School Machine Structures (It’s a bit more complicated!) So'ware Hardware • Parallel Requests CS 61C: Great Ideas in Computer Architecture SIMD I Assigned to computer e.g., Search “Katz” Harness • Parallel Threads Parallelism & Assigned to core e.g., Lookup, Ads Achieve High Performance >1 instrucZon @ one Zme e.g., 5 pipelined instrucZons • Parallel Data >1 data item @ one Zme e.g., Add of 4 pairs of words Input/Output Today’s InstrucZon Unit(s) Lecture Core FuncZonal Unit(s) A0+B0 A1+B1 A2+B2 A3+B3 Cache Memory Logic Gates • Programming Languages 2/26/12 Review Core Memory (Cache) • Hardware descripZons 1 … Core All gates @ one Zme Spring 2012
Lecture #13 Computer • Parallel InstrucZons Instructor: David A. Pa>erson h>p://inst.eecs.Berkeley.edu/~cs61c/sp12 2/26/12 Smart Phone Warehouse Scale Computer Spring 2012
Lecture #13 2 Agenda • To access cache, Memory Address divided into 3 ﬁelds: Tag, Index, Block Oﬀset • Cache size is Data + Management (tags, valid, dirty bits) • Write misses trickier to implement than reads – Write back vs. Write through – Write allocate vs. No write allocate • Cache Performance EquaZons: – CPU Zme = IC × CPIstall × CC = IC × (CPIideal + Memory
• Flynn Taxonomy Administrivia DLP and SIMD Intel Streaming SIMD Extensions (SSE) (Amdahl’s Law if Zme permits) • If understand caches, can adapt somware to improve cache performance and thus program performance 2/26/12 Spring 2012
Lecture #13 3 AlternaZve Kinds of Parallelism: The Programming Viewpoint 2/26/12 Spring 2012
Lecture #13 AlternaZve Kinds of Parallelism: Single InstrucZon/Single Data Stream • Single InstrucZon, Single Data stream (SISD) • Job
level parallelism/process
level parallelism – Running independent programs on mulZple processors simultaneously – Example? • Parallel processing program – Single program that runs on mulZple processors simultaneously – Example? 2/26/12 Spring 2012
Lecture #13 4 Processing Unit 5 2/26/12 – SequenZal computer that exploits no parallelism in either the instrucZon or data streams. Examples of SISD architecture are tradiZonal uniprocessor machines Spring 2012
Lecture #13 6 1 2/26/12 AlternaZve Kinds of Parallelism: MulZple InstrucZon/Single Data Stream AlternaZve Kinds of Parallelism: Single InstrucZon/MulZple Data Stream • MulZple InstrucZon, Single Data streams (MISD) • Single InstrucZon, MulZple Data streams (SIMD or “sim
dee”) – Computer that exploits mulZple instrucZon streams against a single data stream for data operaZons that can be naturally parallelized. For example, certain kinds of array processors. – No longer commonly encountered, mainly of historical interest only 7 Spring 2012
Lecture #13 2/26/12 – Computer that exploits mulZple data streams against a single instrucZon stream to operaZons that may be naturally parallelized, e.g., SIMD instrucZon extensions or Graphics Processing Unit (GPU) 2/26/12 AlternaZve Kinds of Parallelism: MulZple InstrucZon/MulZple Data Streams Spring 2012
Lecture #13 8 Flynn Taxonomy • MulZple InstrucZon, MulZple Data streams (MIMD or “mim
dee”) – MulZple autonomous processors simultaneously execuZng diﬀerent instrucZons on diﬀerent data. – MIMD architectures include mulZcore and Warehouse Scale Computers – (Discuss a'er midterm) 2/26/12 Spring 2012
Lecture #13 9 • In 2012, SIMD and MIMD most common parallel computers • Most common parallel processing programming style: Single Program MulZple Data (“SPMD”) – Single program that runs on all processors of an MIMD – Cross
processor execuZon coordinaZon through condiZonal expressions (thread parallelism amer midterm ) • SIMD (aka hw
level data parallelism): specialized funcZon units, for handling lock
step calculaZons involving arrays – ScienZﬁc compuZng, signal processing, mulZmedia (audio/video processing) Spring 2012
Lecture #13 10 2/26/12 Data
Level Parallelism (DLP) (from 2nd lecture, January 19) SIMD Architectures • Data parallelism: executing one operation on
multiple data streams • 2 kinds of DLP – Lots of data in memory that can be operated on in parallel (e.g., adding together 2 arrays) – Lots of data on many disks that can be operated on in parallel (e.g., searching for documents) • Example to provide context:
– Multiplying a coefficient vector by a data vector
(e.g., in filtering) • 2nd lecture (and 1st project) did DLP across 10s of servers and disks using MapReduce • Today’s lecture (and 3rd project) does Data Level Parallelism (DLP) in memory 2/26/12 Spring 2012
Lecture #13 y[i] := c[i] × x[i], 0 ≤ i < n
• Sources of performance improvement:
– One instruction is fetched & decoded for entire
operation
– Multiplications are known to be independent
– Pipelining/concurrency in memory access as well
11 2/26/12 Spring 2012
Lecture #13 Slide 12 2 2/26/12 Example: SIMD Array Processing “Advanced Digital Media Boost” • To improve performance, Intel’s SIMD instrucZons – Fetch one instrucZon, do the work of mulZple instrucZons – MMX (MulZMedia eXtension, PenZum II processor family) – SSE (Streaming SIMD Extension, PenDum III and beyond) for each f in array!
f = sqrt(f)!
for each f in
{!
load f to
calculate
write the
}! array!
the floatingpoint register!
the square root!
result from the register to memory! for each 4 members in array!
{!
load 4 members to the SSE register!
calculate 4 square roots in one operation!
store the 4 results from the register to memory!
}! SIMD style 2/26/12 Spring 2012
Lecture #13 13 2/26/12 Spring 2012
Lecture #13 Administrivia 14 Agenda • Lab #7 posted • Midterm in 1 week: •
• – Exam: Tu, Mar 6, 6:40
9:40 PM, 2050 VLSB – Covers everything through lecture today – Closed book, can bring one sheet notes, both sides – Copy of Green card will be supplied – No phones, calculators, …; just bring pencils & eraser – TA Review: Su, Mar 4, StarZng 2PM, 2050 VLSB Flynn Taxonomy Administrivia DLP and SIMD Technology Break Intel Streaming SIMD Extensions (SSE) (Amdahl’s Law if Zme permits) • Will send (anonymous) 61C midway survey before Midterm 2/26/12 Spring 2012
Lecture #13 15 2/26/12 Intel SSE InstrucZon Categories for MulZmedia Support Spring 2012
Lecture #13 16 Intel Architecture SSE2+ 128
Bit SIMD Data Types 122 121 96 95 80 79 64 63 48 47 32 31 16 15 16 / 128 bits 122 121 96 95 80 79 64 63 48 47 32 31 16 15 8 / 128 bits 96 95 • SSE
2+ supports wider data types to allow 16 x 8
bit and 8 x 16
bit operands 2/26/12 Spring 2012
Lecture #13 64 63 32 31 64 63 4 / 128 bits 2 / 128 bits • Note: in Intel Architecture (unlike MIPS) a word is 16 bits 17 – Single precision FP: Double word (32 bits) – Double precision FP: Quad word (64 bits) 2/26/12 Spring 2012
Lecture #13 18 3 2/26/12 XMM Registers SSE/SSE2 FloaZng Point InstrucZons xmm: one operand is a 128
bit SSE2 register mem/xmm: other operand is in memory or an SSE2 register {SS} Scalar Single precision FP: one 32
bit operand in a 128
bit register {PS} Packed Single precision FP: four 32
bit operands in a 128
bit register {SD} Scalar Double precision FP: one 64
bit operand in a 128
bit register {PD} Packed Double precision FP, or two 64
bit operands in a 128
bit register {A} 128
bit operand is aligned in memory {U} means the 128
bit operand is unaligned in memory {H} means move the high half of the 128
bit operand {L} means move the low half of the 128
bit operand • Architecture extended with eight 128
bit data registers: XMM registers – IA 64
bit address architecture: available as 16 64
bit registers (XMM8 – XMM15) – E.g., 128
bit packed single
precision ﬂoaZng
point data type (doublewords), allows four single
precision operaZons to be performed simultaneously 2/26/12 Spring 2012
Lecture #13 19 2/26/12 Spring 2012
Lecture #13 20 Packed and Scalar Double
Precision FloaZng
Point OperaZons Example: Add Two Single Precision FP Vectors ComputaZon to be performed: !vec_res.x
!vec_res.y
!vec_res.z
!vec_res.w =
=
=
= v1.x
v1.y
v1.z
v1.w +
+
+
+ v2.x;!
v2.y;!
v2.z;!
v2.w;! mov a ps : move from mem to XMM register, memory aligned, packed single precision add ps : add from mem to XMM register, packed single precision mov a ps : move from XMM register to mem, SSE InstrucZon Sequence: memory aligned, packed single precision (Note: DesZnaZon on the right in x86 assembly) movaps addressofv1, %xmm0 !
!!
!
!// v1.w  v1.z  v1.y  v1.x > xmm0!
addps addressofv2, %xmm0 !
!!
!
!// v1.w+v2.w  v1.z+v2.z  v1.y+v2.y  v1.x+v2.x > xmm0
movaps %xmm0, addressofvec_res!
2/26/12 Spring 2012
Lecture #13 21 !
2/26/12 Example: Image Converter – Read individual pixels from the BMP image, convert pixels into YUV format – Can pack the pixels and operate on a set of pixels with a single instrucZon • FMADDPS – MulZply and add packed single precision ﬂoaZng point instrucZon • One of the typical operaZons computed in transformaZons (e.g., DFT of FFT) • E.g., bitmap image consists of 8 bit monochrome pixels N P = ∑ f(n) × x(n) n = 1 – Pack these pixel values in a 128 bit register (8 bit * 16 pixels), can operate on 16 values at a Zme – Signiﬁcant performance boost Spring 2012
Lecture #13 22 Example: Image Converter • Converts BMP (bitmap) image to a YUV (color space) image format: 2/26/12 Spring 2012
Lecture #13 23 2/26/12 Spring 2012
Lecture #13 24 4 2/26/12 Example: Image Converter Example: SSE Image Converter FloaZng point numbers f(n) and x(n) in src1 and src2; p in dest; C implementaZon for N = 4 (128 bits): for (int i =0; i< 4; i++)!
FloaZng point numbers f(n) and x(n) in src1 and src2; p in dest; C implementaZon for N = 4 (128 bits): for (int i =0; i< 4; i++)!
!! !p = p + src1[i] * src2[i];
!! !p = p + src1[i] * src2[i];! Regular x86 instrucZons for the inner loop: //src1 is on the top of the stack; src1 * src2
> src1 fmul DWORD PTR _src2$[%esp+148] //p = ST(1), src1 = ST(0); ST(0)+ST(1)
> ST(1); ST
Stack Top faddp %ST(0), %ST(1) (Note: DesZnaZon on the right in x86 assembly) Number regular x86 Fl. Pt. instrucZons executed: 4 * 2 = 8 2/26/12 Spring 2012
Lecture #13 !p = p + src1[i] * src2[i]; • Number regular instrucZons executed: 1 SSE5 instrucZon vs. 8 x86 25 2/26/12 Spring 2012
Lecture #13 Intel SSE Intrinsics Instrinsics: • Intrinsics are C funcZons and procedures for pu„ng in assembly language, including SSE instrucZons – With intrinsics, can program using these instrucZons indirectly – One
to
one correspondence between SSE instrucZons and intrinsics 2/26/12 Spring 2012
Lecture #13 27 Example SSE Intrinsics Corresponding SSE instrucZons: 02/09/2010 2/26/12 CS267 L
Lecture #
Spring 2012
ecture 7 13 • Using the XMM registers 2 – 64
bit/double precision/two doubles per XMM reg k = 1 B2,1 B2,2 x 2/26/12 C1,1=A1,1B1,1 + A1,2B2,1 C2,2=A2,1B1,2+A2,2B2,2 C2,2 A A1,i A2,i Bi,1 Bi,1 Bi,2 Bi,2 Stored in memory in Column order C1,2=A1,1B1,2+A1,2B2,2 C2,1=A2,1B1,1 + A2,2B2,1 C2,1 C1,2 B2 A2,2 B1,2 C1,1 B1 A2,1 B1,1 C1 C2 A1,2 28 28 Example: 2 x 2 Matrix MulZply Ci,j = (A×B)i,j = ∑ Ai,k× Bk,j A1,1 26 • Vector data type: _m128d • Load and store operaZons: _mm_load_pd MOVAPD/aligned, packed double _mm_store_pd MOVAPD/aligned, packed double _mm_loadu_pd MOVUPD/unaligned, packed double _mm_storeu_pd
MOVUPD/unaligned, packed double • Load and broadcast across vector _mm_load1_pd MOVSD + shuﬄing/duplicaZng • ArithmeZc: _mm_add_pd ADDPD/add, packed double
_mm_mul_pd MULPD/mulZple, packed double Example: 2 x 2 Matrix MulZply DeﬁniZon of Matrix MulZply: ! • SSE2 instrucZons for the inner loop: //xmm0 = p, xmm1 = src1[i], xmm2 = src2[i] mulps %xmm1, %xmm2 // xmm2 * xmm1
> xmm2 addps %xmm2, %xmm0 // xmm0 + xmm2
> xmm0 • Number regular instrucZons executed: 2 SSE2 instrucZons vs. 8 x86!
• SSE5 instrucZon accomplishes same in one instrucZon: fmaddps %xmm0, %xmm1, %xmm2, %xmm0 // xmm2 * xmm1 + xmm0
> xmm0 // mulZply xmm1 x xmm2 paired single, // then add product paired single to sum in xmm0 = Spring 2012
Lecture #13 29 2/26/12 Spring 2012
Lecture #13 30 5 2/26/12 Example: 2 x 2 Matrix MulZply Example: 2 x 2 Matrix MulZply • IniZalizaZon • IniZalizaZon C1 0 0 C1 0 0 C2 0 0 C2 0 0 A A1,1 A2,1 _mm_load_pd: Stored in memory in Column order A A1,1 A2,1 _mm_load_pd: Load 2 doubles into XMM reg, Stored in memory in Column order B1 B1,1 B1,1 B1,1 B1,1 B1,2 B1,2 _mm_load1_pd: SSE instrucZon that loads a double word and stores it in the high and low double words of the XMM register B1 B2 B2 B1,2 B1,2 _mm_load1_pd: SSE instrucZon that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM) • I = 1 • I = 1 2/26/12 Spring 2012
Lecture #13 31 2/26/12 Example: 2 x 2 Matrix MulZply 0+A1,1B1,1 0+A2,1B1,1 C2 0+A1,1B1,2 0+A2,1B1,2 • I = 1 32 Example: 2 x 2 Matrix MulZply • First iteraZon intermediate result C1 Spring 2012
Lecture #13 • First iteraZon intermediate result c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); SSE instrucZons ﬁrst do parallel mulZplies and then parallel adds in XMM registers C1 0+A1,1B1,1 0+A2,1B1,1 C2 0+A1,1B1,2 0+A2,1B1,2 • I = 2 c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); SSE instrucZons ﬁrst do parallel mulZplies and then parallel adds in XMM registers A A1,1 A2,1 _mm_load_pd: Stored in memory in Column order A A1,2 A2,2 _mm_load_pd: Stored in memory in Column order B1 B1,1 B1,1 B2,1 B2,1 B1,2 B1,2 _mm_load1_pd: SSE instrucZon that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM) B1 B2 B2 B2,2 B2,2 _mm_load1_pd: SSE instrucZon that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM) 2/26/12 Spring 2012
Lecture #13 33 Example: 2 x 2 Matrix MulZply C2 A1,1B1,2+A1,2B2,2 A2,1B1,2+A2,2B2,2 C2,2 C1,2 • I = 2 A A1,2 A2,2 B1 B2,1 B2,1 B2 B2,2 B2,2 2/26/12 DeﬁniZon of Matrix MulZply: c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); SSE instrucZons ﬁrst do parallel mulZplies and then parallel adds in XMM registers 34 k = 1 A1,1 A1,2 B1,1 B1,2 35 C1,1=A1,1B1,1 + A1,2B2,1 A2,2 B2,1 C2,1=A2,1B1,1 + A2,2B2,1 0 1 2/26/12 C1,2= 1*3 + 0*4 = 3 C2,1= 0*1 + 1*2 = 2 3 x 1 C2,2=A2,1B1,2+A2,2B2,2 C1,1= 1*1 + 0*2 = 1 B2,2 1 0 C1,2=A1,1B1,2+A1,2B2,2 C2,2= 0*3 + 1*4 = 4 = A2,1 _mm_load1_pd: SSE instrucZon that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM) 2 Ci,j = (A×B)i,j = ∑ Ai,k× Bk,j x _mm_load_pd: Stored in memory in Column order Spring 2012
Lecture #13 Spring 2012
Lecture #13 Live Example: 2 x 2 Matrix MulZply • Second iteraZon intermediate result C2,1 C1,1 C1 A1,1B1,1+A1,2B2,1 A2,1B1,1+A2,2B2,1 2/26/12 = 2 4 Spring 2012
Lecture #13 36 6 2/26/12 Example: 2 x 2 Matrix MulZply (Part 1 of 2) #include <stdio.h> // header ﬁle for SSE compiler intrinsics #include <emmintrin.h> // NOTE: vector registers will be represented in comments as v1 = [ a  b] // where v1 is a variable of type __m128d and a, b are doubles int main(void) { // allocate A,B,C aligned on 16
byte boundaries double A[4] __a>ribute__ ((aligned (16))); double B[4] __a>ribute__ ((aligned (16))); double C[4] __a>ribute__ ((aligned (16))); int lda = 2; int i = 0; // declare several 128
bit vector variables __m128d c1,c2,a,b1,b2; 2/26/12 Example: 2 x 2 Matrix MulZply (Part 2 of 2) // used aligned loads to set // c1 = [c_11  c_21] c1 = _mm_load_pd(C+0*lda); // c2 = [c_12  c_22] c2 = _mm_load_pd(C+1*lda); // IniDalize A, B, C for example /* A = (note column order!) 1 0 0 1 */ A[0] = 1.0; A[1] = 0.0; A[2] = 0.0; A[3] = 1.0; /* B = (note column order!) 1 3 2 4 */ B[0] = 1.0; B[1] = 2.0; B[2] = 3.0; B[3] = 4.0; /* C = (note column order!) 0 0 0 0 */ C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0; Spring 2012
Lecture #13 37 Inner loop from gcc –O
S L2: movapd (%rax,%rsi), %xmm1 //Load aligned A[i,i+1]
>m1 movddup (%rdx), %xmm0
//Load B[j], duplicate
>m0 mulpd
%xmm1, %xmm0 //MulZply m0*m1
>m0 addpd
%xmm0, %xmm3 //Add m0+m3
>m3 movddup 16(%rdx), %xmm0 //Load B[j+1], duplicate
>m0 mulpd
%xmm0, %xmm1 //MulZply m0*m1
>m1 addpd
%xmm1, %xmm2 //Add m1+m2
>m2 addq
$16, %rax
// rax+16
> rax (i+=2) addq
$8, %rdx
// rdx+8
> rdx ( j+=1) cmpq
$32, %rax
// rax == 32? jne
L2
// jump to L2 if not equal movapd %xmm3, (%rcx)
//store aligned m3 into C[k,k+1] movapd %xmm2, (%rdi)
//store aligned m2 into C[l,l+1] 2/26/12 Spring 2012
Lecture #13 39 for (i = 0; i < 2; i++) { /* a = i = 0: [a_11  a_21] i = 1: [a_12  a_22] */ a = _mm_load_pd(A+i*lda); /* b1 = i = 0: [b_11  b_11] i = 1: [b_21  b_21] */ b1 = _mm_load1_pd(B+i+0*lda); /* b2 = i = 0: [b_12  b_12] i = 1: [b_22  b_22] */ b2 = _mm_load1_pd(B+i+1*lda); 2/26/12 // store c1,c2 back into C for compleDon _mm_store_pd(C+0*lda,c1); _mm_store_pd(C+1*lda,c2); // print C prin‡("%g,%g\n%g,%g\n",C[0],C[2],C[1],C[3]); return 0; } Spring 2012
Lecture #13 38 Performance
Driven ISA Extensions • Subword parallelism, used primarily for mulZmedia applicaZons – Intel MMX: mulZmedia extension • 64
bit registers can hold mulZple integer operands – Intel SSE: Streaming SIMD extension • 128
bit registers can hold several ﬂoaZng
point operands • Adding instrucZons that do more work per cycle –
–
–
– Shim
add: replace two instrucZons with one (e.g., mulZply by 5) MulZply
add: replace two instrucZons with one (x := c + a × b) MulZply
accumulate: reduce round
oﬀ error (s := s + a × b) CondiZonal copy: to avoid some branches (e.g., in if
then
else) 2/26/12 Spring 2012
Lecture #13 Slide 40 Big Idea: Amdahl’s Law Big Idea: Amdahl’s (Heartbreaking) Law • Speedup due to enhancement E is /* c1 = i = 0: [c_11 + a_11*b_11  c_21 + a_21*b_11] i = 1: [c_11 + a_21*b_21  c_21 + a_22*b_21] */ c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); /* c2 = i = 0: [c_12 + a_11*b_12  c_22 + a_21*b_12] i = 1: [c_12 + a_21*b_22  c_22 + a_22*b_22] */ c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); } Speedup = Exec Zme w/o E Speedup w/ E =
Exec Zme w/ E • Suppose that enhancement E accelerates a fracZon F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaﬀected ExecuZon Time w/ E = ExecuZon Time w/o E × [ (1
F) + F/S] Example: the execuZon Zme of half of the program can be accelerated by a factor of 2. What is the program speed
up overall? Speedup w/ E = 1 / [ (1
F) + F/S ] 2/26/12 Spring 2012
Lecture #13 41 2/26/12 Spring 2012
Lecture #13 42 7 2/26/12 Big Idea: Amdahl’s Law Speedup = 1 (1
F) + F Non
speed
up part S Big Idea: Amdahl’s Law If the porZon of the program that can be parallelized is small, then the speedup is limited Speed
up part The non
parallel porZon limits the performance Example: the execuZon Zme of half of the program can be accelerated by a factor of 2. What is the program speed
up overall? 1 0.5 + 0.5 2 2/26/12 = 1 = 0.5 + 0.25 1.33 Spring 2012
Lecture #13 43 2/26/12 Spring 2012
Lecture #13 44 Example #2: Amdahl’s Law Example #1: Amdahl’s Law Speedup w/ E = 1 / [ (1
F) + F/S ] Speedup w/ E = 1 / [ (1
F) + F/S ] • Consider an enhancement which runs 20 Zmes faster but which is only usable 25% of the Zme Speedup w/ E = 1/(.75 + .25/20) = 1.31 • Consider summing 10 scalar variables and two 10 by 10 matrices (matrix sum) on 10 processors • What if its usable only 15% of the Zme? Speedup w/ E = 1/(.85 + .15/20) = 1.17 • What if there are 100 processors ? • Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computaZon can be scalar! • To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup w/ E = 1/(.001 + .999/100) = 90.99 • What if the matrices are 100 by 100 (or 10,010 adds in total) on 10 processors? 2/26/12 Spring 2012
Lecture #13 Speedup w/ E = 1/(.091 + .909/10) = 1/0.1819 = 5.5 Speedup w/ E = 1/(.091 + .909/100) = 1/0.10009 = 10.0 46 Speedup w/ E = 1/(.001 + .999/10) = 1/0.1009 = 9.9 • What if there are 100 processors ? 2/26/12 Speedup w/ E = 1/(.001 + .999/100) = 1/0.01099 = 91 Spring 2012
Lecture #13 Strong and Weak Scaling Review • To get good speedup on a mulZprocessor while keeping the problem size ﬁxed is harder than ge„ng good speedup by increasing the size of the problem. • Flynn Taxonomy of Parallel Architectures – Strong scaling: when speedup can be achieved on a parallel processor without increasing the size of the problem – Weak scaling: when speedup is achieved on a parallel processor by increasing the size of the problem proporZonally to the increase in the number of processors • Load balancing is another important factor: every processor doing same amount of work Spring 2012
Lecture #13 –
–
–
– SIMD: Single InstrucDon MulDple Data MIMD: MulDple InstrucDon MulDple Data SISD: Single InstrucZon Single Data (unused) MISD: MulZple InstrucZon Single Data • Intel SSE SIMD InstrucZons – One instrucZon fetch that operates on mulZple operands simultaneously – 128/64 bit XMM registers • SSE InstrucZons in C – Embed the SSE machine instrucZons directly into C programs through use of intrinsics – Achieve eﬃciency beyond that of opZmizing compiler – Just 1 unit with twice the load of others cuts speedup almost in half 2/26/12 49 50 2/26/12 Spring 2012
Lecture #13 51 8 ...
 Spring '08
 Patterson

