13LecSp12DLPIx6 - 2/26/12 New ­School Machine...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 2/26/12 New ­School Machine Structures (It’s a bit more complicated!) So'ware Hardware •  Parallel Requests CS 61C: Great Ideas in Computer Architecture SIMD I Assigned to computer e.g., Search “Katz” Harness •  Parallel Threads Parallelism & Assigned to core e.g., Lookup, Ads Achieve High Performance >1 instrucZon @ one Zme e.g., 5 pipelined instrucZons •  Parallel Data >1 data item @ one Zme e.g., Add of 4 pairs of words Input/Output Today’s InstrucZon Unit(s) Lecture Core FuncZonal Unit(s) A0+B0 A1+B1 A2+B2 A3+B3 Cache Memory Logic Gates •  Programming Languages 2/26/12 Review Core Memory (Cache) •  Hardware descripZons 1 … Core All gates @ one Zme Spring 2012  ­ ­ Lecture #13 Computer •  Parallel InstrucZons Instructor: David A. Pa>erson h>p://inst.eecs.Berkeley.edu/~cs61c/sp12 2/26/12 Smart Phone Warehouse Scale Computer Spring 2012  ­ ­ Lecture #13 2 Agenda •  To access cache, Memory Address divided into 3 fields: Tag, Index, Block Offset •  Cache size is Data + Management (tags, valid, dirty bits) •  Write misses trickier to implement than reads –  Write back vs. Write through –  Write allocate vs. No write allocate •  Cache Performance EquaZons: –  CPU Zme = IC × CPIstall × CC = IC × (CPIideal + Memory ­stall cycles) × CC –  AMAT = Time for a hit + Miss rate x Miss penalty •  •  •  •  •  Flynn Taxonomy Administrivia DLP and SIMD Intel Streaming SIMD Extensions (SSE) (Amdahl’s Law if Zme permits) •  If understand caches, can adapt somware to improve cache performance and thus program performance 2/26/12 Spring 2012  ­ ­ Lecture #13 3 AlternaZve Kinds of Parallelism: The Programming Viewpoint 2/26/12 Spring 2012  ­ ­ Lecture #13 AlternaZve Kinds of Parallelism: Single InstrucZon/Single Data Stream •  Single InstrucZon, Single Data stream (SISD) •  Job ­level parallelism/process ­level parallelism –  Running independent programs on mulZple processors simultaneously –  Example? •  Parallel processing program –  Single program that runs on mulZple processors simultaneously –  Example? 2/26/12 Spring 2012  ­ ­ Lecture #13 4 Processing Unit 5 2/26/12 –  SequenZal computer that exploits no parallelism in either the instrucZon or data streams. Examples of SISD architecture are tradiZonal uniprocessor machines Spring 2012  ­ ­ Lecture #13 6 1 2/26/12 AlternaZve Kinds of Parallelism: MulZple InstrucZon/Single Data Stream AlternaZve Kinds of Parallelism: Single InstrucZon/MulZple Data Stream •  MulZple InstrucZon, Single Data streams (MISD) •  Single InstrucZon, MulZple Data streams (SIMD or “sim ­dee”) –  Computer that exploits mulZple instrucZon streams against a single data stream for data operaZons that can be naturally parallelized. For example, certain kinds of array processors. –  No longer commonly encountered, mainly of historical interest only 7 Spring 2012  ­ ­ Lecture #13 2/26/12 –  Computer that exploits mulZple data streams against a single instrucZon stream to operaZons that may be naturally parallelized, e.g., SIMD instrucZon extensions or Graphics Processing Unit (GPU) 2/26/12 AlternaZve Kinds of Parallelism: MulZple InstrucZon/MulZple Data Streams Spring 2012  ­ ­ Lecture #13 8 Flynn Taxonomy •  MulZple InstrucZon, MulZple Data streams (MIMD or “mim ­dee”) –  MulZple autonomous processors simultaneously execuZng different instrucZons on different data. –  MIMD architectures include mulZcore and Warehouse Scale Computers –  (Discuss a'er midterm) 2/26/12 Spring 2012  ­ ­ Lecture #13 9 •  In 2012, SIMD and MIMD most common parallel computers •  Most common parallel processing programming style: Single Program MulZple Data (“SPMD”) –  Single program that runs on all processors of an MIMD –  Cross ­processor execuZon coordinaZon through condiZonal expressions (thread parallelism amer midterm ) •  SIMD (aka hw ­level data parallelism): specialized funcZon units, for handling lock ­step calculaZons involving arrays –  ScienZfic compuZng, signal processing, mulZmedia (audio/video processing) Spring 2012  ­ ­ Lecture #13 10 2/26/12 Data ­Level Parallelism (DLP) (from 2nd lecture, January 19) SIMD Architectures •  Data parallelism: executing one operation on multiple data streams •  2 kinds of DLP –  Lots of data in memory that can be operated on in parallel (e.g., adding together 2 arrays) –  Lots of data on many disks that can be operated on in parallel (e.g., searching for documents) •  Example to provide context: –  Multiplying a coefficient vector by a data vector (e.g., in filtering) •  2nd lecture (and 1st project) did DLP across 10s of servers and disks using MapReduce •  Today’s lecture (and 3rd project) does Data Level Parallelism (DLP) in memory 2/26/12 Spring 2012  ­ ­ Lecture #13 y[i] := c[i] × x[i], 0 ≤ i < n •  Sources of performance improvement: –  One instruction is fetched & decoded for entire operation –  Multiplications are known to be independent –  Pipelining/concurrency in memory access as well 11 2/26/12 Spring 2012  ­ ­ Lecture #13 Slide 12 2 2/26/12 Example: SIMD Array Processing “Advanced Digital Media Boost” •  To improve performance, Intel’s SIMD instrucZons –  Fetch one instrucZon, do the work of mulZple instrucZons –  MMX (MulZMedia eXtension, PenZum II processor family) –  SSE (Streaming SIMD Extension, PenDum III and beyond) for each f in array! f = sqrt(f)! for each f in {! load f to calculate write the }! array! the floating-point register! the square root! result from the register to memory! for each 4 members in array! {! load 4 members to the SSE register! calculate 4 square roots in one operation! store the 4 results from the register to memory! }! SIMD style 2/26/12 Spring 2012  ­ ­ Lecture #13 13 2/26/12 Spring 2012  ­ ­ Lecture #13 Administrivia 14 Agenda •  Lab #7 posted •  Midterm in 1 week: •  •  •  •  •  •  –  Exam: Tu, Mar 6, 6:40 ­9:40 PM, 2050 VLSB –  Covers everything through lecture today –  Closed book, can bring one sheet notes, both sides –  Copy of Green card will be supplied –  No phones, calculators, …; just bring pencils & eraser –  TA Review: Su, Mar 4, StarZng 2PM, 2050 VLSB Flynn Taxonomy Administrivia DLP and SIMD Technology Break Intel Streaming SIMD Extensions (SSE) (Amdahl’s Law if Zme permits) •  Will send (anonymous) 61C midway survey before Midterm 2/26/12 Spring 2012  ­ ­ Lecture #13 15 2/26/12 Intel SSE InstrucZon Categories for MulZmedia Support Spring 2012  ­ ­ Lecture #13 16 Intel Architecture SSE2+ 128 ­Bit SIMD Data Types 122 121 96 95 80 79 64 63 48 47 32 31 16 15 16 / 128 bits 122 121 96 95 80 79 64 63 48 47 32 31 16 15 8 / 128 bits 96 95 •  SSE ­2+ supports wider data types to allow 16 x 8 ­bit and 8 x 16 ­bit operands 2/26/12 Spring 2012  ­ ­ Lecture #13 64 63 32 31 64 63 4 / 128 bits 2 / 128 bits •  Note: in Intel Architecture (unlike MIPS) a word is 16 bits 17 –  Single precision FP: Double word (32 bits) –  Double precision FP: Quad word (64 bits) 2/26/12 Spring 2012  ­ ­ Lecture #13 18 3 2/26/12 XMM Registers SSE/SSE2 FloaZng Point InstrucZons xmm: one operand is a 128 ­bit SSE2 register mem/xmm: other operand is in memory or an SSE2 register {SS} Scalar Single precision FP: one 32 ­bit operand in a 128 ­bit register {PS} Packed Single precision FP: four 32 ­bit operands in a 128 ­bit register {SD} Scalar Double precision FP: one 64 ­bit operand in a 128 ­bit register {PD} Packed Double precision FP, or two 64 ­bit operands in a 128 ­bit register {A} 128 ­bit operand is aligned in memory {U} means the 128 ­bit operand is unaligned in memory {H} means move the high half of the 128 ­bit operand {L} means move the low half of the 128 ­bit operand •  Architecture extended with eight 128 ­bit data registers: XMM registers –  IA 64 ­bit address architecture: available as 16 64 ­bit registers (XMM8 – XMM15) –  E.g., 128 ­bit packed single ­precision floaZng ­point data type (doublewords), allows four single ­precision operaZons to be performed simultaneously 2/26/12 Spring 2012  ­ ­ Lecture #13 19 2/26/12 Spring 2012  ­ ­ Lecture #13 20 Packed and Scalar Double ­Precision FloaZng ­Point OperaZons Example: Add Two Single Precision FP Vectors ComputaZon to be performed: !vec_res.x !vec_res.y !vec_res.z !vec_res.w = = = = v1.x v1.y v1.z v1.w + + + + v2.x;! v2.y;! v2.z;! v2.w;! mov a ps : move from mem to XMM register, memory aligned, packed single precision add ps : add from mem to XMM register, packed single precision mov a ps : move from XMM register to mem, SSE InstrucZon Sequence: memory aligned, packed single precision (Note: DesZnaZon on the right in x86 assembly) movaps address-of-v1, %xmm0 ! !! ! !// v1.w | v1.z | v1.y | v1.x -> xmm0! addps address-of-v2, %xmm0 ! !! ! !// v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x -> xmm0 movaps %xmm0, address-of-vec_res! 2/26/12 Spring 2012  ­ ­ Lecture #13 21 ! 2/26/12 Example: Image Converter –  Read individual pixels from the BMP image, convert pixels into YUV format –  Can pack the pixels and operate on a set of pixels with a single instrucZon •  FMADDPS – MulZply and add packed single precision floaZng point instrucZon •  One of the typical operaZons computed in transformaZons (e.g., DFT of FFT) •  E.g., bitmap image consists of 8 bit monochrome pixels N P = ∑ f(n) × x(n) n = 1 –  Pack these pixel values in a 128 bit register (8 bit * 16 pixels), can operate on 16 values at a Zme –  Significant performance boost Spring 2012  ­ ­ Lecture #13 22 Example: Image Converter •  Converts BMP (bitmap) image to a YUV (color space) image format: 2/26/12 Spring 2012  ­ ­ Lecture #13 23 2/26/12 Spring 2012  ­ ­ Lecture #13 24 4 2/26/12 Example: Image Converter Example: SSE Image Converter FloaZng point numbers f(n) and x(n) in src1 and src2; p in dest; C implementaZon for N = 4 (128 bits): for (int i =0; i< 4; i++)! !! FloaZng point numbers f(n) and x(n) in src1 and src2; p in dest; C implementaZon for N = 4 (128 bits): for (int i =0; i< 4; i++)! !! !p = p + src1[i] * src2[i];! Regular x86 instrucZons for the inner loop: //src1 is on the top of the stack; src1 * src2  ­> src1 fmul DWORD PTR _src2$[%esp+148] //p = ST(1), src1 = ST(0); ST(0)+ST(1)  ­> ST(1); ST ­Stack Top faddp %ST(0), %ST(1) (Note: DesZnaZon on the right in x86 assembly) Number regular x86 Fl. Pt. instrucZons executed: 4 * 2 = 8 2/26/12 Spring 2012  ­ ­ Lecture #13 !p = p + src1[i] * src2[i]; •  Number regular instrucZons executed: 1 SSE5 instrucZon vs. 8 x86 25 2/26/12 Spring 2012  ­ ­ Lecture #13 Intel SSE Intrinsics Instrinsics: •  Intrinsics are C funcZons and procedures for pu„ng in assembly language, including SSE instrucZons –  With intrinsics, can program using these instrucZons indirectly –  One ­to ­one correspondence between SSE instrucZons and intrinsics 2/26/12 Spring 2012  ­ ­ Lecture #13 27 Example SSE Intrinsics Corresponding SSE instrucZons: 02/09/2010 2/26/12 CS267 L  ­ Lecture # Spring 2012  ­ecture 7 13 •  Using the XMM registers 2 –  64 ­bit/double precision/two doubles per XMM reg k = 1 B2,1 B2,2 x 2/26/12 C1,1=A1,1B1,1 + A1,2B2,1 C2,2=A2,1B1,2+A2,2B2,2 C2,2 A A1,i A2,i Bi,1 Bi,1 Bi,2 Bi,2 Stored in memory in Column order C1,2=A1,1B1,2+A1,2B2,2 C2,1=A2,1B1,1 + A2,2B2,1 C2,1 C1,2 B2 A2,2 B1,2 C1,1 B1 A2,1 B1,1 C1 C2 A1,2 28 28 Example: 2 x 2 Matrix MulZply Ci,j = (A×B)i,j = ∑ Ai,k× Bk,j A1,1 26 •  Vector data type: _m128d •  Load and store operaZons: _mm_load_pd MOVAPD/aligned, packed double _mm_store_pd MOVAPD/aligned, packed double _mm_loadu_pd MOVUPD/unaligned, packed double _mm_storeu_pd MOVUPD/unaligned, packed double •  Load and broadcast across vector _mm_load1_pd MOVSD + shuffling/duplicaZng •  ArithmeZc: _mm_add_pd ADDPD/add, packed double _mm_mul_pd MULPD/mulZple, packed double Example: 2 x 2 Matrix MulZply DefiniZon of Matrix MulZply: ! •  SSE2 instrucZons for the inner loop: //xmm0 = p, xmm1 = src1[i], xmm2 = src2[i] mulps %xmm1, %xmm2 // xmm2 * xmm1  ­> xmm2 addps %xmm2, %xmm0 // xmm0 + xmm2  ­> xmm0 •  Number regular instrucZons executed: 2 SSE2 instrucZons vs. 8 x86! •  SSE5 instrucZon accomplishes same in one instrucZon: fmaddps %xmm0, %xmm1, %xmm2, %xmm0 // xmm2 * xmm1 + xmm0  ­> xmm0 // mulZply xmm1 x xmm2 paired single, // then add product paired single to sum in xmm0 = Spring 2012  ­ ­ Lecture #13 29 2/26/12 Spring 2012  ­ ­ Lecture #13 30 5 2/26/12 Example: 2 x 2 Matrix MulZply Example: 2 x 2 Matrix MulZply •  IniZalizaZon •  IniZalizaZon C1 0 0 C1 0 0 C2 0 0 C2 0 0 A A1,1 A2,1 _mm_load_pd: Stored in memory in Column order A A1,1 A2,1 _mm_load_pd: Load 2 doubles into XMM reg, Stored in memory in Column order B1 B1,1 B1,1 B1,1 B1,1 B1,2 B1,2 _mm_load1_pd: SSE instrucZon that loads a double word and stores it in the high and low double words of the XMM register B1 B2 B2 B1,2 B1,2 _mm_load1_pd: SSE instrucZon that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM) •  I = 1 •  I = 1 2/26/12 Spring 2012  ­ ­ Lecture #13 31 2/26/12 Example: 2 x 2 Matrix MulZply 0+A1,1B1,1 0+A2,1B1,1 C2 0+A1,1B1,2 0+A2,1B1,2 •  I = 1 32 Example: 2 x 2 Matrix MulZply •  First iteraZon intermediate result C1 Spring 2012  ­ ­ Lecture #13 •  First iteraZon intermediate result c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); SSE instrucZons first do parallel mulZplies and then parallel adds in XMM registers C1 0+A1,1B1,1 0+A2,1B1,1 C2 0+A1,1B1,2 0+A2,1B1,2 •  I = 2 c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); SSE instrucZons first do parallel mulZplies and then parallel adds in XMM registers A A1,1 A2,1 _mm_load_pd: Stored in memory in Column order A A1,2 A2,2 _mm_load_pd: Stored in memory in Column order B1 B1,1 B1,1 B2,1 B2,1 B1,2 B1,2 _mm_load1_pd: SSE instrucZon that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM) B1 B2 B2 B2,2 B2,2 _mm_load1_pd: SSE instrucZon that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM) 2/26/12 Spring 2012  ­ ­ Lecture #13 33 Example: 2 x 2 Matrix MulZply C2 A1,1B1,2+A1,2B2,2 A2,1B1,2+A2,2B2,2 C2,2 C1,2 •  I = 2 A A1,2 A2,2 B1 B2,1 B2,1 B2 B2,2 B2,2 2/26/12 DefiniZon of Matrix MulZply: c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); SSE instrucZons first do parallel mulZplies and then parallel adds in XMM registers 34 k = 1 A1,1 A1,2 B1,1 B1,2 35 C1,1=A1,1B1,1 + A1,2B2,1 A2,2 B2,1 C2,1=A2,1B1,1 + A2,2B2,1 0 1 2/26/12 C1,2= 1*3 + 0*4 = 3 C2,1= 0*1 + 1*2 = 2 3 x 1 C2,2=A2,1B1,2+A2,2B2,2 C1,1= 1*1 + 0*2 = 1 B2,2 1 0 C1,2=A1,1B1,2+A1,2B2,2 C2,2= 0*3 + 1*4 = 4 = A2,1 _mm_load1_pd: SSE instrucZon that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM) 2 Ci,j = (A×B)i,j = ∑ Ai,k× Bk,j x _mm_load_pd: Stored in memory in Column order Spring 2012  ­ ­ Lecture #13 Spring 2012  ­ ­ Lecture #13 Live Example: 2 x 2 Matrix MulZply •  Second iteraZon intermediate result C2,1 C1,1 C1 A1,1B1,1+A1,2B2,1 A2,1B1,1+A2,2B2,1 2/26/12 = 2 4 Spring 2012  ­ ­ Lecture #13 36 6 2/26/12 Example: 2 x 2 Matrix MulZply (Part 1 of 2) #include <stdio.h> // header file for SSE compiler intrinsics #include <emmintrin.h> // NOTE: vector registers will be represented in comments as v1 = [ a | b] // where v1 is a variable of type __m128d and a, b are doubles int main(void) { // allocate A,B,C aligned on 16 ­byte boundaries double A[4] __a>ribute__ ((aligned (16))); double B[4] __a>ribute__ ((aligned (16))); double C[4] __a>ribute__ ((aligned (16))); int lda = 2; int i = 0; // declare several 128 ­bit vector variables __m128d c1,c2,a,b1,b2; 2/26/12 Example: 2 x 2 Matrix MulZply (Part 2 of 2) // used aligned loads to set // c1 = [c_11 | c_21] c1 = _mm_load_pd(C+0*lda); // c2 = [c_12 | c_22] c2 = _mm_load_pd(C+1*lda); // IniDalize A, B, C for example /* A = (note column order!) 1 0 0 1 */ A[0] = 1.0; A[1] = 0.0; A[2] = 0.0; A[3] = 1.0; /* B = (note column order!) 1 3 2 4 */ B[0] = 1.0; B[1] = 2.0; B[2] = 3.0; B[3] = 4.0; /* C = (note column order!) 0 0 0 0 */ C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0; Spring 2012  ­ ­ Lecture #13 37 Inner loop from gcc –O  ­S L2: movapd (%rax,%rsi), %xmm1 //Load aligned A[i,i+1] ­>m1 movddup (%rdx), %xmm0 //Load B[j], duplicate ­>m0 mulpd %xmm1, %xmm0 //MulZply m0*m1 ­>m0 addpd %xmm0, %xmm3 //Add m0+m3 ­>m3 movddup 16(%rdx), %xmm0 //Load B[j+1], duplicate ­>m0 mulpd %xmm0, %xmm1 //MulZply m0*m1 ­>m1 addpd %xmm1, %xmm2 //Add m1+m2 ­>m2 addq $16, %rax // rax+16  ­> rax (i+=2) addq $8, %rdx // rdx+8  ­> rdx ( j+=1) cmpq $32, %rax // rax == 32? jne L2 // jump to L2 if not equal movapd %xmm3, (%rcx) //store aligned m3 into C[k,k+1] movapd %xmm2, (%rdi) //store aligned m2 into C[l,l+1] 2/26/12 Spring 2012  ­ ­ Lecture #13 39 for (i = 0; i < 2; i++) { /* a = i = 0: [a_11 | a_21] i = 1: [a_12 | a_22] */ a = _mm_load_pd(A+i*lda); /* b1 = i = 0: [b_11 | b_11] i = 1: [b_21 | b_21] */ b1 = _mm_load1_pd(B+i+0*lda); /* b2 = i = 0: [b_12 | b_12] i = 1: [b_22 | b_22] */ b2 = _mm_load1_pd(B+i+1*lda); 2/26/12 // store c1,c2 back into C for compleDon _mm_store_pd(C+0*lda,c1); _mm_store_pd(C+1*lda,c2); // print C prin‡("%g,%g\n%g,%g\n",C[0],C[2],C[1],C[3]); return 0; } Spring 2012  ­ ­ Lecture #13 38 Performance ­Driven ISA Extensions •  Subword parallelism, used primarily for mulZmedia applicaZons –  Intel MMX: mulZmedia extension •  64 ­bit registers can hold mulZple integer operands –  Intel SSE: Streaming SIMD extension •  128 ­bit registers can hold several floaZng ­point operands •  Adding instrucZons that do more work per cycle –  –  –  –  Shim ­add: replace two instrucZons with one (e.g., mulZply by 5) MulZply ­add: replace two instrucZons with one (x := c + a × b) MulZply ­accumulate: reduce round ­off error (s := s + a × b) CondiZonal copy: to avoid some branches (e.g., in if ­then ­else) 2/26/12 Spring 2012  ­ ­ Lecture #13 Slide 40 Big Idea: Amdahl’s Law Big Idea: Amdahl’s (Heartbreaking) Law •  Speedup due to enhancement E is /* c1 = i = 0: [c_11 + a_11*b_11 | c_21 + a_21*b_11] i = 1: [c_11 + a_21*b_21 | c_21 + a_22*b_21] */ c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); /* c2 = i = 0: [c_12 + a_11*b_12 | c_22 + a_21*b_12] i = 1: [c_12 + a_21*b_22 | c_22 + a_22*b_22] */ c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); } Speedup = Exec Zme w/o E Speedup w/ E =  ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ Exec Zme w/ E •  Suppose that enhancement E accelerates a fracZon F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected ExecuZon Time w/ E = ExecuZon Time w/o E × [ (1 ­F) + F/S] Example: the execuZon Zme of half of the program can be accelerated by a factor of 2. What is the program speed ­up overall? Speedup w/ E = 1 / [ (1 ­F) + F/S ] 2/26/12 Spring 2012  ­ ­ Lecture #13 41 2/26/12 Spring 2012  ­ ­ Lecture #13 42 7 2/26/12 Big Idea: Amdahl’s Law Speedup = 1 (1  ­ F) + F Non ­speed ­up part S Big Idea: Amdahl’s Law If the porZon of the program that can be parallelized is small, then the speedup is limited Speed ­up part The non ­parallel porZon limits the performance Example: the execuZon Zme of half of the program can be accelerated by a factor of 2. What is the program speed ­up overall? 1 0.5 + 0.5 2 2/26/12 = 1 = 0.5 + 0.25 1.33 Spring 2012  ­ ­ Lecture #13 43 2/26/12 Spring 2012  ­ ­ Lecture #13 44 Example #2: Amdahl’s Law Example #1: Amdahl’s Law Speedup w/ E = 1 / [ (1 ­F) + F/S ] Speedup w/ E = 1 / [ (1 ­F) + F/S ] •  Consider an enhancement which runs 20 Zmes faster but which is only usable 25% of the Zme Speedup w/ E = 1/(.75 + .25/20) = 1.31 •  Consider summing 10 scalar variables and two 10 by 10 matrices (matrix sum) on 10 processors •  What if its usable only 15% of the Zme? Speedup w/ E = 1/(.85 + .15/20) = 1.17 •  What if there are 100 processors ? •  Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computaZon can be scalar! •  To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup w/ E = 1/(.001 + .999/100) = 90.99 •  What if the matrices are 100 by 100 (or 10,010 adds in total) on 10 processors? 2/26/12 Spring 2012  ­ ­ Lecture #13 Speedup w/ E = 1/(.091 + .909/10) = 1/0.1819 = 5.5 Speedup w/ E = 1/(.091 + .909/100) = 1/0.10009 = 10.0 46 Speedup w/ E = 1/(.001 + .999/10) = 1/0.1009 = 9.9 •  What if there are 100 processors ? 2/26/12 Speedup w/ E = 1/(.001 + .999/100) = 1/0.01099 = 91 Spring 2012  ­ ­ Lecture #13 Strong and Weak Scaling Review •  To get good speedup on a mulZprocessor while keeping the problem size fixed is harder than ge„ng good speedup by increasing the size of the problem. •  Flynn Taxonomy of Parallel Architectures –  Strong scaling: when speedup can be achieved on a parallel processor without increasing the size of the problem –  Weak scaling: when speedup is achieved on a parallel processor by increasing the size of the problem proporZonally to the increase in the number of processors •  Load balancing is another important factor: every processor doing same amount of work Spring 2012  ­ ­ Lecture #13 –  –  –  –  SIMD: Single InstrucDon MulDple Data MIMD: MulDple InstrucDon MulDple Data SISD: Single InstrucZon Single Data (unused) MISD: MulZple InstrucZon Single Data •  Intel SSE SIMD InstrucZons –  One instrucZon fetch that operates on mulZple operands simultaneously –  128/64 bit XMM registers •  SSE InstrucZons in C –  Embed the SSE machine instrucZons directly into C programs through use of intrinsics –  Achieve efficiency beyond that of opZmizing compiler –  Just 1 unit with twice the load of others cuts speedup almost in half 2/26/12 49 50 2/26/12 Spring 2012  ­ ­ Lecture #13 51 8 ...
View Full Document

Ask a homework question - tutors are online