14LecSp12DLPIIx6 - 2/26/12 New ­School Machine...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 2/26/12 New ­School Machine Structures (It’s a bit more complicated!) So'ware Hardware •  Parallel Requests CS 61C: Great Ideas in Computer Architecture SIMD II Assigned to computer e.g., Search “Katz” Harness •  Parallel Threads Parallelism & Assigned to core e.g., Lookup, Ads Achieve High Performance >1 instrucZon @ one Zme e.g., 5 pipelined instrucZons •  Parallel Data >1 data item @ one Zme e.g., Add of 4 pairs of words •  Hardware descripZons All gates @ one Zme Spring 2012  ­ ­ Lecture #14 1 •  Programming Languages 2/26/12 Review •  •  •  •  •  •  SIMD: Single Instruc>on Mul>ple Data MIMD: Mul>ple Instruc>on Mul>ple Data SISD: Single InstrucZon Single Data (unused) MISD: MulZple InstrucZon Single Data •  Intel SSE SIMD InstrucZons –  One instrucZon fetch that operates on mulZple operands simultaneously –  128/64 bit XMM registers •  SSE InstrucZons in C –  Embed the SSE machine instrucZons directly into C programs through use of intrinsics –  Achieve efficiency beyond that of opZmizing compiler 2/26/12 Spring 2012  ­ ­ Lecture #14 … Core Core Memory (Cache) Input/Output Today’s InstrucZon Unit(s) Lecture Core FuncZonal Unit(s) A0+B0 A1+B1 A2+B2 A3+B3 Cache Memory Logic Gates Spring 2012  ­ ­ Lecture #14 2 Agenda •  Flynn Taxonomy of Parallel Architectures –  –  –  –  Computer •  Parallel InstrucZons Instructor: David A. Pa>erson h>p://inst.eecs.Berkeley.edu/~cs61c/sp12 2/26/12 Smart Phone Warehouse Scale Computer 3 Amdahl’s Law Administrivia SIMD and Loop Unrolling Technology Break Memory Performance for Caches Review of 1st Half of 61C 2/26/12 4 Big Idea: Amdahl’s Law Big Idea: Amdahl’s (Heartbreaking) Law •  Speedup due to enhancement E is Spring 2012  ­ ­ Lecture #14 Speedup = Exec Zme w/o E Speedup w/ E =  ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ Exec Zme w/ E •  Suppose that enhancement E accelerates a fracZon F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected Example: the execuZon Zme of half of the program can be accelerated by a factor of 2. What is the program speed ­up overall? ExecuZon Time w/ E = ExecuZon Time w/o E × [ (1 ­F) + F/S] Speedup w/ E = 1 / [ (1 ­F) + F/S ] 2/26/12 Spring 2012  ­ ­ Lecture #14 5 2/26/12 Spring 2012  ­ ­ Lecture #14 6 1 2/26/12 Big Idea: Amdahl’s Law Speedup = 1 (1  ­ F) + F Non ­speed ­up part S Big Idea: Amdahl’s Law If the porZon of the program that can be parallelized is small, then the speedup is limited Speed ­up part The non ­parallel porZon limits the performance Example: the execuZon Zme of half of the program can be accelerated by a factor of 2. What is the program speed ­up overall? 1 0.5 + 0.5 2 2/26/12 = 1 = 0.5 + 0.25 1.33 Spring 2012  ­ ­ Lecture #14 7 2/26/12 Example #1: Amdahl’s Law •  Consider an enhancement which runs 20 Zmes faster but which is only usable 25% of the Zme Speedup w/ E = 1/(.75 + .25/20) = 1.31 Z0 + Z1 + … + Z10 X10,1 Non ­parallel part •  Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computaZon can be scalar! •  To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup w/ E = 1/(.001 + .999/100) = 90.99 Y1,1 Y1,10 10 X10,10 Y10,1 Y10,10 ParZZon 10 ways and perform on 10 parallel processing units Parallel part –  100/110 = .909 Parallelizable, 10/110 = 0.91 Scalar 2/26/12 Spring 2012  ­ ­ Lecture #14 11 Strong and Weak Scaling Speedup w/ E = 1 / [ (1 ­F) + F/S ] •  Consider summing 10 scalar variables and two 10 by 10 matrices (matrix sum) on 10 processors •  To get good speedup on a mulZprocessor while keeping the problem size fixed is harder than gexng good speedup by increasing the size of the problem. Speedup w/ E = 1/(.091 + .909/10) = 1/0.1819 = 5.5 –  Strong scaling: when speedup can be achieved on a parallel processor without increasing the size of the problem (e.g., 10x10 Matrix on 10 processors to 100) –  Weak scaling: when speedup is achieved on a parallel processor by increasing the size of the problem proporZonally to the increase in the number of processors –  (e.g., 10x10 Matrix on 10 processors =>33x33 Matrix on 100) •  What if there are 100 processors ? Speedup w/ E = 1/(.091 + .909/100) = 1/0.10009 = 10.0 •  What if the matrices are 33 by 33(or 1019 adds in total) on 10 processors? (increase parallel data by 10x) Speedup w/ E = 1/(.009 + .991/10) = 1/0.108 = 9.2 •  Load balancing is another important factor: every processor doing same amount of work •  What if there are 100 processors ? Speedup w/ E = 1/(.009 + .991/100) = 1/0.019 = 52.6 Spring 2012  ­ ­ Lecture #14 X1,10 •  10 “scalar” operaZons (non ­parallelizable) •  100 parallelizable operaZons •  110 operaZons Example #2: Amdahl’s Law 2/26/12 X1,1 + •  What if its usable only 15% of the Zme? Speedup w/ E = 1/(.85 + .15/20) = 1.17 Spring 2012  ­ ­ Lecture #14 8 Parallel Speed ­up Example Speedup w/ E = 1 / [ (1 ­F) + F/S ] 2/26/12 Spring 2012  ­ ­ Lecture #14 –  Just 1 unit with twice the load of others cuts speedup almost in half 13 2/26/12 Spring 2012  ­ ­ Lecture #14 14 2 2/26/12 Administrivia Suppose a program spends 80% of its Zme in a square root rouZne. How much must you speedup square root to make the program run 5 Zmes faster? •  Lab #7 posted •  Midterm in 5 days: Speedup w/ E = 1 / [ (1 ­F) + F/S ] ☐ –  Exam: Tu, Mar 6, 6:40 ­9:40 PM, 2050 VLSB –  Covers everything through lecture today –  Closed book, can bring one sheet notes, both sides –  Copy of Green card will be supplied –  No phones, calculators, …; just bring pencils & eraser –  TA Review: Su, Mar 4, StarZng 2PM, 2050 VLSB 10 ☐ 20 ☐ 100 •  Will send (anonymous) 61C midway survey before Midterm ☐ 15 2/26/12 Agenda •  •  •  •  •  •  Spring 2012  ­ ­ Lecture #14 •  SIMD wants adjacent values in memory that can be operated in parallel •  Usually specified in programs as loops for(i=1000; i>0; i=i ­1) x[i] = x[i] + s; •  How can reveal more data level parallelism than available in a single iteraZon of a loop? •  Unroll loop and adjust iteraZon rate 17 2/26/12 Spring 2012  ­ ­ Lecture #14 Looping in MIPS Spring 2012  ­ ­ Lecture #14 18 Loop Unrolled Loop: AssumpZons:  ­  R1 is iniZally the address of the element in the array with the highest address  ­  F2 contains the scalar value s  ­  8(R2) is the address of the last element to operate on. CODE: Loop:1. l.d F0, 0(R1) ; F0=array element 2. add.d F4,F0,F2 ; add s to F0 3. s.d F4,0(R1) ; store result 4. addui R1,R1,# ­8 ; decrement pointer 8 byte 5. bne R1,R2,Loop ;repeat loop if R1 != R2 2/26/12 16 Data Level Parallelism and SIMD Amdahl’s Law Administrivia SIMD and Loop Unrolling Technology Break Memory Performance for Caches Review of 1st Half of 61C 2/26/12 Spring 2012  ­ ­ Lecture #14 19 2/26/12 l.d add.d s.d l.d add.d s.d l.d add.d s.d F0,0(R1) F4,F0,F2 F4,0(R1) F6, ­8(R1) F8,F6,F2 F8, ­8(R1) F10, ­16(R1) F12,F10,F2 F12, ­16(R1) l.d add.d s.d addui bne F14, ­24(R1) F16,F14,F2 F16, ­24(R1) R1,R1,# ­32 R1,R2,Loop NOTE: 1.  Different Registers eliminate stalls 2.  Only 1 Loop Overhead every 4 iteraZons 3.  This unrolling works if loop_limit(mod 4) = 0 Spring 2012  ­ ­ Lecture #14 20 3 2/26/12 Loop Unrolled Scheduled Loop Unrolling in C Loop:l.d F0,0(R1) l.d F6, ­8(R1) 4 Loads side ­by ­side: Could replace with 4 wide SIMD Load l.d F10, ­16(R1) l.d F14, ­24(R1) add.d F4,F0,F2 add.d F8,F6,F2 4 Adds side ­by ­side: Could replace with 4 wide SIMD Add add.d F12,F10,F2 add.d F16,F14,F2 s.d F4,0(R1) s.d F8, ­8(R1) 4 Stores side ­by ­side: Could replace with 4 wide SIMD Store s.d F12, ­16(R1) s.d F16, ­24(R1) addui R1,R1,# ­32 bne R1,R2,Loop •  Instead of compiler doing loop unrolling, could do it yourself in C for(i=1000; i>0; i=i ­1) x[i] = x[i] + s; What is downside of doing it in C? •  Could be rewri>en for(i=1000; i>0; i=i ­4) { x[i] = x[i] + s; x[i ­1] = x[i ­1] + s; x[i ­2] = x[i ­2] + s; x[i ­3] = x[i ­3] + s; } 2/26/12 2/26/12 21 Spring 2012  ­ ­ Lecture #14 Generalizing Loop Unrolling Agenda •  A loop of n itera?ons •  k copies of the body of the loop Then we will run the loop with 1 copy of the body n(mod k) Zmes and with k copies of the body floor(n/k) Zmes •  (Will revisit loop unrolling again when get to pipelining later in semester) 2/26/12 Spring 2012  ­ ­ Lecture #14 •  •  •  •  •  23 •  The off ­chip interconnect and memory architecture on ­chip affects overall system performance in dramaZc ways Amdahl’s Law Administrivia SIMD and Loop Unrolling Memory Performance for Caches Review of 1st Half of 61C 2/26/12 Reading Miss Penalty: Memory Systems that Support Caches CPU •  A‚er a row is read into the SRAM register •  bus 32 ­bit data & 32 ­bit addr per cycle DRAM Memory •  •  2/26/12 1 memory bus clock cycle to return a word of data Number of bytes accessed from memory and transferred to cache/CPU per memory bus clock cycle Spring 2012  ­ ­ Lecture #14 25 N cols Transfers a burst of data (ideally a cache block) from a series of sequenZal addresses within that row 15 memory bus clock cycles to get the 1st word in the block from DRAM (row cycle Zme), 5 memory bus clock cycles for 2nd, 3rd, 4th words (subsequent column access Zme)—note effect of latency! Memory ­Bus to Cache bandwidth +1 Input CAS as the starZng “burst” address along with a burst length 1 memory bus clock cycle to send address •  24 Column Address •  One word wide organizaZon (one word wide bus and one word wide memory) •  Spring 2012  ­ ­ Lecture #14 (DDR) SDRAM OperaZon Assume Cache 22 DRAM N rows Spring 2012  ­ ­ Lecture #14 -  Memory bus clock controls transfer of successive words in the burst N x M SRAM Cycle Time 1st M ­bit Access 2nd M ­bit Row Address M bit planes M ­bit Output 3rd M ­bit 4th M ­bit RAS CAS Row Address 2/26/12 Col Address Row Add Spring 2012  ­ ­ Lecture #14 26 4 2/26/12 New ­School Machine Structures (It’s a bit more complicated!) Project 1 Agenda •  •  •  •  •  •  So'ware Hardware •  Parallel Requests Amdahl’s Law Administrivia SIMD and Loop Unrolling Memory Performance for Caches Technology Break Review of 1st Half of 61C Assigned to computer e.g., Search “Katz” Harness •  Parallel Threads Parallelism & Assigned to core e.g., Lookup, Ads Achieve High Performance •  Parallel InstrucZons Project 2 Programming Languages Core >1 instrucZon @ one Zme e.g., 5 pipelined instrucZons Input/Output •  Parallel Data InstrucZon Unit(s) >1 data item @ one Zme e.g., Add of 4 pairs of words 27 2/26/12 •  Programming Languages A0+B0 A1+B1 A2+B2 A3+B3 Logic Gates Project 4 28 Spring 2012  ­ ­ Lecture #14 temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; High Level Language Program (e.g., C) Layers of RepresentaZon/InterpretaZon Moore’s Law Principle of Locality/Memory Hierarchy Parallelism Performance Measurement & Improvement Dependability via Redundancy Project 3 Core FuncZonal Unit(s) Great Idea #1: Levels of First half 61C RepresentaZon/InterpretaZon 6 Great Ideas in Computer Architecture 1.  2.  3.  4.  5.  6.  Core Main Memory All gates funcZoning in parallel at same Zme Spring 2012  ­ ­ Lecture #14 Computer … Memory (Cache) •  Hardware descripZons 2/26/12 Smart Phone Warehouse Scale Computer Compiler Assembly Language Program (e.g., MIPS) lw lw sw sw Assembler Machine Language Program (MIPS) $t0, 0($2) $t1, 4($2) $t1, 0($2) $t0, 4($2) 0000 1010 1100 0101 1001 1111 0110 1000 1100 0101 1010 0000 Anything can be represented as a number, i.e., data or instrucZons 0110 1000 1111 1001 1010 0000 0101 1100 1111 1001 1000 0110 0101 1100 0000 1010 1000 0110 1001 1111 ! Machine Interpreta4on Hardware Architecture Descrip?on (e.g., block diagrams) Architecture Implementa4on Logic Circuit Descrip?on 2/26/12 Spring 2012  ­ ­ Lecture #14 29 Predicts: 2X Transistors / chip every 1.5 years 30 Great Idea #3: Principle of Locality/ First half 61C Memory Hierarchy #2: Moore’s Law # of transistors on an integrated circuit (IC) 2/26/12 (Circuit Schema?c Diagrams) Spring 2012  ­ ­ Lecture #14 Gordon Moore Intel Cofounder B.S. Cal 1950! 2/26/12 Spring 2012  ­ ­ Lecture #14 Year 31 2/26/12 Spring 2012  ­ ­ Lecture #14 32 5 2/26/12 Great Idea #4: Parallelism •  Data Level Parallelism in 1st half 61C –  Lots of data in memory that can be operated on in parallel (e.g., adding together 2 arrays) –  Lots of data on many disks that can be operated on in parallel (e.g., searching for documents) •  1st project: DLP across 10s of servers and disks using MapReduce •  Next week’s lab, 3rd project: DLP in memory 2/26/12 Spring 2012  ­ ­ Lecture #14 33 2/26/12 Spring 2012  ­ ­ Lecture #14 34 Summary •  •  •  •  •  Amdhal’s Cruel Law: Law of Diminishing Returns Loop Unrolling to Expose Parallelism OpZmize Miss Penalty via Memory system As the field changes, cs61c has to change too! SZll about the so‚ware ­hardware interface –  Programming for performance via measurement! –  Understanding the memory hierarchy and its impact on applicaZon performance –  Unlocking the capabiliZes of the architecture for performance: SIMD 2/26/12 Spring 2012  ­ ­ Lecture #14 35 2/26/12 Spring 2012  ­ ­ Lecture #14 36 6 ...
View Full Document

Ask a homework question - tutors are online