10LecSp11Performancex6

10LecSp11Performancex6 - 2/14/11 CS 61C: Great Ideas...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 2/14/11 CS 61C: Great Ideas in Computer Architecture (Machine Structures) Performance Instructors: Randy H. Katz David A. PaGerson hGp://inst.eecs.Berkeley.edu/~cs61c/sp11 2/14/11 Spring 2011  ­ ­ Lecture #10 1 2/14/11 New ­School Machine Structures (It’s a bit more complicated!) •  Parallel Threads Assigned to core e.g., Lookup, Ads Smart Phone Warehouse Scale Computer Harness How do Parallelism & Achieve High we know? Performance Computer •  Parallel Instruc[ons >1 instruc[on @ one [me e.g., 5 pipelined instruc[ons •  Parallel Data >1 data item @ one [me e.g., Add of 4 pairs of words •  Hardware descrip[ons All gates @ one [me 2/14/11 … Core Core Memory (Cache) Input/Output Instruc[on Unit(s) Core Func[onal Unit(s) •  •  •  •  •  •  Defining Performance Administrivia Workloads and Benchmarks Technology Break Measuring Performance Summary A0+B0 A1+B1 A2+B2 A3+B3 Main Memory Logic Gates Spring 2011  ­ ­ Lecture #10 3 2/14/11 Agenda •  •  •  •  •  •  Spring 2011  ­ ­ Lecture #10 Spring 2011  ­ ­ Lecture #10 4 What is Performance? Defining Performance Administrivia Workloads and Benchmarks Technology Break Measuring Performance Summary 2/14/11 2 Agenda So,ware Hardware •  Parallel Requests Assigned to computer e.g., Search “Katz” Spring 2011  ­ ­ Lecture #10 •  Latency (or response <me or execu<on <me) –  Time to complete one task •  Bandwidth (or throughput) –  Tasks completed per unit [me 5 2/14/11 Spring 2011  ­ ­ Lecture #10 6 1 2/14/11 The Iron Law of Queues (aka LiGle’s Law) Running Systems to 100% U[liza[on •  Implica[on of the graph at the right? Service Time aka Latency or Responsiveness •  Can you explain why this happens? “Knee” 100% U[liza[on 2/14/11 Spring 2011  ­ ­ Lecture #10 Student RouleGe? L = λ W 7 2/14/11 Average number of customers in system (L) = average interarrival rate (λ) x average service [me (W) Spring 2011  ­ ­ Lecture #10 8 Google Instant Search “Instant Efficiency” Cloud Performance: Why Applica[on Latency MaGers •  Key figure of merit: applica[on responsiveness –  Longer the delay, the fewer the user clicks, the less the user happiness, and the lower the revenue per user 2/14/11 Spring 2011  ­ ­ Lecture #10 Typical search takes 24 seconds, Google’s search algorithm is only 300 ms of this “It’s not search ‘as you type’, but ‘search before you type’!” “We can predict what you are likely to type and give you those results in real [me” 9 Defining CPU Performance –  2 passengers, 11.1 secs in quarter mile •  2009 Type D school bus –  54 passengers, quarter mile [me? hGp://www.youtube.com/watch?v=KwyCoQuhUNA •  Response Time/Latency: e.g., [me to travel ¼ mile •  Throughput/Bandwidth: e.g., passenger ­mi in 1 hour Spring 2011  ­ ­ Lecture #10 Spring 2011  ­ ­ Lecture #10 10 Defining Rela[ve CPU Performance •  What does it mean to say X is faster than Y? •  Ferrari vs. School Bus? •  2009 Ferrari 599 GTB 2/14/11 2/14/11 11 •  PerformanceX = 1/Program Execu[on TimeX •  PerformanceX > PerformanceY => 1/Execu[on TimeX > 1/Execu[on Timey => Execu[on TimeY > Execu[on TimeX •  Computer X is N [mes faster than Computer Y PerformanceX / PerformanceY = N or Execu[on TimeY / Execu[on TimeX = N •  Bus is to Ferrari as 12 is to 11.1: Ferrari is 1.08 [mes faster than the bus! 2/14/11 Spring 2011  ­ ­ Lecture #10 12 2 2/14/11 Measuring CPU Performance CPU Performance Factors •  Computers use a clock to determine when events takes place within hardware •  Clock cycles: discrete [me intervals •  To dis[nguish between processor [me and I/O, CPU <me is [me spent in processor –  aka clocks, cycles, clock periods, clock [cks •  Clock rate or clock frequency: clock cycles per second (inverse of clock cycle [me) •  3 GigaHertz clock rate => clock cycle [me = 1/(3x109) seconds clock cycle [me = 333 picoseconds (ps) 2/14/11 Spring 2011  ­ ­ Lecture #10 13 CPU Performance Factors •  CPU Time/Program = Clock Cycles/Program x Clock Cycle Time •  Or CPU Time/Program = Clock Cycles/Program ÷ Clock Rate 2/14/11 Spring 2011  ­ ­ Lecture #10 14 Resta[ng Performance Equa[on •  But a program executes instruc[ons •  CPU Time/Program = Clock Cycles/Program x Clock Cycle Time = Instructions/Program x Average Clock Cycles/Instruction x Clock Cycle Time •  Time = Seconds Program Instruc[ons Clock cycles Seconds × × = Program Instruc[on Clock Cycle •  1st term called Instruc<on Count •  2nd term abbreviated CPI for average Clock Cycles Per Instruc<on •  3rd term is 1 / Clock rate 2/14/11 Spring 2011  ­ ­ Lecture #10 15 2/13/11 What Affects Each Component? Instruc[on Count, CPI, Clock Rate Hardware or so*ware component? Algorithm •  Computer A clock cycle [me 250 ps, CPIA = 2 •  Computer B clock cycle [me 500 ps, CPIB = 1.2 •  Assume A and B have same instruc[on set •  Which statement is true? Red. Computer A is ~1.2 [mes faster than B Orange. Computer A is ~4.0 [mes faster than B Green. Computer B is ~1.7 [mes faster than A Yellow. Computer B is ~3.4 [mes faster than A Pink. None of the above Affects What? Instruc[on Set Architecture Spring 2011  ­ ­ Lecture #10 16 Peer Instruc[on Ques[on Programming Language Compiler 2/13/11 Spring 2011  ­ ­ Lecture #10 Student RouleGe? 17 2/13/11 Spring 2011  ­ ­ Lecture #10 19 3 2/14/11 Agenda •  •  •  •  •  •  Administrivia •  •  •  •  Defining Performance Administrivia Workloads and Benchmarks Technology Break Measuring Performance Summary 2/13/11 Spring 2011  ­ ­ Lecture #10 Lab #5 posted Project #2.1 Due Sunday @ 11:59:59 HW #4 Due Sunday @ 11:59:59 Midterm in less than three weeks: –  No discussion during exam week –  TA Review: Su, Mar 6, 2 ­5 PM, 2050 VLSB –  Exam: Tu, Mar 8, 6 ­9 PM, 145/155 Dwinelle –  Small number of special considera[on cases, due to class conflicts, etc.—contact Dave or Randy 21 2/14/11 Agenda •  •  •  •  •  •  Spring 2011  ­ ­ Lecture #10 •  Workload: Set of programs run on a computer –  Actual collec[on of applica[ons run or made from real programs to approximate such a mix –  Specifies both programs and rela[ve frequencies •  Benchmark: Program selected for use in comparing computer performance –  Benchmarks form a workload –  Usually standardized so that many use them 23 2/14/11 (System Performance Evalua[on Coopera[ve) •  Computer Vendor coopera[ve for benchmarks, started in 1989 •  SPECCPU2006 Description Interpreted string processing Block-sorting compression 24 InstrucClock Execu- ReferSPECtion CPI cycle tion ence ratio Count (B) time (ps) Time (s) Time (s) 2,118 0.75 400 637 9,770 15.3 2,389 0.85 400 817 9,650 11.8 1,050 1.72 400 724 8,050 11.1 336 10.0 400 1,345 Go game 1,658 1.09 400 721 Search gene sequence 2,783 0.80 400 890 9,330 10.5 Chess game 2,176 0.96 400 837 12,100 14.5 Quantum computer simulation 1,623 1.61 400 1,047 20,720 19.8 3,102 0.80 400 993 22,130 22.3 587 2.94 400 690 6,250 1,082 1.79 400 773 7,020 1,143 6,900 GNU C compiler Combinatorial optimization –  12 Integer Programs –  17 Floa[ng ­Point Programs •  O|en turn into number where bigger is faster •  SPECra<o: reference execu[on [me on old reference computer divide by execu[on [me on new computer to get an effec[ve speed ­up Video compression Discrete event simulation library Games/path finding Spring 2011  ­ ­ Lecture #10 Spring 2011  ­ ­ Lecture #10 SPECINT2006 on AMD Barcelona SPEC 2/14/11 22 Workload and Benchmark Defining Performance Administrivia Workloads and Benchmarks Technology Break Measuring Performance Summary 2/13/11 Spring 2011  ­ ­ Lecture #7 25 2/13/11 XML parsing Spring 2011  ­ ­ Lecture #10 1,058 2.70 400 9,120 6.8 10,490 14.6 9.1 9.1 26 6.0 4 2/14/11 Summarizing Performance … … Depends Who’s Selling System System Rate (Task 1) A 10 B 20 10 Rate (Task 2) Average 10 20 15 B 20 Rate (Task 1) A Rate (Task 2) 20 10 15 Average throughput System Rate (Task 2) Average 0.50 2.00 1.25 B Which system is faster? Rate (Task 1) A 1.00 1.00 1.00 Throughput relaFve to B System Rate (Task 1) Rate (Task 2) Average A 1.00 1.00 1.00 B 2.00 0.50 1.25 Throughput relaFve to A 2/14/11 Spring 2011  ­ ­ Lecture #10 Student RouleGe? 27 2/13/11 Spring 2011  ­ ­ Lecture #10 28 Energy and Power Summarizing SPEC Performance (Energy = Power x Time) •  Varies from 6x to 22x faster than reference computer •  Geometric mean of ra[os: N ­th root of product of N ra[os •  Energy to complete opera[on (Joules) –  Corresponds approximately to baGery life •  Peak power dissipa[on (WaGs = Joules/s) –  Affects heat (and cooling demands) –  IT equipment’s power is in the denominator of the Power U[liza[on Efficiency (PUE) equa[on, a WSC figure of merit –  Geometric Mean gives same rela[ve answer no maGer what computer is used as reference •  Geometric Mean for Barcelona is 11.7 2/13/11 Spring 2011  ­ ­ Lecture #10 29 Peak Power vs. Lower Energy “The Case for Energy ­Propor[onal Compu[ng,” Luiz André Barroso, Urs Hölzle, IEEE Computer December 2007 Peak A Peak B Spring 2011  ­ ­ Lecture #10 30 Energy Propor[onal Compu[ng (Power x Time = Energy) Power 2/14/11 It is surprisingly hard to achieve high levels of u[liza[on of typical servers (and your home PC or laptop is even worse) Integrate power curve to get energy Time •  Which system has higher peak power? •  Which system has higher energy? 2/14/11 Spring 2011  ­ ­ Lecture #10 Student RouleGe? 31 Figure 1. Average CPU u[liza[on of more than 5,000 servers during a six ­month period. Servers are rarely completely idle and seldom operate near their maximum u[liza[on, instead opera[ng 2/13/11 Spring 2011  ­ ­ Lecture #10 most of the [me at between 10 and 50 percent of their maximum 33 5 2/14/11 SPECPower on Barcelona SPECPower •  Increasing importance of power and energy: create benchmark for performance and power •  Most servers in WSCs have average u[liza[on between 10% & 50%, so measure power at medium as well as at high load •  Measure best performance and power, then step down request rate to measure power for every 10% reduc[on in performance •  Java server benchmark performance is opera<ons per second (ssj_ops), so metric is ssj_ops/WaR 2/13/11 Spring 2011  ­ ­ Lecture #10 34 Target Performance Avg. Power Load % (ssj_ops) (Watts) 100% 231,867 295 90% 211,282 286 80% 185,803 275 70% 163,427 265 60% 140,160 256 50% 118,324 246 40% 92,035 233 30% 70,500 222 20% 47,126 206 10% 23,066 180 0% 0 141 Sum 1,283,590 2,605 ssj_ops/Watt 493 2/13/11 Which is BeGer? (1 Red Machine vs. 5 Green Machines) •  Five machines running at 10% u[liza[on ‒  Total Power = Two 3.0 ­GHz Xeons, 16 GB DRAM, 1 Disk One 2.4 ­GHz Xeon, 8 GB DRAM, 1 Disk •  One machine running at 50% u[liza[on ‒  Total Power = 85% Peak Power@50% u[liza[on 65% Peak Power@10% u[liza[on 2/14/11 Spring 2011  ­ ­ Lecture #10 Student RouleGe? 36 Energy Propor<onality 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 Spring 2011  ­ ­ Lecture #10 100 200 WaGs 300 35 Other Benchmark AGempts •  Rather than run a collec[on of real programs and take their average (geometric mean), create a single program that matches the average behavior of a set of programs •  Called a synthe<c benchmark •  First example called Whetstone in 1972 for floa[ng point intensive programs in Fortran •  Second example called Dhrystone in 1985 for integer programs in Ada and C –  Pun on Wet vs. Dry (“Whet” vs. “Dhry ”) 2/14/11 Dhystone Shortcomings Spring 2011  ­ ­ Lecture #10 38 Agenda •  Dhrystone features unusual code that is not usually representa[ve of real ­life programs •  Dhrystone suscep[ble to compiler op[miza[ons •  Dhrystone’s small code size means always fits in caches, so not representa[ve •  Yet s[ll used in hand held, embedded CPUs! •  •  •  •  •  •  2/13/11 2/14/11 Spring 2011  ­ ­ Lecture #10 100% 39 Defining Performance Administrivia Workloads and Benchmarks Technology Break Measuring Performance Summary Spring 2011  ­ ­ Lecture #10 41 6 2/14/11 Agenda •  •  •  •  •  •  2/13/11 Spring 2011  ­ ­ Lecture #10 42 Compiler Op[miza[on and Dhrystone •  gcc compiler op[ons -O1: the compiler tries to reduce code size and execu[on [me, without performing any op[miza[ons that take a great deal of compila[on [me -O2: Op[mize even more. GCC performs nearly all supported op[miza[ons that do not involve a space ­ speed tradeoff. As compared to  ­O, this op[on increases both compila[on [me and the performance of the generated code -O3: Op[mize yet more. All  ­O2 op[miza[ons and also turns on the  ­finline ­func[ons,  ­funswitch ­loops,  ­ fpredic[ve ­commoning,  ­fgcse ­a|er ­reload,  ­|ree ­vectorize and  ­fipa ­cp ­clone op[ons 2/14/11 Spring 2011  ­ ­ Lecture #10 44 Defining Performance Administrivia Workloads and Benchmarks Technology Break Measuring Performance Summary 2/13/11 Detailed –O1,  ­O2 Op[miza[ons -fthread-jumps ! -fguess-branch-probability ! -fif-conversion2 ! -fif-conversion ! -fipa-pure-const ! -fipa-profile ! -fipa-reference ! -fmerge-constants! -fsplit-wide-types ! -ftree-bit-ccp ! -fcse-follow-jumps -falign-functions -fcse-skip-blocks ! -fdelete-null-pointer-checks ! -fexpensive-optimizations ! -fgcse -fgcse-lm ! -finline-small-functions ! -findirect-inlining ! -fipa-sra ! -foptimize-sibling-calls ! -fpartial-inlining ! -fpeephole2 ! -fregmove ! -ftree-builtin-call-dce ! -ftree-ccp ! -ftree-ch ! -ftree-copyrename ! -ftree-dce ! -ftree-dominator-opts ! -ftree-dse ! -ftree-forwprop ! -ftree-fre ! -ftree-phiprop ! -ftree-sra ! -ftree-pta ! -ftree-ter ! -funit 2/13/11 -at-a-time! -falign-jumps ! -falign-loops -falign-labels ! -fcaller-saves ! -fcrossjumping ! -freorder-blocks -freorder-functions ! -frerun-cse-after-loop ! -fsched-interblock -fschedule-insns -fsched-spec ! -fschedule-insns2 ! -fstrict-aliasing -fstrict-overflow ! -ftree-switch-conversion ! -ftree-pre ! -ftree-vrp! Spring 2011  ­ ­ Lecture #10 hGp://gcc.gnu.org/onlinedocs/gcc/Op[mize ­Op[ons.html 45 How to get RDTSC access in C? •  UNIX [me command measures in seconds •  Time Stamp Counter –  64 ­bit counter of clock cycles on Intel 80x86 instruc[on set computers –  80x86 instruc[on RDTSC (Read TSC) returns TSC in regs EDX (upper 32 bits) and EAX (lower 32 bits) –  Can read, but can’t set –  How long can measure? –  Measures overall [me, not just [me for 1 program Spring 2011  ­ ­ Lecture #10 43 -fauto-inc-dec ! -fcprop-registers ! -fdce ! -fdefer-pop ! -fdelayed-branch ! -fdse ! Measuring Time 2/13/11 Spring 2011  ­ ­ Lecture #10 46 static inline unsigned long long RDTSC(void) { unsigned hi, lo; asm volatile ("rdtsc" : "=a"(lo), "=d"(hi)); return ( (unsigned long long) lo) | ( ((unsigned long long) hi) <<32 ); } 2/13/11 Spring 2011  ­ ­ Lecture #10 47 7 2/14/11 Where Do You Spend the Time in Your Program? gcc Op[miza[on Experiment BubbleSort.c Dhrystone.c •  Profiling a program (e.g., using, gprof) shows where it spends its [me by func[on, so you can determine which code consumes most of the execu[on [me •  Usually a 90/10 rule: 10% of code is responsible for 90% of execu[on [me No Opt  ­O1  ­O2 –  Or 80/20 rule, where 20% of code responsible for 80% of [me  ­O3 2/13/11 Spring 2011  ­ ­ Lecture #10 48 2/14/11 Spring 2011  ­ ­ Lecture #10 50 gprof example gprof % time •  Learn where program spent its [me •  Learn func[ons called while it was execu[ng – And which func[ons call other func[ons Cumula Self tive (secs) (secs) calls Self Total ms/ ms/ name call call 18.18 0.16 0.03 0.19 0.03 6.06 0.21 0.02 12484 0.00 0.00 file_hash_1 6.06 0.23 0.02 3.03 0.24 0.01 29981 0.00 0.00 hash_find_slot 3.03 0.25 0.01 14769 0.00 0.00 next_token 3.03 •  cc –pg x.c {in addi<on to other flags use} 0.03 9.09 –  Execute program to generate a profile data file –  Run gprof to analyze the profile data 0.04 0.13 9.09 –  Compile & link program with profiling enabled 0.06 23480 0.00 0.00 find_char_unquote 0.10 9.09 •  Three steps: 0.06 12.12 0.26 0.01 120 0.33 0.73 pattern_search 5120 0.01 0.01 collapse_continuations 148 0.20 0.88 update_file_1 37 0.81 4.76 eval 6596 0.00 0.00 get_next_mword 5800 0.00 0.00 variable_expand_string See hGp://linuxgazeGe.net/100/vinayak.html 2/14/11 Spring 2011  ­ ­ Lecture #10 51 2/13/11 Spring 2011  ­ ­ Lecture #10 52 2/14/11 Spring 2011  ­ ­ Lecture #10 54 Test Program to Profile with Saturn #include <math.h> #define LIMIT 500000000 void exponen[al() { double a; int i; for (i=1; i != LIMIT; i++) a = exp(i/1000.0); } void sinFunc() { double a; int i; for (i=1; i != LIMIT; i++) a = sin(i/1000.0); } int main() { exponen[al(); sinFunc(); return 0; } (Unfortunately gprof isn’t supported on my Intel ­based mac with mac osx; I use an alterna[ve tool called saturn) 2/14/11 Spring 2011  ­ ­ Lecture #10 53 8 2/14/11 Cau[onary Tale And In Conclusion, … •  “More compu[ng sins are commiGed in the name of efficiency (without necessarily achieving it) than for any other single reason  ­ including blind stupidity”  ­ ­ William A. Wulf •  “We should forget about small efficiencies, say about 97% of the [me: premature op[miza[on is the root of all evil”  ­ ­ Donald E. Knuth •  Time (seconds/program) is measure of performance Instruc[ons Clock cycles Seconds × Instruc[on × Clock Cycle = Program •  Benchmarks stand in for real workloads to as standardized measure of rela[ve performance •  Power of increasing concern, and being added to benchmarks •  Time measurement via clock cycles, machine specific •  Profiling tools as way to see where spending [me in your program •  Don’t op[mize prematurely! 2/14/11 2/13/11 Spring 2011  ­ ­ Lecture #10 55 Spring 2011  ­ ­ Lecture #10 56 9 ...
View Full Document

Ask a homework question - tutors are online