CA4 - Chapter 4 Assessing and Understanding CPU Performance...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Chapter 4 Assessing and Understanding CPU Performance 1 Performance • • • • Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation Why is some hardware better than others for different Why programs? programs? What factors of system performance are hardware related? (e.g., Do we need a new machine, or a new operating (e.g., system?) system?) 2 Which of these airplanes has the best performance? Airplane Passengers Range (mi) Speed (mph) 101 470 132 146 630 4150 4000 8720 598 610 1350 544 Boeing 737-100 Boeing 747 Concorde Douglas DC-8-50 the 747? DC­8? • How much faster is the Concorde compared to • How much bigger is the 747 than the Douglas 3 Computer Performance: TIME, TIME, TIME • Response Time (latency) — How long does it take for my job to run? — How long does it take to execute a job? — How long must I wait for the database query? — How many jobs can the machine run at once? — What is the average execution rate? — How much work is getting done? • Throughput If we upgrade a machine with a new processor what do we If upgrade improve? If we add a new machine to the lab what do we improve? add 4 Execution Time • Elapsed Time – counts everything (disk and memory accesses, I/O , etc.) – a useful number, but often not good for comparison purposes CPU time – doesn't count I/O or time spent running other programs – can be broken up into system time, and user time • • Our focus: user CPU time or user CPU time + system CPU time Note: parallel processing intends to reduce elapse time rather than CPU time ! 5 Execution Time • /usr/bin/time a.out real user sys real user sys 1:23:31.5 30:33.2 11:12.2 is elapse time is user CPU time is system CPU time • • • 6 Definition of Performance • For some program running on machine X, PerformanceX = 1 / Execution timeX • "X is n times faster than Y" PerformanceX / PerformanceY = n • Example: – machine A runs a program in 20 secs – machine B runs the same program in 25 25/20=1.25 secs 7 Exercise • machine A runs a program in 80 secs • machine B runs the same program in 100 secs 1) A is 20% faster than B 2) B is 20% faster than A 3) A is 25% faster than B 4) B is 25% faster than A Answer: 3 Performance A/Performance B = (1/80) / (1/100) = 100/80 = 1.25 8 Exercise • Program A used to take 100 seconds to run, your optimization reduces the runtime to 75 seconds. How much is the improvement? Answer: 33.3% Performance new/Performance old = (1/75) / (1/100) = 100/75 = 1.333… 9 Clock Cycles • Instead of reporting execution time in seconds, we often use cycles seconds cycles seconds = × program program cycle • Clock “ticks” indicate when to start activities (one abstraction): time • cycle time = time between ticks = seconds per cycle • clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec) A 4 Ghz. clock has a cycle time 1 4 ×109 ×1012 = 250 picosecond (ps) 10 10 How to Improve Performance? seconds cycles seconds = × program program cycle So, to improve performance (everything else being equal) you can either (increase or decrease?) decrease decrease increase ________ the # of required cycles for a program, or ________ the clock cycle time or, said another way, ________ the clock rate. 11 11 How many cycles are required for a program? • Could assume that number of cycles equals 2nd instruction 3rd instruction 1st instruction number of instructions 4th 5th 6th ... This assumption is incorrect time since different instructions take different amounts of time e.g. fp divide > fp multiply > int multiply > int add 12 12 Different numbers of cycles for different instructions time • Multiplication takes more time than addition • Floating point operations take longer than integer ones • Accessing memory takes more time than accessing registers • Important point: changing the cycle time often changes the 13 number of cycles required for various instructions (e.g. P4 vs 13 Relative instruction cost Off-chip cache miss: ~200-500 clock cycles On-chip L2 cache miss: ~60 cycles On-chip L1 cache miss: ~20 cycles Loads hit in L1: 2 cycles Mis-predicted branch: 8-9 cycles Fp multiply/add: 4-5 cycles Int add/sub: 1 cycle Logical operations: 1 cycle Int multiply: 3-5 cycles Int divide: ~40-70 cycles 14 14 Example • Program P runs in 10 seconds on computer A, Program which has a 4 GHz. clock. We are trying to design a new machine B, to run this program in 6 seconds. The design can use new technology to increase the clock rate, but it will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate should we target? same How many cycles does program P take on machine A? 10 sec / (1/4G) = 40 B cycles Machine B requires 40G*1.2 = 48 B cycles What clock rate would yield 6 seconds for 48 B cycles? 15 15 Clock rate = 48 B / 6 sec = 8 GHz CPI (Clock cycles Per Instruction) • Average number of clock cycles each • • • instruction takes to execute. CPI provides one way of comparing two different implementations of the same ISA. (remove ISA and clock rate from the equation) IPC is the reverse of CPI, which means average number of instructions per clock cycle MIPS (Million Instruction Per Second, i.e. IPC*clock frequency) is often used (or mis­ used) to represent the peak execution rate. Real applications usually have much lower IPC or effective MIPS. 16 16 MIPS rating • Measured using artificial code sequence with few • • • branches, no cache misses. In 1970’s minicomputer performance was compared to DEC­VAX MIPS, where VAX 11/780 was marked as 1 MIPS. MIPS had been used to rate IBM mainframe server performance for a long time. MIPS has been abused, so critics often referred the term as “Meaningless Indiction of Processor Speed” “Meaningless Information on Performance for Salespeople” 17 17 Example MIPS rating Processor Motorola 68000 Intel 486 DX PowerPC G2 ARM 7500FE PowerPC G3 ARM 10 Pentium4 EE AMD Athlon 64 MIPS rating 1 mips at 8MHz 54 mips at 66Mhz 35 mips at 33 MHz 36 mips at 40 MHz 525 mips at 233 MHz 400 mips at 300 MHz 9726 mips at 3.2 GHz 8400 mips at 2.8GHz What is the trend? 18 18 Year 1979 1992 1994 1996 1997 1998 2003 2005 CPI Example • Suppose we have two implementations of the same instruction set architecture (ISA). For some programs, Machine A has a clock cycle time of 250 ps and a CPI of 2.0 Machine B has a clock cycle time of 500 ps and a CPI of 1.2 What machine is faster for this program, and by how much? Execution time = Cycles * cycle time Cycles = Number of Instructions * CPI Execution Time for A = I * 2.0 * 250ps=500 (I is the number of insts) Execution Time for B = I * 1.2 * 500ps=600 Performance A/Performance B = 600/500= 1.2 A is 1.2 times as fast as B 19 19 • A compiler designer is trying to decide between two # of Instructions Example code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively). The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. • Which sequence will be faster? How much? What is the CPI for each sequence? The second sequence is faster by one cycle. CPI1=10/5=2, and CPI2=9/6=1.5 20 20 Benchmarks • Performance best determined by running a real application – Use programs typical of expected workload – Or, typical of expected class of applications e.g., compilers/editors, scientific apps, graphics, etc. Small benchmarks – nice for architects and designers – easy to standardize – can be abused SPEC (System Performance Evaluation Cooperative) – companies have agreed on a set of real program and inputs – valuable indicator of performance (and compiler technology) 21 21 • • Benchmark History • In 1980, a set of kernels called Livermore loops were • commonly used in the architecture field to evaluate the design for scientific and engineering computers. LANL and NAS Benchmarks from national labs (Los Alamos and Nasa) were also commonly used. • Dhrystone is a synthetic benchmark program developed • The LINPACK Benchmark is to solve a dense system of in 1984. It was intended to be representative of system (integer) programming. It was popular for several years until it was superceded by the SPEC CPU89 benchmark. linear equations. It is widely used and performance numbers are available for almost all related systems. For the TOP500 supercomputers, check 22 22 Benchmark History (cont.) • In late 80’s, CSRD at UIUC led the effort of collecting a suite of scientific and engineering applications, called Perfect Club, to evaluating new concepts in high performance computing. • In 1988, SPEC (Standard Performance Evaluation Corporation) formed, SPEC89 was released the next year. Now it has benchmark suites for CPU, Graphics, HPC, Java Client/Server, Mail server, Web server, Network File system, … etc. SPEC89, CPU92, CPU95, and CPU2000 have been widely used. CPU2006 is coming out the next year. 52 candidates are under evaluation. cpu89cpu92cpu95cpu2000cpu2006? Why it 23 23 Benchmark History (cont.) • In 1988, 8 companies formed TPC (Transaction Processing Performance Council) to select a set of database benchmarks and lay the process of reporting performance. The first benchmark TPC­A was released in 1989. TPC benchmarks are now widely used for evaluating commercial systems. TPC­C simulates a complete computing environment where many users executes transactions against a database (entering and delivering orders, recording payments, checking status, monitoring level of stock,…). Benchmark Consortium, is formed to develop meaningful performance benchmarks for hardware and software used in embedded systems. Benchmarks include Automotive, Consumer, Digital entertainment, Java, Networking, Office automation and Telecom. 24 24 • In 1997, EEMBC, the Embedded Microprocessor Reporting Benchmark Performance • Guiding principle: reproducibility list everything another experimenter would need to duplicate the results. e.g. OS version, compiler version, input data, computer configuration, … etc. Large input sets tend to stress the memory system to a greater extent. Use realistically sized workloads in servers is critical. How to run a reduced input set and still accurately predict the performance of a larger run? 25 25 • • • Reporting Performance (example) Hardware Vendor Model CPU CPU MHz CPU(s) enabled Parallel Primary cache Secondary cache L3 cache Other caches Memory Disk subsystem Other hardware Dell Precision 360 (3.2 GHz P4 EE) Pentium 4 (800 MHz bus) 3200 1 no 12KB I + 8KB D 512 KB (I+D) 2MB on­chip no 4 x 512 MB ECC DDR400 SDRAM 1 x 80 GB ATA/100 7200 RPM 26 26 Reporting Performance (example) Software Operating System Compiler Windows XP Professional SP1 Intel C++ Compiler 7.1 (20030402Z) Microsoft Visual Studio.NET (7.0.9466) MicroQuill SmartHeap Library 6.01 NTFS Default Library File System System state There are 23 lines of notes describing special flag settings used for portability, optimizations, tuning, and special library. 27 27 Comparing and Summarizing Performance benchmarks in a single number? • Why summarize the performance of a group of – A necessary evil, marketers and users prefer to have a single number – A simple arithmetic mean (AM) could be misleading, especially when the performance is reported as rate. – The simplest approach to summarize relative performance is to use total execution time. – Weighting factors should be applied if the frequency of each program in the workload is different. 28 28 • How to summarize? Example: which computer is faster? Benchmark Computer 1 Time (secs.) 1 Computer 2 Time (secs.) 10 Program A Program B 100 50 Total Time 101 60 29 29 Example: which computer is faster? Benchmar Computer 1 Computer 2 Speed up k Time Time (secs.) ratio (secs.) c2 to c1 Program A Program B Total Time 1 10 10 100 50 0.5 101 60 Should it be 0.6 or 5.25? 30 30 Example: summarize performance Benchmark Millions of FP operations 100 Computer 1 Time (secs.) 1 Computer 2 Time (secs.) 10 Computer 3 Time (secs.) 20 Program A Program B 100 1000 100 20 Total Time 1001 110 40 31 31 Example: misleading summary (cont.) using flops Benchmark Millions of FP operations 100 100 Computer 1 Computer 2 Computer 3 Program A Program B Arithmetic mean Geometric mean Harmonic mean 100 Mflops 0.1 Mflops 50.1 Mflops 3.2 Mflops 0.2 mflops 10 Mflops 1 Mflops 5.5 Mflops 3.2 Mflops 1.8 Mflops 5 Mflops 5 Mflops 5 Mflops 5 Mflops 5.0 Mflops 32 32 AM, GM and HM Arithmetic Mean (AM) = ∑Mi / n i= 1 n n Mi as MFLOPS Geometric Mean (GM) = ∏ Mi i =1 1/ n Harmonic Mean (HM) = n n / ∑1 / Mi i =1 33 33 Comparing and Summarizing Performance Arithmetic mean can be used as an accurate measure of performance expressed as time. It should not be used for summarizing performance expressed as a rate. Harmonic mean should be used for summarizing performance expressed as a rate. If performance is to be normalized with respect to a specific machine, an aggregate performance such as total time or harmonic mean rate should be calculated before any normalizing is done. Benchmarks should not be normalized individually. 34 34 Amdahl's Law Amdahl's Law Amdahl’s law is a demonstration of the law of diminishing returns Speed up achievable of a parallel machine is determined by Speed P: Percent of execution time that can be parallelized P: S: Speed up factor SpeedUp = 1 / ((1-P) + (P/S)) If P=1.0, means 100% parallelizable, the speed up is S If P=0.0, means totally sequential, the speedup is 1 If P=0.8, means 80% parallelizable, the 1/(0.2+0.8/S) 35 35 Amdahl's Law (examples) • Example: "Suppose a program runs in 100 seconds on a machine, "Suppose with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?" we How about making it 5 times faster? • Example: "Suppose a program has 10% of time spent on sequential "Suppose part, what is the maximum speed up you may get on a 10,000 node parallel machine?” 10,000 36 36 Example: misleading summary revisited Benchmark Millions of FP operations 100 100 Computer 1 Computer 2 Computer 3 Program A Program B Arithmetic mean 100 Mflops 0.1 Mflops 50.1 Mflops 10 Mflops 1 Mflops 5.5 Mflops 5 Mflops 5 Mflops 5 Mflops Assume A and B take the same amount of execution time, Amdahl’s law indicates Computer 1 can never be more than 2X faster than computer2 or computer 3, no matter how faster it can speed up program A. So the A­mean summary is clearly wrong. 37 37 Rebuttal from the MP world SpeedUp = 1 / ((1-P) + (P/S)) • No issues with the equation s • P iis not fixed. When a parallel program is scaled up to solve a much greater problem, P will usually increase. increase. e.g. matrix_multiply 1000x1000 will spend much e.g. more time in the paralleized loop rather than in the sequential part. sequential 38 38 Benchmark Games • An embarrassed Intel Corp. acknowledged Friday that a bug in a software program known as a compiler had led the company to overstate the speed of its microprocessor chips on an industry benchmark by 10 percent. However, industry analysts said the coding error…was a sad commentary on a common industry practice of “cheating” on standardized performance tests…The error was pointed out to Intel two days ago by a competitor, Motorola …came in a test known as SPECint92…Intel acknowledged that it had “optimized” its compiler to improve its test scores. The company had also said that it did not like the practice but felt to compelled to make the optimizations because its competitors were doing the same thing…At the heart of Intel’s problem is the practice of “tuning” compiler programs to recognize certain computing problems in the test and then substituting special handwritten pieces of code… 39 39 SPEC ‘89 • Compiler “enhancements” and performance 800 700 600 S E p rfo m n e ratio PC e r ac 500 Well known cache blocking transformation 400 300 200 100 0 g cc esp resso spice dodu c na sa 7 li eq nt ott matri x300 fp pp p Co mpi ler tomcatv Ben chmark Enh an ce d co mpil er 40 40 Blocked Matrix Multiplication • “Block” means a sub-block within the matrix Example: N = 1000; sub-block size = 500 A11 A12 X B11 B12 = B21 B22 C11 C12 C21 C22 A21 A22 C11 = A11B11 + A12B21 C21 = A21B11 + A22B21 C12 = A11B12 + A12B22 C22 = A21B12 + A22B22 41 41 SPEC CPU2000 42 42 SPEC 2000 Does doubling the clock rate double the performance? Can a machine with a slower clock rate have better Pentium M @ 1.6/0.6 GHz Can 1.6 1400 Pentium 4-M @ 2.4/1.2 GHz performance? 1.4 1200 Pentium 4 CFP2000 1000 Pentium 4 CINT2000 800 600 Pentium III CINT2000 400 200 0 500 1000 1500 2000 2500 3000 3500 Clock rate in MHz Pentium III CFP2000 0.4 0.2 0.0 SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000 Always on/maximum clock Laptop mode/adaptive clock Benchmark and power mode Minimum power/minimum clock 0.8 0.6 1.0 1.2 Pentium III-M @ 1.2/0.8 GHz 43 43 Remember • Performance is specific to a particular program – Total execution time is a consistent summary of performance • For a given architecture performance increases come from: – increases in clock rate (without adverse CPI affects) – improvements in processor organization that lower CPI – compiler enhancements that lower CPI and/or instruction count – Algorithm/Language choices that affect instruction count 44 44 ...
View Full Document

Ask a homework question - tutors are online