{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

L7_Perform - CS 324 Computer CS Architecture Lecture 7 ISA...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS 324 Computer CS Architecture Lecture 7: ISA II; Performance Constants Design Principle: Make the common case fast. Design – Small constants used frequently (50% of operands) e.g., A = A + 5; B = B + 1; C = C - 18; – Solutions? put 'typical constants' in memory and load them. put create hard-wired registers (e.g. $zero) for constants like 0. create – MIPS Instructions: addi $29, $29, 4 andi $29, $29, 6 ori $29, $29, 4 rs rs rt rt rd shamt funct Which format? Which R I J op op op 16 bit address 26 bit address Logical Operations: Shifting Logical Facilitates examination of individual bits or bytes Facilitates Shifting left or right = mult. or div. by power of 2 Shifting sll or slr (shift left/right logical) sll slr – sll $t1, $s0, 8 => shift the value in $s0 left by 8 and store in $t1 $s0: 0000 0000 0000 0000 0000 0000 0001 1011 $t1: 0000 0000 0000 0000 0001 1011 0000 0000 8 R-type: 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits op rs rt rd sh.amt funct Logical Operations: AND/ OR Logical bit-wise operation – AND leaves 1 iff both operands’ bits are 1 forces 0s where they occur in bit pattern – OR leaves a 1 if either operands’ bits are 1 forces 1s where they occur in mask, 0 leaves bits unchanged Can use a mask (immediate value) to set bits with a logical op, but must first set bits in the mask … – Can’t just set bits directly without logical op. Why? • logical immediates are 16 bit and zero-extended • if upper 16 bits of operand have any 1’s, we need 32-bit immediate to avoid clearing upper 16 bits AND/ OR AND/ Lets assume we want to clear the LSB: 0000 0000 1010 0111 We want to AND w/ fffehex: 0000 0000 1010 0111 1111 1111 1111 1110 But, if some of the high order bits are set, must create a 32-bit constant: 0000 0000 1000 0010 0000 0000 1010 0111 1111 1111 1111 1111 1111 1111 1111 1110 How about larger constants? We'd like to be able to load a 32 bit constant into a register We'd – Must use two instructions; first, "load upper immediate" instr – lui $t0, 1111111111111111 filled with zeros 1111111111111111 0000000000000000 – Then must get the lower order bits right, i.e., – ori $t0, $t0, 1111111111111110 1111111111111111 0000000000000000 ori 1111111111111111 1111111111111110 0000000000000000 1111111111111110 32-bit constant Addressing Modes Addressing Immediate: operand is a binary value coded in the instruction Indexed/Base: Instruction specifies base address and offset which is added to base to get address of item Register: value is in a register, instruction contains register number PC-Relative: address is the sum of the PC and a constant in the instruction Pseudo-direct: Jump address is 26 bits of the instruction + upper bits of PC Assembly Language vs. Machine Language Assembly provides convenient symbolic representation Assembly – much easier than writing down numbers – e.g., destination first Machine language is the underlying reality Machine – e.g., destination is no longer first Assembly can provide 'pseudoinstructions' Assembly – e.g., “move $t0, $t1” exists only in Assembly – would be implemented using “add $t0,$t1,$zero” When considering performance you should count real When instructions (we’ll talk more about this later) Overview of MIPS simple instructions all 32 bits wide simple very structured, no unnecessary baggage very only three instruction formats only R I J op op op rs rs rt rt rd shamt funct 16 bit address 26 bit address rely on compiler to achieve performance rely – what are the compiler's goals? Efficient use of registers! 1. Immediate addressing op rs rt Immediate 2. Register addressing op rs rt rd ... funct Registers Register 3. Base addressing op rs rt Address Memory Register + Byte Halfword Word 4. PC-relative addressing op rs rt Address Memory PC + Word 5. Pseudodirect addressing op Address Memory PC Word Defining (Speed) Performance Defining Normally interested in reducing Normally – Response time (aka execution time) – the time between the start and the completion of a task Important to individual users Important – Thus, to maximize performance, need to minimize execution time performanceX = 1 / execution_timeX If X is n times faster than Y, then performanceX execution_timeY -------------------- = --------------------- = n execution_timeX performanceY – Throughput – the total amount of work done in a given time Important to data center managers Important – Decreasing response time almost always improves throughput Two Notions of “Performance” Two Plane Boeing 747 BAD/Sud Concorde DC to Paris 6.5 hours 3 hours Top Speed 610 mph 1350 mph Passengers 470 Throughput (pmph) 286,700 132 178,200 •Which has higher performance? •Time to deliver 1 passenger? •Time to deliver 400 passengers? •In a computer: •time for 1 job => Response Time or Execution Time •jobs per day called Throughput or Bandwidth Example of Response Time v. Throughput Example Time of Concorde vs. Boeing 747? Time – Concord is 6.5 hours / 3 hours = 2.2 times faster – Concord is 2.2 times (“120%”) faster in terms of flying time (response time) Throughput of Boeing vs. Concorde? Throughput – Boeing 747: 286,700 passenger-mph / 178,200 passenger-mph = 1.6 times faster – Boeing is 1.6 times (“60%”) faster in terms of throughput We will focus primarily on response time. We Performance Factors Performance Want to distinguish elapsed time and the time spent on our task Want CPU execution time (CPU time) – time the CPU spends working CPU on a task – Does not include time waiting for I/O or running other programs CPU execution time = # CPU clock cycles x clock cycle time for a program for a program or CPU execution time = # CPU clock cycles for a program ------------------------------------------for a program clock rate Can improve performance by reducing either the Can length of the clock cycle or the number of clock cycles required for a program Machine Clock Rate Machine Clock rate (MHz, GHz) is inverse of clock cycle time Clock (clock period) CC = 1 / CR one clock period 10 nsec clock cycle => 100 MHz clock rate 5 nsec clock cycle => 200 MHz clock rate 2 nsec clock cycle => 500 MHz clock rate 1 nsec clock cycle => 500 psec clock cycle => 250 psec clock cycle => 200 psec clock cycle => 1 GHz clock rate 2 GHz clock rate 4 GHz clock rate 5 GHz clock rate Clock Cycles per Instruction Clock Not all instructions take the same amount of time to execute Not – One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction # CPU clock cycles # Instructions Average clock cycles = for a program x for a program per instruction Clock cycles per instruction (CPI) – the average number Clock of clock cycles each instruction takes to execute – A way to compare two different implementations of the same ISA CPI CPI for this instruction class A B C 1 2 3 Effective CPI Effective Overall effective CPI: average of individual cycle counts Overall for the different instruction types Overall effective CPI = – – – i=1 Σ n (CPIi x ICi) ICi: the count (percentage) of instructions of class i executed CPIi is the avg number of clock cycles per instruction for that class n is the number of instruction classes The overall effective CPI varies by instruction mix The – a measure of the dynamic frequency of instructions across one or many programs THE Performance Equation THE Our basic performance equation is then Our CPU time CPU time - = Instruction_count x CPI x cycle time or = Instruction_count x CPI ---------------------------------------------clock_rate Note three key factors that affect performance – Can measure the CPU execution time by running the program – The clock rate is usually given – Get instruction count by using profilers/ simulators w/out knowing all implementation details – CPI varies by instruction type and ISA implementation for which we must know the implementation details Determinates of CPU Performance Determinates CPU time = Instruction_count x CPI x clock_cycle Instruction_ count Algorithm Programming language Compiler ISA Processor organization Technology CPI clock_cycle Determinates of CPU Performance Determinates CPU time = Instruction_count x CPI x clock_cycle Instruction_ count Algorithm Programming language Compiler ISA Processor organization Technology X X X X CPI X X X X X clock_cycle X X X A Simple Example Simple Op ALU Load Store Branch Freq 50% 20% 10% 20% CPIi 1 5 3 2 Freq x CPIi .5 1.0 .3 .4 .5 .4 .3 .4 1.6 .5 1.0 .3 .2 2.0 .25 1.0 .3 .4 1.95 Σ= 2.2 How much faster would the machine be if a better data cache How reduced the average load time to 2 cycles? CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster How does this compare with using branch prediction to shave a How cycle off the branch time? CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster What if two ALU instructions could be executed at once? What CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster The Use of Benchmarks The A Benchmark is an application and a problem that jointly define a test. Benchmarks should efficiently serve four purposes – Differentiation of a system from among its competitors System and Architecture studies Purchase/selection – Validate that a system works the way expected once a system is built and/or is delivered – Assure that systems perform as expected throughout its lifetime e.g. after upgrades, changes, and in regular use – Guidance to future system designs and implementation What Programs Measure for Comparison? What Ideally run typical programs with typical input before Ideally purchase, or before even build machine – – – – Called a “workload”; For example: Engineer uses compiler, spreadsheet Author uses word processor, drawing program, compression software In some situations are hard to do In – Don’t have access to machine to “benchmark” before purchase – Don’t know workload in future Benchmarks Benchmarks Apparent sustained speed of processor depends on code used to test Apparent it Industry standards ensure different processors can be fairly Industry compared – Most “standard suites” are simplified Type of algorithm Type Size of problem Size Run time Run Organizations create “typical” code used to evaluate systems Organizations Tests need changed every ~5 years (HW design cycle time) since Tests designers could (and do!) target specific HW for these standard benchmarks – This HW may have little or no general benefit Example Standardized Benchmarks Standard Performance Evaluation Corporation (SPEC) Standard SPEC CPU2006 – – – CINT2006 12 integer (perl, bzip, gcc, go, ...) CFP2006 17 floating-point (povray, bwaves, ...) All relative to base machine (which gets 100) e.g Sun Ultra Enterprise 2 w/296 MHz UltraSPARC II – They measure System speed (SPECint2006) System System throughput (SPECint_rate2006) System – www.spec.org/osg/cpu2006/ Comparing and Summarizing Performance Comparing Summarize benchmark performance w/ a single number: – Arithmetic mean (AM): Avg of execution times is directly proportional to total execution time n AM = 1/n Σ Timei i=1 i=1 – Timei is execution time for ith program of n programs in workload – smaller mean => smaller avg execution time, thus improved performance Guiding principle in performance measurements: reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.)) Example SPEC Ratings Example Power consumption – especially in the embedded market where Power battery life is important (and passive cooling) – For power-limited applications, most important metric is energy efficiency Other Performance Metrics Other “And in conclusion…” Latency vs. Throughput Performance doesn’t depend on any single factor: need Instruction Count, Clocks Per Instruction (CPI) and Clock Rate to get valid estimations User Time: time user waits for program to execute: depends heavily on how OS switches between tasks CPU Time: time spent executing a single program: depends solely on design of processor (datapath, pipelining effectiveness, caches, etc.) Benchmarks – Attempt to understand (and project) performance, – Updated every few years – Measure everything from simulation of desktop graphics programs to battery life Megahertz Myth – MHz ≠ performance, it’s just one factor Summary: Evaluating ISAs Summary: Design-time metrics: – Can it be implemented, in how long, at what cost? – Can it be programmed? Ease of compilation? Static Metrics: – How many bytes does the program occupy in memory? Dynamic Metrics: – How many instructions are executed? How many bytes does the processor fetch to execute the program? – How many clocks are required per instruction? CPI – How "lean" a clock is practical? Best Metric: Time to execute the program! depends on the instructions set, the processor organization, and compilation techniques. Inst. Count Cycle Time ...
View Full Document

{[ snackBarMessage ]}