Lecture03 - 0306-381 Applied Programming Performance •...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 0306-381 Applied Programming Performance • Program Performance Evaluation • Optimization Techniques for Modern Processors Performance Evaluation • Execution time – Time to solution – Base performance comparison metric • Time measurement – Relative start to finish – Elapsed during execution 2 Timing Procedure • Determine execution time 1. 2. 3. 4. Record start time Execute code to be timed Record stop time Determine time difference between stop and start times • Program requirements – Time recording – Time arithmetic 3 Time Module • time.h #include <time.h> • Typemark: – clock_t – Clock value (time) • Function: – clock_t clock (void); – Program execution time 4 Timing Example #include <time.h> ... clock_t TimeExecution, TimeStart, TimeStop; ... TimeStart = clock (); /* Code to be timed goes here */ TimeStop = clock (); TimeExecution = (TimeStop – TimeStart); • Records start and stop times • Computes execution time 5 Time Output • Output requires printf format descriptor. What standard type corresponds to clock_t? • C type descriptions are specific to each platform. Where is clock_t defined? – Definition is in header files. clock_t is long int for cluster.ce.rit.edu. 6 Time Units • Interpretation/comparison requires units. In what unit is clock_t? • clock_t is in clock ticks. How do clock ticks relate to time? • Conversion is in manual page. Divide clock_t by CLOCKS_PER_SEC. 7 Timing Output Example #include <time.h> ... clock_t TimeExecution, TimeStart, TimeStop; ... TimeStart = clock (); /* Code to be timed goes here */ TimeStop = clock (); TimeExecution = (TimeStop – TimeStart); printf (“Execution time = %f s\n”, (double) TimeExecution / (double) CLOCKS_PER_SEC); 8 0306-381 Applied Programming Performance • Program Performance Evaluation FOptimization Techniques for Modern Processors Modern Microprocessor Characteristics • Instructions executed in parallel – Theoretical—hardware design à potential – Practical—program may or may not realize • Fastest execution – Code for greatest potential utilization – Still ensure correct results • This class – Incorporate straightforward optimizations – Discuss but not pursue error-prone optimizations 10 Program Performance and Processor Architecture • Processor architecture – Determines performance potential for given program – Dictates efficient code implementation • Instruction-level parallelism (ILP) – Basis of modern microprocessor performance – Does not apply to older generation processors • • • • 68000 68HC11 8051 80386 11 Instruction-Level Parallelism (ILP) • Program instructions performed simultaneously – Increased instruction throughput (IPC) – Transparent “sequential” execution • Microarchitecture techniques – – – – Instruction pipelining Superscalar execution Out-of-order execution VLIW 12 Instruction Pipelining Example Instructions Cycle 1 IF 2 ID 3 EX 4 MEM 5 WB 6 7 8 9 add $3, $4, $6 sub $5, $3, $2 IF ID EX MEM WB lw $7, 100($5) IF ID EX MEM WB (nop) Bubble Bubble Bubble Bubble Bubble add $8, $7,$2 IF ID EX MEM WB 13 Superscalar Example • Multiple instructions issued per cycle • Multiple units: scheduled at run time Example: – integer – floating point LD LD LD LD ADDD ADDD ADDD ADDD SD SD SD SUB BNEZ SD F0, 0(R1) F6,-8(R1) F10,-16(R1) F14, -24(R1) F4,F0,F2 F8,F6,F2 F12, F10,F2 F16, F14, F2 0(R1),F4 -8(R1),F8 -16(R1),F12 R1,R1,#32 R1,LOOP -24(R1),F16 Superscalar result INT LOOP: LD LD LD LD SD SD SD SUB BNEZ SD F0, 0(R1) F6,-8(R1) F10,-16(R1) F14, -24(R1) 0(R1),F4 -8(R1),F8 -16(R1),F12 R1,R1,#32 R1,LOOP -24(R1),F16 ADDD F4,F0,F2 ADDD F8,F6,F2 ADDD F12,F10,F2 ADDD F16,F14,F2 FP LOOP: 14 Out-of-Order Execution Example Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d , e, and f in memory. Slow code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,Ra Re,e Rf,f Rd,Re,Rf d,Rd Fast code: LW LW LW ADD LW SW SUB SW Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,Ra Rd,Re,Rf d,Rd 15 VLIW Example • Multiple instructions issued per cycle • Multiple units: scheduled by compiler Example: – 2 memory – 2 floating point – 1 integer VLIW result Mem1 LD LD LD LD F2, 0(R1) F10,-16(R1) F18,-32(R1) F26,-48(R1) LOOP: LD LD LD LD LD LD LD ADDD ADDD ADDD ADDD ADDD ADDD ADDD SD SD SD SD SD SD SUB BNEZ SD F2,0(R1) F6,-8(R1) F10,-16(R1) F14,-24(R1) F18,-32(R1) F22,-40(R1) F26,-48(41) F4,F2,F0 F8,F6,F0 F12,F10,F0 F16,F14,F0 F20,F18,F0 F24,F22,F0 F28,F26,F0 0(R1),F4 -8(R1),F8 -16(R1),F12 -24(R1),F16 -32(R1),F20 -40(R1),F24 R1,R1,#32 R1,LOOP -48(R1),F28 Mem2 LD F6,-8(R1) LD F14,-24(R1) LD F22,-40(R1) FP1 ADDD ADDD ADDD ADDD F4,F2,F0 F12,F10,F0 F20,F18,F0 F28,F26,F0 FP2 ADDD F8,F6,F0 ADDD F16,F14,F0 ADDD F24,F22,F0 INT SD 0(R1),F4 SD -16(R1),F12 SD -32(R1),F20 SD -48(R1),F28 SD -8(R1),F8 SD -24(R1),F16 SD -40(R1),F24 SUB R1,R1,#48 BNEZ R1,LOOP 16 Harnessing Available ILP • Program ILP – Increases performance if utilized – Detection crucial to utilization • Detection strategies – Compiler—automatic – Hardware—automatic – Programmer—manual • No long dependent chain operations—error-prone • Interleaved independent calculations—straight-forward 17 Dependent Chain Operations • Intuitive programming approach – Sequence of operations performed on data – Reflection of human thinking on procedure • Alternative approach – Result computation from very few sequential operations – Not possible in some cases – Difficult to understand/debug • Many variables • Many intermediate results • Ignored in this class—no attempt to eliminate 18 Interleaved Independent Calculations • Loop unrolling – Replicate loop body – Reduce iterations • Benefits – Reduce loop overhead – Disambiguate ILP • Implementation – Automatically by compiler—some cases – Manually by programmer—better job 19 Loop Example for (I = 0; I < 100; I++) { Sum [I] = VectorA [I] + VectorB [I]; } /* for (I) */ • Each loop iteration – Completely independent of all others – Could be executed by a different CPU in parallel • Increase ILP by unrolling loop – Modern microprocessors accomplish to certain degree – Programmer can always accomplish better • More knowledge of program • Explicitly code in program 20 Loop Unrolling Example for (I = 0; I < 100; I Sum [I+0] = VectorA Sum [I+1] = VectorA Sum [I+2] = VectorA Sum [I+3] = VectorA } /* for (I) */ += 4) [I+0] [I+1] [I+2] [I+3] { + + + + VectorB VectorB VectorB VectorB [I+0]; [I+1]; [I+2]; [I+3]; • Loop unrolled four (4) times – Loop counter increment: by four times more – Loop iterations: one-fourth as many – Loop code: four times more lines • General case – Unrolled n times – If iterations not evenly divisible by n • Add “iterations” after loop for remainder 21 Loop Unrolling Analysis • Greater ILP potential – Interleaves independent calculations – n-times unrolling guarantees n independent calculations • Best unrolling factor aspects – Calculations involved – CPU functional units • Investigated in Homework Two 22 ...
View Full Document

This note was uploaded on 04/27/2010 for the course EECC 0306-381 taught by Professor Roymelton during the Spring '10 term at RIT.

Ask a homework question - tutors are online