CA61 - Chapter 6 Enhancing Performance with Pipelining 1 General Concept of Pipelining Example Registration Process Course Approval Cashier

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Chapter 6 Enhancing Performance with Pipelining 1 General Concept of Pipelining Example: Registration Process Course Approval Cashier Registrar ID Photo Pickup selection 5 min 30 min 2 General Concept of Pipelining Example: Registration Process Course Approval Cashier Registrar ID Photo Pickup selection 5 min After the initial 30 mins, one student can finish registeration every 5 mins. 3 The Five Stages of Load Cycle Instruction 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw IFetch Dec Exec Mem WB • • • • • WB: Write the result data into the register file IFetch: Instruction Fetch and Update PC Dec: Registers Fetch and Instruction Decode Exec: Execute R­type; calculate memory address Mem: Read/write the data from/to the Data Memory 4 Pipelined MIPS Processor • Start the next instruction while still working on the current one – improves throughput or bandwidth ­ total amount of work done in a given time (average instructions per second or per clock) – instruction latency is not reduced (time from the start of an instruction to its completion) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw sw R-type IFetch Dec Exec Mem Exec WB Mem Exec WB Mem WB 5 IFetch Dec IFetch Dec Pipelined MIPS Processor (cont.) • pipeline clock cycle (pipeline stage time) is limited by the slowest stage • for some instructions, some stages are wasted cycles Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw sw R-type IFetch Dec Exec Mem Exec WB Mem Exec WB Mem WB IFetch Dec IFetch Dec 6 Single Cycle, Multiple Cycle, vs. Single Cycle Implementation: Pipeline Cycle 1 Cycle 2 Clk Load Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 Clk lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem R-type IFetch Store Waste Pipeline Implementation: lw IFetch sw Dec IFetch Exec Dec Mem Exec Dec WB Mem Exec WB Mem WB “wasted” cycles R-type IFetch 7 Multiple Cycle v. Pipeline, Bandwidth v. Multiple Cycle Implementation: Latency Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 Clk lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem R-type IFetch Pipeline Implementation: lw IFetch sw Dec IFetch Exec Dec Mem Exec Dec WB Mem Exec WB Mem WB R-type IFetch • Latency per lw = 5 clock cycles for both • Bandwidth of lw is 1 per clock clock (IPC) for pipeline vs. 1/5 IPC for multicycle • Pipelining improves instruction bandwidth, not instruction latency 8 • What do we need to add/modify in the datapath? – registers between pipeline stages to isolate them IF:IFetch 1 0 MIPS Pipeline Datapath Modifications ID:Dec EX:Execute MEM: MemAccess WB: WriteBack Add 4 Shift left 2 Read Addr 1 Add Dec/Exec Read Address Write Addr Write Data File Read Data 2 0 1 ALU Address Write Data Read Data Mem/WB PC D Read Addr 2 ata 1 Exec/Mem Instruction Memory IFetch/Dec Register Read Data Memory 1 0 16 Sign Extend 32 System Clock 9 Graphically Representing MIPS Pipeline ALU IM Reg DM Reg • Can help with answering questions like: – how many cycles does it take to execute this code? – what is the ALU doing during cycle 4? – is there a hazard, why does it occur, and how can it be fixed? 10 10 Pipeline for Throughput! ALU Time (clock cycles) I n s t r. O r d e r Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 IM Reg DM Reg IM Reg DM Reg Once the pipeline is full, one instruction is completed every cycle Reg ALU Reg IM ALU IM DM ALU Reg Reg DM Reg ALU IM DM Reg Time to fill the pipeline 11 11 Pipeline • structural hazards: attempt to use the same resource by Hazards two different instructions at the same time • data hazards: attempt to use data before it is ready two different instructions at the same time – instruction source operands are produced by a prior instruction still in the pipeline – load instruction followed immediately by an ALU instruction that uses the load operand as a source value • control hazards: attempt to make a decision before • Can always resolve hazards by waiting condition has been evaluated – branch instructions – pipeline control must detect the hazard – take action (or delay action) to resolve hazards 12 12 A Single Memory Would Be a Structural Time (clock cycles) Hazard I n s t r. O r d e r lw Inst 1 Inst 2 Inst 3 Inst 4 Mem Reg Mem Reg Reading data from memory Reg ALU Reg Mem ALU Mem Mem ALU Reg Mem Reg Mem Reg ALU Mem Mem Reg ALU Reading instruction from memory Reg Mem Reg 13 13 How About Register File Time (clock cycles) Access? ALU I n s t r. O r d e r add r1, Inst 1 Inst 2 IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg ALU add r2,r1, Inst 4 IM Reg DM Reg ALU IM Reg DM Reg Potential read before write data hazard 14 14 How About Register File Time (clock cycles) Access? I n s t r. O r d e r add r1, Inst 1 Inst 2 IM Reg DM Reg IM Reg DM Reg Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half. Reg IM ALU Reg ALU ALU DM ALU add r2,r1, Inst 4 IM Reg DM Reg ALU IM Reg DM Reg Potential read before write data hazard 15 15 Register Usage Can Cause Data Hazards • Dependencies backward in time cause hazards ALU I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r5 and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5 IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg Which are read before write data hazards? 16 16 Register Usage Can Cause Data • Dependencies backward in time Hazards cause hazards IM ALU I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r5 and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5 Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg Read before write data hazards 17 17 1 0 IF/ID Add 4 Cause control hazards ID/EX EX/MEM Shift left 2 Add MEM/WB Instruction Memory Read Address PC Read Addr 1 D Read Addr 2 ata 1 Write Addr Write Data Register Read File ALU Data Memory Address Write Data Read Data 2 0 1 Read Data 1 0 16 Sign Extend 32 Cause data hazards 18 18 Loads Can Cause Data • Dependencies backward in time Hazards cause hazards IM ALU I n s t r. O r d e r lw r1,100(r2) sub r4,r1,r5 and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5 Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg Load-use data hazard 19 19 One Way to “Fix” a Data Hazard ALU I n s t r. O r d e r add r1,r2,r3 stall stall IM Reg DM Reg Can fix data hazard by waiting – stall – but affects throughput ALU sub r4,r1,r5 and r6,r1,r7 IM Reg DM Reg ALU IM Reg DM Reg 20 20 Another Way to “Fix” a Data Hazard ALU I n s t r. O r d e r add r1,r2,r3 IM Reg DM Reg sub r4,r1,r5 and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5 IM Reg DM Reg Can fix data hazard by forwarding results as soon as they are available to where they are needed. Reg ALU Reg IM ALU IM DM ALU Reg Reg DM Reg ALU IM DM Reg 21 21 Another Way to “Fix” a Data Hazard ALU I n s t r. O r d e r add r1,r2,r3 IM Reg DM Reg sub r4,r1,r5 and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5 IM Reg DM Reg Can fix data hazard by forwarding results as soon as they are available to where they are needed. Reg ALU Reg IM ALU IM DM ALU Reg Reg DM Reg ALU IM DM Reg Data forwarding is also called bypassing – it allow results to bypass the register file 22 22 Forwarding with Load­use Data Hazards ALU I n s t r. O r d e r lw r1,100(r2) IM Reg DM Reg ALU sub r4,r1,r5 and r6,r1,r7 or r8, r1, r9 xor r4,r1,r5 IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg • Will still need one stall cycle even with forwarding 23 23 Pipelined Control • No separate write signal for PC and pipelined • registers (IF/ID, ID/EX, EX/MEM, MEM/WB) since are written during each clock cycle. Control lines can be divided into 5 groups according to the respective pipe stage. – – – – – IF: IMread, PCwrite always on ID: regs always read, no special controls needed EX: Regdst, ALUop, ALUsrc, immed. MEM: branch, MemRead, MemWR WB: MemtoReg, RegWR 24 24 Control Lines for the Last Three Stages WB instruction Control M EX WB M WB IF/ID ID/EX EX/MEM MEM/WB 25 25 • All control signals can be determined during Decode 1 0 IF/ID Add 4 Shift left 2 Add ID/EX EX/MEM Pipeline Control Path Modifications – and held in the state registers between pipeline stages Control MEM/WB Instruction Memory Read Address PC Read Addr 1 D Read Addr 2 ata 1 Write Addr Write Data Register Read File ALU Data Memory Address Write Data Read Data 2 0 1 Read Data 1 0 16 Sign Extend 32 26 26 Data hazard reductions • Hardware • Software Data forwarding/Bypassing Out­of­order instruction issuing Value prediction Combining Instruction scheduling Transformations to allow more effective scheduling 27 27 Instruction Scheduling A=B+E; C=B+F; 1 2 3 4 5 6 7 lw $t1, 0($sp); load B lw $t2, 4($sp); load E add $t3,$t1,$t2; A=B+E sw $t3,12($sp); store A lw $t4, 8($sp); load F add $t5,$t1,$t4; C=B+F sw $t5,16($sp); store C 28 28 Instruction Scheduling A=B+E; C=B+F; 1 2 3 4 5 6 7 lw $t1, 0($sp); load B lw $t2, 4($sp); load E add $t3,$t1,$t2; A=B+E sw $t3,12($sp); store A lw $t4, 8($sp); load F add $t5,$t1,$t4; C=B+F sw $t5,16($sp); store C lw lw lw add sw add sw $t1, 0($sp); load B $t2, 4($sp); load E $t4, 8($sp); load F $t3,$t1,$t2; A=B+E $t3,12($sp); store A $t5,$t1,$t4; C=B+F $t5,16($sp); store C 29 29 Data Dependence Graph A=B+E; C=B+F; 1 2 3 4 5 6 7 lw $t1, 0($sp); load B lw $t2, 4($sp); load E add $t3,$t1,$t2; A=B+E sw $t3,12($sp); store A lw $t4, 8($sp); load F add $t5,$t1,$t4; C=B+F sw $t5,16($sp); store C 1 3 4 Any topological sort order is a valid schedule. e.g. 1235647 1253647 1256374 …. 30 30 2 6 7 5 Transformations for more effective scheduling • • • • • Loop unrolling Software pipelining Trace formation Superblock formation Loop transformations – Loop fusion • Procedure inlining 31 31 Loop Unrolling For (i=1; i<n; i++) { a[i] = a[i] + b[i]; } For (i=1; i<n; i+=2) { a[i] = a[i] + b[i]; a[i+1] = a[i+1] + b[i+1]; ld a Dependence Graph ld a add st a ld b add st a ld a1 ld b1 add st a1 32 32 ld b } Software Pipelining For (i=1; i<n; i++) { a[i] = a[i] + b[i]; } x=a[1]; y=b[1]; For (i=2; i<n; i++) { a[i­1] = x+y; x=a[i]; y=b[i]; Dependence Graph ld a add st a ld b add st a ld ai ld bi } 33 33 • Any data dependence that goes backwards in time – EX stage generating R­type ALU results or effective address calculation Data Forwarding (aka Bypassing) • Forward by taking the inputs to the ALU from any pipeline register rather than just ID/EX by – adding multiplexors to the inputs of the ALU so can pass Rd back to either (or both) of the EX’s stage Rs and Rt ALU inputs 00: normal input (ID/EX pipeline registers) 10: forward from previous instr (EX/MEM pipeline registers) 01: forward from instr 2 back (MEM/WB pipeline registers) – MEM stage generating lw results – adding the proper control hardware 34 34 Data Forwarding Control Conditions 1. EX/MEM hazard: if (EX/MEM.RegisterRd == ID/EX.RegisterRs)) Forwards the result from the ForwardA = 10 RegisterRd” is number of register to be written previous instr. – “RegisterRd” is number of register to be written (RD or RT) RegisterRs” is number of RS register – “RegisterRs” is number of RS register RegisterRt” is number of RT register – “RegisterRt” is number of RT register ForwardA, – “ForwardA, ForwardB” controls forwarding muxes controls forwarding muxes if (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 Forwards the result from the 1. MEM/WB hazard: if (MEM/WB.RegisterRd == ID/EX.RegisterRs)) second previous instr. ForwardA = 01 if (MEM/WB.RegisterRd == ID/EX.RegisterRt)) ForwardB = 01 What’s wrong with this hazard control? 35 35 Data Forwarding Control Conditions (cont.) 1. EX/MEM hazard: if (EX/MEM.RegWrite Forwards the and (EX/MEM.RegisterRd == ID/EX.RegisterRs))result from the previous instr. ForwardA = 10 provided it if (EX/MEM.RegWrite writes. and (EX/MEM.RegisterRd == ID/EX.RegisterRt)) ForwardB = 10 1. MEM/WB hazard: if (MEM/WB.RegWrite if (MEM/WB.RegWrite and (MEM/WB.RegisterRd == ID/EX.RegisterRs)) ForwardA = 01 and (MEM/WB.RegisterRd == ID/EX.RegisterRt)) ForwardB = 01 36 36 Another Another potential data hazard can occur when •Complication there is a conflict between the result of the WB stage instruction and the MEM stage instruction – which should be forwarded? ALU I n s t r. O r d e r add $1,$1,$2 add $1,$1,$3 add $1,$1,$4 IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg 37 37 Corrected Data Forwarding Control Conditions 1. MEM/WB hazard: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd == ID/EX.RegisterRs) and (EX/MEM.RegisterRd != ID/EX.RegisterRs and || ~ EX/MEM.RegWrite)) || ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd == ID/EX.RegisterRt) and (EX/MEM.RegisterRd != ID/EX.RegisterRt and || ~ EX/MEM.RegWrite))) || ForwardB = 01 38 38 Datapath with Forwarding Hardware PCSrc 1 0 ID/EX IF/ID Add EX/MEM Control Branch MEM/WB 4 Shift left 2 Add Instruction Memory Read Address PC Read Addr 1 D Read Addr 2 ata 1 Write Addr Write Data 16 Sign Extend Register Read File ALU Data Memory Address Read Data 2 32 1 0 Read Data 1 0 Write Data ALU cntrl 0 1 Forward Unit 39 39 Datapath with Forwarding Hardware PCSrc 1 0 ID/EX IF/ID Add EX/MEM Control Branch MEM/WB 4 Shift left 2 Add Instruction Memory Read Address PC Read Addr 1 D Read Addr 2 ata 1 Write Addr Write Data 16 Sign Extend Register Read File ALU Data Memory Address Read Data 2 32 1 0 Read Data 1 0 Write Data ALU cntrl 0 IF/ID.RegisterRs IF/ID.RegisterRt 1 EX/MEM.RegisterRd Forward Unit MEM/WB.RegisterRd 40 40 Datapath with Forwarding Hardware PCSrc 1 0 Control IF/ID Add 4 Shift left 2 Add Branch MEM/WB ID/EX EX/MEM Instruction Memory Read Address PC Read Addr 1 D Read Addr 2 ata 1 Write Addr Write Data 16 Sign Extend Register Read File ALU Data Memory Address Read Data 2 32 1 0 Read Data 1 0 Write Data ALU cntrl EX/MEM.RegisterRd MEM/WB.RegisterRd 0 IF/ID.RegisterRs IF/ID.RegisterRt 1 Forward Unit 41 41 Branch Instructions Cause Control • Dependencies backward in time Hazards cause hazards IM ALU I n s t r. O r d e r beq lw Inst 3 Inst 4 Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg ALU IM Reg DM Reg 42 42 I n s t r. O r d e r One Way to “Fix” a Control Hazard ALU beq IM Reg DM Reg stall stall stall ALU Can fix branch hazard by waiting – stall – but affects throughput lw Inst 3 IM Reg DM Reg ALU IM Reg DM 43 43 Cost of Stalling on each Branch Inst. • Assume 13% of branches pay 3 stall cycles for every branch. 3x0.13=0.39, that is 39% of overhead cost could be higher for deeper pipeline Q: do we need to stall on jumps? 44 44 Jumps Incurs One Jumps not decoded until ID, so one stall is needed •Stall ALU I n s t r. O r d e r J stall IM Reg DM Reg ALU lw and IM Reg DM Reg ALU IM Reg DM Reg • Jumps are infrequent – only 2% of the SPECint instruction mix 45 45 Reducing the delay of Branches • We have assumed the next PC is selected in the MEM • stage. MIPS branch instruction is designed for smaller branch delays – Simple tests (e.g. equal, or sign) do not require a full ALU operation – More complex branches are just pseudo instructions Moving branch decision up requires two actions – Computing target address in ID stage – Branch decision evaluation in ID stage This complicates data forwarding and hazard detection !! 46 46 • Reduced Branch delay I n s t r. O r d e r ALU beq IM Reg DM Reg stall lw Always compute target address at this stage ALU IM Reg DM Reg ALU Inst 3 Evaluate EQ test and select new PC IM Reg DM Reg 47 47 Control hazard reductions • Hardware Branch prediction Static Dynamic • Software Branch elimination Code duplication Loop unrolling/peeling If­conversion 48 48 Branch Prediction Two outcomes to predict • Whether: taken or not Static: predict not taken backward taken, forward not taken prediction bit register convention Dynamic: 2­bit saturation counters, 2­level predictions • Where: target address JR is a problem, but a return stack can make return branches preditable. 49 49 Simple Branch Prediction Predict not taken: for many applications, taken branches outnumber untaken branches performance loss too high if pipeline is deeper or combined with superscalar implementations Predict backward taken, forward fall­thru integer codes often have loops iterate a small number of times. some forward branches may take programmer/compiler can make this scheme more effective. 50 50 Simple Branch Prediction (cont.) Predict not taken: for many applications, taken branches outnumber untaken branches performance loss too high if pipeline is deeper or combined with superscalar implementations Predict backward taken, forward fall­thru integer codes often have loops iterate a small number of times. some forward branches may take programmer/compiler can make this scheme more effective. 51 51 Simple Branch Prediction (cont.) Predict not taken: for many applications, taken branches If (x>y) then { A } Test: else { B } outnumber untaken branches performance loss too high if pipeline is deeper Is A or B the fall­thru block? or combined with superscalar implementations Predict backward taken, forward fall­thru integer codes often have loops iterate a small number of times. some forward branches may take Compiler reverses the condition to programmer/compiler can make this scheme more effective. make A the fall­thru block !! So, put the most likely executed code in the THEN block 52 52 Static Branch Prediction Static prediction bit Assume there is a bit in the branch instruction to hint on the branch direction Compiler can use this bit to communicate the most likely branch direction Compiler may take user directives, but few users like to provide such directives. Compiler static prediction: Profile based – this has been shown to be effective Heuristic based – interesting, but less accurate 53 53 Static Branch Prediction (cont.) Heuristic based static prediction Opcode based: programmers often use negative numbers for Loop based: loop back branches are predicted taken loop header branch not taken Call based: many conditional calls are used to handle exceptions, so predict not taken. true, … etc. error conditions, so bltz and blez are predicted not taken, bgtz and bgez are predicted taken. Pointer based: (p!= NULL) predict true, (p==q) predict not 54 54 Static Branch Prediction (cont.) What if no predict bit available? branch prediction. HP PA uses register convention to guide If the register number of the 1st operand is small than the 2nd , normal prediction (backward taken, forward not taken) is used. If the register number of the 1st is greater, reverse prediction (backward not taken, forward taken) is used. COMPB, gt, $R1,$R2 COMPB, ltz, $R2,$R1 is the same as 55 55 Dynamic Branch Prediction Keeps the branch history and branch target address for each branch, such as the following branch prediction buffer that combines whether and where information for prediction: Address Tag 01000100 01000150 … … Target Address 00100100 001001A0 …. … History 101111 000000 … …. 56 56 Dynamic Branch Prediction (cont.) Branch prediction table: assume the target address can be computed early in the pipeline, and forget about indirect branches, so the column of target address can be taken out. A branch prediction buffer becomes a simpler branch history table: Address tag 01000100 01000150 History 101111 00000 … 57 57 Dynamic Branch Prediction (cont.) Branch prediction table: We may skip the address tag, and use the lower address bits to index to the prediction table. Since multiple branch instructions may map to the same entry, aliasing does exist. However, branch prediction is just a hint, aliasing only impacts performance, not correct execution. History 101111 00000 …. …. 58 58 Dynamic Branch Prediction (cont.) Branch prediction table: The history can be as short as one outcome only. So the history table is only an one­bit array. If the last branch outcome is taken (1), predict it to take. If the last outcome is not taken (0), predict not to take. History 1 0 0 1 59 59 Dynamic Branch Prediction (cont.) Single bit prediction has a performance shortcoming Consider the following loop: for (i=0; i<9; i++) The loop back branch will be taken 9 times, but the prediction accuracy is only 80%. History Prediction Mispredic t 0 1 1 T 1 T 1 T 1 T 1 T 1 T 1 T 1 T X 0 N X 1 T 1 T NT X 60 60 Dynamic Branch Prediction (cont.) In practice, 2­bit prediction scheme is often used In a 2­bit scheme, a prediction must be wrong twice before it is changed. T Predict Taken T N T N Predict Taken N Predict Not Taken Predict Not Taken N 61 61 • Assume the same cycle time per unit as in previous • example for the multicycle datapath. For pipelined execution, assume the following: – 50% of loads are followed by an instruction immediately uses the result. – Branch misprediction costs 1 cycle – 25% of branches are mispredicted – Jumps always take 2 cycles. Performance Comparison Multicycle vs. Pipelined 62 62 CPI Comparison Using Specint2000 instruction mix. cycle note s 5 * 4 3 3 4 63 63 Load Store jump ALU 25% 10% 2% 52% branch 11% ** *** *50% loads followed by immediate use, ** 25% misprediction, 1 cycle delay *** jump takes 2 cycles CPI Comparison Using Specint2000 instruction mix. cycle note s 5 * 4 3 3 4 64 64 0.25*5+ 0.10*4+0.11*3+ 0.02*3+0.52*4 = 4.12 0.25*1.5+ 0.10*1+ 0.11*1.25+0.02*2+ 0.52*1 = 1.17 Load Store jump ALU 25% 10% 2% 52% branch 11% ** *** *50% loads followed by immediate use, ** 25% misprediction, 1 cycle delay *** jump takes 2 cycles Advanced Pipelining • Pipelining exploits the potential parallelism Instruction Level Parallelism (ILP) among instructions. This parallelism is called • Two methods to increase ILP: Superpipelining Increase the depth of the pipeline Multiple Issue Replicate function units so that multiple instructions can be issued. Static multiple issue (e.g. VLIW) Dynamic multiple issue (e.g. Superscalar) 65 65 Advanced Pipelining • Pipelining exploits the potential parallelism Instruction Level Parallelism (ILP) among instructions. This parallelism is called Reduce cycle time Increase IPC • Two methods to increase ILP: Superpipelining Increase the depth of the pipeline Multiple Issue Replicate function units so that multiple instructions can be issued. Static multiple issue (e.g. VLIW) Dynamic multiple issue (e.g. Superscalar) 66 66 • One of the most important methods for finding • and exploiting ILP is Speculation. Allows the execution to exploit statistical ILP (e.g. the branch is taken 90% of time, or Speculatio n One of the most important methods for finding • Two forms of speculation: Control speculation Data Speculation the address of pointer P is 99% of time different from the address of pointer q) 67 67 Control Example •Speculation If (cond) { A=p[i]­>b; } lw add lw sw beq lw add beq $r1, 0($r2) $r3, $r1,$r4 $r5,4($r3) $r5,4($sp) … $r6,… $r3, $r6… … In this block, there is no room to schedule the load !! Why not moving the load instruction into the previous block? 68 68 Control Example •Speculation If (cond) { A=p[i]­>b; } lw add lw sw beq lw add beq $r1, 0($r2) $r3, $r1,$r4 $r5,4($r3) $r5,4($sp) … $r6,… $r3, $r6… … ) Is the cond most likely to be true? profile feedback may guide the optimization ) What if the address of p is bad, and cause memory fault? can we have a special load instruction that ignores memory faults? 69 69 Control Example •Speculation If (cond) { A=p[i]­>b; } lw add lw sw beq lw add beq $r1, 0($r2) $r3, $r1,$r4 $r5,4($r3) $r5,4($sp) … $r6,… $r3, $r6… … hat if the address of p is bad, and cause memory fault? can we have a special load instruction that ignores memory faults? ut what if the real load of p cause a memory fault? e cannot just ignore it!! 70 70 Data Example •Speculation { *p = a; b= *q + 1; } lw sw lw addi sw $r3,4($sp) $r3, 0($r1) $r5,0($r2) $r6,$r5,1 $r6,8($sp) In this block, there is no room to schedule the load !! How can we move the load instruction ahead of the store? $r2 and $r1 may be different most of the time 71 71 Data Speculation Example •(cont.) { *p = a; b= *q + 1; } lw $r3,4($sp) lw $r5,0($r2) sw $r3, 0($r1) If (r1==r2) copy $r5,$r3 addi $r6,$r5,1 sw $r6,8($sp) What if there are m loads moving above n stores? m x n comparisons must be generated !! 72 72 Instruction Encoding in Itanium I nst r uct ion 2 41 bit s I nst r uct ion 1 41 bit s I nst r uct ion 0 41 bit s Tempplat e 5 bit s Each instruction is 41 bits long, with 5 bits template to help decoding and routing instructions. •The template can also mark the end of groups of instructions that can execute in parallel •In Itanium­I and Itanium­II (McKinley), the processor can handle two bundles (i.e. 6 instructions) per clock cycle. 73 73 •A bundle is 128 bits, which has room for 3 instructions. Compiler Directed Speculation dependency •Removes latency of operation from the critical path •Helps hide long latency memory operations •Two type of speculation: Control Speculation, which is the execution of an •Allows compiler to issue operation early before a operation before the branch which guards it DataSpeculation, which is the execution of a memory load prior to a preceding store which may alias with it 74 74 Speculat ion Examples contr ol speculation or iginal: (p1) br.cond ld8 r1 = [ r2 ] tr ansfor med: ld8.s r1 = [ r2 ] ... (p1) br.cond ... chk.s r1,recovery data speculation or iginal: st4 [ r3 ] = r7 ld8 r1 = [ r2 ] t r ansfor med: ld8.a r1 = [r2] ... st4 [r3] = r7 ... chk.a r1,recovery 75 75 Predication in Itanium •Allows instructions to be conditionally executed •Predicate register operand controls execution •Removes branches and associated mispredict penalties •Creates larger basic blocks and simplifies compiler bne $r1,$r2, NT addi $r1,$r2,4 lw $r7,8($r8) cmp.eq p1,p2 = r1,r2 ;; (p1) add r1 = r2, 4 (p2) ld8 r7 = [r8], 8 optimizations NT: If p1 is true, the add is performed, else it acts as a nop If p2 is true, the ld8 is performed, else it acts as a nop 76 76 Speculation in Superscalar OOO Processors lw add Beq lw add lw sw Bne lw $r6,… $r3, $r6… … $r1, 0($r2) $r3, $r1,$r4 $r5,4($r3) $r5,4($sp) … $r7,0($r8) lw add Beq lw add lw sw Bne lw $r6,… $r3, $r6… … $r1, 0($r2) $r3, $r1,$r4 $r5,4($r3) $r5,4($sp) … $r7,0($r8) lw add Beq lw add lw sw Bne lw $r6,… $r3, $r6… … $r1, 0($r2) $r3, $r1,$r4 $r5,4($r3) $r5,4($sp) … $r7,0($r8) Re­order queue Cycle 1 Cycle 2 Load executed before the branch instruction 77 77 Speculation in Superscalar OOO Processors lw add Beq lw add lw sw Bne lw $r6,… $r3, $r6… … $r1, 0($r2) $r3, $r1,$r4 $r5,4($r3) $r5,4($sp) … $r7,0($r8) lw add Beq lw add lw sw Bne lw $r6,… $r3, $r6… … $r1, 0($r2) $r3, $r1,$r4 $r5,4($r3) $r5,4($sp) … $r7,0($r8) Load executed before the store instruction is complete. Cycle 3 Cycle 4 78 78 Summar •yAll modern day processors use pipelining • Pipelining doesn’t help latency of single task, it helps throughput of entire workload – Multiple tasks operating simultaneously using different resources Potential speedup = Number of pipe stages Pipeline rate limited by slowest pipeline stage – Unbalanced lengths of pipe stages reduces speedup – Time to “fill” pipeline and time to “drain” it reduces speedup Must detect and resolve hazards Hazard reduction techniques (HW & SW) • • • • 79 79 ...
View Full Document

This note was uploaded on 01/26/2011 for the course CSCI 4203 taught by Professor Weichunghsu during the Fall '05 term at Minnesota.

Ask a homework question - tutors are online