L17-Pipelined-MIPS

L17-Pipelined-MIPS - COMP541 Pipelined MIPS Montek Singh...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: COMP541 Pipelined MIPS Montek Singh Mar 30, 2010 1 Topics Pipelining Can think of as A way to parallelize, or A way to make better utilization of the hardware. Goal: use all hardware every cycle Section 7.5 of text 2 Parallelism Parallelism Two types of parallelism: Spatial parallelism duplicate hardware performs multiple tasks at once Temporal parallelism task is broken into multiple stages also called pipelining for example, an assembly line Parallelism Definitions Parallelism Definitions Some definitions: Token: A group of inputs processed to produce a group of outputs Latency: Time for one token to pass from start to end Throughput: The number of tokens that can be produced per unit time Parallelism increases throughput Often sacrificing latency Parallelism Example Parallelism Example Ben is baking cookies It takes 5 minutes to roll the cookies and 15 minutes to bake them. After finishing one batch he immediately starts the next batch. What is the latency and throughput if Ben doesn’t use parallelism? Latency = 5 + 15 = 20 minutes = 1/3 hour Throughput = 1 tray/ 1/3 hour = 3 trays/hour Parallelism Example Parallelism Example What is the latency and throughput if Ben uses parallelism? Spatial parallelism: Ben asks Allysa to help, using her own oven Temporal parallelism: Ben breaks the task into two stages: roll and baking. He uses two trays. While the first batch is baking he rolls the second batch, and so on. Spatial Parallelism Spatial Parallelism Latency: time to first tray 0 5 10 15 20 25 30 35 40 45 50 Time Spatial Parallelism Tray 1 Ben 1 Ben 1 Tray 2 Alyssa 1 Alyssa 1 Roll Tray 3 Ben 2 Ben 2 Tray 4 Alyssa 2 Alyssa 2 Latency = ? Throughput = ? Bake Legend Spatial Parallelism Spatial Parallelism Latency: time to first tray 0 5 10 15 20 25 30 35 40 45 50 Time Spatial Parallelism Tray 1 Ben 1 Ben 1 Tray 2 Alyssa 1 Alyssa 1 Roll Tray 3 Ben 2 Ben 2 Tray 4 Alyssa 2 Alyssa 2 Bake Legend Latency = 5 + 15 = 20 minutes = 1/3 hour (same) Throughput = 2 trays/ 1/3 hour = 6 trays/hour (doubled) Temporal Parallelism Temporal Parallelism Latency: time to first tray 0 5 10 15 20 25 30 35 40 45 50 Temporal Parallelism Time Tray 1 Tray 2 Ben 1 Ben 1 Ben 2 Tray 3 Latency = ? Throughput = ? Ben 2 Ben 3 Ben 3 Temporal Parallelism Temporal Parallelism Latency: time to first tray 0 5 10 15 20 25 30 35 40 45 50 Temporal Parallelism Time Tray 1 Tray 2 Tray 3 Ben 1 Ben 1 Ben 2 Ben 2 Ben 3 Ben 3 Latency = 5 + 15 = 20 minutes = 1/3 hour Throughput = 1 trays/ 1/4 hour = 4 trays/hour Using both techniques, the throughput would be 8 trays/hour Pipelined MIPS Pipelined MIPS Temporal parallelism Divide single­cycle processor into 5 stages: Fetch Decode Execute Memory Writeback Add pipeline registers between stages Single­Cycle vs. Pipelined Single­Cycle vs. Pipelined Performance Single-Cycle 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 Instr 1 Fetch Instruction Decode Read Reg Execute ALU Memory Read / Write Time (ps) Write Reg Fetch Instruction 2 Decode Read Reg Execute ALU Memory Read / Write Pipelined Instr 1 2 3 Fetch Instruction Decode Read Reg Fetch Instruction Execute ALU Decode Read Reg Fetch Instruction Memory Read/Write Execute ALU Decode Read Reg Write Reg Memory Read/Write Execute ALU Write Reg Memory Read/Write Write Reg Write Reg Pipelining Abstraction Pipelining Abstraction 1 2 3 4 5 6 7 8 9 10 Time (cycles) lw $s2, 40($0) add $s3, $t1, $t2 sub $s4, $s1, $s5 and $s5, $t5, $t6 sw $s6, 20($s1) or $s7, $t3, $t4 IM lw $0 RF 40 IM add DM + $t1 RF $t2 IM sub RF DM + $s1 RF $s5 IM $s2 and RF DM - $t5 RF $t6 IM $s3 sw RF DM & $s1 RF 20 IM $s4 or + $t3 RF $t4 $s5 RF DM | $s6 RF DM $s7 RF Single­Cycle and Pipelined Datapath Single­Cycle and Pipelined Datapath CLK 0 1 PC' PC A Instr RD Instruction Memory 25:21 20:16 A1 A2 A3 WD3 CLK WE3 SrcA RD1 RD2 0 SrcB 1 Register File 20:16 0 1 15:11 + Zero ALU CLK WE ALUResult WriteData ReadData A RD Data Memory WD 0 1 WriteReg4:0 PCPlus4 SignImm 4 Sign Extend <<2 + 15:0 PCBranch Result CLK CLK CLK PC' PCF A RD Instruction Memory InstrD 25:21 20:16 A1 ALUOutW CLK WE3 CLK A2 RD2 A3 Register WD3 File 20:16 0 SrcBE 1 WriteDataE RtE RdE 15:11 0 1 WE ZeroM SrcAE RD1 ALUOutM WriteDataM A RD Data Memory WD ReadDataW 0 1 WriteRegE4:0 + 4 SignImmE 15:0 <<2 Sign Extend PCBranchM + 0 1 CLK ALU CLK PCPlus4F PCPlus4D PCPlus4E ResultW Fetch Decode Execute Memory Writeback Multi­Cycle and Pipelined Datapath Multi­Cycle and Pipelined Datapath 25:21 25 : 21 1 A1 Instr Instr Adr A A Instruction Instr / Data Memory EN A2 A3 0 A2 Memory 15 : 11 1 WD3 CLK Register File 0 1 Data 20:16 + 15:11 A3 WD3 CLK B 0 SrcB 1 Register File 0 1 15 : 0 ALUResult 00 4 01 10 WriteData 11 A RD Data SrcB Memory WD CLK Zero WE 0 ReadData ALUResult 1 <<2 SignImm 15:0 SrcA WriteReg4:0 PCPlus4 4 0 1 Zero SrcAA RD 1 RD 2 RD2 20 : 16 WD RD1 A1 20 : 16 20:16 WE3 ALU PC WE3 ALU WE RD RD CLK CLK <<2 SignImm Sign Extend Sign Extend + 0PC PC' 0 EN 1 CLK CLK MemtoReg PCBranch Result CLK CLK CLK PC' PCF A RD Instruction Memory InstrD 25:21 20:16 A1 ALUOutW CLK WE3 CLK A2 RD2 A3 Register WD3 File 20:16 0 SrcBE 1 WriteDataE RtE RdE 15:11 0 1 WE ZeroM SrcAE RD1 ALUOutM WriteDataM A RD Data Memory WD ReadDataW 0 1 WriteRegE4:0 4 SignImmE 15:0 <<2 Sign Extend PCBranchM + 0 1 CLK ALU CLK + PC ' CLK CLK Re gDst CLK PCPlus4F PCPlus4D PCPlus4E ResultW Fetch Decode Execute Memory Writeback 0 ALUOut 1 Corrected Pipelined Datapath Corrected Pipelined Datapath WriteReg must arrive at the same time as Result • CLK CLK PC' PCF 1 A RD Instruction Memory InstrD 25:21 20:16 A1 CLK WE3 A2 A3 WD3 RD2 Register File RtE + WriteDataM WriteRegE4:0 0 RD Data Memory WD WriteRegM4:0 0 1 WriteRegW 4:0 <<2 Sign Extend PCPlus4D A ReadDataW 1 SignImmE 15:0 ALUOutM WriteDataE RdE 15:11 PCPlus4F 0 SrcBE 1 WE ZeroM SrcAE RD1 20:16 4 CLK + 0 CLK ALU CLK CLK ALUOutW PCBranchM PCPlus4E ResultW Fetch Decode Execute Memory Writeback Pipelined Control Pipelined Control CLK CLK CLK RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteD MemWriteE MemWriteM BranchD BranchE BranchM Op ALUControlD ALUControlE2:0 Funct ALUSrcD ALUSrcE RegDstD RegDstE RegWriteD Control MemtoRegD Unit 31:26 5:0 CLK 0 1 PC' PCF A RD Instruction Memory ALUOutW CLK InstrD 25:21 20:16 A1 CLK WE3 A2 RD2 A3 Register WD3 File 20:16 0 SrcBE 1 RdE 0 PCPlus4F Sign Extend WriteRegE4:0 WriteRegM4:0 SignImmE PCPlus4D RD Data Memory WD ReadDataW 0 1 WriteRegW 4:0 <<2 + 15:0 4 WriteDataM A 1 + 15:11 ALUOutM WriteDataE RtE WE ZeroM SrcAE RD1 ALU CLK PCSrcM PCBranchM PCPlus4E ResultW Same control unit as single-cycle processor Control delayed to proper pipeline stage Pipeline Hazard Pipeline Hazard Occurs when an instruction depends on results from previous instruction that hasn’t completed. Types of hazards: Data hazard: register value not written back to register file yet Control hazard: next instruction not decided yet (caused by branches) Data Hazard Data Hazard 1 2 3 4 5 6 7 8 Time (cycles) add $s0, $s2, $s3 and $t0, $s0, $s1 or $t1, $s4, $s0 sub $t2, $s0, $s5 IM add $s2 RF $s3 IM and DM + $s0 RF $s1 IM or RF DM & $s4 RF $s0 IM $s0 sub | $s0 RF $s5 $t0 RF DM - $t1 RF DM $t2 RF Handling Data Hazards Handling Data Hazards Static Insert nops in code at compile time Rearrange code at compile time Dynamic Forward data at run time Stall the processor at run time Compile­Time Hazard Elimination Compile­Time Hazard Elimination Insert enough nops for result to be ready Or move independent useful instructions forward 1 2 3 4 5 6 7 8 9 10 Time (cycles) add $s0, $s2, $s3 nop add IM nop nop $t1, $s4, $s0 sub $t2, $s0, $s5 DM + $s0 RF DM RF IM and $t0, $s0, $s1 or IM $s2 RF $s3 nop DM RF IM RF and $s0 RF $s1 IM or DM & $s4 RF $s0 IM RF sub | $s0 RF $s5 $t0 RF DM - $t1 RF DM $t2 RF Data Forwarding Data Forwarding Also known as bypassing 1 2 3 4 5 6 7 8 Time (cycles) add $s0, $s2, $s3 and $t0, $s0, $s1 or $t1, $s4, $s0 sub $t2, $s0, $s5 IM add $s2 RF $s3 IM and DM + $s0 RF $s1 IM or RF DM & $s4 RF $s0 IM $s0 sub | $s0 RF $s5 $t0 RF DM - $t1 RF DM $t2 RF Data Forwarding Data Forwarding CLK Control Unit CLK RegWriteD RegWriteE RegWriteM RegWriteW MemtoRegD MemtoRegE MemtoRegM MemtoRegW MemWriteD MemWriteE MemWriteM 5:0 A RD Instruction Memory RegDstE BranchE InstrD 25:21 A1 20:16 CLK WE3 A2 RD1 Sign Extend 0 SrcBE 1 RsD RdE ALUOutM WriteDataM A RD Data Memory WD RsE RtE WE ZeroM WriteDataE RdD 15:11 15:0 00 01 10 RtD 20:16 SrcAE 00 01 10 RD2 A3 Register WD3 File 25:21 4 PCSrcM BranchM ReadDataW ALUOutW 0 1 WriteRegE4:0 SignImmD WriteRegM4:0 1 0 WriteRegW 4:0 SignImmE + <<2 PCPlus4D PCPlus4E PCBranchM Hazard Unit RegWriteW ResultW RegWriteM PCPlus4F ForwardBE PCF RegDstD ForwardAE PC' ALUSrcE Funct CLK + 0 1 ALUControlE2:0 ALUSrcD ALU CLK CLK ALUControlD2:0 Op BranchD 31:26 CLK Data Forwarding Data Forwarding Forward to Execute stage from either: Memory stage or Writeback stage Forwarding logic for ForwardAE: if ((rsE != 0) AND ForwardAE = 10 ForwardAE else if ((rsE != 0) AND ForwardAE = 01 ForwardAE else ForwardAE = 00 ForwardAE (rsE == WriteRegM) AND RegWriteM) then WriteRegM AND RegWriteM then (rsE == WriteRegW) AND RegWriteW) then WriteRegW AND RegWriteW then Forwarding logic for ForwardBE same, but replace rsE with rtE Data Forwarding if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) WriteRegM AND RegWriteM then ForwardAE = 10 ForwardAE else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW) WriteRegW AND RegWriteW then ForwardAE = 01 ForwardAE else ForwardAE = 00 ForwardAE CLK Control Unit CLK RegWriteD RegWriteE RegWriteM RegWriteW MemtoRegD MemtoRegE MemtoRegM MemtoRegW MemWriteD MemWriteE MemWriteM 5:0 RD Instruction Memory BranchE CLK InstrD 25:21 20:16 A1 CLK WE3 A2 RD1 + Sign Extend 0 SrcBE 1 RsD RdE ALUOutM WriteDataM A RD Data Memory WD RsE RtE WE ZeroM WriteDataE RdD 15:11 15:0 00 01 10 RtD 20:16 SrcAE 00 01 10 RD2 A3 Register WD3 File 25:21 4 PCSrcM BranchM ReadDataW ALUOutW 0 1 WriteRegE4:0 SignImmD WriteRegM4:0 1 0 WriteRegW 4:0 SignImmE + <<2 PCPlus4F PCPlus4D PCPlus4E PCBranchM ResultW Hazard Unit RegWriteW 1 A RegDstE RegWriteM PCF RegDstD ForwardBE PC' ALUSrcE Funct ForwardAE 0 ALUControlE2:0 ALUSrcD ALU CLK CLK ALUControlD2:0 Op BranchD 31:26 CLK 25 Forwarding can fail… Forwarding can fail… 1 2 3 4 5 6 7 8 Time (cycles) lw $s0, 40($0) IM lw $0 RF 40 DM + $s0 RF Trouble! and $t0, $s0, $s1 or $t1, $s4, $s0 sub $t2, $s0, $s5 IM and $s0 RF $s1 IM or DM & $s4 RF $s0 IM lw has a 2-cycle latency! sub | $s0 RF $s5 $t0 RF DM - $t1 RF DM $t2 RF Stalling Stalling 1 2 3 4 5 6 7 8 9 Time (cycles) lw $s0, 40($0) and $t0, $s0, $s1 or $t1, $s4, $s0 IM lw $0 RF 40 IM and DM + $s0 RF $s1 IM or $s0 RF $s1 IM or $s0 RF DM & $s4 RF $s0 Stall sub $t2, $s0, $s5 IM sub | $s0 RF $s5 $t0 RF DM - $t1 RF DM $t2 RF Stalling Hardware Stalling Hardware CLK Control Unit CLK RegWriteD RegWriteE RegWriteM RegWriteW MemtoRegD MemtoRegE MemtoRegM MemtoRegW MemWriteD MemWriteE MemWriteM 5:0 PC' PCF EN 1 A Instruction Memory ALUSrcE Funct RegDstD RegDstE BranchE 25:21 20:16 A1 CLK WE3 A2 A3 WD3 RD1 Register File + Sign Extend RsD 0 SrcBE 1 RdE ALUOutM WriteDataM A RD Data Memory WD RsE RtE WE ZeroM WriteDataE RdD 15:11 15:0 00 01 10 RtD 20:16 SrcAE 00 01 10 RD2 25:21 4 PCSrcM BranchM CLK InstrD RD ReadDataW ALUOutW 0 WriteRegE4:0 WriteRegM4:0 1 0 WriteRegW 4:0 1 SignImmD SignImmE + <<2 PCPlus4D PCPlus4E CLR EN PCPlus4F PCBranchM Hazard Unit RegWriteW RegWriteM MemtoRegE ForwardBE ForwardAE FlushE StallF ResultW StallD 0 ALUControlE2:0 ALUSrcD ALU CLK CLK ALUControlD2:0 Op BranchD 31:26 CLK Stalling Hardware Stalling Hardware Stalling logic: lwstall = ((rsD == rtE) OR (rtD == rtE)) AND MemtoRegE rtE rtE)) MemtoRegE StallF = StallD = FlushE = lwstall StallD FlushE lwstall Stalling Control Stalling Control lwstall = ((rsD == rtE) OR (rtD == rtE)) AND MemtoRegE StallF = StallD = FlushE = lwstall CLK Control Unit CLK RegWriteE RegWriteM RegWriteW MemtoRegD MemtoRegE MemtoRegM MemtoRegW MemWriteD MemWriteE MemWriteM 5:0 PC' PCF EN 1 A Instruction Memory ALUSrcE Funct RegDstD RegDstE BranchE 25:21 20:16 A1 A2 A3 WD3 CLK WE3 RD1 Register File + Sign Extend RsD RtE RdE WE ZeroM 0 SrcBE 1 WriteDataE ALUOutM WriteDataM A RD Data Memory WD RsE RdD 15:11 15:0 00 01 10 RtD 20:16 SrcAE 00 01 10 RD2 25:21 4 PCSrcM BranchM CLK InstrD RD ReadDataW ALUOutW 0 WriteRegE4:0 WriteRegM4:0 1 0 WriteRegW 4:0 1 SignImmD SignImmE + <<2 PCPlus4D PCPlus4E CLR EN PCPlus4F PCBranchM Hazard Unit RegWriteW RegWriteM MemtoRegE ForwardBE ForwardAE FlushE StallF ResultW StallD 0 ALUControlE2:0 ALUSrcD ALU CLK CLK ALUControlD2:0 Op BranchD 31:26 CLK RegWriteD Control Hazards Control Hazards beq: branch is not determined until the fourth stage of the pipeline Instructions after the branch are fetched before branch occurs These instructions must be flushed if the branch happens Effect & Solutions Effect & Solutions Could stall when branch decoded Expensive: 3 cycles lost per branch! Could predict and flush if wrong Branch misprediction penalty Instructions flushed when branch is taken May be reduced by determining branch earlier 32 Control Hazards: Flushing Control Hazards: Flushing 1 2 3 4 5 6 7 8 9 Time (cycles) 20 beq $t1, $t2, 40 24 and $t0, $s0, $s1 28 or $t1, $s4, $s0 2C sub $t2, $s0, $s5 30 IM lw $t1 RF $t2 IM and DM - $s0 RF $s1 or RF DM & $s4 RF DM Flush these instructions ... IM RF $s0 IM sub | $s0 RF $s5 ... slt $t3, $s2, $s3 IM slt $s2 RF $s3 DM slt 64 - RF RF DM $t3 RF Control Hazards: Original Pipeline (for comparison) Control Hazards: Original Pipeline (for comparison) CLK Control Unit CLK RegWriteD RegWriteM RegWriteW MemtoRegD MemtoRegE MemtoRegM MemtoRegW MemWriteD MemWriteE MemWriteM 5:0 PC' EN 1 PCF A Instruction Memory ALUSrcE Funct RegDstD RegDstE BranchE 25:21 20:16 A1 WD3 WE3 RD1 Register File Sign Extend RsD RtE RdE WE ZeroM 0 SrcBE 1 WriteDataE ALUOutM WriteDataM A RD Data Memory WD RsE RdD 15:11 + 00 01 10 RtD 20:16 SrcAE 00 01 10 RD2 25:21 15:0 PCSrcM CLK A2 A3 4 BranchM CLK InstrD RD ReadDataW ALUOutW 0 WriteRegE4:0 WriteRegM4:0 1 0 WriteRegW 4:0 1 SignImmD SignImmE + <<2 PCPlus4D PCPlus4E CLR EN PCPlus4F PCBranchM Hazard Unit RegWriteW RegWriteM MemtoRegE ForwardBE ForwardAE FlushE StallF ResultW StallD 0 ALUControlE2:0 ALUSrcD ALU CLK CLK ALUControlD2:0 Op BranchD 31:26 CLK RegWriteE Control Hazards: Early Branch Resolution Control Hazards: Early Branch Resolution CLK Control Unit 31:26 5:0 CLK CLK RegWriteD RegWriteE RegWriteM RegWriteW MemtoRegD MemtoRegE MemtoRegM MemtoRegW MemWriteD MemWriteE MemWriteM ALUControlD2:0 ALUControlE2:0 Op ALUSrcD ALUSrcE Funct RegDstD RegDstE BranchD 0 PCF EN 1 PC' A InstrD RD Instruction Memory 25:21 20:16 A1 WE3 A2 A3 WD3 RD1 Register File + 00 01 10 RsD RtE RdE ALUOutM WriteDataM A RD Data Memory WD RsE RdE Sign Extend 0 SrcBE 1 WriteDataE RtD 15:11 WE SrcAE 00 01 10 RD2 20:16 15:0 CLK = 25:21 4 PCSrcD EqualD CLK ALU CLK CLK ReadDataW ALUOutW 0 WriteRegE4:0 WriteRegM4:0 1 0 WriteRegW 4:0 1 SignImmD SignImmE + <<2 PCPlus4D CLR CLR EN PCPlus4F PCBranchD RegWriteW RegWriteM MemtoRegE ForwardBE ForwardAE FlushE StallD StallF ResultW Hazard Unit Introduced another data hazard in Decode stage (fix a few slides away) Control Hazards with Early Branch Resolution Control Hazards with Early Branch Resolution 1 2 3 4 5 6 7 8 9 Time (cycles) 20 beq $t1, $t2, 40 24 and $t0, $s0, $s1 28 or 2C DM sub $t2, $s0, $s5 30 lw $t1 RF $t2 ... IM IM and - $s0 RF $s1 DM Flush this instruction RF $t1, $s4, $s0 ... slt $t3, $s2, $s3 IM slt $s2 RF $s3 slt 64 & RF DM Penalty now only one lost cycle $t3 RF Aside: Delayed Branch Aside: Delayed Branch MIPS always executes instruction following a branch So branch delayed This allows us to avoid killing inst. Compilers move instruction that has no conflict w/ branch into delay slot 37 Example Example This sequence add $4 $5 $6 beq $1 $2 40 reordered to this beq $1 $2 40 add $4 $5 $6 38 Handling the New Hazards Handling the New Hazards CLK Control Unit 31:26 5:0 CLK CLK RegWriteD RegWriteE RegWriteM RegWriteW MemtoRegD MemtoRegE MemtoRegM MemtoRegW MemWriteD MemWriteE MemWriteM ALUControlD2:0 ALUControlE2:0 Op ALUSrcD ALUSrcE Funct RegDstD RegDstE BranchD PCF EN A InstrD RD 25:21 A1 WE3 CLK = RD1 0 20:16 A2 A3 WD3 RD2 0 Register File RdD 15:11 4 Sign Extend RtE RdE ALUOutM WriteDataM A RD Data Memory WD RsE RtD 20:16 0 SrcBE 1 WriteDataE RsD 25:21 15:0 00 01 10 1 WE SrcAE 00 01 10 1 Instruction Memory + ReadDataW ALUOutW 0 WriteRegE4:0 WriteRegM4:0 1 0 WriteRegW 4:0 1 SignImmD SignImmE + <<2 PCPlus4D CLR CLR EN PCPlus4F PCBranchD Hazard Unit RegWriteW RegWriteM MemtoRegE RegWriteE ForwardBE ForwardAE FlushE ForwardBD ForwardAD BranchD ResultW StallD 1 PC' StallF 0 PCSrcD EqualD CLK ALU CLK CLK Control Forwarding and Stalling Hardware Control Forwarding and Stalling Hardware Forwarding logic: ForwardAD = (rsD !=0) AND (rsD == WriteRegM) AND RegWriteM WriteRegM AND RegWriteM ForwardBD = (rtD !=0) AND (rtD == WriteRegM) AND RegWriteM RegWriteM Stalling logic: branchstall = BranchD AND RegWriteE AND BranchD RegWriteE (WriteRegE == rsD OR WriteRegE == rtD) rsD WriteRegE rtD OR BranchD AND MemtoRegM AND BranchD MemtoRegM (WriteRegM == rsD OR WriteRegM == rtD) rsD WriteRegM rtD StallF = StallD = FlushE = lwstall OR branchstall StallD FlushE lwstall branchstall Branch Prediction Branch Prediction Especially important if branch penalty > 1 cycle Guess whether branch will be taken Backward branches are usually taken (loops) Perhaps consider history of whether branch was previously taken to improve the guess Good prediction reduces the fraction of branches requiring a flush Pipelined Performance Example Pipelined Performance Example Ideally CPI = 1 But less due to: stalls (caused by loads and branches) SPECINT2000 benchmark: 25% loads 10% stores 11% branches 2% jumps 52% R­type Suppose: 40% of loads used by next instruction 25% of branches mispredicted All jumps flush next instruction What is the average CPI? Pipelined Performance Example Pipelined Performance Example SPECINT2000 benchmark: 25% loads 10% stores 11% branches 2% jumps 52% R­type Suppose: 40% of loads used by next instruction 25% of branches mispredicted All jumps flush next instruction What is the average CPI? Load/Branch CPI = 1 when no stalling, 2 when stalling. Thus, CPIlw = 1(0.6) + 2(0.4) = 1.4 CPIbeq = 1(0.75) + 2(0.25) = 1.25 Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) + (0.52)(1) = 1.15 Pipelined Performance Pipelined Performance Pipelined processor critical path: Tc = max { max tpcq + tmem + tsetup 2(tRFread + tmux + teq + tAND + tmux + tsetup ) 2( eq AND setup tpcq + tmux + tmux + tALU + tsetup mux tpcq + tmemwrite + tsetup 2(tpcq + tmux + tRFwrite) } 2( 31 : 26 5 :0 CLK CLK CLK RegWriteD RegWriteE RegWriteM RegWriteW MemtoRegD MemtoRegE MemtoRegM MemtoRegW MemWriteD MemWriteE MemWriteM ALUControlD2 :0 ALUControlE2 :0 Op ALUSrcD ALUSrcE Funct RegDstD RegDstE Control Unit BranchD PC' PCF EN A InstrD RD 25 : 21 A1 WE3 PCSrcD CLK = RD1 0 Instruction Memory 20 : 16 A2 A3 WD 3 RD2 0 1 Register File 00 01 10 25 : 21 RsD RtD RtE 15 : 11 RdD RdE 15 : 0 WriteDataE WriteDataM WriteRegE4: 0 A RD Data Memory WD WriteRegM4: 0 ReadDataW ALUOutW 0 1 SignImmD Sign Extend << 2 1 0 WriteRegW4:0 SignImmE + 4 0 SrcBE 1 ALUOutM RsE 20 : 16 WE SrcAE 00 01 10 1 + PCPlus 4D CLR CLR EN PCPlus 4F PCBranchD Hazard Unit RegWri teW RegWri teM RegWri teE Mem toRegE For wardB E For wardAE FlushE For wardB D For wardAD BranchD ResultW StallD 1 StallF 0 EqualD CLK ALU CLK CLK Pipelined Performance Example Pipelined Performance Example Tc = 2(tRFread + tmux + teq + tAND + tmux + tsetup ) = 2[150 + 25 + 40 + 15 + 25 + 20] ps = 550 ps Pipelined Performance Example Pipelined Performance Example For a program with 100 billion instructions executing on a pipelined MIPS processor, CPI = 1.15 Tc = 550 ps Execution Time = (# instructions) × CPI × Tc = (100 × 109)(1.15)(550 × 10­12) = 63 seconds Summary Summary Pipelining attempts to use hdw more efficiently Throughput increases at cost of latency Hazards ensue Modern processors pipelined Next Time Next Time I/O Joysticks Keyboard (and mouse?) 48 ...
View Full Document

Ask a homework question - tutors are online