chapter4-part2 - CSCI-365 Computer Organization Lecture...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson & Hennessy, ©2005 Recap: Single Cycle Datapath • Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instruction – especially problematic for more complex instructions like floating point multiply Cycle 1 Clk lw sw Waste Cycle 2 Instruction Times (Critical Paths) What is the clock cycle time assuming negligible delays for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except: q Instruction and Data Memory (200 ps) q ALU and adders (200 ps) q Register File access (reads or writes) (100 ps) Instr. R-type(45%) Load (25%) Store (10%) Beq (15%) Jump (5%) I Mem Reg Rd ALU Op D Mem Reg Wr Total Simplified MIPS Pipelined Datapath Instruction Critical Paths What is the clock cycle time assuming negligible delays for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except: q Instruction and Data Memory (200 ps) q ALU and adders (200 ps) q Register File access (reads or writes) (100 ps) Instr. I Mem Reg Rd ALU Op D Mem Reg Wr Total Rtype load store beq jump 200 200 200 200 200 100 100 100 100 200 200 200 200 200 200 100 100 600 800 700 500 200 Stages • Five stages, one step per stage 1. IF: Instruction fetch from memory 2. ID: Instruction decode & register read 3. EX: Execute operation or calculate address 4. MEM: Access memory operand 5. WB: Write result back to register Single Cycle vs. Multiple Cycle Single Cycle Implementation: Cycle 1 Clk lw Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 Clk lw IF ID EX MEM WB sw IF ID EX MEM R-type IF sw Waste Cycle 2 Gotta Do Laundry • Michael, Conan, Jimmy, Pat each have one load of clothes to wash, dry, fold, and put away MC J P – Washer takes 30 minutes – Dryer takes 30 minutes – “Folder” takes 30 minutes – “Stasher” takes 30 minutes to put clothes into drawers Sequential Laundry 6 PM 7 T a s k O r d e r 8 9 10 11 12 1 2 AM 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 Time M C J P Sequential laundry takes 8 hours for 4 loads Pipelined Laundry 6 PM 7 T aM s kC J O P r d e r 8 9 10 11 Time 12 1 2 AM 30 30 30 30 30 30 30 Pipelined laundry takes 3.5 hours for 4 loads! General Definitions • Latency: time to completely execute a certain task – E.g., time to read a sector from disk is disk access time or disk latency • Throughput: amount of work that can be done over a period of time Pipelining Lessons 6 PM T a s k O r d e r 7 8 9 Time • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Multiple tasks operating simultaneously using different resources • Potential speedup = Number pipe stages • Time to “fill” pipeline and time to “drain” it reduces speedup: 2.3X v. 4X in this example M C J P 30 30 30 30 30 30 30 Pipelining Lessons • Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline? – Pipeline rate limited by slowest pipeline stage – Unbalanced lengths of pipe stages reduces speedup A Pipelined MIPS Processor • Start the next instruction before the current one has completed – improves throughput – instruction latency is not reduced – clock cycle (pipeline stage time) limited by slowest stage – for some instructions, some stages are wasted cycles Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw sw R-type IF ID IF EX ID IF MEM EX ID WB MEM EX WB MEM WB Single Cycle vs. Multiple Cycle vs. Pipelined Single Cycle Implementation: Cycle 1 Clk lw Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 Clk lw IF ID EX MEM WB sw IF ID EX MEM R-type IF sw Waste Cycle 2 Pipeline Implementation: lw IF sw ID IF EX ID MEM EX ID WB MEM EX WB MEM WB R-type IF Single Cycle vs. Pipelined • Example: Compare average time between lw instructions of a single cycle implementation to a pipelined implementation. Assume following operation times for major functional units – 200 ps for memory access – 200 ps for ALU operation – 100 ps for register file read or write (DONE IN CLASS, try 3 and 100, and n) Pipelined Control (Simplified) Pipelined Control Simplified MIPS Pipelined Datapath Can you foresee any problems with these right-to-left flows? Pipeline registers • Need registers between stages – To hold information produced in previous cycle IF ID EX for Load MEM for Load WB for Load There is a BUG here Wrong register number Corrected Datapath for Load Pipelined Control • Control signals derived from instruction – As in single-cycle implementation Hazards • Situations that prevent starting the next instruction in the next cycle – Structure hazards – Data hazard – Control hazard Structure Hazards • Instruction cannot execute in proper clock cycle because hardware cannot support the combination of instructions that are set to execute in the clock cycle • In MIPS pipeline with a single memory – Load/store requires data access – Instruction fetch would have to stall for that cycle • Would cause a pipeline “bubble” • Hence, pipelined datapaths require separate instruction/ data memories – Or separate instruction/data caches Data Hazards • Instruction cannot execute in proper clock cycle because data that is needed is not yet available add sub $s0, $t0, $t1 $t2, $s0, $t3 Simplified MIPS Pipelined Datapath Forwarding (aka Bypassing) • Use result when it is computed – Don’t wait for it to be stored in a register – Requires extra connections in the datapath Load-Use Data Hazard • Can’t always avoid stalls by forwarding – If value not computed when needed – Can’t forward backward in time! Code Scheduling to Avoid Stalls • Reorder code to avoid use of load result in the next instruction: C code A = B + E, C= B + F lw lw add sw lw add sw $t1, 0($t0) $t2, 4($t0) $t3, $t1, $t2 $t3, 12($t0) $t4, 8($t0) $t5, $t1, $t4 $t5, 16($t0) lw lw lw add sw add sw $t1, 0($t0) $t2, 4($t0) $t4, 8($t0) $t3, $t1, $t2 $t3, 12($t0) $t5, $t1, $t4 $t5, 16($t0) stall stall Control Hazards • Instruction cannot execute in proper clock cycle because the instruction that was fetched is not the one that is needed – Branch determines flow of control – Fetching next instruction depends on branch outcome – Pipeline can’t always fetch correct instruction • In MIPS pipeline – Need to compare registers and compute target early in the pipeline – Add hardware to do it in ID stage Stall on Branch • Wait until branch outcome determined before fetching next instruction Branch Prediction • Correct branch prediction is very important and can produce substantial performance improvements. – static prediction – dynamic prediction • To take full advantage of branch prediction, we can have the instructions not only fetched but also begin execution. This is known as speculative execution MIPS with Predict Not Taken Prediction correct Prediction incorrect More Realistic Branch Prediction • Static branch prediction – Based on typical branch behavior – Example: loop and if-statement branches • Predict backward branches taken • Predict forward branches not taken • Dynamic branch prediction – Hardware measures actual branch behavior • e.g., record recent history of each branch – Assume future behavior will continue the trend • When wrong, stall while re-fetching, and update history Branches • Branch instructions can dramatically affect pipeline performance. Control operations are very frequent in current programs. • 20% - 35% of the instructions executed are branches (conditional and unconditional). • 65% of the branches actually take the branch. • Conditional branches are much more frequent than unconditional (more than two times). More than 50% of conditional branches are taken. Static Branch Prediction • Static prediction techniques do not take into consideration execution history. • Predict never taken (Motorola 68020): assumes that the branch is not taken. • Predict always taken: assumes that the branch is taken. Dynamic Branch Prediction • Improve the accuracy of prediction by recording the history of conditional branches. • One-bit prediction scheme – is used in order to record if the last execution resulted in a branch taken or not. The system predicts the same behavior as for the last time. • Two-bit prediction scheme – with a two-bit scheme predictions can be made depending on the last two instances of execution. One-Bit Prediction Scheme Two-Bit Prediction Scheme Branch History Table • History info. can be used not only to predict the outcome of a conditional branch but also to avoid recalculation of the target address. Together with bits used for prediction, the target address can be stored for later use in a branch history table. • Using D. B. P with history tables up to 90% of predictions can be correct. • Pentium,PowerPC620 use speculative execution with D.B.P based on a branch history table. Branch History Table ...
View Full Document

This note was uploaded on 11/26/2009 for the course MATH AND C CSCI365 taught by Professor Laurencetianruoyang during the Spring '09 term at St. Francis Xavier, Antigonish.

Ask a homework question - tutors are online