L12_Pipeline

L12_Pipeline - CS324: Computer CS324: Architecture...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS324: Computer CS324: Architecture Pipelining Pipelining Example: Pipelining Only 1 person uses the Only laundry room at a time Ann, Brian, Kat, Dave Ann, each have 1 load of clothes to wash, dry, fold, & stash Washer takes 30 minutes Washer Dryer takes 30 minutes Dryer “Folder” takes 30 minutes “Stasher” takes 30 minutes to put clothes into drawers A B C D Sequential Laundry Sequential 6 PM T a s k O r d e r 7 8 9 10 11 12 1 2 AM 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 Time A B C D Sequential laundry takes 8 hours for 4 loads Sequential Pipelining is natural! Pipelining Can you suggest a better strategy? Pipelined Laundry: Start work ASAP Pipelined 6 PM T a s k O r d e r 7 8 9 10 11 Time 12 1 2 AM 30 30 30 30 30 30 30 A B C D Pipelined laundry takes 3.5 hours for 4 loads! Pipelined General Definitions General Latency: time to completely execute a certain task Latency – for example, time to read a sector from disk is disk access time or disk latency Throughput: amount of work that can be done over Throughput a period of time Pipelining Lessons Pipelining Multiple tasks operating Multiple simultaneously using different resources Pipelining doesn’t help Pipelining latency of single task, it helps throughput of entire workload 6 PM T a s k O r d e r 7 8 9 Time 30 30 30 30 30 30 30 A B C D Pipelining Lessons Pipelining Potential speedup = Potential Number pipe stages T a s k O r d e r 6 PM 7 8 9 Time Time to “fill” pipeline Time and time to “drain” it reduces speedup – (2.3X v. 4X in this ex) 30 30 30 30 30 30 30 A B C D Pipelining Lessons Pipelining 6 PM T a s k O r d e r 7 8 9 Time 30 30 30 30 30 30 30 A B C D Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline? Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup – Stall for Dependences MIPS Execution Steps MIPS Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Ifetch – Fetch instruction from Instruction Memory, Increment PC Reg/Dec: Registers Fetch and Instruction Decode Reg Exec: Exec: Mem: Mem Wr: Wr Mem-ref: Calculate Address Arith-log: Perform Operation Load: Read Data from Memory Store: Write Data to Memory Write Data Back to Register Conventional Pipelined Execution Conventional Representation Time IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB Every instruction must take same number of steps, also Every called pipeline “stages”, so some will go idle sometimes Review: Datapath for MIPS Review: instruction memory rd rs rt imm registers PC ALU +4 1. Instruction Fetch 5. Write 2. Decode/ 3. Execute 4. Memory Back Register Read Use datapath figure to represent pipeline Use IFtch Dcd Exec Mem WB ALU I$ Reg D$ Reg Data memory Graphical Pipeline Representation Graphical (In Reg, right half highlight read, left half write) Time (clock cycles) I n s Load t Add r. Store O Sub r d Or e r ALU I$ Reg D$ ALU Reg I$ Reg D$ ALU Reg I$ Reg D$ ALU Reg I$ Reg D$ ALU Reg I$ Reg D$ Reg Graphically Representing Pipelines Graphically Can help with answering questions like: Can – – – how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths Single Cycle, Multiple Cycle, vs. Pipeline Cycle 1 Clk Single Cycle Implementation: Load Store Waste Cycle 2 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem R-type Ifetch Pipeline Implementation: Load Ifetch Reg Exec Reg Mem Exec Reg Wr Mem Exec Wr Mem Wr Start R at 9 7 cycles Store Ifetch R-type Ifetch Example Example Suppose 2 ns for memory access, 2 ns for ALU Suppose operation, and 1 ns for register file read or write; compute instr rate Nonpipelined Execution: Nonpipelined –lw : IF + Read Reg + ALU + Memory + Write Reg = 2 + 1 + 2 + 2 + 1 = 8 ns –add: IF + Read Reg + ALU + Write Reg = 2 + 1 + 2 + 1 = 6 ns (recall 8ns for single-cycle processor) (recall Pipelined Execution: Pipelined – Max(IF,Read Reg, ALU, Memory,Write Reg) = 2 ns – (once pipeline is full) Why Pipeline? Why Suppose we execute 100 instructions Suppose Single Cycle Machine Single – 45 ns/cycle x 1 CPI x 100 inst = 4500 ns Multicycle Machine Multicycle – 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns Ideal pipelined machine Ideal – 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns IFetch Dcd IFetch Exec Dcd IFetch Mem Exec Dcd IFetch WB Mem Exec Dcd IFetch WB Mem Exec Dcd WB Mem Exec WB Mem WB Pipeline Hazard: Matching socks in later Matching Pipeline load 6 PM 7 T a s k 8 9 10 11 Time 12 1 2 AM 3030 30 30 30 30 30 A B C O D r dE e rF bubble A depends on D; stall since folder tied up Problems for Pipelining CPUs Problems Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle – Structural hazards: HW cannot support some combination of instructions (single person to fold and put clothes away) – Control hazards: Pipelining of branches causes later instruction fetches to wait for the result of the branch – Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) These might result in pipeline stalls or “bubbles” in the pipeline. Structural Hazard #1: Single Memory Structural Time (clock cycles) I n I$ D$ Reg Reg Load s I$ D$ Reg Reg t Instr 1 r. I$ D$ Reg Reg Instr 2 O I$ D$ Reg Reg Instr 3 r I$ D$ Reg Reg d Instr 4 e r Read same memory twice in same clock cycle ALU ALU ALU ALU ALU Structural Hazard #1: Single Memory Structural Solution: Solution: – infeasible and inefficient to create second memory – so simulate this by having two Level 1 Caches (a temporary smaller [of usually most recently used] copy of memory) – have both an L1 Instruction Cache and an L1 Data Cache – need more complex hardware to control when both caches miss Structural Hazard #2: Registers Structural I n s t sw r. Instr 1 O Instr 2 r Instr 3 d e Instr 4 r Time (clock cycles) ALU I$ Reg D$ ALU Reg I$ Reg D$ ALU Reg I$ Reg D$ ALU Reg I$ Reg D$ ALU Reg I$ Reg D$ Reg Can we read and write to registers simultaneously? Structural Hazard #2: Registers Structural Two different solutions have been used: 1) RegFile access is VERY fast: takes less than half the time of ALU stage Write to Registers during first half of each clock cycle Read from Registers during second half of each clock cycle 2) Build RegFile with independent read and write ports Result: can perform Read and Write during same clock cycle Things to Remember Things Optimal Pipeline Optimal – Each stage is executing part of an instruction each clock cycle. – One instruction finishes during each clock cycle. – On average, execute far more quickly. What makes this work? What – Similarities between instructions allow us to use same stages for all instructions (generally). – Each stage takes about the same amount of time as all others: little wasted time. ...
View Full Document

This note was uploaded on 02/15/2010 for the course CS 324 taught by Professor Lballesteros during the Fall '08 term at Mt. Holyoke.

Ask a homework question - tutors are online