Lec06b - 1 COMP 4300 Computer Architecture Multicycle...

Info iconThis preview shows pages 1–7. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 COMP 4300 Computer Architecture Multicycle Implementation Dr. Xiao Qin Auburn University http://www.eng.auburn.edu/~xqin xqin@auburn.edu Fall, 2010 2 The slowest instruction... If all instructions must complete within one clock cycle, then the cycle time has to be large enough to accommodate the slowest instruction. Assuming the delays shown here: M u x 1 Read address Instruction memory Instruction [31-0] Read address Write address Write data Data memory Read data 1 M u x Sign extend M u x 1 Result Zero ALU I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers 2 ns 2 ns 2 ns 1 ns 0 ns 0 ns 0 ns 0 ns Instruction Time Arithmetic 2+1+2+1 = 6 Loads 2+1+2+2+1 = 8 Stores 2+1+2+2 = 7 Branches 2+1+2 = 5 3 With these same component delays, an sw instruction would need 7ns, and beq would need just 5ns. Lets consider the gcc benchmark. With a single-cycle datapath, each instruction would require 8ns. But if we could execute instructions as fast as possible, the average time per instruction for gcc would be ? : How bad is this? Instruction Frequency Arithmetic 48%, 6ns Loads 22%, 8ns Stores 11%, 7ns Branches 19%, 5ns (48% x 6ns) + (22% x 8ns) + (11% x 7ns) + (19% x 5ns) = 6.36ns The single-cycle datapath is about 1.26 times slower! 4 It gets worse... Weve made very optimistic assumptions about memory latency: Main memory accesses on modern machines is >50ns . For comparison, an ALU on the Pentium4 takes ~0.3ns . Our worst case cycle (loads/stores) includes 2 memory accesses A modern single cycle implementation would be stuck at < 10Mhz . Caches will improve common case access time, not worst case 5 It isnt particularly hardware efficient, either A single-cycle datapath also uses extra hardware one ALU is not enough, since we must do up to three calculations in one clock cycle for a beq. This used to be a big deal, but now transistors are cheap. Heat issue is more important. Read address Instruction memory Instruction [31-0] Read address Write address Write data Data memory Read data MemWrite MemRead 1 M u x MemToReg 4 Shift left 2 PC Add Add M u x 1 PCSrc Sign extend M u x 1 ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] M u x 1 RegDst Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers RegWrite Goto pp.8 6 A multistage approach to instruction execution: Key Idea Break instruction execution into multiple cycles One clock cycle for each major task 1.Instruction Fetch ( IF ) 2.Instruction Decode and Register Fetch ( ID ) 3.Execution, memory address computation, or branch computation ( EX ) 4.Memory access / R-type instruction completion ( MEM ) 5.Memory read completion ( WB ) Share hardware to simplify datapath This would mean that instructions complete as soon as possible, instead of being limited by the slowest instruction. 7...
View Full Document

This note was uploaded on 12/07/2011 for the course COMP 3400 taught by Professor Staff during the Fall '10 term at Auburn University.

Page1 / 28

Lec06b - 1 COMP 4300 Computer Architecture Multicycle...

This preview shows document pages 1 - 7. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online