Lec09 - 1 Processor: Multicycle Implementation Dr. Tao Xie...

Info iconThis preview shows pages 1–6. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 Processor: Multicycle Implementation Dr. Tao Xie Fall, 2008 These slides are adapted from notes by Dr. David Patterson (UCB) 2 The slowest instruction... If all instructions must complete within one clock cycle, then the cycle time has to be large enough to accommodate the slowest instruction. Assuming the delays shown here: M u x 1 Read address Instruction memory Instruction [31-0] Read address Write address Write data Data memory Read data 1 M u x Sign extend M u x 1 Result Zero ALU I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] Read register 1 Read register 2 Write register Write data Read data 2 Read data 1 Registers 2 ns 2 ns 2 ns 1 ns 0 ns 0 ns 0 ns 0 ns Instruction Time Arithmetic 2+1+2+1 = 6 Loads 2+1+2+2+1 = 8 Stores 2+1+2+2 = 7 Branches 2+1+2 = 5 3 With these same component delays, a sw instruction would need 7ns, and beq would need just 5ns. Lets consider the gcc benchmark. With a single-cycle datapath, each instruction would require 8ns. But if we could execute instructions as fast as possible, the average time per instruction for gcc would be: (48% x 6ns) + (22% x 8ns) + (11% x 7ns) + (19% x 5ns) = 6.36ns The single-cycle datapath is about 1.26 times slower! How bad is this? Instruction Frequency Arithmetic 48% Loads 22% Stores 11% Branches 19% 4 It gets worse... Weve made very optimistic assumptions about memory latency: Main memory accesses on modern machines is >50ns . For comparison, an ALU on the Pentium4 takes ~0.3ns . Our worst case cycle (loads/stores) includes 2 memory accesses A modern single cycle implementation would be stuck at < 10Mhz . Caches will improve common case access time, not worst case 5 It isnt particularly hardware efficient, either A single-cycle datapath also uses extra hardware one ALU is not enough, since we must do up to three calculations in one clock cycle for...
View Full Document

Page1 / 34

Lec09 - 1 Processor: Multicycle Implementation Dr. Tao Xie...

This preview shows document pages 1 - 6. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online