This preview shows page 1. Sign up to view the full content.
Unformatted text preview: g each stage in a particular way (see Figure 9.9): The fetch and memory stages are effectively increased from one to one-and-ahalf clock cycles by providing the address for the next cycle early. To achieve this in the memory stage, memory addresses are computed in a separate adder that can produce its result faster than the main ALU (because it implements only a subset of the ALU's functionality). The execute stage uses a combination of improved circuit techniques and restruc turing to reduce its critical path. For example, the multiplier does not feed into the main ALU to resolve its partial sum and product terms; instead it has its own adder in the memory stage (multiplications never access memory, so this stage is free.) The instruction decode stage is the only part of the processor logic that could not be streamlined sufficiently to support the higher clock rate, so here an additional 'Issue' pipeline stage was inserted. The result is a 6-stage pipeline that can operate faster than the 5-stage ARM9TDMI pipeline, but requires its supporting memories to be little faster than the ARM9TDMI's memories. This is of significance since very fast memories tend to be power-hungry. The extra pipeline stage, inserted to allow more time for instruction decode, only incurs pipeline dependency costs when an unpredicted branch is executed. Since the extra stage comes before the register read takes place it introduces no new operand dependencies and requires no new forwarding paths. With the inclusion of a branch prediction mechanism this pipeline will give a very similar CPI to the ARM9TDMI pipeline while supporting the higher clock rate. Figure 9.9 The ARM10TDMI pipeline. ARM10TDMI 265 Reduced CPI The pipeline enhancements described above support a 50% higher clock rate without compromising the CPI. This is a good start, but will not yield the required 100% performance improvement. For this an improved CPI is needed on top of the increased clock rate. Any plan to improve the CPI must start from a consideration of memory bandwidth. The ARM7TDMI uses its single 32-bit memory on (almost) every clock c...
View Full Document
- Spring '09