Unformatted text preview: viability of the original very simple 3-stage pipeline in a world dominated by PCs with ever-increasingly complex superscalar, superpipelined, high-performance (and very power-hungry) microprocessors. The second decade of ARM development has seen a careful diversification of the ARM organization in the quest for higher performance levels: The first step to a 5-stage pipeline yields a doubling of performance (all other factors being equal) at the cost of some forwarding logic in the core and either a double-bandwidth memory (as in the ARMS) or separate instruction and data memories (as in the ARM9TDMI and StrongARM). The next doubling of performance, achieved in the ARMIOTDMI, is rather harder-won. The 6-stage pipeline is quite similar to the 5-stage pipeline used Example and exercises 267 before, but the time slots allocated to memory access have been extended to enable the memories to support higher clock rates without burning excessive power. The processor core also incorporates more decoupling: in the prefetch unit to allow branches to be predicted and removed from the instruction stream, and in the data memory interface to allow the processor to continue executing when a data access takes some time to resolve (for example, due to a cache miss). Performance improvement is achieved through a combination of increased clock rate and reduced CPI - the average number of clocks per instruction. The increased clock rate will usually require a deeper pipeline that will tend to worsen the CPI, so remedial measures are required to recover the CPI loss and then improve it further. To date, all ARM processors have been based on organizations that issue at most one instruction per clock cycle, and always in program order. The ARM10TDMI and the AMULETS processor (described in Section 14.5 on page 387) handle out-of-order completion in order to be able to keep instructions flowing during a slow data access, and both of these processors also include branch prediction logic to reduce the cost of refilling their pipelines on branch instructions. AMULET3 suppresses the fetching of predicted branch instructions but still executes them; ARM10TDMI fetches branch instructions but suppresses their execution. But by the standards of today's high-end PC and workstation processors, these are still very simple machines. This simplicity has direct ben...
View Full Document
This document was uploaded on 10/30/2011 for the course CSE 378 380 at SUNY Buffalo.
- Spring '09