Unformatted text preview: re can be improved by: Increasing the clock rate. This requires the logic in each pipeline stage to be simplified and, therefore, the number of pipeline stages to be increased. Reducing the CPI (clock cycles per instruction). This requires either that instructions which occupy more than one pipeline slot in an ARM7 are re-implemented to occupy fewer slots, or that pipeline stalls caused by dependencies between instructions are reduced, or a combination of both. Reducing the CPI Again repeating the argument presented earlier, the fundamental problem with reducing the CPI relative to an ARM7 core is related to the von Neumann bottleneck - any stored-program computer with a single instruction and data memory will have its performance limited by the available memory bandwidth. An ARM7 core accesses memory on (almost) every clock cycle either to fetch an instruction or to transfer data. To get a significantly better CPI than ARM7 the memory system must deliver more than one value in each clock cycle either by delivering more than 32 bits per cycle from a single memory or by having separate memories for instruction and data accesses. ARMS 257 Double-ban dwidth memory ARMS retains a unified memory (either in the form of cache or on-chip RAM) but exploits the sequential nature of most memory accesses to achieve double-bandwidth from a single memory. It assumes that the memory it is connected to can deliver one word in a clock cycle and deliver the next sequential word half a cycle later concurrently with starting the next access. Typical memory organizations are quite capable of supplying the extra data with only a little extra hardware cost. Restricting the extra bandwidth to sequential accesses may seem to limit its usefulness, but instruction fetches are highly sequential and ARM's load multiple instructions generate sequential addresses (as do the store multiple instructions, though these do not exploit the double-bandwidth memory on ARMS), so the occurrence of sequential accesses is quite high in typical ARM code. A 64-bit wide memory has the required characteristics, but delaying the arri...
View Full Document
This document was uploaded on 10/30/2011 for the course CSE 378 380 at SUNY Buffalo.
- Spring '09