This preview shows page 1. Sign up to view the full content.
Unformatted text preview: r. The VFP 10 incorporates a 5-stage load/store pipeline and a 7-stage execution pipeline and supports single- and double-precision IEEE 754 floating-point arithmetic (see Section 6.3 on page 158). It is capable of issuing a floating-point multiply-accumulate operation at a rate of one per clock cycle. It exploits the 64-bit data memory interface of the The ARM1020E 343 ARM10TDMI to load or store one double-precision value or two single-precision values in a clock cycle. Both the arithmetic and load/store instructions include 'vector' variants that perform the same operation on a set of registers, and since vector operations and vector load/stores can run concurrently, a peak throughput of 800 MFLOPS (at 400 MHz) is achievable.
ARM10200 silicon A plot of the 0.25 mm ARM10200 silicon is shown in Figure 12.14. This first version of the chip has a fully synthesized VFP10 core and the cache was designed using generic design rules. The 0.18mm version of the chip will incorporate a VFP 10 core with a manually laid-out custom datapath and synthesized control, and the caches will use more process specific design rules that will significantly reduce their relative size. Figure 12.14 plot. ARM10200 chip 344 ARM CPU Cores 12.7 Discussion
The ARM CPU cores described in this chapter highlight a number of important aspects of the development of high-performance low-power processor subsystems. The issues relating to the design of the processor core itself were discussed in Chapter 9 and summarized in Section 9.5 on page 266. Here we are concerned with the other components that are intimately connected to the processor core and are critical to its ability to realize its intrinsic performance potential. Memory bandwidth The performance of a processor is ultimately limited by the bandwidth of its associated memory system. The ARM CPU core family demonstrates how a cache memory system can be optimized to different performance points: The ARM7TDMI is designed to waste very few memory cycles. It requires a memory that can supply a word of data in every clock cycle. As it employs a...
View Full Document
- Spring '09