IntelSoftwareDevelopersManual

147 addressing modes and register usage on the

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: e word loads must wait for the MOVQ instruction to write to memory before they can access the data they require. This stall can also occur with other data types (for example, when doublewords or words are stored and then words or bytes are read from the same area of memory). Changing the code sequence as follows allows the processor to access the data without a stall: MOVQ : : MOVQ MOVD PSRLQ SHR MOVD AND mem, mm0 ; store qword to address "mem" mm1, mem eax, mm1 mm1, 32 eax, 16 ebx, mm1 ebx, 0ffffh ; load qword at address "mem" ; transfer "mem + 2" to ax from ; MMX(TM) register not memory ; transfer "mem + 4" to bx from ; MMX register, not memory 14-26 CODE OPTIMIZATION These transformations, in general, increase the number the instructions required to perform the desired operation. For the Pentium® II and Pentium® III processors, the performance penalty due to the increased number of instructions is more than offset by the number of clocks saved. For the Pentium® processor with MMX™ technology, however, the increased number of instructions can negatively impact performance. For this reason, careful and efficient coding of these transformations is necessary to minimize any potential negative impact to Pentium® processor performance. 14.6.3. Write Allocation Effects P6 family processors have a “write allocate by read-for-ownership” cache, whereas the Pentium® processor has a “no-write-allocate; write through on write miss” cache. On P6 family processors, when a write occurs and the write misses the cache, the entire 32-byte cache line is fetched. On the Pentium® processor, when the same write miss occurs, the write is simply sent out to memory. Write allocate is generally advantageous, since sequential stores are merged into burst writes, and the data remains in the cache for use by later loads. This is why P6 family processors adopted this write strategy, and why some Pentium® processor system designs implement it for the L2 cache. Write allocate can be a disadvantage in code where: • • • • Just one piece of a cache line is written. The entire cache line is not read. Strides are larger than the 32-byte cache line. Writes to a large number of addresses (greater than 8000). When a large number of writes occur within an application, and both the stride is longer than the 32-byte cache line and the array is large, every store on a P6 family processor will cause an entire cache line to be fetched. In addition, this fetch will probably replace one (sometimes two) dirty cache line(s). The result is that every store causes an additional cache line fetch and slows down the execution of the program. When many writes occur in a program, the performance decrease can be significant. The following Sieve of Erastothenes example program demonstrates these cache effects. In this example, a large array is stepped through in increasing strides while writing a single value of the array with zero. NOTE This is a very simplistic example used only to demonstrate cache effects. Many other optimizations are possible in this code. 14-27 CODE OPTIMIZATION boolean array[max]; for(i=2;i<max;i++) { array = 1; } for(i=2;i<max;i++) { if( array[i] ) { for(j=2;j<max;j+=i) { array[j] = 0; /*here we assign memory to 0 causing the cache line fetch within the j loop */ } } } Two optimizations are available for this specific example: • Optimization 1—In “boolean” in this example there is a “char” array. Here, it may well be better to make the “boolean” array into an array of bits, thereby reducing the size of the array, which in turn reduces the number of cache line fetches. The array is packed so that read-modify-writes are done (since the cache protocol makes every read into a readmodify-write). Unfortunately, in this example, the vast majority of strides are greater than 256 bits (one cache line of bits), so the performance increase is not significant. Optimization 2—Another optimization is to check if the value is already zero before writing (as shown in the following example), thereby reducing the number of writes to memory (dirty cache lines) boolean array[max]; for(i=2;i<max;i++) { array = 1; } for(i=2;i<max;i++) { if( array[i] ) { for(j=2;j<max;j+=i) { if( array[j] != 0 ) { array[j] = 0; } } } } • /* check to see if value is already 0 */ The external bus activity is reduced by half because most of the time in the Sieve program the data is already zero. By checking first, you need only 1 burst bus cycle for the read and you save the burst bus cycle for every line you do not write. The actual write back of the modified line is no longer needed, therefore saving the extra cycles. 14-28 CODE OPTIMIZATION NOTE This operation benefits the P6 family processors, but it may not enhance the performance of Pentium® processors. As such, it should not be considered generic. 14.7. ADDRESSING MODES AND REGISTER USAGE On the Pentium® processor, when a register is used as the base component, an additional clock is used if that register is the destinat...
View Full Document

This note was uploaded on 06/07/2013 for the course ECE 1234 taught by Professor Kwhon during the Spring '10 term at Berkeley.

Ask a homework question - tutors are online