This preview shows page 1. Sign up to view the full content.
Unformatted text preview: y holds: Instructions NonMMXInstructions -------------------------------- < MemoryAccesses + -----------------------------------------------------------2 2 For memory-bound MMX™ code, Intel recommends merging loads whenever the same memory address is used more than once to reduce memory accesses. For example, the following code sequence can be speeded up by using a MOVQ instruction in place of the opcode reg, mem forms of the MMX™ instructions:
OPCODE MM0, [address A] OPCODE MM1, [address A] ; optimized by use of a MOVQ instruction and opcode reg, mem forms ; of the MMX(TM) instructions
MOVQ OPCODE OPCODE MM2, [address A] MM0, MM2 MM1, MM2 Another alternative is to incorporate the prefetch instruction introduced in the Pentium® III processor. Prefetching the data preloads the cache prior to actually needing the data. Proper use of prefetch can improve performance if the application is not memory bandwidth bound or the 14-24 CODE OPTIMIZATION data does not already fit into cache. For more information on proper usage of the prefetch instruction see the Intel Architecture Optimization Manual order number 245127-001. For MMX™ code that is not memory-bound, load merging is recommended only if the same memory address is used more than twice. Where load merging is not possible, usage of the opcode reg, mem instructions is recommended to minimize instruction count and code size. For example, the following code sequence can be shortened by removing the MOVQ instruction and using an opcode reg, mem form of the MMX™ instruction:
MOVQ mm0, [address A] OPCODE mm1, mm0 ; optimized by removing the MOVQ instruction and using an ; opcode reg, mem form of the MMX(TM) instructions OPCODE mm1, [address A] In many cases, a MOVQ reg, reg and opcode reg, mem can be replaced by a MOVQ reg, mem and the opcode reg, reg. This should be done where possible, since it saves one micro-op on the Pentium® II and Pentium® III processors. The following example is one where the opcode is a symmetric operation:
MOVQ OPCODE mm1, mm0 mm1, [address A] (1 micro-op) (2 micro-ops) One clock can be saved by rewriting the code as follows:
MOVQ OPCODE mm1, [address A] mm1, mm0 (1 micro-op) (1 micro-op) 14.6.2. Partial Memory Accesses With MMX™ Instructions The MMX™ registers allow large quantities of data to be moved without stalling the processor. Instead of loading single array values that are 8-, 16-, or 32-bits long, the values can be loaded in a single quadword, with the structure or array pointer being incremented accordingly. Any data that will be manipulated by MMX™ instructions should be loaded using either: • • The MMX™ instruction that loads a 64-bit operand (for example, MOVQ MM0, m64), or The register-memory form of any MMX™ instruction that operates on a quadword memory operand (for example, PMADDW MM0, m64). All data in MMX™ registers should be stored using the MMX™ instruction that stores a 64-bit operand (for example, MOVQ m64, MM0). The goal of these recommendations is twofold. First, the loading and storing of data in MMX™ registers is more efficient using the larger quadword data block sizes. Second, using quadword data block sizes helps to avoid the mixing of 8-, 16-, or 32-bit load and store operations with 64bit MMX™ load and store operations on the same data. This, in turn, prevents situations in which small loads follow large stores to the same area of memory, or large loads follow small stores to the same area of memory. The Pentium® II and Pentium® III processors will stall in these situations. 14-25 CODE OPTIMIZATION Consider the following examples. The first example illustrates the effects of a large load after a series of small stores to the same area of memory (beginning at memory address mem). The large load will stall the processor:
MOV MOV : : MOVQ mem, eax mem + 4, ebx ; store dword to address "mem" ; store dword to address "mem + 4" mm0, mem ; load qword at address "mem", stalls The MOVQ instruction in this example must wait for the stores to write memory before it can access all the data it requires. This stall can also occur with other data types (for example, when bytes or words are stored and then words or doublewords are read from the same area of memory). By changing the code sequence as follows, the processor can access the data without delay:
MOVD MOVD PSLLQ POR MOVQ : : MOVQ mm1, ebx mm2, eax mm1, 32 mm1, mm2 mem, mm1 ; build data into a qword first before storing it to memory ; store SIMD variable to "mem" as a qword mm0, mem ; load qword SIMD variable "mem", no stall The second example illustrates the effect of a series of small loads after a large store to the same area of memory (beginning at memory address mem). Here, the small loads will stall the processor:
MOVQ mem, mm0 : : MOV bx, mem + 2 MOV cx, mem + 4 ; store qword to address "mem" ; load word at address "mem + 2" stalls ; load word at address "mem + 4" stalls Th...
View Full Document
- Spring '10