Unformatted text preview: until the clock cycle after the stalled micro-op continues through the pipe. In general, to avoid stalls, do not read a large register after writing a small register that is contained in the large register. Special cases of writing and reading corresponding small and large registers have been implemented in the P6 family processors to simplify the blending of code across processor generations. The special cases include the XOR and SUB instructions when using EAX, EBX, ECX, EDX, EBP, ESP, EDI and ESI as shown in the following examples:
xor movb add xor movw add sub movb add sub movb or xor movb sub eax, eax al, mem8 eax, mem32 eax, eax ax, mem16 eax, mem32 ax, ax al, mem8 ax, mem16 eax, eax al, mem8 ax, mem16 ah, ah al, mem8 ax, mem16 ; no partial stall ; no partial stall ; no partial stall ; no partial stall ; no partial stall In general, when implementing this sequence, always write all zeros to the large register then write to the lower half of the register. 14-8 CODE OPTIMIZATION 14.4. ALIGNMENT RULES AND GUIDELINES
The following section gives rules and guidelines for aligning of code and data for optimum code execution speed. 14.4.1. Alignment Penalties The following are common penalties for accesses to misaligned data or code: • • • On a Pentium® processor, a misaligned access costs 3 clocks. On a P6 family processor, a misaligned access that crosses a cache line boundary costs 6 to 9 clocks. On a P6 family processor, unaligned accesses that cause a data cache split stall the processor. A data cache split is a memory access that crosses a 32-byte cache line boundary. For best performance, make sure that data structures and arrays greater than 32 bytes, are 32byte aligned, and that access patterns to data structures and arrays do not break the alignment rules. 14.4.2. Code Alignment The P6 family and Pentium® processors have a cache line size of 32 bytes. Since the prefetch buffers fetch on 16-byte boundaries, code alignment has a direct impact on prefetch buffer efficiency. For optimal performance across the Intel Architecture family, it is recommended that: • • • A loop entry label should be 16-byte aligned when it is less than 8 bytes away from that boundary. A label that follows a conditional branch should not be aligned. A label that follows an unconditional branch or function call should be 16-byte aligned when it is less than 8 bytes away from that boundary. Data Alignment 14.4.3. A misaligned access in the data cache or on the bus costs at least 3 extra clocks on the Pentium® processor. A misaligned access in the data cache, which crosses a cache line boundary, costs 9 to 12 clocks on the P6 family processors. It is recommended that data be aligned on the following boundaries for optimum code execution on all processors: • • • • Align 8-bit data on any boundary. Align 16-bit data to be contained within an aligned 4-byte word. Align 32-bit data on any boundary that is a multiple of 4. Align 64-bit data on any boundary that is a multiple of 8. 14-9 CODE OPTIMIZATION • • Align 80-bit data on a 128-bit boundary (that is, any boundary that is a multiple of 16 bytes). Align 128-bit SIMD floating-point data on a 128-bit boundary (that is, any boundary that is a multiple of 16 bytes). ALIGNMENT OF DATA STRUCTURES AND ARRAYS GREATER THAN 32 BYTES 220.127.116.11. A 32-byte or greater data structure or array should be aligned such that the beginning of each structure or array element is aligned on a 32 byte boundary, and such that each structure or array element does not cross a 32-byte cache line boundary. Does this general discussion adequately cover the differences between 8, 16, and 32 bit alignments? 18.104.22.168. ALIGNMENT OF DATA IN MEMORY AND ON THE STACK On the Pentium® processor, accessing 64-bit variables that are not 8-byte aligned will cost an extra 3 clocks. On the P6 family processors, accessing a 64-bit variable will cause a data cache split. Some commercial compilers do not align double precision variables on 8-byte boundaries. In such cases, the following techniques can be used to force optimum alignment of data: • • • Use static variables instead of dynamic (stack) variables. Use in-line assembly code that explicitly aligns data. In C code, use “malloc” to explicitly allocate variables. The following sections describe these techniques. Static Variables When a compiler allocates stack space for a dynamic variable, it may not align the variable (see Figure 14-1). However, in most cases, when the compiler allocates space in memory for static variables, the variables are aligned. 14-10 CODE OPTIMIZATION static float a; float b; static float c; Stack b b Memory a c Figure 14-1. Stack and Memory Layout of Static Variables Alignment Using Assembly Language Use in-line assembly code to explicitly align variables. The following example aligns the stack to 64-bits.
; procedure prologue push ebp mov esp, ebp and ebp, -8 sub esp, 12 ; procedure epilogue add esp, 12 pop ebp ret Dynamic Allocation Using MALLOC When using dyna...
View Full Document
- Spring '10