This preview shows page 1. Sign up to view the full content.
Unformatted text preview: ion of the immediately preceding instruction (assuming all instructions are already in the prefetch queue). For example:
add mov esi, eax eax, [esi] ; esi is destination register ; esi is base, 1 clock penalty Since the Pentium® processor has two integer pipelines, a register used as the base or index component of an effective address calculation (in either pipe) causes an additional clock if that register is the destination of either instruction from the immediately preceding clock (see Figure 14-2). This effect is known as Address Generation Interlock (AGI). To avoid the AGI, the instructions should be separated by at least 1 clock by placing other instructions between them. The MMX™ registers cannot be used as base or index registers, so the AGI does not apply for MMX™ register destinations. No penalty occurs in the P6 family processors for the AGI condition.
AGI Penalty PF DI D2 E WB PF DI D2 E WB PF DI D2 AGI E WB Figure 14-2. Pipeline Example of AGI Stall 14-29 CODE OPTIMIZATION Note that some instructions have implicit reads/writes to registers. Instructions that generate addresses implicitly through ESP (such as PUSH, POP, RET, CALL) also suffer from the AGI penalty, as shown in the following example:
sub esp, 24 ; 1 clock cycle stall push ebx mov esp, ebp ; 1 clock cycle stall pop ebp The PUSH and POP instructions also implicitly write to the ESP register. These writes, however, do not cause an AGI when the next instruction addresses through the ESP register. Pentium® processors “rename” the ESP register from PUSH and POP instructions to avoid the AGI penalty (see the following example):
push mov edi ebx, [esp] ; no stall On Pentium® processors, instructions that include both an immediate and a displacement field are pairable in the U-pipe. When it is necessary to use constants, it is usually more efficient to use immediate data instead of loading the constant into a register first. If the same immediate data is used more than once, however, it is faster to load the constant in a register and then use the register multiple times, as illustrated in the following example:
mov mov result, 555 word ptr [esp+4], 1 ; 555 is immediate, result is ; displacement ; 1 is immediate, 4 is displacement Since MMX™ instructions have 2-byte opcodes (0FH opcode map), any MMX™ instruction that uses base or index addressing with a 4-byte displacement to access memory will have a length of 8 bytes. Instructions over 7 bytes can slow macro instruction decoding and should be avoided where possible. It is often possible to reduce the size of such instructions by adding the immediate value to the value in the base or index register, thus removing the immediate field. 14.8. INSTRUCTION LENGTH
On Pentium® processors, instructions greater than 7 bytes in length cannot be executed in the Vpipe. In addition, two instructions cannot be pushed into the instruction FIFO unless both are 7 bytes or less in length. If only one instruction is pushed into the instruction FIFO, pairing will not occur unless the instruction FIFO already contains at least one instruction. In code where pairing is very high (as is often the case in MMX™ code) or after a mispredicted branch, the instruction FIFO may be empty, leading to a loss of pairing whenever the instruction length is over 7 bytes. In addition, the P6 family processors can only decode one instruction at a time when an instruction is longer than 7 bytes. So, for best performance on all Intel processors, use simple instructions that are less than 8 bytes in length. 14-30 CODE OPTIMIZATION 14.9. PREFIXED OPCODES
On the Pentium® processor, an instruction with a prefix is pairable in the U-pipe (PU) if the instruction (without the prefix) is pairable in both pipes (UV) or in the U-pipe (PU). The prefixes are issued to the U-pipe and get decoded in 1 clock for each prefix and then the instruction is issued to the U-pipe and may be paired. For the P6 family and Pentium® processors, the prefixes that should be avoided for optimum code execution speeds are: • • • • • • Lock. Segment override. Address size. Operand size. 2-byte opcode map (0FH) prefix. An exception is the Streaming SIMD Extensions instructions introduced with the Pentium® III processor. The first byte of these instructions is 0FH. It is not used as a prefix. 2-byte opcode map (0FH) prefix. On Pentium® processors with MMX™ technology, a prefix on an instruction can delay the parsing and inhibit pairing of instructions. The following list highlights the effects of instruction prefixes on the Pentium® processor instruction FIFO: • • There is no penalty on 0FH-prefix instructions. An instruction with a 66H or 67H prefix takes 1 clock for prefix detection, another clock for length calculation, and another clock to enter the instruction FIFO (3 clocks total). It must be the first instruction to enter the instruction FIFO, and a second instruction can be pushed with it. Instructions with other prefixes (not 0FH,...
View Full Document
This note was uploaded on 06/07/2013 for the course ECE 1234 taught by Professor Kwhon during the Spring '10 term at University of California, Berkeley.
- Spring '10