IntelSoftwareDevelopersManual

Sign extension is usually quite expensive often the

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 66H, or 67H) take 1 additional clock to detect each prefix. These instructions are pushed into the instruction FIFO only as the first instruction. An instruction with two prefixes will take 3 clocks to be pushed into the instruction FIFO (2 clocks for the prefixes and 1 clock for the instruction). A second instruction can be pushed with the first into the instruction FIFO in the same clock. • The impact on performance exists only when the instruction FIFO does not hold at least two entries. As long as the decoder (D1 stage) has two instructions to decode there is no penalty. The instruction FIFO will quickly become empty if the instructions are pulled from the instruction FIFO at the rate of two per clock. So, if the instructions just before the prefixed instruction suffer from a performance loss (for example, no pairing, stalls due to cache misses, misalignments, etc.), then the performance penalty of the prefixed instruction may be masked. On the P6 family processors, instructions longer than 7 bytes in length limit the number of instructions decoded in each clock. Prefixes add 1 to 2 bytes to the length of an instruction, possibly limiting the decoder. 14-31 CODE OPTIMIZATION It is recommended that, whenever possible, prefixed instructions not be used or that they be scheduled behind instructions which themselves stall the pipe for some other reason. 14.10. INTEGER INSTRUCTION SELECTION AND OPTIMIZATIONS This section describes both instruction sequences to avoid and sequences to use when generating optimal assembly code. The information applies to the P6 family processors and the Pentium® processors with and without MMX™ technology. • LEA Instruction. The LEA instruction can be used in the following situations to optimize code execution: — The LEA instruction may be used sometimes as a three/four operand addition instruction (for example, LEA ECX, [EAX+EBX+4+a]). — In many cases, an LEA instruction or a sequence of LEA, ADD, SUB and SHIFT instructions may be used to replace constant multiply instructions. For the P6 family processors the constant multiply is faster relative to other instructions than on the Pentium® processor, therefore the trade off between the two options occurs sooner. It is recommended that the integer multiply instruction be used in code designed for P6 family processor execution. — The above technique can also be used to avoid copying a register when both operands to an ADD instruction are still needed after the ADD, since the LEA instruction need not overwrite its operands. The disadvantage of the LEA instruction is that it increases the possibility of an AGI stall with previous instructions. LEA is useful for shifts of 2, 4, and 8 because on the Pentium® processor, LEA can execute in either the U- or V-pipe, but the shift can only execute in the U-pipe. On the P6 family processors, both the LEA and SHIFT instructions are single micro-op instructions that execute in 1 clock. • • Complex Instructions. For greater execution speed, avoid using complex instructions (for example, LOOP, ENTER, or LEAVE). Use sequences of simple instructions instead to accomplish the function of a complex instruction. Zero-Extension of Short Integers. On the Pentium® processor, the MOVZX instruction has a prefix and takes 3 clocks to execute totaling 4 clocks. It is recommended that the following sequence be used instead of the MOVZX instruction: xor mov eax, eax al, mem If this code occurs within a loop, it may be possible to pull the XOR instruction out of the loop if the only assignment to EAX is the MOV AL, MEM. This has greater importance for the Pentium® processor since the MOVZX is not pairable and the new sequence may be paired with adjacent instructions. In order to avoid a partial register stall on the P6 family processors, special hardware has been implemented that allows this code sequence to execute without a stall. Even 14-32 CODE OPTIMIZATION so, the MOVZX instruction is a better choice for the P6 family processors than the alternative sequences. • PUSH Mem. The PUSH mem instruction takes 4 clocks for the Intel486™ processor. It is recommended that the following sequence be used in place of a PUSH mem instruction because it takes only 2 clocks for the Intel486™ processor and increases pairing opportunity for the Pentium® processor. mov push reg, mem reg • Short Opcodes. Use 1 byte long instructions as much as possible. This will reduce code size and help increase instruction density in the instruction cache. The most common example is using the INC and DEC instructions rather than adding or subtracting the constant 1 with an ADD or SUB instruction. Another common example is using the PUSH and POP instructions instead of the equivalent sequence. 8/16 Bit Operands. With 8-bit operands, try to use the byte opcodes, rather than using 32bit operations on sign and zero extended bytes. Prefixes for operand size override apply to 16-bit operands, not to 8-bit operands. Sign Extension is usually quite expensive. Often, the semantics can...
View Full Document

This note was uploaded on 06/07/2013 for the course ECE 1234 taught by Professor Kwhon during the Spring '10 term at University of California, Berkeley.

Ask a homework question - tutors are online