Unformatted text preview: on is executed from the cache. When the above conditions are true, the instruction is almost “free” and can be used to access elements in the deeper levels of the floating-point stack instead of storing them and then loading them again. 14.5.3. Scheduling Rules for P6 Family Processors
The P6 family processors have 3 decoders that translate Intel Architecture macro instructions into micro operations (micro-ops, also called “uops”). The decoder limitations are as follows: • The first decoder (decoder 0) can decode instructions up to 7 bytes in length and with up to 4 micro-ops in one clock cycle. The second two decoders (decoders 1 and 2) can decode instructions that are 1 micro-op instructions, and these instructions will also be decoded in one clock cycle. Three macro instructions in an instruction sequence that fall into this envelope will be decoded in one clock cycle. Macro instructions outside this envelope will be decoded through decoder 0 alone. While decoder 0 is decoding a long macro instruction, decoders 1 and 2 (second and third decoders) are quiescent. • • Appendix C of the Intel Architecture Optimization Manual lists all Intel macro-instructions and the decoders on which they can be decoded. 14-22 CODE OPTIMIZATION The macro instructions entering the decoder travel through the pipe in order; therefore, if a macro instruction will not fit in the next available decoder then the instruction must wait until the next clock to be decoded. It is possible to schedule instructions for the decoder such that the instructions in the in-order pipeline are less likely to be stalled. Consider the following examples: • If the next available decoder for a multimicro-op instruction is not decoder 0, the multimicro-op instruction will wait for decoder 0 to be available, usually in the next clock, leaving the other decoders empty during the current clock. Hence, the following two instructions will take 2 clocks to decode.
add add eax, ecx edx, [ebx] ; 1 uop instruction (decoder 0) ; 2 uop instruction (stall 1 cycle wait till ; decoder 0 is available) • During the beginning of the decoding clock, if two consecutive instructions are more than 1 micro-op, decoder 0 will decode one instruction and the next instruction will not be decoded until the next clock.
add mov add eax, [ebx] ecx, [eax] ebx, 8 ; 2 uop instruction (decoder 0) ; 2 uop instruction (stall 1 cycle to wait until ; decoder 0 is available) ; 1 uop instruction (decoder 1) Instructions of the opcode reg, mem form produce two micro-ops: the load from memory and the operation micro-op. Scheduling for the decoder template (4-1-1) can improve the decoding throughput of your application. In general, the opcode reg, mem forms of instructions are used to reduce register pressure in code that is not memory bound, and when the data is in the cache. Use simple instructions for improved speed on the Pentium® and P6 family processors. The following rules should be observed while using the opcode reg, mem instruction on Pentium® processors with MMX™ technology: • • Schedule for minimal stalls in the Pentium® processor pipe. Use as many simple instructions as possible. Generally, 32-bit assembly code that is well optimized for the Pentium® processor pipeline will execute well on the P6 family processors. When scheduling for Pentium® processors, keep in mind the primary stall conditions and decoder (4-1-1) template on the P6 family processors, as shown in the example below.
pmaddw paddd ad mm6, [ebx] mm7, mm6 ebx, 8 ; 2 uops instruction (decoder 0) ; 1 uop instruction (decoder 1) ; 1 uop instruction (decoder 2) 14-23 CODE OPTIMIZATION 14.6. ACCESSING MEMORY
The following subsections describe optimizations that can be obtained when scheduling instructions that access memory. 14.6.1. Using MMX™ Instructions That Access Memory An MMX™ instruction may have two register operands (opcode reg, reg) or one register and one memory operand (opcode reg, mem), where opcode represents the instruction opcode, reg represents the register, and mem represents memory. The opcode reg, mem instructions are useful in some cases to reduce register pressure, increase the number of operations per clock, and reduce code size. The following discussion assumes that the memory operand is present in the data cache. If it is not, then the resulting penalty is usually large enough to obviate the scheduling effects discussed in this section. In Pentium® processor with MMX™ technology, the opcode reg, mem MMX™ instructions do not have longer latency than the opcode reg, reg instructions (assuming a cache hit). They do have more limited pairing opportunities, however. In the Pentium® II and Pentium® III processors, the opcode reg, mem MMX™ instructions translate into two micro-ops, as opposed to one micro-op for the opcode reg, reg instructions. Thus, they tend to limit decoding bandwidth and occupy more resources than the opcode reg, reg instructions. The recommended usage of the opcode reg, reg instructions depends on whether the MMX™ code is memory-bound (that is, execution speed is limited by memory accesses). As a rule of thumb, an MMX™ code sequence is considered to be memory-bound if the following inequalit...
View Full Document
This note was uploaded on 06/07/2013 for the course ECE 1234 taught by Professor Kwhon during the Spring '10 term at Berkeley.
- Spring '10