This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Architecture processors the code is intended to run on. Pipelining can greatly increase the performance of code written to run on the Pentium® family of processors. It is less important for code written to run on the P6 family processors, because the dynamic execution model that these processors use does a significant amount of pipelining automatically. The following subsections describe general pipelining guidelines for MMX ™ and floatingpoint instructions. These guidelines yield significant improvements in execution speed for code running on the Pentium® processors and may yield additional improvements in execution speed on the P6 family processors. Specific pipelining guidelines for the P6 family processors are given in Section 14.5.3., “Scheduling Rules for P6 Family Processors” 22.214.171.124. MMX™ INSTRUCTION PIPELINING GUIDELINES All MMX™ instructions can be pipelined on P6 family and Pentium® (with MMX™ technology) processors, including the multiply instructions. All MMX™ instructions take a single clock to execute except the MMX™ multiply instructions which take 3 clocks. Since MMX™ multiply instructions take 3 clocks to execute, the result of a multiply instruction can be used only by other instructions issued 3 clocks later. For this reason, avoid scheduling a dependent instruction in the 2 instruction pairs following the multiply. The store of a register after writing the register must wait for 2 clocks after the update of the register. Scheduling the store 2 clocks after the update avoids a pipeline stall. 126.96.36.199. FLOATING-POINT PIPELINING GUIDELINES Many of the floating-point instructions have a latency greater than 1 clock, therefore on Pentium® processors the next floating-point instruction cannot access the result until the first operation has finished execution. To hide this latency, instructions should be inserted between the pair that causes the pipe stall. These instructions can be integer instructions or floating-point instructions that will not cause a new stall themselves. The number of instructions that should be inserted depends on the length of the latency. Because of the out-of-order execution capa- 14-18 CODE OPTIMIZATION bility of the P6 family processors, stalls will not necessarily occur on an instruction or micro-op basis. However, if an instruction has a very long latency such as an FDIV, then scheduling can improve the throughput of the overall application. The following sections list considerations for floating-point pipelining on Pentium® processors. Pairing of Floating-Point Instructions In a Pentium® processor, pairing floating-point instructions with one another (with one exception) does not result in a performance enhancement because the processor has only one floatingpoint unit (FPU). However, some floating-point instructions can be paired with integer instructions or the FXCH instruction to improve execution times. The following are some general pairing rules and restrictions for floating-point instructions: • • • All floating-point instructions can be executed in the V-pipe and paired with suitable instructions (generally integer instructions) in the U-pipe. The only floating-point instruction that can be executed in the U-pipe is the FXCH instruction. The FXCH instruction, if executed in the U-pipe can be paired with another floating-point instruction executing in the V-pipe. The floating-point instructions FSCALE, FLDCW, and FST cannot be paired with any instruction (integer instruction or the FXCH instruction). Using Integer Instructions to Hide Latencies and Schedule Floating-Point Instructions When a floating-point instruction depends on the result of the immediately preceding instruction, and that instruction is also a floating-point instruction, performance can be improved by placing one or more integer instructions between the two floating-point instructions. This is true even if the integer instructions perform loop control. The following example restructures a loop in this manner:
for (i=0; i<Size; i++) array1 [i] += array2 [i]; ; assume eax=Size-1, esi=array1, edi=array2 PENTIUM(R) PROCESSORCLOCKS LoopEntryPoint: fld fadd fstp dec jnz real4 ptr [esi+eax*4] real4 ptr [edi+eax*4] real4 ptr [esi+eax*4] eax LoopEntryPoint ; 2 - AGI ;1 ; 5 - waits for fadd ;1 ; assume eax=Size-1, esi=array1, edi=array2 jmp LoopEntryPoint Align 16 TopOfLoop: fstp real4 ptr [esi+eax*4+4] LoopEntryPoint: fld real4 ptr [esi+eax*4] ; 4 - waits for fadd + AGI ;1 14-19 CODE OPTIMIZATION fadd dec jnz ; fstp real4 ptr [edi+eax*4] eax TopOfLoop real4 ptr [esi+eax*4+4] ;1 ;1 By moving the integer instructions between the FADDS and FSTPS instructions, the integer instructions can be executed while the FADDS instruction is completing in the floating-point unit and before the FSTPS instruction begins execution. Note that this new loop structure requires a separate entry point for the first iteration because the loop needs to begin with the FLDS instruction. Also, there needs to be an addition...
View Full Document
- Spring '10