This preview shows page 1. Sign up to view the full content.
Unformatted text preview: al FSTPS instruction after the conditional jump to finish the final loop iteration. Hiding the One-Clock Latency of a Floating-Point Store A floating-point store must wait an extra clock for its floating-point operand. After an FLD, an FST must wait 1 clock, as shown in the following example:
fld fst meml mem2 ; 1 fld takes 1 clock ; 2 fst waits, schedule something here ; 3,4 fst takes 2 clocks After the common arithmetic operations, FMUL and FADD, which normally have a latency of 3 clocks, FST waits an extra clock for a total of 4 (see following example).
fadd meml ; 1 add takes 3 clocks ; 2 add, schedule something here ; 3 add, schedule something here ; 4 fst waits, schedule something here ; 5,2 fst takes 2 clocks fst mem2 Other instructions such as FADDP and FSUBRP also exhibit this type of latency. In the next example, the store is not dependent on the previous load:
fld fld fxch fst meml mem2 st(l) mem3 ;1 ;2 ;2 ; 3 stores values loaded from meml Here, a register may be used immediately after it has been loaded (with FLD):
fld fadd mem1 mem2 ;l ; 2,3,4 Use of a register by a floating-point operation immediately after it has been written by another FADD, FSUB, or FMUL causes a 2-clock delay. If instructions are inserted between these two, then latency and a potential stall can be hidden. Additionally, there are multiclock floating-point instructions (FDIV and FSQRT) that execute in the floating-point unit pipe (the U-pipe). While executing these instructions in the floating-point unit pipe, integer instructions can be executed in parallel. Emitting a number of integer instructions after such an instruction will keep the integer execution units busy (the exact number of instructions depends on the floating-point instruction’s clock count). 14-20 CODE OPTIMIZATION Integer instructions generally overlap with the floating-point operations except when the last floating-point operation was FXCH. In this case there is a 1 clock delay:
: U-pipe fadd V-pipe fxch ;1 ; 2 fxch delay mov eax, 1 inc edx Integer and Floating-Point Multiply The integer multiply operations, the MUL and IMUL instructions, are executed by the FPU’s multiply unit. Therefore, for the Pentium® processor, these instructions cannot be executed in parallel with a floating-point instruction. This restriction does not apply to the P6 family processors, because these processors have two internal FPU execution units. A floating-point multiply instruction (FMUL) delays for 1 clock if the immediately preceding clock executed an FMUL or an FMUL-FXCH pair. The multiplier can only accept a new pair of operands every other clock. Floating-Point Operations with Integer Operands Floating-point operations that take integer operands (the FIADD or FISUB instruction) should be avoided. These instructions should be split into two instructions: the FILD instruction and a floating-point operation. The number of clocks before another instruction can be issued (throughput) for FIADD is 4, while for FILD and simple floating-point operations it is 1, as shown in the example below:
. Complex Instructions fiadd [ebp] ; 4 Better for Potential Overlap fild [ebp] ; 1 faddp st(l) ; 2 Using the FILD and FADDP instructions in place of FIADD yields 2 free clocks for executing other instructions. FSTSW Instruction The FSTSW instruction that usually appears after a floating-point comparison instruction (FCOM, FCOMP, FCOMPP) delays for 3 clocks. Other instructions may be inserted after the comparison instruction to hide this latency. On the P6 family processors the FCMOVcc instruction can be used instead. 14-21 CODE OPTIMIZATION Transcendental Instructions Transcendental instructions execute in the U-pipe and nothing can be overlapped with them, so an integer instruction following a transcendental instruction will wait until the previous instruction completes. Transcendental instructions execute on the Pentium® processor (and later Intel Architecture processors) much faster than the software emulations of these instructions found in most math libraries. Therefore, it may be worthwhile in-lining transcendental instructions in place of math library calls to transcendental functions. Software emulations of transcendental instructions will execute faster than the equivalent instructions only if accuracy is sacrificed. FXCH Guidelines The FXCH instruction costs no extra clocks on the Pentium® processor when all of the following conditions occur, allowing the instruction to execute in the V-pipe in parallel with another floating-point instruction executing in the U-pipe: • • • A floating-point instruction follows the FXCH instruction. A floating-point instruction from the following list immediately precedes the FXCH instruction: FADD, FSUB, FMUL, FLD, FCOM, FUCOM, FCHS, FTST, FABS, or FDIV. An FXCH instruction has already been executed. This is because the instruction boundaries in the cache are marked the first time the instruction is executed, so pairing only happens the second time this instructi...
View Full Document
This note was uploaded on 06/07/2013 for the course ECE 1234 taught by Professor Kwhon during the Spring '10 term at University of California, Berkeley.
- Spring '10