IntelSoftwareDevelopersManual

For example writing to register bl bh or bx and

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: an 2 clocks waste loop overhead. Write code to follow the static prediction algorithm. The static prediction algorithm follows the natural flow of program code. Following this algorithm reduces the number of branch mispredictions. Eliminating and Reducing the Number of Branches • • • 14.2.3. Eliminating branches improves processor performance by: • • Removing the possibility of branch mispredictions. Reducing the number of BTB entries required. Branches can be eliminated by using the SETcc instruction, or by using the P6 family processors’ conditional move (CMOVcc or FCMOVcc) instructions. The following C code example shows conditions that are dependent upon on of the constants A and B: /* C Code /* ebx = (A < B) ? C1 : C2; This code conditionally compares the values A and B. If the condition is true, EBX is set to C1; otherwise it is set to C2. The assembly-language equivalent of the C code is shown in the example below: ; Assembly Code 14-5 CODE OPTIMIZATION cmp A, B jge L30 mov ebx, CONST1 jmp L31 L30: mov ebx, CONST2 L31: ; condition ; conditional branch ; unconditional branch By replacing the JGE instruction as shown in the previous example with a SETcc instruction, the EBX register is set to either C1 or C2. This code can be optimized to eliminate the branches as shown in the following code: xor ebx, ebx cmp A, B setge bl dec and add ebx ebx, (CONST2-CONST1) ebx, min(CONST1,CONST2) ;clear ebx ;When ebx = 0 or 1 ;OR the complement condition ;ebx=00...00 or 11...11 ;ebx=0 or(CONST2-CONST1) ;ebx=CONST1 or CONST2 The optimized code sets register EBX to 0 then compares A and B. If A is greater than or equal to B then EBX is set to 1. EBX is then decremented and ANDed with the difference of the constant values. This sets EBX to either 0 or the difference of the values. By adding the minimum of the two constants the correct value is written to EBX. When CONST1 or CONST2 is equal to zero, the last instruction can be deleted as the correct value already has been written to EBX. When ABS(CONST1-CONST2) is 1 of {2,3,5,9}, the following example applies: xor cmp setge lea ebx, ebx A, B bl ; or the complement condition ebx, [ebx*D+ebx+CONST1-CONST2] where D stands for ABS(CONST1 − CONST2) − 1. A second way to remove branches on P6 family processors is to use the new CMOVcc and FCMOVcc instructions. The following example shows how to use the CMOVcc instruction to eliminate the branch from a test and branch instruction sequence. If the test sets the equal flag then the value in register EBX will be moved to register EAX. This branch is data dependent, and is representative of a unpredictable branch. test jne mov 1h: ecx, ecx 1h eax, ebx To change the code, the JNE and the MOV instructions are combined into one CMOVcc instruction, which checks the equal flag. The optimized code is shown below: test ecx, ecx cmoveqeax, ebx 1h: ; test the flags ; if the equal flag is set, move ebx to eax 14-6 CODE OPTIMIZATION The label 1h: is no longer needed unless it is the target of another branch instruction. These instructions will generate invalid opcodes when used on previous generation Intel Architecture processors. Therefore, use the CPUID instruction to check feature bit 15 of the EDX register, which when set indicates presence of the CMOVcc family of instructions. Do not use the family and model codes returned by CPUID to test for the presence of specific features. Additional information on branch optimization can be found in the Intel Architecture Optimization Manual. 14.3. REDUCING PARTIAL REGISTER STALLS ON P6 FAMILY PROCESSORS On P6 family processors, when a large (32-bit) general-purpose register is read immediately after a small register (8- or 16-bit) that is contained in the large register has been written, the read is stalled until the write retires (a minimum of 7 clocks). Consider the example below: MOV ADD AX, 8 ECX, EAX ; Partial stall occurs on access of ; the EAX register Here, the first instruction moves the value 8 into the small register AX. The next instruction accesses the large register EAX. This code sequence results in a partial register stall. Pentium® and Intel486™ processors do not generate this stall. Table 14-1 lists the groups of small registers and their corresponding large register for which a partial register stall can occur. For example, writing to register BL, BH, or BX and subsequently reading register EBX will result in a stall. Table 14-1. Small and Large General-Purpose Register Pairs Small Registers AL BL CL DL AH BH CH DH AX BX CX DX SP BP DI SI Large Registers EAX EBX ECX EDX ESP EBP EDI ESI Because the P6 family processors can execute code out of order, the instructions need not be immediately adjacent for the stall to occur. The following example also contains a partial stall: MOV MOV MOV AL, 8 EDX, 0x40 EDI, new_value 14-7 CODE OPTIMIZATION ADD EDX, EAX ; Partial stall occurs on access of ; the EAX register In addition, any micro-ops that follow the stalled micro-op will also wait...
View Full Document

Ask a homework question - tutors are online