IntelSoftwareDevelopersManual

23 eliminating and reducing the number of branches

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: es only half the memory space as double precision (64-bits) or double extended (80-bits). Use a library that provides fast floating-point to integer routines. Many library routines do more work than is necessary. Insure whenever possible that computations stay in range. Out of range numbers cause very high overhead. Schedule code in assembly language using the FXCH instruction. When possible, unroll loops and pipeline code. Perform transformations to improve memory access patterns. Use loop fusion or compression to keep as much of the computation in the cache as possible. Break dependency chains. 14.1.4. Guidelines for Optimizing SIMD Floating-point Code Generally, it is important to understand and balance port utilization to create efficient SIMD floating-point code. Use the following guidelines to optimize SIMD floating-point code: • • • • • • • Balance the limitations of the architecture. Schedule instructions to resolve dependencies. Schedule utilization of the triple/quadruple rule (port 0, port 1, port 2, 3, and 4). Group instructions that utilize the same registers as closely as possible. Take into consideration the resolution of true dependencies. Intermix SIMD-fp operations that utilize port 0 and port 1. Do not issue consecutive instructions that utilize the same port. Use the reciprocal instructions followed by iteration for increased accuracy. These instructions yield reduced accuracy but execute much faster. If reduced accuracy is acceptable, 14-3 CODE OPTIMIZATION use them with no iteration. If near full accuracy is needed, use a Newton-Raphson iteration. If full accuracy is needed, then use divide and square root which provide more accuracy, but slow down performance. • • • • • • Exceptions: mask exceptions to achieve higher performance. Unmasked exceptions may cause a reduction in the retirement rate. Utilize the flush-to-zero mode for higher performance to avoid the penalty of dealing with denormals and underflows. Incorporate the prefetch instruction whenever possible (for details, refer to Chapter 6, “Optimizing Cache Utilization for Pentium® III processors” ). Try to emulate conditional moves by masked compares and logicals instead of using conditional jumps. Utilize MMX™ technology instructions if the computations can be done in SIMD-integer or for shuffling data or copying data that is not used later in SIMD floating-point computations. If the algorithm requires extended precision, then conversion to SIMD floating-point code is not advised because the SIMD floating-point instructions are single-precision. 14.2. BRANCH PREDICTION OPTIMIZATION The P6 family and Pentium® processors provide dynamic branch prediction using the branch target buffers (BTBs) on the processors. Understanding the flow of branches and improving the predictability of branches can increase code execution speed significantly. 14.2.1. Branch Prediction Rules Three elements of dynamic branch prediction are important to understand: • • • • If the instruction address is not in the BTB, execution is predicted to continue without branching (fall through). Predicted taken branches have a 1 clock delay. The BTB stores a four-bit history of branch predictions on Pentium® Pro processors, the Pentium® II processor family, and the Pentium® III processor. The Pentium® II and Pentium® III processor’s BTB pattern matches on the direction of the last four branches to dynamically predict whether a branch will be taken. During the process of instruction prefetch, the instruction address of a conditional instruction is checked with the entries in the BTB. When the address is not in the BTB, execution is predicted to fall through to the next instruction. On P6 family processors, branches that do not have a history in the BTB are predicted using a static prediction algorithm. The static prediction algorithm does the following: 14-4 CODE OPTIMIZATION • • • Predicts unconditional branches to be taken. Predicts backward conditional branches to be taken. This rule is suitable for loops. Predicts forward conditional branches to be not taken. 14.2.2. Optimizing Branch Predictions in Code To optimize branch predictions in an application code, apply the following techniques: • • Reduce or eliminate branches (see Section 14.2.3., “Eliminating and Reducing the Number of Branches”). Insure that each CALL instruction has a matching RET instruction. The P6 family of processors have a return stack buffer that keeps track of the target address of the next RET instruction. Do not use pops and jumps to return from a CALL instruction; always use the RET instruction. Do not intermingle data with instructions in a code segment. Unconditional jumps, when not in the BTB, are predicted to be not taken. If data follows a unconditional branch, the data might be fetched, causing the loss of instruction fetch cycles and valuable instructioncache space. When data must be stored in the code segment, move it to the end where it will not be in the instruction fetch stream. Unroll all very short loops. Loops that execute for less th...
View Full Document

Ask a homework question - tutors are online