Optimize data cache bandwidth to mmx registers

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: g information unavailable to the hardware. Specifically, the machine-check exception handler can, after logging carefully analyze the error-reporting registers when the error-logging routine reports an error that does not allow execution to be restarted. These recovery techniques can use external bus related model-specific information provided with the error report to localize the source of the error within the system and determine the appropriate recovery strategy. 13-18 14 Code Optimization CHAPTER 14 CODE OPTIMIZATION This chapter describes the more important code optimization techniques for Intel Architecture processors with and without MMX™ technology, as well as with and without Streaming SIMD Extensions. The chapter begins with general code-optimization guidelines and continues with a brief overview of the more important blended techniques for optimizing integer, MMX™ technology, floating-point, and SIMD floating-point code. A comprehensive discussion of code optimization techniques can be found in the Intel Architecture Optimization Manual, Order Number 242816. 14.1. CODE OPTIMIZATION GUIDELINES This section contains general guidelines for optimizing applications code, as well as specific guidelines for optimizing MMX™, floating-point, and SIMD floating-point code. Developers creating applications that use MMX™ and/or floating-point instructions should apply the first set of guidelines in addition to the MMX™ and/or floating-point code optimization guidelines. Developers creating applications that use SIMD floating-point code should apply the first set of guidelines, as well as the MMX™ and/or floating-point code optimization guidelines, in addition to the SIMD floating-point code optimization guidelines. 14.1.1. General Code Optimization Guidelines Use the following guidelines to optimize code to run efficiently across several families of Intel Architecture processors: • • Use a current generation compiler that produces optimized code to insure that efficient code is generated from the start of code development. Write code that can be optimized by the compiler. For example: — Minimize the use of global variables, pointers, and complex control flow statements. — Do not use the “register” modifier. — Use the “const” modifier. — Do not defeat the typing system. — Do not make indirect calls. — Use minimum sizes for integer and floating-point data types, to enable SIMD parallelism. 14-1 CODE OPTIMIZATION • • • • • • • • • • • • • • Pay attention to the branch prediction algorithm for the target processor. This optimization is particularly important for P6 family processors. Code that optimizes branch predictability will spend fewer clocks fetching instructions. Take advantage of the SIMD capabilities of MMX™ technology and Streaming SIMD Extensions. Avoid partial register stalls. Align all data. Organize code to minimize instruction cache misses and optimize instruction prefetches. Schedule code to maximize pairing on Pentium® processors. Avoid prefixed opcodes other than 0FH. When possible, load and store data to the same area of memory using the same data sizes and address alignments; that is, avoid small loads after large stores to the same area of memory, and avoid large loads after small stores to the same area of memory. Use software pipelining. Always pair CALL and RET (return) instructions. Avoid self-modifying code. Do not place data in the code segment. Calculate store addresses as soon as possible. Avoid instructions that contain 4 or more micro-ops or instructions that are more than 7 bytes long. If possible, use instructions that require 1 micro-op. Cleanse partial registers before calling callee-save procedures. 14.1.2. Guidelines for Optimizing MMX™ Code Use the following guidelines to optimize MMX™ code: • • • • Do not intermix MMX™ instructions and floating-point instructions. Use the opcode reg, mem instruction format whenever possible. This format helps to free registers and reduce clocks without generating unnecessary loads. Put an EMMS instruction at the end of all MMX™ code sections that you know will transition to floating-point code. Optimize data cache bandwidth to MMX™ registers. Guidelines for Optimizing Floating-Point Code 14.1.3. Use the following guidelines to optimize floating-point code: 14-2 CODE OPTIMIZATION • • • Understand how the compiler handles floating-point code. Look at the assembly dump and see what transforms are already performed on the program. Study the loop nests in the application that dominate the execution time. Determine why the compiler is not creating the fastest code. For example, look for dependences that can be resolved by rearranging code Look for and correct situations known to cause slow execution of floating-point code, such as: — Large memory bandwidth requirements. — Poor cache locality. — Long-latency floating-point arithmetic operations. • • • • • • Do not use more precision than is necessary. Single precision (32-bits) is faster on some operations and consum...
View Full Document

Ask a homework question - tutors are online