This preview shows page 1. Sign up to view the full content.
Unformatted text preview: tput. Figure 12-1 illustrates the asymmetric processing of the SSE3 instruction ADDSUBPD. Figure 12-2 illustrates the horizontal data movement of the SSE3 instruction HADDPD. X1 X0 Y1 Y0 ADD SUB X1 + Y1 X0 -Y0 Figure 12-1. Asymmetric Processing in ADDSUBPD 12-2 Vol. 1 PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 X1 X0 Y1 Y0 ADD ADD Y0 + Y1 X0 + X1 Figure 12-2. Horizontal Data Movement in HADDPD 12.2 OVERVIEW OF SSE3 INSTRUCTIONS SSE3 extensions include 13 instructions. See: Section 12.3, "SSE3 Instructions," provides an introduction to individual SSE3 instructions. Intel 64 and IA-32 Architectures Software Developer's Manual, Volumes 2A & 2B, provide detailed information on individual instructions. Chapter 12, "System Programming for Streaming SIMD Instruction Sets," in the Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A, gives guidelines for integrating SSE/SSE2/SSE3 extensions into an operatingsystem environment. 12.3 SSE3 INSTRUCTIONS SSE3 instructions are grouped as follows: x87 FPU instruction -- One instruction that improves x87 FPU floating-point to integer conversion SIMD integer instruction -- One instruction that provides a specialized 128-bit unaligned data load SIMD floating-point instructions -- Three instructions that enhance LOAD/MOVE/DUPLICATE performance -- Two instructions that provide packed addition/subtraction -- Four instructions that provide horizontal addition/subtraction Vol. 1 12-3 PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 Thread synchronization instructions -- Two instructions that improve synchronization between multi-threaded agents The instructions are discussed in more detail in the following paragraphs. 12.3.1 x87 FPU Instruction for Integer Conversion The FISTTP instruction (x87 FPU Store Integer and Pop with Truncation) behaves like FISTP, but uses truncation regardless of what rounding mode is specified in the x87 FPU control word. The instruction converts the top of stack (ST0) to integer with rounding to and pops the stack. The FISTTP instruction is available in three precisions: short integer (word or 16-bit), integer (double word or 32-bit), and long integer (64-bit). With FISTTP, applications no longer need to change the FCW when truncation is required. 12.3.2 SIMD Integer Instruction for Specialized 128-bit Unaligned Data Load The LDDQU instruction is a special 128-bit unaligned load designed to avoid cache line splits. If the address of a 16-byte load is on a 16-byte boundary, LDQQU loads the bytes requested. If the address of the load is not aligned on a 16-byte boundary, LDDQU loads a 32-byte block starting at the 16-byte aligned address immediately below the load request. It then extracts the requested 16 bytes. The instruction provides significant performance improvement on 128-bit unaligned memory accesses at the cost of some usage model restrictions. 12.3.3 SIMD Floating-Point Instructions That Enhance LOAD/MOVE/DUPLICATE Performance The MOVSHDUP instructi...
View Full Document
This note was uploaded on 10/01/2013 for the course CPE 103 taught by Professor Watlins during the Winter '11 term at Mississippi State.
- Winter '11