SSE-Intro - 10/11/10 Agenda Intel SEE SIMD InstrucTons...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 10/11/10 Agenda Intel SEE SIMD InstrucTons Administrivia Technology Break SSE in C 10/11/10 Fall 2010 -- Lecture #18 3 Single InstrucTon/MulTple Data Stream Single InstrucTon, MulTple Data streams (SIMD) Computer that exploits mulTple data streams against a single instrucTon stream to operaTons that may be naturally parallelized, e.g., an array processor or Graphics Processing Unit (GPU) 10/11/10 Fall 2010 -- Lecture #18 4 2 10/11/10 "Advanced Digital Media Boost" To improve performance, Intel's SIMD instrucTons Fetch one instrucTon, do the work of mulTple instrucTons MMX (MulTMedia eXtension, PenTum II processor family) SSE (Streaming SIMD Extension, Pen5um III and beyond) 10/11/10 Fall 2010 -- Lecture #18 5 Example: SIMD Array Processing for each f in array! f = sqrt(f)! for each f in {! load f to calculate write the }! array! the floating-point register! the square root! result from the register to memory! for each 4 members in array! {! load 4 members to the SSE register! calculate 4 square roots in one operation! write the result from the register to memory! }! 10/11/10 Fall 2010 -- Lecture #18 6 3 10/11/10 SSE InstrucTon Categories for MulTmedia Support SSE-2+ supports wider data types to allow 16 x 8-bit and 8 x 16-bit operands 10/11/10 Fall 2010 -- Lecture #18 7 Intel Architecture 128-Bit SIMD Data Types Note: in Intel Architecture (unlike MIPS) a word is 16 bits 10/11/10 Single precision FP: Double words (32 bits) Double precision FP: Quad words (64 bits) Fall 2010 -- Lecture #18 8 4 10/11/10 XMM Registers Architecture extended with eight 128-bit data registers: XMM registers IA 64-bit address architecture: available as 16 64-bit registers (XMM8 XMM15) E.g., 128-bit packed single-precision floaTng-point data type (doublewords), allows four single-precision operaTons to be performed simultaneously Fall 2010 -- Lecture #18 9 10/11/10 SSE/SSE2 FloaTng Point InstrucTons xmm: one operand is a 128-bit SSE2 register mem/xmm: other operand is in memory or an SSE2 register {SS} Scalar Single precision FP: one 32-bit operand in a 128-bit register {PS} Packed Single precision FP: four 32-bit operands in a 128-bit register {SD} Scalar Double precision FP: one 64-bit operand in a 128-bit register {PD} Packed Double precision FP, or two 64-bit operands in a 128-bit register {A} 128-bit operand is aligned in memory {U} means the 128-bit operand is unaligned in memory {H} means move the high half of the 128-bit operand {L} means move the low half of the 128-bit operand 10/11/10 Fall 2010 -- Lecture #18 10 5 10/11/10 Example: Add Two Quad Word Vectors ComputaTon to be performed: !vec_res.x !vec_res.y !vec_res.z !vec_res.w = = = = v1.x v1.y v1.z v1.w + + + + v2.x;! v2.y;! v2.z;! v2.w;! mov a ps : move from mem to XMM register, memory aligned, packed single precision add ps : add from mem to XMM register, packed single precision mov a ps : move from XMM register to mem, memory aligned, packed single precision SSE InstrucTon Sequence: movaps xmm0,address-of-v1 ! !! ! !;xmm0=v1.w | v1.z | v1.y | v1.x ! addps xmm0,address-of-v2 ! !! ! !;xmm0=v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x movaps address-of-vec_res,xmm0! 10/11/10 Fall 2010 -- Lecture #18 11 ! Displays and Pixels Each coordinate in frame buffer on leo determines shade of corresponding coordinate for the raster scan CRT display on right. Pixel (X0, Y0) contains bit paFern 0011, a lighter shade on the screen than the bit paFern 1101 in pixel (X1, Y1) 10/11/10 Fall 2010 -- Lecture #18 12 6 10/11/10 Example: Image Converter Converts BMP image to a YUV image format: Read individual pixels from the BMP image, convert pixels into YUV format Can pack the pixels and operate on a set of pixels with a single instrucTon E.g., bitmap image consists of 8 bit monochrome pixels Pack these pixel values in a 128 bit register (8 bit * 16 pixels), can operate on 16 values at a Tme Significant performance boost 10/11/10 Fall 2010 -- Lecture #18 13 Example: Image Converter FMADDPS MulTply and add packed single precision floaTng point instrucTon One of the typical operaTons computed in transformaTons (e.g., DFT of FFT) N P = f(n) x(n) n = 1 10/11/10 Fall 2010 -- Lecture #18 14 7 10/11/10 Example: Image Converter FloaTng point numbers f(n) and x(n) in src1 and src2; p in dest; C implementaTon for N = 4 (128 bits): for (int i =0; i< 4; i++)! {! p = p + src1[i] * src2[i];! }! SSE2 instrucTons for the inner loop: //xmm0 = p, xmm1 = src1, xmm2 = src2! mulps xmm1, xmm2! addps xmm0, xmm1! SSE5 instrucTon accomplishes same in one instrucTon: //xmm0 = p, xmm1 = src1, xmm2 = src2! !! ! fmaddps xmm0, xmm1, xmm2, xmm0! 10/11/10 Fall 2010 -- Lecture #18 15 Agenda Intel SEE SIMD InstrucTons Administrivia Technology Break SSE in C 10/11/10 Fall 2010 -- Lecture #18 16 8 10/11/10 Agenda Intel SEE SIMD InstrucTons Administrivia Technology Break SSE in C 10/11/10 Fall 2010 -- Lecture #18 19 Intel SSE Intrinsics Intrinsics are C funcTons and procedures for SSE instrucTons With instrinsics, can program using these instrucTons indirectly One-to-one correspondence between SSE instrucTons and intrinsics 10/11/10 Fall 2010 -- Lecture #18 20 10 10/11/10 Example SSE Intrinsics Vector data type: _m128d Load and store operaTons: _mm_load_pd MOVAPD/aligned, packed double _mm_store_pd MOVAPD/aligned, packed double _mm_loadu_pd MOVUPD/unaligned, packed double _mm_storeu_pd MOVUPD/unaligned, packed double Load and broadcast across vector _mm_load1_pd MOVSD + shuffling ArithmeTc: _mm_add_pd ADDPD/add, packed double _mm_mul_pd MULPD/mulTple, packed double 02/09/2010 10/11/10 CS267 - Lecture # Fall 2010 -Lecture 718 21 21 Example: 2 x 2 Matrix MulTply DefiniTon of Matrix MulTply: Ci,j = (AB)i,j = Ai,k Bk,j k = 1 A1,1 A2,1 A1,2 x A2,2 B2,1 B2,2 B1,1 B1,2 = C2,1=A2,1B1,1 + A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2 C1,1=A1,1B1,1 + A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2 2 10/11/10 Fall 2010 -- Lecture #18 22 11 10/11/10 Example: 2 x 2 Matrix MulTply Using the XMM registers 64-bit/double precision/two doubles per XMM reg C1 C2 C1,1 C1,2 C2,1 C2,2 Stored in memory in Column order A A1,i A2,i B1 B2 Bi,1 Bi,2 Bi,1 Bi,2 Fall 2010 -- Lecture #18 23 10/11/10 Example: 2 x 2 Matrix MulTply IniTalizaTon C1 C2 0 0 0 0 I = 1 A A1,1 A2,1 _mm_load_pd: Stored in memory in Column order _mm_load1_pd: SSE instrucTon that loads a double word and stores it in the high and low double words of the XMM register B1 B2 B1,1 B1,2 B1,1 B1,2 10/11/10 Fall 2010 -- Lecture #18 24 12 10/11/10 Example: 2 x 2 Matrix MulTply IniTalizaTon C1 C2 0 0 0 0 I = 1 A A1,1 A2,1 _mm_load_pd: Stored in memory in Column order _mm_load1_pd: SSE instrucTon that loads a double word and stores it in the high and low double words of the XMM register B1 B2 B1,1 B1,2 B1,1 B1,2 10/11/10 Fall 2010 -- Lecture #18 25 Example: 2 x 2 Matrix MulTply First iteraTon intermediate result C1 C2 0+A1,1B1,1 0+A1,1B1,2 0+A2,1B1,1 0+A2,1B1,2 c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); I = 1 A A1,1 A2,1 _mm_load_pd: Stored in memory in Column order _mm_load1_pd: SSE instrucTon that loads a double word and stores it in the high and low double words of the XMM register B1 B2 B1,1 B1,2 B1,1 B1,2 10/11/10 Fall 2010 -- Lecture #18 26 13 10/11/10 Example: 2 x 2 Matrix MulTply First iteraTon intermediate result C1 C2 0+A1,1B1,1 0+A1,1B1,2 0+A2,1B1,1 0+A2,1B1,2 c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); I = 2 A A1,2 A2,2 _mm_load_pd: Stored in memory in Column order _mm_load1_pd: SSE instrucTon that loads a double word and stores it in the high and low double words of the XMM register B1 B2 B2,1 B2,2 B2,1 B2,2 10/11/10 Fall 2010 -- Lecture #18 27 Example: 2 x 2 Matrix MulTply Second iteraTon intermediate result C2,1 C1,1 C1 A1,1B1,1+A1,2B2,1 A2,1B1,1+A2,2B2,1 C2 A1,1B1,2+A1,2B2,2 A2,1B1,2+A2,2B2,2 C2,2 C1,2 c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); I = 2 A A1,2 A2,2 _mm_load_pd: Stored in memory in Column order _mm_load1_pd: SSE instrucTon that loads a double word and stores it in the high and low double words of the XMM register B1 B2 B2,1 B2,2 B2,1 B2,2 10/11/10 Fall 2010 -- Lecture #18 28 14 10/11/10 Example: 2 x 2 Matrix MulTply (Part 1 of 2) #include <stdio.h> // header file for SSE compiler intrinsics #include <emmintrin.h> // NOTE: vector registers will be represented in comments as v1 = [ a | b] // where v1 is a variable of type __m128d and a,b are doubles int main(void) { // allocate A,B,C aligned on 16-byte boundaries double A[4] __aFribute__ ((aligned (16))); double B[4] __aFribute__ ((aligned (16))); double C[4] __aFribute__ ((aligned (16))); int lda = 2; int i = 0; // declare a couple 128-bit vector variables __m128d c1,c2,a,b1,b2; 10/11/10 /* A = (note column order!) 1 0 0 1 */ A[0] = 1.0; A[1] = 0.0; A[2] = 0.0; A[3] = 1.0; /* B = (note column order!) 1 3 2 4 */ B[0] = 1.0; B[1] = 2.0; B[2] = 3.0; B[3] = 4.0; /* C = (note column order!) 0 0 0 0 */ C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0; Fall 2010 -- Lecture #18 29 Example: 2 x 2 Matrix MulTply (Part 2 of 2) // used aligned loads to set // c1 = [c_11 | c_21] c1 = _mm_load_pd(C+0*lda); // c2 = [c_12 | c_22] c2 = _mm_load_pd(C+1*lda); for (i = 0; i < 2; i++) { /* a = i = 0: [a_11 | a_21] i = 1: [a_12 | a_22] */ a = _mm_load_pd(A+i*lda); /* b1 = i = 0: [b_11 | b_11] i = 1: [b_21 | b_21] */ b1 = _mm_load1_pd(B+i+0*lda); /* b2 = i = 0: [b_12 | b_12] i = 1: [b_22 | b_22] */ b2 = _mm_load1_pd(B+i+1*lda); 10/11/10 /* c1 = i = 0: [c_11 + a_11*b_11 | c_21 + a_21*b_11] i = 1: [c_11 + a_21*b_21 | c_21 + a_22*b_21] */ c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); /* c2 = i = 0: [c_12 + a_11*b_12 | c_22 + a_21*b_12] i = 1: [c_12 + a_21*b_22 | c_22 + a_22*b_22] */ c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); } // store c1,c2 back into C for compleTon _mm_store_pd(C+0*lda,c1); _mm_store_pd(C+1*lda,c2); // print C prin("%g,%g\n%g,%g\n",C[0],C[2],C[1],C[3]); return 0; } Fall 2010 -- Lecture #18 30 15 10/11/10 Summary Intel SSE SIMD InstrucTons One instrucTon fetch that operates on mulTple operands simultaneously 128/64 bit XMM registers SSE InstrucTons in C Embed the SSE machine instrucTons directly into C programs through use of intrinsics Achieve efficiency beyond that of opTmizing compiler 10/11/10 Fall 2010 -- Lecture #18 31 16 ...
View Full Document

Ask a homework question - tutors are online