Standard_Answer_Ashari

Standard_Answer_Ashari -...

Info iconThis preview shows pages 1–4. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ------------------------------------------------------------------------------------------------------------------------------------------Arash Ashari 200105422 CSE721 Programming Assignment #1 Winter 2011 ------------------------------------------------------------------------------------------------------------------------------------------ 1) Following shows my code in which I have first unrolled the loops and then I have used SSE: static inline void mul4x4sse(float *A,float *B, float *C) { int i,j; __m128 AA, BB, CC1,CC2, CC3, CC4; BB = _mm_load_ps(B); AA = _mm_load1_ps(A); CC1 = _mm_mul_ps(AA,BB); AA = _mm_load1_ps(A+4); CC2 = _mm_mul_ps(AA,BB); AA = _mm_load1_ps(A+8); CC3 = _mm_mul_ps(AA,BB); AA = _mm_load1_ps(A+12); CC4 = _mm_mul_ps(AA,BB); BB = _mm_load_ps(B+4); AA = _mm_load1_ps(A+1); CC1 = _mm_add_ps(CC1,_mm_mul_ps(AA,BB)); AA = _mm_load1_ps(A+5); CC2 = _mm_add_ps(CC2,_mm_mul_ps(AA,BB)); AA = _mm_load1_ps(A+9); CC3 = _mm_add_ps(CC3,_mm_mul_ps(AA,BB)); AA = _mm_load1_ps(A+13); CC4 = _mm_add_ps(CC4,_mm_mul_ps(AA,BB)); BB = _mm_load_ps(B+8); AA = _mm_load1_ps(A+2); CC1 = _mm_add_ps(CC1,_mm_mul_ps(AA,BB)); AA = _mm_load1_ps(A+6); CC2 = _mm_add_ps(CC2,_mm_mul_ps(AA,BB)); AA = _mm_load1_ps(A+10); CC3 = _mm_add_ps(CC3,_mm_mul_ps(AA,BB)); AA = _mm_load1_ps(A+14); CC4 = _mm_add_ps(CC4,_mm_mul_ps(AA,BB)); BB = _mm_load_ps(B+12); AA = _mm_load1_ps(A+3); CC1 = _mm_add_ps(CC1,_mm_mul_ps(AA,BB)); _mm_storeu_ps(C,CC1); AA = _mm_load1_ps(A+7); CC2 = _mm_add_ps(CC2,_mm_mul_ps(AA,BB)); _mm_storeu_ps(C+4,CC2); AA = _mm_load1_ps(A+11); CC3 = _mm_add_ps(CC3,_mm_mul_ps(AA,BB)); _mm_storeu_ps(C+8,CC3); AA = _mm_load1_ps(A+15); CC4 = _mm_add_ps(CC4,_mm_mul_ps(AA,BB)); _mm_storeu_ps(C+12,CC4); } The output of my executions is as follow: Compiler Code Version GFLOPS Execution Time icc -O Default 1.6465780.155474SSE 3.1104560.082303gcc -O -msse3 Default 0.6612150.387166SSE 2.9936680.085514pgcc -O Default 0.590471 0.433552SSE 1.387871 0.184455As the table shows, the order of performance improvement for different compilers is as follow: icc > gcc > pgcc ------------------------------------------------------------------------------------------------------------------------------------------ 2) For this question, I just used the same loops to do the same computations and I didnt try to get to the best performance, but I used SSE registers and instruction. Following shows my code: static inline void mvl8x8dotsse(float *A,float *x, float *y) { __m128 xx1, xx2, AA, yy; int i; xx1 = _mm_load_ps(x); for (i = 0; i < 4; i++) { AA = _mm_load_ps(A+i*8); yy = _mm_mul_ps(AA,xx1); yy = _mm_hadd_ps(yy,yy); yy = _mm_hadd_ps(yy,yy); _mm_store_ss(y+i,yy); } xx2 = _mm_load_ps(x+4); for (i = 4; i < 8; i++) { AA = _mm_load_ps(A+i*8); yy = _mm_mul_ps(AA,xx1); AA = _mm_load_ps(A+i*8+4); yy = _mm_add_ps(yy,_mm_mul_ps(AA,xx2)); yy = _mm_hadd_ps(yy,yy); yy = _mm_hadd_ps(yy,yy); _mm_store_ss(y+i,yy); } } static inline void mvl8x8saxpysse(float *A,float *x, float *y) { __m128 xx, AA, yy[8];...
View Full Document

Page1 / 13

Standard_Answer_Ashari -...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online