{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

Standard_Answer_Ashari

Standard_Answer_Ashari - Arash Ashari 200105422 CSE721...

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon
------------------------------------------------------------------------------------------------------------------------------------------ Arash Ashari 200105422 CSE721 Programming Assignment #1 Winter 2011 ------------------------------------------------------------------------------------------------------------------------------------------ 1) Following shows my code in which I have first unrolled the loops and then I have used SSE: static inline void mul4x4sse(float *A,float *B, float *C) { int i,j; __m128 AA, BB, CC1,CC2, CC3, CC4; BB = _mm_load_ps(B); AA = _mm_load1_ps(A); CC1 = _mm_mul_ps(AA,BB); AA = _mm_load1_ps(A+4); CC2 = _mm_mul_ps(AA,BB); AA = _mm_load1_ps(A+8); CC3 = _mm_mul_ps(AA,BB); AA = _mm_load1_ps(A+12); CC4 = _mm_mul_ps(AA,BB); BB = _mm_load_ps(B+4); AA = _mm_load1_ps(A+1); CC1 = _mm_add_ps(CC1,_mm_mul_ps(AA,BB)); AA = _mm_load1_ps(A+5); CC2 = _mm_add_ps(CC2,_mm_mul_ps(AA,BB)); AA = _mm_load1_ps(A+9); CC3 = _mm_add_ps(CC3,_mm_mul_ps(AA,BB)); AA = _mm_load1_ps(A+13); CC4 = _mm_add_ps(CC4,_mm_mul_ps(AA,BB)); BB = _mm_load_ps(B+8); AA = _mm_load1_ps(A+2); CC1 = _mm_add_ps(CC1,_mm_mul_ps(AA,BB)); AA = _mm_load1_ps(A+6); CC2 = _mm_add_ps(CC2,_mm_mul_ps(AA,BB)); AA = _mm_load1_ps(A+10); CC3 = _mm_add_ps(CC3,_mm_mul_ps(AA,BB)); AA = _mm_load1_ps(A+14);
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
CC4 = _mm_add_ps(CC4,_mm_mul_ps(AA,BB)); BB = _mm_load_ps(B+12); AA = _mm_load1_ps(A+3); CC1 = _mm_add_ps(CC1,_mm_mul_ps(AA,BB)); _mm_storeu_ps(C,CC1); AA = _mm_load1_ps(A+7); CC2 = _mm_add_ps(CC2,_mm_mul_ps(AA,BB)); _mm_storeu_ps(C+4,CC2); AA = _mm_load1_ps(A+11); CC3 = _mm_add_ps(CC3,_mm_mul_ps(AA,BB)); _mm_storeu_ps(C+8,CC3); AA = _mm_load1_ps(A+15); CC4 = _mm_add_ps(CC4,_mm_mul_ps(AA,BB)); _mm_storeu_ps(C+12,CC4); } The output of my executions is as follow: Compiler Code Version GFLOPS Execution Time icc -O Default 1.646578 0.155474 SSE 3.110456 0.082303 gcc -O -msse3 Default 0.661215 0.387166 SSE 2.993668 0.085514 pgcc -O Default 0.590471 0.433552 SSE 1.387871 0.184455 As the table shows, the order of performance improvement for different compilers is as follow: icc > gcc > pgcc ------------------------------------------------------------------------------------------------------------------------------------------ 2) For this question, I just used the same loops to do the same computations and I didn’t try to get to the best performance, but I used SSE registers and instruction. Following shows my code: static inline void mvl8x8dotsse(float *A,float *x, float *y) { __m128 xx1, xx2, AA, yy; int i; xx1 = _mm_load_ps(x); for (i = 0; i < 4; i++) {
Background image of page 2
AA = _mm_load_ps(A+i*8); yy = _mm_mul_ps(AA,xx1); yy = _mm_hadd_ps(yy,yy); yy = _mm_hadd_ps(yy,yy); _mm_store_ss(y+i,yy); } xx2 = _mm_load_ps(x+4); for (i = 4; i < 8; i++) { AA = _mm_load_ps(A+i*8); yy = _mm_mul_ps(AA,xx1); AA = _mm_load_ps(A+i*8+4); yy = _mm_add_ps(yy,_mm_mul_ps(AA,xx2)); yy = _mm_hadd_ps(yy,yy); yy = _mm_hadd_ps(yy,yy); _mm_store_ss(y+i,yy); } } static inline void mvl8x8saxpysse(float *A,float *x, float *y) { __m128 xx, AA, yy[8]; int i, j; xx = _mm_load_ss(x); for (j = 0; j < 8; j++) { AA = _mm_load_ss(A+j*8); yy[j] = _mm_mul_ss(AA,xx); } _mm_store_ss(y,yy[0]); for (i = 1; i < 8; i++) { xx = _mm_load_ss(x+i); for (j = i; j < 8; j++) { AA = _mm_load_ss(A+j*8+i); yy[j] = _mm_add_ss(yy[j],_mm_mul_ss(AA,xx)); } _mm_store_ss(y+i,yy[i]); } }
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
The output of my executions is as follow: Compiler Product Version Code Version GFLOPS Execution Time icc -O Dot Product Default 0.671676 0.214389 SSE 2.787132 0.051666 SAXPY Default 0.694699 0.207284 SSE 2.680816 0.053715 gcc -O -msse3 Dot Product Default 0.351105 0.410134 SSE 3.210404 0.044854 SAXPY Default 0.558689 0.257746 SSE 0.652735 0.220610 pgcc -O Dot Product Default 0.356397 0.404044 SSE 0.674113 0.213614 SAXPY Default 0.654402 0.220048 SSE 0.391669 0.367657 As the table shows, the performance improvement for different compilers on dot product and saxpy is different. For example we get the same improvement for dot product and saxpy using icc. But gcc has a much better improvement on dot product and much less on saxpy. pgcc has a very low improvement on dot product and actually makes SSE saxpy result worse than the original result.
Background image of page 4
Image of page 5
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}