# Standard_Answer_Joshua - Ohio State University CSE 721...

This preview shows pages 1–4. Sign up to view the full content.

Ohio State University CSE 721 Programming Assignment 1: SSE-Intrinsics Author: Josh Mahaﬀey February 27, 2011

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Answer to Question 1: Matrix-matrix multiplication typically requires a double-nested for-loop. Consider the multiplications of two input matrices A and B to obtain a result matrix C. Each entry of C, which we denote as C ij , is the result of performing a dot-product one row of A (i.e. the row corresponding to A i, 1: K ) and one column of B (i.e. the column corresponding to B 1: K,j ). The typical psuedo-code for matrix-matrix multiplication is given in listing 1. Algorithm 1 General Matrix-Matrix Multiply Routine procedure Matrix-Matrix Multiply (()) for i = 1 to Nrow c do for j = 1 to Ncol c do C(i,j):=0 for k = 1 to K do C(i,j):=C(i,j)+A(i,k)*B(k,j); end for end for end for end procedure When using intrinsics such as SSE, we can shorten one of the for-loops by a factor of 16/sizeof(dataType), where dataType is one of int, ﬂoat, double, etc. For this particular problem, we are interested in performing a 4 x 4 ﬂoat matrix-matrix product. As a result, we can essentially hide one of the nested loops shown in the above listing. In order to determine which loop is the best to hide, we need to consider how to obtain the best performance for sse intrinsics. SSE intrinsics perform best when the data that is retreived and/or op- erated on exists within a contiguous memory location. When performing the dot-product operation between a row of A and column of B, the col- umn of B will not be contiguous in memory. Furthermore, doing the ”dot- product” matrix-matrix multiplication essentially requires a reduction op- eration across the contiguous registers in order to obtain a single element of the C-matrix. For this reason, it is obvious that we do not want to use sse-intrinsics for the k-loop. With this in mind, we can examine how to use sse-intrinsics for the j-loop of the above matrix-matrix multiply algorithm. The structure of such an sse-based matrix-matrix multiply is given in listing 2. The algorithm in listing 2 is similar to the SAXPY matrix-matrix mul- 1
Algorithm 2 SSE (SAXPY) Matrix-Matrix Multiply Routine procedure SSE Matrix-matrix Multiply SSE (()) for i = 1 to Nrow c do y sse := 0 . 0; for j = 1 to Ncol a do b sse := LOAD ( B i, 1: col b ); . Broadcast a single element of A to all registers of a sse a sse := BroadcastToRegisters ( A ( i,j )); y sse := y sse + MUL ( b sse ,a sse ); end for STORE ( y sse ,Y i, 1: col c ); end for end procedure tiplication where one element of A is multiplied by a corresponding row of B and incremented into a correponding row of C. The performance of this algorithm should allow the sse-intrinsics to exploit the packed data structure

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

## This note was uploaded on 03/08/2012 for the course CSE 721 taught by Professor Saday during the Winter '11 term at Ohio State.

### Page1 / 9

Standard_Answer_Joshua - Ohio State University CSE 721...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online