Unformatted text preview: performance Matrix multiply: in column order. If
B is accessed for i = 1 to n for j= 1 to n arrays are (as in C) stored in row major
order, cache lines are not helping, which
can cause cache misses for all Bs. C[i,j]=0 for k = 1 to n C[i,j]+=A[i,k]*B[k,j] Solution: transpose B Tiling • Instead of reading a whole row of A and doing n whole row A column B inner products we can read a block of A and compute smaller inner products with sub columns of B. • These partial products are then added up. Conventional matrix multiply Conventional matrix multiply Conventional matrix multiply Conventional matrix multiply Conventional matrix multiply Conventional matrix multiply etc. ..... Conventional matrix multiply All elements of B are used once, while all of row A[i] are
used n times.
A[i,*] may fit in the cache, B will probably not! Tiling A and B " A k x k tile of A (which can ﬁt in the cache) block multiplies...
View
Full Document
 Fall '08
 Staff
 CPU cache, Hierarchical storage management, Locality of reference

Click to edit the document details