Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: r performance than number of mem. accesses! • For large N, kij and ikj performance almost constant. Due to hardware prefetching, able to recognize stride-1 patterns. Stephen Chong, Harvard University 32 Topics for today •Cache performance metrics •Discovering your cache's size and performance •The “Memory Mountain” •Matrix multiply, six ways •Blocked matrix multiplication •Exploiting locality in your programs Stephen Chong, Harvard University 33 Using blocking to improve locality • Blocked matrix multiplication • Break matrix into smaller blocks and perform independent multiplications on each block. • Improves locality by operating on one block at a time. • Best if each block can fit in the cache! • Example: Break each matrix into four sub-blocks A11 A12 A21 A22 × B11 B12 B21 B22 = C11 C12 C21 C22 Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars. C11 = A11B11 + A12B21 C12 = A11B12 + A12B22 C21 = A21B11 + A22B21 C22 = A21B12 + A22B22 Stephen Chong, Harvard University 34 Blocked Matrix Multiply void bmmm(int n, double a[n][n], double b[n][n], double c[n][n]) { int i, j, k; Code becomes harder to read! for (i = 0; i < n; i+=B) Is it worth it? for (j = 0; j < n; j+=B) for (k = 0; k < n; k+=B) Tradeoff between performance /* B x B mini matrix multiplications */ and maintainability... for (i1 = i; i1 < i+B; i++) for (j1 = j; j1 < j+B; j++) for (k1 = k; k1 < k+B; k++) c[i1][j1] += a[i1][k1] * b[k1][j1]; } • Partition arrays into bsize × bsize chunks • Innermost (i1, j1, k1) loop pair multiplies an A chunk by a B chunk and accumulates result in a C chunk Stephen Chong, Harvard University 35 Blocked matrix multiply • Assume 3 chunks can fit into the cache, i.e., 3bsize2 < C • First block iteration n/B chunks A B C • After first iteration in cache (schematic) A Stephen Chong, Harvard University B C 36 Cache miss analysis A B C • Assume 3 chunks can fit into the cache • Assume bsize is a multiple of 4 • bsize2/4 misses per chunk, so 3/4 × bsize2 misses per chunk iteration • (n/bsize)3 chunk iterations • Total of (n/bsize)3 × 3/4 × bsize2 misses = n3 × 3/(4 * bsize) • Compare with n3 × 1/2 total misses for kij algorithm Stephen Chong, Harvard Unive...
View Full Document

This note was uploaded on 10/19/2012 for the course CS 61 taught by Professor Eddiekohler during the Fall '12 term at Carnegie Mellon.

Ask a homework question - tutors are online