Lec14-Cache_measurement

# Loop k kj j a b c columnwise fixed same

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: phen Chong, Harvard University B = 0.25 C = 0.25 Total: 0.5 27 Matrix Multiplication (jki) /* jki */ for (j=0; j&lt;n; j++) { for (k=0; k&lt;n; k++) { r = b[k][j]; for (i=0; i&lt;n; i++) c[i][j] += a[i][k] * r; } } Inner loop: (*,k) (k,j) (*,j) A B C Columnwise Fixed Columnwise • 2 load, 1 store per iteration • Assume cache line size of 32 bytes, so 4 doubles per line • Misses per iteration: A=1 Stephen Chong, Harvard University B=0 C=1 Total: 2 28 Matrix Multiplication (kji) /* kji */ for (k=0; k&lt;n; k++) { for (j=0; j&lt;n; j++) { r = b[k][j]; for (i=0; i&lt;n; i++) c[i][j] += a[i][k] * r; } } Inner loop: (*,k) (k,j) (*,j) A B C Columnwise Fixed • Same as kji, just swapped order of outer loops Columnwise • 2 load, 1 store per iteration • Assume cache line size of 32 bytes, so 4 doubles per line • Misses per iteration: A=1 Stephen Chong, Harvard University B=0 C=1 Total: 2 29 Summary of Matrix Multiplication for (i=0; i&lt;n; i++) { for (j=0; j&lt;n; j++) { sum = 0.0; for (k=0; k&lt;n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } ijk or jik: 2 loads, 0 stores misses/iter = 1.25 } for (k=0; k&lt;n; k++) { for (i=0; i&lt;n; i++) { r = a[i][k]; for (j=0; j&lt;n; j++) c[i][j] += r * b[k][j]; } } for (j=0; j&lt;n; j++) { for (k=0; k&lt;n; k++) { r = b[k][j]; for (i=0; i&lt;n; i++) c[i][j] += a[i][k] * r; } } Stephen Chong, Harvard University kij or ikj: 2 loads, 1 store misses/iter = 0.5 jki or kji: 2 loads, 1 store misses/iter = 2.0 30 Matrix Multiply Performance 100 75 50 1250 1150 1050 950 850 n 750 650 550 450 350 250 0 150 25 50 ijk jik jki kji kij ikj Cycles per loop iteration 125 • Each implementation doing same number of arithmetic operations, but ~20× difference! • Pairs with same number of mem. references and misses per iteration almost identical Stephen Chong, Harvard University 31 Matrix Multiply Performance jki and kji: 2 loads, 1 store misses/iter = 2.0 100 75 ijk and jik: 2 loads, 0 stores misses/iter = 1.25 50 1250 1150 1050 950 850 n 750 650 550 450 350 250 0 150 25 50 ijk jik jki kji kij ikj Cycles per loop iteration 125 kij and ikj: 2 loads, 1 store misses/iter = 0.5 • Miss rate better predictor o...
View Full Document

Ask a homework question - tutors are online