Lec14-Cache_measurement

7 ghz 32 kb l1 d cache 256 kb l2 cache 8mb l3 cache

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: rsity 37 Blocked Matrix Multiply Performance 60 kji jki kij ikj jik ijk bijk (bsize = 25) bikj (bsize = 25) Cycles/iteration 45 30 15 Pentium III Xeon 550 Mhz 0 25 75 125 175 225 275 325 375 Array size (n) • Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions (ijk and jik) • Relatively insensitive to array size. Stephen Chong, Harvard University 38 Blocked Matrix Multiply Performance 125 75 50 1250 1150 1050 950 850 n 750 650 550 450 350 250 150 50 25 0 Stephen Chong, Harvard University ijk jik jki kji kij ikj bijk bikj 100 Cycles per loop iteration Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM 39 Blocked Matrix Multiply Performance 10 8 6 4 1250 1150 1050 950 850 n 750 650 550 450 350 250 150 2 0 Stephen Chong, Harvard University kij ikj bijk bikj 50 Cycles per loop iteration Intel Core i7 2.7 GHz 32 KB L1 d-cache 256 KB L2 cache 8MB L3 cache CAVEAT: Tested on a VM 40 Exploiting locality in your programs •Focus attention on inner loops •This is where most computation and memory accesses in your program occurs •Try to maximize spatial locality •Read data objects sequentially, with stride 1, in the order they are stored in memory •Try to maximize temporal locality •Use a data object as often as possible once it has been read from memory Stephen Chong, Harvard University 41 Next lecture •Virtual memory •Using memory as a cache for disk Stephen Chong, Harvard University 42 Cache performance test program /* The test function */ void test(int elems, int stride) { int i, result = 0; volatile int sink; for (i = 0; i < elems; i += stride) result += data[i]; sink = result; /* So compiler doesn't optimize away the loop */ } /* Run test(elems, stride) and return read throughput (MB/s) */ double run(int size, int stride) { uint64_t start_cycles, end_cycles, diff; int elems = size / sizeof(int); test(elems, stride); /* start_cycles = get_cpu_cycle_counter(); /* test(elems, stride); /* end_cycles = get_cpu_cycle_counter(); /* diff = end_cycles – start_cycles; /* return (size / stride) / (diff / CPU_MHZ); warm up the cache */ Read CPU cycle counter */ Run test */ Read CPU cycle counter again */ Compute time */ /* convert cycles to MB/s */ } Stephen Chong, Harvard University 43 Cache performance main routine #define #define #define #define #define CPU_MHZ 2.8 * 1024.0 * 1024.0; /* e.g., 2.8 GHz */ MINBYTES (1 << 10) /* Working set size ranges from 1 KB */ MAXBYTES (1 << 23) /* ... up to 8 MB */ MAXSTRIDE 16 /* Strides range from 1 to 16 */ MAXELEMS MAXBYTES/sizeof(int) int data[MAXELEMS]; int main() { int size; int stride; /* The array we'll be traversing */ /* Working set size (in bytes) */ /* Stride (in array elements) */ init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ for (size = MAXBYTES; size >= MINBYTES; size >>= 1) { for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1f\t", run(size, stride)); printf("\n"); } exit(0); } Stephen Chong, Harvard University 44...
View Full Document

Ask a homework question - tutors are online