The class routines 2 memory accesses and 125 misses

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: me. For example, on our desktop machine, sumarraycols runs in about 20 clock cycles per iteration, while sumarrayrows runs in about 10 cycles per iteration. To summarize, programmers should be aware of locality in their programs and try to write programs that exploit it. Practice Problem 6.14: Transposing the rows and columns of a matrix is an important problem in signal processing and scientific computing applications. It is also interesting from a locality point of view because its reference pattern is both row-wise and column-wise. For example, consider the following transpose routine: 1 2 3 4 5 6 7 8 9 10 11 12 typedef int array[2][2]; void transpose1(array dst, array src) { int i, j; for (i = 0; i < 2; i++) { for (j = 0; j < 2; j++) { dst[j][i] = src[i][j]; } } } Assume this code runs on a machine with the following properties: ¯ sizeof(int) == 4. 6.5. WRITING CACHE-FRIENDLY CODE 325 ¯ ¯ ¯ ¯ The src array starts at address 0 and the dst array starts at address 16 (decimal). There is a single L1 data cache t...
View Full Document

Ask a homework question - tutors are online