This preview shows page 1. Sign up to view the full content.
Unformatted text preview: . If we permute the loops and make some other minor code changes, we can create the six functionally equivalent versions of matrix multiply shown in Figure 6.45. Each version is uniquely identiﬁed by the ordering of its loops. At a high level, the six versions are quite similar. If addition is associative, then each version computes an identical result.2 Each version performs Ç ´Ò¿ µ total operations and an identical number of adds and multiplies. Each of the Ò¾ elements of and is read Ò times. Each of the Ò¾ elements of is computed by summing Ò values. However, if we analyze the behavior of the innermost loop iterations, we ﬁnd that there are differences in the number of accesses and the locality. For the purposes of our analysis, let’s make the following assumptions: ¯ ¯ ¯ ¯ Each array is an Ò ¢ Ò array of double, with sizeof(double) == 8. There is a single cache with a 32-byte block size (
¿¾ ). The array size Ò is so large that a single matrix row does not ﬁt in the L1 cache. The compiler...
View Full Document