This preview shows page 1. Sign up to view the full content.
Unformatted text preview: ium III Xeon system ( × Þ ¾ ). Notice that blocking improves the running time by a factor of two over the best non-blocked version, from about 20 cycles per iteration down to about 10 cycles per iteration. The other interesting impact of blocking is that the time per iteration remains nearly constant with increasing array size. For small array sizes, the additional overhead in the blocked version causes it to run slower than the non-blocked versions. There is a crossover point, at about Ò ½¼¼, after which the blocked version runs faster.
60 50 Cycles/iteration 40 30 20 10 kji jki kij ikj jik ijk bijk (bsize = 25) bikj (bsize = 25) 0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 Array size (n) Figure 6.50: Pentium III Xeon blocked matrix multiply performance. Legend: and : two different versions of blocked matrix multiply. Performance of the unblocked versions from Figure 6.47 is shown for reference. Aside: Caches and streaming media workloads Applications that process network video and audio data in real time are becoming increasingly important. In these applications, the data arrive at the machin...
View Full Document
- Spring '10
- The American