This preview shows page 1. Sign up to view the full content.
Unformatted text preview: e assembly code for the inner loop and its translation into operations. Assembly Instructions .L49: addl (%eax,%edx,4),%ecx addl 4(%eax,%edx,4),%ecx addl 8(%eax,%edx,4),%ecx addl %edx,3 cmpl %esi,%edx jl .L49 Execution Unit Operations load (%eax, %edx.0, 4) addl t.1a, %ecx.0c load 4(%eax, %edx.0, 4) addl t.1b, %ecx.1a load 8(%eax, %edx.0, 4) addl t.1c, %ecx.1b addl %edx.0, 3 cmpl %esi, %edx.1 jl-taken cc.1 t.1a %ecx.1a t.1b %ecx.1b t.1c %ecx.1c %edx.1 cc.1 As mentioned earlier, loop unrolling by itself will only help the performance of the code for the case of integer sum, since our other cases are limited by the latency of the functional units. For integer sum, threeway unrolling allows us to combine three elements with six integer/branch operations, as shown in Figure 5.20. With two functional units for these operations, we could potentially achieve a CPE of 1.0. Figure 5.21 236 CHAPTER 5. OPTIMIZING PROGRAM PERFORMANCE %edx.2 5 6 7 8 9 %ecx.2c 10 11 12 13 14 15 Cycle i=9 i=6 load load addl
t.3a t.3b t.3c %ecx....
View Full Document