This preview shows page 1. Sign up to view the full content.
Unformatted text preview: has even begun. We can also see the effect of pipelining. Each iteration requires at least seven cycles from start to end, but successive iterations are completed every 4 cycles. Thus, the effective processing rate is one iteration every 4 cycles, giving a CPE of 4.0. The four-cycle latency of integer multiplication constrains the performance of the processor for this program. Each imull operation must wait until the previous one has completed, since it needs the result of 5.7. UNDERSTANDING MODERN PROCESSORS 229 %edx.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cycle imull
%ecx.0 incl load
t.1 %edx.1 cmpl jl i=0
cc.1 incl load
t.2 %edx.2 cmpl jl
cc.2 incl load
t.3 %edx.3 cmpl jl
cc.3 %ecx.1 Iteration 1 imull i=1
%ecx.2 Iteration 2 imull i=2
%ecx.3 Iteration 3 Figure 5.15: Scheduling of Operations for Integer Multiplication with Unlimited Number of Execution Units. The 4 cycle latency of the multiplier is the performance-limiting resource. 230
%edx.0 CHAPTER 5. OPTIMIZING PROGRAM PERFORMANCE 1...
View Full Document
- Spring '10
- The American