This preview shows page 1. Sign up to view the full content.
Unformatted text preview: tions in iteration 8 occur eight cycles later. As the iterations proceed, the patterns shown for iterations 4 to 7 would keep repeating. Thus, we complete four iterations every eight 5.8. REDUCING LOOP OVERHEAD
cycles, achieving the optimum CPE of 2.0. 233 Summary of combine4 Performance
We can now consider the measured performance of combine4 for all four combinations of data type and combining operations: Function combine4 Page 219 Method Accumulate in temporary Integer + * 2.00 4.00 Floating Point + * 3.00 5.00 With the exception of integer addition, these cycle times nearly match the latency for the combining operation, as shown in Figure 5.12. Our transformations to this point have reduced the CPE value to the point where the time for the combining operation becomes the limiting factor. For the case of integer addition, we have seen that the limited number of functional units for branch and integer operations limits the achievable performance. With four such operations per iteration, and just two functional units, we cannot expect the program to go faster than 2 cycles per itera...
View Full Document
- Spring '10
- The American