This preview shows page 1. Sign up to view the full content.
Unformatted text preview: 26 shows a graphical representation of the ﬁrst three iterations (i ¼, ¾, and ) for integer multiplication. For each iteration, the two multiplications must wait until the results from the previous iteration have been computed. Still, the machine can generate two results every four clock cycles, giving a theoretical CPE of 2.0. In this ﬁgure we do not take into account the limited set of integer functional units, but this does not prove to be a limitation for this particular procedure. Comparing loop unrolling alone to loop unrolling with two-way parallelism, we obtain the following performance: Function Page Method Unroll ¢¾ Unroll ¢¾, Parallelism ¢¾ Integer + * 1.50 4.00 1.50 2.00 Floating Point + * 3.00 5.00 2.00 2.50 combine6 241 For integer sum, parallelism does not help, as the latency of integer addition is only one clock cycle. For integer and ﬂoating-point product, however, we reduce the CPE by a factor of two. We are essentially doubling the use of the functional units. For ﬂoating-point sum, some other resource constraint is limit...
View Full Document
- Spring '10
- The American