In fact if it predicted the branch will always be

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 26 shows a graphical representation of the first three iterations (i ¼, ¾, and ) for integer multiplication. For each iteration, the two multiplications must wait until the results from the previous iteration have been computed. Still, the machine can generate two results every four clock cycles, giving a theoretical CPE of 2.0. In this figure we do not take into account the limited set of integer functional units, but this does not prove to be a limitation for this particular procedure. Comparing loop unrolling alone to loop unrolling with two-way parallelism, we obtain the following performance: Function Page Method Unroll ¢¾ Unroll ¢¾, Parallelism ¢¾ Integer + * 1.50 4.00 1.50 2.00 Floating Point + * 3.00 5.00 2.00 2.50 combine6 241 For integer sum, parallelism does not help, as the latency of integer addition is only one clock cycle. For integer and floating-point product, however, we reduce the CPE by a factor of two. We are essentially doubling the use of the functional units. For floating-point sum, some other resource constraint is limit...
View Full Document

Ask a homework question - tutors are online