Suppose that a program frequently requires to find the absolute value of an
integer quantity. For a value denoted by n, this may be expressed as:
(n > 0 ? n : -n)
However, instead of replicating this expression in many places in the
B = C + D;
The dependence between the two statements is no longer loop-carried,
so that iterations of the loop may be overlapped, provided the statements
in each iteration are kept in order.
Our analysis needs to begin by nding all loop-
eration and is not loop-carried. Thus, if this were the only dependence,
multiple iterations of the loop could execute in parallel, as long as each
pair of statements in an iteration were kept in order. We saw this type of
dependence in an example in Sect
ples we considered in Section 4.1 have no loop-carried dependences and, thus,
are loop-level parallel. To see that a loop is parallel, let us rst look at the source
for (i=1000; i>0; i=i1)
x[i] = x[i] + s;
In this loop, there is a dependen
schemes. Another approach is to temper the strictness of the approach so that binary compatibility is still feasible. This later approach is used in the IA-64 architecture, as we will see in Section 4.7.
The major challenge for all multiple-issue processo
To keep the functional units busy, there must be enough parallelism in a code
sequence to ll the available operation slots. This parallelism is uncovered by unrolling loops and scheduling the code within the single larger loop body. If the
The rst multiple-issue processors that required the instruction stream to be explicitly organized to avoid dependences used wide instructions with multiple operations per instruction. For this reason, this architectural approach was named
VLIW, standing f
FIGURE 4.4 Accura
On the iteration i, the loop references element i 5. The loop is said to have a
dependence distance of 5. Many loops with carried dependences have a dependence distance of 1. The larger the distance, the more potential parallelism can be
obtained by unrol
FIGURE 4.6 A software-pipelined loop chooses instructions from different loop iterations, thus separating the dependent instructions within one iteration of the origin
cally on this analysis. The major drawback of dependence analysis is that it applies only under a limited set of circumstances, namely among references within
a single loop nest and using afne index functions. Thus, there are a wide variety
by back substitution, which increases the amount of parallelism and sometimes
increases the amount of computation required. These techniques can be applied
both within a basic block and within loops, and we describe them differently.
Within a basic block,
uated once per unrolled iteration. One common type of recurrence arises from an
explicit program statements, such as:
sum = sum + x;
Assume we unroll a loop with this recurrence ve times, if we let the value of x
on these ve iterations be given by x1, x2,
sible to use points-to analysis to determine the possible set of objects referenced by a pointer. One important use is to determine if two pointer parameters
may designate the same object.
When a pointer can point to one of several types, it is someti
the output dependences and antidependences by renaming.
for (i=1; i<=100; i=i+1) cfw_
Y[i] = X[i] / c; /*S1*/
X[i] = X[i] + c; /*S2*/
Z[i] = Y[i] + c; /*S3*/
Y[i] = c - Y[i]; /*S4*/
The following dependences exist among the four statements:
changing the input so that the prole is for a different run leads to only a small
change in the accuracy of prole-based prediction.
In general, we cannot determine whether a dependence exists at compile time.
For example, the values of a, b, c, and d may not be known (they could be values
in other arrays), making it impossible to tell if a dependence exists. In other
cases, the depend
ing the computation into the offset of the L.D and S.D instructions and by
changing the nal DADDUI into a decrement by 32. This transformation
makes the three DADDUI unnecessary, and the compiler can remove
them. There are other types of dependences in th
The dependence of the DSUBU and BEQZ on the LD instruction means that a stall
will be needed after the LD. Suppose we knew that this branch was almost always
taken and that the value of R7 was not needed on the fall-through path. Then we
ble to allocate all the live values to registers. The transformed code, while theoretically faster, may lose some or all of its advantage, because it generates a shortage
of registers. Without unrolling, aggressive scheduling is sufciently limited by
;8-32 = -24
The execution time of the unrolled loop has dropped to a total of 14 clock
cycles, or 3.5 clock cycles pe
TJADEN, G. S. AND M. J. FLYNN . Detection and parallel execution of independent instructions, IEEE Trans. on Computers C-19:10 (October), 889895.
TOMASULO, R. M. . An efficient algorithm for exploiting multiple arithmetic units, IBM J.
c = c + f;
b = a + f;
A good exercise but requires describing how scoreboards work. There are a number of problems based on
scoreboards, which may be salvagable by one of the following: introducing scoreboards (maybe not worth
it), removing part of the r
Postiff, M.A.; Greene, D.A.; Tyson, G.S.; Mudge, T.N. The limits of instruction level parallelism in
SPEC95 Applications . Computer Architecture News, vol.27, (no.1), ACM, March 1999. p.31-40.
RAMAMOORTHY, C. V. AND H. F. LI . Pipeline architecture,
HWU, W.-M. AND Y. PATT . HPSm, a high performance restricted data flow architecture
having minimum functionality, Proc. 13th Symposium on Computer Architecture (June), Tokyo,
IBM . The IBM RISC System/6000 processor (collection of pape
BAKOGLU, H. B., G. F. GROHOSKI, L. E. THATCHER, J. A. KAELI, C. R. MOORE, D. P. TATTLE, W. E.
MALE, W. R. HARDELL, D. A. HICKS, M. NGUYEN PHU, R. K. MONTOYE, W. T. GLOVER, AND S.
DHAWAN . IBM second-generation RISC processor organization, P