Tomasulo’s Algorithm question 3.6 in the book In this exercise, we will look at how variations on Tomasulo’s algorithm perform when running a common vector loop. The loop is the so-called DAXPY loop ( d ouble-precision aX p lus Y ) and is the central operation in Gaussian elimination. The following code implements the operation Y = aX + Y for a vector of length 100. Initially, R1 = 0 and F0 contains a . foo: L.D F2,0(R1) ;load X(i) MUL.D F4,F2,F0 ;multiply a*X(i) L.D F6,0(R2) ;load Y(i) ADD.D F6,F4,F6 ;add a*X(i) + Y(i) S.D F6,0(R2) ;store Y(i) DADDUI R1,R1,#8 ;increment X index DADDUI R2,R2,#8 ;increment Y index DSGTUI R3,R1,#800 ;test if done BEQZ R3,foo ;loop if not done The pipeline functions units are as described. FU type Cycles in EX Number of FUs Number of reservation stations Integer 1 1 5 FP adder 4 1 3 FP multiplier 15 1 2 Assume the following: Function units are not pipelined. There is no forwarding between function units; results are communicated by the CDB. The execution stage (EX) does both the effective address calculation and the memory access
Unformatted text preview: for loads and stores. Thus the pipeline is IF/ID/IS/EX/WB, so LD/ST can execute in the same cycle as the address calculation. • Loads take 1 cycle (always a cache hit). • The issue (IS) and write result (WB) stages each take 1 clock cycle. • There are 5 load buffer slots and 5 store buffer slots. • Assume that the BEQZ instruction takes 0 clock cycles, this means that it means that there BEQZ must wait until all data dependences are resolved, after which there is no latency in the EX, also there is no latency between EX and the issue cycle of the next instruction • When doing LD/ST address calculation LD/ST done in same cycle (MEM/EX cone at the same time so no need for MEM column on page 222) • Assume FU is free starting at WB • Assume the reservation station becomes free at the WB stage • Assume BEQZ does not take up a slot in the reservation station...
