Unformatted text preview: CS425: Computer Systems Architecture Homework Problem Set 3 Assignment: November 15, 2009 Due: November 22, 2009 – 23:59:59 Instructions: Solve all problems in a .pdf file and send them via e‐mail to Vassilis Papaefstathiou ([email protected]). Use the subject: HY425 ‐ Homework 3 Problem 1 (100 points) The following code is known as the DAXPY loop (Double‐precision AX Plus Y) from the BLAS package (Basic Linear Algebra Subprograms), where x and y are arrays of doubles and a is a double: for ( i=0 ; i<N ; i++ ){ y[i] = a * x[i] + y[i]; } Assume that our compiler has generated the following RISC assembly code: [note: R1 keeps x index , R2 keeps y index, R4 keeps x[N‐1] index, F0 keeps a] Loop: Instruction LD F2, 0(R1) MULTD F4, F2, F0 LD F6, 0(R2) ADDD F6, F4, F6 SD F6, 0(R2) ADDI R1, R1, 8 ADDI R2, R2, 8 SGT R3, R1, R4 BEQZ R3, Loop NOP Notes load x[i] into F2 put a*x[i] into f4 load y[i] into F6 put a*x[i] + y[i] into F6 store F6 into y[i] increment x index (R1) increment y index (R2) test if loop done loop if not done branch delay slot Further assume the following latencies of a typical 5‐stage in‐order pipelined RISC processor (IF, ID, EX, MEM, WB) and that bypassing is applied whenever possible: Operation(s) All Integer LD SD ADDD MULTD i. Stage EX MEM MEM EX EX Latency (cycles) 1 2 1 3 5 Show how the RISC processor would execute each loop iteration (indicate stalls) and calculate the total number of cycles required to run 120 iterations of the loop. Homework Set 3 1 CS425 ii. Try to rearrange the instructions in order to reduce the number of stalls and then calculate the total number of cycles required to run 120 iterations of the loop. Compare the performance now with (i). iii. Loop‐unroll as many iterations needed, in order to reduce the number of stalls and then calculate the total number of cycles required to run 120 iterations of the loop. Compare the performance now with (i) and (ii). iv. Apply the technique of software pipelining and then calculate the total number of cycles required to run 120 iterations of the loop. Compare the performance now with (i), (ii) and (iii). Do not forget the startup and cleanup code! Now assume a VLIW processor that can issue two memory references, two FP operations, and one integer operation or branch in every clock cycle. Further assume the same operation latencies with the RISC processor above and that you have infinite registers. v. Show how the code that you generated in (iii) would run in the VLIW processor and then calculate the total number of cycles required to run 120 iterations of the loop. Compare the performance now with (iii) and (iv). vi. Show how the code that you generated in (iv) would run in the VLIW processor then calculate the total number of cycles required to run 120 iterations of the loop. Compare the performance now with (iii), (iv) and (v). vii. Loop‐unroll as many iterations needed, in order to reduce the number of stalls and keep the VLIW pipeline utilized, then calculate the total number of cycles required to run 120 iterations of the loop. Compare the performance now with (iii), (iv), (v) and (vi). CS425 Homework Set 3 2 ...
View
Full Document
 Spring '10
 Papaefstathiou
 .pdf file, total number

Click to edit the document details