Unformatted text preview: CS433: Computer Systems Organization Fall 2009 Homework 1 Assigned: Sept/1 Due in class Sept/15 Total points: 40 for undergraduate students, 54 for graduate students. Instructions: Please write your name, NetID and an alias on your homework submissions for posting grades (If you don’t want your grades posted, then don’t write an alias). We will use this alias throughout the semester. Homeworks are due in class on the date posted. 1. Amdahl’s law [8 points] Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) [4 points] If enhancements 1 and 2 are each usable for 30% of the time, what fraction of the time must enhancement 3 be used to achieve an overall speedup of 10? Solution: To solve this problem, we first need to develop a new and improved form of Amdahl’s Law that can handle multiple enhancements where only one enhancement is usable at a time. We simply change the terms involving the fraction of time an enhancement can be used into summations: Speedup = [ 1 – (FE1 + FE2 + FE3) + ( (FE1/SE1) + (FE2/SE2) + (FE3/SE3) ) ]‐1 If we plug in the numbers, we get: 10 = [ 1 – (0.30 + 0.30 + FE3) + ( (0.30/30) + (0.30/20) + (FE3/10) ) ]‐1 FE3 = 0.36 Therefore, the third enhancement must be usable in the enhanced system 36% of the time to achieve an overall speedup of 10. Grading: 3 points for correctly setting up the equation 1 points for the correct values in the equation and get the final answer. b) [4 points] Assume for some benchmark, the fraction of use is 15% for each of enhancements 1 and 2 and 70% for enhancement 3. We want to maximize performance. If only one enhancement can be implemented, which should it be? If two enhancements can be implemented, which should be chosen? Solution: Here we will again use Amdahl’s law to compute speedups. Speedup for one enhancement only = [ 1 – FE1 + (FE1/SE1) ]‐1 Speedup for two enhancements = [ 1 – (FE1 + FE2) + ( (FE1/SE1) + (FE2/SE2) ) ]‐1 If we plug in the numbers, we get: Speedup1 = (1 – 0.15 + 0.15/30)‐1 = 1.169 Speedup2 = (1 – 0.15 + 0.15/20)‐1 = 1.166 Speedup3 = (1 – 0.70 + 0.70/10)‐1 = 2.703 Therefore, if we are allowed to select a single enhancement, we would choose E3 Speedup12 = [(1 ‐ 0.15 ‐ 0.15) + (0.15/30 + 0.15/20)]‐1 = 1.4035 Speedup13 = [(1 ‐ 0.15 ‐ 0.70) + (0.15/30 + 0.70/10)]‐1 = 4.4444 Speedup23 = [(1 ‐ 0.15 ‐ 0.70) + (0.15/20 + 0.70/10)]‐1 = 4.3956 Therefore, if two enhancements can be implemented, we would choose E1 and E3. Grading: 2 points for correctly calculating one enhancement speedups 2 points for the correctly calculating two enhancement speedups 2. Measuring processor’s time [8 points] After graduating, you are asked to become the lead computer designer at Hyper Computer, Inc. Your study of usage of high‐level language constructs suggests that procedure calls are one of the most expensive operations. You have invented a new architecture with an ISA that reduces the loads and stores normally associated with procedure calls and returns. The first thing you do is run some experiments with and without this optimization. Your experiments use the same state‐of‐the‐art optimizing compiler that will be used with either version of the computer. These experiments reveal the following information: ‐ The clock cycle time of the optimized version is 5% lower than the unoptimized version ‐ Thirty percent of the instructions in the unoptimized version are loads or stores. ‐ The optimized version executes two‐thirds as many loads and stores as the unoptimized version. For all other instructions the dynamic execution counts are unchanged. ‐ Every instruction (including load and store) in the unoptimized version takes one clock cycle. ‐ Due to the optimization, the procedure call and return instructions take one extra cycle in the optimized version, and these instructions accounts for 5% of total instruction count in the optimized version. Which is faster? Justify your decision quantitatively. Solution: To decide which is faster, we need to measure the CPU time: CPU Time = IC * CPI * Clk For the unoptimized case, we have the CPU Time: CPUun = ICun * CPIun * Clkun Because CPIun = 1.0, so we have: CPUun = ICun * 1.0 * Clkun Since 30% of the instructions are load and store, and in the optimized version, the machine executes 2/3 of them, so in the optimized version, we can reduce 30% * 1/3 = 10% of the instructions, making: ICnew = 0.9 * ICun CPInew = 0.95 * 1 + 0.05 * 2 = 1.05 Clknew = 0.95 * Clkun So we have: CPUnew = ICnew * CPInew * Clknew = 0.9 * ICun * 1.05 * 0.95 * Clkun = 0.89775 * ICun * Clkun = 0.89775 * CPUun So we should use the optimized version. Grading: 2 points for pointing out we should use CPU Time formula to compare. 2 points for each correct component calculation (IC, CPI and Clk) 3. Basic Pipelining [16 points] Consider the following code fragment: Loop: LW R1, 0(R2) DADDI R1, R1, 1 SW R1, 0(R2) DADDI R2, R2, 4 DADDI R4, R4, ‐4 BNEZ R4, Loop Consider the standard 5 stage pipeline machine (IF ID EX MEM WB). Assume the initial value of R4 is 396 and all memory accesses hit in the cache. a. [5 points] Show the timing of the above code fragment for one iteration as well as for the load of the second iteration. For this part, assume there is no forwarding or bypassing hardware. Assume a register write occurs in the first half of the cycle and a register read occurs in the last half of the cycle. Also, assume that branches are resolved in the memory stage and are handled by flushing the pipeline. Use a pipeline timing chart to show the timing as below (expand the chart if you need more cycles). How many cycles does this loop take to complete (for all iterations, not just one iteration)? Instruction LW R1, 0(R2) DADDI R1, R1, 1 SW R1, 0(R2) DADDI R2, R2, 4 DADDI R4, R4, 4 BNEZ R4, Loop LW R1, 0(R2) C1 F C2 D C3 X C4 M C5 W C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 Solution: It is evident that the loop iterates 99 times. To calculate the total time the loop takes to iterate, we look at the length of the first 98 iterations, then factor in the 99th iteration which takes a bit longer to execute. The pipeline diagram: Instruction LW R1, 0(R2) DADDI R1, R1, 1 SW R1, 0(R2) DADDI R2, R2, 4 DADDI R4, R4, 4 BNEZ R4, Loop LW R1, 0(R2) C1 F C2 D F C3 X D F C4 M S S C5 W S S X D F M S S W S S X D F M X D F W M X D W M S W S X M W F D X M W C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 Here, “S” indicates a stall. The last cycle of an iteration is overlapped with the first cycle of the next, so it is not counted until the end. Therefore, the first 98 iterations take 15 cycles each, while the last iteration takes 16 cycles. Therefore, the total time taken from the code to execute is 98 x 15 + 16 = 1486 clock cycles. Grading: 1 points for line 2, 3, 6 0.5 points for line 1, 4, 5, 7 b. [5 points] Show the timing for the same instruction sequence for the pipeline with full forwarding and bypassing hardware (as discussed in class). Assume that branches are resolved in the MEM stage and are predicted as not taken. How many cycles does this loop take to complete? Solution: Instruction LW R1, 0(R2) DADDI R1, R1, 1 SW R1, 0(R2) DADDI R2, R2, 4 DADDI R4, R4, 4 BNEZ R4, Loop LW R1, 0(R2) C1 F C2 D F C3 X D F C4 M S S C5 W X D F M X D F W M X D F W M X D W M X W M W F D X M W C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 The last cycle of an iteration is overlapped with the first cycle of the next, so it is not counted until the end. Therefore, the first 98 iterations take 10 cycles each, while the last iteration takes 11 cycles. Therefore, the total time taken from the code to execute is 98 x 10 + 11 = 991 clock cycles. Grading: Same as problem (a). (c) [3 points] How does the branch delay slot improve performance? Point out where in your solution for part b that it would be beneficial. Solution: The branch delay slot is a place after the branch instruction for an instruction that will be executed regardless of whether the branch is taken or not. By placing such an instruction after the branch and always executing it, we can do useful work while we are still calculating whether the branch is taken or not and what the target address is. More specifically, in part b, the LW instruction enters the IF stage when the branch is in the WB stage (the LW cannot enter the IF stage until this point because we don’t know the branch target until after the MEM stage). If we had a branch delay slot, we could have fit an extra instruction between the 2 with no penalty. Grading: 2 points for explanation. 1 point for pointing out where in part b it is useful. (d) [3 points] Why does static branch prediction improve performance over no branch prediction? Solution: It allows the processor to load an instruction and put it into the pipeline earlier in the pipeline after a branch. If the branch prediction is correct, then nothing needs to be done and we save a few clock cycles. If it mispredicts, we flush the pipeline and then load the correct instruction and so its not different from not predicting at all in this case. Grading: 3 points for explanation. 4. Hazards [8 points] Consider a pipeline with the following structure: IF ID EX MEM WB. Assume that the EX stage is 1 cycle long for all ALU operations, loads and stores. Also, the EX stage is 3 cycles long for the FP add, and 6 cycles long for the FP multiply. The pipeline supports full forwarding. All other stages in the pipeline take one cycle each. The branch is resolved in the ID stage. WAW hazards are resolved by stalling the later instruction. For the following code, list all the data hazards that cause stalls. State the type of data hazard and give a brief explanation why each hazard occurs. (A quick inspection should be ok. You don’t need to do a thorough pipeline diagram like in question 3). loop: L.D F0, 0(R1) #1 L.D F2, 8(R1) #2 L.D F4, 16(R1) #3 L.D F6, 24(R1) #4 MULT.D F8, F6, F0 #5 ADD.D F10, F4, F0 #6 ADD.D F8, F2, F0 #7 S.D 0(R2), F8 #8 DADDI R2, R2, 8 #9 S.D 8(R2), F10 #10 DSUBI R1, R1, 32 #11 BNEZ R1, loop #12 Solution: (a) RAW hazard between instructions 4 and 5 due to line 5 (MULT.D) needing the result from line 4 (L.D) before it is available. (b) WAW hazard between instructions 5 and 7 due to line 7 (ADD.D) wanting to WB to the same register before line 5 (MULT.D) would WB. (c) RAW hazard between instructions 7 and 8 due to line 8 (S.D) storing the value computed by line 7 (ADD.D) (d) RAW hazard between instructions 11 and 12 due to line 12 (BNEZ) wanting to use the result of the line 11 (DSUBI) in determining the branch result, this is a hazard because branches are determined in the ID stage. The BNEZ ID stage occurs at the same time as the SUBI EX stage so forwarding cannot eliminate this hazard. Grading: 2 points per hazard (1 point for type, 1 point for reason). 5. Graduate Problem (Graduate students are required to solve this problem, there will be no additional points for undergraduate students for answering this), pipeline analysis [14 points] For these problems, we will explore a pipeline for a register‐memory architecture. The architecture has two instruction formats: a register‐register format and a register‐memory format. In the register‐memory format, one of the operands for an ALU instruction could come from memory. There is a single memory‐addressing mode (offset + base register). The only non‐branch register‐memory instructions available have the format: Op Rdest, Rsrc1, Rsrc2 or Op Rdest, Rsrc1, MEM where Op is one of the following: Add, Subtract, And, Or, Load (in which case Rsrc1 is ignored), or Store. Rsrc1, Rsrc2, and Rdest are registers. MEM is a (base register, offset) pair. Branches compare two registers and, depending on the outcome of the comparison, move to a target address. The target address can be specified as a PC‐relative offset or in a register (with no offset). Assume that the pipeline structure of the machine is as follows: IF RF ALU1 MEM ALU2 WB The first ALU stage is used for effective address calculation for memory references and branches. The second ALU stage is used for operations and branch comparison. RF is both decode and register‐fetch stage. Assume that when a register read and a register write of the same register occur in the same cycle, the write data is forwarded. (a) [4 points] Find the number of adders, counting any adder or incrementor, needed to minimize the number of structural hazards. Justify why you need this number of adders. Solution: We need three adders ‐ one for each of the two ALUs, and one to increment the PC. Grading: 1 point for each adder, 1 bonus point for the correct answer. (b) [4 points] Find the number of register read and write ports and memory read and write ports needed to minimize the number of structural hazards. Justify why you need this number of ports for the register file and memory. Solution: The register file is used in two pipeline stages; we will have to sum the ports required by both stages to find out how many ports we must have to avoid structural hazards. In the RF stage, we need up to three reads due to the branch instructions which can have three register specifiers. In the WB stage, we need one write. We need three read ports and one write port for the register file. Memory is accessed in two stages, IF for read the next instruction from memory and MEM, which can either read or write to memory. So we need two read port and one write port for memory. Grading: 2 points for registry ports (0.5 point for each port), 2 points for memory ports (0.5 point for each port, 0.5 bonus for correct answer). (c) [3 points] Will data forwarding from the ALU2 stage to any of ALU1, MEM, or ALU2 stages reduce or avoid stalls? Explain your answer for each stage. Solution: The result of ALU2 could be used in the ALU1 stage or the ALU2 stage, and so forwardings to those stages are beneficial. There are instances where ALU2 to MEM forwarding is required to avoid stall. Eg: ADD R1 R2 R3 Some other Instruction STORE R1 0(R4) If there is forwarding from ALU2 to MEM then we will avoid a stall. Grading: 1 point for each stage. (d) [3 points] Will data forwarding from the MEM stage to any of ALU1, MEM, or ALU2 stages reduce or avoid stalls? Explain your answer for each stage. Solution: The result of a memory access could be used in the ALU1 stage, and so forwarding to the ALU1 stage is beneficial. Forwarding to ALU2 is not needed since ALU2 comes after MEM. MEM to MEM forwarding is also required. Eg: LOAD R1 0(R2) STORE R1 0(R3) Grading: 1 point for each stage. ...
View
Full
Document
This note was uploaded on 04/18/2010 for the course CS 433 taught by Professor Harrison during the Fall '08 term at University of Illinois, Urbana Champaign.
 Fall '08
 Harrison

Click to edit the document details