Unformatted text preview: University of California, Berkeley College of Engineering Computer Science Division | EECS Spring 1998 D.A. Patterson Quiz 1 Solutions CS252 Graduate Computer Architecture Notes for future semesters: This quiz was long. If we were going to give this quiz again, we would probably drop the third part of question 2, and parts b, c, and i of question 3. Question 1: Calculate your Cache
A certain system with a 350 MHz clock uses a separate data and instruction cache, and a uni ed second-level cache. The rst-level data cache is a direct-mapped, write-through, write-allocate cache with 8kBytes of data total and 8-Byte blocks, and has a perfect write bu er never causes any stalls. The rst-level instruction cache is a direct-mapped cache with 4kBytes of data total and 8-Byte blocks. The second-level cache is a two-way set associative, write-back, write-allocate cache with 2MBytes of data total and 32-Byte blocks. The rst-level instruction cache has a miss rate of 2. The rst-level data cache has a miss rate of 15. The uni ed second-level cache has a local miss rate of 10. Assume that 40 of all instructions are data memory accesses; 60 of those are loads, and 40 are stores. Assume that 50 of the blocks in the second-level cache are dirty at any time. Assume that there is no optimization for fast reads on an L1 or L2 cache miss. All rst-level cache hits cause no stalls. The second-level hit time is 10 cycles. That means that the L1 miss penalty, assuming a hit in the L2 cache, is 10 cycles. Main memory access time is 100 cycles to the rst bus width of data; after that, the memory system can deliver consecutive bus widths of data on each following cycle. Outstanding non-consecutive memory requests can not overlap; an access to one memory location must complete before an access to another memory location can begin. There is a 128-bit bus from memory to the L2 cache, and a 64-bit bus from both L1 caches to the L2 cache. Assume a perfect TLB for this problem never causes any stalls. a 2 points What percent of all data memory references cause a main memory access main memory is accessed before the memory request is satis ed? First show the equation, then the numeric result. If you did not treat all stores as L1 misses: = L1 miss rate L2 miss rate = .15 .10 = 1.5 If you treated all stores as L1 misses: = of data ref that are readsL2 miss rate + of data ref that are writesL1 miss rateL2 miss rate = .4.1 + .6.15.1 = 4.9 b 3 points How many bits are used to index each of the caches? Assume the caches are presented physical addresses. Data = 8K 8 = 1024 blocks = 10 bits Inst = 4K 8 = 512 blocks = 9 bits L2 = 2M 32 = 64k blocks = 32k sets = 15 bits 2 Question 1 continued
c 3 points How many cycles can the longest possible data memory access take? Describe brie y the events that occur during this access. L1 miss, L2 miss, writeback. 1 + 10 + 2101 = 213 cycles Note that the time to read an L2 cache line from memory is 101 cycles the rst 16 B returns in 100 cycles; the next 16 return the next cycle. d 4 points What is the average memory access time in cycles including instruction and data memory references? First show the equation, then the numeric result. If you did not treat all stores as L1 misses: AMATtotal = 114 AMATinst + 144 AMATdata AMAT = L1 hit time + L1 miss rate L2 hit time + L2 miss rate mem transfer time AMATinst = 1 + 0.0210 + .101.5101 = 1.503 AMATdata = 1 + .1510 + .101.5101 = 4.7725 AMAT = 2.44
: : : Note that the mem transfer time is multipled by 1.5 to account for writebacks in the L2 cache. If you treated all stores as L1 misses: AMATtotal = 114 AMATinst + 124 AMATloads + 116 AMATstores 4 4 AMAT = L1 hit time + L1 miss rate L2 hit time + L2 miss rate mem transfer time AMATinst = 1 + 0.0210 + .101.5101 = 1.503 AMATloads = 1 + .1510 + .101.5101 = 4.7725 AMATstores = 1 + 110 + .101.5101 = 26.15 AMAT = 4.88
: : : : : Note that the mem transfer time is multipled by 1.5 to account for writebacks in the L2 cache. 3 Question 2: Tomasulo's Revenge
Using the DLX code shown below, show the state of the Reservation stations, Reorder bu ers, and FP register status for a speculative processor implementing Tomasulo's algorithm. Assume the following: Only one instruction can issue per cycle. The reorder bu er has 8 slots. The reorder bu er implements the functionality of the load bu ers and store bu ers. All FUs are fully pipelined. There are 2 FP multiply reservation stations. There are 3 FP add reservation stations. There are 3 integer reservation stations, which also execute load and store instructions. No exceptions occur during the execution of this code. All integer operations require 1 execution cycle. Memory requests occur and complete in this cycle. For this problem, assume that, barring structural hazards, loads issue in one cycle, execute in the next, write in the third, and a dependent instruction can start execution on the fourth. All FP multiply operations require 4 execution cycles. All FP addition operations require 2 execution cycles. On a CDB write con ict, the instruction issued earlier gets priority. Execution for a dependent instruction can begin on the cycle after its operand is broadcast on the CDB. If any item changes from Busy" to Not Busy", you should update the Busy" column to re ect this, but you should not erase any other information in the row unless another instruction then overwrites that information. Assume the all reservation stations, reorder bu ers, and functional units were empty and not busy when the code show below began execution. The Value" column gets updated when the value is broadcast on the CDB. Integer registers are not shown, and you do not have to show their state. 4 Question 2 continued
a 4 points The tables below show the state after the cycle in which the second SUBI from the code below issued. Show the state after the next cycle.
Lp: LD LD MULTD ADDD SUBI SUBI ADDI F0, F2, F4, F6, R1, R2, R3, 0R1 0R2 F0, F2 F0, F0 R1, 8 R2, 8 R3, 1 Name Add1 Add2 Add3 Mult1 Mult2 Int1 Int2 Int3 Entry 1 2 3 4 5 6 7 8 Field Reorder Busy Busy N Y Y Y Y Busy N N Y Y Y Y Y Op
ADDD Reservation stations
Vj F0 F0 R2 R3 R1 Instruction Vk F0 F2 8 1 8 State Commit Commit Execute Write Execute Execute Issue Qj Qk Dest 4 3 6 7 5 MULTD SUBI ADDI
SUBI Reorder bu er LD F0, 0R1 LD F2, 0R2 MULTD F4, F0, F2 ADDD F6, F0, F0 SUBI R1, R1, 8 SUBI R2, R2, 8 ADDI R3, R3, 1 Destination Value F0 Mem 0R1 F2 Mem 0R2 F4 F6 F0 + F0 R1 R2 R3 F0 1 N F2 2 N F4 3 Y FP register status
F6 4 Y F8 F10 F12 ... ... ... F30 5 Question 2 continued
b 8 points The tables below show the state during the cycle in which the second MULTD from the code below issued. Show the state after two cycles.
MULTD ADDD ADDD LD ADDI MULTD ADDD ADDI F0, F6, F2, F4, R1, F8, F4, R2, F2, F4 F6, F0 F2, F8 0R2 R1, 8 F10, F12 F4, F10 R2, 1 Name Add1 Add2 Add3 Mult1 Mult2 Int1 Int2 Int3 Entry 1 2 3 4 5 6 7 8 Field Reorder Busy Busy Y N Y N Y N Y Y Busy N Y Y Y Y Y Y Y F0 1 N Op
ADDD ADDD Reservation stations
Vj F6 F2 F4 F2 F10 R2 R1 R2 Vk F0 F8 F10 F4 F12 0 8 1 Qj Qk blank ADDD
MULTD MULTD LD ADDI ADDI Dest 2 3 7 1 6 4 5 8 Instruction Reorder bu er MULTD F0, F2, F4 ADDD F6, F6, F0 ADDD F2, F2, F8 LD F4, 0R2 ADDI R1, R1, 8 MULTD F8, F10, F12 ADDD F4, F4, F10 ADDI R2, R2, 1 State Commit Execute Write Write Execute Execute Issue Issue F10 Destination Value F0 F2F4 F6 F2 F2+F8 F4 Mem 0R2 R1 F8 F4 R2 F12 ... ... ... F30 F2 3 Y F4 7 Y FP register status
F6 2 Y F8 6 Y 6 Question 2 continued
c 8 points The tables below show the state after the cycle in which the SUB from the code below issued. Show the state after the next four cycles.
ADDD SUBD ADDI ADDI ADDD MULTD F4, F4, R2, R3, F2, F0, F0, F4, R2, R3, F6, F6, F0 F2 1 1 F8 F8 Name Add1 Add2 Add3 Mult1 Mult2 Int1 Int2 Int3 Entry 1 2 3 4 5 6 7 8 Field Reorder Busy Busy N Y Y Y
N N Op
ADDD SUBD Reservation stations
Vj F0 F4 F6 F6 Vk F0 F2 F8 F8
1 1 Qj Qk ADDD MULTD ADDI ADDI Dest 1 2 5 6
3 4 R2 R3 Busy N Y Y Y Y Y Instruction Reorder bu er ADDD F4, F0, F0 SUBD F4, F4, F2 ADDI R2, R2, 1 ADDI R3, R3, 1 ADDD F2, F6, F8 MULTD F0, F6, F8 State Commit Execute Write Write Execute Issue Destination F4 F4 R2 R3 F2 F0 Value F0 + F0
R2 + 1 R3 + 1 F0 6 Y F2 5 Y F4 2 Y FP register status
F6 F8 ... ... ... F30 7 Question 3: Vector vs DSP Showdown
Examine the two architectures below. The rst architecture is a 25 Mhz 3-stage dsp processor. A block diagram showing some of the fully-bypassed datapath is shown below. The three stages are fetch, decode where branches
W Registers X Instruction Ram Y Z Control Multiplier ALU Shifter S Accumulator Figure 1: The DSP block diagram are evaluated and the PC updated, and execute where memory and register writes also occur. The processor is able to multiply, accumulate, and shift during its execute stage. It has the same load, store, and branch instructions as DLX. It also includes a LT instruction, which loads a value into a register from memory, and decrements the base register to the next element. The arithmetic operations are slightly di erent: Register 0 always contains the value 0 Register 1 always contains the value 1 The result from the shifter is always written to the accumulator on arithmetic operations Operations can be speci ed as MAC W, X, Y, Z, S, where W is the register to be written; X and Y are registers that go to the multiplier; Z is the register that goes to the alu; and S speci es the amount to right shift the result. Operations can also be speci ed as MACA W, X, Y, S, where W is the register to be written; X and Y are registers that go to the multiplier; the accumulator goes to the alu; and S speci es the amount to right shift the result. The second architecture is a 100 Mhz vector processor with a MVL of 64 elements. It has one FP add subtract FU, one FP multiply divide FU, and a single memory FU. The startup overhead is 5 cycle for add, subtract, multiply, and divide instructions, and 10 cycles for memory instructions. It supports exible chaining but not tailgating. 8 Question 3 continued
Here is the code for the dsp:
LP: LT MAC MACA MACA BNEZ SW R2, R0, R0, R2, R5, R2, 0R5 Load R2 with new value R10, R2, R0, 0 Perform the calculation R11, R2, 0 R12, R2, 0 LP -4R5 Delayed branch a 2 points What is the peak performance, in results per second, of the above three-tap lter? 3 results per loop, 6 instructions per loop, 25 Mhz 12.5M results per second b 2 points What would be the peak performance, in results per second, of the above code, if it was a ve-tap lter? 5 results per loop, 8 instructions per loop, 25 Mhz 15.625M results per second 9 Question 3 continued
c 3 points Translate the following DLX code to code that will operate on this DSP. Assume that all oating point calculations below can be done in xed-point on the DSP. Do not worry about round-o error from converting between oating point and xed point. Assume that for DLX code, F0 contains 0, F2 contains 0.5, and F4 contains 2.0. Assume that for the DSP code you will write, R2 contains the value 2. Assume for both that register R5 contains the correct initial loop count.
LP: LW MOVI2FP CVTI2F MULTF ADDD MULTF CVTF2I MOVFP2I SW ADDI BNEZ R3, 0R5 F6, R3 F6, F6 F8, F4, F6 F10, F8, F0 F0, F10, F2 F12, F0 R3, F12 R3, 0R5 R5, -4 R5, LP Load R3 with the new value Move the value from R3 to F6 Convert integer value to floating point Multiply by 2 Add accumulator to value Divide value by 2 Convert to integer representation Move it to integer registers Store value back to mem Point to next element If not done, branch back MAC LP: LT MACA BNEZ SW R0, R3, R3, R5, R3, R0, R0, R0, 0 Clear accumulator 0R5 Load value R3, R2, 1 R3 - R3*2+acc 2 LP Branch if done -4R5 Delayed branch slot 10 Question 3 continued
Here is the code for the vector machine:
LP: LV MULTSV MULTSV ADDV MULTSV ADDV SV SUBI BNEZ V1, V2, V3, V2, V3, V2, V2, R5, R5, R5 F0, F1, V2, F2, V2, R5 R5, LP V1 V1 V3 V1 V3 8 Load V1 with new value Perform the calculation d 3 points Show the convoys of vector instructions for the above code. Follow the timing examples in the book. Draw lines to show the convoys on the existing code shown above. Shown above with lines e 4 points Show the execution time in clock cycles of this loop with elements T ; assume Tloop = 15. Show the equation, and give the value of execution time for =64.
n n n Tn = 64 Tloop + Tstart + n Tchime l m T64 = 64 Tloop + LVstart + MULTSVstart + MULTSVstart + ADDVstart + MULTSVstart + ADDVstart + 64 SVstart + 64 3 l m 64 T64 = 64 15 + 10 + 5 + 5 + 5 + 5 + 5 + 10 + 64 3 T64 = 252 iteration R1 = Operations percycles per Clock rate lim Clock iteration
n!1 f 3 points What is R1 for this loop? lim Clock cycles per iteration = nlim Tn = nlim 3n + 60 64n = 3 9375 !1 n!1 !1 n n 5 100 = 127MFLOPS R1 = 3 9375
= : : 11 Question 3 continued
g 3 points List 6 characteristics of DSP instruction set architectures that di er from general purpose microprocessors. For g, h, i, and j, there are many possibly answers besides what is listed here. Autoincrement addressing Circular addressing Bit reverse addressing FFT speci c addressing Saturating over ow Fast multiply-add Narrow data Fast loops h 1 point Which of those characteristics are supported in vector architectures? Autoincrement addressing Multiply-add with chaining Fast loops i 1 point Which of the unsupported characteristics could be handled in software? Circular addressing j 2 points What changes to the hardware would you make to handle the remaining characteristics? Saturating over ow Narrow data support FFT support 12 ...
View Full Document
- Spring '07
- Computer Architecture, Central processing unit, CPU cache, Cache algorithms, main memory, reorder bu er