Unformatted text preview: University of California, Berkeley College of Engineering Computer Science Division | EECS Fall 1997 D.A. Patterson Midterm I - SOLUTIONS October 8, 1997 CS152 Computer Architecture and Engineering You are allowed to use a calculator and one 8.5" x 1" double-sided page of notes. You have 3 hours. Good luck!
Your Name: SID Number: Discussion Section: 1 2 3 4
Total /20 /10 /10 /30 /70 Question 1
You are running a benchmark on your company's processor, which runs at 200 MHz, and has these characteristics: Instruction Type Frequency (%) Cycles Arithmetic and Logical 40 1 Load and Store 30 2 Branches 20 3 Floating Point 10 4 Your company is considering o ering a cheaper, lower-performance version of the processor. Their plan is to remove some of the oating point hardware to reduce the die size of the chip. The wafer on which the chip is produced has a diameter of 10cm, a cost of $1000, and 1=(cm2) defects. The manufacturing process results in a 90% wafer yield and a value of 2 for . The current processor has a die size of 12mm 12mm. After the changes, the die size will be 10mm 10mm, and oating point instructions will take 12 cycles to execute. Here are some equations you may nd useful: dies/wafer =
( wafer diameter=2)2 ; wafer diameter p die area 2 die area
; 1 + defects per unit area die area die yield = wafer yield a) What is the CPI and MIPS rating of the original processor? CPI = (1 :4) + (2 :3) + (3 :2) + (4 :1) = 2 MIPS = 200=2 = 100
b) What is the CPI and MIPS rating of the smaller processor? CPI = (1 :4) + (2 :3) + (3 :2) + (12 :1) = 2:8 MIPS = 200=2:4 = 71:4 2 Question 1 (cont)
c) What is the old cost per (working) processor? dies/wafer = 1:(5) ; p2 1044 44 1: 54 42 = 78::44 ; 31::70 1 1 = 54:54 ; 18:48 = 36:05 = 36
2 die yield = :9 1 + 1 1:44 2 = :9 (1 + :72);2 = :9 :338 = :30 ;2 working processors = :30 36 = 10 die cost = $1000=10 = $100 d) What is the new cost per (working) processor? dies/wafer = (5) ; p210 1 1 : 42 = 78154 ; 31::41 1 = 78:54 ; 22:28 = 56:26 = 56
2 die yield = :9 1 + 121 = :9 (1 + :5);2 = :9 :444 = :40 ;2 working processors = :40 56 = 22 die cost = $1000=22 = $45 e) What is the improvement in price per performance?
$100=100 $45=71:4 1 = :63 = 1:59 3 Question 1 (cont)
Your competitors produce a chip that runs at 250 MHz and has the following characteristics for the benchmark: Instruction Type Frequency (%) Cycles Arithmetic and Logical 40 1 Load and Store 30 3 Branches 20 3 Floating Point 10 5 f) What is the CPI and MIPS rating of your competitor's processor for this benchmark? CPI = (:4 1) + (:3 3) + (:2 3) + (:1 5) = 2:4 MIPS = 250=2:4 = 104 g) Your company's advertising department wants to defend your company's motto (\The Appearance of Excellence") by advertising a higher MIPS rating for your processor than your competitor's processor. They want you to write a benchmark that gives this result. Describe an instruction mix that would accomplish this (give speci c percentages of each instruction type). Your benchmark must have a large amount of load or store instructions, since these instructions execute in less time on your processor than on your competitor's processor. For example, a mix of half arithmetic/logical and half load/store would give your machine: CPI = (:5 1) + (:5 2) = 1:5 MIPS = 200=1:5 = 167 and your competitor's machine: CPI = (:5 1) + (:5 3) = 2 MIPS = 250=2 = 125 4 Question 1 (cont)
h) Instead, you decide to improve the compiler that is used to compile this benchmark on your processor. Your compiler reduces the branches by 50%, but it increases the number of arithmetic and logical instructions by 25% it does not a ect the number of other instructions. What is your new CPI and MIPS? CPI = (:5 1) + (:3 2) + (:1 3) + (:1 4) = 1:8 MIPS = 200=1:8 = 111 i) Using the original instruction mix, which machine is faster? Why?
ExecutionTimeexcellence CPIexcellence Speedup = ExecutionTimecompetitor = CPIcompetitor TCLKexcellence = 224 54 = 1:0416 TCLKcompetitor : Thus, the competitor's processor is 1.0416 times faster or 4.16% faster. 5 Question 2
The ALU presented in the book supported set less than (slt) using the sign bit of the adder (a < b , a ; b < 0). Let's try the set less than operation using the values ;7ten and 6ten . To make it simpler to follow the example, let's limit the binary representation to 4 bits: 1001two and 0110two. 1001two ; 0110two = 1001two + 1010two = 0011two The result suggests that ;7 > 6, which is clearly wrong. Hence we must factor in over ow in the decision. Modify the given schematic below of the 1-bit ALU for the most signi cant bit to handle slt correctly. You have to x the set output, which is same as ADDout in this schematic. Explain your modi cations and give the new function for the set output as well. Assume that the over ow signal is correct and can be used and that the gates available to you are: inverters, AND and OR.
Bitvert CarryIn Operation
1 0 1 0 A 0 1 0 1 0 1 0 1 0 1 0 1
111 000 Result B 1 0 0 1 1 0 1 0 11 00 11 00 + 1 0 ADDout 2 Less
1 0 1 0 3 Overflow Detection 1111 0000 111111111111 000000000000 11 00 set A Overflow When a and b have the same sign there can be no over ow, so set is the output of the adder (just like before). An over ow can happen only when a > 0 and b < 0, or a < 0 and b > 0. In the rst case set should be 0 (a > b), while in the second one it should be 1 (a < b). So the function for set is: where a is the MS bit of the rst operand, in other words its sign. This can be implemented with 1 inverter, 2 AND and 1 OR gates as shown in the schematic above. Another valid solution is to XOR the over ow and ADDout. signals. In the case of non-over ow, set is the same with ADDout. When over ow occurs, set is the inverse of ADDout. 6 set = overflow adder output + overflow a Question 3
The gure picture presents the portion of the schematic of a full adder that calculates the carry out signal.
CarryIn A CarryOut B Assume the following characteristics for the gates: AND2: Input load=150fF, propagation delay low-to-high TPlh=0.2ns, propagation delay highto-low TPhl=0.5ns, load dependent delay TPlhf=TPhlf=0.002ns/fF . TPhl=0.1ns, load dependent delay TPlhf=TPhlf=0.002ns/fF . OR2: Input load=100fF, propagation delay low-to-high TPlh=0.5ns, propagation delay high-to-low
Identify the critical path in this schematic and fully characterize its delay using the linear delay model. Assume that the last OR2 gate drives a capacitance of 300fF. The critical path for CarryOut consists of 3 gates, one AND and two OR in the row. There is no need to calculate the delay for the path with the 2 gates (one AND and one OR) since, excluding the additional OR gate, the two paths have the same gates with the same loads. TPtotal = TPAND + TPOR1 + TPOR2 For each gate: TP = TPinherent + TPloaddepedent Capload
For low-to-high: TPLH = (0:2 + 0:002 100) + (0:5 + 0:002 100) + (0:5 + 0:002 300) = 2:2ns For high-to-low: TPHL = (0:5 + 0:002 100) + (0:1 + 0:002 100) + (0:1 + 0:002 300) = 1:7ns 7 Since the cell delay is the worst case one, the delay of the CarryOut calculation is 2.2ns. 8 Question 4
In October of 1996, Silicon Graphics introduced a new set of instructions known as MIPS Digital Media Extensions (MDMX). Similar to the Intel MMX, the MDMX speci cation uses a single instruction multiple data (SIMD) data path to perform parallel narrow data operations on bytes and halfwords within a single instruction. The MDMX has yet to be implemented on a commercially available microprocessor. We will explore some MDMX ideas by extending the single cycle datapath discussed in class. Where the real MDMX uses 64-bit oating point registers, we will use the 32-bit integer registers to perform parallel operations on two half words (\Dual Halfs" instructions). Consider two pseudo-MDMX instructions (based on the real MDMX!), ADD.DH and MAX.DH. ADD.DH adds two 16-bit signed integers in parallel. MAX.DH is more unusual. It performs two simultaneous comparisons and stores the larger results in a third register. The register transfer operations are given below. R x] 0] refers to the half word in bits 15:0 of register x, and R x] 1] refers to bits 31:16. INSTRUCTION rd, rs, rt ADD.DH $r1, $r2, $r3 R rd] 0](R rs] 0]+R rt] 0] R rd] 1](R rs] 1]+R rt] 1] PC(PC+4 MAX.DH $r1, $r2, $r3 for i=0,1 begin if (R rs] i] < R rt] i]) then R rd] i](R rt] i] else R rd] i](R rs] i] end PC(PC+4 9 Question 4 (cont)
a) The single cycle processor developed in class is shown below. Make the necessary datapath modi cations for the two MDMX instructions. You may use a 16-bit version of the 32-bit ALU. If you de ne your own component, be sure to specify its behavior. Label your control signals with descriptive names. To maximize your chances for partial credit, write down anything else that will help us evaluate your work (you do not need to specify the control functions until part b). You will be graded for correctness more than e ciency. An additional datapath (in case you needed it) is provided in the following page. The meaning of some control signals are given below. nPCSel : 0 ) PC PC + 4 1 ) PC PC + 4 + SignExt(Imm16) k 00 ALUCtr : \add00 \sub00 \and00 \or00 \slt00(signed comparison) ExtOp : \zero00 \sign00
Instruction<31:0> Instruction Fetch Unit nPC sel Imm16 <0:15>
16 16 Rd <11:15> Rs Rd Rt MDMX-ALUctr MDMXSrcL
0 <16:20> 16-bit ALU 16-bit ALU Rt busHigh
1 2 <21:25>
0 CLK 3 to 1 MUX
MDMXSrcH 16 3 to 1 MUX RegDst Rs RegWr 5 Ra 32 5 Rb Rt 1 MUX 0 16 busLow
1 2 16 16 5 RW busA 32 ALUctr 32 Zero MDMXSel 32-bit ALU Register File
CLK busB 32 MemWr
0 0 MUX 1 MemtoReg 32
0 busW MUX
Data In 32 32 Data Memory ALUSrc CLK ExtOp 32 WrEn Adr Extender MUX 1 Imm16 16 1 32 10 b) For ADD.DH and MAX.DH, give the values of all control signals, including those you added in part (a). The control signals can be functions of other control signals, values labeled on the datapath, or don't cares. Use b i] to denote bit i on bus b. You may use high level speci cations such as if-then-else. The number of grids below is not an indication of the number of control signals you will need. Control Line nPCSel RegDst RegWr ExtOp ALUSrc ALUCtr MemWr MemtoReg MDMX-ALUctr MDMXSrcH MDMXSrcL MDMXSel ADD.DH 0 1 1 X X X 0 0 add 1 1 1 MAX.DH 0 1 1 X X X 0 0 slt if(busHigh 0]==1) 0 else 2 if(busLow 0]==1) 0 else 2 1 11 Extra Credit. 16 16-bit signed integers are stored in memory as follows:
0x00000000 (half-word 0) 0x00000002 (half-word 1) 0x00000004 (half-word 2) . . 0x0000001E (half-word 15) The following MIPS code (assuming no delay slots) nds the largest integer and stores the result in lower 16 bits of $v0. Since we are dealing with half words, the nal value of the upper 16 bits of $v0 is irrelevant.
lh $v0, addi $t0, LARGEST: lh $s0, slt $t1, beq $t1, addi $v0, NEXT: addi $t0, slt $t1, beq $t1, END: 0x001E($zero) $zero, 0x001C 0($t0) $v0, $s0 $zero, NEXT $zero, $s0 $t0, -2 $t0, $zero $zero, LARGEST #assume the last number is the largest #initialize the half-word pointer #load next half word #update $v0 if $s0 is larger #continue the search until pointer becomes negative Take advantage of the parallelism in MAX.DH to write a faster version of this code. Exactly how many instructions are executed in your code? What are the minimum and maximum number of instructions executed in the non-MDMX code given above? 12 Answer:
lw addi LARGEST: lw max.dh addi slt beq srl max.dh END: $v0, 0x001C($zero) $t0, $zero, 0x0018 $v1, $v0, $t0, $t1, $t1, 0($t0) $v0, $v1 $t0, -4 $t0, $zero $zero, LARGEST $v1, $v0, 16 #put the larger half-word in $v0 into the lower $v0, $v0, $v1 #16-bits of $v0 Instruction counts: MDMX ) 2 + 7*5 + 2 = 39 non-MDMX min ) 2 + 15*6 = 92 non-MDMX max ) 2 + 15*7 = 107 13 ...
View Full Document
- Spring '04
- Computer Architecture, Trigraph, MIPS architecture, SIMD, MIPS rating, MDMX, Instruction Type Frequency