{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

ee457_Final_Fall2004_sol

ee457_Final_Fall2004_sol - Fall 2004 EE457 Instructor...

Info iconThis preview shows pages 1–12. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 6
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 8
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 10
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 12
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Fall 2004 EE457 Instructor: Gandhi Puvvada Name: Final Exam (35%) Date: 12/10/2004, Friday Closed Book, Closed Notes; Calculators allowed Time: 1:45 - 4:15PM SGM123 so LOT! 0 N Total points: 170 Perfect score: 160 / 170 1 $ 1.1 $ 1.2 O ee457~Final__Fa112004.fm 12/8/04 EE457 Final Exam - Fall 2004 1 / 12 ( 11 + 21 = 33 points) 35 min. -- Note: This is alittle difficult design question. Pipelining: Here, we are modifying your lab 7 part 3 as follows. Instead of ab 7 Part 3 the SUB3 and the ADD4 units in EXl and EX2 stages, here we 3F EXl EX2 WB RF have a SUBB unit in each of the two execution stages. So the op— H j [ E U [ U erations possible are NOP, SUB3, and SUBS (subtract 6 by sub- ' tracting 3 twice). RF EXl EX2 WB One-hot coded 2—bit opcode is used as shown below. D U U Note Instruction Operation Opcode SUB3 SUB6 NOP O O SUB3 $R, $X; ($R) <= ($X) - 3 l O SUBG $R, SX; ($R) <= (SX) — 6 O 1 T0 execute the SUB3 instruction, you can choose to per— form the subtract 3 operation either in EXl or in EX2. You need to exploit this aspect to reduce stalling. Instruction Sequence #1 SUB6 $4, $2; ($4)<=($2)-6 SUB3 $6, $4; ($6)<=($4)-3 Instruction Sequence #2 UB6 $4. $2; ($4)<=($2)-6 SUB3 $6, $4; ($6)<=($4)-3 SUB6 $8, $6; ($8)<=($6)-6 For example, in the instruction sequence #1 on the side, you can avoid stalling the dependent SUB3 instruction by postponing the subtraction operation until it reaches EX2 stage. However, in the instruction sequence #2, we avoid- ed stalling the SUB3 but ended up stalling the next SUB6 . This is considered fine. It means, we gain sometimes and we may not gain sometimes. Summary: Our policy for this new design is never to stall SUBB; if needed we will execute it in EX2. A SUBS needs to be stalled if it is dependent on a SUB6 immediately ahead of it or dependent on a SUB3 immediately ahead of it which decided to execute in EX2. All other dependencies can be solved through forwarding. The register file is an internally forwarding register file. Given on the next page is a block diagram for this new design. Complete it. Also complete on page 3, the postpone logic and the logic for HDU, FUl, and FU2. @ Copyright 2004 Gandhi Puvvada 5&3: :85 m :3 m a3 3532 ..___ Elva Sara: <mz§w SAN: Oman :6: 05 so NDm 9% J Du nDOE “0&2 uncommon 2a How 0&2 2t BmHQ e ‘ g H m o m .amgmégmvgofioo aim 2: 80355 .m up; v 0 {muoimou “Riga 05 so £3300 anO 0335 95 EN 229.200 ‘N hum aw // H Gum .Q‘Mlnzv owfim DH 8 someoncoo onounvxx :amvmvé 05 803800 026‘ » Cu. _ BE mum 05 9 32608.50 mfimmfi: 2m 803500 my W V NX Sn 4&1me 5E 3;an <xeen§g owfim ”a E nosfim 9:00 3?: 2533: .l O MFCMOHME' @ Copyright 2004 Gandhi Puvvada fix [Emma 3%? I <m émfixm «a hll'é 4 . Nxmm in . - m3 E m .. a I I n m SE .2 N 8% ca .2 Z Z I Imp! I I 1%! Rummy— “ H3}: m n. mlmxm a H 8% SE a m .E a _S a 4 m m Um QM A» .v M 4“ o I . 0 #3: En mm 95 . 5210A 52 . 3 VSEJE mé mé m a m w m Zm Z w chwwm £2me mmbm 5 £30 mgm W .I ‘ fin I. I! I. ‘ ‘I. may $5130 m J4<bW 0 NDm SE 1 550:3 I wafixlmxm m vIA Zfll 2m w uwsm mm 5 83% €50 m3 3% w Una m E E 0. G EE457 Final Exam - Fall 2004 2 / 12 ee457~Final Fa112004.fm 12/9/04 1.2.2 1.2.3 ee457~Final_Fa112004.fm 12/3/04 EE457 Final Exam - Fall 2004 3 / 12 Reproduced below is the logic in RF_stage converting the SUB3 and SUB6 control signals into SUB3~l and SUB3_2 signals. If postpone is true, then SUB3 causes ID_SUB3"2_IN to go active. Otherwise SUB3 causes ID_SUB3-1_IN to go active. If SUB6 is true, both ID__SUB3_1_IN and ID_SUB3_2_IN will go active. STALL ID_SUBB_1_IN -- ID_SUBS__1_OUT |D__SUBB‘_2_OUT Postpone logic: If there is an instruction just ahead of the instruction in RF stage whose destination register matches with the source 6)“ —SUB .. 2.. register of the RF stage instruction and if the instruction ahead can not provide forwarding XM EX g help in time, then the SUB3 in RF needs to postpone itself. postpone \ to ~49 09 HDU:: If there is an instruction just ahead of ‘bb’ M9 the instruction in RF stage whose destination 0’0 register matches with the source register of the E D ‘30 3 ‘_ STALL RF stage instruction and if the instruction ahead can not provide forwarding help in time, X M ex! and the instruction in RF stage does require E.“ S 03 help L14 when it arrives in EXl itself, then stall " the RF instruction. FUl: The nearer is of higher priority compared EX l .. XMEX! to the farther. Produce PRIORITY based on when the nearer w“ «(L instruction is eligible to help. Activate FORWl if any (nearer or farther) can help. PRIORITY FORWl FU2: If the instruction in WB stage is eligible to help and if the help is not provided to the same 5X1 " M EX ' recipient for the second time, then activate W 8 v V * FORW2 FORWZ. Example: A SUB6 after an unpostponed EX 2 __ 033.,A SUB3 should receive help from the SUBS due to FORWl and not because of FORWZ. 115 Au M auras/Ciel {myp to pkiCLtt 2W1 Errol ”l9? © Copyright 2004 Gandhi Puvvada osmA {m5 2.1 2.2 2.3 i‘; W * ee457gFinal_Fa112004.fm 12/‘2/04 EE457 Final Exam - Fall 2004 4/ 12 3‘ . . ( (O + 5 +3+8 2 23» pomts) L‘o min. Multi-cycle CPU design modification: The 2nd Edition CU (state diagram) and the DPU are given on the next two pages. Mr. Trojan has already completed modifications to the DPU. We notice that, while the register file takes substantial time (nearly half clock time) to read from it or write to it, the stand—alone registers [such as MDR (Memory Data Regi s ter) or ALUOut register] can be written or read instantly. Mr. Trojan suggested saving a clock in load—word and R-type instruction execution by skipping states 4 and 7 as follows. These states are created as writing to the register file takes time. Since the data to be written into the register file is anyway saved in temporary registers MDR and ALUOut, if you add three more flip-flops to record the original 3 signals of states 4 and 7 namely RegDst, RegWrite, and MemtoReg, then the register file gets written in the background while you start accessing the next instruction. This is called a "posted write" (a write operation posted (scheduled) to occur later). . . MemtoReg MemtoReg Q Re W t Re Write Q R Dst Q — g rte D Q g _ RegDst D Q 99 _ D Q __rs .Qi—JQ ELK. CLK CLR Reset The DPU modifications were carried out completely. The state diagram modifications are to be carried-out by you on the next page. What could be the reason for Mr. Trojan’s choice of positive—edge triggered flip~flops? Could he have chosen negative—edge triggered flip—flops? W as m m; Caulk/J cmtl‘is U810»? WfiVa/aajrmfimg W 48qu Till M Fansllfans are OC‘CW 233M 4% [Bl Clock} MDE «ma ALUCJ [zajSE—Erzs qr». t4 “(3 “5:11,; W t 2 . Ho. (2 n ‘ ___’[Uu MWkor IE4. may; C31.) cut/9.2) U Ahawlfizg—%g fgz§3g§a3 FE; Qowx Wifi TEL (1004.16 DPU LL81 mgafivz—Ldgehzqur FF/S Dix-A3 rast‘gTQvg . Miss Bruin suggested a RegWrJ. te_FF_Wr1te control signal as shown on the side. Comment using . words such as "totally wrong","unnecessa1y", etc. Regwr'te CLK it i5 “WAMCLV a Win; 75L RegWrite_FF_Write wool} ‘cafon (gag-um mm; sick, cm M136 ruin MAKE wraith: 5mm M Waczliflamrza Lg (16:va ‘M W mantra cwa W6 W Lei/ZR Nvicclclfvz aura again Purl/Anti “PF. “(32. gamma iwx Carl»: Rail/Met}. macflvzxo m Snafizrélhflg » F- M: RegWrite_Q © Copyright 2004 Gandhi Puvvada 2.4 Mr. Bruin tried to copy Mr. Trojan and suggested saving a clock in the stare—word instruction execution by skipping state 5 and suggested-adding the following two flip—flops to the DPU. CLR Reset MemWrite_Q __.._....J iorD_Q Q..__.i Q ‘ Your advice to Mr. Bruin- My Bram 1843“?) \J uni CELLO. gimlTwawM ngaxsg IEcCC ,_ ma Al! $441309 Igvb. lL... @31an W “SIM" (Sigma :QYDr-Q =2Q§ r {flaTrgg-Jim égjbjA. Taperéot’m @Jm QNL M3 ans-ail 143(3) $33 WaDUAL ,potILWr‘j r 075m wore. wul‘wng. {Wigman‘afon allowm quasi—Pom - and {Ana Emmet-f 2nd Ed. tate Dia Memory address computation ALUSrCA = 1 ALUSch =- 10 ALUOp = 00 Regerte MemtoReg=1 oe457_Final_Fa112004.fm 12/3/04 “>33ch mm?- gram mu: “iii ALUSrcA = 0 mm =5 0 lRerte ALUSch : 01 ALUOp :- 00 PCWrite PCSource = 00 ALUSrcA =1 ALUSrCB = 00 ALUOp= 10 Regwnte g MemtoReg = O . ‘lnstructlon fetch ALUSrcA = 1 ALUSch == 00 ALUOp = 01 PCWriteCond Instruction decode/ ALUSch = 3.1 PCSource = 10 register fetch ALUSrcA = O PCerte BabiCO-uj {naked 0% adffvarmj 15282 mu Sigma/35 wt are, ac'li'vafi 3 owwl 6 (as. E13457 Final Exam - Fall 2004 5 / 12 QJAavxa L In USttlt/Wj am: inc. (4721’ Cl lock //3M [Rem Gar/an] m Stall/8 -TE\L FF5‘ 0M val/L Copyri ht 2004 Gandhi Puvvada next Whig L 2»th the). Wmuw 2nd VEd.,DPU _ Olamommm “mamwm aumEEmmm «Lamas gimmm OimmmoHEwE ammowEmE EE4S7 Fmal Exam ‘ Fall 2004 6 / 12 © Copyright 2004 Gandhi Puvvada ee457aFina1 FallZOO4.fm 12/§/o4 Lt Pt the diagram. Also write address ranges (example A[10:4]) in the 6 boxes, shown as ( 9+8 H I 5 points) ? min. Cache: Given below is a diagram similar to the one in your classnotes. Please divide the address a}; from the CPU into TAG, SET, WORD, and BYTE fields based on the information provided on / k AIG‘LI Block 0 TAG RAM Addr Block 0’s DATA RAM “@4219 ET???» VAVA VAVA +§E6 / ° ix—m 3—13—14 ¢ ¢ ‘3 a a a Comparator i? ,.-, 2' I; \ g E a a 1aAll1‘7l AIGVLI BIOCk 1 Block 1’s TAG RAM \Addr DATA RAM _. , [31:13:33] A A Y' 5‘.) Um?) :2 DECIDE] a DUE-II} E Valid ee457_Final_FallZOO4.fm 12/8/04 A12A11 A10 A9 A8 A7 A6 A5 A4 Address from the CPU EE457 Final Exam - Fall 2004 7 / 12 A3 A2 A1 A0 “BEE—3+1 VA VA VA VA +fi'E—o /' f, g +§§2 min/f, g Comparator § 3 E 5' a S a \ / © Copyright 2004 Gandhi Puvvada 3.1 The above diagram is slightly changed as shown below. Instead of two 4x7 TAG RAMs, we have four 4x7 TAG RAMS here. Accordingly we have 4 sets of DATA RAMs (BLOCK 0’s; BLOCK 1’s; BLOCK 2’s; BLOCK 3’s). The size of the CPU address is the same. L q + L‘ Again divide the CPU address and fill-in the 12 boxes shown as A [ ] F o N \m m V § 0 0(I‘LCI § 0 905m m 2 =33 “a 6 < i < o e —. P‘ _. [-1 m a: 3 § at «swan § vza-Im o vac-Isa A X m 2 E , g :i e § :3 :3 ~ ' a V a. '0 ‘ ' a u < .. <: \9 (”P9 5 \9 W E _. U —- Q < E < 2 c: M M < i o e 0 0381213 ,4 3 < 2 < <5 CG E" m E- N ,2 <5 0 § 2 2’ m E < 8 a {3 U m m ‘0 * v < 0 b '5 § 0 now § 0 001a < 6 § 0 sasla § S" SG'SIQ w G '\ V\ 4 3.. a ‘H 2 i a; < % L11 ‘Lq : Q) m m < H h f. < '5 a; a «“5 a i = "3 z L!) SE im < < 8 < V 8 <1 17 S a :5 4/ ... a i; 4/ __ <3 5: § 09mm! 3‘ Q § 091(i-ozza r~ <1 . .- . .. D o no Isa :4 § 0 m1 Im A E ct = T'x = a ,t ‘m < , l8 < > V ,. m 1.. J i ‘ CD 5. p: ’ i‘ S , : 8 :1 ._ , I E :i a i 7 ‘ ‘- E " < . . < Q wit/WE» a a MAM/M, g .... U ~— D < E < E a N x x M P" m {-4 [I] D [I] / ee4S7_Final_Fa112004.fm 12/8/04 E13457 Final Exam — Fall 2004 8 / 12 © Copyright 2004 Gandhi Puvvada 4 ().;+q-‘;+3+L4+2: :5 points) 5min. . Parallel processors: 4.1 The abbreviation RMW in RMW—race stands for READ MO D ‘ F V Wk‘ TE 4 .2 One way to solve the problem of RMW—race is to keep the shared variables in a [0501 (local/ global) 3 x (L memory and .LOCLK‘ugg (locking/not locking) access to the memory until all three parts of RMW 3' are done. Such operation is called aran are mic (atomic/molecular) operation. 4.3 The operating system may declare some areas of the a {clan}. (local/ global) memory as non-cacheable 2M}? so that it can force all processors to access éka rag (shared/local) variables from that area. 4.4 Snoopy controller in a write-through cache-coherence system helps in snooping (watching) for (1+2, 1,.) r i G: (read/write/both read and write) transactions from the 0mm: gruesome (other processors/same processor) . 4.5 The abbreviation MESI (in MESI cache coherence protocol) stands for 1 M odi i 11A E xcLuazve. 8 hand Invql id , 5 ( '5 + l2. : 9.8 points) as min. 4:" Q It *1» G “>6“ 07 “I: Arithmetic: ‘05 22,, {Q 0 o f“ C LA / y/ f 6 + [email protected] Delay of the 64-bit GSA shown below: “L gate delays. l + 1+ '2— ‘1’ 2+ 2+ . IDDDUDUU UUUUUDUD DUDUUDDUUUDUBUUU UDDUDUDU DDUUDUUD DUDE! DUDDUDDDDUDU‘ P32-47**,G32_47** ) ‘3' The bottom—most CLL (Carry Look-Ahead Logic) do,“ ngi' (does /does not) need to produce group generate (G0_63***) and group propagate (P0_63***). Note that the design above does not produce C64. Among the second-level CLLs, .2} (kw right than (all four, only the left three, only the right three) CLLs need to produce group generate (G**) and group propagate (P**). 5.1.1 Now you are asked to design a 33—bit CLA (note 33—bit (32:0)). Cut—out what~ever hardware is w/ not necessary in the 64~bit adder design below. Nata- we do mi n and if) $1431“; ’Qa 31>! ' £7“? a <2 5 IO” cuk s y 3 Delay of S32 ofyour 33—bit 98%: a gate delays. {at '2. +54; 2+ 0 + O + Q. 1%. Delay of S31 of your 33—biti‘gfi: l; gate delays, ( 4: Q + 2. +5532“) (5'6” 2% % iii» Maximum delay of any sum bit Si of your 33-bit : I; gate delays? C7 I DIDDUDDDUDDIII DDUDDDDD DDUUDUUU DUDE] DUDE] DUDE] BUD P8_11*,G8_11* at. ce457_Final_FallZOO4.fm 12/3/04 1313457 Final Exam — Fall 2004 9/ 12 © Copyright 2004 Gandhi Puvvada 5.2 In your homework #1, you designed a constant adder (Y3Y2Y Y0: X3X2X X0+10112.) Here, you are using a constant 10002 (= 8 Viewed as an unsigned number) (=— —8 viewed as a signed numbefi. 5.2.1 To perform an unsigned addition X3X2X1X0 + 10002, it is possible to complete the design on the side. @110 If you answered YES, please complete the design. 5.2.2 To perform an unsigned subtraction X3X2X1X0 - 10002, it is possible to complete the design on the side. @ @/NO If you answered YES, please complete the design. 5.2.3 To perform an signed addition X3X2X1XO + 10002, it is possible to complete the design on the side. .61.)... If you answered YES please complete the design. 5.2.4 To perform an signed subtraction X3X2X1X0 - 10002, it is possible to complete the design on the side. If you answered YES, please complete the design. 6 ( 2’7 Non—Linear pipeline design: I} + 1'2. : points) 6-1 P = All (+) A10 A5 (+) A4 (+) A9 (+) A3 (+) A8 (+) A2 3< 3< 5 34 Place a wire or an inverter in each of the two boxes. 3 dwsum unsigned_sum1 unsigned_sum0 unsigne unsigned_sum2 unsignedwaddnoverflow to 5 5 3‘ Place a wire or an inverter in each of thetwo boxes. it? 837 €15? 131 “I 12’1 B E B —— § § g) in {I} V2 C4 é 5 § § 3 .5 3 5 Place a wire or an inverter in each of the two boxes. § E” 5 ”’1 "‘1 a! 8 E B E) E» in m V) VI Place a wire or an inverter in each of the two boxes. 3 2 O “1 IO: 1"1 13I '3 "U 'U '0 0) Q) U Q) §> ED 50 En ._ .1 ._. ._ (I) m (I) If) signed_sub_overflow 15 min. Given below is a non~linear pipeline to produce parity P of an 18—bit data. A17 (+) A16 (+) A15 (+) A14 (+) A13 (+) A12 (+) (+) A7 (+) Al (+) A6 (+) A0 (+) It takes 3 iterations through this partial tree of XOR gates to process the 18 bits at the rate of 6 bits per iteration. Complete the reservation table for the same and arrive at ICV. Based on the ICV draw a state diagram for MAL analysis and arrive at MAL. ee457_Final_FallZOO4.fm 12/Y/04 EE457 Final Exam ~ Fall 2004 10 / 12 © Copyright 2004 Gandhi Puvvada 6.1- ee457_Final_FallZOO4.fm 12/8/04 k Bit generator/Shift register l SO ‘ ‘ so x x x 31 31 x x x 32 X X x S3 3‘ X X 82 84 X Reservation Table + (1:? 4C3C2C1 S \”{CV= l o l 0 83 .5 ~ , .5 84 l O l l Given on the side are two design variations of the above design. In the original design, the result of the previous iteration produced by stage 33 (s3/s4) is fed into stage 31 (SI/$2). In the design variation #1, the result of the previous iteration produced by stage 53 (S3/ S4) is fed into stage 32 (Si/s2). State if the variations are acceptable or not. If unacceptable, state your reasons. @ Greedy Simple Latency Cycles = is: | ’ @ MAL = 3 E / Design Variation #1 \ Bit generator/Shift register 810% it Q SO l $2 83 S4 Variation #1 -/ unacceptable minimum (33‘ U? H‘s 3 '2’? j / Design Variation #2 \ SO Bit generator/Shift register 82 1; S3 84 i ) Variation #2 acceptable / (2:723 Matt 80% intb 55%}. ~ may "33/31”. So if QmOumT/s £5 do ComLthJiOflO‘ A5349”: yroducimg an "E3 FAT 155:9: l EE457 Final Exam — Fall 2004 ll / 12 0,5 in” MOJ © Copyright 2004 Gandhi Puvvada 7 (3+2+6+3+3+3=20points)15min. Virtual memory 7.1 TLB (Translation Look«aside Buffer) is built in 5 RAM but not DRAM (SRAM but not in DRAM / DRAM but not in SRAM) technology because TLB acem Ema W new. TLB exam Magi Glee; leleek. Swaymjcagg from/near Kuwn‘xwaf lGHaJ ’ieloek EA lng- DQAMrs care ISlow- SRAMA arajéggjli— 7.2 In a 32—bit virtual address system, VA3 l -VAO, the VPN is 18 bits (VA31—14) and the page—offset is 14 bits (VA13~VAO). The Virtual page size is l6 K Bytes. The physical page size is if; K Bytes. ‘ 6 q = 2654153» Barlaflléufi ® 7.3 If the system uses l28—entry TLB (2~way set~associative = 2 entries per set),/how many ‘3 " {9: 125372th comparators of what size are needed to perform TLB search. Include valid bit in your size calculations. ‘2. Comparatotts 40.ch pg 13 Lil); in 6332 (“FAQ + Wat-tot) If the same 128-entry TLB is converted to a fully-associative TLB, how many comparators of what size are needed to perform TLB search. Include valid bit in your size calculations. 12.8 Commaloré M at R All; in 4:31 WM: TAG = ’5’ Ma GAG + l valid) ® 7.4 In a 3-1evel page—table organization consisting of A—level (top level), B-level, and C—level, at any time, there is only one A-level table _/ False ][email protected] The number of C—leVel tables (CN) when compared to the number of B~level tables (BN) is such that ' CN is always greater than BN 0 @' CN is always greater than or equal to EN ( iii) CN is always less than BN (vi) CN is always less than or equal to EN (v) no relation between CN and BN @ 7.5 Originally, the PT (Page Table) was organized as a 3-leve1 PT. It was later changed to a 4-level ’5 PT. This will Qowtr (lower / improve) performance slightly because you spend (4" {a a little more. (more / less) time whenever a 1’ LB (TLB / Cache) (,3 * ‘miM (hit / miss) occurs. ® 7.6 The degree of lower—order interleaving of the main memory is dependent on (select all right answers) (a) the size of the virtual page (b) the size of the physical page (c) the set-associativity of the TLB (d) the number of levels in the page table (e the set-associativity of the cache ( none of the above. ' Blank area I enjoyed teaching this course. Hope you liked the course. Hope to see some of you in EE454L or EE560. Grades will be out in a week. Enjoy winter break before the school starts again - Gandhi ’ . BE4 7 ' - ee457_F1nal~Fa112004 fm 12/8/04 5 Flnal Exam Fall 2004 12/ 12 © Copyright 2004 Gandhi Puvvada ...
View Full Document

{[ snackBarMessage ]}