final04_solution_revised - UNIVERSITY OF CALIFORNIA LOS...

Info icon This preview shows pages 1–15. Sign up to view the full content.

View Full Document Right Arrow Icon
Image of page 1

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 2
Image of page 3

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 4
Image of page 5

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 6
Image of page 7

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 8
Image of page 9

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 10
Image of page 11

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 12
Image of page 13

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 14
Image of page 15
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: UNIVERSITY OF CALIFORNIA, LOS ANGELES UCLA liléRKFH-IY - DAVIS - IRVINI: . LUSANGEIJES - RIVI:RS|DI{ - SANDII-TUU - SANFRANCISL'U SANIAIMRHARA - \‘ANIAL'RIV CS MlSlB/EEM116C Final Exam Before you start, make sure you have all 14 pages attached to this cover sheet. Please put your name at the top of each page. All work and answers should be written directly on these pages, use the backs of pages if needed. This is an open book, open notes final — but you cannot share books, notes, or calculators. I will uphold the university policy on cheating — so please do not cheat on this exam. Keep your eyes on your own exam — and show all of your work. Problem l (14 points): Problem 2 (20 points): Problem 3 (20 points): Problem 4 (20 points): Problem 5 (30 points): Problem 6 (21 points): Problem 7 (25 points): Total: (out of 150) Fall 2004 NAME 1. De’jd Vu (14 points): Consider a processor with a BCPl of 2.5. The instruction cache has a 10% miss rate and the data cache has a 15% miss rate. The miss latency for both caches is 12 cycles. Assume that 25% of all instructions are loads and that store misses do not cause stalls and calculate the TCP[. Show your work. f 1 .1" ”C a a ’ 1;] , ;: /"f1 « I?! 1C! 1 x) K! J) 4 (jar. i 7- f/(Oe’z ,1; ~ . 1; 1.1 4% l l TCPI: (7" l 73" 7/; F15 : (3 t“ #35 «r {"3“ i f" 2:“ ”3.134 /.é*~;’ _: "/.If7 You are considering increasing the size of the data cache to reduce the miss rate. However, this will impact the clock rate of the processor. The new data cache you intend to use will have a miss rate of 10%, but will decrease the clock rate by 5%. The instruction cache will not be impacted, and we will assume here that the miss latency for both caches will remain 12 cycles. Calculate the performance improvement (or reduction) for this modification in MIPS relative to the original configuration. Show your work. i l 8‘: [Vi-CF? _ ;; {a «l (3 f >- “" -i§' . =‘ , 1C" 1) ll .3” v . '3' '1'} ‘2; ’7 1., "'N‘” ..- \:'_g_.au.xv-.--.ammuqmw M15 ‘chw‘n .s -- 4‘3 c R ’1” 2‘ y a: W m-- I’ p .34.: y. -« ~ ~ . ,, ‘/’“‘/ Speedup in MIPS when increasing the data cache Size: ' ' ° Fall 2004 NAME 2. Cycles of Pain (20 points): Consider the following instruction sequence: SCH: lw $t0, 0($30) add $t1, $32, $t0 1w $t3, 8($t1) bne $t3, $33, SCH add $30, $30, $13 @- These instructions are executed on the 5-stage pipelined MIPS processor, using full forwarding and hazard detection. The branch penalty is two cycles. There are two branch delay slots indicated by the boxes at the bottom — it is filled with the final add and a nop. Show your work in the form of a pipeline diagram — a table is provided (use 1F, ID, EX, M, and WB in the appropriate slots). a) How many cycles will it take to completely execute two iterations of this loop? g O Clock C cle lw 3:0 0(350) I" i add $tl $52. $10 .w-m , ' 5’1,- 339 ;’ My l4 1' lw $13. 8(311) _ (if f Ti“; bne $13. $53, SCH -IE I add $30, $30, $3] nop lw $10. 0($30) add $1]. $32. $t0 lw $t3. 8($tl) bne $13, $33. SCH add $30, $30, $3] nop Fall 2004 NAME b) Now suppose that we use a data cache in our 5-stage pipeline. In the previous section we ideally assumed that memory would take a single cycle. But with a cache, the cache latency is 2 cycles and it does not miss during the two iterations of the loop. This means our pipeline will now have 6-stages that will be seen by all instructions (i.e. memory will take two stages). Show your work. How many cycles will it take to completely execute two iterations of this loop now? 625 Clock C cle IIIIEIEHIIIIIIIIIHIHHIIIIIIIII 0 1w $r0. 0($s0) 4 " 50'“ 2“ 3 1w 3:3. 8{$rl) bne $13. $53. SCH add $50, $s0. $5] 1w $r0. 0{$50) b.) add $11. $s2. $10 Iw $t3. 8{$ll) bne $13. $s3. SCH add $50. $50. $5] nap Fall 2004 NAME 3. T 00 Many 2’s (20 points): For this problem, we will look at a 2-level page table. The first level of the page table provides a physical address for the second level page table containing the desired translation if that second level table is in physical memory (otherwise a page fault is raised and the table is loaded from disk). The virtual address is still split into two components: the virtual page number and the virtual page offset. But now the virtual page number will be split into two components: a page table number and a page table offset. The page table number will be used to index into the first level page table. The page table offset will be used to index into the second level page table pointed to by the first level page table. This is exactly the idea we discussed in class — and in case you need it, there is a diagram on the next page that shows this idea. We will assume that each second level page table occupies exactly one page of virtual memory. And we will further assume that each translation is stored along with onl extra protection bits (no other bits — i.e. dirty bits —— are required). Consider memory with a 64-bit Virtual address space, 128KB pages, 2 GB of physical memory, and a 5128 4-way set associative TLB. Show your work. Q n , ,. a” {3' 2 PA”..- .. .. 777“” Fill in the following: 2 l ”a Uzi—3 '. W ’5 ‘ ' ' “ ' ' i l f " a) # of entries in each 2‘“ level page table: 2 . . l 1 L’ w cu 49-5; 1 (express this as a powew V t , 'n > #41.- U," l l . ”pr-7‘7 7* " "g ~4- d 23 ,7 {1‘ 7f m...) ‘f -' e . ,«Lr 7 b) # of Zn level page tables : " /’ Q ' cg... : . g; 3. (express this as a power of 2) g ( ,= a ‘ i ._ 2 °"‘ [‘7 g {f 3 I H! a 3/ :‘F '7 , if c) # of entries in the first level page table: ‘2 [1/ 4 1V", (express this as a power of 2) Fri}; -.. I t) ’ \a d) Circle the bits of the following 64-bit virtual address that are used to find an index into the (my TLB (i.e. the bits that select one of the indices of the TLB — not the tag or offset bits): ' 000010000]000010000100001000010000100001-010000l000010000I! I [g (y‘ 9 . Fl“; 1 W 11-117 / -, a :26 :3: 4 ,x of g / r .7 E I}: ,7; ‘ K33” ; . r < ”My” . r‘ 1 \7», u Virtual page number m Page Table Page Table Virtual Number Offset Page Offset Page table Physical page or Valid disk address Physical memory Disk storage Fall 2004 NAME 4. Taken for a Loop (20 points): In this problem, we will schedule code to execute on a 2-way superscalar VLI W pipelined processor. For this processor, assume that ANY two independent instructions can be executed in each cycle — and that full bypassing is provided. Assume that there is a single-cycle branch penalty, and that the processor uses branch delay slots to resolve this single cycle penalty. Consider the following MIPS fragment: loop: [w $10, 0($s1) lw $11, 4($sI) add $t2, $t0, $t1 sw $t2, 0($s1) addi$s1, $s1, 4 bne $31, $13, loop Assume that $50, $sl , and $t3 are initialized before the loop is entered, and that the loop will always be taken a number of times that is a multiple of four. Unroll the loop three times (i.e. to make four copies of the loop body) and optimize the instructions for scheduling. Hint — you can reduce the number of loads to 5. List the new code sequence here: /_._../‘—"~“"‘"‘ ( I l w g1' O o [is l IN {7+7va15} v . ,- LAM 5‘”; SJ {,‘fJ/ 55,4 M W2 -’ ”a” f 57%). a C 51“} CW -?_ %?’0(’hy [w fi%3,5(&;0 ‘l l we, #6 5») Mu tt+¥15+tjw i {1‘51“ 5C1???) //\ go 9W), {7% £12») i AM iii-1'); viii) g4]; lw fl+§l /2 (iii/2’ 5: 5w ililjh‘Cgi‘l) my {1%}fiB/5J5 ! . 5+0 5M“) W M) 5%») w‘ J l M) l“ a” “3 CW) l ‘ t W flHf/ 5+; fir" 3? ) l _ ,Mmmmmuaww Sw #45 la (lJ) Fall 2004 NAME Now that you have an optimized sequence of instructions, schedule these instructions in the following slots. Remember to fill the branch delay slots. Cycle lSt Issue Slot (for ANY instruction) 2" Issue Slot (for ANY instruction) Fall 2004 NAME 5. Cache MeIf You Can (30 points): You are designing the data cache for an embedded processor. Power is critical, so you do not want something too associative or too large. You consider the following two alternatives: L”; “-3. 2 . 1v! 1 , 7 DM - A 2KB direct mapped cache with 16 byte block size 2 ’7’“, : 7 e SA - A 2KB 2-way set associative cache with 16 byte block size 2" 7 j . 7 , ‘ R V). (uses LRU replacement Within a set) M‘ _7 I ’ ’ Sui“ / ’,.,‘ ; er l , ,M0110100101110111 1000 ...K 0U11W0100101. ’4’. :"f' if; 1-; a. Consider the performance of these caches on the given stream of byte addresses. Note that there are 6 unique byte addresses here — and that the sequence of six addresses is repeated to make 12 total addresses Mark whether the DM and SA cache has a “hit” or “miss” for each address — i. e whether or not the desired memory address 15 found 1n the cache. For the addresses — assume that ‘6 5 .. means all leading 0’s. Assume that both caches are completely empty (all entries invalid) at the start of the stream. For the DM cache only, classify each miss as capacity, compulsory, or conflict (Miss Type). Address i ...0110010100m ...0110010100flm1000 0—..110010100W _...llOOlOlll f:0]000 1000 ..0.110010101 10110010 ...0110010100101000O1000 ...011001010 E01000181100 ...011001011113010001m ...011010010W1000 ...01101001010’ 0000 r\ 27‘ f)!“ 30 Address in Decimal 828704 828680 828684 834824 860912 ‘w862960 828704 828680 828684 834824 860912 862960 DM Hit or Miss Miss T 6 SA Hit or Miss Fall 2004 b. NAME You have an idea to try a compromise between the SA and DM caches: a pseudoassociative cache (PA cache). The PA cache will look exactly like a direct mapped cache, but if you do not find the block you are looking for at the index specified by your address, you will just check the next contiguous index for a hit in the next cycle. For example, if your address demands that you check index 12 of the PA cache, then you will check 12 first, then 13 in the next cycle. But you will *only* check 13 if the block you wanted is not in 12. This means that hits in the first location take 1 cycle, and hits in the second location take 2 cycles. If you miss in the second location, then it is a cache miss. If your index is the last entry of the cache, then your second access goes back to the first entry of the cache. So a PA cache isjust a direct mapped cache where you look in two places for a desired cache block. The PA cache will be 2KB with a 16 byte block size. Consider the performance of this cache on the same address stream. Mark whether the PA cache has a “hit” or “miss” for each address — i.e. whether or not the desired memory address is found in the cache. For the addresses — assume that “...” means all leading 0’s. _m- Address in Decimal PA Hit or Miss '__M 828704 7“ -:_m 828680 m l—-m 828684 18 __m 834824 In 1 |—_m 860912 80 _mm 862960 00 .mm 828704 13; 010mm 828680 88“ 828684 E 010mm 834824 88 “101111 860912 E “101111 862960 8-8 Fall 2004 NAME c. Assume that for a given workload, a hit in the PA cache will be in the first location 80% of the time, and in the second location 20% of the time. A hit in the first location takes 1 cycle and a hit in the second location takes 2 cycles. The PA cache has a 10% miss rate (so of the remaining 90% — 80% are in the first location and 20% are in the second location). The next level of the memory hierarchy is an L2 cache with a 5% miss rate and an access time of 10 cycles. The time to access main memory is 150 cycles. Calculate the average memory access time in cycles for this memory hierarchy. Show your work. 10 Fall 2004 NAME 6. Problem Solver (21 points): Given the following problems, suggest one solution and give one benefit and one drawback of the solution. EXAMPLE Problem: long memory latencies Solution: Caches benefit: low latency when the data is in the cache drawback: when the cache misses, the latency becomes worse due to the cache access latency We would not accept solutions like: “do not use memory”, “use a slower CPU', etc Problem: too many conflict misses in the data cache Solution: (M cg“ c w £5 61'. 3 20: us- benefit. [(3 x [6,19, {LI ,1,»- ; 5,3, »} , ,1. My }& drawback: ’_ .r ' (l-(Iiwd jki:5'.~’ :I‘ "c? .21 iaf [MW L’s/‘1': rfli'xffiir Problem :Ntoomuch traffmleonté’tion on the bus to memory Solution: .- I“: / \“NM benef' t I x“. drawback: Problem: too many control hazards ', 5. Solution: Ant Frill: ,II A! . i w u c benefit. (fl 4‘ 4'!VO~S\/‘K1J¢£S drawback: \ f! I A; ,i' I M ,,,_. - ; -. ~ I! x: I Problem: our use of daisy chaining to select a bus master is starving low priority devices Solution: a“ benefit t, my“ _,_,_ \“x‘ drawback?” \\\ Problem: our ripple carry adder Is too low Solution: Car// /0: F ”vi. benefit. A 7.5 {crafts A ~:I._«,,. ¢( Cy” II drawback: my“ lo.“ ”1"”! j’f‘é‘ (gm/Ii 57/ Problem: we want more instructions in the MIPS ISA Solution: V’n-I’ i . he“ 4" ’1 SA , benefit: man “A...“ ,5 [K , ,‘A ya“, 5;, A“. If»... a! m.- Nflvg _ 1;, drawback: (01?;3' .1 . Problem. our page table takes up too much space in memory Solution: 1 4'9 . t! I” *‘ ;r = g- _ .7 ,.~' ‘ . ». f '. ' ' Tn ‘ _ rt‘ benefit: we (MI 5 It”: [cm I I: {<- r - «P t . a drawback: Paige «CM/’4‘ u‘f "Ir-'4' a: , {I I. . ll Fall 2004 NAME 7. A Staggering Blow (25 points): Consider the 5-stage pipelined architecture —— we use forwarding to avoid pipeline stalls. The following figure demonstrates the addition of forwarding paths as we examined in class. lD/EX ' EXIMEM MEMANB a. No fonNarding lD/EX EXIMEM MEMNVB Data memory EXIMEMRegisleer M E MIWB Registeer b. With forwarding FIGURE 7.1 Suppose that the ALU is the bottleneck to increasing your clock rate. One solution (used on the Pentium 4) is to use a staggered ALU. This is where the ALU is pipelined over two stages. In the first stage, the first half of the computation is done. In the second stage, the second half of the computation is done. As in the above diagram, you will need to add forwarding hardware to this new 6-stage pipeline. 12 Q3 Fall 2004 NAME First, let’s look at the staggered ALU: ALUOUto.15 There are two ALUs in this figure — both are 16-bit ALUs. Suppose for the purposes of this problem that we are not supporting slt. The first ALU computes the desired operation on the lower 16 bits of the 32 bit registers A and B. The carry out (Cout) from this ALU is used as the Cin for the second ALU, which computes the desired operation on the upper 16 bits of the 32 bit registers A and B. The results of both ALUs together make up the 32 result of the operation on A and B. The following diagram demonstrates how these ALUs will fit into the pipeline. lD/EX1 EX1/EX2 EX2/MEM MEMNVB l3 Note that there is no forwarding shown here. You will want to be able to support back to back instructions — like the following sequence: add $t0, $t1, $t2 and $t3, $t0, $t4 If each 16-bit ALU is placed in a separate pipeline stage, the value of $t0 should be forwarded from the add to the and without any stalling. First, the result of the lower 16 bit sum is forwarded to the and as it enters the first adder, and then the result of the upper 16 bit sum is forwarded to the and as it enters the second adder. We have started the datapath and forwarding logic for the diagram below, just as was done for the 5 stage pipeline in Figure 7.1 above. Complete this modification by adding muxes and wires — label all wires that you add. You do not need to design the internals of the forwarding units — this should be just as was done in Figure 7.1. EXZ/MEM MEMNVB Forwarding Unit ...
View Full Document

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern