more practice

more practice - 1. I Amdahf—ighted with T radeojfs (10...

Info iconThis preview shows pages 1–13. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 6
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 8
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 10
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 12
Background image of page 13
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1. I Amdahf—ighted with T radeojfs (10 points): Given the following problems, suggest one solution and give one drawback ofthe solution. Be brief, but specific. EXAMPLE Problem: long memory latencies Solution: Caches Drawback: when the cache nu‘sses, the latency becomes worse due to the cache (recess latency We would not accept solutions like: “do not use menwa "J “use a Slower C ’U”, “cache is hard (a .spell", etc Problem: too many capacity misses in the data cache Solution: g5) few mare/{i ,3 sink: drawback: {11% gm gxrgewgygg Problem: too many control hazards Solution: 5“ mi; :ch stiffer/WW» z: drawback: ; $5.,“ ; ggkg-{W/M} 5/, e Ike»? :9 4),? we}, r, é “ / Problem: our carry lookahead adder is too slow Solution: “‘3‘: hi-wgfiééfigk; C (If; . - e . r u; diawback. flirt/.5 ism/e“, My {.Agykygiyfia . r f” Problem: we want to be able to use a larger immediate field in the MIPS ISA Solution: we? {{5} } ("are '32 :2”?f§:w ‘“ .4 A I . fo'? drawback: fmfl' fly yea» eggs/um} J Problem: theexecution time of our CPU with a single—cycle datapath is too high Solution: $01.14;: , 23;: f 2: W " 3;? ‘. _ I; a , .z , drawback: Effiéggwr g , f c w“ xieeewf-i 5! my ,3 2. Hazard a Guess? (10 points): Assume you are using the 5-Stage pipelined MiPS processor, with a three—cycle branch penalty. Further assume that we always use predict not taken. Consider the foilowing instruction sequence, where the bne is taken once, and then not taken once (so 7 instructions wi El be executed total): L001) .' [w SID. 512(310) [w St]; 64(St0) bne $30, 8:], Loop SW 53], 128($t0) Assuming that the pipeline is empty before the first instruction: at. Suppose we do not have any data forwarding hardware « we stail on data hazards. The register file is still written in the first half ofa cycle and read in the second halfofa cycle, so there is no hazard from WB to ID. Calcuiate the number of cycles that this sequence of instructions would take: w i w i W! 5. 5.3sz set W3 i .s g {4 ., t , g»« f? it {v if ii i w 3% flit Wig ’ / « t {fit :5;qu {x M M»; Ems g ’ w fl 3" f” :2sz My; 3:“ g”: tax/s" : < b. How many cycles would this sequence of instructions take with data forwarding hardware: £2? 3. More 5 More Problems (10 points): Find the data cache hit or miss stats fora given Set of addresses. The data cache is a lKB, direct mapped cache with 64ubyte blocks. Find the hit/miss behavior ofthe cache for a given byte address stream, and label misses as compulsory, capacity, or conflict misses. All blocks in the cache are initially invalid. Cache Cache Addrss inBiriai'y Hit or Miss Miss Type .H001 011p10000 yaq (fake/9 W I 1 0000 fl’l ("GM 00000 fl”; C (3.5/15: ...001 10000 7—!— ...00101100000 "J14 Cam: ...000011010000 .”00?1011100000 Cfa/v7: 4. The Trouble with TLBS (10 points): Consider an architecture with 32—bit virtual addresses and 1 GB of physica! memory. Pages are 32KB and we have a TLB with 64 sets that is 8-way set associative. The data and instruction caches are 8K8 with i613 biock sizes and are direct mapped — and they are both virtually indexed and physically tagged. Assume that every page mapping (in the TLB or page table) requires l extra bit for storing protection information. Answer the following: if; a. How many pages of virtual memory can fit in physical memory at a time? 2 E E is}? b. How large (in bytes) is the page tabie? 2 D {3% c. What fraction ofthe total number of page translations can fit in the TLB‘? 2 d. What bits ofa virtual address wili be used for the index to the TLB‘? Specify this as a range of bits 7 i.e. bits 4 to 28 will be used as the index. The Eeast significant bit is iabeled 0 and the most significant bit is iabeied 31. 20W / 5,. Ur . Starting Some Static (Scheduling) (20 points): Consider the 2—way superscalal' processor we covered in class i a five stage pipeline where we can iSSue one ALU or branch instruction along with one load or store instruction every cycle. Suppose that the branch delay penalty is two cycles and that we handle control hazards with branch delay slots (since the penalty is two cycles, and this is a 2-way superscalar processor, that would be four instructions that we need to place in delay slots). This processor has full forwarding hardware. This processor is a VLIW machine. How long wouid the following code take to execute on this processor assuming the ioop is executed 200 times? Assume the pipeline is initially empty and give the time taken up until the completed execution of the instruction sequence shown here. First you will need to schedule (i.e. reorder) the code (use the table below) to reduce the total number ofcycles required (but don’t unroll it...yet). Tow! # OfCYCIeS for 200 iterations: j (Him 7 schedule the codefirsthr one iteration, lhenfigtrre out how long it will take the processor 10 mm 200 {fermions ofthis scheduled cede) Loop: lw SEQ, 0 ($50) lw st; 0 (s39) add at}, $31, Stl sw $31, 0 ($33M it you may assume that this store never goes to the same address as the first load eddi $50, $50, 4 bne $50, $82, Loop 2nd Issue Slot (LW or SW) I Now unroll the loop once 10 make two capies ofthe ioop body. Schedule it again and record the total # of cycles Ska: for 200 iterations: 6. A Branch T 00 Far (10 points): One difficulty in designing a branch predictor is trying to avoid cases where two PCs with very different branch behavior index to the same entry of a 2—bit branch predictor. This is called destructive aliasing. One way around this is to use multiple 2-bit branch predictors with different sizes. This way, iftwo PCs index to the same entry in one predictor, they will not likely index to the same entry in the other predictor. We will evaluate a scheme with three 2—bit branch predictors w each with a different number of entries. The three predictors wiil be accessed in parallel, and the majorin decision ofthe predictors will be chosen. So if two predictors say taken and the other predictor says not taken, the majority decision wilE be taken. The scheme looks like this: Majority Vote Not Taken Each predictor has the following FSM: Taken Not taken Predict taken 10 Predict taken 1] . Taken Not taken ' Not taken Predict not taken Predict not taken 00 01 Not taken Evaluate the performance of this prediction scheme on the following sequence ofPCs. The table shows the address of the branch and the actual direction ofthe branch (taken or ndt taken). You get to fill in whether or not the branch predictor would guess correctly or not. Each node of the FSM is marked with the 2—bit value representing that state. Assume that all predictors are initialized to 00. To find an index into a predictor, assume we use the simplified branch indexing formula: index 2 PC % predictorwsize. The symbol % represents the modulo operator. Predictor_size will be different according to the predictor. CorrectE I ilredicted? Actual DirectiorT ‘ m‘ _, , 216/" w .x e 6 €923” { s? ,1 a m, r E m. ; err ‘ u: 2? if v“ “A '1 5; is“? i '7 (g F?! 5 0&5 i, 3:5 W /“”‘_"”“‘a We wear i gr? f/‘l E E v a { yfmw 7. With Friend's Like These... (30points): Consider the scalar pipeline we have explored in class: lDIEXR ' 1 R1 EQIS 81‘ MEIWWB Registeer Ioiex EXIMEM w, MEMWB lFIiD 2 .. - .. g - Registers : Instruction Data : memory - memory 3 M u z X I IFIIDRegislerRs E III IFIID Registerfit 3 II IFIIDRegisterRt EXJMEM Registeer _ IFIID Reg isteer . a. {1 0 points) Suppose 10% of instructions are stores, 15% are branches, 25% are loads, and the rest are R—type. 30% of all loads are followed by a dependent instruction. We have full forwarding hardware on this architecture. We use a predict not taken branch prediction policy and there is a 2 cycle branch penalty. This means that the PC is updated at the end ofthe EX stage — after the comparison is made in the ALU. One third of all branches are taken. There is an instruction cache with a single cycle latency and a miss rate of 10% and a data cache with a single cycle latency and a miss rate of20%. We have an L2 cache that misses 5% — it has a 10 cycle latency — and memory has a 100 cycle latency. Find the TCPI for this architecture. ’FCPI 2 ‘3. #7231 ‘gfl b. (5 points) Your friend has a flash of brillianee m “i know a way to get rid of stalls in this pipeline. The reason we have to stall now is because a load can have a dependent instruction follow it through the pipeline, and we cannot forward the load’s data until the end ofthe MEM stage i but the dependent instruction needs it at the beginning of the EX stage. So what if we add another ALU that recomputes what we did in EX ifthe instruction before it is a load and it is dependent on the load?” This ALU will be in the memory stage of the pipeline as shown below in this simplified picture: M u x M E u I X : Data : memory Is your friend right or wrong? lfthey are wrong, give an exampie ofwhen we would still need to stall. xcg xcg They are right: 0" J a; 3 if 2 g r f s Counter example: A ’ r i _ 2: W §7Lifi§i 10 [5 points) Another friend offers an alternative — using the original pipeline from part a, let’s get rid of base + dispiacement addressing for ioads and stores. Loads and stores can only use register addressing now. This wiil allow us to combine EX and MEM into one stage (called EM) and avoid the need to stall entirely. Instructions will either use the ALU or memory — but not both. There is still forwarding hardware, but now we oniy need to forward from the EM/WB latch to the EM stage ALU. The pipeline wili now be: Suppose that four fifths of loads actualiy use base + displacement addressing (i.e. they have a non-zero displacement), which means that these loads will need to have add instructions before them to do their effective address computation. Halfof stores use base + displacement addressing, and these will also need to be replaced with an add plus the store instruction. This modification has no impact on the branch penalty or the instruction cache miss rate. Is your friend right or wrong — wil ' is eliminate all stalks? lfthey are wrong, give an example of when we wouid still need to st' They are right: Or Counter example: m ll d. (IOpor’ms) A third friend has a different idea (it may be time for you to get new friends who don’t talk about architecture all the time). Forget about trying to eliminate hazards w she says we shouldjust use superpipelining and get a win on cycle time. Take the original architecture from part a a ignore the suggestions from b and c — and assume that the stages have the following latencies: icoseconds 200 100 l 200 l l 200 l we I 100 Your friend suggests a way to cut the IF, EX, and MEM stages in half—just increase the pipeline depth and make each ofthese stages into two stages. So your pipeline would now have IF], TF2: ID, EXE, EXZ, MEME, MEM2, and WB stages "— each of which would have 100 picosecond latency. Your friend also finds a way to do full forwarding between stages m even in the ALU e but loads are still a problem. In fact, load stalls will increase now because ofthis increase in pipeline depth. To help you figure out the new it of pipeline stalls from load data hazards, use the following table: % of Loads cycle Exactly 2 cycles later Exactly 3 cycles later Exact] I 4 c cles later Exact] r 5 c cles later Exact] 6 c cles later Exact] 7 or more c cles later So this means that 30% ofloads are immediately followed by a dependent (Le. ] cycle later), 20% of loads have a dependent exactly 2 cycles later, 20% have a dependent3 Cycles later, and so on. These classifications are completely disjoint «- the 20% of loads that have a dependent 2 cycles later do NOT haVe dependents 1 cycle later. 20% Eflllll . f . fl, Find the TCP] of this new architecture: if 2:; 1 K N ‘fflfj’irféi j" , 3 ,w 5:} fiiww .5“; a r e -t”t‘}f saéar.aa2rwaeij t 55.nahtx a: x a; fl 5: l'; _‘-‘ 5" c} ) v , ( ’ L? L?) w I r r“ r ‘3? " ix g/) — .fis , 'f I , *3 _.r ‘1) Jr W . > m i r {at . 5 / ,. w J-w" W , r~ “aft l‘ Q I H ,/ g as 12 Assume your target application will run EM instruciions. Find the execution time of this architecture: for that application: ET: g’ 3"} .---; LA. 13 ...
View Full Document

This note was uploaded on 06/09/2010 for the course CS 152 taught by Professor Staff during the Spring '98 term at UCLA.

Page1 / 13

more practice - 1. I Amdahf—ighted with T radeojfs (10...

This preview shows document pages 1 - 13. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online