This preview shows pages 1–14. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: UNIVERSITY OF CALIFORNIA, LOS ANGELES UCLA UFliKlleX  DAV}: v lRVlNl v 5.09:\M}lzl,l55  i{l\'lCRSU)J ' SANDHIFD ' St‘xNJ'RAM'ISL'U ' SANTABARBARA  SANEAL'RUI CS MlSlBr’EE Mll6C Final Exam Before you start, make sure you have all 13 pages attached to this cover sheet. All work and answers should be written directly on these pages, use the backs of‘pages if needed. This is an open book, open notes ﬁnal m but you cannot share books, notes, or calculators. Problem I (ll) points):
Problem 2 (IO points):
Problem 3 (10 points):
Problem 4 (10 points):
Problem 5 (20 points): “MW
Problem 6 (10 points): Problem 7 (30 points): Total: (out of 100 points) 1. I Amdahf—ighted with T radeojfs (10 points): Given the following problems, suggest one solution and give one
drawback ofthe solution. Be brief, but speciﬁc. EXAMPLE Problem: long memory latencies Solution: Caches Drawback: when the cache nu‘sses, the latency becomes worse due to the cache (recess latency We would not accept solutions like: “do not use menwa "J “use a Slower C ’U”, “cache is hard (a .spell", etc Problem: too many capacity misses in the data cache
Solution: g5) few mare/{i ,3 sink: drawback: {11% gm gxrgewgygg Problem: too many control hazards Solution: 5“ mi; :ch stiffer/WW» z: drawback: ; $5.,“ ; ggkg{W/M} 5/, e Ike»? :9 4),? we}, r, é
“ / Problem: our carry lookahead adder is too slow Solution: “‘3‘: hiwgﬁééﬁgk; C (If; .  e . r u;
diawback. ﬂirt/.5 ism/e“, My {.Agykygiyﬁa . r f”
Problem: we want to be able to use a larger immediate ﬁeld in the MIPS ISA Solution: we? {{5} } ("are '32 :2”?f§:w ‘“ .4 A I . fo'? drawback: fmﬂ' ﬂy yea» eggs/um}
J Problem: theexecution time of our CPU with a single—cycle datapath is too high
Solution: $01.14;: , 23;:
f 2: W " 3;? ‘. _ I; a , .z ,
drawback: Efﬁéggwr g , f c w“ xieeewfi 5! my ,3 2. Hazard a Guess? (10 points): Assume you are using the 5Stage pipelined MiPS processor, with a three—cycle
branch penalty. Further assume that we always use predict not taken. Consider the foilowing instruction
sequence, where the bne is taken once, and then not taken once (so 7 instructions wi El be executed total): L001) .' [w SID. 512(310)
[w St]; 64(St0)
bne $30, 8:], Loop
SW 53], 128($t0) Assuming that the pipeline is empty before the first instruction: at. Suppose we do not have any data forwarding hardware « we stail on data hazards. The register ﬁle is
still written in the ﬁrst half ofa cycle and read in the second halfofa cycle, so there is no hazard from
WB to ID. Calcuiate the number of cycles that this sequence of instructions would take: w
i w
i W!
5.
5.3sz
set W3
i .s g {4 ., t ,
g»« f? it {v if ii
i w 3% ﬂit Wig
’ / « t {ﬁt :5;qu {x M M»;
Ems g ’
w ﬂ 3" f” :2sz My;
3:“ g”: tax/s" : < b. How many cycles would this sequence of instructions take with data forwarding hardware: £2? 3. More 5 More Problems (10 points): Find the data cache hit or miss stats fora given Set of addresses.
The data cache is a lKB, direct mapped cache with 64ubyte blocks. Find the hit/miss behavior ofthe cache for
a given byte address stream, and label misses as compulsory, capacity, or conﬂict misses. All blocks in the
cache are initially invalid. Cache Cache
Addrss inBiriai'y Hit or Miss Miss Type
.H001 011p10000 yaq (fake/9
W
I 1 0000 ﬂ’l ("GM
00000 ﬂ”; C (3.5/15: ...001 10000 7—!—
...00101100000 "J14 Cam:
...000011010000
.”00?1011100000 Cfa/v7: 4. The Trouble with TLBS (10 points): Consider an architecture with 32—bit virtual addresses and 1 GB of
physica! memory. Pages are 32KB and we have a TLB with 64 sets that is 8way set associative. The data
and instruction caches are 8K8 with i613 biock sizes and are direct mapped — and they are both virtually
indexed and physically tagged. Assume that every page mapping (in the TLB or page table) requires l extra
bit for storing protection information. Answer the following: if;
a. How many pages of virtual memory can fit in physical memory at a time? 2 E E
is}?
b. How large (in bytes) is the page tabie? 2 D {3% c. What fraction ofthe total number of page translations can ﬁt in the TLB‘? 2 d. What bits ofa virtual address wili be used for the index to the TLB‘? Specify this as a range of bits 7
i.e. bits 4 to 28 will be used as the index. The Eeast signiﬁcant bit is iabeled 0 and the most signiﬁcant bit is iabeied 31. 20W / 5,. Ur . Starting Some Static (Scheduling) (20 points): Consider the 2—way superscalal' processor we covered in class
i a five stage pipeline where we can iSSue one ALU or branch instruction along with one load or store
instruction every cycle. Suppose that the branch delay penalty is two cycles and that we handle control
hazards with branch delay slots (since the penalty is two cycles, and this is a 2way superscalar processor, that
would be four instructions that we need to place in delay slots). This processor has full forwarding hardware.
This processor is a VLIW machine. How long wouid the following code take to execute on this processor
assuming the ioop is executed 200 times? Assume the pipeline is initially empty and give the time taken up
until the completed execution of the instruction sequence shown here. First you will need to schedule (i.e.
reorder) the code (use the table below) to reduce the total number ofcycles required (but don’t unroll it...yet). Tow! # OfCYCIeS for 200 iterations: j (Him 7 schedule the codeﬁrsthr one iteration, lhenﬁgtrre out how long it will take the processor 10 mm 200
{fermions ofthis scheduled cede) Loop: lw SEQ, 0 ($50)
lw st; 0 (s39)
add at}, $31, Stl
sw $31, 0 ($33M it you may assume that this store never goes to the same address as the first load
eddi $50, $50, 4
bne $50, $82, Loop 2nd Issue Slot (LW or SW) I Now unroll the loop once 10 make two capies ofthe ioop body. Schedule it again and record the total # of cycles Ska: for 200 iterations: 6. A Branch T 00 Far (10 points): One difﬁculty in designing a branch predictor is trying to avoid cases where
two PCs with very different branch behavior index to the same entry of a 2—bit branch predictor. This is called
destructive aliasing. One way around this is to use multiple 2bit branch predictors with different sizes. This
way, iftwo PCs index to the same entry in one predictor, they will not likely index to the same entry in the
other predictor. We will evaluate a scheme with three 2—bit branch predictors w each with a different number
of entries. The three predictors wiil be accessed in parallel, and the majorin decision ofthe predictors will be
chosen. So if two predictors say taken and the other predictor says not taken, the majority decision wilE be
taken. The scheme looks like this: Majority Vote Not Taken Each predictor has the following FSM: Taken Not taken Predict taken 10 Predict taken 1] .
Taken Not taken
' Not taken Predict not taken Predict not taken 00 01 Not taken Evaluate the performance of this prediction scheme on the following sequence ofPCs. The table shows the
address of the branch and the actual direction ofthe branch (taken or ndt taken). You get to fill in whether or
not the branch predictor would guess correctly or not. Each node of the FSM is marked with the 2—bit value
representing that state. Assume that all predictors are initialized to 00. To ﬁnd an index into a predictor,
assume we use the simplified branch indexing formula: index 2 PC % predictorwsize. The symbol %
represents the modulo operator. Predictor_size will be different according to the predictor. CorrectE I ilredicted? Actual DirectiorT ‘ m‘ _, , 216/" w .x e 6 €923”
{ s? ,1 a
m, r E m.
; err ‘ u: 2? if v“
“A '1
5; is“? i '7 (g F?!
5 0&5
i, 3:5 W /“”‘_"”“‘a We
wear i gr? f/‘l E E v a { yfmw 7. With Friend's Like These... (30points): Consider the scalar pipeline we have explored in class: lDIEXR ' 1 R1
EQIS 81‘ MEIWWB Registeer Ioiex EXIMEM
w, MEMWB lFIiD 2 ..  .. g  Registers : Instruction Data :
memory  memory 3 M u
z X I IFIIDRegislerRs E III IFIID Registerﬁt 3 II IFIIDRegisterRt EXJMEM Registeer _ IFIID Reg isteer . a. {1 0 points) Suppose 10% of instructions are stores, 15% are branches, 25% are loads, and the rest are
R—type. 30% of all loads are followed by a dependent instruction. We have full forwarding hardware
on this architecture. We use a predict not taken branch prediction policy and there is a 2 cycle branch
penalty. This means that the PC is updated at the end ofthe EX stage — after the comparison is made
in the ALU. One third of all branches are taken. There is an instruction cache with a single cycle
latency and a miss rate of 10% and a data cache with a single cycle latency and a miss rate of20%. We have an L2 cache that misses 5% — it has a 10 cycle latency — and memory has a 100 cycle latency.
Find the TCPI for this architecture. ’FCPI 2 ‘3. #7231 ‘gﬂ b. (5 points) Your friend has a ﬂash of brillianee m “i know a way to get rid of stalls in this pipeline. The
reason we have to stall now is because a load can have a dependent instruction follow it through the
pipeline, and we cannot forward the load’s data until the end ofthe MEM stage i but the dependent
instruction needs it at the beginning of the EX stage. So what if we add another ALU that recomputes
what we did in EX ifthe instruction before it is a load and it is dependent on the load?” This ALU will be in the memory stage of the pipeline as shown below in this simpliﬁed picture: M
u x
M E u I X : Data :
memory Is your friend right or wrong? lfthey are wrong, give an exampie ofwhen we would still need to stall. xcg
xcg They are right: 0" J a; 3 if 2
g r f s Counter example: A ’ r i _ 2: W §7Liﬁ§i 10 [5 points) Another friend offers an alternative — using the original pipeline from part a, let’s get rid of
base + dispiacement addressing for ioads and stores. Loads and stores can only use register addressing
now. This wiil allow us to combine EX and MEM into one stage (called EM) and avoid the need to
stall entirely. Instructions will either use the ALU or memory — but not both. There is still forwarding
hardware, but now we oniy need to forward from the EM/WB latch to the EM stage ALU. The pipeline wili now be: Suppose that four fifths of loads actualiy use base + displacement addressing (i.e. they have a nonzero
displacement), which means that these loads will need to have add instructions before them to do their effective address computation. Halfof stores use base + displacement addressing, and these will also
need to be replaced with an add plus the store instruction. This modiﬁcation has no impact on the
branch penalty or the instruction cache miss rate. Is your friend right or wrong — wil ' is eliminate all stalks? lfthey are wrong, give an example of when we wouid still need to st' They are right:
Or Counter example: m ll d. (IOpor’ms) A third friend has a different idea (it may be time for you to get new friends who don’t talk
about architecture all the time). Forget about trying to eliminate hazards w she says we shouldjust use
superpipelining and get a win on cycle time. Take the original architecture from part a a ignore the
suggestions from b and c — and assume that the stages have the following latencies: icoseconds 200 100 l 200 l
l 200
l we I 100 Your friend suggests a way to cut the IF, EX, and MEM stages in half—just increase the pipeline depth
and make each ofthese stages into two stages. So your pipeline would now have IF], TF2: ID, EXE,
EXZ, MEME, MEM2, and WB stages "— each of which would have 100 picosecond latency. Your
friend also finds a way to do full forwarding between stages m even in the ALU e but loads are still a
problem. In fact, load stalls will increase now because ofthis increase in pipeline depth. To help you
figure out the new it of pipeline stalls from load data hazards, use the following table: % of Loads cycle
Exactly 2 cycles later
Exactly 3 cycles later
Exact] I 4 c cles later Exact] r 5 c cles later Exact] 6 c cles later Exact] 7 or more c cles later So this means that 30% ofloads are immediately followed by a dependent (Le. ] cycle later), 20% of
loads have a dependent exactly 2 cycles later, 20% have a dependent3 Cycles later, and so on. These
classiﬁcations are completely disjoint « the 20% of loads that have a dependent 2 cycles later do NOT
haVe dependents 1 cycle later. 20% Eﬂllll . f
. ﬂ,
Find the TCP] of this new architecture: if 2:; 1 K
N ‘fﬂfj’irféi
j" , 3 ,w 5:} ﬁiww .5“;
a r e t”t‘}f saéar.aa2rwaeij t 55.nahtx
a: x a; ﬂ 5: l'; _‘‘ 5" c} ) v , ( ’
L? L?) w I r r“ r ‘3? " ix
g/) — .ﬁs ,
'f I , *3 _.r ‘1) Jr W . >
m i r {at . 5 / ,.
w Jw" W
, r~ “aft l‘ Q
I H ,/
g as 12 Assume your target application will run EM instruciions. Find the execution time of this architecture:
for that application: ET: g’ 3"} .; LA. 13 ...
View
Full
Document
This note was uploaded on 04/18/2010 for the course CS 151B taught by Professor N/a during the Spring '10 term at UCLA.
 Spring '10
 N/A

Click to edit the document details