hw3soln - 4.12.6 We already computed clock cycle times for...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 4.12.6 We already computed clock cycle times for pipelined and single cycle 4.12.6 We in 4.12.1, and the clock cycle times for pipelined and clock cycle organizationsalready computed multi-cycle organization has the same single cycle organizations in 4.12.1, and the multi-cycle organization has times relative to the time as the pipelined organization. We will compute execution the same clock cycle time as the pipelined organization. We will compute execution times (long) to the pipelined organization. In single-cycle, every instruction takes one relative clock pipelined organization. In single-cycle, every instruction takes one (long) one cycle. In pipelined, a long-running program with no pipeline stalls completesclock cycle. In in every instruction2009 a cycle. Finally, a multi-cycle organizationstalls completes one Summerpipelined, long-running program with no pipeline completes a lw in instruction in 4 cycles (no WB), an multi-cycle organization (no MEM), lw a 5 cycles, a sw inevery cycle. Finally, aALU instruction in 4 cyclescompletes a andin 5 cycles, cycles (no WB). So WB), an ALU instruction in 4 cycles (no MEM), and a beq in 4 a sw in 4 cycles (no we have the speed-up of pipeline beq in 4 cycles (no WB). So we have the speed-up of pipeline Multi-cycle execution time is X times Multi-cycle execution time is X X is pipelined execution time, wheretimes pipelined execution time, where X is 0.15 5 + 0.85 4 = 4.15 0.15 5 + 0.85 4 = 4.15 0.30 5 + 0.70 4 = 4.30 0.30 5 + 0.70 4 = 4.30 Prof. Schimmel ECE 3055 a. a. b. b. Computer Architecture and OS Homework 3 Solution Dependences Dependences 1650ps/500ps = 3.30 1650ps/500ps = 3.30 800ps/200ps = 4.00 800ps/200ps = 4.00 Single-cycle execution time is X times Single-cycle execution time is X times pipelined execution time, where X is pipelined execution time, where X is Solution 4.13 Solution 4.13 4.13.1 4.13.1 a. a. Instruction sequence Instruction sequence RAW on $1 from I1 to I3 I1: lw $1,40($6) RAW on $1 from I1 to I3 I1: lw $6,$2,$2 RAW on $6 from I2 to I3 I2: add $1,40($6) RAW on $6 from I2 to I3 I2: add $6,$2,$2 WAR on $6 from I1 to I2 and I3 I3: sw $6,50($1) Chapter WAR on $6 from I1 to I2 and I3 4 Solutions I3: sw $6,50($1) RAW on $5 from I1 to I2 and I3 b. I1: lw $5,-16($5) RAW on $5 from I1 to I2 and I3 I1: lw $5,-16($5) b. I2: sw $5,-16($5) WAR on $5 from I1 and I2 to I3 RAW hazard results from true or forward or RAW dependency. to I3I2 to I3 WAR on $5 from I1 WAR hazard results from I2: sw $5,$5,$5 WAW on $5 from I1 and I3: add $5,-16($5) WAW on $5 from I1 toChapter I3 I3: add $5,$5,$5 WAR dependency. WAW hazard result from output or WAW dependency. 4 Solutions S129 anti or backwards or S129 4.13.2 4.13.2 In the basic fiforwarding, any WAR dependence between an instruction any hazards. Without ve-stage pipeline RAW and WAW dependences do not cause a. lw $1,40($6) any hazards. Without forwarding, any RAW dependence the second instruction and the next two instructions (if register read happens in between an half of the add $6,$2,$2Instruction Chapter 4 and the next two instructions (if register read first half).in the second half of the clock cycle and the register write happens in the happens The code that$1 Solutions nop Delay I3 to avoid RAW hazard on eliminates from I1 clockhazards by inserting nop instructions is: first half). The code that eliminates cycle and the sequence register write happens in the these sw $6,50($1) theselw $1,40($6) hazards by inserting nop instructions is: a. lw $5,-16($5) add $6,$2,$2 nop nop nop sw $6,50($1) Instruction sw $5,-16($5) sequence b. lw $5,-16($5) add $5,$5,$5 lw a. nop $1,40($6) add nop $6,$2,$2 nop sw $5,-16($5) 4.13.3 With full forwarding, sw $5,$5,$5 add $6,50($1) b. Delay I2 to avoid RAW hazard on $5 from I1 Delay I3 to avoid RAW hazard on $1 from I1 Note: no RAW hazard from on $5 from I1 now Delay I2 to avoid RAW hazard on $5 from I1 Instruction In the basicsequence pipeline WAR and WAW dependences do not cause five-stage S129 Delay I3 to avoid RAW hazard on to EX stage an ALU instruction can forward a value$1 from I1 Note: no RAW hazard from on $5 from I1 now of the next instruction without a hazard. However, a load cannot forward to the b. lw of the next EX stage$5,-16($5) instruction (by can to the instruction after that). The code that Delay I2 to avoid RAW hazard on $5 from I1 nop eliminates these forwarding, an ALU instruction can is: 4.13.3 With fullhazards by inserting nop instructions forward a value to EX stage nop of thesw $5,-16($5) next instruction without a hazard. However, a load cannot forward to the add $5,$5,$5Instruction EX stage of the nextsequence (by can to theNote: no RAW hazard from on $5 from I1 now instruction instruction after that). The code that eliminates these hazards by inserting nop instructions is: 4.13.3 With full Instruction an ALU instruction can forward a value to EX stage forwarding, add $6,$2,$2 No RAW hazard on cannot forward of thesw $6,50($1) next instruction without a hazard. However, a load$1 from I1 (forwarded)to the sequence EX stage of the next instruction (by can to the instruction after that). The code that b. lw $5,-16($5) a. lw $1,40($6) Delay I2 to avoid RAW hazard on $5 from I1 nop eliminates$6,$2,$2 add these hazards by inserting nop instructions is: a. lw $1,40($6) the first instruction, then one per instruction). The execution without forwarding b. must lw $5,-16($5) add a total execution time had in 4.13.2, and execution forwarding must add 4.13.4 Thestall for every nop we is the clock cycle time times the number of cycles. I2 Overall, avoid RAW hazard I2 to get: a stallnop$5,-16($5) nop we had in 4.13.3. Delay forto we forwarded fromon $5 from I1 cycle stalls, a Without anyfor everythree-instruction sequence executes in 7 cycles (5nowcomplete Value $5 is sw Note: no RAW hazard from on $5 forwarding the first instruction, then one per instruction). The execution withoutfrom I1 now add $5,$5,$5 Speed-up due to forwarding No forwarding had forwarding must add a stall for every nop we With in 4.13.2, and execution forwarding must add get: a stall (7 + 1) for every 2400ps had 400ps = 2800ps we 0.86 (This is really a slowdown) cycle 300ps = nop we 7 in 4.13.3. Overall, a. 4.13.4 Thetotal execution time is the clock cycle time times the number of cycles. 0.90 7 cycles (5 slowdown) b. (7 + 2) stalls, = 1800ps (7 + 1) 250ps = 2000ps 200ps a three-instruction sequence executes in(This is really ato complete Without anyforwarding Speed-up due to forwarding No With forwarding the first instruction, then one per instruction). The execution without forwarding 0.86 (This is really a slowdown) a. 7 had in = 2800ps must (7 + 1) 300ps =every nop we 400ps 4.13.2, and execution forwarding must add add a stall for 2400ps 0.90 had in 250ps = 2000ps a b. (7 + 2) 200ps = 1800pswe (7 + 1) 4.13.3. Overall, we get:(This is really a slowdown) stall cycle for every nop Value for $5 is forwarded from I2 now sw $5,-16($5) sw $6,50($1) No RAW hazard on $1 from I1 (forwarded) Note: no RAW hazard from on $5 from I1 now add $5,$5,$5 Instruction b. lw $5,-16($5) sequence Delay I2 to avoid RAW hazard on $5 from I1 nop Value for $5 is times the number 4.13.4 The total execution time is the clock cycle time forwarded from I2 now of cycles. a. sw $5,-16($5) lw $1,40($6) Note: add $5,$5,$5 add $6,$2,$2 Without any stalls, a three-instruction sequenceno RAW hazard 7 cycles (5 to complete executes in from on $5 from I1 now sw $6,50($1) No RAW hazard on $1 from I1 (forwarded) the data Solutions Chapter 4 value in MEM stage, when it is too late for ALU-ALU forwarding. We have: 4.13.5 With ALU-ALU-only forwarding, an ALU instruction can forward to the Instruction sequence next instruction, but not to the second-next instruction (because that would be a. lw $1,40($6) forwarding from MEM to EX). A load cannot forward at all, because it determines add $6,$2,$2 Chapter Solutions 4.13.5 4With ALU-ALU-onlywhen it is tooan ALU ALU-ALU forwarding. We have: instruction the data value in MEM stage, forwarding, late for forwarding, ($1 can forward to the nop Can't use ALU-ALU loaded in MEM) next instruction, but not to the second-next instruction (because that would be sw $6,50($1) forwarding from MEM to EX). A load cannot forward at all, because it determines Instruction sequence b. lw $5,-16($5) the data value in MEM stage, when it is too late for ALU-ALU forwarding. We have: nop Can't use ALU-ALU a. lw $1,40($6) 4.13.5 With ALU-ALU-only forwarding, an ALU forwarding ($5 loaded in MEM) to the instruction can forward nop add $6,$2,$2 Instruction sequence the second-next instruction (because that would be next instruction, but not to sw nop$5,-16($5) Can't use ALU-ALU forwarding, ($1 loaded in MEM) add $5,$5,$5 forwarding from MEM to EX). A load cannot forward at all, because it determines sw $6,50($1) a. lw $1,40($6) the data value in MEM stage, when it is too late for ALU-ALU forwarding. We have: add $6,$2,$2 b. lw $5,-16($5) nop Can't use ALU-ALU forwarding, ($1 loaded in MEM) 4.13.6 nop Can't use ALU-ALU forwarding ($5 loaded in MEM) a. b. a. sw $6,50($1) sequence Instruction nop sw $5,-16($5) $5,-16($5) lw $1,40($6) add $5,$5,$5 add No forwarding nop $6,$2,$2 Speed-up with ALU-ALU With ALU-ALU forwarding forwarding only Can't use ALU-ALU forwarding ($5 loaded in MEM) Can't use ALU-ALU forwarding, ($1 loaded in MEM) (7 + 1) 360ps = 2880ps 0.83 (This is really a slowdown) (7 + 2) 220ps = 1980ps 0.91 (This is really a slowdown) nop (7 + 1) 300ps = 2400ps $6,50($1) $5,-16($5) b. add 2) 200ps = 1800ps (7 + $5,$5,$5 b. lw $5,-16($5) nop nop 4.13.6 No forwarding sw $5,-16($5) a. add 1) 300ps = 2400ps (7 + $5,$5,$5 sw 4.13.6 Solution 4.14 Speed-up with MEM) With ALU-ALU Can't use ALU-ALU forwarding ($5 loaded in ALU-ALU forwarding forwarding only (7 + 1) 360ps = 2880ps 4.14.1 + In forwarding the pipelined execution shown below, ***0.91 (This is reallystall when an represents a a slowdown) forwarding forwarding only b. (7 No 200ps = 1800ps (7 + 2) 220ps = 1980ps 2) instruction cannot be fetched because a load or store instruction is using the mem4.13.6 + 1) 300ps = 2400ps (7 + 1) 360ps = 2880ps a. (7 0.83 and for each instruction ory in that cycle. Cycles are represented from left to right, (This is really a slowdown) we show 2) 200ps = 1800ps it(7 +in during that cycle: 0.91 (This is really a slowdown) the pipeline stage is 2) 220ps = 1980ps Speed-up with ALU-ALU With ALU-ALU b. (7 + Solution 4.14 Pipeline stage Cycles 4.14.1 Instruction In the pipelined execution shown below, *** represents a stall when an a. (7 + 1) 300ps = 2400ps (7 + 1) 360ps = 2880ps 0.83 (This is is using the meminstructioncannot be fetched because a load or store instruction really a slowdown) Solution 4.14 IF ID EX MEM WB $1,40($6) a. 9 ory inlw 2)$2,$0,Lbl 1800ps IF + 2) EX MEM 1980psto right, (This is really a slowdown) cycle. Cycles are represented from left b. (7 that 200ps = + (7 ED 220ps = WB 0.91 and for each instruction beq 4.14.1 In$2,$3,$4 Rt field from during thatregister is represents input of the an pipeline register. The stage it is IF the ID/EXMEM WB *** already an a stall when we show the pipeline in shown cycle: add the pipelined executionID EX below, hazard detection unit fetched 4.60. *** IF or EX MEM WB instruction cannot bein Figurebecause a load ID store instruction is using the memsw $3,50($4) Instruction Pipeline stage Cycles ory additional outputs are represented stall left to right, and the three output that cycle. Cycles Solution 4.14 are needed. WeMEM from the pipeline using for each instruction No in lw $5,-16($5) IF ID EX canWB b. 12 we show the pipeline stage it is in duringMEM WB that cycle: sw $1,40($6) $4,-16($4) signals that we already have. ID EX MEM WB below, *** represents a stall when an IF IF ED a. lw In the pipelined executionEX 9 4.14.1 $3,-20($4) shown lw IF ID EX MEM beq $2,$0,Lbl IF ED EX MEM WB WB instruction cannot befor 4.21.5, IF only load MEMstore instructionof WB PCWrite Pipeline stage Cycles 4.21.6 Instruction As explained we ID EX or WB ID value is using the membeq $2,$0,Lbl fetched because a need ***specify theEX MEM the *** *** to IF add $2,$3,$4 IF ory in sw $3,50($4) that cycle. Cycles areis equal to PCWrite and right,WB EX each instruction to MEM ID for signal and signal,add $5,$1,$4 because IF/IDWrite represented from leftEX the ID/EXzeroMEM WBis its *** IF ID a. lw We have: 9 opposite.the pipeline IF it in MEM WB we show $1,40($6) stage ID is EX during that cycle: lw $2,$0,Lbl IF IF EX eliminate MEM WB b. beq $5,-16($5) 12 We can not add nops to theID ED toEX MEM WBthis hazard--nops need to be fetched code ID EX MEM WB Chapter 4 Solutions add $4,-16($4) sw $2,$3,$4 IF IF ED EX MEM WB First fi cycles Pipeline stage Cycles just like $3,50($4) instructions,IF *** IF veID must MEM addressed with a hardware so this hazard EX be WB sw Instruction lw any other $3,-20($4) ID EX MEM WB Instruction sequence Signals 1 2 3 4 hazard detection unit in the processor. *** *** IF5 ID EX MEM WB beq $2,$0,Lbl *** lw $1,40($6) IF ID EX MEM WB 12 add $5,$1,$4 1: MEM 1 lw $5,-16($5) IF ID EX MEM WB IF ID EX PCWrite =WB beq $2,$0,Lbl IF ED EX MEM WB sw $2,$3,$1 IF ED EX MEM WB 2: PCWrite = 1 add $4,-16($4) IF ID *** *** add $1,$6,$4 IF ID EX MEM WB lw $2,$3,$4 IF ID EX 4.14.2 $3,-20($4) only saves one cycleMEM an entire execution without data in *** 3: PCWrite = 1 add This change IF *** WB sw $3,50($4) We can not add nops to the codeThis cycle is saved becausenopsPCWrite to0be fetched eliminate this hazard-- 4: WB beq $2,20($4) $2,$0,Lbl *** *** *** IF ID EX MEM need = sw (such as the one given). to*** IF ID EX MEM WB the last instruction fi *** hazards nadd $5,$1,$4 5: and $1,$1,$4 just like any other instructions, so this hazard must IF addressedMEM = a hardware be ID EX PCWrite WB with 0 a. b. a. lw $1,40($6) IF ID EX MEM WB 9 With ALU-ALU 0.83 (This is really a slowdown) Speed-up with ALU-ALU No forwarding forwarding only Chapter 4 Solutions forwarding S143 S131 ishes lw $5,-16($5) (oneID EX MEM go through). If there were data hazards from one cycle earlier IF less stage to WB b. 12 hazard detection unit in the change wouldMEM eliminate some stall = 1 1: PCWrite cycles. IF ID b. sw $4,-16($4) loadsadd other instruction, IF processor. MEM WB WB to $1,$5,$3 the ED EX EX help 2: need = be We can not add nops to the code to eliminate*** WB sw $3,-20($4) IF ID this *** lw $1,0($2) IF ID EX MEM hazard--nops PCWriteto 1 fetched 3: lw $1,4($2) beq $2,$0,Lbl *** IF *** *** EX just like any other instructions, sowith 5*** *** IF be addressed WB =a1hardware this hazard must ID with MEMPCWrite = 0 with Cycles Cycles ID EX PCWrite WB 4: MEM addInstructions ***IF add $5,$5,$1 $5,$1,$4 hazard detection unit in the processor. Executed stages 4 stages Speed-up 5: PCWrite = 0 sw $1,0($2) a. 4 4 + 4 eliminate this hazard--nops need8/7be1.14 3+4=7 We can not add nops to the code to= 8 to = fetched just like any other instructions, + 5 = 9 hazard must + 5 = 8 so this be addressed with a = 1.13 b. 5 4 3 9/8 hardware Solution 4.22 hazard detection unit in the processor. 4.22.1 4.14.3 Stall-on-branch delays the fetch of the next instruction until the branch Pipeline Cycles is executed. When branches execute in the EXE stage, each branch causes two stall cycles. When branches execute in the 2 stage, each branch 6 ID 3 only causes8one stall 1 4 5 7 9 Executed Instructions cycle. Without branch stalls (e.g., with perfect branch prediction) there are no stalls, a. lw $1,40($6) IF ID and the execution time is 4 plus the numberEX executed instructions. We have: of MEM WB WB beq $2,$3,Label2 (T) IF ID EX MEM beq $1,$2,Label1 (NT) Instructions Branches sw $2,20($4) andExecuted $1,$1,$4 Executed add sw add beq add sw $1,$5,$3 4 1 $1,0($2) $2,$2,$3 5 1 $2,$4,Label1 (NT) $5,$5,$1 $1,0($2) 10 11 12 12 13 13 14 14 Cycles with branch in EXE IF ID Cycles with IF branch in ID EX MEM WB ID IF ID EX Speed-up EX MEM WB MEM WB 13 14 b. a. b. 4.14.4 The number of cycles for the (normal) 5-stage and the (combined EX/ EX MEM) 4-stage pipeline is already computed in 4.14.2. The clock cycle time is equal to the latency of the longest-latency stage. Combining EX and MEM stages affects IF4 + 4 + 1 EX= 10MEM + WB+ 1 1 = 9 ID 10/9 = 1.11 2 4 4 WB IF ID EX MEM EX 4 + 5 + 1 IF= 11ID4 + 5 + 1 MEM 10WB 11/10 = 1.10 2 1= EX IF ID MEM WB IF ID EX IF ID MEM WB MEM WB 2,$3 4,Label1 (NT) 5,$1 ($2) IF ID IF EX ID IF MEM EX ID WB MEM EX IF WB MEM ID WB EX MEM WB Chapter 4 Solutions S144 4.22.3 a. Chapter 4 Solutions ed Instructions 0($6) 3,Label2 (T) 6,$4 2,Label1 (NT) 0($4) 1,$4 5,$3 ($2) 2,$3 4,Label1 (NT) 5,$1 ($2) Label1: lw $1,40($6) seq $8,$2,$3 bnez $8,Label2 ; Taken 4.22.2 4.22.2 add $1,$6,$4 Label2: seq $8,$1,$2 bnez $8,Label1 ; Not taken Pipeline Cycles Pipeline Cycles sw $2,20($4) and $1,$1,$4 2 3 4 5 6 7 8 9 Executed Instructions 2 3 1 4 5 61 7 8 9 10 11 12 13 14 b. add $1,$5,$3 a. lw $1,40($6) IF ID EX MEM WB 13 14 IF ID EX WB Label1: sw MEM $1,0($2) WB beq $2,$3,Label2 (T) IF ID EX MEM IF IDadd EX MEM $2,$2,$3 WB MEM add $1,$6,$4ID IF ID EX WB IFbez $8,$2,$4 MEM EX WB *** beq $1,$2,Label1 (NT) *** IF ID EX MEM WB IF ID WB bnez $8,Label1 ; Not EX takenMEM *** sw $2,20($4) IF ID EX MEM IF ID EX MEM WB add $5,$5,$1 *** and $1,$1,$4 IF ID IF ID EX MEM WB sw $1,0($2) b. add $1,$5,$3 IF ID EX MEM WB 13 14 IF ID EX WB WB sw $1,0($2)MEM IF ID EX MEM WB IF $2,$2,$3EX ID MEM MEM add The hazard detection logic must detect situations when the branch IF ID EX WB 4.22.4 IF ID EX WB EX beq $2,$4,Label1 (NT) MEM IF ID MEM WB depends on the result of the EX previous R-type instruction, or on the result of two ID MEM WB ID add $5,$5,$1IF IF EX MEM WB ID IF EX MEM WB sw loads. When the branch uses the values of its register operands in its ID IF ID EX previous $1,0($2) IF ID EX MEM WB 10 11 12 13 13 14 14 WB EX MEM WB 13 14 MEM WB stage, the R-type instruction's result is still being generated in the EX stage. Thus we must stall the processor and repeat the ID stage of the branch in the next cycle. 4.22.3 Similarly, 4.22.3 if the branch depends on a load that immediately precedes it, the result of the load is only generated two cycles afterlw $1,40($6) Chapter 4ID stage, so we a. Label1: the branch enters the Solutions a. mustLabel1: lw $1,40($6) cycles. Finally,seqthe branch depends on a load that stall the branch for two if $8,$2,$3 bnez $8,Label2 ; Taken seq $8,$2,$3 is the second-previous instruction, the load is completing its MEM stage when the add $1,$6,$4 bnez $8,Label2 ; Taken Label2: seq $8,$1,$2 branch is in its ID $1,$6,$4 we must stall the branch for one cycle. In all three cases, add stage, so bnez of preceding instructions Note that inis a three hazard. assume that the values$8,Label1 ; Not taken are Label2: data cases we the hazard allseq $8,$1,$2 sw $2,20($4) bnez stage of the branch if possible. forwarded to the ID $8,Label1 ; Not taken and $1,$1,$4 sw $2,20($4) b. add $1,$5,$3 S145 and 4.22.5 For 4.22.1 $1,$1,$4already shows the pipeline execution diagram for the we have Label1: sw $1,0($2) b. add are executed in the EX stage. The following is the pipeline diacase when branches$1,$5,$3 add $2,$2,$3 bez $8,$2,$4 gram Label1: sw $1,0($2) when branches are executed in the ID stage, including new stalls due toSolutions data add $2,$2,$3 bnez $8,Label1 ; Chapter 5 Not taken dependences described for 4.22.4: bez $8,$2,$4 add $5,$5,$1 bnez $8,Label1 ; Not taken sw $1,0($2) add $5,$5,$1 Pipeline Cycles sw $1,0($2) 4 1 5 6 7 8 9 Executed Instructions 4.22.42 The 3hazard detection logic must detect S177 Solution 5.3 12 10 11 13 14 situations when the branch depends on the result of the previous R-type instruction, or on the result of two 12 IF ID EX MEM WB 13 14 S177 a. lw $1,40($6) Chapter Solutions 5.3.1 The hazard detectionpreviousmust detect the branch when 5the branch register operands in its ID 4.22.4 $2,$3,Label2 (T) logicIF loads. When situations uses the values of its WB beq ID EX MEM EX beq on the result of the IF *** ID depends $1,$2,Label1 (NT) previous R-type instruction, or on MEM still being two the WB result is result of generated in the EX stage. Thus IF EX MEM WB a. sw $2,20($4) , 100001102, stage, the2, 1 100001112, 110101012, 10100010 Binary address: 11010100 R-type instruction'sID previous $1,$1,$4 12101100 branch mustthe 2,the processor and repeat theEX 2,stageIDWB branch in the next cycle. loads., When the , 101001uses stall values of its register operands in its of the IF and ID we 2, 110111012 ID MEM 101000012 102, 2 stage,Tag: Binary address >> 4 bits result is stillthe branch the R-type instruction's being generated in the EX stage. Thus a that IF ID MEM WB b. add $1,$5,$3 13 14 Solution 5.3processor16 Similarly, ifisEX stage of depends on inload next immediately precedes it, the result Index: Binary address mod and repeat the ID we must $1,0($2) stall the the WB the cycle. sw IF MEM branch of the load ID EX only generated two cycles after the branch enters the ID stage, so we add if the branchM, M, EX Similarly,$2,$2,$3 depends on a load that immediatelycycles. Finally, if result precedes must stall IF ID 5.3.1Hit/Miss: M, M, M, H,(NT) M, M, M, M, M, Mthe branch for two MEM WB it, the the branch depends on a load that ID beq $2,$4,Label1 IF *** EX MEM WB b.the loadaddress: 000001102, 110101102, 101011112, branch enters the ID010101002,we 110101102, 00000110 , stage, so Binary is only generated two the second-previous instruction, the2load is completing its MEM stage when the of add $5,$5,$1 is cycles after the IF ID EX MEM WB 01000001 a. Binary the 2, 101011102, two , branch is 2, 12, 2ID stage, 110101012, stall the branch for mustsw $1,0($2) 12, 100001102cycles.01101001100001112,so, we must 10100010load thatEXone MEM In all three cases, stall address: >> 4 bits 11010100 in its if 010101012 110101112 on IF 2, ID branch for 010000002, Finally, , the branch depends a cycle. WB Tag: Binary , 102, 1011002, 1010012, 110111012 101000012address the hazard is a data hazard. its MEM stage when the is the Index: Binary address modulus 16 the load is completing second-previous instruction, Tag: Binary address >> 4 bits Hit/Miss: M, M, M, H, M, M, the hazard is a data hazard. M, M, M, M, M, M a. 11/10 = 1.1 b. 5.3.2Binary address: 000001102, 110101102, 101011112, 110101102, 000001102, 010101002, Tag: Binary address >> 4 bits Binary address: 12, 100001102, 110101002, 12, 100001112, 110101012, 101000102, Index: Binary address modulus 16 101000012, 102, 1011002, 1010012, 110111012 Hit/Miss: M, instructions are M, M, executed in the ID stage. If the branch 4.22.6 Branch M, M, H, M, M, M, M, now M, M Tag: Binary address >> 3 bits Index: is using a register bit) mod 8 instruction (Binary address >> 1value produced by the immediately preceding instrucHit/Miss: M, M, M, H, 4.22.4 the M, M, M tion, as we described forH, H, M, M, M,branch must be stalled because the preceding b. Hit/Miss: M, address M, M, M, branch speed-up M, be computed M, the Index: Binary M, stage, so we must stall Now theis in its IDcanH, mod 16 M, M,as: M, M branch for one cycle. In all three cases, 01000001 12/12 = 1 2, 101011102, 010000002, 011010012, 010101012, 110101112 a. 5.3.2 b. Binary is in the EX stage when the 10101111 , 110101102, 000001102, register instruction address: 000001102, 110101102, branch is2already using the stale 010101002, 010000012, stage. If the010000002, 011010012, 010101012 110101112 a. valuesBinary address: 12, 100001102, 110101002, 12stage depends, on an R-type instrucin the ID 101100002, branch in the ID , 100001112, 110101012, 101000102, Tag: Binary , 102, 1011002, 101001 101000012address shift right 3 bits 2, 110111012 tion that is in the MEM stage, we need forwarding to ensure correct execution of Index: (Binary address 3 bits Tag: Binary address >>shift right 1 bit) modulus 8 the branch. Similarly, H, M, M, M, H, H,in theMID stage depends on an R-type of load if the branch Hit/Miss: M, address >> 1 bit) mod M, H, Index: (BinaryM, M, 8 instruction in M, M,WB stage,M, M,need forwarding to ensure correct execution of the M, H, H, H, we M, M, M, M Hit/Miss: the branch. Overall, we need another forwarding unit that takes the same inputs 10101111 00000110 , b. address: 000001102, 5.3.3Binarythat forwards to the110101102, The new 2, 110101102,unit should010101002, as the 01000001 , 10110000 , 01000000 , 01101001 , 01010101 , 11010111 2 control one EX stage. forwarding 2 2 2 2 2 2 two Muxes placed right before 3 bits Tag: Binary address shift right the branch comparator. Each Mux selects between a. C1: 1 hit, C2: 3 hits, C4: 2 hits. C1: Stall time = 25 11 + 2 12 = 299, C2: Stall time = 25 Index: (Binary address shift the ALU output from the EX/MEM pipeline register, the value read from Registers,right 1 bit) modulus 8 5.3.3If the a 4 KB page, the lower 12 increased, a higher associativity must prior to translation from With cache's data size is to be bits are available for use for indexing be used. If the cache a. b. 16 KB/8 bytes/block => 2K sets => 11 bits for indexing hasto PA. However, a 16 KB 9 bits are available for indexing (thus 512bits to remain the 16 KB VA 2 words per block, only Direct-Mapped cache needs the lower 14 sets). To make a same cache, a VA to3associativityhits. C1: Stall time = 25 11 + 2 12 = 299, C1: 1 hit,4-way PA translation. Thus itused. possibleto build this cache. C2: Stall time = 25 between C2: hits, C4: 2 must be is not 9 + 3 12 = 261, C3: 64 bits increased, a+ 4 12 = 298 virtualcache's data sizeStall time = 25 10 higher associativity must be used. If the cache If the address size of is to be 14 16 KB wordspageblock, 8 (23) bytes per page table entry has 2 (2 ) per size, only 9 + 2 12 = 299 for indexing (thus 512 sets). To make a 16 KB b. C1: 1 hit, stall time = 25 11 bits are availablecycles 16 KB direct-mapped cache must be used. cache, a 4-way associativity C2: 4 hits, stall3time = 25 8 + 3 12 = 236 cycles 2 words or 8 (2 ) bytes per block, means 3 bits for cache block offset 4 address time = 25 b. C3:KB/8 bytes per block =bits + 5 or 11 bits for indexing virtualhits, stall size of 64 8 sets 12 = 260 cycles 16 2K 3 16 KB (214) page size, lower bytes per page table entry With a 16 KB page, the 8 (2 ) 14 bits are available for use for indexing prior to translation from 16 KB direct-mapped cache virtual to physical. Considering, a 16 KB direct-mapped cache requires the lower 14 bits to 5.3.4remain the same) between translation. Hence, it is possible to build this cache. 2 words or 8 (23 bytes per block, means 3 bits for cache block offset 16 KB/8 bytes per block = 2K sets or 11 bits for indexing With 16 KB page, the 351, 14 14 are m = 0 (1 word per block) a. Usingaequation on page lower n = bitsbits, available for use for indexing prior to translation from virtual to physical. Considering, a + KB 802 Kbits 214 (20 32 + (32 14 0 2)16 1) =direct-mapped cache requires the lower 14 bits to remain the same word blocks, m = 4, Hence, it is possible to is 541 Kbits, and Calculating for 16 between translation. if n = 10 then the cachebuild this cache. if n = 11 then Solution 5.11 5.11.1 larger cache may have a longer access time, leading to lower performance. The b. Using equation total cache size 2n (2 Solution 5.11 bits, physical=memory m 32 + (32 n m 2) + 1), n = 13 bits, a. virtual address 32 4 GB For physical memoryblocks), if n = 10 then the cache is 541 Kbits and if n = 11 then cache is 1 PT m = 4 (16 word = 512K 4 bytes = 2 MB Mbits. Thus the 32 bits, physical memory 4 GB virtual address cache has 64 KB of data. virtualsize cache may bits, a longer access 4 bytes or 2 to lower performance. The larger 8 KB 64 13 have page table entry time, leading bits page address or bits, physical memory 16 GB page size 4KB or 12 bits, or 512K entries 8 bytes or 3 bits #PTE = 32 13 = 19 bits page table entry #PTE = 64 memory =bits or 52 entries 2 MB PT physical 12 = 52 512K 2 4 bytes = PT physical memory = 252 23 = 255 bytes b. virtual address 64 bits, physical memory 16 GB page size 4 KB or 12 bits, page table entry 8 bytes or 3 bits 52 #PTE 5.11.2 = 64 12 = 52 bits or 23 entries PT physical memory = 252 2 = 255 bytes the cache is 1 Mbit. Thus the cache has 128 KB of data. 5.11.1 = 32 32 +=(32 bits or 1 2) + 1) = 213 (64 + 17) = 663 Kbits total cache size 213 (21 13 19 13 512K entries #PTE a. b. m = 1sizewords or 13 bits, page table entry 4 bytes or 2 bits page (2 8 KB per block) virtual address 32 bits, physical memory 4 GB page size 8 KB or 13 bits, page table entry 4 bytes or 2 bits 5.11.2 = 32 13 = 19 bits or 512K entries #PTE 8 KB page/4 byte PTE = 211 pages indexed per page Chapter 5 Solutions a. Hence with 219 32 bits, physical memory 4 table setup. virtual address PTEs will need 2-level page GB Each address translation will require at least 2 physical memory accesses. page size 8 KB or 13 bits, page table entry 4 bytes or 2 bits #PTE = 32 13 = 19 bits or 512K entries b. virtual address 64 bits, physical memory 16 GB 8 KB page/4 byte PTE = 211 pages indexed per page page size 4 KB or 12 bits, page table entry 8 bytes or 3 bits Hence with 219 PTEs will need 2-level page table setup. #PTE = 64 12 = 52 bits or 252 entries Each address translation 9 physical 5.11.3 page/8 byte PTE = 2will require at least 2page memory accesses. 4 KB pages indexed per b. Hence with 252 64 bits, physical memory 16 GB setup. virtual address PTEs will need 6-level page table a. Each address translation will require at least 6 512K PTEs are really needed to store the page Since there KB only GB page table entry 8 bytes or 3 bits page size 4 are or 124bits, physical DRAM, only physical memory accesses. table. Common-case: bits or 2 conflict, so #PTE = 64 12 = 52 no hash 52 entries one memory reference per address translation; worst case: almost 512K memory references are needed 4 KB page/8 byte PTE = 29 pages indexed per pageif hash table degrade into a link list. 52 b. Hence address 64 bits, physical memory 16 GB setup. virtual with 2 PTEs will need 6-level page table Each address translation will require at least 6 physical memory accesses. page size 4 KB or 12 bits, page table entry 8 bytes or 3 bits a. S193 #PTE = 64 12 = 52 bits or 252 entries Since there are only 16 GB physical memory, only 2(3412) PTEs are really needed to store the page table. Common-case: no hash conflict, so one memory reference per address translation; Worst case: almost 2(3412) memory references are needed if hash table degrade into a link list. 5.11.4 TLB initialization, or process context switch. 5.11.5 TLB miss. When most missed TLB entry is cached in processor caches. 5.11.6 Write protection exception. Solution 5.12 5.12.1 a. b. 0 hits 2 hits 5.12.2 a. b. 3 hits 3 hits 5.12.3 a. b. 3 hits or fewer 3 hits or fewer ...
View Full Document

This note was uploaded on 07/30/2009 for the course ECE 3055 taught by Professor Staff during the Spring '08 term at Georgia Institute of Technology.

Ask a homework question - tutors are online