View the step-by-step solution to:

# 3.7 Suppose we have a deeply pipelined processor, for which we implement a branchtarget buffer for the conditional branches only. Assume that the

if the base CPI with a perfect memory system is 1.5, what is the CPI for

3.7 Suppose we have a deeply pipelined processor, for which we implement a branch- target buffer for the conditional branches only. Assume that the misprediction penalty is always 4 cycles and the buffer miss penalty is always 3 cycles. Assume 90% hit rate and 90% accuracy, and 15% branch frequency. How much faster is the processor with the branch-target buffer versus a processor that has a fixed 2-cycle branch penalty? Assume a base CPI without branch stalls of 1. 4.2 Here is an unusual loop. First, list the dependences and then rewrite the loop so that it is parallel. for (i=1;i<100;i=i+1) { a[i] = b[i] + c[i]; /* S1 */ b[i] = a[i] + d[i]; /* S2 */ a[i+1] = a[i] + e[i]; /* S3 */ } 4.3 Assuming the pipeline latencies from Figure 4.1, unroll the following loop as many times as necessary to schedule it without any delays, collapsing the loop overhead instructions. Assume a one-cycle delayed branch. Show the schedule. The loop computes Y[i] = a × X[i] + Y[i], the key step in a Gaussian elimination. loop : L.D F0,0(R1) MUL.D F0,F0,F2 L.D F4,0(R2) ADD.D F0,F0,F4 S.D F0,0(R2) DADDUI R1,R1,#-8 DADDUI R1,R1,#-8 BNE R1,R3, loop 5.8 If the base CPI with a perfect memory system is 1.5, what is the CPI for these cache organizations? Use Figure 5.14 (attached in PDF file): 16-KB direct-mapped unified cache using write back. 16-KB two-way set-associative unified cache using write back. 32-KB direct-mapped unified cache using write back. Assume the memory latency is 40 clocks, the transfer rate is 4 bytes per clock cycle and that 50% of the transfers are dirty. There are 32 bytes per block and 20% of the instructions are data transfer instructions. There is no write buffer. Add to the assumptions above a TLB that takes 20 clock cycles on a TLB miss. A TLB does not slow down a ache hit. For the TLB, make the simplifying assumption that 0.2% of all references aren’t found in TLB, either when addresses come directly from the CPU or when addresses come from cache misses. a. Compute the effective CPI for the three caches assuming an ideal TLB. b. Using the results from part (a), compute the effective CPI for the three caches with a real TLB. c. What is the impact on performance of a TLB if the caches are virtually or physically addressed?

### Why Join Course Hero?

Course Hero has all the homework and study help you need to succeed! We’ve got course-specific notes, study guides, and practice tests along with expert tutors.

### -

Educational Resources
• ### -

Study Documents

Find the best study resources around, tagged to your specific courses. Share your own to gain free Course Hero access.

Browse Documents