fa09_cs433_hw4_sol - CS433: Computer Systems Organization...

Info iconThis preview shows pages 1–4. Sign up to view the full content.

View Full Document Right Arrow Icon
CS433: Computer Systems Organization Fall 2009 Homework 4 Assigned: Oct/23 Due in class Nov/3 Total points: 48 for undergraduate students, 58 for graduate students. Instructions: Please write your name, NetID and an alias on your homework submissions for posting grades (If you don’t want your grades posted, then don’t write an alias). We will use this alias throughout the semester. Homeworks are due in class on the date posted.
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Problem 1: Compiler prefetches [12 points] Consider the code below. register int i,j,k,*indx; /* indx[64] */ register float sum, *b, **a; /* b[64], a[64][64] */ for (i=0; i<64; i++){ //1 k = i + 1 ; / / 2 s u m 1 = 0 ; / / 3 for(j=0; j<64; j+=2){ //4 sum = a[i][j]*b[j]; //5 } b[k] = sum; //6 } Assume the following: Both int and float sizes are 4 bytes We have a fully associative cache with 64 lines Each cache line is 16 bytes Initially, the cache is empty The cache has a LRU replacement policy The variables (i, j, k, indx, sum, a, b) are all stored in registers The array a is stored in row-major form The cache has no penalty for a hit and a miss penalty of 40 cycles There is a prefetch instruction, and the time to execute the prefetch instruction is 2 cycles. 40 cycles after the prefetch instruction executes, the data is in the cache. Without misses, the time to execute line 5 is 10 cycles Without misses, the time to execute each of lines 2,3 and 6 is 2 cycles Without misses, the time to execute each of lines 1 and 4 is 4 cycles
Background image of page 2
Part A. [3 points] How many cycles does the code fragment take to execute if we do NOT use prefetching? To calculate the number of cache misses in the entire loop: Lines 1, 2, 3, 6 execute 64 times each. Lines 4 and 5 execute 32 * 64 = 2048 times each. So we get 64*4 + 64*2 + 64*2 +64*2 + 2048*4 + 2048*10 = 29312 cycles for the instruction executions. The cache is large enough and with full associativity, so no data will be evicted from the cache. Every 2 accesses to the data pointed to by”a” results in a cache miss in the inner loop leaving us with 32/2*64 = 1024 misses for ”a”. There is one cache miss for ”b” every 4 iterations of the outer loop, leaving us with 64/4 = 16 misses for ”b”. So for the miss penalties we get 40 *(1024 + 16) = 41600. The total execution time is 29312 + 41600 = 70912 cycles. Part B. [3 points] Consider inserting prefetch instructions for the inner loop, so as to eliminate cache misses when accessing matrix a. Explain why we may need to unroll the loop to insert prefetches. Since the cache line is 16 bytes long and the size of a float is 4 bytes, a cache block contains 4 floats. Thus, one prefetch instruction will bring in four elements of the array. So we will need to do prefetch every other original loop. To hide the the latency of fetching the instructions, we will need to unroll the loops 4 times. Part C.
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 4
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 04/18/2010 for the course CS 433 taught by Professor Harrison during the Fall '08 term at University of Illinois, Urbana Champaign.

Page1 / 14

fa09_cs433_hw4_sol - CS433: Computer Systems Organization...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online