Sample-Soln-SMFaisal_HW1

Sample-Soln-SMFaisal_HW1 - CSE721 HW1 Date: 01/18/2011...

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon
CSE721 HW1 Date: 01/18/2011 Thanks to S M Faisal
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Answer to question# 1 (Problem 2.3) Given, Cache size = 32KB Processor Speed = 1GHz Cache Latency = 1cycle = 1ns DRAM Latency = 100 cycles = 100 ns Cache Line Size = 4 words Matrix Dimension = 4K x 4K Assuming that the vector b is never evicted from the cache, we'll have an initial 1K cold cache misses for vector b. The cost of bringing the whole b vector to cache is then 1K x 100 cycles. Once in cache, the vector b is reused heavily both in terms of spatial and temporal locality. For the whole computation, this vector b is accessed (4K) 2 = 16M times. This adds 16M cycles to the latency. Hence, the total delay for accessing vector b is (100K + 16M) cycles. Matrix a doesn't give any temporal locality, we can make use of only the spatial locality. Every 4 access to Matrix a results in a cache miss resulting in (100 + 4) = 104 cycles latency for every 4 access to matrix a. Hence the total latency associated with accessing the matrix a is 104 x 1K * 4K = 416M cycles. Vector C will give 1K cold misses which will add another 1K x 100 = 100K cycles of latency. Each element of C is accessed 4Kx2=8K times and there are 4K elements. Hence, effective latency for accessing C is (1K x 100 + 32K) = 132K cycles. From the loops, we see that there are 2 x (4K) 2 = 32M FLOP in this computation. Total latency is (416M + 100K + 16M + 132K) cycles. So, the peak achievable performance is 2 30 x32M/(432M + 232K) FLOPS. = 79.49 MFLOPS (Ans) Also, if we calculate assuming vector b is already in cache , we can exclude the 100K cycles delay for bringing b in. In that case we can get a peak performance of 2 30 x32M/(432M) FLOPS. = 79.54 MFLOPS (Ans) Comments: There are more than one method to solve this problem, by different methods the answer may be different. I didn't grade it only by the final answer. So if you lose some grade on it, please check the method you use and if there is any question, please stop by my office. The standard answer from Vijay gives us more details on how to solve this problem, please refer to it.
Background image of page 2
Answer to question# 2 (Problem 2.10) A Butterfly network with eight processing nodes is as follows: L0 L1 L2 L3 Now, in a Butterfly network, each switch at level l is connected to two switches in level l+1 where the switches in the l+1 level differ only in l-th MSB. For instance, 000 in level 0 connects to 000 and 100 in level 1. Again, 000 in level 1 connects to 000 and 010 in level 2 and so on. An Omega network with eight processing units is as follows: L0 L1 L2 L3
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
In an Omega network, a switch i connects to two switches 2i mod N and (2i+1) mod N in next level where N is the number of switches in a single level (8 in figure). Also, the connection pattern is level independent in case of Omega network. The connection pattern actually is a left-shift with insertion of
Background image of page 4
Image of page 5
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 8

Sample-Soln-SMFaisal_HW1 - CSE721 HW1 Date: 01/18/2011...

This preview shows document pages 1 - 5. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online