hw3_standard_answer

hw3_standard_answer - CSE721 Homework 3 Due on Tuesday,...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CSE721 Homework 3 Due on Tuesday, 1/18/2011 Thanks to Xin Huo, S M Faisal 1 Problem 1 (5.3) 1. Compute the degree of concurrency. Degree of concurrency is the maximum number of tasks that can be executed in parallel. a) 2n-1 b) 2n-1 c) n d) n 2. Compute the maximum possible speedup if an unlimited number of processing elements is available. a) The sequential time Ts = 2n - 1. The Parallel run time Tp = n. Speedup S = (2n - 1)/n b) The sequential time Ts = 2n - 1. The Parallel run time Tp = n. Speedup S = (2n - 1)/n c) The sequential time Ts = n2 . The Parallel run time Tp = 2n - 1. Speedup S = n2 /(2n - 1) c) The sequential time Ts = n(n + 1)/2. The Parallel run time Tp = n. Speedup S = (n + 1)/2 3. Compare the values of speedup, efficiency, and overhead function if the number of processing elements is (i) the same as the degree of concurrency and (ii) equal to half of the degree of concurrency. (i) processor equals to the degree of concurrency. 1) Number of processors: Pa Pb Pc Pd 2) Speedup: Sa Sb Sc Sd 3) Efficiency: Ea Eb Ec Ed = = = = (2n - 1)/n 2n - 1 = n-1 2 n2n-1 n (2 - 1)/n 2n - 1 Sb /Pb = = n-1 2 n2n-1 2 n /(2n - 1) Sc /Pc = = n/(2n - 1) n (n + 1)/2 Sd /Pd = n Sa /Pa = = = = = (2n - 1)/n n2 /(2n - 1) (n + 1)/2 (2n - 1)/n = = = = 2n-1 2n-1 n n 1 4) Overhead: To (a) To (b) To (c) To (d) = = = = 2n-1 n - 2n + 1 2n-1 n - 2n + 1 2n2 - n - n2 = n2 - n n2 - n(n + 1)/2 = (n2 - n)/2 (ii) processor equals to half of degree of concurrency. 1) Number of processors: Pa Pb Pc Pd 2) Speedup: Parallel Time for these four graphs Tp (a) Tp (b) = = n+1 n+1 = = = = 2n-2 2n-2 n/2 n/2 For graph (c), the execution order is from the left top matrix to the right bottom. In the first n/2 and last n/2 times of execution, the degree of concurrency is increased from 1 to n/2. However, in the center of the matrix, there are enough data can be parallelized in all the processors. Thus, the run time of two ends is 2 -2n n/2 2 = n. The left data in the center is n2 - (1+n/2)n/2 2 = 3n 4 . Thus, 2 Tp (c) = n+ 3n2 - 2n n 3n - 2 5n - 2 /( ) = +n= 4 2 2 2 The method for computing parallel time in (d) is very similar to graph (c). For the top n/2 rows, the run 2 2 2 +n +2n time is n/2. The the left data is n 2 - n +2n = 3n 8 .Thus, 8 Tp (d) = n/2 + 3n2 + 2n n 5n + 2 / = 8 2 4 Consequently, the speedup of the graphs is: Sa Sb Sc Sd = = = = (2n - 1)/(n + 1) (2n - 1)/(n + 1) 5n - 2 2n2 n2 / = 2 5n - 2 n(n + 1) 5n + 2 2(n2 + n) / = 2 4 5n + 2 2 3) Efficiency: Ea Eb Ec Ed 4) Overhead (To = pTp - Ts ): To (a) To (b) To (c) To (d) = = = = 2n-2 (n + 1) - 2n + 1 2n-2 (n + 1) - 2n + 1 5n2 - 2n n2 - 2n - n2 = 4 4 5n2 + 2n n2 - 2n - n(n + 1)/2 = 8 8 = = = = 2n - 1 4/n (n + 1)2n-2 2n - 1 Sb /Pb = 4/n (n + 1)2n-2 4n Sc /Pc = 5n - 2 4(n + 1) Sd /Pd = 5n + 2 Sa /Pa = 3 Answer to question number 2 (Problem 5.5) Scaled Speedup pW Given, Scaled Speedup = T (pW,p) . p For the problem of adding n numbers on p processing elements, we get the following values in terms of Standard Speedup and Scaled Speedup. In the case of Scaled Speedup, the probelm size increases linearly with the number of processing elements. The data are shown below: p 1 4 16 64 256 W 256 1024 4096 16384 65536 Ts 255 1023 4095 16383 65535 Tp(standard) 255 85 59 69 88 Standard Speedup 1 3 4.3221 3.6957 2.8978 Tp(scaled) 255 277 299 321 343 Scaled Speedup 1 3.6932 13.6957 51.0374 191.0642 Table 1: Standard Speedup and Scaled speedup Now, plotting the Standard Speedup and Scaled Speedup, we get the follwoing plot shown in figure: 200 180 160 140 120 100 80 60 40 20 0 0 speedup scaled_speedup 0.5 1 1.5 2 2.5 3 3.5 4 Figure 1: Comparison between standard speedup and scaled speedup 5 Answer to question number 3 (Problem 5.9) The expression for Tp from Problem 5.5 is, Tp = ( n - 1) + 11 log p. p n = p(Tp - 11 log p + 1) Now, using this expression for Tp = 512 and p = 1, 4, 16, 256, 1024, and 4096, we get the following problem size (W ) that can be solved in 512 time units. p 1 4 16 64 256 1024 4096 W 513 1964 7504 28608 108800 412672 1560576 Table 2: Size of problem that can be solved withing 512 time units So the size of problem that we can solve with given p processing elements within 512 time units is given in the table. In general, it is NOT possible to solve an arbitrarily large problem in a fixed amound of time even if an unlimited number of processing elements is available. The reason is clear if we look at the given expression for Tp carefully. The expression is: n Tp = ( - 1) + 11 log p p (ii) (i) Now, we can see that for part (i), we can make this part equal to zero by choosing the number of processing elements (p) equal to n. But for part (ii), it keeps increasing logarithmicly with number of processing elements (and multiplied by a constant). Thus, we cannot keep increasing the number of processing elements (p) without increasing the time Tp . This increase comes from the necessary communication/transfers between the processing elements. Hence, at some point, the term 11 log p (for this summation example) exceeds the given fixed time limit no matter what the problem size (n) is. Thus, it is not possible to solve an arbitrarily large problem in a fixed amount of time even if an unlimited number of processing elements is available. 6 Answer to question number 4 (Problem 8.3) Here, the given run time can be attained by altering the method defined in Section 8.2.1 a bit. The overall strategy is same, we just modify the way the vector elements are sent to each corresponding processing element in the column since n >> p. The change we make is motivated by that fact that a small increase in ts due to multiple transmission of smaller n chunks can be cheaper than one broadcasting of a single big chunk (i.e., p ). The steps are enumerated below: n 1. The chunk of size p is sent to the respective processor in the column just as described in Section 8.1.2. This step costs time: ts + n tw p n 2. In this step, rather than broadcasting the chunk of size p to the processing elements n in the column, we introduce a step where we divide the chunk of size p into further p parts. Hence, each new part becomes of size n . We then do a one-to-all-personalized p communication along the processing elements in the column. That is, each processing element in the column receives one message of size n . Hence the time for this step is: p ts log p + ( n tw )( p - 1) p 3. At this point, each processor along the column holds n portion of the vector totalling p n in size. In this step, a all-to-all-broadcast is done among the processing elements p in each column with message of size n . The time for this step is: p ts log p + ( n tw )( p - 1) p Here although it is along a ring, we can use the hypercube algorithm because the message size remains fixed. n 4. Now, at this point, each processing element in the column has the whole p chunk of the vector. So, at this point, the calculation is done which has the cost: n2 p 5. At this point, the all-to-one-reduction operation takes place which is just the dual of one-to-all-broadcast. Thus, steps 3. and 2. are repeated. Hence the cost is: ts log p + ( n tw )( p - 1) + ts log p p + ( n tw )( p - 1) p Totalling the time costs of the steps above, we get the total parallel time to be: 2 n Tp = ts + p tw + ts log p + ( n tw )( p - 1) + ts log p + ( n tw )( p - 1) + n + p p p n n ts log p + ( p tw )( p - 1) + ts log p + ( p tw )( p - 1) 2 n Tp n + 4ts log p + 5( p )tw p 7 Tp n2 p n + 2ts log p + 5( p )tw n We can reduce the constant coefficient of the term p tw by some overlapping of the communication along rows and columns. For instance, we can subdivide the initial transfer to individual processing elements into smaller parts and overlap it with the communication along the column. However, the lower bound can be stated as 2. In face, as shown below, the coefficient of the tw term doesn't matter because the asymptotic bound for the isoefficiency function is dominated by the ts term and we can set the lower bound to be 2. Hence, Tp = ( n2 n + 2ts log p + 2( )tw ) p p (1) Scalability Analysis From the above equation (1) and using equation T0 = pTp - W , we get the overhead function of this parallel algorithm: T0 = n2 + 2ts p log p + 2n ptw - n2 T0 = 2ts p log p + 2n ptw Now if we perform isoefficiency analysis using the equation W = KT0 (W, p), we get: From the ts term, W = K2ts p log p W = (p log p) n = K2 ptw n2 = K 2 22 pt2 w W = (p) Again, from the tw term, W = n2 = K2n ptw Hence, the overall asymptotic isoefficiency function is given by (p log p). Comparing this with the isoefficiency function of section 8.1.2 which is (p log p), we see that the isoefficiency function of the current problem is better because it is (p log p). So, this solution is more scalable than the solution in Section 8.1.2. 2 8 ...
View Full Document

This note was uploaded on 03/08/2012 for the course CSE 721 taught by Professor Saday during the Winter '11 term at Ohio State.

Ask a homework question - tutors are online