This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: 7 Solutions Solution 7.1 There is no single right answer for this question. The purpose is to get students to think about parallelism present in their daily lives. The answer should have at least 10 activities identifi ed. 7.1.1 Any reasonable answer is correct here. 7.1.2 Any reasonable answer is correct here. 7.1.3 Any reasonable answer is correct here. 7.1.4 The student is asked to quantify the savings due to parallelism. The answer should consider the amount of overlap provided through parallelism and should be less than or equal to (if no parallelism was possible) to the original time com- puted if each activity was carried out serially. Solution 7.2 7.2.1 While binary search has very good serial performance, it is diffi cult to paral- lelize without modifying the code. So part A asks to compute the speed-up factor, but increasing X beyond 2 or 3 should have no benefi ts. While we can perform the comparison of low and high on one core, the computation for mid on a second core, and the comparison for A[mid] on a third core, without some restructur- ing or speculative execution, we will not obtain any speed-up. The answer should include a graph, showing that no speed-up is obtained after the values of 1, 2 or 3 (this value depends somewhat on the assumption made) for Y. 7.2.2 In this question, we suggest that we can increase the number of cores to each the number of array elements. Again, given the current code, we really can- not obtain any benefi t from these extra cores. But if we create threads to compare the N elements to the value X and perform these in parallel, then we can get ideal speed-up (Y times speed-up), and the comparison can be completed in the amount of time to perform a single comparison. This problem illustrates that some computations can be done in parallel if serial code is restructured. But more impor tantly, we may want to provide for SIMD operations in our ISA, and allow for data-level parallelism when performing the same operation on multiple data items. Solution 7.3 7.3.1 This is a straightforward computation. The fi rst instruction is executed once, and the loop body is executed 998 times. Version 117,965 cycles Version 222,955 cycles Version 320,959 cycles 7.3.2 Array elements D[j] and D[j 1] will have loop carried dependencies. These will f3 in the current iteration and f1 in the next iteration. 7.3.3 This is a very challenging problem and there are many possible implemen- tations for the solution. The preferred solution will try to utilize the two nodes by unrolling the loop 4 times (this already gives you a substantial speed-up by elimi- nating many loop increment, branch and load instructions. The loop body run- ning on node 1 would look something like this (the code is not the most effi cient code sequence): DADDIU r2, r0, 996 L.D f1, 16(r1) L.D f2, 8(r1) loop: ADD.D f3, f2, f1 ADD.D f4, f3, f2 Send (2, f3) Send (2, f4) S.D f3, 0(r1) S.D f4, 8(r1) Receive(f5) ADD.D f6,f5,f4 ADD.D f1,f6,f5 Send (2, f6)...
View Full Document
This note was uploaded on 02/02/2011 for the course CS 2214 taught by Professor Hadimioglu during the Spring '10 term at NYU Poly.
- Spring '10