11f6643lec22 - CS 6643 F '11 Lec 22 – The latter case is...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS 6643 F '11 Lec 22 – The latter case is preferred. So if i<j, the elements in Pi ≤ elements elements in Pj • Output: sorted sequence is placed in the memory of one process or distributed evenly among all processes – We assume the latter • Input: stored in one the memory of one process or distributed among the memories of multiple processes Input and Output (Some slides are adapted from M. Quinn’s slides for Parallel Prog. in C with MPI and OpenMP, 2004) Rajendra V. Boppana UT San Antonio Sorting 3 Performing Comparisons CS 6643 F '11 Lec 22 – Use compare-and-exchange (actually, exchange-and-compare) operation – (ts + tw) time; time to compare is assumed too small compared to time to communication • One element per process CS 6643 F '11 Lec 22 » W = Θ(n) – Noncomparison-based: use certain properties of the elements such as range of values to avoid comparison of elements with one another » W = seq. work = Θ(n log n) – Comparison-based: use compare-and-exchange operation as the basic operation » We look at internal sorting techniques – Internal, all data fit into the available main memory, or external, data are too large and stored in disk memory • Given S=<a1, …, an>, sorting permutes S into S’ = <b1, … bn> such that each bi is some ai and b1 ≤ b2 ≤ … ≤ bn • Sorting techniques Sorting 4 2 CS 6643 F '11 Lec 22 Given Given a seq. of elements A[1..N] Divide Divide A into two smaller nonempty sets such that all elements elements in one set are no bigger bigger than a pivot element and and all the elements in the second set are greater than the pivot element. Repeat Repeat the divide step on the two subsets until the subsets are of size 1. Divide Divide and conquer technique Selection Selection of the pivot is the key Basic Basic idea but but Ο(n2) worst-case worst- Average Average # of comparecompareexchanges = 1.4n log n Quicksort CS 6643 F '11 Lec 22 Here the first element of the list is the pivot. – Use compare-and-split (actually, exchange-merge-and-split) operation – (ts + m tw) time for m elements • Multiple elements Performing Comparisons 7 5 14 65 4 22 63 11 CS 6643 F '11 Lec 22 Unordered list of values 17 Sequential Quicksort Parallel quicksort Hyperquicksort Parallel sorting by regular sampling Quicksort 8 14 65 4 22 63 11 14 17 CS 6643 F '11 Lec 22 11 65 22 63 Recursively apply quicksort to low list 4 Sequential Quicksort CS 6643 F '11 Lec 22 Choose pivot value 17 Sequential Quicksort 11 9 4 11 22 63 High list (> 17) 17 65 CS 6643 F '11 Lec 22 4 11 63 65 Recursively apply quicksort to high list 14 17 22 Sequential Quicksort CS 6643 F '11 Lec 22 Low list (≤ 17) 14 Sequential Quicksort 12 10 11 14 17 22 63 65 CS 6643 F '11 Lec 22 Recursive Recursive sorts of low, high lists can be done in parallel Natural Natural concurrency Generally Generally recognized as fastest sort in average case When When possible, design parallel algorithm based on the fastest sequential algorithm Speed Speed Parallel Sorting Based on Quicksort CS 6643 F '11 Lec 22 Sorted list of values 4 Sequential Quicksort 15 13 CS 6643 F '11 Lec 22 How How it is done determines the overall time complexity The The last step is easy to parallelize The The first step is inherently sequential Splitting Splitting can be done in parallel, but may need to combine the partial splits by each process Selecting Selecting a pivot Splitting Splitting an array into two Sorting Sorting the two subarrays Key Key steps Parallelizing Quicksort CS 6643 F '11 Lec 22 Example: Example: “Median of 3” technique Can Can make worst-case less probable by using worstsampling sampling to choose pivot value Occurs Occurs when low, high lists maximally unbalanced at every partitioning step AverageAverage-case time complexity: Θ(n log n) WorstWorst-case time complexity: Θ(n2) Attributes of Sequential Quicksort 16 14 P0 CS 6643 F '11 Lec 22 P2 P3 50, 12, 47, 72, 65, 54, 66, 22 83, 66, 67, 0, 70, 98, 99, 82 20, 40, 89, 47, 19, 61, 86, 85 Exchange “lower half” and “upper half” values” P1 75, 91, 15, 64, 21, 8, 88, 54 19 CS 6643 F '11 Lec 22 Upper “half” Lower “half” P3 P2 P1 P0 After exchange step 89, 86, 85 P3 P2 50, 12, 47, 72, 65, 54, 66, 22, 20, 40, 47, 19, 61 83, 98, 99, 82, 91, 88 P1 75, 15, 64, 21, 8, 54, 66, 67, 0, 70 Parallel Quicksort CS 6643 F '11 Lec 22 20, 40, 89, 47, 19, 61, 86, 85 P3 20, 40, 89, 47, 19, 61, 86, 85 17 83, 66, 67, 0, 70, 98, 99, 82 P2 83, 66, 67, 0, 70, 98, 99, 82 Parallel Quicksort CS 6643 F '11 Lec 22 50, 12, 47, 72, 65, 54, 66, 22 P1 50, 12, 47, 72, 65, 54, 66, 22 P0 Process P0 chooses and broadcasts randomly chosen pivot value 75, 91, 15, 64, 21, 8, 88, 54 P0 Parallel Quicksort 75, 91, 15, 64, 21, 8, 88, 54 Parallel Quicksort 20 18 P0 P2 P3 50, 12, 47, 72, 65, 54, 66, 22, 20, 40, 47, 19, 61 83, 98, 99, 82, 91, 88 89, 86, 85 21 CS 6643 F '11 Lec 22 Upper “half” Lower “half” CS 6643 F '11 Lec 22 P3 23 98, 99 Upper “half” of upper “half” Exchange values 83, 82, 91, 88, 89, 86, 85 Lower “half” of upper “half” P2 P1 50, 47, 72, 65, 54, 66, 22, 40, 47, 61, 75, 64, 54, 66, 67, 70 Upper “half” of lower “half” CS 6643 F '11 Lec 22 98, 99 82, 83, 85, 86, 88, 89, 91 22, 40, 47, 47, 50, 54, 54, 61, 64, 65, 66, 66, 67, 70, 72, 75 Each processor sorts values it controls Upper “half” of upper “half” Lower “half” of upper “half” Upper “half” of lower “half” Lower “half” of lower “half” 15, 21, 8, 0, 12, 20, 19 Lower “half” of lower “half” 0, 8, 12, 15, 19, 20, 21 Parallel Quicksort Exchange values 89, 86, 85 83, 98, 99, 82, 91, 88 50, 12, 47, 72, 65, 54, 66, 22, 20, 40, 47, 19, 61 75, 15, 64, 21, 8, 54, 66, 67, 0, 70 Parallel Quicksort CS 6643 F '11 Lec 22 Processes P0 and P2 choose and broadcast randomly chosen pivots P1 75, 15, 64, 21, 8, 54, 66, 67, 0, 70 Parallel Quicksort P0 Upper “half” Lower “half” Parallel Quicksort P3 P2 P1 P0 P3 P2 P1 P0 24 22 CS 6643 F '11 Lec 22 Execution Execution time dictated by when last process completes Algorithm Algorithm likely to do a poor job balancing number of elements sorted by each process Cannot Cannot expect pivot value to be true median Can Can choose a better pivot value Analysis of Parallel Quicksort CS 6643 F '11 Lec 22 Time Time complexity: Tp = local sort time + split-send-merge time + pivot selection local split-send= Θ(n/p log(n/p)) + Θ(n/p log p) + Θ(log2 p) p) Nodes Nodes that have the same bit pattern for bits (log p 1) … (i+1) will form a subcube (exception: (exception: all nodes are used for i= log p -1) Use allUse all-reduce for pivot selection and dissemination in in each subcube subcube Each Each process compares its elements with pivot and sends elements greater than (less than or equal to) pivot to neighbor in dimension i if that neighbor has higher (lower) id n/p n/p elements per process Step Step i = log p -1 … 0 Quicksort on Hypercube 27 25 CS 6643 F '11 Lec 22 First, First, each process sorts its sublist To To complete sorting, processes exchange values Process Process can use median of its sorted list as the the pivot value This This is much more likely to be close to the true median Hyperquicksort CS 6643 F '11 Lec 22 In In sequential version: a bad pivot increases the number of recursive steps by 1 For For example, all elements may be distributed among processes 0, …, p/2-1 after the first step p/2perform perform quicksort with 2n/p elements per process process In In parallel version: each time a bad pivot is slected, the time to finish the rest of the steps is increased by a factor of 2 Impact of Pivot Selection 28 26 CS 6643 F '11 Lec 22 29 CS 6643 F '11 Lec 22 19, 20, 40, 47, 61, 85, 86, 89 P3 19, 20, 40, 47, 61, 85, 86, 89 31 0, 66, 67, 70, 82, 83, 98, 99 P2 0, 66, 67, 70, 82, 83, 98, 99 P3 P2 P1 P0 P3 P2 P1 P0 Processes will exchange “low”, “high” lists 12, 22, 47, 50, 54, 65, 66, 72 P1 12, 22, 47, 50, 54, 65, 66, 72 Process P0 broadcasts its median value 8, 15, 21, 54, 64, 75, 91, 88 8, 15, 21, 54, 64, 75, 91, 88 Hyperquicksort P0 CS 6643 F '11 Lec 22 19, 20, 40, 47, 61, 85, 86, 89 P3 20, 40, 89, 47, 19, 61, 86, 85 Hyperquicksort CS 6643 F '11 Lec 22 0, 66, 67, 70, 82, 83, 98, 99 P2 83, 66, 67, 0, 70, 98, 99, 82 Each process sorts values it controls 12, 22, 47, 50, 54, 65, 66, 72 P1 50, 12, 47, 72, 65, 54, 66, 22 Number of processors is a power of 2 8, 15, 21, 54, 64, 75, 88, 91 P0 Hyperquicksort 75, 91, 15, 64, 21, 8, 88, 54 Hyperquicksort 32 30 CS 6643 F '11 Lec 22 33 CS 6643 F '11 Lec 22 83, 85, 86, 88, 89, 91, 98, 99 P3 61, 65, 66, 72, 85, 86, 89 After exchange-and-merge step 61, 64, 65, 66, 66, 67, 70, 72, 75, 82 P2 64, 66, 67, 70, 75, 82, 83, 88, 91, 98, 99 Communication pattern for second exchange 19, 20, 21, 22, 40, 47, 47, 50, 54, 54 P1 12, 19, 20, 22, 40, 47, 47, 50, 54 35 0, 8, 12, 15 P0 Hyperquicksort 0, 8, 15, 21, 54 CS 6643 F '11 Lec 22 61, 65, 66, 72, 85, 86, 89 P3 61, 65, 66, 72, 85, 86, 89 P3 P2 P1 P0 P3 P2 P1 P0 Processes P0 and P2 broadcast median values. 64, 66, 67, 70, 75, 82, 83, 88, 91, 98, 99 P2 64, 66, 67, 70, 75, 82, 83, 88, 91, 98, 99 Processes merge kept and received values. 12, 19, 20, 22, 40, 47, 47, 50, 54 P1 12, 19, 20, 22, 40, 47, 47, 50, 54 Hyperquicksort CS 6643 F '11 Lec 22 0, 8, 15, 21, 54 P0 Hyperquicksort 0, 8, 15, 21, 54 Hyperquicksort 36 34 CS 6643 F '11 Lec 22 The The value of C determines the scalability. Scalability depends on ratio of communication speed to computation speed. M ( p C ) / p = p C / p = p C −1 Memory Memory scalability: Asymptotic rate of n depends on C For C≤1, the parallel system is highly scalable n log n = C n log p ⇒ log n = C log p ⇒ n = pC Sequential Sequential time complexity: Θ(n log n) Parallel Parallel overhead: Θ(n log p) Isoefficiency Isoefficiency relation: Isoefficiency Analysis CS 6643 F '11 Lec 22 AverageAverage-case analysis Lists Lists stay reasonably balanced Communication Communication time dominated by message transmission transmission time, rather than message latency latency Complexity Analysis Assumptions 39 37 CS 6643 F '11 Lec 22 Our Our analysis assumes lists remain balanced As As p increases, each processor’s share of list decreases Hence Hence as p increases, likelihood of lists becoming unbalanced increases Unbalanced Unbalanced lists lower efficiency Would Would be better to get sample values from all processes before choosing median Another Scalability Concern CS 6643 F '11 Lec 22 Total Total communication time for log p exchange steps: steps: Θ((n/p) log p) Initial Initial quicksort step has time complexity Θ((n/p) log (n/p)) Total Total comparisons needed for log p merge steps: Θ((n/p) log p) Complexity Analysis 40 38 CS 6643 F '11 Lec 22 CS 6643 F '11 Lec 22 43 CS 6643 F '11 Lec 22 0, 66, 67, 70, 82, 83, 98, 99 P2 0, 66, 67, 70, 82, 83, 98, 99 P2 P1 Each process chooses p regular samples. 12, 22, 47, 50, 54, 65, 66, 72 P1 12, 22, 47, 50, 54, 65, 66, 72 Each process sorts its list using quicksort. 8, 15, 21, 54, 64, 75, 88, 91 PSRS Algorithm P0 P2 83, 66, 67, 0, 70, 98, 99, 82 Number of processors does not have to be a power of 2. P1 50, 12, 47, 72, 65, 54, 66, 22 P0 41 P0 75, 91, 15, 64, 21, 8, 88, 54 PSRS Algorithm 8, 15, 21, 54, 64, 75, 88, 91 PSRS Algorithm CS 6643 F '11 Lec 22 Each Each process sorts its share of elements Each Each process selects regular sample of sorted list One One process gathers and sorts samples, chooses pivot values from sorted sample list, and broadcasts these pivot values Each Each process partitions its list into p pieces, using pivot pivot values Each Each process sends partitions to other processes Each Each process merges its partitions Parallel Sorting by Regular Sampling (PSRS Algorithm) 44 42 CS 6643 F '11 Lec 22 45 CS 6643 F '11 Lec 22 One process sorts p2 regular samples. 47 One process chooses p-1 pivot values. CS 6643 F '11 Lec 22 0, 66, 67, 70, 82, 83, 98, 99 P2 0, 66, 67, 70, 82, 83, 98, 99 P2 P1 P0 P2 P1 P0 One process broadcasts p-1 pivot values. 15, 22, 50, 54, 65, 66, 70, 75, 83 12, 22, 47, 50, 54, 65, 66, 72 P1 12, 22, 47, 50, 54, 65, 66, 72 15, 22, 50, 54, 65, 66, 70, 75, 83 8, 15, 21, 54, 64, 75, 88, 91 8, 15, 21, 54, 64, 75, 88, 91 PSRS Algorithm P0 One process collects p2 regular samples. PSRS Algorithm CS 6643 F '11 Lec 22 0, 66, 67, 70, 82, 83, 98, 99 P2 0, 66, 67, 70, 82, 83, 98, 99 15, 22, 50, 54, 65, 66, 70, 75, 83 12, 22, 47, 50, 54, 65, 66, 72 P1 12, 22, 47, 50, 54, 65, 66, 72 15, 54, 75, 22, 50, 65, 66, 70, 83 8, 15, 21, 54, 64, 75, 88, 91 P0 PSRS Algorithm 8, 15, 21, 54, 64, 75, 88, 91 PSRS Algorithm 48 46 P0 CS 6643 F '11 Lec 22 P2 54, 54, 64, 65, 66, 66 67, 70, 72, 75, 82, 83, 88, 91, 98, 99 Each process merges p partitions. P1 0, 8, 12, 15, 21, 22, 47, 50 PSRS Algorithm CS 6643 F '11 Lec 22 49 51 P2 P1 P0 CS 6643 F '11 Lec 22 Each Each process ends up merging close to n/p elements Experimental Experimental results show this is a valid assumption Processor Processor interconnection network supports p simultaneous message transmissions at full speed 4-ary hypertree is an example of such a network Assumptions CS 6643 F '11 Lec 22 75, 88, 91 72 67, 70, 82, 83, 98, 99 P2 0, 66, 67, 70, 82, 83, 98, 99 Each process sends partitions to correct destination process. 54, 64 54, 65, 66 66 P1 12, 22, 47, 50, 54, 65, 66, 72 Each process divides list, based on pivot values. 8, 15, 21 12, 22, 47, 50 0 P0 PSRS Algorithm 8, 15, 21, 54, 64, 75, 88, 91 PSRS Algorithm 52 50 CS 6643 F '11 Lec 22 Parallel Parallel quicksort and hyperquicksort: log p / 2 hyperquicksort: PSRS PSRS algorithm: (p-1)/p Average Average number of times each key moved: Parallel quicksort: Parallel quicksort: poor Hyperquicksort: Hyperquicksort: better PSRS PSRS algorithm: excellent Three Three parallel algorithms based on quicksort Keeping Keeping list sizes balanced Summary of Quicksort Methods CS 6643 F '11 Lec 22 Gather Gather samples, broadcast pivots: Θ(log p) All-to-all exchange: Θ(n/p) All-toOverall: Overall: Θ(n/p) Communications Communications Initial Initial quicksort: Θ((n/p)log(n/p)) Sorting Sorting regular samples: Θ(p2 log p) Merging Merging sorted sublists: Θ((n/p)log p Overall: Overall: Θ((n/p)(log n + log p) + p2log p) p) Computations Computations Time Complexity Analysis 55 53 CS 6643 F '11 Lec 22 Sorting Networks Sequential Sequential time complexity: Θ(n log n) Parallel Parallel overhead: Θ(n log p) Scalability Scalability function same as for hyperquicksort Scalability Scalability depends on ratio of communication communication to computation speeds Isoefficiency Analysis 54 CS 6643 F '11 Lec 22 CS 6643 F '11 Lec 22 Typical Sorting Network • Special networks with Omega network like topologies using columns of two-element comparators • Sorting algorithm is embedded into the network through the connection pattern • Elements to be sorted are fed to the network at the left and sorted sequence will appear at the right of the network network • Time taken is proportional to the depth or number of columns of comparators Sorting Networks 59 57 Bitonic Sorting Network CS 6643 F '11 Lec 22 » <1,3,5,7,8,6,4,2,0> is a bitonic sequence with i=5 » <7,8,6,4,2,0,1,3,5> is a bitonic sequence since right-rotating it by three elements yields seq. in the first example » Any sorted sequence is trivially bitionic – There exists an index 1≤i≤n such that <a1,…,ai> is monotonically increasing and <ai, …, an> is monotonically decreasing, or – There exists a cyclic shift or indices such that the above condition condition is satisfied – Examples: • Proposed by Batcher in 1968 • Bitonic sequence is a sequence of <a1, …, an> such that either CS 6643 F '11 Lec 22 • Two types: increasing comparator and decreasing comparator Comparators 60 58 CS 6643 F '11 Lec 22 CS 6643 F '11 Lec 22 Bitonic Merging Network – Given two bitonic sequences, create a larger bitonic sequence consisting of elements in both sequences – Given a bitonic sequence, obtain a sorted sequence • Two key operations Bitonic Sorting Network 63 61 CS 6643 F '11 Lec 22 CS 6643 F '11 Lec 22 Column of 2-element comparators Bitonic Merging Network – Denoted as BM(n) – Has depth log n • Use bitonic splits repeatedly (log n) times to transform a bitonic sequence into a sorted sequence • A network that achieves this is called a bitionic merging network – Let S=<a1, …, an> be a bitonic sequence. Then S1 = <min{a1, an/2+1}, …,min{an/2,an}> and S2 = <max{a1, an/2+1}, …,max{an/2,an}> are bitonic sequences such that each element of S1 ≤ any element of S2 – Splitting S into S1 and S2 as given above is called a bitonic split • Given a bitonic sequence, obtain a sorted sequence • Bitonic split Bitonic Merge 64 62 CS 6643 F '11 Lec 22 Bitonic Sorting Network for 16 Elements CS 6643 F '11 Lec 22 – Divide the given sequences into subsequences of size 2, which are trivially bitonic – Use BM networks so sort the subsequences into either ascending ascending or descending order as needed – Concatenate adjacent two subsequences to create larger bitonic sequences – Repeat and until all n numbers form a bitonic sequence – Use a BM(n) to finish the sorting • Given an arbitrary sequence of n elements, create a bitonic sequence of n elements • How Bitonic Sorting Network (BSN) 67 65 Bitonic Sorting Network CS 6643 F '11 Lec 22 d(n) = d(n/2) + log n ⇒ d(n) = [ (log n) (logn+1)]/2 • BM(n) can be implemented by an nxn Omega network with 2-element comparators instead of 2x2 switching elements • Time taken = depth of BSN, d(n) CS 6643 F '11 Lec 22 Bitonic Sequence Creating Network 68 66 CS 6643 F '11 Lec 22 CS 6643 F '11 Lec 22 Dimnesions used for Bitonic Sorting • 1 element per process • Careful observation of BSN(n) indicates that elements at distances that are power of 2 are compared and exchanged • So each compare-and-exchange operation done by BSN(n) is near neighbor communication on a hypercube Bitonic Sorting on a Hypercube 71 69 Bitonic Sorting on a Hypercube CS 6643 F '11 Lec 22 • Add a prefix of 0 to lablel. This will be the bit used for step 5 when i=d-1. • Tp = Ο(log2n) P Tp = Ο(nlog2n), which is not cost-optimal CS 6643 F '11 Lec 22 Example of Bitonic Merging on a Hypercube 72 70 Row-major snakelike order Row-major shuffled order Bitonic Sorting on a Mesh CS 6643 F '11 Lec 22 • Use n>>p, # of sorting elements >> # processes, for cost-optimal parallel sorting • The idea is minimize the impact of the cost of communication and excess comparisons inherent in the underlying algorithm with enough work in each process • With n=p, each process performs Ο(log2n) comparisons – Not optimal w.r.t. seq. work – But it is the best that can be done on a mesh – Distance argument • Tp = Ο(n1/2) • Parallel cost, PTp = Ο(n3/2) CS 6643 F '11 Lec 22 Row-major order • Apply hypercube/BSN algorithm • How are elements mapped to processes? • Majority of compare and exchanges will take multiple hops and thus cause link contention Bitonic Sorting on a 2D Mesh 75 73 Multiple Elements Per Process CS 6643 F '11 Lec 22 • Let n>>p • Each process has n/p elements • Initially each process sorts the elements using a sequential sorting algorithm • Each process participates in bitonic sorting using exchange exchange-merge-split as the primitive operations CS 6643 F '11 Lec 22 Row-major shuffled order BM(16) on Mesh 76 74 CS 6643 F '11 Lec 22 CS 6643 F '11 Lec 22 Odd-Even Sorting Example • Tp = local sort time + merger & split time + exchange time = Θ(n/p log(n/p)) + Θ(n/p log2 p) + Θ(n/p log2 p) • For efficient operation, p = Ο(2(log n)1/2) Multiple Elements Per Process (Hypercube) 79 77 CS 6643 F '11 Lec 22 CS 6643 F '11 Lec 22 Parallel Odd-Even Sorting – Odd and even phases with n/2 comparisons in each phase – In the Odd phase, elements in odd positions are compared and exchanged, if necessary, with the next elements – In the Even phase, even elements are compared and exchanged with the next elements – n steps – Easy to parallelize • Odd-even transposition sort for even # of elements – Also it is difficult to parallelize the key compare-and-exchange step since each step is influenced by the previous step • Bubble sort has Θ(n2) sequential time complexity Sorting using General Purpose Processes 80 78 Enumeration Sort CS 6643 F '11 Lec 22 • Rank each element such that rank of ai is number of elements smaller than ai in the original sequence • Elements are permuted using the rank • Using Ο(n2) processes give Ο(1) sorting time CS 6643 F '11 Lec 22 – Initially, processes that are far away from each other compare and split their elements (log p steps) – Then odd-even sort is performed as long as processes are exchanging data • Shell sort will be useful in such cases – For almost sorted sequences with a few elements O(n) positions away from the final sorted position will require O(n) iterations in the seq. case • Odd-even sort moves elements only one position at a time Shell Sort 83 81 CS 6643 F '11 Lec 22 Shell Sort Example 82 ...
View Full Document

This note was uploaded on 01/29/2012 for the course CS 6643 taught by Professor Staff during the Fall '08 term at Texas San Antonio.

Ask a homework question - tutors are online