This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Advanced Algorithm Solutions Homework 2 Eric Nordenstam , Erwan Lemonnier December 3, 2000 1 Suffix array implementation 1.1 Sorting the suffix array We tried 3 algorithms to sort the suffix array: • Quicksort • 3 implementations of counting sort • the Manber-Myers algorithm Below is a short description of our implementations of each of these algorithm, as well as a comparison of their efficiency. All the algorithms below were im- plemented in C, and compiled using the -O3 option, which for some algorithms divided the running time by up to 2. 1.1.1 Quicksort A naiv implementation of quicksort. Quicksort itself runs in time O ( n log n ), but since it compares strings of up to n characters, the comparisons will have to test in average n/ 2 characters. Hence, to sort suffix arrays, quicksort runs in O ( n 2 log n ) 1.1.2 Counting sort Each character in a string is an integer of range between 0 and 255, which makes it reasonable to try a counting sort. Our first implementation runs recursively, and begins by sorting on the first character of each suffix string, then on the 2nd character of each group of suffix strings having the same 1st character, and so on. We wrote a second optimized implementation of the above algorithm, using a loop instead of recursivity, and 4 arrays of the size of the suffix array, allocated from the beginning to avoid dynamic memory allocation. We gained about 10% of time, but we were still on average 5 times slower than the target speed. We did a third implementation in which the counting sort is made on 2 char- acters at a time, but it appeared to be much slower, for 2 reasons: the resulting array is very large (65536 indexes) and if the text only contains alphabetical characters (= it’s not a binary file) only a short range of it is used, and time is 1 wasted to go across the array. Moreover, some computation is required to invert the order of the 2 characters (on x86 processors at least). This recursive counting sort runs in O ( nl ), where l is the size of longest repeated substring in the text. Thus, for well-behaved texts (with few repeti- tions), it can sort the suffix array reasonably fast, but in the worst case ( l = n , ie a text made of the same repeated letter), it runs in time O ( n 2 ). 1.1.3 Manber-Myers algorithm This algorithm was described in the article “Suffix arrays: A new method for on- line string searches” by Udi Manber and Gene Myers. We describe the general idea of the algorithm first, before describing it in detail. Take a string of length n . In the first stage of the algorithm, we bucket on the first character in every suffix. After this is done, we do no longer need to look at the character string, all the information we need is in the placement of suffixes in buckets. For simplicity, we number the stages in the algorithm 1, 2, 4, 8, etc. After stage number h , the suffix array will be sorted after the h first characters in each suffix. This is done in the following manner....
View Full Document