Advanced Algorithm
Solutions Homework 2
Eric
Nordenstam
, Erwan
Lemonnier
December 3, 2000
1
Suffix array implementation
1.1
Sorting the suffix array
We tried 3 algorithms to sort the suffix array:
•
Quicksort
•
3 implementations of counting sort
•
the ManberMyers algorithm
Below is a short description of our implementations of each of these algorithm,
as well as a comparison of their efficiency. All the algorithms below were im
plemented in C, and compiled using the O3 option, which for some algorithms
divided the running time by up to 2.
1.1.1
Quicksort
A naiv implementation of quicksort.
Quicksort itself runs in time
O
(
n
log
n
),
but since it compares strings of up to
n
characters, the comparisons will have
to test in average
n/
2 characters. Hence, to sort suffix arrays, quicksort runs in
O
(
n
2
log
n
)
1.1.2
Counting sort
Each character in a string is an integer of range between 0 and 255, which makes
it reasonable to try a counting sort. Our first implementation runs recursively,
and begins by sorting on the first character of each suffix string, then on the
2nd character of each group of suffix strings having the same 1st character, and
so on.
We wrote a second optimized implementation of the above algorithm, using
a loop instead of recursivity, and 4 arrays of the size of the suffix array, allocated
from the beginning to avoid dynamic memory allocation. We gained about 10%
of time, but we were still on average 5 times slower than the target speed.
We did a third implementation in which the counting sort is made on 2 char
acters at a time, but it appeared to be much slower, for 2 reasons: the resulting
array is very large (65536 indexes) and if the text only contains alphabetical
characters (= it’s not a binary file) only a short range of it is used, and time is
1
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
wasted to go across the array. Moreover, some computation is required to invert
the order of the 2 characters (on x86 processors at least).
This recursive counting sort runs in
O
(
nl
), where
l
is the size of longest
repeated substring in the text. Thus, for wellbehaved texts (with few repeti
tions), it can sort the suffix array reasonably fast, but in the worst case (
l
=
n
,
ie a text made of the same repeated letter), it runs in time
O
(
n
2
).
1.1.3
ManberMyers algorithm
This algorithm was described in the article “Suffix arrays: A new method for on
line string searches” by Udi Manber and Gene Myers. We describe the general
idea of the algorithm first, before describing it in detail. Take a string of length
n
. In the first stage of the algorithm, we bucket on the first character in every
suffix. After this is done, we do no longer need to look at the character string, all
the information we need is in the placement of suffixes in buckets. For simplicity,
we number the stages in the algorithm 1, 2, 4, 8, etc. After stage number
h
,
the suffix array will be sorted after the
h
first characters in each suffix. This is
done in the following manner.
This is the end of the preview.
Sign up
to
access the rest of the document.
 Spring '10
 .
 Algorithms, Sort, bucket sort, suffix array

Click to edit the document details