1
Suffix Trees and Suffix Arrays
Srinivas Aluru
Iowa State University
1.1
Basic Definitions and Properties
....................
11
1.2
Linear Time Construction Algorithms
.............
14
Suffix Trees vs. Suffix Arrays
•
Linear Time
Construction of Suffix Trees
•
Linear Time
Construction of Suffix Arrays
•
Space Issues
1.3
Applications
............................................
111
Pattern Matching
•
Longest Common Substrings
•
Text Compression
•
String Containment
•
SuffixPrefix
Overlaps
1.4
Lowest Common Ancestors
..........................
117
1.5
Advanced Applications
...............................
118
Suffix Links from Lowest Common Ancestors
•
Approximate Pattern Matching
•
Maximal Palindromes
1.1
Basic Definitions and Properties
Suffix trees and suffix arrays are versatile data structures fundamental to string processing
applications. Let
s
0
denote a string over the alphabet Σ. Let $
/
∈
Σ be a unique termination
character, and
s
=
s
0
$ be the string resulting from appending $ to
s
0
. We use the following
notation:

s

denotes the size of
s
,
s
[
i
] denotes the
i
th
character of
s
, and
s
[
i..j
] denotes the
substring
s
[
i
]
s
[
i
+ 1]
. . . s
[
j
]. Let
suff
i
=
s
[
i
]
s
[
i
+ 1]
. . . s
[

s

] be the suffix of
s
starting at
i
th
position.
The suffix tree of
s
, denoted
ST
(
s
) or simply
ST
, is a compacted trie of all suffixes of
string
s
. Let

s

=
n
. It has the following properties:
1. The tree has
n
leaves, labelled 1
. . . n
, one corresponding to each suffix of
s
.
2. Each internal node has at least 2 children.
3. Each edge in the tree is labelled with a substring of
s
.
4. The concatenation of edge labels from the root to the leaf labelled
i
is
suff
i
.
5. The labels of the edges connecting a node with its children start with different
characters.
The paths from root to the suffixes labelled
i
and
j
coincide up to their longest common
prefix, at which point they bifurcate. If a suffix of the string is a prefix of another longer
suffix, the shorter suffix must end in an internal node instead of a leaf, as desired.
It is
to avoid this possibility that the unique termination character is added to the end of the
string. Keeping this in mind, we use the notation
ST
(
s
0
) to denote the suffix tree of the
string obtained by appending $ to
s
0
.
0849385970/01/
$
0.00+
$
1.50
c
2001 by CRC Press, LLC
11
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
12
1
3
4
6
7
8
9
10
11
r
v
w
y
z
$
$
i
p
s
i
i
s
i
$
p
i
$
5
x
2
p
p
i
$
p
p
i
$
i
s
s
i
p
p
i
$
m
i
s
s
i
s
s
i
i
$
p
p
p
p
i
$
s
s
i
p
p
i
$
p
p
i
$
s
s
i
p
p
i
$
s
s
i
12
u
12
11
5
8
2
1
10
9
7
4
6
3
0
1
4
1
0
0
1
0
2
1
3
SA
Lcp
FIGURE 1.1: Suffix tree, suffix array and
Lcp
array of the string
mississippi
. The suffix
links in the tree are given by
x
→
z
→
y
→
u
→
r
,
v
→
r
, and
w
→
r
.
As each internal node has at least 2 children, an
n
leaf suffix tree has at most
n

1
internal nodes.
Because of property (5), the maximum number of children per node is
bounded by

Σ

+ 1. Except for the edge labels, the size of the tree is
O
(
n
). In order to
allow a linear space representation of the tree, each edge label is represented by a pair of
integers denoting the starting and ending positions, respectively, of the substring describing
the edge label. If the edge label corresponds to a repeat substring, the indices corresponding
to any occurrence of the substring may be used. The suffix tree of the string
This is the end of the preview.
Sign up
to
access the rest of the document.
 Spring '08
 Staff
 Suffix tree, Suf fj, suf fi+1

Click to edit the document details