6.851: Advanced Data Structures
Spring 2010
Lecture 7 – February 26, 2010
Prof. Andre Schulz
Scribe: Mark Chen
1
Overview
In this lecture, we consider the string matching problem  finding all places in a text where some
query string occurs. From the perspective of a oneshot approach, we can solve string matching in
O
(

T

) time, where

T

is the size of our text. This purely algorithmic approach has been studied
extensively in the papers by KnuthMorrisPratt [6], BoyerMoore [1], and RabinKarp [4].
However, we entertain the possibility that multiple queries will be made to the same text. This
motivates the development of data structures that preprocess the text to allow for more efficient
queries. We will show how to construct, use, and analyze these string data structures.
2
Storing Strings and String Matching
First, we introduce some notation. Throughout these notes, Σ will denote a finite alphabet. An
example of a finite alphabet is the standard set of English letters Σ =
{
a, b, c, ..., z
}
. A fixed string
of characters
T
∈
Σ
*
will comprise what we call a
text
. Another string of characters
P
∈
Σ
*
will
be called a
search pattern
.
For integers
i
and
j
, define
T
[
i
:
j
] as the substring of
T
starting from the
i
th
character and ending
with the
j
th
character inclusive.
We will often omit
j
and write
T
[
i
:] to denote the suffix of
T
starting at the
i
th
character. Finally, we let the symbol
◦
denote concatenation. As a simple
illustration of our notation, (
abcde
[0 : 2])
◦
(
cde
[1 :]) =
abcde
.
Now we can formally state the string matching problem: Given an input text
T
∈
Σ
*
and a pattern
P
∈
Σ
*
, we want to find all occurrences of
P
in
T
. Closely related variants of the string matching
problem ask for the first, first
k
, or some occurrences, rather than for all occurrences.
2.1
Tries and Compressed Tries
A commonly used string data structure is called a
trie
, a tree where each edge stores a letter,
each node stores a string, and the root stores the empty string. The recursive relationship between
the values stored on the edges and the values stored in the nodes is as follows: Given a path of
increasing depth
p
=
r, v
1
, v
2
, ..., v
from the root
r
to a node
v
, the string stored at node
v
i
is
the concatenation of the string stored in
v
i

1
with the letter stored on
v
i

1
v
i
. We will denote the
strings stored in the leaves of the trie as words, and the strings stored in all other nodes as prefixes.
If there is a natural lexicographical ordering on the elements in Σ, we order the edges of every
node’s fanout alphabetically, from left to right. With respect to this ordering, in order traversal
1