Algorithms
NonLecture G: String Matching
Why are our days numbered and not, say, lettered?
— Woody Allen
G String Matching
G.1 Brute Force
The basic object that we’re going to talk about for the next two lectures is a
string
, which is really just
an array. The elements of the array come from a set
Σ
called the
alphabet
; the elements themselves
are called
characters
. Common examples are ASCII text, where each character is an sevenbit integer
1
,
strands of DNA, where the alphabet is the set of nucleotides
{
A
,
C
,
G
,
T
}
, or proteins, where the alphabet
is the set of 22 amino acids.
The problem we want to solve is the following. Given two strings, a
text T
[
1..
n
]
and a
pattern
P
[
1..
m
]
, ﬁnd the ﬁrst
substring
of the text that is the same as the pattern. (It would be easy to extend
our algorithms to ﬁnd
all
matching substrings, but we will resist.) A substring is just a contiguous
subarray. For any
shift s
, let
T
s
denote the substring
T
[
s
..
s
+
m

1
]
. So more formally, we want to
ﬁnd the smallest shift
s
such that
T
s
=
P
, or report that there is no match. For example, if the text is
the string ‘
AMANAPLANACATACANALPANAMA
’
2
and the pattern is ‘
CAN
’, then the output should be 15. If the
pattern is ‘
SPAM
’, then the answer should be ‘none’. In most cases the pattern is much smaller than the
text; to make this concrete, I’ll assume that
m
<
n
/
2.
Here’s the ‘obvious’ brute force algorithm, but with one immediate improvement. The inner while
loop compares the substring
T
s
with
P
. If the two strings are not equal, this loop stops at the ﬁrst
character mismatch.
A
LMOST
B
RUTE
F
ORCE
(
T
[
1..
n
]
,
P
[
1..
m
])
:
for
s
←
1 to
n

m
+
1
equal
←
true
i
←
1
while equal and
i
≤
m
if
T
[
s
+
i

1
]
6
=
P
[
i
]
equal
←
false
else
i
←
i
+
1
if equal
return
s
return ‘none’
1
Yes,
seven
. Most computer systems use some sort of 8bit character set, but there’s no universally accepted standard.
Java supposedly uses the Unicode character set, which has variablelength characters and therefore doesn’t really ﬁt into our
framework. Just think, someday you’ll be able to write ‘¶
=
ℵ
[
∞
++]/
f
;
’ in your Java code! Joy!
2
Dan Hoey (or rather, his computer program) found the following 540word palindrome in 1984:
A man, a plan, a caret, a ban, a myriad, a sum, a lac, a liar, a hoop, a pint, a catalpa, a gas, an oil, a bird, a yell, a vat, a caw, a pax, a wag, a tax, a nay, a ram, a cap,
a yam, a gay, a tsar, a wall, a car, a luger, a ward, a bin, a woman, a vassal, a wolf, a tuna, a nit, a pall, a fret, a watt, a bay, a daub, a tan, a cab, a datum, a gall, a
hat, a fag, a zap, a say, a jaw, a lay, a wet, a gallop, a tug, a trot, a trap, a tram, a torr, a caper, a top, a tonk, a toll, a ball, a fair, a sax, a minim, a tenor, a bass, a
passer, a capital, a rut, an amen, a ted, a cabal, a tang, a sun, an ass, a maw, a sag, a jam, a dam, a sub, a salt, an axon, a sail, an ad, a wadi, a radian, a room, a
rood, a rip, a tad, a pariah, a revel, a reel, a reed, a pool, a plug, a pin, a peek, a parabola, a dog, a pat, a cud, a nu, a fan, a pal, a rum, a nod, an eta, a lag, an eel, a