Lecture Notes
CMSC 251
Lecture 25: Longest Common Subsequence
(April 28, 1998)
Read: Section 16.3 in CLR.
Strings:
One important area of algorithm design is the study of algorithms for character strings. There are
a number of important problems here. Among the most important has to do with efﬁciently searching
for a substring or generally a pattern in large piece of text. (This is what text editors and functions
like ”grep” do when you perform a search.) In many instances you do not want to ﬁnd a piece of text
exactly, but rather something that is ”similar”. This arises for example in genetics research. Genetic
codes are stored as long DNA molecules. The DNA strands can be broken down into a long sequences
each of which is one of four basic types: C, G, T, A.
But exact matches rarely occur in biology because of small changes in DNA replication. Exact sub-
string search will only ﬁnd exact matches. For this reason, it is of interest to compute similarities
between strings that do not match exactly. The method of string similarities should be insensitive to
random insertions and deletions of characters from some originating string. There are a number of
measures of similarity in strings. The ﬁrst is the
edit distance
, that is, the minimum number of single
character insertions, deletions, or transpositions necessary to convert one string into another. The other,
which we will study today, is that of determining the length of the longest common subsequence.
Longest Common Subsequence:
Let us think of character strings as sequences of characters. Given two
sequences
X
=
h
x
1
,x
2
,...,x
m
i
and
Z
=
h
z
1
,z
2
,...,z
k
i
, we say that
Z
is a
subsequence
of
X
if
there is a strictly increasing sequence of
k
indices
h
i
1
,i
2
,...,i
k
i
(
1
≤
i
1
<i
2
< ... < i
k
≤
n
) such
that
Z
=
h
X
i
1
,X
i
2
,...,X
i
k
i
. For example, let
X
=
h
ABRACADABRA
i
and let
Z
=
h
AADAA
i
,
then
Z
is a subsequence of
X
.