PRUNER: Algorithms for Finding Monad Patterns in DNA Sequences
Ravi VijayaSatya and Amar Mukherjee
(rvijaya, amar)@cs.cuf.edu
School of Computer Science, University of Central Florida, Orlando, FL 328162362
Abstract
In this paper, we present new algorithms for discovering monad patterns in DNA sequences. Monad
patterns are of the form (
l
,
d
)
k
, where
l
is the length of the pattern,
d
is the maximum number of
mismatches allowed, and
k
is the minimum number of times the pattern is repeated in the given sample.
The timecomplexity of some of the best known algorithms to date is
O
(
nt
2
l
d

∑

d
), where
t
is the number of
input sequences, and
n
is the length of each input sequence. The first algorithm that we present in this paper
takes
O
(
n
2
t
2
l
d/2


d/2
) and space
O
(
ntl
d/2


d/2
), and the second algorithm takes O(
n
3
t
3
l
d
/2

∑

d
/2
) time using
O(
l
d
/2

∑

d
/2
) space. In practice, our algorithms have much better performance provided the
d/l
ratio is small.
The second algorithm performs very well even for large values
l
and
d
as long is the
d/l
ratio is small.
Keywords: Pattern discovery, regulatory patterns, kmismatch patterns
1. Introduction
Discovering regulatory patterns in DNA sequences is a well known problem in
computational biology. Due to mutations and other errors, the actual occurrences of these
regulatory patterns allow for a certain degree of error. There fore, the actual regulatory
pattern (or the consensus pattern) may never appear in a gene upstream region, but d
mismatch occurrences of this pattern might appear. The general approach to this problem
is to take a set of
t
DNA sequences each of length
n
, at least
k
of which are guaranteed to
contain the desired binding site, and look for patterns of a certain length
l
that occur in at
least k out of the
t
sequences with at most
d
mismatches at each occurrence. The values
of
l
,
d
and
k
can be determined either from prior knowledge about the binding site, or by
trail and error, trying different values of
l
and
d
.
These single contiguous blocks of patterns are called
monad
patterns. In general, many
regulatory signals are made up of a group of
monad
patterns occurring within a certain
distance form each other [Eskin et. al, 2003, Eskin et. al. 2002, GuhaThakurtha et. al.
2001, van Helden et. al. 2000]. In such a case, the patterns are called
dyad
,
triad
multiad
,
or in general as
composite
patterns. Finding the composite patterns by finding the
component monad patterns individually is significantly more difficult, since the
composite monad patterns might be too subtle to detect. Eskin & Pevzner [Eskin et. al.,
2002] present a simple transformation to convert a multiad problem into a slightly larger
monad problem. In this paper, we present an algorithm to solve the monad problem. The
same transformation as in [Eskin et. al., 2002] can be applied to transform a multiad
problem into a monad problem that is handled by our algorithm.