Unformatted text preview: © 1999 Oxford University Press Nucleic Acids Research, 1999, Vol. 27, No. 2 573–580 Tandem repeats finder: a program to analyze DNA
sequences
Gary Benson*
Department of Biomathematical Sciences, Mount Sinai School of Medicine, New York, NY 100296574, USA
Received September 10, 1998; Revised and Accepted November 12, 1998 ABSTRACT
A tandem repeat in DNA is two or more contiguous,
approximate copies of a pattern of nucleotides. Tandem
repeats have been shown to cause human disease,
may play a variety of regulatory and evolutionary roles
and are important laboratory and analytic tools.
Extensive knowledge about pattern size, copy number,
mutational history, etc. for tandem repeats has been
limited by the inability to easily detect them in genomic
sequence data. In this paper, we present a new
algorithm for finding tandem repeats which works
without the need to specify either the pattern or pattern
size. We model tandem repeats by percent identity and
frequency of indels between adjacent pattern copies
and use statistically based recognition criteria. We
demonstrate the algorithm’s speed and its ability to
detect tandem repeats that have undergone extensive mutational change by analyzing four sequences: the human frataxin gene, the human β T cell
receptor locus sequence and two yeast chromosomes. These sequences range in size from 3 kb up to
700 kb. A World Wide Web server interface at
c3.biomath.mssm.edu/trf.html has been established
for automated use of the program.
INTRODUCTION
DNA molecules are subject to a variety of mutational events. One
of the less well understood is tandem duplication in which a
stretch of DNA, which we call the pattern, is converted into two
or more copies, each following the preceding one in a contiguous
fashion. For example we could have
. . . TCGGA . . . → . . . TCGGCGGCGGA . . .
in which the single occurrence of triplet CGG has been
transformed into three identical, adjacent copies. The result of a
tandem duplication event is termed a tandem repeat. Over time,
individual copies within a tandem repeat may undergo additional,
uncoordinated mutations so that typically, only approximate
tandem copies are present.
Tandem repeats are presumed to occur frequently in genomic
sequences, comprising perhaps 10% or more of the human
genome. But, accurate characterization of the properties of
tandem repeats has been limited by the inability to easily detect them. In recent years, the discovery of the trinucleotide repeat
diseases has piqued interest in tandem repeats. These diseases,
including fragileX mental retardation (1), Huntington’s disease
(2), myotonic dystrophy (3), spinal and bulbar muscular atrophy
(4) and Friedreich’s ataxia (5), are the result of a dramatic increase
in the number of copies of a trinucleotide pattern. In afflicted
individuals, the copy number has been amplified from the normal
range of tens of copies to hundreds or thousands, resulting in the
disease. It has been suggested that the repeats themselves produce
unusual physical structures in the DNA, causing polymerase
slippage and the resulting amplification (6,7).
A more salubrious potential role for tandem repeats is gene
regulation, in which the repeats may interact with transcription
factors, alter the structure of the chromatin or act as protein
binding sites (8–12). Tandem repeats have an apparent function
in development of immune system cells. Breakpoints for
immunoglobulin heavy chain switch recombination occur within
tandem repeats preceding the heavy chain constant region genes
(13). Because the number of copies in any specific tandem repeat
is often polymorphic in the population, tandem repeats have
proven useful in linkage analysis and DNA fingerprinting
(14,15). Recent studies of allele diversity at tandem repeat loci
have provided support for the ‘Out of Africa’ hypothesis of
modern human evolution (16,17).
To date, much of the research on tandem repeats has focused on
those with short patterns (2–5 nt), presumably because such
repeats are relatively easy to spot by eye in printed sequences.
Repeats with long patterns (sometimes called variable number of
tandem repeats or VNTRs) are notoriously harder to detect [even
when the copies are identical, for example see Benson (18) for the
101 bp repeats undetected in Hellman et al. (19), a paper on the
role of tandem repeats as hot spots for recombination]. Given the
importance of known and potential biological roles for tandem
repeats and their usefulness in other biological studies, it seemed
essential to us to develop an efficient and sensitive algorithm for
detecting these repeats so that they may receive further study.
A number of algorithms already exist which either directly or
indirectly detect tandem repeats. All suffer from significant
limitations. One group of algorithms is based on computing
alignment matrices (20–22). Their primary limitation is excessive
running time. The best algorithm in this group (22) has time
complexity O[n2 polylog(n)] for a sequence of length n and would
not be useful for sequences much longer than several thousand *Tel: +1 212 241 5777; Fax: +1 212 860 4630; Email: [email protected] 574 Nucleic Acids Research, 1999, Vol. 27, No. 2 bases. (In this paper we report on our analysis of sequences up to
700 kb in length.)
Another group of algorithms finds tandem repeats indirectly
using methods from the field of data compression. An algorithm
by Milosavljevic and Jurka (23) detects ‘simple sequences’,
i.e. mixtures of fragments that occur elsewhere. Simple sequences
may or may not contain tandem repeats and this algorithm makes
no attempt to deduce a repeated pattern. An algorithm by Rivals
et al. (24) bases the compression on the presence of small
preselected patterns (all those of size 1–3) and is not readily
generalized to longer patterns for which there is an algorithmic
need. To their credit, both of these methods provide a measure of
statistical significance based on the amount of compression.
Another collection of algorithms aim more directly at finding
tandem repeats. Of these, one exact algorithm (25) is limited by
its definition of approximate patterns, requiring that two copies
differ either by k or fewer substitutions (Hamming distance) or by
k or fewer substitutions and indels (unit cost edit distance).
Besides treating substitutions and indels as equals, the requirement
for a fixed number of differences rather than a percentage
difference is unsatisfactory. Any fixed number of differences
suitable for small patterns (say five differences for patterns of size
20) would be unreasonably restrictive for larger patterns (five
differences for patterns of size 100). Conversely, any fixed
number for large patterns would allow too much variability in
small patterns. A heuristic algorithm by Karlin et al. (26) is
similarly hampered by the use of matching blocks separated by
error blocks of fixed size. The remaining two algorithms in this
group require input from the user which limits their usefulness.
An earlier heuristic algorithm by Benson (27) finds tandem repeats
only if they have a pattern size which is specified in advance. An
exact algorithm by Myers and Sagot (28) (limited to patterns with
size of at most 40 bases) requires that the approximate pattern size
and a range for the number of copies be specified.
The algorithm (29) presented in this paper is designed to
overcome many of the aforementioned limitations: (i) it uses the
method of ktuple matching to avoid the need for full scale
alignment matrix computations; (ii) it requires no a priori
knowledge of the pattern, pattern size or number of copies;
(iii) there are no restrictions on the size of the repeats that can be
detected; (iv) it uses percentage differences between adjacent
copies and treats substitutions and indels separately; (v) it
determines a consensus pattern for the smallest repetitive unit in
the tandem repeat. The program has already been used as a
preprocessor in a new alignment algorithm where tandem
duplication augments the standard mutation set of insertion,
deletion and substitution (18).
A number of ideas incorporated into this new algorithm have
been utilized in earlier homology detection programs (30,31), yet
the goals and methods differ. Instead of looking for highest
scoring homologous regions, the algorithm looks for tandem
repeats which are often hidden in larger homologous regions or
which may fall well below the level of significance required for
other programs to report a match. The detection criteria are based
on a stochastic model of tandem repeats specified by percent
identity and frequency of insertions and deletions, rather than
some minimal alignment score. Finally, the program aligns repeat
copies against a consensus sequence, revealing patterns of
common mutations. These patterns yield insight into the history
of duplications that produced the tandem repeat, thus providing
a potentially valuable tool for phylogenetic research. Figure 1. Two adjacent copies from a tandem repeat in the human β T cell
receptor locus sequence (37). H indicates a match, T indicates a mismatch,
insertion or deletion. The remainder of this paper is organized as follows. In Methods
we present a probabilistic model of tandem repeats, an algorithm
overview and the set of criteria that guide the recognition process.
In the Discussion we present our analysis of the frataxin
(Friedreich’s ataxia) gene sequence, the human β T cell receptor
locus and two yeast chromosomes. Finally, in the Conclusion we
describe directions for future research.
METHODS
Probabilistic model of tandem repeats
We model alignment of two tandem copies of a pattern of length
n by a sequence of nindependent Bernoulli trials (coin tosses).
The probability of success, P (heads), which we also call pM or
matching probability, represents the average percent identity
between the copies. Each head in the Bernoulli sequence is
interpreted as a match between aligned nucleotides. Each tail is
a mismatch, insertion or deletion. A second probability, pI or indel
probability, specifies the average percentage of insertions and
deletions between the copies. Figure 1 illustrates the underlying
idea for the model.
While Figure 1 is an interpretation of a particular alignment as
a Bernoulli sequence, we are more generally interested in the
distribution of Bernoulli sequences and the properties of alignments
that they represent when dealing with a specific pair (pM, pI), for
example (pM = 0.80, pI = 0.10). Note that these conservation
parameters serve as a type of extremal bound, i.e. as a quantitative
description of the most divergent copies we hope to detect.
Program outline
Our program has detection and analysis components. The
detection component uses a set of statistically based criteria to
find candidate tandem repeats. The analysis component attempts
to produce an alignment for each candidate and if successful
gathers a number of statistics about the alignment (percent
identity, percent indels) and the nucleotide sequence (composition,
entropy measure).
Detection component. We assume that adjacent copies of any
pattern will contain some matching characters in corresponding
positions. Just how many matches and how the distance between
those matches should vary depend on the fixed values of pM and
pI. In the next section, we develop the statistical criteria to answer
these questions. Here, we describe how the matches are detected.
The algorithm looks for matching nucleotides separated by a
common distance d, which is not specified in advance. For
reasons of efficiency it looks for runs of k matches, which we call
ktuple matches. A ktuple is a window of k consecutive
characters from the nucleotide sequence. Matching ktuples are
two windows with identical contents and if aligned in the
Bernoulli model would produce a run of k heads. Because we
limit ourselves to ktuple matches, we will not detect all matching
characters. For example, if k = 6 and two windows contain 575
Nucleic Acids Research, 1999, Vol. 27, No.
Nucleic Acids Research, 1994, Vol. 22, No. 12 575 a sliding window of length d which keeps track of the positions
of matches and their total.
List Dd is updated every time a match at distance d is detected.
Position i of the match is stored on the list and the total is
increased. The right end of the window is set to i and matches that
occurred before j = i – d are dropped from the list and subtracted
from the total. Lists for other nearby distances are also updated at
this time (Random Walk Distribution in the next section), but only
to reset their right ends to i and remove matches that have been
passed by the advancing windows. Information in the updated
distance lists is used for the sum of heads and apparent size criteria
tests as described in the next section. If both tests are passed, the
program moves on to the analysis component.
Statistical criteria. The statistical criteria are based on runs of
heads in Bernoulli sequences, corresponding to matches detected
with the ktuples and stored in the distance lists. The criteria are
based on four distributions which depend upon: (i) the pattern
length, d; (ii) the matching probability, pM; (iii) the indel
probability, pI; (iv) the tuple size, k. For each distribution, we
either calculate it with a formula or estimate it using simulation.
Then, we select a cutoff value that serves as our criterion. Below
we describe the distributions and criteria in more detail. Figure 2. Tandem repeats are detected by scanning the sequence with a small
window, determining the distance between exact matches and testing the
statistical criteria. TCATGT and TCTTGT we will not know that there are 5
matching characters because the window contents are not
identical. Put in terms of the Bernoulli model, the aligned
windows would be represented by the sequence HHTHHH,
which is not a run of 6 heads.
The basic operation of the detection component is illustrated in
Figure 2. Let S be a nucleotide sequence. We select a small integer
k for the tuple or window size (k = 5 for example) and keep a list
of all possible k length strings (there are 4k for the DNA alphabet
A,C,G,T) which we call the probes. By sliding the window across
the sequence, we determine the probe at each position i in S. For
each probe p, we maintain a history list Hp of the positions at
which p occurs.
When a position i is added to Hp, we scan Hp for all earlier
occurrences of p. Let one earlier occurrence be at j. Since i and
j are the indices of matching ktuples, the distance d = i – j is a
possible pattern size for a tandem repeat. For the criteria tests, we
need information about other ktuple matches at the same distance
d where the leading tuple occurs in the sequence between j and i.
A distance list Dd stores this information. It can be thought of as Sum of heads distribution. This distribution indicates how many
matches are required. Let the random variable Rd,k,pM = the total
number of heads in head runs of length k or longer in an iid
Bernoulli sequence of length d with success probability pM. The
distribution of Rd,k,pM is well approximated by the normal
distribution and we have previously shown that its exact mean
and variance can be calculated in constant time (32). For the sum
of heads criterion, we use the normal distribution to determine the
largest number, x, such that 95% of the time Rd,k,pM ≥ x. For
example, if pM = 0.75, k = 5 and d = 100, then the criterion is 26.
Put another way, if a pattern has length 100 and aligned copies are
expected to match in 75 positions, then by counting only matches
that fill a window of length 5, we expect to count at least 26
matches 95% of the time.
Random walk distribution. This distribution describes how
distances between matches may vary due to indels. Because
indels change the distance between matching ktuples (Fig. 3),
there will be situations where the pattern has size d, yet the
distance between matching ktuples is d ± 1, d ± 2, etc. In order
to test the sum of heads criterion, we count the matches in Dd ± ∆d,
for ∆d = 0, 1, ..., ∆dmax for some ∆dmax. In our model, indels are
single nucleotide events occurring with probability pI. Insertions
and deletions are considered equally likely and we treat the
distance change as a problem of random walks. Let the random
variable Wd,pI = the maximum displacement from the origin of a
onedimensional random walk with expected number of steps
equal to pI·d. It can be shown (33) that 95% of the time, Wd,pI
ranges between ±2.3 p I @ d . We set ∆dmax = b2.3 p I @ d c. For
cexample if pI = 0.1 and d = 100, then ∆dmax = 7.
Apparent size distribution. This distribution is used to distinguish
between tandem repeats and nontandem direct repeats (Fig. 4).
For tandem repeats, the leading tuples in matching ktuples will
be distributed throughout the interval from j to i, whereas for
nontandem repeats, they should be concentrated on the right side
of the interval near i. Let the random variable Sd,k,pM = the
distance between the first and last run of k heads in an iid 576 Nucleic Acids Research, 1999, Vol. 27, No. 2 Figure 3. Insertions and deletions change the distance between exact matches. The inserted character X causes one pair of matching ktuples to be separated by distance
d + 1 while another pair is separated only by distance d. Figure 4. We must distinguish between (a) a tandem repeat (leading tuples in ktuple matches spread over the interval between i and j) and (b) a nontandem, direct
repeat (leading tuples concentrated on the right). Matching ktuples are indicated by the shaded boxes. w is the distance between the first and last leading tuple. Bernoulli sequence of length d with success probability pM.
Sd,k,pM is the apparent size of the repeat when using ktuples to
find the matches and will usually be shorter than the pattern size
d. We estimate the distribution of Sd,k,pM by simulation because
we make it conditional on first meeting the sum of heads criterion.
For given d, k and pM, random Bernoulli sequences are generated
using pM. For every sequence that meets or exceeds the sum of
heads criteria, the distance between the first and last run of heads
of length k or larger is recorded. From the distribution, we
determine the maximum number y such that 95% of the time
Sd,k,pM > y. We use y as our apparent size criterion. For example,
if pM = 0.75, k = 5 and d = 100, then the criterion is 56. In order
to test the apparent size criterion, we compute the distance
between the first and last tuple on list Dd. If the distance between
the tuples is smaller than the criterion, we assume the repeat is not
tandem or that we have not yet seen enough of it to be convinced.
Waiting time distribution. This distribution is used to pick tuple
sizes. Tuple size has a significant inverse effect on the running
time of the program because increasing tuple size causes an
exponential decrease in the expected number of tuple matches. If
the nucleotides occur with equal frequency, then increasing the
tuple size by ∆k increases the average distance between randomly
matching tuples by a factor of 4∆k. If k = 5, the average distance
between random matches is ∼1 kb, but if k = 7, the average
distance is ∼16 kb. Thus, by using a larger tuple size, we keep the
history lists short. On the other hand, increasing the tuple size decreases the chance of noticing approximate copies because they
may not contain a long, unbroken run of matches. Let the random
variable Tk,pM = the number of iid Bernoulli trials with success
probability pM until the first occurrence of a run of k successes.
Tk,pM follows the geometric distribution of order k. If we let p = pM
and q = 1 – p then the exact probability P(Tk,pM = x) for x ≥ 0 is
given by the recursive formula (34)
0
for x t k
pk
for x + k
P(T k,pM + x) +
x–k–1
for x u k
qp k [1 – i + 0 P(T k,pM + i)
For example, if pM = 0.75 and k = 5 then we need at least 31 trials
(coin tosses) to have a 95% chance of seeing a run of 5 heads. For
patterns smaller than 31 characters, we need to use a smaller
ktuple. The waiting time distribution allows us to balance the
running time and sensitivity of our algorithm by picking a set of
tuple sizes, each applying to a different range of pattern sizes. The
program processes the sequence once, simultaneously checking
these different tuple sizes. We require that the smallest pattern for
tuple size k have a sum of heads criterion of at least k + 1. Table
1 shows the range of tuple sizes and the corresponding pattern
sizes currently used by the program.
Analysis component. If the information in the distance list passes
the criteria tests, a candidate pattern consisting of positions j + 1
. . . i is selected from the nucleotide sequence and aligned with the 577
Nucleic Acids Research, 1999, Vol. 27, No.
Nucleic Acids Research, 1994, Vol. 22, No. 12
surrounding sequence using wraparound dynamic programming
(WDP) (35,36). If at least two copies of the pattern are aligned
with the sequence, the tandem repeat is reported. Several
implementation details of the analysis component are described
below.
Table 1. Tuple sizes and the range of pattern sizes each is used to detect Multiple reporting of repeat at different pattern sizes. When a
single tandem repeat contains many copies, several pattern sizes
are possible. For example, if the basic pattern size is 26, then the
repeat may be reported at sizes 26, 52, 78, etc. We limit this
redundancy in the output to, at most, three pattern sizes. Note that
we do not automatically limit the output to the smallest period size
because a much better alignment may come from a larger size (for
example Table 5, indices 410172–410459).
Narrow band alignment. Alignments are the program’s most time
intensive calculations. To decrease running time, we limit WDP
calculations to a narrow diagonal band in the alignment matrix for
patterns larger than 20 characters. In accordance with the random
walk results, the band radius is ∆dmax. The band is periodically
recentered around a run of matches in the current best alignment.
Consensus pattern and period size. An initial candidate pattern P
is drawn from the sequence, but this is usually not the best pattern
to align with the tandem repeat. To improve the alignment, we
determine a consensus pattern by majority rule from the
alignment of the copies with P. The consensus is used to realign
the sequence and this final alignment is reported in the output.
Period size is defined as the most common matching distance
between corresponding characters in the alignment and may not
be identical to consensus size. 577 of the four nucleotides; (ix) entropy measure based on percent
composition.
RESULTS
To demonstrate the capabilities of our program, we used it to
analyze four sequences, the human frataxin gene sequence
(Friedreich’s ataxia) (5), the human β T cell receptor locus
sequence (37) and two yeast chromosomes (I and VIII). [The
frataxin gene sequence and the human β T cell receptor sequences
were obtained from GenBank. The yeast chromosomes sequences
were obtained via ftp from ftp.ebi.ac.uk directory pub/databases/
yeast in files chri_230209.ascii and chrviii_562638.ascii. Indexing
in this paper is relative to the sequences in these files. Data file
accession numbers for these sequences are: frataxin gene
promoter and intron 1, U43748; human T cell receptor, L36092;
yeast chromosome 1, U12980, L20125, L05146, L22015,
L28920; yeast chromosome 8, U11583, U11582, U11581,
U10555, U10400, U10399, U00062, U00061, U10556, U00060,
U00059, U10398, U10397, U00027, U00028, U00030, U00029.]
In our analysis, we searched for all pattern sizes between 1 and
500 bases (the implementation’s current upper size limit, to be
extended in subsequent versions). We used one of two sets of
alignment parameters (match, mismatch, gap), either (+2,–7,–7)
or (+2,–5,–7). Only those repeats scoring at least 50 with these
parameters are reported. Occasionally, the same repeat is reported
at different pattern sizes. We have omitted these redundancies.
We performed two searches on each sequence, using different
conservation parameter values, (pM = 0.75, pI = 0.20) and
(pM = 0.80, pI = 0.10). While the first search is slower than the
second, the detected repeats are nearly identical. Table 2 shows
running times of the program and Tables 4–7 list the tandem
repeats found.
Table 2. Running times of program on selected sequences using a Silicon
Graphics O2 RS10000 Program usage and output
Input to the program consists of a sequence file and the following
parameters: (i) alignment weights for match, mismatch and
indels; (ii) pM and pI; (iii) a minimum size for patterns to report;
(iv) a minimum alignment score to report. We have developed a
web based interface for the program. Using an HTML form at
c3.biomath.mssm.edu/trf.html , the user provides an input DNA
sequence file. Defaults can be used for the remaining parameters.
After program execution, two files are returned. The first is a
summary table describing the location and statistical properties of
the tandem repeats found. The second contains the alignment of
each repeat with its consensus sequence. The files are linked so
that selecting an entry from the table opens a second browser
window which contains the proper alignment. The summary table
includes the following information: (i) indices of the repeat in the
sequence; (ii) period size; (iii) number of copies aligned with the
consensus pattern; (iv) size of the consensus pattern (may differ
from the period size); (v) percent of matches between adjacent
copies overall; (vi) percent of indels between adjacent copies
overall; (vii) alignment score; (viii) percent composition for each Time grows linearly with sequence length. With conservation parameter
values (pM = 0.75, pI = 0.20) running time is ∼10 times slower than with
values (pM = 0.80, pI = 0.10) although the detected repeats are nearly identical.
Alignment weights also affect running time. The most liberal weights tested
increase the times shown here by ∼50%. Human frataxin gene (Friedreich’s ataxia), intron 1
Friedreich’s ataxia is one of the triplet repeat diseases (5). It is
caused by copy number expansion of the triplet GAA in the first
intron of the frataxin gene. Table 4 lists the repeats found in the
sequence. Besides the triplet repeat, our program found two
others which were apparently unknown, a 44 bp pattern and a
14 bp pattern. Figure 5 shows the program’s alignment of the
44 bp repeat. 578 Nucleic Acids Research, 1999, Vol. 27, No. 2
Table 5. Tandem repeats detected in human β T cell receptor locus sequence Figure 5. The program’s alignment of the 44 bp repeat from the frataxin gene
intron 1 (Friedreich’s ataxia). This repeat was apparently unknown. The actual
sequence is on the top; the consensus sequence is on the bottom. Each pair of
lines represents one period. Position of the beginning of the repeat is relative to
the detected pattern when the criteria were met and is therefore arbitrary.
Symbol * indicates a mismatch. Summary refers to matches, mismatches and
indels between adjacent copies in the sequence, not between the sequence and
the consensus pattern. Table 3. Varying copy numbers in the four similar tandem repeat clusters
found in yeast chromosomes 1 and 8 See text and Tables 6 and 7 for cluster locations. Table 4. Tandem repeats detected in the human frataxin gene intron 1
sequence Not shown are all mononucleotide repeats and those repeats already annotated in the GenBank entries (accession nos L36092, U66059, U66060 and
U66061) except for the 60 bp repeat marked with symbol *. Symbol { indicates a pattern which is included even though a longer pattern has a better
scoring alignment. been identified. Table 5 lists the new repeats we found. Of the 83
repeats that we found, 38 were previously annotated and most of
those were for patterns of size 5 or smaller. We missed 6 annotated
repeats: 4 dinucleotide repeats and 1 tetranucleotide repeat (alignment scores were below our cutoff) and 1 repeat with period size
10 567 bases (beyond the current implementation’s pattern upper
size limit). Of the 45 unannotated repeats, 13 have short patterns
(2–6 bp) and may be polymorphic and thus useful for linkage
analysis. Six unannotated repeats have large pattern sizes (116,
65, 52, 49, 34 and 30 bp). The 116 base pattern is also reported
at size 39 with a lower scoring alignment. The annotated 60 base
pattern repeat (indices 12596–13266) is indicative of the
program’s ability to find repeats with substantial amounts of
mutation between adjacent copies (74% matching characters and
7% indels overall).
Yeast chromosomes Human β T cell receptor locus sequence
This sequence (37) contains a family of immune recognition
coding elements, the T cell receptor variable, diversity, joining
and constant gene segments. It was selected for its size and
because many tandem repeats within the sequence had already Tables 6 and 7 list the tandem repeats found for the yeast
sequences. Of special interest are the clusters of tandem repeats
which show up repeatedly at the ends of the chromosomes,
suggesting recent swapping of the ends. Chromosome 8, in
particular, has two different clusters on its right end.
The (27, 21, 48, 15, 135) cluster
The FLO1 gene and its paralogous pseudogenes in chromosomes
1 and 8 contain a cluster of 5 tandem repeats with pattern sizes 27, 579
Nucleic Acids Research, 1999, Vol. 27, No.
Nucleic Acids Research, 1994, Vol. 22, No. 12
Table 6. Tandem repeats detected in yeast chromosome 1 579 Table 7. Tandem repeats detected in yeast chromosome 8 (only the latter
half of the sequence is shown) Period sizes in bold indicate one of four similar clusters found at the ends of
chromosomes 1 and 8. Cluster C4 is shown. Period sizes in italics indicate one
of three similar clusters found at both ends of chromosome 8 and one end of
chromosome 6. Period sizes in bold indicate similar clusters found at the ends of chromosomes 1 and 8. From the top, these are clusters C1, C2 and C3. 21, 48, 15 and 135. We designate these clusters C1 and C2
(adjacent on the left end of chromosome 1), C3 (right end,
opposite strand) and C4 (right end of chromosome 8). The 27, 48
and 135 base patterns are not reported in every cluster in Tables 6
and 7. Subsequent analysis of the surrounding sequences,
however, revealed that every pattern is present but not necessarily
as two or more copies (Table 3). For each pattern size, the number
of copies varies among the four clusters. More specifically, no
cluster is identical in its copy number to any other cluster. This
implies that duplication or excision events (deletion of copies)
have occurred since the separate clusters were incorporated into
the chromosomes. The sequences around these clusters also
reveal close homology. For example, C3 and C4 are nearly
identical over 18 000 bases and C2 and C3 display homology over
15 000 bases.
The (13, 10, 36) cluster
A cluster of 3 tandem repeats with pattern sizes 13, 10 and
36 bases appears on both ends of chromosome 8 (Table 7). The
36 bp pattern also appears on the left end (low index numbers) of
chromosome 6 (not shown). For the 36 bp pattern, each occurrence
has a different copy number. The 10 and 13 bp patterns are identical
in their occurrences. Surrounding sequences comprising 4200 bases
are nearly identical for these three clusters. CONCLUSION
In this paper, we have presented a new algorithm for finding
tandem repeats in DNA sequences without the need to specify
either the pattern or pattern size. The algorithm is based on the
detection of ktuple matches. It uses a probabilisitic model of
tandem repeats and a collection of statistical criteria based on that
model. We have demonstrated the speed and utility of the
algorithm by analyzing four sequences ranging in size up to
700 kb. Several avenues for future research are raised by this
work, including methods to estimate statistical significance for
tandem repeats and algorithms to determine plausible mutational
histories.
Statistical issues
We have yet to develop a good statistical significance measure for
tandem repeats. For now, we use a cutoff alignment score based
on simulations with random sequences. Difficulties include the
local variation in nucleotide content in real sequences, which is
decidedly nonrandom, and the problem of accounting for copy
number as well as total repeat length. Estimates of significance
developed in Benson and Waterman (27) are too high in this
application because they apply to tandem repeats of one pattern
size only, rather than the range of sizes considered here.
Mutational history
Analyzing the mutational history of tandem repeats requires
utilizing the pattern of mutations among adjacent copies to
describe the interwoven progression of substitutions, indels and 580 Nucleic Acids Research, 1999, Vol. 27, No. 2 duplication/excision events leading from a single copy of the
pattern to the present day sequence. Such histories can suggest
how the boundaries and size of the duplication unit vary and may
reveal details about the duplication mechanism.
ACKNOWLEDGEMENTS
The author would like to thank Xiaoping Su for his help in
analyzing the sum of heads criterion, Astrid Jervis for her help in
simulating many of the statistical measures and examining the
program output and Lan Dong for her help with some of the
programing and examining the output. Thanks also to Mike
Waterman, Richard Arratia and Rolf Backofen for helpful
discussions. This work was partially supported by NSF grant
CCR9623532.
REFERENCES
1 Verkerk,A., Pieretti,M., Sutcliffe,J. Fu,Y., Kuhl,D., Pizzuti,A., Reiner,O.,
Richards,S., Victoria,M., Zhang,F., Eussen,B., van Ommen,G.,
Blonden,A., Riggins,G., Chastain,J., Kunst,C., Galjaard,H., Caskey,C.,
Nelson,D., Oostra,B. and Warren,S. (1991) Cell, 65, 905–914.
2 Huntington’s Disease Collaborative Research Group. (1993) Cell, 72,
971–983.
3 Fu,Y.H., Pizzuti,A., Fenwick,J., King,R.G.Jr. and Rajnarayan,S.,
Dunne,P.W., Dubel,J., Nasser,G.A., Ashizawa,T., DeJong,P., Wieringa,B.
Korneluk,R., Perryman,M.B., Epstein,H.F. and Caskey,C.T. (1992)
Science, 255, 1256–1258.
4 La Spada,A., Wilson,E., Lubahn,D., Harding,A. and Fischbeck,K. (1991)
Nature, 352, 77–79.
5 Campuzano,V., Montermini,L., Molto,M.D., Pianese,L. and Cossee,M.
(1996) Science, 271, 1423–1427.
6 Wells,R. (1996) J. Biol. Chem., 271, 2875–2878.
7 Weitzmann,M., Woodford,K. and Usdin,K. (1997) J. Biol. Chem., 272,
9517–9523.
8 Hamada,H., Seidman,M., Howard,B. and Gorman,C. (1984) Mol. Cell. Biol.,
4, 2622–2630.
9 Pardue,M., Lowenhaupt,K., Rich,A. and Nordheim,A. (1987) EMBO J., 6,
1781–1789.
10 Yee,H., Wong,A., van den Sande,J. and Rattner,J. (1991) Nucleic Acids Res.,
19, 949–953.
11 Richards,R., Holman,K., Yu,S. and Southerland,G. (1993) Hum. Mol. Genet.,
2, 1429–1435.
12 Lu,Q., Wallrath,L., Granok,H. and Elgin,S. (1993) Mol. Cell. Biol., 13,
2802–2814. 13 Du,J., Zhu,Y., Shanmugam,A. and Kenter,A. (1997) Nucleic Acids Res.,
25, 3066–3073.
14 Edwards,A., Hammond,H., Jin,L., Caskey,C. and Chakraborty,R. (1992)
Genomics, 12, 241–253.
15 Weber,J. and May,P. (1989) Am. J. Hum. Genet., 44, 388–396.
16 Tishkoff,S.A., Dietzsch,E., Speed,W., Pakstis,A.J. and Kidd,J.R. (1996)
Science, 271, 1380–1387.
17 Armour,J., Anttinen,T., May,C., Vega,E., Sajantila,A., Kidd,J., Kidd,K.,
Bertranpetit,J., Pääbo,S. and Jeffreys,A. (1996) Nature Genet., 13, 154–160.
18 Benson,G. (1997) J. Comput. Biol., 4, 351–367.
19 Hellman,L., Steen,M., Sundvall,M. and Pettersson,U. (1988) Gene, 68,
93–100.
20 Kannan,S.K. and Myers,E.W. (1996) SIAM J. Comput., 25, 648–662.
21 Benson,G. (1995) Theor. Comput. Sci., 145, 357–369.
22 Schmidt,J.P. (1998) SIAM J. Comput., 27, 972–992.
23 Milosavljevic,A. and Jurka,J. (1993) CABIOS, 9, 407–411.
24 Rivals,E., Delgrange,O., Delahaye,J.P., Dauchet,M., Delorme,M.O.,
Hénaut,A. and Ollivier,E. CABIOS, 13, 131–136, 1997.
25 Landau,G. and Schmidt,J. (1993) In Apostolico,A., Crochemore,M., Galil,Z.
and Manber,U (eds), Proceedings of the 4th Annual Symposium on
Combinatorial Pattern Matching, Lecture Notes in Computer Science,
Vol. 648. SpringerVerlag, Berlin, pp. 120–133.
26 Karlin,S., Morris,M., Ghandour,G. and Leung,M.Y. (1988) Proc. Natl
Acad. Sci. USA, 85, 841–845.
27 Benson,G. and Waterman,M. (1994) Nucleic Acids Res., 22, 4828–4836.
28 Sagot,M. and Myers,E. (1998) In Istrail,S., Pevzner,P. and Waterman,M.
(eds), Proceedings of the Second Annual International Conference on
Computational Molecular Biology. ACM Press, NY, pp. 234–242.
29 Benson,G. (1998) In Istrail,S., Pevzner,P. and Waterman,M. (eds),
Proceedings of the Second Annual International Conference on
Computational Molecular Biology. ACM Press, NY, pp. 20–29.
30 Pearson,W. and Lipman,D. (1988) Proc. Natl Acad. Sci. USA, 85,
2444–2448.
31 Altschul,S., Gish,W., Miller,W., Myers,E. and Lipman,D. (1990)
J. Mol. Biol., 215, 403–410.
32 , Benson,G. and Su,X. (1998) J. Comput. Biol., 5, 87–100., Benson,G. and
Su,X. (1998) J. Comput. Biol., 5, 87–100.
33 Feller,W. (1968) An Introduction to Probability Theory and its
Applications, 3rd Edn, Vol. I. John Wiley & Sons, New York, NY.
34 Aki,S., Kuboki,H. and Hirano,K. (1984) Ann. Inst. Statist. Math., 36,
431–440.
35 Miller,W. and Myers,E. (1989) Bull. Math. Biol., 51, 5–37.
36 Fischetti,V., Landau,G., Schmidt,J. and Sellers,P. (1992) In Apostolico,A.,
Crochemore,M., Galil,Z. and Manber,U (eds), Proceedings of the 3rd
Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in
Computer Science, Vol. 644. SpringerVerlag, Berlin, pp. 111–120.
37 Rowan,L., Koop,B. and Hood,L. (1996) Science, 272, 1755–1768. ...
View
Full
Document
This note was uploaded on 12/27/2011 for the course CMPSC 290a taught by Professor Vandam during the Fall '09 term at UCSB.
 Fall '09
 Vandam

Click to edit the document details