This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Multiple Sequence
Alignment
CMSC 423 Multiple Sequence Alignment Multiple sequence alignment: ﬁnd more subtle patterns & ﬁnd common
patterns between all sequence. MSA
• The multiple sequence alignment problem: Input: Sequences: S1, S2, ..., Sm
Let M be a MSA between these sequences.
Let dM(Si, Sj) be the score of the alignment between Si and Sj
implied by M.
SPScore(M) = ∑i,j dM(Si, Sj) = Sum of all pairwise alignment
scores. •
• Goal: ﬁnd M to minimize SPScore(M).
But this is NPhard. SPScore in a Picture
SPScore(M) = ∑i,j dM(Si, Sj)
= sum of all the scores of
the pairwise alignments
implied by M. Sj dM(Si, Sj)
Si MSA
• A multiple sequence alignment (MSA) implies a pairwise alignment
between every pair of sequences. • This implied alignment need not be optimal, however:
match = 1, a mismatch = 1, gap = 2
Sequences: AT, A, T, AT, AT AT
AOptimal MSA: T
AT
AT
+2 +2 = 4 Optimal
Alignment
between
A and T: A
T
+1 (A,A), (A,), (A, A), (A,A), (A, ), (A,A), (A, A) (,A), (,A), (A,A)
1 + 2 1 1 +2 1 1 +2 +2 1 = +2 Slow Dynamic Programming
Suppose you had just 3 sequences.
Apply the same DP idea as sequence alignment for 2 sequences, but
now with a 3dimensional matrix
(i1, j, k)
(i, j, k)
(i, j, k1) (i1, j, k1)
(i1, j1, k) (i1, j1, k1) (i, j1, k)
(i, j1, k1) DP Recurrence for 3 sequences cost(xi , yj , zk ) + A[i − 1, j − 1, k − 1] cost(x , −, −) + A[i − 1, j, k ] i cost(xi , yj , −) + A[i − 1, j − 1, k ] A[i, j, k ] = min cost(−, yj , zk ) + A[i, j − 1, k − 1] cost(−, yj , −) + A[i, j − 1, k ] cost(x , −, z ) + A[i − 1, j, k − 1] i
k cost(−, −, zk ) + A[i, j, k − 1]
(i1, j, k) Every possible
for the gaps. 2k pattern (i, j, k1) (i1, j, k1)
(i1, j1, k) (i1, j1, k1) (i, j, k) (i, j1, k) (i, j1, k1) Running time
• n3 subproblems, each takes 23 time
O(n3) time. • For k sequences: nk subproblems,
each takes 2k time for the max and
k2 to compute cost() O(k2nk2k) • Even O(n3) is often too slow for the
length of sequences encountered in
practice. • One solution: approximation
algorithm. Star Alignment Approximation
Sj Si dM(Si, Sj)
Si Sc SPScore dM(Si, Sc) StarScore =
∑i dM(Si, Sc) Star Alignment Algorithm
Input: sequences S1, S2, ..., Sk •
• Build all O(k2) pairwise alignments.
Let Sc = the sequence in S1, S2, ..., Sk that is closest to the others.
That is, choose Sc to minimize: ∑i≠c d(Sc, Si) • Iteratively align all other sequences to Sc. d(Sc, Si)
dM(Si, Sj)
Sc
Sc Iterative Alignment
• Build a multiple sequence alignment up from pairwise alignments. Start with an alignment between Sc and some other sequence:
SC YFPHFDLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGAL
S1 YFPHFDLSHGAQVKGKKVADALTNAVAHVDDMPNAL Add 3rd sequence, say S2, and use the SC  S2 alignment as a guide, adding
spaces into the MSA as needed.
SC  S2 alignment:
SC YFPHFDLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGAL
S2 FFPKFKGLTTADQLKKSADVRWHAERIINAVNDAVASMDDTEKMS New {SC, S1, S2} alignment (red gaps added in S1):
SC YFPHFDLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGAL
S1 YFPHFDLSHGAQVKGKKVADALTNAVAHVDDMPNAL
S2 FFPKFKGLTTADQLKKSADVRWHAERIINAVNDAVASMDDTEKMS Continue with S3, S4, ... Performance
Assume the cost function satisﬁes the triangle inequality:
cost(x,y) ≤ cost(x, z) + cost(z,y)
Example: cost(A, C) ≤ cost(A, T) + cost(T,C)
cost of 1
mutation from
A→C cost of a mutation
from A→T and
then from T→C STAR = cost of star alignment under SPscore
OPT = cost of optimal multiple sequence alignment (under SPscore)
Theorem. If cost satisﬁes the triangle inequality, then STAR ≤ 2×OPT.
Example: if optimal alignment has cost 10, the star alignment will have
cost ≤ 20. Proof (1)
Theorem. If cost satisﬁes the triangle inequality, then STAR ≤ 2OPT. STAR
≤2
OPT For some B we will
prove the 2 statements: STAR ≤ 2B
OPT ≥ B This will imply: STAR
2B
=⇒
≤
=2
OPT
B Proof (2)
Theorem. If cost satisﬁes the triangle inequality, then STAR ≤ 2OPT. 2 · STAR
by triangle
inequality
because STAR
alignment is optimal
for pairs involving Sc
distribute ∑ = dSTAR (Si , Sj ) defn of SPscore ij ≤
= (dSTAR (Si , Sc ) + dSTAR (Sc , Sj )) ij (d(Si , Sc ) + d(Sc , Sj )) ij =
ij ≤ 2k d(Si , Sc ) +
i d(Sc , Sj ) ij d(Si , Sc ) sums are the same
and each term appears
≤ k (# of sequences)
times. Proof (3)
Theorem. If cost satisﬁes the triangle inequality, then STAR ≤ 2OPT. 2 · OPT =
optimal pairwise alignment
is ≤ pairwise alignment
induced by any MSA
sum of all cost of all pairwise
alignments is = the sum of k
different stars.
We chose Sc because it was
the lowestcost star. ≥ dOPT (Si , Sj ) ij d(Si , Sj ) ij ≥k
i d(Si , Sc ) defn of SPscore End of Proof
For some B we will
prove the 2 statements: This will imply: STAR ≤ 2B
OPT ≥ B 2 · ST AR
2 · OP T ≤ 2k
≥k STAR
2B
=⇒
≤
=2
OPT
B d(Si , Sc ) i
i d(Si , Sc )
2k i d(Si , Sc )
STAR
=⇒
≤
=2
OPT
k i d(Si , Sc ) Consensus Sequence
For every column j,
choose c ∈ ∑ that
minimizes ∑i cost(c, Si[j])
S1
S2
S3
S4
CO (typically this means the
most common letter) YFPHFDLSHGSAQVKAHGKKVGDALTLAVAHLDDLPGAL
YFPHFDLSHGAQVKG—GKKVADALTNAVAHVDDMPNAL
FFPKFKGLTTADQLKKSADVRWHAERIINAVNDAVASMDDTEKMS
LFSFLKGTSEVPQNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATL
YFPHFKDLSHGSAQVKAHGKKVGDALTLAVAHVDDTPGAL •
• Consensus is a summarization of the whole alignment. • Sometimes the MSA problem is formulated as: ﬁnd MSA M that
minimizes: Consensus sequence is sometimes used as an estimate for the
ancestral sequence. ∑i dM(CO, Si) Proﬁles
• Another way to summarize an MSA:
S1
S2
S3
S4 ACGTTGA
ATCGTCGA
ACGCGACC
ACGCGTTA Column in the alignment Character 123456789
A
C
G
T
 1
0 0 0 0 0.75 0.25 0.5 0 0.25 0 0
0 0 0 0.75 0 0.75 0 0.25 0 0 0.25 0.75 0 0 0 0.5 0 0 0 0 Call this proﬁle
matrix R 0.75 0.25 0.25 0.25
0 0.5 0 0 0.25 0 0.75 0 0 Fraction of time
given column had
the given character Proﬁlebased Alignment
gap in proﬁle
introduced to
better ﬁt sequence
1 3 4 5 6 7 8 9 A R= 2 1 0 0 0 0 0.25 0 0 0.75 C 0 0.75 0.25 0.5 0 0 0.25 0.25 0.25 G 0 0 0.75 0 0.75 0 0 0.5 0 T 0 0.25 0 0 0.25 0.75 0 0.25 0  0 0 0 0.5 0 0 0.75 0 0 A C C  AG A C GA A[i − 1, j − 1] + P (xi , j ) A[i, j ] = max A[i − 1, j ] + gap A[i, j − 1] + P (“ “, j ) Score of matching character x with
column j of the proﬁle:
P (x, j ) =
sim(x, c) × R[c, j ]
c∈Σ sim(x,c) = how similar character x is
to character c. align xi to column j
introduce gap into proﬁle
introduce gap into x Recap
• Multiple sequence alignments (MSAs) are a fundamental tool. They
help reveal subtle patterns, compute consistent distances between
sequences, etc. • Quality of MSAs often measured using the SPscore: sum of the
scores of the pairwise alignments implied by the MSA. • Same DP idea as pairwise alignment leads to exponentially slow
algorithm for MSA. • 2approximation obtainable via star alignments. • MSAs often used to create proﬁles summarizing a family of
sequences. Proﬁlesequence alignments solvable via dynamic
programming. ...
View
Full
Document
 Fall '07
 staff

Click to edit the document details