Lec06-msa - Multiple Sequence Alignment CMSC 423 Multiple...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Multiple Sequence Alignment CMSC 423 Multiple Sequence Alignment Multiple sequence alignment: find more subtle patterns & find common patterns between all sequence. MSA • The multiple sequence alignment problem: Input: Sequences: S1, S2, ..., Sm Let M be a MSA between these sequences. Let dM(Si, Sj) be the score of the alignment between Si and Sj implied by M. SP-Score(M) = ∑i,j dM(Si, Sj) = Sum of all pairwise alignment scores. • • Goal: find M to minimize SP-Score(M). But this is NP-hard. SP-Score in a Picture SP-Score(M) = ∑i,j dM(Si, Sj) = sum of all the scores of the pairwise alignments implied by M. Sj dM(Si, Sj) Si MSA • A multiple sequence alignment (MSA) implies a pairwise alignment between every pair of sequences. • This implied alignment need not be optimal, however: match = -1, a mismatch = 1, gap = 2 Sequences: AT, A, T, AT, AT AT AOptimal MSA: -T AT AT +2 +2 = 4 Optimal Alignment between A and T: A T +1 (A,A), (A,-), (A, A), (A,A), (A, -), (A,A), (A, A) (-,A), (-,A), (A,A) -1 + 2 -1 -1 +2 -1 -1 +2 +2 -1 = +2 Slow Dynamic Programming Suppose you had just 3 sequences. Apply the same DP idea as sequence alignment for 2 sequences, but now with a 3-dimensional matrix (i-1, j, k) (i, j, k) (i, j, k-1) (i-1, j, k-1) (i-1, j-1, k) (i-1, j-1, k-1) (i, j-1, k) (i, j-1, k-1) DP Recurrence for 3 sequences cost(xi , yj , zk ) + A[i − 1, j − 1, k − 1] cost(x , −, −) + A[i − 1, j, k ] i cost(xi , yj , −) + A[i − 1, j − 1, k ] A[i, j, k ] = min cost(−, yj , zk ) + A[i, j − 1, k − 1] cost(−, yj , −) + A[i, j − 1, k ] cost(x , −, z ) + A[i − 1, j, k − 1] i k cost(−, −, zk ) + A[i, j, k − 1] (i-1, j, k) Every possible for the gaps. 2k pattern (i, j, k-1) (i-1, j, k-1) (i-1, j-1, k) (i-1, j-1, k-1) (i, j, k) (i, j-1, k) (i, j-1, k-1) Running time • n3 subproblems, each takes 23 time O(n3) time. • For k sequences: nk subproblems, each takes 2k time for the max and k2 to compute cost() O(k2nk2k) • Even O(n3) is often too slow for the length of sequences encountered in practice. • One solution: approximation algorithm. Star Alignment Approximation Sj Si dM(Si, Sj) Si Sc SP-Score dM(Si, Sc) Star-Score = ∑i dM(Si, Sc) Star Alignment Algorithm Input: sequences S1, S2, ..., Sk • • Build all O(k2) pairwise alignments. Let Sc = the sequence in S1, S2, ..., Sk that is closest to the others. That is, choose Sc to minimize: ∑i≠c d(Sc, Si) • Iteratively align all other sequences to Sc. d(Sc, Si) dM(Si, Sj) Sc Sc Iterative Alignment • Build a multiple sequence alignment up from pairwise alignments. Start with an alignment between Sc and some other sequence: SC YFPHFDLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGAL S1 YFPHFDLSHG-AQVKG--KKVADALTNAVAHVDDMPNAL Add 3rd sequence, say S2, and use the SC - S2 alignment as a guide, adding spaces into the MSA as needed. SC - S2 alignment: SC YFPHF-DLS-----HGSAQVKAHGKKVGDALTLAVGHL----DDLPGAL S2 FFPKFKGLTTADQLKKSADVRWHAERII----NAVNDAVASMDDTEKMS New {SC, S1, S2} alignment (red gaps added in S1): SC YFPHF-DLS-----HGSAQVKAHGKKVGDALTLAVGHL----DDLPGAL S1 YFPHF-DLS-----HG-AQVKG--KKVADALTNAVAHV----DDMPNAL S2 FFPKFKGLTTADQLKKSADVRWHAERII----NAVNDAVASMDDTEKMS Continue with S3, S4, ... Performance Assume the cost function satisfies the triangle inequality: cost(x,y) ≤ cost(x, z) + cost(z,y) Example: cost(A, C) ≤ cost(A, T) + cost(T,C) cost of 1 mutation from A→C cost of a mutation from A→T and then from T→C STAR = cost of star alignment under SP-score OPT = cost of optimal multiple sequence alignment (under SP-score) Theorem. If cost satisfies the triangle inequality, then STAR ≤ 2×OPT. Example: if optimal alignment has cost 10, the star alignment will have cost ≤ 20. Proof (1) Theorem. If cost satisfies the triangle inequality, then STAR ≤ 2OPT. STAR ≤2 OPT For some B we will prove the 2 statements: STAR ≤ 2B OPT ≥ B This will imply: STAR 2B =⇒ ≤ =2 OPT B Proof (2) Theorem. If cost satisfies the triangle inequality, then STAR ≤ 2OPT. 2 · STAR by triangle inequality because STAR alignment is optimal for pairs involving Sc distribute ∑ = ￿ dSTAR (Si , Sj ) defn of SP-score ij ≤ = ￿ (dSTAR (Si , Sc ) + dSTAR (Sc , Sj )) ij ￿ (d(Si , Sc ) + d(Sc , Sj )) ij = ￿ ij ≤ 2k d(Si , Sc ) + ￿ i ￿ d(Sc , Sj ) ij d(Si , Sc ) sums are the same and each term appears ≤ k (# of sequences) times. Proof (3) Theorem. If cost satisfies the triangle inequality, then STAR ≤ 2OPT. 2 · OPT = optimal pairwise alignment is ≤ pairwise alignment induced by any MSA sum of all cost of all pairwise alignments is = the sum of k different stars. We chose Sc because it was the lowest-cost star. ≥ ￿ dOPT (Si , Sj ) ij ￿ d(Si , Sj ) ij ≥k ￿ i d(Si , Sc ) defn of SP-score End of Proof For some B we will prove the 2 statements: This will imply: STAR ≤ 2B OPT ≥ B 2 · ST AR 2 · OP T ≤ 2k ≥k ￿ STAR 2B =⇒ ≤ =2 OPT B d(Si , Sc ) i ￿ i d(Si , Sc ) ￿ 2k i d(Si , Sc ) STAR =⇒ ≤￿ =2 OPT k i d(Si , Sc ) Consensus Sequence For every column j, choose c ∈ ∑ that minimizes ∑i cost(c, Si[j]) S1 S2 S3 S4 CO (typically this means the most common letter) YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVAHLDDLPGAL YFPHF-DLS-----HG-AQVKG—GKKVA-----DALTNAVAHVDDMPNAL FFPKFKGLTTADQLKKSADVRWHAERII-----NAVNDAVASMDDTEKMS LFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATL YFPHFKDLS-----HGSAQVKAHGKKVG-----DALTLAVAHVDDTPGAL • • Consensus is a summarization of the whole alignment. • Sometimes the MSA problem is formulated as: find MSA M that minimizes: Consensus sequence is sometimes used as an estimate for the ancestral sequence. ∑i dM(CO, Si) Profiles • Another way to summarize an MSA: S1 S2 S3 S4 ACG-TT-GA ATC-GTCGA ACGCGA-CC ACGCGT-TA Column in the alignment Character 123456789 A C G T - 1 0 0 0 0 0.75 0.25 0.5 0 0.25 0 0 0 0 0 0.75 0 0.75 0 0.25 0 0 0.25 0.75 0 0 0 0.5 0 0 0 0 Call this profile matrix R 0.75 0.25 0.25 0.25 0 0.5 0 0 0.25 0 0.75 0 0 Fraction of time given column had the given character Profile-based Alignment gap in profile introduced to better fit sequence 1 3 4 5 6 7 8 9 A R= 2 1 0 0 0 0 0.25 0 0 0.75 C 0 0.75 0.25 0.5 0 0 0.25 0.25 0.25 G 0 0 0.75 0 0.75 0 0 0.5 0 T 0 0.25 0 0 0.25 0.75 0 0.25 0 - 0 0 0 0.5 0 0 0.75 0 0 A C C - AG A C GA A[i − 1, j − 1] + P (xi , j ) A[i, j ] = max A[i − 1, j ] + gap A[i, j − 1] + P (“ “, j ) Score of matching character x with column j of the profile: ￿ P (x, j ) = sim(x, c) × R[c, j ] c∈Σ sim(x,c) = how similar character x is to character c. align xi to column j introduce gap into profile introduce gap into x Recap • Multiple sequence alignments (MSAs) are a fundamental tool. They help reveal subtle patterns, compute consistent distances between sequences, etc. • Quality of MSAs often measured using the SP-score: sum of the scores of the pairwise alignments implied by the MSA. • Same DP idea as pairwise alignment leads to exponentially slow algorithm for MSA. • 2-approximation obtainable via star alignments. • MSAs often used to create profiles summarizing a family of sequences. Profile-sequence alignments solvable via dynamic programming. ...
View Full Document

Ask a homework question - tutors are online