Unformatted text preview: Gap Penalties
CMSC 423 General Gap Penalties
AAAGAATTCA
AAATCA vs. AAAGAATTCA
AAATCA These have the same score, but the second one is often more
plausible.
A single insertion of “GAAT” into the ﬁrst string could change
it into the second. • Now, the cost of a run of k gaps is GAP × k • A solution to the problem above is to support general gap
penalty, so that the score of a run of k gaps is gap(k) < GAP × k. • Then, the optimization will prefer to group gaps together. General Gap Penalties
AAAGAATTCA
AAATCA vs. AAAGAATTCA
AAATCA Previous DP no longer works with general gap penalties because
the score of the last character depends on details of the previous
alignment:
AAAGAAC
AAA vs. AAAGAATC
AAA Instead, we need to “know” how the previous alignment ends in
order to give a score to the last subproblem. Three Matrices
We now keep 3 different matrices:
M[i,j] = score of best alignment of x[1..i] and y[1..j] ending with a charactercharacter match or mismatch.
X[i,j] = score of best alignment of x[1..i] and y[1..j] ending with a space in X.
Y[i,j] = score of best alignment of x[1..i] and y[1..j] ending with a space in Y. M [i − 1, j − 1] M [i, j ] = match(i, j ) + max X [i − 1, j − 1] Y [i − 1, j − 1]
M [i, j − k ] − gap(k )
X [i, j ] = max
Y [i, j − k ] − gap(k ) for 1 ≤ k ≤ j
for 1 ≤ k ≤ j
M [i − k, j ] − gap(k )
Y [i, j ] = max
X [i − k, j ] − gap(k ) for 1 ≤ k ≤ i
for 1 ≤ k ≤ i The M Matrix
We now keep 3 different matrices:
M[i,j] = score of best alignment of x[1..i] and y[1..j] ending with a charactercharacter match or mismatch.
X[i,j] = score of best alignment of x[1..i] and y[1..j] ending with a space in X.
Y[i,j] = score of best alignment of x[1..i] and y[1..j] ending with a space in Y. By deﬁnition, alignment
ends in a match. M [i − 1, j − 1] M [i, j ] = match(i, j ) + max X [i − 1, j − 1] Y [i − 1, j − 1]
Any kind of alignment is
allowed before the match. A
G The X (and Y) matrices
k i
x G ACGT G y jk j
M [i, j − k ] − gap(k )
X [i, j ] = max
Y [i, j − k ] − gap(k )
i
x
y k G CGT G jk j k decides how long to
make the gap.
We have to make the
whole gap at once in order
to know how to score it. for 1 ≤ k ≤ j
for 1 ≤ k ≤ j The X (and Y) matrices
k i
x G ACGT G y jk j
M [i, j − k ] − gap(k )
X [i, j ] = max
Y [i, j − k ] − gap(k )
i
x
y k G CGT G jk j k decides how long to
make the gap.
We have to make the
whole gap at once in order
to know how to score it. for 1 ≤ k ≤ j
for 1 ≤ k ≤ j
This case is automatically
handled.
k
i
x
y  GCGT G jk j Running Time for Gap Penalties M [i − 1, j − 1] M [i, j ] = match(i, j ) + max X [i − 1, j − 1] Y [i − 1, j − 1]
M [i, j − k ] − gap(k )
X [i, j ] = max
Y [i, j − k ] − gap(k ) for 1 ≤ k ≤ j
for 1 ≤ k ≤ j
M [i − k, j ] − gap(k )
Y [i, j ] = max
X [i − k, j ] − gap(k ) for 1 ≤ k ≤ i
for 1 ≤ k ≤ i Final score is max {M[n,m], X[n,m],Y[n,m]}.
How do you do the traceback?
Runtime: •
• Assume X = Y = n for simplicity: 3n2 subproblems
2n2 subproblems take O(n) time to solve (because we have to try all k)
O(n3) total time Afﬁne Gap Penalties
• O(n3) for general gap penalties is usually too slow... • We can still encourage spaces to group together using a special case
of general penalties called afﬁne gap penalties:
gap_start = the cost of starting a gap
gap_extend = the cost of extending a gap by one more space • Same idea of using 3 matrices, but now we don’t need to search over
all gap lengths, we just have to know whether we are starting a new
gap or not. Afﬁne Gap Penalties M [i − 1, j − 1] If previous
M [i, j ] = match(i, j ) + max X [i − 1, j − 1]
alignment ends in match match, this is a
Y [i − 1, j − 1]
between
new gap
x and y gap start + gap extend + M [i, j − 1] X [i, j ] = max gap extend + X [i, j − 1] gap in x
gap start + gap extend + Y [i, j − 1] gap start + gap extend + M [i − 1, j ] Y [i, j ] = max gap start + gap extend + X [i − 1, j ] gap in y
gap extend + Y [i − 1, j ] Afﬁne Base Cases
• M[0, i] = “score of best alignment between 0 characters of x and i
characters of y that ends in a match” = ∞ because no such alignment
can exist. • X[0, i] = “score of best alignment between 0 characters of x and i
characters of y that ends in a gap in x” = gap_start + i × gap_extend
because this alignment looks like:
yyyyyyyyy • X[i, 0] = “score of best alignment between i characters of x and 0
characters of y that ends in a gap in X” = ∞ xxxxxxxxx • ← not allowed M[i, 0] = M[0, i] and Y[0, i] and Y[i,0] are computed using the same logic
as X[i,0] and X[0,i] Afﬁne Gap Runtime • 3n2 subproblems • Each one takes constant time • Total runtime O(n2), back to the run time of the basic running
time. Why do you “need” 3 matrices?
• Alternative WRONG algorithm:
M[i][j] = max(
M[i1][j1] + cost(x[i], y[i]),
M[i1][j] + gap + (gap_start if Arrow[i1][j] != ), M[j][i1] + gap + (gap_start if Arrow[i][j1] != ) ) Intuition: we only need to know whether we are starting a gap or extending a gap.
The arrows coming out of each subproblem tell us how the best alignment ends, so we
can use them to decide if we are starting a new gap.
The best alignment
up to this cell ends
in a gap. The best alignment
up to this cell ends
in a match. PROBLEM: The best alignment for strings
x[1..i] and y[1..j] doesn’t have to be used
in the best alignment between
x[1..i+1] and y[1..j+1] Why 3 Matrices: Example
match = 10, mismatch = 2, gap = 7, gap_start = 15
CART
CAT OPT(4, 3) = optimal score = 30  15  7 = 8 CARTS
CAT WRONG(5, 3) = 30  15  7  15  7 = 14 CARTS
CAT OPT(5, 3) = 20  2  15  14 = 11
this is why we need to keep the X and Y matrices around.
they tell us the score of ending with a gap in one of the sequences. Side Note: Lower Bounds
• Suppose the lengths of x and y are n. • Clearly, need at least Ω(n) time to ﬁnd their global alignment
(have to read the strings!) • The DP algorithms show global alignment can be done in O(n2) time. • A trick called the “Four Russians Speedup” can make a similar dynamic
programming algorithm run in O(n2 / log n) time.
• • We won’t talk about the Four Russian’s Speedup, but it’s in your book in Sections 7.3
& 7.4. Open questions: Can we do better? Can we prove that we can’t do
better? No one knows... Recap • Semiglobal alignment: 0 initial columns or take maximums over
last row or column. • Local alignment: extra “0” case. •
• General gap penalties require 3 matrices and O(n3) time.
Afﬁne gap penalties require 3 matrices, but only O(n2) time. What you should know by now...
• Global & local sequence alignment algorithms with basic gap
penalties • Alignment with general gap penalties • Alignment with afﬁne gap penalties • Longest common subsequence • Dynamic programming framework ...
View
Full
Document
This note was uploaded on 01/13/2012 for the course CMSC 423 taught by Professor Staff during the Fall '07 term at Maryland.
 Fall '07
 staff

Click to edit the document details