CSE182 lecture 4 notes &questions
Vineet Bafna
October 5, 2006
1
Notes
Recall that we are interested in computing local alignments of a query string of length
n
against a subsequence from
database. Certainly, we can apply the smith waterman (local alignment) algorithm treating the entire database as a single
string of length
m
, and computing the optimum local alignment. See Problem
??
. The number of steps, from earlier
arguments is
O
(
nm
)
. As a rough calculation, suppose,we were querying the entire human genome, against the entire
mouse genome implying that
n
m
3
×
10
9
. An fullblown local alignment would require
∼
10
19
steps. Even with a
fast computation of
10
10
steps per sec., we would need
10
9
s
(
31
CPUyears) to do the computation. It is worth considering
if we can do better.
A general approach to this problem is through
database filtering
. Think of a database filter as a program that rapidly
eliminates a large portion of the database without losing any of the similar strings. For example, suppose we had a filter
that runs in time
O
(
m
)
(independent of the query size), and rejects all but a fraction
f <<
1
of the database. Then, by
aligning the query only to the
filtered sequence
, the total running time is reduced to
O
(
m
+
fmn
)
. Suppose, we had a
filter with
f
= 10

8
. then, the total running time for the previous query would have
∼
10
9
+ 10

8
10
19
10
11
steps. At
10
10
steps per second, we could do the query in
10
secs. This is the idea that is pursued in Blast.
2
Basics
Let us start with the assumption that the database is a random string over the characters
{
A, C, G, T
}
, each occurring
independently with probability
0
.
25
. Next, assume that the query is a string of
k
ones, given by
q
= 111
. . .
111
k
We are interested in computing
Pr
(
q
is contained in a database substring
)
As it turns out, this is somewhat difficult to compute because of the dependencies between occurrence at different positions.
However, given a fixed position
i
in the database,
Pr
(
q
occurs at position
i
) =
1
4
k
Therefore, the
expected number
of occurrences of
q
=
n
(
1
4
)
k
. Why?
2.1
Basic probability
To see this, define an indicator variable
X
i
for all positions
1
≤
i
≤
n
.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
This is the end of the preview.
Sign up
to
access the rest of the document.
 Fall '06
 Bafna
 Normal Distribution, Probability theory, Query string

Click to edit the document details