This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: CSE182 lecture 4 notes &questions Vineet Bafna October 5, 2006 1 Notes Recall that we are interested in computing local alignments of a query string of length n against a subsequence from database. Certainly, we can apply the smith waterman (local alignment) algorithm treating the entire database as a single string of length m , and computing the optimum local alignment. See Problem ?? . The number of steps, from earlier arguments is O ( nm ) . As a rough calculation, suppose,we were querying the entire human genome, against the entire mouse genome implying that n ' m ' 3 10 9 . An fullblown local alignment would require 10 19 steps. Even with a fast computation of 10 10 steps per sec., we would need 10 9 s ( 31 CPUyears) to do the computation. It is worth considering if we can do better. A general approach to this problem is through database filtering . Think of a database filter as a program that rapidly eliminates a large portion of the database without losing any of the similar strings. For example, suppose we had a filter that runs in time O ( m ) (independent of the query size), and rejects all but a fraction f << 1 of the database. Then, by aligning the query only to the filtered sequence , the total running time is reduced to O ( m + fmn ) . Suppose, we had a filter with f = 10 8 . then, the total running time for the previous query would have 10 9 + 10 8 10 19 ' 10 11 steps. At 10 10 steps per second, we could do the query in 10 secs. This is the idea that is pursued in Blast. 2 Basics Let us start with the assumption that the database is a random string over the characters { A,C,G,T } , each occurring independently with probability . 25 . Next, assume that the query is a string of k ones, given by q = 111 ... 111  {z } k We are interested in computing Pr ( q is contained in a database substring ) As it turns out, this is somewhat difficult to compute because of the dependencies between occurrence at different positions. However, given a fixed position i in the database, Pr ( q occurs at position i ) = 1 4 k Therefore, the expected number of occurrences of q = n ( 1 4 ) k . Why?...
View
Full
Document
This note was uploaded on 02/14/2008 for the course CSE 182 taught by Professor Bafna during the Fall '06 term at UCSD.
 Fall '06
 Bafna

Click to edit the document details