lecture 4 notes &questions

An Introduction to Bioinformatics Algorithms (Computational Molecular Biology)

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CSE182 lecture 4 notes &questions Vineet Bafna October 5, 2006 1 Notes Recall that we are interested in computing local alignments of a query string of length n against a subsequence from database. Certainly, we can apply the smith waterman (local alignment) algorithm treating the entire database as a single string of length m , and computing the optimum local alignment. See Problem ?? . The number of steps, from earlier arguments is O ( nm ) . As a rough calculation, suppose,we were querying the entire human genome, against the entire mouse genome implying that n ' m ' 3 10 9 . An full-blown local alignment would require 10 19 steps. Even with a fast computation of 10 10 steps per sec., we would need 10 9 s ( 31 CPU-years) to do the computation. It is worth considering if we can do better. A general approach to this problem is through database filtering . Think of a database filter as a program that rapidly eliminates a large portion of the database without losing any of the similar strings. For example, suppose we had a filter that runs in time O ( m ) (independent of the query size), and rejects all but a fraction f << 1 of the database. Then, by aligning the query only to the filtered sequence , the total running time is reduced to O ( m + fmn ) . Suppose, we had a filter with f = 10- 8 . then, the total running time for the previous query would have 10 9 + 10- 8 10 19 ' 10 11 steps. At 10 10 steps per second, we could do the query in 10 secs. This is the idea that is pursued in Blast. 2 Basics Let us start with the assumption that the database is a random string over the characters { A,C,G,T } , each occurring independently with probability . 25 . Next, assume that the query is a string of k ones, given by q = 111 ... 111 | {z } k We are interested in computing Pr ( q is contained in a database substring ) As it turns out, this is somewhat difficult to compute because of the dependencies between occurrence at different positions. However, given a fixed position i in the database, Pr ( q occurs at position i ) = 1 4 k Therefore, the expected number of occurrences of q = n ( 1 4 ) k . Why?...
View Full Document

This note was uploaded on 02/14/2008 for the course CSE 182 taught by Professor Bafna during the Fall '06 term at UCSD.

Page1 / 3

lecture 4 notes &questions - CSE182 lecture 4 notes...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online