791_my_lecture4

791_my_lecture4 - 7.91 Lecture # 4 Database Searching &...

Info iconThis preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 7.91 Lecture # 4 Database Searching & Molecular Phylogenetics A A B B C D D ((( A , B ) C ) D ) C Michael Yaffe Outline FASTA, Blast searching, Smith-Waterman Psi-Blast Review of Genomic DNA structure Substitution patterns and mutation rates Synonymous and non-Synonymous substitutions Jukes-Cantor Model Kimuras Two-Parameter Model Molecular Clocks Phylogenetic Trees rooted and unrooted Distance Matrix Methods Neighbor-Joining Method and Related Neighbor Methods Maximum Likelihood Outline (cont) Parsimony Branch and Bound Heuristic Seaching Consensus Trees Software (PHYLIP, PAUP) The Tree of Life Reading: Mount, p. 237-280, 283-286, 291-308 Database Searching Problem is simple: I want to find homologues to my protein in the database How do I do it? Do the obvious compare my protein against every other protein in the database and look for local alignments by dynamic programming Uh Oh! 1 n For k sequences in the 1 12345678. Database 12345678. this becomes an O(mnk) 12345678. problem! 12345678. 12345678. m 12345678. .essentially an O(mn) problem Database Searching Still, this can be done - ~ 50x slower than Blast/FASTA, Smith-Waterman algorithm SSEARCH ( ftp.virginia.edu/pub/fasta ) do it locally! But in the old days, needed a faster method 2 approaches Blast, FASTA both heuristic (i.e. tried and true) almost always finds related Proteins but cannot guarantee optimal solution FASTA: Basic Idea 1- Search for matching sequence patterns or words Called k-tuples, which are exact matches of k characters between the two sequences i.e. RW = 2-tuple Seq 1: AHFY RW NKLCV Seq 2: D RW NLFCVATYWE Database Searching FASTA: Basic Idea 2- Repeat for all possible k-tuples i.e. CV = 2-tuple Seq 1: AHFY RW NKL CV Seq 2: D RW NLF CV ATYWE 3- Make a Hash Table (Hashing) that has the position of each k-tuple in each sequence 2-tuple pos. in Seq1 RW 5 CV 10 AH 1 i.e. pos in Seq 2 Offset (pos1-pos2) 2 3 7 3 -------- Database Searching Seq 1: AHFY RW NKL CV Seq 2: D RW NLF CV ATYWE 3- Make a Hash Table i.e. (Hashing) that has the position of each k-tuple in each sequence 2-tuple pos. in Seq1 pos in Seq 2 Offset (pos1-pos2) 3 3 RW 5 2 CV 10 7 AH 1 -------- 4- Look for words (k-tuples) with same offset These are in-phase and reveal a region of alignment between the two sequences. 5- Build a local alignment based on these, extend it outwards Seq 1: AHFY RW NKL CV Seq 2: D RW NLF CV ATYWE Database Searching With hashing, number of comparisons is proportional To the average sequence length (i.e. an O(n) problem), Not an O(mn) problem as in dynamic programming....
View Full Document

This note was uploaded on 11/11/2011 for the course BIO 20.410j taught by Professor Rogerd.kamm during the Spring '03 term at MIT.

Page1 / 75

791_my_lecture4 - 7.91 Lecture # 4 Database Searching &...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online