LecturesPart06 - Computational Biology Part 6 Sequence Database Searching Robert F Murphy Copyright 1996-2009 All rights reserved Sequence Analysis

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Biology, Part 6 Sequence Database Searching Robert F. Murphy Copyright 1996-2009. All rights reserved. Sequence Analysis Tasks Given a query sequence, search for similar sequences in a database Global or Local? Both local and global alignment methods may be applied to database scanning, but local alignment methods are more useful since they do not make the assumption that the query protein and database sequence are of similar length. Efficient database searching methods s Dynamic programming requires order N2L computations (where N is size of the query sequence and L is the size of the database) s Given size of databases, more efficient methods needed "Hit and extend" sequence searching s Problem: Too many calculations "wasted" by comparing regions that have nothing in common s Initial insight: Regions that are similar between two sequences are likely to share short stretches that are identical s Basic method: Look for similar regions only near short stretches that match exactly "Hit and extend" sequence searching s We define a word (or k-tuple) size that is the minimum number of exact "letter" matches that must occur before we do any further comparison or alignment s How do we find all of the occurences of matching words between a sequence and a database? x Could scan sequence a word at a time, but this is order L (size of database) Word searching - hashing s Solution: Use a precomputed table that lists where in the database each possible word occurs x Generation of the table is of order L (size of database) but use of the table is of order N (size of query sequence) s The computer science term for this approach is hashing Hashing s Hashing x Hashing Table of size 10 x Hashing function H(x) = x mod 10 x Applet: http://www.engin.umd.umich.edu/CIS/course. http://www.engin.umd.umich.edu/CIS/course x Insertion & Search Author: R.F. Murphy, Feb. 6, 1995 (revised Feb. 15, 1996) Demonstration: Hashing algorithm for sequence searching This demonstration takes a piece of database sequence, calculates hash values for each ktuple, builds a hash table (listing the positions in the database of the occurence of each hash value), and uses a simplified version of the hash table to find the positions in the database sequence of the first occurence of each ktuple in a query sequence. Hashing i s (Demonstration A10) hash value pos1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 0 1 2 3 4 5 6 7 8 database sequence seq(i) seq(i) as char as int hash value a 0 6 c 1 27 g 2 47 t 3 63 t 3 63 t 3 60 t 3 48 a 0 0 a 0 0 a 0 1 a 0 6 c 1 24 g 2 33 a 0 4 c 1 17 a 0 5 c 1 c 1 a a a a a a a a a pos2 This section converts each base to a number from 0 to 3 and combines those numbers three at a time to form an integer from 0 to 63 that is unique for each three base sequence. Each three base sequence is called a "ktuple." a a a a c c c c g pos3 a c g t a c g t a hash table for the 1 database sequence 8 9 10 11 14 16 first hit hash table 8 10 not found not found 14 16 1 not found not found FASTA s Heavily used for searching databases until advent of BLAST (see below) s Inputs x k (word or k-tuple) size x similarity matrix s Compares query sequence pairwise with each sequence in the database FASTA method s The initial step in the algorithm is to identify all exact matches of length k (k tuples) or greater between the two sequences. FASTA method 1. Find diagonals (paired pieces from each sequence without gaps) that have the highest density of common words 2. Rescore these using a scoring (similarity) matrix and trim ends that do not contribute to the highest score x Result: partial alignments without gaps x Reported as the "init1" score FASTA method 3. Join regions together, including penalties for gaps x Result: unoptimized alignment with gaps x Reported as the "initn" score 4. Use dynamic programming in a band 32 residues wide around the best "initn" score x Result: optimized alignment with gaps x Reported as the "opt" score Comments on FASTA s Larger k-tuple increases speed since fewer "hits" are found but it also decreases sensitivity for finding similar but not identical sequences since exact matches of this length are required Limitations of FASTA s FASTA can miss significant similarity since x For proteins, similar sequences do not have to share identical residues 3 Asp-Lys-Val is quite similar to Glu-Arg-Ile yet it is missed even with k-tuple size of 1 since no amino acid matches 3 Gly-Asp-Gly-Lys-Gly is quite similar to Gly-Glu-Gly-Arg-Gly but there is only match with k-tuple size of 1 Limitations of FASTA s FASTA can miss significant similarity since x For nucleic acids, due to codon "wobble", DNA sequences may look like XXyXXyXXy where X's are conserved and y's are not 3 GGuUCuACgAAg and GGcUCcACaAAA both code for the same peptide sequence (Gly-SerThr-Lys) but they don't match with k-tuple size of 3 or higher BLAST (Basic Local Alignment Search Tool) s Goal: find sequences from database similar to query sequence s Previous tools use either x direct, theoretically sound but computationally slow approach to examine all possible alignments of query with database (dynamic programming) x indirect, heuristic but computationally fast approach to find similar sequences by first finding identical stretches (FASTP, FASTA) BLAST (Basic Local Alignment Search Tool) s BLAST combines best of both by using theoretically sound method which searches for similar sequences directly but computationally fast s Reference x S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman. Basic Local Alignment Search Tool. J. Mol. Biol. 215:403410 (1990) BLAST basics s Need similarity measure, as in dynamic programming - use PAM-120 for proteins s Define maximal segment pair (MSP) to be the highest scoring pair of identical length segments chosen from 2 sequences (in FASTA terms, highest init1 diagonal) BLAST basics s Define a segment pair to be locally maximal if its score cannot be improved either by extending or by shortening both segments BLAST basics s Approach: find segment pairs by first finding word pairs that score above a threshold, i.e., find word pairs of fixed length w with a score of at least T s Key concept: Seems similar to FASTA, but we are searching for words which score above T rather than that match exactly BLAST method for proteins 1. Compile a list of words which give a score above T when paired with the query sequence. x Example using PAM-120 for query sequence A C D E ACDE = +3 +9 +5 +5 = 22 3 try all possibilities: AAAA = +3 -3 0 0 = 0 no good AAAC = +3 -3 0 -7 = -7 no good 3 ...too slow, try directed change ACDE (w=4, T=17): Generating word list A C D E ACDE = +3 +9 +5 +5 = 22 3 change 1st pos. to all acceptable substitutions gCDE = 1 9 5 5 = 20 ok (=pCDE,sCDE, tCDE) nCDE = 0 9 5 5 = 19 ok (=dCDE,eCDE, nCDE,vCDE) iCDE = -1 9 5 5 = 18 ok (=qCDE) kCDE = -2 9 5 5 = 17 ok (=mCDE) 3 change 2nd pos.: can't - all alternatives negative and the other three positions only add up to 13 3 change 3rd pos. in combination with first position gCnE = 1 9 2 5 = 17 ok 3 continue - use recursion Generating word list s For "best" values of w and T there are typically about 50 words in the list for every residue in the query sequence BLAST method for proteins 2. Scan the database for hits with the compiled list of words. Two approaches: x Use index of all possible words (for w=4, need array of size 204=160,000. Can compress this index using pointers to save space. x Use finite state machine (actually used) 3 Calculate a state transition table that tells what state to go to based on the next character in the sequence 3a. Extend hits to form HSPs (high-scoring segment pairs) BLAST method for proteins 3b. BLAST2 or gapped BLAST uses an approach similar to FASTA to combine hits before trying to extend them as in 3a. 4. Compare the score for each HSP to a threshold S to decide whether to keep it 5. Proceed to estimating statistical significance (see below) BLAST Method for DNA s 1. Make list of all contiguous w-mers in the query sequence (often w=12) s 2. Compress database by packing 4 nucleotides into a single byte (use auxiliary table to tell you where sequences start and stop within the compressed database) -doesn't allow for unspecified bases (wildcards) BLAST Method for DNA s s 3. Compress the w-mers from the query sequence the same way. 4. Search the compressed database for matches with the compressed w-mers x Since all frames of the query sequence are considered separately, any match of length w>=11 must contain a match of length 8 that lies on a byte boundary of one of the w-mers from the query sequence. Thus can scan a (packed) byte at a time, improving speed 4-fold over comparing one nucleotide at a time. BLAST Method for DNA s Problem: if query sequence has a stretch of unusual base composition (e.g., A-T rich) or a repeated sequence element (e.g., Alu sequence) there will be many hits with "uninteresting" regions. BLAST Method for DNA s Solution: x x x x During compression of the database, tabulate frequencies of all 8-tuples. Make a list of those occurring very frequently (much more frequently than expected by chance). Remove these words from the query list of w-mers before searching database. Remove words matching a sublibrary of repeated sequences (but report the matches to that sublibrary when done). BLAST Statistical significance s A key to the utility of BLAST is the ability to calculate expected probabilities of occurrence of Maximum Segment Pairs (MSPs) given w and T s This allows BLAST to rank matching sequences in order of "significance" and to cut off listings at a user-specified probability BLAST Statistical significance s From Karlin-Altschul formulation, the expected value (mean) of the HSPs between a query and a set of random sequences is u @[log e (Kmn)] /l or u @[ln(Kmn)] /l BLAST Statistical significance s BLAST uses a correction to this formulation that takes into account the effective sequence lengths of the query and the database sequences u = [ln(Km n )] /l BLAST Statistical significance s The corrected lengths are given by m = m - (lnKmn) /H n = n - (lnKmn) /H with H = (lnKmn) /l s where l is the average length of the alignment that can be achieved between random sequences of length m and n BLAST Statistical significance s Given u, we can calculate the probability p of observing a score S between a query sequence and a given database sequence that is equal to or greater than x p(S x) = 1- exp(- e - l (x- u) ) BLAST Statistical significance s s Lastly, we have to consider that we are searching many database sequences and can expect even a relatively rare score to occur with high chance given enough comparisons For a database of D sequences, this is E 1- e - p(s x )D Summary of Database Search Methods Authors (Program) Needleman & Wunsch Wilbur & Lipman Description full alignment match k-tuple - form diag - NW k-tuple - diag - rescore Lipman & Pearson (FASTP) Pearson & Lipman FASTP - join diags(FASTA) NW Altschul et al (BLAST) word match list statistics Reading for next class s Paper by Grundy and Bailey ...
View Full Document

This note was uploaded on 01/13/2012 for the course BIO 101 taught by Professor Staff during the Fall '10 term at DePaul.

Ask a homework question - tutors are online