Unformatted text preview: Chapter 3
Computational Molecular Biology Michael Smith firstname.lastname@example.org Sequence Comparison Sequence comparison is the most important operation in computational biology Consists of finding which parts of the sequences are alike and which parts differ Similarity and Alignment Similarity Gives a measure of how similar sequences are Alignment A way of placing sequences one above the other in order to make clear the correspondence between similar characters or substrings Sequence Comparison Want best alignment between two or more sequences Global Comparison Local Comparison Alignment involving entire sequences Alignment involving substrings SemiGlobal Comparison All can be solved by Dynamic Programming Aligning prefixes and suffixes of the sequences Global Comparison Consider the following DNA sequences
GACGGATTAG GATCGGAATAG Are they similar? After alignment, similarities are more obvious
GA-CGGATTAG GATCGGAATAG Alignment and Score Alignment, more precise definition Insertion of spaces in arbitrary locations along the sequences so that they end up with the same size No column can be entirely composed of spaces Measure of similarity Each column receive +1, for a match, 1 for a mismatch or 2 for a space Sum values to get score Score Dynamic Programming Solving an instance of a problem by taking advantage of already computed solutions for smaller instances of the problem Main algorithmic approach used in sequence alignment Figure 3.1, 3.2 Optimal Alignments From Figure 3.1, start at (m,n) and follow arrows to (0,0) Each arrow gives one column of the alignment If arrow is horizontal, it corresponds to a column with a space in s matched with t[j] If arrow is vertical, it corresponds to s[i] matched with a space in t If arrow is diagonal, s[i] is matched with t[j] Optimal Alignments Many alignments are possible, depending on which arrow is given priority Local Comparison A local alignment between s and t is an alignment between a substring of s and a substring of t Goal : find the highest scoring local alignment between two sequences Variation of basic algorithm (Figure 3.2) Each entry holds highest score of an alignment between suffixes of s and t (page 55) SemiGlobal Comparison Score alignments ignoring some of the end spaces in the sequences End spaces are those that appear before the first or after the last character in a sequence For example,
CAGCA-CTTGGATTCTCGG ---CAGCGTGG-------- If we aligned the sequences in the usual way, then
CAGCACTTGGATTCTCGG CAGC-----G-T----GG Extensions to Basic Algorithm Basic algorithm has O(mn) complexity and uses space on the order of O(mn) Possible to improve complexity from quadratic to linear at the expense of doubling processing time Can be accomplished by using a Divide and Conquer strategy Divide the problem into small subproblems and later combine the solutions to obtain a solution for the whole problem Gap Penalty Functions A gap is a consecutive number of spaces When mutations occur, it is more likely to have a block of gaps verses a series of isolated gaps Previous discussed scoring method is not appropriate in this case Gap Penalty Functions For example,
A------ATTCCTTCCTTCC AAAGAGAATTCCTTCCTTCC Scoring is done at a block level, not a column level
A A -----AAGAGA ATTCCTTCCTTCC ATTCCTTCCTTCC Multiple Sequences Multiple sequence alignment is a generation of the two sequence case Multiple alignment of s1,s2.....sk is obtained by inserting spaces in the sequences in such a way to make them all the same size No column is made entirely of spaces Figure 3.10 Scoring Multiple Sequences Need a function that inputs amino acid sequences and returns a score The function must have two properties Order of arguments must be independent. For example if a column has I,V, the same score should be produced if the order is ,V,I Should reward the presence of many equal resides and penalize unequal residues and spaces SumofPairs (SP) SumofPairs (SP) satisfies the properties Sum of pairwise scores of all pairs of symbols in a column
SPscore(I,,I,V) = p(I,) + p(I,I) + p(I,V) + p(,I) + p(,V) + p(I,V) where p(a,b) is pairwise score of a and b Algorithm Paradigm Dynamic programming is used again Basic algorithm can be used, but there will be problems In two sequence case, complexity is O(n2) For k sequence case, complexity is O(nk) Can take a really long time if k is large Algorithm Paradigm Must reduce the amount or number of cells to compute Apply a heuristic to reduce the number of computed cells Star Alignments Building a multiple alignment based on pairwise alignments between a fixed sequence and all others Fixed sequence is the center of the star Star Alignments Example a = ATTGCCATT b = ATGGCCATT c = ATCCAATTTT d = ATCTTCTT e = ACTGACC Select a as the center of the star Star Alignments Align a with b a with c a with d a with e Star Alignments ATTGCCATT ATGGCCATT ATTGCCATT-ATC-CAATTTT ATTGCCATT ATCTTC-TT ATTGCCATT ACTGACC-- Star Alignments Combine results ATTGCCATT-ATGGCCATT-ATC-CAATTTT ATCTTC-TT-ACTGACC---- Database Search Database exist for searching and comparing protein and DNA sequences Methods described work, but may take to long and be impractical for searching large databases Novel and faster methods have been developed PAM Matrix When scoring protein sequences, the +1,1,2 may not be sufficient Amino acids have properties that influence the likelihood that they will be substituted in an evolutionary scenario PAM Matrix Point Accepted Mutations A 1PAM matrix is suitable for comparing sequences that are 1 unit of evolution apart A 250PAM matrix is suitable for comparing sequences that are 250 units of evolution apart PAM Matrix Markovian in nature Need the probability of for each amino acid Probability transition matrix Score matrix BLAST Most frequently programs used to search sequence databases Acronym for Basic Alignment Search Tool Returns a list of high scoring segment pairs between the query sequence and sequences in the database http://www.ncbi.nlm.nih.gov FAST Another family of programs for sequence database search http://www.rcsb.org/pdb/index.html BLAST and FAST use PAM matrices ...
View Full Document
This note was uploaded on 02/10/2012 for the course CSE 5615 taught by Professor Mitra during the Fall '11 term at FIT.
- Fall '11