LecturesPart04

LecturesPart04 - Computational Biology, Part 4 Similarity...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Biology, Part 4 Similarity Matrices/Statistics of Pattern Appearance Robert F. Murphy Copyright © 1996-2009. Copyright All rights reserved. Deriving and Using Similarity Matrices Origin of PAM matrices s s s Take aligned set of closely related proteins x 71 groups of proteins that were at least 85% 71 similar similar Each group of sequences were organized into a Each phylogenetic tree phylogenetic x Creates a model of the order in which Creates substitutions occurred substitutions Count the number of changes of each amino acid Count into every other amino acid into x Each substitution is considered to be an Each “accepted mutation” - an amino acid change “accepted” by natural selection “accepted” Origin of PAM matrices s s s For each group of proteins, find the “exposure to For mutation” for each amino acid. Product of x the frequency of each amino acid in that group x the number of all amino acid changes per 100 the residues (total number of amino acid changes divided by the combined length of all sequences in that group, then times 100) in For each group, divide counts of changes for each For amino acid pair by the exposure to mutation of the “original” amino acid “original” Average these across all groups to create PAM1 Average matrix (Point Accepted Mutation at 1% change) matrix Origin of PAM matrices s s s s This table is equivalent to a transition matrix for a This first-order Markov model for protein sequence evolution with a 1% overall probability of change evolution Appropriate for comparing sequences separated by Appropriate an evolutionary distance that would yield changes in 1% of the positions in Note that PAM1 is not symmetric To compare sequences across greater distances, To can multiply the PAM1 matrix by itself (if Markov model is correct) model Origin of PAM matrices s s s s Squaring PAM1 considers all the ways that an Squaring “original” amino acid may have changed over two steps of 1% mutation rate each steps For staying the same, sum probability that it didn’t For change in first step times probability that it didn’t change in second step plus product of all the probability of all changes in first step times probability of changing back probability For changing from x -> y, consider sum of For consider products of all the changes that could have happened in first step (x -> z) times probability of happened times changing from that into y (z -> y) changing This gives PAM2 (still not symmetric!) Origin of PAM matrices s Can raise PAM1 to any power (e.g., Can PAM250) PAM250) s Major effect of raising PAM matrix to a Major power is to decrease the probability that a particular amino acid is unchanged (and increase the probabilities for it to change into all others) into Use of PAM matrices s s Sum of the product of diagonal elements times Sum overall frequency of each amino acid gives expected degree of similarity between two proteins sequences that are separated by an evolutionary distance corresponding to the power of the matrix distance x For PAM60 this is 60% x For PAM80 it is 50% x For PAM120 it is 40% Can choose a matrix based on an estimate of the Can similarity between proteins (but this might require an alignment!) an Use of PAM matrices s s s s When aligning two sequences, don’t know which When amino acid is the original and which is the change amino Therefore average element (a,b) with element (b,a) Therefore to get symmetric matrix to Then convert to odds score by dividing by Then probability that change would occur randomly probability Then convert to log base 2 Dayhoff PAM250 similarity matrix in log-odds form A A C D E F G H I K L M N P Q R S T V W Y C D E F G H I K L M N P Q R S T V W Y 2 -2 0 0 -4 1 -1 -1 -1 -2 -1 0 1 0 -2 1 1 0 -6 -3 -2 12 -5 -5 -4 -3 -3 -2 -5 -6 -5 -4 -3 -5 -4 0 -2 -2 -8 0 0 -5 4 3 -6 1 1 -2 0 -4 -3 2 -1 2 -1 0 0 -2 -7 -4 0 -5 3 4 -5 0 1 -2 0 -3 -2 1 -1 2 -1 0 0 -2 -7 -4 -4 -4 -6 -5 9 -5 -2 1 -5 2 0 -4 -5 -5 -4 -3 -3 -1 0 7 1 -3 1 0 -5 5 -2 -3 -2 -4 -3 0 -1 -1 -3 1 0 -1 -7 -5 -1 -3 1 1 -2 -2 6 -2 0 -2 -2 2 0 3 2 -1 -1 -2 -3 0 -1 -2 -2 -2 1 -3 -2 5 -2 2 2 -2 -2 -2 -2 -1 0 4 -5 -1 -1 -5 0 0 -5 -2 0 -2 5 -3 0 1 -1 1 3 0 0 -2 -3 -4 -2 -6 -4 -3 2 -4 -2 2 -3 6 4 -3 -3 -2 -3 -3 -2 2 -2 -1 -1 -5 -3 -2 0 -3 -2 2 0 4 6 -2 -2 -1 0 -2 -1 2 -4 -2 0 -4 2 1 -4 0 2 -2 1 -3 -2 2 -1 1 0 1 0 -2 -4 -2 1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 6 0 0 1 0 -1 -6 -5 0 -5 2 2 -5 -1 3 -2 1 -2 -1 1 0 4 1 -1 -1 -2 -5 -4 -2 -4 -1 -1 -4 -3 2 -2 3 -3 0 0 0 1 6 0 -1 -2 2 -4 1 0 0 0 -3 1 -1 -1 0 -3 -2 1 1 -1 0 2 1 -1 -2 -3 1 -2 0 0 -3 0 -1 0 0 -2 -1 0 0 -1 -1 1 3 0 -5 -3 0 -2 -2 -2 -1 -1 -2 4 -2 2 2 -2 -1 -2 -2 -1 0 4 -6 -2 -6 -8 -7 -7 0 -7 -3 -5 -3 -2 -4 -4 -6 -5 2 -2 -5 -6 17 0 -3 0 -4 -4 7 -5 0 -1 -4 -1 -2 -2 -5 -4 -4 -3 -3 -2 0 10 Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr Dayhoff, M., Schwartz, R.M. and Orcutt, B.C. (1978) in: Dayhoff M, ed. Atlas of protein sequence and structure, vol 5 (suppl 3) . Silver Spring, MD: NBRF. 345-352. Dayhoff PAM250 similarity matrix (partial) A A C D E F G H I K L M N C D E F G H I K L M N P 2 -2 0 0 -4 1 -1 -1 -1 -2 -1 0 -2 12 -5 -5 -4 -3 -3 -2 -5 -6 -5 -4 0 -5 4 3 -6 1 1 -2 0 -4 -3 2 0 -5 3 4 -5 0 1 -2 0 -3 -2 1 -4 -4 -6 -5 9 -5 -2 1 -5 2 0 -4 1 -3 1 0 -5 5 -2 -3 -2 -4 -3 0 -1 -3 1 1 -2 -2 6 -2 0 -2 -2 2 -1 -2 -2 -2 1 -3 -2 5 -2 2 2 -2 -1 -5 0 0 -5 -2 0 -2 5 -3 0 1 -2 -6 -4 -3 2 -4 -2 2 -3 6 4 -3 -1 -5 -3 -2 0 -3 -2 2 0 4 6 -2 0 -4 2 1 -4 0 2 -2 1 -3 -2 2 1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 Updated PAM matrices s PAM matrices calculated in 1978 when far PAM fewer protein sequences were available fewer s Similar method used in 1991 to create new Similar matrices, e.g., JTT250 matrices, s These should be more accurate BLOSUM62 matrix s An alternative that does not involve an An explicit model of evolutionary change is to just find blocks of similar sequence and count how often one amino acid is substituted by another substituted s Good for general comparisons without Good consideration of evolutionary distance consideration Origin of BLOSUM matrices s Start with groupings (families) of proteins Start that have similar biochemical function (the Prosite catalog) Prosite s Find ungapped amino acid patterns (blocks) Find that were present in each family that s Search protein sequence database using Search these blocks to add previously unrecognized members of each family members Origin of BLOSUM matrices s s s s s For each block, count number of changes between For each amino acid pair each No assumption of “original” amino acid so No consider all observed pairs equally consider Group blocks by the amount of similarity Average changes within each group and convert to Average log odds form log Results in series of matrices, e.g., BLOSUM60 is Results for blocks that are 60% identical) for BLOSUM62 in log-odds form A A C D E F G H I K L M N P Q R S T V W Y C D E F G H I K L M N P Q R S T V W Y 4 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -2 0 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2 -2 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -3 -1 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -2 -2 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 3 0 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -3 -2 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 2 -1 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -2 -1 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 -2 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -2 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -3 -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 -1 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -2 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -2 0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -2 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 -3 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 2 -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 7 Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr Henikoff, S. and Henikoff, J.G. (1992) Proc. Nat. Acad. Sci. USA 8 9 , 19915-10919. Comparison of PAM250 and BLOSUM62 Comparison of PAM250 and BLOSUM62 12 10 8 6 4 2 0 BLOSUM62 score ­2 ­4 ­6 See Demonstration A13 ­10 ­5 0 5 PAM250 score 10 15 20 Sequence Analysis Tasks ⇒ Calculating the probability of finding a sequence pattern sequence Statistics of pattern appearance s s Goal: Determine the significance of observing a Goal: pattern pattern Method: Estimate the probability that that pattern Method: would occur randomly in a given sequence. Three different methods different x x x Assume all nucleotides are equally frequent Use measured frequencies of each nucleotide Use (mononucleotide frequencies) Use measured frequencies with which a given Use nucleotide follows another (dinucleotide frequencies) nucleotide Determining mononucleotide frequencies s s s s s Count how many times each nucleotide appears in Count sequence sequence Divide (normalize) by total number of nucleotides Result: fA ≡ mononucleotide frequency of A (frequency that A is observed) (frequency Define: pA≡ mononucleotide probability that a nucleotide will be an A nucleotide pA assumed to equal fA Determining dinucleotide frequencies s Make 4 x 4 matrix, one element for each Make ordered pair of nucleotides ordered s Zero all elements s Go through sequence linearly, adding one to Go matrix entry corresponding to the pair of sequence elements observed at that position sequence s Divide by total number of dinucleotides s Result: fAC ≡ dinucleotide frequency of AC (frequency that AC is observed out of all Determining conditional dinucleotide probabilities s Divide each dinucleotide frequency by the Divide mononucleotide frequency of the first nucleotide nucleotide s Result: p*AC ≡ conditional dinucleotide conditional probability of observing a C given an A of s p*AC ≡ p(xi+1=C | xi=A)= fAC/ fA Illustration of probability calculation s What is the probability of observing the What sequence feature ART (A followed by a ART purine, either A or G, followed by a T)? followed s Using equal mononucleotide frequencies x pA = pC = pG = pT = 1/4 x pART s = 1/4 * (1/4 + 1/4) * 1/4 = 1/32 Using observed mononucleotide Using frequencies: frequencies: x pART = pA (pA + pG) pT Interactive Demonstration s (A2 Mono- and Dinucleotide Frequencies) Illustration using dinucleotide probabilities s When using dinucleotide probabilities, have When to be careful about how the probabilities are combined combined s Which is right? s pART=pA(p*AA+p*AG)(p*AT+p*GT) [eq.1] s pART=pA(p*AAp*AT+p*AGp*GT) [eq.2] Expansions s pART=pA(p*AA+p*AG)(p*AT+p*GT) [eq.1] s pART=pAp*AAp*AT + pAp*AAp*GT AT GT + pAp*AGp*AT + pAp*AGp*GT) AT s pART=pA(p*AAp*AT+p*AGp*GT) s pART= pAp*AAp*AT + pAp*AGp*GT ART= AT [eq.2] Proof s pART=pAAT+pAGT s pAAT=pAp*AAp*AT s pAGT=pAp*AGp*GT s pART= pAp*AAp*AT + pAp*AGp*GT ART= AT s This matches equation 2 on previous slide Need further convincing? Imagine that p*AA=0 and p*GT=0 (but all =0 other p* are non-zero) other s Then pART should be zero since there is no way to create either AAT or AGT way s This is predicted by eq. 2 but not by eq. 1 s More complicated probability illustration s s What is the probability of observing the sequence What feature ARYT (A followed by a purine {either A ARYT or G}, followed by a pyrimidine {either C or T}, }, }, followed by a T)? Using equal mononucleotide frequencies x x pA = pC = pG = pT = 1/4 pARYT = 1/4 * (1/4 + 1/4) * (1/4 + 1/4) * 1/4 = 1/64 Illustration (continued) s Using observed mononucleotide Using frequencies: frequencies: x pARYT s = pA (pA + pG) (pC + pT) pT Using dinucleotide frequencies: x pARYT = pA (p*AA (p*ACp*CT + p*ATp*TT) + AA (p p*AG (p*GCp*CT + p*GTp*TT) ) AG (p Illustration (continued) s Using dinucleotide frequencies: A R A T +T=AACT +T=AAT +T=AATT +C=AGC +A=AA Y +C=AAC +T=AGCT +T=AGT +T=AGTT +G=AG Multiply then add s We conclude that for such strings our rule We should be “multiply dinucleotide probabilities along each allowed path and then add the results” then A recursive solution s s s Some programming languages allow recursion Some recursion the calling (invoking) of a function by itself the This is useful here because we can branch when This branch we encounter an ambiguous base and consider all alternatives separately separately Allows multiplication down the branches and then Allows addition addition Site Probability Calculation via Recursion s Illustration: Make a function that prints out Illustration: all possible sequences that can match a restriction site restriction s (Demo Program PossibleSites.c) PossibleSites.c s s /* PossibleSites.c s s Prints out all possible sites that can match a string of IUB codes s s s s January 22, 1997 - R.F. Murphy */ s s #include <stdio.h> #include <string.h> s s s s s s s s s void PossibleSites(char SiteString, int Index); short Test1(char SiteString, Index); short Test2(char SiteString, Index); short Test3(char SiteString, Index); short Test4(char SiteString, Index); s int int int int void main(void) { char Site[10]; do { printf("Enter a string of IUB codes (up to 10 characters): "); scanf("%s", Site); PossibleSites(Site,0); } while (0==0); } s s s s s s s s s s Test for s each type s of s ambiguous s base s s void PossibleSites(char SiteString, int Index) { if (Index>=strlen(SiteString)) { printf("%s\n",SiteString); return; } else { if (Test1(SiteString, Index)) ; else if (Test2(SiteString, Index)) ; else if (Test3(SiteString, Index)) ; else if (Test4(SiteString, Index)) ; else { printf("Illegal character (%c) encountered\n",SiteString[Index]) ; Unwind here s s s s s PossibleSites(SiteString,Index+1); } } return; } s s s s s s s s s s s s s s s s short Test1(char SiteString, int Index) { /* printf("In Test1: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */ switch (SiteString[Index]) { case 'A': case 'C': case 'G': case 'T': break; default: return false; } PossibleSites(SiteString,Index+1); return true; } s s s s short Test2(char SiteString, int Index) { char Save; /* printf("In Test2: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */ s s s s s s s s s s s s s s s s s s s s s s s s s Save = SiteString[Index]; switch (SiteString[Index]) { case 'R': SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); break; case 'Y': SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; case 'S': SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); break; s s s s s s s s s s s s s s s s s s s s case 'W': SiteString[Index]='A'; PossibleSites(SiteString, SiteString[Index]='T'; PossibleSites(SiteString, break; case 'M': SiteString[Index]='A'; PossibleSites(SiteString, SiteString[Index]='C'; PossibleSites(SiteString, break; case 'K': SiteString[Index]='G'; PossibleSites(SiteString, SiteString[Index]='T'; PossibleSites(SiteString, break; default: return false; } SiteString[Index] = Save; return true; } Index); Index); Index); Index); Index); Index); s s s s s s s s s s s s s s s s s s s s s s s s short Test3(char SiteString, int Index) { char Save; /* printf("In Test3: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */ s Save = SiteString[Index]; switch (SiteString[Index]) { case 'B': /* not A */ SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; case 'D': /* not C */ SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; s s s s s s s s s s s s s s s s s s s s s case 'H': /* not G */ SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; case 'V': /* not T/U */ SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); break; default: return false; } SiteString[Index] = Save; return true; } s s s short Test4(char SiteString, int Index) { char Save; s /* printf("In Test4: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */ s s s s Save = SiteString[Index]; switch (SiteString[Index]) { case 'N': /* A,C,G,T/U (iNdeterminate) */ case 'X': /* alternate for N */ SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; default: return false; } SiteString[Index] = Save; return true; } s s s s s s s s s s s s s s s s Another illustration s What is pACT in the sequence TTTAACTGGG? TTTAACTGGG? x fA = 2/10, fC = 1/10 x pA = 0.2 x fAC = 1/10, fCT = 1/10 x p*AC = 0.1/0.2 = 0.5, p*CT = 0.1/0.1 = 1 s pACT = pA p*AC p*CT = 0.2 * 0.5 * 1 = 0.1 0.1 AC s (would have been 1/5 * 1/10 * 4/10 = 0.008 (would 0.008 using mononucleotide frequencies) using Expected number and spacing s Probabilities are per nucleotide Probabilities per s How do we calculate number of expected How number features in a sequence of length L? features x Expected number (for large L) ≅ Expected s Lp How do we calculate the expected spacing How spacing between features? between x µART ≡ expected spacing between ART features = 1/pART 1/p Sequence Analysis Tasks ⇒ Calculating the probability of finding an alignment with a particular score alignment Statistics of pattern appearance s s Goal: Determine the significance of observing a Goal: particular alignment between two sequences particular Method: Estimate the probability that that Method: alignment would occur randomly in any pair of sequences of the same lengths and compositions. sequences x Use measured frequencies of each nucleotide Use (mononucleotide frequencies) Probability of consecutive matches s What is probability of j consecutive matches What between two sequences of equal mononucleotide frequencies? mononucleotide (1/4)j s If non-equal mononucleotide frequencies, If use p as the probability of a match between two positions two 2 A 2 C 2 G 2 T p= p + p + p + p Probability of consecutive matches s s s Probability of j matches using p Probability pj Model as flipping coin with p chance of heads Model heads corresponds to matching a given position in the two sequences the Expected length of longest run between two Expected sequences of length n (Erdos and Renyi law) (Erdos log1 / p ( n ) s This is for just one alignment between the two Expected longest match length s Given M is the length of the longest match Given between two sequences of length m and n, what is what E(M), the expected value of M? M? E ( M ) @log1 / p ( mn ) + log1 / p (1 - p) + g log(e) - 1 / 2 or E ( M ) @log1 / p (Kmn) s where K depends on p (which depends on base where composition) and γ =0.577 (Euler’s number) Karlin-Altschul formulation s p is easy to estimate for nucleic acid comparison without a scoring matrix, but how do we estimate it when using a scoring matrix? Karlin and Altschul rewrote the previous result as Altschul E ( M ) @[log e (Kmn )] / l where l = log e (1 / p) s and then generalize to allow λ to be calculated and empirically for a given scoring matrix and sequence composition sequence Estimating Significance of Local Alignment s s The distribution of scores obtained for aligning a The random pair of sequences follow an extreme value distribution distribution Can calculate probability that a score S greater Can than equal to x will be observed P ( S ³ x ) @Kmne -lx Estimating Significance of Local Alignment P ( S ³ x ) @Kmne s -lx Can convert to score corresponding to a particular Can probability threshold (e.g., 0.05) probability S = log 2 (Kmn ) - log 2 P S = log 2 (K / P ) + log 2 ( nm) s For P=0.05, first term is negligible, thus S » log 2 ( nm) Reading for next class s Jones/Pevzner, rest of Chapter 6 ...
View Full Document

Ask a homework question - tutors are online