LecturesPart04

LecturesPart04 - Computational Biology, Part 4 Statistics...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Biology, Part 4 Statistics of Pattern Appearance Robert F. Murphy Copyright © 1996-2006. Copyright All rights reserved. Sequence Analysis Tasks ⇒ Calculating the probability of finding a sequence pattern sequence Statistics of pattern appearance s s Goal: Determine the significance of observing a Goal: pattern pattern Method: Estimate the probability that that pattern Method: would occur randomly in a given sequence. Three different methods different x x x Assume all nucleotides are equally frequent Use measured frequencies of each nucleotide Use (mononucleotide frequencies) Use measured frequencies with which a given Use nucleotide follows another (dinucleotide frequencies) nucleotide Determining mononucleotide frequencies s s s s s Count how many times each nucleotide appears in Count sequence sequence Divide (normalize) by total number of nucleotides Result: fA ≡ mononucleotide frequency of A (frequency that A is observed) (frequency Define: pA≡ mononucleotide probability that a nucleotide will be an A nucleotide pA assumed to equal fA Determining dinucleotide frequencies s Make 4 x 4 matrix, one element for each Make ordered pair of nucleotides ordered s Zero all elements s Go through sequence linearly, adding one to Go matrix entry corresponding to the pair of sequence elements observed at that position sequence s Divide by total number of dinucleotides s Result: fAC ≡ dinucleotide frequency of AC (frequency that AC is observed out of all Determining conditional dinucleotide probabilities s Divide each dinucleotide frequency by the Divide mononucleotide frequency of the first nucleotide nucleotide s Result: p*AC ≡ conditional dinucleotide conditional probability of observing a C given an A of s p*AC ≡ p(xi+1=C | xi=A)= fAC/ fA Illustration of probability calculation s What is the probability of observing the What sequence feature ART (A followed by a ART purine, either A or G, followed by a T)? followed s Using equal mononucleotide frequencies x pA = pC = pG = pT = 1/4 x pART s = 1/4 * (1/4 + 1/4) * 1/4 = 1/32 Using observed mononucleotide Using frequencies: frequencies: x pART = pA (pA + pG) pT Interactive Demonstration s (A2 Mono- and Dinucleotide Frequencies) Illustration using dinucleotide probabilities s When using dinucleotide probabilities, have When to be careful about how the probabilities are combined combined s Which is right? s pART=pA(p*AA+p*AG)(p*AT+p*GT) [eq.1] s pART=pA(p*AAp*AT+p*AGp*GT) [eq.2] Expansions s pART=pA(p*AA+p*AG)(p*AT+p*GT) [eq.1] s pART=pAp*AAp*AT + pAp*AAp*GT AT GT + pAp*AGp*AT + pAp*AGp*GT) AT s pART=pA(p*AAp*AT+p*AGp*GT) s pART= pAp*AAp*AT + pAp*AGp*GT ART= AT [eq.2] Proof s pART=pAAT+pAGT s pAAT=pAp*AAp*AT s pAGT=pAp*AGp*GT s pART= pAp*AAp*AT + pAp*AGp*GT ART= AT s This matches equation 2 on previous slide Need further convincing? Imagine that p*AA=0 and p*GT=0 (but all =0 other p* are non-zero) other s Then pART should be zero since there is no way to create either AAT or AGT way s This is predicted by eq. 2 but not by eq. 1 s More complicated probability illustration s s What is the probability of observing the sequence What feature ARYT (A followed by a purine {either A ARYT or G}, followed by a pyrimidine {either C or T}, }, }, followed by a T)? Using equal mononucleotide frequencies x x pA = pC = pG = pT = 1/4 pARYT = 1/4 * (1/4 + 1/4) * (1/4 + 1/4) * 1/4 = 1/64 Illustration (continued) s Using observed mononucleotide Using frequencies: frequencies: x pARYT s = pA (pA + pG) (pC + pT) pT Using dinucleotide frequencies: x pARYT = pA (p*AA (p*ACp*CT + p*ATp*TT) + AA (p p*AG (p*GCp*CT + p*GTp*TT) ) AG (p Illustration (continued) s Using dinucleotide frequencies: A R A T +T=AACT +T=AAT +T=AATT +C=AGC +A=AA Y +C=AAC +T=AGCT +T=AGT +T=AGTT +G=AG Multiply then add s We conclude that for such strings our rule We should be “multiply dinucleotide probabilities along each allowed path and then add the results” then How do we program this? s “for” loops? s Nested “if” structure? s Other? Will this work? result=monoprob(seq(1)); for i=2 to n { temp=0. for j=1 to 4 /*for each base*/ for /*for { if(seq(i)&mask(j)) if(seq(i)&mask(j)) temp=temp+diprob(seq(itemp=temp+diprob(seq(i1),seq(i)) } result=result*temp } No to for s No, it generates add then multiply A recursive solution s s s Some programming languages allow recursion Some recursion the calling (invoking) of a function by itself the This is useful here because we can branch when This branch we encounter an ambiguous base and consider all alternatives separately separately Allows multiplication down the branches and then Allows addition addition Site Probability Calculation via Recursion s Illustration: Make a function that prints out Illustration: all possible sequences that can match a restriction site restriction s (Demo Program PossibleSites.c) x (found in (found /afs/andrew.cmu.edu/usr/murphy/CompBiol/DemoProgra /afs/andrew.cmu.edu/usr/murphy/CompBiol/DemoProgra ms or Mellon: BioServer: Comp. Biol. 03-310: Demo ms Programs: PossibleSites ƒ) Programs: PossibleSites.c s s /* PossibleSites.c s s Prints out all possible sites that can match a string of IUB codes s s s s January 22, 1997 - R.F. Murphy */ s s #include <stdio.h> #include <string.h> s s s s s s s s s void PossibleSites(char SiteString, int Index); short Test1(char SiteString, Index); short Test2(char SiteString, Index); short Test3(char SiteString, Index); short Test4(char SiteString, Index); s int int int int void main(void) { char Site[10]; do { printf("Enter a string of IUB codes (up to 10 characters): "); scanf("%s", Site); PossibleSites(Site,0); } while (0==0); } s s s s s s s s s s Test for s each type s of s ambiguous s base s s void PossibleSites(char SiteString, int Index) { if (Index>=strlen(SiteString)) { printf("%s\n",SiteString); return; } else { if (Test1(SiteString, Index)) ; else if (Test2(SiteString, Index)) ; else if (Test3(SiteString, Index)) ; else if (Test4(SiteString, Index)) ; else { printf("Illegal character (%c) encountered\n",SiteString[Index]) ; Unwind here s s s s s PossibleSites(SiteString,Index+1); } } return; } s s s s s s s s s s s s s s s s short Test1(char SiteString, int Index) { /* printf("In Test1: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */ switch (SiteString[Index]) { case 'A': case 'C': case 'G': case 'T': break; default: return false; } PossibleSites(SiteString,Index+1); return true; } s s s s short Test2(char SiteString, int Index) { char Save; /* printf("In Test2: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */ s s s s s s s s s s s s s s s s s s s s s s s s s Save = SiteString[Index]; switch (SiteString[Index]) { case 'R': SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); break; case 'Y': SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; case 'S': SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); break; s s s s s s s s s s s s s s s s s s s s case 'W': SiteString[Index]='A'; PossibleSites(SiteString, SiteString[Index]='T'; PossibleSites(SiteString, break; case 'M': SiteString[Index]='A'; PossibleSites(SiteString, SiteString[Index]='C'; PossibleSites(SiteString, break; case 'K': SiteString[Index]='G'; PossibleSites(SiteString, SiteString[Index]='T'; PossibleSites(SiteString, break; default: return false; } SiteString[Index] = Save; return true; } Index); Index); Index); Index); Index); Index); s s s s s s s s s s s s s s s s s s s s s s s s short Test3(char SiteString, int Index) { char Save; /* printf("In Test3: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */ s Save = SiteString[Index]; switch (SiteString[Index]) { case 'B': /* not A */ SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; case 'D': /* not C */ SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; s s s s s s s s s s s s s s s s s s s s s case 'H': /* not G */ SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; case 'V': /* not T/U */ SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); break; default: return false; } SiteString[Index] = Save; return true; } s s s short Test4(char SiteString, int Index) { char Save; s /* printf("In Test4: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */ s s s s Save = SiteString[Index]; switch (SiteString[Index]) { case 'N': /* A,C,G,T/U (iNdeterminate) */ case 'X': /* alternate for N */ SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; default: return false; } SiteString[Index] = Save; return true; } s s s s s s s s s s s s s s s s Another illustration s What is pACT in the sequence TTTAACTGGG? TTTAACTGGG? x fA = 2/10, fC = 1/10 x pA = 0.2 x fAC = 1/10, fCT = 1/10 x p*AC = 0.1/0.2 = 0.5, p*CT = 0.1/0.1 = 1 s pACT = pA p*AC p*CT = 0.2 * 0.5 * 1 = 0.1 0.1 AC s (would have been 1/5 * 1/10 * 4/10 = 0.008 (would 0.008 using mononucleotide frequencies) using Expected number and spacing s Probabilities are per nucleotide Probabilities per s How do we calculate number of expected How number features in a sequence of length L? features x Expected number (for large L) ≅ Expected s Lp How do we calculate the expected spacing How spacing between features? between x µART ≡ expected spacing between ART features = 1/pART 1/p Sequence Analysis Tasks ⇒ Calculating the probability of finding an alignment with a particular score alignment Statistics of pattern appearance s s Goal: Determine the significance of observing a Goal: particular alignment between two sequences particular Method: Estimate the probability that that Method: alignment would occur randomly in any pair of sequences of the same lengths and compositions. sequences x Use measured frequencies of each nucleotide Use (mononucleotide frequencies) Probability of consecutive matches s What is probability of j consecutive matches What between two sequences of equal mononucleotide frequencies? mononucleotide (1/4)j s If non-equal mononucleotide frequencies, If use p as the probability of a match between two positions two 2 A 2 C 2 G 2 T p= p + p + p + p Probability of consecutive matches s s s Probability of j matches using p Probability pj Model as flipping coin with p chance of heads Model heads corresponds to matching a given position in the two sequences the Expected length of longest run between two Expected sequences of length n (Erdos and Renyi law) (Erdos log1 / p ( n ) s This is for just one alignment between the two Expected longest match length s Given M is the length of the longest match Given between two sequences of length m and n, what is what E(M), the expectation value of M? M? E ( M ) @log1 / p ( mn ) + log1 / p (1 - p) + g log(e) - 1 / 2 or E ( M ) @log1 / p (Kmn) s where K depends on p (which depends on base where composition) and γ =0.577 (Euler’s number) Karlin-Altschul formulation s p is easy to estimate for nucleic acid comparison without a scoring matrix, but how do we estimate it when using a scoring matrix? Karlin and Altschul rewrote the previous result as Altschul E ( M ) @[log e (Kmn )] / l where l = log e (1 / p) s and then generalize to allow λ to be calculated and empirically for a given scoring matrix and sequence composition sequence Reading for next class s Mount, Chapter 5 through p. 186 ...
View Full Document

Ask a homework question - tutors are online