Unformatted text preview: Computational Biology, Part 4
Statistics of Pattern Appearance
Robert F. Murphy
Copyright © 19962006.
Copyright
All rights reserved. Sequence Analysis Tasks
⇒ Calculating the probability of finding a
sequence pattern
sequence Statistics of pattern appearance
s s Goal: Determine the significance of observing a
Goal:
pattern
pattern
Method: Estimate the probability that that pattern
Method:
would occur randomly in a given sequence. Three
different methods
different
x
x x Assume all nucleotides are equally frequent
Use measured frequencies of each nucleotide
Use
(mononucleotide frequencies)
Use measured frequencies with which a given
Use
nucleotide follows another (dinucleotide frequencies)
nucleotide Determining mononucleotide
frequencies
s s
s s s Count how many times each nucleotide appears in
Count
sequence
sequence
Divide (normalize) by total number of nucleotides
Result: fA ≡ mononucleotide frequency of A
(frequency that A is observed)
(frequency
Define: pA≡ mononucleotide probability that a
nucleotide will be an A
nucleotide
pA assumed to equal fA Determining dinucleotide
frequencies
s Make 4 x 4 matrix, one element for each
Make
ordered pair of nucleotides
ordered
s Zero all elements
s Go through sequence linearly, adding one to
Go
matrix entry corresponding to the pair of
sequence elements observed at that position
sequence
s Divide by total number of dinucleotides
s Result: fAC ≡ dinucleotide frequency of AC
(frequency that AC is observed out of all Determining conditional
dinucleotide probabilities
s Divide each dinucleotide frequency by the
Divide
mononucleotide frequency of the first
nucleotide
nucleotide
s Result: p*AC ≡ conditional dinucleotide
conditional
probability of observing a C given an A
of
s p*AC ≡ p(xi+1=C  xi=A)= fAC/ fA Illustration of probability
calculation
s What is the probability of observing the
What
sequence feature ART (A followed by a
ART
purine, either A or G, followed by a T)?
followed
s Using equal mononucleotide frequencies
x pA = pC = pG = pT = 1/4 x pART s = 1/4 * (1/4 + 1/4) * 1/4 = 1/32 Using observed mononucleotide
Using
frequencies:
frequencies:
x pART = pA (pA + pG) pT Interactive Demonstration
s (A2 Mono and Dinucleotide Frequencies) Illustration using dinucleotide
probabilities
s When using dinucleotide probabilities, have
When
to be careful about how the probabilities are
combined
combined
s Which is right?
s pART=pA(p*AA+p*AG)(p*AT+p*GT) [eq.1]
s pART=pA(p*AAp*AT+p*AGp*GT) [eq.2] Expansions
s pART=pA(p*AA+p*AG)(p*AT+p*GT) [eq.1] s pART=pAp*AAp*AT + pAp*AAp*GT
AT
GT
+ pAp*AGp*AT + pAp*AGp*GT)
AT s pART=pA(p*AAp*AT+p*AGp*GT) s pART= pAp*AAp*AT + pAp*AGp*GT
ART=
AT [eq.2] Proof
s pART=pAAT+pAGT s pAAT=pAp*AAp*AT s pAGT=pAp*AGp*GT s pART= pAp*AAp*AT + pAp*AGp*GT
ART=
AT s This matches equation 2 on previous slide Need further convincing?
Imagine that p*AA=0 and p*GT=0 (but all
=0
other p* are nonzero)
other
s Then pART should be zero since there is no
way to create either AAT or AGT
way
s This is predicted by eq. 2 but not by eq. 1
s More complicated probability
illustration
s s What is the probability of observing the sequence
What
feature ARYT (A followed by a purine {either A
ARYT
or G}, followed by a pyrimidine {either C or T},
},
},
followed by a T)?
Using equal mononucleotide frequencies
x
x pA = pC = pG = pT = 1/4
pARYT = 1/4 * (1/4 + 1/4) * (1/4 + 1/4) * 1/4 = 1/64 Illustration (continued)
s Using observed mononucleotide
Using
frequencies:
frequencies:
x pARYT s = pA (pA + pG) (pC + pT) pT Using dinucleotide frequencies:
x pARYT = pA (p*AA (p*ACp*CT + p*ATp*TT) +
AA (p
p*AG (p*GCp*CT + p*GTp*TT) )
AG (p Illustration (continued)
s Using dinucleotide frequencies:
A R A T
+T=AACT +T=AAT +T=AATT +C=AGC +A=AA Y
+C=AAC +T=AGCT +T=AGT +T=AGTT +G=AG Multiply then add
s We conclude that for such strings our rule
We
should be “multiply dinucleotide
probabilities along each allowed path and
then add the results”
then How do we program this?
s “for” loops?
s Nested “if” structure?
s Other? Will this work?
result=monoprob(seq(1));
for i=2 to n
{
temp=0.
for j=1 to 4 /*for each base*/
for
/*for
{
if(seq(i)&mask(j))
if(seq(i)&mask(j))
temp=temp+diprob(seq(itemp=temp+diprob(seq(i1),seq(i))
}
result=result*temp
} No to for
s No, it generates add then multiply A recursive solution
s s s Some programming languages allow recursion Some
recursion
the calling (invoking) of a function by itself
the
This is useful here because we can branch when
This
branch
we encounter an ambiguous base and consider all
alternatives separately
separately
Allows multiplication down the branches and then
Allows
addition
addition Site Probability Calculation via
Recursion
s Illustration: Make a function that prints out
Illustration:
all possible sequences that can match a
restriction site
restriction
s (Demo Program PossibleSites.c)
x (found in
(found
/afs/andrew.cmu.edu/usr/murphy/CompBiol/DemoProgra
/afs/andrew.cmu.edu/usr/murphy/CompBiol/DemoProgra
ms or Mellon: BioServer: Comp. Biol. 03310: Demo
ms
Programs: PossibleSites ƒ)
Programs: PossibleSites.c
s s /* PossibleSites.c s
s Prints out all possible sites that
can match a string of IUB codes s
s
s s January 22, 1997  R.F. Murphy */ s s #include <stdio.h>
#include <string.h> s s s
s s
s
s
s
s void PossibleSites(char
SiteString, int Index);
short Test1(char SiteString,
Index);
short Test2(char SiteString,
Index);
short Test3(char SiteString,
Index);
short Test4(char SiteString,
Index); s int
int
int
int void main(void)
{ char Site[10];
do
{
printf("Enter a string of IUB
codes (up to 10 characters): ");
scanf("%s", Site);
PossibleSites(Site,0);
} while (0==0);
} s
s
s
s
s
s
s
s
s
s Test for s
each type s
of
s
ambiguous
s
base
s
s void PossibleSites(char
SiteString, int Index)
{
if (Index>=strlen(SiteString))
{
printf("%s\n",SiteString);
return;
}
else
{
if (Test1(SiteString, Index)) ;
else if (Test2(SiteString,
Index)) ;
else if (Test3(SiteString,
Index)) ;
else if (Test4(SiteString,
Index)) ;
else
{
printf("Illegal character (%c)
encountered\n",SiteString[Index]) ; Unwind here s
s
s
s
s PossibleSites(SiteString,Index+1);
}
}
return;
} s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s short Test1(char SiteString, int
Index)
{
/* printf("In Test1: Index %d,
SiteString[Index]
%c\n",Index,SiteString[Index]); */
switch (SiteString[Index])
{
case 'A':
case 'C':
case 'G':
case 'T':
break;
default:
return false;
}
PossibleSites(SiteString,Index+1);
return true;
} s
s
s
s short Test2(char SiteString, int
Index)
{
char Save;
/* printf("In Test2: Index %d,
SiteString[Index]
%c\n",Index,SiteString[Index]); */ s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s Save = SiteString[Index];
switch (SiteString[Index])
{
case 'R':
SiteString[Index]='A';
PossibleSites(SiteString,
Index);
SiteString[Index]='G';
PossibleSites(SiteString,
Index);
break;
case 'Y':
SiteString[Index]='C';
PossibleSites(SiteString,
Index);
SiteString[Index]='T';
PossibleSites(SiteString,
Index);
break;
case 'S':
SiteString[Index]='G';
PossibleSites(SiteString,
Index);
SiteString[Index]='C';
PossibleSites(SiteString,
Index);
break; s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s case 'W':
SiteString[Index]='A';
PossibleSites(SiteString,
SiteString[Index]='T';
PossibleSites(SiteString,
break;
case 'M':
SiteString[Index]='A';
PossibleSites(SiteString,
SiteString[Index]='C';
PossibleSites(SiteString,
break;
case 'K':
SiteString[Index]='G';
PossibleSites(SiteString,
SiteString[Index]='T';
PossibleSites(SiteString,
break;
default:
return false;
}
SiteString[Index] = Save;
return true;
} Index);
Index); Index);
Index); Index);
Index); s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s short Test3(char SiteString, int
Index)
{
char Save;
/* printf("In Test3: Index %d,
SiteString[Index]
%c\n",Index,SiteString[Index]); */ s Save = SiteString[Index];
switch (SiteString[Index])
{
case 'B': /* not A */
SiteString[Index]='C';
PossibleSites(SiteString,
Index);
SiteString[Index]='G';
PossibleSites(SiteString,
Index);
SiteString[Index]='T';
PossibleSites(SiteString,
Index);
break;
case 'D': /* not C */
SiteString[Index]='A';
PossibleSites(SiteString,
Index);
SiteString[Index]='G';
PossibleSites(SiteString,
Index);
SiteString[Index]='T';
PossibleSites(SiteString,
Index);
break; s s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s case 'H': /* not G */
SiteString[Index]='A';
PossibleSites(SiteString,
Index);
SiteString[Index]='C';
PossibleSites(SiteString,
Index);
SiteString[Index]='T';
PossibleSites(SiteString,
Index);
break;
case 'V': /* not T/U */
SiteString[Index]='A';
PossibleSites(SiteString,
Index);
SiteString[Index]='C';
PossibleSites(SiteString,
Index);
SiteString[Index]='G';
PossibleSites(SiteString,
Index);
break;
default:
return false;
}
SiteString[Index] = Save;
return true;
} s
s
s short Test4(char SiteString, int
Index)
{
char Save; s /* printf("In Test4: Index %d,
SiteString[Index]
%c\n",Index,SiteString[Index]); */ s
s
s
s Save = SiteString[Index];
switch (SiteString[Index])
{
case 'N': /* A,C,G,T/U
(iNdeterminate) */
case 'X': /* alternate for N */
SiteString[Index]='A';
PossibleSites(SiteString,
Index);
SiteString[Index]='C';
PossibleSites(SiteString,
Index);
SiteString[Index]='G';
PossibleSites(SiteString,
Index);
SiteString[Index]='T';
PossibleSites(SiteString,
Index);
break;
default:
return false;
}
SiteString[Index] = Save;
return true;
} s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s Another illustration
s What is pACT in the sequence
TTTAACTGGG?
TTTAACTGGG?
x fA = 2/10, fC = 1/10 x pA = 0.2 x fAC = 1/10, fCT = 1/10 x p*AC = 0.1/0.2 = 0.5, p*CT = 0.1/0.1 = 1 s pACT = pA p*AC p*CT = 0.2 * 0.5 * 1 = 0.1
0.1
AC s (would have been 1/5 * 1/10 * 4/10 = 0.008
(would
0.008
using mononucleotide frequencies)
using Expected number and spacing
s Probabilities are per nucleotide
Probabilities
per
s How do we calculate number of expected
How
number
features in a sequence of length L?
features
x Expected number (for large L) ≅
Expected s Lp How do we calculate the expected spacing
How
spacing
between features?
between
x µART ≡ expected spacing between ART features
= 1/pART
1/p Sequence Analysis Tasks
⇒ Calculating the probability of finding an
alignment with a particular score
alignment Statistics of pattern appearance
s s Goal: Determine the significance of observing a
Goal:
particular alignment between two sequences
particular
Method: Estimate the probability that that
Method:
alignment would occur randomly in any pair of
sequences of the same lengths and compositions.
sequences
x Use measured frequencies of each nucleotide
Use
(mononucleotide frequencies) Probability of consecutive
matches
s What is probability of j consecutive matches
What
between two sequences of equal
mononucleotide frequencies?
mononucleotide
(1/4)j
s If nonequal mononucleotide frequencies,
If
use p as the probability of a match between
two positions
two
2
A 2
C 2
G 2
T p= p + p + p + p Probability of consecutive
matches
s s s Probability of j matches using p
Probability
pj
Model as flipping coin with p chance of heads Model
heads corresponds to matching a given position in
the two sequences
the
Expected length of longest run between two
Expected
sequences of length n (Erdos and Renyi law)
(Erdos log1 / p ( n )
s This is for just one alignment between the two Expected longest match length
s Given M is the length of the longest match
Given
between two sequences of length m and n, what is
what
E(M), the expectation value of M?
M? E ( M ) @log1 / p ( mn ) + log1 / p (1  p) + g log(e)  1 / 2
or
E ( M ) @log1 / p (Kmn)
s where K depends on p (which depends on base
where
composition) and γ =0.577 (Euler’s number) KarlinAltschul formulation
s p is easy to estimate for nucleic acid comparison
without a scoring matrix, but how do we estimate
it when using a scoring matrix? Karlin and
Altschul rewrote the previous result as
Altschul E ( M ) @[log e (Kmn )] / l
where
l = log e (1 / p)
s and then generalize to allow λ to be calculated
and
empirically for a given scoring matrix and
sequence composition
sequence Reading for next class
s Mount, Chapter 5 through p. 186 ...
View
Full Document
 Fall '10
 Staff
 Biology, Conditional Probability, Probability, Computational Biology, Probability theory

Click to edit the document details