Unformatted text preview: Computational Biology, Part 4
Similarity Matrices/Statistics of
Pattern Appearance
Robert F. Murphy
Copyright © 19962009.
Copyright
All rights reserved. Deriving and Using Similarity
Matrices Origin of PAM matrices
s s s Take aligned set of closely related proteins
x 71 groups of proteins that were at least 85%
71
similar
similar
Each group of sequences were organized into a
Each
phylogenetic tree
phylogenetic
x Creates a model of the order in which
Creates
substitutions occurred
substitutions
Count the number of changes of each amino acid
Count
into every other amino acid
into
x Each substitution is considered to be an
Each
“accepted mutation”  an amino acid change
“accepted” by natural selection
“accepted” Origin of PAM matrices
s s s For each group of proteins, find the “exposure to
For
mutation” for each amino acid. Product of
x the frequency of each amino acid in that group
x the number of all amino acid changes per 100
the
residues (total number of amino acid changes
divided by the combined length of all sequences
in that group, then times 100)
in
For each group, divide counts of changes for each
For
amino acid pair by the exposure to mutation of the
“original” amino acid
“original”
Average these across all groups to create PAM1
Average
matrix (Point Accepted Mutation at 1% change)
matrix Origin of PAM matrices
s s s
s This table is equivalent to a transition matrix for a
This
firstorder Markov model for protein sequence
evolution with a 1% overall probability of change
evolution
Appropriate for comparing sequences separated by
Appropriate
an evolutionary distance that would yield changes
in 1% of the positions
in
Note that PAM1 is not symmetric
To compare sequences across greater distances,
To
can multiply the PAM1 matrix by itself (if Markov
model is correct)
model Origin of PAM matrices
s s s s Squaring PAM1 considers all the ways that an
Squaring
“original” amino acid may have changed over two
steps of 1% mutation rate each
steps
For staying the same, sum probability that it didn’t
For
change in first step times probability that it didn’t
change in second step plus product of all the
probability of all changes in first step times
probability of changing back
probability
For changing from x > y, consider sum of
For
consider
products of all the changes that could have
happened in first step (x > z) times probability of
happened
times
changing from that into y (z > y)
changing
This gives PAM2 (still not symmetric!) Origin of PAM matrices
s Can raise PAM1 to any power (e.g.,
Can
PAM250)
PAM250)
s Major effect of raising PAM matrix to a
Major
power is to decrease the probability that a
particular amino acid is unchanged (and
increase the probabilities for it to change
into all others)
into Use of PAM matrices
s s Sum of the product of diagonal elements times
Sum
overall frequency of each amino acid gives
expected degree of similarity between two proteins
sequences that are separated by an evolutionary
distance corresponding to the power of the matrix
distance
x For PAM60 this is 60%
x For PAM80 it is 50%
x For PAM120 it is 40%
Can choose a matrix based on an estimate of the
Can
similarity between proteins (but this might require
an alignment!)
an Use of PAM matrices
s
s
s
s When aligning two sequences, don’t know which
When
amino acid is the original and which is the change
amino
Therefore average element (a,b) with element (b,a)
Therefore
to get symmetric matrix
to
Then convert to odds score by dividing by
Then
probability that change would occur randomly
probability
Then convert to log base 2 Dayhoff PAM250 similarity
matrix in logodds form
A A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y C D E F G H I K L M N P Q R S T V W Y 2
2
0
0
4
1
1
1
1
2
1
0
1
0
2
1
1
0
6
3 2
12
5
5
4
3
3
2
5
6
5
4
3
5
4
0
2
2
8
0 0
5
4
3
6
1
1
2
0
4
3
2
1
2
1
0
0
2
7
4 0
5
3
4
5
0
1
2
0
3
2
1
1
2
1
0
0
2
7
4 4
4
6
5
9
5
2
1
5
2
0
4
5
5
4
3
3
1
0
7 1
3
1
0
5
5
2
3
2
4
3
0
1
1
3
1
0
1
7
5 1
3
1
1
2
2
6
2
0
2
2
2
0
3
2
1
1
2
3
0 1
2
2
2
1
3
2
5
2
2
2
2
2
2
2
1
0
4
5
1 1
5
0
0
5
2
0
2
5
3
0
1
1
1
3
0
0
2
3
4 2
6
4
3
2
4
2
2
3
6
4
3
3
2
3
3
2
2
2
1 1
5
3
2
0
3
2
2
0
4
6
2
2
1
0
2
1
2
4
2 0
4
2
1
4
0
2
2
1
3
2
2
1
1
0
1
0
2
4
2 1
3
1
1
5
1
0
2
1
3
2
1
6
0
0
1
0
1
6
5 0
5
2
2
5
1
3
2
1
2
1
1
0
4
1
1
1
2
5
4 2
4
1
1
4
3
2
2
3
3
0
0
0
1
6
0
1
2
2
4 1
0
0
0
3
1
1
1
0
3
2
1
1
1
0
2
1
1
2
3 1
2
0
0
3
0
1
0
0
2
1
0
0
1
1
1
3
0
5
3 0
2
2
2
1
1
2
4
2
2
2
2
1
2
2
1
0
4
6
2 6
8
7
7
0
7
3
5
3
2
4
4
6
5
2
2
5
6
17
0 3
0
4
4
7
5
0
1
4
1
2
2
5
4
4
3
3
2
0
10 Ala
Cys
Asp
Glu
Phe
Gly
His
Ile
Lys
Leu
Met
Asn
Pro
Gln
Arg
Ser
Thr
Val
Trp
Tyr Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr
Dayhoff, M., Schwartz, R.M. and Orcutt, B.C. (1978) in: Dayhoff M, ed. Atlas of
protein sequence and structure, vol 5 (suppl 3) . Silver Spring, MD: NBRF. 345352. Dayhoff PAM250 similarity
matrix (partial)
A
A
C
D
E
F
G
H
I
K
L
M
N C D E F G H I K L M N P 2
2
0
0
4
1
1
1
1
2
1
0 2
12
5
5
4
3
3
2
5
6
5
4 0
5
4
3
6
1
1
2
0
4
3
2 0
5
3
4
5
0
1
2
0
3
2
1 4
4
6
5
9
5
2
1
5
2
0
4 1
3
1
0
5
5
2
3
2
4
3
0 1
3
1
1
2
2
6
2
0
2
2
2 1
2
2
2
1
3
2
5
2
2
2
2 1
5
0
0
5
2
0
2
5
3
0
1 2
6
4
3
2
4
2
2
3
6
4
3 1
5
3
2
0
3
2
2
0
4
6
2 0
4
2
1
4
0
2
2
1
3
2
2 1
3
1
1
5
1
0
2
1
3
2
1 Updated PAM matrices
s PAM matrices calculated in 1978 when far
PAM
fewer protein sequences were available
fewer
s Similar method used in 1991 to create new
Similar
matrices, e.g., JTT250
matrices,
s These should be more accurate BLOSUM62 matrix
s An alternative that does not involve an
An
explicit model of evolutionary change is to
just find blocks of similar sequence and
count how often one amino acid is
substituted by another
substituted
s Good for general comparisons without
Good
consideration of evolutionary distance
consideration Origin of BLOSUM matrices
s Start with groupings (families) of proteins
Start
that have similar biochemical function (the
Prosite catalog)
Prosite
s Find ungapped amino acid patterns (blocks)
Find
that were present in each family
that
s Search protein sequence database using
Search
these blocks to add previously unrecognized
members of each family
members Origin of BLOSUM matrices
s
s
s
s
s For each block, count number of changes between
For
each amino acid pair
each
No assumption of “original” amino acid so
No
consider all observed pairs equally
consider
Group blocks by the amount of similarity
Average changes within each group and convert to
Average
log odds form
log
Results in series of matrices, e.g., BLOSUM60 is
Results
for blocks that are 60% identical)
for BLOSUM62 in logodds form
A
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y C D E F G H I K L M N P Q R S T V W Y 4
0
2
1
2
0
2
1
1
1
1
2
1
1
1
1
0
0
3
2 0
9
3
4
2
3
3
1
3
1
1
3
3
3
3
1
1
1
2
2 2
3
6
2
3
1
1
3
1
4
3
1
1
0
2
0
1
3
4
3 1
4
2
5
3
2
0
3
1
3
2
0
1
2
0
0
1
2
3
2 2
2
3
3
6
3
1
0
3
0
0
3
4
3
3
2
2
1
1
3 0
3
1
2
3
6
2
4
2
4
3
0
2
2
2
0
2
3
2
3 2
3
1
0
1
2
8
3
1
3
2
1
2
0
0
1
2
3
2
2 1
1
3
3
0
4
3
4
3
2
1
3
3
3
3
2
1
3
3
1 1
3
1
1
3
2
1
3
5
2
1
0
1
1
2
0
1
2
3
2 1
1
4
3
0
4
3
2
2
4
2
3
3
2
2
2
1
1
2
1 1
1
3
2
0
3
2
1
1
2
5
2
2
0
1
1
1
1
1
1 2
3
1
0
3
0
1
3
0
3
2
6
2
0
0
1
0
3
4
2 1
3
1
1
4
2
2
3
1
3
2
2
7
1
2
1
1
2
4
3 1
3
0
2
3
2
0
3
1
2
0
0
1
5
1
0
1
2
2
1 1
3
2
0
3
2
0
3
2
2
1
0
2
1
5
1
1
3
3
2 1
1
0
0
2
0
1
2
0
2
1
1
1
0
1
4
1
2
3
2 0
1
1
1
2
2
2
1
1
1
1
0
1
1
1
1
5
0
2
2 0
1
3
2
1
3
3
3
2
1
1
3
2
2
3
2
0
4
3
1 3
2
4
3
1
2
2
3
3
2
1
4
4
2
3
3
2
3
11
2 2
2
3
2
3
3
2
1
2
1
1
2
3
1
2
2
2
1
2
7 Ala
Cys
Asp
Glu
Phe
Gly
His
Ile
Lys
Leu
Met
Asn
Pro
Gln
Arg
Ser
Thr
Val
Trp
Tyr Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr
Henikoff, S. and Henikoff, J.G. (1992) Proc. Nat. Acad. Sci. USA 8 9 , 1991510919. Comparison of PAM250 and
BLOSUM62
Comparison of PAM250 and BLOSUM62
12
10
8
6
4
2
0
BLOSUM62 score
2
4
6 See Demonstration A13
10 5 0 5
PAM250 score 10 15 20 Sequence Analysis Tasks
⇒ Calculating the probability of finding a
sequence pattern
sequence Statistics of pattern appearance
s s Goal: Determine the significance of observing a
Goal:
pattern
pattern
Method: Estimate the probability that that pattern
Method:
would occur randomly in a given sequence. Three
different methods
different
x
x x Assume all nucleotides are equally frequent
Use measured frequencies of each nucleotide
Use
(mononucleotide frequencies)
Use measured frequencies with which a given
Use
nucleotide follows another (dinucleotide frequencies)
nucleotide Determining mononucleotide
frequencies
s s
s s s Count how many times each nucleotide appears in
Count
sequence
sequence
Divide (normalize) by total number of nucleotides
Result: fA ≡ mononucleotide frequency of A
(frequency that A is observed)
(frequency
Define: pA≡ mononucleotide probability that a
nucleotide will be an A
nucleotide
pA assumed to equal fA Determining dinucleotide
frequencies
s Make 4 x 4 matrix, one element for each
Make
ordered pair of nucleotides
ordered
s Zero all elements
s Go through sequence linearly, adding one to
Go
matrix entry corresponding to the pair of
sequence elements observed at that position
sequence
s Divide by total number of dinucleotides
s Result: fAC ≡ dinucleotide frequency of AC
(frequency that AC is observed out of all Determining conditional
dinucleotide probabilities
s Divide each dinucleotide frequency by the
Divide
mononucleotide frequency of the first
nucleotide
nucleotide
s Result: p*AC ≡ conditional dinucleotide
conditional
probability of observing a C given an A
of
s p*AC ≡ p(xi+1=C  xi=A)= fAC/ fA Illustration of probability
calculation
s What is the probability of observing the
What
sequence feature ART (A followed by a
ART
purine, either A or G, followed by a T)?
followed
s Using equal mononucleotide frequencies
x pA = pC = pG = pT = 1/4 x pART s = 1/4 * (1/4 + 1/4) * 1/4 = 1/32 Using observed mononucleotide
Using
frequencies:
frequencies:
x pART = pA (pA + pG) pT Interactive Demonstration
s (A2 Mono and Dinucleotide Frequencies) Illustration using dinucleotide
probabilities
s When using dinucleotide probabilities, have
When
to be careful about how the probabilities are
combined
combined
s Which is right?
s pART=pA(p*AA+p*AG)(p*AT+p*GT) [eq.1]
s pART=pA(p*AAp*AT+p*AGp*GT) [eq.2] Expansions
s pART=pA(p*AA+p*AG)(p*AT+p*GT) [eq.1] s pART=pAp*AAp*AT + pAp*AAp*GT
AT
GT
+ pAp*AGp*AT + pAp*AGp*GT)
AT s pART=pA(p*AAp*AT+p*AGp*GT) s pART= pAp*AAp*AT + pAp*AGp*GT
ART=
AT [eq.2] Proof
s pART=pAAT+pAGT s pAAT=pAp*AAp*AT s pAGT=pAp*AGp*GT s pART= pAp*AAp*AT + pAp*AGp*GT
ART=
AT s This matches equation 2 on previous slide Need further convincing?
Imagine that p*AA=0 and p*GT=0 (but all
=0
other p* are nonzero)
other
s Then pART should be zero since there is no
way to create either AAT or AGT
way
s This is predicted by eq. 2 but not by eq. 1
s More complicated probability
illustration
s s What is the probability of observing the sequence
What
feature ARYT (A followed by a purine {either A
ARYT
or G}, followed by a pyrimidine {either C or T},
},
},
followed by a T)?
Using equal mononucleotide frequencies
x
x pA = pC = pG = pT = 1/4
pARYT = 1/4 * (1/4 + 1/4) * (1/4 + 1/4) * 1/4 = 1/64 Illustration (continued)
s Using observed mononucleotide
Using
frequencies:
frequencies:
x pARYT s = pA (pA + pG) (pC + pT) pT Using dinucleotide frequencies:
x pARYT = pA (p*AA (p*ACp*CT + p*ATp*TT) +
AA (p
p*AG (p*GCp*CT + p*GTp*TT) )
AG (p Illustration (continued)
s Using dinucleotide frequencies:
A R A T
+T=AACT +T=AAT +T=AATT +C=AGC +A=AA Y
+C=AAC +T=AGCT +T=AGT +T=AGTT +G=AG Multiply then add
s We conclude that for such strings our rule
We
should be “multiply dinucleotide
probabilities along each allowed path and
then add the results”
then A recursive solution
s s s Some programming languages allow recursion Some
recursion
the calling (invoking) of a function by itself
the
This is useful here because we can branch when
This
branch
we encounter an ambiguous base and consider all
alternatives separately
separately
Allows multiplication down the branches and then
Allows
addition
addition Site Probability Calculation via
Recursion
s Illustration: Make a function that prints out
Illustration:
all possible sequences that can match a
restriction site
restriction
s (Demo Program PossibleSites.c) PossibleSites.c
s s /* PossibleSites.c s
s Prints out all possible sites that
can match a string of IUB codes s
s
s s January 22, 1997  R.F. Murphy */ s s #include <stdio.h>
#include <string.h> s s s
s s
s
s
s
s void PossibleSites(char
SiteString, int Index);
short Test1(char SiteString,
Index);
short Test2(char SiteString,
Index);
short Test3(char SiteString,
Index);
short Test4(char SiteString,
Index); s int
int
int
int void main(void)
{ char Site[10];
do
{
printf("Enter a string of IUB
codes (up to 10 characters): ");
scanf("%s", Site);
PossibleSites(Site,0);
} while (0==0);
} s
s
s
s
s
s
s
s
s
s Test for s
each type s
of
s
ambiguous
s
base
s
s void PossibleSites(char
SiteString, int Index)
{
if (Index>=strlen(SiteString))
{
printf("%s\n",SiteString);
return;
}
else
{
if (Test1(SiteString, Index)) ;
else if (Test2(SiteString,
Index)) ;
else if (Test3(SiteString,
Index)) ;
else if (Test4(SiteString,
Index)) ;
else
{
printf("Illegal character (%c)
encountered\n",SiteString[Index]) ; Unwind here s
s
s
s
s PossibleSites(SiteString,Index+1);
}
}
return;
} s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s short Test1(char SiteString, int
Index)
{
/* printf("In Test1: Index %d,
SiteString[Index]
%c\n",Index,SiteString[Index]); */
switch (SiteString[Index])
{
case 'A':
case 'C':
case 'G':
case 'T':
break;
default:
return false;
}
PossibleSites(SiteString,Index+1);
return true;
} s
s
s
s short Test2(char SiteString, int
Index)
{
char Save;
/* printf("In Test2: Index %d,
SiteString[Index]
%c\n",Index,SiteString[Index]); */ s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s Save = SiteString[Index];
switch (SiteString[Index])
{
case 'R':
SiteString[Index]='A';
PossibleSites(SiteString,
Index);
SiteString[Index]='G';
PossibleSites(SiteString,
Index);
break;
case 'Y':
SiteString[Index]='C';
PossibleSites(SiteString,
Index);
SiteString[Index]='T';
PossibleSites(SiteString,
Index);
break;
case 'S':
SiteString[Index]='G';
PossibleSites(SiteString,
Index);
SiteString[Index]='C';
PossibleSites(SiteString,
Index);
break; s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s case 'W':
SiteString[Index]='A';
PossibleSites(SiteString,
SiteString[Index]='T';
PossibleSites(SiteString,
break;
case 'M':
SiteString[Index]='A';
PossibleSites(SiteString,
SiteString[Index]='C';
PossibleSites(SiteString,
break;
case 'K':
SiteString[Index]='G';
PossibleSites(SiteString,
SiteString[Index]='T';
PossibleSites(SiteString,
break;
default:
return false;
}
SiteString[Index] = Save;
return true;
} Index);
Index); Index);
Index); Index);
Index); s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s short Test3(char SiteString, int
Index)
{
char Save;
/* printf("In Test3: Index %d,
SiteString[Index]
%c\n",Index,SiteString[Index]); */ s Save = SiteString[Index];
switch (SiteString[Index])
{
case 'B': /* not A */
SiteString[Index]='C';
PossibleSites(SiteString,
Index);
SiteString[Index]='G';
PossibleSites(SiteString,
Index);
SiteString[Index]='T';
PossibleSites(SiteString,
Index);
break;
case 'D': /* not C */
SiteString[Index]='A';
PossibleSites(SiteString,
Index);
SiteString[Index]='G';
PossibleSites(SiteString,
Index);
SiteString[Index]='T';
PossibleSites(SiteString,
Index);
break; s s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s case 'H': /* not G */
SiteString[Index]='A';
PossibleSites(SiteString,
Index);
SiteString[Index]='C';
PossibleSites(SiteString,
Index);
SiteString[Index]='T';
PossibleSites(SiteString,
Index);
break;
case 'V': /* not T/U */
SiteString[Index]='A';
PossibleSites(SiteString,
Index);
SiteString[Index]='C';
PossibleSites(SiteString,
Index);
SiteString[Index]='G';
PossibleSites(SiteString,
Index);
break;
default:
return false;
}
SiteString[Index] = Save;
return true;
} s
s
s short Test4(char SiteString, int
Index)
{
char Save; s /* printf("In Test4: Index %d,
SiteString[Index]
%c\n",Index,SiteString[Index]); */ s
s
s
s Save = SiteString[Index];
switch (SiteString[Index])
{
case 'N': /* A,C,G,T/U
(iNdeterminate) */
case 'X': /* alternate for N */
SiteString[Index]='A';
PossibleSites(SiteString,
Index);
SiteString[Index]='C';
PossibleSites(SiteString,
Index);
SiteString[Index]='G';
PossibleSites(SiteString,
Index);
SiteString[Index]='T';
PossibleSites(SiteString,
Index);
break;
default:
return false;
}
SiteString[Index] = Save;
return true;
} s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s Another illustration
s What is pACT in the sequence
TTTAACTGGG?
TTTAACTGGG?
x fA = 2/10, fC = 1/10 x pA = 0.2 x fAC = 1/10, fCT = 1/10 x p*AC = 0.1/0.2 = 0.5, p*CT = 0.1/0.1 = 1 s pACT = pA p*AC p*CT = 0.2 * 0.5 * 1 = 0.1
0.1
AC s (would have been 1/5 * 1/10 * 4/10 = 0.008
(would
0.008
using mononucleotide frequencies)
using Expected number and spacing
s Probabilities are per nucleotide
Probabilities
per
s How do we calculate number of expected
How
number
features in a sequence of length L?
features
x Expected number (for large L) ≅
Expected s Lp How do we calculate the expected spacing
How
spacing
between features?
between
x µART ≡ expected spacing between ART features
= 1/pART
1/p Sequence Analysis Tasks
⇒ Calculating the probability of finding an
alignment with a particular score
alignment Statistics of pattern appearance
s s Goal: Determine the significance of observing a
Goal:
particular alignment between two sequences
particular
Method: Estimate the probability that that
Method:
alignment would occur randomly in any pair of
sequences of the same lengths and compositions.
sequences
x Use measured frequencies of each nucleotide
Use
(mononucleotide frequencies) Probability of consecutive
matches
s What is probability of j consecutive matches
What
between two sequences of equal
mononucleotide frequencies?
mononucleotide
(1/4)j
s If nonequal mononucleotide frequencies,
If
use p as the probability of a match between
two positions
two
2
A 2
C 2
G 2
T p= p + p + p + p Probability of consecutive
matches
s s s Probability of j matches using p
Probability
pj
Model as flipping coin with p chance of heads Model
heads corresponds to matching a given position in
the two sequences
the
Expected length of longest run between two
Expected
sequences of length n (Erdos and Renyi law)
(Erdos log1 / p ( n )
s This is for just one alignment between the two Expected longest match length
s Given M is the length of the longest match
Given
between two sequences of length m and n, what is
what
E(M), the expected value of M?
M? E ( M ) @log1 / p ( mn ) + log1 / p (1  p) + g log(e)  1 / 2
or
E ( M ) @log1 / p (Kmn)
s where K depends on p (which depends on base
where
composition) and γ =0.577 (Euler’s number) KarlinAltschul formulation
s p is easy to estimate for nucleic acid comparison
without a scoring matrix, but how do we estimate
it when using a scoring matrix? Karlin and
Altschul rewrote the previous result as
Altschul E ( M ) @[log e (Kmn )] / l
where
l = log e (1 / p)
s and then generalize to allow λ to be calculated
and
empirically for a given scoring matrix and
sequence composition
sequence Estimating Significance of Local
Alignment
s s The distribution of scores obtained for aligning a
The
random pair of sequences follow an extreme value
distribution
distribution
Can calculate probability that a score S greater
Can
than equal to x will be observed P ( S ³ x ) @Kmne lx Estimating Significance of Local
Alignment P ( S ³ x ) @Kmne
s lx Can convert to score corresponding to a particular
Can
probability threshold (e.g., 0.05)
probability S = log 2 (Kmn )  log 2 P
S = log 2 (K / P ) + log 2 ( nm)
s For P=0.05, first term is negligible, thus S » log 2 ( nm) Reading for next class
s Jones/Pevzner, rest of Chapter 6 ...
View
Full
Document
This note was uploaded on 12/03/2011 for the course BIO 118 taught by Professor Staff during the Fall '08 term at Rutgers.
 Fall '08
 Staff
 Computational Biology

Click to edit the document details