This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Computational Biology, Part 5
Hidden Markov Models
Robert F. Murphy
Copyright © 20052009.
Copyright
All rights reserved. Markov chains
s If we can predict all of the properties of a
If
sequence knowing only the conditional
dinucleotide probabilities, then that
sequence is an example of a Markov chain
Markov
s A Markov chain is defined as a sequence
Markov
of states in which each state depends only
on the previous state
on Formalism for Markov chains
s
s M=(Q,π,P) is a Markov chain, where
Q = vector (1,..,n) is the list of states
is
x s π= vector (p1,..,pn) is the initial probability of each state
x s Q(1)=A, Q(2)=C, Q(3)=G, Q(4)=T for DNA
(1)=A, (2)=C, (3)=G,
π i)=pQ(i) (e,g., π(1)=pA for DNA)
(
for P= n x n matrix where the entry in row i and column j is
the probability of observing state j if the previous state is i
and the sum of entries in each row is 1 (≡ dinucleotide
and
probabilities)
x P(i,j)=p*Q(i)Q(i) (e.g., P(1,2)=p*AC for DNA)
Q(i)Q(i) Generating Markov chains
s s s Given Q,π,P (and a random number generator), we
Given Q,
can generate sequences that are members of the
Markov chain M
Markov
If π,P are derived from a single sequence, the
If ,P
family of sequences generated by M will include
that sequence as well as many others
that
If π,P are derived from a sampled set of sequences,
If ,P
the family of sequences generated by M will be
the population from which that set has been
sampled
sampled Interactive Demonstration
s (A11 Markov chains) Matlab code for generating
Markov chains
chars = ['a' 'c' 'g' 't'];
% the dinucs array shows the frequency of observing the character in the
% row followed by the character in the column
row
% these values show strong preference for cc
dinucs = [2, 1, 2, 0; 0, 8, 0, 1; 2, 0, 2, 0; 1, 0, 0, 1];
% these values restrict transitions more
%dinucs = [2, 0, 2, 0; 0, 8, 0, 0; 2, 0, 2, 0; 1, 1, 0, 1];
% calculate mononucleotide frequencies only as the probability of
calculate
% starting with each nucleotide
monocounts = sum(dinucs,2);
monofreqs = monocounts/sum(monocounts);
cmonofreqs = cumsum(monofreqs); Matlab code for generating
Markov chains
% calculate dinucleotide frequencies and cumulative dinuc freqs
freqs = dinucs./repmat(monocounts,1,4);
cfreqs = cumsum(freqs,2);
disp('Dinucleotide frequencies (transition probabilities)');
disp('Dinucleotide
fprintf(' %c
%c
%c
%c\n',chars)
for i=1:4
fprintf('%c %f %f %f %f\n',chars(i),freqs(i,:))
fprintf('%c
end Matlab code for generating
Markov chains
nseq = 10;
for ntries=1:20
rnums = rand(nseq,1);
rnums
% start sequence using mononucleotide frequencies
start
seq(1) = min(find(cmonofreqs>=rnums(1)));
seq(1)
for i=2:nseq
for
% extend it using the appropriate row from the dinuc freqs
extend
seq(i) = min(find(cfreqs(seq(i1),:)>=rnums(i)));
seq(i)
end
end
output=chars(seq);
output=chars(seq);
disp(strvcat(output));
disp(strvcat(output));
end Discriminating between two
states with Markov chains
s To determine which of two states a
To
sequence is more likely to have resulted
from, we calculate
from,
L a
P ( x  model+)
S ( x ) = log
= å log
P ( x  model) i=1
a
L S( x ) = å b xi  1 xi
i =1 +
xi  1 xi
xi  1 xi State probablities for + and models
s +
A
C
G
T Given examples sequences that are from
Given
either + model (CpG island) or  model (not
CpG island), can calculate the probability
that each nucleotide will occur for each
model (the a values for each model)
A
0.180
0.171
0.161
0.079 C
0.274
0.368
0.339
0.355 G
0.426
0.274
0.375
0.384 T
0.120
0.188
0.125
0.182 A
C
G
T A
0.300
0.322
0.248
0.177 C
0.205
0.298
0.246
0.239 G
0.285
0.078
0.298
0.292 T
0.210
0.302
0.208
0.292 Transition probabilities converted
to log likelihood ratios
ß
A
A 0.740
C 0.913
G 0.624
T 1.169 C
0.419
0.302
0.461
0.573 G
0.580
1.812
0.331
0.393 T
0.803
0.685
0.730
0.679 Example
s What is relative probability of C+G+C+
What
compared with CGC?
compared
s First calculate logodds ratio:
S(CGC)= ß(CG) +ß(GC)=1.812+0.461=2.273
s Convert to relative probability:
22.273=4.833
s Relative probability is ratio of (+) to ()
P(+)=4.833 P() Example
s Convert to percentage
P(+) + P() = 1
4.833P() + P() = 1
P() = 1/5.833 = 17%
s Conclusion
P(+)=83% P()=17% Hidden Markov models
s “Hidden” connotes that the sequence is
Hidden”
generated by two or more states that have
different transition probability matrices
different More definitions
s π i = state at position i in a path
path s akl = P(π i = l  π i1 = k)
x probabilityof going from one state to another
x “transition probability” s ek(b) = P(xi = b  π i = k)
x probability of emitting a b when in state k
probability emitting
x “emission probability” Generating sequences (see
previous example code)
s s
s
s
s % force emission to match state (normal Markov
force
model, not hidden)
model,
emit = diag(repmat(1,4,1));
[seq2,states]=hmmgenerate(10,freqs,emit)
output2=chars(seq2);
disp(strvcat(output2)); Decoding
s The goal of using an HMM is often to
The
determine (estimate) the sequence of
underlying states that likely gave rise to an
observed sequence
observed
s This is called “decoding” in the jargon of
This
speech recognition
speech More definitions
s Can calculate the joint probability of a
Can
sequence x and a state sequence π
L P ( x, p ) = a0 p 1 Õ ep i ( x i ) ap i p i +1
i=1 requiring
p L +1 = 0 Determining the optimal path:
the Viterbi algorithm
s Viterbi algorithm is form of dynamic
Viterbi
programming
programming
s Definition: Let vk(i) be the probability of the
(i)
most probable path ending in state k with
observation i
observation Determining the optimal path:
the Viterbi algorithm
s Initialisation (i=0):
=0):
v0(0)=1, vk(0)=0 for k>0
(0)=1, (0)=0 s Recursion (i=1..L):
):
vl(i)=el(xi)maxk(vk(i1)akl)
ptri(l)=argmaxk(vk(i1)akl)
ptr s Termination: P(x,π*)=maxk(vk(L)ak0)
πL*=argmaxk(vk(L)ak0) s Traceback (i=L..1):
..1): πi1*=ptri(πi*) Block Diagram for Viterbi
Algorithm
alphabet
emission
probabilities
transition
probabilities
sequence Viterbi
Algorithm most
probable
state
sequence Multiple paths can give the same
sequence
s The Viterbi algorithm finds the most likely
The
path given a sequence
path
s Other paths could also give rise to the same
Other
sequence
sequence
s How do we calculate the probability of a
How
sequence given an HMM?
sequence Probability of a sequence
s Sum the probabilities of all possible paths
Sum
that give that sequence
that
s Let P(x) be the probability of observing
Let P(x)
sequence x given an HMM P ( x ) = å P ( x, p )
p Probability of a sequence
s Can find P(x) using a variation on Viterbi
Can
P(x)
algorithm using sum instead of max
algorithm
s This is called the forward algorithm
This
forward
s Replace vk(i) with fk(i)=P(x1…xi,πi=k)
Replace Forward algorithm
s Initialisation (i=0):
=0):
f0(0)=1, fk(0)=0 for k>0
(0)=1, (0)=0 s Recursion (i=1..L):
): f l (i) = el ( x i )å f k (i  1) akl
k s Termination: P ( x ) = å f k ( L) ak 0
k Backward algorithm
s We may need to know the probability that a
We
particular observation xi came from a
particular state k given a sequence x,
P(πi=kx)
P( s Use algorithm analogous to forward
Use
algorithm but starting from the end
algorithm Backward algorithm
s Initialisation (i=0):
=0):
bk(L)=ak0 for all k s Recursion (i=L1,…,1):
=L1,…,1): bk (i) = å akl el ( x i +1)bl (i + 1)
l s Termination: P ( x ) = å a0 l el ( x1)bl (1)
l Estimating probability of state at
particular position
s Combine the forward and backward probabilities
Combine
to estimate the posterior probability of the
sequence being in a particular state at a particular
position
position f k (i)bk (i)
P (p i = k  x ) =
P( x) Parameter estimation for HMMs
s Simple when state sequence is known for
Simple
training examples
training
s Can be very complex for unknown paths
Can Estimation when state sequence
known
s Count number of times each transition
Count
occurs, Akl s Count number of times each emission
Count
occurs from each state, Ek(b) s Convert to probabilities E k (b)
Akl
ek (b) =
akl =
å E k (b' )
å Akl '
l' b' BaumWelch
s Make initial parameter estimates
s Use forward algorithm and backward
Use
algorithm to calculate probability of each
sequence according to the model
sequence
s Calculate new model parameters
s Repeat until termination criteria met
Repeat
(change in log likelihood < threshold)
(change Estimating transition frequencies
Probability that akl is used as position i in
Probability
sequence x
f k (i) akl el ( x i +1)bl (i + 1)
P (p i = k, p i +1 = l  x,q ) =
P( x) s s Sum over all positions (i) and all sequences
Sum
(j) to get expected number of times akl is used
1
j
j
j
Akl = å
f k (i) akl el ( x i +1 )bl (i + 1)
jå
j P( x ) i Estimating emission frequencies
s Sum over all positions for which the emitted
Sum
character is b and all sequences E k (b) = å
j 1
j
j
åj f k (i)bk (i)
j
P ( x ) i x = b
i Updating model parameters
s Convert expected numbers to probabilities
Convert
as if expected numbers were actual counts
as Akl
E k (b)
akl =
ek (b) =
å Akl '
å E k (b' )
l' b' Test for termination
s Calculate the log likelihood of the model for all of
Calculate
the sequences using the new parameters
the n å j log P ( x  q ) j =1
s If the change in log likelihood exceeds some
If
threshold, go back and make new estimates of a
and e ...
View
Full
Document
This note was uploaded on 12/03/2011 for the course BIO 118 taught by Professor Staff during the Fall '08 term at Rutgers.
 Fall '08
 Staff
 Computational Biology

Click to edit the document details