LecturesPart05 - Computational Biology Part 5 Hidden Markov Models Robert F Murphy Copyright © 2005-2009 Copyright All rights reserved Markov

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Biology, Part 5 Hidden Markov Models Robert F. Murphy Copyright © 2005-2009. Copyright All rights reserved. Markov chains s If we can predict all of the properties of a If sequence knowing only the conditional dinucleotide probabilities, then that sequence is an example of a Markov chain Markov s A Markov chain is defined as a sequence Markov of states in which each state depends only on the previous state on Formalism for Markov chains s s M=(Q,π,P) is a Markov chain, where Q = vector (1,..,n) is the list of states is x s π= vector (p1,..,pn) is the initial probability of each state x s Q(1)=A, Q(2)=C, Q(3)=G, Q(4)=T for DNA (1)=A, (2)=C, (3)=G, π i)=pQ(i) (e,g., π(1)=pA for DNA) ( for P= n x n matrix where the entry in row i and column j is the probability of observing state j if the previous state is i and the sum of entries in each row is 1 (≡ dinucleotide and probabilities) x P(i,j)=p*Q(i)Q(i) (e.g., P(1,2)=p*AC for DNA) Q(i)Q(i) Generating Markov chains s s s Given Q,π,P (and a random number generator), we Given Q, can generate sequences that are members of the Markov chain M Markov If π,P are derived from a single sequence, the If ,P family of sequences generated by M will include that sequence as well as many others that If π,P are derived from a sampled set of sequences, If ,P the family of sequences generated by M will be the population from which that set has been sampled sampled Interactive Demonstration s (A11 Markov chains) Matlab code for generating Markov chains chars = ['a' 'c' 'g' 't']; % the dinucs array shows the frequency of observing the character in the % row followed by the character in the column row % these values show strong preference for c-c dinucs = [2, 1, 2, 0; 0, 8, 0, 1; 2, 0, 2, 0; 1, 0, 0, 1]; % these values restrict transitions more %dinucs = [2, 0, 2, 0; 0, 8, 0, 0; 2, 0, 2, 0; 1, 1, 0, 1]; % calculate mononucleotide frequencies only as the probability of calculate % starting with each nucleotide monocounts = sum(dinucs,2); monofreqs = monocounts/sum(monocounts); cmonofreqs = cumsum(monofreqs); Matlab code for generating Markov chains % calculate dinucleotide frequencies and cumulative dinuc freqs freqs = dinucs./repmat(monocounts,1,4); cfreqs = cumsum(freqs,2); disp('Dinucleotide frequencies (transition probabilities)'); disp('Dinucleotide fprintf(' %c %c %c %c\n',chars) for i=1:4 fprintf('%c %f %f %f %f\n',chars(i),freqs(i,:)) fprintf('%c end Matlab code for generating Markov chains nseq = 10; for ntries=1:20 rnums = rand(nseq,1); rnums % start sequence using mononucleotide frequencies start seq(1) = min(find(cmonofreqs>=rnums(1))); seq(1) for i=2:nseq for % extend it using the appropriate row from the dinuc freqs extend seq(i) = min(find(cfreqs(seq(i-1),:)>=rnums(i))); seq(i) end end output=chars(seq); output=chars(seq); disp(strvcat(output)); disp(strvcat(output)); end Discriminating between two states with Markov chains s To determine which of two states a To sequence is more likely to have resulted from, we calculate from, L a P ( x | model+) S ( x ) = log = å log P ( x | model-) i=1 a L S( x ) = å b xi - 1 xi i =1 + xi - 1 xi xi - 1 xi State probablities for + and models s + A C G T Given examples sequences that are from Given either + model (CpG island) or - model (not CpG island), can calculate the probability that each nucleotide will occur for each model (the a values for each model) A 0.180 0.171 0.161 0.079 C 0.274 0.368 0.339 0.355 G 0.426 0.274 0.375 0.384 T 0.120 0.188 0.125 0.182 A C G T A 0.300 0.322 0.248 0.177 C 0.205 0.298 0.246 0.239 G 0.285 0.078 0.298 0.292 T 0.210 0.302 0.208 0.292 Transition probabilities converted to log likelihood ratios ß A A -0.740 C -0.913 G -0.624 T -1.169 C 0.419 0.302 0.461 0.573 G 0.580 1.812 0.331 0.393 T -0.803 -0.685 -0.730 -0.679 Example s What is relative probability of C+G+C+ What compared with C-G-C-? compared s First calculate log-odds ratio: S(CGC)= ß(CG) +ß(GC)=1.812+0.461=2.273 s Convert to relative probability: 22.273=4.833 s Relative probability is ratio of (+) to (-) P(+)=4.833 P(-) Example s Convert to percentage P(+) + P(-) = 1 4.833P(-) + P(-) = 1 P(-) = 1/5.833 = 17% s Conclusion P(+)=83% P(-)=17% Hidden Markov models s “Hidden” connotes that the sequence is Hidden” generated by two or more states that have different transition probability matrices different More definitions s π i = state at position i in a path path s akl = P(π i = l | π i-1 = k) x probabilityof going from one state to another x “transition probability” s ek(b) = P(xi = b | π i = k) x probability of emitting a b when in state k probability emitting x “emission probability” Generating sequences (see previous example code) s s s s s % force emission to match state (normal Markov force model, not hidden) model, emit = diag(repmat(1,4,1)); [seq2,states]=hmmgenerate(10,freqs,emit) output2=chars(seq2); disp(strvcat(output2)); Decoding s The goal of using an HMM is often to The determine (estimate) the sequence of underlying states that likely gave rise to an observed sequence observed s This is called “decoding” in the jargon of This speech recognition speech More definitions s Can calculate the joint probability of a Can sequence x and a state sequence π L P ( x, p ) = a0 p 1 Õ ep i ( x i ) ap i p i +1 i=1 requiring p L +1 = 0 Determining the optimal path: the Viterbi algorithm s Viterbi algorithm is form of dynamic Viterbi programming programming s Definition: Let vk(i) be the probability of the (i) most probable path ending in state k with observation i observation Determining the optimal path: the Viterbi algorithm s Initialisation (i=0): =0): v0(0)=1, vk(0)=0 for k>0 (0)=1, (0)=0 s Recursion (i=1..L): ): vl(i)=el(xi)maxk(vk(i-1)akl) ptri(l)=argmaxk(vk(i-1)akl) ptr s Termination: P(x,π*)=maxk(vk(L)ak0) πL*=argmaxk(vk(L)ak0) s Traceback (i=L..1): ..1): πi-1*=ptri(πi*) Block Diagram for Viterbi Algorithm alphabet emission probabilities transition probabilities sequence Viterbi Algorithm most probable state sequence Multiple paths can give the same sequence s The Viterbi algorithm finds the most likely The path given a sequence path s Other paths could also give rise to the same Other sequence sequence s How do we calculate the probability of a How sequence given an HMM? sequence Probability of a sequence s Sum the probabilities of all possible paths Sum that give that sequence that s Let P(x) be the probability of observing Let P(x) sequence x given an HMM P ( x ) = å P ( x, p ) p Probability of a sequence s Can find P(x) using a variation on Viterbi Can P(x) algorithm using sum instead of max algorithm s This is called the forward algorithm This forward s Replace vk(i) with fk(i)=P(x1…xi,πi=k) Replace Forward algorithm s Initialisation (i=0): =0): f0(0)=1, fk(0)=0 for k>0 (0)=1, (0)=0 s Recursion (i=1..L): ): f l (i) = el ( x i )å f k (i - 1) akl k s Termination: P ( x ) = å f k ( L) ak 0 k Backward algorithm s We may need to know the probability that a We particular observation xi came from a particular state k given a sequence x, P(πi=k|x) P( s Use algorithm analogous to forward Use algorithm but starting from the end algorithm Backward algorithm s Initialisation (i=0): =0): bk(L)=ak0 for all k s Recursion (i=L-1,…,1): =L-1,…,1): bk (i) = å akl el ( x i +1)bl (i + 1) l s Termination: P ( x ) = å a0 l el ( x1)bl (1) l Estimating probability of state at particular position s Combine the forward and backward probabilities Combine to estimate the posterior probability of the sequence being in a particular state at a particular position position f k (i)bk (i) P (p i = k | x ) = P( x) Parameter estimation for HMMs s Simple when state sequence is known for Simple training examples training s Can be very complex for unknown paths Can Estimation when state sequence known s Count number of times each transition Count occurs, Akl s Count number of times each emission Count occurs from each state, Ek(b) s Convert to probabilities E k (b) Akl ek (b) = akl = å E k (b' ) å Akl ' l' b' Baum-Welch s Make initial parameter estimates s Use forward algorithm and backward Use algorithm to calculate probability of each sequence according to the model sequence s Calculate new model parameters s Repeat until termination criteria met Repeat (change in log likelihood < threshold) (change Estimating transition frequencies Probability that akl is used as position i in Probability sequence x f k (i) akl el ( x i +1)bl (i + 1) P (p i = k, p i +1 = l | x,q ) = P( x) s s Sum over all positions (i) and all sequences Sum (j) to get expected number of times akl is used 1 j j j Akl = å f k (i) akl el ( x i +1 )bl (i + 1) jå j P( x ) i Estimating emission frequencies s Sum over all positions for which the emitted Sum character is b and all sequences E k (b) = å j 1 j j åj f k (i)bk (i) j P ( x ) i| x = b i Updating model parameters s Convert expected numbers to probabilities Convert as if expected numbers were actual counts as Akl E k (b) akl = ek (b) = å Akl ' å E k (b' ) l' b' Test for termination s Calculate the log likelihood of the model for all of Calculate the sequences using the new parameters the n å j log P ( x | q ) j =1 s If the change in log likelihood exceeds some If threshold, go back and make new estimates of a and e ...
View Full Document

This note was uploaded on 12/03/2011 for the course BIO 118 taught by Professor Staff during the Fall '08 term at Rutgers.

Ask a homework question - tutors are online