This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms . 6.047/6.878 Lecture 7: HMMs II, September 25, 2008 October 1, 2008 The previous lecture introduced hidden Markov models (HMMs), a technique used to infer “hidden” information such as whether a particular nucleotide is part of the coding sequence of a gene, from observable information, such as the sequence of nucleotides. Recall that a Markov chain consists of states Q , initial state probabilities p , and state transition probabilities A . The key assumption that makes the chain a Markov chain is that the probability of going to a particular state depends only on the previous state, not on all the ones before that. A hidden Markov model has the additional property of emitting a series of observable outputs, one from each state, with various emission probabilities E . Because the observations do not allow one to uniquely infer the states, the model is “hidden.” The principle we used to determine hidden states in an HMM was the same as the principle used for sequence alignment. In alignment, we had an exponential number of possible sequences; in the HMM matching problem, we have an exponential number of possible parse sequences, i.e., choices of generating states. Indeed, in an HMM with k states, at each position we can be in any of k states; hence, for a sequence of length n , there are k n possible parses. As we have seen, in both cases we nonetheless avoid actually doing exponential work by using dynamic programming. HMMs present several problems of interest beyond simply finding the optimal parse sequence, however. So far, we have discussed the Viterbi decoding algorithm for finding the single optimal path that could have generated a given sequence, and scoring (i.e., computing the probability of) such a path. We also discussed the Forward algorithm for computing the total probability of a given sequence being generated by a particular HMM over all possible state paths that could have generated it; the method is yet another application of dynamic programming. One motivation for computing this probability is the desire to measure the accuracy of a model. Being able to compute the total probability of a sequence allows us to compare alternate models by asking the question: “Given a portion of a genome, how likely is it that each HMM produced this sequence?” Although we now know the Viterbi decoding algorithm for finding the single optimal path, we will talk about another notion of decoding known as posterior decoding , which finds the most likely state at any position of a sequence (given the knowledge that our HMM produced the entire sequence). The posterior decoding algorithm will apply both the forward algorithm and the closely related backward algorithm . After this discussion, we will pause for an aside on encoding “memory”...
View
Full Document
 Fall '08
 ManolisKellis
 DNA, Markov chain, Viterbi algorithm, Hidden Markov model, Markov models

Click to edit the document details