This preview shows page 1. Sign up to view the full content.
Unformatted text preview: MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution
Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 6.047/6.878  Computational Biology: Genomes, Networks, Evolution Modeling Biological Sequence using Hidden Markov Models Lecture 6 Sept 23, 2008 Challenges in Computational Biology
4 Genome Assembly 9 Regulatory motif discovery 6 Gene Finding DNA 2 14 Comparative Genomics 10 Evolutionary Theory
TCATGCTAT TCGTGATAA TGAGGATAT TTATCATAT TTATGATTT Sequence alignment 3 Database search 5 Gene expression analysis RNA transcript 4 Cluster discovery 9 Gibbs sampling 11 Protein network analysis 12 Regulatory network inference 13 Emerging network properties What have we learned so far?
• String searching and counting
– Bruteforce algorithm – Wmer indexing • Sequence alignment
– Dynamic programming, duality path alignment – Global / local alignment, general gap penalties • String comparison
– Exact string match, seminumerical matching • Rapid database search
– Exact matching: Hashing, BLAST – Inexact matching: neighborhood search, projections • Problem set 1 So, you find a new piece of DNA… What do you do?
…GTACTCACCGGGTTACAGGATTATGGGTTACAGGTAACCGTT… • Align it to things we know about • Align it to things we don’t know about • Stare at it
– Nonstandard nucleotide composition? – Interesting kmer frequencies? – Recurring patterns? • Model it
– Make some hypotheses about it – Build a ‘generative model’ to describe it – Find sequences of similar type This week: Modeling biological sequences
(a.k.a. What to do with a huge chunk of DNA)
Intergenic CpG Promoter First island exon Intron Other exon Intron TTACAGGATTATGGGTTACAGGTAACCGTTGTACTCACCGGGTTACAGGATTATGGGTTACAGGTAACCGGTACTCACCGGGTTACAGGATTATGGTAACGGTACTCACCGGGTTACAGGATTGTTACA GG • Ability to emit DNA sequences of a certain type
– Not exact alignment to previously known gene – Preserving ‘properties’ of type, not identical sequence • Ability to recognize DNA sequences of a certain type (state)
– What (hidden) state is most likely to have generated observations – Find set of states and transitions that generated a long sequence • Ability to learn distinguishing characteristics of each state
– Training our generative models on large datasets – Learn to classify unlabelled data Why Probabilistic Sequence Modeling?
• Biological data is noisy • Probability provides a calculus for manipulating models • Not limited to yes/no answers – can provide “degrees of belief” • Many common computational tools based on probabilistic models • Our tools:
– Markov Chains and Hidden Markov Models (HMMs) Definition: Markov Chain
Definition: A Markov chain is a triplet (Q, p, A), where: Q is a finite set of states. Each state corresponds to a symbol in the alphabet Σ p is the initial state probabilities. A is the state transition probabilities, denoted by ast for each s, t in Q. For each s, t in Q the transition probability is: ast ≡ P(xi = txi1 = s) Output: The output of the model is the set of states at each instant time => the set of states are observable Property: The probability of each symbol xi depends only on the value of the preceding symbol xi1 : P (xi  xi1,…, x1) = P (xi  xi1) Formula: The probability of the sequence:
P(x) = P(xL,xL1,…, x1) = P (xL  xL1) P (xL1  xL2)… P (x2  x1) P(x1) Definitions: HMM (Hidden Markov Model)
Definition: An HMM is a 5tuple (Q, V, p, A, E), where:
Q is a finite set of states, Q=N V is a finite set of observation symbols per state, V=M p is the initial state probabilities. A is the state transition probabilities, denoted by ast for each s, t in Q. For each s, t in Q the transition probability is: ast ≡ P(xi = txi1 = s) E is a probability emission matrix, esk ≡ P (vk at time t  qt = s) underlying random walk between states > “hidden” Output: Only emitted symbols are observable by the system but not the Property: Emissions and transitions are dependent on the current state
only and not on the past. The six algorithmic settings for HMMs
One path Scoring
1. Scoring x, one path P(x,π) Prob of a path, emissions 3. Viterbi decoding π* = argmaxπ P(x,π) Most likely path All paths
2. Scoring x, all paths P(x) = Σπ P(x,π) Prob of emissions, over all paths 4. Posterior decoding π^ = {πi  πi=argmaxk ΣπP(πi=kx)} Path containing the most likely state at any time point. 6. Unsupervised learning Λ* = argmaxΛ ΣπP(x,πΛ) BaumWelch training, over all paths Decoding Learning 5. Supervised learning, given π Λ* = argmaxΛ P(x,πΛ) 6. Unsupervised learning. Λ* = argmaxΛ maxπP(x,πΛ) Viterbi training, best path Example 1: Finding GCrich regions
• Promoter regions frequently have higher counts of Gs and Cs • Model genome as nucleotides drawn independently from two distributions: Background (B) and Promoters (P). • Emission probabilities based on nucleotide composition in each. • Transition probabilities based on relative abundance & avg. length
0.15 0.85 Background (B) 0.25 Promoter Region (P) 0.75 A: 0.25 T: 0.25 G: 0.25 C: 0.25 P(XiB) A: 0.15 T: 0.13 G: 0.30 C: 0.42 P(XiP) TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC HMM as a Generative Model
P B P B P B P B P B P B P B P B S: G C A A A T G C P(Li+1Li)
Bi+1 Pi+1 Bi Pi
0.85 0.15 0.25 0.75 P(SB)
A: 0.25 T: 0.25 G: 0.25 C: 0.25 P(SP)
A: 0.42 T: 0.30 G: 0.13 C: 0.15 Sequence Classification
PROBLEM: Given a sequence, is it a promoter region?
– We can calculate P(SMP), but what is a sufficient P value? SOLUTION: compare to a null model and calculate loglikelihood ratio
– e.g. background DNA distribution model, B N P( Si  MP) N P( Si  MP) P( S  MP) Score = log = log ∏ = ∑ log P( S  B) P ( Si  B ) i =1 i =1 P ( Si  B ) Pathogenicity Islands A: 0.15 T: 0.13 G: 0.30 C: 0.42 Background DNA Score Matrix A: 0.73 T: 0.94 G: 0.26 C: 0.74 Finding GCrich regions
• Could use the loglikelihood ratio on windows of fixed size • Downside: have to evaluate all islands of all lengths repeatedly • Need: a way to easily find transitions
TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC Probability of a sequence if all promoter 0.75 0.75 0.75 0.75 0.75 0.75 0.75 L: P B P B P B P B P B P B P B P B S: G 0.30 C 0.42 A 0.15 A 0.15 A 0.15 T 0.13 G 0.30 C 0.30 P(x,π)=aP*eP(G)*aPP*eP(G)*aPP*eP(C)*aPP*eP(A)*aPP*… =ap*(0.75)7*(0.15)3*(0.13)1*(0.30)2*(0.42)2 =9.3*107
Why is this so small?
A: 0.15 T: 0.13 G: 0.30 C: 0.42 Probability of the same sequence if all background L: P B
0.85 0.25 P B
0.85 0.25 P B
0.85 0.25 P B
0.85 0.25 P B
0.85 0.25 P B
0.85 0.25 P B
0.85 0.25 P B
0.25 S: G C A A A T G C P = P(G  B) P( B1  B0 ) P(C  B) P( B2  B1 ) P( A  B) P( B3  B2 )...P(C  B7 ) = (0.85)7 × (0.25)8 = 4.9 ×10−6
Compare relative probabilities: 5fold more likely!
A: 0.25 T: 0.25 G: 0.25 C: 0.25 Probability of the same sequence if mixed L: P B
0.85 0.25 P B
0.85 0.25 P
0.15 P B 0.75 P B 0.75 P B
0.30 P
0.25 P B
0.85 0.25 B
0.25 B
0.25 0.42 0.42 S: G C A A A T G C P = P(G  B) P( B1  B0 ) P(C  B) P( B2  B1 ) P( A  B) P( P3  B2 )...P(C  B7 ) = (0.85)3 × (0.25)6 × (0.75) 2 × (0.42) 2 × 0.30 × 0.15 = 6.7 ×10−7
Should we try all possibilities? What is the most likely path? The six algorithmic settings for HMMs
One path Scoring
1. Scoring x, one path P(x,π) Prob of a path, emissions 3. Viterbi decoding π* = argmaxπ P(x,π) Most likely path All paths
2. Scoring x, all paths P(x) = Σπ P(x,π) Prob of emissions, over all paths 4. Posterior decoding π^ = {πi  πi=argmaxk ΣπP(πi=kx)} Path containing the most likely state at any time point. 6. Unsupervised learning Λ* = argmaxΛ ΣπP(x,πΛ) BaumWelch training, over all paths Decoding Learning 5. Supervised learning, given Λ* = argmaxΛ P(x,πΛ) 6. Unsupervised learning. Λ* = argmaxΛ maxπP(x,πΛ) Viterbi training, best path 3. DECODING: What was the sequence of hidden states?
Given: Given: Find: Model parameters ei(.), aij Sequence of emissions x Sequence of hidden states π Finding the optimal path
• We can now evaluate any path through hidden states, given the emitted sequences • How do we find the best path? • Optimal substructure! Best path through a given state is:
– Best path to previous state – Best transition from previous state to this state – Best path to the end state Viterbi algortithm – Define Vk(i) = Probability of the most likely path through state πi=k – Compute Vk(i+1) as a function of maxk’ { Vk’(i) } – Vk(i+1) = ek(xi+1) * maxj ajk Vj(i) Dynamic Programming Finding the most likely path 1 2
… 1 2
… 1 2
… … … 1 2
… K
x1 K
x2 K
x3 … K
xK • Find path π* that maximizes total joint probability P[ x, π ] • P(x,π) = a0π * Πi eπ (xi) × aπ π
1 i i i+1 start emission transition Calculate maximum P(x,π) recursively
… hidden states … ajk Vj(i1) j … k
ek Vk(i) observations xi1 xi • Assume we know Vj for the previous time step (i1) • Calculate Vk(i) =
current max ek(xi) * maxj ( Vj(i1) × ajk
this emission max ending in state j at step i all possible previous states j ) Transition from state j The Viterbi Algorithm
State 1 2 Vk(i) K x1 x2 x3 ………………………………………..xN Input: x = x1……xN Initialization: V0(0)=1, Vk(0) = 0, for all k > 0 Iteration: Vk(i) = eK(xi) × maxj ajk Vj(i1) Termination: P(x, π*) = maxk Vk(N) Traceback: Follow max pointers back Similar to aligning states to seq In practice: Use log scores for computation Running time and space: Time: O(K2N) Space: O(KN) The six algorithmic settings for HMMs
One path Scoring
1. Scoring x, one path P(x,π) Prob of a path, emissions 3. Viterbi decoding π* = argmaxπ P(x,π) Most likely path All paths
2. Scoring x, all paths P(x) = Σπ P(x,π) Prob of emissions, over all paths 4. Posterior decoding π^ = {πi  πi=argmaxk ΣπP(πi=kx)} Path containing the most likely state at any time point. 6. Unsupervised learning Λ* = argmaxΛ ΣπP(x,πΛ) BaumWelch training, over all paths Decoding Learning 5. Supervised learning, given π Λ* = argmaxΛ P(x,πΛ) 6. Unsupervised learning. Λ* = argmaxΛ maxπP(x,πΛ) Viterbi training, best path 2. EVALUATION (how well does our model capture the world) Given: Given: Find: Model parameters ei(.), aij Sequence of emissions x P(xM), summed over all possible paths π Simple: Given the model, generate some sequence x
1
a02 1 2
… 1 2
… … … 1 2
… 2
… 0 K
e2(x1) K K … K x1
1. 2. 3. 4. x2 x3 xn Given a HMM, we can generate a sequence of length n as follows:
Start at state π1 according to prob a0π1 Emit letter x1 according to prob eπ1(x1) Go to state π2 according to prob aπ1π2 … until emitting xn We have some sequence x that can be emitted by p. Can calculate its likelihood. However, in general, many different paths may emit this same sequence x. How do we find the total probability of generating a given x, over any path? Complex: Given x, was it generated by the model?
1
a02 1 2
… 1 2
… … … 1 2
… 2
… 0 K
e2(x1) K K … K x1 x2 x3 xn Given a sequence x, What is the probability that x was generated by the model (using any path)?
– P(x) = Σπ P(x,π) • Challenge: exponential number of paths Calculate probability of emission over all paths
• Each path has associated probability
– Some paths are likely, others unlikely: sum them all up Return total probability that emissions are observed, summed over all paths – Viterbi path is the most likely one
• How much ‘probability mass’ does it contain? • (cheap) alternative: – Calculate probability over maximum (Viterbi) path π* – Good approximation if Viterbi has highest density – BUT: incorrect • (real) solution – Calculate the exact sum iteratively
P(x) = Σπ P(x,π) – Can use dynamic programming The Forward Algorithm – derivation
Define the forward probability: fl(i) = P(x1…xi, πi = l) = Σπ1…πi1 P(x1…xi1, π1,…, πi2, πi1, πi = l) el(xi) akl el(xi) = Σk Σπ1…πi2 P(x1…xi1, π1,…, πi2, πi1=k) = Σk fk(i1) akl el(xi) = el(xi) Σk fk(i1) akl Calculate total probability Σπ P(x,π) recursively
… hidden states … ajk fj(i1) j … k
ek fk(i) observations xi1 xi • Assume we know fj for the previous time step (i1) • Calculate fk(i) =
updated sum ek(xi) * sumj ( fj(i1)
this emission × ajk ) sum ending in state j at step i transition from state j every possible previous state j The Forward Algorithm
State 1 2 fk(i) K x1 x2 x3 ………………………………………..xN Input: x = x1……xN Initialization: f0(0)=1, fk(0) = 0, for all k > 0 Iteration: fk(i) = eK(xi) × sumj ajk fj(i1) Termination: P(x, π*) = sumk fk(N) In practice: Sum of log scores is difficult approximate exp(1+p+q) scaling of probabilities Running time and space: Time: O(K2N) Space: O(KN) The six algorithmic settings for HMMs
One path Scoring
1. Scoring x, one path P(x,π) Prob of a path, emissions 3. Viterbi decoding π* = argmaxπ P(x,π) Most likely path All paths
2. Scoring x, all paths P(x) = Σπ P(x,π) Prob of emissions, over all paths 4. Posterior decoding π^ = {πi  πi=argmaxk ΣπP(πi=kx)} Path containing the most likely state at any time point. 6. Unsupervised learning Λ* = argmaxΛ ΣπP(x,πΛ) BaumWelch training, over all paths Decoding Learning 5. Supervised learning, given π Λ* = argmaxΛ P(x,πΛ) 6. Unsupervised learning. Λ* = argmaxΛ maxπP(x,πΛ) Viterbi training, best path Introducing memory
• State, emissions, only depend on current state • How do you count dinucleotide frequencies?
– CpG islands – Codon triplets – Dicodon frequencies • Introducing memory to the system
– Expanding the number of states Example 2: CpG islands: incorporating memory
aAT
A+ T+ A+ C+
A: 0 C: 1 G: 0 T: 0 G+
A: 0 C: 0 G: 1 T: 0 T+
A: 0 C: 0 G: 0 T: 1 aAC
C+ aGT aGC
G+
A: 1 C: 0 G: 0 T: 0 • Markov Chain
– Q: states – p: initial state probabilities – A: transition probabilities • HMM
– – – – – Q: states V: observations p: initial state probabilities A: transition probabilities E: emission probabilities Counting nucleotide transitions: Markov/HMM
T+ G+ A+ C+ aAT
A+ T+ aAC
C+ aGT aGC
G+ • Markov Chain
– Q: states – p: initial state probabilities – A: transition probabilities • HMM
– – – – – Q: states V: observations p: initial state probabilities A: transition probabilities E: emission probabilities A: 1 C: 0 G: 0 T: 0 A: 0 C: 1 G: 0 T: 0 A: 0 C: 0 G: 1 T: 0 A: 0 C: 0 G: 0 T: 1 A: 1 C: 0 G: 0 T: 0 A: 0 C: 1 G: 0 T: 0 A: 0 C: 0 G: 1 T: 0 A: 0 C: 0 G: 0 T: 1 A+ C+ G+ T+ What have we learned ?
• Modeling sequential data
– Recognize a type of sequence, genomic, oral, verbal, visual, etc… • Definitions
– Markov Chains – Hidden Markov Models (HMMs) • Simple examples
– Recognizing GCrich regions. – Recognizing CpG dinucleotides • Our first computations
– – – – Running the model: know model generate sequence of a ‘type’ Evaluation: know model, emissions, states p? Viterbi: know model, emissions find optimal path Forward: know model, emissions total p over all paths • Next time:
– Posterior decoding – Supervised learning – Unsupervised learning: BaumWelch, Viterbi training ...
View
Full
Document
This note was uploaded on 09/24/2010 for the course EECS 6.047 / 6. taught by Professor Manoliskellis during the Fall '08 term at MIT.
 Fall '08
 ManolisKellis

Click to edit the document details