MIT6_047f08_lec08_slide08

MIT6_047f08_lec08_slide08 - MIT OpenCourseWare...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. Computational Biology: Genomes, Networks, Evolution Computational Gene Prediction and Generalized Hidden Markov Models Lecture 8 September 30, 2008 Today • Gene Prediction Overview • HMMs for Gene Prediction • GHMMs for Gene Prediction • Genscan Genome Annotation Genome Sequence Transcription Translation RNA Protein Eukaryotic Gene Structure complete mRNA coding segment ATG ATG TGA TGA exon intron exon intron exon ATG . . . GT start codon AG ... GT AG . . . TGA donor site acceptor site Courtesy of William Majoros. Used with permission. donor site acceptor stop codon site http://geneprediction.org/book/classroom.html Translation Chain of amino acids r RNA one amino acid transf e Anti-codon (3 bases) Codon (3 bases) messenger RNA (mRNA) Courtesy of William Majoros.Used with permission. (performs translation) http://geneprediction.org/book/classroom.html Ribosome Genetic Code Acid A Codons GCA GCC GCG GCT Acid G Codons GGA GGC GGG GGT Acid M Codons ATG Acid S Codons AGC AGT TCA TCC TCG TCT ACA ACC ACG ACT GTA GTC GTG GTT TGG TAC TAT Each amino acid is encoded by one or more codons. Each codon encodes a single amino acid. The third position of the codon is the most likely to vary, for a given amino acid. C TGC TGT GAC GAT H CAC CAT ATA ATC ATT AAA AAG CTA CTC CTG CTT TTA N AAC AAT CCA CCC CCG CCT CAA CAG AGA AGG CGA CGC CGG CGT T D I P V E F GAA GAG TTC TTT K L Q R W Y Figure by MIT OpenCourseWare. http://geneprediction.org/book/classroom.html Gene Prediction as “Parsing” • Given a genome sequence, we wish to label each nucleotide according to the parts of genes – Exon, intron, intergenic, etc • The sequence of labels must follow the syntax of genes – e.g. exons must be followed by introns or intergenic not by other exons • We wish to find the optimal parsing of a sequence by some measure Features A feature is any DNA subsequence of biological significance. For practical reasons, we recognize two broad classes of features: signals — short, fixed-length features content regions — variable-length features Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html Signals Associated with short fixed(-ish) length sequences Start Codon - ATG 5’ E I E I E Stop Codons – TAA, TAG,TGA 3’ 5’ Splice Site (Acceptor) 3’ Splice Site (Donor) Content Regions Content regions often have characteristic base composition Example • Recall: often multiple codons for each amino acid • All codons are not used equally Characteristic higher order nucleotide statistics in coding sequences (hexanucleotides) Pexon(Xi | Xi-1, Xi-2, Xi-3, Xi-4, Xi-5) 5’ E I E I E 3’ P(Xi | Xi-1, Xi-2, Xi-3, Xi-4, Xi-5) = P(Xi) Extrinsic Evidence intron exon Gene Gene Prediction Algorithms BLAST Hits HMMer Domains EST Alignments Neurospora crassa (a fungus) HMMs for Gene Prediction • States correspond to gene and genomic regions (exons, introns, intergenic, etc) • State transitions ensure legal parses • Emission matrices describe nucleotide statistics for each state A (Very) Simple HMM Donor T Donor T Intron Intron Acceptor A Acceptor A Donor G Donor G Start Start Codon G Codon G Start Start Codon T Codon T Start Start Codon A Codon A Intergenic Intergenic Exon Exon Acceptor G Acceptor G Stop Stop Codon G Codon G Stop Stop Codon T Codon T Stop Stop Codon A Codon A the Markov the model: model: q0 q0 Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html A Generative Model We can use this HMM to generate a sequence and state labeling • • The initial state is q0 Choose a subsequent state, conditioned on the current state, according to ajk=P(qk|qj) Choose a nucleotide to emit from the state emissions matrix ek(Xi) Repeat until number of nucleotides equals desired length of sequence Donor T Donor T Intron Intron Acceptor A Acceptor A Donor G Donor G Start Start Codon G Codon G Start Start Codon T Codon T Start Start Codon A Codon A Intergenic Intergenic Exon Exon Acceptor G Acceptor G Stop Stop Codon G Codon G Stop Stop Codon T Codon T Stop Stop Codon A Codon A • • q0 q0 But We Usually Have the Sequence Donor T Donor T Intron Intron Acceptor A Acceptor A Donor G Donor G Start Start Codon G Codon G Start Start Codon T Codon T Start Start Codon A Codon A Intergenic Intergenic Exon Exon Acceptor G Acceptor G Stop Stop Codon G Codon G Stop Stop Codon T Codon T Stop Stop Codon A Codon A the Markov the model: model: q0 q0 the input sequence: AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTAGTATAGGTCGATAGTACGCGA What is the best state labeling? Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html Finding The Most Likely Path • A sensible choice is to choose π that maximizes P[π|x] • This is equivalent to finding path π* that maximizes total joint probability P[ x, π ]: P(x,π) = a0π * Πi eπ (xi) × aπ π 1 i i i+1 start emission transition How do we select π∗ efficiently? A (Very) Simple HMM Donor T Donor T Intron Intron Acceptor A Acceptor A Donor G Donor G Start Start Codon G Codon G Start Start Codon T Codon T Start Start Codon A Codon A Intergenic Intergenic Exon Exon Acceptor G Acceptor G Stop Stop Codon G Codon G Stop Stop Codon T Codon T Stop Stop Codon A Codon A the Markov model: q0 q0 the input sequence: the most probable path: the gene prediction: AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTAGTATAGGTCGATAGTACGCGA exon 1 exon 1 exon 2 exon 2 exon 3 exon 3 Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html A Real HMM Gene Predictor Title page of journal article removed due to copyright restrictions. The article is the following: Krogh, Anders, I. Saira Mian, and David Haussler. "A Hidden Markov Model That Finds Genes in E.coli DNA." Nucleic Acids Research 22, no. 22 (1994): 4768-4778. HMM Limitations The HMM framework imposes constraints on state paths… Donor T Donor T Intron Intron Acceptor A Acceptor A Donor G Donor G Start Start Codon G Codon G Start Start Codon T Codon T Start Start Codon A Codon A Intergenic Intergenic Exon Exon Acceptor G Acceptor G Stop Stop Codon G Codon G Stop Stop Codon T Codon T Stop Stop Codon A Codon A q0 q0 Human Exon Lengths Courtesy of Christopher Burge. Used with permission. Burge, MIT PhD Thesis 10% 0% 20% 30% 40% 50% 10% 20% 30% 40% 50% 0% N. crassa C. neoformans Fungal Intron Lengths Nucleotides 10% 20% 30% 40% 50% 10% 0% 0% 31-40 71-80 111-120 151-160 191-200 231-240 271-280 311-320 351-360 391-400 431-440 471-480 511-520 551-560 591-600 631-640 671-680 711-720 751-760 791-800 20% 30% 40% 50% S. cerevisiae S. pombe Nucleotides Figure by MIT OpenCourseWare. 31-40 71-80 111-120 151-160 191-200 231-240 271-280 311-320 351-360 391-400 431-440 471-480 511-520 551-560 591-600 631-640 671-680 711-720 751-760 791-800 Kupfer et al., (2004) Eukaryotic Cell HMM Emissions Donor T Intron Acceptor A HMMs typically emit a single nucleotide per state Donor G Exon Acceptor G P(T)=1 P(A),P(G),P(C)=0 Intergenic q0 Generalized HMMs (GHMMs) • GHMMs emit more than one symbol per state • Emissions probabilities modeled by any arbitrary probabilistic model • Feature lengths are explicitly modeled W.H. Majaros (http://geneprediction.org/book/classroom.html) Courtesy of William Majoros. Used with permission. Human Introns Courtesy of Elsevier, Inc. http://www.sciencedirect.com. Used with permission. Burge, Karlin (1997) GHMM Elements • • • • States Observations Initial state probabilities Transition Probabilities Q V πi = P(q0=i) ajk= P(qi=k|qi-1=j) Like HMMs • Duration Probabilities fk(d)=P(state k of length d) • Emission Probabilities ek(Xα,α+d)=Pk(Xα…Xα+d| qk,d) Now emit a subsequence Model Abstraction in GHMMs W.H. Majaros (http://geneprediction.org/book/classroom.html) Models must return the probability of a subsequence given a state and duration Courtesy of William Majoros. Used with permission. GHMM Submodel Examples 1. WMM (Weight Matrix) ∏ P (x ) i i i=0 L−1 2. Nth-order Markov Chain (MC) ∏ P( x | x ...x )∏ P( x | x i 0 i−1 i i=0 i= n n−1 L−1 i− n ...xi−1 ) 3. Three-Periodic Markov Chain (3PMC) 5. Codon Bias 6. MDD ∏P i= 0 L−1 ( f + i)(mod 3) (xi ) ∏ P( xα i=0 n−1 +3 i α +3 i+1 α +3 i+2 x x ) Ref: Burge C (1997) Identification of complete gene structures in human genomic DNA. PhD thesis. Stanford University. PeIMM (s | g0 ...g k −1 ) = 7. Interpolated Markov Model ⎧ λG P (s | g ...g ) + (1− λG )P IMM (s | g ...g ) 0 k −1 k e 1 k −1 Ref: Salzberg SL, Delcher AL, Kasif S, White O (1998) ⎨ke Microbial gene identification using interpolated Markov ⎩ Pe (s) models. Nucleic Acids Research 26:544-548. Courtesy of William Majoros. Used with permission. if k > 0 if k = 0 http://geneprediction.org/book/classroom.html GHMMs as Generative Models Like HMM, we can use a GHMM to generate a sequence and state labeling • • • • Choose initial state, q1, from πi Choose a duration d1, from length distribution fq1(d) Generate a subsequence from state model eq1(X1,d) Choose a subsequent state, q2, conditioned on the current state, according to ajk=P(qk|qj) Repeat until number of nucleotides equals desired length of sequence • Given a sequence, how do we pick a state labeling (segmentation)? The Viterbi Algorithm - HMMs State 1 2 Vk(i) K x1 x2 x3 ………………………………………..xN Input: x = x1……xN Initialization: V0(0)=1, Vk(0) = 0, for all k > 0 Iteration: Vk(i) = eK(xi) × maxj ajk Vj(i-1) Termination: P(x, π*) = maxk Vk(N) Traceback: Follow max pointers back In practice: Use log scores for computation Running time and space: Time: O(K2N) Space: O(KN) HMM Viterbi Recursion … hidden states … ajk Vj(i-1) j … k ek Vk(i) observations xi-1 xi • Assume we know Vj for the previous time step (i-1) • Calculate Vk(i) = current max ek(xi) * maxj ( Vj(i-1) × ajk this emission max ending in state j at step i-1 ) Transition from state j all possible previous states j GHMM Recursion … hidden states … ajk Vj(i-1) j … k ek Vk(i) observations xi-1 xi We could have come from state j at the last position… GHMM Recursion But we could also have come from state j 4 positions ago… (State k could have duration 4 so far) GHMM Recursion In general, we could have come from state j d positions ago (State k could have duration d so far) GHMM Recursion This leads to the following recursion equation: Vk ( i ) = max max ⎡ P ( xi … xi − d | k ) ⋅ V j ( i − d ) ⋅ P ( d k ) ⋅ a jk ⎤ ⎣ ⎦ j d Prob of Max over all prev states and state subsequence given state k durations Max ending in state j at step i-d Prob that state k has duration d Transition from j->k Comparing GHMMs and HMMs HMM – O(K2L) Vk(i) = ek(xi) * maxj ( Vj(i-1) × ajk max ending in state j at step i-1 ) Transition from state j current max this emission all possible previous states j GHMM – O(K2L3) Similar modifications needed for forward and backward algorithms Vk ( i ) = max max ⎡ P ( xi … xi − d | k ) ⋅ V j ( i − d ) ⋅ P ( d k ) ⋅ a jk ⎤ ⎣ ⎦ j d Prob of Max over all prev states and state subsequence given state k durations Max ending in state j at step i-d Prob that state k has duration d Transition from j->k Courtesy of Elsevier, Inc. http://www.sciencedirect.com. Used with permission. Genscan - Burge and Karlin, (1997) • • Explicit State Duration GHMM 5th order markov models for coding and non-coding sequences Each CDS frame has own model WAM models for start/stop codons and acceptor sites MDD model for donor sites Separate parameters for regions of different GC content Model +/- strand concurrently • • • • • Courtesy of Elsevier, Inc. http://www.sciencedirect.com. Used with permission. Types of Exons Three types of exons are defined, for convenience: • initial exons extend from a start codon to the first donor site; • internal exons extend from one acceptor site to the next donor site; • final exons extend from the last acceptor site to the stop codon; • single exons (which occur only in intronless genes) extend from the start codon to the stop codon: Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html Intron and Exon Phase Phase 0 Intron Codon Phase 0 Exon 3123 Exon Intron 1231 Exon Phase 1 Intron 2312 Exon Intron Phase 1 Exon 3123 Exon Phase 2 Intron 1231 Exon Intron Phase 2 Exon 2312 Exon Two State Types in Genscan • D-type represented by diamonds • C-type represented by circles D states are always followed by C and vice versa C states always preceded by same Dstate Courtesy of Elsevier, Inc. http://www.sciencedirect.com. Used with permission. Two State Types in Genscan • D-Type – geometric length distribution – Intergenic regions – UTRs – Introns Sequence models are “factorable”: Pk ( Xa, c ) = Pk ( Xa, b) Pk ( Xb + 1, c) f k (d ) = P (state duration d|state k)=p k f k (d − 1) Increasing duration by one changes probability by constant factor Two State Types in Genscan • C-Type – general length distributions and sequence generating modes – Exons (initial, internal, terminal) – Promotors – Poly-A Signals Genscan Inference • Genscan uses same basic forward, backward, and viterbi algorithms as generic GHMMs • But assumptions about C, and D states reduce algorithmic complexity Genscan Inference – C-state List State 1 D states 2 Vk(i) C states K C1 C2 x1 x2 x3 ………………………………………..xN Lk(i) C1, duration 2, previous D-state=1 C2, duration 3, previous D-state=2 Genscan Viterbi Induction Max from previous step in this state Extend D-type state by one step (factorable state) Probability of emiting one more nucleotide from state k Was in state K in previous step Vk ( i +1) = max [ Vk ( i ) ⋅ pk ⋅ ek ( Xi+1), max {Vyc ( i − d −1) ⋅ P(yc|c)(1-pyc ) • P(d|c)P(Xi-d ..Xi |c) • P(k|c) ⋅ pk ⋅ ek ( Xi+1)}] c ∈ c-type states Max probability of ending D type state yc at position i-d-1 Probability of transition from yc to c Probability that C state was duration d Probability of subsequence of length d from state c Probability of transition from c to k Probability of nucleotide from state k Just transitioned from c-type state c of duration d which previously transitioned from D-type state yc Terminate Dtype state length Training A Gene Predictor During training of a gene finder, only a subset K of an organism’s gene set will be available for training: The gene finder will later be deployed for use in predicting the rest of the organism’s genes. The way in which the model parameters are inferred during training can significantly affect the accuracy of the deployed program. Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html Training A Gene Predictor Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html Gene Prediction Accuracy Gene predictions can be evaluated in terms of true positives (predicted features that are real), true negatives (non-predicted features that are not real), false positives (predicted features that are not real), and false negatives (real features that were not predicted: These definitions can be applied at the whole-gene, whole-exon, or individual nucleotide level to arrive at three sets of statistics. Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html Accuracy Metrics TP Sn = TP + FN TP Sp = TP + FP 2 × Sn × Sp F= Sn + Sp TP + TN SMC = TP + TN + FP + FN CC = (TP × TN ) − (FN × FP ) (TP + FN ) × (TN + FP ) × (TP + FP ) × (TN + FN ) . TP TN TN 1 ⎛ TP + + + ACP = ⎜ n ⎝ TP + FN TP + FP TN + FP TN + FN ⎞ ⎟, ⎠ AC = 2( ACP − 0.5). Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html More Information • http://genes.mit.edu/burgelab/links.html • http://www.geneprediction.org/book/clas sroom.html ...
View Full Document

This note was uploaded on 09/24/2010 for the course EECS 6.047 / 6. taught by Professor Manoliskellis during the Fall '08 term at MIT.

Ask a homework question - tutors are online