MIT6_047f08_lec08_slide08

For practical reasons we recognize two broad classes

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: , we recognize two broad classes of features: signals — short, fixed-length features content regions — variable-length features Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html Signals Associated with short fixed(-ish) length sequences Start Codon - ATG 5’ E I E I E Stop Codons – TAA, TAG,TGA 3’ 5’ Splice Site (Acceptor) 3’ Splice Site (Donor) Content Regions Content regions often have characteristic base composition Example • Recall: often multiple codons for each amino acid • All codons are not used equally Characteristic higher order nucleotide statistics in coding sequences (hexanucleotides) Pexon(Xi | Xi-1, Xi-2, Xi-3, Xi-4, Xi-5) 5’ E I E I E 3’ P(Xi | Xi-1, Xi-2, Xi-3, Xi-4, Xi-5) = P(Xi) Extrinsic Evidence intron exon Gene Gene Prediction Algorithms BLAST Hits HMMer Domains EST Alignments Neurospora crassa (a fungus) HMMs for Gene Prediction • States correspond to gene and genomic regions (exons, introns, intergenic, etc) • State transitions ensure legal parses • Emission matrices describe nucleotide statistics for each state A (Very) Simple HMM Donor T Donor T Intron Intron Acceptor A Acceptor A Donor G Donor G Start Start Codon G Codon G Start Start Codon T Codon T Start Start Codon A Codon A Intergenic Intergenic Exon Exon Acceptor G Acceptor G Stop Stop Codon G Codon G Stop Stop Codon T Codon T Stop Stop Codon A Codon A the Markov the model: model: q0 q0 Courtesy of William Majoros. Used with permission. http://geneprediction.org/book/classroom.html A Generative Model We can use this HMM to generate a sequence and state labeling • • The initial state is q0 Choose a subsequent state, conditioned on the current state, according to ajk=P(qk|qj) Choose a nucleotide to emit from the state emissions matrix ek(Xi) Repeat until number of nucleotides equals desired length of sequence Donor T Donor T Intron Intron Acceptor A Acceptor A Donor G Donor G Start Start Codon G Codon G Start Start Codon T Codon T Start Start Codon A Codon A Intergenic Intergenic Exon Exon Acceptor G Acceptor G Stop Stop Codon G Codon G Stop Stop Codon T Codon T Stop Stop Codon A Codon A • • q0 q0 But We Usually Have the Sequence Donor T Donor T Intron Intron Acceptor A Acceptor A Donor G Donor G Start Start Codon G Codon G Start Start Codon T Codon T Start Start Codon A Codon A Intergenic Intergenic Exon Exon Acceptor G Acceptor G Stop Stop Codon G Codon G Stop Stop Codon T Codon T Stop Sto...
View Full Document

Ask a homework question - tutors are online