MIT6_047f08_lec18_note18

MIT6_047f08_lec18_note18 - MIT OpenCourseWare...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms . 1 6.047/6.878 Computational Biology Nov. 4, 2008 Lecture 18: CRFs for Computational Gene Prediction Lecturer: James E. Galagan Overview of gene prediction One of the fundamental problems in computational biology is to identify genes in very long genome sequences. As we know, DNA is a sequence of nucleotide molecules (a.k.a. bases) which encode instructions for generation of proteins. However, not all of these bases are responsible for protein generation. As an example shown in the 4th slide on page 1 of [2], in the eukaryotic gene structure, only exons contribute to protein manufacturing. Other segments, like intron, intergenic, start, stop, are not directly responsible for protein production. Therefore our task in gene prediction (or genome annotation) is, given a DNA sequence ( X ) with zero or more genes and some evidence associated with the given DNA sequence, determine a labeling ( Y ) that assigns to each base a label according to the functionality of that part of the gene. For example, the labels can be intergenic, start, exon, acceptor, intron, donor, stop, etc. How do we do gene prediction? It turns out that we can make use of several types of evidence. For example, some gene parts are associated with short fixed sequences, which are called signals. However, since such short sequences appear randomly everywhere in the DNA sequence, we cannot solely use these signals for gene prediction purposes. Other evidence relies on the tendency of certain genomic regions to have specific base composition (a.k.a. content measures). For example, there are usually multiple codons for each amino acid and not all codons are used equally often, which gives rise to different ratios of nucleotides in coding sequences. Apart from this evidence, which stem from DNA properties, there is also evidence like direct experimental evidence, BLAST hits, HMMer hits, etc. The main challenge of gene prediction algorithms is how to take advantage of all these types of evidence together. The most popular method so far of combining evidence is to use Hidden Markov Models (HMMs). In an HMM, we model the labels as hidden states and assume that the hidden states constitute a Markov chain. We assign an emission probability to every ( x X, y Y ) to model the probability of observing base x when the hidden state is y . This model is also called a generative model, since a convenient way to think of HMMs is to imagine that there is a machine with a button. By pushing the button, one can generate a genome sequence according to some probability distribution. In gene prediction, we use maximum-likelihood training to compute state transition probabilities and emission probabilities, and then find the most likely hidden state sequence Y y n . Note that HMMs in fact...
View Full Document

Page1 / 8

MIT6_047f08_lec18_note18 - MIT OpenCourseWare...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online