MIT6_047f08_lec18_note18

MIT6_047f08_lec18_note18 - MIT OpenCourseWare...

Info icon This preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms .
Image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
1 6.047/6.878 Computational Biology Nov. 4, 2008 Lecture 18: CRFs for Computational Gene Prediction Lecturer: James E. Galagan Overview of gene prediction One of the fundamental problems in computational biology is to identify genes in very long genome sequences. As we know, DNA is a sequence of nucleotide molecules (a.k.a. bases) which encode instructions for generation of proteins. However, not all of these bases are responsible for protein generation. As an example shown in the 4th slide on page 1 of [2], in the eukaryotic gene structure, only exons contribute to protein manufacturing. Other segments, like intron, intergenic, start, stop, are not directly responsible for protein production. Therefore our task in gene prediction (or genome annotation) is, given a DNA sequence ( X ) with zero or more genes and some “evidence” associated with the given DNA sequence, determine a labeling ( Y ) that assigns to each base a label according to the functionality of that part of the gene. For example, the labels can be intergenic, start, exon, acceptor, intron, donor, stop, etc. How do we do gene prediction? It turns out that we can make use of several types of evidence. For example, some gene parts are associated with short fixed sequences, which are called signals. However, since such short sequences appear randomly everywhere in the DNA sequence, we cannot solely use these signals for gene prediction purposes. Other evidence relies on the tendency of certain genomic regions to have specific base composition (a.k.a. content measures). For example, there are usually multiple codons for each amino acid and not all codons are used equally often, which gives rise to different ratios of nucleotides in coding sequences. Apart from this evidence, which stem from DNA properties, there is also evidence like direct experimental evidence, BLAST hits, HMMer hits, etc. The main challenge of gene prediction algorithms is how to take advantage of all these types of evidence together. The most popular method so far of combining evidence is to use Hidden Markov Models (HMMs). In an HMM, we model the labels as hidden states and assume that the hidden states constitute a Markov chain. We assign an emission probability to every ( x X, y Y ) to model the probability of observing base x when the hidden state is y . This model is also called a generative model, since a convenient way to think of HMMs is to imagine that there is a machine with a button. By pushing the button, one can generate a genome sequence according to some probability distribution. In gene prediction, we use maximum-likelihood training to compute state transition probabilities and emission probabilities, and then find the most likely hidden state sequence Y y n . Note that HMMs in fact model a joint distribution over = y 1 · · · bases and hidden states, namely P ( X, Y ) = P (Labels , Sequence). However, in gene
Image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern