MIT6_047f08_lec08_note08 - MIT OpenCourseWare...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: MIT OpenCourseWare 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: . 1 6.047/6.878 Computational Biology September 29, 2008 Lecture 8: Computational Gene Prediction and GHMMs Lecturer: James E. Galagan Overview of gene prediction One of the fundamental problems in computational biology identification of genes in very long genome sequences. As we know DNA is a sequence of nucleotide molecules, or bases, which encode instructions for generation of proteins. However, not all of these bases correspond directly to amino acids. Even within a gene, a relatively small percentage of the nucleotides might actually be translated into an amino acid chain. For example, Eukaryotic genomes contain introns, or long segments of non-coding nu- cleotides within a gene; the introns are discarded during processing into mature RNA leaving only the exons linked together to contribute to protein manufacturing. Also, both Prokaryotic and Eukaryotic genomes contain substantial intergenic regions, and there are many other types of segments such as introns, start and stop codons, etc. which do not code for proteins directly, but are crucial to the protein synthesis in other ways. The contiguous subsequence of bases that is finally parsed out of DNA during processing into RNA is called the coding sequence, and is comprised of the exons with all introns removed. This sequence is then deterministically translated into amino acids. Our task in gene prediction (or genome annotation) is: Given a DNA sequence ( X ) with zero or more genes and some “evidence” associated with the given DNA sequence, determine a labeling ( Y ) that assigns to each base a label according to the functionality of that part of the gene. Some of the most basic labels are: Intron Non-coding regions within a gene; removed during processing into mature RNA. Exon Coding regions. 3-letter words, or codons, correspond directly to amino acids. Intergenic Region Non-coding region located between two genes. Start/Stop Codons Specific sequences appearing at the beginning and end of all genes. Acceptor Site Specific sequence appearing at the end of an intron / the beginning of an exon. Donor Site Specific sequence appearing at the beginning of an intron / the end of an exon. With these labels, we can think of gene prediction as parsing a sequence of letters into words. In this analogy, the different regions of a DNA sequence are similar to different types of words (nouns, verbs, etc.), and just as we must follow grammatical rules when constructing sentences, there are syntactical rules for parsing DNA as well. For example, we only allow introns to occur within genes, and we expect to find start and stop codons before and after genes occur. There are likely to be many “valid” parses, and in general we are searching for the most probable legal parse of the sequence....
View Full Document

This note was uploaded on 09/24/2010 for the course EECS 6.047 / 6. taught by Professor Manoliskellis during the Fall '08 term at MIT.

Page1 / 6

MIT6_047f08_lec08_note08 - MIT OpenCourseWare...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online