Genome_Research - Insight/Outlook Gene-Finding Approaches for Eukaryotes Gary D Stormo1 Department of Genetics Washington University School of

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
Gene-Finding Approaches for Eukaryotes Gary D. Stormo 1 Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110-8232 USA T he goal of this paper is to introduce the methods commonly used for pre- dicting protein-coding regions in eu- karyotic DNA, primarily for the benefit of those not familiar with the topic. This is not meant as a comprehensive review, nor do I describe in detail the underly- ing mathematical formalism. Those seeking additional information are en- couraged to read some recent reviews (Fickett 1996; Claverie 1997; Burge and Karlin 1998; Haussler 1998). Most of the papers in this issue from the recent Ge- nome Annotation Assessment Project (GASP) attempt to identify protein- coding genes using one or more of the methods I describe. I do not assess the success of the different methods, as that is done in the accompanying paper (Re- ese et al. 2000) and by each paper indi- vidually. There are two important aspects to any program for gene identification: one is the type of information used by the program, and the other is the algorithm that is employed to combine that infor- mation into a coherent prediction. Three types of information are used in predicting gene structures: “signals” in the sequence, such as splice sites; “con- tent” statistics, such as codon bias; and similarity to known genes. The first two types have been used since the early days of gene prediction, whereas similar- ity information has been used routinely only in recent years. One of the reasons that the accuracy of gene-prediction programs have improved in the last few years is the enormous increase in the number of examples of known coding sequences. This much larger sample size allows for more reliable statistical mea- sures to be developed, as well as a much greater likelihood of encountering a gene that is related to one that has been identified previously. Types of Information Used Signals The most important features to identify are the splice junctions—the donor and acceptor sites. If these could be reliably detected from the genomic DNA the dif- ficulty in identifying the coding regions would be greatly reduced because most genes could be recognized simply by finding the long ORFs. It would still be somewhat more difficult than for pro- karyotes simply because genes are much less dense in eukaryotes, but a high de- gree of accuracy could be obtained eas- ily. Unfortunately, splice junctions are not reliably detectable in the genomic sequence. The most common method for predicting them has been the “weight matrix.” This is simply a matrix with a score for each possible base at ev- ery position within a “site” (Stormo 1990; Gelfand 1995). There are separate weight matrices for acceptor and donor sites, and the scores for each base de- pend on the frequencies of each base at each position in the known sites. The method of Staden (1984a) simply used the logarithm of the frequency, but it is more common now to use a log-odds ra-
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 12/04/2011 for the course CHEM 590A taught by Professor Staff during the Summer '10 term at University of Illinois, Urbana Champaign.

Page1 / 4

Genome_Research - Insight/Outlook Gene-Finding Approaches for Eukaryotes Gary D Stormo1 Department of Genetics Washington University School of

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online