LecturesPart11

LecturesPart11 - Computational Biology, Part 11 Genefinding...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Biology, Part 11 Genefinding Robert F. Murphy Copyright © 1997, 2001, 2003-2006. Copyright All rights reserved. Clues to locations of genes (Prokaryotic Signals) s for Transcription x Promoters x Transcription factor binding sites s for Translation x Ribosome binding sites x Start/stop codons Prokaryotic vs. Eukaryotic Genefinding s Simple to create programs to look for Simple “grammatical” combination of prokaryotic signals signals s Much more complicated for eukaryotes due Much to the presence of introns and additional regulatory elements regulatory Clues to locations of genes (Eukarytoic Signals) s for Transcription x x x x x s Promoters Transcription terminators Topoisomerase II binding sites Topoisomerase I cleavage sites Transcription factor binding sites for Splicing x x Donor and acceptor sites Branch points Clues to locations of genes (Eukaryotic Signals) s for mRNA Processing x Polyadenylation sites s for Translation x Ribosome binding sites x Start/stop codons DNA->RNA->protein Signal Sensors: Consensus Sequences s Simple x TATA s PROSITE expression x Y-x-G-A-[FL]-[KRHNQ]-C-L-x(3,4)-G- [DENQ]-V-[GA]-[FYW] x (iron binding site in transferrin) Signal Sensors: Networks s Profile, PSSM (equivalent to perceptron) s Neural Network (multi-layer) Content Sensors: Coding Regions s GeneMark: 3 fifth-order Markov models x one for each reading frame s GRAIL: uses neural net with inputs from x coding potential measures x base composition x signal sensor output for flanking splice sites Integrated Systems s Use dynamic programming to find best Use combination of signal/content sensors combination s Apply “linguistic” rules to say what parts Apply are required and in what order are Gene model B = gene start S = translation start D = donor A = accceptor T = translation stop E = gene end Basic Implementation s Use an HMM to model what state Q each Use nucleotide from X is in (given parameters θ) θ) s Train HMM with known genes to estimate θ Train s For unknown sequence, find Q to maximize For P(Q | X, θ s Used by GENSCAN, HMMgene GENSCAN HMM s Handles genes Handles on both forward and reverse strands strands Adding Homology s Can try to include information from Can databases of known proteins to help decide whether an exon is coding whether s For each candidate exon, increase the score For if there is homology with a known protein if s This approach used by Genie, GeneID+, This GeneParser3, Grail GeneParser3, Adding ESTs s Can try to include information from EST Can databases databases s EST (Expressed Sequence Tag) databases EST show sequences that are “known” to be present in mRNA (cDNA) present s For each candidate exon, increase the score For if it matches to an EST if s Used by AAT, Grail Drawbacks s Using homology or ESTs may bias results Using toward genes similar to known genes (homology) or highly expressed genes (ESTs) (ESTs) Assessing Performance s 1995: Burset and Guigo used benchmark set 1995: of 575 vertebrate genes to compare of s 2001: Rogic et al. redid comparison with 2001: better test set of 195 genes (Genome Research 11:817-832, 2001) Research x http://www.cs.ubc.ca/~rogic/evaluation/ Machine Learning 101 s s Two types of learning methods Supervised x x s have examples of things in different “classes” that you have want program to recognize or “classify” want goal is for program to learn “rules” or “boundaries” to goal distinguish classes distinguish Unsupervised x have program figure out what classes exist Machine Learning 101 s s s s Assume that have a group of things that are either of a Assume specific type or not specific Define Positive as being of the desired type and Negative Define as not as Compare classifier predictions with reality Define x x x x True Positive (TP) = number of + predicted as + True False Negative (FN) = number of + predicted as True Negative (TN) = number of - predicted as False Positive (FP) = number of - predicted as + Machine Learning 101 s s Need two measures of performance Define x x x x x x Recall = TP/(TP+FN) “what fraction of + are found” Recall TP/(TP+FN) Precision = TP/(TP+FP) “what fraction of + predictions are right” Recall also referred to as sensitivity sensitivity Precision also referred to as positive predictive value positive Precision also sometimes referred to as specificity specificity But, Specificity sometimes defined as TN/(TN+FP) But, Specificity Performance Measures Results for 7 programs Nucleotide accuracy Exon accuracy ...
View Full Document

This note was uploaded on 01/13/2012 for the course BIO 101 taught by Professor Staff during the Fall '10 term at DePaul.

Ask a homework question - tutors are online