gene finding 2

gene finding 2 - Computational Methods for Gene Finding...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Methods for Gene Finding (part II) Gene Ying Xu (徐鹰) Building a Simple Gene Finder Building • Collect coding and non-coding sequences for a target genome • Build preference models (or more sophisticated models) • For each reading frame, scan through a genomic sequence to For score each of its segment (e.g., 60bp) score +5 0 -5 Boundary Signals of Genes Boundary • Knowing boundaries of coding regions can help identify them more • accurately Possible boundaries of an exon Possible exon • Translation start – in-frame ATG • Splice junctions: – donor site: coding region | GT – acceptor: YAG | coding region • Stop codon: TAA|TAG|TGA Stop codon { translation start, acceptor site } { translation stop, donor site } Translation Starts Translation • Translation start: ATG ATG ATG …… • Predict a translation start: GCCATGGCGA ….. ACGATGCTGT …. GACATGGTAC … AGGATGGGCT … GCGATGTGGC … • Collect a set of experimentally validated translation starts with flanking regions and align them up …. Translation Starts Translation • Certain nucleotides prefer to be in certain position around start “ATG” and other nucleotides prefer not to be there ATG A C G T -4 -3 -2 -1 +3 +4 +5 +6 • The “biased” nucleotide distribution is information! It is the basis for translation start prediction • Question: which one is more probable to be a translation start? CACC ATG GC TCGA ATG TT Translation Starts Translation • Mathematical model: Fii (X): frequency of X (A, C, G, T) in F (X): • position i position Score a string by Σ log (Fi (X)/25) Score CACC ATG GC TCGA ATG TT log (58/25) + log (49/25) + log (40/25) + log (50/25) + log (43/25) + log (49/25) = log (6/25) + log (6/25) + log (15/25) + log (7/25) + log (13/25) + log (14/25) = 0.37 + 0.29 + 0.20 + 0.30 + 0.24 + 0.29 -(0.62 + 0.62 + 0.22 + 0.55 + 0.28 + 0.25) = 1.69 = -2.54 The model captures our intuition! A C G T Translation Starts Translation • Build a mathematical model, based on collected translation start sequence • For each candidate translation start, apply the model and get a For score score ATG …… • If the score is larger than zero, predict it is a possible If “translation start”; the higher score, the higher the probability the the prediction is true the Splice Junction Sites Splice • Splice junctions: – donor site: coding region | GT – acceptor: YAG | coding region • Like translation starts, the flanks of splice junctions Like (acceptors and donors) show “biased” distributions of distributions nucleotides in certain positions nucleotides • These biased distributions of nucleotides are the basis These for prediction of splice junctions for Acceptor Sites Acceptor • Nucleotide distribution in the flanks of acceptors Multiple positions have high “information content” Information content: Σ F (X) log (F (X)/25) If every nucleotide has 25% frequency in a position, then the position’s information content is ZERO. Use “information content as a criterion for determining the length of flanks Acceptor Sites Acceptor • Mathematical model: Fi (X): frequency of X (A, C, G, T) in (X): • position i position Score a segment as a candidate acceptor site by Score Σ log (Fi (X)/25) • For each candidate acceptor sequence, apply the model and get For a score score YAG • If the score is larger than zero, predict it is an “acceptor”; the If the higher score, the higher the probability the prediction is true higher Donor Sites Donor • Nucleotide distribution in the flanks of donors • Mathematical model: Fii (X): frequency of X (A, C, G, T) in F (X): position i position • Score a segment as a possible donor site by Σ log (Fi (X)/25) Score Donor Sites Donor • For each candidate donor sequence, apply the model For and get a score and GT • If the score is larger than zero, predict it is a “donor”; If the higher score, the higher the probability the prediction is true prediction Prediction of Exons Exons • For each orf, find all donor and acceptor candidates by finding For orf find GT and YAG motifs GT CAG GT CAG GT GT • Score each donor and acceptor candidate using our positionspecific weight matrix models • Find all pairs of (acceptor, donor) above some thresholds • Score the coding potential of the segment [donor, acceptor], Score using the hexmer model (e.g. preference model) hexmer Piecing All the Information Together Piecing translation starts donor sites acceptor sites stop codons exon length distribution 150 bp intron length distribution 60 bp Piecing All the Information Together Piecing • Each of these scores provides one piece of evidence of Each coding regions and their boundaries coding – effectively combining these scores could effectively • enhance signals and reduce noise • make coding-region/gene predictions more reliable • Combining these scores is nontrivial as they could be – – – – conflicting not independent incomplete ...... Piecing All the Information Together • Represent a DNA segment as a list of scores, including preference score, Markov score, splice junction scores, length, G+C composition, ....... score, • Find simple mathematical functions that can best separate Find different classes of training data different linear discriminator Y=aX+b quadratic discriminator Y = a X2 + b X + c How to determine the parameters to best separate the classes? Piecing All the Information Together • Neural network is a popular and powerful for – – – – data classification un-supervised learning pattern discovery ...... • Provide a natural framework for “classifying” DNA segments Provide DNA into coding and non-coding categories based all the scores into • Neural network needs to be trained on data with known Neural coding/non-coding classification results coding/non coding: noncoding: Piecing All the Information Together Piecing outputs hidden layer 1 training inputs application 1.50, -1.00, 821.0, 0.001, 1 9.01, -4.50, 120.0, 4.123, 0 2.36, -0.09, 621.0, 0.057, 1 7. 89, -7.02, 210.0, 8.025, 0 1.24, -2.05, 709.0, 1.401, 1 ....... 1.47, -0.79, 901.0, 1.01, ? Piecing All the Information Together • Select a neural network trainer (there are many of them in the public domain) and train a neural network using the data file and 1.50, -1.00, 821.0, 0.001, 0.78 9.01, -4.50, 120.0, 4.123, 0.00 2.36, -0.09, 621.0, 0.057, 1.00 7. 89, -7.02, 210.0, 8.025, 0.01 desired values 1.24, -2.05, 709.0, 1.401, 0.53 ....... • During neural network training, NN parameters will be adjusted to minimize Σ (desired value - calculated value)2 Piecing All the Information Together Piecing • Gene model construction: select a set of exon candidates such that – a gene model should start with a “start” codon and end with a “stop” codon – internal exons should be of the “internal exon” type – adjacent exons should be frame-consistent exon1 [i, j] in frame a and exon2 [m, n] in frame b are consistent if b = (m - j - 1 + a) mod 3 – the sum of all exon scores is maximal Piecing All the Information Together • Use hidden Markov model to Use piece all information together piece – – – – – Coding potential Boundary signals Exon/intron length G/C content ….. Existing Gene Finders • GRAIL GRAIL – Gene finding program for eukaryotic genomes, including human, Gene mouse mouse • GeneScan – Gene finding program (mainly) for eukaryotic genomes • GeneMark – Gene finding program for prokaryotic genomes • Glimmer – Gene finding program for prokaryotic genomes General Gene Finding Methods General • Homology-based methods – gene prediction through homology search • Ab initio methods – gene prediction through identifying distinguishing characteristics Homology-Based Gene Finding Homology • Sequences with high degree of similarity often indicates that they evolved from a common ancestor gene • Similar sequences can be detected through sequence alignment algorithms – BLAST, Smith-Waterman • NCBI nr is a database with all known genes; by finding similar sequences to nr nr genes in a newly sequence genome, one can predict genes in a newly nr sequenced genome Question Question • How do we find genes in a newly sequenced genome? Challenging Problems Challenging • Prediction of splice junction sites donor acceptor • Prediction of translation start sites start Challenging Problems Challenging • Prediction of genes with alternatively spliced forms exons • Prediction of RNA genes – mainly rely on recognition of RNA secondary structures Homework Homework • Calculate the coding potential of the following DNA sequence Calculate using the dii-codon preference model and the provided dii-codon d preference d frequencies; and predict if the region is a coding region or not. ACTGGGATCCGT Coding (ACTGGG) = 0.01 Non-Coding (ACTGGG) = 0.0001 Coding (ATCCGT) = 0.001 Non-Coding (ATCCGT) = 0.0001 Coding (CTGGGA) = 0.002 Non-Coding (CTGGGA) = 0.001 Coding (GATCCG) = 0.008 Non-Coding (GATCCG) = 0.001 Coding (GGATCC) = 0.01 Non-Coding (GGATCC) = 0.0001 Coding (GGGATC) = 0.001 Non-Coding (GGGATC) = 0.001 Coding (TGGGAT) = 0.0005 Non-Coding (TGGGAT) = 0.0001 Homework Homework • Calculate the scores of the following translation start candidates es using the provided position-specific weight matrix, and predict using specific which is more probable to be a translation start site. Explain why. which GCCATGGCC CTTATGTGA ATG A C G T -4 -3 -2 -1 +3 +4 +5 +6 ...
View Full Document

This note was uploaded on 06/16/2011 for the course BIO 127 taught by Professor Xuyin during the Spring '10 term at Georgetown.

Ask a homework question - tutors are online