gene finding 1

gene finding 1 - Computational Methods for Gene...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Methods for Gene Finding (part I) Ying Xu (徐鹰) Genome and Sequencing Genome • Genome ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctg agtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactga tgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtagg taaattgatagatgctagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgag gcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagtaagcgatgcggctgctgag agcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag …………………………… …… Known Elements in Genomes Known Protein-encoding genes: Regulatory elements: Repetitive elements: RNA genes: Genes in Genomes Genes • Human genome has ~3 billion base pairs and has about Human ~20,000 protein-coding genes ~20,000 ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggt agtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggat gctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcat aaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtca gtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaat aagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggt aggatggtaggtaaattgatagatgctagatcgtaggtagtagctagatgcagg gataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatg atgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggat ggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagta agcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggagga aggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag ………………………………… Where are the protein-encoding genes? Gene Structure Gene • Eukaryotic genes – exons, introns introns – translation starts and stops, splice (donor/acceptor) junctions Gene Structure Gene • Prokaryotic genes – coding regions, non-coding regions – translation starts and stops promoter gene start gene gene terminator stop operon Prokaryotic genes are easier to identify than eukaryotic genes because the simplicity of their gene structure and the density of genes in the genome Genes, mRNA and Proteins Genes, • From gene to protein – transcription: introns are spliced out gene mRNA – translation into protein • rule of genetic code protein Features of Genes Features Double strand structure of DNA forward strand …..ACGTTTGA …… …..TGCAAACT……. reverse strand genes genes Features of Genes Features Translating a nucleotide sequence to an amino acid sequence: Each triplet (codon) is translated an amino acid AAATCACGAGAT ……. K H E Every triplet can be translated into an amino acid but ….. Translation Frame Translation • Reading (or translation) frame: each DNA segment has six possible reading frames reading Forward strand: ATGGCTTACGCTTGA Reading frame #0 Reading frame #1 Reading frame #2 ATG GCT TAC GCT TGC TGG CTT ACG CTT GA. GGC TTA CGC TTG A.. Reverse strand: TCAAGCGTAAGCCAT Reading frame #0 Reading frame #1 Reading frame #2 TCA AGC GTA AGC CAT CAA GCG TAA GCC AT. AAG CGT AAG CCA T.. Codons and Amino Acids Codons • There are 64 codons; three encode STOP codons and each of the other 61 encodes an amino acid AAA: K AAC: N AAG: K AAT: N CAA: Q CAC: H CAG: Q CAT: H GAA: E GAC: D GAG: E GAT: D TAA: stop TAG: stop TGA: stop Open Reading Frame (ORF) Open • Open reading frame (ORF): a segment of DNA with two in-frame stop frame codons at the two ends and no in-frame stop codon in the middle codons frame codon stop stop ORF each ORF has a fixed reading frame How many genes can an ORF have inside it? Answer: one because an ORF has only one stop Open Reading Frame (ORF) Open • Generally true: all long (> 300 bp) ORFs iin prokaryotic all bp ORFs n genomes encode have genes genomes • But this may not necessarily be true for eukaryotic But genomes genomes • Coding region – – gene in prokaryotic genomes – exon in eukaryotic genomes Codon Frequencies Codon • Coding sequences are translated into protein sequences • We found the following – the dimer frequency in protein We the dimer frequency sequences is NOT evenly distributed sequences The average frequency is 5% Some amino acids prefer to be next to each other Some other amino acids prefer to be not next to each other shewanella Dicodon Frequencies Dicodon • The biased (uneven) dimer frequencies are the foundation of The dimer frequencies most gene finding programs! most • Basic idea of gene finding: if a dimer has lower than average Basic dimer has dimer frequency this means that proteins prefer not to have dimer frequency such dimers iin its sequence; otherwise proteins prefer to have dimers n such dimers dimers Hence if we see a dicodon encoding this dimer, we may want to bet against this dicodon being in a coding region! Dicodon Frequencies Dicodon • If we see many such dicodons iin a DNA segment, we If dicodons n may want to bet that this region is a non-coding region! may This is the very basic idea of gene finding! Dicodon Frequencies Dicodon • Relative frequencies of a dii-codon in coding versus non-coding Relative d – frequency of dicodon X (e.g, AAAAAA) in coding region, total number of frequency dicodon (e.g, occurrences of X divided by total number of dicocon occurrences dicocon – frequency of dicodon X (e.g, AAAAAA) in noncoding region, total number of frequency dicodon (e.g, noncoding region, occurrences of X divided by total number of dicodon occurrences dicodon In human genome, frequency of dicodon “AAA AAA” is ~1% in coding region versus ~5% in non-coding region Question: if you see a region with many “AAA AAA”, would you guess it is a coding or non-coding region? Basic Idea of Gene Finding Basic • Most dicodons show bias towards either coding or non-coding Most dicodons coding regions; only fraction of dicodons is neutral dicodons • Foundation for coding region identification Regions consisting of dicodons that mostly tend to be in coding regions are probably coding regions; otherwise non-coding regions • Dicodon frequencies are key signal used for coding region frequencies detection; all gene finding programs use this information detection; Computational Model for gene finding for • Preference model: – for each dicodon X (e.g., AAA AAA), calculate its frequencies in coding for dicodon (e.g., AAA), and non-coding regions, FC(X), FN(X) and coding FC(X), – calculate X’s preference value P(X) = log (FC(X)/FN(X)) preference P(X) • Properties: – P(X) is 0 if X has the same frequencies in coding and non-coding coding regions regions – P(X) has positive score if X has higher frequency in coding than in noncoding region; the larger the difference the more positive the score is coding the – P(X) has negative score if X has higher frequency in non-coding than in coding coding region; the larger the difference the more negative the score is the Computational Model for gene finding for • Example AAA ATT, AAA GAC, AAA TAG have the following frequencies FC(AAA ATT) = 1.4%, FN(AAA ATT) = 5.2% FC(AAA GAC) = 1.9%, FN(AAA GAC) = 4.8% FC(AAA TAG) = 0.0%, FN(AAA TAG) = 6.3% We have P(AAA ATT) = log (1.4/5.2) = -0.57 P(AAA GAC) = log (1.9/4.8) = -0.40 P(AAA TAG) = - infinity (treating STOP codons differently) A region consisting of only these dicodons is probably a non-coding region • Coding preference of a region Calculate the preference scores of all dicodons of the region and sum them up; If the total score is positive, predict the region to be a coding region; otherwise a non-coding region. Computational Gene Finding Computational • Prediction procedure of coding region Procedure: Calculate all ORFs of a DNA segment; For each ORF, do the following slide through the ORF with an increment of 10 base-pairs calculate the preference score, in same frame of ORF, within a window of 60 base-pairs; and assign the score to the center of the window Example (forward strand in one particular frame) +5 0 -5 preference scores Computational Gene Finding Computational • Making the call: coding or non-coding and where the boundaries are coding region? where to draw the boundaries? • Need a training set with known coding and non-coding regions – select threshold(s) to include as many known coding regions as possible, and in the same time to exclude as many known non-coding regions as possible If threshold = 0.2, we will include 90% of coding regions and also 10% of non-coding regions If threshold = 0.4, we will include 70% of coding regions and also 6% of non-coding regions If threshold = 0.5, we will include 60% of coding regions and also 2% of non-coding regions where to draw the line? More Sophisticated Models More • Markov chain model • Hidden Markov model Question Question • Can gene finding program, trained on one genome, be Can useful for finding genes in another genomes? useful Universal Gene Finders? Universal • Dicodon frequencies in coding versus non-coding are genomefrequencies non dependent bovine shewanella Universal Gene Finders? Universal • For bacterial genomes, it is solvable • For eukaryotic genomes, it is quite challenging! Question Question • Why dicodon (6mer)? Why dicodon Codon (3mer) -based models are not nearly as information rich as dicodon-based models Tricodon (9mers)-based models need too many data points for it to be practical People have used 7-mer or 8-mer based models; they could provide better prediction methods 6-mer based models There are 4*4*4 = 64 codons 4*4*4*4*4*4 = 4,096 di-codons 4*4*4*4*4*4*4*4*4= 262,144 tricodons To make our statistics reliable, we would need at least ~15 occurrences of each X-mer; so for tricodon-based models, we need at least 15*262144 = 3932160 coding bases in our training data, which is probably not going to be available for most of the genomes Take-Home Message Take • The key information for distinguishing coding regions The from non-coding regions is the biased dii-codon from coding d frequencies in these two types of regions • Gene-finders are generally genome dependent since the finders dii-codon biases are genome specific d Homework • Find an appropriate gene-finding server on the Internet finding for the provided genome sequence (seq.txt), and apply it for ), to the sequence for gene finding. Provide a summary of the predicted genes, including the number of predicted genes and the location of each predicted gene in either forward or reverse strand of the sequence. Justify why this particular gene-prediction server is appropriate for this prediction the given sequence. the ...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online