LecturesPart10 - Computational Biology, Part 10 Protein...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Biology, Part 10 Protein Coding Regions Robert F. Murphy Copyright © 1996-2006. Copyright All rights reserved. Sequence Analysis Tasks ⇒ Finding protein coding regions Finding Goal s Given a DNA or RNA sequence, find those Given regions that code for protein(s) regions x Direct approach: Look for stretches that can be Direct interpreted as protein using the genetic code interpreted x Statistical approaches: Use other knowledge Statistical about likely coding regions about Direct Approach Genetic codes s The set of tRNAs that an organism The possesses defines its genetic code(s) possesses s The universal genetic code is common to all The universal organisms organisms s Prokaryotes, mitochondria and chloroplasts Prokaryotes, often use slightly different genetic codes often s More than one tRNA may be present for a More given codon, allowing more than one possible translation product possible Genetic codes s Differences in genetic codes occur in start Differences and stop codons only and s Alternate initiation codons: codons that codons encode amino acids but can also be used to start translation (GUG, UUG, AUA, UUA, CUG) CUG) s Suppressor tRNA codons: codons that codons normally stop translation but are translated as amino acids (UAG, UGA, UAA) as Genetic codes Genetic codes Genetic codes s s Note additional start codons: UUA, UUG, CUG Note conversion of stop codon UGA (opal) to Trp Modifying genetic codes in MacVector s Under Options select Modify Genetic Under Options Codes... Codes... s Enter a name for new code in box s Make changes by clicking on individual Make codons in table and selecting new values codons s Click OK Click OK Reading Frames s Since nucleotide sequences are “read” three Since bases at a time, there are three possible “frames” in which a given nucleotide sequence can be “read” (in the forward direction) direction) s Taking the complement of the sequence and Taking reading in the reverse direction gives three more reading frames reading Reading frames RF1 RF1 RF2 RF2 RF3 RF3 RF4 RF4 RF5 RF6 RF6 TTC TCA TGT TTG ACA GCT TTC Phe Ser Cys Leu Thr Ala> Phe Ser His Val *** Gln Leu> Ser Leu Met Phe Asp Ser> Leu AAG AGT ACA AAC TGT CGA AAG <Glu *** Thr Gln Cys Ser <Glu <Glu His Lys Val Ala <Glu <Arg Met Asn Ser Leu <Arg Reading frames s To find which reading frame a region is in, To take nucleotide number of lower bound of region, divide by 3 and take remainder (modulus 3) (modulus s 1=RF1, 2=RF2, 0=RF3 s This is the convention used by MacVector s Assumes first nucleotide is 1 (not 0) Reading frames s For reverse reading frames, take nucleotide For reverse number of upper bound of region, subtract upper from total number of nucleotides, divide by 3 and take remainder (modulus 3) and s 1=RF4, 2=RF5, 0=RF6 Open Reading Frames (ORF) s Concept: Region of DNA or RNA sequence Concept: that could be translated into a peptide sequence (open refers to absence of stop sequence codons) codons) s Prerequisite: A specific genetic code s Definition: x s (start codon) (amino acid coding codon)n (stop codon) (start Note: Not all ORFs are actually used Note: actually Open Reading Frames s Open file YSPTUBB in Sample Files Open YSPTUBB folder folder s Under Analyze select Open Reading Frames Under Analyze s Click box next to start/stop codons... Click start/stop s Click OK Open Reading Frames s Click boxes for List ORFS and ORF map Click ORF s Check reading Check frame: mod(696,3)=0 -> RF3 -> Splicing ORFs s For eukaryotes, which have interrupted For genes, ORFs in different reading frames may be spliced together to generate final product product s ORFs from forward and reverse directions ORFs cannot be combined cannot ORFs and Exons s MacVector displays “annotations” to the MacVector sequence in a features table features s Open the feature table for YSPTUBB by Open clicking on the icon clicking s Note the six exons for the tubulin gene s Does the large exon (exon 5) correspond to Does the large ORF in reading frame 3? s Yes, Yes, mod(639,3)=0 -> RF3 which matches reading frame of large ORF at 696 at Block Diagram for Search for ORFs Genetic code Both strands? Ends start/stop? Sequence to be searched Search Engine List of ORF positions Statistical Approaches Calculation Windows s Many sequence analyses require calculating Many some statistic over a long sequence looking for regions where the statistic is unusually high or low high s To do this, we define a window size to be To window the width of the region over which each calculation is to be done calculation s Example: %AT Base Composition Bias s For a protein with a roughly “normal” For amino acid composition, the first 2 positions of all codons will be about 50% GC of s If an organism has a high GC content If overall, the third position of all codons must be mostly GC be s Useful for prokaryotes s Not useful for eukaryotes due to large Not amount of noncoding DNA amount Fickett’s statistic s Also called TestCode analysis Also TestCode s Looks for asymmetry of base composition Looks asymmetry s Strong statistical basis for calculations s Method: x For each window on the sequence, calculate window the base composition of nucleotides 1, 4, 7..., then of 2, 5, 8..., and then of 3, 6, 9... then x Calculate statistic from resulting three numbers Codon Bias (Codon Preference) s Principle x Different levels of expression of different Different tRNAs for a given amino acid lead to pressure on coding regions to “conform” to the preferred codon usage codon x Non-coding regions, on the other hand, feel no Non-coding selective pressure and can drift selective Codon Bias (Codon Preference) s Starting point: Table of observed codon Starting frequencies in known genes from a given organism organism x best to use highly expressed genes s Method x Calculate “coding potential” within a moving Calculate window for all three reading frames window reading x Look for ORFs with high scores Codon Bias (Codon Preference) s Works best for prokaryotes or unicellular Works eukaryotes because for multicellular eukaryotes, different pools of tRNA may be expressed at different stages of development in different tissues in x may have to group genes into sets s Codon bias can also be used to estimate Codon protein expression level protein Portion of D. melanogaster codon frequency table Amino Acid Gly Number Freq/1000 Fraction GG G 11 2.60 0.03 Gly GG A 92 21.74 0.28 Gly GG T 86 20.33 0.26 Gly GG C 142 33.56 0.43 Glu GAG 212 50.11 0.75 Glu Gl G y Codon GAA 69 16.31 0.25 Comparison of Glycine codon frequencies Codon E. coli D. melanogaster G GG 0.03 G GA 0.00 0.28 G GT 0.59 0.26 G GC Gl G y 0.02 0.38 0.43 Illustration of Codon Bias Plots s Use Entrez via MacVector to get sequence of lexA Use x x x s under “Database” select “Internet Entrez Search” Select gene=lexA AND organism=Escherichia Pick one (e.g., region from 89.2 to 92.8) Under “Analyze” select “Codon Preference Plots” x x x Choose Escherichia coli codon bias file Choose gene region corresponding to lacZ Click on Staden codon bias and Gribskov codon bias 18 Staden Codon Preference: Frame +1 Window = 40 codons 9 ­0 ­10 ­19 ­28 122400 18 122500 Staden Codon Preference: Frame +2 122600 122700 122800 122900 122800 122900 122800 122900 Window = 40 codons 9 ­0 ­10 ­19 ­28 18 122400 122500 Staden Codon Preference: Frame +3 122600 122700 Window = 40 codons 9 ­0 ­10 ­19 ­28 122400 122500 122600 122700 14 Staden Codon Preference: Frame +1 Window = 40 codons 8 2 ­4 ­10 ­16 ­22 122000 14 122200 122400 Staden Codon Preference: Frame +2 122600 122800 123000 123200 123000 123200 123000 123200 Window = 40 codons 8 2 ­4 ­10 ­16 ­22 122000 14 122200 122400 Staden Codon Preference: Frame +3 122600 122800 Window = 40 codons 8 2 ­4 ­10 ­16 ­22 122000 122200 122400 122600 122800 Codon Preference Algorithms s The Staden method (from Staden & The McLachlan, 1982) uses a codon usage table directly in identifying coding regions. The codon usage table is normalized so that the sum of all 64 codons is 1. The usages for each codon in each reading frame in each window are multiplied together and normalized by the sum of the probabilities in all three positions to generate a relative coding probability. coding Codon Preference Algorithms s The Gribskov method uses a codon usage The table normalized so that the sum of the alternatives for each amino acid add to 1. The values for each codon for each reading frame in each window are multiplied together and normalized by the random probability expected for that codon given the mononucleotide frequencies of the target sequence. It is the most commonly used method. method. Summary, Part 10 s Translation of nucleic acid sequences into Translation hypothetical protein sequences requires a genetic code genetic s Translation can occur in three forward and Translation three reverse reading frames three s Open reading frames are regions that can be Open translated without encountering a stop codon codon Summary, Part 10 s The likelihood that a particular open reading The frames is in fact a coding region (actually made into protein) can be estimated using third-codon base composition or codon preference tables preference s This can be used to scan long sequences for This possible coding regions possible ...
View Full Document

This note was uploaded on 01/13/2012 for the course BIO 101 taught by Professor Staff during the Fall '10 term at DePaul.

Ask a homework question - tutors are online