LecturesPart12 - Computational Biology Part 12 Predicting from Protein Sequence Robert F Murphy Copyright © 1996 1999-2006 Copyright All rights

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Biology, Part 12 Predicting from Protein Sequence Robert F. Murphy Copyright © 1996, 1999-2006 Copyright All rights reserved. 1 Starting Point s Broad Goal: To determine or predict as Broad much as we can from a “new” protein sequence sequence s Have covered how to find protein motifs Have protein such as targets for post-translational modification (Profiles/PSSMs/HMMs) modification s Have covered how to find homologous Have proteins - we will need them perhaps to proteins predict something from their properties their 2 Starting Point s Some properties or “propensities” can be Some directly calculated from individual amino acids acids s These properties are useful in themselves These and may also be used in place of the original sequence for some prediction methods (or in addition to sequence) addition 3 Use of amino acid properties in prediction schemes Sequence Propensity function Other inputs Vector of Sequence propensities Prediction function Prediction Other inputs 4 Hydro-pathy/phobicity/philicity s One of the most commonly used properties One is the suitability of an amino acid for an aqueous environment aqueous s Hydropathy & Hydrophobicity x degree to which something is “water hating” or degree “water fearing” “water s Hydrophilicity x degree to which something is “water loving” 5 Hydro-pathy/phobicity/philicity Analysis s Goal: Obtain quantitative descriptions of the Goal: degree to which regions of a protein are likely to be exposed to aqueous solvents likely s Starting point: Tables of propensities of Starting each amino acid each 6 Hydrophobicity/Hydrophilicity Tables s Describe the likelihood that each amino acid Describe will be found in an aqueous environment one value for each amino acid one s Commonly used tables x Kyte-Doolittle hydropathy x Hopp-Woods hydrophilicity x Eisenberg et al. normalized consensus Eisenberg hydrophobicity hydrophobicity 7 Kyte-Doolittle hydropathy Amino Index Acid R -4.5 K -3.9 D -3.5 Q -3.5 N -3.5 E -3.5 H -3.2 P -1.6 Y -1.3 W -0.9 Amino Index Acid S - 0 .8 T - 0 .7 G - 0 .4 A 1 .8 M 1 .9 C 2 .5 F 2 .8 L 3 .8 V 4 .2 I 4 .5 8 Basic Hydropathy/Hydrophilicity Plot s Calculate average hydropathy over a Calculate window (e.g., 7 amino acids) and slide window window until entire sequence has been analyzed analyzed s Plot average for each window versus Plot position of window in sequence position 9 Example Hydrophilicity Plot This plot is for a tubulin, a soluble cytoplasmic protein. Regions with high hydrophilicity are likely to be exposed to the solvent (cytoplasm), while those with low hydrophilicity are likely to be internal or interacting with other proteins. 10 Amphiphilicity/Amphipathicity s A structural domain of a protein (e.g., an αstructural helix) can be present at an interface between helix) polar and non-polar environments polar x Example: Domain of a membrane-associated Example: protein that anchors it to membrane protein that s Such a domain will ideally be hydrophilic Such on one side and hydrophobic on the other on s This is termed an amphiphilic or This amphiphilic amphipathic sequence or domain amphipathic 11 Amphiphilicity/Amphipathicity s To find such sequences, we look for regions To where short stretches of charged residues alternate with short stretches of hydrophobic residues with a repeat distance corresponding to the period of the structure corresponding s A helical wheel plot can aid finding such helical repeats repeats 12 Helical Wheel for Prion Protein from Susan Jean Johns and Steven M. Thompson 13 Hydrophobic Moment s Can avoid visual interpretation of helical Can wheel plots by considering each amino acid as being represented by a vector whose direction points orthogonally out from the backbone and whose sign and magnitude come from a hydrophilicity table and then calculating a “net” vector which is termed the hydrophobic moment the s Approach developed by David Eisenberg 14 Hn = hydrophobicity value for residue n δ = frequency of repeat of helix or sheet 15 Prediction Methods s These methods don’t really “predict” These anything, they just “calculate” things anything, s Now let’s consider ways to make Now predictions about proteins predictions 16 Machine Learning 102 s Supervised learning methods need to use Supervised examples to “train” on examples s If all examples are using for training, can’t If evaluate how well the method can be expected to work on a new example expected s Thus, examples are normally divided into Thus, “training” and “testing” sets “training” 17 Machine Learning 102 s Use training data to adjust parameters of Use method until it gives the best agreement between its predictions and the known classes classes s Use the testing data to evaluate how well Use the method works (without adjusting parameters!) 18 Machine Learning 102 s How do we report the performance? s Average accuracy = fraction of all test Average examples that were classified correctly examples s Confusion matrix x use rows for the “true” class and columns for use the “predicted” class the x count how many test cases are predicted to be count in each class in 19 Example Confusion Matrix for Three Classes Predicted A B C True A 96 3 B 2 92 6 C 0 1 1 99 20 Example Confusion Matrix for Three Classes with “Unknown” allowed as a Prediction Predicted A B C Unknown True A 95 2 B C 1 2 2 90 6 2 0 1 98 1 21 Goal s Take primary structure (sequence) and, Take using rules derived from known structures, predict the secondary structure that is most likely to be adopted by each residue likely s Major classes are α-helices, β-sheets and Major -helices, -sheets loops loops 22 Structural Propensities s Due to the size, shape and charge of its side Due chain, each amino acid may “fit” better in one type of secondary structure than another one s Classic example: The rigidity and side chain Classic angle of proline cannot be accomodated in an α-helical structure 23 Structural Propensities s Two ways to view the significance of this Two preference (or propensity) propensity x It may control or affect the folding of the It protein in its immediate vicinity (amino acid determines structure) determines x It may constitute selective pressure to use It particular amino acids in regions that must have a particular structure (structure determines amino acid) amino 24 Secondary structure prediction s In either case, amino acid propensities In should be useful for predicting secondary structure structure s Two classical methods that use previously Two determined propensities: determined x Chou-Fasman x Garnier-Osguthorpe-Robson 25 Chou-Fasman method s Uses table of conformational parameters Uses (propensities) determined primarily from measurements of secondary structure by CD spectroscopy spectroscopy s Table consists of one “likelihood” for each Table structure for each amino acid structure 26 Chou-Fasman propensities (partial table) Amino Acid Γλ υ Μ ετ Αλ α ς αλ Ιλ ε Τ ψρ Προ Γλ ψ Pα 1.51 1.45 1.42 1.06 1.08 0.69 0.57 0.57 Πβ 0.37 1.05 0.83 1.70 1.60 1.47 0.55 0.75 Πτ 0.74 0.60 0.66 0.50 0.50 1.14 1.52 1.56 27 Chou-Fasman method s A prediction is made for each type of prediction structure for each amino acid structure x Can result in ambiguity if a region has high Can propensities for both helix and sheet (higher value usually chosen, with exceptions) value 28 Chou-Fasman method s Calculation rules are somewhat ad hoc Calculation ad s Example: Method for helix x Search for nucleating region where 4 out of 6 Search a.a. have Pα > 1.03 a.a. x Extend until 4 consecutive a.a. have an average Extend Pα < 1.00 x If region is at least 6 a.a. long, has an average If Pα > 1.03, and average Pα > average Pβ, consider region to be helix consider 29 Accuracy of Chou-Fasman predictions s Sequences whose 3D structures are known Sequences are processed so that each residue is “assigned” to a given secondary structure class by looking at the backbone angles class s Three classes most often used (helix=H, helix=H, sheet=E, turn=C) but sometimes use four sheet=E, but classes (helix, sheet, turn, loop) classes 30 Confusion matrix for ChouFasman method on 78 proteins Predicted H E C Unknown True H 47.5 3.0 4.3 45.2 E 20.8 16.8 7.1 55.4 3.6 38.0 52.0 C 6.4 Average accuracy = 54.4 Data from Z-Y Zhu, Protein Engineering 8:103-109, 1995 1 3 Garnier-Osguthorpe-Robson s Uses table of propensities calculated Uses primarily from structures determined by Xprimarily ray crystallography s Table consists of one “likelihood” for each Table structure for each amino acid for each position in a 17 amino acid window position 32 Garnier-Osguthorpe-Robson s Analogous to searching for “features” with Analogous a 17 amino acid wide frequency matrix 17 s One matrix for each “feature” x α-helix x β-sheet x turn x coil s Highest scoring “feature” is found at each Highest location location 33 Confusion matrix for GOR method on 78 proteins Predicted H True H E 40.0 C Unknown 9.3 5.6 45.1 E 7.5 47.9 4.7 39.9 C 5.4 3.6 40.5 50.5 Average accuracy = 58.8 Data from Z-Y Zhu, Protein Engineering 8:103-109, 1995 4 3 Accuracy of predictions GOR much better at recognizing β-sheets GOR -sheets s Both methods are only about 55-65% Both accurate accurate s 35 Accuracy of predictions s A major reason for the modest accuracies is major that while they consider the local context of each sequence element, they do not consider the global context of the sequence - the type of protein of x The same amino acids may adopt a different The configuration in a cytoplasmic protein than in a membrane protein membrane 36 Neural Networks s Learn how to “map” input values Learn (x1,x2,x3,x4) to output values (o1,o2) o1 o2 Lines connect inputs that are multiplied by “weights” for each line and summed to create output 37 Neural Network methods s A neural network with multiple layers is neural presented with known sequences and structures - network is trained until it can predict those structures given those sequences sequences s Allows network to adapt as needed (it can Allows consider neighboring residues like GOR) consider 38 Neural Network methods s Single sequence methods - train network Single using sets of known proteins of certain types (all alpha, all beta, alpha+beta) then use to predict for query sequence predict x NNPREDICT (>65% accuracy) 39 Homology-based modeling s Principle: From the sequences of proteins Principle: whose structures are known, choose a subset that is similar to the query sequence that s Develop rules (e.g., train a network) for just Develop this subset this s Use these rules to make prediction for the Use query sequence query 40 Homology-based modeling s Homology-based methods - predict Homology-based structure using rules derived only from proteins homologous to query sequence proteins x SOPM (>70% accuracy) x PHDsec (>72% accuracy) 41 ...
View Full Document

This note was uploaded on 01/13/2012 for the course BIO 101 taught by Professor Staff during the Fall '10 term at DePaul.

Ask a homework question - tutors are online