protein structure

protein structure - Computational Prediction of Protein...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Prediction of Protein Structures Protein Ying Xu (徐鹰) Protein Sequence, Structure and Function Sequence, Protein sequence >1MBN:_ MYOGLOBIN (154 AA) MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHL KTEAEMKASEDLKKAGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI PIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKEL GYQG Protein structure Protein function Oxygen storage Protein Structures Protein • Protein folding: protein sequence folds into a “unique” shape (“structure”) that minimizes its free energy Protein Structures Protein Primary sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE Secondary structure α-helix anti-parallel β-sheet parallel Protein Structures Protein • Tertiary structure • Quaternary structure Protein Structures Protein • Backbone versus all-atom structures Backbone versus Backbone + sidechain = all-atom structure Backbone structure == structural fold Protein Structures Protein A protein structure carries the key information about its function – knowledge of the structure of a protein enable us to understand its function and functional mechanism – design better mutagenesis experiments – structure-based rational drug design Protein Structures Protein • Soluble protein structure – – – generally compact individual domains are generally globular they share various common characteristics, e.g. hydrophobic they moment profile moment • Membrane protein structure most of the amino acid sidechains of transmembrane segments are non-polar polar groups of the polypeptide backbone of transmembrane segments generally participate in hydrogen bonds Protein Structure Prediction Protein Problem: Given the amino acid sequence of a protein, computationally predict its 3-dimensional shape? MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHL KTEAEMKASEDLKKAGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI PIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKEL GYQG ? …….. Protein Structure Prediction Protein Big gap between the number of protein sequences and the number of protein structures – Uniprot/Swiss-prot, 497,293 protein sequences – Uniprot/TrEMBL, 9,145,906 gene sequences – PDB (Protein Data Bank), 60,173 protein structures Fundamental, unsolved, challenging problem Why We Can Predict Structure Why • In theory, a protein structure can be solved computationally • A protein folds into a 3D structure to minimizes its free potential energy – Anfinsen’s classic experiment on Ribonuclease A folding in the 1960’s – energy functions • This problem can be formulated as an optimization problem – protein folding problem, or ab initio folding ab Protein Structure Prediction Protein • ab initio – – – use first principles to fold proteins does not require templates high computational complexity • homology modeling – similar sequence similar structures similar – practically very useful, need homologues • protein threading – – many proteins share the same structural fold many a folding problem becomes a fold recognition problem ab initio Structure Prediction ab An energy function to describe the protein o o o o o bond energy bond angle energy dihedral angel energy van der Waals energy electrostatic energy Need an algorithm to search the conformational space to find structural conformation that minimizes the function. Not practical in general o computationally too expensive o accuracy is poor Comparative Modeling Comparative • Comparative modeling – Homology modeling – iidentification of homologous proteins dentification through sequence alignment; structure prediction through placing residues into “corresponding” positions of homologous positions structure models structure – Protein threading – make structure prediction through make identification of “good” sequence-structure fit Protein Threading Protein • Basic premise The number of unique structural (domain) folds in nature is fairly small (possibly a few thousand) • Statistics from Protein Data Bank (~65,000 structures) 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB • Chances for a protein to have a native-llike structural fold in PDB are quite ike good (estimated to be 60-70%) good (estimated – Proteins with similar structural folds could be homologues or analogues Proteins homologues or analogues Protein Threading Protein • The goal: find the “correct” sequence-structure alignment between a find structure target sequence and its native-like fold in PDB target MTYKLILN …. NGVDGEWTYTE • Energy function – knowledge (or statistics) based rather than knowledge physics based – Should be able to distinguish correct structural folds from incorrect rrect structural folds structural – Should be able to distinguish correct sequence-fold alignment from fold incorrect sequence-fold alignments incorrect Protein Threading – four basic components • Structure database • Energy function • Sequence-structure alignment algorithm • Prediction reliability assessment Protein Threading Protein – structure database • Structural template database Protein Threading Protein – energy function MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE how preferable to put two particular residues nearby: E_p how well a residue fits a structural environment: E_s alignment gap penalty: E_g total energy: E_p + E_s + E_g find a sequence-structure alignment to minimize the energy function Protein Threading Protein • A simple definition of structural environment – secondary structure: alpha-helix, beta-strand, loop – solvent accessibility: 0, 10, 20, …, 100% of accessibility solvent – each combination of secondary structure and solvent each accessibility level defines a structural environment accessibility • E.g., (alpha-helix, 30%), (loop, 80%), … helix, • E_s: a scoring matrix of 30 structural environments by scoring 20 amino acids 20 – E.g., E_s ((loop, 30%), A) E.g., E_s • E_s(S, X) = log (FE(S, X)/FO(S, X)) X) – FE (): expected frequency – FO (): observed frequency Singleton energy term Protein Threading Protein • E_p: a scoring matrix of 20 amino acids by 20 amino scoring acids acids • E_p (X, Y, C) = log (FE(X, Y)/FO(X, Y, C)) – – – – FE(): expected frequency FO(): observed frequency X, Y: amino acdis X, acdis C: condition – e.g., distance, relative angle, … C: e.g., Pairwise interaction energy term • E_g: alignment gap penalty Protein Threading Protein • E_s: a scoring matrix of 30 structural scoring environments by 20 amino acids environments – E.g., E_s ((loop, 30%), A) E.g., E_s • E_p: a scoring matrix of 20 amino acids by 20 scoring amino acids amino – Unlike BLOSUM matrix, this matrix measures how two Unlike amino acids prefer to be next to each other amino Protein Threading -- algorithm Protein Considering the pair-wise interaction energy makes the problem much more difficult to solve – dynamic programming algorithm does not work any more! There are other techniques that can be used to solve the problem – integer programming, divide-and-conquer Protein Threading Protein Protein fold recognition – leading to more functional information Backbone structure prediction – providing structural information Fold Recognition Fold MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE Score = -1500 Score = -720 Score = -1120 Score = -900 Which one is the correct structural fold for the target sequence if any? The one with the lowest score ? Fold Recognition Fold • Different template structures may have different Different background scores, making direct comparison of threading scores against different templates invalid threading • Comparison of threading results should be made based Comparison on how standout the score is in its background score distribution rather than the threading scores directly distribution Fold Recognition Fold Threading 100,000 sequences against a template structure provides the baseline information about the background scores of the template E-value By locating where the threading score with a particular query sequence, one can decide how significant the score, and hence the threading result, is! Not significant significant Fold Recognition Fold MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE Score = -1500 Score = -720 Score = -1120 Score = -900 E-value = e-1 E-value = e-2 E-value = 0.5 e-1 E-value = e-21 If no predictions have significant e-values, a prediction program should indicate that it could not make a reliable prediction! Structure Prediction Structure MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE Prediction of Protein Structure Prediction • Protein threading can predict only the backbone structure of a protein (side-chains have to be predicted using other methods) Blue: actual structure Green: predicted structure predicted actual • Typically the lower the e-value, the higher the prediction accuracy Prediction of Protein Structures Prediction • State of the art: ~60% of the soluble proteins in a ~60% genome could have correct fold prediction and 50% of these proteins have good backbone structure prediction these • Functional inference could be made based on – accurately predicted structures – correctly identified structural folds Challenging Issues Challenging • Structure prediction of membrane proteins • Structure prediction of splicing isoforms Structure isoforms Take-Home Message Take • Protein structure can be computationally predicted • Template-based structure prediction represents a key based technique for structure prediction technique Homework Homework • Run PROSPECT (select "PROSPECT" only) on the following protein sequence to make a structure prediction. Select the hit with the highest z-score as your structure prediction. FVFQQSEKFAKVENQYQLLKLETNEFQQLQSKISLISEKLESTESILQEATSSMSLMTQF EQEVSNLQDIMHDIQNNEEVLTQRMQSLNEKFQNITDFWKRSLEEMNINTDIFKSEAKHI HSQVTVQINSAEQEIKLLTERLKDLEDSTLRNIRTVKRQEEEDLLRVEEQLGSDTKAIEK LEEEQHALFARDEDLTNKLSDYEPKVEECKTHLPTIESAIHSVLRVSQDLIETEKKMEDL TMQMFNMEDDMLKAVSEIMEMQKTLEGIQYDNSILKMQNELDILKEKVHDFIAYSSTGEK GTLKEYNIENKGIGGDF – Please provide the sequence alignment and the predicted structure. (2D image) – Do you consider your prediction reliable? Why? – Which SCOP family and superfamily does the protein belong to (using the first four digits/letters of the protein code to search)? Homework Homework http://csbl.bmb.uga.edu/protein_pipeline Username: guest Password: bcmb3600 Unselect all options except PROSPECT Give your name as the sequence name • Describe a scenario where sequence-based and structurebased methods for protein function complement each based other. other. ...
View Full Document

This note was uploaded on 06/16/2011 for the course BIO 127 taught by Professor Xuyin during the Spring '10 term at Georgetown.

Ask a homework question - tutors are online