protein function

protein function - Computational Prediction of Protein...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Prediction of Protein functions Protein Ying Xu (徐鹰) Protein Function Prediction Protein A new bacterium was identified and found to be able to live on wood and produce ethanol Question : which genes are involved in the conversion from wood (cellulose) to ethanol? Protein Functions at Different Levels Protein • Molecular level function – molecular activity with respect molecular to interactions with other molecules to – DNA-binding protein • Cellular level function – molecular role in a cellular molecular system system – A transcription regulator nfluence • Phenotypic function – a molecule’s iinfluence on the properties on an organism as a whole properties – It affects the color of the eyes Protein Functions at Different Levels Protein • Low resolution prediction – E.g., protein X is an enzyme E.g., – Protein X is a DNA-binding protein • Medium resolution prediction – E.g., protein X is a protease – Protein X is a transcriptional repressor when it binds to DNA • High-resolution prediction – E.g., protein X is a protease with trypsin specificity E.g., trypsin – Protein X is a lac repressor lac repressor Protein Orthologous versus Paralogous Genes Orthologous Orthologous genes are homologous genes that are descended from a common ancestor through only speciation and encode proteins with the same function Paralogous genes are referred to as homologous genes that evolved through duplication and may encode proteins with similar but not identical functions For “high-resolution” function predictions, we need to distinguish “orthologous” genes from “paralogous” genes Detection of Orthologous Genes Orthologous • One simple approach: genes A and B of two genomes are predicted genes to be orthologous genes if A is the best “hit” of B in A’s genome and orthologous genes genome vice versa vice – this may not always work when two genomes are only remotely related • An enhanced method: genes A, B, C from three genomes are genes considered to orthologous iif A and B, B and C and C and A are all orthologous f “best hits” for pair-wise comparisons – an implementation of this strategy is called COG (clusters of an orthologous genes) orthologous genes) – http://www.ncbi.nlm.nih.gov/COG Function Prediction Methods Function • Sequence based approaches – protein A has function X, and protein B is a homolog (ortholog) of protein A; of Hence B has function X Hence • Motif-based approaches – a group of genes have function X and they all have motif Y; protein A has motif ein Y; Hence protein A’s function might be related to X Y; • Structure-based approaches – protein A has structure X, and X has so-so structural features; Hence A’s function function sites are …. • Function prediction based on “guilt-by-association” Function – gene A has function X and gene B is often “associated” with gene A, B might gene have function related to X Sequence based approaches Sequence • Find orthologues or homologues with known functions of Find orthologues or a new gene through sequence comparison, and predict the function of the new gene using the known function the • Key: need a database of genes with well annotated need functions functions Protein Function Families Protein • Proteins can be grouped into functional families – Each family consists of proteins with the same functions – By associating a novel protein with a protein family, one can By predict the function of the novel protein predict s • Pfam iis one of the most popular protein family classification schemes/databases classification – currently has 11,912 protein families – 75% of protein sequences have at least one match to Pfam. 75% Pfam Pfam Pfam Search Pfam Database Pfam MRVLKFGGTS VANAERFLRV ADILESNARQ GQVATVLSAP AKITNHLVAM IEKTISGQDA ……. Pfam – basic functions Pfam - Functions - Structure - Multiple sequence alignment - Domains - COGs - Functional sites - Interactions - Process - Pathways - Phylogenetic tree - Superfamilies - …… Pfam – alignment of family members Pfam Pfam – function sites Pfam Pfam - interactions Pfam Pfam -- pathways Pfam Enzyme Classification Database Enzyme • EC: a database of all known enzymes (~4,150 entries): EC: • Classification of enzymes into classes, sub-classes, subsub-classes, and sub-sub-sub-classes Oxidoredutase Trasferases Hydrolases Lyases Isomerases Ligases –http://www.brenda.uni-koeln.de/ Gene Ontology Database Gene GO: Gene Ontology: http://www.geneontology.org/ Three gene ontologies: • Molecular function – tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity carbohydrate ATPase • Biological process – broad biological goals, such as mitosis or purine metabolism, mitosis purine that are accomplished by ordered assemblies of molecular functions • Cellular component – subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA nucleus telomere RNA polymerase II holoenzyme Motif-based Function Prediction Motif • Prediction of protein functions based of identified sequence motifs • PROSITE contains patterns specific for more than a thousand protein families. ScanPROSITE -- it allows to scan a protein sequence for occurrence of patterns and profiles stored in PROSITE Motif-based Function Prediction Motif Search PROSITE using ScanPROSITE MSEGSDNNGDPQQQGAEGEAVGENKMKSRLRK GALKKKNVFNVKDHCFIARFFKQPTFCSHCKDFIC GYQSGYAWMGFGKQGFQCQVCSYVVHKRCHEY VTFICPGKDKG IDSDSPKTQH …….. The sequence has ASN_GLYCOSYLATION N-glycosylation site: 242 245 NETL Structure-based Function Prediction Structure Structure-based methods could possibly detect remote homologues that are not detectable by sequence-based method – using structural information in addition to sequence information – protein threading (sequence-structure alignment) is a popular method Structure-based methods could provide more than just “homology” information Structure-based Function Prediction Structure Prediction of ligand binding sites – For ~85% of ligand-binding proteins, the largest cleft is the ligand-binding site – For additional ~10% of ligand-binding proteins, the second largest cleft is the ligand-binding site Structure-based Function Prediction Structure Prediction of macromolecular binding site – there is a strong correlation between macromolecular binding site (with protein, DNA and RNA) and disordered regions – disordered regions in a protein sequence can be predicted using computational methods Dynamics-based functional inference … providing functional mechanism … Phylogenetic Profile Analysis Phylogenetic • Function prediction of genes based on “guilt-by-association” – a Function non-homologous approach non • The phylogenetic profile of a protein is a string that encodes the The phylogenetic profile presence or absence of the protein in every sequenced genome presence • Because proteins that participate in a common structural complex or or metabolic pathway are likely to co-evolve, the phylogenetic profiles metabolic evolve, phylogenetic profiles of such proteins are often ``similar'' of Phylogenetic Profile Analysis Phylogenetic Phylogenetic profile (against N genomes) – For each gene X in a target genome (e.g., E coli), build a phylogenetic profile as follows – If gene X has a homolog in genome #i, the ith bit of X’s phylogenetic profile is “1” otherwise it is “0” Phylogenetic Profile Analysis Phylogenetic • Example – phylogenetic profiles based on 89 genomes Example orf1034:1110110110010111110100010100000000111100011111110110111010101 orf1036:1011110001000001010000010010000000010111101110011011010000101 orf1037:1101100110000001110010000111111001101111101011101111000010100 orf1038:1110100110010010110010011100000101110101101111111111110000101 orf1039:1111111111111111111111111111111111111111101111111111111111101 orf104: 1000101000000000000000101000000000110000000000000100101000100 orf1040:1110111111111101111101111100000111111100111111110110111111101 orf1041:1111111111111111110111111111111101111111101111111111111111101 orf1042:1110100101010010010110000100001001111110111110101101100010101 orf1043:1110100110010000010100111100100001111110101111011101000010101 orf1044:1111100111110010010111010111111001111111111111101101100010101 orf1045:1111110110110011111111111111111101111111101111111111110010101 orf1046:0101100000010001011000000111110000010100000001010010100000000 orf1047:0000000000000001000010000001000100000000000000010000000000000 orf105: 0110110110100010111101101010111001101100101111100010000010001 orf1054:0100100110000001100001000100000000100100100001000100100000000 Genes with similar phylogenetic profiles have related functions or functionally linked – D Eisenberg and colleagues (1999) Phylogenetic Profile Analysis Phylogenetic • Phylogenetic profiles contain great amount of functional information • Phlylogenetic profile analysis can be used to distinguish orthologous profile orthologous genes from paralogous genes genes paralogous • Subcellular localization: 361 yeast nucleus-encoded mitochondrial encoded proteins are identified at 50% accuracy with 58% coverage through proteins gh phylogenetic profile analysis phylogenetic • Functional complementarity: By examining inverse phylogenetic Functional complementarity By phylogenetic profiles, one can find functionally complementary genes that have evolved through one of several mechanisms of convergent evolution. evolved • Question : which genes are involved in the conversion from wood (cellulose) to ethanol? Challenging Problems Challenging • Prediction of orthology relationship Prediction orthology • Identification of functional associations Identification Take-Home Message Take • Homology-based approach represents the key technique based for protein function prediction for • Function prediction at higher resolution requires Function additional information additional – orthology – protein structure • Functions can be predicted at different levels – molecular molecular – cellular cellular – phenotypic Homework Homework • Give a detailed report on known functions, structures, Give etc on a protein family of Pfam that starts with “A” Pfam that Homework Homework • Write a search report on a specific protein, say cpiB, based on your search results against GO database http://www.geneontology.org/ ...
View Full Document

Ask a homework question - tutors are online