motif finding

motif finding - Identification of Functional Motifs in DNA...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Identification of Functional Motifs in DNA and Protein Motifs Ying Xu (徐鹰) What are Motifs • Motifs: a recurring element in bio-sequence or structure • A motif typically has functional implications since it is motif preserved (recurring) during evolution • There are different types of motifs in biological data – – – – sequence motifs structure motifs network motifs … Sequence Motifs Sequence • DNA sequence motifs: they generally function as they regulatory elements in biological systems regulatory – transcription regulatory motifs • Protein sequence motifs: they are generally functional they sites sites Transcription Regulation Transcription • Transcription: the process by which genetic information from DNA is transferred into (messenger) RNA Transcription Regulation Transcription • Regulation of transcription machinery through through – activation of “transcription regulators” (factors) that help to recruit activation (factors) RNAP (expression) RNAP – inhibition of RNAP from binding to promoter (repression) Transcription Regulation Transcription • Different classes of transcription factors regulate the on regulate on and off of transcription as well as the efficiency off of – – – – – – activators (general or specific) enhancers (general or specific) repressors (general or specific) inhibitors (general or specific) specificity regulators … RNAP RNAP RNAP AP RN RNAP RNAP Transcription Regulation Transcription trans-acting element gene A transcription factor could regulate multiple genes …. gene gene gene The bigger picture … cis-acting element Knowing the genes regulated by a transcription factor can help to elucidate a biological process responsible for a complex task – a systems biology view Problem Definition Problem • In a genome, find genes that are transcriptionally (co-) find regulated by the same transcription factor • This represents one of the most fundamental problems in elucidation of complex biological systems Problem Definition Problem • …. by identifying genes sharing “common” regulatory by regulatory binding sites, we can infer that these genes are cowe regulated TGTGAAAGACTGTTTTTTTGATCGTTTTGACAAAAATGGAAGTCCACA AAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATCCCATAG TGATGTACTGCATGTATGCAAAGGACGTCAGATTACCGTGCAGTACAG TAAACGATTCCACTAATTTATTCCATGTCACTCTTTTCGCATCTTTGT ACATTACCGCCAATTCTGTAACAGAGATCACACAAAGCGACGGTGGGG ACTTTTTTTTCATATGCCTGACGGAGTTGACACTTGTAAGTTTTCAAC DNA Motif Identification DNA Binding sites of the same transcription factor does not have to be exactly the same in their sequences; rather they should be “conserved” TGTGAAAGACTGTTTTTTTGATCGTTTTGACAAAAATGGAAGTCCACA AAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATCCCATAG TGATGTACTGCATGTATGCAAAGGACGTCAGATTACCGTGCAGTACAG TAAACGATTCCACTAATTTATTCCATGTCACTCTTTTCGCATCTTTGT ACATTACCGCCAATTCTGTAACAGAGATCACACAAAGCGACGGTGGGG ACTTTTTTTTCATATGCCTGACGGAGTTGACACTTGTAAGTTTTCAAC Measure of “conservedness” – information content! Information content: Σ F (X) log (F (X)/0.25) A: 1/6 A: 0 A: 0 C: 1/6 C: 5/6 C: 0 IC = 0.0246 IC = 0.407 G: 2/6 G: 1/6 G: 0 T: 2/6 T: 0 T: 6/6 IC = 0.602 DNA Motif Identification DNA A group of sequence motifs are considered as “conserved” if their aligned positions have “high” information content How to find blocks of DNA that have high information content? TGTGAAAGACTGTTTTTTTGATCGTTTTGACAAAAATGGAAGTCCACA AAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATCCCATAG TGATGTACTGCATGTATGCAAAGGACGTCAGATTACCGTGCAGTACAG TAAACGATTCCACTAATTTATTCCATGTCACTCTTTTCGCATCTTTGT ACATTACCGCCAATTCTGTAACAGAGATCACACAAAGCGACGGTGGGG ACTTTTTTTTCATATGCCTGACGGAGTTGACACTTGTAAGTTTTCAAC Simple if the sequences are already aligned! But …… DNA Motif Identification DNA • … this is what you typically get TGTGAAAGACTGTTTTTTTGATCGTTTTGACAAAAATGGAAGTCCACA AAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATCCCATAG TGATGTACTGCATGTATGCAAAGGACGTCAGATTACCGTGCAGTACAG TAAACGATTCCACTAATTTATTCCATGTCACTCTTTTCGCATCTTTGT ACATTACCGCCAATTCTGTAACAGAGATCACACAAAGCGACGGTGGGG ACTTTTTTTTCATATGCCTGACGGAGTTGACACTTGTAAGTTTTCAAC • How to find the “conserved” sites? DNA Motif Identification DNA • Basic idea – – assume that we know that the length of the “conserved” sites is K (K typically from 5 to 30) – go through all possible combinations of K-mers, one k-mer from each sequence, and calculate the information content – call a particular combination a “conserved” site if the IC is high TGTGAAAGACTGTTTTTTTGATCGTTTTGACAAAAATGGAAGTCCACA AAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATCCCATAG TGATGTACTGCATGTATGCAAAGGACGTCAGATTACCGTGCAGTACAG TAAACGATTCCACTAATTTATTCCATGTCACTCTTTTCGCATCTTTGT ACATTACCGCCAATTCTGTAACAGAGATCACACAAAGCGACGGTGGGG ACTTTTTTTTCATATGCCTGACGGAGTTGACACTTGTAAGTTTTCAAC Too many combinations to consider! DNA Motif Identification DNA ATTAG ATTAC ATTAC 1 ATTTG GGCTT GGGTT 1 ATTAG 5 5 GGGTA 1 5 1 5 4 5 4 1 GGCTA ATTTG 1 5 4 GGCTT GGGTA 2 DNA Motif Identification DNA • Prim’s algorithm – step 1: select an arbitrary node as the current tree step – step 2: find an external node that is closest to the tree, and add it with dd its corresponding edge into tree its – step 3: continue steps 1 and 2 till all nodes are connected in tree 4 8 4 7 5 3 4 4 7 10 4 7 7 3 3 6 (a) (b) (c) (d) (e) 5 DNA Motif Identification DNA ATTAC 1 1 ATTAG 5 5 1 5 1 5 4 5 4 1 GGCTA ATTTG 1 5 4 GGCTT GGGTA 2 DNA Motif Identification DNA ATTAC 1 ATTAG 1 ATTAG ATTTG ATTAC ATTTG 4 GGCTA 1 1 GGCTT GGGTA GGGTT GGGTA GGCTT DNA Motif Identification DNA • Solving motif identification problem through identification Solving of clusters in noisy data of Protein Sequence Motifs Protein • The number of different proteins in nature could be in many The billions of trillions billions • Proteins are grouped into functional families where proteins of Proteins families where the same family generally share the same functional attributes attributes • … and the number of functional families of proteins is and believed to be small (up to a few to tens of thousand) believed Protein Sequence Motifs Protein • Sequence motifs in a protein sequence – proteins of same family often have “conserved” sequence segments, Protein Sequence Motifs Protein • Example -- – R-N-L-[LIV]-S-[VG]-[GA]-Y-[KN]-N-[IVA] is known to be associated with the following family of proteins (called 14-3-3 proteins signatures) • A conserved motif typically has the same length and each position is either of one type of amino acid or of a small number of possible amino acid types (“conserved”) Protein Sequence Motifs Protein The ligand-binding site consists of a hydrophobic patch that contains a cluster of conserved aromatic residues and is surrounded by two charged and variable loops. Individual TPR domains are composed of two anti-parallel alpha helices separated by a turn. This creates a groove with a large amount of surface area available for ligand binding. Protein Sequence Motifs Protein • By identifying a particular known sequence motif (e.g., RN-L-[LIV]-S-[VG]-[GA]-Y-[KN]-N-[IVA] ), one predict the function of a novel protein could possibly Identification of Protein Motifs Identification • What information can we use to identify a “functional motif”? MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWT YTEAGTYKILGCSVRRAETDRRPPKVADIILCSYYKKVMGCQY …………………………………………………………………….. • Approach #1 – prediction through detection of conserved regions in a multiple sequence alignment of “related proteins” – Get a collection of proteins known to be of the same family – Do multiple sequence alignment to align these sequences Identification of Protein Motifs Identification • Identify the “conserved” positions Identify The more distant relatives the proteins are, the more significant the “conserved” positions are Identification of Protein Motifs Identification • Approach #2: identification of “functional motifs” through matching the target sequence with known functional motifs Protein sequence: MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWT YTEAGTYKILGCSVRRAETDRRPPKVADIILCSYYKKVMGCQY …………………………………………………………………….. Known motif: R-N-L-[LIV]-S-[VG]-[GA]-Y-[KN]-N-[IVA] Matching procedure: -for each alignment -score each matched pair using a simple scoring scheme -if the aa in the protein sequence is one of the aa’s in the matched position in the motif, assign “1”; otherwise “0” -add up the scores for all matched positions Protein Motif Databases Protein • There are a number of databases/servers for retrieving and There predicting functional motifs in protein sequences predicting • PROSITE – http://us.expasy.org/prosite/prosite_details.html • BLOCKS – http://blocks.fhcrc.org/blocks/help/about_blocks.html • PRINTS – http://umber.sbs.man.ac.uk/dbbrowser/PRINTS/ • Pfam – http://www.sanger.ac.uk/Software/Pfam/ PROSITE PROSITE • PROSITE currently contains patterns specific for each of more than a thousand an protein families. Each of these signatures comes with documentation providing protein background information on the structure and function of these proteins. PROSITE PROSITE C-x-H-R-[GA]-x(8)-G-N-x(5)-C-x-[FY]-H [FY]-x-[LIVMFY]-x-S-[TV]-x-K-x(4)-[AGLM]-x(2)[LC] Take-Home Message Take • Conserved motifs are generally functional • DNA motifs tend to be regulatory elements while protein DNA motifs tend to have “functional” roles Homework Homework • PS01159 is a known functional site. Search the PROSITE PS01159 PROSITE website to answer the following questions: – – – What is the consensus pattern of this functional site? Give five proteins that have this pattern in their sequences What biological functions is this functional site typically involved in? • Describe the main differences among the motifs in PROSITE, Describe PRINTS and BLOCKS databases PRINTS and BLOCKS • Calculate the information content of each aligned position in the multiple sequence alignment, and predict which portion might be a binding site based on the information content. Explain why. be TGATCGTTTTGACAAAAATG AGCACGGCGTCACACTTTGC CAAGGTACGTCAGATGACCG TATTTCCTCTCACTCTCTTT GAACAAAGATCACACAAAGC TGGCGGTGTTGACACTGGTA ...
View Full Document

Ask a homework question - tutors are online