This preview shows page 1. Sign up to view the full content.
Unformatted text preview: MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution
Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. Computational Biology: Genomes, Networks, Evolution Motif Discovery Lecture 9 October 2, 2008 Regulatory Motifs
Find promoter motifs associated with coregulated or functionally related genes Motifs Are Degenerate
• ProteinDNA interactions
– Proteins read DNA by “feeling” the chemical properties of the bases – Without opening DNA (not by base complementarity)
Sugar phosphate backbone A DNA B 2 3 1 1 2 Ser 3 Arg Asn • Sequence specificity
– Topology of 3D contact dictates sequence specificity of binding – Some positions are fully constrained; other positions are degenerate – “Ambiguous / degenerate” positions are loosely contacted by the transcription factor Arg Base pair COOH C D NH2 Figure by MIT OpenCourseWare. Other “Motifs”
• Splicing Signals
– Splice junctions – Exonic Splicing Enhancers (ESE) – Exonic Splicing Surpressors (ESS) • Protein Domains
– Glycosylation sites – Kinase targets – Targetting signals • Protein Epitopes
– MHC binding specificities Essential Tasks
• Modeling Motifs
– How to computationally represent motifs • Visualizing Motifs
– Motif “Information” • Predicting Motif Instances
– Using the model to classify new sequences • Learning Motif Structure
– Finding new motifs, assessing their quality Modeling Motifs Consensus Sequences
Useful for publication
HEM13 HEM13 HEM13 ANB1 CCCATTGTTCTC TTTCTGGTTCTC TCAATTGTTTAG CTCATTGTTGTC TCCATTGTTCTC CCTATTGTTCTC TCCATTGTTCGT CCAATTGTTTTG YCHATTGTTCTC
Figure by MIT OpenCourseWare. IUPAC symbols for degenerate sites Not very amenable to computation ANB1 ANB1 ANB1 ROX1 Nature Biotechnology 24, 423  425 (2006) Probabilistic Model
1 HEM13 HEM13 HEM13 ANB1 ANB1 ANB1 ANB1 ROX1 CCCATT TTTCTG TCAATT CTCATT TCCATT CCTATT TCCATT CCAATT A C G T M1
.1 .2 .4 .3 .2 .2 .5 .1 .1 .2 .4 .2 .4 .2 .2 .2 .1 .5 .2 .2 K Count frequencies Add pseudocounts MK
.1 .1 .1 .7 Pk(SM) Position Frequency Matrix (PFM) Figure by MIT OpenCourseWare. Scoring A Sequence
To score a sequence, we compare to a null model
N Pi ( Si  PFM ) N Pi ( Si  PFM ) P ( S  PFM ) = log ∏ = ∑ log Score = log P( S  B) P ( Si  B ) P ( Si  B ) i =1 i =1 PFM Background DNA (B)
.1 .5 .2 .2 .1 .1 .1 .7 Position Weight Matrix (PWM)
A C G T
1.3 0.3 0.6 0.3 0.3 0.3 1 1.3 1.3 0.3 0.6 0.3 0.6 0.3 0.3 0.3 1.3 1 0.3 0.3 1.3 1.3 1.3 1.4 A C G T .1 .2 .4 .3 .2 .2 .5 .1 .1 .2 .4 .2 .4 .2 .2 .2 A: 0.25 T: 0.25 G: 0.25 C: 0.25 Scoring a Sequence Courtesy of Kenzie MacIsaac and Ernest Fraenkel. Used with permission. MacIsaac, Kenzie, and Ernest Fraenkel. "Practical Strategies for Discovering Regulatory DNA Sequence Motifs." PLoS Computational Biology 2, no. 4 (2006): e36. Common threshold = 60% of maximum score
MacIsaac & Fraenkel (2006) PLoS Comp Bio Visualizing Motifs – Motif Logos
Represent both base frequency and conservation at each position Height of letter proportional to frequency of base at that position Height of stack proportional to conservation at that position Motif Information
The height of a stack is often called the motif information at that position measured in bits Information Motif Position Information = 2 − b ={ A,T ,G ,C } ∑ − pb log pb Why is this a measure of information? Uncertainty and probability
Uncertainty is related to our surprise at an event “The sun will rise tomorrow” “The sun will not rise tomorrow” Not surprising (p~1) Very surprising (p<<1) Uncertainty is inversely related to probability of event Average Uncertainty
Two possible outcomes for sun rising A “The sun will rise tomorrow” B “The sun will not rise tomorrow” P(A)=p1 P(B)=p2 What is our average uncertainty about the sun rising = P( A)Uncertainty(A) + P(B)Uncertainty(B) = − p1 log p1 − p2 log p2 = −∑ pi log pi
= Entropy Entropy
Entropy measures average uncertainty Entropy measures randomness H ( X ) = −∑ pi log 2 pi
i If log is base 2, then the units are called bits Entropy versus randomness
Entropy is maximum at maximum randomness
1 0.9 0.8 Example: Coin Toss P(heads)=0.1 Not very random H(X)=0.47 bits Entropy 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P(heads)=0.5 Completely random H(X)=1 bits P(heads) Entropy Examples
P(x)
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 H ( X ) = −[0.25log(0.25) + 0.25log(0.25) +0.25log(0.25) + 0.25log(0.25)] = 2 bits
1 A 2 T 3 G 4 C P(x) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 H ( X ) = −[0.1log(0.1) + 0.1log(0.1) +0.1log(0.1) + 0.75log(0.75)] = 0.63 bits
A 1 T 2 G 3 C 4 Information Content
Information is a decrease in uncertainty
Once I tell you the sun will rise, your uncertainty about the event decreases Information = Hbefore(X)  Hafter(X) Information is difference in entropy after receiving information Motif Information
Motif Position Information = 2  b ={ A,T ,G ,C } ∑ − pb log pb Hbackground(X)
Prior uncertainty about nucleotide
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Hmotif_i(X)
Uncertainty after learning it is position i in a motif
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 P(x) A T G H(X)=2 bits C P(x) A T G C H(X)=0.63 bits Uncertainty at this position has been reduced by 0.37 bits Motif Logo Conserved Residue Reduction of uncertainty of 2 bits Little Conservation Minimal reduction of uncertainty Background DNA Frequency
The definition of information assumes a uniform background DNA nucleotide frequency What if the background frequency is not uniform?
(e.g. Plasmodium)
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Hbackground(X)
P(x) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Hmotif_i(X) P(x) A T G H(X)=1.7 bits C A T G C H(X)=1.9 bits Motif Position Information = 2 1.7  b ={ A,T ,G ,C } ∑ − pb log pb = 0.2 bits Some motifs could have negative information! A Different Measure
Relative entropy or KullbackLeibler (KL) divergence
Divergence between a “true” distribution and another DKL ( Pmotif  Pbackground ) =
“True” Distribution i ={ A,T ,G ,C } ∑ Pmotif (i ) log Pmotif (i ) Pbackground (i ) Other Distribution DKL is larger the more different Pmotif is from Pbackground Same as Information if Pbackground is uniform Properties DKL ≥ 0 DKL = 0 if and only if Pmotif =Pbackground DKL ( P  Q) ≠ DKL (Q  P) Comparing Both Methods
Information assuming uniform background DNA KL Distance assuming 20% GC content (e.g. Plasmodium) Online Logo Generation http://weblogo.berkeley.edu/ http://biodev.hgen.pitt.edu/cgibin/enologos/enologos.cgi Finding New Motifs
Learning Motif Models A Promoter Model
Length K Motif Background DNA M1 A C G T
.1 .2 .4 .3 .2 .2 .5 .1 .1 .2 .4 .2 .4 .2 .2 .2 .1 .5 .2 .2 MK
.3 .4 .2 .1 A: 0.25 T: 0.25 G: 0.25 C: 0.25
P(SB) Pk(SM) The same motif model in all promoters Probability of a Sequence
Given a sequence(s), motif model and motif location
1 60 65 100 ATATGC P ( Seq  Mstart = 10, Model ) = ∏ P( Si  B )∏ Pk ( S k + 63  M )∏ P( Si  B )
i =1 k =1 i = 66 59 6 100 Si = nucleotide at position i in the sequence M1 A C G T
.1 .2 .4 .3 .2 .2 .5 .1 .1 .2 .4 .2 .4 .2 .2 .2 .1 .5 .2 .2 MK
.3 .4 .2 .1 Parameterizing the Motif Model
Given multiple sequences and motif locations but no motif model M1 M6 AATGCG ATATGG ATATCG GATGCA A
Count Frequencies Add pseudocounts 3/4 C G T ETC…
3/4 Finding Known Motifs
Given multiple sequences and motif model but no motif locations
P(SeqwindowMotif)
window Calculate P(SeqwindowMotif) for every starting location Motif Position Distribution Zij
• the element Z ij of the matrix Z represents the probability that the motif starts in position j in sequence I
seq1 seq2 seq3 seq4 1 0.1 0.4 0.3 0.1 2 0.1 0.2 0.1 0.5 3 0.2 0.1 0.5 0.1 4 0.6 0.3 0.1 0.3 Z=
Some examples: Z1 Z2 Z3 Z4 no clear winner two candidates one big winner uniform Calculating the Z Vector
P( S  Zij = 1, M ) P( Zij = 1) P( Z ij = 1 S , M ) = P( S ) (Bayes’ rule) P ( Z ij = 1 S , M ) = P( S  Zij = 1, M ) P( Zij = 1)
L − K +1 ∑
k =1 P( S  Zij = 1, M ) P( Zij = 1) P( S  Zij = 1, M )
L − K +1 P( Z ij = 1 S , M ) = ∑
k =1 P( S  Zij = 1, M ) Assume uniform priors (motif equally likely to start at any position) Calculating the Z Vector  Example
Xi = G C T G T A G
A C G T 0 0.25 0.25 0.25 0.25 1 0.1 0.4 0.3 0.2 2 0.5 0.2 0.1 0.2 3 0.2 0.1 0.6 0.1 p= Z i1 = 0.3 × 0.2 × 0.1× 0.25 × 0.25 × 0.25 × 0.25
Z i 2 = 0.25 × 0.4 × 0.2 × 0.6 × 0.25 × 0.25 × 0.25
• then normalize so that ... L −W +1 j =1 ∑Z ij =1 Discovering Motifs
Given a set of coregulated genes, we need to discover with only sequences We have neither a motif model nor motif locations Need to discover both How can we approach this problem? (Hint: start with a random motif model) Expectation Maximization (EM)
Remember the basic idea! 1.Use model to estimate distribution of missing data 2.Use estimate to update model 3.Repeat until convergence Model is the motif model Missing data are the motif locations EM for Motif Discovery 1. Start with random motif model 2. E Step: estimate probability of motif positions for each sequence 3. M Step: use estimate to update motif model 4. Iterate (to convergence) A C G T .1 .2 .4 .3 .2 .2 .5 .1 .1 .2 .4 .2 .4 .2 .2 .2 .1 .5 .2 .2 .3 .4 .2 .1 A C G T .1 .2 .4 .3 .1 .3 .5 .1 .1 .2 .4 .2 .1 .2 .5 .2 .1 .5 .2 .2 .3 .1 .1 .1 ETC… The MStep Calculating the Motif Matrix
• Mck is the probability of character c at position k • With specific motif positions, we can estimate Mck:
Counts of c at pos k In each motif position Pseudocounts M c ,k = ∑n
b nc ,k + d c ,k
b,k + db,k • But with probabilities of positions, Zij, we average: nc ,k = sequences Si { j  Si = c} ∑ ∑ Z ij MEME
• MEME  implements EM for motif discovery in DNA and proteins • MAST – search sequences for motifs given a model http://meme.sdsc.edu/meme/ P(SeqModel) Landscape
EM searches for parameters to increase P(seqsparameters) Useful to think of P(seqsparameters) as a function of parameters EM starts at an initial set of parameters And then “climbs uphill” until it reaches a local maximum P(Sequencesparams1,params2) Pa ram ete r1 Pa ter2 me ra Where EM starts can make a big difference Search from Many Different Starts
To minimize the effects of local maxima, you should search multiple times from different starting points MEME uses this idea
P(Sequencesparams1,params2) Start at many points Run for one iteration Choose starting point that got the “highest” and continue Pa ram ete r1 Pa ete ra m r2 The ZOOPS Model
• The approach as we’ve outlined it, assumes that each sequence has exactly one motif occurrence per sequence; this is the OOPS model The ZOOPS model assumes zero or one occurrences per sequence • Estep in the ZOOPS Model
• • We need to consider another alternative: the ith sequence doesn’t contain the motif We add another parameter (and its relative) λ prior prob that any position in a sequence is the start of a motif prior prob of a sequence containing a motif γ = ( L − W + 1)λ Estep in the ZOOPS Model
P( Z ij = 1) = Pr( Si  Z ij = 1, M )λ Pr( Si  Qi = 0, M )(1 − γ ) +
L −W +1 ∑
k =1 Pr( Si  Z ik = 1, M )λ • here Qi is a random variable that takes on 0 to indicate that the sequence doesn’t contain a motif occurrence Qi = L −W +1 j =1 ∑Z i, j Mstep in the ZOOPS Model
• update p same as before • update , as follows λγ
( t +1) λ ( t +1) n m 1 (t ) = = ∑ i=1 positions j =1 Zi, j ∑ ( L − W + 1) n( L − W + 1) sequences γ (t ) • average of Z i , j across all sequences, positions The TCM Model
• The TCM (twocomponent mixture model) assumes zero or more motif occurrences per sequence Likelihood in the TCM Model
• • the TCM model treats each length W subsequence independently to determine the likelihood of such a subsequence: Pr( Sij  Z ij = 1, M ) = j +W −1 ∏
k= j M ck ,k − j +1 assuming a motif starts there Pr( Sij  Z ij = 0, p ) = j +W −1 ∏
k= j P(ck  B ) assuming a motif doesn’t start there Estep in the TCM Model
Pr( Si , j  Z ij = 1, M )λ Pr( Si , j  Z ij = 0, B)(1 − λ ) + Pr( Si , j  Z ij = 1, M )λ
subsequence isn’t a motif • Mstep same as before subsequence is a motif Z ij = Gibbs Sampling
A stochastic version of EM that differs from deterministic EM in two key ways 1. At each iteration, we only update the motif position of a single sequence 2. We may update a motif position to a “suboptimal” new position Gibbs Sampling
“Best” Location New Location 1. Start with random motif locations and calculate a motif model 2. Randomly select a sequence, remove its motif and recalculate tempory model 3. With temporary model, calculate probability of motif at each position on sequence 4. Select new position based on this distribution 5. Update model and Iterate A C G T .1 .2 .4 .3 .2 .2 .5 .1 .1 .2 .4 .2 .4 .2 .2 .2 .1 .5 .2 .2 .3 .4 .2 .1 A C G T .1 .2 .4 .3 .1 .3 .5 .1 .1 .2 .4 .2 .1 .2 .5 .2 .1 .5 .2 .2 .3 .1 .1 .1 ETC… Gibbs Sampling and Climbing
Because gibbs sampling does not always choose the best new location it can move to another place not directly uphill P(Sequencesparams1,params2) Pa ram ete r1 Pa ete ra m r2 In theory, Gibbs Sampling less likely to get stuck a local maxima AlignACE
• Implements Gibbs sampling for motif discovery
– Several enhancements • ScanAce – look for motifs in a sequence given a model CompareAce – calculate “similarity” between two motifs (i.e. for clustering motifs)
http://atlas.med.harvard.edu/cgibin/alignace.pl • Antigen Epitope Prediction Antigens and Epitopes
• Antigens are molecules that induce immune system to produce antibodies • Antibodies recognize parts of molecules called epitopes Genome to “Immunome”
Pathogen genome sequences provide define all proteins that could illicit an immune response • Looking for a needle…
– Only a small number of epitopes are typically antigenic • …in a very big haystack
– Vaccinia virus (258 ORFs): 175,716 potential epitopes (8, 9, and 10mers) – M. tuberculosis (~4K genes): 433,206 potential epitopes – A. nidulans (~9K genes): 1,579,000 potential epitopes Can computational approaches predict all antigenic epitopes from a genome? Modeling MHC Epitopes
• Have a set of peptides that have been associate with a particular MHC allele • Want to discover motif within the peptide bound by MHC allele • Use motif to predict other potential epitopes Motifs Bound by MHCs
• MHC 1
– Closed ends of grove – Peptides 810 AAs in length – Motif is the peptide • MHC 2
– Grove has open ends – Peptides have broad length distribution: 1030 AAs – Need to find binding motif within peptides ...
View
Full
Document
This note was uploaded on 09/24/2010 for the course EECS 6.047 / 6. taught by Professor Manoliskellis during the Fall '08 term at MIT.
 Fall '08
 ManolisKellis

Click to edit the document details