Unformatted text preview: MIT OpenCourseWare
http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution
Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 6.047/6.878 Computational Biology: Genomes, Networks, Evolution Conditional Random Fields for Computational
Gene Prediction Lecture 18 November 4, 2008 Genome Annotation
ggggctgagatgtaaattagaggagctggagaggagtgcttcagagtttgggttgctttaagaaagggt
ggttccgaattctcccgtggttggagggccgaatgtgggaggagggaggataccagaggcaggga
gaacttgagctttactgacactgttctttttctagctgacgtgaagatgagcagctcagaggaggtgtc
ctggatttcctggttctgtgggctccgtggcaatgaattcttctgtgaagtgagttctcttcaacctcc
ctacttgccagcttcacatatcttcccaccagacgttccttcacatattccacttctacactgttctct
ctaaagcttttatgggagagagtgtaggtgaactagggagagacacaagtacttctgctgagttgggagtg
agaaacaagcacaacagatgcagttgtgttgatgataaggcatcacttagagcattttgcccaggtcaa
agatgaggattttgatatgggttccctcttggcttccatgtcctgacaggtggatgaagactacatcca
ggacaaatttaatcttactggactcaatgagcaggtccctcactatcgacaagctctagacatgatctt
ggacctggagcctggtgaggcaccctcagggttgttttgtgtgtgtgcgtgcactatttttctcttcaa
atctctattcacttgcctgaattttgccaaatttcctttggttctctgatttctttaaccccaaattca
tgctttattttgatcctccacctgactcttgtctagttttgtgacgtatatcacttgttctcatgtttt
tgcaagggtcagaagcccaggtttctgggtcccatgcccagatgttggatggggtaaggcccaaaagta
ggtgctaggcaaactgaatagcccgcagcccctggatatgggcagggcacctaggaaagctgaaaaaca
agtagttgcatttggccgggctgtggttcagatgaagaactggaagacaaccccaaccagagtgacctg
attgagcaggcagccgagatgctttatggattgatccacgcccgctacatccttaccaaccgtggcatc
gcccagatggtgaggcctctctgctcctacctgcctccttctgagcagtaagagacacaggttcctgca
gcaagaagtcatgtttaagccctgtttaaggaagctagctgagaagaggggaagaaccccagaacttgg Transcription RNA
Translation Protein Genome sequence Figure by MIT OpenCourseWare. Eukaryotic Gene Structure
complete mRNA
coding segment
ATG
ATG exon intron ATG . . . GT
start codon TGA
TGA exon AG ... donor site acceptor
site
Courtesy of William Majoros. Used with permission. intron GT exon AG . . . TGA donor site acceptor stop codon
site
http://geneprediction.org/book/classroom.html Gene Prediction with HMM
Y2 Y3 Yn1 Yn Start
Start
start
Exon
Exon
Exon
Exon
5’ Splice
5’ Splice
Intron
Intron
Intron
Intron
3’ Splice
3’ Splice Y1 ATGCCCCAGTTTTTGT
X1 X2 X3 Xn1 Xn Figure by MIT OpenCourseWare. Model of joint distribution P(Y,X) = P(Labels,Seq)
For gene prediction, we are given X…
How do we select a Y efficiently? Limitations of HMM Approach (1)
All components of HMMs have strict probabilistic semantics Y1 Y2 Y3 … Yi X1 X2 X3 Xi P(Yi=exonYi1=intron) P(Xi=GYi=exon) Each sums
to 1,
etc.. P(HMMerexon)? P(Blast Hitexon)? What about incorporating both Blast and Hmmer? Dependent Evidence
• HMMer protein domains predictions come from
models based on known protein sequences
– Protein sequences for the same domain are aligned
– Conservation modelled with HMM • But these are the same proteins searched by
BLAST
• If we see a HMMer hit, we are already more
likely to get a BLAST hit, and vice versa
BLAST and HMMER do not provide independent evidence
 Dependence is the rule for most evidence Dependent Evidence in HMMs
• HMMs explicitly model P(XiYi)=P(Blast,HmmerYi)
– Not enough to know P(HMMerYi), also need to know P(HMMerYi,Blast)
– Need to model these dependencies in the input data • Every time we add new evidence (i.e. ESTs) we need to know about
dependence on previous evidence
– E.g. not just P(ESTYi) but P(ESTYi,Blast,HMMer) • Unpleasant and unnecessary for our task: classification • A common strategy is to simply assume independence (Naïve Bayes
assumption) P(X1 ,X 2 ,X 3 ,...,X N Yi )=∏ P(X i Yi )
i • Almost always a false assumption Independencies in X
HMMs make assumptions about dependencies in X Y1 Y2 Y3 … Yi X1 X2 X3 Xi P(XiYi,Yi1,Yi2,Yi3,…,Y1) = P(XiYi) Effectively each Yi “looks” at only a contiguous subset of X given the
previous Yi1 Limitations Stem from Generative Modeling
HMMs are models of full joint probability distribution P(X,Y)
Y2 Y3 Yn1 Yn X1
X2 X3 Xn1 Xn Start
Start
start
Exon
Exon
Exon
Exon
5’ Splice
5’ Splice
Intron
Intron
Intron
Intron
3’ Splice
3’ Splice Y1 ATGCCCCAGTTTTTGT Figure by MIT OpenCourseWare. P(X,Y) = P(YX) P(X)
But this is all we need for gene prediction! Generative Modeling of P(X)
• HMMs expend unnecessary effort to
model P(X) which is never needed for
gene prediction
– Must model dependencies in X • During learning, we might trade accuracy
in modeling P(YX) in order to model P(X)
accurately
– Less accurate gene prediction Discriminative Models ATGCCCCAGTTTTTGT
Blast Hits, ESTs, etc.. P(YX) Start
Start
start
Exon
Exon
Exon
Exon
5’ Splice
5’ Splice
Intron
Intron
Intron
Intron
3’ Splice
3’ Splice Model conditional distribution P(YX) directly ATGCCCCAGTTTTTGT Discriminative models outperform generative
models in several natural language processing
tasks Discriminative Model
Desirable characteristics 1. Efficient learning & inference algorithms
2. Easily incorporate diverse evidence
3. Build on best existing HMM models for gene
calling Linear Chain CRF
Hidden state labels
(exon, intron, etc) Y1 Y2 Y3 Input data
(sequence, blast hits, ESTs, etc..) feature weights … Yi1 Y4 Yi X feature functions N
⎧J
⎫
1
exp ⎨∑ λ j ∑ f j ( Yi ,Yi1 ,X ) ⎬
P(YX)=
Z(X)
⎩ j=1 i
⎭ normalization …Y
N N
⎧J
⎫
Z(X)= ∑ exp ⎨∑ λ j ∑ f j ( Yi ,Yi1 ,X ) ⎬
Y
⎩ j=1 i
⎭ The Basic Idea
N
⎧J
⎫
1
P(YX)=
exp ⎨∑ λ j ∑ f j ( Yi ,Yi1 ,X ) ⎬
Z(X)
i
⎩ j=1
⎭ • Feature functions, fj, return real values on pairs of labels and input data
that we think are important for determining P(YX)
– e.g. If the last state (yi1) was intron and we have a blast hit (x), we have a
different probability for whether we are in an exon (yi) now. • We may not know how this probability has changed or dependence
other evidence • We learn this by selecting weights, λj, to maximize the likelihood of
training data • Z(X) is a normalization constant that ensure that P(YX) sums to one
over all possible Ys Using CRFs
N
⎧J
⎫
1
exp ⎨∑ λ j ∑ f j ( Yi ,Yi1 ,X ) ⎬
P(YX)=
Z(X)
i
⎩ j=1
⎭ Design 1. Select feature functions on label pairs {Yi,Yi1} and X. Inference 2. Given weights and feature functions, find the most
probable labeling Y, given an input X Learning 3. Use a training set of data to select the weights, λ. What Are Feature Functions?
Core issue in CRF – selecting feature functions
1. Features are arbitrary functions that return a real value for
some pair of labels {Yi,Yi1}, and the input, X
•
•
• Indicator function – 1 for certain {Yi,Yi1,X}, 0 otherwise
Sum, product, etc.. over labels and data
Could return some probability over {Yi,Yi1,X} – but this is
not required 2. We want to select feature functions that capture constraints
or conjunctions of label pairs {Yi,Yi1}, and the input, X that we
think are important for P(YX)
3. Determine characteristics of the training data that must hold
in our CRF model Example Feature Function
An BLAST hit at position i impacts the probability that Yi = exon. To
capture this, we can define an indicator function: ⎧1 if Yi =exon and X i =BLAST
f blast,exon ( Yi ,Yi1 ,X ) = ⎨
otherwise
⎩0 intron intron Exon Y1 Y2 Y3 1 0 P(YX)
N
⎧
⎫
exp ⎨λblast,exon ∑ f blast,exon ( Yi ,Yi1 ,X ) ⎬
i
⎩
⎭
=
Z(X) Y4 …Yi1 EST 0 Exon intron X:
Y: 1 intron 0 EST 0 Exon 0 EST f: Yi … YN =
= X exp {λblast,exon ( 0 + 0 + 0 + 1 + 0 + 1 + 0 )}
Z(X)
exp {λblast,exon 2}
Z(X) Adding Evidence
An BLAST hit at position i impacts the probability that Yi = exon. To
capture this, we can define an indicator function: ⎧1 if Yi =exon and X i =BLAST
f blast,exon ( Yi ,Yi1 ,X ) = ⎨
otherwise
⎩0
A protein domain predicted by the tool HMMer at position i also impacts
the probability that Yi = exon. ⎧1 if Yi =exon and X i =HMMer
f HMMer,exon ( Yi ,Yi1 ,X ) = ⎨
otherwise
⎩0 But recall that these two pieces of evidence not independent Dependent Evidence in CRFs
There is no requirement that evidence represented
by feature functions be independent • Why? CRFs do not model P(X)!
• All that matters is whether evidence
constrains P(YX)
• The weights determine the extent to
which each set of evidence contributes
and interacts A Strategy for Selecting Features
• Typical applications use thousands or millions of
arbitrary indicator feature functions – brute force
approach
• But we know gene prediction HMMs encode useful
information Strategy
1. Start with feature functions derived from best HMM
based gene prediction algorithms 2. Use arbitrary feature functions to capture evidence
hard to model probabilistically Alternative Formulation of HMM
Y1 Y2 Y3 Y4 … Yi Yi+1 … YN X1 X2 X3 X4 Xi Xi+1 XN HMM probability factors over pairs of nodes P ( yi  yi −1 ) × P ( xi  yi ) = P ( yi , xi  yi −1 )
N ∏ P ( y , x  y ) = P(Y,X)
i =1 i i i −1 Alternative Formulation of HMM
Y1 Y2 Y3 Y4 … Yi Yi+1 … YN X1 X2 X3 X4 Xi Xi+1 XN We can define a function, f, over each of these pairs
log {P ( yi  yi −1 ) × P ( xi  yi )} ≡ f ( yi , yi −1 , xi ) then,
N N i =1 i =1 P(Y,X)= ∏ P ( yi  yi −1 ) P ( xi  yi ) = ∏ exp {f ( yi , yi −1 , xi )}
⎧N
⎫
= exp ⎨∑ f ( yi , yi −1 , xi ) ⎬
⎩ i =1
⎭ Conditional Probability from HMM
⎧N
⎫
exp ⎨∑ f ( yi , yi −1 , xi ) ⎬
P(Y,X)
P(Y,X)
⎩ i =1
⎭
P ( YX ) =
=
=
P(X) ∑ P(Y,X)
⎧N
⎫
∑ exp ⎨∑ f ( yi , yi−1 , xi )⎬
Y
Y
⎩ i =1
⎭
1
⎧N
⎫
P ( YX ) =
exp ⎨∑1 f ( yi , yi −1 , xi ) ⎬
Z(X)
⎩ i =1
⎭
⎧N
⎫
where Z(X)=∑ exp ⎨∑ 1 f ( yi , yi −1 , xi ) ⎬
Y
⎩ i =1
⎭ This is the
formula for a
linear chain
CRF
with all λ = 1 Implementing HMMs as CRFs
We can implement an HMM as a CRF by choosing f HMM ( yi , yi −1 , xi ) = log {P ( yi  yi −1 ) × P ( xi  yi )} λHMM = 1
Or more commonly f HMM_Transition ( yi , yi −1 , xi ) = log {P ( yi  yi −1 )}
f HMM_Emission ( yi , yi −1 , xi ) = log {P ( xi  yi )} λHMM_Transition = λHMM_Emission = 1
Either formulation creates a CRF that models that same
conditional probability P(YX) as the original HMM Adding New Evidence
• Additional feature are added with arbitrary feature
functions (i.e. fblast,exon)
• When features are added, learning of weights
empirically determines the impact of new features
relative to existing features (i.e. relative value of
λHMM vs λblast,exon)
CRFs provide a framework for incorporating
diverse evidence into the best existing models for
gene prediction Conditional Independence of Y
Y1 Y2 X1 X2 Y3 X3 Y4 X4 Directed Graph Semantics Y1 Y2 Bayesian
Networks Y3 Y4 X2 Potential Functions over Cliques
(conditioned on X)
Markov Random Field Factorization P(X,Y) = ∏ P(vparents(v)) all nodes v = ∏ P ( Yi  Yi1 )P ( X i Yi ) Factorization P(YX) = ∏ U(clique(v),X) all nodes v = ∏ P ( Yi  Yi1 ,X ) Both cases: Yi conditionally independent of all other Y given Yi1 ConditionalGenerative Pairs
HMMs and linear chain CRFs explore the same
family of conditional distributions P(YX)*
Can convert HMM to CRF
• Training an HMM to maximize P(YX) yields same decision
boundary as CRF
Can convert CRF to HMM
• Training CRF to maximize P(Y,X) yields same classification
boundary as HMM
Sutton, McCallum (CRFTutorial) HMMs and CRFs form a generativediscriminative pair
Ng, Jordan (2002)
* Assuming P of the form exp(U(Y))/z – exponential family ConditionalGenerative Pairs Figure 1.2 from "An Introduction to Conditional Random Fields for Relational Learning," Charles Sutton and Andrew McCallum.
Getoor, Lise, and Ben Taskar, editors. Introduction to Statistical Relational Learning. Cambridge, MA: MIT Press, 2007. ISBN: 9780262072885.
Courtesy of MIT Press. Used with permission. Sutton, C. and A. McCallum. An Introduction to Conditional Random Fields for Relational Learning. Practical Benefit of Factorization
• Allows us to take a very large probability distribution
and model it using much smaller distributions over
“local” sets of variables
• Example: CRF with N states and 5 labels (ignore X
for now) P(Y1,Y2,Y3,…,YN) Pi(Yi,Yi1) 5N 5*5*N
(5*5 if Pi=Pi1 for all i) Using CRFs
N
⎧J
⎫
1
exp ⎨∑ λ j ∑ f j ( Yi ,Yi1 ,X ) ⎬
P(YX)=
Z(X)
i
⎩ j=1
⎭ Design 1. Select feature functions on label pairs {Yi,Yi1} and X. Inference 2. Given weights and feature functions, find the most
probable labeling Y, given an input X Learning 3. Use a training set of data to select the weights, λ. Labeling A Sequence
Given sequence & evidence X, we wish to select a labeling, Y, that
is in some sense ‘best” given our model
As with HMMs, one sensible choice is the most probable labeling
given the data and model:
N
⎡1
⎧J
⎫⎤
arg max P(YX)= arg max ⎢
exp ⎨∑ λ j ∑ f j ( Yi ,Yi1 ,X ) ⎬⎥
Y
Y
⎢ Z(X)
⎩ j=1 i
⎭⎥
⎣
⎦ But of course, we don’t want to score every possible Y. This is
where the chain structure of the linear chain CRF comes in handy… Why? Dynamic Programming
Nucleotide Position 2 1 K Labels (Y) 1 .. i1 i 1 1 1 2 2 2 2 … … … … K K K K Vk(i1) = probability of most likely path
through i1 ending on K given X Score derived from feature
functions over Yi1=2 and Yi=k ⎧J
⎫
exp ⎨∑ λ jf j ( K,2,X ) ⎬
⎩ j=1
⎭ ⎛
⎧J
⎫⎞
Vk ( i ) = max ⎜ Vl (i1) × exp ⎨∑ λ jf j ( k,l,X ) ⎬ ⎟
⎜
⎟
l
j=1
⎩
⎭⎠
⎝ By Analogy With HMM
Recall from HMM lectures Vk ( i ) =e k (x i ) ∗ max ( Vk ( i ) × a jk ) = max ( Vk ( i ) × a jk × e k (x i ) )
j j = max ( Vk ( i ) × P(Yi =jYi1 =k)P(x i Yi =k) )
j = max ( Vk ( i ) × Ψ HMM (Yi =j,Yi1 =k,X) )
j Where we have defined Ψ HMM (Yi ,Yi1 ,X) = P(Yi Yi1 )P(X i Yi ) Recall From Previous Slides
Y1 Y2 Y3 Y4 … Yi Yi+1 … YN X1 X2 X3 X4 Xi Xi+1 XN log {P ( yi  yi −1 ) × P ( xi  yi )} = f ( yi , yi −1 , xi ) λHMM , λhmm =1 1
⎧N
⎫
P ( YX ) =
exp ⎨∑ λHMM f ( yi , yi −1 , xi ) ⎬
Z(X)
⎩ i =1
⎭ linear chain
CRF Ψ HMM (Yi ,Yi1 ,X) = P(Yi Yi1 )P(X i Yi )= exp {λHMM f ( yi , yi −1 , xi )} Combined HMM and CRF Inference
We can define the same quantity for a generic CRFs Ψ CRF (Yi ,Yi1 ,X) = exp {λk f k ( yi , yi −1 , xi )}
We can rewrite all HMM equations in terms of ΨHMM
If we then plug ΨCRF in for ΨHMM , they work analogously: Vk ( i ) = max ( Vl (i1) × Ψ ( k,l ,X ) )
l Viterbi α k (i ) = ∑ Ψ ( k,l ,X ) αl ( i1) Forward β k (i ) = ∑ Ψ ( k,l ,X ) βl ( i+1) Backward l l Using CRFs
N
⎧J
⎫
1
exp ⎨∑ λ j ∑ f j ( Yi ,Yi1 ,X ) ⎬
P(YX)=
Z(X)
i
⎩ j=1
⎭ Design 1. Select feature functions on label pairs {Yi,Yi1} and X. Inference 2. Given weights and feature functions, find the most
probable labeling Y, given an input X Learning 3. Use a training set of data to select the weights, λ. Maximum Likelihood Learning
• We assume an iid training set {(x(k),y(k))} of K labeled
sequences of length N
– A set of manually curated genes sequences for which all
nucleotides are labeled • We then select weights, λ, that maximize the loglikelihood, L(λ), of the data ( ) (k)
L(λ ) = ∑∑ λ j ∑ f j Yi(k) ,Yi1 ,X (k) ∑ log Z ( X (k) )
K J k=1 j=1 N i=1 K k=1 Good news
L(λ) is concave  guaranteed global max Maximum Likelihood Learning
• Maximum where ∂L(λ ) ∂λ = 0
• From homework, at maximum we know ( ) ( ) (k)
(k)
f j Yi(k) ,Yi1 ,X (k) = ∑∑ f j Yi(k) ,Yi1 ,X (k) Pmodel (Y '  X ( k ) )
∑∑
N k N k i=1 Count in data i=1 Expected count by model Features determine characteristics of the training data
that must hold in our CRF model
Maximum entropy solution – no assumptions in CRF
distribution other than feature constraints Gradient Search
Bad news
No closed solution – need gradient method
Need efficient calculation of δL(λ)/δλ and Z(X)
Outline
1.
2.
3.
4.
5. Define forward/backward variables akin to HMMs
Calculate Z(X) using forward/backward
Calculate δL(λ)/δλi using Z(x) and forward/backward
Update each parameter with gradient search (quasiNewton)
Continue until convergence to global maximum
Very slow – many iterations of forward/backward Using CRFs
N
⎧J
⎫
1
exp ⎨∑ λ j ∑ f j ( Yi ,Yi1 ,X ) ⎬
P(YX)=
Z(X)
i
⎩ j=1
⎭ Design 1. Select feature functions on label pairs {Yi,Yi1} and X. Inference 2. Given weights and feature functions, find the most
probable labeling Y, given an input X Learning 3. Use a training set of data to select the weights, λ. CRF Applications to Gene Prediction
CRF actually work for gene prediction
• Culotta, Kulp, McCallum (2005)
• CRAIG (Bernal et al, 2007)
• Conrad (DeCaprio et al (2007), Vinson et al (2006))
• PhyloCRF (Majores, http://geneprediction.org/book/) Conrad
• A SemiMarkov CRF
– Explicit state duration models
– Features over intervals • Incorporates diverse information
– GHMM features
– Blast
– EST • Comparative Data
– Genome alignments with model of evolution
– Alignment gaps • Alternative Objective Function
– Maximize expected accuracy instead of likelihood
during learning ...
View
Full
Document
This note was uploaded on 09/24/2010 for the course EECS 6.047 / 6. taught by Professor Manoliskellis during the Fall '08 term at MIT.
 Fall '08
 ManolisKellis

Click to edit the document details