This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Probablistic Graphical Models, Spring 2007 Homework 3 solutions 1 Score Equivalence Solution due to Steve Gardiner Proposition 1. The Bayesian score with a K2 prior is not score equivalent Proof. Consider the problem of learning a Bayesian network structure over two binary random variables X and Y with the following empirical joint distribution and sample size M = 4 P ( X = 0 ,Y = 0) = 1 4 ; P ( X = 0 ,Y = 1) = 1 2 ; P ( X = 1 ,Y = 0) = 1 4 ; P ( X = 1 ,Y = 1) = 0; The Bayesian score with K2 prior (drawing each distribution from a Dirichlet(1 , 1) prior) is as follows: score B ( G : D ) = log P ( G ) + log P ( D  G ) = log P ( G ) + log integraldisplay θ G P ( D  θ G , G ) P ( θ G  G ) dθ G If we assume G X → Y and G Y → X are equally likely, then the first term is a constant q = log P ( G ). I’ll use the notation α = α + α 1 = 1 + 1 = 2 for the hyperparameters, and count variables of the form m ∗ = ∑ M i =1 I ( X = 0 ,Y ∈ { , 1 } ). The graph G X → Y has score score B ( G X → Y : D ) = q + log integraldisplay θ G X → Y P ( D  θ G X → Y , G X → Y ) P ( θ G X → Y  G X → Y ) dθ G X → Y = q + log bracketleftbig Γ( α ) Γ( α + m ∗∗ ) · Γ( α + m ∗ ) Γ( α ) · Γ( α 1 + m 1 ∗ ) Γ( α 1 ) · (KF eqn 16.9) Γ( α ) Γ( α + m ∗ ) · Γ( α + m 00 ) Γ( α ) · Γ( α 1 + m 01 ) Γ( α 1 ) · Γ( α ) Γ( α + m 1 ∗ ) · Γ( α + m 10 ) Γ( α ) · Γ( α 1 + m 11 ) Γ( α 1 ) bracketrightbig The score of G Y → X can be derived analagously. With data exhibiting the above empirical distribution, we get score B ( G X → Y : D ) = q + 6 . 1738 > q + 6 . 2916 = score B ( G Y → X : D ) 2 Scoring functions Solution due to Steve Gardiner Proposition 2 ( 2.1). The optimal network structure according to the BIC scoring function is not neces sarily the same as the optimal network structure according to the ML scoring function. 1 Proof. Consider data D 1 exhibitng the following empirical joint distribution with sample size M = 8 P ( X = 0 ,Y = 0) = 1 4 ; P ( X = 0 ,Y = 1) = 1 4 ; P ( X = 1 ,Y = 0) = 1 8 ; P ( X = 1 ,Y = 1) = 3 8 ; (1) The BIC score differs from the ML score by the term − log M 2 Dim[ G ]. By counting the independent parameters in the graphs, we see that Dim[ G ∅ ] = 2 and Dim[ G X → Y ] = 3, so we have score BIC ( G ∅ : D 1 ) score BIC ( G X → Y : D 1 ) = [ l ( ˆ θ G ∅ : D 1 ) log M ] [ l ( ˆ θ G X → Y : D 1 ) 3 2 log M ] = [ l ( ˆ θ G ∅ : D 1 ) l ( ˆ θ G X → Y : D 1 )] + 1 2 log M The ML scoring function is always nondecreasing as we add more edges to the graph, so we know that [ l ( ˆ θ G ∅ : D 1 ) l ( ˆ θ G X → Y : D 1 )] ≤ 0; with the empirical distribution above, we get a specific value of [ 10 . 8377 10 . 5671] = . 2706. For M = 8 as above, we have that 1 2 log M = 1 . 0397 >  . 2706  . Thus the optimal network structure under BIC is G ∅ , which is different from the optimal network structure under ML. These results are summarized in the following table:under ML....
View
Full
Document
This note was uploaded on 05/25/2008 for the course MACHINE LE 10708 taught by Professor Carlosgustin during the Fall '07 term at Carnegie Mellon.
 Fall '07
 CarlosGustin

Click to edit the document details