This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Probablistic Graphical Models, Spring 2007 Homework 3 solutions 1 Score Equivalence Solution due to Steve Gardiner Proposition 1. The Bayesian score with a K2 prior is not score equivalent Proof. Consider the problem of learning a Bayesian network structure over two binary random variables X and Y with the following empirical joint distribution and sample size M = 4 P ( X = 0 ,Y = 0) = 1 4 ; P ( X = 0 ,Y = 1) = 1 2 ; P ( X = 1 ,Y = 0) = 1 4 ; P ( X = 1 ,Y = 1) = 0; The Bayesian score with K2 prior (drawing each distribution from a Dirichlet(1 , 1) prior) is as follows: score B ( G : D ) = log P ( G ) + log P ( D  G ) = log P ( G ) + log integraldisplay G P ( D  G , G ) P ( G  G ) d G If we assume G X Y and G Y X are equally likely, then the first term is a constant q = log P ( G ). Ill use the notation = + 1 = 1 + 1 = 2 for the hyperparameters, and count variables of the form m = M i =1 I ( X = 0 ,Y { , 1 } ). The graph G X Y has score score B ( G X Y : D ) = q + log integraldisplay G X Y P ( D  G X Y , G X Y ) P ( G X Y  G X Y ) d G X Y = q + log bracketleftbig ( ) ( + m ) ( + m ) ( ) ( 1 + m 1 ) ( 1 ) (KF eqn 16.9) ( ) ( + m ) ( + m 00 ) ( ) ( 1 + m 01 ) ( 1 ) ( ) ( + m 1 ) ( + m 10 ) ( ) ( 1 + m 11 ) ( 1 ) bracketrightbig The score of G Y X can be derived analagously. With data exhibiting the above empirical distribution, we get score B ( G X Y : D ) = q + 6 . 1738 > q + 6 . 2916 = score B ( G Y X : D ) 2 Scoring functions Solution due to Steve Gardiner Proposition 2 ( 2.1). The optimal network structure according to the BIC scoring function is not neces sarily the same as the optimal network structure according to the ML scoring function. 1 Proof. Consider data D 1 exhibitng the following empirical joint distribution with sample size M = 8 P ( X = 0 ,Y = 0) = 1 4 ; P ( X = 0 ,Y = 1) = 1 4 ; P ( X = 1 ,Y = 0) = 1 8 ; P ( X = 1 ,Y = 1) = 3 8 ; (1) The BIC score differs from the ML score by the term log M 2 Dim[ G ]. By counting the independent parameters in the graphs, we see that Dim[ G ] = 2 and Dim[ G X Y ] = 3, so we have score BIC ( G : D 1 ) score BIC ( G X Y : D 1 ) = [ l ( G : D 1 ) log M ] [ l ( G X Y : D 1 ) 3 2 log M ] = [ l ( G : D 1 ) l ( G X Y : D 1 )] + 1 2 log M The ML scoring function is always nondecreasing as we add more edges to the graph, so we know that [ l ( G : D 1 ) l ( G X Y : D 1 )] 0; with the empirical distribution above, we get a specific value of [ 10 . 8377 10 . 5671] = . 2706. For M = 8 as above, we have that 1 2 log M = 1 . 0397 >  . 2706  . Thus the optimal network structure under BIC is G , which is different from the optimal network structure under ML. These results are summarized in the following table:under ML....
View Full
Document
 Fall '07
 CarlosGustin

Click to edit the document details