hw3sol - Probablistic Graphical Models, Spring 2007...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Probablistic Graphical Models, Spring 2007 Homework 3 solutions 1 Score Equivalence Solution due to Steve Gardiner Proposition 1. The Bayesian score with a K2 prior is not score equivalent Proof. Consider the problem of learning a Bayesian network structure over two binary random variables X and Y with the following empirical joint distribution and sample size M = 4 P ( X = 0 ,Y = 0) = 1 4 ; P ( X = 0 ,Y = 1) = 1 2 ; P ( X = 1 ,Y = 0) = 1 4 ; P ( X = 1 ,Y = 1) = 0; The Bayesian score with K2 prior (drawing each distribution from a Dirichlet(1 , 1) prior) is as follows: score B ( G : D ) = log P ( G ) + log P ( D | G ) = log P ( G ) + log integraldisplay G P ( D | G , G ) P ( G | G ) d G If we assume G X Y and G Y X are equally likely, then the first term is a constant q = log P ( G ). Ill use the notation = + 1 = 1 + 1 = 2 for the hyperparameters, and count variables of the form m = M i =1 I ( X = 0 ,Y { , 1 } ). The graph G X Y has score score B ( G X Y : D ) = q + log integraldisplay G X Y P ( D | G X Y , G X Y ) P ( G X Y | G X Y ) d G X Y = q + log bracketleftbig ( ) ( + m ) ( + m ) ( ) ( 1 + m 1 ) ( 1 ) (KF eqn 16.9) ( ) ( + m ) ( + m 00 ) ( ) ( 1 + m 01 ) ( 1 ) ( ) ( + m 1 ) ( + m 10 ) ( ) ( 1 + m 11 ) ( 1 ) bracketrightbig The score of G Y X can be derived analagously. With data exhibiting the above empirical distribution, we get score B ( G X Y : D ) = q +- 6 . 1738 > q +- 6 . 2916 = score B ( G Y X : D ) 2 Scoring functions Solution due to Steve Gardiner Proposition 2 ( 2.1). The optimal network structure according to the BIC scoring function is not neces- sarily the same as the optimal network structure according to the ML scoring function. 1 Proof. Consider data D 1 exhibitng the following empirical joint distribution with sample size M = 8 P ( X = 0 ,Y = 0) = 1 4 ; P ( X = 0 ,Y = 1) = 1 4 ; P ( X = 1 ,Y = 0) = 1 8 ; P ( X = 1 ,Y = 1) = 3 8 ; (1) The BIC score differs from the ML score by the term log M 2 Dim[ G ]. By counting the independent parameters in the graphs, we see that Dim[ G ] = 2 and Dim[ G X Y ] = 3, so we have score BIC ( G : D 1 )- score BIC ( G X Y : D 1 ) = [ l ( G : D 1 )- log M ]- [ l ( G X Y : D 1 )- 3 2 log M ] = [ l ( G : D 1 )- l ( G X Y : D 1 )] + 1 2 log M The ML scoring function is always nondecreasing as we add more edges to the graph, so we know that [ l ( G : D 1 )- l ( G X Y : D 1 )] 0; with the empirical distribution above, we get a specific value of [- 10 . 8377-- 10 . 5671] =- . 2706. For M = 8 as above, we have that 1 2 log M = 1 . 0397 > |- . 2706 | . Thus the optimal network structure under BIC is G , which is different from the optimal network structure under ML. These results are summarized in the following table:under ML....
View Full Document

Page1 / 8

hw3sol - Probablistic Graphical Models, Spring 2007...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online