This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Bayes Nets CPS 170 Ron Parr Modeling Distribu;ons • Suppose we knew P(X1…Xn) for all features – Can answer any classiﬁca;on ques;on op;mally • Let Y=Xi • P(YX1…Xn\Xi) – Can answer many clustering type ques;ons • P(XiXj)? (How oRen do two features co
occur) • P(X1…Xn) (How typical is an instance?) • To do correctly we need joint probability distribu;on • Unwieldy for discrete variables • Use independence to make this tractable 1 Condi;onal Independence • Suppose we know the following: – The ﬂu causes sinus inﬂamma;on – Allergies cause sinus inﬂamma;on – Sinus inﬂamma;on causes a runny nose – Sinus inﬂamma;on causes headaches • How are these connected? Causal Structure Flu Allergy Sinus Headache Nose Knowing sinus separates the variables from each other. 2 Condi;onal Independence • We say that two variables, A and B, are condi;onally independent given C if: – P(ABC) = P(AC) – P(ABC) = P(AC)P(BC) • How does this help? • We store only a condi;onal probability table (CPT) of each variable given its parents • Naïve Bayes (e.g. Spam Assassin) is a special case of this! Nota;on Reminder • P(AB) is a condi;onal prob. distribu;on – It is a func;on! – P(A=trueB=true), P(A=trueB=false), P(A=false
B=True), P(A=falseB=true) • P(Ab) is a probability distribu;on, func;on • P(aB) is a func;on, not a distribu;on • P(ab) is a number 3 Naïve Bayes Spam Filter S P(S) P(W1S) W1 P(WnS) W2 … Wn We will see later why this is a par;cularly convenient representa;on. (Does it make a correct assump;on?) Geeng More Formal • What is a Bayes net? – A directed acyclic graph (DAG) – Given the parents, each variable is independent of non
descendents – Joint probability decomposes: P( x1 ...x n ) = ∏ P( x i  parents( x i ))
i
– For each node Xi, store P(Xiparents(Xi)) – Represent as table called a CPT € 4 Real Applica;ons of Bayes Nets • Diagnosis of lymph node disease • Used in MicrosoR oﬃce and Windows – hhp://research.microsoR.com/en
us/groups/mlas/ • Used by robots to iden;fy meteorites to study • Study the human genome:Alex Hartemink et al. • Many other applica;ons… Flu Space Eﬃciency Allergy Sinus Headache Nose • En;re joint distribu;on as 32 (31) entries P(HS),P(NS) have 4 (2) – P(SAF) has 8 (4) – P(A) has 2 (1) – Total is 20 (10) • This can require exponen;ally less space • Space problem is solved for “most” problems 5 Naïve Bayes Space Eﬃciency S P(S) P(W1S) W1 P(WnS) W2 Wn … En;re Joint distribu;on has 2n+1 (2n+1
1) numbers vs. 4n+2 (2n+1) Atomic Event Probabili;es P( x1 ...x n ) = ∏ P( x i  parents( x i ))
i
Flu Allergy Sinus € Headache Nose Note that this is guaranteed true if we construct net incrementally, so that for each new variable added, we connect all inﬂuencing variables as parents (prove it by induc;on) 6 Doing Things the Hard Way P( fh)
P( f  h) =
=
P(h)
defn. of condi;onal probability ∑ P( fhSAN)
∑ P(hSANF )
SAN SANF marginaliza;on € Doing this naïvely, we need to sum over all atomic events deﬁned over these variables. There are exponen;ally many of these. Working Smarter I Flu Allergy Sinus Nose Headache P(hSANF ) = ∏ p( x  parents( x ))
x = P(h  S )P(N  S )P( S  AF )P( A)P(F ) € 7 Flu Working Smarter II Allergy Sinus Headache P(h) = Nose ∑ P(hSANF ) SANF = ∑ P(h  S)P(N  S)P(S  AF )P( A)P(F )) SANF = ∑ P(h  S )P(N  S )∑ P( S  AF )P( A)P(F ))
NS AF = ∑ P(h  S )∑ P(N  S )∑ P( S  AF )P( A)P(F )) S N AF Poten;al for exponen;al reduc;on in computa;on. € Computa;onal Eﬃciency ∑ P(hSANF ) = ∑ P(h  S )P(N  S )P(S  AF )P( A)P(F ) SANF SANF = ∑ P(h  S )∑ P(N  S )∑ P( S  AF )P( A)P(F )
N
AF
S € The distribu;ve law allows us to decompose the sum. AKA: Sum
product algorithm Poten;al for an exponen;al reduc;on in computa;on costs. 8 Naïve Bayes Eﬃciency S P(S) P(W1S) W1 P(WnS) W2 Wn … Given a set of words, we want to know which is larger: P(sW1…Wn) or P(¬sW1…Wn). Use Bayes Rule: P( S  W1 ...Wn ) =
P(W1 ...Wn  S )P( S )
P(W1 ...Wn ) € Naïve Bayes Eﬃciency II S P(S) P( S  W1 ...Wn ) =
P(W1S) W1 W2 … € P(W1 ...Wn  S )P( S )
P(W1 ...Wn ) Wn Observa;on 1: We can ignore P(W1…Wn) Observa;on 2: P(S) is given n
Observa;on 3: P(W1…WnS) is easy: P(W1 ...Wn  S ) = ∏ P(Wi  S )
i =1 € 9 Checkpoint
• BNs can give us an exponen;al reduc;on in the space required to represent a joint distribu;on. • Storage is exponen;al in largest parent set. • Claim: Parent sets are oRen reasonable. • Claim: Inference cost is oRen reasonable. • Ques;on: Can we quan;fy rela;onship between structure and inference cost? Now the Bad News… • In full generality: Inference is NP
hard • Decision problem: Is P(X)>0? • We reduce from 3SAT • 3SAT variables map to BN variables • Clauses become variables with the corresponding SAT variables as parents 10 Reduc;on ( X 1 ∨ X 2 ∨ X 3 ) ∧ ( X 2 ∨ X 3 ∨ X 4 ) ∧ ...
X1 € X2 X3 C1 C2 Problem: What if we have a large number of clauses? How does this ﬁt into our decision problem framework? X4 And Trees We could make a single variable which is the AND of all of our clauses, but this would have CPT that is exponen;al in the number of clauses. X1 X2 X3 X4 C1 A1 C2 Implement as a tree of ANDs. This is polynomial. A3 A2 11 Is BN Inference NP Complete? • Can show that BN inference is #P hard • #P is coun;ng the number of sa;sfying assignments • Idea: Assign variables uniform probability • Probability of conjunc;on of clauses tells us how many assignments are sa;sfying Checkpoint • BNs can be very compact • Worst case: Inference is intractable • Hope that worst is case: – Avoidable – Easily characterized in some way 12 Clues in the Graphical Structure • Q: How does graphical structure relate to our ability to push in summa;ons over variables? • A: – We relate summa;ons to graph opera;ons – Summing out a variable = • Removing node(s) from DAG • Crea;ng new replacement node – Relate graph proper;es to computa;onal eﬃciency Variable Elimina;on Recall that in variable elimina;on for CSPs, we eliminated variables and created new supervariables 13 Another Example Network Cloudy P( s  c ) = 0.1
P( s  c ) = 0.5
Sprinkler P(w  sr ) = 0.99
P(w  sr ) = 0.9
P(w  s r ) = 0.9
P(w  s r ) = 0.0
€ P(c ) = 0.5
Rain € P(r  c ) = 0.8
P(r  c ) = 0.2
W. Grass € € Marginal Probabili;es Suppose we want P(W): P(W ) = ∑ P(CSRW )
CSR = ∑ P(C )P( S  C )P(R  C )P(W  RS )
CSR = ∑ P(W  RS )∑ P( S  C )P(C )P(R  C )
SR C € 14 Elimina;ng Cloudy P(C)=0.5 Cloudy Sprinkler Rain P(R  C ) = 0.8 P( S  C ) = 0.1
P( S  C ) = 0.5
P( sr ) = 0.5 * 0.1 * 0.8 + 0.5 * 0.5 * 0.2 = 0.09
P( sr ) = 0.5 * 0.1 * 0.2 + 0.5 * 0.5 * 0.8 = 0.21
P( s r ) = 0.5 * 0.9 * 0.8 + 0.5 * 0.5 * 0.2 = 0.41
P( s r ) = 0.5 * 0.9 * 0.2 + 0.5 * 0.5 * 0.8 = 0.29
P(R  C ) = 0.2
Sprinkler € W. Grass € € Rain W. Grass P(W ) = ∑ P(CSRW )
CSR = ∑CSR P(C )P( S  C )P(R  C )P(W  RS )
= ∑ P(W  RS )∑ P( S  C )P(C )P(R  C )
SR C € Elimina;ng Sprinkler/Rain P( sr ) = 0.09
P( sr ) = 0.21
P( s r ) = 0.41 Sprinkler W. Grass P( s r ) = 0.29
€ Rain P(w  sr ) = 0.99
P(w  sr ) = 0.9
P(w  s r ) = 0.9
P(w  s r ) = 0.0
P(w ) = ∑ P(w  RS )P(RS )
SR €
= 0.09 * 0.99 + 0.21 * 0.9 + 0.41 * 0.9 + 0.29 * 0
= 0.6471 €
15 Dealing With Evidence
Suppose we have observed that the grass is wet?
What is the probability that it has rained? P(R  W ) = αP(RW )
= α ∑ P(CSRW )
CS = α ∑CS P(C )P( S  C )P(R  C )P(W  RS )
= α ∑ P(R  C )P(C )∑ P( S  C )P(W  RS )
C S Is there a more clever way to deal with w? € Turning our Summa;on Trick into an Algorithm • What happens when we “sum out” a variable? – All CPTs that reference this variable get pushed to the right of the summa;on – A new func;on deﬁned over the union of these variables replaces these CPTs • We call this “variable elimina;on” • Analogous to Gaussian elimina;on in many ways 16 The Variable Elimina;on Algorithm Elim(bn, query) If bn.vars = query return bn Else x = pick_variable(bn) newbn.vars = bn.vars
x newbn.vars = newbn.vars
neighbors(x) newbn.vars = newbn.vars + newvar newbn.vars(newvar).func;on = Can also sum out variables that are “hidden” ∑ X ∏bn.vars(Y).function Y ∈X ∪ neighbors( X ) return(elim(newbn, query)) € Eﬃciency of Variable Elimina;on • Exponen;al in the largest domain size of new variables created (just as in CSPs) • Equivalently: Exponen;al in largest func;on created by pushing in summa;ons (sum
product algorithm) • Linear for trees • Almost linear for almost trees 17 Naïve Bayes Eﬃciency S P(S) P(W1S) W1 P(WnS) W2 … Wn Another way to understand why Naïve Bayes is eﬃcient: It’s a tree! Facts About Variable Elimination
• Picking variables in op;mal order is NP hard • For some networks, there will be no elimina;on ordering that results in a poly ;me solu;on
(Must be the case unless P=NP) • Polynomial for trees • Need to get a lihle fancier if there are a large number of query variables or evidence variables 18 Beyond Variable Elimina;on • Variable elimina;on must be rerun for every new query • Possible to compile a Bayes net into a new data structure to make repeated queries more eﬃcient – Recall that inference in trees is linear – Deﬁne a “cluster tree” where • Clusters = sets of original variables • Can infer original probs from cluster probs • For networks w/o good elimina;on schemes – Sampling (discussed brieﬂy) – Varia;onal methods (not covered in this class) – Loopy belief propaga;on (not covered in this class) Sampling • A Bayes net is an example of a genera&ve model of a probability distribu;on • Genera;ve models allow one to generate samples from a distribu;on in a natural way • Sampling algorithm: – While some variables are not sampled • Pick variable x with no unsampled parents • Assign this variable a value from p(xparents(x)) 19 Comments on Sampling • Sampling is the easiest algorithm to implement • Can compute marginal or condi;onal distribu;ons by coun;ng • Problem: How do we handle observed values? – Rejec;on sampling: Quit and start over when mismatches occur – Importance sampling: Use a reweigh;ng trick to compensate for mismatches Bayes Net Summary • Bayes net = data structure for joint distribu;on • Can give exponen;al reduc;on in storage • Variable elimina;on: – simple, elegant method – eﬃcient for many networks • For some networks, must use approxima;on • BNs are a major success story for modern AI –
–
–
– BNs do the “right” thing (no ugly approxima;ons) Exploit structure in problem to reduce storage/computa;on Not always eﬃcient, but ineﬃcient cases are well understood Work and used in prac;ce 20 ...
View
Full
Document
This note was uploaded on 02/17/2012 for the course COMPSCI 170 taught by Professor Parr during the Spring '11 term at Duke.
 Spring '11
 Parr
 Artificial Intelligence

Click to edit the document details