bnets[1] - Bayes Nets CPS 170 Ron Parr Modeling...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Bayes Nets CPS 170 Ron Parr Modeling Distribu;ons •  Suppose we knew P(X1…Xn) for all features –  Can answer any classifica;on ques;on op;mally •  Let Y=Xi •  P(Y|X1…Xn\Xi) –  Can answer many clustering type ques;ons •  P(XiXj)? (How oRen do two features co ­occur) •  P(X1…Xn) (How typical is an instance?) •  To do correctly we need joint probability distribu;on •  Unwieldy for discrete variables •  Use independence to make this tractable 1 Condi;onal Independence •  Suppose we know the following: –  The flu causes sinus inflamma;on –  Allergies cause sinus inflamma;on –  Sinus inflamma;on causes a runny nose –  Sinus inflamma;on causes headaches •  How are these connected? Causal Structure Flu Allergy Sinus Headache Nose Knowing sinus separates the variables from each other. 2 Condi;onal Independence •  We say that two variables, A and B, are condi;onally independent given C if: –  P(A|BC) = P(A|C) –  P(AB|C) = P(A|C)P(B|C) •  How does this help? •  We store only a condi;onal probability table (CPT) of each variable given its parents •  Naïve Bayes (e.g. Spam Assassin) is a special case of this! Nota;on Reminder •  P(A|B) is a condi;onal prob. distribu;on –  It is a func;on! –  P(A=true|B=true), P(A=true|B=false), P(A=false| B=True), P(A=false|B=true) •  P(A|b) is a probability distribu;on, func;on •  P(a|B) is a func;on, not a distribu;on •  P(a|b) is a number 3 Naïve Bayes Spam Filter S P(S) P(W1|S) W1 P(Wn|S) W2 … Wn We will see later why this is a par;cularly convenient representa;on. (Does it make a correct assump;on?) Geeng More Formal •  What is a Bayes net? –  A directed acyclic graph (DAG) –  Given the parents, each variable is independent of non ­descendents –  Joint probability decomposes: P( x1 ...x n ) = ∏ P( x i | parents( x i )) i –  For each node Xi, store P(Xi|parents(Xi)) –  Represent as table called a CPT € 4 Real Applica;ons of Bayes Nets •  Diagnosis of lymph node disease •  Used in MicrosoR office and Windows –  hhp://research.microsoR.com/en ­us/groups/mlas/ •  Used by robots to iden;fy meteorites to study •  Study the human genome:Alex Hartemink et al. •  Many other applica;ons… Flu Space Efficiency Allergy Sinus Headache Nose •  En;re joint distribu;on as 32 (31) entries   P(H|S),P(N|S) have 4 (2) –  P(S|AF) has 8 (4) –  P(A) has 2 (1) –  Total is 20 (10) •  This can require exponen;ally less space •  Space problem is solved for “most” problems 5 Naïve Bayes Space Efficiency S P(S) P(W1|S) W1 P(Wn|S) W2 Wn … En;re Joint distribu;on has 2n+1 (2n+1 ­1) numbers vs. 4n+2 (2n+1) Atomic Event Probabili;es P( x1 ...x n ) = ∏ P( x i | parents( x i )) i Flu Allergy Sinus € Headache Nose Note that this is guaranteed true if we construct net incrementally, so that for each new variable added, we connect all influencing variables as parents (prove it by induc;on) 6 Doing Things the Hard Way P( fh) P( f | h) = = P(h) defn. of condi;onal probability ∑ P( fhSAN) ∑ P(hSANF ) SAN SANF marginaliza;on € Doing this naïvely, we need to sum over all atomic events defined over these variables. There are exponen;ally many of these. Working Smarter I Flu Allergy Sinus Nose Headache P(hSANF ) = ∏ p( x | parents( x )) x = P(h | S )P(N | S )P( S | AF )P( A)P(F ) € 7 Flu Working Smarter II Allergy Sinus Headache P(h) = Nose ∑ P(hSANF ) SANF = ∑ P(h | S)P(N | S)P(S | AF )P( A)P(F )) SANF = ∑ P(h | S )P(N | S )∑ P( S | AF )P( A)P(F )) NS AF = ∑ P(h | S )∑ P(N | S )∑ P( S | AF )P( A)P(F )) S N AF Poten;al for exponen;al reduc;on in computa;on. € Computa;onal Efficiency ∑ P(hSANF ) = ∑ P(h | S )P(N | S )P(S | AF )P( A)P(F ) SANF SANF = ∑ P(h | S )∑ P(N | S )∑ P( S | AF )P( A)P(F ) N AF S € The distribu;ve law allows us to decompose the sum. AKA: Sum ­product algorithm Poten;al for an exponen;al reduc;on in computa;on costs. 8 Naïve Bayes Efficiency S P(S) P(W1|S) W1 P(Wn|S) W2 Wn … Given a set of words, we want to know which is larger: P(s|W1…Wn) or P(¬s|W1…Wn). Use Bayes Rule: P( S | W1 ...Wn ) = P(W1 ...Wn | S )P( S ) P(W1 ...Wn ) € Naïve Bayes Efficiency II S P(S) P( S | W1 ...Wn ) = P(W1|S) W1 W2 … € P(W1 ...Wn | S )P( S ) P(W1 ...Wn ) Wn Observa;on 1: We can ignore P(W1…Wn) Observa;on 2: P(S) is given n Observa;on 3: P(W1…Wn|S) is easy: P(W1 ...Wn | S ) = ∏ P(Wi | S ) i =1 € 9 Checkpoint •  BNs can give us an exponen;al reduc;on in the space required to represent a joint distribu;on. •  Storage is exponen;al in largest parent set. •  Claim: Parent sets are oRen reasonable. •  Claim: Inference cost is oRen reasonable. •  Ques;on: Can we quan;fy rela;onship between structure and inference cost? Now the Bad News… •  In full generality: Inference is NP ­hard •  Decision problem: Is P(X)>0? •  We reduce from 3SAT •  3SAT variables map to BN variables •  Clauses become variables with the corresponding SAT variables as parents 10 Reduc;on ( X 1 ∨ X 2 ∨ X 3 ) ∧ ( X 2 ∨ X 3 ∨ X 4 ) ∧ ... X1 € X2 X3 C1 C2 Problem: What if we have a large number of clauses? How does this fit into our decision problem framework? X4 And Trees We could make a single variable which is the AND of all of our clauses, but this would have CPT that is exponen;al in the number of clauses. X1 X2 X3 X4 C1 A1 C2 Implement as a tree of ANDs. This is polynomial. A3 A2 11 Is BN Inference NP Complete? •  Can show that BN inference is #P hard •  #P is coun;ng the number of sa;sfying assignments •  Idea: Assign variables uniform probability •  Probability of conjunc;on of clauses tells us how many assignments are sa;sfying Checkpoint •  BNs can be very compact •  Worst case: Inference is intractable •  Hope that worst is case: –  Avoidable –  Easily characterized in some way 12 Clues in the Graphical Structure •  Q: How does graphical structure relate to our ability to push in summa;ons over variables? •  A: –  We relate summa;ons to graph opera;ons –  Summing out a variable = •  Removing node(s) from DAG •  Crea;ng new replacement node –  Relate graph proper;es to computa;onal efficiency Variable Elimina;on Recall that in variable elimina;on for CSPs, we eliminated variables and created new supervariables 13 Another Example Network Cloudy P( s | c ) = 0.1 P( s | c ) = 0.5 Sprinkler P(w | sr ) = 0.99 P(w | sr ) = 0.9 P(w | s r ) = 0.9 P(w | s r ) = 0.0 € P(c ) = 0.5 Rain € P(r | c ) = 0.8 P(r | c ) = 0.2 W. Grass € € Marginal Probabili;es Suppose we want P(W): P(W ) = ∑ P(CSRW ) CSR = ∑ P(C )P( S | C )P(R | C )P(W | RS ) CSR = ∑ P(W | RS )∑ P( S | C )P(C )P(R | C ) SR C € 14 Elimina;ng Cloudy P(C)=0.5 Cloudy Sprinkler Rain P(R | C ) = 0.8 P( S | C ) = 0.1 P( S | C ) = 0.5 P( sr ) = 0.5 * 0.1 * 0.8 + 0.5 * 0.5 * 0.2 = 0.09 P( sr ) = 0.5 * 0.1 * 0.2 + 0.5 * 0.5 * 0.8 = 0.21 P( s r ) = 0.5 * 0.9 * 0.8 + 0.5 * 0.5 * 0.2 = 0.41 P( s r ) = 0.5 * 0.9 * 0.2 + 0.5 * 0.5 * 0.8 = 0.29 P(R | C ) = 0.2 Sprinkler € W. Grass € € Rain W. Grass P(W ) = ∑ P(CSRW ) CSR = ∑CSR P(C )P( S | C )P(R | C )P(W | RS ) = ∑ P(W | RS )∑ P( S | C )P(C )P(R | C ) SR C € Elimina;ng Sprinkler/Rain P( sr ) = 0.09 P( sr ) = 0.21 P( s r ) = 0.41 Sprinkler W. Grass P( s r ) = 0.29 € Rain P(w | sr ) = 0.99 P(w | sr ) = 0.9 P(w | s r ) = 0.9 P(w | s r ) = 0.0 P(w ) = ∑ P(w | RS )P(RS ) SR € = 0.09 * 0.99 + 0.21 * 0.9 + 0.41 * 0.9 + 0.29 * 0 = 0.6471 € 15 Dealing With Evidence Suppose we have observed that the grass is wet? What is the probability that it has rained? P(R | W ) = αP(RW ) = α ∑ P(CSRW ) CS = α ∑CS P(C )P( S | C )P(R | C )P(W | RS ) = α ∑ P(R | C )P(C )∑ P( S | C )P(W | RS ) C S Is there a more clever way to deal with w? € Turning our Summa;on Trick into an Algorithm •  What happens when we “sum out” a variable? –  All CPTs that reference this variable get pushed to the right of the summa;on –  A new func;on defined over the union of these variables replaces these CPTs •  We call this “variable elimina;on” •  Analogous to Gaussian elimina;on in many ways 16 The Variable Elimina;on Algorithm Elim(bn, query) If bn.vars = query return bn Else x = pick_variable(bn) newbn.vars = bn.vars  ­ x newbn.vars = newbn.vars  ­ neighbors(x) newbn.vars = newbn.vars + newvar newbn.vars(newvar).func;on = Can also sum out variables that are “hidden” ∑ X ∏bn.vars(Y).function Y ∈X ∪ neighbors( X ) return(elim(newbn, query)) € Efficiency of Variable Elimina;on •  Exponen;al in the largest domain size of new variables created (just as in CSPs) •  Equivalently: Exponen;al in largest func;on created by pushing in summa;ons (sum ­product algorithm) •  Linear for trees •  Almost linear for almost trees 17 Naïve Bayes Efficiency S P(S) P(W1|S) W1 P(Wn|S) W2 … Wn Another way to understand why Naïve Bayes is efficient: It’s a tree! Facts About Variable Elimination •  Picking variables in op;mal order is NP hard •  For some networks, there will be no elimina;on ordering that results in a poly ;me solu;on (Must be the case unless P=NP) •  Polynomial for trees •  Need to get a lihle fancier if there are a large number of query variables or evidence variables 18 Beyond Variable Elimina;on •  Variable elimina;on must be rerun for every new query •  Possible to compile a Bayes net into a new data structure to make repeated queries more efficient –  Recall that inference in trees is linear –  Define a “cluster tree” where •  Clusters = sets of original variables •  Can infer original probs from cluster probs •  For networks w/o good elimina;on schemes –  Sampling (discussed briefly) –  Varia;onal methods (not covered in this class) –  Loopy belief propaga;on (not covered in this class) Sampling •  A Bayes net is an example of a genera&ve model of a probability distribu;on •  Genera;ve models allow one to generate samples from a distribu;on in a natural way •  Sampling algorithm: –  While some variables are not sampled •  Pick variable x with no unsampled parents •  Assign this variable a value from p(x|parents(x)) 19 Comments on Sampling •  Sampling is the easiest algorithm to implement •  Can compute marginal or condi;onal distribu;ons by coun;ng •  Problem: How do we handle observed values? –  Rejec;on sampling: Quit and start over when mismatches occur –  Importance sampling: Use a reweigh;ng trick to compensate for mismatches Bayes Net Summary •  Bayes net = data structure for joint distribu;on •  Can give exponen;al reduc;on in storage •  Variable elimina;on: –  simple, elegant method –  efficient for many networks •  For some networks, must use approxima;on •  BNs are a major success story for modern AI –  –  –  –  BNs do the “right” thing (no ugly approxima;ons) Exploit structure in problem to reduce storage/computa;on Not always efficient, but inefficient cases are well understood Work and used in prac;ce 20 ...
View Full Document

This note was uploaded on 02/17/2012 for the course COMPSCI 170 taught by Professor Parr during the Spring '11 term at Duke.

Ask a homework question - tutors are online