SP11 Final Review 6PP

SP11 Final Review 6PP - Probabilistic Reasoning CS 188...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Probabilistic Reasoning CS 188: Artificial Intelligence   Probability Spring 2011             Final Review 5/2/2011 Random Variables Joint and Marginal Distributions Conditional Distribution Inference by Enumeration Product Rule, Chain Rule, Bayes Rule Independence   Distributions over LARGE Numbers of Random Variables  Bayesian networks   Representation   Inference   Exact: Enumeration, Variable Elimination   Approximate: Sampling   Learning Pieter Abbeel – UC Berkeley   Maximum likelihood parameter estimation   Laplace smoothing   Linear interpolation Probability recap Inference by Enumeration   Conditional probability   P(sun)? S equivalently, iff: ∀x, y, z : P (x|y, z ) = P (x|z ) equivalently, iff: ∀x, y, z : P (y |x, z ) = P (y |z ) rain 0.05 cold sun 0.10 cold rain 0.05 hot sun 0.10 hot rain 0.05 winter cold sun 0.15 winter   P(sun | winter, hot)? hot summer winter   X and Y are conditionally independent given Z iff: 0.30 winter ∀x, y : P (x|y ) = P (x) ∀x, y : P (y |x) = P (y ) sun summer   P(sun | winter)? W hot summer   Chain rule T summer   Product rule   X, Y independent iff: equivalently, iff: equivalently, iff: 2 P cold rain 0.20 3 4 Chain Rule  Bayes net Example: Alarm Network ¬b 0.999 Burglary Earthqk P(E) 0.002 0.998 E A P(A|B,E) +b +e +a 0.95 +b +e ¬a 0.05 +b 0.001 +e ¬e P(B) +b E B B   Chain rule: can always write any joint distribution as an incremental product of conditional distributions ¬e +a 0.94 Alarm John calls   Bayes nets: make conditional independence assumptions of the form: B P (xi |x1 · · · xi−1 ) = P (xi |parents(Xi )) giving us: E A A M P(M|A) +b ¬e ¬a 0.06 +j 0.9 +a +m 0.7 ¬b +e +a 0.29 ¬j 0.1 +a ¬m 0.3 ¬b +e ¬a 0.71 ¬a 5 P(J|A) +a M J +a A J Mary calls +j 0.05 ¬a +m 0.01 ¬b ¬e +a 0.001 ¬a ¬j 0.95 ¬a ¬m 0.99 ¬b ¬e ¬a 0.999 1 Bayes Nets: Assumptions Size of a Bayes’ Net for   Assumptions we are required to make to define the Bayes net when given the graph:   How big is a joint distribution over N Boolean variables? 2N P (xi |x1 · · · xi−1 ) = P (xi |parents(Xi ))   Size of representation if we use the chain rule 2N   Given a Bayes net graph additional conditional independences can be read off directly from the graph   How big is an N-node net if nodes have up to k parents?   Question: Are two nodes necessarily independent given certain evidence? O(N * 2k+1)   If no, can prove with a counter example   Both give you the power to calculate   BNs:   I.e., pick a set of CPT’s, and show that the independence assumption is violated by the resulting distribution   Huge space savings!   Easier to elicit local CPTs   Faster to answer queries   If yes, can prove with   Algebra (tedious) 7 D-Separation   Question: Are X and Y conditionally independent given evidence vars {Z}?   D-separation (analyzes graph) 8 D-Separation Active Triples Inactive Triples   Yes, if X and Y separated by Z   Consider all (undirected) paths from X to Y   No active paths = independence! ?   Given query Xi ⊥ Xj |{Xk1 , ..., Xkn } ⊥   Shade all evidence nodes   For all (undirected!) paths between and   Check whether path is active   A path is active if each triple is active:   If active return   Causal chain A → B → C where B is unobserved (either direction)   Common cause A ← B → C where B is unobserved   Common effect (aka v-structure) A → B ← C where B or one of its descendents is observed Xi ⊥ Xj |{Xk1 , ..., Xkn } ⊥   (If reaching this point all paths have been checked and shown inactive) ⊥   Return Xi ⊥ Xj |{Xk1 , ..., Xkn }   All it takes to block a path is a single inactive segment Example 10 All Conditional Independences   Given a Bayes net structure, can run dseparation to build a complete list of conditional independences that are necessarily true of the form L Yes R Yes D B Xi ⊥ Xj |{Xk1 , ..., Xkn } ⊥ T Yes T 11   This list determines the set of probability distributions that can be represented by Bayes’ nets with this graph structure 12 2 Topology Limits Distributions Bayes Nets Status Y Y X Z {X ⊥ Y, X ⊥ Z, Y ⊥ Z, ⊥ ⊥ ⊥   Given some graph ⊥ ⊥ ⊥ topology G, only certain X ⊥ Z | Y, X ⊥ Y | Z, Y ⊥ Z | X } X Z joint distributions can {X ⊥ Z | Y } Y ⊥ be encoded X Z   The graph structure guarantees certain Y (conditional) independences X Z   (There might be more {} independence)   Adding arcs increases Y Y Y the set of distributions, but has several costs   Full conditioning can encode any distribution Z X X Z X X Z X Y Y X Z   Representation   Inference   Learning Bayes Nets from Data Z Y Z Inference by Enumeration 14 13 Example: Enumeration   In this simple method, we only need the BN to synthesize the joint entries   Given unlimited time, inference in BNs is easy   Recipe:   State the marginal probabilities you need   Figure out ALL the atomic probabilities you need   Calculate and combine them   Example: B E A J M 15 Variable Elimination 16 Variable Elimination Outline   Why is inference by enumeration so slow?   You join up the whole joint distribution before you sum out the hidden variables   You end up repeating a lot of work!   Idea: interleave joining and marginalizing!   Track objects called factors   Initial factors are local CPTs (one per node) +r  ­r 0.1 0.9 +r +r  ­r  ­r +t  ­t +t  ­t 0.8 0.2 0.1 0.9 +t +t  ­t  ­t +l  ­l +l  ­l 0.3 0.7 0.1 0.9   Any known values are selected   E.g. if we know   Called Variable Elimination   Still NP-hard, but usually much faster than inference by enumeration +r  ­r 17 0.1 0.9 , the initial factors are +r +r  ­r  ­r +t  ­t +t  ­t 0.8 0.2 0.1 0.9 +t  ­t +l +l R T L 0.3 0.1   VE: Alternately join factors and eliminate variables 18 3 Variable Elimination Example Variable Elimination Example T +r  ­r R T L +r +r  ­r  ­r +t +t  ­t  ­t 0.1 0.9 +t  ­t +t  ­t +l  ­l +l  ­l 0.8 0.2 0.1 0.9 Sum out R Join R L +r +r  ­r  ­r +t  ­t +t  ­t 0.08 0.02 0.09 0.81 +t  ­t 0.17 0.83 T +t  ­t T, L Join T 0.17 0.83 +t +t  ­t  ­t R, T 0.3 0.7 0.1 0.9 +t +t  ­t  ­t +l  ­l +l  ­l 0.3 0.7 0.1 0.9 L +t +t  ­t  ­t +l  ­l +l  ­l 0.3 0.7 0.1 0.9 L 19 +t +t  ­t  ­t +l  ­l +l  ­l 0.3 0.7 0.1 0.9 +l  ­l +l  ­l Sum out T 0.051 0.119 0.083 0.747 L +l 0.134  ­l 0.886 * VE is variable elimination Example Example Choose E Choose A Finish with B Normalize 21 General Variable Elimination 22 Approximate Inference: Sampling   Basic idea:   Query:   Draw N samples from a sampling distribution S   Compute an approximate posterior probability   Show this converges to the true probability P   Start with initial factors:   Local CPTs (but instantiated by evidence)   Why? Faster than computing the exact answer   While there are still hidden variables (not Q or evidence):   Pick a hidden variable H   Join all factors mentioning H   Eliminate (sum out) H   Prior sampling:   Sample ALL variables in topological order as this can be done quickly   Rejection sampling for query   = like prior sampling, but reject when a variable is sampled inconsistent with the query, in this case when a variable Ei is sampled differently from ei   Likelihood weighting for query   Join all remaining factors and normalize 23   = like prior sampling but variables Ei are not sampled, when it’s their turn, they get set to ei, and the sample gets weighted by P(ei | value of parents(ei) in current sample)   Gibbs sampling: repeatedly samples each non-evidence variable conditioned on all other variables  can incorporate downstream evidence 24 4 Prior Sampling +c  ­c Example   We ll get a bunch of samples from the BN: 0.5 0.5 +c, -s, +r, +w +c, +s, +r, +w -c, +s, +r, -w +c, -s, +r, +w -c, -s, -r, +w Cloudy +c +s 0.1  ­s 0.9  ­c +s 0.5  ­s 0.5 +s  ­s Sprinkler +w  ­w +w  ­w +w  ­w +w  ­w +r  ­r +r  ­r 0.99 0.01 0.90 0.10 0.90 0.10 0.01 0.99 +c  ­c Rain +r  ­r +r  ­r 0.8 0.2 0.2 0.8 RR ain WetGrass W   If we want to know P(W)             Samples: WetGrass Cloudy C Sprinkler S +c, -s, +r, +w -c, +s, -r, +w … 25 We have counts <+w:4, -w:1> Normalize to get P(W) = <+w:0.8, -w:0.2> This will get closer to the true distribution with more samples Can estimate anything else, too What about P(C| +w)? P(C| +r, +w)? P(C| -r, -w)? Fast: can use fewer samples if less time 26 Likelihood Weighting +c  ­c Likelihood Weighting   Sampling distribution if z sampled and e fixed evidence 0.5 0.5 Cloudy C Cloudy +c +s 0.1  ­s 0.9  ­c +s 0.5  ­s 0.5 +s  ­s +r  ­r +r  ­r Sprinkler +w  ­w +w  ­w +w  ­w +w  ­w 0.99 0.01 0.90 0.10 0.90 0.10 0.01 0.99 +c  ­c Rain WetGrass +r  ­r +r  ­r 0.8 0.2 0.2 0.8 S   Now, samples have weights R W Samples:   Together, weighted sampling distribution is consistent +c, +s, +r, +w … 27 28 Gibbs Sampling Markov Models   A Markov model is a chain-structured BN   Idea: instead of sampling from scratch, create samples that are each like the last one.   Each node is identically distributed (stationarity)   Value of X at a given time is called the state   As a BN:   Procedure: resample one variable at a time, conditioned on all the rest, but keep evidence fixed. X1   Properties: Now samples are not independent (in fact they re nearly identical), but sample averages are still consistent estimators! X2 X3 X4   The chain is just a (growing) BN   We can always use generic BN reasoning on it if we truncate the chain at a fixed length   Stationary distributions   For most chains, the distribution we end up in is independent of the initial distribution   Called the stationary distribution of the chain   What s the point: both upstream and downstream variables condition on evidence. P∞ ( X ) = 29 ￿ x Pt+1|t (X |x)P∞ (x)   Example applications: Web link analysis (Page Rank) and Gibbs Sampling 5 Hidden Markov Models Online Belief Updates   Underlying Markov chain over states S   You observe outputs (effects) at each time step X1 X2 X3 X4 X5 E1 E2 E3 E4   Every time step, we start with current P(X | evidence)   We update for time: E5 X1 X2   We update for evidence: X2   Speech recognition HMMs: E2   Xi: specific positions in specific words; Ei: acoustic signals   Machine translation HMMs:   The forward algorithm does both at once (and doesn t normalize)   Xi: translation options; Ei: Observations are words   Robot tracking:   Xi: positions on a map; Ei: range readings Particle Filtering Particle Filtering   = likelihood weighting + resampling at each time slice 0.0 0.1 0.0   Why: sometimes |X| is too big to use exact inference 0.0 0.0 0.2 0.0 0.2 0.5   Particle is just new name for sample   Elapse time:   Each particle is moved by sampling its next position from the transition model   Observe:   We don t sample the observation, we fix it and downweight our samples based on the evidence   This is like likelihood weighting, so we   Resample:   Rather than tracking weighted samples, we resample   N times, we choose from our weighted sample distribution Dynamic Bayes Nets (DBNs)   We want to track multiple variables over time, using multiple sources of evidence   Idea: Repeat a fixed Bayes net structure at each time   Variables from time t can condition on those from t-1 t =1 t =2 G1a E1a E1b G3b G2b E2a E2b   Representation   Inference t =3 G3a G2a G1b Bayes Nets Status E3a   Learning Bayes Nets from Data E3b   Discrete valued dynamic Bayes nets are also HMMs 36 6 Parameter Estimation Bayes Nets Status   Estimating distribution of random variables like X or X | Y   Empirically: use training data   Representation   For each outcome x, look at the empirical rate of that value: r g g   This is the estimate that maximizes the likelihood of the data   Inference   Learning Bayes Nets from Data   Laplace smoothing   Pretend saw every outcome k extra times   Smooth each condition independently: 38 Classification: Feature Vectors Classification overview   Naïve Bayes:         Hello, Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just # free YOUR_NAME MISSPELLED FROM_FRIEND ... : : : : 2 0 2 0 SPAM or + Builds a model training data Gives prediction probabilities Strong assumptions about feature independence One pass through data (counting)   Perceptron:         Makes less assumptions about data Mistake-driven learning Multiple passes through data (prediction) Often more accurate   MIRA:   Like perceptron, but adaptive scaling of size of update   SVM: PIXEL-7,12 PIXEL-7,13 ... NUM_LOOPS ... :1 :0 2 :1     Properties similar to perceptron Convex optimization formulation   Nearest-Neighbor:   Non-parametric: more expressive with more training data   Kernels   Efficient way to make linear learning architectures into nonlinear ones Bayes Nets for Classification   One method of classification:         Use a probabilistic model! Features are observed random variables Fi Y is the query variable Use probabilistic inference to compute most likely Y General Naïve Bayes   A general naive Bayes model: |Y| x |F|n parameters Y F1 |Y| parameters   You already know how to do this inference F2 Fn n x |F| x |Y| parameters   We only specify how each feature depends on the class   Total number of parameters is linear in n   Our running example: digits 7 Bag-of-Words Naïve Bayes   Generative model Linear Classifier Word at position i, not ith word in the dictionary!   Binary linear classifier:   Bag-of-words   Each position is identically distributed   All positions share the same conditional probs P(W|C)    When learning the parameters, data is shared over all positions in the document (rather than separately learning a distribution for each position in the document)   Our running example: spam vs. ham   Multiclass linear classifier:   A weight vector for each class:   Score (activation) of a class y:   Prediction highest score wins Binary = multiclass where the negative class has weight zero Perceptron = algorithm to learn weights w Problems with the Perceptron   Noise: if the data isn t separable, weights might thrash   Start with zero weights   Pick up training instances one by one   Classify with current weights   Averaging weight vectors over time can help (averaged perceptron)   Mediocre generalization: finds a barely separating solution   If correct, no change!   If wrong: lower score of wrong answer, raise score of right answer   Overtraining: test / held-out accuracy usually rises, then falls   Overtraining is a kind of overfitting 45 Fixing the Perceptron: MIRA   Update size that fixes the current mistake and also minimizes the change to w   Update w by solving: Support Vector Machines   Maximizing the margin: good according to intuition, theory, practice   Support vector machines (SVMs) find the separator with max margin   Basically, SVMs are MIRA where you optimize over all examples at once SVM ,¿ ¿·C 8 Non-Linear Separators Non-Linear Separators   Data that is linearly separable (with some noise) works out great:   General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: x 0   But what are we going to do if the dataset is just too hard? x 0 Φ: x → φ(x)   How about… mapping data to a higher-dimensional space: x2 x 0 49 50 This and next few slides adapted from Ray Mooney, UT Some Kernels Some Kernels (2)   Kernels implicitly map original vectors to higher dimensional spaces, take the dot product there, and hand the result back   Polynomial kernel:   Linear kernel: Á(x) = x   Quadratic kernel: 51 Why and When Kernels?   Can t you just add these features on your own (e.g. add all pairs of features instead of using the quadratic kernel)?   Yes, in principle, just compute them   No need to modify any algorithms   But, number of features can get large (or infinite)   Kernels let us compute with these features implicitly   Example: implicit dot product in polynomial, Gaussian and string kernel takes much less space and time per dot product   When can we use kernels?   When our learning algorithm can be reformulated in terms of only inner products between feature vectors   Examples: perceptron, support vector machine K-nearest neighbors   1-NN: copy the label of the most similar data point   K-NN: let the k nearest neighbors vote (have to devise a weighting scheme) 2 Examples 10 Examples 100 Examples Truth 10000 Examples   Parametric models:   Fixed set of parameters   More data means better settings   Non-parametric models:   Complexity of the classifier increases with data   Better in the limit, often worse in the non-limit   (K)NN is non-parametric 54 9 Basic Similarity Important Concepts   Many similarities based on feature dot products:   Data: labeled instances, e.g. emails marked spam/ham   Training set   Held out set   Test set   Features: attribute-value pairs which characterize each x   Experimentation cycle           If features are just the pixels: Training Data Learn parameters (e.g. model probabilities) on training set (Tune hyperparameters on held-out set) Compute accuracy of test set Very important: never peek at the test set!   Evaluation   Accuracy: fraction of instances predicted correctly Held-Out Data   Overfitting and generalization   Want a classifier which does well on test data   Overfitting: fitting the training data very closely, but not generalizing well   We ll investigate overfitting and generalization formally in a few lectures   Note: not all similarities are of this form 55 Tuning on Held-Out Data   Now we ve got two kinds of unknowns   Parameters: the probabilities P(Y|X), P(Y) Test Data Extension: Web Search   Information retrieval: x = Apple Computers   Given information needs, produce information   Includes, e.g. web search, question answering, and classic IR   Hyperparameters:   Amount of smoothing to do: k, α (naïve Bayes)   Number of passes over training data (perceptron)   Where to learn?   Learn parameters from training data   Must tune hyperparameters on different data   For each value of the hyperparameters, train and test on the held-out data   Choose the best value and do a final test on the test data   Web search: not exactly classification, but rather ranking Feature-Based Ranking x = Apple Computers x, Perceptron for Ranking         Inputs Candidates Many feature vectors: One weight vector:   Prediction: x,   Update (if wrong): 10 ...
View Full Document

This note was uploaded on 08/26/2011 for the course CS 188 taught by Professor Staff during the Spring '08 term at Berkeley.

Ask a homework question - tutors are online