This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Probabilistic Reasoning CS 188: Artificial Intelligence Probability Spring 2011 Final Review
5/2/2011 Random Variables
Joint and Marginal Distributions
Conditional Distribution
Inference by Enumeration
Product Rule, Chain Rule, Bayes Rule
Independence Distributions over LARGE Numbers of Random Variables Bayesian
networks Representation Inference Exact: Enumeration, Variable Elimination Approximate: Sampling Learning Pieter Abbeel – UC Berkeley Maximum likelihood parameter estimation Laplace smoothing Linear interpolation Probability recap Inference by Enumeration Conditional probability P(sun)? S equivalently, iff: ∀x, y, z : P (xy, z ) = P (xz )
equivalently, iff: ∀x, y, z : P (y x, z ) = P (y z ) rain 0.05 cold sun 0.10 cold rain 0.05 hot sun 0.10 hot rain 0.05 winter cold sun 0.15 winter P(sun  winter, hot)? hot summer winter X and Y are conditionally independent given Z iff: 0.30 winter ∀x, y : P (xy ) = P (x)
∀x, y : P (y x) = P (y ) sun summer P(sun  winter)? W hot summer Chain rule T summer Product rule X, Y independent iff:
equivalently, iff:
equivalently, iff: 2 P cold rain 0.20 3 4 Chain Rule Bayes net Example: Alarm Network
¬b 0.999 Burglary Earthqk P(E)
0.002
0.998 E A P(AB,E) +b +e +a 0.95 +b +e ¬a 0.05 +b 0.001 +e
¬e P(B) +b E B B Chain rule: can always write any joint distribution as an
incremental product of conditional distributions ¬e +a 0.94 Alarm John
calls Bayes nets: make conditional independence assumptions of the
form:
B P (xi x1 · · · xi−1 ) = P (xi parents(Xi ))
giving us: E A A M P(MA) +b ¬e ¬a 0.06 +j 0.9 +a +m 0.7 ¬b +e +a 0.29 ¬j 0.1 +a ¬m 0.3 ¬b +e ¬a 0.71 ¬a
5 P(JA) +a M J +a A
J Mary
calls +j 0.05 ¬a +m 0.01 ¬b ¬e +a 0.001 ¬a ¬j 0.95 ¬a ¬m 0.99 ¬b ¬e ¬a 0.999 1 Bayes Nets: Assumptions Size of a Bayes’ Net for Assumptions we are required to make to define the Bayes
net when given the graph: How big is a joint distribution over N Boolean variables? 2N P (xi x1 · · · xi−1 ) = P (xi parents(Xi )) Size of representation if we use the chain rule 2N Given a Bayes net graph additional conditional
independences can be read off directly from the graph How big is an Nnode net if nodes have up to k parents? Question: Are two nodes necessarily independent given certain
evidence? O(N * 2k+1) If no, can prove with a counter example Both give you the power to calculate BNs: I.e., pick a set of CPT’s, and show that the independence assumption is
violated by the resulting distribution Huge space savings! Easier to elicit local CPTs Faster to answer queries If yes, can prove with Algebra (tedious)
7 DSeparation Question: Are X and Y
conditionally independent
given evidence vars {Z}? Dseparation (analyzes graph) 8 DSeparation Active Triples Inactive Triples Yes, if X and Y separated by Z Consider all (undirected) paths
from X to Y No active paths = independence! ? Given query Xi ⊥ Xj {Xk1 , ..., Xkn }
⊥ Shade all evidence nodes For all (undirected!) paths between and Check whether path is active A path is active if each triple
is active: If active return Causal chain A → B → C where B
is unobserved (either direction) Common cause A ← B → C
where B is unobserved Common effect (aka vstructure)
A → B ← C where B or one of its
descendents is observed Xi ⊥ Xj {Xk1 , ..., Xkn }
⊥ (If reaching this point all paths have been
checked and shown inactive)
⊥ Return Xi ⊥ Xj {Xk1 , ..., Xkn } All it takes to block a path is
a single inactive segment Example 10 All Conditional Independences Given a Bayes net structure, can run dseparation to build a complete list of
conditional independences that are
necessarily true of the form L
Yes
R Yes
D B Xi ⊥ Xj {Xk1 , ..., Xkn }
⊥ T Yes
T
11 This list determines the set of probability
distributions that can be represented by
Bayes’ nets with this graph structure 12 2 Topology Limits Distributions Bayes Nets Status Y Y X
Z
{X ⊥ Y, X ⊥ Z, Y ⊥ Z,
⊥
⊥
⊥ Given some graph
⊥
⊥
⊥
topology G, only certain X ⊥ Z  Y, X ⊥ Y  Z, Y ⊥ Z  X }
X
Z
joint distributions can
{X ⊥ Z  Y } Y
⊥
be encoded
X
Z The graph structure
guarantees certain
Y
(conditional)
independences
X
Z (There might be more
{}
independence) Adding arcs increases
Y
Y
Y
the set of distributions,
but has several costs Full conditioning can
encode any distribution Z X X Z X X Z X Y Y
X Z Representation Inference Learning Bayes Nets from Data Z Y
Z Inference by Enumeration 14 13 Example: Enumeration In this simple method, we only need the BN to
synthesize the joint entries Given unlimited time, inference in BNs is easy Recipe: State the marginal probabilities you need Figure out ALL the atomic probabilities you need Calculate and combine them Example: B E
A J M 15 Variable Elimination 16 Variable Elimination Outline Why is inference by enumeration so slow? You join up the whole joint distribution before you sum
out the hidden variables You end up repeating a lot of work! Idea: interleave joining and marginalizing! Track objects called factors Initial factors are local CPTs (one per node)
+r
r 0.1 0.9 +r +r
r
r +t
t +t
t 0.8 0.2 0.1 0.9 +t +t
t
t +l
l +l
l 0.3 0.7 0.1 0.9 Any known values are selected E.g. if we know Called Variable Elimination Still NPhard, but usually much faster than inference
by enumeration +r
r 17 0.1 0.9 , the initial factors are
+r +r
r
r +t
t +t
t 0.8 0.2 0.1 0.9 +t
t +l +l R
T
L 0.3 0.1 VE: Alternately join factors and eliminate variables 18 3 Variable Elimination Example Variable Elimination Example
T +r
r R
T
L +r +r
r
r +t +t
t
t 0.1 0.9 +t
t +t
t +l
l +l
l 0.8 0.2 0.1 0.9 Sum out R Join R L
+r +r
r
r +t
t +t
t 0.08 0.02 0.09 0.81 +t
t 0.17 0.83 T +t
t T, L Join T 0.17 0.83 +t +t
t
t R, T 0.3 0.7 0.1 0.9 +t +t
t
t +l
l +l
l 0.3 0.7 0.1 0.9 L +t +t
t
t +l
l +l
l 0.3 0.7 0.1 0.9 L 19 +t +t
t
t +l
l +l
l 0.3 0.7 0.1 0.9 +l
l +l
l Sum out T 0.051 0.119 0.083 0.747 L +l 0.134
l 0.886 * VE is variable elimination Example Example
Choose E Choose A Finish with B
Normalize
21 General Variable Elimination 22 Approximate Inference: Sampling Basic idea: Query: Draw N samples from a sampling distribution S Compute an approximate posterior probability Show this converges to the true probability P Start with initial factors: Local CPTs (but instantiated by evidence) Why? Faster than computing the exact answer While there are still hidden variables (not Q or evidence): Pick a hidden variable H Join all factors mentioning H Eliminate (sum out) H Prior sampling: Sample ALL variables in topological order as this can be done quickly Rejection sampling for query = like prior sampling, but reject when a variable is sampled inconsistent
with the query, in this case when a variable Ei is sampled differently
from ei Likelihood weighting for query Join all remaining factors and normalize
23 = like prior sampling but variables Ei are not sampled, when it’s their
turn, they get set to ei, and the sample gets weighted by
P(ei  value of parents(ei) in current sample) Gibbs sampling: repeatedly samples each nonevidence variable
conditioned on all other variables can incorporate downstream evidence 24 4 Prior Sampling
+c
c Example We ll get a bunch of samples from the BN: 0.5 0.5 +c, s, +r, +w
+c, +s, +r, +w
c, +s, +r, w
+c, s, +r, +w
c, s, r, +w Cloudy
+c +s 0.1
s 0.9
c +s 0.5
s 0.5 +s
s Sprinkler +w
w +w
w +w
w +w
w +r
r +r
r 0.99 0.01 0.90 0.10 0.90 0.10 0.01 0.99 +c
c Rain +r
r +r
r 0.8 0.2 0.2 0.8 RR
ain
WetGrass
W If we want to know P(W) Samples: WetGrass Cloudy
C
Sprinkler
S +c, s, +r, +w
c, +s, r, +w
…
25 We have counts <+w:4, w:1>
Normalize to get P(W) = <+w:0.8, w:0.2>
This will get closer to the true distribution with more samples
Can estimate anything else, too
What about P(C +w)? P(C +r, +w)? P(C r, w)?
Fast: can use fewer samples if less time
26 Likelihood Weighting
+c
c Likelihood Weighting Sampling distribution if z sampled and e fixed evidence 0.5 0.5 Cloudy
C Cloudy
+c +s 0.1
s 0.9
c +s 0.5
s 0.5 +s
s +r
r +r
r Sprinkler +w
w +w
w +w
w +w
w 0.99 0.01 0.90 0.10 0.90 0.10 0.01 0.99 +c
c Rain WetGrass +r
r +r
r 0.8 0.2 0.2 0.8 S Now, samples have weights R
W Samples: Together, weighted sampling distribution is consistent +c, +s, +r, +w
… 27 28 Gibbs Sampling Markov Models A Markov model is a chainstructured BN Idea: instead of sampling from scratch, create samples
that are each like the last one. Each node is identically distributed (stationarity) Value of X at a given time is called the state As a BN: Procedure: resample one variable at a time, conditioned
on all the rest, but keep evidence fixed. X1 Properties: Now samples are not independent (in fact
they re nearly identical), but sample averages are still
consistent estimators! X2 X3 X4 The chain is just a (growing) BN We can always use generic BN reasoning on it if we truncate the chain at a
fixed length Stationary distributions For most chains, the distribution we end up in is independent of the initial
distribution Called the stationary distribution of the chain What s the point: both upstream and downstream
variables condition on evidence. P∞ ( X ) =
29
x Pt+1t (X x)P∞ (x) Example applications: Web link analysis (Page Rank) and Gibbs Sampling 5 Hidden Markov Models Online Belief Updates Underlying Markov chain over states S You observe outputs (effects) at each time step
X1 X2 X3 X4 X5 E1 E2 E3 E4 Every time step, we start with current P(X  evidence) We update for time: E5 X1 X2 We update for evidence:
X2 Speech recognition HMMs: E2 Xi: specific positions in specific words; Ei: acoustic signals Machine translation HMMs: The forward algorithm does both at once (and doesn t normalize) Xi: translation options; Ei: Observations are words Robot tracking: Xi: positions on a map; Ei: range readings Particle Filtering Particle Filtering = likelihood weighting + resampling at
each time slice 0.0 0.1 0.0 Why: sometimes X is too big to use
exact inference 0.0 0.0 0.2 0.0 0.2 0.5 Particle is just new name for sample Elapse time: Each particle is moved by sampling its
next position from the transition model Observe: We don t sample the observation, we fix it
and downweight our samples based on
the evidence This is like likelihood weighting, so we Resample: Rather than tracking weighted samples,
we resample N times, we choose from our weighted
sample distribution Dynamic Bayes Nets (DBNs) We want to track multiple variables over time, using
multiple sources of evidence Idea: Repeat a fixed Bayes net structure at each time Variables from time t can condition on those from t1
t =1 t =2 G1a E1a E1b G3b G2b
E2a E2b Representation Inference t =3
G3a G2a
G1b Bayes Nets Status E3a Learning Bayes Nets from Data E3b Discrete valued dynamic Bayes nets are also HMMs 36 6 Parameter Estimation Bayes Nets Status Estimating distribution of random variables like X or X  Y Empirically: use training data Representation For each outcome x, look at the empirical rate of that value:
r g g This is the estimate that maximizes the likelihood of the data Inference Learning Bayes Nets from Data Laplace smoothing Pretend saw every outcome k extra times Smooth each condition independently: 38 Classification: Feature Vectors Classification overview Naïve Bayes: Hello,
Do you want free printr
cartriges? Why pay more
when you can get them
ABSOLUTELY FREE! Just # free
YOUR_NAME
MISSPELLED
FROM_FRIEND
... :
:
:
: 2
0
2
0 SPAM
or
+ Builds a model training data
Gives prediction probabilities
Strong assumptions about feature independence
One pass through data (counting) Perceptron: Makes less assumptions about data
Mistakedriven learning
Multiple passes through data (prediction)
Often more accurate MIRA: Like perceptron, but adaptive scaling of size of update SVM:
PIXEL7,12
PIXEL7,13
...
NUM_LOOPS
... :1
:0 2 :1 Properties similar to perceptron
Convex optimization formulation NearestNeighbor: Nonparametric: more expressive with more training data Kernels Efficient way to make linear learning architectures into nonlinear ones Bayes Nets for Classification One method of classification: Use a probabilistic model!
Features are observed random variables Fi
Y is the query variable
Use probabilistic inference to compute most likely Y General Naïve Bayes A general naive Bayes model:
Y x Fn
parameters Y F1
Y parameters You already know how to do this inference F2 Fn n x F x Y
parameters We only specify how each feature depends on the class Total number of parameters is linear in n Our running example: digits 7 BagofWords Naïve Bayes Generative model Linear Classifier Word at position
i, not ith word in
the dictionary! Binary linear classifier: Bagofwords Each position is identically distributed All positions share the same conditional probs P(WC) When learning the parameters, data is shared over all
positions in the document (rather than separately learning
a distribution for each position in the document) Our running example: spam vs. ham Multiclass linear classifier: A weight vector for each class: Score (activation) of a class y: Prediction highest score wins
Binary = multiclass where the
negative class has weight zero Perceptron = algorithm to learn weights w Problems with the Perceptron Noise: if the data isn t
separable, weights might thrash Start with zero weights Pick up training instances one by one Classify with current weights Averaging weight vectors over time
can help (averaged perceptron) Mediocre generalization: finds a
barely separating solution If correct, no change! If wrong: lower score of wrong
answer, raise score of right answer Overtraining: test / heldout
accuracy usually rises, then falls Overtraining is a kind of overfitting
45 Fixing the Perceptron: MIRA Update size that fixes the current mistake and also minimizes the
change to w Update w by solving: Support Vector Machines Maximizing the margin: good according to intuition, theory, practice Support vector machines (SVMs) find the separator with max margin Basically, SVMs are MIRA where you optimize over all examples at
once
SVM ,¿ ¿·C 8 NonLinear Separators NonLinear Separators Data that is linearly separable (with some noise) works out great: General idea: the original feature space can always be
mapped to some higherdimensional feature space
where the training set is separable: x 0 But what are we going to do if the dataset is just too hard?
x 0 Φ: x → φ(x) How about… mapping data to a higherdimensional space:
x2 x 0 49 50 This and next few slides adapted from Ray Mooney, UT Some Kernels Some Kernels (2) Kernels implicitly map original vectors to higher
dimensional spaces, take the dot product there, and
hand the result back Polynomial kernel: Linear kernel: Á(x) = x Quadratic kernel: 51 Why and When Kernels? Can t you just add these features on your own (e.g. add
all pairs of features instead of using the quadratic
kernel)? Yes, in principle, just compute them No need to modify any algorithms But, number of features can get large (or infinite) Kernels let us compute with these features implicitly Example: implicit dot product in polynomial, Gaussian and string
kernel takes much less space and time per dot product When can we use kernels? When our learning algorithm can be reformulated in terms of
only inner products between feature vectors Examples: perceptron, support vector machine Knearest neighbors 1NN: copy the label of the most similar data point KNN: let the k nearest neighbors vote (have to devise a
weighting scheme) 2 Examples 10 Examples 100 Examples Truth 10000 Examples Parametric models: Fixed set of parameters More data means better settings Nonparametric models: Complexity of the classifier increases with data Better in the limit, often worse in the nonlimit (K)NN is nonparametric 54 9 Basic Similarity Important Concepts Many similarities based on feature dot products: Data: labeled instances, e.g. emails marked spam/ham Training set Held out set Test set Features: attributevalue pairs which characterize each x Experimentation cycle If features are just the pixels: Training
Data Learn parameters (e.g. model probabilities) on training set
(Tune hyperparameters on heldout set)
Compute accuracy of test set
Very important: never peek at the test set! Evaluation Accuracy: fraction of instances predicted correctly HeldOut
Data Overfitting and generalization Want a classifier which does well on test data Overfitting: fitting the training data very closely, but not
generalizing well We ll investigate overfitting and generalization formally in a
few lectures Note: not all similarities are of this form
55 Tuning on HeldOut Data Now we ve got two kinds of unknowns Parameters: the probabilities P(YX), P(Y) Test
Data Extension: Web Search Information retrieval: x = Apple Computers Given information needs,
produce information Includes, e.g. web search,
question answering, and
classic IR Hyperparameters: Amount of smoothing to do: k, α (naïve Bayes) Number of passes over training data (perceptron) Where to learn? Learn parameters from training data Must tune hyperparameters on different
data For each value of the hyperparameters,
train and test on the heldout data Choose the best value and do a final test
on the test data Web search: not exactly
classification, but rather
ranking FeatureBased Ranking
x = Apple Computers x, Perceptron for Ranking Inputs
Candidates
Many feature vectors:
One weight vector: Prediction: x, Update (if wrong): 10 ...
View Full
Document
 Spring '08
 Staff
 Artificial Intelligence

Click to edit the document details