This preview shows pages 1–6. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Monte Carlo Principle 0 Target density p(x) on a high—dimensional space
0 Draw i.i.d. samples {Xi Ll from p(x)
0 Construct empirical point mass function CSci 5512: Artificial Intelligence  ’21 0 One can approximate integrals/sums by Instructor: Arindam Banerjee W) = %2 f(x.) IN) = / f(x)p(x)dx
i=1 X February 6' 2012 o Unbiased estimate In(f) converges by strong law
0 For finite 0%, central limit theorem implies «Hi/"(n — 1m) :50 Mom?) Instructor Arindam Banerjee Instructor Arindam Banerjee — — o Primarily of two types: Integration and Optimization
0 Bayesian inference and learning
0 Computing normalization in Bayesian methods _ p(y)p(xly) . . .
PO’IX)  W 0 Use an easy to sample proposal distribution q(x) o q(x) satisfies p(x) g Mq(x), M < oo
0 Algorithm: For i= 1,,n
0 Sample x, ~ q(x) and u ~ “(0, 1) 0 Target density p(x) is known, but hard to sample o Marginalization: p(yx) = L p(y,zx)dz
o Expectation: Eylx[f(y)] = / f(y)p(ylx)dy 9 If u < accept x,, else reject
y 0 Issues:
0 Statistical mechanics: Computing the partition function ° TriCkY to bound P(X)/Q(X) With a reasonable COHStant M o If M is too large, acceptance probability is small 0 Optimization, Model Selection, etc. Instructor Arindam Banerjee Instructor Arindam Banerjee Rejection Sampling (Contd.) Markov Chains (Contd.) 0 Random walker on the web a Irreducibility. should be able to reach all pages
0 Aperiodicity, do not get stuck in a loop 0 PageRank uses T = L + E o L 2 link matrix for the web graph
0 E = uniform random matrix, to ensure irreducibility,
aperiodicity o Invariant distribution p(x) represents rank of webpage x 0 Continuous spaces, T becomes an integral kernel K Reject region I / P(Xi)K(Xi+1Xi)dXi = P(Xi+1)
XIII” x xi 0 p(x) is the corresponding eigenfunction Instructor Arindam Banerjee Instructor Arindam Banerjee — —
Markov Chains The MetropolisHastings Algorithm 0 Use a Markov chain to explore the state space 0 Most popular MCMC method
0 Markov chain in a discrete space is a process with ° Based 0" 3 Proposal diStribUtion q(x*IX)
0 Algorithm: For i= 0, . . . , (n — 1)
P(XiXi—1,    ,X1) = T(XiXi—1) a Sample u ~u(o, 1) o Sample x* ~ q(x*x,) in Then
0 A chain is homogenous If. T IS. invariant for. all ll X* if u < A(Xi,X*) : min {17 p(af))q((;ill,:))}
0 MC Will stabilize into an invariant distribution if x,+1 = _ P i ‘7 i 0 lrreducible, transition graph is connected Xi Otherw'se 9 Aperiodic, does not get trapped in cycles . . . . . . . . . . o The transition kernel is
o Suffiaent condition to ensure p(x) is the invariant distribution KMH(Xi IXi) =q(Xi 1Xi)A(XiaXi 1)+5x,(Xi 1)F(Xi)
p(x;)T(x,_1ix,) = p(x._1)T(x.ix._1) +1 + + + where r(x,) is the term associated with rejection 0 MCMC samplers. invariant distribution = target distribution r(Xi) = / Q(XXi)(1 — A(Xi,X))dX
0 Design of samplers for fast convergence X Instructor Arindam Banerjee Instructor Arindam Banerjee The MetropolisHastings Algorithm (Contd.) The MetropolisHastings Algorithm (Contd.) 0.15 0.15
6'=l cr'=100 0.1 0.1 0.05 0.05 u':0 0.15 0.15 MCMC approximation 0.1 0.1 Target distribution 0.05 0.05 Instructor Arindam Banerjee Mixtures of MCMC Kernels o By construction PXi KMH Xi 1Xi =PXi 1 KMH XiXi 1
l l ( +l ) l +) l l +) o Powerful property of MCMC: Combination of Samplers 0 Let K1, K2 be kernels with invariant distribution p o Mixture kernel aKl + (1 — a)K2,a 6 [0,1] converges to p
0 Cycle kernel K1K2 converges to p o Implies p(x) is the invariant distribution
0 Basic properties
a Irreducibility, ensure support of q contains support of p o Aperiodicity, ensured since rejection is always a possibility ° MlXtures can use glObal and local Proposals
. Independent sampler: q(x*lxi) = q(x*) so that a Global proposals explore the entire space (with probability a)
0 Local proposals discover finer details (with probability (1 — a))
A(X_ X*) = min {1 P(X*)Q(Xi)} 0 Example: Target has many narrow peaks
" ’ q(x*)p(X,) a Global proposal gets the peaks . , a Local proposals get the neighborhood of peaks (random walk)
0 Metropolis sampler: symmetric q(x*x,) = q(x,x*) A(x,,x*) = min {1 ppm} ,
P(Xi)
Instructor Arindam Banerjee Instructor Arindam Banerjee Cycles of MCMC Kernels The Gibbs Sampler (Contd.) 0 Split a multivariate state into blocks . Initialize X(0)_ FOr , = 0, _ _ I a (N _ 1)
0 Each block can be updated separately . Sample X1(i+1) N p(X1X2(i),X3Ei) I _ _ “((10)
0 Convergence is faster if correlated variables are blocked a Sample x2(l"1) N p(xllx1(’+1)IX§i) I I I ngl)
0 Transition kernel is iven b ' . . . g y 0 Sample XIII—‘1) ~ p(XdX£l+1), . . . ,x‘g'jill)   "b    0 Possible to have MH steps inside a Gibbs sampler
K X(i+1)x(i) = K I X(I+1)X(I) X(I+1)
MHCyC/e( l ) MHl’l)( bi l bf ’ TM 0 For d = 2, Gibbs sampler is the data augmentation algorithm 0 For Bayes nets, the conditioning is on the Markov blanket o Tradeoff on block size a If block size is small, chain takes long time to explore the space
a If block size is large, acceptance probability is low p()9X—j)o<p(XJXpa(j)) H P(XklPa(k))
k€ch(j) o Gibbs sampling effectively uses block size of l Instructor Arindam Banerjee Instructor Arindam Banerjee — —
The Gibbs Sampler Simulated Annealing o For a ddimensional vector x, assume we know . problem: To ﬁnd global maximum of poo p(lex_j) = p(,glxl,___,)g_bxj+1,... de) 0 Initial idea: Run MCMC, estimate p(x), compute max
0 Issue: MC may not come close to the mode(s)
o Simulate a non—homogenous Markov chain 0 Gibbs sam ler uses the following proposal distribution I I I I I I I I
p o Invariant distribution at iteration i is p,(x) oc pl/T"(x) q(X*IX(,)) : {P(Xf if Xij = 0 Sample update follows
0 otherwise 1
x* if u < A(x,,x*) 2 min {1, o The acceptance probability Xi+1 = p77(x,)q(x*x,) x, otherwise X(i)x* 2min W =
A‘ ’ ) I1’p(x<i>)q(x*ix(i>) 1 o T, decreases following a cooling schedule, lim,_,°o T, = 0 1 0 Cooling schedule needs proper choice, e.g., T, = m o Deterministic scan: All samples are accepted Instructor Arindam Banerjee Instructor Arindam Banerjee Simulated Annealing (Contd.) Hybrid Monte Carlo 0.2 0.2 0 Uses gradient of the target distribution
0 Improves "mixing" in high dimensions
0 Effectively, take steps based on gradient of p(x) 0.1 0.1 0 Introduce auxiliary momentum variables u 6 Rd with p(x, u) = p(x)N(u; 0, Ed) o Gradient vector A(x) = Blog p(x)/8x, stepsize p
o Gradient descent for L steps to get proposal candidate
0 When L = 1, one obtains the Langevin algorithm x* = X0 + puo = XIHI + p(u* + pA(X(i_1))/2) Instructor Arindam Banerjee Instructor Arindam Banerjee —
Auxiliary Variable Samplers Hybrid Monte Carlo (Contd.) o Initialize x(°). For i = o, . . . , (n — 1)
a Sample v ~M[0,1],u* ~N(0,]Id)
: Let x0 = XII), uo = u* + pA(xo)/2 Fore=1,...,L, with = ,£<L, = 2
0 Sometimes easier to sample from p(x, u) rather than p(x) O pl p m p/ 0 Sample (x;, u,), and then ignore u, X2 = Xe—1 + Pile—1 ue = Uz—l + MAW)
0 Consider two well—known examples: . set
0 Hybrid Monte Carlo I I (x)
0 Slice sampling (Xv+1), u(,+1)) = {(XLI, uL) If v < min {1, 2??) exp (—%(uL”2 _ (x('), u*) otherwise 0 Tradeost for p, L 0 Large p gives low acceptance, small p needs many steps
a Large L gives candidates far from x0, but expensive Instructor Arindam Banerjee Instructor Arindam Banerjee The Slice Sampler 0 Construct extended target distribution * 1 if 0 S u S p X
p (x, u) = . l l
0 otherwrse 0 It follows that: fp*(x, u) du 2 0pm du = p(x)
c From the Gibbs sampling perspective p(uX) = u[0,p(x)] p(xu) = HA,A = {x : p(x) 2 u} 0 Algorithm is easy is A is easy to figure out
0 Otherwise, several auxiliary variables need to be introduced Instructor Arindam Banerjee —
The Slice Sampler (Contd.)
I Instructor Arindam Banerjee ...
View Full
Document
 Spring '08
 Staff

Click to edit the document details