lec6 - Monte Carlo Principle 0 Target density p(x) on a...

Info iconThis preview shows pages 1–6. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 6
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Monte Carlo Principle 0 Target density p(x) on a high—dimensional space 0 Draw i.i.d. samples {Xi Ll from p(x) 0 Construct empirical point mass function CSci 5512: Artificial Intelligence || ’21 0 One can approximate integrals/sums by Instructor: Arindam Banerjee W) = %2 f(x.-) IN) = / f(x)p(x)dx i=1 X February 6' 2012 o Unbiased estimate In(f) converges by strong law 0 For finite 0%, central limit theorem implies «Hi/"(n — 1m) :50 Mom?) Instructor Arindam Banerjee Instructor Arindam Banerjee — — o Primarily of two types: Integration and Optimization 0 Bayesian inference and learning 0 Computing normalization in Bayesian methods _ p(y)p(xly) . . . PO’IX) - W 0 Use an easy to sample proposal distribution q(x) o q(x) satisfies p(x) g Mq(x), M < oo 0 Algorithm: For i= 1,,n 0 Sample x,- ~ q(x) and u ~ “(0, 1) 0 Target density p(x) is known, but hard to sample o Marginalization: p(y|x) = L p(y,z|x)dz o Expectation: Eylx[f(y)] = / f(y)p(ylx)dy 9 If u < accept x,-, else reject y 0 Issues: 0 Statistical mechanics: Computing the partition function ° TriCkY to bound P(X)/Q(X) With a reasonable COHStant M o If M is too large, acceptance probability is small 0 Optimization, Model Selection, etc. Instructor Arindam Banerjee Instructor Arindam Banerjee Rejection Sampling (Contd.) Markov Chains (Contd.) 0 Random walker on the web a Irreducibility. should be able to reach all pages 0 Aperiodicity, do not get stuck in a loop 0 PageRank uses T = L + E o L 2 link matrix for the web graph 0 E = uniform random matrix, to ensure irreducibility, aperiodicity o Invariant distribution p(x) represents rank of webpage x 0 Continuous spaces, T becomes an integral kernel K Reject region I / P(Xi)K(Xi+1|Xi)dXi = P(Xi+1) XIII” x xi 0 p(x) is the corresponding eigenfunction Instructor Arindam Banerjee Instructor Arindam Banerjee — — Markov Chains The Metropolis-Hastings Algorithm 0 Use a Markov chain to explore the state space 0 Most popular MCMC method 0 Markov chain in a discrete space is a process with ° Based 0" 3 Proposal diStribUtion q(x*IX) 0 Algorithm: For i= 0, . . . , (n — 1) P(Xi|Xi—1, - - - ,X1) = T(Xi|Xi—1) a Sample u ~u(o, 1) o Sample x* ~ q(x*|x,-) in Then 0 A chain is homogenous If. T IS. invariant for. all ll X* if u < A(Xi,X*) : min {17 p(af))q((;ill,:))} 0 MC Will stabilize into an invariant distribution if x,-+1 = _ P i ‘7 i 0 lrreducible, transition graph is connected Xi Otherw'se 9 Aperiodic, does not get trapped in cycles . . . . . . . . . . o The transition kernel is o Suffiaent condition to ensure p(x) is the invariant distribution KMH(Xi IXi) =q(Xi 1|Xi)A(XiaXi 1)+5x,-(Xi 1)F(Xi) p(x;)T(x,-_1ix,-) = p(x.-_1)T(x.-ix.-_1) +1 + + + where r(x,-) is the term associated with rejection 0 MCMC samplers. invariant distribution = target distribution r(Xi) = / Q(X|Xi)(1 — A(Xi,X))dX 0 Design of samplers for fast convergence X Instructor Arindam Banerjee Instructor Arindam Banerjee The Metropolis-Hastings Algorithm (Contd.) The Metropolis-Hastings Algorithm (Contd.) 0.15 0.15 6'=l cr'=100 0.1 0.1 0.05 0.05 u':|0 0.15 0.15 MCMC approximation 0.1 0.1 Target distribution 0.05 0.05 Instructor Arindam Banerjee Mixtures of MCMC Kernels o By construction PXi KMH Xi 1Xi =PXi 1 KMH XiXi 1 l l ( +l ) l +) l l +) o Powerful property of MCMC: Combination of Samplers 0 Let K1, K2 be kernels with invariant distribution p o Mixture kernel aKl + (1 — a)K2,a 6 [0,1] converges to p 0 Cycle kernel K1K2 converges to p o Implies p(x) is the invariant distribution 0 Basic properties a Irreducibility, ensure support of q contains support of p o Aperiodicity, ensured since rejection is always a possibility ° MlXtures can use glObal and local Proposals . Independent sampler: q(x*lxi) = q(x*) so that a Global proposals explore the entire space (with probability a) 0 Local proposals discover finer details (with probability (1 — a)) A(X_ X*) = min {1 P(X*)Q(Xi)} 0 Example: Target has many narrow peaks " ’ q(x*)p(X,-) a Global proposal gets the peaks . , a Local proposals get the neighborhood of peaks (random walk) 0 Metropolis sampler: symmetric q(x*|x,-) = q(x,-|x*) A(x,-,x*) = min {1 ppm} , P(Xi) Instructor Arindam Banerjee Instructor Arindam Banerjee Cycles of MCMC Kernels The Gibbs Sampler (Contd.) 0 Split a multi-variate state into blocks . Initialize X(0)_ FOr ,- = 0, _ _ I a (N _ 1) 0 Each block can be updated separately . Sample X1(i+1) N p(X1|X2(i),X3Ei) I _ _ “((10) 0 Convergence is faster if correlated variables are blocked a Sample x2(l"1) N p(xllx1(’+1)IX§i) I I I ngl) 0 Transition kernel is iven b ' . . . g y 0 Sample XIII—‘1) ~ p(Xd|X£l+1), . . . ,x‘g'jill) - - "b - - - 0 Possible to have MH steps inside a Gibbs sampler K X(i+1)x(i) = K I X(I+1)X(I) X(I+1) MHCyC/e( l ) MHl’l)( bi l bf ’ TM 0 For d = 2, Gibbs sampler is the data augmentation algorithm 0 For Bayes nets, the conditioning is on the Markov blanket o Trade-off on block size a If block size is small, chain takes long time to explore the space a If block size is large, acceptance probability is low p()9|X—j)o<p(XJ-|Xpa(j)) H P(XklPa(k)) k€ch(j) o Gibbs sampling effectively uses block size of l Instructor Arindam Banerjee Instructor Arindam Banerjee — — The Gibbs Sampler Simulated Annealing o For a d-dimensional vector x, assume we know . problem: To find global maximum of poo p(lex_j) = p(,glxl,___,)g_bxj+1,... de) 0 Initial idea: Run MCMC, estimate p(x), compute max 0 Issue: MC may not come close to the mode(s) o Simulate a non—homogenous Markov chain 0 Gibbs sam ler uses the following proposal distribution I I I I I I I I p o Invariant distribution at iteration i is p,-(x) oc pl/T"(x) q(X*IX(,-)) : {P(Xf if Xij = 0 Sample update follows 0 otherwise 1 x* if u < A(x,,x*) 2 min {1, o The acceptance probability Xi+1 = p77(x,-)q(x*|x,-) x,- otherwise X(i)x* 2min W = A‘ ’ ) I1’p(x<i>)q(x*ix(i>) 1 o T,- decreases following a cooling schedule, lim,-_,°o T,- = 0 1 0 Cooling schedule needs proper choice, e.g., T,- = m o Deterministic scan: All samples are accepted Instructor Arindam Banerjee Instructor Arindam Banerjee Simulated Annealing (Contd.) Hybrid Monte Carlo 0.2 0.2 0 Uses gradient of the target distribution 0 Improves "mixing" in high dimensions 0 Effectively, take steps based on gradient of p(x) 0.1 0.1 0 Introduce auxiliary momentum variables u 6 Rd with p(x, u) = p(x)N(u; 0, Ed) o Gradient vector A(x) = Blog p(x)/8x, step-size p o Gradient descent for L steps to get proposal candidate 0 When L = 1, one obtains the Langevin algorithm x* = X0 + puo = XIHI + p(u* + pA(X(i_1))/2) Instructor Arindam Banerjee Instructor Arindam Banerjee — Auxiliary Variable Samplers Hybrid Monte Carlo (Contd.) o Initialize x(°). For i = o, . . . , (n — 1) a Sample v ~M[0,1],u* ~N(0,]Id) :- Let x0 = XII), uo = u* + pA(xo)/2 Fore=1,...,L, with = ,£<L, = 2 0 Sometimes easier to sample from p(x, u) rather than p(x) O pl p m p/ 0 Sample (x;, u,-), and then ignore u,- X2 = Xe—1 + Pile—1 ue = Uz—l + MAW) 0 Consider two well—known examples: . set 0 Hybrid Monte Carlo I I (x) 0 Slice sampling (Xv-+1), u(,-+1)) = {(XLI, uL) If v < min {1, 2??) exp (—%(||uL”2 _ (x('), u*) otherwise 0 Tradeost for p, L 0 Large p gives low acceptance, small p needs many steps a Large L gives candidates far from x0, but expensive Instructor Arindam Banerjee Instructor Arindam Banerjee The Slice Sampler 0 Construct extended target distribution * 1 if 0 S u S p X p (x, u) = . l l 0 otherwrse 0 It follows that: fp*(x, u) du 2 0pm du = p(x) c From the Gibbs sampling perspective p(u|X) = u[0,p(x)] p(x|u) = HA,A = {x : p(x) 2 u} 0 Algorithm is easy is A is easy to figure out 0 Otherwise, several auxiliary variables need to be introduced Instructor Arindam Banerjee — The Slice Sampler (Contd.) I Instructor Arindam Banerjee ...
View Full Document

Page1 / 6

lec6 - Monte Carlo Principle 0 Target density p(x) on a...

This preview shows document pages 1 - 6. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online