Posterior approximation with the Gibbs sampler

Posterior approximation with the Gibbs sampler - 6...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 6 Posterior approximation with the Gibbs sampler For many multiparameter models the joint posterior distribution is nonstandard and difficult to sample from directly. However, it is often the case that it is easy to sample from the full conditional distribution of each parameter. In such cases, posterior approximation can be made with the Gibbs sampler, an iterative algorithm that constructs a dependent sequence of parameter values whose distribution converges to the target joint posterior distribution. In this chapter we outline the Gibbs sampler in the context of the normal model with a semiconjugate prior distribution, and discuss how well the method is able to approximate the posterior distribution. 6.1 A semiconjugate prior distribution In the previous chapter we modeled our uncertainty about as depending on 2 : p(| 2 ) = dnorm (, 0 , / 0 ) . This prior distribution relates the prior variance of to the sampling variance of our data in such a way that 0 can be thought of as 0 prior samples from the population. In some situations this makes sense, but in others we may want to specify our uncertainty about as being independent of 2 , so that p(, 2 ) = p() p( 2 ). One such joint distribution is the following "semiconjugate" prior distribution: 2 normal(0 , 0 ) 2 1/ 2 gamma(0 /2, 0 0 /2) . If {Y1 , . . . , Yn |, 2 } i.i.d. normal(, 2 ), we showed in Section 5.2 that 2 {| 2 , y1 , . . . , yn } normal(n , n ) with n = 2 0 /0 + n/ 2 y 2 and n = 2 + n/ 2 1/0 1 n 2 + 2 0 -1 . P.D. Hoff, A First Course in Bayesian Statistical Methods, Springer Texts in Statistics, DOI 10.1007/978-0-387-92407-6 6, c Springer Science+Business Media, LLC 2009 90 6 Posterior approximation with the Gibbs sampler 2 In the conjugate case where 0 was proportional to 2 , we showed that 2 p( |y1 , . . . , yn ) was an inverse-gamma distribution, and that a Monte Carlo sample of {, 2 } from their joint posterior distribution could be obtained by sampling 1. a value 2(s) from p( 2 |y1 , . . . , yn ), an inverse-gamma distribution, then 2. a value (s) from p(| 2(s) , y1 , . . . , yn ), a normal distribution. 2 However, in the case where 0 is not proportional to 2 , the marginal density 2 of 1/ is not a gamma distribution, or any other standard distribution from which we can easily sample. 6.2 Discrete approximations Letting 2 = 1/ 2 be the precision, recall that the posterior distribution ~ of {, 2 } is equal to the joint distribution of {, 2 , y1 , . . . , yn }, divided by ~ p(y1 , . . . , yn ), which does not depend on the parameters. Therefore the relative posterior probabilities of one set of parameter values {1 , 1 } to another ~2 2 {2 , 2 } is directly computable: ~ p(1 , 1 |y1 , . . . , yn ) ~2 p(1 , 1 , y1 , . . . , yn )/p(y1 , . . . , yn ) ~2 2 |y , . . . , y ) = p( , 2 , y , . . . , y )/p(y , . . . , y ) p(2 , 2 1 ~ n 2 ~2 1 n 1 n 2 p(1 , 1 , y1 , . . . , yn ) ~ = . p(2 , 2 , y1 , . . . , yn ) ~2 The joint distribution is easy to compute as it was built out of standard prior and sampling distributions: p(, 2 , y1 , . . . , yn ) = p(, 2 ) p(y1 , . . . , yn |, 2 ) ~ ~ ~ 2 = dnorm(, 0 , 0 ) dgamma(~ 2 , 0 /2, 0 0 /2) n dnorm(yi , , 1/ 2 ). ~ i=1 A discrete approximation to the posterior distribution makes use of these facts by constructing a posterior distribution over a grid of parameter values, based on relative posterior probabilities. This is done by evaluating p(, 2 , y1 , . . . , yn ) on a two-dimensional grid of values of {, 2 }. Letting ~ ~ {1 , . . . , G } and {~1 , . . . , H } be sequences of evenly spaced parameter val2 ~2 ues, the discrete approximation to the posterior distribution assigns a posterior probability to each pair {k , l } on the grid, given by ~2 6.2 Discrete approximations 91 pD (k , l |y1 , . . . , yn ) = ~2 = = p(k , l |y1 , . . . , yn ) ~2 G g=1 H h=1 p(g , h |y1 , . . . , yn ) ~2 p(g , h , y1 , . . . , yn )/p(y1 , . . . , yn ) ~2 p(g , h , y1 , . . . , yn ) ~2 . p(k , l , y1 , . . . , yn )/p(y1 , . . . , yn ) ~2 G g=1 H h=1 p(k , l , y1 , . . . , yn ) ~2 G g=1 H h=1 This is a real joint probability distribution for {1 , . . . , G } and 2 ~ {~1 , . . . , H }, in the sense that it sums to 1. In fact, it is the actual posterior 2 ~2 distribution of {, 2 } if the joint prior distribution for these parameters is ~ discrete on this grid. Let's try the approximation for the midge data from the previous chapter. Recall that our data were {n = 9, y = 1.804, s2 = 0.017}. The conjugate prior distribution on and 2 of Chapter 5 required that the prior variance on be 2 /0 , i.e. proportional to the sampling variance. A small value of the sampling variance then has the possibly undesirable effect of reducing the nominal prior uncertainty for . In contrast, the semiconjugate prior distribution frees us from this constraint. Recall that we first suggested that the prior mean and standard deviation of should be 0 = 1.9 and 0 = .95, as this would put most of the prior mass on > 0, which we know to be true. For 2 , let's 2 use prior parameters of 0 = 1 and 0 = 0.01. The R -code below evaluates p(, 2 |y1 , . . . , yn ) on a 100100 grid of evenly ~ spaced parameter values, with {1.505, 1.510, . . . , 1.995, 2.00} and 2 ~ {1.75, 3.5, . . . , 173.25, 175.0}. The first panel of Figure 6.1 gives the discrete approximation to the joint distribution of {, 2 }. Marginal and conditional ~ posterior distributions for and 2 can be obtained from the approximation ~ to the joint distribution with simple arithmetic. For example, H pD (k |y1 , . . . , yn ) = h=1 pD (k , h |y1 , . . . , yn ). ~2 The resulting discrete approximations to the marginal posterior distributions of and 2 are shown in the second and third panels of Figure 6.1. ~ mu0<-1.9 ; t20 <-0.95^2 ; s20 <-.01 ; nu0<-1 y<-c ( 1 . 6 4 , 1 . 7 0 , 1 . 7 2 , 1 . 7 4 , 1 . 8 2 , 1 . 8 2 , 1 . 8 2 , 1 . 9 0 , 2 . 0 8 ) G<-100 ; H<-100 mean . g r i d <-s e q ( 1 . 5 0 5 , 2 . 0 0 , l e n g t h=G) p r e c . g r i d <-s e q ( 1 . 7 5 , 1 7 5 , l e n g t h=H) p o s t . g r i d <-ma t r i x ( nrow=G, n c o l=H) f o r ( g i n 1 :G) { 92 6 Posterior approximation with the Gibbs sampler f o r ( h i n 1 :H) { p o s t . g r i d [ g , h]<- dnorm ( mean . g r i d [ g ] , mu0 , s q r t ( t 2 0 ) ) dgamma( p r e c . g r i d [ h ] , nu0 / 2 , s 2 0 nu0 /2 ) prod ( dnorm ( y , mean . g r i d [ g ] , 1 / s q r t ( p r e c . g r i d [ h ] ) ) ) } } p o s t . g r i d <-p o s t . g r i d /sum ( p o s t . g r i d ) p( |y 1...y n ) 0.02 0.04 150 50 1.6 1.7 1.8 1.9 2.0 1.5 1.6 1.7 1.8 1.9 2.0 0.000 0 0.00 ~ p( 2|y 1...y n ) 0.010 0.020 ~ 2 100 50 100 ~ 2 150 Fig. 6.1. Joint and marginal posterior distributions based on a discrete approximation. Evaluation of this two-parameter posterior distribution at 100 values of each parameter required a grid of size 100100 = 1002 . In general, to construct a similarly fine approximation for a p-dimensional posterior distribution we would need a p-dimensional grid containing 100p posterior probabilities. This means that discrete approximations will only be feasible for densities having a small number of parameters. 6.3 Sampling from the conditional distributions Suppose for the moment you knew the value of . The conditional distribution of 2 given and {y1 , . . . , yn } is ~ p(~ 2 |, y1 , . . . , yn ) p(y1 , . . . , yn , , 2 ) ~ = p(y1 , . . . , yn |, 2 )p(|~ 2 )p(~ 2 ) . ~ If and 2 are independent in the prior distribution, then p(|~ 2 ) = p() and ~ 6.4 Gibbs sampling 93 p(~ 2 |, y1 , . . . , yn ) p(y1 , . . . , yn |, 2 )p(~ 2 ) ~ n (~ 2 )n/2 exp{-~ 2 i=1 (yi - )2 /2} 2 (~ 2 )0 /2-1 exp{-~ 2 0 0 /2} 2 = (~ 2 )(0 +n)/2-1 exp{-~ 2 [0 0 + (yi - )2 ]/2}. This is the form of a gamma density, and so evidently { 2 |, y1 , . . . , yn } 2 inverse-gamma(n /2, n n ()/2), where 2 n = 0 + n , n () = 1 2 0 0 + ns2 () , n n and s2 () = (yi - )2 /n, the unbiased estimate of 2 if were known. This n means that we can easily sample directly from p( 2 |, y1 , . . . , yn ), as well as from p(| 2 , y1 , . . . , yn ) as shown at the beginning of the chapter. However, we do not yet have a way to sample directly from p(, 2 |y1 , . . . , yn ). Can we use the full conditional distributions to sample from the joint posterior distribution? Suppose we were given 2(1) , a single sample from the marginal posterior distribution p( 2 |y1 , . . . , yn ). Then we could sample (1) p(| 2(1) , y1 , . . . , yn ) and {(1) , 2(1) } would be a sample from the joint distribution of {, 2 }. Additionally, (1) can be considered a sample from the marginal distribution p(|y1 , . . . , yn ). From this -value, we can generate 2(2) p( 2 |(1) , y1 , . . . , yn ). But since (1) is a sample from the marginal distribution of , and 2(2) is a sample from the conditional distribution of 2 given (1) , then {(1) , 2(2) } is also a sample from the joint distribution of {, 2 }. This in turn means that 2(2) is a sample from the marginal distribution p( 2 |y1 , . . . , yn ), which then could be used to generate a new sample (2) , and so on. It seems that the two conditional distributions could be used to generate samples from the joint distribution, if only we had a 2(1) from which to start. 6.4 Gibbs sampling The distributions p(| 2 , y1 , . . . , yn ) and p( 2 |, y1 , . . . , yn ) are called the full conditional distributions of and 2 respectively, as they are each a conditional distribution of a parameter given everything else. Let's make the iterative sampling idea described in the previous paragraph more precise. Given a current state of the parameters (s) = {(s) , 2(s) }, we generate a new state ~ as follows: 94 6 Posterior approximation with the Gibbs sampler 1. sample (s+1) p(|~ 2(s) , y1 , . . . , yn ); 2. sample 2(s+1) p(~ 2 |(s+1) , y1 , . . . , yn ); ~ 3. let (s+1) = {(s+1) , 2(s+1) }. ~ This algorithm is called the Gibbs sampler, and generates a dependent sequence of our parameters {(1) , (2) , . . . , (S) }. The R-code to perform this sampling scheme for the normal model with the semiconjugate prior distribution is as follows: ### data mean . y<-mean ( y ) ; var . y<-var ( y ) ; n<-l e n g t h ( y ) ### ### s t a r t i n g v a l u e s S<-1000 PHI<-m at r i x ( nrow=S , n c o l =2) PHI[1 ,] < - phi<-c ( mean . y , 1/ var . y ) ### ### Gibbs s a m p l i n g set . seed (1) for ( s in 2: S) { # g e n e r a t e a new t h e t a v a l u e from i t s f u l l c o n d i t i o n a l mun<- ( mu0/ t 2 0 + nmean . y p h i [ 2 ] ) / ( 1/ t 2 0 + n p h i [ 2 ] ) t2n<- 1 / ( 1/ t 2 0 + n p h i [ 2 ] ) p h i [1]< - rnorm ( 1 , mun , s q r t ( t2n ) ) # g e n e r a t e a new 1/ sigma ^2 v a l u e from i t s f u l l c o n d i t i o n a l nun<- nu0+n s2n<- ( nu0 s 2 0 + ( n-1) var . y + n ( mean . y-p h i [ 1 ] ) ^ 2 ) /nun p h i [2]< - rgamma ( 1 , nun / 2 , nun s2n / 2 ) PHI [ s ,]<- p h i ### } In this code, we have used the identity n n ns2 () = n i=1 (yi - )2 = i=1 n (yi - y + y - )2 [(yi - y )2 + 2(yi - y )( - ) + ( - )2 ] y y i=1 n n = = i=1 (yi - y )2 + 0 + 2 i=1 2 ( - )2 y = (n - 1)s + n( - ) . y 6.4 Gibbs sampling 95 The reason for writing the code this way is because s2 and y do not change with each new -value, and computing (n - 1)s2 + n( - )2 is faster than y n having to recompute i=1 (yi - )2 at each iteration. 49 88 98 58 3 6 85 1594 67 4869 939 80 59 13 56 28 60 272611 53 6616 10 19 40 37 96 229732 45 2 31 35 1 76 3486 557 912 72 23 30 993320 8163 43 719 68 870 10093 87 41 89 3824 83 36 7 74 62 544 73 18 47 78 29 44 46 4284 90 50 55 64 25 82 65 21 95 51 14 61 17 77 79 52 75 12 1.70 1.80 1.90 ~ 2 60 80 100 ~ 2 60 80 100 9 25 8 13 11 10 25 1 4 1 7 4 12 40 40 20 20 1.70 1.80 1.90 1.70 1.80 1.90 Fig. 6.2. The first 5, 15 and 100 iterations of a Gibbs sampler. Using the midge data from the previous chapter and the prior distributions described above, a Gibbs sampler consisting of 1,000 iterations was constructed. Figure 6.2 plots the first 5, 15 and 100 simulated values, and the first panel of Figure 6.3 plots the 1,000 values over the contours of the discrete approximation to p(, 2 |y1 , . . . , yn ). The second and third panels of Figure ~ 6.3 give density estimates of the distributions of the simulated values of and 2 . Finally, let's find some empirical quantiles of our Gibbs samples: ~ ### CI f o r p o p u l a t i o n mean > q u a n t i l e ( PHI [ , 1 ] , c ( . 0 2 5 , . 5 , . 9 7 5 ) ) 2.5% 50% 97.5% 1.707282 1.804348 1.901129 ### CI f o r p o p u l a t i o n p r e c i s i o n > q u a n t i l e ( PHI [ , 2 ] , c ( . 0 2 5 , . 5 , . 9 7 5 ) ) 2.5% 50% 97.5% 17.48020 53.62511 129.20020 ### CI f o r p o p u l a t i o n s t a n d a r d d e v i a t i o n > q u a n t i l e ( 1 / s q r t ( PHI [ , 2 ] ) , c ( . 0 2 5 , . 5 , . 9 7 5 ) ) 2.5% 50% 97.5% 0.08797701 0.13655763 0.23918408 The empirical distribution of these Gibbs samples very closely resembles the discrete approximation to their posterior distribution, as can be seen by comparing Figures 6.1 and 6.3. This gives some indication that the Gibbs sampling procedure is a valid method for approximating p(, 2 |y1 , . . . , yn ). 20 14 40 ~ 2 60 80 100 3 3 6 15 96 150 6 Posterior approximation with the Gibbs sampler 8 ~ p( 2|y 1...y n ) 0.000 0.004 0.008 0.012 1.6 1.7 1.8 1.9 2.0 0 50 1.6 1.7 1.8 1.9 0 2 p( |y 1...y n ) 4 6 ~ 2 100 50 100 ~ 2 200 Fig. 6.3. The first panel shows 1,000 samples from the Gibbs sampler, plotted over the contours of the discrete approximation. The second and third panels give kernel density estimates to the distributions of Gibbs samples of and 2 . Vertical gray ~ bars on the second plot indicate 2.5% and 97.5% quantiles of the Gibbs samples of , while nearly identical black vertical bars indicate the 95% confidence interval based on the t-test. 6.5 General properties of the Gibbs sampler Suppose you have a vector of parameters = {1 , . . . , p }, and your information about is measured with p() = p(1 , . . . , p ). For example, in the normal model = {, 2 }, and the probability measure of interest is (0) (0) p(, 2 |y1 , . . . , yn ). Given a starting point (0) = {1 , . . . , p }, the Gibbs sampler generates (s) from (s-1) as follows: 1. sample 1 p(1 |2 , 3 , . . . , p ) (s) (s) (s-1) (s-1) 2. sample 2 p(2 |1 , 3 , . . . , p ) . . . p. sample p p(p |1 , 2 , . . . , p-1 ) . This algorithm generates a dependent sequence of vectors: (1) = {1 , . . . , (1) } p (2) = {1 , . . . , (2) } p . . . (S) (S) = {1 , . . . , (S) } . p In this sequence, (s) depends on (0) , . . . , (s-1) only through (s-1) , i.e. (s) is conditionally independent of (0) , . . . , (s-2) given (s-1) . This is called the Markov property, and so the sequence is called a Markov chain. Under some conditions that will be met for all of the models discussed in this text, (2) (1) (s) (s) (s) (s) (s) (s-1) (s-1) (s-1) 6.5 General properties of the Gibbs sampler 97 Pr((s) A) A p() d as s . In words, the sampling distribution of (s) approaches the target distribution as s , no matter what the starting value (0) is (although some starting values will get you to the target sooner than others). More importantly, for most functions g of interest, 1 S S g((s) ) E[g()] = s=1 g()p() d as S . (6.1) This means we can approximate E[g()] with the sample average of {g((1) ), . . ., g((S) )}, just as in Monte Carlo approximation. For this reason, we call such approximations Markov chain Monte Carlo (MCMC) approximations, and the procedure an MCMC algorithm. In the context of the semiconjugate normal model, Equation 6.1 implies that the joint distribution of {((1) , 2(1) ), . . ., ((1000) , 2(1000) )} is approximately equal to p(, 2 | y1 ,. . ., yn ) , and that 1 E[|y1 , . . . , yn ] 1000 1000 (s) = 1.804, and s=1 Pr( [1.71, 1.90]|y1 , . . . , yn ) 0.95. We will discuss practical aspects of MCMC in the context of specific models in the next section and in the next several chapters. Distinguishing parameter estimation from posterior approximation A Bayesian data analysis using Monte Carlo methods often involves a confusing array of sampling procedures and probability distributions. With this in mind it is helpful to distinguish the part of the data analysis which is statistical from that which is numerical approximation. Recall from Chapter 1 that the necessary ingredients of a Bayesian data analysis are 1. Model specification: a collection of probability distributions {p(y|), } which should represent the sampling distribution of your data for some value of ; 2. Prior specification: a probability distribution p(), ideally representing someone's prior information about which parameter values are likely to describe the sampling distribution. Once these items are specified and the data have been gathered, the posterior p(|y) is completely determined. It is given by p(|y) = p()p(y|) = p(y) p()p(y|) , p()p(y|) d and so in a sense there is no more modeling or estimation. All that is left is 98 6 Posterior approximation with the Gibbs sampler 3. Posterior summary: a description of the posterior distribution p(|y), done in terms of particular quantities of interest such as posterior means, medians, modes, predictive probabilities and confidence regions. For many models, p(|y) is complicated, hard to write down, and so on. In these cases, a useful way to "look at" p(|y) is by studying Monte Carlo samples from p(|y). Thus, Monte Carlo and MCMC sampling algorithms are not models, they do not generate "more information" than is in y and p(), they are simply "ways of looking at" p(|y). For example, if we have Monte Carlo samples (1) , . . . , (S) that are approximate draws from p(|y), then these samples help describe p(|y): 1 S 1 S (s) p(|y) d 1((s) c) Pr( c|y) = c - p(|y) d. and so on. To keep this distinction in mind, it is useful to reserve the word estimation to describe how we use p(|y) to make inference about , and to use the word approximation to describe the use of Monte Carlo procedures to approximate integrals. 6.6 Introduction to MCMC diagnostics The purpose of Monte Carlo or Markov chain Monte Carlo approximation is to obtain a sequence of parameter values {(1) , . . . , (S) } such that 1 S S g((s) ) s=1 g()p() d, for any functions g of interest. In other words, we want the empirical average of {g((1) ), . . . , g((S) )} to approximate the expected value of g() under a target probability distribution p() (in Bayesian inference, the target distribution is usually the posterior distribution). In order for this to be a good approximation for a wide range of functions g, we need the empirical distribution of the simulated sequence {(1) , . . . , (S) } to look like the target distribution p(). Monte Carlo and Markov chain Monte Carlo are two ways of generating such a sequence. Monte Carlo simulation, in which we generate independent samples from the target distribution, is in some sense the "gold standard." Independent MC samples automatically create a sequence that is representative of p(): The probability that (s) A for any set A is p() d. This is true for every s {1, . . . , S} and conditionally or unconA ditionally on the other values in the sequence. This is not true for MCMC samples, in which case all we are sure of is that 6.6 Introduction to MCMC diagnostics 99 s lim Pr((s) A) = A p() d. Let's explore the differences between MC and MCMC with a simple example. Our target distribution will be the joint probability distribution of two variables: a discrete variable {1, 2, 3} and a continuous variable R. The target density for this example will be defined as {Pr( = 1), Pr( = 2), Pr( = 3)} = (.45, .10, .45) and p(|) = dnorm(, , ), where 2 2 2 (1 , 2 , 3 ) = (-3, 0, 3) and (1 , 2 , 3 ) = (1/3, 1/3, 1/3). This is a mixture of three normal densities, where we might think of as being a group mem2 bership variable and ( , ) as the population mean and variance for group . A plot of the exact marginal density of , p() = p(|)p(), appears in the black lines of Figure 6.4. Notice that there are three modes representing the three different group means. 0.0 -6 0.1 p( ) 0.2 0.3 0.4 -4 -2 0 2 4 6 Fig. 6.4. A mixture of normal densities and a Monte Carlo approximation. It is very easy to obtain independent Monte Carlo samples from the joint distribution of = (, ). First, a value of is sampled from its marginal distribution, then the value is plugged into p(|), from which a value of is sampled. The sampled pair (, ) represents a sample from the joint distribution of p(, ) = p()p(|). The empirical distribution of the -samples provides an approximation to the marginal distribution p() = p(|)p(). A histogram of 1,000 Monte Carlo -values generated in this way is shown in Figure 6.4. The empirical distribution of the Monte Carlo samples looks a lot like p(). It is also straightforward to construct a Gibbs sampler for = (, ). A Gibbs sampler would alternately sample values of and from their full conditional distributions. The full conditional distribution of is already provided, 100 6 Posterior approximation with the Gibbs sampler and using Bayes' rule we can show that the full conditional distribution of is given by Pr( = d|) = Pr( = d) dnorm(, d , d ) 3 d=1 Pr( = d) dnorm(, d , d ) , for d {1, 2, 3}. The first panel of Figure 6.5 shows a histogram of 1,000 MCMC values of generated with the Gibbs sampler. Notice that the empirical distribution of the MCMC samples gives a poor approximation to p(). Values of near -3 are underrepresented, whereas values near zero and +3 are overrepresented. What went wrong? A plot of the -values versus iteration number in the second panel of the figure tells the story. The -values get "stuck" in certain regions, and rarely move among the three regions represented by the three values of . The technical term for this "stickiness" is autocorrelation, or correlation between consecutive values of the chain. In this Gibbs sampler, if we have a value of near 0 for example, then the next value of is likely to be 2. If is 2, then the next value of is likely to be near 0, resulting in a high degree of positive correlation between consecutive -values in the chain. Isn't the Gibbs sampler guaranteed to eventually provide a good approximation to p()? It is, but "eventually" can be a very long time in some situations. The first panel of Figure 6.6 indicates that our approximation has greatly improved after using 10,000 iterations of the Gibbs sampler, although it is still somewhat inadequate. 0.4 0.0 q q q q q q q q q q q q q q q q qqqq q qq q q q q q qq q q q q q q qq q q q q q qq q q q q qq qqq qq qq q qqq qq qqqqq qqqqqqqq q qq qq qqqqqqqqqqqqqq q q qqq qq q q qqqq q q qq qqq q q qqq q qq q q qq qqqq q q qq qq qqqq q qqqqqq qqq qqq qqqqq q qqq q q qq qq qqq q q q q qq qqqq q qq q q qqq q q q q qq q q q qq qq qq qq q qqq q qq qq q q qqqq q qq qqqqqqq qq q q qq q q qq q qqqqqq q q q q qqqqqqq qq qq qqq qq qqq q q q qq q q qqqqqqqqqqqqqqqqqqqqqq q qq qq q qq q q qq q q qq q qq q q qq qq q q qq qq q q qqq qqq q q qqqq qqqq q qq qqqqqqqqqqqqqqqqq qqqq q q qq qqq qqqq qqqqqqqq qqqq q q qq q q q qqq qqq q qq q q qqqq qq q qq q q q qqqq q q q q q qq q q q q q q qq q qq q qq q q qq qqq q q q q q q q q q q q q q q q q qq q qq q qq q qqq q q qq q q qq q q q q q qq q q q q q q q q q q q q qq q qq q qq q q q q qq q q q qqq q qq q q q q qqq q q q q q qq q q q qq q qq q qq q qq q q qq q qq q q q qq q q q q qq q q q q q q q qqq q q q q q qq q q q qq q q q q q q q q qq q qq q q qq q q qq q qqq q q q qq qq q qqqq qq q qqq q q qqq q q q q qqq q qq q q qq q q qqqq qqq qq qqq qq q q qqq qqq qq qq q q q qq qqqq qq q q qqqqqqq q qqqqqqq qq qqqqqq q qq qq q qqqqqqqq q q qqqqq q q q qq q q q q q qq q q qq q q q q qq q q q qq q qq q q q q q q q q 0.3 p( ) 0.2 -6 -4 -2 0 2 4 6 -4 -2 0 0.1 2 4 0 200 400 600 iteration 800 1000 Fig. 6.5. Histogram and traceplot of 1,000 Gibbs samples. In the case of a generic parameter and target distribution p(), it is helpful to think of the sequence {(1) , . . . , (S) } as the trajectory of a particle moving around the parameter space. In terms of MCMC integral approxi- 6.6 Introduction to MCMC diagnostics 0.4 101 0.0 q q q q q q q q qq q qq qq q q qq q qq qq qq q q qq q q qq q qq q qq qq qq q qqqq q qq q q qq qqqq qq qq qq qq qq qq q qq qq q qqqq qqq qq q q q q q qqqq qq q q q q q q q qq q q qq q q qqqqq qq q qq q qqqq qqq qq qq qq qq q qq q qq qq qq qq qq qq q qqq qqqqq qq qq qq qq q qq qq qq qq q qqqq qqq qq qq qq qq q q q q q q qq qq q q qqqqq q q q qq qq qq qq qq qqq q qqq q qq qq q q qqq q q q qq qq qqqqq qq q qqqq q qq q q qq q qq q q qq qq q qq q qq qq qq q q qqq qq qq qq qq qq qq q qqqqq qq qq q q qq q qq qq q q qq qq q qq qq qq qq qq q q qq qq q qqqqq qqqqq q qq qqq qq q qq q qq qq qq qq q qqqq q qq qq qq qq qq qq qqq qq qq qq qq q qqq q qqqqq qq qqqqq qq q q qqqqq qq q qq q qq q qq qq qq qq q qq qq qq qq q qqq qq qq qq qq qqqqq q qq qq q qq qqq q qq q q q qqqq qq q q q q q qq qqqqq q qq q qq qq qq qq qq q q q qqqqq qq q qqq qqqqq q qq qq q qqq qq qq qq qq qqq qq qq qq qq q q q q q qq qq qqqq q q qq q qq q q q qq qq q q qq q qq qq q qqqqq qq q qqqqq q qq qqqq qq q q qqq qq qq q q qq q qq qq q q q q qq qq qq qq q q qq qq qq q qqqqq qq q q qqq qq qq qq qq qqqqq q q q qq qqqq q q qq q qq q q q qq qq q qq q q qq q qq qq q qq qq q qq q q qqqqq q qqq qq qq qq qq qq qq qq qq q qq qqqqq q qq q q qq q q qq qq q q q qqqq q qqq qq qq qq q qq q q qq qq qq qq q qqqq q q qq q qq q q qqqqq qqqq q qq q q qqq qq qq qq qq qq qq qq q q qq q qq q q q qq qq qq qq qq q qq qqqq qq qq qq q qq qq q qqqqq qq q q q qq q q q q q qq q qq qq qq qq qq qqq qq qq qq qq qqqq q qq qq q q qqqqq qq q qqqqq qq q qq qq q q q qq q qq qq qq qq qq qqqq qq q q q qq qq q qq qqqq q q qq q q qq q q qqqqq qqq qq qq qq qq q qq q qqqq q qq q q qq qq qq q q qq q qq q q q q q q q q qq q qq q q q q qq qq qq qqq qq qq qq q q q qqqqq qq q q qq q q q q qq q q q q qq q q q q q qq q qq qqq qq qq qq qq q q qq q q q qq qq q qq q q q qq q q q q q qq q q qq q q q qq q q q qq q q q qq q q q q q q q q q qq q q q q q q q q q q q q q q qq q q qqq q q q q q q q q qq q q qq q q q q qqq q q q q q qq q q q q qqq q qq q q q q q q q qq q q qq q q qqq q q q q q q q qq q q q q q q q qqqq q q qq q q qq q q q q qqqq q q qq q q q qqqq q q qqq qq q q q q q q q q q q q qq q q q qqq q q q qq q q q q q q q q qq q q q q q q q q q q q q q qqqq q q qqq q q q q q q q q qq q q q qq q q qqqq q q q q q q q q q q q qq q q q q qq q q q qqqq q q qqqq q q q q q q q q qq q q qq q q q qq q q q q q q q q qq q q qqqq q q q qq q q q q q q q q q q qq q qq q qq q q q q qqq q q q qq q q q q qq q qqq q q q q q qq q q q q q q q qq q q qq q q q qq q qq q q q qq qq q q qq qq q q qqq qq q q q qq q q q q q q q qq q qq q qqqq qq q qq q qqqq q q qqqqq q qq q q q qq q q qq q qq q qq qq q q qq q qq qq qqq qq qqqq q qqqqq qq q qq q qq q qq qq q q q q qqq q q q qq q qq q qqqq q q qq q qq q qq qqqqqq qqq qq qqqqq qq q qq qq q q qq qqqq q q q q qq q q q qqqqqqqqq qq qqqqq qq qq q qq q qqqq q q qq q qq q qqq q qq q q qq q qq q qq q q qq qqqqqqqqq q q qq qq qqq qq qqqq qq q qq q qqqq q q q q qqq qqqqq qqq q qq q qq qq qq qqq q qqqq q qqqqqq qqq qq q qq qq q q qqqqq qq qqqqqqqqq q qq qq qqq qq qqqq qqqqq qq qq qq qq qq qqqq qqqqqqqqq qq qq qq qqq qq qqqq qqqqqqqqq q qqqqq qqq q qq qq qqq qq qqqq q q qq qqq qq q q qqq q q qqqq qq qq qqq qq qqqq qq q qq q qqqq qqqqqqqqq qq q q q qq q qqqq qqqqq qqq q qqqqq qqq q q qq qq qq qq qqqq q qq q qqqq q qqq qqqqqq qqq qq qq qq q q q qq qq q qq q q q qq qq q q q q q q qq q q q qqqqqq qqq qq qqqq q qq q q q q q qq qq qqq qq qqqq qq q q q q q q q q qq q qqqq q q q qq q qqqq qqqq q qq q q qq qqq qq qqqq qqqq qqqqq q q q q q q qq q q q qqqq q qq q q qqqqqqqqq q q qqqqqq qqq qq qq q qq qq qq qq qqqq qq qq qqq qq qqqq q qq q qq q qqqq qqqqqqqqq q qq qqqqq qq q qq qq qq q qqqq qqqqqq qqq q q qq qq qq qq qqqq qq qq qqq qq qqqq q qq q qq q q q qq q qqqq q qqqqq qq qq q qq q qqqq qqqqqqqqq q q q qq q qqq qq qq qq qqqqqq qqq qq qqqqq qq qqqq q qq q q qqq qq qq qq qq qqqq qq qq qqq qq qqqq qq q qq q qqqqqqqqq qq qqqqqqqqq q q qq q qqqq q q q qq q qqq qq qqqq qq q qq q qqq qq qqqq qqqqqq qqq qq q qqq q qqqqq qqq qq q qqqq q q q qq qq qq qqqq q q q q q qq q q qq qqqqqq qqq qq qq q qqqq q qqq q q qq q qq q q qq qq qqq qq qqqq q q q qq q q q qq q qq qq qqqq q qqqqqqqqq q q q q q q qq qqqqqq qq q qqqqq qq q qq q q q qq qqq q qqqq qqqqqq qqq qq qqq qq q q qq q qq q q q q qqq q q qqqqqq q qqqqqqqqq q q qq qq qqq qq qqqq qq q qq qq qqq q qq q qqqqqq qqq qq q q q q q qq qq qq q qqqq qqqqqq qqq q q qq qqq qq q qq q qq q q q q q q q qq q qqqq q q qq qq q qqqq qq qqqqqqq qq qqq qq qq qq q qq q qqqq qqq qq qq q qq qq q qq q qqqq q q qq qq qqq qq qqqq qq qqqq qq q q q q qq q q q q qqq qq q qq q qqqq q qq q qq q qqqq q q qq q q q q q qq q qq q qqqqq qqq qq qq qq qqq qq qqqq qq qq qq qqqq q qq q qq q qq q qqqq q q qq q qqqqqqqqq qq qqq q q q qq qq qqq q qq qq q q qq q qqqq qq q q qq q qq qq qq q q q q qq q qq q qq q q q qq q q q q q q q q q q q 0.3 p( ) 0.2 0.1 -6 -4 -2 0 2 4 6 -4 -2 0 2 4 0 2000 4000 6000 iteration 8000 Fig. 6.6. Histogram and traceplot of 10,000 Gibbs samples. mation, the critical thing is that the amount of time the particle spends in a given set A is proportional to the target probability A p() d. Now suppose A1 , A2 and A3 are three disjoint subsets of the parameter space, with Pr(A2 ) < Pr(A1 ) Pr(A3 ) (these could be, for example, the regions near the three modes of the normal mixture distribution above). In terms of the integral approximation, this means that we want the particle to spend little time in A2 , and about the same amount of time in A1 as in A3 . Since in general we do not know p() (otherwise we would not be trying to approximate it), it is possible that we would accidentally start our Markov chain in A2 . In this case, it is critical that the number of iterations S is large enough so that the particle has a chance to 1. move out of A2 and into higher probability regions, and 2. move between A1 and A3 , and any other sets of high probability. The technical term for attaining item 1 is to say that the chain has achieved stationarity or has converged . If your Markov chain starts off in a region of the parameter space that has high probability, then convergence generally is not a big issue. If you do not know if you are starting off in a good region, assessing convergence is fraught with epistemological problems. In general, you cannot know for sure if your chain has converged. But sometimes you can know if your chain has not converged, so we at least check for this latter possibility. One thing to check for is stationarity, or that samples taken in one part of the chain have a similar distribution to samples taken in other parts. For the normal model with semiconjugate prior distributions from the previous section, stationarity is achieved quite quickly and is not a big issue. However, for some highly parameterized models that we will see later on, the autocorrelation in the chain is high, good starting values can be hard to find 102 6 Posterior approximation with the Gibbs sampler and it can take a long time to get to stationarity. In these cases we need to run the MCMC sampler for a very long time. Item 2 above relates to how quickly the particle moves around the parameter space, which is sometimes called the speed of mixing. An independent MC sampler has perfect mixing: It has zero autocorrelation and can jump between different regions of the parameter space in one step. As we have seen in the example above, an MCMC sampler might have poor mixing, take a long time between jumps to different parts of the parameter space and have a high degree of autocorrelation. How does the correlation of the MCMC samples affect posterior approximation? Suppose we want to approximate the integral E = p() d = 0 using the empirical distribution of {(1) , . . . , (S) }. If the -values are independent Monte Carlo samples from p(), then the variance of = (s) /S is Var VarMC = E[( - 0 )2 ] = , S where Var = 2 p() d - 2 . Recall from Chapter 4 that the square 0 root of VarMC is the Monte Carlo standard error, and is a measure of how well we expect to approximate the integral p() d. If we were to rerun the MC approximation procedure many times, perhaps with different starting values or random number generators, we expect that 0 , the true value of the integral, would be contained within the interval 2 VarMC for roughly 95% of the MC approximations. The width of this interval is 4 VarMC , and we can make this as small as we want by generating more MC samples. What if we use an MCMC algorithm such as the Gibbs sampler? As can be seen in Figures 6.5 and 6.6, consecutive MCMC samples (s) and (s+1) can be positively correlated. Assuming stationarity has been achieved, the expected squared difference from the MCMC integral approximation to the target 0 = p() d is the MCMC variance, and is given by VarMCMC = E[( - 0 )2 ] 1 = E[{ ((s) - 0 )}2 ] S = 1 E[ ((s) - 0 )2 + S 2 s=1 1 S2 S S ((s) - 0 )((t) - 0 )] s=t = E[((s) - 0 )2 ] + s=1 1 S2 E[((s) - 0 )((t) - 0 )] s=t 1 = VarMC + 2 S E[((s) - 0 )((t) - 0 )]. s=t So the MCMC variance is equal to the MC variance plus a term that depends on the correlation of samples within the Markov chain. This term is generally 6.6 Introduction to MCMC diagnostics 103 positive and so the MCMC variance is higher than the MC variance, meaning that we expect the MCMC approximation to be further away from 0 than the MC approximation is. The higher the autocorrelation in the chain, the larger the MCMC variance and the worse the approximation is. To assess how much correlation there is in the chain we often compute the sample autocorrelation function. For a generic sequence of numbers {1 , . . . , S }, the lag-t autocorrelation function estimates the correlation between elements of the sequence that are t steps apart: acf t () = 1 S-t S-t s=1 (s - )(s+t - S 1 2 s=1 (s - ) S-1 ) , which is computed by the R-function acf . For the sequence of 10,000 -values plotted in Figure 6.6, the lag-10 autocorrelation is 0.93, and the lag-50 autocorrelation is 0.812. A Markov chain with such a high autocorrelation moves around the parameter space slowly, taking a long time to achieve the correct balance among the different regions of the parameter space. The higher the autocorrelation, the more MCMC samples we need to attain a given level of precision for our approximation. One way to measure this is to calculate the effective sample size for an MCMC sequence, using the R-command effectiveSize in the "coda" package. The effective sample size function estimates the value Seff such that VarMCMC = Var , Seff so that Seff can be interpreted as the number of independent Monte Carlo samples necessary to give the same precision as the MCMC samples. For the normal mixture density example above, the effective sample size of the 10,000 Gibbs samples of is 18.42, indicating that the precision of the MCMC approximation to E is as good as the precision that would have been obtained by only about 18 independent samples of . There is a large literature on the practical implementation and assessment of Gibbs sampling and MCMC approximation. Much insight can be gained by hands-on experience supplemented by reading books and articles. A good article to start with is "Practical Markov chain Monte Carlo" (Geyer, 1992), which includes a discussion by many researchers and a large variety of viewpoints on and techniques for MCMC approximation. MCMC diagnostics for the semiconjugate normal analysis We now assess the Markov chain of and 2 values generated by the Gibbs sampler in Section 6.4. Figure 6.7 plots the values of these two parameters in sequential order, and seems to indicate immediate convergence and a low degree of autocorrelation. The lag-1 autocorrelation for the sequence 104 6 Posterior approximation with the Gibbs sampler {(1) , . . . , (1000) } is 0.031, which is essentially zero for approximation purposes. The effective sample size for this sequence is computed in R to be 1,000. The lag-1 autocorrelation for the 2 -values is 0.147, with an effective sample size of 742. While not quite as good as an independently sampled sequence of parameter values, the Gibbs sampler for this model and prior distribution performs quite well. q q q q q q q qq q qq q q qq q qq q qq q q q q qq q q q qq q q q q q q q qq qq qq qq q q qq q q qq q qq q qq q q q q qq q q q qq q qqqqq q q qqqqq qqqqq q qqqqq q q q qq q q q q q q q q qqq q qq q q q q qq q q q qq qq q q q q q q q q q q q q q q q q qq q q q qq qqqq q q q q q q q qqqqq qqqqqq q qqqqqqqqqqqqqqq q q q q q qqq qqq q q q q q qq q q q qqq q qq q q qqqqq qq q qqqqq qqq qq q qqqqqqqqqq q qq q q q qqq q qq qq qq q q q qq q qqqq qqqqqq qq qq qq q qq qq qq qq qqq q qq qqq qq q q q q q q qqqqq qqqqqqqq qq qqq qq q q q q q qq qq qqqqqqqqqqqqqqqqqqqq qqqqq qqqqq q q q q q q q q qqqq qqqq qqq q qqq qq q qqq q q qqq qqq qqq qqqqqqqqq qqqqq q q q qqqqqq qqq q q q qqq qqqq q q qqq q qqq q q q qq q q q qq q q q q q q q qq qq q qq q q q qqq qq q q q q q q q q q q q qqqqq qqqq q q qq qq q q q qqqqqq qqq qq q qq qq q q q q q qqqq q qq q qq q q qq q q qqq qq qqqqqqqq q q q qq qqqq qq q qqqqqqqqq q qqqqqqq qqqqqqqqqqqqqqqqq qq q q q q q q qqq qq q qq q qqqq qqqqq qqqqqqqqqqq q q qq q qq q q q q q qqq q qq qq qqqqqq q q q qq q q qq q q q q qq q qq q q q q q qq qq q qq q qqqq q q q q q q q q qqqqq qq q q q qqq q qq q q q q q q q q qqq q q q q q q q q q q qq q qq q qqq qq qq q q q q q qq q q q qq q q q q q q q q q q q q q q qq q q q q qq q q q q q q q qq q q q q q q q q qqq qq q q q q q q q q q q q q q 0.14 q q 1.9 0.10 q q q qq q q qq q q 1.8 2 0.06 1.5 q q q q q q q q q q q q q q qq q q q q q q qq q q q q qq q q q q q q q qqq qq q q qqq qq q q q qq q q q q qq qq q q q q q q q q q qqq qqq q q q q qq q q qq q q q qq q q q q q q q q q qq q qqq q q q qq q qqqq qq qq qq q q q q q q q q q q q qqq qq qq qqq qqqqq qqq q q q q q q q q q qq q q q q q qqqq q q q qq q q q q q qq q q qqq q q qq q q qqq q qq qq q q q qqqqq q qqqq qq qqqq qqq q q qqq qqqqqqq q qqqqq q q q q qq q qq qqqq qq q qqq qqq q q q q q q qq qqqqqqq q qq qqqqqqqqqq qqqqqqqqqqq q q q q qq q q qqq q q qq q qqqqq q qqqq q q qqqqqqqq qqqqqq qq qqqqqqq q q q q q q qqq q q qq q q qq q qqqq qqq q qq q q qqq qq q qq qq q q q qq q q q q qq qqq q qqq q q qq q qqqq qq qqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqq qq q q q qq q q q q q q qqq q q qqqq qq qq q qq qqqqqqqqqq q qqqqqq q qqq qqqq qqqqqqq q q q q q q q qq qq q qqqqqq qq q qq qq qq qqq qqqq q qq qqqqqqqqqqq qq q qqq qqqqqqqqq qq q q qq q qq q qq qq q q qq q qqqqqqqqqq qq qqqqqqqqqq qqqqqq q q q q qq qqqqqqq qq q q q q q qq qq q q qqqqq qqqq qqq qq qq qqqqqqq q q qqq qqqqqqqqqqq qqq qqq qq qqqqqqqqq qq q q qq q qq q q qq q q q q q qq qq qq q q qq q q q q q q q q q q q q q q q q 1.6 1.7 0 200 400 600 iteration 800 1000 0.02 0 200 400 600 iteration 800 1000 Fig. 6.7. Traceplots for and 2 . 6.7 Discussion and further references The term "Gibbs sampling" was coined by Geman and Geman (1984) in their paper on image analysis, but the algorithm appears earlier in the context of spatial statistics, for example, Besag (1974) or Ripley (1979). However, the general utility of the Gibbs sampler for Bayesian data analysis was not fully realized until the late 1980s (Gelfand and Smith, 1990). See Robert and Casella (2008) for a historical review. Assessing the convergence of the Gibbs sampler and the accuracy of the MCMC approximation is difficult. Several authors have come up with convergence diagnostics (Gelman and Rubin, 1992; Geweke, 1992; Raftery and Lewis, 1992), although these can only highlight problems and not guarantee a good approximation (Geyer, 1992). ...
View Full Document

Ask a homework question - tutors are online