Unformatted Document Excerpt
Coursehero >>
Utah >>
BYU >>
ECE 771
Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
Cambridge
Copyright University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981
You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
28.1
How many boxes are in the picture (gure 28.1)? In particular, how many boxes are in the vicinity of the tree? If we looked with x-ray spectacles, would we see one or two boxes behind the trunk (gure 28.2)? (Or even more?) Occams razor is the principle that states a preference for simple theories. Accept the simplest explanation that ts the data. Thus according to Occams razor, we should deduce that there is only one box behind the tree. Is this an ad hoc rule of thumb? Or is there a convincing reason for believing there is most likely one box? Perhaps your intuition likes the argument well, it would be a remarkable coincidence for the two boxes to be just the same height and colour as each other. If we wish to make artical intelligences that interpret data correctly, we must translate this intuitive feeling into a concrete theory.
Motivations for Occams razor
If several explanations are compatible with a set of observations, Occams razor advises us to buy the simplest. This principle is often advocated for one of two reasons: the rst is aesthetic (A theory with mathematical beauty is more likely to be correct than an ugly one that ts some experimental data
Occams razor
Model Comparison and Occams Razor
28
343
Figure 28.2. How many boxes are behind the tree?
or 2?
Figure 28.1. A picture to be interpreted. It contains a tree and some boxes.
1?
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
344
28 Model Comparison and Occams Razor
Figure 28.3. Why Bayesian inference embodies Occams razor. This gure gives the basic intuition for why complex models can turn out to be less probable. The horizontal axis represents the space of possible data sets D. Bayes theorem rewards models in proportion to how much they predicted the data that occurred. These predictions are quantied by a normalized probability distribution on D. This probability of the data given model Hi , P (D|Hi ), is called the evidence for Hi . A simple model H1 makes only a limited range of predictions, shown by P (D|H1 ); a more powerful model H2 , that has, for example, more free parameters than H1 , is able to predict a greater variety of data sets. This means, however, that H2 does not predict the data sets in region C 1 as strongly as H 1 . Suppose that equal prior probabilities have been assigned to the two models. Then, if the data set falls in region C 1 , the less powerful model H1 will be the more probable model.
Evidence
P(D|H ) 1 P(D|H2)
C 1
D
(Paul Dirac)); the second reason is the past empirical success of Occams razor. However there is a dierent justication for Occams razor, namely: Coherent inference (as embodied by Bayesian probability) automatically embodies Occams razor, quantitatively. It is indeed more probable that theres one box behind the tree, and we can compute how much more probable one is than two.
Model comparison and Occams razor
We evaluate the plausibility of two alternative theories H1 and H2 in the light of data D as follows: using Bayes theorem, we relate the plausibility of model H1 given the data, P (H1 |D ), to the predictions made by the model about the data, P (D |H1 ), and the prior plausibility of H1 , P (H1 ). This gives the following probability ratio between theory H1 and theory H2 : P (H1 |D ) P (H1 ) P (D |H1 ) = . P (H2 |D ) P (H2 ) P (D |H2 ) (28.1)
The rst ratio (P (H1 )/P (H2 )) on the right-hand side measures how much our initial beliefs favoured H1 over H2 . The second ratio expresses how well the observed data were predicted by H1 , compared to H2 . How does this relate to Occams razor, when H1 is a simpler model than H2 ? The rst ratio (P (H1 )/P (H2 )) gives us the opportunity, if we wish, to insert a prior bias in favour of H1 on aesthetic grounds, or on the basis of experience. This would correspond to the aesthetic and empirical motivations for Occams razor mentioned earlier. But such a prior bias is not necessary: the second ratio, the data-dependent factor, embodies Occams razor automatically. Simple models tend to make precise predictions. Complex models, by their nature, are capable of making a greater variety of predictions (gure 28.3). So if H2 is a more complex model, it must spread its predictive probability P (D |H2 ) more thinly over the data space than H1 . Thus, in the case where the data are compatible with both theories, the simpler H1 will turn out more probable than H2 , without our having to express any subjective dislike for complex models. Our subjective prior just needs to assign equal prior probabilities to the possibilities of simplicity and complexity. Probability theory then allows the observed data to express their opinion. Let us turn to a simple example. Here is a sequence of numbers: 1, 3, 7, 11. The task is to predict the next two numbers, and infer the underlying process that gave rise to this sequence. A popular answer to this question is the prediction 15, 19, with the explanation add 4 to the previous number. What about the alternative answer 19.9, 1043.8 with the underlying rule being: get the next number from the previous number, x, by evaluating
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
28.1: Occams razor x3 /11 + 9/11x2 + 23/11 ? I assume that this prediction seems rather less plausible. But the second rule ts the data (1, 3, 7, 11) just as well as the rule add 4. So why should we nd it less plausible? Let us give labels to the two general theories: Ha the sequence is an arithmetic progression, add n, where n is an integer. Hc the sequence is generated by a cubic function of the form x cx3 + dx2 + e, where c, d and e are fractions. One reason for nding the second explanation, Hc , less plausible, might be that arithmetic progressions are more frequently encountered than cubic functions. This would put a bias in the prior probability ratio P (Ha )/P (Hc ) in equation (28.1). But let us give the two theories equal prior probabilities, and concentrate on what the data have to say. How well did each theory predict the data? To obtain P (D |Ha ) we must specify the probability distribution that each model assigns to its parameters. First, Ha depends on the added integer n, and the rst number in the sequence. Let us say that these numbers could each have been anywhere between 50 and 50. Then since only the pair of values {n = 4, rst number = 1} give rise to the observed data D = (1, 3, 7, 11), the probability of the data, given Ha , is: 11 = 0.00010. (28.2) 101 101 To evaluate P (D |Hc ), we must similarly say what values the fractions c, d and e might take on. [I choose to represent these numbers as fractions rather than real numbers because if we used real numbers, the model would assign, relative to Ha , an innitesimal probability to D . Real parameters are the norm however, and are assumed in the rest of this chapter.] A reasonable prior might state that for each fraction the numerator could be any number between 50 and 50, and the denominator is any number between 1 and 50. As for the initial value in the sequence, let us leave its probability distribution the same as in Ha . There are four ways of expressing the fraction c = 1/11 = 2/22 = 3/33 = 4/44 under this prior, and similarly there are four and two possible solutions for d and e, respectively. So the probability of the observed data, given Hc , is found to be: 41 41 21 1 P (D |Hc ) = 101 101 50 101 50 101 50 = 0.0000000000025 = 2.5 1012 . (28.3) P (D |Ha ) =
345
Thus comparing P (D |Hc ) with P (D |Ha ) = 0.00010, even if our prior probabilities for Ha and Hc are equal, the odds, P (D |Ha ) : P (D |Hc ), in favour of Ha over Hc , given the sequence D = (1, 3, 7, 11), are about forty million to one. 2 This answer depends on several subjective assumptions; in particular, the probability assigned to the free parameters n, c, d, e of the theories. Bayesians make no apologies for this: there is no such thing as inference or prediction without assumptions. However, the quantitative details of the prior probabilities have no eect on the qualitative Occams razor eect; the complex theory Hc always suers an Occam factor because it has more parameters, and so can predict a greater variety of data sets (gure 28.3). This was only a small example, and there were only four data points; as we move to larger and more sophisticated problems the magnitude of the Occam factors typically increases, and the degree to which our inferences are inuenced by the quantitative details of our subjective assumptions becomes smaller.
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
346
Create alternative MODELS d E Fit each MODEL to the DATA d Gather more data T Assign preferences to the alternative MODELS c Choose future actions d d d '
28 Model Comparison and Occams Razor
Figure 28.4. Where Bayesian inference ts into the data modelling process. This gure illustrates an abstraction of the part of the scientic process in which data are collected and modelled. In particular, this gure applies to pattern classication, learning, interpolation, etc. The two double-framed boxes denote the two steps which involve inference. It is only in those two steps that Bayes theorem can be used. Bayes does not tell you how to invent models, for example. The rst box, tting each model to the data, is the task of inferring what the model parameters might be given the model and the data. Bayesian methods may be used to nd the most probable parameter values, and error bars on those parameters. The result of applying Bayesian methods to this problem is often little dierent from the answers given by orthodox statistics. The second inference task, model comparison in the light of the data, is where Bayesian methods are in a class of their own. This second inference problem requires a quantitative Occams razor to penalise over-complex models. Bayesian methods can assign objective preferences to the alternative models in a way that automatically embodies Occams razor.
Gather DATA
Create new models T
Choose what data to gather next
Decide whether to create new models
Bayesian methods and data analysis
Let us now relate the discussion above to real problems in data analysis. There are countless problems in science, statistics and technology which require that, given a limited data set, preferences be assigned to alternative models of diering complexities. For example, two alternative hypotheses accounting for planetary motion are Mr. Inquisitions geocentric model based on epicycles, and Mr. Copernicuss simpler model of the solar system with the sun at the centre. The epicyclic model ts data on planetary motion at least as well as the Copernican model, but does so using more parameters. Coincidentally for Mr. Inquisition, two of the extra epicyclic parameters for every planet are found to be identical to the period and radius of the suns cycle around the earth. Intuitively we nd Mr. Copernicuss theory more probable.
The mechanism of the Bayesian razor: the evidence and the Occam factor
Two levels of inference can often be distinguished in the process of data modelling. At the rst level of inference, we assume that a particular model is true, and we t that model to the data, i.e., we infer what values its free parameters should plausibly take, given the data. The results of this inference are often summarized by the most probable parameter values, and error bars on those parameters. This analysis is repeated for each model. The second level of inference is the task of model comparison. Here we wish to compare the models in the light of the data, and assign some sort of preference or ranking to the alternatives.
Note that both levels of inference are distinct from decision theory. The goal of inference is, given a dened hypothesis space and a particular data set, to assign probabilities to hypotheses. Decision theory typically chooses between alternative actions on the basis of these probabilities so as to minimize the expectation of a loss function. This chapter concerns inference alone and no loss functions are involved. When we discuss model comparison, this should not be construed as implying model choice . Ideal Bayesian predictions do not involve choice between models; rather, predictions are made by summing over all the alternative models, weighted by their probabilities.
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
28.1: Occams razor Bayesian methods are able consistently and quantitatively to solve both the inference tasks. There is a popular myth that states that Bayesian methods only dier from orthodox statistical methods by the inclusion of subjective priors, which are dicult to assign, and which usually dont make much difference to the conclusions. It is true that, at the rst level of inference, a Bayesians results will often dier little from the outcome of an orthodox attack. What is not widely appreciated is how a Bayesian performs the second level of inference; this section will therefore focus on Bayesian model comparison. Model comparison is a dicult task because it is not possible simply to choose the model that ts the data best: more complex models can always t the data better, so the maximum likelihood model choice would lead us inevitably to implausible, over-parameterized models, which generalize poorly. Occams razor is needed. Let us write down Bayes theorem for the two levels of inference described above, so as to see explicitly how Bayesian model comparison works. Each model Hi is assumed to have a vector of parameters w. A model is dened by a collection of probability distributions: a prior distribution P (w|Hi ), which states what values the models parameters might be expected to take; and a set of conditional distributions, one for each value of w, dening the predictions P (D |w, Hi ) that the model makes about the data D . 1. Model tting. At the rst level of inference, we assume that one model, the ith, say, is true, and we infer what the models parameters w might be, given the data D . Using Bayes theorem, the posterior probability of the parameters w is: P (w|D, Hi ) = that is, Posterior = Likelihood Prior . Evidence P (D |w, Hi )P (w|Hi ) , P (D |Hi ) (28.4)
347
The normalizing constant P (D |Hi ) is commonly ignored since it is irrelevant to the rst level of inference, i.e., the inference of w; but it becomes important in the second level of inference, and we name it the evidence for Hi . It is common practice to use gradient-based methods to nd the maximum of the posterior, which denes the most probable value for the parameters, wMP ; it is then usual to summarize the posterior distribution by the value of wMP , and error bars or condence intervals on these best t parameters. Error bars can be obtained from the curvature of the posterior; evaluating the Hessian at wMP , A = ln P (w|D, Hi )|wMP , and Taylor-expanding the log posterior probability with w = w w MP: P (w|D, Hi ) P (wMP |D, Hi ) exp 1/2wTAw , (28.5)
we see that the posterior can be locally approximated as a Gaussian with covariance matrix (equivalent to error bars) A1 . [Whether this approximation is good or not will depend on the problem we are solving. Indeed, the maximum and mean of the posterior distribution have no fundamental status in Bayesian inference they both change under nonlinear reparameterizations. Maximization of a posterior probability is only useful if an approximation like equation (28.5) gives a good summary of the distribution.]
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
348
28 Model Comparison and Occams Razor
Figure 28.5. The Occam factor. This gure shows the quantities that determine the Occam factor for a hypothesis Hi having a single parameter w. The prior distribution (solid line) for the parameter has width w . The posterior distribution (dashed line) has a single peak at w MP with characteristic width w|D . The Occam factor is w|D . w|D P (wMP |Hi ) = w
P (w|D, Hi ) w|D P (w|Hi ) w wMP w
2. Model comparison. At the second level of inference, we wish to infer which model is most plausible given the data. The posterior probability of each model is: P (Hi |D ) P (D |Hi )P (Hi ). (28.6) Notice that the data-dependent term P (D |Hi ) is the evidence for Hi , which appeared as the normalizing constant in (28.4). The second term, P (Hi ), is the subjective prior over our hypothesis space, which expresses how plausible we thought the alternative models were before the data arrived. Assuming that we choose to assign equal priors P (Hi ) to the alternative models, models Hi are ranked by evaluating the evidence. The normalizing constant P (D ) = i P (D |Hi )P (Hi ) has been omitted from equation (28.6) because in the data modelling process we may develop new models after the data have arrived, when an inadequacy of the rst models is detected, for example. Inference is open ended: we continually seek more probable models to account for the data we gather. To repeat the key idea: to rank alternative models Hi , a Bayesian evaluates the evidence P (D |Hi ). This concept is very general: the evidence can be evaluated for parametric and non-parametric models alike; whatever our data modelling task, a regression problem, a classication problem, or a density estimation problem, the evidence is a transportable quantity for comparing alternative models. In all these cases the evidence naturally embodies Occams razor.
Evaluating the evidence
Let us now study the evidence more closely to gain insight into how the Bayesian Occams razor works. The evidence is the normalizing constant for equation (28.4): P (D |Hi ) = P (D |w, Hi )P (w|Hi ) dw. (28.7)
For many problems the posterior P (w|D, Hi ) P (D |w, Hi )P (w|Hi ) has a strong peak at the most probable parameters wMP (gure 28.5). Then, taking for simplicity the one-dimensional case, the evidence can be approximated, using Laplaces method, by the height of the peak of the integrand P (D |w, Hi )P (w|Hi ) times its width, w|D : P (D |Hi ) Evidence P (D |wMP , Hi ) P (wMP |Hi ) w|D . Best t likelihood Occam factor (28.8)
Thus the evidence is found by taking the best t likelihood that the model can achieve and multiplying it by an Occam factor, which is a term with magnitude less than one that penalizes Hi for having the parameter w.
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
28.1: Occams razor
349
Figure 28.6. A hypothesis space consisting of three exclusive models, each having one parameter w, and a one-dimensional data set D. The data set is a single measured value which diers from the parameter w by a small amount of additive noise. Typical samples from the joint distribution P (D, w, H) are shown by dots. (NB, these are not data points.) The observed data set is a single particular value for D shown by the dashed horizontal line. The dashed curves below show the posterior probability of w for each model given this data set (c.f. gure 28.3). The evidence for the dierent models is obtained by marginalizing onto the D axis at the left-hand side (c.f. gure 28.5).
P (D|H3 ) P (D|H2 ) P (D|H1 )
D
P (w|D, H1 )
P (w|D, H2 ) P (w|D, H3 ) D P (w|H3 ) w|D w w P (w|H2 )
P (w|H1 )
w
w
Interpretation of the Occam factor
The quantity w|D is the posterior uncertainty in w. Suppose for simplicity that the prior P (w|Hi ) is uniform on some large interval w , representing the range of values of w that were possible a priori, according to Hi (gure 28.5). Then P (wMP |Hi ) = 1/w , and Occam factor = w|D , w (28.9)
i.e., the Occam factor is equal to the ratio of the posterior accessible volume of Hi s parameter space to the prior accessible volume, or the factor by which Hi s hypothesis space collapses when the data arrive. The model Hi can be viewed as consisting of a certain number of exclusive submodels, of which only one survives when the data arrive. The Occam factor is the inverse of that number. The logarithm of the Occam factor is a measure of the amount of information we gain about the models parameters when the data arrive. A complex model having many parameters, each of which is free to vary over a large range w , will typically be penalized by a stronger Occam factor than a simpler model. The Occam factor also penalizes models that have to be nely tuned to t the data, favouring models for which the required precision of the parameters w|D is coarse. The magnitude of the Occam factor is thus a measure of complexity of the model; it relates to the complexity of the predictions that the model makes in data space. This depends not only on the number of parameters in the model, but also on the prior probability that the model assigns to them. Which model achieves the greatest evidence is determined by a trade-o between minimizing this natural complexity measure and minimizing the data mist. In contrast to alternative measures of model complexity, the Occam factor for a model is straightforward to evaluate: it simply depends on the error bars on the parameters, which we already evaluated when tting the model to the data. Figure 28.6 displays an entire hypothesis space so as to illustrate the various probabilities in the analysis. There are three models, H1 , H2 , H3 , which
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
350
28 Model Comparison and Occams Razor
have equal prior probabilities. Each model has one parameter w (each shown on a horizontal axis), but assigns a dierent prior range W to that parameter. H3 is the most exible or complex model, assigning the broadest prior range. A one-dimensional data space is shown by the vertical axis. Each model assigns a joint probability distribution P (D, w|Hi ) to the data and the parameters, illustrated by a cloud of dots. These dots represent random samples from the full probability distribution. The total number of dots in each of the three model subspaces is the same, because we assigned equal prior probabilities to the models. When a particular data set D is received (horizontal line), we infer the posterior distribution of w for a model (H3 , say) by reading out the density along that horizontal line, and normalizing. The posterior probability P (w|D, H3 ) is shown by the dotted curve at the bottom. Also shown is the prior distribution P (w|H3 ) (c.f. gure 28.5). [In the case of model H1 which is very poorly matched to the data, the shape of the posterior distribution will depend on the details of the tails of the prior P (w|H1 ) and the likelihood P (D |w, H1 ); the curve shown is for the case where the prior falls o more strongly.] We obtain gure 28.3 by marginalizing the joint distributions P (D, w|Hi ) onto the D axis at the left-hand side. For the data set D shown by the dotted horizontal line, the evidence P (D |H3 ) for the more exible model H3 has a smaller value than the evidence for H2 . This is because H3 placed less predictive probability (fewer dots) on that line. In terms of the distributions over w, model H3 has smaller evidence because the Occam factor w|D /w is smaller for H3 than for H2 . The simplest model H1 has the smallest evidence of all, because the best t that it can achieve to the data D is very poor. Given this data set, the most probable model is H2 .
Occam factor for several parameters
If the posterior is well approximated by a Gaussian, then the Occam factor is obtained from the determinant of the corresponding covariance matrix (c.f. equation (28.8) and Chapter 27): P (D |Hi ) Evidence P (D |wMP , Hi ) P (wMP |Hi ) det 2 (A/2 ), Best t likelihood Occam factor
1
(28.10)
where A = ln P (w|D, Hi ), the Hessian which we evaluated when we calculated the error bars on wMP (equation 28.5 and Chapter 27). As the amount of data collected increases, this Gaussian approximation is expected to become increasingly accurate. In summary, Bayesian model comparison is a simple extension of maximum likelihood model selection: the evidence is obtained by multiplying the best t likelihood by the Occam factor. To evaluate the Occam factor we need only the Hessian A, if the Gaussian approximation is good. Thus the Bayesian method of model comparison by evaluating the evidence is no more demanding computationally than the task of nding for each model the best t parameters and their error bars.
28.2
Example
Lets return to the example that opened this chapter. Are there one or two boxes behind the tree in gure 28.1? Why do coincidences make us suspicious? Lets assume the image of the area round the trunk and box has a size of 50 pixels, that the trunk is 10 pixels wide, and that 16 dierent colours of
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
28.3: Minimum description length (MDL) boxes can be distinguished. The theory H1 that says there is one box near the trunk has four free parameters: three coordinates dening the top three edges of the box, and one parameter giving the boxs colour. (If boxes could levitate, there would be ve free parameters.) The theory H2 that says there are two boxes near the trunk has eight free parameters (twice four), plus a ninth, a binary variable that indicates which of the two boxes is the closest to the viewer. What is the evidence for each model? Well do H1 rst. We need a prior on the parameters to evaluate the evidence. For convenience, lets work in pixels. Lets assign a separable prior to the horizontal location of the box, its width, its height, and its colour. The height could have any of, say, 20 distinguishable values, so could the width, and so could the location. The colour could have any of 16 values. Well put uniform priors over these variables. Well ignore all the parameters associated with other objects in the image, since they dont come into the model comparison between H1 and H2 . The evidence is P (D |H1 ) = 1111 20 20 20 16 (28.11)
351
1?
or 2?
Figure 28.7. How many boxes are behind the tree?
since only one setting of the parameters ts the data, and it predicts the data perfectly. As for model H2 , six of its nine parameters are well-determined, and three of them are partly-constrained by the data. If the left-hand box is furthest away, for example, then its width is at least 8 pixels and at most 30; if its the closer of the two boxes, then its width is between 8 and 18 pixels. (Im assuming here that the visible portion of the left-hand box is about 8 pixels wide.) To get the evidence we need to sum up the prior probabilities of all viable hypotheses. To do an exact calculation, we need to be more specic about the data and the priors, but lets just get the ballpark answer, assuming that the two unconstrained real variables have half their values available, and that the binary variable is completely undetermined. (As an exercise, you can make an explicit model and work out the exact answer.) P (D |H2 ) 1 1 10 1 1 1 10 1 2 . 20 20 20 16 20 20 20 16 2 (28.12)
Thus the posterior probability ratio is (assuming equal prior probability): P (D |H1 )P (H1 ) P (D |H2 )P (H2 ) = 1
1 10 10 1 20 20 20 16
(28.13) 1000/1. (28.14)
= 20 2 2 16
So the data are roughly 1000 to 1 in favour of the simpler hypothesis. The four factors can be interpreted in terms of Occam factors. The more complex model has four extra parameters for sizes and colours three for sizes, and one for colour. It has to pay two big Occam factors (1/20 and 1/16) for the highly suspicious coincidences that the two box heights match exactly and the two colours match exactly; and it also pays two lesser Occam factors for the two lesser coincidences that both boxes happened to have one of their edges conveniently hidden behind a tree or behind each other.
28.3
Minimum description length (MDL)
A complementary view of Bayesian model comparison is obtained by replacing probabilities of events by the lengths in bits of messages that communicate
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
352
H1 : H2 : H3 :
L(H1 ) L(w(1) |H1 ) L(D|w(1) , H1 ) L(D|w(2) , H2 ) L(D|w(3) , H3 )
28 Model Comparison and Occams Razor
Figure 28.8. A popular view of model comparison by minimum description length. Each model Hi communicates the data D by sending the identity of the model, sending the best t parameters of the model w , then sending the data relative to those parameters. As we proceed to more complex models the length of the parameter message increases. On the other hand, the length of the data message decreases, because a complex model is able to t the data better, making the residuals smaller. In this example the intermediate model H2 achieves the optimum trade-o between these two trends.
L(H2 ) L(H3 )
L(w(2) |H2 ) L(w(3) |H3 )
the events without loss to a receiver. Message lengths L(x) correspond to a probabilistic model over events x via the relations: P (x) = 2L(x) , L(x) = log2 P (x). (28.15)
The MDL principle (Wallace and Boulton, 1968) states that one should prefer models that can communicate the data in the smallest number of bits. Consider a two-part message that states which model, H, is to be used, and then communicates the data D within that model, to some pre-arranged precision D . This produces a message of length L(D, H) = L(H)+ L(D |H). The lengths L(H) for dierent H dene an implicit prior P (H) over the alternative models. Similarly L(D |H) corresponds to a density P (D |H). Thus, a procedure for assigning message lengths can be mapped onto posterior probabilities: L(D, H) = log P (H) log(P (D |H)D ) = log P (H|D ) + const. (28.16) (28.17)
In principle, then, MDL can always be interpreted as Bayesian model comparison and vice versa. However, this simple discussion has not addressed how one would actually evaluate the key data-dependent term L(D |H), which corresponds to the evidence for H. Often, this message is imagined as being subdivided into a parameter block and a data block (gure 28.8). Models with a small number of parameters have only a short parameter block but do not t the data well, and so the data message (a list of large residuals) is long. As the number of parameters increases, the parameter block lengthens, and the data message becomes shorter. There is an optimum model complexity (H2 in the gure) for which the sum is minimized. This picture glosses over some subtle issues. We have not specied the precision to which the parameters w should be sent. This precision has an important eect (unlike the precision D to which real-valued data D are sent, which, assuming D is small relative to the noise level, just introduces an additive constant). As we decrease the precision to which w is sent, the parameter message shortens, but the data message typically lengthens because the truncated parameters do not match the data so well. There is a non-trivial optimal precision. In simple Gaussian cases it is possible to solve for this optimal precision (Wallace and Freeman, 1987), and it is closely related to the posterior error bars on the parameters, A 1 , where A = ln P (w|D, H). It turns out that the optimal parameter message length is virtually identical to the log of the Occam factor in equation (28.10). (The random element involved in parameter truncation means that the encoding is slightly sub-optimal.) With care, therefore, one can replicate Bayesian results in MDL terms. Although some of the earliest work on complex model comparison involved the MDL framework (Patrick and Wallace, 1982), MDL has no apparent advantages over the direct probabilistic approach. MDL does have its uses as a pedagogical tool. The description length concept is useful for motivating prior probability distributions. Also, dierent ways of breaking down the task of communicating data using a model can give helpful insights into the modelling process, as will now be illustrated.
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
28.3: Minimum description length (MDL) On-line learning and cross-validation. In cases where the data consist of a sequence of points D = t(1) , t(2) , . . . , t(N ) , the log evidence can be decomposed as a sum of on-line predictive performances: log P (D |H) = log P (t(1) |H) + log P (t(2) |t(1) , H)
353
+ log P (t(3) |t(1) , t(2) , H) + + log P (t(N ) |t(1) . . . t(N1) , H). (28.18)
This decomposition can be used to explain the dierence between the evidence and leave-one-out cross-validation as measures of predictive ability. Cross-validation examines the average value of just the last term, log P (t(N ) |t(1) . . . t(N1) , H), under random re-orderings of the data. The evidence, on the other hand, sums up how well the model predicted all the data, starting from scratch.
The bits back encoding method.
Another MDL thought experiment (Hinton and van Camp, 1993) involves incorporating random bits into our message. The data are communicated using a parameter block and a data block. The parameter vector sent is a random sample from the posterior distribution P (w|D, H) = P (D |w, H)P (w|H)/P (D |H). This sample w is sent to an arbitrary small granularity w using a message length L(w|H) = log[P (w|H) w]. The data are encoded relative to w with a message of length L(D |w, H) = log[P (D |w, H)D ]. Once the data message has been received, the random bits used to generate the sample w from the posterior can be deduced by the receiver. The number of bits so recovered is log[P (w|D, H) w]. These recovered bits need not count towards the message length, since we might use some other optimally encoded message as a random bit string, thereby communicating that message at the same time. The net description cost is therefore: L(w|H) + L(D |w, H) Bits back = log P (w|H)P (D |w, H)D P (w|D, H) = log P (D |H) log D. (28.19)
Thus this thought experiment has yielded the optimal description length. Bits back encoding has been turned into a practical compression method for data modelled with latent variable models by Frey (1998).
Further reading
Bayesian methods are introduced and contrasted with sampling-theory statistics in (Jaynes, 1983; Gull, 1988; Loredo, 1990). The Bayesian Occams razor is demonstrated on model problems in (Gull, 1988; MacKay, 1992a). Useful textbooks are (Box and Tiao, 1973; Berger, 1985). One debate worth understanding is the question of whether its permissible to use improper priors in Bayesian inference (Dawid et al., 1996). If we want to do model comparison (as discussed in this chapter), it is essential to use proper priors otherwise the evidences and the Occam factors are meaningless. Only when one has no intention to do model comparison may it be safe to use improper priors, and even in such cases there are pitfalls, as Dawid et al. explain. I would agree with their advice to always use proper priors, tempered by an encouragement to be smart when making calculations, recognizing opportunities for approximation.
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
354
28 Model Comparison and Occams Razor
28.4
Exercises
Exercise 28.1.[3 ] Random variables x come independently from a probability distribution P (x). According to model H0 , P (x) is a uniform distribution 1 x (1, 1). (28.20) P (x|H0 ) = 2 According to model H1 , P (x) is a nonuniform distribution with an unknown parameter m (1, 1): P (x|m, H1 ) = 1 (1 + mx) 2 x (1, 1). (28.21)
P (x|H0 ) 1 1x
P (x|m = 0.4, H1 ) 1 1x
Given the data D = {0.3, 0.5, 0.7, 0.8, 0.9}, what is the evidence for H0 and H1 ? Exercise 28.2.[3 ] Datapoints (x, t) are believed to come from a straight line. The experimenter chooses x, and t is Gaussian-distributed about y = w 0 + w1 x (28.22)
x
y = w0 + w1 x
2 with variance . According to model H1 , the straight line is horizontal, so w1 = 0. According to model H2 , w1 is a parameter with prior distribution Normal(0, 1). Both models assign a prior distribution Normal(0, 1) to w0 . Given the data set D = {(8, 8), (2, 10), (6, 11)}, and assuming the noise level is = 1, what is the evidence for each model?
Exercise 28.3.[3 ] A six-sided die is rolled 30 times and the numbers of times each face came up were F = {3, 3, 2, 2, 9, 11}. What is the probability that the die is a perfectly fair die (H0 ), assuming the alternative hypothesis H1 says that the die has a biased distribution p, and the prior density for p is uniform over the simplex pi 0, i pi = 1? Solve this problem two ways: exactly, using the helpful Dirichlet formulae (23.30, 23.31), and approximately, using Laplaces method. Notice that your choice of basis for the Laplace approximation is important. See MacKay (1998a) for discussion of this exercise.
Exercise 28.4.[3 ] The inuence of race on the imposition of the death penalty for murder in America has been much studied. The following three-way table classies 326 cases in which the defendant was convicted of murder. The three variables are the defendants race, the victims race, and whether the defendant was sentenced to death. (Data from M. Radelet, Racial characteristics and imposition of the death penalty, American Sociological Review, 46 (1981), pp.918-927.) White defendant Death penalty Yes No White victim Black victim 19 0 132 9 White victim Black victim Black defendant Death penalty Yes No 11 6 52 97
It seems that the death penalty was applied much more often when the victim was white then when the victim was black. When the victim was white 14% of defendants got the death penalty, but when the victim was black 6% of defendants got the death penalty. [Incidentally, these data
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
28.4: Exercises provide an example of a phenomenon known as Simpsons paradox: a higher fraction of white defendants are sentenced to death overall, but in cases involving black victims a higher fraction of black defendants are sentenced to death and in cases involving white victims a higher fraction of black defendants are sentenced to death.] Quantify the evidence for the four alternative hypotheses shown in gure 28.9. I should mention that I dont believe any of these models is adequate: several additional variables are important in murder cases, such as whether the victim and murderer knew each other, whether the murder was premeditated, and whether the defendant had a prior criminal record; none of these variables is included in the table. So this is an academic exercise in model comparison rather than a serious study of racial bias in the state of Florida. The hypotheses are shown as graphical models, with arrows showing dependencies between the variables v (victim race), m (murderer race), and d (whether death penalty given). Model H00 has only one free parameter, the probability of receiving the death penalty; model H11 has four such parameters, one for each state of the variables v and m. Assign uniform priors to these variables. How sensitive are the conclusions to the choice of prior?
H11 v d H01 v d m v d m v d H00 m H10 m
355
Figure 28.9. Four hypotheses concerning the dependence of the imposition of the death penalty d on the race of the victim v and the race of the convicted murderer m. H01 , for example, asserts that the probability of receiving the death penalty does depend on the murderers race, but not on the victims.
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
About Chapter 29
The last couple of chapters have assumed that a Gaussian approximation to the probability distribution we are interested in is adequate. What if it is not? We have already seen an example clustering where the likelihood function is multimodal, and has nasty unboundedly high spikes in certain locations in the parameter space; so maximizing the posterior probability and tting a Gaussian is not always going to work. This diculty with Laplaces method is one motivation for being interested in Monte Carlo methods. In fact, Monte Carlo methods provide a general-purpose set of tools with applications in Bayesian data modelling and many other elds. This chapter describes a sequence of methods: importance sampling, rejection sampling, the Metropolis method, Gibbs sampling and slice sampling. For each method, we discuss whether the method is expected to be useful for high-dimensional problems such as arise in inference with graphical models. [A graphical model is a probabilistic model in which dependencies and independencies of variables are represented by edges in a graph whose nodes are the variables.] Along the way, the terminology of Markov chain Monte Carlo methods is presented. The subsequent chapter gives a discussion of advanced methods for reducing random walk behaviour. For details of Monte Carlo methods, theorems and proofs and a full list of references, the reader is directed to Neal (1993b), Gilks et al. (1996), and Tanner (1996). In this chapter I will use the word sample in the following sense: a sample from a distribution P (x) is a single realization x whose probability distribution is P (x). This contrasts with the alternative usage in statistics, where sample refers to a collection of realizations {x}. When we discuss transition probability matrices, I will use a right-multiplication convention: I like my matrices to act to the right, preferring u = Mv to uT = vTMT. (29.2) A transition probability matrix Tij or Ti|j species the probability, given the current state is j , of making the transition from j to i. The columns of T are probability vectors. If we write down a transition probability density, we use the same convention for the order of its arguments: T (x ; x) is a transition probability density from x to x . This unfortunately means that you have to get used to reading from right to left the sequence xyz has probability T (z ; y )T (y ; x) (x). (29.1)
356
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
29
Monte Carlo Methods
29.1 The problems to be solved
Monte Carlo methods are computational techniques that make use of random numbers. The aims of Monte Carlo methods are to solve one or both of the following problems. Problem 1: to generate samples {x(r) }R from a given probability distribur =1 tion P (x). Problem 2: to estimate expectations of functions under this distribution, for example = (x) dN x P (x)(x). (29.3)
The probability distribution P (x), which we call the target density, might be a distribution from statistical physics or a conditional distribution arising in data modelling for example, the posterior probability of a models parameters given some observed data. We will generally assume that x is an N -dimensional vector with real components xn , but we will sometimes consider discrete spaces also. Simple examples of functions (x) whose expectations we might be interested in include the rst and second moments of quantities that we wish to predict, from which we can compute means and variances; for example if some quantity t depends on x, we can nd the mean and variance of t under P (x) by nding the expectations of the functions 1 (x) = t(x) and 2 (x) = (t(x))2 , 1 E [1 (x)] and 2 E [2 (x)], then using It is assumed that P (x) is suciently complex that we cannot evaluate these expectations by exact methods; so we are interested in Monte Carlo methods. We will concentrate on the rst problem (sampling), because if we have solved it, then we can solve the second problem by using the random samples {x(r) }R to give the estimator r =1 1 R (x(r) ).
r
(29.4) (29.5)
t = 1 and var(t) = 2 2 . 1
(29.6)
If the vectors {x(r) }R are generated from P (x) then the expectation of is r =1 will decrease . Also, as the number of samples R increases, the variance of 2 as /R, where 2 is the variance of , 2 = dN x P (x)((x) )2 . 357 (29.7)
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
358
3 2.5 2 1.5 1 0.5 0 P*(x) 3 2.5 2 1.5 1 0.5 0 -4 -2 0 2 4 P*(x)
29 Monte Carlo Methods
Figure 29.1. (a) The function P (x) = exp 0.4(x 0.4)2 0.08x4 . How to draw samples from this density? (b) The function P (x) evaluated at a discrete set of uniformly spaced points {xi }. How to draw samples from this discrete distribution?
4
(a)
(b)
-4
-2
0
2
This is one of the important properties of Monte Carlo methods. The accuracy of the Monte Carlo estimate (29.6) depends only on the variance of , not on the dimensionality of the space sampled. To be precise, the variance of goes as 2 /R. So regardless of the dimensionality of x, it may be that as few as a dozen independent samples {x(r) } suce to estimate satisfactorily. We will nd later, however, that high dimensionality can cause other diculties for Monte Carlo methods. Obtaining independent samples from a given distribution P (x) is often not easy.
Why is sampling from P (x) hard?
We will assume that the density from which we wish to draw samples, P (x), can be evaluated, at least to within a multiplicative constant; that is, we can evaluate a function P (x) such that P (x) = P (x)/Z. (29.8)
If we can evaluate P (x), why can we not easily solve problem 1? Why is it in general dicult to obtain samples from P (x)? There are two diculties. The rst is that we typically do not know the normalizing constant Z= dN x P (x). (29.9)
The second is that, even if we did know Z , the problem of drawing samples from P (x) is still a challenging one, especially in high-dimensional spaces, because there is no obvious way to sample from P without enumerating most or all of the possible states. Correct samples from P will by denition tend to come from places in x-space where P (x) is big; how can we identify those places where P (x) is big, without evaluating P (x) everywhere? There are only a few high-dimensional densities from which it is easy to draw samples, for example the Gaussian distribution. Let us start with a simple one-dimensional example. Imagine that we wish to draw samples from the density P (x) = P (x)/Z where P (x) = exp 0.4(x 0.4)2 0.08x4 , x (, ). (29.10)
We can plot this function (gure 29.1a). But that does not mean we can draw samples from it. To start with, we dont know the normalizing constant Z .
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
29.1: The problems to be solved To give ourselves a simpler problem, we could discretize the variable x and ask for samples from the discrete probability distribution over a nite set of uniformly spaced points {xi } (gure 29.1b). How could we solve this problem? If we evaluate p = P (xi ) at each point xi , we can compute i Z=
i
359
p i
(29.11)
and pi = p /Z i (29.12) and we can then sample from the probability distribution {p i } using various methods based on a source of random bits (see section 6.3). But what is the cost of this procedure, and how does it scale with the dimensionality of the space, N ? Let us concentrate on the initial cost of evaluating Z (29.11). To compute Z we have to visit every point in the space. In gure 29.1b there are 50 uniformly spaced points in one dimension. If our system had N dimensions, N = 1000 say, then the corresponding number of points would be 501000 , an unimaginable number of evaluations of P . Even if each component xn took only two discrete values, the number of evaluations of P would be 21000 , a number that is still horribly huge. If every electron in the universe (there are about 2266 of them) were a 1000 gigahertz computer that could evaluate P for a trillion (240 ) states every second, and if we ran those 2266 computers for a time equal to the age of the universe (258 seconds), they would still only visit 2364 states. Wed have to wait for more than 2636 10190 universe ages to elapse before all 21000 states had been visited. Systems with 21000 states are two a penny. One example is a collection of 1000 spins such as a 30 30 fragment of an Ising model whose probability distribution is proportional to P (x) = exp[E (x)] where xn {1} and E ( x) = 1 Jmn xm xn + 2 m,n H n xn .
n
(29.13)
Translation for American readers: such systems are a dime a dozen; incidentally, this equivalence (10c = 6p) shows that the correct exchange rate is 1.00 = $1.67.
(29.14)
The energy function E (x) is readily evaluated for any x. But if we wish to evaluate this function at all states x, the computer time required would be 21000 function evaluations. The Ising model is a simple model which has been around for a long time, but the task of generating samples from the distribution P (x) = P (x)/Z is still an active research area; the rst exact samples from this distribution were created in the pioneering work of Propp and Wilson (1996), as well describe in Chapter 32.
P (x)
A useful analogy
Imagine the tasks of drawing random water samples from a lake and nding the average plankton concentration (gure 29.2). The depth of the lake at x = (x, y ) is P (x), and we assert (in order to make the analogy work) that the plankton concentration is a function of x, (x). The required average concentration is an integral like (29.3), namely = (x) 1 Z dN x P (x)(x), (29.15)
Figure 29.2. A lake whose depth at x = (x, y ) is P (x).
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
360 where Z = dx dy P (x) is the volume of the lake. You are provided with a boat, a satellite navigation system, and a plumbline. Using the navigator, you can can take your boat to any desired location x on the map; using the plumbline you can measure P (x) at that point. You can also measure the plankton concentration there. Problem 1 is to draw 1 cm3 water samples at random from the lake, in such a way that each sample is equally likely to come from any point within the lake. Problem 2 is to nd the average plankton concentration. These are dicult problems to solve because at the outset we know nothing about the depth P (x). Perhaps much of the volume of the lake is contained in narrow, deep underwater canyons (gure 29.3), in which case, to correctly sample from the lake and correctly estimate our method must implicitly discover the canyons and nd their volume relative to the rest of the lake. Dicult problems, yes; nevertheless, well see that clever Monte Carlo methods can solve them.
29 Monte Carlo Methods
Figure 29.3. A slice through a lake that includes some canyons.
Uniform sampling
Having accepted that we cannot exhaustively visit every location x in the state space, we might consider trying to solve the second problem (estimating the expectation of a function (x)) by drawing random samples {x(r) }R r =1 uniformly from the state space and evaluating P (x) at those points. Then we could introduce a normalizing constant ZR , dened by
R
ZR =
r =1
P (x(r) ),
(29.16)
and estimate = dN x (x)P (x) by =
R
(x(r) )
r =1
P (x(r) ) . ZR
(29.17)
Is anything wrong with this strategy? Well, it depends on the functions (x) and P (x). Let us assume that (x) is a benign, smoothly varying function and concentrate on the nature of P (x). As we learnt in Chapter 4, a highdimensional distribution is often concentrated in a small region of the state space known as its typical set T , whose volume is given by |T | 2H (X) , where H (X) is the entropy of the probability distribution P (x). If almost all the probability mass is located in the typical set and (x) is a benign function, the value of = dN x (x)P (x) will be principally determined by the values that (x) takes on in the typical set. So uniform sampling will only stand a chance of giving a good estimate of if we make the number of samples R suciently large that we are likely to hit the typical set at least once or twice. So, how many samples are required? Let us take the case of the Ising model again. (Strictly, the Ising model may not be a good example, since it doesnt necessarily have a typical set, as dened in Chapter 4; the denition of a typical set was that all states had log probability close to the entropy, which for an Ising model would mean that the energy is very close to the mean energy; but in the vicinity of phase transitions, the variance of energy, also known as the heat capacity, may diverge, which means that the energy of a random state is not necessarily expected to be very close to the mean energy.) The total size of the state space is 2N states, and the typical set has size 2H . So each sample has a chance of 2H /2N of falling in the typical set. The number of samples required to hit the typical set once is thus of order Rmin 2N H . (29.18)
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
29.2: Importance sampling
64 log(2)
361
Figure 29.4. (a) Entropy of a 64-spin Ising model as a function of temperature. (b) One state of a 1024-spin Ising model.
Entropy
0 0 1 2 3 4 Temperature 5 6
(a)
(b)
So, what is H ? At high temperatures, the probability distribution of an Ising model tends to a uniform distribution and the entropy tends to H max = N bits, which means Rmin is of order 1. Under these conditions, uniform sampling may well be a satisfactory technique for estimating . But high temperatures are not of great interest. Considerably more interesting are intermediate temperatures such as the critical temperature at which the Ising model melts from an ordered phase to a disordered phase. The critical temperature of an innite Ising model, at which it melts, is c = 2.27. At this temperature the entropy of an Ising model is roughly N/2 bits (gure 29.4). For this probability distribution the number of samples required simply to hit the typical set once is of order Rmin 2N N/2 = 2N/2 , (29.19) which for N = 1000 is about 10150 . This is roughly the square of the number of particles in the universe. Thus uniform sampling is utterly useless for the study of Ising models of modest size. And in most high-dimensional problems, if the distribution P (x) is not actually uniform, uniform sampling is unlikely to be useful.
Overview
Having established that drawing samples from a high-dimensional distribution P (x) = P (x)/Z is dicult even if P (x) is easy to evaluate, we will now study a sequence of more sophisticated Monte Carlo methods: importance sampling, rejection sampling, the Metropolis method, Gibbs sampling, and slice sampling.
29.2
Importance sampling
Importance sampling is not a method for generating samples from P (x) (problem 1); it is just a method for estimating the expectation of a function (x) (problem 2). It can be viewed as a generalization of the uniform sampling method. For illustrative purposes, let us imagine that the target distribution is a one-dimensional density P (x). Let us assume that we are able to evaluate this density at any chosen point x, at least to within a multiplicative constant; thus we can evaluate a function P (x) such that P (x) = P (x)/Z. (29.20)
But P (x) is too complicated a function for us to be able to sample from it directly. We now assume that we have a simpler density Q(x) from which we can generate samples and which we can evaluate to within a multiplicative
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
362 constant (that is, we can evaluate Q (x), where Q(x) = Q (x)/ZQ ). An example of the functions P , Q and is shown in gure 29.5. We call Q the sampler density. In importance sampling, we generate R samples {x(r) }R from Q(x). If r =1 these points were samples from P (x) then we could estimate by equation (29.6). But when we generate samples from Q, values of x where Q(x) is greater than P (x) will be over-represented in this estimator, and points where Q(x) is less than P (x) will be under-represented. To take into account the fact that we have sampled from the wrong distribution, we introduce weights wr P (x(r) ) Q (x(r) ) (29.21)
29 Monte Carlo Methods
P ( x)
Q ( x)
( x)
x
which we use to adjust the importance of each point in our estimator thus:
r
wr (x(r) ) . r wr
(29.22)
Figure 29.5. Functions involved in importance sampling. We wish to estimate the expectation of (x) under P (x) P (x). We can generate samples from the simpler distribution Q(x) Q (x). We can evaluate Q and P at any point.
Exercise 29.1.[2, p.384] Prove that, if Q(x) is non-zero for all x where P (x) is non-zero, the estimator converges to , the mean value of (x), as R increases. What is the variance of this estimator, asymptotically? Hint: consider the statistics of the numerator and the denominator separately. Is the estimator an unbiased estimator for small R? A practical diculty with importance sampling is that it is hard to estimate how reliable the estimator is. The variance of the estimator is unknown beforehand, because it depends on an integral over x of a function involving P (x). And the variance of is hard to estimate, because the empirical variances of the quantities wr and wr (x(r) ) are not necessarily a good guide to the true variances of the numerator and denominator in equation (29.22). If the proposal density Q(x) is small in a region where |(x)P (x)| is large then it is quite possible, even after many points x(r) have been generated, that none of them will have fallen in that region. In this case the estimate of would be drastically wrong, and there would be no indication in the empirical variance that the true variance of the estimator is large.
-6.2 -6.2
-6.4
-6.4
-6.6
-6.6
-6.8
-6.8
-7
-7
Figure 29.6. Importance sampling in action: (a) using a Gaussian sampler density; (b) using a Cauchy sampler density. Vertical axis shows the estimate . The horizontal line indicates the true value of . Horizontal axis shows number of samples on a log scale.
-7.2 10 100 1000 10000 100000 1000000
-7.2 10 100 1000 10000 100000 1000000
(a)
(b)
Cautionary illustration of importance sampling
In a toy problem related to the modelling of amino acid probability distributions with a one-dimensional variable x, I evaluated a quantity of interest using importance sampling. The results using a Gaussian sampler and a Cauchy sampler are shown in gure 29.6. The horizontal axis shows the number of
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
29.2: Importance sampling samples on a log scale. In the case of the Gaussian sampler, after about 500 samples had been evaluated one might be tempted to call a halt; but evidently there are infrequent samples that make a huge contribution to , and the value of the estimate at 500 samples is wrong. Even after a million samples have been taken, the estimate has still not settled down close to the true value. In contrast, the Cauchy sampler does not suer from glitches; it converges (on the scale shown here) after about 5000 samples. This example illustrates the fact that an importance sampler should have heavy tails. Exercise 29.2.[2, p.385] Consider the situation where P (x) is multimodal, consisting of several widely separated peaks. (Probability distributions like this arise frequently in statistical data modelling.) Discuss whether it is a wise strategy to do importance sampling using a sampler Q(x) that is a unimodal distribution tted to one of these peaks. Assume that the function (x) whose mean is to be estimated is a smoothly varying function of x such as mx + c. Describe the typical evolution of the estimator as a function of the number of samples R.
-5 0 5
363
P(x) Q(x) phi(x)
10
15
Figure 29.7. A multimodal distribution P (x) and a unimodal sampler Q(x).
Importance sampling in many dimensions
We have already observed that care is needed in one-dimensional importance sampling problems. Is importance sampling a useful technique in spaces of higher dimensionality, say N = 1000? Consider a simple case-study where the target density P (x) is a uniform distribution inside a sphere, P (x) = where (x) ( the origin,
2 1 /2 , i xi )
1 0 (x) RP 0 (x) > RP ,
(29.23)
and the proposal density is a Gaussian centred on Normal(xi ; 0, 2 ).
i
Q(x) =
(29.24)
An importance sampling method will be in trouble if the estimator is dominated by a few large weights wr . What will be the typical range of values of the weights wr ? We know from our discussions of typical sequences in Part I see exercise 6.14 (p.124), for example that if is the distance from the origin of a sample from Q, the quantity 2 has a roughly Gaussian distribution with mean and standard deviation: (29.25) 2 N 2 2N 2 . Thus almost all samples from Q lie in a typical set with distance from the origin very close to N . Let us assume that is chosen such that the typical set of Q lies inside the sphere of radius RP . [If it does not, then the law of large numbers implies that almost all the samples generated from Q will fall outside RP and will have weight zero.] Then we know that most samples from Q will have a value of Q that lies in the range 2N N 1 . (29.26) exp 2 2 (22 )N/2 Thus the weights wr = P /Q will typically have values in the range N 2N (22 )N/2 exp . 2 2
(29.27)
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
364
(a)
29 Monte Carlo Methods
cQ (x) P ( x)
(b)
cQ (x) P ( x) u
x
x
x
So if we draw a hundred samples, what will the typical range of weights be? We can roughly estimate the ratio of the largest weight to the median weight by doubling the standard deviation in equation (29.27). The largest weight and the median weight will typically be in the ratio:
max wr 2N . = exp med wr
Figure 29.8. Rejection sampling. (a) The functions involved in rejection sampling. We desire samples from P (x) P (x). We are able to draw samples from Q(x) Q (x), and we know a value c such that cQ (x) > P (x) for all x. (b) A point (x, u) is generated at random in the lightly shaded area under the curve cQ (x). If this point also lies below P (x) then it is accepted.
(29.28)
In N = 1000 dimensions therefore, the largest weight after one hundred samples is likely to be roughly 1019 times greater than the median weight. Thus an importance sampling estimate for a high-dimensional problem will very likely be utterly dominated by a few samples with huge weights. In conclusion, importance sampling in high dimensions often suers from two diculties. First, we need to obtain samples that lie in the typical set of P , and this may take a long time unless Q is a good approximation to P . Second, even if we obtain samples in the typical set, the weights associated with those samples are likely to vary by large factors, because the probabilities of points in a typical set, although similar to each other, still dier by factors of order exp( N ), so the weights will too, unless Q is a near-perfect approximation to P.
29.3
Rejection sampling
We assume again a one-dimensional density P (x) = P (x)/Z that is too complicated a function for us to be able to sample from it directly. We assume that we have a simpler proposal density Q(x) which we can evaluate (within a multiplicative factor ZQ , as before), and from which we can generate samples. We further assume that we know the value of a constant c such that c Q (x) > P (x), for all x. (29.29)
A schematic picture of the two functions is shown in gure 29.8a. We generate two random numbers. The rst, x, is generated from the proposal density Q(x). We then evaluate c Q (x) and generate a uniformly distributed random variable u from the interval [0, c Q (x)]. These two random numbers can be viewed as selecting a point in the two-dimensional plane as shown in gure 29.8b. We now evaluate P (x) and accept or reject the sample x by comparing the value of u with the value of P (x). If u > P (x) then x is rejected; otherwise it is accepted, which means that we add x to our set of samples {x(r) }. The value of u is discarded. Why does this procedure generate samples from P (x)? The proposed point (x, u) comes with uniform probability from the lightly shaded area underneath the curve c Q (x) as shown in gure 29.8b. The rejection rule rejects all the points that lie above the curve P (x). So the points (x, u) that are accepted are uniformly distributed in the heavily shaded area under P (x). This implies
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
29.4: The MetropolisHastings method that the probability density of the x-coordinates of the accepted points must be proportional to P (x), so the samples must be independent samples from P (x). Rejection sampling will work best if Q is a good approximation to P . If Q is very dierent from P then, for c Q to exceed P everywhere, c will necessarily have to be large and the frequency of rejection will be large.
365
P(x) cQ(x)
Rejection sampling in many dimensions
-4 -3 -2 -1 0 1 2 3 4
In a high-dimensional problem it is very likely that the requirement that c Q be an upper bound for P will force c to be so huge that acceptances will be very rare indeed. Finding such a value of c may be dicult too, since in many problems we know neither where the modes of P are located nor how high they are. As a case study, consider a pair of N -dimensional Gaussian distributions with mean zero (gure 29.9). Imagine generating samples from one with standard deviation Q and using rejection sampling to obtain samples from the other whose standard deviation is P . Let us assume that these two standard deviations are close in value say, Q is 1% larger than P . [Q must be larger than P because if this is not the case, there is no c such that c Q exceeds P for all x.] So, what value of c is required if the dimensionality is N = 1000? 2 The density of Q(x) at the origin is 1/(2Q )N/2 , so for c Q to exceed P we need to set 2 (2Q )N/2 Q c= . (29.30) = exp N ln 2 P (2P )N/2 With N = 1000 and Q = 1.01, we nd c = exp(10) 20,000. What will the P acceptance rate be for this value of c? The answer is immediate: since the acceptance rate is the ratio of the volume under the curve P (x) to the volume under c Q(x), the fact that P and Q are both normalized here implies that the acceptance rate will be 1/c, for example, 1/20,000. In general, c grows exponentially with the dimensionality N , so the acceptance rate is expected to be exponentially small in N . Rejection sampling, therefore, whilst a useful method for one-dimensional problems, is not expected to be a practical technique for generating samples from high-dimensional distributions P (x).
Figure 29.9. A Gaussian P (x) and a slightly broader Gaussian Q(x) scaled up by a factor c such that c Q(x) P (x).
Q(x; x(1) )
P (x) x Q(x; x(2) )
29.4
The MetropolisHastings method
Importance sampling and rejection sampling work well only if the proposal density Q(x) is similar to P (x). In large and complex problems it is dicult to create a single density Q(x) that has this property. The MetropolisHastings algorithm instead makes use of a proposal density Q which depends on the current state x(t) . The density Q(x ; x(t) ) might be a simple distribution such as a Gaussian centred on the current x (t) . The proposal density Q(x ; x) can be any xed density from which we can draw samples. In contrast to importance sampling and rejection sampling, it is not necessary that Q(x ; x(t) ) look at all similar to P (x) in order for the algorithm to be practically useful. An example of a proposal density is shown in gure 29.10; this gure shows the density Q(x ; x(t) ) for two dierent states x(1) and x(2) . As before, we assume that we can evaluate P (x) for any x. A tentative new state x is generated from the proposal density Q(x ; x(t) ). To decide
x(1)
P (x) x x(2) Figure 29.10. MetropolisHastings method in one dimension. The proposal distribution Q(x ; x) is here shown as having a shape that changes as x changes, though this is not typical of the proposal densities used in practice.
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
366 whether to accept the new state, we compute the quantity a= P (x ) Q(x(t) ; x ) . P (x(t) ) Q(x ; x(t) ) (29.31)
29 Monte Carlo Methods
If a 1 then the new state is accepted. Otherwise, the new state is accepted with probability a. If the step is accepted, we set x(t+1) = x . If the step is rejected, then we set x(t+1) = x(t) . Note the dierence from rejection sampling: in rejection sampling, rejected points are discarded and have no inuence on the list of samples {x(r) } that we collected. Here, a rejection causes the current state to be written again onto the list. Notation. I have used the superscript r = 1, . . . , R to label points that are independent samples from a distribution, and the superscript t = 1, . . . , T to label the sequence of states in a Markov chain. It is important to note that a MetropolisHastings simulation of T iterations does not produce T independent samples from the target distribution P . The samples are correlated. To compute the acceptance probability (29.31) we need to be able to compute the probability ratios P (x )/P (x(t) ) and Q(x(t) ; x )/Q(x ; x(t) ). If the proposal density is a simple symmetrical density such as a Gaussian centred on the current point, then the latter factor is unity, and the MetropolisHastings method simply involves comparing the value of the target density at the two points. This special case is sometimes called the Metropolis method. However, with apologies to Hastings, I will call the general MetropolisHastings algorithm for asymmetric Q the Metropolis algorithm since I believe important ideas deserve short names.
Convergence of the Metropolis method to the target density
It can be shown that for any positive Q (that is, any Q such that Q(x ; x) > 0 for all x, x ), as t , the probability distribution of x(t) tends to P (x) = P (x)/Z . [This statement should not be seen as implying that Q has to assign positive probability to every point x we will discuss examples later where Q(x ; x) = 0 for some x, x ; notice also that we have said nothing about how rapidly the convergence to P (x) takes place.] The Metropolis method is an example of a Markov chain Monte Carlo method (abbreviated MCMC). In contrast to rejection sampling, where the accepted points {x(r) } are independent samples from the desired distribution, Markov chain Monte Carlo methods involve a Markov process in which a sequence of states {x(t) } is generated, each sample x(t) having a probability distribution that depends on the previous value, x (t1) . Since successive samples are correlated with each other, the Markov chain may have to be run for a considerable time in order to generate samples that are eectively independent samples from P . Just as it was dicult to estimate the variance of an importance sampling estimator, so it is dicult to assess whether a Markov chain Monte Carlo method has converged, and to quantify how long one has to wait to obtain samples that are eectively independent samples from P .
Demonstration of the Metropolis method
The Metropolis method is widely used for high-dimensional problems. Many implementations of the Metropolis method employ a proposal distribution
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
29.4: The MetropolisHastings method
367
Figure 29.11. Metropolis method in two dimensions, showing a traditional proposal density that has a suciently small step size that the acceptance frequency will be about 0.5.
P (x)
Q(x; x(1) )
x(1) L
with a length scale that is short relative to the longest length scale L of the probable region (gure 29.11). A reason for choosing a small length scale is that for most high-dimensional problems, a large random step from a typical point (that is, a sample from P (x)) is very likely to end in a state that has very low probability; such steps are unlikely to be accepted. If is large, movement around the state space will only occur when such a transition to a low-probability state is actually accepted, or when a large random step chances to land in another probable state. So the rate of progress will be slow if large steps are used. The disadvantage of small steps, on the other hand, is that the Metropolis method will explore the probability distribution by a random walk, and a random walk takes a long time to get anywhere, especially if the walk is made of small steps. Exercise 29.3.[1 ] Consider a one-dimensional random walk, on each step of which the state moves randomly to the left or to the right with equal probability. Show that after T steps of size , the state is likely to have moved only a distance about T . (Compute the root mean square distance travelled.) Recall that the rst aim of Monte Carlo sampling is to generate a number of independent samples from the given distribution (a dozen, say). If the largest length scale of the state space is L, then we have to simulate a random-walk Metropolis method for a time T (L/ )2 before we can expect to get a sample that is roughly independent of the initial condition and thats assuming that every step is accepted: if only a fraction f of the steps are accepted on average, then this time is increased by a factor 1/f . Rule of thumb: lower bound on number of iterations of a Metropolis method. If the largest length scale of the space of probable states is L, a Metropolis method whose proposal distribution generates a random walk with step size must be run for at least T (L/ )2 (29.32) iterations to obtain an independent sample. This rule of thumb gives only a lower bound; the situation may be much worse, if, for example, the probability distribution consists of several islands of high probability separated by regions of low probability.
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
368
29 Monte Carlo Methods
(b) Metropolis
100 iterations 12 10 8 6 4 2 0 0 40 35 30 5 10 400 iterations 40 35 30 25 20 15 10 5 0 0 90 80 70 60 50 40 30 20 10 0 0 5 10 15 20 5 10 1200 iterations 90 80 70 60 50 40 30 20 10 0 15 20 15 20 12 10 8 6 4 2 0
(c) Independent sampling
100 iterations
0
5
10 400 iterations
15
20
Figure 29.12. Metropolis method for a toy problem. (a) The state sequence for t = 1, . . . , 600. Horizontal direction = states from 0 to 20; vertical direction = time from 1 to 600; the cross bars mark time intervals of duration 50. (b) Histogram of occupancy of the states after 100, 400 and 1200 iterations. (c) For comparison, histograms resulting when successive points are drawn independently from the target distribution.
(a)
25 20 15 10 5 0
0
5
10 1200 iterations
15
20
0
5
10
15
20
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
29.4: The MetropolisHastings method To illustrate how slowly a random walk explores a state space, gure 29.12 shows a simulation of a Metropolis algorithm for generating samples from the distribution: 1/21 x {0, 1, 2, . . . , 20} P (x) = (29.33) 0 otherwise. The proposal distribution is Q(x ; x) =
1/2
369
0
x = x1 otherwise.
(29.34)
Because the target distribution P (x) is uniform, rejections occur only when the proposal takes the state to x = 1 or x = 21. The simulation was started in the state x0 = 10 and its evolution is shown in gure 29.12a. How long does it take to reach one of the end states x = 0 and x = 20? Since the distance is 10 steps, the rule of thumb (29.32) predicts that it will typically take a time T 100 iterations to reach an end state. This is conrmed in the present example: the rst step into an end state occurs on the 178th iteration. How long does it take to visit both end states? The rule of thumb predicts about 400 iterations are required to traverse the whole state space; and indeed the rst encounter with the other end state takes place on the 540th iteration. Thus eectively-independent samples are only generated by simulating for about four hundred iterations per independent sample. This simple example shows that it is important to try to abolish random walk behaviour in Monte Carlo methods. A systematic exploration of the toy state space {0, 1, 2, . . . , 20} could get around it, using the same step sizes, in about twenty steps instead of four hundred. Methods for reducing random walk behaviour are discussed in the next chapter.
Metropolis method in high dimensions
The rule of thumb (29.32), which gives a lower bound on the number of iterations of a random walk Metropolis method, also applies to higher dimensional problems. Consider the simple case of a target distribution that is an N dimensional Gaussian, and a proposal distribution that is a spherical Gaussian of standard deviation in each direction. Without loss of generality, we can assume that the target distribution is a separable distribution aligned with the axes {xn }, and that it has standard deviation n in direction n. Let max and min be the largest and smallest of these standard deviations. Let us assume that is adjusted such that the acceptance frequency is close to 1. Under this assumption, each variable xn evolves independently of all the others, executing a random walk with step size about . The time taken to generate eectively independent samples from the target distribution will be controlled by the largest lengthscale max . Just as in the previous section, where we needed at least T (L/ )2 iterations to obtain an independent sample, here we need max / )2 . T ( Now, how big can b e? The bigger it is, the smaller this number T becomes, but if is too big bigger than min then the acceptance rate will fall sharply. It seems plausible that the optimal must be similar to min . Strictly, this may not be true; in special cases where the second smallest n is signicantly greater than min , the optimal may be closer to that second smallest n . But our rough conclusion is this: where simple spherical proposal distributions are used, we will need at least T ( max /min )2 iterations to obtain an independent sample, where max and min are the longest and shortest lengthscales of the target distribution.
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
370
29 Monte Carlo Methods
Figure 29.13. Gibbs sampling. (a) The joint density P (x) from which samples are required. (b) Starting from a state x(t) , x1 is sampled from the conditional (t) density P (x1 |x2 ). (c) A sample is then made from the conditional density P (x2 |x1 ). (d) A couple of iterations of Gibbs sampling.
x2
x2
P (x) x(t) (a) x2 x1 (b) x2 x(t+2) x(t+1) P (x2 |x1 ) x(t) (c) x1 (d) x1 P (x1 |x2 ) x1
(t)
This good is news and bad news. It is good news because, unlike the cases of rejection sampling and importance sampling, there is no catastrophic dependence on the dimensionality N . Our computer will give useful answers in a time shorter than the age of the universe. But it is bad news all the same, because this quadratic dependence on the lengthscale-ratio may still force us to make very lengthy simulations. Fortunately, there are methods for suppressing random walks in Monte Carlo simulations, which we will discuss in the next chapter.
29.5
Gibbs sampling
We introduced importance sampling, rejection sampling and the Metropolis method using one-dimensional examples. Gibbs sampling, also known as the heat bath method or Glauber dynamics, is a method for sampling from distributions over at least two dimensions. Gibbs sampling can be viewed as a Metropolis method in which a sequence of proposal distributions Q are dened in terms of the conditional distributions of the joint distribution P (x). It is assumed that, whilst P (x) is too complex to draw samples from directly, its conditional distributions P (xi | {xj }j =i ) are tractable to work with. For many graphical models (but not all) these one-dimensional conditional distributions are straightforward to sample from. For example, if a Gaussian distribution for some variables d has an unknown mean m, and the prior distribution of m is Gaussian, then the conditional distribution of m given d is also Gaussian. Conditional distributions that are not of standard form may still be sampled from by adaptive rejection sampling if the conditional distribution satises certain convexity properties (Gilks and Wild, 1992). Gibbs sampling is illustrated for a case with two variables (x1 , x2 ) = x in gure 29.13. On each iteration, we start from the current state x(t) , and x1 is (t) sampled from the conditional density P (x1 |x2 ), with x2 xed to x2 . A sample x2 is then made from the conditional density P (x2 |x1 ), using the new value of
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
29.5: Gibbs sampling x1 . This brings us to the new state x(t+1) , and completes the iteration. In the general case of a system with K variables, a single iteration involves sampling one parameter at a time: x1
(t+1)
371
(t+1) x2 (t+1) x3
P (x1 | x2 , x3 , . . . , xK )
(t)
(t)
(t)
(29.35) (29.36) etc. (29.37)
(t+1) (t) (t) P (x2 | x1 , x3 , . . . , xK ) (t) (t+1) (t+1) , . . . , xK ), , x2 P (x3 | x1
Convergence of Gibbs sampling to the target density
Exercise 29.4.[2 ] Show that a single variable-update of Gibbs sampling can be viewed as a Metropolis method with target density P (x), and that this Metropolis method has the property that every proposal is always accepted. Because Gibbs sampling is a Metropolis method, the probability distribution of x(t) tends to P (x) as t , as long as P (x) does not have pathological properties. Exercise 29.5.[2, p.385] Discuss whether the syndrome decoding problem for a (7, 4) Hamming code can be solved using Gibbs sampling. The syndrome decoding problem, if we are to solve it with a Monte Carlo approach, is to draw samples from the posterior distribution of the noise vector n = (n1 , . . . , nn , . . . , nN ), P (n | f , z) = 1 Z
N n=1 n fn n (1 fn )(1nn ) [Hn = z],
(29.38)
where fn is the normalized likelihood for the nth transmitted bit and z is the observed syndrome. The factor [Hn = z] is 1 if n has the correct syndrome z and 0 otherwise. What about the syndrome decoding problem for any linear error-correcting code?
Gibbs sampling in high dimensions
Gibbs sampling suers from the same defect as simple Metropolis algorithms the state space is explored by a slow random walk, unless a fortuitous parameterization has been chosen that makes the probability distribution P (x) separable. If, say, two variables x1 and x2 are strongly correlated, having marginal densities of width L and conditional densities of width , then it will take at least about (L/ )2 iterations to generate an independent sample from the target density. Figure 30.3, p.390, illustrates the slow progress made by Gibbs sampling when L . However Gibbs sampling involves no adjustable parameters, so it is an attractive strategy when one wants to get a model running quickly. An excellent software package, BUGS, makes it easy to set up almost arbitrary probabilistic models and simulate them by Gibbs sampling (Thomas et al., 1992).1
1
http://www.mrc-bsu.cam.ac.uk/bugs/
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
372
29 Monte Carlo Methods
29.6
Terminology for Markov chain Monte Carlo methods
We now spend a few moments sketching the theory on which the Metropolis method and Gibbs sampling are based. We denote by p(t) (x) the probability distribution of the state of a Markov chain simulator. (To visualize this distribution, imagine running an innite collection of identical simulators in parallel.) Our aim is to nd a Markov chain such that as t , p(t) (x) tends to the desired distribution P (x). A Markov chain can be specied by an initial probability distribution p(0) (x) and a transition probability T (x ; x). The probability distribution of the state at the (t + 1)th iteration of the Markov chain, p(t+1) (x), is given by p(t+1) (x ) = dN x T (x ; x)p(t) (x). (29.39)
Example 29.6. An example of a Markov chain is given by the Metropolis demonstration of section 29.4 (gure 29.12), for which the transition probability is
1/ 1/ 2 2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 T = 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2
p(0) (x)
0 5 10 15 20
p
(1)
(x)
0 5 10 15 20
p(2) (x)
0 5 10 15 20
p(3) (x)
0 5 10 15 20
p
(10)
(x)
0 5 10 15 20
p
(100)
(x)
0 5 10 15 20
and the initial distribution was p(0) (x) =
1 .
(29.40)
p
(200)
(x)
0 5 10 15 20
The probability distribution p(t) (x) of the state at the tth iteration is shown for t = 0, 1, 2, 3, 5, 10, 100, 200, 400 in gure 29.14; an equivalent sequence of distributions is shown in gure 29.15 for the chain that begins in initial state x0 = 17. Both chains converge to the target density, the uniform density, as t .
p
(400)
(x)
0 5 10 15 20
Figure 29.14. The probability distribution of the state of the Markov chain of example 29.6.
Required properties
When designing a Markov chain Monte Carlo method, we construct a chain with the following properties: 1. The desired distribution P (x) is an invariant distribution of the chain. A distribution (x) is an invariant distribution of the transition probability T (x ; x) if (x ) = dN x T (x ; x) (x). (29.41)
An invariant distribution is an eigenvector of the transition probability matrix that has eigenvalue 1.
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
29.6: Terminology for Markov chain Monte Carlo methods 2. The chain must also be ergodic, that is, p(t) (x) (x) as t , for any p(0) (x). A couple of reasons why a chain might not be ergodic are: (a) Its matrix might be reducible, which means that the state space contains two or more subsets of states that can never be reached from each other. Such a chain has many invariant distributions; which one p(t) (x) would tend to as t would depend on the initial condition p(0) (x). The transition probability matrix of such a chain has more than one eigenvalue equal to 1. (b) The chain might have a periodic set, which means that, for some initial conditions, p(t) (x) doesnt tend to an invariant distribution, but instead tends to a periodic limit-cycle. A simple Markov chain with this property is the random walk on the N -dimensional hypercube. The chain T takes the state from one corner to a randomly chosen adjacent corner. The unique invariant distribution of this chain is the uniform distribution over all 2N states, but the chain is not ergodic; it is periodic with period two: if we divide the states into states with odd parity and states with even parity, we notice that every odd state is surrounded by even states and vice versa. So if the initial condition at time t = 0 is a state with even parity, then at time t = 1 and at all odd times the state must have odd parity, and at all even times, the state will be of even parity. The transition probability matrix of such a chain has more than one eigenvalue with magnitude equal to 1. The random walk on the hypercube, for example, has eigenvalues equal to +1 and 1. (29.42)
p(0) (x)
0 5 10 15 20
373
p(1) (x)
0 5 10 15 20
p(2) (x)
0 5 10 15 20
p(3) (x)
0 5 10 15 20
p
(10)
(x)
0 5 10 15 20
p
(100)
(x)
0 5 10 15 20
p
(200)
(x)
0 5 10 15 20
p
(400)
(x)
0 5 10 15 20
Figure 29.15. The probability distribution of the state of the Markov chain for initial condition x0 = 17 (example 29.6 (p.372)).
Methods of construction of Markov chains
It is often convenient to construct T by mixing or concatenating simple base transitions B all of which satisfy P (x ) = dN x B (x ; x)P (x), (29.43)
for the desired density P (x), i.e., they all have the desired density as an invariant distribution. These base transitions need not individually be ergodic. T is a mixture of several base transitions Bb (x , x) if we make the transition by picking one of the base transitions at random, and allowing it to determine the transition, i.e., pb Bb (x , x), (29.44) T (x , x) =
b
where {pb } is a probability distribution over the base transitions. T is a concatenation of two base transitions B1 (x , x) and B2 (x , x) if we rst make a transition to an intermediate state x using B1 , and then make a transition from state x to x using B2 . T ( x , x) = dN x B2 (x , x )B1 (x , x). (29.45)
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
374
29 Monte Carlo Methods
Detailed balance
Many useful transition probabilities satisfy the detailed balance property: T (xa ; xb )P (xb ) = T (xb ; xa )P (xa ), for all xb and xa . (29.46)
This equation says that if we pick (by magic) a state from the target density P and make a transition under T to another state, it is just as likely that we will pick xb and go from xb to xa as it is that we will pick xa and go from xa to xb . Markov chains that satisfy detailed balance are also called reversible Markov chains. The reason why the detailed balance property is of interest is that detailed balance implies invariance of the distribution P (x) under the Markov chain T , which is a necessary condition for the key property that we want from our MCMC simulation that the probability distribution of the chain should converge to P (x). Exercise 29.7.[2 ] Prove that detailed balance implies invariance of the distribution P (x) under the Markov chain T . Proving that detailed balance holds is often a key step when proving that a Markov chain Monte Carlo simulation will converge to the desired distribution. The Metropolis method satises detailed balance, for example. Detailed balance is not an essential condition, however, and we will see later that irreversible Markov chains can be useful in practice, because they may have dierent random walk properties. Exercise 29.8.[2 ] Show that, if we concatenate two base transitions B1 and B2 which satisfy detailed balance, it is not necessarily the case that the T thus dened (29.45) satises detailed balance. Exercise 29.9.[2 ] Does Gibbs sampling, with several variables all updated in a deterministic sequence, satisfy detailed balance?
29.7
Slice sampling
Slice sampling (Neal, 1997a; Neal, 2003) is a Markov chain Monte Carlo method that has similarities to rejection sampling, Gibbs sampling and the Metropolis method. It can be applied wherever the Metropolis method can be applied, that is, to any system for which the target density P (x) can be evaluated at any point x; it has the advantage over simple Metropolis methods that it is more robust to the choice of parameters like step sizes. The simplest version of slice sampling is similar to Gibbs sampling in that it consists of one-dimensional transitions in the state space; however there is no requirement that the one-dimensional conditional distributions be easy to sample from, nor that they have any convexity properties such as are required for adaptive rejection sampling. And slice sampling is similar to rejection sampling in that it is a method that asymptotically draws samples from the volume under the curve described by P (x); but there is no requirement for an upper-bounding function. I will describe slice sampling by giving a sketch of a one-dimensional sampling algorithm, then giving a pictorial description that includes the details that make the method valid.
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
29.7: Slice sampling
375
The skeleton of slice sampling
Let us assume that we want to draw samples from P (x) P (x) where x is a real number. A one-dimensional slice sampling algorithm is a method for making transitions from a two-dimensional point (x, u) lying under the curve P (x) to another point (x , u ) lying under the same curve, such that the probability distribution of (x, u) tends to a uniform distribution over the area under the curve P (x), whatever initial point we start from like the uniform distribution under the curve P (x) produced by rejection sampling (section 29.3). A single transition (x, u) (x , u ) of a one-dimensional slice sampling algorithm has the following steps, of which steps 3 and 8 will require further elaboration. 1: 2: 3: 4: 5: 6: 7: 8: 9: evaluate P (x) draw a vertical coordinate u Uniform(0, P (x)) create a horizontal interval (xl , xr ) enclosing x loop { draw x Uniform(xl , xr ) evaluate P (x ) if P (x ) > u break out of loop 4-9 else modify the interval (xl , xr ) }
There are several methods for creating the interval (xl , xr ) in step 3, and several methods for modifying it at step 8. The important point is that the overall method must satisfy detailed balance, so that the uniform distribution for (x, u) under the curve P (x) is invariant.
The stepping out method for step 3
In the stepping out method for creating an interval (xl , xr ) enclosing x, we step out in steps of length w until we nd endpoints x l and xr at which P is smaller than u. The algorithm is shown in gure 29.16. 3a: 3b: 3c: 3d: 3e: draw r Uniform(0, 1) xl := x rw xr := x + (1 r )w while (P (xl ) > u) { xl := xl w } while (P (xr ) > u) { xr := xr + w }
The shrinking method for step 8
Whenever a point x is drawn such that (x , u ) lies above the curve P (x), we shrink the interval so that one of the end points is x , and such that the original point x is still enclosed in the interval. 8a: if (x > x) { xr := x } 8b: else { xl := x }
Properties of slice sampling
Like a standard Metropolis method, slice sampling gets around by a random walk, but whereas in the Metropolis method, the choice of the step size is
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
376
29 Monte Carlo Methods
1
2
3a,3b,3c
3d,3e
5,6
8
Figure 29.16. Slice sampling. Each panel is labelled by the steps of the algorithm that are executed in it. At step 1, P (x) is evaluated at the current point x. At step 2, a vertical coordinate is selected giving the point (x, u ) shown by the box; At steps 3a-c, an interval of size w containing (x, u ) is created at random. At step 3d, P is evaluated at the left end of the interval and is found to be larger than u , so a step to the left of size w is made. At step 3e, P is evaluated at the right end of the interval and is found to be smaller than u , so no stepping out to the right is needed. When step 3d is repeated, P is found to be smaller than u , so the stepping out halts. At step 5 a point is drawn from the interval, shown by a . Step 6 establishes that this point is above P and step 8 shrinks the interval to the rejected point in such a way that the original point x is still in the interval. When step 5 is repeated, the new coordinate x (which is to the right-hand side of the interval) gives a value of P greater than u , so this point x is the outcome at step 7.
5,6,7
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
29.7: Slice sampling critical to the rate of progress, in slice sampling the step size is self-tuning. If the initial interval size w is too small by a factor f compared with the width of the probable region then the stepping-out procedure expands the interval size. The cost of this stepping-out is only linear in f , whereas in the Metropolis method the computer-time scales as the square of f if the step size is too small. If the chosen value of w is too large by a factor F then the algorithm spends a time proportional to the logarithm of F shrinking the interval down to the right size, since the interval typically shrinks by a factor in the ballpark of 0.6 each time a point is rejected. In contrast, the Metropolis algorithm responds to a too-large step size by rejecting almost all proposals, so the rate of progress is exponentially bad in F . There are no rejections in slice sampling. The probability of staying in exactly the same place is very small. Exercise 29.10.[2 ] Investigate the properties of slice sampling applied to the density shown in gure 29.17. x is a real variable between 0.0 and 11.0. How long does it take typically for slice sampling to get from an x in the peak region x (0, 1) to an x in the tail region x (1, 11), and vice versa? Conrm that the probabilities of these transitions do yield an asymptotic probability density that is correct.
377
10
1 0 1 2 3 4 5 6 7 8 9 10 11
Figure 29.17. P (x).
How slice sampling is used in real problems
An N -dimensional density P (x) P (x) may be sampled with the help of one-dimensional slice sampling method presented above by picking a sequence of directions y(1) , y(2) , . . . and dening x = x(t) + xy(t) . The function P (x) above is replaced by P (x) = P (x(t) + xy(t) ). The directions may be chosen in various ways; for example, as in Gibbs sampling, the directions could be the coordinate axes; alternatively, the directions y(t) may be selected at random in any manner such that the overall procedure satises detailed balance.
Computer-friendly slice sampling
The real variables of a probabilistic model will always be represented in a computer using a nite number of bits. In the following implementation of slice sampling due to Skilling, the stepping-out, randomization, and shrinking operations, described above in terms of oating-point operations, are replaced by binary and integer operations. We assume that the variable x that is being slice-sampled is represented by a b-bit integer X taking on one of B = 2b values, 0, 1, 2, . . . , B 1, many or all of which correspond to valid values of x. Using an integer grid eliminates any errors in detailed balance that might ensue from variable-precision rounding of oating-point numbers. The mapping from X to x need not be linear; if it is nonlinear, we assume that the function P (x) is replaced by an appropriately transformed function for example, P (X ) P (x)|dx/dX |. We assume the following operators on b-bit integers are available: X +N X N X N N := randbits(l) arithmetic sum, modulo B , of X and N . dierence, modulo B , of X and N . bitwise exclusive-or of X and N . sets N to a random l-bit integer.
A slice-sampling procedure for integers is then as follows:
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
378
29 Monte Carlo Methods Given: a current point X and a height Y = P (X ) Uniform(0, 1) P (X )
1: 2: 3: 4: 5:
U := randbits(b) set l to a value l b do { N := randbits(l) X := ((X U ) N ) + U
Dene a random translation U of the binary coordinate system. Set initial l-bit sampling range. Dene a random move within the current interval of width 2l . Randomize the lowest l bits of X (in the translated coordinate system). If X is not acceptable, decrease l and try again with a smaller perturbation of X ; termination at or before l = 0 is assured.
6: l := l 1 7: } until (X = X ) or (P (X ) Y )
The translation U is introduced to avoid permanent sharp edges, where for example the adjacent binary integers 0111111111 and 1000000000 would otherwise be permanently in dierent sectors, making it dicult for X to move from one to the other. The sequence of intervals from which the new candidate points are drawn is illustrated in gure 29.18. First, a point is drawn from the entire interval, shown by the top horizontal line. At each subsequent draw, the interval is halved in such a way as to contain the previous point X . If preliminary stepping-out from the initial range is required, step 2 above can be replaced by the following similar procedure: 2a: set l to a value l < b 2b: do { 2c: N := randbits(l) 2d: X := ((X U ) N ) + U 2e: l := l + 1 2f: } until (l = b) or (P (X ) < Y ) l sets the initial width
0
X
B1
Figure 29.18. The sequence of intervals from which the new candidate points are drawn.
These shrinking and stepping out methods shrink and expand by a factor of two per evaluation. A variant is to shrink or expand by more than one bit each time, setting l := l l with l > 1. Taking l at each step from any pre-assigned distribution (which may include l = 0) allows extra exibility. Exercise 29.11.[4 ] In the shrinking phase, after an unacceptable X has been produced, the choice of l is allowed to depend on the dierence between the slices height Y and the value of P (X ), without spoiling the algorithms validity. (Prove this.) It might be a good idea to choose a larger value of l when Y P (X ) is large. Investigate this idea theoretically or empirically. A feature of using the integer representation is that, with a suitably extended number of bits, the single integer X can represent two or more real parameters for example, by mapping X to (x1 , x2 , x3 ) through a space-lling curve such as a Peano curve. Thus multi-dimensional slice sampling can be performed using the same software as for one dimension.
29.8
Practicalities
Can we predict how long a Markov chain Monte Carlo simulation
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
29.9: Further practical issues will take to equilibrate? By considering the random walks involved in a Markov chain Monte Carlo simulation we can obtain simple lower bounds on the time required for convergence. But predicting this time more precisely is a dicult problem, and most of the theoretical results giving upper bounds on the convergence time are of little practical use. The exact sampling methods of Chapter 32 oer a solution to this problem for certain Markov chains. Can we diagnose or detect convergence in a running simulation? This is also a dicult problem. There are a few practical tools available, but none of them is perfect (Cowles and Carlin, 1996). Can we speed up the convergence time and time between independent samples of a Markov chain Monte Carlo method? Here, there is good news, as described in the next chapter, which describes the Hamiltonian Monte Carlo method, overrelaxation, and simulated annealing.
379
29.9
Further practical issues Can the normalizing constant be evaluated?
If the target density P (x) is given in the form of an unnormalized density 1 P (x) with P (x) = Z P (x), the value of Z may well be of interest. Monte Carlo methods do not readily yield an estimate of this quantity, and it is an area of active research to nd ways of evaluating it. Techniques for evaluating Z include: 1. Importance sampling (reviewed by Neal (1993b)) and annealed importance sampling (Neal, 1998). 2. Thermodynamic integration during simulated annealing, the acceptance ratio method, and umbrella sampling (reviewed by Neal (1993b)). 3. Reversible jump Markov chain Monte Carlo (Green, 1995). One way of dealing with Z , however, may be to nd a solution to ones task that does not require that Z be evaluated. In Bayesian data modelling one might be able to avoid the need to evaluate Z which would be important for model comparison by not having more than one model. Instead of using several models (diering in complexity, for example) and evaluating their relative posterior probabilities, one can make a single hierarchical model having, for example, various continuous hyperparameters which play a role similar to that played by the distinct models (Neal, 1996). In noting the possibility of not computing Z , I am not endorsing this approach. The normalizing constant Z is often the single most important number in the problem, and I think every eort should be devoted to calculating it.
The Metropolis method for big models
Our original description of the Metropolis method involved a joint updating of all the variables using a proposal density Q(x ; x). For big problems it may be more ecient to use several proposal distributions Q(b) (x ; x), each of which updates only some of the components of x. Each proposal is individually accepted or rejected, and the proposal distributions are repeatedly run through in sequence. Exercise 29.12.[2, p.385] Explain why the rate of movement through the state space will be greater when B proposals Q(1) , . . . , Q(B ) are considered
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
380 individually in sequence, compared with the case of a single proposal Q dened by the concatenation of Q(1) , . . . , Q(B ) . Assume that each proposal distribution Q(b) (x ; x) has an acceptance rate f < 1/2. In the Metropolis method, the proposal density Q(x ; x) typically has a number of parameters that control, for example, its width. These parameters are usually set by trial and error with the rule of thumb being to aim for a rejection frequency of about 0.5. It is not valid to have the width parameters be dynamically updated during the simulation in a way that depends on the history of the simulation. Such a modication of the proposal density would violate the detailed balance condition which guarantees that the Markov chain has the correct invariant distribution.
29 Monte Carlo Methods
Gibbs sampling in big models
Our description of Gibbs sampling involved sampling one parameter at a time, as described in equations (29.3529.37). For big problems it may be more ecient to sample groups of variables jointly, that is to use several proposal distributions:
( x1 , . . . , xat+1) P (x1 , . . . , xa | xa+1 , . . . , xK ) (t+1) (t) (t)
(29.47) etc.
(t+1) (t+1) xa+1 , . . . , xb
(t) (t) (t+1) ( P (xa+1 , . . . , xb | x1 , . . . , xat+1) , xb+1 , . . . , xK ),
How many samples are needed?
At the start of this chapter, we observed that the variance of an estimator depends only on the number of independent samples R and the value of 2 = dN x P (x)((x) )2 . (29.48)
We have now discussed a variety of methods for generating samples from P (x). How many independent samples R should we aim for? In many problems, we really only need about twelve independent samples from P (x). Imagine that x is an unknown vector such as the amount of corrosion present in each of 10 000 underground pipelines around Cambridge, and (x) is the total cost of repairing those pipelines. The distribution P (x) describes the probability of a state x given the tests that have been carried out on some pipelines and the assumptions about the physics of corrosion. The quantity is the expected cost of the repairs. The quantity 2 is the variance of the cost measures by how much we should expect the actual cost to dier from the expectation . Now, how accurately would a manager like to know ? I would suggest there is little point in knowing to a precision ner than about /3. After all, the true cost is likely to dier by from . If we obtain R 12 = independent samples from P (x), we can estimate to a precision of / 12 which is smaller than /3. So twelve samples suce.
Allocation of resources
Assuming we have decided how many independent samples R are required, an important question is how one should make use of ones limited computer resources to obtain these samples. A typical Markov chain Monte Carlo experiment involves an initial period in which control parameters of the simulation such as step sizes may be
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
29.10: Summary
(1) (2)
381
Figure 29.19. Three possible Markov chain Monte Carlo strategies for obtaining twelve samples in a xed amount of computer time. Time is represented by horizontal lines; samples by white circles. (1) A single run consisting of one long burn in period followed by a sampling period. (2) Four medium-length runs with dierent initial conditions and a medium-length burn in period. (3) Twelve short runs.
(3)
adjusted. This is followed by a burn in period during which we hope the simulation converges to the desired distribution. Finally, as the simulation continues, we record the state vector occasionally so as to create a list of states {x(r) }R that we hope are roughly independent samples from P (x). r=1 There are several possible strategies (gure 29.19): 1. Make one long run, obtaining all R samples from it. 2. Make a few medium length runs with dierent initial conditions, obtaining some samples from each. 3. Make R short runs, each starting from a dierent random initial condition, with the only state that is recorded being the nal state of each simulation. The rst strategy has the best chance of attaining convergence. The last strategy may have the advantage that the correlations between the recorded samples are smaller. The middle path is popular with Markov chain Monte Carlo experts (Gilks et al., 1996) because it avoids the ineciency of discarding burn-in iterations in many runs, while still allowing one to detect problems with lack of convergence that would not be apparent from a single run. Finally, I should emphasize that there is no need to make the points nearlyindependent. Averaging over dependent points is ne it wont lead to any bias in the estimates. For example, when you use strategy 1 or 2, you may, if you wish, include all the points between the rst and last sample in each run. Of course, estimating the accuracy of the estimate is harder when the points are dependent.
29.10
Summary
Monte Carlo methods are a powerful tool that allow one to sample from any probability distribution that can be expressed in the form P (x) = 1 Z P (x). Monte Carlo methods can answer virtually any query related to P (x) by putting the query in the form (x)P (x) 1 R (x(r) ).
r
(29.49)
In high-dimensional problems the only satisfactory methods are those based on Markov chains, such as the Metropolis method, Gibbs sampling and slice sampling. Gibbs sampling is an attractive method because it has no adjustable parameters but its use is restricted to cases
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
382 where samples can be generated from the conditional distributions. Slice sampling is attractive because, whilst it has step length parameters, its performance is not very sensitive to their values. Simple Metropolis algorithms and Gibbs sampling algorithms, although widely used, perform poorly because they explore the space by a slow random walk. The next chapter will discuss methods for speeding up Markov chain Monte Carlo simulations. Slice sampling does not avoid random walk behaviour, but it automatically chooses the largest appropriate step size, thus reducing the bad eects of the random walk compared with, say, a Metropolis method with a tiny step size.
29 Monte Carlo Methods
29.11
Exercises
Exercise 29.13.[2C, p.386] A study of importance sampling. We already established in section 29.2 that importance sampling is likely to be useless in high-dimensional problems. This exercise explores a further cautionary tale, showing that importance sampling can fail even in one dimension, even with friendly Gaussian distributions. Imagine that we want to know the expectation of a function (x) under a distribution P (x), = dx P (x)(x), (29.50)
and that this expectation is estimated by importance sampling with a distribution Q(x). Alternatively, perhaps we wish to estimate the normalizing constant Z in P (x) = P (x)/Z using Z= dx P (x) = dx Q(x) P (x) = Q(x) P (x) Q(x) .
xQ
(29.51)
Check your theory by simulating this importance-sampling problem on a computer.
Now, let P (x) and Q(x) be Gaussian distributions with mean zero and standard deviations p and q . Each point x drawn from Q will have an associated weight P (x)/Q(x). What is the variance of the weights? [Assume that P = P , so P is actually normalized, and Z = 1, though we can pretend that we didnt know that.] What happens to the variance 2 2 of the weights as q p /2?
Exercise 29.14.[2 ] Consider the Metropolis algorithm for the one-dimensional toy problem of section 29.4, sampling from {0, 1, . . . , 20}. Whenever the current state is one of the end states, the proposal density given in equation (29.34) will propose with probability 50% a state that will be rejected. To reduce this waste, Fred modies the software responsible for generating samples from Q so that when x = 0, the proposal density is 100% on x = 1, and similarly when x = 20, x = 19 is always proposed. Fred sets the software that implements the acceptance rule so that the software accepts all proposed moves. What probability P (x) will Freds modied software generate samples from? What is the correct acceptance rule for Freds proposal density, in order to obtain samples from P (x)?
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
29.11: Exercises Exercise 29.15.[3C ] Implement Gibbs sampling for the inference of a single one-dimensional Gaussian, which we studied using maximum likelihood in section 22.1. Assign a broad Gaussian prior to and a broad gamma prior (24.2) to the precision parameter = 1/2 . Each update of will involve a sample from a Gaussian distribution, and each update of requires a sample from a gamma distribution. Exercise 29.16.[3C ] Gibbs sampling for clustering. Implement Gibbs sampling for the inference of a mixture of K one-dimensional Gaussians, which we studied using maximum likelihood in section 22.2. Allow the clusters to have dierent standard deviations k . Assign priors to the means and standard deviations in the same way as the previous exercise. Either x the prior probabilities of the classes {k } to be equal or put a uniform prior over the parameters and include them in the Gibbs sampling. Notice the similarity of Gibbs sampling to the soft K-means clustering algorithm (algorithm 22.2). We can alternately assign the class labels {kn } given the parameters {k , k }, then update the parameters given the class labels. The assignment step involves sampling from the probability distributions dened by the responsiblities (refeq.assignII), and the update step updates the means and variances using probability distributions centred on the K-means algorithms values (22.23, 22.24). Do your experiments conrm that Monte Carlo methods bypass the overtting diculties of maximum likelihood discussed in section 22.4? A solution to this exercise and the previous one, written in octave, is available.2 Exercise 29.17.[3C ] Implement Gibbs sampling for the seven scientists inference problem, which we encountered in exercise 22.15 (p.309), and which you may have solved by exact marginalization (exercise 24.3 (p.323)) [its not essential to have done the latter]. Exercise 29.18.[2 ] A Metropolis method is used to explore a distribution P (x) that is actually a 1000-dimensional spherical Gaussian distribution of standard deviation 1 in all dimensions. The proposal density Q is a 1000-dimensional spherical Gaussian distribution of standard deviation . Roughly what is the step size if the acceptance rate is 0.5? Assuming this value of , (a) roughly how long would the method take to traverse the distribution and generate a sample independent of the initial condition? (b) By how much does ln P (x) change in a typical step? By how much should ln P (x) vary when x is drawn from P (x)? (c) What happens if, rather than using a Metropolis method that tries to change all components at once, one instead uses a concatenation of Metropolis updates changing one component at a time? Exercise 29.19.[2 ] When discussing the time taken by the Metropolis algorithm to generate independent samples we considered a distribution with longest spatial length scale L being explored using a proposal distribution with step size . Another dimension that a MCMC method must explore is the range of possible values of the log probability ln P (x).
2
383
http://www.inference.phy.cam.ac.uk/mackay/itila/
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
384 Assuming that the state x contains a number of independent random variables proportional to N , when samples are drawn from P (x), the asymptotic equipartition principle tell us that the value of ln P (x) is likely to be close to the entropy of x, varying either side with a standard deviation that scales as N . Consider a Metropolis method with a symmetrical proposal density, that is, one that satises Q(x; x ) = Q(x ; x). Assuming that accepted jumps either increase ln P (x) by some amount or decrease it by a small amount, e.g. ln e = 1 (is this a reasonable assumption?), discuss how long it must take to generate roughly independent samples from P (x). Discuss whether Gibbs sampling has similar properties. Exercise 29.20.[3 ] Markov chain Monte Carlo methods do not compute partition functions Z , yet they allow ratios of quantities like Z to be estimated. For example, consider a random-walk Metropolis algorithm in a state space where the energy is zero in a connected accessible region, and innitely large everywhere else; and imagine that the accessible space can be chopped into two regions connected by one or more corridor states. The fraction of times spent in each region at equilibrium is proportional to the volume of the region. How does the Monte Carlo method manage to do this without measuring the volumes? Exercise 29.21.[5 ] Philosophy. One curious defect of these Monte Carlo methods which are widely used by Bayesian statisticians is that they are all non-Bayesian (OHagan, 1987). They involve computer experiments from which estimators of quantities of interest are derived. These estimators depend on the proposal distributions that were used to generate the samples and on the random numbers that happened to come out of our random number generator. In contrast, an alternative Bayesian approach to the problem would use the results of our computer experiments to infer the properties of the target function P (x) and generate predictive distributions for quantities of interest such as . This approach would give answers that would depend only on the computed values of P (x(r) ) at the points {x(r) }; the answers would not depend on how those points were chosen.
29 Monte Carlo Methods
Can you make a Bayesian Monte Carlo method? (See Rasmussen and Ghahramani (2003) for a practical attempt.)
29.12
Solutions
Solution to exercise 29.1 (p.362). We wish to show that
r
wr (x(r) ) r wr
(29.52)
converges to the expectation of under P . We consider the numerator and the denominator separately. First, the denominator. Consider a single importance weight. P (x(r) ) wr (r) . (29.53) Q (x ) What is its expectation, averaged under the distribution Q = Q /ZQ of the point x(r) ? wr = dx Q(x) P (x) = Q (x) dx 1 ZP P (x) = . ZQ ZQ (29.54)
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
29.12: Solutions So the expectation of the denominator is wr
r
385
=R
ZP . ZQ
(29.55)
As long as the variance of wr is nite, the denominator, divided by R, will converge to ZP /ZQ as R increases. [In fact, the estimate converges to the right answer even if this variance is innite, as long as the expectation is well-dened.] Similarly, the expectation of one term in the numerator is wr (x) = dx Q(x) P (x) (x) = Q (x) dx 1 ZP P (x)(x) = , ZQ ZQ (29.56)
where is the expectation of under P . So the numerator, divided by R, Z converges to ZP with increasing R. Thus converged to .
Q
The numerator and the denominator are unbiased estimators of RZP /ZQ and RZP /ZQ respectively, but their ratio is not necessarily an unbiased estimator for nite R. Solution to exercise 29.2 (p.363). When the true density P is multimodal, it is unwise to use importance sampling with a sampler density tted to one mode, because on the rare occasions that a point is produced that lands in one of the other modes, the weight associated with that point will be enormous. The estimates will have enormous variance, but this enormous variance may not be evident to the user if no points in the other mode have been seen. Solution to exercise 29.5 (p.371). The posterior distribution for the syndrome decoding problem is a pathological distribution from the point of view of Gibbs sampling. The factor [Hn = z] is only 1 on a small fraction of the space of possible vectors n, namely the 2K points that correspond to the valid codewords. No two codewords are adjacent, so similarly, any single bit ip from a viable state n will take us to a state with zero probability and so the state will never move in Gibbs sampling. A general code has exactly the same problem. The points corresponding to valid codewords are relatively few in number and they are not adjacent (at least for any useful code). So Gibbs sampling is no use for syndrome decoding for two reasons. First, nding any reasonably good hypothesis is dicult, and as long as the state is not near a valid codeword, Gibbs sampling cannot help since none of the conditional distributions is dened; and second, once we are in a valid hypothesis, Gibbs sampling will never take us out of it. One could attempt to perform Gibbs sampling using the bits of the original message s as the variables. This approach would not get locked up in the way just described, but, for a good code, any single bit ip would substantially alter the reconstructed codeword, so if one had found a state with reasonably large likelihood, Gibbs sampling would take an impractically large time to escape from it. Solution to exercise 29.12 (p.379). Each Metropolis proposal will take the energy of the state up or down by some amount. The total change in energy when B proposals are concatenated will be the end-point of a random walk with B steps in it. This walk might have mean zero, or it might have a tendency to drift upwards (if most moves increase the energy and only a few decrease it). In general the latter will hold, if the acceptance rate f is small: the mean change in energy from any one move will be some E > 0 and so the acceptance probability for the concatenation of B moves will be of order
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
386 1/(1 + exp(B E )), which scales roughly as f B . The mean-square-distance moved will be of order f B B 2 , where is the typical step size. In contrast, the mean-square-distance moved when the moves are considered individually will be of order f B 2 .
1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 1.6 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1000 10000 100000 theory 6 8 10 1000 10000 100000 theory 3.5 3 2.5 2 1.5 1 2 0.5 0 1.6 0
29 Monte Carlo Methods
4
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Solution to exercise 29.13 (p.382). The weights are w = P (x)/Q(x) and x is drawn from Q. The mean weight is dx Q(x) [P (x)/Q(x)] = dx P (x) = 1, (29.57)
assuming the integral converges. The variance is var(w) = = = dx Q(x) dx P (x) 1 Q(x)
2
(29.58) (29.59) 1, (29.60)
P (x)2 2P (x) + Q(x) Q(x) ZQ x2 2 exp 2 ZP 2 1 2 2 p q
dx
Figure 29.20. Importance sampling in one dimension. For R = 1000, 10 4 , and 105 , the normalizing constant of a Gaussian distribution (known in fact to be 1) was estimated using importance sampling with a sampler density of standard deviation q (horizontal axis). The same random number seed was used for all runs. The three plots show (a) the estimated normalizing constant; (b) the empirical standard deviation of the R weights; (c) 30 of the weights.
2 2 where ZQ /ZP = q /( 2p ). The integral in (29.60) is nite only if the 2 in the exponent is positive, i.e., if coecient of x
2 q >
12 . 2p
(29.61)
If this condition is satised, the variance is q var(w) = 2 2 2p 1 2 2 2 p q
1 2
1 =
2 q 2 2 p 2q p 1/2
1.
(29.62)
As q approaches the critical value about 0.7p the variance becomes innite. Figure 29.20 illustrates these phenomena for p = 1 with q varying from 0.1 to 1.5. The same random number seed was used for all runs, so the weights and estimates follow smooth curves. Notice that the empirical standard deviation of the R weights can look quite small and well-behaved (say, at q 0.3) when the true standard deviation is nevertheless innite.
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
30
Ecient Monte Carlo Methods
This chapter discusses several methods for reducing random walk behaviour in Metropolis methods. The aim is to reduce the time required to obtain eectively independent samples. For brevity, we will say independent samples when we mean eectively independent samples.
30.1
Hamiltonian Monte Carlo
The Hamiltonian Monte Carlo method is a Metropolis method, applicable to continuous state spaces, that makes use of gradient information to reduce random walk behaviour. [The Hamiltonian Monte Carlo method was originally called hybrid Monte Carlo, for historical reasons.] For many systems whose probability P (x) can be written in the form P (x) = eE (x) , Z (30.1)
not only E (x) but also its gradient with respect to x can be readily evaluated. It seems wasteful to use a simple random-walk Metropolis method when this gradient is available the gradient indicates which direction one should go in to nd states with higher probability!
Overview of Hamiltonian Monte Carlo
In the Hamiltonian Monte Carlo method, the state space x is augmented by momentum variables p, and there is an alternation of two types of proposal. The rst proposal randomizes the momentum variable, leaving the state x unchanged. The second proposal changes both x and p using simulated Hamiltonian dynamics as dened by the Hamiltonian H (x, p) = E (x) + K (p), (30.2)
where K (p) is a kinetic energy such as K (p) = pTp/2. These two proposals are used to create (asymptotically) samples from the joint density PH (x, p) = 1 1 exp[H (x, p)] = exp[E (x)] exp[K (p)]. ZH ZH (30.3)
This density is separable, so the marginal distribution of x is the desired distribution exp[E (x)]/Z . So, simply discarding the momentum variables, we obtain a sequence of samples {x(t) } that asymptotically come from P (x). 387
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
388
30 Ecient Monte Carlo Methods
g = gradE ( x ) ; E = findE ( x ) ; for l = 1:L p = randn ( size(x) ) ; H = p * p / 2 + E ; xnew = x ; gnew = g ; for tau = 1:Tau p = p - epsilon * gnew / 2 ; xnew = xnew + epsilon * p ; gnew = gradE ( xnew ) ; p = p - epsilon * gnew / 2 ; endfor Enew = findE ( xnew ) ; Hnew = p * p / 2 + Enew ; dH = Hnew - H ;
# set gradient using initial x # set objective function too # loop L times # initial momentum is Normal(0,1) # evaluate H(x,p)
Algorithm 30.1. Octave source code for the Hamiltonian Monte Carlo method.
# make Tau leapfrog steps # # # # make make find make half-step in p step in x new gradient half-step in p
# find new value of H # Decide whether to accept
if ( dH < 0 ) accept = 1 ; elseif ( rand() < exp(-dH) ) accept = 1 ; else accept = 0 ; endif if ( accept ) g = gnew ; endif endfor
x = xnew ;
E = Enew ;
Hamiltonian Monte Carlo
Simple Metropolis
1
(a) (c)
1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 1
(d)
0.5 0 -0.5 -1 1
(b) 0.5
Figure 30.2. (a,b) Hamiltonian Monte Carlo used to generate samples from a bivariate Gaussian with correlation = 0.998. (c,d) For comparison, a simple random-walk Metropolis method, given equal computer time.
-1
-0.5
0
0.5
1
0.5 0
0 -0.5 -1 -1.5 -1.5 -1 -0.5 0 0.5 1 -0.5 -1 -1 -0.5 0 0.5 1
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
30.1: Hamiltonian Monte Carlo
389
Details of Hamiltonian Monte Carlo
The rst proposal, which can be viewed as a Gibbs sampling update, draws a new momentum from the Gaussian density exp[K (p)]/ZK . This proposal is always accepted. During the second, dynamical proposal, the momentum variable determines where the state x goes, and the gradient of E (x) determines how the momentum p changes, in accordance with the equations x=p p= E (x) . x (30.4) (30.5)
Because of the persistent motion of x in the direction of the momentum p during each dynamical proposal, the state of the system tends to move a distance that goes linearly with the computer time, rather than as the square root. The second proposal is accepted in accordance with the Metropolis rule. If the simulation of the Hamiltonian dynamics is numerically perfect then the proposals are accepted every time, because the total energy H (x, p) is a constant of the motion and so a in equation (29.31) is equal to one. If the simulation is imperfect, because of nite step sizes for example, then some of the dynamical proposals will be rejected. The rejection rule makes use of the change in H (x, p), which is zero if the simulation is perfect. The occasional rejections ensure that, asymptotically, we obtain samples (x(t) , p(t) ) from the required joint density PH (x, p). The source code in gure 30.1 describes a Hamiltonian Monte Carlo method that uses the leapfrog algorithm to simulate the dynamics on the function findE(x), whose gradient is found by the function gradE(x). Figure 30.2 shows this algorithm generating samples from a bivariate Gaussian whose energy function is E (x) = 1 xTAx with 2 A= 250.25 249.75 249.75 250.25 , (30.6)
corresponding to a variancecovariance matrix of 1 0.998 0.998 1 . (30.7)
In gure 30.2a, starting from the state marked by the arrow, the solid line represents two successive trajectories generated by the Hamiltonian dynamics. The squares show the endpoints of these two trajectories. Each trajectory consists of Tau = 19 leapfrog steps with epsilon = 0.055. These steps are indicated by the crosses on the trajectory in the magnied inset. After each trajectory, the momentum is randomized. Here, both trajectories are accepted; the errors in the Hamiltonian were only +0.016 and 0.06 respectively. Figure 30.2b shows how a sequence of four trajectories converges from an initial condition, indicated by the arrow, that is not close to the typical set of the target distribution. The trajectory parameters Tau and epsilon were randomized for each trajectory using uniform distributions with means 19 and 0.055 respectively. The rst trajectory takes us to a new state, (1.5, 0.5), similar in energy to the rst state. The second trajectory happens to end in a state nearer the bottom of the energy landscape. Here, since the potential energy E is smaller, the kinetic energy K = p2 /2 is necessarily larger than it was at the start of the trajectory. When the momentum is randomized before
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
390
Gibbs sampling (a)
1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 0 -0.2 -0.4 -0.6 -0.8 -1 -1 -0.8 -0.6 -0.4 -0.2 0 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1
30 Ecient Monte Carlo Methods
Overrelaxation Figure 30.3. Overrelaxation contrasted with Gibbs sampling for a bivariate Gaussian with correlation = 0.998. (a) The state sequence for 40 iterations, each iteration involving one update of both variables. The overrelaxation method had = 0.98. (This excessively large value is chosen to make it easy to see how the overrelaxation method reduces random walk behaviour.) The dotted line shows the contour xT1 x = 1. (b) Detail of (a), showing the two steps making up each iteration. (c) Time-course of the variable x1 during 2000 iterations of the two methods. The overrelaxation method had = 0.89. (After Neal (1995).)
(b)
(c) Gibbs sampling
3 2 1 0 -1 -2 -3 0 3 2 1 0 -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 200 400 600 800 1000 1200 1400 1600 1800 2000
Overrelaxation
the third trajectory, its kinetic energy becomes much smaller. After the fourth trajectory has been simulated, the state appears to have become typical of the target density. Figures 30.2(c) and (d) show a random-walk Metropolis method using a Gaussian proposal density to sample from the same Gaussian distribution, starting from the initial conditions of (a) and (b) respectively. In (c) the step size was adjusted such that the acceptance rate was 58%. The number of proposals was 38 so the total amount of computer time used was similar to that in (a). The distance moved is small because of random walk behaviour. In (d) the random-walk Metropolis method was used and started from the same initial condition as (b) and given a similar amount of computer time.
30.2
Overrelaxation
The method of overrelaxation is a method for reducing random walk behaviour in Gibbs sampling. Overrelaxation was originally introduced for systems in which all the conditional distributions are Gaussian.
An example of a joint distribution that is not Gaussian but whose conditional distri-
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
30.2: Overrelaxation
butions are all Gaussian is P (x, y ) = exp(x2 y 2 x2 y 2 )/Z .
391
Overrelaxation for Gaussian conditional distributions
In ordinary Gibbs sampling, one draws the new value x i of the current (t) variable xi from its conditional distribution, ignoring the old value xi . The state makes lengthy random walks in cases where the variables are strongly correlated, as illustrated in the left-hand panel of gure 30.3. This gure uses a correlated Gaussian distribution as the target density. (t+1) In Adlers (1981) overrelaxation method, one instead samples xi from a Gaussian that is biased to the opposite side of the conditional distribution. If the conditional distribution of xi is Normal(, 2 ) and the current value of (t) xi is xi , then Adlers method sets xi to xi
(t+1) (t+1)
= + (xi ) + (1 2 )1/2 ,
(t)
(30.8)
where Normal(0, 1) and is a parameter between 1 and 1, usually set to a negative value. (If is positive, then the method is called under-relaxation.) Exercise 30.1.[2 ] Show that this individual transition leaves invariant the conditional distribution xi Normal(, 2 ). A single iteration of Adlers overrelaxation, like one of Gibbs sampling, updates each variable in turn as indicated in equation (30.8). The transition matrix T (x ; x) dened by a complete update of all variables in some xed order does not satisfy detailed balance. Each individual transition for one coordinate just described does satisfy detailed balance so the overall chain gives a valid sampling strategy which converges to the target density P (x) but when we form a chain by applying the individual transitions in a xed sequence, the overall chain is not reversible. This temporal asymmetry is the key to why overrelaxation can be benecial. If, say, two variables are positively correlated, then they will (on a short timescale) evolve in a directed manner instead of by random walk, as shown in gure 30.3. This may signicantly reduce the time required to obtain independent samples. Exercise 30.2.[3 ] The transition matrix T (x ; x) dened by a complete update of all variables in some xed order does not satisfy detailed balance. If the updates were in a random order, then T would be symmetric. Investigate, for the toy two-dimensional Gaussian distribution, the assertion that the advantages of overrelaxation are lost if the overrelaxed updates are made in a random order.
Ordered Overrelaxation
The overrelaxation method has been generalized by Neal (1995) whose ordered overrelaxation method is applicable to any system where Gibbs sampling is used. In ordered overrelaxation, instead of taking one sample from the condi(1) (2) (K ) tional distribution P (xi |{xj }j =i ), we create K such samples xi , xi , . . . , xi , where K might be set to twenty or so. Often generating K 1 extra samples adds a negligible computational cost to the initial computations required for (k ) making the rst sample. The points {xi } are then sorted numerically, and the current value of xi is inserted into the sorted list, giving a list of K + 1 points. We give them ranks 0, 1, 2, . . . , K . Let be the rank of the current value of xi in the list. We set xi to the value that is an equal distance from
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
392
30 Ecient Monte Carlo Methods
the other end of the list, that is, the value with rank K . The role played by Adlers parameter is here played by the parameter K . When K = 1, we obtain ordinary Gibbs sampling. For practical purposes Neal estimates that ordered overrelaxation may speed up a simulation by a factor of ten or twenty.
30.3
Simulated annealing
A third technique for speeding convergence is simulated annealing. In simulated annealing, a temperature parameter is introduced which, when large, allows the system to make transitions that would be improbable at temperature 1. The temperature is set to a large value and gradually reduced to 1. This procedure is supposed to reduce the chance that the simulation gets stuck in an unrepresentative probability island. We asssume that we wish to sample from a distribution of the form P (x) = eE (x) Z (30.9)
where E (x) can be evaluated. In the simplest simulated annealing method, we instead sample from the distribution PT (x) =
1 Z (T )
e
E (x) T
(30.10)
and decrease T gradually to 1. Often the energy function can be separated into two terms, E (x) = E0 (x) + E1 (x), (30.11)
of which the rst term is nice (for example, a separable function of x) and the second is nasty. In these cases, a better simulated annealing method might make use of the distribution PT (x) =
1 Z (T )
eE0 (x)
E1 (x)/T
(30.12)
with T gradually decreasing to 1. In this way, the distribution at high temperatures reverts to a well-behaved distribution dened by E 0 . Simulated annealing is often used as an optimization method, where the aim is to nd an x that minimizes E (x), in which case the temperature is decreased to zero rather than to 1. As a Monte Carlo method, simulated annealing as described above doesnt sample exactly from the right distribution, because there is no guarantee that the probability of falling into one basin of the energy is equal to the total probability of all the states in that basin. The closely related simulated tempering method (Marinari and Parisi, 1992) corrects the biases introduced by the annealing process by making the temperature itself a random variable that is updated in Metropolis fashion during the simulation. Neals (1998) annealed importance sampling method removes the biases introduced by annealing by computing importance weights for each generated point.
30.4
Skillings multi-state leapfrog method
A fourth method for speeding up Monte Carlo simulations, due to John Skilling, has a similar spirit to overrelaxation, but works in more dimensions. This method is applicable to sampling from a distribution over a continuous
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
30.4: Skillings multi-state leapfrog method state space, and the sole requirement is that the energy E (x) should be easy to evaluate. The gradient is not used. This leapfrog method is not intended to be used on its own but rather in sequence with other Monte Carlo operators. Instead of moving just one state vector x around the state space, as was the case for all the Monte Carlo methods discussed thus far, Skillings leapfrog method simultaneously maintains a set of S state vectors {x(s) }, where S might be six or twelve. The aim is that all S of these vectors will represent independent samples from the same distribution P (x). Skillings leapfrog makes a proposal for the new state x(s) , which is accepted or rejected in accordance with the Metropolis method, by leapfrogging the current state x(s) over another state vector x(t) : x(s) = x(t) + (x(t) x(s) ) = 2x(t) x(s) . (30.13) x(t) x(s)
393
All the other state vectors are left where they are, so the acceptance probability depends only on the change in energy of x(s) . Which vector, t, is the partner for the leapfrog event can be chosen in various ways. The simplest method is to select the partner at random from the other vectors. It might be better to choose t by selecting one of the nearest neighbours x(s) nearest by any chosen distance function as long as one then uses an acceptance rule that ensures detailed balance by checking whether point t is still among the nearest neighbours of the new point, x(s) .
x(s)
Why the leapfrog is a good idea
Imagine that the target density P (x) has strong correlations for example, the density might be a needle-like Gaussian with width and length L , where L 1. As we have emphasized, motion around such a density by standard methods proceeds by a slow random walk. Imagine now that our set of S points is lurking initially in a location that is probable under the density, but in an inappropriately small ball of size . Now, under Skillings leapfrog method, a typical rst move will take the point a little outside the current ball, perhaps doubling its distance from the centre of the ball. After all the points have had a chance to move, the ball will have increased in size; if all the moves are accepted, the ball will be bigger by a factor of two or so in all dimensions. The rejection of some moves will mean that the ball containing the points will probably have elongated in the needles long direction by a factor of, say, two. After another cycle through the points, the ball will have grown in the long direction by another factor of two. So the typical distance travelled in the long dimension grows exponentially with the number of iterations. Now, maybe a factor of two growth per iteration is on the optimistic side; but even if the ball only grows by a factor of, lets say, 1.1 per iteration, the growth is nevertheless exponential. It will only take a number of iterations proportional to log L/ log(1.1) for the long dimension to be explored. Exercise 30.3.[2, p.398] Discuss how the eectiveness of Skillings method scales with dimensionality, using a correlated N -dimensional Gaussian distribution as an example. Find an expression for the rejection probability, assuming the Markov chain is at equilibrium. Also discuss how it scales with the strength of correlation among the Gaussian variables. [Hint: Skillings method is invariant under ane transformations, so the rejection probability at equilibrium can be found by looking at the case of a separable Gaussian.]
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
394
30 Ecient Monte Carlo Methods
This method has some similarity to the adaptive direction sampling method of Gilks et al. (1994) but the leapfrog method is simpler and can be applied to a greater variety of distributions.
30.5
Monte Carlo algorithms as communication channels
It may be a helpful perspective, when thinking about speeding up Monte Carlo methods, to think about the information that is being communicated. Two communications take place when a sample from P (x) is being generated. First, the selection of a particular x from P (x) necessarily requires that at least log 1/P (x) random bits be consumed. [Recall the use of inverse arithmetic coding as a method for generating samples from given distributions (section 6.3).] Second, the generation of a sample conveys information about P (x) from the subroutine that is able to evaluate P (x) (and from any other subroutines that have access to properties of P (x)). Consider a dumb Metropolis method, for example. In a dumb Metropolis method, the proposals Q(x ; x) have nothing to do with P (x). Properties of P (x) are only involved in the algorithm at the acceptance step, when the ratio P (x )/P (x) is computed. The channel from the true distribution P (x) to the user who is interested in computing properties of P (x) thus passes through a bottleneck: all the information about P is conveyed by the string of acceptances and rejections. If P (x) were replaced by a dierent distribution P2 (x), the only way in which this change would have an inuence is that the string of acceptances and rejections would be changed. I am not aware of much use being made of this information-theoretic view of Monte Carlo algorithms, but I think it is an instructive viewpoint: if the aim is to obtain information about properties of P (x) then presumably it is helpful to identify the channel through which this information ows, and maximize the rate of information transfer. Example 30.4. The information-theoretic viewpoint oers a simple justication for the widely-adopted rule of thumb, which states that the parameters of a dumb Metropolis method should be adjusted such that the acceptance rate is about one half. Lets call the acceptance history, that is, the binary string of accept or reject decisions, a. The information learned about P (x) after the algorithm has run for T steps is less than or equal to the information content of a, since all information about P is mediated by a. And the information content of a is upper-bounded by T H 2 (f ), where f is the acceptance rate. This bound on information acquired about P is maximized by setting f = 1/2. Another helpful analogy for a dumb Metropolis method is an evolutionary one. Each proposal generates a progeny x from the current state x. These two individuals then compete with each other, and the Metropolis method uses a noisy survival-of-the-ttest rule. If the progeny x is tter than the parent (i.e., P (x ) > P (x), assuming the Q/Q factor is unity) then the progeny replaces the parent. The survival rule also allows less-t progeny to replace the parent, sometimes. Insights about the rate of evolution can thus be applied to Monte Carlo methods. Exercise 30.5.[3 ] Let x {0, 1}G and let P (x) be a separable distribution, P (x) =
g
p(xg ),
(30.14)
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
30.6: Multi-state methods with p(0) = p0 and p(1) = p1 , for example p1 = 0.1. Let the proposal density of a dumb Metropolis algorithm Q involve ipping a fraction m of the G bits in the state x. Analyze how long it takes for the chain to converge to the target density as a function of m. Find the optimal m and deduce how long the Metropolis method must run for. Compare the result with the results for an evolving population under natural selection found in Chapter 19. The insight that the fastest progress that a standard Metropolis method can make, in information terms, is about one bit per iteration, gives a strong motivation for speeding up the algorithm. This chapter has already reviewed several methods for reducing random-walk behaviour. Do these methods also speed up the rate at which information is acquired? Exercise 30.6.[4 ] Does Gibbs sampling, which is a smart Metropolis method whose proposal distributions do depend on P (x), allow information about P (x) to leak out at a rate faster than one bit per iteration? Find toy examples in which this question can be precisely investigated. Exercise 30.7.[4 ] Hamiltonian Monte Carlo is another smart Metropolis method in which the proposal distributions depend on P (x). Can Hamiltonian Monte Carlo extract information about P (x) at a rate faster than one bit per iteration? Exercise 30.8.[5 ] In importance sampling, the weight wr = P (x(r) )/Q (x(r) ), a oating-point number, is computed and retained until the end of the computation. In contrast, in the dumb Metropolis method, the ratio a = P (x )/P (x) is reduced to a single bit (is a bigger than or smaller than the random number u?). Thus in principle importance sampling preserves more information about P than does dumb Metropolis. Can you nd a toy example in which this extra information does indeed lead to faster convergence of importance sampling than Metropolis? Can you design a Markov chain Monte Carlo algorithm that moves around adaptively, like a Metropolis method, and that retains more useful information about the value of P , like importance sampling? In Chapter 19 we noticed that an evolving population of N individuals can make faster evolutionary progress if the individuals engage in sexual reproduction. This observation motivates looking at Monte Carlo algorithms in which multiple parameter vectors x are evolved and interact.
395
30.6
Multi-state methods
In a multi-state method, multiple parameter vectors x are maintained; they evolve individually under moves such as Metropolis and Gibbs; there are also interactions among the vectors. The intention is either that eventually all the vectors x should be samples from P (x) (as illustrated by Skillings leapfrog method), or that information associated with the nal vectors x should allow us to approximate expectations under P (x), as in importance sampling.
Genetic methods
Genetic algorithms are not often described by their proponents as Monte Carlo algorithms, but I think this is the correct categorization, and an ideal genetic
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
396
30 Ecient Monte Carlo Methods
algorithm would be one that can be proved to be a valid Monte Carlo algorithm that converges to a specied density. Ill use R to denote the number of vectors in the population. We aim to have P ({x(r) }R ) = P (x(r) ). A genetic algorithm involves moves of two or 1 three types. First, individual moves in which one state vector is perturbed, x(r) x(r) , which could be performed using any of the Monte Carlo methods we have mentioned so far. Second, we allow crossover moves of the form x, y x , y ; in a typical crossover move, the progeny x receives half his state vector from one parent, x, and half from the other, y; the secret of success in a genetic algorithm is that the parameter x must be encoded in such a way that the crossover of two independent states x and y, both of which have good tness P , should have a reasonably good chance of producing progeny who are equally t. This constraint is a hard one to satisfy in many problems, which is why genetic algorithms are mainly talked about and hyped up, and rarely used by serious experts. Having introduced a crossover move x, y x , y , we need to choose an acceptance rule. One easy way to obtain a valid algorithm is to accept or reject the crossover proposal using the Metropolis rule with P ({x(r) }R ) as 1 the target density this involves comparing the tnesses before and after the crossover using the ratio P (x )P (y ) . (30.15) P (x)P (y) If the crossover operator is reversible then we have an easy proof that this procedure satises detailed balance and so is a valid component in a chain converging to P ({x(r) }R ). 1 Exercise 30.9.[3 ] Discuss whether the above two operators, individual variation and crossover with the Metropolis acceptance rule, will give a more ecient Monte Carlo method than a standard method with only one state vector and no crossover.
The reason why the sexual community could acquire information faster than the asexual community in Chapter 19 was because the crossover operation produced diversity with standard deviation G, then the Blind Watchmaker was able to convey lots of information about the tness function by killing o the less t ospring. The above two operators do not oer a speed-up of G compared with standard Monte Carlo methods because there is no killing. Whats required, in order to obtain a speed-up, is two things: multiplication and death; and at least one of these must operate selectively. Either we must kill o the less-t state vectors, or we must allow the more-t state vectors to give rise to more ospring. While its easy to sketch these ideas, it is hard to dene a valid method for doing it. Exercise 30.10.[5 ] Design a birth rule and a death rule such that the chain converges to P ({x(r) }R ). 1
I believe this is still an open research problem.
Particle lters
Particle lters, which are particularly popular in inference problems involving temporal tracking, are multistate methods that mix the ideas of importance sampling and Markov chain Monte Carlo. See Isard and Blake (1996), Isard and Blake (1998), Berzuini et al. (1997), Berzuini and Gilks (2001), Doucet et al. (2001).
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
30.7: Methods that do not necessarily help
397
30.7
Methods that do not necessarily help
It is common practice to use many initial conditions for a particular Markov chain (gure 29.19). If you are worried about sampling well from a complicated density P (x), can you ensure the states produced by the simulations are well distributed about the typical set of P (x) by ensuring that the initial points are well distributed about the whole state space ? The answer is, unfortunately, no. In hierarchical Bayesian models, for example, a large number of parameters {xn } may be coupled together via another parameter (known as a hyperparameter). For example, the quantities {xn } might be independent noise signals, and might be the inverse-variance of the noise source. The joint distribution of and {xn } might be
N
P (, {xn }) = P ( ) = P ( )
n=1 N n=1
P (xn | )
1 Z ( )
exn /2 ,
2
where Z ( ) = 2/ and P ( ) is a broad distribution describing our ignorance about the noise level. For simplicity, lets leave out all the other variables data and such that might be involved in a realistic problem. Lets imagine that we want to sample eectively from P (, {xn }) by Gibbs sampling alternately sampling from the conditional distribution P ( |xn ) then sampling all the xn from their conditional distributions P (xn | ). [The resulting marginal distribution of should asymptotically be the broad distribution P ( ).] If N is large then the conditional distribution of given any particular setting of {xn } will be tightly concentrated on a particular most-probable value of , with width proportional to 1/ N . Progress up and down the -axis will therefore take place by a slow random walk with steps of size 1/ N . So, to the initialization strategy. Can we nesse our slow convergence problem by using initial conditions located all over the state space ? Sadly, no. If we distribute the points {xn } widely, what we are actually doing is favouring an initial value of the noise level 1/ that is large . The random walk of the parameter will thus tend, after the rst drawing of from P ( |xn ), always to start o from one end of the -axis.
Further reading
The Hamiltonian Monte Carlo method is reviewed in Neal (1993b). This excellent tome also reviews a huge range of other Monte Carlo methods, including the related topics of simulated annealing and free energy estimation.
30.8
Further exercises
Exercise 30.11.[4 ] An important detail of the Hamiltonian Monte Carlo method is that the simulation of the Hamiltonian dynamics, while it may be inaccurate, must be perfectly reversible, in the sense that if the initial condition (x, p) (x , p ), then the same simulator must take (x , p ) (x, p), and the inaccurate dynamics must conserve statespace volume. [The leapfrog method in algorithm 30.1 satises these rules.] Explain why these rules must be satised and create an example illustrating the problems that arise if they are not.
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
398
30 Ecient Monte Carlo Methods
Exercise 30.12.[4 ] A multi-state idea for slice sampling. Investigate the following multi-state method for slice sampling. As in Skillings multi-state leapfrog method (section 30.4), maintain a set of S state vectors {x(s) }. Update one state vector x(s) by one-dimensional slice sampling in a direction y determined by picking two other state vectors x(v) and x(w) at random and setting y = x(v) x(w) . Investigate this method on toy problems such as a highly-correlated multivariate Gaussian distribution. Bear in mind that if S 1 is smaller than the number of dimensions N then this method will not be ergodic by itself, so it may need to be mixed with other methods. Are there classes of problems that are better solved by this slice sampling method than by the standard methods for picking y such as cycling through the coordinate axes or picking u at random from a Gaussian distribution?
x(v) x(s) x(w)
30.9
Solutions
Solution to exercise 30.3 (p.393). Consider the spherical Gaussian distribution where all components have mean zero and variance 1. In one dimension, the (1) (2) nth, if xn leapfrogs over xn , we obtain the proposed coordinate (x(1) ) = 2x(2) x(1) . n n n
(1) (2)
(30.16)
Assuming that xn and xn are Gaussian random variables from Normal(0, 1), (1) (xn ) is Gaussian from Normal(0, 2 ), where 2 = 22 +(1)2 = 5. The change in energy contributed by this one dimension will be 1 (2x(2) x(1) )2 (x(1) )2 = 2(x(2) )2 2x(2) x(1) n n n n n n 2
(2)
(30.17)
so the typical change in energy is 2 (xn )2 = 2. This positive change is bad news. In N dimensions, the typical change in energy when a leapfrog move is made, at equilibrium, is thus +2N . The probability of acceptance of the move scales as e2N . (30.18) This implies that Skillings method, as described, is not eective in very highdimensional problems at least, not once convergence has occured. Nevertheless it has the impressive advantage that its convergence properties are independent of the strength of correlations between the variables a property that not even the Hamiltonian Monte Carlo and overrelaxation methods oer.
Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more.
Course Hero has millions of course specific materials providing students with the best way to expand
their education.
Below is a small sample set of documents:
BYU - ECE - 771
Parameter Estimation3rd March 20041IntroductionA commonly-encountered inverse problem occurs when a system can be modeled by a dierential equation. For example, consider the one-dimensional inhomogeneous Helmholtz equation d dx G (x) dU dx +
BYU - ECE - 771
HOMEWORK SOLUTIONS1.1a: Proof that f 1 [B ] =f 1 [B ].Proof. Assume x f 1 [ B ] then by denition there is a y B such that y = f (x) . This y must be in at least one of the sets, B , call this set B0 . By denition, x f 1 [B0 ] and thu
West Virginia University at Parkersburg - SB - 653
ACTION PLAN DOCUMENTI.ACCESS TO HIGHER EDUCATIONGoal A - Create accessible graduate education in every region of state. Goal IA.1 - In order to provide citizens access to stable and continuing graduate-level programs in every region of the stat
Penn State - SAS - 5516
Stefan Szydlowskisas5516@psu.edu Permanent Address 2100 Augusta Drive Center Valley, PA 18034 Home- (610) 694-8988 Education Local Address 102 Geary Hall University Park, PA 16802 Cell-(610)248-3894The Pennsylvania State University, University Par
UC Davis - ATT - 0503
Boletn CIIFEN Marzo 2005: Condiciones del ENOS: una perspectiva regionalEl CIIFEN presenta este servicio de informacin que pretende proveer a los usuarios cientficos, tcnicos y la poblacin en general de una sntesis til y oportuna del anlisis de las
Laurentian - CHEM - 10001
CHEMISTRY 1000INSTRUCTORS: Michelle Duke John Eng Wayne Lippa Ying ZhengLAB SCHEDULEFall 2005Laboratory Manual: University of Lethbridge Chemistry 1000 Laboratory ManualOffice: Office: Office: Office:E-856 E-790 E-711 E-788duke@uleth.ca
Laurentian - CHEM - 10001
Chemistry 1000 Labs begin the week of Monday, September 12th, 2005. Please have purchased a Chemistry 1000 laboratory manual and lab coat and bring them to your first lab period. There will be a lab completed during the first period. It is important
Syracuse - PHY - 215
PHY 215 - Problem Set #1 - Due Friday September 9, 2005August 11, 20051A Note about the AssignmentsThere are two types of Assignments: masteringphysics assignments and Problem Sets. The rst, masteringphysics assignments, are viewed and comple
Syracuse - PHY - 215
PHY 215 - Problem Set #2 - Due Friday September 16, 20051A Note about the AssignmentsThere are two types of Assignments: masteringphysics assignments and Problem Sets. In order to view and work on the Assignments, you need to use the Student Ac
Syracuse - PHY - 215
PHY 215 - Problem Set #3 - Due Friday September 23, 20051A Note about the AssignmentsThere are two types of Assignments: masteringphysics assignments and Problem Sets. In order to view and work on the Assignments, you need to use the Student Ac
Syracuse - PHY - 215
PHY 215 - Problem Set #4 - Due Friday September 30, 20051The AssignmentDo the following problems from the textbook (University Physics): 5.104, 5.120, 5.125, 5.119 Go to www.masteringphysics.com and do PS4-M (there are two easy problems).1
Syracuse - PHY - 215
PHY 215 - Problem Set #5 - Due Friday October 7, 20051The AssignmentDo the following problems from the textbook (University Physics): 5.21, 5.93, 6.4, 6.18,1
Syracuse - PHY - 215
PHY 215 - Problem Set #6 - Due Friday October 21, 20051The AssignmentDo the following problems from the textbook (University Physics): 7.68, 7.82, 7.86, 6.641
Syracuse - PHY - 215
PHY 215 - Problem Set #7 - Due Friday October 28, 20051The AssignmentDo the following problems from the textbook (University Physics): 7.65, 12.65, 12.601
Syracuse - PHY - 215
PHY 215 - Problem Set #9 - Due Friday November 18, 20051The AssignmentDo the following problems from the textbook (University Physics): 10.83, 10.90, 10.51, 9.98, 10.581
Syracuse - PHY - 215
PHY 215 - Problem Set #10 - Due Friday December 2, 20051The AssignmentDo the following problems from the textbook (University Physics): 11.21, 13.63 , 13.40, 13.1021
Penn State - MATH - 508
DYNAMICAL SYSTEMS EXERCISE SHEET 2Problem 1. Let Ti : Xi Xi be two continuous mappings of two compact metric spaces X1 , X2 . Dene T1 T2 : X1 X2 X1 X2 by (T1 T2 )(x1 , x2 ) = (T1 x1 , T2 x2 ). Prove: htop (T1 T2 ) = htop (T1 ) + htop (T2 ).
Penn State - CHEM - 431
I n the Laboratoryedited byGreen ChemistryMary M. KirchhoffACS Green Chemistry Institute Washington, DC 20036One-Pot Synthesis of 7-Hydroxy-3-carboxycoumarin in WaterWFrancesco Fringuelli,* Oriana Piermatti, and Ferdinando Pizzo Dipartime
UCSD - MAE - 156
Ocean Controllable Drogues Progress Report for Week 9 Group 15 Goals for next week All Finishing touches and present. Accomplishments from last week Verified waterproof end caps Wired circuit on Lab-X2 board Risks/challenges End caps can be diffi
UCSD - MAE - 156
Ocean Controllable Drogues Progress Report for Week 8 Group 15 Goals for next week Bryan Verify circuit wiring. Sonny Update and make the website presentable. Ryan Assist with circuit and end cap. Jimmy Small adjustments to end caps. Perform wat
UCSD - MAE - 156
Ocean Controllable Drogues Progress Report for Week 7 Group 15 Goals for next week Bryan Wire circuit. Sonny Wire circuit. Update and make the website presentable. Ryan Machining more end caps. Jimmy Machining more end caps. Test the piston tank
UCSD - MAE - 156
Ocean Controllable Drogues Progress Report for Week 6 Group 15 Goals for next week Bryan Perform strength analysis on end caps. Research what batteries to use. Sonny Study strength performance. Perform an FEA analysis on the cylinder design with i
UCSD - MAE - 156
Ocean Controllable Drogues Progress Report for Week 5 Group 15 Goals for next week Bryan Contact Scripps about motor control, hardware, and pressure sensor. Sonny Study strength performance. Perform an FEA analysis on the cylinder design. Update a
UCSD - MAE - 156
Ocean Controllable Drogues Progress Report for Week 4 Group 15 Goals for next week Bryan Research seals. Get some polycarbonate tubes for testing. Sonny Research seals. Perform an FEA analysis on the cylinder design. Update webpage and act as head
UCSD - MAE - 156
Ocean Controllable Drogues Progress Report for Week 3 Group 15 Goals for next week Bryan Buy PVC tube for Proof of Concept. Perform air tight analysis on PVC tube for Proof of Concept. Sonny Perform an FEA analysis on the cylinder design. Update w
UCSD - MAE - 156
Ocean Controllable Drogues Progress Report for Week 2 Group 15 Goals for next week Bryan Duties include setting up meetings with Andy (TA) and a group meeting to include professors and the communication team to discuss progress thus far on the drog
UCSD - MAE - 156
Ocean Controllable Drogues Progress Report for Week 1 Group 15 Goals for next week Bryan Duties include setting up meetings with the communication team to discuss how their optical equipment will fit on the drogue. Discussion will include how much
Berkeley - MCB - 140
Restriction MappingRestriction Mapping: Double DigestsFactor I Single CutFactor 2 Single CutRestriction Mapping: Double DigestsRestriction Mapping: Double DigestsRestriction Mapping: Double DigestsSingle BamHI 5600BamHIEcoRI 400 5
Berkeley - MCB - 140
Todays Agenda Run Gel for Blot Denature Gel Examine Transformations Set up Overnight Cultures for Thursday Set up DNA Blot for Hybridization on ThursdaySouthern BlotNYLON MEMBRANEDRY BLOTTING PAPER NaOH BLOTTING PAPER GEL NaOH BLOTTING PAPE
Berkeley - MCB - 140
PCR and 2-Hybrid ScreensTodays Activities Experiment 2 Digest each of your minipreps w/HindIII and PstI (separate single digests) and freeze for next time Set up PCR reactions to confirm presence and orientation of Kan gene Experiment 3 Trans
Berkeley - MCB - 140
<?xml version="1.0" encoding="UTF-8"?> <Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>201e821e99eede20fa3643da01e05e859662608c.ppt</Key><RequestId>B 3F53A135D0EAD2A</RequestId><HostId>XzK7h/B+pxENqi69G1TK00H5wvK
Berkeley - MCB - 140
Expt. 10-1: P-elements and Enhancer TrappingExpt. 10: P-elements and Enhancer Trapping -Naturally occurring P-elements:Transposase gene between two inverted repeatsTransposaseOrganization of the P ElementTransposase ORF2.9 kb P Element31
Laurentian - MGT - 200203
MGT 5541Y Derivatives PROJECT2002 Fall SemesterPart A with a notional US $100,000 beginning with Nov. 1/02 closing stock/option prices for General Electric (GE) Purchase shares in board lots (100 shares), A1 Write or issue covered calls (gener
Laurentian - MGT - 200203
University of Lethbridge Derivative Securities Markets MGt 4451Y Assignment Notice The assignments will be available to be picked up in the U of L admin office by Noon on Monday Dec. 9/02 Terry Harbottle
Laurentian - MGT - 200203
Chapter Two Basic Principles of Options Answers to Problems and Questions 1.2.Up False. The owner of an option has the right to do something, while the writer has an obligation to perform if the option owner exercises. The term "wasting asset" ref
Laurentian - MGT - 200203
<?xml version="1.0" encoding="UTF-8"?> <Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>6d3fb11fe63620fdc9a9c0b465b34a197c5d1704.ppt</Key><RequestId>7 4633FC74C2597DC</RequestId><HostId>Ix6WMmOg83oFCIOn0DJPekLJMi6
Laurentian - MGT - 200203
Chapter Five Option Pricing Answers to Problems and Questions 1. Arbitrage is the existence of a riskless profit, especially when you do not have to commit funds to earn the profit. 2. Market prices move continually, and frequently move slightly away
Laurentian - MGT - 200203
Chapter 6The Black-Scholes Option Pricing Model1 2002 South-Western PublishingIntroduction TheBlack-Scholes option pricing model (BSOPM) has been one of the most important developments in finance in the last 30 yearsbasic valuation model
Laurentian - MGT - 200203
Chapter Six The Black-Scholes Option Pricing Model Answers to Problems and Questions 1. Interest rates may have changed, the expected amount of the next dividend may have changed, or the market may anticipate a different future level of volatility in
Laurentian - MGT - 200203
Chapter 8Fundamentals of the Futures Market1 2002 South-Western PublishingOutlineA. The concept of futures contracts B. Market mechanics C. Market participants D. The clearing process E. Principles of futures contract pricing F. Hedging G. S
Laurentian - MGT - 200203
Chapter 8Fundamentals of the Futures Market1 2002 South-Western PublishingOutlineA. The concept of futures contracts B. Market mechanics C. Market participants D. The clearing process E. Principles of futures contract pricing F. Hedging G. S
Laurentian - MGT - 200203
Chapter 11Fundamentals of Interest Rate Futures1 2002 South-Western PublishingOutline Interestrate futures Treasury bills, Eurodollars, and their futures contracts Speculating & Hedging with T-bill futures Hedging with Eurodollar Future
Laurentian - MGT - 200203
Chapter Thirteen Swaps and Interest Rate Options Answers to Problems and Questions 1. In an interest rate swap, the swap seller is the party paying the floating rate. 2. The swap tenor is the length (in time) of the swap arrangement. The swap price i
Laurentian - MGT - 200203
Chapter Fourteen Swap Pricing Answers to Problems and Questions 1. If the options are European style you can use the put/call parity model with C P = 0, meaning K = S (1+R)T. Here we have K = 28.55 x (1.055)0.5 = 29.32 If the options are American st
Laurentian - MGT - 200203
Oct 1 Message: Derivatives - 4451Y Message re Text Problem Solutions I will be posting to the web-site the solutions for the end of chapter problems for chapters 8, 11 and 12. Please look for them shortly. Terry Harbottle University of Lethbridge
Laurentian - MGT - 200203
Chapter Eight Fundamentals of the Futures Market Answers to Problems and Questions 1. Open interest measures the number of futures contracts that exist. One side of the contract may change hands many times prior to the delivery month. Each time a tra
Laurentian - MGT - 200203
Chapter Eleven Fundamentals of Interest Rate Futures Answers to Problems and Questions 1. Unlike shares in a particular common stock or bushels of inspected soybeans, not all Treasury bonds are identical. Differences in coupon rate and in maturity ar
Laurentian - MGT - 200203
Chapter Twelve Futures Contracts and Portfolio Management Answers to Problems and Questions 1. Immunization refers to the situation in which the effects of interest rate risk and reinvestment rate risk largely cancel. 2. Bullet immunization seeks to
Laurentian - MGT - 200203
<?xml version="1.0" encoding="UTF-8"?> <Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>d0728d470ccb744c325027a8e150fa1c15007e15.ppt</Key><RequestId>6 137D7B2FD634153</RequestId><HostId>ZjULq4SI6hTourMWHF+/QPs1VA+
Laurentian - MGT - 200203
Chapter 3Basic Option Strategies: Covered Calls and Protective Puts1 2002 South-Western PublishingOutlineEquity Options Using options as a hedge Using options to generate income Profit and loss diagrams with seasoned stock positions Impro
Laurentian - MGT - 200203
Chapter Three Basic Option Strategies: Covered Calls and Protective Puts Answers to Problems and Questions 1. Covering a call means either 1) buying back a call that you previously wrote, or 2) adding a stock or long option position so that a short c
Laurentian - MGT - 200203
Chapter 4Option Combinations and Spreads1 2002 South-Western PublishingChapter 3 - Rolling Trading Strategies Rollingdown/out/up - possible action as part of re-evaluating ones option positionRolling down - closing your position and reest
Laurentian - MGT - 200203
Chapter Four Option Combinations and Spreads Answers to Problems and Questions 1. A straddle involves a higher maximum profit if the stock remains near the option striking price. A strangle has a lower maximum profit, but it occurs over a wider range
Laurentian - MGT - 200203
Chapter 13Swaps and Interest Rate Options1 2002 South-Western PublishingOutline Introduction Interestrate swaps Foreign currency swaps Circus swap Interest rate options2Introduction Bothswaps and interest rate options are relat
Laurentian - MGT - 200203
Chapter 14Swap Pricing1 2002 South-Western PublishingOutline Swappricing Solving for the swap price Valuing an off-market swap Hedging the swap Pricing a currency swap2Swap Pricing Swapsas a pair of bonds Swaps as a series of f
Laurentian - MGT - 200203
University of Lethbridge Derivative Securities Markets - MGT 4451Y Class Assignment update Expectations (discussed at November 5/02 class) Part A A stock outlook price assumptions, by when? A put and call strategy aligned with stock outlook Work
Laurentian - MGT - 200203
Chapter 12Futures Contracts and Portfolio Management1 2002 South-Western PublishingThe Concept of Immunization Introduction Bondrisks Duration matching Bullet Immunization Bank immunization Duration shifting2The Concept of Immuniza
Laurentian - MGT - 200203
T5.1 Chapter OutlineIntroduction to Valuation: The Time Value of Money Chapter Organization 5.1 Future Value and CompoundingChapter 5 5.2 Present Value and Discounting 5.3 More on Present and Future Values 5.4 Summary and ConclusionsCLIC
Laurentian - MGT - 200203
T3.1 Chapter OutlineWorking With Financial Statements Chapter Organization 3.1 Cash Flow and Financial Statements: A Closer Look 3.2 Standardized Financial Statements 3.3 Ratio Analysis 3.4 The Du Pont Identity 3.5 Using Financial Statement In
Laurentian - MGT - 200203
T4.1 Chapter OutlineLong-Term Financial Planning and Corporate Growth Chapter Organization 4.1 What is Financial Planning? 4.2 Financial Planning Models: A First Look 4.3 The Percentage of Sales Approach 4.4 External Financing and Growth 4.5 S
BYU - MATH - 371
REVIEW SHEET FOR EXAM 33. Theorems to know Be able to prove theorems with a . (1) Lemma 4.2.1 (2) Lemma 4.3.1 (3) Theorem 4.3.2 (4) Theorem 4.3.3 (5) Theorem 4.3.4 (6) Theorem 4.3.5 (7) Theorem 4.3.6 (8) Lemma 4.4.1 (9) Theorem 4.4.2 (10) The