70 Pages

ChipmanGeorgeMcCulloch2001

Course: ST 790, Fall 2008
School: N.C. State
Rating:
 
 
 
 
 

Word Count: 26953

Document Preview

Selection Model IMS Lecture Notes - Monograph Series (2001) Volume 38 The Practical Implementation of Bayesian Model Selection Hugh Chipman, Edward I. George and Robert E. McCulloch The University of Waterloo, The University of Pennsylvania and The University of Chicago Abstract In principle, the Bayesian approach to model selection is straightforward. Prior probability distributions are used to describe the...

Register Now

Unformatted Document Excerpt

Coursehero >> North Carolina >> N.C. State >> ST 790

Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

Course Hero has millions of student submitted documents similar to the one below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
Selection Model IMS Lecture Notes - Monograph Series (2001) Volume 38 The Practical Implementation of Bayesian Model Selection Hugh Chipman, Edward I. George and Robert E. McCulloch The University of Waterloo, The University of Pennsylvania and The University of Chicago Abstract In principle, the Bayesian approach to model selection is straightforward. Prior probability distributions are used to describe the uncertainty surrounding all unknowns. After observing the data, the posterior distribution provides a coherent post data summary of the remaining uncertainty which is relevant for model selection. However, the practical implementation of this approach often requires carefully tailored priors and novel posterior calculation methods. In this article, we illustrate some of the fundamental practical issues that arise for two dierent model selection problems: the variable selection problem for the linear model and the CART model selection problem. Hugh Chipman is Associate Professor of Statistics, Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada; email: hachipman@uwaterloo.ca. Edward I. George is Professor of Statistics, Department of Statistics, The Wharton School of the University of Pennsylvania, 3620 Locust Walk, Philadelphia, PA 19104-6302, U.S.A; email: edgeorge@wharton.upenn.edu. Robert E. McCulloch is Professor of Statistics, Graduate School of Business, University of Chicago, 1101 East 58th Street, Chicago, IL, U.S.A; email: Robert.McCulloch@gsb.uchicago.edu. This work was supported by NSF grant DMS-98.03756 and Texas ARP grant 003658.690. 0 Contents 1 Introduction 2 The General Bayesian Approach 2.1 2.2 2.3 67 69 A Probabilistic Setup for Model Uncertainty . . . . . . . . . . . . . . . . . 69 General Considerations for Prior Selection . . . . . . . . . . . . . . . . . . 71 Extracting Information from the Posterior . . . . . . . . . . . . . . . . . . 73 75 3 Bayesian Variable Selection for the Linear Model 3.1 3.2 3.3 3.4 3.5 Model Space Priors for Variable Selection . . . . . . . . . . . . . . . . . . 76 Parameter Priors for Selection of Nonzero i . . . . . . . . . . . . . . . . 80 Calibration and Empirical Bayes Variable Selection . . . . . . . . . . . . . 82 Parameter Priors for Selection Based on Practical Signicance . . . . . . . 85 Posterior Calculation and Exploration for Variable Selection . . . . . . . . 88 3.5.1 3.5.2 3.5.3 3.5.4 3.5.5 Closed Form Expressions for p(Y | ) . . . . . . . . . . . . . . . . . 88 MCMC Methods for Variable Selection . . . . . . . . . . . . . . . . 89 Gibbs Sampling Algorithms . . . . . . . . . . . . . . . . . . . . . . 90 Metropolis-Hastings Algorithms . . . . . . . . . . . . . . . . . . . . 91 Extracting Information from the Output . . . . . . . . . . . . . . . 93 95 4 Bayesian CART Model Selection 4.1 Prior Formulations for Bayesian CART . . . . . . . . . . . . . . . . . . . . 98 4.1.1 4.1.2 Tree Prior Specication . . . . . . . . . . . . . . . . . . . . . . . . 98 Parameter Prior Specications . . . . . . . . . . . . . . . . . . . . 100 4.2 Stochastic Search of the CART Model Posterior . . . . . . . . . . . . . . . 103 4.2.1 4.2.2 4.2.3 Metropolis-Hastings Search Algorithms . . . . . . . . . . . . . . . 103 Running the MH Algorithm for Stochastic Search . . . . . . . . . 105 Selecting the Best Trees . . . . . . . . . . . . . . . . . . . . . . . 106 107 5 Much More to Come Practical Bayes Model Selection 67 1 Introduction The Bayesian approach to statistical problems is fundamentally probabilistic. A joint probability distribution is used to describe the relationships between all the unknowns and the data. Inference is then based on the conditional probability distribution of the unknowns given the observed data, the posterior distribution. Beyond the specication of the joint distribution, the Bayesian approach is automatic. Exploiting the internal consistency of the probability framework, the posterior distribution extracts the relevant information in the data and provides a complete and coherent summary of post data uncertainty. Using the posterior to solve specic inference and decision problems is then straightforward, at least in principle. In this article, we describe applications of this Bayesian approach for model uncertainty problems where a large number of dierent models are under consideration for the data. The joint distribution is obtained by introducing prior distributions on all the unknowns, here the parameters of each model and the models themselves, and then combining them with the distributions for the data. Conditioning on the data then induces a posterior distribution of model uncertainty that can be used for model selection and other inference and decision problems. This is the essential idea and it can be very powerful. Especially appealing is its broad generality as it is based only on probabilistic considerations. However, two major challenges confront its practical implementation the specication of the prior distributions and the calculation of the posterior. This will be our main focus. The statistical properties of the Bayesian approach rest squarely on the specication of the prior distributions on the unknowns. But where do these prior distributions come from and what do they mean? One extreme answer to this question is the pure subjective Bayesian point of view that characterizes the prior as a wholly subjective description of initial uncertainty, rendering the posterior as a subjective post data description of uncertainty. Although logically compelling, we nd this characterization to be unrealistic in complicated model selection problems where such information is typically unavailable or dicult to precisely quantify as a probability distribution. At the other extreme is the objective Bayesian point of view which seeks to nd semi-automatic prior formulations or approximations when subjective information is unavailable. Such priors can serve as default inputs and make them attractive for repeated use by non-experts. Prior specication strategies for recent Bayesian model selection implementations, including our own, have tended to fall somewhere between these two extremes. Typically, specic parametric families of proper priors are considered, thereby reducing the specication problem to that of selecting appropriate hyperparameter values. To avoid 68 H. Chipman, E. I. George and R. E. McCulloch the need for subjective inputs, automatic default hyperparameter choices are often recommended. For this purpose, empirical Bayes considerations, either formal or informal, can be helpful, especially when informative choices are needed. However, subjective considerations can also be helpful, at least for roughly gauging prior location and scale and for putting small probability on implausible values. Of course, when substantial prior information is available, the Bayesian model selection implementations provide a natural environment for introducing realistic and important views. By abandoning the pure subjective point of view, the evaluation of such Bayesian methods must ultimately involve frequentist considerations. Typically, such evaluations have taken the form of average performance over repeated simulations from hypothetical models or of cross validations on real data. Although such evaluations are necessarily limited in scope, the Bayesian procedures have consistently performed well compared to non-Bayesian alternatives. Although more work is clearly needed on this crucial aspect, there is cause for optimism, since by the complete class theorems of decision theory, we need look no further than Bayes and generalized Bayes procedures for good frequentist performance. The second major challenge confronting the practical application of Bayesian model selection approaches is posterior calculation or perhaps more accurately, posterior exploration. Recent advances in computing technology coupled with developments in numerical and Monte Carlo methods, most notably Markov Chain Monte Carlo (MCMC), have opened up new and promising directions for addressing this challenge. The basic idea behind MCMC here is the construction of a sampler which simulates a Markov chain that is converging to the posterior distribution. Although this provides a route to calculation of the full posterior, such chains are typically run for a relatively short time and used to search for high posterior models or to estimate posterior characteristics. However, constructing eective samplers and the use of such methods can be a delicate matter involving problem specic considerations such as model structure and the prior formulations. This very active area of research continues to hold promise for future developments. In this introduction, we have described our overall point of view to provide context for the implementations we are about to describe. In Section 2, we describe the general Bayesian approach in more detail. In Sections 3 and 4, we illustrate the practical implementation of these general ideas to Bayesian variable selection for the linear model and Bayesian CART model selection, respectively. In Section 5, we conclude with a brief discussion of related recent implementations for Bayesian model selection. Practical Bayes Model Selection 69 2 2.1 The General Bayesian Approach A Probabilistic Setup for Model Uncertainty Suppose a set of K models M = {M1 , . . . , MK } are under consideration for data Y , and that under Mk , Y has density p(Y | k , Mk ) where k is a vector of unknown parameters that indexes the members of Mk . (Although we refer to Mk as a model, it is more precisely a model class). The Bayesian approach proceeds by assigning a prior probability distribution p(k | Mk ) to the parameters of each model, and a prior probability p(Mk ) to each model. Intuitively, this complete specication can be understood as a three stage hierarchical mixture model for generating the data Y ; rst the model Mk is generated from p(M1 ), . . . , p(MK ), second the parameter vector k is generated from p(k | Mk ), and third the data Y is generated from p(Y | k , Mk ). Letting Yf be future observations of the same process that generated Y , this prior formulation induces a joint distribution p(Yf , Y, k , Mk ) = p(Yf , Y | k , Mk ) p(k | Mk ) p(Mk ). Conditioning on the observed data Y , all remaining uncertainty is captured by the joint posterior distribution p(Yf , k , Mk | Y ). Through conditioning and marginalization, this joint posterior can be used for a variety Bayesian inferences and decisions. For example, when the goal is exclusively prediction of Yf , attention would focus on the predictive distribution p(Yf | Y ), which is obtained by margining out both k and Mk . By averaging over the unknown models, p(Yf |Y ) properly incorporates the model uncertainty embedded in the priors. In eect, the predictive distribution sidesteps the problem of model selection, replacing it by model averaging. However, sometimes interest focuses on selecting one of the models in M for the data Y , the model selection problem. One may simply want to discover a useful simple model from a large speculative class of models. Such a model might, for example, provide valuable scientic insights or perhaps a less costly method for prediction than the model average. One may instead want to test a theory represented by one of a set of carefully studied models. In terms of the three stage hierarchical mixture formulation, the model selection problem becomes that of nding the model in M that actually generated the data, namely the model that was generated from p(M1 ), . . . , p(MK ) in the rst step. The probability that Mk was in fact this model, conditionally on having observed Y , is the posterior model probability p(Mk | Y ) = where p(Y | Mk ) = p(Y | k , Mk )p(k | Mk )dk (2.2) p(Y | Mk )p(Mk ) k p(Y | Mk )p(Mk ) (2.1) 70 H. Chipman, E. I. George and R. E. McCulloch is the marginal or integrated likelihood of Mk . Based on these posterior probabilities, pairwise comparison of models, say M1 and M2 , is summarized by the posterior odds p(M1 | Y ) p(Y | M1 ) p(M1 ) = . p(M2 | Y ) p(Y | M2 ) p(M2 ) This expression reveals how the data, through the Bayes factor prior odds p(M1 ) p(M2 ) p(Y | M1 ) p(Y | M2 ) , (2.3) updates the to yield the posterior odds. The model posterior distribution p(M1 |Y ), . . . , p(MK |Y ) is the fundamental object of interest for model selection. Insofar as the priors p(k | Mk ) and p(Mk ) provide an initial representation of model uncertainty, the model posterior summarizes all the relevant information in the data Y and provides a complete post-data representation of model uncertainty. By treating p(Mk | Y ) as a measure of the truth of model Mk , a natural and simple strategy for model selection is to choose the most probable Mk , the one for which p(Mk | Y ) largest. Alternatively one might prefer to report a set of high posterior models along with their probabilities to convey the model uncertainty. More formally, one can motivate selection strategies based on the posterior using a decision theoretic framework where the goal is to maximize expected utility, (Gelfand, Dey and Chang 1992 and Bernardo and Smith 1994). More precisely, let represent the action of selecting Mk , and suppose that is evaluated by a utility function u(, ), where is some unknown of interest, possibly Yf . Then, the optimal selection is that which maximizes the expected utility u(, )p( | Y )d where the predictive distribution of given Y p( | Y ) = k (2.4) p( | Mk , Y )p(Mk | Y ) (2.5) is a posterior weighted mixture of the conditional predictive distributions. p( | Mk , Y ) = p( | k , Mk )p(k | Mk , Y )dk (2.6) It is straightforward to show that if identies one of the Mk as the true state of nature, and u(, ) is 0 or 1 according to whether a correct selection has been made, then selection of the highest posterior probability model will maximize expected utility. However, dierent selection strategies are motivated by other utility functions. For example, suppose entails choosing p( | Mk , Y ) as a predictive distribution for a future observation , and this selection is to be evaluated by the logarithmic score function u(, ) = log p( | Mk , Y ). Then, the best selection is that which maximizes Practical Bayes Model Selection the posterior weighted logarithmic divergence p(Mk | Y ) k 71 p( | Mk , Y ) log p( | Mk , Y ) p( | Mk , Y ) (2.7) (San Martini and Spezzaferri 1984). However, if the goal is strictly prediction and not model selection, then expected logarithmic utility is maximized by using the posterior weighted mixture p( | Y ) in (2.5). Under squared error loss, the best prediction of is the overall posterior mean E( | Y ) = k E( | Mk , Y )p(Mk | Y ). (2.8) Such model averaging or mixing procedures incorporate model uncertainty and have been advocated by Geisser (1993), Draper (1995), Hoeting, Madigan, Raftery and Volinsky (1999) and Clyde, Desimone and Parmigiani (1995). Note however, that if a cost of model complexity is introduced into these utilities, then model selection may dominate model averaging. Another interesting modication of the decision theory setup is to allow for the possibility that the true model is not one of the Mk , a commonly held perspective in many applications. This aspect can be incorporated into a utility analysis by using the actual predictive density in place of p( | Y ). In cases where the form of the true model is completely unknown, this approach serves to motivate cross validation types of training sample approaches, (see Bernardo and Smith 1994, Berger and Pericchi 1996 and Key, Perrichi and Smith 1998). 2.2 General Considerations for Prior Selection For a given set of models M, the eectiveness of the Bayesian approach rests rmly on the specication of the parameter priors p(k |Mk ) and the model space prior p(M1 ), . . . , p(MK ). Indeed, all of the utility results in the previous section are predicated on the assumption that this specication is correct. If one takes the subjective point of view that these priors represent the statisticians prior uncertainty about all the unknowns, then the posterior would be the appropriate update of this uncertainty after the data Y has been observed. However appealing, the pure subjective point of view here has practical limitations. Because of the sheer number and complexity of unknowns in most model uncertainty problems, it is probably unrealistic to assume that such uncertainty can be meaningfully described. The most common and practical approach to prior specication in this context is to try and construct noninformative, semi-automatic formulations, using subjective and 72 H. Chipman, E. I. George and R. E. McCulloch empirical Bayes considerations where needed. Roughly speaking, one would like to specify priors that allow the posterior to accumulate probability at or near the actual model that generated the data. At the very least, such a posterior can serve as a heuristic device to identify promising models for further examination. Beginning with considerations for choosing the model space prior p(M1 ), . . . , p(MK ), a simple and popular choice is the uniform prior p(Mk ) 1/K (2.9) which is noninformative in the sense of favoring all models equally. Under this prior, the model posterior is proportional to the marginal likelihood, p(Mk |Y ) p(Y |Mk ), and posterior odds comparisons in (2.3) reduce to Bayes factor comparisons. However, the apparent noninformativeness of (2.9) can be deceptive. Although uniform over models, it will typically not be uniform on model characteristics such as model size. A more subtle problem occurs in setups where many models are very similar and only a few are distinct. In such cases, (2.9) will not assign probability uniformly to model neighborhoods and may bias the posterior away from good models. As will be seen in later sections, alternative model space priors that dilute probability within model neighborhoods can be meaningfully considered when specic model structures are taken into account. Turning to the choice of parameter priors p(k | Mk ), direct insertion of improper noninformative priors into (2.1) and (2.2) must be ruled out because their arbitrary norming constants are problematic for posterior odds comparisons. Although one can avoid some of these diculties with constructs such as intrinsic Bayes factors (Berger and Pericchi 1996) or fractional Bayes factors (OHagan 1995), many Bayesian model selection implementations, including our own, have stuck with proper parameter priors, especially in large problems. Such priors guarantee the internal coherence of the Bayesian formulation, allow for meaningful hyperparameter specications and yield proper posterior distributions which are crucial for the MCMC posterior calculation and exploration described in the next section. Several features are typically used to narrow down the choice of proper parameter priors. To ease the computational burden, it is very useful to choose priors under which rapidly computable closed form expressions for the marginal p(Y | Mk ) in (2.2) can be obtained. For exponential family models, conjugate priors serve this purpose and so have been commonly used. When such priors are not used, as is sometimes necessary outside the exponential family, computational eciency may be obtained with the approximations of p(Y |Mk ) described in Section 2.3. In any case, it is useful to parametrize p(k | Mk ) by a small number of interpretable hyperparameters. For nested model formulations, which are obtained by setting certain parameters to zero, it is often natural Practical Bayes Model Selection 73 to center the priors of such parameters at zero, further simplifying the specication. A crucial challenge is setting the prior dispersion. It should be large enough to avoid too much prior inuence, but small enough to avoid overly diuse specications that tend to downweight p(Y | Mk ) through (2.2), resulting in too little probability on Mk . For this purpose, we have found it useful to consider subjective inputs and empirical Bayes estimates. 2.3 Extracting Information from the Posterior Once the priors have been chosen, all the needed information for Bayesian inference and decision is implicitly contained in the posterior. In large problems, where exact calculation of (2.1) and (2.2) is not feasible, Markov Chain Monte Carlo (MCMC) can often be used to extract such information by simulating an approximate sample from the posterior. Such samples can be used to estimate posterior characteristics or to explore the posterior, searching for models with high posterior probability. For a model characteristic , MCMC entails simulating a Markov chain, say (1) , (2) , . . ., that is converging to its posterior distribution p( | Y ). Typically, will be an index of the models Mk or an index of the values of (k , Mk ). Simulation of (1) , (2) , . . . requires a starting value (0) and proceeds by successive simulation from a probability transition kernel p( | (j) ), see Meyn and Tweedie (1993). Two of the most useful prescriptions for constructing a kernel that generates a Markov chain converging to a given p( | Y ), are the Gibbs sampler (GS) (Geman and Geman 1984, Gelfand and Smith 1990) and the Metropolis-Hastings (MH) algorithms (Metropolis 1953, Hastings 1970). Introductions to these methods can be found in Casella and George (1992) and Chib and Greenberg (1995). More general treatments that detail precise convergence conditions (essentially irreducibility and aperiodicity) can found in Besag and Green (1993), Smith and Roberts (1993) and Tierney (1994). When Rp , the GS is obtained by successive simulations from the full conditional component distributions p(i | i ), i = 1, . . . , p, where i denotes the most recently updated component values of other than i . Such GS algorithms reduce the problem of simulating from p( | Y ) to a sequence of one-dimensional simulations. MH algorithms work by successive sampling from an essentially arbitrary probability transition kernel q( | (j) ) and imposing a random rejection step at each transition. When the dimension of (j) remains xed, an MH algorithm is dened by: 1. Simulate a candidate from the transition kernel q( | (j) ) 74 2. Set (j+1) = with probability H. Chipman, E. I. George and R. E. McCulloch ( | (j) ) = min 1, Otherwise set (j+1) = (j) , q( (j) | ) p( | Y ) q( | (j) ) p( (j) | Y ) This is a special case of the more elaborate reversible jump MH algorithms (Green 1995) which can be used when dimension of is changing. The general availability of such MH algorithms derives from the fact that p( | Y ) is only needed up to the norming constant for the calculation of above. The are endless possibilities for constructing Markov transition kernels p( | (j) ) that guarantee convergence to p( | Y ). The GS can be applied to dierent groupings and reorderings of the coordinates, and these can be randomly chosen. For MH algorithms, only weak conditions restrict considerations of the choice of q(| (j) ) and can also be considered componentwise. The GS and MH algorithms can be combined and used together in many ways. Recently proposed variations such as tempering, importance sampling, perfect sampling and augmentation oer a promising wealth of further possibilities for sampling the posterior. As with prior specication, the construction of eective transition kernels and how they can be exploited is meaningfully guided by problem specic considerations as will be seen in later sections. Various illustrations of the broad practical potential of MCMC are described in Gilks, Richardson, and Spieglehalter (1996). The use of MCMC to simulate the posterior distribution of a model index is greatly facilitated when rapidly computable closed form expressions for the marginal p(Y | Mk ) in (2.2) are available. In such cases, p(Y | )p() p( | Y ) can be used to implement GS and MH algorithms. Otherwise, one can simulate an index of the values of (k , Mk ) (or at least Mk and the values of parameters that cannot be eliminated analytically). When the dimension of such an index is changing, MCMC implementations for this purpose typically require more delicate design, see Carlin and Chib (1995), Dellaportas, Forster and Ntzoufras (2000), Geweke (1996), Green (1995), Kuo and Mallick (1998) and Phillips and Smith (1996). Because of the computational advantages of having closed form expressions for p(Y |Mk ), it may be preferable to use a computable approximation for p(Y | Mk ) when exact expressions are unavailable. An eective approximation for this purpose, when h(k ) log p(Y | k , Mk )p(k | Mk ) is suciently well-behaved, is obtained by Laplaces method (see Tierney and Kadane 1986) as p(Y | Mk ) (2)dk /2 |H(k )|1/2 p(Y | k , Mk )p(k | Mk ) (2.10) where dk is the dimension of k , k is the maximum of h(k ), namely the posterior mode of p(k |Y, Mk ), and H(k ) is minus the inverse Hessian of h evaluated at k . This is obtained Practical Bayes Model Selection 75 by substituting the Taylor series approximation h(k ) h(k ) 1 (k k ) H(k )(k k ) 2 for h(k ) in p(Mk | Y ) = exp{h(k )}dk . When nding k is costly, further approximation of p(Y | M ) can be obtained by p(Y | Mk ) (2)dk /2 |H (k )|1/2 p(Y | k , Mk )p(k | Mk ) (2.11) where k is the maximum likelihood estimate and H can be H, minus the inverse Hessian of the log likelihood or Fishers information matrix. Going one step further, by ignoring the terms in (2.11) that are constant in large samples, yields the BIC approximation (Schwarz 1978) log p(Y | M ) log p(Y | k , Mk ) (dk /2) log n (2.12) where n is the sample size. This last approximation was successfully implemented for model averaging in a survival analysis problem by Raftery, Madigan and Volinsky (1996). Although it does not explicitly depend on a parameter prior, (2.12) may be considered an implicit approximation to p(Y | M ) under a unit information prior (Kass and Wasserman 1995) or under a normalized Jereys prior (Wasserman 2000). It should be emphasized that the asymptotic justication for these successive approximations, (2.10), (2.11), (2.12), may not be very good in small samples, see for example, McCulloch and Rossi (1991). 3 Bayesian Variable Selection for the Linear Model Suppose Y a variable of interest, and X1 , . . . , Xp a set of potential explanatory variables or predictors, are vectors of n observations. The problem of variable selection, or subset selection as it often called, arises when one wants to model the relationship between Y and a subset of X1 , . . . , Xp , but there is uncertainty about which subset to use. Such a situation is particularly of interest when p is large and X1 , . . . , Xp is thought to contain many redundant or irrelevant variables. The variable selection problem is usually posed as a special case of the model selection problem, where each model under consideration corresponds to a distinct subset of X1 , . . . , Xp . This problem is most familiar in the context of multiple regression where attention is restricted to normal linear models. Many of the fundamental developments in variable selection have occurred in the context of the linear model, in large part because its analytical tractability greatly facilitates insight and computational reduction, and because it provides a simple rst order approximation to more complex relationships. Furthermore, many problems of interest can be posed as linear variable selection problems. For example, for the problem of nonparametric function estimation, the values of the unknown function are represented by Y , and a linear basis such as a wavelet basis or 76 H. Chipman, E. I. George and R. E. McCulloch a spline basis are represented by X1 , . . . , Xp . The problem of nding a parsimonious approximation to the function is then the linear variable selection problem. Finally, when the normality assumption is inappropriate, such as when Y is discrete, solutions for the linear model can be extended to alternatives such as general linear models (McCullagh and Nelder 1989). We now proceed to consider Bayesian approaches to this important linear variable selection problem. Suppose the normal linear model is used to relate Y to the potential predictors X1 , . . . Xp Y Nn (X, 2 I) (3.1) where X = (X1 , . . . Xp ), is a p 1 vector of unknown regression coecients, and 2 is an unknown positive scalar. The variable selection problem arises when there is some unknown subset of the predictors with regression coecients so small that it would be preferable to ignore them. In Sections 3.2 and 3.4, we describe two Bayesian formulations of this problem which are distinguished by their interpretation of how small a regression coecient must be to ignore Xi . It will be convenient throughout to index each of these 2p possible subset choices by the vector = (1 , . . . , p ) , where i = 0 or 1 according to whether i is small or large, respectively. We use q 1 to denote the size of the th subset. Note that here, plays the role of model identier Mk described in Section 2. We will assume throughout this section that X1 , . . . Xp contains no variable that would be included in every possible model. If additional predictors Z1 , . . . , Zr were to be included every model, then we would assume that their linear eect had been removed by replacing Y and X1 , . . . Xp with (I Z(Z Z)1 Z )Y and (I Z(Z Z)1 Z )Xi , i = 1, . . . , p where Z = (Z1 , . . . , Zr ). For example, if an intercept were to be included in every model, then we would assume that Y and X1 , . . . Xp had all been centered to have mean 0. Such reductions are simple and fast, and can be motivated from a formal Bayesian perspective by integrating out the coecients corresponding to Z1 , . . . , Zr with respect to an improper uniform prior. 3.1 Model Space Priors for Variable Selection For the specication of the model space prior, most Bayesian variable selection implementations have used independence priors of the form p() = wi i (1 wi )1i , (3.2) Practical Bayes Model Selection 77 which are easy to specify, substantially reduce computational requirements, and often yield sensible results, see, for example, Clyde, Desimone and Parmigiani (1996), George and McCulloch (1993, 1997), Raftery, Madigan and Hoeting (1997) and Smith and Kohn (1996). Under this prior, each Xi enters the model independently of the other coecients, with probability p(i = 1) = 1 p(i = 0) = wi . Smaller wi can be used to downweight Xi which are costly or of less interest. A useful reduction of (3.2) has been to set wi w, yielding p() = wq (1 w)pq , (3.3) in which case the hyperparameter w is the a priori expected proportion of Xi s in the model. In particular, setting w = 1/2, yields the popular uniform prior p() 1/2p , (3.4) which is often used as a representation of ignorance. However, this prior puts most of its weight near models of size q = p/2 because there are more of them. Increased weight on parsimonious models, for example, could instead be obtained by setting w small. Alternatively, one could put a prior on w. For example, combined with a beta prior w Beta(, ), (3.3) yields p() = B( + q , + p q ) B(, ) (3.5) where B(, ) is the beta function. More generally, one could simply put a prior h(q ) on the model dimension and let p() = p q 1 h(q ), (3.6) of which (3.5) is a special case. Under priors of the form (3.6), the components of are exchangeable but not independent, (except for the special case (3.3)). Independence and exchangeable priors on may be less satisfactory when the models under consideration contain dependent components such as might occur with interactions, polynomials, lagged variables or indicator variables (Chipman 1996). Common practice often rules out certain models from consideration, such as a model with an X1 X2 interaction but no X1 or X2 linear terms. Priors on can encode such preferences. With interactions, the prior for can capture the dependence relation between the importance of a higher order term and those lower order terms from which it was formed. For example, suppose there are three independent main eects A, B, C and three twofactor interactions AB, AC, and BC. The importance of the interactions such as AB will 78 H. Chipman, E. I. George and R. E. McCulloch often depend only on whether the main eects A and B are included in the model. This belief can be expressed by a prior for = (A , . . . , BC ) of the form: p() = p(A )p(B )p(C )p(AB | A , B )p(AC | A , C )p(BC | B , C ). (3.7) The specication of terms like p(AC |A , C ) in (3.7) would entail specifying four probabilities, one for each of the values of (A , C ). Typically p(AC |0, 0) < (p(AC |1, 0), // p(AC | 0, 1)) < p(AC | 1, 1). Similar strategies can be considered to downweight or eliminate models with isolated high order terms in polynomial regressions or isolated high order lagged variables in ARIMA models. With indicators for a categorical predictor, it may be of interest to include either all or none of the indicators, in which case p() = 0 for any violating this condition. The number of possible models using interactions, polynomials, lagged variables or indicator variables grows combinatorially as the number of variables increases. In contrast to independence priors of the form (3.2), priors for dependent component models, such as (3.7), is that they concentrate mass on plausible models, when the number of possible models is huge. This can be crucial in applications such as screening designs, where the number of candidate predictors may exceed the number of observations (Chipman, Hamada, and Wu 1997). Another more subtle shortcoming of independence and exchangeable priors on is their failure to account for similarities and dierences between models due to covariate collinearity or redundancy. An interesting alternative in this regard are priors that dilute probability across neighborhoods of similar models, the so called dilution priors (George 1999). Consider the following simple example. Suppose that only two uncorrelated predictors X1 and X2 are considered, and that they yield the following posterior probabilities: Variables in p( | Y ) X1 0.3 X2 0.4 X1 , X2 0.2 Suppose now a new potential predictor X3 is introduced, and that X3 is very highly correlated with X2 , but not with X1 . If the model prior is elaborated in a sensible way, as is discussed below, the posterior may well look something like Variables in p( | Y ) X1 0.3 X2 0.13 X3 0.13 X1 , X2 0.06 Practical Bayes Model Selection 79 The probability allocated to X1 remains unchanged, whereas the probability allocated to X2 and X1 , X2 has been diluted across all the new models containing X3 . Such dilution seems desirable because it maintains the allocation of posterior probability across neighborhoods of similar models. The introduction of X3 has added proxies for the models containing X2 but not any really new models. The probability of the resulting set of equivalent models should not change, and it is dilution that prevents this from happening. Note that this dilution phenomenon would become much more pronounced when many highly correlated variables are under consideration. The dilution phenomenon is controlled completely by the model space prior p() because p( | Y ) p(Y | )p() and the marginal p(Y | ) is unaected by changes to the model space. Indeed, no dilution of neighborhood probabilities occurs under the uniform prior (3.4) where p(|Y ) p(Y |). Instead the posterior probability of every is reduced while all pairwise posterior odds are maintained. For instance, when X3 is introduced above and a uniform prior is used, the posterior probabilities become something like Variables in p( | Y ) X1 0.15 X2 0.2 X3 0.2 X1 , X2 0.1 If we continued to introduce more proxies for X2 , the probability of the X1 only model could be made arbitrarily small and the overall probability of the X2 like models could be made arbitrarily large, a disturbing feature if Y was strongly related to X1 and unrelated to X2 . Note that any independence prior (3.2), of which (3.4) is a special case, will also fail to maintain probability allocation within neighborhoods of similar models, because the addition of a new Xj reduces all the model probabilities by wj for models in which Xj is included, and by (1 wi ) for models in which Xj is excluded. What are the advantages of dilution priors? Dilution priors avoid placing too little probability on good, but unique, models as a consequence of massing excess probability on large sets of bad, but similar, models. Thus dilution priors are desirable for model averaging over the entire posterior to avoid biasing averages such as (2.8) away from good models. They are also desirable for MCMC sampling of the posterior because such Markov chains gravitate towards regions of high probability. Failure to dilute the probability across clusters of many bad models would bias both model search and model averaging estimates towards those bad models. That said, it should be noted that dilution priors would not be appropriate for pairwise model comparisons because the relative strengths of two models should not depend on whether another is considered. For this purpose, Bayes factors (corresponding to selection under uniform priors) would be preferable. 80 H. Chipman, E. I. George and R. E. McCulloch 3.2 Parameter Priors for Selection of Nonzero i We now consider parameter prior formulations for variable selection where the goal is to ignore only those Xi for which i = 0 in (3.1). In eect, the problem then becomes that of selecting a submodel of (3.1) of the form p(Y | , 2 , ) = Nn (X , 2 I) (3.8) where X is the n x q matrix whose columns correspond to the th subset of X1 , . . . , Xp , is a q 1 vector of unknown regression coecients, and 2 is the unknown residual variance. Here, ( , 2 ) plays the role of the model parameter k described in Section 2. Perhaps the most useful and commonly applied parameter prior form for this setup, especially in large problems, is the normal-inverse-gamma, which consists of a q -dimensional normal prior on p( | 2 , ) = Nq ( , 2 ), (3.9) coupled with an inverse gamma prior on 2 p( 2 | ) = p( 2 ) = IG(/2, /2), (3.10) (which is equivalent to / 2 2 ). For example, see Clyde, DeSimone and Parmigiani (1996), Fernandez, Ley and Steel (2001), Garthwaite and Dickey (1992, 1996), George and McCulloch (1997), Kuo and Mallick (1998), Raftery, Madigan and Hoeting (1997) and Smith and Kohn (1996). Note that the coecient prior (3.9), when coupled with p(), implicitly assigns a point mass at zero for coecients in (3.1) that are not contained in . As such, (3.9) may be thought of as a point-normal prior. It should also be mentioned that in one the rst Bayesian variable selection treatments of the setup (3.8), Mitchell and Beauchamp (1988) proposed spike-and-slab priors. The normal-inversegamma prior above is obtained by simply replacing their uniform slab by a normal distribution. A valuable feature of the prior combination (3.9) and (3.10) is analytical tractability; the conditional distribution of and 2 given is conjugate for (3.8), so that and 2 can be eliminated by routine integration from p(Y, , 2 | ) = p(Y | , 2 , )p( | 2 , ) p( 2 | ) to yield 2 p(Y | ) |X X + 1 |1/2 | |1/2 ( + S )(n+)/2 (3.11) where 2 S = Y Y Y X (X X + 1 )1 X Y. (3.12) As will be seen in subsequent sections, the use of these closed form expressions can substantially speed up posterior evaluation and MCMC exploration. Note that the scale Practical Bayes Model Selection 81 of the prior (3.9) for depends on 2 , and this is needed to obtain conjugacy. Integrating out 2 with respect to (3.10), the prior for conditionally only on is p( | ) = Tq (, , ) (3.13) the q -dimensional multivariate T -distribution centered at with degrees of freedom and scale . The priors (3.9) and (3.10) are determined by the hyperparameters , , , ,, which must be specied for implementations. Although a good deal of progress can be made through subjective elicitation of these hyperparameter values in smaller problems when substantial expert information is available, for example see Garthwaite and Dickey (1996), we focus here on the case where such information is unavailable and the goal is roughly to assign values that minimize prior inuence. Beginning with the choice of and , note that (3.10) corresponds to the likelihood information about 2 provided by independent observations from a N (0, ) distribution. Thus, may be thought of as a prior estimate of 2 and may be thought of as the prior sample size associated with this estimate. By using the data and treating s2 , Y the sample variance of Y , as a rough upper bound for 2 , a simple default strategy is to choose small, say around 3, and near s2 . One might also go a bit further, by Y treating s2 U LL , the traditional unbiased estimate of 2 based on a saturated model, as F a rough lower bound for 2 , and then choosing and so that (3.10) assigns substantial probability to the interval (s2 U LL , s2 ). Similar informal approaches based on the data F Y are considered by Clyde, Desimone and Parmigiani (1996) and Raftery, Madigan and Hoeting (1997). Alternatively, the explicit choice of and can be avoided by using p( 2 | ) 1/ 2 , the limit of (3.10) as 0, a choice recommended by Smith and Kohn (1996) and Fernandez, Ley and Steel (2001). This prior corresponds to the uniform distribution on log 2 , and is invariant to scale changes in Y . Although improper, it yields proper marginals p(Y | ) when combined with (3.9) and so can be used formally. Turning to (3.9), the usual default for the prior mean has been = 0, a neutral choice reecting indierence between positive and negative values. This specication is also consistent with standard Bayesian approaches to testing point null hypotheses, where under the alternative, the prior is typically centered at the point null value. For choosing the prior covariance matrix , the specication is substantially simplied by setting = c V , where c is a scalar and V is a preset form such as V = (X X )1 or V = Iq , the q q identity matrix. Note that under such V , the conditional priors (3.9) provide a consistent description of uncertainty in the sense that they are the conditional distributions of the nonzero components of given when Np (0, c 2 (X X)1 ) or Np (0, c 2 I), respectively. The choice V = (X X )1 serves to replicate the covariance structure of the likelihood, and yields the g-prior recommended by Zellner 82 H. Chipman, E. I. George and R. E. McCulloch (1986). With V = Iq , the components of are conditionally independent, causing (3.9) to weaken the likelihood covariance structure. In contrast to V = (X X )1 , the eect of V = Iq on the posterior depends on the relative scaling of the predictors. In this regard, it may be reasonable to rescale the predictors in units of standard deviation to give them a common scaling, although this may be complicated by the presence of outliers or skewed distributions. Having xed V , the goal is then to choose c large enough so that the prior is relatively at over the region of plausible values of , thereby reducing prior inuence (Edwards, Lindman and Savage 1963). At the same time, however, it is important to avoid excessively large values of c because the prior will eventually put increasing weight on the null model as c , a form of the Bartlett-Lindley paradox, Bartlett (1957). For practical purposes, a rough guide is to choose c so that (3.13) assigns substantial probability to the range of all plausible values for . Raftery, Madigan and Hoeting (1997), who used a combination of V = Iq and V = (X X )1 with standardized predictors, list various desiderata along the lines of this rough guide which lead them to the choice c = 2.852 . They also note that their resulting coecient prior is relatively at over the actual distribution of coecients from a variety of real data sets. Smith and Kohn (1996), who used V = (X X )1 , recommend c = 100 and report that performance was insensitive to values of c between 10 and 10,000. Fernandez, Ley and Steel (2001) perform a simulation evaluation of the eect of various choices for c, with V = (X X )1 , p( 2 | ) 1/ 2 and p() = 2p , on the posterior probability of the true model. Noting how the eect depends on the true model and noise level, they recommend c = max{p2 , n}. 3.3 Calibration and Empirical Bayes Variable Selection An interesting connection between Bayesian and non-Bayesian approaches to variable selection occurs when the special case of (3.9) with = 0 and V = (X X )1 , namely p( | 2 , ) = Nq (0, c 2 (X X )1 ), is combined with p() = wq (1 w)pq (3.15) (3.14) the simple independence prior in (3.3); for the moment, 2 is treated as known. As shown by George and Foster (2000), this prior setup can be calibrated by choices of c and w so that the same maximizes both the model posterior and the canonical penalized sum-of-squares criterion SS / 2 F q (3.16) Practical Bayes Model Selection 83 where SS = X X , (X X )1 X Y and F is a xed penalty. This correspondence may be of interest because a wide variety of popular model selection criteria are obtained by maximizing (3.16) with particular choices of F and with 2 replaced by an estimate 2 . For example F = 2 yields Cp (Mallows 1973) and, approximately, AIC (Akaike 1973), F = log n yields BIC (Schwarz 1978) and F = 2 log p yields RIC (Donoho and Johnstone 1994, Foster and George 1994). The motivation for these choices are varied; Cp is motivated as an unbiased estimate of predictive risk, AIC by an expected information distance, BIC by an asymptotic Bayes factor and RIC by minimax predictive risk ination. The posterior correspondence with (3.16) is obtained by reexpressing the model posterior under (3.14) and (3.15) as p( | Y ) wq (1 w)pq (1 + c)q /2 exp exp where F (c, w) = Y Y SS SS 2 2 2 2 (1 + c) (3.17) c {SS / 2 F (c, w) q } , 2(1 + c) 1+c 1w 2 log + log(1 + c) . c w (3.18) The expression (3.17) reveals that, as a function of for xed Y , p( | Y ) is increasing in (3.16) when F = F (c, w). Thus, both (3.16) and (3.17) are simultaneously maximized by the same when c and w are chosen to satisfy F (c, w) = F . In this case, Bayesian model selection based on p( | Y ) is equivalent to model selection based on the criterion (3.16). This correspondence between seemingly dierent approaches to model selection provides additional insight and interpretability for users of either approach. In particular, when c and w are such that F (c, w) = 2, log n or 2 log p, selecting the highest posterior model (with 2 set equal to 2 ) will be equivalent to selecting the best AIC/Cp , BIC or RIC models, respectively. For example, F (c, w) = 2, log n and 2 log p are obtained when c 3.92, n and p2 respectively, all with w = 1/2. Similar asymptotic connections are pointed out by Fernandez, Ley and Steel (2001) when p( 2 | ) 1/ 2 and w = 1/2. Because the posterior probabilities are monotone in (3.16) when F = F (c, w), this correspondence also reveals that the MCMC methods discussed in Section 3.5 can also be used to search for large values of (3.16) in large problems where global maximization is not computationally feasible. Furthermore, since c and w control the expected size and proportion of the nonzero components of , the dependence of F (c, w) on c and w provides an implicit connection between the penalty F and the prole of models for which its value may be appropriate. Ideally, the prespecied values of c and w in (3.14) and (3.15) will be consistent with 84 H. Chipman, E. I. George and R. E. McCulloch the true underlying model. For example, large c will be chosen when the regression coefcients are large, and small w will be chosen when the proportion of nonzero coecients are small. To avoid the diculties of choosing such c and w when the true model is completely unknown, it may be preferable to treat c and w as unknown parameters, and use empirical Bayes estimates of c and w based on the data. Such estimates can be obtained, at least in principle, as the values of c and w that maximize the marginal likelihood under (3.14) and (3.15), namely L(c, w | Y ) p( | w) p(Y | , c) wq (1 w)pq (1 + c)q /2 exp c SS . 2 2 (1 + c) (3.19) Although this maximization is generally impractical when p is large, the likelihood (3.19) simplies considerably when X is orthogonal, a setup that occurs naturally in nonparametric function estimation with orthogonal bases such as wavelets. In this case, letting 2 ti = bi vi / where vi is the ith diagonal element of X X and bi is the ith component of = (X X)1 X Y , (3.19) reduces to p L(c, w | Y ) i=1 (1 w)eti /2 + w(1 + c)1/2 eti /2(1+c) . 2 2 (3.20) Since many fewer terms are involved in the product in (3.20) than in the sum in (3.19), maximization of (3.20) is feasible by numerical methods even for moderately large p. Replacing 2 by an estimate 2 , the estimators c and w that maximize the marginal likelihood L above can be used as prior inputs for an empirical Bayes analysis under the priors (3.14) and (3.15). In particular, (3.17) reveals that the maximizing the posterior p( | Y, c, w) can be obtained as the that maximizes the marginal maximum likelihood criterion CMML = SS / 2 F (, w) q , c (3.21) where F (c, w) is given by (3.18). Because maximizing (3.19) to obtain c and w can be computationally overwhelming when p is large and X is not orthogonal, one might instead consider a computable empirical Bayes approximation, the conditional maximum likelihood criterion q CCML = SS / 2 q 1 + log+ (SS / 2 q ) 2 log(p q )(pq ) + log q (3.22) where log+ () is the positive part of log(). Selecting the that maximizes CCML provides an approximate empirical Bayes alternative to selection based on CMML . In contrast to criteria of the form (3.16), which penalize SS / 2 by F q , with F constant, CMML uses an adaptive penalty F (, w) that is implicitly based on the estic mated distribution of the regression coecients. CCML also uses an adaptive penalty, Practical Bayes Model Selection 85 but one can be expressed by a rapidly computable closed form that can be shown to act like a combination of a modied BIC penalty F = log n, which gives it same consistency property as BIC, and a modied RIC penalty F = 2 log p. Insofar as maximizing CCML approximates maximizing CMML , these interpretations at least roughly explain the behavior of the CMML penalty F (, w) in (3.21). c George and Foster (2000) proposed the empirical Bayes criteria CMML and CCML and provided simulation evaluations demonstrating substantial performance advantages over the xed penalty criteria (3.16); selection using CMML delivers excellent performance over a much wider portion of the model space, and CCML performs nearly as well. The superiority of empirical Bayes methods was conrmed in context of wavelet regression by Johnstone and Silverman (1998) and Clyde and George (1999). Johnstone and Silverman (1998) demonstrated the superiority of using maximum marginal likelihood estimates of c and w with posterior median selection criteria, and proposed EM algorithms for implementation. Clyde and George (1999) also proposed EM algorithms for implementation, extended the methods to include empirical Bayes estimates of 2 and considered both model selection and model averaging. Finally, a fully Bayes analysis which integrates out c and w with respect to some noninformative prior p(c, w) could be a promising alternative to empirical Bayes estimation of c and w. Indeed, the maximum marginal likelihood estimates c and w correspond to the posterior mode estimates under a Bayes formulation with independent uniform priors on c and w, a natural default choice. As such, the empirical Bayes methods can be considered as approximations to fully Bayes methods, but approximations which do not fully account for the uncertainty surrounding c and w. We are currently investigating the potential of such fully Bayes alternatives and plan to report on them elsewhere. 3.4 Parameter Priors for Selection Based on Practical Signicance A potential drawback of the point-normal prior (3.9) for variable selection is that with enough data, the posterior will favor the inclusion of Xi for any i = 0, no matter how small. Although this might be desirable from a purely predictive standpoint, it can also run counter to the goals of parsimony and interpretability in some problems, where it would be preferable to ignore such negligible i . A similar phenomenon occurs in frequentist hypothesis testing, where for large enough sample sizes, small departures from a point null become statistically signicant even though they are not practically signicant or meaningful. An alternative to the point-normal prior (3.9), which avoids this potential drawback, is the normal-normal formulation used in the stochastic search variable selection (SSVS) 86 H. Chipman, E. I. George and R. E. McCulloch procedure of George and McCulloch (1993, 1996, 1997). This formulation builds in the goal of excluding Xi from the model whenever |i | < i for a given i > 0. The idea is that i is a threshold of practical signicance that is prespecied by the user. A simple choice might be i = Y /Xi , where Y is the size of an insignicant change in Y , and Xi is the size of the maximum feasible change in Xi . To account for the cumulative eect of changes of other Xs in the model, one might prefer the smaller conservative choice i = Y /(pXi ). The practical potential of the SSVS formulation is nicely illustrated by Wakeeld and Bennett (1996). Under the normal-normal formulation of SSVS, the data always follow the saturated model (3.1) so that p(Y | , 2 , ) = Nn (X, 2 I) (3.23) for all . In the general notation of Section 2, the model parameters here are the same for every model, k (, 2 ). The th model is instead distinguished by a coecient prior of the form ( | 2 , ) = ( | ) = Np (0, D RD ) (3.24) where R is a correlation matrix and D is a diagonal matrix with diagonal elements (D )ii = 0i when i = 0 1i when i = 1 (3.25) Under the model space prior p(), the marginal prior distribution of each component of is here p(i ) = (1 p(i ))N (0, 0i ) + p(i )N (0, 1i ), (3.26) a scale mixture of two normal distributions. Although is independent of 2 in (3.24), the inverse Gamma prior (3.10) for 2 is still useful, as are the specication considerations for it discussed in Section 3.2. Furthermore, R (X X)1 and R = I are natural choices for R in (3.24), similarly to the commonly used choices for in (3.9). To use this normal-normal setup for variable selection, the hyperparameters 0i and 1i are set small and large respectively, so that N (0, 0i ) is concentrated and N (0, 1i ) is diuse. The general idea is that when the data support i = 0 over i = 1, then i is probably small enough so that Xi will not be needed in the model. For a given threshold i , higher posterior weighting of those values for which |i | > i when i = 1, can be achieved by choosing 0i and 1i such that p(i | i = 0) = N (0, 0i ) > p(i | i = 1) = N (0, 1i ) precisely on the interval (i , i ). This property is obtained by any 0i and 1i satisfying 1 1 2 log(1i /0i )/(0i 1i ) = i (3.27) Practical Bayes Model Selection 87 By choosing 1i such that N (0, 1i ) is consistent with plausible values of i , 0i can then be chosen according to (3.27). George and McCulloch (1997) report that computational problems and diculties with 1i too large will be avoided whenever 1i /0i 10, 000, thus allowing for a wide variety of settings. Under the normal-normal setup above, the joint distribution of and 2 given is not conjugate for (3.1) because (3.24) excludes 2 . This prevents analytical reduction of the full posterior p(, 2 , | Y ), which can severely increase the cost of posterior computations. To avoid this, one can instead consider the conjugate normal-normal formulation using p( | 2 , ) = Np (0, 2 D RD ), (3.28) which is identical to (3.24) except for the insertion of 2 . Coupled with the inverse Gamma prior (3.10) for 2 , the conditional distribution of and 2 given is conjugate. This allows for the analytical margining out of and 2 from p(Y, , 2 | ) = p(Y | , 2 )p( | 2 , )p( 2 | ) to yield 2 p(Y | ) |X X + (D RD )1 |1/2 |D RD |1/2 ( + S )(n+)/2 (3.29) where 2 S = Y Y Y X(X X + (D RD )1 )1 X Y. (3.30) As will be seen in Section 3.5, this simplication confers strong advantages for posterior calculation and exploration. Under (3.28), (3.10), and a model space prior p(), the marginal distribution each component of is p(i | ) = (1 i )T1 (, 0, 0i ) + i T1 (, 0, 1i ), (3.31) a scale mixture of t-distributions, in contrast to the normal mixture (3.26). As with the nonconjugate prior, the idea is that 0i and 1i are to be set small and large respectively, so that when the data supports i = 0 over i = 1, then i is probably small enough so that Xi will not be needed in the model. However, the way in which 0i and 1i determine small and large is aected by the unknown value of 2 , thereby making specication more dicult and less reliable than in the nonconjugate formulation. For a chosen threshold of practical signicance i , the pdf p(i | i, i = 0) = T (, 0, 0i ) is larger than the pdf p(i | i, i = 1) = T (, 0, 1i ) precisely on the interval (i , i ), when 0i and 1i satisfy 2 2 (0i /1i )/(+1) = [0i + i /()]/[1i + i /()] (3.32) By choosing 1i such that T (, 0, 1i ) is consistent with plausible values of i , 0i can then be chosen according to (3.32). 88 H. Chipman, E. I. George and R. E. McCulloch Another potentially valuable specication of the conjugate normal-normal formulation can be used to address the problem of outlier detection, which can be framed as a variable selection problem by including indicator variables for the observations as poten tial predictors. For such indicator variables, the choice 0i = 1 and 1i = K > 0 yields the well-known additive outlier formulation, see, for example, Petit and Smith (1985). Furthermore, when used in combination with the previous settings for ordinary predictors, the conjugate prior provides a hierarchical formulation for simultaneous variable selection and outlier detection. This has also been considered by Smith and Kohn (1996). A related treatment has been considered by Hoeting, Raftery and Madigan (1996). 3.5 3.5.1 Posterior Calculation and Exploration for Variable Selection Closed Form Expressions for p(Y | ) A valuable feature of the previous conjugate prior formulations is that they allow for analytical margining out of and 2 from p(Y, , 2 | ) to yield the closed form expressions in (3.11) and (3.29) which are proportional to p(Y | ). Thus, when the model prior p() is computable, this can be used to obtain a computable, closed form expression g() satisfying g() p(Y | )p() p( | Y ). (3.33) The availability of such g() can greatly facilitate posterior calculation and estimation. Furthermore, it turns out that for certain formulations, the value of g() can be rapidly updated as is changed by a single component. As will be seen, such rapid updating schemes can be used to speed up algorithms for evaluating and exploring the posterior p( | Y ). Consider rst the conjugate point-normal formulation (3.9) and (3.10) for which p(Y | ) proportional to (3.11) can be obtained. When = c (X X )1 , a function g() satisfying (3.33) can be expressed as g() = (1 + c)q /2 ( + Y Y (1 + 1/c)1 W W )(n+)/2 p() (3.34) where W = T 1 X Y for upper triangular T such that T T = X X for (obtainable by the Cholesky decomposition). As noted by Smith and Kohn (1996), the algorithm of Dongarra, Moler, Bunch and Stewart (1979) provides fast updating of T , and hence W 2 and g(), when is changed one component at a time. This algorithm requires O(q ) operations per update, where is the changed value. Now consider the conjugate normal-normal formulation (3.28) and (3.10) for which p(Y | ) proportional to (3.29) can be obtained. When R = I holds, a function g() Practical Bayes Model Selection satisfying (3.33) can be expressed as p 89 g() = ( i=1 2 Tii [(1 i )0(i) + i 1(i) ])1/2 ( + Y Y W W )(n+)/2 p() (3.35) where W = T 1 X Y for upper triangular T such that T T = X X (obtainable by the Cholesky decomposition). As noted by George and McCulloch (1997), the Chambers (1971) algorithm provides fast updating of T , and hence W and g(), when is changed one component at a time. This algorithm requires O(p2 ) operations per update. The availability of these computable, closed form expressions for g() p( | Y ) enables exhaustive calculation of p( | Y ) in moderately sized problems. In general, this simply entails calculating g() for every value and then summing over all values to obtain the normalization constant. However, when one of the above fast updating schemes can be used, this calculation can be substantially speeded up by sequential calculation of the 2p g() values where consecutive dier by just one component. Such an ordering is provided by the Gray Code, George and McCulloch (1997). After computing T , W and g() for an initial value, subsequent values of T , W and g() can be obtained with the appropriate fast updating scheme by proceeding in the Gray Code order. Using this approach, this exhaustive calculation is feasible for p less than about 25. 3.5.2 MCMC Methods for Variable Selection MCMC methods have become a principal tool for posterior evaluation and exploration in Bayesian variable selection problems. Such methods are used to simulate a sequence (1) , (2) , . . . (3.36) that converges (in distribution) to p(|Y ). In formulations where analytical simplication of p(, 2 , | Y ) is unavailable, (3.36) can be obtained as a subsequence of a simulated Markov chain of the form (1) , (1) , (1) , (2) , (2) , (2) , . . . (3.37) that converges to p(, 2 , | Y ). However, in conjugate formulations where and 2 can be analytically eliminated form the posterior, the availability of g() p( | Y ) allows for the exible construction of MCMC algorithms that simulate (3.36) directly as a Markov chain. Such chains are often more useful, in terms of both computational and convergence speed. In problems where the number of potential predictors p is very small, and g() p( | Y ) is unavailable, the sequence (3.36) may be used to evaluate the entire posterior p( | Y ). Indeed, empirical frequencies and other functions of the values will be 90 H. Chipman, E. I. George and R. E. McCulloch consistent estimates of their values under p( | Y ). In large problems where exhaustive calculation of all 2p values of p( | Y ) is not feasible, the sequence (3.36) may still provide useful information. Even when the length of the sequence (3.36) is much smaller than 2p , it may be possible to identify at least some of the high probability , since those are expected to appear more frequently. In this sense, these MCMC methods can be used to stochastically search for high probability models. In the next two subsections, we describe various MCMC algorithms which may be useful for simulating (3.36). These algorithms are obtained as variants of the Gibbs sampler (GS) and Metropolis-Hastings (MH) algorithms described in Section 2.3. 3.5.3 Gibbs Sampling Algorithms Under the nonconjugate normal-normal formulation (3.24) and (3.10) for SSVS, the posterior p(, 2 , | Y ) is p-dimensional for all . Thus, a simple GS that simulates the full parameter sequence (3.37) is obtained by successive simulation from the full conditionals p( | 2 , , Y ) p( 2 | , , Y ) = p( 2 | , Y ) p(i | , 2 , (i) , Y ) = p(i | , (i) ), i = 1, . . . , p where at each step, these distributions are conditioned on the most recently generated parameter values. These conditionals are standard distributions which can be simulated quickly and eciently by routine methods. For conjugate formulations where g() is available, a variety of MCMC algorithms for generating (3.36) directly as a Markov chain, can be conveniently obtained by applying the GS with g(). The simplest such implementation is obtained by generating each value componentwise from the full conditionals, i | (i) , Y i = 1, 2, . . . , p, (3.39) (3.38) ((i) = (1 , 2 , . . . , i1 , i+1 , . . . , p )) where the i may be drawn in any xed or random order. By margining out and 2 in advance, the sequence (3.36) obtained by this algorithm should converge faster than the nonconjugate Gibbs approach, rendering it more eective on a per iteration basis for learning about p( | Y ), see Liu, Wong and Kong (1994). The generation of the components in (3.39) in conjunction with g() can be obtained trivially as a sequence of Bernoulli draws. Furthermore, if g() allows for fast updating as in (3.34) or (3.35), the required sequence of Bernoulli probabilities can be computed Practical Bayes Model Selection 91 faster and more eciently. To see this, note that the Bernoulli probabilities are simple functions of the ratio p(i = 1, (i) | Y ) g(i = 1, (i) ) = . (3.40) p(i = 0, (i) | Y ) g(i = 0, (i) ) At each step of the iterative simulation from (3.39), one of the values of g() in (3.40) will be available from the previous component simulation. Since has been varied by only a single component, the other value of g() can then be obtained by using the appropriate updating scheme. Under the point-normal prior (3.9) with = c (X X )1 , the fast 2 updating of (3.34) requires O(q ) operations, whereas under the conjugate normal-normal prior formulation (3.28) with R = I fast updating of (3.35) requires O(p2 ) operations. Thus, GS algorithms in the former case can be substantially faster when p( | Y ) is concentrated on those for which q is small, namely the parsimonious models. This advantage could be pronounced in large problems with many useless predictors. Simple variants of the componentwise GS can be obtained by generating the components in a dierent xed or random order. Note that in any such generation, it is not necessary to generate each and every component once before repeating a coordinate. Another variant of the GS can be obtained by drawing the components of in groups, rather than one at a time. Let {Ik }, k = 1, 2, . . . , m be a partition of {1, 2, . . . , p} so that, Ik {1, 2, . . . , p}, Ik = {1, 2, . . . , p} and Ik1 Ik2 = for k1 = k2 . Let Ik = {i |i Ik } and (Ik ) = {i | i Ik }. The grouped GS generates (3.36) by iterative simulation from / Ik | (Ik ) , Y k = 1, 2, . . . , m. (3.41) Fast updating of g(), when available, can also be used to speed up this simulation by computing the conditional probabilities of each Ik in Gray Code order. The potential advantage of such a grouped GS is improved convergence of (3.36). This might be achieved by choosing the partition so that strongly correlated i are contained in the same Ik , thereby reducing the dependence between draws in the simulation. Intuitively, clusters of such correlated i should correspond to clusters of correlated Xi which, in practice, might be identied by clustering procedures. As before, variants of the grouped GS can be obtained by generating the Ik in a dierent xed or random order. 3.5.4 Metropolis-Hastings Algorithms The availability of g() p(|Y ) also facilitates the use of MH algorithms for direct simulation of (3.36). By restricting attention to the set of values, a discrete space, the simple MH form described in Section 2.3 can be used. Because g()/g( ) = p( | Y )/p( | Y ), such MH algorithms are here of the form: 1. Simulate a candidate from a transition kernel q( | (j) ). 92 2. Set (j+1) = with probability H. Chipman, E. I. George and R. E. McCulloch ( | (j) ) = min Otherwise, (j+1) = (j) . q( (j) | ) g( ) ,1 . q( | (j) ) g( (j) ) (3.42) When available, fast updating schemes for g() can be exploited. Just as for the Gibbs sampler, the MH algorthims under the point-normal formulations (3.9) with = c (X X )1 will be the fastest scheme when p( | Y ) is concentrated on those for which q is small. A special class of MH algorithms, the Metropolis algorithms, are obtained from the class of transition kernels q( 1 | 0 ) which are symmetric in 1 and 0 . For this class, the form of (3.42) simplies to M ( | (j) ) = min g( ) ,1 . g( (j) ) (3.43) Perhaps the simplest symmetric transition kernel is p q( | ) = 1/p if 1 1 0 0 1 |i i | = 1. (3.44) This yields the Metropolis algorithm 1. Simulate a candidate by randomly changing one component of (j) . 2. Set (j+1) = with probability M ( | (j) ). Otherwise, (j+1) = (j) . This algorithm was proposed in a related model selection context by Madigan and York (1995) who called it MC3 . It was used by Raftery, Madigan and Hoeting (1997) for model averaging, and was proposed for the SSVS prior formulation by Clyde and Parmigiani (1994). The transition kernel (3.44) is a special case of the class of symmetric transition kernels of the form p 1 q( | ) = qd if 1 0 0 1 |i i | = d. (3.45) Such transition kernels yield Metropolis algorithms of the form 1. Simulate a candidate by randomly changing d components of (j) with probability qd . 2. Set (j+1) = with probability M ( | (j) ). Otherwise, (j+1) = (j) . Practical Bayes Model Selection 93 Here qd is the probability that will have d new components. By allocating some weight to qd for larger d, the resulting algorithm will occasionally make big jumps to dierent values. In contrast to the algorithm obtained by (3.44) which only moves locally, such algorithms require more computation per iteration. Finally, it may also be of interest to consider asymmetric transition kernels such as p q( 1 | 0 ) = qd if 1 0 1 (i i ) = d. (3.46) Here qd is the probability of generating a candidate value which corresponds to a model with d more variables (j) . When d < 0, will represent a more parsimonious model than (j) . By suitable weighting of the qd probabilities, such Metropolis-Hastings algorithms can be made to explore the posterior in the region of more parsimonious models. 3.5.5 Extracting Information from the Output In nonconjugate setups, where g() is unavailable, inference about posterior characteristics based on (3.36) ultimately rely on the empirical frequency estimates the visited values. Although such estimates of posterior characteristics will be consistent, they may be unreliable, especially if the size of the simulated sample is small in comparison to 2p or if there is substantial dependence between draws. The use of empirical frequencies to identify high probability values for selection can be similarly problematic. However, the situation changes dramatically in conjugate setups where g() p(|Y ) is available. To begin with, g() provides the relative probability of any two values 0 and 1 via g( 0 ) / g( 1 ) and so can be used to denitively identify the higher probability in the sequence (3.36) of simulated values. Only minimal additional eort is required to obtain such calculations since g() must be calculated for each of the visited values in the execution of the MCMC algorithms described in Sections 3.5.3 and 3.5.4. The availability of g() can also be used to obtain estimators of the normalizing constant C p( | Y ) = C g() (3.47) based on the MCMC output (3.36) , say (1) , . . . , (K) . Let A be a preselected subset of values and let g(A) = A g() so that p(A | Y ) = C g(A). Then, a consistent estimator of C is K 1 C= IA ( (k) ) (3.48) g(A)K k=1 where IA ( ) is the indicator of the set A, George and McCulloch (1997). Note that if | (3.36) were an uncorrelated sequence, then Var(C) = (C 2 /K) 1p(A Y Y ) suggesting that p(A | ) 94 H. Chipman, E. I. George and R. E. McCulloch the variance of (3.48) is decreasing as p(A | Y ) increases. It is also desirable to choose A such that IA ( (k) ) can be easily evaluated. George and McCulloch (1997) obtain very good results by setting A to be those values visited by a preliminary simulation of (3.36). Peng (1998) has extended and generalized these ideas to obtain estimators of C that improve on (3.48). Inserting C into (3.47) yields improved estimates of the probability of individual values p( | Y ) = C g(), (3.49) as well as an estimate of the total visited probability p(B | Y ) = C g(B), (3.50) where B is the set of visited values. Such p(B | Y ) can provide valuable information about when to stop a MCMC simulation. Since p( | Y )/p( | Y ) C/C, the uniform accuracy of the probability estimates (3.49) is |(C/C) 1|. This quantity is also the total probability discrepancy since |C C| g() = |(C/C) 1|. (3.51) |( | Y ) p( | Y )| = p The simulated values (3.36) can also play an important role in model averaging. For example, suppose one wanted to predict a quantity of interest by the posterior mean E( | Y ) = all E( | , Y )p( | Y ). (3.52) When p is too large for exhaustive enumeration and p( | Y ) cannot be computed, (3.52) is unavailable and is typically approximated by something of the form E( | Y ) = S E( | , Y )( | Y, S) p (3.53) where S is a manageable subset of models and p( | Y, S) is a probability distribution over S. (In some cases, E( | , Y ) will also be approximated). Using the Markov chain sample for S, a natural choice for (3.53) is Ef ( | Y ) = S E( | , Y )f ( | Y, S) p (3.54) where pf ( | Y, S) is the relative frequency of in S, George (1999). Indeed, (3.54) will be a consistent estimator of E( | Y ). However, here too, it appears that when g() is available, one can do better by using Eg ( | Y ) = S E( | , Y )g ( | Y, S) p (3.55) Practical Bayes Model Selection where pg ( | Y, S) = g()/g(S) 95 (3.56) is just the renormalized value of g(). For example, when S is an iid sample from p( |Y ), (3.55) increasingly approximates the best unbiased estimator of E( | Y ) as the sample size increases. To see this, note that when S is an iid sample, Ef ( | Y ) is unbiased for E( | Y ). Since S (together with g) is sucient, the Rao-Blackwellized estimator E(Ef ( | Y ) | S) is best unbiased. But as the sample size increases, E(Ef ( | Y ) | S) Eg ( | Y ). 4 Bayesian CART Model Selection For our second illustration of Bayesian model selection implementations, we consider the problem of selecting a classication and regression tree (CART) model for the relationship between a variable y and a vector of potential predictors x = (x1 , . . . , xp ). An alternative to linear regression, CART models provide a more exible specication of the conditional distribution of y given x. This specication consists of a partition of the x space, and a set of distinct distributions for y within the subsets of the partition. The partition is accomplished by a binary tree T that recursively partitions the x space with internal node splitting rules of the form {x A} or {x A}. By moving from the root / node through to the terminal nodes, each observation is assigned to a terminal node of T which then associates the observation with a distribution for y. Although any distribution may be considered for the terminal node distributions, it is convenient to specify these as members of a single parametric family p(y|) and to assume all observations of y are conditionally independent given the parameter values. In this case, a CART model is identied by the tree T and the parameter values = (1 , . . . , b ) of the distributions at each of the b terminal nodes of T . Note that T here plays the role of Mk of model identier as described in Section 2. The model is called a regression tree model or a classication tree model according to whether y is quantitative or qualitative, respectively. For regression trees, two simple and useful specications for the terminal node distributions are the mean shift normal model p(y | i ) = N (i , 2 ), i = 1, . . . , b, where i = (i , ), and the mean-variance shift normal model 2 p(y | i ) = N (i , i ), i = 1, . . . , b, (4.1) (4.2) where i = (i , i ). For classication trees where y belongs to one of K categories, say 96 H. Chipman, E. I. George and R. E. McCulloch C1 , . . . , CK , a natural choice for terminal node distributions are the simple multinomials K p(y | i ) = k=1 pik I(yCk ) i = 1, . . . , b, (4.3) where i = pi (pi1 , . . . , piK ), pik 0 and terminal node of T . k pik = 1. Here p(y Ck ) = pik at the ith As illustration, Figure 1 depicts a regression tree model where y N (, 22 ) and x = (x1 , x2 ). x1 is a quantitative predictor taking values in [0,10], and x2 is a qualitative predictor with categories {A, B, C, D}. The binary tree has 9 nodes of which b = 5 are terminal nodes that partition the x space into 5 subsets. The splitting rules are displayed at each internal node. For example, the leftmost terminal node corresponds to x1 3.0 and x2 {C, D}. The i value identifying the mean of y given x is displayed at each terminal node. Note that in contrast to a linear model, i decreases in x1 when x2 {A, B}, but increases in x1 when x2 {C, D}. The two basic components of the Bayesian approach to CART model selection are prior specication and posterior exploration. Prior specication over CART models entails putting a prior on the tree space and priors on the parameters of the terminal node distributions. The CART model likelihoods are then used to update the prior to yield a posterior distribution that can be used for model selection. Although straightforward in principle, practical implementations require subtle and delicate attention to details. Prior formulation must be interpretable and computationally manageable. Hyperparameter specication can be usefully guided by overall location and scale measures of the data. A feature of this approach is that the prior specication can be used to downweight undesirable model characteristics such as tree complexity or to express a preference for certain predictor variables. Although the entire posterior cannot be computed in nontrivial problems, posterior guided MH algorithms can still be used to search for good tree models. However, the algorithms require repeated restarting or other modications because of the multimodal nature of the posterior. As the search proceeds, selection based on marginal likelihood rather than posterior probability is preferable because of the dilution properties of the prior. Alternatively, a posterior weighted average of the visited models can be easily obtained. CART modelling was popularized in the statistical community by the seminal book of Breiman, Friedman, Olshen and Stone (1984). Earlier work by Kass (1980) and Hawkins and Kass (1982) developed tree models in a statistical framework. There has also been substantial research on trees in the eld of machine learning, for example the C4.5 algorithm and its predecessor, ID3 (Quinlan 1986, 1993). Here, we focus on the method of Breiman et al. (1984), which proposed a nonparametric approach for tree selection based on a greedy algorithm named CART. A concise description of this approach, which Practical Bayes Model Selection 97 X2 {C, D} X2 {A, B} X1 3 X1 > 3 X1 5 X1 > 5 1 = 1 X1 7 X1 > 7 4 = 8 5 = 2 2 = 5 3 = 8 Figure 1: A regression tree where y N (, 22 ) and x = (x1 , x2 ). 98 H. Chipman, E. I. George and R. E. McCulloch seeks to partition the x space into regions where the distribution of y is homogeneous, and its implementation in S appears in Clark and Pregibon (1992). Bayesian approaches to CART are enabled by elaborating the CART tree formulation to include parametric terminal node distributions, eectively turning it into a statistical model and providing a likelihood. Conventional greedy search algorithms are also replaced by the MCMC algorithms that provide a broader search over the tree model space. The Bayesian CART model selection implementations described here were proposed by Chipman, George and McCulloch (1998) and Denison, Mallick and Smith (1998a), hereafter referred to as CGM and DMS, respectively. An earlier Bayesian approach to classication tree modelling was proposed by Buntine (1992) which, compared to CGM and DMS, uses similar priors for terminal node distributions, but dierent priors on the space of trees, and deterministic, rather than stochastic, algorithms for model search. Priors for tree models based on Minimum Encoding ideas were proposed by Quinlan and Rivest (1989) and Wallace and Patrick (1993). Oliver and Hand (1995) discuss and provide an empirical comparison of a variety of pruning and Bayesian model averaging approaches based on CART. Paass and Kindermann (1997) applied a simpler version of the CGM approach and obtained results which uniformly dominated a wide variety of competing methods. Other alternatives to greedy search methods include Sutton (1991) and Lutsko and Kuijpers (1994) who use simulated annealing, Jordan and Jacobs (1994) who use the EM algorithm, Breiman (1996), who averages trees based on bootstrap samples, and Tibshirani and Knight (1999) who select trees based on bootstrap samples. 4.1 Prior Formulations for Bayesian CART Since a CART model is identied by (, T ), a Bayesian analysis of the problem proceeds by specifying priors on the parameters of the terminal node distributions of each tree p( | T ) and a prior distribution p(T ) over the set of trees. Because the prior for T does not depend on the form of the terminal node distributions, p(T ) can be generally considered for both regression trees and classication trees. 4.1.1 Tree Prior Specication A tree T partitions the x space and consists of both the binary tree structure and the set of splitting rules associated with the internal nodes. A general formulation approach for p(T ) proposed by CGM, is to specify p(T ) implicitly by the following tree-generating stochastic process which grows trees from a single root tree by randomly splitting terminal nodes: 1. Begin by setting T to be the trivial tree consisting of a single root (and terminal) Practical Bayes Model Selection node denoted . 99 2. Split the terminal node with probability p = (1 + d ) where d is the depth of the node , and (0, 1) and 0 are prechosen control parameters. 3. If the node splits, randomly assign it a splitting rule as follows: First choose xi uniformly from the set of available predictors. If xi is quantitative, assign a splitting rule of the form {xi s} vs {xi > s} where s is chosen uniformly from the available observed values of xi . If xi is qualitative, assign a splitting rule of the form {xi C} vs {xi C} where C is chosen uniformly from the set of subsets / of available categories of xi . Next assign left and right children nodes to the split node, and apply steps 2 and 3 to the newly created tree with equal to the new left and the right children (if nontrivial splitting rules are available). By available in step 3, we mean those predictors, split values and category subsets that would not lead to empty terminal nodes. For example, if a binary predictor was used in a splitting rule, it would no longer be available for splitting rules at nodes below it. Each realization of such a process can simply be considered as a random draw from p(T ). Furthermore, this specication allows for straightforward evaluation of p(T ) for any T , and can be eectively coupled with the MH search algorithms described in Section 4.2.1. Although other useful forms can easily be considered for the splitting probability in step 2 above, the choice of p = (1+d ) is simple, interpretable, easy to compute and dependent only on the depth d of the node . The parameters and control the size and shape of the binary tree produced by the process. To see how, consider the simple specication, p < 1 when = 0. In this case the probability of any particular binary tree with b terminal nodes (ignoring the constraints of splitting rule assignments in step 3) is just b1 (1 )b , a natural generalization of the geometric distribution. (A binary tree with b terminal nodes will always have exactly (b 1) internal nodes). Setting small will tend to yield smaller trees and is a simple convenient way to control the size of trees generated by growing process. The choice of = 0 above assigns equal probability to all binary trees with b terminal nodes regardless of their shape. Indeed any prior that is only a function of b will do this; for example, DMS recommend this with a truncated Poisson distribution on b. However, for increasing > 0, p is a decreasing function of d making deeper nodes less likely to split. The resulting prior p(T ) puts higher probability on bushy trees, those whose terminal nodes do not vary too much in depth. Choosing and in practice can guided by looking at the implicit marginal distributions of characteristics such as b. Such marginals can be easily simulated and graphed. 100 H. Chipman, E. I. George and R. E. McCulloch Turning to the splitting rule assignments, step 3 of the tree growing process represents the prior information that at each node, available predictors are equally likely to be eective, and that for each predictor, available split values or category subsets are equally likely to be eective. This specication is invariant to monotone transformations of the quantitative predictors, and is uniform on the observed quantiles of a quantitative predictor with no repeated values. However, it is not uniform over all possible splitting rules because it assigns lower probability to splitting rules based on predictors with more potential split values or category subsets. This feature is necessary to maintain equal probability on predictor choices, and essentially yields the dilution property discussed in Sections 2.2 and 3.1. Predictors with more potential split values will give rise to more trees. By downweighting the splitting rules of such predictors, p(T ) serves to dilute probability within neighborhoods of similar trees. Although the uniform choices for p(T ) above seem to be reasonable defaults, nonuniform choices may also be of interest. For example, it may be preferable to place higher probability on predictors that are thought to be more important. A preference for models with fewer variables could be expressed by putting greater mass on variables already assigned to ancestral nodes. For the choice of split value, tapered distribution at the extremes would increase the tendency to split more towards the interior range of a variable. One might also consider the distribution of split values to be uniform on the available range of the predictor and so could weight the available observed values accordingly. For the choice of category subset, one might put extra weight on subsets thought to be more important. As a practical matter, note that all of the choices above consider only the observed predictor values as possible split points. This induces a discrete distribution on the set of splitting rules, and hence the support of p(T ) will be a nite set of in trees any application. This is not really a restriction since it allows for all possible partitions of any given data set. The alternative of putting a continuous distribution on the range of the predictors would needlessly increase the computational requirements of posterior search while providing no gain in generality. Finally, we note that the dependence of p(T ) on the observed x values is typical of default prior formulations, as was the case for some of the coecient prior covariance choices discussed in Sections 3.2 and 3.4. 4.1.2 Parameter Prior Specications As discussed in Section 2.3, the computational burden of posterior calculation and exploration is substantially reduced when the marginal likelihood, here p(Y | T ), can be obtained in closed form. Because of the large size of the space of CART models, this computational consideration is key in choosing the prior p( | T ) for the parameters of Practical Bayes Model Selection 101 the terminal node distributions. For this purpose, we recommend the conjugate prior forms below for the parameters of the models (4.1)-(4.3). For each of these priors, can be analytically margined out via (2.2), namely p(Y | T ) = p(Y | , T )p( | T )d, (4.4) where Y here denotes the observed values of y. For regression trees with the mean-shift normal model (4.1), perhaps the simplest prior specication for p(|T ) is the standard normal-inverse-gamma form where 1 , . . . , b are iid given and T with p(i | , T ) = N (, 2 /a) and p( 2 | T ) = p( 2 ) = IG(/2, /2). Under this prior, standard analytical simplication yields p(Y | T ) c ab/2 b 1/2 i=1 (ni + a) b (n+)/2 (4.5) (4.6) (si + ti ) + i=1 (4.7) where c is a constant which does not depend on T , si is (ni 1) times the sample variance ia of the ith terminal node Y values, ti = nn+a (i )2 , and yi is the sample mean of the y i ith terminal node Y values. In practice, the observed Y can be used to guide the choice of hyperparameter values for (, , , a). Considerations similar to those discussed for Bayesian variable selection in Section 3.2 are also useful here. To begin with, because the mean-shift model attempts to explain the variation of Y , it is reasonable to expect that 2 will be smaller than s2 , Y 2 will be larger than the sample variance of Y . Similarly, it is reasonable to expect that a pooled variance estimate obtained from a deliberate overtting of the data by a greedy algorithm, say s2 . Using these values as guides, and would then be chosen so that G 2 assigns substantial probability to the interval (s2 , s2 ). Once and the prior for GY have been chosen, and a would be selected so that the prior for is spread out over the range of Y values. For the more exible mean-variance shift model (4.2) where i can also vary across the terminal nodes, the normal-inverse-gamma form is easily extended to p(i | i , T ) = N (, i /a) 2 and 2 2 p(i | T ) = p(i ) = IG(/2, /2), (4.8) (4.9) 102 H. Chipman, E. I. George and R. E. McCulloch with the pairs (1 , 1 ), . . . , (b , b ) independently distributed given T . Under this prior, analytical simplication is still straightforward, and yields b p(Y | T ) i=1 ni /2 () /2 a ((ni + )/2) (si + ti + )(ni +)/2 ni + a (/2) (4.10) where si and ti are as above. Interestingly, the MCMC computations discussed in the next section are facilitated by the factorization of this marginal likelihood across nodes, in contrast to the marginal likelihood (4.7) for the equal variance model. Here too, the observed Y can be used to guide the choice of hyperparameter values for (, , , a). The same ideas above may be used with an additional consideration. In some cases, the mean-variance shift model may explain variance shifts much more so than mean shifts. To handle this possibility, it may be better to choose and so that 2 Y is more toward the center rather than the right tail of the prior for 2 . We might also tighten up our prior for about the average y value. In any case, it can be useful to explore the consequences of several dierent prior choices. For classication trees with the simple multinomial model (4.3), a useful conjugate prior specication for = (p1 , . . . , pb ) is the standard Dirichlet distribution of dimension K 1 with parameter = (1 , . . . , K ), k > 0, where p1 , . . . , pb are iid given T with p(pi | T ) = Dirichlet(pi | ) pi11 1 pK 1 . iK (4.11) When K = 2 this reduces to the familiar Beta prior. Under this prior, standard analytical simplication yields p(Y | T ) ( k ) k (k ) k bb i=1 (nik + k ) (ni + k k ) k (4.12) where nik is the number of ith terminal node Y values in category Ck , ni = k nik and k = 1, ..., K over the sums and products above. For a given tree, p(Y | T ) will be larger when the Y values within each node are more homogeneous. To see this, note that assignments for which the Y values at the same node are similar will lead to more disparate values of ni1 , . . . , niK , which in turn will lead to larger values of p(Y | T ). The natural default choice for is the vector (1, . . . , 1) for which the Dirichlet prior (4.11) is the uniform. However, by setting certain k to be larger for certain categories, p(Y | T ) will become more sensitive to misclassication at those categories. This would be desirable when classication into those categories is most important. One detail of analytical simplications yielding integrated likelihoods (4.7), (4.10) or (4.12) merits attention. Independence of parameters across terminal nodes means that integration can be carried out separately for each node. Normalizing constants Practical Bayes Model Selection 103 in integrals for each node that would usually be discarded (for example ab/2 in (4.7)) need to be kept, since the number of terminal nodes, b, varies across trees. This means that these normalizing constants will be exponentiated to a dierent power for trees of dierent size. All the prior specications above assume that given the tree T , the parameters in the terminal nodes are independent. When terminal nodes share many common parents, it may be desirable to introduce dependence between their i values. Chipman, George, and McCulloch (2000) introduce such a dependence for the regression tree model, resulting in a Bayesian analogue of the tree shrinkage methods considered by Hastie and Pregibon (1990) and Leblanc and Tibshirani (1998). 4.2 Stochastic Search of the CART Model Posterior Combining any of the closed form expressions (4.7), (4.10) or (4.12) for p(Y | T ) with p(T ) yields a closed form expression g(T ) satisfying g(T ) p(Y | T )p(T ) p(T | Y ). (4.13) Analogous to benets of the availability g() in (3.33) for Bayesian variable selection, the availability of g(T ) confers great advantages for posterior computation and exploration in Bayesian CART model selection. Exhaustive evaluation of g(T ) over all T will not be feasible, except in trivially small problems, because of the sheer number of possible trees. This not only prevents exact calculation of the norming constant, but also makes it nearly impossible to determine exactly which trees have largest posterior probability. In spite of these limitations, MH algorithms can still be used to explore the posterior. Such algorithms simulate a Markov chain sequence of trees T 0, T 1, T 2, . . . (4.14) which are converging in distribution to the posterior p(T | Y ). Because such a simulated sequence will tend to gravitate towards regions of higher posterior probability, the simulation can be used to stochastically search for high posterior probability trees. We now proceed to describe the details of such algorithms and their eective implementation. 4.2.1 Metropolis-Hastings Search Algorithms By restricting attention to a nite set of trees, as discussed in the last paragraph of Section 4.1.1, the simple MH form described in Section 2.3 can be used for direct simulation of the Markov chain (4.14). Because g(T )/g(T ) = p(T | Y )/p(T | Y ), such MH 104 H. Chipman, E. I. George and R. E. McCulloch algorithms are obtained as follows. Starting with an initial tree T 0 , iteratively simulate the transitions from T j to T j+1 by the two steps: 1. Simulate a candidate T from the transition kernel q(T | T j ). 2. Set T j+1 = T with probability (T | T j ) = min Otherwise, set T j+1 = T j . The key to making this an eective MH algorithm is the choice of transition kernel q(T | T j ). A useful strategy in this regard is to construct q(T | T j ) as a mixture of simple local moves from one tree to another - moves that have a chance of increasing posterior probability. In particular, CGM use the following q(T | T j ), which generates T from T j by randomly choosing among four steps: GROW: Randomly pick a terminal node. Split it into two new ones by randomly assigning it a splitting rule using the same random splitting rule assignment used to determine p(T ). PRUNE: Randomly pick a parent of two terminal nodes and turn it into a terminal node by collapsing the nodes below it. CHANGE: Randomly pick an internal node, and randomly reassign it using the same random splitting rule assignment used to determine p(T ). SWAP: Randomly pick a parent-child pair which are both internal nodes. Swap their splitting rules unless the other child has the identical rule. In that case, swap the splitting rule of the parent with that of both children. In executing the GROW, CHANGE and SWAP steps, attention is restricted to splitting rule assignments that do not force the tree have an empty terminal node. CGM also recommend further restricting attention to splitting rule assignments which yield trees with at least a small number (such as ve) observations at every terminal node. A similar q(T | T j ), without the SWAP step, was proposed by DMS. An interesting general approach for constructing such moves was proposed by Knight, Kustra and Tibshirani (1998). The transition kernel q(T | T j ) above has some appealing features. To begin with, every step from T to T has a counterpart that moves from T to T . Indeed, the GROW and PRUNE steps are counterparts of one another, and the CHANGE and q(T j | T ) g(T ) ,1 . q(T | T j ) g(T j ) (4.15) Practical Bayes Model Selection 105 SWAP steps are their own counterparts. This feature guarantees the irreducibility of the algorithm, which is needed for convergence. It also makes it easy to calculate the ratio q(T j | T )/q(T | T j ) in (4.15). Note that other reversible moves may be much more dicult to implement because their counterparts are impractical to construct. For example, pruning o more than a pair of terminal nodes would require a complicated and computationally expensive reverse step. Another computational feature occurs in the GROW and PRUNE steps, where there is substantial cancellation between g and q in the calculation of (4.15) because the splitting rule assignment for the prior is used. 4.2.2 Running the MH Algorithm for Stochastic Search The MH algorithm described in the previous section can be used to search for desirable trees. To perform an eective search it is necessary to understand its behavior as it moves through the space of trees. By virtue of the fact that its limiting distribution is p(T |Y ), it will spend more time visiting tree regions where p(T |Y ) is large. However, our experience in assorted problems (see the examples in CGM) has been that the algorithm quickly gravitates towards such regions and then stabilizes, moving locally in that region for a long time. Evidently, this is a consequence of a transition kernel that makes local moves over a sharply peaked multimodal posterior. Once a tree has reasonable t, the chain is unlikely to move away from a sharp local mode by small steps. Because the algorithm is convergent, we know it will eventually move from mode to mode and traverse the entire space of trees. However, the long waiting times between such moves and the large size of the space of trees make it impractical to search eectively with long runs of the algorithm. Although dierent move types might be implemented, we believe that any MH algorithm for CART models will have diculty moving between local modes. To avoid wasting time waiting for mode to mode moves, our search strategy has been to repeatedly restart the algorithm. At each restart, the algorithm tends to move quickly in a direction of higher posterior probability and eventually stabilize around a local mode. At that point the algorithm ceases to provide new information, and so we intervene in order to nd another local mode more quickly. Although the algorithm can be restarted from any particular tree, we have found it very productive to repeatedly restart at the trivial single node tree. Such restarts have led to a wide variety of dierent trees, apparently due to large initial variation of the algorithm. However, we have also found it productive to restart the algorithm at other trees such as previously visited intermediate trees or trees found by other heuristic methods. For example, CGM demonstrate that restarting the algorithm at trees found by bootstrap bumping (Tibshirani and Knight 1999) leads to further improvements over the start points. A practical implication of restarting the chain is that the number of restarts must be 106 H. Chipman, E. I. George and R. E. McCulloch traded o against the length of the chains. Longer chains may more thoroughly explore a local region of the model space, while more restarts could cover the space of models more completely. In our experience, a preliminary run with a small number of restarts can aid in deciding these two parameters of the run. If the marginal likelihood stops increasing before the end of each run, lengthening runs may be less protable than increasing the number of restarts. It may also be useful to consider the slower burn in modication of the algorithm proposed by DMS. Rather than let their MH algorithm move quickly to a mode, DMS intervene, forcing the algorithm to move around small trees with around 6 or fewer nodes, before letting it move on. This interesting strategy can take advantage of the fact that the problems caused by the sharply peaked multimodal posterior are less acute when small trees are constructed. Indeed, when trees remain small, the change or swap steps are more likely to be permissible (since there are fewer children to be incompatible with), and help move around the model space. Although this burn in strategy will slow down the algorithm, it may be a worthwhile tradeo if it suciently increases the probability of nding better models. 4.2.3 Selecting the Best Trees As many trees are visited by each run of the algorithm, a method is needed to identify those trees which are of most interest. Because g(T ) p(T | Y ) is available for each visited tree, one might consider selecting those trees with largest posterior probability. However, this can be problematic because of the dilution property of p(T ) discussed in Section 4.1.1. Consider the following simple example. Suppose we were considering all possible trees with two terminal nodes and a single rule. Suppose further that we had only two possible predictors, a binary variable with a single available splitting rule, and a multilevel variable with 100 possible splits. If the marginal likelihood p(Y | T ) was the same for all 101 rules, then the posterior would have a sharp mode on the binary variable because the prior assigns small probability to each of the 100 candidate splits for the multilevel predictor, and much larger probability to the single rule on the binary predictor. Selection via posterior probabilities is problematic because the relative sizes of posterior modes does not capture the fact that the total posterior probability allocated to the 100 trees splitting on the multilevel variable is the same as that allocated to the single binary tree. It should be emphasized that the dilution property is not a failure of the prior. By using it, the posterior properly allocates high probability to tree neighborhoods which are collectively supported by the data. This serves to guide the algorithm towards such regions. The diculty is that relative sizes of posterior modes do not capture the relative Practical Bayes Model Selection 107 allocation of probability to such regions, and so can lead to misleading comparisons of single trees. Note also that dilution is not a limitation for model averaging. Indeed, one could approximate the overall posterior mean by the average of the visited trees using weights proportional to p(Y | T )p(T ). Such model averages provide a Bayesian alternative to the tree model averaging proposed by see Breiman (1996) and Oliver and Hand (1995). A natural criterion for tree model selection, which avoids the diculties described above, is to use the marginal likelihood p(Y | T ). As illustrated in CGM, a useful tool in this regard is a plot of the largest observed values of p(Y | T ) against the number of terminal nodes of T , an analogue of the Cp plot (Mallows 1973). This allows the user to directly gauge the value of adding additional nodes while removing the inuence of p(T ). In the same spirit, we have also found it useful to consider other commonly used tree selection criteria such as residual sums of squares for regression trees and misclassication rates for classication trees. After choosing a selection criterion, a remaining issue is what to do when many dierent models are found, all of which t the data well. Indeed, our experience with stochastic search in applications has been to nd a large number of good tree models, distinguished only by small dierences in marginal likelihood. To deal with such output, in Chipman, George and McCulloch (1998b, 2001a), we have proposed clustering methods for organizing multiple models. We found such clustering to reveal a few distinct neighborhoods of similar models. In such cases, it may be better to select a few representative models rather than a single best model. 5 Much More to Come Because of its broad generality, the formulation for Bayesian model uncertainty can be applied to a wide variety of problems. The two examples that we have discussed at length, Bayesian variable selection for the linear model and Bayesian CART model selection, illustrate some of the main ideas that have been used to obtain eective practical implementations. However, there have been many other recent examples. To get an idea of the extent of recent activity, consider the following partial list of some of the highlights just within the regression framework. To begin with, the Bayesian variable selection formulation for the linear model has been extended to the multivariate regression setting by Brown, Vannucci and Fearn (1998). It has been applied and extended to nonparametric spline regression by Denison, Mallick and Smith (1998bc), Gustafson (2000), Holmes and Mallick (2001), Liang, Truong and Wong (2000), Smith and Kohn (1996, 1997), Smith, Wong and Kohn 108 H. Chipman, E. I. George and R. E. McCulloch (1998); and to nonparametric wavelet regression by Abramovich, Sapatinas and Silverman (1998), Chipman, Kolaczyk and McCulloch (1997), Clyde and George (1999,2000), Clyde, Parmigiani and Vidakovic (1998), Holmes and Mallick (2000) and Johnstone and Silverman (1998). Related Bayesian approaches for generalized linear models and time series models have been put forward by Chen, Ibrahim and Yiannoutsos (1999), Clyde (1999), George, McCulloch and Tsay (1995), Ibrahim and Chen (2000), Mallick and Gelfand (1994), Raftery (1996), Raftery, Madigan and Volinsky (1996), Raftery and Richardson (1996), Shively, Kohn and Wood (1999), Troughton and Godsill (1997) and Wood and Kohn (1998); for loglinear models by Dellaportas and Foster (1999) and Albert (1996); and to graphical model selection by Giuduci and Green (1999) and Madigan and York (1995). Bayesian CART has been extended to Bayesian treed modelling by Chipman, George and McCulloch (2001); an related Bayesian partitioning approach has been proposed by Holmes, Denison and Mallick (2000). Alternative recent Bayesian methods for the regression setup include the predictive criteria of Laud and Ibrahim (1995), the information distance approach of Goutis and Robert (1998) and the utility approach of Brown, Fearn and Vannucci (1999) based on the early work of Lindley (1968). An excellent summary of many of the above articles and additional contributions to Bayesian model selection can be found in Ntzoufras (1999). Spurred on by applications to new model classes, renements in prior formulations and advances in computing methods and technology, implementations of Bayesian approaches to model uncertainty are widening in scope and becoming increasingly prevalent. With the involvement of a growing number of researchers, the cross-fertilization of ideas is further accelerating developments. As we see it, there is much more to come. REFERENCES Abramovich, F., Sapatinas, T., and Silverman, B.W. (1998). Wavelet thresholding via a Bayesian approach. J. Roy. Statist. Soc. Ser. B 60, 725-749. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In 2nd Internat. Symp. Inform. Theory (B.N. Petrov and F. Csaki, eds.) 267-81, Akademia Kiado, Budapest. Albert, J.H. (1996). The Bayesian Selection of Log-linear Models. Canad. J. Statist. 24, 327-347. Bartlett, M. (1957). A comment on D. V. Lindleys Statistical Paradox. Biometrika 44, 533-534. Bernardo, J. M., and Smith, A.F.M. (1994). Bayesian Theory, Wiley, New York. Berger, J.O. and Pericchi, L.R. (1996). The intrinsic Bayes factor for model selection Practical Bayes Model Selection and prediction. J. Amer. Statist. Asso. 91, 109-122. 109 Besag, J and Green, P.J. (1993). Spatial statistics and Bayesian computation (with discussion) J. Roy. Statist. Soc. Ser. B 55 25-37. Breiman, L (1996). Bagging predictors. Machine Learning 24, 123-140. Breiman, L., Friedman, J. Olshen, R. and Stone, C. (1984). Classication and Regression Trees. Wadsworth. Brown, P.J., Vannucci,M., and Fearn, T. (1998). Multivariate Bayesian variable selection and prediction. J. Roy. Statist. Soc. Ser. B 60, 627-642. Brown, P.J., Fearn, T. and Vannucci, M. (1999). The choice of variables in multivariate regression: a non-conjugate Bayesian decision theory approach. Biometrika 86, 635648. Buntine, W. (1992). Learning Classication Trees. Statist. Comput. 2, 63-73. Carlin, B.P. and Chib, S. (1995). Bayesian model choice via Markov Chain Monte Carlo. J. Roy. Statist. Soc. Ser. B 77, 473-484. Casella, G. and George, E.I. (1992). Explaining the Gibbs sampler, The American Statistician, 46, 167-174. Chambers, J.M. (1971). Regression updating, J. Amer. Statist. Asso. 66, 744-748. Chen, M.H., Ibrahim, J.G. and Yiannoutsos, C. (1999). Prior elicitation, variable selection and Bayesian computation for logistic regression models. J. Roy. Statist. Soc. Ser. B 61, 223-243. Chib, S. and Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm, The American Statistician, 49, 327-335. Chipman, H. (1996). Bayesian variable selection with related predictors. Canad. J. Statist. 24, 17-36. Chipman, H. A., George, E. I., and McCulloch, R. E. (1998a). Bayesian CART model search (with discussion). J. Amer. Statist. Asso. 93, 935-960. Chipman, H. A., George, E. I. and McCulloch, R. E. (1998b). Making sense of a forest of trees. Computing Science and Statistics, Proc. 30th Symp. Interface (S. Weisberg, Ed.) 84-92, Interface Foundation of North America, Fairfax, VA. Chipman, H., George, E. I., and McCulloch, R. E. (2000). Hierarchical priors for Bayesian CART shrinkage. Statist. Comput. 10, 17-24. Chipman, H., George, E.I. and McCulloch, R.E. (2001a). Managing multiple models. Articial Intelligence and Statistics 2001, (Tommi Jaakkola and Thomas Richardson, eds.) 11-18, Morgan Kaufmann, San Francisco, CA. 110 H. Chipman, E. I. George and R. E. McCulloch Chipman, H., George, E. I., and McCulloch, R. E. (2001b). Bayesian treed models. Machine Learning. To appear. Chipman, H., Hamada, M. and Wu, C. F. J., (1997). A Bayesian variable selection approach for analyzing designed experiments with complex aliasing. Technometrics, 39, 372-381. Chipman, H., Kolaczyk, E., and McCulloch, R. (1997). Adaptive Bayesian wavelet shrinkage. J. Amer. Statist. Asso. 92, 1413-1421. Clark, L. and Pregibon, D. (1992). Tree-Based Models. In Statistical Models in S (J. Chambers and T. Hastie, Eds.) 377-419, Wadsworth. Clyde, M.A. (1999). Bayesian model averaging and model search strategies (with discussion). In Bayesian Statistics 6 (J.M. Bernardo, J.O. Berger, A.P. Dawid and A.F.M. Smith, eds.) 157-185, Oxford Univ. Press. Clyde, M.A., DeSimone, H., and Parmigiani, G. (1996). Prediction via orthogonalized model mixing. J. Amer. Statist. Asso. 91, 1197-1208. Clyde, M. and George, E.I. (1999) Empirical Bayes estimation in wavelet nonparametric regression. In Bayesian Inference in Wavelet Based Models (P. Muller and B. Vidakovic, eds.) 309-22, Springer-Verlag, New York. Clyde, M. and George, E.I. (2000). Flexible empirical Bayes estimation for wavelets. J. Roy. Statist. Soc. Ser. B 62, 681-698. Clyde, M.A. and Parmigiani, G. (1994). Bayesian variable selection and prediction with mixtures, J. Biopharmaceutical Statist. Clyde, M., Parmigiani, G., Vidakovic, B. (1998). Multiple shrinkage and subset selection in wavelets. Biometrika 85, 391-402. Denison, D.G.T., Mallick, B.K. and Smith, A.F.M. (1998a). A Bayesian CART algorithm. Biometrika 85, 363-377. Denison, D.G.T., Mallick, B.K. and Smith, A.F.M. (1998b). Automatic Bayesian curve tting. J. Roy. Statist. Soc. Ser. B 60, 333-350. Denison, D.G.T., Mallick, B.K. and Smith, A.F.M. (1998c). Bayesian MARS. Statist. Comput. 8, 337-346. Dellaportas, P. and Foster, J.J. (1999). Markov Chain Monte Carlo Model determination for hierarchical and graphical log-linear Models. Biometrika 86, 615-634. Dellaportas, P. Forster, J.J. and Ntzoufras, I. (2000). On Bayesian model and variable selection using MCMC. Statist. Comput. To appear. Dongarra, J.J., Moler C.B., Bunch, J.R. and Stewart, G.W. (1979). Linpack Users Guide. SIAM, Philadelphia. Practical Bayes Model Selection 111 Donoho, D.L. and Johnstone, I.M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81, 425-56. Draper, D. (1995). Assessment and propagation of model uncertainty (with discusssion). J. Roy. Statist. Soc. Ser. B 57, 45-98. Edwards, W Lindman H. and Savage, L.J. (1963). Bayesian statistical inference for psychological research. Psychological Review 70 193-242. Fernandez, C., Ley, E. and Steel, M.F.J. (2001). Benchmark priors for Bayesian model averaging. J. Econometrics 100, 381-427. Foster, D.P. and George, E.I. (1994). The risk ination criterion for multiple regression. Ann. Statist. 22, 1947-75. Garthwaite,P. H. and Dickey, J.M. (1992). Elicitation of prior distributions for variableselection problems in regression, Ann. Statist. 20, 1697-1719. Garthwaite, P. H. and Dickey, J.M. (1996). Quantifying and using expert opinion for variable-selection problems in regression (with discussion). Chemomet. Intel. Lab. Syst. 35, 1-34. Gelfand, A. E., Dey, D. K., and Chang, H. (1992). Model determination using predictive distributions With implementations via sampling-based methods. In Bayesian Statistics 4 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 147167, Oxford Univ. Press. Gelfand, A., and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities, J. Amer. Statist. Asso. 85, 398-409. George, E.I. (1998). Bayesian model selection. Encyclopedia of Statistical Sciences, Update Volume 3 (S. Kotz, C. Read and D. Banks, eds.) 39-46, Wiley, New York. George, E.I. (1999). Discussion of Bayesian model averaging and model search strategies by M.A. Clyde. Bayesian Statistics 6 (J.M. Bernardo, J.O. Berger, A.P. Dawid and A.F.M. Smith, eds.) 175-177, Oxford Univ. Press. George, E.I. and Foster, D.P. (2000) Calibration and empirical Bayes variable selection. Biometrika 87, 731-748. George, E.I. and McCulloch, R.E. (1993). Variable selection via Gibbs sampling. J. Amer. Statist. Asso. 88, 881-889. George, E.I. and McCulloch, R.E. (1996). Stochastic search variable selection. In Markov Chain Monte Carlo in Practice (W.R. Gilks, S. Richardson and D.J. Spiegelhalter, eds.) 203-214, Chapman and Hall, London. George, E.I. and McCulloch, R.E. (1997). Approaches for Bayesian variable selection. Statist. Sinica 7, 339-73. 112 H. Chipman, E. I. George and R. E. McCulloch George, E.I., McCulloch, R.E. and Tsay, R. (1995) Two approaches to Bayesian Model selection with applications. In Bayesian Statistics and Econometrics: Essays in Honor of Arnold Zellner (D. Berry, K. Chaloner, and J. Geweke, eds.) 339-348, Wiley, New York. Geisser, S. (1993). Predictive Inference: An Introduction. Chapman and Hall, London. Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. Pattn. Anal. Mach. Intell. 6, 721-741. Geweke, J. (1996). Variable selection and model comparison in regression. In Bayesian Statistics 5 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 609-620, Oxford Univ. Press. Gilks, W.R., Richardson S. and Spiegelhalter, D.J. (1996) Markov Chain Monte Carlo in Practice. Chapman and Hall, London. Giuduci, P. and Green, P.J. (1999). Decomposable graphical Gausssian model determination. Biometrika 86, 785-802. Goutis, C. and Robert, C.P. (1998). Model choice in generalized linear models: A Bayesian approach via Kullback-Leibler projections. Biometrika 82, 711-732. Green, P. (1995). Reversible Jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 711-732. Gustafson, P. (2000). Bayesian regression modelling with interactions and smooth eects. J. Amer. Statist. Asso. 95, 795-806. Hastie, T., and Pregibon, D. (1990). Shrinking trees. AT&T Bell Laboratories Technical Report. Hastings, W.K. (1970). Monte Carlo sampling methods using Markov chain and their applications. Biometrika 57, 97-109. Hawkins, D. M. and Kass, G. V. (1982). Automatic interaction detection. In Topic in Applied Multivariate Analysis (D. M. Hawkins, ed.) 267-302, Cambridge Univ. Press. Hoeting, J. A., Raftery, A.E. and Madigan, D. (1996). A method for simultaneous variable selection and outlier identication in linear regression. Computational Statistics and Data Analysis 22, 251-270. Hoeting, J. A., Madigan, D., Raftery, A.E., and Volinsky, C.T. (1999). Bayesian Model averaging: A tutorial (with discussion). Statist. Sci. 14:4, 382-417. (Corrected version available at http://www.stat.washington.edu/www/research/online/hoeting1999.pdf). Holmes C.C., Denison, D.G.T. and Mallick, B.K. (2000). Bayesian prediction via partitioning. Technical Report, Dept. of Mathematics, Imperial College, London. Holmes, C.C. and Mallick, B.K. (2000). Bayesian wavelet networks for nonparametric Practical Bayes Model Selection regression. IEEE Trans. Neur. Netwrks. 10, 1217-1233. 113 Holmes, C.C. and Mallick, B.K. (2001). Bayesian regression with multivariate linear splines. J. Roy. Statist. Soc. Ser. B 63, 3-18. Ibrahim, J.G. and Chen, M.-H. (2000). Prior Elicitation and Variable Selection for Generalized Linear Mixed Models. In Generalized Linear Models: A Bayesian Perspective, (Dey, D.K., Ghosh, S.K. and Mallick, B.K. eds.) 41-53, Marcel Dekker, New York. Johnstone, I.M. and Silverman, B.W. (1998). Empirical Bayes approaches to mixture problems and wavelet regression. Technical Report, Univ. of Bristol. Jordan, M.I. and Jacobs, R.A. (1994). Mixtures of experts and the EM algorithm. Neural Computation 6, 181-214. Kass, G. V. (1980). An exploratory technique for investigating large quantities of categorical data. Appl. Statist. 29, 119-127. Kass, R. E. and Wasserman, L. (1995). A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. J. Amer. Statist. Asso. 90, 928-34. Key, J.T., Pericchi, L.R. and Smith, A.F.M. (1998). Choosing among models when none of them are true. In Proc. Workshop Model Selection, Special Issue of Rassegna di Metodi Statistici ed Applicazioni (W. Racugno, ed.) 333-361, Pitagora Editrice, Bologna. Knight, K., Kustra, R. and Tibshirani, R. (1998). Discussion of Bayesian CART model search by Chipman, H. A., George, E. I., and McCulloch, R. E. J. Amer. Statist. Asso. 93, 950-957. Kuo, L. and Mallick, B. (1998). Variable selection for regression models. Sankhy Ser. a B 60, 65-81. Laud, P.W. and Ibrahim, J.G. (1995). Predictive model selection. J. Roy. Statist. Soc. Ser. B 57, 247-262. Leblanc, M. and Tibshirani, R. (1998). Monotone shrinkage of trees. J. Comput. Graph. Statist. 7, 417-433. Liang F., Truong, Y.K. and Wong, W.H. (2000). Automatic Bayesian model averaging for linear regression and applications in Bayesian curve tting. Technical Report, Dept. of Statistics, Univ. of California, Los Angeles. Lindley, D.V. (1968). The choice of variables in regression. J. Roy. Statist. Soc. Ser. B 30, 31-66. Liu, J.S., Wong, W.H., and Kong, A. (1994). Covariance structure and convergence rate of the Gibbs sampler with applications to the comparisons of estimators and 114 H. Chipman, E. I. George and R. E. McCulloch augmentation schemes. Biometrika 81, 27-40. Lutsko, J. F. and Kuijpers, B. (1994). Simulated annealing in the construction of nearoptimal decision trees. In Selecting Models from Data: AI and Statistics IV (P. Cheeseman and R. W. Oldford, eds.) 453-462. Madigan, D. and York, J. (1995). Bayesian graphical models for discrete data. Internat. Statist. Rev. 63, 215-232. Mallick, B.K. and Gelfand, A.E. (1994). Generalized linear models with unknown number of components. Biometrika 81, 237-245. Mallows, C. L. (1973). Some Comments on Cp . Technometrics 15, 661-676. McCullagh, P. and Nelder, J.A. (1989). Generalized Liner Models, 2nd Ed. Chapman and Hall, New York. McCulloch, R.E. and Rossi P. (1991). A Bayesian approach to testing the arbitrage pricing theory. J. Econometrics 49, 141-168. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H. and Teller,E. (1953). Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087-1091. Meyn, S.P. and Tweedie, R.L. (1993). Markov Chains and Stochastic Stability. SpringerVerlag, New York. Mitchell, T.J. and Beauchamp, J.J. (1988). Bayesian variable selection in linear regression (with discussion). J. Amer. Statist. Asso. 83, 1023-1036. Ntzoufras, I. (1999). Aspects of Bayesian model and variable selection using MCMC. Ph.D. dissertation, Dept. of Statistics, Athens Univ. of Economics and Business, Athens, Greece. OHagan, A. (1995). Fractional Bayes factors for model comparison (with discussion). Jour. of the Roy. Statist. Soc. Ser. B 57, 99-138. Oliver, J.J. and Hand, D.J. (1995). On pruning and averaging decision trees. Proc. Internat. Machine Learning Conf. 430-437. Paass, G. and Kindermann, J. (1997). Describing the uncertainty of Bayesian predictions by using ensembles of models and its application. 1997 Real World Comput. Symp. 118-125, Real World Computing Partnership, Tsukuba Research Center, Tsukuba, Japan. Petit, L.I., and Smith, A. F. M. (1985). Outliers and inuential observations in linear models. In Bayesian Statistics 2, (J.M. Bernardo, M.H. DeGroot, D.V. Lindley and A.F.M. Smith, eds.) 473-494, North-Holland, Amsterdam. Peng, L. (1998). Normalizing constant estimation for discrete distribution simulation. Practical Bayes Model Selection Ph.D. dissertation, Dept. MSIS, Univ. of Texas, Austin. 115 Phillips, D. B., and Smith, A. F. M. (1995). Bayesian model comparison via jump diusions, In Practical Markov Chain Monte Carlo in Practice (W.R. Gilks, S. Richardson and D.J. Spiegelhalter, eds.) 215-239, Chapman and Hall, London. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning 1, 81-106. Quinlan, J. R. (1993). C4.5: Tools for Machine Learning, Morgan Kauman, San Mateo, CA. Quinlan, J.R. and Rivest, R.L. (1989). Inferring decision trees using the minimum description length principle,. Information and Computation 80, 227-248. Raftery, A.E. (1996). Approximate Bayes factors and accounting for model uncertainty in generalized linear models. Biometrika 83, 251-266. Raftery, A. E., Madigan, D. and Hoeting, J. A. (1997). Bayesian model averaging for linear regression models. J. Amer. Statist. Asso. 92, 179-191. Raftery, A.E., Madigan, D. M., and Volinsky C.T. (1995). Accounting for model uncertainty in survival analysis improves predictive performance (with discussion). In Bayesian Statistics 5 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 323-350, Oxford Univ. Press. Raftery, A.E. and Richardson, S. (1996). Model selection for generalized linear models via GLIB: application to nutrition and breast cancer. Bayesian Biostatistics ( D.A. Berry and D.K. Strangl, eds.) 321-353, Marcel Dekker, New York. San Martini, A. and Spezzaferri, F. (1984). A predictive model selection criterion J. Roy. Statist. Soc. Ser. B 46, 296-303. Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461-4. Shively, T.S., Kohn, R. and Wood, S. (1999). Variable selection and function estimation in additive nonparametric regression using a data-based prior (with discussion). J. Amer. Statist. Asso. 94, 777-806. Smith, A.F.M and Roberts, G.O. (1993). Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods. J. Roy. Statist. Soc. Ser. B 55, 3-24. Smith, M. and Kohn, R.(1996). Nonparametric regression using Bayesian variable selection. Journal of Econometrics, 75, 317-344. Smith, M. and Kohn, R. (1997). A Bayesian approach to nonparametric bivariate regression. J. Amer. Statist. Asso. 92, 1522-1535. Smith, M., Wong, C.M. and Kohn, R. (1998). Additive nonparametric regression with autocorrelated errors. J. Roy. Statist. Soc. Ser. B 60, 311-331. 116 H. Chipman, E. I. George and R. E. McCulloch Sutton, C. (1991). Improving classication trees with simulated annealing. Computing Science and Statistics, Proc. 23rd Symp. Interface (E. Keramidas, ed.) 396-402, Interface Foundation of North America. Tibshirani, R. and Knight, K. (1999). Model search by bootstrap bumping. J. Comput. Graph. Statist. 8, 671-686. Tierney, L. (1994). Markov chains for exploring posterior distributions (with discussion). Ann. Statist. 22, 1674-1762. Tierney, L. and Kadane, J.B. (1986). Accurate approximations for posterior moments and marginal densities. J. Amer. Statist. Asso. 81, 82-86. Troughton, P.T. and Godsill, S.J. (1997). Bayesian model selection for time series using Markov Chain Monte Carlo. Proc. IEEE Internat. Conf. Acoustics, Speech and Signal Processing, 3733-3736. Wakeeld, J.C. and Bennett, J.E. (1996). The Bayesian modelling of covariates for population pharmacokinetic models. J. Amer. Statist. Asso. 91, 917-928. Wallace, C.C. and Patrick, J.D. (1993). Coding decision trees. Machine Learning 11, 7-22. Wasserman, L. (2000). Bayesian model selection and model averaging. J. Math. Psychology 44, 92-107. Wood, S. and Kohn, R. (1998). A Bayesian approach to robust binary nonparametric regression. J. Amer. Statist. Asso. 93, 203-213. Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finietti (P.K. Goel and A. Zellner, eds.) 233-43, North-Holland, Amsterdam. M. Clyde 117 DISCUSSION M. Clyde Duke University I would like to begin by thanking the authors for their many contributions to Bayesian model selection and for providing an excellent summary of the growing body of literature on Bayesian model selection and model uncertainty. The development of computational tools such as the Gibbs sampler and Markov chain Monte Carlo approaches, has led to an explosion in Bayesian approaches for model selection. On the surface, Bayesian model averaging (BMA) and model selection is straightforward to implement: one species the distribution of the data, and the prior probabilities of models and model specic parameters; Bayes theorem provides the rest. As the authors point out the two major challenges confronting the practical implementation of Bayesian model selection are choosing prior distributions and calculating posterior distributions. In my experience, I have found that this is especially true in high dimensional problems, such as wavelet regression or non-parametric models, where subjective elicitation of prior distributions is practically infeasible and enumeration of the number of potential models is intractable (Clyde et al. 1998; Clyde and George 2000). Choice of Prior Distributions The specication of prior distributions is often broken down into two parts: (1) elicitation of distributions for parameters specic to each model, such as the distribution for regression coecients in linear models, p(|, 2 ), and (2) selection of a prior distribution over models p(). For high dimensional problems one cannot specify the prior probability of each directly, and practical implementations of Bayesian selection have usually made prior assumptions that the presence or absence of a variable is independent of the presence or absence of other variables. As a special case of this, the uniform prior distribution over models is appealing in that posterior probabilities of models depend only on the likelihood. However, this prior distribution may not be sensible for model averaging when there are groups of similar variables and does not provide the proper dilution of posterior mass over similar models (see Clyde 1999; Hoeting et al. 1999), and discussion therein (George 1999; 1999a). In this regard, uniform and independent prior distributions must be used carefully with highly correlated explanatory variables and special consideration should be given to constructing the model space. Even with Merlise Clyde is Associate Professor, Institute of Statistics and Decision Sciences, Duke University, Durham NC 27708-0251, U.S.A. email: clyde@stat.Duke.EDU. 0 118 Discussion uniform priors on models, the posterior distribution over models naturally penalizes adding redundant variables, however, this may not be enough to lead to the proper rate of dilution with nearly redundant variables. One approach to constructing dilution prior distributions is to start with a uniform prior over models and use imaginary training data to construct a posterior distribution for based on the training data; this posterior would become the prior distribution for for the observed data. Selection of the training data and hyperparameters are non-trivial issues, however, this approach would likely provide better dilution properties than starting with an independent prior. Clearly, construction of prior distributions for models that capture similarities among variables in a sensible way is a dicult task and one that needs more exploration. For (1), by far the most common choice is a normal prior distribution for , such as in the conjugate setup for point prior selection models (section 3.2 CGM), where N (0, 2 ). Again, as one cannot typically specify a separate prior distribution for under each model, any practical implementation for Bayesian model uncertainty usually resorts to structured families of prior distributions. Another important consideration is whether prior specications for are compatible across models (Dawid and Lauritzen 2000) . For example, suppose that Model 1 contains variables 1 and 2, Model 2 contains variables 2 and 3, and Model 3 includes only variable 2. With apologies for possible abuse of notation, let 2 denote the coecient for variable 2 in each model. With completely arbitrary choices for , under Model 1 the variance for 2 given that 1 = 0 may not be the same as the variance for 2 given that 3 = 0 derived under Model 2, and both may dier from the variance for 2 under the prior distribution given Model 3. To avoid this incompatibility, choices for are often obtained from conditional specications (i.e. conditional on i = 0 for i = 0) derived from a prior distribution for under the full model. For example, Zellners g-prior (Zellner 1986) is commonly used, which leads to = c(X X)1 for the full model and = c(X X )1 for model . While in many cases conditioning leads to sensible choices, the result may depend on the choice of parameterization, which can lead to a Borel paradox (Kass and Raftery 1995; Dawid and Lauritzen 2000). Other structured families may be induced by marginalization or projections, leading to possibly dierent prior distributions. While structured prior distributions may reduce the number of hyperparameters that must be selected, i.e. to one parameter c, how to specify c is still an issue. The choice of c requires careful consideration, as it appears in the marginal likelihoods of the data and hence the posterior model probabilities, and in my experience, can be inuential. In the situation where little prior information is available, being too noninformative about (taking c large) can have the un-intended consequence of favoring the null model a posteriori (Kass and Raftery 1995). While default prior distributions M. Clyde 119 (both proper and improper) can be calibrated based on information criteria such as AIC (Akaike Information Criterion - Akaike 1973), BIC (Bayes Information Criterion Schwarz 1978), or RIC (Risk Ination Criterion Foster and George 1994), there remains the question of which one to use (Clyde 2000; George and Foster 2000; Fernandez et al. 1998); such decisions may relate more to utilities for model selection rather than prior beliefs (although it may be impossible to separate the two issues). Based on simulation studies, Fernandez et al.(1998) recommend RIC-like prior distributions when n < p2 and BIC-like prior distributions otherwise. In wavelet regression, where p = n, there are cases where priors calibrated based on BIC have better predictive performance than prior distributions calibrated using RIC, and vice versa. From simulation studies and asymptotic arguments, it is clear that there is no one default choice for c that will perform well for all contingencies (Fernandez et al. 1998; Hansen and Yu 1999). Empirical Bayes approaches provide an adaptive choice for c. One class of problems where BMA has had outstanding success is in non-parametric regression using wavelet bases. In nonparametric wavelet regression where subjective information is typically not available, empirical Bayes methods for estimating the hyperparameters have (empirically) led to improved predictive performance over a number of xed hyperparameter specications as well as default choices such as AIC, BIC, and RIC (Clyde and George 1999, 2000; George and Foster 2000; Hansen and Yu 1999) for both model selection and BMA. Because of the orthogonality in the design matrix under discrete wavelet transformations, EB estimates can be easily obtained using EM algorithms (Clyde and George 1999, 2000; Johnstone and Silverman 1998) and allow for fast analytic expressions for Bayesian model averaging and model selection despite the high dimension of the parameter space (p = n) and model space (2n ), bypassing MCMC sampling altogether. George and Foster (2000) have explored EB approaches to hyperparameter selection for linear models with correlated predictors. EM algorithms for obtaining estimates of c and , as in Clyde and George (2000), can be adapted to the non-orthogonal case with unknown 2 using the conjugate point prior formulation and independent model space priors (equation 3.2 in CGM). For the EM algorithm both model indicators and 2 are treated as latent data and imputed using current estimates of c and = (1 , . . . , p ), where j is the prior probability that variable j is included under the independence prior. At iteration i, this leads to (i) = 2 S (i) p(Y |, c(i) )p(| (i) ) (i) (i) p(Y |, c )p(| ) c(i) SSR() 1 + c(i) (i) (1) (2) is a Bayesian version of = YY 2 where SSR() is the usual regression sum of squares and S 120 Discussion residual sum of squares using the current estimate c(i) . Values of c and that maximize the posterior distribution for c and given the observed data and current values of the latent data are j (i+1) = such that j =1 (i) (3) j j (i+1) c(i+1) = max 0, SSR()/(p (i) ) 2 ( + S (i) )/(n + ) 1 (4) where and are hyperparameters in the inverse gamma prior distribution for 2 (CGM equation 3.10). These steps are iterated until estimates converge. EB estimates based on a common for all variables are also easy to obtain. For ridge-regression independent priors with = cI or other prior distributions for , estimates for j have the same form, but estimates for c are slightly more complicated and require numerical optimization. The ratio in the expression for c has the form of a generalized F-ratio, which is weighted by estimates of posterior model probabilities. The EM algorithm highlights an immediate diculty with a common c for all models, as one or two highly signicant coecients may inuence the EB estimate of c. For example, the intercept may be centered far from 0, and may have an absolute t-ratio much larger than the t-ratios of other coecients. As the intercept is in all models, it contributes to all of the SSR() terms, which has the eect of increasing c as the absolute t-ratio increases. Since the same c appears in the prior variance of all other coecients, if c becomes too large in order to account for the size of the intercept, we risk having the null model being favored (Bartletts paradox (Kass and Raftery 1995)). While one could use a dierent prior distribution for the intercept (even a non-informative prior distribution, which would correspond to centering all variables), the problem may still arise among the other variables if there are many moderate to large coecients, and a few that have extreme standardized values. Implicit in the formulation based on a common c is that the non-zero standardized coecients follow a normal distribution with a common variance. As such, this model cannot accommodate one or a few extremely large standardized coecients without increasing the odds that the remaining coecients are zero. Using a heavy-tailed prior distribution for may result in more robust EB estimates of c (Clyde and George 2000). Other possible solutions including adding additional structure into the prior that would allow for dierent groups of coecients with a dierent c in each group. In the context of wavelet regression, coecients are grouped based on the multi-resolution wavelet decomposition; in other problems there may not be any natural a priori groupings. Related to EB methods is the minimum description length (MDL) approach to model selection, which eectively uses a dierent c estimated from the data M. Clyde 121 for each model (Hansen and Yu 1999). While EB methods have led to improvements in performance, part of the success depends on careful construction of the model/prior. Some of the problems discussed above highlight possible restrictions of the normal prior. Unlike the EM estimates for orthogonal regression, the EB approach with correlated predictors involves a summation over all models, which is clearly impractical in large problems. As in the inference problem, one can base EB estimation on a sample of models. This approach has worked well in moderate sized problems, where leaps and bounds (Furnival and Wilson 1974) was used to select a subset of models; these were then used to construct the EB prior distribution, and then estimates under BMA with the estimated prior. For larger problems, leaps and bounds may not be practical, feasible, or suitable (such as CART models), and models must be sampled using MCMC or other methods. How to scale the EB/EM approach up for larger problems where models must be sampled is an interesting problem. In situations where there is uncertainty regarding a parameter, the Bayesian approach is to represent that prior uncertainty via a distribution. In other words, why not add another level to the hierarchy and specify a prior distribution on c rather than using a xed value? While clearly feasible using Gibbs sampling and MCMC methods, anal...

Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more. Course Hero has millions of course specific materials providing students with the best way to expand their education.

Below is a small sample set of documents:

N.C. State - ST - 732
The SAS System 1 The GLM Procedure Class Level Information Class Levels Values
Johns Hopkins - ECE - 353
The Johns Hopkins University Department of Electrical and Computer Engineering 520.454 - Control Systems Design - Spring 2007 Problem Set #0 Problems 1. Suppose that the input to a linear, time-invariant system was u(t) = et sin(2t), and this resulte
N.C. State - CH - 331
NORTH CAROLINA STATE UNIVERSITY Department of Chemistry CH 331 Physical Chemistry Given:Name_ Practice Mid-termg = 9.81 m/s2 R = 8.314 J mol-1 K-1 = 0.08206 L atm mol-1 K-1 1 atm = 1.0133 x 105 Nm-2 = 760 Torr P = P0exp{-Mgh/RT}P2 = P1 + ln t
N.C. State - CH - 331
North Carolina State University Physical Chemistry 331 Homework #6 Name _ Section _ Due date: Friday Oct. 22 1. It is often said the the primary charge separation step of photosynthesis is the most efficient electron transfer reaction known. a. Assum
N.C. State - ACC - 410
Fall 2005 GOVERNMENT AND NONPROFIT ACCOUNTING ACCOUNTING 410 Chapter 7 The Governmental Fund Accounting Cycle - Proprietary- Type Funds Internal Service Funds (ISF) When business type activities for goods/services are supplied to other gvt. departme
Virginia Tech - ETD - 03102001
East Los Angeles College - MJ - 665
Proceedings of the 1st Conference of the European Cooperation in Informatics, Amsterdam August 9-12, 1976, pages 236-262; Lecture Notes in Computer Science 44, K Samelson ed, Springer, 1976.CONSTRUCTIVE METHODS OF PROGRAM DESIGN M. A. Jackson Micha
N.C. State - MEA - 100
MEA 100Introduction to Earth Systems: This is a stealth course in Environmental Science. We will emphasize a quantitative approach based on knowledge of both abiotic and biotic systems.courses.ncsu.edu/mea100/lec/001/Introduction to Earth System
Contra Costa College - COEN - 244
COEN244 Winter 2008 Section U by Dr. Aishy Amer 5 % (A 3, 4, &amp;5) 15 % (5/6) submitted and compiled: -1 25 % submitted and did not compile: -2 53 % did not write (DNW) or did not submit: 0 2 % Note: if you did better in the final, your mark of the mid
University of Iowa - CS - 185
CS 185: Type Preservation, Constraint Solving. Types. type := base | type typeType assignment rules. (x) = T t-var x:T t1 : T 2 T 1 t2 : T 2 t-app t1 t2 : T 1 , x : T1 t : T2 t-lam x.t : T1 T2 t : T.Type Preservation Theorem. If t : T a
University of Iowa - CS - 185
CS 185, Homework 1Denotational Semantics of IMP [100 points].Your solution to this assignment is due Sept. 16th, at the start ofclass. The start of class is defined as within the first 10 minutes,so please do not be late.Recall the course col
University of Iowa - CS - 185
&gt; I. Non-determinism and Concurrency [40 points]&gt; For each of the following commands, first state whether it is using&gt; guarded commands (Ch. 7), shared-variable concurrency (Ch. 8), or&gt; communication via channels (Ch. 9). Then show all possible t
University of Iowa - CS - 185
Programming in Lambda CalculusLambda calculus syntax is nice and compact, and has severaleasy-to-define operational semantics.But can you program in it?Yes, by encoding all data as lambda-abstractions.A piece of data is encoded as a function
University of Iowa - CS - 185
CS 185: Combinators and Substitution10/16/08Last time: Scott-encoded data, in particular, unary natural numbers.0 = \ s . \ z. z1 = \ s . \ z. (s 0)2 = \ s . \ z. (s 1).Scott-encoded booleans:tt = _ff = _and = \ b1 . \ b2. _No
University of Iowa - CS - 185
CS 185, Confluence of Lambda Calculus ContinuedAs we saw last time, our goal is to prove the diamondproperty for the weak multi-step reduction relation -w-&gt; defined inductively as follows:- refle -w-&gt; ee -w-&gt; e'- lambdalambda x. e -w-&gt; la
University of Iowa - CS - 185
CS 185, Homework 3Concurrency and Lambda Calculus [100 points].Your solution to this assignment is due Friday, Oct. 31st, at 3pm.Turn in your solution to my mailbox, in the CS mail room. --I. Non-determinism and Concurrency [40 points]For eac
University of Iowa - CS - 185
CS 185: Lecture Notes on Simple Types. Syntax. The syntax of simple types is given by type := base | type typewhere base is any non-empty set of base types (for example, int or char). We use T as a meta-variable for types. The intuition is that T1
University of Iowa - CS - 185
CS 185, Homework 4Lambda Calculus and Functional Programming [100 points, 30 extra credit].Your solution to this assignment is due Wednesday, Nov. 21, at 3pm.Turn in your solution to my mailbox, in the CS mail room. --I. Confluence [30 points]
Concordia Canada - LYRA - 11247
N.C. State - CS - 312
Farm Plan homework Due: Lab on Monday Nov 21 Yesterday (Nov 15) we reviewed the basics for developing a whole farm forage plan and we walked the old Beef Teaching Unit to evaluate pastures and observe the landscape. On Nov 21 we will use a computer p
University of Iowa - IBS - 593
Directed evolution of ampicillin-resistant activity from a functionally unrelated DNA fragment: A laboratory model of molecular evolutionTakato Yano and Hiroyuki Kagamiyama*Department of Biochemistry, Osaka Medical College, Takatsuki, Osaka 569-868
University of Iowa - ME - 159
THE UNIVERSITY OF IOWA Department of Mechanical &amp; Industrial Engineering Fracture Mechanics 58:159 Homework #1 Total Points: 20 Assigned: January 28, 2009 Due: February 09, 2009Problem 1: Consider a plate containing a circular hole of radius a, as
University of Texas - ASE - 366
N.C. State - OR - 706
Motivation, Intuition, Speculation, Theorizationy[]()()xboundary / interior points closed / open sets bounded / compact sets convex sets . . . continuous functions differentiable functions convex / concave functions Taylor Series . . .1
University of Iowa - TA - 016
1. Enter.2. Leave. 3. QuitInput your option: 1*Welcome to the parking lot!1. Enter.2. Leave. 3. QuitInput your option: 2*Good bye!1. Enter.2. Leave. 3. QuitInput your option: 1*Welcome to the parking lot!1. Enter.2. Leave. 3.
East Los Angeles College - COMS - 12200
Hardware/Software Interface : Part 3So far, we have ignored the topic of function calls. A function call can be split into two partsThe caller part is what makes the function call. The callee part is the actual function itself.To implement funct
Wilfrid Laurier - MATH - 249
Math 249 Lecture 08: Inverse Functions, One-to-one Functions. Objective: To learn the meaning of one-to-one functions and inverse functions, how to determine whether a function is one-to-one, and how to find inverse functions of one-to-one functions
Wilfrid Laurier - MATH - 249
Math 249 L05/L06 (Fall 2007) Worksheet 7 1. p.205 #27. 2. p.205 #29. 3. p.205 #32. 4. p.218 #44. 5. p.218 #32. 6. p.218 #34. 7. p.228 #38. 8. p.228 #50. 9. p.228 #52.November 5-9, 20071
Wilfrid Laurier - MATH - 249
Math 249 Lecture 27: Logarithmic Dierentiation Objective: To learn the method of logarithmic dierentiation. Concepts: To nd the derivative of y = f (x)g(x) . f (x)g(x) = eln(f (x) Logarithmic Dierentiation Examples. 1. Find the derivative of the fo
Middlebury - ECON - 0340
November 8, 2005Bush, Meeting Panama's Leader, Endorses Widening of the CanalBy ELISABETH BUMILLERPANAMA, Nov. 7 - President Bush on Monday endorsed widening the Panama Canal and cited progress in reaching a free-trade agreement with Panama's pr
Wilfrid Laurier - PMAT - 445
PMAT 445 (Winter 2007) Midterm 2 1. Dene the following. (a) (10 marks) Interior pointMarch 14, 2007(b) (10 marks) Connected sets (If you use the term separated sets dene it too) (c) (10 marks) Directional derivative2. (30 marks) Prove this theo
Wilfrid Laurier - MATH - 403
MATH 403 (Winter 2007) Assignment 3Due: March 7, 20071. Use the - characterization of continuity to verify that the following functions are continuous at the given points. (a) (3 marks) f (x) = 3x - 7y, at point p = (4, 2) . (b) (5 marks) g (x)
University of Iowa - PSYCHOLOGY - 31174
University of Iowa - PSYCHOLOGY - 31174
LSA 33:144/Psych 31:174 Mind and Behavior: Natural Science and Cognition after Darwin Brief Essay 2 DUE: March 11 (Please provide two copies of this essay when you turn it in.)Write a 2 - 4 page essay addressing the following question:B.F. Skinne
University of Iowa - PSYCHOLOGY - 31241
AuditoryProcessingPhysical Dimension Amplitude Frequency ComplexityPerceptual Dimension Loudness Pitch TimbreW. W. NortonW. W. NortonW. W. NortonW. W. NortonMapping the Auditory System in Rhesus MonkeysTheUniversityofIowa Departmentof
Allan Hancock College - ELEC - 2041
Overview ELEC2041 Microprocessors and Interfacing Lectures 24: Compiler, Assembler, Linker and Loader I http:/webct.edtec.unsw.edu.au/ Compiler Assembler Linker Loader ExampleMay 2006 Saeid Nooshabadi saeid@unsw.edu.auELEC2041 lec24-linker-I
Allan Hancock College - ELEC - 2041
Overview ELEC2041 Microprocessors and Interfacing Lectures 25: Compiler, Assembler, Linker and Loader II http:/webct.edtec.unsw.edu.au/ Assembler Linker Loader ExampleMay 2006 Saeid NooshabadiELEC2041 lec25-linker.II.1saeid@unsw.edu.auSae
N.C. State - CS - 746
Crossing-Over and Recombination Updated 2/20/06 Required Readings: Fu, H. et al., 2002. Recombination rates between adjacent genic and retrotransposon regions in maize vary by 2 orders of magnitude. PNAS 99:1082-1087. Yao, H. et al. 2002. Molecular c
N.C. State - MA - 242
MA 242Fall, 1999Final Exam - SOLUTIONSL. K. Norris1. (15 pts) The position vector of a moving particle is given by r(t) = (t2 - 1)i + (t3 - t + 1) + t2 k (a) Find the velocity and acceleration vectors of the particle for t 0. 5 points SOLUTI
N.C. State - MA - 242
MA 242 Section 009Fall, 1999Test #1L. K. Norris(You must show your work to receive partial credit)1. (10 pts) Sketch the surfaces given by the following equations: (a) z + 9 = x2 + y 2 , (b) y 2 + 4z 2 = x2 + 9SOLUTION: The rst is an elli
N.C. State - MA - 401
A Summary of Various SL Problems L.K. Norris No. 1 2 3 4 SL Type Regular Regular Periodic Regular ODE y + y = 0 y + y = 0 y + y = 0 y + y = 0 Boundary Conditions y(0) = y() = 0 y (0) = y () = 0 y() = y() y () = y () y(0) = 0 y() + y () = 0 Eigenvalue
N.C. State - MA - 242
N.C. State - MA - 242
Middlebury - MATH - 0410
MATH 410 HOMEWORK #3 (due Wednesday, February 25)Spring 2009I. Look at our handout for the gambler's ruin probabilities. For p &lt; q, consider the chance of winning when you are 25, 10 or 5 units away from the goal N. Observe that these probabiliti
N.C. State - AEE - 526
The Joy of Using ExcelNCSU Agricultural &amp; Extension Education AEE 526Excel is a _ Program A. layout B. record keeping C. data base D. spreadsheet CORRECT ANSWER: spreadsheetNCSU Agricultural &amp; Extension Education AEE 526What is Excel
University of Iowa - M - 170
What Every Computer Scientist Should Know About Floating Point ArithmeticENote This document is an edited reprint of the paper What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg, published in the March,
University of Texas - PSY - 355
Attention What is attention? Low level attention Serial and parallel searchWhat is attention? How is the word used? Examples Something fluttering caught my attention I didnt see you, I was paying attention to the game. I struggled to pay at
University of Iowa - HEPSUN - 29140
Exam 2 Practice Material Physics 29:140 1. Do Problems 4.2 -4.4, 5.2-5.6, 10.7-10.8 in Squires 2. For fun look at Problem 10.6 in Squires. Here you will see that the spherical harmonics, Ylm 's are precisely the simultaneous eigenstates of L2 and Lz
University of Iowa - HEPSUN - 29105
1 Hobbie, Intermediate Physics for Medicine and Biology, 3rd. ed.ErrataLast modified: Apr. 19, 1999 Most of these corrections were found by students in my classes during the 1997-1998 academic year. I am particularly grateful to In-young Choi, who
University of Iowa - HEPSUN - 29140
HW 3, 29:140 Due Wednesday, 17 September, 2008 (10 points each problem.) 1. (Give yourself ample time for this one.) Consider again a classical point particle in a central potential given by: V (r) = krl , where k is a constant and l is an integer, r
University of Iowa - ECN - 56244
PROCEEDINGS of the HUMAN FACTORS AND ERGONOMICS SOCIETY 43rd ANNUAL MEETING - 199951THE USE OF PREDICTIVE DISPLAYS FOR AIDING CONTROLLER SITUATION AWARENESS Mica Endsley SA Technologies Randy Sollenberger &amp; Earl Stein Federal Aviation Administrat
University of Iowa - ECN - 56244
PROCEEDINGS of the HUMAN FACTORS AND ERGONOMICS SOCIETY 43rd ANNUAL MEETING - 19991FREE FLIGHT AND THE AIR TRAFFIC CONTROLLER: ACTIVE CONTROL VERSUS PASSIVE MONITORING Ulla Metzger and Raja Parasuraman Cognitive Science Laboratory The Catholic Un
University of Iowa - ECN - 56244
RotorcraftAirframe ComponentsSource: AC-8083Flight Controls Collective Throttle Correlator Governor Cyclic Anti torque pedalsSource: AC-8083 Piston Turbine Transmission MR, TRPowerplant Drive LineSource: AC-8083Rotor System
University of Iowa - ECN - 56244
56:244 Airborne Design of ExperimentsFall, 2007Flight Testing Flight testing Validate predictive models Very data driven Typically a component of aeronautical engineering A specialized and potentially dangerous technical activity Requires s
University of Iowa - ECN - 56244
Decision Making, Recipe for Disaster Big Decision catastrophes Space Shuttle Challenger Decision to launch in spite of suggestions that booster O-Rings not designed for cold temperatures USS Vincennes Engaged in a skirmish with enemy vessels, t
University of Iowa - ECN - 56244
Vision in Driving Complex topic Visual acuity Contrast sensitivity and threshold contrast Color perception Importance Information acquisition in driving almost exclusively visual Safety for the driver and for others Detecting changes in the
University of Iowa - ECN - 56244
Automated Highway SystemsFigure from Volpe AHS PageSome level of automation Hands/feet off Dual mode, normal and automated Improved safety Increased efficiency Predictable trip times Less pollution Less stress (?) Degree of automation Adapti
University of Iowa - ECN - 56244
Human Factors in the Air Transport System56:244 - Human Factors in Transportation Tom Schnell, 200056:244 Human Factors in Transportation1National Airspace System Aircraft and Crew Air Traffic Control Infrastructure Tom Schnell, 20005
University of Iowa - ECN - 56244
EVALUATIONOFTRAFFICFLOWANALYSIS TOOLSAPPLIEDTOWORKZONESBASEDON FLOWDATACOLLECTEDINTHEFIELD Jeff Mohror Jen Blackhurst 56:244 Human Factors in Transportation November 28,2000Literature ReviewFatalities in Work Zones High of 868 in 1999Fa tal iti
University of Iowa - PHIL - 56244
DRIVER VISIBILITY OF PAVEMENT MARKINGS AT NIGHTPhil Ohme Operator Performance Laboratory, Human Factors Department of Industrial Engineering University of IowaNighttime Visibility of Pavement Markings Light to observerReference axis Light from h