BurnhamKull-LiebINFO

BurnhamKull-LiebINFO - Wildlife Research, 2001, 28, 1117119

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 6
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 8
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Wildlife Research, 2001, 28, 1117119 Kullback—Leibler information as a basis for strong inference in ecological studies Kenneth P. BurnhamA and David R. Anderson Colorado Cooperative Fish and Wildlife Research Unit, Colorado State University, Fort Collins, CO 80523, USA. [Employed by USGS, Division of Biological Resources] AEmail: [email protected] A bs tract. We describe an information-theoretic paradigm for analysis of ecological data, based on KullbackiLei- bler information, that is an extension oflikelihood theory and avoids the pitfalls of null hypothesis testing. Infor- mation—theoretic approaches emphasise a deliberate focus on the a priori science in developing a set of multiple working hypotheses or models. Simple methods then allow these hypotheses (models) to be ranked from best to worst and scaled to reflect a strength of evidence using the likelihood of each model (gi), given the data and the models in the set (i.e. L(gl- | data)). In addition, a variance component due to model-selection uncertainty is included in estimates of precision. There are many cases where formal inference can be based on all the models in the a priori set and this multi-model inference represents a powerful, new approach to valid inference. Finally, we strongly rec- ommend inferences based on a priori considerations be carefully separated from those resulting from some form of data dredging. An example is given for questions related to age- and sex-dependent rates oftag loss in elephant seals (Miruttnga leonina). Introduction Theoretical and applied ecologists are becoming increasing- ly dissatisfied with the traditional testing-based aspects of frequentist statistics. Over the past 50 years a large body of statistical literature has shown the testing of null hypotheses to have relatively little utility, in spite of their very wide- spread use (Nester 1996). lnman (1994) provides a historical perspective on this issue by highlighting the points of a heat- ed exchange in the published literature between R. A Fisher and Karl Pearson in 1935. In the applied ecology literature Yoccoz (1991), Cherry (1998), Johnson (1999) and Ander- son et a]. (2000) have written on this specific issue. The sta- tistical null hypothesis testing approach is not wrong, but it is relatively uninformative and, thus, slows scientific progress and understanding. Bayesian approaches are relatively unknown to ecologists and will likely remain so because this material is not com- monly offered in statistics departments, except perhaps in advanced courses. Still, an increasing number of people think that Bayesian statistics offer an acceptable alternative (Gelman et a1. 1995; Ellison 1996), while others are leery (Forster 1995; Dennis 1996). In addition, there are funda- mental issues with the subjectivity inherent in many Baye- sian methods and this has unfortunately divided the field of statistics for many decades. Also, much of Bayesian statistics has been developed from the viewpoint of decision theory. We find that science is most involved with estimation, pre- diction and understanding and, less so, with decision-making (see Berger 1985 for a discussion of decision-making). © CSIRO 2001 The purpose of this paper is to introduce readers to the use of KullbackiL'eibler information as a basis for making valid inference from the analysis of empirical data. We provide this introduction because information-theoretic approaches are simple, easy to learn and understand, compelling, and quite general. This class ofmethods allows one to select the best model from an a priori set, rank and scale the models, and include model selection uncertainty into estimates of precision. Information-theoretic approaches provide an ef- fective strategy for objective data analysis (Burnham and Anderson 1998; Anderson and Burnham 1999). Finally, we provide a simple approach to making formal inference from more than a single model (multi-model inference, or MMI). We believe the information-theoretic approaches are excel- lent for the analysis of ecological data, whether experimental or observational, and provide a» rational alternative to the testing-based frequentist methods and the computer-inten- sive Bayesian methods. The central inferential issues in science are two-fold. First, scientists are fundamentally interested in estimates of the magnitude of the parameters or differences between pa- rameters and their precision; are the differences trivial, small, medium, or large? Are the differences biologically meaningful? This is an estimation problem. Second, one of- ten wants to know whether the differences are large enough to justify inclusion in a model to be used for further inference (e.g. prediction) and this is a model-selection problem. These central issues are not properly associated with statistical null hypothesis-testing. In particular, hypothesis-testing is a poor 10.1071/WR99107 1035—3712/01/020111 112 approach to model selection or variable selection (e.g. for- ward or backward selection in regression analysis). The application of information-theoretic approaches is relatively new; however, a number of papers using these methods have already appeared in the fisheries, wildlife and conservation biology literature. Research into the analysis of marked birds has made heavy use of these new methods (see the special supplement of Bird Study, 1999, Vol. 46). Pro— gram MARK (White et a]. 2001) allows a full analysis of data under the information-theoretic paradigm, including model averaging and estimates of precision that include model-selection uncertainty. Distance sampling and analysis theory (Buckland et a]. 1993) should often be based on this theory with an emphasis on making formal inference from several models. The large data sets on the threatened north- ern spotted owl in the United States have been the subject of large-scale analyses using these new methods (see Burnham et a]. 1996). Burnham and Anderson (1998) provide a number of other examples, including formal experiments to examine the effect of a treatment, studies of spatial overlap in Anolis lizards in Jamaica, line—transect sampling ofkanga- roos at Wallaby Creek in Australia, predicting the frequency of storms in South Africa, and the time distribution of an in— secticide (Dursban®) in a simulated ecosystem. Burnham and Anderson (1998: 9699) provide an example of a simu- lated experiment on starlings (Sturnus vulgaris) to illustrate that substantial differences can arise between the results of hypothesis-testing and model-selection criteria. Another ex- ample relates to time-dependent survival of sage grouse (Centracercus uropliasianus) where Akaike's Information Criterion selected a model with 4 parameters whereas hy- pothesis tests suggested a model with 58 parameters (Burn- ham and Anderson 1998: 106409). Information-theoretic methods have found heavy use in other fields of science (e.g. time series analysis). Science Philosophy First we must agree on the fact that there are no true models; instead, models, by definition, are only approximations to unknown reality or truth. George Box made the famous state— ment “All models are wrong but some are useful”. In the anal— ysis of empirical data, one must face the question ‘What model should be used to best approximate reality given the data at hand?’ (the best model depends on sample size). The information-theoretic paradigm rests on the assumption that good data, relevant to the issue, are available and these have been collected in an appropriate manner. Three general prin- ciples guide us in model-based inference in the sciences. Simplicity and Parsimony. Many scientific concepts and theories are simple, once understood. In fact, Occam's Razor implores us to ‘shave away all but what is necessary’. Albert Einstein is supposed to have said ‘Everything should be made as simple as possible, but no simpler’. Parsimony enjoys a featured place in scientific thinking in general and K. P. Burnham and D. R. Anderson Variance Bias2 (Uncertainty) Few Many Number of Parameters Fig. 1. The principle ofparsimony: the conceptual trade—offbetween squared bias (solid line) and variance (i.e. uncertainty) versus the number of estimable parameters in the model. The best model has di- mension (K0) near the intersection ofthe two lines, while full reality lies far to the right oftrade—off region. in modelling specifically (see Forster and Sober 1994; For— ster 2001) for a strictly science philosophy perspective). Model selection (variable selection in regression is a spe- cial case) is a bias v. variance trade-off and this is the princi- ple of parsimony (Fig. 1). Models with too few parameters (variables) have bias, whereas models with too many param- eters (variables) may have poor precision or tend to identify effects that are, in fact, spurious (slightly different issues arise for count data v. continuous data). These considerations call for a balance between under- and over-fitted models — the so-called ‘model selection problem’ (see Forster 2000). Multiple Working Hypotheses. Over 100 years ago, Chamberlin (1890, reprinted 1965) advocated the concept of ‘multiple working hypotheses”. Here, there is no null hypoth- esis; instead, there are several well-supported hypotheses (equivalently, ‘models’) that are being entertained. The a pri- ori ‘science’ of the issue enters at this important point. Rel- evant empirical data are then gathered, analysed, and the results tend to support one or more hypotheses, while provid- ing less support for other hypotheses. Repetition of this gen- eral approach leads to advances in the sciences. New or more elaborate hypotheses are added, while hypotheses with little empirical support are gradually dropped from consideration. At any one point in time, there are multiple hypotheses (mod— els) still under consideration. An important feature of this multiplicity is that the number of alternative models should be kept small (Zucchini 2000); the analysis of, say, hundreds of models is not justified except when prediction is the only objective, or in the most exploratory phases of an investiga- tlon. Information-theoretic methods Strength of Evidence. Providing information to judge the ‘strength of evidence’ is central to science. Null-hypoth- esis-testing provides only arbitrary dichotomies (e.g. signif— icant v. non-significant) and in the all-too-often-seen case where the null hypothesis is obviously false on a priori grounds, the test result is superfluous. Royall (1997) pro- vides an interesting discussion of the likelihood-based strength-of-evidence approach in simple statistical situa- tions. The information-theoretic paradigm is partially grounded in the three principlcs above. Impetus for the general ap— proach can be traced to several major advances made over the past half century and this history will serve as an intro- duction to the subject. Advance 1 — Kullback—Leibler information In 1951 S. Kullback and R. A. Leibler published a now-fa- mous paper that examined the scientific meaning of ‘infor— mation’ related to R. A. Fisher's concept of a ‘sufficient statistic’. Their celebrated result, now called Kullback—Lei- bler information, is a fundamental quantity in the sciences and has earlier roots back to Boltzmann’s (1877) concept of entropy. Boltzmann’s entropy and the associated Second Law of Thermodynamics represents one of the most outstanding achievements of 19th century science. KullbackiLeibler (K714) information is a measure (a ‘dis— tance’ in an heuristic sense) between conceptual reality, f; and approximating model, g, and is defined for continuous functions as the integral Io: g) =lf(x)10ge m) x £0619) wheref and g are n-dimensional probability distributions. K- L information, denoted 1(fl g), is the ‘information’ lost when model g is used to approximate reality, f. The analyst seeks an approximating model that loses as little information as possible; this is equivalent to minimising [(f, g), over the set of models of interest (we assume there are R a priori models in the candidate set). Boltzmann's entropy H is ~10, g), although these quanti- ties were derived along very different lines. Boltzmann de- rived the fundamental relationship between entropy (H) and probability (P) as H : log€(P) and because H : 41(/, g), one can see that entropy, informa- tion and probability are linked, allowing probabilities to be multiplicative whereas information and entropy are additive. KiL information can be viewed as an extension of the fa- mous Shannon (1948) entropy and is often referred to as ‘cross entropy’. In addition, there is a close relationship be- tween Jaynes' (1957) ‘maximum entropy principle’ or Max- 113 Ent (see Akaike 1977, 1983a, 1985). Cover and Thomas (1989) provide a nice introduction to information theory in general. K714 information, by itself, will not aid in data anal- ysis as both reality (f) and the parameters (6) in the approxi- mating model are unknown to us. H. Akaike made the next breakthrough in the early 19705. Advance 2 7 Estimation ()fKullbaCkiLeibler infbrmation (AIC) Akaike (1973, 1974) found a formal relationship between K7 L information (a dominant paradigm in information and cod— ing theory) and maximum likelihood (the dominant para— digm in statistics) (see deLeeuw 1992). This finding makes it possible to combine estimation (e.g. maximum likelihood or least squares) and model selection under a single theoret— ical framework 7 optimisation. Akaike's breakthrough was the finding of an estimator of the expected, relative KiL in— formation, based on the maximised log-likelihood function. Akaike's derivation (which is for large samples) relied on K7 L information as averaged entropy and this lead to ‘Akaike‘s information criterion’ (AIC), AIC : izlogdflé 1 data» + 2K” where loge(L(é 1 data)) is the value of the maximised log— likelihood over the unknown parameters (9), given the data and the model, and K is the number of estimable parameters in that approximating model. In the special case of least— squares (LS) estimation with normally distributed errors for all R models in the set, and apart from an arbitrary additive constant, AIC can be expressed as AIC = nlog(€r)+2K, where and é, are the estimated residuals from the fitted model. In this case the number of estimable parameters, K, must be the total number of parameters in the model, including the inter— cept and (72. Thus, AIC is easyto compute from the results of LS estimation in the case of linear models or from the re- sults of a likelihood-based analysis in general (Edwards 1992; Azzalini 1996). Akaike's procedures are now called in- formation-theoretic because they are based on the KiL infor— mation (see Akaike 1983b, 1992, 1994). Assuming that a set of a priori candidate models has been defined and is well supported by the underlying science, then AIC is computed for each of the approximating models in the set (i.e. g, i = 1,2, ..., R). The model for which AIC is min- imal is selected as best for the empirical data at hand. This is a simple, compelling concept, based on deep theoretical foundations (i.e. entropy, K—L information, and likelihood theory). AIC is not a test in any sense: no single hypothesis 114 (model) is made to be the ‘null’, there is no arbitrary 0L level, and there is no arbitrary notion of ‘significance’. Instead, there are concepts of evidence and a ‘best’ inference, given the data and the set of a priori models representing the sci- entific hypotheses ofinterest. When K is large relative to sample size n (which includes when n is small, for any K) there is a small-sample (second- order) version called AIC), 2K(K+l) AI ,.=—1 .L‘ K — C 20g( (9))+2 +(n_K_1) (see: for example, Hurvich and Tsai 1989), and this should be used unless n/K > ~40. Both AIC and AlCi. are estimates of expected, relative Kullback—Leibler information and are useful in the analysis of real data in the ‘noisy’ sciences. As- suming independence, AIC—based model selection is equiva- lent to certain cross-validation methods (Stone 1974, 1977) and this is an important property. Akaike's general approach allows the best model in the set to be identified, but also allows the rest ofthe models to be easily ranked. Here, it is very useful (essentially imperative) to rescale AlC (or AICC) values such that the model with the minimum information criterion has a value of 0, i.e. A,- = Ale minAlC. The A,- values are easy to interpret, and allow a quick ‘strength of evidence’ comparison and ranking of candidate hypotheses or models. The larger the A], the less plausible is fitted model i as being the best approximating model in the candidate set. It is generally important to know which model (hypothesis) is second best (the ranking) as well as some measure of its standing with respect to the best model. Some simple rules of thumb are often useful in assessing the rela— tive merits of models in the set: models having A, S 2 have substantial support (evidence), those where 4 S Ai S 7 have considerably less support, while models having A > 10 have essentially no support. An improved method for sealing models appears in the next section. The Ai values allow an easy ranking of hypotheses (mod- els) in the candidate set. One must turn to goodness-of-fit tests or other measures to determine whether any of the mod— els is good in some absolute sense. For count data, we sug— gest a standard goodness-of-fit test; whereas standard measures such as R2 and d2 in regression and analysis of variance are often useful. Justification of the models in the candidate set is a very important issue. This is where the sci— ence of the problem enters the scene. Ideally, there ought to be ajustification of models in the set and a defense as to why some models should remain out of the set. This is an area where ecologists need to spend much more time just think- ing, well prior to data analysis and, perhaps, prior to data col- lection. K. P. Burnham and D. R. Anderson The principle of parsimony, or Occam‘s razor, provides a philosophical basis for model selection; Kullback—Leibler information provides an objective target based on deep, fun- damental theory; and the information criteria (AIC and AICC), along with likelihood— or least-squares-based in- ference, provide a practical, general methodology for use in the analysis of empirical data. Objective data analysis can be rigorously based on these principles without having to assume that the ‘true model’ is contained in the set of candi— date models 7 surely there are no true models in the biologi- cal sciences! Advance 3 * Likelihood ofa model, given the data The simple transformation exp(iA,-/2), for i = l, 2, ..., R, pro- vides the likelihood of the model (Akaike 1981) given the data: L(gi | data). This is a likelihood function over the model set in the same sense that L(9 1 data, g) is the likelihood over the parameter space (for model g,) of the parameters 6, given the data (x) and the model (g). The relative likelihood of model i versus modelj is L(gi | data)/L(gj 1 data); this ratio does not depend on any of the other models under consider- ation. Without loss of generality we may assume model gl- is more likely than gj. Then if this ratio is large (e.g. >10 is large), model g/ is a poor model to fit the data relative to model g. The expression L(g,- l data)/L(g/ | data) can be re— garded as an evidence ratio ~ the evidence for model i versus modelj. . It is often convenient to normalise these likelihoods such that they sum to 1, hence we use exp(—A,. /2) Wi = R Zexp(—A,. /2) r=l The wi, called Akaike weights, are useful as the ‘weight of ev- idence’ in favor ofmodel i as being the actual K—L best mod- el in the set. The ratios wi/wj are identical to the original likelihood ratios, L(gi I data)/L(gj 1 data); however, W), i = l, R are useful in additional ways. For example, the w,- are interpreted approximately as the probability that model 1' is, in fact, the KiL best model for the data. This latter inference about model-selection uncertainty is conditional on both the data and the full set of a priori models considered. There are simple methods to provide a confidence set on the models, in the same sense as a confidence set for estimates of parame— ters, and to allow prior (Bayesian type) information to affect these weights (see Burnham and Anderson 1998: 126—128). Advance 4 i Unconditional sampling variance Typically, estimates of sampling variance are conditional on a ‘given’ model as if there were no uncertainty about which model to use (Breiman 1992 calls this a ‘quiet scandal’). When model selection has been done, there is a variance component due to model-selection uncertainty that should be Information-theoretic methods incorporated into estimates of precision. That is, one needs estimates that are ‘unconditional’ on the selected model. Here the estimates are unconditional on any particular mod— el, but conditional on the R models in the a priori set. A sim- ple estimator of the unconditional variance for the parameter maximum likelihood estimator, from the selected (best) model is, 2 var(é) = nil—(Wamé, I g1.)+(9i -9)2:| 1:1 ; R where 9 = Z we. i:1 and this represents a form of frequentist ‘model averaging’. The notation 9,- here means that the parameter 6 is estimated on the basis of model gi , but 9 is a parameter in common to all R models (such as occurs with prediction). This estimator, from Buckland er al. (1997), includes a term for the conditional sampling variance, given model gi (denoted as var(91|gi) here) and a variance component for model-selection uncertainty, (9 149)? These variance com- ponents are multiplied by the Akaike weights, which reflect the degree of support or evidence for model i. The uncondi- tional variance and its square root are appropriate measures of precision after model selection. The usual 95% confi- dence interval, 9 ::2§e(é ) should be based on the uncondi- tional variance. Alternatively, intervals can be based on log- or logit-transformations (Burnham er al. 1987), profile like- lihoods (Royall 1997) or bootstrap methods (Efron and Tib- shirani 1993). Burnham and Anderson (1998, chapter 5) provide a number of Monte Carlo results on achieved confi- dence interval coverage when information-theoretic ap— proaches are used in some moderately challenging data sets. Model averaging (see below) arises naturally when the un- conditional variance is derived (Burnham and Anderson 1998: section 4.2.6). Advance 5 * Mum-model inference (MM!) Rather than base inferences on a single, selected best model from an apriori set of models, inference can be based on the entire set of models (multi-model inference, or MMI). Such inferences can be made if a parameter, say 9, is in common over all models (as Si in model g), or ifthe goal is prediction. Then, by using theA weighted average for that parameter across models (i.e. 5: Ewl- 9,) we are basing point infer- ence on the entire set of models. This approach has both practical and philosophical advantages (see Hoeting et al. 1999 for a discussion of model averaging in a Bayesian con- text). Where a model-averaged estimator can be used it often has better precision and reduced bias compared with the es- timator of that parameter from just the selected best model (Burnham and Anderson 1998: chapters 4 and 5). 115 Assessment of the relative importance of variables has of— ten been based only on the best model ( e. g. often selected us- ing a stepwise testing procedure). Variables in that best model are considered ‘important’, excluded variables are considered not important. This is far too simplistic. Variable importance can be refined by making inference from all the models in the candidate set (see Burnham and Anderson 1998: 140—151). Akaike weights are summed for all models containing a given predictor variable x]; we denote the result— ant sum as w+(/'). For each variable considered we can com- pute its predictor weight. The predictor variable with the largest predictor weight, W+(j), is estimated to be the most important; the variable with the smallest sum is estimated to be the least important predictor (as with all inferences there is uncertainty about the inferred order of variable impor- tance). This procedure is superior to making inferences con- cerning the relative importance of variables based only on the best model. This is particularly important when the sec- ond or third best model is nearly as well supported as the best model, or when all models have nearly equal support. (There are ‘design’ considerations about the set ofmodels to consid— er when a goal is assessing the importance of predictor vari— ables, we do not discuss these considerations here — the key issue is one ofbalance of models with and without each var- iable.) Advance 6 4 Incorporation ofoverdispersion for count data Much of the statistical analysis in wildlife and ecology deals with count data (e.g. captureirecapture) and overdispersion is a fact of life with such count data. When there is more var— iation than predicted by Poisson or multinomial probability distributions, the data are termed overdispered (Agresti 1990: 42). A partial dependence in the count data most often underlies the overdispersion; however, parameter heteroge- neity is another contributor to overdispersion. KullbackiLei- bler-based model—selection and inference methods have been adapted to deal with overdispersion based on ideas from qua- si-likelihood methods and variance inflation (Wedderburn 1974). The usual models of count data implicitly assume a theoretical sampling variance. However, common violations of stochastic assumptions will {lead to data more variable than assumed and can do so without affecting structural as- pects ofthe model. In this case, there is an overdispersion co- efficient, c, such that c > 1 and actual variances are obtainable as c >< theoretical variances. Typically with overd- ispersion c is only a little larger than 1, say 1 < c < 4: C is es- timated on the basis of the data. We denote the quasi-likelihood modifications to AIC and AICC as (Lebreton et al. 1992; see also, Hurvich and Tsai 1995; Burnham and Anderson 1998) — 210g<L<é>> ——~fi + C QAIC= 2K, 116 —21 L0 2K K+1 mdQAK}=—*2g—LB+2K+——£——l. (’ n—K—l 2mk+n =QAm+——-—— n—K—l When no overdispersion exists C : 1, so the formulae for QAIC and QAICC then reduce to AIC and AICC, respectively. We note that 6 < 1 should not be used (use 1) and when 6 is estimated it counts as a parameter and should be in K, the number of estimable parameters in the model (this last point was not mentioned or done in Burnham and Anderson 1998 i an oversight). Only one estimate ofc should be used along with a set of models (varying 6' over the models produces invalid results). Often there will be a global model wherein all other models are nested within the global model. Then we obtain 6 from the goodness-of-fit Chi—square statistic (X2) for the global model and its degrees of freedom (d.f.): é:fim1 More discussion and guidance on QAIC, C” and variance inflation using 6 are given in Burnham and Anderson a%&. An Example Pistorius er al. (2000) evaluated age- and sex-dependent rates of tag loss in southern elephant seals and used information- theoretic methods as the basis for data analysis and infer- ence. Specifically, they considered 4 models representing rates of tag loss being either constant or age- or sex-depend- ent. We will use their results and make a number of exten— sions for illustrative purposes. We are not attempting to present a reanalysis and reinterpretation of their data; in- stead, we wish only to show that additional steps might be considered. Details of this study are contained in Pistorius et al. (2000) and we assume the reader is familiar with this pa- per. We performed a goodness—of-fit test (essentially Test 2, Burnham er al. 1987) on these data, partitioned by gender and found evidence of overdispersion (X2 for males : 15 7.20, d.f. = 77; x2 for females = 97.92, d.f. : 84; pooled x2 = 255.12, d.f. = 161). An estimate ofthe variance inflation fac— tor was c‘ : 255.12/161 I 1.58. This estimate ofc may reflect Table 1. K. R Burnham and D. R. Anderson primarily heterogeneity rather than a lack of independence. Interestingly. much of the lack of fit was attributed to a single cell in the data for both males and females (the same cell, by gender). Had these two cells been in line with what was ex- pected from the general model, the estimate of the variance inflation factor would have been only 1.30. QAIC, rather than AIC, was used for model selection and model—based estimates of sampling variance should be mul- tiplied by 1.58. Pistorious er al. (2000) used the bootstrap to get robust estimates of sampling variance. thus their esti- mates of precision should appropriately reflect the overdis- persion. The results are summarised in Table l and provide substantial support for the model that allows tag loss to be both age- and sex-dependent. Support for the model with age-dependence, but not sex-dependence, is more limited; the evidence ratio for the best model versus the second best model is 0.82/0.18 : 4.6. Inference concerning tag loss could be made from the best model, whereby tag loss is a function of both age and sex. A1- ternatively, model averaging could be used to allow a robust inference of the derived parameters shown in Table 3 of Pis- torius er al. (2000). In this case, the model parameters are the [3, and the age- and sex-dependent estimates of tag loss are derived from the fli. Model averaging in this example would slightly minimise the difference in estimates of tag loss by gender, relative to those shown in the original paper. Clearly, there is essentially no support for the model whereby tag loss is independent of age and sex or the model where tag loss is only sex-dependent. To measure the relative importance of variables the w,- val- ues can be summed for all models (only 2 here) with age-de- pendence and all models with sex-dependence. 1n the example, w+(age) : 1, whereas w+(sex) = 0.82, confirming that age is the more important variable in explaining tag loss in these seals. In this example, model-selection uncertainty was minor as the data point to the model allowing both age- and sex—specific tag loss. Other examples where substantial model—selection uncertainty exists are given in Burnham and Anderson (1998: chapter 5). Recommendations and Summary There needs to be increased attention to separate those infer- ences that rest on a priori considerations from those result- ing from some form of data dredging (see Mayo 1996). Model selection statistics for the southern elephant seal data to estimate tag loss See Pistorius er al. (2000) Model KA QAIC A —10g(L)/é ,- w,- Age—constant, sex—constant 1.845 3 2,341 39 0.00 Age-dependent, sex-constant 1,815 4 2,305 3 0.18 Age-constant, sex-dependent 1,845 4 2,343 41 0.00 Age—dependent, sex-dependent 1,81 l 5 2,302 0 0.82 AThis total includes the estimation ofthe overdispersion parameter c. Information—theoretic methods Essentially, no justifiable theory exists to estimate precision (or test hypotheses, for those still so inclined) when data dredging has taken place. The theory (mis)used is for apriori analyses, assuming the model was the only one fit to the data. This glaring fact is either not understood by practitioners and journal editors or is simply ignored. Two types of data dredg— ing include (1) an iterative approach where patterns and dif- ferences observed after initial analysis are ‘chased’ by repeatedly building new models with these effects included and (2) analysis of ‘all possible models’. Data dredging is a poor approach to making reliable inferences about the sam- pled population and both types of data dredging are best re- served for more exploratory investigations that probably should remain unpublished. The incorporation of a priori considerations is of paramount importance and, as such, ed— itors, referees and authors should pay much closer attention to these issues and be wary of inferences obtained from post hoc data dredging. At a conceptual level, reasonable data and a good model allow a separation of ‘information’ and ‘noise.’ Here, infor— mation relates to the structure of relationships, estimates of model parameters and components of variance. Noise then refers to the residuals: variation left unexplained. We can use the information extracted from the data to make proper inferences and achieve what Romesburg (1981) termed ‘re- liable information’. We want an approximating model that minimises information loss, I(f,' g), and properly separates noise (non-information or entropy) from structural informa- tion. In a very important sense, we are not trying to model the data; instead, we are trying to model the information in the data. Information-theoretic methods are based on deep theory and are quite effective in making strong inferences from the analysis of empirical data. These methods are relatively simple to understand and practical to employ across a very large class of empirical situations and scientific disciplines. The methods are easy to compute by hand if necessary (as- suming one has the parameter estimates 0,, the conditional variances, var( 011 g,), and the maximised log-likelihood values for each of the R candidate models from standard statistical software). Researchers can easily understand the heuristics and application of the information-theoretic methods presented here; we believe it is very important that people understand the methods they employ. Information- theoretic approaches should not be used unthinkingly; a good set of a priori models is essential and this involves professional judgment and integration of the science of the issue into the model set. Publication of results under the information-theoretic par- adigm would typically have substantial material in the Meth- ods section to discuss and fully justify the candidate models in the set, whereas the Results section would typically present a table showing AICC or QAICC, K, the maximised logg(L), A, and w for each ofthe R models, followed by an ef- 117 fective discussion of the scientific interpretation of the table entries. Further material, including many examples, on infor- mation-theoretic methods can be found in recent books by Burnham and Anderson (1998) and McQuarrie and Tsai (1998). Akaike‘s collected works have been recently pub- lished (Parzen et al. 1997) and this book will be ofinterest to the more quantitatively fit. An interesting application of the information-theoretic approach is in conflict resolution in applied aspects of ecol- ogy and environmental science (see Anderson et al. 1999 for a general protocol). Here, there are opposing parties in a technical controversy and data are available that bear on the resolution of the disagreement. In such cases, models would be built to represent the position of each of the par- ties. For example, consider the case where there are 3 par- ties and each party might have 2 models that represent their general position; thus there are R = 6 models in the set. Computation of AICC and A for each model would allow a ranking of the various positions (models), while the Akaike weights would allow a scaling and weight of evidence for the opposing parties and their positions. This approach has not yet been tried in a real controversy to our knowledge (but see Anderson et al. 2001). Acknowledgments Dr Peter Boveng '(NMF S) provided a summary of the seal tag-loss data for our use in computing goodness—of-fit tests and for arranging for our use of the data from senior author Pistorious. Dr Richard Barker and an anonymous referee of- fered valuable suggestions that allowed the manuscript to be improved. References Agresti, A. (1990). ‘Categorical Data Analysis.’ (John Wiley & Sons: New York.) Akaike, H. (1973). Information theory as an extension ofthe maximum likelihood principle. In ‘Second International Symposium on Information Theory’. (Eds B. N. Petrov and F. Csaki.) pp. 2677281. (Akademiai Kiado: Budapest.) , Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic ControlAC 19, 716—723. Akaike, H. (1977). On entropy maximization principle. In ‘Applications of Statistics’. (Ed. P. R. Krishnaiah.) pp. 27—41. (North Holland: Amsterdam.) Akaike, H. (1981). Likelihood of a model and information criteria. Journal ofEconometrics 16, 3714. Akaike, H. (1983a). Statistical inference and measurement of entropy. In ‘Scientific Inference, Data Analysis, and Robustness’. (Eds G. E. P. Box, T. Leonard and C.-F. Wu.) pp. 165-189. (Academic Press: London.) Akaike, H. (1983b). Information measures and model selection. International Statistical Institute 44, 277A291. Akaike, H. (1985). Prediction and entropy. In ‘A Celebration of Statistics’. (Eds A. C. Atkinson and S. E. Fienberg.) pp. 1724. (Springer: New York.) 118 Akaike, H. (1992). Information theory and an extension of the maximum likelihood principle. In ‘Breakthroughs in Statistics. Vol. 1’. (Eds S. Kotz and N. L. Johnson) pp. 6104624. (Springer-Verlag: London.) Akaike, H. (1994). Implications of the informational point of View on the development of statistical science. In ‘Engineering and Scientific Applications. Vol. 3. Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach’. (Ed. H. Bozdogan.) pp. 27738. (Kluwer Academic Publishers: Dordrecht, The Netherlands.) Anderson, D. R., and Burnham, K. P. (1999). General strategies for the analysis of ringing data. Bird Study 46(suppl.), 82617270. Anderson, D. R., Burnham, K. P., Franklin, A. B., Gutierrez, R. J., Forsman, E. D., Anthony, R. G., White, G. C., and Shenk, T. M. (1999). A protocol for conflict resolution in analyzing empirical data related to natural resource controversies. Wildlife Society Bulletin 27, 105071058. Anderson, D. R., Burnham, K. P, and Thompson, W. L. (2000). Null hypothesis testing: problems, prevalence, and an alternative. Journal of Wildlife Management 64, 9127923. Anderson, D. R., Burnham, K. P., and White, G. C. (2001). Kullbacki Leibler information in resolving natural resource conflicts when definitive data exist. Wildlife Society Bulletin. Azzalini, A. (1996). ‘Statistieal Inference Based on the Likelihood.” (Chapman and Hall: London.) Berger, J. O. (1985). ‘Statistical Decision Theory and Bayesian Analysis.’ 2nd Edn. (Springer-Verlag: New York.) Boltzmann, L. (1877). Uber die Beziehung zwischen dem Hauptsatze der mechanischen Warmetheorie und der Wahrschein- licjkeitsrechnung respective den Satzen uber das Warmegleichgewieht. Wiener Berichte 76, 373435. Breiman, L. (1992). The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error. Journal ofthe American Statistical Association 87, 73 87754. Buckland, S. T., Anderson, D. R., Burnham K. P., and Laake, J. L. (1993). ‘Distance Sampling: Estimating Abundance of Biological Populations.’ (Chapman and Hall: London.) Buckland, S. T., Burnham, K. P., and Augustin, N. H. (1997). Model selection: an integral part of inference. Biometrics 53, 6037618. Burnham, K. P., and Anderson, D. R. (1998). ‘Model Selection and Inference: a Practical Information-Theoretic Approach. (Springer— Verlag: New York.) Burnham, K. R, Anderson, D. R., White, G. C., Brownie, C., and Pollock, K. H. (1987). Design and analysis methods for fish survival experiments based on releaseirecapture. American Fisheries Society, Monograph No. 5. 437 pp. Burnham, K. R, Anderson, D. R., and White, G. C. (1996). Meta- analysis of vital rates of the northern spotted owl. Studies in Avian Biology 17, 92—101. Chamberlin, T. (1965). The method of multiple working hypotheses. Science 148, 7547759. [Reprint of 1890 paper in Science] Cherry, S. (1998). Statistical tests in publications of The Wildlife Society. Wildlife Society Bulletin 26, 9477953. Cover, T. M., and Thomas, J. A. (1991). ‘Elements of Information Theory.’ (John Wiley and Sons: New York.) deLeeuw, I. (1992). Introduction to Akaike (1973) information theory and an extension of the maximum likelihood principle. In “Breakthroughs in Statistics. V01. 1. (Eds S. Kotz and N. L. Johnson.) pp. 599—609. (Springer-Verlag: London.) Dennis, B. (1996). Should ecologists become Bayesians? Ecological Applications 6, 109571103. Edwards, A. W. F. (1992). ‘Likelihood.’ Expanded Edn. (The Johns Hopkins University Press: Baltimore, Maryland.) K. P. Burnham and D. R. Anderson Efron, B., and Tibshirani, R. J. (1993). ‘An Introduction to the Bootstrap.’ (Chapman and Hall: New York.) Ellison, A. M. (1996). An introduction to Bayesian inference for ecological research and environmental decision—making. Ecological Applications 6, 103671046. Forster, M. R. (1995). Bayes or bust: the problem of simplicity for a probabilistic approach to confirmation. British Journal for the Philosophy of'Science 46, 399424. Forster, M. R. (2000). Key concepts in model selection: performance and generalizability. Journal ofMathematical Psychology 44, 2057 231. Forster, M. R. (2001). The new science of simplicity. In ‘Simplicity, Inference and Econometric Modelling’. (Eds I-I. Keuzcnkamp, M. McAleer and A. Zellner.) (Cambridge University Press.) Forster, M. R., and Sober, E. (1994). How to tell simpler, more unified, or less ad hoc theories will provide more accurate predictions. British Journal of the Philosophy of Science 45, 1—35. Gelman, A, Carlin, J. B., Stern, H. S., and Rubin, D. B. (1995). “Bayesian Data Analysis.’ (Chapman and Hall: London.) Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). Bayesian model averaging: a tutorial (with discussion). Statistical Science 14, 382—417. Hurvich, C. M., and Tsai, C.-L. (1989). Regression and time series model selection in small samples. Biometrika 76, 2977307. Hurvich, C. M., and Tsai, C.—L. (1995). Model selection for extended quasi—likelihood models in small samples. Biometrics 51, 10777 1084. Inman, H. F. (1994). Karl Pearson and R. A. Fisher on statistical tests: a 1935 exchange from Nature. The American Statistician 48, 2711. Jaynes, E. T. (1957). Information theory and statistical mechanics. Physics Review 106, 6207630. Johnson, D. H. (1999). The insignificance of statistical significance testing. Journal of Wildlife Management 63, 7637772. Kullback, S., and Leibler, R. A. (1951). On information and sufficiency. Annals ofMathematical Statistics 22, 79786. Mayo, D. G. (1996). ‘Error and Growth of Experimental Knowledge.” (University ofChicago Press: London.) McQuarrie, A. D. R., and Tsai, C.-L. (1998). ‘Regression and Time Series Model Selection.’ (World Scientific Press: Singapore.) Nester, M. (1996). An applied statistician's creed. Applied Statistics 45, 401410. Parzen, E., Tanabe, K., and Kitagawa, G. (Eds) (1998). ‘Selected Papers of Hirotugu Akaike.’ (Springer-Verlag: New York.) Pistorius, P. A., Bester, M. N., Kirkman, S. P., and Boveng, P. L. (2000). Evaluation of age- and sex—dependent rates of tag loss in southern elephant seals. Journal of Wildlife Management 64, 3737380. Romesburg, H. C. (1981). Wildlife science: gaining reliable knowledge. Journal of Wildlife Management 45, 2937313. Royall, R. M. (1997). ‘Statistical Evidence: a Likelihood Paradigm.‘ (Chapman and Hall: London.) Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal 27, 379423 & 6237656. Stone, M. (1974). Cross—validatory choice and assessment of statistical predictions (with discussion). Journal of the Royal Statistical Society, Series B 39, l l l—147. Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and Akaike's criterion. Journal of the Royal Statistical Society, Series B 39, 4447. Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models, and the GaussiNewton method. Biometrika 61, 439~ 447. White, G. C., Burnham, K. P., and Anderson, D. R. (2001). Advanced features of program MARK. In ‘Integrating People and Wildlife for a Sustainable Future. Proceedings of the Second International ...
View Full Document

This note was uploaded on 07/14/2011 for the course STA 4702 taught by Professor Staff during the Spring '08 term at University of Florida.

Page1 / 8

BurnhamKull-LiebINFO - Wildlife Research, 2001, 28, 1117119

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online