{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

BurnhamKull-LiebINFO - Wildlife Research 2001 28 1117119...

Info icon This preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon
Image of page 1

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 2
Image of page 3

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 4
Image of page 5

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 6
Image of page 7

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 8
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Wildlife Research, 2001, 28, 1117119 Kullback—Leibler information as a basis for strong inference in ecological studies Kenneth P. BurnhamA and David R. Anderson Colorado Cooperative Fish and Wildlife Research Unit, Colorado State University, Fort Collins, CO 80523, USA. [Employed by USGS, Division of Biological Resources] AEmail: [email protected] A bs tract. We describe an information-theoretic paradigm for analysis of ecological data, based on KullbackiLei- bler information, that is an extension oflikelihood theory and avoids the pitfalls of null hypothesis testing. Infor- mation—theoretic approaches emphasise a deliberate focus on the a priori science in developing a set of multiple working hypotheses or models. Simple methods then allow these hypotheses (models) to be ranked from best to worst and scaled to reflect a strength of evidence using the likelihood of each model (gi), given the data and the models in the set (i.e. L(gl- | data)). In addition, a variance component due to model-selection uncertainty is included in estimates of precision. There are many cases where formal inference can be based on all the models in the a priori set and this multi-model inference represents a powerful, new approach to valid inference. Finally, we strongly rec- ommend inferences based on a priori considerations be carefully separated from those resulting from some form of data dredging. An example is given for questions related to age- and sex-dependent rates oftag loss in elephant seals (Miruttnga leonina). Introduction Theoretical and applied ecologists are becoming increasing- ly dissatisfied with the traditional testing-based aspects of frequentist statistics. Over the past 50 years a large body of statistical literature has shown the testing of null hypotheses to have relatively little utility, in spite of their very wide- spread use (Nester 1996). lnman (1994) provides a historical perspective on this issue by highlighting the points of a heat- ed exchange in the published literature between R. A Fisher and Karl Pearson in 1935. In the applied ecology literature Yoccoz (1991), Cherry (1998), Johnson (1999) and Ander- son et a]. (2000) have written on this specific issue. The sta- tistical null hypothesis testing approach is not wrong, but it is relatively uninformative and, thus, slows scientific progress and understanding. Bayesian approaches are relatively unknown to ecologists and will likely remain so because this material is not com- monly offered in statistics departments, except perhaps in advanced courses. Still, an increasing number of people think that Bayesian statistics offer an acceptable alternative (Gelman et a1. 1995; Ellison 1996), while others are leery (Forster 1995; Dennis 1996). In addition, there are funda- mental issues with the subjectivity inherent in many Baye- sian methods and this has unfortunately divided the field of statistics for many decades. Also, much of Bayesian statistics has been developed from the viewpoint of decision theory. We find that science is most involved with estimation, pre- diction and understanding and, less so, with decision-making (see Berger 1985 for a discussion of decision-making). © CSIRO 2001 The purpose of this paper is to introduce readers to the use of KullbackiL'eibler information as a basis for making valid inference from the analysis of empirical data. We provide this introduction because information-theoretic approaches are simple, easy to learn and understand, compelling, and quite general. This class ofmethods allows one to select the best model from an a priori set, rank and scale the models, and include model selection uncertainty into estimates of precision. Information-theoretic approaches provide an ef- fective strategy for objective data analysis (Burnham and Anderson 1998; Anderson and Burnham 1999). Finally, we provide a simple approach to making formal inference from more than a single model (multi-model inference, or MMI). We believe the information-theoretic approaches are excel- lent for the analysis of ecological data, whether experimental or observational, and provide a» rational alternative to the testing-based frequentist methods and the computer-inten- sive Bayesian methods. The central inferential issues in science are two-fold. First, scientists are fundamentally interested in estimates of the magnitude of the parameters or differences between pa- rameters and their precision; are the differences trivial, small, medium, or large? Are the differences biologically meaningful? This is an estimation problem. Second, one of- ten wants to know whether the differences are large enough to justify inclusion in a model to be used for further inference (e.g. prediction) and this is a model-selection problem. These central issues are not properly associated with statistical null hypothesis-testing. In particular, hypothesis-testing is a poor 10.1071/WR99107 1035—3712/01/020111 112 approach to model selection or variable selection (e.g. for- ward or backward selection in regression analysis). The application of information-theoretic approaches is relatively new; however, a number of papers using these methods have already appeared in the fisheries, wildlife and conservation biology literature. Research into the analysis of marked birds has made heavy use of these new methods (see the special supplement of Bird Study, 1999, Vol. 46). Pro— gram MARK (White et a]. 2001) allows a full analysis of data under the information-theoretic paradigm, including model averaging and estimates of precision that include model-selection uncertainty. Distance sampling and analysis theory (Buckland et a]. 1993) should often be based on this theory with an emphasis on making formal inference from several models. The large data sets on the threatened north- ern spotted owl in the United States have been the subject of large-scale analyses using these new methods (see Burnham et a]. 1996). Burnham and Anderson (1998) provide a number of other examples, including formal experiments to examine the effect of a treatment, studies of spatial overlap in Anolis lizards in Jamaica, line—transect sampling ofkanga- roos at Wallaby Creek in Australia, predicting the frequency of storms in South Africa, and the time distribution of an in— secticide (Dursban®) in a simulated ecosystem. Burnham and Anderson (1998: 9699) provide an example of a simu- lated experiment on starlings (Sturnus vulgaris) to illustrate that substantial differences can arise between the results of hypothesis-testing and model-selection criteria. Another ex- ample relates to time-dependent survival of sage grouse (Centracercus uropliasianus) where Akaike's Information Criterion selected a model with 4 parameters whereas hy- pothesis tests suggested a model with 58 parameters (Burn- ham and Anderson 1998: 106409). Information-theoretic methods have found heavy use in other fields of science (e.g. time series analysis). Science Philosophy First we must agree on the fact that there are no true models; instead, models, by definition, are only approximations to unknown reality or truth. George Box made the famous state— ment “All models are wrong but some are useful”. In the anal— ysis of empirical data, one must face the question ‘What model should be used to best approximate reality given the data at hand?’ (the best model depends on sample size). The information-theoretic paradigm rests on the assumption that good data, relevant to the issue, are available and these have been collected in an appropriate manner. Three general prin- ciples guide us in model-based inference in the sciences. Simplicity and Parsimony. Many scientific concepts and theories are simple, once understood. In fact, Occam's Razor implores us to ‘shave away all but what is necessary’. Albert Einstein is supposed to have said ‘Everything should be made as simple as possible, but no simpler’. Parsimony enjoys a featured place in scientific thinking in general and K. P. Burnham and D. R. Anderson Variance Bias2 (Uncertainty) Few Many Number of Parameters Fig. 1. The principle ofparsimony: the conceptual trade—offbetween squared bias (solid line) and variance (i.e. uncertainty) versus the number of estimable parameters in the model. The best model has di- mension (K0) near the intersection ofthe two lines, while full reality lies far to the right oftrade—off region. in modelling specifically (see Forster and Sober 1994; For— ster 2001) for a strictly science philosophy perspective). Model selection (variable selection in regression is a spe- cial case) is a bias v. variance trade-off and this is the princi- ple of parsimony (Fig. 1). Models with too few parameters (variables) have bias, whereas models with too many param- eters (variables) may have poor precision or tend to identify effects that are, in fact, spurious (slightly different issues arise for count data v. continuous data). These considerations call for a balance between under- and over-fitted models — the so-called ‘model selection problem’ (see Forster 2000). Multiple Working Hypotheses. Over 100 years ago, Chamberlin (1890, reprinted 1965) advocated the concept of ‘multiple working hypotheses”. Here, there is no null hypoth- esis; instead, there are several well-supported hypotheses (equivalently, ‘models’) that are being entertained. The a pri- ori ‘science’ of the issue enters at this important point. Rel- evant empirical data are then gathered, analysed, and the results tend to support one or more hypotheses, while provid- ing less support for other hypotheses. Repetition of this gen- eral approach leads to advances in the sciences. New or more elaborate hypotheses are added, while hypotheses with little empirical support are gradually dropped from consideration. At any one point in time, there are multiple hypotheses (mod— els) still under consideration. An important feature of this multiplicity is that the number of alternative models should be kept small (Zucchini 2000); the analysis of, say, hundreds of models is not justified except when prediction is the only objective, or in the most exploratory phases of an investiga- tlon. Information-theoretic methods Strength of Evidence. Providing information to judge the ‘strength of evidence’ is central to science. Null-hypoth- esis-testing provides only arbitrary dichotomies (e.g. signif— icant v. non-significant) and in the all-too-often-seen case where the null hypothesis is obviously false on a priori grounds, the test result is superfluous. Royall (1997) pro- vides an interesting discussion of the likelihood-based strength-of-evidence approach in simple statistical situa- tions. The information-theoretic paradigm is partially grounded in the three principlcs above. Impetus for the general ap— proach can be traced to several major advances made over the past half century and this history will serve as an intro- duction to the subject. Advance 1 — Kullback—Leibler information In 1951 S. Kullback and R. A. Leibler published a now-fa- mous paper that examined the scientific meaning of ‘infor— mation’ related to R. A. Fisher's concept of a ‘sufficient statistic’. Their celebrated result, now called Kullback—Lei- bler information, is a fundamental quantity in the sciences and has earlier roots back to Boltzmann’s (1877) concept of entropy. Boltzmann’s entropy and the associated Second Law of Thermodynamics represents one of the most outstanding achievements of 19th century science. KullbackiLeibler (K714) information is a measure (a ‘dis— tance’ in an heuristic sense) between conceptual reality, f; and approximating model, g, and is defined for continuous functions as the integral Io: g) =lf(x)10ge m) x £0619) wheref and g are n-dimensional probability distributions. K- L information, denoted 1(fl g), is the ‘information’ lost when model g is used to approximate reality, f. The analyst seeks an approximating model that loses as little information as possible; this is equivalent to minimising [(f, g), over the set of models of interest (we assume there are R a priori models in the candidate set). Boltzmann's entropy H is ~10, g), although these quanti- ties were derived along very different lines. Boltzmann de- rived the fundamental relationship between entropy (H) and probability (P) as H : log€(P) and because H : 41(/, g), one can see that entropy, informa- tion and probability are linked, allowing probabilities to be multiplicative whereas information and entropy are additive. KiL information can be viewed as an extension of the fa- mous Shannon (1948) entropy and is often referred to as ‘cross entropy’. In addition, there is a close relationship be- tween Jaynes' (1957) ‘maximum entropy principle’ or Max- 113 Ent (see Akaike 1977, 1983a, 1985). Cover and Thomas (1989) provide a nice introduction to information theory in general. K714 information, by itself, will not aid in data anal- ysis as both reality (f) and the parameters (6) in the approxi- mating model are unknown to us. H. Akaike made the next breakthrough in the early 19705. Advance 2 7 Estimation ()fKullbaCkiLeibler infbrmation (AIC) Akaike (1973, 1974) found a formal relationship between K7 L information (a dominant paradigm in information and cod— ing theory) and maximum likelihood (the dominant para— digm in statistics) (see deLeeuw 1992). This finding makes it possible to combine estimation (e.g. maximum likelihood or least squares) and model selection under a single theoret— ical framework 7 optimisation. Akaike's breakthrough was the finding of an estimator of the expected, relative KiL in— formation, based on the maximised log-likelihood function. Akaike's derivation (which is for large samples) relied on K7 L information as averaged entropy and this lead to ‘Akaike‘s information criterion’ (AIC), AIC : izlogdflé 1 data» + 2K” where loge(L(é 1 data)) is the value of the maximised log— likelihood over the unknown parameters (9), given the data and the model, and K is the number of estimable parameters in that approximating model. In the special case of least— squares (LS) estimation with normally distributed errors for all R models in the set, and apart from an arbitrary additive constant, AIC can be expressed as AIC = nlog(€r)+2K, where and é, are the estimated residuals from the fitted model. In this case the number of estimable parameters, K, must be the total number of parameters in the model, including the inter— cept and (72. Thus, AIC is easyto compute from the results of LS estimation in the case of linear models or from the re- sults of a likelihood-based analysis in general (Edwards 1992; Azzalini 1996). Akaike's procedures are now called in- formation-theoretic because they are based on the KiL infor— mation (see Akaike 1983b, 1992, 1994). Assuming that a set of a priori candidate models has been defined and is well supported by the underlying science, then AIC is computed for each of the approximating models in the set (i.e. g, i = 1,2, ..., R). The model for which AIC is min- imal is selected as best for the empirical data at hand. This is a simple, compelling concept, based on deep theoretical foundations (i.e. entropy, K—L information, and likelihood theory). AIC is not a test in any sense: no single hypothesis 114 (model) is made to be the ‘null’, there is no arbitrary 0L level, and there is no arbitrary notion of ‘significance’. Instead, there are concepts of evidence and a ‘best’ inference, given the data and the set of a priori models representing the sci- entific hypotheses ofinterest. When K is large relative to sample size n (which includes when n is small, for any K) there is a small-sample (second- order) version called AIC), 2K(K+l) AI ,.=—1 .L‘ K — C 20g( (9))+2 +(n_K_1) (see: for example, Hurvich and Tsai 1989), and this should be used unless n/K > ~40. Both AIC and AlCi. are estimates of expected, relative Kullback—Leibler information and are useful in the analysis of real data in the ‘noisy’ sciences. As- suming independence, AIC—based model selection is equiva- lent to certain cross-validation methods (Stone 1974, 1977) and this is an important property. Akaike's general approach allows the best model in the set to be identified, but also allows the rest ofthe models to be easily ranked. Here, it is very useful (essentially imperative) to rescale AlC (or AICC) values such that the model with the minimum information criterion has a value of 0, i.e. A,- = Ale minAlC. The A,- values are easy to interpret, and allow a quick ‘strength of evidence’ comparison and ranking of candidate hypotheses or models. The larger the A], the less plausible is fitted model i as being the best approximating model in the candidate set. It is generally important to know which model (hypothesis) is second best (the ranking) as well as some measure of its standing with respect to the best model. Some simple rules of thumb are often useful in assessing the rela— tive merits of models in the set: models having A, S 2 have substantial support (evidence), those where 4 S Ai S 7 have considerably less support, while models having A > 10 have essentially no support. An improved method for sealing models appears in the next section. The Ai values allow an easy ranking of hypotheses (mod- els) in the candidate set. One must turn to goodness-of-fit tests or other measures to determine whether any of the mod— els is good in some absolute sense. For count data, we sug— gest a standard goodness-of-fit test; whereas standard measures such as R2 and d2 in regression and analysis of variance are often useful. Justification of the models in the candidate set is a very important issue. This is where the sci— ence of the problem enters the scene. Ideally, there ought to be ajustification of models in the set and a defense as to why some models should remain out of the set. This is an area where ecologists need to spend much more time just think- ing, well prior to data analysis and, perhaps, prior to data col- lection. K. P. Burnham and D. R. Anderson The principle of parsimony, or Occam‘s razor, provides a philosophical basis for model selection; Kullback—Leibler information provides an objective target based on deep, fun- damental theory; and the information criteria (AIC and AICC), along with likelihood— or least-squares-based in- ference, provide a practical, general methodology for use in the analysis of empirical data. Objective data analysis can be rigorously based on these principles without having to assume that the ‘true model’ is contained in the set of candi— date models 7 surely there are no true models in the biologi- cal sciences! Advance 3 * Likelihood ofa model, given the data The simple transformation exp(iA,-/2), for i = l, 2, ..., R, pro- vides the likelihood of the model (Akaike 1981) given the data: L(gi | data). This is a likelihood function over the model set in the same sense that L(9 1 data, g) is the likelihood over the parameter space (for model g,) of the parameters 6, given the data (x) and the model (g). The relative likelihood of model i versus modelj is L(gi | data)/L(gj 1 data); this ratio does not depend on any of the other models under consider- ation. Without loss of generality we may assume model gl- is more likely than gj. Then if this ratio is large (e.g. >10 is large), model g/ is a poor model to fit the data relative to model g. The expression L(g,- l data)/L(g/ | data) can be re— garded as an evidence ratio ~ the evidence for model i versus modelj. . It is often convenient to normalise these likelihoods such that they sum to 1, hence we use exp(—A,. /2) Wi = R Zexp(—A,. /2) r=l The wi, called Akaike weights, are useful as the ‘weight of ev- idence’ in favor ofmodel i as being the actual K—L best mod- el in the set. The ratios wi/wj are identical to the original likelihood ratios, L(gi I data)/L(gj 1 data); however, W), i = l, R are useful in additional ways. For example, the w,- are interpreted approximately as the pr...
View Full Document

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern