This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Wildlife Research, 2001, 28, 1117119 Kullback—Leibler information as a basis for strong inference in
ecological studies Kenneth P. BurnhamA and David R. Anderson Colorado Cooperative Fish and Wildlife Research Unit, Colorado State University, Fort Collins, CO 80523, USA.
[Employed by USGS, Division of Biological Resources]
AEmail: [email protected] A bs tract. We describe an informationtheoretic paradigm for analysis of ecological data, based on KullbackiLei bler information, that is an extension oflikelihood theory and avoids the pitfalls of null hypothesis testing. Infor
mation—theoretic approaches emphasise a deliberate focus on the a priori science in developing a set of multiple
working hypotheses or models. Simple methods then allow these hypotheses (models) to be ranked from best to
worst and scaled to reﬂect a strength of evidence using the likelihood of each model (gi), given the data and the
models in the set (i.e. L(gl  data)). In addition, a variance component due to modelselection uncertainty is included
in estimates of precision. There are many cases where formal inference can be based on all the models in the a priori
set and this multimodel inference represents a powerful, new approach to valid inference. Finally, we strongly rec
ommend inferences based on a priori considerations be carefully separated from those resulting from some form of
data dredging. An example is given for questions related to age and sexdependent rates oftag loss in elephant seals (Miruttnga leonina). Introduction Theoretical and applied ecologists are becoming increasing
ly dissatisfied with the traditional testingbased aspects of
frequentist statistics. Over the past 50 years a large body of
statistical literature has shown the testing of null hypotheses
to have relatively little utility, in spite of their very wide
spread use (Nester 1996). lnman (1994) provides a historical
perspective on this issue by highlighting the points of a heat
ed exchange in the published literature between R. A Fisher
and Karl Pearson in 1935. In the applied ecology literature
Yoccoz (1991), Cherry (1998), Johnson (1999) and Ander
son et a]. (2000) have written on this specific issue. The sta
tistical null hypothesis testing approach is not wrong, but it
is relatively uninformative and, thus, slows scientific
progress and understanding. Bayesian approaches are relatively unknown to ecologists
and will likely remain so because this material is not com
monly offered in statistics departments, except perhaps in
advanced courses. Still, an increasing number of people
think that Bayesian statistics offer an acceptable alternative
(Gelman et a1. 1995; Ellison 1996), while others are leery
(Forster 1995; Dennis 1996). In addition, there are funda
mental issues with the subjectivity inherent in many Baye
sian methods and this has unfortunately divided the field of
statistics for many decades. Also, much of Bayesian statistics
has been developed from the viewpoint of decision theory.
We find that science is most involved with estimation, pre
diction and understanding and, less so, with decisionmaking
(see Berger 1985 for a discussion of decisionmaking). © CSIRO 2001 The purpose of this paper is to introduce readers to the use
of KullbackiL'eibler information as a basis for making valid
inference from the analysis of empirical data. We provide
this introduction because informationtheoretic approaches
are simple, easy to learn and understand, compelling, and
quite general. This class ofmethods allows one to select the
best model from an a priori set, rank and scale the models,
and include model selection uncertainty into estimates of
precision. Informationtheoretic approaches provide an ef
fective strategy for objective data analysis (Burnham and
Anderson 1998; Anderson and Burnham 1999). Finally, we
provide a simple approach to making formal inference from
more than a single model (multimodel inference, or MMI).
We believe the informationtheoretic approaches are excel
lent for the analysis of ecological data, whether experimental
or observational, and provide a» rational alternative to the
testingbased frequentist methods and the computerinten
sive Bayesian methods. The central inferential issues in science are twofold.
First, scientists are fundamentally interested in estimates of
the magnitude of the parameters or differences between pa
rameters and their precision; are the differences trivial,
small, medium, or large? Are the differences biologically
meaningful? This is an estimation problem. Second, one of
ten wants to know whether the differences are large enough
to justify inclusion in a model to be used for further inference
(e.g. prediction) and this is a modelselection problem. These
central issues are not properly associated with statistical null
hypothesistesting. In particular, hypothesistesting is a poor 10.1071/WR99107 1035—3712/01/020111 112 approach to model selection or variable selection (e.g. for
ward or backward selection in regression analysis). The application of informationtheoretic approaches is
relatively new; however, a number of papers using these
methods have already appeared in the fisheries, wildlife and
conservation biology literature. Research into the analysis of
marked birds has made heavy use of these new methods (see
the special supplement of Bird Study, 1999, Vol. 46). Pro—
gram MARK (White et a]. 2001) allows a full analysis of
data under the informationtheoretic paradigm, including
model averaging and estimates of precision that include
modelselection uncertainty. Distance sampling and analysis
theory (Buckland et a]. 1993) should often be based on this
theory with an emphasis on making formal inference from
several models. The large data sets on the threatened north
ern spotted owl in the United States have been the subject of
largescale analyses using these new methods (see Burnham
et a]. 1996). Burnham and Anderson (1998) provide a
number of other examples, including formal experiments to
examine the effect of a treatment, studies of spatial overlap
in Anolis lizards in Jamaica, line—transect sampling ofkanga
roos at Wallaby Creek in Australia, predicting the frequency
of storms in South Africa, and the time distribution of an in—
secticide (Dursban®) in a simulated ecosystem. Burnham
and Anderson (1998: 9699) provide an example of a simu
lated experiment on starlings (Sturnus vulgaris) to illustrate
that substantial differences can arise between the results of
hypothesistesting and modelselection criteria. Another ex
ample relates to timedependent survival of sage grouse
(Centracercus uropliasianus) where Akaike's Information
Criterion selected a model with 4 parameters whereas hy
pothesis tests suggested a model with 58 parameters (Burn
ham and Anderson 1998: 106409). Informationtheoretic
methods have found heavy use in other fields of science (e.g.
time series analysis). Science Philosophy First we must agree on the fact that there are no true models;
instead, models, by definition, are only approximations to
unknown reality or truth. George Box made the famous state—
ment “All models are wrong but some are useful”. In the anal—
ysis of empirical data, one must face the question ‘What
model should be used to best approximate reality given the
data at hand?’ (the best model depends on sample size). The
informationtheoretic paradigm rests on the assumption that
good data, relevant to the issue, are available and these have
been collected in an appropriate manner. Three general prin
ciples guide us in modelbased inference in the sciences.
Simplicity and Parsimony. Many scientific concepts
and theories are simple, once understood. In fact, Occam's
Razor implores us to ‘shave away all but what is necessary’.
Albert Einstein is supposed to have said ‘Everything should
be made as simple as possible, but no simpler’. Parsimony
enjoys a featured place in scientific thinking in general and K. P. Burnham and D. R. Anderson Variance
Bias2
(Uncertainty)
Few Many
Number of Parameters
Fig. 1. The principle ofparsimony: the conceptual trade—offbetween squared bias (solid line) and variance (i.e. uncertainty) versus the
number of estimable parameters in the model. The best model has di
mension (K0) near the intersection ofthe two lines, while full reality
lies far to the right oftrade—off region. in modelling specifically (see Forster and Sober 1994; For—
ster 2001) for a strictly science philosophy perspective). Model selection (variable selection in regression is a spe
cial case) is a bias v. variance tradeoff and this is the princi
ple of parsimony (Fig. 1). Models with too few parameters
(variables) have bias, whereas models with too many param
eters (variables) may have poor precision or tend to identify
effects that are, in fact, spurious (slightly different issues
arise for count data v. continuous data). These considerations
call for a balance between under and overfitted models —
the socalled ‘model selection problem’ (see Forster 2000). Multiple Working Hypotheses. Over 100 years ago,
Chamberlin (1890, reprinted 1965) advocated the concept of
‘multiple working hypotheses”. Here, there is no null hypoth
esis; instead, there are several wellsupported hypotheses
(equivalently, ‘models’) that are being entertained. The a pri
ori ‘science’ of the issue enters at this important point. Rel
evant empirical data are then gathered, analysed, and the
results tend to support one or more hypotheses, while provid
ing less support for other hypotheses. Repetition of this gen
eral approach leads to advances in the sciences. New or more
elaborate hypotheses are added, while hypotheses with little
empirical support are gradually dropped from consideration.
At any one point in time, there are multiple hypotheses (mod—
els) still under consideration. An important feature of this
multiplicity is that the number of alternative models should
be kept small (Zucchini 2000); the analysis of, say, hundreds
of models is not justified except when prediction is the only
objective, or in the most exploratory phases of an investiga
tlon. Informationtheoretic methods Strength of Evidence. Providing information to judge
the ‘strength of evidence’ is central to science. Nullhypoth
esistesting provides only arbitrary dichotomies (e.g. signif—
icant v. nonsignificant) and in the alltoooftenseen case
where the null hypothesis is obviously false on a priori
grounds, the test result is superfluous. Royall (1997) pro
vides an interesting discussion of the likelihoodbased
strengthofevidence approach in simple statistical situa
tions. The informationtheoretic paradigm is partially grounded
in the three principlcs above. Impetus for the general ap—
proach can be traced to several major advances made over
the past half century and this history will serve as an intro
duction to the subject. Advance 1 — Kullback—Leibler information In 1951 S. Kullback and R. A. Leibler published a nowfa
mous paper that examined the scientific meaning of ‘infor—
mation’ related to R. A. Fisher's concept of a ‘sufficient
statistic’. Their celebrated result, now called Kullback—Lei
bler information, is a fundamental quantity in the sciences
and has earlier roots back to Boltzmann’s (1877) concept of
entropy. Boltzmann’s entropy and the associated Second Law
of Thermodynamics represents one of the most outstanding
achievements of 19th century science. KullbackiLeibler (K714) information is a measure (a ‘dis—
tance’ in an heuristic sense) between conceptual reality, f;
and approximating model, g, and is defined for continuous
functions as the integral Io: g) =lf(x)10ge m) x
£0619) wheref and g are ndimensional probability distributions. K
L information, denoted 1(ﬂ g), is the ‘information’ lost when
model g is used to approximate reality, f. The analyst seeks
an approximating model that loses as little information as
possible; this is equivalent to minimising [(f, g), over the set
of models of interest (we assume there are R a priori models
in the candidate set). Boltzmann's entropy H is ~10, g), although these quanti
ties were derived along very different lines. Boltzmann de
rived the fundamental relationship between entropy (H) and
probability (P) as H : log€(P) and because H : 41(/, g), one can see that entropy, informa
tion and probability are linked, allowing probabilities to be
multiplicative whereas information and entropy are additive. KiL information can be viewed as an extension of the fa
mous Shannon (1948) entropy and is often referred to as
‘cross entropy’. In addition, there is a close relationship be
tween Jaynes' (1957) ‘maximum entropy principle’ or Max 113 Ent (see Akaike 1977, 1983a, 1985). Cover and Thomas
(1989) provide a nice introduction to information theory in
general. K714 information, by itself, will not aid in data anal
ysis as both reality (f) and the parameters (6) in the approxi
mating model are unknown to us. H. Akaike made the next
breakthrough in the early 19705. Advance 2 7 Estimation ()fKullbaCkiLeibler infbrmation
(AIC) Akaike (1973, 1974) found a formal relationship between K7
L information (a dominant paradigm in information and cod—
ing theory) and maximum likelihood (the dominant para—
digm in statistics) (see deLeeuw 1992). This finding makes
it possible to combine estimation (e.g. maximum likelihood
or least squares) and model selection under a single theoret—
ical framework 7 optimisation. Akaike's breakthrough was
the finding of an estimator of the expected, relative KiL in—
formation, based on the maximised loglikelihood function.
Akaike's derivation (which is for large samples) relied on K7
L information as averaged entropy and this lead to ‘Akaike‘s
information criterion’ (AIC), AIC : izlogdﬂé 1 data» + 2K” where loge(L(é 1 data)) is the value of the maximised log—
likelihood over the unknown parameters (9), given the data
and the model, and K is the number of estimable parameters
in that approximating model. In the special case of least—
squares (LS) estimation with normally distributed errors for
all R models in the set, and apart from an arbitrary additive
constant, AIC can be expressed as AIC = nlog(€r)+2K, where and é, are the estimated residuals from the fitted model. In
this case the number of estimable parameters, K, must be the
total number of parameters in the model, including the inter—
cept and (72. Thus, AIC is easyto compute from the results
of LS estimation in the case of linear models or from the re
sults of a likelihoodbased analysis in general (Edwards
1992; Azzalini 1996). Akaike's procedures are now called in
formationtheoretic because they are based on the KiL infor—
mation (see Akaike 1983b, 1992, 1994). Assuming that a set of a priori candidate models has been
defined and is well supported by the underlying science, then
AIC is computed for each of the approximating models in the
set (i.e. g, i = 1,2, ..., R). The model for which AIC is min
imal is selected as best for the empirical data at hand. This is
a simple, compelling concept, based on deep theoretical
foundations (i.e. entropy, K—L information, and likelihood
theory). AIC is not a test in any sense: no single hypothesis 114 (model) is made to be the ‘null’, there is no arbitrary 0L level,
and there is no arbitrary notion of ‘significance’. Instead,
there are concepts of evidence and a ‘best’ inference, given
the data and the set of a priori models representing the sci
entific hypotheses ofinterest. When K is large relative to sample size n (which includes
when n is small, for any K) there is a smallsample (second
order) version called AIC), 2K(K+l) AI ,.=—1 .L‘ K —
C 20g( (9))+2 +(n_K_1) (see: for example, Hurvich and Tsai 1989), and this should
be used unless n/K > ~40. Both AIC and AlCi. are estimates
of expected, relative Kullback—Leibler information and are
useful in the analysis of real data in the ‘noisy’ sciences. As
suming independence, AIC—based model selection is equiva
lent to certain crossvalidation methods (Stone 1974, 1977)
and this is an important property. Akaike's general approach allows the best model in the set
to be identified, but also allows the rest ofthe models to be
easily ranked. Here, it is very useful (essentially imperative)
to rescale AlC (or AICC) values such that the model with the
minimum information criterion has a value of 0, i.e. A, = Ale minAlC. The A, values are easy to interpret, and allow a quick
‘strength of evidence’ comparison and ranking of candidate
hypotheses or models. The larger the A], the less plausible is
fitted model i as being the best approximating model in the
candidate set. It is generally important to know which model
(hypothesis) is second best (the ranking) as well as some
measure of its standing with respect to the best model. Some
simple rules of thumb are often useful in assessing the rela—
tive merits of models in the set: models having A, S 2 have
substantial support (evidence), those where 4 S Ai S 7 have
considerably less support, while models having A > 10 have
essentially no support. An improved method for sealing
models appears in the next section. The Ai values allow an easy ranking of hypotheses (mod
els) in the candidate set. One must turn to goodnessoffit
tests or other measures to determine whether any of the mod—
els is good in some absolute sense. For count data, we sug—
gest a standard goodnessoffit test; whereas standard
measures such as R2 and d2 in regression and analysis of
variance are often useful. Justification of the models in the
candidate set is a very important issue. This is where the sci—
ence of the problem enters the scene. Ideally, there ought to
be ajustification of models in the set and a defense as to why
some models should remain out of the set. This is an area
where ecologists need to spend much more time just think
ing, well prior to data analysis and, perhaps, prior to data col
lection. K. P. Burnham and D. R. Anderson The principle of parsimony, or Occam‘s razor, provides a
philosophical basis for model selection; Kullback—Leibler
information provides an objective target based on deep, fun
damental theory; and the information criteria (AIC and
AICC), along with likelihood— or leastsquaresbased in
ference, provide a practical, general methodology for use in
the analysis of empirical data. Objective data analysis can be
rigorously based on these principles without having to
assume that the ‘true model’ is contained in the set of candi—
date models 7 surely there are no true models in the biologi
cal sciences! Advance 3 * Likelihood ofa model, given the data The simple transformation exp(iA,/2), for i = l, 2, ..., R, pro
vides the likelihood of the model (Akaike 1981) given the
data: L(gi  data). This is a likelihood function over the model
set in the same sense that L(9 1 data, g) is the likelihood over
the parameter space (for model g,) of the parameters 6, given
the data (x) and the model (g). The relative likelihood of
model i versus modelj is L(gi  data)/L(gj 1 data); this ratio
does not depend on any of the other models under consider
ation. Without loss of generality we may assume model gl is
more likely than gj. Then if this ratio is large (e.g. >10 is
large), model g/ is a poor model to fit the data relative to
model g. The expression L(g, l data)/L(g/  data) can be re—
garded as an evidence ratio ~ the evidence for model i versus
modelj. . It is often convenient to normalise these likelihoods such
that they sum to 1, hence we use exp(—A,. /2) Wi = R
Zexp(—A,. /2)
r=l The wi, called Akaike weights, are useful as the ‘weight of ev
idence’ in favor ofmodel i as being the actual K—L best mod
el in the set. The ratios wi/wj are identical to the original
likelihood ratios, L(gi I data)/L(gj 1 data); however, W), i = l, R are useful in additional ways. For example, the w, are
interpreted approximately as the pr...
View
Full Document
 Spring '08
 Staff

Click to edit the document details