This preview shows pages 1–8. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Wildlife Research, 2001, 28, 1117119 Kullback—Leibler information as a basis for strong inference in
ecological studies Kenneth P. BurnhamA and David R. Anderson Colorado Cooperative Fish and Wildlife Research Unit, Colorado State University, Fort Collins, CO 80523, USA.
[Employed by USGS, Division of Biological Resources]
AEmail: [email protected] A bs tract. We describe an informationtheoretic paradigm for analysis of ecological data, based on KullbackiLei bler information, that is an extension oflikelihood theory and avoids the pitfalls of null hypothesis testing. Infor
mation—theoretic approaches emphasise a deliberate focus on the a priori science in developing a set of multiple
working hypotheses or models. Simple methods then allow these hypotheses (models) to be ranked from best to
worst and scaled to reﬂect a strength of evidence using the likelihood of each model (gi), given the data and the
models in the set (i.e. L(gl  data)). In addition, a variance component due to modelselection uncertainty is included
in estimates of precision. There are many cases where formal inference can be based on all the models in the a priori
set and this multimodel inference represents a powerful, new approach to valid inference. Finally, we strongly rec
ommend inferences based on a priori considerations be carefully separated from those resulting from some form of
data dredging. An example is given for questions related to age and sexdependent rates oftag loss in elephant seals (Miruttnga leonina). Introduction Theoretical and applied ecologists are becoming increasing
ly dissatisfied with the traditional testingbased aspects of
frequentist statistics. Over the past 50 years a large body of
statistical literature has shown the testing of null hypotheses
to have relatively little utility, in spite of their very wide
spread use (Nester 1996). lnman (1994) provides a historical
perspective on this issue by highlighting the points of a heat
ed exchange in the published literature between R. A Fisher
and Karl Pearson in 1935. In the applied ecology literature
Yoccoz (1991), Cherry (1998), Johnson (1999) and Ander
son et a]. (2000) have written on this specific issue. The sta
tistical null hypothesis testing approach is not wrong, but it
is relatively uninformative and, thus, slows scientific
progress and understanding. Bayesian approaches are relatively unknown to ecologists
and will likely remain so because this material is not com
monly offered in statistics departments, except perhaps in
advanced courses. Still, an increasing number of people
think that Bayesian statistics offer an acceptable alternative
(Gelman et a1. 1995; Ellison 1996), while others are leery
(Forster 1995; Dennis 1996). In addition, there are funda
mental issues with the subjectivity inherent in many Baye
sian methods and this has unfortunately divided the field of
statistics for many decades. Also, much of Bayesian statistics
has been developed from the viewpoint of decision theory.
We find that science is most involved with estimation, pre
diction and understanding and, less so, with decisionmaking
(see Berger 1985 for a discussion of decisionmaking). © CSIRO 2001 The purpose of this paper is to introduce readers to the use
of KullbackiL'eibler information as a basis for making valid
inference from the analysis of empirical data. We provide
this introduction because informationtheoretic approaches
are simple, easy to learn and understand, compelling, and
quite general. This class ofmethods allows one to select the
best model from an a priori set, rank and scale the models,
and include model selection uncertainty into estimates of
precision. Informationtheoretic approaches provide an ef
fective strategy for objective data analysis (Burnham and
Anderson 1998; Anderson and Burnham 1999). Finally, we
provide a simple approach to making formal inference from
more than a single model (multimodel inference, or MMI).
We believe the informationtheoretic approaches are excel
lent for the analysis of ecological data, whether experimental
or observational, and provide a» rational alternative to the
testingbased frequentist methods and the computerinten
sive Bayesian methods. The central inferential issues in science are twofold.
First, scientists are fundamentally interested in estimates of
the magnitude of the parameters or differences between pa
rameters and their precision; are the differences trivial,
small, medium, or large? Are the differences biologically
meaningful? This is an estimation problem. Second, one of
ten wants to know whether the differences are large enough
to justify inclusion in a model to be used for further inference
(e.g. prediction) and this is a modelselection problem. These
central issues are not properly associated with statistical null
hypothesistesting. In particular, hypothesistesting is a poor 10.1071/WR99107 1035—3712/01/020111 112 approach to model selection or variable selection (e.g. for
ward or backward selection in regression analysis). The application of informationtheoretic approaches is
relatively new; however, a number of papers using these
methods have already appeared in the fisheries, wildlife and
conservation biology literature. Research into the analysis of
marked birds has made heavy use of these new methods (see
the special supplement of Bird Study, 1999, Vol. 46). Pro—
gram MARK (White et a]. 2001) allows a full analysis of
data under the informationtheoretic paradigm, including
model averaging and estimates of precision that include
modelselection uncertainty. Distance sampling and analysis
theory (Buckland et a]. 1993) should often be based on this
theory with an emphasis on making formal inference from
several models. The large data sets on the threatened north
ern spotted owl in the United States have been the subject of
largescale analyses using these new methods (see Burnham
et a]. 1996). Burnham and Anderson (1998) provide a
number of other examples, including formal experiments to
examine the effect of a treatment, studies of spatial overlap
in Anolis lizards in Jamaica, line—transect sampling ofkanga
roos at Wallaby Creek in Australia, predicting the frequency
of storms in South Africa, and the time distribution of an in—
secticide (Dursban®) in a simulated ecosystem. Burnham
and Anderson (1998: 9699) provide an example of a simu
lated experiment on starlings (Sturnus vulgaris) to illustrate
that substantial differences can arise between the results of
hypothesistesting and modelselection criteria. Another ex
ample relates to timedependent survival of sage grouse
(Centracercus uropliasianus) where Akaike's Information
Criterion selected a model with 4 parameters whereas hy
pothesis tests suggested a model with 58 parameters (Burn
ham and Anderson 1998: 106409). Informationtheoretic
methods have found heavy use in other fields of science (e.g.
time series analysis). Science Philosophy First we must agree on the fact that there are no true models;
instead, models, by definition, are only approximations to
unknown reality or truth. George Box made the famous state—
ment “All models are wrong but some are useful”. In the anal—
ysis of empirical data, one must face the question ‘What
model should be used to best approximate reality given the
data at hand?’ (the best model depends on sample size). The
informationtheoretic paradigm rests on the assumption that
good data, relevant to the issue, are available and these have
been collected in an appropriate manner. Three general prin
ciples guide us in modelbased inference in the sciences.
Simplicity and Parsimony. Many scientific concepts
and theories are simple, once understood. In fact, Occam's
Razor implores us to ‘shave away all but what is necessary’.
Albert Einstein is supposed to have said ‘Everything should
be made as simple as possible, but no simpler’. Parsimony
enjoys a featured place in scientific thinking in general and K. P. Burnham and D. R. Anderson Variance
Bias2
(Uncertainty)
Few Many
Number of Parameters
Fig. 1. The principle ofparsimony: the conceptual trade—offbetween squared bias (solid line) and variance (i.e. uncertainty) versus the
number of estimable parameters in the model. The best model has di
mension (K0) near the intersection ofthe two lines, while full reality
lies far to the right oftrade—off region. in modelling specifically (see Forster and Sober 1994; For—
ster 2001) for a strictly science philosophy perspective). Model selection (variable selection in regression is a spe
cial case) is a bias v. variance tradeoff and this is the princi
ple of parsimony (Fig. 1). Models with too few parameters
(variables) have bias, whereas models with too many param
eters (variables) may have poor precision or tend to identify
effects that are, in fact, spurious (slightly different issues
arise for count data v. continuous data). These considerations
call for a balance between under and overfitted models —
the socalled ‘model selection problem’ (see Forster 2000). Multiple Working Hypotheses. Over 100 years ago,
Chamberlin (1890, reprinted 1965) advocated the concept of
‘multiple working hypotheses”. Here, there is no null hypoth
esis; instead, there are several wellsupported hypotheses
(equivalently, ‘models’) that are being entertained. The a pri
ori ‘science’ of the issue enters at this important point. Rel
evant empirical data are then gathered, analysed, and the
results tend to support one or more hypotheses, while provid
ing less support for other hypotheses. Repetition of this gen
eral approach leads to advances in the sciences. New or more
elaborate hypotheses are added, while hypotheses with little
empirical support are gradually dropped from consideration.
At any one point in time, there are multiple hypotheses (mod—
els) still under consideration. An important feature of this
multiplicity is that the number of alternative models should
be kept small (Zucchini 2000); the analysis of, say, hundreds
of models is not justified except when prediction is the only
objective, or in the most exploratory phases of an investiga
tlon. Informationtheoretic methods Strength of Evidence. Providing information to judge
the ‘strength of evidence’ is central to science. Nullhypoth
esistesting provides only arbitrary dichotomies (e.g. signif—
icant v. nonsignificant) and in the alltoooftenseen case
where the null hypothesis is obviously false on a priori
grounds, the test result is superfluous. Royall (1997) pro
vides an interesting discussion of the likelihoodbased
strengthofevidence approach in simple statistical situa
tions. The informationtheoretic paradigm is partially grounded
in the three principlcs above. Impetus for the general ap—
proach can be traced to several major advances made over
the past half century and this history will serve as an intro
duction to the subject. Advance 1 — Kullback—Leibler information In 1951 S. Kullback and R. A. Leibler published a nowfa
mous paper that examined the scientific meaning of ‘infor—
mation’ related to R. A. Fisher's concept of a ‘sufficient
statistic’. Their celebrated result, now called Kullback—Lei
bler information, is a fundamental quantity in the sciences
and has earlier roots back to Boltzmann’s (1877) concept of
entropy. Boltzmann’s entropy and the associated Second Law
of Thermodynamics represents one of the most outstanding
achievements of 19th century science. KullbackiLeibler (K714) information is a measure (a ‘dis—
tance’ in an heuristic sense) between conceptual reality, f;
and approximating model, g, and is defined for continuous
functions as the integral Io: g) =lf(x)10ge m) x
£0619) wheref and g are ndimensional probability distributions. K
L information, denoted 1(ﬂ g), is the ‘information’ lost when
model g is used to approximate reality, f. The analyst seeks
an approximating model that loses as little information as
possible; this is equivalent to minimising [(f, g), over the set
of models of interest (we assume there are R a priori models
in the candidate set). Boltzmann's entropy H is ~10, g), although these quanti
ties were derived along very different lines. Boltzmann de
rived the fundamental relationship between entropy (H) and
probability (P) as H : log€(P) and because H : 41(/, g), one can see that entropy, informa
tion and probability are linked, allowing probabilities to be
multiplicative whereas information and entropy are additive. KiL information can be viewed as an extension of the fa
mous Shannon (1948) entropy and is often referred to as
‘cross entropy’. In addition, there is a close relationship be
tween Jaynes' (1957) ‘maximum entropy principle’ or Max 113 Ent (see Akaike 1977, 1983a, 1985). Cover and Thomas
(1989) provide a nice introduction to information theory in
general. K714 information, by itself, will not aid in data anal
ysis as both reality (f) and the parameters (6) in the approxi
mating model are unknown to us. H. Akaike made the next
breakthrough in the early 19705. Advance 2 7 Estimation ()fKullbaCkiLeibler infbrmation
(AIC) Akaike (1973, 1974) found a formal relationship between K7
L information (a dominant paradigm in information and cod—
ing theory) and maximum likelihood (the dominant para—
digm in statistics) (see deLeeuw 1992). This finding makes
it possible to combine estimation (e.g. maximum likelihood
or least squares) and model selection under a single theoret—
ical framework 7 optimisation. Akaike's breakthrough was
the finding of an estimator of the expected, relative KiL in—
formation, based on the maximised loglikelihood function.
Akaike's derivation (which is for large samples) relied on K7
L information as averaged entropy and this lead to ‘Akaike‘s
information criterion’ (AIC), AIC : izlogdﬂé 1 data» + 2K” where loge(L(é 1 data)) is the value of the maximised log—
likelihood over the unknown parameters (9), given the data
and the model, and K is the number of estimable parameters
in that approximating model. In the special case of least—
squares (LS) estimation with normally distributed errors for
all R models in the set, and apart from an arbitrary additive
constant, AIC can be expressed as AIC = nlog(€r)+2K, where and é, are the estimated residuals from the fitted model. In
this case the number of estimable parameters, K, must be the
total number of parameters in the model, including the inter—
cept and (72. Thus, AIC is easyto compute from the results
of LS estimation in the case of linear models or from the re
sults of a likelihoodbased analysis in general (Edwards
1992; Azzalini 1996). Akaike's procedures are now called in
formationtheoretic because they are based on the KiL infor—
mation (see Akaike 1983b, 1992, 1994). Assuming that a set of a priori candidate models has been
defined and is well supported by the underlying science, then
AIC is computed for each of the approximating models in the
set (i.e. g, i = 1,2, ..., R). The model for which AIC is min
imal is selected as best for the empirical data at hand. This is
a simple, compelling concept, based on deep theoretical
foundations (i.e. entropy, K—L information, and likelihood
theory). AIC is not a test in any sense: no single hypothesis 114 (model) is made to be the ‘null’, there is no arbitrary 0L level,
and there is no arbitrary notion of ‘significance’. Instead,
there are concepts of evidence and a ‘best’ inference, given
the data and the set of a priori models representing the sci
entific hypotheses ofinterest. When K is large relative to sample size n (which includes
when n is small, for any K) there is a smallsample (second
order) version called AIC), 2K(K+l) AI ,.=—1 .L‘ K —
C 20g( (9))+2 +(n_K_1) (see: for example, Hurvich and Tsai 1989), and this should
be used unless n/K > ~40. Both AIC and AlCi. are estimates
of expected, relative Kullback—Leibler information and are
useful in the analysis of real data in the ‘noisy’ sciences. As
suming independence, AIC—based model selection is equiva
lent to certain crossvalidation methods (Stone 1974, 1977)
and this is an important property. Akaike's general approach allows the best model in the set
to be identified, but also allows the rest ofthe models to be
easily ranked. Here, it is very useful (essentially imperative)
to rescale AlC (or AICC) values such that the model with the
minimum information criterion has a value of 0, i.e. A, = Ale minAlC. The A, values are easy to interpret, and allow a quick
‘strength of evidence’ comparison and ranking of candidate
hypotheses or models. The larger the A], the less plausible is
fitted model i as being the best approximating model in the
candidate set. It is generally important to know which model
(hypothesis) is second best (the ranking) as well as some
measure of its standing with respect to the best model. Some
simple rules of thumb are often useful in assessing the rela—
tive merits of models in the set: models having A, S 2 have
substantial support (evidence), those where 4 S Ai S 7 have
considerably less support, while models having A > 10 have
essentially no support. An improved method for sealing
models appears in the next section. The Ai values allow an easy ranking of hypotheses (mod
els) in the candidate set. One must turn to goodnessoffit
tests or other measures to determine whether any of the mod—
els is good in some absolute sense. For count data, we sug—
gest a standard goodnessoffit test; whereas standard
measures such as R2 and d2 in regression and analysis of
variance are often useful. Justification of the models in the
candidate set is a very important issue. This is where the sci—
ence of the problem enters the scene. Ideally, there ought to
be ajustification of models in the set and a defense as to why
some models should remain out of the set. This is an area
where ecologists need to spend much more time just think
ing, well prior to data analysis and, perhaps, prior to data col
lection. K. P. Burnham and D. R. Anderson The principle of parsimony, or Occam‘s razor, provides a
philosophical basis for model selection; Kullback—Leibler
information provides an objective target based on deep, fun
damental theory; and the information criteria (AIC and
AICC), along with likelihood— or leastsquaresbased in
ference, provide a practical, general methodology for use in
the analysis of empirical data. Objective data analysis can be
rigorously based on these principles without having to
assume that the ‘true model’ is contained in the set of candi—
date models 7 surely there are no true models in the biologi
cal sciences! Advance 3 * Likelihood ofa model, given the data The simple transformation exp(iA,/2), for i = l, 2, ..., R, pro
vides the likelihood of the model (Akaike 1981) given the
data: L(gi  data). This is a likelihood function over the model
set in the same sense that L(9 1 data, g) is the likelihood over
the parameter space (for model g,) of the parameters 6, given
the data (x) and the model (g). The relative likelihood of
model i versus modelj is L(gi  data)/L(gj 1 data); this ratio
does not depend on any of the other models under consider
ation. Without loss of generality we may assume model gl is
more likely than gj. Then if this ratio is large (e.g. >10 is
large), model g/ is a poor model to fit the data relative to
model g. The expression L(g, l data)/L(g/  data) can be re—
garded as an evidence ratio ~ the evidence for model i versus
modelj. . It is often convenient to normalise these likelihoods such
that they sum to 1, hence we use exp(—A,. /2) Wi = R
Zexp(—A,. /2)
r=l The wi, called Akaike weights, are useful as the ‘weight of ev
idence’ in favor ofmodel i as being the actual K—L best mod
el in the set. The ratios wi/wj are identical to the original
likelihood ratios, L(gi I data)/L(gj 1 data); however, W), i = l, R are useful in additional ways. For example, the w, are
interpreted approximately as the probability that model 1' is,
in fact, the KiL best model for the data. This latter inference
about modelselection uncertainty is conditional on both the
data and the full set of a priori models considered. There are
simple methods to provide a confidence set on the models, in
the same sense as a confidence set for estimates of parame— ters, and to allow prior (Bayesian type) information to affect
these weights (see Burnham and Anderson 1998: 126—128). Advance 4 i Unconditional sampling variance Typically, estimates of sampling variance are conditional on
a ‘given’ model as if there were no uncertainty about which
model to use (Breiman 1992 calls this a ‘quiet scandal’).
When model selection has been done, there is a variance
component due to modelselection uncertainty that should be Informationtheoretic methods incorporated into estimates of precision. That is, one needs
estimates that are ‘unconditional’ on the selected model.
Here the estimates are unconditional on any particular mod—
el, but conditional on the R models in the a priori set. A sim
ple estimator of the unconditional variance for the parameter
maximum likelihood estimator, from the selected (best)
model is, 2 var(é) = nil—(Wamé, I g1.)+(9i 9)2: 1:1 ; R
where 9 = Z we.
i:1 and this represents a form of frequentist ‘model averaging’.
The notation 9, here means that the parameter 6 is estimated
on the basis of model gi , but 9 is a parameter in common to
all R models (such as occurs with prediction). This estimator, from Buckland er al. (1997), includes a
term for the conditional sampling variance, given model gi
(denoted as var(91gi) here) and a variance component for
modelselection uncertainty, (9 149)? These variance com
ponents are multiplied by the Akaike weights, which reflect
the degree of support or evidence for model i. The uncondi
tional variance and its square root are appropriate measures
of precision after model selection. The usual 95% confi
dence interval, 9 ::2§e(é ) should be based on the uncondi
tional variance. Alternatively, intervals can be based on log
or logittransformations (Burnham er al. 1987), profile like
lihoods (Royall 1997) or bootstrap methods (Efron and Tib
shirani 1993). Burnham and Anderson (1998, chapter 5)
provide a number of Monte Carlo results on achieved confi
dence interval coverage when informationtheoretic ap—
proaches are used in some moderately challenging data sets.
Model averaging (see below) arises naturally when the un conditional variance is derived (Burnham and Anderson
1998: section 4.2.6). Advance 5 * Mummodel inference (MM!) Rather than base inferences on a single, selected best model
from an apriori set of models, inference can be based on the
entire set of models (multimodel inference, or MMI). Such
inferences can be made if a parameter, say 9, is in common
over all models (as Si in model g), or ifthe goal is prediction.
Then, by using theA weighted average for that parameter
across models (i.e. 5: Ewl 9,) we are basing point infer
ence on the entire set of models. This approach has both
practical and philosophical advantages (see Hoeting et al.
1999 for a discussion of model averaging in a Bayesian con
text). Where a modelaveraged estimator can be used it often
has better precision and reduced bias compared with the es
timator of that parameter from just the selected best model
(Burnham and Anderson 1998: chapters 4 and 5). 115 Assessment of the relative importance of variables has of—
ten been based only on the best model ( e. g. often selected us
ing a stepwise testing procedure). Variables in that best
model are considered ‘important’, excluded variables are
considered not important. This is far too simplistic. Variable
importance can be refined by making inference from all the
models in the candidate set (see Burnham and Anderson
1998: 140—151). Akaike weights are summed for all models
containing a given predictor variable x]; we denote the result—
ant sum as w+(/'). For each variable considered we can com
pute its predictor weight. The predictor variable with the
largest predictor weight, W+(j), is estimated to be the most
important; the variable with the smallest sum is estimated to
be the least important predictor (as with all inferences there
is uncertainty about the inferred order of variable impor
tance). This procedure is superior to making inferences con
cerning the relative importance of variables based only on
the best model. This is particularly important when the sec
ond or third best model is nearly as well supported as the best
model, or when all models have nearly equal support. (There
are ‘design’ considerations about the set ofmodels to consid—
er when a goal is assessing the importance of predictor vari—
ables, we do not discuss these considerations here — the key
issue is one ofbalance of models with and without each var
iable.) Advance 6 4 Incorporation ofoverdispersion for count data Much of the statistical analysis in wildlife and ecology deals
with count data (e.g. captureirecapture) and overdispersion
is a fact of life with such count data. When there is more var—
iation than predicted by Poisson or multinomial probability
distributions, the data are termed overdispered (Agresti
1990: 42). A partial dependence in the count data most often
underlies the overdispersion; however, parameter heteroge
neity is another contributor to overdispersion. KullbackiLei
blerbased model—selection and inference methods have been
adapted to deal with overdispersion based on ideas from qua
silikelihood methods and variance inﬂation (Wedderburn
1974). The usual models of count data implicitly assume a
theoretical sampling variance. However, common violations
of stochastic assumptions will {lead to data more variable
than assumed and can do so without affecting structural as
pects ofthe model. In this case, there is an overdispersion co
efficient, c, such that c > 1 and actual variances are
obtainable as c >< theoretical variances. Typically with overd
ispersion c is only a little larger than 1, say 1 < c < 4: C is es
timated on the basis of the data. We denote the quasilikelihood modifications to AIC and
AICC as (Lebreton et al. 1992; see also, Hurvich and Tsai
1995; Burnham and Anderson 1998) — 210g<L<é>>
——~ﬁ + C QAIC= 2K, 116 —21 L0 2K K+1
mdQAK}=—*2g—LB+2K+——£——l.
(’ n—K—l
2mk+n
=QAm+————
n—K—l When no overdispersion exists C : 1, so the formulae for
QAIC and QAICC then reduce to AIC and AICC, respectively.
We note that 6 < 1 should not be used (use 1) and when 6 is
estimated it counts as a parameter and should be in K, the
number of estimable parameters in the model (this last point
was not mentioned or done in Burnham and Anderson 1998
i an oversight). Only one estimate ofc should be used along
with a set of models (varying 6' over the models produces
invalid results). Often there will be a global model wherein
all other models are nested within the global model. Then we
obtain 6 from the goodnessoffit Chi—square statistic (X2)
for the global model and its degrees of freedom (d.f.): é:ﬁm1 More discussion and guidance on QAIC, C” and variance
inﬂation using 6 are given in Burnham and Anderson a%&. An Example Pistorius er al. (2000) evaluated age and sexdependent rates
of tag loss in southern elephant seals and used information
theoretic methods as the basis for data analysis and infer
ence. Specifically, they considered 4 models representing
rates of tag loss being either constant or age or sexdepend
ent. We will use their results and make a number of exten—
sions for illustrative purposes. We are not attempting to
present a reanalysis and reinterpretation of their data; in
stead, we wish only to show that additional steps might be
considered. Details of this study are contained in Pistorius et
al. (2000) and we assume the reader is familiar with this pa
per. We performed a goodness—offit test (essentially Test 2,
Burnham er al. 1987) on these data, partitioned by gender
and found evidence of overdispersion (X2 for males : 15 7.20,
d.f. = 77; x2 for females = 97.92, d.f. : 84; pooled x2 =
255.12, d.f. = 161). An estimate ofthe variance inﬂation fac—
tor was c‘ : 255.12/161 I 1.58. This estimate ofc may reﬂect Table 1. K. R Burnham and D. R. Anderson primarily heterogeneity rather than a lack of independence.
Interestingly. much of the lack of fit was attributed to a single
cell in the data for both males and females (the same cell, by
gender). Had these two cells been in line with what was ex
pected from the general model, the estimate of the variance
inﬂation factor would have been only 1.30. QAIC, rather than AIC, was used for model selection and
model—based estimates of sampling variance should be mul
tiplied by 1.58. Pistorious er al. (2000) used the bootstrap to
get robust estimates of sampling variance. thus their esti
mates of precision should appropriately reﬂect the overdis
persion. The results are summarised in Table l and provide
substantial support for the model that allows tag loss to be
both age and sexdependent. Support for the model with
agedependence, but not sexdependence, is more limited;
the evidence ratio for the best model versus the second best
model is 0.82/0.18 : 4.6. Inference concerning tag loss could be made from the best
model, whereby tag loss is a function of both age and sex. A1
ternatively, model averaging could be used to allow a robust
inference of the derived parameters shown in Table 3 of Pis
torius er al. (2000). In this case, the model parameters are the
[3, and the age and sexdependent estimates of tag loss are
derived from the ﬂi. Model averaging in this example would
slightly minimise the difference in estimates of tag loss by
gender, relative to those shown in the original paper. Clearly,
there is essentially no support for the model whereby tag loss
is independent of age and sex or the model where tag loss is
only sexdependent. To measure the relative importance of variables the w, val
ues can be summed for all models (only 2 here) with agede
pendence and all models with sexdependence. 1n the
example, w+(age) : 1, whereas w+(sex) = 0.82, confirming
that age is the more important variable in explaining tag loss
in these seals. In this example, modelselection uncertainty
was minor as the data point to the model allowing both age
and sex—specific tag loss. Other examples where substantial model—selection uncertainty exists are given in Burnham and
Anderson (1998: chapter 5). Recommendations and Summary There needs to be increased attention to separate those infer
ences that rest on a priori considerations from those result
ing from some form of data dredging (see Mayo 1996). Model selection statistics for the southern elephant seal data to estimate tag loss See Pistorius er al. (2000) Model KA QAIC A —10g(L)/é , w,
Age—constant, sex—constant 1.845 3 2,341 39 0.00
Agedependent, sexconstant 1,815 4 2,305 3 0.18
Ageconstant, sexdependent 1,845 4 2,343 41 0.00
Age—dependent, sexdependent 1,81 l 5 2,302 0 0.82 AThis total includes the estimation ofthe overdispersion parameter c. Information—theoretic methods Essentially, no justifiable theory exists to estimate precision
(or test hypotheses, for those still so inclined) when data
dredging has taken place. The theory (mis)used is for apriori
analyses, assuming the model was the only one fit to the data.
This glaring fact is either not understood by practitioners and
journal editors or is simply ignored. Two types of data dredg—
ing include (1) an iterative approach where patterns and dif
ferences observed after initial analysis are ‘chased’ by
repeatedly building new models with these effects included
and (2) analysis of ‘all possible models’. Data dredging is a
poor approach to making reliable inferences about the sam
pled population and both types of data dredging are best re
served for more exploratory investigations that probably
should remain unpublished. The incorporation of a priori
considerations is of paramount importance and, as such, ed—
itors, referees and authors should pay much closer attention
to these issues and be wary of inferences obtained from post
hoc data dredging. At a conceptual level, reasonable data and a good model
allow a separation of ‘information’ and ‘noise.’ Here, infor—
mation relates to the structure of relationships, estimates of
model parameters and components of variance. Noise then
refers to the residuals: variation left unexplained. We can
use the information extracted from the data to make proper
inferences and achieve what Romesburg (1981) termed ‘re
liable information’. We want an approximating model that
minimises information loss, I(f,' g), and properly separates
noise (noninformation or entropy) from structural informa
tion. In a very important sense, we are not trying to model
the data; instead, we are trying to model the information in
the data. Informationtheoretic methods are based on deep theory
and are quite effective in making strong inferences from the
analysis of empirical data. These methods are relatively
simple to understand and practical to employ across a very
large class of empirical situations and scientific disciplines.
The methods are easy to compute by hand if necessary (as
suming one has the parameter estimates 0,, the conditional
variances, var( 011 g,), and the maximised loglikelihood
values for each of the R candidate models from standard
statistical software). Researchers can easily understand the
heuristics and application of the informationtheoretic
methods presented here; we believe it is very important that
people understand the methods they employ. Information
theoretic approaches should not be used unthinkingly; a
good set of a priori models is essential and this involves
professional judgment and integration of the science of the
issue into the model set. Publication of results under the informationtheoretic par
adigm would typically have substantial material in the Meth
ods section to discuss and fully justify the candidate models
in the set, whereas the Results section would typically
present a table showing AICC or QAICC, K, the maximised
logg(L), A, and w for each ofthe R models, followed by an ef 117 fective discussion of the scientific interpretation of the table
entries. Further material, including many examples, on infor
mationtheoretic methods can be found in recent books by
Burnham and Anderson (1998) and McQuarrie and Tsai
(1998). Akaike‘s collected works have been recently pub
lished (Parzen et al. 1997) and this book will be ofinterest to
the more quantitatively fit. An interesting application of the informationtheoretic
approach is in conﬂict resolution in applied aspects of ecol
ogy and environmental science (see Anderson et al. 1999
for a general protocol). Here, there are opposing parties in a
technical controversy and data are available that bear on the
resolution of the disagreement. In such cases, models
would be built to represent the position of each of the par
ties. For example, consider the case where there are 3 par
ties and each party might have 2 models that represent their
general position; thus there are R = 6 models in the set.
Computation of AICC and A for each model would allow a
ranking of the various positions (models), while the Akaike
weights would allow a scaling and weight of evidence for
the opposing parties and their positions. This approach has not yet been tried in a real controversy to our knowledge
(but see Anderson et al. 2001). Acknowledgments Dr Peter Boveng '(NMF S) provided a summary of the seal
tagloss data for our use in computing goodness—offit tests
and for arranging for our use of the data from senior author
Pistorious. Dr Richard Barker and an anonymous referee of
fered valuable suggestions that allowed the manuscript to be
improved. References Agresti, A. (1990). ‘Categorical Data Analysis.’ (John Wiley & Sons:
New York.) Akaike, H. (1973). Information theory as an extension ofthe maximum
likelihood principle. In ‘Second International Symposium on
Information Theory’. (Eds B. N. Petrov and F. Csaki.) pp. 2677281.
(Akademiai Kiado: Budapest.) , Akaike, H. (1974). A new look at the statistical model identification.
IEEE Transactions on Automatic ControlAC 19, 716—723. Akaike, H. (1977). On entropy maximization principle. In
‘Applications of Statistics’. (Ed. P. R. Krishnaiah.) pp. 27—41.
(North Holland: Amsterdam.) Akaike, H. (1981). Likelihood of a model and information criteria.
Journal ofEconometrics 16, 3714. Akaike, H. (1983a). Statistical inference and measurement of entropy.
In ‘Scientific Inference, Data Analysis, and Robustness’. (Eds G. E.
P. Box, T. Leonard and C.F. Wu.) pp. 165189. (Academic Press:
London.) Akaike, H. (1983b). Information measures and model selection.
International Statistical Institute 44, 277A291. Akaike, H. (1985). Prediction and entropy. In ‘A Celebration of
Statistics’. (Eds A. C. Atkinson and S. E. Fienberg.) pp. 1724.
(Springer: New York.) 118 Akaike, H. (1992). Information theory and an extension of the
maximum likelihood principle. In ‘Breakthroughs in Statistics. Vol.
1’. (Eds S. Kotz and N. L. Johnson) pp. 6104624. (SpringerVerlag:
London.) Akaike, H. (1994). Implications of the informational point of View on
the development of statistical science. In ‘Engineering and
Scientific Applications. Vol. 3. Proceedings of the First US/Japan
Conference on the Frontiers of Statistical Modeling: An
Informational Approach’. (Ed. H. Bozdogan.) pp. 27738. (Kluwer
Academic Publishers: Dordrecht, The Netherlands.) Anderson, D. R., and Burnham, K. P. (1999). General strategies for the
analysis of ringing data. Bird Study 46(suppl.), 82617270. Anderson, D. R., Burnham, K. P., Franklin, A. B., Gutierrez, R. J.,
Forsman, E. D., Anthony, R. G., White, G. C., and Shenk, T. M.
(1999). A protocol for conflict resolution in analyzing empirical
data related to natural resource controversies. Wildlife Society
Bulletin 27, 105071058. Anderson, D. R., Burnham, K. P, and Thompson, W. L. (2000). Null
hypothesis testing: problems, prevalence, and an alternative.
Journal of Wildlife Management 64, 9127923. Anderson, D. R., Burnham, K. P., and White, G. C. (2001). Kullbacki
Leibler information in resolving natural resource conflicts when
definitive data exist. Wildlife Society Bulletin. Azzalini, A. (1996). ‘Statistieal Inference Based on the Likelihood.”
(Chapman and Hall: London.) Berger, J. O. (1985). ‘Statistical Decision Theory and Bayesian
Analysis.’ 2nd Edn. (SpringerVerlag: New York.) Boltzmann, L. (1877). Uber die Beziehung zwischen dem Hauptsatze
der mechanischen Warmetheorie und der Wahrschein
licjkeitsrechnung respective den Satzen uber das
Warmegleichgewieht. Wiener Berichte 76, 373435. Breiman, L. (1992). The little bootstrap and other methods for
dimensionality selection in regression: Xfixed prediction error.
Journal ofthe American Statistical Association 87, 73 87754. Buckland, S. T., Anderson, D. R., Burnham K. P., and Laake, J. L.
(1993). ‘Distance Sampling: Estimating Abundance of Biological
Populations.’ (Chapman and Hall: London.) Buckland, S. T., Burnham, K. P., and Augustin, N. H. (1997).
Model selection: an integral part of inference. Biometrics 53,
6037618. Burnham, K. P., and Anderson, D. R. (1998). ‘Model Selection and
Inference: a Practical InformationTheoretic Approach. (Springer—
Verlag: New York.) Burnham, K. R, Anderson, D. R., White, G. C., Brownie, C., and
Pollock, K. H. (1987). Design and analysis methods for fish survival
experiments based on releaseirecapture. American Fisheries
Society, Monograph No. 5. 437 pp. Burnham, K. R, Anderson, D. R., and White, G. C. (1996). Meta
analysis of vital rates of the northern spotted owl. Studies in Avian
Biology 17, 92—101. Chamberlin, T. (1965). The method of multiple working hypotheses.
Science 148, 7547759. [Reprint of 1890 paper in Science] Cherry, S. (1998). Statistical tests in publications of The Wildlife
Society. Wildlife Society Bulletin 26, 9477953. Cover, T. M., and Thomas, J. A. (1991). ‘Elements of Information
Theory.’ (John Wiley and Sons: New York.) deLeeuw, I. (1992). Introduction to Akaike (1973) information theory
and an extension of the maximum likelihood principle. In
“Breakthroughs in Statistics. V01. 1. (Eds S. Kotz and N. L.
Johnson.) pp. 599—609. (SpringerVerlag: London.) Dennis, B. (1996). Should ecologists become Bayesians? Ecological
Applications 6, 109571103. Edwards, A. W. F. (1992). ‘Likelihood.’ Expanded Edn. (The Johns
Hopkins University Press: Baltimore, Maryland.) K. P. Burnham and D. R. Anderson Efron, B., and Tibshirani, R. J. (1993). ‘An Introduction to the
Bootstrap.’ (Chapman and Hall: New York.) Ellison, A. M. (1996). An introduction to Bayesian inference for
ecological research and environmental decision—making. Ecological
Applications 6, 103671046. Forster, M. R. (1995). Bayes or bust: the problem of simplicity for a
probabilistic approach to confirmation. British Journal for the
Philosophy of'Science 46, 399424. Forster, M. R. (2000). Key concepts in model selection: performance
and generalizability. Journal ofMathematical Psychology 44, 2057
231. Forster, M. R. (2001). The new science of simplicity. In ‘Simplicity,
Inference and Econometric Modelling’. (Eds II. Keuzcnkamp, M.
McAleer and A. Zellner.) (Cambridge University Press.) Forster, M. R., and Sober, E. (1994). How to tell simpler, more unified,
or less ad hoc theories will provide more accurate predictions.
British Journal of the Philosophy of Science 45, 1—35. Gelman, A, Carlin, J. B., Stern, H. S., and Rubin, D. B. (1995).
“Bayesian Data Analysis.’ (Chapman and Hall: London.) Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999).
Bayesian model averaging: a tutorial (with discussion). Statistical
Science 14, 382—417. Hurvich, C. M., and Tsai, C.L. (1989). Regression and time series
model selection in small samples. Biometrika 76, 2977307. Hurvich, C. M., and Tsai, C.—L. (1995). Model selection for extended
quasi—likelihood models in small samples. Biometrics 51, 10777
1084. Inman, H. F. (1994). Karl Pearson and R. A. Fisher on statistical tests:
a 1935 exchange from Nature. The American Statistician 48, 2711. Jaynes, E. T. (1957). Information theory and statistical mechanics.
Physics Review 106, 6207630. Johnson, D. H. (1999). The insignificance of statistical significance
testing. Journal of Wildlife Management 63, 7637772. Kullback, S., and Leibler, R. A. (1951). On information and sufficiency.
Annals ofMathematical Statistics 22, 79786. Mayo, D. G. (1996). ‘Error and Growth of Experimental Knowledge.”
(University ofChicago Press: London.) McQuarrie, A. D. R., and Tsai, C.L. (1998). ‘Regression and Time
Series Model Selection.’ (World Scientific Press: Singapore.) Nester, M. (1996). An applied statistician's creed. Applied Statistics 45,
401410. Parzen, E., Tanabe, K., and Kitagawa, G. (Eds) (1998). ‘Selected Papers
of Hirotugu Akaike.’ (SpringerVerlag: New York.) Pistorius, P. A., Bester, M. N., Kirkman, S. P., and Boveng, P. L. (2000).
Evaluation of age and sex—dependent rates of tag loss in southern
elephant seals. Journal of Wildlife Management 64, 3737380. Romesburg, H. C. (1981). Wildlife science: gaining reliable knowledge.
Journal of Wildlife Management 45, 2937313. Royall, R. M. (1997). ‘Statistical Evidence: a Likelihood Paradigm.‘
(Chapman and Hall: London.) Shannon, C. E. (1948). A mathematical theory of communication. Bell
System Technical Journal 27, 379423 & 6237656. Stone, M. (1974). Cross—validatory choice and assessment of statistical
predictions (with discussion). Journal of the Royal Statistical
Society, Series B 39, l l l—147. Stone, M. (1977). An asymptotic equivalence of choice of model by
crossvalidation and Akaike's criterion. Journal of the Royal
Statistical Society, Series B 39, 4447. Wedderburn, R. W. M. (1974). Quasilikelihood functions, generalized
linear models, and the GaussiNewton method. Biometrika 61, 439~
447. White, G. C., Burnham, K. P., and Anderson, D. R. (2001). Advanced
features of program MARK. In ‘Integrating People and Wildlife for
a Sustainable Future. Proceedings of the Second International ...
View
Full
Document
This note was uploaded on 07/14/2011 for the course STA 4702 taught by Professor Staff during the Spring '08 term at University of Florida.
 Spring '08
 Staff

Click to edit the document details