This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: COMMUN. STATIST.——THEORY METH, 26(3), 525546 (1997) THE DISTRIBUTION OF COOK'S D STATISTIC Keith E. Muller Mario Chen Mok
Dept. of Biostatistics, CB#7400 Dept. of Biostatistics, CB#7400
University of North Carolina University of North Carolina Chapel Hill, North Carolina, 27599 Chapel Hill, North Carolina, 27599 Key Words and Phrases: regression diagnostics; inﬂuence; residual analysis ABSTRACT Cook (1977) proposed a diagnostic to quantify the impact of deleting an
observation on the estimated regression coefficients of a General Linear Univariate
Model (GLUM). Simulations of models with Gaussian response and predictors
demonstrate that his suggestion of comparing the‘ diagnostic to the median of the
F for overall regression captures an erratically varying proportion of the values. We describe the exact distribution of Cook's statistic for a GLUM with
Gaussian predictors and response. We also present computational forms, simple
approximations, and asymptotic results. A simulation supports the accuracy of the
results. The methods allow accurate evaluation of a single value or the maximum
value from a regression analysis. The approximations work well for a single value,
but less well for the maximum. In contrast, the cutpoint suggested by Cook
provides widely varying tail probabilities. As with all diagnostics, the data analyst
must use scientiﬁc judgment in deciding how to treat highlighted observations. 525 Copyright \l' I997 by Marcel Dekker. Inc. 526 MULLER AND CHEN MOK 1. INTRODUCTION
1.1 Motivation A wide variety of applications in the medical, social, and physical sciences
use regression models with continuous predictors. Often the predictors may
plausibly be assumed to follow a multivariate Gaussian distribution. For example,
a paleontologist may wish to model total skeleton length of fossils of a particular
species, as a function of sizes for a limited number of bones. Many diagnostics
have been suggested to aid in evaluating the validity of such models. Most research in regression diagnostics has centered on the impact of
deleting a single observation, with many different measures suggested. Cook
(1977) recommended evaluating the standardized shift in the vector of estimated
regression coefﬁcients. He suggested comparing the statistic to the median of the
F statistic for the test of all coeﬁicients equal to zero. Such highlighted
observations men't further examination in terms of their credibility and also their
implications for validity of the model assumptions. Belsley, Kuh, and Welsch (1980, p28) and Cook and Weisberg (1982, p114)
discussed two alternatives for judging diagnostic statistics. Internal scaling
involves judging a value with respect to the distribution in the sample at hand.
External scaling involves judging a value with respect to the distribution that
might occur over repeated samples. Both principles have merit in data analysis. A standard approach for a diagnostic with know sampling distribution, such
as studentized residuals, involves three steps. First, highlight observations by
reference to the sampling distribution. Second, investigate the highlighted
observations values and role in the analysis. Third, decide on the disposition of the
observation, in light of all knowledge about the data. Possible actions include
doing nothing, correcting a discovered error, or deleting an impossible value. Data analysts first encountering p—values for regression diagnostics may hope
to use them for automatic elimination of observations. Sophisticated analysts use
the reference distributions to provide a common metric for the three step process
(highlight, investigate, decide). Kleinbaum, Kupper, and Muller (1988, p201), in
their introductory regression book, summarized their discussion of diagnostics by
stating: “One should be cautioned that deleting the most deviant observations will
in all cases slightly improve, and sometimes substantially improve, the ﬁt of the
model. One must be careful not to data snoop simply in order to polish the ﬁt of
the model by discarding troublesome data points.” DISTRIBUTION OF COOK’S D STATISTIC 527 Although conceptually attractive to some observers, Cook's statistic has not
elicited universal enthusiasm. For example, Obenchain (1977) suggested ignoring
the statistic and concentrating on its two components, the residual and the
leverage. The difﬁculty in using the statistic stems from uncertainty as to what
cutpoint to use for highlighting troublesome observations. Our experience led us
to the belief that the statistic ﬂags only values already highlighted by residual
analysis. Unpublished simulations (Chen Mok, 1993) conﬁrmed the impression. The ability to compute quantiles for Cook's statistic based on, Gaussian
predictors, described in §2, provides an accurate metric for the statistic and hence
allows the diagnostic to consistently highlight values worthy of further
examination. The new results in this paper also imply a framework and approach
for describing distributions and other properties of other diagnostics. 1.2 Related Earlier Work Nearly all current regression texts consider regression diagnostics in some
detail. Excellent booklengthtreatments include, in chronological order, Belsley,
Kuh and Welsch (1980), Cook and Weisberg (1982), Atkinson (1985), and
Chatterjee and Hadi (1986). We consider two versions of the General Linear Univariate Model (GLUM)
with iid Gaussian errors. For each observational unit the predictors will be
assumed to be either a set of ﬁxed values or to follow a multivariate Gaussian
distribution. Sampson (1974) described the setting with ﬁxed predictors as the
conditionalmodel, and the setting with Gaussian predictors as the unconditional
model. As detailed in §2, the distribution and interpretation of Cook's statistic
depend directly on the distribution of the predictors. See Jensen and Ramirez
(1996, 1997) for the distribution of Cook's statistic for ﬁxed predictors. 2. DISTRIBUTION THEORY
2.1 Notation and Deﬁnitions In this section we present many standard results for regression diagnostics.
Rather than cite a single source for each result, we recommend that the reader
consult any of the booklength treatments just cited. LaMotte (1994) provided a
“Rosetta Stone” for translating among the many names used for residuals. A number of standard distributions must be considered. In general, indicate
the cumulative distribution function (CDF) of the random variable U, which
depends on parameters 01 through ak, as Fu(t;a1...ak), with density 528 MULLER AND CHEN MOK fg(t;a1...ak) and pth quantile F51(p;a1...ak). For notational convenience
write the CDF of U IV = v as FUu(t; a1...ak). Resolution of conﬂict between
random variable and matrix notation and the random or ﬁxed nature of a variable
will be speciﬁed when not obvious from context. Let N (p, 2) indicate a
multivariate Gaussian vector, with mean p, nonsingular covariance E, and CDF
@(t; p,E). Most results in this paper involve )8, F, or ﬂ random variables
(Johnson and Kotz, Chapter 17, 1970a; Chapters 24 and 26, 1970b). Let x201)
indicate a central x2 random variable on 1/ degrees of freedom, and let F(u1, V2)
indicate a central F random variable on V1 and V2 degrees of freedom. Similarly
let ﬁ(n1,n2) indicate a ,6 random variable, with support (0,1). Most results for regression diagnostics concern ﬁxed predictors, and hence
the conditional model described by Sampson (1974). In particular, consider Xﬁ + = e . 2.1
Nit (qu)(qx1) N><1 ( ) Let y; indicate the ith row of y, X; the ith row of X, and e; the ith row of e.
Here X contains ﬁxed values, known conditionally on having designated the
sampling units, 5 contains ﬁxed unknown values, and F¢x(t) = <I>(t; p, 2).
Assume throughout that N > g and that X has ﬁill rank of q. Let u = (N — q)
indicate the error degrees of freedom. Indicate the usual estimators as B = (X'X)‘X'y, (2.2) a? = y’(I — H)y/V. (2.3)
Deﬁne H = X(X’X)1X', (2.4) the hat matrix because 3 = H y (Hoaglin and Welsch, 1978). Let h, indicate the
ith diagonal element of H, the leverage for the ith observation: h. = X.(X'X)‘1X2. (2.5)
Refer to
'5 = (y — i?) (2.6)
as the vector of residuals. Note that Fax“) = QB; 0,0’2(I — H” . (2.7) DISTRIBUTION OF COOK’S D STATISTIC 529 In turn deﬁne the ith squared standardized residual as ﬁi (28)
‘—ﬁ2(l—hg)' ‘ Belsley, Kuh and Welsch (1980), Cook and Weisberg (1982) and Atkinson
(1985) reviewed the algebra of deletion and properties of residuals. Let (—1)
indicate deletion of the ith observation and index the N statistics generated by
doing so. Let X(;) indicate the (N — 1) x q matrix created by deleting the ith
row, with corresponding leverage h(_,) = Xi(X(_I)X(;))‘1X£. The process
creates sets of N estimates of ﬂ, {3(4)}, predicted values, {17(4) = Xgﬂﬂj},
residuals, {3(4) = y, —’3}(_,)}, and variance estimates, {3%)}. The resulting
squared and standardized residual, the studentized residual, equals 0 e :
RM = 3f_,.)(1(—)h(_.)) (2.9)
a?
= 3(i)(1 — hr)
= R3 (VII—1:3) '
with
Fggﬂlxu) = pm; 1, u — 1). (2.10) Cook's statistic measures the standardized shiﬁ in predicted values and the
shiﬁ in 3 due to deleting the ith observation: (717(4) ‘ Wﬁlei) ‘ 3) W
q . 0’ q ' 32 . (2.11) Furthermore _’£__
9(1  hr)
= R?  Ci. 1),. = R3 . (2.12) Finding d such that Pr{D,~ > d} = a would provide a metric for Cook's statistic.
This idea motivates the Current work. The results also provide a test of whether a 530 MULLER AND CHEN MOK particular D, arose from the distribution of D, implied by the GLUM assumptions.
As highlighted in §l.1 and §4.3, the latter interpretation has more risks than
beneﬁts in practical use for the diagnostic setting. 2.2 The Distribution of Cook's Statistic for Fixed Predictors
For ﬁxed predictors C, does not vary randomly. Hence, conditional on X, D, = 0, ~12? (2.13)
= 0. v~ﬁ[1/2,(u 1)/2l = Ciﬁll/210/  1)/2] Usually if i # 2" then 0,5 aé 0,4,. The value of C; does not vary randomly with
ﬁxed predictors, but does vary with the ith leverage, hi, and hence typically varies
across sampling units. In order to provide a metric for judging Cook's statistic it would seem
natural to eliminate the heterogeneity between sampling units which occurs with
ﬁxed predictors. However, doing so eliminates the variability due to C, and makes
D, a simple multiple of R?, with no distinct information. At least with predictor
values assigned by the experimenter, Obenchain's (1977) preference for
considering the leverages and residuals separately seems appealling. See Jensen
and Ramirez (1996, 1997) for a thorough treatment of ﬁxed predictors. 2.3 The Distribution of Cook's Statistic for Gaussian Predictors Theorem. Let a0 = [q(N — 1)]‘1, a1 = (q — 1)N[qV(N — 1)]‘1, and
to = max(ao, d/u). For (1 > 0 and Gaussian predictors Pr{D,~ 5 d} = 1 — [01944; ”; 1) > %}fc‘.(t)dt, (2.14) with corresponding density 00 d _
row) =/tna(;;%,” 2 1)(vt)“fc.(t)dt. (2.15)
Here
0 t< a0
fa(t) = { fp[(t— ao)a;‘;q— 1,140? 00 St. (2.16) Lemma 1. (Weisberg, 1985, p] 14) Conditional on knowing X (ﬁxed X)
R? =V~ﬁ[1/2,(u— 1)/2]. (2.17) DISTRIBUTION OF COOK’S D STATISTIC 531 Lemma 2. A leverage value from a model containing an intercept and
(q — 1) multivariate Gaussian predictors, with each row iid, equals a onetoone
function of an F random variable.
Proof. Belsley, Kuh, and Welsch {p66, 1980) proved that , _ (h. — Inn/(4 — 1)
F‘ ’ (1 Tho/v Solving their result for h; yields hp _ m— 1>/u+ 1/N
‘ — 1+E(q~1)/V ' = F(q— 1,11). (2.18) (2.19) Lemma 3. With Gaussian predictors, C, = a0 + ang, so that Pr{C, S t} = Pr{ao +a1F; S t} (2.20)
= PT{Fi S (t  a(ll/‘11}, and t<ao 0
’6‘“) = {ma  aim/am ~ Lula? 110 s t ' (“1) Proof. For Gaussian predictors the expression in (2.19) for h, allows stating a _ E(q 1)/V+1/N
‘ * 9(11/N)
=ao+alFi (2,22) Lemma 4. Let X. = XT, with T a full rank 9 x q matrix of constants.
Note that T“ = (T')'1 = (T‘1)’. Then H does not vary due to this
transformation of the predictors. Proof. Observe that H = X(X'X)“X' = XT[T‘1(X’X)"T“]T’X’ (2.23)
= XT(T’X’XT)“T’X’
= Xs(X:Xt)—1X’u ‘ Corollary 4.1. H does not vary due to the covariance matrix of iid random
predictors. Proof. Let 2, = FF’ indicate a factoring of the (q — 1) X (g — 1) covariance
matrix of a row of random predictors, assumed full rank. Choosing T = [(1, 13.] (2.24) 532 MULLER AND CHEN MOK corresponds to considering a new model with predictors X . = X T. The model
contains an intercept and q — 1 random predictors, with E,_ = I. Corollary 4.2. h, 2., 3?, Bi, 0., and D. do not vary due to full rank
transformation of the predictors or the covariance matrix of random predictors.
Proof. Each quantity depends on X only through elements of H. Lemma 5. With Gaussian predictors F Riarihi (t) = F Ri—srl X(t). Proof. Consider Bi 1.) in terms of three pieces: (1 — hi), 3?”), and??. i) Obviously (1 — hi) depends on X only through hi. ii) Conditional on X, 3234) (u — 1) /a2 = x201 — 1), and does not depend on X.
iii) nglxﬂ) = (I>[0, (1 — hﬂaz] and therefore Fax“) = Fawn (t) iv) Conditional on X, by the nature of deletion ’e‘? and 3%”) are statistically
independent (LaMotte, 1994, example 1) and Fir/3f_;)IX(t) = Far 1: (t)F1 M4" x(t).
v) Combining i) through iv) completes the proof. Corollary 5.1. With Gaussian predictors F Rig"... (t) = FRHX (t). Proof. Use the last line of (2.9) to write R? = u[(u — 1)/R(2_i) + 1] 1. Hence
R? depends on X only through RE”), which depends on X only through hi.
Corollary 5.2. With Gaussian predictors FRHC; (t) = FR?[X(t) Proof. 0; = hg/[q(1  hi)] and hence depends on X only through In.
Proof of the Theorem. Use the law of total probability to state Pr{D. > a} = /°3’r{(R,2C', = t) > d/t}fc,.(t)dt. (2.25) Equation (2.17) describes the distribution ﬁinction of R? conditional on X, which
equals the distribution of R3 conditional on Ci, by Corollary 5.2. Combining the
distribution in (2.17) with (2.25) allows concluding that 1 (150 °° 1 V—l d
PT{Dr>d}= LPT{ﬁ(§r 2 )>y—t}fc.(t)dt O<d<aou °° 1 11—1 (1
[var{ﬁ(§, 2 )>;}fci(t)dt aou<d Note that to = max(ao, d/u) and simplify. Finding the density requires .(2.26) differentiating each form in (2.26) separately, and recognizing that the lower limit
depends on d. The two apparently distinct forms reduce to a single one upon noting that fpll; 1/2, (V — 1)/2] = 0. DISTRIBUTION OF COOK’S D STATISTIC 533 2.4 Computational Forms for Numerical Integration Although tantalizing in form, the integral for the CDF of D; does not allow
closed form integration. Numerical integration allows accurate and convenient
computation of Pr{D, > d}. Both ﬁmctions in the integral require careﬁil
consideration in order to produce a form amenable to computation. Among
various forms considered, the ones used here provide the simplest proofs and least
computational time for any level of accuracy, except perhaps for small values of
Pr{D, 5 (1}. Interest usually centers on large values of Pr{D,~ S d}. Two distinct representations create a ﬁnite region of integration which
greatly simpliﬁes numerical integration. First express the density of C, in terms of
an F. Ifu = (t  ao)/ai, so that t = alu + a0 and no = (to — ao)/a1 then Pr{D, > d} = A 73r{ﬁ[3 0" 1)]>—d—}fp(u;q — 1,V)du, (2.27) 2’ 2 u(a1u + co)
or equivalently
Pr{D> d} — / ;r{F(1 u — 1)>————”‘—1———}f (u' — 1 u)du (2 28)
I "o ) (a1u+ao)V/d—1 F ,9 1 '  The relationship of F and ,6 random variables allows creating a ﬁnite region
of integration. If 7. = (q — 1)u[u + (q — 1)u]‘1 then u = 2/(q — 1)‘lz(1 — z)—1
and 20 = (q — 1)uo[v + (q — 1)uo]'1. Also let u—l 8(2) = [alt/(q — 1)‘1z(1 — z)“ + ao]V/d — 1 ' (2.29) With this transformation Pr{D, > d} =/1Pr{F(1,u  1) > 3(2)}f5 (2; q 3 1, g) dz. (2.30) A second useful representation results from applying the transformation
in = u/(l + u) to the integral in (2.28). With wo = uo/(l + no) and 11—1 ho”) = [a1w(1 — w)1 + ao]z//d — 1 (2.31) it follows that ;q — 1, u] (1 _1w)2dw.(2.32) (11”) 1
Pr{D, > d}=/PT{F(1,u  1)> h(w)}fp[ 534 MULLER AND CHEN MOK 2.5 Approximations Equation (2.27) allows recognizing that Pr{D, > d} equals the expected
value of a function of a random variable whenever to =ao. For ﬁxed q
lim d / V = Alim a0 = 0. Consequently the expected value interpretation holds, at N—too least asymptotically, in all cases. The accuracy of a series based on treating the
integral as an expected value depends both on the remainder term and on any
discrepancy due to d/V > do. Creating a two term Taylor's series approximation for (2.30) involves noting
that EBKq — 1)/2, 11/2] = (q — l)/(V + q — 1). Ignoring any discrepancy due to
d / V > 00 yields 11 — 1
 d a: F — —— . 2.33
Pr{D,> } Pr{ (1,11 1)>[a1+ao]u/d—1} ( )
Applying a series expansion for an F random variable, using (2.27) or (2.28),
requires 11 > 21: to insure ﬁnite kth moment. If V > 2 then £F[(q — 1),V] =
u/(u  2) and, ignoring discrepancy due to d / u > a0, a two term series equals 11—1 Pr{D, s d} z Pr{F(1,V  1) 5 W } . (2.34)
For V 5 2, a one term F based expansion about the number 1 yields 11 — 1
D~ < d a: — < —— .
P'r{ ._ } PT{F(1,V 1)—[a1+ao]V/d—l}’ (235)
which corresponds to the two term expansion for the ,6 representation (in 2.34).
The approximate probability of (2.35) will never be greater than that of (2.34).
The probability approximations imply approximations for quantiles of D,: ~ 12— 1 ’1
d = [0.1 .m + ao]u[1 + —_———] . (2.36)
P F; 1(12:1,11— 1) Here m = 1 for (2.34) and (2.36), or m = V/(V — 2) for (2.35). Assigning m the
value of the median, F;](.50; q — 1,11), or mode, u(q — 3)/[(q — 1)(u + 2)], for
q > 3, also provides a one term approximation. One convenient form for creating a long series arises from (2.28): V—l PT{D,‘>d}=‘/uOPT{F(1,V— 1))W }fF(U; q — 1, V)du (2.37) DISTRIBUTION OF COOK’S D STATISTIC 535
°° d — 1
=/Pr{F(u — 1,1)3 9%}fﬁmq — 1,u)du
no _. =/Pr{F(u — 1,1) g cm + co}fp(u;q — 1,u)du
no =/P(U)fp(u; q  1.u>du.
no
In turn c1u+ao
P<°>(u) = / fp(s;u 1,1)ds (2.33)
0 P(1)(u) = clfp(c1u + co; V — 1,1)
PU‘) (u) = cffgkclu + Co; v — 1,1). 2.6 Large Sample Properties The behavior of D. in large samples merits separate consideration. The
results have both analytic and computational value. Rather than study D, directly,
consider D.. = u  Di. Then Pr{D,. > (1.} = P'r{r/  D, > (1.} (2.39)
= Pr{D,' > d./V}
= Pr{D, > d}, with d = d. /V. Using (2.28) the distribution function for D;. may be expressed as Pr{D;. > 11.} = mer{F(1,y — 1) > s.(d./V,u)}fp(u;q  1,V)du, (2.40) with
_____£”_“1)/_"_.__
Iu(1?)(7‘/NT1) +q—1(Nu_1 ]/d. — 1/1/ 110. = [to(d./u) — col/a1, and to(d./u) = max(ao,d./u2).
Consider D;. as N —> 00. In that case . . d.
Alimos. (dim) = W = W (2.42) s.(d./u,u) = » (241) That Iim a4, = 0 and Iim d. /1/2 = O combine to imply Iim uo. = 0. Therefore N—ooo N—eoo N—‘co 536 MULLER AND CHEN MOK [gamma > d.) = (2.43)
°° 2 d.q _ 2 _  ._
fa Pr{x <1>>[u(q_1)+1]}<q 1mm 1>u,q lldu Let w = (q — 1)u, so that dw = (q — 1)du. Then d.q
w+1 Nﬁoo lim Pr{D,. > d.) = /mPr{X2(1) > }fxz(w; q — 1)dw. (2.44)
' o A Taylor's series about 5 W = (q — 1) yields the two term approximation
Pr{D.. > d.} z Pr{x2(1) > d.}. (2.45)
Also, with d = d./u, for large N
Pr{D, g d} z Pr{X2(1) g u  d}, (2.46) with corresponding quantile approximation
3,, z F§1(p;1)/V. (2.47) The F based approximation in (2.36) provides more accuracy, except in large
samples. Additional terms are required for the approximation to vary with q.
Three conclusions follow. First, as N increases D, converges to a degenerate random variable with all mass at zero. Second, Di. converges to a
nondegenerate random variable, Third, calculations of quantiles in terms of D,. can greatly reduce numerical difﬁculties with large samples.
2.7 The Maximum of N Values of Cook's Statistic Fi...
View
Full Document
 Spring '12
 Staff

Click to edit the document details