This preview shows pages 1–8. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS 1 Influential observations are observations whose presence in the data can have a distorting ef fect on the parameter estimates and possibly the entire analysis, e.g. identifying the wrong model. Distinction from outliers , though it is possible for one observation to be both influential and an outlier. Outliers: 1. data points that contain unusual dependent ( y ) values. 2. Outlying independent ( x ) values not in dicating lack of fit of model, but some obser vations still influence the fit more than others. Detection: In simple linear regression, usually easy from plots of data, but in multiple regres sion, more formal measures are required. 2 o o o o o o o o oo o o x y 2 4 6 8 2 4 6 8 A o B o C Figure 4.1. Three least squares lines fitted to sample data, where the observation at x = 8 is allowed to move between the three points A, B and C. The corresponding least squares fits are the solid, dashed and dotted lines respectively. 3 The hat matrix Recall Y = HY , H = X ( X T X ) 1 X T , so co variance matrix of Y is Var { Y } = H 2 Variance of y i is h ii 2 , variance of i th residual e i is (1 h ii ) 2 . Properties of the { h ii } values include h ii 1 for all i, (1) X i h ii = p. (2) Property (1) follows simply from the fact that both h ii 2 and (1 h ii ) 2 are the variances of random quantities, and therefore are nonneg ative. For property (2), note that tr(H)=p. 4 Leverage A data point with large h ii is called a point of high leverage . How high is high? by (2), the average value of h ii is p n . A standard criterion is to call any data point for which h ii > 2 p n a point of high leverage. Note that since h ii is a function of X , it has no distribution, thus no formal test. 5 Example: Consider the artificial data of Fig. 4.1. The twelve x values here are , . 2 , . 4 ,..., 1 . 8 , 2 , 8 . The corresponding h ii values are . 1342 ,. 1221 ,...,. 0869 ,. 9182 . The last observation, corresponding to x = 8, is clearly highly influential. Intuitively, this is because if this point is moved up or down, the least squares straight line will tend to follow it the overall least squares fit on the other 11 observations is not much affected by modest changes in the slope of the fitted straight line, but the fit at x = 8 has a big influence. Note that this has nothing to do with y 12 pos sibly being an outlier, since for any i , the actual value of y i does not even enter into the calcu lation of h ii . 6 Real data examples from Chapter 3 Tree data: Highest h ii value is h 20 = 0 . 2428 (diameter=13.8, height=64), not extreme for either indepen dent variable but does correspond to a fairly large diameter combined with the second small est height....
View
Full
Document
This note was uploaded on 11/17/2011 for the course STOR 664 taught by Professor Staff during the Fall '11 term at UNC.
 Fall '11
 Staff

Click to edit the document details