Handout 4 Influence & outliers

Handout 4 Influence & outliers - Case Analysis:...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
Case Analysis: Influential data points and outliers September 12, 2006 Introduction The process of doing regression can be divided into four steps: specification, estimation, evaluation and criticism. So far, we have concentrated on the first three. Model specification refers to the process of writing down a model with the right predictor variables and right functional forms ofthe relationships between the response variable (y) and the predictor variables (X\, X 2 • • .X k ). Model estimation is the process whereby we arrive at estimates (b l , b 2 ,..... b k ) of the population parameters (PI' P2 '..... Pk ), and model evaluation refers to the F-tests and t-tests used to construct confidence intervals and make decisions about hypotheses. The fourth step is to plot the residuals and then use them to evaluate how well we meet the assumptions ofthe model. This step involves both the examination of specific cases (case analysis) and the analysis of patterns in the residuals (pattern analysis). In the following pages we discuss the first ofthe two aspects of model criticism. This discussion borrows from DeMaris (2004: pp. 218 - 223), Sanford Weisberg's Applied Linear Regression (2 nd ed. 1985) and Neter, Kutner, Nachtsheim & Wasserman's Applied Linear Regression Models (3fd ed. 1990). To illustrate the importance of model criticism, especially case analysis, Anscombe (1973) devised a set of 11 data points with the following plot and summary statistics: y y = 3.0 + 0.50X R 2 = 66.8% Source df ss ms Reg (X) 1 27.51 27.51 Error 9 13.76 1.53 But he was also able to create exactly the same prediction equation, R-square, and summary statistics with the following radically different data sets. ... .----- .. The point Anscombe was making is that it isn't sufficient to look at the summary statistics to evaluate a model; you must also look at data plots, especially residual plots. Notice that each display reflects a distinct problem: the first shows non-linearity, the second shows an outlier and the third shows an "influential" data point. In the third display, there isn't enough information in the data to construct a stable model; one data point completely determines the slope ofthe line. In the paragraphs that follow, we study outliers and influential data points so as to address these two related questions: 1. How well does our model actually resemble the data? 2. Do any ofthe cases unduly influence the estimation ofthe prediction equation? To address these questions, we develop the following four distinct themes: The idea of leverage.
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2. A measure of an "outlier" that takes leverage into account. 3. A test for outliers. 4. A distinct measure of influence. Leverage As illustrated in Anscombe's example, a data point that is horizontally distinct from the mass of data points in simple regression is important because it may strongly influ~ce the slope of the regression slope. In multiple regression, a point that is distinct from the vector of means X may be influential because it can dramatically alter the vector of estimates of the slopes, b T = (b o ,b l , b 2 ,....
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 8

Handout 4 Influence & outliers - Case Analysis:...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online