Unformatted text preview: Model Checking and
Refinement
Refinement
Chapter 11 identify questions of interest, review
design of study and scope of inference
explore data graphically data analysis
strategy fit model check model
carry out inferences
communicate results Good Quote
Good “Multiple regression analysis takes time and care…going about the analysis in a proper order…transformations and outliers must be dealt with early on.” page 304 case study 1 – alcohol metabolism
case observational study (with some aspects of a designed experiment)
subjects were volunteers: 18 women, 14 men
3 of the women, 5 of the men were alcoholics
response variable: first pass alcohol metabolism
explanatory variables: gender (2level factor), alcoholism (2level factor), gastric alcohol dehydrogenase activity (numeric) case study 1 – alcohol metabolism
case questions: – Does metabolism differ between women and men?
– Can the differences be completely explained by differences in gastric ad activity?
– Are the differences complicated by alcoholism? note the observations with unusual values of gastric AD activity (an explanatory variable) case study 1 – alcohol metabolism
case
Overlay Plot
15 METABO L 10 5 0
1 2 3 4 5 GASTRIC
Groups SEX=FEMALE, ALCOHOL=ALCOHOLIC
SEX=FEMALE, ALCOHOL=NONALCOHOLIC
SEX=MALE, ALCOHOL=ALCOHOLIC
SEX=MALE, ALCOHOL=NONALCOHOLIC case study 2 – bloodbrain barrier
case controlled experiment (with some aspects of an observational study) 34 rats NOT randomly assigned to combinations of sacrifice times and barrier disruption treatments response variable: tumortoliver ratio of antibody concentration case study 2 – bloodbrain barrier
case explanatory variables: – design variables: sacrifice time (numeric), barrier disruption treatment (2level factor)
– noncontrolled variables (or covariates): days post inoculation (numeric), sex (2level factor), weight loss (numeric), tumor weight (numeric) case study 2 – bloodbrain barrier
case questions: does the barrier disruption treatment work? if so, by how much does it increase the ratio? do the answers to these questions depend on sacrifice time or the covariates? note the apparent outliers case study 2 – bloodbrain barrier
case 3
log(brain/l iver) 2
1
0
1
2
3
4
5
1 0 1 2
logtime Groups TREAT=BD
TREAT=NS 3 4 5 Residual plots
Residual initial plots – specially coded scatterplots – are important in multiple regression but there are problems
– coded plots are complicated and limited
– matrix plots don’t show adjusted relationships residual plots are more important in multiple regression than in simple linear regression
– often simpler than, say, coded scatterplots
– display finer detail on assumptions, etc. however, they require that an initial tentative model be selected Initial model
Initial
choosing this is an art “It is disadvantageous to start with too many or too few explanatory variables in the tentative model. With too few, outliers may appear simply because of omitted relationships. With too many (lots of interactions and quadratic terms …), the analyst risks overfitting the data – causing real outliers to be explained by complex, but meaningless … relationships.” page 310 Residual Plots
Residual Residuals vs Fits – useful in assessing need for transformation, outliers, general lackoffit of the model Residuals vs Time Order (When applicable) Normal Probability Plot of Residuals Partial Residual Plots (Leverage) Partial Residual Plots
Partial
Partial Residual plots were first developed by
Wayne Larsen and Susan McCleary in a 1972 article in Technometerics display the nature of the adjusted relationship between an explanatory variable and the response variable a different partial residual plot for every explanatory variable in the model Computing Partial Residuals
Computing Store residuals Create a new column = Residual + (βj * Xj) Plot new column versus X j Note: regressing the partial residual on X yields a slope of βj OR just look at the leverage plots in JMP j What does it mean?
What shows where the regression coefficient came from – pictorial representation of the pvalue and slope, adjusted for the other variables – the tighter the fit around the partial residual line, the smaller the pvalue in the original regression. displays curvature, unequal spreads, outliers, etc. for adjusted relationship suggests adding new explanatory variables like squared terms, or removing explanatory variables Partial Residual Plots (Leverage) in
JMP
JMP automatically produced in JMP as Leverage Plots e.g. el nino storm data year
Leverage Plot storm s Leverage
R e si d u a l s 20
15
10
5
1940 1960 1980 2000 year Leverage, P=0.0851 Back to plot of residuals vs fitted
values
values helps identify outliers and evaluate transformations
e.g. alcohol metabolism data
usual course of action for apparent outliers 1. examine for recording error or contamination – if so correct the error
2. see if transformation resolves the problem
3. if neither of these help, examine to see if the apparent outliers actually influence the conclusions Quotes Worth Noting
Quotes “Least squares regression is not resistant to outliers.” – page 313
but . . . “…removing an observation simply because it is influential is not justified.” Influential Observations
Influential
the question about apparent outliers is whether to worry about them or not – whether they influence conclusions of the study data points which have a disproportionate influence on the conclusions are called influential observations need to be identified and investigated potentially influential observations can be identified and evaluated for actual influence using case influence statistics Case Influential Statistics
Case
1. Leverage (Hats) measures how unusual an observation (case) is in terms of values of the explanatory variables (X’s) observations with ‘large’ leverage values have the potential to be influential crude guideline for ‘large’: > 2p/n text denotes this as hi JMP calls these “Hats” Case Influential Statistics
Case
2. Studentized Residual ordinary residual divided by its estimated standard deviation can be thought of as standardized distance of Y from its fitted value observations with ‘large’ studentized residual values also have potential to be influential crude guideline for ‘large’: >2 or <2 Case Influential Statistics
Case
3. Cook’s Distance measure of actual overall influence
based on omitting the observation and measuring the effect on the regression coefficients observations with ‘large’ Cook’s distance are actually influential in general crude guideline for ‘large’: >1 Case Influential Statistics
Case
1.
2.
3. Leverage (Hats)
Studentized residual
Cook’s distance –
–
–
– Why are all three useful rather than just Cook’s distance?
useful to plot these case statistics versus case number
all three are readily available on JMP; must be stored and then plotted
look at scenarios in Display 11.11 e.g. alcohol metabolism data Overlay Plot
15 METABO L 10 5 0
1 2 3 4 5 GASTRIC
Groups SEX=FEMALE, ALCOHOL=ALCOHOLIC
SEX=FEMALE, ALCOHOL=NONALCOHOLIC
SEX=MALE, ALCOHOL=ALCOHOLIC
SEX=MALE, ALCOHOL=NONALCOHOLIC Stu d e n ti z e d METABO
h 0.5
0.4
In fl u e n ce METABO L ETABO L
R e si d M C o o k' s D Sc at t erplo t Mat rix 0.3
0.2
0 .1
3
2
1
0
1
2
3
1
0.8
0.6
0.4
0.2
0
0.2
5 10 15 20 SUBJECT 25 30 Sensible Strategy for Influential
Observations – Display 11.8
Observations
Do conclusions change if the case omitted? N Proceed with the case included. Y Omit the case and proceed. Y Does the case belong to a different population?
N Does the case have unusual explanatory variable values? Y N More clarification needed. Omit the case and proceed. Report conclusions for reduced range of X’s Strategy for Influential Observations
Strategy apply strategy to alcohol metabolism data – leave out cases 31 and 32
– restrict inferences to people with gastric ad activity of 3 or lower
– extra sum of squares Ftest for gender and gender*gastric ad coefficients is 11.04 with 2 and 26 df (pvalue=.0003) there are gender differences beyond those due to gastric ad activity Refining the model
Refining Should you drop explanatory variables if their coefficients are not significantly different from 0? – if the pvalue is quite large and the variable (and its coefficient) are not essential to answering the scientific questions, yes
– otherwise, no Additional topics
Additional weighted regression for dealing with non constant variance measurement error methods for dealing with situations in which the explanatory variables are measured with error Weighted Regression
Weighted “Although nonconstant variance can sometimes be corrected by a transformation of the response, in many situations it cannot.” “If enough information is known about the form of the nonconstant variance, the method of weighted least squares may be used. ” Example Pinewood Derby
Example It is clear that the standard deviations of the car weights are drastically different. We use regression weights = 1/ (std dev )2 Note – it is a coincidence that we introduce weighted regression with an example where one of terms is weights. Measurement Error’s in X’s
Measurement
Case I Only prediction matters (individual coefficients are not relevant) NO PROBLEM Case II Individual Coefficients Important CONSULT YOUR LOCAL STATISTICIAN! Model Checking and
Refinement
Refinement
Chapter 11 ...
View
Full
Document
This note was uploaded on 11/30/2011 for the course STAT 380 taught by Professor Stevens during the Spring '11 term at Brigham Young University, Hawaii.
 Spring '11
 Stevens

Click to edit the document details