511_Chapter_11 - Model Checking and Refinement Refinement...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Model Checking and Refinement Refinement Chapter 11 identify questions of interest, review design of study and scope of inference explore data graphically data analysis strategy fit model check model carry out inferences communicate results Good Quote Good “Multiple regression analysis takes time and care…going about the analysis in a proper order…transformations and outliers must be dealt with early on.” page 304 case study 1 – alcohol metabolism case observational study (with some aspects of a designed experiment) subjects were volunteers: 18 women, 14 men 3 of the women, 5 of the men were alcoholics response variable: first pass alcohol metabolism explanatory variables: gender (2­level factor), alcoholism (2­level factor), gastric alcohol dehydrogenase activity (numeric) case study 1 – alcohol metabolism case questions: – Does metabolism differ between women and men? – Can the differences be completely explained by differences in gastric ad activity? – Are the differences complicated by alcoholism? note the observations with unusual values of gastric AD activity (an explanatory variable) case study 1 – alcohol metabolism case Overlay Plot 15 METABO L 10 5 0 1 2 3 4 5 GASTRIC Groups SEX=FEMALE, ALCOHOL=ALCOHOLIC SEX=FEMALE, ALCOHOL=NON-ALCOHOLIC SEX=MALE, ALCOHOL=ALCOHOLIC SEX=MALE, ALCOHOL=NON-ALCOHOLIC case study 2 – blood-brain barrier case controlled experiment (with some aspects of an observational study) 34 rats NOT randomly assigned to combinations of sacrifice times and barrier disruption treatments response variable: tumor­to­liver ratio of antibody concentration case study 2 – blood-brain barrier case explanatory variables: – design variables: sacrifice time (numeric), barrier disruption treatment (2­level factor) – non­controlled variables (or covariates): days post inoculation (numeric), sex (2­level factor), weight loss (numeric), tumor weight (numeric) case study 2 – blood-brain barrier case questions: does the barrier disruption treatment work? if so, by how much does it increase the ratio? do the answers to these questions depend on sacrifice time or the covariates? note the apparent outliers case study 2 – blood-brain barrier case 3 log(brain/l iver) 2 1 0 -1 -2 -3 -4 -5 -1 0 1 2 logtime Groups TREAT=BD TREAT=NS 3 4 5 Residual plots Residual initial plots – specially coded scatterplots – are important in multiple regression but there are problems – coded plots are complicated and limited – matrix plots don’t show adjusted relationships residual plots are more important in multiple regression than in simple linear regression – often simpler than, say, coded scatterplots – display finer detail on assumptions, etc. however, they require that an initial tentative model be selected Initial model Initial choosing this is an art “It is disadvantageous to start with too many or too few explanatory variables in the tentative model. With too few, outliers may appear simply because of omitted relationships. With too many (lots of interactions and quadratic terms …), the analyst risks overfitting the data – causing real outliers to be explained by complex, but meaningless … relationships.” page 310 Residual Plots Residual Residuals vs Fits – useful in assessing need for transformation, outliers, general lack­of­fit of the model Residuals vs Time Order (When applicable) Normal Probability Plot of Residuals Partial Residual Plots (Leverage) Partial Residual Plots Partial Partial Residual plots were first developed by Wayne Larsen and Susan McCleary in a 1972 article in Technometerics display the nature of the adjusted relationship between an explanatory variable and the response variable a different partial residual plot for every explanatory variable in the model Computing Partial Residuals Computing Store residuals Create a new column = Residual + (βj * Xj) Plot new column versus X j Note: regressing the partial residual on X yields a slope of βj OR just look at the leverage plots in JMP j What does it mean? What shows where the regression coefficient came from – pictorial representation of the p­value and slope, adjusted for the other variables – the tighter the fit around the partial residual line, the smaller the p­value in the original regression. displays curvature, unequal spreads, outliers, etc. for adjusted relationship suggests adding new explanatory variables like squared terms, or removing explanatory variables Partial Residual Plots (Leverage) in JMP JMP automatically produced in JMP as Leverage Plots e.g. el nino storm data year Leverage Plot storm s Leverage R e si d u a l s 20 15 10 5 1940 1960 1980 2000 year Leverage, P=0.0851 Back to plot of residuals vs fitted values values helps identify outliers and evaluate transformations e.g. alcohol metabolism data usual course of action for apparent outliers 1. examine for recording error or contamination – if so correct the error 2. see if transformation resolves the problem 3. if neither of these help, examine to see if the apparent outliers actually influence the conclusions Quotes Worth Noting Quotes “Least squares regression is not resistant to outliers.” – page 313 but . . . “…removing an observation simply because it is influential is not justified.” Influential Observations Influential the question about apparent outliers is whether to worry about them or not – whether they influence conclusions of the study data points which have a disproportionate influence on the conclusions are called influential observations need to be identified and investigated potentially influential observations can be identified and evaluated for actual influence using case influence statistics Case Influential Statistics Case 1. Leverage (Hats) measures how unusual an observation (case) is in terms of values of the explanatory variables (X’s) observations with ‘large’ leverage values have the potential to be influential crude guideline for ‘large’: > 2p/n text denotes this as hi JMP calls these “Hats” Case Influential Statistics Case 2. Studentized Residual ordinary residual divided by its estimated standard deviation can be thought of as standardized distance of Y from its fitted value observations with ‘large’ studentized residual values also have potential to be influential crude guideline for ‘large’: >2 or <­2 Case Influential Statistics Case 3. Cook’s Distance measure of actual overall influence based on omitting the observation and measuring the effect on the regression coefficients observations with ‘large’ Cook’s distance are actually influential in general crude guideline for ‘large’: >1 Case Influential Statistics Case 1. 2. 3. Leverage (Hats) Studentized residual Cook’s distance – – – – Why are all three useful rather than just Cook’s distance? useful to plot these case statistics versus case number all three are readily available on JMP; must be stored and then plotted look at scenarios in Display 11.11 e.g. alcohol metabolism data Overlay Plot 15 METABO L 10 5 0 1 2 3 4 5 GASTRIC Groups SEX=FEMALE, ALCOHOL=ALCOHOLIC SEX=FEMALE, ALCOHOL=NON-ALCOHOLIC SEX=MALE, ALCOHOL=ALCOHOLIC SEX=MALE, ALCOHOL=NON-ALCOHOLIC Stu d e n ti z e d METABO h 0.5 0.4 In fl u e n ce METABO L ETABO L R e si d M C o o k' s D Sc at t erplo t Mat rix 0.3 0.2 0 .1 3 2 1 0 -1 -2 -3 1 0.8 0.6 0.4 0.2 0 -0.2 5 10 15 20 SUBJECT 25 30 Sensible Strategy for Influential Observations – Display 11.8 Observations Do conclusions change if the case omitted? N Proceed ­­ with the case included. Y Omit the case and proceed. Y Does the case belong to a different population? N Does the case have unusual explanatory variable values? Y N More clarification needed. Omit the case and proceed. Report conclusions for reduced range of X’s Strategy for Influential Observations Strategy apply strategy to alcohol metabolism data – leave out cases 31 and 32 – restrict inferences to people with gastric ad activity of 3 or lower – extra sum of squares F­test for gender and gender*gastric ad coefficients is 11.04 with 2 and 26 df (p­value=.0003) there are gender differences beyond those due to gastric ad activity Refining the model Refining Should you drop explanatory variables if their coefficients are not significantly different from 0? – if the p­value is quite large and the variable (and its coefficient) are not essential to answering the scientific questions, yes – otherwise, no Additional topics Additional weighted regression for dealing with non­ constant variance measurement error methods for dealing with situations in which the explanatory variables are measured with error Weighted Regression Weighted “Although nonconstant variance can sometimes be corrected by a transformation of the response, in many situations it cannot.” “If enough information is known about the form of the non­constant variance, the method of weighted least squares may be used. ” Example Pinewood Derby Example It is clear that the standard deviations of the car weights are drastically different. We use regression weights = 1/ (std dev )2 Note – it is a coincidence that we introduce weighted regression with an example where one of terms is weights. Measurement Error’s in X’s Measurement Case I Only prediction matters ­ (individual coefficients are not relevant) NO PROBLEM Case II Individual Coefficients Important CONSULT YOUR LOCAL STATISTICIAN! Model Checking and Refinement Refinement Chapter 11 ...
View Full Document

This note was uploaded on 11/30/2011 for the course STAT 380 taught by Professor Stevens during the Spring '11 term at Brigham Young University, Hawaii.

Ask a homework question - tutors are online