{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

selectingbestregression - Selection of the Best Regression...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
Selection of the Best Regression Equation by sorting out Variables Mohammad Ehsanul Karim < [email protected]> Institute of Statistical Research and training; University of Dhaka, Dhaka – 1000, Bangladesh
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
The general problem to be discussed in this documentation is as follows: in many practical applications of regression analysis, the set of variables to be included in the regression model is not predetermined, and it is often the first part of the analysis to select these variables. Of course, there are some occasions when theoretical or other considerations determine the variables to be included in the equation – there the problem of variable selection does not arise. But in situations, where there is no clear- cut theory, the problem of selecting variables for a regression equation becomes an important one and this is the subject matter of our present discussion. Suppose we have one response variable and a set of k predictor variables 1 2 , ,... k X X X : and we wish to establish a linear regression equation for this particular response Y in terms of the basic predictor variables. We want to determine or select the best (most important or most valid) subset of the k predictors and the corresponding best-fitting regression model for describing the relationship between Y and X’s. What exactly we mean by “best” depends in part on our overall goal in modeling. Basis of Data Collection We need to establish the basis of the data collection, as the conclusions we can make depends on this. a) In an experiment, treatments are allocated to the experimental units and there should be elements of statistical design (such as randomization). b) In a survey, it is not possible to allocate treatments and we take note of existing affairs. In no sense are survey data collected under controlled conditions and there may be many important factors overlooked! Dangers of using unplanned data When we do regression calculations by the use of any selection procedure on unplanned data arising from continuing operations and not from a designed experiment, some potentially dangerous possibilities 1 can arise: such as- a) Errors in the model may not be random and may be due to joint effect of several variables. b) Bias may be introduces. c) Consequently, prediction equation becomes unreliable due to non- considering the joint interaction and confounding effects. d) Ranges becomes invalid for prediction. e) Large correlations between predictors are seen. However, by randomizing out the variables, some of these problems might be avoided. Also common sense, basic knowledge of data being analyzed should be employed. Consequences of Model Misspecification By deleting variable from the model, we may improve the precision 2 of the parameter estimates of the retained variables even though some of the deleted variables are not negligible. This is also true for the variance of a predicted response. Deleting 1 These problems are well described in Montgomery, Peck (1992) “Introduction to Linear Regression Analysis”, 2 nd ed., Page - 269.
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}