Selection of the Best Regression
Equation by sorting out Variables
Mohammad Ehsanul Karim <
[email protected]>
Institute of Statistical Research and training;
University of Dhaka, Dhaka – 1000, Bangladesh
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
The general problem to be discussed in this documentation is as follows: in many
practical applications of regression analysis, the set of variables to be included in the
regression model is not predetermined, and it is often the first part of the analysis to
select these variables. Of course, there are some occasions when theoretical or other
considerations determine the variables to be included in the equation – there the
problem of variable selection does not arise. But in situations, where there is no clear
cut theory, the problem of selecting variables for a regression equation becomes an
important one and this is the subject matter of our present discussion.
Suppose we have one response variable and a set of k predictor variables
1
2
,
,...
k
X
X
X
: and we wish to establish a linear regression equation for this particular
response Y in terms of the basic predictor variables. We want to determine or select
the best (most important or most valid) subset of the k predictors and the
corresponding bestfitting regression model for describing the relationship between Y
and X’s. What exactly we mean by “best” depends in part on our overall goal in
modeling.
Basis of Data Collection
We need to establish the basis of the data collection, as the conclusions we can make
depends on this.
a)
In an experiment, treatments are allocated to the experimental units and there
should be elements of statistical design (such as randomization).
b)
In a survey, it is not possible to allocate treatments and we take note of existing
affairs. In no sense are survey data collected under controlled conditions and there
may be many important factors overlooked!
Dangers of using unplanned data
When we do regression calculations by the use of any selection procedure on
unplanned data arising from continuing operations and not from a designed
experiment, some potentially dangerous possibilities
1
can arise: such as
a)
Errors in the model may not be random and may be due to joint effect of
several variables.
b)
Bias may be introduces.
c)
Consequently, prediction equation becomes unreliable due to non
considering the joint interaction and confounding effects.
d)
Ranges becomes invalid for prediction.
e)
Large correlations between predictors are seen.
However, by randomizing out the variables, some of these problems might be
avoided. Also common sense, basic knowledge of data being analyzed should be
employed.
Consequences of Model Misspecification
By deleting variable from the model, we may improve the precision
2
of the parameter
estimates of the retained variables even though some of the deleted variables are not
negligible. This is also true for the variance of a predicted response. Deleting
1
These problems are well described in Montgomery, Peck (1992) “Introduction to Linear Regression
Analysis”, 2
nd
ed., Page  269.
This is the end of the preview.
Sign up
to
access the rest of the document.
 Spring '10
 Various
 Regression Analysis, Regression Error Total

Click to edit the document details