An intelligent way to recode discrete predictors is to replace each discrete value
by the mean of the target conditioned on that discrete value.
For example, if the
average label value is 20 for men versus 16 for women, these values could replace
the male and female values of a variable for gender. This idea is especially useful as
1
The ordering of the values, i.e. which value is associated with
j
= 1
, etc., is arbitrary. Mathemat
ically it is preferable to use only
k

1
realvalued features. For the last categorical value, set all
k

1
features equal to 0.0. For the
j
th categorical value where
j < k
, set the
j
th feature value to 1.0 and set
all
k

1
others equal to 0.0.
14
CHAPTER 2. PREDICTIVE ANALYTICS IN GENERAL
a way to convert a discrete feature with many values, for example the 50 U.S. states,
into a useful single numerical feature.
However, as just explained, the standard way to recode a discrete feature with
m
values is to introduce
m

1
binary features. With this standard approach, the
training algorithm can learn a coefficient for each new feature that corresponds to an
optimal numerical value for the corresponding discrete value. Conditional means are
likely to be meaningful and useful, but they will not yield better predictions than the
coefficients learned in the standard approach. A major advantage of the conditional
means approach is that it avoids an explosion in the dimensionality of training and
test examples.
Mixed types.
Sparse data.
Normalization. After conditionalmean new values have been created, they can
be scaled to have zero mean and unit variance in the same way as other features.
When preprocessing and recoding data, it is vital not to peek at test data.
If
preprocessing is based on test data in any way, then the test data is available indirectly
for training, which can lead to overfitting. If a feature is normalized by zscoring, its
mean and standard deviation must be computed using only the training data. Then,
later, the same mean and standard deviation values must be used for normalizing this
feature in test data.
Even more important, target values from test data must not be used in any way
before or during training. When a discrete value is replaced by the mean of the target
conditioned on it, the mean must be just of
training
target values. Later, in test data,
the same discrete value must be replaced by the same mean of training target values.
2.3
Linear regression
Let
x
be an instance and let
y
be its realvalued label. For linear regression,
x
must
be a vector of real numbers of fixed length. Remember that this length
p
is often
called the dimension, or dimensionality, of
x
.
Write
x
=
x
1
, x
2
, . . . , x
p
.
The
linear regression model is
y
=
b
0
+
b
1
x
1
+
b
2
x
2
+
. . .
+
b
p
x
p
.
The righthand side above is called a linear function of
x
.
The linear function is
defined by its coefficients
b
0
to
b
p
.
These coefficients are the
output
of the data
mining algorithm.
You've reached the end of your free preview.
Want to read all 165 pages?
 Winter '08
 staff
 Linear Regression, Regression Analysis, Data Mining, The Land