An intelligent way to recode discrete predictors is to replace each discrete value by the mean of the target conditioned on that discrete value. For example, if the average label value is 20 for men versus 16 for women, these values could replace the male and female values of a variable for gender. This idea is especially useful as 1 The ordering of the values, i.e. which value is associated with j = 1 , etc., is arbitrary. Mathemat- ically it is preferable to use only k - 1 real-valued features. For the last categorical value, set all k - 1 features equal to 0.0. For the j th categorical value where j < k , set the j th feature value to 1.0 and set all k - 1 others equal to 0.0.
14 CHAPTER 2. PREDICTIVE ANALYTICS IN GENERAL a way to convert a discrete feature with many values, for example the 50 U.S. states, into a useful single numerical feature. However, as just explained, the standard way to recode a discrete feature with m values is to introduce m - 1 binary features. With this standard approach, the training algorithm can learn a coefficient for each new feature that corresponds to an optimal numerical value for the corresponding discrete value. Conditional means are likely to be meaningful and useful, but they will not yield better predictions than the coefficients learned in the standard approach. A major advantage of the conditional- means approach is that it avoids an explosion in the dimensionality of training and test examples. Mixed types. Sparse data. Normalization. After conditional-mean new values have been created, they can be scaled to have zero mean and unit variance in the same way as other features. When preprocessing and recoding data, it is vital not to peek at test data. If preprocessing is based on test data in any way, then the test data is available indirectly for training, which can lead to overfitting. If a feature is normalized by z-scoring, its mean and standard deviation must be computed using only the training data. Then, later, the same mean and standard deviation values must be used for normalizing this feature in test data. Even more important, target values from test data must not be used in any way before or during training. When a discrete value is replaced by the mean of the target conditioned on it, the mean must be just of training target values. Later, in test data, the same discrete value must be replaced by the same mean of training target values. 2.3 Linear regression Let x be an instance and let y be its real-valued label. For linear regression, x must be a vector of real numbers of fixed length. Remember that this length p is often called the dimension, or dimensionality, of x . Write x = x 1 , x 2 , . . . , x p . The linear regression model is y = b 0 + b 1 x 1 + b 2 x 2 + . . . + b p x p . The righthand side above is called a linear function of x . The linear function is defined by its coefficients b 0 to b p . These coefficients are the output of the data mining algorithm.