This preview shows page 1. Sign up to view the full content.
Unformatted text preview: ckward Elimination
PCA, Fisher criterion and ttest construct or rank features
independent of the learning machine that does the actual
classiﬁcation and are therefore called ﬁlter methods (Blum
& Langley, 1997). Another approach is to use the learning
machine itself for feature selection. These techniques are
called wrapper methods and try to identify the features
that are important for the generalisation capability of the
machine. Here we propose to use the features that are
important for achieving a low training error as a simple
approximation. In the case of a SVM with linear kernel Feature selection for DNA methylation these features are easily identiﬁed by looking at the normal
vector w of the separating hyperplane. The smaller the
angle between a feature basis vector and the normal
vector the more important is the feature for the separation.
Features orthogonal to the normal vector have obviously
no inﬂuence on the discrimination at all. This means
the feature ranking is simply given by the components
2
of the normal vector as wk . Of course this ranking is
not very realistic because the SVM solution on the full
feature set is far from optimal as we demonstrated in
the last subsections. A simple heuristic is to assume that
2
the feature with the smallest wk is really unimportant
for the solution and can be safely removed from the
feature set. Then the SVM can be retrained on the
reduced feature set and the procedure is repeated until the
feature set is empty. Such a successive feature removal
is called backward elimination (Blum & Langley, 1997).
The resulting CpG ranking on our data set is shown in
Fig. 2d and differs considerably from the Fisher and ttest
rankings. It seems backward elimination is able to remove
redundant features. However, as shown in Tab. 1 and Fig. 3
the generalisation results are not better than for the Fisher
criterion. Furthermore, backward elimination seems to be
more dimension dependent and it is computationally more
expensive. It follows that at least for this data set the
simple Fisher criterion is the preferable feature selection
technique. Exhaustive Search
A canonical way to construct a wrapper method for feature
selection is to evaluate the generalisation performance of
the learning machine on every possible feature subset.
Crossvalidation on the training set can be used to estimate
the generalisation of the machine on a given feature set.
What makes this exhaustive search of the feature space
practically useless is the enormous number of n=0 n =
k
k
2n different feature combinations and there are numerous
heuristics to search the feature space more efﬁciently (e.g.
backward elimination) (Blum & Langley, 1997).
Here we only want to demonstrate that there are no
higher order correlations between features and class labels
in our data set. In order to do this we exhaustively
searched the space of all two feature combinations. For
every of the 821 = 3240 two CpG combinations we
computed the leaveoneout crossvalidation error of a
SVM with quadratic kernel on the training set. From all
CpG pairs with minimum leaveoneout error we selected
the one...
View Full
Document
 Spring '09
 PAUL
 The Land

Click to edit the document details