Unformatted text preview: ll possible
choices of the test group. To obtain a reliable estimate
for the test error the number of misclassiﬁcations were
averaged over 50 different partitionings of the samples
into 8 groups. FEATURE SELECTION
The simplest way for applying a SVM to our methylation
data is to use every CpG position as a separate dimension,
not making any assumption about the interdependence
of CpG sites from the same gene. On the leukemia
subclassiﬁcation task the SVM with linear kernel trained
on this 81 dimensional input space had an average test
error of 16%. Using a quadratic kernel did not signiﬁcantly
improve the results (see Tab. 1). An obvious explanation
for this relatively poor performance is that we have only
25 data points (even less in the training set) in a 81
dimensional space. Finding a separating hyperplane under
these conditions is a heavily underdetermined problem.
And as it turns out, the SVM technique of maximising
the margin is not sufﬁcient to ﬁnd the solution with
optimal generalisation properties. It is necessary to reduce
the dimensionality of the input space while retaining the
relevant information for classiﬁcation. This should be
S159 F.Model et al. † The SVM was trained on all 81 features. Table 1. Performance of different feature selection methods. Training Error
2 Features Test Error
2 Features Training Error
5 Features Test Error
5 Features Linear Kernel
Fisher Criterion
Golub’s Method
tTest
Backward Elimination
PCA 0.01
0.01
0.05
0.02
0.13 0.05
0.05
0.13
0.17
0.21 0.00
0.00
0.00
0.00
0.05 0.03
0.04
0.08
0.05
0.28 No Feature Selection† 0.00 0.16 Quadratic Kernel
Fisher Criterion
Golub’s Method
tTest
Backward Elimination
PCA
Exhaustive Search 0.00
0.00
0.04
0.00
0.10
0.00 0.06
0.06
0.14
0.12
0.30
0.06 0.00
0.00
0.00
0.00
0.00
 0.03
0.05
0.07
0.05
0.31
 No Feature Selection† 0.00 0.15 possible because it can be expected that only a minority of
CpG positions has any connection with the two subtypes
of leukemia. Principle Component Analysis
The probably most popular method for dimension reduction is principle component analysis (PCA) (Bishop,
1995). For a given training set X , PCA constructs a
set of orthogonal vectors (principle components) which
correspond to the directions of maximum variance. The
projection of X onto the ﬁrst k principle components
gives the 2norm optimal representation of X in a k dimensional orthogonal subspace. Because this projection
does not explicitly use the class information Y , PCA is an
unsupervised learning technique.
In order to reduce the dimension of the input space for
the SVM we performed a PCA on the combined training
and test set { X , X } and projected both sets on the ﬁrst
k principle components. This gives considerably better
results than performing PCA only on the training set X
and is justiﬁed by the fact that no label information is
used. However, the generalisation results for k = 2 and
k = 5, as shown in Tab. 1, were even worse than for
the SVM without feature selection. The reason for this
is that PCA does not necessarily extract features that are
important for the discrimination betwee...
View
Full
Document
This note was uploaded on 04/06/2010 for the course COMPUTER S COSC1520 taught by Professor Paul during the Spring '09 term at York University.
 Spring '09
 PAUL
 The Land

Click to edit the document details