Feature selection the simplest way for applying a svm

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ll possible choices of the test group. To obtain a reliable estimate for the test error the number of misclassifications were averaged over 50 different partitionings of the samples into 8 groups. FEATURE SELECTION The simplest way for applying a SVM to our methylation data is to use every CpG position as a separate dimension, not making any assumption about the interdependence of CpG sites from the same gene. On the leukemia subclassification task the SVM with linear kernel trained on this 81 dimensional input space had an average test error of 16%. Using a quadratic kernel did not significantly improve the results (see Tab. 1). An obvious explanation for this relatively poor performance is that we have only 25 data points (even less in the training set) in a 81 dimensional space. Finding a separating hyperplane under these conditions is a heavily under-determined problem. And as it turns out, the SVM technique of maximising the margin is not sufficient to find the solution with optimal generalisation properties. It is necessary to reduce the dimensionality of the input space while retaining the relevant information for classification. This should be S159 F.Model et al. † The SVM was trained on all 81 features. Table 1. Performance of different feature selection methods. Training Error 2 Features Test Error 2 Features Training Error 5 Features Test Error 5 Features Linear Kernel Fisher Criterion Golub’s Method t-Test Backward Elimination PCA 0.01 0.01 0.05 0.02 0.13 0.05 0.05 0.13 0.17 0.21 0.00 0.00 0.00 0.00 0.05 0.03 0.04 0.08 0.05 0.28 No Feature Selection† 0.00 0.16 Quadratic Kernel Fisher Criterion Golub’s Method t-Test Backward Elimination PCA Exhaustive Search 0.00 0.00 0.04 0.00 0.10 0.00 0.06 0.06 0.14 0.12 0.30 0.06 0.00 0.00 0.00 0.00 0.00 - 0.03 0.05 0.07 0.05 0.31 - No Feature Selection† 0.00 0.15 possible because it can be expected that only a minority of CpG positions has any connection with the two subtypes of leukemia. Principle Component Analysis The probably most popular method for dimension reduction is principle component analysis (PCA) (Bishop, 1995). For a given training set X , PCA constructs a set of orthogonal vectors (principle components) which correspond to the directions of maximum variance. The projection of X onto the first k principle components gives the 2-norm optimal representation of X in a k dimensional orthogonal subspace. Because this projection does not explicitly use the class information Y , PCA is an unsupervised learning technique. In order to reduce the dimension of the input space for the SVM we performed a PCA on the combined training and test set { X , X } and projected both sets on the first k principle components. This gives considerably better results than performing PCA only on the training set X and is justified by the fact that no label information is used. However, the generalisation results for k = 2 and k = 5, as shown in Tab. 1, were even worse than for the SVM without feature selection. The reason for this is that PCA does not necessarily extract features that are important for the discrimination betwee...
View Full Document

This note was uploaded on 04/06/2010 for the course COMPUTER S COSC1520 taught by Professor Paul during the Spring '09 term at York University.

Ask a homework question - tutors are online