featureSelectionDNAMethyCancerClassification_01bioinfo

Fig 2c shows the ranking which is very similar to the

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: levels of a CpG position within a class and used a two sample t-test to rank the CpGs according to the significance of the difference between the class means (Mendenhall & Sincich, 1995). Fig. 2c shows the ranking, which is very similar to the S161 F.Model et al. −0.5 −1.0 −1.5 −2.5 AML −3.0 0.05 20 40 60 80 Feature Number Fig. 3. Dimension dependence of feature selection performance. The plot shows the generalisation performance of a linear SVM with four different feature selection methods against the number of selected features. The x-axis is scaled logarithmically and gives the number of input features for the SVM, starting with two. The y-axis gives the achieved generalization performance. Note that the maximum number of principle components corresponds to the number of available samples. The performance of Golub’s method was very similar to the Fisher criterion and is not shown. Fisher criterion because a large mean difference and a small within class variance are the important factors for both methods. In order to improve classification performance we trained SVMs on the k highest ranking CpGs according to the Fisher criterion, Golub’s method or t-test. Fig. 4 shows a trained SVM on the best two CpGs from the Fisher criterion. The test errors for k = 2 and k = 5 are given in Tab. 1. The results show a dramatic improvement of generalisation performance. Using the Fisher criterion for feature selection and k = 5 CpGs the test error was decreased to 3% compared to 16% for the SVM without feature selection. Fig. 3 shows the dependence of generalisation performance from the selected dimension k and indicates that especially the Fisher criterion gives dimension independent good generalisation for reasonable small k . The performance of Golub’s ranking method was equal or slightly inferior to the Fisher criterion on our data set, whereas the t-test performance was considerably worse for small feature numbers. Although the described CpG ranking methods give very good generalisation, they have some potential drawS162 ALL −2.0 CDK4 CpG3 0.20 0.15 0.10 Test Error 0.25 0.30 Fisher t−Test BackElim PCA −2.5 −2.0 −1.5 −1.0 CSNK2B CpG2 Fig. 4. Support Vector Machine on two best features of the Fisher criterion. The plot shows a SVM trained on the two highest ranking CpG sites according to the Fisher criterion with all ALL and AML samples used as training data. The black points are AML, the grey ones ALL samples. Circled points are the support vectors defining the white borderline between the areas of AML and ALL prediction. The grey value of the background corresponds to the prediction strength. backs. One problem is that they can only detect linear dependencies between features and class labels. A simple XOR or even OR combination of two CpGs would be completely missed. Another drawback is that redundant features are not removed. In our case there are usually several CpGs from the same gene which have a high likelihood of comethylation. This can result in a large set of high ranking features which carry essentially the same information. Although the good results seem to indicate that the described problems do not appear in our data set, they should be considered. Ba...
View Full Document

This note was uploaded on 04/06/2010 for the course COMPUTER S COSC1520 taught by Professor Paul during the Spring '09 term at York University.

Ask a homework question - tutors are online