featureSelectionDNAMethyCancerClassification_01bioinfo

E subgroups that are not relevant to the classication

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: n ALL and AML. It first picks the features with the highest variance, which are in this case discriminating between cell lines and primary patient tissue (see Fig. 2a), i.e. subgroups that are not relevant to the classification task. As is shown in S160 Fig. 3, features carrying information about the leukemia subclasses appear only from the 9th principle component on. The generalisation performance including the 9th component is significantly better than for a SVM without feature selection. However, it seems clear that a supervised feature selection method, which takes the class labels of the training set into account, should be more reliable and give better generalisation. Fisher Criterion and t-Test A classical measure to asses the degree of separation between two classes is given by the Fisher criterion (Bishop, 1995). In our case it gives the discriminative power of the k th CpG as J (k ) = AL AM (µk L − µk L )2 2 σkAL L + σkAM L 2 AL L / AM L , AL L / AM L where µk is the mean and σk is the i standard deviation of all x k with yi = AL L / AM L . The Fisher criterion gives a high ranking for CpGs where the two classes are far apart compared to the within class variances. Fig. 2b shows the methylation profiles of the best 20 CpGs according to the Fisher criterion. The very similar criterion G (k ) = AL AM |µ k L − µ k L | σkAL L + σkAM L was used by Golub and coworkers for their ALL/AML classification based on mRNA expression data (Golub Feature selection for DNA methylation CDC25A CpG2 4 CD63 CpG5 CD1A CpG3 ELK1 CpG11 MYCN CpG5 2 TUBB2 CpG1 MOS CpG2 0 2.PC MYCN CpG1 MYCL1 CpG6 CSNK2B CpG1 TUBB2 CpG5 −2 CDC25A CpG1 CD63 CpG3 MYCL1 CpG7 −4 ELK1 CpG10 CD63 CpG1 CDK4 CpG5 CDK4 CpG10 −4 −2 0 2 CDK4 CpG3 4 CSNK2B CpG2 1 1.PC 18 (a) AML 26 18 ALL AML 26 (b) CD63 CpG2 MYCL1 CpG6 CDC25A CpG2 CD1A CpG1 TUBB2 CpG1 MOS CpG1 CD63 CpG5 CD63 CpG3 CD63 CpG1 CSNK2B CpG14 MOS CpG2 MYCL1 CpG2 MYCL1 CpG7 ELK1 CpG9 CD63 CpG3 ELK1 CpG6 CDC25A CpG1 MYCL1 CpG7 MYCL1 CpG6 TUBB2 CpG5 CD1A CpG3 CDC25A CpG1 MYCN CpG1 CSNK2B CpG6 MYCN CpG5 CDK4 CpG1 CSNK2B CpG1 CDC25A CpG6 TUBB2 CpG5 CDK4 CpG10 CDK4 CpG5 CSNK2B CpG2 CDK4 CpG3 CDK4 CpG5 ELK1 CpG10 CDK4 CpG3 CDK4 CpG10 CDC25A CpG5 CSNK2B CpG2 MOS CpG2 1 18 ALL 26 AML (c) 1 ALL (d) Fig. 2. Feature selection methods. a) Principle component analysis. The whole data set was projected onto its first 2 principle components. Circles represent cell lines, triangles primary patient tissue. Filled circles or triangles are AML, empty ones ALL samples. b) Fisher criterion. The 20 highest ranking CpG sites according to the Fisher criterion are shown. The highest ranking features are on the bottom of the plot. High probability of methylation corresponds to black, uncertainty to grey and low probability to white. c) Two sample t-test. d) Backward elimination. et al., 1999). Its relation to the Fisher criterion is given by G (k ) = J (k ) 1 + 2 −1 2σkAL L σkAM L 2 σkAL L + σkAM L 2 , which shows the preference of Golub’s ranking for features with different within class variances compared to the Fisher criterion. Another approach to rank CpGs by their discriminative power is to use a test statistic for computing the significance of class differences. Here we assumed a normal distribution of the methylation...
View Full Document

This note was uploaded on 04/06/2010 for the course COMPUTER S COSC1520 taught by Professor Paul during the Spring '09 term at York University.

Ask a homework question - tutors are online