This preview shows page 1. Sign up to view the full content.
Unformatted text preview: n ALL and AML.
It ﬁrst picks the features with the highest variance, which
are in this case discriminating between cell lines and
primary patient tissue (see Fig. 2a), i.e. subgroups that
are not relevant to the classiﬁcation task. As is shown in
S160 Fig. 3, features carrying information about the leukemia
subclasses appear only from the 9th principle component
on. The generalisation performance including the 9th
component is signiﬁcantly better than for a SVM without
feature selection. However, it seems clear that a supervised
feature selection method, which takes the class labels of
the training set into account, should be more reliable and
give better generalisation. Fisher Criterion and t-Test
A classical measure to asses the degree of separation
between two classes is given by the Fisher criterion
(Bishop, 1995). In our case it gives the discriminative
power of the k th CpG as
J (k ) = AL
(µk L − µk L )2
2 σkAL L + σkAM L 2 AL L / AM L , AL L / AM L where µk
is the mean and σk
standard deviation of all x k with yi = AL L / AM L . The
Fisher criterion gives a high ranking for CpGs where the
two classes are far apart compared to the within class
variances. Fig. 2b shows the methylation proﬁles of the
best 20 CpGs according to the Fisher criterion. The very
G (k ) = AL
|µ k L − µ k L | σkAL L + σkAM L was used by Golub and coworkers for their ALL/AML
classiﬁcation based on mRNA expression data (Golub Feature selection for DNA methylation CDC25A CpG2 4 CD63 CpG5
MYCN CpG5 2 TUBB2 CpG1 MOS CpG2 0 2.PC MYCN CpG1
MYCL1 CpG6 CSNK2B CpG1
TUBB2 CpG5 −2 CDC25A CpG1
MYCL1 CpG7 −4 ELK1 CpG10
CDK4 CpG10 −4 −2 0 2 CDK4 CpG3 4 CSNK2B CpG2
1 1.PC 18 (a) AML 26 18 ALL AML 26 (b) CD63 CpG2 MYCL1 CpG6 CDC25A CpG2 CD1A CpG1 TUBB2 CpG1 MOS CpG1 CD63 CpG5 CD63 CpG3 CD63 CpG1 CSNK2B CpG14 MOS CpG2 MYCL1 CpG2 MYCL1 CpG7 ELK1 CpG9 CD63 CpG3 ELK1 CpG6 CDC25A CpG1 MYCL1 CpG7 MYCL1 CpG6 TUBB2 CpG5 CD1A CpG3 CDC25A CpG1 MYCN CpG1 CSNK2B CpG6 MYCN CpG5 CDK4 CpG1 CSNK2B CpG1 CDC25A CpG6 TUBB2 CpG5 CDK4 CpG10 CDK4 CpG5 CSNK2B CpG2 CDK4 CpG3 CDK4 CpG5 ELK1 CpG10 CDK4 CpG3 CDK4 CpG10 CDC25A CpG5 CSNK2B CpG2 MOS CpG2
1 18 ALL 26 AML (c) 1 ALL (d) Fig. 2. Feature selection methods. a) Principle component analysis. The whole data set was projected onto its ﬁrst 2 principle components.
Circles represent cell lines, triangles primary patient tissue. Filled circles or triangles are AML, empty ones ALL samples. b) Fisher criterion.
The 20 highest ranking CpG sites according to the Fisher criterion are shown. The highest ranking features are on the bottom of the plot.
High probability of methylation corresponds to black, uncertainty to grey and low probability to white. c) Two sample t-test. d) Backward
elimination. et al., 1999). Its relation to the Fisher criterion is given
G (k ) = J (k ) 1 +
2 −1 2σkAL L σkAM L
2 σkAL L + σkAM L 2 , which shows the preference of Golub’s ranking for
features with different within class variances compared to
the Fisher criterion. Another approach to rank CpGs by their discriminative
power is to use a test statistic for computing the significance of class differences. Here we assumed a normal
distribution of the methylation...
View Full Document
This note was uploaded on 04/06/2010 for the course COMPUTER S COSC1520 taught by Professor Paul during the Spring '09 term at York University.
- Spring '09
- The Land