featureSelectionDNAMethyCancerClassification_01bioinfo

Because the svm algorithm in its dual formulation

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: e non-linearly mapped into a potentially higher dimensional feature space by a mapping function : xi → (xi ). Because the SVM algorithm in its dual formulation uses only the inner product between elements of the input space, the knowledge of the kernel function k (xi , x j ) = (xi ) · (x j ) is sufficient to train the SVM. † Every hybridization experiment was at least 3 times repeated and the results averaged. Feature selection for DNA methylation CD1A CpG2 66% 33% 100% ELK1 CpG6 CD63 CpG1 AR CpG4 0% p<0.1 TUBB2 CpG4 CDK4 CpG3 CSNK2B CpG10 AR CpG5 −6 −4 −2 log(Ratio) 0 2 ELK1 CpG5 ELK1 CpG12 AR CpG2 F8−5 p<0.01 Probability Density 0.0 1.0 2.0 F8−3 ELK1 CpG8 33% 66% ELK1 CpG11 MYCN CpG2 100% 0% AR CpG1 p<0.001 Probability Density 0.0 1.0 2.0 ELK1 CpG9 ELK1 CpG2 ELK1 CpG3 AR CpG3 ELK1 CpG1 −6 −4 −2 0 log(Ratio) 2 4 (a) 1 15 Female Male 22 (b) Fig. 1. Validation of measurements. a) Quantification of methylation measurements for two CpG dinucleotides. A series of hybridizations was performed with mixtures of artificially up- and down-methylated DNA fragments of the factor VIII exon 14 gene. Down- and up-methylated DNA fragments were mixed at ratios: 0:3, 1:2, 2:1, 3:0, representing a methylation status of 100 %, 66 %, 33 % and 0 %, respectively. For the 4 kinds of compounds 59, 36, 40, 63 identical slides were made. The log-ratio of the CG and the TG detection oligomer hybridization intensity was calculated and then averaged for experimental subgroups each containing 3 identical experiments. The distribution function of the CG:TG ratios shows that measurement values of the different mixtures are well separated and therefore allow a high resolution detection of the methylation level of a single CpG. b) Gender separation. The 20 CpG sites with the most significant difference between female and male samples are shown. Only non cell line leukemia and healthy control samples were used. As expected the absolute majority of the significant CpG dinucleotides come from the two X-chromosome genes (ELK1, AR). High probability of methylation corresponds to black, uncertainty to grey and low probability to white. The labels on the left side of the plot are gene and CpG identifiers. The bottom to top ranking of the CpGs is according to the significance of the difference between the means of the two groups, estimated by a two sample t-test. Each row corresponds to a single CpG and each column to the methylation levels of one sample. It is not necessary to explicitly know the mapping and a non-linear SVM can be trained efficiently by computing only the kernel function. Here we will only use the linear kernel k (xi , x j ) = xi · x j and the quadratic kernel 2 k (xi , x j ) = xi · x j + 1 . In the next section we will compare SVMs trained on different feature sets. In order to evaluate the prediction performance of these SVMs we used a cross-validation method (Bishop, 1995). For each classification task, the samples were partitioned into 8 groups of approximately equal size. Then the SVM predicted the class for the test samples in one group after it had been trained using the 7 other groups. The number of misclassifications was counted over 8 runs of the SVM algorithm for a...
View Full Document

Ask a homework question - tutors are online