Unformatted text preview: selected by pruning and cross-validation. The cross-validation estimate of misclassification rate is 0.29. The misclassification rate on a separate test set of size 5000 is 0.28. The Bayes classification rule can be derived. Applying this rule to the test set yields a misclassification rate of 0.14. Results: Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (I) Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (I) Advantages of the Tree-Structured Approach Handles both categorical and ordered variables in a simple and natural way. Automatic stepwise variable selection and complexity reduction. It provides an estimate of the misclassification rate for a query sample. It is invariant under all monotone transformations of individual ordered variables. Robust to outliers and misclassified points in the training set. Easy to interpret. Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (I) Variable Combinations Splits perpendicular to the coordinate axes are inefficient in certain cases. Use linear combinations of variables: Is aj xj c? The amount of computation is increased significantly. Price to pay: model complexity increases. Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (I) Missing Values Certain variables are missing in some training samples. Often occurs in gene-expression microarray data. Suppose each variable has 5% chance being missing independently. Then for a training sample with 50 variables, the probability of missing some variables is as high as 92.3%. A query sample to be classified may have missing variables. Find surrogate splits. Suppose the best split for node t is s which involves a question on Xm . Find another split s on a variable Xj , j = m, which is most similar to s in a certain sense. Similarly, the second best surrogate split, the third, and so on, can be found. Jia Li http://www.stat.psu.edu/jiali...
