Statpsuedujiali classificationdecision trees i 300

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: selected by pruning and cross-validation. The cross-validation estimate of misclassification rate is 0.29. The misclassification rate on a separate test set of size 5000 is 0.28. The Bayes classification rule can be derived. Applying this rule to the test set yields a misclassification rate of 0.14. Results: Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (I) Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (I) Advantages of the Tree-Structured Approach Handles both categorical and ordered variables in a simple and natural way. Automatic stepwise variable selection and complexity reduction. It provides an estimate of the misclassification rate for a query sample. It is invariant under all monotone transformations of individual ordered variables. Robust to outliers and misclassified points in the training set. Easy to interpret. Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (I) Variable Combinations Splits perpendicular to the coordinate axes are inefficient in certain cases. Use linear combinations of variables: Is aj xj c? The amount of computation is increased significantly. Price to pay: model complexity increases. Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (I) Missing Values Certain variables are missing in some training samples. Often occurs in gene-expression microarray data. Suppose each variable has 5% chance being missing independently. Then for a training sample with 50 variables, the probability of missing some variables is as high as 92.3%. A query sample to be classified may have missing variables. Find surrogate splits. Suppose the best split for node t is s which involves a question on Xm . Find another split s on a variable Xj , j = m, which is most similar to s in a certain sense. Similarly, the second best surrogate split, the third, and so on, can be found. Jia Li http://www.stat.psu.edu/jiali...
View Full Document

Ask a homework question - tutors are online