This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Sec. 9.4] Cost datasets 153 in the original dataset in which there were four different degrees of heart-disease. Table 9.16 gives the different costs of the possible misclassifications. Nine fold cross-validation was used to estimate the average misclassification cost. Naive Bayes performed best on the heart dataset. This may reflect the careful selection of attributes by the doctors. Of the decision trees, CART and Cal5 performed the best. Cal5 tuned the pruning parameter, and used an average of 8 nodes in the trees, whereas used 45 nodes. However, did not take the cost matrix into account, so the prefered pruning is still an open question. This data has been studied in the literature before, but without taking any cost matrix into account and so the results are not comparable with those obtained here. Table 9.16: Misclassification costs for the heart disease dataset. The columns represent the predicted class, and the rows the true class. absent present absent 1 present 5 9.4.3 German credit (Cr.Ger) Table 9.17: Cost matrix for the German credit dataset. The columns are the predicted class and the rows the true class. good bad good 1 bad 5 The original dataset (provided by Professor Dr. Hans Hofmann, Universit¨at Hamburg) contained some categorical/symbolic attributes. For algorithms that required numerical attributes, a version was produced with several indicator variables added. The attributes that were ordered categorical were coded as integer. This preprocessed dataset had 24 numerical attributes and 10-fold cross-validation was used for the classification, and for uniformity all algorithms used this preprocessed version. It is of interest that NewID did the trials with both the preprocessed version and the original data, and obtained nearly identical error rates (32.8% and 31.3%) but rather different tree sizes (179 and 306 nodes). The attributes of the original dataset include: status of existing current account, duration of current account, credit history, reason for loan request (e.g. new car, furniture), credit amount, savings account/bonds, length of employment, installment rate in percentage of disposable income, marital status and sex, length of time at presentresidence, age and job. 154 Dataset descriptions and results [Ch. 9 Results are given in Table 9.18. The providers of this dataset suggest the cost matrix of Table 9.17. It is interesting that only 10 algorithms do better than the Default. The results clearly demonstrate that some Decision Tree algorithms are at a disadvantage when costs are taken into account. That it is possible to include costs into decision trees, is demonstrated by the good results of Cal5 and CART (Breiman et al. , 1984). Cal5 achieved a good result with an average of only 2 nodes which would lead to very transparent rules....
View Full Document
- Spring '11