Lettr capital letter 26 values from a to z 2 x box

This preview shows page 3 - 6 out of 9 pages.

1. lettr capital letter (26 values from A to Z) 2. x-box horizontal position of box (integer) 3. y-box vertical position of box (integer) 4. width width of box (integer) 5. high height of box (integer) 6. onpix total # on pixels (integer) 7. x-bar mean x of on pixels in box (integer) 8. y-bar mean y of on pixels in box (integer) 9. x2bar mean x variance (integer) 10. y2bar mean y variance (integer) 11. xybar mean x y correlation (integer) 12. x2ybr mean of x * x * y (integer) 13. xy2br mean of x * y * y (integer) 14. x-ege mean edge count left to right (integer) 15. xegvy correlation of x-ege with y (integer) 16. y-ege mean edge count bottom to top (integer) 17. yegvx correlation of y-ege with x (integer) Create a classification model for letter recognition using decision trees as a classification method with a holdout partitioning technique for splitting the data into training versus testing. a. (15 points) Changing the values for the depth, number of cases per parent and number of cases per leaf produces different tree configurations with different accuracies for training and testing. Choose at least five different configurations and report the accuracy for training and testing for each one of them. Which configuration will you choose as the best model? Explain your answer. Mod el Configuration Result Tree Dept h Pare nt Node s Child Nod es Dept h Nod es Termin al Nodes Trainin g Accura cy Testing Accura cy 1 10 100 50 11 143 72 63.5% 62.1% 2 20 80 40 13 173 87 66.5% 65.1% 3 10 50 25 11 113 57 59.8% 59.9% 4 10 200 100 11 91 45 58.9% 58.4% 5 10 120 60 11 125 63 63.7% 61.5%
DSC441- Fall 2018, Assignment 4, Page 4 of 9 The above table displays the results and configurations of the 5 decision tree models generated with this data. I would choose the fifth model as the best fit although it does not have the highest training and testing accuracy rates. The training and testing accuracy rates of this model are very similar, within 3% of each other. The complexity of this model is not too high as it is 11 levels deep with 125 nodes and 63 terminal nodes.
b. (4 points) For the best tree configuration, report the misclassification matrix and interpret it. In your opinion, is accuracy a good way to interpret the performance of the model? If not, suggest other measures.
DSC441- Fall 2018, Assignment 4, Page 5 of 9 o The accuracy per class ranges from 41.9% (letter V) to 79.8% (letter G) on the testing set Accuracy is one of the best interpretation methods of the model performance, but other evaluations need to be considered as well, such as the balance in the data. In the letter recognition data, the letters/classes are very balanced (as shown in the below table) so there is no one letter that dominates over the others to impact the classification.

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture