DSC 441 - Assignment 3.docx - 1 Problem 1 1 In the appendix...

This preview shows page 1 - 2 out of 21 pages.

1 Problem 1 1) In the appendix, Graph 1, we can observe the decision tree where I decided to create a custom value for the maximum tree depth of 20 and a minimum number of cases for the parent node to be 10, and the child node to be 5. The reason for these values was due to a trial and error method where I began testing different value amounts for both maximum and minimum and I would compare and contrast the estimate, standard error, and the percent correct under the classification table. As I lowered the maximum value, as well as the minimum values, I noticed that the risk decreased and that the percent correct value increased to almost 100% (appendix, Table 1 ). Also, I used the CHAID growing method as well as a cross-validation for the validation. 2) Number of Nodes: 17 Number of Terminal Nodes: 9 3) Most important data features as seen in the appendix in Graph 1 , the three most important data features are Symptom10, Symptom1, and Symptom9. The reason for these three to be the main features, is because in the decision tree, we can see how Symptom 10 is the main branch from the root diagnosis and it splits into the first two decision nodes which are Symptom 1 and Symptom 9, making these three the most important data features to start the decision tree. Also, we are able to validate this by looking at the model summary in which it indicates that these in fact are the first three most important data features (Table 2) 4) Keeping all else equal, as I increased the parent and child values from 20 to 30 to 100 and from 10 to 15 to 50, respectively, I noticed that the decision tree got smaller and would expand less as seen in the appendix, Graph 2 & 3 , where the number of nodes went from 17 to 13 to 9 in Graphs 1,2, & 3 , respectively. Also, the percent correctness decreased as the parent and child values increased as seen in the appendix, Table 3 & 4. So, the complexity of the decision tree decreased as the parent and child values increased and the reason for that being is that the parent node has a default value of 100 (Graph 3) which means how many cases at least we would want to have for a node in order to allow its splitting. Meaning that any node with less than 100 cases will not be split further. As for the child node, its default is 50 (Graph 3) , meaning how many cases at least we would like to have for the children of a parent node in order to split that parent node further. From this exercise, we notice how the lower the values for the parent and child are, the higher accuracy we will get and the more complex of a model which can run into overfitting. Whereas the higher the parent and child values are, the less accurate and

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture