Use the count matrix to make decisions multi way

This preview shows page 34 - 38 out of 55 pages.

Use the count matrix to make decisions Multi-way Split Two-way split Continuous Attributes: Computing GINI Index - Use binary decisions based on one value o Several choices for splitting value
o Number of possible splitting values = Number of distinct values - Each splitting value has a count matrix associated with it o Class counts in each of the partitions, A < v and A >= v - Simple method to choose best v o For each v, scan the database to gather count matrix and compute its GINI Index o Computationally inefficient o Repetition of work o Big-O n^2 - For efficient computation : for each attribute, o Sort the attribute on values o Linearly scan these values, each time updating the count matrix and computing the GINI index o Choose the split position that has the lease GINI index o Big-O log n Splitting based on classification error - Measures misclassification error made by a node o Maximum 0.5 records are equally distributed among all classes Implying least interesting information o Minimum 0 all records belong to one class Implying most interesting information. Low degree of freedom and impurity Pure node or homogeneous node - Classification error at a node t:
Comparing Attribute Selection Measures 1) Information gain based towards multivalued attributes 2) Gain ratio tens to prefer unbalanced splits in which one partition is much smaller than the others 3) Gini index a. biased to multivalued attributes b. has difficulty when number of classes is large c. tends to favor tests that result in equal-sized partitions and purity I both partitions Stopping Criteria for Tree Induction - stop expanding a node when all the records belong to the same class - the maximum tree depth has been reached - early termination – pre-pruning Determine the final tree size - use minimum description length (MDL) principle o measures use encoding techniques to define the best decision tree as the one that requires the fewest number of bits to both encode the tree encode the exceptions to the tree o main idea that the simplest of solutions is preferred o has the least bias toward multivalued attributes o every model provides a (lossless) encoding of our data o the model that gives the shortest encoding (best compression) of the data is the best implies regularities in the data - halting growth of the tree when the encoding is minimized ** Complex models describe the data in a lot of detail but implies maximum description length ( expensive to describe the model)
** Simple models imply minimum description length and are cheap to describe but also leading to describing the data being expensive Extracting Classification Rules from the Tree - represent the knowledge in the form of IF-TTHEN rules - one rule is created for each path from the root to a leaf - each attribute-value pair along a path forms a conjunction - the leaf node holds the class prediction - rules are easier for humans to understand Avoiding Overfitting in Classification - an induced tree may overfit the training data -

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture