This preview shows page 1. Sign up to view the full content.
Unformatted text preview: or
building decision trees including CHAID (Chi-squared Automatic Interaction Detection), CART
(Classification And Regression Trees), Quest, and C5.0.
Decision trees are grown through an iterative splitting of data into discrete groups, where the goal is
to maximize the “distance” between groups at each split.
One of the distinctions between decision tree methods is how they measure this distance. While the
details of such measurement are beyond the scope of this introduction, you can think of each split as
separating the data into new groups which are as different from each other as possible. This is also
sometimes called making the groups purer. Using our simple example where the data had two
possible output classes — Good Risk and Bad Risk — it would be preferable if each data split found
a criterion resulting in “pure” groups with instances of only one class instead of both classes.
Decision trees which are used to predict categorical variables are called classification trees because
they place instances in categories or classes. Decision trees used to predict continuous variables are
called regression trees.
© 1999 Two Crows Corporation 15 The example we’ve been using up until now has been very simple. The tree is easy to understand and
interpret. However, trees can become very complicated. Imagine the complexity of a decision tree
derived from a database of hundreds of attributes and a response variable with a dozen output classes.
Such a tree would be extremely difficult to understand, although each path to a leaf is usually
understandable. In that sense a decision tree can explain its predictions, which is an important
However, this clarity can be somewhat misleading. For example, the hard splits of decision trees
imply a precision that is rarely reflected in reality. (Why would someone whose salary was $40,001
be a good credit risk whereas someone whose salary was $40,000 not be?) Furthermore, since several
trees can often represent the same data with equal accuracy, what interpretation should be placed on
Decision trees make few passes through the data (no more than one pass for each level of the tree)
and they work well with many predictor variables. As a consequence, models can be built very
quickly, making them suitable for large data sets.
Trees left to grow without bound take longer to build and become unintelligible, but more importantly
they overfit the data. Tree size can be controlled via stopping rules that limit growth. One common
stopping rule is simply to limit the maximum depth to which a tree may grow. Another stopping rule
is to establish a lower limit on the number of records in a node and not do splits below this limit.
An alternative to stopping rules is to prune the tree. The tree is allowed to grow to its full size and
then, using either built-in heuristics or user intervention, the tree is pruned back to the smallest size
that does not compromise accuracy. For example, a branch or subtree that the user feels is
inconsequential because it has very few cases might be removed. CART prunes trees by cross
validating them to see if the improvement in accuracy justifies the extra nodes.
A common criticism of decision trees is that they choose a split using a “greedy” algorithm in which
the decision on which variable to split doesn’t take into account any effect the split might have on
future splits. In other words, the split decision is made at the node “in the moment” and it is never
revisited. In addition, all splits are made sequentially, so each split is dependent on its predecessor.
Thus all future splits are dependent on the first split, which means the final solution could be very
different if a different first split is made. The benefit of looking ahead to make the best splits based on
two or more levels at one time is unclear. Such attempts to look ahead are in the research stage, but
are very computationally intensive and presently unavailable in commercial implementations.
Furthermore, algorithms used for splitting are generally univariate; that is, they consider only one
predictor variable at a time. And while this approach is one of the reasons the model builds quickly —
it limits the number of possible splitting rules to test — it also makes relationships between predictor
variables harder to detect.
Decision trees that are not limited to univariate splits could use multiple predictor variables in a single
splitting rule. Such a decision tree could allow linear combinations of variables, also known as
oblique trees. A criterion for a split might be “SALARY < (0.35 * MORTGAGE),” for instance.
Splitting on logical combinations of variables (such as “SALARY > 35,000 OR MORTGAGE <
150,000”) is another kind of multivariate split.
Decision trees handle non-numeric data very well. This ability to accept categorical data minimizes
the amount of data transformations and the explosion of predictor variables inherent in neural nets.
16 © 1999 Two Crows Corp...
View Full Document
- Winter '08