Unformatted text preview: arameters (nine weights and four bias or constant terms) in the neural network shown in
Figure 4. Because they are so numerous, and because so many combinations of parameters result in
similar predictions, the parameters become uninterpretable and the network serves as a “black box”
predictor. In fact, a given result can be associated with several different sets of weights. Consequently,
the network weights in general do not aid in understanding the underlying process generating the
prediction. However, this is acceptable in many applications. A bank may want to automatically
recognize handwritten applications, but does not care about the form of the functional relationship
between the pixels and the characters they represent. Some of the many applications where hundreds
of variables may be input into models with thousands of parameters (node weights) include modeling
of chemical plants, robots and financial markets, and pattern recognition problems such as speech,
vision and handwritten character recognition.
One advantage of neural network models is that they can easily be implemented to run on massively
parallel computers with each node simultaneously doing its own calculations.
Users must be conscious of several facts about neural networks: First, neural networks are not easily
interpreted. There is no explicit rationale given for the decisions or predictions a neural network
Second, they tend to overfit the training data unless very stringent measures, such as weight decay
and/or cross validation, are used judiciously. This is due to the very large number of parameters of the
neural network which, if allowed to be of sufficient size, will fit any data set arbitrarily well when
allowed to train to convergence.
Third, neural networks require an extensive amount of training time unless the problem is very small.
Once trained, however, they can provide predictions very quickly.
Fourth, they require no less data preparation than any other method, which is to say they require a lot
of data preparation. One myth of neural networks is that data of any quality can be used to provide
reasonable predictions. The most successful implementations of neural networks (or decision trees, or
logistic regression, or any other method) involve very careful data cleansing, selection, preparation
and pre-processing. For instance, neural nets require that all variables be numeric. Therefore
categorical data such as “state” is usually broken up into multiple dichotomous variables (e.g.,
“California,” “New York”) , each with a “1” (yes) or “0” (no) value. The resulting increase in
variables is called the categorical explosion.
Finally, neural networks tend to work best when the data set is sufficiently large and the signal-tonoise ratio is reasonably high. Because they are so flexible, they will find many false patterns in a low
signal-to-noise ratio situation.
Decision trees are a way of representing a series of rules that lead to a class or value. For example,
you may wish to classify loan applicants as good or bad credit risks. Figure 7 shows a simple decision
tree that solves this problem while illustrating all the basic components of a decision tree: the decision
node, branches and leaves. 14 © 1999 Two Crows Corporation Income > $40,000
No Yes Job > 5 Years
No Good Risk Bad Risk Yes
Yes Bad Risk No Good Risk Figure 7. A simple classification tree. The first component is the top decision node, or root node, which specifies a test to be carried out.
The root node in this example is “Income > $40,000.” The results of this test cause the tree to split
into branches, each representing one of the possible answers. In this case, the test “Income >
$40,000” can be answered either “yes” or “no,” and so we get two branches.
Depending on the algorithm, each node may have two or more branches. For example, CART
generates trees with only two branches at each node. Such a tree is called a binary tree. When more
than two branches are allowed it is called a multiway tree.
Each branch will lead either to another decision node or to the bottom of the tree, called a leaf node.
By navigating the decision tree you can assign a value or class to a case by deciding which branch to
take, starting at the root node and moving to each subsequent node until a leaf node is reached. Each
node uses the data from the case to choose the appropriate branch.
Armed with this sample tree and a loan application, a loan officer could determine whether the
applicant was a good or bad credit risk. An individual with “Income > $40,000” and “High Debt”
would be classified a “Bad Risk,” whereas an individual with “Income < $40,000” and “Job > 5
Years” would be classified a “Good Risk.”
Decision trees models are commonly used in data mining to examine the data and induce the tree and
its rules that will be used to make predictions. A number of different algorithms may be used f...
View Full Document
This note was uploaded on 01/19/2014 for the course STATS 315B taught by Professor Friedman during the Winter '08 term at Stanford.
- Winter '08