This preview shows page 1. Sign up to view the full content.
Unformatted text preview: tion 11 After the input layer, each node takes in a set of inputs, multiplies them by a connection weight Wxy
(e.g., the weight from node 1 to 3 is W13 — see Figure 5), adds them together, applies a function
(called the activation or squashing function) to them, and passes the output to the node(s) in the next
layer. For example, the value passed from node 4 to node 6 is:
Activation function applied to ([W14 * value of node 1] + [W24 * value of node 2]) W13
W15 3 W36 W23
4 W46 6 W24 2
5 Figure 5. Wxy is the weight from node x to node y. Each node may be viewed as a predictor variable (nodes 1 and 2 in this example) or as a combination
of predictor variables (nodes 3 through 6). Node 6 is a non-linear combination of the values of nodes
1 and 2, because of the activation function on the summed values at the hidden nodes. In fact, if there
is a linear activation function but no hidden layer, neural nets are equivalent to a linear regression; and
with certain non-linear activation functions, neural nets are equivalent to logistic regression.
The connection weights (W’s) are the unknown parameters which are estimated by a training method.
Originally, the most common training method was backpropagation; newer methods include conjugate
gradient, quasi-Newton, Levenberg-Marquardt, and genetic algorithms. Each training method has a
set of parameters that control various aspects of training such as avoiding local optima or adjusting
the speed of conversion.
The architecture (or topology) of a neural network is the number of nodes and hidden layers, and how
they are connected. In designing a neural network, either the user or the software must choose the
number of hidden nodes and hidden layers, the activation function, and limits on the weights. While
there are some general guidelines, you may have to experiment with these parameters.
One of the most common types of neural network is the feed-forward backpropagation network. For
simplicity of discussion, we will assume a single hidden layer.
Backpropagation training is simply a version of gradient descent, a type of algorithm that tries to
reduce a target value (error, in the case of neural nets) at each step. The algorithm proceeds as
follows. 12 © 1999 Two Crows Corporation Feed forward: The value of the output node is calculated based on the input node values and
a set of initial weights. The values from the input nodes are combined in the hidden layers,
and the values of those nodes are combined to calculate the output value.
Backpropagation: The error in the output is computed by finding the difference between the
calculated output and the desired output (i.e., the actual values found in the training set).
Next, the error from the output is assigned to the hidden layer nodes proportionally to their
weights. This permits an error to be computed for every output node and hidden node in the
network. Finally, the error at each of the hidden and output nodes is used by the algorithm to
adjust the weight coming into that node to reduce the error.
This process is repeated for each row in the training set. Each pass through all rows in the training set
is called an epoch. The training set will be used repeatedly, until the error no longer decreases. At that
point the neural net is considered to be trained to find the pattern in the test set.
Because so many parameters may exist in the hidden layers, a neural net with enough hidden nodes
will always eventually fit the training set if left to run long enough. But how well it will do on other
data? To avoid an overfitted neural network which will only work well on the training data, you must
know when to stop training. Some implementations will evaluate the neural net against the test data
periodically during training. As long as the error rate on the test set is decreasing, training will
continue. If the error rate on the test data goes up, even though the error rate on the training data is
still decreasing, then the neural net may be overfitting the data.
The graph in Figure 6 illustrates how the test data set helps us avoid overfitting. You can see how the
error rate decreases with each pass the neural net makes through the data (small circle markers), but
the error rate for the test data (triangle markers) bottoms out and starts increasing. Since the goal of
data mining is to make predictions on data other than the training set, you are clearly better off using a
neural net that minimizes the error on the test data, not the training data. 9 10 Error as function of training 7
6 ERROR 8 Test Set Error 4 5 Training Set Error 0 100 200 300 400 500 Number of training epochs Figure 6. Error rate as a function of the number of epochs in a neural net.
(Screen shot courtesy Dr. Richard D. De Veaux, Williams College) © 1999 Two Crows Corporation 13 Neural networks differ in philosophy from many statistical methods in several ways. First, a neural
network usually has more parameters than does a typical statistical model. For example, there are
View Full Document
- Winter '08