For example the value passed from node 4 to node 6 is

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: tion 11 After the input layer, each node takes in a set of inputs, multiplies them by a connection weight Wxy (e.g., the weight from node 1 to 3 is W13 — see Figure 5), adds them together, applies a function (called the activation or squashing function) to them, and passes the output to the node(s) in the next layer. For example, the value passed from node 4 to node 6 is: Activation function applied to ([W14 * value of node 1] + [W24 * value of node 2]) W13 1 W14 W15 3 W36 W23 4 W46 6 W24 2 W25 W56 5 Figure 5. Wxy is the weight from node x to node y. Each node may be viewed as a predictor variable (nodes 1 and 2 in this example) or as a combination of predictor variables (nodes 3 through 6). Node 6 is a non-linear combination of the values of nodes 1 and 2, because of the activation function on the summed values at the hidden nodes. In fact, if there is a linear activation function but no hidden layer, neural nets are equivalent to a linear regression; and with certain non-linear activation functions, neural nets are equivalent to logistic regression. The connection weights (W’s) are the unknown parameters which are estimated by a training method. Originally, the most common training method was backpropagation; newer methods include conjugate gradient, quasi-Newton, Levenberg-Marquardt, and genetic algorithms. Each training method has a set of parameters that control various aspects of training such as avoiding local optima or adjusting the speed of conversion. The architecture (or topology) of a neural network is the number of nodes and hidden layers, and how they are connected. In designing a neural network, either the user or the software must choose the number of hidden nodes and hidden layers, the activation function, and limits on the weights. While there are some general guidelines, you may have to experiment with these parameters. One of the most common types of neural network is the feed-forward backpropagation network. For simplicity of discussion, we will assume a single hidden layer. Backpropagation training is simply a version of gradient descent, a type of algorithm that tries to reduce a target value (error, in the case of neural nets) at each step. The algorithm proceeds as follows. 12 © 1999 Two Crows Corporation Feed forward: The value of the output node is calculated based on the input node values and a set of initial weights. The values from the input nodes are combined in the hidden layers, and the values of those nodes are combined to calculate the output value. Backpropagation: The error in the output is computed by finding the difference between the calculated output and the desired output (i.e., the actual values found in the training set). Next, the error from the output is assigned to the hidden layer nodes proportionally to their weights. This permits an error to be computed for every output node and hidden node in the network. Finally, the error at each of the hidden and output nodes is used by the algorithm to adjust the weight coming into that node to reduce the error. This process is repeated for each row in the training set. Each pass through all rows in the training set is called an epoch. The training set will be used repeatedly, until the error no longer decreases. At that point the neural net is considered to be trained to find the pattern in the test set. Because so many parameters may exist in the hidden layers, a neural net with enough hidden nodes will always eventually fit the training set if left to run long enough. But how well it will do on other data? To avoid an overfitted neural network which will only work well on the training data, you must know when to stop training. Some implementations will evaluate the neural net against the test data periodically during training. As long as the error rate on the test set is decreasing, training will continue. If the error rate on the test data goes up, even though the error rate on the training data is still decreasing, then the neural net may be overfitting the data. The graph in Figure 6 illustrates how the test data set helps us avoid overfitting. You can see how the error rate decreases with each pass the neural net makes through the data (small circle markers), but the error rate for the test data (triangle markers) bottoms out and starts increasing. Since the goal of data mining is to make predictions on data other than the training set, you are clearly better off using a neural net that minimizes the error on the test data, not the training data. 9 10 Error as function of training 7 6 ERROR 8 Test Set Error 4 5 Training Set Error 0 100 200 300 400 500 Number of training epochs Figure 6. Error rate as a function of the number of epochs in a neural net. (Screen shot courtesy Dr. Richard D. De Veaux, Williams College) © 1999 Two Crows Corporation 13 Neural networks differ in philosophy from many statistical methods in several ways. First, a neural network usually has more parameters than does a typical statistical model. For example, there are thirteen p...
View Full Document

Ask a homework question - tutors are online