Networks also get more computationally expensive to make predictions as the

# Networks also get more computationally expensive to

• 456
• 67% (6) 4 out of 6 people found this document helpful

This preview shows page 394 - 396 out of 456 pages.

Networks also get more computationally expensive to make predictions as the depth increases, since the computation takes time linear in the number of edges in the network. This is not terrible, especially since all the nodes on any given level can be evaluated in parallel on multiple cores to reduce the prediction time. Training time is where the real computational bottlenecks generally exist. Non-linearity The image of recognizing increasing levels of abstraction up the hidden layers of a network is certainly a compelling one. It is fair to ask if it is real, however. Do extra layers in a network really give us additional computational power to do things we can’t with less? The example of Figure 11.13 seems to argue the converse. It shows addition networks built with two and three layers of nodes, respectively, but both com- pute exactly the same function on all inputs. This suggests that the extra layer was unnecessary, except perhaps to reduce the engineering constraint of node degree, the number of edges entering as input. What it really shows is that we need more complicated, non-linear node ac- tivation functions φ ( v ) to take advantage of depth. Non-linear functions cannot be composed in the same way that addition can be composed to yield addition. This nonlinear activation function φ ( v i ) typically operates on a weighted sum
11.6. DEEP LEARNING 381 Figure 11.14: The logistic (left) and ReLU (right) activation functions for nodes in neural networks. of the inputs x , where v i = β + X i w i x i . Here β is a constant for the given node, perhaps to be learned in training. It is called the bias of the node because it defines the activation in the absence of other inputs. That computing the output values of layer l involves applying the activation function φ to weighted sums of the values from layer l - 1 has an important implication on performance. In particular, neural network evaluation basically just involves one matrix multiplication per level, where the weighted sums are obtained by multiplying an | V l |×| V l - 1 | weight matrix W by an | V l - 1 1 output vector V l - 1 . Each element of the resulting | V l | × 1 vector is then hit with the φ function to prepare the output values for that layer. Fast libraries for matrix multiplication can perform the heart of this evaluation very efficiently. A suite of interesting, non-linear activation functions have been deployed in building networks. Two of the most prominent, shown in Figure 11.14, include: Logit : We have previously encountered the logistic function or logit, in our discussion of logistic regression for classification. Here f ( x ) = 1 1 + e - x This unit has the property that the output is constrained to the range [0,1], where f (0) = 1 / 2. Further, the function is differentiable, so back- propagation can be used to train the resulting network.
• • • 