67%(6)4 out of 6 people found this document helpful
This preview shows page 394 - 396 out of 456 pages.
Networks also get more computationally expensive to make predictions asthe depth increases, since the computation takes time linear in the number ofedges in the network. This is not terrible, especially since all the nodes on anygiven level can be evaluated in parallel on multiple cores to reduce the predictiontime. Training time is where the real computational bottlenecks generally exist.Non-linearityThe image of recognizing increasing levels of abstraction up the hidden layersof a network is certainly a compelling one. It is fair to ask if it is real, however.Do extra layers in a networkreallygive us additional computational power todo things we can’t with less?The example of Figure 11.13 seems to argue the converse. It shows additionnetworks built with two and three layers of nodes, respectively, but both com-pute exactly the same function on all inputs. This suggests that the extra layerwas unnecessary, except perhaps to reduce the engineering constraint of nodedegree, the number of edges entering as input.What it really shows is that we need more complicated, non-linear nodeac-tivationfunctionsφ(v) to take advantage of depth. Non-linear functions cannotbe composed in the same way that addition can be composed to yield addition.This nonlinear activation functionφ(vi) typically operates on a weighted sum
11.6.DEEP LEARNING381Figure 11.14: The logistic (left) and ReLU (right) activation functions for nodesin neural networks.of the inputsx, wherevi=β+Xiwixi.Hereβis a constant for the given node, perhaps to be learned in training. Itis called thebiasof the node because it defines the activation in the absence ofother inputs.That computing the output values of layerlinvolves applying the activationfunctionφto weighted sums of the values from layerl-1 has an importantimplication on performance. In particular, neural network evaluation basicallyjust involves one matrix multiplication per level, where the weighted sums areobtained by multiplying an|Vl|×|Vl-1|weight matrixWby an|Vl-1|×1 outputvectorVl-1. Each element of the resulting|Vl| ×1 vector is then hit with theφfunction to prepare the output values for that layer. Fast libraries for matrixmultiplication can perform the heart of this evaluation very efficiently.A suite of interesting, non-linear activation functions have been deployed inbuilding networks. Two of the most prominent, shown in Figure 11.14, include:•Logit:We have previously encountered thelogistic functionor logit, inour discussion of logistic regression for classification. Heref(x) =11 +e-xThis unit has the property that the output is constrained to the range[0,1], wheref(0) = 1/2. Further, the function is differentiable, so back-propagation can be used to train the resulting network.