57 dropout was introduced in 2014 by hinton and co

This preview shows page 57 - 59 out of 83 pages.

57 Dropout was introduced in 2014 by Hinton and co-workers as a means to prevent overfitting. Again, they took their cue from biology: “A motivation for dropout comes from a theory of the role of sex in evolution (Livnat et al., 2010). Sexual reproduction involves taking half the genes of one parent and half of the other, adding a very small amount of random mutation, and combining them to produce an offspring. The asexual alternative is to create an offspring with a slightly mutated copy of the parent’s genes. It seems plausible that asexual reproduction should be a better way to optimize individual fitness because a good set of genes that have come to work well together can be passed on directly to the offspring. On the other hand, sexual reproduction is likely to break up these co-adapted sets of genes, especially if these sets are large. Intuitively, this should decrease the fitness of organisms that have already evolved complicated coadaptations. However, sexual reproduction is the way most advanced organisms evolved.” (Hinton, et al., 2014). The two images (above) demonstrate how this idea was implemented in neural networks. The image on the left is a standard feedforward network where three neurons are fed into four, which in turn, are fed into two output neurons. The second image shows how a 50% dropout is implemented. Given the dropout rate for a specific layer 50% in this case that percentage of randomly chosen neurons is temporarily turned off, of course, together with their connecting weights. Since the active neurons are no longer able to rely on their neighbors, they must become as useful as possible themselves. Note that this only happens during training. Once the neural network is fully trained, all the neurons become active. Since twice as many neurons are now in use as during training, the weights are divided by two. Activation Functions Figure 7: Sigmoid activation function
MScFE 650 Machine Learning in Finance - Module 5: Unit 4 ©2019 - WorldQuant University All rights reserved. 58 The sigmoid activation function was, for a long time, a favorite as an activation function. However, it turned out to be very hard, if not impossible to train for deep neural networks. The problem is that its gradient vanishes for input values away from zero, as should be clear from Figure 8. Since the magnitude of the gradient determines how much the weights are updated during gradient descent, vanishing gradients mean no update, hence no training. We will discuss this issue in much more detail in a later module when we look at the mechanics of backpropagation. Figure 8: ReLU Activation Function Nair and Hinton proposed the rectifying linear unit in Nair and Hinton, its efficacy subsequently demonstrated by Xu et al. It should be clear that it largely eliminates the vanishing gradient problem, most definitely for positive inputs. It remains a popular activation function.

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture