Most of the examples presented in the accompanying code for this chapter are

# Most of the examples presented in the accompanying

• 169
• 86% (7) 6 out of 7 people found this document helpful

This preview shows page 120 - 123 out of 169 pages.

parallel/distributed SGD and touch on Second Order Methods for completeness. Most of the examples presented in the accompanying code for this chapter are based on a Python package called downhill. Downhill implements SGD with many of its variations and is an excellent choice for experimenting. Optimization Problems Simply put, an optimization problem involves finding the parameters which minimize a mathematical function. For example, given the function fx x () = 2 , finding the value of x which minimizes the function is an optimization problem (refer to Figure  8-1 ). Figure 8-1. An optimization problem involves finding the parameters that minimize a given function
CHAPTER 8 STOCHASTIC GRADIENT DESCENT 112 While the functions we want to minimize while building deep learning models are way more complicated (involving multiple parameters which may be scalars, vectors, or matrices), conceptually it’s simply finding the parameters that minimize the function. The function one wants to optimize for while building a deep learning model is referred to as the loss function. The loss function may have a number of scalar/vector/matrix-valued parameters but always has a scalar output. This scalar output represents the goodness of the model. Goodness, typically, means a combination of how well the model predicts and how simple the model is. Note For now, we will stay away from the statistical/machine learning aspects of a loss function (covered elsewhere in the book) and focus purely on solving such optimization problems. That is, we assume that we have been presented with a loss function L ( x ) where x represents the parameters of the model and the job at hand is to find the values for x which minimize L ( x ). Method of Steepest Descent Let us now look at a simple mathematical idea, which is the intuition behind SGD. For the sake of simplicity, let us assume that x is just one vector. Given that we want to minimize L ( x ), we want to change or update x such that L ( x ) reduces. Let u represent the unit vector or direction in which x should be ideally changed and let α denote the magnitude (scalar) of this step. A higher value of α implies a larger step in the direction u , which is not desired. This is because u is evaluated for the current value of x and it will be different for a different x . Thus, we want to find a u such that lim a ® + () 0 Lx u is minimized. It follows that li m. ® + =Ñ () 0 Lu T xu x Thus, we basically want to find a u such that uL x T Ñ () x is minimized. Note that x is the gradient of L ( x ). Given that both u T and x are vectors, it follows that T xx co s, q where θ is the angle between the two vectors (refer to Figure  8-2 ).
CHAPTER 8 STOCHASTIC GRADIENT DESCENT 113 The value of cos  θ would be minimized at qp = , that is to say the vectors are pointing in the opposite direction. Thus, it follows that setting the direction u = - Ñ () x Lx would achieve our desired objective. This leads to a simple iterative algorithm as follows: Input: α , n Initialize x to a random value.