86%(7)6 out of 7 people found this document helpful
This preview shows page 120 - 123 out of 169 pages.
parallel/distributed SGD and touch on Second Order Methods for completeness.Most of the examples presented in the accompanying code for this chapter are based on a Python package called downhill. Downhill implements SGD with many of its variations and is an excellent choice for experimenting.Optimization ProblemsSimply put, an optimization problem involves finding the parameters which minimize a mathematical function.For example, given the function fxx() =2, finding the value of xwhich minimizes the function is an optimization problem (refer to Figure 8-1).Figure 8-1.An optimization problem involves finding the parameters that minimize a given function
CHAPTER 8 ■STOCHASTIC GRADIENT DESCENT112While the functions we want to minimize while building deep learning models are way more complicated (involving multiple parameters which may be scalars, vectors, or matrices), conceptually it’s simply finding the parameters that minimize the function.The function one wants to optimize for while building a deep learning model is referred to as the loss function. The loss function may have a number of scalar/vector/matrix-valued parameters but always has a scalar output. This scalar output represents the goodness of the model. Goodness, typically, means a combination of how well the model predicts and how simple the model is.■NoteFor now, we will stay away from the statistical/machine learning aspects of a loss function (covered elsewhere in the book) and focus purely on solving such optimization problems. That is, we assume that we have been presented with a loss function L(x ) where xrepresents the parameters of the model and the job at hand is to find the values for xwhich minimize L(x ).Method of Steepest DescentLet us now look at a simple mathematical idea, which is the intuition behind SGD. For the sake of simplicity, let us assume that xis just one vector. Given that we want to minimize L(x), we want to change or update xsuch that L(x) reduces. Let urepresent the unit vector or direction in which xshould be ideally changed and let αdenote the magnitude (scalar) of this step. A higher value of αimplies a larger step in the direction u, which is not desired. This is because uis evaluated for the current value of xand it will be different for a different x.Thus, we want to find a usuch thatlima®+()0Lxuis minimized. It follows thatlim.®+=Ñ ()0LuTxuxThus, we basically want to find a usuch thatuLxTÑ ()xis minimized. Note that xis the gradient of L(x).Given that both uTand xare vectors, it follows thatT=×xxcos,qwhere θis the angle between the two vectors (refer to Figure 8-2).
CHAPTER 8 ■STOCHASTIC GRADIENT DESCENT113The value of cos θwould be minimized at qp=,that is to say the vectors are pointing in the opposite direction. Thus, it follows that setting the direction u= -Ñ ()xLxwould achieve our desired objective. This leads to a simple iterative algorithm as follows:Input: α, nInitialize xto a random value.