parallel/distributed SGD and touch on Second Order Methods for completeness.
Most of the examples presented in the accompanying code for this chapter are based on a Python
package called downhill. Downhill implements SGD with many of its variations and is an excellent choice
for experimenting.
Optimization Problems
Simply put, an optimization problem involves finding the parameters which minimize a mathematical function.
For example, given the function
fx
x
() =
2
, finding the value of
x
which minimizes the function is an
optimization problem (refer to Figure
8-1
).
Figure 8-1.
An optimization problem involves finding the parameters that minimize a given function

CHAPTER 8
■
STOCHASTIC GRADIENT DESCENT
112
While the functions we want to minimize while building deep learning models are way more
complicated (involving multiple parameters which may be scalars, vectors, or matrices), conceptually it’s
simply finding the parameters that minimize the function.
The function one wants to optimize for while building a deep learning model is referred to as the loss
function. The loss function may have a number of scalar/vector/matrix-valued parameters but always has
a scalar output. This scalar output represents the goodness of the model. Goodness, typically, means a
combination of how well the model predicts and how simple the model is.
■
Note
For now, we will stay away from the statistical/machine learning aspects of a loss function (covered
elsewhere in the book) and focus purely on solving such optimization problems. That is, we assume that we
have been presented with a loss function
L
(
x
) where
x
represents the parameters of the model and the job at
hand is to find the values for
x
which minimize
L
(
x
).
Method of Steepest Descent
Let us now look at a simple mathematical idea, which is the intuition behind SGD. For the sake of simplicity,
let us assume that
x
is just one vector. Given that we want to minimize
L
(
x
), we want to change or update
x
such that
L
(
x
) reduces. Let
u
represent the unit vector or direction in which
x
should be ideally changed
and let
α
denote the magnitude (scalar) of this step. A higher value of
α
implies a larger step in the direction
u
, which is not desired. This is because
u
is evaluated for the current value of
x
and it will be different for a
different
x
.
Thus, we want to find a
u
such that
lim
a
®
+
()
0
Lx
u
is minimized. It follows that
li
m.
®
+
=Ñ ()
0
Lu
T
xu
x
Thus, we basically want to find a
u
such that
uL
x
T
Ñ ()
x
is minimized. Note that
x
is the gradient of
L
(
x
).
Given that both
u
T
and
x
are vectors, it follows that
T
=×
xx
co
s,
q
where
θ
is the angle between the two vectors (refer to Figure
8-2
).

CHAPTER 8
■
STOCHASTIC GRADIENT DESCENT
113
The value of cos
θ
would be minimized at
qp
=
,
that is to say the vectors are pointing in the opposite
direction. Thus, it follows that setting the direction
u
= -
Ñ ()
x
Lx
would achieve our desired objective. This
leads to a simple iterative algorithm as follows:
Input:
α
, n
Initialize
x
to a random value.