to the positive side, theright side point at the local minimum.From calculus, we can computedJ(ω)dωwhich represents the slope of the functionand allows us to to know which direction to go. In logistic regression, the cost functionJ(ω, b) is shown and we want to updateω:=ωαdJ(ω, b)dωb:=bαdJ(ω, b)dbSometimes we also write∂J(ω,b)∂ωas equal as the derivatives above as well, whichnotates “partial derivatives” in calculus. They are referring to the same notion. Lowercase “d” and partial derivative symbol∂are usages of whether it is in calculus or not.In coding, we can simply notate “dw” versus “db”.1.3.5DerivativesThis small section let us dive in to talk about a few points in calculus. We can considera functionf(a) = 3a. Ifa= 2, then we havef(a) = 6. Ifa= 2.001, thenf(a) = 6.003,which is the change offasachanges 0.001.Then we are going to say the slope (orderivative) off(a) ata= 2 is 3. To see why this is true, we take a look at anotherpointa= 5 with the value of functionf(a) = 15. If we movea0.001 which means weseta= 5.001 then the value of the function would bef(a) = 15.003 which is the sameincrement changes from before when we moveafrom 2 to 2.001. We concludePage 7
Notes in Deep Learning [Notes by Yiqiao Yin] [Instructor: Andrew Ng]§1df(a)a= 3 =ddaf(a)for the derivative of the functionf(a).Next, we considerf(a) =a2.Starting froma= 2, we have the value of thefunctionf(a) = 4. If we moveatoa= 2.001, we havef(a)≈4.
004. Then the slope
(or derivative) of
f
(
a
) at
a
= 2 is 4. That is,
d
da
f
(
a
) = 4
However, since
f
(
a
) is higher power this time it may not have the same slope.
Consider
a
= 5 with value
f
(
a
) = 25. We have value
f
(
a
)
≈
25
.
010 when
a
= 5
.
001.
Then we have
d
da
f
(
a
) = 10 when
a
= 5
which is larger than before.
Moreover, we have
f
(
a
) = log
e
(
a
) = ln(
a
), then
d
da
f
(
a
) =
1
a
.
1.3.6
Computation Graph
Let us use a simple example as a start. Consider a function
J
(
a, b, c
) = 3(
a
+
bc
)
while let us write
u
=
bc
,
v
=
a
+
u
, and
J
= 3
v
. If we let
a
= 5,
b
= 3, and
c
= 2,
we will have
u
=
bc
= 6,
v
=
a
+
u
= 11 and finally we have
J
= 3
v
= 33. For the
derivatives, we would have to go backward.
Let us say we want the derivative of
J
, i.e.
d
dv
=?. Consider
J
= 3
v
while
v
= 11.
Then we can compute
d
dv
= 3. From same procedure, we can compute
dJ
da
= 3. Then
dv
da
= 1, then by chain rule,
dJ
dv
dv
da
= 3
×
1.
It is the same thing to implement Gradient Descent for Logistic Regression. Let us
say we have the following setup
z
=
w
T
x
+
b
ˆ
y
=
a
=
σ
(
z
)
L
(
a, y
) =

(
y
log(
a
) + (1

y
) log(1

a
))
Now we are trying to solve
z
to get
a
and thus to reduce loss
L
.
In coding, we
would have
da =
d
L
(
a, y
)
da
=

y
a
+
1

y
1

a
and also
dz =
d
L
dz
=
d
L
(
a, y
)
dz
= 1

y
=
d
L
da
·
da
dz
{z}
a
(1

a
)
↔
y
a
+
1

y
1

a
Now let us consider
m
examples for
J
(
w, b
) =
1
m
m
X
i
=1
L
(
a
(
i
)
, y
)
Page 8
Notes in Deep Learning [Notes by Yiqiao Yin] [Instructor: Andrew Ng]
§
1
a
(
i
)
= ˆ
y
(
i
)
=
σ
(
z
(
i
)
) =
σ
(
w
T
x
(
i
)
+
b
)
with one training example to be (
x
(
i
)
, y
(
i
)
). Then the derivative w.r.t.
w
1
would be
∂
∂w
1
J
(
w, b
) =
1
m
m
X
i
=1
∂
∂w
1
L
(
a
(
i
)
, y
(
i
)
)
Algorithm 1.3.2.
Let us initialize
J
= 0,
dw
1
= 0,
dw
2
= 0,
db
= 0, then
for
i
= 1 to
m
:
z
(
i
)
w
T
x
(
i
)
+
b, a
(
i
)
=
σ
(
z
(
i
)
)
Jt
=

[
y
(
i
)
log
a
(
i
)
+ (1

y
(
i
)
) log(1

a
(
i
)
)]
dz
(
i
)
=
a
(
i
)

y
(
i
)
, dw
1
t
=
x
(
i
)
1
, dw
2
t
=
x
(
i
)
2
dz
(
i
)
, dbt
=
dz
(
i
)
Then
J/
=
m
dw
1
m
;
dw
2
/
=
m
;
db/
=
m
Then
dw
1
=
∂J
∂w
1