to the positive side the right side point at the local minimum From calculus we

# To the positive side the right side point at the

• Notes
• 54
• 100% (2) 2 out of 2 people found this document helpful

This preview shows page 7 - 10 out of 54 pages.

to the positive side, theright side point at the local minimum.From calculus, we can computedJ(ω)which represents the slope of the functionand allows us to to know which direction to go. In logistic regression, the cost functionJ(ω, b) is shown and we want to updateω:=ω-αdJ(ω, b)b:=b-αdJ(ω, b)dbSometimes we also write∂J(ω,b)∂ωas equal as the derivatives above as well, whichnotates “partial derivatives” in calculus. They are referring to the same notion. Lowercase “d” and partial derivative symbolare usages of whether it is in calculus or not.In coding, we can simply notate “dw” versus “db”.1.3.5DerivativesThis small section let us dive in to talk about a few points in calculus. We can considera functionf(a) = 3a. Ifa= 2, then we havef(a) = 6. Ifa= 2.001, thenf(a) = 6.003,which is the change offasachanges 0.001.Then we are going to say the slope (orderivative) off(a) ata= 2 is 3. To see why this is true, we take a look at anotherpointa= 5 with the value of functionf(a) = 15. If we movea0.001 which means weseta= 5.001 then the value of the function would bef(a) = 15.003 which is the sameincrement changes from before when we moveafrom 2 to 2.001. We concludePage 7 Notes in Deep Learning [Notes by Yiqiao Yin] [Instructor: Andrew Ng]§1df(a)a= 3 =ddaf(a)for the derivative of the functionf(a).Next, we considerf(a) =a2.Starting froma= 2, we have the value of thefunctionf(a) = 4. If we moveatoa= 2.001, we havef(a)4. 004. Then the slope (or derivative) of f ( a ) at a = 2 is 4. That is, d da f ( a ) = 4 However, since f ( a ) is higher power this time it may not have the same slope. Consider a = 5 with value f ( a ) = 25. We have value f ( a ) 25 . 010 when a = 5 . 001. Then we have d da f ( a ) = 10 when a = 5 which is larger than before. Moreover, we have f ( a ) = log e ( a ) = ln( a ), then d da f ( a ) = 1 a . 1.3.6 Computation Graph Let us use a simple example as a start. Consider a function J ( a, b, c ) = 3( a + bc ) while let us write u = bc , v = a + u , and J = 3 v . If we let a = 5, b = 3, and c = 2, we will have u = bc = 6, v = a + u = 11 and finally we have J = 3 v = 33. For the derivatives, we would have to go backward. Let us say we want the derivative of J , i.e. d dv =?. Consider J = 3 v while v = 11. Then we can compute d dv = 3. From same procedure, we can compute dJ da = 3. Then dv da = 1, then by chain rule, dJ dv dv da = 3 × 1. It is the same thing to implement Gradient Descent for Logistic Regression. Let us say we have the following setup z = w T x + b ˆ y = a = σ ( z ) L ( a, y ) = - ( y log( a ) + (1 - y ) log(1 - a )) Now we are trying to solve z to get a and thus to reduce loss L . In coding, we would have da = d L ( a, y ) da = - y a + 1 - y 1 - a and also dz = d L dz = d L ( a, y ) dz = 1 - y = d L da · da dz |{z} a (1 - a ) ↔- y a + 1 - y 1 - a Now let us consider m examples for J ( w, b ) = 1 m m X i =1 L ( a ( i ) , y ) Page 8 Notes in Deep Learning [Notes by Yiqiao Yin] [Instructor: Andrew Ng] § 1 a ( i ) = ˆ y ( i ) = σ ( z ( i ) ) = σ ( w T x ( i ) + b ) with one training example to be ( x ( i ) , y ( i ) ). Then the derivative w.r.t. w 1 would be ∂w 1 J ( w, b ) = 1 m m X i =1 ∂w 1 L ( a ( i ) , y ( i ) ) Algorithm 1.3.2. Let us initialize J = 0, dw 1 = 0, dw 2 = 0, db = 0, then for i = 1 to m : z ( i ) w T x ( i ) + b, a ( i ) = σ ( z ( i ) ) Jt = - [ y ( i ) log a ( i ) + (1 - y ( i ) ) log(1 - a ( i ) )] dz ( i ) = a ( i ) - y ( i ) , dw 1 t = x ( i ) 1 , dw 2 t = x ( i ) 2 dz ( i ) , dbt = dz ( i ) Then J/ = m dw 1 m ; dw 2 / = m ; db/ = m Then dw 1 = ∂J ∂w 1  • • • 