*This preview shows
pages
1–4. Sign up to
view the full content.*

This ** preview**
has intentionally

**sections.**

*blurred***to view the full version.**

*Sign up*This ** preview**
has intentionally

**sections.**

*blurred***to view the full version.**

*Sign up*
**Unformatted text preview: **CS229 Lecture notes Andrew Ng Part IX The EM algorithm In the previous set of notes, we talked about the EM algorithm as applied to fitting a mixture of Gaussians. In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a large family of estimation problems with latent variables. We begin our discussion with a very useful result called Jensens inequality 1 Jensens inequality Let f be a function whose domain is the set of real numbers. Recall that f is a convex function if f 00 ( x ) 0 (for all x R ). In the case of f taking vector-valued inputs, this is generalized to the condition that its hessian H is positive semi-definite ( H 0). If f 00 ( x ) > 0 for all x , then we say f is strictly convex (in the vector-valued case, the corresponding statement is that H must be strictly positive semi-definite, written H > 0). Jensens inequality can then be stated as follows: Theorem. Let f be a convex function, and let X be a random variable. Then: E[ f ( X )] f (E X ) . Moreover, if f is strictly convex, then E[ f ( X )] = f (E X ) holds true if and only if X = E[ X ] with probability 1 (i.e., if X is a constant). Recall our convention of occasionally dropping the parentheses when writ- ing expectations, so in the theorem above, f (E X ) = f (E[ X ]). For an interpretation of the theorem, consider the figure below. 1 2 a E[X] b f(a) f(b) f(EX) E[f(X)] f Here, f is a convex function shown by the solid line. Also, X is a random variable that has a 0.5 chance of taking the value a , and a 0.5 chance of taking the value b (indicated on the x-axis). Thus, the expected value of X is given by the midpoint between a and b . We also see the values f ( a ), f ( b ) and f (E[ X ]) indicated on the y-axis. Moreover, the value E[ f ( X )] is now the midpoint on the y-axis between f ( a ) and f ( b ). From our example, we see that because f is convex, it must be the case that E[ f ( X )] f (E X ). Incidentally, quite a lot of people have trouble remembering which way the inequality goes, and remembering a picture like this is a good way to quickly figure out the answer. Remark. Recall that f is [strictly] concave if and only if- f is [strictly] convex (i.e., f 00 ( x ) 0 or H 0). Jensens inequality also holds for concave functions f , but with the direction of all the inequalities reversed (E[ f ( X )] f (E X ), etc.). 2 The EM algorithm Suppose we have an estimation problem in which we have a training set { x (1) , . . . , x ( m ) } consisting of m independent examples. We wish to fit the parameters of a model p ( x, z ) to the data, where the likelihood is given by ( ) = m X i =1 log p ( x ; ) = m X i =1 log X z p ( x, z ; ) . 3 But, explicitly finding the maximum likelihood estimates of the parameters...

View Full
Document