This preview shows page 1. Sign up to view the full content.
Unformatted text preview: (y) log
p(y )
p(y )
The KL divergence is only deﬁned if q (y ) > 0 for any y such that p(y ) > 0.
It is sometimes referred to as the KL distance, however it is not a metric in
the mathematical sense because in general it is asymmetric: D(q (·)p(·)) =
D(p(·)q (·)). It is the average of the logarithmic diﬀerence between the prob
ability distributions p(y ) and q (y ), where the average is taken with respect
to q (y ). The following property of the KL divergence will be very important for us. Theorem 2 (Nonnegativity of KL Divergence). D(q (·)p(·)) ≥ 0 with equal
ity if and only if q (y ) = p(y ) ∀y . Proof. We will rely on Jensen’s inequality, which states that for any convex function f and random variable X , 1[f (X )] ≥ f (1[X ]).
When f is strictly convex, Jensen’s inequality holds with equality if and only
if X is constant, so that 1[X ] = X and 1[f (X )] = f (X ). Take y ∼ q (y ) and
(y )
deﬁne the random variable X = p(y) . Let f (X ) = − log(X ), a strictly convex
q
function. Now we can...
View
Full
Document
This note was uploaded on 03/24/2014 for the course MIT 15.097 taught by Professor Cynthiarudin during the Spring '12 term at MIT.
 Spring '12
 CynthiaRudin

Click to edit the document details