Unformatted text preview: ropy integral can
however be approximated as a sample mean: hX hbX = Eb log^X ;
p 3.20 where Eb is the sample mean taken over the sample b, p is the estimate for the sample
^
density and hbX is the sample entropy rst introduced in p
Section 2.3.1. The sample mean
converges toward the true mean at a rate proportional to 1= Nb Nb is the size of b. Based
on this insight, two samples can be used to estimate the entropy of a distribution: the rst
is used to estimate the density, the second is used to estimate the entropy.
While the two sample approach can be used to estimate entropy, it is not a practical
algorithm for entropy manipulation. In the two applications above, changes in the parameter
vector e ect the densities that are being approximated. As the search through parameter
space adjusts the parameter vector a new sample must be drawn, a new density estimated,
and the derivative of entropy evaluated. If estimating the density is itself a complex search
process, the search for the correct parameter vector can take an unbearably long time. 3.2 Estimating Entropy with Parzen Densities
In this section we will describe a technique that can e ectively estimate and manipulate the
entropy of nonGaussian distributions. The basic insight is that rather than use maximum
57 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT likelihood to estimate the density of a sample, we will instead use Parzen window density
estimation see Section 2.4.3. The Parzen scheme for estimating densities has two signi cant
advantages over maximum likelihood: 1 since the Parzen estimate is computed directly from
the sample, there is no search for parameters; 2 the derivative of the entropy of the Parzen
estimate is simple to compute.
In the following general derivation we will assume that we have samples of a random
variable X , and we would like to manipulate the entropy of the random variable Y = F X; v.
The entropy, hY , is now a function of v and can be manipulated by changing v. Since there
is no direct technique for nding the parameters that will extremize hY we will search the
parameter space using gradient descent. The following derivation assumes that Y is a vector
random variable. The joint entropy of a two random variables, hY1; Y2, can be evaluated
by constructing the vector random variable, W = Y1; Y2 T and evaluated hW .
The form of the Parzen estimate constructed from a sample a is
1 X g y , y ;
P y; a = N
a
a ya 2a 3.21 where the Parzen estimator is constructed with Gaussian smoothing functions. Given P y; a
we can approximate entropy as the sample mean, hY Eb logP Y; a
1 X logP y ; a ;
=N
b
b yb 2b 3.22
3.23 computed over a second sample b hX is the EMMA estimate of empirical entropy.
In order to extremize entropy we must calculate the derivative of entropy with respect to
v. This may be expressed as
d
d hY = ,1 X Pya2a dv g yb , ya ;
3.24
dv
Nb yb2b Pya2a g yb , ya
and, after di erentiating the Gaussian,
d
d hY = 1 X Pya2a g yb , ya yb , yaT ,1 dv yb , ya :
P g y , y
dv
N
ya 2a b b yb 2b 58 a 3.25 3.2. ESTIMATING ENTROPY WITH PARZEN DENSITIES This expression may be written more compactly as follows,
d hY = 1 X X W y ; y y , y T ,1 d y , y ;
yba b
a
dv
N
dv b a
b yb 2b ya 2a AITR 1548 3.26 using the following de nition: y,a
Wy yb; ya P g g b y y,y :
a
ya 2a b 3.27 Wy yb; ya takes on values between zero and one. It will approach one if yb is signi cantly
closer to ya than any other element of a. It will be near zero if some other element of a is
signi cantly closer to yb. Distance is interpreted with respect to the squared Mahalonobis
distance see Duda and Hart, 1973 D y yT ,1y :
Thus, Wy yb; ya is an indicator of the degree of match between its arguments, in a soft"
sense. It is equivalent to using the softmax" function of neural networks Bridle, 1989 on
the negative of the Mahalonobis distance to indicate correspondence between yb and elements
of a.
Equation 3.26 may also be expressed as
d hY = 1 X X W y ; y d 1 D y , y :
yba
dv
N
dv 2 b a
B yb 2b ya 2a 3.28 In this form it is apparent that to reduce entropy, the parameters v should be adjusted such
that there is a reduction in the average squared distance between points which W indicates
are nearby.
d
Before moving on, it is worth reemphasizing that for most density models dv hY is very
di cult to compute. The general derivation of the derivative of entropy is much more complex
than the Parzen derivation:
@hY d 1 X logpy ; a
3.29
b
@v
dv N
b yb 2b 59 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT
d
1 X dv pyb ; a
=N
b yb 2b pyb ; a
d
d
1 X dyb pyb; a dyb + da pyb; a da :
dv
dv
=N
pyb ; a
b yb 2b 3.30
3.31 d
The numerator of the derivative has two components. The rst, dyb pyb; a dyb , is the change
dv
da
d
in entropy that results from changes in the sample b. The second, da...
View
Full
Document
This note was uploaded on 02/10/2010 for the course TBE 2300 taught by Professor Cudeback during the Spring '10 term at Webber.
 Spring '10
 Cudeback
 The Land

Click to edit the document details