As the search through parameter space adjusts the

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ropy integral can however be approximated as a sample mean: hX   hbX  = Eb log^X  ; p 3.20 where Eb is the sample mean taken over the sample b, p is the estimate for the sample ^ density and hbX  is the sample entropy rst introduced in p Section 2.3.1. The sample mean converges toward the true mean at a rate proportional to 1= Nb Nb is the size of b. Based on this insight, two samples can be used to estimate the entropy of a distribution: the rst is used to estimate the density, the second is used to estimate the entropy. While the two sample approach can be used to estimate entropy, it is not a practical algorithm for entropy manipulation. In the two applications above, changes in the parameter vector e ect the densities that are being approximated. As the search through parameter space adjusts the parameter vector a new sample must be drawn, a new density estimated, and the derivative of entropy evaluated. If estimating the density is itself a complex search process, the search for the correct parameter vector can take an unbearably long time. 3.2 Estimating Entropy with Parzen Densities In this section we will describe a technique that can e ectively estimate and manipulate the entropy of non-Gaussian distributions. The basic insight is that rather than use maximum 57 CHAPTER 3. Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT likelihood to estimate the density of a sample, we will instead use Parzen window density estimation see Section 2.4.3. The Parzen scheme for estimating densities has two signi cant advantages over maximum likelihood: 1 since the Parzen estimate is computed directly from the sample, there is no search for parameters; 2 the derivative of the entropy of the Parzen estimate is simple to compute. In the following general derivation we will assume that we have samples of a random variable X , and we would like to manipulate the entropy of the random variable Y = F X; v. The entropy, hY , is now a function of v and can be manipulated by changing v. Since there is no direct technique for nding the parameters that will extremize hY  we will search the parameter space using gradient descent. The following derivation assumes that Y is a vector random variable. The joint entropy of a two random variables, hY1; Y2, can be evaluated by constructing the vector random variable, W = Y1; Y2 T and evaluated hW . The form of the Parzen estimate constructed from a sample a is 1 X g y , y  ; P y; a = N  a a ya 2a 3.21 where the Parzen estimator is constructed with Gaussian smoothing functions. Given P y; a we can approximate entropy as the sample mean, hY  Eb logP Y; a 1 X logP y ; a ; =N b b yb 2b 3.22 3.23 computed over a second sample b hX  is the EMMA estimate of empirical entropy. In order to extremize entropy we must calculate the derivative of entropy with respect to v. This may be expressed as d d hY  = ,1 X Pya2a dv g yb , ya ; 3.24 dv Nb yb2b Pya2a g yb , ya and, after di erentiating the Gaussian, d d hY  = 1 X Pya2a g yb , ya yb , yaT ,1 dv yb , ya : P g y , y  dv N ya 2a  b b yb 2b 58 a 3.25 3.2. ESTIMATING ENTROPY WITH PARZEN DENSITIES This expression may be written more compactly as follows, d hY  = 1 X X W y ; y  y , y T ,1 d y , y  ; yba b a dv N dv b a b yb 2b ya 2a AI-TR 1548 3.26 using the following de nition: y,a Wy yb; ya P g g b y y,y  : a ya 2a  b 3.27 Wy yb; ya takes on values between zero and one. It will approach one if yb is signi cantly closer to ya than any other element of a. It will be near zero if some other element of a is signi cantly closer to yb. Distance is interpreted with respect to the squared Mahalonobis distance see Duda and Hart, 1973 D y yT ,1y : Thus, Wy yb; ya is an indicator of the degree of match between its arguments, in a soft" sense. It is equivalent to using the softmax" function of neural networks Bridle, 1989 on the negative of the Mahalonobis distance to indicate correspondence between yb and elements of a. Equation 3.26 may also be expressed as d hY  = 1 X X W y ; y  d 1 D y , y  : yba dv N dv 2  b a B yb 2b ya 2a 3.28 In this form it is apparent that to reduce entropy, the parameters v should be adjusted such that there is a reduction in the average squared distance between points which W indicates are nearby. d Before moving on, it is worth reemphasizing that for most density models dv hY  is very di cult to compute. The general derivation of the derivative of entropy is much more complex than the Parzen derivation: @hY   d 1 X logpy ; a 3.29 b @v dv N b yb 2b 59 CHAPTER 3. Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT d 1 X dv pyb ; a =N b yb 2b pyb ; a d d 1 X dyb pyb; a  dyb + da pyb; a  da : dv dv =N pyb ; a b yb 2b 3.30 3.31 d The numerator of the derivative has two components. The rst, dyb pyb; a  dyb , is the change dv da d in entropy that results from changes in the sample b. The second, da...
View Full Document

This note was uploaded on 02/10/2010 for the course TBE 2300 taught by Professor Cudeback during the Spring '10 term at Webber.

Ask a homework question - tutors are online