Unformatted text preview: radient
techniques, can be overcome using stochastic gradient descent.
We believe that stochastic approximation serendipitously combines e cient computation
with e ective escape from local minima. 3.3.1 Estimating the Covariance
In addition to the learning rate , the covariance matrices of the smoothing functions R are
important parameters of EMMA. These parameters may be chosen so that they are optimal
in the maximum likelihood sense. This is equivalent to minimizing the cross entropy of the
estimated distribution with the true distribution see Section 2.4.3. Our goal is to nd the
parameters that minimize empirical entropy.
67 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT For simplicity, we assume that the covariance matrices are diagonal,
= DIAG 1 ; 2 ; : : : : 3.57 Following a derivation almost identical to the one described in Section 3.2 we can derive an
equation analogous to 3.26,
d hY = 1 X X W y ; y 1 y 2 , 1
k b yb 2b ya 2a k k where y k is the kth component of the vector y. This equation forms the basis for a method
of stochastic maximization of likelihood.
Repeat: A fNa points drawn from yg
B fNb points drawn from yg
0 d h Y 8k
k The above procedure is very similar to the one described in Section 3.3. During entropy
manipulation, it is possible to interleave covariance updates with parameter updates. 3.4 Principal Components Analysis and Information
As a demonstration, we can derive a parameter estimation rule akin to principal components
analysis that truly maximizes information. This new EMMA based component analysis
ECA manipulates the entropy of the random variable Yv = X v under the constraint that
jvj = 1. For any given value of v the entropy of Yv can be estimated from a sample of X as:
1 X log @ 1 X g y , y A
hYv = N
Na ya2a b a
b yb 2b
1 X log 1 X g x v , x v
b xb2b a xa 2a 68 3.4. PRINCIPAL COMPONENTS ANALYSIS AND INFORMATION AI-TR 1548 where is the variance of the Parzen smoothing function. Moreover we can estimate the
derivative of entropy:
d hY = 1 X X W y ; y 1 y , y d y , y
dv v Nb yb2b ya2a y b a b a dv b a
1 X X W y ; y 1 y , y x , x :
b a b a
bb a Let us decompose the derivative into parts which can be understood more easily. We will
rst analyze the second part of the summand: yb , yaxb , xa. Ignoring the weighting
function Wy ,1 we are left with the derivative of some unknown function f Yv :
d f Y = X Xy , y x , x
b Ea yb , yaxb , xa : 3.64 What then is f Yv ? The derivative of the squared di erence between samples is: d y , y 2 = 2y , y d x v , x v
dv b a
= 2yb , ya d xb , xa v
= 2yb , yaxb , xa : 3.65
3.67 So we can see that f Yv = NbNaEb Ea yb , ya2 3.68 is the expectation of the squared di erence between pairs of trials of Yv .
Recall that PCA searches for the RV Yv that has the largest variance: Ea ya , Ea ya 2 =
V araYv . Interestingly the expected squared di erence between a pair of trials is precisely
twice the variance: C v = Eb Ea yb , ya2
= Eb Ea yb2 , 2yayb + ya
= Eb yb2 + Ea 2yayb + ya
3.71 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT 4 ECA
PCA 3 2 1 0 -1 -2 -3 -4
-4 -3 -2 -1 0 1 2 3 4 Figure 3.2: A scatter plot of a sample from a two dimensional Gaussian density. The sample
contains 200 points. The principal axis and the ECA axis are also plotted as vectors from
the origin. The vectors are nearly identical.
= Eb yb2 , 2ybEa ya + Ea ya
= Eb yb2 , 2Eb yb Ea ya + Ea ya
= Eb Y 2 , 2Eb Y Ea Y + Ea Y 2
= 2V araY : 3.72
3.75 Without the weighting term, Wy ,1, ECA would nd exactly the same vector that PCA does:
the maximum variance projection vector. However the derivative of ECA does not act on all
points of Yv equally. Recall that Wy ya; yb is a measure of the distance between yb and ya.
It is large when yb is signi cantly closer to ya than any other element of a. As a result ECA
maximizes variance in a local way. Points that are very far apart are forced no further apart.
Another way of interpreting ECA is as a type of robust variance maximization. Points that
might best be interpreted as outliers, because they are very far from the body of other points,
play a very small role in the minimization. These robust characteristic stand in contrast to
PCA which is very sensitive to outliers.
For densities that are Gaussian, the maximum entropy projection is the rst principal
component. In simulations ECA e ectively nds the same projection as PCA. Figure 3.2
shows a sample of data and the PCA and ECA principal components. Since this density
has a larger variance along the horizontal axis, both the ECA and PCA axes point along the
horizontal axis. Our ECA code take roughly 10 second...
View Full Document
- Spring '10
- The Land, Probability distribution, Probability theory, probability density function, Mutual Information, Paul A. Viola