1995_Viola_thesis_registrationMI

# Repeat a fna points drawn from yg b fnb points drawn

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: radient techniques, can be overcome using stochastic gradient descent. We believe that stochastic approximation serendipitously combines e cient computation with e ective escape from local minima. 3.3.1 Estimating the Covariance In addition to the learning rate , the covariance matrices of the smoothing functions R are important parameters of EMMA. These parameters may be chosen so that they are optimal in the maximum likelihood sense. This is equivalent to minimizing the cross entropy of the estimated distribution with the true distribution see Section 2.4.3. Our goal is to nd the parameters that minimize empirical entropy. 67 CHAPTER 3. Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT For simplicity, we assume that the covariance matrices are diagonal, 22  = DIAG 1 ; 2 ; : : : : 3.57 Following a derivation almost identical to the one described in Section 3.2 we can derive an equation analogous to 3.26,  ! d hY  = 1 X X W y ; y  1 y 2 , 1 k 3.58 yba 2 d N k b yb 2b ya 2a k k where y k is the kth component of the vector y. This equation forms the basis for a method of stochastic maximization of likelihood. Repeat: A  fNa points drawn from yg B  fNb points drawn from yg 0 d h Y  8k k  k+ d k The above procedure is very similar to the one described in Section 3.3. During entropy manipulation, it is possible to interleave covariance updates with parameter updates. 3.4 Principal Components Analysis and Information As a demonstration, we can derive a parameter estimation rule akin to principal components analysis that truly maximizes information. This new EMMA based component analysis ECA manipulates the entropy of the random variable Yv = X  v under the constraint that jvj = 1. For any given value of v the entropy of Yv can be estimated from a sample of X as: 0 1 1 X log @ 1 X g y , y A 3.59 hYv  = N Na ya2a  b a b yb 2b  ! 1 X log 1 X g x  v , x  v =N 3.60 b a N b xb2b a xa 2a 68 3.4. PRINCIPAL COMPONENTS ANALYSIS AND INFORMATION AI-TR 1548 where  is the variance of the Parzen smoothing function. Moreover we can estimate the derivative of entropy: d hY  = 1 X X W y ; y  1 y , y  d y , y  3.61 dv v Nb yb2b ya2a y b a  b a dv b a 1 X X W y ; y  1 y , y x , x  : =N 3.62 yba b a b a bb a Let us decompose the derivative into parts which can be understood more easily. We will rst analyze the second part of the summand: yb , yaxb , xa. Ignoring the weighting function Wy ,1 we are left with the derivative of some unknown function f Yv : d f Y  = X Xy , y x , x  3.63 b ab a dv v a = NbNaEb b Ea yb , yaxb , xa : 3.64 What then is f Yv ? The derivative of the squared di erence between samples is: d y , y 2 = 2y , y  d x  v , x  v b a a dv b a dv b = 2yb , ya d xb , xa  v dv = 2yb , yaxb , xa : 3.65 3.66 3.67 So we can see that f Yv  = NbNaEb Ea yb , ya2 3.68 is the expectation of the squared di erence between pairs of trials of Yv . Recall that PCA searches for the RV Yv that has the largest variance: Ea ya , Ea ya 2 = V araYv . Interestingly the expected squared di erence between a pair of trials is precisely twice the variance: C v = Eb Ea yb , ya2 2 = Eb Ea yb2 , 2yayb + ya 2 = Eb yb2 + Ea 2yayb + ya 69 3.69 3.70 3.71 CHAPTER 3. Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT 4 ECA PCA 3 2 1 0 -1 -2 -3 -4 -4 -3 -2 -1 0 1 2 3 4 Figure 3.2: A scatter plot of a sample from a two dimensional Gaussian density. The sample contains 200 points. The principal axis and the ECA axis are also plotted as vectors from the origin. The vectors are nearly identical. 2 = Eb yb2 , 2ybEa ya + Ea ya 2 = Eb yb2 , 2Eb yb Ea ya + Ea ya = Eb Y 2 , 2Eb Y Ea Y + Ea Y 2 = 2V araY  : 3.72 3.73 3.74 3.75 Without the weighting term, Wy ,1, ECA would nd exactly the same vector that PCA does: the maximum variance projection vector. However the derivative of ECA does not act on all points of Yv equally. Recall that Wy ya; yb is a measure of the distance between yb and ya. It is large when yb is signi cantly closer to ya than any other element of a. As a result ECA maximizes variance in a local way. Points that are very far apart are forced no further apart. Another way of interpreting ECA is as a type of robust variance maximization. Points that might best be interpreted as outliers, because they are very far from the body of other points, play a very small role in the minimization. These robust characteristic stand in contrast to PCA which is very sensitive to outliers. For densities that are Gaussian, the maximum entropy projection is the rst principal component. In simulations ECA e ectively nds the same projection as PCA. Figure 3.2 shows a sample of data and the PCA and ECA principal components. Since this density has a larger variance along the horizontal axis, both the ECA and PCA axes point along the horizontal axis. Our ECA code take roughly 10 second...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online