Moreover principal components analysis nds the axis

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: imized by the principal component vector. There are many schemes for nding the principal component of a density. One of the more elegant accomplishments of linear algebra is the proof that the rst eigenvector of the covariance matrix of X, X , is the principal component vector. Under the assumption that X is Gaussian we can prove that the principal component is the projection with maximum entropy. First, every projection of a Gaussian is also Gaussian. 54 3.1. EMPIRICAL ENTROPY AI-TR 1548 Second, the entropy of a Gaussian is monotonically related to variance. Therefore, the Yv corresponding the axis of highest variance is a Gaussian with the highest possible entropy. Moreover, principal components analysis nds the axis that contains the most information about X . The mutual information between X and Yv is I X; Yv  = hYv  , hYv jX  : 3.10 This equation has two components. The rst implies that Yv will give you more information about X when Yv has a lot of entropy. The second can be misleading. Since knowing X removes all of the randomness from Yv , hYv jX  is negative in nity. This is not particularly bothersome precisely because relative entropy is relative. Only the di erences between relative entropies are signi cant. The variable Yv yields more information about X than Yv when 1 2 I X; Yv  I X; Yv  hYv  , hYv jX  hYv  , hYv jX  hYv  hYv  : 1 1 2 1 2 1 2 2 3.11 3.12 3.13 We can conclude that the principal component axis carries more information about a distribution than any other axis. Function Learning There are other well known problems that can be formulated within the entropy framework. Let us analyze a simple learning problem. Given a random variable X we can de ne a functionally dependent RV, Y = F X  v + , which we assume has been perturbed by measurement noise. We are given samples of the joint occurrences of the RVs: a = :::fxa; yag::: . How can we estimate v? Typically this is formulated as a least squares problem. A cost is de ned as the sum of the squares of the di erences between predicted ya = F xa  v and the ^ ^ actual samples of Y , X C ^ = v ya , F xa  v2 : ^ 3.14 fxa ;ya g2a The cost is a function of the estimated parameter vector v. The sum of squared di erence ^ can be justi ed from a log likelihood perspective see Section 2.3.1. If we assume that the 55 CHAPTER 3. Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT noise added to Y is Gaussian and independent of Y then the log likelihood of v is: ^ X ^ log`^ = v logpY = yajY = F xa  v ^ fxa;ya g2a X = logg ya , F xa  v ^ fxa;ya g2a X =, ya , F xa  v2 + k ; ^ fxa ;ya g2a 3.15 3.16 3.17 where k is a constant value independent of v. The v that minimizes the sum of the squares ^ ^ is also the v that makes the data observed most likely. For most problems like this, gradient ^ descent is used to search for v. ^ Finally we can show that minimizing cost maximizes mutual information. We showed in Section 2.3.1 that log likelihood is related to sample entropy: 1 1 ^ ^ log`^ = , N haY jY   , N hY jY  : v a a 3.18 ^ The mutual information between Y and Y is ^ ^ I Y; Y  = hY  , hY jY  : 3.19 The rst term is not a function of v. The second is an approximation of log likelihood| ^ ^ minimizing C ^ maximizes I Y; Y . v Non-Gaussian Densities There is a commonly held misconception that mutual information techniques are all equivalent to simple well known algorithms. Contrary to the impression that the above two examples may give, this is far from the truth. Entropy is only equivalent to least squares when the data is assumed to be Gaussian. The approach to alignment and bias correction that we will describe in the next chapters does not and could not assume that distributions were Gaussian. We will show that if the data were Gaussian our alignment technique would reduce to correlation. There are a number of non-Gaussian problems that can be solved using entropy or mutual 56 3.2. ESTIMATING ENTROPY WITH PARZEN DENSITIES AI-TR 1548 information. Bell has shown that signal separation and de-correlation can be thought of as entropy problems Bell and Sejnowski, 1995. Bell's technique can be derived both for Gaussian and non-Gaussian distributions. Bell shows that the Gaussian assumption leads to a well known and ine ective algorithm. When the signals are presumed to be non-Gaussian, the resulting algorithms are much more e ective. Many compression and image processing problems clearly involve non-Gaussian distributions. In theory, empirical entropy estimation can be used with any type of density model. The procedure is the same: estimate the density from a sample and compute the entropy from the density. In practice, the process can be computationally intensive. The rst part, maximum likelihood density estimation, is an iterative search through parameter space. The second, evaluating the entropy integral, may well be impossible. For example, there is no known closed form solution for the entropy of a mixture of Gaussians. The ent...
View Full Document

Ask a homework question - tutors are online