Unformatted text preview: imized by the principal component vector. There are many schemes for nding
the principal component of a density. One of the more elegant accomplishments of linear
algebra is the proof that the rst eigenvector of the covariance matrix of X, X , is the
principal component vector.
Under the assumption that X is Gaussian we can prove that the principal component is
the projection with maximum entropy. First, every projection of a Gaussian is also Gaussian.
54 3.1. EMPIRICAL ENTROPY AITR 1548 Second, the entropy of a Gaussian is monotonically related to variance. Therefore, the Yv
corresponding the axis of highest variance is a Gaussian with the highest possible entropy.
Moreover, principal components analysis nds the axis that contains the most information
about X . The mutual information between X and Yv is I X; Yv = hYv , hYv jX : 3.10 This equation has two components. The rst implies that Yv will give you more information
about X when Yv has a lot of entropy. The second can be misleading. Since knowing X
removes all of the randomness from Yv , hYv jX is negative in nity. This is not particularly
bothersome precisely because relative entropy is relative. Only the di erences between relative entropies are signi cant. The variable Yv yields more information about X than Yv
when
1 2 I X; Yv I X; Yv
hYv , hYv jX hYv , hYv jX
hYv hYv :
1 1 2 1 2 1 2 2 3.11
3.12
3.13 We can conclude that the principal component axis carries more information about a distribution than any other axis. Function Learning
There are other well known problems that can be formulated within the entropy framework.
Let us analyze a simple learning problem. Given a random variable X we can de ne a
functionally dependent RV, Y = F X v + , which we assume has been perturbed by measurement noise. We are given samples of the joint occurrences of the RVs: a = :::fxa; yag::: .
How can we estimate v? Typically this is formulated as a least squares problem. A cost is
de ned as the sum of the squares of the di erences between predicted ya = F xa v and the
^
^
actual samples of Y ,
X
C ^ =
v
ya , F xa v2 :
^
3.14
fxa ;ya g2a The cost is a function of the estimated parameter vector v. The sum of squared di erence
^
can be justi ed from a log likelihood perspective see Section 2.3.1. If we assume that the
55 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT noise added to Y is Gaussian and independent of Y then the log likelihood of v is:
^
X
^
log`^ =
v
logpY = yajY = F xa v
^
fxa;ya g2a
X
=
logg ya , F xa v
^
fxa;ya g2a
X
=,
ya , F xa v2 + k ;
^
fxa ;ya g2a 3.15
3.16
3.17 where k is a constant value independent of v. The v that minimizes the sum of the squares
^
^
is also the v that makes the data observed most likely. For most problems like this, gradient
^
descent is used to search for v.
^
Finally we can show that minimizing cost maximizes mutual information. We showed in
Section 2.3.1 that log likelihood is related to sample entropy:
1
1
^
^
log`^ = , N haY jY , N hY jY :
v
a a 3.18 ^
The mutual information between Y and Y is
^
^
I Y; Y = hY , hY jY : 3.19 The rst term is not a function of v. The second is an approximation of log likelihood
^
^
minimizing C ^ maximizes I Y; Y .
v NonGaussian Densities
There is a commonly held misconception that mutual information techniques are all equivalent
to simple well known algorithms. Contrary to the impression that the above two examples
may give, this is far from the truth. Entropy is only equivalent to least squares when the
data is assumed to be Gaussian. The approach to alignment and bias correction that we
will describe in the next chapters does not and could not assume that distributions were
Gaussian. We will show that if the data were Gaussian our alignment technique would
reduce to correlation.
There are a number of nonGaussian problems that can be solved using entropy or mutual
56 3.2. ESTIMATING ENTROPY WITH PARZEN DENSITIES AITR 1548 information. Bell has shown that signal separation and decorrelation can be thought of as
entropy problems Bell and Sejnowski, 1995. Bell's technique can be derived both for
Gaussian and nonGaussian distributions. Bell shows that the Gaussian assumption leads to
a well known and ine ective algorithm. When the signals are presumed to be nonGaussian,
the resulting algorithms are much more e ective. Many compression and image processing
problems clearly involve nonGaussian distributions.
In theory, empirical entropy estimation can be used with any type of density model. The
procedure is the same: estimate the density from a sample and compute the entropy from the
density. In practice, the process can be computationally intensive. The rst part, maximum
likelihood density estimation, is an iterative search through parameter space. The second,
evaluating the entropy integral, may well be impossible. For example, there is no known
closed form solution for the entropy of a mixture of Gaussians. The ent...
View
Full Document
 Spring '10
 Cudeback
 The Land, Probability distribution, Probability theory, probability density function, Mutual Information, Paul A. Viola

Click to edit the document details