Unformatted text preview: ontained in a sphere of radius R, on average Na R data points
r
will fall in a randomly chosen window. Generally, r is selected so that R 1. As a result
with increased dimension the number of points falling in a randomly chosen window drops
exponentially and the normalized standard deviation of P will increase rapidly. This implies
that the Parzen density estimate will become very unreliable as dimensionality increases. 50 2.4. MODELING DENSITIES AITR 1548 While in theory this could be remedied by increasing the size of the sample exponentially,
things rapidly get out of hand. Empirical evidence argues against using Parzen estimation
in many more than six dimensions. 51 Chapter 3
Empirical Entropy Manipulation and
Stochastic Gradient Descent
This chapter presents a novel technique for evaluating and manipulating the empirical entropy
of a distribution, called EMMA. The theory of entropy manipulation plays a critical role in
the rest of this thesis and forms the algorithmic core in all of the applications.
There are a number of existing techniques that manipulate the entropy of a density. They
each have signi cant theoretical and practical limitations that make them unsuitable for our
purposes. We will begin with a description of these techniques, and two simple applications.
The second part of the chapter describes a new procedure for evaluating empirical entropy,
EMMA. We will present an e cient stochastic gradient scheme for extremizing the EMMA
estimates. This scheme has applications outside of entropy manipulation.
The nal section of this chapter presents a tutorial application of EMMA. We will show
how EMMA can be used to derive an information theoretic version of principal components
analysis. 52 3.1. EMPIRICAL ENTROPY AITR 1548 3.1 Empirical Entropy
As we saw in the previous chapter, in many cases the true density of a random variable
is not known. Instead one must make do with an estimate of the density obtained from a
sample of the variable. Likewise, there is no direct procedure for evaluating entropy from a
sample. A common approach is to rst model the density from the sample, and then estimate
the entropy from the density. This divides the problem into manageable parts which can be
solved separately.
By far the most popular density model for entropy calculations is the Gaussian. This is
based on two considerations: 1 nding the Gaussian that ts the data best is very easy
see Section 2.3.1 and 2 the entropy of the Gaussian can be directly calculated from its
variance. The entropy of a Gaussian distribution is
Z1
hX = , ,1 g x , logg x , dx
3.1
!
"
Z1
2
= , g x , log p21 , 1 x , dx
3.2
2
,1
"
1 x , 2 + 1 log2
3.3
=E 2
2
= 1 + 1 log2
3.4
22
1
3.5
= 1 loge + 2 log2
2
3.6
= 1 log2e :
2
The entropy of a Gaussian is a function of its variance and is not a function of its mean.
Wider Gaussians have more entropy and narrower Gaussians have less. There is a simple
procedure for nding the empirical entropy of a Gaussian distribution: compute the variance
of the sample and evaluate 3.6.
The equivalence between the log of variance and entropy can be used to reformulate well
known signal and image processing problems as entropy problems. Since the logarithm is a
monotonically increasing function, any technique that maximizes or minimizes the variance
of a signal can be viewed as an entropy technique. Examples include principal components
analysis, where variance is maximized, and least square solutions to matching problems,
53 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT where variance is minimized. There is however one signi cant caveat. Variance maximization
is only equivalent to entropy maximization if the density of the signal involved is Gaussian.
When this assumption is violated it is possible to reduce entropy as variance is increased.
All but a very few of the techniques that manipulate the entropy of a signal assume that the
signals are Gaussian or another exponential distribution Linsker, 1988; Becker and Hinton,
1992; Bell and Sejnowski, 1995. We will discuss these techniques in Section 7. Principal Components Analysis
There are a number of signal processing and learning problems that can be formulated as
entropy maximization problems. One well known example is principal components analysis.
The principal component of a d dimensional distribution is a d dimensional vector. Given a
density X every vector v de nes a new random variable, Yv = X v. The variance along an
axis v is the variance of this new variable: V arYv = EX X v , EX X v2 : 3.7 The principal component v is the vector for which V arYv is maximized.
In practice neither the density of X nor Yv is known. The projection variance is computed
from a sample a of points from X , V arYv V araYv Ea X v , Ea X v 2 : 3.8 We can then de ne a vector cost function, C v = ,V araYv ; 3.9 which is min...
View
Full
Document
This note was uploaded on 02/10/2010 for the course TBE 2300 taught by Professor Cudeback during the Spring '10 term at Webber.
 Spring '10
 Cudeback
 The Land

Click to edit the document details