Unformatted text preview: log likelihood and entropy. Recall that entropy is
a statistic of X . The nite sample average of entropy, or empirical entropy, which will gure 33 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY strongly later in the thesis, is
1 X log P x :
haX ,Ea log PX X = , N
a xa 2a 2.25 We can therefore conclude that
haX = , N log`a :
a 2.26 This provides us with an interpretation of model selection in terms of entropy. Instead of
nding the model that makes the data most likely, we could instead nd the model that has
the lowest empirical entropy. Conversely, we could present a new interpretation of entropy:
a distribution has low entropy if the probability of a sample drawn from that distribution is
high. A distribution has high entropy if the sample has low probability. A density with a
very narrow peak has low entropy because most of the samples will fall in the region where
the density is large. A very broad density has high entropy because the samples are spread
out and fall where the density is low.
The close relationship between entropy and log likelihood is well known, but is often
overlooked by students of probability. As result the parallels between research on entropy
and maximum likelihood can be easily missed. The fact that a system manipulates entropy"
does not make it necessarily any better, or di erent, than one based on likelihood. For
instance, log likelihood model selection can be derived directly from the entropy framework
using cross entropy. The cross entropy, DpX kpX , or asymmetric divergence is a measure
of the di erence between two distributions:
DpX jj pX = EX log p X
Z 1 pX x0 !
log p x0 pX x0dx
logpX x0pX x0dx ,
logpX x0pX x0dx
= ,hX , EX logpX X
,hX , Ea logpX X
= ,hX + haX :
Cross entropy is non-negative, reaching zero if and only if pX and pX are identical. Where
34 2.4. MODELING DENSITIES AI-TR 1548 maximum likelihood model selection searches for the model that makes the sample most likely,
cross entropy model selection searches for the model that is closest, in the cross entropy sense,
to the true distribution. If we use the approximation in 2.31, the two procedures are in
fact identical. The rst term in 2.31, hX , is a constant and does not play a role in model
selection. The second term, haX , is ,Na times the log likelihood of a sample drawn from X
under the density p~. Minimization of cross entropy implies the maximization of likelihood.
x 2.4 Modeling Densities
In this section we will describe a number of techniques for estimating densities from data.
Understanding the process by which this is done is an important prerequisite for understanding the main algorithms in this thesis. We will begin with a discussion of the most widely
observed continuous density: the Gaussian density3. We will then derive an closed form
expression for the most likely Gaussian given a sample. This section will also include a
discussion of other parametric density functions and nally a non-parametric technique for
estimating densities known as Parzen window density estimation. 2.4.1 The Gaussian Density
The most ubiquitous of all random processes is the Gaussian or normal density. It literally
appears everywhere. The most common justi cation arises from the central limit theorem",
which shows that the density of the sum of a very large number of independent random variables will tend toward Gaussian. An equally important justi cation is that the mathematics
of the Gaussian density are quite simple because it is an exponential. Moreover, since any
linear function of a Gaussian density is itself Gaussian, they are widely used in linear systems
theory. It is almost certainly the case that the majority of continuous random processes have
been modeled as Gaussians; whether they are or not. A Gaussian density is de ned as: g x , p21 e,
2 : For the sake of brevity we will sometimes refer to a Gaussian density as simply a Gaussian. 35 2.33 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY The parameters and are the variance and mean of the density. One can demonstrate this
with some clever integration.
The Gaussian density can also be de ned in higher dimensions: g x , 1
n 2 jj
exp, 2 x , T ,1x , : 1
2 2.34 In a d dimensional space, the mean is a d-vector. The variance is replaced by a covariance
matrix, a d-by-d matrix jj is the determinant of . Recall that variance is de ned as the
expected square of the di erence from the mean; covariance is somewhat more complex: ij = E Xi , E Xi Xj , E Xj ; 2.35 where Xi is the i'th component of the RV X . The diagonal entries of contain the variances
of the components. The o -diagonal entries measure the expected co-variation.
Equation 2.33 de nes an in nite family of density functions. Any one member of this
View Full Document
This note was uploaded on 02/10/2010 for the course TBE 2300 taught by Professor Cudeback during the Spring '10 term at Webber.
- Spring '10
- The Land