1995_Viola_thesis_registrationMI

25 226 this provides us with an interpretation of

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: log likelihood and entropy. Recall that entropy is a statistic of X . The nite sample average of entropy, or empirical entropy, which will gure 33 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY strongly later in the thesis, is 1 X log P x  : haX  ,Ea log PX X  = , N Xa a xa 2a 2.25 We can therefore conclude that 1 haX  = , N log`a : a 2.26 This provides us with an interpretation of model selection in terms of entropy. Instead of nding the model that makes the data most likely, we could instead nd the model that has the lowest empirical entropy. Conversely, we could present a new interpretation of entropy: a distribution has low entropy if the probability of a sample drawn from that distribution is high. A distribution has high entropy if the sample has low probability. A density with a very narrow peak has low entropy because most of the samples will fall in the region where the density is large. A very broad density has high entropy because the samples are spread out and fall where the density is low. The close relationship between entropy and log likelihood is well known, but is often overlooked by students of probability. As result the parallels between research on entropy and maximum likelihood can be easily missed. The fact that a system manipulates entropy" does not make it necessarily any better, or di erent, than one based on likelihood. For instance, log likelihood model selection can be derived directly from the entropy framework using cross entropy. The cross entropy, DpX kpX , or asymmetric divergence is a measure ~ of the di erence between two distributions: "  pX X   DpX jj pX  = EX log p X  2.27 ~ ~ X Z 1  pX x0 ! = log p x0 pX x0dx 2.28 ,1 ~ X Z1 Z1 = logpX x0pX x0dx , logpX x0pX x0dx 2.29 ~ ,1 ,1 = ,hX  , EX logpX X  2.30 ~  ,hX  , Ea logpX X  2.31 ~ = ,hX  + haX  : 2.32 Cross entropy is non-negative, reaching zero if and only if pX and pX are identical. Where ~ 34 2.4. MODELING DENSITIES AI-TR 1548 maximum likelihood model selection searches for the model that makes the sample most likely, cross entropy model selection searches for the model that is closest, in the cross entropy sense, to the true distribution. If we use the approximation in 2.31, the two procedures are in fact identical. The rst term in 2.31, hX , is a constant and does not play a role in model selection. The second term, haX , is ,Na times the log likelihood of a sample drawn from X under the density p~. Minimization of cross entropy implies the maximization of likelihood. x 2.4 Modeling Densities In this section we will describe a number of techniques for estimating densities from data. Understanding the process by which this is done is an important prerequisite for understanding the main algorithms in this thesis. We will begin with a discussion of the most widely observed continuous density: the Gaussian density3. We will then derive an closed form expression for the most likely Gaussian given a sample. This section will also include a discussion of other parametric density functions and nally a non-parametric technique for estimating densities known as Parzen window density estimation. 2.4.1 The Gaussian Density The most ubiquitous of all random processes is the Gaussian or normal density. It literally appears everywhere. The most common justi cation arises from the central limit theorem", which shows that the density of the sum of a very large number of independent random variables will tend toward Gaussian. An equally important justi cation is that the mathematics of the Gaussian density are quite simple because it is an exponential. Moreover, since any linear function of a Gaussian density is itself Gaussian, they are widely used in linear systems theory. It is almost certainly the case that the majority of continuous random processes have been modeled as Gaussians; whether they are or not. A Gaussian density is de ned as: g x ,  p21 e, 3 x,2  1 2 : For the sake of brevity we will sometimes refer to a Gaussian density as simply a Gaussian. 35 2.33 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY The parameters  and  are the variance and mean of the density. One can demonstrate this with some clever integration. The Gaussian density can also be de ned in higher dimensions: g x ,  1 n 2 jj 2 1 exp, 2 x , T ,1x ,  : 1 2 2.34 In a d dimensional space, the mean  is a d-vector. The variance is replaced by a covariance matrix, a d-by-d matrix jj is the determinant of . Recall that variance is de ned as the expected square of the di erence from the mean; covariance is somewhat more complex: ij = E Xi , E Xi Xj , E Xj  ; 2.35 where Xi is the i'th component of the RV X . The diagonal entries of  contain the variances of the components. The o -diagonal entries measure the expected co-variation. Equation 2.33 de nes an in nite family of density functions. Any one member of this fami...
View Full Document

Ask a homework question - tutors are online