Empirical evidence argues against using parzen

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ontained in a sphere of radius R, on average Na R data points r will fall in a randomly chosen window. Generally, r is selected so that R 1. As a result with increased dimension the number of points falling in a randomly chosen window drops exponentially and the normalized standard deviation of P  will increase rapidly. This implies that the Parzen density estimate will become very unreliable as dimensionality increases. 50 2.4. MODELING DENSITIES AI-TR 1548 While in theory this could be remedied by increasing the size of the sample exponentially, things rapidly get out of hand. Empirical evidence argues against using Parzen estimation in many more than six dimensions. 51 Chapter 3 Empirical Entropy Manipulation and Stochastic Gradient Descent This chapter presents a novel technique for evaluating and manipulating the empirical entropy of a distribution, called EMMA. The theory of entropy manipulation plays a critical role in the rest of this thesis and forms the algorithmic core in all of the applications. There are a number of existing techniques that manipulate the entropy of a density. They each have signi cant theoretical and practical limitations that make them unsuitable for our purposes. We will begin with a description of these techniques, and two simple applications. The second part of the chapter describes a new procedure for evaluating empirical entropy, EMMA. We will present an e cient stochastic gradient scheme for extremizing the EMMA estimates. This scheme has applications outside of entropy manipulation. The nal section of this chapter presents a tutorial application of EMMA. We will show how EMMA can be used to derive an information theoretic version of principal components analysis. 52 3.1. EMPIRICAL ENTROPY AI-TR 1548 3.1 Empirical Entropy As we saw in the previous chapter, in many cases the true density of a random variable is not known. Instead one must make do with an estimate of the density obtained from a sample of the variable. Likewise, there is no direct procedure for evaluating entropy from a sample. A common approach is to rst model the density from the sample, and then estimate the entropy from the density. This divides the problem into manageable parts which can be solved separately. By far the most popular density model for entropy calculations is the Gaussian. This is based on two considerations: 1 nding the Gaussian that ts the data best is very easy see Section 2.3.1 and 2 the entropy of the Gaussian can be directly calculated from its variance. The entropy of a Gaussian distribution is Z1 hX  = , ,1 g x ,  logg x , dx 3.1 ! "  Z1 2 = , g x ,  log p21 , 1 x ,  dx 3.2 2 ,1  " 1 x , 2 + 1 log2 3.3 =E 2  2 = 1 + 1 log2 3.4 22 1 3.5 = 1 loge + 2 log2 2 3.6 = 1 log2e : 2 The entropy of a Gaussian is a function of its variance and is not a function of its mean. Wider Gaussians have more entropy and narrower Gaussians have less. There is a simple procedure for nding the empirical entropy of a Gaussian distribution: compute the variance of the sample and evaluate 3.6. The equivalence between the log of variance and entropy can be used to reformulate well known signal and image processing problems as entropy problems. Since the logarithm is a monotonically increasing function, any technique that maximizes or minimizes the variance of a signal can be viewed as an entropy technique. Examples include principal components analysis, where variance is maximized, and least square solutions to matching problems, 53 CHAPTER 3. Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT where variance is minimized. There is however one signi cant caveat. Variance maximization is only equivalent to entropy maximization if the density of the signal involved is Gaussian. When this assumption is violated it is possible to reduce entropy as variance is increased. All but a very few of the techniques that manipulate the entropy of a signal assume that the signals are Gaussian or another exponential distribution Linsker, 1988; Becker and Hinton, 1992; Bell and Sejnowski, 1995. We will discuss these techniques in Section 7. Principal Components Analysis There are a number of signal processing and learning problems that can be formulated as entropy maximization problems. One well known example is principal components analysis. The principal component of a d dimensional distribution is a d dimensional vector. Given a density X every vector v de nes a new random variable, Yv = X  v. The variance along an axis v is the variance of this new variable: V arYv  = EX X  v , EX X  v2 : 3.7 The principal component v is the vector for which V arYv  is maximized. In practice neither the density of X nor Yv is known. The projection variance is computed from a sample a of points from X , V arYv   V araYv  Ea X  v , Ea X  v 2 : 3.8 We can then de ne a vector cost function, C v = ,V araYv  ; 3.9 which is min...
View Full Document

This note was uploaded on 02/10/2010 for the course TBE 2300 taught by Professor Cudeback during the Spring '10 term at Webber.

Ask a homework question - tutors are online