Viola chapter 2 probability and entropy as a density

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: lect the best set of parameters for a given sample a. It is possible to search for the correct parameter vector using gradient ascent, but for Gaussian mixture models there is a more e cient technique known as Expectation Maximization Dempster et al., 1977. In either case nding the best parameter vector can involve a lengthy search process. While mixture models are fairly popular, almost any parameterized function can be used 39 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY as a density estimate. Within the neural networks literature some have trained back propagation neural networks to approximate densities Jacobs et al., 1991 see also Haykin, 1994 for an excellent review of neural network research. There is nothing terribly special about using a neural network for this purpose. It is just another form of parametric density estimation. Now that we have some feel for the density estimation process, it is critical that important limitations be pointed out. Whenever one estimates a density from a sample a very important rst step is required: assumptions must be made about the form of the density. The space of possible functions is so large that for any sample there are an in nite number of density functions that t it equally well. Continuous density can defy intuition. For instance it is always possible to de ne a density that makes a given sample in nitely likely. Take for example a density function made up of delta functions. As we did in section 2.2 we could make up a function with a delta function for each trial: 1 X x , x  : 2.40 pX  = N a a xa2a The likelihood of this model density is then 1, and is guaranteed to be bigger than any other density's likelihood. Wouldn't this imply that tting any other density is sub-optimal? While intuition argues against such an arti cial density, there is no principled scheme for dealing with this dilemma. Much has been written about this problem in the machine learning literature, where it is called function approximation. There simply is not enough information in a nite sample to uniquely determine which of the in nite number of possible functions t the sample best. The only solution is to make strong assumptions about the form of the correct function: for example that it is Gaussian, that it is polynomial, or that it is smooth. These assumptions provide a strong prior probability over the space of possible functions. Together the likelihood of a function and its prior probability can often uniquely determine a solution. Maximum likelihood model selection is not guaranteed to do a good job of density estimation. There are three reasons why the most likely model may fail to be an accurate model. The rst reason is that the set of evaluated models may not contain the correct model. This is called inductive inadequacy and it arises when the underlying assumptions about the density are wrong. The second reason is that maximum likelihood can be fooled. 40 2.4. MODELING DENSITIES AI-TR 1548 An especially unlikely sample, as in our example of the unbiased coin see Section 2.3, can lead to a model estimate that is not correct. This is a question of con dence. A larger sample is less likely to be unusual, and gives us more con dence in our model. The third reason is that the search through parameter space may fail. A good solution may exist but cannot be found, for example if there are local minima. 2.4.3 Parzen Window Density Estimation The nal class of density functions we will discuss are called non-parametric density estimators. For these models no search for parameters is needed. While parametric methods use the parameters as the model, non-parametric methods use the sample to directly de ne the model. The non-parametric scheme on which we will focus is known as Parzen window density estimation. The general form of the density is: 1 X Rx , x  ; 2.41 P x; a N a a xa 2a where a is a sample and R is a valid density function. The function R is often called the smoothing or window function. The quality of the approximation is dependent both on the functional form of R and its width. Di erent window functions will lead to very di erent density estimates. The Gaussian density is a common selection for R, making the Parzen density estimate a mixture of Gaussians. There is one Gaussian centered at each sample. Figure 2.4 contains a graph of a density, a sample, and the Parzen estimate constructed from the sample. Figure 2.5 contains a graph of ten di erent Parzen estimates from ten di erent samples. The di erent Parzen estimates do not show signi cantly more variation than the Gaussian estimates shown in Figure 2.2. In practice the Parzen density estimate is much more exible than a parametric density estimate. Where parametric techniques make very strong assumptions about the functional form of the density to be approximated, Parzen estimation requires only that the density be smooth. Figure 2.6 shows the Parzen density estimate of a bimodal distribution. Contrast it to the p...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online