Unformatted text preview: lect the best set
of parameters for a given sample a. It is possible to search for the correct parameter vector
using gradient ascent, but for Gaussian mixture models there is a more e cient technique
known as Expectation Maximization Dempster et al., 1977. In either case nding the best
parameter vector can involve a lengthy search process.
While mixture models are fairly popular, almost any parameterized function can be used
39 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY as a density estimate. Within the neural networks literature some have trained back propagation neural networks to approximate densities Jacobs et al., 1991 see also Haykin, 1994
for an excellent review of neural network research. There is nothing terribly special about
using a neural network for this purpose. It is just another form of parametric density estimation.
Now that we have some feel for the density estimation process, it is critical that important
limitations be pointed out. Whenever one estimates a density from a sample a very important
rst step is required: assumptions must be made about the form of the density. The space
of possible functions is so large that for any sample there are an in nite number of density
functions that t it equally well. Continuous density can defy intuition. For instance it
is always possible to de ne a density that makes a given sample in nitely likely. Take for
example a density function made up of delta functions. As we did in section 2.2 we could
make up a function with a delta function for each trial:
1 X x , x :
pX = N
a xa2a The likelihood of this model density is then 1, and is guaranteed to be bigger than any other
density's likelihood. Wouldn't this imply that tting any other density is sub-optimal? While
intuition argues against such an arti cial density, there is no principled scheme for dealing
with this dilemma.
Much has been written about this problem in the machine learning literature, where it
is called function approximation. There simply is not enough information in a nite sample
to uniquely determine which of the in nite number of possible functions t the sample best.
The only solution is to make strong assumptions about the form of the correct function: for
example that it is Gaussian, that it is polynomial, or that it is smooth. These assumptions
provide a strong prior probability over the space of possible functions. Together the likelihood
of a function and its prior probability can often uniquely determine a solution.
Maximum likelihood model selection is not guaranteed to do a good job of density estimation. There are three reasons why the most likely model may fail to be an accurate
model. The rst reason is that the set of evaluated models may not contain the correct
model. This is called inductive inadequacy and it arises when the underlying assumptions
about the density are wrong. The second reason is that maximum likelihood can be fooled.
40 2.4. MODELING DENSITIES AI-TR 1548 An especially unlikely sample, as in our example of the unbiased coin see Section 2.3, can
lead to a model estimate that is not correct. This is a question of con dence. A larger sample
is less likely to be unusual, and gives us more con dence in our model. The third reason is
that the search through parameter space may fail. A good solution may exist but cannot be
found, for example if there are local minima. 2.4.3 Parzen Window Density Estimation
The nal class of density functions we will discuss are called non-parametric density estimators. For these models no search for parameters is needed. While parametric methods use
the parameters as the model, non-parametric methods use the sample to directly de ne the
The non-parametric scheme on which we will focus is known as Parzen window density
estimation. The general form of the density is:
1 X Rx , x ;
P x; a N
a xa 2a where a is a sample and R is a valid density function. The function R is often called the
smoothing or window function. The quality of the approximation is dependent both on the
functional form of R and its width. Di erent window functions will lead to very di erent
density estimates. The Gaussian density is a common selection for R, making the Parzen
density estimate a mixture of Gaussians. There is one Gaussian centered at each sample.
Figure 2.4 contains a graph of a density, a sample, and the Parzen estimate constructed from
the sample. Figure 2.5 contains a graph of ten di erent Parzen estimates from ten di erent
samples. The di erent Parzen estimates do not show signi cantly more variation than the
Gaussian estimates shown in Figure 2.2.
In practice the Parzen density estimate is much more exible than a parametric density
estimate. Where parametric techniques make very strong assumptions about the functional
form of the density to be approximated, Parzen estimation requires only that the density be
smooth. Figure 2.6 shows the Parzen density estimate of a bimodal distribution. Contrast it
to the p...
View Full Document
- Spring '10
- The Land, Probability distribution, Probability theory, probability density function, Mutual Information, Paul A. Viola