Unformatted text preview: pyb ; a dv , is far more
problematic. The second component is a measure of the change in the density estimate that
results from changes in the sample a. In the Parzen framework the two components of the
derivative collapse into a single term and can be directly computed from the samples. In the
maximum likelihood framework pyb; a is a complex function of the sample. Since there is
no closed form function that computes the density estimate from the sample, computing its
derivative can be very di cult. 3.3 Stochastic Maximization Algorithm
The variance maximization minimization applications described above principal components analysis and learning are deterministic procedures. Starting from an initial guess,
gradient descent uses the derivative of cost to repeatedly update the parameter vector. Two
di erent runs that start from the same initial parameters will end up with the same nal
parameters. Our justi cation for using probability and entropy to analyze these problems
is purely convenience. There is nothing random about these problems once the samples are
drawn. One of the bene ts of understanding the probabilistic interpretation of these problems
is that we can introduce randomness into our solutions and understand its e ect. Here is a
simple example: we want to know the average of a large sample of data. Without knowing
anything else, it would make sense to sum over the entire sample. But, if we needed only
a rough estimate of the average, signi cant computation could be saved by averaging over
a subset of the sample. Furthermore, knowledge of the sample variance would allow us to
compute the size of the subsample needed to estimate the mean to a given precision.
A similar analysis can be applied to principal components analysis or function learning.
The cost of a particular parameter vector is computed by summing over an entire sample
as we did in Equation 3.14. But, when that sample is very large this expectation can be
60 3.3. STOCHASTIC MAXIMIZATION ALGORITHM AITR 1548 approximated by a smaller random sample. The same argument applies to the gradient.
Since the gradient is de ned as an average over a very large sample it may make sense to
approximate it over a smaller random sample. When we use random samples, both the error
estimate and the gradient estimate are now truly random. For very large samples, accurate
error gradient estimates can be made without averaging over the entire sample. For problems
where the gradient needs to be evaluated often, this can save signi cant computation.
Though a random estimate of the gradient is cheaper to compute, it could be useless.
Under what conditions does it make sense to use a random gradient estimate? The theory
of stochastic approximation tells us that stochastic estimates of the gradient can be used
instead of the true gradient when the following conditions hold: 1 the gradient estimate
is unbiased; 2 the parameter update rate asymptotically converges to zero; 3 the error
surface is quadratic in the parameters Robbins and Munroe, 1951; Ljung and Soderstrom,
1983; Haykin, 1994. The rst condition requires that on average the estimate for the gradient
is the true gradient. The second insures that the search will eventually stop moving about
randomly in parameter space. In practice the third condition can be relaxed to include most
smooth nonlinear error surfaces; though there is no guarantee that the parameters will end
up in any particular minimum.
Returning our attention to equations 3.22 and 3.26, notice that the both the calculation of the EMMA entropy estimate and its derivative involve a double summation. One
summation is over the points in sample a and another over the points in b. As a result the cost
of evaluation is quadratic in sample size: ONa Nb. We will present an experiment where
the derivative of entropy for an image containing 60; 000 pixels is evaluated. While the true
derivative of empirical entropy could be obtained by exhaustively sampling the data, a random estimate of the entropy can be obtained with much less computation. This is especially
critical in entropy manipulation problems, where the derivative of entropy is evaluated many
thousands of times. Without the quadratic savings that arise from using smaller samples
entropy manipulation would be impossible.
For entropy manipulation problems involving large samples we will use stochastic gradient
descent. Stochastic gradient descent seeks a local maximum of entropy by using a stochastic
estimate of the gradient instead of the true gradient. Steps are repeatedly taken that are
proportional to the approximation of the derivative of the mutual information with respect to
the parameters:
61 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT Repeat: a fNa samples drawn from yg
b fNb samples drawn from yg
v v + dh
dv
where dh is the derivative of entropy evaluated over samples a and b, v is the parameter to be
dv
estimated, and t...
View
Full Document
 Spring '10
 Cudeback
 The Land, Probability distribution, Probability theory, probability density function, Mutual Information, Paul A. Viola

Click to edit the document details