Our justi cation for using probability and entropy to

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: pyb ; a  dv , is far more problematic. The second component is a measure of the change in the density estimate that results from changes in the sample a. In the Parzen framework the two components of the derivative collapse into a single term and can be directly computed from the samples. In the maximum likelihood framework pyb; a is a complex function of the sample. Since there is no closed form function that computes the density estimate from the sample, computing its derivative can be very di cult. 3.3 Stochastic Maximization Algorithm The variance maximization minimization applications described above principal components analysis and learning are deterministic procedures. Starting from an initial guess, gradient descent uses the derivative of cost to repeatedly update the parameter vector. Two di erent runs that start from the same initial parameters will end up with the same nal parameters. Our justi cation for using probability and entropy to analyze these problems is purely convenience. There is nothing random about these problems once the samples are drawn. One of the bene ts of understanding the probabilistic interpretation of these problems is that we can introduce randomness into our solutions and understand its e ect. Here is a simple example: we want to know the average of a large sample of data. Without knowing anything else, it would make sense to sum over the entire sample. But, if we needed only a rough estimate of the average, signi cant computation could be saved by averaging over a subset of the sample. Furthermore, knowledge of the sample variance would allow us to compute the size of the subsample needed to estimate the mean to a given precision. A similar analysis can be applied to principal components analysis or function learning. The cost of a particular parameter vector is computed by summing over an entire sample as we did in Equation 3.14. But, when that sample is very large this expectation can be 60 3.3. STOCHASTIC MAXIMIZATION ALGORITHM AI-TR 1548 approximated by a smaller random sample. The same argument applies to the gradient. Since the gradient is de ned as an average over a very large sample it may make sense to approximate it over a smaller random sample. When we use random samples, both the error estimate and the gradient estimate are now truly random. For very large samples, accurate error gradient estimates can be made without averaging over the entire sample. For problems where the gradient needs to be evaluated often, this can save signi cant computation. Though a random estimate of the gradient is cheaper to compute, it could be useless. Under what conditions does it make sense to use a random gradient estimate? The theory of stochastic approximation tells us that stochastic estimates of the gradient can be used instead of the true gradient when the following conditions hold: 1 the gradient estimate is unbiased; 2 the parameter update rate asymptotically converges to zero; 3 the error surface is quadratic in the parameters Robbins and Munroe, 1951; Ljung and Soderstrom, 1983; Haykin, 1994. The rst condition requires that on average the estimate for the gradient is the true gradient. The second insures that the search will eventually stop moving about randomly in parameter space. In practice the third condition can be relaxed to include most smooth non-linear error surfaces; though there is no guarantee that the parameters will end up in any particular minimum. Returning our attention to equations 3.22 and 3.26, notice that the both the calculation of the EMMA entropy estimate and its derivative involve a double summation. One summation is over the points in sample a and another over the points in b. As a result the cost of evaluation is quadratic in sample size: ONa Nb. We will present an experiment where the derivative of entropy for an image containing 60; 000 pixels is evaluated. While the true derivative of empirical entropy could be obtained by exhaustively sampling the data, a random estimate of the entropy can be obtained with much less computation. This is especially critical in entropy manipulation problems, where the derivative of entropy is evaluated many thousands of times. Without the quadratic savings that arise from using smaller samples entropy manipulation would be impossible. For entropy manipulation problems involving large samples we will use stochastic gradient descent. Stochastic gradient descent seeks a local maximum of entropy by using a stochastic estimate of the gradient instead of the true gradient. Steps are repeatedly taken that are proportional to the approximation of the derivative of the mutual information with respect to the parameters: 61 CHAPTER 3. Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT Repeat: a  fNa samples drawn from yg b  fNb samples drawn from yg v  v +  dh dv where dh is the derivative of entropy evaluated over samples a and b, v is the parameter to be dv estimated, and t...
View Full Document

This note was uploaded on 02/10/2010 for the course TBE 2300 taught by Professor Cudeback during the Spring '10 term at Webber.

Ask a homework question - tutors are online