In the experimental section chapter 6 of this thesis

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: he parameter  is called the learning rate". The above procedure is repeated a xed number of times or until convergence is detected. In most problems the initial value of  is reduced during the search. In subsequent chapters we will describe experiments where samples of 50 or less can be used to e ectively nd entropy maxima. Stochastic approximation does not seem to be well known in the computer vision community. We believe that it is applicable to a number of cost minimization problems that arise in computer vision. Stochastic gradient descent is most appropriate for tasks where evaluation of the true gradient is expensive, but an unbiased estimate of the gradient is easy to compute. Examples include cost functions whose derivative is a sum over all of the pixels in an image. In these cases, stochastic gradient search can be orders of magnitude faster than even the most complex second order gradient search schemes. In the experimental section Chapter 6 of this thesis we will brie y describe joint work where an existing vision application was sped up by a factor of fty using stochastic approximation. Convergence of Stochastic EMMA Most of the conditions that insure convergence of stochastic gradient descent are easy to obtain in practice. For example, it is not really necessary for the learning rate to asymptotically converge to zero. At non-zero learning rates the parameter vector will move randomly about the minimum maximum endlessly. Smaller learning rates make for smaller excursions from the true answer. An e ective way to terminate the search is to detect when on average the parameter is not changing and then reduce the learning rate. The learning rate only needs to approach zero if your goal is zero error, something that no practical system can achieve anyway. A better idea is to reduce the learning rate until the parameters have a reasonable variance and then take the average parameters. 62 3.3. STOCHASTIC MAXIMIZATION ALGORITHM AI-TR 1548 The rst proofs of stochastic approximation required that the error be quadratic in the parameters. More modern proofs are more general. For convergence to a particular optimum, the parameter vector must be guaranteed to enter that optimum's basin of attraction in nitely often. A basin of attraction of an optimum is de ned with respect to true gradient descent. Each basin is a set of points from which true gradient descent will converge to the same optimum. Quadratic error surfaces have a single optimum and the basin of attraction is the entire parameter space. Non-linear error spaces may have many optima, and the parameter space is partitioned into many basins of attraction. When there is a nite number of optima, we can prove that stochastic gradient descent will converge to one of them. The proof proceeds by contradiction. Assume that the parameter vector never converges. Instead it wanders about parameter space forever. Since parameter space in partitioned into basins of attraction it is always in some basin. But since there are a nite number of basins it must be in one basin in nitely often. So it must converge to the optimum of that basin. One condition will give us more trouble than the others. The stochastic estimate of the gradient must be unbiased. It is not true that the sample approximation for empirical entropy is unbiased. Moreover, we have been able to prove that it has a consistent bias: it is too large. In Section 2.4.3 we described the conditions under which the Parzen density estimate is unbiased. When these conditions are met, a number of equalities hold: pX = x = Nlim P x; a a !1 = Efa2fX gg " x; a P  1 X Rx , x  = Efa2fX gg N a a xa 2a = EX RX , x : 3.32 3.33 3.34 3.35 Here Efa2fX gg P x; a denotes the expectation over all possible random samples of size Na drawn from the random variable X . Assuming the di erent samples of X are independent allows us to move the expectation inside the summation. The true entropy of the RV X can be expressed as hX  = ,EX log Efa2fX gg P x; a : 3.36 We can de ne a similar statistic: hX  Efb2fX g;a2fX gg hX  63 3.37 CHAPTER 3. Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT = ,Efb2fX g;a2fX gg Eb logEa Rxb , xa = ,EX Efa2fX gg logP x; a : 3.38 3.39 hX  is the expected value of hX . Therefore, hX  provides an unbiased estimate of hX . Jensen's inequality allows us to move the logarithm inside the expectation: hX  = ,EX logEfa2fX gg P x; a  ,EX Efa2fX gg logP x; a  hX  : 3.40 3.41 3.42 The stochastic EMMA estimate is an unbiased estimator of a statistic that is provably larger than the true entropy. Intuitively, overly large estimates arise when elements of b fall in regions where P x; a is too small. For these points the log of P x; a is much smaller than it should be. How then might we patch the de nition of EMMA to remedy the bias? Another statistic that is similar to entropy is Z ^ X  = ,EX pX  = , 1 px2dx : h 3.43 ,1 ^ X  is a measure of the ra...
View Full Document

This note was uploaded on 02/10/2010 for the course TBE 2300 taught by Professor Cudeback during the Spring '10 term at Webber.

Ask a homework question - tutors are online