Unformatted text preview: he parameter is called the learning rate". The above procedure is repeated
a xed number of times or until convergence is detected. In most problems the initial value
of is reduced during the search. In subsequent chapters we will describe experiments where
samples of 50 or less can be used to e ectively nd entropy maxima.
Stochastic approximation does not seem to be well known in the computer vision community. We believe that it is applicable to a number of cost minimization problems that
arise in computer vision. Stochastic gradient descent is most appropriate for tasks where
evaluation of the true gradient is expensive, but an unbiased estimate of the gradient is easy
to compute. Examples include cost functions whose derivative is a sum over all of the pixels
in an image. In these cases, stochastic gradient search can be orders of magnitude faster
than even the most complex second order gradient search schemes. In the experimental section Chapter 6 of this thesis we will brie y describe joint work where an existing vision
application was sped up by a factor of fty using stochastic approximation. Convergence of Stochastic EMMA
Most of the conditions that insure convergence of stochastic gradient descent are easy to obtain in practice. For example, it is not really necessary for the learning rate to asymptotically
converge to zero. At nonzero learning rates the parameter vector will move randomly about
the minimum maximum endlessly. Smaller learning rates make for smaller excursions from
the true answer. An e ective way to terminate the search is to detect when on average the
parameter is not changing and then reduce the learning rate. The learning rate only needs
to approach zero if your goal is zero error, something that no practical system can achieve
anyway. A better idea is to reduce the learning rate until the parameters have a reasonable
variance and then take the average parameters.
62 3.3. STOCHASTIC MAXIMIZATION ALGORITHM AITR 1548 The rst proofs of stochastic approximation required that the error be quadratic in the
parameters. More modern proofs are more general. For convergence to a particular optimum,
the parameter vector must be guaranteed to enter that optimum's basin of attraction in nitely
often. A basin of attraction of an optimum is de ned with respect to true gradient descent.
Each basin is a set of points from which true gradient descent will converge to the same
optimum. Quadratic error surfaces have a single optimum and the basin of attraction is the
entire parameter space. Nonlinear error spaces may have many optima, and the parameter
space is partitioned into many basins of attraction. When there is a nite number of optima,
we can prove that stochastic gradient descent will converge to one of them. The proof
proceeds by contradiction. Assume that the parameter vector never converges. Instead it
wanders about parameter space forever. Since parameter space in partitioned into basins of
attraction it is always in some basin. But since there are a nite number of basins it must
be in one basin in nitely often. So it must converge to the optimum of that basin.
One condition will give us more trouble than the others. The stochastic estimate of the
gradient must be unbiased. It is not true that the sample approximation for empirical entropy
is unbiased. Moreover, we have been able to prove that it has a consistent bias: it is too
large. In Section 2.4.3 we described the conditions under which the Parzen density estimate
is unbiased. When these conditions are met, a number of equalities hold: pX = x = Nlim P x; a
a !1
= Efa2fX gg " x; a
P
1 X Rx , x
= Efa2fX gg N
a
a xa 2a
= EX RX , x : 3.32
3.33
3.34
3.35 Here Efa2fX gg P x; a denotes the expectation over all possible random samples of size Na
drawn from the random variable X . Assuming the di erent samples of X are independent
allows us to move the expectation inside the summation. The true entropy of the RV X can
be expressed as
hX = ,EX log Efa2fX gg P x; a :
3.36
We can de ne a similar statistic: hX Efb2fX g;a2fX gg hX
63 3.37 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT = ,Efb2fX g;a2fX gg Eb logEa Rxb , xa
= ,EX Efa2fX gg logP x; a : 3.38
3.39 hX is the expected value of hX . Therefore, hX provides an unbiased estimate of
hX . Jensen's inequality allows us to move the logarithm inside the expectation:
hX = ,EX logEfa2fX gg P x; a
,EX Efa2fX gg logP x; a
hX : 3.40
3.41
3.42 The stochastic EMMA estimate is an unbiased estimator of a statistic that is provably larger
than the true entropy. Intuitively, overly large estimates arise when elements of b fall in
regions where P x; a is too small. For these points the log of P x; a is much smaller than
it should be.
How then might we patch the de nition of EMMA to remedy the bias? Another statistic
that is similar to entropy is
Z
^ X = ,EX pX = , 1 px2dx :
h
3.43
,1 ^ X is a measure of the ra...
View
Full Document
 Spring '10
 Cudeback
 The Land, Probability distribution, Probability theory, probability density function, Mutual Information, Paul A. Viola

Click to edit the document details