This preview shows page 1. Sign up to view the full content.
Unformatted text preview: ndomness of X we will de ne ^ X shortly. Strongly peaked
h
h
^ . For widely spread uniform distributions
distributions have large negative values for h
^
h approaches zero. Using the well known inequality, x , 1 logx, we can show,
Z1
^
hX = , ,1 px2dx
3.44
Z1
, ,1 pxlogpxdx
3.45
hX
3.46
Parzen window density estimation can be used to construct a stochastic measure
^ X ,Eb P X; a :
h 64 3.47 3.3. STOCHASTIC MAXIMIZATION ALGORITHM AITR 1548 h
The expectation of ^ X is ^ X :
h ,Efb2fX g;a2fX gg ^ X = ,Efb2fX g;a2fX gg Eb P X; a
h
= ,EX Efa2fX gg P X; a
= ,EX pX
= ^ X :
h Further simplifying, 3.48
3.49
3.50
3.51 ^ X = ,Ex=X Ex=X Rx , x
h
~
~ 3.52
^
is an expectation over a pair of events from the RV X . On average h is far too negative when
pX is large.
We have now de ned two alternative statistics for which inexpensive unbiased estimates
are available: ^ X and hX . These statistics bound the true entropy above and below,
h
^ X hX hX :
h 3.53 h is on average too large and ^ is on average too small. Instead of using either, we have
h
had good success using a third estimate, e
hX = ,Eb xb; a ;
where 3.54 8 logP x; a
if P x; a pmin
:
3.55
x;a
P
pmin + log pmin , 1 otherwise
The function is designed so that both x; a and dx;a are continuous. See Figure 3.1
dx
e
for a plot of the function. The intuition behind h is that we use h whenever possible, and
for those sample points where P x; a has a large standard deviation, we use ^ instead. The
h
x; a = : standard deviation of the Parzen density estimate is highest at points where the probability
density is lowest, so we use ^ where P x; a is below pmin see Section 2.4.3. The variable
h
e
^
pmin allows us to continuously vary the h estimate from the two extremes of h and h. 65 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT
2
up(x, 0.1)
up(x, 0.8)
x*log(x)
x* (x1) 1.5 1 0.5 0 0.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Figure 3.1: Plot of the functions x x which is labeled up, x logx, and x x , 1. Two
di erent values for pmin are plotted: 0.1 and 0.8. Notice that smaller values of pmin cause the
approximation of x logx to be very good. The di erence between the two functions when
pmin = 0:1 is almost unnoticeable. Other Stochastic Search Techniques
Nonlinear stochastic gradient descent is commonly used in the neural network literature,
where it is often called the LMS rule. It was introduced there by Widrow and Ho Widrow
and Ho , 1960 and has been used extensively with good results. Since a stochastic estimate
for the gradient of error is much cheaper to compute than a true estimate of the gradient,
for many real problems LMS is faster than all other gradient techniques. The textbook by
Haykin Haykin, 1994 discusses the use of such algorithms by the neural network community.
An excellent discussion of stochastic approximation appears in the textbook by Ljung and
Soderstrom Ljung and Soderstrom, 1983.
Simulated annealing is a related method that has been used in optimization problems
which have many local minima Kirkpatrick et al., 1983. These minima can trap" gradient
techniques far from the optimal solution. Simulated annealing performs a random, though
usually local, search through parameter space. At each step a random modi cation to the
parameters is proposed and the new cost is evaluated. If the new cost is lower than the
previous cost the parameter modi cation is accepted. If the di erence in cost is positive, the
modi cation is accepted probabilistically. The probability of acceptance is proportional to
66 3.3. STOCHASTIC MAXIMIZATION ALGORITHM AITR 1548 the negative exponential of the di erence, paccept d = expt, t ;
d 3.56 where d is the di erence between the new and the old cost and t is a temperature that
controls the likelihood that a bad modi cation is accepted. Simulated annealing is based on
the insight that physical systems, like iron, invariably nd good energy minima when heated
and then cooled slowly. The process of physical annealing is basically a gradient search
perturbed by thermal noise. The thermal noise provides the energy to kick physical systems
out of unfavorable local minima. The parameter t is an analog of physical temperature, it is
initially set to large values and is gradually cooled" during learning. In some cases it can
be proven that with the right annealing schedule simulated annealing will converge to the
global optimum.
Stochastic gradient descent can e ectively penetrate narrow local minima which trap
gradient techniques. Alignment applications can often have many narrow local minima which
arise because of false matches between the high frequency components in the model and
image. Because these false matches are based on features which are small the local minima
are narrow in pose space. We have found that narrow local minima, which can trap g...
View
Full
Document
This note was uploaded on 02/10/2010 for the course TBE 2300 taught by Professor Cudeback during the Spring '10 term at Webber.
 Spring '10
 Cudeback
 The Land

Click to edit the document details