Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ndomness of X we will de ne ^ X  shortly. Strongly peaked h h ^ . For widely spread uniform distributions distributions have large negative values for h ^ h approaches zero. Using the well known inequality, x , 1  logx, we can show, Z1 ^ hX  = , ,1 px2dx 3.44 Z1  , ,1 pxlogpxdx 3.45  hX  3.46 Parzen window density estimation can be used to construct a stochastic measure ^ X  ,Eb P X; a : h 64 3.47 3.3. STOCHASTIC MAXIMIZATION ALGORITHM AI-TR 1548 h The expectation of ^ X  is ^ X : h ,Efb2fX g;a2fX gg ^ X  = ,Efb2fX g;a2fX gg Eb P X; a h = ,EX Efa2fX gg P X; a = ,EX pX  = ^ X  : h Further simplifying, 3.48 3.49 3.50 3.51 ^ X  = ,Ex=X Ex=X Rx , x h ~ ~ 3.52 ^ is an expectation over a pair of events from the RV X . On average h is far too negative when pX  is large. We have now de ned two alternative statistics for which inexpensive unbiased estimates are available: ^ X  and hX . These statistics bound the true entropy above and below, h ^ X   hX   hX  : h 3.53 h is on average too large and ^ is on average too small. Instead of using either, we have h had good success using a third estimate, e hX  = ,Eb xb; a ; where 3.54 8 logP x; a if P x; a pmin : 3.55  x;a P pmin + log pmin  , 1 otherwise The  function is designed so that both x; a and dx;a are continuous. See Figure 3.1 dx e for a plot of the  function. The intuition behind h is that we use h whenever possible, and for those sample points where P x; a has a large standard deviation, we use ^ instead. The h x; a = : standard deviation of the Parzen density estimate is highest at points where the probability density is lowest, so we use ^ where P x; a is below pmin see Section 2.4.3. The variable h e ^ pmin allows us to continuously vary the h estimate from the two extremes of h and h. 65 CHAPTER 3. Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT 2 up(x, 0.1) up(x, 0.8) x*log(x) x* (x-1) 1.5 1 0.5 0 -0.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Figure 3.1: Plot of the functions x x which is labeled up, x logx, and x x , 1. Two di erent values for pmin are plotted: 0.1 and 0.8. Notice that smaller values of pmin cause the approximation of x logx to be very good. The di erence between the two functions when pmin = 0:1 is almost unnoticeable. Other Stochastic Search Techniques Non-linear stochastic gradient descent is commonly used in the neural network literature, where it is often called the LMS rule. It was introduced there by Widrow and Ho Widrow and Ho , 1960 and has been used extensively with good results. Since a stochastic estimate for the gradient of error is much cheaper to compute than a true estimate of the gradient, for many real problems LMS is faster than all other gradient techniques. The textbook by Haykin Haykin, 1994 discusses the use of such algorithms by the neural network community. An excellent discussion of stochastic approximation appears in the textbook by Ljung and Soderstrom Ljung and Soderstrom, 1983. Simulated annealing is a related method that has been used in optimization problems which have many local minima Kirkpatrick et al., 1983. These minima can trap" gradient techniques far from the optimal solution. Simulated annealing performs a random, though usually local, search through parameter space. At each step a random modi cation to the parameters is proposed and the new cost is evaluated. If the new cost is lower than the previous cost the parameter modi cation is accepted. If the di erence in cost is positive, the modi cation is accepted probabilistically. The probability of acceptance is proportional to 66 3.3. STOCHASTIC MAXIMIZATION ALGORITHM AI-TR 1548 the negative exponential of the di erence, paccept d = expt, t  ; d 3.56 where d is the di erence between the new and the old cost and t is a temperature that controls the likelihood that a bad modi cation is accepted. Simulated annealing is based on the insight that physical systems, like iron, invariably nd good energy minima when heated and then cooled slowly. The process of physical annealing is basically a gradient search perturbed by thermal noise. The thermal noise provides the energy to kick physical systems out of unfavorable local minima. The parameter t is an analog of physical temperature, it is initially set to large values and is gradually cooled" during learning. In some cases it can be proven that with the right annealing schedule simulated annealing will converge to the global optimum. Stochastic gradient descent can e ectively penetrate narrow local minima which trap gradient techniques. Alignment applications can often have many narrow local minima which arise because of false matches between the high frequency components in the model and image. Because these false matches are based on features which are small the local minima are narrow in pose space. We have found that narrow local minima, which can trap g...
View Full Document

This note was uploaded on 02/10/2010 for the course TBE 2300 taught by Professor Cudeback during the Spring '10 term at Webber.

Ask a homework question - tutors are online