Unformatted text preview: al Alignment
EMMA bears some similarity to methods used for evaluating and adjusting geometrical alignment. These similarities may be seen by revisiting the entropy derivative of Equation 3.28,
and comparing it to the derivative of the following construct.
149 Paul A. Viola CHAPTER 7. CONCLUSION We de ne D, half the averaged Mahalonobis distance between values in B and their
nearest correspondences in A,
1 X min 1 Dz , z :
B zi 2B zj 2A 2
Locally away from discontinuities, the derivative of the above expression is
d DT 1 X min d 1 Dz , z :
NB zi 2B zj 2A dT 2 i j
Comparing the above expression with Equation 3.28, we see the following analogy. If the
transformation T is adjusted to reduce the averaged squared di erences" between points in
B and their counterparts from A that are nearest in signal value, then a reduction in entropy
is obtained. This is intuitive, in that entropy will be lower if clusters in signal value" are
tighter so that nearby signal di erences will be smaller. The approximation of this analogy
is due to the dissimilarity between max and softmax.
Equation 7.1 is essentially the measure used in chamfer matching techniques, such as the
method described by Borgefors Borgefors, 1988. Huttenlocher Huttenlocher et al., 1991
has used a related measure in feature matching applications, the Hausdor distance, which
uses maximum instead of the sum that appears in Equation 7.1. The similarity between
geometrical matching and entropy becomes even stronger if one uses the softmax operation
to weight the closest element rather than simply selecting the closest, as Wells has Wells III,
1992b; Wells III, 1992a.
We reiterate that in vision applications, these methods have typically been used to measure
aggregate geometrical distance, while here we are measuring aggregate distances among signal
values typically intensities, brightnesses, or surface properties. 150 Appendix A
A.1 Gradient Descent
In a number problems described in this thesis one must nd a set of parameters that extremizes an evaluation function. Examples include: 1 nding the parameters of density so that
the likelihood of sample is maximized; 2 nding the pose parameters that align a model
and an image best; and 3 nding the weights of a neural network so that it approximates
a function best. In each case there is a function of a parameters set F p, whose value is
to be either maximized or minimized. The parameters are continuous variables, and we are
therefore faced with an in nite number of possible solutions. The gradient descent procedure
is an e ective though greedy technique for searching such a space.
There are many closely related gradient descent algorithms. Here we will describe the
simplest: steepest descent or hill climbing. Starting from an initial guess for the parameters,
steepest descent is an iterative procedure that uses the partial derivatives of a function to
construct an improved estimate for its parameters. Each parameter is updated by p p + @F p :
The update rate which is also known as the learning rate must be chosen carefully. When
151 Paul A. Viola APPENDIX A. APPENDIX is su ciently small one can use a Taylor expansion of F to prove that
F p + @F p F p :
When is too small p might take arbitrarily long to approach a maximum. If is chosen
correctly p will converge toward the maximum relatively rapidly.
There are many gradient based techniques that attempt to speed the rate of convergence of
p. Second order techniques such as Levenberg-Marquart and Newton-Raphson use the second
derivatives of F p to re-estimate . Conjugate gradient techniques attempt to nd better
directions than the gradient of F . In every case one must be careful that the theoretical
advantages of the algorithm are not outweighed by the costs of computing it. Researchers
in neural networks have found that for many problems it is di cult to realize any actual
improvement in convergence speed. The problems for which steepest descent works as well
as more complex techniques include functions where there are a large number of parameters|
this makes computing the second derivatives quite expensive. 152 Bibliography
Anderson, J. and Rosenfeld, E., editors 1988. Neurocomputing: Foundations of Research.
MIT Press, Cambridge.
Baclawski, K., Rota, G.-C., and Billey, S. 1990. Introduction to the theory of probability.
MIT course notes for 18.313.
Becker, S. and Hinton, G. E. 1992. Learning to make coherent predictions in domains
with discontinuities. In Moody, J. E., Hanson, S. J., and Lippmann, R. P., editors,
Advances in Neural Information Processing, volume 4, Denver 1991. Morgan Kaufmann,
Bell, A. J. and Sejnowski, T. J. 1995. An information-maximisation approach to blind
separation. In Advances in Neural Information Processing, volume 7, Denver 1994.
Morgan Kaufmann, San Francisco.
Besl, P. and Jain, R. 1985. Three-Dimensional Object Recognition. Computing Surveys,
Bezdek, J., Hall, L., and Clarke, L. 1993. Review of MR...
View Full Document
- Spring '10
- The Land, Probability distribution, Probability theory, probability density function, Mutual Information, Paul A. Viola