In each case there is a function of a parameters set

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: al Alignment EMMA bears some similarity to methods used for evaluating and adjusting geometrical alignment. These similarities may be seen by revisiting the entropy derivative of Equation 3.28, and comparing it to the derivative of the following construct. 149 Paul A. Viola CHAPTER 7. CONCLUSION We de ne D, half the averaged Mahalonobis distance between values in B and their nearest correspondences in A, 1 X min 1 Dz , z  : 7.1 DT  N i j B zi 2B zj 2A 2 Locally away from discontinuities, the derivative of the above expression is d DT  1 X min d  1 Dz , z  : dT NB zi 2B zj 2A dT 2 i j Comparing the above expression with Equation 3.28, we see the following analogy. If the transformation T is adjusted to reduce the averaged squared di erences" between points in B and their counterparts from A that are nearest in signal value, then a reduction in entropy is obtained. This is intuitive, in that entropy will be lower if clusters in signal value" are tighter so that nearby signal di erences will be smaller. The approximation of this analogy is due to the dissimilarity between max and softmax. Equation 7.1 is essentially the measure used in chamfer matching techniques, such as the method described by Borgefors Borgefors, 1988. Huttenlocher Huttenlocher et al., 1991 has used a related measure in feature matching applications, the Hausdor distance, which uses maximum instead of the sum that appears in Equation 7.1. The similarity between geometrical matching and entropy becomes even stronger if one uses the softmax operation to weight the closest element rather than simply selecting the closest, as Wells has Wells III, 1992b; Wells III, 1992a. We reiterate that in vision applications, these methods have typically been used to measure aggregate geometrical distance, while here we are measuring aggregate distances among signal values typically intensities, brightnesses, or surface properties. 150 Appendix A Appendix A.1 Gradient Descent In a number problems described in this thesis one must nd a set of parameters that extremizes an evaluation function. Examples include: 1 nding the parameters of density so that the likelihood of sample is maximized; 2 nding the pose parameters that align a model and an image best; and 3 nding the weights of a neural network so that it approximates a function best. In each case there is a function of a parameters set F p, whose value is to be either maximized or minimized. The parameters are continuous variables, and we are therefore faced with an in nite number of possible solutions. The gradient descent procedure is an e ective though greedy technique for searching such a space. There are many closely related gradient descent algorithms. Here we will describe the simplest: steepest descent or hill climbing. Starting from an initial guess for the parameters, steepest descent is an iterative procedure that uses the partial derivatives of a function to construct an improved estimate for its parameters. Each parameter is updated by p  p +  @F p : @p The update rate  which is also known as the learning rate must be chosen carefully. When 151 Paul A. Viola APPENDIX A. APPENDIX  is su ciently small one can use a Taylor expansion of F  to prove that F p +  @F p   F p : @p When  is too small p might take arbitrarily long to approach a maximum. If  is chosen correctly p will converge toward the maximum relatively rapidly. There are many gradient based techniques that attempt to speed the rate of convergence of p. Second order techniques such as Levenberg-Marquart and Newton-Raphson use the second derivatives of F p to re-estimate . Conjugate gradient techniques attempt to nd better directions than the gradient of F . In every case one must be careful that the theoretical advantages of the algorithm are not outweighed by the costs of computing it. Researchers in neural networks have found that for many problems it is di cult to realize any actual improvement in convergence speed. The problems for which steepest descent works as well as more complex techniques include functions where there are a large number of parameters| this makes computing the second derivatives quite expensive. 152 Bibliography Anderson, J. and Rosenfeld, E., editors 1988. Neurocomputing: Foundations of Research. MIT Press, Cambridge. Baclawski, K., Rota, G.-C., and Billey, S. 1990. Introduction to the theory of probability. MIT course notes for 18.313. Becker, S. and Hinton, G. E. 1992. Learning to make coherent predictions in domains with discontinuities. In Moody, J. E., Hanson, S. J., and Lippmann, R. P., editors, Advances in Neural Information Processing, volume 4, Denver 1991. Morgan Kaufmann, San Mateo. Bell, A. J. and Sejnowski, T. J. 1995. An information-maximisation approach to blind separation. In Advances in Neural Information Processing, volume 7, Denver 1994. Morgan Kaufmann, San Francisco. Besl, P. and Jain, R. 1985. Three-Dimensional Object Recognition. Computing Surveys, 17:75 145. Bezdek, J., Hall, L., and Clarke, L. 1993. Review of MR...
View Full Document

This note was uploaded on 02/10/2010 for the course TBE 2300 taught by Professor Cudeback during the Spring '10 term at Webber.

Ask a homework question - tutors are online