This preview shows page 1. Sign up to view the full content.
Unformatted text preview: MASSACHUSETTS INSTITUTE OF TECHNOLOGY
ARTIFICIAL INTELLIGENCE LABORATORY
A.I. Technical Report No. 1548 June, 1995 Alignment by Maximization of Mutual
Information
Paul A. Viola This publication can be retrieved by anonymous ftp to publications.ai.mit.edu. Abstract
A new informationtheoretic approach is presented for nding the pose of an object in an image.
The technique does not require information about the surface properties of the object, besides its
shape, and is robust with respect to variations of illumination. In our derivation, few assumptions
are made about the nature of the imaging process. As a result the algorithms are quite general and
can foreseeably be used in a wide variety of imaging situations.
Experiments are presented that demonstrate the approach registering magnetic resonance MR
images with computed tomography CT images, aligning a complex 3D object model to real scenes
including clutter and occlusion, tracking a human head in a video sequence and aligning a viewbased
2D object model to real images.
The method is based on a formulation of the mutual information between the model and the
image called EMMA. As applied here the technique is intensitybased, rather than featurebased.
It works well in domains where edge or gradientmagnitude based methods have di culty, yet it
is more robust than traditional correlation. Additionally, it has an e cient implementation that is
based on stochastic approximation.
Finally, we will describe a number of additional realworld applications that can be solved efciently and reliably using EMMA. EMMA can be used in machine learning to nd maximally
informative projections of highdimensional data. EMMA can also be used to detect and correct
corruption in magnetic resonance images MRI.
Copyright c Massachusetts Institute of Technology, 1995
This report describes research done at the Arti cial Intelligence Laboratory of the Massachusetts Institute of
Technology. Support for the laboratory's arti cial intelligence research is provided in part by the Advanced
Research Projects Agency of the Department of Defense under O ce of Naval Research contract :N0001494010994. Paul Viola was also supported by USAF ASSERT program, Parent Grant:F496209310263. 1 2 Alignment by Maximization of Mutual Information
by Paul A. Viola
Submitted to the Department of Electrical Engineering and Computer Science on
June 1995, in partial ful llment of the requirements for the degree of Doctor of
Philosophy. Abstract Over the last 30 years the problems of image registration and recognition have proven more
di cult than even the most pessimistic might have predicted. Progress has been hampered by the
sheer complexity of the relationship between an object and its image, which involves the object's
shape, surface properties, position, and illumination.
Changes in illumination can radically alter the intensity and shading of an image. Nevertheless,
the human visual system can use shading both for recognition and image interpretation. We will
present a measure for comparing objects and images that uses shading information, yet is explicitly
insensitive to changes in illumination. This measure is unique in that it compares 3D object models
directly to raw images. No preprocessing or edge detection is required. We will show that when the
mutual information between model and image is large they are likely to be aligned. Toward making
this technique a reality we have de ned a concrete and e cient technique for evaluating entropy
called EMMA.
In our derivation of mutual information based alignment few assumptions are made about the
nature of the imaging process. As a result the algorithms are quite general and can be used in a wide
variety of imaging situations. Experiments demonstrate this approach aligning a number of complex
3D object models to real images. In addition, we demonstrate that the same technique can be used
to solve problems in medical registration.
Alignment is accomplished by adjusting the pose of an object until the mutual information
between image and object is maximized. We will present a gradient descent alignment procedure
based on stochastic approximation that has a very e cient implementation. For this application
stochastic approximation a ords a speed up of at least a factor of 500 over gradient descent. In
addition, stochastic approximation can be used to accelerate a variety of other vision applications.
We will describe an existing vision application which can be accelerated by a factor of 30 using
stochastic approximation.
Finally, we will describe a number of additional realworld applications that can be solved efciently and reliably using EMMA. EMMA can be used in machine learning to nd maximally
informative projections of highdimensional data. EMMA can also be used to detect and correct
corruption in magnetic resonance images MRI. Thesis Committee: Prof.
Prof.
Prof.
Prof. Tomas LozanoPerez CoSupervisor
Christopher G. Atkeson CoSupervisor
W. Eric L. Grimson
Berthold K. P. Horn
3 4 Acknowledgments
I would like to acknowledge the MIT Arti cial Intelligence Laboratory for being a haven of
intellectual freedom. Professors Tom
s LozanoP
rez, Christopher Atkeson, Eric Grimson,
a
e
Rodney Brooks, and others have worked to build an environment where every resource is
unfettered. Chris and Tom
s have supported me through thick and through thin. They've
a
taught me how truly valuable unconditional support can be.
Students are the AI lab's most precious resource and I bene ted from discussions with
Phil Agre, Davi Geiger, David Chapman, Jose Robles, Tao Alter, Misha Bolotski, Jonathan
Connel, Karen Sarachik, Maja Mataric, Ian Horswill, Colin Angle, Cynthia Ferrel, Henry
Minsky, Saed Younis, Rick Lathrope, Barbara Moore, and many others. Willliam Wells
has stood out among all those I've met at MIT. His unique approach to vision, his care in
research and his advice as a friend have proven invaluable. I owe much to John Shewchuk,
then of Brown University, for introducing me to the eld of statistical learning. Memories of
his irrepressible intellectual curiosity serve to continually motivate my own thinking.
Outside of MIT I have spent many productive months at the Salk Institute in the Computational Neurobiology Laboratory of Terrence Sejnowski. Terry is the most tirelessly devoted
scientist that I have ever met. In his lab I learned that science is an uncompromising pursuit
of truth. Science is about building on the work of others, and that to build one must rst
understand. In Terry's lab I have had the pleasure of working with David Lawrence, Rich
Zemel, Nici Shraudolph, Tony Bell, and Peter Dayan. Each of them has had an e ect on
some part of this thesis.
Most importantly I must recognize the technical contributions of Sara Billey. She worked
tirelessly with me to understand the most cryptic mathematics and clear up my own cryptic
thinking. Without her this thesis would not exist. 5 To:
Sara Billey for being the love of my life.
My parents Mary AnconaViola and Alfredo Viola for making it all possible. 6 Contents
1 Introduction 9 1.1 An Introduction to Alignment : : : : : : : : : : : : : : : : : : : : : : : : : : 11
1.1.1 An Alignment Example : : : : : : : : : : : : : : : : : : : : : : : : : : 12
1.2 Overview of the Thesis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 18 2 Probability and Entropy 2.1 Random Variables : : : : : : : : : : : : : : : : : : : : :
2.2 Entropy : : : : : : : : : : : : : : : : : : : : : : : : : :
2.2.1 Di erential Entropy : : : : : : : : : : : : : : : :
2.3 Samples versus Distributions : : : : : : : : : : : : : : :
2.3.1 Model Selection, Likelihood and Cross Entropy :
2.4 Modeling Densities : : : : : : : : : : : : : : : : : : : :
2.4.1 The Gaussian Density : : : : : : : : : : : : : :
2.4.2 Other Parametric Densities : : : : : : : : : : : :
2.4.3 Parzen Window Density Estimation : : : : : : : :
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
: 3 Empirical Entropy Manipulation and Stochastic Gradient Descent
3.1 Empirical Entropy : : : : : : : : : : : : : : : :
3.2 Estimating Entropy with Parzen Densities : : :
3.3 Stochastic Maximization Algorithm : : : : : : :
3.3.1 Estimating the Covariance : : : : : : : :
3.4 Principal Components Analysis and Information
3.5 Conclusion : : : : : : : : : : : : : : : : : : : : : 4 Matching and Alignment :
:
:
:
:
: :
:
:
:
:
: :
:
:
:
:
: :
:
:
:
:
: :
:
:
:
:
: :
:
:
:
:
: :
:
:
:
:
: :
:
:
:
:
: :
:
:
:
:
: :
:
:
:
:
: :
:
:
:
:
: :
:
:
:
:
: :
:
:
:
:
:
:
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
:
:
:
:
:
:
: 20
21
26
30
31
32
35
35
38
41 52
53
57
60
67
68
73 76 4.1 Alignment : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 77
7 4.2
4.3
4.4
4.5 4.1.1 Correlation as a Maximum Likelihood Technique :
4.1.2 Correlation and Mutual Information : : : : : : : :
Weighted Neighbor Likelihood vs. EMMA : : : : : : : :
4.2.1 Nonfunctional Signals : : : : : : : : : : : : : : :
Alignment Derivation : : : : : : : : : : : : : : : : : : : :
Matching and Minimum Description Length : : : : : : :
Summary : : : : : : : : : : : : : : : : : : : : : : : : : : 5 Alignment Experiments 5.1 Alignment of 3D Objects to Video : : : : : : :
5.1.1 Alignment of Skull Model : : : : : : :
5.1.2 Alignment of Head Model : : : : : : :
5.1.3 Alignment of Curved Surfaces : : : : :
5.2 Medical Registration Experiments : : : : : : :
5.2.1 Three Dimensional MR CT Alignment
5.3 View Based Recognition Experiments : : : : :
5.3.1 Photometric Stereo : : : : : : : : : : :
5.4 Limitations of EMMA Alignment : : : : : : : :
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: 79
81
93
94
100
101
105 106
107
111
119
123
126
128
132
133
135 6 Other Applications of EMMA 137 7 Conclusion 147 A Appendix 151 6.1 Bias Compensation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 137
6.2 Alignment of Line Drawings : : : : : : : : : : : : : : : : : : : : : : : : : : : 142
7.1 Related Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 148
7.2 A Parallel with Geometrical Alignment : : : : : : : : : : : : : : : : : : : : : 149
A.1 Gradient Descent : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 151 8 Chapter 1
Introduction
This thesis is about a new information theoretic approach for solving several standing problems in computer vision and image processing. For example, this approach can be used to
nd the correct alignment between a three dimensional model and an image. While alignment is a critical component of the object recognition problem, it is also useful by itself in
medical and military applications. We will also describe several other applications, including
an image processing application and a new form of unsupervised learning. While the form
of these applications is quite di erent the underlying theory and derivations are very similar.
Preliminary investigation imply that the theory presented here will have wide application.
Computer vision has proven more di cult than even the most pessimistic might have
predicted. While the problem has been of interest for over 30 years, progress has been
painstakingly slow. Even the best computer vision systems stand in stark contrast to the
human visual system: our perception of images is e ortless and robust; computer vision
systems are at best slow and unreliable. Among other di culties, progress has been hampered
by the sheer complexity of the relationship between image and object, which involves the
object's shape, surface properties, position, and illumination.
A computer vision program is faced with the task of interpreting an image of intensities.
While information about the shape and location of objects is somehow embedded in these
intensities, the actual intensities that arise in an image are di cult to interpret. For example,
changes in illumination can radically alter the intensity and shading of an image. Though 9 Paul A. Viola CHAPTER 1. INTRODUCTION the human visual system can use shading both for recognition and image interpretation,
most existing computer object recognition systems cannot. These systems throw out shading
information in an e ort to obtain illumination invariance". We will present a measure for
comparing objects and images that uses shading information, yet is explicitly insensitive to
changes in illumination. This measure is unique in that it compares 3D object models directly
to raw images. No preprocessing or edge detection is required.
This image model comparison measure has been rigorously derived from information
theory. Both the theory and algorithms involved are new, and are based on a e cient
scheme for evaluating mutual information called EMMA1. The derivation of the the alignment
procedure requires few assumptions about the nature of the imaging process. As a result
the algorithms are quite general and can be used in a wide variety of imaging situations.
Experiments demonstrate that this approach can align a number of complex 3D object models
to real images. In addition, the same technique can be used to solve problems in medical
registration.
Alignment adjusts the pose of an object until the mutual information between image and
object is maximized. Pose adjustment can be accomplished by ascending the gradient of
mutual information. We will present an alignment procedure based on stochastic approximation that a ords a speed up of at least a factor of 500 over gradient ascent. In addition,
stochastic approximation can be used to accelerate a variety of other vision applications. We
will describe an existing vision application which can be accelerated by a factor of 30 using
stochastic approximation.
EMMA has also proven useful in a number of tasks beside alignment. For example, an
entropy minimization framework that can be used to detect and correct corruption in magnetic
resonance images MRI. EMMA can also be used to de ne a new form of unsupervised
learning. Unsupervised learning has been popularized in the neural network literature as a
scheme for simplifying the representations of complex data. EMMA can be used to nd lowdimensional projections of a high dimensional input space that are maximally informative. EMMA is a random but pronounceable subset of the letters in the words EMpirical entropy Manipulation
and Analysis".
1 10 1.1. AN INTRODUCTION TO ALIGNMENT AITR 1548 1.1 An Introduction to Alignment
The general problem of alignment entails comparing a predicted image of an object with an
actual image. Given an object model and a pose coordinate transformation, a model for
the imaging process could be used to predict the image that will result. If we had a good
imaging model then deciding whether an image contained a particular model at a given pose
is straightforward: compute the predicted image and compare it to the actual image directly.
Given a perfect imaging model the two images will be identical, or close to it. Of course
nding the correct alignment is still a remaining challenge.
The relationship between an object model no matter how accurate and the object's
image is a complex one. The appearance of a small patch of a surface is a function of the
surface properties, the patch's orientation, the position of the lights and the position of the
observer. Given a model ux and an image vy we can formulate an imaging equation, vT x = F ux; q
or equivalently, 1.1 vy = F uT ,1y; q : 1.2
The imaging equation is separable into two distinct components. The rst component is
called a transformation, or pose, denoted T . It relates the coordinate frame of the model
to the coordinate frame of the image. The transformation tells us which point in the model
is responsible for a particular point in the image. The second component is the imaging
function, F ux; q. The imaging function determines the value of image point vT x. In
general a pixel's value may be a function both of the model and other exogenous factors. For
example an image of a three dimensional object depends not only on the object but also on
the lighting. The parameter, q, collects all of the exogenous in uences into a single vector.
One reason that it is, in principle, possible to de ne F is that the image does convey
information about the model. Clearly if there were no mutual information between u and v,
there could be no meaningful F . We propose to nesse the problem of nding and computing
F by dealing with this mutual information directly. We will present an algorithm that aligns
by maximizing the mutual information between model and image. It requires no a priori
model of the relationship between surface properties and scene intensities it only assumes
11 Paul A. Viola CHAPTER 1. INTRODUCTION that the model tells more about the scene when it is correctly aligned. 1.1.1 An Alignment Example
One of the alignment problems that we will address involves nding the pose of a threedimensional object that appears in a video image. This problem involves comparing two
very di erent kinds of representations: a threedimensional model of the shape of the object
and a video image of that object. For example, Figure 1.1 contains a video image of an
example object on the left and a depth map of that same object on the right the object in
question is a person's head: Ron. A depth map is an image that displays the depth from
the camera to every visible point on the object model. A depth map is a complete description
of the shape of the object, at least the visible parts.
From the depth map alone it might be di cult to see that the image and the model are
aligned. The task can be made much easier, at least for us, if we simulate the imaging
process and construct an image from the 3D model. Figure 1.2 contains two computer
graphics renderings of the object model. These synthetic images are constructed assuming
that the 3D model has a Lambertian surface and that the lighting comes from the right. It
is almost immediately obvious that the model on the left is more closely aligned to the true
image than the model on the right. Unfortunately, what we nd trivial is very di cult for a
computer. The intensities of the true video image and the synthetic images are very di erent.
The true image and the correct model image are in fact uncorrelated. Yet any person can
glance at these images and decide that both are images of a head and that both heads are
looking in roughly the same direction. The human visual system is capable of ignoring the
super cial di erences that arise from changes in illumination and surface properties.
It is not easy to build an automated alignment procedure that can make this kind of
comparison. It is harder still to construct a system that can nd the correct model pose.
We have built such a system. That system selected the pose of the model shown at left in
Figure 1.2.
As mentioned above, the synthetic images of Ron were generated under the assumption
the model surface is Lambertian and the lighting is from the right. Lambert's law is perhaps
the simplest model of surface re ectivity. It is an accurate model of the re ectance of a matte
12 1.1. AN INTRODUCTION TO ALIGNMENT AITR 1548 Figure 1.1: Two di erent views of Ron. On the left is a video image. On the right is a
depth map of a model of Ron. A depth map describes the distance to each of the visible
points of the model. White denotes points that are closer, black further. Figure 1.2: At left is a computer graphics rendering of a 3D model of Ron. The position of
the model is the same as the position of the actual head. At right is a rendering of the head
model in an incorrect pose.
or nonshiny surface. Lambert's law states that the visible intensity of a surface patch is
related to the dot product between the surface normal and the lighting. For a Lambertian
object the imaging equation is:
X~
vT x =
1.3
i li ux ;
i where the model value ux is the normal vector of a surface patch on the object, li is a
vector pointing toward light source i, and i is proportional to the intensity of that light
source Horn, 1986 contains an excellent review of imaging and its relationship to vision.
13 Paul A. Viola CHAPTER 1. INTRODUCTION Drawing the explicit parallel between 1.1 and 1.3 we can see that the imaging function is
X~
F ux; q =
1.4
i li ux ;
i where q = f i; ~ig. As the illumination changes the functional relationship between the model
l
and image will change.
Since we can not know beforehand what the imaging function will be, aligning a model and
image can be quite di cult. These di culties are only compounded if the surface properties of
the object are not well understood. For example, many objects can not be modeled as having
a Lambertian surface. Di erent surface nishes will have di erent re ectance functions. In
general re ectance is a function of lighting direction, surface normal and viewing direction.
The intensity of an observed patch is then:
X
vT x = R i; ~i;~; ux ;
lo
1.5
i where ~ is a vector pointing toward the observer from the patch and R is the re ectance
o
function of the surface. For an unknown material a great deal of experimentation is necessary to completely categorize the re ectance function. Since a general vision system should
work with a variety of objects and under general illumination conditions, overly constraining
assumptions about re ectance or illumination should be avoided.
Let us examine the relationship between a real image and model. This will allow us to
build intuition about the alignment process. Data from the real re ectance function can
be obtained by aligning a model to a real image. An alignment associates points from the
image with points from the model. If the alignment is correct, each pixel of the image can
be interpreted as a sample of the imaging function R. The imaging function could be displayed by plotting intensity against lighting direction, viewing direction and surface normal.
Unfortunately, because intensity is a function of so many di erent parameters the resulting
plot can be prohibitively complex and impossible to visualize. Signi cant simpli cation will
be necessary if we are to detect any structure in this data.
In a wide variety of real images we can assume that the light sources are far from the
object at least in terms of the dimensions of the object. When this is true and there are no
shadows, each patch of the object will be illuminated in the same way. Furthermore, we will
14 1.1. AN INTRODUCTION TO ALIGNMENT AITR 1548 assume that the observer is far from the object, and that the viewing direction is therefore
constant throughout the image. The resulting relationship between normal and intensity is
three dimensional: the normal vector has unit length and is determined by two parameters,
its x and y components; the intensity is a third parameter. A three dimensional scatter plot
of normal versus intensity is really a slice through the high dimensional space in which R
is de ned. Though this graph is much simpler than the original, three dimensional plots are
still quite di cult to interpret. We will slice the data once again so that all of the points have
a single value for the y component of the normal.
Figure 1.3 contains a graph of the intensities along a single scanline of the image of Ron.
Figure 1.4 shows similar data for the correctly aligned model of Ron. Model normals from
this scanline are displayed in two graphs: the rst shows the x component of the normal
while the second shows the y component. Notice that we have chosen this portion of the
model so that the y component of the normal is almost constant. As a result the relationship
between normal and intensity can be visualized in only two dimensions. Figure 1.5 shows the
intensities in the image plotted against the x component of the normal in the model. Notice
that this relationship appears both consistent and functional. Points from the model with
similar surface normals have very similar intensities. The data in this graph could be well
approximated by a smooth curve. We will call an imaging function like this one consistent.
Interestingly, we did not need any information about the illumination or surface properties
of the object to determine that there is a consistent relationship between model normal and
image intensity.
Figure 1.6 shows the relationship between normal and intensity when the model and
image are no longer aligned. The only di erence between this graph and the rst is that the
intensities come from a scanline 3 centimeters below the correct alignment i.e. the model is
no longer aligned with the image, it is 3 centimeters too low. The normals used are the same.
The resulting graph is no longer consistent. It does not look as though a simple smooth curve
would t this data well.
In summary, when model and image are aligned there will be a consistent relationship
between image intensity and model normal. This is predicted by our assumption that there
is an imaging function that relates models and images. While the actual form of this function
depends on lighting and surface properties, a correct alignment will always lead to a consistent
relationship. Conversely, when model and image are misaligned the relationship between
15 Paul A. Viola 16 1 0.8 0.6 0.4 0 20 60 80 CHAPTER 1. INTRODUCTION 40 Position Though developed for alignment, the EMMA estimates of entropy and mutual information can be used in other applications. We will show that EMMA can be used to correct
inhomogeneities in MRI scans. In addition, we will derive a new approach for dimensionality
reduction based on entropy. Similar to principal components analysis, our technique can nd
low dimensional projections of higher dimensional data that preserve the most information This same technology can be used for the alignment of other types of signals. In its full
generality, EMMA can be used whenever there is a need to align images from two di erent
sensors, the socalled sensor fusion" problem. For example, in medical imaging data from
one type of sensor such as magnetic resonance imaging must be aligned to data from another
sensor such as computed tomography. We will demonstrate that EMMA can be used to
solve problems such as this. A major contribution of this thesis is the derivation of a formal technique that delivers
a principled estimate of consistency". We will show that when the mutual information
between an image and a model is high they are likely to be aligned. Toward making this
technique a reality we have de ned a new approach for evaluating entropy and information
called EMMA. We have also de ned an e cient scheme for adjusting a set of parameters so
that mutual information and entropy can be optimized. We will use EMMA to e ectively
evaluate and adjust the alignment of three dimensional models and two dimensional images. intensity and normal is inconsistent. Figure 1.3: On the left is a video image of Ron with the single scanline highlighted. On
the right is a graph of the intensities observed along this scan line. Real Intensity. 17 1 0.5 0 0.5 1 1 0.5 0 0.5 1 0 0 20 20 40 60 60 Position 40 Position 80 80 AITR 1548 Figure 1.4: On the left is a depth map of Ron with the single scanline highlighted. At
top right is a graph of the x component of the surface normal. On the bottom right is the y
component of the normal. 1.1. AN INTRODUCTION TO ALIGNMENT Normal: X Comp. Normal: Y Comp. Paul A. Viola CHAPTER 1. INTRODUCTION Intensity 1
0.8
0.6
0.4
0.2
0
0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 X Component Normal Figure 1.5: The Aligned Case: A scatter plot of the intensity of the video image versus
the x component of the surface normal from the model. The image and model are correctly
aligned. Intensity 1
0.8
0.6
0.4
0.2
0
0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 X Component Normal Figure 1.6: The Misaligned Case: On the left is the misaligned scanline from the video
image of Ron. On the right is a scatter plot of the intensity of this part of the video image
versus the x component of the surface normal from the model.
possible. 1.2 Overview of the Thesis
The second chapter contains an overview of the probability theory necessary to understand
what EMMA is doing and how it does it. The third chapter discusses estimation of entropy
from samples. While a number of techniques for manipulating entropy currently exist, EMMA
combines computational e ciency with the exibility necessary to model a wide variety of
18 1.2. OVERVIEW OF THE THESIS AITR 1548 distributions. The fourth chapter returns to our discussion of alignment. We show here that
EMMA is capable of aligning signals where simpler techniques cannot. This chapter will
present the basic equations that underly alignment by maximization of mutual information.
The fth chapter contains a wide variety of alignment experiments designed both to validate
our approach and explore the scope of possible application. In chapter six we will describe
applications besides alignment to which we have applied EMMA. For example, our scheme
for e ciently manipulating entropy includes a stochastic form of gradient descent. We will
describe a ow estimation problem in which stochastic gradient descent speeds convergence
by a factor of thirty. Chapter seven will include a discussion of our results and a comparison
with related work. 19 Chapter 2
Probability and Entropy
One of the key insights in this thesis is that many of the techniques that are common in
computer vision, such as correlation, are easily interpreted as statistics of random variables.
Once this is done we can use a broad range of tools from probability to analyze the behavior of
these statistics. The theory of probability will help us determine how statistics converge, what
they converge to, and more importantly how alternative statistics might be more appropriate.
In this chapter we will introduce the basic mathematics that underly probability. In
subsequent chapters we will assume that the reader has a fairly thorough knowledge of probability, statistics, entropy, and coding. This chapter is intended both as a review of the
required techniques and theorems and as a bridge for a reader unfamiliar with these topics.
The nal sections of this chapter contain a new analysis and discussion of Parzen density
estimation. Parzen density estimation will play an important role in the estimation of entropy
in subsequent chapters.
We will sometimes use a simpli ed, or looser, de nition of concepts like events and random
variables than is typical. If you get overly confused reading this chapter, any good book on
probability should clear things up Papoulis, 1991; Baclawski et al., 1990. In general we
will leave out the proofs of anything that is easily looked up, and of course most of the theory
presented is cited here without reference. Unfortunately probability and statistics seems to
have many con icting standard notations. In our own de nitions we will try to be consistent
with the prevailing conventions. 20 2.1. RANDOM VARIABLES AITR 1548 2.1 Random Variables
In many cases an algebraic model of a physical system allows us to accurately predict its
behavior. For instance, circuit theory can be used to analyze a particular circuit and predict
that when a switch is closed current will ow. The physics of many circuits can be modeled
as equations where unknown quantities are recorded as variables. In the case of a switched
circuit, we can model the resistance of a switch as a variable that can take on one of two
values: zero when closed or in nity when open. The current that ows through that resistor
can then be predicted from algebraic manipulations. Conversely, knowing the value of the
current allows us to predict whether the switch is open or closed. The equivalence of a circuit
and a circuit model is fundamental within the elds of physics and engineering.
In a wide variety of physical systems the behavior of particular measurements cannot
be easily predicted. The voltage of a wire may be a complex function of the circuit and the
thermal noise in a resistor. Even when all of the other circuit variables are known, the voltage
cannot be predicted accurately. Luckily, all hope is not lost. We may not know the actual
voltage but we may know that it will be near" V0 , and that it is never, in our experience,
higher than Vmax . Probability, random processes, and random variables provide the tools to
quantify the intuitive concepts of near" and never".
A random variable, or RV, is a variable whose value is unpredictable. Recall that a variable
is a symbol X and a set of values X over which the variable can range. For example, X
could range over the real numbers between 1 and 10. In this thesis we will assume X will
always be a subset of the real numbers. A random variable X is a variable together with a
function PX : X ! 0; 1 called a probability distribution. For example we can construct
an RV that models the roll of a six sided die. If the die is fair" we cannot in advance know
what its value will be, but we do know that its value will be one of 6 integers from 1 to
6, and that each will appear roughly one sixth of the time. The RV that describes this die
includes the variable symbol X , a sample space X = f1; 2; 3; 4; 5; 6g of possible outcomes,
and a probability distribution function PX n which tells us the probability that X will take
on the value n. A particular value of an RV is called a trial, for example from a die roll. A
collection of trials is called a sample. An event is a set A such that A X . The probability
of an event, PX X 2 A is the proportion of times that you expect to see event A in a large
21 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY sample. The sum over the sample space of the probability distribution equals one:
X
P X 2 fxig = 1 :
xi 2 X Here we denote the elements of the sample space X with the lower case letter x. In many
cases we will write PX xi, P X = xi or P xi for PX X 2 fxig1. An RV which takes on
a nite or discrete set of values is known as a discrete random variable. An RV whose range
includes some in nite set of continuous values is known as a continuous random variable.
A bit of thought leads one to a conundrum regarding continuous RV'ssince there are
an in nite number of possible outcomes the probability of almost every outcome will be zero.
This will in fact be a continuing annoyance to us as we move toward the de nition of entropy.
Instead of probability distributions for continuous RV's we use probability densities: pX x0 = lim P x0 X x0 + :
!0
The probability of an event can just as easily be de ned from the density by
Z xhigh
P xlow X xhigh =
pX xdx :
xlow The probability density of an RV always integrates to 1,
Z1
pX xdx = PX ,1 X 1 = 1 :
,1 It is not true, however, that 0 pX x 1. Probability densities are always nonnegative,
but can have arbitrarily large values. Often densities can be manipulated in the same way
that distributions are. In subsequent discussion we will avoid duplication whenever de nitions
and theorems are the same for both distributions and densities. Typically probability books say that this is only done when there is no chance for confusion. We know
better.
1 22 2.1. RANDOM VARIABLES AITR 1548 Simple Statistics
A random variable model of a process allows us to answer a variety of quantitative questions
about the behavior of the process. Though the voltage across a resistor is unpredictable, its
long term average is not. Let us de ne the intuitive notion of long term average" as the
expected value or mean of an RV. The expected value EX X is de ned as:
X
EX X xiP X = xi ;
2.1
xi 2 X or Z EX X xpX = xdx : 2.2 For notational convenience we will sometimes refer to the expectation of X as E X . The
mean of a random variable is a deterministic function of its distribution. Intuitively E X
is the average of the RV's value over a large sample. We will denote a sample a, somewhat
nonstandardly, by an ordered collection of trials xa,
a = :::xa::: :
The size of a sample kak we will refer to as Na. In a small abuse of notation we will write
1X Ea X N
a xa ; for the average over the sample a. Unlike the mean, the sample mean is a random variable.
The law of large numbers allows us to prove that in the limit the sample mean equals the
expectation:
1X
EX X = Nlim Ea X = Nlim N xa :
2.3
a!1
a !1
a The mean is an example of a statistic. Statistics are deterministic values computed from
an RV that sum up its gross or long term behavior. Statistics of X are de ned as the
expectation of functions of X or possibly P X .
By itself, E X does not tell us much about X . For example, the average lottery number
does not help us guess what the next lottery number will be. In addition to knowing the
mean, we would like to know on average how close samples of X will be to the mean. We
23 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY can tell that the average lottery number is a useless statistic by the fact that the variation in
lottery numbers is huge. One measure of expected variation is called variance and is de ned
as:
V arx EX X , EX X 2 = EX X 2 , EX X 2 :
2.4
The square root of variance is the standard deviation, X . The standard deviation is a
measure of how far, on average, the samples of X will be from E X .
Though the expectation of an RV is equal to the in nite mean, as in 2.3, we have not
explored its relation to the sample mean. Is the sample mean a good estimate for the true
mean of an RV? In a quali ed sense the answer is yes. The expectation of the sample mean
is the same as the expectation:
1X
1X
E Ea X = E N xa = N E xa = E X :
a a Expectation, because it is de ned as an integral, is linear and can be moved inside the
summation. The sample mean is often called an unbiased estimator of the true mean. But,
how close on average will the sample mean be to the true mean? Under the assumption that
the di erent trials of X are independent and identically distributed, the standard deviation
of the sample mean is
Ea X = pX : Na Therefore, the standard deviation of the sample mean approaches 0 as Na approaches in nity.
We can conclude that the sample mean is an unbiased estimate for the true mean, and that
the quality of the estimate is better for larger samples.
The mean and variance are the zeroth and rst elements of an in nite class of moment
statistics. These statistics can used to classify the behavior of an RV with ever increasing
accuracy. The Algebra of Random Variables
Random variables are useful descriptions of processes that occur in the real world. RV's can
be used in algebraic equations just as variables are. The value of an equation that includes
an RV is another random process. A new RV, Y , can be de ned from X as Y = F X . For
24 2.1. RANDOM VARIABLES AITR 1548 discrete RV's, the probability distribution of Y is easily de ned as: PY F n = PX n. For
continuous RV's it is not quite as simple, x
pY F x = pdX x :
F
dx Intuitively this equation tells us to scale down the density at points where @F x is large. In
@x
these regions F acts to stretch out X . The density pX x gets diluted by this stretching.
With this new theory of random variables, and many identities that can only really be
hinted at here, we can begin to analyze systems such as the noisy circuit described above. We
can answer questions like, If there is random noise in the voltage from a power supply, how
much variation will there be in the current across a resistor on the other side of the circuit?"
In general this kind of analysis starts from a description of the distribution of one RV and
derives the distribution of other functionally related RV's in the system. Joint and Conditional Distributions
When one RV is a function of another RV, as in Y = F X , Y and X are said to be dependent.
Measuring X allows us to predict Y . It is also possible that two RV's are related but not
directly predictable from each other. An example is a noisy voltage source that is powering
a noisy current source. Actually measuring voltage tells you something about current, but it
doesn't tell you everything. There is still unpredictability that arises from the current source
itself. Finally, it is possible that two RV's are completely independent. For example, two
di erent rolls of a fair die are considered independent.
Dependency can be formalized by examining the joint distribution of two RV's, P X; Y .
The joint distribution tells us about the cooccurrence of events from the RVs X and Y . It is
a complete description of the random behavior of both X and Y . From the joint distribution
one can compute the marginal distributions:
X
P X =
P X; Y = y
y2 Y 25 Paul A. Viola and CHAPTER 2. PROBABILITY AND ENTROPY P Y = Two variables are independent if X
x2 X P X = x; Y : P X; Y = P X P Y : 2.5 They are considered dependent when the joint distribution is not the product of marginal
distributions. A closely related distribution, the conditional distribution, P Y j X , is the
probability of Y if we knew X . It is de ned as:
P Y j X = PPX; Y :
X
Complete, functional dependence can be determined from conditional probability when it is
the case that for all x 2 X that P Y = F x j X = x = 1 :
What is known as Bayes' Law can be concluded from the following equation:
P X j Y = PPX; Y P X = P Y j X P X :
Y P X
P Y
Bayes' Law inverts conditional probabilities. It is quite useful in situations where one would
like to conclude the distribution of X from a measurement of Y , but in principle all that is
known is P Y j X . 2.2 Entropy
Entropy is a statistic that summarizes randomness. The de nition of a random variable
makes no mention of how random the variable is. Is a lottery number more or less random
than the roll of a die? Entropy helps us answer this question. As we will see, the more
random a variable is the more entropy it will have. Much additional material on entropy can
be found in the excellent textbook by Cover and Thomas Cover and Thomas, 1991.
26 2.2. ENTROPY AITR 1548 Entropy in one form or another is a very old concept. Its origins clearly date back to the
rst work on thermodynamics in the last century. Nonetheless, most of the credit for de ning
entropy and promoting its use in data analysis and engineering falls to Shannon Shannon,
1948. The most straightforward de nition of entropy is as an expectation:
X
H X = ,EX logP X = ,
logP X = xiP X = xi :
xi 2 X where we de ne 0 log0 = 0 here and elsewhere in the thesis. The classical de nition of
entropy applies only to discrete random variables. We will present the de nition of continuous
entropy, known as di erential entropy, later. Entropy is typically de ned in terms of the
logarithm base 2. In that case entropy is given in units of bits. Entropy is Code Length
One way of measuring randomness is to compose the shortest message that describes either
one or a number of trials of an RV. A trial of a fair coin takes one bit of information to
encode: a 1 for heads and a 0 for tails. There is no more e cient technique for encoding a
single trial2. This restriction does not apply to a message that describes a sample of many
trials. If the coin in question comes up tails only one time in a thousand, there are a number
of straightforward schemes for encoding a sample that require less than one bit per trial. For
instance one could send the length of the sample a and then the position of the zeros. It
would take logNa bits to send the length and logNa bits to send the position of each zero.
The length of a message describing Na ips of our coin would on average be E lengtha = logNa + P X = 0Na logNa :
In this coding scheme a message describing one thousand trials will on average take about
20 bits. The number of bits it takes to encode a sample is dependent both on the number of
events and the distribution of the random variable.
A discussion and comparison of coding schemes could take up quite a lot of space. Luckily,
using the Kraft inequality it can be proven that on average one needs to send H X bits to
communicate a trial of the random variable X . Furthermore, Shannon showed that is possible
2 Provided the coin doesn't have two heads. 27 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY to construct a code that will take at most H X + 1 bits on average. A simple algorithm
discovered by Hu man can construct the shortest possible codes for any random variable.
Because entropy is a bound on the code length that is required to transmit a trial, entropy
is often called information. Conditional Entropy, Joint Entropy, and Mutual Information
The concept of mutual information plays a critical role in this thesis. One of the key problems
that we will need to solve is, How likely is it that the random variable Y is functionally
dependent on X ?" In Section 2.1 we saw that two RV's were independent if and only if their
joint density was the product of their marginal densities see 2.5. Entropy will allow us to
quantify the extent to which two RV's are dependent.
Quantifying dependence is very much like quantifying randomness. Total dependence
implies that a measurement of one RV completely determines the other i.e. knowledge of
X removes any randomness from Y . Independence is just the opposite i.e. knowledge
of X does not help you predict Y . Just as joint and conditional distributions relate the
cooccurrences of two RV's, entropy can be used to relate the predictability of two RV's.
Conditional entropy and joint entropy are de ned as: H Y j X EX EY logP Y j X
and H Y; X EX EY logP Y; X :
Conditional entropy is a measure of the randomness of Y given knowledge of X . Note that it
is an expectation over the di erent events of X , so it measures on average just how random
Y is given X . H Y j X = x is the randomness one expects from Y if X takes on a particular
value. Random variables are considered independent when H Y j X = H Y
or, H X; Y = H X + H Y :
28 2.2. ENTROPY AITR 1548 As Y becomes more dependent on X , H Y j X gets smaller. However, conditional
entropy by itself is not a measure of dependency. A small value for H Y jX may not imply
dependence, it may only imply that H Y is small. The mutual information, MI, between
two random variables is given by I X; Y = H Y , H Y j X : 2.6 I X; Y is a measure of the reduction in the entropy of Y given X .
A number of simple logarithm equalities can be used to prove relations between conditional
and joint entropy. For instance, conditional entropy can be expressed in terms of marginal
and joint entropies:
H Y j X = H X; Y , H X :
This allows us to provide three equivalent expression for mutual information and a useful
identity: I X; Y = H Y , H Y j X
= H X + H Y , H X; Y
= H X , H X j Y
= I Y; X : 2.7
2.8
2.9
2.10 An extremely useful inequality on expectations, known as Jensen's inequality, allows us
to prove that for any concave function F that E F X F E X :
A function is concave when its second derivative is negative everywhere. Using the fact that
the logarithm function is concave, Jensen's inequality allows us to prove the following useful
inequalities: H X 0
H Y H Y j X
I X; Y 0 29 2.11
2.12
2.13 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY 2.2.1 Di erential Entropy
While a number of the main theorems of entropy apply both to continuous and discrete
distributions, a number of other theorems change signi cantly. The continuous version of
entropy is called di erential entropy, and is de ned as:
Z1
hX ,EX logpX = , px logpxdx :
2.14
,1 For the most part di erential entropies can be manipulated, and obey the same identities,
as entropy. In fact all of the equalities and inequalities of the previous section hold except
for 2.11. Throughout the thesis, when entropy is mentioned, it is to be understood as the
applicable form of entropy. When the di erence matters we will be explicit.
The most perplexing di erence between entropy and di erential entropy is that there is
no longer a direct relationship between hX and code length. It is possible to construct
examples where di erential entropy is negative. This is an implication of the fact that pX
can take on values greater than 1. Code length, however, is never negative.
Di erential entropy does not provide an absolute measure of randomness. Discouragingly,
it is even the case that a density with a di erential entropy of negative in nity may still be
unpredictable. Examples of this sort can be constructed by embedding a discrete process
into a continuous space. For example one could model the roll of a die as a continuous RV.
The density would then be made up of a series of delta functions centered at the points one
through six. A delta function, often called a Dirac delta function, can be de ned from a box
car function,
81
bx; xlow; xhigh = : xhigh ,xlow if xlow x xhigh :
2.15
0
otherwise
The box car function is de ned so that it integrates to one. The delta function is a box car
function in the limit as it approaches zero width,
x = !0 bx; 0; :
lim
The delta function, because it is a box car, integrates to one. It can be shown that
Z1
f x0 =
x0 , xf xdx
,1 30 2.16
2.17 2.3. SAMPLES VERSUS DISTRIBUTIONS AITR 1548 Furthermore, from the de nition of convolution,
Z1
f gx f x , x0gx0dx0 2.18 we can see that the delta function is the identity operator,
Z1
f x =
x , x0f x0dx0 = f x : 2.19 ,1 ,1 The density of the continuous model of die rolling can be formulated as px = 6
X1
x , i
i=1 6 which will integrate to 1. Furthermore if we de ne the probability of an event as
Z x+
P x = !0 x, px ;
lim 2.20 2.21 1
then P X = 1 = 6 , as will the probability of the other events. Finally we can show that the
entropy of X is negative in nity,
Z1
6
0 logpx0 dx0 = , X 1 log1 = ,1 ;
2.22
hX = , ,1 px
i=1 6 yet X is pretty clearly random.
Though di erential entropy does not provide an absolute measure of randomness or code
length, it does provide a relative measure of these properties. A random variable X is less
predictable than Y whenever hX hY . Similarly an event from X requires more bits
on average to encode than an event from Y . 2.3 Samples versus Distributions
A random variable is a mathematical structure that is intended to model the behavior of a
physical process. In some cases there are excellent physical reasons to believe that an RV
is an accurate model of a process. In many other cases the properties of a random physical
31 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY process may well be unknown. In these cases we may still wish to use the theory of probability
to analyze a system. The rst step is to nd an accurate RV model of our data.
In order to insure that our probabilistic inferences will be correct, our model must be as
accurate as possible. While it is possible to model every coin as a fair coin, we do so at
our peril. It makes much more sense to perform a large number of experiments intended to
test the hypothesis, Is the coin fair?" There are two important intuitions behind nding an
accurate model of a random process. First and foremost, you want a model that seems to
explain the data well. It might not make sense to guess that a coin is fair if after 500 ips
heads has come up 400 times. But it is also important that your model be plausible. If after
a lifetime of experience you realize that most coins are pretty fair, than perhaps it makes
more sense to assume that 400 heads is unusual but not su ciently unusual to assume that
the coin is biased. 2.3.1 Model Selection, Likelihood and Cross Entropy
The eld of statistics provides many tools for testing the validity of random models. A lot of
this work shares a particular form, called maximum likelihood model selection. The goal is to
select the most probable model given a large sample of measurements. Maximum likelihood
selection proceeds in steps: 1 guess the de nition of a random variable that might model
the process; 2 evaluate the goodness" of the model by computing the probability that the
data observed could have been generated by the model, 3 after evaluating many models
retain the model that makes the data most probable. The probability of a sample a under
the RV X is the probability of the cooccurrence of the trials in a, `a = PX a = PX x1 = xa ; x2 = xa ; ::: :
1 2 2.23 The probability of a sample is usually called its likelihood and is denoted `a.
Justi cation for maximum likelihood model selection is based on Bayes' law. The likelihood of a sample is really a conditional probability, P a j X . Bayes' law allows us to turn
the conditional around and nd the most likely model given the sample, P X j a = P a j X P X :
Pa
32 2.24 2.3. SAMPLES VERSUS DISTRIBUTIONS AITR 1548 In order to compute the model likelihood one must multiply the sample likelihood by the
correcting factor P X . The unconditioned probability of the sample P a could well be
Pa
arbitrary, because the sample is the same for all of the models we will evaluate. The prior
probability of the model, P X , poses more problems. Maximum likelihood model selection
assumes that all of the models that are to be evaluated are equally likely to have occured, i.e.
P X is constant. As a result P X is the same for all models, and the most probable model
Pa
is the one that makes the data most probable.
When reliable information about the prior probability of a model is available we can use
Bayes' law directly. This technique is known as maximum a posteriori model selection. For
instance, over a wide variety of experiments, we may have observed that fair coins are far
more common than unfair coins. It is very implausible that any particular coin would be
unfair, but not impossible. While our prior knowledge should bias us toward the conclusion
that a new coin is fair, it does not determine our conclusion. The likelihood of a model
together with its prior probability can be used to determine which model has the highest
probability of explaining the data. We want a model that both explains the data well and is
plausible.
In general, evaluating joint probability over a large number of random variables is intractable. In practice most maximum likelihood schemes assume that the di erent trials of X are
independent. The probability of cooccurrence is then the product of the independent RVs,
Y
`a = PX xa :
xa2a Maximizing `a is still a daunting process. Signi cant simpli cation can be obtained by
maximizing the logarithm of `,
X
log `a = log PX xa :
xa2a Log likelihood has the same maximum as `a, but has a much simpler derivative.
There is an interesting parallel between log likelihood and entropy. Recall that entropy is
a statistic of X . The nite sample average of entropy, or empirical entropy, which will gure 33 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY strongly later in the thesis, is
1 X log P x :
haX ,Ea log PX X = , N
Xa
a xa 2a 2.25 We can therefore conclude that
1
haX = , N log`a :
a 2.26 This provides us with an interpretation of model selection in terms of entropy. Instead of
nding the model that makes the data most likely, we could instead nd the model that has
the lowest empirical entropy. Conversely, we could present a new interpretation of entropy:
a distribution has low entropy if the probability of a sample drawn from that distribution is
high. A distribution has high entropy if the sample has low probability. A density with a
very narrow peak has low entropy because most of the samples will fall in the region where
the density is large. A very broad density has high entropy because the samples are spread
out and fall where the density is low.
The close relationship between entropy and log likelihood is well known, but is often
overlooked by students of probability. As result the parallels between research on entropy
and maximum likelihood can be easily missed. The fact that a system manipulates entropy"
does not make it necessarily any better, or di erent, than one based on likelihood. For
instance, log likelihood model selection can be derived directly from the entropy framework
using cross entropy. The cross entropy, DpX kpX , or asymmetric divergence is a measure
~
of the di erence between two distributions:
"
pX X
DpX jj pX = EX log p X
2.27
~
~
X
Z 1 pX x0 !
=
log p x0 pX x0dx
2.28
,1
~
X
Z1
Z1
=
logpX x0pX x0dx ,
logpX x0pX x0dx
2.29
~
,1
,1
= ,hX , EX logpX X
2.30
~
,hX , Ea logpX X
2.31
~
= ,hX + haX :
2.32
Cross entropy is nonnegative, reaching zero if and only if pX and pX are identical. Where
~
34 2.4. MODELING DENSITIES AITR 1548 maximum likelihood model selection searches for the model that makes the sample most likely,
cross entropy model selection searches for the model that is closest, in the cross entropy sense,
to the true distribution. If we use the approximation in 2.31, the two procedures are in
fact identical. The rst term in 2.31, hX , is a constant and does not play a role in model
selection. The second term, haX , is ,Na times the log likelihood of a sample drawn from X
under the density p~. Minimization of cross entropy implies the maximization of likelihood.
x 2.4 Modeling Densities
In this section we will describe a number of techniques for estimating densities from data.
Understanding the process by which this is done is an important prerequisite for understanding the main algorithms in this thesis. We will begin with a discussion of the most widely
observed continuous density: the Gaussian density3. We will then derive an closed form
expression for the most likely Gaussian given a sample. This section will also include a
discussion of other parametric density functions and nally a nonparametric technique for
estimating densities known as Parzen window density estimation. 2.4.1 The Gaussian Density
The most ubiquitous of all random processes is the Gaussian or normal density. It literally
appears everywhere. The most common justi cation arises from the central limit theorem",
which shows that the density of the sum of a very large number of independent random variables will tend toward Gaussian. An equally important justi cation is that the mathematics
of the Gaussian density are quite simple because it is an exponential. Moreover, since any
linear function of a Gaussian density is itself Gaussian, they are widely used in linear systems
theory. It is almost certainly the case that the majority of continuous random processes have
been modeled as Gaussians; whether they are or not. A Gaussian density is de ned as: g x , p21 e,
3 x,2
1
2 : For the sake of brevity we will sometimes refer to a Gaussian density as simply a Gaussian. 35 2.33 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY The parameters and are the variance and mean of the density. One can demonstrate this
with some clever integration.
The Gaussian density can also be de ned in higher dimensions: g x , 1
n 2 jj
2 1
exp, 2 x , T ,1x , : 1
2 2.34 In a d dimensional space, the mean is a dvector. The variance is replaced by a covariance
matrix, a dbyd matrix jj is the determinant of . Recall that variance is de ned as the
expected square of the di erence from the mean; covariance is somewhat more complex: ij = E Xi , E Xi Xj , E Xj ; 2.35 where Xi is the i'th component of the RV X . The diagonal entries of contain the variances
of the components. The o diagonal entries measure the expected covariation.
Equation 2.33 de nes an in nite family of density functions. Any one member of this
family is determined by its mean and its variance. Before we can attempt to model an
unknown density with a density from this family, we must rst decide if the density is
Gaussian. Maximum likelihood model selection can then be used to estimate and from a
sample. The form of the Gaussian density makes nding the maximum likelihood parameters
very easy. The log likelihood of a sample of a Gaussian density is
X
log`a = logPX xa
2.36
xa 2a
X
= logg xa ,
2.37
xa 2a
2
X
= log p 1 , 1 xa , :
2.38
2 2
xa 2a
The most likely minimizes X xa , 2 ; xa2a a quadratic function of . Di erentiating and solving for zeroes we nd that the most likely
is
X
= 1
xa : Na xa2a
36 2.4. MODELING DENSITIES AITR 1548 0.8
Sample
True Gaussian
M.L. Fit 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 4 3 2 1 0 1 2 3 4 Figure 2.1: Three views of a Gaussian density with a mean of 0:0 and a variance 1:0: First
a sample of 100 points drawn from the density. Each point is is represented a vertical black
line. Second the density of the true Gaussian. Third the density of the Gaussian estimated
from the sample mean = 0:045, variance = 0:869. This is a very satisfying result. We have proven that the most likely estimate for the mean, ,
is the mean of the sample. A very similar argument can be used to prove that the maximum
likelihood estimate for the variance, , is the sample variance:
1 X x , 2 :
=N
a
a xa 2a Figure 2.1 displays a 100 point sample drawn from a Gaussian density. The true density
is shown together with the most likely model. Because the sample mean and sample variance
are not perfect measures of the true mean and variance, the most likely model is not perfect.
The accuracy of the estimated mean and variance gets better as the sample size increases.
Even for a sample of 100 points there is signi cant variability in the estimated model for
di erent samples. Figure 2.2 shows ten di erent estimates from ten di erent samples of the
same density.
37 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY
0.8
0
1
2
3
4
5
6
7
8
9 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 4 3 2 1 0 1 2 3 4 Figure 2.2: The maximum likelihood density estimates for ten di erent samples of 100 points
drawn from the same Gaussian. 2.4.2 Other Parametric Densities
Finding the most likely Gaussian model for a sample is a very e cient operation. The mean
and the variance are trivially computable in linear time. E cient estimation is a property
shared by all of the exponential densities, a class of densities which include the Gaussian
density, the Gamma density, and others. For all other types of densities it is not possible
to nd maximum likelihood parameter estimates directly from statistics of the density. The
most likely set of parameters must be determined by a search process. Since there are an
in nite number of possible parameter values, nding values for these parameters that are
optimal is not straightforward. Generally problems of this sort are solved using an iterative
re nement process known as gradient descent. The gradient descent procedure is described
in Appendix A.1.
The Gaussian density has many advantages. Why not use it to model every sample? The
simple answer is that not all real densities are Gaussian. In fact, many real densities are far
from Gaussian. One of the strongest limitations of the Gaussian, and of all the exponential
densities, is that they are unimodal they have a single peak. Modeling densities that have
multiple peaks as if they had a single peak is foolhardy. Figure 2.3 shows an attempt to t
a two peaked function with a single Gaussian. In many situations it may seem as though
the simplicity and e ciency that arises from using a Gaussian density outweigh the added
accuracy that arises from using a more accurate model. As we will see, this is a temptation
38 2.4. MODELING DENSITIES AITR 1548 0.4
Sample
True Distribution
M.L. Fit 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 4 3 2 1 0 1 2 3 4 Figure 2.3: Three views of a density constructed from a combination of two Gaussians. The
Gaussians have variance 0:3 and means of 2:0 and ,2:0 respectively. As before the sample
contains 100 points. The maximum likelihood Gaussian has mean 0:213 and variance 3:824.
to which many have succumbed.
Once the decision to use a more complex model has been made, the set of possible model
densities is literally in nite. In terms of accuracy this is an unambiguous advantage. Density
can be modeled by any function that can be guaranteed to integrate to one. The most
common model after the simple Gaussian is a mixture of Gaussians: M x; W = N
X
i=1 ci gi x , i ; 2.39 here W represents the collection of parameters N; fig; fig; fcig. When P ci = 1, the
mixture model is guaranteed to integrate to one. A mixture density need not be unimodal;
it may have as many as N peaks. Figure 2.3 contains a graph of a mixture of Gaussian
density with two equal components. Given a large number of Gaussians, almost any density
can be modeled accurately. As before, maximum likelihood can be used to select the best set
of parameters for a given sample a. It is possible to search for the correct parameter vector
using gradient ascent, but for Gaussian mixture models there is a more e cient technique
known as Expectation Maximization Dempster et al., 1977. In either case nding the best
parameter vector can involve a lengthy search process.
While mixture models are fairly popular, almost any parameterized function can be used
39 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY as a density estimate. Within the neural networks literature some have trained back propagation neural networks to approximate densities Jacobs et al., 1991 see also Haykin, 1994
for an excellent review of neural network research. There is nothing terribly special about
using a neural network for this purpose. It is just another form of parametric density estimation.
Now that we have some feel for the density estimation process, it is critical that important
limitations be pointed out. Whenever one estimates a density from a sample a very important
rst step is required: assumptions must be made about the form of the density. The space
of possible functions is so large that for any sample there are an in nite number of density
functions that t it equally well. Continuous density can defy intuition. For instance it
is always possible to de ne a density that makes a given sample in nitely likely. Take for
example a density function made up of delta functions. As we did in section 2.2 we could
make up a function with a delta function for each trial:
1 X x , x :
2.40
pX = N
a
a xa2a The likelihood of this model density is then 1, and is guaranteed to be bigger than any other
density's likelihood. Wouldn't this imply that tting any other density is suboptimal? While
intuition argues against such an arti cial density, there is no principled scheme for dealing
with this dilemma.
Much has been written about this problem in the machine learning literature, where it
is called function approximation. There simply is not enough information in a nite sample
to uniquely determine which of the in nite number of possible functions t the sample best.
The only solution is to make strong assumptions about the form of the correct function: for
example that it is Gaussian, that it is polynomial, or that it is smooth. These assumptions
provide a strong prior probability over the space of possible functions. Together the likelihood
of a function and its prior probability can often uniquely determine a solution.
Maximum likelihood model selection is not guaranteed to do a good job of density estimation. There are three reasons why the most likely model may fail to be an accurate
model. The rst reason is that the set of evaluated models may not contain the correct
model. This is called inductive inadequacy and it arises when the underlying assumptions
about the density are wrong. The second reason is that maximum likelihood can be fooled.
40 2.4. MODELING DENSITIES AITR 1548 An especially unlikely sample, as in our example of the unbiased coin see Section 2.3, can
lead to a model estimate that is not correct. This is a question of con dence. A larger sample
is less likely to be unusual, and gives us more con dence in our model. The third reason is
that the search through parameter space may fail. A good solution may exist but cannot be
found, for example if there are local minima. 2.4.3 Parzen Window Density Estimation
The nal class of density functions we will discuss are called nonparametric density estimators. For these models no search for parameters is needed. While parametric methods use
the parameters as the model, nonparametric methods use the sample to directly de ne the
model.
The nonparametric scheme on which we will focus is known as Parzen window density
estimation. The general form of the density is:
1 X Rx , x ;
2.41
P x; a N
a
a xa 2a where a is a sample and R is a valid density function. The function R is often called the
smoothing or window function. The quality of the approximation is dependent both on the
functional form of R and its width. Di erent window functions will lead to very di erent
density estimates. The Gaussian density is a common selection for R, making the Parzen
density estimate a mixture of Gaussians. There is one Gaussian centered at each sample.
Figure 2.4 contains a graph of a density, a sample, and the Parzen estimate constructed from
the sample. Figure 2.5 contains a graph of ten di erent Parzen estimates from ten di erent
samples. The di erent Parzen estimates do not show signi cantly more variation than the
Gaussian estimates shown in Figure 2.2.
In practice the Parzen density estimate is much more exible than a parametric density
estimate. Where parametric techniques make very strong assumptions about the functional
form of the density to be approximated, Parzen estimation requires only that the density be
smooth. Figure 2.6 shows the Parzen density estimate of a bimodal distribution. Contrast it
to the parametric estimate of the same density as shown in Figure 2.3.
41 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY
0.8
Sample
True Gaussian
Parzen Fit 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 4 3 2 1 0 1 2 3 4 Figure 2.4: Three views of a Gaussian density with a mean of 0:0 and a variance 1:0: First
a sample of 100 points drawn from the density. Each point is is represented a vertical black
line. Second the density of the true Gaussian. Third the Parzen density estimate constructed
from the sample. The window functions are Gaussians with variance 0:35.
Intuitively, the Parzen density estimator computes a local, or windowed, average of the
sample. Looking back to 2.41, notice that if R is symmetrical about the origin we can
view the window function as being centered on the query point, x, rather than at the data
points. Viewed in this light, the density estimate at a query point is a weighted sum over
the sample, where the weighting is determined by the window function. The most common
window functions are unimodal, symmetric about the origin, and fall o quickly to zero. In
e ect, the window function de nes a region centered on x in which sample points contribute
to the density estimate. Points that fall outside of this window do not contribute. The
density estimate at x is the ratio of the number of weighted sample points within the window
divided by the total number of sample points, Na. Getting a reliable estimate of this ratio
involves having a reasonable number of points fall into the window around the query point.
The number of points that we expect to fall into this window is a function both of the size
of the sample and the size of the window. As the number of points that fall into a window
decreases, the variance of the Parzen density estimate increases. We will analyze the variance
of the Parzen estimate later in the chapter.
The balance of computation required by Parzen window density estimation is qualitatively
very di erent from parametric schemes. Constructing a parametric model involves a lengthy
search through parameter space that takes more time for larger samples. Constructing a
42 2.4. MODELING DENSITIES AITR 1548 0.7
0
1
2
3
4
5
6
7
8
9 0.6 0.5 0.4 0.3 0.2 0.1 0 4 3 2 1 0 1 2 3 4 Figure 2.5: The Parzen density estimates for ten di erent samples of 100 points drawn from
the same Gaussian density.
Parzen model is cheap. One need only memorize the sample. Evaluating a parametric model
is usually e cient. Once the parameters are known the number of operations required is
usually very small and does not grow with the size of the sample. Evaluating P x; a is more
expensive; requiring time proportional to the size of the sample. The overall computational
complexity of either technique is a function of how they are used.
Though the Parzen estimate is a mixture model, it is not the maximum likelihood mixture
model. Unlike the Parzen estimate, the maximum likelihood model is not constrained to place
one Gaussians at each of the sample points. There is however an asymptotic proof of Parzen
convergence that relies on the law of large numbers. The Parzen estimate can be written as
a sample mean:
1 X Rx , x = E Rx , X :
P x0; a = N
0
a
a
0
a xa 2a In the limit this equals the true expectation which in turn is a convolution
lim P x; a = E Rx , X
Z1
=
Rx , x0px0dx0
,1
= R px : Na !1 2.42
2.43
2.44 So P x; a converges to px if and only if px = R px. There are two distinct
conditions under which equality holds. The rst is that R tends toward the delta function
43 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY
0.7
Sample
True Density
Parzen Fit
0.6 0.5 0.4 0.3 0.2 0.1 0 4 3 2 1 0 1 2 3 4 Figure 2.6: Three views of a density constructed from a combination of two Gaussians. The
Gaussians have variance 0:3 and means of 2:0 and ,2:0 respectively. As before the sample
contains 100 points. The Parzen estimate is constructed with Gaussians of variance 0:20.
when the sample size approaches in nity. The second occurs when convolution by R does not
change px. In theory this could be achieved when px has bounded frequency content and
R is a perfect low pass lter. In practice approximate equality holds whenever px has low
frequency content and R is primarily a low pass lter, for example when px is a smooth
function and R is a Gaussian. Finally, whenever px = R px the Parzen estimate,
P x; a, is an unbiased estimator of px.
There are other conditions under which the Parzen estimate will converge to the correct
density estimate. This proof assumes that the samples are corrupted by measurement noise
f
of a known density. Instead of X , a corrupted random variable, X = X + , is observed. If were known the probability of X would be, f~
pX = xjX = x; = x , x , :
~
Without knowledge of we must integrate over all its possible values,
Z1
f~
f~
pX = xjX = x = ,1 pX = xjX = x; 0p 0d0
Z1
=
x , x , p d0
~
,1
= p x , x :
~
44 2.45
2.46
2.47 2.4. MODELING DENSITIES AITR 1548 f
To compute pX we must integrate over X . We show that the integral is approximated by
the Parzen estimator,
Z1
f f ef f
pX = x = ,1 pX = xjX = X 0pX X 0dX 0
2.48
Z1
f ef f
=
p x , X 0pX X 0dX 0
2.49
,1
f
= EX p x , X
2.50
e
~
2.51
Ea p x , xa
1 X p x , x ;
2.52
=N 0 ~a
a xa 2a
~ f
where a is a sample of X . The probability of the uncorrupted random variable X is apf
proximated by the Parzen estimate constructed from the samples of X where the smoothing
function is the density function of the noise. The probability of the corrupted random variable
can be derived from a very similar argument,
1 X p p x , x :
f~
pX = x N 0 ~a
a xa 2a
~ f
The probability of a noise corrupted random variable X is approximated by the Parzen
estimate using the smoothing function p p x. This result is independent of the density
of X . Often is Gaussian noise, a very common assumption that we will return to in our
discussions of entropy. The smoothing function is then a Gaussian density that has twice the
standard deviation of . Finding the Best Smoothing Functions
As we have seen, when a priori information about the density is available Parzen estimation
will converge to the correct density. Moreover, when we know either that the density is smooth
or that it has been perturbed by noise it is possible to nd the correct smoothing function.
In the absence of a priori information, the quality of the Parzen estimate is dependent on the
variance of the smoothing functions. Figures 2.7 and 2.8 display the dependence of the
density estimate on . Each shows the Parzen estimates computed from a 100 point sample
as is changed. Notice that the actual density function that results is very dependent on
the variance. The qualitative nature of this dependence varies across the range of variances
45 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY
1
0.005
0.02
0.08
0.32
1.28 0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 4 3 2 1 0 1 2 3 4 Figure 2.7: Five plots of the Parzen density estimates derived from a 100 point sample of
a Gaussian. The Gaussian has variance 1:0 and mean 0:0. The di erent estimates use a
di erent value for the variance of the component smoothing functions. The variances used
range over a factor of 256, from 0:005 to 1:28.
shown. When the variance of the smoothing function is small, less that 0.1, the resulting
density changes very rapidly as variance is changed. Above 0.2 small changes in variance do
not change the resulting density nearly as rapidly.
Selection of the correct variance for the smoothing functions need not be a hit or miss
process. Much in the same way that likelihood can be used to nd the parameters of a
Gaussian to t a sample, likelihood can be used to nd the variance of the Gaussians that
make up the Parzen estimate. In general it is possible to compute the best variance for each
Gaussian in the Parzen density estimate separately. This process requires a great deal of time
and data. Since we wish to preserve the simplicity of the Parzen estimate, a single variance
will be used for all of the smoothing functions.
Recall that likelihood is maximized when empirical entropy is minimized see Section 2.3.1.
Since subsequent chapters will focus on empirical entropy, we will use empirical entropy to
estimate the optimal variance. Figure 2.9 graphs the empirical entropy of the sample versus
variance. The sample used in this graph is the same as was used to estimate the densities in
Figures 2.7 and 2.8. The broad minimum in entropy at 0:25 implies that the Parzen density
estimate is not critically dependent on variance. The variance need only be within a factor
of ten of the optimal variance.
46 2.4. MODELING DENSITIES AITR 1548 0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 3 2 1 0 1 2 0.7 0.6 0.5 0.4 0.3 0.2 0.1 3 Figure 2.8: A parametric surface plot of the Parzen density versus variance this is the same
data shown in the previous graph. The horizontal and vertical axes are the location and
density respectively. Variance changes with depth in the graph. Here variance ranges from
0:80 to 0:01.
The true entropy of a Gaussian with variance 1:0 is 1:419. The optimal Parzen density
estimate has an empirical entropy of 1:47. This close agreement is not coincidence. It is
argued in the next chapter that the true entropy of a density can be e ectively estimated
from a Parzen density estimate.
There is a small technical note that should not be overlooked. We must be careful
whenever the same sample that is used both to construct the Parzen estimate and to estimate
entropy. Recall that the most likely, or lowest entropy, density estimate for a sample is a
collection of delta functions centered at each point from the sample see 2.2.1. We also know
that this delta function density will have an entropy of negative in nity. The Parzen density
is very similar in form to the delta function density. It too centers a function at each point
from the sample. In the limit as the variance of the smoothing functions tends towards zero,
each smoothing function approximates a delta function. Therefore the minimum empirical
entropy should be obtained when the variance is zero. This di culty only arises when the
sample from which the density is estimated is the same as the sample with which empirical
entropy is calculated. If these two samples are di erent, and the density is not degenerate in
some way, then no point should appear in both samples. As the variance of the smoothing
functions tends to zero the density at all points that are not in the Parzen sample tends to
zero. As a result the empirical entropy should tend toward positive in nity as the variance
47 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY
5.5
Log Likelihood
5 4.5 4 3.5 3 2.5 2 1.5 1
0.001 0.01 0.1 1 10 100 Figure 2.9: A log plot of negative log likelihood versus . Near the minimum, the log
likelihood is not terribly sensitive to the . Values within a factor of 10 are all roughly
equivalent. tends to zero. This e ectively precludes the solution where the variance of the smoothing
functions is zero.
We can simulate having two di erent samples by a process called crossvalidation. Crossvalidation splits a single sample a into two samples. One sample has a single point fxg and
the second contains the remaining points a , fxg. There are Na di erent ways to split the
sample in two parts. Rather than draw two di erent samples, we use the Na di erent split
samples. In each case the larger sample, of size Na , 1, is used for the Parzen estimate, and
the smaller sample is used to estimate the entropy. Estimating log likelihood or empirical
entropy with two samples a and b yields
log`b = ,NbhbX = ,NbEb log P X; a ;
versus ,NaEa logP xa; a , fxag ; 2.53 2.54
using cross validation. The cross validated empirical entropy is an unbiased estimate of the
two sample empirical entropy.
48 2.4. MODELING DENSITIES AITR 1548 The Quality of the Parzen Estimate One way to evaluate the quality of the Parzen estimate is to evaluate the standard deviation
of its estimate. Another, perhaps more useful statistic is the standard deviation normalized
by the mean
X :
2.55
EX The normalized standard deviation measures expected deviation from the mean as a function
of the overall scale. For many types of problems, when the mean of a variable is large, small
deviations about the mean are usually unimportant. But when the mean is very small, a
small deviation can make a big di erence. Normalized standard deviation is a good measure
to use when the log of a variable is important like log likelihood and entropy. Using
the constant and linear terms of the Taylor expansion of logarithm and assuming that the
standard deviation of X is small,
logX EX :
X 2.56 The standard deviation of the Parzen density estimate at a point x is a function of the total
number of sample points used to estimate the density. The normalized standard deviation of
a Parzen estimate is
P x; a = P x; a ;
2.57
E P x; a
pX where both the standard deviation and the expectation are taken over the space of possible
samples. The two equations are equal whenever P x; a is an unbiased estimator for pX .
The standard deviation of the Parzen estimate can be computed exactly when the smoothing functions are box car functions. The Parzen estimate is then the number of sample points
that fall into the box car window divided by the total number of sample points,
1 X Rx , x = kNin :
P x; a N
2.58
a
N
a xa2a a where Nin is the number of points for which Rx , xa is nonzero, and k = R0 is chosen
49 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY so that P integrates to one. The standard deviation of the Parzen estimate is, k
P x; a = N Nin :
a Assuming each point in the sample is independent then
v
!
u
k uN P x; a 1 , P x; a :
x; a =
ta
P
N
k
a The normalized standard deviation of the Parzen estimate is then
v
!
u
P x; a =
1
k uN P x; a 1 , P x; a
ta
E P x; a E P x; a Na
k
s
x;
k
P 1 a N NaP x; a1 , P k a
x;v a
u
k u 1 , P kx;a :
pN t P x; a
a 2.59 2.60 2.61
2.62
2.63 The Parzen estimate has a larger normalized standard deviation at points where the estimated
probability is small. The Parzen density estimate converges to the true estimate at a rate
1
proportional to pNa .
The de nition of Parzen window estimation can be generalized to higher dimensions by
replacing the one dimensional smoothing functions by their d dimensional counterparts see
Section 2.4 for a d dimensional Gaussian. Though the de nition of Parzen estimation is the
same for any number of dimensions, the behavior of the algorithm can be very di erent. As
the number of dimensions grows the number of data points required rapidly increases. In
d dimensions, the window of support of a Gaussian smoothing function is an d dimensional
sphere whose radius r is a function of its standard deviation. The volume of a d dimensional
sphere of radius r is Vd rd, where Vd is a constant dependent only on the dimension. Assuming r d
that all of the sample data is contained in a sphere of radius R, on average Na R data points
r
will fall in a randomly chosen window. Generally, r is selected so that R 1. As a result
with increased dimension the number of points falling in a randomly chosen window drops
exponentially and the normalized standard deviation of P will increase rapidly. This implies
that the Parzen density estimate will become very unreliable as dimensionality increases. 50 2.4. MODELING DENSITIES AITR 1548 While in theory this could be remedied by increasing the size of the sample exponentially,
things rapidly get out of hand. Empirical evidence argues against using Parzen estimation
in many more than six dimensions. 51 Chapter 3
Empirical Entropy Manipulation and
Stochastic Gradient Descent
This chapter presents a novel technique for evaluating and manipulating the empirical entropy
of a distribution, called EMMA. The theory of entropy manipulation plays a critical role in
the rest of this thesis and forms the algorithmic core in all of the applications.
There are a number of existing techniques that manipulate the entropy of a density. They
each have signi cant theoretical and practical limitations that make them unsuitable for our
purposes. We will begin with a description of these techniques, and two simple applications.
The second part of the chapter describes a new procedure for evaluating empirical entropy,
EMMA. We will present an e cient stochastic gradient scheme for extremizing the EMMA
estimates. This scheme has applications outside of entropy manipulation.
The nal section of this chapter presents a tutorial application of EMMA. We will show
how EMMA can be used to derive an information theoretic version of principal components
analysis. 52 3.1. EMPIRICAL ENTROPY AITR 1548 3.1 Empirical Entropy
As we saw in the previous chapter, in many cases the true density of a random variable
is not known. Instead one must make do with an estimate of the density obtained from a
sample of the variable. Likewise, there is no direct procedure for evaluating entropy from a
sample. A common approach is to rst model the density from the sample, and then estimate
the entropy from the density. This divides the problem into manageable parts which can be
solved separately.
By far the most popular density model for entropy calculations is the Gaussian. This is
based on two considerations: 1 nding the Gaussian that ts the data best is very easy
see Section 2.3.1 and 2 the entropy of the Gaussian can be directly calculated from its
variance. The entropy of a Gaussian distribution is
Z1
hX = , ,1 g x , logg x , dx
3.1
!
"
Z1
2
= , g x , log p21 , 1 x , dx
3.2
2
,1
"
1 x , 2 + 1 log2
3.3
=E 2
2
= 1 + 1 log2
3.4
22
1
3.5
= 1 loge + 2 log2
2
3.6
= 1 log2e :
2
The entropy of a Gaussian is a function of its variance and is not a function of its mean.
Wider Gaussians have more entropy and narrower Gaussians have less. There is a simple
procedure for nding the empirical entropy of a Gaussian distribution: compute the variance
of the sample and evaluate 3.6.
The equivalence between the log of variance and entropy can be used to reformulate well
known signal and image processing problems as entropy problems. Since the logarithm is a
monotonically increasing function, any technique that maximizes or minimizes the variance
of a signal can be viewed as an entropy technique. Examples include principal components
analysis, where variance is maximized, and least square solutions to matching problems,
53 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT where variance is minimized. There is however one signi cant caveat. Variance maximization
is only equivalent to entropy maximization if the density of the signal involved is Gaussian.
When this assumption is violated it is possible to reduce entropy as variance is increased.
All but a very few of the techniques that manipulate the entropy of a signal assume that the
signals are Gaussian or another exponential distribution Linsker, 1988; Becker and Hinton,
1992; Bell and Sejnowski, 1995. We will discuss these techniques in Section 7. Principal Components Analysis
There are a number of signal processing and learning problems that can be formulated as
entropy maximization problems. One well known example is principal components analysis.
The principal component of a d dimensional distribution is a d dimensional vector. Given a
density X every vector v de nes a new random variable, Yv = X v. The variance along an
axis v is the variance of this new variable: V arYv = EX X v , EX X v2 : 3.7 The principal component v is the vector for which V arYv is maximized.
In practice neither the density of X nor Yv is known. The projection variance is computed
from a sample a of points from X , V arYv V araYv Ea X v , Ea X v 2 : 3.8 We can then de ne a vector cost function, C v = ,V araYv ; 3.9 which is minimized by the principal component vector. There are many schemes for nding
the principal component of a density. One of the more elegant accomplishments of linear
algebra is the proof that the rst eigenvector of the covariance matrix of X, X , is the
principal component vector.
Under the assumption that X is Gaussian we can prove that the principal component is
the projection with maximum entropy. First, every projection of a Gaussian is also Gaussian.
54 3.1. EMPIRICAL ENTROPY AITR 1548 Second, the entropy of a Gaussian is monotonically related to variance. Therefore, the Yv
corresponding the axis of highest variance is a Gaussian with the highest possible entropy.
Moreover, principal components analysis nds the axis that contains the most information
about X . The mutual information between X and Yv is I X; Yv = hYv , hYv jX : 3.10 This equation has two components. The rst implies that Yv will give you more information
about X when Yv has a lot of entropy. The second can be misleading. Since knowing X
removes all of the randomness from Yv , hYv jX is negative in nity. This is not particularly
bothersome precisely because relative entropy is relative. Only the di erences between relative entropies are signi cant. The variable Yv yields more information about X than Yv
when
1 2 I X; Yv I X; Yv
hYv , hYv jX hYv , hYv jX
hYv hYv :
1 1 2 1 2 1 2 2 3.11
3.12
3.13 We can conclude that the principal component axis carries more information about a distribution than any other axis. Function Learning
There are other well known problems that can be formulated within the entropy framework.
Let us analyze a simple learning problem. Given a random variable X we can de ne a
functionally dependent RV, Y = F X v + , which we assume has been perturbed by measurement noise. We are given samples of the joint occurrences of the RVs: a = :::fxa; yag::: .
How can we estimate v? Typically this is formulated as a least squares problem. A cost is
de ned as the sum of the squares of the di erences between predicted ya = F xa v and the
^
^
actual samples of Y ,
X
C ^ =
v
ya , F xa v2 :
^
3.14
fxa ;ya g2a The cost is a function of the estimated parameter vector v. The sum of squared di erence
^
can be justi ed from a log likelihood perspective see Section 2.3.1. If we assume that the
55 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT noise added to Y is Gaussian and independent of Y then the log likelihood of v is:
^
X
^
log`^ =
v
logpY = yajY = F xa v
^
fxa;ya g2a
X
=
logg ya , F xa v
^
fxa;ya g2a
X
=,
ya , F xa v2 + k ;
^
fxa ;ya g2a 3.15
3.16
3.17 where k is a constant value independent of v. The v that minimizes the sum of the squares
^
^
is also the v that makes the data observed most likely. For most problems like this, gradient
^
descent is used to search for v.
^
Finally we can show that minimizing cost maximizes mutual information. We showed in
Section 2.3.1 that log likelihood is related to sample entropy:
1
1
^
^
log`^ = , N haY jY , N hY jY :
v
a a 3.18 ^
The mutual information between Y and Y is
^
^
I Y; Y = hY , hY jY : 3.19 The rst term is not a function of v. The second is an approximation of log likelihood
^
^
minimizing C ^ maximizes I Y; Y .
v NonGaussian Densities
There is a commonly held misconception that mutual information techniques are all equivalent
to simple well known algorithms. Contrary to the impression that the above two examples
may give, this is far from the truth. Entropy is only equivalent to least squares when the
data is assumed to be Gaussian. The approach to alignment and bias correction that we
will describe in the next chapters does not and could not assume that distributions were
Gaussian. We will show that if the data were Gaussian our alignment technique would
reduce to correlation.
There are a number of nonGaussian problems that can be solved using entropy or mutual
56 3.2. ESTIMATING ENTROPY WITH PARZEN DENSITIES AITR 1548 information. Bell has shown that signal separation and decorrelation can be thought of as
entropy problems Bell and Sejnowski, 1995. Bell's technique can be derived both for
Gaussian and nonGaussian distributions. Bell shows that the Gaussian assumption leads to
a well known and ine ective algorithm. When the signals are presumed to be nonGaussian,
the resulting algorithms are much more e ective. Many compression and image processing
problems clearly involve nonGaussian distributions.
In theory, empirical entropy estimation can be used with any type of density model. The
procedure is the same: estimate the density from a sample and compute the entropy from the
density. In practice, the process can be computationally intensive. The rst part, maximum
likelihood density estimation, is an iterative search through parameter space. The second,
evaluating the entropy integral, may well be impossible. For example, there is no known
closed form solution for the entropy of a mixture of Gaussians. The entropy integral can
however be approximated as a sample mean: hX hbX = Eb log^X ;
p 3.20 where Eb is the sample mean taken over the sample b, p is the estimate for the sample
^
density and hbX is the sample entropy rst introduced in p
Section 2.3.1. The sample mean
converges toward the true mean at a rate proportional to 1= Nb Nb is the size of b. Based
on this insight, two samples can be used to estimate the entropy of a distribution: the rst
is used to estimate the density, the second is used to estimate the entropy.
While the two sample approach can be used to estimate entropy, it is not a practical
algorithm for entropy manipulation. In the two applications above, changes in the parameter
vector e ect the densities that are being approximated. As the search through parameter
space adjusts the parameter vector a new sample must be drawn, a new density estimated,
and the derivative of entropy evaluated. If estimating the density is itself a complex search
process, the search for the correct parameter vector can take an unbearably long time. 3.2 Estimating Entropy with Parzen Densities
In this section we will describe a technique that can e ectively estimate and manipulate the
entropy of nonGaussian distributions. The basic insight is that rather than use maximum
57 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT likelihood to estimate the density of a sample, we will instead use Parzen window density
estimation see Section 2.4.3. The Parzen scheme for estimating densities has two signi cant
advantages over maximum likelihood: 1 since the Parzen estimate is computed directly from
the sample, there is no search for parameters; 2 the derivative of the entropy of the Parzen
estimate is simple to compute.
In the following general derivation we will assume that we have samples of a random
variable X , and we would like to manipulate the entropy of the random variable Y = F X; v.
The entropy, hY , is now a function of v and can be manipulated by changing v. Since there
is no direct technique for nding the parameters that will extremize hY we will search the
parameter space using gradient descent. The following derivation assumes that Y is a vector
random variable. The joint entropy of a two random variables, hY1; Y2, can be evaluated
by constructing the vector random variable, W = Y1; Y2 T and evaluated hW .
The form of the Parzen estimate constructed from a sample a is
1 X g y , y ;
P y; a = N
a
a ya 2a 3.21 where the Parzen estimator is constructed with Gaussian smoothing functions. Given P y; a
we can approximate entropy as the sample mean, hY Eb logP Y; a
1 X logP y ; a ;
=N
b
b yb 2b 3.22
3.23 computed over a second sample b hX is the EMMA estimate of empirical entropy.
In order to extremize entropy we must calculate the derivative of entropy with respect to
v. This may be expressed as
d
d hY = ,1 X Pya2a dv g yb , ya ;
3.24
dv
Nb yb2b Pya2a g yb , ya
and, after di erentiating the Gaussian,
d
d hY = 1 X Pya2a g yb , ya yb , yaT ,1 dv yb , ya :
P g y , y
dv
N
ya 2a b b yb 2b 58 a 3.25 3.2. ESTIMATING ENTROPY WITH PARZEN DENSITIES This expression may be written more compactly as follows,
d hY = 1 X X W y ; y y , y T ,1 d y , y ;
yba b
a
dv
N
dv b a
b yb 2b ya 2a AITR 1548 3.26 using the following de nition: y,a
Wy yb; ya P g g b y y,y :
a
ya 2a b 3.27 Wy yb; ya takes on values between zero and one. It will approach one if yb is signi cantly
closer to ya than any other element of a. It will be near zero if some other element of a is
signi cantly closer to yb. Distance is interpreted with respect to the squared Mahalonobis
distance see Duda and Hart, 1973 D y yT ,1y :
Thus, Wy yb; ya is an indicator of the degree of match between its arguments, in a soft"
sense. It is equivalent to using the softmax" function of neural networks Bridle, 1989 on
the negative of the Mahalonobis distance to indicate correspondence between yb and elements
of a.
Equation 3.26 may also be expressed as
d hY = 1 X X W y ; y d 1 D y , y :
yba
dv
N
dv 2 b a
B yb 2b ya 2a 3.28 In this form it is apparent that to reduce entropy, the parameters v should be adjusted such
that there is a reduction in the average squared distance between points which W indicates
are nearby.
d
Before moving on, it is worth reemphasizing that for most density models dv hY is very
di cult to compute. The general derivation of the derivative of entropy is much more complex
than the Parzen derivation:
@hY d 1 X logpy ; a
3.29
b
@v
dv N
b yb 2b 59 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT
d
1 X dv pyb ; a
=N
b yb 2b pyb ; a
d
d
1 X dyb pyb; a dyb + da pyb; a da :
dv
dv
=N
pyb ; a
b yb 2b 3.30
3.31 d
The numerator of the derivative has two components. The rst, dyb pyb; a dyb , is the change
dv
da
d
in entropy that results from changes in the sample b. The second, da pyb ; a dv , is far more
problematic. The second component is a measure of the change in the density estimate that
results from changes in the sample a. In the Parzen framework the two components of the
derivative collapse into a single term and can be directly computed from the samples. In the
maximum likelihood framework pyb; a is a complex function of the sample. Since there is
no closed form function that computes the density estimate from the sample, computing its
derivative can be very di cult. 3.3 Stochastic Maximization Algorithm
The variance maximization minimization applications described above principal components analysis and learning are deterministic procedures. Starting from an initial guess,
gradient descent uses the derivative of cost to repeatedly update the parameter vector. Two
di erent runs that start from the same initial parameters will end up with the same nal
parameters. Our justi cation for using probability and entropy to analyze these problems
is purely convenience. There is nothing random about these problems once the samples are
drawn. One of the bene ts of understanding the probabilistic interpretation of these problems
is that we can introduce randomness into our solutions and understand its e ect. Here is a
simple example: we want to know the average of a large sample of data. Without knowing
anything else, it would make sense to sum over the entire sample. But, if we needed only
a rough estimate of the average, signi cant computation could be saved by averaging over
a subset of the sample. Furthermore, knowledge of the sample variance would allow us to
compute the size of the subsample needed to estimate the mean to a given precision.
A similar analysis can be applied to principal components analysis or function learning.
The cost of a particular parameter vector is computed by summing over an entire sample
as we did in Equation 3.14. But, when that sample is very large this expectation can be
60 3.3. STOCHASTIC MAXIMIZATION ALGORITHM AITR 1548 approximated by a smaller random sample. The same argument applies to the gradient.
Since the gradient is de ned as an average over a very large sample it may make sense to
approximate it over a smaller random sample. When we use random samples, both the error
estimate and the gradient estimate are now truly random. For very large samples, accurate
error gradient estimates can be made without averaging over the entire sample. For problems
where the gradient needs to be evaluated often, this can save signi cant computation.
Though a random estimate of the gradient is cheaper to compute, it could be useless.
Under what conditions does it make sense to use a random gradient estimate? The theory
of stochastic approximation tells us that stochastic estimates of the gradient can be used
instead of the true gradient when the following conditions hold: 1 the gradient estimate
is unbiased; 2 the parameter update rate asymptotically converges to zero; 3 the error
surface is quadratic in the parameters Robbins and Munroe, 1951; Ljung and Soderstrom,
1983; Haykin, 1994. The rst condition requires that on average the estimate for the gradient
is the true gradient. The second insures that the search will eventually stop moving about
randomly in parameter space. In practice the third condition can be relaxed to include most
smooth nonlinear error surfaces; though there is no guarantee that the parameters will end
up in any particular minimum.
Returning our attention to equations 3.22 and 3.26, notice that the both the calculation of the EMMA entropy estimate and its derivative involve a double summation. One
summation is over the points in sample a and another over the points in b. As a result the cost
of evaluation is quadratic in sample size: ONa Nb. We will present an experiment where
the derivative of entropy for an image containing 60; 000 pixels is evaluated. While the true
derivative of empirical entropy could be obtained by exhaustively sampling the data, a random estimate of the entropy can be obtained with much less computation. This is especially
critical in entropy manipulation problems, where the derivative of entropy is evaluated many
thousands of times. Without the quadratic savings that arise from using smaller samples
entropy manipulation would be impossible.
For entropy manipulation problems involving large samples we will use stochastic gradient
descent. Stochastic gradient descent seeks a local maximum of entropy by using a stochastic
estimate of the gradient instead of the true gradient. Steps are repeatedly taken that are
proportional to the approximation of the derivative of the mutual information with respect to
the parameters:
61 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT Repeat: a fNa samples drawn from yg
b fNb samples drawn from yg
v v + dh
dv
where dh is the derivative of entropy evaluated over samples a and b, v is the parameter to be
dv
estimated, and the parameter is called the learning rate". The above procedure is repeated
a xed number of times or until convergence is detected. In most problems the initial value
of is reduced during the search. In subsequent chapters we will describe experiments where
samples of 50 or less can be used to e ectively nd entropy maxima.
Stochastic approximation does not seem to be well known in the computer vision community. We believe that it is applicable to a number of cost minimization problems that
arise in computer vision. Stochastic gradient descent is most appropriate for tasks where
evaluation of the true gradient is expensive, but an unbiased estimate of the gradient is easy
to compute. Examples include cost functions whose derivative is a sum over all of the pixels
in an image. In these cases, stochastic gradient search can be orders of magnitude faster
than even the most complex second order gradient search schemes. In the experimental section Chapter 6 of this thesis we will brie y describe joint work where an existing vision
application was sped up by a factor of fty using stochastic approximation. Convergence of Stochastic EMMA
Most of the conditions that insure convergence of stochastic gradient descent are easy to obtain in practice. For example, it is not really necessary for the learning rate to asymptotically
converge to zero. At nonzero learning rates the parameter vector will move randomly about
the minimum maximum endlessly. Smaller learning rates make for smaller excursions from
the true answer. An e ective way to terminate the search is to detect when on average the
parameter is not changing and then reduce the learning rate. The learning rate only needs
to approach zero if your goal is zero error, something that no practical system can achieve
anyway. A better idea is to reduce the learning rate until the parameters have a reasonable
variance and then take the average parameters.
62 3.3. STOCHASTIC MAXIMIZATION ALGORITHM AITR 1548 The rst proofs of stochastic approximation required that the error be quadratic in the
parameters. More modern proofs are more general. For convergence to a particular optimum,
the parameter vector must be guaranteed to enter that optimum's basin of attraction in nitely
often. A basin of attraction of an optimum is de ned with respect to true gradient descent.
Each basin is a set of points from which true gradient descent will converge to the same
optimum. Quadratic error surfaces have a single optimum and the basin of attraction is the
entire parameter space. Nonlinear error spaces may have many optima, and the parameter
space is partitioned into many basins of attraction. When there is a nite number of optima,
we can prove that stochastic gradient descent will converge to one of them. The proof
proceeds by contradiction. Assume that the parameter vector never converges. Instead it
wanders about parameter space forever. Since parameter space in partitioned into basins of
attraction it is always in some basin. But since there are a nite number of basins it must
be in one basin in nitely often. So it must converge to the optimum of that basin.
One condition will give us more trouble than the others. The stochastic estimate of the
gradient must be unbiased. It is not true that the sample approximation for empirical entropy
is unbiased. Moreover, we have been able to prove that it has a consistent bias: it is too
large. In Section 2.4.3 we described the conditions under which the Parzen density estimate
is unbiased. When these conditions are met, a number of equalities hold: pX = x = Nlim P x; a
a !1
= Efa2fX gg " x; a
P
1 X Rx , x
= Efa2fX gg N
a
a xa 2a
= EX RX , x : 3.32
3.33
3.34
3.35 Here Efa2fX gg P x; a denotes the expectation over all possible random samples of size Na
drawn from the random variable X . Assuming the di erent samples of X are independent
allows us to move the expectation inside the summation. The true entropy of the RV X can
be expressed as
hX = ,EX log Efa2fX gg P x; a :
3.36
We can de ne a similar statistic: hX Efb2fX g;a2fX gg hX
63 3.37 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT = ,Efb2fX g;a2fX gg Eb logEa Rxb , xa
= ,EX Efa2fX gg logP x; a : 3.38
3.39 hX is the expected value of hX . Therefore, hX provides an unbiased estimate of
hX . Jensen's inequality allows us to move the logarithm inside the expectation:
hX = ,EX logEfa2fX gg P x; a
,EX Efa2fX gg logP x; a
hX : 3.40
3.41
3.42 The stochastic EMMA estimate is an unbiased estimator of a statistic that is provably larger
than the true entropy. Intuitively, overly large estimates arise when elements of b fall in
regions where P x; a is too small. For these points the log of P x; a is much smaller than
it should be.
How then might we patch the de nition of EMMA to remedy the bias? Another statistic
that is similar to entropy is
Z
^ X = ,EX pX = , 1 px2dx :
h
3.43
,1 ^ X is a measure of the randomness of X we will de ne ^ X shortly. Strongly peaked
h
h
^ . For widely spread uniform distributions
distributions have large negative values for h
^
h approaches zero. Using the well known inequality, x , 1 logx, we can show,
Z1
^
hX = , ,1 px2dx
3.44
Z1
, ,1 pxlogpxdx
3.45
hX
3.46
Parzen window density estimation can be used to construct a stochastic measure
^ X ,Eb P X; a :
h 64 3.47 3.3. STOCHASTIC MAXIMIZATION ALGORITHM AITR 1548 h
The expectation of ^ X is ^ X :
h ,Efb2fX g;a2fX gg ^ X = ,Efb2fX g;a2fX gg Eb P X; a
h
= ,EX Efa2fX gg P X; a
= ,EX pX
= ^ X :
h Further simplifying, 3.48
3.49
3.50
3.51 ^ X = ,Ex=X Ex=X Rx , x
h
~
~ 3.52
^
is an expectation over a pair of events from the RV X . On average h is far too negative when
pX is large.
We have now de ned two alternative statistics for which inexpensive unbiased estimates
are available: ^ X and hX . These statistics bound the true entropy above and below,
h
^ X hX hX :
h 3.53 h is on average too large and ^ is on average too small. Instead of using either, we have
h
had good success using a third estimate, e
hX = ,Eb xb; a ;
where 3.54 8 logP x; a
if P x; a pmin
:
3.55
x;a
P
pmin + log pmin , 1 otherwise
The function is designed so that both x; a and dx;a are continuous. See Figure 3.1
dx
e
for a plot of the function. The intuition behind h is that we use h whenever possible, and
for those sample points where P x; a has a large standard deviation, we use ^ instead. The
h
x; a = : standard deviation of the Parzen density estimate is highest at points where the probability
density is lowest, so we use ^ where P x; a is below pmin see Section 2.4.3. The variable
h
e
^
pmin allows us to continuously vary the h estimate from the two extremes of h and h. 65 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT
2
up(x, 0.1)
up(x, 0.8)
x*log(x)
x* (x1) 1.5 1 0.5 0 0.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Figure 3.1: Plot of the functions x x which is labeled up, x logx, and x x , 1. Two
di erent values for pmin are plotted: 0.1 and 0.8. Notice that smaller values of pmin cause the
approximation of x logx to be very good. The di erence between the two functions when
pmin = 0:1 is almost unnoticeable. Other Stochastic Search Techniques
Nonlinear stochastic gradient descent is commonly used in the neural network literature,
where it is often called the LMS rule. It was introduced there by Widrow and Ho Widrow
and Ho , 1960 and has been used extensively with good results. Since a stochastic estimate
for the gradient of error is much cheaper to compute than a true estimate of the gradient,
for many real problems LMS is faster than all other gradient techniques. The textbook by
Haykin Haykin, 1994 discusses the use of such algorithms by the neural network community.
An excellent discussion of stochastic approximation appears in the textbook by Ljung and
Soderstrom Ljung and Soderstrom, 1983.
Simulated annealing is a related method that has been used in optimization problems
which have many local minima Kirkpatrick et al., 1983. These minima can trap" gradient
techniques far from the optimal solution. Simulated annealing performs a random, though
usually local, search through parameter space. At each step a random modi cation to the
parameters is proposed and the new cost is evaluated. If the new cost is lower than the
previous cost the parameter modi cation is accepted. If the di erence in cost is positive, the
modi cation is accepted probabilistically. The probability of acceptance is proportional to
66 3.3. STOCHASTIC MAXIMIZATION ALGORITHM AITR 1548 the negative exponential of the di erence, paccept d = expt, t ;
d 3.56 where d is the di erence between the new and the old cost and t is a temperature that
controls the likelihood that a bad modi cation is accepted. Simulated annealing is based on
the insight that physical systems, like iron, invariably nd good energy minima when heated
and then cooled slowly. The process of physical annealing is basically a gradient search
perturbed by thermal noise. The thermal noise provides the energy to kick physical systems
out of unfavorable local minima. The parameter t is an analog of physical temperature, it is
initially set to large values and is gradually cooled" during learning. In some cases it can
be proven that with the right annealing schedule simulated annealing will converge to the
global optimum.
Stochastic gradient descent can e ectively penetrate narrow local minima which trap
gradient techniques. Alignment applications can often have many narrow local minima which
arise because of false matches between the high frequency components in the model and
image. Because these false matches are based on features which are small the local minima
are narrow in pose space. We have found that narrow local minima, which can trap gradient
techniques, can be overcome using stochastic gradient descent.
We believe that stochastic approximation serendipitously combines e cient computation
with e ective escape from local minima. 3.3.1 Estimating the Covariance
In addition to the learning rate , the covariance matrices of the smoothing functions R are
important parameters of EMMA. These parameters may be chosen so that they are optimal
in the maximum likelihood sense. This is equivalent to minimizing the cross entropy of the
estimated distribution with the true distribution see Section 2.4.3. Our goal is to nd the
parameters that minimize empirical entropy.
67 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT For simplicity, we assume that the covariance matrices are diagonal,
22
= DIAG 1 ; 2 ; : : : : 3.57 Following a derivation almost identical to the one described in Section 3.2 we can derive an
equation analogous to 3.26,
!
d hY = 1 X X W y ; y 1 y 2 , 1
k
3.58
yba
2
d
N
k b yb 2b ya 2a k k where y k is the kth component of the vector y. This equation forms the basis for a method
of stochastic maximization of likelihood.
Repeat: A fNa points drawn from yg
B fNb points drawn from yg
0 d h Y 8k
k k+
d
k The above procedure is very similar to the one described in Section 3.3. During entropy
manipulation, it is possible to interleave covariance updates with parameter updates. 3.4 Principal Components Analysis and Information
As a demonstration, we can derive a parameter estimation rule akin to principal components
analysis that truly maximizes information. This new EMMA based component analysis
ECA manipulates the entropy of the random variable Yv = X v under the constraint that
jvj = 1. For any given value of v the entropy of Yv can be estimated from a sample of X as:
0
1
1 X log @ 1 X g y , y A
3.59
hYv = N
Na ya2a b a
b yb 2b
!
1 X log 1 X g x v , x v
=N
3.60
b
a
N
b xb2b a xa 2a 68 3.4. PRINCIPAL COMPONENTS ANALYSIS AND INFORMATION AITR 1548 where is the variance of the Parzen smoothing function. Moreover we can estimate the
derivative of entropy:
d hY = 1 X X W y ; y 1 y , y d y , y
3.61
dv v Nb yb2b ya2a y b a b a dv b a
1 X X W y ; y 1 y , y x , x :
=N
3.62
yba
b a b a
bb a Let us decompose the derivative into parts which can be understood more easily. We will
rst analyze the second part of the summand: yb , yaxb , xa. Ignoring the weighting
function Wy ,1 we are left with the derivative of some unknown function f Yv :
d f Y = X Xy , y x , x
3.63
b
ab
a
dv v
a
= NbNaEb
b Ea yb , yaxb , xa : 3.64 What then is f Yv ? The derivative of the squared di erence between samples is: d y , y 2 = 2y , y d x v , x v
b
a
a
dv b a
dv b
= 2yb , ya d xb , xa v
dv
= 2yb , yaxb , xa : 3.65
3.66
3.67 So we can see that f Yv = NbNaEb Ea yb , ya2 3.68 is the expectation of the squared di erence between pairs of trials of Yv .
Recall that PCA searches for the RV Yv that has the largest variance: Ea ya , Ea ya 2 =
V araYv . Interestingly the expected squared di erence between a pair of trials is precisely
twice the variance: C v = Eb Ea yb , ya2
2
= Eb Ea yb2 , 2yayb + ya
2
= Eb yb2 + Ea 2yayb + ya
69 3.69
3.70
3.71 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT 4 ECA
PCA 3 2 1 0 1 2 3 4
4 3 2 1 0 1 2 3 4 Figure 3.2: A scatter plot of a sample from a two dimensional Gaussian density. The sample
contains 200 points. The principal axis and the ECA axis are also plotted as vectors from
the origin. The vectors are nearly identical.
2
= Eb yb2 , 2ybEa ya + Ea ya
2
= Eb yb2 , 2Eb yb Ea ya + Ea ya
= Eb Y 2 , 2Eb Y Ea Y + Ea Y 2
= 2V araY : 3.72
3.73
3.74
3.75 Without the weighting term, Wy ,1, ECA would nd exactly the same vector that PCA does:
the maximum variance projection vector. However the derivative of ECA does not act on all
points of Yv equally. Recall that Wy ya; yb is a measure of the distance between yb and ya.
It is large when yb is signi cantly closer to ya than any other element of a. As a result ECA
maximizes variance in a local way. Points that are very far apart are forced no further apart.
Another way of interpreting ECA is as a type of robust variance maximization. Points that
might best be interpreted as outliers, because they are very far from the body of other points,
play a very small role in the minimization. These robust characteristic stand in contrast to
PCA which is very sensitive to outliers.
For densities that are Gaussian, the maximum entropy projection is the rst principal
component. In simulations ECA e ectively nds the same projection as PCA. Figure 3.2
shows a sample of data and the PCA and ECA principal components. Since this density
has a larger variance along the horizontal axis, both the ECA and PCA axes point along the
horizontal axis. Our ECA code take roughly 10 seconds to run on Sun Sparc5 workstation.
70 3.4. PRINCIPAL COMPONENTS ANALYSIS AND INFORMATION AITR 1548 4 ECA
PCA 3 2 1 0 1 2 3 4
4 3 2 1 0 1 2 3 4 Figure 3.3: A scatter plot of a 400 point sample from a two dimensional density. The density
is a mixture of two horizontally stretched Gaussians. The PCA and ECA principal axes are
also plotted as vectors from the origin.
This is comparable to the time it takes to run PCA.
In general, PCA does not nd the highest entropy projection of nonGaussian densities.
For more complex densities the PCA axis is very di erent from the entropy maximizing
axis. Figure 3.3 shows a density for which the PCA and ECA axes are very di erent. The
PCA axis, which is vertical, spreads the points in the sample as far apart as possible. The
ECA axis, which is oblique, spreads nearby points in the sample as far apart as possible.
The resulting densities, YPCA and YECA , are graphed in Figure 3.4. The PCA density is
very tightly peaked, the ECA density is broadly spread out. Though the nal variance
of YPCA is larger, 2:005 vs. 1:626, the entropy of the YECA distribution is much higher,
hYPCA = ,0:17 and hYECA = 1:61.
Linsker has argued that the PCA axis separates the clusters of a distribution Linsker,
1988. To justify this claim, he uses gures much like Figure 3.3 and Figure 3.4. These
graphs show the PCA axis projecting points from separated clusters so that they remain
separate. It is then proposed that the PCA axis is useful for cluster classi cation of high
dimensional data. In other words, that high dimensional data can be projected down into a
low dimensional space without perturbing the cluster structure. In general this is not true.
PCA only separates clusters when the variance between clusters is higher than the variance
within clusters.
71 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT
2
ECA
PCA 1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0 4 3 2 1 0 1 2 3 4 Figure 3.4: The Parzen density estimates of YPCA and YECA .
Ironically, it is the minimum entropy projection that should separate clusters well. Let
us assume that each cluster is generated from a prototypical point that has been perturbed
by random noise. If there is very little noise, the sample points associated with a cluster
prototype will be clustered together tightly. The resulting density is sharply peaked around
the cluster prototypes and has low entropy. Additional noise acts to spread out each cluster,
adding entropy to the density. Most of the entropy in this density arises from the noise, not
the clusters. An entropy maximizing algorithm will nd a projection vector that maximizes
the projection of the noise. On the other hand, an entropy minimizing algorithm should, if
possible, nd a projection that is perpendicular to the noise. ECA can be used both to nd
the entropy maximizing ECAMAX and minimizing ECAMIN axes.
Figure 3.5 shows a distribution where the noise, or spread, of the clusters is perpendicular
to the axis that separates the clusters. As a result, the PCA axis does not separate these
clusters. The ECA axis shown is the minimum entropy axis which is obtained by running the
EMMA algorithm with a negative learning rate. The ECAMIN axis separates the clusters
much better than the PCA axis see Figure 3.6.
To provide further intuition regarding the behavior of ECA we have run ECA, PCA, and
two related procedures BCM and BINGO on the same density. BCM is a learning rule that
was originally proposed to explain development of receptive elds patterns in visual cortex
Bienenstock et al., 1982. More recently it has been argued that the rule nds projections
that are far from Gaussian Intrator and Cooper, 1992. Under a limited set of conditions
72 3.5. CONCLUSION AITR 1548 4 ECA Min
PCA 3 2 1 0 1 2 3 4
4 3 2 1 0 1 2 3 4 Figure 3.5: A scatter plot of a 400 point sample from a two dimensional density. The density
is a mixture of two horizontally stretched Gaussians. The PCA and ECA minimum entropy
axes are also plotted as vectors from the origin.
BCM nds the minimum entropy projection. BINGO was proposed to nd axes along which
there is a bimodal distribution Schraudolph and Sejnowski, 1993.
Figure 3.7 displays a 400 point sample and the ve di erent projection axes found by
the algorithms discussed above discussed above. The density is a mixture of two clusters.
Each cluster has high kurtosis in the horizontal direction. The oblique axis projects the data
so that it is most uniform and hence has the highest entropy; ECAMAX nds this axis.
Along the vertical axis the data is clustered and has low entropy; ECAMIN nds this axis.
Interestingly because the vertical axis has high variance, PCA nds the entropy minimizing
axis. BCM, while it may nd minimum entropy projections for some densities, is attracted to
the kurtosis along the horizontal axis. The horizontal axis neither minimizes nor maximizes
entropy. Finally, BINGO successfully discovers that the vertical axis is very bimodal. The
densities of the di erent projections is shown in Figure 3.8. 3.5 Conclusion
This chapter has presented a new technique for estimating the entropy of a distribution
called EMMA. Provided the density being approximated is smooth, we have proven that
the technique will converge to the correct entropy estimate. Moreover we have presented
73 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT
2.5
ECA
PCA 2 1.5 1 0.5 0 4 3 2 1 0 1 2 3 4 Figure 3.6: The Parzen density estimates of YPCA and YECA from the previous graph.
a computationally e cient stochastic technique for manipulating entropy. For reasonable
sample sizes, the technique is not guaranteed to optimize true entropy. Instead it optimizes
a very similar statistic that retains all of the salient characteristics of entropy.
We have also described a simple application of EMMA. EMMA enables us to nd low
dimensional projections of higher dimensional data that minimize or maximize entropy. 74 3.5. CONCLUSION AITR 1548 ECAMIN
ECAMAX
BCM
BINGO
PCA 3
2
1
0
1
2
3
4 2 0 2 4 Figure 3.7: A scatter plot of a 400 point sample from a two dimensional density. Each cluster
has very high kurtosis along the horizontal axis. See text for description of projection axes.
2.5
ECAMIN
ECAMAX
BCM
BINGO
PCA Density 2
1.5
1
0.5
0
4 3 2 1 0
1
Position 2 3 4 Figure 3.8: The densities along various projection axes.
75 Chapter 4
Matching and Alignment
This chapter is perhaps the most important in this thesis. Previous chapters have presented
the mathematics and algorithms that underly the computation of empirical entropy. We
have already seen that empirical entropy can be used to de ne a new algorithm for nding
the most informative projection of a distribution. This chapter will show that matching
and alignment can also be formulated as an entropy problem. In addition, we will discuss
the intuition behind our framework and suggest some simpli ed schemes that re ect these
intuitions. Throughout this chapter a number of synthetic alignment problems will drive our
discussions.
We will begin with a rederivation of correlation as a maximum likelihood method. This
derivation will make clear the assumptions under which correlation will work, and when it
may fail. We will then attempt to generalize correlation so that it will work with a wider set of
inputs. While this generalization is theoretically straightforward it will prove computationally
intractable.
Dropping our focus on correlation, we will de ne an intuitive approach to alignment
which is e ciently computable. Using this intuition we will then de ne an approximation to
a maximum likelihood technique that is both concrete and computable. Finally we will draw
a parallel between this technique and mutual information. Experimental data from synthetic
alignment problems will help us evaluate the proposed alignment techniques.
This chapter will conclude with an entirely di erent motivation for the use of mutual
76 4.1. ALIGNMENT AITR 1548 information in alignment. We will show how the alignment problem can be thought of as
a Minimum Description Length problem Rissanen, 1978; Leclerc, 1988. This formulation
will naturally focus on the task of coding e ciency and entropy minimization. A very similar
set of alignment equations will arise from these considerations. 4.1 Alignment
We are given two signals of time or space: ux and vy. We will call ux the model.
Often it is a description of a physical object that has been computed with great care. For
example, in one of our experiments the model is an accurate three dimensional description of
a skull. The second signal, vy, is an image of the model. In general the form and even the
coordinate systems of the model and image can be very di erent. For example, one of our
3D models is a collection of three dimensional points and normals; the corresponding image
is a two dimensional array of intensities. It is assumed that vy is an observation of ux,
for example that vy is a picture of the skull ux.
The relationship between ux and vy is based on the physics of imaging. The process
of constructing an observation has two separate components. The rst component is called
a transformation, or pose, denoted T . It relates the coordinate frame of the model, x, to
the coordinate frame of the image, y. The transformation tells us which part of the model
is responsible for a particular pixel of the image. The second component is the imaging
function, F ux; q. The imaging function determines the value of image point vT x. In
general a pixel's value may be a function both of the model and other exogenous factors.
For example, an image of an object depends not only on the object but also on the lighting.
The parameter, q, collects all of the exogenous in uences into a single vector. The complete
imaging model is then:
vT x = F ux; q + ;
4.1
or equivalently,
vy = F uT ,1y; q + ;
4.2
where is a random variable that models noise in the imaging process.
For a number of practical problems, the transformation between a model and an image
77 Paul A. Viola CHAPTER 4. MATCHING AND ALIGNMENT u(x)
v(x) Intensity 1
0.5
0
0.5
1 0 100 200
300
Position 400 Figure 4.1: Graph of ux and vx = ux + versus x.
is not known. Alignment is the process by which the correct transformation is extracted.
Alignment can be a di cult problem for a number of reasons:
The imaging function F of the physical world can be di cult to model.
The exogenous parameters q are not necessarily known and can be di cult to nd. For
example computing the lighting in an image is a nontrivial problem.
The space of transformations, which may have many dimensions, is di cult to search.
Rigid objects often have a 6 dimensional transformation space. Nonrigid objects can
in principle have an unbounded number of pose parameters.
A simple example can lend intuition to these de nitions. Let ux and vy be one
dimensional signals. Let the transformation space be the space of all possible translations T x = x , : 4.3 Let the imaging function F be the identity function. Choosing = 0 leads to vx = ux + : 4.4 Figure 4.1 contains a graph of two signals that obey this relationship. Though we show
the image and model aligned, the correct alignment between v to u may not be known.
78 AITR 1548 Intensity 4.1. ALIGNMENT 2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
3 u(x)
v(x) 0 100 200
300
Position 400 Figure 4.2: Graph of ux and vx = ,ux2 versus x.
In all of our synthetic experiments 10 random noise has been added to v1. Noise is of
course an unavoidable reality of any real system. But more importantly the addition of noise
demonstrates that the algorithms presented are numerically stable.
More complex imaging functions are possible. For example, F might be nonlinear F u = ,u2 : 4.5 Figure 4.2 contains a graph of ux and vT x = F ux. 4.1.1 Correlation as a Maximum Likelihood Technique
The search for the correct alignment can be cast as a maximum likelihood or variance minimization problem see Section 2.3.1. The probability of an image given the model, the
transformation, the noise distribution, the exogenous parameters, and the imaging function
is:
Y
pv j u; T; ; q; F = p = vT xa , F uxa; q :
4.6
x a 2a We use white noise that has been lowpassed ltered to roughly 0.3 cycles per unit. The peak to peak
amplitude of the noise is 10 of the peak to peak amplitude of the signal.
1 79 Paul A. Viola CHAPTER 4. MATCHING AND ALIGNMENT In the above equation we have assumed that each pixel of v is conditionally independent.
Conditional independence does not imply that the pixels are independent, just that if u; T; ; q;
and F are known the pixels are independent. Assuming that the noise is Gaussian, we can
then compute the log likelihood of a transformation as log`T = log pv j u; T; ; q; F
X
= log p = vT xa , F uxa; q
xa 2a
X
= ,k1 vT xa , F uxa; q2
xa2a
h
i
,k2E vT X , F uX ; q2
h
i
h
i
,k2E vT X 2 , 2E vT X F uX ; q + E F uX ; q2 4.7
4.8
4.9
4.10
4.11 where k1 and k2 are constants computed from the variance of the noise and the number of
sample points. They play no role in the maximization. In 4.11 we have expanded the square
of the di erence to show that the log likelihood of a transformation has three components: one
that arises from the variance of the model; a second that arises from the correlation between
the image and the predicted image; and a third that arises from the variance of the predicted
image. For problems where the variance of the image and predicted image are xed, the best
transformation is the one that maximizes the correlation between the actual and predicted
image.
For convenience we will de ne the cost of a transformation as
h
i
C T = E vT X , F uX ; q2
,log`T : 4.12
4.13 The lowest cost transformation is the one that causes the model to match the image best".
As we did in the analysis of principal components and function learning, we have invented" random variables: X , vT X and uX . The random variable X ranges over points
from the coordinate system of u where uX is de ned. The random variables uX and
vT X range over the values in the model and image. In reality there are no random processes involved in matching and alignment. The model and image are predetermined and
xed. Alignment could proceed deterministically; the cost of a transformation being evaluated directly from all of the points in the model and image. We have chosen to interpret
80 4.1. ALIGNMENT AITR 1548 the summation over pixels that arises in correlation as an expectation over a set of random
variables. As a result the insights of probability and statistics can be brought to bear on
these problems. 4.1.2 Correlation and Mutual Information
Alignment is very similar to the problem of function learning that we encountered in Section 3.1. Equation 3.15 is almost identical to 4.7. In both problems we are looking for a
set of parameters that cause the inputs to model the outputs. Function learning attempts to
nd the best parameter vector v; alignment attempts to nd the best transformation T . As
we did for function learning, we can draw an analogy with sample entropy. The log likelihood
of T is proportional to the conditional entropy of the image given the model,
1
log`T = , N havT X j uX ; T; ; q; F :
a 4.14 This is not the EMMA estimate of entropy, but the conditional entropy of v under the
assumption that v is conditionally Gaussian. For the problems described in Section 3.1 it
was possible to show that entropy optimization led to maximum mutual information solutions.
For this problem however, we cannot claim that maximizing log likelihood is equivalent to
maximizing the mutual information between v and u. The mutual information I vT X ; uX = hvT X , hvT X j uX ; T; ; q; F ; 4.15 includes both a conditioned and unconditioned entropy. For some types of transformations
hvT X may change as T is varied. In these cases minimizing conditional entropy is not
equivalent to maximizing mutual information. One must also maximize the unconditioned
entropy. In our simple example, where only translation is varied and the signals are periodic,
unconditioned entropy does not change as T is varied.
Returning to the rst synthetic example, we can plot C T from 4.12 versus translation. Figure 4.3 graphs C T for the two signals from Figure 4.1 we have assumed periodic
boundary conditions on the signals. We can see that the cost has a very strong minimum at
the true translation of 0 pixels.
81 Paul A. Viola CHAPTER 4. MATCHING AND ALIGNMENT
1400
u(x)
v(x) Difference Squared 1200 Intensity 1
0.5
0
0.5
1 1000
800
600
400
200
0 0 100 200
300
Position 400 150 75 0
Position 75 150 Figure 4.3: On the left is a plot of image and model that are identical except for noise. On
the right is a plot of C T versus translation. There is a signi cant minimum at the correct
aligning translation of 0 pixels.
u(x)
v(x) 1400 Difference Squared Intensity 1500
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
3 1300
1200
1100
1000
900
800 0 100 200
300
Position 400 150 75 0
Position 75 150 Figure 4.4: On the left is a plot of image and model that are related nonlinearly. On the
right is a plot of C T versus translation. There is no minima at the aligning translation of
0 pixels. In fact minima exist at incorrect translations.
Correlation works very well at matching together u and v when the imaging model and
exogenous parameters are known. In many cases however we may be faced with a situation
where F and q are unknown. In some cases alignment problems can still be solved by
assuming that the imaging function is the identity function. This assumption is not e ective
when aligning the nonmonotonically related signals shown in Figure 4.2. Figure 4.4 graphs
C T versus translation for these two signals. Notice that each of the actual minima are at
incorrect translations.
In general C T cannot be used to align signals related by an unknown nonlinear transformations. C T can however be generalized to work with signals that have been transformed
linearly. Rather than minimize the squared di erence between signals, we can instead minimize the squared di erence between signals that have been normalized. A normalized signal
82 4.1. ALIGNMENT AITR 1548 4 u(x)
v(x) Intensity 2
0
2
4
6
0 100 200
300
Position 400 Figure 4.5: Graph of ux and vx = 3ux , 2 versus x.
is one with a mean of zero and a standard deviation of one and can be computed as E
b
ux = ux , XuX :
u 4.16 The normalized version of a signal is invariant to multiplicative and additive changes to the
original. The sum of the squared di erences between the normalized signals, NC T , can
be computed directly as one minus the Normalized correlation between the signals u and v.
Normalized cost is de ned as:
, u E
4.17
NC T = 1 , Ea uX vT X EavTXX a vT X :
au X a
As a shorthand we have abbreviated sums over the coordinates x as expectations and variances.
Normalized cost can be used on signals like the ones shown in Figure 4.5 F u = 3u , 2.
A plot of NC T versus translation is identical to Figure 4.3. In some cases, normalized cost
can be applied to signals transformed by nonlinear monotonic functions. Note however that
the two signals shown in Figure 4.2 are related by a nonmonotonic function and cannot be
accommodated in this way. In these examples translation does not e ect the mean or the
standard deviation of the signals. As a result, normalized cost will not produce a convincing
minimum where cost alone does not.
83 Paul A. Viola CHAPTER 4. MATCHING AND ALIGNMENT We may still wish to align models and images that are related by nonmonotonic functions.
In this case alignment can be performed by jointly searching over the space of possible imaging
functions, exogenous parameters and transformations. Probability can be used to motivate
this procedure. We can evaluate pv j u; T; N; q; F when F and q are unknown by integrating
out the unknown variables. The probability of the image would then be,
ZZ Y
pv j u; T; =
p = vT xa , F uxa; q pF pq dF dq :
4.18
x a 2a This equation integrates over all possible imaging functions and all possible sets of exogenous
variables. We are not aware of any approach that has come close to evaluating such an
integral. It may not be feasible. Another possible approach is to nd the imaging function
and exogenous variables that make the image most likely,
Y
pv j u; T; N max p = vT xa , F uxa; q pF pq :
4.19
F;q
xa 2a Here we have assumed that the integral in Equation 4.18 is approximated by the component
of the integrand that is maximal. The approximation is a good one when a particular F and
q are much more likely than any other.
Using 4.19 we can de ne an alignment procedure as a nested search: 1 given an
estimate for the transformation, nd F and q that make the image most likely; 2 given
estimates for F and q, nd a new transformation that makes the image most likely. Terminate
when the transformation has stabilized. In other words, a transformation associates points
from the model with points in the image; for every ux there is a corresponding vT x. A
function F and parameter vector q are sought that best model the relationship between ux
and vT x. This can be accomplished by training" a function to t the collection of pairs
fvT xa; uxag. Algorithms for nding F and q are very similar to the those for density
approximation and learning described in Chapter 3. Notice also that that alignment with an
unknown imaging model is very similar to entropy maximization. Entropy maximization is
a nested search for a density estimate and parameters. Alignment is a nested search for an
imaging model and a transformation. We will return to this analogy shortly.
Many of the pitfalls of density approximation as described in Chapter 2 apply to function
approximation as well. Before we can hope to learn the function F we must rst make a set
of assumptions about the form of F . Without these assumptions discontinuous estimates for
84 4.1. ALIGNMENT AITR 1548 F , which t the data perfectly well but are very unlikely, can prevent convergence. One way to prevent, or discourage, this behavior is to formulate a strong prior probability over the
space of functions, pF .
In many cases the search for an imaging function and exogenous parameters can be
combined. For any particular F and q, another function Fq ux = F ux; q can be
de ned. Combining functions like this is a common technique in both shape from shading"
and photometric stereo" research. Both techniques compute the shape of an object from
the shading that is present in an image or images. Rather than independently model the
exogenous variable the lighting direction and imaging function the re ectance function
a combined function is represented and manipulated. The combined function is called a
re ectance map Horn, 1986. It maps the normals of an object directly into intensities.
The three dimensional alignment procedure we will describe manipulates a similar combined
function.
How might Equation 4.19 be approximated e ciently? It seems reasonable to assume
that for most real imaging functions similar inputs should yield similar outputs. In other
words, that the unknown imaging function is continuous and piecewise smooth. An e cient
scheme for alignment could skip the step of approximating the imaging function and attempt
to directly evaluate the consistency of a transformation. A transformation is considered
consistent if points that have similar values in the model project to similar values in the
image. By similar we do not mean similar in physical location, as in jxa , xbj, but similar
in value, juxa , uxbj and jvT xa , vT xbj. One adhoc technique for estimating
consistency is to pick a similarity constant and evaluate the following sum:
X
Consistency1T = , vT xb , vT xa2 ;
4.20
where the sum is over xa 2 a and xb 2 b such that juxb,uxaj and xa 6= xb. Consistency
is awed in a number of ways. For instance, there are no obvious clues of picking . We can
replace the all or nothing" nature of the test with a more gradual discrimination:
X
Consistency2T = ,
g uxb , uxavT xb , vT xa2 ;
4.21
xa 6=xb where g is a Gaussian with standard deviation, . In order to minimize this measure, points
that are close together must be more consistent, and those further apart less so. Another
85 Paul A. Viola CHAPTER 4. MATCHING AND ALIGNMENT problem with any consistency measure is that it is too aggressive; consistency is maximized
by constancy. The most consistent transformation projects the points of the model onto a
constant region of the image. For example, if scale is one of the transformation parameters,
one entirely consistent transformation projects all of the points of the model down to a single
point of the image. Though there are a number of problems with consistency that need to be
addressed, it will serve as source of intuition as we analyze di erent approaches.
We now have two alternatives for alignment when the imaging function is unknown: a
theoretical technique that may be intractable 4.19, and an outwardly e cient technique
that has a number of important di culties 4.21. One would like to nd a technique that
combines the best feature from each approach. Perhaps the complex search for the most
likely imaging function, Fq , can be replaced with a simpler search for a consistent imaging
function.
One type of function approximator that maximizes consistency is known as a nearest
neighbor function approximator Duda and Hart, 1973. A nearest neighbor function approximator FN u; a is constructed directly from a sample a. The approximator's value at a
point u is the value of the point which is nearest in the sample: FN u; a = vT ^ such that x = argmin ju , uxaj :
x
^ x 2a
a 4.22 FN can be used to estimate the likelihood of a model as we did in 4.19. The nearest neighbor formulation can be much more e cient than a naive implementation of 4.19, since
there is no need to search for Fq. The model, image, and transformation de ne FN directly.
The nearest neighbor function approximator plays a role in the likelihood computation
that is very similar to the role that Parzen density estimation plays in entropy estimation
see Sections 2.4.3 and 3.2. Unlike the Parzen estimate, the nearest neighbor approximation
is not continuously di erentiable. A similar though di erentiable version is called a weighted
neighbor approximator2:
P
u; a = a Ru , uxav T xa :
4.23
F
P Ru , ux
a
a
The weighting function R usually has a maximum at zero, and falls o asymptotically away
2 This technique is also known as kernel regression. 86 4.1. ALIGNMENT AITR 1548 from zero. A common choice for R is the Gaussian density function g with standard deviation
. We can rewrite F as,
X
F u; a = Wau; uxavT xa :
4.24
a where W is the soft nearest neighbor function rst de ned in Section 3.2, Wau1; u2 P g u1 , u2 :
a g u1 , uxa
An estimate for the log likelihood is then,
X
log`T = ,k2 vT xa , F uxa; a 2
4.25
a
2
X"
X
= ,k2 vT xa , Wauxa; uxbvT xb
4.26
a
b
2
X "X
X
= ,k2
Wauxa; uxbvT xa , Wauxa; uxbvT xb 4.27
ab
b
"X
2
X
= ,k2
Wauxa; uxbvT xa , vT xb :
4.28
a b Step 4.27 relies on the fact that
X
b Wauxa; uxb = 1 ; where xa 2 a. The log likelihood of a transformation using a weighted neighbor function
approximation is very similar to the intuitive consistency measure 4.21. In addition its
derivative bears a striking resemblance to the derivative of the EMMA estimate see 3.26: d log`T
4.29
dT d
d
X Pb g uxa , uxb2vT xa , vT xb dT vT xa , dT vT xb
= ,k2
4.30
Pb g uxa , uxb2
a
!
XX
2 v T xa , v T xb d v T xa , v T xb 4.31
= ,k2
Wauxa; uxb
:
dT
ab 87 Paul A. Viola CHAPTER 4. MATCHING AND ALIGNMENT 450 u(x)
v(x) Log Likelihood Intensity 1
0.5
0
0.5 350
250
150 1
50
0 100 200
300
Position 400 150 75 0
Position 75 150 2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
3 u(x)
v(x) 380 Log Likelihood Intensity Figure 4.6: On the left is a plot of image and model that are identical except for noise. On
the right is a plot of the logarithm of weighted neighbor likelihood versus translation. 280
180
80 0 100 200
300
Position 400 150 75 0
Position 75 150 Figure 4.7: On the left is a plot of image and model that are related nonlinearly. On the
right is a plot of the logarithm of weighted neighbor likelihood versus translation.
Weighted neighbor likelihood can be used to evaluate the cost of di erent translations.
Figure 4.6 shows a graph of weighted neighbor likelihood versus translation for the initial
pair of signals, ux and vx = ux + . Figure 4.7 contains a similar graph for the second,
nonlinear experiment, ux and vx = ux , 22 + . Both graphs show a strong minimum
at the correct alignment of the signals. We can conclude that weighted neighbor likelihood
can be used in situations where neither cost nor normalized cost would work.
The parallel between EMMA and weighted neighbor likelihood is more than structural.
EMMA estimates the density of the sample directly and uses it to compute the derivative of
entropy with respect to the parameter vector; weighted neighbor likelihood estimates the imaging function directly and uses it to compute the derivative of log likelihood with respect to the
transformation. More importantly both techniques manipulate the entropy of the of the joint
distribution u and v. EMMA can be used to evaluate the joint entropy hvT x; uX ;
weighted neighbor likelihood evaluates the conditional entropy hvT x j uX under the
88 4.1. ALIGNMENT AITR 1548 assumption that pvT xjuX ; a = g vT x , F ux; a see Sections 2.3.1 and 4.1.2
for commentary on the equivalence of log likelihood and sample entropy.
We can relax the constraint that v be conditionally Gaussian by using EMMA to estimate
the conditional entropy: hvT xjuX hux , hvT x; uX : 4.32 The rst term is the entropy of the model. It is not a function of the transformation.
Why are these two seemingly unrelated concepts, weighted neighbor likelihood and conditional entropy, so closely related? We can gain some intuition about this equivalence by
looking at the joint distribution of u and v. For any particular transformation we can sample
points uxa; vT xa T and plot them. Figure 4.8 shows the joint samples of the signals
from Figure 4.1 when aligned. The thin line in the plot is the weighted neighbor function
approximation of this data; it is a good t to the data. There is a noticeable clumping, or
clustering, in the data. These clumps arise from the regions of almost constant intensity in
the signals. There are four large regions of constant intensity and four clusters.
When these almost identical signals are aligned they are strongly correlated. Large values in one signal corresponds to large values of the other. Conversely, small values in one
correspond to small values in the other. Correlation measures the tendency of the data to lie
along the line x = y normalized correlation measures the tendency of the data to lie along
some line of positive slope. Figure 4.9 shows the joint samples of the two signals shown in
Figure 4.2. These signals are not linearly related or even correlated, but they are functionally
related.
Weighted neighbor likelihood measures the quality of the weighted neighbor function
approximation. In both of these graphs the points of the sample lie near the weighted neighbor
function approximation. Moreover, in both of these graphs the joint distribution of samples
is tightly packed together. Points are not distributed throughout the space, but lie instead
in a small part of the joint space. This is the hallmark of a low entropy distribution.
We can generate similar graphs for signals that are not aligned. Figures 4.10 and 4.11 show
the same signals except for the fact that the image has been shifted 30 units. For these shifted
signals the structure of the joint distribution is destroyed. The weighted neighbor function
89 Paul A. Viola CHAPTER 4. MATCHING AND ALIGNMENT
2.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 Figure 4.8: Samples from the joint space of ux and vx = ux + . A small black square
is plotted for every pixel in the signals. The Xaxis is the value of ux. The Yaxis is the
value of vx. The clumping of points in clusters is caused by the regions of almost constant
intensity in the images. The thin line plotted through the data is the weighted neighbor
function estimate.
approximation is a terrible t to the data. As a result the weighted neighbor likelihood of
these signals is low.
Alternatively we could look directly at the distributions. When the signals are aligned the
distributions are compact. When they are misaligned the distributions are spread out and
haphazard. Or in other words, aligned signals have low joint entropy and misaligned signals
have high joint entropy.
This suggests an alternative to weighted neighbor likelihood: the EMMA approximation
of joint entropy. Graphed below are the EMMA estimates of joint entropy, hw, versus
translation for each signal alignment problems discussed. Figure 4.12 shows a graph of joint
entropy for the two signals that are nearly identical. Figure 4.13 shows a graph of joint
entropy for the model and the nonlinearly transformed image. In both case the graphs show
strong minima at the correct aligning translation. 90 4.1. ALIGNMENT AITR 1548 0.5
0
0.5
1
1.5
2
2.5
3
3.5
2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 Figure 4.9: Samples from the joint space of ux and vx = ,ux2 + . 2.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 Figure 4.10: Samples from the joint space of ux and vx = ux + 30 + . Unlike the
previous graph these two signal are no longer aligned. 91 Paul A. Viola CHAPTER 4. MATCHING AND ALIGNMENT 2.5
2
1.5
1
0.5
0
0.5
1
1.5
2
3.5 3 2.5 2 1.5 1 0.5 0 0.5 Figure 4.11: Samples from the joint space of ux and vx = ,ux + 302 + . The two
signals are not aligned.
2.2 u(x)
v(x) Joint Entropy Intensity 1
0.5
0
0.5 2
1.8
1.6 1
1.4
0 100 200
300
Position 400 150 75 0
Position 75 150 2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
3 u(x)
v(x) 2.2 Joint Entropy Intensity Figure 4.12: On the left is a plot of image and model that are identical except for noise. On
the right is a plot of the estimated joint entropy versus translation. 2 1.8 0 100 200
300
Position 400 150 75 0
Position 75 150 Figure 4.13: On the left is a plot of image and model that are related nonlinearly. On the
right is a plot of estimated joint entropy versus translation.
92 4.2. WEIGHTED NEIGHBOR LIKELIHOOD VS. EMMA AITR 1548 4.2 Weighted Neighbor Likelihood vs. EMMA
Weighted neighbor likelihood and EMMA are both smoothly di erentiable functions that can
be used to align signals when the imaging function is unknown. Qualitatively the EMMA
estimate of joint entropy seems better. Joint entropy seems to have a wider basin in these
synthetic experiments.
If weighted neighbor likelihood and EMMA are so similar, why is there a di erence?
Recall that weighted neighbor likelihood measures the conditional entropy of the image given
the model. It does this under the assumption that the conditional distribution of the image
is Gaussian. Weighted neighbor likelihood use the data around a point to estimate the mean
of a Gaussian. The log likelihood of that point is then proportional to the squared di erence
from this mean. In general log likelihood calculations are very sensitive to outliers. Outliers
are points that are, because of noise or measurement error, perturbed and land far from
where they should have. Recall that the log likelihood of a sample is the sum of the log
likelihoods of each point in the sample. As a result a single outlier can ruin a sample that
would otherwise have had a high likelihood.
A more reasonable measure might introduce a bound on the penalty for a single point.
Once a single point moved beyond a certain distance from the local mean the cost would
no longer increase. Calculating likelihood in this way is closely related to the concept of a
robust statistic
EMMA on the other hand does not assume that the conditional distribution of the image
given the model is Gaussian. Instead it approximates the density nonparametrically. EMMA
can handle situations where there are multiple peaks in the conditional distribution. While
there is a likelihood penalty if a group of points are perturbed away from the local mean,
it is not a function of the distance from the mean. Once these points move outside the
e ective range of the smoothing function there is no additional penalty. EMMA's robust
nature prevents it from getting swamped by a few outliers in the joint distribution. This
gives it a greater ability to deal with the distributions that arise from misalignments between
model and image.
In the next sections we will describe a number of other situations where EMMA is better
than weighted neighbor likelihood.
93 Paul A. Viola CHAPTER 4. MATCHING AND ALIGNMENT 4.2.1 Nonfunctional Signals
Up until this point our analysis of alignment has assumed that there exists an imaging
function that relates the model and the image. For at least two classes of problems there will
be no imaging function at all. The rst arises from a common situation in computer vision:
occlusion. The second arises when the model does not contain all of the information required
to predict the image. In both cases no single function, regardless of exogenous variables, can
be used to predict the image from the model.
Figure 4.14 shows a graph of our original pair of signals, except that vx has now been
corrupted by an occlusion. Occlusion proves to be particularly bothersome for the alignment
techniques we have proposed. For example, the basic assumption behind normalized cost
has been violated; the occluded signal is not a linearly transformed version of the model.
In addition, a quick glance at the joint space shows that the assumption behind weighted
neighbor likelihood has also been violated see Figure 4.15; even when the signals are aligned,
there is no longer any function that relates ux and vx. Figure 4.16 show a graph of
weighted neighbor likelihood versus translation. The global minimum no longer coincides
with the correct translation.
In some cases EMMA can be used to align partially occluded signals. Joint entropy does
not su er from the strong assumption that the signals are functionally related. Though part
of the signal may be corrupted, the remaining parts retain their low entropy relationship.
Figure 4.17 show a plot of joint entropy for the occluded pair of signals.
The simplest example of nonfunctional signals often arises when the model and the image
are swapped. Whenever the function between the model and the image is nonmonotonic,
the relationship between the image and the model is nonfunctional. The nonmonotonically
related signals shown in Figure 4.2 are an example. Figure 4.18 shows the joint space of the
swapped signals and a weighted neighbor function approximation. The function t to this
joint space is a terrible approximation of the data. The quality of the function approximation
points out an important limitation of weighted neighbor likelihood. While normalized cost
is a symmetric comparison metric, weighted neighbor likelihood is not. It may seem at rst
that this is an unimportant distinction. It is not. Symmetric measures allow us to match
images to models as well as models to images. This can be critical when it is not possible to
construct a detailed model.
94 4.2. WEIGHTED NEIGHBOR LIKELIHOOD VS. EMMA AITR 1548 u(x)
v(x) 2 Intensity 1.5
1
0.5
0
0.5
1
1.5
0 100 200
300
Position 400 Figure 4.14: Graph of ux and vx where vx has been perturbed by noise, and a portion
of it occluded. Both joint entropy and mutual information are symmetric measures. A plot of EMMA
entropy for the swapped signals is identical to Figure 4.13.
A more complex example of nonfunctional signals arises when both the model and the
image are functions of some third unmeasurable signal. Call this signal zx. There are
now two imaging functions, one that creates the model ux = Fuzx; qu, and another that
creates the image vT x = Fv zx; qv . Medical registration, which we describe in some
detail in the next chapter, is a clear example of a two sensor problem. In medical registration
one seeks an alignment of signals from two types of sensors for example a CT scan and an
MRI scan. Neither gives perfect information about the object, and neither is completely
predictable from the other.
The two sensor problem can be simulated by transforming our original signal by two
di erent nonlinear transforms. Using the original signal from Figure 4.1 we can de ne
Fuz = sin2z and Fv z = z2. The resulting aligned distribution should fall approximately
along a line that looks like Figure 4.19. The actual distribution is shown in Figure 4.20. Here
too EMMA shows a strong minimum at the correct alignment of model and image see
Figure 4.21.
95 Paul A. Viola CHAPTER 4. MATCHING AND ALIGNMENT
2
1.5
1
0.5
0
0.5
1
1.5
2
2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 Figure 4.15: Samples from the joint space of ux and vx where the image has been occluded.
Though these signals are aligned the the weighted neighbor function is a terrible t to the
data. The data is segregated into two parts: the linearly related part that arise from the
nonoccluded signal, and the constant part that projects to the occluded part. Mutual Information versus Joint Entropy
Recall that conditional entropy is not a measure of the dependence between two signals see
Section 2.2. Conditional entropy hvju can be small for two di erent reasons: hv is itself
small, or v is dependent on u. Mutual information is a better measure of dependence. For
the simple examples described in this chapter hv j u can be used alone as a measure of
alignment. In more complex examples, where the model can change scale or project on a
limited part of the image, hv must be taken into account. In general we will solve alignment
problems by maximizing mutual information. 96 Log Likelihood 4.2. WEIGHTED NEIGHBOR LIKELIHOOD VS. EMMA AITR 1548 380 280 150 75 0
Position 75 150 Figure 4.16: Graph of weighted neighbor likelihood versus translation where vx has been
occluded. Joint Entropy 2.2 2 1.8 1.6
150 75 0
Position 75 150 Figure 4.17: Graph of EMMA joint entropy estimate versus translation for the occluded pair
of signals. 97 Paul A. Viola CHAPTER 4. MATCHING AND ALIGNMENT 2.5
2
1.5
1
0.5
0
0.5
1
1.5
2
3.5 3 2.5 2 1.5 1 0.5 0 0.5 Figure 4.18: Samples from the joint space of ux and vx = ux2 + . The roles of the
model and the image have been reversed. 2.5
sin(x) vs. x^2 2 1.5 1 0.5 0 0.5
1.5 1 0.5 0 0.5 1 1.5 Figure 4.19: Graph of the function de ned by u = sin2z and v = z2 as z varies from 1.5
to 1.5. 98 4.2. WEIGHTED NEIGHBOR LIKELIHOOD VS. EMMA AITR 1548 4
3.5
3
2.5
2
1.5
1
0.5
0
0.5
1.5 1 0.5 0 0.5 1 1.5 Joint Entropy Figure 4.20: Samples from the joint space of from the simulated two sensory data. 2.2 2
150 75 0
Position 75 150 Figure 4.21: Graph of EMMA joint entropy estimate versus translation for the nonfunctional
pair of signals. 99 Paul A. Viola CHAPTER 4. MATCHING AND ALIGNMENT 4.3 Alignment Derivation
Let us derive the equations and algorithms used for alignment by maximization of mutual
information. We wish to maximize the mutual information between model and image signals.
This requires a search of the aligning transformation space. We will use the stochastic
gradient descent algorithm described in Section 3.3.
The derivative of mutual information is d I ux; vT x = d hux + d hvT x , d hux; vT x :
dT
dT
dT
dT
Since hux is not a function of T , it drops from our calculations. Using the EMMA entropy
estimate, d I ux; vT x 1 X X W v ; v v , v T ,1 d v , v
4.33
dT
NB xi2B xj 2A v i j i j v dT i j
1 X X W w ; w w , w T ,1 d w , w : 4.34
,N
uv i j
i
j
j
uv dT i
B xi 2B xj 2A
The following de nitions have been used:
j
Wv vi; vj P gv vi , v, v ;
k
xk 2A gv vi Wuv wi; wj P guvgwi , wj w ;
xk 2A uv wi , k
ui uxi ; uj uxj ; uk uxk ;
vi vT xi ; vj vT xj ; vk vT xk ;
wi ui; vi T ; wj uj ; vj T ; and wk uk ; vk T :
We assume that the covariance matrices of the component densities used in the approximation
scheme for the joint density are block diagonal,
,
,,
uv1 = DIAGuu1; vv1 ; 100 4.4. MATCHING AND MINIMUM DESCRIPTION LENGTH AITR 1548 and we obtain an estimate for the derivative of the mutual information as follows
d 1 XX
dI =
,
,d
vi; vj T Wv vi; vj v 1 , Wuv wi; wj vv1 dT vi , vj :
dT N
B xi 2B xj 2A If we are to increase the mutual information, then the e ect of the rst term in the brackets
may be interpreted as acting to increase the squared distance between pairs of samples that
are nearby in image intensity, while the second term acts to decrease the squared distance
between pairs of samples that are nearby in both image intensity and the model properties.
It is important to emphasize that distances are in the space of values intensities, brightness,
or surface properties, rather than coordinate locations.
d
The term dT vi , vj will generally involve gradients of the image intensities and the
derivative of transformed coordinates with respect to the transformation. In the simple case
that T is a linear operator, we obtain the following outer product expression, d vT x = rvT x xT :
i
ii
dT 4.4 Matching and Minimum Description Length
There is another entirely di erent motivation for using mutual information as a alignment
metric. Alignment, and many other vision problems, can be reformulated as minimum description length MDL problems Rissanen, 1978; Leclerc, 1988. MDL can provide us with
some new insight into the problem of alignment and help us derive a missing and often useful
term in the alignment equations.
The standard framework of MDL involves a sender and a receiver communicating descriptions of images. Given that the sender and the receiver have an agreed upon language
for describing images, the sender's goal is to nd a message that will accurately describe an
image in the fewest bits. The concept of description length is clearly related to the code
length introduced in Section 2.2.
For the problem of alignment we will assume that the sender and the receiver share the
same set of object models. The sender's goal is to communicate an image of one of these
101 Paul A. Viola CHAPTER 4. MATCHING AND ALIGNMENT models. Knowing nothing else, the sender could simply ignore the models and send a message
describing the entire image. This would require a message that is on average as long as the
entropy of the image. However, whenever the image is an observation of a model a more
e cient approach is possible. For example the sender could send the pose of the model
and a description for how to render it. From these the receiver can reconstruct the part of
the original image in which the model lies. To send the entire image, the sender need only
encode the errors in this reconstruction, if there are any, and any part of the image that is
unexplained by the model. Alignment can be thought of as the process by which the sender
attempts to nd the model pose that minimizes the code length of the overall message.
The encoding of the entire image has several parts: 1 a message describing the pose;
2 a message describing the imaging function; 3 a message describing the errors in the
reconstruction; and 4 a message describing the parts of the image unexplained by the
model. The length of each part of the message is proportional to its entropy. We can assume
that poses are uniformly distributed, and that sending a pose incurs some small uniform
cost. The length of part 4 is the entropy of the image that is unexplained. Parts 2 and 3
can be interpreted in two ways. We can assume that the imaging function can be sent with
a xed or small cost. Part 3 is then proportional to the conditional entropy of the image
given the model and imaging function. This is precisely what was estimated and minimized
with weighted neighbor alignment. A second interpretation comes from EMMA. EMMA
estimates the joint entropy of the model and image, hu; v. The conditional entropy of the
image given the model can be computed as hvju = hu; v , hu. Since the entropy of the
model is xed, minimizing the joint entropy minimizes the conditional entropy. In both cases
entropy based alignment as proposed in the rst part of this chapter minimizes the cost of
sending parts 1, 2 and 3. MDL suggests that we must also minimize the entropy of the
unmodeled part of the image.
In the previous information theoretic formulation there was no concept of pixels or of
the proportion of the image explained by the model. In fact, in the previous formulation the
entropy of the explained part of the image could get larger as the model shrunk. For example,
assume that the model covers a contiguous region of an image where most of the pixels have
constant value. At the center of this region is a small patch containing varied pixels. Recall
that the image is sampled at points that are projected from the model. Most of the model
points will project into the region of constant intensity and a few will project onto the varied
patch. The resulting distribution of image pixels, because it has many samples of the same
102 4.4. MATCHING AND MINIMUM DESCRIPTION LENGTH AITR 1548 value, has fairly low entropy. If the model were shrunk to cover only the varied patch, then
all of the points from the model would fall in the varied region. The new distribution of pixel
values will have higher entropy.
Can the mutual information between the model and the entire image be measured? This
would be a more direct parallel to the MDL framework. MDL directs the sender to transmit
the model and pose that will produce an encoding for the entire image that is shortest. As
the model explains more, less of the image needs to be encoded directly. We can derive an
approximation to mutual information between the model and the entire image. Let U and V
represent the whole model and whole image respectively. The mutual information equation
can now be rewritten,
I U; V = hU + hV , hU; V :
4.35
The dependence of hU; V on T is implicit in the above equation.
Logically, the image can be split into two parts: Vm and Vm . Vm is the part of the image
in which the model lies. Vm is the unmodeled part of the image. We can de ne them as
Vm fvy such that uT ,1y is de nedg ;
and, 4.36 Vm V , Vm :
4.37 It is assumed that the two parts of the image are independent. This allows us to split the
entropies that involve V into two distinct parts, I U; V = hU + hV , hU; V
= hU + hVm ; Vm , hU; Vm ; Vm
= hU + hVm + hVm , hU; Vm , hVm
= hU + hVm , hU; Vm : 4.38
4.39
4.40
4.41 Step 4.40 relies on the assumption that Vm is independent of both Vm and U i.e. that the
background is independent both of the image of the object and the object. Equation 4.41
directs us to maximize the entropy of the modeled part of the image, hVm , as well as
minimizing the joint entropy of model and the image. 103 Paul A. Viola CHAPTER 4. MATCHING AND ALIGNMENT EMMA can estimate the entropy of a signal, but what is the entropy of an entire image?
This is a very di cult question to answer. The entropy of an image is a function of the
distribution of a vector random variable with between ten thousand and a million dimensions.
Though the entropy of an image can be modeled as the the sum of the pixel entropies, this
is guaranteed to be an overestimate of the true image entropy.
One of the problems with mutual information alignment is that it does not tell us whether
the object is present in the image or not. The principle of minimum description length should
allow us to derive a decision threshold for the presence of an object. The object is present
if the image is easier to encode with it than without it. Unfortunately this decision is highly
dependent on the estimate of the entropy, or code length, of the explained part of the image.
The naive overestimate of image entropythat simply sums the entropies of the pixelsis
not tight enough to determine the decision threshold correctly. An important area of open
research is deriving a more reasonable estimate of the code length of an image.
In previous derivations points were sampled from the model and projected into the image.
Since we are now explicitly modeling the entropy of the image, we will sample from the image
and project back into the model. The joint entropy is then, hvy; uT ,1y = EY log pvy; uT ,1y
Z
= pvy; uT ,1ylog pvy; uT ,1ydy
Z
= pvT x; uxlog pvT x; uxAT; xdx
= EX AT; xlog pvT x; ux : 4.42
4.43
4.44
4.45 Equation 4.44 involves a change of variables from y back into x. The correcting term AT; x
plays the role of the Jacobian of the transformation, measuring the ratio of the area in the
model as it projects into the image. For a ne transformations the projected area is related
to the model area by the determinant of the transformation. For projective transformations
there will also be a term that arises from foreshortening. The mutual information is then
2
0
13
+logP uxb; a
6
B
C7
C7 :
I U; V Eb 6AT; xb B +logP vT xb; a
4.46
4
@
A5
fv T xb; uxb g; a
,logP
The mutual information between the whole model and image is the EMMA estimate of the
104 4.5. SUMMARY AITR 1548 mutual information between the model and image signals weighted by the projected area of
the model. This new formulation of mutual information includes a term which encourages
the model to explain as much of the image as possible. The derivative
2
0
13
+logP uxb; a
6
C7
dB
6 AT; xb dT B +logP vT xb; a
C7
6
@
A7
6
7
fv T xb; uxbg; a
6
7
,logP
d I U; V E 6 0
7;
1
6
7
4.47
b6
7
dT
+logP uxb; a
6B
7
Cd
6 +B
C dT AT; xb 7
6 @ +logP vT xb; a
7
A
4
5
,logP fvT xb; uxbg; a
encourages the model to grow as large as possible. 4.5 Summary
In this chapter several motivations for the selection of mutual information as a measure of
alignment have been presented. The primary existing measure of alignment, correlation, is
rederived as a maximum likelihood technique which has a number of important weaknesses.
The concept of maximum likelihood matching is then extended to de ne a more general
measure of alignment. This measure, called weighted neighbor likelihood, can be interpreted
as estimating the conditional entropy of an image given a model. While it is more general than
correlation, weighted neighbor likelihood is still limited. It can be still further generalized
yielding a technique called EMMA alignment. This technique explicitly estimates the mutual
information between an image and model. A number of synthetic experiments demonstrate
that mutual information is a very exible measure of alignment.
In the nal section of this chapter the concept of minimum description length is used as
yet another motivation for mutual information as an alignment measure. This alternative
framework encourages the model to explain as much of the image as possible. 105 Chapter 5
Alignment Experiments
This chapter contains a number of experiments designed to demonstrate that alignment by
maximization of mutual information is a practical technique. The previous chapter contained
a very general de nition of alignment. Though a procedure for adjusting the pose parameters
during alignment was derived, many of the details regarding representations and implementations were left out. Each experiment described in this chapter will include both these needed
details and a general discussion of the experimental framework.
By the end of this chapter we will have developed some familiarity with the application
of EMMA alignment. The chapter will conclude with a section describing explicit limitations of this approach. In addition to describing problems for which EMMA alignment is
poorly suited, it will be emphasized that EMMA alignment by itself is not a complete object
recognition system.
For clarity the parameters and assumptions that underly EMMA alignment have been
identi ed. We have broken the process of setting up an experiment into discrete steps that
can be applied to a wide variety of alignment problems see Table 5.1. Along with the
description of each experiment we will include a similar table with a speci c realization for
each step. 106 5.1. ALIGNMENT OF 3D OBJECTS TO VIDEO AITR 1548 1. Choose a model and image representation i. e. de ne u and
v. De ne an interpolation scheme for sampling v at nonintegral coordinates.
2. Choose a scheme for sampling the model i.e. de ne x.
3. Determine the space of possible aligning transformations and its
concrete representation i.e. de ne T . The de nition of the
random variables ux and vT x is now complete.
4. Derive an expression for dvy=dy.
5. Pick a metric for computing distances between pairs of samples
of ux, vT x, and fux, vT xg.
6. Pick the variance for the component densities: .
7. Choose a value for pmin .
8. Determine the number of samples used to estimate the distribution, and the number used to estimate the entropy.
9. Pick a parameter update rate, . In general the update rate will
decrease with time.
Table 5.1: The process for setting up an alignment. 5.1 Alignment of 3D Objects to Video
In our rst experiment we will return to the example described in the introduction: alignment
of a threedimensional object to a video image. In all of our alignment experiments we will
assume that the entire object has the same surface properties. We can then treat surface
property as yet another exogenous variable.
Following Table 5.1:
1. Models are a collection of points that lie on the surface of the object. We chose this representation because it is capable of representing any shape including smoothly curved
or irregular forms. It is equally capable of representing objects with at faces such as
polyhedra. The models have been constructed so the distribution of surface points is
as close to uniform as possible. Associated with each surface point is the local surface
normal, a unit vector perpendicular to the surface. The models used have between 7000
and 65,000 points. Video images are represented as simple two dimensional arrays.
2. The random variable x that is used to sample the model and image is de ned from the
107 Paul A. Viola CHAPTER 5. ALIGNMENT EXPERIMENTS model. A trial of x is a randomly selected model point. The value of the trial is the
3D location of that model point. We sample the points of the model uniformly.
3. The transformation space is the space of rigid three dimensional translations and rotations followed by perspective projection. The overall transformation is a concatenation
of rotations and translations each acting on the prede ned center" of the object.
Because of selfocclusion, not every point on the model is visible. Visibility is determined by a Zbu er rendering of the model. Zbu er rendering takes each point in the
model and projects it into the image. When multiple points fall onto the same pixel,
only the point that is nearest is considered visible. As pose changes, some points become visible and others become invisible. In theory Zbu ering needs be repeated every
time the pose of an object changes. Unfortunately, Zbu ering takes time proportional
to the size of the model. This cost is far larger than the cost of computing an estimate
for the derivative of entropy. Since pose does not change much between iterations of
gradient descent, it has proven su cient to Zbu er every 300 iterations.
4. The derivative dvy=dy is the spatial gradient of the image.
5. The metric used for comparing points sampled from the image is squared di erence.
The representation of joint events, w = fvT x; uxg, is somewhat more complex. We
will represent only two dimensions of the normal vector: the x and y components. Since
the normal is always a unit vector, the z component is redundant. The joint events are
therefore three dimensional vectors, two components from the model and one from the
image. We will use Euclidean distance to measure the distance between joint events.
6. Since we will be using diagonal covariance matrices for the smoothing functions, four
variances are required. Three for the joint entropy and one for the image entropy. Based
on maximum likelihood estimates from aligned objects, we have settled on a single set
of smoothing parameters that we will use for all of our 3D alignment experiments. For
the joint entropy, the variance of x and y components of the normal are both 0.3 and
the variance for image intensity is 0.2. For the image entropy, the variance for image
intensity is 0.15. Having a single set of parameters is for every experiment is possible
in part because we have prenormalized all images so that their variance is 1.0.
7. We will use a value of 0.01 for pmin . The alignment process shows very little sensitivity
to pmin . We have repeated a number of experiments with a pmin value of 0.1 and
108 5.1. ALIGNMENT OF 3D OBJECTS TO VIDEO AITR 1548 1.0. Our results are not signi cantly di erent. Values that are more than a factor
of 10 smaller than 0:01 cause the derivative of estimated entropy to be too noisy see
Section 3.3. This noise can prevent convergence to the correct pose.
8. Rather than draw two di erent samples, we will use the crossvalidation approximation
see Section 2.4.3. In all of our experiments we use a sample size of 25.
9. Finally, we must choose a parameter update rate, . Actually, since the units of rotation and translation are very di erent two update rates are necessary. Internally we
represent rotations in radians and translations in millimeters. For an object with a 100
millimeter radius a rotation of 0.01 radians about the center of mass can translate a
model point up to 1 millimeter. A translation of 0.01 can at most translate a model
point 0.01 millimeters. The derivative of mutual information with respect to a model
point's position is a combination of a rotation and translation. A small step in the direction of the derivative will move the model point up to 100 times further by rotation
than translation. If there is only a single update rate a poor compromise must be made
between the rapid changes that arise from the rotation and the slow changes that arise
from translation. If the rotation update rate is reduced by a factor of 100 the model
point will move approximately as far by rotation as it does by translation. Scale issues
such as these do not arise when more complex gradient descent techniques are used, for
example conjugate gradient descent or LevenbergMarquardt. Unfortunately, neither
of these techniques can use stochastic estimates of the gradient. Since our models have
a radius that is on the order of 100 millimeters, we have chosen rotation update rates
are 100 times smaller than translation rates. Most of our 3D alignment experiments
proceed in two stages. In the rst stage the rotation update rate is 0.0005 and the
translation update rate is 0.05. After a number of iterations the update rates are then
reduced to 0.0001 and 0.01 respectively. We have chosen a simple automatic descent
procedure in an e ort to simplify subsequent analysis of convergence. The realization of the basic framework is summarized in Table 5.2.
109 Paul A. Viola CHAPTER 5. ALIGNMENT EXPERIMENTS 1. De ne the model and image u and v: u contains points
distributed on the surface of the object. Each point has an associated normal. v is the image of intensities.
2. Sampling x: The sampling is determined by the distribution of
surface points which is close to uniform.
3. Transformation space T : The space of rigid 3D rotations and
translations by perspective projection using an estimate for the
camera parameters.
4. De nition of dvy=dy: This is the intensity gradient.
5. Distance metric: Euclidean distance.
6. Variance, : Assuming diagonal covariance matrices, four different variance are necessary, three for the joint entropy estimate
and one for the image entropy estimate. The variances were 0.3,
0.3, and 0.2 for the x component of the normal, y component of
the normal, and image intensity. The variance was 0.15 for the
image entropy.
7. Minimum probability, pmin : 0.01.
8. Number of samples: One sample of 25 using crossvalidation.
9. Update rate, : Rotation rate: 0.0005 for 3200 steps and then
0.0001 for 3200 steps. Translation rate: 0.05 and then 0.01.
Table 5.2: Summary of 3D to video alignment. 110 5.1. ALIGNMENT OF 3D OBJECTS TO VIDEO AITR 1548 Figure 5.1: A typical image of the skull object. 5.1.1 Alignment of Skull Model
In our rst experiment we will align a 3D skull model to a number of di erent images. The
skull model was produced automatically from a Computed Tomography CT scan of a plastic
skull1. The same plastic skull was then photographed in a number of di erent poses under
natural lighting2. The skull model contains 65000 points. The video images are 240 by 320
pixels. Figure 5.1 is an example video image of the skull.
Figure 5.2 contains a representation of the shape of the skull model. It is an image
displaying the distance from the camera to the visible points on the skull model. White is
further and black nearer. This image is computed by projecting each model point into the
image plane. The pixel to which the model point projects records the distance of the model
point from the camera. There may, however, be a number of model points that project to the
same image pixel. In this case, the depth of the model point which is nearest the camera is
used. Since the model is constructed from a collection of points, it is not dense. As a result
there are some pixels to which no model point projects. A few of these pixels, which remain
white, appear throughout the model.
Thanks to Gil Ettinger and Ron Kikinis for providing the skull model. Their work on medical registration
using this model is described in Grimson et al., 1994.
2 Thanks to J. P. Mellor for providing the skull images. His work on registration is described in Mellor,
1995.
1 111 Paul A. Viola CHAPTER 5. ALIGNMENT EXPERIMENTS Figure 5.2: A depth map of the skull model. See text for description.
We can use a Lambertian re ectance model to render a graphical picture of the skull. A
Lambertian model relates the model normal and image intensity:
X~
vTx =
5.1
i li ux ;
i where the model value ux is the normal vector of a surface patch, li is a vector pointing
toward light source i, and i is proportional to the intensity of that light source. Figure 5.3
shows a rendered version of the model in the same pose as Figure 5.1. To the human eye
this sort of image is more readily interpretable than a depth map. We can bring to bear our
substantial visual competence when the shape of an object is rendered as an image. From
Figure 5.3 it is almost immediately clear that the pose of the object model is close to correct.
There is however no simple relationship between the intensities of the video image and the
rendered image.
The goals of this rst experiment is to answer three questions. 1 Can EMMA align a
complex 3D object model to a number of di erent images taken under uncontrolled lighting?
2 How long does EMMA alignment take to run? 3 What is the range of poses from
which a correct" alignment can be obtained. Regarding this third point, we do not have
true information about either the pose of the object nor the camera parameters of the video
camera. The correct" pose has been determined by inspection of the alignment results. We
can however ask a related question about reliability. How far can the the object be perturbed
112 5.1. ALIGNMENT OF 3D OBJECTS TO VIDEO AITR 1548 Figure 5.3: A rendered image of the skull model.
away from the correct" pose and have EMMA alignment reliably realign it?
To answer our rst question, we must establish that in the six dimensional space of rigid
transformations there is a maximum of mutual information at a plausible alignment pose.
For each image the object model was initially adjusted so that it's pose was close to correct.
This was done by eye. EMMA alignment was then used to pull the object into a correct"
pose.
One scheme for assessing the quality of an alignment is to display the model pose and
the video image together. This can be done by taking a random collection of model points
and projecting them into the coordinate frame of the image. The pixels to which the model
points project are then set to white. The nature of the alignment is readily apparent from
such images. When the model and image are misaligned model points will project onto the
background and the coverage of the object's image will be incomplete. When the model and
image are correctly aligned there is close agreement between the occluding contours of the
model points and the object's image. Figure 5.4 shows an initial incorrect pose in this way.
Figure 5.5 shows the nal pose obtained after running EMMA alignment.
Figures 5.6, 5.7, and 5.8 show the nal alignment obtained for three other images. Notice
that in each of these images the boundaries of the skull are in close agreement with the outline
of the model points.
113 Paul A. Viola CHAPTER 5. ALIGNMENT EXPERIMENTS Figure 5.4: Initial pose of the skull model before alignment.
We would like to emphasize that in none of these experiments have we presegmented the
image. The initial poses often project the model into regions of the image that contain a
signi cant amount of clutter. EMMA reliably settles on a pose where few if any of the model
points project onto the background.
In answer to the second question, EMMA requires roughly 35 seconds on a Sun SparcStation5 for each of the alignments shown above. Run times are identical because we have
chosen to use a xed number of update iterations for each alignment experiment. In some
cases an accurate alignment was obtained well before the full number of iterations had been
completed. In others it appeared that the nal alignment could have been improved if the
number of iterations were increased.
There are few if any principled results on the convergence of stochastic approximation.
Convergence detection is a subtle issue. For example, EMMA does not make a direct estimate
of the mutual information between model and image. During alignment only a stochastic
estimate of the gradient is available. It may be possible to construct an ad hoc procedure
that would be able to detect convergence. Alignment could then be continued until the pose
estimate had converged.
From an analysis of the program's memory access and computation patterns, we conclude
that an implementation on a digital signal processor would be as much as 100 times faster
than our current implementation. One major issue is cache performance. Because EMMA
114 5.1. ALIGNMENT OF 3D OBJECTS TO VIDEO AITR 1548 Figure 5.5: Final pose of the skull model after alignment.
randomly accesses each of the points in the image and model, much time is wasted ushing
and re lling the cache. The cache on a general purpose processor is often fairly limited. Most
digital signal processors include a large quantity of fast SRAM, eliminating the need for a
cache. For random memory accesses a digital signal processor should be approximately 5
times faster than a conventional computer. The inner loop of the EMMA derivative estimation
procedure is dominated by simple oating point operations. Modern digital signal processors
can execute these instructions 10 to 20 times faster than conventional computers. Together
these advantages should lead to an overall improvement in speed of between 50 and 100.
A number of randomized experiments were performed to determine the reliability, accuracy and repeatability of alignment. This data is reported in Table 5.3. An initial alignment
was performed to establish a base pose. This pose, shown in Figure 5.5, is used as a point
of reference. A set of randomized experiments was performed where the base pose is rst
perturbed, and then EMMA is used to realign the image and model. The perturbation is
computed as follows: a random uniformly distributed o set is added to each translational
axis labeled T and then the model is rotated about a randomly selected axis by a random
uniformly selected angle . There were four experiments each including 50 random initial poses. The distribution of the nal and initial poses can be compared by comparing the
variance of the location of the centroid, computed separately in X, Y and Z. Furthermore, the
average angular rotation from the true pose is computed labeled j 4 j. Finally, the number
of poses that failed to converge near the correct solution is reported. The nal statistics are
115 Paul A. Viola CHAPTER 5. ALIGNMENT EXPERIMENTS Figure 5.6: Final pose of the skull model after alignment.
4T XYZ
mm
10
30
20
10; 20 4 INITIAL
X 10
10
20
20; 40 5.94
16.53
10.12
14.83 Y Z j 4 j FINAL
X mm
5.56 6.11 5.11 .61
18.00 16.82 5.88 1.80
12.04 10.77 11.56 1.11
15.46 14.466 28.70 1.87 Y mm
.53
.81
.41
2.22 Z j 4 j 5.49
14.56
9.18
14.19 3.22
2.77
3.31
3.05
100
96
96
78 Table 5.3: Skull Results Table. The nal column contains the percentage of poses that
successfully converged to a pose near the correct pose.
only evaluated over the poses that converged near the correct solution.
These experiments demonstrate that the alignment procedure is reliable when the initial
pose is close to the correct" pose. Outside of this range gradient descent, by itself, is not
capable of converging to the correct solution. The capture range is not unreasonably small
however. Translations as large as half the diameter of the skull can be accommodated, as
can rotations in the plane of up to 45 degrees. Empirically it seems that alignment is most
sensitive to rotation in depth. This is not terribly surprising since only the visible points play
a role in the calculation of the derivative. As a result, when the chin is hidden the derivative
gives you no information about how move the chin out from behind the rest of the skull.
Finally, we have done a number of experiments to demonstrate that EMMA alignment
116 5.1. ALIGNMENT OF 3D OBJECTS TO VIDEO AITR 1548 Figure 5.7: Final pose of the skull model after alignment.
can deal with occlusion. Figure 5.9 shows an initial and nal alignment for an image that
includes an arti cial occlusion that covers the entire chin area. The nal alignment is very
close to the correct one despite the occlusion. Figure 5.10 shows an initial and nal pose for a
more complex occlusion. In this image we have replaced a rectangular window with another
randomly chosen window of the image. The source of the rectangle is near the bottom of the
image. In a number of experiments, we have found that alignment to occluded images can
require more time for convergence. 117 Paul A. Viola CHAPTER 5. ALIGNMENT EXPERIMENTS Figure 5.8: Final pose of the skull model after alignment. Figure 5.9: An image including an arti cial occlusion. White spots denote the pose of the
model. On the left is the initial pose, on the right is the nal pose. Figure 5.10: An image including an arti cial occlusion. White spots denote the pose of the
model. On the left is the initial pose, on the right is the nal pose.
118 5.1. ALIGNMENT OF 3D OBJECTS TO VIDEO AITR 1548 5.1.2 Alignment of Head Model
We have repeated many of the skull experiments with a three dimensional model of a human head. This model was obtained from a Cyberware scan of the subject that was taken
approximately two years before the video images3. A Cyberware scan is a complete three
dimensional representation of the shape of the subject's head in cylindrical coordinates. The
surface normals were computed from the surface by smoothing and di erencing neighboring
surface points.
The experiments in this section are designed to answer two questions: 1 Will the same
techniques and parameters work with two di erent types of models and images? 2 Is it
possible to use the pose re nement procedure to track a moving object in a video sequence?
Figure 5.11 shows an image of the head and a rendering of the model.
How are the face experiments di erent from the skull experiments? Firstly, the face
model is much smoother than the skull model. There really aren't any creases or points of
high curvature. As a result it is much less likely that an edgebased system could construct
a representation either of the image or the model that would be stable under changes in
illumination. Secondly, the albedo of the actual object is not exactly constant. The face
contains eyebrows, lips and other regions where the albedo is not the same. As a result this
is a test of EMMA's ability to handle objects where the assumption of constant albedo is
violated. Thirdly, not all of the occluding contours of the object are present in the model.
The model is truncated both at the chin and the forehead. As a result experiments with this
model demonstrate that EMMA can work even when the occluding contours of the image and
model are not in agreement.
In the previous experiment projecting points from the model into the image was su cient
to describe the model pose. Since the head model is very smooth and some occluding contours
are missing simply projecting the model points into the image is not su cient to determine
the quality of an alignment. For our experiments with the head model we will display the
original image, augmented with model points, alongside a rendered image of the model.
Figures 5.11 and 5.12 show the model before and after alignment. In this experiment the
model has been rotated 30 degrees around the vertical and translated 40 millimeters to the
3 Thanks to Ron Kikinis for providing the Cyberware scan and for allowing me to take the images of him. 119 Paul A. Viola CHAPTER 5. ALIGNMENT EXPERIMENTS Figure 5.11: An initial incorrect pose. The model has been rotated 30 degrees about the
vertical and translated 40 millimeters to the right. On the left is an image of the head along
with a collection of points projected from the model. On the right is a rendering of the model
in the same pose. Figure 5.12: The nal aligned poses. On the left is an image of the head along with a
collection of points projected from the model. On the right is a rendering of the model in the
same pose.
right. Figures 5.13 and 5.14 show another experiment where EMMA alignment corrects for
a 150 millimeter translation in depth.
We have also tested EMMA alignment on a video sequence digitized from a video tape.
The sequence was taken at the same time as the other images, though the camera and the
lens were di erent. Ten frames were acquired from a video tape at 3 frames per second. The
quality of the resulting images is very low. The images were degraded both by their storage
on video tape and by the frame grabber that was used. It was somewhat surprising that these
images worked nearly as well as the higher quality still frames.
Motion in the video sequence was tracked by sequentially aligning the model to each of
the frames. The starting pose for each frame was obtained by using the nal estimated pose
from the previous frame. The starting pose for the rst frame was hand selected so that
EMMA alignment could acquire a good initial alignment. The sequence and pose estimates
are displayed in Figure 5.15.
120 5.1. ALIGNMENT OF 3D OBJECTS TO VIDEO AITR 1548 Figure 5.13: An initial incorrect pose. The model has been moved 150 millimeters toward
the camera. Figure 5.14: The nal aligned poses. 121 Paul A. Viola CHAPTER 5. ALIGNMENT EXPERIMENTS Figure 5.15: Ten frames from a video sequence of Ron's head. 122 5.1. ALIGNMENT OF 3D OBJECTS TO VIDEO AITR 1548 Figure 5.16: Bumps with Di erent Lighting, and their Edges 5.1.3 Alignment of Curved Surfaces
The third experiment is designed to explore the nature of the information that EMMA alignment uses to detect the correct pose. The previous two experiments, because they are based
on real data, can be di cult to analyze. We would like to determine which component of
the information in the image and model is critical to alignment. For example, it could be the
case that EMMA alignment relies implicitly on intensity edges to match model to image. Or,
it could be the case that the occluding contours of the object are of critical importance.
For this experiment we created a very simple, almost pathological, synthetic example.
123 Paul A. Viola CHAPTER 5. ALIGNMENT EXPERIMENTS Figure 5.17: Target Image, Final Model Pose, and Initial Model Pose
The object is a set of three Gaussian shaped bumps in a at patch of surface. This object in
has no sharp edges and does not have a well de ned occluding contour. Figure 5.16 shows
ten di erent images of this object. Each image uses the same Lambertian re ectance model
but has di erent illumination. Across the top row the light source moves gradually from left
to right. In the second row the light source moves from top to bottom. Even for a simple
Lambertian surface, image variation can be signi cant. Below this we show the output of a
Canny edge detector run on the ten di erent images Canny, 1986. The variation between
the di erent edges extracted is quite striking.
Figure 5.17 shows the target image on the left. Here the bumps are inserted into an in nite
plane. Also shown is a rendered version of a typical nal and initial pose of the model. As in
other experiments the rendered images of the model are made using a particular surface and
lighting scheme; they are for visualization only and play no role in the alignment process.
The black regions of the rendered image lie outside the borders of the model. Notice that
the model's boundaries do not coincide with any discontinuities in the target image. Since
there are neither stable edges nor usable occluding boundaries, we can conclude that EMMA
alignment can proceed using only shading information.
The bumps each have a sigma of 7mm the bump is about 3 sigma or 21mm wide. The
bump height is 20mm. Lying together in the same plane they take up an L shaped region that
is 100mm by 100mm. The true pose is 1000mm away from the camera and perpendicular to
the camera axis. The camera has a viewing angle of 18 degrees. This experiment proceeds
exactly as in the previous three dimensional experiments.
As we did with the skull model we performed an analysis of the reliability of the maxima
of mutual information. These experiments summarized in Table 5.4.
124 5.1. ALIGNMENT OF 3D OBJECTS TO VIDEO X 4T
Y mm Z 4 10 10 25 30
15 15 25 20
20 20 25 20 X AITR 1548 INITIAL Y Z j 4 j mm
5.21 5.82 19.97 15.41
8.75 8.95 14.61 9.15
11.72 11.07 13.85 9.22 X
.56
.82
.65 FINAL SUCCESS mm
.45 9.73
.39 8.95
.38 8.35
100
100
80 Y Z j 4 j Table 5.4: Curved Surface Alignment Data 125 5.68
3.23
3.12 Paul A. Viola CHAPTER 5. ALIGNMENT EXPERIMENTS 5.2 Medical Registration Experiments
EMMA alignment of three dimensional objects relies on the fact that the image is a function
of the model and the lighting. It is not necessary that we know the exact nature of this
function. In medical imaging we are faced with a di erent though related task. We are given
two di erent observations of the same object. For example we may be given a Computed
Tomography CT scan and Magnetic Resonance Image MRI of the same patient. Two scans
are often obtained because neither gives perfect information about the patient. CT is good
at nding bone. MRI is good for distinguishing soft tissue. These measurements are taken at
di erent times with di erent machines. A clinician that would like to have information about
bone and soft tissue must integrate the two scans into a single selfconsistent picture. Once
this is done the spatial relationships between structures in the two di erent scans become
apparent. For example the distance between a tumor and a bone can be measured.
In Chapter 4 we argued that EMMA alignment should be able to align two signals
whenever there is mutual information between them. In this experiment two di erent MR
images of the same head will be aligned see Figure 5.18. These images comprise the proton
density and T2weighted images of a doubleecho MR scan. We have chosen these two MR
scans for two reasons: 1 it is clear that the two images share a great deal of information,
while they are not identical; and 2 since they are taken simultaneously the correct alignment should be close to the identity transformation. Because we know ground truth, we can
evaluate the accuracy of the EMMA alignment procedure.
A typical initial alignment appears in Figure 5.19. Notice that this image is a scaled,
sheared, rotated and translated version of the original. The nal alignment is displayed as
a checkerboard. Here every other 20x20 pixel block is taken either from the model image or
aligned target image. Notice that the boundary of the brain in the two images is in close
agreement.
We represent transformation space as a 6 element a ne matrix that is used to project
two dimensional points from the image into the model. This scheme can represent any
combination of scaling, shearing, rotation and translation. The remaining algorithmic details
are summarized in Table 5.5.
In order to determine the reliability and precision of the alignment procedure, 50 random126 5.2. MEDICAL REGISTRATION EXPERIMENTS AITR 1548 1. De ne the model and image u and v: u is the model image,
v is the target image.
2. Sampling x: Sample the pixels of the model image uniformly.
3. Transformation space T : The space of a ne transformations
mapping pixels locations from the model into pixel locations in
the image.
4. De nition of dvy=dy: This is the intensity gradient.
5. Distance metric: Euclidean distance.
6. Variance, : Assuming diagonal covariance matrices, three different variance are necessary: two for the joint entropy and one
for the image entropy. The variances were 0.1 in all cases.
7. Minimum probability, pmin : 0.01.
8. Number of samples: One sample of 20 using crossvalidation.
9. Update rate, : 0.02 for 500 steps and then 0.005 for 500 steps.
Table 5.5: Summary of MRI alignment experiments.
ized alignments were performed. The initial transformations were randomly selected, having
a translation of up to 35 pixels this is about one third of the width of the head, a rotation
of up to 30 degrees, and a scaling of up to 20. The correct alignment was obtained in 100
of the experiments. After alignment the a ne transformations had an average translation
error of 0.1 pixels. The remaining a ne parameters represent a mixing of rotation, scale and
shearing. They are somewhat more di cult to interpret. Given that the correct transformation is the identity matrix we can evaluate the nal matrices by measuring the di erence
from the identity. On average the coe cients of the nal transformation were in error by
0:02. These experiments demonstrate that EMMA alignment is both precise and reliable.
These two MRI images are fairly similar. Good alignment could probably have been
obtained with a normalized correlation metric. Normalized correlation assumes, at least
locally, that one signal is a scaled and o set version of the other. Our technique makes no
such assumption. In fact, it will work across a wide variety of nonlinear transformations.
More di cult alignment problems are easily simulated. In Figure 5.20 we show the model
image after a nonmonotonic nonlinear function has been applied. Recall that initially the
image lies in the range 0,1 . We subtract 0.5, square the result, and renormalize to 0,1 . This
operation is shown at the right of the gure. After applying this nonlinear transformation
the two images are anticorrelated; no variant of correlation can correctly align them. EMMA
127 Paul A. Viola CHAPTER 5. ALIGNMENT EXPERIMENTS Figure 5.18: MR Images
alignment performance, however, is not e ected. 5.2.1 Three Dimensional MR CT Alignment
The medical alignment procedure described above can be extended to volumetric data. In the
resulting system both the model and image are 3D arrays and a full three dimensional aligning
transformation is estimated. Recently a number of three dimensional CT MRI alignments
have been performed. Because these results are preliminary, many of the experimental details
are still in ux. Our description here will be necessarily brief. A more complete description
can be found in Wells III and Viola, 1995.
The scans used were obtained from the same patient at di erent times4. Display of these
three dimensional scans, and their alignment, is a di cult problem. Though the entire scan
cannot be shown, some feeling for the data can obtained be displaying the three central
slices. The central slices are perpendicular planes that pass trough the centroid of the data.
Figure 5.21 shows the three central slices of the CT scan. Figure 5.22 shows the initial
The images and the standard transformations were provided as part of the project, Evaluation of Retrospective Image Registration", National Institutes of Health, Project Number 1 R01 NS3392601, Principal
Investigator, J. Michael Fitzpatrick, Vanderbilt University, Nashville, TN.
4 128 5.2. MEDICAL REGISTRATION EXPERIMENTS AITR 1548 Figure 5.19: Initial Pose and Display of Result
alignment of the CT MR pair as a checkerboard. When the signals involved are very di erent,
the checkerboard representation can be somewhat confusing. Figure 5.23 shows the nal
alignment as the composition of the MR data with the intensity edges computed from the
CT data. Notice the close agreement between the skull in both scans. 129 Paul A. Viola CHAPTER 5. ALIGNMENT EXPERIMENTS 1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1 Figure 5.20: Transformed Model and the Transformation Figure 5.21: The three central slices of the CT data used in the MRCT experiments. 130 5.2. MEDICAL REGISTRATION EXPERIMENTS AITR 1548 Figure 5.22: An initial condition for MRCT registration by maximization of mutual information displayed as a checkerboard composite of the three central slices. Figure 5.23: A nal con guration for MRCT registration by maximization of mutual information. The three central slices of the MRI data are shown with the edges from the registered
CT data overlaid. 131 Paul A. Viola CHAPTER 5. ALIGNMENT EXPERIMENTS 5.3 View Based Recognition Experiments In the previous vision experiments we used knowledge of the physics of imaging to show that
the surface normal of an object should be predictive of the intensity observed in an image.
Unfortunately, in many experimental situations no three dimensional model is available. In
these situations it is frequently the case that the only available information about an object
is a collection of images taken under a variety conditions. One approach for solving problems
like this is to use a collection of images as the model. This is often called a view based"
approach since the model is made up of a number of views of the model object. Given a novel
image of some object, each model image is compared to it in turn. If some model image is
close enough" to the novel image, the model and novel image are considered aligned or
recognized. One can signi cantly reduce the number of model images required by adding an
a ne transformation to the comparison process. The novel image is then compared to each
model image under a set of a ne transformations. The most commonly used comparison
metric is correlation. As we saw in Section 4.1.1, correlation makes the assumption that the
model and the image are identical or possibly related by linear function.
In general the set of images that can arise from a single object under varying illumination
is very broad. Figure 5.24 shows two images of the same object in the same pose. These
images are very di erent and are in fact anticorrelated: bright pixels in the left image
correspond to dark pixels in the right image; dark pixels in the left image correspond to
bright pixels in the right image. No variant of correlation could match these images together.
We have presented techniques based on entropy that can match both correlated and anticorrelated signals. These techniques require only that there is some consistent relationship
between model and image. Discouragingly, it is not di cult to nd two images of the same
object for which there is no consistent relationship. Figure 5.25 shows a novel image which
is aligned with the two model images. Figure 5.26 contains two scatter plots of the pixel
values in the novel image versus the pixel values in the model images. Clearly, there is no
simple consistent relationship displayed in either of these graphs. EMMA could not be used
to match this novel image to either model image.
132 5.3. VIEW BASED RECOGNITION EXPERIMENTS AITR 1548 Figure 5.24: Car Model Images Figure 5.25: A novel image of the car model. 5.3.1 Photometric Stereo
By itself each model image does not contain enough information to constrain the match
between image and model. However, it is well known that taken together a collection of
images can be used to determine the 3D shape of an object. As we've seen the 3D shape is
su cient to constrain the match between image and model.
When multiple images of an object are available a technique called photometric stereo can
be used to estimate its 3D shape Horn, 1986. Photometric stereo works with images which
are taken from the same location but under di erent illumination conditions. It is assumed
that detailed information both about illumination and surface properties are available for
each image. As a result a re ectance map can be computed for each image. The re ectance
map determines the relationship between the normals of an object and the intensities in an
image.
The re ectance map together with the intensity of a pixel acts as a constraint on the
normal vector visible from that pixel. The allowable normals usually lie along a closed curve
on the unit circle. From a second image, and its associated re ectance map, another set
of allowable normals can be computed. By intersecting these constraints, two images are
su cient to determine the surface normal at each pixel. From the normals the shape can be
obtained through integration.
133 Paul A. Viola CHAPTER 5. ALIGNMENT EXPERIMENTS
2 1 1 Novel Image Novel Image 2 0 1 0 1 1 0
1
Model Image 1 2 1 0
1
Model Image 2 2 Figure 5.26: The relationship between pixels in the novel image and each of the model images.
Once the shape of the object is determined, the correct alignment could be computed
using the three dimensional version of EMMA. The imaging function of this new two stage
process is:
I T xi = F Gu1xi; r1; u2xi; r2; q
where G is the photometric stereo function that takes two images and two re ectance
maps and returns the shape, and F is our original imaging function which predicts image
intensities from object normals.
In practice, however, performing photometric stereo requires the kind of detailed metric
information about illumination that is only available under very controlled circumstances.
One cannot use natural images where the lighting is unknown and di cult to determine.
Luckily, we need not actually know G, r1, r2, F , or q. As long as they exist there will be
high mutual information between any novel image and a pair of model images. This is the
essence of view based EMMA alignment. We don't actually perform photometric stereo, we
simply assume that it is possible. As a result a pair of images should give information about
any third image.
To demonstrate this approach we have built a model using the two images in Figure 5.24.
Figure 5.27 shows the target image, and the nal pose obtained after alignment. Figure 5.28
shows the initial pose of the model.
Technically this experiment is very similar to the MRI alignment experiment. The main
di erence is that the model is constructed from a pair of model images. A sample of the
model ux = u1x; u2x T is a two dimensional vector containing the intensity of the two
images at location x. This is similar to the two component representation of normal used in
134 5.4. LIMITATIONS OF EMMA ALIGNMENT AITR 1548 Figure 5.27: Car Image and Final Pose Figure 5.28: Initial Pose of Car Model
the three dimensional alignment experiments. For this experiment is 0:1. The parameters
were updated for 1000 iterations at a rate of 0.002. From a set of randomized experiments
we have determined that the capture range of the alignment procedure is about 40 of the
length and width of the car, and 35 degrees of rotation. 5.4 Limitations of EMMA Alignment
Before we complete our discussion of EMMA alignment, several important caveats must
be emphasized. EMMA alignment is not a recognition procedure. Though EMMA could
well play a role in recognition there are two major missing components. The rst missing
component is an indexing scheme. EMMA alignment only works when the initial hypothetical
pose is close" to the true pose. A number of experiments have been performed in which an
empirical estimate of close" is determined see Tables 5.3 and 5.4. The capture range of
135 Paul A. Viola CHAPTER 5. ALIGNMENT EXPERIMENTS alignment is not large enough to expect that a randomly chosen transformation will converge
to the true solution. As a result, some sort of additional procedure is required that can
rapidly propose possible poses. Such a procedure is typically called an indexing" scheme.
The second missing component is the recognition process itself. Recognition requires that
a decision be made about whether the model object is really present in the image. Object
recognition is very similar in concept to the problem of pattern classi cation Duda and Hart,
1973; Fukunaga, 1990. Pattern classi cation is a process by which a novel input pattern is
classi ed as an example of a class. The task is di cult when the structure of the class is
complex and underspeci ed. For example, determining which stocks are a good investment
is a classi cation task that even the most complex classi ers, human investors, have di culty
performing.
Pattern classi cation can be formulated as a maximum likelihood or maximum a posteriori
problem. Given a novel pattern, the likelihood of each possible class is evaluated in turn. If
one class is much more likely then any other, then that class is considered the correct class.
Object recognition is a similar process. Given a novel image, the likelihood of each object
model is evaluated. In order that there be con dence in the classi cation, it is particularly
important that the most likely model be more likely than the null hypothesis: that the image
does not contain any model. The likelihood of the null hypothesis is proportional to the
unconditioned likelihood of an image.
As we have seen, log likelihood is closely related to entropy. As a result, EMMA can
be used to de ne a classi cation procedure: the mutual information between each model
and the novel image is evaluated; the model selected is the one that provides the most
information about the image. What is missing is a reliable measure of the unconditioned
entropy of an image i.e. the information that the null hypothesis gives us about the image.
The assumption that the pixels are independent underlies the EMMA estimate of image
entropy. Though it leads to inaccuracy, the independence assumption has proven su cient
for alignment. This is primarily because alignment is a relative procedure. The model is
adjusted so that the image is best explained. Recognition, because it is an absolute procedure,
is not so forgiving. A more accurate estimate of image entropy will be required before EMMA
can be used for object recognition. 136 Chapter 6
Other Applications of EMMA
The theory and algorithms presented in this thesis are quite general and can in principle
be applied to a variety of problems. This chapter is devoted to a description of two such
problems and their solutions. In the rst part of the chapter we will show how EMMA can
be used to correct images that have been have been corrupted" by a slowly varying bias
eld. Examples include: MRI corruption that arises from nonuniformity in magnetic eld,
and lightness correction in visual images. The second part of the chapter is devoted to an
application of stochastic gradient descent outside of entropy manipulation. Jones and Poggio
have presented a system that aligns line drawings of faces with novel line drawings Jones
and Poggio, 1995. Their published work uses a complex second order gradient descent
technique known as LevenbergMarquardt. We will show that similar if not better results
can be obtained with stochastic gradient descent. The resulting algorithm operates roughly
30 times as fast as the original. 6.1 Bias Compensation
A magnetic resonance image MRI is a 2 or 3 dimensional image that records the density
of tissues inside the body. In the head, as in other parts of the body, there are a number
of distinct tissue classes including: bone, water, white matter, grey matter, and fat. In
principle the distribution of pixel values in an MRI scan should be clustered, with one
137 Paul A. Viola CHAPTER 6. OTHER APPLICATIONS OF EMMA cluster for each tissue class. In reality MRI signals are corrupted by a bias eld, an additive
or multiplicative o set that varies slowly in space. The bias eld results from unavoidable
variations in magnetic eld see Wells III et al., 1994 for an overview of this problem. The
bias eld makes constructing automatic tissue classi ers di cult.
Wells et al. have built a system for bias correction around the assumption that an uncorrupted MRI scan will have a particular distribution of pixel values. This distribution
will have a peak for each type of tissue. Using an explicit physical model of MRI image
formation they construct a prior model for this distribution as a mixture of Gaussians, with
one Gaussian for each tissue type. The model can then be used to compute the likelihood of
an MRI. Corrupted MRI's will be unlikely because the bias eld blurs together the clusters.
Wells et al. use maximum likelihood to select the correction eld the inverse of the bias
eld that makes a corrupted MRI most likely.
To reiterate, their system nds an estimated correction eld that when applied to the data
makes it look like a particular type of clustered multiclass data. Applying the correction eld
sharpens up the classes and makes automatic classi cation easy. As in the learning problems
encountered in previous chapters, some assumption about the nature of the correction eld
is necessary to condition the problem. If we have prior knowledge that the bias eld varies
slowly across space, the correction eld should also vary slowly. Wells et al. assume that the
bias eld is smooth. To encourage smoothness they introduce a probabilistic prior in which
smooth elds are more likely than nonsmooth ones.
The main disadvantage of their MRI correction system is that it requires a fairly accurate
model of the tissue distribution. These models can be di cult to construct. Furthermore,
since the model includes estimates for the relative proportions of the tissue types, a di erent
model is required for each region of the body.
Using entropy we can proceed in a much less modelbased way. Since Wells et al.'s
technique has proven to be quite e ective, we can safely assume that the pixel values of an
uncorrupted MRI image are clustered into distinct classes. Such a distribution should have
low entropy. Corruption from the bias eld blurs together the clusters. The bias eld acts
like noise, adding entropy to the pixel distribution. This insight is the central idea behind
our approach. We attempt to nd the lowfrequency correction eld that when applied to
the image, makes the pixel distribution have a lower entropy. The resulting bias corrected"
image will have a tighter clustering than the original distribution.
138 6.1. BIAS COMPENSATION AITR 1548 Insuring the smoothness of the correction eld can be tricky. Wells et al. estimate a dense
correction eld, with one estimate for every pixel in the MRI. They insure smoothness by
periodically, at every iteration, smoothing the correction eld estimates. Another approach
would be to represent the bias eld as smooth in the rst place. This can be done parametrically by representing the correction eld as a smooth parameterized function. Or it can
be done using a lowfrequency correction image, with say 1 pixel for every 10 in the image.
Another approach, one which is guaranteed to be better when it is possible, is to represent
the correction eld with an explicit physical model. In the case of MRI, physics can be used
to show that the bias eld should be a low order polynomial of location M. Tincher and Williams, 1993. We take this approach for correcting MRI scans, representing the correction
eld as a third degree polynomial in the x and y coordinates of the scan.
The code that minimizes the entropy of the MRI is very similar to the other entropy
manipulation applications we have described. Once again we sample points from the image,
x, where each point now has a value, vx, and a current estimate for the bias eld, bx.
Proceeding as we have before, we approximate the entropy of the bias compensated image,
cx = vx , bx see Equation 3.22. The bias eld is adjusted to minimize hcx by
taking steps in the direction of the derivative as approximated in Equation 3.24. In this case
the parameters over which we are minimizing entropy are the coe cients of the bias eld
polynomial. The derivatives of the bias eld, since they are polynomial, are easy to compute.
The rst experiment is identical to a synthetic experiment proposed by Wells et al. A
binary checkerboard is used as a prototypical example of a two class image. Half of the
pixels belong to the black class, the other half to the white class. The pixel entropy of a
checkerboard is very low. The checkerboard is then corrupted by a large unknown bias eld
see Figures 6.1 and 6.2. The corruption is so large that any cluster structure in the data
has disappeared. This is apparent both from the distribution of pixels and a thresholding of
the corrupted image see Figures 6.3 and 6.2. Entropy minimization comes very close to
exactly compensating for the bias eld. Figure 6.2 shows the corrected image. Its distribution
is also shown in Figure 6.3. The image is not perfectly corrected because we use a di erent
bias eld representation than Wells et al.
One of the goals of the work by Wells et al. is to correctly classify white versus grey matter
in the brain see Bezdek et al., 1993 for a comprehensive overview of MRI segmentation.
They show that classi cation is much easier if the bias eld of a scan is known beforehand.
139 Paul A. Viola CHAPTER 6. OTHER APPLICATIONS OF EMMA Figure 6.1: The original checkerboard image and the bias eld that corrupts it. Figure 6.2: Left: the corrupted checkerboard. Center: a thresholded version of the corrupted
image. Right: the corrected checkerboard.
We have performed a number of experiments where EMMA is used to nd the correction
eld. While Wells et al.'s system is designed to give a tissue classi cation for the corrected
scan, EMMA based correction does not provide a classi cation. Instead we will examine the
distribution of the corrected scan and attempt to determine if the scan has been corrected
in a way that would make classi cation of white and grey matter easier. Figure 6.4 shows a
slice of MRI data taken from a brain. Figure 6.5 shows the histogram of the scan before and
after correction. In the histogram of the original scan white and grey matter tissue classes are
confounded into a single peak ranging from about 0.4 to 0.7. The histogram of the corrected
scan shows much better separation between these two classes. We can highlight the pixels
in this distribution by mapping all the pixels below the rst peak to black, all the pixels
above the second peak to white, and linearly scaling between white matter appears darker
than grey matter in this MRI scan. Figure 6.6 shows the original and corrected scans in this
manner. Notice that the inhomogeneity in the original image becomes immediately apparent.
The lower left hand portion of the original scan is dark. Since the corrected scan does not
show this inhomogeneity, the white and grey matter of the corrected scan are distinct.
140 6.1. BIAS COMPENSATION AITR 1548 18000
Corrupted
Corrected
16000 14000 12000 10000 8000 6000 4000 2000 0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 Figure 6.3: The pixel density of the corrupted checkerboard and the compensated checkerboard. Figure 6.4: A slice from an MRI scan of a head. A Second Experiment
The procedure for bias eld correction has been repeated for a number of di erent scans.
Figures 6.7, 6.8 and 6.9 show the results of an experiment performed on a coronal slice of an
MRI scan.
141 Paul A. Viola CHAPTER 6. OTHER APPLICATIONS OF EMMA Corrupted
Corrected 1200 1000 800 600 400 200 0
0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Figure 6.5: The distribution of pixel value in the MRI scan before and after correction. Experimental Details
All of the experiments use a smoothing function variance of 0.01. The learning rate was
0.003. The sample size was 30. Every run was concluded after 8000 parameter updates.
The pmin value was 0.01.
The checkerboard image is 256 by 256 pixels. The checkerboard experiment uses a bias
eld which is a 20 by 20 pixel low resolution image. This 20 by 20 image is then bilinearly
interpolated to create a continuous bias o set for each pixel in the checkerboard. The 400
parameters in this bias eld are the most ever simultaneously approximated by an EMMA
based technique.
The MRI scans varied in size from 100 square to 256 square. The bias eld for the MRI
experiments was a 3rd order polynomial in x and y location. 6.2 Alignment of Line Drawings
Jones and Poggio have constructed a system that automatically analyzes hand drawn faces
Jones and Poggio, 1995. Their system can estimate the emotional content of these faces i.e.
it can determine that you have drawn a very happy face that is somewhat surprised. Their
142 6.2. ALIGNMENT OF LINE DRAWINGS AITR 1548 Figure 6.6: At left is the MRI scan shown with a modi ed intensity scale. The inhomogeneity
is not much more apparent. In the center is the correct scan shown using the same intensity
scale. At right is the estimated correction eld. Figure 6.7: A coronal slice from an MRI scan of a head.
system works by constructing a nonrigid transformation that maps a novel drawing onto a
hand drawn neutral" face see gure Figure 6.10. The shape of the nonrigid transformation
determines the emotion of the face.
Jones and Poggio use a representation for nonrigid transformations that is called ow. A
ow is an image of displacement vectors. They search for a ow that minimizes the di erence
between the base image bx and the novel image nx,
X
C f = nx , bx + f x2 :
6.1
where f is the ow image and the summation is over all of the pixels in the novel image. The
problem of nonrigidly transforming one image into another has been very heavily studied in
computer vision. In most previous work ow is represented directly as an image of displace143 Paul A. Viola CHAPTER 6. OTHER APPLICATIONS OF EMMA Corrupted
Corrected 1200 1000 800 600 400 200 0
0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Figure 6.8: The distribution of pixel value in the MRI scan before and after correction. Solid
line is the original distribution. The dashed line is the corrected distribution. Figure 6.9: The same scan but with a modi ed intensity scale. Points above the intensity for
grey matter appear white. Points below the intensity for white matter appear black. There
is a linear scale between. Notice that the lower part of the image is darker in the uncorrected
image. On the right is the estimated bias.
ment vectors. The search problem is then conditioned with the addition of a smoothness prior
over ow images. Jones and Poggio decompose ow into a linear combination of component
ows,
!2
X
X
C f ig =
nx , bx + ifix :
6.2
i The search for ow then becomes an unconstrained optimization over the parameters i.
Each component ow represents a di erent type of emotion. For example, one kind of
ow might transform a neutral face into a smiling face. Another might make a face look
more surprised. These ows can be mixed together to produce images that have combination
144 6.2. ALIGNMENT OF LINE DRAWINGS AITR 1548 Figure 6.10: The rst face is the neutral" face. The others are frown", narrow eyes",
surprise", eyebrows", and smile".
of properties. Given a novel image, a set of i's are determined that provide for the closest
match with the neutral face. These i's determine to what extent the face is smiling or
frowning. Figure 6.11 shows several novel images and the best reconstruction obtained by
transforming the neutral face.
In their paper Jones and Poggio use LevenbergMarquardt, a second order gradient descent procedure, to determine the ow parameters see William H. Press and Veterling, 1992
for an excellent discussion optimization techniques. Together we have replaced this technique with a much simpler stochastic gradient descent procedure. Over the course of many
experiments stochastic alignment has improved running times from 60 seconds to 2 seconds.
The quality of the minima found with stochastic gradient descent is equivalent to, if not
better, than LevenbergMarquardt. Moreover, stochastic gradient descent rarely if ever gets
stuck in local minima. There were a number of cases where LevenbergMarquardt converges
far from the the best solution. On these same problems, stochastic gradient descent almost
always nds a good solution. 145 Paul A. Viola CHAPTER 6. OTHER APPLICATIONS OF EMMA Figure 6.11: A series of novel faces with the best reconstruction displayed below. 146 Chapter 7
Conclusion
Maximization of mutual information appears to be a new and powerful means of performing
local alignment of objects and images. In a typical vision application it is an intensitybased,
rather than feature based method. While intensity based, it is more robust than traditional
correlation, as shown by the insensitivity to lighting demonstrated in the experiment of
Section 5.1.3. In addition, the method is insensitive to negating the image data, as well
as a variety of nonlinear transformations, which would defeat conventional intensitybased
correlation.
The weaknesses of intensity correlation may be corrected, to some extent, by performing
correlations on the magnitude of the brightness gradient. This, as well as edgebased matching
techniques, can perform well on objects having discontinuous surface properties, or useful
silhouettes. These approaches work because the image counterparts of these discontinuities
are reasonably stable with respect to illumination.
Gradient magnitude correlation, as well as edgebased methods can have serious di culties in domains lacking discontinuities, such as the example shown in Section 5.1.3, because neither edges, nor their precursor, gradient magnitude, are stable in image position
with respect to lighting changes see Figure 5.16. While our technique works well using
only shading, it also works well in domains having surface property discontinuities and silhouette information see Section 5.1.1. 147 Paul A. Viola CHAPTER 7. CONCLUSION 7.1 Related Work
Alignment by extremizing properties of the joint signal has been used by Hill and Hawkes
Hill et al., 1994 to align MRI, CT, and other medical image modalities. They use third order
moments to characterize the clustering of the joint data. We believe that mutual information
is perhaps a more direct measure of the salient property of the joint data at alignment, and
demonstrate an e cient means of estimating and extremizing it.
There are many schemes that represent models and images by collections of edges and
de ne a distance metric between them that is proportional to the number of edges that coincide
see the excellent survey articles: Besl and Jain, 1985; Chin and Dyer, 1986. A smooth,
optimizable version of this metric can be de ned by introducing a penalty both for unmatched
edges and for the distance between those that are matched Lowe, 1985; Wells III, 1992b;
Huttenlocher et al., 1991. This metric can then be used both for image model comparison
and for pose re nement. Edge based metrics can work under a variety of di erent lighting
conditions, but they make two very strong assumptions: the edges that arise are stable under
changes in lighting, and the models are well described as a collection of edges. Clearly
smoothly curved objects are a real problem for these techniques. As we alluded before, Wells
has performed a number of experiments where he attempts to match edges that are extracted
under varying lighting. In general for even moderately curved objects, the number of unstable
and therefore unreliable edges is problematic. Faces, cars, fruit and a myriad of other objects
have proven to be very di cult to model using edges.
Others use more direct techniques to build models. Generally these approaches revolve
around the use of the image itself as an object model. Objects need not have edges to be well
represented in this way, but care must be taken to deal with changes in lighting and pose.
Turk and Pentland have used a large collection of face images to train a system to construct
representations that are invariant to some changes in lighting and pose Turk and Pentland,
1991. These representations are a projection onto the largest eigenvectors of the distribution
of images within the collection. Their system addresses the problem of recognition rather
than alignment, and as a result much of the emphasis and many of the results are di erent.
For instance, it is not clear how much variation in pose can be handled by their system.
We do not see a straightforward extension of this or similar eigenspace work to the problem
of pose re nement. On a related note Shashua has shown that all of the images, under
di erent lighting, of a Lambertian surface are a linear combination of any three of the images
148 7.2. A PARALLEL WITH GEOMETRICAL ALIGNMENT AITR 1548 Shashua, 1992. This also bears a clear relation to the work of Turk and Pentland in that
the eigenvectors of a set of images of an object should span this three dimensional space.
Entropy is playing an ever increasing role within the eld of neural networks. We know
of no work on the alignment of models and images, but there has been work using entropy
and information in vision problems. None of these techniques uses a nonparametric scheme
for density entropy estimation as we do. In most cases the distributions are assumed to be
either binomial or Gaussian. This both simpli es and limits such approaches.
Linsker has used the concept of information maximization to motivate a theory of development in the primary visual cortex Linsker, 1986. He has been able to predict the
development of receptive elds that are very reminiscent of the ones found in the primate
visual cortex. He uses a Gaussian model both for the signal and the noise.
Becker and Hinton have used the maximization of mutual information as a framework for
learning di erent lowlevel processing algorithms such as disparity estimation and curvature
estimation Becker and Hinton, 1992. They assume that the signals whose mutual information is to be maximized are Gaussian. In addition, they assume that the only joint information
between images is the information that they wish to extract i.e. they train their disparity
detectors on random dot stereograms.
Finally, Bell has used a measure of information to separate signals that have been linearly
mixed together Bell and Sejnowski, 1995. His technique assumes that the di erent mixed
signals carry little mutual information. While he does not assume that the distribution has
a particular functional form, he does assume that the distribution is well matched to a preselected transfer function. For example, a Gaussian is well matched to the logistic function
because applying a correctly positioned and scaled logistic function results in a uniform
distribution. 7.2 A Parallel with Geometrical Alignment
EMMA bears some similarity to methods used for evaluating and adjusting geometrical alignment. These similarities may be seen by revisiting the entropy derivative of Equation 3.28,
and comparing it to the derivative of the following construct.
149 Paul A. Viola CHAPTER 7. CONCLUSION We de ne D, half the averaged Mahalonobis distance between values in B and their
nearest correspondences in A,
1 X min 1 Dz , z :
7.1
DT N
i
j
B zi 2B zj 2A 2
Locally away from discontinuities, the derivative of the above expression is
d DT 1 X min d 1 Dz , z :
dT
NB zi 2B zj 2A dT 2 i j
Comparing the above expression with Equation 3.28, we see the following analogy. If the
transformation T is adjusted to reduce the averaged squared di erences" between points in
B and their counterparts from A that are nearest in signal value, then a reduction in entropy
is obtained. This is intuitive, in that entropy will be lower if clusters in signal value" are
tighter so that nearby signal di erences will be smaller. The approximation of this analogy
is due to the dissimilarity between max and softmax.
Equation 7.1 is essentially the measure used in chamfer matching techniques, such as the
method described by Borgefors Borgefors, 1988. Huttenlocher Huttenlocher et al., 1991
has used a related measure in feature matching applications, the Hausdor distance, which
uses maximum instead of the sum that appears in Equation 7.1. The similarity between
geometrical matching and entropy becomes even stronger if one uses the softmax operation
to weight the closest element rather than simply selecting the closest, as Wells has Wells III,
1992b; Wells III, 1992a.
We reiterate that in vision applications, these methods have typically been used to measure
aggregate geometrical distance, while here we are measuring aggregate distances among signal
values typically intensities, brightnesses, or surface properties. 150 Appendix A
Appendix
A.1 Gradient Descent
In a number problems described in this thesis one must nd a set of parameters that extremizes an evaluation function. Examples include: 1 nding the parameters of density so that
the likelihood of sample is maximized; 2 nding the pose parameters that align a model
and an image best; and 3 nding the weights of a neural network so that it approximates
a function best. In each case there is a function of a parameters set F p, whose value is
to be either maximized or minimized. The parameters are continuous variables, and we are
therefore faced with an in nite number of possible solutions. The gradient descent procedure
is an e ective though greedy technique for searching such a space.
There are many closely related gradient descent algorithms. Here we will describe the
simplest: steepest descent or hill climbing. Starting from an initial guess for the parameters,
steepest descent is an iterative procedure that uses the partial derivatives of a function to
construct an improved estimate for its parameters. Each parameter is updated by p p + @F p :
@p
The update rate which is also known as the learning rate must be chosen carefully. When
151 Paul A. Viola APPENDIX A. APPENDIX is su ciently small one can use a Taylor expansion of F to prove that
F p + @F p F p :
@p
When is too small p might take arbitrarily long to approach a maximum. If is chosen
correctly p will converge toward the maximum relatively rapidly.
There are many gradient based techniques that attempt to speed the rate of convergence of
p. Second order techniques such as LevenbergMarquart and NewtonRaphson use the second
derivatives of F p to reestimate . Conjugate gradient techniques attempt to nd better
directions than the gradient of F . In every case one must be careful that the theoretical
advantages of the algorithm are not outweighed by the costs of computing it. Researchers
in neural networks have found that for many problems it is di cult to realize any actual
improvement in convergence speed. The problems for which steepest descent works as well
as more complex techniques include functions where there are a large number of parameters
this makes computing the second derivatives quite expensive. 152 Bibliography
Anderson, J. and Rosenfeld, E., editors 1988. Neurocomputing: Foundations of Research.
MIT Press, Cambridge.
Baclawski, K., Rota, G.C., and Billey, S. 1990. Introduction to the theory of probability.
MIT course notes for 18.313.
Becker, S. and Hinton, G. E. 1992. Learning to make coherent predictions in domains
with discontinuities. In Moody, J. E., Hanson, S. J., and Lippmann, R. P., editors,
Advances in Neural Information Processing, volume 4, Denver 1991. Morgan Kaufmann,
San Mateo.
Bell, A. J. and Sejnowski, T. J. 1995. An informationmaximisation approach to blind
separation. In Advances in Neural Information Processing, volume 7, Denver 1994.
Morgan Kaufmann, San Francisco.
Besl, P. and Jain, R. 1985. ThreeDimensional Object Recognition. Computing Surveys,
17:75 145.
Bezdek, J., Hall, L., and Clarke, L. 1993. Review of MR Image Segmentation Techniques
using Pattern Recognition. Medical Physics, 204:1033 1048.
Bienenstock, E., Cooper, L., and Munro, P. 1982. Theory for the development of neuron
selectivity: Orientation speci city and binocular interaction in visual cortex. Journal of
Neuroscience, 2. Reprinted in Anderson and Rosenfeld, 1988.
Borgefors, G. 1988. Hierarchical Chamfer Matching: A Parametric Edge Matching Algorithm. IEEE Transactions PAMI, 106:849 865.
Bridle, J. S. 1989. Training stochastic model recognition algorithms as networks can lead
to maximum mutual information estimation of parameters. In Touretzky, D. S., editor,
Advances in Neural Information Processing 2, pages 211 217. Morgan Kaufman.
Canny, J. 1986. A Computational Approach to Edge Detection. IEEE Transactions PAMI,
PAMI86:679 698.
153 Paul A. Viola BIBLIOGRAPHY Chin, R. and Dyer, C. 1986. ModelBased Recognition in Robot Vision. Computing
Surveys, 18:67 108.
Cover, T. M. and Thomas, J. A. 1991. Elements of Information Theory. John Wiley and
Sons.
Dempster, A., Laird, N., and Rubin, D. 1977. Maximum Likelihood from Incomplete Data
via the EM Algorithm. Journal of the Royal Statistical Society, Series B, 39:1 38.
Duda, R. and Hart, P. 1973. Pattern Classi cation and Scene Analysis. John Wiley and
Sons.
Fukunaga, K. 1990. Introduction to Statistical Pattern Recognition. Harcourt Brace Jovanovich.
Grimson, W., LozanoP
rez, T., Wells, W., et al. 1994. An Automatic Registration Method
e
for Frameless Stereotaxy, Image Guided Surgery, and Enhanced Realigy Visualization.
In Proceedings of the Computer Society Conference on Computer Vision and Pattern
Recognition, Seattle, WA. IEEE.
Haykin, S. 1994. Neural Networks: A comprehensive foundation. Macmillan College Publishing.
Hill, D. L., Studholme, C., and Hawkes, D. J. 1994. Voxel Similarity Measures for Automated Image Registration. In Proceedings of the Third Conference on Visualization in
Biomedical Computing, pages 205 216. SPIE.
Horn, B. 1986. Robot Vision. McGrawHill, New York.
Huttenlocher, D., Kedem, K., Sharir, K., and Sharir, M. 1991. The Upper Envelope of
Voronoi Surfaces and its Applications. In Proceedings of the Seventh ACM Symposium
on Computational Geometry, pages 194 293.
Intrator, N. and Cooper, L. N. 1992. Objective function formulation of the bcm theory of
visual cortical plasticity: Statistical connections, stability conditions. Neural Networks,
5:3 17.
Jacobs, R., Jordan, M., Nowlan, S., and Hinton, G. 1991. Adaptive mixtures of local
experts. Neural Computation, 31:79 87.
Jones, M. and Poggio, T. 1995. Modelbased matching of line drawings by linear combinations of prototypes. Proceedings of the International Conference on Computer Vision.
Kirkpatrick, S., Gelatt, C., and Vecchi, M. 1983. Optimization by Simulated Annealing.
Science, 2204598:671 680.
154 BIBLIOGRAPHY AITR 1548 Leclerc, Y. 1988. Constructing simple stable descriptions for image partitioning. Proceedings
Image Understanding Workshop, 1:365 382.
Linsker, R. 1986. From basic network principles to neural architecture. Proceedings of the
National Academy of Sciences, USA, 83:7508 7512, 8390 8394, 8779 8783.
Linsker, R. 1988. Selforganization in a perceptual network. IEEE Computer, pages 105
117.
Ljung, L. and Soderstrom, T. 1983. Theory and Practice of Recursive Identi cation. MIT
Press.
Lowe, D. 1985. Perceptual Organization and Visual Recognition. Kluwer Academic Publishers.
M. Tincher, C. R. Meyer, R. G. and Williams, D. M. 1993. Polynomial modeling and
reduction of rf body coil spatial inhomogeneity in mri. IEEE Trans. Med. Imaging,
122:361 365.
Mellor, J. 1995. Realtime camera calibration for enhanced reality visualization. In Computer Vision, Virtual Reality and Robotics in Medicine, pages 471 475. Nice, France.
Papoulis, A. 1991. Probability, Random Variables, and Stochastic Processes. McGrawHill,
Inc., third edition.
Rissanen, J. 1978. Modeling by shortest data description. Automatica, 14:465 471.
Robbins, H. and Munroe, S. 1951. A stochastic approximation method. Annals of Mathematical Statistics, 22:400 407.
Schraudolph, N. N. and Sejnowski, T. J. 1993. Unsupervised discrimination of clustered
data via optimization of binary information gain. In Hanson, S. J., Cowan, J. D., and
Giles, C. L., editors, Advances in Neural Information Processing, volume 5, pages 499
506, Denver 1992. Morgan Kaufmann, San Mateo.
Shannon, C. E. 1948. A mathematical theory of communication. Bell Systems Technical
Journal, 27:379 423 and 623 656.
Shashua, A. 1992. Geometry and Photometry in 3D Visual Recognition. PhD thesis, M.I.T
Arti cial Intelligence Laboratory, AITR1401.
Turk, M. and Pentland, A. 1991. Face Recognition using Eigenfaces. In Proceedings of
the Computer Society Conference on Computer Vision and Pattern Recognition, pages
586 591, Lahaina, Maui, Hawaii. IEEE.
155 Paul A. Viola BIBLIOGRAPHY Wells III, W. 1992a. Posterior Marginal Pose Estimation. In Proceedings: Image Understanding Workshop, pages 745 751. Morgan Kaufmann.
Wells III, W. 1992b. Statistical Object Recognition. PhD thesis, MIT Department Electrical
Engineering and Computer Science, Cambridge, Mass. MIT AI Laboratory TR 1398.
Wells III, W., Grimson, W., Kikinis, R., and Jolesz, F. 1994. Statistical Gain Correction
and Segmentation of MRI Data. In Proceedings of the Computer Society Conference on
Computer Vision and Pattern Recognition, Seattle, Wash. IEEE , Submitted.
Wells III, W. M. and Viola, P. A. 1995. Multimodal volume registration by maximization
of mutual information. In preparation.
Widrow, B. and Ho , M. 1960. Adaptive switching circuits. In 1960 IRE WESCON
Convention Record, volume 4, pages 96 104. IRE, New York.
William H. Press, Brian P. Flannery, S. A. T. and Veterling, W. T. 1992. Numerical
Recipes in C: The Art of Scienti c Computing. Cambridge University Press, Cambridge,
England, second edition edition. 156 ...
View
Full
Document
This note was uploaded on 02/10/2010 for the course TBE 2300 taught by Professor Cudeback during the Spring '10 term at Webber.
 Spring '10
 Cudeback
 The Land

Click to edit the document details