1995_Viola_thesis_registrationMI

1995_Viola_thesis_registrationMI

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: transformation, a model for the imaging process could be used to predict the image that will result. If we had a good imaging model then deciding whether an image contained a particular model at a given pose is straightforward: compute the predicted image and compare it to the actual image directly. Given a perfect imaging model the two images will be identical, or close to it. Of course nding the correct alignment is still a remaining challenge. The relationship between an object model no matter how accurate and the object's image is a complex one. The appearance of a small patch of a surface is a function of the surface properties, the patch's orientation, the position of the lights and the position of the observer. Given a model ux and an image vy we can formulate an imaging equation, vT x = F ux; q or equivalently, 1.1 vy = F uT ,1y; q : 1.2 The imaging equation is separable into two distinct components. The rst component is called a transformation, or pose, denoted T . It relates the coordinate frame of the model to the coordinate frame of the image. The transformation tells us which point in the model is responsible for a particular point in the image. The second component is the imaging function, F ux; q. The imaging function determines the value of image point vT x. In general a pixel's value may be a function both of the model and other exogenous factors. For example an image of a three dimensional object depends not only on the object but also on the lighting. The parameter, q, collects all of the exogenous in uences into a single vector. One reason that it is, in principle, possible to de ne F is that the image does convey information about the model. Clearly if there were no mutual information between u and v, there could be no meaningful F . We propose to nesse the problem of nding and computing F by dealing with this mutual information directly. We will present an algorithm that aligns by maximizing the mutual information between model and image. It requires no a priori model of the relationship between surface properties and scene intensities it only assumes 11 Paul A. Viola CHAPTER 1. INTRODUCTION that the model tells more about the scene when it is correctly aligned. 1.1.1 An Alignment Example One of the alignment problems that we will address involves nding the pose of a threedimensional object that appears in a video image. This problem involves comparing two very di erent kinds of representations: a three-dimensional model of the shape of the object and a video image of that object. For example, Figure 1.1 contains a video image of an example object on the left and a depth map of that same object on the right the object in question is a person's head: Ron. A depth map is an image that displays the depth from the camera to every visible point on the object model. A depth map is a complete description of the shape of the object, at least the visible parts. From the depth map alone it might be di cult to see that the image and the model are aligned. The task can be made much easier, at least for us, if we simulate the imaging process and construct an image from the 3D model. Figure 1.2 contains two computer graphics renderings of the object model. These synthetic images are constructed assuming that the 3D model has a Lambertian surface and that the lighting comes from the right. It is almost immediately obvious that the model on the left is more closely aligned to the true image than the model on the right. Unfortunately, what we nd trivial is very di cult for a computer. The intensities of the true video image and the synthetic images are very di erent. The true image and the correct model image are in fact uncorrelated. Yet any person can glance at these images and decide that both are images of a head and that both heads are looking in roughly the same direction. The human visual system is capable of ignoring the super cial di erences that arise from changes in illumination and surface properties. It is not easy to build an automated alignment procedure that can make this kind of comparison. It is harder still to construct a system that can nd the correct model pose. We have built such a system. That system selected the pose of the model shown at left in Figure 1.2. As mentioned above, the synthetic images of Ron were generated under the assumption the model surface is Lambertian and the lighting is from the right. Lambert's law is perhaps the simplest model of surface re ectivity. It is an accurate model of the re ectance of a matte 12 1.1. AN INTRODUCTION TO ALIGNMENT AI-TR 1548 Figure 1.1: Two di erent views of Ron. On the left is a video image. On the right is a depth map of a model of Ron. A depth map describes the distance to each of the visible points of the model. White denotes points that are closer, black further. Figure 1.2: At left is a computer graphics rendering of a 3D model of Ron. The position of the model is the same as the position of the actual head. At right is a rende...
View Full Document

This note was uploaded on 02/10/2010 for the course TBE 2300 taught by Professor Cudeback during the Spring '10 term at Webber.

Ask a homework question - tutors are online