This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Object Level Grouping for Video Shots Josef Sivic, Frederik Schaffalitzky, and Andrew Zisserman Robotics Research Group, Department of Engineering Science, University of Oxford http://www.robots.ox.ac.uk/ vgg Abstract. We describe a method for automatically associating image patches from frames of a movie shot into object-level groups. The method employs both the appearance and motion of the patches. There are two areas of innovation: first, affine invariant regions are used to repair short gaps in individual tracks and also to join sets of tracks across occlusions (where many tracks are lost simultaneously); second, a robust affine factoriza- tion method is developed which is able to cope with motion degeneracy. This factorization is used to associate tracks into object-level groups. The outcome is that separate parts of an object that are never visible simultane- ously in a single frame are associated together. For example, the front and back of a car, or the front and side of a face. In turn this enables object-level matching and recognition throughout a video. We illustrate the method for a number of shots from the feature film Groundhog Day. 1 Introduction The objective of this work is to automatically extract and group independently moving 3D semi-rigid (that is, rigid or slowly deforming) objects from video shots. The prin- cipal reason we are interested in this is that we wish to be able to match such objects throughout a video or feature length film. An object, such as a vehicle, may be seen from one aspect in a particular shot (e.g. the side of the vehicle) and from a different as- pect (e.g. the front) in another shot. Our aim is to learn multi-aspect object models from shots which cover several visual aspects, and thereby enable object level matching. In a video or film shot the object of interest is usually tracked by the camera think of a car being driven down a road, and the camera panning to follow it, or tracking with it. The fact that the camera motion follows the object motion has several beneficial effects for us: the background changes systematically, and may often be motion blurred (and so features are not detected there); and, the regions of the object are present in the frames of the shot for longer than other regions. Consequently, object level grouping can be achieved by determining the regions that are most common throughout the shot. In more detail we define object level grouping as determining the set of appearance patches which (a) last for a significant number of frames, and (b) move (semi-rigidly) together throughout the shot. In particular (a) requires that every appearance of a patch is identified and linked, which in turn requires extended tracks for a patch even as- sociating patches across partial and complete occlusions. Such thoroughness has two 2 Josef Sivic et al....
View Full Document
This note was uploaded on 06/13/2011 for the course CAP 6412 taught by Professor Staff during the Spring '08 term at University of Central Florida.
- Spring '08