This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Tracking and Modeling NonRigid Objects with Rank Constraints Lorenzo Torresani’r Danny B. Yangl l {ltorresa, dbyang, bregler} @cs.stanford.edu Computer Science Department
Stanford University, Stanford, CA 94305 Abstract This paper presents a novel solution for ﬂawbased
tracking and 3D reconstruction of deforming objects in
monocular image sequences. A nonrigid 3D object under
going rotation and deformation can be effectively approxi
mated using a linear combination of 3D basis shapes. This
puts a bound on the rank of the tracking matrix. The rank
constraint is used to achieve robust and precise lowlevel
optical ﬂow estimation without prior knowledge of the 3D
shape of the object. The bound on the rank is also ex
ploited to handle occlusion at the tracking level leading
to the possibility of recovering the complete trajectories
of occluded/disoccluded points. Following the same low
rank principle, the resulting ﬂow matrix can be factored to
get the 3D pose, conﬁguration coeﬁicients, and 3D basis
shapes. The ﬂow matrix is factored in an iterative manner;
looping between solving for pose, conﬁguration. and basis
shapes. The flowbased tracking is applied to several video
sequences and provides the input to the 3D nonrigid re
construction task. Additional results on synthetic data and
comparisons to ground truth complete the experiments. 1. introduction This paper addresses the problem of 3D tracking and
model acquisition of non—rigid motion in video sequences.
We are speciﬁcally concerned with human motion, which is
a challenging domain. Standard. lowlevel tracking schemes
usually fail due to local ambiguities and noise. Most re
cent approaches overcome this problem with the use of a
model. t In those techniques optical ﬂow vectors or the mo
tion of feature locations can be constrained by a low degree
of—frecdom parametric model. For instance, to track joint—
angles of human limb segments an approximate kinematic
chain model can be used. The models lose many details
that cannot be recovered by simple cylinder or sphere shape
models and ﬁxed axis rotations. Nonrigid torso motions,
deforming shoe motions, or subtle facial skin motions are 076951272—0/01 $10.00 © 2001 IEEE 1—493 Eugene J. Alexander1E Christoph Breglerl
[email protected]
Mechanical Engineering Department
Stanford University, Stanford, CA 94305 problem areas. Alternatively, such non—rigid motions can
be captured with basisshape models that are learned from
example data. Most of the previous work is based on PCA
techniques applied to 2D or 3D training data. For exam—
ple, human face deformations have been tracked in 2D and
3D with such models. For 3D domains, prior models are
aquired using stereo cameras or cyber—scan hardware. Care—
fully labeled data have to be provided to derive the PCA
based models. We are interested in cases where no such 3D models are
available, or existing models are too restricted and would
not be able to recover all subtleties. The input to our tech—
nique is a singleview video reCording of an arbitrary de
forming object, and the output is the 3D motion AND a 3D shape model parameterized by its modes of nonrigid defor
mation. ' We are facing three very challenging problems: 1. Without a model, how can we reliably track ambigu
ous and noisy local features in this domain? 2. Without point feature tracks or robust optical ﬂow,
how can we derive a model? 3. Given reliable 2D tracks, how can we recover 3D
nonrigid motion and shape structure? We have previously demonstrated that single—view 2D
point tracks are enough to recover 3D nonrigid motion and
structure by exploiting lowrank constraints [7]. Based on
the same assumption, we show in this paper that it is also
possible to constrain the low—level ﬂow estimation and to
handle occlusion without any model—assumption. Irani [14]
has demonstrated that modelfree lowrank constraints can
be applied to overcome local ambiguities in ﬂow—estimation
for rigid scenes. We show that this can be extended to 3D
non—rigid tracking and modelacquistion. Our new tech
niques do not need 2D point tracks, can deal with ambigous
and noisy local features, and can handle occlusion. By ex
ploiting the lowrank constraints in lowlevel tracking and
in 3D non—rigid model acquisition we are able to solve all
three challenges mentioned above in one uniﬁed manner. We demostrate the technique on tracking several video se
quences and on deriving 3D deformable models from those
measurements. 2 Previous Work Many non—rigid tracking solutions have been proposed
previously. As mentioned earlier, most techniques use an
a—priori model. Examples are [16, 5, 9, 19, 3, 4]. Most
of these techniques model 2D non—rigid motion, but some
of these approaches also recover 3D pose and deformations
based on a 3D model. The 3D model is obtained from 3D
scanning devices [6], stereo cameras [10], or multi—view
reconstruction [18, 11]. The multiview reconstruction is
based on the assumption that for a speciﬁc deformed con
ﬁguration all views are sampled at the same time. This is
equivalent to the structure from motion problem, that as
sumes rigidity between the different views [22]. Extensions
have been proposed, such as the multibody factorization
method of _Coseira and Kanade [8] that relaxes the rigid
ity constraint. In this method, K independently moving ob
jects are allowed, which results in a tracking matrix of rank
3K and a permutation algorithm that identiﬁes the subma—
trix corresponding to each object. More recently, Bascle
and Blake [l] proposed a method for factoring facial ex—
pressions and pose during tracking. Although it exploits the
bilinearity of 3D pose and nonrigid object conﬁguration, it
requires again a set of basis images selected before factor—
ization is performed. The discovery of these basis images is
not part of their algorithm. In addition, most techniques treat lowlevel tracking and
3D structural constraints independently. In the following
section we describe how we can track and reconstruct non
rigid motions from single views without prior models. 3 Technical Approach The central theme in this paper is the exploitation of
rank—hounds for recovering 3D nonrigid motion. We ﬁrst
describe in general why and under what circumstances 3D
nonrigid motion puts rank bounds on 2D image motion
(section 3. l ). We then detail how these b0unds can be used
to constrain low—level tracking in a modelfree fashion (sec
tion 3.2). We then describe how this technique can also be
used for prediction of occluded features (section 3.3), and
we then introduce three techniques that are able to recon
struct’3D deformable shapes and their motion from those
2D measurements (section 3.4. l, 3.4.2, and 3.4.3). 3.1 Lowrank constraints for nonrigid motion Given a sequence of F video frames, the optical flow of
P pixels can be coded into two F x P matrices, U and V. [—494 Each row of U holds all x—displacements of all P locations
for a speciﬁc time frame, and each row of V holds all y
displacements for a speciﬁc time frame. It has been shown
that if U and v describe a 3D rigid motion, the rank of [9,]
has an upper bound, which depends on the assumed camera
model (for example, for an orthographic camera model the
rank is r S 4, while for a perspective camera model the rank
is r 5 8) [22, 14]. This rank constraint derives from the fact
that [g] can be factored into two matrices: Q x S. Q””
describes the relative pose between camera and object for
each time frame, and 8'” describes the 3D structure of the
scene which is invariant to camera and object motion. Previously we have shown that non—rigid object motion
can also be factored into 2 matrices [7] but of rank r that is
higher than the bounds for the rigid case. Assuming the 3D
nonrigid motion can be approximated by a set of K modes
of variatidn, the 3D shape of a speciﬁc object conﬁguration
can be expressed as a linear combination of K basis—shapes
(51,52, ...Sk). Each basisshape Si is a 3 x P matrix describ—
ing P points. The shape of a speciﬁc conﬁguration is a linear
combination of this basis set: X
3:2433
, (:1 Assuming weakperspective projection, at a speciﬁc time
frame I the P points of a conﬁguration S are projected onto
2D image points (uni, v,,,): um 141,? :R_ +T (2)
vr,l W.” t i=1 m l ( Rt=[r1 r2 r3] (3) r4 r5 76 5,3. enema, an (1) where R, contains the ﬁrst two rows of the full 3D cam
era rotation matrix, and 7} is the camera translation. The
weak perspective scaling (f/ngg) of the projection is im—
plicitly coded in 1,,1,...l,,K. As in [22], we can eliminate 7}
by subtracting the mean of all 2D points, and henceforth can
assume that S is centered at the origin. Weak perspective projection is in practice a good approx
imation if the perspective effects between the closest and
furthest point on the object surface are small. Extending this
framework to fullperspective projection is straight—forward
using an iterative extension. All experiments reported here
assume weak perspective projection. We can rewrite the linear combination in (2) as a matrix
multiplication: 51
unl “if =[ l [R leR 52
v“ V1,}? r’ t t‘ I SK (4) We stack all point tracks from time frame 1 to F into one large measurement 2F x P matrix W. Using (4) we can
write: ZURr l1=KR1 S1 W ___ 12,1R2 12,KR2 . $2 (5)
[F.lRF [EKRF SK
Q 3 Since Q is 21 2F >< 3K matrix and B is a 3K x P matrix, in
the noise free case W has a rank r S 3K. In the following sections we describe how this rank
bound on W can be exploited for l) constrained lowlevel
tracking 2) recovery of occluded feature locations 3) 3D reconstruction of pose, nonrigid deformations, and key
shapes. 3.2 Basis Flow The previous analysis tells us why W is rank bounded
and how W can be factored. In this section we discuss how
to derive the optical ﬂow matrix W from an image sequence
and how the rank—bound can be used to disambiguate the
local ﬂow. Features can usually be tracked reliably with local meth—
ods, such as Lucas—Kanade [l7] and extensions [21, 2], if
they contain a distinctive high contrast pattern with 2D tex—
ture, such as corner features. For traditional rigid shape
reconstruction, only a few feature locations are necessary.
Nonrigid objects go through much more severe motion
variations, hence many more features need to be tracked. In
the extreme case it might be desirable to track every pixel
location. Unfortunately, many objects that we are interested
in, including the human body, do not have many of those
very reliable features. V Our solution to the tracking dilemma builds on a tech—
nique introduced in [14] that exploits rank constraints for
optical flow estimation in the case of rigid motion. Since W is assumed to have rank r, all P columns of W
can be modeled as a linear combination of r “basistracks”,
Q. The basis is not uniquely deﬁned, but if there are more
than r points whose trajectories over the F frames can be
reliably estimated, then we can compute with SVD the ﬁrst
r eigenvectors Q of the reduced tracking matrix Wrelmble.
Q25” is an initial estimate of the basis for all P tracks. Our
next task is to estimate all P tracks (the entire W) using this
eigenbase Q and additional local image constraints. As in the original LucasKanade tracking, we assume
that a small image—patch centered at a track—point location
will not change its appearance drastically between two con—
secutive frames. Therefore the local patch ﬂow [u,v] can be computed by solving the following well known equation
[17]: " 1—495 c d
[ulil’vtif’b I: d e ] :i Thi d 212 211 .
where c = x ” th d —
[d e] [21ny 21; ]IS esecon mo ment matrix of the local image patch in the ﬁrst frame,
g = 21x1“ and h = Zlylt. (for further details see [17, 21, 2]). If all F X P ﬂow—vectors across the entire image sequence
are coded relative to one single image template, the follow
ing equation system can be written [14]: WM [,3 :3 l = [01H] (7) where C, D, E are diagonal P X P matrices that contain
the corresponding c, d, and e values for each of the P local
image patches. Accordingly, G and H are F X P matrices,
that contain the g and h values for all P local patches across
all F time frames. This system of equations is a rewriting of
the LucasKanade linearization for every ﬂow vector, with
no additional constraints yet applied. The number of free
variables is equal to the number of constraints. If a local
patch has no 2D texture, the single equation describing its
motion in the system will only provide an accurate estimate
of its normal ﬂow (aperture problem). Now we split Q into Qu that contains all even rows of Q,
and Qv that contains all odd rows of Since Q is a basis
for W, there must exist some r X P matrix 3 for which the
following equations hold: QuB=U Qvé=v (8)
Using (7) we can write [14]: [erasa [ g g ] =[GIH1 (9) This is a system with r X P unknowns (the entries in 3‘)
and 2F x P equations. For long tracks (F > > P) the system is very overconstrained (in contrast to (7)). We can exploit
this redundancy to derive the optical ﬂow for points difﬁcult
to track and for features along lD edges. Since [GIH] is computed based on the LucasKanade lin—
earization, the resulting ﬂow [U IV] = [Qu  1?le  3] will
only be a ﬁrst approximation. We rewarp all images of the
sequence using the new ﬂow and then iterate equation (9). 3.3 Dealing with Occlusion By reordering the elements of B into a r  Pdimensional
vector b. equation (9) can be rewritten in the form: LZPerP_8rP><l :mZPNXl where now each row describes one point in one particular
frame. If we have occlusion, or the tracker used for initial—
ization has lost some points at certain time frames, then the
corresponding entries in the m vector will not be measur
able. We eliminate those rows from the L matrix and the m
vector. If the number of missing points is not overly large,
we are still left with an overconstrained system that can give
us an accurate solution for I}. As long as the disappearing
features are visible in enough frames, the product Q  3 pro vides also a good prediction of the displacements for the
missing points 3.4 3D Reconstruction As mentioned earlier, the factorization of W into Q and B
is not unique. Any invertible r x r matrixA applied to Q and
B in the following way leads to an alternative factorization: Qu=QA, Q“ and Ba multiplied together approximate W with the
same sumofsquared error as Q and B.’ Using SVD. we compute a Q (with orthonormal
columns) and E. In general Q will not comply to the struc
ture we described in (5); Q:
Q: .with Q,=[l¢,lRtl...l,Eth] (12) QF For the general case, transforming Q into a Q that com
plies to those constraints can not be done with a linear least—
squares technique. For the speciﬁc case of rigid scenes,
each subblock is equal to the ﬁrst 2 rows of a rotation matrix (Q, = R,). TomasiKanade [22] suggested a linear
approximation Schema to ﬁnd an A that enforces the sub blocks of Q to comply to rotation matrices. 3.4.1 Subblock factorization For the nonrrigid case, we previously proposed a second
factorization step on each subblock that transforms every Q, onto a Q,.that complies to the constraints (5) [7]. Q, can
be rewritten as: I r ‘ Qt = [ [IRr lKRr ]
_ [m llrz [1r3 [Kn [Km .lKrg
— [M4 [123 [[76 lKr4 lKrs lKr6 We reorder the elements of Q, into a new matrix Q,: liri lirz' lira llr4 lirs llr6
Q 1er hm hm 12m lzrs 12%
I /
[Kr] lxrg lKrg lKr4 lKrs lKr6l 1—496 Ba=A"1B (11) which shows that Q, is of rank 1 and can be factored into
the pose R, and conﬁguration weights 1, by SVD. After the second factorization step is applied to each of
the individual subblocks Q, a non—linear optimization over
the entire time sequence is performed to ﬁnd one invertible
matrix A that orthonormalizes all of the sub—blocks. The re
sult is that each sub—block is a scaled rotation matrix. In the
presence of noise and ambiguities, the second and higher
eigenvalues of many subblocks do not vanish. In those
cases, it results in bad rankl approximations, and bad esti
mates for R,. We therefore propose a second alternative in
the next section that overcomes this limitation. 3.4.2 Iterative Optimization Instead of local factorizations on the subblocks, we pro—
pose a new iterative technique that solves (5) directly. Many nonrigid objects have avdominant rigid compo
nent and we take advantage of this to get an initial estimate
for all pose matrices (R1, , RF). Given an initial guess of
the pose at each time frame, we can solve for the conﬁgura—
tion weights and the basis shapes. To initialize the pose, we factor W into a‘ZF x 3 rigid
pose matrix Qng and a 3 x P matrix ﬁrig (as originally done
by TomasiKanade). As usual, we transform Qng into a ma
trix Qng, whose subblocks have all weakperspective rota
tion matrices (as outlined in section_3.4.l). Using Qrig as an initial guess for the pose of the non
rigid shape, we solve for the nonrigid l” and B terms in
(5), We do this iteratively by ﬁrst initializing I,” randomly and then iterating between solving for B, then for 1,1,, and then reﬁning R, againl. 1. Given all R, and In; terms (the Q matrix), equation (5)
can be used to ﬁnd the linear leastsquare—ﬁt of B. 2., Given B and all R,, we can solve for all 1,), with linear
leastsquares. 3. Given B and L, we can rewrite (5) to: m = R, Elnksk (13)
k Solving for all R, such that they ﬁt this equation and
remain rotation matrices can be done by parameter—
izing R, with exponential coordinates. A full rotation ‘Alternatively we can use the sub—block factorization described in sec
tion 3.4.l for initialization matrix can be described by 3 variables [(ux, my, (02] as: 0 —(nZ (n).
R(w) = exp (oz 0 —mx (14)
~03}. to,t 0 Assume 6) is the estimate of R, at the previous itera tion, we can then linearize (13) around the previous
estimate to: ~ 1 *w’ (o’. _
w,=[mc lz ~&;]R(w)2ktl,yksk (15) and solve for a new 0). We then update R(u)) :
R(tu’)R(6)) and iterate2 We iterate all 3 steps until convergence. Similar to the technique described in section 3.3 we can
easily handle missing entries in W when points are occluded
or are lost by the tracker. B and L are overconstrained, so we leave out the missing data points and solve the linear ﬁt
as before. 3.4.3 MultiView Input Another extension of this factorization technique is the in—
corporation of multi—view inputs from M cameras.
This enlarges the input matrix W to size 2F M x P. W1 Wm W . .
W 2 2 7w, ~_ Wz,2 ’WW 2 [ um um ] 116,1...VL1p WF WM (16) As before, we assume that W”. can be described by a 2 X 3 pose matrix RM, by k deformation coefﬁcients 1,,1,l,,2,...l,,k, and a 3K X P keyshape matrix B. Assum ing the cameras are synchronized, an additional constraint for the multiView case is that all M views share the same
deformation coefﬁcients for a particular time frame 1?: w, = [1,,1 R,1,2 ‘R,.'..1,,K R,] B (17)
Rel R: = R1,: (18)
Km Similar to our previous 2step factorization, we can fac—
tor W into Q and B complying to this new structure. Fur
thermore we can enforce another constraint if we assume
that all M cameras remain ﬁxed relative to each other: The
relative rotation between all Rw’s in the R, subblock of Q
is constant over time. This is enforced with a nonlinear it—
erative optimization after the 2—step factorization. 2A future extension of this algorithm will deal with an iterative version
for true perspective models. However, we like to point out, that for the
orthographic case, there exist also several closed—form solutions including
Hom‘s technique [12, 13]. and a SVD based method proposed by Ruder—
man [20] that we will include in an extendet technical report. 1—497 3.4.4 Shape Regularization If there is not enough outof—plane rotation, the Z values of
B can be ill~conditioned. For instance, a small nonrigid de
formation in X and Y can also be explained by a small out—
of~imageplane rigid rotation of a shape with...
View
Full Document
 Spring '08
 Staff

Click to edit the document details