Global motion model based on B-spline wavelets. Application to motion estimation and video indexing E. Bruno1 , D. Pellerin1;2 Laboratoire des Images et des Signaux (LIS) 1 INPG, 46 Av. F´elix Viallet, 38031 Grenoble Cedex, France 2 ISTG, Universit´e Joseph Fourier, Grenoble, France e-mail: eric.bruno,
[email protected]
Abstract This paper describes a framework to estimate global motion model based on B-spline wavelets. The wavelet-based model allows optical flow to be recovered at different resolution levels from image derivatives. By combining estimation from different resolution levels in a coarse to fine scheme, our algorithm is able to recover a large range of velocity magnitude. Our algorithm is evaluated on artificial and real image sequences and provides accurate optical flows. The wavelet coefficients of the model at low resolution levels also provide features to index video or to recognize activities. Preliminary experiments on video indexing based on coarse wavelet coefficients are presented. As an example, we have considered video sequences containing different human activities. Wavelet coefficients are efficient to give a database partition related to the kind of human activities.
1 Introduction Motion estimation is a fundemental problem in video analysis. The knowledge of motion is required in many applications, such as video compression, video indexing, activity recognition or 3D reconstruction. Parameter models of optical flow aid in estimation by enforcing strong constraints on motion spatial variation within a region or over the whole image. Many authors have used affine or quadratic basis set [2, 8] to model motion. These models are very effective to relate planar transformation but fail when there are two or more different moving regions in the image sequence. These methods generally aim at determining a partition of the moving regions and then applying motion model on each region. Other works in activity recognition use motion speci-
ficities to generate basis functions. In [3], Black et. al model motion by linear combination of eigenvectors computed from a training set of flow fields (motion discontinuities and non-rigid mouth motion). These methods are very effective to model velocity fields close to the training set but cannot be used to estimate any motions.
In our previous work [4], global motion was expressed as a Fourier series expansion. Such a model allows a large range of motion to be approximated on a few harmonic basis functions. Following the same principle, we now use a set of B-spline wavelet functions. Wavelet series expansion is better than Fourier analysis to approximate piecewise smooth functions, such as optical flow. Then using B-spline wavelets, which are smooth, symmetrical and have a compact support, improves motion estimation accuracy. In addition, model parameters provide a concise description of global motion. Srinivasan and Chellapa have recently proposed to model global motion by a set of overlapped cosine windows [10]. Their approach is close to our algorithm.
This paper is organized as follows: Section 2 outlines the problem of global motion model estimation from the image sequence brightness function and presents the motion model based on B-spline wavelets. Section 3 details the model parameter robust estimation based on a sparse system inversion. This procedure is then embedded in a coarse to fine scheme so as to handle a large range of motion magnitude. Section 4 presents experimental results on estimated motion model accuracy and a performance comparison with the Srinivasan’s algorithm. This section also displays some preliminary experiments on video indexing based on motion wavelet coefficients.
2 Global Motion model 2.1 General remarks on motion model estimation Let us consider an image sequence I (p ; t) with p = (xi ; yi) 2 the location of each pixel in the image. The brightness constancy assumption states that the image brightness I (p ; t +1) is a simple deformation of the image i
i
I (p ; t) = I (p i
i
+ V(p ; t); t + 1)
(1)
i
V(p ; t) = (u; v) is the optical flow (also named velocity field or global motion) between the two frames I (p ; t) and I (p ; t + 1). V(p ; t) is also defined over . This velocity i
i
i
i
field can be globally modeled as a linear combination of basis functions:
V(p ; t) = i
p
N X k=0
k k (p ) with k = ( kx; ky )T i
(2)
where k ( i ) are basis functions. The motion parameter vector = [ 0 : : : N ℄T is estimated by minimizing an objective function [8]:
X
E=
pi
2
pi
2
X
(I (p (r(p
i
i
+ V(p ; t); t + 1) I (p ; t); ) i
i
+ V); )
p V
p Vp
p
= argmin (E )
2.2 Wavelet series expansion Any function f (x) 2 L2 (R) can be expanded as a weighted sum of basis functions:
f (x) =
X k
l;k l;k (x) +
= (2 x k) and j
j;k
X k
j+1;k j+1;k (x) =
XX
j l
dj;k
k
j;k
(x)
(5)
(x) = (2 x k) are rej
spectively the scaling and the wavelet functions, dilated and
X k
j;k j;k (x) +
X
dj;k
k
j;k
(x) (7)
Estimation of the f j +1;k gk coefficients provides also scaling and wavelet coefficients for all lower resolution levels. Then the approximation of f (x) at level j could be expressed as a linear combination of scaling functions:
f~j (x) =
X k
j;k j;k (x)
(8)
V
We want to model the velocity field (x; y ) by using two-dimensional scaling basis functions. A natural extension of one-dimensional to two-dimensional scaling basis function is:
j;k1 ;k2 (x; y) = (2j x k1 )(2j y k2 )
(9)
The subscripts j , k1 , k2 represent respectively the resolution scale, the horizontal and the vertical translations. The global motion could then be approximated at a resolution scale j by:
Vj (p ; t) = i
(4)
The success of the minimization stage depends on the model’s ability to fit the real global motion between I (t) and I (t + 1). Wavelet basis functions are reputed to give the best approximation of piecewise smooth functions, such as optical flow, and are an effective solution to solve our problem.
(6)
Relation (6) leads to:
(3)
where (; ) is a robust norm error (or M-estimator) applied to the difference of warped images r( i + ) = I ( i + ( i ; t); t + 1) I ( i ; t). A robust estimation is needed to reduce outlying data weight in the estimation process. Then, the solution for is given by:
j;k (x)
Vj+1 = Vj Wj Vj Vj+1
i
at time t:
=
translated by k at a level j . Be Vj , with j 2 Z, a set of closed subspaces of L2 (R). The scaling functions fj;k (x)gk are a basis for Vj . Vj contains all functions of L2 (R) at a resolution level j . The new details at level j are represented by the wavelet set f j;k (x)gk which is a basis for Wj . Then, the subspace Vj +1 is defined as:
j 2X 1
k1 ;k2 =0
j;k ;k j;k ;k (p ) 1
2
1
2
i
(10)
2.3 B-spline basis functions So as to recover a smooth and regular optical flow, the scaling functions must be as smooth and symmetrical as possible. The B-spline wavelet has maximum regularity and symmetry, and is a good candidate to model motion. The B-spline function of degree N 1 is the convolution of N box functions:
N0 1 (x) = (B B : : : B )(x)
(11)
1
with B = 12 ; 2 . The B-spline scaling function at the resolution level j + 1 is defined by the dilatation equation:
N 1 j +1
(x) = 21
N
N X k=0
N k
Nj 1 (2x k)
(12)
error quantity:
0.12
E~ (V + ÆV) = X (r(p + V) + ÆVrI (p 2
i
0.1
0.08
i
0.06
i
+ V; t + 1); )
p
(14)
0.04
0.02
0
0
0.5
1
1.5
2
2.5
3
3.5
4
(a) 1;2;3
where rI = [Ix ; Iy ℄T represents the spatial derivatives of I . The M-estimation problem can be converted into an iterated reweighted least squares [7]:
E~ (V + ÆV) = X w(r(p + V)) (r(p 2
i i
i
+ V) + ÆVrI (p + V; t + 1))2 i
p
(15) with w(x) Tuckey: (b) 1 (x)
(c) 2 (x)
(d) 3 (x)
(x) =
Figure 1. a) 1-dimensional B-spline scaling functions of degree 1, 2 and 3 and, b,c,d) the corresponding 2-dimensional scaling functions
= (
1 (x) x x . 2
6 6
2
The -function is the biweight of if jxj if jxj >
(1 (1 ( x )2 )3 )
(16)
V
V
Using global motion model defined in (10) for and Æ at a resolution level j , the robust error in (15) becomes with matrix notation:
E~ (V + ÆV) = W (Mj Æj + B )T (Mj Æj + B ) N0 1 (x)
is supported on [0; N ℄. Since we work on 1 a finite sampled signal, the scaling function N 0 (x) is mapped into the signal support. The two-dimensional basis functions are computed at different resolution levels using the product (9). Figure 1 represents the 1D B-spline functions of degree 1, 2 and 3 and the corresponding 2D scaling functions used to model motion.
3 Global Motion estimation
V
pi
2
(r(p
V
i
+ V + ÆV); )
V
(13)
Given an estimate (initially zero), the goal is to estimate Æ that minimize equation (13). The first order Taylor series expansion of I (t +1) with respect to Æ provides a new
V
ard Mj = [Ix (p + V)j;k ;k (p ); Iy (p + V)j;k ;k (p )℄ , i
1
2
i
i
i
Mj 2 R 2N Æj = [Æ j;k1 ;k2 ℄T , Æj 2 R2N W = [diag(w(pi + V))℄, W 2 RM M B = r(pi + V), B 2 RM
1
2
i
M
MjT W Mj Æj = MjT W B
To estimate global motion , the robust error E ( ) defined in (3), has to be minimized. This step is achieved by using an incremental scheme.
E (V + ÆV) =
1℄2 and 8p 2 (N = (2j )2
(18)
The minimum of (17) with respect to Æj leads to the linear system:
3.1 Incremental robust estimation
X
with, for k1 k2 = [0 : : : 2j and M = ( )):
(17)
V
(19)
and we obtain the solution: Æj = MjT W Mj 1 MjT W B
(20)
The matrix MjT W Mj could be large when the resolution level j is high, and it is time and memory space expensive to compute its inverse with numerical techniques. Hopefully, because we use localized scaling functions (Fig. 1) as basis functions, system (19) is sparse. Figure 2 represents zero and nonzero entries of MjT W Mj for the 3 first degree B-spline scaling functions at level j = 4. Note that sparsity
(a)
(b)
(a) Frame from Yosemite sequence
(c)
Figure 2. MjT W Mj matrix for B-spline scaling function of degree a) 1, b) 2 and c) 3 at level j = 4. White and black regions represent zero and non-zero entries, respectively.
depends on the spatial extent of the scaling function used. Æj is then iteratively estimated in (20) using the generalized minimum residual method [9] which is effective and fast to solve such a sparse system. The first step of the reweighted least squares consist in obtaining a first estimate Æj0 with the weight matrix W equal to the identity. New weights are next evaluated and a new estimate Æj1 is obtained. This process is repeated until the incremental estimate Æji is too small or after a predefined number of iterations. Then, the motion parameters vector at level j is:
j =
X k
Æjk
(21)
At coarse resolution level, the wavelet coefficients are estimated on large image regions, and can recover large but coarse motion. So as to provide an accurate global motion estimation for large and fine displacements, the incremental scheme is embedded in a coarse to fine refinement algorithm. A first estimate l is obtained at the coarsest level l. Then l is transmitted to a finer level where a new incremental estimation is done. This is repeated until the finest level L is reached. Then, the final motion parameter vector L is estimated, and contains all wavelet coefficients that describe global motion L at resolution level L.
V
4 Results The global motion parameter vector estimated not only provides an optical flow estimation, but also describes the sequence activity by means of some wavelet coefficients. In
(c) Estimated flow field
Figure 3. Optical flow result. The optical flow estimated is modeled by the B-spline basis functions of degree 1 with a coarse to fine estimation from the level 2 to 4
this section, we present our results on motion estimation accuracy and some results on video indexing based on wavelet coefficients.
4.1 Performances on synthetic and real image sequences We test our algorithm on both synthetic and real sequences. The angular error [1] is used to evaluate our estimation when the true motion field (ur ; vr ) is known (in Yosemite sequence (Fig 3.a) downloaded from Barron et al.’s FTP 1 ). This angular error is defined at each location by:
e = ar
os 3.2 Coarse to fine estimation
(b) Real flow field
r +1 p p 2 uu2r + vv u + v + 1 u2r + vr2 + 1
!
(22)
This provides a degree value which takes care of velocity modulus and orientation errors. Table 1 compares the angular error average and standard deviation for B-spline scaling functions of degree 1, 2 and 3. The motion models estimated with the different B-splines are close to the real optical flow. The B-splines of degree 2 and 3 perform slightly better than the B-spline of degree 1 but for a larger computational time. Whatever the degree of the B-spline used, our algorithm outperforms the approach of Srinivasan and Chellappa [10], who use overlapped cosine windows to model global motion. The next example, Baltrain sequence (Fig 4.a), is a real sequence where motion is more complex than Yosemite. Figures 4.b), c) and d) display the estimation scheme progress across the resolution levels. Estimated optical flows are close to perceptual view, whereas they are just defined by a small number of parameters (only 1616 wavelet coefficient vectors at level 4). 1 ftp.csd.uwo.ca
(a) Frame from baltrain sequence
(b) Estimated flow field at
j=2
(c) Estimated flow field at
(d) Estimated flow field at
j=3
j=4
Figure 4. Optical flow result. The optical flow estimated is modeled by the B-spline basis functions of degree 3
B-spline
1 2 3
Srinivasan and Chellappa [10]
Resol. level
j=2 j=3 j=4 j=2 j=3 j=4 j=2 j=3 j=4
Avg error
9:1o 6:4o 5:6o 9:2o 5:0o 4:6o 10:0o 5:1o 4:4o 8:9o
Std error
9:6o 8:5o 9:7o 9:6o 7:5o 9:0o 10:7o 8:0o 8:7o 10:6o
Table 1. Motion estimation errors for Yosemite sequence, using different B-spline scaling functions. The coarse to fine scheme was performed from the resolution level j = 2 to
j=4
To sum up, a motion model based on B-spline wavelets is efficient to approximate on a very few parameters the global optical flow. Actually, with only 16 basis functions (j = 2) it is possible to have a coarse motion estimation and then information about motion activities along a video sequence.
4.2 Video indexing based on wavelet coefficients In a motion based video indexing problem, the first stage is to define relevant motion features. These features have to contain enough information to distinguish activities, and to be coarse enough to be invariable to close motion activities. We will show in this section that a wavelet model estimated at coarse levels could provide relevant motion features. In the following, sequences are considered as elementary video shots.
4.2.1 Definition of the motion features Wavelet components extracted from different sequences provide for each frame i of sequence S a motion parameter vector i = [ 1 : : : N ℄T 2 R2N . The motion-based feature vector associated to the sequence S of M frames is then defined as the center of gravity of all reduced motion parameters vectors of S .
M X 1 S = M iT i=1
1
(23)
where is a diagonal matrix which contains standard deviation of components over the whole sequence. Computing S for K videos provides a motion-based feature space :
= fS g; i = 1 : : : K (24) The dimension of depends on the number of wavelets i
used to model global motion. The curvilinear componant analysis (CCA) [6] is an algorithm for dimensionality reduction and representation of multidimensional data sets. Applying CCA on the motion feature space gives a revealing representation of it in low dimension, preparing a basis for further clustering and classification. 4.2.2 Experimental results We have built with a video database representing six human activities (Figure 5.a): up, down, left, right, come and go. These sequences were acquired for activity recognition [5] (aquisition rate: 10Hz ). Each sequence includes ten frames and there are five sequences for each activity with different persons. So, the video database contains 30 sequences. B-splines of degree 1 at resolution level 2 were used to model coarse motion. Figure 5.b) represents the 2D mapping of . The projection shows that our motion features are relevant to cluster together the 6 activities. In addition to the six clusters,
up
down
left
right
come
go
(a)
(b)
Figure 5. a) Typical video sequences in the database and b) 2D mapping of the motion feature space obtained by CCA
the activities are also grouped by their main global motion characteristics. The sequences up, right and come have a global motion to the right of the scene, whereas the other sequences to the left. The same division appears into the projection map.
5 Conclusion We have presented a new motion model based on Bspline wavelet. Using this framework, it is possible to estimate accuratly motion in image sequence and also to extract global motion features from video sequences. It is based on the optical flow B-spline wavelet series expansion between two consecutive frames. This kind of expansion presents the advantage to model various types of motions and provides relevant information about motion structure. It is of importance to note that motion wavelet coefficients are estimated directly from the image derivatives and do not require prior information of dense image motion or image segmentation. We have illustrated the motion model relevance with regard to the motion estimation accuracy and showed that wavelet coeffients can be used for video indexing or activity recognition. Future works will concern precisely video indexing problems. We have to study how to combine wavelet coefficients along video sequences and how to classify these motion features to obtain relevant video database representation.
References [1] J. Barron, D. Fleet, and S. Beauchemin. Performance of optical flow techniques. International Journal of Computer
Vision, 1(12):43–77, 1994. [2] M. Black and A. Jepson. Estimating optical flow in segmented images using variable-order parametric models with local deformations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1):57–92, July 1996. [3] M. Black, Y. Yacoob, A. Jepson, and D. Fleet. Learning parametrized models of image motion. In Computer Vision and Pattern Recognition CVPR’97, pages 561–567, June 1997. [4] E. Bruno and D. Pellerin. Global motion fourier series expansion for video indexing and retrieval. In Advances in Visual Information System, VISUAL, pages 327–337, Lyon, November 2000. [5] O. Chomat and J. Crowley. Probabilistic recognition of activity using local appearance. In Conference on Computer Vision ans Pattern Recognition CVPR’99, pages 104–109, June 1999. [6] P. Demartines and J. Herault. Curvilinear component analysis: A self-organising neural network for non linear mapping of data sets. IEEE Transactions on Neural Networks, 8(1):148–154, 1997. [7] P. Holland and R. Welsch. Robust regression using iteratively reweighted least squares. pages A6:813–828, 1977. [8] J.-M. Odobez and P. Bouthemy. Robust multiresolution estimation of parametric motion models. Journal of Visual Communication and Image Representation, 6(14):348–365, December 1995. [9] W. H. Press, S. Teukolsky, W. Vetterling, and B. Flannery. Numerical Recipies in C, Second Edition. Cambridge University Press, 1992. [10] S. Srinivasan and R. Chellappa. Noise-resiliant estimation of optical flow by use of overlapped basis functions. Journal of the Optical Society of America A, 16(3):493–507, March 1999.