Flexible 3D motion estimation and tracking for multiview image ...

0 downloads 0 Views 973KB Size Report
The performance of the resulting 3D flexible motion estimation method is evaluated experimentally. ... Keywords: Multiview image sequence compression; Model-based coding; Non-rigid 3D ..... for the whole mesh, hence independent of the tri-.
Signal Processing: Image Communication 14 (1998) 95—110

Flexible 3D motion estimation and tracking for multiview image sequence coding Ioannis Kompatsiaris*, Dimitrios Tzovaras, Michael G. Strintzis Information Processing Laboratory, Electrical and Computer Engineering Department, Aristotle University of Thessaloniki, Thessaloniki 54006, Greece

Abstract This paper describes a procedure for model-based coding of all channels of a multiview image sequence. The 3D model is initialised by accurate adaptation of a 2D wireframe model to the foreground object of one of the views. The rigid 3D motion is estimated for each triangle, and spatial homogeneity neighbourhood constraints are used to improve the reliability of the estimation efficiency and to smooth the motion field produced. A novel technique is used to estimate flexible motion of the nodes of the wireframe from the rigid 3D motion vectors of the wireframe triangles containing each node. Kalman filtering is used to track both rigid 3D motion of each triangle and flexible deformation of each node of the wireframe. The performance of the resulting 3D flexible motion estimation method is evaluated experimentally.  1998 Elsevier Science B.V. All rights reserved. Keywords: Multiview image sequence compression; Model-based coding; Non-rigid 3D motion estimation

1. Introduction Multiview video processing has been the focus of considerable attention in recent literature [7,10,20,23,26]. In a multiview image sequence, each different view is recorded with a difference in the observation angle, creating an enhanced 3D feeling to the observer, and increased ‘telepresence’

* Corresponding author. Tel.: (#30-31) 996-359; fax: (#3031) 996-398; e-mail: +ikom,tzovaras,@dion.ee.auth.gr.  This work was supported by the EU CEC Project ACTS PANORAMA (Package for New Autostereoscopic Multiview Systems and Applications, ACTS project 092) and the Greek Secretariat for Science and Technology Programme YPER.

in teleconferencing and several other (medical, entertainment, etc.) applications. New international video coding standards such as MPEG-4, are in the process of being issued, for the coding of image sequences containing up to 4 views of a video scene. In both monoscopic and multiview vision, the ability of model-based techniques to describe a scene in a structural way has opened up new areas of applications. Video production, realistic computer graphics, multimedia interfaces and medical visualisation are some of the applications that may benefit by exploiting the potential of model-based schemes. Current model-based image coding schemes may be divided into two broad categories. The first category [2,8,12,13,27] is knowledgebased and uses explicit human head models for

0923-5965/98/$ — see front matter  1998 Elsevier Science B.V. All rights reserved. PII: S 0 9 2 3 - 5 9 6 5 ( 9 8 ) 0 0 0 3 0 - 7

96

I. Kompatsiaris et al. / Signal Processing: Image Communication 14 (1998) 95—110

coding primarily videoconferencing schemes. A second analysis-by-synthesis group of methods [4,14,18] is suitable for coding of more general classes of images. The derivation of 3D models directly from images, usually requires estimation of dense disparity fields, postprocessing to remove erroneous estimates and fitting of a surface model to the calculated depth map. In [16] an algorithm was presented which optimally models each scene using an hierarchical structure derived directly from intensity images. The wireframe model consists of adjacent triangles that may be split into smaller ones over areas that need to be represented in higher detail. The motion of the model surface using both rigid and non-rigid body assumptions is estimated concurrently with depth parameters. The face model in [24] incorporates a physically based approximation to facial tissues and a set of anatomically motivated facial muscle actuators. Active contour meshes or ‘snakes’ have been also used for tracking and estimating flexible motion [15]. In the present paper, a procedure for modelbased coding, using flexible motion estimation of all channels of a three-view image sequence, is proposed. The 3D model is initialised by adapting a 2D wireframe to the foreground object. The adaptation of the 2D wireframe is a novel procedure resulting in a very accurate representation of the foreground object. Using depth and multiview camera geometry the 2D wireframe is reprojected in the 3D space, forming a consistent wireframe for all views. The rigid 3D motion of each triangle is estimated next, taking into account the temporal correlation between subsequent frames. For each group of pictures (GOP) the rigid 3D motion of the first frame is estimated using a least median of squares technique. In order to improve the efficiency and stability of this procedure, homogeneity neighbourhood constraints on the motion of each triangle are imposed. The rigid 3D motion of each triangle for subsequent frames is estimated using an iterative Kalman estimation algorithm. For the flexible motion estimation of each node of the 3D model, the rigid 3D motion vectors of each neighbouring triangle are used in a novel flexible motion estimation procedure. Since more than one observation is available for each node, principal component

analysis (PCA) may be used in order to find the best estimate. For the first frame in each GOP, the flexible 3D motion of each node is estimated by solving directly the equation containing the motion parameters. The motion of the nodes of the remaining frames is tracked using Kalman filtering using as initial value the estimate of the first frame. With the flexible motion of each node available, the model is updated at the next time instant and all views are reconstructed using only the 3D flexible motion, the 3D model and the information from the previous frame. The proposed algorithm exploits fully both spatial and temporal correlation of the images, since neighbourhood constraints are taken into account (both for rigid 3D motion estimation of each triangle and for PCA) and Kalman filtering is used for tracking the motion in subsequent frames. The novel flexible motion estimator assigns the deformation vector to each node using an one step approach. This is a major departure from the currently popular practice of estimating first a global motion which is further improved by testing and then adopting local deformations. In the present paper very few assumptions about the content of the scene and the undergoing motion are made. The priori information needed consists only of the depth map and a background/foreground segmentation mask; all other information is extracted using only the projected images of all views. For the depth map estimation the algorithms presented in [26] or in [5] can be used, while the foreground object can be separated using motion information [19] or combined motion and depth information [25]. The basic camera geometry used in this paper is dictated by practical considerations in videoconferencing applications, where cameras are placed on either side and the top of the monitor. All results of the present paper are derived with respect to this geometry and are evaluated with real multiview sequences produced by precisely such an arrangement of cameras. However, the basic ideas and results of the paper may be easily extended to arbitrary multiview systems using arbitrary arrangements and numbers of cameras. The paper is organised as follows. In the following section, the three-camera geometry is described,

I. Kompatsiaris et al. / Signal Processing: Image Communication 14 (1998) 95—110

97

and in Section 3 the procedure for the 3D model initialisation is given. Section 4.1 describes the rigid 3D motion estimation of each triangle for the first frame of each GOP, while in Section 4.2 we present tracking of the motion at subsequent frames using a Kalman filter approach. The PCA used to find the best estimate among neighbouring triangles is described in Section 5.1, while in Section 5.2 the motion of each node is tracked using Kalman filtering. The 3D flexible motion compensation mechanisms needed so as to reconstruct all views are presented in Section 6. In Section 7 experimental results are given evaluating the performance of the proposed methods. Conclusions are finally drawn in Section 8. Fig. 1. The CAHV camera model.

2. Camera model A camera model describes the projection of 3-D points onto a camera target. The model used here is the CAHV model introduced in [28]. This model describes extrinsic camera parameters such as position and orientation and intrinsic camera parameters such as focal length and intersection between optical axis and image plane. As mentioned in the introduction, in our multiview camera geometry, three cameras c are used: c" left, top, right. For each camera c the model contains the following parameters shown in Fig. 1: (a) position of the camera C , (b) optical axis A , i.e. A A the viewing direction of the camera (unit vector), (c) horizontal camera target vector H (x-axis of the A camera target), (d) vertical camera target vector V (y-axis of the camera target), radial distortion A and sx, sy pixel size. In our camera model we shall assume that the radial distortion is compensated. In this case projection of a 3D point P, with coordinates relative to world coordinate system, onto the image plane (X ,½ ) is [28] A A (P!C )2 ) H A A, X " A (P!C )2 ) A A A

(P!C )2 ) V A A. ½ " A (P!C )2 ) A A A

(1)

The coordinates (X ,½ ) are camera centered (image A A plane coordinate system) with the unit pel. The origin of the coordinate system is the centre point

of the camera. The coordinates of a point relative to the picture coordinate system (X ,½ ) are given by A A (X ,½ )"(X #O ,½ #O ), where (O ,O ) is A A A V A A W A V A W A the centre of the image plane in the picture coordinate system. Conversely, given its position (X ,½ ) on the camA A era plane, the 3D position of a point can be determined by P"C #q ) S (X ,½ ), (2) A A A A A where S (X ,½ ) is the unit vector pointing from the A A A camera to the point in the direction of the optical axis and q is the distance between the 3D point and A the centre of camera c.

3. 3D model initialisation The generation of the 3D model object is based on an initial adaptation of a 2D wireframe model to the foreground object of one of the projected images (left, top or right). Since the 2D wireframe is adapted to the image, the 3D model can be formed by using depth information (i.e. the distance of the 3D point from the specific camera) in Eq. (2). The consistent camera geometry allows the 3D model to be adapted to the other views as well. In the present paper very few assumptions are made regarding the content of the scene and the

98

I. Kompatsiaris et al. / Signal Processing: Image Communication 14 (1998) 95—110

motion of its objects. The a priori information needed consists only of the depth map and a background/foreground segmentation mask; all other information is extracted using only the projected images of all views. For the depth map estimation the algorithms presented in [26] or in [5] can be used, while the foreground object can be separated using motion information [19] or combined motion and depth information [25]. Without loss of generality, to simplify the subsequent discussion, we shall assume that only one object exists in the foreground. However, it is emphasised that the methodology of the paper remains valid and applicable even in the presence of more than one such objects. The adaptation of the 2D wireframe to the foreground object is a coarse to fine procedure consisting of the following steps: E A regular grid is first created, covering the full size of the 2D image. E Only triangles overlapping with the foreground object or with the boundary of the foreground object are retained. The remaining triangles are discarded. E Nodes of the triangles lying outside of the boundary of the foreground object are forced to meet this boundary. E The ‘force’ applied for the movement of the outer nodes is propagated to the remaining nodes so as not to drastically alter the regularity of the wireframe. These steps are described in more detail in the sequel. The initial regular grid is of the form p"(iDX, jD½), where DX, D½ are the horizontal and vertical distance of the nodes of the wireframe, i"0,2, SX/DX and j"0,2, S½/D½. SX, S½ are the horizontal and vertical size of the image, respectively. The wireframe is formed using a Delaunay triangulation for all nodes/vertices p. Such a triangulation covering the full size of the middle view can be seen in Fig. 6(a). In the next step, a background/foreground segmentation mask is used to separate and discard the triangles that do not overlap with the foreground

Fig. 2. Special case where a triangle may be discarded from the wireframe.

object. Such a mask is particularly easily calculated, if the foreground consists of a moving object in front of a static, no-textured background (such as in videotelephony applications). A simple edge detection algorithm suffices then, to provide the information of the foreground object silhouette. A triangle is retained as part of the wireframe when at least one of its vertices lies on the foreground object. The nodes remaining outside of the boundary of the object are attracted to the object by a ‘force’ proportional to their distance from its boundary. Specifically, each node is assumed to have eight degrees of freedom: +f ,2, f , as shown   in Fig. 2. Following each of the eight directions a point on the object boundary p is found and a force F is defined, having the same direction as G f and magnitude equal to the Euclidean distance G between node p and point p, G F "#p!p#. G G If no point p on the boundary of the foreground G object can be found in direction f then F is set to G G infinity. The force applied to node p in order to meet the object boundary is chosen to have magnitude F"arg min F G GZ+ 2 , and direction the one of the corresponding F . G

I. Kompatsiaris et al. / Signal Processing: Image Communication 14 (1998) 95—110

In the special case where a triangle has two of its nodes lying on the object boundary, forcing the third node on the boundary would create a very small triangle (Fig. 2) which must be discarded. Specifically, the triangle is discarded when u #u (th,   where u , u are the angles between the triangle   vertex within the object and the tangents e , e to   the object boundary at the two nodes on the boundary, and th is a threshold. The above procedure results in an accurate adaptation of the wireframe to the foreground object boundary. To avoid the creation of very small triangles and disruption of the regularity of the wireframe, the ‘force’ F applied to each node is used for the adaptation of the remaining nodes of the wireframe. Specifically, the force F is propagated according to a Gaussian model taking into account the distance between the node p and all other mesh nodes, following the approach in [3]. If F is the force applied to node p, then the force F applied to O a node q with distance d from node p, has the same direction as F and magnitude 1 eB\IN. F" O (2pp This means that node q is moved to a new position q in the direction of F and with distance from its O previous position #q!q#"F . The mean value O k and standard deviation p are selected so as to harmonise the resulting wireframe mesh. The result of the above procedure is a very accurately adapted wireframe over the foreground object in one of the views (Fig. 6(c)). By applying Eq. (2) to all nodes p"(X, ½), the 3D points P"(x, y, z) are found and the 3D model is formed (Fig. 6(d)). Due to the consistency of the camera geometry, the projection of the 3D model to the image planes formed by other cameras leads also to 2D wireframes adapted to the foreground object of the other views. 4. Rigid 3D motion estimation and tracking of each triangle The estimation of rigid 3D motion vectors for each triangle of the 3D model is needed before the

99

evaluation of flexible motion estimation is attempted. In order to increase the efficiency and stability of the triangle motion estimation algorithm, neighbourhood triangles are taken into account. This also results to a smoother 3D motion field, since erroneous large local deformations are suppressed.

4.1. Rigid 3D motion estimation In the first frame of each group of pictures (GOP) the rigid motion of each triangle ¹ , k"1,2, K, I where K is the number of triangles in the foreground object, is modelled using a linear 3D model, with three rotation and three translation parameters [1], P

"RIP #TI, R> R with RI and TI being of the form

 

1

RI"

wI X !wI W

!wI X 1 wI V



wI W !wI , V 1

(3)

(4)

tI V (5) TI" tI , W tI X where P "(x , y , z ) is a 3D point on the plane R R R R defined by the coordinates of the vertices of triangle ¹. I In order to guarantee the production of a homogeneous estimated triangle motion vector field, smoothness neighbourhood constraints are imposed for the estimation of the model parameter vector aI"(wI, wI, wI, tI, tI, tI). Let N be the V W X V W X I ensemble of the triangles neighbouring the triangle ¹ , i.e. those sharing at least one common node I with ¹ . We define as neighbourhood ¹S of each I I triangle ¹ the set of triangles in I ¹S "+¹ 3N ,6+¹ ,. I H I I For the estimation of the model parameter vector aI the MLMS iterative algorithm [22] was used. The MLMS algorithm is based on median filtering and is very efficient in suppressing noise with a large amount of outliers (i.e. in situations

100

I. Kompatsiaris et al. / Signal Processing: Image Communication 14 (1998) 95—110

where conventional least squares techniques usually fail). At time t, each point P in ¹S is projected to R I points (X ,½ ), c"l, t, r on the planes of the three A R A R cameras. Using Eqs. (1) and (3), the projected 2D motion vector, d (X ,½ ) is determined by A A A d (X , ½ )"X !X VA A R A R A R> A R (P !C )2 ) H (RIP #TI!C )2 ) H A A, R A A! R " (RIP #TI!C )2 ) A (P !C )2 ) A R A A R A A d (X , ½ )"½ !½ WA A R A R A R> A R (RIP #TI!C )2 ) V (P !C )2 ) V A A, R A A! R " (P !C )2 ) A (RIP #TI!C )2 ) A R A A R A A

(6)

(7)

where d (X , ½ )"(d (X ,½ ), d (X ,½ )). A A A VA A R A R WA A R A R Using the initial 2D motion vectors, estimated by applying a block matching algorithm to the images corresponding to the left, top and right cameras and also using Eqs. (6) and (7), a linear system for the global motion parameter vector aI for triangle ¹ is formed, I (8)

bI"DIaI.

Omitting, for notational simplicity, dependence on time t, triangle k and camera c, D is seen to equal D"



d [0] V  d [0] W  2 d [0] V * d [0] W *

d [1] V  d [1] W  2 d [1] V * d [1] W *

d [2] V  d [2] W  2 d [2] V * d [2] W *

d [3] V  d [3] W  2 d [3] V * d [3] W *

d [4] V  d [4] W  2 d [4] V * d [4] W *

where d [0]"!H z #H y #q A z !q A y , V J W J X J V J W J V J X J d [1]"H z !H x !q A z #q A x , V J V J X J V J V J V J X J d [2]"!H y #H x #q A y !q A x , V J V J V J V J V J V J W J d [3]"H !q A , V J V V V d [5]"H !q A V J X V X



d [5] V  d [5] W  , 2 d [5] V * d [5] W * (9)

d [4]"H !q A , V J W V W

and (P !C)2 ) H q "d # J . V J V J (P !C)2 ) A J For the y coordinates, d [0]"!» z #» y #q A z !q A y , W J W J X J W J W J W J X J d [1]"» z !» x !q A z #q A x , W J V J X J W J V J W J X J d [2]"!» y #» x #q A y !q A x , W J V J V J W J V J W J W J d [3]"» !q A , d [4]"» !q A , W J V W J V W J W W J W d [5]"» !q A W J X W J X and (P !C)2 ) V q "d # J . W J W J (P !C)2 ) A J Also,



q



) (P !C)2 ) A!(P !C)2 ) H V    q ) (P !C)2 ) A!(P !C)2 ) V W    2 b" . q ) (P !C)2 ) A!(P !C)2 ) H V * * * q ) (P !C)2 ) A!(P !C)2 ) V W * * *

(10)

In the above l"0,2, ¸, where ¸ is the number of 3D points P "[x , y , z ]2 contained in ¹S J J J J I and C, A"[A , A , A ]2, H"[H , H , H ]2, V W X V W X V"[» , » , » ]2 are the multiview camera paraV W X meters. The 2D motion vectors [d , d ]2 corresV J W J pond to the projected 3D point P . J This is a system of 2;3;¸ equations with six unknowns, where ¸ is the number of 3D points contained in ¹S , since for each 3D point P , two I R equations are formed for the X and ½ coordinates for each one of the three cameras. Since 2;3;¸*6 for all triangles (in the worst case ¹S I is composed of a single triangle with ¸"3 and 2;3;3"18) this is overdetermined and can be solved by the robust least median of squares motion estimation algorithm described in detail in [22]. Erroneous initial 2D estimates, produced by the block-matching algorithm, will be discarded by the least median of squares motion estimation algorithm.

I. Kompatsiaris et al. / Signal Processing: Image Communication 14 (1998) 95—110

4.2. 3D motion tracking using Kalman filtering In order to exploit the temporal correlation between consequent frames, a Kalman filter similar to the one developed in [6] is applied for the calculation of the 3D rigid motion parameters at every time instant. In this way, the estimation of the motion parameters is honed by additional observations as additional frames arrive. Omitting for the sake of notational simplicity, the explicit dependence of the motion parameters to the triangle ¹ , I thus writing a , b , C instead of aI, bI, CI, the R R R R R R dynamics of the system are described as follows: a "a #w ) e , (11) R> R R> b "D a #€ , (12) R> R> R> R> where a is the rigid 3D motion vector of each triangle and e is a unit-variance white random R sequence. The term w ) e describes the changes R> from frame to frame and a high value of w implies small correlation between subsequent frames and can be used to describe fast-changing scenes, whereas a low value of w may be used when the motion is relatively slow and the temporal correlation is high. The noise term € represents the R> random error of the formation of the system (8), modelled as white zero mean Gaussian noise, where E+€ ) € ,"R d(n!n), n being the nth element L LY T of €. The equations [11,21] giving the estimated value of aL according to aL are R> R aL "aL #K ) (b !D ) aL ), (13) R> R R> R> R> R K "(R #wI) ) D2 ) k\, (14) R> R R> k"D ) R ) D2 #D ) wI ) D2 #R , (15) R> R R> R> R> T R "(I!K ) D ) ) (R #wI), (16) R> R> R> R where aL and aL are the new and old prediction of R> R the unknown motion parameters corresponding to the (t#1)th and tth frame, respectively, K repR> resents the correction matrix and R and R deR R> scribe the old and the new covariance matrix of the estimation error E and E , respectively, R R> E "(a !aL ), R "E+E ) E2,, R R R R R R E "(a !aL ), R "E+E ) E2 ,. R> R> R> R> R> R>

101

The initial value aL of the filter (beginning of each  GOP) is given by solving Eq. (8). The correlation matrix R is  R "E+a ) a2,.    In the above, w and € are assumed to be the same for the whole mesh, hence independent of the triangle ¹ . Notice that Eq. (8) is solved only once in I order to provide the initial values for the Kalman filtering. During the next frames D and b are only formed and used at the Kalman filter procedure. The benefits of using Kalman filtering, rather than simply solving Eq. (8) for each two consequent frames, are illustrated by the experimental results in Section 7. 5. Estimation and tracking of flexible surface deformation using PCA The 3D rigid motion parameters aI estimated from the previously described procedure are input to the 3D flexible motion estimation procedure. The aim of the 3D flexible motion estimation is to assign a motion vector to every node of the 3D model. Then the 3D model can be updated at the next frame and all views can be reconstructed using only the 3D model, the 3D flexible motion of each node and the previous frame information.

5.1. Estimation of flexible surface deformation The flexible motion of each node of the wireframe P , +i"1,2, P,, is affected by the 3D rigid G motion aI of every triangle connected to P , whose G 3D motion was estimated by the procedure described in the preceding sections. We propose the use of principal component analysis (PCA) in order to determine the best representative motion vector for node P . G More specifically, for each node P , aJ, G G +l"1,2, N ,, observations are available, where G N is the number of triangles containing this node. G To form the covariance matrix of a their mean G value is found by 1 ,G aN " aJ. G G N G J

102

I. Kompatsiaris et al. / Signal Processing: Image Communication 14 (1998) 95—110

The covariance matrix of a is G 1 ,G C " (aJ!aN )(aJ!aN )2. G G G G G N G J Let u be the eigenvector of the 6;6 matrix G I C corresponding to its kth highest eigenvalue. The G mean value of the projections of all observations to u is G I 1 ,G q " u2 ) (aJ!aN ), (17) G I N G I G G G J the best estimated value of the motion parameters for node i based on M eigenvectors is G +G (18) a " q ) u #aN , G K G K G G K where M )6. The number M used for each node G G depends on the corresponding number of dominant eigenvalues.

5.2. 3D motion tracking using Kalman filtering The previously described flexible motion estimation procedure is applied only at the beginning of each GOP. An iterative Kalman filter is used, for the flexible motion estimation of each node of subsequent frames, taking into account their temporal correlation. Eq. (18) with M"6 can be written as Q"U ) (*a),

(19)

where *a"a!aN , Q is the 6;1 vector with components given by Eq. (17) for each node i and U the 6;6 matrix having as rows the eigenvectors. Temporal relations can be written as (*a) "(*a) #w ) e , (20) R> R R> Q "U ) (*a) #€ . (21) R> R> R> R> The term w ) e remains the same as in Eq. (11), R> since temporal correlation is the same, but the noise term € now represents the random error of Eq. (19). Eqs. (13)—(16), with a replaced by *a, b replaced by Q and D replaced by U provide the best estimate of the 3D motion of each node.

6. 3D flexible motion compensation After the procedure described in the previous section, a 3D motion vector has been assigned to every node of the 3D model. Using this information, the 3D model may be updated and the projected views of the next frame may be found. More specifically, each node of the 3D model is governed by the following equation: "R P #T , G R> G G R G where R is of the form of Eq. (4), T is of the form of G G Eq. (5) and both are derived from a calculated in G the previous section. The 3D flexible motion vector of each node is given by

P

S "P !P . G R G R> G R In order to assign a 3D motion vector to every 3D point we use the barycentric coordinates of this point. If P"(x, y, z) is a point on the triangular patch P PY P "+(x , y , z ), (x , y , z ), (x , y , z ),,             then the 3D flexible motion vector of this point is given by S "S g (x, y)#S g (x, y)#S g (x, y). (22) R  R   R   R  The functions g (x, y) are the barycentric coordiG nates of (x, y, z) relative to the triangle and they are given by g (x, y)"Area(A )/Area(P PY P ), where G G    Area(A ) is the area of the triangle having as vertices G the point (x, y, z) and two of the vertices of P PY P    excluding P . Each view at the next frame can be G reconstructed using I

(p )"I (p ), A R> A R> A R A R where I is the intensity of each projected view and p ,p are the 2D projected points of P , P usA R A R> R R> ing camera c"left, top, right in Eq. (1). It can be seen from the above discussion that the only parameters whose transmission is necessary for the reconstructing of all but the first frame in each GOP are the 3D flexible motion vectors of each node of the 3D wireframe. For the coding of these vectors, a simple DPCM technique is used, coding only the differences between successive nodes. In the beginning of each GOP the model geometry and the projected views must also be transmitted. For the coding of images at the

I. Kompatsiaris et al. / Signal Processing: Image Communication 14 (1998) 95—110

103

Fig. 3. (a) Synthetic 3D model. (b) Left projected view. (c) Top projected view. (d) Right projected view.

beginning of each GOP intra frame techniques are used, as in [17].

quences obtained with a three camera setup such as described in Section 2.

7. Experimental results

7.1. Experimental results for synthesised multiview images

The proposed model-based coding was evaluated for the flexible 3D motion estimation of both synthetically created and real multiview image se-

In order to test the performance of the proposed algorithm for the tracking of 3D deformations,

104

I. Kompatsiaris et al. / Signal Processing: Image Communication 14 (1998) 95—110

Fig. 5. The estimated 3D flexible motion.

Fig. 4. (a) Projected wireframe of the synthetic 3D model. (b) The projected wireframe after undergoing major deformation.

a highly textured 3D model undergoing a major deformation was created. The wireframe of the 3D model is shown in Fig. 3(a), and the projected views using a highly textured image are shown in Fig. 3 (b—d). A hemispherical 3D model was chosen, since this shape approximates several naturally occurring volumes including the human face. A similar shape was used in [9] in order to reconstruct

the 3D left ventricle (LV) heart surface from volumetric data. The multiview data were subjected to a 3D deformation consisting of a translation in the x direction of one node P of the model, which is also propagated over the mesh in a Gaussian fashion. In other words, a 3D point of the model P was transO lated to point P given by O t V P "P # 0 , O O 0



where t is given by V 1 t" eB\IN, V (2pp where d is the 3D Euclidean distance of P, P . The O mean value k is the initial translation applied to

I. Kompatsiaris et al. / Signal Processing: Image Communication 14 (1998) 95—110

105

Fig. 6. (a) Regular wireframe covering the full size of the middle view. (b) Coarse adaptation with triangles covering only the foreground object. (c) Fine adapation to the foreground object. (d) The 3D model produced by reprojecting the 2D wireframe to the 3D space.

node P and the standard deviation p determines the severity of the modelled deformation. The 3D point P was selected to be close to the centre of the wireframe. The initial and the deformed model are projected to the three camera planes, producing three views. The projected 3D model before and after the deformation is shown in Fig. 4(a) and in Fig. 4(b), respectively. The above procedure results in three images, each corresponding to one of the views. The algorithms described in Sections 4.1 and 5.1 was used to estimate the 3D flexible motion only from the projected images and the 3D model, as if those images were the first and the second frames in a GOP. The estimated 3D flexible motion vectors are shown in Fig. 5, where it can be seen that large deformation

vectors, shown in dark colour, are correctly estimated around the centre of the model and abate near the outer edge, shown with light colour. The direction of the 3D flexible motion vectors can be seen from the vectors plotted for each node of the 3D model. This demonstrates the ability of the proposed algorithm to determine large deformation vectors, using the initial 3D model and 3 projected images.

7.2. Experimental results for real multiview images The proposed model-based coding was also evaluated for the coding of left, top, right channels of a real multiview image sequence. The interlaced

106

I. Kompatsiaris et al. / Signal Processing: Image Communication 14 (1998) 95—110

Fig. 7. (a), (c), (e) Original left, top, right view of frame 3. (b), (d), (f) Reconstructed left, top, right view of frame 3.

multiview videoconference sequence ‘Ludo’ of size 720;576 was used. All experiments were performed at the top field of the interlaced sequence, thus using images of size 720;288. The 3D model was formed using the techniques described in Section 3. A regular wireframe was first created covering the full size of the top view image, as shown in Fig. 6(a). Only triangles lying on the foreground object (body and desk of ‘Ludo’) were then retained providing a coarse adaptation of to the foreground object (Fig. 6(b)). The wireframe was then finely adapted to the foreground object (Fig. 6(c)) and using depth information, the 2D wireframe was reprojected to the 3D space giving the 3D model (Fig. 6(d)).  This sequence were prepared by the CCETT for use in the PANORAMA ACTS project.

Since the geometry of the 3D model is known, the 3D motion of each triangle can be estimated using the algorithm in Section 4.1. The 3D motion of each triangle was then used as an input to the flexible motion estimation algorithm described in Section 5.1. Since a 3D motion vector is assigned to each node of the wireframe, succeeding frames can be reconstructed using only the 3D model, the 3D motion and the frames of the previous time instant (Section 6). The flexible motion estimation was performed between frames 0 and 3 (since differences between frames 0 and 1 was negligible). The original top view of frame 0 can be seen in Fig. 6 and the original left, top and right view of frame 3 are shown in Fig. 7(a,c,e), respectively. The reconstructed left, top and right view are shown in Fig. 7(b,d,f).The frame differences between the original frames 0 and 3 are given in Fig. 8(a,c,e) for

I. Kompatsiaris et al. / Signal Processing: Image Communication 14 (1998) 95—110

107

Fig. 8. (a), (c), (e) Difference between original frames 0 and 3 (left, top, right views, respectively). (b), (d), (f) Difference between original frame 3 and reconstructed frame 3 (left, top, right views, respectively).

all views while the frame differences between the original frame 3 and reconstructed frame 3 are shown in Fig. 7(b,d,f). As seen the reconstructed images are very close to original frame 3. The performance of the algorithm was also tested in terms of PSNR, giving an improvement of approximately 4 dB for all views, as compared to the PSNR between frame 0 and 3. The proposed algorithm was also tested for the coding of a sequence of frames at 10 frames/sec. The model adaptation procedures was applied only at the beginning of each GOP. Each GOP consists of 10 frames. The first frame of each GOP was transmitted using intra frame coding techniques. For the rigid 3D motion estimation of each triangle and for the flexible 3D motion estimation of each node, between first and second frames in a GOP, the

techniques described in Sections 4.1 and 5.1, respectively, were used. For the rigid 3D motion estimation of each triangle and for the flexible 3D motion estimation of each node between subsequent frames the Kalman filtering approach of Sections 4.2 and 5.2 was used. The only parameters that need be transmitted in this case are the 3D flexible motion vector of each node. The methodology developed in this paper allows both left, top and right images to be reconstructed using the same 3D flexible motion vectors, thus achieving considerable bitrate savings. The coding algorithm requires a bitrate of 64.4 kbps and produces better image quality compared to a correspondingly simple block matching motion estimation algorithm. For the block matching scheme only the first

108

I. Kompatsiaris et al. / Signal Processing: Image Communication 14 (1998) 95—110

Fig. 9. (a) PSNR of each frame of the left channel of the proposed algorithm, compared with flexible motion estimation without Kalman filtering and with the block matching scheme with a block size of 16;16 pixels. (b) PSNR of each frame of the top channel. (c) PSNR of each frame of the right channel.

frame of each group of frames was transmitted using intra frame coding. The second frame was reconstructed from the first one using the estimated flexible motion vectors between the first and the second frames and this procedure was continued for the rest of the group of frames using each previously reconstructed frame for the estimation of the next one. The bitrate required by this scheme with a 16;16 block size was 68.5 kbps. In order to demonstrate the benefits of the Kalman filtering approach, we also estimated motion

between consequent frames without taking into account motion information of previous frames. More specifically, for the rigid 3D motion estimation of each triangle and for the flexible 3D motion estimation of each node, between all subsequent frames, the techniques described in Sections 4.1 and 5.1, respectively, were used. The parameters which need to be transmitted remain the same. In Fig. 9(a—c) the resulting image quality for every reconstructed frame is shown, comparing the block-based scheme with our approach, with and

I. Kompatsiaris et al. / Signal Processing: Image Communication 14 (1998) 95—110

without use of the Kalman filter, for the left, top and right views, respectively. As can be seen from the plots, the proposed approach performs better with the use of Kalman filters. We also note that the additional computational overhead resulting from the use of the Kalman filter, is negligible. The results also demonstrate the robustness of the algorithm for noisy inputs, since the experiments involve real image sequences, where the initial 2D vector fields are estimated by a simple blockmatching algorithm without any special post or pre-processing, and as a result they contain many noisy estimates. Even in the case of synthetic data, the initial 2D vector fields are derived directly from the produced images (Fig. 3(b—d)) and again contain many noisy estimates.

8. Conclusions We have presented an algorithm for estimating the 3D flexible motion of each node of a 3D model. This flexible motion can be used to reconstruct the next frames of a multiview image sequence. A novel algorithm was used for the exact initial adaptation of the model to the foreground object. Information form all cameras was used then in order to estimate the 3D motion of each triangle. This motion was tracked over time using Kalman filtering. Principal Component Analysis of the triangles 3D motion vectors was used in order to assign a flexible 3D motion vector to each node of the wireframe. The deformation motion was updated from frame to frame using a Kalman filter approach. Using the barycentric coordinates of each 3D point contained in the 3D model, the next frame of all 3 views was reconstructed using only the flexible 3D motion of each node. The algorithm requires very little a priori information (depth map, background/foreground mask) and is applicable to general scenes. It can be used in a variety of applications, including very low bitrate teleconference with enhanced ‘telepresence’ and video production, where the 3D motion of a specific scene is used to produce a scene with similar motion but different texture, as when producing a video with a model mimicking the motion of an actor.

109

References [1] G. Adiv, Determining three-dimensional motion and structure from optical flow generated by several moving objects, IEEE Trans. Pattern Anal. Mach. Intell. 7 (July 1985) 384—401. [2] K. Aizawa, H. Harashima, T. Saito, Model-based analysis synthesis image coding (MBASIC) system for a person’s face, Signal Processing: Image Communication 1 (2) (1989) 139—152. [3] D. Burr, Elastic matching of line drawings, TPAMI 3 (6) (1981) 708—713. [4] H. Busch, Subdividing nonorigid 3D objects into quasi rigid parts, in: Proc. IEE 3rd Internat. Conf. Image Processing Applications, Warwick, UK, 1989. [5] L. Falkenhagen, Depth estimation from stereoscopic image pairs assuming piecewise continuous surfaces, in: European Workshop on Combined Real and Synthetic Image Processing for Broadcast and Video Productions, Hamburg, Germany, November 1994, pp. 23—24. [6] L. Falkenhagen, 3D object-based depth estimation from stereoscopic image sequences, in: Proc. Internat. Workshop on Stereoscopic and 3D Imaging’95, Santorini, Greece, September 1995, pp. 81—86. [7] N. Grammalidis, S. Malassiotis, D. Tzovaras, M.G. Strintzis, Stereo image sequence coding based on threedimensional motion estimation and compensation, Signal Processing: Image Communication 7 (2) (1995) 129—145. [8] L. Haibo, P. Roivanen, R. Forcheimer, 3D motion estimation in model-based facial image coding, TPAMI 15 (June 1993) 545—555. [9] W.C. Huang, D.B. Goldgof, Adaptive-size meshes for rigid and non-rigid shape analysis and synthesis, TPAMI 15 (6) (1993) 611—616. [10] E. Izquierdo, M. Ernst, Motion/disparity analysis for 3Dvideo-conference applications, in: M.G.S. et al. (Eds.), Proc. Internat. Workshop Stereoscopic and 3D Imaging, Santorini, Greece, September 1995, pp. 180—186. [11] A.K. Jain, Fundamentals of Digital Image Processing, Prentice-Hall, Englewood Cliffs, NJ, 1986, pp. 304—306. [12] J. Jaou, N. Duffy, A texture mapping approach to 3D facial image synthesis, Comput. Graphics Forum 7 (1988) 129—134. [13] M. Kaneko, A. Koike, Y. Hatori, Coding of facial image sequences based on 3D model of the head and motion detection, J. Visual Commun. Image Represent. 2 (March 1991) 39—54. [14] F. Kappei, C.E. Liedtke, 3D motion estimation in model based facial image coding, Dept. Elec. Eng. Rep. LiTHISY-I-1278, October 1991. [15] M. Kass, A. Witkin, D. Terzopoulos, Snakes: Active contour models, in: Proc. 1st Internat. Conf. Comput. Vision, London, June 1987, pp. 259—268. [16] S. Malassiotis, M.G. Strintzis, Model based joint motion and structure estimation from stereo images, Comput. Vision, and Image Understanding 64 (November 1996).

110

I. Kompatsiaris et al. / Signal Processing: Image Communication 14 (1998) 95—110

[17] MPEG-2, Generic coding of moving pictures and associated audio information, Tech. Rep., ISO/IEC 13818, 1996. [18] H.G. Musmann, M. Ho¨tter, J. Ostermann, Objectoriented analysis—synthesis coding of moving images, Signal Processing: Image Communication 1 (2) (1989) 117—138. [19] M.J.T. Reinders, Model Adaptation for Image Coding, Delft University Press, 1995. [20] L. Robert, R. Deriche, Dense depth map reconstruction using a multiscale regularization approach with discontinuities preserving, in: M.G.S. et al. (Eds.), Proc. Internat. Workshop Stereoscopic and 3D Imaging, Santorini, Greece, September 1995, pp. 32—39. [21] A.P. Sage, J.L. Melsa, Estimation Theory with Applications to Communications and Control, McGraw-Hill, New York, 1971. [22] S.S. Sinha, B.G. Schunck, A two-stage algorithm for discontinuity-preserving surface reconstruction, IEEE Trans. PAMI 14 (January 1992).

[23] A. Tamtaoui, C. Labit, Constrained disparity and motion estimators for 3DTV image sequence coding, Signal Processing: Image Communication 4 (November 1991) 45—54. [24] D. Terzopoulos, K. Waters, Analysis and synthesis of facial image sequences using physical and anatomical models, TPAMI 15 (June 1993) 569—579. [25] D. Tzovaras, I. Kompatsiaris, M.G. Strintzis, 3D object articulation and motion estimation in model-based stereoscopic videoconference image sequence analysis and coding, Signal Processing: Image Communication 14 (1999) to appear. [26] D. Tzovaras, N. Grammalidis, M.G. Strintzis, Objectbased coding of stereo image sequences using joint 3-D motion/disparity compensation, IEEE Trans. Circuits Systems Video Technol. 7 (April 1997). [27] B. Welsh, Model based coding of images, Ph.D. Dissertation, British Telecom Res Lab, January 1991. [28] Y. Yakimovski, R. Cunningham, A system for extracting 3D measurements from a stereo pair of TV cameras, CGVIP 7 (2) (1978) 195—210.

Suggest Documents