3D Body Animation and Coding within a MPEG-4 ... - CiteSeerX

2 downloads 0 Views 162KB Size Report
3D Body Animation and Coding within a MPEG-4 Compliant Framework. Marius Preda, Titus Zaharia and Françoise Prêteux. ARTEMIS Project Unit, Institut ...
3D Body Animation and Coding within a MPEG-4 Compliant Framework Marius Preda, Titus Zaharia and Françoise Prêteux ARTEMIS Project Unit, Institut National des Télécommunications, 9, Rue Charles Fourier, 91011 Evry Cedex, FRANCE e-mail: [email protected]

ABSTRACT Referring to the new functionality of video access and coding, the work presented here lies within the scope of MPEG-4 activities related to Face and Body Animation (FBA). Specifically, we show how the new MPEG-4 standard offers an appropriate framework for virtual actor animation, gesture synthesis and compression/transmission. The targeted multimedia service concerns a low bit rate video-phone system for the deaf people community by means of sign language. To address the above-mentioned functionalities and service, we have developed and implemented: 1) a virtual human body modeling and animation method which is MPEG-4 compliant, 2) two compression techniques currently adopted by the MPEG-4 standard, 3) a novel method for estimating the 3D pose of the hand from video sequences.

1. INTRODUCTION The continuous development of multimedia software and hardware technologies, together with the explosive growth of the Internet, leads to an increasing interest for effective compression tools of audio-visual content in order to reduce the cost of data transmission in multimedia environments. The Moving Picture Expert Group (MPEG) aims at providing standardized core technologies allowing efficient storage, transmission and manipulation of video data. MPEG-4, which will become an international standard by December 1999, is specifically intended to cope with the requirements of multimedia applications, allowing new functionalities like video manipulation, scalable video encoding, synthetic and natural hybrid coding, 3D object compression and face and body animation and coding. Model-based parametric video coding leads to very low bitrate compression, making possible applications like video conferencing and video-telephony, mobile communications and mobile multimedia related applications.

The issue of face and body animation and coding has been addressed by the Face and Body Animation (FBA) Subgroup of the Synthetic and Natural Hybrid Coding Group (SNHC) of MPEG. Virtual actors can be portrayed visually as 2D icons, cartoons [1], composited video, 3D shapes or full 3D bodies [2, 3, 4]. MPEG-4 standard provides tools for the efficient compression of the animation parameters associated with the H-Anim [5] 3D articulated human body model. We emphasize that MPEG-4 standard aims only at the efficient coding of the animation parameters, thus it does not standardize any specific model, nor the parameter extraction methods. However, tools for gesture analysis or synthesis are of a great interest. Experimental results provided in this paper are closely related to a domain specific application: communication by means of sign language. During the last few years, intensive research work has been carried out leading to web-oriented educational applications [6], sign language dictionaries [7], compression and transmission of sign languagebased applications [8], and sign language visionbased recognition/interpretation. In this paper, we show how the new MPEG-4 standard offers an appropriate framework for the development of sign language related applications, in terms of animation, gesture synthesis/analysis, and compression/transmission. The paper is organized as follows. In Section 2, we present the basic concepts related to the MPEG-4 body object: the body definition parameters, specifying the model properties (topology, geometry, texture, color) and the body animation parameters defining the 3D pose of a generic model. We describe in details the generation of a 3D articulated hand. We present the so-called ARTEMIS Hand Animation Interface [9] which provides a complete and user-friendly software environment for MPEG-4 compliant gesture synthesis. Section 3 deals with two compression techniques currently adopted in MPEG-4 [10]: a predictive approach and a DCTbased compression scheme. Experimental results, obtained in the case of alphabet letters expressed in American Sign Language, are then discussed. Section 4 describes the method developed for

estimating the 3D pose of the hand from video sequences.

2. MPEG-4 BODY OBJECT The MPEG-4 Body Object is a hierarchical graph [10] consisting of nodes associated with anatomical segments and edges defining subpart relationships. Each segment is individually specified and animated by means of two distinct bitstreams, referred to as Body Definition Parameters (BDPs) and Body Animation Parameters (BAPs). BDPs control the intrinsic properties of the segment, namely surface local topology, geometry and texture. BAPs define the extrinsic properties of a segment, i.e. its 3D pose with respect to a reference frame attached to the parent segment. BDPs are actor-specific; hence, the overall morphology of an actor can be readily altered by overriding the current BDPs. As opposite to BDPs, BAPs are meant to be generic : if correctly interpreted, a given set of BAPs will produce perceptually reasonably similar results, in terms of motion, when applied to different actor models specified by their own BDPs. 2.1. 3D Human Body Modeling Within the MPEG-4 framework, it is necessary to develop a 3D humanoid model and to extract the animation parameters directly from video sequences. Starting from a non-articulated and static VRML humanoid model, we first produce a complete segmentation into anatomical subparts (Figure 1).

are addressed in a supervised fashion by means of a user-friendly graphical interface allowing the interactive setting of each parameter. Decomposing a VRML model into anatomical subparts is performed by a supervised triangular mesh propagation algorithm. This algorithm recursively constructs node and face adjacency lists for the sub-meshes associated with anatomical segment in the following way : the node list is initialized with an arbitrary seed point belonging to the anatomical segment under consideration; at step t, all neighbors of the points added at step t-1 and all the triangles containing these points, are appended to the node and triangle lists, respectively. Here, two points are said to be neighbors if they belong to the same triangle of the initial mesh. Visualizing the model at each step allows to decide whenever the segmentation procedure must be stopped. Subpart relationships between the resulting anatomical segments are represented in a hierarchical graph. The tree structure of the graph defines for each component one parent node and possibly several child nodes. BAPs information are attached to each node, yielding the rotation matrix which is used to register the segment local frame onto its parent’s one. Such information allows to directly specify local movements, or global movements by matrix multiplication. Local axes are part of MPEG-4 Body object specifications [10]. Frame origins are computed as follows. Given the sub-mesh M associated with a segment, its boundary is defined as the set B(M) of points P such that there exists no closed and simple path on M that surrounds P. Set B(M) consists of the vertices shared by the segment and its parent, and therefore corresponds to the junction between two anatomical components. The frame origin is then set as the center of gravity GB(M) of the boundary points. Denoting by GM the center of gravity of the sub-mesh, the y-axis lies along GB(M)GM. The remaining two axes are finally determined in accordance with MPEG-4 specifications. 2.2. 3D Human Body Animation

Figure 1: Articulation generation procedure: steps of the recursive propagation-based segmentation. We then define and generate a hierarchical graph description of the body, equivalent to an articulated model. Finally, BAPs extraction issues

In order to facilitate BAPs manipulation, we have developed the so-called ARTEMIS Hand Animation Interface (AHAI). The AHAI is a C++ user-friendly interface, available on X11/Motif and Windows platforms, which offers the following features : loading video sequences, loading human body model or anatomical parts models (hand, arm), calibration of the model according to the anthropometric characteristics of the actor in the video sequence (dimensions of the palm, length of the fingers ...), extraction of BAPs specifying any gesture posture or corresponding to the posture shown in the video sequence, storage of the

extracted BAPs, animation of the model according to the BAPs file. This interface has been used to generate the MPEG-4 BAPs data set of alphabet letters expressed in American Sign Language and has been donated to ISO [11]. In addition, we addressed more sophisticated aspects dealing with the realistic 3D deformations induced by model animation (muscle contraction, clothing folds and adjustments ...) [12]. Deformation modeling is achieved by instanciating Body Animation Tables (BATs). BATs specify a list of vertices of the 3D model which undergo nonrigid motion and define the displacement field as functions of BAPs. Our contribution [13] has consisted in creating BATs dedicated to hand deformations. For sake of compactness in deformation field specification, we have implemented a BAT interpolation method exploiting the reference BATs associated with key frames. Figure 2 shows the results of the BATbased interpolation technique in the case of a simple finger flexion movement.

a)

b)

c) d) Figure 2: BATs Interpolation. Key frames (a) and (d). Interpolated frames (b) and (c)

3. BODY ANIMATION PARAMETER CODING The genericity of BAPs description allows to avoid model transmission during animation. Moreover, BAPs coding ensures a very low bit rate transmission. We have contributed [14,15] to the evaluation of the two coding approaches (predictive and DCT-based) selected in the future MPEG-4 standard [10]. In the first method, BAPs are quantized and coded by a predictive coding scheme: for each parameter to be coded in frame n, the decoded value of this parameter in frame n-1 is used as prediction. The prediction error is then encoded by arithmetic coding. This scheme avoids encoding

error accumulation. Moreover, since BAPs can be assigned different precision requirements, different quantization step sizes are applied. They consist of a local (BAP specific) step size and a global one (used for bit-rate control). The quantized values are passed to the adaptive arithmetic encoder. The coding efficiency is increased by providing the encoder with range estimates for each BAP. Table 1 shows the results obtained by applying such a coding to letters "A" to "L". For all BAPs, the frame rate is 10 frames/second. Letter name

# Fr

Br Q=1

Br Q=2

Br Q=4

Br Q=8

Br Br Q=16 Q=31

A 44 2.70 2.62 2.54 2.47 2.39 2.32 B 18 4.71 4.56 4.44 4.33 4.23 4.11 C 29 3.85 3.78 3.65 3.54 3.45 3.34 D 49 2.67 2.60 2.53 2.46 2.40 2.35 E 50 2.95 2.87 2.78 2.71 2.63 2.55 F 27 3.59 3.49 3.41 3.30 3.22 3.15 G 42 3.24 3.13 3.06 2.97 2.89 2.79 H 41 3.25 3.16 3.08 2.99 2.92 2.83 I 43 2.93 2.86 2.78 2.69 2.63 2.57 J 68 2.25 2.19 2.14 2.08 2.02 1.97 K 30 3.55 3.46 3.37 3.27 3.18 3.09 L 27 3.91 3.79 3.66 3.53 3.41 3.30 Table 1: Predictive coding scheme. Results for letters "A" to "L". Fr, Br and Q denote the frames number, the bitrate and the global quantization value respectively. The DCT-based coding method splits BAPs time sequences into BAP segments made of 16 consecutive BAP frames. Encoding a BAP segment includes three steps achieved for all BAPs: (1) determining the 16 coefficient values by using discrete cosine transform (DCT), (2) quantizing and coding the AC coefficients and (3) quantizing and differential coding the DC coefficients. Letter name

# Fr

Br Q=1

Br Q=2

Br Q=5

Br Q=8

Br Br Q=15 Q=30

A 44 2.80 2.42 1.83 1.54 1.14 0.86 B 18 2.61 2.22 1.64 1.36 1.08 0.81 C 29 3.65 3.15 2.24 1.87 1.41 1.10 D 49 2.66 2.30 1.69 1.40 1.01 0.74 E 50 3.23 2.85 2.13 1.74 1.30 1.00 F 27 2.04 1.83 1.36 1.12 0.88 0.67 G 42 2.83 2.43 1.76 1.47 1.12 0.84 H 41 2.83 2.39 1.70 1.40 1.08 0.76 I 43 2.70 2.39 1.73 1.45 1.10 0.81 J 68 2.20 1.89 1.34 1.11 0.82 0.59 K 30 2.23 1.94 1.44 1.23 0.93 0.68 L 27 2.74 2.36 1.66 1.38 1.12 0.87 Table 2: DCT-based coding scheme. Results for letters "A" to "L". Fr, Br and Q denote the frames number, the bitrate and the global quantization value respectively.

The DC coefficient of an intra coded segment is stored as it is and, for an inter-coded segment, the DC coefficient of the previous segment is used as a prediction of the current DC coefficient. The prediction error and AC coefficients, (for both inter and intra coded segments), are coded by using Huffman tables. Table 2 shows the results for DCT-based coding of letters "A" to "L". In order to objectively compare both coding schemes, we introduce the distortion measure between the original and the decoded sequences, defined as the mean square error of the BAP vectors:

1 D= No_ frames ( o)

No_ frames



BAPi( o) − BAPi( d )

2

,

i =0

(d )

where BAPi and BAPi are the original and the decoded BAP vectors of frame i.

a) b) Figure 3: Predictive versus DCT-based BAP coding schemes. Distortion evaluation as a function of bitrate. 3a: letter "A", 3b: letter "G".

ambiguities due to partial occlusions, global perspective deformations, local deformations and shadowing effects can occur. To overcome such problems, we exploit the specific dynamic structure of a sign which is characterized by the transitions between two "over-time-stable" hand configurations. The notion of stable hand configuration allows to take into account slight modifications around a reference configuration, called prototype. Specifically, we have created two types of hand configuration prototypes: • natural prototypes, extracted from natural sign language video sequences, and • synthetic prototypes directly generated from a 3D mesh model of the hand (as previously defined). Prototypes are indexed by means of a shape descriptor which is defined as a translation, rotation and scale invariant Hough Transform ( τ,ρ,λ HT) [17]. A very compact representation is available by considering the Fourier Transform of the Hough coefficients. For each frame of the video sequence, a flesh color-based segmentation of the hand is performed. The shape of the gray-level mask is analyzed by applying the τ,ρ,λ HT. Then, for each frame, the hand configuration is compared to all prototypes by using the mean square error between the corresponding shape descriptors. The prototype which corresponds to the minimum distance is then selected (Figure 4).

4. 3D POSE ESTIMATION OF THE HAND Simulating virtual body animation similarly to real video sequences requires to be able to estimate BAPs. Because we are involved in Sign Language analysis/recognition, we are more specifically interested in face and hand 3D pose estimation. We first developed a robust approach for estimating the 3D global head pose [16]. The method is based on a 3D/2D matching between 2D image features (texture and optical flow) and 3D object features of a generic head model (shape and theoretical flow). We propose, in the case of 3D hand pose estimation, a novel method directly exploiting the semantic/syntactic structure of the sign language [17]. Among the six features characterizing sign gesture, the hand configuration (i.e. the relative finger position) has the most discriminating power. Tracking such a feature within a sign sequence proves to be a complex task because of the dynamic and 3D nature of a sign. In addition,

a)

b)

d) c) Figure 4: The τ,ρ,λ HT-based recognition method. Original image (a), associated hand gray-level mask (b), τ,ρ,λ HT performed on the mask (c), the selected synthetic prototype (d). Results obtained on two corpus consisting of letters and words, respectively, show the power of the method. The recognition rate equals 100% for the natural prototypes and 89% for the synthetic prototypes provided that these prototypes are calibrated with respect to the morphometry of the hand of the signer. Finally, the hand configuration

recognition being successfully achieved, the 3D pose estimation of the hand is defined as the 3D parameters associated with the selected prototype. The virtual 3D hand is then animated in terms of hand configuration according to the extracted BAPs. For a complete and realistic hand animation, motion and orientation of the hand have to be added.

5. CONCLUSION As illustrated above, the MPEG-4 FBA framework provide appropriate functionalities for animating virtual actors, creating virtual universe, developing gesture analysis and human-machine communications. The large values of the compression rate obtained for an image transmission using the parametrical description make feasible real-time applications like videoconference or sign language transmission.

6. REFERENCES [1] D. Kurlander, T. Skelly, and D. Salesin. "Comic Chat", ACM Computer Graphics, Annual Conf. Series, pp 225-236, 1996. [2] N. Badler, M. Hollick, and J. Granieri "RealTime Control of a Virtual Human Using Minimal Sensors", Presence, 2(1):82-86, 1993. [3] B. Robertson, "Best behaviors. Digital magic" Computer Graphics World (Supplement), pp S12— S19, 1996. [4] S. Stansfield, "Distributed Virtual Reality Simulation System/or Situational Training", Presence, 3(4):360-366, 1994. [5] "Specification for a Standard VRML Humanoid, (DRAFT) Version 1.1", http://eceuwaterloo.ca: 80/~h-anim/newspec.htm1. [6] Seamless Solutions Inc. demos, http://www.seamless-solutions.com. [7] Sign Language Web Site at University Lumière Lyon2, http://signserver.univ-lyon2.fr. [8] G. Mozelle, F. Prêteux, J.E. Viallet, "Tele-sign : A Compression Framework for Sign Language Distant Communication", Proceedings SPIE Conference on Mathematical Modeling and Estimation Techniques in Computer Vision, San Diego, CA, Vol. 3457, July 1998, pp. 94-110. [9] F. Prêteux, M. Preda and G. Mozelle, "Hand Animation and BAPs Extraction: Reports on Core Experiments", ISO/IEC JTC1/SC29/WG11, M3592, July 1998. [10] "SNHC Verification Model 9.0" ISO/IEC JTC1/SC29/WG11 W2301, Dublin ,July 1998. [11] F. Prêteux, M. Preda and G. Mozelle, "Donation to ISO of Hand Animation Software", ISO/IEC JTC1/SC29/WG11, M3590, July 1998, Dublin.

[12] T. Capin, S. Bandi, J. Esmerado, "Results of Core Experiments on Body Animation", ISO/IEC JTC1/SC29/WG11, M4202, Roma, December 1998. [13] F. Prêteux, M. Preda, T. Zaharia, "Preliminary results on hand BAT interpolation", ISO/IEC JTC1/SC29/WG11, M4278, Roma, December 1998. [14] F. Prêteux, M. Preda, T. Zaharia, "Results of Core Experiment on BAP Coding", ISO/IEC JTC1/SC29/WG11, M4283, Roma, December 1998. [15] F. Prêteux, M. Preda, T. Zaharia, "Predictiveversus DCT-based BAP Coding", ISO/IEC JTC1/SC29/WG11, M4254, Seoul, March 1999. [16] F. Prêteux, M. Malciu, "Model-based head tracking and 3D pose estimation", Proceedings SPIE Conference on Mathematical Modeling and Estimation Techniques in Computer Vision, San Diego, CA, Vol. 3457, July 1998, pp. 94-110. [17] T. Zaharia, M. Preda, F. Preteux. "Sign Language Indexation within the MPEG-7 Framework", Accepted to SPIE, Conference on Mathematical Modeling, Bayesian Estimation, and Inverse Problems, Vol. 3816 Denver, July 1999.

Suggest Documents