Rate-Distortion-Optimized Video Coding Using a 3-D Head Model Peter Eisert, Thomas Wiegand, and Bernd Girod Telecommunications Laboratory University of Erlangen-Nuremberg Cauerstrasse 7, D-91058 Erlangen, Germany feisert,wiegand,
[email protected] URL:http://www.nt.e-technik.uni-erlangen.de/~eisert/mbbc.html
Abstract | Model-based video synthesis is combined with block-based motion-compensated video coding for ratedistortion ecient transmission of head-and-shoulder sequences. Two frames are utilized for block-based motioncompensated prediction of the current video frame where one frame is the previously reconstructed one and the other frame is provided by a model-based coder. The model-based coder uses a parameterized 3-D head model for the estimation of the motion as well as the facial expressions of a person. A synthetic output frame is rendered serving as a second reference frame for block-based motion compensation of a video coder. The coder is based on the H.263 syntax which is extended to multi-frame motion-compensated prediction incorporating a picture reference parameter. Ratedistortion optimization is employed for the coding control of the video coder providing robustness against model failures. Hence, the coding eciency of the proposed combined codec never degrades below what is achievable with H.263 syntax even if the model-based coder cannot describe the current scene and the second reference frame is useless. On the other hand, if the objects in the scene are well described by the model-based coder, signi cant coding gains can be achieved. Experiments are conducted for two CIF sequences with accurate 3-D head shape information from a 3-D laser scanner. In addition to that, results for the well known test sequence Akiyo are presented where instead of an accurate 3-D scan a generic head model is used enabling low-cost applications such as video-telephony or video-conferencing. Bit-rate savings of about 35 % are achieved at equal average PSNR when comparing the proposed codec to TMN-10, the state-of-the-art test model of the H.263 standard. When encoding at equal bit-rate for both codecs, signi cant improvements in terms of subjective quality are visible.
I. Introduction
In recent years, several video coding standards such as H.261, MPEG-1, MPEG-2, and H.263 have been introduced, which mainly address the compression of generic video data for digital storage and communication services. H.263 [1] as well as the other standards is a hybrid video coding scheme, which consists of block-based motioncompensated prediction (MCP) and DCT-based quantization of the prediction error. Also the future MPEG-4 standard [2] will follow the same video coding approach. These schemes utilize the statistics of the video signal without knowledge of the semantic content of the frames and can therefore robustly be used for arbitrary scenes. In case the semantic information of the scene can be exploited, higher coding eciency may be achieved by modelbased video codecs [3], [4]. For example, 3-D models that
describe the shape and texture of the objects in the scene could be used. The 3-D object descriptions are encoded only once. When encoding a video sequence, individual video frames are characterized by 3-D motion and deformation parameters of these objects. In most cases, the parameters can be transmitted at extremely small bit-rates making the scheme suitable for very low bit-rate applications. Such a 3-D model-based coder is restricted to scenes that can be composed of objects that are known by the decoder. One typical class of scenes are head-and-shoulder sequences which can be frequently found in applications such as video-telephone or video-conferencing systems. For head-and-shoulder scenes, bit-rates of about 1 kbit/s with acceptable quality can be achieved [5]. This has also motivated the recently determined synthetic and natural hybrid coding (SNHC) part of the MPEG-4 standard [2]. SNHC allows the transmission of a 3-D face model that can be animated to generate dierent facial expressions. The transmission of SNHC-based 3-D models is supported in combination with 2-D video streams [2]. Using object boundary information (in MPEG-4 called video object planes) the mode decision can be done individually for dierent regions in the frame and the resulting picture can be displayed using a compositor. However, due to the additional bit-rate needed for transmitting the shape of these regions, this method may require a prohibitive amount of overhead information adversively aecting its application for very low bit-rate coding which is the objective of this paper. Note, the compression ratio obtained in our experimental results is about 1200:1 as will be described later in the paper. The combination of traditional hybrid video coding methods with model-based coding has been proposed by Chowdhury et al. in 1994 [6]. In [6] a switched model-based coder is introduced that decides between the encoded output frames from an H.261 block-based and a 3-D modelbased coder. The decision which frame to take is based on rate and distortion. However, the mode decision is only done for a complete frame and therefore the information from the 3-D model cannot be exploited if parts of the frame cannot be described by the model-based coder. An extension to the switched model-based coder is the layered coder published by Musmann in 1995 [7] as well as
DRAFT, March 8, 1999
Kampmann and Ostermann in 1997 [8]. The layered coder chooses the output from up to ve dierent coders. The mode decision between the layers is also done frame-wise or object-wise and no combined encoding is performed. In this paper we present an extension of an H.263 video codec [1] that incorporates information from a model-based coder. Instead of exclusively predicting the current frame of the video sequence from the previously decoded frame, motion compensation using the rendered synthetic output frame of the model-based coder is also considered. The block-based video coder decides which prediction is ecient in terms of rate-distortion performance by minimizing a Lagrangian cost function D + R, where distortion D is minimized together with rate R. The Lagrange parameter controls the balance between distortion and rate. A large value of corresponds to low bit-rate and delity solutions while a small value results in high bit-rate and delity [9], [10], [11]. The minimization proceeds over both prediction signals and chooses the most ecient one in terms of the cost function. The incorporation of the model-based coder into the motion compensation enables the coding option to further re ne an imperfect model-based prediction. However, neither the switched coder nor the layered coder nor MPEG-4 allow ecient residual coding of the model frame leading to a less exible video codec. The coding eciency of our combined codec never degrades below H.263 in the case the model-based coder cannot describe the current scene. However, if the objects in the scene are compliant to the 3-D models in the codec, a signi cant improvement in coding eciency can be achieved. This paper is organized as follows. We rst describe the architecture of the video coder that combines the traditional hybrid video coding loop with a model-based coder that is able to encode head-and-shoulder scenes at very low bit-rates. Then, the underlying semantic model is presented and the algorithm for the estimation of Facial Animation Parameters (FAPs) which determine the rendered output frame is explained. Given the rendered output frame of the model-based coder, we describe how bit allocation is done in our combined block- and model-based coder using Lagrangian optimization techniques. Finally, experimental results verify the improved rate-distortion performance of the proposed scheme compared to TMN-10, the test model of the H.263 standard.
transmission of the motion vector and DCT coecients for the residual coding is often highly reduced. original frame
residual
+ −
DCT
Q
IDCT
+ multi− frame motion compen− sation
reconst. frame model frame
motion vectors
model− based decoder decoder
FAPs
model− based coder
Fig. 1. Structure of the video coder with a predictor that uses the previously decoded frame and the current model frame.
Incorporating the model-based codec into an H.263based video codec requires changes to the syntax. The H.263 syntax is modi ed by extending the inter-prediction macroblock modes in order to enable multi-frame MCP. More precisely, the inter-prediction macroblock modes INTER and UNCODED are assigned one code word representing the picture reference parameter for the entire macroblock. The INTER-4V macroblock mode utilizes four picture reference parameters each associated to one of the four 8 8 block motion vectors. For further details on H.263 syntax, please refer to the ITU-T Recommendation in [1]. In addition to the picture reference parameter, the FAPs for rendering the model frame at the decoder side are included into the picture header. In the next section, we will II. Video Coding Architecture describe the model-based codec in more detail. After that, Figure 1 shows the architecture of the proposed video we will return to the combined video codec and explain the codec. This gure depicts the well-known hybrid video coder control. coding loop that is extended by a model-based codec. III. Model-Based Codec The model-based codec is running simultaneously to the hybrid video codec, generating a synthetic model frame. The structure of the model-based codec which is described This model frame is employed as a second reference frame in the following is depicted in Fig. 2. The coder analyzes for block-based MCP in addition to the previously recon- the incoming frames and estimates the parameters of the structed reference frame. For each block the video coder 3-D motion and deformation of the head model. These decides which of the two frames to reference for MCP. deformations are represented by a set of FAPs that are The bit-rate reduction for the proposed scheme arises from entropy coded and transmitted through the channel. The those parts in the image that are well approximated by the 3-D head model and the facial expression synthesis are inmodel frame. For these blocks, the bit-rate required for corporated into the parameter estimation. The 3-D head DRAFT, March 8, 1999
model consists of shape, texture, and the description of facial expressions. For synthesis of facial expressions, the transmitted FAPs are used to deform the 3-D head model. Finally, the head-and-shoulder scene is approximated by simply rendering a 3-D model frame.
control points
FAPs
3D model
3D object points
vertices basis functions
barycentric coordinates
2D object points camera model
Fig. 4. Transformation from FAPs to image points.
A. 3-D Head Model
For the description of the shape of head and shoulder, we use a generic 3-D model with xed topology similar to the well known Candide model [12]. Texture is mapped onto the triangular mesh to obtain a photorealistic appearance. In contrast to the Candide model, we describe the object's surface with triangular B-splines [13] in order to reduce the number of degrees of freedom of the shape. This simpli es modeling and estimation of the facial expressions and does not severely restrict the possible shapes of the surface since facial expressions result in smooth movements of surface points due to the anatomical properties of tissue and muscles. Hoch et al. [14] have shown that B-splines are well suited to model the surface properties of facial skin. To reduce the computational complexity of the rendering, the B-splines are only evaluated at xed discrete positions of the surface. These points are called vertices. They form a triangular mesh that approximates the smooth Bspline surface. The number of vertices and thus triangles can be varied to obtain a trade-o between approximation accuracy and rendering complexity. The positions of the vertices in the 3-D space and the shape of the head are determined in our model by the location of 231 control points of the B-spline surface. An example of such a shaped triangular mesh together with the corresponding textured model is shown in Fig. 3.
object points on the triangle surface is speci ed by their barycentric coordinates. Finally, the 2-D object points in the model frame are obtained by projecting the 3-D points into the image plane. In the following all transformations are explained in more detail. B.1 Facial Animation Parameters To describe the mimic in the face of a person we parameterize the facial expressions following the proposal of the MPEG-4/SNHC group [2]. According to that scheme, every facial expression can be generated by a superposition of dierent basic expressions, each quanti ed by a facial animation parameter (FAP). These FAPs describe elementary motion elds of the face's surface including both global motion, like head rotation, and local motion, like eye or mouth movement. B.2 Control Points Changes of FAPs in uence the shape and mimic of the 3-D head model. The relation between a parameterized facial expression and the surface shape is de ned by a set of transformations that is applied to the control points of the B-spline surface. The nal control point position ci is obtained by concatenating all transformations each assigned to a FAP according to
ci = TFAPK (: : :TFAP (TFAP (ci0))) 1
0
(1)
with ci0 being the initial control point location corresponding to a neutral expression of the person. Each transformation describes either a translation or a rotation ci + dik FAPk translation, (2) TFAPk (ci ) = RFAP k (ci ? ok ) + ok rotation:
Fig. 3. Left: hidden-line representation of the head model, right: In case of a translation, the control point ci is moved in corresponding textured version. direction dik by the amount of jjdikjjFAPk . For rotational movements, the control point ci is rotated as speci ed by B. Synthesis of the Model Frame RFAPk around an axis through the point ok.
The synthesis of the model frame consists of rst animating the 3-D head model using the FAPs and then rendering the model frame. The mechanism how changes of the FAPs in uence the shape and mimic of the face in the rendered model frame is de ned by a set of cascaded transformations as shown in Fig. 4. Given a set of FAPs de ning a certain facial expression, the corresponding 2-D model frame is created by rst placing the control points in order to shape the B-spline surface. Using the basis functions of the B-splines, the algorithm computes the position of the vertices from the control points. The 3-D location of all
To generate facial expressions speci ed by a set of FAPs, the transformations are computed and sequentially applied on the control point positions. Hence, the resulting facial expression is not independent of the order of the transformations. We start with local deformations in the face and apply global head rotation and translation last.
B.3 Vertices Once all control points are placed according to the FAPs, the shape of the B-spline surface is de ned everywhere. Since we approximate this surface by a triangular mesh,
DRAFT, March 8, 1999
about 1 kbit/s
Coder
Decoder
Channel Video
Analysis Estimation of FAPs
Synthesis
Parameter entropy coding
Parameter decoding
Animation and rendering of the head model
Video
Head Model Shape Texture Illumination Dynamics
Fig. 2. Basic structure of our model-based codec.
we have to compute the position of the vertices. This is accomplished using the linear relation
vj =
X
i2Ij
bjici
(3)
object
Y
between a vertex vj and the control points ci with bji being the basis functions of the B-spline. The basis functions for each vertex are constant and computed o-line. Note that Ij is the index set that contains the indices for the control points that in uence the position of vertex vj . The number of indices in this set is usually between 3 and 6. B.4 3-D Object Points Arbitrary object points x on the surface of the triangular mesh are computed from the position of the vertices using 2 X
x
2 XX
X0 ,Y0
y
image plane
X
z
x
Fig. 5. Camera model and its associated coordinate systems.
axis and its deviation from the image center due to inaccu(4) rate placement of the CCD-sensor in the camera. The four j =0 i j =0 parameters fx , fy , X0 and Y0 are obtained from an initial where j are the barycentric coordinates of the object point camera calibration using Tsai's algorithm [15]. related to the surface triangle enclosing it. C. Scene Analysis and Encoding of Facial Animation Parameters B.5 Rendering of the Model Frame Given the 3-D object points, a 2-D model frame is rendered The model-based coder analyzes the incoming frames and using perspective projection. The geometry of the camera estimates the 3-D motion and facial expression of a person. model is shown in Fig. 5. The 3-D coordinates of an object The determined FAPs are entropy coded and transmitted point x = [x y z ]T are projected into the image plane to the decoder. The 3-D head model and the motion constraints used by the facial expression synthesis are incoraccording to porated into the parameter estimation as described in the x following. X = X 0 ? fx z Y = Y0 ? fy zy ; (5) C.1 Facial Parameter Estimation In our model-based coder all FAPs are estimated simultawith fx and fy denoting the scaled focal length parame- neously using a hierarchical optical ow based method [16]. ters. These internal camera parameters transform world We currently employ a hierarchy of 3 spatial resolution laycoordinates into pixel coordinates X and Y . Additionally, ers with CIF as the highest resolution and each subsequent they allow the use of non-square pixel geometries. The two lower resolution layer being subsampled by a factor of two parameters X0 and Y0 describe the location of the optical vertically and horizontally. We use the whole picture of
x=
j vj =
(
j bji)ci ;
DRAFT, March 8, 1999
the face for the estimation, in contrast to feature-based approaches, where only discrete feature point correspondences are exploited. To simplify the optimization in the high dimensional parameter space, a linearized solution is directly computed using information from the optical ow and the motion constraints from the head model. camera
I(n)
facial parameters
image
Analysis Î(n−1)
synthetic image
Model
Shape Texture Expressions Illumination
Synthesis
ing the description of the transformations in section III-B we can set up one linear 3-D motion equation for each 3D object point of the surface. This constraint determines how the point moves in 3-D space if the facial animation parameters are changed. Instead of the 6 parameters specifying a rigid body motion, we have a much larger number of unknowns. Since every pixel in the image that contains the object provides us with one equation for the parameter estimation we still have an over-determined set of linear equations that can be solved easily. For the motion estimation we do not estimate the absolute values of the FAPs. We rather estimate relative changes between the previous frame and the current one. The analysis-synthesis loop assures that no error accumulation occurs when the relative changes are summed up for a long time. In order to obtain a constraint that de nes the relative motion between two frames we have to reformulate Eqs. (1){(4). In a rst step, a relation between the control point position c0 of the current frame and the one of the previous frame c is established. This relation is given by Eqs. (1) and (2). To obtain the new control point coordinates we have to apply the inverse transformations from the previous ?1 ; : : :; T ?1 in descending order and then the frame TFAP FAPK 0 0 ; : : :; T 0 transformations of the current frame TFAP FAPK in 0 ascending order ?1 ?1 0 0 c0i = TFAP K (: : :TFAP0 (TFAP0 (: : : (TFAPK (ci )))): (6) This leads to a complicated description which cannot easily be incorporated into the motion estimation framework. We therefore make use of the assumption that the rotation angles originating from changes in the FAPs are relatively small between two successive video frames. The rotation of a control point around an axis with direction ! can then be substituted by a translation along the tangent TFAPk (ci ) = RFAPk (ci ?ok )+ok ci +dik FAPk (7) with (8) dik = sk jj!! ((ccii ?? ookk ))jj : sk denotes here a scaling factor that determines the range of the corresponding FAP and FAPk = FAPk0 ? FAPk (9) is the change of the facial animation parameter k between the two frames. We now have a uniform description for both rotation as well as translation and can estimate both global as well as local motion simultaneously. The small error caused by the approximation is compensated after some iterations in the feedback structure shown in Fig. 6. The desired function for the control point motion results in X (10) c0i = ci + FAPkd^ik; i
i
Fig. 6. Analysis-synthesis loop of the coder.
In the optimization, an analysis-synthesis loop is employed [17] as shown in Fig. 6. The algorithm estimates the facial parameter changes between the previous synthetic frame I^[n ? 1] and the current frame I [n] from the video sequence. This approximative solution is used to compensate the dierences between the two frames by rendering the deformed 3-D model at the new position. The remaining linearization errors are reduced by iterating through dierent levels of resolution. Note that the texture map of the head model, once complete, does not need to be updated during the video sequence. All changes between the model image and the video frame must therefore be compensated by the underlying 3-D model, the motion model and the illumination model. To allow also changes in the illumination which have large in uence on the image intensities we also estimate the current illumination situation [18]. The parameters that describe the direction and intensity of the incoming light are then used at the decoder to reduce the dierences between real and model frames caused by photometric eects. In the following the components of the motion estimation algorithm are described in more detail. C.2 3-D Motion Constraint Estimating 3-D motion of exible bodies from 2-D images is a dicult task. However, the deformation of most objects can be constrained to simplify the estimation process. For the 3-D head model, the constraint is composed of a chain of transformations depicted in Fig. 4 and describes a relation between the 3-D object point motion and changes of FAPs. The motion and deformation of the 3-D head model with surface is restricted to a superposition of certain motion elds which are controlled by the unknown FAPs. Follow-
DRAFT, March 8, 1999
k
d^ik = (
K Y j =k+1
RFAPj ) dik
(11)
being the new rotated 3-D direction vector of the cor- model (5) and using a rst order approximation yields responding movement. For translational transformations u = X0 ? X RFAPj is the identity matrix. Combination of Eqs. (4) and (10) leads to the motion ? 1z (fx tx + (X ? X0 )tz )FAP (17) equation for a surface point as 0 v =Y ?Y X x0 = x + tkFAPk = x + T FAP; (12) (18) ? z1 (fy ty + (Y ? Y0 )tz )FAP: k where the tk 's are the new direction vectors corresponding to the facial animation parameters which are calculated from d^ ik by applying the linear transforms (3) and (4). T combines all direction vectors in a single matrix and FAP is the vector of all FAP changes. The matrix T can be derived from the 3-D model. Equation (12) describes the change of the 3-D point location x0 ? x as a linear function of FAP changes FAP. Due to the presence of local deformations, we cannot write the transformation with the same matrix T for all object points as in the case for rigid body motion. Rather T has to be set up for each image point independently. C.3 Gradient-based FAP Determination For the estimation of the facial expression parameters, the complete picture of the head-and-shoulder scene is incorporated into the optical ow constraint equation IX u + IY v + It = 0 (13) where [IX IY ] is the gradient of the intensity at point [X Y ], u and v are the velocity in x- and y-direction and It the intensity gradient in the temporal direction. Since u and v can take on arbitrary values for each point [X Y ], this set of equations is under-determined, and we need additional constraints to compute a unique solution. Instead of determining the optical ow eld by using smoothness constraints and then extracting the motion parameter set from this ow eld, we estimate the facial animation parameters from (13) together with the 3-D motion equations of the head's object points [17], [19]. Our technique is very similar to the one described in [20]. One main dierence is that we estimate the motion from synthetic frames and camera images using an analysis-synthesis loop as shown in Fig. 6. This allows the usage of a hierarchical framework that can handle larger motion vectors between two successive frames. Another dierence between the two approaches is that we use a textured 3-D model, which permits us to generate new synthetic views of our virtual scene after having estimated the motion. Writing Eq. (12) for each component leads to (14) x0 = x(1 + x1 tx FAP) y0 = y(1 + 1y ty FAP) (15) z 0 = z (1 + z1 tz FAP): (16) with tx , ty and tz being the row vectors of matrix T. Dividing Eqs. (14) and (15) by (16), incorporating the camera
These equations serve as the motion constraint in the 2-D image plane. Together with Eq. (13), a linear equation at each pixel position can be written as 1 [I f t + I f t + (19) z X xx Y yy (IX (X ? X0 ) + IY (Y ? Y0))tz ] FAP = It with z being the depth information obtained from the model. This over-determined system is solved in a leastsquares sense with low computational complexity. The size of the system depends directly on the number of FAPs. In order to increase the robustness of the highdimensional parameter optimization, the range of the parameter space is restricted. We add some constraints given by the matrices A and B to specify the allowed range for the FAPs A FAP 0 (20) or to restrict the changes in the facial parameters between two successive frames B FAP 0: (21) One example for a constraint in Eq. (20) is the restriction not to close the eyelid more than total. The additional terms are incorporated in the optimization of the leastsquares problem given by Eq. (19) using an LSI estimator [21]. C.4 Illumination Estimation Motion estimation algorithms are often based on the constant brightness assumption. However, this assumptions is violated, if the illumination conditions in the scene change or the objects move. To overcome this restriction we add an illumination model to the virtual scene that describes the photometric properties for colored light and surfaces. In contrast to the methods proposed in [22] and [23] we incorporate the 3-D information from our head model which leads to a linear low complexity algorithm for the estimation of the illumination parameters. The incoming light in the original scene is assumed to consist of ambient light and a directional light source with illumination direction l. The surface is modeled by Lambert re ection [24] and we describe the intensity relation between the video frame I and the corresponding intensity of the texture map Itex by R (cR + cR maxf?n l; 0g) I R = Itex amb dir G (cG + cG maxf?n l; 0g) I G = Itex amb dir B (cB + cB maxf?n l; 0g); (22) I B = Itex amb dir
DRAFT, March 8, 1999
with camb and cdir being the relative intensity of ambient and directional light, respectively [18]. The surface normal n is derived from the 3-D head model. The Lambert model is applied on all three RGB color components independently with a common direction of the incoming light. These equations have 8 degrees of freedom that specify the current illumination condition. By estimating the illumination parameters ciamb , cidir , and the illuminant direction l, we are able to compensate intensity errors between the model frame and the video frame according to the illumination model. However, the unknowns are non-linearly coupled which complicates the estimation. When dividing the estimation process into two steps we can use a low complexity linear estimator for the unknowns. First, we determine the illumination direction and in a second step the remaining photometric properties ciamb and cidir are computed. To obtain the illuminant direction l we divide each comi ponent in (22) by its corresponding texture intensity Itex and sum up the three equations. We obtain a system with four unknowns 2 3 camb G B R 6 7 [1 ? nx ? ny ? nz ] 64 ccdir llx 75 = IIR + IIG + IIB (23) dir y tex tex tex cdir lz where the new constants camb = cRamb + cGamb + cBamb and cdir = cRdir + cGdir + cBdir are no longer considered in the following. Since (23) contains quotients of intensities we can improve the error measure for the minimization of the intensity dierences between the camera image and the texture by using a weighted least-squares estimator with R + I G + I B for the solution of (23). weights Itex tex tex With the known illumination vector we obtain three independent systems of equations of size 2x2 (as shown for the red color component)
R R [1 d] cR amb (24) Itex cRdir = I that can again be solved with low complexity in a leastsquares sense. We use a non-negative least squares estimator (NNLS) [21] to constrain the intensities to positive values. The re ectivity d = maxf?n l; 0g is calculated from the previously determined light direction and the surface normals from the 3-D model. Note that the images from the video camera are -pre-distorted [25] which has to be compensated before estimating the photometric properties. The estimated variables are then used for the compensation of the illumination dierences between the camera images and the synthetic images using Eq. (22). C.5 Encoding of Facial Animation Parameters In our experiments, 19 FAPs are estimated. These parameters include global head rotation and translation (6 parameters), movement of the eyebrows (4 parameters), two parameters for eye blinking, and 7 parameters for the motion of the mouth and the lips. For the transmission of
the FAPs, we predict the current values from the previous frame, scale and quantize them. An arithmetic coder that is initialized with experimentally determined probabilities is then used to encode the quantized values. Note that the training set for the arithmetic coder is separate from the test set. The parameters for the body motion (4 parameters) and the illumination model (8 values) are coded accordingly. The resulting bit-stream has to be transmitted as side information if the synthetic frame is employed for prediction. IV. Rate-Constrained Coder Control
The coder control employed for the proposed scheme mainly follows the speci cations of TMN-10 [26] as proposed to the ITU-T in [27], since the optimization method in [27] has been proven ecient while requiring a reasonable amount of computational complexity. A further motivation is the use of the H.263-based TMN-10 coder as anchor for comparisons. In the following, we brie y describe the TMN-10 scheme and explain the extensions to multi-frame motion-compensated prediction enabling the incorporation of the model-based coder into H.263. The problem of optimum bit allocation to the motion vectors and the residual coding in any hybrid video coder is a non-separable problem requiring a high amount of computation. To circumvent this joint optimization, we split the problem into two parts: motion estimation and mode decision. Motion estimation determines the motion vector and the picture reference parameter to provide the motioncompensated signal. Mode decision determines the use of the macroblock mode which includes the MCP parameters, the DCT coecients and coder control information. Motion estimation and mode decision are conducted for each macroblock given the decisions made for past macroblocks. Our block-based motion estimation proceeds over both reference frames, the previously reconstructed one and the model frame. For each block, a Lagrangian cost function is minimized [28], [29] that is given by DDFD (v; ) + MOTION RMOTION (v; ); (25) where the distortion DDFD is measured as the sum of the absolute dierences (SAD) between the luminance pixels in the original and the block that is displaced by v from the reference frame . The rate term RMOTION is associated to the motion vector v and the picture reference . The motion vector v is entropy coded according to the H.263 speci cation while the picture reference is signaled using one bit. The motion search covers for both frames a range of 16 pixels horizontally and vertically. However, we have noticed that large displacements are very unlikely to occur when the model frame is referenced leaving further space for optimization. Given the motion vectors and picture reference parameters, the macroblock modes are chosen. Again, we employ a rate-constrained decision scheme where a Lagrangian cost function is minimized for each macroblock [30] DREC (h; v; ; c) + MODE RREC (h; v; ; c): (26)
DRAFT, March 8, 1999
V. Experimental Results
Experiments are conducted using two natural CIF sequences Peter and Eckehard consisting of 200 and 100 frames, respectively. Both sequences are encoded at 8.33 Hz. Additionally, 200 frames of the standard video test sequence Akiyo were encoded at 10 Hz and CIF resolution. For the two sequences Peter and Eckehard the shape information from a 3-D laser scan of these persons is exploited to create the head model. We map the texture on the 3-D model and optimize the control points' position to adapt the spline surface to the measured data. This 3-D head model is then used to estimate the FAPs and the illumination parameters. At coder and decoder side, the model is animated according to the quantized values and a synthetic sequence of model frames is rendered. Figure 7 shows one frame of the sequence Peter and its synthetic counterpart. Since the 3-D scan does not provide us with information about the hair and the inside of the mouth, model failures lead to errors in the synthetic frame that have to be compensated by the block-based coder.
illustrated in Fig. 8 where the PSNR, measured only in the facial area, is plotted over the bit-rate needed for coding the sequence. Note that model failures lead to a saturation of the curve at about 32.8 dB. The bit-rate necessary to reach that point is below 1 kbit/s. 34 33 32 PSNR [dB]
Here, the distortion after reconstruction DREC measured as the sum of the squared dierences (SSD) is weighted against bit-rate RREC using the Lagrange multiplier MODE . The corresponding rate term is given by the total bit-rate RREC that is needed to transmit and reconstruct a particular macroblock mode, including the macroblock header h, motion information including v and as well as DCT coecients c. The mode decision determines whether to code each macroblock using the H.263 modes INTER or INTER-4V or INTRA [1]. Following [31], the Lagrange multiplier for the mode decision is chosen as MODE = 0:85Q2; (27) with Q being the DCT quantizer value. For the Lagrange multiplier used in the motion estimation MOTION , we make an adjustment to the relationship to allow use of the SAD measure. Experimentally, we have found that an eective method is to measure distortion during motion estimation using SAD rather than the SSD and to simply adjust the Lagrange multiplier for the lack of the squaring operation p in the error computation, as given by MOTION = MODE .
31 30 29 28 27 26 0
0.2
0.4
0.6
0.8 1 Rate [kbit/s]
1.2
1.4
1.6
Fig. 8. Rate-distortion plot for the sequence Peter using the modelbased coder.
For the sequence Akiyo we have no head shape information from a scan. Hence, the 3-D model is generated using the shape of an arti cial plaster head and mapping texture from the rst video frame onto the object. This solution has been chosen due to its simplicity. However, more sophisticated approaches to initial model adaptation can be found in the literature, e.g., see [20]. Nevertheless, even with our simple approach to initial model adaptation, we are able to demonstrate the practicability of the combined block- and model-based codec, since we do not require the availability of a 3-D scan. Figure 9 shows an original and the corresponding model frame for the Akiyo sequence.
Fig. 9. Left: frame 18 of the sequence Akiyo, right: corresponding model frame.
For comparing the proposed coder with the anchor which is TMN-10, the state-of-the-art test model of the H.263 standard, rate-distortion curves are generated by varying the DCT quantizer over values 10; 15; 20; 25; and 31. BitFig. 7. Left: frame 120 of the sequence Peter, right: corresponding streams are generated that are decodable producing the same PSNR values as at the encoder. In our simulations, model frame. the data for the rst INTRA frame and the initial 3-D The quality of the synthetic model-based prediction is model are excluded from the results. This way we simuDRAFT, March 8, 1999
40
Signi cant gains in coding eciency are achieved compared to TMN-10. For example, the results obtained for the sequence Peter show bit-rate savings of about 35 % at equal average PSNR being achieved at the low bit-rate end. This corresponds to gain of about 2.8 dB in terms of average PSNR. Figure 12 shows a similar curve for the sequence Akiyo. Also for this sequence where no 3-D scan of the head shape is available, the bit-rate savings are about 35 % at the low bit-rate end. 39 38 37
PSNR [dB]
late steady-state behavior meaning that we compare the inter-frame coding performance of both codecs excluding the transition phase at the beginning of the sequence. We assume that the texture of the model is extracted from those video frames that are transmitted during the transition phase at the beginning of the sequence. Another practical scenario for our simulation conditions is the case when the decoder has stored the texture of the model from a previous video-phone session between the two terminals. We rst show rate-distortion curves for the proposed coder in comparison to the H.263 test model, TMN-10. The following abbreviations are used for the two codecs compared: TMN-10: The result produced by the H.263 test model, TMN-10, using Annexes D, F, I, J, and T. MBBC: Model- and block-based coder: H.263 extended by model-based prediction with Annexes D, F, I, J, and T enabled as well. Figures 10 and 11 show the results obtained for our two test sequences Peter and Eckehard.
36 35 34 33
TMN−10 MBBC
32
39
31
10
15
20 Rate [kbit/s]
25
30
PSNR [dB]
38
Fig. 12. Rate-distortion plot for the sequence Akiyo.
37 36 35
TMN−10 MBBC
34 33 6
8
10
12
14 16 Rate [kbit/s]
18
20
22
Fig. 10. Rate-distortion plot for the sequence Peter. 40 39
PSNR [dB]
38 37 36 35
TMN−10 MBBC
34 33 6
8
10
12
14 16 Rate [kbit/s]
18
20
22
Fig. 11. Rate-distortion plot for the sequence Eckehard.
In the following we will present subjective results by means of comparing reconstructed frames, since we have found that the PSNR measure allows only limited conclusions about the comparisons. Figure 13 shows frame 120 of the test sequence Peter where the upper picture is decoded and reconstructed from the TMN-10 decoder, while the lower one comes from the model- and block-based coder. Both frames are transmitted at the same bit-rate of 1680 bits. In Fig. 14 similar results for the sequence Eckehard are shown that is encoded with both encoders at the same bitrate. The upper image corresponds to frame 27 that is decoded and reconstructed from the TMN-10 decoder, while the lower one is generated from the model- and block-based coder. Again, both frames are transmitted at the same bitrate of about 1200 bits. Finally, the corresponding results for the sequence Akiyo are depicted in Fig. 15. Both frames are taken from sequences encoded with the same average PSNR using the TMN-10 and the model- and block-based coder. Finally, let us take a closer look at the robustness of the proposed video codec. Figure 16 shows the model-based prediction of frame 18 that was already shown in Fig. 9. As can be seen in the enlargement of the image region around the mouth, a model failure occurs that causes the black bar inside the mouth. However, the rate-constrained coder control handles model failures like the one depicted in Fig. 16 robustly as illustrated by the selection mask which is shown in Fig. 17. Figure 17 shows all macroblocks that are predicted from the model frame, while the macroblocks
DRAFT, March 8, 1999
Fig. 13. Frame 120 of the Peter sequence coded at the same bit-rate Fig. 14. Frame 27 of the Eckehard sequence coded at the same bitusing the TMN-10 and the MBBC, upper image: TMN-10 rate using the TMN-10 and the MBBC, upper image: TMN(33.88 dB PSNR, 1680 bits), lower image: MBBC (37.34 dB 10 (34.4 dB PSNR, 1264 bits), lower image: MBBC (37.02 dB PSNR, 1682 bits). PSNR, 1170 bits).
predicted from the reconstructed frame are grey. As can be seen in Fig. 17, the mouth is not predicted from the model frame thus avoiding the prediction error coding of the black bar in Fig. 16. VI. Concluding Remarks
The combination of model-based video coding with blockbased motion compensation into a multi-frame predictor yields a superior video coding scheme for head-andshoulder sequences. When integrating the multi-frame motion-compensated predictor into an H.263-based video codec, bit-rate savings around 35 % are achieved at equal average PSNR in comparison to TMN-10, the state-of-theart test model of the H.263 video compression standard. Signi cant improvements in terms of subjective quality are visible when running both codecs at equal average bit-rate. The model-based coder analyzes the incoming original video frames and estimates facial animation parameters that are entropy coded and transmitted through the channel. The 3-D head model and the synthesis framework are incorporated into the parameter estimation allowing direct access to the appearance of the facial expressions in the rendered output frame. The 3-D head model consists of shape, texture, and the description of facial expressions. The ini-
tial 3-D head model adaptation can either be conducted using a 3-D scan of the head shape or a generic head model as demonstrated in the paper. The generic head model approach avoids the use of a 3-D scanner enabling low-cost applications such as video-telephony or video-conferencing. For synthesis of facial expressions, the transmitted facial animation parameters are used to deform the 3-D head model. Finally, the head-and-shoulder scene is approximated by simply rendering a model frame. In most cases, only a few parameters are encoded and transmitted, leading to very low bit-rates, typically less than 1 kbit/s as shown in the paper. The model-based codec is running simultaneously to the hybrid H.263-based video coder generating a synthetic model frame as prediction for each transmitted frame. This model frame is employed as a second reference frame for rate-constrained block-based motion-compensated prediction in addition to the previously reconstructed reference frame. For each block the video coder decides which of the two frames to reference for motion compensation using a Lagrangian cost function. Moreover, the macroblock mode decision is also embedded into a Lagrangian formulation. This provides increased robustness to estimation errors or model failures that may occur in the model-based coder as
DRAFT, March 8, 1999
Fig. 16. Model frame 18 of the sequence Akiyo with enlargement of the image region around the mouth.
[11] Fig. 15. Frame 150 of the Akiyo sequence coded at the same bit-rate using the TMN-10 and the MBBC, upper image: TMN-10 [12] (31.08 dB PSNR, 720 bits), lower image: MBBC (33.19 dB [13] PSNR, 725 bits).
illustrated in the paper.
[14]
[15] References [1] ITU-T Recommendation H.263 Version 2 (H.263+), \Video Coding for Low Bitrate Communication", Jan. 1998. [2] ISO/IEC 14496-2, Coding of Audio-Visual Objects: Visual [16] (MPEG-4 video), Committee Draft, Document N1902, Oct. 1997. [3] W. J. Welsh, S. Searsby, and J. B. Waite, \Model-based image [17] coding", British Telecom Technology Journal, vol. 8, no. 3, pp. 94{106, Jul. 1990. [4] D. E. Pearson, \Developments in model-based video coding", [18] Proceedings of the IEEE, vol. 83, no. 6, pp. 892{906, Jun. 1995. [5] P. Eisert and B. Girod, \Analyzing facial expressions for virtual conferencing", IEEE Computer Graphics and Applications, pp. [19] 70{78, Sep. 1998. [6] M. F. Chowdhury, A. F. Clark, A. C. Downton, E. Morimatsu, and D. E. Pearson, \A switched model-based coder for video [20] signals", IEEE Transactions on Circuits and Systems for Video Technology, vol. 4, no. 3, pp. 216{227, Jun. 1994. [7] H. G. Musmann, \A layered coding system for very low bit rate video coding", Signal Processing: Image Communication, vol. [21] 7, no. 4-6, pp. 267{278, Nov. 1995. [8] M. Kampmann and J. Ostermann, \Automatic adaptation of [22] a face model in a layered coder with an object-based analysissynthesis layer and a knowledge-based layer", Signal Processing: Image Communication, pp. 201{220, 1997. [23] [9] H. Everett III, \Generalized Lagrange Multiplier Method for Solving Problems of Optimum Allocation of Resources", Operations Research, vol. 11, pp. 399{417, 1963. [10] Y. Shoham and A. Gersho, \Ecient Bit Allocation for an Ar-
bitrary Set of Quantizers", IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 36, pp. 1445{1453, Sept. 1988. P. A. Chou, T. Lookabaugh, and R. M. Gray, \EntropyConstrained Vector Quantization", IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 37, no. 1, pp. 31{ 42, Jan. 1989. M. Rydfalk, CANDIDE: A Parameterized Face, PhD thesis, Linkoping University, 1978, LiTH-ISY-I-0866. G. Greiner and H. P. Seidel, \Splines in computer graphics: Polar forms and triangular B-spline surfaces", Eurographics, 1993, State-of-the-Art-Report. M. Hoch, G. Fleischmann, and B. Girod, \Modeling and animation of facial expressions based on B-splines", Visual Computer, vol. 11, pp. 87{95, 1994. R. Y. Tsai, \A versatile camera calibration technique for high accuracy 3d machine vision metrology using o-the-shelf cameras and lenses", IEEE Journal of Robotics and Automation, vol. RA-3, no. 4, pp. 323{344, Aug. 1987. P. Eisert and B. Girod, \Model-based estimation of facial expression parameters from image sequences", International Conference on Image Processing, vol. 2, pp. 418{421, Oct. 1997. H. Li, P. Roivainen, and R. Forchheimer, \3-D motion estimation in model-based facial image coding", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 6, pp. 545{555, Jun. 1993. P. Eisert and B. Girod, \Model-based coding of facial image sequences at varying illuminationconditions", 10th IMDSP Workshop 98, pp. 119{122, Jul. 1998. J. Ostermann, \Object-based analysis-synthesis coding (obasc) based on the source model of moving exible 3D-objects", TRIP, pp. 705{711, Sep. 1994. D. DeCarlo and D. Metaxas, \The integrationof optical ow and deformable models with applications to human face shape and motion estimation", Computer Vision and Pattern Recognition, pp. 231{238, 1996. C. L. Lawson and R. J. Hanson, Solving Least Squares Problems, Prentice Hall, Inc., 1974. J. Stauder, \Estimation of point light source parameters for object-based coding", Signal Processing: Image Communication, pp. 355{379, 1995. G. Bozdagi, A. M. Tekalp, and L. Onural, \3-D motion estimation and wireframe adaption including photometric eects for model-based coding of facial image sequences", IEEE Transactions on Circuits and Systems for Video Technology, vol. 4, no. 3, pp. 246{256, Jun. 1994.
DRAFT, March 8, 1999
Fig. 17. Selection mask for model frame 18 of the sequence Akiyo. Those macroblocks are shown that have been selected for motion compensation using the model frame. The grey parts of the image are either chosen to be compensated from the previously reconstructed frame or they are coded using the INTRA macroblock mode. [24] J. H. Lambert, Photometria sive de mensura de gratibus luminis, colorum et umbrae., Eberhard Klett, Augsberg, Germany, 1760. [25] C. A. Poynton, \Gamma and its disguises: The nonlinear mappings of intensity in perception, crts, lm and video", SMPTE Journal, pp. 1099{1108, Dec. 1993. [26] ITU-T/SG16/Q15-D-65, \Video Codec Test Model, Near Term, Version 10 (TMN-10), Draft 1", Download via anonymous ftp to: standard.pictel.com/video-site/9804 Tam/q15d65d1.doc, Apr. 1998. [27] ITU-T/SG16/Q15-D-13, T. Wiegand and B.Andrews, \An Improved H.263-Codec Using Rate-Distortion Optimization", Download via anonymous ftp to: standard.pictel.com/videosite/9804 Tam/q15d13.doc, Apr. 1998. [28] G. J. Sullivan and R. L. Baker, \Motion Compensation for Video Compression Using Control Grid Interpolation", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, May 1991, vol. 5, pp. 2713{2716. [29] B. Girod, \Rate-Constrained Motion Estimation", in Proceedings of the SPIE Conference on Visual Communications and Image Processing, Chicago, USA, Sept. 1994, pp. 1026{1034, (Invited paper). [30] T. Wiegand, M. Lightstone, D. Mukherjee, T. G. Campbell, and S. K. Mitra, \Rate-Distortion Optimized Mode Selection for Very Low Bit Rate Video Coding and the Emerging H.263 Standard", IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, no. 2, pp. 182{190, Apr. 1996. [31] G. J. Sullivan and T. Wiegand, \Rate-Distortion Optimization for Video Compression", IEEE Signal Processing Magazine, vol. 15, no. 6, pp. 74{90, Nov. 1998.
DRAFT, March 8, 1999