An efficient use of mpeg-4 FAP interpolation for

0 downloads 0 Views 364KB Size Report
cial animation parameter (FAP) interpolation modality specified by the MPEG-4 ... other words, at less than 2 kbits/s for smooth synthetic video at 25 frames/s. ... In this context, since both the known and the ..... the quality of the synthesis, this low-pass filtering has the posi- .... sensitive to quantization noise, as shown in Fig.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER 2001

1085

An Efficient Use of MPEG-4 FAP Interpolation for Facial Animation at 70 bits/Frame Fabio Lavagetto and Roberto Pockaj

Abstract—An efficient algorithm is proposed to exploit the facial animation parameter (FAP) interpolation modality specified by the MPEG-4 standard in order to allow very low bit-rate transmission of the animation parameters. The proposed algorithm is based on a comprehensive analysis of the cross-correlation properties that characterize FAPs, which is here reported and discussed extensively. Based on this analysis, a subset of ten almost independent FAPs has been selected from the full set of 66 low-level FAPs to be transmitted and used at the decoder to interpolate the remaining ones. The performances achievable through the proposed algorithm have been evaluated objectively by means of conventional PSNR measures and compared to an alternative solution based on the increase of the quantization scale factor used for FAP encoding. The subjective evaluation and comparison of the results has also been made possible by uploading mpg movies on a freely accessible web site (referenced in the bibliography). Experimental results demonstrate that the proposed FAP interpolation algorithm allows efficient parameter encoding at around 70 bits/frame or, in other words, at less than 2 kbits/s for smooth synthetic video at 25 frames/s. Index Terms—Avatars, facial animation, MPEG-4.

facial movements of a person while talking, it is possible to reduce much of the FAP redundancy. Instead of coding and transmitting the complete set of 66 FAP at each frame, it is possible to encode only a significant subset of them and let the decoder generate the missing parameters, thus achieving very low bit-rate animation. This procedure, named FAP interpolation, represents the core issue discussed in this paper. In Section II, we introduce the specifications of MPEG-4 concerned with FAP interpolation, and explain how the techniques we propose are compliant with the standard and oriented to fully exploit its potentiality. In Section III, we present the methodology and the setup used for high-precision data acquisition of real facial movements. In Section IV, a description is given of the techniques adopted to post-process the acquired data and to define an appropriate subset of FAPs assumed as a minimum basis capable of approximating all the possible human facial movements. The mechanism used by the decoder to interpolate the missing FAPs is explained in Section V, while preliminary objective and subjective results are reported in Section VI.

I. INTRODUCTION

II. FAP “STATUS” AND INTERPOLATION

HE OBJECTIVE of the present paper is to define appropriate criteria for allowing MPEG-4 facial animation at very low bit-rate and to provide, at the same time, a detailed explanation on a particular part of the facial animation specification. As in any conventional approach to lossy video compression, in facial animation, the straightforward solution to reduce the bitrate due to the transmission of facial animation parameters (FAPs) is that of increasing the quantization scaling factor used for parameter encoding. The evident disadvantage of this approach is that of progressively degrading the quality of the animation with the introduction of jerky facial movements that are usually very annoying. In synthetic facial animation, different from natural video coding, the coarse quantization of parameters does not affect the rendering quality of each individual frame, but the smoothness of facial movements rendering. However, another possibility exists to reduce the bit-rate of the stream encoding the FAPs according to MPEG-4 specifications. This method is based on exploiting the a priori knowledge about the object that is animated, namely a human face. Through the analysis of the time correlation characterizing the

MPEG-4 specifications concerning facial animation [1] adopt the term “FAP interpolation” to indicate the procedure used by a generic facial animation decoder to autonomously define the value of the FAPs that have not been encoded and transmitted. Based on the knowledge of only a limited number of FAPs, the objective of the interpolation procedure is to estimate a variable number of those missing. In this context, since both the known and the estimated FAPs belong to the same frame, the estimation procedure is performed intra-frame without taking into account any inter-frame FAP prediction. In this respect, therefore, rather than an interpolation, we should more properly call the procedure an actual FAP extrapolation. After having stressed this terminology ambiguity, let us now examine the related MPEG-4 specifications and focus the attention on some issues that are not of clear and immediate interpretation. FAPs have been subdivided in a few groups, depending on the region of the face they are applied to, with the objective of optimizing the compression efficiency. These groups are listed in Table I. Together with I-Frames, two different hierarchical masks are transmitted, being fap_mask_type and fap_group_mask, with the objective of selecting the subset of the complete set of 68 defined FAPs that will be transmitted in the present I-frame and in the following P-frames. The fap_mask_type is a 2-bit mask whose meaning is described in Table II.

T

Manuscript received June 1, 2000; revised July 18, 2001. This work was supported in part by the European Union under the ACTS Research Project VIDAS, and by the IST Research Project Interface, both coordinated under DIST. This paper was recommended by Associate Editor E. Petajan. The authors are with DIST, Università of Genova, 16145 Genova, Italy (e-mail: [email protected]; [email protected]). Publisher Item Identifier S 1051-8215(01)09156-X.

1051–8215/01$10.00 © 2001 IEEE

1086

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER 2001

TABLE I FAP GROUPS AND NUMBER OF FAPS PER GROUP

TABLE II FAP_MASK_TYPE

TABLE III LENGTH IN BITS OF FAP_GROUP_MASK VERSUS THE GROUP NUMBER

The fap_group_mask, which is encoded only in case the value associated to fap_mask_type is 01 or 10, specifies which FAPs among those in the group are actually transmitted. The size of this mask is variable (see Table III), depending on the specific group it is referred to. The value of this mask can be interpreted as a composition of 1 bit fields: if the bit value is 1, the corresponding FAP is transmitted; otherwise, it is not. As described before, these masks are used to specify which FAPs are transmitted. Moreover, masks are also used to encode the so-called “FAP status.” At each transmission of frame parameters to the decoder, the FAP status can be one of the following. SET: if the FAP was transmitted by the encoder; LOCK: if the FAP was not transmitted and maintains the value of the previous frame; INTERP: if the FAP was not transmitted and the decoder may define its new value (i.e., it can interpolate its value). SET, LOCK, and INTERP are terms not explicitly mentioned in the standard, but have been introduced by the authors and will be used in the sequel to define the FAP status.

The status of non-transmitted FAPs is determined by the decoder according to the value of the fap_mask_type and of the fap_group_mask. In fact, if the fap_mask_type has value “01,” non-transmitted FAPs (represented by “0” bit value for the corresponding fields of the fap_group_mask) must maintain the same value of the previous frame (FAPs are in LOCK status); if fap_mask_type has value “10,” non-transmitted FAPs can be interpolated by the decoder (FAPs are in INTERP status). The only exception is when the encoder has never transmitted a FAP. In this case, since no past reference value is available for this FAP, the only solution is to force its initial status to INTERP. It is worth figuring out some problems that could be originated by this specification in the case of broadband transmission, at least as far as it concerns the authors’ interpretation of the standard. When different decoders start to decode the broadcasted bitstream at different instants, an unresolved ambiguity is generated about the correct recognition of not transmitted FAPs. Let us consider the example of a FAP being transmitted for some time at the beginning of the communication and, after a while, no more transmitted. In case the decoder is activated from the beginning (i.e., when the FAP is still transmitted), as soon as its transmission is stopped, it enters a LOCK status. Conversely, in case the decoder is activated after the FAP transmission is stopped, it enters an INTERP status. It must be noticed also that when a FAP is in INTERP status, the decoder is not always free to fix its value. The standard, in fact, defines two default criteria for interpolating FAPs, called “left–right interpolation” and “inner–outer lip contour interpolation,” respectively. These criteria have been defined to exploit two evident characteristic of facial motion: the vertical symmetry of facial movements and the strong correlation between the movements of inner and outer lip contours. As a practical consequence, in case only the FAPs of the right part of the face are transmitted while those of the left part are in INTERP status, the decoder is forced to reproduce those received also on the left half of the face, and vice-versa. The same process is applied with respect to the lip contours: if only the FAPs related to the inner contour are transmitted while those of the outer contour are in INTERP status, the decoder is forced to reproduce those received also on the outer contour, and vice-versa. Another FAP interpolation method is included in the MPEG-4 standard, applicable to all the profiles including facial animation except to the simplest one, Simple FA. This method makes use of the Facial Interpolation Table (FIT) [2] to allow the encoder to define inter-FAP dependence criteria in polynomial form. After downloading the FIT parameters, the encoder activity can be limited to the transmission of a subset of FAPs, leaving to the decoder the task of interpolating the missing ones, on the base of the inter-FAP relations specified by the FIT. For each FAP to be interpolated, FIT allows the definition of relations such as

LAVAGETTO AND POCKAJ: MPEG-4 FAP INTERPOLATION FOR FACIAL ANIMATION AT 70 BITS/FRAME

1087

Each interpolation function I( ) is in a rational polynomial form (this is not true for FAP1 and FAP2; please refer to [2] for more details)

(1)

An example of FIT can, therefore, be as follows: (2) (3) In the opinion of the authors, however, the use of FIT is mainly oriented to guarantee the predictability of the animation when used together with the Facial Animation Table (FAT). In this way, the encoder will be able to guarantee a minimum level of quality, since the results of the animation would be fully predictable, and to achieve in the meantime a significant savings of bandwidth. It is reasonable to imagine that, for most applications, the inter-FAP dependence criteria should be very simple. Thanks to the strong symmetries characterizing a human face, it should be possible to express the majority of these dependencies in terms of simple direct proportions, without the need of adopting complex polynomial functions. Based on these considerations, it turns out that almost all FIT information is, in general, needed at the decoder, provided that, besides implementing the default interpolation functions “left–right” and “inner–outer lip,” it also includes such kind of simple proportional relations.

Fig. 1. Location of feature points defined by MPEG-4 (left) and location of markers in the data acquisition phase (right).

have been estimated, specifically: group 4 (eyebrow), group 5 (cheek), group 7 (head), group 8 (outer lip), and FAPs 3, 14, 15, 16, and 17 of group 2 (inner lip, jaw). The acquisition procedure described above has been applied to record a few sequences for a total of about 22 000 frames with three different speakers, two males and one female. The acquired data have been subdivided into two parts: the Training Data Set obtained by selecting the first 2/3 of each sequence and the Test Data Set composed by the remaining 1/3 of each sequence. The former Data Set has been used to analyze the FAP correlation, while the latter has been used for the performance evaluation. IV. ANALYSIS OF FAP CORRELATION

III. ACQUISITION SETUP The Test Data Set has been acquired by the “Elite” [3] system that is composed of four synchronized cameras, one video, and three IR cameras, together with a microphone. The 3-D position of small IR reflecting markers distributed on the speaker’s face is estimated and tracked by the system at 100 Hz by means of suitable triangulation algorithms applied to the trinocular IR images with sub-millimeter precision. Once the 3-D trajectories of each marker have been automatically computed by the system, suitable post-processing is applied to convert them into MPEG-4 compliant FAPs. After having estimated the neutral position of the speaker (as defined by the standard) and the Facial Animation Parameter Units FAPUs), the rigid motion of the speaker’s head is computed for each frame by analyzing the position of three reference markers located on almost non deformable face structures, like on the tip of the nose (see Fig. 1). The 3-D coordinates of each marker are normalized with respect to the neutral position (by compensating the rigid motion of the head) and the FAPs associated to the present frame are estimated by comparing the actual compensated positions of the markers with their coordinates in the neutral position. Twenty markers have been distributed on the speaker’s face, as shown in Fig. 1. Thanks to this marker configuration, 30 FAPs

For the analysis of FAP correlation, only a few of the 10 FAP groups defined in MPEG-4 have been considered (see Table I). The analysis we have carried out was oriented to model the FAP’s trajectories related to the human facial movements associated to normal speech production. The objective is in reducing the number of FAPs to encode and transmit and providing the decoder with a suitable FAP interpolation mechanism. Should particular nonhuman facial movements be rendered for specific applications (like for the animation of a cartoon-like character), it will be enough to modify the FAP mask associated to the transmission of an I-frame and to transmit all the FAPs describing that particular animation for the time needed. As far as the correlation analysis is concerned, FAPs in group 1 (visemes and expressions) have not been considered since they do not simply encode the scalar value of a specific facial feature like the other FAPs but, on the contrary, they encode high-level information such as the global posture of the mouth and the whole facial expression. No measurement of FAPs in group 6 was possible for the limitations of the data acquisition system that has been used, and the same happened for some FAPs of group 2 (inner lip contour). FAPs in groups 9 and 10 have been excluded from the analysis since human facial movements associated to the nose and ears are usually negligible and of limited interest. FAPs in

1088

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER 2001

group 7 (head) have been considered, even if the three rotations of the head have been assumed decorrelated. In conclusion, for the correlation analysis, three sets of FAPs have been considered: 1) some of those in group 2 together with those in group 5 and in group 8; 2) those in group 3; and 3) those in group 4. The reason for analyzing together, in the first set, FAPs coming from different groups (while the other two sets include only homogeneous FAPs) is due to the a priori knowledge about the anatomy of human faces. Bone and muscle structures of a human head, in fact, indicate that some regions of the face are affected by correlated motion while others can be reasonably assumed to be independent. As an example, movements of cheeks and lips are strongly correlated, while movements of the eyebrows can be considered rather independent from those of the jaw. This consideration is very evident if facial movements are analyzed at a low level and instant by instant, while it becomes less and less valid if the analysis is shifted to the semantic level and applied on longer time intervals. In this last case, for instance, it might be possible to identify long-term correlations between the emphasis of pronunciation (with relation to lip movements) and movements of the head and eyebrows [4]. Anyway, these considerations go far beyond the scope of the paper, and our investigation here will be limited only to low-level FAP analysis even if, in the opinion of the authors, this represents a promising and fertile field of research for future applications. A. Computation of FAP Correlations Our analysis has been based on the FAP correlation matrix [5], computed as follows:

(4)

is the number of FAPs under investigation and where represents the correlation coefficient between the th and the th FAPs, defined as (5)

represents the variance of the th FAP and The term the coefficient of the covariance matrix defined as (6) the mean value of the th FAP. being , The matrix is symmetric by definition; that is . Though in rigorous mathematical terms is with a signed real number, we have chosen to consider its absolute value to simplify the graphical interpretation of the results. The approximate the value 1, the more the th and the th more FAPs are correlated. It is worth mentioning the fact that other methods, different from the approach based on the correlation matrix described above, have been considered by the authors to the end of identifying a set of uncorrelated FAPs among the 66 defined by

Fig. 2. Graphical representation of matrix indicate high correlation).

R

for group 4 (white squares

MPEG-4. The most suitable among them might seem to be the Principal Component Analysis (PCA), of widespread use in this class of problems. However, they are of no use in this application and, therefore, have been discarded, since the -vector ) almost never basis they provide for given vectors ( coincides with a subset of the given vectors. On the contrary, this is an obvious constraint for an MPEG-4 compliant codec, where only the 66 FAPs or possibly subset of them are allowed to be transmitted. After defining the criterion for the correlation estimation, let us examine the various FAP groups. In the following, it will be discussed how to choose the minimum set of FAPs that must be transmitted. In Section V, on the contrary, criteria will be defined to interpolate non-transmitted parameters. B. Group 4 (Eyebrows) Matrix is represented with gray levels in Fig. 2. The graphic representation of allows a faster estimation of while correlations. Bright blocks indicate a high value of dark blocks identify totally uncorrelated FAPs. The visualization of allows a first data validation by comparing the correlation of FAPs on the right side of the face with that of FAPs on the left side. By inspection, it turns out that an unexpected very low correlation between FAP 38 and the other FAPs. By analyzing the temporal trajectory of FAP 38, it was discovered that its intrinsically low dynamics (squeeze_r_eyebrow), typically in the order of a few tens of Eye Separation Unit (ES), has been completely obscured by the acquisition noise, whose standard deviations was comparable with the signal. Because of this systematic error in the measurement process, all the FAPs affected by this kind of acquisition noise have been has been excluded from further analysis and a second matrix computed where the columns corresponding to symmetric FAPs have been merged. Since MPEG-4 forces identical values for symmetric FAPs in case of interpolation, it is convenient to con-

LAVAGETTO AND POCKAJ: MPEG-4 FAP INTERPOLATION FOR FACIAL ANIMATION AT 70 BITS/FRAME

Fig. 3. Graphical representation of matrix indicate high correlation).

MATRIX

R

FOR OF THE

s

R

1089

for group 4 (white squares

TABLE IV GROUP 3 (LEFT) AND VALUES COEFFICIENT (RIGHT)

Fig. 4. Graphical representation of matrix (white squares mean high correlation).

R for group 2 (partial), 5, and 8

C. Groups 2, 5, and 8 (Jaw, Chin, Lip Protrusion, Cheeks, Outer lip, Cornerlip) sider symmetric FAPs as a single parameter (in our case, we have chosen to represent only the left FAPs). On the basis of the above considerations, measures related to FAP 38 have been replaced with those of the symmetric FAP 37, whose acquisition was significantly less noisy. is visualized in Fig. 3, while the values of its coMatrix are reported in Table IV. efficients The information associated to Table IV provides good indications for selecting the specific FAP that will be used to interpolate the other ones. The criterion we have adopted is that of selecting the th FAP having the highest correlation with respect to the other FAPs

(7) The th FAP can now be used to interpolate any other th FAP . The threshold value equal to 0.75 has been dewith termined experimentally, seeking for a reasonable tradeoff between rate and distortion. Once this operation has been completed, rows and columns corresponding to the th and th FAPs that have been selected and the coefficients are recomare removed from matrix puted. This process is then iterated until all the examined FAPs have been analyzed. As far as FAPs in group 4 are concerned, the values of and allow an easy choice for FAP 33 (or for its symmetric FAP 34) as the best and unique FAP to transmit, since . In Section V, the interpolation criteria will be discussed and defined.

The corresponding matrix is visualized in Fig. 4. By inspecting matrix , a few significant considerations can be drawn. Also in this case, an unexpected low correlation is found between FAPs 53 and 54 (stretch outer corner lips). If we then consider FAPs corresponding to the upper lip, it is evident to conclude that they are substantially uncorrelated from all the other FAPs. Also, FAP 15 (shift_jaw) seems to be totally uncorrelated from the other FAPs. This conclusion is confirmed by a deeper analysis of its time trajectory, revealing small random variations due, very likely, to the acquisition noise, since during speech this movement is substantially absent. , obtained from by reLet us now consider the matrix moving noisy FAP and by merging left and right pairs. (see Fig. 5 When inspecting the coefficients of matrix and Table V), choosing the FAPs to transmit seems to be more complex than in group 4. The first FAP selected for transmission is FAP 52, used then to interpolate FAPs 3, 14 and 57 (having ). Therefore, four rows and four columns can be removed from and the new matrix is obtained by recomputing coeffi(superscript 1 indicates that the first FAP to transmit cients has been selected or, in other terms, that the first simplification step has been completed). As evidenced in Table VI and Fig. 6, FAPs with the highest are FAP 41 (lift_l_cheek) and FAP 59 (raise_l_corvalues nerlip_o). The decision of selecting FAP 59 as the second pawas slightly greater than , rameter to transmit, though is due to evident technical reasons: experiments show that the automatic extraction of lip corners from real video sequences is far easier than that of cheek coordinates. The comparable

1090

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER 2001

MATRIX R

FOR

TABLE V GROUP 2 (PARTIAL), 5 AND 8 (LEFT), AND VALUES OF THE s COEFFICIENT (RIGHT)

MATRIX R

FOR

TABLE VI GROUP 2 (PARTIAL), 5, AND 8 (LEFT), AND VALUES OF THE s COEFFICIENT (RIGHT)

Fig. 5. Graphical representation of matrix R (white squares indicate high correlation).

for group 2 (partial), 5, and 8

values of and , together with the high cross correla), should guarantee good interpolation tion ( ) is anyway. Since the only value over threshold ( , it can be reasonably assumed that FAP 41 just met for can be interpolated effectively from FAP 59.

Fig. 6. Graphical representation of matrix R (white squares indicate high correlation).

for group 2 (partial), 5, and 8

Let us now erase these two FAPs and recompute , as shown in Table VII and Fig. 7. , even if it is only Now the highest value is reached by . Also in this case, only FAP 39 has slightly higher than

LAVAGETTO AND POCKAJ: MPEG-4 FAP INTERPOLATION FOR FACIAL ANIMATION AT 70 BITS/FRAME

MATRIX R

FOR

TABLE VII GROUP 2 (PARTIAL), 5, AND 8 (LEFT), AND VALUES OF THE s COEFFICIENT (RIGHT)

Fig. 8. Graphical representation of matrix R (white squares indicate high correlation). Fig. 7. Graphical representation of matrix R (white squares indicate high correlation).

MATRIX R

FOR

1091

for group 2 (partial), 5, and 8

for group 2 (partial), 5, and 8

TABLE VIII GROUP 2 (PARTIAL), 5, AND 8 (LEFT), AND VALUES OF THE s COEFFICIENT (RIGHT)

openings and to the movements of the mouth corners (see the complete matrix ). Anyway, experimental results prove that these relations cannot be modeled easily, and in particular, that no linear dependence can be formalized among FAPs 16 and 17, on one side, and the other FAPs of groups 2, 8, and 5, on the other side. D. Group 3 (Eyeballs, Pupils, Eyelids)

and results as the only parameter that can be interpolated from FAP 53. , as described in The next step is the computation of Table VIII and Fig. 8. The next parameter to transmit would be FAP 55, though only slightly greater than . Also in this with a value case, as in the first step, we have preferred to transmit FAP 51 (lower_top_midlip_o), since it is highly correlated to FAP 55 and is more easily trackable from real video analysis. FAPs 16 and 17 are the last ones to be transmitted, being those maximally uncorrelated from the others. Some comments are worth drawing about lip protrusion. axis (with Since these FAPs describe variations along the positive orientation out of the screen) of the mid point of the upper and lower lip, their values are partially correlated to lip

There is no need for experimental evidence to state that movements of the eyeballs and pupils are maximally correlated, since they are affected by the same rigid motion. Lower eyelids are most of time static and rarely affected by an almost unperceivable motion. Movements of the upper eyelids, on the other hand, are substantially uncorrelated from eyeballs and pupils. Therefore, as defined in the MPEG-4 specifications through the “left–right interpolation” criterion, all the right FAPs in this group can be interpolated from the corresponding left ones (or vice versa). Based on the previous considerations, we can conclude that only three FAPs (19, 23, and 25) are necessary for animating the entire group; the remaining FAPs are not necessary or can be interpolated as indicated in Table IX. Their transmission, if needed, can be omitted by simulating eye blinking at the decoder and by synthesizing the movements of the eyeballs based on the parameters encoding the head rotation. In Section V-D, some criteria are explained to synthetically generate FAPs 19, 23, and 25 and, therefore, to completely avoid the transmission of any FAP in group 3.

1092

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER 2001

TABLE IX FAP INTERPOLATION COEFFICIENTS (VALUES OF ) FOR GROUP 3 (TX MEANS TRANSMITTED FAP, EMPTY ROWS INDICATE FAP NOT NECESSARY FOR TYPICAL ANIMATIONS): 1) INDICATE THAT THOSE FAP CAN POSSIBLY BE TOTALLY SYNTHESIZED AND 2) INDICATES THAT THOSE FAP CAN BE INTERPOLATED BY HEAD MOVEMENTS

TABLE X FAP INTERPOLATION COEFFICIENTS (VALUES OF ) (TX MEANS TRANSMITTED FAP)

FOR

GROUP 3

V. FAP INTERPOLATION CRITERIA After having selected the FAPs that optimize the estimation of not transmitted parameters (see Section IV), their specific mutual dependencies must be formalized and implemented. A. Computation of FAP Interpolation Criteria For the sake of simplicity, let us assume that each of the FAPs to be interpolated is linearly dependent from a single FAP of those actually transmitted. As it will be evidenced by the experimental results reported in the following, this hypothesis is very close to reality. By comparing the trajectories of the estimated and measured FAPs, it turns out that they are quite similar, except in correspondence of intervals where the FAP amplitude is very high. It is reasonable to suppose that this effect is due to nonlinear saturation distortion affecting the estimates, which is difficult to face, even by modeling FAP dependences through more complex relations. As an example, let us consider FAP 3 (open_jaw) and FAP 52 (raise_b_midlip_o): when the jaw is completely closed or open, the lower lip still has some residual possibility to move independently from the jaw itself. Despite this annoying but, fortunately, rare and scarcely perceivable distortion, let us formalize the linear inter-FAP dependence as follows:

Fig. 9. Trajectories of FAP 31 (raised left inner eyebrow, solid line) estimated by FAP 33 (raised left middle eyebrow) and its actual value (dashed line).

(8) where represents the value of the FAP in correspondence of is the paramthe th frame of a sequence with frames, the parameter actually transeter to be interpolated and mitted. The problem consists of determining the interpolation coefficient that minimizes the mean square error (MSE), defined as

Fig. 10. Trajectories of FAP 37 (squeezed left eyebrow, solid line) estimated by FAP 33 (raised left middle eyebrow) and its actual value (dashed line).

(9) B. Group 4 (Eyebrows) The optimal coefficient

results as being

(10)

Table X summarizes the values computed for the FAPs of group 4. In Figs. 9 and 10, the trajectories of the estimated FAPs (solid line) are compared to the actual measured values (dashed line). Besides the substantially correct reproduction of the trajectories,

LAVAGETTO AND POCKAJ: MPEG-4 FAP INTERPOLATION FOR FACIAL ANIMATION AT 70 BITS/FRAME

1093

TABLE XI FAP INTERPOLATION COEFFICIENTS (VALUES OF ) FOR GROUP 2, 5 AND 8 (TX MEANS TRANSMITTED FAP, EMPTY ROWS INDICATE FAP NOT NECESSARY FOR TYPICAL ANIMATIONS)

it must be noticed how, in the case of FAP 37, affected by significant acquisition noise, estimates are somehow the low-pass filtered replica of the original parameters. Instead of degrading the quality of the synthesis, this low-pass filtering has the positive effect of gracefully smoothing the animation, thus making it more natural and realistic. C. Groups 2, 5, and 8 (Jaw, Chin, Lip Protrusion, Cheeks, Outer Lip, Cornerlip) Table XI indicates the values of computed for FAPs of groups 2, 5, and 8. In Figs. 11–14, the trajectories are reported of estimated FAPs (solid line) compared with the actual FAP values (dashed line). D. Group 3 (Eyeballs, Pupils, Eyelids) Various studies, both in medicine and in psychology, have computed the typical value for the frequency and duration of eye blinking. In [6], as an example, it is evidenced how this frequency ranges between 10–22 Hz, with an average duration of eye closure equal to 50–75 ms. Based on these experimental evidences, it is easy to simulate, at the decoder, the eye blinking by means of FAP 19. Some experiments and subjective evaluations carried out by the authors have proven that, for many applications, it can be acceptable to interpolate FAP 23 and 25 based

Fig. 11. Trajectory of FAP 3 (open jaw, solid line) estimated by FAP 52 (raised bottom midlip outer) and its actual value (dashed line).

only on the head rotation, doing it in a way to maintain the sight of the virtual character as frontal as possible for meeting the sight of the interacting human, supposed to be seated frontally to the monitor. The typical values that we have used to interpo-

1094

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER 2001

Fig. 12. Trajectory of FAP 57 (raised bottom lip left middle outer, solid line) estimated by FAP 52 (raised bottom midlip outer) and its actual value (dashed line).

Fig. 13. Trajectory of FAP 39 (puffed left cheek, solid line) estimated by FAP 53 (stretched left cornerlip outer) and its actual value (dashed line).

late the movements of the eyes based on the movements of the head, obtained experimentally, are

VI. EXPERIMENTAL RESULTS In conclusion, from that described in the previous sections, it turns out that a subset of only 10 FAPs, suitably chosen from the complete set of 66 FAPs, can be used to guarantee the efficient encoding of MPEG-4 facial animation sequences. In particular, one FAP for the eyebrow movements (FAP 33), six FAPs for mouth and cheeks movements (FAP 16, 17, 51, 51, 53, and 59), and three FAPs for head rotation (FAP 48, 49, and 50). As is proven experimentally, the remaining FAPs can be easily interpolated from those actually transmitted or result being superfluous in typical sequences with continuous speech produced by

Fig. 14. Trajectory of FAP 41 (lifted left cheek, solid line) estimated by FAP 59 (raised left cornerlip outer) and its actual value (dashed line).

natural faces. In the opinion of the authors, at least another FAP of group 6 should be transmitted for controlling the movements of the tongue. However, the evident difficulties in tracking the tongue movements have prevented up to now from its analysis and modeling. In the remainder of this section, we provide a quality evaluation—both objective and subjective—of the animation obtained by using only this subset of 10 FAPs, compared to what is achievable by exploiting the full set of 46 FAPs captured with the acquisition system described in Section III. For running the experiments, three different sets of FAPs have been generated. The first, set A, has been generated by interpolating the FAP Test Data Set (see the description in Section III) starting from only 10 FAPs encoded with a quantization scaling factor of 1 and then decoded. The second, set B, has been obtained by encoding, and then decoding, the entire Test Data Set with a quantization scaling factor of 16 in a way to maintain the same bitrate (around 1.4 kbits/s) associated with set A. The third, set C, has been obtained by encoding, and then decoding, the entire Test Data Set with a quantization scaling factor of 1 and, therefore, with the same quality of set A, but with a higher bitr ate. Each of the three sets A, B, and C achieves a frame rate of 25 frames/s. Fig. 15 provides a graphical description of the three sets. The results have been compared to the original Test Data Set before parameter encoding. In addition to the 30 FAPs actually captured through the acquisition system, the Test Data Set includes some FAPs whose value has been synthesized by fantasy, like the ten missing FAPs in group 2, those controlling the upper eyelids, and those responsible for the eyeball rotation. This is with the purpose of simulating a more realistic situation, obtainable in case the complete set of FAPs could be captured through a more sophisticated acquisition system, where the maximum available information for facial animation is used, and for better comparing FAP encoding with and without FAP interpolation. In Table XII the bit rate achieved for the three sets A, B, and C is reported. Let us notice that the bit rate for set A is signif-

LAVAGETTO AND POCKAJ: MPEG-4 FAP INTERPOLATION FOR FACIAL ANIMATION AT 70 BITS/FRAME

Fig. 15.

1095

The three test data sets used in the experiments. TABLE XII BIT RATES FOR THE THREE TEST DATA SETS

TABLE XIII PSNR FOR SOME FAP OBTAINED WITH THE THREE DIFFERENT CODING SCHEMES, WITH (Y) OR WITHOUT (N) INTERPOLATION

=

Fig. 16. FAP 35 (raise_l_o_eyebrow) encoded with qsf 1 (solid line marked with ), encoded with qsf 16 (dotted line) and interpolated (solid line marked with ); note that, though the FAP35 encoded with qsf 16 has a better PSNR, its step-wise behavior results in a subjectively worse animation than the interpolated FAP.

=

icantly lower than for set C while maintaining high quality in FAP reproduction (as evidenced in Table XIII). Table XII reports the PSNR values computed for a few FAPs, part of them interpolated and part of them transmitted, as specified in column Interp. The results reported in Table XIII suggest some important considerations. In the case of set A, the interpolated FAPs are obviously characterized by a PSNR lower than in the other two cases. Nevertheless, the temporal trajectory of these FAPs, unlike for set B, is not substantially affected by quantization distortion. Fig. 16 evidences clearly this phenomenon: when FAP 35 (raise_l_o_eyebrow) is interpolated, its value differs from the actual measure more than in the case of qsf equal to 16. How-

=

=

Fig. 17. FAP 48 (head_pitch) encoded with qsf 1 (solid line) and encoded with qsf 16 (dotted line); note that the FAP48 encoded with qsf 16 has both a worse PSNR and a step-wise behavior if compared with the FAP 48 encoded with qsf 1.

= =

=

ever, the step-wise characteristics of the quantization noise turn out to be subjectively more annoying during the animation.

1096

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER 2001

TABLE XIV ENCODING PARAMETERS OF THE “WOW” SEQUENCE

A second consideration concerns the 10 FAPs transmitted in case of set A and whose quality results far higher than in case of set B, where all the FAPs are subject to coarse quantization. In particular, the movements of the head are among those most sensitive to quantization noise, as shown in Fig. 17: the dark line represents the trajectory of FAP 48 (head_pitch) either in case of set A and of set C, while the gray line is referred to case B. Based on these experimental evidences, it is important to notice that the reproduction of head movements must be sufficiently smooth to avoid severe subjective artifacts, like annoying jerky head motion. Differently from many other movements of the face, FAPs controlling the head motion must be quantized with very small values of qsf. The third and last consideration suggested by the analysis of Table XIII concerns the negligible information associated to the average PSNR for the mutual comparison of sets A, B, and C. The above considerations on FAP quantization, and the fact that, in set A, the PSNR computed over the interpolated FAPs differs significantly from the PSNR associated with the transmitted FAPs, make almost meaningless the use of the average PSNR as an objective evaluation criterion. The achieved results put in evidence how, toward the objective of reducing as much as possible the bit rate for transmitting a FAP stream, it is preferable to employ FAP interpolation rather than increasing the quantization scaling factor. In order to allow a more reliable subjective evaluation of the quality improvements that can be achieved by exploiting FAP interpolation, two movies are available on our web site,1 based on the Facial Animation Engine (FAE) developed at the DSP Lab of DIST. In the first movie, the stream wow.fap (donated by DIST to the MPEG Face and Body Animation Ad Hoc Group) encoded with qsf 1 (case A) is compared to the same stream encoded at very low bitrate with qsf 16 (case C). The second movie compares case A with the same stream encoded at very low bitrate by using FAP interpolation (case B). The subjective evaluation is left to the readers. In Table XIV, the characteristics of the sequences are summarized. As a final consideration, it is important to notice how the conventional objective evaluation based on the PSNR, 1http://www-dsp.com.dist.unige.it/~pok/RESEARCH/MPEG/fapintrp.htm

computed on each static frame, here has almost no meaning since the quality of the synthetic images is good all the time. On the contrary, the increase of the quantization factor affects significantly the movement smoothness that, as evidenced by the movie WowQ.mpg, can only be appreciated and evaluated by playing the video. VII. CONCLUSION The cross-correlation analysis between MPEG-4 FAPs reported in this paper, together with the proposed algorithm for FAP interpolation, represent a key reference for any study oriented to exploit this FAP encoding modality for achieving efficient transmission of FAP streams. The innovative contribution of this study consists both of the specific technical solution that is proposed and of the experimental evidences that are produced. Up to now, to the knowledge of the authors, no investigation has been reported in the scientific literature on the exploitation of FAP interpolation modality, nor any concrete proposal on procedural solutions. The experimental results here reported provide clear indication of the performances level that FAP interpolation can guarantee and they suggest, therefore, a large variety of possible applications of MPEG-4 facial animation technologies within Internet based services, mobile interpersonal communications, etc. REFERENCES [1] Text for ISO/IEC FDIS 14 496-2 Visual, ISO/IEC JTC1/SC29/WG11 N2502, Nov. 1998. [2] Text for ISO/IEC FDIS 14 496-1 Systems, ISO/IEC JTC1/SC29/WG11 N2501, Nov. 1998. [3] G. Ferrigno and A. Pedrotti, “ELITE: A digital dedicated hardware system for movement analysis via real-time TV signal processing,” IEEE Trans. Biomed. Eng., vol. BME-32, pp. 943–950, 1985. [4] C. Pelachaud, N. Badler, and M. Steedman, “Generating facial expressions for speech,” Cog. Sci., vol. 20, no. 1, pp. 1–46, 1996. [5] A. Papoulis, Probability, Random Variables, and Stochastic Processes. New York: McGraw-Hill, 1984. [6] J. A. Stern, D. Boyer, D. J. Schroeder, R. M. Touchstone, and N. Stoliarov, “Blinks, saccades, and fixation pauses during vigilance task performance: II. Gender and time of day,” ADA307 024 FAA Office of Aviation Medicine—Civil Aeromedical Institute Publications, Aviation Medicine Reports, 1996. [7] F. Lavagetto and R. Pockaj, “The facial animation engine: Toward a high-level interface for the design of MPEG-4 compliant animated faces,” IEEE Trans. Circuits Syst. Video Technol., vol. 9, pp. 277–289, Mar. 1999.

LAVAGETTO AND POCKAJ: MPEG-4 FAP INTERPOLATION FOR FACIAL ANIMATION AT 70 BITS/FRAME

Fabio Lavagetto was born in Genoa, Italy, in 1962. He received the Laurea degree in electrical engineering from the University of Genoa, Genoa, Italy, in March 1987, and the Ph.D. degree from the Department of Communication, Computer and System Sciences (DIST), University of Genoa, in 1992. From 1987 to 1988, he was with the Marconi Group, Genova, Italy, working on real-time image processing. He was a visiting researcher with AT&T Bell Laboratories, Holmdel, NJ, during 1990 and a Contract Professor in digital signal processing at the University of Parma, Parma, Italy, in 1993. Presently, he is an Associate Professor with DIST, where he teaches a course on radio communication systems, and is responsible for many national and international research projects. During 1995–2000, he coordinated the European ACTS project VIDAS, concerned with the application of MPEG-4 technologies in multimedia telecommunication products. Since January 2000, he has been coordinating the IST European project INTERFACE, which is oriented to speech/image emotional analysis/synthesis. He is the author of more than 70 scientific papers in the area of multimedia data management and coding.

1097

Roberto Pockaj was born in Genova, Italy, in 1967. He received the Masters degree in electronic engineering in 1993 from the University of Genova, Genova, Italy, and the Ph.D. degree in computer engineering and computer science from the Department of Communications, Computer and System Sciences (DIST), University of Genova, in 1999. From June 1992 to June 1996, he was a software designer with the Marconi Group, Genova, Italy, working in the field of real-time image and signal processing for optoelectronic applications (active and passive laser sensors). Between 1996 and 2001, he collborated on the management of the European projects ACTS-VIDAS and IST-INTERFACE, and participated in the definition of the new standard MPEG-4 for the coding of multimedia contents within the “Ad Hoc Group” on Face and Body Animation. He is currently a Contract Researcher at DIST. He has authored many papers on image processing and multimedia management.

Suggest Documents