A perceptually optimised video coding system for sign language ...

2 downloads 3759 Views 2MB Size Report
aCCR, Department of Electrical and Electronic Engineering, University of Bristol, UK ... sign language video coding system using foveated processing, which can ...
ARTICLE IN PRESS

Signal Processing: Image Communication 21 (2006) 531–549 www.elsevier.com/locate/image

A perceptually optimised video coding system for sign language communication at low bit rates Dimitris Agrafiotisa,, Nishan Canagarajaha, David R. Bulla, Jim Kyleb, Helen Seersb, Matthew Dyec a

CCR, Department of Electrical and Electronic Engineering, University of Bristol, UK b Centre for Deaf Studies, University of Bristol, UK c Department of Brain and Cognitive Sciences, University of Rochester, NY, USA

Received 24 May 2005; received in revised form 17 February 2006; accepted 20 February 2006

Abstract The ability to communicate remotely through the use of video as promised by wireless networks and already practised over fixed networks, is for deaf people as important as voice telephony is for hearing people. Sign languages are visual–spatial languages and as such demand good image quality for interaction and understanding. In this paper, we first analyse the sign language viewer’s eye-gaze, based on the results of an eye-tracking study that we conducted, as well as the video content involved in sign language person-to-person communication. Based on this analysis we propose a sign language video coding system using foveated processing, which can lead to bit rate savings without compromising the comprehension of the coded sequence or equivalently produce a coded sequence with higher comprehension value at the same bit rate. We support this claim with the results of an initial comprehension assessment trial of such coded sequences by deaf users. The proposed system constitutes a new paradigm for coding sign language image sequences at limited bit rates. r 2006 Elsevier B.V. All rights reserved. Keywords: Sign language video coding; Eye tracking; Foveated video coding; Rate control; H.264

1. Introduction Remote communication through the transmission of video over fixed or wireless networks is very important to the deaf community because it allows deaf people to communicate in their own language, sign language. Deaf people using sign language are currently among the most eager buyers of videophones and are equally eager to move on to mobile Corresponding author. Fax: +44 117 9545206.

E-mail address: d.agrafi[email protected] (D. Agrafiotis).

services [8]. The arrival of mobile video telephony will enable them to communicate anytime/anywhere in their own language as hearing people have been able to do for some time. Sign languages are visual–spatial languages and as such demand good image quality for interpersonal interaction and mutual understanding. Many video coding systems have focused on the compression of typical video conferencing sequences where a head and shoulders view of the participant is usually involved. Sign language video includes, in addition, the rapidly moving hands and arms of the imaged

0923-5965/$ - see front matter r 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.image.2006.02.003

ARTICLE IN PRESS 532

D. Agrafiotis et al. / Signal Processing: Image Communication 21 (2006) 531–549

signer resulting in increased bit rate requirements [19]. Moreover some of the requirements for successful person-to-person video communication in sign language as suggested in [8], such as high temporal resolution (at least 12 frames per second (fps)) and preference of CIF spatial resolution video over QCIF emphasise further the need for efficient compression especially at low bit rates. Some sign language video coding systems have focussed on providing a practical solution to the problem of sign language telecommunication at very low bite rates by relaxing the video quality requirements and resorting to the use of moving binary sketches or cartoons instead [4,11]. Other systems opt for a region of interest coding approach wherein specific regions of each video frame are treated differently according to their apparent contribution to sign language video quality [16,19]. Most commonly the face and hands are considered more important and proposed systems aim at coding these regions with better quality compared to the background. Such methods are associated with complex segmentation procedures which have to be applied prior to coding, mainly due to the rapidly moving hands and arms of the imaged signer [7,16,17]. Additionally the resulting video quality assessment as well as the initial assumptions about quality importance are based on the apparent predominance of the hands in the construction of signs which has led to the belief (especially so among hearing people) that sign language viewers focus on the larger moving parts of the image—the hands—in order to understand the messages. This paper presents a video coding system for low bit rates adapted to the requirements of sign language (SL) communication. The proposed system is based on the analysis of the needs and characteristics of SL video communication given in Section 2. Central to this analysis is the gazetracking study that we conducted in order to characterise the visual attention1 of sign language viewers and thus obtain some information about the visual importance of the various regions in the sign language video frame. The proposed system is described in Section 3. Coding results obtained with the proposed system follow in Section 4. Results of an initial comprehension and quality assessment are given in Section 5, followed by conclusions and 1 That is the overt visual attention since attention can be directed towards a location that is not the source of fixation (covert attention).

possible further work. The work described in this paper has been partly presented by the authors in [1–3]. 2. Analysis of sign language video communication In order to propose a system for SL video communication it is necessary to first analyse the specific case with the aim of finding possible requirements and characteristics which should be fulfilled/exploited by such a system. Consequently we examined the two key components that make up the particular (or any) video communication loop namely the users/viewers and the video content. 2.1. Sign language video viewers In order to examine the sign language video viewer’s behaviour—how sign language viewers watch/perceive sign language video material—a gaze-tracking study was setup, wherein the viewer’s eye-gaze was tracked while watching sign language video clips. 2.1.1. Gaze-tracking trials description Twenty-eight subjects took part in experiments involving the use of an eye-tracking system which recorded the participants’ eye-gaze while watching four one-minute clips. The clips shown were short narratives being signed in British Sign Language (BSL) by an expert signer (deaf native signer) sitting in front of a plain blue background and wearing plain clothes. All clips were shot in a studio environment and were displayed during the trials in uncompressed CIF format, i.e. with a resolution of 352  288 pixels, at 25 fps. The participants included deaf and hearing BSL signers (i.e. interpreters), referred to as experienced viewers hereafter, as well as hearing beginners/non-signers, referred to as naive viewers. The system used for recording the participants’ eye-gaze [15] consists of a headband with miniature high speed cameras which provide true gaze position tracking through an operator PC with a DSP card that analyses the captured images at a sampling rate of 250 Hz. Software written and used in the experiments reported results as fixation locations per frame (i.e. locus at which eye-gaze was directed), as well as overall eye movement events (which are either fixations or jumps/saccades). The frame fixation locations were computed

ARTICLE IN PRESS D. Agrafiotis et al. / Signal Processing: Image Communication 21 (2006) 531–549

as the median of the 10 fixation samples per frame provided by the system. The trials took place in a darkened room with the participants sitting in front of a 2100 screen (39  29 cm active visible area) with the resolution set at 640  480 pixels, with a viewing distance of 60 cm (Fig. 1). A chinrest was used in order to ensure the best possible gaze position accuracy. Based on a 20 pixels per visual degree (ppd) and 40 ppd horizontal and vertical average gaze position resolution reported by the system, and an average gaze position error of less than 0.51 (as stated in the system’s specifications) we can estimate the gaze position accuracy of the results to be within 10 pixels horizontally and 20 pixels vertically.

533

The procedure followed during the experiments consisted of the following steps: (1) fitting of the headband on the participant, (2) calibration and validation of the system (based on displayed targets and estimated position), (3) display of a video clip and recording of eye-gaze data, (4) assessment of comprehension. The procedure (excluding fitting of the headband) was repeated for each of the video clips of the experiment.

Fig. 1. Layout and setup used for the gaze-tracking trials.

ARTICLE IN PRESS 534

D. Agrafiotis et al. / Signal Processing: Image Communication 21 (2006) 531–549

2.1.2. Gaze-tracking trials results The outcome of the gaze-tracking trials is graphically illustrated in Figs. 3 and 4. Analysis of the results showed that sign language viewers, excluding the naive viewers, fixate on the face of the signer. In fact a closer look at the results strongly suggests that sign language viewers concentrate on the mouth of the viewed signer. There were no fixations on the hands except where the hands occlude the mouth. In contrast, naı¨ ve viewers tend to follow the hands, thus giving a much more spread viewing pattern. The graphs of Fig. 3 show the vertical position of fixations per frame, for 1000 frames (40 s) of clips 1,2, and 3 with respect to the clip’s position on the screen (i.e. vertical position 0 is the top of the clip not of the screen). Two graphs are shown per clip, one with typical results of experienced sign language viewers (left column) and one with typical results of naive viewers (right column). The horizontal line at vertical position 150 represents a threshold for the vertical location of the face/ neck (anything below that corresponds to fixations on the body or hands) as shown in Fig. 2. Looking at Fig. 3 one can see that the experienced viewers (subject 2 is deaf from hearing parents, subject 5 is deaf from deaf parents and subject 3 is a hearing interpreter) never looked at the hands while watching these clips. In contrast the naive viewers did look at the hands on numerous occasions as indicated by the multiple crossings of the threshold line. The results are also visualised (Fig. 4) in terms of the distribution of the average location of the overall fixations that were recorded during one whole clip (clip 2) for the same experienced (a) and naı¨ ve viewers (b). The points shown represent the main regions towards which the visual attention of the viewers was directed while watching the specific clip. These results are superimposed on one frame of the respective test sequences. One can easily recognise the fixed nature of the sign language user’s viewing pattern, which seems to concentrate on the area spanned by the mouth of the imaged signer, as opposed to the spread pattern generated by the non-signers. The latter seem to be searching for meaning throughout the moving visual scene and thereby fixate the hands frequently. A more rigorous discussion of these results can be found in [22]. The results obtained from the gaze-tracking trials are in accordance with anecdotal reports coming from

Fig. 2. Vertical location limit of the face/neck region corresponding to a fixation location threshold. Fixations below the line are mainly on the body and hands.

deaf people which indicate that signers maintain their visual attention on the face of the person signing.

ARTICLE IN PRESS D. Agrafiotis et al. / Signal Processing: Image Communication 21 (2006) 531–549

535

Fig. 3. Gaze-tracking results for 3 test clips. Left column shows results for 3 experienced sign language viewers (apart from clip 3 for which there are no data for subject 5) while right column shows results for 2 naı¨ ve viewers.

ARTICLE IN PRESS 536

D. Agrafiotis et al. / Signal Processing: Image Communication 21 (2006) 531–549

2.2. Sign language video characteristics

Fig. 4. Average location of overall fixations for clip ‘sign2’. (a) Experienced viewers, (b) naı¨ ve viewers.

Due to specific features of BSL, sign language video displays certain characteristics. As with the case of hearing person-to-person video conferencing, the signer at either end is the main point of focus and as such is located close to the centre of the viewing area. However, in sign language communication the hands (and hence the upper body) as well as the face and shoulders of the signer have to be visible, since they play a key role in the language, requiring an increased field of view for the cameras at each end. As a result a large part of the background is usually visible which not only is irrelevant for sign language communication but can also be perceived as increased noise (distraction) in the viewers’ visual field. Apart from any motion in the background, the main activity in sign language video consists of facial expressions and head/hand changes. In conversation with another person, the signers’ position may also change but not usually to the extent of altering the body shape on screen—i.e. it may rotate but not move around the screen. Face rotations are generally small since eye contact with the signer at the other end is important. These characteristics have an effect on the number of bits generated by each macroblock (MB) of the video frame when coded by a typical hybrid video coder (H.264 [13] is used in this work). The amount of bits generated depends on the amount of activity in that particular MB, the effectiveness of the prediction and the quantisation

Fig. 5. Typical bit distribution of a coded sign language video frame (QP30).

ARTICLE IN PRESS D. Agrafiotis et al. / Signal Processing: Image Communication 21 (2006) 531–549

parameter (QP) used for coding the transform coefficients. A typical bit distribution of sign language video is shown in Fig. 5 where the number of bits required to code each macroblock with a QP of 30 in one frame of a (CIF) plain-background sequence is depicted. Each square represents the location of one MB, and the brightness of the square specifies the number of bits spent. It can be seen that MBs corresponding to the position of the hands require a large number of bits, more than any other region. 3. Proposed sign language video coding system The recorded viewing pattern confirms that the central point of fixation is the mouth and the rest of the moving image is seen with decreasing acuity. The fact that the hands apparently play an important role in the lexicon but are never fixated suggests that their motion and shape is processed only in peripheral vision. That is where the background is processed too (but probably discarded). A coding approach that follows this visual processing model is that of foveated video coding [6,9,18].

obstacle for applying foveated coding is finding the fixation point, since this normally requires real time tracking of the viewers’ eye-gaze [9]. The result of our gaze-tracking study—‘‘sign language viewers fixate mainly on the face’’—removes this obstacle for the case of sign language video. The fixation point is known prior to coding and will (almost) always lie on the face of the signer and close to the mouth. Still there is the (simpler) need of locating the face of the displayed signer which is tackled by the face tracking algorithm described in Section 3.2. Once the location of the fixation point is identified and a normalised viewing distance is chosen, the video frame can be partitioned into 8 regions based on their eccentricity (viewing angle with regard to the fixation point) using Eqs. (1) and (2). The derivation of these formulas originates in experimental human contrast sensitivity graphs measured as a function of spatial frequency and retinal eccentricity (for details see [18])   e2 1 ec ¼ ln (1)  e2 , CT0 af ec ¼ tan1

3.1. Foveated processing Foveated video compression aims to exploit the fall-off in spatial resolution of the human visual system away from the point of fixation in order to reduce the bandwidth requirements of compressed video. The foveation model effectively specifies a relation for the maximum detectable spatial frequency at a point of an image as a function of the coordinates of the fixation point and the viewing distance of the observer from the image [6]. If the viewing distance (normalised with regard to the image width) is known along with the point of fixation for every frame, then one can filter out a number of high spatial frequencies (those greater than the maximum detectable at each specific location) without causing any perceived reduction in quality. For lossy coding (as is the case in this work) we do not need to know the exact viewing distance. Instead we use it as a means of controlling the amount of loss. Moreover, as suggested in [6] we limit the maximum detectable frequency to eight possible values which results in partitioning the video frame in 8 regions of constant maximum detectable frequency (all regions are defined at a macroblock level, i.e. they are constrained to be the union of disjoint MBs). Hence the only remaining

537



 dðxÞ . Nv

(2)

The parameters in Eq. (1) are the following: f is the spatial frequency in cycles per degree, ec is the eccentricity in degrees, CT0 is a minimal contrast threshold constant, a is a spatial frequency decay constant, and e2 is the half-resolution eccentricity constant. The suggested values for the constants in (1) (as given in [18]) are: a ¼ 0.106, e2 ¼ 2.3, and 1 CT0 ¼ 64 . For a given f, Eq. (1) gives the critical eccentricity ec from the foveation (fixation) point beyond which the given spatial frequency will be imperceptible. Assuming the viewing configuration of Fig. 6 the eccentricity at each macroblock (measured as the eccentricity at its central pixel) can be found using Eq. (2), wherein N is the width of the image in pixels, d(x) is the Euclidean distance from the fixation point, and v is the viewing distance

Fig. 6. Viewing configuration for eccentricity calculations.

ARTICLE IN PRESS 538

D. Agrafiotis et al. / Signal Processing: Image Communication 21 (2006) 531–549

Fig. 7. Foveation macroblock maps for two different viewing distances. The fixation point is located at the central pixel of macroblock (5,11), i.e. pixel location (88,174) marked with 0. From left to right: MB map before foveation, foveation MB map with v ¼ 3 and foveation MB map with v ¼ 2.

measured in image width (note that d(x)/N is distance measured in image width). Knowing the eccentricity of each macroblock we can then classify them into one of the eight regions by comparing the eccentricity value of each macroblock with the critical eccentricity corresponding to each region. Foveated processing produces a map showing the region each MB belongs to in each frame. Fig. 7 shows two such macroblock maps produced for one CIF video frame (288  352 pixels, i.e. 18  22 macroblocks) for two different viewing distances (v ¼ 3 and 2). The radius of the highest acuity (highest priority) region (region 0 in the maps) has to be specified. The foveation map is used either for assigning a different QP to each MB in the case of no rate control being used (variable QP coding), or for redistributing the bits available to each frame among the different priority regions when rate control is used (variable priority rate controlled coding). 3.2. Face location/tracking As mentioned in the previous section, face tracking is necessary for finding the location of the fixation point (which will lie on the face of the imaged signer) as well as the extent of the imaged signer’s face on the screen, which will determine the size of the highest acuity (priority) foveation region. A large number of face detection/tracking methods exist, a good survey of which can be found in [23]. Methods can be classified into different categories based on the main approach used. We have employed a cascade of such methods based on [14] combined with temporal information to track the signer’s face in a sign language sequence. Skin colour segmentation is first performed in the UV colour space, followed by the hierarchical

multiscale approach of [12] applied to pixels classified as skin. The specific approach extracts horizontal edges from the selected regions of the image at a number of resolutions (formed by averaging and sub-sampling) with the aim of identifying six facial components: two eyebrows, two eyes, one nose and one mouth. The result of the two previous procedures is a number of face candidates based on the detection and relative arrangement of the facial features. The face candidates are passed on to a template matching module (after rotation compensation) which verifies true faces based on their correlation with stored templates. If these are more than one then knowledge about the sign language video is used to find the one that is most likely to be the signer’s face (e.g. face located closer to the centre, and/or largest face in the image). This procedure is applied to the whole video frame (except from regions close to the frame borders) at the beginning of the sequence in order to locate the signer’s face. In a commercial system this process of first locating the signer’s face could easily be replaced by input from the user (for example using a rectangle at the centre of the image and allowing the user to adjust the camera accordingly). Apart from the first frame, the whole process is applied to the MBs corresponding to the signer’s face in the previous frame along with a ring of MBs situated around them (tracking). If the algorithm fails (most likely cause being the hands occluding facial features) then the results of the skin colour module together with the temporal information are used to give the signer’s face position. That along with the fact that signers will maintain eye contact most of the time (meaning that facial features will be visible) makes the whole method very robust. Tracking also makes the algorithm fast since the cascade of methods employed are only applied to a

ARTICLE IN PRESS D. Agrafiotis et al. / Signal Processing: Image Communication 21 (2006) 531–549

539

Fig. 8. Schematic description of the algorithm employed for tracking the face of the image signer. The steps followed are: 1—skin colour segmentation, 2—multiscale facial feature detection, 3—template correlation of candidate faces after rotation compensation, 4—detection, 5—region definition for detection in next frame.

very limited region of each video frame. The typical performance on a Pentium IV at 2.8 GHz with no optimisations employed is 27 fps. The face location/tracking algorithm is schematically described in Fig. 8. 3.3. Variable QP coding We have modified the H.264 reference software (JM) in order to enable variable quantisation based on a given foveation map. The foveation map produced for each frame of the coded sequence as described in the previous sections is combined with a given range of QP values (minimum QP/best quality–maximum QP/lowest quality) to produce macroblock regions that will be quantised with different step sizes, with the step size getting bigger for regions of higher eccentricity. The outer regions always have their QP increased before inner regions with the highest QP in the range being assigned to

the lowest priority region. Fig. 9 shows the 8 different foveation regions that correspond to the depicted video frame for a viewing distance v ¼ 3. Region 0 is the highest priority region around the face the extent of which is provided by the face tracking module. The QP allocations for the 8 priority regions corresponding to a QPmin of 30 and a QPmax of 40–44 are shown in Table 1. Coding with such a variable QP (VQP) incurs only a small overhead due to the coding of the difference in QP values (QPdelta) of MBs lying on region borders. A typical overhead for a CIF sized coded sequence with a QP range of 30–40 was found to be less than 3 kbits/s. 3.4. Variable priority rate controlled (VPRC) coding Within a rate controlled context it is the rate control algorithm that selects the QP values. This is

ARTICLE IN PRESS D. Agrafiotis et al. / Signal Processing: Image Communication 21 (2006) 531–549

540

Fig. 9. Priority regions for one video frame (shown on the left).

Table 1 QP allocations for the 8 priority regions QPmin–max

Region 0

Region 1

Region 2

Region 3

Region 4

Region 5

Region 6

Region 7

30–40 30–41 30–42 30–43 30–44

30 30 30 30 30

31 31 31 31 32

32 32 32 33 34

33 33 34 35 36

34 35 36 37 38

36 37 38 39 40

38 39 40 41 42

40 41 42 43 44

normally done through the use of a rate model that predicts the number of bits that will be output after coding a macroblock or frame with a specific quantiser, given a measure of the variance of the residual signal (the prediction difference signal) and a specific bit budget allocation. We have enhanced the rate control method of the H.264 JM encoder [10,21] to accommodate the multiple priority levels specified by the foveated processing module. Below we briefly describe the JM rate control algorithm and our modifications. The rate control method adopted by the JM encoder differs from previous approaches in that the QP values are chosen prior to the prediction taking place (i.e. prior to the residual signal being known). This is mainly due to the multiple number of coding modes available in H.264 for each MB and the fact that their use can be optimised in a rate distortion sense given a specific QP. The JM rate control uses instead a linear model for predicting the statistics (mean absolute difference—MAD) of the residual of the current basic unit (e.g. frame, slice, or macroblock) based on the MAD of past (colocated) basic units. Once the MAD is predicted, the quadratic model of [5] is employed to find a QP which will lead to a bit stream that adheres to the specific bit budget allocation. The basic unit size defines the number of rate control layers (up to 3). Herein we discuss the case of a basic unit size equal

to that of the frame which results in two rate control layers, the group of pictures (GOP) layer and the frame layer. We also assume use of I and P frames only. Our modified rate control algorithm treats the priority regions defined by the foveation map independently across all operations of both rate control layers. In the first rate control layer a specific bit budget is allocated to the remaining pictures within a GOP, based on the coding rate and the occupancy of the virtual buffer employed for the rate regulation. The GOP layer also assigns a QP—QPi(1)—to the intra (IDR) and first predicted picture (P) of the ith GOP based on the average QP assigned to the P pictures of the previous GOP as well as the respective QPi1(1) allocation made at the beginning of the previous GOP. In our modified rate control this QP assignment is done separately for each priority region, i.e. a QPi,j (1) is assigned to the jth priority region based on the average QP assigned to this region in the P pictures of the previous GOP and the value of QPi1,j (1). At the 2nd layer the frame bit budget allocation for P frames takes place. There are two stages in the frame bit allocation process. The first stage takes place after coding the first two pictures (IDR,P) of the current GOP and determines a target buffer level for each remaining P picture in the GOP based on the bit usage of these first two frames. The

ARTICLE IN PRESS D. Agrafiotis et al. / Signal Processing: Image Communication 21 (2006) 531–549

second stage of the frame bit allocation process takes place at each remaining P frame and determines the bits that will be allocated for the current P picture in the current GOP based on the target buffer level, the frame rate, the available channel bandwidth and the actual buffer occupancy. The actual allocated (target) bits are a weighted combination of the outputs of these two stages. Once a specific number of bits has been allocated to the current frame the algorithm normally proceeds to compute the frame QP. This involves the use of a linear model which predicts the MAD of the current frame based on the actual MAD of the previous P picture. The quantisation step corresponding to the target bits is then computed using the quadratic model of [5], with the computed QP being restricted to lie within a range of [2,+2] of the previous frame QP. For variable priority coding we distribute the texture bits allocated to the current frame among the different regions in a way that reflects their relative importance. More specifically the amount of bits allocated to each region is calculated based on the size of the region, the predicted MAD value of the region and its priority. At a first stage the frame texture bits are distributed to each region based on the normalised size and normalised predicted MAD of each region. Once the initial allocation has taken place, we employ a priority constant P0 (ranging from 0 to 1) to specify the percentage of texture bits that will be redistributed from the lower priority regions 1 to 7 to the highest priority region 0. The priority of the rest of the regions is then calculated using the following exponential model: Pj ¼ ej=3  P0 ;

8j ¼ 1; 2; 3; 4; 5; 6.

(3)

Each jth region is assigned a percentage Pj of the remaining texture bits of all lower priority regions. The number of texture bits remaining in each of the lower priority regions is updated before each redistribution step. The QP of each region is then calculated using the region’s target bits and the same quadratic model albeit with region specific coefficients and predicted MAD values. The QP values for all regions are restricted to lie within a range of [2, +2] of the QP of the same priority region in the previous frame.

541

and 3 outdoor scenes, at 25 fps. In all cases we employed the JM 9.5 encoder with 3 reference frames and fast motion estimation with a search range of 16 pixels. Two cases are examined, with and without rate control. The duration of the clips ranges from 20 to 60 s (500–1500 frames). For the case of no rate control, sequences coded with a variable QP (VQP) and a QP range of 30–40 are compared in terms of the resulting bit rate with constant QP (CQP) coded sequences, where the QP assigned to MBs of the highest priority region (face) is the same for both versions (essentially similar quality for the face). Results are also given for CQP coding at similar bit rates to the VQP approach. One intra frame (first frame) was used with the rest being P frames. With rate control we compare the performance of our enhanced rate control method that allows variable priority rate controlled coding (VPRC) with that of the rate control (RC) used in the H.264 JM encoder with a basic unit size equal to the whole frame. Results are given for 4 priority levels of region 0 equal to 15%, 25%, 35% and 45%. The comparison is done in terms of the PSNR of the multiple priority regions. The results given also highlight the coding flexibility of our method. IDR and P frames were used with one IDR frame every 48 P frames. 4.1. Variable QP coding Table 2 shows that the proposed system with VQP coding leads to significant bit rate savings (40%) compared to a standard CQP approach while keeping the quality of the important regions to similar high levels (Fig. 10). The rate savings are large partly because the region that generates the largest amounts of bits (the hands) undergoes coarser quantisation especially when located far way from the face. When closer to the face the hands are coded with higher fidelity (since they enter higher priority regions). This fits nicely with

Table 2 Bit rate savings obtained with the variable quality (VQP) approach vs. constant quality (CQP) Bit rate (kbits/s)

Indoor

Outdoor 1

Outdoor 2

Outdoor 3

CQP VQP Saving (%)

168.82 109.36 35.2

289.48 170.46 41.1

192.33 105.20 45.3

220.30 141.50 35.7

4. Coding results Coding results are given for 4 different CIF resolution clips, a plain background indoor scene

ARTICLE IN PRESS 542

D. Agrafiotis et al. / Signal Processing: Image Communication 21 (2006) 531–549

Fig. 10. One frame from the ‘‘indoor’’ and ‘‘outdoor ‘‘sequences coded with VQP (30–40)—top—and CQP (30)—bottom—offering similar quality for the highest priority region.

suggestions made in an early paper in the workings of sign language [20] according to which the language has/is evolving to accommodate a face centred viewing pattern by bringing the hands closer to the face when detailed signs have to be made, leaving mostly gross movements and gestures to take place in regions away from it. Additionally when background activity is present (e.g. in outdoor scenes) the proposed approach benefits from spending less bits for coding what is (essentially) irrelevant information to sign language communication. Visual comparisons with CQP coding at similar rates can be made by looking at Fig. 11. The proposed approach ensures that the face and areas around it are coded with adequate quality so that comprehension is not hampered.

4.2. Variable priority rate controlled coding We show representative results collected for sequence ‘‘outdoor 2’’. Fig. 12 shows average PSNR results for each region with standard rate control and our VPRC at different priorities for all tested bit rates (96–256 kbits/s). It can be seen that the perceptually important regions (regions 0–4) are favoured by VPRC at the expense of regions 5–7 which are less significant for sign language comprehension as they are located further away from the face. The variation in quality is due to the bit redistribution process the results of which are shown in Fig. 13. On a frame by frame basis Fig. 14 shows the quality improvements offered for the highest priority region by our method (VPRC at

ARTICLE IN PRESS D. Agrafiotis et al. / Signal Processing: Image Communication 21 (2006) 531–549

543

Fig. 11. One frame from the ‘‘outdoor 2’’ sequence: (a) original, (b) coded with VQP (30–40), (c) coded with CQP(35), (d) detail of facial area with VQP, (e) detail of facial area with CQP. (Rate100 kbits/s).

ARTICLE IN PRESS 544

D. Agrafiotis et al. / Signal Processing: Image Communication 21 (2006) 531–549

Fig. 12. ‘‘Outdoor 2’’ average PSNR graphs for each priority region with standard RC and VPRC.

ARTICLE IN PRESS D. Agrafiotis et al. / Signal Processing: Image Communication 21 (2006) 531–549

545

Fig. 13. Bit rate distribution among regions for different bit rates and priorities (‘‘Outdoor 2’’).

25%), for the duration of two GOPs at a rate of 128 kbits/s. The effectiveness of our method is particularly visible at the high activity frames shown in Fig. 14 where the quality of the perceptually important regions is preserved in contrast to the standard approach that experiences dips in performance and can thus hamper comprehension. For the specific priority constant (25%) the minimum region 0 PSNR recorded with VPRC at 128 kbits/s was 26.04 dB whereas the minimum region 0 PSNR recorded with standard RC was 23.2 dB. Similarly, the respective maximum region 0 PSNR values were 35.63 dB for VPRC and 33.87 dB for the standard RC. 5. Comprehension and quality assessment An essential aspect of this work is the comprehension ‘‘value’’ of the coded material. In order to assess the effect of the proposed coding approach on the ability to comprehend the coded sign language video a small trial was set up with 17 deaf participants (8 from deaf parents and 9 from hearing parents) who watched 2 clips (outdoors scene/indoors scene) with and without background activity, respectively. Each clip was separated in 3 segments with each segment being assigned and coded randomly with a QP range of 30 (i.e. CQP),

30–36 and 30–40 at 12.5 fps (Fig. 15). The level of comprehension was assessed by asking questions related to the signed content after watching each segment. The trial also aimed to assess the perceived quality while watching and understanding (at the same time) the content of the clips and to that extent participants were asked to rate the clips in terms of quality and blurriness on a scale from 0 to 100. The comprehension results were at a ceiling indicating that the proposed coding approach does not affect understanding even though it introduces losses in quality at peripheral regions. In terms of perceived quality there were no significant difference across clips—the plain background sequence received a similar rating for all cases while the outdoor sequence received a slightly lower rating for the 30–40 version. Fig. 16 shows average rating results for quality and blurriness for the three types of coding across all clips. 6. Conclusions Coding of image sequences will always result in some information being lost in order to satisfy rate requirements set by the network over which transmission will take place. In this paper we have proposed and described a sign language video coding system with which it is possible to localise

ARTICLE IN PRESS 546

D. Agrafiotis et al. / Signal Processing: Image Communication 21 (2006) 531–549

Fig. 14. Frame results for the ‘‘outdoor 2’’ sequence at 128 kbits/s: (a) PSNR trace for frames 720–815 (2 GOPs), (b) frame 784 coded with standard RC, (c) same frame with VPRC at 25% priority, (d) PSNR trace for frames 1296–1392 (2 GOPs), (e) frame 1313 coded with standard RC, (f) same frame with VPRC at 25%.

this information loss, in a way that should not impair sign language comprehension. The system employs variable quality/priority coding based on foveated processing of the input video frames which

requires tracking of the imaged signer’s face in the clip. The proposed approach is based on the outcome of a gaze-tracking study conducted by the authors which indicates that the visual attention

ARTICLE IN PRESS D. Agrafiotis et al. / Signal Processing: Image Communication 21 (2006) 531–549

547

Fig. 15. One frame from the 2 clips used in the comprehension and quality assessment trials. (a,d)—CQP 30, (b,e)—VQP 30–36, (c,f)— 30–40.

ARTICLE IN PRESS D. Agrafiotis et al. / Signal Processing: Image Communication 21 (2006) 531–549

548

Fig. 16. Quality assessment results.

of sign language users is directed towards the face of the viewed signer. The results presented are very promising, indicating that substantial bit rate savings can be had without affecting the comprehension ability of the viewers as suggested by the comprehension assessment study. Equivalently, when rate control is employed, better perceptual quality for sign language communication can be achieved at a fixed low bit rate compared to standard methods. This work sets a new paradigm for coding sign language image sequences at limited bit rates and opens up a number of possibilities for further work including sign language adapted error resilience (via unequal error protection) and sign language specific coding complexity reductions. Acknowledgements The authors would like to thank DTI (UK) for funding this work. References [1] D. Agrafiotis, N. Canagarajah, D.R. Bull, M. Dye, Perceptually optimised sign language video coding based on eye tracking analysis, IEE Electron. Lett. 39 (24) (November 2003) 1703–1705.

[2] D. Agrafiotis, N. Canagarajah, D.R. Bull, M. Dye, H. Twyford, J. Kyle, J. Chung-How, Optimised sign language video coding based on eye-tracking analysis, in: Proceedings of the Visual Communications and Image Processing conference (VCIP) vol. 5150, SPIE, 2003, pp. 1244–1252. [3] D. Agrafiotis, N. Canagarajah, D.R. Bull, J. Kyle, H. Seers, M. Dye, A video coding system for sign language communication at low bit rates, in: Proceedings of International Conference on Image Processing (ICIP), 2004. [4] M. Boubekker, Automatic feature extraction for the transmission of American sign language over telephone lines, in: Proceedings of the 10th Annual Conference on Rehabilitation Technology, 1987, pp. 434–436. [5] T. Chiang, Y.-Q. Zhang, A new rate control scheme using quadratic rate-distortion modelling, IEEE Trans. Circuits Systems Video Technol. (February 1997). [6] W.S. Geisler, J.S. Perry, A real-time foveated multiresolution system for low-bandwidth video communication, SPIE Proc. 3299 (1998). [7] N. Habili, C.-C. Lim, A. Moini, Segmentation of the face and hands in sign language video sequences using colour and motion cues, IEEE Trans. Circuits Systems Video Technol. 14 (8) (August 2004) 1086–1097. [8] ITU-T Q9/16, Draft Application Profile, Sign language and lip-reading real time conversation usage of low bit rate video communication, September 1998. [9] S. Lee, A.C. Bovik, Fast algorithms for foveated video processing, IEEE Trans. Circuits Systems Video Technol. 13 (2) (February 2003) 149–162. [10] Z. Li, F. Pan, K.P. Lim, X. Lin, S. Rahardja, Adaptive rate control for H.264, in: Proceedings of ICIP, 2004. [11] M.D. Manoranjan, J. Robinson, Practical low-cost visual communication using binary images for deaf sign language, IEEE Trans. Rehabil. Eng. 8 (1) (March 2000) 81–88. [12] J. Miao, B. Yin, K. Wang, L. Shen, X. Chen, A hierarchical multiscale and multiangle system for human face detection in a complex background using gravity-center template, Pattern Recognition 32 (7) (July 1999) 1237–1248. [13] MPEG-4 Part 10 (ISO 14496-10), Advanced Video Coding. [14] M. Pickering, et al., A proposal for an automatic face extraction algorithm, ISO/MPEG M5399, Maui, 1999. [15] SR Research Ltd, The Eyelink System, http://www. eyelinkinfo.com/. [16] D.M. Saxe, R. Foulds, Robust region of interest coding for improved sign language telecommunication, IEEE Trans. Inform. Technol. Biomed. 6 (4) (December 2002) 310–314. [17] R. Schumeyer, K. Barner, A colour-based classifier for region identification in video, in: Proceedings of the Visual Communications and Image Processing Conference (VCIP), vol. 3309, SPIE, 1998, pp. 189–200. [18] H.R. Sheikh, S. Liu, B.L. Evans, A.C. Bovic, Real time foveation techniques for H.263 video encoding in software, in: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, 2001, pp. 1781–1784. [19] R.P. Shumeyer, E.A. Heredia, K.E. Barner, Region of interest priority coding for sign language videoconferencing, in: Proceedings of IEEE Workshop on Multimedia Signal Processing, 1997, pp. 531–536. [20] P. Siple, Visual constraints for sign language communication, Sign Language Stud. 19 (1978) 95–110.

ARTICLE IN PRESS D. Agrafiotis et al. / Signal Processing: Image Communication 21 (2006) 531–549 [21] G. Sullivan, T. Wiegand, K.-P. Lim, Joint Model Reference Encoding Methods and Decoding Concealment Methods, Document JVT-I049, San Diego, USA, September 2003. [22] H.E. Twyford, J.G. Kyle, M.W.G. Dye, D.S. Waters, M.A. Canavan, N. Canagarajah, D. Agrafiotis, Watching the

549

signs: eye-gaze during sign language comprehension, J. Deaf Stud. Deaf Educ. (2004), submitted for publication. [23] M.-H. Yang, D.J. Kriegman, N. Ahuja, Detecting faces in images: a survey, IEEE Trans. Pattern Anal. Machine Intell. 24 (1) (January 2002) 34–57.

Suggest Documents