Multiview-Video-Plus-Depth Coding Based on the Advanced Video ...

7 downloads 0 Views 1MB Size Report
weighted prediction, joint inter-view depth filtering, and gradual view refresh. The presented coding scheme is submitted to the 3D video coding (3DV) call for ...
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 9, SEPTEMBER 2013

3449

Multiview-Video-Plus-Depth Coding Based on the Advanced Video Coding Standard Miska M. Hannuksela, Member, IEEE, Dmytro Rusanovskyy, Wenyi Su, Lulu Chen, Ri Li, Payman Aflaki, Deyan Lan, Michal Joachimiak, Houqiang Li, Member, IEEE, and Moncef Gabbouj, Fellow, IEEE

Abstract— This paper presents a multiview-video-plus-depth coding scheme, which is compatible with the advanced video coding (H.264/AVC) standard and its multiview video coding (MVC) extension. This scheme introduces several encoding and in-loop coding tools for depth and texture video coding, such as depth-based texture motion vector prediction, depth-range-based weighted prediction, joint inter-view depth filtering, and gradual view refresh. The presented coding scheme is submitted to the 3D video coding (3DV) call for proposals (CfP) of the Moving Picture Experts Group standardization committee. When measured with commonly used objective metrics against the MVC anchor, the proposed scheme provides an average bitrate reduction of 26% and 35% for the 3DV CfP test scenarios with two and three views, respectively. The observed bitrate reduction is similar according to an analysis of the results obtained for the subjective tests on the 3DV CfP submissions. Index Terms— H.264/AVC, three-dimensional video, video coding.

I. I NTRODUCTION

C

ODING of multiview-video-plus-depth (MVD) data [1] facilitates more flexible three-dimensional (3D) video displaying at the receiving or playback devices when compared to conventional frame-compatible stereoscopic video as well as multiview video coding, such as the Multiview Video Coding (MVC) extension of the Advanced Video Coding (H.264/AVC) standard [2]. While coding of two texture views provides a basic 3D perception on stereoscopic displays, it has been discovered that disparity adjustment between views is needed for adapting the content on different displays and viewing conditions, such as viewing distance, as well as for meeting individual preferences [3]. Moreover, auto-stereoscopic display technology typically requires displaying a relatively large number of views simultaneously, for which views have

Manuscript received October 15, 2012; revised March 15, 2013; accepted May 31, 2013. Date of publication June 18, 2013; date of current version July 30, 2013. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Gary Sullivan. M. M. Hannuksela is with Nokia Research Center, Tampere 33720, Finland (e-mail: [email protected]). D. Rusanovskyy was with Nokia Research Center, Tampere 33720, Finland. He is now with LG Electronics (e-mail: [email protected]). W. Su, L. Chen, R. Li, D. Lan, and H. Li are with the University of Science and Technology of China, Hefei 230026, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]). P. Aflaki, M. Joachimiak, and M. Gabbouj are with Tampere University of Technology, Tampere 33720, Finland (e-mail: ext-payman. [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2013.2269274

to be generated in the playback device from the received views. These needs can be served by the MVD format and using the decoded MVD data as source for depth-image-based rendering (DIBR) [4]. Each texture view is accompanied by a respective depth view in the MVD format, from which new views can be synthesized using any appropriate DIBR algorithm. MPEG issued a Call for Proposals (CfP) for 3D video coding technology in March 2011 [5], aiming at standardizing a coding format supporting advanced stereoscopic display processing and improved support for auto-stereoscopic multiview displays. The CfP invited submissions in two categories, the first one compatible with H.264/AVC and the second compatible with the High Efficiency Video Coding (H.265/HEVC) standard [6], which was under development at the time of the CfP. As a result of the CfP evaluation, MPEG and, since July 2012, the Joint Collaborative Team on 3D Video Coding (JCT-3V) [7] have initiated two parallel H.264/AVC-based MVD coding developments, which are briefly described in the following paragraphs. An MVC extension for inclusion of depth maps, abbreviated MVC+D, specifies the encapsulation of MVC-coded texture and depth views into a single bitstream [8], [9]. The coding technology is identical to MVC, and hence MVC+D is backward-compatible with MVC and the texture views of MVC+D bitstreams can be decoded with an MVC decoder. The last technical changes to the MVC+D specification were finalized in January 2013. Another ongoing JCT-3V development is a multiview video and depth extension of H.264/AVC, referred to as 3DAVC [10]. This development exploits redundancies between texture and depth and includes several coding tools that provide a compression improvement over MVC+D. The specification requires that the base texture view is compatible with H.264/AVC and compatibility of dependent texture views to MVC may optionally be provided. 3D-AVC is planned to be finalized in November 2013. In this paper we present Nokia’s codec submission [11] to the MPEG 3DV CfP [5], which is referred to as Nokia 3DV Test Model or Nokia 3DV-TM in this paper. Nokia 3DV-TM was evaluated as the best-performing submission in the H.264/AVC-compatible category of the MPEG 3DV CfP. Consequently, it was selected as the basis of the initial test model for MVC+D and 3D-AVC development, where the backward compatibility requirements of MVC+D could be reached by configuring the encoder.

1057-7149/$31.00 © 2013 IEEE

3450

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 9, SEPTEMBER 2013

The paper is organized as follows. Section II describes the general principles and the architecture utilized for the coding of MVD data with Nokia 3DV-TM and the bitstream design of Nokia 3DV-TM. Section III describes the coding tools of Nokia 3DV-TM for texture data, whereas the tools for coding depth map data are described in Section IV. The conditions of the MPEG 3DV CfP, its evaluation procedure and the results achieved by the proposed coding design are given in Section V, and additionally Section V analyzes the impact of individual tools on the results. Section VI briefly analyzes the complexity of the tools and explains how the tools have evolved in the 3DAVC standardization process. Finally, the paper is concluded in the Section VII. II. P ROPOSED C ODEC A RCHITECTURE AND B ITSTREAM D ESIGN

0

8

15

23

30

38

45

IDR

...

B

...

I

...

B

...

I

...

B

...

I

...

IDR

...

B

...

P

...

B

...

P

...

B

...

P

...

Fig. 1. Example of GVR access units (picture order count 15 and 45) coded at every other random access point.

texture images. Depth image normalization is implemented as in-loop upsampling with bi-linear interpolation. To enable the use of VSP and depth-range-based weighted prediction (DRWP), Nokia 3DV-TM transmits camera parameters and the depth range represented by the depth views as part of the bitstream. The parameters include for example the closest and farthest real-world depth values Z near and Z f ar , respectively.

A. Design Goals A goal of the Nokia 3DV-TM development was an MVD coding system that is able to benefit from a wide deployment of H.264/AVC-based video services and from widely available hardware and software implementations of H.264/AVC. Our intent was to allow only a limited number of changes to lowlevel processing and at the same time obtain a significant compression improvement compared to MVC-compatible coding. In Nokia 3DV-TM, a modified motion vector prediction (MVP) scheme is the only low-level tool introduced to the original H.264/AVC technology. The Nokia 3DV-TM encoder can be configured to code a selected number of texture views as H.264/AVC and MVC compatible, while the remaining texture views utilize enhanced texture coding. B. Codec Architecture and Bitstream Structure The encoder input and decoder output of Nokia 3DV-TM follow the MVD data format, as detailed in the MPEG 3DV CfP [5]. The encoder codes the input data into a bitstream, which consists of a sequence of access units. Each access unit consists of texture view components and depth view components representing one sampling or playback instant of MVD data. Since the bitrate required for transmission of highquality texture content is typically significantly larger than the bitrate required for coded depth maps [12], a design concept in Nokia 3DV-TM is to utilize depth data for enhanced texture coding. In particular, a depth view component (D) can be coded prior to the texture view component (T) of the same view and hence used as inter-component prediction reference for the texture view component. In Nokia 3DV-TM this coding order is used for depth-based motion vector prediction (D-MVP) and joint view depth filtering (JVDF). 3DV-TM supports joint coding of texture and depth that have different spatial resolutions. Particularly, coding of depth data is supported at full, half (reduced in vertical or horizontal directions) and quarter spatial resolution (downsampling in the vertical and the horizontal directions) compared to the resolution of the texture data. To enable coding tools, such as D-MVP and view synthesis prediction (VSP), the resolution of depth map images is normalized to the resolution of luma

C. Gradual View Refresh (GVR) Nokia 3DV-TM allows random access into bitstream with a new type of an access unit, referred to gradual view refresh (GVR) access unit. This section reviews GVR briefly, while an in-depth analysis of GVR is available in [13]. MVC enables random access through instantaneous decoding refresh (IDR) and anchor access units, which allow only inter-view prediction and disallow temporal prediction (a.k.a. inter prediction). All access units following an IDR or anchor access unit in output order can be correctly decoded. GVR access units are coded in such a way that inter prediction is selectively enabled and hence compression improvement compared to IDR and anchor access units may be obtained. When decoding is started from a GVR access unit, a subset of the views in the multiview bitstream can be accurately decoded, while the remaining views can only be approximately reconstructed. The encoder selects which views are refreshed in a GVR access unit and codes these view components in the GVR access unit without inter prediction, while the remaining non-refreshed views may use both inter and interview prediction. Accurate decoding of all views can be achieved in a subsequent IDR, anchor, or GVR access unit. Fig. 1 presents an example bitstream where GVR access units are coded at every other random access point. It is assumed in that the frame rate is 30 Hz and random access points are coded every half a second. In the example, GVR access units refresh the base view only, while the non-base views are refreshed once per second with anchor access units. When decoding is started from a GVR access unit, the texture and depth view components which do not use inter prediction are decoded. Then, DIBR may be used to reconstruct those views that cannot be decoded because inter prediction was used for them. It is noted that the separation between the base view and the synthesized view is selected based on the rendering preferences for the used display environment and therefore need not be the same as the camera separation between the coded views. Fig. 2 presents an example of the decoder side operation when decoding is started at a GVR access unit of the bitstream presented in Fig. 1.

HANNUKSELA et al.: MULTIVIEW-VIDEO-PLUS-DEPTH CODING

0

8

15

IDR ... B Frames not received IDR ... B

...

I

23 ...

DIBR ...

30

B

...

38

3451

45

I

...

B

...

I

...

P

...

B

...

P

...

DIBR ...

...

Fig. 2. Decoder operation when starting decoding from GVR access unit at picture order count 15.

X

AVC coding scheme E’ T +

-

Q

ENC. Q-1

T-1

+

X’ Frame Buffer

MCP

Y

z-1 ENC.

ME

mv D-MVP VSP

Tools for enhanced texture coding Decoded depth picture(s)

Fig. 3.

Decoded texture picture(s)

High-level flow chart of the texture encoder in the Nokia 3DV-TM.

III. D EPTH -BASED E NHANCED T EXTURE C ODING T OOLS A. Introduction Nokia 3D-TM was designed with an assumption that most of the compression gain for MVD coding can be achieved from improvements in texture coding since the bitrate budget for depth map is usually a minor share of the total MVD bitrate [12]. Therefore, Nokia 3DV-TM includes two texture coding tools that utilize depth information: view synthesis prediction (VSP) and depth-based motion vector prediction (D-MVP). Fig. 3 shows a high level flowchart of the texture coding in Nokia 3DV-TM with VSP and D-MVP modules marked in red color. B. View Synthesis Prediction (VSP) In VSP, an already decoded texture view component is projected according to camera parameters to a viewing point of the currently (de)coded dependent view using DIBR, as described in many earlier papers, such as [14]. In Nokia 3DVTM the projected image is included in the reference picture list(s) and serves as a reference for motion compensated prediction (MCP). The VSP implemented in Nokia-3DV is similar to that presented in [15]. However, in order to keep the syntax of the macroblock coding layer unchanged, Nokia

3DV-TM does not include specific VSP skip and direct modes as in [15]. There are multiple DIBR implementations available including different projection and post-processing techniques. The DIBR algorithm of VSP utilized in Nokia 3DV-TM was implemented using the 1D image projection of the MPEG view synthesis reference software (VSRS) [16], [17]. As part of the DIBR algorithm, the depth sample values are converted to disparity vectors, which are rounded to a quarter-pixel accuracy, resulting in a virtual image t (x, y) being horizontally four times the size of the source image s(x, y). In order to be used as a reference picture for MCP, the virtual image t (x, y) is downsampled using the default filter of VSRS before inserting it in the initial reference picture lists at a position subsequent to temporal and inter-view reference frames. The reference picture list modification syntax was extended to support VSP reference pictures, thus any ordering of reference picture lists is allowed. C. Depth-Based Motion Vector Prediction (D-MVP) In this sub-section we review the H.264/AVC motion vector prediction with a goal of explaining its shortcomings for MVD coding. We then introduce D-MVP, which is a novel feature in Nokia 3DV-TM. The D-MVP scheme is described in more details in [18]. In H.264/AVC motion information associated with each prediction block of a current block (Cb) consists of three components, a reference index (refIdx) indicating the reference picture and two spatial components of motion vectors (MV x , and MV y ). In order to reduce the required number of bits to encode the motion information, the blocks adjacent to Cb are used to produce a predicted motion vector (mvpx , mvp y ), and the difference between the actual motion information of Cb and mvp is transmitted. H.264/AVC specifies that components of the predicted motion vector are calculated by a median value of the corresponding motion vector components (MV x , MV y ) of the neighboring blocks A, B and C: mvpx = median(M Vx (A), M Vx (B), M Vx (C)) mvp y = median(M Vy (A), M Vy (B), M Vy (C))

(1)

where the subscripts x and y indicate the horizontal and vertical components of the motion vector MV respectively. The layout of spatial neighbors (A, B, C) utilized in MVP is depicted in the top-left corner of Fig. 4. The motion vectors of the corresponding blocks (A, B, C) are marked accordingly (MV(A), MV(B), MV(C)). As described in more details in [18], the median MVP of H.264/AVC is not suitable for using more than one prediction direction (inter, inter-view, VSP), because it operates independently in the horizontal and vertical directions and because the magnitude of motion vector components can differ to a great extent in different prediction directions. Therefore, in Nokia 3DV-TM we restricted the conventional median MVP of (1) to identical prediction directions. All available neighboring blocks are classified according to the direction of their prediction (temporal, inter-view, VSP). For example, if Cb uses

3452

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 9, SEPTEMBER 2013

D

B

A

Cb

MVs of Cb and {A,B,C}

Disparity block D(Cb(i))

Temporal prediction in Cb

_ Average disparity D

C

Y

N

Select blocks {A,B,C} with temporal prediction

Select blocks {A,B,C} with inter-view prediction

Median MVP of H.264/AVC

_ Median MVP with D instead of zero MV

MV coding

Fig. 4.

Flow chart of direction-separated MVP.

an inter-view reference picture, all neighboring blocks which do not utilize inter-view prediction are marked as not-available for MVP and are not considered in the median MVP (1). The flowchart of this process is depicted in Fig. 4 for inter and inter-view prediction, while it is also applied similarly for VSP in Nokia 3DV-TM Furthermore, we introduced a new default candidate vector, when inter-view prediction is in use in the original H.264/AVC design: if no motion vector candidates are available from the neighboring blocks, MV x is set to the  which is associated with Cb and computed average disparity D by (2): 1   D(Cb) = D(Cb(i )) (2) N i

where i is the index of pixels within Cb, D(Cb(i )) is the disparity of pixel Cb(i ), and N is the total number of luma pixels in Cb. In addition to the generic MVP defined by (1), there are two special modes, the Direct and Skip modes, in H.264/AVC. In these modes, motion vector components are predicted as shown in (1), whereas the minimal reference index used in the neighboring blocks (A, B, C) is selected for Cb. This selection of reference indices favors the prediction direction of the first reference picture in the reference picture list and hence constrains the use of the Direct and Skip modes for multiple prediction directions. We therefore introduced a depth-based motion competition (DMC) into Nokia 3DV-TM as described in the following paragraphs. The flow chart of DMC in the Skip mode is shown in Fig. 5. In the Skip mode, motion vectors {MV i } of the texture data blocks {A, B, C} are grouped according to their prediction direction. The DMC process, which is detailed in the grey block of Fig. 5, is performed for each group independently. For each motion vector MV i within a given group, we first derive a motion-compensated depth block d(Cb,MV i ) where the motion vector MV i is applied relative to the position of Cb to obtain the depth block from the reference picture pointed to by MV i . Then, we estimate the similarity of d(Cb) and d(Cb,MV i ) by computing the sum of absolute differences

Fig. 5.

Flowchart of the DMC for Skip mode in P Slice.

(SAD) as follows: SAD(M Vi ) = SAD(d(Cb, M Vi , d(Cb))

(3)

The MV i that provides a minimal SAD value within the current group is selected as the optimal predictor for that particular direction (mvpdir ). Following this, the predictor in the temporal direction (mvpt emp ) is compared to the predictor in the inter-view direction (mvpint er ), and the predictor which provides the minimal SAD is used in the Skip mode. The MVP for the Direct mode of B slices is very similar to the Skip mode, but DMC (marked with grey blocks) is performed over both reference pictures lists (List 0 and List 1) independently. Thus, for each prediction direction (temporal or inter-view) DMC produces two predictors (mvp0dir and mvp1dir ) for List 0 and List 1, respectively. The SAD values of mvp0dir and mvp1dir are computed as shown in (3) and averaged to form the SAD of bi-prediction for each direction independently. Finally, the MVP for the Direct mode is selected among mvpint er and mvpt emp based on which one produces a smaller SAD, similarly to the Skip mode. IV. D EPTH C ODING T OOLS A. Introduction The following depth coding tools were included in Nokia 3DV-TM: joint view depth filtering (JVDF), intending to improve the fidelity of depth maps across views, and depthrange-based weighted prediction (DRWP), which utilizes transmitted depth range parameters for deriving weighted prediction parameters implicitly, and VSP that operates as described above. The interaction of these tools with other depth coding blocks is illustrated in Fig. 6. VSP has been described in Section III.B and JVDF and DRWP are described in the next sub-sections. B. Joint View Depth Filtering (JVDF) The main idea of JVDF is that depth map filtering can utilize redundancy of multi-view depth map representation, and depth maps of all available N viewpoints are filtered jointly.

HANNUKSELA et al.: MULTIVIEW-VIDEO-PLUS-DEPTH CODING

3453

In DRWP, the parameters of the weighted prediction W and Offset are computed as follows: W =

Z far1 − Z near1 Z near2 ∗ Z far2 ∗ Z far2 − Z near2 Z near1 ∗ Z far1

O f f set =

255 ∗ Z far2 (Z far2 − Z far1) ∗ Z far1 (Z far2 − Z near2)

(5)

(6)

where variables with subscript 1 are represent the parameters of the currently coded/decoded depth image and variables with subscript 2 represent the parameters of the reference depth map image. V. C ODING C ONDITIONS AND R ESULTS A. Introduction

Fig. 6.

High-level flow chart of the depth encoder in the Nokia 3DV-TM.

JVDF attempts to make depth maps of the same time instant consistent across views and hence removes depth estimation and coding errors. JVDF is similar to but somewhat simpler than the approach proposed in [19]. A detailed description and simulation results for JVDF can be found in [20], while a brief description of the JVDF algorithm is presented next. All available depth maps are first warped to a single view m. Since warping results in multiple estimates of a noisefree depth map value in a spatial location (x m , ym ), we select samples among which the filtering is carried out. We assume that the depth value Z m of view m is relatively accurate and therefore the correctly projected depth value Z i from other views which describe the same object should be close in value to Z m . A classification of similarity is defined through a confidence range and a threshold T on the absolute difference between Z i and Z m . The depth value Z i , for which the absolute difference exceeds threshold T are excluded from joint filtering, whereas other depth values of location (x m , ym ) are averaged in order to produce a “noise-free” estimate. After that the produced “noise-free” estimate of the depth value is warped back to corresponding views that participated in joint filtering. Depth map values which were found to be outliers remain unchanged.

C. Depth-Range-Based Weighted Prediction (DRWP) Since every frame of the depth map may be created with different Znear /Zfar , the same actual Z values can be represented with different depth map values. To compensate this mismatch, we introduced a novel coding tool, DRWP, which can be utilized to produce weights for the weighted prediction process of H.264/AVC. The weighted prediction of H.264/AVC is implemented as shown in (4): v 2 = v 1 · W + O f f set + 0.5

(4)

In this section, we present the coding and the view synthesis performance results of Nokia 3DV-TM. The results are based on the conditions of the MPEG 3DV CfP [5], which are described in sub-section V.B. The Nokia 3DV-TM encoding and view synthesis settings used to fulfill the requirements of the MPEG 3DV CfP are presented in sub-section V.C. Objective and subjective performance results are presented in sub-sections V.D and V.E, respectively. Sub-section V.F analyzes the tool-wise performance. B. MPEG 3DV CfP Eight test sequences were used as summarized in Table I. As can be observed from the table, four of the sequences have a 1080p resolution while the other four have a resolution of 1024 × 768. Six of the sequences were captured with a multi-camera setup, while two sequences were fully or mostly computer-generated. Bitstreams were encoded for two test scenarios: the 2-view scenario (C2), providing disparity adjustment capability for stereoscopic displays, and the 3-view scenario (C3), additionally providing the capability for view synthesis for multiview autostereoscopic displays. The selected input views for encoding the two scenarios are presented in Table III. In order to test the operation of the codec at a wide range of operation points, bitstreams were generated for four bitrates, R1 to R4, listed in Table II. The JMVC reference software [21] of MVC was used as anchor for comparisons in the CfP. Texture views were coded as one MVC bitstream and the depth views were coded as another MVC bitstream. The anchor bitstreams follow the same constraints imposed on the proposals. MPEG VSRS was used for DIBR for the anchor encoding. For further information, the anchor bitstreams as well as the configuration files are available on-line [16]. Two viewing environments, stereoscopic glasses-based and autostereoscopic ones, were used in the subjective evaluations organized by MPEG. Stereoscopic viewing was performed on a 46” stereo display with passive glasses, while in the autostereoscopic viewing 28 views were displayed simultaneously on a 52” panel. The subjective test setup is described in [5].

3454

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 9, SEPTEMBER 2013

TABLE I

C. Encoding and View Synthesis Settings for Nokia 3DV-TM

MPEG 3DV CfP T EST S EQUENCES

TABLE II B ITRATES FOR R ATE P OINTS R1 TO R4 OF C2 AND C3

TABLE III I NPUT V IEWS FOR C ODING , G ENERATED S TEREO PAIRS FOR V IEWING , AND

C AMERA S EPARATION FOR S YNTHESIZED V IEWS ( FOR C3)

C2 was tested only with the stereoscopic display, while C3 was tested both with the stereoscopic and the autostereoscopic display. Synthesized views were generated from the decoded texture and depth views as summarized in Table III. The displayed views for C2 are also indicated in the “Stereo pair” column of Table III. As can be observed, the displayed stereo pair consisted of one decoded view and one synthesized view, which is a reasonable assumption for disparity-adjusted stereoscopic viewing. For C3and stereo viewing, views were synthesized at regular intervals as presented in Table III. Out of the synthesized views one fixed stereo pair and, for the lower-resolution sequences, also one randomly selected stereo pair were evaluated as indicated in Table III. For C3 and autostereoscopic viewing, the 28 adjacent views from all the synthesized and coded views were randomly selected. The same random selections were made for all submissions, but the proponents were not aware of the random selections at the time of the submission in order to avoid tuning of the encoding settings to favour certain views.

Nokia 3DV-TM coding tools as well as high-level syntax and codec operation were implemented on top of the JM 17.2 reference software [22] of H.264/AVC. In C3, the PIP inter-view prediction structure was used, in which the central view is the base view used as interview reference for coding the two side views. In C2, the non-base view was inter-view-predicted from the base view. As governed by the CfP, the random access period was set to 12 and 15 for 25-Hz and 30-Hz sequences, respectively. GVR was used at every other random access point. Quantization parameter (QP) cascading was used between views, i.e. side views were quantized more coarsely with a QP value increase of 3 compared to the QP value of the base view. Subjective testing was performed to verify that the selected inter-view QP cascading was preferred over some other tested options for selecting QP values across views [23]. A dyadic inter prediction hierarchy was used. Temporal QP cascading was used for different temporal levels of the inter prediction hierarchy, as proposed in [24], i.e. intra frames were coded with a certain QP b, while pictures at temporal level n ≥ 1 were coded with QP b+3+n. The value b was kept unchanged for the entire coded view sequence. The same inter-view and GOP prediction patterns as well as the same QP cascading scheme were used for both texture and depth. The QP values were selected to match the target bitrates manually, with an emphasis on the selection of the texture QP value and using depth QP as a mechanism for finer granularity bitrate matching. Depth views were selected to be coded at quarter resolution from their original resolution listed in Table I. To produce reduced resolution depth map data, a linear 13-tap length filter specified in the JSVM reference software [25] for the Scalable Video Coding of H.264/AVC was utilized. Following the decoding, depth map data is up-sampled back to the original resolution with a simple 2-tap bilinear filter. However, more advanced upsampling approaches may be used in order to preserve depth contours and yield a better subjective and objective quality [26]. The spatial resolution of the texture views was automatically selected from three options: full resolution (as listed in Table I), ¾ resolution, and ½ resolution horizontally and vertically. For each spatial resolution, the Mean Square Error (MSE) of the reconstructed image upsampled to the full resolution was calculated against the original image. The resolution providing the smallest MSE value was selected. If two resolutions provided approximately equal MSE within a 5% margin, the smaller resolution of the two was selected. Later on, we studied a resolution selection algorithm based on frequency analysis, which provided better resolution selection subjectively, particularly for the 1080 p test sequences [27]. The presented enhanced texture coding tools were used for all non-base texture views. Hence, non-base depth view components preceded the respective texture view components in coding/decoding order. The presented depth coding tools were used except for VSP for depth, which was turned off mainly to reduce execution times.

HANNUKSELA et al.: MULTIVIEW-VIDEO-PLUS-DEPTH CODING

IN

T ERMS OF B JONTEGAARD M ETRICS

10.00

Average subjective rating

TABLE IV C ODING E FFICIENCY OF N OKIA 3DV-TM VS . R EFERENCE MVC C ODING ,

3455

8.00 6.00 4.00 2.00 0.00 500

700 H.264/MVC

900

1100 Bitrate (kbps) Nokia 3DV-TM

Fig. 7. An example of subjective viewing results performed for the MPEG 3DV CfP evaluation. The average subjective ratings of Nokia 3DV-TM (the C2 case, sequence S06) are compared against the reference MVC results.

The 1D parallel mode of the MPEG VSRS [16], [17] was used for DIBR at the post processing stage. The same VSRS configuration files and camera parameters as those used for anchor encoding were utilized as such [16].

TABLE V B ITRATE S AVING OF N OKIA 3DV-TM VS . R EFERENCE MVC C ODING O BTAINED FROM THE M EAN O PINION S CORE

D. Objective Results The objective quality of the decoded views of Nokia 3DVTM bitstreams was measured against the MVC reference encoding (anchor), where texture views were coded as one MVC bitstream and depth views were coded as another MVC bitstream. The commonly used Bjontegaard delta bitrate (dBR) and Bjontegaard delta Peak Signal-to-Noise Ratio (dPSNR) [28] metrics were applied in comparing the rate-distortion (RD) curves of Nokia 3DV-TM against the reference MVC results. These metrics have been produced by taking into account the total bitrate required for MVD data transmission and PSNR results for the luma component of the decoded texture views. The obtained results are presented in Table IV. As can be observed from Table IV, Nokia 3DV-TM outperformed MVC coding with a clear margin. The RD improvement compared to MVC in C2 is smaller than that in C3, because the coding tools in Nokia 3DV-TM improve particularly the coding efficiency of non-base texture views. It should be noted that the perceived quality obtained with MVD coding is not limited to the quality of the decoded texture views, although it serves as a clear indicator of the coding technology performance, but the quality of synthesized views should also be taken into account. However, for most of the synthesized views there is no original data available and hence a fullreference objective quality metrics, such as PSNR, are not applicable as such. Therefore, this component of MVD coding (i.e. the quality of synthesized views) was evaluated through a large-scale formal viewing procedure arranged by MPEG. It is also noted that in an analysis of the subjective evaluation results of the MPEG 3DV CfP it was discovered that the PSNR of the decoded view had the highest correlation to subjective ratings of displayed stereo pairs with one decoded view and one synthesized view [29]. Consequently, the results of Table IV can be considered indicative of the subjective quality improvement of synthesized stereo pairs too.

E. Subjective Results A detailed review of the MPEG 3DV CfP test arrangement and results is provided in [30], while the results for the Nokia 3DV-TM compared to MVC are analyzed in greater detail in this paper. The Double Stimulus Impairment Scale (DSIS) test method was used in the subjective testing with 11 quality levels, where 10 indicates the highest quality and 0 indicates the lowest quality. In order to provide results that are intuitively comprehensible as well as comparable to the dBR results presented in the previous sub-section, we analyzed the graphs of the mean opinion score (MOS) plotted against the target bitrate. An example of such a curve is provided in Fig. 7 for the C2 case. In the figure, the obtained MOS points from the subjective testing are piece-wise linearly connected. For the MOS range that overlaps between the curves, the bitrates of the MVC curve were compared to the bitrates of the Nokia 3DVTM curve at an equal MOS value. The bitrate saving averaged over the overlapping MOS range is presented in Table V. It can be observed that the objective bitrate results presented in Table IV are approximately aligned with the bitrate saving results at equal MOS values. There are sequence-wise differences, such as for the 2-view coding of sequence S04, where the subjective scores of Nokia 3DV-TM were slightly inferior to those of MVC, whereas objective coding results

3456

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 9, SEPTEMBER 2013

TABLE VI E STIMATED T OOL -W ISE AVERAGE B JONTEGAARD B ITRATE R EDUCTION IN

N OKIA’ S 3DV CfP R ESPONSE (%)

TABLE VII B JONTEGAARD D ELTA B ITRATE I MPACT OF VSP AND D-MVP (%)

indicate that Nokia 3DV-TM clearly outperformed MVC for 2-view coding of S04. We suspect that this behavior is due to RD optimization in the encoder causing some residual blocks being left uncoded and creating a clearly visible temporal trail. This problem was later fixed by an improved texture view resolution selection algorithm [27]. F. Objective Results Per Coding Tool This sub-section attempts to give a summary of the contribution of the main tools of Nokia 3DV-TM to the objective quality improvement relative to the MVC anchor of MPEG 3DV CfP. Table VI presents a summary of the tool-wise estimated coding gain in dBR for the MPEG 3DV CfP coding conditions, while the next paragraphs provide more details how the estimates were obtained. A more detailed analysis of the most substantial coding tools and encoding methods has been provided in [13], [18], [20], [27]. It can be observed that the total RD improvement reported in Table VI does not add up to the RD improvement reported in Table IV. The remaining of the RD gain can be considered to result from various encoder algorithms and configurations. VSP and D-MVP were tested using common test conditions [31] of the JCT-3V by turning off these tools individually. The results of this experiment are reported in Table VII. Even if the common test conditions differ from the conditions we used in the submission to the MPEG 3DV CfP, for example when it comes to QP settings, it can be estimated based on these results that the depth-based texture coding tools provided roughly 10–15% dBR reduction in C3. The impact of depth coding tools, namely JVDF and DRWP, was analyzed as follows. As presented in [20], JVDF being applied for full resolution depth map provides noticeable coding gain (up-to 10% of dBR) for noisy depth maps of

natural test sequences. However, taking into account low resolution depth map coding used in Nokia 3DV-TM, and the small share of depth bitrate from the total MVD bitrate (about 11% in Nokia’s response to the MPEG 3DV CfP), the impact of JVDF in Nokia’s CfP response can be estimated to be up to 1% dBR from the total bitrate of texture and depth views. DRWP is applicable only if the depth range varies during the coded sequence, and among the sequences in the CfP only GT_Fly (S04) is such a sequence. The impact of DRWP for GT_Fly was measured to be 5.6% dBR of coded depth views, which corresponds to 0.15% dBR of all coded views [32]. In conclusion, as the depth bitrate share was small in Nokia’s CfP response and as the underlying coding technology remained the same as in the MVC anchor, the depth coding tools (i.e., JVDF and DRWP) have a minor impact on the presented overall results in sub-sections V.D and V.E. GVR was analyzed in [13], where it was concluded that an RD gain was achieved in 3 out of the 8 test sequences in the MPEG 3DV CfP. The average texture coding gain of GVR over all 8 test sequences was 1.4% and 3.0% dBR in C2 and C3, respectively, and the estimated impact of the respective depth coding gain on the total bitrate was 0.1% and 0.2% dBR in C2 and C3, respectively. VI. D ISCUSSION ON C OMPLEXITY AND S TANDARDIZATION D EVELOPMENT A. Introduction Nokia 3DV-TM was intended to include coding tools and coding approaches that have significant potential in terms of coding efficiency, while the optimization and finer tuning of these coding tools were expected to be conducted during the collaboration phase of the standardization process. With this approach, Nokia 3DV-TM provided a solid starting point for collaboration and it was selected as the initial basis for the 3DV-ATM reference test model [33]. The later development within MPEG and JCT-3V has resulted into rigorous study on complexity-performance tradeoffs for tools in 3DV-ATM, complexity-optimized solutions and improved compression efficiency. In this section, we briefly describe relevant tools in terms of their complexity, complexity optimization techniques and the evolution of the presented coding tools within the scope of 3D-AVC specification. B. Depth-Based Enhanced Texture Coding Tools The original design of VSP in Nokia 3DV-TM was implemented with a forward view synthesis approach (F-VSP) [15]. This process is considered to be demanding in terms of memory use and processing power. For example, the subpixel-based processing of F-VSP and the hole and occlusion handling, inherently part of F-VSP, significantly increase memory access rates and disable the block-based processing concept which is typically utilized in state-of-the-art video coding systems. As a result of the analysis provided in [34], it was shown that in-loop F-VSP is the most computationally demanding module of the 3DV-ATM, requiring about 30–40% of the total decoding time. This was considered unacceptable

HANNUKSELA et al.: MULTIVIEW-VIDEO-PLUS-DEPTH CODING

and hence the original F-VSP design was replaced by a backward VSP approach [34] (B-VSP) utilizing the depthfirst coding order for non-base views. In B-VSP, the depth view component of the current non-base view is used to derive a block-wise disparity to obtain a prediction block from an adjacent texture view component. This design is aligned with the conventional MCP of H.264/AVC and was found to provide comparable compression efficiency to F-VSP [34]. The original D-MVP design described in this paper consists of two conceptual modules, direction-separated MVP (DS-MVP) and depth-based motion competition (DCP). The complexity of DS-MVP can be considered very close to that of the original MVP design in H.264/AVC. The only change is the computation of a disparity vector which is required if no other candidate is available. The disparity derivation described in this paper specifies computing the average depth value over d(Cb) as shown in (2). However, the spatial redundancy of depth information allowed a simple sub-sampling approach to replace the averaging procedure of (2), as proposed in [32]. The complexity of DCP, in turn can be regarded as significant, since a SAD operation between two depth blocks should be performed at the decoder side for up to three pairs of blocks. Therefore, DMC was replaced during the 3D-AVC standardization by the simpler depth-based MVP process for the Skip and Direct modes proposed in [35]. C. Depth Coding Tools In terms of number of operations, the complexity of JVDF can be considered insignificant. JVDF processing can be estimated to require seven operations per pixel, which is significantly lower than the average complexity of H.264/AVC interpolation, for example. However, pixel-wise operation and the relatively large amount of required memory accesses made this tool relatively complex, similarly to F-VSP. These facts and the fairly small coding gain lead to excluding JVDF from the normative part of 3D-AVC. However, JVDF remains a part of 3DV-ATM as a non-normative pre-processing and postprocessing tool. The complexity of DRWP is considered to be negligible, since it introduces no changes to block level processing of H.264/AVC. However, the original implementation of DRWP described in this paper utilizes floating point calculations (4–6) which are performed at the slice level. As floating point operations may be rounded differently in different computing systems and are computationally more demanding than fixed point operations, the implementation of (4–6) in fixed point arithmetic was proposed in [32] and got adopted to 3D-AVC. D. Gradual View Refresh Finally, GVR is included in 3DV-ATM as a non-normative tool. The use of GVR can be signaled with supplemental enhancement information messages as explained in [33]. GVR does not involve additional coding tools, and hence its complexity impact can be considered negligible. VII. C ONCLUSION This paper describes the Nokia 3D video coding test model (Nokia 3DV-TM), which was found to be the best-performing

3457

submission in the H.264/AVC compatible category of the call for proposals (CfP) on 3D video coding technology organized by the Moving Picture Experts Group (MPEG). Both objective coding performance results and subjective viewing experience results were provided and compared against the coding of texture views as one Multiview Video Coding (MVC) bitstream and depth views as another MVC bitstream. As a result of the CfP evaluation, Nokia 3DV-TM was selected as a starting point for development of MVC and H.264/AVC compatible 3D video coding standards. ACKNOWLEDGMENT The authors would like to thank T. Utriainen, E. Pesonen, and S. Jumisko-Pyykkö from the laboratory of the HumanCentered Technology of Tampere University of Technology for performing systematic subjective testing supporting the development of Nokia 3DV-TM. Moreover, the authors thank Prof. M. Doma´nski, et al. for providing Poznan test sequences and their camera parameters [36]. R EFERENCES [1] Multi-View Video Plus Depth (MVD) Format for Advanced 3D Video Systems, document JVT-W100.doc, Joint Video Team, Apr. 2007. [2] Advanced Video Coding for Generic Audiovisual Services, document H.264.doc, ITU-T Recommendation, Apr. 2013. [3] T. Shibata, J. Kim, D. M. Hoffman, and M. S. Banks, “The zone of comfort: Predicting visual discomfort with stereo displays,” J. Vis., vol. 11, no. 8, p. 11, Jul. 2011. [4] L. McMillan, Jr., “An image-based approach to three-dimensional computer graphics,” Ph.D. thesis, Dept. Comput. Sci., Univ. North Carolina, Charlotte, NC, USA, 1997. [5] Call for Proposals on 3D Video Coding Technology, document N12036.doc, MPEG, Mar. 2011. [6] High Efficiency Video Coding, document H.265.doc, ITU-T Recommendation, Apr. 2013. [7] [Online]. Available: http://phenix.int-evry.fr/jct3v/ [8] MVC Extension for Inclusion of Depth Maps Draft Text 6, document JCT3V-C1001.doc, JCT-3V, Mar. 2013. [9] Y. Chen, M. M. Hannuksela, T. Suzuki, and S. Hattori, “Overview of the MVC+D 3D video coding standard,” J. Vis. Commun. Image Represent., Apr. 2013. [10] 3D-AVC Draft Text 6, document JCT3V-D1002.doc, JCT-3V, May 2013. [11] Description of 3D Video Coding Technology Proposal by Nokia, document M22552.doc, MPEG, Nov. 2011. [12] A. Vetro, S. Yea, and A. Smolic, “Toward a 3D video format for auto-stereoscopic displays,” Proc. SPIE Applications of Digital Image Processing XXXI, vol. 7073, pp. 1–12, Sep. 2008, doi:10.1117/12.797353. [13] M. M. Hannuksela, L. Chen, D. Rusanovskyy, and H. Li, “Gradual view refresh in depth-enhanced multiview video,” in Proc. Picture Coding Symp., May 2012, pp. 141–144. [14] S. Shimizu, M. Kitahara, H. Kimata, K. Kamikura, and Y. Yashima, “View scalable multiview video coding using 3-D warping with depthmap,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 11, pp. 1485–1495, Nov. 2007. [15] S. Yea and A. Vetro, “View synthesis prediction for multiview video coding,” Signal Process., Image Commun., vol. 24, nos. 1–2, pp. 89–100, Jan. 2009. [16] MPEG View Synthesis Reference Software [Online]. Available FTP: ftp.merl.com/pub/avetro/3dv-cfp/ [17] D. Tian, P.-L. Lai, P. Lopez, and C. Gomila, “View synthesis techniques for 3D video,” Proc. SPIE Applications of Digital Image Processing XXXII, vol. 7443, pp. 74430T-1–74430T-11, Sep. 2009, doi:10.1117/12.829372. [18] W. Su, D. Rusanovskyy, M. M. Hannuksela, and H. Li, “Depth-based motion vector prediction in 3D video coding,” in Proc. Picture Coding Symp., May 2012, pp. 37–40.

3458

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 9, SEPTEMBER 2013

[19] E. Ekmekcioglu, V. Velisavljevic, and S. T. Worrall, “Content adaptive enhancement of multi-view depth maps for free viewpoint video,” IEEE J. Sel. Topics Signal Process., vol. 5, no. 2, pp. 352–361, Apr. 2011. [20] R. Li, D. Rusanovskyy, M. M. Hannuksela, and H. Li, “Joint view filtering for multiview depth map sequences,” in Proc. IEEE Int. Conf. Image Process., Sep./Oct. 2012, pp. 1329–1332. [21] MVC Reference Software, document N10897.doc, MPEG, Dec. 2009. [22] JM Software [Online]. Available: http://iphome.hhi.de/suehring/tml/ download/old_jm/jm17.2.zip [23] P. Aflaki, D. Rusanovskyy, T. Utriainen, E. Pesonen, M. M. Hannuksela, S. Jumisko-Pyykkö, and M. Gabbouj, “Study of asymmetric quality between coded views in depth-enhanced multiview video coding,” in Proc. IC3D, Dec. 2011. [24] Hierarchical B Pictures, document JVT-P014.doc, JVT, Jul. 2005. [25] JSVM Software [Online]. Available: http://wftp3.itu.int/av-arch/ jvt-site/2008_01_Antalya/JVT-Z203.zip [26] P. Aflaki, M. M. Hannuksela, D. Rusanovskyy, and M. Gabbouj, “Nonlinear depth map resampling for depth-enhanced 3-D video coding,” IEEE Signal Process. Lett., vol. 20, no. 1, pp. 87–90, Jan. 2013. [27] P. Aflaki, D. Rusanovskyy, M. M. Hannuksela, and M. Gabbouj, “Frequency based adaptive spatial resolution selection for 3D video coding,” in Proc. EUSIPCO, Aug. 2012, pp. 759–763. [28] Calculation of Average PSNR Differences Between RD-Curves, document VCEG-M33.doc, ITU-T SG16 Q.6 (VCEG), Apr. 2001. [29] P. Hanhart, F. De Simone, and T. Ebrahimi, “Quality assessment of asymmetric stereo pair formed from decoded and synthesized views,” in Proc. Int. Workshop QoMEX, Jul. 2012, pp. 236–241. [30] Report of Subjective Test Results from the Call for Proposals on 3D Video Coding Technology, document N12347.doc, MPEG, Jan. 2012. [31] Common Test Conditions of 3DV Core Experiments, document JCT3VA1100.doc, JCT-3V, Jul. 2012. [32] Calculation Process for Parameters of Depth-Range-Based Weighted Prediction with Fixed-Point/Integer Operations, document JCT3VA0112.doc, JCT-3V, Jul. 2012. [33] 3D-AVC Test Model 5, document JCT3V-C1003.doc, JCT-3V, Jan. 2013. [34] 3DV-CE1.a: Block-Based View Synthesis Prediction for 3DV-ATM, document JCT3V-A0107.doc, JCT-3V, Jul. 2012. [35] 3D-CE5.a Results on Motion Vector Competition-Based Skip/Direct Mode with Explicit Signaling, document JCT3V-A0045.doc, JCT-3V, Jul. 2012. [36] Poznan Multiview Video Test Sequences and Camera Parameters, document M17050.doc, MPEG, Oct. 2009. Miska M. Hannuksela (M’03) received the Master of Science degree in engineering and Doctor of Science degree in technology from the Tampere University of Technology, Tampere, Finland, in 1997 and 2010, respectively. He has been with Nokia since 1996 in different roles, including Research Manager and Leader in the areas of video and image compression, end-toend multimedia systems, as well as sensor signal processing and context extraction. Currently, he works as a Distinguished Scientist with Multimedia Technologies, Nokia Research Center, Tampere. He has published more than 100 journal and conference papers and 100 standardization contributions in JCT-VC, JCT-3V, JVT, MPEG, 3GPP, and DVB. He has granted patents from more than 70 patent families. His current research interests include video compression and multimedia communication systems. Dr. Hannuksela received the Best Doctoral Thesis of the Tampere University of Technology in 2009 and the Scientific Achievement Award nominated by the Centre of Excellence of Signal Processing, Tampere University of Technology, in 2010. He has been an Associate Editor of the IEEE T RANS ACTIONS ON C IRCUITS AND S YSTEMS OF V IDEO T ECHNOLOGY since 2010. Dmytro Rusanovskyy received the Master of Science degree from the Kharkiv National University of Radioelectronics, Kharkiv, Ukraine, in 2000, and the Doctor of Science degree in technology from the Tampere University of Technology, Tampere, Finland, in 2009. He is currently a Consultant in video architecture for LG Electronics. His current research interests include image and video processing and enhancement, 2-D/3-D video coding, and depth-enhanced video processing. He is the co-author of multiple journal and conference papers, standardization contributions, and patent applications. Wenyi Su received the Bachelor of Science degree from Xi’dian University, Xi’an, China, in 2010. He is currently pursuing the master’s degree in signal and information processing with the University of Science and Technology of China, Hefei, China. His current research interests include 3-D video coding and processing.

Lulu Chen received the B.S. degree in electronic information engineering from the University of Science and Technology of China (USTC), Hefei, China, in 2010. She is currently pursuing the M.S. degree with the Signal and Information Processing, USTC. Her current research interests include 3-D video coding and processing. She received the Best Paper Award of 2012 Visual Communications and Image Processing conference with Dr. Hannuksela and Dr. Li. Ri Li received the Bachelor of Engineering and master’s degree in signal and information processing from the University of Science and Technology of China, Hefei, China, in 2009 and 2012, respectively. His current research interests include 3-D depth coding and pre/post-processing. Payman Aflaki received the master’s degree from the Polytechnic University of Turin, Turin, Italy, and the Polytechnic university of Catalonia, Catalonia, Spain, in 2008 and 2009, respectively. He is pursuing the Ph.D. degree with the Tampere University of Technology, Tampere, Finland. Since June 2011, he has been an External Researcher with the Nokia Research Center, Tampere, contributing actively to ongoing 3-D video coding standardization activities. He has been working in 3-D video coding for four years. His current research interests include asymmetric 3-D video compression and depthenhanced video processing/coding. Deyan Lan received the master’s degree in signal and information processing from the University of Science and Technology of China, Hefei, China, in 2012. His current research interests include 3-D depth coding and pre/postprocessing. Michal Joachimiak received the master’s degree in computer science from the Lodz University of Technology, Lodz, Poland, in 2006. From 2007 to 2009, he was a Computer Vision Researcher with the Tampere University of Technology (TUT), Tampere, Finland. In 2009, he has started the Ph.D. studies in 3-D video coding with the Department of Signal Processing, TUT. His current research interests include 3-D video processing and coding.

Houqiang Li (M’10) received the B.S., M.Eng., and Ph.D. degrees from the University of Science and Technology of China (USTC), Hefei, China, in 1992, 1997, and 2000, respectively, all in electronic engineering. He is currently a Professor with the Department of Electronic Engineering and Information Science, USTC. He has authored or co-authored over 90 papers in journals and conferences. His current research interests include video coding and communication, multimedia search, and image/video analysis. He is an Associate Editor of the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY and the Editorial Board of Journal of Multimedia. He has served on Technical/Program Committees, Organizing Committees, and as a Program Co-Chair, Track/Session Chair for over ten international conferences. He was a recipient of the Best Paper Award for the International Conference on Mobile and Ubiquitous Multimedia from ACM in 2011 and a senior author of the Best Student Paper of the 5th International Mobile Multimedia Communications Conference (MobiMedia) in 2009.

Moncef Gabbouj (M’85–SM’95–F’11) received the B.S. degree in electrical engineering from Oklahoma State University, Stillwater, OK, USA, in 1985, and the M.S. and Ph.D. degrees in electrical engineering from Purdue University, West Lafayette, IN, USA, in 1986 and 1989, respectively. He has been an Academy Professor with the Academy of Finland, Helsinki, Finland, since January 2011. He is currently with the Department of Electronic and Computer Engineering and the Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong. He was with the School of Electrical Engineering, Purdue University, from August 2011 to December 2011 and the Viterbi School of Engineering, University of Southern California, Los Angeles, CA, USA, from January 2012 to June 2012. He was a Senior Research Fellow with the Academy of Finland from 1997 to 1998 and 2007 to 2008. His current research interests include multimedia content-based analysis, indexing and retrieval, nonlinear signal and image processing and analysis, voice conversion, and video processing and coding. Dr. Gabbouj has served as an Associate Editor of the IEEE T RANSAC TIONS ON I MAGE P ROCESSING and a Guest Editor of Multimedia Tools and Applications, and the European Journal Applied Signal Processing.