Novel Stereo-Video Quality Metric - Semantic Scholar

1 downloads 0 Views 1MB Size Report
full-reference fashion, but estimates presence of depth only in the test video stream, ... coefficient X୧୨ depends upon its square value (power) and human eye ...... Processing and Quality Metrics for Consumer Electronics, VPQM, Scottsdale,.
Novel Stereo-Video Quality Metric Lina Jin n

Atanas Boev n

Satu Jumisko-Pyykkö n

Tomi Haustola

n

Atanas Gotchev

MOBILE3DTV Project No. 216503

Novel Stereo-Video Quality Metric Lina Jin, Atanas Boev, Satu Jumisko-Pyykkö, Tomi Haustola, Atanas Gotchev

Abstract: We present the design of a new objective metric for evaluating the quality of stereo video. The design process has undergone into two stages. First, specific subjective tests were designed and conducted to facilitate the design of the metric. Then, the metric was designed and validated using the results of the tests. Correspondingly, the report is given into two parts. The first part describes the subjective tests and the second part describes the metric design. The results show that the amount of artefacts in the visual content dominances the experienced quality. The depth influences the experienced quality and 3D was valuated favourably to 2D. However, the amount of artefacts is critical for the experienced quality and the acceptance of the visual output. The results also show that videos with two best QP values (25 and 30) are practically the only acceptable options. The new metric was designed with the aim to provide modules for characterizing the 2D-type and 3D-type of distortions. Specifically, a recent 2D metric abbreviated as PSNRHVS has been selected as the main block to evaluate the 2D type of artefacts, as it proved superior for large classes of distortions. The metric takes into account the contrast sensitivity masking mechanism of the human vision through block-DCT transform-domain masking. The metric was modified to work for smaller blocks (i.e. 4x4) in order to make it more suitable for H.264 encoding applications.

Keywords: 3D-DCT, contrast masking, cyclopean view, display comfort zone

MOBILE3DTV

D5.5 Novel Stereo-Video Quality Metric

Executive Summary We present the design of a new objective metric for evaluating the quality of stereo video. The design process has undergone into two stages. First, specific subjective tests were designed and conducted to facilitate the design of the metric. Then, the metric was designed and validated using the results of the tests. Correspondingly, the report is given into two parts. The first part describes the subjective tests and the second part describes the metric design. The goal of the subjective study was to examine influence of depth, spatial quality and their interaction on experienced visual quality. Two experiments were conducted: 1) video content experiment, 2) still-image content experiment. Both experiments were done in controlled laboratory context with identical setup. In the experiments users evaluated quality and acceptance of presented visual material. The material was presented through auto-stereoscopic NEC mobile-3D display. The variables of study were depth of content and amount of artefacts in the content. There were three levels of depth: mono (2D), short baseline (3D) and wide baseline (3D). For the amount of artefacts there were 5 different levels which were categorized by the quantisation parameter of H.264 encoding. The results show that the amount of artefacts in the visual content dominances the experienced quality. The depth influences the experienced quality and 3D was valuated favourably to 2D. However, the amount of artefacts is critical for the experienced quality and the acceptance of the visual output. The results also show that videos with two best QP values (25 and 30) are practically the only acceptable options, where other QP values have gotten low acceptance levels. The study provides information about the influence of amount of artefacts as well as influence of depth in case of 2D vs. 3D. The results reveal small difference between 3D setups with varying baseline which shall be attributed to display-specific effects such as narrower comfort zone and cross-talk. The new metric was designed with the aim to provide modules for characterizing the 2D-type and 3D-type of distortions. Specifically, a recent 2D metric abbreviated as PSNR-HVS has been selected as the main block to evaluate the 2D type of artefacts, as it proved superior for large classes of distortions. The metric takes into account the contrast sensitivity masking mechanism of the human vision through block-DCT transform-domain masking. The metric was modified to work for smaller blocks (i.e. 4x4) in order to make it more suitable for H.264 encoding applications. A computational structure based on 3D-DCT is proposed to account for the formation of the cyclopean view and thus to provide an integral evaluation of the amount of 2D distortions in the stereo views. Furthermore, it is augmented by blocks providing factors about the local disparity activity and global disparity distortions. The compound metric gives very good correlation with the mean opinion scores collected through the subjective tests. The metric was also validated against the results of previous tests and also compared with other available 2D and 3D metrics. All comparisons demonstrate the superiority of the metric while tested for mobile resolution stereo video sequences.

2

MOBILE3DTV

D5.5 Novel Stereo-Video Quality Metric

Table of Contents 1

Introduction .......................................................................................................................... 5

2

Subjective Tests ................................................................................................................... 6 2.1

2.1.1

Participants ............................................................................................................ 6

2.1.2

Test Procedure....................................................................................................... 6

2.1.3

Context of Viewing ................................................................................................. 6

2.1.4

Stimuli Material....................................................................................................... 7

2.1.5

Production of Test Material..................................................................................... 8

2.1.6

Presentation of Test Material.................................................................................. 9

2.1.7

Method of Analysis ................................................................................................. 9

2.2

Experiment 2 ............................................................................................................... 10

2.2.1

Participants .......................................................................................................... 10

2.2.2

Test Procedure..................................................................................................... 10

2.2.3

Context of Viewing ............................................................................................... 10

2.2.4

Stimuli Material..................................................................................................... 10

2.2.5

Production of Test Material................................................................................... 10

2.2.6

Presentation of Test Material................................................................................ 10

2.2.7

Method of Analysis ............................................................................................... 10

2.3

3

Experiment 1 ................................................................................................................. 6

Results ........................................................................................................................ 10

2.3.1

Experiment 1 – video content ............................................................................... 11

2.3.2

Experiment 2 – still image content........................................................................ 16

Design of Objective Quality Metrics .................................................................................... 22 3.1

Introduction ................................................................................................................. 22

3.2

Image processing channel for mobile stereo-video quality estimation.......................... 23

3.3

PSNR-HVS and PSNR-HVS-M ................................................................................... 25

3.3.1

Introduction .......................................................................................................... 25

3.3.2

Modified Version .................................................................................................. 26

3.4

3D Video Quality Metrics I for stereo video .................................................................. 30

3.4.1

Finding block-disparity map .................................................................................. 30

3.4.2

Block selection and 3D-DCT transform................................................................. 30

3.4.3

Modified MSE ....................................................................................................... 31

3.5

3D Video Quality Metrics II for Mobile 3DTV Content .................................................. 32

3.5.1

Disparity map and local disparity variance............................................................ 33

3.5.2

Assessment of visual artefacts in a transform-domain cyclopean view model ...... 34

3.5.3

Weighting with local disparity variance ................................................................. 34 3

MOBILE3DTV 3.5.4 3.6

Composite quality measure .................................................................................. 36

Test Sequences and Subjective Tests ......................................................................... 36

3.6.1

3D Video Database I ............................................................................................ 36

3.6.2

3D Video Database II ........................................................................................... 37

3.7

4

D5.5 Novel Stereo-Video Quality Metric

Results ........................................................................................................................ 38

3.7.1

Results of 3D Quality Metrics I ............................................................................. 38

3.7.2

Results of 3D Quality Metrics II ............................................................................ 43

Discussion and Conclusion ................................................................................................ 49 4.1

Subjective tests ........................................................................................................... 49

4.1.1

Summary of results .............................................................................................. 49

4.1.2

Conclusions ......................................................................................................... 50

4.2

Objective metrics......................................................................................................... 50

4

MOBILE3DTV

D5.5 Novel Stereo-Video Quality Metric

1 Introduction The challenge for mobile 3D television is to deliver experience that satisfies viewers with limited resources. Factors such as limited bandwidth, vulnerable transmission channel, constraint of the receiving devices and amount of 3D data, set need for tight optimization of system resources [1]. For this we need information about how depth, amount of artefacts and other factors influence the quality of experience and what quality is “sufficiently good” for the viewers. According to previous studies related to the subject, 3D enhances the experienced quality when compared to 2D presentation [2], [3], but amount of stereoscopic artefacts is the dominant factor [1], [4]. Basically, the value that 3D adds to QoE turns irrelevant if the amount of artefacts is high. This study continues examining influences of depth and spatial quality on experienced visual quality. As previous studies introduced the effects of depth and amount of artefacts, this study aims at setting more precise limits of acceptance of experienced quality when both compression artefacts and varying amount of depth are presented. The study also takes more systematic approach on examining depth versus compression artefacts by utilizing a denser set of varying parameters. The goal of the study is to examine influence of depth, spatial quality and their interaction on experienced visual quality. The study consists of two subjective evaluation experiments: video content experiment and still-image content experiment. In both experiments users evaluate quality and acceptance of the contents in controlled laboratory environment. Content is presented on a portable auto-stereoscopic display. There are two variables testes: the amount of compression artefacts varied through QP and the strength of the depth effect varied by the camera baseline. Test content consists only of visual stimuli and no audio stimuli are included to the test. Early attempts to objectively quantify 3D videos had been based on the use of 2D metrics. That is, each channel of a stereo video is evaluated by some 2D metric and then the overall 3D video quality is calculated as the mean of two video channels. This approach however hardly corresponds to the binocular mechanisms of the human visual system (HVS) and thus hardly correlates with subjective quality scores. Therefore, inclusion of some 3D factors to the quality evaluation has been attempted. In [25], a monoscopic quality component and stereoscopic quality component for measuring stereoscopic image quality have been combined. The former component assesses the trivial monoscopic perceived distortions caused by blur, noise, contrast change etc; while the latter assesses the perceived degradation of binocular depth cues only. In [26], the popular 2D image quality metric called structural similarity index (SSIM) [27] has been applied for 3D images in the form of view plus depth, where information about depth has been added to the metric using a local or global approach. In [28], an overall quality metric has been suggested by combining image quality with disparity quality using a nonlinear function. In [29], a quality metric for color stereo images has been proposed based o the use of binocular energy contained in the left and right retinal images calculated by complex wavelet transform (CWT) and Bandelet transform. Other works have addressed the use of discrete cosine transform (DCT) as a component in 2D image and video quality metrics [30], [31], [32]. However, DCT has not been investigated for its implementation to stereoscopic image and video quality metric. In this report, we propose two versions of a full reference stereoscopic quality metric based on 3D-DCT, which take into account HVS properties, such as contrast sensitive function (CSF) and luminance masking. In the proposed metrics, 3D-DCT transform is used to analyze the perceptual similarity of blocks in stereo frames grouped using disparity correspondence and block-matching. MSE adjusted by contrast masking is used to quantify the quality difference between the reference and distorted blocks. Global disparity distortion and local disparity variance are used to further tune the metric to respond more adequately to the presented 3D effect. We present experimental results providing the feasibility of the proposed metric. 5

MOBILE3DTV

D5.5 Novel Stereo-Video Quality Metric

2 Subjective Tests 2.1 Experiment 1 2.1.1 Participants The test included a total of 30 participants; equally stratified by gender and age between 18 and 45 years. There were two outlier participants, whose data was not used in data analysis. In participant recruitment the demand was that 20% of sampling can be categorised as innovators and early adopters and 80% of sampling as naive participants [5]. The participants had little to no prior experience of quality evaluation experiments, they were not experts in technical implementation and they were not studying, working or otherwise engaged in information technology or multimedia processing [6][7]. 2.1.2 Test Procedure The test procedure was divided into three phases: pre-test, test and post test. In the pre-test session test procedure was introduced to the participant, sensorial tests were executed, the Simulator sickness questionnaire [8] was filled by participants and combined training and anchoring was done. The whole pre-test session was held in the lab environment. The sensorial tests included measurement of visual acuity (Landolt chart 20/40 at least), colour vision (Ishihara test) and acuity of stereo vision (Randot stereotest, 0.5 at least). After sensorial tests the Simulator sickness questionnaire was filled by the participants to gain information of their state before the test-phase. Then combined training and anchoring the extremes of the sample qualities and all contents were shown to participants. This familiarised the participants to quality scale, test content types and evaluation task [6]. In the test, the bi-dimensional research method of acceptance was used [9]. The stimuli were presented one by one, they were rated independently and retrospectively by applying single stimulus/ absolute category rating method (ACR) [6] [7]. After each clip, participants marked retrospectively the overall quality satisfaction score using a discrete, unlabeled scale from 0 to 10 and the acceptance of quality for viewing mobile 3DTV (binary yes/no scale). To measure satisfaction, we used a wide scale to compromise the end-avoidance-effect and problems of labeled scales due to the cultural differences [10]. Acceptance of quality was measured on a binary scale (yes/no) to find a threshold for acceptable quality [9]. The instructions for the quality evaluation tasks were as follows. For gathering the quality satisfaction score, the participants were asked to assess the overall quality of the presented clip. The measure of acceptance of quality was evaluated by asking whether the participants would accept the overall quality presented if they were watching mobile 3D television. No other evaluation criteria or advice were given. The actual test phase was conducted in the lab environment. In the post-test session participants answered to the Simulator sickness questionnaire [8]. Then they filled a questionnaire that collected their demographic data mainly including aspects of television consumption habits, use of different devices, previous 3D experiences and technology attitude. The questionnaire also measured their attitudes towards each presented content and user requirements for mobile 3D television. 2.1.3 Context of Viewing The experiment took place in laboratory context (Figure. 2.1). The conditions in the laboratory were fixed according to ITU-T P.911 [7] specifications (Table 1). The viewing distance was fixed at 40cm, ~10 times the video height as suggested by Knoche et al. [11], [12]. In the context 6

MOBILE3DTV

D5.5 Novel Stereo-Video Quality Metric

users viewing height and angle were adjusted according to feedback of the participants, keeping the viewing distance fixed to 40 cm. Table 1 ITU (ITU-T P.911) specification (ITU-T 1998) Parameter

Specification

Test setup

Viewing distance

1-8 H (Image height)

10 H (Image height)

Peak luminance of the screen

100-200 cd/m2

69 cd/m2 (Note 1)

Ratio of screens peak black to peak white luminance

0.1

0.014 (Note 1)

Background room illumination

20 lux

Lab: 12 lux Home: 26 lux

Background noise level Listening level

30 dBA ~80 dBA

25 dBA (Note 2) 75 dBA (+10 dBA for peaks)

Note 1 – Some individual peaks of background noise were possible from the surrounding environment, such as adjacent rooms or the ventilation system.

Figure 2.1 The laboratory context 2.1.4 Stimuli Material There were four kinds of content (see Table 2) in which amount of spatial details, temporal motion, amount of depth and depth dynamism varied. The video clips did not include any audio stimuli. The length of each stimulus was 10 seconds and there were no scene cut in any content. 7

MOBILE3DTV

D5.5 Novel Stereo-Video Quality Metric Table 2 Stimuli content descriptions and visual characteristics.

Screenshot Genre, content description and audio characteristics VSD=visual spatial details, VTD=temporal motion, VD=amount of depth, VDD=depth dynamism, VSD: med, VTD: med, VD: med, VDD: med Akkokayo – Two women carry boxes around the screen. Both are going side to side.

VSD: med, VTD: low, VD: low, VDD: low Champagne Tower – A female pours champagne on the classes. VSD: med, VTD: med, VD: low, VDD: low Pantomime – Two clowns fools around. The left clown blows a balloon.

VSD: med, VTD: low, VD: med, VDD: med Love Birds – Lovers walk hand in hand to the edge of scene. Man points out to the scenery.

2.1.5 Production of Test Material The source content was selected from the multi-view videos available to MPEG community and also made available to the Mobile3DTV stereo-video database [14]. The reason of using multiview videos was determined by the aim to vary the depth range in the test videos. Three depth levels and five quantization parameters were varied. The depth levels contained mono presentation, stereoscopic short and wide baselines. The values of varied quantization parameters, QP, were: 25, 30, 35, 40, 45, respectively. The goal of the selection of these parameters was to systematically tackle the juxtaposition between the positive influence of depth and negative influence of artifacts on experienced quality based on previous work. It is known that under the perceptually error-free scenarios, stereoscopic video on small screen has been experienced to provide higher visual quality over conventional monoscopic video presentation [14]. However, this positive effect of stereoscopic depth is overridden when detectable artefacts (e.g. by compression, rendering, display) are part of visual quality, and then the mono presentation is more pleasurable [14][13]. When examining consumer services, such mobile 3D video, visible independently or jointly occurring artefacts from capture to vulnerable wireless transmission and visualization on display can potentially reduce visual quality of experience (e.g. review of artefacts [15][14]). The preparation of variable depth levels contained the following steps: for each video clip, multiple stereoscopic versions were prepared, by selecting different camera pairs from the 8

MOBILE3DTV

D5.5 Novel Stereo-Video Quality Metric

available multi-view video tracks. The left camera of all sequences was kept the same. The 3Deffect of the video was controlled by the position of the right camera, as follows: 1) Monoscopic (2D), video, where left and right camera correspond to the same camera number in the multiview sequence; 2) Short baseline – a camera baseline, which produces 3D scene with a limited disparity range and less-pronounced, but visible 3D effect; 3) Wide baseline – camera baseline, where the camera baseline is selected to provide the optimal disparity range for the chosen stereoscopic display. By the notation of short baseline, we consider the use scenario where a high-resolution video content is repurposed for mobile use by a direct linear down-scaling. In the general case, this would result in a ‘shallow’ depth. By the notation of wide baseline we consider the scenario where the content is specifically adapted to the viewing conditions of the portable display. In the ideal case, this would result in full utilization of the mobile display’s comfort zone [16], [17]. Each stereoscopic sequence was converted from its original resolution to the resolution of the target display by using a four-step procedure. 1) The disparity range of each stereo-pair was analyzed. 2) The left and right channels of the video were cropped from the sides with the aim to shift the disparity range and equalize the absolute positive and negative disparity values, as well as to avoid frame violence. The position of the first cropping window in the left channel was kept intact, while varying it in the right channel relatively for the different camera pairs. 3) Both channels were down scaled with respect to the smaller target dimension while maintaining the source aspect ratio. 4) The extra pixels of the larger target dimension were cropped to achieve the display (target) aspect ratio. The position of the second cropping window was the same for all channels and all frames, and was selected manually based on the movie content. For the cropping operations, cubic spline interpolation was applied, while for the resizing (down-scaling), least-squares cubic projection was applied [18]. Following these steps, three depth levels were created for the selected contents. After downscaling, each test video was compressed by H.264 reference encoder in intra-frame mode, applied independently to the left and right channels (no inter-view prediction). Five values of the quantization parameter (QP), namely QP=[25, 30, 35, 40, 45] were used to introduce different levels of blocky compression artefacts). 2.1.6 Presentation of Test Material The test materials were presented on a prototype of NEC mobile 3D-display. The display had a 6.90x3.88cm lenticular sheet-based auto-stereoscopic screen. The screen had native resolution of 427x240 pixels (157 PPI). The test materials were presented in pseudo-random order [6]. The clips were played from Asus G51J laptop computer that was connected to the NEC display. Media Player 12 was used as video player for presenting the material. No sound stimulus was played to the participants. A total of 164 video clips were shown to each participant. Each stimulus was shown two times during the test. There were four dummy clips: two in the beginning and two in the middle of session. One test session lasted about 1h 10min. All tests were done within two weeks. 2.1.7 Method of Analysis Wilcoxon matched pair signed rank test and Friedman’s test were used to analyze satisfaction data as the presumption of normality was not met (Kolmogorov-Smirnov: p>.05). For acceptance data McNemar’s test was applied to examine the differences between two categories in the related data. Cochran’s Q test was also applied for analysing the acceptance data [19].

9

MOBILE3DTV

D5.5 Novel Stereo-Video Quality Metric

2.2 Experiment 2 2.2.1 Participants The test included a total of 32 participants; who followed same requirements of age and gender stratification as well as characteristics as presented for the experiment 1. There was no outlier participants in the experiment. 2.2.2 Test Procedure The test procedure of experiment 2 followed same structure as experiment 1. 2.2.3 Context of Viewing The experiment 2 was conducted in same context as experiment 1. 2.2.4 Stimuli Material There were four kinds of content (see table 2) in which amount of spatial details, temporal motion, amount of depth and depth dynamism varied. The length of each stimulus was 7 seconds. Stimuli material included same contents as video material in the experiment 1 (see 2.1.4). 2.2.5 Production of Test Material Same procedure as in Section 2.1.5 was followed for the second frame of each test sequence. 2.2.6 Presentation of Test Material The test materials were presented on a prototype of NEC mobile 3D-display. The display had a 6.90x3.88cm lenticular sheet-based auto-stereoscopic screen. The screen had native resolution of 427x240 pixels (157 PPI). The test materials were presented individually in randomized order. The clips were played from Asus G51J laptop computer that was connected to the NEC display. IrfanView v. 4.27 was used to create slideshow presentations of the content. No sound stimulus was played to the participants. A total of 164 still-images were shown to each participant. The material presented in pseudorandom order [6]. Every stimulus image was shown to times during the test. The material also included four dummy images. One test session lasted about 1h. All tests were done within two weeks time period. 2.2.7 Method of Analysis The data met the presumption of normality (Kolmogorov-Smirnov: p