Three Dimensional Scalable Video Adaptation via User-end ...

2 downloads 61 Views 887KB Size Report
the advanced scalable video coding (SVC) technique can be directly utilized to adapt ... network conditions and heterogeneous wireless devices. However,.
1

Three Dimensional Scalable Video Adaptation via User-end Perceptual Quality Assessment Guangtao Zhai, Jianfei Cai, Senior Member, IEEE, Weisi Lin, Senior Member, IEEE, Xiaokang Yang, Senior Member, IEEE, Wenjun Zhang, Senior Member, IEEE

Abstract—For wireless video streaming, the three dimensional scalabilities (spatial, temporal and SNR) provided by the advanced scalable video coding (SVC) technique can be directly utilized to adapt video streams to dynamic wireless network conditions and heterogeneous wireless devices. However, the question is how to optimally trade off among the three dimensional scalabilities so as to maximize the perceived video quality, given the available resource. In this paper, we propose a low-complexity algorithm that executes at resource-limited user end to quantitatively and perceptually assess video quality under different spatial, temporal and SNR combinations. Based on the video quality measures, we further propose an efficient adaptation algorithm, which dynamically adapts scalable video to a suitable three dimension combination. Experimental results demonstrate the effectiveness of our proposed perceptual video adaptation framework. Index Terms—Perceptual video adaptation, human visual system, no-reference video quality assessment, scalable video coding.

I. I NTRODUCTION Scalable video coding (SVC), which offers spatial, temporal, and SNR scalabilities at bit-stream level, becomes more and more attractive since it enables the easy adaptation of video streams for the variety of requirements from storage devices, terminals, and communication networks. In this research, we consider the scenario of wireless video streaming, where stored videos encoded by SVC in a server are distributed to multiple wireless users. The SVC scalabilities can be directly utilized to adapt video streams to dynamic wireless network conditions and heterogeneous wireless devices. One remaining challenge is how to achieve an optimal tradeoff among the three dimensional scalabilities provided by SVC so as to maximize the received video quality given the available resource. To tackle such a problem of three dimensional scalable video adaptation, one of the major obstacles is that it is intricate to quantitatively assess video quality under different This research is partially supported by Singapore A*STAR SERC Grants (032 101 0006) and (062 130 0059). G. Zhai is with the Institute of Image Communication and Information Processing, Shanghai Jiao Tong University, Shanghai, 200240, China. This work was done during his visit at the School of Computer Engineering, Nanyang Technological University, 639798, Singapore. e-mail: [email protected] J. Cai and W. Lin are with School of Computer Engineering Nanyang Technological University, 639798, Singapore. e-mail: {wslin,asjfcai}@ntu.edu.sg X. Yang and W. Zhang are with the Institute of Image Communication and Information Processing, Shanghai Jiao Tong University, Shanghai, 200240, China. e-mail: {xkyang,zhangwenjun}@sjtu.edu.cn

spatial, temporal and SNR combinations. The traditional quality measure metrics such as PSNR are often used for assessing SNR-scaled video but are hard to assess video scaled in three dimensions. Furthermore, considering the human visual system (HVS) is the ultimate receiver for any visual communication system, it is desirable to quantify the perceived video quality that is relative to the HVS, which is called perceptual video quality assessment (VQA). In general, VQA can be performed either subjectively or objectively. Subjective VQA calls human viewers to participate in some kind of watching and voting practice to determine the video perceptual quality, while the objective VQA uses mathematical models to fulfill the assessment task. Some subjective VQA based video adaptation algorithms [1]–[3] have been proposed in literature. The basic idea is to utilize off-line subjective VQA to construct some empirical models, which are then applied for real-time video adaptation. In particular, Rajendran et al. studied the optimum frame rate selection problem for MPEG-4 FGS videos in [1], where they concluded that higher temporal resolution is preferred for the cases of high PSNR. Through subjective viewing tests, Wang et al. [2] derived the preferable frame rate in the bitrate range of 50 kbps ∼ 1 Mbps. Empirically, they found that 440 kbps and 175 kbps are the crucial bitrate points for the frame rate to be halved. Also through subjective tests, Cranley and Murphy [3] observed that under some bitrate conditions certain combinations of spatial and temporal resolutions can maximize the perceptual quality. Although the models reported in [2], [3] can help adapting video to some suboptimum frame-rate and frame-size combinations, both of the authors in [2], [3] admitted that the optimal perceptual video adaptation model should be content-related. In order to develop a universal model for the optimal three dimensional adaptation, the best way is to conduct extensive subjective tests to test all types of video sequences at all possible frame-rate and frame-size combinations at different bandwidth. However, it requires huge time consumption and money cost due to extremely large sample space. Alternatively, we can use objective VQA to assess received video quality in real time, which is fully content-dependent. Note that we do not specify the SNR resolution since it is determined by the bandwidth constraint once the temporal and spatial combination is given. Unlike most of the existing approaches, where objective VQA is performed at the sender side, we consider applying user-end objective VQA. Although user-end VQA has the disadvantage of no original undistorted video available, it

2

also has some attractive advantages. First, considering the dynamic wireless channel conditions, sender-end VQA can only estimate the end-to-end statistical performance, which might not be the actual performance at the user end [4]– [6]. On the contrary, user-end VQA measures the quality of the exact received video. Second, further considering multiple wireless users using heterogeneous wireless devices, it is hard for the video server to estimate and track the situations of each individual user. In contrast, user-end VQA enables a pure receiver based quality assessment and adaptation, which greatly reduces the burden at the server side. In addition, user-end VQA can be modified by individual users or even introduce end users into the loop, which could lead to personalized video adaptation. In general, depending on the availability of the original video, objective VQA algorithms can be classified into noreference (NR), reduced-reference (RR) and full-reference (FR) methods [7]. Since we consider user-end VQA, only the NR and RR methods are applicable. Lu et al. [8] proposed an objective VQA based adaptation algorithm using Wolf and Pison’s RR-VQA model [9], which requires to transmit some side information such as the features of the original video. Thus, it inevitably lowers the bandwidth efficiency of the system. To the best of our knowledge, there is no NR-VQA based video adaptation system reported in literature. It might be because the NR-VQA itself is not an easy task and has not been thoroughly studied [7]. Most of the state-of-theart NR-VQA algorithms are based on some prior knowledge of the scenario, the compression method and major artifacts. They heavily depend on the measurements of certain artifacts to predict the perceptual video quality. Many of them are based on sophisticated HVS models, which result in high computational complexity. For example, in [10], Winkler et al. considered blockiness as the only quality degradation in video streaming applications. Massidda et al. [11] analyzed luminance and temporal masking together with blockiness, blur and moving artifact measures to predict the perceptual video quality. Although the scheme is technically sound, the computational complexity is very high. The NR-VQA algorithm proposed by Yang et al. [12] involves motion vector estimation and translational region detection, which is also quite complex. Considering the real-time video adaptation requirement, the resource-poor wireless devices and the bandwidth-limited wireless networks, in this paper, we propose a low-complexity NR-VQA algorithm to gauge both intra and inter frame distortions caused by three-dimensional video adaptation. The proposed NR-VQA combines the independent measures of artifacts of blockiness, blur and jerkiness. Based on the distortion measures, we further propose an adaptation algorithm, which dynamically adapts scalable video to an optimal three dimension combination that maximizes the perceptual quality. Experimental results demonstrate the effectiveness of our proposed NR-VQA and adaptation algorithms. Note that our proposed NR-VQA algorithm may be outperformed by some other more sophisticated VQA methods. However, we should bear in mind that our purpose is to effec-

Fig. 1.

The system diagram in an end-to-end sender-receiver scenario.

tively capture the relative perceptual quality for video adaptation rather than to predict the absolute perceptual quality. Moreover, in terms of the match with the perceptual quality, our subjective validation test shows that the performance of the proposed NR-VQA is comparable to that of those common full-reference quality metrics such as PSNR and SSIM. Thus, it is able to reveal the perceptual quality orders under different SVC spatial, temporal and SNR combinations and guide the following adaptation. Figure 1 shows the system diagram in an end-to-end senderreceiver scenario. In particular, a video sequence is preencoded by SVC encoder and stored in the sender. When the receiver requests to stream the video, it will inform the sender how the video should be adapted, and the stream extractor extracts the corresponding partial bitstream out and transmits it over the network. Generally only the combination of temporal and spatial resolution is sent to the sender, and the stream extractor truncates SNR packets to meet the bandwidth budget. At the receiver side, the decoded video is sent to the NRVQA module, where the distortions are measured in real-time and the corresponding adaptation decision is made adaptively. The NR-VQA contains three artifact-evaluation components: the blockiness measure, the blur measure and the jerkiness measure. These measures are integrated to indicate the relative perceptual video quality. The rest of the paper is organized as follows. Section II describes our proposed NR-VQA algorithm and validates the proposed quality metric. Section III discusses how to apply our proposed quality metric for three dimensional scalable video adaptation. Section IV shows simulation results and finally section V concludes the paper. II. P ROPOSED NR-VQA A LGORITHM In our proposed NR-VQA algorithm, we measure the three most evident distortions [13] for block-based video coding, i.e. blockiness, blur, and motion jerkiness. Blockiness is perhaps the most rigorous artifact for all types of block-based image and video coding schemes. It is caused by inconsistent block-based quantization and inter-frame prediction. Blur is another well-known artifact usually caused by the truncation of high frequency transform coefficients. Motion jerkiness is an inter-frame artifact that is mainly caused by frame dropping. Although we only consider these three distortions in this

3

paper, other artifacts can be easily integrated into the current framework. To find suitable assessment methods for the three artifacts, we should keep in mind that since the proposed algorithm is designed for real-time applications and resource-limited wireless devices, its computational complexity and memory consumption should be considerably low. To seek a balance between the affordable complexity and acceptable accuracy, for the intra frame artifacts, the various HVS masking effects (e.g. luminance/textural masking for intra-frame and motion masking for inter-frame) are not considered in the current implementation due to their complexities and marginal improvements on quality prediction for low bitrate videos observed from our experiments. For the inter frame artifacts, it is reported that the widely referred temporal CSF does not work well for highly-compressed low bitrate videos [14], [15]. Therefore, in our algorithm, we simply use the frame rate and the mean inter-frame difference to gauge the frame dropping impact on perceived visual quality. All the quality measures are performed only on the luminance component to further reduce the computation. In order to have a fair quality comparison for a video sequence coded at different spatial and temporal combinations, each individual combination should be converted into the same reference level, which is set to be the highest resolutions (CIF at 30 fps in this paper) . We use spatial up-sampling or frame replication to convert a lower spatial or temporal resolution video into full-resolution. For example, for converting a 15 fps QCIF video into a 30 fps CIF video, we firstly use the AVC half-sample interpolation filter [16] (with six taps [1, −5, 20, 20, −5, 1]/32) for spatial up-sampling and then duplicate each frame once. Note that other more advanced spatial and temporal up-sampling algorithms can also be applied but at the cost of higher computation complexity. The NR-VQA algorithm is then performed on the up-sampled and framerepeated video. In the following, we describe the proposed NR-VQA algorithm in detail. A. Distortion Measures 1) Blockiness Measure (KM): A few no-reference blockiness measure algorithms [17]–[20] have been proposed in literature. Among these algorithm, we choose the approach called mean squared difference of slope (MSDS) proposed by Minami and Zakhor in [17] due to its low complexity. It has been indicated in [17] that MSDS increases with the quantization of DCT coefficients and it serves as a good yet efficient measure of blockiness. In particular, to compute the MSDS, we first calculate the squared difference of slope (SDS) for each block boundary. Considering the four pixels, a, b, c and d, lying on each side of a block boundary shown in Figure 2, the SDS for the boundary between b and c is computed as SDS(b, c)

1 = {(c − b) − [(b − a) + (d − c)]}2 2 = {[a, b, c, d] · [0.5, −1.5, 1.5, −0.5]t }2 , (1)

which measures the squared difference between the slope/gradient across a block boundary and the mean of

Fig. 2.

An illustration of computing MSDS.

Fig. 3.

An illustration of computing edge expansion.

the slopes/gradients on each side of that boundary. The KM, equal to the MSDS, is then calculated as the mean SDS value over all the block boundaries, i.e. X 1 KM = M SDS = SDS. (2) MN all boundary points

The computation of KM approximately needs M N multiplications and M N additions for an M × N frame. 2) Blur Measure (BM): Image blur measure is another well studied topic and many algorithms have been proposed [21]– [24]. Again, for the low-complexity requirement, we adopt the metric proposed by Marziliano et al. in [21], where the perceptual blur metric is based on the measure of local edge expansions. In particular, the vertical binary edge map is firstly computed with the Sobel filter. Then, the local extrema in the horizontal neighborhood of each edge point are detected, and the distance between these extrema is denoted as the local expansion of the edge (see Figure 3). Finally, BM is computed as the average of the edge expansions for all the edge points, i.e. X 1 BM = |xp1 − xp2 |, (3) Ne all edge points

where Ne is the number of edge points. The BM computation generally uses 6M N multiplications and 6M N additions for an M ×N frame, where for each edge point six multiplications and five additions are needed for the vertical edge detection process with the Sobel mask (containing six nonzero elements), and one more addition is needed for the computation in Eq. (3). 3) Motion Jerkiness Measure (JM): Psychovisual studies showed that a spatiotemporal energy model can be used to accurately characterize the motion perception of HVS [25]. For low bit rate videos, the most prominent temporal distortion is motion jerkiness, which is mainly caused by frame dropping

4

[26]. Motion jerkiness can be very annoying to HVS because its appearance of distinctive “snapshots” affects preferred continuous and smooth motion [27]. Lu et al. [15] examined the negative impact of frame dropping on the perceived video quality and concluded that the negative impact of frame dropping is deeply related to the motion jerkiness (computed using optical flow) and the frame rate. However, the computation of the optical flow algorithm is too complex to be applied for real-time VQA running on resource-poor mobile devices. Therefore, in this research, we use direct frame difference to replace the optical flow algorithm to measure motion jerkiness. Similar to [15], we set the frame rate of 30 fps as an anchor point, to which other in-service frame rates are compared. For a frame i, its motion jerkiness is calculated as v u N M X u 1 X 30 |fi (x, y) − fi−1 (x, y)| ·t JM = f rame rate M N x=1 y=1 (4) where (x, y) and (M, N ) are pixel coordinates and frame dimensions, respectively. For an M ×N frame, M N additions are needed for the computation of JM. B. Quality Impairment Score (QIS) Before the above computed measures are integrated to predict the final quality impairment score (QIS), the KM, BM and JM values are averaged over the past several frames. The motivation is twofold. First, since KM and BM are calculated in a frame-by-frame manner with only intra-frame information, they should be smoothed in order to avoid the abrupt changes caused by different frame encoding types (I, P, B). Second, vision persistence also requires the perceived quality of current frame to be correlated with the past several frames. Thus, in this research, the KM, BM and JM values are averaged over 30 frames using a sliding window and the averaged measures are denoted as AKM, ABM and AJM, respectively. Note that this sliding window averaging process filters out small-scale video content variations while keeping large-scale variations. In this way, too frequent video adaptation triggered by these perceptual quality measures can be avoided. We define the integrated measure, QIS, as QIS(i) =

[QIK] · [QIB] · [QIJ]

= AKM (i)a1 · ABM (i)a2 · AJM (i)a3 , (5) where quality impairment of blockiness (QIK), quality impairment of blur (QIB) and quality impairment of jerkiness (QIJ) are power functions of AKM, ABM and AJM respectively, and the parameters a1 ,a2 and a3 are introduced to give different weights to different measures. Empirically, we choose a1 , a2 and a3 to be 0.1, 1 and 1 respectively. The reason to give QIK a smaller weight is due to the much larger fluctuations observed in the QIK measures in our experiments while QIB and QIJ are relatively stable. Note that these parameters are fixed to the same values for all the video sequences tested throughout the paper. The calculated QIS values are then used as the overall perceptual video quality indicator. The smaller the QIS is, the better visual quality we perceive.

TABLE I T HE DETAILS OF THE SUBJECTIVE TEST SETUP FOR THE ‘ FOREMAN ’ SEQUENCE . Bit Rate 24 Kbs 48 Kbs 64 Kbs 128 Kbs 64 Kbs 128 Kbs 384 Kbs

7.5Hz √ √ √

15Hz √ √ √

√ √

√ √

30Hz √ √ √ √ √

Resolution QCIF QCIF QCIF QCIF CIF CIF CIF

C. Quality Metric Validation In order to justify the accuracy of the proposed NR-VQA metric, a subjective viewing test is conducted. In particular, the ‘foreman’ sequence is compressed by MPEG-4 with spatial resolutions of {CIF, QCIF} and frame rates of {30, 15, 7.5} fps at various bitrates of {24, 48, 64, 128, 384} kbps. The details of the subjective test setup are listed in Table I. The reconstructed videos at all the combinations are played back at CIF and 30 fps through up-sampling and frame repeat if necessary. The experiments involve 20 participants and the setup is according to the specification of ITU-R Recommendation BT500-11 [28]. The mean opinion score (MOS) is collected with a double stimulus impairment scale, variant II (DSIS II)1 . The type of monitor used is SONY BVM 21F, and the viewing distance is set to 3 ∼ 4 times of the image height. Fig. 4 shows the scatter plots of different metrics versus MOS. We compare our proposed QIS metric with SSIM [29], which is a recently developed metric, and PSNR, the most widely used video quality assessment metric. The real data are fitted with quadratic functions, and the corresponding Rsquare2 values of the fittings are 0.7440, 0.6484 and 0.5286 for the cases of QIS, Mean SSIM (MSSIM) and PSNR respectively, which suggests QIS is more correlated with MOS than MSSIM and PSNR. This is because SSIM and PSNR do not explore any temporal information. It can also be seen that QIS slightly outperforms MSSIM and PSNR since QIS has tighter prediction bounds and less outliers. Therefore, we can conclude that the proposed QIS can serve as a simple yet accurate metric to measure perceptual video quality under different SVC scalability combinations. III. T HREE - DIMENSIONAL V IDEO A DAPTATION After we assess the current perceived video quality at the user end, the question is how to make use of the three scalabilities to adapt scalable video bitstreams. Specifically, the threedimensional video adaptation problem can be summarized as: given the current bandwidth, how to find the optimal combination of spatial, temporal and SNR resolutions so as to maximize the perceived video quality for future frames. Since this adaptation decision is made at the user end, where 1 The DSIS method displays the reference sequence followed by the distorted sequence before the viewer to vote on the visual quality of the distorted sequence. The DSIS variant II further uses two repetitions of the display process before voting. 2 R-square is the coefficient of multiple determination. It measures how successful the fit is through explaining the variation of the data. A value closer to 1 indicates a better fit.

5

350

14.75 Fitted with quadratic function 95% prediction bounds QIS vs. MOS

300

14.7 14.65

250

14.6 14.55

QIS

PSNR

200

150

14.5 14.45 14.4

100

14.35 Fitted with quadratic function 95% prediction bounds PSNR vs. MOS

50 14.3 0

1

1.5

2

2.5

3 MOS

3.5

4

4.5

14.25

5

1

1.5

2

2.5

(a)

3 MOS

3.5

4

4.5

5

(b) 0.475

0.47

MSSIM

0.465

0.46

0.455

0.45

Fitted with quadratic function 95% prediction bounds MSSIM vs. MOS

0.445

0.44

1

1.5

2

2.5

3 MOS

3.5

4

4.5

5

(c) Fig. 4.

The scatter plots of different metrics versus MOS.

the future frames are not available, it is impossible to achieve a global optimization. The best thing we can do is to find the best combination from the recent frames, and we assume it is also a reasonable choice for future frames. Considering the bandwidth constraint, the coarse adjustments provided by spatial and temporal scalabilities and the fine granularity in SNR scalability, clearly we should determine the spatial and temporal resolutions first and then finely tune SNR scalability to meet the available bitrate constraint. In particular, we determine the temporal and spatial resolutions at the receiver end, and pass the decision together with the currently available bandwidth information to the sender. At the sender side, we truncate the SNR packets, where any existing rate-distortion optimized (RDO) algorithm can be applied, and extract the corresponding bitstream. Since only little information needs to be sent back, we neglect this feedback transmission overhead. Through our experiments, we find that, for the limited number of spatial and temporal resolutions and the common bitrate range typically used in wireless video streaming applications, temporal resolution dominates the overall perceived video quality. Similar observations have also been reported in [1], [2], [15]. Thus, we propose to select the temporal resolution first and then the spatial resolution. The detailed algorithm is described as follows.

Initiation step: We choose the highest temporal resolution and then the highest spatial resolution available under the given bitrate constraint. Updating step: For each frame, we compute KM, BM, JM and QIS according to the algorithms in section II, and we update their average values, AKM, ABM, AJM and AQIS, over all the past frames. Verification step: The adaptation configuration calculated at the last frame is verified with the QIS and AQIS values calculated at the current frame. Specifically, if the QIS is higher than the AQIS, which means that the picture quality is below the average level, we approve the pre-computed adaptation configuration; otherwise, we do not perform adaptation. This verification stage is introduced to prevent too frequent adaptations. As indicated in [30], too frequently adjusting the temporal or spatial resolution is annoying to viewers. Configuration step: In this stage, we compute the good spatial and temporal configuration. The basic idea is to compare the JM , KM , and BM values with their average values to see whether the current artifacts significantly deviate from the average levels or not. In particular, first, if the current jerkiness is very high (low), JM > JMM AX (JM < JMM IN ), we increase (decrease) the frame rate. Then, if the current blockiness is very high, KM > KMM AX , or the blur is very low, BM < BMM IN , the frame size is decreased; otherwise,

6

if the current blockiness is very low, KM < KMM IN , or the blur is very high BM > BMM AX , the frame size is increased. The thresholds are empirically set as scaling versions of AJM , AKM and ABM : JMM AX KMM AX

= =

1.5 · AJM, JMM IN = 0.5 · AJM 1.5 · AKM, KMM IN = 0.8 · AKM

BMM AX

=

1.2 · ABM,

BMM IN = 0.8 · ABM. (6)

IV. S IMULATION R ESULTS The video sequence used is a concatenation of the first 100 frames of ten 4:2:0 test video sequences including ‘akiyo’, ‘coastguard’, ‘container’, ‘foreman’, ‘mobile’, ‘mother and daughter’, ‘news’, ‘stefan’, ‘tempete’, and ‘weather’, as shown in Figure 5(e). We encode the video at a very high bitrate using the standard SVC reference software JSVM v.5.11 with a GOP size of eight. The compressed video bistream is then truncated into some lower testing bitrates ranging from 640 kbps to 128 kbps. The set of possible spatial resolutions is {CIF,QCIF}, and the set of possible temporal resolutions is {30,15,7.5,3.75} fps. Here, we only show the results at 640 kbps, and the results at other bitrates are similar. A. NR-VQA Results Figure 5 shows the computed QIK, QIB, QIJ and QIS results under different spatial and temporal resolutions with a total bitrate budget of 640 kbps. We have the following observations. First, the QIK and QIB results are mainly dependent on the frame size. Compared with the CIF streams, in general the QCIF streams result in higher QIB but lower QIK. This is reasonable since for the QCIF streams the up-sampling process introduced in the VQA reduces the blockiness but enhances the blur effect. Second, the QIJ results are closely related to the frame rate and slightly dependent on video content. The frame size has little impact on QIJ. This observation is similar to that made in [15], which is based on a much more complex method. Third, the QIS results show that the temporal resolution has a massive impact on perceived video quality. Fourth, also from the QIS results, we find that for the same frame rate the CIF streams generally give better quality for large motion sequences such as ‘mobile’ and ‘stefan’, while the QCIF configuration performs better for small motion sequences such as ‘mother and daughter’ and ‘news’. This is because the QCIF configuration typically has less accuracy in describing motion, and thus has inferior performance on large motion sequences. Figure 6 and 7 illustrates the advantage of using different spatial resolutions for different video contents. B. Video Adaptation Results We consider eight types of spatial and temporal combinations as listed in Table II, which can be regarded as eight adaptation modes. Figure 8 shows the adaptation results and the corresponding QIS values. It can be seen that at the beginning model 1 is chosen. After about 30 frames, model 2 is selected because it has lower QIS. After that, at frame 420

TABLE II D IFFERENT SPATIAL AND TEMPORAL COMBINATIONS . Mode FPS SIZE

1 30 CIF

2 30 QCIF

3 15 CIF

4 15 QCIF

5 7.5 CIF

6 7.5 QCIF

7 3.75 CIF

8 3.75 QCIF

the adaptation is switched back to mode 1, since the current video content ‘mobile’ prefers model 1, as evident in Figure 6. The mode switching occurring at other places can be explained in a similar way. From the adaptation results, we also observe that there is some kind of “delay” in the adaptation process, which is caused by both the temporal averaging in the NRVQA algorithm and the verification stage in the adaptation algorithm. Such a short “delay” is necessary since it prevents the annoying abrupt adaptation changes. It can also been seen that our proposed algorithm always chooses the frame size and frame rate combination that gives the better perceptual quality, e.g. ’Foreman’ and ’Mother and daughter’ with the QCIF resolution at frame number 310 and 550, and ’Mobile’ and ’Tempere’ with the CIF resolution at frame number 450 and 850. V. C ONCLUSION In this paper, we have proposed a three-dimensional video adaptation framework using user-end perceptual quality assessment for streaming scalable video coding over wireless networks. The proposed NR-VQA algorithm has low complexity and is suitable to be applied in wireless devices for realtime applications. Experimental results showed that given a certain bit budget the proposed video adaptation is able to find the optimal combination of spatial and temporal resolutions that maximizes the perceived video quality. We would like to emphasize one important point observed from our studies, i.e., the temporal resolution has a massive impact on perceived video quality. By making use of this point, the bandwidth-constrained three-dimensional video adaptation problem can be simplified into a two-step one-dimensional optimization problem, i.e. finding the best temporal resolution first and then searching for the best spatial resolution. R EFERENCES [1] R. K. Rajendran, M. van der Schaar, and S.-F. Chang, “FGS+: optimizing the joint SNR-temporal video quality in MPEG-4 fine grained scalable coding,” in 2002 IEEE International Symposium on Circuits and Systems, vol. 1, 2002. [2] Y. Wang, S.-F. Chang, and A. Loui, “Subjective preference of spatiotemporal rate in video adaptation using multi-dimensional scalable coding,” in 2004 IEEE International Conference on Multimedia and Expo, vol. 3, 2004. [3] N. Cranley, P. Perry, and L. Murphy, “Optimum adaptation trajectories for streamed multimedia,” Multimedia Systems, vol. 10, no. 5, pp. 392 – 401, 2005. [4] S. Kanumuri, P. C. Cosman, A. R. Reibman, and V. A. Vaishampayan, “Modeling packet-loss visibility in MPEG-2 video,” IEEE Trans. on Multimedia, vol. 8, no. 2, pp. 341–355, 2006. [5] H. Koumaras, A. Kourtis, C.-H. Lin, and C.-K. Shieh, “A theoretical framework for end-to-end video quality prediction of MPEG-based sequences,” in The Third International Conference on Networking and Services - ICNS07, Athens, Greece, June 2007. [6] Z. He and H. Xong, “Transmission distortion analysis for real-time video encoding and streaming over wireless networks,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 16, no. 9, pp. 1051– 1062, 2006.

7

QIK

QIB

1.3

352x288@30fps:640kbps 176x144@30fps:640kbps 352x288@15fps:640kbps 176x144@15fps:640kbps [email protected]:640kbps [email protected]:640kbps [email protected]:640kbps [email protected]:640kbps

1.2

1.1

18

352x288@30fps:640kbps 176x144@30fps:640kbps 352x288@15fps:640kbps 176x144@15fps:640kbps [email protected]:640kbps [email protected]:640kbps [email protected]:640kbps [email protected]:640kbps

16

14

1 12 0.9 10 0.8 8 0.7 6

0.6

0.5 0

100

200

300

400

500

600

700

800

900

1000

4 0

100

200

(a) QIK vs. Frame No.

300

400

500

600

700

800

900

1000

(b) QIB vs. Frame No.

QIJ

QIS

120

352x288@30fps:640kbps 176x144@30fps:640kbps 352x288@15fps:640kbps 176x144@15fps:640kbps [email protected]:640kbps [email protected]:640kbps [email protected]:640kbps [email protected]:640kbps

100

80

1400

352x288@30fps:640kbps 176x144@30fps:640kbps 352x288@15fps:640kbps 176x144@15fps:640kbps [email protected]:640kbps [email protected]:640kbps [email protected]:640kbps [email protected]:640kbps

1200

1000

800 60 600 40 400 20

0 0

200

100

200

300

400

500

600

700

800

900

1000

0 0

(c) QIJ vs. Frame No.

100

200

300

400

500

600

700

800

900

1000

(d) QIS vs. Frame No.

(e) Sequence Information Fig. 5.

The QIK, QIB, QIJ and QIS results under different configurations with a total bitrate budget of 640kbps.

[7] S. Winkler, Digital Video Quality: Vision Models and Metrics. John Wiley and Sons, 2005. [8] X. Lu, S. Tao, J. E. Zarki, and R. Guerin, “Quality-based adaptive video over the internet,” in Proceedings of CNDS 2003, Orlando, FL, 2003. [9] S. Wolf and M. H. Pinson, “Spatial-temporal distortion metric for inservice quality monitoring of any digital video system,” in Proceedings of SPIE - The International Society for Optical Engineering, vol. 3845, 1999, pp. 266 – 277. [10] S. Winkler, A. Sharma, and D. McNally, “Perceptual video quality and blockiness metrics for multimedia streaming applications,” in 4th International Symposium on Wireless Personal Multimedia Communications, 2001, pp. 553–556. [11] F. Massidda, D. D. Giusto, and C. Perra, “No reference video quality estimation based on human visual system for 2.5/3G devices,” in Proceedings of SPIE - The International Society for Optical Engineering, vol. 5666, 2005, pp. 168 – 179. [12] F. Yang, S. Wan, Y. Chang, and H. R. Wu, “A novel objective noreference metric for digital video quality assessment,” IEEE Signal Processing Letters, vol. 12, no. 10, pp. 685 – 688, 2005. [13] M. Yuen, Digital Video Image Quality and Perceptual Coding. Boca Raton, FL: CRC Press, 2006, ch. Coding Artifacts and Visual Distortions, pp. 87–122. [14] R. Pastrana-Vidal, J. Gicquel, C. Colomes, and H. Cherifi, “Sporadic frame dropping impact on quality perception,” in Proceedings of the SPIE - The International Society for Optical Engineering, vol. 5292, no. 1, 2004. [15] Z. Lu, W. Lin, B. C. Seng, S. Kato, S. Yao, E. Ong, and X. Yang, “Measuring the negative impact of frame dropping on perceptual visual quality,” in Proceedings of SPIE - The International Society for Optical Engineering, vol. 5666, 2005, pp. 554 – 562. [16] MPEG-4 AVC/H.264 Video Group, Advanced video coding for generic

audiovisual services, ITU-T Rec. H.264 (03/2005) Std., 2005. [17] S. Minami and A. Zakhor, “Optimization approach for removing blocking effects in transform coding,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 5, no. 7, pp. 74–82, 1995. [18] H. R. Wu and M. Yuen, “Generalized block-edge impairment metric for video coding,” IEEE Signal Processing Letters, vol. 4, no. 11, pp. 317–320, 1997. [19] Z. Wang, H. R. Sheikh, and A. C. Bovik, “No reference perceptual quality assessment of JPEG compressed images,” in IEEE International Conference on Image Processing, 2002, pp. 477–480. [20] L. Shizhong and A. C. Bovik, “Efficient DCT-domain blind measurement and reduction of blocking artifacts,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 12, no. 12, pp. 1139–1149, 2002. [21] P. Marziliano, F. Dufaux, S. Winkler, and T. Ebrahimi, “A no-reference perceptual blur metric,” in Proceedings 2002 International Conference on Image Processing, vol. 3, 2002, pp. 57 – 60. [22] E. Ong, W. Lin, Z. Lu, S. Yao, X. Yang, and L. Jiang, “No-reference JPEG-2000 image quality metric,” in Proceedings 2003 International Conference on Multimedia and Expo, vol. vol.1, 2003. [23] J. Buzzi and F. Guichard, “Uniqueness of blur measure,” in Proceedings - International Conference on Image Processing, ICIP, vol. 2, 2004, pp. 2985 – 2988. [24] J. Elder and S. Zucker, “Local scale control for edge detection and blur estimation,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 20, no. 7, pp. 699 – 716, 1998. [25] E. Adelson and J. Bergen, “Spatiotemporal energy models for the perception of motion,” Journal of the Optical Society of America A (Optics and Image Science), vol. 2, no. 2, 1985. [26] R. Pastrana-Vidal, J. Gicquel, C. Colomes, and H. Cherifi, “Sporadic frame dropping impact on quality perception,” in Proceedings of the

8

(a) Frame 310, CIF, 30fps, 640kbps, QIK=1.1182, QIB=7.8761, (b) Frame 310, QCIF, 30fps, 640kbps, QIK0.8522, QIB=9.4155, QIJ=12.3157, QIS=108.4676 QIJ=12.3034, QIS= 98.7269

(c) Frame 450, CIF, 30fps, 640kbps, QIK=1.0759, QIB=4.9239, (d) Frame 450, QCIF, 30fps, 640kbps, QIK=0.8163, QIB=7.8016, QIJ=11.5857, QIS=61.3756 QIJ=11.5742, QIS=73.7107 Fig. 6.

[27] [28] [29] [30]

Perceptual quality comparison under different spatial resolutions.

SPIE - The International Society for Optical Engineering, vol. 5292, no. 1, 2004. ANSI, “American national stansard for telecommunications - digital transport of video teleconferenceing / video telephony signals - performance terms, definitions and examples,” T1. 801. 02-1996,, 1996. ITU, “Methodology for the subjective assessment of the quality of television pictures,” Recommendation ITU-R BT. 500-11, 2002. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004. G. Ghinea, J. Thomas, and R. Fish, “Multimedia, network protocols and users-bridging the gap,” in Proceedings ACM Multimedia 99, 1999.

Guangtao Zhai graduated form Shandong University, Shandong, China, with B.E. and M.E. degree in 2001 and 2004 respectively. He is currently pursuing the Ph.D. degree at the Institute of Image Communication and Information Processing, Shanghai Jiao Tong University, Shanghai, China. From August 2006 to February 2007, he was an intern student at the Institute for Infocomm Research, Singapore. From March 2007 to January 2008, he was a visiting student at the School of Computer Engineering, Nanyang Technological University. His research interests include image and video processing, perceptual signal processing and pattern recognition.

9

(a) Frame 550, CIF, 30fps, 640kbps, QIK=0.7024, QIB=15.6357, (b) Frame 550, QCIF, 30fps, 640kbps, QIK=0.6302, QIB=9.3705, QIJ=10.7504, QIS=118.0701 QIJ=10.7596, QIS=63.5420

(c) Frame 850, CIF, 30fps, 640kbps, QIK=0.9340, QIB=6.3026, (d) Frame 850, QCIF, 30fps, 640kbps, QIK=0.7442, QIB=9.4005, QIJ=8.4982, QIS=50.0271 QIJ=8.4681, QIS=59.2402 Fig. 7.

Perceptual quality comparison under different spatial resolutions (continue).

Video adaptation process for 640 kbps

QIS

8

500

352x288@30fps:640kbps 176x144@30fps:640kbps

450

7

400 6 350 5 4

300 250

QCIF, 30 FPS CIF, 30 FPS

200

3 Mobile

Tempete

150

2 100 1 0

100

200

300

Mother and Daughter 400 500 600 700

800

900

(a) Fig. 8.

Foreman Mobile

50

Foreman

The adaptation results and the corresponding QIS values.

1000

0 0

Tempete

Mother and Daughter 100

200

300

400

500

(b)

600

700

800

900

1000

10

Jianfei Cai (S’98-M’02-SM’07) received his PhD degree from the University of Missouri-Columbia in 2002. Currently, he is an Associate Professor with the Nanyang Technological University, Singapore. His major research interests include digital media processing, multimedia compression, communications and networking technologies. He has published over 80 technical papers in international conferences and journals. He has been actively participated in program committees of various conferences, and he is the mobile multimedia track co-chair for ICME 2006, the technical program co-chair for Multimedia Modeling (MMM) 2007 and the conference co-chair for Multimedia on Mobile Devices 2007. He is also an Associate Editor for IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT).

Weisi Lin (M’92-SM’98) graduated from Zhongshan University, China with B.Sc in Electronics and M.Sc in Digital Signal Processing in 1982 and 1985, respectively, and from King’s College, London University, UK with Ph.D in Computer Vision in 1992. He taught and researched in Zhongshan University, Shantou University (China), Bath University (UK), National University of Singapore, Institute of Microelectronics (Singapore), Centre for Signal Processing (Singapore), and Institute for Infocomm Research (Singapore). He has been the project leader of 12 successfully-delivered projects in digital multimedia technology development. He also serves as the Lab Head, Visual Processing, and then the Acting Department Manager, Media Processing, in Institute for Infocomm Research. Currently, he is an Associate Professor in School of Computer Engineering, Nanyang Technological University, Singapore. His areas of expertise include image processing, perceptual modeling, video compression, multimedia communication and computer vision. He holds 10 patents, wrote 4 book chapters, and published over 130 refereed papers in international journals and conferences. He is a member of IET and a Chartered Engineer (UK). He believes that good theory is practical so has kept a balance of academic research and industrial deployment throughout his working life.

Xiaokang Yang (M’00-SM’04) received the B. S. degree from Xiamen University, Xiamen, China, in 1994, the M.S. degree from the Chinese Academy of Sciences, Shanghai, China, in 1997, and the Ph.D. degree from Shanghai Jiao Tong University, Shanghai, in 2000. He is currently a Full Professor and Deputy Director of the Institute of Image Communication and Information Processing, Department of Electronic Engineering, Shanghai Jiao Tong University. From September 2000 to March 2002, he was a Research Fellow in Centre for Signal Processing, Nanyang Technological University, Singapore. From April 2002 to October 2004, he was a Research Scientist with the Institute for Infocomm Research, Singapore. He has published over 80 refereed papers, and has filed six patents. His current research interests include video processing and communication, media analysis and retrieval, perceptual visual processing, and pattern recognition. He actively participates in the International Standards such as MPEG-4, JVT, and MPEG-21. Dr. Yang received the Microsoft Young Professorship Award 2006, the Best Young Investigator Paper Award at IS&T/SPIE International Conference on Video Communication and Image Processing (VCIP2003), and awards from A-STAR and Tan Kah Kee foundations. He is a member of Visual Signal Processing and Communications Technical Committee of the IEEE Circuits and Systems Society. He was the Special Session Chair of Perceptual Visual Processing of IEEE ICME2006. He is the local co-chair of ChinaCom2007 and the technical program co-chair of IEEE SiPS2007.

Wenjun Zhang received the B.S., M.S. and Ph.D. degrees in electronic engineering from Shanghai Jiao Tong University, Shanghai, China, in 1984, 1987 and 1989, respectively. As a group leader, he was successfully in charge of developing the first Chinese HDTV prototype system in 1998. He is a Changjiang Scholarship Professor in the field of communications and electronic systems with Shanghai Jiao Tong University. His research interests include digital media processing and transmission, video coding, wireless wideband communication systems.

Suggest Documents