SUBJECTIVE EVALUATION OF MULTI-USER ... - Semantic Scholar

SUBJECTIVE EVALUATION OF MULTI-USER RATE ALLOCATION FOR STREAMING HETEROGENEOUS VIDEO CONTENTS OVER WIRELESS NETWORKS Xiaoqing Zhu and Bernd Girod Information Systems Laboratory, Stanford University, Stanford, CA 94305, USA {zhuxq,bgirod}@stanford.edu ABSTRACT Multi-user rate allocation is essential for preventing multiple simultaneous video streaming sessions from congesting a timeshared wireless network. Although the problem fits naturally within an optimization framework, it remains unclear which objective function to optimize, especially for video streams with heterogeneous resolutions, frame rates and content complexity. To address this issue, we have conducted subjective tests and have collected viewers’ opinion scores for various quality combinations of high-definition (HD) and standard-definition (SD) video snapshots. The test results yield a simple subjective quality model, useful for comparing performance of several rate allocation schemes. It is shown that the subjectively preferred allocation can be closely approximated by minimizing weighted or equal-weight sum of mean-squared-error (MSE) distortion of all streams. Significant performance gains can be achieved over media-unaware TCP-friendly allocation, up to 1.5 in MOS, when network resources are limited. Index Terms— subjective test, rate allocation, multi-user video streaming, wireless networks 1. INTRODUCTION Video streaming over wireless networks is compelling for many applications, ranging from surveillance camera networking to interconnecting home entertainment devices. Rate allocation becomes an important task in the presence of multiple simultaneous streaming sessions. Since the video streams time-share the common wireless radio channel, their rates need to be jointly allocated even when there is no shared link in their routes. The rate allocation problem is further complicated by heterogeneity in both the video contents and wireless link capacities, as pointed out in [1] and [2]. Many research efforts have been dedicated to joint rate adaptation of multiple video streams based on various quality metrics such as mean-squared-error (MSE) video distortion [3] and meanopinion-score (MOS) [4]. In general, it remains to be explored which objective function to optimize, to better reflect human preference in a realistic application scenario. The work in [1] has shown that minimizing MSE distortion is the subjectively preferred criteria when video streams of the same resolution time-share a wireless LAN. However, this may not necessarily carry over to video streams with heterogeneous resolutions and frame rates. Consider, for instance, one high-definition (HD) video stream competing with one standard-definition (SD) video stream over two wireless links, as in Fig. 1. Should the rate allocation scheme maximize total MSE distortion of both streams? Or, rather, should one This work is partially supported by NSF Grant CCR-0325639, a gift from HiSilicon Technologies, and the Max Planck Center for Visual Computing and Communication at Stanford.

HD Stream

C1 C2

SD Stream

Fig. 1. Network topology of one HD stream competing over one SD stream, over wireless links with capacities C1 and C2 , respectively.

account for their resolution differences and assign different weight to the distortion of each stream? In the latter case, how much weight should be assigned to the HD and SD stream, respectively? This paper addresses the above questions. We conduct subjective viewing tests of pairs of image snapshots from HD and SD video streams in various quality combinations. Analysis of the viewers’ opinion scores, collected from 20 participants for each of the 4 sequence pairs, leads to a simple subjective quality model. It is shown that mean-opinion-score (MOS) of an image pair can be approximated as bilinear function of individual image MOS, which, in turn, fits linearly to MSE distortion of that image. This allows subjective evaluation of several rate allocation schemes, including media-aware schemes that minimize weighted or equal-weight sum of MSE distortions, and media-unaware scheme with TCP-friendly allocation. The rest of the paper is organized as follows. The next section describes the subjective viewing test. In Section 3, we analyze the test results, present the subjective quality model and compare subjective evaluation of various rate allocation schemes.

2. SUBJECTIVE VIEWING TEST Subjective viewing tests are conducted to collect viewers’ opinion scores of various quality combinations of two video sequences. For faster testing time, different video quality levels are presented using still image snapshots instead of continuously playing video segments. Each pair of test images are displayed side-by-side, on two 24-inch Samsung Syncmaster 245BW LCD monitors at a distance

Quality of SD Image

HD/SD Display

High

Low Low

High Quality of HD Image

Control Panel Subject Fig. 4. Quality combinations of HD/SD image pairs presented in the test. The highlighted circles indicate image pairs used in the training session.

45

45

40

40

PSNR (dB)

PSNR (dB)

Fig. 2. Subjective viewing test with HD and SD image contents on two adjacent displays.

35 30 25 20 0

5

Rate (Mbps)

35 30 25

FountainHD IceSD 10 15

20 0

(b) Dijana vs. Crew

45

45

40

40

30 25 20 0

5 10 Rate (Mbps)

CityHD CitySD 15

(c) City vs. City

PSNR (dB)

PSNR (dB)

(a) Fountain vs. Ice

35

DijanaHD CrewSD 5 10 15 Rate (Mbps)

35 30 25 20 0

RavenHD HarborSD 5 10 15 Rate (Mbps)

collected from these 10 image pairs are used to calculate MOS of individual HD or SD images. Each test comprises of one training session and two scoring sessions. The training session contains pairs of HD and SD images at the highest, intermediate and lowest quality levels, as highlighted in Fig. 4. The purpose of the training session is to familiarize the viewer with the expected range of perceived image qualities. No scores are collected at this stage. In the scoring session, the viewer provides an opinion score for each presented image pair, based on the perceived overall quality of both images. The opinion scores range from 1 - 5, where a score of 1 means ”bad” and a score of 5 means ”excellent”. Fractional scores are allowed between adjacent description levels. All 35 image pairs are presented in random orders, to mitigate the contextual effect [5]. The viewers are allowed sufficient time to observe and score each image pair. Each data set is presented to 20 participants.

(d) Raven vs. Harbor 3. TEST RESULTS

Fig. 3. Four HD/SD video pairs used in the subjective viewing test. The rate-PSNR tradeoff curves are obtained from compression with the x264 codec [6] at various quantization parameters. of 1.5 m from the viewer 1 . Potential inconsistencies between the two monitors are mitigated by randomizing which monitor displays which image. The testing room is kept dark, as recommended by [5]. Figure 2 illustrates the test setup. Each test involves one HD video sequence with a resolution of 1280 × 720 pixels per frame and 60 frames per second (fps), and one SD video sequence at 704 × 576 pixels per frame and 30 frames per second. Four test data sets are chosen to cover a range of content complexity. The rate-PSNR tradeoff curves of the HD/SD video pairs are plotted in Fig. 3. Each video is represented in 5 quality levels, corresponding to the uncompressed version, and compressed versions with quantization parameters (QP) at 24, 34, 40 and 44, using x264, a fast implementation of the H.264/AVC codec [6][7]. In addition to the 25 HD/SD image pairs, each test data set also contains pairs of identical HD or SD images on both displays, for all 5 quality levels. Scores 1 The distance corresponds to approximately 5 times the picture height of the HD image, and 6 times the picture height of the SD image.

3.1. Subjective Quality Model Figure 5 shows the mean opinion score (MOS) of each HD or SD image, averaged over 40 readings from 20 participants in each test, as a function of its mean-squared-error (MSE) distortion. The standard deviations of the MOS values are also plotted in the form of error bars. It can be observed that MOS of each image can be fitted as a linear function of its MSE distortion: X = X0 − αDHD Y = Y0 − βDSD .

(1) (2)

The parameters X0 and Y0 correspond to the highest MOS for uncompressed images, and are around 4.5 for all images. The slope of each curve, α or β, is content-dependent, and can be calculated by looking up MSE distortion of images with the lowest MOS, typically around 1.5. The fitting curves are shown as dotted lines in the same figure. Interestingly, MOS values of images containing many complex details tend to be less sensitive to changes in MSE distortion, as in Fountain and Harbor; while MOS values of images containing simple scenes and smooth regions tend to vary more rapidly with MSE distortion, as in Ice and Raven. This can be explained by the spatial masking effect in the human vision system [8].

4 3

2

2

1

1 0

50 100 150 MSE Distortion

0

200

2

3

3

3 3

2

2.5

1.5

1 1

2 3 HD MOS

2.5 2

4

1.5

2

1

Fig. 6. Bilinear fit of MOS of an image pair as function of individual image MOS. A common set of parameters w0 = 0.80, w1 = 0.20, w2 = 0.15 and w3 = 0.10, can be used to fit all experiments, with a maximum fitting error of 0.24 in MOS.

1 0


200

(c) City vs. City

0


200

(d) Raven vs. Harbor

Fig. 5. MOS value as a function of MSE distortion for individual HD or SD images. The standard deviations are plotted in the form of error bars, and range from 0.3 to 0.8 in MOS. The dotted lines indicate the linear fit.

3 4

1.5 3 1

4

2

0.5 3 0 0

In addition, MOS of an image pair can be fitted to a bilinear function of individual image MOS:

3

SD Rate (Mbps)

MOS

3

4

3.5

3.5

2

2

HD data HD model SD data SD model

5 MOS


4

200

(b) Dijana vs. Crew

(a) Fountain vs. Ice 5


4

2.5

SD MOS

3

4

3

4

SD Rate (Mbps)

MOS

4


5 MOS


5

4

2 0 0

10 20 HD Rate (Mbps)

2

D(R) = D0 +

θ , R − R0

(4)

we can derive the subjective opinion score as a function of the given rate allocation. Figure 7 shows the contour plots of such functions, which are content-dependent.

3.2. Comparison of Rate Allocation Schemes We now consider rate allocation for two competing streams over parallel wireless links, as in Fig. 1. Given the video sequence pairs, wireless link capacities and total utilization limit γ, different rate allocation schemes would result in different rate pairs (RHD , RSD ) along the tradeoff line: RSD RHD + = γ. C1 C2

(5)

Figure 8 shows the MOS derived from the subjective quality model along the tradeoff line, for the sequence pair Dijana vs. Crew. Different lines indicate different values of C2 . Note that the MOS values become more sensitive to rate allocation results with lower

SD Rate (Mbps)

In (3), Z denotes MOS of the image pair; X and Y are MOS values of the HD and SD image, respectively. Surprisingly, a common set of parameters w0 − w3 can be used to fit all 4 data sets despite their content differences. The maximum fitting error is 0.24 in MOS, comparable to the standard deviation of the collected scores. Figure 6 shows the contour plot of this fitted function. As expected, MOS of the image pair increases with MOS of either image, and is slightly higher with more balanced qualities of both images. We note that the weights for the HD and SD images are similar even though their areas on the display differ by a factor of 2.3. Combining (1) - (3) and the video distortion-rate tradeoff described by the following parametric model [9]:

SD Rate (Mbps)

(3) 4

3

3

4

2 4

1 0 2 0

3

3 5 10 HD Rate (Mbps)

(b) City vs. City

3

3 5 10 15 HD Rate (Mbps)

(c) Dijana vs. Crew

(a) Fountain vs. Ice Z = w0 + w1 X + w2 Y + w3 XY.

4

4

8 3

4

6 4

4

2 0 2 0

3

3 2 4 HD Rate (Mbps)

6


Fig. 7. Contour graph of subjective MOS of HD/SD sequence pairs, as function of allocated rate to each stream. values of C2 , i.e., more stringent network resources. Results from four allocation schemes are also marked on the graph: • MAX-MOS: allocation by maximizing the MOS value derived from the subjective quality model, marked in red ”◦” signs. This corresponds to the subjectively preferred allocation, and serves as an upper bound of performance. • MIN-WMSE: allocation by minimizing weighted sum of MSE distortion of both streams, marked in purple ”×” signs. The weights for the HD and SD streams are chosen as α and β from (1) and (2). • MIN-MSE: allocation by minimizing total MSE distortion of both sequences, marked in blue ”+” signs. • TCP-Friendly: allocation based on TCP-Friendly Rate Control (TFRC) [10], marked in green ”∗” signs. Both streams are assigned equal rates, since they observe same round-trip-time and packet loss ratios over the common wireless network. From Fig. 8, it can be noted that the MIN-WMSE scheme closely approximates the subjectively preferred allocation. The MIN-MSE scheme tends to allocate slightly higher rate for the HD stream due to mismatch in the weights, yet still achieves close-to-optimal MOS. In contrast, the media-unaware TCP-friendly allocation results in significantly lower MOS values when network resources are limited.

Dijana vs. Crew 4.5 4

MOS

3.5 3 C2 = 1 Mbps

2.5

C2 = 2 Mbps

2

C2 = 5 Mbps

1.5 1 0

C2 = 20 Mbps 0.2

0.4 0.6 0.8 Utilization of HD Stream

1

Fig. 8. MOS of rate allocation results, as utilization of the HD stream increases. The HD sequence Dijana is streamed over the first wireless link, while the SD sequence Crew is streamed over the second wireless link. Capacity of the first link is chosen at 20 Mbps; capacity of the second link is chosen as 1 Mbps, 2 Mbps, 5 Mbps, and 20 Mbps in each experiment. The utilization limit γ is set to 90 %

4

3 2 1 0

MAX−MOS MIN−WMSE MIN−MSE TCP−Friendly 5 10 15 20 Capacity (Mbps)

MOS

MOS

4

3 2 1 0

(b) Dijana vs. Crew

(a) Fountain vs. Ice

2 1 0

4 MAX−MOS MIN−WMSE MIN−MSE TCP−Friendly 5 10 15 20 Capacity (Mbps)

(c) City vs. City

MOS

MOS

4 3


3 2 1 0



Fig. 9. Subjective evaluation of four allocation schemes. The MOS is plotted against varying capacity of the second wireless link, while capacity of the first link is fixed at 20 Mbps. The utilization limit γ is set to 90 %. The relative performance of the four schemes are confirmed in Fig. 9, which plots MOS achieved by the allocation schemes against varying values of C2 . Both the MIN-WMSE and MIN-MSE schemes achieve almost the same MOS values as the subjectively preferred allocation. The performance gain over the media-unaware TCP-friendly allocation varies with sequence contents, and increases with decreasing values of C2 . In particular, the improvement in subjective evaluation can be as significant as 1.5 in MOS, with the Dijana HD sequence over a link of 20 Mbps and the Crew SD sequence over a link of 1 Mbps. 4. CONCLUSIONS This paper presents a subjective viewing test designed to evaluate rate allocation results for streaming heterogeneous video content over wireless network. We collect opinion scores from 20 participants for each of the 4 HD/SD video pairs, presented in the form of snapshots in various quality combinations. A simple subjective qual-

ity model is derived from the test results. It is shown that MOS of an image pair can be fitted as a bilinear function of individual image MOS, with a common set of parameters irrespective of image content. In addition, the subjective MOS value relates linearly with the objective MSE distortion of individual HD or SD images. The model is then used to compare the performance of several rate allocation schemes. The subjectively preferred allocation can be closely approximated by both MIN-WMSE and MIN-MSE schemes, whereas media-unaware TCP-friendly allocation may incur significantly suboptimal allocation results when network resources are limited. Despite simplicity of the test scenario, we believe that the insights obtained from this work may apply to more general conditions, in designing rate allocation schemes for streaming heterogeneous video contents over wireless. 5. ACKNOWLEDGEMENTS The authors would like to thank Dr. Joyce Farrell for many insightful discussions on the test design. We thank Dr. Peter van Beek at Sharp Labs of America, Dr. Erwin Bellers at NXP Semiconductors and Dr. Jianhua Zheng at HiSilicon Technologies for providing some of the test video sequences. 6. REFERENCES [1] M. Kalman and B. Girod, “Optimal channel-time allocation for the transmission of multiple video streams over a shared channel,” in Proc. IEEE International Workshop on Multimedia Signal Processing, Shanghai, China, Oct. 2005. [2] X. Zhu and B. Girod, “Distributed rate allocation for video streaming over wireless networks with heterogeneous link speeds,” in Proc. International Symposium on Multimedia over Wireless, Honolulu, HI, USA, Aug. 2007, pp. 296–301. [3] T. Schierl, S. Johansen, C. Hellge, T. Stockhammer, and T. Wiegand, “Distributed rate-distortion optimization for rateless channel coded scalable video in mobile ad hoc networks,” in Proc. International Conference on Image Processing, San Antonio, TX, USA, 2007, vol. 6, pp. 497–500. [4] S. Khan, S. Duhovnikov, E. Steinbach, and W. Kellerer, “MOSbased multiuser multiapplication cross-layer optimization for mobile multimedia communication,” Advances in Multimedia, 2007. [5] ITU-R BT.500-8, Methodology for the Subjective Assessment of the Quality of Television Pictures, Aug. 1998. [6] “x.264,” http://www.videolan.org/x264.html. [7] ITU-T and ISO/IEC JTC 1, Advanced Video Coding for Generic Audiovisual services, ITU-T Recommendation H.264 - ISO/IEC 14496-10(AVC), 2003. [8] B. Girod, “The information theoretical significance of spatial and temporal masking in video signals,” in Proc. SPIE/SPSE Conf. on Human Vision, Visual Processing and Digital Display, Los Angeles, CA, USA,, Jan. 1989, pp. 178–187. [9] K. Stuhlm¨uller, N. F¨arber, M. Link, and B. Girod, “Analysis of video transmission over lossy channels,” IEEE Journal on Selected Areas in Communications, vol. 18, no. 6, pp. 1012–32, June 2000. [10] S. Floyd and K. Fall, “Promoting the use of end-to-end congestion control in the Internet,” IEEE/ACM Trans. on Networking, vol. 7, pp. 458–472, Aug. 1999.