Designing QoE Experiments to Evaluate Peer-to-Peer ... - CiteSeerX

Designing QoE Experiments to Evaluate Peer-to-Peer Streaming Applications Tom Z. J. Fua , Dah Ming Chiua and Zhibin Leib a The

b Applied

Chinese University of Hong Kong, Hong Kong, China Science & Technology Research Institute (ASTRI), Hong Kong, China ABSTRACT

Quality of Experience (QoE) refers to subjective criteria for evaluating multimedia content. Methods have been devised to study the design of Voice over IP systems and video codecs. In recent years, due to more abundant network bandwidth, it has become quite popular to watch video streamed over the Internet, whether by clientserver method or through a peer-to-peer (P2P) network. In this paper, we describe our experience in conducting QoE studies of P2P streaming using a chunk-level model. Instead of considering fine-grained network service impairments such as bit errors, packet losses or delays, we focus on chunk level delays. We carry out some preliminary QoE experiments on low-bit rate, low-frame rate and low-resolution video (3L-video) sequences. We apply the chunk-level model to help improve the design of the P2P streaming algorithms, and the design of video players that playback network streamed video. Keywords: QoE, Subjective assessment, Video quality, Peer-to-Peer streaming

1. INTRODUCTION Increasingly, multimedia content is delivered over the network. There are many mechanisms for transporting multimedia content over the network. Traditionally, multimedia content is accessed via some client-server mechanism (e.g. web service). Efficiency can be improved by proxy servers and a CDN (content distribution network). Recently, it has become popular to use peer-to-peer (P2P) mechanisms. Essentially, dedicated servers and proxies are relieved by customers who are accessing the content themselves. Often, the content delivery and playback is done concurrently - this is known as “streaming”, whether it is carried out on a one-to-one basis (as in client-server), or on a one-to-many basis (as in P2P). There is therefore a need to evaluate the different P2P mechanisms and algorithms for content streaming. For traditional file transfers, a simple measurable quantity such as total delay can be used as an objective. For multimedia content streaming, however, the objective is the quality of the playback experience (referred to as QoE), a more elusive and subjective quantity. The only accepted method for measuring QoE is through subjective scoring by representative users. But the subjective scoring process is labor-intensive and time-consuming. Ideally, the subjective testing should be repeated for different settings (network topology, protocols and algorithms, number of users etc), and different types of contents; furthermore, the experiments should be repeatable to be convincing. In reality, large scale p2p experiments are very difficult to set up. Some people use the PlanetLab as a (semi-)controlled environment, but exactly same experiments are hardly repeatable. Another way is to use some unique content to attract users to participate in an experiment (e.g. some tests were done during the 2008 Olympics broadcasting). Although such experiments can be reasonably large-scale, they cannot be easily repeated. In P2P streaming systems, content is divided into chunks; and the timeliness a chunk is received relative to its playback time has a dominant impact on the resulting QoE. Therefore, we conduct QoE subjective scoring to typical chunk delay scenarios. This approach helps us design playback scheduling strategies. It also allows us to evaluate the QoE of different P2P algorithms based on what chunk delay characteristics they produce. Further author information: Tom Z. J. Fu and Dah Ming Chiu, E-mail: {zjfu6, dmchiu}@ie.cuhk.edu.hk Zhibin Lei, E-mail: [email protected]

In this paper, we propose a chunk-level impairment model and use it to characterize the many parameters in a typical P2P streaming application. This model is simple and yet effective enough to allow us capture the many factors affecting user perceived QoE result. Based on this model, we design a corresponding subjective experiment and single out an important (dis)continuity factor for verification of performance metric. The preliminary result shows the effectiveness of the model, which can lead to understanding and improvement of practical P2P streaming system by taking into consideration of user’s subjective performance metric.

1.1. Related works Video quality assessment for TV applications has received much attention in the past. There are a number of publications and standards available, e.g., ITU-R BT.500,1 the first recommendation on subjective assessment methodology. Although some concrete methods have been designed for subjective assessment of video quality in multimedia applications,2 they are quite preliminary and do not provide any technical specifications for experiments on personal computers. More recently, there are more studies on subjective assessment of 3L-video, namely low-bit rate, low-frame rate and low-resolution video which constitutes most of the videos streamed on the Internet. In these existing works, each reference sequence (SRC) was processed through a number of error conditions, or so called Hypothetical Reference Circuits (HRC), to generate processed video sequences (PVS).3–6 Roughly speaking, HRC is equivalent to our chunk-based impairment model. The HRC considered so far is suitable for the client-server streaming service where there is one single link between the server and the client, and packet loss rate is considered as the only type of network transmission errors. But for P2P type of systems delivering 3L-videos, such HRC model is focused at too fine a granularity. Logically, the minimum processing unit is no longer packets but chunks. The size of a chunk is implementation dependent, but definitely much larger than one packet. In P2P streaming applications,7–9 almost all the important building blocks and components are designed based on chunks, e.g. chunk selection algorithm, peer selection algorithm (based on chunk bit-map information exchanged among neighbor peers), local chunk buffer management and so on. The rest of the paper is organized as follows. The description of the chunk-level impairment model is given and discussed in Section 2. Section 3 covers the preliminary experiment we have carried out and data analysis results. Finally we conclude our work in Section 4.

2. CHUNK-LEVEL IMPAIRMENT MODEL In this section, we introduce the design of the chunk-level impairment model for subjective assessment on video contents distributed by P2P mechanism in the Internet, in Fig 1. As illustrated in Fig 1, the top part is the same as the HRC with single link network condition for clientserver model,3–5 i.e., different codecs, bit rates could be chosen at the Video Encoder component. However, in our design, the packet loss rate (plr ) is replaced by a chunk-level impairment model. This model works in the following way: at first, the encoded video contents are organized into video chunks according to the assigned chunk size (pre-defined parameter). These chunks are then distributed through the P2P overlay. During the distribution process, different types of chunk-level distortions are introduced. After receiving the video chunks, each (peer) client manages and keeps the received chunks in a local chunk-level buffer for continuing sharing with other peers. At the same time, the client needs to make playback decision for each chunk (playback controller function) and when it meets the playback deadline of a downloaded chunk, the client will remove it from the local buffer and send it to the decoder and player. The playback controller is running at the network streaming layer above the media (video and audio) codec layer. Hence, it is transparent to any codec used, and can be designed to take into consideration of various tradeoffs in a typical streaming applications. It is an important practical consideration. With large deployed client base in place, it is very important that any improvements in playback player shall not render existing decoders any standard compliance issues. Next, we list the different decision parameters and error conditions involved in the chunk-level impairment model. The implementation issues will be discussed afterwards.

Source video(SRC)

Video encoder

Network transmission

Video decoder

Processed video(PVS)

Traditional HRC Chunk-level distortion Chunk size Chunk maker

Peer

Peer

Peer

Peer

Peer

Peer

Playback Controller Chunk buffer manager

Peer-to-Peer mechanism Internet transmission Chunk-level impairment model Figure 1. chunk-level impairment generation chain

I. Chunk size parameter. The chunk size is a key parameter that determines the trade-offs between chunklevel transmission, control overhead, and chunk-level playback delay. There are two extreme cases: when we set the smallest value of the chunk size (equal to the packet size), it becomes traditional C/S mode streaming; on the other hand, when we set the largest vale (equal to the size of the whole video content), it is similar to the P2P file downloading. II. Chunk-level distortion. We introduce the following notations to describe it: Tie : the expected playback time of the ith chunk; Tis : the download start time of chunk i; Tic : the download completion time of chunk i. 1. Chunk-level delay. Chunk i is delayed if its download completion time is later than the expected playback time. We further define Di = {Tic − Tie }+ , the duration of chunk-level delay of the ith chunk, where {x}+ = x when x > 0, otherwise 0. 2. Chunk delay distribution. Chunk delay distribution is an aggregate statistic for all delayed chunks. In the simplest case, it can be represented by a discrete random variable. For example, if X = {1, 2}(s), p(X = 1) = p(X = 2) = 0.5, all delayed chunks encounter delay length of either 1 or 2 seconds with equal probability. Less restrictive assumptions can be made for other more general scenarios. 3. Chunk receiving pattern. It describes how a chunk, say chunk i, is filled over the whole downloading process. If we denote fi (t), t ∈ [Tis , Tic ], to be the download completion percentage of chunk i at time t, then fi (t) is a non-decreasing function and we have fi (Tis ) = 0; fi (Tic ) = 1. Mathematically, chunk receiving pattern could be represented by any increasing curve of fi (t) over t ∈ [Tis , Tic]. III. Playback controller. When a chunk is delayed (arrives after its playback time), the playback controller must decide how to handle the condition. For example, the following three actions are typical choices to take: a) wait until the chunk is completely downloaded and then send it to the decoder; b) directly send the incomplete chunk to the decoder with no waiting;

c) wait for the chunk completion, meanwhile a timer is initiated with a pre-defined expiration threshold. Once the timer expires, the client stop waiting and send it to the decoder immediately. Let us define the longest waiting time (LWT), as the timer expiration threshold. In this sense, action (a) and (b) are special cases of (c), corresponding to (LW T = ∞) and (LW T = 0). IV. Average playback continuity, c. This is an objective metric widely used by P2P streaming system designers to evaluate the overall performance of the streaming service. It is believed to have high correlation with users’ subjective feelings and QoE. The correctness of this common belief needs experimental validation, which is an interesting question for our study. In reality, various definitions of average playback continuity have been proposed by different people.7, 9, 10 In this paper, we adopt the definition in,9 where Continuously viewing time c = . It is worth noting that our chunk-level model can be Total viewing time (including waiting time) used to evaluate the relationship between QoE and average continuity under other definitions as well. Based on the chunk-level model we described above, there are a huge number of cases of how videos can be distorted to be considered. It is impossible and not necessary to cover all the cases in real subjective assessment. Therefore, careful experiment design is very important, to focus on the scenarios of likely interest. This will be illustrated in our QoE experiment in the next section.

3. PRELIMINARY EXPERIMENT AND RESULTS In this section, we describe some preliminary subjective experiments we have conducted on 3L-video sequences processed by the chunk-level impairment model and discuss on the data analysis results. We first state two experiment objectives. After that we describe the chunk-level impairment settings we adopted and the selection procedure of the 50 source video clips. Finally, we talk about the subjective assessment methodology. We first reduce the huge experimental space of possible impairment scenarios down to a manageable set of typical settings. This process is ad hoc, and will be made more rigorous and systematic in future work. We have two objectives: 1. To validate the effectiveness of the average playback (dis)continuity with subjective assessment. If the correlation does exist, we will try to find out a simple mapping scheme between the objective and subjective metrics. 2. To explore the relationship between chunk delay distribution and subjective QoE while keeping all the remaining settings unchanged.

3.1. Experimental settings and chunk-level distortions The experimental settings and chunk-level distortions, therefore, were designed in accordance with the two objectives. First of all, video and audio encoders fixed on the settings reflecting the 3L-video characteristics, as listed in Table 1: Video Codec Audio Codec Resolution

H.264 AAC 720 × 576

Video bit rate Audio bit rate Frame rate

512 kbps 128 kbps 30 fps

Table 1. Fixed settings applied by video and audio encoders.

Secondly, the behaviors of both the playback controller and the decoder are further simplified: the playback controller fixed its LW T = 3∗ seconds throughout the whole generating process while the decoder was configured to follow the two rules: ∗

This is a default setting of some practical systems.

i. If the video chunk from the peer client is incomplete, discard it; Otherwise decode and playback it. ii. If there is no chunk received from peer client at the expected playback time, the decoder simply freezes at the last playable image until new content arrives. In other words, as a result, there could be only three possible conditions for the chunk-level video distortions: i. Di = 0, no distortion. If chunk i is completed before its expected playback time, it will be normally decoded and played back. ii. 0 < Di < LW T , freeze-and-play. If chunk i is delayed but still completed before longest waiting time, the resulting effect in PVS is first freezing an image for duration of Di and then normally playing back chunk i. iii. Di ≥ LW T , freeze-and-discard. If chunk i is delayed and remains incompleted when longest waiting time expires, the effect in PVS is freezing an image for LW T and then skipping to chunk i + 1 (chunk i is discarded). Finally, we arrive at a reduced set of 20 combinations of chunk-level distortions composed of two factors: . Average discontinuity (d = 1 − c): 0, 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60% . Chunk delay distribution: delays uniformly distributed in [0, 2] seconds (short delay distribution), or all delays equal to 3 seconds, the LW T (long delay distribution). Note that for each selected average discontinuity d, two chunk-level distortions were generated and processed on the source videos, long and short delay distribution for each. If all the source videos were of the same length, ¯ ≈ 1 second), would experience more delay it could be easily inferred that short delay distribution, of which (D ¯ events than the long delay distribution, of which D = Di = 3 seconds, ∀ i.

3.2. Source video sequences (SRC) Subjective feelings and judgements can be affected by video contents. Therefore, it is crucial to select a variety of video contents in the subjective assessment. According to the specification in VQEG MM test plan,3 we selected 50 video clips of five categories, each of which has different characteristics in terms of scene change and object movement, summarized in Table 2.

A

Category Movies

B C

Advertisements Vigorous sports

D

Static sports

E

Music videos/news

Description Most of them are movie trailers, having numerous scene changes and rapid cuts. Action contents, with considerable amount of scene changes. Basketball and soccer games involving more than ten players with less scene changes than advertisements and movies. Matches of badminton, tennis and table tennis with limited object movement. Contain camera motion but less scene change. Peoples are singing or speaking with low-motion objects and low texture complexity.

Table 2. Brief description of five selected categories of source videos.

As a result, each category contained 10 video clips and in addition each source video clip had approximately same length, 32 seconds.

3.3. Subjective assessment methodology 3.3.1. Laboratory set-up and viewing condition The subjective experiments were conducted in our research lab with controlled lighting environment conforming to the Recommendation.1, 2 Since the experiments were entirely computer-based, two computers had been setup. All the video clips were played back by a proprietary player on these two computers and shown on the LCD displays as specified in the VQEG MM test plan.3 The specification of the two computers and LCD displays are shown in Table 3 and 4. CPU Memory Video Card

Computer A Core 2 6320 1.86GHz 2GB On-board

Computer B Pentium IV 3.2 GHz 1GB On-board

Table 3. Technique specification of testing computers.

Model Native resolution Dot-pitch Response time Vertical refresh freq. Color temperature

Computer A Philip 170S5 17” 1280 × 1024 pixels 0.264 mm 16 ms typical 60 Hz 6500 K

Computer B Dell 1704FPT 17” 1280 × 1024 pixels 0.264 mm 12 mm typical 60 Hz 6500 K

Table 4. Technique specification of LCD displays.

It is noted that the video cards on the two computers were different and LCD displays were different but with similar specifications. 3.3.2. Assessment scheme and experiment procedure According to the comparison results derived in a recent work by Huynh-Thu and Ghanbari6 that the quality ratings given by subjects in the Absolute Category Rating (ACR) with hidden reference methods highly correlated with those ones given in the Double-Stimulus Continuous Quality Scale (DSCQS) method. In other words, ACR with hidden reference becomes more preferred method for conducting subjective assessments on 3L-video sequences because it allows assessments of about four times as many test files as DSCQS while it can still provide as accurate information as DSCQS does. In our experiment, we adopted ACR with hidden reference as the assessment method. Hidden reference means one of the chunk-level impairment setting was to remain the original video sequence unprocessed and put them directly to the final PVS without explicitly informing the viewers. After reading the experiment instructions, viewers were required to watch 50 video clips and provide their subjective judgement on the video quality. The whole experiment process was divided into two equal parts with a 5-minute break in between. In each part, viewers needed to score 25 video clips. The presentation order of the video clips was randomly generated and the same for all viewers. Between each two consecutive video clips, a 10-second interval was reserved for the viewers to evaluate the previous clip. Viewers were required to give an integer score from 1 to 5 towards each video clip and all video clips would be played back once only for each viewer. The score guideline in the instruction given to the viewer is shown in Table 5. Score Subjective feeling

5 Excellent

4 Good

3 Fair

Table 5. Subjective score guideline.

2 Poor

1 Bad

3.3.3. Viewers A total of 30 non-expert viewers (16 males and 14 females) were invited to participate in the subjective experiment. All of them were university students, either under graduate or post graduate, and their ages were in range of 18 - 28. Before the subjective test, each viewer needed to fulfill the following screening requirements4 : 1. Normal vision (20/20) (verified by Snellen Test) or corrective glasses; 2. Normal color vision (verified by Ishihara Color Test); 3. Have not involved in similar subjective video evaluation test for the past 4 months. 4. Familiar with language to understand instructions and provide necessary semantic judgements.

3.4. Subjective data analysis From the experiment, we calculate the Mean Opinion Score (MOS) for each PVS by averaging all the integer scores given by viewers. In addition, the hidden reference removal (HRR) process is applied to derive the Differential MOS (DMOS) value for each PVS. DMOS is a variation of MOS and could be calculated by Eg. (1)3 : DM OS = M OScat. (d) − M OScat. (r) + 5. (1) In Eq. (1), M OScat. (d) is the MOS of a processed video sequence with average discontinuity d and cat. labels the video category it belongs to (out of five types); while M OScat. (r) is the MOS of the unprocessed video sequence (hidden reference) belonging to the same video category. Therefore, by subtracting the MOS of the processed video sequence from the MOS of the corresponding reference video, DMOS actually removes the bias in the subjective scoring process caused by individual’s preference of video contents. Although DMOS is ranged from 1 to 9, it can be treated in the same way as MOS. Exceptions occur when DM OS > 5, meaning that the distorted video gets higher score than the reference one. It is straight forward that they are considered same as DMOS = 5, the excellent quality case. We plot MOS and DMOS values of all the processed video sequences in Fig. 2. There are totally five hidden 5

5 Correlation coefficient: R = − 0.90146

Correlation coefficient: R = − 0.87342

4.5

4.5 Linear regression: y = − 5.619x + 4.6345 Norm of residuals = 3.2933

4

3.5

DMOS

MOS

3.5

3

3

2.5

2.5

2

2

1.5

1.5

1 0

Linear regression: y = − 5.2407x + 4.7645 Norm of residuals = 3.0905

4

0.1

0.2

0.3

0.4

0.5

Average discontinuity, d

(a)

0.6

0.7

1 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7


(b)

Figure 2. Subjective assessment results for each processed video sequence (a) MOS value, (b) DMOS value.

reference videos (one for each video category) on the top left of Fig. 2(a) (points with d = 0). As shown in Fig. 2(a) and 2(b), the results are very similar, and both reveal a clear negative correlation between the subjective

evaluation and average discontinuity, d. We believe that the effectiveness of the average (dis)continuity metric has been validated by the subjective experiment. In order to explore the relationship between chunk delay distribution and subjective QoE, we performed comparisons between the processed video sequences with two different types of delay distribution through their MOS and DMOS values, shown in Fig. 3. Averagely speaking, processed video sequences with long delay 5

5 PVS with long delay distribution PVS with short delay distribution Linear regression of PVS (long delay) Linear regression of PVS (short delay)

4.5

4

4 y = − 4.2572x + 4.5246 Norm of residuals = 1.6989

3

3

2.5

2.5

2

2

1 0

y = − 6.7742x + 4.6645 Norm of residuals = 1.5925 0.1

0.2

0.3

1.5

0.4

0.5

0.6

0.7

y = − 3.9561x + 4.6748 Norm of residuals = 1.4789

3.5

DMOS

MOS

3.5

1.5

PVS with long delay distribution PVS with short delay distribution Linear regression of PVS (long delay) Linear regression of PVS (short delay)

4.5

y = − 6.5253x + 4.8543 Norm of residuals = 1.5247

1 0

0.1


0.2

0.3

0.4

0.5

0.6

0.7


(a)

(b)

Figure 3. Comparison between processed video sequences with long and short delay distribution. (a) MOS value, (b) DMOS value.

distribution (all delay length equal to 3 seconds) obtain higher MOS and DMOS values than those with short delay distribution. The implication is that viewers care more about the number of screen freezing events than the duration of each freezing event. Further study and analysis are needed to find out to what extent this preference can be quantified and the tradeoff be parameterized in order to use it in a practical P2P streaming system to improve the overall performance.

4. CONCLUSION In this paper, we first propose the chunk-level impairment model which makes it possible to conduct and control subjective assessment on 3L-videos distributed by P2P mechanism through Internet. Secondly, by applying this model, we carried out some preliminary subjective experiments on 3L-video sequences and reached two targets: (1) We validated the effectiveness of the objective performance metric, average (dis)continuity, which has been widely used by people deploying or measuring practical P2P streaming systems; (2) We also discovered that processed video sequences with long delay distribution but experience less delay events earn higher subjective scores than those with short delay distribution but more delay events.

ACKNOWLEDGMENTS We would like to thank Trevor Ng and Rachel Ching for letting us use their video player and providing helpful comments and suggestions. We also want to thank Steve Ip and Jason Wong for their help in designing the experiments and conducting the subjective tests.

REFERENCES 1. International Telecommunication Union, “ITU-R Recommendation BT.500-11: Methodology for the subjective assessment of the quality of television pictures,” 2002.

2. International Telecommunication Union, “ITU-R Recommendation P.910: Subjective video quality assessment methods for multimedia applications,” September, 1999. 3. Video Quality Experts Group, “VQEG multimedia test plan”, Nov. 2005. Latest version available at http://www.vqeg.org. 4. S. Winkler, and R. Campos, “Video quality evaluation for internet streaming applications,” in Proceedings of SPIE Human Vision and Electronic Imaging, 5007, pp. 104-115, January 20-24 2003. 5. Q. Huynh-Thu, M. Ghanbari, D. Hands, and M. Brotherton, “Subjective video quality evaluaion for multimedia applications,” in Proceedings of SPIE Human Vision and Electronic Imaging, 6057, pp. 60571D.160571D.11, January 16-18 2006. 6. Q. Huynh-Thu and M. Ghanbari, “A comparison of subjective video quality assessment methods for low-bit rate and low-resolution video,” in Proceedings of the IASTED International Conference on Signal and Image Processing, August 2005. 7. X. Zhang, J. Liu, B. Li, and T. S. P. Yum, “Coolstreaming/donet: A data-driven overlay network for efficient live media streaming,” In Proceedings of IEEE INFOCOM’05, March 2005. 8. Y. Zhou, D. M. Chiu, and J. C. S. Lui, “A simple model for analyzing p2p streaming protocols,” In Proceedings of IEEE ICNP’07, October 2007. 9. Y. Huang, T. Z. J. Fu, D. M. Chiu, J. C. S. Lui, and C. Huang, “Challenges, design and analysis of a large-scale p2p-vod system,” In Proc. ACM SIGCOMM’08, 2008. 10. C. Wu, B. Li, and S. Zhao, “Multi-channel live P2P streaming: refocusing on servers,” In Proceedings of IEEE INFOCOM’08, April 2008.

Designing QoE Experiments to Evaluate Peer-to-Peer ... - CiteSeerX

Designing QoE Experiments to Evaluate Peer-to-Peer ... - CiteSeerX

Suggest Documents

Using (quasi-)experiments to evaluate of education ... - CiteSeerX

Stochastic Packet Loss Model to Evaluate QoE ... - Ohohlfeld.com

Designing Statistical Language Learners: Experiments on ... - CiteSeerX

Designing and Conducting Phishing Experiments - CiteSeerX

Designing Focus Groups and Experiments to ...

Designing Focus Groups and Experiments to ...

Designing ocean acidification experiments to maximize inference

Designing Lagrangian experiments to measure ... - Semantic Scholar

Designing Focus Groups and Experiments to

Lectures on Designing Screening Experiments

DRAFT Design of Experiments to Evaluate CAD ... - Semantic Scholar

Using Factorial Experiments to Evaluate the Effect of ... - Google Sites

A Family of Experiments to Evaluate a Functional Size Measurement ...

Usability Experiments to Evaluate UML/SysML-Based Model Driven ...

Designing online experiments - Semantic Scholar

Designing to motivate - CiteSeerX

Using (quasi-)experiments to evaluate of education interventions

Parallelism experiments to evaluate matrix effects ... - Semantic Scholar

QoE Doctor: Diagnosing Mobile App QoE with Automated ... - CiteSeerX

On Designing an Experiment to Evaluate a ... - Semantic Scholar

Designing survey methods to evaluate the undeclared ...

Designing a Randomized Clinical Trial to Evaluate Personalized ...

Designing a Regional System of Social Indicators to Evaluate ...

designing fault injection experiments using state- based ... - CiteSeerX