About Subjective Evaluation of Adaptive Video ...

About Subjective Evaluation of Adaptive Video Streaming Samira Tavakolia , Kjell Brunnströmb,c , and Narciso Garc´ıaa a Universidad

b Acreo

Politécnica de Madrid, Spain; Swedish ICT AB, Sweden; c Mid Sweden University, Sweden ABSTRACT

The usage of HTTP Adaptive Streaming (HAS) technology by content providers is increasing rapidly. Having available the video content in multiple qualities, using HAS allows to adapt the quality of downloaded video to the current network conditions providing smooth video-playback. However, the time-varying video quality by itself introduces a new type of impairment. The quality adaptation can be done in different ways. In order to find the best adaptation strategy maximizing users perceptual quality it is necessary to investigate about the subjective perception of adaptation-related impairments. However, the novelties of these impairments and their comparably long time duration make most of the standardized assessment methodologies fall less suited for studying HAS degradation. Furthermore, in traditional testing methodologies, the quality of the video in audiovisual services is often evaluated separated and not in the presence of audio. Nevertheless, the requirement of jointly evaluating the audio and the video within a subjective test is a relatively under-explored research field. In this work, we address the research question of determining the appropriate assessment methodology to evaluate the sequences with time-varying quality due to the adaptation. This was done by studying the influence of different adaptation related parameters through two different subjective experiments using a methodology developed to evaluate long test sequences. In order to study the impact of audio presence on quality assessment by the test subjects, one of the experiments was done in the presence of audio stimuli. The experimental results were subsequently compared with another experiment using the standardized single stimulus Absolute Category Rating (ACR) methodology. Keywords: HTTP Adaptive Streaming, Perceptual Quality, Subjective Assessment, Audiovisual Quality

1. INTRODUCTION Bandwidth fluctuations due to the changing network conditions pose severe challenges to Internet video streaming and affect the quality of delivered video. Buffering the content at the client side has been one of the approaches to overcome the network resource limitations in a short time scale. Nevertheless, different temporal impairment such as initial delay (in case where large playout buffers have to be filled initially) or stalling (i.e. playback interruption due to the empty playout buffer) are still plausible. To cope with such these problems, many content delivery networks such as Netflix and Youtube use HTTP Adaptive Streaming (HAS) over TCP. In this way, the quality of the delivered video is controlled entirely by the application adapting to the current network conditions. HAS requires the video to be available on the server in a set of video representations (adaptation streams) encoded into multiple quality levels by adjusting the video compression, resolution, framerate, or/and audio compression. The representations are segmented into small chunks of few seconds according to the size of GOPs (Group of Pictures). This information is stored in a file (manifest) at the server. After filling the buffer with the initial bitrate, the video playout is started. The HAS client is able to automatically switch the video quality according to the available bandwidth. Before downloading each chunk, it decides the most suitable quality level based on the historical TCP throughput of last or last few chunks. In this way the available bandwidth is utilized efficiently with a reduced risk of stalling. However, in terms of user perceived quality, another impairment dimension, i.e. the time-varying quality switching is produced instead. The quality switching procedure (adaptation) can be performed by the client in different behaviors. There are several research works on the performance analysis of different adaptation solutions which can be differentiated along technical and perception based assessment targeting users’ Quality of Experience (QoE). Nevertheless, there is still no clear guideline about the impact of different switching strategies in terms of the user’s QoE. Results reported in1 showed that quality switches in active adaptation are perceived as QoE degradation itself.

In this regard there are different possible QoE influence factors which have been addressed in previous studies. Two key aspects are specifically emphasized to be considered when designing a perceptual efficient adaptation strategies: switching amplitude and frequency of the switching. Related to the switching frequency, the number of switches in each adaptation event∗ and the number of adaptation events occurring during the whole video playback could provide different perceptual quality for the user. About the amplitude of the switch, the question can be raised if there is any difference between gradual quality switching (i.e. stepwise change from current to the target quality) compared to rapid quality change. Moreover, the perceptual influence of using different chunk length (different switching period ) in different switching frequency and amplitude should be taken into account. Studying the impact of these factors when increasing the quality (up-switching) and decreasing the quality (down-switching) is also important. Considering the switching frequency, results presented in2–6 show that the frequency of adaptation should be kept as low as possible. On the other hand, the study presented in3 shows that if the duration spent on a high quality level is sufficiently long, higher switching frequencies do not significantly degrade the QoE. Considering the amplitude of the switch, most of the previous studies3–7 conclude that gradual multiple variations are preferred over rapid variations. Nevertheless, as highlighted in4 this conclusion may not be applied to the scenarios where the quality levels difference is very small. In spite of tackling the aforementioned adaptation-related parameters in previous studies, different research questions are still open. This is mostly due to shortcomings of the reported studies, such as missing information in the respective publication, evident limitations in the considered set of test conditions, or methodological shortcomings in terms of how tests have been conducted.8 Specifically, in regard to the later point, because of the novelties of HAS degradation the usage of traditional experimental methods to evaluate the HAS users’ perceived quality is exposed to different objections. One of the common approaches to evaluate the impact of visual distortions on the user’s perceived QoE is subjective assessment. Extensive research related to subjective studies of audiovisual quality has brought up several assessment methodologies to obtain reliable results for the development of multimedia technologies. There are different international recommendations such as ITU-R BT.5009 and ITU-T P.91010 provided by standardization organizations, which give guidelines to assess the quality of television pictures. However, the novelties of adaptive streaming technology address the need to research for a new assessment methodology that allow to obtain representative conclusions in regard the HAS visual experience. Most notably to make it possible to study the impact of rather long time duration each quality switching has on the overall QoE. For instance, one of the most common methodologies, like single stimulus Absolute Category Rating (ACR)10 recommend the use of short test video sequences of around 15 seconds after which the observers provide their ratings. However, with adaptive streaming, it is probable that the quality varies over periods up to several minutes. Therefore, using longer test sequences is more appropriate to study these cases. Also, as presented in,11 the traditional testing methods may not accurately predict the perceptual quality since the relative impact of impairment types would change with the setting of subjective test. It means it is not clear if the perceptual quality of an individual stimuli (cf. adaptation event) evaluated in the ACR way would be the same as the perceived quality of the adaptation event when occurred in a longer sequence. Another standardized methodology that may be more suitable for adaptive streaming evaluation is Single Stimulus Continuous Quality Evaluation (SSCQE)9 since the observers provide instant ratings of the video quality in a continuous way during the sequence. However, the recency and hysteresis effect of the human behavioral responses while evaluating the time-varying video quality would lead to an unreliable evaluation through this methodology.12 Because it is not clear how to aggregate the instant quality ratings into an overall score or in per event scores. Therefore, research on new methodologies more appropriate for evaluating the effects of adaptive streaming systems is required. Some novel assessment approaches have been already proposed in literature. For example, in our previous work,13 a semi-continuous evaluation of different adaptive streaming strategies was made using long sequences (around 6 minutes) selected from the content that is usually watched by the viewers at home (e.g., movies, news, etc.). The aim of this methodology which was firstly presented in14 is to simulate realistic viewing conditions to engage the observers to the content as they would be in real life, rather than focusing on detecting impairments, which can happen using traditional methodologies with the short and less entertaining test videos. ∗

adaptation event denotes the period of video playback when the quality switching from the current level to the target level occur.

In this sense, other approaches have been presented, such as the method proposed in15 for immersive evaluation of audiovisual content, based on the use of long test stimuli to encourage the observers engagement with the content and simulating real situations of using audiovisual applications. In this work, not only longer sequences are recommended for evaluating video quality in a more realistic way, but also using test sequences with audio (in spite of traditional standard recommendations). This recommendation makes sense since video-only presentations poorly represent the users experience of an audiovisual application, as people rarely watch videos without sound. Another interesting study was presented in3 analyzing the impact of flickering effects in adaptive video streaming to mobile devices. The authors use a non-standardized test environment and give freedom to the observers to choose the location in the room, in order to simulate real-life viewing conditions. Taking into account the open research questions in regard to the adaptation-QoE influence factors, in our previous study presented in16 the impact of different switching related parameters in different content type were assessed following the ACR methodology. In order to understand the impact of assessment methodology on evaluation of the test subjects, two other experiments were subsequently carried out considering the same Processed Video Sequences (PVS) but using the evaluation method employed in.13 This paper presents the comparison of the results of these experiments obtained from the aforementioned methodology called Content-Immersive Evaluation of Transmission Impairments (CIETI). One of the experiments was done by presenting the video-stimuli only and the other one in the presence of audio. The reminder of the paper is organized as follows. In Section 2 we describe the study factors and experimental setup. In Section 3 the comparative results are presented and later on discussed in Section 4. We conclude the paper in Section 5.

2. STUDY DESCRIPTION Among the possible technical switching parameters, the impact of switching amplitude and chunk size when decreasing and increasing the quality were investigate in this study. Table 1 lists the twelve adaptation strategies considered as Hypothetical Reference Circuits (HRCs). For each of the status when the client should request from the server lower bitrate chunk (decreasing) or higher bitrate chunk (increasing) the gradual and rapid switching (cf. amplitude aspect) were simulated using 2 seconds and 10 seconds chunk. The chunk lengths were chosen to be inline with current HAS solutions. To study the perception of adaption streams in different content, four HRCs were considered representing the constant quality levels. Choosing among commercial content, seven source videos (SRC) of approximately six minutes length were chosen as listed in Table 2. They were originally 1080p picture size and 24 or 50 fps. The spatial and temporal information of the content (SI and TI in order), as defined by,10 covered a large portion of SI-TI plane. The adaptive streams were provided considering the encoding setting used in practice by the video streaming companies for the living-room platform. Following the recommendation of,3 the compression domain was considered as the switching dimension. For each SRC, four quality levels (representations) were produced using Rhozet Table 1: Test adaptation strategies (HRCs) Status Increasing quality

Decreasing quality

Constant

Possible behavior 10 s chunk Gradually 2 s chunk 10 s chunk Rapidly 2 s chunk 10 s chunk Gradually 2 s chunk 10 s chunk Rapidly 2 s chunk Whole PVS at 5 Mbps Whole PVS at 3 Mbps Whole PVS at 1 Mbps Whole PVS at 600 kbps

HRC code IGR10 IGR2 IRP10 IRP2 DGR10 DGR2 DRP10 DRP2 N5 N3 N1 N600

Table 2: Source sequence (SRCs) description Content Type

Description

Movie1

Action, adventure, fantasy; with some scene in smooth motion, some with group of walking people, some with camera panning

Movie2

Drama, music, romance; mostly with the smooth motion in the static background, some scene with group of dancing people

Movie3

Action, Si Fi, drama; with rapid changes in some sequences, cloudy atmosphere

Sport

Soccer, 2010 World cup final, average motion, wide angle camera sequences with uniform camera panning.

Documentary

Sport documentary with Spanish commentary; mostly with handheld shooting camera

Music

Music concert; high movement of the singer with some sudden scene change

News

Spanish news broadcast; some scenes with static shooting camera with one/two standing/sitting people, some outdoor scenes

Carbon Coder with the setting summarized in Table 3. It was assumed that the network bandwidth varies along these levels and consequently the video quality is switched between these representations. To produce the test sequences, each SRC was segmented using Adobe Premiere Pro CS6. The segmented videos were subsequently considered as either PVS (by applying one of the HRCs on the segment) or voting-segment (VS) with no degradation. In the same way, all HRCs were applied on the subsequent segments with the VS intervention. Because of the session time limitation the full factorial design was not feasible. Therefore, only four out of seven SRCs (Movie1, Movie2, Sport and Documentary) were used in two different variants to compare the relevant amplitude degradations (i.e. comparing GRx and RPx, cf. Table 1) and the constant quality PVS (i.e. comparing N3 and N5, as well as N600 and N1, cf. Table 1) in the same video segment. As a result, 132 PVSs were generated for evaluation. The length of the PVSs were variable depending the HRCs, 40 seconds for those considering the quality switching with 10 seconds chunk, and 14 seconds for the rest of HRCs. In our previous study denoted as ’Acreo’ experiment,16 the PVSs were evaluated following the ACR methodology.10 After the presentation of each PVS, the subjects were asked to evaluate the PVS by answering two questions: the overall quality of the PVS (rating on the five graded scale Bad (1), Poor (2), Fair (3), Good (4) and Excellent(5)), and if they perceived any change in the quality (options: Increasing, No change, Decreasing). We followed up that study by evaluating the identical PVSs through two independent experiments in the Universidad Politécnica de Madrid’s lab: one by presenting only the video stimulus (denoted as ’UPM-NoAudio’ experiment), and the other one in the presence of audio (denoted as ’UPM-Audio’ experiment). Figure 1 shows an example of the test sequence used in UPM experiments following CIETI methodology. The first segment had no degradation providing a reference of the video quality. ’0’ printed in the lower right corner of the frames Table 3: Trasncoding parameters of quality levels Stream code

Framerate

Resolution

Target bitrate (kbps)

Q1 24 720p 600 Q2 24 720p 1000 Q3 24 720p 3000 Q4 24 720p 5000 Video coding: H.264, high profile, closed GOP, Disabled scene change detection Ref. frame: 2, B frame: 2, CBR, Adaptive QP Audio coding: AAC, 192 Kbps

1

0

Figure 1: Test sequence structure in CIETI methodology. Evaluation of non-numbered segment (HRCi) is done during the subsequent numbered segment (VS). indicated the start of the test. The following segments are the subsequence of PVS and VS. The VS frames had printed a number indicating the former PVS number which was also the box number in the paper questionnaire to be rated by the test subjects. For the evaluation, the subjects were asked to answer the same questions as in Acreo and considering the same scales. In the test session, the test sequences including 12 sequential PVS-VS pairs were presented in a random order. One minute after terminating the evaluation of the first test sequence the next test video was played. The whole session was divided into three parts including two breaks of about 10 minutes when the subject was encouraged to leave the test room to minimize the fatigue effect on his evaluation. The total experiment lasted approximately 65 min. In order to allow for cross-lab comparison, the ambient and all the hardware and software used in UPM were adjusted similar to Acreo both complying with the recommendation ITU-R BT.500-11.9 A 46” Hyundai S465D display was used with the native resolution of 1920x1080 and 60 Hz refresh rate. The viewing distance was set to four times of display height. The TV’s peak white luminance was 177cd/m2 and the illumination level of the room was about 20 lux. The TSs (PVSs in Acreo experiment) were displayed in uncompressed format to assure that all observers were presented the same sequences. A computer connected to the TV was used to play the sequences using VLC. In order to avoid any temporal distortion introduced by the player, the videos were preloaded into the computer’s RAM. The TV resolution was set to the resolution of the test videos (720p) to avoid scaling when displaying the videos. Prior to the test session, the test subjects were screened for visual acuity and color vision. Later on, the test instruction and the rating scale presented in the observers’ native language were given. After reading the instruction, a training session was conducted to familiarize the observers with the range of the qualities, the quality variation (by showing some test sequence samples specially prepared for the training) and the test procedure. After post-screening of the subjective data in accordance with the latest recommendations from Video Quality Experts Group,17 the scores of 21 observers (6 females/15 males, age ranged from 27 to 50) in UPM-Audio and 22 observers (5 females/17 males, age ranged from 24 to 54) in UPM-NoAudio experiment were considered for the evaluation. In each experiments, about 80% of the subjects had telecommunication background (engineer, researcher, etc.) and 4 to 6 of them had subscription from a media service providers.

3. RESULTS Two sets of data including the scores for the ’perceptual quality’ and the ’quality switching detection’ were collected from each experiment and accordingly the mean opinion scores (MOS) and 95% confidence interval (CI) of their statistical distribution were calculated. To investigate the impact of evaluation methodology on observers’ assessment, first the MOS of the two UPM experiments were compared and later on the result was compared with Acreo experiment.

3.1 UPM experiments: Impact of the audio presence The MOS comparison of Audio and NoAudio experiments considering the arithmetical difference of corresponded MOS for each PVS is shown in Figure 2. It can be observed that in some PVSs when increasing the quality (specially in Sport, Music, and Movie2 which included scenes with dancing people) the presence of audio had positive impact up to 0.9 MOS value on evaluation of the video quality. However, there are also few PVSs in other content (such as in News) which perceived up to 0.6 MOS value better in NoAudio experiment. By computing

1 IRP10

0,8

IRP2

IGR2 IGR10

IRP10

MOS Difference

0,6 0,4 0,2 0 -0,2 -0,4 -0,6

DGR2

N600

-0,8 Doc.

-1

Movie3

Sport

News

Movie1

Music

Movie2

PVS

Figure 2: Difference between the MOS in Audio and NoAudio experiments (UPM)

5

Audio MOS

4 3 2 Experimental data

1

Audio-NoAudio Diagonal

0 0

1

2 3 NoAudio MOS

4

5

Figure 3: Relationship between Audio and NoAudio experiments (UPM) the Pearson linear Correlation Coefficient (PCC) between the MOS of two experiments it was observed that the results of two experiments have a similar trend (PCC = 93%). Figure 3 shows this relationship having the results of NoAudio experiment on the x axis and the Audio experiment result on the y axis. Considering the diagonal solid line as the main diagonal (indicating the ideal case in which both data sets would match to each other) and

the dash-dash line as the regression mapping of Audio experiment data to NoAudio, it is observed that the data have a small deviation upside of the reference which indicates the higher scores given in the Audio experiment compared to the other one. In addition, a slightly larger span in the MOS of NoAudio experiment (from 1.3 to 4.7) compared to Audio one (from 1,5 to 4,6) has been observed. To explore the significance of difference between the results of two experiments the repeated measure of ANalysis Of Variance (ANOVA) was performed on scores as the dependent factor, the experiment as one between factor, the HRCs as within factors, α = 5% as the level of significance, and null hypothesis as the main effect of the experiments is not significant. The ANOVA resulted on acceptance of null hypothesis which means no significant difference in the main effect of the audio presence (p = 0.63). The Tukey HSD post-hoc test was performed for the PVS-multiple comparisons. The purpose of multiplecomparisons procedure is to control the overall significance level for some set of inferences performed as a follow-up to ANOVA. The overall significance level or error rate is the probability, conditional on all the null hypotheses being tested being true, of rejecting at least one of them. The result of Tukey test also showed that there is no single PVS significantly perceived different in two experiments. The result of numerically averaging the two data sets is denoted as ’UPM’ in rest of the paper.

3.2 UPM vs. Acreo: Impact of the evaluation methodology Table 4 presents the correlation between each pairs of experiments which has the lowest value in the case of Audio and Acreo experiments. Considering the MOS values of individual content evaluated in different experiments which are presented in Figure 4, it can be observed that the subjects’ judgment about different content is varied even in the identical experiment. This was more tremendous specifically in comparison between the results Table 4: Experiment-to-Experiment correlation considering all the content Experiment

Audio

NoAudio

Acreo

Audio NoAudio Acreo

100 93 86

93 100 91

86 91 100

100 95 Doc.var2

percentage (%)

90 85

Doc.var2

80 75 News

70

Doc.var1

Doc.var1

65 60

News

Audio-NoAudio

Audio-Acreo

NoAudio-Acreo

UPM-Acreo

Content-based correlation

Figure 4: Correlation between subjective scores in different content obtained from three experiments. The lowest correlation in each comparison belongs to Documentary (cf. Table2).

5

Acreo MOS

4 3

2 Experimental data

1

Acreo-UPM Diagonal

0 0

1

2 UPM MOS

3

4

5

Figure 5: Relationship between Acreo and UPM experiments

5 Doc DGR2

Doc DRP10

UPM

Acreo

Movie3 DGR10

MOS

4 Movie3 N600

3

2

Doc DGR2

1

Large deviating PVS Figure 6: MOS comparison of those PVSs with the largest deviation (combined UPM vs. Acreo) of Audio and Acreo experiments where the perceived quality of some of the test conditions in Documentary content was significantly lower in Audio experiments. Moreover, it was observed that the correlation between these two experiments is significantly different for two different variant of this content so that for instance, the MOS difference in two experiments for DGR2 in one variant is -0.9 while in other variant is +0.3. One possible reason could be the different objective characteristics of those segments where the test condition was applied

Acreo

Audio_UPM

NoAudio_UPM

Percentage (%)

40

30

20

10

0 1

2

3 Score number

4

5

Figure 7: Observers’ vote distribution on. Figure 5 shows the correlation between UPM and Acreo studies. In spite of existence of some of the PVSs with relatively high different MOS, the PCC value between two studies was 90%. The repeated measure of ANOVA was applied on two data sets considering the same setting as the previous part and the result revealed no statistically significant difference between the main effect of the evaluation methodology (p = 0.31). In Figure 6 those PVSs with the large deviating MOS in two data sets are presented. Some (visually) significantly difference in some of the PVSs can be observed. However, the result of the pairwise Tukey test comparing the Acreo data once with the individual UPM experiments and later on with the combined UPM data showed no single PVS significantly perceived different in any of the cases. As an alternative method, the Newman-Keyls test was also performed that resulted to one PVS (Doc-DRP10) being significantly different in UPM and Acreo. The reason for obtaining different results is that these tests are different in their balance between power to find an effect and in the same time having safety against Type I error (i.e. finding a difference when none exists). The advantage of Tukey test is that it keeps the level of Type I error equal to the chosen α while this probability cannot be computed in the Newman-Keuls test. Therefore we considered the Tukey’s result as more reliable. Considering the distribution of the votes in two studies shown in Figure 7, it was observed that the usage of the voting scale was different in two labs. After applying the linear transformation of UPM’s data to Acreo’s by Acreo realigned data= (Original Acreo data - 0.98) / 0.73, the difference almost vanished and provided the possibility of combining the results of two studies to be used as a single dataset.

4. DISCUSSION Statistically speaking, the different experiments gave similar results, so based on the current experiments we cannot conclude that these methods are different. One reason could be that the number of comparisons in combination with the number of test subjects makes the statistical power quite low. Although the number of subjects was based on what is recommended and normal for this type of experiments, the number of PVSs was quite high, leading to a large number of comparisons. Having large number of comparisons increases the risk of making Type I error, which is finding an effect while it does not really exist. Considering an estimation based on Bonferroni correction method,18 the significance level (α) should be divided by the number of comparisons. If we consider the case with planned hypothesis testing, which is to study whether each PVS is rated the same in two experiments, then the number of comparisons will be the same as the number of PVSs. In this case the α value

should be divided by 132 (number of PVSs) which means for an α=0.05, the p-value per comparison should be less than 0.0004. On the other hand if the test is not planned, all possible combinations of comparisons should be considered. Then the number of comparisons will be increased to 132x132=17424 so that for a Bonferroni correction α should be divided by this number instead, giving a per comparison α of about 0.000003. We believe the post-hoc test used in the current study (Tukey HSD) is more efficient than that, but of course it gives the order of magnitude. From another side, some suspicious cases were found which was pointed out in the result section. For instance, the documentary content which was rated quite differently, so caution should be taken in particular considering different content.

5. CONCLUSIONS The goal of this study was to investigate subjective assessment methodology to evaluate sequences with timevarying quality due to the adaptation. To this aim, we ran two experiments using a test methodology suggested by Gutierrez et al14 for evaluating the long sequences (CIETI) to make mimicking the attribute of mentioned degradation possible. To study the impact of audio on evaluation of the test stimuli, one of the experiments was done in the presence of audio. It was observed that in some of the content, specially those including music and dancing people, the audio presence had positive effect on the evaluation of visual quality. This indicate that the visual quality degradation maybe less annoying when evaluating the audiovisual stimuli rather than only the video stimuli. Nevertheless, the high Pearson linear Correlation Coefficient between the two experiments as well as the no significant different result obtained from ANOVA and post-hoc test verified that two experiments can be combined. Moreover, no large difference in the range of MOS values of two studies was observed (Figure 7). This finding was different than what was discussed in15 regarding the impact of changing from video-only stimuli to audiovisual stimuli on our ability to distinguish between HRCs. Comparing the combined result of the CIETI experiments with the results of Acreo, our previous study using ACR assessment methodology, indicated a relatively high correlation (0.9). The result of ANOVA and post-hoc test also showed a non-significant difference between two experimental methodologies. As mentioned, some different MOS were obtained from different methodologies which could be also influenced by the characteristics of the content. The positive impact of audio presence on evaluation of some of the PVSs has also been observed. Nevertheless, the result of statistical analysis did not show any significant difference between three experiments. This does not rule out that the audio presence/different methods can lead to different results, but with the number of subjects involved and the statistical variance in the data the differences in the MOS values of the same PVSs were not big enough to show any statistically significant difference. Therefore, this would need to be investigated further involving a larger number of test subjects to be able to statistically infer about the influence of audio presence and the difference between the evaluation methodologies.

ACKNOWLEDGMENTS The work at UPM has been partially supported by the Ministerio de Econom´ıa y Competitividad of the Spanish Government under projects TEC2010-20412 (Enhanced 3DTV) and TEC2013-48453 (MR-UHDTV). The work at Acreo Swedish ICT AB was supported by VINNOVA (Sweden’s innovation agency) and EIT ICT Labs, which are hereby gratefully acknowledged.

REFERENCES 1. B. Lewcio, B. Belmudez, A. Mehmood, M. Waltermann, and S. Moller, “Video Quality in Next Generation Mobile Networks - Perception of Time-varying Transmission,” in Proceedings of the IEEE International Workshop on Communications Quality and Reliability (CQR’11), pp. 1–6, May 2011. 2. M. Zink, J. Schmitt, and R. Steinmetz, “Layer-encoded Video in Scalable Adaptive Streaming,” IEEE Transactions on Multimedia 7(1), pp. 75–84, 2005. 3. P. Ni, R. Eg, A. Eichhorn, C. Griwodz, and P. Halvorsen, “Flicker Effects in Adaptive Video Streaming to Handheld Devices,” in Proceedings of the 19th ACM International Conference on Multimedia, MM ’11, pp. 463–472, 2011.

4. A. K. Moorthy, L. Choi, A. C. Bovik, and G. de Veciana, “Video Quality Assessment on Mobile Devices: Subjective, Behavioral and Objective Studies,” IEEE Journal of Selected Topics in Signal Processing 6(6), pp. 652–671, 2012. 5. R. K. P. Mok, X. Luo, E. W. W. Chan, and R. K. C. Chang, “QDASH: A QoE-aware DASH System,” in Proceedings of the 3rd Multimedia Systems Conference, MMSys ’12, pp. 11–22, 2012. 6. L. Yitong, S. Yun, M. Yinian, L. Jing, L. Qi, and Y. Dacheng, “A Study on Quality of Experience for Adaptive Streaming Service,” in Proceedings of the IEEE International Conference on Communications (ICC), pp. 682–686, June 2013. 7. M. Zink, O. K¨ unzel, J. Schmitt, and R. Steinmetz, “Subjective Impression of Variations in Layer Encoded Videos,” in Proceedings of the 11th International Conference on Quality of Service, pp. 137–154, 2003. 8. M.-N. Garcia, F. De Simone, S. Tavakoli, N. Staelens, S. Egger, K. Brunnstrom, and A. Raake, “Quality of Experience and HTTP Adaptive Streaming: A Review of Subjective Studies,” in Proceedings of the 6th International Workshop on Quality of Multimedia Experience (QoMEX’14), pp. 141–146, Sept 2014. 9. ITU-R, “Methodology for the Subjective Assessment of the Quality of Television Pictures.” ITU-R Recommendation BT. 500, Jan 2012. 10. ITU-T, “Subjective Video Quality Assessment Methods for Multimedia Applications.” ITU-T Recommendation P.910, Apr 2008. 11. N. Staelens, S. Moens, W. Van den Broeck, I. Marien, B. Vermeulen, P. Lambert, R. Van de Walle, and P. Demeester, “Assessing Quality of Experience of IPTV and Video on Demand Services in Real-Life Environments,” IEEE Transactions on Broadcasting 56, pp. 458–466, Dec 2010. 12. C. Chen, L. Choi, G. de Veciana, C. Caramanis, R. Heath, and A. Bovik, “Modeling the Time Varying Subjective Quality of HTTP Video Streams with Rate Adaptations,” IEEE Transactions on Image Processing 23, pp. 2206–2221, May 2014. 13. S. Tavakoli, J. Gutierrez, and N. Garcia, “Subjective Quality Study of Adaptive Streaming of Monoscopic and Stereoscopic Video,” IEEE Journal on Selected Areas in Communications 32, pp. 684–692, April 2014. 14. J. Gutierrez, P. Perez, F. Jaureguizar, J. Cabrera, and N. Garcia, “Subjective assessment of the impact of transmission errors in 3DTV compared to HDTV,” in Proceedings of the 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON), pp. 1–4, May 2011. 15. M. Pinson, M. Sullivan, and A. Catellier, “A New Method for Immersive Audiovisual Subjective Testing,” in Proceedings of the 8th International Workshop on Video Processing and Quality Metrics for Consumer Electronics (VPQM), Jan 2014. 16. S. Tavakoli, K. Brunnstr¨ om, K. Wang, B. Andrén, M. Shahid, and N. Garcia, “Subjective Quality Assessment of an Adaptive Video Streaming Model,” in Proceedings of the SPIE Int. Conf. on Image Quality and System Performance XI, 9016, paper 90160K, 2014. 17. VQEG, “Report on the Validation of Video Quality Models for High Definition Video Content.” Video Quality Expert Group, Available: www.vqeg.org, June 2010. 18. S. E. Maxwell and H. D. Delaney, “Designing Experiments and Analyzing Data: A Model Comparison Perspective, Second Edition,” Routledge Academic, 2003.