Download - Semantic Scholar

1 downloads 0 Views 107KB Size Report
HTTP-BASED ADAPTIVE STREAMING WITH AN IMMERSIVE TEST PARADIGM. Werner Robitza* .... switches, and the impact of a low starting bitrate. At a later.
AT HOME IN THE LAB: ASSESSING AUDIOVISUAL QUALITY OF HTTP-BASED ADAPTIVE STREAMING WITH AN IMMERSIVE TEST PARADIGM Werner Robitza* , Marie Neige Garcia† , Alexander Raake† *

Deutsche Telekom AG, † Technische Universit¨at Berlin Telekom Innovation Laboratories, Berlin, Germany

ABSTRACT In this paper, we assess audiovisual quality of HTTP Adaptive Streaming using a novel subjective test design. The goal of this test was to systematically study the impact of both quality variations and stalling events on remembered quality. To gather more ecologically valid results, we wanted to reach a degree of test subject engagement and attention closer to real-life video viewing than what is achieved in traditional lab tests. To this aim, we used a novel test design method: the “immersive” paradigm, where subjects never see the same source stimulus twice. A total of 66 source clips of one minute length have been shown, selected from online video services. Together with qualitative results obtained from questionnaires, we can confirm previously reported effects, such as the impact of quality switching frequency. We also present new findings on the interaction of stalling events and quality drops. Finally, the contribution highlights that a long duration test of over one hour is feasible using the immersive paradigm while keeping subjects entertained. Index Terms— Quality of Experience, HTTP Adaptive Streaming, Video Quality, Subjective Experiments 1. INTRODUCTION HTTP Adaptive Streaming (HAS) has grown to become one of the most frequently used technologies to deliver audiovisual content over TCP/IP-based networks, relying on HTTP as the data transfer method, thus leveraging existing network and especially server-side architecture. Media is transmitted in segments of several seconds length, which are sequentially downloaded. Depending on the current available network throughput, the client chooses from multiple representations of the media, which reduces the possibility of buffer depletion, and therefore of visible stalling events. When the client switches from one representation to another, a sudden change in audiovisual quality may be noticeable. Whether this switch can be perceived at all depends on multiple factors, such as the quality difference between representations or the content itself (both due to its semantic meaning and visual aspects). Also, while short-term (i.e., around ten seconds) models for instrumental quality prediction exist, the question

remains how the overall quality of an entire session (of several minutes length) is formed. Previous research has not been conclusive on these issues. In this paper, we present the results from a subjective quality test which specifically aimed at identifying the impact of HAS-typical degradations on the perceived audiovisual quality. The test design differs from what is typically recommended in the case of multimedia quality testing, where users only rate shorter stimuli, and with repeating contents (ITU-R Rec. BT.500-13 and ITU-T P.910). Our motivation for deviating from a traditional design are explained in Section 2, together with a review of existing literature. A detailed description of the test follows in Section 3. The results are presented in Section 4, where we emphasize not only the quality ratings themselves, but also the impact of user characteristics on perceived quality. We will critically compare the method to more traditional ones and discuss its benefits and drawbacks in Section 5. Section 6 will conclude this paper. 2. MOTIVATION Recent surveys [1, 2] have identified several key factors in determining the QoE for HAS services, based on an extensive literature review. The authors report that, for example, gradual variations in quality are preferred over abrupt variations. The frequency of representation switches also negatively impacts the experienced quality. In general, people seem to prefer constant over varying quality, unless the average quality is too low—in that case, any switch to a higher quality is better. Also, it is generally agreed upon that any kind of stalling must be avoided. Since research has been mostly conclusive in these regards, we focus our present work on other challenges. The authors of [1] do not only list problems regarding the test design, collected data or data presentation, but also mention scientific questions that have not yet been addressed. Notably the “combined quality-impact due to initial delay, stalling events and quality switches that all occur in one sequence” has not been studied so far. Initial delay has been shown to hardly have any impact on QoE, since users have learned to adapt to it, and it does not interrupt the flow of a presentation [3, 4]. In the process of developing a QoE monitoring model for

HAS, we designed several quality tests, one of them described in this paper. The monitoring model is developed so as to more faithfully reflect the quality perceived by real users of video services. The test presented in this paper systematically studies the impact of the number, duration and depth of quality drops, the combined effects of stalling and quality switches, the temporal location of stalling events and quality switches, and the impact of a low starting bitrate. At a later stage, its results will be combined with those from similar tests to provide a broader picture. For the design, we had to weigh off the number of conditions and subject exhaustion in the process. In the next Section we will extensively describe our approach to tackling this issue. 3. SUBJECTIVE TEST DESCRIPTION 3.1. The Immersive Test Design When designing the test, we faced a few challenges. Our aim was to stick to the traditional laboratory context for testing, to eliminate as many confounding variables as possible. To still achieve a certain level of engagement, a first design element was to ensure that users see realistic content, long enough to enable subject immersion and to include some adaptation and stalling events. This required the use of entertaining sequences with a length of at least one minute, rather than samples of a few seconds length. A second design element relates to the question of how to reliably test the remembered quality of an entire sequence? The only standardized method aimed at long sequences, SSCQE from ITU-R Rec. BT.500-13, requires the user to focus on both rating and watching at the same time. In another test methodology, users see periods of varying quality and then rate them during a non-degraded period [5]. These approaches deliver more instantaneous feedback to quality variations, but they are intrusive and may prevent the user from fully immersing into the content. Since remembered quality rather than instantaneous quality was the focus of our tests, these methods were not applicable. For even longer contents, the authors of [6] asked participants to rate the quality after seeing an entire movie. The difference from traditional tests was that subjects were not informed of the quality rating task until they had completed watching the movie. This method has the benefit of subjects not being primed for quality testing, and it has been shown to give significantly different results when compared to similar tests done in a lab, where subjects knew that they were to rate the quality. However, this method can only be used once for a participant, since they could be biased after knowing what to look for. It also does not allow to test many conditions, and hence was unsuitable for the targeted test. We therefore used a variation of the immersive test design, as described in [7]. Its main difference to traditional quality testing lies in the way in which test conditions (Hy-

pothetical Reference Circuits, HRCs) are applied to source sequences (SRCs). In typical tests, for example according to standard protocols (e.g., BT.500-13, P.910), a low number of SRCs s are treated with a similar (or larger) number of HRCs h, resulting in h × s processed (video) sequences (PVSs). A subject then has to rate all PVSs, which results in a repetition of content. In an immersive design, however, there is no such repetition. In [7], the authors describe how to partition the PVSs among the subjects in such a way that every HRC is rated by all subjects, but not all subjects see every PVS. The rationale for such a design is that without a repetition of content, subjects will stay immersed in the viewing/listening task and do not primarily focus on the “technical quality”—which is expected to occur when a sequence is seen multiple times, and subjects may get bored of the repetition. We used 3 SRCs with every given HRC (e.g. HRC A uses SRCs 1+2+3; HRC B uses SRCs 4+5+6, etc.). In our adoption of the immersive design, a subject is presented all SRCs and HRCs (all s PVS). The overall test duration and the number of SRCs per HRC critically limit the number of HRCs that can be tested. In our case, the design goal was to maximize this number. 3.2. Source Content For our HAS test, we wanted to show sequences of several minutes length, which are typically not available through commonly used public video databases. Moreover, standardized test sequences are rarely entertaining. We therefore used online video services to find material. Since such video material is already compressed, we only used sequences that were 1) available in UHDTV resolution (3840 × 2160) and 2) have more than 0.05 Bits per pixel, as a lower-bound value only used for lower-quality representations. We manually checked that after rescaling to 1920 × 1080 during encoding, the final quality would still be considered as excellent. The SRCs were cut from the original material so as to be most natural to watch, for example by ensuring that the end does not cut a sentence in the middle, or interrupt a scene. We tried to keep the cut SRCs within the bounds of 1 min ± 10 s (which was not possible for 6 of the SRCs where it was not logical to cut earlier/later). 3.3. Test Parameters and Procedure Our test uses a bitrate-scaling adaptation set, with SRC video encoded at a resolution of 1920 × 1080. The three quality levels (QL) are at 10, 2 and 0.5 MBit/s, respectively. The audio bitrate was left constant at 128 kBit/s. A total of 66 SRCs of 1 minute length were used. We designed 22 HRCs, which results in 3 different SRCs for each condition. We made sure that the quality levels were perceivably different, and that in each SRC/HRC combination the quality change was visible. The conditions of the test are presented in Fig. 1. A single segment is 5 s long. The figure illustrates the systematic

1

3 (60)

2

2 (60)

9

3 (20)

10

2 (20)

16

3 (15)

2 (25)

3 (30)

3 (15) 1

2 (15)

1 (25)

17

3 (30)

1

10

1

10

3 (15) 1 (15)

3

1 (60)

11

3 (25)

3

3

1

4

3 (40)

5

2 (40)

3

1

18

3 (15)

10

12

3

3

3

3

1

3

1

3

1

3

1

19

2 (15)

1

3

3

3 (20)

3

3

20

3

3

3

1 (15)

10

3 (30)

1 (15)

14

3 (45)

10

3 (15)

21

3 (15)

2 (15)

1

15 8

3 (15)

3 (35) 1 (15)

3 (15)

3 (30)

10

1 (15)

13 3 (30)

3 (15) 1 (15)

1

7

3 (15)

3 (15) 1

6

3

1

3 (20)

3 (15)

10

3 (50)

2

3 (45)

22

3 (15) 1 (25)

3 (20)

2 (20)

1 (20)

Fig. 1: Conditions (HRC IDs to the left, grey background = stalling event, numbers in parentheses = duration in seconds) 5.0

53 2● 49

4.5

10

4



3● 4.0 36



50

3.5 11 54

3.0 57 2.5

59 48

2.0





● 660



● 2462

44 22



20 39 31



61 23 ●







27



26 5 32





56

34 ●



21

33



63 47

17

25





1●

12 ● 41 ●



29

8●











45

18





51



19 37● ●



28 55● ● ●

65 40 ●



43● ● 16





58 38

13









15 35● ● 14

MOS

64



9●



46 ● 30 66 ●



52







42





1.5 7● 1.0 3 16 10 17 18 12 20 5 19 11 14 6 13 (1.64) (2.09) (2.21) (2.24) (2.36) (2.37) (2.38) (2.54) (2.56) (3.12) (3.23) (3.26) (3.31)

15 (3.4)

8 22 4A 2 21 (3.44) (3.67) (3.77) (3.81) (3.81)

4 (4.1)

9 7 1 14A (4.17) (4.49) (4.74) (4.85)

Fig. 2: MOS for all PVS, in ascending HRC MOS order approach for the condition design: we varied the length of individual quality drops, their depth, and the number of quality drops, keeping the same overall length. The conditions include combinations of stalling and quality drops, and low quality starts. Since the patterns partly overlap, it is possible to analyze the perceptual impact of each condition. Degradations were not located in the last 15 s. This was done to avoid strong effects due to recency, possibly interfering with the systematic questions addressed with the conditions. PVS Generation: To encode the original sequences and generate the adaptation set, we used ffmpeg with 2-pass constant bitrate x264 to generate High profile H.264 video and libfk-aac to generate AAC-LC audio. The GOP length was 1 s and scene cut detection was disabled. For generating the PVSs, we recombined the 5 s segments into a new sequence according to the adaptation condition. Finally, for sequences with stalling conditions, a spinning “rebuffering” indicator was inserted for the respective amount of time. Note that we

did not generate freezing pictures without spinning indicators, as those degradations are primarily caused by lags in the playback software, the analysis of which is beyond the scope of this study. Test Environment: Except for the immersive design and the duration of the stimuli, we conformed to the recommendations for a traditional single-stimulus test as defined in ITU-T Rec. P.910. As display, a 42” professional LCD screen was used, positioned at a distance of 3H (three times the height of the display itself) from the subjects. The audio was presented on reference class headphones at 73 dB SPL. Rating Scales and Questionnaires: After each sequence, we asked subjects to “rate the audiovisual quality” on the Absolute Category Rating (ACR) scale (Bad, Poor, Fair, Good, Excellent), with additional numeric labels from 1–5 next to the categories. To avoid fatigue, the subjects were asked to take two 5 minute breaks after 1/3 and 2/3 of the test. In addition, before and after the test, we let subjects fill out questionnaires

to ask about their experiences with the test itself, and their opinion of the degradations they had seen.

4.5

●7 ●9

●4

4.0

4. RESULTS MOS

30 people participated in the test. We removed 4 subjects, determined by checking the Pearson correlation r between each PVS rating and the MOS. If r < 0.7 and unwanted behaviour such as partial use of the scale is detected, ratings for that subject are discarded. Subjects were aged from 21 to 56 years (average 31 years), with 16 females and 10 males, the majority of them (21) being students.

● 4A

3.5

●8 ● ●613

11 3.0

2.5

●5 12 ● 10

2.0 1

2

3

4

5

Length per drop (in segments)

4.1. Quality Assessment First, we look at the Mean Opinion Score (MOS, i.e. the average rating per PVS) as a function of the conditions and sources. Fig. 2 shows the PVS MOS (including 95% confidence interval) for the three SRCs per HRC. Due to some SRCs being shorter than one minute, the PVS with HRC 4, SRC 33 erroneously ended directly after the segment with QL 1. We subsequently denote it as HRC 4A. For the combination of HRC 14 with SRC 25, by error no stalling was inserted, making it equivalent to HRC 1. It is referred to as HRC 14A. 4.1.1. Impact of Quality Drops and Fluctuations One of the features that distinguish HAS from other streaming methods is the expected variability in audiovisual quality. We included three kinds of quality variations: drops (HRCs 4–10, 13), frequency (11, 12) and low-quality starts with ramp-up (21, 22). The results from this test should therefore also help in finding rules for client-switching behavior. Fig. 3 shows an overview of the relevant HRCs for drops and their frequency. Here, we look at the following parameters: the length of the drop(s), their depth (i.e., how many QL), the location(s) (beginning vs. end) and the number of drops (w.r.t. frequency). Length of Drops: HRCs with only one quality drop but of two levels (QL 3 → 1) are HRCs 4, 6/13 and 8. For these conditions, we expected a negative tendency of MOS from left to right because the overall length of the drop increases. However, the MOS slightly increases again for HRC 8. This is due to SRCs 5 and 18 (see Fig. 2), which exhibit lower spatio-temporal complexity, making the quality drop less visible than for SRC 51. For HRCs 7/9, and 5/10, the MOSvalues are decreasing with drop length, as expected. The quality-shift is explained by the “base” QL, which is 1 for HRCs 7/9 and 2 for HRCs 5/10. Frequency: Compare HRC 11 with 6 and HRC 12 with 8: despite the overall time spent at each QL being the the same (15 s and 25 s), the MOS is negatively impacted by the number of drops. For HRCs 8 and 12, the difference is significant. This confirms results reported in [1], but is in contradiction

Count



1

3

5

Depth

● a

1

● a

2

Fig. 3: Impact of quality drops with varying position, depth and length to another recent study [8], where no significant impact of switching frequency was found. The authors attributed it to their segment lengths being longer (4 and 10 s) than in similar previous studies they compared their results against. However, in our test, the segment length is 5 s as well. Our significantly different ratings could be explained by the generally lower MOS obtained for these conditions. In other words: frequent switching, even when visible, may have no big impact if it happens at high quality levels. Locations: HRCs 6 and 13 have the same pattern, with just the location of the drop being different. A recency effect has been found in some studies [1], implying that degradations towards the very end of a sequence have a stronger impact on ratings, or, conversely, that good quality at the end makes up for low quality in the beginning. Since we avoided degradations in the last 15 s we cannot analyze this effect. However, we can see that there is no general location effect (“temporal trend”). Low-bitrate Start with Ramp-up: HRCs 21/22 (MOS 3.81, 3.67) demonstrate the heavy impact of a low-quality start compared to the reference HRC 1 (MOS 4.74). These perceptual degradations correspond to the effect of initial buffering times of as much as 28.5 s and 44.3 s, respectively, according to ITU-T Rec. P.1201 Amd. 2, App. III. It seems to be preferable to have users wait rather than starting playback immediately but at low quality, which confirms results from [9]. However, it should be considered that for the two HRCs MOS values do not differ significantly, which was unexpected given the longer time spent at low quality levels in HRC 22. 4.1.2. Impact of Stalling In addition to the pure quality-adaptation HRCs, we combined them with stalling events. The respective HRCs specifically aimed at studying the impact of stalling-event location

(14, 15) and interaction with quality drops (16–20). Location: There seems to be no location effect for stalling events only (HRCs 14/15). Interaction with Quality: We see a slightly stronger impact of stalling events when they occur within or close to a quality degradation. Specifically, in ascending HRC MOS order, stalling inside a drop (HRC 16) is worse than at the beginning of the drop (17), before the drop (18), and after the drop (20, 19). The presence of stalling at a time where subjects are already focused on low quality therefore has an impact on the impairments perceived overall, but the effect is not statistically significant, except for HRCs 16 vs. 19, where it is the most extreme. Generally, stalling is more negatively perceived when there is no adaptivity. When we compare HRCs 14/15 (with no adaptation) to HRC 1, the estimated stalling degradation is 1.3 MOS, whereas in HRCs 16–20, the stalling degradation alone would only amount to 0.74 in HRC 20 (which is like HRC 13 with stalling) and up to 1.28 in HRC 16 (compared to HRC 8). 4.1.3. Content Dependency As visible in Fig. 2, some HRCs show a big quality impact due to the employed SRCs. On average for all HRCs, ratings between the highest and lowest rated SRC differ by 0.63 MOS. For example, for HRC 8, the MOS difference between SRC 51 and 35 is 1.46. As mentioned in Section 4.1.1, we noticed that the quality drop was much more visible for SRC 51 due to higher spatiotemporal complexity, resulting in a lower rating. Similarly, for HRC 4, SRC 8 produces much lower MOS due to the scene characteristics (sports). This highlights the importance of content-related criteria when designing quality prediction models. But even when using models that consider visual characteristics, is there a MOS difference between SRCs for the same HRC that cannot be explained by those characteristics alone? To answer that, we calculated the quality scores per segment using the VQM full-reference model from ITU-T Rec. J.144. We obtained the objective score for a PVS by averaging the VQM values over all segments, re-scaling them from [0, 100] to [5, 1], then subtracting the degradation due to stalling as predicted by ITU-T P.1201 Amd. 2, App. III (equal to 0.86 MOS) to yield the objective MOS M OSV QM . Overall, the VQM-based model has an RMSE of 0.69 for all PVS, with a correlation of 0.82. The low performance can be explained by the model not considering the factors described previously (e.g., frequency of stalling events). However, if we keep the HRC fixed, we would expect little difference in the scores for the three SRCs, since VQM considers spatiotemporal characteristics. We therefore look for SRCs with similar VQM-based scores but different MOS. For example, taking SRCs 51 and 35 for HRC 8, the M OSV QM ≈ 3.5, although MOS

ratings differ significantly. For SRCs 8 and 49 in HRC 4, M OSV QM ≈ 4.7, failing to express the difference for SRC 8. The latter effect may also be due to the VQM-based model not capturing the strong impact of single quality drops such as in HRC 4, but nonetheless such discrepancies could be an indication for semantic effects of content, where people generally rate a specific SRC lower than others. 4.2. User Factors From our questionnaire we learned that 64% of users strongly preferred having to wait for initial loading, while the others preferred a low quality start. To check whether this is reflected in the actual quality ratings, we perfomed a Wilcoxon rank sum test on the scores of those two groups for the respective HRCs 21, 22. The null hypothesis, stating that the ratings differ, could not be rejected (W = 742, p = 0.2661). In practice, this means that service providers may significantly increase the experienced quality if the users are allowed to set a certain client playback behavior to their liking. All subjects agreed that they would mind if the video had to re-buffer during playback, even if the content was entertaining. Similarly, 92% said that they would still be annoyed of bad audiovisual quality. The expected annoyance appears to be higher for stalling though, as it interrupts the “flow” of the video. This is in line with results from [6] which state that small degradations that do not interrupt the playback may not even be remembered when the content is entertaining. However, too strong degradations may have an even stronger effect if users want to see the content. 5. DISCUSSION Using an immersive design without content repetition instead of a full-factorial SRC–HRC design had several advantages. Subjects stayed entertained, as 96% of them said, participating in the test was “interesting and fun”, 84% claimed to have no problems concentrating, and just 28% felt the test was too long—at a duration of over one hour, more than double the recommended time. The main drawback of course is that a source content’s visual characteristics are heavily reflected in the quality ratings, as some quality switches—especially with resolution changes—may just not be perceivable for a certain SRC. It is therefore absolutely necessary for an experiment designer to carefully select the SRCs and be aware of the (non-)visibility of the conditions that are applied. Alternatively, the full factorial design of SRCs/HRCs could be created and distributed across subjects (as explained in [7]). This may enable more stability in the obtained ratings, at the expense of requiring many more observers. Another effect is the possible impact of content likability, which may be reflected in the quality scores. A VQM-based model could not explain the significant differences in ratings for some HRCs. Further tests and data analysis is needed to

investigate this possible factor. At this point it should be noted that the question asked to the observer also plays a large role: are we still interested in “technical quality” (that is, “quality based on experiencing” as defined in [10]), or a real measure of Quality of Experience [10]? If it is the latter, the question and wording of the rating scale may have to be adjusted, and the influence of content likability is now explicitly welcome. In fact, this differs from the intentions of traditional tests, which try to eliminate any possible influence of the content itself. When aiming for predicting the QoE, it may even be necessary to consider more user factors than typically done, to be able to estimate how much a user values technical quality over the content itself (e.g. “I don’t care what the quality is as long as the video is entertaining” versus “If I like the video, I can’t stand to see it in bad quality.”), and estimate what the impact of a given content is on the QoE ratings. We believe that there are still a few challenges to be taken and open research questions to be answered before fully understanding QoE for HAS services. For an ecologically valid test design we need to provide realistic stimuli, both in terms of content and length, and overall enjoyable viewing experiences. It would also be preferable to have users rate the QoE within a specific task (such as selecting a YouTube channel and watching a video they would like), rather than putting the human into the role of a passive viewer and a “degradationspotter”. A more practical challenge arises when designing representative video streaming tests. Where can researchers obtain source material that is free to use, of sufficient length, with pristine original quality, and, last but not least, enjoyable to watch? In the pursuit of a valid test result, we cannot have subjects “watch paint dry” for an extended period of time. A joint effort between QoE research and the entertainment industry would be fruitful for both parties, to develop open databases that can be used for QoE testing. 6. CONCLUSIONS In this paper, we presented the results of an audiovisual quality test which investigated the impact of quality switches and stalling events, in the context of HTTP Adaptive Streaming services. We took care to design as many representative conditions as possible, applying them on realistic source content of 1 min length, which we obtained from online video services. Using an “immersive design” for our tests, subjects never had to see the same content twice, which proved to be a valid means to make subjective tests more enjoyable and more ecologically valid. The results of this test confirm previous findings, namely that the frequency of quality switches negatively impacts the overall quality. A mere average is therefore not good enough as a model for the quality prediction of a longer sequence. It is better to stay at lower quality when it is expected that another switch to a lower QL may happen soon. Our results also show interaction effects that have not been studied yet:

there is a tendency in the ratings, showing that stalling has a bigger impact when it occurs within a low quality region. It may therefore be strategically better to avoid stalling when already playing out low quality. In other words, it may be better to re-buffer data first and then continue with low quality if needed. Lastly, we found no evidence for temporal effects, both with regard to adaptation and stalling events. Our future research will not only target audiovisual quality, but also the user’s actual experience, taking into account the likability of the content and personal traits. We believe that more meaningful and representative QoE models can be built from such data. Another goal will be combining results from this contribution with further tests, in which we investigate all three factors, namely the impact of initial loading, stalling and adaptivity. Also, the impact of different adaptation sets (e.g., bitrate scaling vs. resolution scaling or framerate scaling) needs to be studied in more depth, and in combination with the aforementioned factors.

References [1] M.-N. Garcia, F. De Simone, S. Tavakoli, N Staelens, S. Egger, K. Brunnstrom, and A. Raake, “Quality of experience and HTTP adaptive streaming: A review of subjective studies,” in QoMEX, 2014. [2] M. Seufert, S. Egger, M. Slanina, T. Zinner, T. Hossfeld, and Tran-Gia P., “A Survey on Quality of Experience of HTTP Adaptive Streaming,” in IEEE Communications Surveys & Tutorials, 2014. [3] T. Hossfeld, S. Egger, R. Schatz, M. Fiedler, K. Masuch, and C. Lorentzen, “Initial delay vs. interruptions: between the devil and the deep blue sea,” in QoMEX, 2012. [4] M.-N. Garcia, D. Dytko, and A. Raake, “Quality impact due to initial loading, stalling, and video bitrate in progressive download video services,” in QoMEX, 2014. [5] S. Tavakoli, J. Gutierrez, and N. Garcia, “Subjective Quality Study of Adaptive Streaming of Monoscopic and Stereoscopic Video,” in IEEE Journal on Selected Areas in Communications, 2014. [6] N. Staelens, S. Moens, W. Van den Broeck, I. Marin, B. Vermeulen, P. Lambert, R. Van de Walle, and P. Demeester, “Assessing Quality of Experience of IPTV and Video on Demand Services in Real-life Environments,” in IEEE Trans. On Broadcasting, 2010. [7] M.H. Pinson, M. Sullivan, and A. Catellier, “A new method for immersive audiovisual subjective testing,” in VPQM, 2014. [8] S. Egger, B. Gardlo, M. Seufert, and R. Schatz, “The impact of adaptation strategies on perceived quality of http adaptive streaming,” in VideoNEXT. ACM, 2014, pp. 31–36. [9] L. Yitong, S. Yun, M. Yinian, L. Jing, L. Qi, and Y. Dacheng, “A study on Quality of Experience for adaptive streaming service,” in IEEE International Conference on Communications Workshops, 2013. [10] A. Raake and S. Egger, “Quality and quality of experience,” in Quality of Experience. Advanced Concepts, Applications and Methods, S. M¨oller and A. Raake, Eds., chapter 2. Springer, 2014.