PREDICTING AN AVERAGE END-USER'S EXPERIENCE ... - CiteSeerX

PREDICTING AN AVERAGE END-USER’S EXPERIENCE OF VIDEO PLAYBACK Audrey Younkin, Royce Fernald, Rina Doherty, Eric Salskov, Philip Corriveau Intel Corporation, Hillsboro OR 97124 ABSTRACT The end-user experience of a given platform or product with respect to video performance is an increasingly important aspect for engineers and product planners to understand. Testing real world situations in an objective and repeatable fashion is complex. The video Gross Error Detector (GED) provides a quick and cost-efficient way to evaluate video and predict an average end-user’s perception of video smoothness. The video GED allows an automated, quantitative, and reliable measurement of the number of large errors; such as dropped, repeated, or out of sequence frames that may be present in the video program. These errors can be mapped to end-user subjective ratings to estimate perceptibility and annoyance associated with playback errors. The video GED can be used to monitor errors in video for quality control, benchmark video processing and algorithms, and can be rooted into the video processing system to optimize algorithms and limit settings. 1. INTRODUCTION Since the Video Quality Experts Group (VQEG) was established in 1997, reliable objective tools have been under development taking into consideration the impact to the end-user’s experience [1]. In video transmission there are many aspects that can be measured including packet loss, delay, and jitter. These measurements, however, do not describe the end-user experience. Intel Corporation decided that it was time to take the next step to implement a simple tool that was objective and could map end-user experiences to technical requirements. Intel has been on the forefront of introducing perceptually based metrics into the industry. These measurements describe video performance expected by the average person. In 2005 Intel released dhCAT 1.0 which incorporated perceptual modeling [2]. This tool was designed for the industry to differentiate a platform’s performance in ways an end-user could understand. However, the largest gap for engineers, designers, and validation teams was to understand what requirements were needed for an “acceptable” experience and, harder yet, how they could be measured.

The video GED was developed to quantify the playback of a video stream in relation to its target frame rate (i.e. 29.97 frames per second for NTSC video), enabling the solution to the above problem statement. The video delivery errors are perceived as skips or pauses during playback which causes a degraded experience for the end-user. At times it can be so bad that end-users refer to the experience as a slide show. The video GED is a high level video performance analysis application that provides objective data on smoothness of video playback. This tool supports any video playback solution that can be connected to a hardware capture device. It can measure and compare the video performance of various transports, operating systems, streaming applications, media players, and compression formats. Overall the video GED allows a direct comparison between PC platforms and consumer electronic devices. The implementation of the GED is relatively simple and is started when source material is being prepared for a given testing environment. A source file is encoded with a series of visual frame identifiers. These identifiers consist of a series of color blocks that are unique to each frame so that the video GED can quickly and efficiently determine whether a video stream is rendered intact, in sequence, and at the correct frame rate. This marked file can now be compressed or streamed to understand the playback capability of the system under test. The output file is then captured. The video file is then decoded and analyzed (see Figure 1). 0. Source Material

1. Marked Source Material

GED

Video Encoder 2. Compressed

3. Capture

4. GED Analysis

System Under Test GED Decode

Figure 1. Gross Error Detector processing chain The stand alone video GED in itself has very useful objective outputs. However, as platform companies grow

to incorporate video as a competitive consumer experience, more testing was required in order to map the objective outputs to real end-user subjective ratings. Subjective testing using the targeted consumer is now mainstream and important as the industry is shifting from the conventional goals of increasing capacity and performance to enabling an overall increased value of experience to the consumer. Instituting testing methodologies that not only measure technology based outputs, but also provide an estimate of how the user will respond to this technology, helps define the overall end product value statement. The following discusses a study that was specifically aimed at understanding the threshold for end-user annoyance with respect to playback errors. The goal of this experiment was to determine the perceived levels of annoyance associated with playback errors and to correlate these results with the objective outputs of the video GED. This paper will discuss the standardized methodology of subjective assessment used for video playback, display the overall end result and explain the importance of using these results in an objective tool. 2. METHODOLOGY 2.1. Experimental Design The study was designed using a methodology standardized by the International Telecommunications Union [3]. A within-subjects design was employed; this type of design ensures that each participant is exposed to all levels of the independent variable (playback errors). The advantage of a within-subjects design (apart from needing fewer subjects than a between-subjects design), is that it reduces subjective error variance [4]. A single stimulus impairment scale (SSIS) design with a hidden reference was used to present the stimuli. The SSIS methodology consists of showing a video clip to a participant without showing a reference clip before every trial. Instead the reference clip is hidden as a randomized trial to account for reliability of the subjective results [3]. Ten different clips that span a wide range of content were provided by VQEG, an international standards body. The experimental design consisted of two different types of playback errors 1) dropped and 2) repeated frames. To understand the full spectrum of annoyance, five different percentages of errors were chosen: 0.04%, 2.4%, 20.8%, 33.3%, and 83.3%. To formulate a balanced test design a total of 100 video clips were presented to participants (10 video clips x 2 error types x 5 conditions).

2.2. Non-Expert Participants Subjective testing is the most reliable means for collecting data to determine the end-user’s perceived smoothness of video playback [5]. Recruited participants are not directly concerned with video quality as part of their typical vocation. Viewers’ ratings are typically used for a threshold setting, determining the pass/fail line, or mapping to objective tools. Subjective assessment continues to provide the most dependable results compared to objective processes [5]. A total of 52 nonexpert participants were recruited, with an effort to have an equal number of male and female ranging from 18-50 years of age. 2.3. Instrumentation The source video clips were uncompressed (UYVY) standard definition with a resolution of 720x480 and a frame rate of 29.97 frames per second. Each clip was eight seconds in length for a total of 240 frames. The source clips were encoded with the GED color blocks. Dropped and repeated errors were inserted into the encoded source clips (1, 10, 50, 80, and 200 errors). A controlled playback system called Video Clarity ™ was attached to a calibrated CRT display [6]. The monitor was calibrated using an Eye-one™ device which is a small tool that enabled an accurate display profile from the CRT screen [7]. This determined the preset level of color temperature as well as gamma and luminance measurements. Video Clarity™ was required for consistent evaluations, because it captures and outputs exactly what it records and would not introduce any playback errors during testing sessions. 2.4. Subjective Rating Scale Mean Opinion Score (MOS) testing is an international standard, traditionally used for measuring & quantifying end-user’s perception of video quality [3]. The MOS is a numerical representation of the average end-users’ perceived quality by calculating the arithmetic mean of all of the individual participant ratings in each test condition. The quality rating scale consists of a quantitative scale and an associated qualitative characterization ranging from 1 (very annoying) to 5 (imperceptible) (See Table 1). Table 1. MOS quantitative and qualitative scale mapping Quantitative Scale 5

Qualitative Description Imperceptible

4 3 2 1

Perceptible, but not annoying Slightly annoying Annoying Very annoying

By combining the dropped and repeated results, the logarithmic curve fit revealed a correlation coefficient of R2=0.99 (See Figure 2). The logarithmic equation was determined to be y = -0.571 * Ln(x) + 4.6836. 5

2.5. Procedure All evaluations took place in a semi anechoic chamber, located at the Intel facilities in Hillsboro, Oregon. Participants were asked to sit approximately five times the vertical height of the display away to view the monitor, which is inline with the recommended viewing distance [3]. In order to ensure proper viewing experience, the viewing angle off center axis did not exceed 20 degrees. Ergonomically adjustable chairs were provided for each participant to ensure that the line of sight was vertically centered on the display. In addition, the lighting in the room was made so that a participant’s pupil was not dilating and/or constricting between clip presentations and rating screens and measured to 10 lux. Also, care was taken to ensure minimal glare from overhead lights onto the monitors. There were no windows in the room and the walls were clear of distraction and carpeted to a 50% grey color. All participants went through a vision acuity (Snellen Eye Chart™) and color deficiency (Ishihara Testing Plates™) screening [8, 9]. Participants were instructed on the single stimulus impairment method and the five point ordinal impairment rating scale. Before each test session, participants were exposed to a training session which shows an example of the best and worst clips to establish a frame of reference in order to reduce the impact of participant inherent biases. A practice session was also given to familiarize the participants with the testing setup. A break was scheduled half way between the testing sessions to account for fatigue. The presentation of the video content was pseudo-randomized to prevent any ordering effects. Pseudo-randomization is a key element of psycho-visual testing that ensures participants do not see the material in a repeated fashion that would allow a learning effect. Participants were reminded that there were no right or wrong answers and that their opinions were highly valued. 3. RESULTS Data were analyzed using a statistical analysis software package, Statistical version 7.0™ [10].

MOS

4

3

2

1 0

20

40

60

80 100 120 140 160 180 200

y

Errors

Figure 2. MOS log trend for dropped and repeated errors Changing these numerical values obtained from the 240 frame clips into percentages, a generalized data set can be used. Using the logarithmic equation, the experience threshold was quantified. For example, an imperceptible experience yielded less than 0.3% degradation of the overall video clip (See Table 2). Table 2. End-user experience mapped to percent errors COMBINED DROPPED AND REPEATED EXPERIENCE NO. ERRORS % ERRORS 5 0.6 0.3% 4 3.3 1.3% 3 18.9 7.8% 2 109.9 46.0% 1 633.4 -

4. DISCUSSION A very strong statistically significant correlation was revealed and the corresponding mathematical equation was employed as the perceptual model used to predict MOS based on the playback errors. This predictive model of end-user perception was then coded into the video GED. The subjective mapping process accounts for different clip length and normalizes the errors accordingly. Once the video GED measures the number of dropped frames and repeated frames, it sums the total number of errors. In order to normalize the error count, the total number of frames must be added into the

equation. Then a predictive MOS score can be generated by using the logarithmic equation provided by the nonexpert subjective data (See Figure 3). 1. Encoded captured video clip

2. Recorded frame errors Frame Errors = fn (dropped frames) + fn (repeated frames)

3. Normalize error count t = any length

Normalized Errors = (240/frame count)*Frame Errors

4. Calculate MOS value MOS = -.571 * Ln (Normalized Errors) + 4.6836

Figure 3. Predictive MOS process flow The process flow of the results has provided a fundamental tool with real end-user data. This novel approach to relating smoothness (for both dropped and repeated frames) of video to the annoyance threshold of the consumer has never been done before. In addition, the tool can handle a variety of formats and clip lengths, making this applicable to diverse video testing. The video GED provides a systematic methodology for use by the untrained and gives the ability to assess the playback experience via set pass/fail criteria. The video GED provides features that are useful for early detection of smoothness degradation in a video clip. It answers the questions, “did the video get there? If so, how much playback information was lost?” This is particularly constructive on entertainment platforms or for use in wireless environments. Currently the video GED is in the process of being standardized by the Institute of Electrical and Electronics Engineers (IEEE). The standardized process involves an extensive amount of inspection of the draft text to ensure validity of the method. The GED-based performance metrics have been adopted by IEEE Task Group T in 802.11.2 draft version 1.0, entitled “Recommended Practice for the Evaluation of 802.11 Wireless Performance”. The document is presently undergoing internal review and should be ready for a full 802.11 working group ballot by March, 2007.

Additional testing is presently underway to better understand the upper portion (3-5) of the MOS scale. It is understood what is annoying, and very annoying-but it is more interesting and useful to planning engineers to hone down of the minimal amount of acceptable playback errors. Also, this next phase of testing takes into account the temporal placement of the errors and if makes a difference in the annoyance levels to the end-user. For example if the dropped/repeated clusters are in the initial portion of the video does that make the experience less or more annoying than if the error clusters were presented at the end of the clip. The results from this study will help shape the higher end of the predictive MOS ratings, and will be ready for reporting before the next VPQM conference. 5. CONCLUSION The video GED provides a systematic methodology to assess the playback experience via defined pass/fail criteria. There are currently no other subjective mapping features for video playback that take into account both dropped and repeated frames that can be applied to various lengths of video. The advantage of the video GED is that it counts the frame errors in a numerical fashion and categorizes these errors then these values are inputted into an equations formulated by subjective assessment and reveals what an average end-user perceives. The video GED is particularly constructive on entertainment platforms or for use in wireless environments. 6. REFERENCES [1]. P. Corriveau. Digital Video Image Quality and Perceptual Coding. In H.R. Wu and K.R. Rao (Ed.), Video Quality Experts Group (pp. 325-333). Boca Raton: Taylor & Francis. 2006. [2].P. Corriveau & R. Doherty. Novel Methodology for

Developing a Perceptual Model of Video Playback Experience. Intel’s DTTC Online Journal 2006. http://dttc.intel.com/secured/2006/PublishOnly_Papers/45 34.pdf [3] International Telecommunication Union. Methodology

for the Subjective Assessment of the Quality of Television Pictures. ITU-R Recommendation BT.500-11. 2002.

[4]. P. Bibby. C82MST Statistical Methods 2. With-in Subjects Overview Lecture. Department of Psychology, University of Nottingham. 2006. [5]. P. Corriveau. Digital Video Image Quality and Perceptual Coding. In H.R. Wu and K.R. Rao (Ed.), Video Quality Testing (pp. 125-153). Boca Raton: Taylor & Francis. 2006. [6]. Video Clarity; Tools for Visual Analysis. 2006. http://www.videoclarity.com/html/ClearView.html [7]. The Eye-One Display 2. GretagMacbeth. 2006. http://www.northlightimages.co.uk/article_pages/eye_one_display_2.html [8] Snellen Eye Chart. 2006. http://www.allegromedical.com/snellen-eye-chart189809.html [9] Ishihara Testing Plates. 2006. http://www.allegromedical.com/official-ishihara-colorblindness-test-192016.html [10]. Statsoft Statistica (Version software]. Tulsa, OK. 2001.

7.0)

[Computer

PREDICTING AN AVERAGE END-USER'S EXPERIENCE ... - CiteSeerX

PREDICTING AN AVERAGE END-USER'S EXPERIENCE ... - CiteSeerX

Suggest Documents

Predicting Defects in SAP Java Code: An Experience Report - CiteSeerX

Average - CiteSeerX

A Statistical Model for Predicting the Average

Virtual Metrology Models for Predicting Average ...

LacSubPred: predicting subtypes of Laccases, an ... - CiteSeerX

Predicting the Duration of Emotional Experience: Two Experience ...

Predicting the Duration of Emotional Experience: Two Experience

Cross-Site Development: An Experience Report - CiteSeerX

Overconfidence, Experience, and Professionalism: An ... - CiteSeerX

An Experience on CORBA Component Deployment - CiteSeerX

AN EXPERIENCE WITH DETECTION AND ... - CiteSeerX

an experience in the Ecuadorian Andes - CiteSeerX

Interactive Multimedia Learning Systems: An Experience ... - CiteSeerX

An experience with Moroccan smallholder farmers - CiteSeerX

Predicting the Subjective Experience of Intrinsic ... - sethschwartz.info

j. Perinai. Med. Accuracy of an average ultrasonic ... - CiteSeerX

Analysis of the average-case behavior of an inference ... - CiteSeerX

A Model for Predicting Average Fire Company ... - RAND Corporation

Estimating and Predicting Average Likability on ... - Google Sites

AN AVERAGE FLOW MODEL OF THE REYNOLDS ... - CiteSeerX

Time-Critical Collision Detection Using an Average-Case ... - CiteSeerX

the estimation of an average cost frontier to calculate ... - CiteSeerX

Inverting a Batting Average - an Application of Continued ... - CiteSeerX

Predicting average growth and size distributions of Douglas-fir ...