Unacceptability of Instantaneous Errors in Mobile ... - Semantic Scholar

3 downloads 488 Views 97KB Size Report
Sep 12, 2006 - potential television contents over DVB-H. Data is collected with ... damaged data sections are detected and the lost data sections are.
Unacceptability of Instantaneous Errors in Mobile Television: From Annoying Audio to Video Satu Jumisko-Pyykkö

Vinod Kumar M.V

Jari Korhonen

Tampere University of technology,

Tampere University of technology,

Nokia Research Center

Institute of Human-Centered Technology

Institute of Signal Processing

Multimedia Technologies laboratory

P.O. Box 553

P.O. Box 553

P.O. Box 100

FI-33101 Tampere, FINLAND

FI-33101 Tampere, FINLAND

FI-33721 Tampere, FINLAND

[email protected]

[email protected]

[email protected]

ABSTRACT As in many digital telecommunications systems, the received data streams over Digital Video Broadcasting for Handhelds (DVB-H) may contain bursty transmission errors. The bursty error characteristics affect the end users' perceived audiovisual quality. This study examined the perceived unacceptability of instantaneous but noticeable audio, visual and audiovisual errors. The erroneous streams were generated from four popular television contents by applying three simulated error patterns with different error rates (1.7%, 6.9%, 13.8%) and error burst durations. Instantaneous unacceptability of errors was evaluated by 30 participants with simplified continuous assessment while watching the program content. The results show that with the two lowest error rates the audio errors were more unacceptable than video errors and with the highest error rate the visual and audiovisual errors become the most unacceptable.

Categories and Subject Descriptors H.5.1 Multimedia methodology.

Information

Systems:

Evaluation/

General Terms Experimentation, Human Factors

Keywords Audiovisual quality, transmission errors, perception, audio, video

1. INTRODUCTION Mass mobile services such as mobile television are expected to be part of tomorrow's everyday life. On the way to the end users' devices in all possible locations - busses, cafes, waiting halls these services are sent via error-prone transmission channels. The transmission may cause quality degradation. To reach the expected popularity, the media services with possible impairments have to be acceptable to the end users. To ensure the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MobileHCI’06, September 12–15, 2006, Helsinki, Finalnd. Copyright 2004 ACM 1-58113-000-0/00/0004…$5.00.

acceptance of the signal quality, subjective research methods enable to identify the perceptual multimodal acceptability, preferences and critical quality factors[14][18]. In wireless telecommunications, radio interference and physical constraints often cause bursty bit errors in the digital transmission channel. To some degree, errors can be corrected using error correcting codes, but in the presence of long error bursts the error correction schemes may fail. In this case, the system must discard the damaged transport data units and possibly request for retransmission. Unfortunately, retransmissions are typically not feasible for mobile broadcast and multicast streaming systems, such as DVB-H and mobile multicast. Therefore, the application should expect some data loss to occur in realistic DVB-H usage scenarios. Data loss implies perceptual quality degradation. Audiovisual, multimodal perception and its effects on quality are relatively unexplored. In audiovisual perception, the visual and auditory information are integrated into unified perceptual experience. This experience is more than just the sum of the separate auditory and visual experiences. For example, it seems that people rely on more than one media at time [8]; the importance of sensory channels can be dependant on processed contents, context and tasks [10][17]. Understanding the multimodal perception may provide new challenges for the highly required technical optimization. Recent studies in this topic have mostly focused on bandwidth optimization and testing of compression parameters. In quality optimization studies, the audio-video bit rate allocation for different television materials have been reported [10][17][32]. The published studies in audiovisual transmission errors have focused on setting the minimum user requirements for packet loss [30] and the acceptance threshold for erroneous streams over DVB-H [18]. These studies have focused on quality from the overall perceptual quality of point of view. However, there are no publications comparing the unacceptability of instantaneous errors in transmitted audiovisual material. The only studies of the topic have concentrated to study audio and video signal losses separately [22][23]. The knowledge of perceived instantaneous errors provides information about the annoyance of errors in different media. In this paper, instantaneous unacceptability is defined as the subjectively perceived annoyance of noticeable unimodal or multimodal errors. This study compares the unacceptability of instantaneous audio, video and audiovisual errors. Simulated errors with different error rates are applied to four different

potential television contents over DVB-H. Data is collected with continuous assessment while watching the contents. The organization of the paper is following: An overview to erroneous streams is given in chapter 2. Issues related to audiovisual perception and instantaneous changes are presented in chapter 3. The chapter 4 summarizes the used research methods. The results for the unacceptability of instantaneous errors are presented in the chapter 5 with comparison to the overall perceptual quality. The chapter 6 presents the discussion and concludes the study.

2. ERRORS IN WIRELESS CHANNELS One of the essential differences between wired and wireless digital telecommunications lay in the different sources of transmission errors. In the traditional wired packet-switched networks, packet losses are primarily due to congestion in the network devices. In contrast, wireless telecommunications typically suffer from a high rate of physical transmission errors in the radio channel. Mobile terminals in movement (when located in a vehicle, for example) are especially prone to transmission errors.

2.1 Wireless error characteristics On the link layer, errors in the physical channel appear as bit errors. Due to the vulnerable nature of a radio channel, most of the practical wireless systems employ a number of link layer mechanisms to improve the reliability of data transport. These mechanisms can be based on forward error correction (FEC), retransmissions or both. FEC adds some redundant information (error correcting codes) in the data stream which can be used to detect and correct bit errors. When retransmissions are used, damaged data sections are detected and the lost data sections are retransmitted by the sender upon request from the receiver. Several studies concerning different wireless telecommunications systems and physical environments show that bit errors are usually not distributed uniformly, but tend to be clustered in bursts [31][19]. This bursty nature of errors has implications also to the error patterns observed at the upper layers. Characteristically, FEC cannot recover all heavily clustered bit errors. In this case, the damaged transmission units (packets) have to be discarded entirely or retransmitted. Packet losses are undesirable, because they have direct implications to the perceived audio-visual quality in streaming applications. Unfortunately, in multicast and broadcast the feedback channel is typically either missing or it cannot be used as this would cause an implosion of the feedback messages. This is why the retransmission-based error recovery mechanisms are typically feasible for unicast transmission only. The performance of FEC can be improved by using interleaving to distribute the bursty errors more smoothly. In DVB-H, multiprotocol encapsulation with FEC (MPE-FEC) has been adopted to combine interleaving and FEC. MPE-FEC corrects effectively both individual bit errors and short error bursts [7].

2.2 Simulating wireless errors Due to the bursty nature of errors in a wireless channel, different finite-state stochastic models have been developed to model the behavior of wireless digital transmission. In the well-known Gilbert-Elliot (GE) model [9] there are two states representing two different channel condition states: the good state and the bad

state. In the good state, the bit error probability eg is very low or zero and in the bad state it is high eb. The average lengths of the error bursts are determined by the transition probabilities between the two states [19]. A simplified GE model assumes eg as zero and eb as one. More accurate finite-state models have also been proposed, because the GE model fails to describe the measured real-life error patterns with an acceptable accuracy in many cases [19]. Most of the practical experiments to measure error patterns have been conducted in Wireless Local Area Networks (WLAN) or cellular packet radio systems, such as GPRS. However, the finite-state models have been confirmed useful for simulating the packet error behavior also in DVB-H [24].

3. AUDIOVISUAL PERCEPTION OF INSTANTANEOUS CHANGES 3.1 Audiovisual perception of quality In the perception of audiovisual material, visual and auditory information are integrated into one unified perception in a complex manner. Different sensory modalities can enhance and modify the perceptual impression of the other perceptual channel. McGurk effect illustrates how mismatched visual and acoustics material are integrated into a unified audiovisual experience [21]. The synchronization of the media as a process of gluing image and sound together affects the audiovisual synthesis and builds up the unified multimodal perception [28]. In the unsynchronized material, the clarity of the message decreases and distracts the viewer from the intended content [25]. The characteristics of multimodal perception appear also in audiovisual quality research. It is reported that good audio quality enhances the perceived visual information and visa versa [25][6]. One media type seems to be leading and its importance can vary according to different contents, contexts and tasks [10][17]. Especially the content dependency is reported in the data compression studies with overall retrospective quality evaluation. The importance of audio quality is the highest in head and shoulder contents, such as news, video conferencing [10][17] and music videos [17]. Controversially, the relative importance of video quality is emphasized in high-motion sport contents. Animation represents mid-type contents, with importance of both media. It is anticipated that the significance of content dependency would be revealed also when the unacceptability of instantaneous transmission errors are studied, because errors in the dominating modalities would violate the unified multimodal perception.

3.2 Peaks and instantaneous errors Instantaneous errors are quality impairments appearing as simultaneous or separate temporal discontinuities in the different media. Instantaneous errors are typical in data transmission, representing negative quality peaks and, according to Hands and Avons [11], they are related to low quality rating. In retrospective overall evaluations, the peak intensity and the location at the end are the most affecting factors in the perceived quality. Surprisingly, the duration of the peak impairment is not an important factor. The negative effect of the quality impairment located at the end disappears if the continuous assessment is used as a research method in video quality studies. The peak intensity of the error is the highest predictor for the retrospective overall quality if there is only one error in a temporal location and the

duration is varied. In realistic data transmission scenarios, it is expected that several errors can occur and these results may have to be elaborated. Pastrana-Vidal et al. [22][23] have studied the effects of sporadic audio and video errors on the perceptual quality of one media at time (with short stimuli material 10s). Their artificially controlled video errors appeared as jerky motion, because when an error occurred, the last picture was played until the new image was reconstructed. In the video study, the minimum length of the error detection threshold was 80 ms, and 200 ms long errors were visible in all contents. In the sporadic video frame dropping, several short discontinuities were less preferable than a long lasting single burst. The audio errors were perceived as silence followed by an abrupt clipping. The auditory detection threshold varied from 1.2 to 6.1 ms depending on the content, but the objective error of 30 ms was audible in all contents. The audio loss wais experienced more annoying in music contents than speech contents, but the content dependency became irrelevant if the length of the error was 550 ms or greater. These two studies showed the perceptual detection and unequivocal thresholds for sporadic audio and video errors. However, if the errors are noticeable, does it also mean that they are unacceptable to the end user? In multimedia transmission, the instantaneous quality impairments can occur in video and audio separately or simultaneously. What is the relative annoyance in the terms of unacceptability of audio and video errors or simultaneous audiovisual errors in a realistic transmission simulation? This paper focuses on these questions.

4. RESEARCH METHOD 4.1 Participants 30 participants (equally stratified by age (18-45 years) and gender) participated to the laboratory environment at Technical University of Tampere during fall 2005. Maximum number of people categorized as innovators and early adopters according to their attitudes toward technology and professional evaluators defined as people studying, working or otherwise engaged in information technology or multimedia processing/presenting were restricted to 20% [26]. All participants had normal or corrected to normal vision, normal color vision and hearing.

4.2 Test procedure The test procedure contained pre-test, test and post-test sessions. In the pre-test session, vision and hearing tests with demographic data collection took a place. It was followed by the combined training and anchoring in which participants were shown the extremes of the sample qualities as an example of the quality scale and to familiarize them with the test and used contents. In the test, simplified continuous assessment was used parallel to retrospective ratings [Figure 1].The sample material was shown using the single stimulus method where clips are viewed one by one and rated independently [14]. During each clip presentation, unacceptable quality was indicated by pressing a button of a game controller. The typical continuous assessment with the use of sliders has been reported to have a high cognitive load on the participants [20]. The simplified continuous assessment is expected not to require participant’s attention as much as using a slider but it still enables the continuous data-collection. After the presentation of each clip, participants had five seconds to mark

the quality score of a clip on an answer sheet by using a discrete, unlabelled scale from 0 to 10 [29] and the acceptance of the clip (yes/no choice). All clips were played three times and the positions of the transmission errors varied in each repetition. In the post-test session, qualitative data of experiences on the erroneous streams were gathered. One test session lasted for about 1.5 hours.

Figure 1 Experimental set-up: continuous assessment and retrospective ratings

4.3 Selection of Test Material The test materials, four contents were selected according to their audiovisual characteristics [Table 1], popularity and potentiality to mobile television. Especially news as official and objective information source content and an entertaining sport as content of excitement were selected because they have potential for television in handhelds [1][27]. Each clip contained a meaningful segment of a TV program and the start and end points did not cut a sentence. The length of the clips varied from 61s to 63s and it was ensured that at least one error appeared in one of the repetitions with the lowest error rate. Table 1 Contents and their descriptions CONTENTS

DESCRIPTIONS Genre: NEWS Content: Evening news Video: -Spatial (details): High -Temporal (motion): Moderate Audio: Speech Genre: SPORT Content: Ice Hockey Video: -Spatial (details): High -Temporal (motion): High Audio: Speech Genre: MUSIC VIDEO Content: Gwen Stefanie; What are you waiting for Video: -Spatial (details): High -Temporal (motion): High Audio: Music Genre: CARTOONS Content: The Simpsons Video: -Spatial (details): Moderate -Temporal (motion): Moderate Audio: Speech

4.4 Material Production Process– Simulations The selected test materials were encoded using recommended codecs for the IP data casting service over DVB-H [5]. Advanced Audio Coding (AAC) [13] was used for audio and H.264/AVC [12] for video encoding. The bitrate, sampling rate and frame rate

were selected according to the results of the previous study. The audio bitrate was 32 kbps with a sampling rate of 16 kHz as monoaural while the video was coded at a bitrate of 128 kbps and a frame rate of 12.5 frames per second [17]. At least one Instantaneous Decoder Refresh (IDR) frame was inserted per DVB-H time slice to reduce the tuning-in delay at the receiver and provide better error resiliency in the channel. As predefined in the specifications for encapsulating data in DVB-H given in [3][4][5], the coded audio and video packets were encapsulated into IP datagrams by adding IP/UDP/RTP headers. Both audio and video packets contained one media frame each. To enable synchronized audio-visual streaming, the audio and video packets were first multiplexed and the resulting service stream was multiplexed with other services to form a single stream of packets, containing four different services with each service comprising of and audio-visual multiplexed stream. The time-sliced transmission burst interval was set to approximately 1.5 seconds. MPE-FEC method is used to provide reliability in error prone DVB-H channel. It is computed using a matrix of size 512 x 255. Each cell in the matrix holds one information byte. The first 512 x 191 part of the matrix is filled with IP datagrams filled columnwise. When an IP packet does not fit completely into the matrix, the remaining cells are not filled. The second 512 x 64 part of the matrix is filled with Reed-Solomon FEC codes computed for each row of the matrix. Further details on the computation procedure can be obtained from [3]. To simulate loss in the DVB-H channel, the results of a field trial carried out in an urban setting with an operable DVB-H system was used as a basis. The receiver in the field trials was located in a car. The results of the field test, which was in the form of an MPE-FEC error pattern (correct and uncorrectable MPE-FEC frames), was used to model more MPE-FEC error patterns using a simplified GE model with error probabilities from one to zero. The field test results were used to train the model and estimate the state transition matrix. The resulting estimated transition matrix was used to generate another MPE-FEC error pattern with some modifications to achieve the relevant error rates. Three rates (1.7%, 6.9%, 13.8%) for erroneous time-slices after FEC decoding were chosen into the simulations. To generate the error patterns for the transport stream (TS) packets within the uncorrectable MPE-FEC frames a second simplified GE model was implemented. The result was a TS error pattern that approximated the results of the actual field test. The generated TS packet errors were used to corrupt the coded audiovisual sequences. Error correction operation using MPEFEC was simulated and the resulting residual IP packet error pattern was obtained. The residual IP error pattern reflected the uncorrectable errors in the channel. Simple video error concealment was used: when a picture was lost, all subsequent pictures were replaced by the last correctly received picture in the presentation order until the arrival of the next refresh picture. Errors in video were perceived as discontinuous motion jerks. For audio error concealment, the lost audio frames were replaced by silence, which was then perceived as gaps during the playback. In the simulated error patterns the error rates, number of errors, their duration and location varied. Figure 2 shows examples of audio-visual errors for the animation sequence with the 1.7% and 13.8% error rates. Table 2 lists the burst error characteristics for

all the sequences for one error. It can be noticed that the error bursts in video are longer than in audio.

4.5 Presentation of Test Materials The tests were conducted in a laboratory environment [15]. The clips were viewed on a Nokia 6630 handset with a player from Nokia. The device was enclosed in a stand (the screen and buttons were visible) to the vertical position. The stand was adjusted to the eye level with a viewing distance of 44 cm, according to the preferred viewing distance reported in similar tasks and screen size [17]. Headphones were used for audio playback and audio loudness level was adjusted to 75dBA. Participants used a game controller (Logitech Dual Action gamepad) to mark unacceptability in a simplified continuous evaluation. A logging program was run on a laptop (Fujitsu Simens Lifebook Pentium 3, Windows 2000) to collect the user input. The logging program run on Python 2.3.5 and uses PyGame 1.6 module for accessing the game controller button events. When the button of the game controller is pressed, the program receives a button down event and saves the number of seconds elapsed from the reference time at the beginning of the presentation.

Figure 2: Audio and video errors as a function of time for the animation sequence at the error rates of 1.7% and 13.8%. Table 2: Number of errors, mean durations and standard deviation (in sec.) of burst errors for all error patterns in different error rates. Error rate: Content Animation Audio Video

1.7% Mean N (SD) 0-3 1

Music Video Audio Video

0-3

News Audio Video

2

1

1 Sports Audio Video

0-3 1-2

6.9% Mean N (SD)

0.33 (0.28) 1.57 (0.51)

3-6

0.27 (0.38) 1.65 (0.38)

3-7

0.33 (0.29) 1.94 (0.45)

2-6

0.34 (0.28) 1.10 (0.34)

4-6

3-4

2-3

2-4

2-4

13.8% Mean N (SD)

0.37 (0.20) 1.06 (0.54)

11-14

0.70 (0.17) 1.21 (0.43)

11-14

0.38 (0.20) 1.08 (0.35)

11-14

0.34 (0.21) 1.06 (0.44)

12-15

7-8

7-9

7-9

7-8

0.32 (0.19) 1.61 (0.97) 0.31 (0.19) 1.27 (0.74) 0.32 (0.19) 1.41 (1.00) 0.30 (0.18) 1.61 (0.81)

4.6 Data-analysis Methods Prior to the analysis, the subjective ratings and objective errors were mapped along with some probability estimations. Firstly, the estimation was needed because of the participants’ personal reaction times. The reaction time in case of an expectable stimuli is not usually longer than 3 seconds [7] and in subjective experiments, the errors are assumed to be expectable. Secondly, there was some variation between the participants in starting to play the clips because the logging software for continuous assessment was started manually (1-2 s). The evaluations given for the error ratio 1.7% were used as a reference to set the mapping time between the objective errors and ratings. This error ratio was selected because the sequence contained only few errors (1-2) and the evaluation deviation around the objective errors were the most visible of all contents. 90% of all evaluations were located two seconds backward from the objective error starting second and two seconds forward from the error’s ending second. In these estimations, chopped audio errors located near to each other are counted as one error.

both mean and nominal unacceptability approaches (Figure 3a,b; Wilxocon: Z=-5.536 p