JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
1
Audiovisual Quality Estimation for Mobile Video Services Michal Ries, Bruno Gardlo
Abstract—Provisioning of mobile video services is rather challenging since in mobile environments, bandwidth and processing resources are limited. Audiovisual content is present in most multimedia services, however, the user expectation of perceived audiovisual quality differs for speech and non-speech contents. The majority of recently proposed metrics for audiovisual quality estimation assumes only one continuous medium, either audio or video. In order to accurately predict the audiovisual quality of a multi-media system it is necessary to apply a metric that takes simultaneously into account audio as well as video quality. When assessing a multi-modal system, one cannot model it only as a simple combination of mono-modal models, because the pure combination of audio and video models does not give a robust perceived-quality performance metric. We show the importance taking into account the cross-modal interaction between audio and video modes also code mutual compensation effect. In this contribution we report on measuring the cross-modal interaction and propose a content adaptive audiovisual metric for video sequences that distinguishes between speech and non-speech audio. Furthermore, the proposed method allows for a referencefree audiovisual quality estimation, which reduces computational complexity and extends applicability. Index Terms—audiovisual quality, multimedia, mobile services.
I. I NTRODUCTION
W
HILE measuring or estimation of speech quality has been a standard procedure for many decades, quality estimation of video and audio signals is a relatively new field. In particular in mobile environments multimedia transmissions, mostly defined by simultaneous audio and video transmissions, quality estimation is challenging due to the limited data rate and processing resources. One of the challenges in mobile communications is to improve the subjective quality of audio and audio-visual services. Due to advances in audio and video compression and wide-spread use of standard codecs such as AMR and AAC (audio) and MPEG-4/AVC (video), provisioning of audio-visual services is possible at low bit rates while preserving perceptual quality. The Universal Mobile Telecommunications System (UMTS) Release 4 (implemented by the first UMTS network elements and terminals) provides a maximum data rate of 1 920 kbps shared by all users in a cell while Release 5 offers up to 14.4 Mbps in the downlink direction for High Speed Downlink Packet Access (HSDPA). The following audio and video codecs are supported by UMTS M. Ries is with the Institute of Communications and Radio Frequency Engineering, Vienna University of Technology, Gusshausstrasse 25/389, A1040 Vienna e-mail:
[email protected] B. Gardlo is with the Department of Telecommunications and Multimedia, University of Zilina, Univerzitna 1, 010 26 Zilina e-mail:
[email protected] Manuscript received March 15, 2009.
video services. For audio they include: AMR speech codec, AAC Low Complexity (AAC-LC), AAC Long Term Prediction (AAC-LTP) [1] and for video they include: H.263, MPEG-4 and MPEG-4/AVC [1]. The appropriate encoder settings for UMTS video services differ for various content and streaming application settings (resolution, frame and bit rate) [2]. End-user quality is influenced by a number of factors [2], [7], [8], [9] including mutual compensation effects between audio and video [4], [5], content, encoding, and network settings as well as transmission conditions [6]. This mutual compensation effect is very noticeable if for example a news speakers face freezes while the speech continues. In this case it is not perceived as very harmful since most of the information is in the speech signal. A similar effect is given in a soccer match where the players freeze but the commentator’s speech continues. If however there is no speaker involved and only the soccer scene is shown, a freeze of the video is perceived as very harmful and quality is rated as very low. Thus, audio and video are not only mixed in the multimedia stream, but there is even a synergy of component media (audio and video) as was shown in [2], [7], [9], [10]. Mutual compensation effects cause perceptual differences in video with a dominant voice in the audio track rather than in video with other types of audio [9]. Video content with a dominant voice include news, interviews, talk shows, and so on [3]. Finally, audio-visual quality estimation models tuned for video content with a dominant human voice perform better than general (universal) models [9], [2]. Therefore, our focus within this work is on the design of audiovisual metrics incorporating simultaneously audio and video content features. We are looking at measures that do not need the original (non-compressed) sequence for the estimation of quality, because this reduces the complexity and at the same time broadens the possibilities of the quality prediction deployment. Furthermore, we investigated novel ensemble based estimations. The ensemble based estimation method shows that such estimators are more beneficial than their single classifier counterparts [13]. The paper is organized as follows: In Section II we describe a typical mobile video streaming scenario and a test setup for video quality evaluation. In Section III the description of video and audio feature extraction is provided. The design of ensemble based audiovisual estimator and its is presented in Section IV. Section V performance and evaluation of ensemble based estimator and comparison with state of the art estimator. Section VI contains conclusions and provides an outlook on future work.
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
2
Resolution QVGA VGA QVGA VGA QVGA Fig. 1. Snapshots of selected sequences for audiovisual test: Video clip (left), Soccer (middle), Video call (right).
II. AUDIOVISUAL QUALITY ASSESSMENT A. Test Methodology The proposed test methodology is based on ITU-T P.911 [14] and adapted to our specific purpose and limitations. For this particular application it was considered that the most suitable experimental method, among those proposed in the ITU-T Recommendation, is ACR, also called Single Stimulus Method. The ACR method is a category judgement in which the test sequences are presented one at a time and are rated independently on a category scale. Only degraded sequences are displayed, and they are presented in arbitrary order. This method imitates the real world scenario, because the customers of mobile video services do not have access to original videos (high quality versions). On the other hand, ACR introduces a higher variance in the results, as compared to other methods in which also the original sequence is presented and serves as a reference by the test subjects [11]. After each presentation the test subjects were asked to evaluate the overall quality of the sequence shown. In order to measure the quality perceived, a subjective scaling method is required. However, whatever the rating method, this measurement will only be meaningful if there actually exists a relation between the characteristics of the video sequence presented and the magnitude and nature of the sensation that it causes on the subject. The existence of this relation is assumed. Test subjects evaluated the video quality after each sequence in a prepared form using a five grade MOS scale: “5–Excellent”, “4–Good”, “3–Fair”, “2–Poor”, “1–Bad”. Higher discriminative power was not required, because our test subjects were used to five grade MOS scales (school). According to our previous experiences a five grade MOS scale offers better trade-off between the evaluation interval and reliability of the results [2], [12]. The previous works [8], [12] in this field show, that test subjects hesitate to use the entire range of the 9 or 11 ACR scales. For emulating the real word conditions of the UMTS video service all the audio and video sequences were played at the UE (Vodafone VPA IV). In this singular point the proposed methodology for audiovisual quality testing is not compliant with ITU-T P.911 [14]. Furthermore, since one of our intentions is to study the relation between audio quality and video quality, we have decided to take all the tests with a standard stereo headset. During the training session of three sequences the subjects were allowed to adjust the volume level of the headset to a comfortable level. The viewing distance from the phone was not fixed and selected by the test person but we have noticed that all subjects were comfortable to hold the cell-phone at a distance of 20-30 cm.
Audio Codec AAC AAC AAC AAC AAC
Video BR [kbps] 190.28 231.40 173.80 229.80 75.87
Video FR [fps] 12.5 15 12.5 12.5 12.5
Audio BR [kbps] 16 32 32 32 16
Audio SR [kHz] 16 16 16 16 16
TABLE I E NCODING SETTINGS OF V IDEO CLIP TRAINING SEQUENCE .
Resolution QVGA VGA QVGA VGA QVGA
Audio Codec AAC AAC AAC AAC AAC
Video BR [kbps] 186.90 295.36 171.16 269.96 103.10
Video FR [fps] 12.5 15 12.5 12.5 12.5
Audio BR [kbps] 16 32 32 32 16
Audio SR [kHz] 16 16 16 16 16
TABLE II E NCODING SETTINGS OF V IDEO CLIP EVALUATION SEQUENCE .
B. Encoder Settings All video sequences were encoded using typical settings for the UMTS environment. Due to limitations of mobile radio resources, bit rates were selected in range 59–320 kbps. Only comprehensible audio files were allowed in the set. The test sequences were encoded with H.264/AVC baseline profile 1b codec. The audio was encoded with AAC or AMR codec. The encoding parameters were selected according for our former experiences described in [2] and [9]. In total there were tested 12 encoding combinations for the training set (see Tables I, III, V) and 13 encoding combinations for the evaluation set (see Tables II, IV, VI). Two sets, an evaluation and a training set were defined. Both sets consisted of different sequences. In each of the sets we used for each content type the same subset of video sequences but slightly differently encoded (see Table I and II for encoding setting of video clip and Table III and IV for soccer). To evaluate the subjective perceptual audiovisual quality a group of 15 people for the training set and a group of 16 people for the evaluation set was chosen. The chosen group ranged different ages (between 22 and 30), gender, education and experience. The sequences were presented in an random order, with the additional condition that the same sequence (even differently degraded) did not appear in succession. Two rounds of each test were taken. The duration of each test round was about 20 minutes. The single evaluation MOS values with variance higher than one between round one and round two were excluded. In total there were 6% of single evaluation MOS values rejected maintaining the mean of the set but significantly decreasing the variance. Resolution QVGA QVGA QVGA QVGA QVGA
Audio Codec AAC AAC AAC AAC AAC
Video BR [kbps] 199.14 92.30 181.46 196.98 182.98
Video FR [fps] 15 15 12.5 12.5 15
Audio BR [kbps] 16 16 32 16 32
Audio SR [kHz] 16 16 16 16 16
TABLE III E NCODING SETTINGS OF S OCCER TRAINING SEQUENCE .
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
Resolution QVGA QVGA QVGA VGA QVGA QVGA VGA
Audio Codec AAC AAC AAC AAC AAC AAC AAC
Video BR [kbps] 198.72 93.07 180.90 319.32 195.69 183.32 292.12
Video FR [fps] 15 15 12.5 15 12.5 15 12.5
Audio BR [kbps] 16 16 32 16 16 32 16
3
Audio SR [kHz] 16 16 16 16 16 16 16
sual quality. This was influenced with granularity of LCD on test PDA. Resolution QVGA QVGA
Fig. 2.
Video BR [kbps] 202.84 59.25
Video FR [fps] 12.5 7
Audio BR [kbps] 16 5
Audio SR [kHz] 16 8
TABLE V E NCODING SETTINGS OF V IDEO CALL TRAINING SEQUENCE .
TABLE IV E NCODING SETTINGS OF S OCCER EVALUATION SEQUENCE .
For audiovisual quality tests three different content types (Video clip, Soccer and Video call) were selected with different perception of video and audio media. The video snapshots are depicted in Figure 1. The first two sequences Video clip and Soccer contain a lot of local and global movement. The main difference between them is in their audio part. In Soccer the speaker voice as well as a loud support of audience is present, where the speakers voice is rather important. Especially important are small moving objects: players and ball. Figure 2 and Figure 3 shows the results from subjective tests. In Video clip instrumental music with voice is present in the foreground. Figure 4 and Figure 5 show results from the subjective test for video clip files. In Video call a human voice is the most dominant. Finally, Figure 6 shows the results from subjective tests for video call files.
Audio Codec AAC AMR
Resolution QVGA
Audio Codec AAC
Video BR [kbps] 191.86
Video FR [fps] 12.5
Audio BR [kbps] 16
Audio SR [kHz] 16
TABLE VI E NCODING SETTINGS OF V IDEO CALL EVALUATION SEQUENCE .
Fig. 4.
Measured MOS results for Video clip training sequences.
Fig. 5.
Measured MOS results for Video clip evaluation sequences.
Measured MOS results for Soccer video training sequences.
C. Prior Art
Fig. 3.
Measured MOS results for Soccer video evaluation sequences.
Furthermore, the obtained results for Video call and Soccer show that higher resolution has no or little impact at audiovi-
In former work [2], [9] we investigated audiovisual quality on different content classes, codecs and encoding settings1 . The obtained subjective video quality results clearly show the presence of the mutual compensation effect. The depicted 1 It is worth pointing out that audio and video sets used in former work are different from the ones used in present work!
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
Fig. 6.
4
Measured MOS results for Video call sequences.
Fig. 8.
MOS results for the Video clip - codecs combination H.263/AAC.
Fig. 7. MOS results for the Video call content - codecs combination H.263/AMR
Figures 7, 8, 9 (color code serves only for better visualization of the results) show results of audiovisual quality assessment based on H.263 encoding. In Video call the audiovisual quality is more influenced by the audio quality than the video quality as shown in (see Figure 7). This problem is described in more detail in [2]. Further investigation within this work shows that it is beneficial to propose one audiovisual model with content sensitive parameters for various video contents depending on the presence (Video call) or absence of dominant human voice (Video clip and Cinema trailer). Therefore, within the new work presented in this contribution an additional parameter was introduced for detecting speech and non-speech audio content (cf. Section III-B).
III. F EATURE EXTRACTION The proposed method is focused on reference free audiovisual quality estimation. The character of the sequence is determined by content dependent audio and video features in between two scene changes. Therefore, the investigation of the audio and video stream was focused on sequence motion features as well as on audio content and quality. The video content influences significantly the subjective video quality [2], [15] and the sequence motion features reflect very well the video content. The well-known ITU-T standard
Fig. 9. MOS results for the Cinema trailer - codecs combination H.263/AAC.
P.563 [16] was used for audio quality estimation. The ITU-T P.563 is a standard for the speech quality evaluation but in our setup is also used for the evaluation of the audio quality, since at the time of writing this paper, there were not known other standard for non-reference audio quality evaluation. Furthermore, a speech/non-speech detector was introduced for eliminating of different influence of mutual compensation effect between audio and video in speech and non-speech content. Finally, temporal segmentation was used also as a prerequisite in the process of video quality estimation. For this purpose a scene change detector was designed with an adaptive threshold based on the video dynamics. The scene change detector design is described in detail in [2].
A. Video feature extraction The focus of our investigation is given on the motion features of the video sequences. The motion features can be used directly as an input into the estimation formulas or models. Both possibilities were investigated in [17], [18] and [2], respectively. The investigated motion features concentrate on the motion
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
vector statistics, including the size distribution and the directional features of the motion vectors (MV) within one sequence of frames between two cuts. Zero MVs allow for estimating the size of the still regions in the video pictures. That, in turn, allows analyzing MV features for the regions with movement separately. This particular MV feature makes it possible for distinguishing between rapid local movements and global movement. Moreover, the perceptual quality reduction in spatial and temporal domain is very sensitive to the chosen motion features, making these very suitable for reference free quality estimation because a higher compression does not necessarily reduce the subjective video quality (e.g. in static sequences). This particular MV features make it possible to detect rapid local movements or the character of global movements. The selection MV features is based on multivariate statistical analysis and the details can be found in [2]. The following features of MV and BR represent the motion characteristics:
•
•
•
•
•
Zero MV ratio within one shot Z: The percentage of zero MVs is the proportion of the frame that does not change at all (or changes very slightly) between two consecutive frames averaged over all frames in the shot. This feature detects the proportion of a still region. The high proportion of the still region refers to a very static sequence with small significant local movement. The viewer attention is focused mainly on this small moving region. The low proportion of the still region indicates uniform global movement and/or a lot of local movement. Mean MV size within one shot N : This is the percentage of mean size of the non-zero MVs normalized to the screen width. This parameter determines the intensity of a movement within a moving region. Low intensity indicates the static sequence. High intensity within a large moving region indicates a rapidly changing scene. Ratio of MV deviation within one shot S: This ratio is the percentage of standard MV deviation to mean MV size within one shot. A high deviation indicates a lot of local movement and a low deviation indicates a global movement. Uniformity of movement within one shot U : Percentage of MVs pointing in the dominant direction (the most frequent direction of MVs) within one shot. For this purpose, the resolution of the direction is 10o . This feature expresses the proportion of uniform and local movement within one sequence. Average BR: This parameter refers to the pure video payload. The BR is calculated as an average over the whole stream. Furthermore, the parameter BR reflects a compression gain in spatial and temporal domain. Moreover the encoder performance is dependent on the motion characteristics. The BR reduction causes a loss of the spatial and temporal information what is usually annoying for viewers.
5
B. Audio feature extraction Many reliable estimators for audiovisual quality were proposed recently, some of them became standards [19], [20] and [16]. For our purpose a reference free estimation method called “Single ended method for objective speech quality assessment in narrow-band telephony applications” [16] turned out to be very suitable. The 3SQM [16] performs audio quality estimation in two stages, the first stage includes intermediate reference system filtering, signal normalization and voice activity detection. In the second stage of operation, twelve parameters based on the processed input signal are calculated. These parameters take into account speech level, noise, delay, repeated frames, disruptions in pitch period, artificial components in speech signal (beeps, clicks). These 12 parameters are then linearly combined to form the final audio quality prediction (in MOS scale). As previous work has shown [7], mutual compensation effects cause differences in perception of video content with a dominant voice in the audio track rather than in video with other types of audio [9]. Video content with a dominant voice include news, interviews, talk shows, and so on. Finally, audiovisual quality estimation models tuned for video content with a dominant human voice perform better than a universal models [9]. Therefore, our further investigation was focused on the design of speech detection algorithms suitable for the mobile environment. Due to the low complexity requirement of the algorithm, our investigation was initially focused on time-domain method. For this purpose a pair of audio parameters {kurtosis (κx ) [21], High Zero Crossing Rate Ratio (HZCRR) [22]} extracted from the audio signal were suitable. The kurtosis of a zeromean random process x(n) is defined as the dimensionless, scale invariant quantity2 κx =
PN 1 4 n=1 (x(n) − x) N , PN 1 2 2 n=1 (x(n) − x) N
(1)
where in our case, x(n) represents the n-th sample of an audio signal. A higher κx value is related to a more peaked distribution of samples as is found in speech signals as depicted in Figure 10 whereas a lower value implies a more flat distribution as is found in other types of audio signals. Therefore, kurtosis was selected as a basis for detection of speech. However, accurate detection of speech in short-time frames is not always possible by kurtosis alone. The second objective parameter under consideration is the HZCRR defined as the ratio of the number of frames whose Zero Crossing Rate (ZCR) is greater than 1.5 times the average ZCR in audio file as [22] HZCRRM
N −1 1 X = [sgn(ZCR(n, M )−1.5ZCR)+1], (2) 2N n=0
2 Note 1 N 1 N
texts define kurtosis as κx = PN that some (x(n)−x)4 n=1 2 −3. We shall however follow the definition PN 2
in [21].
n=1
(x(n)−x)
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
6
follows
1
P20
0.8
i=1
log
∆= P20
0.6 CDF
i=1
0.4
0.2 Kurtosis CDF for Non−speech Kurtosis CDF for Speech 0 2
4
C0
6
8
10
κ
12
14
16
18
20
Fig. 10. Cumulative density function of kurtosis κx for speech and nonspeech audio samples.
log
√ 12 (2π) kΣs k
exp(g)
1 (2π)2 kΣm k
exp(h)
√
,
(4)
where g and h are denoted as follows: 1 T g = − (Fi − µs )Σ−1 (5) s (Fi − µs ) , 2 1 T h = − (Fi − µm )Σ−1 (6) m (Fi − µm ) . 2 If the LLR is greater than the decision threshold, c = c1 = 2.2 (see Figure 11), we declare it as a non-speech frame otherwise we declare it as a speech frame.
where ZCR(n, M ) is the rate of the n-th, length-M frame (equation given below), N is the total number of frames, ZCR is the average ZCR over the audio file. The ZCR is given by M −1 1 X 1