ACs aS and MVsX aS will be named QP1, AC1 and. MV1. 4. VIDEO QUALITY MODEL. Modeling is done by using the results of the subjective test presented in ...
TOWARDS A CONTENT-BASED PARAMETRIC VIDEO QUALITY MODEL FOR IPTV M.N. Garcia, R. Schleicher, A. Raake Quality and Usability Lab Deutsche Telekom Laboratories, Berlin University of Technology Berlin, Germany ABSTRACT This paper presents a content-based parametric video quality model for IPTV services. The model estimates the quality of High Definition (HD, 1920x1080 pixels) video as perceived by the user, covering a wide range of quality from low to high bit-rate. It has been developed for H.264 encoding. Following a modular design, the model uses content-related parameters only when having access to the encoded video (”contentbased” model), and relying on lower-level information e.g. from packet headers when access to the payload is not possible (”content-blind” model). In this study, content-dependent parameters were selected from a set of spatio-temporal features derived from the encoded video, such as the AC transform coefficients, the quantization parameters and the motion vectors. The most appropriate features for explaining the content dependency of the video quality were determined using a principal component analysis. Those features are included in the content-blind model, leading to a content-based model which performs significantly better on both known and unknown subjective test data. Index Terms— Video, quality, content, spatiotemporal complexity, parametric, bitstream, non-reference, model, IPTV, HD, H.264 1. INTRODUCTION In order to achieve a high degree of user satisfaction for video services such as IPTV, perceived video quality needs to be predicted both in the network planning phase and as part of the service monitoring. Quality assessment can be achieved using subjective tests or by instrumental methods that yield estimates of video quality as perceived by the user. Since subjective tests are time consuming, costly, and do not allow assessing the quality during real-time service operation, instrumental assessment methods are often preferred in a technical application context. Those instrumental methods are typically based on video quality models. It is well known that the subjective quality of the video depends on its content1[1, 2, 3, 4, 5]. As a consequence, spatio1 In
this paper, video content refers to the spatio-temporal complexity of
temporal features should be included in the objective video quality metric to yield more accurate estimates. Which type of spatio-temporal features is employed depends on the strategy followed for developing the video quality metric, the nature of the signal from which the measures are retrieved and the availability of the original (uncompressed, non-transmitted) video signal. In the literature, two strategies are mainly followed for developing the metrics: In the first one, a common model is used for all contents, but different pre-defined values of the coefficients of the model are selected depending on content classes, as in [2, 3, 4]. Content classes are estimated by a content classifier, which is formerly trained on many video sequences based on their spatio-temporal features. The content class of the video under test is then obtained by matching its spatio-temporal features to the spatio-temporal features of the training database. The second approach consists in incorporating the spatio-temporal features directly into the model. The model is then a function of the spatio-temporal features, as in [6, 7, 8, 1]. This approach can be further split into two, depending on which data the model has been developed from: subjective test data, as in [1], or PSNR values computed in a full-reference mode, as in [6, 7]. In most studies, spatio-temporal features are computed from the video signal itself, i.e. from the original video signal as in [8, 3], or from the decoded video as in [1, 2, 4]. Note that when the features are derived from the original signal – whereas the quality metric is situated on the user side or in the network as it is the case for most IPTV service monitoring solutions – they can be sent to the quality metric via an auxiliary channel as (reduced) reference information. In our case, no auxiliary channel is available. Moreover, computing the features on the decoded picture on the user side is computationally costly and thus not suitable for in-service quality monitoring. Therefore, we decided to compute the spatio-temporal features from the encoded video, as in [6, 7] for MPEG2, H.264 and DCT-based encoded video. However, contrary to [6, 7], we developed our model based on subjective test the video, i.e. to the amount of details and complex movements in the video.
data (HD resolution) and not PSNR values estimated in a full reference mode. Indeed, PSNR is known to be inappropriate when used across video contents and codecs [9] while subjective tests are the most valid method for assessing perceived quality. At last, our metric is embedded into a core model (parametric video quality model T-V-Model, [10]) which covers the whole quality range and degradation types (compression, information loss with various loss concealment techniques, etc.). The core model has been developed for IPTV services. In this paper, the content-dependency metric is implemented only for compression artefacts. Subjective tests conducted for developing the content-based video quality metric are detailed in Section 2, followed by an analysis of the test results, discussing the importance and the nature of the quality impact due to the video content. Section 3 describes the spatiotemporal features which can be extracted from the H.264 encoded video. It outlines the method we used for selecting the most appropriate spatio-temporal features for explaining the video quality content dependency. These features are then employed together with the model (Section 4), which is evaluated by comparing its predicted quality values both to subjective test judgements used for training the model and unknown subjective judgements.
T Recommendation P.910. The uncompressed original video was used as a hidden reference in the test. In the paper, we will analyze only the results for cases without transmission errors. Figure 1 shows the subjective judgements of the video sequences as a function of the bit-rate for the 10 video contents. We observe that the video content clearly has an influence on the perceived quality, leading to a variation of up to 4 points on the 11-point scale for a given bit-rate. This influence is more important at the lowest bit-rate. The soccer, river, dance, boxing bag and movie videos seem to be particulary sensitive to the bit-rate, which can be explained with the presence of fine and complex spatial structure like grass and streaming water, or with the presence of details or fast complex movements in the case of the dance, boxing bag and movie videos. Thus, it is obvious that parameters need to be introduced into our video quality model to adjust the predicted video quality as a function of the content. To this aim, we first identify spatio-temporal descriptors of the content that can be extracted from the bitstream at the user side. 10
9
8
A subjective video quality test series was conducted to develop the content-aware parametric video quality model. This test has been conducted for the HD video format on 10 video contents of 16 s duration each. The 10 video contents used for the tests are representative of various TV programs. They differ in their amount of details and complexity of structures and movements, resulting in the following video contents: A: Movie Trailer, B: Interview, C: Soccer, D: Movie, E: Music video, F: Dance, G: Boxing bag, H: Stockholm shooting, I: River and J: Computer-generated film. The 10 source video files were processed using H.264 encoding at different bit-rates (2, 4 and 16 Mbps). The choice of bit-rates was based on the results of a former studies described in [10] and showing that the perceived quality is saturating at 16 Mbps for HD and H.264 encoding. Uniform packet loss was inserted at various packet loss rates (0 % to 4 %). Two types of packet loss concealment were used: Freezing or Slicing. For slicing, we had 1 slice per Macro-block line. We used zero-motion concealment as concealment strategy. Six anchor conditions were present in the test, covering the entire quality range presented in the test, both in terms of quality levels and possible perceptual effects, thus leading to an overall number of 16 test conditions. An “absolute category rating” was used for collecting subjective video quality judgements. 24 subjects rated the quality using the continuous 11-point quality scale according to ITU-
7
Subjective Quality
2. INFLUENCE OF VIDEO CONTENT ON PERCEIVED QUALITY
A
mov−tr
6
Binterv Csoccer
5
D
mov
Evclip
4
Fdance G
3
box
Hstock 2
I
river
Jcomput 1
0
0
2
4
6
8
10
12
14
16
Bitrate (Mbps)
Fig. 1. Mean subjective video quality as a function of the bit-rate for the 10 video contents.
3. VIDEO CONTENT FEATURES 3.1. Spatio-temporal complexity measures A list of spatio-temporal features relevant for describing the spatio-temporal complexity of video content has been provided in [5]. The list of measures is summarized below: - QP: Quantization Parameter per macro-block averaged (QP a) over each I-frame. QP indicates how many bits are allocated to each block. If an I-frame contains blocks with high QP values, the quantization is high and the I-frame coarsely encoded, leading to block artefacts.
- AC: Average (ACa) and Standard deviation (ACs) of 15 AC transform coefficients over the whole I-frame. AC indicates the presence of details and complex structure. - MV: Average (M V a) of the Magnitude of the motion vectors, and standard deviation of their horizontal (M V sX) and vertical (M V sY ) components over P- and B-frames. MV gives information on the magnitude and complexity of the movement. - MBtype: number of macro-blocks per P- and B- frames per Macro-block type. The type of Macro-block depends on the number of motion vectors required for representing the Macro-block. Macro-block types are linked to the complexity of the movement. Since this parameter set did not show a high correlation with subjective quality, it will not be considered further in the following. By taking the means and the standard deviations of those metrics over the whole video sequences (16s each), we derive 15 spatio-temporal features: two from QP, four from AC, four from MV and five from MBtype . These 15 spatio-temporal features are computed for each test case (3 bit-rates) and each video content (10 contents), leading to 30 samples for each of the 15 spatio-temporal features. The bit-rate is added as a 16th variable and the mean subjective video quality (V Q) as a 17th variable for indicating the PCA component the video quality is the most correlated with. 3.2. Principal Component Analysis The dimensionality of this space is reduced to three dimensions by performing a principal component analysis (PCA) with VARIMAX rotation on the z-score-normalized samples. We observe that the first, second and third dimensions are highly correlated with respectively the bit-rate, spatial complexity variables, such as the AC transform coefficients, and temporal complexity variables, such as motion vector magnitudes. As a consequence, the dimensions will be respectively called “Bit-rate”, “Spatial” and “Temporal”. The bit-rate parameter is highly correlated with the “Bit-rate” dimension and poorly correlated with the “Spatial” and “Temporal” dimensions. We increase this tendency by performing a step-wise two-dimensional rotation around the “Spatial” and “Temporal” dimensions. This procedure yields the highest correlation between the “Bit-rate” dimension and the bit-rate variable at an angle of 193 ◦ rotation around the “Temporal” dimension and at an angle of 5 ◦ rotation around the “Spatial” dimension. As a result, we have the ”Bit-rate” dimension explaining the bit-rate dependency and the two other dimensions independent of the bit-rate mainly explaining the spatial and temporal complexity of the contents. Correlations between the initial spatio-temporal features (i.e. variables) and the three dimensions resulting from the PCA plus VARIMAX plus subsequent 193 ◦ and 5 ◦ rotations are shown in Table 1. In this table, a high correlation coefficient between one variable and one dimension means that the variance of this
Variables VQ Bitrate MVa aS MVsX aS MVsY aS MVa sS ACa aS ACa sS ACs aS ACs sS QPa aS QPa sS
Bit-rate 0.88 -0.97 -0.32 -0.13 -0.09 -0.30 -0.47 -0.31 -0.13 0.13 0.78 -0.42
Factors Spatial -0.26 -0.002 0.26 0.19 0.30 0.27 -0.81 -0.91 -0.97 -0.95 -0.46 -0.78
Temporal 0.17 -0.09 -0.86 -0.97 -0.95 -0.76 0.26 0.22 0.27 0.14 -0.05 0.06
Table 1. Correlation coefficients between the variables (rows) and factors (columns). The notations aS and sS represents resp. the average and standard deviation of the variables over the whole video sequence. variable is well explained by the dimension. The parameters bitrate (the video bit-rate), ACs aS (the average of ACs over the whole video sequence), and M V sX aS (the average of M V sX over the whole video sequence) have the highest factor loadings with respectively the “Bit-rate”, “Spatial” and “Temporal” dimensions. Thus, they are the most promising variables for being incorporated in the content-based model. We further observe that the quantization parameter averaged over each I-frame and over the whole sequence (QP a aS) is highly correlated with the “Bit-rate” dimension. Since we know from former studies [5] that this variable is a good candidate for explaining parts of the quality impact of the content, we will consider it in the modeling. This choice was also due to the higher prediction performance achieved when considering this parameter. From now on and for the sake of clarity, QP a aS, ACs aS and M V sX aS will be named QP1 , AC1 and M V1 . 4. VIDEO QUALITY MODEL Modeling is done by using the results of the subjective test presented in Section 2. For this test, the Mean Opinion Score (MOS) was computed as the average over all subjects, converted to the 5-point MOS scale by linear transformation, and then transformed to the 100-point-scale used in the model (see [10] for details). 4.1. Content-blind and content-based models The content-blind model has no access to the encoded video, as in case of encrypted data, and thus uses only low-level information like the bitrate. The content-based model has access to the payload and thus uses both low-level information,
like the bitrate, and content-related parameters such as QP1 , AC1 and M V1 , as described in Section 3. In a least-square curve fitting procedure using the subjective video test results detailed in Section 2 as target values, we obtained the following “content-blind” video quality model:
the MOS values were further mapped onto the predicted values using the cubic polynomial function given by equation (3) in order to deal with potential bias effects in the test results:
Qv = Qvo − a1 · exp(a2 · bitrate) + a3
Here, M OSp are the MOS values predicted either by the content-blind model or by the content-based model. We obtain one set of coefficients (a, b, c and d) for each model type.
Following the same fitting procedure, we obtained the following “content-based” video quality model: Qv
=
content−blind model
Qvo − a1 · exp(a2 · bitrate + a4 · QP1 ) (2)
In both cases, Qvo = 100 is the base quality and ai are the regression coefficients. Note that the three non-constant terms of equation (2) correspond to the three (orthogonal) dimensions of the PCA analysis. We found that the term corresponding to the spatial dimension a5 · AC1 is negligible and did not improve the model performance. As a consequence, we have set a5 = 0. This result can be explained by the PCA analysis. Indeed, we can observe in Table 1 that AC1 (ACs aS in Table 1) is highly correlated with the “Spatial” dimension, and that QP1 (QP a aS in Table 1) is correlated with the “bit-rate” dimension but also with the “Spatial” dimension. As a consequence, QP1 in equation (2) partially captures the quality impact of AC1 . This assumption is confirmed by forcing a4 to 0 in equation (2). As expected, this results in a non-negligible regression coefficient a5 . 4.2. Model evaluation Model performance comparison is based on the Pearson correlation coefficient R and the Root Mean Square Error RM SE. The content-based model (R = 0.97, RM SE = 5.01) performs better than the content-blind model (R = 0.88, RM SE = 9.42) when their predicted quality values are compared with the subjective judgements of the training data set. Both the content-blind and content-based video quality models are also evaluated against a subjective test data set not used for training the model, the “test database”. It contains the 5 original video contents A, B, C, D and E described in Section 2 as well as the compressed version of those contents at 4 bit-rates (2, 4, 8 and 16 Mbps), using the H.264 codec. 24 na¨ıve subjects rated both original and compressed video sequences using the same test method and scale as described in Section 2. Similarly to the database used for training our models, the test database contains also conditions and ratings for packet loss rates and packet loss concealment, and the same anchor conditions as the training database, but those corresponding ratings are not used in the model evaluation. The MOS values were computed from the subjective ratings following the procedure described in Section 4. As in [11],
estimated video quality
+a3 + a5 · AC1 + a6 · M V1
100
90
90
80
80
70
70
60
50
40
30
60
50
40
30
R = 0.91
R = 0.93
20
20
RMSE = 7.02
RMSE = 6.53
10
0
(3)
content−based model
100
estimated video quality
(1)
M OSp = a · x3 + b · x2 + c · x + d
10
0
20
40
60
80
subjective video quality
100
0
0
20
40
60
80
100
subjective video quality
Fig. 2. Performance of the content-blind and content-based video quality models on unknown data. As shown in Figure 2, the content-based model (R = 0.93, RM SE = 6.53) performs also better than the contentblind model (R = 0.91, RM SE = 7.02) when their predicted quality values are compared with the subjective judgements of unknown subjective data set. 4.3. Discussion We believe that the model performance can be further improved by adapting the computation of the spatio-temporal features. Indeed, the features are so far averages or standard deviations over the whole sequence of spatio-temporal measures, such as ACa and ACs. As illustrated in Figure 3, this is not appropriate when the video sequence is not homogeneous and contains several scenes with different spatiotemporal characteristics, i.e. with different amount of details and movement complexities. Figure 3 shows for each I-frame of the content E encoded at 16 Mbps, the average (ACa) of the AC transform coefficients over each I-frame. The video content E contains 4 homogenous scenes, reflected in the 4 areas circled in red. Low ACa values, as in the 1st and 3rd scenes, mean low spatial complexity while higher ACa values, as in the 2nd and 4th scenes, represent more spatially complex scenes.
30
content features extraction and Bernhard Feiten for fruitful discussions.
2 4
7. REFERENCES
25
[1] S. P´echard, D. Barba, and P. Le Callet, “Video quality model based on a spatio-temporal features extraction for H.264-coded HDTV sequences,” in Proc. of PCS, 2007.
ACa
20
[2] Y. Liu, R. Kurceren, and U. Budhia, “Video classification for video quality prediction,” in Journal of Zhejiang University Science A, 2006.
15 3 1 10
5
0
50
100
150
200 Frame Number
250
300
350
400
Fig. 3. Average of the AC transform coefficients over each I-frame - Content E - 16 Mbps. For capturing the difference of spatial and temporal complexities between scenes at the video sequence level, one approach is to average first the spatio-temporal measures over each homogeneous scene, and then average the resulting means with different weights over the whole video sequence. The use of weights is required for taking into account the different quality impacts of the various scenes. Indeed, based on former studies [12], we can expect that low quality scenes have a higher impact on the overall quality than high quality scenes. As we have seen for video contents with either high complexity scenes only or low complexity scenes only, high complexity scenes yield worse quality ratings than low complexity scenes for a given bit-rate. Therefore they should have higher weights. 5. CONCLUSION We presented a content-based parametric video quality model for IPTV services, which predicts the perceived quality of H.264 encoded HD video sequences at the user side. We were able to identify two spatio-temporal parameters extracted from the encoded video, which have successfully been used in the model for improving its performance. Next step is the improvement of the computation of spatio-temporal features over a non-homogeneous video sequence. We will also extend our analysis to the prediction of the perceived video quality under packet loss and to Standard Definition (720x576) video formats. In addition, we will detail the comparison with other approaches from the literature in terms of content parameters and video quality model performance. 6. ACKNOWLEDGEMENTS This study is supported by the T-V-Model project at Deutsche Telekom Laboratories. Special thanks to Peter List for the
[3] M. Ries, C. Crespi, O. Nemethova, and M. Rupp, “Content-based Video Quality Estimation for H.264/AVC Video Streaming,” in Proc. of Wireless Communications and Networking Conference, 2007. [4] A. Khan, L. Sun, and E. Ifeachor, “Content clustering based video quality prediction model for MPEG4 video streaming over wireless networks,” in Proc. of ICC, 2009. [5] M.N. Garcia, A. Raake, and P. List, “Towards contentrelated features for parametric video quality prediction of IPTV services,” in Proc. of ICASSP, 2008. [6] M. Ghanbari, “Video quality measurement,” in Patent WO 2004/054274, 24 June 2004. [7] T. Brandao and M.P. Queluz, “No-Reference PSNR Estimation Algortigh for H.264 encoded video sequences,” in Proc. of EUSIPCO, 2008. [8] S. Wolf and M.H. Pinson, “Spatial-temporal distortion metric for in-service quality monitoring of any digital video system,” in Proc. of SPIE, MSAII, 1999. [9] Q. Huynh-Thu and M. Ghanbari, “Scope of validity of PSNR in image/video quality assessment,” in IEEE Electronics Letters, 2008. [10] A. Raake, M.N. Garcia, S.Moeller, J. Berger, F. Kling, P. List, J. Johann, and C. Heidemann, “T-V-MODEL: Parameter-based prediction of IPTV quality,” in Proc. of ICASSP, 2008. [11] VQEG, “Final report from the Video Quality Expert Group on the validation of objective models of multimedia quality assessment, phase I,” in Tech. Rep. VQEG, 2008. [12] B. Weiss, S. Mller, A. Raake, J. Berger, and R. Ullmann, “Modeling Call Quality for Time-Varying Transmission Characteristics Using Simulated Conversational Structures,” in Acta Acustica united with Acustica, 2009.