Modeling Rate and Perceptual Quality of Scalable Video as Functions of Quantization and Frame Rate and Its Application in Scalable Video Adaptation (Invited Paper) Yao Wang, Zhan Ma, Yen-Fu Ou Dept. of Electrical and Computer Engineering Polytechnic Institute of NYU, Brooklyn, NY 11201, U.S.A Email:
[email protected], {zma03, you01}@students.poly.edu Abstract—This paper investigates the impact of frame rate and quantization on the bit rate and perceptual quality of a scalable video with temporal and quality scalability. We propose a rate model and a quality model, both in terms of the quantization stepsize and frame rate. The quality model is derived from our earlier quality model in terms of the PSNR of decoded frames and frame rate. Both models are developed based on the key observation from experimental data that the relative reduction of either rate and quality when the frame rate decreases is quite independent of the quantization stepsize. This observation enables us to express both rate and quality as the product of separate functions of quantization stepsize and frame rate, respectively. The proposed rate and quality models are analytically tractable, each requiring only two content-dependent parameters. Both models fit the measured data very accurately, with high Pearson correlation. We further apply these models for rate-constrained bitstream adaptation, where the problem is to determine the optimal combination of quality and temporal layers that provides the highest perceptual quality for a given bandwidth constraint. Index Terms—Rate prediction, video quality metric, scalable video adaptation, scalable video coding (SVC)
I. I NTRODUCTION Scalable video coding (SVC) refers coding a video into an embedded bit stream that has a high quality when completely decoded, and has a lower quality when the bit stream is truncated. When a video is coded into a scalable stream with spatial, temporal, and amplitude1 scalability, the same video content may be delivered with varying frame rate or frame size or quantization stepsizes, depending on the substainable transmission rate, display resolution, and battery status (for battery-powered devices) at the receiver. Scalable video is particularly attractive for video multicast, where receivers of the same video often have different sustainable transmission rates with the server and varying decoding and display capabilities. Even for unicast, SVC allows the server to store just one bitstream, but send different portions of the stream to receivers with different bandwidth and energy resources. 1 Amplitude scalability defined here is conventionally known as quality or SNR scalability. To avoid the confusion with the overall perceptual quality of a video at different resolutions, we use the term Amplitude Scalability.
To deliver a pre-coded scalable bitstream to heterogeneous receivers with varying bandwidth constraints, either the sender or a transcoder at a proxy needs to extract from the original bitstream a certain spatial, temporal, and amplitude layers to meet the bandwidth constraint of a particular receiver (or a group of receivers with similar rate constraints). This problem is generally known as rate-constraint bit stream adaptation. For a given target bit rate, one may choose to extract the layers leading to a high frame rate, large frame size, but low quality in each decoded frame (noticeable coding artifacts), or a low frame rate, small frame size, but high quality per frame, or other combinations of spatial, temporal, amplitude resolutions. Different combinations are likely to yield different perceptual quality. A major challenge for deploying scalable video lies in how to perform the adaptation efficiently, while maximizing the perceptual quality. The latest scalable video coding (SVC) standard [1] enables lightweight bitstream manipulation [2] and also can provide the state-of-the-art coding performance [3], by its network friendly interface design and efficient compression schemes inherited from the H.264/AVC [4]. However, before SVC video can be widely deployed for practical applications, efficient mechanisms for SVC stream adaptation to meet different user constraints need to be developed. Optimal adaptation requires accurate prediction of the perceived quality as well as the total rate at any combination of spatial, temporal and amplitude (STA) resolutions. Although much work has been done in perceptual quality modeling and in rate modeling for single layer video or video with amplitude scalability only, the impact of spatial and temporal resolutions, together with amplitude resolution, individually and jointly, on the perceptual quality and rate has not been studied extensively. Recently, several studies have examined the influence of spatial, temporal, and amplitude resolutions, individually or jointly, on the perceptual quality [5], [6], [7], [8]. However, some of these models require a lot of parameters, or have limited accuracy. To the best of our knowledge, none of the prior work in scalable video adaptation have attempted to predict the rates corresponding to different layer combinations. Rather these studies make use of the actual rates associated
with different layers. Without analytical rate models, the solution of optimal layer combination has to be done through exhaustive search, to see which combination leads to the highest rate-quality slope while meeting the rate constraint. In certain applications, the adaptation decision needs to be made at the receiver and feedback to the server. In such situations, the rates associated with all possible layer combinations have to be delivered to the receiver, requiring extra bandwidth and delay. Having an accurate rate model, together with an accurate quality model, would enable one to determine the optimal STA combination for a given rate constraint efficiently. In this paper, we focus on modeling the impact of temporal and amplitude resolutions (in terms of frame rate and quantization stepsize, respectively) on both rate and quality. We further apply these models for solving the rate-constrained SVC adaptation problem assuming the spatial resolution is determined based on other considerations (e.g. display size of the receiver). We defer the consideration of the spatial resolution for future study. Our quality model relates the perceptual quality with the quantization stepsize and frame rate. It is derived based on our prior work, which uses the product of a metric that assesses the quality of a quantized video at the highest frame rate, based on the PSNR of decoded frames, and a temporal correction factor for quality (TCFQ), which reduces the quality assigned by the first metric according to the actual frame rate. In the quality model proposed here, we replace the first term by a metric that relates the quality of the highest frame rate video with the quantization stepsize. Each term has a single parameter, and the overall model is shown to fit very well with the subjective ratings, with an average Pearson correlation of 0.984 over four test sequences. Our rate model predicts the rate from quantization stepsize and frame rate. It also uses the product of a metric that describes how the rate changes with the quantization stepsize when the video is coded at the highest frame rate, and a temporal correction factor for rate (TCFR), which corrects the predicted rate by the first metric based on the actual frame rate. As with the quality model, it has two parameters only and fits the measured rates of decoded SVC video from different temporal and amplitude layers very accurately (with an average Pearson correlation of 0.998 over four sequences). In the reminder of this paper, we present the proposed rate model in Sec. II, and the quality model in Sec. III. Using these two developed models. We address the problem of rateconstrained bit stream adaptation in Sec. IV. Sec. V concludes the paper. II. R ATE M ODEL In this section, we develop a rate model R(q, t), which relates the rate R with the quantization stepsize q and frame rate t. To the best of our knowledge, no prior work has considered the joint impact of frame rate and quantization on the bit rate. However, several prior works have considered rate modeling in non-scalable video, and have proposed models that relate the average bit rate versus quantization stepsize q.
Ding and Liu reported the following model [9], R=
θ , qγ
(1)
where θ and γ are model parameters, with 0 ≤ γ ≤ 2. Chiang and Zhang [10] suggested the following model R=
A1 A2 + 2, q q
(2)
This so-called quadratic rate model has been used for ratecontrol in MPEG-4 reference encoder [11]. We note that by choosing A1 and A2 appropriately, the model in (2) can realize the inverse power model of (1) with any γ ∈ (1, 2). Only the quadratic term was included in the model by Ribas-Cobera and Lei [12], i.e., R=
A . q2
(3)
More recently, He [13] proposed the ρ-model, R(QP) = θ (1 − ρ(QP)) ,
(4)
with ρ denoting the percentage of zero quantized transform coefficients with a given quantization parameter. This model has been shown to have high accuracy for rate prediction. A problem with the ρ-model is that it does not provide explicit relation between QP and ρ. Therefore, it does not lend itself to theoretical understanding of the impact of QP on the rate. In our work on rate modeling, we focus on the impact of frame rate t on the bit rate R, under the same quantization stepsize q; while using prior models to characterize the impact of q on the rate, when the video is coded at a fixed frame rate. Towards this goal, we recognize that R(q, t) can be written as R(q, t) = Rmax Rq (q; tmax )Rt (t; q),
(5)
where Rmax = R(qmin , tmax ) is the maximum bit rate obtained with a chosen minimal quantization stepsize qmin and a chosen maximum frame rate tmax ; Rq (q; tmax ) =
R(q, tmax ) R(qmin , tmax )
is the normalized rate vs. quantization stepsize (NRQ) under the maximum frame rate tmax , and Rt (t; q) =
R(q, t) R(q, tmax )
is the normalized rate vs. temporal resolution (NRT) under the same quantization stepsize q. Note that the NRQ function Rq (q; tmax ) describes how does the rate decreases when the quantization stepsize q increases beyond qmin , under the frame rate tmax ; while the NRT function Rt (t; q) characterizes how does the rate reduces when the frame rate decreases from tmax , under the same quantization stepsize q. We also call Rt (t; q) the temporal correction factor for rate (TCFR), as it describes how to correct the rate estimate by Rmax Rq (q; tmax ) based on the actual temporal resolution. As will be shown later by experimental data, the impact of q and t on the bit rate is actually separable, so that Rt (t; q) can be represented by a
0.8
Normalized Rate
0.8 0.6 0.4 0.2 0 0
Normalized Rate
city 1
b=0.473 q = 104 q = 64 q = 40 q = 26 q = 16 10 20 Frame Rate [Hz] crew
0.6 0.4 0.2 0 0
30
1
1
0.8
0.8
Normalized Rate
Normalized Rate
akiyo 1
0.6 0.4 0.2 0 0
b=0.671 q = 104 q = 64 q = 40 q = 26 q = 16 10 20 Frame Rate [Hz]
10 20 Frame Rate [Hz] football
30
A. Model for the Temporal Correction Factor for Rate (TCFR) Rt (t)
0.6 0.4 0.2
30
b=0.484 q = 104 q = 64 q = 40 q = 26 q = 16
0 0
b=0.739 q = 104 q = 64 q = 40 q = 26 q = 16 10 20 Frame Rate [Hz]
that can accurately model the measured NRQ points in Fig. 2 for t = tmax . Note that in fact, the Rq (q) model fits the NRQ points obtained at all different t. We assume that Rmax can be easily measured by coding a video at chosen qmin and tmax . Generally for given qmin and tmax , Rmax depends on the video content. The modeling of the relation of Rmax with the video content is beyond the scope of this paper. The derivation of the models Rq (q) and Rt (t) are explained in detail as follows.
30
Fig. 1. Normalized rate vs. temporal resolution (NRT) using different quantization stepsize (q). Points are measured rates, curves are predicted rates by the model of Eq. 6.
function of t only, denoted by Rt (t), and Rq (q; t) by a function of q only, denoted by Rq (q). To see how quantization and frame rate respectively influence the bit rate, we encoded several test videos using the SVC reference software JSVM912 [14] and measured the actual bit rates corresponding to different q and t. Specifically, four video sequences, “akiyo”, “city”, “crew” and “football”, all in CIF (352×288) resolution, are encoded into 5 temporal layers using dyadic hierarchical prediction structure, with frame rates 1.875, 3.75, 7.5, 15, and 30 Hz, respectively, and each temporal layer contains 5 CGS layers obtained with quantization parameter (QP) of 44, 40, 36, 32, 28.2 Using the H.264 mapping between q and QP, q = 2(QP−4)/6 , the corresponding quantization stepsizes are 104, 64, 40, 26, 16, respectively. The bit rates of all layers are collected and normalized by the rate at the highest frame rate, i.e., tmax = 30 Hz, to find NRT points Rt (t; q) = R(q, t)/R(q, tmax ), for all t and q considered, which are plotted in Fig. 1. As shown in Fig. 1, the NRT curves obtained with different quantization stepsizes overlap with each other, and can be captured by a single curve quite well. Similarly, the NRQ curves Rq (q; t) = R(q, t)/R(qmin , t) for different frame rates t are also almost invariant with the frame rate t, as shown in Fig. 2. These observations suggest that the effects of q and t on the bit rate are separable, i.e., the quantization-induced rate variation is independent of the frame rate and vice verse. Therefore, the overall rate modeling problem is divided into two parts, one is to devise an appropriate functional form for Rt (t), so that it can model the measured NRT points for all q in Fig. 1 accurately, the other is to derive an appropriate functional form for Rq (q) 2 Different from the JSVM default configuration utilizing different QPs for different temporal layers, the same QP is applied to all temporal layers at each CGS layer.
As explained earlier, Rt (t) is used to describe the reduction of the normalized bit rate as the frame rate reduces. Therefore, the desired property for the Rt (t) function is that it should be 1 at t = tmax and monotonically reduces to 0 at t = 0. Based on the measurement data in Fig. 1, we choose a power function, i.e., b t . (6) Rt (t) = tmax Figure 1 shows the model curve using this function along with the measured data. The parameter b is obtained by minimizing the squared error between the modeled rates and measured rates. It can be seen that the model fits the measured data points very well. We also tried some other functional forms, including logarithmic and inverse falling exponential. We found that the power function yields the least fitting error. B. Model for Normalized Rate vs. Quantization Rq (q) Analogous to the Rt (t) function, Rq (q) is used to describe the reduction of the normalized bit rate as the quantization stepsize increases at a fixed frame rate. The desired property for the Rq (q) function is that it should be 1 at q = qmin and monotonically reduces to 0 as q goes to infinity. Based on the measurement data in Fig. 2, we choose an inverse power function, i.e., −a q Rq (q) = . (7) qmin Figure 2 shows the model curve using this function along with the measured data. It can be seen that the model fits the measured data points very well. The parameter a characterizes how fast the bit rate reduces when q increases. Interestingly all four test sequences have very similar a values. We also tried some other functional forms, including falling exponential. We found that the inverse power function yields the least fitting error. We note that the model in (7) is consistent with the model proposed by Ding and Liu [9], i.e., Eq. (1), for nonscalable video, where they have found that the parameter a is in the range of 0-2. C. The Overall Rate Model Combining Eqs. (5), (6), and (7), we propose the following rate model −a b t q , (8) R(q, t) = Rmax qmin tmax
0.4 0.2
a =1.194 t = 1.875 Hz t = 3.75 Hz t = 7.5 Hz t = 15 Hz t = 30 Hz
0.8 0.6
city
200
0.4
150
700 q = 104 q = 64 q = 40 q = 26 q = 16 Rate Model
600 500 Bit Rate [kbps]
0.6
Normalized Rate
Normalized Rate
a =1.213 t = 1.875 Hz t = 3.75 Hz t = 7.5 Hz t = 15 Hz t = 30 Hz
0.8
akiyo
city 1
Bit Rate [kbps]
akiyo 1
100
300 200
50
0.2
400
q = 104 q = 64 q = 40 q = 26 q = 16 Rate Model
100
0
20
40
60 q crew
80
0
100
20
40
60 q football
80
0 0
100
5
10 15 20 Frame Rate [Hz] crew
25
0 0
30
0.4 0.2
0.8 0.6 0.4
q = 104 q = 64 q = 40 q = 26 q = 16 Rate Model
1200
25
30
1000 800
25
30
q = 104 q = 64 q = 40 q = 26 q = 16 Rate Model
2000 Bit Rate [kbps]
0.6
a =1.128 t = 1.875 Hz t = 3.75 Hz t = 7.5 Hz t = 15 Hz t = 30 Hz
Bit Rate [kbps]
0.8
1 Normalized Rate
Normalized Rate
a =1.234 t = 1.875 Hz t = 3.75 Hz t = 7.5 Hz t = 15 Hz t = 30 Hz
10 15 20 Frame Rate [Hz] football
2500
1400
1
5
600
1500
1000
400
0.2
500 200
0
20
40
60 q
80
100
0
20
40
60 q
80
100
0 0
5
10 15 20 Frame Rate [Hz]
25
30
0 0
5
10 15 20 Frame Rate [Hz]
Fig. 2. Normalized rate vs. quantization stepsize (NRQ) using different frame rates t. Points are measured rates, curves are predicted rates by the model of Eq. (7).
Fig. 3. (8).
where qmin and tmax should be chosen based on the underlying application, and Rmax is the actual rate when coding a video at qmin , tmax , and a and b are the model parameters. The actual rate data of all test sequences with different combinations of q and t, and the corresponding estimated rates via the proposed model (8) are illustrated in Fig. 3, we note that the model predictions fit very well with the experimental rate points. The model parameters, a and b, are obtained by minimizing the root mean squared errors (RMSE) between the measured and predicted rates corresponding to all q and t. Table I lists the parameter values. Also listed are the fitting error in terms of relative RMSE/Rmax , and the Pearson correlation (PC) bewteen measured and predicted rates, defined as P P P n xi yi − xi yi p p rxy = , (9) P P P P n x2i − ( xi )2 n yi2 − ( yi )2
get quite accurate rate prediction. In practice, in order to avoid the estimation or specification of the parameter a, it may be preferable to use a fixed value for a. Parameter b indicates how fast the rate drops when the frame rate decreases, with a larger b indicating a faster drop. As expected, the “Football” sequence, which has higher motion, has a larger b and “Akiyo”, has the least b. In scalable video adaptation where a full-resolution scalable stream is already generated, one can easily derive the model parameters from the rates corresponding to several different (t, q) combinations using least squares fitting. In applications requiring estimation of model parameters from the original video sequence (e.g. for encoder optimization), it will be important to characterize the relation between a, b and some content features. Study of the relation between the model parameters and video content will be a subject of our future research.
where xi and yi are the measured and predicted rates, and n is the total number of available samples. We see that the model is very accurate for all four sequences, with very small relative RMSE and very high PC. TABLE I PARAMETERS FOR THE RATE MODEL AND MODEL ACCURACY
a b RMSE/Rmax PC
akiyo 1.213 0.473 1.54% 0.9985
city 1.194 0.484 1.67% 0.9977
crew 1.234 0.671 1.25% 0.9989
football 1.128 0.739 1.54% 0.9983
Note that parameter a characterizes how fast the bit rate reduces when q increases. A larger a indicates a faster drop rate. Interestingly all four test sequences have quite similar a values. This implies that a is almost independent of video content. When we set a = 1.2 for all four sequences, we also
Experimental rate points and predicted rates using the rate model
III. Q UALITY M ODEL There are several published works examining the impact of either frame rate alone or both frame rate and quantization artifacts on the perceptual quality. Quan and Ghanbari [7] consider the impact of both regular and irregular frame drops and examine the jerkiness and jitter effects caused by different levels of strength, duration and distribution of the temporal impairment. Besides the study of frame rate impact on perceptual quality, Feghali et al. proposed a video quality metric [6], [8] investigating both frame rate and quantization effects. Their metric uses a weighted sum of two terms, one is the PSNR of the interpolated sequences from the original low frame-rate video, another is the frame-rate reduction. The weight depends on the motion attributes of the sequences. The work in [15] extended that of [8] by employing a different motion feature in the weight. The authors of [5] propose to use
computational models, which emulate human visual perception based on block-fidelity, content richness, spatial-textural, color and temporal mask. Although the model have been shown to have a good correlation with subjective quality, it requires significant computation. Our quality model is extended from our earlier work [16]. Like the rate model, we focus on examining the impact of frame rate on the quality, under the same quantization stepsize; while trying to use prior models to characterize the impact of quantization stepsize q on the quality, when the video is coded at a fixed frame rate. The proposed model is written generally as
the subject tests. To reduce the model complexity, we choose to model the Qt (t; q) curves by a function of t only, denoted by Qt (t). For the model for Rq (q; tmax ), we use only the measured NQQ data at the frame rate tmax . In [16], we used the inverted exponential function for the NQT function, i.e., t
1 − e−d tmax . Qt (t) = 1 − e−d
The model curve is shown along with the measured NQT points in Fig. 4. We see that the model is quite accurate. Akiyo
Qq (q; tmax ) = Q(q, tmax )/Q(qmin , tmax ) is the normalized quality versus quantization stepsize (NQQ) under the maximum frame rate tmax ;
0.8
0.8
0.6
0.4
q = 16 q = 40 q = 64 q = 104 d=8.03
0.2
0 0
Normalized MOS
Qt (t; q) = Q(q, t)/Q(q, tmax ) is the normalized quality vs. temporal resolution (NQT) under the same quantization stepsize q. Note that Qmax Qq (q; tmax ) models the impact of quantization on the quality when the video is coded at the highest frame rate tmax ; while Qt (t; q) describes how the quality reduces when the frame rate reduces, under the same q. In other words, Qt (t; q) corrects the predicted quality by Qmax Qq (q; tmax ) based on the actual frame rate, and for this reason is also called Temporal Correction Factor for Quality (TCFQ). To derive the appropriate functional forms for Qq (q; tmax ) and Qt (t; q), we conducted subjective tests to obtain mean opinion scores (MOS) for the same test sequences used for deriving the rate model, but the subjective tests were performed only for 64 decoded sequences, at frame rates of 30, 15, 7.5, 3.75 Hz, and QP equals to 28, 36, 40, and 44 (corresponding to quantization stepsize of 16, 40, 64, 104, respectively). The subjective quality assessment is carried out using a protocol similar to ACR-HR (Absolute Category Rating with Hidden Reference) described in [17]. In the test, a subject is shown one video at a time, providing an overall rating after each clip is played completely. The rating scale ranges from 0 (worst) to 100 (best). There are on average 20 ratings for each processed video sequence. Details about the subjective tests can be found in [16]. To see how the normalized quality ratings Qq (q; t) and Qt (t; q) vary with q and t, respectively, Figures 4 and 5 show the measured data from our subjective tests. Unlike the rate data, where the effects of quantization stepsize q and frame rate t are quite separable, there are noticeable interactions between t and q in their impact on the perceptual quality. This interaction in fact is well known, but not well understood. However, as seen in Fig. 4, the effect of q on the NQT curves Qt (t; q) is inconsistent and relatively small. Also these variations may be in part due to viewer inconsistency during
1
Normalized MOS
where Qmax = Q(qmin , tmax ),
City
1
10 20 Frame Rate(Hz) Crew
0.6
0.4
0 0
30
1
1
0.8
0.8
0.6
0.4
q = 16 q = 40 q = 64 q = 104 d=7.34
0.2
0 0
10 20 Frame Rate(Hz)
q = 16 q = 40 q = 64 q = 104 d=7.35
0.2
Normalized MOS
(10) Normalized MOS
Q(q, t) = Qmax Qq (q; tmax )Qt (t; q),
(11)
10 20 Frame Rate(Hz) Football
30
0.6
0.4
q = 16 q = 40 q = 64 q = 104 d=5.38
0.2
30
0 0
10 20 Frame Rate(Hz)
30
Fig. 4. Normalized quality against frame rate, for different quantization stepsize q. Points are measured data, the curve is based on the model in Eq. (11).
To model the variation of the perceptual quality with quantization when the video is coded at a fixed frame rate tmax , in our earlier work [16], we assume that under the same quantization parameter q, the PSNR of decoded frames at frame rate tmax would be similar to the PSNR of decoded frames at a reduced frame rate t. So we use PSNR computed at frame rate t to estimate the quality of the video coded at tmax . Based on the prior work in [18], we use a sigmoidal function to relate the PSNR with the perceptual quality, with two parameters. In the current work, based on measured NQQ points Qq (q; tmax ) shown in Fig. 5, we propose to use an exponential function to capture the quality variation with q at the highest frame rate tmax , i.e., Qq (q) = ec e
−c q
q min
,
(12)
with c as the model parameter. Compared with the original two parameter sigmoid function proposed in [16], the single parameter exponential function is simpler and easier to analyze. Comparing the measured and predicted quality ratings shown in Fig. 5, we see that the model captures the quantizationinduced quality variation very well at the highest frame rate.
City
0.8
0.8
80
80
0.6
60
60
0.6 0.4
3.75Hz 7.5Hz 15Hz 30Hz c=0.11
0
20
40
0.4
60
80
100
0
120
40
3.75Hz 7.5Hz 15Hz 30Hz c=0.13
0.2
20
40
80 q Football
100
0 0
120
40
q = 16 q = 40 q = 64 q = 104 Model curve
20
60
q Crew
MOS
100
MOS
100
0.2
10 20 Frame Rate(Hz) Crew
0 0
30
100
0.8
0.8
80
80
0.6
60
60
0.4
3.75Hz 7.5Hz 15Hz 30Hz c=0.17
0.2 0
20
40
0.4 0.2
60
80
100
0
120
q
40
3.75Hz 7.5Hz 15Hz 30Hz c=0.08 20
40
80
100
120
q
Fig. 5. Normalized quality versus the quantization stepsize for different frame rates t. Points are measured data and the curve is the predicted quality for t = tmax = 30 Hz, using Eq. (12).
Combing Eqs. (10), (11) and (12), the overall video quality model can be expressed as q min
−c q
t −d tmax
1−e . (13) e−c 1 − e−d Note that Qmax is the MOS given for the video at qmin and tmax . Generally, this value can be estimated by some preliminary subjective tests. In our subjective tests, the ratings are given in the range of 0 to 100. But the viewers seldom give a rating of 100, even for very high quality video, as is commonly observed in subjective tests. What is surprising and fortunate is that the MOS values for the videos coded at qmin and tmax are very close to each other for all four test sequences, about 89, as shown in Fig. 7. Therefore, we set Qmax to 89 in our model. Note that on the more common MOS scale of 1 to 5, 89 out of 0 to 100 would correspond to a MOS of 0.89 × 4 + 1 = 4.56. Figure 6 compares the measured and predicted quality ratings by the model in (13). The parameters c, d are obtained by least square error fitting. Table II summarizes the parameters and the model accuracy in terms of RMSE and Pearson correlation (PC) values for the four sequences. Overall, the proposed model, with only two content-dependent parameters, predicts the MOS very well, for sequences “Akiyo” and “Crew”, with a very high PC (> 0.99). The model is less accurate for “Football” and “City”, but still has a quite high PC. We would like to point out that the measured MOS data for these two sequences do not follow a consistent trend at some quantization levels, which may be due to the limited number of viewers participating the subjective tests. Note that parameter c indicates how fast the quality drops with increasing q, with a larger c suggesting a faster drop. On the other hand, parameter d reveals how fast the quality Q(q, t) = Qmax
e
40 q = 16 q = 40 q = 64 q = 104 Quality Model
20
60
10 20 Frame Rate(Hz) Football
30
MOS
100
MOS
1
0.6
q = 16 q = 40 q = 64 q = 104 Model curve
20
1
NOrmalized MOS
NOrmalized MOS
Akiyo
City 1
NOrmalized MOS
NOrmalized MOS
Akiyo 1
0 0
10 20 Frame Rate(Hz)
q = 16 q = 40 q = 64 q = 104 Model curve
20
30
0 0
10 20 Frame Rate(Hz)
30
Fig. 6. Video quality model (13) in terms of quantization stepsize and frame rate, the discrete points are the measured MOS data for different quantization steps. TABLE II PARAMETERS FOR THE QUALITY MODEL AND MODEL ACCURACY
c d RMSE/Qmax PC
akiyo 0.11 8.03 1.55% 0.991
city 0.13 7.35 1.67% 0.967
crew 0.17 7.34 1.25% 0.994
football 0.08 5.38 1.54% 0.982
reduces as the frame rate decreases, with a smaller d corresponding to a faster drop. Our prior work [16], [19] has shown that parameter d is closely related to some motion attributes of the video. Derivation of the model parameters from the original or coded video is a subject of our future study. We note that the quality model parameters very much depend on the underlying viewers. In our current study, the model is derived based on MOS obtained from a relatively large group of viewers, and hence is meant to characterize an “average viewer”. Such models are useful when one designs a video system to optimize the perceptual quality for all potential viewers. For any particular viewer, parameters c and d are likely dependent on the viewers’ sensitivities to quantization artifacts and motion jerkiness, respectively. In order to optimize for individual user’s perceptual quality, the model parameters should be determined based on both the video content and some viewer attributes. This is discussed further in Sec. IV. Combining the rate and quality models, we draw in Fig. 7, quality vs. rate curves achievable at different frame rates. We also plot the measured MOS data on the same figure. The model fits the measured data very well for sequences “Akiyo” and “Crew”. But the model is not as accurate at some frame rates for “Football” and “City” due to slight errors in both rate and quality prediction. It is clear from this figure, that
city 90
80
80
70
70
60
30Hz
50 15Hz
40 30 20 0
50 100 Bit Rate (kbps) crew
Quality
Quality
akiyo 90
15Hz
40 30
7.5Hz
3.75Hz
20
3.75Hz
10 0
150
90
90
80
80
200 400 Bit Rate (kbps) football
600
30Hz
40
15Hz
30
7.5Hz
Quality
50
60
30Hz
50
15Hz
40 7.5Hz 30
20
3.75Hz
10 500 1000 Bit Rate (kbps)
3.75Hz
20 0
Determine t, q to maximize Q(q, t) subject to R(q, t) ≤ R0 ,
70
60
0
30Hz
50
7.5Hz
70 Quality
60
the full-resolution bitstream. The parameters for the quality model needs to be determined based both on the video content and the viewer preference setting, as discussed in Sec. III. In a simpler implementation, the adaptor may ignore the user’s preference setting, and use quality model parameters tuned for “average” viewers. Based on the target rate R0 and the model parameters, the adaptor determines the optimal frame rate topt and quantization qopt , and corresponding temporal and amplitude layers. Finally the adaptor extracts these layers from the scalable bit stream and delivers the resulting bit stream to the user. For a given target rate R0 , the adaptation problem can be formulated as the following constrained optimization problem,
500
1000 1500 Bit Rate (kbps)
In the following subsections, we employ proposed rate and quality models to solve this optimization problem, first assuming the frame rate can be any positive value, and then considering the discrete set of frame rates afforded by the dyadic temporal prediction structure.
2000
Fig. 7. Quality vs. rate at different frame rates. Points are measured data, curves are based on the rate model in Eq. (8) and the quality model in Eq. (13).
Video Server
Scalable video stream ···
Rate-constrained Bit Stream Adaptation Target rate R0 and model parameters ⇒ feasible points q ∗ , t∗ : R(q ∗ , t∗ ) ≤ R0 ⇒ supportable quality: Q(q ∗ , t∗ ) ⇒ optimal setting: qopt , topt
···
IV. R ATE -C ONSTRAINED B IT S TREAM A DAPTATION In this section, we consider how to apply our proposed rate and quality models to perform rate-constrained SVC bit stream adaptation. Figure 8 provides a system view of the adaptation problem. For each video, a single full-resolution scalable stream is available at a media content server, which will be adapted at a network proxy or gateway in response to the user channel conditions and viewing preferences. When a user requests the video from the server, the adaptor (sitting at the proxy) will determine an appropriate video rate R0 based on the user’s channel condition (e.g. R0 can the sustainable transmission rate for the given channel condition minus all the overheads for channel error correction and packetization). Based on R0 and the user’s viewing preference setting (embedded in the user profile sent to the adaptor), the adaptor determines the optimal set of temporal and amplitude layers (more generally spatial layers) to extract, so as to provide the best perceptual quality. In Fig. 8, we assume that the adaptor monitors the channel condition based on some feedbacks from the user. (The user may inform the adaptor its desired rate R0 in alternative implementations.) Furthermore, it determines the quality model parameters based on the user’s preference setting, which describes the user’s preferred tradeoff among spatial, temporal, and amplitude resolutions. Recall that parameters c and d in the quality model depend on viewers sensitivities to temporal and amplitude resolutions. Note that the parameters of the rate model for each video can be predetermined as discussed in Sec. II, and embedded in
(14)
video substream
Channel condition & User profile
each frame rate is optimal only for a certain rate region. By connecting the segments on top for each rate region in the figure for each sequence, we effectively obtain the operational rate-quality function of the SVC encoder for that sequence.
User
Fig. 8.
Rate-Constrained SVC Video Adaptation
A. Optimal solution assuming t and q continuous values We first solve the constrained optimization problem in (14) assuming both the frame rate t and quantization stepsize q can take on any value in the range of t ∈ (0, tmax ), q ∈ Qmax ˆ = (qmin , +∞). To simplify the notation, let Q , (1−e−d )e−c ˆ ˆ ˆ t = t/tmax , qˆ = q/qmin , R = R/Rmax , and R0 = R0 /Rmax , the rate and quality models in (8) and (13) become respectively ˆ qˆ, tˆ = qˆ−a tˆb , R (15) ˆ ˆ −cˆq 1 − e−dt . Q qˆ, tˆ = Qe (16) ˆ qˆ, tˆ = R ˆ 0 in (15), we obtain By setting R r ˆ0 , (17) qˆ = a tˆb /R which describes the feasible q for a given t, to satisfy the rate constraint R0 . Substituting (17) into (16) yields −
ˆ Q(tˆ) = Qe
ˆψ c·t √ a ˆ R 0
ˆ 1 − e−dt , tˆ ∈ (0, 1)
(18)
akiyo 120 100
akiyo
city
optimal t vs. bit rate optimal q vs. bit rate optimal Q vs. bit rate
120
optimal t vs. bit rate optimal q vs. bit rate optimal Q vs. bit rate
100
120
city
optimal t vs. bit rate optimal q vs. bit rate optimal Q vs. bit rate
100
120
80
80
80
80
60
60
60
60
40
40
40
40
20
20
20
20
0 0
120 100
50
100 Bit Rate (kbps) crew
0 0
150
optimal t vs. bit rate optimal q vs. bit rate optimal Q vs. bit rate
200
120
400 Bit Rate (kbps) football
600
optimal t vs. bit rate optimal q vs. bit rate optimal Q vs. bit rate
100
0 0
120 100
50
100 Bit Rate (kbps) crew
0 0
150
optimal t vs. bit rate optimal q vs. bit rate optimal Q vs. bit rate
80
80
60
60
60
60
40
40
40
40
20
20
20
20
1500
0 0
500
1000 1500 Bit Rate (kbps)
2000
Fig. 9. Optimal quantization stepsize, frame rate and the corresponding quality index versus the bit rate R by assuming the quantization stepsize and frame rate can take on any value within their effective range.
where ψ = b/a. Equation (18) expresses the achievable quality with different frame rate under the rate constraint R0 . Clearly, this function has a unique maximum, which can be derived by setting its derivative with respect to tˆ to zero. This yields !a ˆψ−1 (1 − e−dtˆ) cψ t ˆ0 = R . (19) de−dtˆ For any given rate constraint R0 , we can solve (19) numerically to determine the optimal frame rate topt . Then using (17) and (18) we can determine the optimal quantization stepsize qopt , and the corresponding maximum quality Qopt at the rate R0 . Figure 9 shows topt , qopt , and Qopt as functions of the rate constraint R0 . As expected, as the rate increases, topt increases while qopt reduces, and the achievable best quality continuously improve. Notice that topt increases more rapidly for the “football” sequence than for the other sequences, because of its faster motion. Based on the parameters derived from our subjective test data, even at the highest bit rates examined, the optimal frame rate is below 20 Hz for the other three sequences. Note that had we used a smaller qmin to allow much higher values for Rmax , topt would have increased to 30 Hz beyond some rates. B. Optimal solution under dyadic temporal scalability structure A popular way to implement temporal scalability is through the dyadic hierarchical B-picture prediction structure, by which the frame rate doubles with each more temporal layer. With 5 temporal layers, the corresponding frame rates are 1.875, 3.75, 7.5, 15 and 30 Hz. From a practical point of view, it will be interesting to see what are the optimal combination of the frame rate and quantization stepsize for different bit
0 0
500 1000 Bit Rate (kbps)
1500
0 0
400 Bit Rate (kbps) football
600
optimal t vs. bit rate optimal q vs. bit rate optimal Q vs. bit rate
100
80
500 1000 Bit Rate (kbps)
200
120
80
0 0
optimal t vs. bit rate optimal q vs. bit rate optimal Q vs. bit rate
100
500
1000 1500 Bit Rate (kbps)
2000
Fig. 10. Optimal quantization stepsize, frame rate and the corresponding quality index versus the bit rate R by assuming the q varies continuously and frame rate takes 1.875/3.75/7.5/15/30 Hz which is effective by using dyadic hierarchical prediction structure.
rates under this structure. To obtain the optimal solution under this scenario, for each given rate, we determine the quality values corresponding to all five possible frame rates using (18), and choose the frame rate (and its corresponding quantization stepsize using (17)) that leads to the highest quality. The results are shown in Fig. 10. Because the frame rate t can only increase in discrete steps, the optimal q does not decrease monotonically with the rate. Rather, whenever topt jumps to the next higher value (doubles), qopt first increases to meet the rate constraint, and then decreases while t is held constant, as the rate increases. Consistent with the previous results in Fig. 9, for football, the optimal frame rate transitions to 30 Hz at an intermediate bit rate; whereas for the other sequences, the optimal frame rate stays at 15 Hz even at the highest bit rates examined. As mentioned earlier, had we used a lower qmin , we would have seen transitions to 30 Hz after some rates. The results in Fig. 10 can be validated by cross checking with Fig. 7. For example, for “Crew”, in the rate region below 25.0 kbps, 1.875 Hz leads to the highest quality, in the rate range between 25 and 61 kbps, 3.75 Hz gives the highest quality, between 61 and 253 kbps, 7.5 Hz is the best, and beyond 253 kbps, 15 Hz provides the highest quality. Connecting the top segments for each sequence in Fig. 7 will lead to the optimal Q vs. bit rate curve in Fig. 10. In practice, the SVC encoder with CGS quality scalability does not allow the quantization stepsize to change continuously. The finest granularity in quality scalability is a decrement of QP by 1 with each additional quality layer. This means that the quantization stepsize reduces by a factor of 2−1/6 with each additional layer. In practice, much coarser granularity is typically used, with decrement of QP by 2 to 4 typically. When we constrain q to take only discrete values corresponding to
such QP values, in addition to allow only dyadic frame rates, one cannot always meet a rate constraint exactly. One can still solve the optimal t and q for any given rate constraint using the proposed models, by exhaustive search within the finite set of feasible values for t and q.
V. C ONCLUDING R EMARKS In this paper we examine the impact of frame rate t and quantization stepsize q on the rate and perceptual quality of scalable video. Both models are developed based on the key observation from experimental data that the relative reduction of either rate and quality when the frame rate decreases is quite independent of the quantization stepsize. This observation enables us to express both rate and quality as the product of a function of q and a function of t. The proposed rate and quality models are analytically tractable, each requiring only two content-dependent parameters. The rate model fits the measured rates very accurately, with an average Pearson correlation of 0.998, over four video sequences. The quality model also match the MOS from subjective tests very well, with an average Pearson correlation of 0.984. We further apply these models for rate-constrained SVC bitstream adaptation, where the problem is to determine the frame rate and quantization stepsize that can lead to the highest perceptual quality for a given target rate. We derive the optimal frame rate topt and quantization stepsize qopt , both as a function of the rate R, first by assuming t can vary continuously to provide theoretical insights, and then by considering the feasible set of discrete frame rates afforded by the hierarchical temporal prediction structure. The proposed rate and quality models have other applications beyond SVC bit stream adaptation. One important application is in non-scalable encoder optimization, e.g., determining the optimal encoding frame rate for a target bit rate. It can also be used for scalable encoder optimization, e.g., determining the appropriate temporal and amplitude layers to generate and include at different rate ranges. For the proposed models to be adopted for practical applications, one must be able to determine the model parameters easily either from the original or coded sequences. For the rate model, the parameters for each sequence can be easily derived from the actual bit rates of layers corresponding to selected combinations of q and t, once a complete scalable bit stream is created. However, to use the the rate model for encoder optimization, it is desirable to determine the model parameters from some content features (such as motion, contrast, etc.) derived from the original video. Similarly for the quality model, we will investigate the correlation between the model parameters and content features. Both will be subjects of our future study. We will further investigate how to take into account of the viewer’s sensitivities to jerkiness and coding artifacts, when determining the quality model parameters for individual viewers.
ACKNOWLEDGMENT This material is based upon work supported in part by the National Science Foundation under Grant No. 0430145. R EFERENCES [1] G. Sullivan, T. Wiegand, and H. Schwarz, Text of ITU-T Rec. H.264 | ISO/IEC 14496-10:200X/DCOR1 / Amd.3 Scalable video coding, ISO/IEC JTC1/SC29/WG11, MPEG08/N9574, Antalya, TR, January 2008. [2] Y.-K. Wang, M. Hannuksela, S. Pateux, A. Eleftheriadis, and S. Wenger, “System and transport interface SVC,” IEEE Trans. Circuit and Sys. for Video Technology, vol. 17, no. 9, pp. 1149–1163, Sept. 2007. [3] M. Wien, H. Schwarz, and T. Oelbaum, “Performance analysis of SVC,” IEEE Trans. Circuit and Sys. for Video Technology, vol. 17, no. 9, pp. 1194–1203, Sept. 2007. [4] H.264/AVC, Draft ITU-T Rec. and Final Draft Intl. Std. of Joint Video Spec. (ITU-T Rec. H.264\ISO/IEC 14496-10 AVC) Joint Video Team (JVT), Joint Video Team, Doc. JVT-G050, Mar. 2003. [5] E. Ong, X. Yang, W. Lin, Z. Lu, and S. Yao, “Perceptual Quality Metric For Compressed Videos,” in Proc. of ICASSP, vol. 2, Mar. 2005, pp. 581 – 584. [6] R. Feghali, D. Wang, F. Speranza, and A. Vincent, “Quality metric for video sequences with temporal scalability,” in Proc. of ICIP, vol. 3, Sep. 2005, pp. III–137–40. [7] H.-T. Quan and M. Ghanbari, “Temporal Aspect of Perceived Quality of Mobile Video Broadcasting,” IEEE Trans. on Broadcasting, vol. 54, no. 3, pp. 641–651, Sept. 2008. [8] R. Feghali, D. Wang, F. Speranza, and A. Vincent, “Video quality metric for bit rate control via joint adjustment of quantization and frame rate,” IEEE Trans. on Broadcasting, vol. 53, no. 1, pp. 441–446, Mar. 2007. [9] W. Ding and B. Liu, “Rate control of MPEG video coding and recoding by rate-quantization modeling,” IEEE Trans. Circuit and Sys. for Video Technology, vol. 6, pp. 12–20, Feb. 1996. [10] T. Chiang and Y.-Q. Zhang, “A new rate control scheme using quadratic rate distortion model,” IEEE Trans. Circuit and Sys. for Video Technology, vol. 7, no. 2, pp. 246–250, Feb. 1997. [11] T. Chiang, H.-J. Lee, and H. Sun, “An overview of the encoding tools in the MPEG-4 reference software,” in Proc. of IEEE Intl. Symp. Circuit and Systems, Geneva, Switzerland, May 28 -31 2000. [12] J. Ribas-Corbera and S. Lei, “Rate control in DCT video coding for low-delay communications,” IEEE Trans. Circuit and Sys. for Video Technology, vol. 9, no. 2, pp. 172–185, Feb. 1999. [13] Z. He and S. K. Mitra, “A novel linear source model and a unified rate control algorithm for H.264/MPEG-2/MPEG-4,” in Proc. of Intl. Conf. Acoustics, Speech, and Signal Processing, Salt Lake City, Utah, May 2001. [14] JSVM software, Joint Scalable Video Model, Joint Video Team, Doc. JVT-X203, Geneva, Switzerland, 29 June - 5 July 2007. [15] S. H. Jin, C. S. Kim, D. J. Seo, and Y. M. Ro, “Quality Measurement Modeling on Scalable Video Applications,” in Proc. of IEEE Workshop on Multimedia Signal Processing, Otc. 2007, pp. 131 – 134. [16] Y.-F. Ou, Z. Ma, and Y. Wang, “A novel quality metric for compressed video considering both frame rate and quantization artifacts,” in Proc. of Intl. Workshop Video Processing and Quality Metrics for Consumer (VPQM), Scottsdale, AZ, Jan. 2009. [17] ITU-T Rec. P. 910: Subjective video quality assessment methods for multimedia applications, 1999. [18] S. Wolf and M. Pinson, Video quality measurement techniques, NTIA, Tech. Report 02-392, June 2002. [19] Y.-F. Ou, T. Liu, Z. Zhao, Z. Ma, and Y. Wang, “Modeling the impact of frame rate on perceptual quality of video,” in Proc. of Intl. Conf. Image Processing (ICIP), San Diego, CA, Oct. 2008, pp. 689 – 692.