IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 59, NO. 3, MARCH 2012
1673
Novel Rate-Quantization Model-Based Rate Control With Adaptive Initialization for Spatial Scalable Video Coding Sudeng Hu, Hanli Wang, Member, IEEE, Sam Kwong, Senior Member, IEEE, and C.-C. Jay Kuo, Fellow, IEEE
Abstract—A novel spatial-layer rate-control algorithm is proposed for scalable video coding (SVC). First, by analyzing the relationship among the best initial quantization parameter (Qp ), channel bandwidth, and the initial frames’ complexity measure, an adaptive Qp -initialization model is introduced to determine the starting Qp value for not only the base layer but also the spatial enhancement layers of SVC. Then, a two-stage Qp -determination scheme is designed to improve the rate-control performance for the spatial-layer SVC with an efficient framecomplexity prediction method and an adaptive model-parameter update technique employed. Experimental results demonstrate the effectiveness of the proposed Qp -initialization scheme and the two-stage Qp -determination algorithm. By comparison with two other benchmark rate-control algorithms, the proposed ratecontrol algorithm is able to control constrained bit rates accurately with better rate-distortion performance. Index Terms—Initial quantization parameter, rate control, rate-quantization (R-Q) model, scalable video coding (SVC), spatial scalability.
I. I NTRODUCTION
W
ITH the widespread adoption of advanced Internet and multimedia technologies, the past few decades have witnessed great success in the development of video communication techniques [1] and prosperous applications in industrial networks [2]. As an essential component of a video communication system, video coding plays a key role since it is necessary to compress the visual content in a more efficient manner for storage and transmission. Nowadays, due to the development of mobile devices such as personal digital assistants and smartManuscript received August 20, 2010; revised March 27, 2011; accepted April 15, 2011. Date of publication May 19, 2011; date of current version October 25, 2011. This work was supported in part by Hong Kong Research Grants Council General Research Fund under Project 9041353 (CityU 115408), by the Program for Professor of Special Appointment (Eastern Scholar), Shanghai Institutions of Higher Learning, by the Program for New Century Excellent Talents in University of China under Project NCET-10-0634, and by the National Basic Research Program (973 Program) of China under Grant 2010CB328101. S. Hu and S. Kwong are with the Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong (e-mail:
[email protected];
[email protected]). H. Wang is with the Department of Computer Science and Technology and Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai 200092, China (e-mail:
[email protected]). C.-C. J. Kuo is with Ming Hsieh Department of Electrical Engineering and Signal and Image Processing Institute, University of Southern California, Los Angeles, CA 90089-2564 USA (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIE.2011.2157282
phones, mobile Internet access has become more and more common, and thus, a demand for mobile video communications has arisen. In practice, a robust and powerful mobile video communication system should consider the diversity of the participants, such as their transmission bandwidth, computational power, and display screen size. To address this issue, scalable video coding (SVC) [3], [4] is desirable to be employed as the infrastructure of video codec for video communications. SVC was developed as an extension of H.264/advanced video coding (AVC), which aims to provide a good manner of bitstream adaptability via temporal, spatial, and quality layers. SVC is able to encode the video signal once but enable decoding from partial streams, depending on the specific application requirements. Therefore, SVC can provide an efficient way to share and utilize the available resources over heterogeneous networks where the downstream client capabilities, system resources, and network conditions are not known in advance. For spatial scalability, a layered coding approach is applied to encode an input video sequence with different picture sizes (or resolutions), enabling various clients to extract desired subbitstreams depending on their display resolutions, computational capacities, and channel bandwidth. The spatial layer with the smallest picture size is named as the base layer (BL), whereas the other layers with larger picture sizes are termed as spatial enhancement layers (ELs). The BL offers an H.264/AVC-compatible bitstream, whereas the ELs are coded through interlayer prediction tools to further exploit the redundancies between consecutive spatial layers and thus improve coding efficiency. In addition, other technologies are also proposed to support spatial scalability, such as in [5], where the format-compatible and format-modified discrete cosine transform (DCT) are developed to implement image/video spatial scalability in the DCT-compressed domain. For each spatial layer, the hierarchical B-picture prediction structure [6] is usually employed to support temporal scalability, which is different from traditional group-of-picture (GOP) structures and thus leads to different coding features in terms of rate-quantization (R-Q) and distortion-quantization relations. Rate control works through aiming not only at accurate bit rate (BR) regulation but also at rate-distortion optimization (RDO) over the video-transmission system. For example, in [7], a rate-control algorithm is designed to maximize the quality of the overall broadcasting system. In [8], the rate-distortion (RD) performance of the hierarchical B-picture prediction structure for temporal scalability is optimized by developing a set of
0278-0046/$26.00 © 2011 IEEE
1674
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 59, NO. 3, MARCH 2012
Fig. 1. Rate-control architecture in the SVC encoder.
weighting factors for bit allocation. Although SVC bitstreams provide rich functionalities of rate adaptation, it is not ensured that the generated scalable bitstreams are optimal in the RD sense, which makes rate control a necessity for SVC in aiming to optimize the video quality under the target BR (or bandwidth) constraint. Generally, rate control is a specific form of control system. Fig. 1 demonstrates the rate-control architecture in the SVC encoder, where the Qp values for each layer (i.e., Qp (0), Qp (1), . . . , Qp (i)) are the input variables, and the output BRs of each layer (i.e., BR(0), BR(1), . . . , BR(i)) are expected to be controlled with these Qp values according to the end-user requirement or network condition. The performance of this rate-control system may be improved by introducing various advanced control solutions such as the novel control solution by combining fuzzy control and iterative learning control in [9], and active disturbance rejection control, which outperforms the traditional proportional–integral–derivative control, as discussed in [10]. On the other hand, rate control in SVC has its unique characteristics and challenges to address. In particular, the layer-based coding structure in SVC makes it become more complex and difficult than previous encoders in rate-control designing. Recently, a number of rate-control approaches have been proposed for SVC, including the temporal-layer rate-control algorithms [11]–[13] and spatiallayer rate-control algorithms [12], [14]. Regarding temporal-layer rate control, Xu et al. [11] proposed a scaling factor-based scheme to assign bit resources to different temporal layers according to their prediction effects in the hierarchical B-picture structure. In [12], a set of empirical weighting factors is proposed to allocate bit resources to different temporal layers, and the linear sum bits R-Q model [15] is applied to determine Qp for each coding unit. In [13], the relation of distortion dependence among the temporal layers with the hierarchical B-picture structure is investigated, and a multipass temporal-layer rate-control algorithm is developed to achieve the optimal RD performance.
As far as the spatial-layer rate control is concerned, it plays a key role by exploring the correlation of the interlayer R-Q characteristics to improve rate-control performance. In [12], to decouple the Qp interdependence problem [16] between rate control and the RDO process, the mean of absolute difference (MAD) of texture residuals in the lower spatial layers is used, and a switchable MAD prediction scheme is proposed for modeling R-Q relations for the spatial ELs. Liu et al. [14] investigated the dependence relation in terms of rate and distortion between the spatial layers and developed a multipass rate-control scheme for the spatial-layer rate control. For the aforementioned SVC rate-control algorithms, the accuracy of the R-Q model is critical to optimal RD performance achievement. In fact, various R-Q models have been established in the literature for previous video coding standards with a single-layer coding structure, such as [17]–[23], and some of these R-Q models have been employed for SVC rate control. However, the correlation of the R-Q model parameters among the different layers has not been fully investigated. Another important issue to improve SVC rate-control performance is to determine an appropriate Qp value for coding the first frame of a sequence. Usually, a suitable initial Qp not only benefits the first frame in the sequence but also effectively prevents fluctuation of bit consumption/video quality and thus improve the coding efficiency of the whole sequence. In Joint Video Team (JVT)-G012 [16], a bits per pixel (BPP)-based initial Qp -determination scheme is introduced. However, it does not take video contents into consideration for different sequences. In [24], an empirical model is developed to determine the initial Qp for H.264/AVC. Although it begins to consider the sequence complexity, it limits the investigation only to the first video frame. As a result, the complexity measure may not be accurate enough to reflect the actual complexity of a number of initial video frames. In addition, for multiple layers in SVC, the correlation of the initial Qp values for each coding layer is desirable to be explored for designing an efficient Qp initialization scheme.
HU et al.: NOVEL R-Q MODEL-BASED RATE CONTROL WITH ADAPTIVE INITIALIZATION FOR SPATIAL SVC
In this paper, we mainly focus on rate control for SVC spatial scalability and propose a novel spatial-layer rate-control algorithm based on [23]. The main contributions of this paper can be summarized as follows. First, a novel adaptive Qp -initialization scheme is proposed for each of the spatial coding layers. Based on the analysis of the hierarchical B-picture structure [6], a BL initial Qp -determination method is developed. Then, via investigating the relation in terms of picture sizes and target BRs between the BL and spatial ELs, an efficient initial Qp determination method is derived for the spatial ELs. Second, by extending the work in [23] into SVC, an efficient texturecomplexity measure is designed for the Cauchy distributionbased R-Q model [20], and a two-stage Qp -determination method is presented. Moreover, the correlation of the Cauchy distribution-based R-Q model parameters in the consecutive spatial layers is investigated, and consequently, a novel prediction mechanism to update model parameters is developed to improve the model accuracy for the ELs in SVC. The rest of this paper is organized as follows. In Section II, the adaptive Qp -initialization scheme is introduced for the SVC spatial layers. Then, the two-stage Qp -determination algorithm and the improved R-Q model with the effective complexity measure and model-parameter update techniques are discussed in Section III. Experimental results are presented in Section IV, and finally, Section V concludes this paper. II. I NITIAL Qp D ETERMINATION FOR THE S PATIAL L AYERS At the beginning of rate control, an initial Qp should be specified to encode the first frame of a video sequence. In general, setting an appropriate initial Qp value not only benefits the first frame of a sequence but also prevents the fluctuation of bit consumption/video quality for a number of succeeding frames. Usually, the determination of the best initial Qp value is mainly dependent on two factors, including the constrained BR and the complexity of a video sequence. Traditionally, only the constrained BR is considered to set the initial Qp value, such as the BPP-based initial Qp -determination scheme [16]. To take different video sequences’ contents into consideration for initial Qp determination, the complexity information of the first frame is additionally utilized besides the BPP in [24], and the coded video performance with H.264/AVC is further improved by this adaptive Qp -initialization method. Usage of the first video frame for the initial Qp calculation has the advantage of introducing no coding delay; however, the complexity measure based on only the first video frame may not be accurate enough to reflect the actual complexity for a number of initial frames, and thus, the accuracy for the best Qp initialization may be degraded to some extent. As a rule of thumb, the more the number of initial frames are employed, the more accurately the initial Qp value is derived, but the more the coding delay is introduced. In the following, a novel adaptive Qp -initialization scheme will be presented for the spatial SVC rate control. First, considering the hierarchical B-picture prediction structure [6] usually used in SVC, a BL initial Qp -determination method is developed, which can achieve a good tradeoff between the accuracy of the best initial Qp value and the latency of coding
1675
delay. Then, the proposed Qp -initialization scheme is extended to multiple spatial layers, based on investigating the relation of image resolution and the target BR between the BL and spatial ELs. A. BL Qp Initialization In this paper, the Cauchy distribution-based R-Q model [20] is applied to calculate Qp , given the target BR due to its reported superior performance to other models. The Cauchy distribution-based R-Q model is formulized as Rtot = α · Qβstep + γ
(1)
where Rtot is the number of total bits, including header and texture bits; Qstep is the quantization step size; α is the complexity measure of the residual signal; and γ is corresponding to the number of header bits; β is the model parameter associated with the distribution characteristics of DCT coefficients, and it is negative. Choosing an appropriate initial Qp value should meet the bandwidth requirement at the very beginning of video encoding; otherwise, the bit buffer will be overflowed or underflowed, leading to severe degradation of coding performances. Assume during beginning time ΔT that the number of frames to be coded is N , the channel bandwidth (or BR) is B, αa represents the average complexity measure of these N frames and is mentioned as the initial complexity in the rest of this paper, we have the following formula: B · ΔT ≈ N · αa · Qβstep
(2)
where the number of header bits γ is ignored since it takes a very small percentage in the total output bits during the encoding of these initial frames. In SVC, the same as in H.264/AVC, Qstep doubles in size for every increment of 6 in Qp [19], which can be expressed as Qstep = ρ · eζ·Qp
(3)
where ρ and ζ are constant parameters. By substituting (2) with (3), we have B 1 ln(ρ) Qp = ln (4) − β·ζ Fr · αa ζ where Fr is the frame rate equal to N/ΔT . Based on (4), an adaptive model considering both BR constraint and video content characteristics is proposed to derive initial Q˜p as B ˜ Qp = φ · ln +ϕ (5) Fr · αa where φ = 1/β · ζ and ϕ = − ln(ρ)/ζ. To further verify the proposed initial Qp model in (5), a number of benchmark video sequences in quarter common intermediate format (QCIF) and common intermediate format (CIF) are investigated with various target BR scenarios. The
1676
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 59, NO. 3, MARCH 2012
Fig. 3.
Dyadic hierarchical B-picture structure.
Fig. 2. Relationship between the best initial Qp and bandwidth B. B is setting from 40 to 800 kb/s for the QCIF video sequences and from 80 to 1600 kb/s for the CIF video sequences while Fr = 30.
relationship between the best initial Qp 1 and ln(B) is illustrated in Fig. 2, where every line corresponds to a video sequence. It is shown that these lines (corresponding to different video sequences) are almost parallel to one another, and the distances in between these lines are caused by the differences in the initial complexity [i.e., αa in (5)] of these video sequences. Therefore, it is necessary to explore the impact of initial complexity on determining the initial Qp value. In order to derive suitable initial Qp , the initial complexity measure αa in (5) should be known before coding the first frame. In SVC, the hierarchical B-picture prediction structure [6] is usually used, where the display order is different from the coding order. As illustrated in Fig. 3, before encoding Frame 1 (the second key frame) in the coding order, the frames of the entire GOP are available. Therefore, in estimating the initial complexity more accurately, we utilize more frames, which only results in a very short coding delay of the first frame. The proposed initial complexity measure is based on the macroblock-based variance (MBV) of the first I frame and the sum of absolute difference (SAD) among specific frames. The MBV and SAD are defined as MBV =
NM 15 15 k k I (i, j) − IDC
(6)
k=1 i=0 j=0
SAD(n, m) =
X−1 −1 Y
|P n (x, y) − P m (x, y)|
(7)
i=0 j=0
where I k (i, j) is the pixel value at (i, j) of the kth macroblock k is the average pixel value of the kth MB; NM is (MB); IDC the number of MBs in a frame; P n (x, y) and P m (x, y) are the pixel values at (x, y) of the nth and mth frames in the display order; and X and Y are the width and height, respectively, of
a given BR B, the initial Qp resulting in the best RD performance is recorded as the best initial Qp . 1 For
Fig. 4. Relationship between initial Qp and ln(B/αa ) for different sequences.
a frame. Therefore, the initial complexity measure is defined based on the MBV and SAD as λ MBV· SAD(0, S)+SAD 0, S2 +SAD S2 , S αa= 3 × 108 (8) where S is the GOP size and λ is the model parameter that is set to 0.4 based on extensive experimental results in this paper. As shown in Fig. 2, given a bandwidth, the best initial Qp s significantly vary from QCIF sequences to CIF sequences and even among sequences with the same format. However, when the proposed initial complexity measure in (8) is applied with (5), as depicted in Fig. 4, these lines almost convert to a single line, which indicates that the proposed initial complexity measure is accurate enough to reflect the characteristics of different video contents. According to the linear regression technique, the model parameters in (5) are obtained as φ = −7.66 and ϕ = 47.04. B. Model Extension to Multiple Layers In SVC, multiple spatial layers can be simultaneously coded. Therefore, it is necessary to extend (5) to determine the initial Qp values for each spatial layer, which can be expressed as Q˜p (i) = φ · ln
B(i) Fr (i) · αa (i)
+ ϕ, i = 0, 1, . . .
(9)
HU et al.: NOVEL R-Q MODEL-BASED RATE CONTROL WITH ADAPTIVE INITIALIZATION FOR SPATIAL SVC
TABLE I E XAMPLE OF ω I NVESTIGATION B ETWEEN THE CIF AND QCIF S PATIAL L AYERS ; S(1)/S0 = 4
1677
By combining (9) and (10), the following equation is yielded for calculating the initial Q˜p (i) for the ith EL as B(i) · Fr (0) · S(0) ˜ ˜ Qp (i) = Qp (0) + φ · ln . (11) ω · B(0) · Fr (i) · S(i) Therefore, in the proposed Qp -initialization scheme, it is only necessary to compute the initial complexity of the BL and thus determine Q˜p (0) for the BL. Then, the initial Qp value for each EL is derived according to (11). III. I MPROVED R-Q M ODEL FOR SVC
TABLE II ACCURACY OF S EQUENCE -C OMPLEXITY P REDICTION B ETWEEN L AYERS
In [23], an effective complexity measure for the Cauchy distribution-based R-Q model [20] is developed to improve the rate-control performance of H.264/AVC. In the following, we will further extend this complexity measure for SVC rate control and propose a novel prediction mechanism to update model parameter β [in (1)] for the performance enhancement of the Cauchy distribution-based R-Q model. A. Two-Stage Rate-Control Scheme in SVC Like most previous video encoders, the goal of the SVC encoder is to achieve the best RD efficiency, i.e., to minimize distortion D subject to constraint Rc on the number of used bits R. This problem is described as min{D},
where B(i), Fr (i), and αa (i) are the target BR, frame rate, and initial complexity measure, respectively, for the ith layer. In such a way, the initial complexity needs to be investigated for each spatial layer. Because the video content of a lower layer is usually the down-sampled version of a higher layer, there exist strong correlations between the initial complexity measures of the different spatial layers for the same video sequence. For mathematical simplicity, the relation regarding the initial complexity measures of the different spatial layers is modeled as S(i) αa (i) =ω· αa (0) S(0)
(10)
where S(i) represents the image resolution of the ith-layer video and ω is the weighting factor. In order to investigate the parameter ω in (10), we perform extensive experiments with a large number of video sequences, image resolution of spatial layers, and BRs. The typical results of ω between the BL in QCIF and the EL in CIF are given in Table I. In this paper, ω is set to 0.8. In fact, (10) is used as the sequence-complexity prediction for the ELs. In Table II, the prediction accuracy is further tested for other benchmark video sequences where the predicted α value is calculated from (10) and the actual α value is calculated from (8) for EL 1. In the result, it is shown that acceptable sequence-complexity prediction performance can be achieved for the ELs.
subject to R ≤ Rc .
(12)
The optimization problem in (12) is intensively discussed in [25], and it is solved by Lagrangian optimization, which is adopted in the H.264/AVC and SVC encoders to find the best encoding mode regarding RD efficiency. According to the Lagrangian optimization method, the SVC encoder exhaustively searches the best mode that can produce the minimum RD cost given by J(m, Qp ) = D(m, Qp )+λmode · (Rt (m, Qp )+Rh (m, Qp )) (13) where D(·) stands for a distortion function between the original and constructed video signals; Rt (·) and Rh (·) are the number of texture and header bits, respectively, associated with candidate mode m; and Lagrangian multiplier λmode depends on Qp . The RDO-based mode decision in SVC will cause a “chicken-and-egg” dilemma [16] when performing rate control as H.264/AVC does. This is because before the RDO process, Qp is required for the λmode calculation; however, residual signal and its related information, which are usually used to determine Qp , are available only after carrying out RDO. In order to address this Qp interdependence problem between RDO and rate control, the two-stage Qp -determination scheme [23] is applied in this paper. At the first stage, Qp , denoted by Qp1 (with the corresponding quantization step size being Qstep1 ), is used to perform RDO for all the MBs inside a frame. At the second stage, the other Qp , denoted by Qp2 (with the corresponding quantization step size as Qstep2 ), is employed for quantizing the residual signal. The rationale behind the twostage Qp method lies in the fact that a small mismatch of these
1678
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 59, NO. 3, MARCH 2012
Fig. 5. Comparison of the MAD-based and α-based complexity measures. Results for the sequence of (a)–(d) Soccer and (e)–(h) Akiyo. The BL is in the QCIF, and the EL is in the CIF. (a) BL. (b) EL. (c) BL. (d) EL. (e) BL. (f) EL. (g) BL. (h) EL.
two Qp values keeps the coding performance more or less intact [21], [23], although theoretically speaking, these two Qp values shall be equal to obtain the optimal RD performance. Note that the computational overheads introduced by the proposed two-stage Qp scheme are negligible because only an additional quantization procedure with Qp2 is needed, and the quantization complexity is very slight, as compared with that of the entire coding process. During video coding, to gain a smooth quality for a video sequence, the variation in the Qp values of the consecutive frames is usually limited to a small range. Therefore, for the current frame to be coded, the Qp1 for the first-stage RDO process is predicted from the values of the Qp1 and Qp2 of the previous frame in the same spatial layer as Qp1 (k, i) = wq · Qp2 (k, i − 1) + (1 − wq ) · Qp1 (k, i − 1) (14) where (k, i) stands for the ith frame in the kth spatial layer and wq is the weighting parameter that is empirically set to 0.7. Regarding the Qp2 for the second-stage quantization, it is calculated based on the Cauchy distribution-based R-Q model [20], given the number of target bits and bit consumption information available during the first-stage RDO process. More specifically, after the first-stage RDO process for the ith frame in the kth spatial layer, we can obtain the number of texture bits (Ct (k, i)) and the number of header bits (Ch (k, i)). According to the Cauchy distribution-based R-Q model in (1), framecomplexity measure α(k, i) for the ith frame in the kth spatial layer can be derived as
The number of header bits mainly comes from motion vectors, reference index, mode type, etc., which are decided at the first stage by Qstep1 . Therefore, it is almost not affected by Qstep2 , and the total number of bits generated at the second stage, including both of the texture and header bits, can be written as Rtot (k, i) = Ch (k, i) + α(k, i) · Qstep2 (k, i)β(k,i) .
(17)
With (17), we can calculate Qstep2 (k, i) and, thus, the corresponding Qp2 (k, i) for the second-stage quantization. Based on the aforementioned analysis, it is convenient to utilize the α-based complexity measure with (15) to derive Qstep2 . On the other hand, for the sake of accuracy, it has been demonstrated in [23] that the proposed complexity measure is superior to the traditional MAD-based complexity measure, which is usually used by the classical second-order R-Q model as Rtot = α1 ·
MAD MAD + α2 · 2 Qstep Qstep
(18)
where α1 and α2 are the model parameters. The comparison between the proposed α-based complexity measure and the MADbased complexity measure is illustrated for SVC in Fig. 5, where two typical test scenarios are presented. In the figure, it is shown that the proposed α-based complexity measure is more accurate than the MAD-based complexity measure for modeling SVC R-Q relations. B. Parameter Update
α(k, i) = Ct (k, i) · [Qstep1 (k, i)]−β(k,i) .
(15)
Therefore, the number of texture bits generated by Qstep2 (k, i) at the second stage is Rt (k, i) = α(k, i) · Qstep2 (k, i)β(k,i) .
(16)
In the Cauchy distribution-based R-Q model with (1), parameter β is related to the distribution of DCT coefficients and plays a key role in model accuracy. Traditionally, it is limited to a set of predefined constant values, e.g., in [20], β is specified to be {−0.75, −0.8, −0.85} for the I frame, {−1.2, −1.4, −1.6} for the P frame, and {−1.6, −1.8, −2}
HU et al.: NOVEL R-Q MODEL-BASED RATE CONTROL WITH ADAPTIVE INITIALIZATION FOR SPATIAL SVC
1679
for the B frame. However, the distribution of the actual DCT coefficients of the different frames significantly varies in the different sequences or even in the different frames of the same sequence [23]. Therefore, it is desired to set adaptive β for the video frames according to local characteristics. According to (16), β(k, i) for the ith frame in the kth spatial layer can be obtained after encoding the corresponding frame as
ˆ t (k, i)/Ct (k, i) ln R (19) β(k, i) = ln (Qstep2 (k, i)/Qstep1 (k, i)) ˆ t (k, i) is the number of texture bits actually generated. where R However, the actual β(k, i) value cannot be derived with (19) ˆ t (k, i) and Qstep2 (k, i) are inaccessible until the ensince R coding process of the (k, i)th frame is completed. Thus, it is necessary to design an efficient β-prediction scheme. For the SVC spatial scalability, two kinds of methods can be used for β prediction: one is to predict β along the temporal direction, the other along the spatial direction. The temporal prediction method is proposed in [23] to predict β for the H.264/AVC rate control. It can be extended for β prediction for SVC as βt (k, i) = wβ · β(k, i − 1) + (1 − wβ ) · βt (k, i − 1)
(20)
where βT (k, i) is the temporal prediction of β for the (k, i)th frame; β(k, i − 1) is obtained by (19) for the previous frame (i.e., the (k, i − 1)th frame); wβ is the weighting parameter with the typical value of 0.7, based on the experiments. On the other hand, for a frame belonging to a spatial EL, its β can also be predicted from that of the colocated frame in a lower spatial layer because the corresponding frames in the consecutive spatial layers possess almost the same video content but with different image sizes. In order to investigate the β relationship between the consecutive spatial-layer frames, extensive experiments are performed with a large number of benchmark video sequences. Due to space limit, only six typical results about β relationship between the two spatial layers are illustrated in Fig. 6. Based on the results, the relation between β values for consecutive spatial-layer frames can be expressed as βS (k, i) = a · β(k − 1, i) + b
(21)
where βS (k, i) is the spatially predicted β value for the (k, i)th frame; β(k − 1, i) is the β value for the (k − 1, i)th frame (i.e., the colocated frame in a lower spatial layer) from calculating (19); a and b are the model parameters, which are updated by utilizing previously coded frames according to least-square minimization of the error between the actual and predicted β values. More specifically, the update of a and b can be written as n n n 1 i=1 β(k, i)β(k− 1, i)− n i=1 β(k, i) i=1 β(k−1, i) a= n 2 n 2 i=1 β (k−1, i)−( i=1 β(k−1, i)) (22) n
b=
n
1 1 β(k, i) − β(k − 1, i) n i=1 n i=1
(23)
Fig. 6. Quasi-linear relation between βBL and βEL . (a) Akiyo. (b) Football. (c) Silent. (d) Foreman. (e) Container. (f) Coastguard.
where n is the number of previously coded frames. In this paper, the least-square minimization method is applied for parameter update due to its computational simplicity. To compare the accuracy of the temporal and spatial prediction methods for the spatial EL, the prediction error is studied as Eβ =
N 1 i ˆi ˆi |β − β |/β N i=1
(24)
where N is the number of evaluation frames, β i is the predicted value of the ith evaluation frame calculated by (21) or (20), and βˆi is the actual value of the ith evaluation frame computed with (19). As shown in Table III, the prediction performance of the spatial prediction method is better than that of the temporal prediction method. Therefore, in this paper, the temporal prediction method is applied to predict β for the BL, and the spatial prediction method is used for the spatial ELs.
IV. E XPERIMENTAL R ESULTS The proposed algorithm is implemented in the SVC reference software joint scalable video model (JSVM) 9.17 [26]. In order to evaluate the rate-control performance of the proposed algorithm, various benchmark video sequences are tested. Some of the typical simulation parameters are summarized in Table IV,
1680
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 59, NO. 3, MARCH 2012
TABLE III β P REDICTION E RROR
TABLE V S UMMARY OF THE B IT-ACHIEVEMENT R ESULTS
TABLE IV S UMMARY OF THE S IMULATION PARAMETERS
Fig. 7. Bit achievement with constant target bit setting in both the BL and EL. Target bits are set to 10 000 and 80 000 for each frame in the BL and ELs, respectively.
whereas other parameters are set as defaults of the reference software. For performance comparison, the benchmark rate-control algorithm [12] (denoted by Liu2008 for short in the rest of this paper) for SVC spatial scalability is implemented. For both of the two algorithms, including Liu2008 [12] and the proposed algorithm, the bit-allocation scheme in [12] is adopted for a fair comparison. A. Accuracy of Bit Achievement First, in order to test the performance of the proposed RQ model in Section III, constant target bits are set for each frame in the BL and ELs [i.e., 10 000 at BL (QCIF) and 80 000 at EL (CIF), respectively]. Liu2008 is also implemented for comparison, and bit achievement is presented in Fig. 7. The bits of the proposed algorithm swiftly converge to the target bits of both layers within the first 20 frames; then, both layers
Fig. 8.
Comparative results of buffer occupancy. (a) BL. (b) EL.
can obtain relatively precise and stable bit achievement. For Liu2008, it can generally achieve bit achievement around the target bit but not accurately and stably enough. Second, the bitallocation scheme is tested, and the accuracy of bit achievement is investigated in terms of mismatch error Eb between the
HU et al.: NOVEL R-Q MODEL-BASED RATE CONTROL WITH ADAPTIVE INITIALIZATION FOR SPATIAL SVC
1681
TABLE VI S UMMARY OF THE RD P ERFORMANCE R ESULTS W ITH THE {BL AT QCIF, EL AT CIF} S ETTING
number of target bits and the number of actual output bits, which is defined as Eb =
N k 1 k k /Btarget Btarget − Bactual N
(25)
k=1
k where N is the total number of evaluated frames and Btarget k and Bactual are the number of target and actual output bits of the kth frame, respectively. In Table V, the BL and ELs are encoded with sequences in QCIF and CIF, respectively. Two sets of target BRs are applied for the experiments, i.e., {BL at 128 kb/s, EL at 512 kb/s} and {BL at 512 kb/s, EL at 1024 kb/s}. As shown in Table V, the proposed method can achieve more accurate bit-achievement performance in both the BL and ELs, as compared with Liu2008.
B. Buffer Regulation To prevent buffer underflow or overflow, an efficient ratecontrol algorithm should control the number of output bits and the buffer occupancy to a suitable level. In order to com-
pare the proposed two-stage Qp algorithm with Liu2008, the performance on buffer status management is investigated with concatenated sequences containing scene-change contents. The BL of the concatenated sequences consists of the first 50 frames of each of the five standard QCIF video sequences, namely, “Mobile,” “Silent,” “Foreman,” “Akiyo,” and “Highway,” and the EL of the sequence comprises the corresponding CIF video sequence frames. The comparative results of buffer occupancy by Liu2008 and the proposed algorithm are illustrated in Fig. 8, where the target BRs are set as {BL at 256 kb/s, EL at 1000 kb/s}. In the results, it is shown that the proposed algorithm is able to maintain the buffer status in a stable manner and is better than Liu2008. The superior performance achieved by the proposed method is due to the fact that the proposed improved R-Q model and two-Qp scheme are able to more accurately depict the R-Q relationship, even for scene-change video contents. Whereas, regarding the algorithm Liu2008, the frame complexity of a coding frame is mainly predicted from neighbor frames, and such a prediction scheme may cause severe errors when scene change occurs.
1682
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 59, NO. 3, MARCH 2012
TABLE VII S UMMARY OF THE RD P ERFORMANCE R ESULTS W ITH THE {BL AT CIF, EL AT 4CIF} S ETTING
In Fig. 8, the stability of the generated BR from the status of the buffer is shown. Due to the severe video content changing in the sequence, the bit stream of Liu2008 is unstable, particularly at the joint points of the different sequences. Regarding the proposed algorithm, it can output very stable bit stream after the initial stage. This illustrates the adaptiveness and robustness of the proposed algorithm in the BR control. C. RD Performance Finally, the RD performances are studied to demonstrate the effectiveness of the proposed R-Q model and the adaptive Qp initialization scheme. In addition to Liu2008, the rate-control algorithm JVT-W043 [27], which has been implemented in JSVM for BL rate control, is extended to the spatial ELs for comparison. To evaluate the RD performance, two criteria are employed, including average peak-signal-to-noise ratio (PSNR) decrease (Bjøntegaard Delta-PSNR; in decibels) and average BRs increase (Bjøntegaard Delta-BR; in percentage), which are defined in [28]. The algorithm JVT-W043 is used as the comparison basis to calculate the BD-PSNR and BD-BR performances for the algorithm Liu2008 and the proposed
algorithm. In order to evaluate the proposed Qp -initialization scheme, the proposed rate-control algorithm is tested under two conditions. Under the first condition, the Qp -initialization scheme introduced in Section II is not applied, and the proposed rate-control algorithm with the improved R-Q model described in Section III is tried. This algorithm is denoted by “ALG1” in the following presentation. Under the second condition, the proposed Qp -initialization scheme is enabled for performing the proposed algorithm, as denoted by “ALG2.” The results are summarized in Tables VI and VII, where “L” stands for Layer with “0” representing the BL and “1” representing the EL, “Avg” indicates the results in average, and the video formats in the BL and ELs are set as {BL at QCIF, EL at CIF} in Table VI and {BL at CIF, EL at 4CIF} in Table VII, respectively. Four sets of target bits are set as {BL at 64, 128, 256, and 512 kb/s, EL at 512, 768, 1024, and 1500 kb/s} in Table VI and {BL at 300, 400, 500, and 600 kb/s, EL at 4, 5, 6, and 7 Mb/s} in Table VII. In Tables VI and VII, besides the BD-PSNR (denoted by PBD for short) and BD-BR (denoted by RBD for short) performances, the PSNR and BR (BR; in kilobits/second) results are also listed for each of the test algorithms. In the results, it is shown that with the
HU et al.: NOVEL R-Q MODEL-BASED RATE CONTROL WITH ADAPTIVE INITIALIZATION FOR SPATIAL SVC
same Qp -initialization scheme, the proposed algorithm ALG1 outperforms the other two algorithms JVT-W043 and Liu2008 on the RD performance, e.g., the proposed ALG1 achieves the BD-PSNR gain of 0.18 and 0.09 dB for the BL and ELs, respectively, as compared with Liu2008, as indicated in Table VI. When the proposed Qp -initialization scheme is applied, the proposed ALG2 can further improve the RD performance, as compared with ALG1. For example, in Table VI, the proposed ALG2 obtains 0.27- and 0.15-dB better results in BD-PSNR than ALG1 for the BL and the EL, respectively.
V. C ONCLUSION In this paper, an efficient rate-control algorithm has been presented for the SVC spatial layer. In the proposed ratecontrol scheme, an adaptive Qp -initialization scheme is first introduced, which considers not only the target BR constraints but also the video sequence content. To decouple the Qp interdependence problem between rate control and the RDO process, a two-stage Qp -determination strategy based on the Cauchy distribution-based R-Q model is proposed for each spatial layer. In the proposed two-stage method, effective methods for complexity prediction and model-parameter update are designed to further improve rate-control performance. Experimental results demonstrate that the proposed rate−control algorithm is superior to the other two rate-control algorithms [27] and [12]. R EFERENCES [1] C.-H. Wu and J. D. Irwin, “Multimedia and multimedia communication: A tutorial,” IEEE Trans. Ind. Electron., vol. 45, no. 1, pp. 4–14, Feb. 1998. [2] J. Silvestre-Blanes, R. Marau, L. Almeida, and P. Pedreiras, “On-line QoS management for multimedia real-time transmission in industrial networks,” IEEE Trans. Ind. Electron., vol. 58, no. 3, pp. 1061–1071, Mar. 2011. [3] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of the scalable video coding extension of the H.264/AVC standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 9, pp. 1103–1120, Sep. 2007. [4] T. Wiegand, G. Sullivan, J. Reichel, H. Schwarz, and M. Wien, “Joint draft 11: Scalable video coding,” Doc. JVT-X201, Geneva, Switzerland, Jul. 2007. [5] Q. Hu and S. Panchanathan, “Image/video spatial scalability in compressed domain,” IEEE Trans. Ind. Electron., vol. 45, no. 1, pp. 23–31, Feb. 1998. [6] H. Schwarz, D. Marpe, and T. Wiegand, “Analysis of hierarchical B pictures and MCTF,” in Proc. IEEE ICME, Jul. 2006, pp. 1929–1932. [7] Z. He and D. O. Wu, “Linear rate control and optimum statistical multiplexing for H.264 video broadcast,” IEEE Trans. MultiMedia, vol. 10, no. 7, pp. 1237–1249, Nov. 2008. [8] S. Hu, H. Wang, S. Kwong, and C.-C. J. Kuo, “Rate control optimization for temporal-layer scalable video coding,” IEEE Trans. Circuits Syst. Video Technol., to be published. [9] R. Precup, S. Preitl, J. K. Tar, M. L. Tomescu, M. Takacs, P. Korondi, and P. Baranyi, “Fuzzy control system performance enhancement by iterative learning control,” IEEE Trans. Ind. Electron., vol. 55, no. 9, pp. 3461– 3475, Sep. 2008. [10] J. Han, “From PID to active disturbance rejection control,” IEEE Trans. Ind. Electron., vol. 56, no. 3, pp. 900–906, Mar. 2009. [11] L. Xu, W. Gao, X. Ji, and D. Zhao, “Rate control for hierarchical B-picture coding with scaling-factors,” in Proc. IEEE ISCAS, May 2007, pp. 49–52. [12] Y. Liu, Z. G. Li, and Y. C. Soh, “Rate control of H.264/AVC scalable extension,” IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 1, pp. 116–121, Jan. 2008. [13] Y. Cho, J. Liu, D.-K. Kwon, and C.-C. J. Kuo, “H.264/SVC temporal bit allocation with dependent distortion model,” in Proc. IEEE ICASSP, Apr. 2009, pp. 641–644.
1683
[14] J. Liu, Y. Cho, and Z. Guo, “Frame-based bit allocation for spatial scalability in H.264/SVC,” in Proc. ICME, Jun. 2009, pp. 189–192. [15] Y. Liu, Z. G. Li, and Y. C. Soh, “A novel rate control scheme for low delay video communication of H.264/AVC standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 1, pp. 68–78, Jan. 2007. [16] Z. Li, F. Pan, K. P. Lim, G. Feng, X. Lin, and S. Rahardja, “Adaptive basic unit layer rate control for JVT,” Doc. JVT-G012-r1, Thailand, Mar. 2003. [17] T. Chiang and Y.-Q. Zhang, “A new rate control scheme using quadratic rate distortion model,” IEEE Trans. Circuits Syst. Video Technol., vol. 7, no. 1, pp. 246–250, Feb. 1997. [18] J. R. Corbera and S. Lei, “Rate control in DCT video coding for low-delay communications,” IEEE Trans. Circuits Syst. Video Technol., vol. 9, no. 1, pp. 172–185, Feb. 1999. [19] S. Ma, W. Gao, and Y. Lu, “Rate-distortion analysis for H.264/AVC video coding and its application to rate control,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 12, pp. 1533–1544, Dec. 2005. [20] N. Kamaci, Y. Altinbasak, and R. M. Mersereau, “Frame bit allocation for the H.264/AVC video coder via cauchy density-based rate and distortion models,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 8, pp. 994–1006, Aug. 2005. [21] D. Kwon, M. Shen, and C.-C. J. Kuo, “Rate control for H.264 video with enhanced rate and distortion model,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 5, pp. 517–529, May 2007. [22] J. Dong and N. Ling, “A context-adaptive prediction scheme for parameter estimation in H.264/AVC macroblock layer rate control,” IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 8, pp. 1108–1117, Aug. 2009. [23] S. Hu, H. Wang, S. Kwong, and T. Zhao, “Frame level rate control for H.264/AVC with novel rate-quantization model,” in Proc. ICME, Jul. 2010, pp. 226–231. [24] H. Wang and S. Kwong, “Rate-distortion optimization of rate control for H.264 with adaptive initial quantization parameter determination,” IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 1, pp. 140–144, Jan. 2008. [25] G. J. Sullivan and T. Wiegand, “Rate-distortion optimization for video compression,” IEEE Signal Process. Mag., vol. 15, no. 6, pp. 74–90, Nov. 1998. [26] “Joint Scalable Video Model JSVM 9.17 Software Package, CVS server for the JSVM software, Mar. 2009. [27] A. Leontaris and A. M. Tourapis, “Rate control for the joint scalable video model (JSVM),” Doc. JVT-W043, California, Apr. 2007. [28] G. Bjontegaard, “Calculation of average PSNR differences between RDcurves, Doc. VCEG-M33, Austin, Apr. 2001.
Sudeng Hu received the B.Eng. degree from Zhejiang University, Hangzhou, China, in 2007 and the M.Phil. degree from the City University of Hong Kong, Kowloon, Hong Kong, in 2010. He is currently with the Department of Computer Science, City University of Hong Kong. His research interests include image and video compression, rate control, and scalable video coding.
Hanli Wang (M’08) received the B.S. and M.S. degrees in electrical engineering from Zhejiang University, Hangzhou, China, in 2001 and 2004, respectively, and the Ph.D. degree in computer science from the City University of Hong Kong, Kowloon, Hong Kong, in 2007. From 2007 to 2008, he was a Research Fellow with the Department of Computer Science, City University of Hong Kong, and a Visiting Scholar with Stanford University, Palo Alto, CA, invited by Prof. C. K. Chui. From 2008 to 2009, he was a Research Engineer with Precoad Inc., Menlo Park, CA. From 2009 to 2010, he was an Alexander von Humboldt Research Fellow with the University of Hagen, Hagen, Germany. In 2010, he joined the Department of Computer Science and Technology, Tongji University, Shanghai, China, where he is a Professor. His current research interests include digital video coding, image processing, pattern recognition, and video analysis.
1684
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 59, NO. 3, MARCH 2012
Sam Kwong (SM’04) received the B.S. degree in electrical engineering from the State University of New York at Buffalo, Buffalo, in 1983, the M.S. degree in electrical engineering from the University of Waterloo, Waterloo, ON, Canada, in 1985, and the Ph.D. degree from the University of Hagen, Hagen, Germany, in 1996. From 1985 to 1987, he was a Diagnostic Engineer with Control Data Canada. He joined Bell Northern Research Canada as a Member of the Scientific Staff. In 1990, he became a Lecturer with the Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong, where he is currently a Professor with the Department of Computer Science. His research interests include video and image coding and evolutionary algorithms.
C.-C. Jay Kuo (F’99) received the B.S. degree in electrical engineering from the National Taiwan University, Taipei, Taiwan, in 1980, and the M.S. and Ph.D. degrees in electrical engineering from the Massachusetts Institute of Technology, Cambridge, in 1985 and 1987, respectively. He is the Director of the Signal and Image Processing Institute, University of Southern California, Los Angeles, where he is also a Professor of electrical engineering, computer science, and mathematics with Ming Hsieh Department of Electrical Engineering and Integrated Media Systems Center. He is the coauthor of about 190 journal papers, 810 conference papers, and ten books. His research interests include digital image/video analysis and modeling, multimedia data compression, communication and networking, and biological signal/image processing. Dr. Kuo is a Fellow of the American Association for the Advancement of Science and the International Society for Optical Engineers.