best coding mode for each macroblock. ... propose a novel fast mode decision scheme for spatial .... Pentium IV, 1.83GHz CPU, 256M RAM with Windows.
Fast Mode Decision for Spatial Scalable Video Coding He Li1, Z. G. Li2, Changyun Wen1,* and Lap-Pui Chau1 1
2
School of EEE, Nanyang Technological University, Singapore 639798
Media Division, Institute for Infocomm Research, Singapore 119613
Abstract—Scalable Video Coding (SVC) is an on-going standard and the current working draft (WD) is an extension of H.264/AVC. It provides scalability at the bit stream level with good compression efficiency, and allowing free combinations of spatial, temporal and SNR scalability. In the WD, exhaustive search technique is employed to select the best coding mode for each macroblock. This technique achieves highest possible coding efficiency, but it results in higher computational complexity. To overcome this, we propose a novel fast mode decision scheme for spatial scalability in SVC. In this scheme, the mode distribution relationship between base layer and enhancement layers is employed to reduce the candidate mode set at enhancement layers. The experimental results show that the proposed scheme provides significant reduction in computational complexity without any noticeable coding loss.
I. INTRODUCTION Scalable video coding as proposed in [1] is an extension of H.264/MPEG-4 Advanced Video Coding (H.264/AVC). It allows specific subsets of the video bit stream with different transport and presentation properties, e.g. different bit rates and different temporal, spatial characteristics. For spatial scalability in SVC, the base layer contains a reduced-resolution version of each coded frame. The enhancement layers are coded based on predictions formed from the base layer pictures and previously encoded enhancement layer pictures. SVC has introduced a number of advanced features and demonstrated significant achievements in terms of coding efficiency, robustness to a variety of network channels and conditions, and breadth of application [2]. One of the new features is inter-layer prediction between base layer and enhancement layers [3][4]. This feature results in many more macroblock (MB) encoding modes than in the other standards. The available MB modes in H.264/AVC include two INTRA modes such as INTRA_4×4 and INTRA_16 × 16, seven inter modes such as MODE_16 × 16, MODE_16×8, MODE_8×16, MODE_8×8, MODE_8× 4, MODE_4 × 8, MODE_4 × 4 and MODE_SKIP. MODE_SKIP is used in our paper to represent the mode type of skip and direct in B slice. For enhancement layers in SVC, two more modes, base_layer_mode and qpel_refinement_mode, are added [5]. These two modes indicate that motion and prediction information including the partitioning of the corresponding MB of the base layer is used. For the mode of qpel_refinement_mode, a quartersample motion vector refinement is additionally
transmitted and added to the derived motion vectors. Since these two modes are all directly predicted from the base layer, we use a single mode type BL_pred to represent these two modes in this paper. Moreover, BL_pred is used in our paper to represent intra macroblock with prediction from base layer. To achieve the highest coding efficiency, SVC makes use of computationally intensive Lagrangian rate distortion optimization (RDO) technique to decide the coding mode for each MB. That is to say, all modes are examined and the one with the minimum rate distortion cost is selected as the best mode for each MB. As a result, SVC scheme achieves optimal coding performance at the expense of tremendously increased computation complexity. There is a number of fast mode decision methods were proposed for H.264/AVC, see for example [6][7][8]. They all speed up the encoding process with negligible PSNR loss and bit rate increase for single layer coding. However, for SVC, the selection of prediction mode at enhancement layers would be more challenging. Motivated by the observations that there is correlation between the mode distribution at the base layer and that at enhancement layers, we propose an effective fast mode decision for spatial scalable coding. As a result, the candidate prediction modes for enhancement layers will be limited to a small subset and the computational complexity could be reduced. Simulation results illustrate that our algorithm can achieve nearly 60% of encoding time on average saving with negligible PSNR loss and bit rate increase for all the layers. The rest of this paper is organized as follows. In Section II, the mode distribution relationship among the base layer and enhancement layers are described. In Section III, the entire MB mode decision process is presented. Simulation results are presented in Section IV and concluding remarks are given in Section V. II. SPATIAL SCALABILITY IN SVC In SVC, spatial scalable coding of video is considered at multiple resolutions (e.g. QCIF, CIF and 4CIF) with a factor of 2 in horizontal and vertical resolution. An over sampled pyramid representation is used for spatial scalability; where for each spatial resolution a separate refinement of motion and texture information is deployed. The following inter-layer prediction techniques are used in the current SVC scheme [2][3]:
*Corresponding author.
0-7803-9390-2/06/$20.00 ©2006 IEEE
3005
ISCAS 2006
•
1) Prediction of a macroblock using the up-sampled lower resolution signal. 2) Prediction of motion vectors using the up-sampled lower resolution motion vectors. 3) Prediction of the residual signal using the upsampled residual signal of the lower resolution layer.
If MB mode at the base layer is MODE_16×16, it is probable that the corresponding MBs at its enhancement layer are coded at the mode of MODE_16×16 or MODE_SKIP. The feature is more obvious if the quantization parameter is large. TABLE I.
When the base layer represents a layer with half the spatial resolution, according to the inter-layer prediction technique, the motion vector field including the MB partitioning is scaled as shown in Figure 1. Take the middle figure as an example. If the corresponding base layer MB is coded in MODE_SKIP, MODE_16 × 16, MODE_16×8 and MODE_8×16, then MODE_16×16 is used for the current MB at the enhancement layer. Therefore, the intra and inter MBs can be predicted using the corresponding signals of previous layers. Moreover, the motion description of each layer can be used for a prediction of the motion description for following enhancement layers [6]. In our scheme, the base layer MB mode is used to predict the corresponding enhancement layer MB mode.
III.
By observing a large number of video sequences, the following relationships can be found:
•
If the MB mode at the base layer is INTRA_4×4 or INTRA_16 × 16, it is probable that the corresponding four MBs at its enhancement layer are intra coded. The intra coded MB at the base layer indicates that one cannot find a good match in the reference frames. Similarly, at enhancement layers, it is also difficult for the up-sampled MB to find a good match in the reference frames. Table I shows the probability of MBs at the enhancement layers that tend to be intra coded when the corresponding MB mode at the base layer is intra coded in three typical video sequences: FOOTBALL, FOREMAN and BUS. After exhaustive search, majority of MBs at enhancement layers are coded in the mode of BL_pred. This indicates that both the motion and texture information of the base layer are used to predict those at enhancement layers.
Video Sequence
Probability
FOOTBALL
98.32%
FOREMAN
97.17%
BUS
95.66%
PROPOSED FAST MODE DECISION SCHEME
Based on the considerations above, our fast mode decision procedure is given as follows:
Figure 1. Up-sampling of Motion Information
•
PROBABILITY OF INTRA CODED MBS AT EL
•
If the estimated MB mode at the base layer is INTRA_4× 4, the candidate mode subset at its enhancement layer is reduced to BL_pred and INTRA_4×4.
•
If the estimated MB mode at the base layer is INTRA_16×16, the candidate mode subset at its enhancement layer is reduced to BL_pred and INTRA_16×16.
•
If the estimated MB mode at the base layer is MODE_16×16, the candidate mode subset at its enhancement layer is reduced to BL_pred, MODE_16×16 and MODE_SKIP.
•
If the estimated MB mode at the base layer is MODE_16×8, MODE_8×16 and MODE_8×8, we will choose the best two modes and pass them to the enhancement layers to build the candidate mode set. These best two modes are the modes with minimum and 2nd minimum estimated rate distortion cost at base layer. Together with BL_pred and MODE_SKIP, the mode set is reduced to four in this case.
A flowchart of the overall process of the proposed scheme is shown in Figure 2. The MB mode at the base layer is estimated using the exhaustive search technique, which is the same as JSVM 2.0. MODEBL represents the best mode selected at the base layer for certain MB. MODEELpred1 and MODEELpred2 are the best two modes generated at the base layer and they are part of the prediction mode set for enhancement layers. These best two modes can be generated after the rate distortion optimization at the base layer by comparing the rate distortion cost.
3006
FOOTBALL Resolution
FOREMAN
Base
BUS
QCIF
Enhancement
CIF
Frame Rate
15 Hz
Total Number of Frames
32
32
32
Qp Value
10, 20, 30 and 40
Coding Option Used
Rate distortion optimization is enabled.
MV search range is ±32 pels Reference frame number equal to 1. MV resolution is 1/4 pel. Codec TABLE III. QP
JSVM Scheme PSNR(dB)
40
101.565
30
314.8425
20 10
QP
As a result, in our scheme, the candidate mode set is reduced at enhancement layers and complexity reduction is achieved. IV. SIMULATION RESULTS The proposed fast mode decision scheme is embedded in JSVM 2.0 encoder [5]. The test platform used is Intel Pentium IV, 1.83GHz CPU, 256M RAM with Windows XP professional operating system. The test condition is shown in Table II. In our experiments, three standard test sequences including FOREMAN, BUS and FOOTBALL, have been tested with QP range from 10 to 40. We only consider the two-layer case and the same QP value is used for both base layer and enhancement layer. The type of all the sequences is IBBBB. The GOP size is set to be 8. These sequences are all 32 frames long. They represent sequences with large and medium motion. The testing parameters in our experiments include the average time saving, Y-PSNR and bit rate for the enhancement layer. We use Time Saving (TS) to indicate the average time saving in encoding process: TS =
TJSVM − T proposed TJSVM
× 100%
where TJSVM and T proposed are the encoding time of JSVM 2.0 encoder and its modified encoder according to the proposed fast mode decision method, respectively. TABLE II.
SIMULATION CONDITION
SIMULATION RESULT FOR FOREMAN
Bit rate(kbits/s)
Proposed Scheme
TS(%)
Bit Rate
PSNR
30.9086
100.695
30.7742
61.65%
36.4844
317.6475
36.3905
56.03%
1048.005
41.6713
1037.535
41.7071
45.47%
3882.203
48.1106
3882.968
48.0637
41.28%
TABLE IV.
Figure 2. Flowchart of Proposed Fast Mode Decision at EL
JSVM 2.0 encoder
SIMULATION RESULT FOR BUS
JSVM Scheme
Proposed Scheme
TS(%)
Bit rate(kbits/s)
PSNR(dB)
Bit Rate
PSNR
40
267.435
27.5573
267.75
27.4185
30
805.44
33.4307
812.64
33.4244
51.72%
62.47%
20
2360.07
40.2659
2376.495
40.256
40.54%
10
6182.745
48.3774
6200.79
48.3099
37.27%
TABLE V. QP
SIMULATION RESULTS FOR FOOTBALL
JSVM Scheme
Proposed Scheme PSNR
TS(%)
Bit rate(kbits/s)
PSNR(dB)
Bit Rate
40
325.425
27.0371
326.8425
26.96
54.67%
30
1125.158
33.5802
1129.44
33.5579
46.88%
20
3152.648
41.109
3156.458
41.0981
46.41%
10
7081.838
48.7973
7098.113
48.7468
49.70%
Table III to Table V show the coding results of JSVM original scheme and our proposed scheme. Only the enhancement layer Y-PSNR and bit rate results are shown in the table. The results show that the proposed method is very effective in reducing the encoding time. The total encoding time is reduced by about 60% on average. Moreover, our scheme achieves consistent time saving over a large bit rate range with negligible loss in PSNR and increments in bit rate. Figures 3 to 5 present the rate distortion (RD) curves drawn from the data from Table III to Table V. They describe the rate distortion relationship for FOREMAN, BUS and FOOTBALL sequences. From these figures, we can see that there is little difference between the two RD curves in each figure. By comparing Figures 3 and 5, the difference between two RD curves at high bit rate is larger for FOOTBALL sequence. This is because FOOTBALL represents a sequence of fast motion with fine details. The motion correlation between base layer and enhancement layer is lower compared to the low motion sequence in this case. Consequently, our proposed scheme almost does not degrade the coding efficiency and picture quality.
3007
Figure 3. RD Curve for FOREMAN
V. CONCLUSION The current scalable video coding relies on exhaustive search to determine the best mode for each macroblock. This technique requires high computational complexity and thus it consumes a lot of time for processing. In order to reduce the computational complexity while maintaining the high coding efficiency, we proposed an efficient mode decision scheme at the enhancement layer for spatial scalability. With this scheme, the candidate mode set is reduced at enhancement layers. Simulation results show that our proposed method can provide significant reduction in computational complexity with only negligibly small coding loss. In this paper, the mode type with large partition size is considered in the mode reduction method at enhancement layers. For small partition size, such as MODE_8x4, MODE_4x8 and MODE_4x4, further investigation is needed.
REFERENCES [1]
[2]
[3]
[4] Figure 4. RD Curve for BUS [5]
[6]
[7]
[8]
Figure 5.
RD Curve for FOOTBALL
3008
H. Schwarz, T. Hinz, H. Kirchhoffer, D. Marpe, and T. Wiegand, “Technical Description of the HHI proposal for SVC CE 1,” ISO/IEC JTC1/WG11, Doc. M11244, Palma de Mallorca, Spain, Oct. 2004. J. Reichel, H. Schwarz, and M. Wien, “Scalable Video Coding – Working Draft 1,” Joint Video Team (JVT), Doc. JVT-N020, Hong Kong, CN, Jan. 2005. H. Schwarz, D. Marpe and T. Wiegand, “Inter-layer Prediction of Motion and Residual Data,” ISO/IEC JTC 1/SC 29/WG 11/M11043, USA, July. 2004. Z. G. Li, X. K. Yang, K. P. Lim, X. Lin, S. Rahardja, and F. Pan, “Scalable Video Coding with Grid Motion Estimation and Compensation,” US Provisional Patent Application (I2R Ref: P2004003-ETPL Ref: I2R/P//1899/PCT), Filed in 23 June, 2004. J. Reichel, H. Schwarz, and M. Wien, “Joint Scalable Video Model (JSVM) 2.0 Reference Encoding Algorithm Description,” ISO/IEC JTC1/SC29/WG11, Doc. N7084, Buzan, Korea, April. 2005. F. Pan, X. Lin, R. Susanto, K. P. Lim, Z. G. Li, G. N. Feng, D. J. Wu, and S. Wu, “Fast Mode Decision Algorithm for Intraprediction in H.264/AVC Video Coding,” IEEE Transactions on Circuits and Systems for Video Technology, VOL. 15, NO. 7, July. 2005. Qionghai Dai, Dongdong Zhu and Rong Ding, “Fast Mode Decision for Inter Prediction in H.264,” International Conference on Image Processing, 2004. J. Lee and B. Jeon, “Fast Mode Decision for H.264 with Variable Motion Block Sizes,” in Proc. Int. Symmp, Computer and Information Scinences (ISCIS) 2003, Nov. 2003, pp. 723-730.