H.264/AVC based Stereoscopic Video Coding Scheme using the Statistics of Block Matching 1
Hany Said, 2Akbar Sheikh Akbari, and 3Mohsin Abbas Malik
Abstract— this paper investigates the effect of reference frame architecture on coding performance of the H.264/AVC based stereoscopic video codec. To achieve this, the block matching amongst reference frames of the codec at different bitrates are statistically analyzed. Based on the resulting statistics, reference frame architecture for best coding performance of the codec at low and medium bitrates is proposed. The codec using the proposed reference frame architecture is measured against the same codec which employs the typical reference frame architectures. The measurements were carried out on three standard multi-view datasets named, Break-dancers, Exit and Race1. Results show that the application of the proposed reference frame architectures significantly (up to 0.6 dBs). Keywords— H.264/AVC, Reference frame architecture, Reference frame reordering, Stereoscopic video codec, Video compression
S
I. INTRODUCTION
TEREOSCOPIC video coding has been the subject of increasing interest due to the wide variety of its emerging applications. Storage and transmission of the stereoscopic video data over limited bandwidth channel have been a challenge. Hence, efficient stereoscopic video compression techniques are necessary to overcome these challenges [1, 2]. Many stereoscopic video coding techniques have been reported in the literature. Amongst these codecs, the H.264/AVC-based stereoscopic video codecs are the most common techniques [3-12]. These codecs employ the multiple reference frame feature of the H.264/AVC to exploit temporal and inter-camera correlations. The H.264/AVC based stereoscopic video codecs use various reference frame architectures to efficiently encode P- and B-frames [3-12]. A H.264/AVC multi-view video codec that supports stereoscopic videos was reported in [3]. The prediction structure of the codec when coding stereoscopic videos contains two temporal, one adjacent spatial and twospatiotemporal reference frames. Superior coding performance to its corresponding simulcast coding especially for closely located cameras was reported. Another H.264/AVC based stereoscopic video code was reported by Aksay et al. in [4]. They studied the concept of mixresolution coding of the stereoscopic videos both subjectively and objectively. Their proposed codec uses two temporal reference frames to predict the frames in the left view. The Manuscript received February 21, 2013. 1 H. Said, 2A. Sheikh Akbari and 3M. A. Malik are with the Faculty of Computing, Engineering and Sciences, Staffordshire University, Stafford, UK (phone: 10044-(0)1785-353469; e-mail:
[email protected], 2
[email protected], and
[email protected]).
frames in the right view are both temporarily and spatially down-sampled and then coded using two temporal, one spatial and one spatiotemporal reference frames. Bouyagoub et al. investigated the impact of camera separation on the coding performance of a H.264/AVC based stereoscopic video coding [5]. Their codec reference frame architecture contains two temporal and two spatiotemporal reference frames for coding the frames in the left view while it contains two temporal and one spatial and one spatiotemporal reference frames for coding the frames in the right view (as shown in Figure 1). Another H.264/AVC codec was proposed by Fehn et al. [6]. Their codec uses three temporal frames for predicting the left view and two temporal and one spatial reference frame for coding the right view. They investigated the impact of applying different decimation levels on the right view on the performance of an asymmetric codec. Yu et al. used the reference frame architecture proposed in [6] to investigate the impact of direction of the down-sampling (horizontal or vertical) on the performance of an asymmetric coding scheme for 3DTV mobiles [7]. Yang et al. studied techniques to optimize the decoding speed of the codec using the skipped macro-blocks of the right frames [8]. They also used the reference frame architecture of Fehn et al.’s codec [6]. Fehn et al.’s reference frame architecture has been used in other reported stereoscopic video codecs such as: Cho et al. [9] and Kwon et al. [10]. Ugur et al. reported another stereoscopic video codec in [11]. Their proposed codec uses one temporal reference frames for coding the left view and one temporal and one spatial reference frames for coding the right view. Their research concerned on parallel coding of both views. Another H.264/AVC based stereoscopic codec was reported in [12] that uses the same reference frame architecture of [11]. This codec uses B-prediction structure for coding frames in the right view. Their investigation covers various sorts of frame predictions for the right view (P- and B-prediction). In summary, various reference frame architectures have been employed in the H.264/AVC based stereoscopic video codecs [6-12]. Amongst them the following two architectures are more common. Figure 2 shows the first reference frame architecture. From this figure it can be seen that it uses three temporal reference frames for coding the left view and two temporal and one spatial reference frames for coding the right view [6-10]. Figure 3 illustrates the second reference frame architecture. From this figure it is clear that the codec employs one temporal reference frames for coding the left view and one temporal and one spatial reference frames for coding the right view [11-12]. From the literature, it can be seen that various reference frame architectures have been used for coding
Ti-2 Left View Right View
3
Ti-1
Ti
2
0
1
TABLE I DATASETS DESCRIPTION
Dataset name
P
Fig. 1. Reference frame archtecture used in Bouyagoub's investigation [5].
Ti-3 Left View
Ti-2
I
Ti-1
P
Ti
P
Provider
Breakdancers Race1
Microsoft
Exit
MERL
KDDI
P
P
P
P
Fig. 2. Reference frame archtecture used in Cho [9] and Kwon [10] investigations.
Left View Right View
P
Fehn[6], Yu [7], Yang[8],
Ti-3 I
Ti-2 P
Ti-1 P
Ti P
P
P
P
P
Fig. 3. Reference frame archtecture used in Ugur investigation [11].
stereoscopic videos. However, fewer investigations have been reported on reference frame selection and their re-ordering. In this paper the contribution of different reference frames in block prediction (taking into account the spatial and temporal location of the reference frames) is investigated. Based on the statistics of the block matching, reference frame architecture for best coding performance of the codec at low and medium bitrates is proposed. The performances of the H.264/AVC based stereoscopic codec using the proposed reference frame architecture against other codecs when it uses their reference frame architectures at various bitrates are assessed. Results show that the application of the proposed reference frames architecture significantly improves the coding performance of the codec. The rest of the paper is organized as follows; in Section II multi-view datasets are introduced. Statistical analysis of block matching amongst reference frames is detailed in Section III. H.264/AVC based stereoscopic video codec using the statistics of block matching is presented in Section IV and finally the paper will be concluded in Section V. II. DATASET DESCRIPTION Three multi-view video datasets, called: Break-dancers, Race1 and Exit have been used in this investigation. The description of each dataset is provided in Table I. These multi-view videos (MVV) have different characteristics of motion, disparity and scene complexity [13]. Since this investigation targets stereoscopic video transmission for mobile devices, CIF and QVGA size stereo
Camera space / Frame rate
Camera array property
20 cm, 15 fps 20 cm, 30fps 19.5 cm, 25fps
1D/arc 1D/parallel 1D/parallel
TABLE II
Inter-view with down-sampled signal Right View
Frame resolution / Color sampling ratio 1024×768, 4:4:4 640×480, 4:2:0 640×480, 4:2:0
KAISER FIR FILTER COEFFICIENTS
0 0 0.0393 0 0
0 0.0653 0.1077 0.0653 0
0.0393 0.1077 0.1511 0.1077 0.0393
0 0.0653 0.1077 0.0653 0
0 0 0.0393 0 0
pairs video datasets were generated from all seven combinations stereo-pair of the Microsoft, KDDI and MERL multi-view sequences. To achieve this, all frames of the of Break-dancers dataset have been filtered using a 5×5 FIR Kaiser low-pass filter (the coefficients of this filter are tabulated in Table II); filtered frames are then down-sampled by a factor of 2 both horizontally and vertically to mitigate the aliasing artifacts; to preserve the external camera parameters the resulting frames were cropped from point (Px, Py) = (120, 47) and corresponding CIF size sequences were generated. The resulting RGB frames are finally converted to YUV in full color sampling format 4:4:4. The luminance components of the Race1 and Exit datasets were also filtered and down-sampled generating full color sampling QVGA sizes. Frames of the two views are interleaved using time first ordering generating a single sequence [14]. III. STATISTICAL ANALYSIS OF BLOCK MATCHING AMONGST REFERENCE FRAMES
Statistical analysis for block matching prediction has been conducted using H.264/AVC based stereoscopic video codec. All inter-picture coding modes, which include 16×16, 16×8, 8×16, 8×8, 8×4, 4×8 and 4×4 block sizes, in addition to intraprediction have been enabled for this analysis. Bitrate control is also enabled to reveal the block matching statistics of the stereoscopic video codec amongst temporal and neighboring reference frames at low and medium bitrate transmission; 64 and 192 kilobits per second (kbps). Two comprehensive statistical analysis measurements of block matching amongst reference frames have been carried out using IPPP configuration. Figures 4-a and 4–b, demonstrate the reference frame architectures for coding left and right frames, respectively, where there are three temporal reference frames named: T0-T2 and, one spatial reference frame; M0 and, four spatial reference frames named: S0-S3. In the first set of investigation, the reference frames were sorted in descending coding order, as shown in Figure 5.
Ti-4 Left View M3
Right View
Ti-3 T2
Ti-2 T1
Ti-1 T0
M2
M1
M0
Ti P
TABLE III
STATISTICS OF BLOCK MATCHING AMONGST REFERENCE FRAMES FOR BREAK-DANCERS SEQUENCES USING THE DESCENDING ORDER FRAME INDEXING AT A) 64 KBPS BITRATE AND, B) 192 KBPS.
(a)
Left View
Ti-3 M2
Ti-2 M1
Ti-1 M0
Ti S0
Right View
T2
T1
T0
P
Ti-4
(b) Fig. 4. Block diagram of reference frames used in the statistical analysis for a) left view and b) right view.
Ti-3 5
Ti-4 Left View Right View
4
6
Ti-2 3
Ti-1 1
2
0
Ti P
(a)
Left View
Ti-3 6
Ti-2 4
Ti-1 2
Ti 0
Right View
5
3
1
P
Ti-4
(b) Fig. 5. Reference frame indexing order (descending order) for coding frames in a) left view and b) right view.
Table III-a and III-b show the primary results for the statistical analysis of block matching of Break-dancers at 64 and 192 Kbps respectively. These results represent the average statistics of coding seven pairs of the stereoscopic adjacent views (seven pairs represent all combinations of adjacent eight views). From Table III, it can be seen that the neighboring reference frames (nearest temporal and spatial) have significant contribution for block matching. Rate-distortion optimization of H.264/AVC enforces the minimum cost for the best reference frame selection, which is a compromise between the actual bitrate and residual error due to the inter-prediction from various reference frames. According to Lagrangian method, the cost function, J(ref | λMOTION) is defined by equation 1 [15]: |
,
,
(1)
where Sum of Absolute Difference (SAD) frame is the prediction error between the current (s), and, corresponding reference block (r), λMotion is Lagrange multiplier, R is the number of bits required to code both; Motion Vector (MV) and REference Frame (REF). In H.264/AVC, the index of all used reference frames are stored in Buffer List 0 and the overhead cost of presenting the reference frame for each Macroblock is related to the index position of the reference frame in Buffer List 0, where fewer number of bits are used to address the closer reference frames. From Table III, it is obvious that the distribution of the block matching amongst reference frames is inconsistent with the position of the reference frames in the buffer. Reference frame T0 (which its index located in the second position in the List 0), was used for predicting the majority of
REF
T0
T1
T2
S0
M0
M1
M2
M3
Left Right
48.23 34.63
4.2 2.4
3 1.7
n/a 58.7 (a)
40.3 1.8
2.5 0.45
0.82 0.35
0.95 n/a
Left Right
70.4 54
7.5 5.1
4.8 3.1
n/a 35.31 (b)
13.46 1.6
2 0.5
0.84 0.39
1 n/a
blocks of both right and left views at 192 kbps and left view at 64 kbps. Hence, current bit allocation system for representing each Macroblock reference frame could cost additional bits, which reduces coding efficiency of the codec. Therefore, reordering the index of reference frames according to their contributions in block-matching could improve the coding performance of the codec. In the second set of experiments, the reference frames were first indexed according to their contributions in block matching using the resulting statistics from the first set of experiments beside their spatial position to the current frame, as shown in Figure 6. Another comprehensive statistical analysis of block matching was performed using the proposed reference frame indexing. The results for Break-dancers sequences are tabulated in Table IV. From Table IV, it can be seen that the temporal frames have higher contribution in block prediction than the spatial frames and also the first temporal frame, T0, contributes more than the adjacent spatial reference frame, S0 (right frame), in block prediction. This finding matches the fact that temporal correlations are higher than the spatial correlations. It reveals that coding performance of the stereoscopic video codec could be increased by using the proposed reference frame indexing. The contribution of nearest temporal reference frame has higher contribution of block matching in the proposed reordering reference frame rather than the default order of reference frames. From table IV, There is strong relation between target bitrate and inter-picture prediction. Nearest temporal reference frame; T0 has highest contribution of block matching, however, it is the percentage compared to remaining reference frames decrease with the growth of bitrate. The contribution of the other temporal frames; T1 and T2 increase proportionally with target bitrate. IV. H.264/AVC BASED STEREOSCOPIC VIDEO CODEC USING THE STATISTICS OF BLOCK MATCHING
The outcomes of the statistical analysis for the proposed reference frame indexing have been considered in the selection of the reference frames and also in indexing the reference frames in Buffer List 0. Figure 7, shows proposed reference frame architecture, for coding stereoscopic videos at low and medium bitrates. In Figure 7 numbers in each block represents its frame reference indexing in List 0.
Right View
6
Ti-3 3
Ti-2 2
Ti-1 0
5
4
1
Ti P
34 dB 33
(a)
Left View
Ti-3 6
Ti-2 5
Ti-1 4
Ti 1
Right View
3
2
0
P
Ti-4
(b) Fig. 6. Reference frame indexing order (according to their contribution in block matching) for coding frames in a) left view and b) right view.
31
Fehn [6] 29
TABLE IV
28 27
REF
T0
T1
T2
S0
M0
M1
M2
M3
Left Right
85.98 72
4.4 2.96
1.92 1.16
n/a 22.57 (a)
5.59 0.35
0.78 0.68
0.6 0.28
0.73 n/a
n/a 25 (b)
5.54 0.9
7.22 5
Left View
3.28 2
Ti-3 3
Ti-2 2
0.93 0.4
0.66 0.3
0.9 n/a
Ti-3
Ti-2
Ti-1 0
Ti-1
Left View Right View 3
2
Ugur [11] 128
160
192
224
256 Kbps
Bitrate (a)
41 dB 40
Exit
39 Ti P
1
Right View
Bouyagoub [5]
96
0
Ti 1 P
Fig. 7. Presents proposed reference frame structure for coding Left and Right frames.
In order to evaluate the performance of using the proposed reference frame architecture, view 3 and 4 of Race1 and Exit datasets were taken. The H.264/AVC based stereoscopic video codec using the proposed reference frame architecture and three other widely used reference frame architectures [511], as shown in Figures 1-3, were applied to these stereoscopic video sequences. The Peak Signal to Noise Ratio (PSNR) measurement was used to assess the quality of the reconstructed luminance components of decoded sequences. Figures 8-a and 8-b, show the resulting PSNR for Race1 and Exits stereoscopic video when coding the stereo-pair sequences at different bitrates. From Figure 8, it can be seen that the application of the proposed reference frame architecture and indexing improves the coding performance of the codec compared to the use of the state of the art architectures at all bitrates (up to 0.6 dB for Exit sequence and up to 0.47 dB for Race1 sequence). From these Figures it can be seen that the application of the Ugur’s reference frame architecture, which uses smaller number of reference frames, gives higher coding performance to the use of other state of the arts reference frame architectures. This can be explained by the fact the Ugur et al.’s codec uses
38 PSNR
81.47 66.4
Proposed Reference Frame Architecture
30
STATISTICS OF BLOCK MATCHING AMONGST REFERENCE FRAMES FOR BREAK-DANCERS SEQUENCES USING THE PROPOSED FRAME INDEXING AT A) 64 KBPS AND, B) 192 KBPS.
Left Right
Race1
32 PSNR
Ti-4 Left View
Proposed Reference Frame Architecture
37
Fehn [6]
36
Bouyagoub [5]
35
Ugur [11]
34 33 32
64
96
128
160 Bitrate
192
224
256 Kbps
(b) Fig. 8. Coding performance of the stereoscopic video codec (using the proposed reference frame architectures and typical reference frame architectures), for coding a) Race1 and b) Exit stereo-pair sequence.
reference frame reordering. This is in consistent with the proposed reference frame architecture, which employs frame reordering to improve the coding performance of the codec. V. CONCLUSIONS H.264/AVC based stereoscopic video codec using the proposed reference frame architectures was found to give superior coding performance compared to the same codec using three state of the art reference frame architectures. Two sets of statistical analysis of block matching amongst reference frames were carried out by taking into consideration the spatial and temporal location of the reference frames. Results show that there is a relationship between the order of reference frames and coding performance of the codec. Results show that the application of reference frames reordering improves the coding performance of the codec.
REFERENCES [1] [2]
[3] [4]
[5]
[6] [7]
[8] [9]
[10] [11]
[12]
[13]
[14]
[15]
L. Sumei, H. Chunping, Y. Yicai, S. Xiaowei, Y. Lei, "Stereoscopic Video Compression Based on H.264 MVC," 2nd International Congress on Image and Signal Processing, CISP '09, China, 2009, pp.1-5. A Vetro, T. Wiegand, and G. J. Sullivan, “Overview of the Stereo and Multiview Video Coding Extensions of the H.264/MPEG-4 AVC Standard,” Proceeding of the IEEE, vol. 99, no. 4, pp. 626–642, Apr. 2011. C. Bilen, A. Aksay, and G. Akar, “A Multi-View Video Codec Based on H.264,” in IEEE International Conference on Image Processing, Atlanta, U.S.A, 2006, pp. 541–544. A. Aksay , C. Bilen , E. Kurutepe , T. Ozcelebi , G. B. Akar , M. R. Civanlar and A. M. Tekalp "Temporal and spatial scaling for stereoscopic video compression", in Proc. European Signal Processing Conference EUSIPCO 2006, Florence, Italy, 2006. S. Bouyagoub, A. Sheikh Akbari, D. Bull, and N. Canagarajah, “Impact of camera separation on performance of H.264/AVC-based stereoscopic video codec,” Electronics Letters, vol. 46, no. 5, pp. 345-346, Mar. 2010 C. Fehn, P. Kauff, S. Cho, H. Kwon, N. Hur, and J. Kim, “Asymmetric Coding of Stereoscopic Video for Transmission Over T-DMB,” in 2007 3DTV Conference, Kos, Greece. 2007, pp. 1–4. M. Yu, H. Yang, S. Fu, F. Li, R. Fu, and G. Jiang, “New Sampling Strategy in Asymmetric Stereoscopic Video Coding for Mobile Devices,” in 2010 International Conference on E-Product E-Service and E-Entertainment, Henan, China, 2010, pp. 1–4. H. Yang, M. Yu, and G. Jiang, “Decoding and up-sampling optimization for asymmetric coding of mobile 3DTV,” TENCON 2009 - 2009 IEEE Region 10 Conference, Singapore, 2009, pp. 1–4. S. Cho, H. Kwon, N. Hur, J. Kim and S.I. Lee, "Stereoscopic Video Codec for 3D Video Service over T-DMB," International Conference on Consumer Electronics, 2007. ICCE 2007, Digest of Technical Papers, Las Vegas, Nevada, U.S.A., 2007, pp. 1-2. H. Kwon, S. Cho, N. Hur, J. Kim and S.I. Lee, "AVC Based Stereoscopic Video Codec for 3D DMB," in 2007 3DTV Conference, Kos Island, Greece, 2007. K. Ugur, H. Liu, J. Lainema, M. Gabbouj, and H. Li, “Parallel Encoding - Decoding Operation for Multiview Video Coding with High Coding Efficiency,” in 2007 3DTV Conference, Kos Island, Greece, 2007, pp. 1–4. S.A Fezza, M.C. Larabi and K.M. Faraoun , "Stereoscopic video coding based on the H.264/AVC standard," International Conference on Machine and Web Intelligence (ICMWI 2010), Algiers, Algeria, 2010, pp.476-478. Y. Zhang; S. Kwong, S.; G. Jiang; H. Wang; , "Efficient MultiReference Frame Selection Algorithm for Hierarchical B Pictures in Multiview Video Coding," IEEE Trans. on Broadcasting, vol.57, no.1, pp.15-23, Mar. 2011. Y. Chen, Y.-K. Wang, K. Ugur, M. M. Hannuksela, J. Lainema, and M. Gabbouj, “The Emerging MVC Standard for 3D Video Services,” EURASIP Journal on Advances in Signal Processing, vol. 2009, no. 1, Jan. 2009. S.-H. Jung, W.-J. Park, and T.-Y. Kim, “Fast Reference Frame Selection with Adaptive Motion Search Using RD Cost,” in 2012 Spring Congress on Engineering and Technology (S-CET), Xi'an, China, 2012, pp. 1–4.