Utilisation of downsampling for arbitrary views in multi-view video coding E. Ekmekcioglu, S.T. Worrall and A.M. Kondoz The effect of using downsampling for arbitrary views inside a multiview sequence on the multi-view coding (MVC) efficiency is explored. A bit rate adaptive approach is proposed to consider downsampling certain views prior to encoding with relevant downscaling ratios. The inter-view references, if any, are downsampled to the same resolution and the decoded view is upsampled back to the original resolution. The results over several multi-view test sequences imply that up to 0.9 dB gain or 20% reduction in bit rate can be achieved, reducing the computational complexity in the encoder significantly at the same time.
Introduction: The concept of multi-view coding (MVC) has grown in significance in order to make applications such as 3D TV and free view-point video possible. Owing to the large amounts of video data involved, a variety of techniques have been developed to exploit temporal and spatial redundancies, as well as the inter-view redundancies specific to multi-view sequences. The MVC technique proposed in [1], based on hierarchical B frame prediction in both temporal and view dimensions, forms the basis of the multi-view extension of H.264 [2]. The extension is currently under development, and is known as the Joint Multi-View Video Model (JMVM) [3]. In this Letter, arbitrary views inside a multi-view sequence, regardless of their inter-view dependencies, are downsampled prior to encoding and upsampled to their original resolutions after decoding. The technique is based on a trade-off between two types of distortion: distortion introduced by downsampling and distortion introduced by quantisation. For a fixed bit budget, increasing the downsampling ratio means that less coarse quantisation need be used. Thus, more information is lost through downsampling, but less is lost through coarse quantisation. Finding the optimum trade-off between the two distortion sources should lead to improved compression efficiency. Encoder structure and coding conditions: Fig. 1 shows a part of the multi-view encoder structure, for two views using inter-view prediction. As in JMVM [3], hierarchical B frame prediction is used in both temporal and view dimensions. The disparity estimation and compensation processes are also as specified in JMVM [3]. For coding views that use inter-view prediction, the inter-view references are downsampled using the same scaling ratio as the original view before being put in the reference picture buffer. JMVM [3] is used for testing the performance of the proposed method. Binary arithmetic coding is used to encode transform coefficients. In the tests, the scaling ratios, stated in the following Section on downsampling and upsampling processes, are applied to all views separately and the corresponding downsampled views are encoded at a wide range of bit rates. For each downsampled view, the inter-view references, which are decoded at their original resolution, are downsampled to the same resolution. Three different sets of multiview sequences are used for tests. The applied quantisation step size is decreased as the amount of downscaling is increased, in order to maintain a similar total bit budget.
Fig. 1 Proposed encoder structure (for two view case)
Downsampling and upsampling processes: Each view is downsampled prior to encoding with a downscaling ratio selected from a set. The aspect ratios of the original sequences are always divisible by 16 in both dimensions and maintained in order to comply with the coding standard. The original views are Video Graphics Array (VGA) (640 480) resolution. 0.9, 0.8, 0.7, 0.6, 0.5 or 0.3 scaling can be applied to get 576 432, 512 384, 448 336, 384 288, 320 240 and 192 144 resolution views, respectively. A set of seven filters, designed originally to support the extended range of spatial scaling ratios used in the scalable extension of H.264, is used to downsample the original views. Integer based six-tap filters are applied to the decoded low-resolution views to upsample them back to VGA resolution.
Fig. 2 Coding performance of MV coder using several downscaling ratios (0.3, 0.5, 0.6, 0.8) for second view of Breakdancer test sequence
Fig. 3 Coding performance of proposed bit rate adaptive MVC for Ballroom and Akko & Kayo test sequences (average of all views)
Results: The results show that different scaling ratios yield different performance characteristics at different bit rates. Also, regardless of the number of inter-view references used, all views are affected in the same way. Fig. 2 shows the coding performance for the second view of the Breakdancer [4] sequence. For the reference method, all sequences are coded in their original resolution, while the other curves show the performance with different scaling ratios. The results show that the optimum trade-off between the two distortion types (downscaling and quantisation) varies with target bit rate. The best performance characteristics at medium and low bit rates, where the quantisation distortion is more effective, are achieved with a 0.6 scaling ratio, where the coder achieves up to 0.9 dB quality gain or up to a 20% reduction in bit rate for the Breakdancer sequence. High downscaling ratios, such as 0.3, are only useful at very low bit rates, where the reconstruction quality is already quite insufficient to be used in any multi-view application. Reconstruction quality of around 33 dB is satisfactory, while a PSNR of around 35 – 36 dB is regarded as reasonable quality. Much higher qualities are difficult to distinguish. Fig. 3 shows the performance characteristics of the proposed technique on other test sequences (Ballroom and Akko & Kayo). The results are obtained by taking the average of all views inside a test set. 0.6 scaling is used for medium and low bit rate ranges (less than 400 kbit/s), and 0.8 scaling is used for high bit rate ranges (over 400 kbit/s). Comparing visually two pictures of the reference method (at around 155 kbit/s, 34.1 dB) and the
ELECTRONICS LETTERS 28th February 2008 Vol. 44 No. 5
proposed method (at around 130 kbit/s, 34.5 dB), it is seen that there is almost no difference between the pictures, although the proposed method uses much fewer bits to encode the picture.
E. Ekmekcioglu, S.T. Worrall and A.M. Kondoz (Centre for Communication Systems Research, University of Surrey, Guildford GU2 7XH, United Kingdom) E-mail:
[email protected]
Conclusions: The proposed technique outperforms the reference method at all bit rates objectively and it is capable of switching to the optimum downscaling ratio depending on the operating bit rate in the region of interest (Fig. 2). The scheme takes into account the fact that the optimum trade-off between quantisation distortion and downscaling distortion varies with bit rate. Off-line optimisation of the downscaling ratio is investigated for the whole sequence. When the same subjective quality as the reference technique is used, the proposed scheme yields a considerable bit rate saving, which can be used for other services. Further investigations that concern coding delay, including the periodic ratio update during encoding, and the construction of downscaling ratio look-up tables will be investigated in future work.
References 1 Mu¨ller, K., et al.: ‘Multi-view video coding based on H.264/AVC using hierarchical B-frames’. PCS 2006, China, 2006 2 ITU-T Recommendation H.264 and ISO/IEC 14496-10 AVC, ‘Advanced video coding for generic audio-visual services’, 2003 3 Vetro A., Su Y., Kimata H., and Smolic A., ‘Joint multiview video model JMVM 2.0’, ITU-T and ISO/IEC Joint Video Team, Document JVTU207, November 2006 4 Zitnick, C. L., et al. ‘High-quality video view interpolation using a layered representation’, ACM Siggraph and ACM Transactions on Graphics, August 2004
# The Institution of Engineering and Technology 2008 1 October 2007 Electronics Letters online no: 20082578 doi: 10.1049/el:20082578
ELECTRONICS LETTERS 28th February 2008 Vol. 44 No. 5