Video Coding using Elastic Motion Model and ... - Semantic Scholar

3 downloads 6861 Views 583KB Size Report
macroblock, Motion estimation and compensation, Video coding. ... obtained from the IEEE by sending an email to [email protected]. each block was ...
CSVT 2855

1

Video Coding using Elastic Motion Model and Larger Blocks Abdullah A. Muhit, Mark R. Pickering, Michael R. Frater, and John F. Arnold

Abstract—Motion-compensated prediction is the key to highperformance video coding. Previous works have explored alternatives to the classical translational motion model in video coding, but the cumulative rate-distortion performance has not been significant enough to see such approaches adopted in mainstream standards. In this paper, we propose a new extended prediction strategy that incorporates non-translational motion prediction. This method uses an elastic motion model with 2-D cosine basis functions to estimate non-translational motion between the blocks. To achieve superior performance, the proposed scheme takes advantage of larger blocks with multilevel partitioning. Experimental results show that this combined framework outperforms the existing techniques including those available in the recent H.264 standard. Index Terms—Elastic motion model, H.264, Super macroblock, Motion estimation and compensation, Video coding.

I. INTRODUCTION

T

HE key to efficient image-sequence coding lies in the reduction of the temporal redundancies among frames. Motion-compensated prediction tries to achieve this goal. The determination of actual motion is not the fundamental objective here. Rather, the aim is to optimize a rate-distortion trade-off. Block matching algorithms (BMA) have proven to be the most suitable in the first generation video coding framework, surpassing other candidates such as gradient techniques, pel-recursive techniques and frequency-domain techniques [6]. In block matching techniques, each frame is partitioned into blocks and predicted using the best match from a reference frame. The underlying assumption is that each block undergoes independent motion from the reference block. The performance of BMA relies on three factors- (a) the block size and shape, (b) matching precision and (c) the motion model. Research efforts devoted to address these issues are quite diverse. In earlier standards such as H.261 [1], a fixed block size of 16×16 was adopted. Later, in H.263 [2], Manuscript received December 16, 2008. This work was supported in part by the UCPRS grant. A. Muhit, M. Pickering, M. Frater and J. Arnold are with School of Engineering and Information Technology (SEIT), University of New South Wales (UNSW) at the Australian Defence Force Academy (ADFA), Canberra, Australia. (e.mail: [email protected]; phone: +612-62688191; fax: +61262688581) Copyright (c) 2009 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to [email protected].

each block was allowed to be split into 8×8 blocks whenever this lead to a better rate-distortion output. This was influenced by the superior performance of variable size block matching [4], [5] or locally adaptive multigrid block matching [6] over the fixed ones. H.264 [3] has adopted an advanced treestructured block partitioning method down to a block size of 4×4. Recently, larger macro-blocks, also known as supermacroblocks, have shown to provide higher efficiency for high definition (HD) videos [24]. Some researchers have explored the possibility of motion-assisted merging [7] or leaf-merging [9] after quadtree-based segmentation with an aim to reduce the number of segments/blocks which can be predicted using the same motion vector. However, these approaches lead to arbitrary shaped segments with a minor increase in prediction error. It is also preferable to limit considerations to rectangular or square blocks due to easier implementation of transform and quantization functions. BMAs have been designed initially to estimate linear displacements with single pixel precision. In [6], it is shown that half-pixel accuracy of motion vectors leads to a significant improvement, which was adopted in H.263, MPEG-1, MPEG-2 and MPEG-4 (Part 2). H.264 implements more precise (quarter-pixel) translational motion vectors. However, there has been no attempt to increase the precision higher since it results in negligible changes in the performance [8]. The remaining issue in BMA is the motion model. In typical video sequences, motion occurs for two reasons. The first one is global motion or camera motion. The second cause is due to intrinsic motion of the objects in the scene. Hence, independent moving objects in combination with camera motion lead to a complicated motion vector field. A translational motion model has been the most popular until now because of its simplicity and efficiency. However, the model has a number of inherent shortcomings. Without addressing these problems, further major improvements in the compression performance of video coding is unlikely to occur [7]. The most severe of these shortcomings is the blockwiseconstant motion model where all the pixels in a block are assigned the same motion vector. The other problem is the scheme’s inability to capture camera zoom, rotation and complex motions such as the deformation of objects. Complex motion in blocks is usually approximated using a piecewise local translational model with smaller blocks. However, there is an intrinsic limitation with this approach because blocks smaller than 4×4 are not feasible from a rate-distortion optimization point of view. In a nutshell, the first generation video coding framework is reaching its performance

CSVT 2855 saturation point unless a breakthrough is made in any of these three aspects of BMA. In this paper, we tackle the issue of an enhanced motion model for BMAs since other issues such as block partitioning and matching precision offer relatively smaller scope for improvement. In recent years, several strategies have been investigated for efficient prediction of complex motion vector fields [7], [10][14] under the block matching framework. Seferidis and Ghanbari [10] proposed the generalized block-matching framework which performs the motion estimation using a deformed quadrilateral of the previous block using affine, perspective and bilinear transformations. Nakaya and Harashima [11] took a similar approach and showed a relationship between the transformation functions and the block shape. Li and Forchheimer [12] proposed a comparable approach using an affine motion model, which also known as transformed block-based motion compensation. A combination of the block-based and region based approach is presented in [7]. Karczewicz et al. [7] report that motionassisted merging after quadtree segmentation together with a polynomial motion model is beneficial for coding efficiency. Zhang et al. [13] presented an approach that uses multiplelevel segmentation and affine motion estimation using the Hough-transform. They apply global motion estimation and segmentation to separate background and foreground objects. Quadtree decomposition is then used to get a rough segmentation of the motion regions within an object [13]. Affine motion prediction using translational motion vectors is proposed by Kordasiewicz et al. [14] under a mesh-based approach. Sayed and Badawy [15] presented a simplified affine-based motion estimation algorithm and its hardware implementation for video coding. In this paper, we propose a novel technique for extending the standard motion estimation algorithms to incorporate motion parameters that describe non-translational motions. It is based on elastic or non-rigid image registration techniques that are extensively used in medical imaging [16]. Specifically, we apply an elastic motion model with 2-D cosine basis functions to efficiently predict the complex motion vector fields observed in typical video sequences. We also use larger blocks with multiple-level partitioning to improve rate-distortion optimized performance. Larger blocks alone can provide good PSNR gain at very low bitrates [17], [24] even with a simple translational model. However, as the rate increases, larger blocks perform worse than standard blocks. In contrast, an elastic motion model is able to take advantage of larger block sizes and provide significant compression efficiency over a wide range of output bitrates. Although existing approaches rely on smaller blocks for better performance, ours is able to perform better using bigger blocks, thereby reducing the number of blocks, bits for motion vectors and computation time. The proposed scheme is suitable for integration with standard coders using a fully functional rate-distortion optimization framework. This enables the encoder to select the elastic motion prediction only when it is advantageous for cumulative rate-distortion performance. The rest of the paper is organized as follows. In section II,

2 we discuss the elastic motion model, tree-structured multilevel block partitioning and the proposed algorithm. Experimental results are presented in section III. Finally we conclude the paper in section IV. II. EXTENDED MOTION COMPENSATION A. Elastic Motion Model Elastic or non-rigid image registration techniques have been extensively used in medical imaging, object tracking, image stabilization and motion analysis applications [16]. Image registration is the task of finding a correspondence function mapping coordinates from a reference image to coordinates of homologous points in a test image. The basic image registration technique can be applied to the block matching framework without any major modification. Assume two blocks I (x i , y i ) and I ′(x i′ , y i′ ) are related by some coordinate transformation of the form: P2

xi′ = xi +

∑ m ϕ (x , y ) k k

i

i

k =1

yi′ = yi +

P

∑ m ϕ (x , y ) k k

i

(1)

i

k = P 2 +1

where P is the number of motion parameters, m k are the

motion parameters and ϕ k ( ) are basis functions that describe arbitrarily complex mappings between the coordinates of I and I ′ . Examples of basis functions that have been used in the literature include B-splines, polynomials, harmonic functions, radial basis functions and wavelets [16]. In this paper, we use a set of discrete cosines to capture elastic or non-rigid motion. This choice was based on the requirement to have the most compact parameterization of the non-translational motion. Discrete cosines have the ability to represent smooth motion fields with a minimum number of coefficients. The choice of discrete cosines is also inspired by its widespread popularity and hardware implementation in various image and video processing applications. The basis functions for the coordinate transform are given by ϕ k (xi , yi ) = ϕ k + P 2 (xi , yi ) ⎛ (2 x + 1)π u ⎞ ⎛ (2 yi + 1)π v ⎞ = cos⎜ i ⎟ cos⎜ ⎟ 2M 2N ⎝ ⎠ ⎝ ⎠

(2)

P 2 where M and N are the horizontal and vertical dimensions of the blocks to be registered. An example of the motion fields corresponding to each basis function for such coordinate transform is shown in Fig. 1. For this example, P = 8, s = 2 and M = N = 4. Having defined the motion model, the next step is to compute the motion parameters that will enable the best prediction of a block from its reference frame. Standard gradient-based image registration techniques are well-known and have been used extensively in image processing and for k = s u + v + 1, u, v = 0,K , s − 1 and s =

CSVT 2855

3

machine vision applications for this purpose (see for example [18]). Gradient-based registration methods typically attempt to minimize the sum of squared difference (SSD) between I and I’. The SSD can be written as T

E=

T

∑ [I ′(xi′, yi′ ) − I (xi , yi )]2 = ∑ ei2 i =1

(3)

i =1

Here T is the number of pixels in the region where the two images overlap and ei is the difference in intensity values of pixel i. A commonly used method for minimizing E is a Gauss-Newton gradient descent non-linear optimization algorithm [18]. In this technique, a first-order Taylor approximation of the SSD is used to linearize the nonlinear expression in (3). Therefore, after updating the motion parameters by Δm, the expression becomes: E′ =

⎡⎛ ∂I ′ ⎞ ⎤ ⎢⎜ I ′ + ∂m Δm ⎟ − I ⎥ ⎝ ⎠ ⎦ i =1 ⎣ T



⎡ ∂I ′ ⎤ = ⎢ ∂m Δm + ei ⎥ ⎦ i =1 ⎣ T



2

(4)

2

Note that we have omitted the functional dependence of I and I ′ on xi , yi and xi′ , yi′ in the remainder of this section for clarity and ease of notation. The partial derivative of E ′ with respect to the parameter updates is given by ∂I ′ ∂E ′ =2 m ∂ ∂Δm i =1 T



⎡ ∂I ′ ⎤ ⎢ ∂m Δm + ei ⎥ ⎣ ⎦

(5)

If we let this partial derivative equal zero then the error (after updating) will be minimized. Hence, the parameter updates can be given by 2

∂I ′ ⎡ ∂I ′ ⎤ ei ⎢ ∂m ⎥ Δm = − ∂m ⎦ i =1 ⎣ i =1 T



T



(6)

Since transforms typically have multiple motion parameters, this equation can be written using matrix notation as (7) HΔm = b where H is known as the Hessian matrix and b is a weighted gradient vector. The elements of H and b are given by T

H k ,l =

∂I ′ ∂I ′ k ∂ml

∑ ∂m i =1

T

and bk = −

∂I ′

∑ ∂m i =1

ei and

can

be

k

calculated using the chain rule as follows: ∂I ′ ∂yi′ ∂I ′ ∂I ′ ∂xi′ + = ∂mk ∂xi′ ∂mk ∂yi′ ∂mk

(8)

In this chain rule equation, the terms ∂I ′ ∂xi′ and ∂I ′ ∂yi′ are simply the horizontal and vertical gradients of I ′ . In elastic registration, the terms ∂xi′ ∂mk and ∂yi′ ∂mk are equal to the basis functions of the warping function, i.e. ∂xi′ ∂yi′ The motion = ϕ k (xi , yi ) and = ϕ k (xi , yi ) . ∂mk ∂mk parameters are then updated iteratively as follows

m t +1 = m t + Δm

(9) = m t + H −1b where t denotes the iteration number and xi′ , yi′ , I ′ , H and b are recalculated at each iteration using the updated motion parameters. In this technique motion parameters are iteratively updated until a minimum of E is found. At each iteration, an estimate of the parameter updates required to minimize E is calculated. These updates are then added to the motion parameters and a warped version of I ′ is calculated with the new warping function. This warped version of I ′ is used in the next iteration and the process continues until some threshold or stopping criterion is reached e.g. maximum number of iterations or minimum change in E. Experimental results show that 10 to 15 iterations are sufficient to calculate the elastic motion parameters for the typical blocks used in video coding. It is worth noting that the choice of an elastic model here is inspired by its greater flexibility in terms of the number of motion parameters that can be calculated without altering the gradient-descent algorithm. Other higher order models such as affine (6 parameter), perspective (8 parameter) or polynomial (12 parameter) models are limited to a fixed set of motion parameters. On the contrary, elastic motion model adds greater flexibility in choosing the number of motion parameters according to the size of block and the requirement of the application. Besides, affine and perspective models are originally global motion estimation methods. So they are not adequate to capture complex motions or deformations that are typically seen on videos.

B. Tree Structured Multi-level Block Partitioning In this work, we retain the tree-structured motion estimation framework that has been adopted in H.264 [3]. However, we generalize it to implement larger blocks and multiple levels of segmentation. This structure allows great flexibility in coding both homogeneous and detailed areas of a frame. Larger blocks at lower levels are beneficial to homogeneous parts of the picture with little or no motion, whereas multiple partitioning (resulting in smaller blocks) is essential to describe more active parts of the picture with complex motions. Consider a video frame with resolution w×h. This frame is divided into a number of macro-blocks with a size of Z .2 f − g × Z .2 f − g pixels for independent coding. Here, Z is the minimum block size allowed (Z is required to be an integer multiple of 4 due to 4×4 integer transform and quantization function); f is the total number of partitions allowed and 0 ≤ g ≤ f is the level of partitioning. For example, H.264 uses Z = 4 and f = 2. At g = 0, this macro-block may be split and motion compensated in four ways; either as one Z .2 f − g × Z .2 f − g block, two Z .2 f − g × Z .2 f − g −1 blocks (horizontal split), two Z .2 f − g −1 × Z .2 f − g blocks (vertical split) or four Z .2 f − g −1 × Z .2 f − g −1 blocks (quad split). Further partitioning is allowed only when quad split is chosen at the previous level. In the latter case, each of the four sub-blocks

CSVT 2855 may be further partitioned at l = 1. This process continues until the minimum block size Z×Z is reached. The selection of block size & levels of partitioning is an important issue to address. On one hand, the block size should be sufficiently small to minimize the likelihood that a block contains two or more different motions. On the other hand, the block should be sufficiently large to minimize the overhead associated with the motion vector and partitioning. Arnold et al. [19] reported that an initial block size of 16×16 pixels is suitable for most video coding applications. This is the same block size used in H.264. Sullivan and Baker [5] show that a variable block size algorithm with an optimized tree structure produces improvements over a fixed block size algorithm. In this paper, we attempt to combine these two characteristics to improve the rate-distortion trade-off. Under this framework, the initial macro-block size can be selected as 128×128, 64×64 or 32×32 instead of 16×16 as specified in H.264. Furthermore, this macro-block is allowed to be split into smaller blocks e.g. 8×8 or 4×4. This initial selection reduces the bitrate of the frame by coding the homogeneous regions of the frame using fewer blocks. However, when complex motions occur in a macro-block, a multi-level split allows these areas to be represented by a combination of smaller blocks with individual motion vectors. There is also a strong correlation between the block size and motion model used to represent the motion of that particular block. For instance, although translational motion vectors might be sufficient to represent a 4×4 block quite accurately, they are usually inadequate for representing complex motion in a 16×16 block. This inadequacy motivates us to introduce extended motion models into our scheme. The primary motivation behind smaller blocks is the inherent limitation of the translational motion model. Hence, when we utilize a higher order motion model such as elastic, it is not necessary to use smaller blocks. Furthermore, smaller blocks are not very feasible with a higher order motion model from a rate-distortion trade-off point of view. This is due to the higher amount of motion data required as compared to the amount of residual data which in turn might lead to inferior cumulative rate-distortion performance. Our experiments show that an initial macro-block size of 32×32 is the most suitable with the 8 parameter elastic motion model that is proposed in this paper. Therefore, the maximum decomposition level is empirically chosen as 2 which result in smallest blocks of 8×8. This selection of block size and partition levels is attractive from a rate-distortion optimization point of view as it reduces the data representing partitioning and motion information. This approach is also beneficial from the computational complexity aspect. The use of larger blocks will reduce the total number of blocks to be motion-compensated; thereby, reducing the total number of calculations and the computation time. Although, larger blocks will require slightly more memory, it is no longer infeasible for new generation handheld devices with increased computation power and memory resources.

4

C. Enhanced Motion Compensation Algorithm In this section we describe our new technique that uses an elastic motion model and larger blocks to improve the compression efficiency for inter-frame video coding. Interframe coding involves finding the best prediction for a block of pixels in the current frame from a reference frame which has previously been transmitted to the decoder. In a ratedistortion optimized video encoder, the mode choice is performed by selecting the mode which minimizes a Lagrangian cost function J mode given by: J mode = Dmode + λmode Rmode

(10)

where Dmode is the SSD between the original and reconstructed

macroblocks

or

partitions,

λmode

is

a

Lagrangian multiplier and Rmode is the bit-rate required to transmit the motion vectors, transform coefficients residuals after prediction and macroblock type information. The choice of motion vector for a particular partition is also performed using a Lagrangian cost function given by: J motion = Dmotion + λmotion Rmotion (11) where Dmotion is the sum of the absolute differences between the current partition and the partition in the reference frame indicated by the candidate motion vector, Rmode is the bit-rate required to transmit the motion vectors and λmotion is a Lagrangian multiplier. In this scheme, we utilize H.264/AVC type Lagrangian multipliers for the rate-distortion optimization [20]. In our proposed algorithm the mode decision process is modified to allow a preference between standard translational motion vectors and elastic motion parameters that describe non-translational motions of pixels from the reference frame. The proposed enhanced motion compensation algorithm is summarized in the following steps:

1. For each 32×32 macro-block: 1.1 For each partition mode: 1.1.1 Calculate translational motion vectors with minimum cost using λmotion 1.1.2 Calculate rate and distortion for this partition 1.1.3 Find cost for translational motion compensation using λ mode 1.1.4 Calculate elastic motion parameters 1.1.5 Calculate rate and distortion for elastic motion parameters 1.1.6 Find cost for elastic motion compensation using λ mode 1.1.7 If cost for elastic motion is less than translational motion, select elastic motion compensation for this partition mode 1.2 Calculate total rate and distortion for this mode 2. Find the best mode with minimum cost using λ mode 3. If the mode is quad, split and repeat steps 1 and 2 for each 16×16 partition

CSVT 2855 The additional steps 1.1.3 – 1.1.7 that are required for the proposed enhanced motion compensation algorithm are shown in bold. In step 1.1.4, the elastic motion parameters are determined as described in section II.A. To calculate these parameters, a partition from the reference frame corresponding to the integer pixel translational motion vectors in step 1.1.1 is used. This reference partition is then registered to the current partition using standard gradient-descent algorithm as described in earlier section. The motion parameters found from the registration process are then quantized and used to calculate a warped version of the reference partition which is used as the motion compensated prediction. A demonstration of block partitioning and motion modelling is given in figure 3. It shows that in cases where the underlying motion of a region is complex, conventional H.264 applies a piecewise-translational coding approach by dividing the region into multiple smaller blocks with independent motion vectors. On the contrary, the proposed scheme is able to efficiently represent the same region using one large block. The elastic motion model generates a smooth motion field which results in similar or better prediction with reduced bitrate.

D. Encoding Issues This approach requires very minor changes to the standard bit-stream syntax. A flag indicates whether the motion vectors are translational only or elastic (using the proposed method). All the motion vectors are coded using exp-golomb codes [3]. Translational motion vectors are coded in a manner similar to H.264 with quarter pixel accuracy. When a block is coded using the proposed method, two sets of motion vectors are coded – translational motion vectors to indicate the integer pixel location of the prediction block and elastic motion vectors to define the enhanced prediction. In this case, translational motion vectors are coded with single pixel accuracy. And the elastic motion vectors are coded with 1/8 precision. Our experiments show that 1/4, 1/8 or 1/16 pixel accuracy can be used to code elastic parameters without much disparity in performance. However, 1/8 precision generally yields the best rate-distortion trade-off due to their high sensitivity. E. Computational Complexity The enhanced prediction presented in this paper comes along with some computational burden. We use the Big-O notation to describe and compare computational complexity of the proposed algorithm. This computation cost is mainly incurred by the inverse-compositional gradient-descent optimization [18] used to compute the elastic motion parameters. The complexity of pre-computation and subsequent iterations are O(P2T) and O(PT + P3) respectively, where P is the number of elastic motion parameters and T is the total number of pixels in the block. In our scheme, P = 8 and T ≤ 32×32. Typically it takes about 15 iterations to find the optimal motion parameters. In comparison, the computation cost for affine motion estimation [10]-[14] is similar to the proposed scheme except P = 6, and H.264

5 motion estimation costs O(T2) where T ≤ 16×16. The execution time due to additional complexity can be afforded for offline processing and storage applications with relaxed time-constraint. Yet, it is of some concern for real-time applications or devices with very limited resources. It can be noted here that the bulk of the complexity is on the encoder side. Hence, the proposed algorithm puts comparatively little load on the decoder. The complexity can be reduced in several ways. Firstly, the complexity for block-matching can be lowered using fast block search techniques e.g. logarithmic search, 3-step search etc [6]. Rate-distortion-complexity optimization (RDCO) [25] is another effective way to reduce the complexity (i.e. mode selection) of encoder. The complexity of the gradient-descent optimization can be reduced using less iteration per block, introducing stopping criterion to exit the iterative loop or using fast gradientdescent optimization using low-bit depth [23]. Our experiments show that reduction in the number of iterations per block dramatically speeds up the encoding process without any major degradation in rate-distortion performance. III. EVALUATION We evaluate the performance of the proposed formulation targeting different video conferencing, streaming and entertainment applications. The sequences used are listed in Table 1. We consider only the luma or Y component for the experiments. For video conferencing and streaming applications, the first frame is coded as an I frame and subsequent frames are coded as P. Entertainment quality videos are coded in IBBP format. The motion search range is restricted to ±8 and ±4 pixels for the 1st and 2nd levels of full search motion estimation, respectively. We compare the performance of our algorithm to that of the following algorithms: a) H.264 - H.264 implementation (Macro-block size of 16 with 2 levels of partitioning). It is selected as the stateof-the-art video coding standard [3]. b) BMA 32 - H.264 with macro-block size of 32×32 with 2 levels of partitioning. This is used to see the effects of larger blocks on the existing standard. c) Affine 16 - H.264 enhanced with RD-optimized affine motion compensation. The affine motion model is the most popular alternative to the classical translational motion model as of now [10]-[14]. d) Affine 32 - ‘BMA 32’enhanced with RD-optimized affine motion compensation. This is to observe the performance of affine with larger blocks. e) Elastic 16 - H.264 enhanced with RD-optimized elastic motion compensation. It is used here to show the effect of an elastic motion model on standard block sizes [22].

A. Rate-distortion comparison Table 2 shows results for the test sequences under various test conditions selected to represent typical interactive video conferencing and video streaming applications. Video conferencing applications generally support low to medium bit

CSVT 2855 rates and picture resolutions – QCIF and CIF. Video streaming support the same resolutions but with higher bit rates. We have measured performance using Lagrangian RDoptimized encoders. The gain due to the proposed algorithm over H.264 is given at the rightmost column. It can be seen here that the proposed scheme significantly outperforms the H.264 standard by 0.5-1.3 dB in most of the cases. The gain over other reference models is also evident. In a few cases (especially at high rates), ‘Elastic 16’ has a small gain over the proposed method. However, the gain is small in comparison with the computational costs involved. Next, we present some selected rate-distortion curves in figure 4. The left column of the figure shows a comparison between H.264, ‘Affine 16’ and ‘Elastic 16’. The objective here is to examine the effect of affine and (proposed) elastic motion model on the standard blocks. It can be seen here that ‘Elastic 16’ has a gain of 0.5 dB over H.264 at higher bitrates. However, there is no gain at lower rates. Interestingly enough, the affine motion model achieves virtually no gain with respect to H.264. This is because of the improved blockpartitioning mechanism and higher precision of motion vectors used in H.264. It is also not viable for the RD optimized encoder to choose extra motion parameters for smaller blocks due to their high overhead. Yet, it is evident that an elastic motion model represents typical complex motion fields in video sequences better than affine with standard blocks. On the right column of figure 4, the performance of the proposed method is shown together with H.264, ‘BMA 32’ and ‘Affine 32’. It can be seen there that ‘BMA 32’ achieves gains of up to 0.5-1 dB at lower rates. At higher rates, the gain diminishes or even worsens when compared to H.264. This is because translational motion vectors are not sufficient to capture the motion in larger blocks. Less signalling and motion vector bits are required for larger blocks but, at medium and higher bit rates, this advantage is over-shadowed by the requirement to transmit far more residual bits. ‘Affine 32’, however, achieves similar gain at lower rates but none at higher ones. In contrast, we accomplish significant improvement using the proposed algorithm over a wide range of PSNR and bit-rates. The overall gain is about 0.5-2.5 dB in PSNR and 10-20% in compression efficiency. This implies that an elastic motion model is able to represent the complex motion in large blocks much more effectively. Hence, using a combination of larger blocks and an elastic motion model is very attractive from a rate-distortion point of view. In order to get a better perception as to why and how the performance is affected in different algorithms, we present some comparative analysis for the Foreman QCIF sequence at 15 Hz with selected QP values in Table 3. Average values of PSNR and bits for the P frames are given for fair comparison since I frames are the same for all the schemes. All comparisons are done with H.264. It can be seen here that PSNR values achieved via all the schemes generally remains the same at a specific QP. However, the cumulative RD performance is influenced by savings in the total number of

6 bits in P frames derived from each method. The proposed method typically saves an average of about 15% of total bits. ‘BMA 32’ is able to provide similar reduction at higher QPs (i.e. lower bit- rates). However, at medium or lower values of QP, it requires about 5% extra bits. This affects its overall performance. ‘Affine 16’ remains similar to H.264 with ±1% change in total bits. ‘Affine 32’ follows ‘BMA 32’at higher QPs but the saving reduces to ±1% at medium or lower QPs. This result implies that the gain at lower PSNR values is mostly due to the larger block size, rather than the affine motion model. ‘Elastic 16’ is able to provide savings of about 7% at higher rates although it stays unchanged at lower rates. This gain is due to the better prediction of the elastic motion model. When we look at average bits spent on motion (signalling bits are counted as a part of motion bits) and residuals, a clearer understanding is realized. ‘BMA 32’ achieves about 53% reduction in motion bits but costs an extra 13% residual bits, which degrades its performance. ‘Affine 16’ saves 3% residual bits at the cost of 6% extra motion bits. Hence, the performance remains similar to H.264. ‘Affine 32’ provides 46% savings in motion bits while costing 8% more residual bits. ‘Elastic 16’ costs 28% more motion bits although saves 14% residual bits at higher bit rates. This explains its better performance at higher rates. On the other hand, the proposed technique offers about 45% and 4% savings in motion bits and residuals at higher QPs (resulting in lower bit-rates). At higher rates, the savings in motion bits reduces to 21% but produces 8% reduction in residual bits. Our next experiment seeks to address coding efficiency for entertainment-quality applications, such as DVD-Video systems, SDTV and HDTV etc. In such applications, sequences are generally encoded at resolutions of 720×576 pixels and higher at an average bit rate of 3-15 Mbps. One I– frame is used at the beginning of a sequence, and two Bframes are typically inserted between each two successive Pframes (IBBPBBP format). The remaining coding parameters are kept similar to earlier experiments. It is shown in figure 5 that the proposed method achieves substantial gain (0.5-2 dB) over a wide range of bit rates for these higher resolution videos as well. These results show that the proposed algorithm is suitable for a wide variety of video applications.

B. Encoding time In this section, the encoding time of the proposed scheme is compared with other reference models. Our experiments show that our encoding time (with respect to H.264) is dependent on two factors – the resolution of the video and the quantization parameter (QP). Encoding time is inversely proportional to the QP value and has a similar relationship with resolution. For example, our algorithm takes 40% more time than the H.264 algorithm to encode a CIF sequence, whereas it takes only 510% more time for an SD sequence (I and P frames only). This is because of the different computational overhead involved in the implementation. The proposed scheme encodes at a very similar speed (±5%) to H.264 for SD and

CSVT 2855

7

HD sequences when B frames are used. Leaving this issue aside, our method requires 50-75% and 10-15% extra encoding time with respect to BMA 32 and Affine 32, respectively. However, Affine 16 and Elastic 16 costs 50% and 90% more time than the proposed scheme. As discussed in the section II.E, we have used 15 iterations per block in our proposed method across all experiments with an average improvement of 0.7 dB. The running (encoding) time can be greatly improved by reducing the number of iterations per block with a small penalty in PSNR. For example, when we reduce the iterations to 10, the encoding time reduces by 20% with a loss in PSNR of 0.07 dB. Further reduction to 5 iterations per block costs another 0.07 dB in PSNR. Therefore, we still retain an average gain of over 0.5 dB when our encoder runs at a speed similar to or faster than H.264. This is a simple but very effective way to restrict encoding time and meet the constraints of delay-sensitive application.

[2]

IV. CONCLUSION

[11]

In this paper, we present a new video compression technique that takes advantage of larger blocks with multiplelevel partitioning and a higher order elastic motion model. Although, these two features are not substantial by themselves in the context of the state-of-the-art H.264 coder, their combination provides significant improvements in compression efficiency over a wide range of output bit rates. The proposed method surpasses all other techniques presented in this paper, as a result of considerable savings in both motion and residual bits. The amount of improvement is of course sequence and content dependent. The elastic motion model is able to characterize the complex motions in typical video sequences even when the block size is large. On top of that, the use of larger blocks makes it efficient from a ratedistortion optimization point of view. It should also be mentioned here that larger blocks help to reduce the overall computation time of the proposed algorithm to a great extent. Larger blocks result in a smaller number of blocks, hence requiring fewer gradient descent minimizations than otherwise. Although, the computational complexity of gradient-descent optimization is not a big concern for off-line processing and storage purposes, it is a valid concern for mobile devices or applications with strict delay constraint. However, it has been shown here that the encoding time can be easily regulated by reducing the number of iterations per block if necessary. Future work on this approach will concentrate on determining the optimal basis functions for the elastic motion model, more efficient prediction and coding of the elastic motion parameters and low-complexity implementation for real-time applications.

[3] [4] [5] [6] [7] [8]

[9] [10]

[12] [13]

[14] [15] [16] [17] [18] [19] [20]

[21] [22] [23] [24] [25]

REFERENCES [1]

“Video Codec for Audiovisual Services at p×64 kbits,” ITU-T, ITU-T Recommendation H.261, 1994.

“Video Coding for Low Bitrate Communication Version 1,” ITU-T, ITU-T Recommendation H.263, 1995. ISO/IEC 14496-10 and ITU-T Rec. H.264, Advanced Video Coding, 2003. M. H. Chan, Y. B. Yu and A. G. Constantinides, “Variable size block matching motion compensation with application to video coding,” IEE Proceedings, vol. 137, no. 4, 1990. G. J. Sullivan and R. L. Baker, “Rate-distortion optimized motion compensation for video compression using fixed or variable size blocks,” in Proc. of GLOBECOM 1991, pp. 85-90. F. Dufaux and F. Moscheni, “Motion estimation techniques for digital TV: a review and a new contribution,” Proceedings of the IEEE, vol. 83, no. 9, pp. 858-876, June 1999. M. Karczewicz, J. Nieweglowski and P. Haavisto, “Video coding using motion compensation with polynomial motion vector fields,” Signal Processing: Image Commun., vol. 10, pp. 63-91, 1997. S. L. Lu, “Comparison of motion compensation using different degrees of sub-pixel accuracy for interfiled/interframe hybrid coding of HDTV image sequences,” in IEEE proc. ICASSP’92, San Francisco, CA, Mar. 1992, vol. 3, pp. 465-468. R. Mathew and D. Taubman, “Hierarchical and polynomial motion modeling with quad-tree leaf merging,” in Proc. of ICIP 2006, pp. 18811884. V. Seferidis and M. Ghanbari, “General approach to block-matching motion estimation,” Optical Engineering, vol. 32, no. 7, pp. 1464-1474, 1993. Y. Nakaya and H. Harashima, “Motion compesation based on spatial transformations,” IEEE Trans. Circ. Syst. for Video Tech., vol. 4, no. 3, pp. 339-367, 1994. H. Li and R. Forchheimer, “A transformed block-based motion compensation technique,” IEEE Trans. On Communicaitons, vol. 43, no. 2/3/4, pp. 1673-1676, 1995. K. Zhang, M. Bober and J. Kittler, “Image sequence coding using multiple-level segmentation and affine motion estimation,” IEEE Journal on Selected Areas in Commun., vol. 15, no. 9, pp. 1704-1713, 1997. R. C. Kordasiewicz, M. D. Gallant and S. Shirani, “Affine motion prediction based on translational motion vectors,” IEEE Trans. Circ. Syst. for Video Tech., vol. 17, no. 10, pp. 1388-1394, 2007. M. Sayed and W. Badawy, “An affine-based algorithm and SIMD architecture for video compression with low bit-rate applications,” IEEE Trans. Circ. Syst. for Video Tech., vol. 16, no. 4, pp. 457-471, 2006. J. Kybic and M. Unser, “Fast Parametric Elastic Image Registration,” IEEE Trans. Im. Proc., vol. 12, no. 11, pp. 1427-1442, Nov. 2003. A. A. Muhit, M. R. Pickering, M. R. Frater and J. F. Arnold, “Extended motion compensation using larger blocks and elastic motion model,” in Proceedings IEEE MMSP 2008, to appear. S. Baker and I. Matthews, “Lucas-Kanade 20 years on: a unifying framework,” International Journal of Computer Vision, vol. 56, no. 3, pp. 221–255, 2004. J. Arnold, M. Frater and M. Pickering, Digital Television Technology and Standards, John Willey & Sons Inc., 2007. T. Wiegand, H. Schwarz, A. Joch, F. Kossentini and G.J. Sullivan, “Rate-constrained coder control and comparison of video coding standards,” IEEE Trans. Circuits and Systems for Video Technol., vol. 13, no. 7, pp. 688-703, July 2003. K. Kannappan, “An interleaved scanline algorithm for 2-D affine transformations of images,” in Proc. 35th Midwest Symp. Circuits Syst., vol. 1, pp. 179-182, August 1992. M. R. Pickering, M. R. Frater and J. F. Arnold, “Enhanced motion compensation using elastic image registration,” in Proc. of ICIP 2006, pp. 1061-1064. K. Yang, M. R. Frater, M. R. Pickering and J. F. Arnold, “Generalized framework for reduced precision global motion estimation between digital images,” in Proceedings IEEE MMSP 2008, to appear. S. Ma and C.-C. J. Kuo, “High-definition video coding with supermacroblocks,” in Proc. SPIE , vol. 6508. Y. Hu, Q. Li, S. Ma and C.-C. J. Kuo, “Joint rate-distortion-complexity optimization for H.264 motion search” in Proc. of IEEE ICME, Toronto, Canada, July 9-12, 2006.

CSVT 2855

8

k=1

k=3

k=2

k=4

k=5

k=7

k=6

k=8

Fig. 1. Motion fields corresponding to each basis function of the elastic warping function when P = 8, s = 2 and M = N = 4.

Fig. 2. Tree-structured multiple level split of a video frame.

CSVT 2855

9

0

0

5

5

10

10

15

15

20

20

25

25

30

30 0

5

10

15

20

25

30

0

5

10

15

20

25

30

Fig. 3. A demonstration of block partitioning and motion modelling using the proposed algorithm. (a) Block partitioning in H.264; (b) Block partitioning in the proposed method – elastic motion model was chosen by the proposed encoder for blocks with black border; (c) 2-D motion vector field of a selected 32×32 region in (a); and (d) Smooth 2-D motion vector field for the same region generated using the proposed method in (b).

CSVT 2855

10 Mother & Daughter QCIF 10 hz 42

40

40

38

38

Average PSNR (dB)

Average PSNR (dB)

Mother & Daughter QCIF 10 hz 42

36

34

32

30

34

32

30

H.264 Affine 16 Elastic 16

28

26

36

0

20

40

60

80

100

H.264 BMA 32 Affine 32 Proposed

28

120

26

140

0

20

40

Rate (kbps)

80

100

120

140

Foreman QCIF 15 hz 42

40

40

38

38

Average PSNR (dB)

Average PSNR (dB)

Foreman QCIF 15 hz 42

36

34

32

30

36

34

32

30

H.264 Affine 16 Elastic 16

28

26

60

Rate (kbps)

0

50

100

150

200

250

H.264 BMA 32 Affine 32 Proposed

28

300

26

350

0

50

100

Rate (kbps)

150

200

250

300

350

Rate (kbps)

Mobile & Calendar CIF 15 hz

Mobile & Calendar CIF 15 hz

40

38

38

36

36

Average PSNR (dB)

Average PSNR (dB)

34 34

32

30

32

30

28 28

H.264 Affine 16 Elastic 16

26

24 200

400

600

800

1000

1200

1400

1600

1800

2000

H.264 BMA 32 Affine 32 Proposed

26

24 200

2200

400

600

800

Rate (kbps)

1400

1600

1800

2000

2200

Horse CIF 30 hz

Horse CIF 30 hz

40

38

38

36

36

Average PSNR (dB)

Average PSNR (dB)

1200

Rate (kbps)

40

34

32

30

34

32

30

28

28

H.264 Affine 16 Elastic 16

26

24

1000

24 0

1000

2000

3000

4000

5000

6000

7000

Rate (kbps)

Fig. 4. Selected rate-distortion curves for QCIF and CIF sequences.

H.264 BMA 32 Affine 32 Proposed

26

0

1000

2000

3000

4000

Rate (kbps)

5000

6000

7000

CSVT 2855

11 Foreman SD 25 Hz 40

40

38

Average PSNR (dB)

Average PSNR (dB)

Susie SD 25 Hz 42

38

36

34

H.264

32

36

34

32

H.264

30

Proposed

Proposed 30

1

1.5

2

2.5

3

3.5

4

28

4.5

1

1.5

2

2.5

3

Rate (Mbps)

3.5

4

4.5

5

5.5

Rate (Mbps)

Park Run HD 25 Hz

Shields HD 25 Hz

38

38

36

36

Average PSNR (dB)

Average PSNR (dB)

34

32

30

34

32

30

28

24

0

5

10

15

20

25

30

35

H.264 Proposed

28

H.264 Proposed

26

40

26

0

5

10

Rate (Mbps)

Rate (Mbps)

Fig. 5. Selected rate-distortion curves of SD and HD sequences for entertainment applications. TABLE 1 LIST OF SEQUENCES

15

20

25

CSVT 2855

12 TABLE 2 FIXED BIT-RATE RESULTS FOR VIDEO CONFERENCING & STREAMING APPLICATIONS

CSVT 2855

13

TABLE 3 COMPARATIVE ANALYSIS FOR FOREMAN QCIF SEQUENCE AT 15 HZ