Hierarchical and Polynomial Motion Modeling with Quad-Tree Leaf ...

3 downloads 0 Views 310KB Size Report
Recent work into quad-tree representations [1] has commented on the sub-optimal performance of quad-trees due to not exploiting the dependency between ...
HIERARCHICAL AND POLYNOMIAL MOTION MODELING WITH QUAD-TREE LEAF MERGING Reji Mathew∗† and David S. Taubman∗ ∗

The University of New South Wales, Sydney, Australia † National ICT Australia

ABSTRACT Recent work into quad-tree representations [1] has commented on the sub-optimal performance of quad-trees due to not exploiting the dependency between neighboring leaf nodes with different parents. Leaf merging therefore has been proposed to rectify this performance loss. In the context of quad-tree motion models, the performance of leaf merging, was recently demonstrated [2] in which an optimally pruned H.264 motion model was taken as the starting point for subsequent merging steps. In this paper we explore the performance of leaf merging over a wider range of conditions by starting with a general tree structure where each node is allowed to represent motion using either polynomial models or a single translational vector. We also explore two cases of motion prediction methods, these being spatial prediction and hierarchical prediction. Experimental results show the benefit of merging across these broad conditions. In comparison to the merged H.264 model reported in the previous work, substantial gains are evident. Furthermore we explore applying the merging principles to branch nodes of a quad-tree to achieve efficient hierarchical motion representation. Index Terms— Video coding, Motion compensation 1. INTRODUCTION For modeling image features the quad-tree structure is often used as it allows an image to be recursively divided into smaller regions with each region represented by a suitable model. For image compression, which can be considered a form of image modeling, the quad-tree representation is attractive as it lends itself to direct ratedistortion optimization, allowing a Lagrangian cost function to be globally minimized using tree pruning strategies. In the present paper, we are concerned with the modeling of motion, between frames of a video sequence. Ultimately, we are interested in video coding schemes employing motion compensated prediction or motion compensated temporal filtering (MCTF), with an emphasis on scalable video coding. Typically it is not possible to represent the motion between frames by a single model and therefore a quad-tree structure is employed where smaller, variable size regions or blocks are allowed to take on separate motion models. Recent work into quad-tree representations [1] has commented on the sub-optimal performance of quad-trees due to not exploiting the dependency between neighboring leaf nodes with different parents. Leaf merging has been proposed to rectify this performance loss as it allows joint coding of related nodes. The performance of leaf merging in the context of quad-tree motion models was recently demonstrated in [2]. In that work, an optimally pruned H.264 motion model was taken as the starting point for subsequent merging steps.

1­4244­0481­9/06/$20.00 ©2006 IEEE

1881

In this paper we explore the performance of leaf merging over a wider range of conditions than those investigated in [2]. We pursue this by starting with a general tree structure where each node is allowed to represent motion using either polynomial models or a single translational vector. A greater set of node sizes and decomposition levels for the quad-tree structure are also explored removing some of the strict limitations imposed in [2]. The previously investigated H264 [3] model therefore can be considered a subset of the general structure presented in this paper. Although in this study we are not primarily concerned with H.264 compression, we often refer to H.264 cases for comparison purposes. Polynomial motion models have been the subject of previous research [4] focusing primarily on applying polynomial motion compensation to improve rate-distortion performance. Our motivation in pursuing polynomial models to describe motion, is principally to attain a smoother motion representation and to avoid some of the boundary artifacts caused by piecewise constant motion. This goal is furthered by the use of leaf merging. The hierarchical nature of quad-tree decomposition raises the possibility of hierarchical coding of motion, which is attractive for scalable video coding. In this paper we introduce hierarchical coding of the quad-tree structure and accompanying merge and motion information. Obviously hierarchical coding involves redundancy, as branch (or intermediate) nodes must be assigned motion parameters, even if they will not be used directly for motion compensation. Interestingly, when rate-distortion optimized merging principles are applied to both leaf nodes and branch nodes of a hierarchical representation, the cost of this redundancy is considerably reduced. Preliminary results presented in this paper illustrate both the efficiency of merged hierarchical motion modelling and the potential for scalable motion representation. One of the most important observations in this paper is that the leaf merging principle provides substantial benefits over a broad range of motion modeling contexts, including both translational and polynomial models, and both hierarchical and spatially predictive (a la H.264) motion coding methods. Section 2 describes the motion models considered in this paper, while Section 3 shows how the merging principle can be applied to these models. These are followed by experimental results in Section 4 and some conclusions. Whereas [2] focused on providing convincing results in the context of H.264, in this work we seek to understand the potential of RD optimized motion merging in a more general context, with applications to wavelet based texture coding and scalable video. While alternate video compression paradigms lie beyond the scope of this paper, we provide evidence for the potential of our proposed motion models in a variety of forms.

ICIP 2006

2. MOTION MODELS

X

We begin with a hierarchical description of motion based on blocks. At each level k in the hierarchy, the frame is partitioned into blocks Bb(k) each of size S 2k × S2k . Motion compensation is always performed at the lowest level, k = 0, typically with a block size of S × S = 4 × 4. At this level, motion is modeled as pure translation, (0) with motion vectors vb . At all other levels in the hierarchy, we allow for motion models with 2, 4 or 6 parameters, corresponding (k) to translation, linear and affine flows, writing b for the relevant (k) model. In practice, b is represented through 1, 2 or 3 motion vectors, vb(k,1) , vb(k,2) , vb(k,3) , with nominal locations η(bk,)1 , η(bk,)2 , η(bk,3)

M

M

M

(k)

as depicted in Fig 1. Motion compensation based on model b is performed by first generating a set of corresponding motion vectors for each descendant at level 0, according to

vb(0) =

3

αb vb( ) , ∀Bb(0) ⊂ Bb( ) k

,i

i=1

k

,i

where the αb ,i sum to 1, and depend on the displacements, η(bk,i) − η(0) b . Of course, for translational models αb ,1 = 1, while for linear models αb ,3 = 0. In our work, the model parameters v(bk,1) , vb(k,2) , vb(k,3) , are obtained by a weighted least squared fitting procedure, based upon a collection of initial translational motion estimates on blocks of size S2j × 2j for various in the range 0 to − 1. The weights used in this procedure are selected so as to give relatively greater importance to locations that are more sensitive to motion compensation errors, such as edges and highly textured regions. Lower importance is assigned to locations that can tolerate motion compensation error, such as flat or smooth regions. A quad-tree is defined using this hierarchical description of motion of levels, with a leaf node at index b and level in the tree (k) structure representing the motion model b of the corresponding block. At the top most level in the tree hierarchy, all nodes are labelled as either leaf or branch nodes. While leaf nodes represent motion models, branch nodes indicate that motion is represented by descendant nodes at lower levels. The immediate descendants of a branch node, located a the next lower level, may intern be leaves or branches. This tree structure can continue until reaching the lowest level, k = 0. Inspired by the success of the H264 quad-tree structure, we follow the general principals of H264 quad-tree decomposition to increase the largest bock size from 16 × 16 in the earlier work [2] to 32 × 32 and increase the decomposition levels from 3 to 4, thereby still keeping the smallest block size to be 4 × 4. This corresponds to increasing k to 3 while maintaining S = 4.We still retain the H264 property of a block being able to be split into either four quads or two rectangles. This means a block of size S 2k × S 2k can be partitioned to four S 2k−1 × S 2k−1 , two S2k−1 × S 2k blocks or two S2k × S2k−1 blocks, for values of k equal to 3, 2 or 1. In summary, the general quad-tree structure we employ is capable of representing motion using either polynomial models or a single translational vector and incorporates a greater set of node sizes and decomposition levels. The H264 motion model, which was the focus of the previous work, can be obtained as a special case in which all models are translational, S = 4 and the number of levels in the hierarchy is 3. In our work we incorporate two motion vector prediction strategies for coding motion. First is the H264 spatial motion vector prediction strategy, which is used to code all respective motion vectors

S

k

j

k

M

X

X

X

X (a)

X

X (b)

(c)

Fig. 1. Nominal locations within a block used in representing (a) horizontal linear model (b) vertical linear model and (c) affine model of leaf nodes. In this paper we also explore hierarchical motion modelling which requires a different motion vector prediction strategy. For the hierarchical case, motion vectors of leaf nodes at a particular level are coded with respect to a vector from the immediate parent (branch) node; with the predictors at the highest level taken as (0,0). Note that for hierarchical coding, branch nodes are also required to convey motion information, as the motion vector of a branch node is used as a reference for coding motion vectors of immediate descendants. Although this introduces some redundancy, it enables scalable access to the motion information and as we will show in the later sections, the redundancy can be reduced by incorporating branch nodes into the merging process. An appealing attribute of quad-trees is that they allow a Lagrangian cost function of the form D + λL to be globally minimized using tree pruning algorithms. This is possible because distortion D is additive across the leaves and model coding cost L is additive across both leaves and branches in the tree, depending on the motion coding approach. The tree pruning phase therefore seeks to determine the quad-tree structure that minimizes the Lagrangian cost D + λL assuming that no leaf merging will take place. Tree pruning can be globally optimal for hierarchical coding; while it is somewhat greedy for spatially predictive coding.

k

3. MERGING PRINCIPLE Leaf merging has been proposed to exploit the redundancy that exists between neighboring nodes of different immediate parents. Given a rate-distortion optimally pruned quad-tree, merging is performed by visiting each node in a bottom-up manner and applying a set of rules ¯, located to identify possible merge targets. For a given leaf block b ¯ = (b, k), we identify a set Tb¯ at (b, k) in the quad-tree, that is b ¯ i such that b¯i and b¯ are neighborof possible merge targets b ¯ , a neighboring leaf ing leaf blocks. As in [2], for a leaf block b block is only considered a valid merge target if either the neighbor¯ , or if same in size, the neighboring ing block is larger in size to b block must belong to a different immediate parent. This condition ensures that the merging phase does not undo the rate-distortion optimal decisions made during the initial tree pruning phase. A leaf ¯ could potentially merge with utmost one of the merge targets node b b¯i in the set Tb¯ to produce a new merged region. To decide on merging, we first calculate the impact of merge on the overall Lagrangian cost + λL , taking into account distortion of the merged region, motion model coding cost and the cost of signalling merge information. Merging is allowed to take place only if it reduces the overall Lagrangian cost. This ensures that each modification to the original rate-distortion optimally pruned quad-tree will only serve to reduce the overall Lagrangian functional D + λL. ¯ and In our investigations, for each potential merge between b b¯i ∈ Tb¯ , three avenues are explored for calculating the merged mo¯ for tion model, these being a) using the motion model of block b the potential merged region, b) using the motion model of block

1882

D

for the potential merged region and c) calculating new models (single vector, linear and affine) for the merged region, fitted using weighted least squares optimization on block motion vectors of all descendant nodes of the region at a given level. In the previous work, the only new model considered for a potential new merged region ¯ and b¯i , weighted was an average of the single motion vectors of b according to the total area of blocks to which they are assigned . We of course consider a greater number of new motion models, increasing the possibility of a merge. Since new motion models are formulated using existing motion vectors of descendant nodes, no new exhaustive search need to be conducted and hence the increase in complexity due to merging is relatively small. To calculate the cost of a merge, the number of bits required to signal a merge decision and merge direction needs to be determined. ¯ , we first determine the set of As followed in [2] for a leaf block b possible merge targets Tb¯ . If Tb¯ is empty then no further action is required. If Tb¯ contains at least one target then a binary merge flag is required to signal a "merge" decision or a "no merge" decision. The coding of this merge flag can be implemented using a arithmetic coder. Currently, as followed in [2], we “model” the cost of coding a “merge” decision as − log2 51 and that of coding a “no merge” decision as − log2 45 . The difference between these costs is 2 bits. This is a conservative approximation, our tests show that merging occurs more frequently thereby reducing the real cost of coding the merge flag. If the merge flag signals a "merge" decision then the ¯ to either top, bottom, left or right merge merge direction (from leaf b ¯i ) needs to be communicated. The number of bits required target b is either 0, 1 or 2 corresponding to the number of targets in Tb¯ being 1, 2 or greater. Due to the rules used for selecting merge targets, the number of targets in Tb¯ cannot exceed 4. These merge coding costs are incorporated into the Lagrangian cost function to determine the benefit of merging. When using spatial motion vector prediction, as in H264, to differentially code motion vectors, merging can have an impact on the motion coding cost of subsequent neighboring blocks. This cost was ignored in the previous work, we now incorporate into the total cost calculations of a merge, the impact on motion vector prediction at subsequent neighboring blocks. It is important to note that merging may not capture whole regions in an image; visual inspection of the merged results verify this point. The goal of merging is to correct the sub-optimal ratedistortion properties of an initial quad-tree representation. The procedure discussed so far for merging can be used for both spatial motion vector prediction (which relates to the H264 coding scenario) as well as the hierarchical coding case introduced in this paper. For hierarchical coding, branch nodes are also included in the merging process with the only change being in the rules identifyˆ = (b, k), possible ing possible merge targets. For a branch node b merge targets that can be included in the set Tbˆ are neighboring leaf ¯i that are bigger in size to b ˆ and neighboring branch blocks blocks b ˆi or leaf blocks b ¯ i that are equal in size but belonging to a differb ˆ is calculated by ent immediate parent. The cost of a branch node b assessing the bits required to code a) the branch motion vector and b) the motion of the immediate descendants since the motion of the children are coded differentially with respect to parent branch node. All other aspects of merging remain the same, thereby allowing hierarchical coding to be incorporated with minimal changes to the merging algorithm. To communicate the motion of a merged region, motion information is transmitted for only a single node in the region, we refer to this node as the anchor node. For both spatially predictive and hierarchical motion coding methods, the same general principle is used to identify anchor nodes within a region, this being that the an-

1883

38.5 Foreman CIF 30Hz 38 37.5 37

PSNR (dB)

b¯i

36.5 36

general_spt general_hrc general_hrc_no_merge general_spt_no_merge

35.5 35 k bits/s

34.5 20

40

60

80

100

120

140

160

Fig. 2. Rate-distortion improvements achieved by performing merging on general quad-tree structures for the two cases of spatial and hierarchical motion prediction chor is the first node in the merged region that is encountered during decoding. For the spatially predictive or H264 case, this relates to the node which appears first in the raster coding order, regardless of its size. For the hierarchical case introduced in this paper, the maximum block size of all nodes in the region is first determined. Then from the set of all nodes in the region that correspond to this maximum block size, the anchor node is the block that appears first in the raster coding order. As hierarchical coding is performed top-down, the size of a block , or equivalently the level in the quad-tree, needs to be taken into consideration in determining the anchor node. One advantage of our node merging scheme is that it is single pass (non-iterative) scheme. We investigated multiple pass merging, where regions and leaf nodes are reconsidered for further merging. We found little gain with such an approach suggesting that the majority of the rate-distortion governed merging options were incorporated during the first pass. 4. EXPERIMENTAL RESULTS We first present results which clearly show that the merging principle can be successfully applied over a wide range of motion modeling contexts. In Fig 2 we show the benefits of applying merging to our general quad-tree structure, which incorporates both translational and polynomial models, for the two cases of hierarchical and spatial motion vector prediction. The results relate to the first 100 frames of the standard Foreman sequence at CIF resolution and 30 fps. In all the graphs, the rate corresponds to the rate required to code the block mode, motion and any merge information. The distortion or PSNR refers to the Y component of the motion compensated image. In Fig 2, the graphs labelled "general_hrc" and "general_hrc_no_merge" correspond respectively to the performance of our general quad-tree structure with and without merging for the hierarchical case. The graphs in Fig 2 labelled "general_spt" and "general_spt_no_merge" correspond respectively to the performance of the general quad-tree structure with and without merging for the spatial prediction case. For both hierarchical and spatial motion vector prediction, merging always improves performance. Note that without merging the hierarchical case displays significantly lower rate-distortion performance. This is due to the redundancy associated with intermediate branch nodes, however by incorporating branch nodes into the merging process, the performance of hierarchical motion represen-

39

32

Foreman CIF 30Hz

Flower Garden CIF 30Hz

38

38

32

Foreman CIF 30Hz, Hierarchical Decoding

Flower Garden CIF 30Hz

37 31.5

36.5 36

31

36 35

31

PSNR(dB)

PSNR (dB)

37

PSNR (dB)

37.5

31.5

34 33

PSNR (dB)

38.5

30.5

32 35.5

30.5

general_hrc H264+merge H264

35

H264+merge

k bits/s

34.5 0

50

100

30

150

200

20

70

30

H264

k bits/s

k bits/s

29 120

general_hrc case_A case_B case_C

31

general_hrc

170

0

20

40

60

general_hrc

30

general_hrc_no_models H264+merge k bits/s

29.5

80

100

120

140

20

40

60

80

100

120

140

160

Fig. 3. Performance comparison of H264 without merge, H264 with merging and the general hierarchical quad-tree with merging .

Fig. 4. Left : Hierarchical decoding; Right : Gain in performance achieved by incorporating polynomial motion models

tation can be brought close to that achieved by the spatial prediction case with merging. Results obtained for another sequence, Flower Garden, at CIF resolution and 30 fps also showed the same trend, with merging allowing the hierarchical representation to be similar in rate-distortion performance to that attained by the spatial prediction case with merging. We can see therefore that merging opens up possibilities for efficient hierarchical motion modeling which is attractive to scalable video coding. Next we compare the performance of merged hierarchical motion modelling with the previously investigated case of applying merging to the specific H264 model. This comparison is presented in Fig 3 for the Foreman and Flower Garden sequences. In the figures, graphs labelled "H264+merge" relate to applying leaf merging to the H264 model as reported in [2]. The performance of the H264 model without merging, labelled "H264", is provided as a reference. The higher performance of our general quad-tree structure with merging, represented by "general_hrc", is clearly evident from the figures. Hierarchical modeling allows hierarchical decoding of motion and this is illustrated in the left section Fig 4 for the Foreman sequence. For three points on the rate-distortion graph "general_hrc", we show the respective hierarchical decoding paths, from decoding just the top layer, in our case layer 3, to decoding all available layers, that is decoding layers 3 to 0. In Fig 4 (left), the three dashed lines show the rate-distortion characteristics of hierarchical decoding for each of the three cases referred to as A, B and C . Decoding just layer 3 produces the lowest rate and PSNR point; for case A the PSNR achieved at this point is approximately 30 dB. By incorporating further layers, the operating point moves upward to finally reach the maximum rate and PSNR value when all 4 layers are decoded. For case A, this corresponds to PSNR values increasing from approximately 30dB to 32dB, 37dB and then 38dB. To investigate the advantage of using polynomial motion models on coding efficiency, in the right section of Fig 4 for the Flower Garden sequence, we compare merging with and without using polynomial motion models while keeping all other parameters of our general quad-tree structure unchanged. We examine the hierarchical motion modelling situation and compare the general case of using polynomial models, represented by "general_hrc", with that of using only translational vectors to describe motion, represented by the dashed line graph in the figure labelled "general_hrc_no_models". The "H264+merge" results are shown for reference. The bit-rate saving attained by using polynomial motion models ranges from approximately 10% to 30% with respect to not using the polynomial models. For the Foreman sequence we observe that the advantage of using motion models is lower, typically displaying about 10% sav-

ing in bit rate. We observe that polynomial models are advantageous when motion can be well described by linear or affine models, such as the case in Flower Garden. 5. CONCLUSIONS Leaf merging has been proposed to address the sub-optimal performance of quad-trees. In this paper we have shown that the merging principle can be applied to a wide range of motion modeling context, always improving rate-distortion performance. We have presented results showing the gain achieved by applying merging to general quad-tree structures that allow for both translational and polynomial motion models. In particular we have shown that by extending the merging principle to branch nodes, an efficient hierarchical motion representation can be attained. One outcome of such a hierarchical representation is that the quad-tree structure and associated merge and motion information can be hierarchically coded; which is attractive for scalable video coding. The rate-distortion performance achieved by the merged general tree structures considered in this work is significantly better than that achieved previously where merging was applied directly to an H264 model. The complexity implications of merging are small in comparison to the initial motion estimation phase which is required regardless of the application of merging. 6. REFERENCES [1] R. Shukla, P. Dragotti, M. Do, and M. Vetterli, “Rate-distortion optimized tree-structure compression algorithms for piecewise polynomial images,” IEEE Trans. Image Processing, vol. 14, pp. 343–359, March 2005. [2] R. D. Forni and D. Taubman, “On the benefits of leaf merging in quad-tree motion models,” Proc. IEEE Int. Conf. Image Processing, vol. 2, pp. 858–861, September 2005. [3] “Advanced video coding for generic audiovisual services,” Tech. Rep. Rec. H.264, ITU-T, May 2003. [4] T. Wiegand, E. Steinbach, A. Stensrud, and B. Girod, “Multiple reference picture video coding using polynomial motion models,” Proc. Visual Communications and Image Processing, vol. 3309, pp. 134–145, 1998.

1884

Suggest Documents