Project 057). The authors are with the Department of Electrical and Computer Engineer- ing, Information Processing Laboratory, Aristotle University of Thessaloniki, ...... [19] H. Everett, âGeneralized Langrange multiplier method for solving.
1726
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 15, NO. 9, DECEMBER 1997
Optimization of Quadtree Segmentation and Hybrid Two-Dimensional and Three-Dimensional Motion Estimation in a Rate-Distortion Framework Dimitrios Tzovaras, Stavros Vachtsevanos, and Michael G. Strintzis, Senior Member, IEEE Abstract—A rate-distortion framework is used to define a very low bit-rate coding scheme based on quadtree segmentation and optimized selection of motion estimators. This technique achieves maximum reconstructed image quality under the constraint of a target bit rate for the coding of the vector field and segmentation information. First, a complete scheme is proposed for hybrid two-dimensional (2-D) and three-dimensional (3-D) motion estimation and compensation. The quadtree object segmentation is optimized for hybrid motion estimation in the rate-distortion sense. This scheme adapts to the depth of the quadtree and the technique used for motion estimation for each leaf of the tree. A more sophisticated technique, adapted to the requirements of a very low bit-rate coder, is also proposed which also considers the transmission of the prediction error corresponding to the particular choice of the motion estimator. Based on these coding schemes, two versions of a very low bit-rate image sequence coder are developed. Experimental results illustrating the performance of the proposed techniques in very low bit-rate image sequence coding application areas are presented and evaluated. Index Terms—Hybrid 2-D and 3-D motion estimation, quadtree segmentation, rate-distortion theory, very low bit-rate coding.
I. INTRODUCTION
T
HE transmission of full-motion video through limitedcapacity channels is critically dependent on the ability of the compression schemes to achieve target bit rates while still maintaining acceptable visual quality [1]. In order to achieve this, motion estimation and motion-compensated prediction are frequently used so as to reduce temporal redundancies in image sequences [2]. Block-based motion estimation techniques have been extensively studied and applied for very low bit-rate coding [3]. However, the performance of these techniques in such low bit rates is restricted by well-known limitations such as the block and mosquito artifacts. Object-based techniques for image sequence coding [4]–[8] have been proposed to solve these problems. Affine two-dimensional (2-D) and threedimensional (3-D) motion estimation models are used for motion compensation in object-based techniques. While much attention has been devoted to the coding of the intraframe and prediction error images, the displacement vector fields or the parameters of the motion models are usually Manuscript received September 9, 1996; revised June 27, 1997. This work was supported by the EU CEC Projects PANORAMA (Package for New Autostereoscopic Multiview Systems and Applications, ACTS Project 092) and VIDAS (Video Assisted with Audio Coding and Representation, ACTS Project 057). The authors are with the Department of Electrical and Computer Engineering, Information Processing Laboratory, Aristotle University of Thessaloniki, Thessaloniki 54006, Greece. Publisher Item Identifier S 0733-8716(97)07702-0.
coded losslessly using DPCM/Huffman coding, resulting in limited compression. The reason for this is that digital video coding systems for many applications have at their disposal rates ranging from 1 to 25 Mbits/s. At such rates, only a minor part of the global rate is devoted to the transmission of the motion information; hence, the bit-rate overhead produced by lossless encoding of the vector fields or motion model parameters is negligible. In many emerging application areas, however, lossy compression of the vector fields is often highly desirable, and sometimes unavoidable. For example, mobile videophone or multimedia transmission channels are often limited to capacities of 4.8–64 kbits/s. In such cases, it is clearly desirable to reduce as much as possible the bit rate needed to transmit the motion vector fields, provided that this reduction does not produce intolerable distortion in the reconstructed image. It is also desirable to allocate the bit rate devoted to the coding of motion fields adaptively, depending on the complexity of the sequence and also on the overall bit-rate availability when the latter varies with time. Furthermore, it is desirable to select both the image segmentation into objects and the motion estimation method for each object adaptively so as to best represent the motion of each part of the image. For example, in a typical videophone sequence, rough object subdivision combined with a blockbased or affine 2-D motion estimation model would suffice for the description of the motion of most parts of the foreground object, while much finer object subdivision, perhaps down to the size of single blocks, and more sophisticated 3-D motion models would be best suited for the description of mouth and eye motion. An elegant framework for the definition of such a strategy is provided by the classical rate-distortion-constrained minimization procedure. This has been recently used in many coding applications including bit allocation for vector quantization [9], wavelet packet image coding [10] quadtree still image coding [11], and generic video compression [12]. In [13], the rate-distortion function was evaluated for image sequence coding under the assumption of Gaussian intensity distribution. Recently, rate-distortion optimization was also used for the development of efficient motion and disparity estimation strategies [14], [15]. In this scheme, a rate-distortion framework is used to define a displacement vector-field estimation technique for use in video coding. The present paper investigates the use of this methodology for quadtree segmentation and hybrid motion field estimation under the constraint of a target bit rate for the coding of the vector information. Quadtree segmentation is performed using rate-distortion criteria, and is fused with motion estimation
0733–8716/97$10.00 1997 IEEE
TZOVARAS et al.: QUADTREE SEGMENTATION AND MOTION ESTIMATION
Fig. 1. First version (RDHC) of the proposed coding scheme (no option of error transmission).
1727
Fig. 3. Rate distortion framework for the selection of the optimal motion estimator.
II. HYBRID MOTION ESTIMATION
Fig. 2. Second version (RDHCE) of the proposed coding scheme (error transmission is an option).
by selecting for each node of the quadtree the optimum motion estimator from a predetermined set of candidate motion estimators. As an extension, the rate-distortion optimization scheme also was used to optimize the allocation of the prediction error corresponding to the motion estimation procedures in the transmitted information. Also, two possible codecs are proposed and evaluated experimentally. In the first (rate-distortion optimized hybrid codec, RDHC), the image sequence is divided into groups (GOF) of ten frames, and ratedistortion optimization is directed to each GOF separately. The first frame of each GOF is transmitted as a still image (intracoded), and the succeeding frames are coded using motion compensation from the reconstructed version of the previous frame (see Fig. 1). The prediction error is not transmitted. In the second (rate-distortion optimized hybrid codec with error transmission, RDHCE), the optimization algorithm is applied to a much longer sequence of (up to 100) frames (see Fig. 2). In this case, the first frame is coded as a still image, and all other frames are coded using rate-distortion optimization of the quadtree segmentation, the motion estimation, and additionally, the prediction error transmission. The paper is organized as follows. The hybrid technique used for motion estimation is described in Section II, and a brief review is given of each candidate technique. The determination of the optimal quadtree segmentation based on rate-distortion optimization for the identification of the optimal quadtree and the optimal motion estimator for each leaf of the quadtree is described in Section III. Also, in Section IV, the proposed technique is extended to include the transmission of the prediction error corresponding to the motion-compensated estimates. Finally, experimental results demonstrating the performance of the proposed algorithm for the coding of typical videophone and videoconferencing sequences are given in Section V, and conclusions are drawn in Section VI.
Several schemes have been proposed in the literature for the coding of videophone or videoconference image sequences [1]–[3]. Motion estimation and compensation is the basic approach used in all of these schemes. Modeling of the motion information by translation, zoom, and pan, or a 3-D rotation and translation, has been used in block-based, affine, and 3-D motion estimators. Experimental results have shown that affine 2-D motion or 3-D motion models may represent efficiently the displacement occurring in typical scenes; however, most parts of the image may be coded very satisfactorily using only translational motion (e.g., the background). Moreover, the complexity of the affine and 3-D motion estimation algorithms is higher than the complexity of the block-based scheme. Based on the above observations, we propose the use of all of these models for the motion-compensated coding of the objects of a scene, within a rate-distortion framework optimizing both the segmentation and the motion estimation (see Fig. 3). The alternatives are as follows. 1) The motion of the object is insignificant. No motion vector is transmitted, and the previous estimate for this frame is considered sufficient. 2) Translational motion is used to compensate the motion of an object. A two-component motion vector is transmitted. 3) An affine 2-D motion model is used to represent the motion of an object. The six model parameters are transmitted. 4) A 3-D motion model represents best the motion of an object in the scene. The eight motion model parameters are transmitted. 5) The 3-D motion corresponding to the same block in the preceding in time frame is used. In other words, the optimum image segmentation together with the optimum of the above motion estimator candidates are selected so as to minimize a distortion index subject to a ceiling on the available rate. Classical DFD (displaced frame difference) minimization defines the block-based motion estimator. To define the remaining motion estimator candidates, a brief review is given below of the affine and the 3-D motion estimation methods [16], [17]. A. Affine 2-D Motion Estimation The general representation of an affine transformation is (1)
1728
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 15, NO. 9, DECEMBER 1997
If , and the field of view of the camera, i.e., the visual angle corresponding to the whole image, is not very large, the displacement vector may be approximated by
or, equivalently
If is the vector of the motion parameters, the following system of equations must be solved for each object in the scene [16]: where
(5) Furthermore, by making the assumption that the object surface is a plane, i.e.,
and
or, equivalently .. .
.. .
.. . and substituting in (4)
and (6) where where is the number of points with coordinates ( , ) in the working object. The solution to the above overdetermined set of equations may be obtained by use of a least squares method, or alternately by the robust least median of squares technique described in detail in [18]. B. 3-D Motion Estimation In order to identify the objects in the scene, the original image is segmented into areas having uniform motion characteristics. The 3-D motion of each object in the scene is modeled using a six-parameter model. More specifically, we are the coordinates of a point assume that if at time instant , its coordinates at time instant , are given by (2) and three where three translational parameters are used to describe the rotational parameters motion of the underlying object. The goal of the 3-D motion estimation procedure is to compute the motion parameter for each object in the scene. vector are the coordinates of the perspective projection If on the image plane at time of the 3-D point , then and
(3)
that From (2) and (3), the 2-D motion vectors of each object are defined correspond to the pixels by projection of the 3-D motion on the 2-D image plane, as follows:
(4)
(7) of the initially estimated Equation (6) can be evaluated for equations and eight 2-D vectors, forming a system of this system is overdetermined, and unknowns. With can be solved using least squares methods or, alternately, by the robust least median of squares motion estimation algorithm described in detail in [18]. III. QUADTREE SEGMENTATION USING RATE-DISTORTION OPTIMIZATION Let a segmentation of the image plane consisting of objects . For each candidate , motion estimator be the let corresponding set of object motion vectors. The general joint vector field estimation and quadtree segmentation algorithm of the reconstructed image aims to minimize the distortion on the rate for the transsequence, under a constraint mission of the vector field and the corresponding segmentation information. This corresponds to the following constrained optimization problem: (8) subject to
where is the total number of objects in the image, is the contribution of the decision to the distortion is the contribution of the same to the function, and total rate or cost of the transmission of the motion vectors and the segmentation information. The methodology in [19] permits the transformation of the above into an unconstrained optimization problem. In fact, as shown in [19] (the proof is also contained in [9]), the solution
TZOVARAS et al.: QUADTREE SEGMENTATION AND MOTION ESTIMATION
1729
Fig. 4. Convergence of the algorithm for the coding of frame 20 of “Claire” (20) at Rbudget = 0:064 bits/pixel.
Fig. 5. Convergence of the algorithm for the coding of frame 14 of “Tunnel” (14) at Rbudget = 0:24 bits/pixel.
of the problem of unconstrained
The proposed joint motion estimation and quadtree segmentation algorithm begins by gathering the motion estimator set values for each node , and dependent , to generate the versus values for each for each node. In the following, the two stages of the ratedistortion optimization algorithm are presented, i.e., 1) finding and 2) the optimal solution for a given operating slope determining the optimal slope . Stage 1) of the algorithm and could be considered a is run for a given slope value subroutine called by stage 2) of the algorithm.
minimization of
(9) is also a solution of (8) if (10) The problem, therefore, reduces to ensuring that (10) has a and determining solution for this solution. This was investigated from a general viewpoint in and [9], where it was shown that are monotonic functions of the Langrange multiplier , which may be interpreted as a quality index, with values ranging (lowest rate, from zero (highest rate, lowest distortion) to highest distortion). Further investigation in [10] proved that the solution of (10) may be obtained using any fast convex algorithm such as the bisection algorithm [20]. One such algorithm, which gave very good results in both [10] and [11], is also adopted in the present paper (Section III-C). The algorithm for the determination of the optimal segmentation and motion estimator consists of the following steps. A. Initialization is assumed In the present work, the segmentation to be completely described by a quadtree (i.e., is the th node at scale ) where is the dimension of the image. Associated with of the quadtree is a data structure of the form each node split where is the entropy associated is the distortion corresponding to with the current node, the reconstruction of the block associated with the current is the corresponding Langrangian cost, and quadtree node, is a bit indicating if the current node is split or not. split will be In the following, for the sake of simplicity, node denoted as node .
B. Finding the Optimal Solution for a Given Operating Slope For the current , populate all of the nodes of the tree or, equivalently, with their minimum Lagrangian costs when referring to the th node at quadtree scale , i.e.,
The cost is minimized with respect to the choice of motion estimation for node . Step 1) Initialize . Let be the maximum , if is the value of that minimizes tree depth. For initialize (11) Step 2) Set Step 3) If
then set split
. If
, go to Step 5).
and
(12) Step 4) Go to Step 2). Step 5) Starting from the root node and using in a , linked-like fashion the node data-structure element split selected optimally for all the nodes of , construct the optimal and its associated optimal motion estimator quadtree . set choice
1730
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 15, NO. 9, DECEMBER 1997
(a)
(b)
(c)
(d)
Fig. 6. (a) Original frame 0 of “Miss America.” (b) Original frame 5 of “Miss America.” (c) Reconstructed frame 5 of “Miss America” from frame 0 using the proposed algorithm at 41.31 dB PSNR. (d) Corresponding prediction error image.
C. Determining the Optimal Slope First two values
of
Step 4) If , then stop,
are found so that
Note that the initial segmentation of the first frame of the image sequence selects the whole image to be a single is the optimal object, while for the subsequent frames, segmentation corresponding to the previous in time frame. corresponds to the segmentation resulting by Similarly, full splitting of the quadtree until the minimum allowed object (block) size is reached. For the coding of a sequence of frames, the values of are chosen to be for the initial frame for subsequent frames, where and is the solution of (10) for the previous frame. The bracketing interval is then successively decreased in size by the following procedure. Step 1) For each object , compute and and the correspondand . ing Step 2) Set
where is a vanishingly small positive number. Step 3) Compute the minimizing for
and .
Else, if . Go to Step 2). Else
. Go to Step 2). Note that the distortion corresponding to each motion vector in a specific search area is computed only once, at the first iteration of the algorithm. Thus, the computational load of the algorithm consists of updating the entropy of the vector field . and finding the minimum D. Computation of the Entropy and Distortion Functions The specific way the vector field affects the quality of the reconstructed image will determine the distortion index . A number of such distortion measures have been proposed in the literature. In case of quadtree-based segmentation, the simplest and most commonly used is the temporally displaced frame difference
where are the upper left-hand corner coordinates of , and is the block corresponding to node , respectively, is image at time instant and the projected 2-D motion vector corresponding to the motion are the dimensions of the estimation method used, and working object (block). will depend on Also, the transmission cost the specific method used for the coding of the vector fields. The motion parameter vectors corresponding to either the affine or the 3-D motion estimation methods are first quantized
TZOVARAS et al.: QUADTREE SEGMENTATION AND MOTION ESTIMATION
1731
(a)
(b)
(c)
(d)
Fig. 7. (a) Original frame 0 of “Claire.” (b) Original frame 2 of “Claire.” (c) Reconstructed frame 2 of “Claire” from frame 0 using the proposed algorithm at 40.2 dB PSNR. (d) Corresponding prediction error image.
uniformly, and the corresponding entropy is thus computed with respect to the quantized parameters. The distortion is also computed based on the quantized motion parameter vectors. is computed The entropy of the current node by summing the entropy of the already coded motion or motion parameter vectors with the entropy of the split bit and the entropy of the parameter indicating which motion estimator is chosen for the already coded quadtree objects. In the present work, the use of entropy coding (e.g., Huffman or arithmetic coding) is assumed, with an adaptive probability model, for the computation of the entropy of each component of the motion or motion parameter vectors. Thus, for the coding of the component the entropy of computed using a specific motion estimator (i.e., or component of the 2-D motion vector field in the case of block-matching motion estimation, or a parameter of the quantized motion vector field in the case of affine or 3-D motion estimation) is computed as
Fig. 8. Quadtree segmentation corresponding to frame 5 of “Miss America” when coded with frame 0 as reference using the proposed algorithm.
where if otherwise. Note that (13) is equivalent to the following efficient formula : for the incremental computation of (14)
where is the probability that the vector field minimizing the index satisfies , and are the minimum and maximum allowed values for the specific component of the motion or motion parameter vecis computed for each operating tor. The probability of the algorithm using the information point of all previously encoded parameters corresponding to the specific motion estimator as follows:
A more computationally efficient approach, which does not involve incremental computation of the probability density of the vector field or the first-order vector field differences, is to assume a model for this probability density function. Specifically, the assumption of Gauss–Markov random field to describe motion [21], [22] vector differences could be used so as to accelerate the rate-distortion minimization procedure. IV. TRANSMISSION OF PREDICTION ERROR INFORMATION
(13)
In many applications, the transmission of motion and segmentation information alone is insufficient for the reconstruc-
1732
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 15, NO. 9, DECEMBER 1997
Fig. 9. MSE versus bit rate (in bits/pixel) for the coding of the fifth frame of “Miss America.”
Fig. 10. Comparison of the proposed rate-distortion optimized hybrid coder (RDHC) with the block matching method (BM) in terms of PSNR versus frame number for the coding of the image sequence “Miss America” at 12 and 24 kbits/s.
Fig. 12. Performance of the proposed rate-distortion optimized hybrid coder with transmission of the prediction error (RDHCE) for the coding of the first 50 frames of the image sequence “Miss America” at 64, 28.8, and 14.4 kbits/s.
Fig. 13. Performance of the proposed rate-distortion optimized hybrid coder with transmission of the prediction error (RDHCE) for the coding of the first 50 frames of the image sequence “Claire” at 64, 28.8, and 14.4 kbits/s.
Fig. 11. Comparison of the proposed rate-distortion optimized hybrid coder (RDHC) with the block matching method (BM) in terms of PSNR versus frame number for the coding of the image sequence “Claire” at 10 and 24 kbits/s.
Fig. 14. Performance of the proposed rate-distortion optimized hybrid coder with transmission of the prediction error (RDHCE) for the coding of the first 50 frames of the image sequence “Salesman” at 64, 28.8, and 14.4 kbits/s.
tion of an image sequence with acceptable quality. Then the choice must be permitted of transmitting the prediction error corresponding to the motion estimator, especially for
blocks containing artifacts in the reconstruction image. The optimization technique described in detail in the previous sections is easily extended so as to accommodate the choice
TZOVARAS et al.: QUADTREE SEGMENTATION AND MOTION ESTIMATION
1733
subject to
where
is the total number of objects in the image, is the contribution of the decision to the distortion function, and is the contribution of the same to the total rate or cost of the transmission of the motion vectors, the segmentation map, and the prediction error information. As discussed in the previous section, the of the problem solution of unconstrained minimization of
Fig. 15. Performance of the proposed rate-distortion optimized hybrid coder with transmission of the prediction error (RDHCE) for the coding of the first 50 frames of the image sequence “Foreman” at 64, 28.8, and 14.4 kbits/s.
(16) is also a solution of (8) if (17) The problem therefore, reduces to ensuring that (10) has and a solution for determining this solution. The rate-distortion optimization algorithm presented in the previous section is used again for the computation of the optimal segmentation and the corresponding motion estimator for each object. In the case of error transmission, the distortion function used is
Fig. 16. Performance of the proposed rate-distortion optimized hybrid coder with transmission of the prediction error (RDHCE) for the coding of the first 20 frames of the image sequence “Tunnel” at 64, 28.8, and 14.4 kbits/s.
where of the transmission of prediction error. It will be assumed that the prediction error is coded using DCT transformation and Huffman entropy coding as is the case in JPEG. of the image plane consist Let again a segmentation objects . For each of , let candidate motion estimator be the corresponding set of object motion vectors, and be the corresponding set of prediction errors. The general joint vector field estimation and quadtree segmenof the tation algorithm aims to minimize the distortion reconstructed image sequence, under a constraint on the rate for the transmission of the vector field, the corresponding segmentation, and prediction error information. This corresponds to the following constrained optimization problem: (15)
where to pixel
is the decoded prediction error corresponding . V. APPLICATION TO VERY LOW BIT-RATE IMAGE SEQUENCE CODING
A. Computational Complexity of the Proposed Approach The proposed algorithm consists of the initialization and the optimization stages. During the computationally involved initialization stage, all candidate algorithms for motion estimation are tested, and their performance is stored in memory. Note that the distortion function is computed only in the first iteration of the algorithm, and thus the computational load of the remainder of the algorithm reduces to updating of the entropy of the vector field and finding the minimum . Note also that, following the first frame, the search
1734
Fig. 17.
Fig. 18.
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 15, NO. 9, DECEMBER 1997
(a)
(b)
(c)
(d)
(e)
(f)
Original frames: (a) 5, (b) 10, (c) 15, (d) 20, (e) 25, and (f) 30 of “Salesman.”
(a)
(b)
(c)
(d)
(e)
(f)
Reconstructed frames (a) 5, (b) 10, (c) 15, (d) 20, (e) 25, and (f) 30 of “Salesman” coded at 64 kbits/s.
for is confined to narrower intervals, and hence fewer iterations are needed for the completion of the optimization stage. Also, the choice of the segmentation map corresponding to the previous frame as an initial segmentation for the current frame further reduces the computational load of the proposed algorithm. The execution time of the encoding phase of the algorithm in a R4400 INDIGO II Silicon Graphics workstation is approximately 1 min for each frame. Most of this time (about 60%) is devoted to the initialization stage where the distortion functions are computed. The remaining 40% is used to complete the optimization procedure. As elaborated in the sequel, the optimization algorithm was run for many
and target bit rates, and was seen to values of converge very rapidly, never requiring more than 15 iterations in any of our experiments. This convergence to the desired bit rate for a videoconference scene (“Claire”) and a nonvideoconference scene (“Tunnel”) is depicted in Figs. 4 and 5, respectively. The computational complexity of the decoding phase of the proposed approach is very low and even a software decoder may be implemented in real time. This makes the proposed scheme an attractive candidate for use in asymmetrical coding applications such as multimedia communication, teleshopping and fixed-location-to-mobile or broadcast video communication.
TZOVARAS et al.: QUADTREE SEGMENTATION AND MOTION ESTIMATION
Fig. 19.
Fig. 20.
1735
(a)
(b)
(c)
(a)
(b)
(c)
Original frames, (a) 5, (b) 10, (c) 15, (d) 20, (e) 25, and (f) 30 of “Claire.”
(a)
(b)
(c)
(d)
(e)
(f)
Reconstructed frames, (a) 5, (b) 10, (c) 15, (d) 20, (e) 25, and (f) 30 of “Claire” coded at 64 kbits/s.
B. Experimental Results In order to evaluate the performance of the proposed approach for very low bit-rate coding, the algorithm was applied to the typical QCIF sequences “Claire,” “Miss America,” “Salesman,” “Foreman,” and a QCIF version of the MPEG-4 test sequence “Tunnel.” The frame rate of the sequences was 10 frames/s. Objects were defined using the segmentation procedure described in Section III. The construction of the quadtree representation for the first frame of the video sequence may start with the hypothesis that the whole image may be represented
by only one node (root), and proceed with tests deciding if further splitting is necessary, as described in Section III. However, it was found experimentally that in practice it is preferable to start by testing smaller blocks (typically 32 32 pixels each) instead of the entire image, so as to expedite the identification of the optimal segmentation. Similar constraints were imposed on the size of the smallest blocks in order to maintain the segmentation overhead information within acceptable limits. The size of the smallest block in our . As noted experiments was chosen to be in Section III, after the first frame, a good choice for the initial
1736
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 15, NO. 9, DECEMBER 1997
(a)
(b)
(c)
(d)
Fig. 21. Segmentation and motion estimator index maps of (a), (b) the tenth frame of “Salesman” and (c), (d) the eighth frame of “Foreman.” White color corresponds to no motion and predicted motion, dark gray to affine 2-D motion, light gray to block matching, and black corresponds to 3-D motion.
Fig. 22.
(a)
(b)
(c)
(d)
(e)
(f)
Original frames 5 (a), 10 (b) and 15 (c) of “Tunnel.” Reconstructed frames 5 (d), 10 (e), and 15 (f) of “Tunnel” coded at 64 kbits/s.
segmentation mask is the segmentation mask corresponding to the immediately preceding in time frame. As a first experiment, the optimized motion estimation and quadtree segmentation algorithm was applied for the coding of specific frames only of the “Miss America” and “Claire” sequences. More specifically, the algorithm was applied between the zeroth and fifth frames of “Miss America” and the
zeroth and second frames of “Claire.” The original zeroth and fifth frames of “Miss America” and the zeroth and second frames of “Claire” are shown in Fig. 6(a) and (b) and Fig. 7(a) and (b), respectively, while the reconstructed ones and the corresponding prediction errors are illustrated in Fig. 6(c) and (d) and Fig. 7(c) and (d), respectively. The resulting quadtree segmentation is also shown in Fig. 8. Fig. 9 shows the MSE
TZOVARAS et al.: QUADTREE SEGMENTATION AND MOTION ESTIMATION
(a)
1737
(b)
Fig. 23. (a) Segmentation map of the fifth frame of “Tunnel” interleaved with the image. (b) Corresponding motion estimator index map. White color corresponds to no motion and predicted motion, dark gray to affine 2-D motion, light gray to block matching, and black corresponds to 3-D motion.
(mean-square error) versus bit rate for the coding of the fifth frame of “Miss America.” This curve was obtained by running and computing the the proposed algorithm for various MSE after the convergence. For very low bit-rate video coding applications, the proposed algorithm may also be applied to groups of frames (GOF), with the bit allocation assigned adaptively to each frame of the sequence in order to optimize the transmission bit stream. More specifically, the total bit rate is given for the coding of the whole GOF, and the rate is allocated to each frame according to the frame difference between the current is the frame and the preceding in time frame. Thus, if target bit rate for the coding of the whole GOF and
is the frame difference between frames and , and are the image dimensions, the bit rate is allocated as follows:
In this way, the coding of the motion and segmentation information is optimized for the whole GOF. For the coding of a group of frames, the first version of the optimized motion estimation algorithm (RDHC) was applied for the coding of the first ten frames of “Miss America” and “Claire” using motion compensation. Figs. 10 and 11 illustrate the comparison of the proposed algorithm with the simple block matching algorithm with a block size of 8 8 and 16 16, used in the existing standards (MPEG, H.261), in terms of PSNR versus frame number, for the coding of the first ten frames of the two above sequences, respectively. The simple block matching approach consists of absolute displaced frame difference minimization, by searching exhaustively within a , 15 half-pixels in the previous in search area of 15, time frame, centered at the position of the examined block. In both coders, it was assumed that each frame was predicted using the reconstructed previous frame, and that the prediction was selected to be error was not transmitted. The 24 and 12 kbits for “Miss America” and 24 and 10 kbits for “Claire,” respectively, for the coding of the nine frames following the initial frame. As seen, the performance of the proposed algorithm is very high compared to the standard
TABLE I AVERAGE PERCENTAGE OF SELECTION OF THE CANDIDATE MOTION ESTIMATION METHODS USED FOR THE CODING OF “MISS AMERICA,” “CLAIRE,” “SALESMAN,” “FOREMAN,” AND “TUNNEL”
block matching technique, since both segmentation and motion estimation are optimized for a specific bit rate. The second version of the proposed coder (RDHCE) which includes error transmission was also used for the coding of the above QCIF sequences. The technique described earlier, of adapting bit allocation to frame differences, is also used in this coder version; however, the transmission of the error allows efficient communication of much longer groups of frames. Three target bit rates were tested: 14.4, 28.8, and 64 kbits/s. Figs. 12–16 illustrate the performance of the proposed scheme in terms of PSNR versus bit rate for the coding of the first 50 frames of “Miss America,” “Claire,” “Salesman,” and “Foreman” and the first 20 frames of “Tunnel.” Note that the coder performance remains stable, providing a highquality image without the intervention of a new intracoded image before the end of the whole sequence. Figs. 17–20, respectively, show original and reconstructed frames of the sequences “Salesman” and “Claire” coded at 64 kbits/s. Regarding the performance of optimization, Fig. 21(a) and (c) shows the resulting quadtree corresponding to the tenth frame of “Salesman” and the eighth frame of “Foreman.” The complexity of the quadtree for the representation of both sequences is relatively high since “Salesman” is a sequence with large rigid and flexible motion, while in the “Foreman” sequence, both object motion and camera motion exist. Also, Fig. 21(b) and (d) presents the motion estimator index map corresponding to the segmentation maps of Fig. 21(a) and (c). In these figures, each object is colored depending on the motion estimator choice. White color corresponds to no motion and predicted motion, dark gray to affine 2-D motion, light gray to block matching, and black corresponds to 3-D motion. Also, Table I shows an average percentage of the motion estimator choices for the coding of all of the QCIF image sequences.
1738
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 15, NO. 9, DECEMBER 1997
In addition to videoconferencing schemes, the algorithm was also tested in the more complicated MPEG-4 test sequence “Tunnel.” The results are comparable to the results obtained using videophone-related sequences. Fig. 22 shows original and reconstructed frames of the sequence “Tunnel” coded at 64 kbits/s. The segmentation map and the motion estimator index map corresponding to this sequence are shown in Fig. 23(a) and (b). The performance in terms of the PSNR versus bit rate for the coding of the first 20 frames of “Tunnel” is illustrated in Fig. 16. VI. CONCLUSION A rate-distortion framework was used to define a very low bit-rate coding scheme based on quadtree segmentation and optimized selection of motion estimation. This technique achieved maximum reconstructed image quality under the constraint of a target bit rate for the coding of the vector and segmentation information. Joint optimization of quadtree object segmentation and motion estimation method for each leaf of the tree, i.e., each object, was achieved subject to this target bit rate restriction. For experimental evaluation, the proposed algorithm was combined with an appropriate rate control strategy to optimize the coding of the motion vectors corresponding to all of the frames of a group of frames of an image sequence. Experimental results in application for the coding of typical videophone sequences have demonstrated the performance of the proposed very low bit-rate video coding scheme. ACKNOWLEDGMENT The assistance of COST 211ter is also gratefully acknowledged.
[12] E. Reusens, “Joint optimization of representation model and frame segmentation for generic video compression,” Signal Process., vol. 46, pp. 105–117, Sept. 1995. [13] G. Tziritas, “Rate distortion theory for image and video coding,” in 27th Int. Conf. Digital Signal Processing, Limassol, Cyprus, June 1995. [14] D. Tzovaras and M. G. Strintzis, “Motion estimation using rate distortion theory for very low bit rate image sequence coding,” in Proc. Int. Conf. Telecommun. ’96, Istanbul, Turkey, Apr. 1996. , “Motion and disparity estimation using rate distortion theory for [15] very low bit rate and multiview image sequence coding,” in VCIP’97, San Jose, CA, Feb. 1997. [16] G. Wolberg, Digital Image Warping. Los Alamitos, CA: IEEE Computer Soc. Press, 1988. [17] G. Adiv, “Determining three-dimensional motion and structure from optical flow generated by several moving objects,” IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-7, pp. 384–401, July 1985. [18] S. S. Sinha and B. G. Schunck, “A two-stage algorithm for discontinuitypreserving surface reconstruction,” IEEE Trans. Pattern Anal. Machine Intell., vol. 14, Jan. 1992. [19] H. Everett, “Generalized Langrange multiplier method for solving problems of optimum allocation of resources,” Oper. Res., vol. 11, pp. 399–417, 1963. [20] W. K. Press, B. P. Flannery, S. A. Tenkolsky, and W. T. Vetterling, “Numerical recipes in C: The art of scientific computing,” Tech. Rep., Cambridge Univ. Press, Cambridge, U.K., 1988. [21] J. Konrad and E. Dubois, “Bayesian estimation of motion vector fields,” IEEE Trans. Pattern Anal. Machine Intell., vol. 14, pp. 910–927, Sept. 1992. [22] S. Malassiotis and M. G. Strintzis, “Joint motion/disparity MAP estimation for stereo image sequences,” Proc. Inst. Elect. Eng., Vision, Image, Signal Process., vol. 143, pp. 101–108, Apr. 1996.
Stavros Vachtsevanos was born in Thessaloniki, Greece, in 1973. He received the Electrical Engineering degree from the Aristotle University of Thessaloniki (AUTH) in 1997. Since then, he has been a Research Assistant in the Information Processing Laboratory, AUTH. His research interests include image sequence analysis and coding and multimedia data processing.
REFERENCES [1] H. Li, A. Lundmark, and R. Forchheimer, “Image sequence coding at very low bit rates—A review,” IEEE Trans. Image Processing, vol. 3, pp. 589–609, Sept. 1994. [2] H. G. Musmann, P. Pirsch, and H. J. Grallert, “Advances in picture coding,” Proc. IEEE, vol. 73, pp. 523–548, Apr. 1985. [3] M. I. Sezan and R. L. Lagendijk, Motion Analysis and Image Sequence Processing. Boston, MA: Kluwer, 1993. [4] H. G. Mussman, M. Hotter, and J. Ostermann, “Object-oriented analysissynthesis coding of moving images,” Signal Processing: Image Commun., vol. 1, pp. 117–138, Oct. 1989. [5] M. Hotter, “Optimization and efficiency of an object-oriented analysissynthesis coder,” Signal Processing: Image Commun., vol. 4, pp. 181–194, Apr. 1994. [6] N. Grammalidis, S. Malassiotis, D. Tzovaras, and M. G. Strintzis, “Stereo image sequence coding based on 3-D motion estimation and compensation,” Signal Processing: Image Commun., vol. 7, pp. 129–145, Jan. 1995. [7] D. Tzovaras, N. Grammalidis, and M. G. Strintzis, “3-D motion/disparity segmentation for object-based image sequence coding,” Opt. Eng. (Special Issue on Visual Commun. Image Processing), vol. 35, pp. 137–145, Jan. 1996. [8] , “Object-based coding of stereo image sequences using joint 3-D motion/disparity compensation,” IEEE Trans. Circuits Syst. Video Technol., vol. 7, pp. 312–328, Apr. 1997. [9] Y. Shoham and A. Gersho, “Efficient bit allocation for an arbitrary set of quantizers,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 36, pp. 1445–1453, Sept. 1988. [10] K. Ramchandran and M. Vetterli, “Best wavelet packet bases in a ratedistortion sense,” IEEE Trans. Image Processing, vol. 2, pp. 160–175, Apr. 1993. [11] G. J. Sullivan and R. Baker, “Efficient quadtree coding of images and video,” IEEE Trans. Image Processing, vol. 3, pp. 327–331, May 1994.
Michael G. Strintzis (S’68–M’70–SM’79) received the Diploma in electrical engineering from the National Technical University of Athens, Athens, Greece, in 1967, and the M.A. and Ph.D. degrees in electrical engineering from Princeton University, Princeton, NJ, in 1969 and 1970, respectively. He then joined the Electrical Engineering Department, University of Pittsburgh, Pittsburgh, PA, where he served as an Assistant Professor from 1970 to 1976 and Associate Professor from 1976 to 1980. Since 1980, he has been a Professor of Electrical and Computer Engineering, University of Thessaloniki, Thessaloniki, Greece. His current research interests include image coding, image processing, biomedical signal and image processing, and educational technology. Dr. Strintzis was awarded an IEEE Centennial Medal in 1984.
Dimitrios Tzovaras was born in Ioannina, Greece, in 1970. He received the Electrical Engineering degree from the Aristotle University of Thessaloniki (AUTH), Thessaloniki, Greece, in 1992. He is currently pursuing the Ph.D. degree in the Electrical Engineering Department, AUTH. Since 1993, he has been a Research Assistant at the Information Processing Laboratory, AUTH. His research interests include image compression, monoscopic and stereoscopic image sequence analysis and coding, and multirate signal processing. Mr. Tzovaras is a member of the Technical Chamber of Greece and SPIE.