Video Object Segmentation and Tracking for ... - Semantic Scholar

Video Object Segmentation and Tracking for Content-Based Video Coding J.Y. Zhou*, E.P. Ong#, C.C. Ko* *

Dept. of Electrical Engineering, National University of Singapore, Singapore 119260 Institute of Microelectronics, 11 Science Park Rd, Science Park II, Singapore 117685

#

Email: [email protected]

Abstract In this paper, a novel technique for the segmentation and tracking of moving objects is proposed. The proposed scheme extracts moving objects based on both motion and spatial information. Initially, an intra-frame segmentation is performed using a morphological segmentation algorithm. After this, the video object areas are merged according to their motions based on a new criterion. When different moving objects are extracted, their affine motion parameters are estimated. In the tracking stage, different objects are projected to the next frame according to their affine motion parameters. The projected regions may be overlapped and/or uncovered, creating uncertain areas. By using a two-scan Zig-Zag algorithm, the pixels in the uncertain areas can be absorbed into one of the neighboring objects. Experimental results have demonstrated successful object segmentation and tracking in both moving and stationary background cases.

1. Introduction The MPEG-4 video coding standard provides many new features for future multimedia applications and enables interactivity with objects in video sequences via contentbased representation [1]. With the introduction of video object, automatic object segmentation becomes a prerequisite for video encoding. However, this is a very difficult task because physical objects normally are not homogeneous with respect to some features such as color, luminance, or optical flow [1]. In [4], a coding algorithm that combines a spatial analysis of the sequence with motion compensation has been proposed. On one hand, the spatial analysis is used to obtain a general scheme to deal with any kind of sequences and scene changes. On the other hand, motion information is used to compensate the spatial information. Motion information is generally the most important cue in motion segmentation since physical object is often characterized by coherent motion that is different from the motion of the background. Most video object segmentation techniques separate the video object from the background based on optical flow fields or change detection masks (CDM). In [6, 7], different methods are proposed to extract the objects from the background by using motion information. However, motion estimation is often not very accurate due to occlusion and aperture problems, and the sensitivity to noise. In [3, 8], a morphological motion filter algorithm is proposed to resolve edge pixels localization along the motion boundary. However, this algorithm partitions the image

into different regions only according to the gray-level. Thus, it is sensitive to noise and the computational load is very high. In [5], a spatial-temporal segmentation algorithm is proposed, but there are many thresholds that need to be set manually. In this paper, we present a new video object extraction and tracking algorithm as described in Section 2. In the first step, the image is partitioned into different regions using morphological segmentation algorithm [3]. In the second step, a new motion merge criterion is proposed based on the distance difference between the optical flow motion field and the dominant motion field. By using this criterion, some object regions are merged and several video object regions are obtained. In the third step, these object regions are projected into the next frame and then tracked in subsequent frames. Some experimental results on the segmentation of the test sequences are given in Section 3.

2. Segmentation and Tracking Algorithm In the proposed algorithm, four efforts have been made to get the final segmentation results: (1) Intra-frame segmentation is performed using hierarchical morphological segmentation algorithm. In this stage, the image is partitioned into many flat zones with accurate contour information. (2) Based on the motion criterion, a binary image is derived from the partitioned regions and several object regions are obtained and their affine model parameters are also re-estimated. (3) In the tracking stage, these video objects are projected into the next frame according to their affine model parameters that have been obtained in stage 2. Uncovered and overlapped regions are regarded as uncertain regions. (4) To deal with the uncertain region, a two-scan Zig-Zag algorithm is used to absorb the pixels in the uncertain region according to the displaced frame difference.

2.1 Intra-frame segmentation stage A hierarchical morphological segmentation that preserves the contour of the objects in the image has been described in [3]. This method is composed of three steps: image simplification using morphological operators, markers’ extraction, and region growing using watershed algorithm with a hierarchical queue. The simplification goal is to make the signal easier to segment. By using morphological connectivity operator, the image can be segmented into constant gray-level regions with sharp contours corresponding precisely to those of the original signal. In the next step, the interior of each homogeneous region is “marked” by a label, that is, a constant gray level value, which is unique for this region.

Finally, these labeled markers are input to watershed algorithm with a hierarchical queue and a region growing procedure is performed. After this intra-frame segmentation, the image is partitioned into many regions with different labels. Thus, the image is composed of a number of regions each with a unique label value. The intra-frame segmentation results are shown in Fig.5.3 and Fig.6.3. These partitioned regions are fed into the next step, the region-merging stage, to obtain the binary image.

Motion information is one of the most important criteria to use in video object segmentation in order to merge object regions. The displaced frame difference (DFD) is one possible choice for the region-merging criterion. However, when the background is not stationary, we find that DFD is not a good criterion for the merging stage. In this case, we proposed a new motion merge criterion based on optical flow estimation and dominant motion estimation. The steps in region-merging are described as follow: • Estimate the dense optical flow motion field of the whole frame R between two successive frames. • Estimate the affine model parameters of the dominant motion of the whole frame R between the two successive frames. • Compare the parametric motion vector with optical flow vector in each region Ri from the last step. If the motion criterion (which will be defined later) of a region Ri exceeds a threshold, the object region Ri will be classified as Model Failure (MF) region. Otherwise, the region is labeled as a Model Compliant (MC) region. • Label the MF regions and put those coherent regions that are large enough into a region-pool P (All MC regions are regarded as a single region in P). • Estimate the model parameters of the various object regions in the pool P and then output these parameters into the next tracking step. With this top-down process, the whole scene is partitioned into several objects, each of which complies with a particular motion model. We first adopt the method of [2] to compute a dense optical flow field. In the following, (u(x, y ), v(x, y )) denotes the estimated optical flow vector at pixel ( x, y) . We adopt the 6-parameter affine motion model. The 6-parameter affine transformation is defined as follows: ∧ (1) u ( x , y )= a x + a y + a 1

v (x , y

)=

a

4

2

x +

a

5

3

y +

)= ∑

X I∈ R

ρ (DFD

Θ

(2)

( X i ))

where R is the current support region, ρ (•) is the Mestimator, Θ is the estimated motion parameter vector and DFD Θ is the displaced frame difference defined as: DFD Θ ( X i , t )= I ( X i , t )− I ( X i + d Θ ( X i ), t + 1) (3) In the above, I is the image function, xi is the pixel’s coordinates and d Θ is the parametric motion vector computed by Θ and xi.

2.2 Region-Merging Stage

∧

E (Θ

a

6

Generally, there may be several objects inside a certain support region. Thus, robust estimation technique [7] is necessary in order to detect the dominant motion and minimize the averaging effects among the existing ones. The robust parameter motion estimation is formulated as an optimization problem with the following objective function:

We select Tukey’s bi-weight function as in [6] and convert the M-estimation problem into an equivalent weighted least-squares problem. In this case, the optimal Θ can be solved via an incremental estimation scheme. Using equation (1), we can calculate for each pixel (x, y) the global motion vector ( u∧ (x , y ), v∧ (x , y )) . The motion criterion is selected as the Euclidean distance between the estimated parametric motion vector and the optical flow vector: 2 2 (4) ∧  ∧  dΘ − d0 =

where

dΘ − d0

 u (x , y )− u ( x , y )   +   v (x , y )− v (x , y )     

is high for pixels belonging to

independently moving objects and low for background pixels. We define the motion criterion as the average of d Θ − d 0 over all pixels ( x, y ) belonging to a particular region R i . Then, we can threshold this criterion to get a binary image. Next, by using a morphological open-close algorithm, the binary image can be smoothed. At the end of this stage, a label map O representing several object regions is obtained, where O ( x , y ) = l, l=1,2… M and M is the number of objects, and there are M sets of motion model parameters {Θ i , i = 1,2 ...M } corresponding to those objects.

2.3 Projection of Video Objects Let f ( x, y, t ) and f ( x, y, t + 1) denote two successive frames of a video sequence at times t and t+1, respectively. From stage 2.2, f ( x , y , t ) has been segmented into M moving objects Oi , i = 1,2...M , and the motion parameters for each moving object has been estimated with respect to f ( x, y, t + 1) . Then, the correspondence of O i in frame f ( x, y, t + 1) can be obtained by projecting Oi into f ( x, y, t + 1) by using each affine model parameter Θ i . Let

(d , d ) denotes the distance in the motion field within the x

y

moving object Oi . This distance is estimated from Θ i . Hence, the projection of Oi into f (x, y,t + 1) can be determined by: Pi = x + d x , y + d y (x , y )∈ Oi (5)

{(

)

}

The union of all the projections Pi may not cover the whole frame t + 1 , and one projection may overlap another due to occlusions (see Figure 3). Hence, the uncovered and overlapped regions are labeled as uncertain

field, which means that the pixels in the uncertain field do not belong to any Pi .

B

1

2

3

4

5

Uncovered area

O2 O1

O4

P2

P rojection

P1

O3 O verlapping area

A

P3

P4

Figure 4: Refining dependent on scanning direction

Figure 3: Video object projection

3. Experimental Results 2.4 Refinement and Tracking The projection map P obtained in the projection stage will be refined in this stage to reach the final segmentation of the video objects. The refinement is achieved by absorbing neighboring pixels into one of the M motion objects by using the model parameter sets Θ i that have been estimated. In order to re-group the neighboring pixels into motion objects, we use DFD in this stage as a decision metric. The classification is based on selecting the minimum errors among several candidates. The refinement procedure is as follows: 1. Determine if the current pixel p is on the border of objects in the map P. If not, go to step 5. 2. Determine Cp, a set of objects’ labels where the corresponding objects are present in the neighborhood of p. 3. Calculate DFDΘ i using all the Θ i , where i ∈ C p . 4. Assign p to the object i where DFDΘ i is the smallest. 5. Select the next pixel in the scanning direction as the current pixel and go back to step 1. It should be noted that the scanning direction has a strong impact on the modification of P in this scheme. The modification takes place along the scanning direction. This can be illustrated by a simple example shown in Figure 4. Suppose that the scanning proceeds from left to right and there are two neighboring projected objects A and B. In this example, pixel 3 has the chance of being absorbed into B. If it has been modified, then pixel 4 and 5 are also likely to be absorbed into B. On the contrary, pixel 1 and 2 can never be absorbed into B under this scan direction although they are also neighboring pixels. Thus, the refining process should be applied twice sequentially, each time in one of the two opposite directions. Hence, two scan directions are selected. The forward direction is chosen in row by row format and from upper-left to bottom-right. The backward direction is the reverse of the forward direction from the bottom-right to the upper-left. In order to reduce the noise effect in this stage, the rootmean-squares of the DFD value in a 3x3 window Wp centered on p could be adopted in place of DFD Θ ( p ), 2 where: DFD ( p )= ∑ r∈W DFD Θ (r ) Θ p

3

Some experimental results using the Flower Garden and the Coast guard video sequence are presented in Fig.5 and 6 respectively. For the Flower Garden video sequence, the background is not stationary, and the tree under-goes larger motion from the right to left. Fig.5.1 is the first frame of the sequence, while Fig.5.2 shows the results after applying the morphological open-close operator which produce flat zones. Fig.5.3 gives the intra-frame segmentation results. Fig.5.4 shows the result of the merge stage and a binary image is obtained in which white regions represent the moving objects. Fig.5.5 shows the segmentation results while Fig.5.6 gives the results of the projection of the objects where the gray level area is the uncertain region. Fig.5.7 to Fig.5.10 are the results for the tracking of the objects. There are parts of the sky being grouped together with the twigs because the smoothness constraint leads to inaccurate optical flow estimation in those areas. Fig.6 show the results of the Coast guard sequence. Because the boat is much smaller than the background, dominant motion has been well estimated. Fig.6.6 shows a reasonably satisfactory segmentation result.

4. Conclusions The proposed video object segmentation technique has been shown to be successful in segmenting the moving objects from the moving background. The algorithm first performs intra-frame segmentation to obtain the contours of the objects. Then, the object regions are merged according to a motion criterion and the binary object image is obtained. In order to track objects in the subsequent frame, the binary objects are projected into the next frame according to their affine model parameters. By using a two-scan direction algorithm, the pixels in the uncovered and overlapped regions are absorbed into their neighboring objects. Experiments show that the presented technique could segment and track video objects for different types of sequences successfully.

References [1] MPEG Video Group, “MPEG-4 video verification model version 10.1”, ISO/IECJTC1/SC29/WG11/W3464, Tokyo Meeting, March 1998. [2] P. Anandan, “A computational framework and an algorithm for the measurement of visual motion”, Int. Journal of Computer Vision, Vol. 2, pp.283-310, 1989.

[3] P. Salembier and M. Pardas, “Hierarchical Morphological Segmentation for Image Sequence Coding”, IEEE Trans. On Image Processing, Vol.3, No.5, pp.629-651, September 1994.

[6] J.M. Odobez and P. Bouthemy, “Robust multiresolution estimation of parametric motion models”, Journal of Visual Comm. And Image Represent., Vol.6, pp. 348-365,1995.

[4] J.Y.A. Wang and E.H. Adelson, “Representing moving images with layers”, IEEE Trans. On Image Processing, Vol.3, No.5, pp.625-638, September 1994.

[7] G.D. Borshukov et. al., “Motion segmentation by multistage affine classification”, ”, IEEE Trans. On Image Processing, Vol. 6, pp.1591-1594, 1997.

[5] J.G. Choi, S-W. Lee and S.D. Kim, “Spatiao-Temporal Video Segmentation Using a Joint Similarity Measure”, IEEE Trans, on Circuits and Systems For Video Technology, Vol.7, No.2, pp.279-286, April 1997.

[8] N.T. Watsuji et. al., “Morphological segmentation with motion based feature extraction”, Int. Workshop on Coding Techniques for Very Low Bit-Rate Video, Tokyo, Nov. 8-10, 1998.

Fig.5.1 First frame

Fig.5.2 Open_close Fig.5.3 Intra_seg

Fig5.6 Projection

Fig5.7 Track_12#

Fig5.8 Track_17#

Fig.5.4 binary mask

Fig5.9 Track_22#

Fig.5.5 Inter_seg

Fig5.10 Track_33#

Fig.5: Results of the Flower Garden video sequence

Fig6.1 First frame

Fig6.2 Open_close

Fig6.3 Intra_seg

Fig6.4 Rough mask Fig6.5 Refine mask

Fig6.6 Inter_seg

Fig6.7 Track_10#

Fig6.8 Track_20#

Fig6.9 Track_31#

Fig.6: Results of the Coast-guard video sequence

Fig6.10 Track_42#