R. Li et al.: Efficient Spatio-temporal Segmentation for Extracting Moving Objects in Video Sequences
1161
Efficient Spatio-temporal Segmentation for Extracting Moving Objects in Video Sequences Renjie Li, Songyu Yu, Member, IEEE, and Xiaokang Yang, Senior Member, IEEE Abstract — Extraction of moving objects is an important and fundamental research topic for many digital video applications. This paper addresses an efficient spatiotemporal segmentation scheme to extract moving objects from video sequences. The temporal segmentation yields a temporal mask that indicates moving regions and static regions for each frame. For localization of moving objects, a block-based motion detection method considering a novel feature measure is proposed to detect changed regions. These changed regions are coarse and need accurate spatial compensation. An edgebased morphological dilation method is presented to achieve the anisotropic expansion of the changed regions. Furthermore, to solve the temporarily stopping problem of moving objects, the inertia information of moving objects is considered in the temporal segmentation. The spatial segmentation based on the watershed algorithm is performed to provide homogeneous regions with closed and precise boundaries. It considers the global information to improve the accuracy of the boundaries. To reduce over-segmentation in the watershed segmentation, a novel mean filter is proposed to suppress some minima. A fusion of the spatial and temporal segmentation results produces complete moving objects faithfully. Compared with the reference algorithms, the fusion threshold in our scheme is fixed for different sequences. Experiments on typical sequences have successfully demonstrated the validity of the proposed scheme1. Index Terms — watershed, dilation, moving object, spatiotemporal segmentation.
I. INTRODUCTION Extraction of moving objects is an important and fundamental research topic for many areas of digital video applications such as object-based video coding, video surveillance, and video indexing. Object-based video coding is of particular concern in video storage and communication systems. Among object-based coding standards, MPEG-4 considers 2D Video Object Planes (VOPs) to improve the efficiency of coding and allow object-to-object interactions [1]. Shapes of VOPs may be arbitrary. However, the algorithms to extract VOPs are not specified in MPEG-4. The problem how to segment VOPs is open. Various algorithms have been proposed to solve this problem. Background subtraction uses the intensity difference of 1 This work was supported in part by the National High-Tech Development Project of P.R. China under Grant No. 2003AA103810 and National Natural Science Foundation of China under Grant No. 60332030. The authors are with the Institute of Image Communication and Information Processing of Shanghai Jiao Tong University, Shanghai, P.R. China (e-mail:
[email protected],
[email protected],
[email protected] ).
Contributed Paper Manuscript received July 14, 2007
each frame and the background image to detect moving objects for surveillance [2]. The algorithm [3] combines the watershed segmentation and the optical flow motion estimation to segment objects in video sequences. The algorithm [4] utilizes texture correlation and a motion model to track moving regions. In [5], the algorithm combining frame difference and active contour model is applied in real time tracking of multiple moving objects. In [6], the multiframe simultaneous segmentation is performed by minimizing an energy functional defined in the spatio-temporal domain. Among video segmentation algorithms, the spatio-temporal segmentation is an important method. It takes advantage of the spatio-temporal relationship and can obtain better segmentation results [7]. In general, temporal segmentation localizes moving regions by thresholding intensity difference of two consecutive frames pixel by pixel. It leads to a temporal mask which indicates coarse changed regions for moving objects and non-moving regions (background). However, this method is sensitive to noise. These coarse object regions may have discontinuous boundaries and some holes. On the other hand, spatial segmentation can provide accurate boundaries of regions, but it is difficult to detect moving objects. The well-known watershed approach is an important region-based tool [9, 10]. The conventional watershed algorithm has the problem of over-segmentation. In fusion of spatial and temporal segmentation results, the threshold which determines a region obtained in spatial segmentation as foreground or background usually varies for different sequences [7][8][11][12][16]. Furthermore, it is also difficult to choose a proper fusion threshold for one sequence. This paper proposes an effective spatio-temporal scheme to extract moving objects in video sequences. To localize moving objects, the difference of two consecutive block images in the temporal direction is examined. Here, each frame is decomposed into uniform and non-overlapped blocks. The decomposition method improves the robustness to noise while it reduces the sensitiveness to motion. To solve this problem, a novel feature-based measure for each block is proposed. A problem occurs when moving objects move very slowly or stop temporarily. We call this problem as the temporarily stopping problem. Under this situation, the above object detection method fails to localize moving objects since the difference is not significant. In this paper, we consider the inertial information of moving objects to overcome this problem. For ease of the fusion, these coarse changed regions need accurate expansion. Here, an edge-based morphological dilation algorithm is presented to achieve the anisotropic expansion. This is followed by filling holes. The temporal
0098 3063/07/$20.00 © 2007 IEEE
IEEE Transactions on Consumer Electronics, Vol. 53, No. 3, AUGUST 2007
1162
segmentation extracts the changed regions which contain complete moving objects. The spatial segmentation is performed by the watershed algorithm on weighted gradient images. Here, the global information is utilized to weight the strengths of the edge points of desired objects for further improving the accuracy of the spatial segmentation. To reduce over-segmentation, some undesired minima need to be suppressed. A novel mean filter is proposed to smooth weighted gradient images. A fusion module with the fixed threshold is used to extract desired moving objects. The remainder of this paper is organized as follows. Section II presents the proposed scheme. Subsection II-A discusses block-based motion detection. Subsection II-B introduces the edge-based morphological dilation algorithm. The spatial segmentation is described in Subsection II-C. Subsection II-D describes the fusion of spatial and temporal segmentation results and the post-processing. Section III gives the experimental results to evaluate the proposed scheme. Section IV gives the conclusion.
where the positive weighting parameter α is smaller than 1, mean(m, n, t ) is the mean gray level of all pixels within the block (m, n) at frame t, N1 (m, n, t ) is the number of pixels with gray levels greater than mean(m, n, t ) and N −1 (m, n, t ) the number of pixels with gray levels smaller than mean(m, n, t ) within the block (m, n) . The second term in (1) is regarded as the feature. The block method improves the robustness to noise while the feature makes each block sensitive to motion.
II. THE PROPOSED SCHEME
Then, the difference image FD(m, n, t ) of two consecutive block images is obtained by FD(m, n, t ) = I B (m, n, t ) − I B (m, n, t − 1) . (2) To reduce the computational complexity, FD(m, n, t ) with real values is quantized. The number of the quantization levels is 256. The quantized image FDq (m, n, t ) is obtained by
The proposed scheme is depicted in Fig. 1. The input is a video sequence and the output is the extracted moving objects. A temporal mask is a binary image, which indicates changed regions and static regions. A spatial mask is a label image, which indicates all spatial regions with closed and precise boundaries. Frame (t-1)
Frame t
Mask of moving objects at frame (t-1)
Block-based motion detection
Generation of gradient image
Coarse mask
Gradient image Watershed segmentation on weighted gradient image
Edge-based dilation Temporal mask
Spatial mask
Fusion of temporal and spatial segmentation
Post-processing
Extracted moving objects
Fig. 1. Overview of the proposed spatio-temporal segmentation scheme.
(a) (b) (c) Fig. 2. (a) and (b) show the original frames at frames 3 and 4 in the mother-daughter sequence, respectively. (c) illustrates decomposition of an image.
FDq (m, n, t ) = round ( FD(m, n, t ) / Lq (t ) )
(3)
where round(x) is the operation that makes any real number x rounded off to the nearest integer and Lq (t ) is the quantization step at frame t, which is obtained by (4)
Lq (t ) = FDmax (t ) / 256
where FDmax (t ) is the maximum value in the image FD(m, n, t ) . To reduce noise, the image FDq (m, n, t ) is filtered by a 3 × 3 median filter [13]. Let FDq' (m, n, t ) be the filtered image. Then the image
FDq' (m, n, t )
is thresholded by
⎪⎧1 if FD (m, n, t ) ≥ T (t ) FDb (m, n, t ) = ⎨ otherwise ⎪⎩0 ' q
(5)
where the threshold T (t ) at time t is determined by A. Block-based Motion Detection First, each frame is decomposed into M × N uniform and non-overlapped blocks with the size of B × B , as shown in Fig.2 (c). Fig.2 (a) and (b) show the original images at frames 3 and 4 in the mother-daughter sequence, respectively. Let I ( x, y, t ) be the original frame at frame t in a video sequence, where ( x, y ) denotes a pixel position in the original frame, and I B (m, n, t ) the corresponding decomposed image, where (m, n) denotes a block position in the decomposed image. A novel feature-based measure for the block (m, n) is defined as follows: α (1) I (m, n, t ) = mean(m, n, t ) + ( N (m, n, t ) − N (m, n, t ) ) B
B2
1
−1
T (t ) = mean(t ) + β g1 (t ) − g 0 (t ) (6) where β is the positive weighting parameter, g 0 (t ) and g1 (t ) are the positions with the largest peaks in the histogram H 0 (t ) of the image FDq (m, n, t ) and the histogram H1 (t ) of the filtered
image FDq' (m, n, t ) , respectively, and mean(t ) is the mean of the measures of all blocks in the image FDq' (m, n, t ) . The second term in (6) indicates the distance between g 0 (t ) and g1 (t ) , as shown in Fig.3. The distance is mainly caused by noise. Here, the parameter β is usually set to 1. If FDb (m, n, t ) = 1 , each pixel within this block is set to 1; otherwise 0. The binary mask FDp ( x, y, t ) is obtained in pixels.
R. Li et al.: Efficient Spatio-temporal Segmentation for Extracting Moving Objects in Video Sequences
1163
H 0 (4)
(a) (b) (c) Fig. 4. Motion detection. (a) Extracted moving objects at t=3. (d) Changed regions without compensation at t=4. (e) Compensated temporal mask ( TM = 0.6 ).
g 0 (4)
H 1 (4)
g 1 (4)
Fig. 3. Illustration of the effect of filtering at t=4 in the mother-daughter sequence.
However, when a desired object moves very slowly or stops temporarily, some parts of this object can not be detected, as shown in Fig.4 (b). Even most of this object may be cut off. Under this situation, it is difficult to detect desired objects by using only two consecutive frames. To solve this problem, the inertia information of moving objects is used. Let M ( x, y, t − 1) be the mask of the extracted moving objects at frame (t-1). Let AM (t − 1) be the sum of the areas of all object regions in the mask M ( x, y, t − 1) and Ap (t ) the area of the largest changed region in the mask FDp ( x, y, t ) . The inertia variable PM (t ) is defined as PM (t ) = Ap (t ) / ( AM (t − 1) + 1)
.
(7)
Given the threshold TM , if PM (t ) is smaller than TM , it indicates the moving objects are almost stationary between the two consecutive frames. To localize object regions, the mask FD p ( x, y, t ) needs to be compensated. Here, an inertia-based compensation method is proposed. The compensated mask FDc ( x, y, t ) can be obtained by ⎧⎪( FD p ( x, y, t ) or M ( x, y, t − 1) ) if PM (t ) < TM FDc ( x, y, t ) = ⎨ FD p ( x, y, t ) otherwise ⎪⎩
.
(8)
When the difference of two successive frames is too small to detect motion, the compensation approach makes the temporal mask at frame t similar to the mask of the extracted moving objects at frame t − 1 . Note that, when the second frame is processed, i.e., t = 2 , the area AM (2 − 1) is set to 0. The threshold TM is usually smaller than 0.8. To avoid infinite expansion, TM should be small enough. Fig.4 illustrates the proposed motion detection approach. The extracted moving objects at frame 3 in the motherdaughter sequence are shown in Fig.4 (a). The image FD p ( x, y, 4) is given in Fig.4 (b). Since the movements of some parts of the desired objects are very slight, these parts are cut off. The variable PM (4) is smaller than the threshold TM . The mask FDp ( x, y, 4) needs to be compensated. The compensated mask
FDc ( x, y, 4)
is given in Fig.4 (c).
B. Edge-based Morphological Dilation The binary mask FDc ( x, y, t ) of moving objects may have discontinuous boundaries and some holes. To make the changed regions contain the desired moving objects completely, the morphological dilation operation can be used to expand the changed regions [14]. Given a binary image I and a structuring element K containing elements a and b in the 2-space ( E 2 ) respectively, the conventional dilation of I by K is defined as I ⊕ K = {c | c = a + b for a ∈ I and b ∈ K } . (9) Fig.5 gives an example of the conventional dilation.
(a) (b) (c) Fig. 5. Illustration of the conventional dilation. (a) The image I. (b) The structuring element K centered at the position with 1. (c) Result of dilation.
As seen from Fig.5, the image I is expanded uniformly. If the dilation operation is directly used on the mask FDc ( x, y, t ) , some undesired regions may be integrated into the result. It may cause errors in the subsequent steps. To avoid the abovementioned problem, an edge-based dilation approach is proposed to achieve the anisotropic expansion. The edge-based dilation of the image I by the structuring element K is defined as DE = I U D (10) where D is determined by D = {c | c = a + b for some a ∈ I and b ∈ K and Edge(c ) ≠ 0} (11) where Edge( x ) represents a corresponding edge image. Fig.6 illustrates the edge-based dilation. Fig.6 (a) gives the edge image. The result of edge-based dilation of the image I given in Fig.5 (a) by K given in Fig.5 (b) is shown in Fig.6 (b).
(a) (b) Fig. 6. Illustration of edge-based dilation. (a) The edge image. (b) Result of edge-based dilation.
IEEE Transactions on Consumer Electronics, Vol. 53, No. 3, AUGUST 2007
1164
The initial edge image G ( x, y, t ) can be obtained by applying any gradient operator, e.g., Robert operator or Sobel operator, to the current frame I ( x, y, t ) . Before applying the gradient operator, the original image I ( x, y, t ) is filtered by the 3 × 3 median filter to eliminate noise [13]. In general, the gradient magnitudes at the boundaries of objects are greater than those at the other regions, e.g., the interior of an object. There are a large number of points with smaller magnitudes in the image G ( x, y, t ) . These points with much smaller values need to be eliminated. Here, the threshold is determined by the histogram of the image G ( x, y, t ) . The largest peak in this histogram is constructed by the points with the values equal to k. The value k corresponds to the position with the largest peak in the histogram. The value k is selected as the threshold. The pixels with the values smaller than or equal to k are set to 0 in the image G ( x, y, t ) . Let GT ( x, y, t ) be the thresholded image. Fig.7 (a) shows the image GT ( x, y, 4) in the mother-daughter sequence. Here, the Sobel operator is used [15]. Let FDc (t ) be the set of all points with the values equal to 1 in the binary image FDc ( x, y, t ) . I is replaced by FDc (t ) in (10). The image GT ( x, y, t ) represents the edge image Edge( x ) . Thus, the result FDd ( x, y, t ) of edge-based dilation of the image FDc ( x, y, t ) is obtained according to (10). Fig.7 (b) gives the dilation result of the image FDc ( x, y, 4) in the mother-daughter sequence. Here, the structuring element K is given in Fig.7 (d).
(a)
(b)
(c)
(d)
Fig. 7. (a) The thresholded gradient image at t=4 in the mother-daughter sequence. (b) Result of edge-based dilation. (c) The temporal mask after filling holes. (d) The structuring element.
There are some holes in the resulting object regions after edge-based dilation, as shown in Fig.6 (b) and Fig.7 (b). To improve the effectiveness of the fusion, these holes need to be filled. Here, a simple and effective method is presented. The binary image FD b ( x, y, t ) is obtained by ⎧1 if FDd ( x, y, t ) = 0 . FD b ( x, y, t ) = ⎨ otherwise ⎩0
(12)
The area (number of pixels) of each 4-connected region with pixel values equal to 1 is computed in the image FD b ( x, y, t ) . The region with the maximum area where FDb ( x, y, t ) = 1 in the image FDb ( x, y, t ) corresponds to the stationary region where FDd ( x, y, t ) = 1 in the image FDd ( x, y, t ) . Then, the regions with the areas smaller than the maximum area in the image FD b ( x, y , t ) are eliminated, i.e., the values are set to 0. Let FD bf ( x, y, t ) be the resulting image. The final binary temporal
mask
MASK ( x, y, t )
is obtained by
b ⎪⎧1 if FD f ( x, y, t ) = 0 MASK ( x, y, t ) = ⎨ otherwise ⎪⎩0
.
(13)
Fig.7 (c) shows the final temporal mask. C. Watershed Segmentation on Weighted Gradient Image The well-known watershed approach based on immersion simulation is an important region-based tool for spatial segmentation and has been widely used. Here, watershed segmentation [9] on weighted gradient images provides spatial regions with precise and closed boundaries. The algorithm consists of two passes: generation of the weighted gradient image and watershed segmentation. The task of weighting the gradient image is to reinforce the edge strengths of the desired moving objects by using the global information. An edge point on the boundaries of the desired moving objects is usually connected to more edge points than a false edge point. The strength of each candidate edge point in the image GT ( x, y, t ) is weighted by
(
)
GT' ( x, y, t ) = Ag (( x, y ), t ) /( Ag (t ) + 1) GT ( x, y, t )
where
Ag (( x, y ), t )
and
Ag (t )
(14)
are the area of the connected edge
region to which the point ( x, y ) belongs and the mean area of all connected edge regions in the image GT ( x, y, t ) , respectively. The resulting image GT' ( x, y, t ) with real values is quantized by using the method mentioned in Subsection II-A. The number of the quantization levels is 256. Let Gq ( x, y, t ) be the quantized image. A novel mean filter is proposed to suppress some minima. Assume that there is a 3 × 3 block centered at the point ( x, y ) . Let g ( x, y, t ) be the mean of the values of all pixels within the block. Let g 0 ( x, y, t ) be the mean value of the pixels with the values smaller than or equal to g ( x, y, t ) and g1 ( x, y, t ) the mean value of the pixels with the values greater than or equal to g ( x, y, t ) within the block. If the value of the pixel ( x, y ) is greater than or equal to g ( x, y, t ) , the value of the pixel ( x, y ) is set to g1 ( x, y, t ) ; otherwise g 0 ( x, y, t ) . After filtering the image, the watershed algorithm [9] is used to segment the obtained image. The immersion approach is divided into two steps: an initial sorting step and a flooding step. In the topographic representation of a given image, the numerical value of each pixel stands for the elevation at this point. In the sorting step, the pixels are sorted in the increasing order of their elevations. Once the pixels have been sorted, the progressive flooding step starts with the sorted minimal elevation hmin . Each catchment basin is labeled with a unique symbol. The labeling process continues recursively from hmin to the maximum elevation hmax . Suppose that the flooding has been done up to the elevation h . Every catchment basin whose corresponding minimum value is smaller than or equal to h has been labeled. Then, the pixels of the altitude h + 1 are labeled as follows. For each pixel p of the altitude h + 1 , if no neighbors (8-neighborhood) of it have been
R. Li et al.: Efficient Spatio-temporal Segmentation for Extracting Moving Objects in Video Sequences
labeled, a new catchment basin starts from this pixel p and is labeled with a new symbol. If there is only one labeled neighbor, this pixel p is labeled with the symbol of its labeled neighbor. If there is more than one label in the labeled neighbors, a watershed is built at the pixel p labeled with the watershed symbol. After all elevations are processed, the regions with the closed and accurate boundaries are obtained. Fig.8 (a) gives the spatial segmentation result of the fourth frame given in Fig.2 (b).
(a) (b) Fig. 8. (a) Result of the spatial segmentation of the fourth frame in the mother-daughter sequence. (b) Final extracted object.
D. Fusion of Spatial and Temporal Segmentation Let F (t ) be the labeled result of the spatial segmentation. Each region in this image indicates coherence in intensity and motion. Moving objects are discriminated from backgrounds by the fusion module that combines the spatial segmentation result and the temporal mask [16]. Let Rs (i, t ) be a spatial region in F (t ) and As (i, t ) the corresponding area. The spatial segmentation result F (t ) is superposed on top of the temporal mask MASK ( x, y, t ) . Let Ac (i, t ) be the area of the union of the spatial region Rs (i, t ) and the temporal mask MASK ( x, y, t ) . The coverage ratio P(i, t ) is defined as P (i, t ) = Ac (i, t ) / As (i, t ) . (15) If the value of P(i, t ) is greater than a given fusion threshold Ts , the region Rs (i, t ) is considered as foreground; otherwise background. In general, the threshold Ts varies for different sequences. Furthermore, it is also difficult to choose a proper threshold for one sequence. However, the threshold Ts in our scheme may be fixed for different sequences, i.e., Ts is always set to 0.99. It is very close to 1 and does not need to be adjusted. The reason is mainly that the obtained temporal mask can contain the complete desired objects. There may be some smaller blobs in the resulting mask. Those regions with the areas smaller than the threshold TA are considered as noisy regions and can be eliminated by the size filter. If there is only one moving object in a video clip, the desired moving object corresponds to the largest motion region in the mask. Thus, the final mask M ( x, y, t ) of moving objects is obtained. Fig.8 (b) shows the extracted moving object for frame 4 in the mother-daughter sequence. III. EXPERIMENTAL RESULTS In order to verify the effectiveness of the proposed scheme, we compare it with the reference algorithms [7, 8]. In [7] (called Kim algorithm), the temporal masks are obtained by using the pairs of two frames with a one-frame distance
1165
instead of a three-frame distance. We provide initial object masks for the algorithm [8] (called Xia algorithm). Experiments were carried out on the typical test sequences like mother-daughter, gardener, and akiyo. All sequences were tested in QCIF format. For our scheme, the thresholds TM and TA are empirically set to 0.6 and 50, respectively. The parameters α and β are set to 0.4 and 1, respectively. The size B is set to 2. The Sobel gradient operator is used. Fig.9 shows the segmentation results selected in the motherdaughter sequence. In this case, the fusion thresholds in Kim and Xia algorithms are set to 0.6 and 0.5, respectively. The extracted objects with the proposed scheme become complete over time. From frame 23, the desired objects (the mother and the daughter) are extracted completely. The proposed scheme deals with the temporarily stopping problem of some parts of the moving objects successfully. Furthermore, the parts of the temporal masks exceeding the real boundaries of the desired moving objects are cut off by the fusion. The objects with Kim algorithm have discontinuous boundaries and some larger holes. We provide the accurate initial mask of moving objects for Xia algorithm.
Frame 47
Frame 51
Frame 55
Frame 59
Frame 63 Fig. 9. Video segmentation results of the mother-daughter sequence. Leftmost (first) column gives the original video frames. Second, third, and rightmost columns show the results with the proposed, Kim, and Xia algorithms, respectively.
To have the objective comparison, a spatial accuracy measure [17] in each frame is defined as follows: ⎛ ⎞ ⎛ ⎞ Serror = ⎜ ∑ M ext ( x, y ) ⊕ M ref ( x, y ) ⎟ / ⎜ ∑ M ref ( x, y ) ⎟ ⎝ ( x, y ) ⎠ ⎝ ( x, y) ⎠
(16)
where M ext and M ref are binary masks for extracted objects and real objects obtained by manual segmentation,
1166
respectively, and ⊕ denotes the binary XOR operation. Fig.11 (a) gives the comparison of the spatial accuracy measures of the three algorithms for the mother-daughter sequence. As can be seen, the spatial accuracy measure of the proposed scheme is smaller. The gardener sequence is a typical sequence with the complicated background. Furthermore, the desired truck moves very slowly or stops temporarily sometime. Fig.10 gives the corresponding segmentation results. In this case, the fusion threshold in [7] is set to 0.7 in order to better extract the objects. The extracted truck with Kim algorithm has some undesired regions and some parts of the truck are cut off. As can be seen in the results with Xia algorithm, some parts of the desired truck are cut off. However, the proposed scheme extracts the moving object more completely. The experiment on the gardener sequence shows that the proposed scheme can solve the problem of temporarily stopping successfully. The results with the proposed scheme look subjectively better than those with the reference algorithms. As can be seen, some parts of the boundaries of the extracted objects deviate from the corresponding parts of the real boundaries. The misdetection is mainly caused by the camouflage problem, i.e., the intensities of some parts of the moving object are much similar to those of the background.
IEEE Transactions on Consumer Electronics, Vol. 53, No. 3, AUGUST 2007
measure of the proposed scheme is greater. It is because, on these frames, some false regions are extracted as parts of the truck by the proposed scheme. It makes the spatial accuracy measure somewhat greater on these frames. To objectively compare the completeness of the extracted objects, a completeness measure Scompl is defined as ⎛ ⎞ ⎛ ⎞ Scompl = ⎜ ∑ M seg ( x, y ) ⋅ M ref ( x, y ) ⎟ / ⎜ ∑ M ref ( x, y ) ⎟ ( , ) ( , ) x y x y ⎝ ⎠ ⎝ ⎠
(17)
where ⋅ denotes the AND operation. Fig.12 gives the comparison of the completeness measures of the three algorithms for the gardener sequence. From Fig.12, though the spatial accuracy measure of the proposed scheme is somewhat greater than that of Xia algorithm on some frames, the extracted objects with the proposed scheme are more complete than those with Xia algorithm.
(a) Frame 39
Frame 42
Frame 45 (b) Fig.11. Comparison of the spatial accuracy measures of three algorithms. (a) Mother-daughter sequence. (b) Gardener sequence. Frame 48
Frame 51 Fig. 10. Video segmentation results of the gardener sequence. Leftmost (first) column gives the original video frames. Second, third, and rightmost columns show the results with the proposed, Kim, and Xia algorithms, respectively.
Fig.11 (b) shows the comparison of the corresponding spatial accuracy measures of the three algorithms for the gardener sequence. On some frames, the spatial accuracy
Fig.12. Comparison of the completeness measures of three algorithms for gardener sequence.
R. Li et al.: Efficient Spatio-temporal Segmentation for Extracting Moving Objects in Video Sequences
The results of another experiment are shown in Fig.13. From these experiments, the proposed scheme works well for extracting the moving objects from the long video sequences.
[8] [9] [10] [11]
Frame 49
Frame 51
Frame 53
[12]
[13] Frame 55 Frame 57 Frame 59 Fig.13. Video segmentation results for the akiyo sequence.
[14] [15]
IV. CONCLUSION This paper addresses an efficient spatio-temporal scheme to extract moving objects in video sequences. The proposed scheme deals with the temporarily stopping problem successfully. It improves the accuracy and completeness of segmentation. Furthermore, its fusion threshold is fixed for different sequences. The issue how to utilize global motion estimation to compute frame differences for localization of moving objects remains. Future work is mainly focused on extraction of objects from video sequences with moving backgrounds by applying global motion estimation.
[16] [17]
1167
J. Xia and Y. Wang, “A spatio-temporal video analysis system for object segmentation,” ISPA, vol.2, pp.18-20, Sept. 2003. L. Vincent and P. Soille, “Watersheds in digital spaces: an efficient algorithm based on immersion simulations,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 13, no. 6, pp. 583-598, Jun. 1991. Y. Tsaig and A. Averbuch, “Automatic segmentation of moving objects in video sequences: a region labeling,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 7, pp. 597-612, Jul. 2002. H. Gao, W. C. Siu, and C. H. Hou, “Improved techniques for automatic image segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 12, pp. 1273-1280, Dec. 2001. C. Pan, C.X. Zheng, and H. J. Wang, “Robust color image segmentation based on mean shift and marker-controlled watershed algorithm,” International Conference on Machine Learning and Cybernetics, vol.5, pp. 2752-2756, Nov. 2003. I. Pitas and A. Venetsanopou, Nonlinear Digital Filters: Principles and Application, Norwell: Kluwer, 1990, pp. 154-156. R. M. Haralick, S. R. Sternberg, and X. Zhuang, “Image analysis using mathematical morphology,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 9, pp. 532-550, Jul. 1987. L.S. Davis, “A Survey of Edge Detection Techniques,” CGIP, vol.4, pp.248-270, 1975. J. H. Mao and K. K. Ma, “Semantic spatio-temporal segmentation for extracting video objects,” IEEE International Conference on Multimedia Computing and Systems, vol. 1, pp. 738-743, Jun. 1999. M. Wollborn and R. Mech, “Refined procedure for objective evaluation of video object generation algorithm,” Doc. ISO/IEC JTC1/SC29/ WG11 M3448, Mar. 1998.
Renjie Li received the B.S. and M.S. degrees in electronics engineering from Hebei University of Technology, Tianjin, P.R. China, in 2000 and 2003, respectively. He is currently working toward the Ph.D. degree in communication and electronics system at Institute of Image Communication and Information Processing of Shanghai Jiao Tong University, Shanghai, P. R. China. His research interests include DTV, video communication, image and video processing, and pattern recognition.
REFERENCES [1] [2]
[3] [4] [5] [6] [7]
ISO/IEC JTC1/SC29/WG11, “Overview of the MPEG-4 standard,” MPEG98/N2323, Dublin, Ireland, Jul. 1998. A. Elgammal, R. Duraiswami, D. Harwood, and L. S. Davis, “Background and foreground modeling using nonparametric kernel density estimation for visual surveillance,” Proc. IEEE, vol. 90, pp. 1151-1163, 2002. Hsu, W. L. Shih, and C. W. Jeff, “Video segmentation based on watershed algorithm and optical flow motion estimation,” Proc. SPIE Int. Soc. Opt. Eng., USA, pp. 56-66, 1999. B. Bascle and R. Deriche, “Region tracking through image sequences,” Proceedings of the IEEE International Conference on Computer Vision, pp. 302–307, 1995. S. Araki, T. Matsuoaka, and N. Yokoya, “Real time tracking of multiple moving object contours in a moving camera image sequence,” IEICE Trans. Inf. & Syst., E832D(7) , 2000. R. Feghali, “Multi-frame simultaneous motion estimation and segmentation,” Consumer Electronics, IEEE Transactions on, vol. 51, pp. 245 -248, Feb. 2005. M. Kim, J.G. Choi, D. Kim, and H. Lee, “A VOP generation tool: Automatic segmentation of moving objects in image sequences based on spatial-temporal information,” IEEE Trans. Circuits Syst. Video Technol., vol. 9, no. 8, pp. 1216-1226, 1999.
Songyu Yu graduated from Shanghai Jiao Tong University in 1963. AS professor and chairman, he is now with the Institute of Image Communication and Information Processing of Shanghai Jiao Tong University. His general interests include image processing, image communication and DTV.
Xiaokang Yang (M’00) received the B.Sci. degree from Xiamen University, Xiamen, China, in 1994, the M. Eng. degree from Chinese Academy of Sciences, Beijing, China, in 1997, and the Ph.D. degree from Shanghai Jiaotong University, Shanghai, P.R. China, in 2000. He is currently a professor in the Institute of Image Communication and Information Processing, Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, China. From April 2002 to October 2004, he was a research scientist in the Institute for Infocomm Research, Singapore. His current research interests include scalable video coding, video transmission over networks, video quality assessment, digital television, and pattern recognition.