Unsupervised Segmentation of Stereoscopic Video Objects

0 downloads 0 Views 370KB Size Report
Two unsupervised video object segmentation techniques are proposed in this ... depth segment, according to a fitness function that considers dif- ferent color ...
Unsupervised Segmentation of Stereoscopic Video Objects: Investigation of Two Depth-Based Approaches Klimis S. Ntalianis, Anastasios D. Doulamis, Nikolaos D. Doulamis and Stefanos D. Kollias National Technical University of Athens, Electrical and Computer Engineering Department, 15773, Athens, Greece E-mail: [email protected]

Abstract Two unsupervised video object segmentation techniques are proposed in this paper and are compared in terms of computational cost and segmentation quality. Both methods are based on the exploitation of depth information. In particular a depth segments map is initially estimated by analyzing a stereoscopic pair of frames and applying a segmentation algorithm. Next, considering the first "Constrained Fusion of Color Segments" (CFCS) approach, color segmentation is performed to one of the stereo pairs and video objects are extracted by fusing color segments according to depth similarity. In the second method an active contour is automatically initialized onto the boundary of each depth segment, according to a fitness function that considers different color areas and preserves the shapes of depth segments' boundaries. Then the active contour moves onto a grid to extract the video object. Experiments on real stereoscopic sequences exhibit the speed and accuracy of the proposed schemes.

1.

Introduction

Bursting increase in popularity of multimedia applications, has led to the emergence of new standards as MPEG-4 and MPEG-7. The MPEG-4 standard introduced the concept of video objects (VOs), where a VO corresponds to a semantically meaningful object such as a human, a car, a house etc. This idea would greatly benefit the implementation of systems for content-based indexing, directive video browsing and summarization, object-based coding and transmission and efficient rate control. On the other hand the visual part of MPEG-4 concerning the development of automatic content segmentation algorithms did not result into a globally applicable scheme. This is due to the fact that semantic objects usually consist of multiple regions with different color, texture and motion and although humans can fuse the suitable regions almost effortlessly, the same task is difficult to be performed by a machine and still remains a challenging problem. Currently, in 2D-video, low-level features, such as color, texture and motion information are used as content descriptors [1]. In stereoscopic video however, where depth information can be reliably estimated, the problem of content description can be addressed more effectively since video objects usually consist of regions located at the same depth plane [2]. Several, supervised techniques have been proposed in literature for VO segmentation. Characteristic examples include: (A) the approach in [3], where multidimensional analysis of several image features is performed by a spatially constrained fuzzy C-means algorithm. (B) The work in [4], where the VOP is detected using a combination of human assistance and the morphological watershed segmentation algorithm. (C) The work in [5], where an active contour is manually initialized around the VO and moves towards the direction of energy minimization. However it is difficult to include supervised segmentation techniques in real time applications, due to time constraints. On the other hand some unsupervised schemes have also been presented. In [6] a multiscale gradient algorithm followed by a

watershed transformation is used for color segmentation, while motion parameters of each region are estimated and regions with coherent motion are merged to form the object. In [7] a binary model of the VOP is derived using the Canny operator. The model is initialized and updated according to motion information, derived by incorporation of Change Detection Masks (CDM's) or morphological filtering. Another scheme is presented in [8] where temporal and spatial segmentation is performed. Temporal segmentation is based on intensity differences between successive frames including a statistical hypothesis test, while spatial segmentation is performed by the watershed algorithm. The aforementioned techniques produce satisfactory results in cases where background and foreground VOs have different motion and color segments of the foreground have coherent motion. However in real sequences there are several cases where motion information is not adequate for semantic segmentation. These cases include: (A) Shots with very little motion. (B) Scenes consisting of multiple objects, having different motion parameters. (C) Cases where color segments of the VOP do not have coherent motion (D) Scenes where the background and the VOP have similar motion. In this paper two efficient unsupervised video object segmentation approaches are proposed and their performances are extensively compared. The two methods are based on the exploitation of depth information. In particular initially a depth segments map for each stereoscopic pair of frames is produced, using the technique described in [2]. In this technique a multiresolution version of the RSST segmentation algorithm is implemented, called MRSST. Then, considering the first "Constrained Fusion of Color Segments" (CFCS) approach, a color segments map is created, by applying the M-RSST to one of the stereo pairs. In this case video object extraction is accomplished by fusing color segments according to their degrees of overlap with the depth segments. The second method is based on active contours. In particular firstly the points of the active contour are unsupervisedly selected on the depth boundary of each depth segment, by means of a polygonal shape approximation technique, which also takes into consideration color variations. Secondly each point of the active contour is associated to an attractive edge point, according to a nearest neighbor-based method as described in [9]. Finally a greedy algorithm is incorporated and the active contour evolves from its initial state to its final minimal energy position, to extract video objects. Experimental results on real life stereoscopic sequences indicate the prominent performance of the proposed schemes as video object segmentation tools. Furthermore comparison tables exhibit the advantages and disadvantages of each scheme regarding computational cost and segmentation accuracy.

2.

The CFCS Approach

Let us assume that after using the technique described in [2] a depth segments map with Kd depth segments, denoted as S id , i=1,2,…,Kd is produced. Let us also assume that the M-RSST segmentation algorithm of [2] is applied to one of the two channels of a stereoscopic pair of frames, producing Kc color segments,

which can be denoted as S ic , i=1,2,…,Kc. Segments S ic and S id satisfy the conditions: S ic ∩ S kc = ∅ for any i, k = 1,2, K , K c , i ≠ k (1)

S id ∩ S kd = ∅ for any i, k = 1,2, K , K d , i ≠ k (2) Then color/depth segmentation masks, which are given as the union of all color/depth segments, can be denoted by Gc and Gd where: Kc

Kd

i =1

i =1

G c = U S ic , G d = U S id

(3)

Now assuming two different segmentation masks over the same lattice, say Gc and Gd, an operator p(.) can be defined, which projects a segment of the first segmentation mask, i.e., S ic , on the second mask, i.e., Gd, p( S ic , G d ) = S kd such that S kd = arg max a( S dj ∩ S ic ), (4) j =1,..., K d

.

where α( ) returns the number of pixels of a segment. Equation (4) indicates that the projection of a color segment, say the ith ( S ic ), onto the depth segmentation mask, Gd, returns that depth segment, S kd , among all Kd available, with which the respective color seg-

ment S ic has the maximum area of intersection. Using the projection operator, each color segment of Gc is associated to one depth segment and all segments belonging to the same depth segment compose a group. More specifically, Kd classes are created, say Ci, i=1,2,…,Kd, each of which corresponds to a depth segment S id and contains all color segments that returned S id during projection, i.e., C i = {S dj : p( S dj , G d ) = S id }, i = 1,2, K , K d

(5)

Using the classes Ci , the final video object segmentation mask is obtained by fusing all color segments, which belong to the same class Ci. Let us denote as G, this final segmentation mask consisting of K=Kd segments, say Si, i=1,2, …,K. These segments are generated as: Si = U Scg , i = 1,2,K, K such that Scg∈Ci (6) while the final video object mask G, is given as the union over all segments Si, i=1,2,…,K, that is K

G = U Si

(7)

i =1

Equation (7) denotes that color segments are fused together into K=Kd new segments according to depth constraints.

3.

The Active Contour Scheme

An active contour (known also as “snake”) is a parametric curve represented by a vector v(s,t) =(x(s,t), y(s,t)) that moves onto an image plane to minimize the energy functional: 1

E * snake = ∫0 E snake (v( s )) ⇔

(8) 1 E * snake = ∫0 [ Eint (v( s)) + Eimage (v( s )) + Econ (v( s ))]ds where Eint represents the internal deformation energy (stretching and bending), Eimage expresses the image forces and Econ corresponds to the external forces. In the proposed scheme, the energy being minimized is formulated as: (9) E*snake = -(w1Eint + w2Eedge_map)

where Eedge_map is an external energy originating from a simplified edge map and Eint is the internal energy. More details for calculating these energies can be found in [9].

3.1

Unsupervised Active Contour Initialization and Convergence

After formulating the energy to be minimized, unsupervised initialization of an active contour onto each of the depth segments is accomplished. Towards this direction the following two steps are performed: a) Determination of a specific order for the boundary points of the depth segment under consideration. b) Selection of the initial active contour points from the set of ordered boundary points. The boundary points of a depth segment are those points located at the border of the depth segment. However, these points do not satisfy any ordering and thus they cannot be straightforwardly used for active contour initialization. For this reason an order is established by finding the neighboring points of each boundary point. Towards this direction let us denote by Pi =(xi,yi) an initially selected random boundary point named "current point". Then starting from this point and moving in a clockwise manner an order of the boundary points is created. In particular the following steps are performed iteratively: i) Mask adaptation of size 3x3, centered at "current point". ii) Detection of the depth boundary points inside the mask. iii) Detection of the nearest neighbor using the city block distance. iv) The nearest neighbor becomes "current point". The process is terminated when starting from point (xi,yi) it results to this point again. Finally the set of ordered boundary depth segment points is denoted as SP= {P1, P2,…, PN}, where P1 is the initial point and PN the final point. Then the points of set SP are linked to each other with a line according to the order resulting in the P1P2…PNP1 polygon. However, this polygon cannot be used as initial active contour since it would induce great computational complexity during convergence. For this reason only a small part of these points is selected to form the active contour. In our scheme a merging technique is incorporated, where we start from the polygon P1P2…PNP1 and by discarding points according to color homogeneity and shape distance criteria we result to the initial active contour. More specifically let SP= {P1, P2,…, PN} denote the ordered set of points comprising a constructed polygon like the one depicted in Figure 1. In order to discard polygon points it is essential to define error criteria to measure the quality of fitness of the polygon. In this direction and without loss of generality, if we discard point P2 then the distance d2=|P2-K2| can be used as the approximation error of the curve. Expression |P2-K2| denotes a vector starting from point P2 and being perpendicular to the vector P3-P1: (P2-K2)T(P3- P1) = 0 (10) Then in the case under study the adopted fitness criterion is the maximal error dmax: (11) dmax = max|Pi-Ki|, 2 ≤ i ≤ N-1 More specifically the following steps are performed iteratively towards selection of the initial points: Step 1. For each point Pi of set SP, find the mean color difference MCi and calculate the mean city block distance Di from its two adjacent points Pi-1, Pi+1. Step 2. Select the point of set SP with the smaller MCi value. Step 3. For the selected point, say Pi, calculate di. Step 4. If di ≤ dmax and Di dmax, select the point of set SP that possesses the next smaller MC value and repeat process from Step 3. If no point satisfies the “di ≤ dmax” condition then terminate process. After termination of the process the remaining points constitute the initial points of the active contour. In our experiments a small number of points is kept (about 2% of the whole points) to reduce the computational cost while providing a well-positioned initial active contour for video object segmentation. Finally the active contour converges to the boundary of the video object using the greedy algorithm proposed in [10].

4.

Experimental Results

In this section the performance of the proposed unsupervised video object segmentation schemes is investigated, while comparisons in terms of accuracy and execution time are provided. In particular the left and right channels of a stereoscopic pair are depicted in Figures 2(a) and 2(b) respectively, where three video objects can be observed (2 foreground objects and the background) and the content is very complex. Next as the proposed schemes are based on depth information, a depth segments map is produced for the examined stereo pair using a multiresolution implementation of the RSST algorithm as described in [2]. The depth segments map is depicted in Figure 3(a). Obviously each depth segment roughly approximates a video object. Afterwards and according to the CFCS approach, a color segments mask is produced by applying the M-RSST to the frame of Figure 2(a). The color segmentation result is depicted in Figure 3(b) where each color segment is an area surrounded by a white contour. Next color segments are fused according to their depth similarity. For this reason color segments are projected onto the depth segments map. The projection process is presented in Figure 4(a), where the white lines correspond to the contours of the color segments, while each region with different color represents a depth segment. In Figure 4(b) fusion of color segments belonging to the same depth segment is performed and the extracted three video objects are depicted in Figures 4(c) and 4(d). According to the second approach an active contour is initialized on the boundary of each depth segment to extract the contained VO. Initialization was accomplished using the technique described in subsection 3.1 where dmax was set equal to 3. In particular in case of the left video object (man) 918 boundary points where detected 15 of which were kept to comprise the initial active contour, while in case of the right video object (woman) 867 points were found 21 of which were kept. The initial active contours for the two foreground video objects can be seen in Figure 5(a). Finally the active contours converge after 77 iterations for the left object and 64 iterations for the right object. Convergence results are presented in Figure 5(b) while video objects’ extraction can be seen in Figures 5(c) and 5(d) for the foreground and background video objects. Now in order to evaluate the segmentation accuracy of the proposed methods three measures are used: (a) The Spatial Quality Measure (SQM) proposed in [11], where the quality of segmentation masks is evaluated by means of algorithmically computed figures of merit, weighted to take into account the visual relevance of segmentation errors, (b) Measure CP, which expresses the percentage of the perfect video object mask that is covered by an estimated mask and is calculated by: Number of estimated pixels belonging to video object (12) CP = Number of pixels of the video object

and (c) Measure AP, which expresses the percentage of the estimated mask that belongs to the video object and is calculated by: Number of estimated mask pixels belonging to video object (13) AP = Number of pixels of the estimated mask

All measures assume the existence of a perfect mask. In this paper perfect masks of the foreground video objects have been produced manually and they are depicted in Figures 6(a) and 6(b). Furthermore execution times are also estimated for the proposed methods. Experiments were performed using a Pentium 1 GHz processor, 256 MB RAM and ANSI C implementation of the various schemes. For the CFCS technique execution time includes color segmentation (M-RSST) and segmentation fusion, while execution time for the active contour scheme includes active contour initialization, convergence and VO extraction. Tables I and II contain the results for the two foreground video objects. As it can be observed from these tables the segmentation fusion scheme generally has a very good performance providing accurate results. However it needs higher execution time compared to the active contour technique, depending on the complexity of the frame content (more complex content => more execution time) and its performance is deteriorated by inaccurate depth segmentation. On the other hand the active contour scheme achieves average execution time improvement of 62% compared to the CFCS method. Moreover the provided segmentation results are of good quality and comparable to the CFCS technique. However the scheme is sensitive to initialization and its performance can deteriorate when several points are away of the VO boundary.

5.

References

[1] B. Furht, S.W. Smoliar and H. Zhang, Video and Image Processing in Multimedia Systems. Κluwer Academic Publishers, 1995.

[2] N. Doulamis, A. Doulamis, Y. Avrithis, K. Ntalianis and S. Kollias, [3] [4] [5] [6] [7] [8]

[9]

[10] [11]

“Efficient Summarization of Stereoscopic Video Sequences,” IEEE Trans. CSVT, Vol. 10, No. 4, pp. 501-517, June 2000. R. Castagno, T.Ebrahimi, and M. Kunt, “Video Segmentation Based on Multiple Features for Interactive Multimedia Applications,” IEEE Trans. CSVT, Vol. 8, No. 5, pp. 562-571, 1998. C. Gu and M.-C.Lee, “Semiautomatic Segmentation and Tracking of Semantic Video Objects,” IEEE Trans. CSVT, Vol. 8, No. 5, pp. 572584, 1998. C. Xu and J. L. Prince, “Snakes, Shapes and Gradient Vector Flow,” IEEE Trans. Image Proc., Vol. 7, No. 3, pp. 359-369, March 1998. D. Wang, “Unsupervised Video Segmentation Based on Watersheds and Temporal Tracking,” IEEE Trans. CSVT, Vol. 8, No. 5, pp. 539546, 1998. T. Meier, and K. Ngan, “Video Segmentation for Content-Based Coding,” IEEE Trans. CSVT, Vol. 9, No. 8, pp. 1190-1203, 1999. M. Kim, J.G. Choi, D. Kim, H. Lee, M.H. Lee, C. Ahn, and Y-S. Ho, “A VOP Generation Tool: Automatic Segmentation of Moving Objects in Image Sequences Based on Spatio-Temporal Information,” IEEE Trans. CSVT, Vol. 9, No. 8, pp. 1216-1226, 1999. K. S. Ntalianis, A.D. Doulamis, N.D. Doulamis, and S.D. Kollias, “Non-Sequential Video Structuring based on Video Object Linking: An Efficient Tool for Video Browsing and Indexing," in Proc. of ICIP, Thessaloniki, Greece, October 2001. D. J. Williams and M. Shah, “A fast algorithm for active contours and curvature estimation,” GVGIP: Image Understanding, vol. 55, no.1, pp. 14-26, January 1992. P. Villegas, X. Marichal and A. Salcedo, “Objective Evaluation of Segmentation Masks in Video Sequences”, in Proc. Of WIAMIS, Berlin, Germany, May-June 1999.

d2

P2

K2

P3

P1 PN Figure 1: Polygonal approximation: the merging scheme

(a) Figure 2: (a) Left channel and (b) Right channel

(b)

(a) (b) Figure 3: (a) Depth segments map and (b) Color segmentation of Figure 2(a)

(a) (b) (c) (d) Figure 4: (a) Projection of the color segments mask onto the depth segments map, (b) Fusion of color segments belonging to the same depth segment, (c) Foreground video objects and (d) Background video object.

(a) (b) (c) Figure 5: (a) Initial active contours, (b) Final active contours, (c) Foreground video objects and (d) Background video object.

(a) Figure 6: (a) Left object (man) reference mask and (b) Right object (woman) reference mask. Left Video Object (MAN) Perfect Mask Fusion Mask Active Contour Mask Right Video Object (WOMAN) Perfect Mask Fusion Mask Active Contour Mask

(d)

(b)

TABLE I: Evaluation results for the left video object (MAN) Mask Pixels belonging to VOP CP (%) AP (%) SQM(dB) 29.409 100 100 0 29.325 99.71 93.11 -4.37 27.597 93.84 92.6 -4.29

Execution Time (sec) 2.89 1.13

TABLE II: Evaluation results for the right video object (WOMAN) Mask Pixels Mask Pixels belonging to VOP CP (%) AP (%) SQM(dB) 27.867 27.867 100 100 0 28.975 27.724 99.49 95.68 -2.33 29.764 26.287 94.33 88.32 -7.3

Execution Time (sec) 2.72 0.99

Mask Pixels 29.409 31.496 29.803

Suggest Documents