Video Object Segmentation by Motion-based ... - Semantic Scholar

1 downloads 58 Views 738KB Size Report
Cupertino, California, USA [email protected], [email protected], [email protected]. ABSTRACT. Segmentation of video foreground objects ...
Video Object Segmentation by Motion-based Sequential Feature Clustering Mei Han, Wei Xu, and Yihong Gong NEC Laboratories America Cupertino, California, USA

[email protected], [email protected], [email protected] ABSTRACT

1. INTRODUCTION

Segmentation of video foreground objects from background has many important applications, such as human computer interaction, video compression, multimedia content editing and manipulation. Most existing methods work on image pixels or color segments which are computationally expensive. Some methods require extensive manual inputs, static cameras, and/or rigid scenes. In this paper we propose a fully automatic foreground segmentation method based on sequential clustering of sparse image features. The sparseness makes the method computationally efficient. We use both edge and corner points extracted from each video frame. A joint spatio-temporal linear regression method is developed to compute sparse motion layers of M consecutive frames jointly under the temporal consistency constraint. Once the sparse motion layers have been identified for each frame, the corresponding dense motion layers are created using the Markov Random Field (MRF) model. The MRF model assigns the rest of the image pixels to the motion layers by considering both the color attributes and the spatial relations between each pixel and its surrounding edge/corner points. Experimental evaluations on videos taken by webcams show the effectiveness of the proposed method.

Segmentation of foreground objects from background has a lot of applications in human-computer interaction, video compression, object tracking, multimedia content editing and manipulation. Now with the prevalence of broadband Internet, multimedia-enabled personal computers and 3G cellphones, home users can easily establish video connections with friends by which they can see each other’s faces and objects of interest. Such developments have opened up new opportunities for various value-added services and applications, and we believe that the foreground object segmentation techniques have a great potential as an enabling tool for accomplishing such tasks/applications as bandwidth reduction, privacy protection, personalized video content editing and hallucination, etc. To date, there have been several research areas that are related to the task of foreground object segmentation. Video matting is a classic inverse problem in computer vision that involves the extraction of foreground objects and the alpha mattes that describe their opacity from image sequences [26, 18, 24, 20, 1, 5, 6, 21]. Apostoloff and Fitzgibbon [1] presented their matting approach for natural scenes assuming the camera was static and the background was known. Wang and Cohen [26] described a unified segmentation and matting approach based on Belief Propagation, which iteratively estimates the opacity value for every pixel in the image using a small sample of foreground and background pixels marked by the user. Li et al. [18] used a 3D graph cut-based segmentation followed by a tracking-based local refinement to obtain a binary segmentation of the video objects, and adopted coherent matting [23] as a prior to produce the alpha matte of the objects. The method is computationally expensive and often needs user’s input to fine tune the results. Rother et al. [20] extended the graph-cut based methods to iteratively extract foreground objects given limited user interactions. Most other matting methods assumed that a trimap was provided by the user which segments the image into three regions: foreground, background and unknown. Ruzon and Tomasi [21] analyzed the foreground and background color distributions to build a probabilistic model for infering the labelling of the unknown regions. Chuang et al. [6] proposed a Bayesian matting approach which formulated the problem in a well-defined Bayesian network and solved it using the maximum a posteriori (MAP) technique. They further extended this method to videos based on optical flow computation [5]. However, it is a heavy burden to provide “trimap” labels periodically for long video sequences. Sun

Categories and Subject Descriptors I.4.6 [Image Processing and Computer Vision]: Segmentation – Pixel Classification

General Terms Algorithms

Keywords Object Segmentation, Feature Extraction, Feature Clustering, Linear Regression, Markov Random Field

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’06, October 23–27, 2006, Santa Barbara, California, USA. Copyright 2006 ACM 1-59593-447-2/06/0010 ...$5.00.

773

et al. [24] presented a Poisson matting approach to estimate the matte from image gradients by solving Poison equations using boundary information from the trimap. Some automatic techniques were proposed based on stereo input [17, 7]. Kolmogorov et al. [17] presented algorithms to fuse color, contrast and stereo matching information together so as to infer layers accurately and efficiently. Criminisi and Blake [7] described an efficient synthesis algorithm of virtual views from two input images taken with stereo cameras. These methods require synchronized stereo input which provide valuable depth information from stereo matching. Motion-based segmentation approaches perform motion estimations, and cluster pixels or color segments into regions of coherent motions. Many approaches solve the problem through an Expectation-Maximization (EM) process [9] to estimate the parametric motion models and the supporting regions [12, 2, 8, 25, 13]. There is also much work done to group pixels or segments into layers based on the affinity of local measurements [27, 22, 14, 29]. However, spatial or temporal clues alone are not sufficient to distinguish objects because different objects may share similar colors or motions. There have been research efforts that strive to combine spatial and temporal features for improved segmentations [16, 15, 28]. Ke and Kanade [15] described a factorization method to perform rigid layer segmentation in a subspace because all the layers share the same camera motion. Wang and Ji [28] presented a dynamic conditional random field model to combine both intensity and motion cues to achieve segmentations. Most of these methods work on pixel or color segment level where the computation cost is high. In this paper we propose a fully automatic foreground object segmentation method that aims at applications for 3G cellphone and home broadband Internet users. In particular, we are interested in such applications that the user takes video images of his/her face or other objects of interest using a cellphone or a webcam, and wants to send the video clip to another person with the background either eliminated or hallucinated. We assume that the video clip can contain both rigid and non-rigid objects; there always exist motions caused by either moving cameras or moving objects; and the video capturing/transmission device has quite limited computation and storage resources. Indeed, these assumptions account for most of the common usage conditions for this type of applications. For the applications described above, our foreground object segmentation method must be able to handle any type of objects, and must be very efficient in computation and storage. We take the following approaches to overcome these challenges. First, we assume that the input video sequence is composed of two major motion layers which correspond to the foreground and the background, respectively. Although such an assumption will limit our ability to model complex scenes, it is sufficient to model the most common usage patterns for cellphone and home Internet users. Second, we strive to compute sparse motion layers first, using sparse image features such as edge and corner points extracted from each of the video frames. This approach will dramatically reduce the computational cost because compared to dense motion layers, it involves much fewer pixels for the costly computation and clustering of optical flows. The joint spatio-temporal linear regression method is de-

veloped to compute sparse motion layers of M consecutive frames jointly under the temporal consistency constraint. This method aims to generate more reliable and temporally smoother motion layers for the entire input video sequence. Third, once the two sparse motion layers have been identified for edge and corner points, we create the corresponding dense motion layers by using the Markov Random Field (MRF) model. The MRF model assigns the rest of the pixels to either of the motion layers by considering both the color attributes and the spatial relations between each pixel and its surrounding edge/corner points. By taking the above approaches together, we strive to accomplish the task of foreground object segmentation with low computation cost and high segmentation accuracy. The rest of the paper is organized as follows. Section 2 presents a detailed description on each main component of the proposed foreground segmentation method. Section 3 provides the experimental results and discussions. Section 4 concludes the paper.

2. THE PROPOSED METHOD We assume that there are only two motion layers within all video frames: foreground and background; and that the two motion layers follow affine transformations. Generally speaking, identification of motion layers is a very computationally intense task because it involves both the computation and clustering of optical flows for all image pixels. To provide computationally affordable solutions for mobile and home Internet users, we conduct the task of motion layer identification with two steps. For M consecutive video frames, we first extract sparse feature points such as edge and corner points, and identify two sparse motion layers from the optical flows of these feature points through a joint spatio-temporal linear regression. Because edge and corner points are the most reliable features for the computation of optical flows, this approach not only dramatically reduces the computation cost, but also has the potential for generating more accurate motion layers. The joint spatiotemporal linear regression strives to generate more reliable and temporally smoother motion layers by enforcing temporal consistencies among adjacent video frames. The enforcement of temporal consistencies is achieved by performing the joint linear regression on M consecutive frames under the constraint demanding that each feature point and all its corresponding feature points in the subsequent frames must belong to the same motion layer. Once the two sparse motion layers have been identified for edge and corner points, we create the corresponding dense motion layers by using the Markov Random Field (MRF) model. The MRF model clusters the rest of the image pixels to either of the motion layers by considering both the color attributes and the spatial relations between each pixel and its surrounding edge/corner points. In the remaining part of this section, we first describe the extraction of sparse features, and then present the idea of computing sparse motion layers using simple linear regression. Next we explain how the simple linear regression is extended to a joint spatio-temporal linear regression with the temporal consistency constraint. Finally, we describe our method of creating dense motion layers using the MRF model.

774

2.1 Extraction of Sparse Features

sets of feature points F, B that minimize the error function:

We extract both corner and edge points from each video frame for optical flow and sparse motion layer computations. We use the Canny edge detector [4] to extract such feature points from each image. The covariance matrix is computed for each feature point to measure if it is an edge or a corner point: f eature

edge, corner,

if σ1 > α σ2 and σ2 < β otherwise

E(A1 , A2 , F, B) =



 (xi ,yi )∈F      (xj ,yj )∈B 

(1)

a2 xi + b2 yi + c2 = δxi d2 xi + e2 yi + f2 = δyi



If (xi , yi ) ∈ B.





 

a1 d1

b1 e1

c1 f1

a2 d2

b2 e2

c2 f2

·

 

·

2. Estimation: Compute the least square solution of the affine model for each set of feature points using the linear regression algorithm. 3. Classification: Classify each feature point to the affine model with the smaller residual. If the smaller residual is above a threshold, the corresponding feature point is put into a garbage set which would skip next iteration of computation. 4. Repeat the above E-C steps until the two affine motion models converge.

2.3 Joint Spatio-Temporal Linear Regression We extend the motion layer computation by linear regression between two frames to a few frames so that we can exploit the temporal consistency to achieve smoother and more reliable motion layers. By temporal consistency, we mean that if a feature point in frame k belongs to the foreground, then its corresponding point in frame k+1 should also belong to the foreground, and vice versa. Since our motion layer computation is based on affine motion models, it works best when the camera is moving, or the foreground and background objects have independent motions. This is not always true between two frames, but a few frames usually provide enough motion information to distinguish the foreground and background layers. With this extended linear regression framework, motion layers for certain frames that could not be accurately computed due to the lack of enough motion information will also be accurately estimated. To enforce the temporal consistency into the motion layer computation process, we perform the linear regression on M consecutive frames jointly, and explicitly request that within this joint linear regression, each feature point Xi and all of its corresponding feature points in the subsequent M − 1 frames must belong to the same motion layer. The correspondences between feature points are determined based on the optical flows computed from pairs of frames. Let Xik = [xik , yik ]T denote the i’th feature point in frame k, ∇Xik = [δxik , δyik ]T be Xik ’s optical flow between frame

Given a set of n feature points (xi , yi ) and their optical flow values between two frames (δxi , δyi ), i = 1, · · · , n, our task is to compute the two motion layers that correspond to the foreground F and the background B, respectively, and to classify the n feature points (xi , yi ) into either of the two classes F and B. Since the two motion layers follow affine transformations, the problem becomes the discovery of two sets of affine parameters A1 = (a1 , b1 , c1 , d1 , e1 , f1 ) and A2 = (a2 , b2 , c2 , d2 , e2 , f2 ) that satisfy the following equations: If (xi , yi ) ∈ F .

δxj δyj





1. Initialization: Randomly cluster the feature points into two sets F, B.

2.2 Sparse Motion Layer Computation Using Linear Regression



δxi δyi

 2    +  2 xj   yj   1 xi yi 1

Using the above error function, our method obtains the optimal solution through an iterative Estimation-Classification (EC) process. The process starts by randomly clustering the feature points into two sets. Then, at the E-stage, two affine motion models are estimated from these two feature sets. At the C-stage, each feature point is compared with the two affine motion models obtained at the E-stage, and is classified into the model that produces the smaller residual for the feature point. This EC process is repeated until the two affine motion models converge. The outline of the algorithm is provided as follows:

where σ1 and σ2 are the eigenvalues of the covariance matrix, and α and β are the parameters. We use Lucas-Kanade method [19] to compute the optical flow values of the features. For edge features we compute its normal direction (δx, δy) from the covariance matrix and project its optical flow to this direction, i.e., we only keep the normal optical flow in affine parameter computation. Motion-based methods are generally sensitive to the computation accuracy of the optical flow values. The proposed method is reliable to correspondence errors due to the following reasons: (a) The sparse features we used are corner and edge features which are generally easier and more reliable to find corresponding feature points. We use normal flow for edge feature points to overcome the aperture problem. (b) We need to fit only two affine motion models onto the large amount of (around 4000-6000) motion vectors, therefore, the method is relatively robust to outliers, especially under our assumptions of the scenarios where the scene is approximated by two major layers and the depth between these two layers is distinguishable either by camera motion or by foreground object motion. (c) We use multiple frames to classify features into foreground and background layers, which helps in dealing with correspondence errors. Given the number of affine motion models to fit (2 in our case), and the amount of available motion vectors (4000-6000) as well as temporal constraints, our method is not sensitive to the correspondence computation accuracy.

a1 xi + b1 yi + c1 = δxi d1 xi + e1 yi + f1 = δyi





(2)

We have devised an iterative linear regression method that aims to compute the two affine motion models and to classify the n feature points all together. The method attempts to find the two sets of affine parameters A1 , A2 and the two

775

k and k −1, l ∈ {1, 2} be the motion layer label (l = 1, 2 represent the foreground and background layers, respectively), lXik represent the motion layer label of Xik . Alk , the matrix of affine parameters for motion layer l at frame k, is defined as: Alk =



alk dlk

blk elk

clk flk

of cliques C(G) representing the dependencies among individual data points, (2) define a set of potential functions ψ(c; Λ) over the cliques c ∈ C(G) of the graph G, where Λ is the parameter set of the MRF model. Given the above model component definitions, the joint probability distribution over a frame I and its label set L is defined by:

(3)



Because we have M − 1 pairs of frames, we need to compute a total of 2 · (M − 1) affine motion models. Here, instead of computing these affine motion models separately, we compute them jointly by optimizing the following constrained linear regression problem: JE =

M 

2



k=2 l=1 Xik : lX =l ik





∇Xik − Alk ·



subject to lXik = lXjk−1 if Xik 7→ Xjk−1 for ∀ i; k = 2, . . . , M.

Xik 1







PΛ (I, L) =

1 ZΛ

ψ(c; Λ)



(5)

c∈C(G)

where ZΛ = I,L c∈C(G) ψ(c; Λ) is the partition function that normalizes the probability distribution. Generally, a potential function takes the form

2

ψ(c; Λ) = exp[λc f (c)]

(6)

where f (c) is some real-valued feature function over clique values and λc is the weight given to that particular feature function. Substituting Eq.(6) back into Eq.(5), we have

(4)

where the third summation in Eq.(4) sums up the regression residues for all the feature points Xik whose motion layer labels equal l, and Xik 7→ Xjk−1 means that the two points are the corresponding feature points between frames k and k − 1. Obviously, the above joint linear regression attempts to obtain the set of 2 · (M − 1) affine motion models A = {Alk | l = 1, 2; k = 2, . . . , M } as well as the motion layer label for each feature point Xik that minimize the accumulated residues across the M −1 pairs of frames with the given set of constraints. This constraint set demands that each feature point Xik and all its corresponding feature points in the rest of the M − 1 frames must be linked together for the assignment of the motion layer labels. This framework not only enforces the temporal consistency into the motion layer estimation process, but also makes estimated motion layers more accurate, especially for those frames which do not have sufficient object/camera motions. The algorithm for obtaining the optimal solution is very similar to the one described in Section 2.2, except that all the constraints must be satisfied during the iterative EC process. The computational cost is proportional to the number of frames M for the joint linear regression. In our implementation, we set M to five.

PΛ (I, L) =

1 exp ZΛ





λc f (c)

(7)

c∈C(G)

With our MRF model, the graph G is constructed in such a way that each node in G corresponds to a pixel, each edge connects only two immediate neighboring nodes, and there are no other types of edges in the graph. Two types of cliques are defined on the graph: single node clique that is composed of each node i only, and 4-neighbor clique that is composed of four neighbors N (i) of a node i. Furthermore, each node i (i.e. pixel i) is represented by the following five attributes: (1) the coordinates Xi = (xi , yi ) in frame I, (2) the color triplet coli , (3) the feature point indicator ei which equals one if pixel i is a feature point, but equals zero otherwise, (4) the normal direction di if pixel i is a edge feature point, (5) the regression residue ri which is defined as follows: ri (li ) =

    



0, ∇Xi − Ali ·



Xi 1

  

if ei = 0 2

,

otherwise

(8)

where li is the motion layer label for Xi . The definition of the potential functions ψ for graph G is the most important part for the MRF model design, because these potential functions play a very important role in determining the accuracy of our approximation to the true joint distribution. These functions can be thought of as compatibility functions. Therefore, a good potential function assigns less penalties to the clique setting that are highly compatible with each other under the given distribution. With the above discussion in mind, we introduce three potential functions that strive to accomplish the following objectives:

2.4 Dense Foreground Layer Creation Using MRF In Section 2.3, we generate two sparse motion layers using sparse feature points extracted from video frames. Because motion layer computation requires the most computational efforts within our foreground segmentation framework, our choice of performing this task in a sparse feature space with fast linear regression dramatically reduces the computational cost. Once the sparse motion layers are obtained, our next task is to turn them into dense motion layers where all the pixels in the frame get their labels. We adopt the Markov Random Field (MRF) model to propagate the labels provided by the sparse motion layers to the rest of the pixels in the frame. MRF, also called undirected graphical model, is in the statistical machine learning domain to model data sets that possess strong dependencies among individual data points. To utilize the model, we have to first define the following two model components: (1) construct a graph G and a set

1. Minimize the total regression residue between all the pixels and the two motion layers, which can be achieved by minimizing f1 (i) = ri (li )

(9)

that is defined on single node cliques. 2. Assign the same label to the pixels with similar colors. This can be accomplished by minimizing the following

776

potential function defined on 4-neighbor cliques. f2 (i) =



a webcam, or through the wireless communication network using a cellphone. In such cases, a typical scene configuration is that the foreground contains either a human face, or some objects of interest, and the rest of the image belongs to the background. Therefore, we compute the location variation for both the motion layers, and take the one with the smaller variation as the foreground layer. Our experimental evaluations have shown that the foreground determination using this simple heuristic is very effective, and produces correct results for most testing video sequences.

δ(li 6= lj ) exp(−kcoli − colj k2 /σ 2 ) (10)

j∈N (i)

where δ(cond) equals one if cond is true, but equals zero otherwise, and σ is a constant. 3. Assign the same label to a pair of neighboring edge points having the same normal direction that is perpendicular to the line connecting the two edge points. That can be realized by minimizing the following potential function defined on 4-neighbor cliques. f3 (i) =



3. EXPERIMENTAL RESULTS The proposed foreground segmentation method has been tested on real videos taken under different lighting and camera motions. In this section we firstly show two examples captured by a low-cost Creative webcam. The frame resolution is 640 by 480, and the frame rate is 6 fps. The quality of the webcam is close to many cellphone video cameras. We allow the webcam to move during video shootings and do not require that either the foreground or the background is known or static. We also show one example which is actually the right view from a stereo setup (the sequence is from

δ(li 6= lj )ei ej kdij × di kkdij × dj k (11)

j∈N (i)

where dij is the unit vector connecting from point i to point j, and x × y is the cross product between the vectors x and y. Substituting the above potential functions into Eq.(7), we have PΛ (I, L) =

1 exp ZΛ





[λ1 f1 (i) + λ2 f2 (i) + λ3 f3 (i)]

(12)

http://research.microsoft.com/vision/cambridge/i2i/DSWeb.htm)

i∈I

where the person is close to the camera and there are moving people in the background, which violates the affine motion assumptions.

We use the Gibbs sampling method in [10] to iteratively obtain the label set L that maximizes the joint probability PΛ (I, L). We obtain the weight set Λ through experiments, and set λ1 = 2, λ2 = 3, λ3 = 2 in our implementation. Compared with other popular inference framework, such as graph cut or belief propagation (BP), we choose to use Gibbs sampling for the following reasons: (a) Although graph cut based algorithm can give exact MAP result in polynomial time, but its time complexity is too much for our application. Boykov et al. [3] gave a review about using the min-cut type of algorithms to solve MRF. The computational complexity for graph cut is O(mn2 ), where m, n are the number of edges and nodes in the graph, respectively. In contrast, Gibbs sampling has the complexity of O(mt), where t is the number of iterations. As described in the review paper, the graph cut based algorithms usually take several seconds for processing an image with a size similar to ours. (b) Belief propagation (BP) works fine for graphs without many loops, especially those short loops [11]. Since there are too many short loops in our MRF, belief propagation may not be the best choice for our model. (c) Gibbs sampling is much easier to implement to prove our idea. In our implementation, the MRF has a pyramidal structure. We conduct Gibbs sampling only on the top level to avoid major local optimum. This needs 50 − 100 iterations. Since the image size at the top level is very small, the computation is affordable. For other lower level layers, we use deterministic greedy update to find the solution. This needs several sweeps to converge. Based on our particular implementation, we believe that the use of Gibbs sampling is faster than BP and we expect that the solution from our implementation has a similar or even better quality than the solution from BP in this particular problem setting. After obtaining the two dense motion layers for each video frame, we need to determine which one corresponds to the foreground. The heuristic we use here is that foreground layer is generally more compact than the background one. This is especially true when the user establishes a video connection with his/her friend through the Internet using

3.1 Calendar sequence This sequence was taken from a rigid scene with a moving camera. The scene is composed of a desktop calendar as the foreground object and a relatively flat background. Figure 1 (a) shows the 1’st, 6’th, 11’th, 16’th and 21’st frames of the sequence. Due to the low quality of the webcam, shape distortions can be clearly observed from these images. Figure 1 (b) presents the foreground layer extracted by our proposed method. From these images, it is clear that except for the mis-classification of some pixels along the boundaries of the calendar, the entire foreground object has been correctly extracted in its entirety throughout the whole sequence.

3.2 Human sequence The second sequence was taken from a person moving and talking in front of a camera while holding the camera himself. The camera was shaking randomly with the person’s movement. Most features on the face were undergoing non-rigid motions. There are areas in the background with colors that are almost identical to the human’s hair. Moreover, the human’s hair has a very irregular shape, with some portions sticking out a lot (the top and right ear portions). The background is not flat and most part of it is textureless. Despite these difficulties, it is clear from Figure 2 that our foreground segmentation method was able to extract the human and his hair with relatively high accuracy. Again, there exist few mis-classifications along the boundary of the human hair.

3.3 Stereo sequence (only right view) This sequence is actually one of the two stereo videos for depth computation. We used the right view sequence only. This sequence shows a moving person which is very close to the camera with non-rigid facial motion, large depth variations on her face and large range of body movements. The background is dynamic with two people moving as seen

777

(a)

(b) Figure 1: Calendar sequence: (a) Original video frames where distortions exist around image boundaries. (b) Foreground layers extracted by the proposed method. The proposed method aims at applications for mobile and home broadband Internet users, therefore the method must be as inexpensive as possible in terms of computational requirement. In comparison, we implemented the broadly used RANSAC (Random Sample Consensus) approach and conducted the following experiments. With 4000-6000 feature points per frame, we ran RANSAC 100 times with the sampling size of 100-200 feature points for each run. The execution time is around 600-900ms/frame, which is about 4-5 times slower than our joint spatio-temporal linear regression method, and there is no remarkable improvement on segmentation results.

from Figure 3 (a). The results in Figure 3 (b) reveal that the proposed method is capable to deal with affine model violations, non-rigid objects, non-rigid motions, outliers in motion computation. Compared with the results from the stereo method [17], there exist more errors around the area in the background where people are passing since we do not have depth information from the stereo matching.

3.4 Improvement by temporal constraints To reveal how much improvement the joint spatio-temporal linear regression method makes, we firstly show two pairs of segmentation results on sparse features side by side in Figure 4. Results in (a) and (c) are generated by the simple linear regression approach on single frame, while results in (b) and (d) are generated by the joint spatio-temporal linear regression method. The image pairs (a) and (b), (c) and (d) show the results of the same frames, which are from the Calendar and the Human sequences, respectively. Feature points in the two motion layers are displayed using red and green colors. Obviously, motion layers generated by the single frame linear regression are fragmentary, and contain many mis-segmentations. On the other hand, motion layers generated by the joint spatio-temporal linear regression are complete, accurate, and remarkably surpass the results from the simple linear regression on all aspects. Figure 5 compares the results of the extracted foreground objects by the single frame linear regression (shown in Figure 5 (a)) and by the joint spatio-temporal linear regression (shown in Figure 5 (b)). The proposed method is able to overcome the errors caused by blurred images due to sudden camera motion, temporary static cameras, drastic facial movements because it considers multiple frames in the linear regression process (we used 5 frames in this experiment). The computational cost of the joint spatio-temporal linear regression is approximately O(N L) where N is the total number of feature points used in the joint linear regression, and L is the number of iterations of the EC process (See Section 2.2 for the details). Our experiments have shown that regression results converge with 10 to 20 iterations. We can achieve five frames per second using a PC with a Pentium IV 2.0GHz CPU.

4. CONCLUSION In this paper we proposed a foreground object segmentation method that aims at applications for 3G cellphone and home broadband Internet users. Compared with many existing methods which either require large amount of human inputs or assume rigid objects, our method is fully automatic, and is able to handle any type of objects. To accomplish the task of foreground object segmentation with low computational cost and high segmentation accuracy, we conduct the motion layer segmentation with two steps. For M consecutive video frames, we first extract sparse feature points such as edge and corner points, and compute two sparse motion layers from the optical flows of these feature points. The joint spatio-temporal linear regression method is developed to generate more reliable and temporally smoother motion layers by enforcing temporal consistencies among adjacent video frames. Once the two sparse motion layers have been identified for edge and corner points, we create the corresponding dense motion layers by using the Markov Random Field (MRF) model. The MRF model clusters rest of the image pixels to either of the motion layers by considering both the color attributes and the spatial relations between each pixel and its surrounding edge/corner points. Experimental results on webcam videos are promising. This method is suitable for home Internet users where computation power is limited. The results can also replace the

778

[14] N. Jojic and B.J. Frey. Learning flexible sprites in video layers. In CVPR01, pages I:199–206, 2001. [15] Q. Ke and T. Kanade. A subspace approach to layer extraction. In CVPR01, pages I:255–262, 2001. [16] S. Khan and M. Shah. Object based segmentation of video using color, motion and spatial information. In CVPR01, pages II:746–751, 2001. [17] V. Kolmogorov, A. Criminisi, A. Blake, G. Cross, and C. Rother. Bi-layer segmentation of binocular stereo video. In CVPR, page II: 1186, 2005. [18] Y. Li, J. Sun, and H. Shum. Video object cut and paste. In ACM SIGGRAPH 2005, 2005. [19] B.D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In IJCAI81, pages 674–679, 1981. [20] C. Rother, V. Kolmogorov, and A. Blake. Grabcut – interactive foreground extraction using iterated graph cuts. In ACM SIGGRAPH 2004, pages 309–314, 2004. [21] M.A. Ruzon and C. Tomasi. Alpha estimation in natural images. In CVPR00, pages I: 18–25, 2000. [22] J. Shi and J. Malik. Motion segmentation and tracking using normalized cuts. In ICCV98, pages 1154–1160, 1998. [23] H. Shum, J. Sun, S. Yamazaki, Y. Li, and C. Tang. Pop-up light field: An interactive image-based modeling and rendering system. ACM Transaction of Graphics, 23(2):143–162, 2004. [24] J. Sun, J. Y. Jia, C. K. Tang, and H. Y. Shum. Poisson matting. In ACM SIGGRAPH 2004, pages 315–321, 2004. [25] P.H.S. Torr, R. Szeliski, and P. Anandan. An integrated bayesian approach to layer extraction from image sequences. PAMI, 23(3):297–303, March 2001. [26] J. Wang and M. Cohen. An iterative optimization approach for unified image segmentation and matting. In ICCV05, 2005. [27] J.Y.A. Wang and E.H. Adelson. Representing moving images with layers. IP, 3(5):625–638, September 1994. [28] Y. Wang and Q. Ji. A dynamic conditional random field model for object segmentation in image sequences. In CVPR05, pages I: 264–270, 2005. [29] J. Xiao and M. Shah. Motion layer extraction in the presence of occlusion using graph cuts. In CVPR04, pages II: 972–979, 2004.

human input in most image matting approaches to generate the initial trimaps.

5.

REFERENCES

[1] N. Apostoloff and A.W. Fitzgibbon. Bayesian video matting using learnt image priors. In CVPR04, pages I: 407–414, 2004. [2] S. Ayer and H. Sawhney. Layered representation of motion video using robust maximum-likelihood estimation of mixture models and mdl encoding. In ICCV95, pages 777–784, 1995. [3] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. PAMI, 26(9), September 2004. [4] J. Canny. A computational approach to edge detection. PAMI, 8(6), 1986. [5] Y.Y. Chuang, A. Agarwala, B. Curless, D.H. Salesin, and R. Szeliski. Video matting of complex scenes. In ACM SIGGRAPH 2002, pages II:243–248, 2002. [6] Y.Y. Chuang, B. Curless, D.H. Salesin, and R. Szeliski. A bayesian approach to digital matting. In CVPR01, pages II:264–271, 2001. [7] A. Criminisi and A. Blake. The sps algorithm: patching figural continuity and transparency by split-patch search. In CVPR, pages I: 342–349, 2004. [8] T.J. Darrell and A.P. Pentland. Cooperative robust estimation using layers of support. PAMI, 17(5):474–487, May 1995. [9] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39(B):1–38, September 1977. [10] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on PAMI, 6:721–741, 1984. [11] Alexander T. Ihler, John W. Fisher, and Alan S. Willsky. Message errors in belief propagation. In Lawrence K. Saul, Yair Weiss, and L´eon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 609–616. MIT Press, Cambridge, MA, 2005. [12] A.D. Jepson and M.J. Black. Mixture models for optical flow computation. In CVPR93, 1993. [13] A.D. Jepson, D.J. Fleet, and M.J. Black. A layered motion representation with occlusion and compact spatial support. In ECCV02, page I: 692 ff., 2002.

779

(a)

(b) Figure 2: Human sequence: (a) Original video frames with non-rigid movements. (b) Foreground layers extracted by the proposed method.

(a)

(b) Figure 3: Stereo sequence: (a) Original video frames which involves foreground human motion, non-rigid facial movements, background moving people. (b) Foreground layers extracted by the proposed method.

780

(a)

(b)

(c)

(d)

Figure 4: Comparisons of motion layer segmentation results: (a) and (c) Motion layers generated by the single frame linear regression. (b) and (d) Motion layers generated by the joint spatio-temporal linear regression.

(a)

(b) Figure 5: Comparisons of foreground object extraction results: (a) objects extracted by the single frame linear regression. (b) objects extracted by the joint spatio-temporal linear regression.

781

Suggest Documents