Video Object Segmentation Via Dense Trajectories

0 downloads 0 Views 1MB Size Report
I. INTRODUCTION. VIDEO object segmentation is a problem which provides .... semi-dense trajectories by sampling points in the original image scale and they ...
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 17, NO. 12, DECEMBER 2015

2225

Video Object Segmentation Via Dense Trajectories Lin Chen, Jianbing Shen, Senior Member, IEEE, Wenguan Wang, and Bingbing Ni

Abstract—In this paper, we propose a novel approach to segment moving object in video by utilizing improved point trajectories. First, point trajectories are densely sampled from video and tracked through optical flow, which provides information of long-term temporal interactions among objects in the video sequence. Second, a novel affinity measurement method considering both global and local information of point trajectories is proposed to cluster trajectories into groups. Finally, we propose a new graph-based segmentation method which adopts both local and global motion information encoded by the tracked dense point trajectories. The proposed approach achieves good performance on trajectory clustering, and it also obtains accurate video object segmentation results on both the Moseg dataset and our new dataset containing more challenging videos. Index Terms—Dense trajectories, energy optimization, global motion information, point trajectory clustering, video object segmentation.

I. INTRODUCTION

V

IDEO object segmentation is a problem which provides prerequisite for a wide range of applications, such as 3D reconstruction, action recognition, video retrieval and scene understanding. There exist many approaches to obtain foreground objects from video sequences, including interactive methods (which require manual initialization) and fully automatic methods. Semi-supervised methods [10], [12], [20] require users to annotate a few frames and then propagate these annotations to other frames. Automatic algorithms [2], [9], [21], [24], [28], [30], [31] can process large number of video sequences without human interaction. Many approaches achieve video object segmentation [24], [30] by using object proposals, and other approaches (e.g., [31]) analyze local motion cues in a small number of consecutive frames. Most of these automatic methods assume that objects keep moving throughout the entire video and they only consider motion information in short temporal term. As shown in Fig. 1(b), an

Manuscript received January 13, 2015; revised June 13, 2015 and August 04, 2015; accepted September 17, 2015. Date of publication September 23, 2015; date of current version November 13, 2015. This work was supported in part by the National Basic Research Program of China (973 Program) under Grant 2013CB328805, by the National Natural Science Foundation of China under Grant 61272359, by the Fok Ying-Tong Education Foundation for Young Teachers, and by the Specialized Fund for Joint Building Program of Beijing Municipal Education Commission. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Vasileios Mezaris. L. Chen, J. Shen, and W. Wang are with the Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, Beijing 100081, China (e-mail: [email protected]). B. Ni is with the School of Electronic Information and Electrical Engineering, Shanghai Jiaotong University, Shanghai 200240, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMM.2015.2481711

Fig. 1. Video object segmentation results of different methods. (a) Original video. (b) DAGVOS [30] using object proposal. (c) Our method analyzing longterm motion information.

obvious drawback of these methods is that they cannot deal with objects which are moving in a discontinuous manner. Many methods [5], [6], [17], [18], [29] with point trajectories with long-term motion cues have been proposed to solve the above-mentioned problem in recent years. Most of these approaches generate trajectories through tracking points over frames and calculating affinity matrix to cluster the tracked trajectories. Video segmentation is performed by utilizing trajectory clustering results. The point trajectories, which go through the same object in large number of successive frames, contribute to extracting the objects with discontinuous motion patterns. Trajectories encode motion feature, and can further utilize the useful information which indicates occlusion. However, most trajectory based methods [5], [17] only consider the cues in common frames of pair trajectories to calculate affinity matrix and miss the global information of trajectories in their life spans. Other methods [6], [18], [29] consider trajectory labels in each frame or short sequences, but ignore the fact that erroneous trajectory labels would interfere segmentation results. Besides, all

1520-9210 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2226

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 17, NO. 12, DECEMBER 2015

of these methods cannot directly distinguish foreground objects from background. To address these issues, we propose a video object segmentation approach by dense point trajectories. First, we densely sample points in multiple spatial scales and use optical flow to track them over frames. To calculate affinity matrix for the tracked trajectories, we consider both local and global information. Then, we propose a segmentation method which adopts local and global motion information of video through Gaussian Mixture Models (GMM) modeling, reducing the negative effect of wrongly assigned trajectory labels and naturally handling the object moving in discontinuous way. Finally, we propose an approach to automatically distinguish foreground objects from background based on the trajectory clustering results. Our source code will be publicly available online.1 Our main contributions are summarized as follows. • An affinity measure approach is proposed to calculate affinity matrix of dense point trajectories, which considers both local and global information. This affinity measure method can distinguish different motion objects and enhance the integrity of object. • We propose an efficient video object segmentation approach with local and global motion information expressed by dense point trajectories. The approach reduces the interference of erroneous trajectory labels in individual frames and makes segmentation consistent with object entirety. The rest of paper is organized as follows. Section II discusses some related works. Section III details the proposed point trajectory clustering method, which includes point trajectory generation, affinity matrix calculation, and point trajectory clustering. Section IV introduces the proposed video object segmentation method using the clustered trajectories. Experiments and evaluations are given in Section V. Section VI concludes the paper. II. RELATED WORK Video object segmentation can be achieved with user initialization. Some video segmentation methods [10], [25] obtain object region through accurately marking object in the first frame and then performing tracking to segment the object in the rest of the sequence. Other methods [12], [20] require object annotations in key frames. Although the above supervised approaches generally achieve good performance, segmentation results may be sensitive to human interaction. On the other hand, in many video applications, it is required to process large number of video sequences, which makes manual annotations infeasible. Automatic methods were also proposed for video object segmentation. Classic background subtraction methods [13], [14] maintain a background appearance model for each pixel and treat fast-moving pixels as foreground. However, these methods rely on a restrict assumption that the camera is moving stably and slowly. Recently, video object segmentation methods based on object proposal were developed, which generate proposals per-frame and then select primary object proposals to build object and background models. [22], [23] provided an explicit notion of how a generic object looks like. Especially, 1[Online].

Available:http://github.com/shenjianbing/trajectseg

[23] could extract object proposals from video frames. [24] detected the primary object with a pool of object proposals in video, and then applied spectral clustering to obtain foreground/background partitions. [7] utilized relationship between object proposals in adjacent frames to consider the temporal orders of them. [30] adopted optical flow to track the evolution of object shape. They also constructed a directed acyclic graph (DAG) for solving the prime object proposal selection problem. However, these methods explore motion information in short term and do not perform well when objects keep static in some frames. The problem can be solved through analyzing motion information over longer period. Many video object segmentation methods [5], [6], [17], [18], [29] based on long term motion information with point trajectories were proposed in recent years. These approaches generate point trajectories and then cluster trajectories through calculating their affinity matrix. Segmentation results are obtained with the clustering trajectories as prior information. However, all of the above video object segmentation methods based on point trajectory can not directly determine which partition belongs to foreground objects. Point trajectories go through object in many consecutive frames and provide valuable cues for video segmentation. Classic point tracking method Kanade-Lucas-Tomasi(KLT) [3] results in too sparse point trajectories. [5], [6] obtained semi-dense trajectories by sampling points in the original image scale and they tracked these points to next frame with dense optical flow [15]. In order to capture more movements, we adopt dense trajectories [27] through sampling points densely in multiple spatial scales. [16] ran clustering on each video frame. [17] designed a track clustering cost function which considers occlusion reason and then adopted sequential tree reweighed message passing algorithm [26] to optimize it. [5], [6], [17] took advantage of the information in common frames of pair point trajectories to calculate their affinity. Our approach is inspired by these methods and considers the global as well as local information of trajectory to obtain affinity matrix, which offers more accurate clustering results. Clustered trajectories with assigned labels are regarded as prior information to obtain video segmentation. [17] segmented video sequence into spatial-temporal over segmentation regions that preserve object boundaries, through merging regions with the constraint of point trajectories. [29] proposed constrained Gabriel graphs as per-frame superpixel region maps and computed a dense video segmentation on Gabriel superpixels. [18] turned point trajectory clusters to dense regions by hierarchical variational approach. [6] adopted Potts model to obtain dense maps from sparse ones using trajectory labels. Most of the above methods considered trajectory labels in each frame or in short neighboring frames as prior information to generate segmentation. We note that pixels share similar color along a trajectory. Besides, point trajectories also reflect the spatial-temporal continuity of motion objects. Hence, our video object segmentation method not only explores the prior information of trajectory label in each frame, but also trains GMM models for each category of trajectories to obtain the prior information. Motivated by GrabCut [4], we assume most of the regions on the video

CHEN et al.: VIDEO OBJECT SEGMENTATION VIA DENSE TRAJECTORIES

2227

Fig. 2. Process of our algorithm. (a) Input video. (b) Trajectories clustered by our affinity measure method. (c) Video partitions using both local and global motion information extracted from trajectories. (d) Foreground object determined from partition results. (a) Input frames. (b) Trajectory labels. (c) Partition results. (d) Segmentation results.

boundaries belong to background. Thus, we define background trajectories and foreground ones to achieve foreground object segmentation. III. TRAJECTORY CLUSTERING Trajectory labels intuitively indicate object region and provide valuable prior information for video object segmentation (Section IV). Our point trajectory clustering approach consists of three processes including point trajectory generation, affinity matrix calculation and clustering implementation. To obtain dense trajectories, points are sampled in multiple spatial scales and tracked to next frame using optical flow. Global and local information of trajectories is utilized to calculate affinity matrix. Besides, trajectories that do not share common frames are connected through clustering with land mark spectral cluster (LSC) [8]. The trajectory clustering results from our method are determined in Fig. 2(b). A. Trajectory Generation Point trajectories take advantage of long term motion information of successive frames in video and provide valuable cues for video segmentation. The way that trajectories are generated has great influence on the clustering results. Classic point tracking method Kanade-Lucas-Tomasi (KLT) [3] only generates sparse tracking and cannot capture small movements. Recent methods, such as [5], [6] adopt semi-dense trajectories and achieve better results. We present here a dense trajectory generation approach that densely samples points on a grid, typically spaced by 5 pixels, and tracks them to adjacent frames using optical flow [15]. For an input video sequence , tracking should be stopped when a point gets occluded. Otherwise, a trajectory would be related to different objects. In order to judge whether a point trajectory is occluded or not, we consider three conditions. If any one of the below conditions is not

satisfied for a point trajectory, it indicates that occlusion happens. Firstly, the point in -th frame and its tracked point in the next frame have similar color. Secondly, the spatial distance between and is smaller than a threshold . Thirdly, the directions of point ’s forward flow vector and point ’s backward flow vector are opposite. When occlusion occurs, the corresponding areas in the next frame are not covered with trajectory points. In order to fill these empty areas, new trajectories are created on them. Based on above conditions, a point trajectory is defined as (1) where is the length of trajectory , and is location of -th point in -th frame. is the total number of trajectories of a video sequence. B. Affinities Between Trajectories We now put forward a new affinity measure method which considers both local and global information of point trajectories. Local information is computed from the temporal overlapping area in trajectory pairs, while global information is obtained from point trajectory in its whole life span. According to (1), each point trajectory spans a certain temporal sub-sequence of video. Our approach calculates the affinities of trajectory pairs with common frames. The affinities define the edge weights of a graph with trajectories as vertices. affinity matrix corresponds to the graph and the elA indicates the affinity of a pair of trajectories. ement We comprehensively consider both local and global motion be the affinity information of point trajectories. Let and of (2) affinity

is computed by global affinity .

and local

2228

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 17, NO. 12, DECEMBER 2015

Fig. 3. Point trajectory clustering results for videos in Moseg dataset and our dataset. In each group, first row: original frames; second row: point trajectory clustering results by brox2010object [5]; third row: point trajectory clustering results by fragkiadaki2012video [29]; bottom row: our results. The first three video groups (cars05, cars09, people01) are taken from Moseg dataset, and other three groups (squirrel, kangaroo, people) are taken from our video dataset. (a) Cars05. (b) Cars09. (c) People01. (d) Squirrel. (e) Kangaroo. (f) People.

The local affinity locates the common frames of trajectory pairs and is used to distinct different moving objects. Video contains complex motion information and each object has its own motion pattern. For distinguishing different objects, we emphasis the importance of local information of trajectories. However, we can not simply follow Gestalt principle [1], which just assigns high affinities to trajectories with similar motion. For example, in Fig. 3(e), two kangaroos share same motion direction, but they are different objects. In Fig. 4(e), a dog walks at first and then sits down for a long time. According to above analysis, we compute local affinity of trajectory pairs according to motion speed and spatial distance (3) is the affinity of speed, is the where affinity of spatial distance. have a negative correThe speed difference and lation. We adopt the derivative of trajectory as its velocity. In

order to distinguish different objects, the max speed variance between a pair of trajectories is utilized to calculate their speed affinity (4) However, when only considering a single common point, trajectory noise would be magnified and serious over segmentation would be produced. To solve these problems, we introduce a parameter (5) which can be achieved through approximating at time with forward common frames. in experiments. This dynamic motion model We set allows the motion speed of object to change with time. To balance the velocity of object, we introduce an adaptive parameter . For slow-moving object, should be large to distinguish it

CHEN et al.: VIDEO OBJECT SEGMENTATION VIA DENSE TRAJECTORIES

2229

Fig. 4. Video object segmentation results for videos in Moseg dataset and our dataset. In each group, first row: original frames; second row: segmentation results by FastMotionSeg [31]; third row: segmentation results of DAGVOS [30]; bottom row: our results. The first three video groups (Marple03, Marple05, cars06) are taken from Moseg dataset, and other three groups (hippo, dog, tiger) are taken from our video dataset. (a) Marple03. (b) Marple05. (c) Cars06. (d) Hippo. (e) Dog. (f) Tiger.

from background. For fast-moving object, should be small to avoid splitting it into regions. This parameter is estimated by optical flow weight and computed for each trajectory in a local neighborhood. The spatial affinity is defined as

legs of kangaroos always move faster and more frequently than the bodies. To reduce over segmentation of moving object, we adopt global affinity which considers motion intensity of point trajectory in its life time (7)

(6) where is the spatial distance of and in frame , is a parameter to adjust the weight of spatial affinity. Local affinity relies on maximum motion dissimilarity of point trajectories in common frames. For object that has different motion directions or speeds in different body parts, over segmentation is hard to avoid. In Fig. 3(d), the head and tail of squirrel move differently in first few frames. In Fig. 3(e), the

where affinity,

is a parameter to adjust the weight of global responds to the optical flow weight of (8)

where is optical flow weight matrix and also denotes the motion intensity of pixel in video.

2230

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 17, NO. 12, DECEMBER 2015

C. Clustering Implementation The LSC [8] algorithm is introduced to obtain trajectory clusters, which explores long-range connections between trajectories, even those trajectories that do not in neighborhood can also be connected. We calculate the eigenvectors and eigen, where is divalues of normalized affinity matrix . is the affinity matrix agonal matrix, and obtained in Section III-B and . Eigenvectors for their eigenvalues greater than 0.85 are chosen to construct a new matrix (9) The -th eigenvector is expressed as and is the -th component of eigenvector . Sparse coding is a technique that finds a set of basis vectors and representation matrix through factorizing (10) We can transform the matrix factorization to an optimization problem as (11) where is achieved by calculating cluster centers with k-means on . Then the optimization problem (11) is regarded as constraint linear regression and can be solved by computing coefficient matrix . We can simply compute the graph matrix as

IV. VIDEO OBJECT SEGMENTATION BY TRAJECTORY CLUSTERS Video object segmentation methods usually utilize some lowlevel cues such as color and motion information. In this section, we further utilize global video motion information expressed by point trajectory clusters for generating partition results, as determined in Fig. 2(c). Next, in order to identify foreground objects from the partitions, we introduce an effective hypothesis that a partition with a lot of area near frame boundaries should be regarded as background. Based on this hypothesis, we create a rectangular border and check the number of each type of trajectory labels lying within it. The label which corresponds to the max number of trajectories in the border is regarded as background trajectory label. In this way, we obtain background trajectory label set as well as foreground trajectory label set and further find out video objects from partitions automatically [Fig. 2(d)]. A. Obtaining Partitions of Video The proposed segmentation framework gets partitions of frame domain with the prior information offered by point trajectory clustering results. Each pixel in th frame can take a label . The goal is also to find a labeling of all pixels to represent partitions

(12) (14) where and is the row sum of Z. The Singular Values Decomposition (SVD) of is defined as (13) where corresponds to the eigenvectors of matrix , is the singular matrix of . is the eigen. Each row of is regarded as a data point to vectors of achieve clustering results with k-means algorithm. A discontinuities embed [29] based post-processing step is further applied to obtain trajectory labels and label set , where is the number of trajectory clusters and . We summarize our trajectory clustering algorithm with pseudo-code in Algorithm 1. Algorithm 1 Dense trajectory clustering Input: Input video . ; Output: Point trajectories . Trajectory labels 1: Calculate optical flow of video with method [15]. through sampling points in first frame with 2: Obtain multiple spatial scales and tracking to next frame with optical flow. with (2). 3: Get affinity matrix to get . 4: Perform eigenvalue decomposition on . 5: Cluster with LSC to obtain

indicates spatial neighborhood and indicates where temporal neighborhood. To ensure temporal continuity of segmentation, we consider front and rear frames for setting six neighborhood of a pixel and implementing segmentation in three consecutive frames. This energy function consists of two unary terms, and , and two pairwise terms and . is for encouraging a pixel belongs to a partition with , which is define as the label of trajectory crossing -th frame. The unary appearance term tends to label pixels with the prior models trained through global motion information from all trajectories. The unary term is based on our trajectory clustering results, which penalizes a pixel without trajectory label. is defined as if otherwise

(15)

where is a positive weight and we empirically set in experiments. The unary term defines the cost of classifying pixel p into a trajectory according to their appearance similarity. To model the partition appearance, color cues of trajectories are extracted to construct Gauss Mixture Model(GMM). We compute

CHEN et al.: VIDEO OBJECT SEGMENTATION VIA DENSE TRAJECTORIES

2231

GMM for each category of trajectories in whole video and obtain . The unary term is then computed as (16) where is the probability that pixel assigning label through the estimated and . significantly improves segmentation performance. To obtain more accurate segments, we train model for color information of each kind of trajectories in whole video. This operation not only reduces the interference of erroneous trajectory labels, but also utilizes all the appearance information from whole video sequence for training GMM. Besides, temporal consistence of segmentation is enhanced with considering all trajectories in video. The pairwise terms and are adopted to measure the similarity between pixels and encourage spatial and temporal smoothness. Two neighboring pixels with similar color are more likely to assign same label

(17) where is a label function. If pixels and pixel have the same label, , otherwise . Function is defined as (18) is the color Euclidean distance between where two pixels and . is a small constant (typically 0.001) that prevents division by zero. Energy optimization method [11] is adopted to optimize (14) and get the partition results . B. Determining Object From Partitions In order to determine object from partition results, we define background trajectories and foreground trajectories as follows. A rectangular border covering the boundary region of video is created to check the number of each kind of trajectory labels within it for each frame. The label with maximal number belongs to the label set of which is defined as . For each frame (19) (20) where is the number of in rectangular border for -th frame, and is a function to remove duplicate data. Trajectory with background label is regarded as background trajectory. The remaining trajectories are foreground trajectories and their label set defined as . Then we check the trajectory labels on each region of . The region which has more labels belong to than is

foreground region. In this way, final video object segmentation is generated. We summarize our video object segmentation algorithm with pseudo-code as Algorithm 2. Algorithm 2 Video object segmentation Input: Input video Output: Video object segmentation results

.

. 1: Obtain Point trajectories and corresponding trajectory label with Algorithm 1. 2: Compute for each category of point trajectories. 3: Calculate and by (20). 4: for do 5: Compute with (15) and by (16). 6: Obtain and through (17). 7: Minimize (14) to obtain for frame . 8: Create object with , and . 9: end for V. EXPERIMENTS AND EVALUATIONS A. Datasets and Evaluation Methods The proposed method is evaluated on the Moseg dataset [5], which annotates 26 sequences taken from detective movies or Hopkins. We chose the first 50 frames of each video to test our approach and comparison algorithms. For video less than 50 frames, we tested our method with whole video frames. Although Moseg dataset spans a range of difficulties, the type of scenes is limited and motion patterns of objects are simple. To exactly evaluate the efficiency of our method and establish a benchmark for future work, we introduce a new dataset that includes 16 videos, taken from Berkeley video dataset and YouTube-Objects dataset [19], with totally 12 categories. These selected video sequences range in length from 32 to 242 frames and exhibit major challenges in motion segmentation such as complex background (e.g., squirrel, tiger), object interaction (e.g., kangaroo), occlusion (e.g., people), object deformation (e.g., dog). Several frames of each sequence are annotated as ground truth. The proposed approach has ability to accurately cluster point trajectories and segment the foreground objects in video sequences in an unsupervised way. In our experiments, we first test our point trajectory clustering method. Even though it is not the final goal, we still evaluate the effectiveness of our approach by testing clustering point trajectories with the evaluation tool delivered with Moseg dataset. The evaluation tool yields 5 evaluation parameters. Density is the coverage percentage of labels on video sequence. Clustering error is defined as the percentage of wrong labels in total number of trajectory labels. Per region clustering error is defined as the average error of each region. Over segmentation parameter is adopted to evaluate the number of clusters to fit ground truth. Extracted objects is the number of objects with less than 10% wrong trajectory labels. We further test our segmentation results on Moseg dataset and our dataset. In order to comprehensively evaluate dense seg-

2232

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 17, NO. 12, DECEMBER 2015

TABLE I

Results in Moseg. Our method has the highest density, largest number of extracted objects as well as the lowest clustering error. TABLE II

Results for all 50 frames of our dataset. Our method achieves the highest density, largest number of extracted objects as well as the lowest clustering error.

mentation, F-measure, which comprehensively considers both precision and recall, is introduced for offering reliable segmentation evaluation.

(21) where corresponds to video segmentation result and cates the ground truth.

indi-

B. Trajectory Clustering Evaluation In the proposed algorithm, both global and local information of point trajectories are utilized to cluster trajectories into groups. Using the codes obtained from the corresponding authors, we compare our results with two state-of-the-art methods [5] and [29]. Fig. 3 shows the comparison results between our algorithm and previous methods [5], [29]. The first three groups are taken from Moseg dataset, and the other groups are chosen from our dataset. The performances of previous techniques [5], [29] are not satisfactory, and some foreground objects even can not be correctly marked. For example, in Fig. 3(a), two cars are assigned with same label by [5] or lots of pixel near the left car are misclassified with error labels by [29]. Similar situation also occurs in Fig. 3(b). It is difficult for previous methods [5], [29] to precisely mark objects when objects have similar or complex motion patterns. In contrast, the proposed framework adopts dense trajectories and global as well as local affinity, which makes the proposed approach classify trajectories on the objects into correct cluster across the whole video. In Fig. 3(c), obvious over-segmentation on legs of the people is generated by [5]. Although [29] marks most part of the person, our approach has more correct labeling results on motion object. That is because our point trajectory affinity is efficiently calculated by considering global and local information, which produces more satisfactory results. We also present some quantitative evaluation results of our method compared with [5], [29] on Moseg dataset in Table I. The results clearly demonstrate that our method obtains the highest density and largest number of abstracted objects, with the lowest clustering error. Although videos from Moseg dataset span a range of difficulties, Moseg dataset is limited to the few number of object categories and the simple object motion patterns. In order to better illustrate the effectiveness of our method, we test the proposed

method on our new dataset. A qualitative comparison of different trajectory clustering methods with three typical videos is shown in Fig. 3(d)–(f). Fig. 3(d) gives a difficult example that a squirrel only moves at first few frames and remains still for a long period of time. Methods [5], [29] face difficulty in this situation; the results exhibit obvious over-segmentation. Benefit from long term motion analysis and dense trajectory extraction, our method has ability to mark the whole object entirety. Although the squirrel is static at most frames, it still be successfully separated from the background by the proposed framework. Our method shows superiority with the complex object motion [e.g. Fig. 3(e)] and large object deformation [e.g. Fig. 3(f)] as we utilize discriminative affinity measurement. Additionally, in Fig. 3(f), our method also performs well when a jeep gradually emerges and tells apart two cyclists with similar movements. For this scene, [5] treats the two persons as one object and [29] has obviously cluster error at the head and shoulder of left person. The quantitative evaluation on our dataset is offered in Table II. As can be seen, our approach achieves the best performance with clustering error low to 2.23%, which indicates our clustering results are more precise and responsive to the objects. Similar conclusions can be drawn from the per region clustering error. Our density and extracted objects are well above or on a par with the performance of other methods, which indicates the effectiveness of the proposed point trajectory clustering framework. Overall, the better performance of our method owns to our dense trajectories and affinity measurement by considering both the global and local information of trajectory. C. Video Object Segmentation Evaluation Based on our trajectory clustering results, our framework produces both spatially and temporally coherent object segmentation results for videos. We train with the color cues for each category of trajectories on their life spans for getting prior information, further enhancing spatial-temporal continuity. Furthermore, this operation also reduces the impact of erroneous trajectory labeling in each frame and improves accuracy of segmentation. We compare our segmentation approach with the state-of-the-art approaches: FastMotionSeg [31] and DAGVOS [30] on Moseg dataset [5] and our dataset. Fig. 4 shows the qualitative results for the videos of Moseg dataset [(a)–(c)] and our dataset [(d)–(f)]. It can be observed that our method has the ability to segment the objects with

CHEN et al.: VIDEO OBJECT SEGMENTATION VIA DENSE TRAJECTORIES

2233

handle the object with similar appearance or motion as background. From Marple03 and Marple05 groups, we can find that a lot of pixels are given incorrect labels by DAGVOS. For quantitative evaluation, we introduce the F-measure, which considers precision and recall simultaneously and offers reliable segmentation evaluation. The F-measure compared with the methods FastMotionSeg and DAGVOS for each video from Moseg dataset as well as our dataset is depicted in Fig. 5. It can be seen that our method can achieve best F-measure for both two datasets, significantly outperforming other methods. The proposed approach analyses motion information on whole video sequences with point trajectories. The point trajectories go through object and transfer motion information in long term consecutive frames. The trajectories intuitively show object region and provide valuable cues for segmentation, resulting in better performance. VI. CONCLUSION

Fig. 5. F-measure results and comparison with state-of-the-art methods: FastMotionSeg [31] and DAGVOS [30].

blurred background (e.g., Marple03, hippo), foreground/background color overlap (e.g., Marple05, dog), and also produces accurate segmentation even when the foreground objects with fast motion patterns (e.g., cars06) or complex background (e.g., tiger). One reason for the superior performance compared with video segmentation methods FastMotionSeg and DAGVOS is that our method fully explore the object motion information by analyzing the trajectories of objects. Point trajectories locate object regions in video and provide valuable prior information for segmentation. The segmented results by FastMotionSeg are not accurate in many videos in Fig. 4. For example, in cars06 group, some fragments are generated and parts of background regions are misclassified as foreground. For the result of FastMotionSeg in hippo group, a lot of pixels within hippo are wrongly divided into background and segmentation boundaries are not smooth. This method only uses motion information estimated from two consecutive frames, which is not reliable sometimes and misleads the final segmentation result. Additionally, this method can not work well for object without distinct motions from the background in many scenes. Proposal based method DAGVOS combines both motion and appearance information to measure the objectness of a proposal and utilizes object proposal for inferring object. The results in some examples are satisfactory, while this method can not

We present a video object segmentation approach which considers global motion information of video with dense point trajectories. In order to take full use of motion information and cover video sufficiently, dense point trajectories are obtained through sampling points densely in multiple spatial scales and tracking them with dense optical flow. Besides, a new affinity measurement method considering the global as well as local information of point trajectory is proposed to calculate affinity matrix. Dense partition is generated by graph cut with prior information obtained through GMM for each category of point trajectories. Besides, the proposed method is able to automatically identify foreground object from partition results. In order to make accurate evaluation, we compare the proposed method with the state-of-the-art approaches on Moseg dataset and our challenging dataset. Our algorithm is evaluated from two aspects: one is the performance of the trajectory clustering algorithm and the other is the performance of the video segmentation algorithm. Quantitative and qualitative experiments demonstrate that our method achieves promising performances from these two aspects, respectively. REFERENCES [1] M. Wertheimer, “A source book of gestalt psychology,” in Laws of Organization in Perceptual Forms. New York, NY, USA: Harcourt, 1938, pp. 71–88. [2] H. Jiang, G. Zhang, H. Wang, and H. Bao, “Spatio-temporal video segmentation of static scenes and its applications,” IEEE Trans. Multimedia, vol. 17, no. 1, pp. 3–15, Jan. 2015. [3] J. Shi and C. Tomasi, “Good features to track,” in Proc. IEEE Comput. Soc. Conf, Comput. Vis. Pattern Recog., Jun. 1994, pp. 593–600. [4] C. Rother, V. Kolmogorov, and A. Blake, “Grabcut: Interactive foreground extraction using iterated graph cuts,” ACM Trans. Graph., vol. 23, no. 3, pp. 309–314, 2004. [5] T. Brox and J. Malik, “Object segmentation by long term analysis of point trajectories,” in Proc. ECCV, 2010, pp. 282–295. [6] P. Ochs, J. Malik, and T. Brox, “Segmentation of moving objects by long term video analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 6, pp. 1187–1200, Jun. 2014. [7] T. Ma and L. Latecki, “Maximum weight cliques with mutex constraints for video object segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2012, pp. 670–677. [8] X. Chen and D. Cai, “Large scale spectral clustering with landmarkbased representation,” in Proc. AAAI, 2011, pp. 313–318. [9] W. Wang, J. Shen, X. Li, and F. Porikli, “Robust video object cosegmentation,” IEEE Trans. Image Process., vol. 24, no. 10, pp. 3137–3148, Oct. 2015.

2234

[10] D. Tsai, M. Flagg, A. Nakazawa, and J. M. Rehg, “Motion coherent tracking using multi-label MRF optimization,” Trans. Int. J. Comput. Vis., vol. 100, no. 2, pp. 190–202, 2012. [11] Y. Boykov and V. Kolmogorov, “An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 9, pp. 1124–1137, Sep. 2004. [12] B. L. Price, B. S. Morse, and S. Cohen, “LIVEcut: Learning-based interactive video segmentation by evaluation of multiple propagated cues,” in Proc. IEEE 12th Int. Conf. Comput. Vis., Sep.–Oct. 2009, pp. 779–786. [13] S. Brutzer, B. Hoferlin, and G. Heidemann, “Evaluation of background subtraction techniques for video surveillance,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2011, pp. 1937–1944. [14] O. Barnichm and M. Van, “ViBe: A universal background subtraction algorithm for video sequences,” IEEE Trans. Image Process., vol. 20, no. 6, pp. 1709–1724, Jun. 2011. [15] T. Brox and J. Malik, “Large displacement optical flow: Descriptor matching in variational motion estimation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 3, pp. 500–513, Mar. 2011. [16] G. J. Brostow and R. Cipolla, “Unsupervised Bayesian detection of independent motion in crowds,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recog., Jun. 2006, vol. 1, pp. 594–601. [17] J. Lezama, K. Alahari, J. Sivic, and I. Laptev, “Track to the future: Spatio-temporal video segmentation with long-range motion cues,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2011, pp. 3369–3376. [18] P. Ochs and T. Brox, “Object segmentation in video: A hierarchical variational approach for turning point trajectories into dense regions,” in Proc. IEEE Int. Conf. Comput. Vis., Nov. 2011, pp. 1583–1590. [19] A. Prest, C. Leistner, and J. Civera, “Learning object class detectors from weakly annotated video,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2012, pp. 3282–3289. [20] J. Yuen, B. Russell, C. Liu, and A. Torralba, “LabelMe video: Building a video database with human annotations,” in Proc. IEEE Int. Conf. Comput. Vis., Sep,–Oct. 2009, pp. 1451–1458. [21] W. Wang, J. Shen, and L. Shao, “Consistent video saliency using local gradient flow optimization and global refinement,” IEEE Trans. Image Process., vol. 24, no. 11, pp. 4185–4196, Nov. 2015. [22] J. Carreira and C. Sminchisescu, “Constrained parametric min-cuts for automatic object segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2010, pp. 3241–3248. [23] I. Endres and D. Hoiem, “Category independent object proposals,” in Proc. ECCV, 2010, pp. 575–588. [24] Y. Lee, J. Kim, and K. Grauman, “Key-segments for video object segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., Nov. 2011, pp. 1995–2002. [25] P. Chockalingam, N. Pradeep, and S. Birchfield, “Adaptive fragmentsbased tracking of non-rigid objects using level sets,” in Proc. IEEE Int. Conf. Comput. Vis., Sep.–Oct. 2009, pp. 1530–1537. [26] V. Kolmogorov, “Convergent tree-reweighted message passing for energy minimization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 10, pp. 1568–1583, Oct. 2006. [27] H. Wang, A. Klaser, and C. Schmid, “Action recognition by dense trajectories,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2011, pp. 3169–3176.

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 17, NO. 12, DECEMBER 2015

[28] W. Wang, J. Shen, and F. Porikli, “Saliency-aware geodesic video object segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2015, pp. 3395–3402. [29] K. Fragkiadaki, G. Zhang, and J. Shi, “Video segmentation by tracing discontinuities in a trajectory embedding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2012, pp. 1846–1853. [30] D. Zhang, O. Javed, and M. Shah, “Video object segmentation through spatially accurate and temporally dense extraction of primary object regions,” in Proc. Conf. Comput. Vis. Pattern Recog., Jun. 2013, pp. 628–635. [31] A. Papazoglou and V. Ferrari, “Fast object segmentation in unconstrained video,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 1777–1784.

Lin Chen is currently working toward the M. S. degree in computer science at the Beijing Institute of Technology, Beijing, China. His current research interests include video object segmentation algorithms using motion trajectory.

Jianbing Shen (M’11–SM’12) received the Ph.D. degree in computer science from Zhejiang University, Hangzhou, China, in 2007. He is currently a Full Professor with the School of Computer Science, Beijing Institute of Technology, Beijing, China. He has authored or coauthored about 50 journal and conference papers, appearing in publications that include the IEEE CVPR, IEEE ICCV, and various IEEE Transactions. Prof. Shen is on the Editorial Board of Neurocomputing. He has also been the recipient of flagship honors from various institutions, including the Fok Ying Tung Foundation from the Ministry of Education, the Program for Beijing Excellent Youth Talents from the Beijing Municipal Education Commission, and the Program for New Century Excellent Talents in University from the Ministry of Education.

Wenguan Wang is currently working toward the Ph.D. degree in computer science from the Beijing Institute of Technology, Beijing, China. His current research interests include video saliency and segmentation using optimization.

Bingbing Ni received the Ph.D. degree from the National University of Singapore (NUS), Singapore, in 2011. He was a Research Intern with Microsoft Research Asia, Beijing, China, in 2009. He was a Software Engineer Intern with Google Inc., Mountain View, CA, USA, in 2010. He is currently a Research Scientist with the Advanced Digital Sciences Center, Singapore. His current research interests include computer vision, machine learning, and multimedia.