Semantic Multi-body Motion Segmentation Cosimo Rubino
Marco Crocco
Vittorio Murino
Alessio Del Bue
Visual Geometry and Modelling Lab - VGM Pattern Analysis and Computer Vision - PAVIS Istituto Italiano di Tecnologia - IIT Via Morego 30, 16163 Genova, Italy
[email protected]
Abstract This paper presents a method to deal with the multi-body segmentation problem using a set of 2D points matches between two views. The key feature of our approach is the explicit inclusion of a higher semantic information as given by general purpose object detectors that boost the segmentation of the moving objects. In the classical formulation of the problem, only 2D matched points between views are used to identify independently moving objects based on the principle that a set of points belonging to a moving object would satisfy some given multi-view relations (e.g. multi-body epipolar constraints). We improve and speedup such process by including the information that a set of 2D matches may belong to the same object given the output of a detector. As such, instead of sampling points uniformly with a RANSAC based strategy, the selection of the matches is driven by the position and score confidence of the object detectors. Evaluation on challenging synthetic and real datasets shows a remarkable improvement in respect to previous approaches, regarding both the number of iterations required to segment a scene and the effectiveness of the segmentation itself, often making the difference between satisfying segmentation and almost complete failure.
1. Introduction One of the main goals in computer vision is to recover the composition of a scene by understanding the 3D spatial configuration of objects using geometric and algebraic relations among several uncalibrated views. In particular, motion is a strong cue used to identify dynamic elements in the scene and multi-view relations have been used in the past to segment moving objects. Given a set of matched 2D feature points among views, standard feature-based ap-
proaches attempt to fit multiple motion models to the subset of points that mostly comply with multi-view geometric relations. The standard strategy is to perform a random sampling and consensus procedure (RANSAC) [17] where the different models are selected sequentially. However, by increasing the number of multiple motions in the scene, the required number of random samples increases dramatically, thus reducing the applicability of most approaches. This is also because the probability of sampling a set of points from a single moving object decreases when multiple objects are present [30, 34]. This problem becomes crucial in the case of large number of outliers and motions. In particular, when fitting a single motion, the points associated to other motions become structured outliers with high residual errors. In order to overcome this limitation, we introduce a novel approach which uses the semantic information of the composition of a scene to improve the classical multi-body segmentation using solely 2D features. Inspired by recent results that leverage semantic to improve Structure from Motion [2, 38] and 3D segmentation [21], this paper provides an effective approach to include priors over the scene composition in multiple views. In particular, to highly improve efficiency and robustness, we introduce a sampling function that is driven by the response of general purpose object detectors. We show that it is possible to ameliorate the performance of a motion segmentation task by injecting the semantic notion of an object given a set of images. In particular, we present a framework to introduce the detection score maps into a two-views motion segmentation problem by explicitly modelling the interaction between the object position and the 2D features matched in both views. This improves the probability of sampling pair of features associated to a single object and thus increasing robustness to a high number of multiple motions in the scene. We show that such novel sampling strategy is a crucial key when dealing with more realistic scenes where several dynamic ob-
jects are present together with high ratios of outliers.
1.1. Previous work Motion segmentation is a necessary task enabling higher-level applications in several fields and for this reason it has attracted a relevant interest in the literature. Here we make a distinction between algorithms that use pair-wise relations (i.e. point matches only) or multiple images (i.e. image point trajectories). Starting form the latter, multiple views constraints have been used to directly solve the multi-body segmentation problem by exploiting the images geometry and mixing algebraic/statistical tools [25, 12, 18, 32]. For instance, the main idea of [29] is to enforce two-view constraints between consecutive frames and to use a model selection strategy to perform the segmentation. However this method can deliver under and over segmentations if the model selection is not properly fed with clean data. In the work [25] the authors tried to solve the problems that affect [29] by employing a splitting and merging strategy after the segmentation and the reconstruction. This requires a manual adjustment of the parameters in order to deal with different imaging conditions. As reviewed in [39] several type of information has been used to solve the motion segmentation problem such as image difference, statistics, wavelets, optical flow, layers and manifold clustering to cite a few. A few algorithms [39, 23, 13, 40] reported low misclassification rates, which testifies an increasing progress in this field, but the major issue of point trajectory based algorithms is the presence of missing data which are common in every realistic scenario. For this reason many algorithms have been recently tuned to solve for the motion segmentation problem with interrupted image trajectories [33, 15, 4, 28]. In the case of two-views, there is an extensive literature and evaluation regarding single motion fitting techniques from two views [9]. In particular for multi-body motion segmentation, one of the first approaches presented is a sequential RANSAC, which consists in iteratively applying the RANSAC algorithm [17] and removing the inliers belonging to an estimated model. This could lead to a nonoptimal estimation with data which may belong to similar models [30, 34]. In particular, it performs poorly when the motions have a degree of dependency. A similar approach for multiple model fitting is the MultiRANSAC algorithm [42]. The strength of this method is that it works in a parallel framework without removing features sequentially. On the other hand, the weakness is that the number of model instances is required in advance. More efficient methods also start with random samples generation [41, 30, 5] but convergence decreases dramatically with the increase of outliers ratio and number of motions. Other approaches introduce guided sampling techniques in order to speed-up the convergence of the multiple fitting and to avoid misclassifications
of the inliers and outliers samples [26, 36, 7, 37], but these approaches are negatively affected by strong outliers ratio that can misguide initial samples and the following guided sampling. More recently, a new set of approaches provide a more robust fitting by automatically estimating the inliers scales associated to each motion model [34, 27, 24, 8, 35] but still being restricted by random sampling functions.
1.2. Paper contributions Differently from the mentioned approaches on two views motion segmentation, we introduce semantic in the problem using the output of off-the-shelf object detectors. At this end, we clearly define a strategy for using the classifiers score maps for the creation of a semantic sampling function. This restricts the possible set of sampled groups of features and implicitly models the relation between objects and 2D feature points in the scene. As a result the problem becomes more tractable even with strong ratios of outliers and several moving objects. In particular we demonstrate that the semantic sampling function decrease dramatically the number of required iterations in order to segment the multiple motions in the scene. The rest of the paper will first introduce in Sec. 2 the method for constructing the sampling function obtained from a general purpose object detector. Sec. 3 will present both synthetic and real results evaluated on the KITTI dataset [20] and HOPKINS155 dataset [31]. Finally Sec. 4 will draw conclusions and discuss future work.
2. Image features and object detector relations Feature-based motion segmentation algorithms require a set of 2D image correspondences between two views. These are in general obtained using standard feature detectors and matching algorithms [1] that provide the 2D positions of the matches as a set of pairs S = {(xp1 , xp2 )} with p = 1 . . . P for the first and second image respectively. These matched pairs are distributed over the whole image and might belong to moving objects in the scene. Given this stage, the 2D image points can only provide local information given the matched pairs and any higher-level association to the objects contained in the scene is irremediably lost. To revert this semantic loss, we advocate the use of an object detector that might attempt to re-establish the semantic information of a set of points belonging to an object. Sliding-window based object detectors provide a score over an image as an output of the classifier. These values are related to the confidence of the classifier on the presence of a determined object class such as, for instance, a pedestrian or a car [14]. In general, the score values are directly associated to a single pixel or as a multi-resolution pyramid if the classifier acts at different scales [16]. As an example, Figure 1a shows a representation of the score maps of a car object detector at different scales. In order to use the whole information at multiple scales, we resize each score map to
the maximum size using bicubic interpolation [3], so that all the score maps have the same number of pixels. After this registration step, for each pixel we keep the maximum value of each score map at different scales, that is useful to understand if at a specific pixel position there is a strong response of the detector at any scale (Fig. 1c). Let us call
After this clustering stage, we perform a cluster matching between views that is based on the number of point matches shared by each couple of clusters. In detail, let us define ch1 and ck2 as the h-th and k-th cluster as provided by the initial mean shift procedure in views 1 and 2 respectively with h = 1 . . . H and k = 1 . . . K. Notice that the total numbers of clusters H and K can be in general different in views 1 and 2: for instance two objects may appear as overlapped in only one of the views and their feature points may be consequently grouped into a single cluster. Each cluster in each view can be considered as a subset of feature points xpf . Given this, we can define H × K subsets sh,k of feature point pairs (xp1 , xp2 ) in the following way: sh,k = {(xp1 , xp2 ) | (xp1 ∈ ch1 ) ∧ (xp2 ∈ ck2 )}
(1)
Figure 1. The figure shows a) the score maps at different scales as obtained by a car object detector [16], b) the interpolated score maps and c) the final score maps that will be used to obtain the sampling function.
where ∧ denotes the logical AND operator. In order to choose the most likely cluster match (m, n) among views, we search for the maximum number of feature points matchings related to the cluster couple:
this final score map as Φ1 and Φ2 for the first and second image respectively. Now, given the matched 2D correspondences, we can assign a feature score map value such that wpf = Φf (xpf ) with f = 1, 2. This score provides the information about the likelihood of a correspondence pair of points belonging to an object. Moreover, closer points with a high detection score might indicate that they are likely to belong to the same object. For such reason, we encode this semantic information into a sampling function that might increase the probability of choosing matches belonging to the same object. Notice that we do not use the detected bounding boxes given by the final classifier output to cluster together the set of points. This is because the score can give a richer evidence of the presence of an object without giving a hard decision over the bounding box position.
(m, n) = arg max |sh,k |
2.1. Computation of the sampling function The crucial part of this stage is to cluster the 2D features associated to each object given the points position xpf and the associated score values wpf given two views. Here the aim is to cluster together points that have both closer proximity considering that they belongs to an object. The method applied to group this set of data is the classical mean-shift clustering algorithm [11] that is more suitable in the case of non-parametric density functions such as the ones we may obtain from the classifier output. Moreover, to exploit the negative scores given by the classifier, which corresponds to places where objects are not detected, we extend the mean-shift iterations similarly to [10]: points with negative weights are mainly pushed away toward the image borders, allowing in this way to define a macro-region separated from the more spatially localized clusters which are related to stronger responses from the object detectors.
(h,k)
(2)
where |·| denotes the cardinality of a set. Now, we need to prune out point matches that are not included in clusters (cm1 , cn2 ) since there is a possibility that some of the matches do not belongs to the two clusters. This further stage enforces a consistency between the clusters (given by the object detector response) and the point correspondences given the two images. This point to object correspondence in two views is crucial in order to sample likely points belonging to a single object detected in the two frames. In particular we defined the two pruned clusters c˜m1 , c˜n2 in the following way: c˜m1 = {xp1 |(xp1 , xp2 ) ∈ sm,n }
(3)
c˜n2 = {xp2 |(xp1 , xp2 ) ∈ sm,n } .
(4)
In order to obtain a sampling distribution from the pruned clusters, we compute the centroid and covariance matrix of two multivariate Gaussian given the points in c˜m1 , c˜n2 and their score weights wpf given by the detector. However, notice that at this stage some points inside the clusters might have still associated negative weights since such points can still converge via mean shift to a cluster. In order to perform the fitting, these point scores are zeroed such that: vpf = max(0, wpf ). (5) The weighted centroid and covariance matrix are therefore calculated as: X vp1 xp1 µm1 =
p∈Pm1
X p∈Pm1
(6) vp1
X Σm1
=
p∈Pm1 2
X
p∈Pm1
·
X
vp1
vp1
−
· X
(7)
2 vp1
p∈Pm1
vp1 (xp1 − µm1 )(xp1 − µm1 )>
p∈Pm1
where Pm1 is the subset of indices p defined as: Pm1 = {p | xp1 ∈ c˜m1 }
(8)
An analogous definition holds for the pruned cluster in the second view with parameters µn2 , Σn2 , Pn2 . A normal distribution of mean value µkf and covariance matrix Σkf is then associated to each cluster in each view. Finally a sampling map for the couple of clusters (m, n) is defined as the product of the two related normal distributions: Ψ(xp1 , xp2 ) = N (xp1 |µm1 , Σm1 )N (xp2 |µn2 , Σn2 ) (9)
2.2. Model fitting with RANSAC The sampling function defined in the previous section guides the selection of points that are likely to belong to a detected object and the related matched pairs. Of course the object detector does not encode the notion of motion between two frames (i.e. static detected objects belong to the background) and the next stage will be devoted to the selection of multiple motion given the proposed sampling function. To this end, we define a sequential RANSAC algorithm where at each run a set of motion models, consisting in a fundamental matrix, is fit given a minimal subset of points sampled according to the sampling function. Here we fix the inlier threshold t to a fixed value but more elaborated algorithms such as [34, 27, 24, 8, 35] can be used to avoid this manual step. After the selection of the best motion model in the first iterations, the inlier matches are removed from the set of feature points pairs S, obtaining a reduced set Snext . Consequently, all clusters previously found are updated by removing the inliers contained in each cluster: 1 ch1 ← ch1 ∩ Snext
(10)
2 ck2 ← ck2 ∩ Snext
(11)
with k = 1 . . . K and h = 1 . . . H. The subsets of points f Snext are defined as: f Snext = {xpf | (xp1 , xp2 ) ∈ Snext } .
(12)
At the next iteration, we select the new couple (m, n) with maximum intersecting points from the updated set of clusters and a new sampling function is evaluated according
to Eqs. (6, 8). Since all the inliers related to the previously estimated motion models have been removed, the new multivariate Gaussian defining the sampling function will enforce the sampling of more consistent matches related to the remaining objects. The maximum number of iterations is fixed to max(H, K) i.e. the maximum number of clusters between the two views as evaluated by mean shift. However it is to note that the sequential RANSAC might include inliers belonging to different clusters but with the same motion. This might provide a reduced number of overall fitted motion models thus the value max(H, K) has to be considered as an upper bound for the iterations through the clusters. Algorithm 1 summarizes the full scheme of our method. Algorithm 1 Semantically driven sequential RANSAC Require: set of 2D image correspondences S = {(xp1 , xp2 ) | p = 1 · · · P } and score maps from the object detector Φ1 and Φ2 . Ensure: Multiple motion models and inliers assignment to each motion model. 1: Find the H and K clusters ch1 and ck2 via mean shift procedure (Section 2.1). 2: Assign Snext ← S. 3: for i = 1 to max(K, H) do 4: Select the couple of clusters (m, n) according to the highest number of matches in Snext (Eqs.1, 2). 5: Compute the sampling function associated to clusters (m, n): Ψ = N (xp1 |µm1 , Σm1 )N (xp2 |µn2 , Σn2 ). 6: Fit the motion model with RANSAC using the sampling function Ψ. 7: Divide the set Snext into inliers and ouliers according in out to a given threshold t: Snext = Snext ∪ Snext . 8: Remove the inliers of the RANSAC from the whole out set of features pairs Snext : Snext ← Snext 1 9: Update the clusters ch1 ← ch1 ∩ Snext , ck2 ← ck2 ∩ 2 Snext for k = 1 · · · K, h = 1 · · · H out 10: if Snext = ∅ then 11: EXIT 12: end if 13: end for
3. Experiments In this section we evaluate in particular the advantage of using the proposed sampling function compared to a standard random sampling function using both synthetic and real data.
3.1. Synthetic setup We simulate a set of N objects, with a cubic shape, rotating and translating in 3D and then orthographically
Table 1. Mean of the ratio of inliers correctly matched for the synthetic tests for 3 and 5 motions and increasing number of outliers. We present results for our proposed algorithm (OR), J-Linkage (JL), multiGS (GS) and sequential RANSAC (SR). # of motions
3
5
Iterations 10 100 1000 10 100 1000
OR .99 .99 0.99 .97 .99 .99
0% JL .45 .98 1.00 .31 .88 1.00
GS .09 .88 .99 .08 .64 .95
SR .17 .81 .99 .10 .31 .93
Mismatching Ratio 10% OR JL GS .89 .47 .08 .89 .88 .71 .90 .90 .89 .88 .26 .06 .90 .79 .48 .90 .91 .86
SR .08 .24 .80 .05 .15 .36
OR .72 .72 .74 .70 .70 .72
30% JL GS .39 .06 .72 .44 .74 .72 .20 .05 .63 .31 .73 .65
SR .05 .11 .24 .04 .08 .12
ated as: projected into an image frame of size 640 × 480 pixels. Each object is associated to a set of features points In , with n = 1 · · · N , uniformly distributed inside the objects boundaries. The number of feature points per object (i.e. the inliers) In is randomly sampled within the interval [15, 30]. The number of points of the background has been set equal to ten times the maximum number of points per object. To simulate mismatches in the matching algorithm, we introduce an increasing percentage of outliers. Moreover, we also simulate the 2D score map of an object detector by associating a truncated multivariate normal distribution at each object centroid and a uniform negative value in the remaining part of the image related to the background. Notice that 2D points belonging to the background but close to any object centroid will be weighted with a positive score, thus reproducing the likely behavior of an object detector (check Fig. 2 for a visual representation).
Figure 2. Example of two images used for the synthetic setup. The red points in each image represent key-points belonging either to background or to 5 moving objects. The green lines display the point matches across the two views. The heatmap values at each pixel correspond to the score map given by a simulated detector response.
We compared the performance of three competing methods to the proposed one (OR): the sequential RANSAC (SR) (which uses the sequential fitting-and-removing procedure), the Jaccard-distance Linkage (JL) [30] and the multi GS (GS) [6]. The performance of each method was evalu-
PN
est n=1 In ∩ In PN n=1 In
(13)
where In is the actual number of inliers for the motion model n and Inest is the corresponding estimated number. We applied the four methods under a set of different conditions, namely varying the number of objects (3 and 5), the ratio of mismatches in the feature point pairs (%0, %10, %30), and the total number of iterations (10, 100 and 1000). By iteration we mean a single model fitting with the fundamental matrix. For each scenario we performed 100 trials and reported in Table 1 the mean of the performance given by each method according to Eq. (13). Table 1 shows that the number of true inliers of our method is in general higher or similar with respect to the competing methods. In particular, it is clear the advantage when the number of iterations is small. For the case with 10 iterations, the gap between our approach and the best competing one is almost double in percentage (e.g. 72% vs. 39% for 30% outliers and 3 objects) showing how much the semantic information can boost the performances. This behaviour is even more evident when the number of objects grows to 5. Also, it is clear that the sequential RANSAC shows all its limits, while the J-Linkage results are comparable when the mis-matchings are low, but the performance decreases more when the number of outliers increases. When the number of iterations becomes higher, our approach behaves similarly to the best competing methods since the effect of the semantic sampling function is less evident in such cases. Table 2 displays the processing time for the four algorithms tested with Matlab on a PC with four-cores CPU at 2.6 GHz, 8 GB RAM. It can be noted that J-linkage and our proposed approach have rather similar timing. The sequential RANSAC is on the contrary the fastest algorithm but is in general delivering poor results. Notice that for a small number of iterations the bottleneck of our algorithm is given by the mean shift procedure, which is not optimized in our current Matlab implementation. However, code optimization and use of more advanced mean shift algorithms
[19] are able to abate the computational load by orders of magnitude. Table 2. Average computation time versus number of iterations for our proposed algorithm (OR), J-Linkage (JL), multiGS (GS) and sequential RANSAC (SR).
# iterations 10 100 1000
OR 7.6 s 9.6 s 9.0 s
JL 6.6 s 6.8 s 9.5 s
GS 0.04 s 0.35 s 11.8 s
SR 0.03 s 0.10 s 0.77 s
3.2. Real scenario In a real scenario, we computed the score maps using a car object detector [16] applied on pairs of images selected from the KITTI dataset [20], which consists of 29 test sequences in an urban environment, and from the 38 traffic sequences of the HOPKINS155 dataset [31]. Images are taken from a camera mounted on a moving car (KITTI) or from an hand-held camera (HOPKINS155). The motion model used for the real tests was a perspective fundamental matrix estimated using the eight points algorithm [22]. Notice again that the score maps contain both positive and negative scores: the adaptation of the mean-shift algorithm makes the clusterization more efficient because of a first removal of many features belonging to the background. In Fig. 3 the score map generated for an image selected from KITTI dataset is shown. It can be seen that the highest scores are correctly generated at the cars locations in the image. However the modes of the score map are not exactly centered on the objects, giving only a rough estimation of their locations. Moreover some positive scores exists in regions where there are no objects. A similar behaviour is observed in the HOPKINS155 dataset (Fig. 4 (e)). Despite this fact, we can use the information provided by the score map to create the sampling prior for the next motion segmentation stage with RANSAC. As for the synthetic tests,
Figure 3. Score maps for test image from the KITTI dataset given by a car detector.
the proposed method has been compared with (JL), (GS) and (SR) methods keeping a number of iterations equal to 104 . Such a high number, in comparison with the synthetic tests, was set considering the more challenging conditions
of a real dataset. A second reason was to avoid positive bias toward our method which is very effective with just a few iterations, as seen in the synthetic tests. The computation times for the four methods were 20 s (OR), 67 s (JL), 48.5 s (GS) and 6.5 s (SR) for HOPKINS155 images and 34.7 s (OR), 141 s (JL), 40 s (GS) and 9.3 s (SR) for KITTI images. It is worth noting that the computation of the score map and the mean shift clustering does not impact very much on the overall computation time of our method, that remains lower in respect to JL and GS; SR is obviously faster but in the same order of magnitude. Results are displayed in Fig. 4 for the HOPKINS155 dataset and in Fig. 5 for the KITTI dataset, colouring the overlaid 2D points according to the estimated motion model they belong to. As it can be seen, the proposed method achieves a very good performance on the image couple from HOPKINS155 dataset (Fig. 4) where almost all the points belonging to the two moving cars are perfectly assigned to two different models, both distinct from the background. Differently, the other three methods struggle to reconstruct any consistent motion models. Results for the image couple from the KITTI dataset are slightly less precise for our method, with a couple of outliers on the background and a few outliers on the left car (Fig.5). This is due to the fact that the motion of the two cars is extremely similar to the background motion, and the point exchange among the two motion models yields a very small penalty in the fitting process, making this test a challenging benchmark for a motion segmentation algorithm. Despite these imperfections, the improvements in respect to the three literature methods, that provides unacceptable results, is impressive. We repeated the tests with a lower and higher number of iterations (103 , 105 ) obtaining qualitatively similar results: very good model estimation for our method and bad or unacceptable results for the three literature methods.
4. Conclusions This paper presented a novel approach to motion segmentation that promotes the use of semantic information to improve the performance of sampling and consensus approaches. In particular, the high level information given by an object detector is used to design an efficient sampling function that allows fast recovery of the multiple motion models. Regarding this aspect we demonstrated that the proposed approach needs less iterations for discovering models in comparison to highly optimized methods that do not use any semantic information. Furthermore, on real challenging datasets, the proposed methods sharply outperforms the literature ones, that are in some cases unable to recover a reasonable segmentation even with a considerable number of iterations. This initial results can be certainly
(a) (a)
(b)
(b) (c)
(d)
(c)
(e) Figure 4. Multi-body segmentation results from a pair of images from the HOPKINS155 dataset. The colors of the feature points overlaid on the images denote the estimated inliers for each motion model for the Proposed method (a), J-Linkage (b), multiGS (c) and sequential RANSAC (d). In subfigure (e) the disparity map, overlapped with the classifer score map is displayed
improved by including other semantic information that are specific to the object class used. For instance a fac¸ade of a building or a street floor might fit better an homography model rather then a full fundamental matrix model. It is also interesting to investigate how object to object relations (such as car and street) might further boost performance for multi-body segmentation, especially for the case of clustering matching between two views.
5. Acknowledgements We thank M. San Biagio, S. Martelli and M. Zanotto for their support in object detection implementation and testing. We acknowledge Tat-Jun Chin for making available the code used for comparison.
References [1] H. Aanæs, A. L. Dahl, and K. S. Pedersen. Interesting interest points. International Journal of Computer Vision,
(d) Figure 5. Multi-body segmentation results from a pair of images from the KITTI dataset. The colors of the feature points overlaid on the images denote the estimated inliers for each motion model for the Proposed method (a), J-Linkage (b), multiGS (c) and sequential RANSAC (d).
97(1):18–35, 2012. [2] S. Y. Bao and S. Savarese. Semantic structure from motion. In CVPR, 2011, pages 2025–2032, 2011. [3] W. Burger and M. Burge. Digital Image Processing: An Algorithmic Introduction Using Java. Texts in Computer Science. Springer, 2009. [4] A. Cheriyadat and R. Radke. Non-negative matrix factorization of partial track data for motion segmentation. In ICCV 2009, pages 865–872, 2009. [5] T.-J. Chin, H. Wang, and D. Suter. Robust fitting of multiple structures: The statistical learning approach. In ICCV 2009, pages 413–420, 2009. [6] T.-J. Chin, J. Yu, and D. Suter. Accelerated hypothesis generation for multi-structure robust fitting. In ECCV 2010, pages 533–546. Springer, 2010. [7] T.-J. Chin, J. Yu, and D. Suter. Accelerated hypothesis generation for multistructure data via preference analysis. PAMI, IEEE Trans. on, 34(4):625–638, 2012.
[8] J. Choi and G. Medioni. Starsac: Stable random sample consensus for parameter estimation. In CVPR 2009 on, pages 675–682. IEEE, 2009. [9] S. Choi, T. Kim, and W. Yu. Performance evaluation of ransac family. BMVC 2009, 2009. [10] R. T. Collins. Mean-shift blob tracking through scale space. In CVPR 2003, volume 2, pages II–234, 2003. [11] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. PAMI, IEEE Trans. on, 24(5):603–619, 2002. [12] J. Costeira and T. Kanade. A multi-body factorization method for motion analysis. In ICCV 1995 on, pages 1071– 1076, Jun 1995. [13] N. da Silva and J. Costeira. The normalized subspace inclusion: Robust clustering of motion subspaces. In ICCV 2009, pages 1444–1450, Sept 2009. [14] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR 2005, volume 1, pages 886–893, 2005. [15] E. Elhamifar and R. Vidal. Sparse subspace clustering. In CVPR 2009 on, pages 2790–2797, June 2009. [16] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. PAMI, IEEE Trans. on, 32(9):1627–1645, 2010. [17] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, 1981. [18] A. Fitzgibbon and A. Zisserman. Multibody structure and motion: 3-d reconstruction of independently moving objects. In ECCV 2000, volume 1842, pages 891–906. 2000. [19] D. Freedman and P. Kisilev. Fast mean shift by compact density representation. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1818– 1825. IEEE, 2009. [20] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR 2012, 2012. [21] C. Hane, C. Zach, A. Cohen, R. Angst, and M. Pollefeys. Joint 3d scene reconstruction and class segmentation. In CVPR 2013, pages 97–104, 2013. [22] R. Hartley and A. Zisserman. Multiple view geometry in computer vision, volume 2. Cambridge Univ Press, 2000. [23] F. Lauer and C. Schnorr. Spectral clustering of linear subspaces for motion segmentation. In ICCV 2009, pages 678– 685, Sept 2009. [24] K. H. Lee and S. W. Lee. Deterministic fitting of multiple structures using iterative maxfs with inlier scale estimation. December 2013. [25] K. Ozden, K. Schindler, and L. Van Gool. Multibody structure-from-motion in practice. PAMI, IEEE Trans. on, 32(6):1134–1141, June 2010. [26] T. T. Pham, T.-J. Chin, J. Yu, and D. Suter. The random cluster model for robust geometric fitting. In CVPR 2012, pages 710–717, 2012.
[27] R. Raguram and J.-M. Frahm. Recon: Scale-adaptive robust estimation via residual consensus. In ICCV 2011, pages 1299–1306, 2011. [28] S. Rao, R. Tron, R. Vidal, and Y. Ma. Motion segmentation via robust subspace separation in the presence of outlying, incomplete, or corrupted trajectories. In CVPR 2008, pages 1–8, June 2008. [29] K. Schindler, D. Suter, and H. Wang. A model-selection framework for multibody structure-and-motion of image sequences. IJCV, 79(2), 2008. [30] R. Toldo and A. Fusiello. Robust multiple structures estimation with j-linkage. In ECCV 2008, pages 537–547. 2008. [31] R. Tron and R. Vidal. A benchmark for the comparison of 3-d motion segmentation algorithms. In CVPR 2007, pages 1–8, June 2007. [32] R. Vidal and R. Hartley. Three-view multibody structure from motion. IEEE Trans. on Patt. Anal. and Mach. Intell., 30(2):214–227, Feb. 2008. [33] R. Vidal, R. Tron, and R. Hartley. Multiframe motion segmentation with missing data using powerfactorization and gpca. IJCV, 79(1):85–105, Aug. 2008. [34] H. Wang, T.-J. Chin, and D. Suter. Simultaneously fitting and segmenting multiple-structure data with outliers. PAMI, IEEE Trans. on, 34(6):1177–1192, 2012. [35] H. Wang and D. Suter. Robust adaptive-scale parametric model estimation for computer vision. PAMI, IEEE Trans. on, 26(11):1459–1474, 2004. [36] H. S. Wong, T.-J. Chin, J. Yu, and D. Suter. Dynamic and hierarchical multi-structure geometric model fitting. In ICCV 2011, pages 1044–1051, 2011. [37] H. S. Wong, T.-J. Chin, J. Yu, and D. Suter. Efficient multistructure robust fitting with incremental top-k lists comparison. In ACCV 2010, pages 553–564. Springer, 2011. [38] J. Xiao, A. Owens, and A. Torralba. Sun3d: A database of big spaces reconstructed using sfm and object labels. In ICCV 2013, pages 1625–1632, 2013. [39] L. Zappella, X. Llad´o, E. Provenzi, and J. Salvi. Enhanced local subspace affinity for feature-based motion segmentation. Pattern Recognition, 44:454–470, 2011. [40] L. Zappella, E. Provenzi, X. Llad´o, and J. Salvi. Adaptive motion segmentation algorithm based on the principal angles configuration. In ACCV 2010, volume 6494/2011, pages 15– 26, 2010. [41] W. Zhang and J. Kseck´a. Nonparametric estimation of multiple structures with outliers. In Dynamical Vision, pages 60–74. Springer, 2007. [42] M. Zuliani, C. S. Kenney, and B. Manjunath. The multiransac algorithm and its application to detect planar homographies. In ICIP 2005, volume 3, pages III–153, 2005.