International Journal of Computer Vision manuscript No. (will be inserted by the editor)
A New Approach To Two-View Motion Segmentation Using Global Dimension Minimization
arXiv:1304.2999v2 [cs.CV] 7 Jan 2014
Bryan Poling · Gilad Lerman
Received: date / Accepted: date
Abstract We present a new approach to rigid-body motion segmentation from two views. We use a previously developed nonlinear embedding of two-view point correspondences into a 9-dimensional space and identify the different motions by segmenting lower-dimensional subspaces. In order to overcome nonuniform distributions along the subspaces, whose dimensions are unknown, we suggest the novel concept of global dimension and its minimization for clustering subspaces with some theoretical motivation. We propose a fast projected gradient algorithm for minimizing global dimension and thus segmenting motions from 2views. We develop an outlier detection framework around the proposed method, and we present state-of-the-art results on outlier-free and outlier-corrupted two-view data for segmenting motion. Keywords Global Dimension · Empirical Dimension · Subspace Clustering · Hybrid-Linear Modeling · Motion Segmentation · Outliers · Robust Statistics Supp. webpage: http://math.umn.edu/~lerman/gdm
1 Introduction A classic problem in computer vision is that of featurebased motion segmentation from two views. In this problem one has two images, taken at different times, of a 3D Bryan Poling 206 Church St. SE, Minneapolis, MN 55455 Tel.: +612-625-5099 E-mail:
[email protected] Gilad Lerman 206 Church St. SE, Minneapolis, MN 55455 Tel.: +612-624-5541 E-mail:
[email protected]
scene. The scene is assumed to consist of multiple, independently moving rigid bodies. The goal is to identify the different moving objects and estimate a motion model for each one of them. For this purpose, one automatically tracks the locations of visually-interesting “features” in the scene, which are visible in both views (e.g., via Lucas Kanade type algorithm [4]). Each feature is represented as a pair of 2vectors, holding the image coordinates of the feature in the two different views; such a pair is referred to as a point correspondence. The mathematical problem of feature-based, two-view motion segmentation is to both segment the point correspondences according to the rigid objects to which they belong, and estimate a motion model for each object. A basic strategy to solve the feature-based, two-view motion segmentation problem is to first cluster point correspondences and then estimate the single-body motions within clusters (well-known methods for single-body motion estimation are described in [22, 30]). This procedure was suggested in [17], while clustering point correspondences with K-means or spectral clustering, and in [40], while alternating between clustering and motion segmentation via an EM procedure. Both clustering strategies of [17] and [40] are based primarily on spatial separation between the clusters, however, different clusters in this setting may intersect each other (e.g., when motions share a symmetry). Due to this problem, some algebraic methods have been developed for directly solving for the motion parameters, while eliminating the clustering of point correspondences [46, 35]. Another solution is to segment feature trajectories by taking into account their geometric structures, which may be different than spatial separation (the feature trajectories in 2-views are the 4-dimensional vectors concatenating the 2 point correspondences of the same feature from 2 views). Costeira and Kanade [14] showed that under the affine camera model, feature trajectories in n-views (for n ≥ 2
2
these are vectors of length 2n) within each rigid body lie on an affine subspace of dimension at most 3. This observation has given rise to several feature-based motion segmentation schemes for n-views, which are based on clustering subspaces; we refer to such clustering as Hybrid Linear Modeling (HLM). Many algorithms have been suggested for solving the HLM problem, for example, the Kflats (KF) algorithm or any of its variants [39, 9, 42, 23, 49], methods based on direct matrix factorization [8, 14, 24, 25], Generalized Principal Component Analysis (GPCA) [45, 32, 33], Local Subspace Affinity (LSA) [47], RANSAC (for HLM) [48], Agglomerative Lossy Compression (ALC) [31], Spectral Curvature Clustering (SCC) [12], Sparse Subspace Clustering (SSC) [15, 16], Local Best-Fit Flats (LBF and its spectral version SLBF) [50, 51] and Low-rank Representation (LRR) [29, 28]. Some theoretical guarantees for particular HLM algorithms appear in [11, 2, 26, 37, 38, 3]. Two recent reviews on HLM are by Vidal [44] and Aldroubi [1]. For the more general and realistic model of the perspective camera, it can be shown that feature trajectories from two-views lie on quadratic surfaces of dimension at most 3 (in R4 ) (see §2). Arias-Castro et al. [2] suggested clustering the quadratic surfaces of point correspondences (in R4 ) using Higher Order Spectral Clustering (HOSC) for manifold clustering. They demonstrated competitive results on the outlier-free database of [35], when assuming that the clusters are of dimension 2. However, their results are not competitive for incorporating dimension 3 and they did not provide any numerical evidence that the dimension of the surfaces was 2 and not 3. A different approach for clustering these particular quadratic surfaces can be obtained by embedding point correspondences into “quadratic coordinates” and then clustering subspaces. More precisely, if a point correspondence ((x, y), (x0 , y0 )) is mapped into (x, y, 1) ⊗ (x0 , y0 , 1) ∈ R9 , where ⊗ denotes the Kronecker product, then these quadratic surfaces are mapped into linear subspaces of dimensions at most 8, which are determined by the fundamental matrices [22, 30, 10] of the different motions. Chen et al. [10] have used this idea for clustering such quadratic mappings of point correspondences by the Spectral Curvature Clustering (SCC) algorithm [12] (they showed that instead of performing the actual mapping, one can apply the kernel trick). They claimed that other HLM algorithms (at that time) did not work well for such embedded data. The drawback of applying SCC to this quadratic mapping of point correspondences in R9 is that SCC does not work well with subspaces of mixed dimensions, and the subspace dimensions must be known a-priori. Unfortunately, the subspaces in this application have mixed and unknown dimensions (see §2). What makes SCC successful for this application is the fact that it takes into account some global information of the subspaces (i.e., for d-dimensional sub-
Bryan Poling, Gilad Lerman
spaces it uses affinities based on arbitrary d + 2 points, and in particular, far-away points). This helps SCC deal with nonuniform sampling along subspaces with local structure very different than the global one (see §2). On the other hand, local methods (e.g., [51, 13]) often do not work well in this setting. The purpose of this paper is to develop an HLM algorithm that can successfully cluster the quadraticallyembedded point correspondences in R9 . In particular, it exploits the global structure of the underlying subspaces, i.e., their “dimensions”. We remark that earlier works [5, 18, 20, 21] used dimension estimators to segment data clusters according to their intrinsic dimension ([5] and [18] used box counting estimators of some fractal dimensions and [20, 21] used the statistical estimator of [27]). However, their methods do not distinguish well subspaces of the same dimension. Here on the other hand, we aim to minimize “dimensions” within tentative clusters, instead of estimating them. Thus, we take into account the effect of tentative clusters on these “dimensions” and try to optimize accordingly the appropriate choice of clusters. For this purpose, we propose a class of empirical dimension estimators, and a corresponding notion of global dimension for a mixture of subspaces (a function of the estimated dimensions of its constituent parts). We propose the global dimension minimization (GDM) algorithm, which is a fast projected gradient method aiming to minimize the global dimension among all data partitions. We also build an outlier detection framework into this development to allow for corrupted data sets. We demonstrate state-of-the-art results for two-view motion segmentation (via quadratic embedding), both in the outlier-free and outlier-corrupted cases. We even show that these results are competitive with the state-of-theart results for multiple-views, i.e., using all frames of a video sequence (obtained under the affine camera model). To motivate the use of global dimension, we prove that for special settings and choice of parameters, the global dimension is minimized by the correct partition of the data (representing the underlying subspaces). We then discuss what to do in more general settings. The paper is organized as follows: §2 briefly explains how the problem of 2-view motion segmentation can be formulated as a problem in HLM; §3 introduces global dimension and explains why its minimization can solve the HLM problem under some conditions; §4 develops a fast projected gradient method for minimizing global dimension; §5 develops an outlier detection/rejection framework for global dimension minimization; §6 demonstrates numerical results on real-world 2-view data sets for both outlier-removed and outlier-corrupted data; finally, §7 concludes this work. The appendix contains proofs of the key results in the paper.
A New Approach To Two-View Motion Segmentation Using Global Dimension Minimization
3
2 Formulating 2-View Motion Segmentation as a Problem in HLM One way of formulating the motion segmentation problem in terms of HLM is by exploiting the Affine Motion Subspace. Costeira and Kanade [14] demonstrated that when a set of features all come from a single rigid body, then under the assumptions of the affine camera model, the corresponding feature trajectories lie in an affine subspace of dimension 3 or less. One can use this fact to partition the set of features by clustering their trajectories into subspaces. This is a popular formulation of the segmentation problem, even when dealing with only two views of a scene. The formulation involving the affine motion subspace has the advantage that the feature trajectories tend to be nicely distributed in their respective subspaces, and the different subspaces all have nearly the same dimensions. This formulation has the drawback that it requires an affine camera model. The affine camera assumption breaks down when viewing objects close to the camera, or when looking at objects at significantly different ranges. The consequence of this is that the trajectories from a rigid body do not lie within a subspace, but rather in a manifold which is only locally approximated by a subspace of dimension at most 3. When dealing with 2-view segmentation, there is another approach, based on a more general camera model, which avoids this problem of distortion. This approach assumes a perspective camera, and relies on the fundamental matrix [22] for a rigid body. Indeed, if F = (Fi, j )3i, j=1 is the fundamental matrix for a rigid body, and xh = (x, y, 1)T and x0h = (x0 , y0 , 1)T together form a point correspondence (in standard homogenous coordinates) from that body, then T
x0h Fxh = 0, which is algebraically equivalent to vec(F) · v = 0,
(1)
Fig. 1: Two views of a 3D scene with features overlayed (left), and the nonlinearly embedded point correspondences in R9 , projected onto the 3-dimensional subspace spanned by their 3rd, 4th and 5th principal components (right). (Color figure online)
quadratic manifold). However, the subspace dimension can decrease due to two different reasons. First of all, if there are very few points (per motion), then they may span a lowerdimensional subspace. The second cause is degeneracy in the 3D configuration of the features. If all world points and both camera centers live on a ruled quadratic surface1 , then their corresponding subspace has dimension 7 or less. In particular, if all world points (but not necessarily the camera centres) are coplanar, the corresponding subspace will have dimension no larger than 6 (see [22, pg. 296]). Therefore, to make use of this embedding, the hybrid-linear modeling algorithm being employed must be tolerant of subspaces of mixed dimension. Since the perspective camera assumption is accurate in a much broader range of situations than the affine camera model, subspaces are more apparent with the nonlinear embedding than with the linear embedding. However, the nonlinear embedding distorts the original sampling and results in lower-dimensional structures (of dimension at most 3) within the higher dimensional subspaces (of typical dimensions 6, 7 or 8), which is a serious obstacle for many HLM algorithms, especially ones using local spatial information.
where vec(F) = (F11 , F12 , F13 , F21 , F22 , F23 , F31 , F32 , F33 )T
3 Global Dimension
and 0
0
0
0
0
0
T
T
0
0
T
v = (xx , x y, x , xy , yy , y , x, y, 1) = (x, y, 1) ⊗ (x , y , 1) . We refer to the vector v as the nonlinear or Kronecker embedding of a point correspondence (recall that ⊗ is the Kronecker product). The vectors obtained through this nonlinear embedding for feature points on the same rigid object lie in a linear subspace of R9 of dimension at most 8. Indeed, (1) says that there is a vector vec(F) ∈ R9 orthogonal to all of the feature trajectories in this set (it also shows that the linear embedding (x, y, x0 , y0 ) lies on a 3-dimensional
From here on, we will be considering 2-view motion segmentation under the perspective camera model, i.e. using the Kronecker embedding. We will present a global HLM method, which is well-suited for handling the data which results from this embedding. We begin our development by providing some intuitive motivation for our approach. Imagine that we have access to an oracle, who for any set of vectors in RD , can provide for us a good, robust estimate 1
A surface S is ruled if through every point of S there exists a straight line that lies on S.
4
Bryan Poling, Gilad Lerman
of the dimensionality of the set2 . Now, suppose we have a set of vectors which are sampled from a hybrid-linear distribution. Consider a general partition of the data set, and define the “vector of set dimensions” for that partition to be the vector of oracle-provided approximate dimensions of each respective set in the partition. Our inspiration is the observation that for most partitions one may happen upon, each set in the partition will typically contain points from many of the underlying subspaces. The associated vector of set dimensions will contain relatively large numbers, and the pnorm of this vector will be large. The p-norm of the vector of set dimensions will be referred to as the global dimension of the partition. The best way to make the global dimension small, it would seem, is to try and decrease all of the elements of the vector of set dimensions by grouping together vectors that come from common subspaces. This notion will be made precise and we will show, in fact, that under certain conditions, the natural partition of the data set (the one where point assignment agrees with subspace affiliation) is a global minimizer of the global dimension function. Our approach to HLM will be to find the partition of a data set that yields the lowest possible global dimension. In this section, we develop the global dimension objective function in two parts. In §3.1 we suggest a new class of dimension estimators that can perform the role of the oracle in our discussion above. In §3.2 we define global dimension and explain why we expect its minimizer to reveal the clusters corresponding to the underlying subspaces. A fast algorithm for this minimization will be later described in §4.
if we rotate and/or uniformly scale our set of vectors by a non-zero amount, then the empirical dimension of the set does not change. Second, in the absence of noise, empirical dimension never exceeds true dimension, but it approaches true dimension in the limit (as the number of measurements goes to infinity) for spherically symmetric distributions. From now on we refer to d-dimensional subspaces as d-subspaces. Theorem 1 For ε ∈ (0, 1], dˆε possesses the following properties: 1. dˆε is invariant under dilations (i.e., scaling). 2. dˆε is invariant under orthogonal transformations. 3. If {v i }Ni=1 are contained in a d-subspace of RD , then dˆε ≤ d. 4. If {v i }Ni=1 are i.i.d. samples from a sub-Gaussian probability measure, which is spherically symmetric within a d-subspace4 and non-degenerate5 , then limN→∞ dˆε (v 1 , . . . , v N ) = d with probability 1. To gain some intuition into the definition of empirical dimension, consider taking a large set of samples from a spherically symmetric distribution supported by a d-subspace. Call the covariance matrix for this distribution Q. As the number of samples becomes large, the empirical covariance matrix approaches Q, which has the first d elements on the main diagonal all equal (call the value α 2 ), and 0’s everywhere else. The empirical dimension of the set of vectors involves the singular values of the data matrix, which are approaching the square roots of the eigenvalues of Q. Hence, as the number of samples increases, we get:
3.1 On Empirical Dimension We present here a class of dimension estimators depending on a parameter ε ∈ (0, 1]. For u = (u1 , . . . , uk ) and any p > 0, we use the notation kuk p to mean (u1p + . . . + ukp )1/p (even for p = ε < 1, where k·k p is not a norm). For a given set of N vectors in RD , {v i }Ni=1 , we denote by σ = (σ1 σ2 . . . σN∧D )T the vector of singular values of the D × N data matrix A (the matrix whose columns are the data vectors). For ε ∈ (0, 1] the empirical dimension, denoted by dˆε (v 1 , v 2 , ..., v N ) (or simply dˆε ) is defined by dˆε (v 1 , v 2 , ..., v N ) :=
kσkε . kσk( ε ) 1−ε
(2)
When ε = 1, this is sometimes called the “effective rank”3 of the data matrix [43]. The following theorem explains why dˆε is a good estimator for dimension. Put simply, it says two things. First, 2 In a noiseless case this would return the dimension of the linear span of the set of vectors 3 “Effective rank” is sometimes defined differently. See [36].
dˆε (v 1 , . . . , v N ) →
k(α, α, ..., α, 0, ..., 0)kε d 1/ε α = (1−ε)/ε = d. (3) k(α, α, ..., α, 0, ..., 0)k( ε ) d α 1−ε
Thus, for any value of ε in (0, 1], the empirical dimension approaches the true dimension of the set as the number of measurements increases. If we look at a distribution that is not spherically symmetric, but still supported by a d-subspace, then empirical dimension tends to under-estimate the true dimension of the distribution, even as the number of samples approaches infinity. This is actually desirable behavior. If we take a spherically symmetric distribution in a d-subspace and imagine the process of collapsing it in one direction until it lies in a (d − 1)-subspace, then true dimension behaves discontinuously. The true dimension of a large set of samples will 4 A measure is spherically symmetric within a d-subspace if it is supported on this subspace and invariant to rotations within this subspace. 5 A measure is non-degenerate on a subspace if it does not concentrate mass on any proper subspace. In our setting the measure is also assumed to be spherically symmetric, and this assumption is equivalent to assuming the measure does not concentrate at the origin.
A New Approach To Two-View Motion Segmentation Using Global Dimension Minimization
equal d until the collapsing is complete; at that point the dimension will instantly drop to d − 1. Empirical dimension smoothly drops from d to d − 1 during this collapsing process. It is in this setting that we see the necessity of the parameter ε. This parameter controls how quickly the empirical dimension drops from d to d − 1 in this process. More generally, a low value of ε results in a “strict” dimension estimator (meaning that it will not under-estimate dimension easily, even when distributions are asymmetric). When ε is large (approaching 1), empirical dimension is a lenient dimension estimator. It is much more tolerant of noise, but it may consequently under-estimate the dimension of highly asymmetric distributions. The trade-off is that when dealing with noisy data or distributions only approximately supported by linear subspaces, a stricter estimator can mistakenly interpret noise or distortion as energy in new directions, thereby causing an over-estimate of dimension. Numerical experiments (e.g., Fig. 2) have shown that values of ε between 0.3 and 0.7 seem to provide reasonable estimators, which tend to agree with our intuitive notion of dimension.
5
is not “tuned” to individual data sets, but is chosen based on the properties of the application as a whole and the nature of the embedding.
3.2 On Global Dimension Assume we are provided a data set X in RD (in our application D = 9 with the nonlinear embedding) and a partition of it Π = (Π1 , Π2 , ..., Πk ) for some k ∈ N (i.e., {Πi }ki=1 are disjoint subsets of X whose union is X). We also assume that X lies on a union of K subspaces and denote the “correct” (or natural) partition of the data (where each subset contains only points from a single underlying subspace) by ΠNat . For a fixed ε ∈ (0, 1], {dˆε,i }ki=1 are the empirical dimensions of the sets {Πi }ki=1 . We seek to minimize a function based on these dimensions to recover ΠNat . To this end we define global dimension (GD). When thinking of this function, we take the set of data vectors to be fixed and given, and we view GD as a function of partitions, Π , of the set of data vectors. For a fixed p ∈ (0, ∞) (we discuss the meaning of p later) we define GD as follows:
Emperical Dimension
Empirical Dimension of a collapsing set K
3
GD(Π ) = k(dˆε,1 , dˆε,2 , ..., dˆε,K ) k p = T
∑
!1/p p dˆε,i
.
(4)
i=1
2.5 2 True Dimension Empirical Dim ε=0.35 Empirical Dim ε=0.50 Empirical Dim ε=0.65 Empirical Dim ε=0.90
1.5 1 0
120
240
Time
Fig. 2: The experiment mentioned above is illustrated. A normally-distributed point cloud is created in R3 and is slowly collapsed into a plane and then a line. One can see that if ε is close to 0, empirical dimension more closely tracks true dimension, resulting in a strict dimension estimator. If ε is close to 1, empirical dimension changes more smoothly, resulting in a lenient estimator. (Color figure online)
In our application, since the perspective camera model is reasonably accurate (as opposed to the affine model), the nonlinear embedding of point correspondences results in subspaces with rather negligible distortion. We can thus afford a low value of ε. In fact, this is needed because the data vectors are frequently distributed in very non-isotropic ways with this embedding. Thus, to avoid underestimating dimension, we choose ε = 0.35, which lies just slightly above the lowest value we confirmed for ε (0.3). Notice that this value
Our strategy for recovering ΠNat will be to try and find the partition of the data set that minimizes GD(Π ). Intuitively, by trying to minimize the p-norm of the vector of set dimensions, we are looking for a partition where all of the set dimensions are small. Imagine trying to minimize this objective function by hand, and starting with a partition close to, but not equal to ΠNat . If there is a point assigned to the wrong cluster, then removing it from the set it is currently assigned to should result in a significant drop in the dimension of that particular set. Re-assigning that point to the correct set, on the other hand, will have little impact on the dimension of the target set because the point will lie approximately in the span of other points already in the set. Thus, such a change would cause a significant drop in one of the set dimensions, without disturbing the other sets, and the global dimension will decrease. This would suggest that amongst partitions that are close to it, ΠNat yields the lowest global dimension. Additionally, if one considers a “random”, or usual partition, then each set in that partition will tend to contain vectors from many different subspaces. Each set will have a large dimension, and the global dimension will exceed that of ΠNat . This would suggest that minimizing global dimension may be a reasonable objective if we want to recover ΠNat . Unfortunately, there can exist certain special partitions of a data set that result in low global dimension (in some cases even lower than that of ΠNat ). For example, let us
6
Bryan Poling, Gilad Lerman
choose p = 1, so that the global dimension of a partition is simply the sum of the dimensions of its constituent parts. Now consider 3 lines in the plane, and a data set consisting of many points sampled from each line. In this case ΠNat will consist of three sets. Each set will contain only points from a single line. The dimension of each set in ΠNat is 1. Hence, GD(ΠNat ) = 3. On the other hand, if we consider the “degenerate” partition, that simply puts all points in a single set, then since we are in R2 , the dimension of that set, and hence the global dimension of the partition, is 2. The above example is actually rather special. Consider the same data set, but set p to a large value instead of 1. When p is large, the global dimension approximately returns the largest value from {dˆi }Ki=1 . Now consider minimizing this quantity, subject to the constraint that the partition contains no more than 3 sets. Minimizing global dimension in this setting penalizes partitions consisting of fewer, higherdimensional sets instead of multiple, more balanced sets. Specifically, the global dimension of the degenerate partition is again approximately 2, while the global dimension of ΠNat is approximately 1, since that is the maximum dimension of its constituent sets. In fact, as shown in the next theorem, using large p effectively resolves the issue of special partitions yielding lower global dimension than ΠNat . We will consider the setting where we have K distinct linear subspaces of RD , each of dimension d < D. Call these subspaces {Lk }Kk=1 . Assume we have a collection of nondegenerate measures {µk }Kk=1 supported by these subspaces (so that µk is supported by Lk , k = 1, 2, ..., K). Let {v n }Nn=1 be a set consisting of Nk i.i.d. points from each µk (so N = N1 +N2 +...+NK ). We require that Nk > d for each k so that each subspace is adequately represented in the data set. Let GDTrue be global dimension for a fixed parameter p, defined using true dimension as the “dimension estimator” for a set. That is, GDTrue (Π ) = k(dTrue (Π1 ), ..., dTrue (ΠK ))k p where the sets Πk are the constituent sets of the partition Π and dTrue (•) returns the true dimension of its parameter set. Then, we get the following result: Theorem 2 Let ditions above. If
{Lk }Kk=1 ,
{µk }Kk=1 ,
p > ln(K)/(ln(d + 1) − ln(d)),
{v n }Nn=1
satisfy the con(5)
then amongst all partitions of {v n }Nn=1 into K or fewer sets, the natural partition is almost surely (w.r.t {µk }Kk=1 ) the unique minimizer of GDTrue . The weakness of the above theorem is that it requires all of the intrinsic subspaces to have the same dimension. In practice, the global dimension objective function appears to be rather robust to subspaces with mixed dimensions. If there is a large difference in dimension between two subspaces in a dataset, then the minimum of global dimension tends to be very near ΠNat , the only difference being that a
few points from the higher-dimensional set are re-assigned to lower-dimensional sets to balance out the set dimensions. Theorem 2 gives us a quantitative way of selecting an appropriate value of p for our application. Specifically, if we identify the largest number of clusters we will need to address (K) and an upper bound for d, then the right-hand side (RHS) of (5) gives us a lower bound on the value of p. As long as p is larger than this bound, then Theorem 2 ensures that the natural partition (uniquely) minimizes global dimension. Table 1 exemplifies the values of the RHS of (5) for different values of d and K. We do not want to choose p extravagantly large because of the potential for numerical issues when taking large powers. In our setting, we want to be able to handle up to 4 sets and we will use one less than the ambient dimension as an upper bound for the intrinsic dimension (d = 8). According to Theorem 2 we should select p ≥ 11.77. In all of our experiments we set p = 15 to give us a safety margin. We did some tests on one of our motion segmentation databases (outlier-free RAS) to see how sensitive the minimizer of global dimension is to p in practice. We found that values as low as p = 10 result in nearly identical performance to p = 15, and we don’t start to see significant degradation in results in the other direction until p = 25. Table 1: Values of the RHS of (5) for various values of K and d. Theorem 2 ensures that ΠNat is the unique minimizer of G p when p is larger than these values d
K
2 3 4
8 5.89 9.33 11.77
7 5.19 8.23 10.38
6 4.50 7.13 8.99
5 3.80 6.03 7.60
4 3.11 4.92 6.21
4 A Fast Algorithm for Minimizing Global Dimension Global dimension is defined on the set of partitions of a data set. With a discrete domain, finding ways of quickly minimizing the objective function is non-trivial. In this section we briefly introduce a method, which we will call Global Dimension Minimization (GDM) for doing exactly this. GDM is based on the gradient projection method [6, §2.3]. In order to apply a gradient-based method, we need to re-formulate the problem so that we have a smooth objective function over a convex domain. To do this we employ the notion of fuzzy assignment. Rather than trying to assign each data point a label, identifying it with a single cluster, we allow each point to be associated with every cluster simultaneously, in varying amounts. Specifically, we assign each data point v j a probability vector where the i’th coordinate holds the strength of v j ’s affiliation with cluster i. Assuming we have a data set of N points in RD , and we seek K clusters,
A New Approach To Two-View Motion Segmentation Using Global Dimension Minimization
we need N probability vectors of length K to encode the soft partition of the data. This membership information will be stored in a membership matrix, M , where each column is a probability vector. Element (i, j) of the matrix M holds the strength of v j ’s affiliation with cluster i. The next step is to extend the definition of global dimension so that it is defined on soft partitions in a meaningful way. In its original formulation, to evaluate the global dimension of a partition, we would break up the data set into parts, based on the partition, and estimate the dimension of each part using empirical dimension. To extend this to soft partitions, we estimate the dimension of the k’th set in a partition by scaling each data point by its respective affiliation strength to set k (v n is multiplied by M (k,n) ). We then use empirical dimension to estimate the dimension of the scaled set. In essence, each point is now included in each dimension estimate. However, if a point is scaled so that it lays near the origin when considering a given set, it has little impact on the estimated dimension of that set. In fact, if we look at the global dimension of a soft partition that assigns each data point entirely to a single set (M has only 1’s and 0’s in it), then the global dimension of that soft partition, using our new definition, agrees with the global dimension of the corresponding “hard partition”, using our original definition. Thus, this change is a reasonable extension of the original definition to soft partitions. Our extended definition of global dimension is: GD = k(dˆε1 , dˆε2 , ..., dˆεK )k p where dˆεk = dˆε (M (k,1) v 1 , M (k,2) v 2 , ..., M (k,N) v N ).
(6)
With this modified formulation, global dimension is an almost-everywhere differentiable function defined over the Cartesian product of N K-dimensional probability simplexes. One can check that this is a convex domain (the product of convex sets is convex). A natural approach to minimizing a problem of this sort is the gradient projection method [6, §2.3]. In this method, we begin at some initial state, compute the gradient of the objective function, take a step in the direction opposite the gradient, and then project our new state back into the domain of optimization. This is repeated until our state converges. The gradient of global dimension can be computed, but we need some notation first. For i = 1, . . . , K, we denote by Ak the D-by-N matrix whose j’th column equals M (k, j) v j for j = 1, 2, ..., N (i.e., Ak is the data matrix scaled according to weights for cluster k). Let Ak = U k Σ k (V k )T be the thin SVD of Ak , and σ k denote the vector of elements from the diagonal of Σ k . Let δ = ε/(1 − ε). Define Dk =
kσ k k1−ε ε kσ k kδ kσ k k2δ
! · (Σ k )ε−1 −
kσ k kε kσ k kδ1−δ kσ k k2δ
! · (Σ k )δ −1 . (7)
7
Theorem 3 The derivative of global dimension w.r.t. an arbitrary element of the membership matrix M is given by: ∂ GD D k (U k )T A(:,n) . = V k(n,:) (dˆεk ) p−1 k(dˆε1 , dˆε2 , ..., dˆεK ) k1−p p ∂ M (k,n) (8)
A proof of Theorem 3 is included in the appendix. This theorem allows us to evaluate the gradient vector of global dimension. As was mentioned before, in an iteration of the gradient projection method we take a step in the direction opposite the gradient. Computing a good step size is frequently a challenging task, but here we are fortunate. Our domain has a meaningful natural scale, since it is formed as a product of probability simplexes. Intuitively, our step size should be large enough to move us across the entire space in a reasonable number of steps, but small enough that any individual membership vector can move only a fraction of the way across its own simplex in one step. In practice, we scale each step so that the membership vectors most affected by the step move a distance of .3 on average. This seems to work well in general. Finally, one can check that projecting onto the domain of optimization can be accomplished by individually projecting each column of M onto the standard K-dimensional probability simplex. We have outlined a projected gradient descent method for minimizing global dimension. The above method forms the core of the GDM algorithm. However, since the global dimension function is non-convex (and hence may contain multiple local minimums), it is important to achieve reasonably good initialization. Our initialization strategy is inspired by ALC [31]. We start with a “trivial” partition where each point is in its own set, and we randomly select many pairs of sets in the partition. For each pair, we hypothetically merge the two sets and measure the resulting global dimension. We select the pair that results in the lowest global dimension when merged and we effect that merge. We then repeat the process iteratively until we have the desired number of sets in our partition (in each step the number of sets in the partition decreases by 1). After initialization, the projected gradient descent algorithm is run until convergence (or for a fixed, but large number of iterations). Thresholding is performed to recover a “hard partition” from our soft partition (point j is assigned to cluster i if M (i, j) is the largest element from column j of M ). After this is done, we perform a final genetic stage to clean up small errors which may have occurred in any of the previous stages. This is done by taking each point and hypothetically re-assigning it to each different cluster (while all other point assignments are kept fixed) and retaining the assignment that results in the lowest global dimension. This is repeated a few times or until no single-point re-assignments reduce global dimension.
8
This primarily helps with placing points that lie near the intersections of different subspaces (their fuzzy assignments may associate them almost equally to 2 different subspaces, making them difficult to place). Finally, we run this entire process several times and return the best partition of all runs (as measured by global dimension).
Bryan Poling, Gilad Lerman
our step size6 . We have good numerical evidence, even with large N, that this can result in good accuracy and speed for artificial data. Regardless, for the values of N in our application the algorithm is sufficiently fast and these additional steps help improving accuracy, especially for points which are nearby several clusters (whose percentage is not negligible when N is small).
Algorithm 1 GDM Algorithm for HLM Input: X = {x1 , x2 , · · · , xN } ⊆ RD : data, K: number of clusters, p: global dimension parameter, ε: empirical dimension parameter, n1 , n2 , n3 : number of iterations (default: n1 = n3 = 10, n2 = 30) Output: A partition, Π , of X into K disjoint clusters for i = 1 : n1 do • Π := Partition of X where each point is in its own set. while number of sets in Π greater than K do • Randomly choose several pairs of sets. • For each pair, measure the effect on global dimension if the pair is merged. • Merge the pair of sets which results in the lowest global dimension. end while • Convert Π to a soft partition, encoded in membership matrix M. for j = 1 : n2 do • Compute gradient of global dimension, ∇GD. • ρ = average magnitude of largest 10% of columns of ∇GD. • Take a step in direction −1 ∗ ∇GD of length .3/ρ. • Project each column of M onto the standard k-dimensional probability simplex. end for • Convert M back to a “hard partition”, Π , by thresholding. for j = 1 : n3 do for n = 1 : N do • Check if re-assigning point n to some other cluster decreases global dimension. • If so, re-assign point n to that cluster. end for end for end for • return partition from all runs with lowest global dimension.
5 Detecting and Rejecting Outliers with GDM In practice, it turns out that the GDM algorithm described above is naturally robust to a small number of outliers (in that they do not tend to affect the classification of inliers), but no instruments were put in place for explicitly detecting or rejecting these outlying points. In this section, we introduce a modification to GDM that allows for explicit outlier detection and rejection. The guiding intuition is that an outlier has the property that if the true hybrid-linear structure is reflected in a partition, then no matter which group we assign the outlier to, it causes a significant increase in the empirical dimension of that group. This, in turn, results in a significant increase in global dimension. In other words, if we have a partition that reflects the true hybrid-linear structure of the data set, then there is no good place to put an outlier. If the algorithm was given the option of paying a fixed, low price for the right to ignore a given point, it would make sense for it to exercise this option on outliers, and only segment inliers. We propose modifying the global dimension objective function, and the accompanying variational development in the following way: GD(M ) = αkM 1,: k1 + k(dˆε,2 , dˆε,3 , ..., dˆε,K+1 )T k p
(9)
where: dˆε,k = dˆε M k,1 v 1 , M k,2 v 2 , ..., M k,N v N .
(10)
4.1 Complexity of GDM A thorough analysis of the computational complexity is not included here; this is a short summary of the computational aspects involved. The main numerical component of GDM is computing ∇GD. For a single iteration its complexity is O(K · N · D2 ). Our choice of ρ requires a sorting procedure and is thus of order O(N · log(N)) operations for a single iteration. The initialization of the algorithm via ALC-type procedure [31] requires O(n1 · N · log(N) · D2 ) operations. Also, the last genetic step has the following complexity O(n1 · n3 · K · D2 · N 2 ) (without taking advantage of incremental SVD). In theory, we can make the algorithm linear in the number of points N, by randomly initializing it, removing the genetic “clean-up” step and changing how we select
This modification adds an additional “cluster” to the problem (call it cluster 1), and we treat it differently than the others. Clusters 2 through K + 1 contribute to the global dimension in the same way that they did in the original development. Cluster 1 contributes to the cost function the sum of the membership strengths of all data points to this cluster. This is the “fuzzy assignment” version of the following notion: we allow the algorithm to pay a fixed price, α, for the right to ignore any particular data point (not assign it to any true cluster). 6
This is assuming that we will not require more iterations to get close enough to the minimum that we can apply thresholding. In our experiments the number of needed iterations does not appear to grow with N, but we do not have any results to guarantee this.
A New Approach To Two-View Motion Segmentation Using Global Dimension Minimization
5.1 Modification to GDM The proposed modification to the objective function only trivially changes the state space (now it is the product of N K + 1-dimensional probability simplexes, as opposed to K-dimensional simplexes). Thus, our method of projecting states onto the convex domain is effectively the same. The change to the objective function does mean that we must reevaluate the gradient of global dimension. The computation is very similar to the unmodified version, and the result is: ∂ GD = αmn1 ∂ mn1
(11)
and, for all k > 1 1 −1 ∂ GD p p p−1 ˆp dε,2 + ... + dˆε,K+1 = V k(n,:) dˆε,k D k U Tk v n , n ∂ mk (12) where the notation and constants are as defined in §4. Thus, the necessary modifications to GDM are: 1. Update the evaluation of the objective function GD according to (9). 2. Update the initialization of the state vector to include an outlier group. 3. Update the state projection routine to accommodate additional dimensions in domain. 4. Update the evaluation of ∇GD according to 11 and 12.
5.2 Practical Implementations of Outlier Rejection We have described an idea for how to handle outliers, but it introduces a new parameter, α. It is not immediately clear how one should choose this parameter, and how sensitive the results will be to it. In theory one would need to choose an outlier cost, α, that is not so high that nothing is ever assigned to the outlier group, but not so low that large quantities of inliers are assigned to this group. The appropriate values would likely depend on multiple quantities, like intrinsic dimension, noise level, and distortion of the underlying subspaces. These are quantities that can vary not just between applications, but also from data set to data set for a single application. Applying the suggested modification exactly as proposed (and trying to “tune” this parameter) would therefore lead to an unreliable and unpredictable algorithm. We refer to this approach as GDM-Naive, and Figure 3 illustrates why this method is unsound. Instead, we propose two variations of this method, which lead to more reliable solutions. 1. GDM Known-Fraction: Run the proposed algorithm with a fixed, low value of α (we use α = 0.01) but stop
9
before the threshold step. Rank the data points according to their membership strengths to the outlier group. Remove a pre-set fraction of the data set (the part that most strongly affiliates with the outlier group). Continue with the classic (non-outlier version) of the variational algorithm on the surviving points only7 - this provides the inlier segmentation. The points that were removed are labelled outliers. 2. GDM Model-Reassign: Run method 1 above (GDM Known-Fraction). Fit subspaces of appropriate dimension (round the empirical dimension) to each set in the resulting partition. Re-assign all points (including those that were decided to be outliers) according to their distances from each subspace. Call a point an outlier if it is more than some fixed distance, κ, from all of the subspaces. Each of the proposed methods handles the task of selecting α, but introduces a new parameter. For method 1, this is the percentage of the data set to throw out. For method 2, the new parameter is the maximum distance a point can be from a subspace to be considered an inlier. Both of these parameters are more natural than selecting α. In a noisy environment, one may have an idea, based on experiments, of what percentage of the data set will be outliers, or what the inlier modelling error tends to be. Additionally, when using the “Model-Reassign” method, one could find the average and variance of the residuals, µ, and σ 2 respectively, when fitting subspaces to the inlier clusters. These quantities can be used to come up with a reasonable value of κ for a given application (µ + rσ for some r). One could also find these values on a per-cluster basis and have a different outlier threshold for each cluster.
6 Results on Real-World Data 6.1 Performance in the Absence of Outliers We tested the GDM algorithm on 2 motion segmentation databases. First, we used the outlier-free RAS database [35, 10] and compared with many leading methods in 2-view segmentation. We noticed that some of the HLM methods performed better when using the linearly embedded point correspondences than with the nonlinear embedding. Therefore, in Table 2 we present each of the competing HLM algorithms twice. Where “Linear” appears, the algorithm was run on the feature trajectories in R4 . Where “Nonlinear” appears, the algorithm was run on the Kronecker products (in R9 ) of the standard homogeneous coordinates of 7
We could skip this step and segment directly from the fuzzy assignment that we already have. Refining the membership matrix after removing the outliers is done to repair whatever damage the outliers may have done to the membership matrix before thresholding.
10
Bryan Poling, Gilad Lerman Outlier Group
Outlier Group
3 1 2
Cluster A
Outlier Group
3 1 2
Cluster B
Large α: Nothing ends up in outlier group.
Cluster A
3 1 2
Cluster B
Cluster A
Medium α: Only outliers end up in outlier group.
Cluster B
Low α: Some inliers end up in outlier group.
The three images here illustrate the problem with GDM-Naive. Each triangle represents the probability simplex containing the fuzzy assignment vectors for a fictitious data set. The fuzzy assignment for each point is plotted after many iterations of GDM. Points in red are inliers and points in green are outliers. The quantization regions (for the threshold step) are numbered 1-3. One can see that if α is not chosen correctly, points can be quantized into the wrong cluster. On the other hand, the outlier ranking of a point (the # of points closer the outlier corner) is a more stable quantity. (Color figure online)
each feature correspondence. Figure 5 presents more details on the performance of the HLM methods with the nonlinear embedding, and Table 3 gives the average runtimes of these methods. The other HLM methods we included are SCC [12], MAPA [13], SSC [15], SLBF [51], and LRR [29]. We also included two other successful methods for two-views (for which there was a code available online): RAS [35] and HOSC [2]. Algorithm parameters and our experiment procedure are detailed in §8.4.
% Of Sequences
Fig. 3: Graphical depiction of the problem with GDM-Naive
100 90 80 70 60 50 40 30 20
GDM SCC MAPA SSC SLBF LRR 0
10
20
Nonlinear Nonlinear Nonlinear Nonlinear Nonlinear Nonlinear 30
40
50
Classification Error %
Fig. 5: GDM is compared against other HLM methods on the nonlinear 2-view embedding of the outlier-free RAS database. (Color figure online) (a) SCC
(b) GDM
(c) MAPA
Fig. 4: Clustering by SCC, GDM, and MAPA on file 6 of the outlier-free RAS database. (Color figure online)
Method
Table 3: Average runtimes (per file) of HLM-based methods on non-linearly embedded (outlier-free) RAS data. GDM SCC MAPA SSC SLBF LRR
Runtime (seconds) 12.7 2.3 5.6 89.5 4.0 0.8
From Table 2 and Figure 5, we can see that GDM performs very competitively on this database. There is only a single file (#8) on which GDM exhibits significant error. This file contains features from two bent magazines as well
as a rigid background. Since the bent magazines are clearly non-rigid, our model assumptions are not met (see Fig. 6). There were two methods in the comparison that had a lower average misclassification error than GDM (“SCC Linear” and “SLBF Linear”). This is because they perform significantly better on file (#8). Both of these are spectral methods, accompanied by the linear embedding, and are therefore better able to handle the manifold structure that results from the non-rigidity of the objects in this file. Amongst the other files however, GDM performs better on average than both of these two methods (see the last column of Table 2). Comparing just the HLM-based methods on the nonlinearly-embedded data, GDM performs better than any other method, with the most perfect classifications and the fewest number of files with significant errors. Figure 5 more clearly emphasizes this superb performance amongst methods using the nonlinear embedding. We also performed experiments on the Hopkins155 database [41]. For 2-view segmentation we extracted the first and last frame of each sequence and performed 2-view
A New Approach To Two-View Motion Segmentation Using Global Dimension Minimization
11
Method/Embedding
Table 2: Misclassification Rates (given as % Error) on the outlier-free RAS database.
GDM Nonlinear SCC Linear SCC Nonlinear MAPA Linear MAPA Nonlinear SSC Linear SSC Nonlinear SLBF Linear SLBF Nonlinear LRR Linear LRR Nonlinear RAS HOSC d=2 HOSC d=3
(a) Frame 1
1 0.85 0.85 0.85 0.85 0.85 1.69 1.27 0.85 0.85 5.08 1.27 11.65 0.85 1.27
2 0.00 0.00 0.00 3.65 20.55 18.26 0.00 0.46 0.00 24.66 9.13 0.00 0.00 23.74
3 1.57 1.18 24.41 1.18 21.65 0.79 22.44 1.18 5.12 1.18 2.76 2.56 24.41 24.41
4 0.65 0.65 0.00 0.65 0.65 1.94 0.65 0.65 1.94 2.58 1.94 9.68 1.61 3.23
5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.38 0.00 16.19 0.00 0.00
File Number 6 7 8 0.00 0.00 12.76 1.37 0.00 1.42 19.18 0.00 0.00 13.70 15.97 1.29 21.92 6.25 7.73 0.00 6.25 32.22 21.92 0.00 9.02 0.00 0.00 0.26 19.18 0.00 10.57 2.74 0.00 29.12 0.00 3.47 3.61 26.03 26.74 11.21 0.00 0.00 22.94 19.18 12.15 19.59
(b) Frame 2
Fig. 6: File 8 in the RAS database. This is a problematic file because the two magazines in the scene appear to undergo a non-rigid transformation between the two frames. Point correspondences are colored according to ground-truth segmentation. (Color figure online)
segmentation on the nonlinear embedding (in R9 ) of the data. For comparison, we demonstrate the results of some other HLM algorithms on this embedded data: MAPA [13], SCC-MS [12, 51] and SLBF-MS [51]. We also supply results for a few state-of-the-art HLM methods on the full nview feature trajectories. For these n-view results we chose in this table the best methods on Hopkins155 we are aware of, which do not require careful tuning with parameters: SSC [15] and SLBF-MS [51]. We also include the reference (REF) results [41]. REF finds the best linear models (via least squares approximation) for each cluster of embedded points (given the ground truth segmentation), and then finds new clusters by assigning points to the models they best agree with. For GDM on this database, it was necessary to increase the number of random initializations (n1 in Algorithm 1) to achieve reliable convergence (we changed it from 10 to 30). From Table 4 we see that GDM outperforms the other 2-view methods (although SCC matches or nearly matches its performance in some categories). We remark that we also tested a genetic algorithm for minimizing the global dimension and it achieved even more accurate results, however, we do not include it here since it is not as fast as GDM.
9 0.00 0.39 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.28 0.00 23.75
10 0.00 0.00 13.97 0.00 13.97 0.00 13.97 0.00 13.97 0.00 0.00 13.97 0.00 0.00
11 0.00 0.00 5.36 0.00 1.43 14.64 9.29 0.00 0.00 8.93 9.64 3.21 0.00 1.43
12 0.00 1.01 0.84 0.67 0.34 1.35 12.12 0.67 1.68 14.81 18.18 2.36 0.00 1.01
13 0.00 0.00 1.10 3.30 3.30 4.40 6.59 4.40 14.29 18.68 2.20 6.59 2.20 17.58
Average 1.22 0.53 5.05 3.17 7.59 6.27 7.48 0.65 5.20 8.47 4.02 10.27 4.00 11.33
Average w/o File #8 0.26 0.45 5.48 3.33 7.57 4.11 7.35 0.68 4.75 6.75 4.05 10.19 2.42 10.65
It is also interesting to note that our results for 2-views are comparable to the reference results with n-views. That is, the results of GDM are the best one can expect with pure linear modeling given many views and assuming an affine camera model. GDM for n-views gave comparable results and we thus did not include it. On the other hand, both SLBFMS and SSC-N are able to obtain better results with n-views and this may be because their machinery of spectral clustering (together with good choices of spectral weights) allows them to take into account some of the manifold structure and nearness of points (information beyond linear modeling). 6.2 Performance in the Presence of Outliers We tested the methods suggested in §5.2 on the outliercorrupted RAS database [35]. The performance of classic GDM (no outlier rejection machinery) is also presented on this database, as is the performance of GDM on the corresponding outlier-free database (for comparison purposes). We also show results from three competing methods for segmenting motion with outliers: RAS [35], HOSC [2], and LRR [29, 28] with outlier rejection performed by identifying the largest columns of E, as suggested in [28, pg. 9]. The details of this experiment, including parameter values, are given in §8.4. It is non-trivial to fairly compare different algorithms in the presence of outliers. Each method generally has at least one parameter for controlling how it handles outliers. This parameter balances the desire for a high outlier detection rate with a desire for a low false alarm rate (these two quantities are invariably correlated). Using any popular metric for evaluating segmentation accuracy (like misclassification rate for true inliers8 ), the performance of each algorithm will depend substantially on its outlier handling parameter. In general terms, if an algorithm is allowed to discard points as outliers more freely, then the accuracy on the surviving points will improve. Thus, if one method is more conservative than another in discarding points as outliers, the results 8
“True inliers” are points that are inliers according to ground truth.
12
Bryan Poling, Gilad Lerman
Table 4: The mean and median percentage of misclassified points for two-motions and three-motions in Hopkins 155 database with comparisons to state-of-the-art n-views. Winning results amongst the 2-view methods are bold-faced in each category.
n-view
2-view
2-motion GDM MAPA SCC (d=7) SLBF (d=6) SLBF-MS (2F,3) SSC-N (4K,3) REF
n-view
2-view
3-motion GDM MAPA SCC (d=7) SLBF (d=6) SLBF-MS (2F,3) SSC-N (4K,3) REF
Checker Mean Median 2.79 0.00 12.85 14.07 2.79 0.00 8.18 1.39 1.28 0.00 1.29 0.00 2.76 0.49
Mean 1.78 6.49 1.97 3.98 0.21 0.29 0.30
Traffic Median 0.00 6.93 0.00 0.53 0.00 0.00 0.00
Articulated Mean Median 2.66 0.00 7.15 5.33 3.42 0.00 4.73 0.40 0.94 0.00 0.97 0.00 1.71 0.00
Mean 2.51 10.69 2.64 6.78 0.98 1.00 2.03
Checker Mean Median 5.37 3.23 21.89 19.49 8.05 5.85 14.08 12.80 3.33 0.39 3.22 0.29 6.28 5.06
Mean 4.23 13.15 4.67 7.93 0.24 0.53 1.30
Traffic Median 2.69 13.04 5.45 6.75 0.00 0.00 0.00
Articulated Mean Median 5.32 5.32 9.04 9.04 5.85 5.85 4.79 4.79 2.13 2.13 2.13 2.13 2.66 2.66
Mean 5.14 19.41 7.25 12.32 2.64 2.62 5.08
will likely be skewed in favor of one method over the other. It is therefore important when looking at segmentation accuracy to think in terms of accuracy for a given true positive rate (TPR) and false positive rate (FPR):
TPR =
# of outliers that were identified as outliers ∗ 100, # of outliers in dataset
FPR =
# of inliers that were identified as outliers ∗ 100. # of inliers in dataset
There are two aspects of these algorithms we wish to compare. The first is outlier detection performance (how good is each method at distinguishing between inliers and outliers). The second is segmentation performance, where we evaluate how good each method is at segmenting motions in the presence of outliers. To compare the outlier detection performance of multiple methods, a common tool is the ROC curve, which parametrically plots the TPR vs. FPR as a function of the outlier parameter for a method. A “random classifier” that randomly labels points as inliers or outliers will have an ROC curve lying along the line TPR = FPR. An ideal classifier will follow the line TPR = 1. Hence, methods can be compared by seeing which ROC curve is highest over the broadest range of FPRs (or over the FPRs one is interested in). The ROC curves for GDM (using the Model-Reassign outlier detection method and varying κ), LRR (by varying λ ), RAS (by varying “outlierFraction”), and HOSC (by varying α), are presented in Fig. 7.
All Median 0.00 10.03 0.00 1.11 0.00 0.00 0.00 All Median 3.13 18.09 5.45 9.57 0.22 0.22 2.40
GDM was again run using the nonlinear embedding of the data. HOSC was run with the linear embedding and LRR was run with the nonlinear embedding since these were the cases that yielded the best performance in the outlier-free tests for each algorithm (see §8.4 for more details). From Fig. 7 we can see that GDM is very competitive at detecting outliers on this database. At low FPRs GDM yields comparatively excellent performance. At higher FPRs HOSC has a moderate advantage at outlier detection v.s. GDM. However, it will be seen later (Table 5) that HOSC is not competitive at segmentation in the presence of outliers. Furthermore, the presented HOSC results were prepared using d = 2 (see §8.4), instead of d = 3 as argued for by its authors. Using d = 3 gave worse results and made the algorithm take an extremely long time to execute. The TPR and FPR for a robust segmentation algorithm cannot generally be controlled independently or arbitrarily. Thus, for a comparison of segmentation accuracy, one must select “reasonable” parameters for each method, which correspond to the same general region of ROC space. It should be understood that since the TPR and FPR cannot be controlled exactly for each method, any such comparison is inherently unfair, and by manipulating outlier parameters the results can be skewed somewhat in any direction. For the purpose of fairly comparing GDM with other methods, we must select only one of the suggested outlier detection schemes for GDM (“GDM - Known Fraction” or “GDM Model-Reassign”). To effectively use “GDM Known Fraction”, one must either know roughly what fraction of his or her data are going to be outliers, or be in a situation where over-rejecting points as outliers is acceptable
A New Approach To Two-View Motion Segmentation Using Global Dimension Minimization
13
Method
Table 5: Misclassification Rates (given as % Error) of inliers on the RAS database. All but ‘Classic GDM - clean’ are misclassification rates when run on the outlier-corrupted datasets. ‘GDM - clean’ gives the performance of the unmodified GDM algorithm, when run on the outlier-removed datasets (included as a reference).
GDM - Model-Reassign GDM - Classic RAS LRR HOSC (d=2) GDM - clean
1 2.97 0.85 19.49 4.24 11.02 0.85
2 0.00 0.00 5.02 20.55 22.37 0.00
3 4.33 1.57 1.97 22.83 16.54 1.57
4 1.29 32.26 5.81 7.10 33.55 0.65
5 0.95 0.00 15.71 7.14 10.95 0.00
File Number 6 7 8 0.00 0.00 12.63 0.00 0.00 22.94 23.29 25.00 11.86 8.22 18.75 34.54 2.74 11.11 11.34 0.00 0.00 12.76
1
10 6.62 0.00 13.97 27.21 13.97 0.00
11 0.00 22.14 12.14 8.57 36.79 0.00
12 2.02 8.75 18.18 11.78 66.67 0.00
13 17.58 16.48 21.98 25.27 8.79 0.00
Average 3.72 8.08 13.60 15.27 19.15 1.22
Average w/o File #8 2.98 6.84 13.74 13.67 19.80 0.26
Table 6: True Positive Rate (TPR) and False Positive Rate (FPR) for each method in our segmentation comparison in Table 5. GDM - Model Reassign, RAS, LRR, and HOSC were each tuned to achieve a false positive rate in the range of 0.01 to 0.08.
0.8
0.6 Method
True Positive Rate
9 0.00 0.00 2.32 2.32 3.09 0.00
0.4
GDM HOSC LRR RAS Random Classifier
0.2
0 0
0.2
0.4
0.6
0.8
1
False Positive Rate
Fig. 7: The outlier detection performance of GDM (Model-Reassign) is compared against other motion segmentation methods on the outlier-free RAS database. (Color figure online)
(you can then over-estimate the outlier fraction). Since this is not usually the case, we will consider the results of “GDM Model-Reassign” when comparing with other methods.
In Table 5 we present a file-by-file comparison of segmentation accuracy for the aforementioned methods using parameters that place the FPR of each method in the range of 0.01 to 0.08. Table 6 reports the average TPR and FPR for each of these methods.
GDM - Model-Reassign GDM - Classic RAS LRR HOSC GDM - clean
TPR 0.56 NA 0.74 0.49 0.71 NA
FPR 0.01 NA 0.08 0.04 0.06 NA
One can see from Table 5 that “GDM Model-Reassign” causes an overall improvement in segmentation accuracy (vs “GDM - Classic”) in the presence of outliers. There were several files where the outliers cause the classic GDM method to misclassify large fractions of the data sets (files 4, 8, and 11 have inlier misclassification rates over 20%). On these files the error rates of “GDM Model-Reassign” are dramatically lower. There are some files where the outlier detection framework appears to hurt performance, but in most of these cases the degradation is slight. The results for GDM are better in most cases (and on average) than the competing methods, although there are a few files where GDM is outperformed by a small margin. Unlike the strong outlier detection performance of HOSC discussed earlier, the segmentation capabilities of HOSC appear very intolerant to outliers (if even a few outliers slip through, segmentation performance suffers).
7 Conclusions We presented a new approach to 2-view motion segmentation, which is also a general method for HLM. Its development was motivated by the main obstacle of recovering multiple subspaces within the nonlinear embedding of point correspondences into R9 ; namely, the nonuniform distributions along subspaces (of unknown dimensions). The idea was to minimize a global quantity, i.e., global dimension. Unlike KSCC [10], which also exploits global information, this approach does not make an a-priori assumption on the
14
Bryan Poling, Gilad Lerman
dimensions of the underlying subspaces. We formulated a fast method to minimize this global dimension, which we referred to as GDM. We demonstrated state-of-the-art results of GDM for 2-view motion segmentation. We carefully explained the meaning of the two main parameters in our algorithm, p and ε, and the trade-offs they express. We gave a theoretical basis for selecting an appropriate value of p. Needless to say that these parameters are fixed throughout the paper. We described a preliminary theory which motivated the notion of global dimension, and we justified why it makes sense as an objective function in our application. Finally, we presented an outlier detection/rejection framework for GDM. We explored two complimentary implementations of this framework, and we presented results demonstrating that it is competitive at handling outliers in this application.
8 Appendix 8.1 Proof of Theorem 1 We prove the four properties of the statement of the theorem. For simplicity we assume that D < N. That is, the number of data points is greater than the dimension of the ambient space. This is the usual case in many applications. Proof of Property 1: Clearly, scaling all data vectors by α 6= 0 results in scaling all the singular values of the corresponding data matrix by α. Furthermore, this results in scaling by α both the numerator and denominator of the expression for the empirical dimension for any ε > 0. Therefore, the empirical dimension is invariant to this scaling. Proof of Property 2: The singular values of a matrix (in particular the data matrix) are invariant to any orthogonal transformation of this matrix and thus the empirical dimension is invariant to such transformation. Proof of Property 3: If {v i }Ni=1 are contained in a dsubspace, then since these form the columns of A, rank(A) ≤ d. Since U and V are orthogonal, rank(A) = rank(Σ). In particular, A has at most d singular values. Let σ be the vector of singular values of A, and let 1σ be the indicator vector of σ 9 . The generalized H¨older’s Inequality [19, pg. 10] states that if: 1 1 1 p1 , p2 ∈ (0, ∞] and + = p1 p2 r 9
then k f1 f2 kr ≤ k f1 k p1 k f2 k p2 for any functions f1 and f2 .
To apply this result to vectors, we view them as functions over the set {1, 2, ..., D} with counting measure. ε Let p1 = 1, p2 = 1−ε , r = ε. Also let f1 = 1σ , f2 = σ. These values satisfy (13). We therefore get: kσkε ≤ k1σ k1 = (# of non-zero sing. values of A) ≤ d. kσk ε 1−ε
(15) Proof of Property 4: By hypothesis, the data vectors {v i }Ni=1 are i.i.d. and sampled according to probability measure µ, where µ is sub-Gaussian, non-degenerate, and spherically symmetric in a d-subspace of RD . We define the nth data matrix: ↑ ↑ ↑ ↑ An = v 1 v 2 v 3 · · · v n . ↓ ↓ ↓ ↓ Then Σ n := ( 1n )An ATn is the nth sample covariance matrix of our data set. Also, let v be a random variable with probability measure µ. Then Σ := E[vv T ] is the covariance matrix of the distribution. A consequence of µ being spherically symmetric in a d-subspace is that after an appropriate rotation of space, Σ is diagonal with a fixed constant in d of its diagonal entries and 0 in all other locations. We are trying to prove a result about empirical dimension, which is scale invariant and invariant under rotations of space. Because of these two properties we can assume that the appropriate rotation and scaling has been done so that Σ is diagonal with value 1 in d diagonal entries and 0 in all others. Without any loss of generality, we assume that the first d diagonal entries are the non-zero ones. Let σ n = (σn,1 , σn,2 , ..., σn,D )T , n ≥ D, denote the vector of singular values of the matrix An . Our first task will be σn to show that √ converges in probability (as n → ∞) to the n vector: (1, 1, ..., 1, 0, ..., 0)T . | {z }
1σ has a 1 in each coordinate where σ has a non-zero element, and 0’s in all other coordinates.
(16)
d
To accomplish our task, we will first relate σ n to the vector of singular values of Σ n , and then use a result showing that Σ n converges to Σ as n → ∞. It is clear that the vector of singular values of Σ n , which we will denote by ψ, is given by: ψ=
(13)
(14)
T 1 2 2 . σ , σ 2 , ..., σn,D n n,1 n,2
(17)
Next, we will need the following result regarding covariance estimation. This is Corollary 5.50 of [43], adapted to be consistent with our notation.
A New Approach To Two-View Motion Segmentation Using Global Dimension Minimization
Lemma 1 (Covariance Estimation): Consider a subGaussian distribution in RD with covariance matrix Σ. Let γ ∈ (0, 1), and t ≥ 1. If n > C(t/γ)2 D, then with probability 2 at least 1 − 2e−t D , kΣ n − Σk2 ≤ γ, where k · k2 denotes the spectral norm (i.e., largest singular value of the matrix). The constant C depends only on the sub-Gaussian norm of the distribution. In our problem, we are applying this lemma to the distribution µ. Let γ ∈ (0, 1) be given. If n > C(t/γ)2 D,
(18) 2
then kΣ n − Σk2 ≤ γ with probability at least 1 − 2e−t D . The 2-norm of the difference of two matrices bounds the differences of their individual singular values. We will use the following result to make this precise: Lemma 2 [7]: Let σi (•) denote the ith largest singular value of an arbitrary m-by-n matrix. Then: |σi (B + E) − σi (B)| ≤ kEk2 , for each i. Because Σ is diagonal with only values 1 and 0 on the diagonal, the singular values of Σ are simply these diagonal values. We will use 1i∈1:d to denote the i’th singular value of Σ. Setting B = Σ n and E = Σ − Σ n , in lemma 2 we get: 2 −1 kΣ n − Σk2 ≤ γ ⇒ |(1/n)σn,i i∈1:d | ≤ kΣ n − Σk2 ≤ γ, for each i. This implies that: σn,i √ ∈ n
√ √ [ 1 − γ, 1 + γ] , if i ≤ d; √ 0, γ , if i > d.
(19)
15
8.2 Proof of Theorem 2 Recall that ΠNat denotes the natural partition of the data set. First, we notice that GD(ΠNat ) = k(d1 , d2 , ..., dK )k p , where dk is the true dimension of set k of the partition. Notice that dk cannot exceed d since µk is supported by Lk , a dsubspace. Furthermore, since µk does not concentrate mass on subspaces it is a probability 0 event that all Nk points from Lk exist in a proper subspace of Lk . Thus, for the natural partition, dk is almost surely d, for each k. Hence, GD(ΠNat ) is almost surely k(d, d, ..., d)k p = (Kd p )1/p = K 1/p d. Next, we will find a lower bound for the global dimension of any non-natural partition of the data, and show that if p meets the hypothesis criteria, the lower bound we get is greater than K 1/p d = GD(ΠNat ). To accomplish this we need the following lemma. Lemma 1 If Π 6= ΠNat then Π almost surely has one set with dimension at least d + 1. Before proving the lemma, observe that a consequence is that if Π 6= ΠNat , then with probability 1: GD(Π ) ≥ k(?, ..., ?, d + 1, ?, ..., ?)k p ≥ d + 1.
(21)
Then, from our hypothesis: p >ln(K)/(ln(d + 1) − ln(d)) d +1 p =⇒ >K d =⇒ d + 1 > K 1/p d.
(22)
Hence, GD(Π ) ≥ d + 1 > K 1/p d = GD(ΠNat ).
(23)
Thus, if we show Lemma 1, the proof of the theorem follows. To prove Lemma 1 we require an a simpler lemma:
σ
Notice that as γ → 0, √n,in approaches 1i∈1:d . Specifically, for any desired tolerance, η > 0, and any desired certainty, ξ , n can be chosen large enough that with probability greater σn,i than ξ , 1i∈1:d − √n < η, simultaneously for each i. It follows from this that the vector (16) as n → ∞.
σn √ n
converges in probability to
Finally, dˆε,n =
kσ n kε kσ n k ε
=
1−ε
√1 kσ n kε n 1 √ kσ n k ε n 1−ε
=
√n kε kσ n σ k √nn k ε 1−ε
.
σn Thus, dˆε,n is a continuous function of the vector √ . Hence, n σ n ˆ since √ converges to 1i∈1:d as n → ∞, dε,n converges in n
probability to
k(1, 1, ..., 1, 0, ..., 0)kε k(1, 1, ..., 1, 0, ..., 0)k( ε ) 1−ε
!
1
=
dε d
1−ε ε
1
=dε−
1−ε ε
= d. (20)
Lemma 2 If a set Q in Π has fewer than d points from a subspace Li , then either Q has dimension at least d + 1 or adding another point from Li to Q (an R.V. X with probability measure µi , independent from all other samples) will almost surely increase the dimension of Q by 1. Proof If dim(Q) ≤ d then Q has dimension strictly less than the ambient space (RD ). Observe that span(Q) is a linear subspace of RD , which a.s. does not contain Li . We cannot have proper containment since dim(Li ) = d ≥ dim(Q). Also, we have fewer than d points from Li in Q, and each other point in Q lies in Li with probability 0 (All µi do not concentrate mass on subspaces). Thus, span(Q) a.s. does not equal Li . Therefore, if we intersect Li with span(Q) we get a ¯ We note that µi (L) ¯ =0 proper subspace of Li ; call it L.
16
Bryan Poling, Gilad Lerman
since µi does not concentrate on subspaces. Thus, since X has probability measure µi , X a.s. lies outside the intersection of Li and span(Q). It follows that if we add X to Q, the dimension of Q a.s. increases by 1. Now we prove Lemma 1. We will assume all sets in Π have dimension less than d + 1 and pursue a contradiction. By hypothesis, our set {v n }Nn=1 contains at least d + 1 points from each subspace Li . Since Π 6= ΠNat , there is some subspace L∗ whose points are assigned to 2 or more distinct sets in Π . Let v ∗ be a point from L∗ . Now, choose d points from each Li and denote this collection of Kd points {y1 , y2 , ..., yKd }. When making this selection, ensure that v∗ is not chosen and that of the points selected from L∗ , not all of them are assigned to the same set in Π as v ∗ . Notice that Π induces a partition on {y1 , y2 , ..., yKd }. Select any point yi and remove it from the set {y1 , y2 , ..., yKd }. Since we are assuming that each set in Π has dimension less than d + 1, Lemma 2 implies that the set in Π to which yi belongs will have its dimension decrease by 1. Now select another point y j and remove it. Lemma 2 still applies and so the set to which y j belonged will have its dimension decrease by 1. We can repeat this until all Kd points have been removed. Since each removal decreases the dimension of some set in Π by 1 it follows that before any removals the sum of the dimensions of all sets in Π was at least Kd. Since each of the K sets in Π had dimension d or less, we conclude that in fact each set must have had dimension exactly d. Now, consider our set {y1 , y2 , ..., yKd } and add in v∗ . By our choice of v∗ , Lemma 2 implies that its addition a.s. increases the dimension of its target set in Π by 1 (to d + 1). Adding in all remaining points from {v n }Nn=1 will only increase the dimensions of the sets in Π . Thus, we almost surely have a set of dimension at least d + 1 in Π , contradicting our hypothesis.
8.3 Proof of Theorem 3 Recall that the soft partition is stored in a membership matrix M . Specifically, the (k, n)’th element of M , denoted mnk , holds the “probability” that vector v n belongs to cluster k. Thus, each column of M forms a probability vector. Hence, global dimension is a real-valued function of the matrix M . We will think of the membership matrix as being vectorized, so that the domain of optimization can be thought of as a subset of RNK . However, we will not explicitly vectorize the membership matrix. Thus, when we talk about the gradient of global dimension, we are referring to
another K-by-N matrix, where the (k, n)’th element is the derivative of global dimension w.r.t. mnk . To differentiate global dimension we must be able to differentiate the singular values of a matrix w.r.t. each element of that matrix. A treatment of this is available in [34]. To begin, recall the definition of GD:
1
dˆε
2
dˆε 1/p
. (24) GD = . = (dˆε1 ) p + (dˆε2 ) p + ... + (dˆεK ) p
..
dˆK ε
p
We will denote the thin SVD (only D columns of U and V are used) of Ak : Ak = U k Σ k V k T .
(25)
Also, we will let σ ij refer to the ( j, j)’th element of Σ i . Then, using the chain rule: ∂ GD ∂ GD ∂ dˆε1 ∂ GD ∂ dˆε2 ∂ GD ∂ dˆεK = + + ... + . ∂ mnk ∂ dˆε1 ∂ mnk ∂ dˆε2 ∂ mnk ∂ dˆεK ∂ mnk From (24) we can compute
∂ GD ∂ dˆεi
(26)
rather easily:
1 −1 ∂ GD 1 ˆ1 p = (dε ) + (dˆε2 ) p + ... + (dˆεK ) p p p(dˆεi ) p−1 i ˆ p ∂ dε 1 −1 =(dˆεi ) p−1 (dˆε1 ) p + (dˆε2 ) p + ... + (dˆεK ) p p . (27) Next, we expand the other components of (26): ∂ dˆεi = ∂ mnk
i ∂ dˆεi ∂ σ j ∑ i n. j=1 ∂ σ j ∂ mk D
(28)
We now use the definition of dˆεi to compute the first factor of each term as follows: 1/ε kσ i kδ ∂∂σ i (σ1i )ε + ... + (σDi )ε i ˆ ∂ dε j = − ∂ σ ij kσ i k2δ 1/δ kσ i kε ∂∂σ i (σ1i )δ + ... + (σDi )δ j
kσ i k2δ =
kσ i kδ (σ1i )ε + ... + (σDi )ε
1−ε i ε−1 ε σj
kσ i k2δ kσ i kε (σ1i )δ + ... + (σDi )δ
−
δ −1 1−δ δ σ ij
kσ i k2δ =
1 kσ i k2δ
! kσ i kδ kσ i kε1−ε σ ij
ε−1
! δ −1 1 kσ i kε kσ i k1−δ σ ij δ 2 kσ i kδ ε−1 δ −1 =C1i σ ij −C2i σ ij ,
−
(29)
A New Approach To Two-View Motion Segmentation Using Global Dimension Minimization
where
Now we can write: kσ i kε1−ε kσ i kδ kσ i k2δ
C1i =
! ,
C2i =
kσ i kε kσ i k1−δ δ kσ i k2δ
! . (30)
Next, we must evaluate the second factor in each term of (28). Recall that σ ij is the j’th largest singular value of the matrix Ai . To achieve the next step, we must observe that each singular value of Ai depends, in general, on each element of the matrix Ai . We can then compute the derivative of each element of the matrix Ai w.r.t. each membership variable, mnk . We will denote the (α, β )’th element of the matrix Ai by Ai (α,β ) . Using the chain rule: ∂ σ ij ∂ mnk
17
N
=
D
∂ σ ij
∂ Ai (α,β ) . ∂ Ai (α,β ) mnk β =1 α=1
∑∑
(31)
= U i(α, j) V i(β , j) .
∂ Ai (α,β )
D
∑
! i ε−1
σj
V i(n, j) U i(:, j) · v n δik −
j=1 D
C2i
∑
δ −1 σ ij V i(n, j)
(36)
!
U i(:, j) · v n δik .
j=1
We now simplify the components of (36). After some manipulation, and using the notation (σ1i )ε−1 0 0 .. (Σ i )ε−1 = (37) , . 0 0 i ε−1 0 0 (σD ) we can write D
A powerful result [34, eqn. 7] allows us to express the partial derivative of each singular value, σ ij , w.r.t. a given matrix element in terms of the already-known SVD of Ai : ∂ σ ij
∂ dˆεi =C1i ∂ mnk
σ ij
∑
ε−1
V i(n, j) U i(:, j) · v n =
j=1
h
σ1i
ε−1
V i(n,1) , ..., σDi
ε−1
i V i(n,D)
(32)
U i(:,1) · v n U i(:,2) · v n .. . U i(:,D) · v n
(38)
= V i(n,:) (Σ i )ε−1 (U i )T v n .
The second factor in each term of (31) can be evaluated directly from the definition of Ak : ( ∂ Ai (α,β ) 0, if n 6= β or if i 6= k; = (33) ∂ mnk v n · eˆα , if n = β and i = k,
Similarly, we can simplify part of the second term of (36): D
∑
σ ij
δ −1
V i(n, j) U i(:, j) · v n = V i(n,:) (Σ i )δ −1 (U i )T v n .
j=1
where eˆα denotes the α’th standard basis vector (1 in position α and 0’s everywhere else). We are now in a position to work backwards and construct the partial derivative of GD w.r.t. mnk . In what follows, δik is equal to 1 if i = k and is 0 otherwise (this is not to be confused with the un-subscripted δ , which is shorthand for ε/(1 − ε)). Also, for notational convenience, we use Matlab notation to represent a row or column of a matrix (B(w,:) and B(:,w) , respectively). We fist compute ∂ σ ij /∂ mnk as follows: ∂ σ ij ∂ mnk
∂ σ ij
D
=
∑ α=1
∂ Ai (α,n)
∂ Ai (α,n) mnk
∑ U i(α, j) V i(n, j) (vn · eˆα ) δik
=
U i(1, j) V i(n, j) U i(2, j) V i(n, j) .. . U i(D, j) V i(n, j)
K
· v n δik = V i(n, j)
U i(1, j) U i(2, j) .. . U i(D, j)
(34)
Then from (28), we get
ε−1 δ −1 V i(n, j) U i(:, j) · v n δik . = ∑ C1i σ ij −C2i σ ij j=1
δik ·
h i C1i V i(n,:) (Σ i )ε−1 (U i )T v n −C2i V i(n,:) (Σ i )δ −1 (U i )T v n
· v n δik
i D ∂ dˆεi ∂ dˆεi ∂ σ j = ∑ ∂ mnk j=1 ∂ σ ij ∂ mnk
1p −1
i=1
=V i(n, j) U i(:, j) · v n δik .
D
With this expression we are ready to evaluate (26) as follows:
= ∑ (dˆεi ) p−1 (dˆε1 ) p + ... + (dˆεK ) p
α=1
Substituting (38) and (39) into (36) we get h ∂ dˆεi ε−1 T i = C V (Σ ) (U ) v − i i n i(n,:) 1 ∂ mnk i C2i V i(n,:) (Σ i )δ −1 (U i )T v n δik .
K ∂ GD ∂ dˆεi ∂ GD =∑ n n ˆi ∂ mk i=1 ∂ dε ∂ mk
D
=
(39)
(35)
1 −1 =(dˆεk ) p−1 (dˆε1 ) p + ... + (dˆεK ) p p · h i C1k V k(n,:) (Σ k )ε−1 (U k )T v n −C2k V k(n,:) (Σ k )δ −1 (U k )T v n =(dˆεk ) p−1 k dˆε1 , ..., dˆεK k1−p p · h i ε−1 k V k(n,:) C1 (Σ k ) −C2k (Σ k )δ −1 (U k )T v n T =(dˆεk ) p−1 k dˆε1 , ..., dˆεK k1−p (40) p V k(n,:) D k (U k ) v n ,
where D k = C1k (Σ k )ε−1 −C2k (Σ k )δ −1 .
(41)
18
We re-write (40) as follows: 1−p ∂ GD T 1 K k p−1 ˆ ˆ ˆ d , ..., d = V ( d ) k k D (U ) A(:,n) . k k k(n,:) ε ε ε p ∂ mnk (42) 8.4 Experiment Setup For our comparison on the outlier-free RAS database, we include the following methods: GDM, SCC [12] from www.math.umn.edu/∼lerman/scc, MAPA [13] from www.math.duke.edu/∼glchen/mapa.html, SSC [15] (version 1.0 based on CVX) from www.vision.jhu.edu/code, SLBF (& SLBF-MS) [51] from www.math.umn.edu/∼lerman/lbf, LRR [29] from sites.google.com/site/guangcanliu, RAS [35] (obtained directly from the authors), and HOSC [2] from www.math.duke.edu/∼glchen/hosc.html. For each method in our comparisons (outlier-free and our tests with outliers) the implementation of each algorithm is that of the original authors. Most of these codes were found on the respective authors’ websites, although some codes were obtained from the authors directly when they could not be found online. As a matter of good testing methodology, we ran each method 10 times on each file. This is because we want to avoid capturing any fluke occurrences of any method, but instead seek the “usual case” results (this is important for repeatability of the results). Of the 10 runs for a given file and method, the median error is reported. For deterministic methods, we get the same exact results for each run. GDM involves randomness, but the average standard deviation of the misclassification errors was 0.73%, meaning that it behaved very consistently in the experiment. GDM was run with n1 = 10. The other parameters (ε and p) are fixed throughout all experiments and are addressed earlier. SCC was run with d = 3 for the linearly embedded data (this was found to give the best results), and d = 7 for the nonlinearly embedded data (as recommended in [10]). MAPA was run without any special parameters. SSC was run with no data projection (because of the low ambient dimension to start with), the affine constraint enabled (we tried it both ways and this gave better results), optimization method = “Lasso” (Default for authors code), and parameter lambda = 0.001 (found through trial and error). SLBF was run with d = 3 for the linearly embedded data and d = 6 for the nonlinearly embedded data and σ was set to 20, 000 for both cases (d and σ were selected by trial and error to give the best results). LRR was run with λ = 100 for the linear case and λ = 10000 for the non-linear case (these seemed to give the best results). RAS proved rather sensitive to its main parameter (“angleTolerance”), and no single value gave good across-the-board results. We ran with all default parameters and many other combinations. The results presented were generated using angleTolerance = 0.22 and
Bryan Poling, Gilad Lerman
boundaryThreshold = 5, as this combination gave the best results from our tests (better than the algorithms defaults). HOSC was run with η automatically selected by the algorithm from the range [0.0001, 0.1]. The parameter “knn” was set to 20, and the default “heat” kernel was chosen. The algorithm was tried with d set to 2 and 3. Both of these cases are presented. d = 2 gave better results, but the authors of HOSC argue for using d = 3 in this setting. For our comparison on the outlier-free Hopkins 155 database, the algorithms that were selected for the comparison were run once on each of the 155 data files. The mean and median performance for each category is reported. GDM was run with n1 = 30 to improve reliability. All other parameters were left fixed, and (as before) the nonlinearly embedded data was used. Each competing 2-view method was run on the non-linearly embedded data with the same parameters that gave the best performance on the RAS database. The competing n-view methods have their parameters given in the results tables. For our outlier comparison on the corrupted RAS database, we ran GDM with n1 = 30 (same as for the Hopkins 155 database). For the naive approach, we used α = 0.02. For “GDM - Known Fraction” we rejected 20% of the dataset. For “GDM - Model Reassign” we used κ = 0.05. “GDM - Classic” was the same algorithm as in the outlierfree comparisons and so had no extra parameters. RAS was run with angleTolerance = 0.22 and boundaryThreshold = 5 (same as in the outlier-free tests). We ran LRR with λ = 0.1, and outlierThreshold = 0.138 (these gave the best results of the combinations we tried). HOSC was run with d = 2 (which gave the best results in the outlier-free case) and α = 0.11. The code for GDM can be found on our supplemental webpage. For each of the algorithms used in our comparisons, we have made an effort to provide (on the supplemental webpage) the code or a link to where the code can be found. Acknowledgements This work was supported by NSF grants DMS09-15064 and DMS-09-56072. GL was partially supported by the IMA during their annual program on the mathematics of information (20112012) and BP benefited from participating in parts of this program and even presented an initial version of this work at an IMA seminar in Spring 2012. We thank the anonymous reviewers for their thoughtful comments and Tom Lou for his helpful suggestions in regards to our algorithm for minimizing global dimension. A very preliminary version of this work was submitted to CVPR 2012, we thank one of the anonymous reviewers for some insightful comments that made us modify the GDM algorithm and its theoretical support.
References 1. Aldroubi, A.: A review of subspace segmentation: Problem, nonlinear approximations, and applications to motion segmentation.
A New Approach To Two-View Motion Segmentation Using Global Dimension Minimization
2.
3. 4.
5. 6. 7. 8.
9. 10.
11.
12. 13.
14.
15.
16.
17.
18. 19. 20.
21.
22.
23.
ISRN Signal Processing 2013(Article ID 417492), 1–13 (2013). DOI doi:10.1155/2013/417492 Arias-Castro, E., Chen, G., Lerman, G.: Spectral clustering based on local linear approximations. Electron. J. Statist. 5, 1537–1587 (2011) Arias-Castro, E., Lerman, G., Zhang, T.: Spectral Clustering Based on Local PCA. ArXiv e-prints (2013) Baker, S., Matthews, I.: Lucas-kanade 20 years on: A unifying framework. International Journal of Computer Vision 56(1), 221 – 255 (2004) Barbar´a, D., Chen, P.: Using the fractal dimension to cluster datasets. In: KDD, pp. 260–264 (2000) Bertsekas, D.: Nonlinear programming. Optimization and neural computation series. Athena Scientific (1995) Bhatia, R.: Matrix Analysis. Graduate Texts in Mathematics Series. Springer Verlag (1997) Boult, T.E., Brown, L.G.: Factorization-based segmentation of motions. In: Proceedings of the IEEE Workshop on Visual Motion, pp. 179–186 (1991) Bradley, P., Mangasarian, O.: k-plane clustering. J. Global optim. 16(1), 23–32 (2000) Chen, G., Atev, S., Lerman, G.: Kernel spectral curvature clustering (KSCC). In: Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on Computer Vision, pp. 765–772. Kyoto, Japan (2009). DOI 10.1109/ICCVW.2009. 5457627 Chen, G., Lerman, G.: Foundations of a multi-way spectral clustering framework for hybrid linear modeling. Found. Comput. Math. 9(5), 517–558 (2009). DOI http://dx.doi.org/10.1007/ s10208-009-9043-7 Chen, G., Lerman, G.: Spectral curvature clustering (SCC). Int. J. Comput. Vision 81(3), 317–330 (2009) Chen, G., Maggioni, M.: Multiscale geometric and spectral analysis of plane arrangements. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011) Costeira, J., Kanade, T.: A multibody factorization method for independently moving objects. International Journal of Computer Vision 29(3), 159–179 (1998) Elhamifar, E., Vidal, R.: Sparse subspace clustering. In: Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 09), pp. 2790 – 2797 (2009) Elhamifar, E., Vidal, R.: Sparse subspace clustering: Algorithm, theory, and applications. Pattern Analysis and Machine Intelligence, IEEE Transactions on PP(99), 1–1 (2013). DOI 10.1109/ TPAMI.2013.57 Feng, X., Perona, P.: Scene segmentation from 3d motion. In: Computer Vision and Pattern Recognition, 1998. Proceedings. 1998 IEEE Computer Society Conference on, pp. 225 –231 (1998). DOI 10.1109/CVPR.1998.698613 Gionis, A., Hinneburg, A., Papadimitriou, S., Tsaparas, P.: Dimension induced clustering. In: KDD, pp. 51–60 (2005) Grafakos, L.: Classical and modern Fourier analysis. Pearson/Prentice Hall (2004) Haro, G., Randall, G., Sapiro, G.: Stratification learning: Detecting mixed density and dimensionality in high dimensional point clouds. Neural Information Processing Systems (2006) Haro, G., Randall, G., Sapiro, G.: Translated poisson mixture model for stratification learning. Int. J. Comput. Vision 80(3), 358–374 (2008) Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521623049 (2000) Ho, J., Yang, M., Lim, J., Lee, K., Kriegman, D.: Clustering appearances of objects under varying illumination conditions. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 11–18 (2003)
19
24. Kanatani, K.: Motion segmentation by subspace separation and model selection. In: Proc. of 8th ICCV, vol. 3, pp. 586–591. Vancouver, Canada (2001) 25. Kanatani, K.: Evaluation and selection of models for motion segmentation. In: 7th ECCV, vol. 3, pp. 335–349 (2002) 26. Lerman, G., Zhang, T.: Robust recovery of multiple subspaces by geometric l p minimization. Ann. Statist. 39(5), 2686–2715 (2011). DOI 10.1214/11-AOS914 27. Levina, E., Bickel, P.J.: Maximum likelihood estimation of intrinsic dimension. In: L.K. Saul, Y. Weiss, L. Bottou (eds.) Advances in Neural Information Processing Systems 17, pp. 777–784. MIT Press, Cambridge, MA (2005) 28. Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., Ma, Y.: Robust recovery of subspace structures by low-rank representation. Pattern Analysis and Machine Intelligence, IEEE Transactions on 35(1), 171 –184 (2013). DOI 10.1109/TPAMI.2012.88 29. Liu, G., Lin, Z., Yu, Y.: Robust subspace segmentation by lowrank representation. In: ICML (2010) 30. Ma, Y.: An invitation to 3-D vision: from images to geometric models. Interdisciplinary applied mathematics: Imaging, vision, and graphics. Springer (2004) 31. Ma, Y., Derksen, H., Hong, W., Wright, J.: Segmentation of multivariate mixed data via lossy coding and compression. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(9), 1546– 1562 (2007) 32. Ma, Y., Yang, A.Y., Derksen, H., Fossum, R.: Estimation of subspace arrangements with applications in modeling and segmenting mixed data. SIAM Review 50(3), 413–458 (2008) 33. Ozay, N., Sznaier, M., Lagoa, C., Camps, O.: GPCA with denoising: A moments-based convex approach. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 3209–3216 (2010). DOI 10.1109/CVPR.2010.5540075 34. Papadopoulo, T., Lourakis, M.I.A.: Estimating the jacobian of the singular value decomposition: Theory and applications. In: In Proc. European Conf. on Computer Vision, ECCV 00, pp. 554– 570. Springer (2000) 35. Rao, S.R., Yang, A.Y., Sastry, S.S., Ma, Y.: Robust algebraic segmentation of mixed rigid-body and planar motions from two views. Int. J. Comput. Vision 88(3), 425–446 (2010). DOI http://dx.doi.org/10.1007/s11263-009-0314-1 36. Roy, O., Vetterli, M.: The Effective Rank: A Measure of Effective Dimensionality. In: European Signal Processing Conference (EUSIPCO), pp. 606–610 (2007) 37. Soltanolkotabi, M., Cand`es, E.J.: A geometric analysis of subspace clustering with outliers. Ann. Stat. 40(4), 2195–2238 (2012). DOI 10.1214/12-AOS1034 38. Soltanolkotabi, M., Elhamifar, E., Candes, E.: Robust Subspace Clustering. ArXiv e-prints (2013) 39. Tipping, M., Bishop, C.: Mixtures of probabilistic principal component analysers. Neural Computation 11(2), 443–482 (1999) 40. Torr, P.H.S.: Geometric motion segmentation and model selection. Phil. Trans. Royal Society of London A 356, 1321–1340 (1998) 41. Tron, R., Vidal, R.: A benchmark for the comparison of 3-d motion segmentation algorithms. In: Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEEE Conference on, pp. 1 –8 (2007). DOI 10.1109/CVPR.2007.382974 42. Tseng, P.: Nearest q-flat to m points. Journal of Optimization Theory and Applications 105, 249–252 (2000). 10.1023/A:1004678431677 43. Vershynin, R.: Introduction to the non-asymptotic analysis of random matrices. In: Compressed sensing, pp. 210–268. Cambridge Univ. Press, Cambridge (2012) 44. Vidal, R.: Subspace clustering. Signal Processing Magazine, IEEE 28(2), 52 –68 (2011). DOI 10.1109/MSP.2010.939739 45. Vidal, R., Ma, Y., Sastry, S.: Generalized principal component analysis (GPCA). IEEE Transactions on Pattern Analysis and Machine Intelligence 27(12) (2005)
20 46. Vidal, R., Ma, Y., Soatto, S., Sastry, S.: Two-view multibody structure from motion. International Journal of Computer Vision 68(1), 7–25 (2006) 47. Yan, J., Pollefeys, M.: A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and nondegenerate. In: ECCV, vol. 4, pp. 94–106 (2006) 48. Yang, A.Y., Rao, S.R., Ma, Y.: Robust statistical estimation and segmentation of multiple subspaces. In: CVPRW ’06: Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop, p. 99. IEEE Computer Society, Washington, DC, USA (2006). DOI http://dx.doi.org/10.1109/CVPRW.2006.178 49. Zhang, T., Szlam, A., Lerman, G.: Median K-flats for hybrid linear modeling with many outliers. In: Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on Computer Vision, pp. 234–241. Kyoto, Japan (2009). DOI 10. 1109/ICCVW.2009.5457695 50. Zhang, T., Szlam, A., Wang, Y., Lerman, G.: Randomized hybrid linear modeling by local best-fit flats. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 1927 –1934 (2010). DOI 10.1109/CVPR.2010.5539866 51. Zhang, T., Szlam, A., Wang, Y., Lerman, G.: Hybrid linear modeling via local best-fit flats. International Journal of Computer Vision 100, 217–240 (2012). DOI 10.1007/s11263-012-0535-6
Bryan Poling, Gilad Lerman