Large-scale Binary Quadratic Optimization Using ... - Semantic Scholar

9 downloads 0 Views 6MB Size Report
Nov 27, 2014 - For the Stanford bunny data4, 50 points are randomly sampled and ... 4http://www.gvu.gatech.edu/people/faculty/greg.turk/bunny/bunny.html ...
MANUSCRIPT

1

Large-scale Binary Quadratic Optimization Using Semidefinite Relaxation and Applications Peng Wang, Chunhua Shen, Anton van den Hengel

arXiv:1411.7564v1 [cs.CV] 27 Nov 2014

Abstract In computer vision, many problems such as image segmentation, pixel labelling, and scene parsing can be formulated as binary quadratic programs (BQPs). For submodular problems, cuts based methods can be employed to efficiently solve large-scale problems. However, general nonsubmodular problems are significantly more challenging to solve. Finding a solution when the problem is of large size to be of practical interest, however, typically requires relaxation. Two standard relaxation methods are widely used for solving general BQPs—spectral methods and semidefinite programming (SDP), each with their own advantages and disadvantages. Spectral relaxation is simple and easy to implement, but its bound is loose. Semidefinite relaxation has a tighter bound, but its computational complexity is high, especially for large scale problems. In this work, we present a new SDP formulation for BQPs, with two desirable properties. First, it has a similar relaxation bound to conventional SDP formulations. Second, compared with conventional SDP methods, the new SDP formulation leads to a significantly more efficient and scalable dual optimization approach, which has the same degree of complexity as spectral methods. We then propose two solvers, namely, quasi-Newton and smoothing Newton methods, for the dual problem. Both of them are significantly more efficiently than standard interior-point methods. In practice, the smoothing Newton solver is faster than the quasi-Newton solver for dense or medium-sized problems, while the quasi-Newton solver is preferable for large sparse/structured problems. Our experiments on a few computer vision applications including clustering, image segmentation, co-segmentation and registration show the potential of our SDP formulation for solving large-scale BQPs.

I. I NTRODUCTION Binary quadratic programs (BQPs) are a class of combinatorial optimization problems with binary variables, quadratic objective function and linear/quadratic constraints. They appear in a wide variety of applications in computer vision, such as image segmentation, pixel labelling, image restoration and scene parsing. Moreover, Maximum a Posteriori (MAP) inference problems for Markov Random Fields (MRFs) can be formulated as certain BQPs as well. For MRF-MAP inference problems with submodular pairwise terms, the corresponding BQP can be solved exactly and efficiently using graph cuts or other techniques such as linear programming and message passing. However solving general BQP problems is known to be NP-hard, which means that it is unlikely to find polynomial time algorithms to solve these problems exactly (see [1] for polynomially solvable subclasses of BQPs). Alternatively, approximation algorithms can be used to produce a feasible solution close to the original optima in polynomial time. In order to accept such an approximation we require a guarantee that the divergence between the solutions of the two problems is bounded. The quality of the approximation thus depends upon the tightness of the bounds. Developing an efficient approximation algorithm with a tight relaxation bound that can achieve a good solution (particularly for large problems) is thus of great practical importance. Spectral methods are effective for many computer vision applications, such as data clustering, image segmentation [2], [3], motion segmentation [4] and many other MRF applications [5]. The optimization of spectral methods eventually lead to the computation of top eigenvectors. Nevertheless, spectral methods may produce loose relaxation bounds in many cases [6], [7], [8]. Moreover, the inherent quadratic programming formulation of spectral methods is difficult to incorporate certain types of additional constraints [5]. SDP methods have also been used to solve problems like image segmentation [9], restoration [10], [11], subgraph matching [12], co-segmentation [13] and general MRFs [14]. Typically, SDP based methods offer better approximations than spectral methods. It is widely accepted that interior-point methods [15], [16] are very robust and accurate for general SDP problems up to The authors are with School of Computer Science, University of Adelaide, Australia; and ARC Centre of Excellence for Robotic Vision. Correspondence should be addressed to C. Shen ([email protected]).

MANUSCRIPT

2

moderate size. However, once the matrix dimension n or the number of constraints m goes up to several thousands, interior-point methods becomes inefficient due to the computation of dense linear algebra operations of order n and constructing/factorizing the m × m dense Schur complement matrix. This high computation demand prohibits the application of SDP methods on large-scale problems. The objective of this paper is to develop new SDP approximation algorithms for BQPs, which has the solution quality similar to the standard SDP formulation and the efficiency comparable to spectral methods. Our main contributions are as follows: 1) A new SDP formulation (refer to as SDCut) is proposed for binary quadratic programs. By virtue of the use of Frobeniusnorm term in the objective function, the Lagrangian dual problem can be simplified to one with simple constraints and without positive semidefinite matrix variables, which is much easier to be optimized. The proposed SDP formulation offers bounds comparable to the typical SDP relaxation, and therefore provides better solutions than spectral relaxation. 2) Two algorithms are proffered to solve the dual problem, based on quasi-Newton (refer to as SDCut-QN) and smoothing Newton (refer to as SDCut-SN) methods respectively. The sparse or low-rank structures of specific problems are also exploited to speed up the computation. The proposed solvers require much lower computational cost and storage memory than standard interior-point methods. In particular, the quasi-Newton solver has a lower computational cost in each iteration while needs more iterations to converge. On the contrary, the smoothing Newton solver converges quadratically at a cost of a higher computational complexity per iteration. In practice, we find that SDCut-SN is faster for dense or medium-sized problems, and SDCut-QN is more efficient for large-scale sparse/structured problems. 3) The efficiency and flexibility of the proposed algorithms are demonstrated by applying them to a variety of computer vision problems. The formulation of SDCut allows multiple additional linear and quadratic constraints, which enables a broader set of applications than spectral methods. Related work We refer to [17], [18], [19] for surveys of general approaches to BQPs. A large number of methods have been proposed for BQPs arising from MRF-MAP inference problems, such as graph cuts [20], [21], [22], message passing [23], [24], [25], and combinatorial methods [26] (see [27], [28], [29], [30] for comparative studies). Different relaxation schemes have also been studied for these inference problems, for instance, linear programming relaxation [23], [25], [31], [32], quadratic programming relaxation [33], second order cone relaxation [34], spectral relaxation [5] and SDP relaxation [35]. Primal-dual interior-point approaches [36], [37], [38], [39] are widely considered as standard solvers for general SDP problems, and their implementations can be found in SeDuMi [40], SDPT3 [41] and MOSEK [42]. Due to the poor scalability of interior-point methods, a number of scalable approaches have been proposed for specific types of SDP problems. Relying on the assumption that the SDP program presents a low-rank solution X ∈ Sn+ , the original convex problem with respect to X can be reformulated to a nonconvex one with respect to Y ∈ Rn×r where X = YY> denotes a low-rank factorization of X. Nonlinear programming methods [43], [44], [45] are proposed for large-scale SDP problems based on this low-rank factorization. As the optimal r is unknown, these algorithms need to solve a sequence of nonconvex problems with respect to fixed rank r, and therefore may converge to an undesirable local optimum. Based on multivariate weight updates [46], approximation algorithms were developed in [47], [48], [49], [50], [51] to solve certain SDP problems inexactly. The approximate solution is updated by a rank-one matrix in each iteration and finally satisfies all constraints up to a prescribed tolerance. However, these methods may require a lot of iterations to converge to an accurate solution. For semidefinite least-square (SDLS) problems, some recent work proposes to optimize the Lagrangian dual problem with a continuously differentiable objective function, using projected gradient [52], quasi-Newton [53], and Newton-like methods [54], [55]. Augmented Lagrangian methods [56], [57] and their variant alternating direction direction methods for multipliers (ADMM) [58], [59] have also been developed for general SDP problems or specific ones. It has been known that the augmented Lagrangian method for convex problems can be considered as a gradient descend method for the corresponding dual problems [60], which unfortunately may converge slowly. In this work, we consider approximation algorithms for BQPs based on SDP relaxation. Different from the aforementioned methods, we do not directly solve the original SDP relaxation to BQPs. Alternatively, a Frobenius-norm term is added to the objective function and leads to a simplified Lagrangian dual problem. The proposed SDP formulation presents a solution close

MANUSCRIPT

3

to that of the original SDP problem. Our method is motivated by the work of Shen et al. [61], which presented a fast dual SDP approach to Mahalanobis metric learning. They, however, focused on learning a metric for nearest neighbour classification. Here, in contrast, we are interested in discrete combinatorial optimization problems arising in computer vision. Krislock et al. [62] have independently formulated a similar SDP problem for the MaxCut problem, which is simpler than the problems that we solve here. Moreover, they focus on globally solving the MaxCut problem using branch-and-bound. Preliminary results of this paper are published in [63]. Notation A matrix is denoted by a bold capital letter (X) and a column vector by a bold lower-case letter (x). Rn denotes the spaces of real-valued n × 1 vectors. Rn+ and Rn− represent the non-negative and non-positive orthants of Rn respectively. Sn denotes the space of n × n symmetric matrices, and Sn+ represents the corresponding cone of positive semidefinite (p.s.d.) matrices. For two vectors, x ≤ y indicates the element-wise inequality; diag(X) denotes the main diagonal vector of a matrix X ∈ Sn , and diag(v) forms an n × n diagonal matrix whose main diagonal vector is v. The trace of a matrix is denoted by trace(·). The rank of a matrix is denoted by rank(·). k·k1 and k·k2 denote the `1 and `2 norm of a vector respectively. kXk2F = trace(XX> ) = trace(X> X) is the Frobenius norm. The inner product of two matrices is defined as hX, Yi = trace(X> Y). X◦Y denotes the Hadamard product of X and Y. X⊗Y denotes the Kronecker product of X and Y. In indicates the n × n identity matrix. 0 and 1 denote all-zero and all-one column vectors with proper dimensions respectively. ∇f(·) and ∇2 f(·) stand for the first-order and second-order derivatives of function f(·) respectively. The eigen-decomposition of X ∈ Sn is expressed in the form X = Pdiag(λ)P or X = Pdiag(λ(X))P, where λ or λ(X) is the vector of eigenvalues and columns of P are the corresponding eigenvectors. II. BQP S AND THEIR R ELAXATION Here we consider binary quadratic programs of the following from: min

x∈{−1,1}n

s.t.

x> Ax,

(1a)

c>i x = bi , i ∈ Ileq , x> Bj x = bj , j ∈ Iqeq ,

(1b)

≤ bi , i ∈ Ilin , x Bj x ≤ bj , j ∈ Iqin ,

(1c)

c>i x

>

where A ∈ Sn ; Bj ∈ Sn , j ∈ Iqeq ∪ Iqin ; ci ∈ Rn , i ∈ Ileq ∪ Ilin ; b ∈ Rm , m = |Ileq | + |Iqeq | + |Ilin | + |Iqin |. Ileq , Iqeq , Ilin and Iqin contain non-overlapping indexes corresponding to linear/quadratic equality and linear/quadratic inequality ¯ = [1; x], constraints, respectively. Note that linear terms can be added into quadratic functions, by replacing x, A and Bj with x > > n ¯ ¯ A = [0, a ; a, A] and Bj = [0, d ; dj , Bj ], where a, dj ∈ R , j ∈ Iqeq ∪ Iqin . Without loss of generality, linear terms are j

thus neglected in the following content. In general, BQPs are NP-hard problems (see [1] for exceptions). A. Spectral Relaxation Spectral methods are extensively used in computer vision and they are usually simple to compute as only a few leading eigenpairs are required. Two classic spectral clustering approaches, for example, RatioCut and NCut [2], are shown in the following: RatioCut: NCut:

min x> Lx,

s.t. x> 1 = 0, x> x = n,

(2)

˜ min x> Lx,

s.t. x> c = 0, x> x = n,

(3)

x∈Rn x∈Rn

˜ = D−1/2 LD−1/2 and c = D1/2 1. The solutions of RatioCut and NCut are the second least eigenvectors of L and where L ˜ respectively. L, Although appealingly simple to implement, the spectral relaxation often yields poor solution quality. There is no guarantee on the bound of its solution with respect to the optimum of (1). The poor bound of spectral relaxation has been verified by a variety of authors [6], [7], [8].

MANUSCRIPT

4

Furthermore, it is difficult to generalize spectral methods to BQPs with arbitrary additional constraints. The works in [3], [5] and [64] proposed spectral methods with multiple linear equality constraints: min x> Ax,

x∈Rn

s.t. C> x = b, x> x = n,

(4)

where C ∈ Rn×m , m is the number of linear equality constraints. b ∈ Rn and b = 0 in [3]. Spectral methods proposed by Wang and Davidson [65] (refer to as SMQC in this paper) and Maji et al. [66] (refer to as BNCut) consider BQPs with one additional quadratic constraint: min x> Ax,

x∈Rn

s.t. x> Bx ≤ b, x> x = n,

(5)

where B ∈ Sn and rank(B) = 1 in BNCut [66]. Note that both BNCut [66] and SMQC [65] can only incorporate one additional quadratic constraint and offer sub-optimal solutions. Moreover, SMQC is computationally more expensive than standard spectral methods as it needs to perform full eigen-decompostions. Despite these efforts, it is still difficult for spectral methods to accommodate linear inequality constraints or more than one quadratic constraint. For spectral methods like NCut [2] and SMAC [64], only the top r (r  n) eigenpairs are required. For sparse matrices with t average non-zero entries per row or rank-t matrices (t  n), one can use Lanczos methods to compute the partial eigen-decompostion using O(nr2 +ntr)×Lanczos-iterations flops and O(nr +nt) bytes. For dense matrices, the computational complexity is O(n3 ) and the memory requirement is O(n2 ). In contrast, SMQC [65] needs to compute all eigenpairs to search for the best result, so it requires O(n3 ) flops and O(n2 ) bytes regardless of the matrix structure. B. SDP relaxation In SDP relaxation, the space x ∈ {−1, 1}n is lifted to the space X ∈ Sn+ , which represents xx> . Note that quadratic functions in BQPs can be expressed in linear forms with respect to X = xx> . The constraint x ∈ {−1, 1}n is relaxed to diag(X) = 1, which is also linear in X. Thus the SDP relaxation to the BQP in (1) can be expressed as the following general form: min p(X) := hX, Ai,

X∈Sn +

s.t. hBi , Xi = bi , i ∈ Ieq , hBi , Xi ≤ bi , i ∈ Iin ,

(6)

where Ieq , Iin contains non-overlapping indexes for equality and inequality constraints respectively. m := |Ieq |+|Iin |. Imposing an additional non-convex constraint rank(X) = 1, the above SDP problem is equivalent to the original BQP (1). In general, SDP relaxation is tighter than spectral relaxation. In particular, it has been proved in [67] that the expected values of solutions are tightly bounded (at most 14% suboptimal) for its SDP relaxation to MaxCut problems. In contrast to spectral relaxation, SDP relaxation can explicitly incorporate many linear (in)equality and quadratic (in)equality constraints. SDP problems can be solved by interior-point methods, which have been implemented in standard convex optimization toolboxes, for example, SeDuMi [40], SDPT3 [41] and MOSEK [42]. The most significant drawback of interior-point methods is the poor scalability to problems with large n or m. At each iteration of an interior-point method, it has to solve a dense linear system of size m using O(m3 + mn3 + m2 n2 ) flops. To store the primal n × n dense matrix variable and the m × m dense Schur matrix, interior-point methods require O(m2 + n2 ) bytes. III. SDC UT F ORMULATION Now we consider a specific class of SDP problems, in which the trace of the p.s.d. matrix variable is fixed, that is, X ∈ Θη := {X ∈ Sn+ |trace(X) = η > 0}. {0, 1}-quadratic programs or {±1}-quadratic programs with fixed number of non-zero variables can be relaxed to this class of SDP problems. X ∈ Θη has the properties that kXkF ≤ η and rank(X) = 1 ⇔ kXkF = η (See [68] and Appendix A), which gives a geometric explanation of the rank-one constraint dropped by SDP relaxation: the rank-one constraint restrict X ∈ Θη to lying on the (non-convex) the Euclidean sphere with center 0 and radius η in Sn , while the SDP formulation relaxes it to the corresponding convex ball.

MANUSCRIPT

5

A new SDP formulation for BQPs is presented instead of (6) in this work: min pγ (X) := hX, Ai +

X∈Sn +

1 (kXk2F − η 2 ), 2γ

s.t. hBi , Xi = bi , i ∈ Ieq , hBi , Xi ≤ bi , i ∈ Iin ,

(7)

where γ > 0 is a penalty parameter. Compared to (6), the new formulation (7) adds into the objective function a Frobeniusnorm term with respect to X. The reason for choosing this formulation is two-fold: 1) The solution quality of (7) can be as close to that of (6) as desired by making γ sufficiently large. 2) A simple dual formulation can be derived from (7), in which the p.s.d. constraint is eliminated. Proposition 1. The following results holds: (i) ∀  > 0, ∃ γ > 0 such that |p(X? ) − p(X?γ )| ≤ , where X? denotes the optima for (6) and X?γ denotes that for (7) with respect to γ. (ii) For γ2 > γ1 > 0, we have p(X?γ1 ) ≥ p(X?γ2 ), where X?γ1 and X?γ2 are the optimal solutions of (7) for γ1 and γ2 respectively. Proof. These results depend on the properties for X ∈ Θη (See Appendix A). These results show that the solution quality of (7) can be monotonically improved towards that of (6), by making γ sufficiently large. Next, we show that the Lagrangian dual of (7) can have a much simpler form. Proposition 2. The dual problem of (7) can be simplified to γ η2 maxm dγ (u) := − kΠSn+ (C(u))k2F −u> b− , u∈R 2 2γ where C(u) = −A − is:

Pm

i=1

s.t. ui ≥ 0, i ∈ Iin ,

(8)

ui Bi . The relationship between the primal optimal solution X? and the dual optimal solution u?

X? = γΠSn+ (C(u? )).

(9)

Proof. The Lagrangian of the primal problem (7) is: L(X, u, Z) =hX, Ai − hX, Zi + |I

m 1 2 X 1 kXk2F − η + ui (hX, Bi i−bi ), 2γ 2γ i=1

(10)

|

where u ∈ R|Ieq | × R+in are the multipliers with respect to the m linear equality/inequality constraints with respect to X in primal, and Z ∈ Sn+ are the multipliers with respect to the primal constraint X ∈ Sn+ . Since the primal problem in Equation (7) is convex, and both the primal and dual problems are feasible, strong duality holds. The primal optimal solution X? is a minimizer of L(X, u? , Z? ), which means that ∇X L(X? , u? , Z? ) = 0. Then we have that Pm X? = γ(Z? −A− i=1 u?i Bi ) = γ(Z? +C(u? )). By substituting X? into the Lagrangian (10), we obtain the dual problem: max m

u∈R ,Z

γ η2 − kZ + C(u)k2F − u> b − , 2 2γ

s.t. ui ≥ 0, i ∈ Iin , Z ∈ Sn+ .

(11)

As the dual (11) still contains a p.s.d. variable, it might seem that no efficient method can be used to solve it directly, other than the interior-point algorithms. Fortunately, the p.s.d. matrix variable Z can be eliminated as follows. Given a fixed u, the dual (11) can be simplified to: minZ∈Sn+ kZ + C(u)k2F , which has the solution Z = ΠSn+ (−C(u)) based on Theorem A6 in Appendix. By substituting Z into (11) and X? , we have the simplified dual (8) and Equation (9). The simplified Lagrangian dual problem (8) is explicitly convex (maximizing a concave objective function in a convex feasible set), and has simple bound constraints. There is no p.s.d. matrix variable in the dual and the number of dual variables equals to the number of equality and inequality constraints in the primal, that is m. Furthermore, the dual objective function dγ (·) has the following important properties. Proposition 3. d(·) is continuously differentiable but not necessarily twice differentiable, and its gradient is given by h i ∇dγ (u) = −γΦ ΠSn+ (C(u)) − b.

(12)

MANUSCRIPT

6

Algorithm 1 SDCut-QN: Solving (8) using quasi-Newton methods. 1 2 3

4 5 6

Input: A, Φ, b, γ, u0 , Kmax , τ > 0. Step 1: Solving the dual using L-BFGS-B for k = 0, 1, 2, . . . , Kmax do Step 1.1: Compute ∇dγ (uk ) and update H. Step 1.2: Compute the descent direction ∆u = −H∇dγ (uk ). Step 1.3: Find a step size ρ, and uk+1 = uk + ρ∆u. Step 1.4: Exit, if (dγ (uk+1 ) − dγ (uk ))/ max{|dγ (uk+1 )|, |dγ (uk )|, 1} ≤ τ . end Step 2: u? = uk+1 , X? = γΠSn (C(u? )). + Step 3: x? = Round(X? ). ? ? Output: x , u , upper-bound: p(x? x?> ) and lower-bound: dγ (u? ).

where Φ : Sn → Rm denotes the linear transformation Φ[X] := [hB1 , Xi, · · · , hBm , Xi]> . Proof. This result can be proofed from different ways. In C, we verified this proposition based on some results on separable spectral functions [69]. |I

|

Proposition 4. ∀u ∈ R|Ieq | ×R+in , ∀γ > 0, dγ (u) yields a lower-bound on the optimum of the BQP (1). Proof. This proposition also depends on the properties for X ∈ Θη (See Appendix A for details). IV. S OLVING THE D UAL P ROBLEM Based on the result in Proposition 3, first-order methods (for example gradient descent, quasi-Newton), which only require the computation of the objective function and its gradient, can be directly applied to solving (8). There is a difficulty in using standard Newton methods, however, as they require the calculation of second-order derivatives. In the following two sections we present two algorithms for solving the dual (8), which are based on quasi-Newton and smoothing Newton methods respectively. A. Quasi-Newton Methods One major advantage of quasi-Newton methods over Newton methods is that the inversion of the Hessian matrix is approximated directly by analyzing successive gradient vectors, and thus that there is no need to explicitly compute the Hessian matrix and its inverse. Solving (8) using quasi-Newton methods is simple. We use L-BFGS-B and only evaluation of the objective function and gradient are needed. The quasi-Newton algorithm for (8) (refer to as SDCut-QN) is summarized in Algorithm 6. In Step 1, the dual problem (8) is solved using L-BFGS-B, which only requires the callback function for calculating the dual objective function (8) and its gradient (12). At each iteration, a descent direction for ∆u is computed based on the gradient ∇dγ (u) and the approximated inverse of the Hessian matrix: H ≈ (∇2 dγ (u))−1 . A step size ρ is found using line search. The algorithm is stopped when the difference between the last two dual objective values is smaller than a pre-set tolerance. After solving the dual using L-BFGS-B, the primal optimal variable X? is calculated from the dual optimal u? based on Equation (9) in Step 2. Finally in Step 3, the primal optimal variable X? is discretized and factorized to produce the feasible binary solution x? . The discretization method is dependent on specific applications, which will be discussed separately later. Now we have an upper-bound and a lower-bound (see Propsition 4) on the optimum of the original BQP (1) (refer to as ξ ? ): p(x? x?> ) ≥ ξ ? ≥ dγ (u? ). These two values and the gap between them are used as measures of the solution quality in the experiments. B. Smoothing Newton Methods It is well known (see [70], for example) that the dual problem (8) is equivalent to finding u? ∈ D such that hu − |I

|

u? , −∇dγ (u? )i ≥ 0, ∀u ∈ D (variational inequality), as dγ (u) is a concave function. D := R|Ieq | ×R+in is used to denote the feasible set of the dual problem. Thus (8) is also equivalent to finding a root of the following non-smooth equation: F(u) := u − ΠD (u + ∇dγ (u)) = 0,

u ∈ Rm ,

(13)

MANUSCRIPT

7

Algorithm 2 SDCut-SN: Solving (8) using smoothing Newton methods. Input: A, Φ, b, γ, u0 , 0 , Kmax , τ > 0, r ∈ (0, 1), ρ ∈ (0, 1). Step 1: Solving the dual using smoothing Newton methods for k = 0, 1, 2, . . . , Kmax do Step 1.1: ¯ ← k or rk . Step 1.2: Solve the following linear system up to certain accuracy E(k , uk ) + ∇E(k , uk ) [∆k ; ∆uk ] = [¯ ; 0] .

(19)

Step 1.3: Line Search l = 0; while kE(k +ρl ∆k , uk +ρl ∆uk )k22 ≥ kE(k , uk )k22 do l = l + 1; k+1 = k + ρl ∆k , uk+1 = uk + ρl ∆uk . Step 1.4: If |dγ (uk+1 ) − dγ (uk )|/max{|dγ (uk+1 )|, |dγ (uk )|, 1} ≤ τ , break. Step 2: u? = uk+1 , X? = γΠSn (C(u? )). + Step 3: x? = Round(X? ). Output: x? , u? , upper-bound: p(x? x?> ) and lower-bound: dγ (u? ).

(

if i = 1, . . . , p can be considered as a metric projection from Rm to Rp × Rq+ . Because max(0, vi ) if i = p+1, . . . , m both of the metric projections ΠD and ΠSn+ are continuous but not continuously differentiable, F(u) is also continuous but

where ΠD (v) :=

vi

not continuously differentiable at u ∈ Rm . As classic Newton methods require the computation of Hessian matrix at each iteration, one cannot apply them to solving (13) directly. Motivated by the work in [55], smoothing Newton methods, which rely on smoothing approximation functions, are used here to solve such non-smooth equations. The smoothing functions for ΠD and ΠSn+ are as follows respectively: ( vi if i = 1, . . . , p, ˜ ΠD (, v) := (, v) ∈ R × Rm , φ(, vi ) if i = p+1, . . . , m,  ˜ Sn (, X) := P diag (φ(, λ)) P> , (, X) ∈ R × Sn , Π +

(14) (15)

 > where X = P diag(λ) P> is the eigen-decomposition of X. φ(, λ) := [φ(, λ1 ), · · · , φ(, λn )] and φ(, v) is the Huber smoothing function that we adopt here for smoothing max(0, v):  if v > 0.5,   v 2 φ(, v) := (v + 0.5) /2, if − 0.5 ≤ v ≤ 0.5,   0 if v < −0.5.

(16)

˜ D (, v) = ΠD (v) and Π ˜ Sn (, X) = ΠSn (X). φ, Π ˜ D, Π ˜ Sn are Lipschitz continuous Note that at  = 0, φ(, v) = max(0, v), Π + + + on R, R × Rm , R × Sn respectively, and they are continuously differentiable when  6= 0. Now we have a smoothing function for F(·):  h i  ˜ u) := u − Π ˜ D , u − γΦ Π ˜ Sn (, C(u)) − b , F(, +

(, u) ∈ R × Rm .

(17)

˜ D and Π ˜ Sn , we also see that F(, ˜ u) = F(u) at  = 0. F ˜ is Lipschitz continuous for any (, u) ∈ R × Rm , and Similar to Π + continuously differentiable when  6= 0. We thus see that solving the non-smooth equation F(u) = 0 is equivalent to solving the smoothing equation: h i ˜ u) = 0, E(, u) := ; F(,

(, u) ∈ R × Rm .

(18)

Inexact Newton methods are used to optimize the dual problem (8), which approximately solve the Newton linear system E(, u) + ∇E(, u)[∆; ∆u] = 0 at each iteration. The presented inexact smoothing Newton method (refer to as SDCut-SN) is shown in Algorithm 2. In Step 1.2, a linear system (19) is solved inexactly. We apply classic conjugate gradient (CG) methods to the linear system with respect to |Iin | = 0 and biconjugate gradient stabilized (BiCGStab) methods [71] to that with respect to |Iin | = 6 0. We present the computational details in appendix. In Step 1.3, we carry out a search in the direction [∆k ; ∆uk ] for an appropriate step size ρl such that the norm of E(, u) is reduced.

MANUSCRIPT

8

C. Speeding Up the Computation In this section, the eigen-decompostion of C(u), which is the computational bottleneck of our algorithms, is accelerated using several techniques. Low-rank Solution In practice, we find that the rank of the final p.s.d. solution typically has a low rank and r = rank(ΠSn+ (C(u))) usually decreases sharply such that r  n for most of descent iterations in both our algorithms. Actually, it is known (see [72] and [73]) that any SDP problem with m linear constraints has an optimal solution X? ∈ Sn+ , such that √ rank(X? )(rank(X? ) + 1)/2 ≤ m. It means that the rank of X? is roughly bounded by 2m. Then Lanczos methods can be used to efficiently calculate ΠSn+ (C(u)), which needs the positive eigenvalues of C(u) and the corresponding eigenvectors. Pm Lanczos methods rely only on the implementation of the matrix-vector product C(u)d = (−A− i=1 ui Bi )d, d ∈ Rn . This simple interface allows us to exploit specific structures of the coefficient matrices A and Bi , i = 1, · · · , m. Specific Problem Structure In the following discussion, we assume most of the matrices Bi , i = 1, 2, . . . , m only contain O(1) non-zero elements (for example, consider the constraints diag(X) = 1) and the rest are O(1) number of low-rank matrices, Pm such that the transformation Φ(X) := [hB1 , Xi, · · · , hBm , Xi]> and Ψ(u) := i=1 ui Bi can be computed in O(m + n) flops. Note that this assumption holds for all the applications we evaluated in this paper. In many cases A is sparse or structured, such that it can be stored using O(nt) bytes and the matrix-vector product Ad, ∀d ∈ Rn can be computed in O(nt) time. t  n is used to denote the number of non-zero entries per row for sparse matrices or the rank of low-rank (one type of structured) matrices. For C(u) with above specific structures, it requires O(nr2 + ntr) flops and O(nr + nt) bytes at each iteration of Lanczos factorization, given that the number of Lanczos basis vectors is set to a small multiple (1 ∼ 3) of r. Warm Start Another trick can also be used to further accelerate the computation of Lanczos methods. As iterative methods, a good initial point is crutial for the convergence speed of Lanczos methods. In quasi-Newton and smoothing Newton methods, the step size ∆u = uk+1 − uk tends to decrease with descent iterations. It means that C(uk+1 ) and C(uk ) may have similar eigenstructures, which inspires us to use a random linear combination of eigenvectors of C(uk ) as the starting point for the Lanczos process for C(uk+1 ). D. Convergence Speed, Computational Complexity and Memory Requirement This section discuss the convergence speeds, computational complexities and memory requirements of SDCut-QN, SDCutSN respectively. Note that n stands for the size of the primal matrix variable X and m denotes the number of dual variables Pm in u. The structures of matrices A and C(u) = −A− i=1 ui Bi are considered separately as dense, sparse or structured. SDCut-QN Generally speaking, the superlinear convergence rate of quasi-Newton methods is established under the condition that the objective function is at least twice differentiable (see [74], [75], [76] for detailed analysis). However, the dual objective function (8) is not necessarily twice differentiable. So the theoretical convergence speed of SDCut-QN is unknown. In practice, SDCut-QN usually converges within 200 iterations. At each iteration of L-BFGS-B, both of the computational complexity and memory requirement of L-BFGS-B itself are O(m). The main computational cost in computing the dual objective and its gradient is the projection ΠSn+ (C(u)), which is equivalent to computing all positive eigenvalues and corresponding eigenvectors of the matrix C(u). In general, the eigen-decomposition of a dense matrix requires O(n3 ) flops and O(n2 ) bytes. The overall computational cost of SDCut-QN at each descent iteration is O(m + n3 ) flops for dense matrices and O(m) + O(nr2 + ntr) × #Lanczos-iterations flops for sparse or structured matrices. As for memory requirements, SDCut-QN needs O(m + n2 ) bytes for dense matrices and O(m + nr + nt) bytes for sparse or structured matrices. SDCut-SN Based on the analysis in [55], the smoothing Newton method SDCut-SN is quadratically convergent. SDCut-SN normally uses less than 30 iterations in practice. There are two computational intensive aspects of SDCut-SN: the CG algorithms for solving the linear system (19) and the full eigen-decomposition of C(u). In D, we show that the Jacobian-vector product requires O(m + n2 r) flops at each CG iteration, where r = rank(ΠSn+ (C(u))). Since all eigenpairs of C(u) are needed to obtain Jacobian matrices implicitly, the computation cost of the eigen-decomposition is O(n3 ), regardless the structure of matrix C(u). The overall computation cost of

MANUSCRIPT

9

SDCut-SN is O(n3 ) + O(m + n2 r) × #CG-iterations flops at each descent iteration, and its memory requirement is O(m + n2 ) bytes. In summary, the computational costs and memory requirements for both SDCut-QN and SDCut-SN are linear in m, which means our methods are far more scalable to m than interior-point methods (which needs cubic flops and quadratic bytes in m). In terms of the dimension of p.s.d. matrix n, our methods is also more scalable than interior-point methods and comparable to spectral methods. Especially for sparse/structured matrices, the computational complexity of SDCut-QN is linear in n. As SDCut-SN cannot benefit from sparse/structured matrices, it needs more time than SDCut-QN in each descent iteration for such matrices. However, SDCut-SN converges much faster than SDCut-QN. Experiments are required to compare the speeds of SDCut-SN and SDCut-QN in different cases. V. A PPLICATIONS In this section, SDCut-QN and SDCut-SN are evaluated on several computer vision applications. In the experiments, we also compare our methods to spectral methods, graph cuts based methods and interior-point based SDP methods. We have used the L-BFGS-B implementation in [77] for the optimization in SDCut-QN. ARPACK, which implements a variant of Lanczos methods called the Implicitly Restarted Lanczos Method (IRLM), is employed here for eigen-decomposition for sparse or structured matrices. The DSYEVR function in LAPACK [78] is used for eigen-decomposition of dense matrices. The code is written in Matlab, with some key subroutines implemented in C/MEX. All of the experiments are evaluated on a 2.7GHz Intel CPU. The maximum number of descent iterations of SDCut-QN and SDCut-SN are set to 50 and 500 respectively. As shown in Algorithm 6 and Algorithm 2, the same stopping criterion is used for SDCut-QN and SDCut-SN, and the tolerance τ is set to 107 × (machine precision). The initial values of the dual variables with respect to equality constraints (i ∈ Ieq ) are set to 0, and those with respect to inequalities (i ∈ Iin ) are set to a small positive number. The selection of parameter γ will be discussed in the next section. A. Graph Bisection 1) Formulation: Graph bisection is a problem of separating the vertices of a weighted graph into two disjoint sets with equal cardinality, while minimizing the total weights of the edges cut. The problem can be formulated as: min

x∈{−1,+1}n

x> Lx,

s.t. x> 1 = 0,

(20)

where L = D − W is the graph Laplacian matrix, W is the weighted affinity matrix, and D = diag(W1) is the degree matrix. By lifting x ∈ −1, 1n to X = xx> , we have the following SDP relaxation to the graph bisection problem (20): min hX, −Wi,

X∈Sn +

s.t. diag(X) = 1, hX, 1 · 1> i = 0.

(21)

To obtain the discrete result from the optimal solution of X, we adopt the randomized rounding method in [67]: a score vector is generated from a Gaussian distribution with mean 0 and covariance X, and the discrete vector x ∈ {−1, 1}n is obtained by thresholding the score vector by its median value. This process is repeated several times and the final solution is the one with the smallest objective value. 2) Experiments: To show that the proposed SDP methods have better solution quality than spectral methods we compare the graph-bisection results of RatioCut, NCut1 and SDCut-QN on two artificial 2-dimensional datasets. The similarity matrix W is calculated based on the Euclidean distance between two points i and j (1 ≤ i < j ≤ n): Wij = exp(−d(i, j)2 /σ 2 ), if d(i, j) < r; and 0, otherwise. d(·, ·) measures the Euclidean distance between two points; σ is set to 0.1 of the maximum distance. As shown in Fig. 1, the first data set (the first row) contains two sets of points with different densities, and the second set contains an outlier. RatioCut and NCut fail to offer satisfactory results on both of the data sets, possibly due to the poor approximation of spectral relaxation. In contrast, our SDCut-QN achieves desired results on these data sets. Secondly, to demonstrate the impact of the parameter γ, we test SDCut-QN and SDCut-SN on a random graph with γ ranging from 102 to 104 . The graph is generated with 1000 vertices and all edges are assigned a weight uniformly sampled 1 http://www.cis.upenn.edu/∼jshi/software/

MANUSCRIPT

10

2

2

2

2

1

1

1

1

0

0

0

0

−1

−1

−1

−1

−2

−2

−2

−1

0

1

−2

−2

2

−1

0

1

2

−2

−2

−1

0

1

2

−2

4

4

4

4

3

3

3

3

2

2

2

2

1

1

1

1

0

0 0

1

2

3

4

0 0

Original data

1

2

3

4

−1

0

1

2

0 0

NCut

Fig. 1: Results for 2D points bisection. The thresholds are set to the median of score vectors for NCut and RatioCut. The two classes of points are shown in red ‘+’ and blue ‘◦’ respectively. SDCut-QN succeeds in clustering the points as desired, while both RatioCut and NCut failed in these two cases.

1

2

RatioCut

3

4

0

1

2

3

4

SDCut-QN

from [0, 1]. The resulting affinity matrices are dense matrices, and the DSYEVR routine in LAPACK package is used for eigendecomposition. In Fig. 2, we show the upper-bounds, lower-bounds, number of iterations and time achieved by SDCut-QN and SDCut-SN, with respect to different values of γ. There are several observations: 1) With the increase of γ, upper-bounds become smaller and lower-bounds become larger, which implies a tighter relaxation. 2) Both SDCut-QN and SDCut-SN take more iterations to converge when γ is larger. 3) SDCut-SN always uses less iterations than SDCut-QN. The above observations coincide with the analysis in Sect. IV-D. Using a larger parameter γ yields better solution quality, but at the cost of slower convergence speed. To achieve a balance between solution quality and convergence speed, we choose to set γ = 103 ∼ 104 in the following experiments. Finally, experiments are performed to evaluate another two factors affecting the running time of our methods: the sparsity of the affinity matrix W and the matrix size n. The numerical results corresponding to dense and sparse affinity matrices are shown in Table I and Table II respectively. The sparse affinity matrices are generated from random graphs with 8-neighbour connection. In these experiments, the size of matrix W is varied from 200 to 5000. ARPACK is used by SDCut-QN for partial eigen-decomposition of sparse problems, and DSYEVR is used for other cases. For both SDCut-QN and SDCut-SN, the number of iterations does not grow significantly with the increase of the matrix size n. However, the running time is still correlated with n, since an eigen-decompostion of an n × n matrix needs to be computed at each iteration for both of our methods. We also find that the second-order method SDCut-SN always uses significantly fewer iterations than the first-order method SDCut-QN. For dense affinity matrices, SDCut-SN runs consistently faster than SDCut-QN. In contrast for sparse affinity matrices, SDCut-SN is only faster than SDCut-QN for small-scale problems and becomes slower when the matrix size n ≥ 2000. That is because the Lanczos method used by SDCut-QN for partial eigen-decompostion scales much better for large sparse matrices than the standard factorization method (DSYEVR) used by SDCut-SN for full eigen-decomposition. The upper- and lower-bounds yield by our methods are similar to those of the interior-point methods. Meanwhile, NCut and RatioCut run much faster than other methods, but offer significantly worse upper-bounds. B. Constrained Image Segmentation 1) Formulation: In graph based segmentation, images are represented by weighted graphs G(V, E), with n vertices corresponding to pixels and edges encoding feature similarities between pixel pairs. A binary vector x ∈ {−1, 1}n is optimized to cut the minimal edge weights and results into two balanced disjoint groups. The problem of image segmentation (without additional constraints) can be also expressed as (21), where the affinity matrix W is constructed based on the color similarities and spatial adjacencies between pixels: ( Wij =

  exp −kfi − fj k22 /σf2 − d(i, j)2 /σd2 if d(i, j) < r, 0

otherwise.

(22)

MANUSCRIPT

11

n, m 200, 201

500, 501

1000, 1001

2000, 2001

5000, 5001

Methods Time Iterations Upper-bound Lower-bound Gap Time Iterations Upper-bound Lower-bound Gap Time Iterations Upper-bound Lower-bound Gap Time Iterations Upper-bound Lower-bound Gap Time Iterations Upper-bound Lower-bound Gap

SDCut-QN 0.7s 67.7 1.03 −0.63 1.65 1.9s 43.2 2.94 −0.31 3.26 22.6s 39.9 5.06 −0.19 5.25 01m54s 34.9 8.02 −0.18 8.19 20m39s 27.1 13.89 −0.32 14.22

SDCut-SN 0.6s 11.0 1.04 −0.63 1.67 1.8s 9.7 2.96 −0.31 3.28 13.0s 9.0 5.10 −0.19 5.29 54.3s 9.0 7.99 −0.18 8.17 11m05s 8.1 13.87 −0.32 14.19

SeDuMi 10.4s − 1.04 −0.58 1.61 01m21s − 2.93 −0.20 3.13 08m21s − 5.07 0.02 5.05 55m45s − 7.94 0.21 7.73 14h55m − 13.78 0.51 13.27

SDPT3 7.0s − 1.03 −0.58 1.60 33.9s − 2.92 −0.20 3.12 01m34s − 5.04 −18.58 23.61 12m37s − 7.96 −26.16 34.13 01h44m − 13.77 −41.36 55.13

MOSEK 5.5s − 1.04 −0.58 1.61 36.0s − 2.93 −0.20 3.13 02m36s − 5.04 0.02 5.02 22m25s − 7.95 0.21 7.74 04h40m − 15.60 2.66 12.95

NCut 0.2s − 1.82 − − 0.3s − 4.01 − − 0.5s − 6.10 − − 2.1s − 9.00 − − 24.2s − 14.91 − −

RatioCut 0.2s − 4.61 − − 0.4s − 9.23 − − 0.9s − 13.28 − − 2.9s − 20.85 − − 15.4s − 33.46 − −

TABLE I: Numerical results for graph bisection with dense affinity matrices. All the results are the average over 10 random graphs. Gap is the difference between the corresponding upper-bound and lower-bound. SDP based methods (the left five columns) achieve better upper-bounds than spectral methods (NCut and RatioCut). SDCut-SN uses less iterations than SDCut-QN and achieves the fastest speed of the five SDP based methods. n, m 200, 201

500, 501

1000, 1001

2000, 2001

5000, 5001

Methods Time Iterations Upper-bound Lower-bound Gap Time Iterations Upper-bound Lower-bound Gap Time Iterations Upper-bound Lower-bound Gap Time Iterations Upper-bound Lower-bound Gap Time Iterations Upper-bound Lower-bound Gap

SDCut-QN 6.0s 76.5 −0.57 −1.32 0.75 12.3s 65.3 0.65 −0.41 1.06 28.5s 73.3 1.35 0.25 1.10 01m12s 72.5 2.43 1.01 1.42 04m43s 90.3 4.00 2.24 1.77

SDCut-SN 0.6s 11.0 −0.57 −1.32 0.75 3.1s 11.0 0.64 −0.41 1.05 24.0s 11.8 1.35 0.25 1.10 02m38s 12.5 2.43 1.01 1.42 26m19s 13.2 3.99 2.24 1.76

SeDuMi 9.8s − −0.57 −1.28 0.71 01m36s − 0.65 −0.30 0.94 11m36s − 1.35 0.46 0.89 42m19s − 2.41 1.40 1.01 15h48m − 3.95 3.12 0.83

SDPT3 7.3s − −0.57 −1.28 0.71 54.0s − 0.64 −0.30 0.94 02m12s − 1.35 −28.50 29.84 11m04s − 2.41 −38.98 41.39 01h40m − 3.95 −60.04 63.99

MOSEK 3.5s − −0.57 −1.28 0.71 40.5s − 0.64 −0.30 0.94 02m43s − 1.34 0.46 0.88 23m12s − 2.41 1.40 1.01 05h18m − 3.95 3.12 0.83

NCut 0.1s − 8.38 − − 0.1s − 19.20 − − 0.1s − 28.32 − − 0.3s − 41.18 − − 1.2s − 64.98 − −

RatioCut 0.1s − −0.48 − − 0.2s − 0.73 − − 0.3s − 1.41 − − 0.5s − 2.51 − − 0.9s − 4.02 − −

TABLE II: Numerical results for graph bisection with sparse affinity matrices. All the results are the average over 10 random graphs. Gap is the difference between the corresponding upper-bound and lower-bound. The upper-bounds achieved by SDP based methods are close to each other and significantly better than spectral methods (NCut and RatioCut). The number of iterations for SDCut-SN is much less than SDCut-QN. For small-scale problems (n ≤ 1000), SDCut-SN is faster than SDCut-QN. While for large-scale problems (n ≥ 2000), SDCut-QN achieves faster speeds than SDCut-SN.

fi and fj are color features of pixels i and j, and d(i, j) is the spatial distance between pixels i and j. Bottom-up methods for graph partitioning typically require an additional means for incorporating prior knowledge. Prior knowledge can be encoded as constraints imposed on the problem (21), such as partial grouping constraints [3] and histogram constraints [79]. Partial grouping constraints In this case prior knowledge takes the form of several sets of labelled pixels corresponding to foreground and background respectively. In our methods, the partial grouping constraints on x are formulated as: (t>f Px)2 ≥ κ2 kt>f Pk21 , (t>b Px)2 ≥ κ2 kt>b Pk21 and ((tf − tb )> Px)2 ≥ κ2 k(tf − tb )> Pk21 , where tf , tb ∈ {0, 1}n are the indicator vectors for labelled foreground and background pixels; P = D−1 W is the normalized affinity matrix, working as smoothing terms [3]; κ ∈ (0, 1] denotes the degree of belief in the partial grouping constraints. Using X to represent xx> , the partial group constraints are transformed to: hPtf t>f P, Xi ≥ κ2 kt>f Pk21 , hPtb t>b P, Xi ≥ κ2 kt>b Pk21 and hP(tf − tb )(tf − tb )>P, Xi ≥ κ2 k(tf − tb )> Pk21 .

MANUSCRIPT

12

6

150

Upper−bound Lower−bound Iterations SDCut−SN Iterations SDCut−QN

2

100 Iterations

Upper-bound/Lower-bound

4

0

−2

Fig. 2: Results for graph bisection with different values of the parameter γ. The illustrated results are averaged over 10 random graphs. Upper-bounds and lowerbounds achieved by SDCut-QN are shown in this figure (those of SDCut-SN is very similar and thus omitted). The relaxation becomes tighter (that is, upperbounds and lower-bounds are closer) for larger γ. The number of iterations for both SDCut-SN and SDCut-QN grow with the increase of γ.

50

−4

−6

100

500

1000 Parameter γ

5000

10000

0

Histogram constraints Given a target color distribution for the foreground region, a quadratic constraint on x can be 2 PK  hti ,x+1i enforced to upper-bound the Euclidean distance between the segment and target color histograms: i=1 h1,x+1i − qi ≤ σ 2 , where q = [q1 , q2 , . . . , qK ]> is the target color histogram; t1 , t2 , . . . , tK ∈ {0, 1}n are indicator vectors for every color bin; K is the number of histogram bins; σ is the prescribed upper-bound for the Euclidean distance. Equivalently, the   constraint with respect to x can be expressed with respect to X = [1, x> ; x, xx> ]: hX, 1> B1, 1> B; B1, B i ≤ 0, where    PK PK 2 2 B = i=1 ti t>i − qi (ti 1> + 1t>i ) + · 1 · 1> . i=1 qi − σ Both the partial grouping constraints and histogram constraints are quadratic with respect to x, which cannot be easily encoded in classic spectral clustering methods. Biased normalized cut (BNCut) [66] is an extension of NCut [2], which only incorporates a single set of labelled (foreground) pixels. The result of BNCut is a weighted combination of the eigenvectors of the normalized Laplacian matrix, and the weights need be tuned manually. In our experiments, the parameters of BNCut are set as suggested in [66]. Wang and Davidson [65] proposed a constrained spectral clustering method (refer to as SMQC), which explicitly encodes one quadratic constraint. However, the complexity of their method is much greater than standard spectral methods, because two full eigen-decompositions need to be performed to find feasible solutions. A limitation of both BNCut and SMQC is that only one quadratic constraint can be incorporated. They are difficult (if not impossible) to be generalized to multiple quadratic constraints. In contrast, our SDP methods can accommodate many quadratic constraints. For graph cuts methods, the discussed constraints are usually incorporated into the unary potentials. However, it is known that the unary term cannot encode an appearance model exactly [79], especially when the color distributions of foreground and background overlap significantly. 2) Experiments: We use the rounding method from [67] to generate the final binary vector x from X. Fig. 3 illustrates the result for image segmentation with partial grouping constraints. Test images are from the Berkeley segmentation dataset [80]. Images are converted to Lab color space and over-segmented into SLIC superpixels using the VLFeat toolbox [81]. As shown in the top line, 10 foreground pixels and 10 background pixels are annotated by red and blue markers respectively. The segmentation results of BNCut and SDCut-QN are shown in the second and third lines respectively. We omit the segmentation results of SeDuMi and SDPT3, since they are similar to those of SDCut-QN. We find that BNCut did not accurately extract foreground, as it only incorporated a single set of grouping pixels (foreground pixels). In contrast, our methods are able to accommodate multiple sets of grouping pixels and extracts the foreground more accurately. In Table III, we compare the CPU time and the upper-bounds of BNCut, SDCut-QN, SeDuMi and SDPT3. ARPACK is used by SDCut-QN for partial eigendecompostion. The results are the average for the five images shown in Fig. 3. All five images are over-segmented into 760 superpixels, and so the numbers of variables for SDP are the same. We can see that BNCut is much faster than SDP based methods, but with worse upper-bounds. SDCut-QN achieves the similar objective value to that of SeDuMi and SDPT3, yet is over 10 times faster. Fig. 4 and Table IV demonstrate the results for image segmentation with histogram constraints. Our methods are compared with graph cuts, spectral methods (SMQC), and other SDP methods based on interior-point algorithms. The affinity matrix

13

Images

MANUSCRIPT

SDCut-QN

BNCut

Fig. 3: Image segmentation with partial grouping constraints. The top row shows the original images with labelled foreground (red markers) and background (blue markers) pixels. SDCut-QN achieves significantly better results than BNCut.

Methods Time Upper-bound

SDCut-QN 23.7s −116.10

SeDuMi 6m12s −116.30

SDPT3 5m29s −116.32

BNCut 0.26s −112.55

TABLE III: Numerical results for image segmentation with partial grouping constraints. Time and upper-bound are the means over the five images in Fig. 3. SDCut-QN runs 10 times faster than SeDuMi and SDPT3, and offers a similar upper-bound.

W is sparse, and so ARPACK is used by SDCut-QN for eigen-decomposition. For graph cuts methods, the prior knowledge  for appearance distribution is encoded as unary terms: ϕi = − ln Pr(fi |fore)/Pr(fi |back) , i = 1, 2, . . . , n. Pr(fi |fore) and Pr(fi |back) are probabilities for the color of the ith pixel belonging to foreground and background respectively. The unary terms for graph cuts are shown in the second line in Fig. 4. We can see that unary terms are not ideal when the color distribution of foreground and background are overlapped. For example in the first image, the white collar of the person in the foreground have similar unary terms with the white wall in the background. The fourth line of Fig. 4 shows that the unsatisfactory unary terms degrade the segmentation results of graph cuts methods significantly. SMQC is a spectral clustering method, which can encode a single quadratic constraint. The real-valued and binary results are shown in Line 6 and Line 7 of Fig. 4 respectively. Compared with the results of SDCut-QN in Line 8 and Line 9, we can see that SMQC and SDCut-QN achieve similar results. The average F-measure of all evaluated methods are shown in Table IV. Our methods achieves the best results and graph cuts have the worst results. As for the running time, SDCut-SN is faster than all other SDP-based methods (SDCut-QN, SeDuMi, SDPT3 and MOSEK). As expected, SDCut-SN uses much less (1/6) iterations than SDCut-QN to converge . SDCut-QN and SDCut-SN have slightly worse upper-bounds and lower-bounds than interior-point methods. From Table IV, we can find that SMQC is faster than our methods. However, SMQC does not scale well to large problems since it need to compute full eigen-decomposition. We test SDCut-QN and SMQC on problems with larger number of superpixels (9801). As shown in Fig. 5, SMQC takes much longer running time than SDCut-QN. C. Image Co-segmentation 1) Formulation: Image co-segmentation aims to partition the same object from multiple images simultaneously. The advantage of co-segmentation over traditional single image segmentation is that a common object appearance across multiple images. Co-segmentation is conducted by optimizing two criteria: i) the color and spatial consistency within a single image. ii) the separability of foreground and background over multiple images, measured by discriminative features, such as SIFT. Joulin et al. [13] adopted a discriminative clustering method to the problem of co-segmentation. The problem of discriminative clustering for co-segmentation can be expressed as: min

x∈{−1,1}n

x>Ax,

s.t. |x> ti | < κ/ni , ∀i = 1, . . . , s,

where s is the number of images, ni is the number of pixels for ith image, and n =

(23)

Ps

i=1

ni . ti ∈ {0, 1}n is the indicator

>

vector for the image i, that is, 1 ti is the number of pixels of image i. The matrix A is constructed as Ab + (µ/n)Aw , where Aw = In − D−1/2 WD−1/2 is the intra-image affinity matrix, and Ab = κk (I − 1 · 1> /n)(nκk In + K)−1 (I − 1 · 1> /n)

14

SDCut-QN

SDCut-QN score

SMQC

SMQC score

Graph cuts

Graph cuts unary

Superpixel

GT

Images

MANUSCRIPT

Fig. 4: Image segmentation with histogram constraints (coarse over-segmentation). The number of superpixels is around 726. Form top to bottom are: original images, ground-truth (GT), superpixels, unary terms for graph-cuts, results for graph-cuts, results for SMQC and SDCut-QN (score vector and binary results). Results for other SDP based methods are similar to that of SDCut-QN and thus omitted. Graph cuts tends to mix together the foreground and background with similar color. SDCut-QN achieves the best segmentation results.

Methods SDCut-QN SDCut-SN SeDuMi SDPT3 Time 32.2s 12.6s 08m16s 03m28s Iterations 149.8 25.6 − − F-measure 0.905 0.904 0.902 0.901 Upper-bound 1.34 1.34 1.31 1.31 Lower-bound −1.62 −1.62 −1.53 −1.53 Gap 2.96 2.96 2.84 2.83

MOSEK GC SMQC 02m40s 0.2s 5.1s − − − 0.901 0.722 0.832 1.29 − 1.40 −1.53 − − 2.82 − −

TABLE IV: Numerical results for image segmentation with histogram constraints. All the demonstrated results are the average of the eight images shown in Fig. 4. Gap is the difference between the corresponding upper-bound and lowerbound. SDCut-SN uses less iterations than SDCut-QN and runs faster than all other SDP based methods. Graph cuts and SMQC have worse F-measure than SDP based methods.

is the inter-image discriminative clustering cost matrix. κk and µ are parameters and we set them according to [13]. W is a block-diagonal matrix, whose ith block is the affinity matrix (22) of the ith image, and D = diag(W1). K is the kernel matrix based on the χ2 -distance between SIFT features corresponding to two pixels. Incomplete Cholesky decomposition is used to compute a low-rank approximation of K. κ ∈ (0, 1) is a parameter, and the constraints in (23) are used to avoid trivial solutions. The SDP relaxation to the problem (23) is: min hX, Ai,

X∈Sn +

s.t. diag(X) = 1, hX, ti t>i i ≤ κ2 /n2i , ∀i = 1, . . . , s.

(24)

Joulin et al. [13] used a low-rank factorization method [82] (refer to as LowRank) to solve the associated SDP program. The LowRank method finds a locally-optimal factorization X = YY> , where Y ∈ Rn×r and r  n. To obtain the final solution, a score vector is computed based on the eigen-decomposition of X. Then the binary solution is computed by thresholding the

MANUSCRIPT

(a) Superpixel

15

(b) SDCut-QN score

(c) SDCut-QN (23m21s)

(d) SMQC score

(e) SMQC (04h09m)

SDCut-QN

LowRank

Images

SDCut-QN

LowRank

Images

Fig. 5: Image segmentation with histogram constraints (fine over-segmentation). The number of superpixels is 9801. SDCut-QN and SMQC have similar segmentation results, but SDCut-QN runs 10 times faster than SMQC.

Fig. 6: Image co-segmentation on Weizman horses and MSRC datasets. The original images, the results (score vectors) of LowRank, SDCut-QN are illustrated from top to bottom. Other methods produce similar segmentation results.

score vector. 2) Experiments: The Weizman horses2 and MSRC3 datasets are used for the image co-segmentation problem. There are 6 ∼ 10 images in each of four classes, namely car-front, car-back, face and horse. Each image is oversegmented to 400 ∼ 700 SLIC superpixels. The dimension of p.s.d. matrix variables n is then increased to 4000 ∼ 7000. We compare SDCut-QN and SDCut-SN with the low-rank factorization method (LowRank) and interior-point methods (SeDuMi, SDPT3 and MOSEK). The matrix A in (23) can be decomposed into a sparse matrix (Aw ) and a structural matrix (Ab ), therefore ARPACK can be used by SDCut-QN for partial eigen-decomposition. While SDCut-SN uses DSYEVR for full eigen-decomposition. As we can see in Table V, SDCut-QN takes 10 times more iterations than SDCut-SN, but still runs faster than SDCut-SN especially when the size of problem is large (see “face” data). The reason is that SDCut-QN can exploit the specific structure of matrix A and has much smaller computational complexity than SDCut-SN in each iteration. SDCut-QN runs also 5 times faster than LowRank. All of the methods provide similar upper-bounds on the primal objective value, and the score vectors shown in Fig. 6 also show that the evaluated methods achieve similar visual results. D. Graph Matching 1) Formulation: In the following experiments for graph matching, K source points must be matched to L target points, where K < L. The matching should maximize the local feature similarities between matched-pairs and also the structure 2 http://www.msri.org/people/members/eranb/ 3 http://www.research.microsoft.com/en-us/projects/objectclassrecognition/

MANUSCRIPT

16

Data, n, m

car-back, 4012, 4018

car-front, 4017, 4023

face, 6684, 6694

horse, 4587, 4597

Methods Time Iterations Upper-bound Lower-bound Gap Time Iterations Upper-bound Lower-bound Gap Time Iterations Upper-bound Lower-bound Gap Time Iterations Upper-bound Lower-bound Gap

SDCut-QN 06m08s 140 12.71 −7.41 20.12 07m32s 188 8.16 −7.67 15.83 08m18s 164 12.65 −9.73 22.38 06m15s 167 14.78 −6.83 21.61

SDCut-SN 09m59s 15 12.71 −7.41 20.12 11m25s 16 8.16 −7.67 15.83 43m57s 16 12.65 −9.73 22.38 17m01s 16 14.78 −6.83 21.61

SeDuMi 07h02m − 12.74 −7.30 20.04 07h04m − 8.16 −7.56 15.72 > 24hrs − − − − 11h03m − 14.76 −6.69 21.45

SDPT3 02h51m − 12.59 −60.80 73.40 02h10m − 8.04 −64.58 72.63 09h21m − 13.12 −80.29 93.41 02h42m − 15.71 −61.24 76.95

MOSEK 02h54m − 12.74 −7.30 20.04 02h54m − 8.16 −7.56 15.72 12h06m − 12.96 −9.53 22.49 04h14m − 14.76 −6.69 21.45

LowRank 28m44s − 12.64 − − 59m47s − 8.61 − − 40m56s − 20.53 − − 42m14s − 15.77 − −

TABLE V: Numerical results for image co-segmentation. Gap is defined as the difference between the upper-bound and lower-bound. All the evaluated are SDP based methods and achieve similar upper-bounds and lower-bounds. SDCut-QN runs significantly faster than other methods, although it needs more iterations to converge than SDCut-SN.

similarity between the source and target graphs. The problem is expressed as the following BQP: min

x∈{0,1}KL

s.t.

h> x + x> Hx, P

j

xij = 1, ∀i = 1, . . . , K,

(25a) P

i

xij ≤ 1, ∀j = 1, . . . , L,

(25b)

where xij := x(i−1)L+j = 1 if the source point i is matched to the target point j; otherwise it equals to 0. h ∈ RKL records the local feature similarity between each pair of source-target points; H(i−1)L+j,(k−1)L+l = exp(−(d(i, j) − d(k, l))2 /σ 2 ) encodes the structural consistency of source points i, j and target points k, l. In this case, the problem is a {0, 1}-quadratic program, rather than a {±1}-quadratic program. Based on (25b), η = trace(xx> ) = K +1. Schellewald et al. [12] use X to represent [1, x> ; x, X] and express the SDP relaxation of (25) as: min

X∈SKL+1 +

s.t.

hX, [0, 0.5h> ; 0.5h, H]i

(26a)

X1,1 = 1,

(26b)

2Xi,i = Xi,1 + X1,i , ∀i = 2, . . . , KL + 1, PL j=1 Xij,ij = 1, ∀i = 1, . . . , K,

(26c) (26d)

Xij,ik + Xik,ij = 0, ∀i = 1, . . . , K, ∀j, k = 1, . . . , L, j 6= k,

(26e)

Xij,kj + Xkj,ij = 0, ∀i, k = 1, . . . , K, i 6= k, ∀j = 1, . . . , L,

(26f)

where Xij,kl := X(i−1)L+j+1,(k−1)L+l+1 . Constraints (26c) arises from the fact that xi = x2i ; constraints (26d) arises from (25b); constraints (26e) and (26f) avoids undesirable solutions that match one point to multiple points. The binary solution x is obtained by solving the following linear program: max x> diag(X? ), s.t. x ≥ 0, x1 = 1, (25b),

x∈RKL

(27)

which is guaranteed to have integer solutions [12]. SMAC [64] is also evaluated for graph matching, which is a spectral method with affine constrains. 2) Experiments: Firstly, three data sets are generated to evaluate the proposed methods visually and numerically. For 2dimensional and 3-dimensional data, 30 target points are generated from a uniform distribution firstly, and 15 source points are selected randomly and rotated and translated by a random similarity transformation y = Rx + t with additive Gaussian noise. For the Stanford bunny data4 , 50 points are randomly sampled and similar transformation and noise are applied. The matrix H in (26) is not sufficiently sparse, so DSYEVR is used for both SDCut-QN and SDCut-SN for eigen-decomposition. 4 http://www.gvu.gatech.edu/people/faculty/greg.turk/bunny/bunny.html

MANUSCRIPT

17

0.6

0.5

0.5

0.5

0.4

0.4

0.3

0.3

0.4

2d-toy

0.2

0.2

0.3 0.1

0.1

0.2 0

0

−0.1

−0.1

−0.2

−0.2

−0.3

−0.3

0.1 0

−0.1 −0.2

−0.4

−0.3 −1.2

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

−0.5 −0.5

−0.4

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

0.4

−0.5 −0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

Data K×L n m SDCut-QN SDCut-SN SeDuMi SDPT3 MOSEK

0.5

0.5

0.2

3d-toy

0.2 0

−0.2

0

−0.4

−0.2

0

−0.6 −0.4

−0.8 −1 0.5

−0.5 0.5

−0.6 0.5 0.4

0

0.5

0.2

0.5

0

0 −0.5

0 0

−0.2

0

−0.4 −1

−0.5

−0.6

0.06

−0.5

−0.5

3d-toy 15 × 30 451 10141 4.2s/93iter 3.1s/12iter 58m20s 09m04s 13m48s

Bunny 50 × 50 2501 125051 03m60s/54iter 03m34s/12iter Out of mem. Out of mem. Out of mem.

0.06

0.45

0.04

Bunny

−0.5

2d-toy 15 × 30 451 10141 6.2s/136iter 3.2s/14iter 01h06m 09m06s 22m17s

0.04 0.4

0.02

0.02

0

0

0.35

−0.02

−0.02 0.3

−0.04

−0.04

−0.06 0.2

−0.06 0.2

0.25 0.1

0.15

0.1 0.05

0.1

0.05

−0.35 −0.4

0

0 0.05

−0.05 0

−0.1

0.15

0.1 0.05

0.1 0

−0.45 −0.05

−0.5 −0.1

−0.55

0.05

−0.05 0

−0.1

Fig. 7: Graph matching results. For 2d (top row) and 3d (middle row) artificial data, 15 source points are to be matched to a subset of 30 target points. For bunny data (bottom row), there are 50 source points and 50 target points. Gap is the difference between the corresponding upper-bound and lower-bound. All evaluated methods have the same matching results thus the same upper-bounds. SDCut-SN has the fastest speed and SDCut-QN is the second fastest. SeDuMi achieves smallest gaps for 2d-toy and 3d-toy data. Interior-point methods runs out of memory on the bunny data, as there is over 105 constraints.

From Fig. 7, we can see that the source and target points are matched correctly in all the three data sets. SDCut-SN takes much less iterations and shorter running time to converge than SDCut-QN. Both SDCut-SN and SDCut-QN run much faster than interior-point methods. For the bunny data with 125051 constraints, interior-point methods run out of the 100G memory limit. Table VI shows the numerical results for different problem sizes: n is from 201 to 3201 and m is from 3011 to 192041. The maximum number of iterations for SDCut-QN and SDCut-SN are restricted to 500 and 50 respectively. SDCut-SN and SDCutQN achieves exactly the same upper-bounds as interior-point methods and slightly worse lower-bounds. As for the running time, SDCut-SN takes much less number of iterations to converge and is relatively faster (within 2 times) than SDCut-QN. Our methods run significantly faster than interior-point methods. Taking the case K × L = 25 × 50 as an example, SDCutSN and SDCut-QN converge at around 4 minutes and interior-point methods do not converge within 24 hours. Furthermore, interior-point methods runs out of 100G memory limit when the number of primal constraints m is over 105 . SMAC is also evaluated in this experiment, which provides worse upper-bounds and error ratios. VI. C ONCLUSION In this paper, we have presented a new semidefinite formulation (SDCut) for BQPs. SDCut produces a similar lower bound to the conventional SDP formulation, and therefore is tighter than spectral relaxation. Two algorithms are proposed based on quasi-Newton methods (SDCut-QN) and smoothing Newton methods (SDCut-SN). Both SDCut-QN and SDCut-SN are more efficient than classic interior-point algorithms. To be specific, SDCut-SN is faster than SDCut-QN for small to medium sized problems. If the matrix to be eigen-decomposed, C(u), has a special structure (for example, sparse or low-rank) such that matrix-vector products can be computed efficiently, SDCut-QN is much more scalable to large problems. The proposed algorithms have been applied to several computer vision tasks, which demonstrate its flexibility in accommodating different types of constraints. Experiments also show the computational efficiency and good solution quality of SDCut. We have made the code available online5 . Acknowledgements This work was in part supported by ARC Future Fellowship FT120100969 and ARC project DP120101172. A. van den Hengel’s participation in this work was also supported by the Data to Decisions Cooperative Research Centre. A PPENDIX A T HE S PHERICAL C ONSTRAINT In this section, we explore a property of the set: Θη := {X ∈ Sn+ |trace(X) = η > 0}, which is an intersection of the p.s.d. cone and a linear subspace. For this set, we have the following results. 5 http://cs.adelaide.edu.au/∼chhshen/projects/BQP/

MANUSCRIPT

18

K ×L, n, m

Methods Time Iterations Error ratio Upper-bound Lower-bound Gap Time Iterations Error ratio Upper-bound Lower-bound Gap Time Iterations Error ratio Upper-bound Lower-bound Gap Time Iterations Error ratio Upper-bound Lower-bound Gap Time Iterations Error ratio Upper-bound Lower-bound Gap Time Iterations Error ratio Upper-bound Lower-bound Gap

10 × 20, 201, 3011

15 × 30, 451, 10141

20 × 40, 801, 24021

25 × 50, 1251, 46901

30 × 60, 1801, 81031

40 × 80, 3201, 192041

SDCut-QN 5.8s 262.6 1/100 −1.30×10−1 −1.31×10−1 1.07×10−3 22.5s 359.7 1/150 −3.77×10−2 −3.81×10−2 4.29×10−4 01m27s 405.2 1/200 4.01×10−2 3.93×10−2 8.11×10−4 04m05s 384.0 0/250 1.04×10−1 1.03×10−1 4.08×10−4 14m43s 500.0 2/300 1.59×10−1 1.58×10−1 1.07×10−3 03h02m 500.0 1/400 2.63×10−1 2.61×10−1 2.15×10−3

SDCut-SN 1.7s 34.2 1/100 −1.30×10−1 −1.31×10−1 1.15×10−3 11.2s 35.7 1/150 −3.77×10−2 −3.81×10−2 4.22×10−4 51.2s 41.7 1/200 4.01×10−2 3.93×10−2 8.33×10−4 03m50s 41.0 0/250 1.04×10−1 1.03×10−1 4.23×10−4 10m20s 50.0 2/300 1.59×10−1 1.58×10−1 1.07×10−3 02h26m 50.0 1/400 2.63×10−1 2.61×10−1 2.09×10−3

SeDuMi 02m17s − 1/100 −1.30×10−1 −1.31×10−1 1.73×10−3 01h34m − 1/150 −3.77×10−2 −3.78×10−2 1.73×10−4 17h48m − 1/200 4.01×10−2 3.99×10−2 2.32×10−4 > 24hrs − − − − − > 24hrs − − − − − Out of mem. − − − − −

SDPT3 45.0s − 1/100 −1.30×10−1 −1.30×10−1 7.65×10−4 15m48s − 1/150 −3.77×10−2 −3.79×10−2 1.86×10−4 02h09m − 1/200 4.01×10−2 3.98×10−2 2.75×10−4 > 24hrs − − − − − > 24hrs − − − − − Out of mem. − − − − −

MOSEK 30.7s − 1/100 −1.30×10−1 −1.30×10−1 7.65×10−4 30m38s − 1/150 −3.77×10−2 −3.78×10−2 1.73×10−4 04h39m − 1/200 4.01×10−2 3.99×10−2 2.32×10−4 > 24hrs − − − − − > 24hrs − − − − − Out of mem. − − − − −

SMAC 0.1s − 1/100 −1.27×10−1 − − 0.3s − 6/150 −2.01×10−2 − − 0.2s − 6/200 4.29×10−2 − − 0.3s − 3/250 1.06×10−1 − − 0.4s − 4/300 1.60×10−1 − − 1.2s − 9/400 2.66×10−1 − −

TABLE VI: Numerical results for graph matching, which are the mean over 10 random graphs. Gap is the difference between the corresponding upper-bound and lower-bound. For the fourth and fifth models, interior-point methods including Sedumi, SDPT3 and Mosek do not converge within 24 hours. For the last model with around 2 × 105 constraints, Sedumi, SDPT3 and Mosek run out of 100G memory limit. SDCut-SN uses less iterations than SDCut-QN and achieves the fastest speed over all SDP based methods. All SDP based methods achieve the same upper-bounds and error rates. The lower-bounds for SDCut-SN and SDCut-QN are slightly worse than interior-point methods. SMAC provides worse upper-bounds and error rates than SDP-based methods.

Theorem A5. (The spherical constraint). For X ∈ Θ(η), we have the inequality kXkF ≤ η, in which the equality holds if and only if rank(X) = 1. Proof. The proof here is an extension of the one in [68]. For a matrix X ∈ Θ(η), kXk2F = trace(XX> ) = kλ(X)k22 ≤ kλ(X)k21 . Because X ∈ Sn+ , then λ(X) ≥ 0 and kλ(X)k1 = trace(X). Therefore kXkF = kλ(X)k2 ≤ k(λ(X))k1 = η.

(A28)

Because kxk2 = kxk1 holds if and only if only one element in x is non-zero, the equality holds for (A28) if and only if there is only one non-zero eigenvalue for X, i.e., rank(X) = 1. A. Proof of Proposition 1 Proof. (i) Firstly, we have the following inequalities: p(X?γ ) ≥ p(X? ) ≥ pγ (X? ) ≥ pγ (X?γ ),

(A29)

where the first and third inequalities are based on the definitions of (6) and (7), and the second one is based on kXkF ≤ η (see Theorem A5). Then we have: |p(X? ) − p(X?γ )| = p(X?γ ) − p(X? ) ≤ p(X?γ ) − pγ (X?γ ) = 2

?

1 2 2γ (η

− kXk2F ) ≤

η2 2γ .

Set γ

to η /2, then |p(X ) − p(X?γ )| ≤ . (ii) By the definition of X?γ1 and X?γ2 , it is clear that pγ1 (X?γ1 ) ≤ pγ1 (X?γ2 ) and pγ2 (X?γ2 ) ≤ pγ2 (X?γ1 ). Then we have pγ1 (X?γ1 ) − γγ12 pγ2 (X?γ1 ) = (1− γγ21 ) · p(X?γ1 ) ≤ pγ1 (X?γ2 ) − γγ12 pγ2 (X?γ2 ) = (1− γγ21 ) · p(X?γ2 ). Because γ2 /γ1 > 1, p(X?γ1 ) ≥ p(X?γ2 ).

MANUSCRIPT

19

B. Proof of Proposition 4 Proof. Firstly, it is known that the optimum of the original SDP problem (6) is a lower-bound on the optimum of the BQP (1) (denoted by ξ ? ): p(X? ) ≤ ξ ? . Then according to (A29), we have pγ (X?γ ) ≤ p(X? ). Finally based on the strong duality, the primal objective value is not smaller than the dual objective value in the feasible set (see for example [83]): dγ (u) ≤ pγ (X?γ ), where u ∈ Rp × Rq+ , γ > 0. In summary, we have: dγ (u) ≤ pγ (X?γ ) ≤ p(X? ) ≤ ξ ? , ∀u ∈ Rp × Rq+ , ∀γ > 0. A PPENDIX B E UCLIDEAN P ROJECTION ONTO THE P.S.D. C ONE Theorem A6. The Euclidean projection of a symmetric matrix X ∈ Sn onto the positive semidefinite cone Sn+ , is given by  ΠSn+ (X) := arg minn kY − Xk2F = P diag(max(0, λ)) P> ,

(A30)

Y∈S+

 where X = P diag(λ) P> is the eigen-decomposition of X. Proof. This result is well-known and its proof can be found in [84] or Section 8.1.1 of [83]. Although (A6) is an SDP problem, it has an explicit solution. This is important to simplify our SDP dual formulation. A PPENDIX C D ERIVATIVES OF S EPARABLE S PECTRAL F UNCTIONS A spectral function F(X) : Sn → R is a function which depends only on the eigenvalues of a symmetric matrix X, and can be written as f(λ(X)) for some symmetric function f : Rn → R, where λ(X) denotes the vector of eigenvalues of X. A function f(·) is symmetric means that f(x) = f(Ux) for any permutation matrix U and any x in the domain of f(·). Such Pn symmetric functions and the corresponding spectral functions are called separable, when f(x) = i=1 g(xi ) for some function g : R → R. It is known [85], [86], [69] that a spectral function has the following properties. Theorem A7. A separable spectral function F(·) is k-times (continuously) differentiable at X ∈ Sn , if and only if its corresponding function g(·) is k-times (continuously) differentiable at λi , i = 1, . . . , n, and the first- and second-order derivatives of F(·) are given by   ∇F(X) = P diag ∇g(λ1 ), ∇g(λ2 ), . . . , ∇g(λn ) P> ,  ∇2 F(X)(H) = P Ω(λ) ◦ (P> HP) P> , ∀H ∈ Sn ( where [Ω(λ)]ij :=

∇g(λi )−∇g(λj ) λi −λj ∇2 g(λi )

if λi 6= λj , if λi = λj ,

(A31) (A32)

 i, j = 1, . . . , n. and X = P diag(λ) P> is the eigen-decomposition of

X. A. Proof of Proposition 3 Proof. Define ζ(X) := 21 kΠSn+ (X)k2F =

1 2

Pn

2 i=1 (max(0, λi (X))) ,

is a separable spectral function associated with a function g(x) =

where λ(X) is the vector of eigenvalues of X ∈ Sn . ζ(X) 1 2 2 (max(0, x)) ,

where x ∈ R. ζ : Sn → R is continuously

differentiable but not necessarily twice differentiable at X ∈ Sn , as its corresponding function g : R → R has the same properties at λi (X), i = 1, . . . , n. Actually, ΠSn+ (X) is differentiable when X has no zero eigenvalue and not differentiable when X is singular. We also have ∇ζ(X) = ΠSn+ (X). A PPENDIX D S OLVING THE L INEAR S YSTEM (19) The linear system (19) can be decomposed to two parts: " # " k 1 (19) ⇔ + ˜ ˜ F(k , uk ) (∇ F)(k , uk )

0 ˜ k , uk ) (∇u F)(

# ,

(A33a)

MANUSCRIPT

20

( ⇔

∆k = ¯ − k

(A33b)

˜ k , uk )(∆uk ) = −F( ˜ k , uk ) − ∇ F( ˜ k , uk )(∆k ), ∇u F(

(A33c)

˜ and ∇u F ˜ denote the partial derivatives of F ˜ with respect to  and u respectively. One can firstly obtain the value where ∇ F of ∆k by (A33b) and then solve the linear system (A33c) using CG-like algorithms. ˜ k , uk ) ∈ Rm×m is nonsymmetric when inequality constraints exist, biconjugate gradient Since the Jacobian matrix ∇u F( stabilized (BiCGStab) methods [71] are used for (A33c) with respect to mi 6= 0, and classic conjugate gradient methods are used when mi = 0. The computational bottleneck of CG-like algorithms is on the Jacobian-vector products at each iteration. We discuss in the following the computational complexity of it in our specific cases. Firstly, we give the partial derivatives of smoothing ˜ D (, v) : R × Rm → Rm and Π ˜ Sn (, X) : R × Sn → Sn : functions φ(, v) : R × R → R, Π +

( ∇ φ(, v) =

0.125 − 0.5(v/)2

if − 0.5 ≤ v ≤ 0.5,

0

otherwise,

  if v > 0.5,  1 ∇v φ(, v) = 0.5 + v/ if − 0.5 ≤ v ≤ 0.5,   0 if v < −0.5, ( ˜ D (, v) = ∇ Π

if i = 1, . . . , p,

∇ φ(, vi ) if i = p+1, . . . , m, (

˜ D (, v) = ∇v Π

0 1

if i = 1, . . . , p,

∇vi φ(, vi ) if i = p+1, . . . , m,

˜ Sn (, X) = Pdiag (∇ φ(, λ)) P> , ∇ Π +  ˜ Sn (, X)(H) = P Ω(, λ) ◦ (P> HP) P> , ∇X Π +

(A34a)

(A34b)

(A35a) (A35b)

(A36a) (A36b)

 n where X = P diag(λ) P> is the eigen-decomposition of X, ∇ φ(, λ) := [∇ φ(, λi )]i=1 and Ω(, λ) : R × Rn → Sn is defined as ( [Ω(, λ)]ij :=

φ(,λi )−φ(,λj ) λi −λj

if λi 6= λj ,

∇λi φ(, λi )

if λi = λj ,

i, j = 1, . . . , n.

(A37)

Equations (A36a) and (A36b) are derived based on Theorem A7. ˜ u) : R × Rm → with respect to  and u: Then we have the partial derivatives of F(, ˜ u) = −∇ Π ˜ D (, w) − ∇w Π ˜ D (, w) (∇ w), ∇ F(,   ˜ D (, w) + ∇w Π ˜ D (, w) γΦ Pdiag (∇ φ(, λ)) P> , = −∇ Π

(A38a)

˜ u)(h) = h − ∇w Π ˜ D (, w) (∇u w), ∇u F(,    ˜ D (, w) h + γΦ P Ω(, λ) ◦ (P> Ψ[h]P) P> , = h − ∇w Π (A38b) h i ˜ Sn (, C(u)) − b, C(u) := −A−Ψ[u], Φ(X) := [hB1 , Xi, · · · , hBm , Xi]> and Ψ(u) := Pm ui Bi . where w := u − γΦ Π i=1  >+ C(u) = P diag(λ) P is the eigen-decomposition of C(u). In general cases, computing (A38a) and (A38b) needs O(mn2 + n3 ) flops. However, based on the observation that most of Bi , i = 1, . . . , m contain only O(1) elements and r = rank(ΠSn+ (C(u)))  n, the computation cost can be dramatically reduced. Firstly, super sparse Bi s lead to the computation cost of Φ and Ψ reduced from O(mn2 ) to O(m + n). Secondly, note that [Ω(, λ)]ij = 0, ∀λi , λj < 0. Given r  n and  is small enough, the matrix Ω only contains non-zero elements in the first r columns and rows. Thus the matrix multiplication in (A38a), (A38b) and (17) can be computed in O(n2 r) flops rather than the usual O(n3 ) flops.

MANUSCRIPT

21

In summary, the computation cost of the right hand side of Equ. (A33c) and the Jacobian-vector product (A38b) can be reduced from O(mn2 + n3 ) to O(m + n2 r) in our cases. R EFERENCES [1] D. Li, X. Sun, S. Gu, J. Gao, and C. Liu, “Polynomially solvable cases of binary quadratic programs,” in Optimization and Optimal Control, 2010, pp. 199–225. [2] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, 8 2000. [3] S. X. Yu and J. Shi, “Segmentation given partial grouping constraints,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 2, pp. 173–183, 2004. [4] F. Lauer and C. Schnorr, “Spectral clustering of linear subspaces for motion segmentation,” in Proc. IEEE Int. Conf. Comp. Vis., 2009. [5] T. Cour and J. Bo, “Solving Markov random fields with spectral relaxation,” in Proc. Int. Workshop Artificial Intell. & Statistics, 2007. [6] S. Guattery and G. Miller, “On the quality of spectral separators,” SIAM J. Matrix Anal. Appl., vol. 19, pp. 701–719, 1998. [7] K. J. Lang, “Fixing two weaknesses of the spectral method,” in Proc. Adv. Neural Inf. Process. Syst., 2005, pp. 715–722. [8] R. Kannan, S. Vempala, and A. Vetta, “On clusterings: Good, bad and spectral,” J. ACM, vol. 51, pp. 497–515, 2004. [9] M. Heiler, J. Keuchel, and C. Schnorr, “Semidefinite clustering for image segmentation with a-priori knowledge,” in Proc. DAGM Symp. Pattern Recogn., 2005, pp. 309–317. [10] J. Keuchel, C. Schnoerr, C. Schellewald, and D. Cremers, “Binary partitioning, perceptual grouping and restoration with semidefinite programming,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 11, pp. 1364–1379, 2003. [11] C. Olsson, A. Eriksson, and F. Kahl, “Solving large scale binary quadratic problems: Spectral methods vs. semidefinite programming,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2007, pp. 1–8. [12] C. Schellewald and C. Schn¨orr, “Probabilistic subgraph matching based on convex relaxation,” in Proc. Int. Conf. Energy Minimization Methods in Comp. Vis. & Pattern Recogn., 2005, pp. 171–186. [13] A. Joulin, F. Bach, and J. Ponce, “Discriminative clustering for image co-segmentation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2010. [14] P. Torr, “Solving markov random fields using semi-definite programming,” in Proc. Int. Workshop Artificial Intell. & Statistics, 2007. [15] F. Alizadeh, “Interior point methods in semidefinite programming with applications to combinatorial optimization,” SIAM Journal on Optimization, vol. 5, no. 1, pp. 13–51, 1995. [16] Y. Nesterov, A. Nemirovskii, and Y. Ye, Interior-point polynomial algorithms in convex programming. SIAM, 1994, vol. 13. [17] G. Kochenberger, J.-K. Hao, F. Glover, M. Lewis, Z. L¨u, H. Wang, and Y. Wang, “The unconstrained binary quadratic programming problem: a survey,” Journal of Combinatorial Optimization, vol. 28, no. 1, pp. 58–81, 2014. [18] D. Pisinger, “The quadratic knapsack problema survey,” Discrete applied mathematics, vol. 155, no. 5, pp. 623–648, 2007. [19] N. Van Thoai, “Solution methods for general quadratic programming problem with continuous and binary variables: Overview,” in Advanced Computational Methods for Knowledge Engineering. Springer, 2013, pp. 3–17. [20] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 11, pp. 1222–1239, 2001. [21] V. Kolmogorov and R. Zabin, “What energy functions can be minimized via graph cuts?” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 26, no. 2, pp. 147–159, 2004. [22] V. Kolmogorov and C. Rother, “Minimizing nonsubmodular functions with graph cuts-a review,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 7, pp. 1274–1279, 2007. [23] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky, “Map estimation via agreement on trees: message-passing and linear programming,” Information Theory, IEEE Transactions on, vol. 51, no. 11, pp. 3697–3717, 2005. [24] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient belief propagation for early vision,” Int. J. Comp. Vis., vol. 70, no. 1, pp. 41–54, 2006. [25] V. Kolmogorov, “Convergent tree-reweighted message passing for energy minimization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 10, pp. 1568–1583, 2006. [26] L. Otten and R. Dechter, “Anytime and/or depth-first search for combinatorial optimization,” AI Communications, vol. 25, no. 3, pp. 211–227, 2012. [27] J. H. Kappes, B. Andres, F. A. Hamprecht, C. Schnorr, S. Nowozin, D. Batra, S. Kim, B. X. Kausler, J. Lellmann, N. Komodakis et al., “A comparative study of modern inference techniques for discrete energy minimization problems,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2013, pp. 1328–1335. [28] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, and C. Rother, “A comparative study of energy minimization methods for markov random fields with smoothness-based priors,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 6, pp. 1068–1080, 2008. [29] M. P. Kumar, V. Kolmogorov, and P. H. Torr, “An analysis of convex relaxations for map estimation of discrete mrfs,” J. Mach. Learn. Res., vol. 10, pp. 71–106, 2009. [30] V. Kolmogorov and C. Rother, “Comparison of energy minimization algorithms for highly connected graphs,” in Proc. Eur. Conf. Comp. Vis., 2006, pp. 1–15. [31] A. Globerson and T. Jaakkola, “Fixing max-product: Convergent message passing algorithms for MAP LP-relaxations,” in Proc. Adv. Neural Inf. Process. Syst., 2007. [32] D. Sontag, D. K. Choe, and Y. Li, “Efficiently searching for frustrated cycles in MAP inference,” in Proc. Uncertainty in Artificial Intell., 2012. [33] P. Ravikumar and J. Lafferty, “Quadratic programming relaxations for metric labeling and markov random field map estimation,” in Proc. Int. Conf. Mach. Learn., 2006, pp. 737–744. [34] M. P. Kumar, P. H. Torr, and A. Zisserman, “Solving markov random fields using second order cone programming relaxations,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., vol. 1, 2006, pp. 1045–1052. [35] P. H. Torr, “Solving markov random fields using semi definite programming,” in Proc. Int. Workshop Artificial Intell. & Statistics, 2003.

MANUSCRIPT

22

[36] F. Alizadeh, J.-P. A. Haeberly, and M. L. Overton, “Primal-dual interior-point methods for semidefinite programming: convergence rates, stability and numerical results,” SIAM Journal on Optimization, vol. 8, no. 3, pp. 746–768, 1998. [37] E. D. Andersen, C. Roos, and T. Terlaky, “On implementing a primal-dual interior-point method for conic quadratic optimization,” Mathematical Programming, vol. 95, no. 2, pp. 249–277, 2003. [38] Y. E. Nesterov and M. J. Todd, “Primal-dual interior-point methods for self-scaled cones,” SIAM Journal on optimization, vol. 8, no. 2, pp. 324–364, 1998. √ [39] Y. Ye, M. J. Todd, and S. Mizuno, “An o ( nL)-iteration homogeneous and self-dual linear programming algorithm,” Mathematics of Operations Research, vol. 19, no. 1, pp. 53–67, 1994. [40] J. F. Sturm, “Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones,” Optimizat. Methods & Softw., vol. 11, pp. 625–653, 1999. [41] K. C. Toh, M. Todd, and R. H. Ttnc, “SDPT3—a MATLAB software package for semidefinite programming,” Optimizat. Methods & Softw., vol. 11, pp. 545–581, 1999. [42] The MOSEK optimization toolbox for MATLAB manual. Version 7.0 (Revision 139), MOSEK ApS, Denmark. [43] S. Burer and R. D. Monteiro, “A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization,” Mathematical Programming, vol. 95, no. 2, pp. 329–357, 2003. [44] M. Journ´ee, F. Bach, P.-A. Absil, and R. Sepulchre, “Low-rank optimization on the cone of positive semidefinite matrices,” SIAM Journal on Optimization, vol. 20, no. 5, pp. 2327–2351, 2010. [45] R. Frostig, S. Wang, P. S. Liang, and C. D. Manning, “Simple map inference via low-rank relaxations,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 3077–3085. [46] S. Arora, E. Hazan, and S. Kale, “The multiplicative weights update method: a meta-algorithm and applications.” Theory of Computing, vol. 8, no. 1, pp. 121–164, 2012. [47] ——, “Fast algorithms for approximate semidefinite programming using the multiplicative weights update method,” in Foundations of Computer Science, 2005. FOCS 2005. 46th Annual IEEE Symposium on. IEEE, 2005, pp. 339–348. [48] S. Arora and S. Kale, “A combinatorial, primal-dual approach to semidefinite programs,” in Proceedings of the thirty-ninth annual ACM symposium on Theory of computing. ACM, 2007, pp. 227–236. [49] E. Hazan, “Sparse approximate solutions to semidefinite programs,” in LATIN 2008: Theoretical Informatics. Springer, 2008, pp. 306–316. [50] D. Garber and E. Hazan, “Approximating semidefinite programs in sublinear time,” in Proc. Adv. Neural Inf. Process. Syst., 2011, pp. 1080–1088. [51] S. Laue, “A hybrid algorithm for convex semidefinite optimization,” in Proc. Int. Conf. Mach. Learn., 2012, pp. 177–184. [52] S. Boyd and L. Xiao, “Least-squares covariance matrix adjustment,” SIAM J. Matrix Anal. Appl., vol. 27, no. 2, pp. 532–546, 2005. [53] J. Malick, “A dual approach to semidefinite least-squares problems,” SIAM J. Matrix Anal. Appl., vol. 26, no. 1, pp. 272–284, 2004. [54] H. Qi and D. Sun, “A quadratically convergent newton method for computing the nearest correlation matrix,” SIAM journal on matrix analysis and applications, vol. 28, no. 2, pp. 360–385, 2006. [55] Y. Gao and D. Sun, “Calibrating least squares covariance matrix problems with equality and inequality constraints,” SIAM J. Matrix Anal. Appl., vol. 31, pp. 1432–1457, 2009. [56] J. Malick, J. Povh, F. Rendl, and A. Wiegele, “Regularization methods for semidefinite programming,” SIAM Journal on Optimization, vol. 20, no. 1, pp. 336–356, 2009. [57] X.-Y. Zhao, D. Sun, and K.-C. Toh, “A newton-cg augmented lagrangian method for semidefinite programming,” SIAM J. Optim., vol. 20, no. 4, pp. 1737–1765, 2010. [58] Z. Wen, D. Goldfarb, and W. Yin, “Alternating direction augmented lagrangian methods for semidefinite programming,” Mathematical Programming Computation, vol. 2, no. 3-4, pp. 203–230, 2010. [59] Q. Huang, Y. Chen, and L. Guibas, “Scalable semidefinite relaxation for maximum a posterior estimation,” in Proc. Int. Conf. Mach. Learn., 2014. [60] R. T. Rockafellar, “A dual approach to solving nonlinear programming problems by unconstrained optimization,” Mathematical Programming, vol. 5, no. 1, pp. 354–373, 1973. [61] C. Shen, J. Kim, and L. Wang, “A scalable dual approach to semidefinite metric learning,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2011, pp. 2601–2608. [62] N. Krislock, J. Malick, and F. Roupin, “Improved semidefinite bounding procedure for solving max-cut problems to optimality,” Math. Program. Ser. A, 2013, published online 13 Oct. 2012 at http://doi.org/k2q. [63] P. Wang, C. Shen, and A. van den Hengel, “A fast semidefinite approach to solving binary quadratic problems,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn. IEEE, 2013, pp. 1312–1319. [64] T. Cour, P. Srinivasan, and J. Shi, “Balanced graph matching,” in Proc. Adv. Neural Inf. Process. Syst., 2006, pp. 313–320. [65] X. Wang and I. Davidson, “Flexible constrained spectral clustering,” in Proc. ACM Int. Conf. Knowledge Discovery & Data Mining. ACM, 2010, pp. 563–572. [66] S. Maji, N. K. Vishnoi, and J. Malik, “Biased normalized cuts,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2011, pp. 2057–2064. [67] M. X. Goemans and D. Williamson, “Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming,” J. ACM, vol. 42, pp. 1115–1145, 1995. [68] J. Malick, “The spherical constraint in boolean quadratic programs,” J. Glob. Optimization, vol. 39, no. 4, pp. 609–622, 2007. [69] H. S. Sendov, “The higher-order derivatives of spectral functions,” Linear algebra and its applications, vol. 424, no. 1, pp. 240–281, 2007. [70] P. T. Harker and J.-S. Pang, “Finite-dimensional variational inequality and nonlinear complementarity problems: a survey of theory, algorithms and applications,” Mathematical programming, vol. 48, no. 1-3, pp. 161–220, 1990. [71] H. A. Van der Vorst, “Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems,” SIAM J. scientific & Statistical Computing, vol. 13, no. 2, pp. 631–644, 1992.

MANUSCRIPT

23

[72] A. I. Barvinok, “Problems of distance geometry and convex properties of quadratic maps,” Discrete & Computational Geometry, vol. 13, no. 1, pp. 189–202, 1995. [73] G. Pataki, “On the rank of extreme matrices in semidefinite programs and the multiplicity of optimal eigenvalues,” Mathematics of operations research, vol. 23, no. 2, pp. 339–358, 1998. [74] C. G. Broyden, J. E. Dennis, and J. J. Mor´e, “On the local and superlinear convergence of quasi-newton methods,” IMA Journal of Applied Mathematics, vol. 12, no. 3, pp. 223–245, 1973. [75] J. Dennis and J. J. Mor´e, “A characterization of superlinear convergence and its application to quasi-newton methods,” Mathematics of Computation, vol. 28, no. 126, pp. 549–560, 1974. [76] L. Qi, “On superlinear convergence of quasi-newton methods for nonsmooth equations,” Operations research letters, vol. 20, no. 5, pp. 223–228, 1997. [77] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, “Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization,” ACM Trans. Mathematical Software, vol. 23, no. 4, pp. 550–560, 1997. [78] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammerling, A. McKenney et al., LAPACK Users’ guide. Siam, 1999, vol. 9. [79] L. Gorelick, F. R. Schmidt, Y. Boykov, A. Delong, and A. Ward, “Segmentation with non-linear regional constraints via line-search cuts,” in Proc. Eur. Conf. Comp. Vis. Springer, 2012, pp. 583–597. [80] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., vol. 2, 2001, pp. 416–423. [81] A. Vedaldi and B. Fulkerson, “VLFeat: An open and portable library of computer vision algorithms,” http://www.vlfeat.org/, 2008. [82] M. Journee, F. Bach, P.-A. Absil, and R. Sepulchre, “Low-rank optimization on the cone of positive semidefinite matrices,” SIAM J. Optimization, vol. 20, no. 5, 1999. [83] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004. [84] N. J. Higham, “Computing a nearest symmetric positive semidefinite matrix,” Linear algebra and its applications, vol. 103, pp. 103–118, 1988. [85] A. S. Lewis, “Derivatives of spectral functions,” Mathematics of Operations Research, vol. 21, no. 3, pp. 576–588, 1996. [86] A. S. Lewis and H. S. Sendov, “Twice differentiable spectral functions,” SIAM Journal on Matrix Analysis and Applications, vol. 23, no. 2, pp. 368–386, 2001.

Suggest Documents