A Solver for Nonconvex Bound-Constrained

c 2015 Society for Industrial and Applied Mathematics

Downloaded 11/23/15 to 130.132.173.161. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

SIAM J. OPTIM. Vol. 25, No. 4, pp. 2385–2407

A SOLVER FOR NONCONVEX BOUND-CONSTRAINED QUADRATIC OPTIMIZATION∗ HASSAN MOHY-UD-DIN† AND DANIEL P. ROBINSON‡ Abstract. We present a new algorithm for nonconvex bound-constrained quadratic optimization. In the strictly convex case, our method is equivalent to the state-of-the-art algorithm by Dost´ al and Sch¨ oberl [Comput. Optim. Appl., 30 (2005), pp. 23–43]. Unlike their method, however, we establish a convergence theory for our algorithm that holds even when the problems are nonconvex. This is achieved by carefully addressing the challenges associated with directions of negative curvature, in particular, those that may naturally arise when applying the conjugate gradient algorithm to an indefinite system of equations. Our presentation and analysis deal explicitly with both lower and upper bounds on the optimization variables, whereas the work by Dost´ al and Sch¨ oberl considers only strictly convex problems with lower bounds. To handle this generality, we introduce the reduced chopped gradient that is analogous to the reduced free gradient previously used. The reduced chopped gradient leads to a new condition that is used to determine when optimization over a given subspace should be terminated. This condition, although not equivalent, is motivated by a similar condition used by Dost´ al and Sch¨ oberl. Numerical results illustrate the superior performance of our method over commonly used solvers that employ gradient projection steps and subspace acceleration. Key words. active set

quadratic optimization, nonconvex, bound constraint, conjugate gradient,

AMS subject classifications. 49M15, 49M37, 58C15, 65H10, 65K05, 68Q25, 90C30, 90C60 DOI. 10.1137/15M1022100

1. Introduction. We consider the bound-constrained quadratic problem (BCQP) given by (1.1) minimize q(x) := n x∈R

1 T x Hx − cT x subject to x ∈ Ω := {x ∈ Rn : l ≤ x ≤ u}, 2

where q is the quadratic objective function, H is a symmetric n × n Hessian matrix, c ∈ Rn is a column vector, and the ith components of l and u satisfy li ∈ R ∪ {−∞}, ui ∈ R ∪ {∞}, and li < ui . There is no loss in generality with assuming that l < u since if li = ui for some i, any solution x∗ to (1.1) must satisfy x∗i = li . Consequently, the ith variable must be fixed and a reduced, but equivalent, optimization problem may be solved. We do not assume that H is positive semidefinite, i.e., we do not assume that problem (1.1) is a convex optimization problem. Consequently, finding the global solution is NP-hard [9, 21] and a local minimizer is not necessarily a global minimizer. The design of efficient algorithms for finding local minimizers for (1.1) is an important task. Such problems are often the basic subproblem used by other methods for minimizing general nonlinear functions, possibly subject to nonlinear constraints. Popular examples are augmented Lagrangian based methods [7, 1, 28, 11, 10], which ∗ Received by the editors May 19, 2015; accepted for publication (in revised form) October 5, 2015; published electronically November 19, 2015. http://www.siam.org/journals/siopt/25-4/M102210.html † Department of Diagnostic Radiology, Yale University, New Haven, CT 06511 ([email protected]). ‡ Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218 ([email protected]). This author was supported in part by National Science Foundation grant DMS–1217153.

2385

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


2386

HASSAN MOHY-UD-DIN AND DANIEL P. ROBINSON

must solve a sequence of bound-constrained nonlinear optimization problems. Each problem in this sequence may itself be solved, for example, with a trust-region method that solves a sequence of potentially nonconvex BCQPs. Nonconvex BCQPs also arise as stand-alone problems, such as for contact and rigid body problems in computational mechanics [33], journal bearing lubrication and flow [32], elastic-plastic torsion problems [5], and modeling circulation of the ocean currents [36]. In summary, new algorithms that efficiently and reliably solve nonconvex BCQPs are important, especially in the large-scale setting, which makes the algorithm that we present an important contribution. 1.1. Literature review and our contributions. A great amount of research has focused on solving BCQPs. In the large-scale setting, the most successful methods calculate directions that aim to identify variables that are active at a solution (i.e., the variables that are equal to either their lower or their upper bound at the solution) and directions that accelerate convergence by using those active set predictions. A well-known strategy for identifying the active set is to use gradient projection iterations [35, 39]. In the strictly convex case, however, other more efficient strategies based on projected matrix splitting iterations (the projected gradient iteration is a special case) have been successfully used [29, 37, 19]. All these methods use the active set identified by the cheaply computed projection steps to formulate a reduced space subproblem. The reduced space problem is typically (approximately) optimized by performing conjugate gradient (CG) iterations [27]. Deciding when to terminate optimization in the reduced space is a crucial aspect, with various heuristics being employed. For example, a heuristic based on the maximum number of CG iterations and the inspection of consecutive active set predictions was used in [37]; a related strategy was used in [19]. To overcome the numerical inefficiencies associated with using a heuristic approach to terminate the reduced space subproblem, a collection of work has aimed at developing practical subspace termination conditions that are not heuristic [23, 24, 22, 15, 4, 14, 16, 18]. Although the precise definitions of the termination conditions differ between these manuscripts (this has important theoretical and practical consequences), the key idea in all cases is to compare a measure of optimality within the reduced space to a measure of optimality over the complementary space. It appears that the first work to introduce these conditions is Friendlander and Mart´ınez [23]. This work, as well as the follow-up papers [24, 22], consider convex BCQPs. For this class of problems, they establish finite termination without having to make a nondegeneracy assumption on the optimization problem. Later, Dostál [15] presented a related method for strictly convex BCQPs. He established finite termination without the need for a nondegeneracy assumption, introduced the concept of a proportional iteration, and presented a new release condition on active variables. Subsequent work by Bielschowsky et al. [4] presented the first method from this class of algorithms that considered nonconvex BCQPs. The properties required to hold while minimizing over the reduced space were quite general for their method, but few details were given on how one might implement CG as the subproblem solver. In the follow-up work by Diniz-Ehrhardt et al. [14], far greater detail was given in terms of incorporating CG iterations, which is crucial since CG can fail to converge (or even be well defined) when the reduced space subproblem is nonconvex. A second contribution in [14] is a nonmonotone step acceptance strategy that was shown to be numerically beneficial. We note that neither of these two papers, which aimed at solving nonconvex BCQPs, ensured convergence to second-order optimal points. Next, Dostál [16]



A SOLVER FOR NONCONVEX BCQP

2387

presented an important algorithm for strictly convex BCQPs based on the concept of proportioning. The main contributions were a rate-of-convergence result in terms of the spectral condition number of H and the use of a fixed step length instead of a backtracking procedure. However, a finite termination result was only proved for problems that satisfied strict complementarity. This weakness was addressed by Dostál and Sch¨ oberl [18] by introducing the reduced free gradient, again directed at solving strictly convex BCQPs. The reduced free gradient was an important ingredient in the formulation of a new condition for determining when to terminate the reduced space subproblems. These contributions allowed Dost´ al and Sch¨ oberl to prove finite convergence without the need of a nondegeneracy assumption, as well as to maintain the previously established rate-of-convergence and use of a fixed step length. We present a method for minimizing nonconvex BCQPs that extends the ideas introduced by Dostál and Sch¨ oberl [18] for strictly convex BCQPs. Our presentation and analysis deal explicitly with both lower and upper bounds on the optimization variables, whereas the work by Dost´ al and Sch¨ oberl considers only lower bounds. To handle this generality, we introduce the reduced chopped gradient that is analogous to the previously used reduced free gradient. The reduced chopped gradient leads to a new condition that is used to determine when optimization over a given subspace should be terminated. This condition, although not equivalent, is motivated by a similar condition used by Dostál and Sch¨ oberl. We also introduce a new iteration type (a saddle point iteration) that allows us to establish convergence to points that satisfy certain approximate second-order optimality conditions. Numerical results illustrate the superior performance of our method over commonly used solvers that employ gradient projection steps and subspace acceleration. 1.2. Notation and preliminaries. Our method generates a sequence {xk } of iterates. The quadratic objective function is denoted by q : Rn → R and its gradient by g(x) := ∇x q(x) = Hx − c, and to simplify notation we use qk := q(xk ) and gk := g(xk ). The ith component of a vector v is written as [v]i , and the subvector of v with components in a given index set S is written as [v]S . The two-norm of v is denoted by v, and [v]+ := max{v, 0} and [v]− := min{v, 0} represent the positive and negative parts of v, respectively, with the maximum and minimum taken componentwise. A first-order KKT point x for problem (1.1) satisfies ⎛ (1.2)

⎞ min{x − l, 0} ⎜ max{x − u, 0} ⎟ ⎜ ⎟ ⎝ min{[g(x)]+ , x − l} ⎠ = 0. max{[g(x)]− , x − u}

For our purposes, a more convenient way of defining a first-order KKT point may be formulated based on the index sets

(1.3)

Al (x) := {i : xi = li }, Au (x) := {i : xi = ui }, A(x) := Al (x) ∪ Au (x), and F (x) := {1, 2, . . . , n} \ A(x).



2388


Using these sets, we define the free gradient ϕ and chopped gradient β as follows: [g(x)]i if i ∈ F (x); [ϕ(x)]i := 0 if i ∈ A(x); ⎧ (1.4) ⎪ if i ∈ F (x); ⎨0 [β(x)]i := [g(x)]− if i ∈ Al (x); i ⎪ ⎩ + [g(x)]i if i ∈ Au (x). Although it follows from (1.2) that x is a first-order KKT point for problem (1.1) if and only if ϕ(x)+β(x) = 0, note that ϕ(x)+β(x) is not an appropriate measure to use within an algorithm. For example, consider the problem of minimizing x2 /2 subject to x ≥ 1, which has the unique minimizer x∗ = 1. Then, the sequence xk = 1 + 1/k satisfies limk→∞ xk = 1 = x∗ , but limk→∞ [ϕ(xk ) + β(xk )] = limk→∞ xk = 1. As a substitute, we prefer to use the reduced free gradient

, ϕ(x) if ϕ(x) ≥ 0; min x−l α

x−u (1.5) ϕ(x, α) := if ϕ(x) < 0, max α , ϕ(x) and the reduced chopped gradient

min x−l , β(x) α

(1.6) β(x, α) := max x−u α , β(x)

if β(x) ≥ 0; if β(x) < 0,

whose definitions should be interpreted componentwise and are well defined for all α > 0. In particular, it is easy to show that (1.7) PΩ x − αg(x) = x − α ϕ(x, α) + β(x, α) , where we have used the operator that projects vectors from Rn onto the set Ω, defined by (1.8)

PΩ (x) := max{l, min{x, u}}.

We may now define the vector (1.9)

ν(x) := x − PΩ x − g(x) = ϕ(x, 1) + β(x, 1)

that will be used to signal convergence to a first-order KKT point. (It has already been established that the norm of ν serves as an appropriate measure of first-order optimality [8].) Note that for the sequence {xk } used in the above example, we now have limk→∞ ν(xk ) = limk→∞ [xk − 1] = 0, as should be the case for a function that measures proximity to a first-order KKT point. In this paper, we are interested in convergence to points that satisfy the following second-order conditions. Definition 1.1 (an approximate second-order KKT point). The vector x is an approximate second-order KKT point for the tolerance τstop > 0 if and only if (1.10)

ν(x) ≤ τstop

and (1.11)

λmin (HF F ) ≥ −τstop or F = ∅

with F = F (xk ) and HF F the submatrix of H with row/column indices from F .




2389

Condition (1.10) says that x is an approximate first-order KKT point, while condition (1.11) places a restriction on the curvature of the submatrix of H corresponding to the free variables. These conditions are natural relaxations of the second-order sufficient optimality conditions in the case that strict complementarity holds. From a computational perspective, this is as much as one should expect since verifying the second-order sufficient conditions for problem (1.1) when strict complementarity does not hold is NP-hard. 1.3. Algorithm overview. Our solver is based on the CG method [27]. In particular, CG is used to reduce the objective function within subspaces defined by variables predicted to be inactive (free) at the solution. Throughout our presentation, iterations that accept the CG step will be called CG iterations and are described in detail in section 2. We mention that unlike traditional gradient projection methods [25, 31, 13, 3, 38, 35, 37], we use an adaptive condition to determine when the CG iterations within the subspace should be terminated, in a manner consistent with that introduced in [18] for strictly convex problems. Like all active-set methods, our method requires a mechanism for adjusting the active set. This is important since the active set defines—through its complementary set of free variables—the reduced space explored by CG. Three mechanisms allow for the size of the active set to increase. For the first one, we check whether the computed CG step would produce an infeasible point. If this is deemed to be the case, then the first variable that would become violated is added to the active set. This is then immediately followed by projecting the free gradient onto the feasible region, which allows for the possibility that many additional variables may be added to the active set. When the new point is computed in this way, we say it is an expansion iteration since the active set expanded. A second way that the active set may be increased is when a direction of negative curvature is produced by CG. In this case, the next iterate is obtained by minimizing the objective function along the negative curvature direction while enforcing feasibility. The resulting step has the effect of adding extra variables to the active set. As in the previous case, this is then followed by projecting the free gradient onto the feasible region. When the updated point is obtained in this manner, we say it is a negative curvature CG iteration. The third way that the active set may be increased is when an approximate first-order KKT point is obtained that is not an approximate second-order KKT point, i.e., condition (1.10) is satisfied, but (1.11) is not. In this case, our algorithm uses a direction of negative curvature to add variables to the active set in the same way as for negative curvature CG iterations; these are called saddle point iterations. A single mechanism is used to remove elements from the active set. Roughly, we remove indices from the active set when a certain measure of optimality on the reduced space is dwarfed by a certain measure of optimality for the original problem (1.1), which indicates that variables should be freed. When this occurs, our algorithm computes the next iterate by minimizing the objective function along the reduced chopped gradient direction; we call these proportioning iterations since they attempt to make the optimality measures in the free and active spaces proportional, i.e., it attempts to make (2.2) satisfied. 2. The CG algorithm on a reduced space. An important aspect of our method is the use of CG to minimize the objective function over a reduced space. In particular, for a generic vector x, we would like to compute iterates that converge to


2390



the solution of (2.1)

q(x) = 12 xTHx − cTx subject to [x]A = [ x ]A minimize n x∈R

with A = A( x). In [6, Algorithm 5.4.2], it is shown how such an algorithm can be obtained by first converting problem (2.1) into an unconstrained quadratic problem defined over a reduced set of variables. Next, they apply the standard CG algorithm to the reduced problem. Finally, they show that the calculations may be arranged so that they are performed in the full space. The resulting method, written in terms of our notation, is given by Algorithm 1. Algorithm 1. The CG algorithm for the reduced space problem (2.1). 1: Input: x , A = A( x), and F = F ( x). 2: Set x0 ← x , g0 ← Hx0 − c, and s0 ← ϕ(x0 ). 3: loop(until convergence) 4: Set αj ← gjTϕ(xj ) /(sTjHsj ). 5: Set xj+1 ← xj − αj sj . 6: Set gj+1 ← gTj − αj Hsj. T 7: Set γj ← gj+1 ϕ(xj+1 ) / gj ϕ(xj ) . 8: Set sj+1 ← ϕ(xj+1 ) + γj sj . 9: Set j ← j + 1. As mentioned in section 1.3, steps generated by Algorithm 1 form the basis of our method. Since it is well known that the sequence of CG iterates defined by Algorithm 1 minimize q over a corresponding sequence of expanding (Krylov) subspaces within the space of free variables, we choose to compute them only when (2.2)

¯ Tβ(xk ) ≤ Γϕ(xk , α) ¯ Tϕ(xk ) β(xk , α)

is satisfied for some Γ > 0 and α¯ ∈ (0, 2H−1]. Satisfaction of (2.2) indicates that the possible reduction of q in the space of free variables is substantial relative to the potential reduction obtained by allowing active variables to become free. We note that a condition similar to (2.2) was first used by [18], although their variant did not involve the reduced chopped gradient. 3. The algorithm. Our method for solving problem (1.1) is stated as Algorithm 2. The sequence {xk } of solution estimates is generated within the main loop that begins in step 3. During each iteration, one of four computational blocks occurs to compute xk+1 . Scenario 1 (negative curvature CG iteration). ν(xk ) > τstop , (2.2) holds, and sTkHsk ≤ 0. These conditions can be interpreted to mean, respectively, that xk is not an approximate first-order KKT point, that optimization over the current subspace should be continued, and that the CG search direction is not a direction of positive curvature for H. In this case, we compute the largest step length αfeas along −sk that remains feasible (see step 23), and if this value is infinity, we declare that the problem is unbounded below in step 22. Otherwise, the step xk − αfeas sk adds at least one variable to the active set. To accelerate active set identification, we then compute xk+1 as a projected gradient step in the space of free variables in step 24. Since the active set has changed, we then reinitialize the CG process by setting the next search direction to be ϕ(xk+1 ) in step 25.




2391

Scenario 2. ν(xk ) > τstop , (2.2) holds, and sTkHsk > 0. The conditions of this scenario mean, respectively, that xk is not an approximate first-order KKT point, further optimization on the current face should be performed, and the CG search direction is a direction of positive curvature for H. In this case, we continue the CG iteration by defining the step length in step 27 as motivated by Algorithm 1. Since the resulting step may produce an infeasible iterate, we have an explicit check in step 28 that leads to two possible sub-scenarios. Scenario 2a (CG iteration). If the CG step is feasible, then the standard CG updates (see Algorithm 1) are made in steps 29–30. Note that the next search direction sk+1 is precisely the next CG direction. Scenario 2b (expansion iteration). If the full CG step is not feasible, the maximum step that remains feasible is taken and then followed by a projected gradient step in the space of free variables (see steps 32–34). In this case, the next search direction sk+1 is set to reinitialize CG since the active set has changed. Scenario 3 (proportioning iteration). ν(xk ) > τstop and (2.2) does not hold. This means that xk is not an approximate first-order KKT point and that optimization on the current face would likely make little progress. In particular, since (2.2) does not hold, there exists at least one component of the reduced chopped gradient β(xk , α) ¯ that is nonzero, and we compute the step length αmin that minimizes the objective function along the direction −β(xk , α) ¯ (see steps 37–40). Since the resulting step length αmin may produce an infeasible iterate, we also compute the maximum feasible step length αfeas and define the final step length αk to be the minimum of the two (see step 41). If αk = ∞, then the problem is unbounded and we return this fact in step 43. Otherwise, the step length αk is used to define the next iterate xk+1 , and the next search direction is set to reinitialize CG since the active set has changed (see step 44). Scenario 4 (saddle point iteration). ν(xk ) ≤ τstop . In this scenario xk is an approximate first-order KKT point. Therefore, it makes sense to consider if xk satisfies the approximate second-order KKT conditions given by Definition 1.1. In our algorithm, this amounts to either (see step 5) verifying that (1.11) is satisfied, in which case we terminate in step 7 with xk as an approximate second-order KKT point, or calculating a nonzero direction sk of negative curvature for the reduced Hessian matrix HF F in the sense that it satisfies (3.1)

[sk ]A = 0

and

[sk ]TF HF F [sk ]F ≤ −ητstop [sk ]F 2

for some η ∈ (0, 1) with F = F (xk ). In the large-scale setting, this can be achieved via matrix-free Lanczos iterations [30]. We then proceed exactly as in Scenario 1. The following remark is critical to the analysis (cf. with Case 1 in Lemma 3.10), and aids in our general understanding of Algorithm 2. Remark 3.1. The CG steps do not change the active set (i.e., A(xk+1 ) = A(xk ) for k ∈ SCG ), while the expansion steps, negative curvature CG steps, and saddle point steps all strictly increase the size of the active set (i.e., A(xk ) ⊂ A(xk+1 ) for k ∈ SEXP ∪ SNEG ∪ SSAD ). 3.1. Well-posedness. Several results must be established for Algorithm 2 to be well-posed. Our first result states that every iterate is feasible, which will be used in the remainder of the paper without explicit reference. Lemma 3.2. Every iterate xk is feasible, i.e., xk ∈ Ω for all k ≥ 0. Proof. This follows from the construction of Algorithm 2.



2392


Algorithm 2. Method for solving problem (1.1). 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43: 44:

Input: x0 ∈ [l, u], {Γ, τstop } > 0, α¯ ∈ (0, 2H−1], and η ∈ (0, 1). Set g0 ← Hx0 − c and s0 ← ϕ(x0 ). for k = 0, 1, 2, . . . do if ν(xk ) ≤ τstop then Either compute a direction sk satisfying (3.1) or decide that (1.11) holds. if (1.11) holds then return approximate second-order KKT point xk . else

saddle point iteration if sTkϕ(xk ) < 0 then Set sk ← −sk . Set αfeas ← max{α : xk − αsk ∈ Ω}. if αfeas = ∞ then return problem (1.1) is unbounded along the ray {xk −αsk : α ≥ 0}. Set xk+1/2 ← xk − αfeas sk and gk+1/2← gk − αfeas Hsk . Set xk+1 ← PΩ xk+1/2 − αϕ(x ¯ k+1/2 ) . Set gk+1 ← Hxk+1 − c and sk+1 ← ϕ(xk+1 ). else if (2.2) holds then if sTkHsk ≤ 0 then

negative curvature CG iteration Set αfeas ← max{α : xk − αsk ∈ Ω}. if αfeas = ∞ then return problem (1.1) is unbounded on the set {xk −αsk : α ≥ 0}. Set xk+1/2 ← xk − αfeas sk and gk+1/2 ← gk − αfeas Hsk . Set xk+1 ← PΩ xk+1/2 − αϕ(x ¯ k+1/2 ) . Set gk+1 ← Hxk+1 − c and sk+1 ← ϕ(xk+1 ). else Set αcg ← gkTϕ(xk ) /(sTkHsk ) and αfeas ← max{α : xk − αsk ∈ Ω}. if αcg ≤ αfeas then

CG iteration Set xk+1 ← x − α s and g ← g − α Hs . k cg k k+1 k cg k T Set γk ← gk+1 ϕ(xk+1 ) / gkTϕ(xk ) and sk+1 ← ϕ(xk+1 )+γk sk . else

expansion iteration Set xk+1/2 ← xk − αfeas sk and gk+1/2 ← gk − αfeas Hsk . Set xk+1 ← PΩ xk+1/2 − αϕ(x ¯ k+1/2 ) . Set gk+1 ← Hxk+1 − c and sk+1 ← ϕ(xk+1 ). else

proportioning iteration Set sk ← β(xk , α). ¯ if sTkHsk > 0 then Set αmin ← (sTk gk )/(sTkHsk ). else Set αmin ← ∞. Set αfeas ← max{α : xk − αsk ∈ Ω} and αk ← min{αmin , αfeas }. if αk = ∞ then return problem (1.1) is unbounded along the ray {xk −αsk : α ≥ 0}. Set xk+1 ← xk − αk sk , gk+1 ← gk − αk Hsk , and sk+1 ← ϕ(xk+1 ).




2393

The next lemma contains estimates that will be used in the analysis. Lemma 3.3. The following estimates hold for any α > 0 and 1 ≤ i ≤ n: (i) If [β(xk , α)]i > 0, then [β(xk )]i ≥ [β(xk , α)]i > 0. (ii) If [β(xk , α)]i < 0, then [β(xk )]i ≤ [β(xk , α)]i < 0. (iii) β(xk , α)2 ≤ β(xk , α)Tβ(xk ). (iv) β(xk , α)T gk = β(xk , α)Tβ(xk ). (v) If [ϕ(xk , α)]i > 0, then [ϕ(xk )]i ≥ [ϕ(xk , α)]i > 0. (vi) If [ϕ(xk , α)]i < 0, then [ϕ(xk )]i ≤ [ϕ(xk , α)]i < 0. (vii) ϕ(xk , α)2 ≤ ϕ(xk , α)Tϕ(xk ) ≤ ϕ(xk )2 . Proof. Parts (i) and (ii) follow from (1.6), and (iii) follows from parts (i) and (ii). To prove part (iv), we first notice that it is sufficient to prove that [gk ]i = [β(xk )]i for all i satisfying [β(xk , α)]i = 0. If i is such that [β(xk , α)]i = 0, it follows from parts (i) and (ii) that [β(xk )]i = 0, which combined with (1.4) shows that [gk ]i = [β(xk )]i , as desired. Parts (v)–(vii) follow using similar arguments as those for parts (i)–(iii). We now show that the computation required at an approximate first-order KKT point is possible. We do not give a specific method for estimating the leftmost eigenpair of a matrix but rather assume that there exists an iterative method that can compute estimates of a desired accuracy. We formally state this assumption next and note that it is assumed to hold throughout the remainder of the paper although no explicit reference is given. Assumption 3.4. For any > 0, we can find an eigenpair estimate (s, λ) satisfying (3.2)

|λ − λmin (HF F )| ≤ ε

and

[s]TF HF F [s]F = λ, [s]F 2

where F = F (xk ) and λmin (HF F ) denotes the leftmost eigenvalue of HF F . Lemma 3.5. If Assumption 3.4 holds, then step 5 of Algorithm 2 is well defined. Proof. Let us define ε = τstop (1 − η) > 0, where η and τstop are defined in Algorithm 2. It then follows from Assumption 3.4 that we may compute a pair (s, λ) that satisfies (3.2). If λ ≥ −ητstop , then (3.2) and the definition of ε give λmin (HF F ) ≥ λ − ε ≥ −ητstop − τstop (1 − η) = −τstop , which verifies that (1.11) holds, meaning that the conditions in step 5 are met. Otherwise, if we define the vector sk as [sk ]F := [s]F and [sk ]A := 0, then we can see from λ < −ητstop and (3.2) that sk satisfies condition (3.1), meaning that the requirements in step 5 are met. To make the referencing of the iteration types in Algorithm 2 easier, we define SSAD := {k : xk+1 is computed by a saddle point iteration in step 15}; SCG := {k : xk+1 is computed by a CG iteration in step 29}; SNEG := {k : xk+1 is computed by a negative curvature CG iteration in step 24}; SEXP := {k : xk+1 is computed by an expansion iteration in step 33}; SPRO := {k : xk+1 is computed by a proportioning iteration in step 44}. We now state a descent condition that is satisfied for certain subsets of these iteration types. This result is used in Lemma 3.7 and also plays a crucial role in proving



2394


monotonicity of the cost function along the iterates generated by Algorithm 2 (cf. Lemma 3.12). Lemma 3.6. The following descent condition holds: ϕ(xk , α) ¯ 2>0 if k ∈ SCG ∪ SNEG ∪ SEXP ; sTk gk ≥ 2 ¯ >0 if k ∈ SPRO . β(xk , α) Proof. Suppose that k ∈ SCG ∪ SNEG ∪ SEXP . It follows from the structure of Algorithm 2 that either sk = ϕ(xk ) or sk = ϕ(xk ) + γk−1 sk−1 . In the former case, we have from the definition of ϕ(xk ) that sTkgk = ϕ(xk )Tgk = ϕ(xk )2 . In the latter case, we have sTkgk = (ϕ(xk ) + γk−1 sk−1 )Tgk = ϕ(xk )2 + γk−1 sTk−1 gk = ϕ(xk )2 , where the last equality uses the well-known property that the gradient at the current point is orthogonal to the previous CG directions. Combining these ¯ 2 . This part of results with Lemma 3.3(vii) shows that sTkgk = ϕ(xk )2 ≥ ϕ(xk , α) the proof is complete once we establish ϕ(xk , α) ¯ = 0. For a proof by contradiction, suppose that ϕ(xk , α) ¯ = 0. Since k ∈ SCG ∪ SNEG ∪ SEXP , it follows from step 18 ¯ Tβ(xk ) = 0, and then that the inequality (2.2) must hold, which implies that β(xk , α) Lemma 3.3(iii) shows that β(xk , α) ¯ = 0. Therefore, we have that ν(xk ) = 0, which means that k ∈ SSAD , a contradiction. Thus, we must conclude that ϕ(xk , α) ¯ = 0. ¯ It follows from this fact, Lemma 3.3(iv), If k ∈ SPRO , then we have sk = β(xk , α). and Lemma 3.3(iii) that sTk gk = β(xk , α) ¯ Tgk = β(xk , α) ¯ Tβ(xk ) ≥ β(xk , α) ¯ 2 . Also, since k ∈ SPRO , condition (2.2) must not have held when tested in step 18, which means that β(xk , α) ¯ Tβ(xk ) > Γϕ(xk , α) ¯ Tϕ(xk ) ≥ 0, where the last inequality follows from Lemma 3.3(vii). Since this implies that β(xk , α) ¯ = 0, the proof for this case is complete. We now show that the claims of unboundedness in Algorithm 2 are justified. Lemma 3.7. If Algorithm 2 terminates in either steps 13, 22, or 43, then the objective function is unbounded along the feasible ray {xk − αsk : α ≥ 0}. Proof. If step 13 is reached, then a direction of negative curvature sk was computed in step 5 to satisfy (3.1) (if (1.11) was satisfied in step 5, then termination would have occurred in step 7). We now observe that the sign of sk may be switched in step 10 to ensure that −sk is also a direction of nonascent. With these properties holding for sk , it follows that if αfeas = ∞ in step 12, meaning that the feasible region does restrict movement along the direction −sk , then problem (1.1) is unbounded along the feasible ray as claimed. Next, suppose that step 22 is reached so that k ∈ SNEG . By construction, the direction sk satisfies sTkHsk ≤ 0 since the test in step 19 must have tested true, and from Lemma 3.6 we have that sTk gk > 0. Therefore, if step 21 tests true, problem (1.1) is unbounded along the feasible ray as claimed in step 22. Finally, suppose that step 43 is reached (i.e., that k ∈ SPRO ), which means that αk = ∞ in step 42, which in tandem with step 41 means that αmin = αfees = ∞. Combining this (specifically αfeas = ∞) with the if statement at step 37 gives that sTkHsk ≤ 0. Moreover, we have from Lemma 3.6 and k ∈ SPRO that sTk gk > 0. Since we already knew that sTk Hsk ≤ 0, it follows that if step 42 tests true, then problem (1.1) is unbounded along the feasible ray as claimed in step 43. 3.2. Convergence. Our convergence analysis shows that Algorithm 2 terminates in a finite number of iterations for any positive stopping tolerance τstop or generates a sequence of iterates along which the objective function converges to negative infinity. Therefore, most of our results establish conditions that hold when our algorithm does not terminate finitely and ultimately are used to reach a contradiction.



2395


It is convenient to define (3.3)

Qk (α) := q(xk − αsk ) and ΔQk (α) := Qk (0) − Qk (α) = αsTk gk − 12 α2 sTk Hsk

to be the value of the objective function at xk − αsk and its associated change. We may now prove a minimum decrease in the objective function when a proportioning step is computed. This decrease is the key result used to establish convergence of our method. Lemma 3.8. If k ∈ SPRO and Algorithm 2 does not terminate in step 43, then ΔQk (αk ) ≥ 14 αβ(x ¯ ¯ T β(xk ) > 14 αΓϕ(x ¯ ¯ T ϕ(xk ) ≥ 0. k , α) k , α) Proof. Since k ∈ SPRO , we know from Algorithm 2 that condition (2.2) does not hold and that sk ← β := β(xk , α). ¯ In view of steps 37–44, we consider three cases. Case 1. sTk Hsk > 0 and αk = αmin = (sTk gk )/(sTk Hsk ) ≤ αfeas . We then have that (3.4) (3.5) (3.6)

α2min T β Hβ 2 1 (gkT β)2 T (g T β)2 1 (gkT β)2 − = Tk β Hβ = β Hβ 2 (β T Hβ)2 2 β T Hβ T 2 T 2 1 T 1 (gk β) 1 (β(xk ) β) ≥ αβ ¯ β(xk ), ≥ ≥ 2 Hβ2 2 Hβ Tβ(xk ) 4

ΔQk (αk ) = αmin gkT β −

where (3.4) follows from (3.3), (3.5) from the value of αmin and algebraic simplification, and (3.6) from the Cauchy–Schwarz inequality, Lemma 3.3(iv), Lemma 3.3(iii), and α¯ ∈ (0, 2H−1]. Case 2. sTkHsk > 0 and αk = αfeas < αmin = (sTk gk )/(sTk Hsk ). We then have that (3.7) (3.8) (3.9)

α2feas T β Hβ 2 αfeas T g β ≥ αfeas gkTβ − 2 k 1 T 1 ¯ β(xk ), = αfeas β Tβ(xk ) ≥ αβ 2 4

ΔQk (αk ) = αfeas gkT β −

where (3.7) follows from (3.3), (3.8) from the bound on αfeas assumed in this case and sk = β, and (3.9) from Lemma 3.3(iv) and the fact that αfeas ≥ α, ¯ which can be seen to hold as follows. From (1.7) we have with A = A(xk ) that [PΩ xk − αg(x ¯ k ) ]A = [xk − α¯ ϕ(xk , α) ¯ + β ]A = [xk − αβ] ¯ A, which, in combination with [β]F = 0 for F = F (xk ), shows that xk − αβ ¯ is feasible. ¯ as claimed. Thus, it follows that αfeas ≥ α, Case 3. sTk Hsk ≤ 0. Since by assumption Algorithm 2 does not return in step 43, we know that αk = αfeas < αmin = ∞. We then have (3.10) (3.11)

α2feas T β Hβ 2 1 1 T ¯ β(xk ), ≥ αfeas β Tβ(xk ) ≥ αβ 2 4

ΔQk (αk ) = αfeas gkT β −



2396


where (3.10) follows from (3.3), and (3.11) follows from sk = β, sTk Hsk ≤ 0, Lemma 3.3(iv), and αfeas ≥ α, ¯ as shown in Case 2. It follows from the above three cases (specifically, from (3.6), (3.9), and (3.11)) ¯ Tβ(xk ). The desired result follows from this inequality and the that ΔQk (αk ) ≥ 14 αβ fact that (2.2) does not hold as stated in the first line of the proof. The next lemma hints at the usefulness of the previous result. Lemma 3.9. If there exists a subsequence K ⊆ {1, 2, . . . } such that (3.12)

lim ϕ(xk , α) ¯ Tϕ(xk ) = lim β(xk , α) ¯ Tβ(xk ) = 0,

k∈K

k∈K

then limk∈K ν(xk ) = 0. ¯ = 0, Proof. It follows from (3.12) and Lemma 3.3(iii) that limk∈K β(xk , α) which implies with (1.6) that limk∈K β(xk , 1) = 0. Similarly, it follows from (3.12) ¯ = 0, which implies in view of (1.5) that and Lemma 3.3(vii) that limk∈K ϕ(xk , α) limk∈K ϕ(xk , 1) = 0. Based on the definition of ν in (1.9), we have proved that limk∈K ν(xk ) = 0, as claimed. We now show that there must be an infinite number of proportioning steps if our algorithm does not terminate finitely. Lemma 3.10. If Algorithm 2 does not terminate finitely, then |SPRO | = ∞. Proof. To reach a contradiction, suppose that there exists k¯ so that k ∈ / SPRO ¯ We consider two cases and reach a contradiction in both, which will for all k > k. complete our proof. Case 1. |SEXP ∪ SNEG ∪ SSAD | = ∞. In this case, it follows from Remark 3.1 that there exists some k > k¯ such that A(xk) = {1, 2, . . . , n}, so that in particular we have 0 = ϕ(xk) = ϕ(xk, 1). If ν(xk) ≤ τstop , then the if statement in step 4 will test true. Since F (xk) = ∅, condition (1.11) is satisfied so that Algorithm 2 would terminate in step 7. This contradicts the assumption of this lemma, which means ¯ Tβ(xk) = 0, then that ν(xk) > τstop so that step 18 is reached. Now, if β(xk, α) it follows that (2.2) does not hold, which in light of step 18 means that k ∈ SPRO . T ¯ which means that β(x , α) This contradicts the definition of k, ) = 0. It follows k ¯ β(xk from this fact and Lemma 3.3(iii) that 0 = β(xk, α) ¯ = β(xk, 1). Thus, we have that ν(xk) = ϕ(xk, 1) + β(xk, 1) = 0, which contradicts ν(xk) > τstop . We must now conclude that this case cannot occur. Case 2. |SEXP ∪ SNEG ∪ SSAD | < ∞. In this case, there exists k > k¯ such that It follows from standard properties of the CG algorithm k ∈ SCG for all k ≥ k. that 0 = limk→∞ ϕ(xk ) = limk→∞ ϕ(xk , 1) = limk→∞ ϕ(xk , α). ¯ (In fact, in exact arithmetic we have that ϕ(xk ) = 0 for all sufficiently large k.) Also, since k ∈ SCG for we have that (2.2) holds for all k ≥ k (see step 18). Combining these two all k ≥ k, facts means that limk→∞ β(xk , α) ¯ Tβ(xk ) = 0, which combined with Lemma 3.3(iii) allows us to deduce that limk→∞ β(xk , α) ¯ = 0 and in turn that limk→∞ β(xk , 1) = 0. Therefore, we have that limk→∞ ν(xk ) = limk→∞ [ϕ(xk , 1) + β(xk , 1)] = 0, which means that step 4 will test true for all sufficiently large k. In particular, since an assumption of this lemma is that Algorithm 2 does not terminate in a finite number of iterations, we have that k ∈ SSAD for all sufficiently large k, which contradicts the Thus, we must now conclude that this case cannot occur. definition of k. Our next goal is to show that the objective function is monotonically decreasing. To this end, the following result is crucial since it establishes the decrease of the objective function along certain projected gradient steps in the reduced space.



2397


Lemma 3.11. If x ∈ Ω, then the vector y(α) := PΩ x − αϕ(x) satisfies q(x) − q(y(α)) ≥ 12 αϕ(x, α)2 for all α ∈ (0, H−1 ]. Moreover, for α¯ ∈ [0, 2H−1], we have q(x) − q(y(α)) ¯ ≥ 0. Proof. We observe from the definition of A = A(x), F = F (x), and ϕ(x), x ∈ Ω, and (1.7) that [PΩ (x − αϕ(x)]A [PΩ (x)]A y(α) = PΩ x − αϕ(x) = = [PΩ (x − αϕ(x)]F [PΩ (x − αg(x))]F [x]A [x]A = = , [x − α ϕ(x, α) + β(x, α) ]F [x − αϕ(x, α)]F from which it trivially follows that y(α) − x =

0 −α[ϕ(x, α)]F

.

Using this fact, the definition of ϕ(x), Lemma 3.3(vii), the Cauchy–Schwarz inequality, and α ∈ (0, H−1 ] it follows that T q(x) − q(y(α)) = −g(x)T y(α) − x − 12 y(α) − x H y(α) − x (3.13)

= αϕ(x)Tϕ(x, α) − 12 α2 ϕ(x, α)THF F ϕ(x, α)

≥ αϕ(x, α)2 − 12 α2 HF F ϕ(x, α)2 ≥ αϕ(x, α)2 1 − 12 αH ≥ 12 αϕ(x, α)2 ,

which proves the first part of the lemma. Moreover, if α¯ ∈ [0, 2H−1], then it follows from (3.13) that q(x) − q(y(α)) ¯ ≥ 0, as claimed. We now prove monotonicity of the objective values over the sequence of iterates. Lemma 3.12. The iterates {xk } in Algorithm 2 satisfy q(xk+1 ) ≤ q(xk ). Proof. If k ∈ SCG , then the desired result follows from standard results for the CG method, which acts by minimizing the objective function over an increasing sequence of Krylov subspaces within the reduced space. If k ∈ SEXP , then step 28 ensures that the CG step was computed but the step was infeasible. Since we must also have sTkHsk > 0 (step 19 must have tested false), we know that q(xk+1/2 ) < q(xk ) with xk+1/2 defined in step 32. We then use Lemma 3.11 and α¯ ∈ (0, 2H−1 ] to conclude that xk+1 computed in step 33 satisfies q(xk+1 ) ≤ q(xk+1/2 ). Thus, we have q(xk+1 ) ≤ q(xk+1/2 ) < q(xk ). If k ∈ SPRO , then it follows from Lemma 3.6 that gkTsk > 0. Since this means that −sk is a direction of strict descent, the update in step 44 ensures that q(xk+1 ) < q(xk ). If k ∈ SNEG , then it follows as in the second paragraph of the proof of Lemma 3.7 that −sk is a strict descent direction and satisfies sTkHsk ≤ 0. As for the case k ∈ SEXP , we then have that q(xk+1/2 ) < q(xk ) with xk+1/2 defined in step 23 and that q(xk+1 ) ≤ q(xk+1/2 ) with xk+1 computed in step 24, which again gives q(xk+1 ) ≤ q(xk+1/2 ) < q(xk ). Finally, if k ∈ SSAD , it follows from the condition checked in step 9 that −sk is a nonascent direction. Moreover, sk is a negative curvature direction since (3.1) holds (see step 5) because otherwise we would have terminated in step 7. From these properties of sk , it is clear that q(xk+1/2 ) < q(xk ) with xk+1/2 defined in step 14,



2398


and from Lemma 3.11 and α¯ ∈ (0, 2H−1] that xk+1 computed in step 15 satisfies q(xk+1 ) ≤ q(xk+1/2 ). Thus, we have q(xk+1 ) ≤ q(xk+1/2 ) < q(xk ). We may now state our main convergence result. Theorem 3.13. If Algorithm 2 does not terminate finitely, then {q(xk )} → −∞. Proof. For a proof by contradiction, suppose that limk→∞ q(xk ) = −∞. As a consequence of Lemma 3.12, we may then conclude that there exists qlow satisfying lim q(xk ) = qlow > −∞,

(3.14)

k→∞

where we used the supposition that Algorithm 2 did not terminate finitely. This latter fact also allows us to use Lemma 3.10 and conclude that |SPRO | = ∞. We now use Lemma 3.12 and (3.3) to conclude that, for any p ≥ 0, we have q(x0 ) − q(xp+1 ) =

p

[q(xk ) − q(xk+1 )] ≥

[q(xk ) − q(xk+1 )]

k∈SPRO

k=0

=

k≤p

[q(xk ) − q(xk − αk sk )] =

k∈SPRO

ΔQk (αk ) ≥ 0.

k∈SPRO

k≤p

k≤p

Taking the limit as p → ∞ and using |SPRO | = ∞ and (3.14) shows that ΔQk (αk ), q(x0 ) − qlow = lim [q(x0 ) − q(xp+1 )] ≥ p→∞

k∈SPRO

and since ΔQ(αk ) ≥ 0 for all k (see Lemma 3.12), we have limk∈SPRO ΔQk (αk ) = 0. Combining this limit with Lemma 3.8 shows that ¯ Tβ(xk ) = lim β(xk , α)

k∈SPRO

lim ϕ(xk , α) ¯ Tϕ(xk ) = 0,

k∈SPRO

which in turn may be paired with Lemma 3.9 to deduce limk∈SPRO ν(xk ) = 0. This limit means, in light of step 4, that k ∈ SSAD for all sufficiently large k ∈ SPRO . This is a contradiction since the sets SPRO and SSAD are disjoint. Therefore, we must conclude that limk→∞ q(xk ) = −∞. 4. Numerical results. We evaluate our MATLAB implementation of Algorithm 2 on both convex and nonconvex problems of varying difficulty. For comparison, we have also written our own MATLAB projected gradient method (PGM) with subspace acceleration that was developed in [34, 37]. A problem was considered to be successfully solved by Algorithm 2 if either a feasible ray of unbounded descent was obtained (see lines 13, 22, and 43) or an approximate second-order KKT point was computed (see line 7) with tolerance τstop = 10−5 . For the remaining control parameters, we chose the values η = 0.5, α¯ = 12 H−1 , and Γ = 1. (It was shown in [18] that a value α¯ ≈ 2H−1 tends to perform the best on problems for which H is positive definite.) The choice for Γ can have a big effect on practical performance through its influence on condition (2.2). It is clear that condition (2.2) is more easily satisfied for larger values of Γ. Thus, a large value for Γ makes it “harder” to compute proportioning iterations since they will only be computed once an accurate minimizer over the current set of free (inactive) variables is obtained. The other extreme is not necessarily any better. Specifically, if Γ is chosen very small, condition (2.2) would be “easier” to satisfy and may lead to the




2399

computation of many proportioning steps. In contrast to the previous scenario, Algorithm 2 would keep stepping off of bounds (proportioning iterations) even when the iterates are far from a minimizer for the reduced space problem. The choice Γ = 1 was made based on empirical evidence. Our MATLAB implementation of Algorithm 2 used a simple procedure for computing saddle point iterations (see step 8), when required. Specifically, we used the built-in MATLAB function eig to compute the eigenvectors/eigenvalues of the reduced Hessian matrix, i.e., the submatrix of H whose rows and columns are given by the current set of free variables, and then set sk to the eigenvector corresponding to the leftmost eigvenvalue. Although this procedure would not be practical in the largescale case, it was not a problem for the preliminary tests we performed. Moreover, we explain later that saddle point iterations are rarely encountered, which makes these results a reliable representation of the performance of our method. A fully developed code should of course allow for leftmost eigenvalue estimates to be obtained via Lanczos iterations, for example. Each iteration of PGM consists of an active set prediction step followed by a subspace acceleration step. Specifically, the active set prediction step consists of performing a backtracking search along the projected gradient path [6], with an initial step length defined by a Barzilai–Borwein calculation [12, 20]. The subspace step is then computed by first setting the variables predicted to be active by the projected gradient step to their lower or upper bounds. We then attempt to (approximately) minimize the quadratic objective function q over the complementary set of free variables. When this reduced problem is strictly convex, its unique minimizer may be found by solving a (reduced) system of linear equations, which we do with the CG method [2]. Since the solution (when it exists) to this reduced problem may violate some of the bounds on the free variables, we next project it onto the feasible region. This feasible projected point is then compared to the projected gradient iterate, and the best is taken as the next iterate, where “best” is measured in terms of the objective function value. (For additional details concerning PGM, see [34, 37].) We note that PGM used the same termination tests and tolerance τstop as for Algorithm 2, except PGM checks for first-order KKT points, whereas Algorithm 2 checks for second-order KKT points. Finally, for practical purposes, we initialized the maximum number of allowed CG iterations for solving the reduced linear system of equations to the value 100 and the stopping tolerance for the CG method to 10−4 . However, once the first consecutive pair of iterates had the same active set, we reset these values to max{100, 2n} and 10−10 , respectively. This choice was made for numerical efficiency, in particular for reducing the overall number of matrix-vector products, which is the key performance measure. Of course, other ad hoc heuristics could be used, but the overall performance is unlikely to differ greatly from our strategy over a random set of test problems. We also stress that Algorithm 2 does not require any such heuristic since it uses condition (2.2) during every iteration to determine what iteration type is appropriate. 4.1. Convex problems. We first tested Algorithm 2 on randomly generated strictly convex problems with varying numbers of variables (n) and condition numbers for H (Hcond ). Specifically, we created combinations from n ∈ {102 , 103 } and Hcond ∈ {102 , 104 , 106 , 108 }. Each instance of H was generated by MATLAB’s sprandsym routine, and the remaining problem data c, l, and u chosen so that the number of lower-active, upper-active, and inactive primal variables at the unique solution were roughly equal. The results are given in Table 1.



2400


Table 1 Results from Algorithm 2 on 50 randomly generated strictly convex problems for different choices of problem size n and condition number Hcond . n 102 102 102 102 103 103 103 103

Hcond 102 104 106 108 102 104 106 108

CG Mean s.d. 29.3 3.9 90.4 16.0 98.5 33.4 67.1 18.6 39.3 2.8 294.0 26.2 1060.0 163.0 1100.0 165.0

Expansion Mean s.d. 10.2 2.0 14.5 3.2 14.0 3.5 13.8 3.0 32.9 3.3 56.7 6.7 80.0 8.4 94.3 10.1

Proportioning Mean s.d. 2.9 0.6 4.0 1.0 3.7 1.2 3.4 1.4 4.0 0.5 5.8 0.8 6.4 1.0 6.2 1.5

Table 1 reports the mean (mean) and standard deviation (s.d.) for various performance statistics, computed from solving 50 random problems of the specified size and condition number. In particular, we provide values for the numbers of CG iterations, expansion iterations, and proportioning iterations. For these results, all problems were successfully solved before the maximum allowed value of 100,000 matrix-vector products was reached. We note that the numbers of negative curvature CG and saddle point iterations were not included since they do not occur for strictly convex problems. These results show that Algorithm 2 is robust and requires a modest number of iterations. We can also see that every iteration type is routinely encountered, with a majority of them being CG iterations. This performance is rather ideal, since it indicates that the active set at the solution is quickly identified and then followed by CG iterations that converge to the solution. In Table 2, we compare Algorithm 2 with PGM on the same set of test problems used to create Table 1. For PGM we enforced an upper bound of 100,000 matrixvector products, as we did for Algorithm 2. We state the percentage of the 50 test problems that were successfully solved before the maximum allowed number of matrixvector products was reached. The numbers presented for the means and standard derivations are computed over those problems that were successfully solved. We do not include values for the number of iterations, since the predominant cost for both methods is the number of matrix-vector products computed. These results show that Algorithm 2 is far more efficient compared to PGM and that this distinction becomes more pronounced as the condition number of H increases. This superior performance, especially as the condition number of H increases, appears to be a consequence of the efficient strategy used by Algorithm 2 for determining when the CG iterations should be stopped and the active set changed. We also note that the standard deviation for PGM is often very large, whereas the standard deviation is rather small for Algorithm 2. This evidence seems to further support the fact that the PGM method is less robust and susceptible to great variation in performance. This performance also holds for the remaining numerical experiments, but we do not repeat ourselves. To understand how Algorithm 2 performs as the norm of H varies, we chose problem size n = 103 and condition number Hcond = 104 and then randomly generated 50 strictly convex problems for various values of H2 . The results from this experiment are provided in Table 3. They show that the performance of Algorithm 2 is only slightly affected by the value H2 . This is consistent with our previous results, which showed that most of the computational work involves CG iterations after the optimal active set is quickly identified.


2401



Table 2 Comparison of PGM and Algorithm 2 on the same 50 randomly generated strictly convex problems used to create Table 1.

n 102 102 102 102 103 103 103 103

Hcond 102 104 106 108 102 104 106 108

Solved 100% 100% 100% 100% 100% 100% 84% 60%

PGM Hx Prods. Mean s.d. 104.0 12.7 525.0 478.0 645.0 591.0 874.0 1710.0 230.0 45.0 4640.0 4420.0 29500.0 24300.0 44600.0 35000.0

Algorithm 2 Hx Prods. Solved Mean s.d. 100% 54.6 6.4 100% 125.0 21.0 100% 132.0 38.3 100% 100.0 22.4 100% 111.0 5.9 100% 415.0 29.7 100% 1230.0 168.0 100% 1300.0 173.0

Table 3 Results from Algorithm 2 on 50 randomly generated strictly convex problems with n = 103 and Hcond = 104 for various choices of H2 . H2 102 104 106 108

CG Mean s.d. 294.0 24.6 294.0 23.0 290.0 25.5 293.0 21.9

Expansion Mean s.d. 57.5 6.7 56.5 5.7 56.6 6.3 57.6 7.6

Proportioning Mean s.d. 5.8 0.9 5.6 0.7 5.9 0.9 6.0 0.7

Hx Prods. Mean s.d. 416.0 29.4 415.0 29.8 411.0 30.1 417.0 28.5

Next, we solved the convex problems identified in the CUTEst [26] test set that had at least n = 50 optimization variables. The initial point for all problems is the default value supplied by CUTEst. The results recorded in Table 4 show that both algorithms performed well on this set, with the only failure occurring for PGM on problem BIGGSB1. A detailed inspection of that problem revealed that it was highly dual-degenerate. One may also observe that Algorithm 2 has a tendency to require fewer matrix-vector products in comparison to PGM, and in many cases the reduction is dramatic. 4.2. Nonconvex problems. In this section we consider nonconvex problems. Our first experiment was on randomly generated nonconvex problems, which we created in a similar fashion as described in section 4.1. Each instance of H was generated by the MATLAB sprandsym routine without requiring the matrix to be positive definite and with the remaining problem data c, l, and u chosen so that the numbers of lower-active, upper-active, and inactive primal variables were roughly equal at one of the solutions (since H is now indefinite, the problems are nonconvex and generally have more multiple local solutions). To ensure that the problems have well-defined solutions, we adjusted l and u to ensure that they satisfied l ≥ −5 and u ≤ 5. The results are given in Table 5, where we have now introduced the mean and standard deviations for the number of negative curvature CG iterations. We also note that Algorithm 2 successfully solved all the test problems. We can see from Table 5 that the number of iterations generally increases as the dimension and condition number increase, as expected. It is, however, nice to see that the increase in the number of iterations is relatively mild as the condition number of H increases. We can also see that a number of negative curvature CG steps are frequently used, which make them an important aspect of our algorithm. The reader may notice that we intentionally did not provide the number of saddle


2402



Table 4 Comparison of PGM and Algorithm 2 on convex CUTEst problems.

Prob BIGGSB1 BQPGABIM BQPGASIM CHENHARK CVXBQP1 JNLBRNG1 JNLBRNG2 JNLBRNGA JNLBRNGB NOBNDTOR OBSTCLAE OBSTCLAL OBSTCLBL OBSTCLBM OBSTCLBU PENTDI TORSION1 TORSION2 TORSION3 TORSION4 TORSION5 TORSION6 TORSIONA TORSIONB TORSIONC TORSIOND TORSIONE TORSIONF

n 5000 50 50 5000 10000 10000 10000 10000 10000 5476 10000 10000 10000 10000 10000 5000 5476 5476 5476 5476 5476 5476 5476 5476 5476 5476 5476 5476

PGM Iterations Hx fail 3 2 7 7 19 12 18 9 25 24 23 17 11 19 1 26 15 13 14 8 9 25 14 13 14 8 9

Prods. fail 48 37 29700 29700 1762 1187 1701 16210 1323 917 817 627 512 692 3 839 699 238 335 89 181 756 522 240 336 89 182

Algorithm 2 Iterations Hx Prods. 33411 33417 15 21 17 23 2704 2711 2704 2711 505 548 795 827 406 409 2644 2694 279 283 2455 4727 240 255 856 1577 329 566 443 718 1 3 213 215 379 499 91 93 376 642 41 44 279 529 209 211 307 389 92 94 453 801 37 39 280 530

Table 5 Algorithm 2 on 50 randomly generated nonconvex problems for each choice of problem size n and condition number Hcond . n 102 102 102 102 103 103 103 103

Hcond 102 104 106 108 102 104 106 108

CG Mean s.d. 17.5 5.6 63.6 23.4 86.3 39.1 59.8 30.4 44.4 12.1 299.0 61.1 1020.0 242.0 1080.0 222.0

Neg. Curv. CG Mean s.d. 22.9 4.3 21.3 5.8 18.3 4.7 16.5 5.5 118.0 10.3 148.0 11.4 151.0 15.2 153.0 13.8

Expansion Mean s.d. 8.2 3.4 15.7 4.5 17.5 6.2 15.8 5.9 18.3 7.3 68.2 9.1 103.0 14.0 121.0 17.4

Proportioning Mean s.d. 3.8 1.2 5.0 1.7 5.0 2.1 4.2 1.9 9.0 2.4 10.5 2.6 10.7 3.0 11.4 3.7

point iterations since they occurred infrequently. In fact, the maximum mean value encountered over the various combinations of problem sizes n and condition numbers Hcond was approximately 0.15%. We may thus conclude that although first-order KKT points that are not second-order KKT points are rarely encountered, it does


2403



Table 6 Comparison of PGM and Algorithm 2 on the same 50 randomly generated nonconvex problems used to create Table 5.

n 102 102 102 102 103 103 103 103

Hcond 102 104 106 108 102 104 106 108

Solved 100% 98% 94% 92% 100% 98% 98% 50%

PGM Hx Prods. Mean s.d. 139.0 208.0 4510.0 6570.0 19600.0 19000.0 20500.0 24200.0 28800.0 13800.0 11500.0 11400.0 69800.0 19700.0 65700.0 21000.0

Algorithm 2 Hx Prods. Solved Mean s.d. 100% 85.7 9.2 100% 145.0 23.6 100% 165.0 44.4 100% 131.0 34.8 100% 328.0 30.9 100% 744.0 70.8 100% 1540.0 262.0 100% 1640.0 237.0

Table 7 Results of Algorithm 2 on 50 randomly generated nonconvex problems with n = 103 and Hcond = 104 for various values of H2 . H2 102 104 106 108

CG Mean s.d. 396.0 62.1 468.0 56.9 532.0 57.3 642.0 122.0

Neg. Curv. CG Mean s.d. 186.0 15.8 195.0 11.6 192.0 14.7 188.0 14.7

Expansion Mean s.d. 79.7 14.3 79.4 14.6 84.2 17.7 86.5 17.2

Proportioning Mean s.d. 12.2 3.0 11.2 3.1 11.6 2.7 12.4 2.8

Hx Prods. Mean s.d. 941.0 78.9 1030.0 74.3 1100.0 70.4 1210.0 132.0

happen and Algorithm 2 handles it appropriately. Overall, these results are similar to the performance in the convex case: Algorithm 2 is robust and requires a modest number of iterations, and every iteration type is frequently used with the majority being of the CG type. As already observed in the convex case, this performance is rather ideal since it indicates that the active set at the solution is quickly identified, which is then followed by CG iterations on the reduced space. We next compare Algorithm 2 with PGM on the same set of test problems used to create Table 5. From these results (see Table 6), we can see that Algorithm 2 is more robust than PGM. The PGM clearly struggles with larger condition numbers as well as larger problem sizes. Of course, the difficulty of PGM to handle larger problems is mostly a consequence of our imposed upper limit of 100,000 matrixvector products. In particular, if this limit was increased, the percentage of problems solved for PGM would increase, but this still does not bode well for PGM. This is especially true in view of the performance of Algorithm 2, which for the largest problems (n = 103 ) and largest condition number (Hcond = 108 ) tested required only 1640.0 mean matrix-vector products. (We remind the reader that the total number of matrix-vector products is the principal measure of performance of both PGM and Algorithm 2.) Over the entire range of problem sizes and condition numbers, it is also clear that Algorithm 2 significantly outperforms PGM. Next, to get a sense of how Algorithm 2 performs on nonconvex problems as the norm of H varies, we chose problem size n = 103 and condition number Hcond = 104 and then randomly generated 50 nonconvex problems for various values of H2 . The results from this experiment are provided in Table 7. They show that the performance of Algorithm 2 slowly degrades (in terms of matrix-vector products) as the value H2 increases, at least for the above chosen problem size and condition number. This is consistent with our previous results. Finally, we compared PGM to Algorithm 2 on BCQPs derived from the Hock– Schittkowski set of nonlinear test problems. One BCQP was derived from each



2404


nonlinear test problem in the following manner. First, we defined H and g as the Hessian and gradient of the nonlinear objective function evaluated at either x = 0 if the nonlinear problem had no general constraints or at the default initial value provided by CUTEst otherwise. The choice x = 0 ensured that if the objective function happened to be quadratic and the only constraints were simple bounds on the optimization variables, then we solved the QP problem provided by CUTEst. Second, any bounds on the optimization variables provided in the problem formulation were maintained. Additional bounds were added (if needed) to ensure that −104 ≤ l ≤ u ≤ 104 , so that each problem had at least one bounded local solution. Third, we computed the eigenvalues of the matrix H and discarded any problem whose most negative eigenvalue was greater than −10−8; thus, the Hessian matrix for each problem tested had at least one negative eigenvalue. Finally, we discarded problems HS25 and HS41 because the starting point was already a first-order solution. This resulted in the 32 nonconvex test problems listed in Table 8. We note that the naming convention in Table 8 is to use the Hock–Schittkowski problem name suffixed by “-BCQP”, e.g., HS7-BCQP is the BCQP problem derived from the nonlinear Hock–Schittkowski test problem HS7 as described above. The results in Table 8 show that Algorithm 2 rarely does worse than PGM, and when it does, it is only by a few matrix-vector products. For the majority of the problems, Algorithm 2 performs substantially better with the most significant gains being on problems HS7-BCQP, HS57-BCQP, HS85-BCQP, HS101-BCQP, HS102-BCQP, HS103-BCQP, HS105-BCQP, and HS114-BCQP. This provides additional evidence to the efficiency and reliability of Algorithm 2. Such superior performance on these constructed nonconvex BCQPs suggests that Algorithm 2 may be an effective subproblem solver for optimization algorithms designed for general nonlinear problems that solve a sequence of (potentially) nonconvex BCQPs, e.g., augmented Lagrangian methods. 5. Conclusions and final remarks. We presented a new algorithm for solving BCQPs. We proved that from an arbitrary starting point, our method will either generate a sequence of iterates for which the objective function converges to negative infinity or will terminate finitely at an approximate second-order KKT point. For strictly convex problems with only lower bounds on the optimization variables, our method is equivalent to the work by Dostál and Sch¨ oberl [18]. Therefore, in this case, our method enjoys the nice convergence results established for their method. The numerical results presented in section 4.1 show the superiority of our method over a commonly used projected gradient algorithm that uses subspace acceleration. Since our method is also able to find local solutions to nonconvex problems, we performed numerical experiments on such problems in section 4.2. These results, as for the convex case, show that our new method is substantially more efficient and reliable when compared to the projected gradient algorithm with subspace acceleration. The effectiveness of our method seems to be related to the use of condition (2.2) to determine when minimization over the current active set should be terminated. A second important feature appears to be the use of projected gradient steps in the reduced space, as opposed to projected gradient steps in the full space. A similar strategy was used by Dostál and Sch¨ oberl and appears to help prevent the zig-zag effect that often happens for methods that compute active-set estimates via a sequence of gradient projection iterations. Overall, the numerical results are promising. Four additional improvements can be incorporated into our method. First, although the statement and analysis of our algorithm were performed for a fixed step


2405



Table 8 Comparison of PGM and Algorithm 2 on nonconvex QPs derived from the Hock–Schitkowski test problems.

Problem HS7-BCQP HS19-BCQP

PGM Iterations Hx Prods. 101 202 2 5

Algorithm 2 Iterations Hx Prods. 2 6 2 6

HS24-BCQP

4

8

2

6

HS29-BCQP

3

6

1

4

HS33-BCQP

2

4

1

4

HS36-BCQP

2

4

3

8

HS37-BCQP

2

4

1

4

HS40-BCQP

3

6

1

4

HS44-BCQP

5

11

3

6

HS45-BCQP

2

5

2

6

HS55-BCQP

2

4

2

5

HS56-BCQP

3

6

1

4

HS57-BCQP

12188

24376

2

6

HS59-BCQP

2

5

1

4

HS63-BCQP

3

6

2

6

HS68-BCQP

5

16

7

14

HS69-BCQP

7

30

8

14

HS70-BCQP

5

18

5

10

HS71-BCQP

2

4

2

5

HS78-BCQP

3

6

2

6

HS81-BCQP

4

14

4

8

HS84-BCQP

2

4

3

8

HS85-BCQP

300

603

5

12

HS93-BCQP

3

10

5

12

HS101-BCQP

12

55

11

17

HS102-BCQP

11

51

9

14

HS103-BCQP

11

51

9

14

HS104-BCQP

2

10

5

9

HS105-BCQP

589

1183

6

13

HS108-BCQP

5

11

4

9

HS111-BCQP

2

4

5

12

HS114-BCQP

2859

5719

6

13

length α¯ ∈ (0, 2H−1], both can easily be modified to handle the case when α¯ is chosen differently from iteration to iteration, provided it always satisfies α¯ ∈ [αmin , αmax ] for some 0 < αmin ≤ αmax ≤ 2H−1. This added flexibility could be used to adaptively choose α¯ based on more refined strategies such as Barzilai–Borwein calculations. Second, one could include a control parameter that determines whether the user wishes to have second-order optimality verified, or if verification of first-order conditions is sufficient. Third, it is clear that preconditioned CG iterations can be used to further accelerate the minimization of each reduced space problem. Moreover, it is possible to use other forms of preconditioning beyond the subspace minimization procedure, as described in [17, section 5.10]. Finally, we note that further computational gains are possible by altering the definition of the proportioning step. In the form presented



2406


in this paper, the proportioning iteration involves a step that leaves all bounds that are active but have an approximate Lagrange multiplier that is of the wrong sign. Since this happens only when condition (2.2) is violated, we know that this direction must be nonzero. However, it does not always make sense to step off of every variable whose approximate multiplier is of the wrong sign just because (2.2) is violated since some of the signs might be only slightly incorrect. Rather, it makes sense to step off of only those whose Lagrange multiplier violates its sign by an amount that is ¯ Tβ(xk ) (see (2.2)). This strategy was implemented substantial compared to β(xk , α) and produced even better results than those presented in this paper; they were not used since convergence with this strategy is not immediate, although it is likely to hold. We suspect that such an analysis is not difficult but would complicate what is currently a simple analysis. Acknowledgments. The authors would like to thank the associate editor and the two anonymous referees, whose constructive comments and suggestions were truly helpful and improved the presentation of the paper. Also, we appreciate the submission of their reports in a timely manner. REFERENCES [1] R. Andreani, E. G. Birgin, J. M. Mart´ınez, and M. L. Schuverdt, On augmented Lagrangian methods with general lower-level constraints, SIAM J. Optim., 18 (2007), pp. 1286–1309. [2] R. Barrett, M. W. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. Van der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, SIAM, Philadelphia, 1994. [3] D. P. Bertsekas, Projected Newton methods for optimization problems with simple constraints, SIAM J. Control Optim., 20 (1982), pp. 221–246. [4] R. H. Bielschowsky, A. Friedlander, F. Gomes, J. Martınez, and M. Raydan, An adaptive algorithm for bound constrained quadratic minimization, Investigaci´ on Oper., 7 (1997), pp. 67–102. [5] L. A. Caffarelli and A. Friedman, The free boundary for elastic-plastic torsion problems, Trans. Amer. Math. Soc., 252 (1979), pp. 65–97. [6] A. R. Conn, N. I. Gould, and P. L. Toint, Trust Region Methods, Vol. 1, SIAM, Philadelphia, 2000. [7] A. R. Conn, N. I. M. Gould, and P. L. Toint, A globally convergent augmented Lagrangian algorithm for optimization with general constraints and simple bounds, SIAM J. Numer. Anal., 28 (1991), pp. 545–572. [8] A. R. Conn, N. I. M. Gould, and P. L. Toint, Trust-Region Methods, SIAM, Philadelphia, 2000. [9] B. Contesse, Une caract´ erisation compl` ete des minima locaux en programmation quadratique., Numer. Math., 34 (1980), pp. 315–332. [10] F. E. Curtis, N. I. M. Gould, H. Jiang, and D. P. Robinson, Adaptive augmented Lagrangian methods: Algorithms and practical numerical experience, Optim. Methods Softw., (2015), pp. 1–30, DOI:10.1080/10556788.2015.1071813. [11] F. E. Curtis, H. Jiang, and D. P. Robinson, An adaptive augmented lagrangian method for large-scale constrained optimization, Math. Program., (2013), pp. 1–45. [12] Y.-H. Dai and R. Fletcher, Projected Barzilai-Borwein methods for large-scale boxconstrained quadratic programming, Numer. Math., 100 (2005), pp. 21–47. [13] R. S. Dembo and U. Tulowitzki, On the Minimization of Quadratic Functions Subject to Box Constraints, Department of Computer Science, Yale University, 1984. ´ l, M. Gomes-Ruggiero, J. Mart´ınez, and S. A. Santos, Non[14] M. Diniz-Ehrhardt, Z. Dosta monotone strategy for minimization of quadratics with simple constraints, Appl. Math., 46 (2001), pp. 321–338. ´ l, Box constrained quadratic programming with proportioning and projections, SIAM [15] Z. Dosta J. Optim., 7 (1997), pp. 871–887. ´ l, A proportioning based algorithm for bound constrained quadratic programming with [16] Z. Dosta the rate of convergence, Numer. Algorithms, 34 (2003), pp. 293–302.




2407

´ l, Optimal Quadratic Programming Algorithms: With Applications to Variational [17] Z. Dosta Inequalities, Springer Optim. Appl., 23, Springer, New York, 2009. ´ l and J. Scho ¨ berl, Minimizing quadratic functions subject to bound constraints [18] Z. Dosta with the rate of convergence and finite termination, Comput. Optim. Appl., 30 (2005), pp. 23–43. [19] L. Feng, V. Linetsky, J. L. Morales, and J. Nocedal, On the solution of complementarity problems arising in American options pricing, Optim. Methods Softw., 26 (2011), pp. 813– 825. [20] R. Fletcher, On the Barzilai-Borwein method, in Optimization and Control with Applications, Springer, New York, 2005, pp. 235–256. [21] A. Forsgren, P. Gill, and W. Murray, On the identification of local minimizers in inertiacontrolling methods for quadratic programming, SIAM J. Matrix Anal. Appl., 12 (1991), pp. 730–746. [22] A. Friedlander, J. Mario Mart´ınez, and M. Raydon, A new method for large-scale box constrained convex quadratic minimization problems, Optim. Methods Softw., 5 (1995), pp. 57–74. [23] A. Friedlander and J. M. Mart´ınez, On the numerical solution of bound constrained optimization problems, RAIRO Oper. Res., 23 (1989), pp. 319–341. [24] A. Friedlander and J. M. Mart´ınez, On the maximization of a concave quadratic function with box constraints, SIAM J. Optim., 4 (1994), pp. 177–192. [25] A. A. Goldstein, Convex programming in Hilbert space, Bull. Amer. Math. Soc., 70 (1964), pp. 709–710. [26] N. I. M. Gould, D. Orban, and P. L. Toint, CUTEst: A Constrained and Unconstrained Testing Environment with Safe Threads, Technical report, Rutherford Appleton Laboratory, Chilton, UK, 2013. [27] M. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems, J. Res. Nat. Bur. Standards, 49 (1952), pp. 409–436. ˇvara and M. Stingl, PENNON: A code for convex nonlinear and semidefinite pro[28] M. Koc gramming, Optim. Methods Softw., 18 (2003), pp. 317–333. ˇvara and J. Zowe, An iterative two-step algorithm for linear complementarity prob[29] M. Koc lems, Numer. Math., 68 (1994), pp. 95–106. [30] C. Lanczos, An Iteration Method for the Solution of the Eigenvalue Problem of Linear Differential and Integral Operators, U.S. Government Press Office, 1950. [31] E. S. Levitin and B. T. Polyak, Constrained minimization methods, USSR Comput. Math. Math. Phys., 6 (1966), pp. 1–50. [32] Y. Lin and C. Cryer, An alternating direction implicit algorithm for the solution of linear complementarity problems arising from free boundary problems, Appl. Math. Optim., 13 (1985), pp. 1–17. ¨ tstedt, Numerical simulation of time-dependent contact and friction problems in rigid [33] P. Lo body mechanics, SIAM J. Sci. Statist. Comput., 5 (1984), pp. 370–393. [34] J. J. Mor´ e and G. Toraldo, Algorithms for bound constrained quadratic programming problems, Numer. Math., 55 (1989), pp. 377–400. [35] J. J. Mor´ e and G. Toraldo, On the solution of large quadratic programming problems with bound constraints, SIAM J. Optim., 1 (1991), pp. 93–113. [36] U. Oreborn, A Direct Method for Sparse Nonnegative Least Squares Problems, Ph.D. thesis, Link¨ oping University, Link¨ oping, Sweden, 1986. [37] D. P. Robinson, L. Feng, J. M. Nocedal, and J.-S. Pang, Subspace accelerated matrix splitting algorithms for asymmetric and symmetric linear complementarity problems, SIAM J. Optim., 23 (2013), pp. 1371–1397. [38] E. K. Yang and J. W. Tolle, A class of methods for solving large, convex quadratic programs subject to box constraints, Math. Program., 51 (1991), pp. 223–228. [39] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization, ACM Trans. Math. Software, 23 (1997), pp. 550–560.


A Solver for Nonconvex Bound-Constrained

A Solver for Nonconvex Bound-Constrained

Suggest Documents

A PROXIMAL AVERAGE FOR NONCONVEX ... - Semantic Scholar

A Branch and Cut Algorithm for Nonconvex

A FAST SOLVER FOR INCOMPRESSIBLE

Jensen's inequality for nonconvex functions

Nonconvex Optimization for Communication Systems

A proximal-like algorithm for a class of nonconvex programming

Integration of a Refinement Solver and a Local-Search Solver

A Complete Solver for Constraint Games

A FAST POISSON SOLVER FOR THE FINITE

A SAT Solver Primer

A solver tool for G-networks - CiteSeerX

A SOLVER FOR COMPLEX-VALUED ... - Semantic Scholar

A FAST POISSON SOLVER FOR THE FINITE

Neural network for nonsmooth, nonconvex constrained minimization ...

INTERIOR-POINT METHODS FOR NONCONVEX NONLINEAR ...

Successive Convex Relaxation Methods for Nonconvex Quadratic

A Global Optimization Method for General Constrained Nonconvex ...

A Barrier Function Method for the Nonconvex Quadratic ... - CUHK CSE

A Stochastic Semismooth Newton Method for Nonsmooth Nonconvex

Convergent LMI Relaxations For Nonconvex Quadratic Programs

FAST ALGORITHMS FOR NONCONVEX COMPRESSIVE SENSING ...

Douglas-Rachford splitting for nonconvex feasibility problems

Jensen's inequality for nonconvex functions - Semantic Scholar

Mini-batch Stochastic Approximation Methods for Nonconvex ...