Jul 4, 2014 - [23] James Hardy Wilkinson, James Hardy Wilkinson, and James Hardy Wilkinson. The al- gebraic eigenvalue problem, volume 87. Clarendon ...
SECOND-ORDER ORTHANT-BASED METHODS WITH ENRICHED HESSIAN INFORMATION FOR SPARSE `1 –OPTIMIZATION
arXiv:1407.1096v1 [math.OC] 4 Jul 2014
J.C. DE LOS REYES‡ , E. LOAYZA‡ AND P. MERINO‡ Abstract. We present a second order algorithm for solving optimization problems involving the sparsity enhancing `1 -norm. An orthantwise-direction strategy is used in the spirit of [1]. The main idea of our method consists in computing the descent directions by incorporating second order information both of the regular term and (in weak sense) of the `1 -norm. The weak second order information behind the `1 -term is incorporated via a partial Huber regularization. One of the main features of our algorithm consists in a faster identification of the active set. We also prove that our method is equivalent to a semismooth Newton algorithm applied to the optimality condition, under a specific choice of the (SSN) defining constants. We present several computational experiments to show the efficiency of our approach compared to other up-to-date algorithms.
1. Introduction Optimization problems involving sparse vectors have important applications in fields like image restoration, machine learning, data classification, among many others [9, 19]. One of the most common mechanisms for enhancing sparsity consists in the use of the `1 –norm of the vector in the cost function. Although this approach helps in getting sparse solutions, the design of solution algorithms becomes challenging due to the non-differentiability involved. Most of the algorithms developed to solve optimization problems with `1 –norm consider in fact only first-order information, primal and/or dual, which seems natural due to the presence of the nondifferentiable term (see e.g. [19] and the references therein). More recently, some authors have investigated the use of second order information in cases where the objective function is composed by a regular convex function and the `1 –norm of the design variable [1, 3, 2]. By using second order information of the regular part, faster algorithms have been obtained. In the pioneering work [1] a limited memory BFGS method was considered in connection with so-called orthantwise directions. In [3] and [2] these type of directions were considered in connection with Newton and semismooth Newton type updates, respectively. The contribution of this paper arises from the question on whether some useful information can also be extracted from the special structure of the `1 –norm in order to design a second-order method. Our algorithm aims to give an answer to that question. The main idea of our approach consists in incorporating secondorder information ”hidden” in the `1 –norm. Roughly speaking, although the 2010 Mathematics Subject Classification. 49M15, 65K05, 90C53, 49J20, 49K20. 1
2
J.C. DE LOS REYES‡ , E. LOAYZA‡ AND P. MERINO‡
`1 –norm is not differentiable in a classical sense, it has two derivatives in a distributional sense. By using a partial Huber regularization, we are able to extract that information for the update of the second order matrix, while keeping the same orthanwise descent type directions. This leads to an algorithm that enables a fast identification of the active set and, therefore, a rapid decrease of the cost. Our contribution encompasses also the study of the relation of our method with respect to semismooth Newton methods (SSN). We prove that, under a specific choice of the defining constants of the semismooth Newton algorithm, our updates turn out to be equivalent to the (SSN) ones. In this manner, important convergence results are inherited from the generalized Newton framework. In addition, on basis of this equivalence, an adaptive algorithmic choice of the regularization parameter is proposed. Let us remark that one of our main motivations for this work is the solution of sparse PDE-constrained optimization problems. In this field, the importance of sparsity lies in the fact that such structure allows to localize the action of the controls, since sparse control functions have small support [20, 11, 5]. Our main concern in this respect is that, whereas semismooth Newton methods applied to the optimality systems of elliptic problems work fine (see [20]), their application to time-dependent problems requires to solve the full optimality system at once, which is computationally very costly. The outline of our paper is as follows. In Section 2 the main idea of our approach is described and the resulting orthantwise enriched second-order algorithm is presented. The main theoretical properties of our algorithm are studied in Section 3. In Section 4 we show that our method is equivalent to a semismooth Newton method under a specific choice of the defining parameters. Finally, in Section 5 we exhaustively compare the behaviour of our algorithm with respect to other up-to-date methods. 1.1. Problem formulation. Let f :⊂ Rm → R be a twice continuously differentiable function and β a positive real number. We are interested in the numerical solution of the unconstrained optimization problem (P)
min
x∈Rm
ϕ(x) := f (x) + βkxk1 ,
where k · k1 corresponds to the standard `1 –norm in Rm . Along this paper we assume that f is convex and satisfies the following condition: there exists two positive constants c and C such that (1)
ckdk22 ≤ d> ∇2 f (x)d ≤ Ckdk22 , for all d ∈ Rm , x ∈ Rm .
There are some interesting special choices for the cost function f . We mention, for instance: • f (x) = kAx − yk22 , where A is a matrix in Rn×m . This sparse linear regression problem receives the name of LASSO [14, 21].
SECOND-ORDER ENRICHED METHODS FOR SPARSE OPTIMIZATION
3
• f (x) = kA−1 x − yk22 + α2 kxk22 . This particular loss function appears in linear-quadratic PDE–constrained optimization problems after a discretization of the partial differential operator. The function also involves an additional `2 Tikhonov regularization term. N X exp(xTyj zj ) • f (x) = − N1 . This function appears in voice recoglog P T exp(x z ) j i i∈C j=1 nition problems, where N is the number of samples used for the recognition, C denotes the set of all class labels, yj the label associated to the training points j, zj is the feature vector and xi is the parameters sub-vector of class label i. The function represents the normalized sum of the negative log likelihood of each data point being placed in the correct class [6, 15]. It is well known that for any β > 0 problem (P) has a unique solution, which in the following will be denoted by x¯. Depending on the parameter β the solution x¯ tends to be more or less sparse. In fact, if β ≥ k∇f (0)k∞ , then x¯ is identically zero (see for instance [20]). First-order optimality conditions for problem (P) can be obtained by convex analysis and can be stated as 0 ∈ ∂ϕ(¯ x), where ∂ϕ(x) denotes the subdifferential of ϕ at x. Moreover, the last inclusion is equivalent to the relations: (2a) (2b) (2c)
0= ∇i f (¯ x) + β 0= ∇i f (¯ x) − β 0 ∈ [∇i f (¯ x) − β, ∇i f (¯ x) + β]
¯ for i ∈ P, for i ∈ N¯ , ¯ for i ∈ A,
¯ N¯ and A¯ are defined by where the index sets P, (3)
P¯ = {i : x¯i > 0},
N¯ = {i : x¯i < 0},
and A¯ = {i : x¯i = 0}.
2. The orthant-wise enriched second-order method (OESOM) In this section we present the main steps of the second-order algorithm we propose for solving problem (P). Since our method is based on the use of orthant directions, let us start by introducing them in a formal manner. We define the orthant directions associated to a given vector x ∈ Rm as follows: sign(xi ) if xi 6= 0, 1 if xi = 0 and ∇i f (x) < −β, (4) zi = −1 if xi = 0 and ∇i f (x) > β, 0 otherwise, i = 1, . . . , m. These directions actually correspond to the minimum norm subgradient element [19, Ch. 11]. A characterization of the orthant defined by z is given by: (5)
Ω := {d : sign(d) = sign(z)}.
4
J.C. DE LOS REYES‡ , E. LOAYZA‡ AND P. MERINO‡
Following [3] we then consider the descent type direction: ∇i f (x) + βsign(xi ) if xi 6= 0, ∇i f (x) + β if xi = 0 and ∇i f (x) < −β, e i ϕ(x) = (6) ∇ ∇i f (x) − β if xi = 0 and ∇i f (x) > β, 0 otherwise. To ensure that the iterates remain in the same orthant, an additional projection step is required. The corresponding projection over the active orthant is defined as: xi if sign(xi ) = sign(zi ), (7) P(x)i = 0 otherwise. The use of this type of directions in connection to second order updates was originally proposed in [1]. The resulting orthant-wise quasi-Newton second order matrix was constructed with the solely contribution of the regular part. In [3], instead of using the second order matrix for the whole direction, an additional Newton subspace correction step was performed. The efficiency of both methods was experimentally compared in [3]. As mentioned in the introduction, although the `1 -norm is not differentiable in a classical sense, it is twice differentiable in a distributional sense. The second distributional derivative is given by Dirac’s delta function: ( +∞ if x = 0, δ(x) = 0 otherwise. Since this weak derivative lives only at one single point, its role is usually ignored. In our case, however, we are precisely interested in getting a large number of zero entries in the solution vector and, therefore, this information may become valuable. To make this second-order information usable, Pm let us consider a Huber regularization of the `1 -norm given by kxk1,γ := i=1 hγ (xi ), where ( 2 x γ 2i if |xi | ≤ γ1 , (8) hγ (xi ) = 1 |xi | − 2γ if |xi | > γ1 , for γ > 0. The first derivative is then given by γxi ∇i kxk1,γ = , max(1, γ|xi |) i.e., the gradient of the Huber regularization is a vector whose components are described by the formula above. Since the first derivative is a semismooth function, it posses also a slant derivative, given by if i 6= j, 0 ∂ 2 γ γ x χ sign(x ) (∇i kxk1,γ ) = Γij = i i i − otherwise, ∂xj max(1, γ|xi |) max(1, γ|xi |)2
SECOND-ORDER ENRICHED METHODS FOR SPARSE OPTIMIZATION
5
where χi :=
0 if γ|xi | < 1, 1 otherwise.
corresponds to the generalized derivative of the max function. Analyzing by cases, we obtain: Case I. γ|xi | ≤ 1: Γii = γ. Case II. γ|xi | > 1: Γii =
γ 2 x2 γ − 2 2 i = 0. γ|xi | γ xi |xi |
Thus, the diagonal matrix Γ can be rewritten in the following simplified form: γ if γ|xi | ≤ 1, (9) Γii = 0 elsewhere. By incorporating this matrix in the second order update, together with the orthantwise descent direction, we get the following system: k e (10) B k + βΓk dk = −∇ϕ(x ), where B k stand for the Hessian of f or a quasi-Newton approximation of it (e.g. the BFGS matrix). Similarly to [1, 3], we consider the projected line-search rule (11)
k T e ϕ[P(xk + sk dk )] ≤ ϕ(xk ) + ∇ϕ(x ) [P(xk + sk dk ) − xk ],
used in a backtracking procedure for choosing sk . The resulting algorithm is then given through the following steps. Algorithm 1 Orthantwise Enriched Second Order Method (OESOM) 1: 2: 3: 4: 5: 6:
7: 8: 9:
Initialize x0 and B 0 . repeat Compute the matrix Γk using (9) . k e Compute the descent direction ∇ϕ(x ) using (6). k Compute d by solving the linear system (10). Compute xk+1 = P(xk + sk dk ), with the line–search step sk computed by (11). Update the second-order matrix. k ← k + 1. until stopping criteria is satisfied
6
J.C. DE LOS REYES‡ , E. LOAYZA‡ AND P. MERINO‡
3. Properties of the algorithm In this section we focus on the properties of the algorithm presented above. We verify that the directions of (OESOM) are in fact descent directions and investigate the behaviour of the active set along the algorithm iterations. We start by verifying a bound that plays an important role in the subsequent analysis. From assumptions on the function f , the second-order matrix B k and k e the vector ∇ϕ(x ) are bounded. Therefore, equation (10) implies the existence of some positive constant C independent of k, such that ! 1 X k k e j ϕ(xk ) ≤ C (12) |dkj | = k d` + ∇ Bj` Bjj + γ γ `6=j for those j such that |xkj | ≤ 1/γ. At the k–th step of the algorithm, we define the active set by Ak := {i : zik = 0}.
(13)
Moreover, we define the following index set of components of the current solution that remain on the orthant defined by z k . Hk := {i : sign(xki + sdki ) = sign(zik )}. With this definition we can express the k + 1 iterate as the following update formula: (14) xk+1 = P[xk + sdk ] = xk + d˜k , i
i
i
i
i
where (15)
d˜ki =
sdki if i ∈ Hk , −xki elsewhere.
In the following we argue that d˜ is a descent direction, i.e. ϕ(xk + d˜k ) < ϕ(xk ). We start by proving two technical lemmas. Lemma 1. Let x = xk be the k–th iterate of the algorithm and z = z k its associated orthant direction. Let γ be sufficiently large and s sufficiently small such that 1 (16) sign(xi + sdi ) = sign(xi ), for all i : |xi | ≥ . γ Then, (17) (∇f (x)> + βz)> d˜ < 0, where z ∈ ∂k · k1 is defined by (4). Proof. For convenience of the reader, in this proof we omit the super index k on some variables. Let us define the index sets (18) I = P¯ ∪ N¯ , (20)
A− = A¯ ∩ {i : ∇i f (x) < −β}, A+ = A¯ ∩ {i : ∇i f (x) > β}, and
(21)
A0 = A¯ ∩ {i : |∇i f (x)| ≤ β},
(19)
SECOND-ORDER ENRICHED METHODS FOR SPARSE OPTIMIZATION
7
Let d˜ and d be the vectors defined by (15) and (10) respectively. Let us write the quantity (∇f (x) + βz)> d˜ as follows: (∇f (x) + βz)> d˜ = s(∇f (x) + βz)> d −
X
(∇i f (x) + βzi )(xi + sdi )
i6∈Hk
= − s(∇f (x) + βz)> (B k + Γk )−1 ∇ϕ(x) ˜ −
(22)
X
(∇i f (x) + βzi )(xi + sdi ).
i6∈Hk
Taking into account that if i ∈ Hk ∩ A0 , 0 = sign(zi ) = sign(xi + sdi ) = sign(sdi ) e and from the definition of ∇ϕ(x) in (6), the first term in (22) can be expressed as: −s(∇f (x)+βz)> (B k + Γk )−1 ∇ϕ(x) ˜ X
= − s∇ϕ(x) ˜ > (B k + Γk )−1 ∇ϕ(x) ˜ +
(∇i f (x) + βzi )sdi
i∈A0
X
= − s∇ϕ(x) ˜ > (B k + Γk )−1 ∇ϕ(x) ˜ +
(23)
i6∈Hk ,i∈A
(∇i f (x) + βzi )sdi . 0,
Moreover, in view of the positive definiteness of the matrix (Bk + Γk )−1 , we can estimate (22) using (23), which results in X
2 e (∇f (x) + βz)> d˜ ≤ − sck∇ϕ(x)k +
i6∈Hk ,i∈A
−
X
(∇i f (x) + βzi )sdi 0,
(∇i f (x) + βzi )(xi + sdi ),
i6∈Hk
X
2 e = − sck∇ϕ(x)k −
(24)
(∇i f (x) + βzi )(xi + sdi )
i6∈Hk ,i6∈A0
where last sum results from the fact that xi + sdi = sdi for i ∈ A0 . We analyze the last term in (24) separately on the index sets A− , A+ , where it becomes negative. Indeed, if i ∈ A− then ∇i f (x) + β < 0 and zi = 1 by (4). Moreover, since i 6∈ Hk we have that sign(xi + sdi ) 6= sign(zi ) = 1 which implies (∇i f (x) + βzi )(xi + sdi ) = (∇i f (x) + β)(xi + sdi ) > 0. Analogously, if i ∈ A+ then (∇i f (x) + βzi )(xi + sdi ) > 0, thus we get (25)
X i6∈Hk , i6∈A0
(∇i f (x) + βzi )(xi + sdi ) >
X
(∇i f (x) + βzi )(xi + sdi ).
i6∈Hk , i∈I
It should be noticed that for i 6∈ Hk and xi > 0, then xi + sdi < 0, and thus |xi | < s|di |. The same argument applies if xi < 0. Therefore, we obtain the
J.C. DE LOS REYES‡ , E. LOAYZA‡ AND P. MERINO‡
8
following estimation 2 (∇f (x) + βz)> d˜ < −sck∇ϕ(x)k ˜ 2−
X
(∇i f (x) + βzi )(xi + sdi )
i6∈Hk , i∈I
X
2 < −sck∇ϕ(x)k ˜ 2−
(∇i f (x) + βzi )(xi + sdi )
i6∈Hk :0 d˜ = kx + dk 1
(26)
Proof. Using the same notation as in the previous lemma, we obtain that ˜ 1 − kxk − z > d˜ kx + dk 1 X X |xi + d˜i | − |xi | − zi d˜i = i
i
=
X
(|xi + sdi | − |xi | − zi sdi ) −
i∈Hk
=
X
X
(|xi | − zi xi )
i6∈Hk
(sign(zi )(xi + sdi ) − |xi | − zi sdi )
i∈Hk
=
X
(zi (xi + sdi ) − |xi | − zi sdi ) =
i∈Hk
X
zi xi − |xi | = 0
i∈Hk
Theorem 1 (Descent direction). The direction d˜ defined by (15) is a descent direction for ϕ at x, i.e., (27)
˜ < ϕ(x), ϕ(x + d)
for sufficiently small step size s.
SECOND-ORDER ENRICHED METHODS FOR SPARSE OPTIMIZATION
9
Proof. Using Lemmas 1 and 2 and the Taylor expansion of function f at x we have ˜ = f (x + d) ˜ + βkx + dk ˜ ϕ(x + d) 1
˜ + kx + dk ˜ = f (x) + ∇f (x) d˜ + o(d) 1 >˜ ˜ ˜ ) = ϕ(x) + ∇f (x) d + β(kx + dk1 − kxk1 ) + o(kdk 2 >˜ >˜ ˜ = ϕ(x) + ∇f (x) d + βz d + o(kdk ) >
2
˜ ). < ϕ(x) + o(kdk 2 ˜ then o(kdk ˜ ) = o(s) since d is bounded. Finally we deduce and by definition of d, 2 ˜ < ϕ(x) and therefore d˜ is a descent direction. ϕ(x + d) In the following theorem we show that the identification of the active sets performed by (OESOM) is monotone in a neighbourhood of the solution x¯. The size of the neighbourhood turns out to be strongly dependent on the regularization parameter γ. Theorem 2 states the mononicity of the active sets under a strict complementarity condition. Theorem 2 (Monotonicity of the active set). Let us assume that the sequence {xk }k∈N generated by algorithm (OESOM) converges to the solution x¯. If strict complementarity holds at xk , i.e. {i : ∇i f (xk ) = β, xki = 0} = ∅, then, for k sufficiently large, Ak ⊆ Ak+1 .
(28)
Proof. Let us suppose that zik = 0 for some index i. In order to prove (28) we must verify two properties: i) xk+1 = 0 and ii) |∇i f (xk+1 )| ≤ β. Property i) i follows directly by taking into account that zik = 0 and the projection (7). Let us check property ii). By applying Taylor expansion of ∇f at xk+1 = xk + d˜k , for some θ ∈ (0, 1), we get (29) ∇f (xk+1 ) = ∇f (xk ) + ∇2 f (xk + (1 − θ)d˜k )> d˜k . By applying our strict complementarity hypothesis (29) we obtain i Xh ∇2 f (xk + (1 − θ)d˜k ) d˜kj |. (30) |∇i f (xk+1 )| < β + | ij
j
First, we analyze the last sum in (30).h In order to simply the i notation, we define the matrix H k with entries Hijk = ∇2 f (xk + (1 − θ)d˜k ) . By construction ij
d˜ki = 0 and we get the estimate (31)
|
X j6=i
Hij d˜kj | ≤ kHk2
X j6=i
X |d˜kj | ≤ kHk2 j6=i j:|xkj |≤1/γ
|d˜kj | +
X j6=i j:|xkj |>1/γ
|d˜kj | ,
where the first sum in the last inequality can be bounded by C/γ using (12), for the indexes j ∈ Hk . For those indexes j 6∈ Hk it holds that |d˜kj | = |xkj | ≤ 1/γ.
J.C. DE LOS REYES‡ , E. LOAYZA‡ AND P. MERINO‡
10
Therefore, there exists a constant C > 0 such that (32)
|
X j6=i
C Hij d˜kj | ≤ kHk2 + γ
X j6=i j:|xkj |>1/γ
|d˜kj | .
On the other hand, since xk → x¯ as k → ∞ we have that d˜k = xk+1 − xk → 0 as k → ∞. Consequently, for any ε > 0 there exists k0 such that for any k ≥ k0 the relations (12) ,(30) and (32) leads to the following bound |∇i f (xk+1 )| < β +
C + εkHk2 . γ
Since ε > 0 is arbitrary and taking γ large enough, condition ii) holds, which together with i) implies (28). Remark 2. If we assume that f is separable, its Hessian is a diagonal matrix, thus X Hijk dekj = 0, j6=i
which, together with the assumptions of Theorem 2, imply directly that |∇i f (xk+1 )| = |∇i f (xk )| ≤ β and therefore Ak ⊆ Ak+1 . 4. Interpretation as semismooth Newton method In a recent paper, Byrd et al. [2] introduced a general semismooth Newton framework for the solution of problem (P). By choosing the algorithm parameters differently, several well-known algorithms (FISTA, OWL, BAS) may be derived from this rather general setting. In this section we show that our algorithm can also be casted as a semismooth Newton algorithm by choosing appropriate parameter values. Moreover, inspired by this interpretation, an adaptive semismooth Newton strategy is built upon (OESOM) . Let us start by recalling the (SSN) framework proposed in [2]. A simple reformulation of (2) leads to the equivalent operator equation F (x) = 0, where (33)
Fi (x) = max (min (τ (∇i f (x) + β), xi ) , τ (∇i f (x) − β)) , i = 1, . . . , m.
Since the max and min functions are semismooth, a Newton type update can be obtained as (34)
G(xk )d = −F (xk ),
xk+1 = xk + d,
where G stands for the generalized Jacobian of F . By defining the following index sets: N k := i : xki ≤ τ ∇i f (xk ) − β , Ak := i : τ ∇i f (xk ) − β ≤ xki ≤ τ ∇i f (xk ) + β , P k := i : xki ≥ τ ∇i f (xk ) + β ,
SECOND-ORDER ENRICHED METHODS FOR SPARSE OPTIMIZATION
11
the Newton updates can also be written in the following form: eTi d = −xki ,
(35a) (35b) (35c) (35d) (35e)
i ∈ Ak \ N k ∪ P k
∇2i: f (xk )d = − ∇i f (xk ) + β , ∇2i: f (xk )d = − ∇i f (xk ) − β , δi ∇2i: f (xk ) + (1 − δi )eTi d = −τ ∇i f (xk ) − β , δi ∇2i: f (xk ) + (1 − δi )eTi d = −τ ∇i f (xk ) + β ,
i ∈ P k \ Ak , i ∈ N k \ Ak , i ∈ N k ∩ Ak , i ∈ P k ∩ Ak ,
xk+1 = xk + d,
(35f)
where ∇2i: f (x) stands for the i–th row of the Hessian and ei is the canonical vector of Rm . The orthant–wise method proposed by Byrd et al. [2] turns out to be equivalent to the semismooth Newton updates under the special choice of τ sufficiently small and δi = 0, for all i ∈ N k ∩ Ak ∪ P k ∩ Ak . In the case of our algorithm, we obtain for γ large enough and the choice of τ = 1/(γ+1) that sign xki − τ ∇i f (xk ) + sign(xki )β = sign(xki ) ∀i : xki 6= 0. Indeed, if sign(xki ) = −1, then xki − τ ∇i f (xk ) − β < 0, which implies that i ∈ N k \ Ak . In similar way, if xki > 0 then i ∈ P k \ Ak . This also implies (see [2]) that if xki ∈ Ak , then xki = 0. Concerning the updates: • If i ∈ Ak \ N k ∪ P k , then by (35a) dki = −xki = 0. On the other hand, in our method this case corresponds to zik = 0, which thanks to the projection formula implies that xk+1 = xki = 0. i • If sign(xki ) = −1, then i ∈ N k \ Ak and the (SSN) update corresponds to ∇2i: f (xk )dk = − ∇i f (xk ) + β , which is similar to the update (10) of the (OESOM) algorithm. • If sign(xki ) = 1 , then i ∈ P k \ Ak and ∇2i: f (xk )dk = − ∇i f (xk ) − β , which corresponds to the update (10) of our algorithm. • If i ∈ N k ∩ Ak , then our algorithm yields the update (36) ∇2i: f (xk ) + γeTi dk = − ∇i f (xk ) − β ,
J.C. DE LOS REYES‡ , E. LOAYZA‡ AND P. MERINO‡
12
which corresponds exactly to the (SSN) update (35d) with the choice δ = 1/(γ+1). Indeed, δi ∇2i: f (xk ) + (1 − δi )eTi d = τ ∇i f (xk ) − β γ T 1 1 2 k ∇i: f (x ) + ei d = − ∇i f (xk ) − β , ⇔ γ+1 γ+1 γ+1 which coincides with (36). The case i ∈ P k ∩ Ak follows in a similiar way. Consequently, the updates of (OESOM) and the (SSN) method coincide for 1 the special choice τ = δ = . γ+1 From the semismooth interpretation also an adaptive strategy for the regularization parameter can be divised. Indeed, we may choose τ such that sign xki − τ ∇i f (xk ) + sign(xki )β = sign(xki ), ∀i : xki 6= 0 is satisfied. This leads to the condition |xki | τ< , |∇i f (xk ) + sign(xki )β|
∀i : xki 6= 0.
Considering the choice of the (OESOM) method τ = δ = 1/(γ+1), an adaptive choice of γ is given by γ>
|∇i f (xk ) + sign(xki )β| − 1, |xki |
∀i : xki 6= 0,
or, more conservatively, (37)
γ = max
{i: xki 6=0}
|∇i f (xk ) + sign(xki )β| |xki |
The adaptive (OESOM) algorithm will then be given through the following steps. Algorithm 2 Adaptive Orthantwise Enriched Second Order method 1: 2: 3: 4: 5: 6: 7:
8: 9: 10:
Initialize x0 and B 0 . repeat Choose the regularization parameter γ using (37). Compute the matrix Γk using (9) . k e Compute the descent direction ∇ϕ(x ) using (6). k Compute d by solving the linear system (10). Compute xk+1 = P(xk + sk dk ), with the line–search step sk computed by (11). Update the second-order matrix. k ← k + 1. until stopping criteria is satisfied
SECOND-ORDER ENRICHED METHODS FOR SPARSE OPTIMIZATION
13
5. Numerical experiments We study of the numerical performance of (OESOM) compared to (OWL) [1] and (NW–CG) [3], for different `1 -penalized optimization problems. In particular, we draw our attention to the numerical solution of PDE-constrained optimization problems. The experiments below show that (OESOM) is competitive compared to other state-of-the-art algorithms, and in many cases is able to converge faster than the others. Furthermore, there are some special cases in which our algorithm succeeds in computing the approximate solution of the problem whereas the other algorithms are not able to. In the experiments carried out we consider the (OWL) algorithm without limited-memory updates, as originally designed, but with the full BFGS matrix for the Hessian approximation. We also consider the algorithm (NW–CG) using the BFGS approximation and not the exact Hessian information. We take into account the number of iterations as a measure of efficiency to avoid coding-optimization issues when the execution time is used to compare their performance. This is particularly relevant in PDE-constrained optimization problems due to the high computing cost per iteration; since at every step, two partial differential equations, state and adjoint equations must be solved to compute descent directions and perform objective function evaluations. In each experiment, we specify the parameters and functions values. Finally, we mention that for the numerical approximation of the partial differential equations involved we used a finite difference discretization scheme. 5.1. Linear-Quadratic optimization problems randomly generated. Similarly to [2] we consider the following experiment: we randomly generate 1000 linear-quadratic optimization problems of the form 1 T (P) minm u Qu + q T u + β k u k1 u∈R 2 where Q is generated by the MATLAB function sprandsym with type 1, whose outcome is a symmetric possitive definite matrix with a sparsity rate d. That is, the amount of nonzero entries of the matrix is approximately d · m2 , where m is the size of the matrix. In addition, the condition number is equal to rc−1 . In our experiments, we selected d = 25% and on half of the problems we consider a moderate condition number 104 ; while for the remaining half we take a higher value for the condition number equal to 107 . The vector q has randomly generated entries whose magnitudes are around of the eigenvalues of Q. Furthermore, approximately half of the entries of q are negative while the others are positive. The parameter β was also randomly generated in the interval [1.8, n/3]. The condition kqk∞ < β is satisfied, therefore the exact solution in all cases is zero. The goal of this experiment is to verify the amount of times each algorithm fails to achieve the exact solution: when the algorithm is not able to compute the exact solution before 6000 iterations, it is considered as a failure. We run this experiment for a fixed value of the regularization parameter γ = 104 used by the (OESOM) algorithm. The results obtained for this experiment are summarized on Table 1, where it can be verified the (OESOM) algorithm successfully computes the exact solution in all cases.
14
J.C. DE LOS REYES‡ , E. LOAYZA‡ AND P. MERINO‡
Figure 1. Number of iterations: OESOM (left), OWL (center), NW–CG (right).
Algorithms Condition number of Q OESOM NW–CG OWL Number of failures Moderate 0 0 0 High 0 260 2 Total 0 260 2 Table 1. Failures out of a set of 1000 random generated problems.
Figure 1 show the total number of iterations for those problems for which the three algorithms successfully compute the null solutions. From these figures we can observe that both algorithms (OESOM) and (OWL) exhibit a similar performance; whereas the (NW–CG) algorithm is less competitive for this set of problems. Table 2 statistically summarizes the obtained results. The global performance of the algorithms is shown in terms of the mean and the variance of the number of iterations for those problems where the three algorithms succeed in computing the solution. We notice that the mean of (OWL) is slightly inferior than the other ones in this experiment. However, the variance is larger compared to (OESOM) . In particular, the results of (NW–CG) algorithm are affected by those problems in which the algorithm has difficulties to converge. Algorithm Mean Variance OESOM 4.2970 0.4252 NW–CG 69.3149 9.6458e+04 OWL 3.7154 0.7394 Table 2. Global performance of the algorithms for Experiment 1
5.2. PDE-constrained optimization problems. The following example is taken from [20, Ex. 3] and consists in the solution of the elliptic optimal control
SECOND-ORDER ENRICHED METHODS FOR SPARSE OPTIMIZATION
15
Figure 2. Optimal control u problem: α 1 min ky − yd k2L2 (D) + kuk2L2 (D) + βkukL1 (D) 2 (y,u) 2 subject to (OCP ) −ν∆y = u + f in D := (0, 1) × (0, 1), y=0 on ∂D, where the following parameter values are chosen: α = 0.0002, β = 0.0094 and the functions yd := sin(4πx) cos(8πy) exp(2πx) and f := 0. By using a first discretize-then-optimize approach, we transform this problem into a finitedimensional optimization problem in the form of (P). The computed optimal solution is shown in Figure 2. As stopping criteria we use the following condition k xk+1 − xk k +
|ϕ(xk ) − ϕ(xk−1 )| < η, |ϕ(xk )|
with tolerance η = 10−5 . Moreover, we select γ = 104 as the Huber regularization parameter. The graphics below illustrate the performance of the algorithms for ν = 0.1. The size of the active sets at each iteration are presented in Figures 3a and 3b, and the decay of the objective function is shown in 3c. We notice that the (OESOM) algorithm appears to be faster at identifying the active sets. This feature of the algorithm plays a significant role in the optimization process due to the sparse structure of the control shown in Figure 3c. The fast identification of the active sets is reflected in the decay of the value of the objective function depicted in Figure 3c for the first 50 iterations, where the (OESOM) algorithm attains a smaller value of the objective function in less iterations than its competitors. These results are summarized in Table 3 Numerical results for this problem are strongly affected by the selection of the diffusion parameter ν and the `1 -weighting parameter β. For values of ν close to one the three algorithms show a similar performance, whereas for smaller values
16
J.C. DE LOS REYES‡ , E. LOAYZA‡ AND P. MERINO‡
(a) Null entries of z k
(b) Null entries of xk
(c) Decrease of the objective function
Algorithm Number of Objective func- Size of the active Iterations tion set |Ak | OESOM 22 1.236488867697212 582 NW–CG 48 1.236484675634829 591 OWL 275 1.236492847100781 588 Table 3. Experiment 2: performance of the algorithms
of ν they differ as we can see in the graphics below. A similar situation occurs when β is changed up to its critical value β0 , where the solution vanishes. In order to have a big picture of the performance of the three different algorithms, we test 90000 runs for the three algorithms in a grid of different values of ν and β. In Figure 3 we present the number of iterations for each algorithm for different values of β ∈ [0, 0.1] and a fixed value of ν = 0.1.
SECOND-ORDER ENRICHED METHODS FOR SPARSE OPTIMIZATION
17
Figure 3. Number of iterations per experiment For each combination of (ν, β) ∈ [0.0186, 0.6065] × [0.0024, 0.6], in a grid of 62500 points, Figure 4 shows which of the three algorithms (OESOM (red), NW–CG (purple) and OWL (yellow)) need less iterations for computing the solution. The white regions represent those combinations of parameters (ν, β) for which our algorithm shares the fastest convergence with one of the two other algorithms. The black curve represent the critical value β0 = β0 (ν) = 1/ν 2 k A−T (A−1 f − νz) k∞ , where A stands for the discrete laplacian. Here, f and z are the linear interpolation of the functions described in the first part of the experiment.
Figure 4. Regions of performance of the algorithms for the combination of parameters (β, ν) Unlike the other methods, ours depends on the Huber regularization parameter γ. As expected, the selection of this parameter is crucial to its performance. Although the theory establishes convergence of solutions when γ → ∞, in practice this parameter must be kept bounded, otherwise we face numerical
18
J.C. DE LOS REYES‡ , E. LOAYZA‡ AND P. MERINO‡
ill-conditioning. The numerical evidence suggests that the selection of the regularization parameter γ can be done in an adaptive fashion for every step of the algorithm as suggested in (37).
(a) zero entries of z k
(b) zero entries of xk
(c) Decrease of the objective function
(d) ||dk ||
Figure 5. Behaviour of (OESOM) for different choices of the regularization parameter. Figures 5a and 5b show the active set of z k and uk , respectively, and Figures 5c and 5d the decay of the objective function and the norm of the directions for different values of γ. As we mentioned above we also include the adaptive regularization scheme, in order to compare the performance of the (OESOM) algorithm. As we can see using the adaptive regularization, we can guarantee that the performance of the method is quite similar to the best performance with the value of γ = 104 . However, the problem is better conditioned since the values of the parameter are lower than 104 in most of the cases. 6. Conclusions By using the distributional information of the `1 -norm we are able to obtain extra second-order information of the objective function, which is incorporated in the computation of the descent direction. This information turns out to be important for a faster identification of the active-set .
SECOND-ORDER ENRICHED METHODS FOR SPARSE OPTIMIZATION
19
The proposed algorithm has been proved to be equivalent to a semismooth 1 Newton method with the choice τ = δ = γ+1 for the (SSN) defining constants. Since the (SSN) is a fast local method, such convergence properties are also inherited by our approach. Moreover, thanks to this interpretation, an adaptive update strategy for the regularization parameter has been proposed. Finally, the performance of the proposed algorithm results to be competitive with respect to other up-to-date algorithms. Specifically, we exhaustively compared the efficiency of (OESOM) with respect to (OWL) and (NW–CG) , for a large set of linear-quadratic randomly generated problems and elliptic PDEconstrained optimization problems. References [1] G. Andrew and J. Gao. Scalable training of l1-regularized log-linear models. In Proceedings of the Twenty Fourth Conference on Machine Learning (ICML), 2007. [2] R. Byrd, G. Chin, J. Nocedal, and F. Oztoprak. A family of second-order methods for convex l1-regularized optimization. Unpublished: Optimization Center: Northwestern University, Tech Report, 2012. [3] R. Byrd, G. Chin, J. Nocedal, and Y. Wu. Sample size selection in optimization methods for machine learning. Mathematical Programming, 134(1), 2011. [4] R Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal. On the use of stochastic hessian information in unconstrained optimization. SIAM J. Optim, 21(3):977–995, 2011. [5] Eduardo Casas, Christopher Ryll, and Fredi Tr¨oltzsch. Sparse optimal control of the schl¨ ogl and fitzhugh–nagumo systems. Computational Methods in Applied Mathematics, 13(4):415–442, 2013. [6] Michael Collins and Terry Koo. Discriminative reranking for natural language parsing. Computational Linguistics, 31(1):25–70, 2005. [7] John N Darroch and Douglas Ratcliff. Generalized iterative scaling for log-linear models. The annals of mathematical statistics, pages 1470–1480, 1972. [8] Gonzio J. Fountoulakis, K. A second-order method for strongly convex `1 -regularization problems. arXiv, (1306.5386v3), 2013. [9] Haoying Fu, Michael K. Ng, Mila Nikolova, and Jesse L. Barlow. Efficient minimization methods of mixed l2-l1 and l1-l1 norms for image restoration. SIAM J. Sci. Comput., 27(6):1881–1902, 2006. [10] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012. [11] Roland Herzog, Georg Stadler, and Gerd Wachsmuth. Directional sparsity in optimal control of partial differential equations. SIAM Journal on Control and Optimization, 50(2):943–963, 2012. [12] Su-In Lee, Honglak Lee, Pieter Abbeel, and Andrew Y Ng. Efficient l˜ 1 regularized logistic regression. In Proceedings of the National Conference on Artificial Intelligence, volume 21, page 401. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006. [13] Robert Malouf. A comparison of algorithms for maximum entropy parameter estimation. In proceedings of the 6th conference on Natural language learning-Volume 20, pages 1–7. Association for Computational Linguistics, 2002. [14] Nicolai Meinshausen and Bin Yu. Lasso-type recovery of sparse representations for highdimensional data. The Annals of Statistics, pages 246–270, 2009. [15] Thomas P Minka. A comparison of numerical optimizers for logistic regression. Unpublished draft, 2003. [16] J. Nocedal and S. J. Wright. Numerical Optimization. Springer-Verlag, New York, 1999. [17] Mark Schmidt, Glenn Fung, and Rmer Rosales. Fast optimization methods for l1 regularization: A comparative study and two new approaches. In Machine Learning: ECML 2007, pages 286–297. Springer, 2007.
J.C. DE LOS REYES‡ , E. LOAYZA‡ AND P. MERINO‡
20
[18] Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 134–141. Association for Computational Linguistics, 2003. [19] Suvrit Sra, Sebastian Nowozin, and Stephen J Wright. Optimization for machine learning. Mit Press, 2012. [20] Georg Stadler. Elliptic optimal control problems with L1 -control cost and applications for the placement of control devices. Comput. Optim. Appl., 44(2):159–181, 2009. [21] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996. [22] Fredi Tr¨ oltzsch. Optimal control of partial differential equations, volume 112 of Graduate Studies in Mathematics. American Mathematical Society, Providence, RI, 2010. Theory, methods and applications, Translated from the 2005 German original by J¨ urgen Sprekels. [23] James Hardy Wilkinson, James Hardy Wilkinson, and James Hardy Wilkinson. The algebraic eigenvalue problem, volume 87. Clarendon Press Oxford, 1965. [24] G. Yuan, K. Chang, C. Hsieh, and C. Lin. A comparison of optimization methods and software for large-scale l1-regularized classification. Journal of Machine Learning Research, (11):3183–3234, 2010. ‡
´ n Matema ´ tica (MODEMAT), EPN Quito, Ladro ´ n de Centro de Modelizacio Guevara E11-253, Quito, Ecuador