Learning with Optimal Interpolation Norms: Properties and Algorithms

4 downloads 54 Views 615KB Size Report
Mar 30, 2016 - OC] 30 Mar 2016. Learning with Optimal Interpolation Norms: Properties and. Algorithms. Patrick L. Combettes. Sorbonne Universités – UPMC ...
Learning with Optimal Interpolation Norms: Properties and Algorithms

arXiv:1603.09273v1 [math.OC] 30 Mar 2016

Patrick L. Combettes Sorbonne Universit´ es – UPMC Univ. Paris 06 UMR 7598, Laboratoire Jacques-Louis Lions, F-75005 Paris, France [email protected]

Andrew M. McDonald University College London, Department of Computer Science London WC1E 6BT, UK [email protected]

Charles A. Micchelli State University of New York, The University at Albany Department of Mathematics and Statistics, Albany, NY 12222, USA charles [email protected]

Massimiliano Pontil Istituto Italiano di Tecnologia, 16163 Genoa, Italy [email protected] and University College London, Department of Computer Science London WC1E 6BT, UK

Abstract We analyze a class of norms defined via an optimal interpolation problem involving the composition of norms and a linear operator. This framework encompasses a number of norms which have been used as regularizers in machine learning and statistics. In particular, these include the latent group lasso, the overlapping group lasso, and certain norms used for learning tensors. We establish basic properties of this class of regularizers and we provide the dual norm. We present a novel stochastic block-coordinate version of the Douglas-Rachford algorithm to solve regularization problems with these regularizers. A unique feature of the algorithm is that it yields iterates that converge to a solution in the case of non smooth losses and random block updates. Finally, we present numerical experiments on the latent group lasso.

1

1 Introduction In machine learning and statistics, regularization is a standard tool used to promote structure in the solutions of an optimization problem in order to improve learning. Of particular interest has been structured sparsity, where regularizers provide a means to promote specific sparsity patterns of regression vectors or spectra of matrices. Important examples include the group lasso [37] and related norms [12, 23, 38], spectral regularizers for low rank matrix learning and multitask learning [1, 13, 31], multiple kernel learning [5, 25], and regularizers for learning tensors [29, 34], among others. We introduce a general formulation which captures many of these norms and allows us to construct more via an optimal interpolation problem involving simpler norms and a linear operator. Specifically, the norm of w ∈ Rd is defined as kwk = inf |||F (v)|||,

(1.1)

Bv=w

where ||| · ||| is a monotone norm (e.g., an ℓp norm), F a vector-valued mapping the components of which are also norms, and B a d × N real matrix. As we shall see, this concise formulation encompasses many classes of regularizers, either via the norm (1.1) or the corresponding dual norm, which we investigate in the paper. An important example is provided by the latent group lasso [12] in which ||| · ||| is the ℓ1 norm and the components of F are ℓ2 norms. Furthermore, while in this paper we focus on norms, the construction (1.1) provides a scheme for a broader range of regularizers of the form  ϕ(w) = inf h F (v) , (1.2) Bv=w

where the components of F and h are convex functions that satisfy certain properties. Such functions generalize the notion of infimal convolution (see [6, Proposition 12.35] and are found in signal recovery formulations, e.g., [7]. Optimal interpolation and, in particular, the problem of finding a minimal norm interpolant to a finite set of points has a long history in approximation theory; see, e.g., [8] and the references. In machine learning, this problem has been investigated in [3] and [24]. Special cases of our setting based on (1.1) appear in [20, 34], where |||·||| is an ℓp norm and the components of F are ℓ2 norms. On the other hand, the atomic norms considered in [9] correspond, in the case of a finite number of atoms, to setting F = Id and ||| · ||| = k · k1 in (1.1). The general fact that the construction (1.1) preserves convexity is known. However, to the best of our knowledge, the observation that optimal interpolation can be used as a framework to generate norms is novel. On the numerical front, we shall take advantage of the fact that the mapping F and the operator B can be composed of a large number of “simple” components to create complex structures, leading to efficient algorithms to solve problems involving regularizers of the form (1.1). In summary, the two main contributions of this paper are the following: • We introduce a class of convex regularizers which encompasses many norms used in machine learning and statistics in order to enforce properties such as structured sparsity. We provide 2

basic properties (cf. Proposition 3.1 and Theorem 3.5) and examples of this construction. In particular, these include the overlapping group lasso, the latent group lasso, and various norms used in tensor learning problems. • We present a general method to solve optimization problems involving regularizers of the form (1.1) which is based on the recent stochastic block-coordinate Douglas-Rachford iterative framework of [10]. Unlike existing methods, this approach guarantees convergence of the iterates to a solution even when none of the functions present in the model need to be differentiable and when random block updates are performed. The paper is organized as follows. In Section 2 we review our notation. In Section 3 we introduce the general class of regularizers and establish some of their basic properties. In Section 4 we give a number of examples of norms in this framework. In Section 5 we provide a random block-coordinate algorithm to solve learning problems involving these regularizers. In Section 6 we report on numerical experiments with this algorithm using the latent group lasso.

2 Notation and Background We introduce our notation and recall basic concepts from convex analysis used throughout the paper; for details see [6] and [28]. Let Rd+ and Rd++ be the positive and strictly positive d-dimensional orthant, respectively, and set Rd− = −Rd+ . Given w ∈ Rd and J ⊂ {1, . . . , d}, w|J ∈ Rd is the vector that has the same components as those of w on the support set defined by J, where the support of w is supp(w) =  i 1 6 i 6 d , wi 6= 0 . The dual norm of a norm k · k : Rd → R+ evaluated at u ∈ Rd is  kuk∗ = max hw | ui kwk 6 1 , (2.1)

where h· | ·i is the canonical scalar product of Rd . For every p ∈ [1, +∞], the ℓp norm on Rd is denoted by k · kp . The corresponding dual norm is k · kp,∗ = k · kq , where 1/p + 1/q = 1.  Let ϕ : Rd → ]−∞, +∞]. The domain of ϕ is dom ϕ = w ∈ Rd ϕ(w) < +∞ and the conjugate of ϕ is ϕ∗ : Rd → [−∞, +∞] : supw∈Rd (hw | ui − ϕ(w)). Γ0 (Rd ) denotes the set of lower semicontinuous convex functions from Rd to ]−∞, +∞] with nonempty domain. Finally, the proximity operator of a function ϕ ∈ Γ0 (Rd ) at w ∈ Rd is the unique solution, denoted by proxϕ w, to the optimization problem 1 minimize ϕ(u) + kw − uk22 . d 2 u∈R

(2.2)

3 A Class of Norms We establish the mathematical foundation of our framework. We begin with a scheme to construct convex functions on Rd . 3

d×N , and let Proposition 3.1 Let m, d, and N be strictly positive integers. Let K = Rm − , let B ∈ R N m N F : R → R be K-convex in the sense that, for every α ∈ ]0, 1[ and every (u, v) ∈ R × RN  F αu + (1 − α)v − αF (u) − (1 − α)F (v) ∈ K. (3.1)

Let h : Rm → ]−∞, +∞] be a proper convex function such that, for every (x, y) ∈ ran F × ran F , x−y ∈K



h(x) 6 h(y).

(3.2)

Define, for every w ∈ Rd ,  ϕ(w) = inf h F (v) .

(3.3)

v∈RN Bv=w

Then h ◦ F and ϕ are convex. Proof. It is enough to show that h ◦ F is convex, as this will imply that ϕ is likewise by [6, Proposition 12.34(ii)] or [28, Theorem 5.7]. Let α ∈ ]0, 1[ and let u and v be points in RN . Combining (3.1) and (3.2) yields  h F (αu + (1 − α)v) 6 h(αF (u) + (1 − α)F (v)). (3.4) Therefore, by convexity of h, we obtain  (h ◦ F ) αu + (1 − α)v 6 α(h ◦ F )(u) + (1 − α)(h ◦ F )(v),

(3.5)

which establishes the convexity of h ◦ F .

We are now ready to define the class of norms induced from optimal interpolation, which will be the main focus of the present paper. Pm Assumption 3.2 Let m, d, and (pj )16j6m be strictly positive integers, set N = j=1 pj , and let N p j v = (vj )16j6m denote a generic element in R where, for every j ∈ {1, . . . , m}, vj ∈ R . Furthermore: (i) F : v 7→ (fj (vj ))16j6m where, for every j ∈ {1, . . . , m}, fj is a norm on Rpj with dual norm fj,∗ . m (ii) ||| · ||| is a norm on Rm which is monotone in the sense that, for every (x, y) ∈ Rm + × R+ ,

x − y ∈ Rm −



|||x||| 6 |||y|||,

(3.6)

and ||| · |||∗ denotes its dual norm. (iii) For every j ∈ {1, . . . , m}, Bj ∈ Rd×pj , and B = [B1 · · · Bm ] has full rank. Let, for every w ∈ Rd ,  kwk = min |||F (v)||| = min f1 (v1 ), . . . , fm (vm ) . v∈RN Bv=w

v∈RN Bv=w

4

(3.7)

Proposition 3.3 Consider the setting of Assumption 3.2. Then ||| · ||| ◦ F is a norm on RN and its dual norm at t ∈ RN is  (||| · ||| ◦ F )∗ (t) = ||| f1,∗ (t1 ), . . . , fm,∗ (tm ) |||. (3.8) Proof. Let v and t be in RN . We first deduce from Assumption 3.2(i) that (∀α ∈ R) |||F (αv)||| = |||(f1 (αv1 ), . . . , fm (αvm ))|||

= |||(|α|f1 (v1 ), . . . , |α|fm (vm ))|||

= |α| |||(f1 (v1 ), . . . , fm (vm ))|||

= |α| |||F (v)|||

(3.9)

and that |||F (v)||| = 0 ⇔ F (v) = 0

⇔ (∀j ∈ {1, . . . , m}) fj (vj ) = 0

⇔ (∀j ∈ {1, . . . , m}) vj = 0

⇔ v = 0.

(3.10)

Let us now check the triangle inequality. By Assumption 3.2(i), F (v + t) − F (v) − F (t) ∈ Rm −. Hence, we derive from (3.6) that |||F (v + t)||| 6 |||F (v) + F (t)||| 6 |||F (v)||| + |||F (t)|||.

(3.11)

Finally, to establish (3.8), observe that hv | ti =

m X j=1

hvj | tj i 6

m X j=1

fj (vj )fj,∗ (tj ) 6 |||F (v)||| |||F∗ (t)|||∗ .

(3.12)

Since for every t we can choose v so that both equalities are tight, the result follows. Remark 3.4 The optimization problem underlying (3.7) is simply that of minimizing the norm N ||| · ||| ◦ F over the finite-dimensional affine subspace C = v ∈ R Bv = w . It does therefore possess a solution, namely the minimum norm element in C. Thus, the function k · k in (3.7) is defined via a minimal norm interpolation process. In the next result we establish that the construction described in Assumption 3.2 does provide a norm, and we compute its dual norm. Theorem 3.5 Consider the setting of Assumption 3.2. Then the following hold: (i) k · k is a norm. 5

(ii) The dual norm of k · k at u ∈ Rd is  ⊤ kuk∗ = f1,∗ (B1⊤ u), . . . , fm,∗ (Bm u) ∗ .

(3.13)

Proof. (i): We first note that, since ran B = Rd , dom k · k = Rd . Next, we derive from (3.7) that for every w ∈ Rd and every α ∈ R kαwk = |α| kwk.

(3.14)

On the other hand, it is clear that F satisfies (3.1), that ||| · ||| satisfies (3.2), and that (3.7) assumes the same form as (3.3). Hence, by Proposition 3.1 the function k · k is convex. In  view of (3.14), we therefore have for every (w, t) ∈ Rd × Rd kw + tk 6 kwk + ktk. Now set C = v ∈ RN Bv = w . Then C is a closed affine subspace, and inf v∈C |||F (v − 0)||| = kwk = 0. Thus 0 is at zero distance from C and therefore 0 ∈ C, that is, there exists v ∈ C such that |||F (v)||| = 0, from which we deduce that w = Bv = B0 = 0. (ii): Let u ∈ Rd . Then  kuk∗ = max hw | ui w ∈ Rd , kwk 6 1 o n = max hw | ui w ∈ Rd , min |||F (v)||| 6 1 Bv=w  N = max hBv | ui v ∈ R , |||F (v)||| 6 1  = max hv | B ⊤ ui v ∈ RN , |||F (v)||| 6 1  = ||| · ||| ◦ F ∗ (B ⊤ u),

(3.15)

where last step follows from Proposition 3.3. We conclude by applying (3.8). A special case of Theorem 3.5 is the following corollary, which appears in [20, Theorem 7]. Corollary 3.6 Let p, q, r, and s be numbers in [1, +∞] such that 1/p + 1/q = 1 and 1/r + 1/s = 1. Then the function k · k defined at w ∈ Rd by kwk = inf k(kv1 kp , . . . , kvm kp )kr

(3.16)

Bv=w

is a norm and the dual norm at u ∈ Rd is given by ⊤ kuk∗ = k(kB1⊤ ukq , . . . , kBm ukq )ks .

(3.17)

Proof. This construction is a special case of Assumption 3.2 with f1 = · · · = fm = k · kp and ||| · ||| = k · kr . The function ||| · ||| is an absolute norm, that is, ||| · ||| = ||| · ||| ◦ Λ, for every Λ = diag(λ1 , . . . , λm ) such that (∀j ∈ {1, . . . , m}) λj ∈ {−1, 1}. By [17, Theorem 5.4.19] a norm is monotone if and only if it is absolute. The result follows by applying Theorem 3.5. Remark 3.7 Suppose that, in Proposition 3.1, F is the identity mapping and h an arbitrary norm. Then, arguing as in the proof of Theorem 3.5, we obtain that the function ϕ is a norm and that the dual norm at u ∈ Rd is ϕ∗ (u) = |||B ⊤ u|||. 6

Remark 3.8 Any norm k·k can trivially be written in the form of (3.3), by simply letting |||·||| = k·k, F be the identity operator, and B be the identity matrix. However we are interested in exploiting the structure of the construction (3.3) in cases in which the norms ||| · ||| and (fj )16j6m are chosen from a “simple” class and give rise, via the optimal interpolation problem (3.3), to a “complex” norm k · k which we shall use as a regularizer in a statistical learning setting. In particular, when using proximal splitting methods, the computation of proxk·k will typically not be easy whereas that of prox|||·||| and proxF will be. This will be exploited in Section 5 to devise efficient splitting algorithms to solve problems involving k · k which use the functions (fj )16j6m and the operators (Bj )16j6m separately.

4 Examples In this section, we observe that the construct presented in Section 3 encompasses in a unified framework a number of existing regularizers. For simplicity, we focus on the norms captured by Corollary 3.6. Our main aim here is not to derive new regularizers but, rather, to show that our analysis captures existing ones and to derive their dual norm.

4.1 Latent Group Lasso The first example we consider is known as the latent group lasso (LGL), or group lasso with overlap [12]. Let G = {G1 , . . . , Gm } where the sets Gj ⊂ {1, . . . , d} and ∪m j=1 Gj = {1, . . . , d}. Define the  affine set ZG = (zj )16j6m zj ∈ Rd , supp(zj ) ⊂ Gj . The latent group lasso penalty is defined, for w ∈ Rd , as ) (m m X X zj = w . (4.1) kzj kp z ∈ ZG , kwkLGL = min j=1

j=1

The optimal interpolation problem (4.1) seeks a decomposition of vector w in terms of vectors z1 , . . . , zm whose support sets are restricted to the corresponding group of variables in Gj . If the groups overlap then the decomposition is not necessarily unique, and the variational formulation involves those zj such that the aggregate ℓp norm is minimal. On the other hand, if the group are disjoint, that is {G1 , . . . , Gm } forms a partition of {1, . . . , d}, the latent group lasso coincides with the “standard” group lasso [37], which is defined, for w ∈ Rd , as kwkGL =

m X j=1

kw|Gj kp .

(4.2)

Indeed in this case there is a unique feasible point for the optimization problem (4.1) which is given by zj = w|Gj . Proposition 4.1 For every j ∈ {1, . . . , m}, set pj = |Gj |. The latent group lasso penalty (4.1) is a norm of the form (3.7) with fj = k · kp , ||| · ||| = k · k1 , and d × pj matrices Bj = [ei | i ∈ Gj ], where 7

ei is the i-th standard unit vector in Rd . Furthermore, the dual norm of w ∈ Rd is given by kwkLGL,∗ = max kw|Gj kq .

(4.3)

16j6m

Proof. The expression for the primal norm follows immediately by the change of variable zj = Bj vj , which also eliminates the support constraint. The dual norm follows by (3.13) noting that ||| · |||∗ is the max norm, fj,∗ = k · k2 for all j and Bj⊤ w = Bj w = w|Gj . The case p = +∞ is especially interesting and it has been studied in [19]. We note that although in the general case the norm cannot be computed in closed form, in special cases which exhibit additional structure, it can be computed by an algorithm. An important example is provided by the (k, p)-support norm [22], in which the groups consists of all subsets of {1, . . . , d} of cardinality no greater than k, for some k ∈ {1, . . . , d}. The case p = 2 has been studied in [2] and [21].

4.2 Overlapping Group Lasso An alternative generalization of the group lasso is the overlapping group lasso [15, 38], which we write as k · kOGL . It has the same expression of the group lasso in (4.2), except that we drop the restriction that the set of groups {G1 , . . . , Gm } form a partition of the index set {1, . . . , d}. That is, we require only that the union of sets Gj be equal to {1, . . . , d}. Our next result establishes that the overlapping group lasso function is captured by a dual norm (3.13). Proposition 4.2 Let pj = |Gj |, j ∈ {1, . . . , m}. The overlapping group lasso penalty (4.2) is a norm of the form (3.13) for fj,∗ = k · kp , ||| · |||∗ = k · k1 , and the d × pj matrices Bj = [ei | i ∈ Gj ]. Furthermore the dual norm is given, for w ∈ Rd , by ) ( m X B j vj = w . (4.4) kwkOGL,∗ = inf max kvj kq 16j6m

j=1

Proof. The first assertion follows from equation (3.13) by noting that kBj⊤ ukp = ku|Gj kp . The second assertion follows from the definition of the primal norm in (3.7), using the fact that ||| · ||| is the ℓ∞ -norm and F is the ℓq -norm. The case p = +∞ corresponds to the iCAP penalty in [38]. We may also consider other choices for the matrices Bj . For example, an appropriate choice gives various total variation penalties [26]. A further extension is obtained by choosing m = 1 and B1 to be the incidence matrix of a graph, a setting which is relevant in particular in online learning over graphs [16]. In particular, for p = 1, this corresponds to the fused lasso penalty [33].

8

4.3 Θ-Norms In [21], the authors consider a family of norms which is parameterized by a convex set; see also [4, 23]. Specifically, let Θ ⊂ Rd++ be nonempty, convex, and bounded. Then the expressions p kwkΘ = inf hw | diag(θ)−1 wi (4.5) θ∈Θ

and

kukΘ,∗ = sup θ∈Θ

p

hu | diag(θ)ui

(4.6)

define norms, which we refer to as the primal and dual norms, respectively. Norms which are included in this family include the ℓp -norms, and the k-support norm discussed above. We now consider the case when Θ is a polyhedron. ¯ = conv{θ (1) , . . . , θ (m) }, where Proposition 4.3 Let Θ be a nonempty subset of Rd++ such that Θ (j) d (∀j ∈ {1, . . . , m}) θ ∈ R+ . Then the norms defined in (4.5) and (4.6) can √ be written in the form (3.7) and (3.13) respectively, with ||| · ||| = k · k1 , f = k · k2 , and Bj = diag( θ (j) ). √ Proof. For every j ∈ {1, . . . , m}, set Bj = diag( θ (j) ). Letting f∗ be the ℓ2 -norm on Rd , and ||| · |||∗ be the ℓ∞ -norm on Rm , the dual norm defined by (3.13) is given by kuk∗ = max kBj uk2 16j6m q

= max u | diag(θ (j ))u 16j6m p = max hu | diag(θ)ui ¯ θ∈Θ p = sup hu | diag(θ)ui,

(4.7) (4.8)

θ∈Θ

where equality (4.7) uses the fact that a linear function on a compact convex set attains its maximum at an extreme point of the set [28, Corollary 32.3.2]. It follows that the dual norms coincide, and consequently the same holds for the primal norms.

4.4 Polyhedral Norms Polyhedral norms [11] are characterized by having as their unit ball a polyhedron, that is an intersection of closed halfspaces. Specifically, let m be an even positive integer and let E = {b1 , . . . , bm } ⊂ Rd be such that span(E) = Rd and −bi ∈ E for every j ∈ {1, . . . , m}. Let C = conv(E) and let k · kPH be the polyhedral norm with unit ball C. We have the following characterization of a polyhedral norm [18, Section 1.34], ) (m m X X vj bj = w . (4.9) |vj | v ∈ Rm , kwkPH = min j=1

j=1

Using this expression we have the following result, which follows directly by Remark 3.7. 9

Proposition 4.4 The polyhedral norm on Rd defined by (4.9) can be written in the form of (3.7) for p1 = · · · = pm = 1, Bj = [bj ], ||| · ||| = k · k1 and F the identity mapping.  Recall that the polar of a set C is the set C ⊙ = u ∈ Rd supw∈C hu | wi 6 1 . The dual norm ⊖ ⊖ k · kPH,∗ is also polyhedral, and its unit ball is the convex hull of E ⊖ = {b⊖ 1 , . . . , br }, where E is the polar cone of E and r is an even positive integer. Indeed, the dual norm is given by  kukPH,∗ = max hu | wi w ∈ C = max hu | bj i. (4.10) 16j6m

It follows that kwkPH can also be written as a maximum of scalar over the set of extreme  products points of the unit ball of the dual norm, that is, kwkPH = max hw | bi b ∈ E ⊖ , where E ⊖ is the set of extreme points of the dual unit ball C ⊙ . Geometrically, the unit balls are related by polarity.

4.5 Tensor Norms Recently a number of regularizers have been proposed to learn low rank tensors. We discuss two of them, which are based on the nuclear norm of the matricizations of a tensor. To describe this setting, we introduce some notions from multilinear algebra. Let W ∈ Rd1 ×···×dm be an m-way tensor, that is W = (Wj1 ,...,jm | 1 6 j1 6 d1 , . . . , 1 6 jm 6 dm ). A mode-j fiber is a vector composed of elements of W obtainedQby fixing all indices but those corresponding to the j-th mode. For every j ∈ {1, . . . , m}, let qj = k6=j dk . The mode-j matricization Mj (W ) of a tensor W , also denoted by W (j) , is the dj ×qj matrix obtained by arranging the mode-j fibers of W such that each of them forms a column of Mj (W ). Note that Mj : Rd1 ×···×dm → Rdj ×qj is a linear operator. By way of example, a 3-mode tensor W ∈ R3×4×2 admits the matricizations: W (1) ∈ R3×8 , W (2) ∈ R4×6 , and W (3) ∈ R2×12 . Note also that the adjoint of Mj , namely the operator Mj⊤ : Rdj ×qj → Rd1 ×···×dm is the reverse matricization along mode j. Recall that the nuclear norm (or trace norm) of a matrix, k · knuc , is the sum of its singular values. Its dual norm k · ksp is given by the largest singular value. The overlapped nuclear norm [29, 35] is defined as the sum of the nuclear norms of the mode-j matricizations, namely kW kover =

m X j=1

kMj (W )knuc .

(4.11)

The next result, which is easily proved, captures the overlapped nuclear norm in our framework. Proposition 4.5 The overlapped nuclear norm (4.11) can be written in the form (3.13) for ||| · |||∗ = k · k1 , fj,∗ = k · knuc and Bj⊤ = Mj . Furthermore, the dual norm of a matrix U is kU kover,∗ = inf

(

max kVj ksp

16j6m

) m X Mj⊤ Vj = U . j=1

10

(4.12)

Let α1 , . . . , αm be strictly positive reals. The scaled latent nuclear norm is defined by variational problem ) (m m X X 1 ⊤ Mj Vj = W . (4.13) kVj knuc kW klat = inf αj j=1

j=1

The case αj = 1 is presented in [34] and the case αj =

p

dj in [36].

Proposition 4.6 The latent nuclear norm (4.13) can be written in the form (3.7) for ||| · ||| = k · k1 , fj = k · knuc /αj and Bj = Mj⊤ . Furthermore, the dual norm of a matrix U is kU klat,∗ = max αj kMj (U )ksp .

(4.14)

16j6m

Proof. Immediate, upon identification of the ||| · |||, F , and the matrices Bj , and noting that fj,∗ = αj k · ksp .

5 Random Block-Coordinate Algorithm The purpose of this section is to address some of the numerical aspects associated with the class of norms introduced in Assumption 3.2. Since such norms are nonsmooth convex functions, they could in principle be handled via their proximity operators in the context of proximal splitting algorithms [6]. However, the proximity operator of the composite norm k · k in (3.7) is usually too complicated for this direct approach to be viable. We circumvent this problem by formulating the problem in such a way that it involves only the proximity operators of the functions (fj )16j6m , which will typically be available in closed form. The main features and novelty of the algorithmic approach we propose are the following: • It can handle general nonsmooth formulations: the functions present in the model need not be differentiable. • It adapts the recent approach proposed in [10] to devise a block-coordinate algorithm which allows us to select arbitrarily the blocks functions (fj )16j6m to be activated over the course of the iterations. This makes the method amenable to the processing of very large data sets in a flexible manner by adapting the computational load of each iteration to the available computing resources. • The computations are broken down to the evaluation of simple proximity operators of the functions fj and of those appearing in the loss function, while the linear operators are applied separately. • It guarantees the convergence of the iterates to a solution of the minimization problem under consideration; convergence of the function values is also guaranteed.

11

We are not aware of alternative frameworks which achieve simultaneously these properties. In particular, it is noteworthy that the convergence of the iterates is achieved although only a portion of the proximity operators are activated at each iteration. We consider a supervised learning problem in which a vector w ∈ Rd is to be inferred from n input-output data points (ai , βi )16i6n , where ai ∈ Rd and βi ∈ R. A common formulation for such problems is the regularized convex minimization problem minimize w∈Rd

n X i=1

ℓi (hw | ai i, βi ) + λkwk,

(5.1)

where ℓi : R2 → R is a loss function (for simplicity we take it to be real-valued but extended functions can also be considered) such that, for every β ∈ R, ℓi is convex, λ ∈ ]0, +∞[ is a regularization parameter, and k · k assumes the general form (3.7). For simplicity, we focus in this section and the subsequent experiments on the case when ||| · ||| = k · k1 . This choice covers many instances of practical interest in machine learning. We can rewrite (5.1) as minimize

w∈Rd p pm v1 ∈R Pm1 ,...,vm ∈R B v =w j j j=1

n X i=1

ℓi (hw | ai i, βi ) + λ

m X

fj (vj ).

We can therefore solve the optimization problem  X m m n X X fj (vj ), hBj vj | ai i, βi + λ ℓi minimize p p v1 ∈R

1 ,...,vm ∈R m

(5.3)

j=1

j=1

i=1

(5.2)

j=1

P and derive from one of its solutions (vj )16j6m a solution w = m j=1 Bj vj to (5.2). To make the structure of this problem more apparent, let us introduce the functions Φ : (v1 , . . . , vm ) 7→ λ

m X

fj (vj )

(5.4)

j=1

and Ψ : (η1 , . . . , ηn ) 7→

n X

ψi (ηi ),

(5.5)

i=1

where, for every i ∈ {1, . . . , n}, ψi : ηi 7→ ℓi (ηi , βi ).

(5.6)

Let us also designate by A ∈ Rn×d the matrix the rows of which are (ai )16i6n and define L = [L1 · · · Lm ] ∈ Rn×N ,

(5.7)

where, for every j ∈ {1, . . . , m}, Lj = ABj ∈ Rn×pj .

(5.8) 12

Then L = AB, where B = [B1 · · · Bm ], and we can rewrite problem (5.3) as minimize Φ(v) + Ψ(Lv).

(5.9)

v∈RN

Note that in this concise formulation, the functions Φ ∈ Γ0 (RN ) and Ψ ∈ Γ0 (Rn ) are nonsmooth. Now let us introduce the functions F : (v, s) 7→ Φ(v) + Ψ(s) and G = ιV , (5.10)  where V = gra L = (v, s) ∈ RN × Rn Lv = s is the graph of L, and ιV is the indicator function of V . Using the variable x = (v, s), we see that (5.9) reduces to the problem minimize F (x) + G(x)

(5.11)

x∈RN+n

involving the sum of two functions in Γ0 (RN +n ) and which can be solved with the Douglas-Rachford algorithm [6, Section 27.2]. Let x0 ∈ RN +n , y 0 ∈ RN +n , z 0 ∈ RN +n , let γ ∈ ]0, +∞[, and let (µk )k∈N be a sequence in ]0, 2[ such that inf k∈N µn > 0 and supk∈N µn < 2. The Douglas-Rachford algorithm for  k = 0, 1, . . .  xk+1 = proxγG y k   z k+1 = proxγF (2xk+1 − y k ) y k+1 = y k + µk (z k+1 − xk+1 ).

(5.12)

produces a sequence (xk )k∈N which converges to a solution to (5.11) [6, Corollary 27.4]. However, proxF : (v, s) 7→ (proxΦ v, proxΨ s),

(5.13)

and proxG : (x, y) 7→ (v, Lv),

where v = x − L⊤ (Id +LL⊤ )−1 (Lx − y),

(5.14)

is the projection operator onto V . Hence, upon setting R = L⊤ (Id +LL⊤ )−1 , we can rewrite (5.12) as for k = 0, 1, . . .   vk+1 = xk − R(Lxk − yk )   sk+1 = Lvk+1   zk+1 = proxγΦ (2vk+1 − xk )   tk+1 = prox (2sk+1 − yk ) γΨ   xk+1 = xk + µk (zk+1 − vk+1 ) yk+1 = yk + µk (tk+1 − sk+1 ),

(5.15)

where we have set xk = (vk , sk ) ∈ RN × Rn , y k = (xk , yk ) ∈ RN × Rn , and z k = (zk , tk ) ∈ RN × Rn . It follows from the above result that (vk )k∈N converges to a solution to (5.9). Let us now express (5.15) in terms of the original variables (vj )16j6m of (5.3). To this end set, for every j ∈ {1, . . . , m}, −1  m X ⊤ ⊤ ⊤ −1 ⊤ Lj Lj . (5.16) Rj = Lj (Id +LL ) = Lj Id + j=1

13

Moreover, let us denote by xj,k ∈ Rpj the j-th component of xk , by zj,k ∈ Rpj the j-th component of zk . Furthermore, we denote by ηi,k ∈ R the i-th component of yk and by σi,k ∈ R the i-th component of sk . Thus, (5.15) becomes for  k = 0, 1, . . .  for j = 1, . . . , m     v   j,k+1 = xj,k − Rj (Lj xj,k − yk )   z j,k+1 = proxγλfj (2vj,k+1 − xj,k )   xj,k+1P= xj,k + µk (zj,k+1 − vj,k+1 )    sk+1 = m j=1 Lj vj,k+1   for i = 1, . . . , n    τi,k+1 = proxγψi (2σi,k+1 − ηi,k ) ηi,k+1 = ηi,k + µk (τi,k+1 − σi,k+1 ).

(5.17)

In large-scale problems, a drawback of this approach is that m + n proximity operators must be evaluated at each iteration, which can lead to impractical implementations in terms of computations and/or memory requirements. The analysis of [10, Corollary 5.5] shows that the proximity operators in (5.17) can be sampled by sweeping through the indices in {1, . . . , m} and {1, . . . , n} randomly while preserving the convergence of the iterates. This results in partial updates of the variables which leads to significantly lighter iterations and remarkable flexibility in the implementation of the algorithm. Thus, a variable vj,k is updated at iteration k depending on whether a random activation variable εj,k takes on the value 1 or 0 (each component ηi,k of the vector yk is randomly updated according to the same strategy). The method resulting from this approach in presented in the next theorem. Theorem 5.1 Let D = {0, 1}m+n r {0}, let γ ∈ ]0, +∞[, let (µk )k∈N be a sequence in ]0, 2[ such that inf k∈N µn > 0 and supk∈N µn < 2, let (vj,0 )16j6m and (xj,0 )16j6m be in RN , let y0 = (ηi,0 )16i6n ∈ Rn , and let (εk )k∈N = (ε1,k . . . , εm+n,k )k∈N be identically distributed D-valued random variables such that, for every i ∈ {1, . . . , m + n}, Prob[εi,0 = 1] > 0. Iterate for k = 0, 1, . . .   for j = 1, . . . , m   $  vj,k+1 = vj,k + εj,k xj,k − Rj (Lj xj,k − yk ) − vj,k    xj,k+1 = xj,k + εj,k µk proxγλfj (2vj,k+1 − xj,k ) − vj,k+1   Pm  s  k+1 = j=1 Lj vj,k+1  for i = 1, . . . , n    ηi,k+1 = ηi,k + εm+i,k µk proxγψi (2σi,k+1 − ηi,k ) − σi,k+1 .

(5.18)

Suppose that the random vectors (εk )k∈N and (xk , yk )k∈N are independent. Pm Then, for every j ∈ {1, . . . , m}, (vj,k )k∈N converges almost surely to a vector vj and w = j=1 Lj vj is a solution to (5.1). Proof. It follows from [10, Corollary 5.5] that (v1,k , . . . , vm,k )k∈N converges almost surely to a solution to (5.3). In turn, w solves (5.1). 14

Remark 5.2 The matrices (Rj )16j6m of (5.16) are computed off-line only once and they intervene in algorithm (5.18) only via matrix-vector multiplications. Remark 5.3 It follows from the result of [10] that, under suitable qualification conditions, the conclusions of Theorem 5.1 remain true for a general choice of the functions fj ∈ Γ0 (Rpj ) and φi ∈ Γ0 (R), and also when B does not have full rank. This allows us to solve the more general versions of (5.2) in which the regularizer is not a norm but a function of the form (1.2).

6 Numerical Experiments In this section we present numerical experiments applying the random sweeping stochastic block algorithm outlined in Section 5 to sparse problems in which the regularization penalty is a norm fitting our interpolation framework, as described in Assumption 3.2. The goal of these experiments is to show concrete applications of the class of norms discussed in the paper and to illustrate the behavior of the proposed random block-iterative proximal splitting algorithm. Let us stress that these appear to be the first numerical experiments on this kind of iteration-convergent, blockcoordinate method for completely nonsmooth optimization problems. We focus on binary classification with the hinge loss and a latent group lasso penalty [12]. Each data matrix A ∈ Rn×d and weight vector w is generated with i.i.d. Gaussian entries, each row ai of A is normalized to have unit ℓ2 norm, and w is also normalized in the same manner after applying a sparsity constraint. The n observations are generated as βi = sign(hai | wi) and a random subset have their sign reversed, where the noise determines the size of the subset. We apply the latent group lasso chain penalty, whereby groups define contiguous sequences with a specified overlap. The matrix A is chosen to be of size 100 × 1000, the true model vector w ∈ R1000 has sparsity 5%, and we apply 25% classification noise. We set the relaxation parameter µ to 1.99, the proximal parameter γ to 0.01, the regularization parameter λ to 0.1, and convergence is measured as the percentage of change in iterates. The latent group lasso chains are set to length 10, with an overlap of 3. For the implementation of the algorithm, we require the proximity operators of the functions fj and ψi in Theorem 5.1. In our case, these are the ℓ2 -norm and the hinge loss, which have straightforward proximity operators. In large-scale applications it is impossible to activate all the functions and all the blocks due to computing and memory limitations. Our random sweeping algorithm (5.18) gives us the possibility of activating only some of them by toggling the activation variables εj,k and εm+i,k . The vectors are updated only when the corresponding activation variable is equal to 1; otherwise it is equal to 0 and no update takes place. In our experiments we always activate the latter as they are associated to the set of training points which is small in sparsity regularization problems. On the other hand, only a fraction α of the former parameters are activated, which is achieved by sampling, at each iteration k, a subset ⌊mα⌋ of the components j ∈ {1, . . . , m} uniformly without replacement. In light of Theorem 5.1, α can be very small, while still guaranteeing convergence of the iterates. It is natural to ask to what extent these selective updates slow down the algorithm with respect to the hypothetic fully updated version in which sufficient computational power and memory would 15

Table 1: Time, iterations, and normalized iterations to convergence for hinge loss classification with the latent group lasso. activation time (s) actual normalized rate iterations iterations 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

28 29 29 31 34 35 42 44 62 93

2027 2292 2605 3082 3520 4015 5153 6723 9291 16474

2027 2062 2084 2157 2112 2008 2061 2017 1858 1647

be available. For this purpose, Table 1 presents the time, number of iterations, and number of normalized iterations by the activation rate for the hinge loss and latent group lasso penalty for α ∈ {0.1, 0.2, . . . , 1.0}. Scaling the iterations by the activation rate allows for a fair comparison between regimes since the computational requirements per iteration of the algorithm is proportional to the number of activated blocks. We observe that while the absolute number of iterations increases as the activation rate decreases, scaling these by α reveals that the computational cost is remarkably stable across the different regimes. It follows that using a small value of α does not affect negatively the computational performance, and this can be beneficial when applied to penalties with a large number of blocks. We stress that in large scale problems in which the matrices Bi do not fit on a single computer memory the standard optimization algorithm (full activation rate) is not suitable, whereas our random sweeping procedure can be easily tailored to the available computing power. Interestingly, Table 1 indicates that the normalized number of iterations are not affected and in fact slightly improves as the activation rate decreases. To illustrate this further, Table 2 shows the same metrics for the k-support norm of [2] where A is 10 × 15 and k =  3, for α ∈ {0.01, 0.05, 0.1, 0.2, . . . , 1.0}. As outlined therein, the number of groups is given by kd , hence even for this modest setting, the number of groups is m = 455, which considerably exceeds both d and n. We again observe that performance is consistent over the range of α. Figure 1 (top) illustrates the objective values for classification with the hinge loss and a latent group lasso chain penalty, and the same (bottom) illustrates the distance to the limiting solution for various activation rates. The iterations were normalized by the activation rate. We note that the paths are similar for all activation rates, and the convergence is similarly fast for small α. This reinforces our findings that choosing an activation rate lower than 100% does not lead to any deterioration in performance, and in some cases we can even make computational gains. Finally we note that the random sweeping algorithm is robust to the value of γ, which only affects the speed of convergence. We found that in general choosing γ in {0.01, 0.1, 1} gave good 16

Table 2: Time, iterations and normalized iterations to convergence for hinge loss classification with the k-support norm, k = 3, m = 455. activation time (s) actual normalized rate iterations iterations 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.05 0.01

29 27 25 23 24 21 19 18 21 21 16 35

1195 1222 1284 1417 1563 1810 2139 2763 3986 7168 13874 52701

1195 1099 1027 992 937 905 855 829 797 716 693 527

performance, in particular when γ and λ were related by only a few orders of magnitude. We also consider a standard lasso regression problem [32] with the square loss function. In this case we found that, in spite of its greater generality, our random sweeping algorithm remains competitive with FISTA and the forward-backward (F.-B.) algorithms, which are both applicable only to the case of differentiable loss function with Lipschitz gradient. In addition, these experiments indicate that the iterates of our algorithm have a faster convergence rate when the distance to the solution is concerned. For the regression problems we let A be 200 × 500, and set λ = γ = 0.1. The observations b ∈ Rm are generated as b = Aw + r where the components of r are i.i.d. Gaussian noise with a standard deviation of 0.25. Figure 2 illustrates the objective values for square loss regression with ℓ1 penalty, and Table 3 illustrates running time, iterations and normalized iterations for the same. For Figure 2 the iterations (x axis) are normalized by the corresponding value of the activation rate as the amount of work done at each iteration is proportional to the number of blocks which are activated. As it can be seen from the plots, our method works well for various ranges of activation rates and remains competitive even for small activation rates. Furthermore, our method remains competitive with FISTA and the forward-backward splitting algorithms despite its greater generality and the fact that, unlike the former, it guarantees the convergence of the iterates. What is most remarkable in the experiments is that the iterates of our algorithm have a faster convergence to the solution to the underlying optimization problem. This highlight the facts that in problems in which convergence of the iterates is the main focus, looking at the convergence of the objective values may be misleading.

17

90

R-S 1 R-S 0.7 R-S 0.4 R-S 0.1

80 70

objective

60 50 40

30

20

10 1

10 2

10 3

normalized iterations

distance to solution

10 1

R-S 1 R-S 0.7 R-S 0.4 R-S 0.1

10 0

10 -1

10 -2

10 -3

0

200

400

600

800

1000 1200 1400 1600 1800 2000

normalized iterations Figure 1: Objective for hinge loss classification with the latent group lasso (left), and distance to solution for the same (right).

References [1] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73:243–272, 2008. [2] A. Argyriou, R. Foygel, and N. Srebro. Sparse prediction with the k-support norm. Advances in Neural Information Processing Systems 25, pages 1466–1474, 2012. [3] A. Argyriou, C. A. Micchelli, M. Pontil. On spectral learning. Journal of Machine Learning Research, 11:935–953, 2010. [4] F. R. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4:1–106, 2011. 18

Table 3: Time, iterations and normalized iterations to convergence for regression with square loss and ℓ1 -norm penalty. activation time (s) iterations iterations rate (normalized) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.05 0.01 0.005

47 55 37 35 44 39 30 36 35 34 38 69 90

1181 1315 1387 1487 1630 1870 2214 2844 4041 7582 14434 56382 85174

1181 1184 1109 1041 978 935 885 853 808 758 722 564 426

[5] F. R. Bach, G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. Proceedings of the 21st International Conference on Machine Learning, pages 6–15, 2004. [6] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2011. [7] S. R. Becker and P. L. Combettes. An algorithm for splitting parallel sums of linearly composed monotone operators, with applications to signal recovery. Journal of Nonlinear and Convex Analysis, 15:137–159, 2014. [8] W. Cheney and W. Light. A Course in Approximation Theory. American Mathematical Society, Providence, RI, 2000. [9] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. Willsky, The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12:805–849, 2012. [10] P. L. Combettes and J.-C. Pesquet. Stochastic quasi-Fej´er block-coordinate fixed point iterations with random sweeping. SIAM Journal on Optimization, 25:1221–1248, 2015. [11] J.-B. Hiriart-Urruty and C. Lemar´ echal. Convex Analysis and Minimization Algorithms, Volume II. Springer, 1993. [12] L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso. Proceedings of the 26th International Conference on Machine Learning, pages 433–440, 2009. [13] M. Jaggi and M. Sulovsky. A simple algorithm for nuclear norm regularized problems. Proceedings of the 27th International Conference on Machine Learning, pages 471–478, 2010.

19

6

F-B fista R-S 1 R-S 0.7 R-S 0.4 R-S 0.1

objective

5.6

5.2

4.8

4.4

4 10 1

10 2

10 3

distance to solution

normalized iterations F-B fista R-S 1 R-S 0.7 R-S 0.4 R-S 0.1

10 0

10 -1

10 -2

10 -3

10 -4

10 -5 10 1

10 2

10 3

normalized iterations Figure 2: Objectives for square loss regression with ℓ1 penalty (left), and distance to final solution for same (right).

[14] G. J. O. Jameson. Summing and Nuclear Norms in Banach Space Theory. Cambridge University Press, 1987. [15] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research, 12:2297–2334, 2011. [16] M. Herbster and G. Lever. Predicting the labelling of a graph via minimum p-seminorm interpolation. In Proceedings of the 22nd Conference on Learning Theory, 2009. [17] R. A. Horn and C. R. Johnson. Topics in Matrix Analysis. Cambridge University Press, 2005. [18] M.-C. K¨ orner. Minisum Hyperspheres. Springer, 2011.

20

[19] J. Mairal and B. Yu. Supervised feature selection in graphs with path coding penalties and network flows. Journal of Machine Learning Research, 14:2449–2485, 2013 [20] A. Maurer and M. Pontil. Structured sparsity and generalization. Journal of Machine Learning Research, 13:671–690, 2012. [21] A. M. McDonald, M. Pontil, and D. Stamos. Spectral k-support norm regularization. Advances in Neural Information Processing Systems, pages 3644–3652, 2014. [22] A. M. McDonald, M. Pontil, and D. Stamos. Fitting spectral decay with the k-support norm. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016. [23] C. A. Micchelli, J. M. Morales, and M. Pontil. Regularizers for structured sparsity. Advances in Computational Mathematics, 38:455–489, 2013. [24] C. A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine Learning Research, 6:1099–1125, 2005. [25] C. A. Micchelli and M. Pontil. Feature space perspectives for learning the kernel. Machine Learning, 66:297–319, 2007. [26] C. A. Micchelli, L. Shen, and Y. Xu. Proximity algorithms for image models: denoising. Inverse Problems, 27:045009, 2011. [27] G. Obozinski, L. Jacob, and J.-P. Vert. Group lasso with overlaps: the latent group lasso approach. Technical report, HAL-Inria, 2011. [28] R. T. Rockafellar, Convex Analysis. Princeton University Press, 1970. [29] B. Romera-Paredes, H. Aung, N. Bianchi-Berthouze, and M. Pontil. Multilinear multitask learning. Proceedings of the 30th International Conference on Machine Learning, pages 1444– 1452, 2013. [30] W. Rudin. Functional Analysis. McGraw Hill, 1991. [31] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. Advances in Neural Information Processing Systems 17, pages 1329–1336, 2005. [32] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58:267–288, 1996. [33] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, K. Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B, 67:91–108, 2005. [34] R. Tomioka and T. Suzuki. Convex tensor decomposition via structured Schatten norm regularization. In Advances in Neural Information Processing Systems 25, pages 1331–1339, 2013. [35] R. Tomioka, T. Sukuki, K. Hayashi, and H. Kashima. Statistical performance of convex tensor decomposition. In Advances in Neural Information Processing Systems 23, pages 972–980, 2011. 21

[36] K. Wimalawarne, M. Sugiama, and R. Tomioka. Multitask learning meets tensor factorization: task imputation via convex optimization. In Advances in Neural Information Processing Systems 26, pages 2825–2833, 2014. [37] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49–67, 2006. [38] P. Zhao, G. Rocha, B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 37:3468–3497, 2009.

22

distance to solution

10 2 R-S 1 R-S 0.7 R-S 0.4 R-S 0.1

10 1 10 0 10 -1 10 -2 10 -3 10 -4 10 -5

0

500

1000

1500

2000

normalized iterations

2500

Suggest Documents