been the hallmark of the scalable Frank-Wolfe- type algorithms. Our analysis extends Nesterov's universal gradient methods to the primal-dual setting in a ...
Universal Primal-Dual Proximal-Gradient Methods Alp Yurtsever Quoc Tran-Dinh Volkan Cevher Laboratory for Information and Inference Systems (LIONS), EPFL, Switzerland
Abstract
Surprisingly, the majority of the primal-dual algorithmic schemes that obtain numerical solutions to (1) is proximaltype methods; cf., (Chambolle & Pock, 2011; Juditsky & Nemirovski, 2013) and the references therein. By proximal-type methods, we mean the class of algorithms that iteratively apply the proximal operator proxf of f pxq to obtain numerical solutions: ( proxf pxq :“ arg minp f pzq ` p1{2q}z ´ x}2 . (2)
We propose a primal-dual algorithmic framework for a prototypical constrained convex minimization problem. The new framework aims to trade-off the computational difficulty between the primal and the dual subproblems. We achieve this in our setting by replacing the standard proximal mapping computations with linear minimization oracles in the primal space, which has been the hallmark of the scalable Frank-Wolfetype algorithms. Our analysis extends Nesterov’s universal gradient methods to the primal-dual setting in a nontrivial fashion, and provides optimal convergence guarantees for the objective residual as well as the feasibility gap without having to know the smoothness structures of the problem. As a result, we obtain stronger primal-dual convergence results than the existing Frank-Wolfe-type algorithms for important machine learning problems involving sparsity and low-rank, and also illustrate them numerically.
zPR
Indeed, there are several algorithmic variants that obtain optimal convergence guarantees when proxf is tractable, e.g., (Tran-Dinh & Cevher, 2014). However, as the dimensions of learning problems become stupendous, the proximal tractability assumption becomes severely limiting. This fact increased the popularity of the Frank-Wolfe-type of algorithms, which leverage linear minimization oracles over the domain. By linear minimization oracle, we mean: lmoX pyq :“ arg maxxy, xy, xPX
1. Introduction This paper constructs an algorithmic framework to obtain numerical solutions to the following prototypical constrained convex optimization problem: f ‹ :“ min tf pxq : Ax ´ b P Ku , xPX
ALP. YURTSEVER @ EPFL . CH QUOC . TRANDINH @ EPFL . CH VOLKAN . CEVHER @ EPFL . CH
(1)
where f : Rp Ñ R Y t`8u is a convex function; A P Rnˆp , b P Rn ; X and K are nonempty, closed and convex sets in Rp and Rn , respectively. While the formulation may appear too specific, it is quite flexible to capture many interesting learning problems in a unified fashion, from matrix completion to sparse regularization, and from support vector machines to convex relaxations to decision problems based on submodular minimization (Wainwright, 2014; Cevher et al., 2014; Jaggi, 2013). Technical report no. (LIONS-EPFL-2015a) of Laboratory for Information and Inference Systems (LIONS), EPFL, Switzerland. Copyright 2015 by: A. Yurtsever, Q. Tran-Dinh and V. Cevher
which is arguably much cheaper to process as compared to (2) (Juditsky & Nemirovski, 2013; Jaggi, 2013). While the Frank-Wolfe-type algorithms require O p1{q-iterations to guarantee an -primal objective residual/duality gap, they are limited to differentiable objectives in (1). These methods are also preferred since their primal solution is constructed as a weighted convex combination of algorithmic iterates, which are meaningful when we seek sparse or lowrank solutions. In this work, we provide a new primal-dual framework that trades off the computational difficulty in the primal subproblems with the difficulty of optimization in the dual subproblems by replacing the proximal mapping with linear minimization oracles. Unlike Frank-Wolfe-type methods, our approach can also handle nonsmooth objectives and complex constraint structures in (1) more efficiently. Since the dual problem has a special structure, we can always assume that its (sub)gradient mapping is H¨older continuous with a positive H¨older constant and a H¨older smoothness order. As a result, we can tailor the universal (accelerated) gradient algorithms, recently developed by Nesterov (Nesterov, 2014) to solve the dual problem.
Universal Primal-Dual Proximal-Gradient Methods
Our approach has several key advantages compared to existing methods. First, we preserve the structure of the original problem (1) as opposed to smoothing, which enables us to recover the primal solution as a weighted combinations of the primal subproblem solutions at each iteration a` la Frank-Wolfe. Second, the algorithms do not require the priori knowledge of the H¨older constant as well as the H¨older degree for operation. Third, our analysis leads to optimal convergence rate guarantees in the sense of firstorder black-box models (Nemirovskii & Yudin, 1983). Finally, our approach covers both subgradient and accelerated gradient methods in a unified fashion. Specifically, paq We propose new universal primal-dual gradient methods to solve (1) that leverage linear minimization oracles or sharp-operators to do away with the proximal operators in the primal subproblems. pbq We introduce a new variant of accelerated universal dual gradient method that requires less proximal operators on the dual subproblems compared to (Nesterov, 2014). Our algorithm can be viewed as the FISTA equivalent version (Beck & Teboulle, 2009b). pcq We extend the convergence analysis in Nesterov’s universal gradient methods to the primal-dual setting and rigorously analyze the convergence rate and the worstcase complexity of these algorithms, both on the objective residual |f pxk q ´ f ‹ | and the primal feasibility gap dist pAxk ´ b, Kq. pdq We illustrate how to analytically eliminate the linesearch steps required in the universal gradient methods for important learning objectives. We also show how our framework solves non-smooth optimization problems, such as the Basis Pursuit formulation, where Frank-Wolfe type of algorithms are not applicable with Frank-Wolfe-like iterations. Paper organization: The paper is organized as follows. Section 2 briefly recalls primal-dual formulation of problem (1) with some standard assumptions, and characterizes its optimality condition. Section 3 defines the universal gradient mapping and its properties. Section 4 presents the primal-dual universal proximal-gradient algorithms (both unaccelerated and accelerated variants), and analyzes their convergence. Section 5 provides numerical illustrations, followed by our conclusions. All the technical proofs and details can be found in the supplementary document. Notation and terminology: For notational simplicity, we work on the Rp {Rn spaces with the standard Euclidean norms. We denote by dist pu, X q the Euclidean distance from u to a convex set X . For a convex function f , we use ∇f both for its subgradient and gradient of f , and f ˚ for its Fenchel’s conjugate. For a convex set X , we denote δX its indicator function and sX its support function.
2. Primal-dual preliminaries It will be convenient to reformulate (1) as follows f‹ “
min tf pxq : Ax ´ r “ bu .
(3)
xPX ,rPK
Let z :“ rx, rs and Z :“ X ˆ K. Then, we have D :“ tz “ px, rq P Z : Ax ´ r “ bu as the primal feasible set. Dual problem: The Lagrange function associated with the linear constraint Ax ´ r “ b is defined as: Lpx, r, λq :“ f pxq ` xλ, Ax ´ r ´ by.
(4)
Using L, we can define the dual function d of (3) as: dpλq :“
min tf pxq ` xλ, Ax ´ r ´ byu ,
(5)
xPX ,rPK
where λ is the dual variable. Due to the separability of x and r, we can write d as: dpλq “ dx pλq ` dr pλq,
(6)
where both components dx and dr are given explicitly by dx pλq :“ min tf pxq ` xλ, Ax ´ byu xPX
dr pλq
(7)
:“ minxλ, ´ry “ ´ supxλ, ry. rPK
rPK
Now, we can define the dual problem of (3) as: ! ) d‹ :“ maxn dpλq “ maxn dx pλq ´ supxλ, ry . λPR
λPR
(8)
rPK
The dual function is concave, but generally nonsmooth. Algorithmic assumptions: To characterize the primaldual relation between (1) and (8), we require the following: Assumption A. 1. The function f is proper, closed, and convex. The constraint sets X and K are nonempty, closed, and convex. The solution set X ‹ of (1) is nonempty. Either Z is polyhedral or the Slater condition (9) holds. We say that (1) satisfies the Slater condition if: ripZq X tpx, rq : Ax ´ r “ bu ‰ H,
(9)
where ri stands for the relative interior (Rockafellar, 1970). Strong duality: Under Assumption A.1, the solution set Λ‹ of the dual problem (8) is also nonempty and bounded. Moreover, the strong duality holds, i.e., f ‹ “ d‹ . From the classical duality theory, we have dpλq ď f pxq for any px, r, λq P D ˆ Rn . Hence, in this case, we can define a convex primal-dual gap function H as follows Hpwq :“ f pxq ´ dpλq ě 0, @x P D, @λ P Rn ,
(10)
where w :“ rx, r, λs. Clearly, Hpw‹ q “ 0 (zero duality gap) for any primal-dual solution w‹ “ rx‹ , r‹ , λ‹ s P X ‹ ˆ K‹ ˆ Λ‹ . In addition, w‹ is a saddle point of the Lagrange function, i.e., Lpx‹ , r‹ , λq ď Lpx‹ , r‹ , λ‹ q “ f ‹ “ d‹ ď Lpx, r, λ‹ q for all px, rq P W and λ P Rn .
Universal Primal-Dual Proximal-Gradient Methods
Approximate solutions: Our goal is to approximately solve (1) to obtain x in the following sense: Definition 2.1. Given an accuracy level ą 0, a point x P X is said to be an -solution of (1) if: |f px q ´ f ‹ | ď and dist pAx ´ b, Kq ď .
(11)
Here, we call |f px q ´ f ‹ | the [absolute] primal objective residual and dist pAx ´ b, Kq the primal feasibility gap. The condition x P X is in general not restrictive since, in many cases, X is a simple set (e.g., box, simplex, or cone) so that the projection on X can exactly be guaranteed.
3. Universal gradient mappings This section defines the universal proximal-gradient mapping and its properties. As the reader will recognize, our algorithmic framework applies the key ideas in the universal proximal-gradient methods in (Nesterov, 2014) to the dual problem (8). However, we also develop a primal scheme to construct the primal sequence txk u.
h requires the max absolute elements of λ. If K :“ tr P Rq1 ˆq2 : }r}˚ ď κu (the nuclear norm), then hpλq “ κ}λ}, the spectral norm. The nuclear norm induces the low-rankness of x. Computing h in this case leads to finding the top-eigenvalue of λ, which is efficient. pbq Cone constraints: If K is a cone, then h becomes the indicator function δK˚ of its dual cone K˚ . Hence, we can handle the inequality constraints and positive semidefinite constraints in (1). For instance, if K ” Rn` , then hpλq “ δRn´ pλq, the indicator function of Rn´ :“ p tλ P Rn : λ ď 0u. If K “ S` , then hpλq :“ δS´p pλq, then indicator function of the negative semidefinite matrix cone. pcq Separable structures: If X řpand f are separable, i.e., śp X :“ i“1 Xi and f pxq :“ i“1 fi pxi q, then the evaluation of g and its derivatives can be decomposed into p subproblems, which can be evaluated in parallel. When X is absent, evaluating g requires to compute the Fenchel conjugate f ˚ of f . In many cases, when f is given explicitly, g can be computed explicitly, e.g., quadratic loss.
3.1. Dual reformulation As we also introduce a new universal accelerated gradient method, we first modify our notation in (8) for generality:
3.3. H¨older continuity of the universal gradient Let ∇gp¨q be a subgradient mapping of g defined by (13). Clearly, this subgradient can be computed explicitly as:
G‹ :“ minn tGpλq :“ gpλq ` hpλqu ,
∇gpλq :“ b ´ Ax‹ pλq,
λPR
(12)
where g and h are redefined as follows: # gpλq :“ max txb´Ax, λy ´ f pxqu ” ´dx pλq xPX
hpλq :“ supxλ, ry “ sK pλq ” ´dr pλq.
(13)
rPK
Problem (12) is a composite convex minimization, and G‹ “ ´d‹ . Since g and h defined by (13) are generally nonsmooth, the fast proximal-gradient algorithm (FISTA) (Beck & Teboulle, 2009a) is no longer applicable. We note that g defined in (12) can be expressed as gpλq “ xb, λy ` f ˚ p´AT λq, where f ˚ is the Fenchel conjugate of f . If we define: x# psq P arg max txs, xy ´ f pxqu , xPX
(14)
then the solution x‹ pλq of (12) is given by x‹ pλq “ x# p´AT λq, which can be multivalued. We call x# the sharp-operator of f . If X ” Rp , then x# psq “ ∇f ˚ psq, the [sub]gradient of f ˚ . An oracle call to G requires the sharp-operator x# and the linear minimization in h. 3.2. Properties of the dual objective terms Let us discuss the properties of g and h defined by (13). paq Sparsity/low-rankness: If K :“ tr P Rn : }r} ď κu for a given κ ě 0 and a given norm } ¨ }, then hpλq “ }λ}˚ the dual norm of } ¨ }. For instance, if K :“ tr P Rn : }r}1 ď κu, then hpλq “ κ}λ}8 . While the `1 -norm induces the sparsity of x, computing
(15)
where x‹ pλq “ x# p´AT λq the sharp-operator defined in (14). Clearly, if either X is bounded and f is continuous on X , or f is strongly convex, then x‹ p¨q exists. Hence, ∇gp¨q is well-defined on Rn . Next, we define: # + ˜ }∇gpλq´∇gpλq} Mν “ Mν pgq :“ sup , (16) ˜ ν }λ ´ λ} ˜ n ,λ‰λ ˜ λ,λPR where ν ě 0 is the H¨older smoothness order/degree. As indicated in (Nesterov, 2014), the parameter Mν depends on ν. We are interested in the case ν P r0, 1s, and especially two extremal cases, where we either have Lipschitz gradient function g corresponding to ν “ 1 or the bounded subgradient function g with ν “ 0. We require the following condition throughout the paper: ˆ pgq :“ inf Mν pgq ă `8. Assumption A. 2. M 0ďνď1
The question is: “Is Assumption (2) reasonable?”. We consider the following two special cases. First, if X is bounded, and g is only subdifferentiable, then ∇gp¨q is also bounded. Indeed, we have: (15)
A }∇gpλq} ď DX :“ supt}Ax ´ b} : x P X u.
ˆ ν pgq “ 2DA ă 8. Hence, we can choose ν “ 0 and M X Second, if f is uniformly convex with the convexity parameter µf ą 0 and the degree q ě 2, i.e., x∇f pxq ´ ˜ y ě µf }x ´ x ˜ }q for any x, x ˜ P Rp , then g ∇f p˜ xq, x ´ x
Universal Primal-Dual Proximal-Gradient Methods 1 ˆ ν pgq :“ defined by (13) satisfies (16) with ν :“ q´1 and M 1 ` ´1 ˘ µf }A}2 q´1 ă `8. In particular, if q “ 2, i.e., f is 2 µf -strongly convex, then ν “ 1 and Mν pgq :“ µ´1 f }A} , which is the Lipschitz constant of the gradient ∇g.
3.4. The proximal-gradient step for the dual problem ˆ k P Rn and Mk ą 0, we define: Given λ
Then, we choose Mk :“ Mk,i , where ik :“ i is the smallest integer number so that (20) holds. ˝ ( ¯ k and tG ¯ k u as: We consider two averaging sequences λ k k ÿ ÿ 1 1 ¯ k :“ 1 ¯ k :“ 1 λ λi`1 and G Gpλi`1 q, (21) Sk i“0 Mi Sk i“0 Mi řk where Sk :“ i“0 M1i . Then, we have the following results, whose proof is in the supplementary document.
ˆ k q :“ gpλ ˆ k q`x∇gpλ ˆ k q, λ´λ ˆ k y` Mk }λ´λ ˆ k }2 , QMk pλ; λ 2 as a quadratic surrogate of the function g. Then, we consider the following subproblem: ( ˆ k q ` hpλq λk`1 :“ arg minn QMk pλ; λ λPR ´ ¯ (17) ˆ k ´ M ´1 ∇gpλ ˆkq , ” proxM ´1 h λ k
Theorem 4.1. Let tλk u be the sequence generated by (20). ¯ k u and tG ¯ k u deThen, for any λ P Rn , the sequences tλ fined by (21) satisfy:
where proxϕ is the proximal operator of ϕ defined by (2).
Ď is defined by (18). for any λ P Rn , where M
k
For a given accuracy ą 0, we define: „
Ď :“ 1 ´ ν 1 M 1`ν
1´ν 1`ν
2
Mν1`ν .
(18)
We need to choose the parameter Mk ą 0 such that QMk is an upper surrogate of g, i.e. gpλq ď QMk pλ; λk q for some λ P Rn . If ν and Mν are known, then we can set Ď defined by (18). In this case, QM Mk :“ M Ď is an upper surrogate of g. In general, we do not know ν and Mν . Hence, Mk can be determined via a line-search procedure.
4. Universal proximal-gradient methods We first apply the universal proximal-gradient to solve the dual problem (12) and propose a primal scheme to construct t¯ xk u for approximating x‹ . Then, we develop an accelerated scheme based on the FISTA algorithm (Beck ¯ku & Teboulle, 2009a) and construct a primal sequence tx for approximating x‹ . 4.1. Universal primal-dual proximal-gradient scheme The dual step of our algorithm is simply the universal proximal-gradient method in (Nesterov, 2014), while the new primal step allows to approximate the solution of (1). paq The dual proximal-gradient step: The first main step of our universal primal-dual proximal-gradient algorithm is the dual step for updating λk . At each iteration k ě 0, we update λk`1 from λk as: ` ˘ λk`1 :“ proxM ´1 h λk ´ Mk´1 ∇gpλk q , (19) k
where Mk is determined by a line-search procedure: Ď defined Line-search: We choose an estimate M0,0 for M by (18). Then, at the iteration k ě 1, we set Mk,i :“ 0.5Mk´1 and update Mk,i :“ 2i Mk,0 such that: gpλk`1 q ď QMk,i pλk`1 ; λk q ` . 2
(20)
Ď ¯ k q´Gpλq ď G ¯ k ´Gpλq ď M }λ0 ´λ}2 ` , (22) Gpλ k`1 2
pbq The primal weighted-averaging step: While the convergence in Theorem 4.1 is guaranteed on the dual objective function G, we need to recover an approximate solution for the primal problem (1). For this purpose, we define the following weighted-averaging sequence: ¯ k :“ Sk´1 x
k ÿ
wi x‹ pλi q, Sk :“
i“0
k ÿ
wi , wi :“ Mi´1 , (23)
i“0
where x‹ pλi q is the solution of the subproblem (17) at a given λi for i “ 0, . . . , k. We have the following guarantee on the primal sequence t¯ xk u, whose proof can be found in the supplementary document. ( ¯ k be the sequence defined by (23). Theorem 4.2. Let x Then, we have the following estimates: ´ }λ‹ }dist pA¯ xk´b, Kq ď f p¯ xk q´f ‹ ď
2d
(24)
Ď }λ0 }}λ0 ´λ‹ } Ď 4M 2M ` }λ0 } . k`1 k`1 d Ď Ď 4M 2M dist pA¯ xk ´b, Kq ď }λ0 ´λ‹ }` . (25) k`1 k`1 `
Ď is defined by (18), λ‹ P Λ‹ is an arbitrary dual where M solution, and is the desired accuracy level. pcq The full algorithm: By combining the dual proximalgradient step (19) and the primal averaging step (23), we obtain a universal primal-dual proximal-gradient algorithm (UniProxGrad) as specified in Algorithm 1. Complexity-per-iteration: First, computing x‹ pλk q at Step 1 requires the solution of (26). If X and f have special structures, computing x‹ pλk q can be done efficiently (e.g., in a closed form). Second, in the line-search procedure, we require the solution λk,i of (27) and the evaluation of
Universal Primal-Dual Proximal-Gradient Methods
Algorithm 1 (Universal Primal-Dual Proximal-Gradient) Initialization: 1. Choose an initial point λ0 P Rn and ą 0. Ď . 2. Estimate a value M´1 :“ M such that 0 ă M ď M p ¯ ´1 “ 0 . 3. Set S´1 :“ 0 and x for k “ 0 to kmax do 1. Compute the primal solution x‹ pλk q of: ( min f pxq ` xAT λk , xy . (26) xPX
2. Form ∇gpλk q :“ b ´ Ax‹ pλk q. 3. Set Mk,0 :“ 0.5Mk´1 . 4. Line-search: For i “ 0 to imax , perform: 4.a. Compute the trial point λk,i by: ´ ¯ ´1 λk,i :“ proxM ´1 h λk ´ Mk,i ∇gpλk q . k,i
Finally, we determine the number of iterations k and the accuracy level of the dual problem so that we get an solution for the primal problem (1). Since the dual solution set Λ‹ of (8) is nonempty, convex and bounded, we define: DΛ‹ pλ0 q :“ ‹inf ‹ }λ‹ ´ λ0 },
(30)
λ PΛ
‹
the distance from λ0 to the dual solution set Λ . We also set Dd :“ DΛ‹ pλ0 q and Dd0 :“ DΛ‹ p0n q. Assume that λ0 “ 0n , it follows from Theorem 4.2 that: 2 Ď 2M k`1 .
´}λ‹ }dist pA¯ xk´b, Kq ď f p¯ xk q´f ‹bď dist pA¯ xk ´b, Kq ď
(27)
4.b. If the following line-search condition holds: gpλk,i q ď QMk,i pλk,i ; λk q ` {2, then ik :“ i, and terminate the line-search loop. 4.c. Otherwise, set Mk,i`1 :“ 2Mk,i . End of line-search 5. Set λk`1 :“ λk,ik and Mk :“ Mk,ik . k 6. Compute wk :“ M1k , Sk :“ Sk´1 `wk , and γk :“ w Sk . ‹ ¯ k :“ p1 ´ γk q¯ 7. Compute x xk´1 ` γk x pλk q. end for ¯ k for x‹ . Output: Return the primal approximation x
gpλk,i q. The total computational cost depends on the proximal operator of h and the evaluations of g. As we will see below, at each iteration k, we approximately require 2 oracle quires of g on average. pdq The worst-case analytical complexity: We follow the analysis of (Nesterov, 2014) to show that the worst-case complexity of Algorithm 1 is given by Z 2 ^ ´ M ¯ 1`ν ν ‹ 2 kmax :“ 2}λ0 ´ λ } inf , (28) 0ďνď1 which is not optimal for ν ą 0, but is optimal for ν “ 0. At each iteration k, the linesearch procedure at Step 4 requires the evaluations of g. The total number N1 pkq of oracle querries (including the function G and its gradient evaluations) up to the k-iteration is bounded by: N1 pkq ď 2pk ` 1q ´ log2 pM q (29) ˆ " ˙ * p1´νq 1´ν 2 ` inf log2 ` log2 Mν . 0ďνď1 1`ν p1`νq 1`ν This shows that NG pkq « 2pk ` 1q, i.e., on average, it approximately requires 2 oracle quires at each iteration k.
Ď ‹ 4M k`1 }λ0 ´λ }`
Using these bounds, we can show that the number of it¯ k of (1) such that erations kmax to achieve an -solution x |f p¯ xk q ´ f ‹ | ď and dist pA¯ xk ´ b, Kq ď is given by ˆ 2 ˙ ´ M ¯ 1`ν ν . O pDd0 q2 inf 0ďνď1 This worst-case complexity is optimal for ν “ 0. Details for obtaining these results can be found in the supplementary document. 4.2. Accelerated universal PDPG algorithm We now develop an accelerated scheme for solving (12). paq Accelerated universal dual proximal-gradient step: The dual main step of our algorithm is to update λk`1 and ˜ k`1 from λ ˆ k and λ ˜ k at each iteration k ě 0 as: λ $ ˆ ˜k λ :“ p1 ´ τk qλk´` τk λ ’ ’ ¯ & k ˆ k ´ M ´1 ∇gpλ ˆkq λk`1 :“ proxM ´1 h λ (31) k k ´ ¯ ’ ’ % λ ˆ k ´ λk`1 , ˜ k`1 :“ λ ˜k ´ 1 λ τk
where Mk is obtained by the line-search procedure specified below, and the step size τk P r0, 1s. At the initial ˜ 0 :“ λ0 and τ0 :“ 1. iteration k :“ 0, we set λ Next, we simplify the scheme (31), and derive an update rule for τk and δk in the following lemma, whose proof can be found in the supplementary document. Lemma 4.3. $ ’ & λk`1 tk`1 ’ % λ ˆ k`1
The scheme (31) can be restated as follows: ` ˘ ˆ k ´ M ´1 ∇gpλ ˆkq :“ proxM ´1 h λ k k “ ‰ (32) :“ 12 1 ` p1 ` 4t2k q1{2 ´1 :“ λk`1 ` ttkk`1 pλk`1 ´ λk q ,
ˆ 0 “ λ0 and t0 :“ 1. The parameter Mk is deterwhere λ mined based on the following line-search condition: ˆ k q ` x∇gpλ ˆ k q, λk`1 ´ λ ˆky gpλk`1 q ď gpλ Mk ˆ k }2 ` , ` }λk`1 ´ λ 2 2tk with Mk ě Mk´1 for k ě 0.
(33)
Universal Primal-Dual Proximal-Gradient Methods
This dual scheme is of the FISTA form (Beck & Teboulle, 2009a), except for the line-search step. The following theorem shows the convergence rate on G, whose proof can be found in the supplementary document. Theorem 4.4. The sequence tλk u updated by (32) guarantees the following estimate: Ď 1´ν M pk `1q 1`ν , M0 pk ` 1q (34) Ď is defined by (18) and ν P r0, 1s. where M
Gpλk q´G‹ ď
Ď 4M
1`3ν 1`ν
}λ0 ´λ‹ }2 `
pbq Primal weighted-averaging step: Our primal se¯ k u is updated as follows: quence tx ¯ k :“ Sˆk´1 x
k ÿ i“0
ˆ i q, Sˆk :“ w ˆ i x‹ pλ
k ÿ
w ˆi , w ˆi :“
i“0
ti . (35) Mi
¯ k u to x‹ as We now can guarantee the convergence of tx in the following theorem, whose proof can be found in the supplementary document. ( ¯ k be the sequence defined by (35). Theorem 4.5. Let x Then, we have the following estimates: ¯ k´b, Kq ď f px ¯ k q´f ‹ ď ´ }λ‹ }dist pAx 2 d Ď Ď }λ0 }}λ0 ´λ‹ } 2M 16M ` 2}λ0 } ` 1`3ν 1`3ν , pk`2q 1`ν pk`2q 1`ν Ď 16M ‹ ¯ k ´b, Kq ď dist pAx 1`3ν }λ0 ´λ } 1`ν pk`2q d Ď 8M ` 1`3ν , pk`2q 1`ν
(36)
pdq The worst-case analytical complexity: The worstcase analytical complexity 2 to achieve ` ofk Algorithm ˘ ¯ k q´f ‹ | ď and dist Ax ¯ ´b, K ď is given by: |f px kmax
0ďνď1
Mν
` ˘ ˆ k ´ M ´1 ∇gpλ ˆkq . λk,i :“ proxM ´1 h λ k,i k,i
(39)
4.b. If the following line-search condition holds: ˆ k q ` {p2tk q, gpλk,i q ď QMk,i pλk,i ; λ then ik :“ i, and terminate the line-search loop. 4.c. Otherwise, set Mk,i`1 “ 2Mk,i . End of line-search 5. Set λk`1 :“ λk,ik and“ Mk a :“ Mk,ik .‰ 6. Compute tk`1 :“ 0.5 1 ` 1 ` 4t2k . ` ˘ ˆ k`1 :“ λk`1 ` tk ´1 λk`1 ´ λk . 7. Compute λ tk`1 8. Compute wk :“ tk , Sˆk :“ Sˆk´1 `wk , and γk :“ wk . ˆk S
(37)
The complexity-per-iteration of Algorithm 2 remains essentially the same as in Algorithm 1. However, as we will see below, on average, this algorithm approximately requires only 1 oracle query of g at each iteration k.
ˆ
ˆ k q :“ b ´ Ax‹ pλ ˆ k q. 2. Form ∇gpλ 3. Set Mk,0 :“ Mk´1 . 4. Line-search: For i “ 0 to imax , perform: 4.a. Compute the trial point λk,i by:
Mk
pcq The full-algorithm: Combing the accelerated dual proximal-gradient scheme (32) and the primal step (35), we obtain the full-algorithm (AccUniProxGrad) as presented in Algorithm 2.
? 2p1`νq :“ r4 2}λ‹ }s 1`3ν inf
xPX
ˆ k q. ¯ k :“ p1 ´ γk qx ¯ k´1 ` γk x‹ pλ 9. Compute x end for ¯ k for x‹ . Output: Return the primal approximation x
Ď is defined by (18), λ‹ P Λ‹ is an arbitrary dual where M solution, and is the desired accuracy level.
[
Algorithm 2(Accelerated Universal PD Proximal-Gradient) Initialization: 1. Choose an initial point λ0 P Rn and ą 0. Ď . 2. Estimate a value M´1 such that 0 ă M´1 ď M p ˆ ¯ ´1 “ 0 . 3. Set S´1 :“ 0 and x ˆ 0 :“ λ0 and t0 :“ 1. 4. Set λ for k “ 0 to kmax do ˆ k q of: 1. Compute the primal solution x‹ pλ ( ˆ k , xy . min f pxq ` xAT λ (38)
_ 2 ˙ 1`3ν , (40)
This worst-case complexity is known to be optimal in the sense of first-order black box models for ν “ 1. The linesearch procedure at Step 4 of Algorithm 2 is also terminated after finite number of iterations. Similarly to Algorithm 1, at each iteration k of Algorithm 2, it requires 1 gradient query of g and ik function evaluations of g. Hence, the number of oracle quires in Algorithm 2 is: N2 pkq ď pk ` 1q ` `
1´ν rlog2 pk ` 1q ´ log2 pqs 1`ν
2 log2 pMν q ´ log2 pM q. 1`ν
(41)
Roughly speaking, Algorithm 2 approximately requires one oracle query per iteration on average. Details for obtaining these results can be found in the supplementary document. Remark 4.6 (Strong convexity adaptation). If g is strongly convex with the convexity parameter µg ą 0, which is known, then we can justify the update rule for tk in Algorithm 2 in order to get a faster convergence rate. This
Universal Primal-Dual Proximal-Gradient Methods
result can be found in the extended version of this work.
5. Numerical experiments We demonstrate the algorithms with two selected examples revolving around the quantum tomography and matrix completion applications. We implement all algorithms, including Algorithms 1 and 2, in Matlab. Hence, the purpose of the examples is not to argue about the scalability of the proposed approaches but to demonstrate the rate (and timing) improvements over Frank-Wolfe-like algorithms. Avoiding line-search with analytic step-size. We also include a natural algorithmic variant to each of our algorithms. For instance, we denote [Acc]UniProxGrad the standard one with line-search, and dub α[Acc]UniProxGrad the new variant with analytical step-size α. We compute the analytical step-size αk “ 1{Mk based on specific structure of G. For instance, if Gpλq :“ p1{2q}λ}22 ` xb, λy ` κ}AT λ}8 , then αk is given explicitly as: a ˆ k q}2 αk :“ 0.5p P 2 ` 4δk ´ P q{}∇Gpλ 2 ˆ k q}2 ` 2κ}AT ∇Gpλ ˆ k q}8 ´ 2xλ ˆk ` where P :“ }∇Gpλ 2 ˆ b, ∇Gpλk qy. See the supplementary document for the derivation, including the matrix cases. We compare our algorithms with the Frank-Wolfe (FW) algorithm and its line-search variant, whenever it is applicable. 5.1. Quantum tomography with Pauli operators We consider the following convex program using in quantum tomography (Gross et al., 2010): " * 1 ‹ 2 ϕ :“ minp ϕpXq :“ }ApXq´b}2 : trpXq “ 1 , (42) 2 XPS` p where S` is the set of Hermitian positive definite matrices, tracepXq is the trace of matrix X. Each component of A has a special Kronecker product structure, based on the Pauli measurements. Our implementation of linear operators specifically exploits this structure.
We recast (42) into (1) by using r “ ApXq´b. Then, each iteration of our algorithms requires one operation on A and one adjoint operation A˚ . Computing the sharp-operator x# requires a top-eigenvector e1 of A˚ pλq, while evaluating g corresponds to just computing the top-eigenvalue σ1 of A˚ pλq via a power-method. Hence, each line-search step is expected to add only a fractional time cost to our method over the standard Frank-Wolfe approach as the dimensions grow. We use synthetic data with 10 qubits, corresponding to a low-rank matrix of size 210 ˆ 210 , and take 7 ¨ 210 Pauli measurements. We test 6 algorithms consisting of 4 variants of our algorithms and two variants of the Frank-Wolfe algorithm (with step-size γk “ 2{pk ` 2q and with linesearch (Jaggi, 2013)). The two first plots in Figure 1 show
the convergence of the absolute primal objective residual ¯ k q ´ ϕ‹ w.r.t. to the iterations and the computational ϕpX time [in second]. The last plot in Figure 1 reveals the relative distance to the true parameter X6 . We observe from these plots that UniProxGrad has a similar performance as the line-search FW variant in terms of iterations. However, its accuracy is slightly lower than FW. We note that the accuracy level of UniProxGrad must be set a priori. Changing this level is also crucial to trade off the computation in UniProxGrad. While it is unclear to distinguish the performance of UniProxGrad and FW in this example, AccUniProxGrad significantly outperforms other methods both in terms of iterations and time. 5.2. Matrix completion 2 Let Ω Ă Npq :“ t1, ¨ ¨ ¨ , pu ˆ t1, ¨ ¨ ¨ , qu be a subset of indices and MΩ is a given observed sub-matrix. The canonical setting of matrix completion applications (Cand`es & Recht, 2012) is as follows: ϕ‹ :“ min tϕpXq :“ }X}˚ : PΩ pXq “ MΩ u , (43) XPRpˆq
where PΩ is the projection operator, i.e., pPΩ pXqqij “ Xij for pi, jq P Ω and pPΩ pXqqij “ 0, otherwise, and } ¨ }˚ is the matrix nuclear norm. In the noisy case, we may relax the condition PΩ pXq “ MΩ to }PΩ pXq ´ MΩ }1 ď τ to obtain a robust version as: min t}X}˚ : }PΩ pXq ´ MΩ }1 ď τ u ,
XPRpˆq
(44)
for a given noise level τ ě 0. Clearly, Frank-Wolfe-type algorithms are not efficient for solving (43) and (44), while our methods handle these in an efficient unified fashion. When we observer an upper bound of ϕ‹ , (43) can be relaxed as: ( min p1{2q}PΩ pXq ´ MΩ }22 : }X}˚ ď ϕ‹ (45) XPRpˆq
which is suitable for Frank-Wolfe-type algorithms. Figure 2 first compares the algorithms for the formulation (45) on synthetic data, where ϕ‹ and X‹ are evaluated via an off-the-shelf solver (e.g., cvx). We perform 30 MonteCarlo simulations and take the average results. It is clear that the new algorithms feature a distinct performance improvement over the Frank-Wolfe-type methods. Figures 4 and 3 illustrate the flexibility of our approach. Note that Frank-Wolfe-type methods are not directly applicable here. While the number of iterations required to solve the problem increases, these formulations obviate parameter knowledge or can handle outliers.
6. Conclusion We have proposed a new universal primal-dual proximalgradient framework to obtain computational trade-offs
Universal Primal-Dual Proximal-Gradient Methods Frank−Wolfe FW with line−search α−UniProxGrad UniProxGrad α−AccUniProxGrad AccUniProxGrad
10
−2
10
¯ k) − ϕ ⋆ ϕ( X
¯ k) − ϕ ⋆ ϕ( X
−4
10
Frank−Wolfe FW with line−search α−UniProxGrad UniProxGrad α−AccUniProxGrad AccUniProxGrad
−6
10
−8
10
0
10
−4
10
−6
10
−1
10
Frank−Wolfe FW with linesearch α−UniProxGrad UniProxGrad α−AccUniProxGrad AccUniProxGrad
−2
10
−8
10
1
10
R el ati ve error:
−2
10
0
10
¯ k− X ♮ k kX k X ♮k
0
0
10
2
#Iteration
3
10
0
10
20
40
60
80
100
120
140
Computational time [s]
160
0
180
1
10
2
10
3
10
10
#Iteration
¯ k) − ϕ ⋆ Ob jecti ve resi d u al : ϕ( X
0
¯ k− X ⋆ k kX k X ⋆k
10
−1
10
Re lat ive e r r or :
¯ k) − ϕ ⋆ A b sol u te resi d u al : ϕ( X
¯ k q ´ ϕ‹ u in 6 algorithms with respect to iterations and time and the solution quality. Figure 1. The convergence of tϕpX
−2
10
−3
10
α−UniProxGrad α−AccUniProxGrad UniProxGrad AccUniProxGrad FW with line−search Frank−Wolfe
−4
10
−5
10
0
1
10
10
2
#iterations
0
10
−1
α−UniProxGrad α−AccUniProxGrad UniProxGrad AccUniProxGrad FW with line−search Frank−Wolfe
10
3
10
0
10
1
10
10
2
# iterations
−2
10
−3
10
−4
10
−5
10
3
10
α−UniProxGrad α−AccUniProxGrad UniProxGrad AccUniProxGrad FW with line−search Frank−Wolfe
−1
10
0
10
2
4
6
8
10
12
14
Computational time [s]
16
0
10
0
¯ k− X ⋆ k kX k X ⋆k
¯ k− X ⋆ k kX k X ⋆k
10 0
10
−1
−2
10
α−UniProxGrad UniProxGrad α−AccUniProxGrad AccUniProxGrad
−2
10
α−UniProxGrad UniProxGrad α−AccUniProxGrad AccUniProxGrad
−3
10
R el ati ve error:
−1
10
1
2
10
10
3
10
−2
−3
10
4
10
0
1
10
#Iteration
−1
10
−3
0
10
α−UniProxGrad UniProxGrad α−AccUniProxGrad AccUniProxGrad
10
10
R el ati ve error:
¯ k) − ϕ ⋆ Ob jecti ve resi d u al : ϕ( X
Figure 2. The performance of 6 algorithms with respect to iterations and time and the solution quality for (45).
2
10
3
10
10
4
10
10
#Iteration
0
20
40
60
80
100
120
140
160
180
200
Computational time [s]
Figure 3. The performance of our algorithms with respect to iterations and time and the solution quality for (43). 0
10
UniProxGrad AccUniProxGrad
R el ati ve error:
−1
10
¯ k− x ⋆ k kx k x ⋆k
¯ k− x ⋆ k kx k x ⋆k
0
10
R el ati ve error:
Ob jecti ve resi d u al f (x k ) − f ⋆
0
10
−1
10
UniProxGrad AccUniProxGrad
UniProxGrad AccUniProxGrad
−2
−2
−2
10
10
10
0
10
1
10
−1
10
2
10
3
10
4
10
0
10
1
10
#iteration
2
10
3
#iteration
10
4
10
0
20
40
60
80
100
120
140
160
180
200
Computational time [s]
Figure 4. The performance of our algorithms with respect to iterations and time and the solution quality for (44)
in constrained optimization. Our technical contribution extends Nesterov’s universal gradient scheme (Nesterov, 2014) to the constrained setting in a nontrivial fashion, which requires the introduction of a new primal scheme and a new convergence analysis for primal sequences. Our accelerated universal gradient method also adopts the FISTA idea (Beck & Teboulle, 2009a) and enhances it with a new line-search procedure. The hallmarks of our approach includes the optimal worst-case complexity and its flexibility to handle nonsmooth objectives and complex con-
straints, compared to existing primal-dual algorithm as well as Frank-Wolfe-type algorithms, while essentially preserving their low cost iteration complexity.
References Bauschke, H.H. and Combettes, P. Convex analysis and monotone operators theory in Hilbert spaces. SpringerVerlag, 2011.
Universal Primal-Dual Proximal-Gradient Methods
Beck, A. and Teboulle, M. A Fast Iterative ShrinkageThresholding Algorithm for Linear Inverse Problems. SIAM J. Imaging Sciences, 2(1):183–202, 2009a.
Neural Information Processing Systems Foundation conference (NIPS2014), pp. 1–9, Montreal, Canada, December 2014.
Beck, A. and Teboulle, M. Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Trans Image Process., 18(11): 2419–2434, 2009b.
Wainwright, M. J. Structured regularizers for highdimensional problems: Statistical and computational issues. Annual Review of Statistics and its Applications, 1: 233–253, 2014.
Cand`es, E. and Recht, B. Exact matrix completion via convex optimization. Communications of the ACM, 55(6): 111–119, 2012. Cevher, V., Becker, S., and Schmidt, M. Convex optimization for big data: Scalable, randomized, and parallel algorithms for big data analytics. IEEE Signal Processing Magazine, 31(5):32–43, 2014. Chambolle, A. and Pock, T. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1): 120–145, 2011. Gross, D., Liu, Y.-K., Flammia, S. T., Becker, S., and Eisert, J. Quantum state tomography via compressed sensing. Physical Review Letters, 105(15):150401(0–4), 2010. Jaggi, M. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. JMLR W&CP, 28(1):427–435, 2013. Juditsky, A. and Nemirovski, A. Solving variational inequalities with monotone operators on domains given by linear minimization oracles. Tech. Report., http://arxiv.org/abs/1312.1073, 2013. Nemirovskii, A. and Yudin, D. Problem Complexity and Method Efficiency in Optimization. Wiley Interscience, 1983. Nesterov, Y. Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Program., 109(2–3):319–344, 2007. Nesterov, Y. Primal-dual subgradient methods for convex problems. Math. Program., 120(1, Ser. B):221–259, 2009. Nesterov, Y. Universal gradient methods for convex optimization problems. Math. Program., xx:1–24, 2014. Rockafellar, R. T. Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, 1970. Tran-Dinh, Q. and Cevher, V. Constrained convex minimization via model-based excessive gap. In Proc. the
Universal Primal-Dual Proximal-Gradient Methods
Supplementary document:
Universal Primal-Dual Proximal-Gradient Methods A. The key estimate of the proximal-gradient step If the function g defined by (13) satisfies (16) with Mν ă `8 as in Assumption A.1, then the following lemma provides key properties for constructing universal gradient algorithms. ˜ λ P Rn , we have: Lemma A.1 ((Nesterov, 2014)). Under Assumption A.2, the following statement holds: For any λ, ˜ ď gpλq ` x∇gpλq, λ ˜ ´ λy ` Mν }λ ˜ ´ λ}1`ν gpλq 1`ν
(46)
˜ λ P Rn and δ ą 0. If we choose: For any λ, „ Mě
1´ν 1 1`ν δ
1´ν 1`ν
2
Mν1`ν ,
then: ˜ ď gpλq ` x∇gpλq, λ ˜ ´ λy ` gpλq
(47)
M ˜ δ }λ ´ λ}2 ` . 2 2
(48)
Clearly, (48) provides an approximate quadratic upper bound for g. However, it depends on the choice of δ and the smoothness parameter ν. If ν “ 1, then M can be set to the Lipschitz constant M1 and is independent of δ. The algorithms developed in this paper based on the proximal-gradient step (17) on the dual objective function G: ´ ¯ ! ) ˆ k ´ M ´1 ∇gpλ ˆ k q “ arg min QM pλ; λ ˆ k q ` hpλq , λk`1 :“ proxM ´1 h λ k k n λPR
k
where QMk is the quadratic surrogate of g defined by: ˆ k q :“ gpλk q ` x∇gpλ ˆ k q, λ ´ λ ˆky ` QMk pλ; λ
Mk ˆ k }2 . }λ ´ λ 2
(49)
This step guarantees the following estimate: Lemma A.2. Let QM be the quadratic model of g defined by (49). If λk`1 defined by (17) satisfies the following inequality: ˆkq ` gpλk`1 q ď QMk pλk`1 ; λ
δk , 2
(50)
for some δk ą 0, then: ˆ k q ` x∇gpλ ˆ k q, λ ´ λ ˆ k y ` hpλq ` Gpλk`1 q ď gpλ
ı δk Mk ” ˆ k }2 ´ }λ ´ λk`1 }2 , @λ P Rn . ` }λ ´ λ 2 2
(51)
Ď due to Lemma A.1, whenever δk “ is fixed and M Ď is defined by (18). Clearly, the condition (50) holds if Mk ě M Ď defined by (18) for all k ě 0. In this case, the condition (50) is If ν and Mν are known, then we can set Mk :“ M automatically satisfied. In general, we do not know ν and Mν . Then Mk can be determined via a line-search procedure on Ď in (18), the line-search procedure is terminated after a finite number of the condition (50). Due to the boundedness of M function evaluations (see Lemma B.1). Proof of Lemma A.2. We note that the optimality condition of (17) is: ˆ k q ` Mk pλk`1 ´ λ ˆ k q ` Bhpλk`1 q. 0 P ∇gpλ ˆ k ´ λk`1 P M ´1 p∇gk pλ ˆ k q ` Bhpλk`1 qq. Let ∇hpλ ˆ k`1 q P Bhpλk`1 q be a subgradient of h at which can be written as λ k λk`1 . Then, we have: ” ı ˆ k ´ λk`1 “ 1 ∇gpλ ˆ k q ` ∇hpλk`1 q . λ (52) Mk
Universal Primal-Dual Proximal-Gradient Methods
Now, using (52), we can derive: 1 1 ˆ k }2 }λ ´ λk`1 }2 ´ }λ ´ λ 2 2 1 ˆ k ´ λk`1 , λ ´ λk`1 y ´ 1 }λ ˆ k ´ λk`1 }2 (52) ˆ k q ` ∇hpλk`1 q, λ ´ λk`1 y ´ 1 }λ ˆ k ´ λk`1 }2 “ xλ x∇gpλ “ 2 Mk 2 „ 1 Mk 2 ˆ ˆ ˆ “´ x∇gpλk q, λk`1 ´ λk y ` }λk`1 ´ λk } Mk 2 1 1 ˆ k q, λ ´ λ ˆ k y. ` x∇hpλk`1 q, λ ´ λk`1 y ` x∇gpλ (53) Mk Mk
∆rk pλq :“
By using the condition (50), we get: ˆ k q ` x∇gpλ ˆ k q, λk`1 ´ λ ˆ k y ` Mk }λk`1 ´ λ ˆ k }2 ` δk , gpλk`1 q ď gpλ 2 2 which implies: „ 1 Mk 2 ˆ ˆ ˆ ˆ k q ´ gpλk`1 q ` δk . }λk`1 ´ λk } ď gpλ ´ x∇gpλk q, λk`1 ´ λk y ` Mk 2 2
(54)
By the convexity of h, we have: x∇hpλk`1 q, λ ´ λk`1 y ď hpλq ´ hpλk`1 q.
(55)
Substituting (54) and (55) into (53), we obtain: „ 1 δk 1 1 ˆ ˆ k q, λ ´ λ ˆky rhpλq ´ hpλk`1 qs ` ∆rk pλq ď gpλk q ´ gpλk`1 q ` ` x∇gpλ Mk 2 Mk Mk „ δk 1 1 ˆ ˆ ˆ gpλk q ` x∇gpλk q, λ ´ λk y ` hpλq ` ´ rgpλk`1 q ` hpλk`1 qs . “ Mk 2 Mk ˆ k }2 and Gp¨q “ gp¨q ` hp¨q to obtain (51). It remains to use ∆rk pλq :“ 12 }λ ´ λk`1 }2 ´ 21 }λ ´ λ
B. The finite termination of the line-search procedure The proximal-gradient step (17) requires to satisfy the line-search condition (50). The following lemma guarantees that the line-search procedure in Algorithms 1 and 2 terminates after a finite number of iterations. Lemma B.1. The line-search procedure in Algorithms 1 and 2 terminates after a finite number ik of iterations. Proof. Under Assumption A.2, Mν defined in Lemma A.1 is finite. When δk “ ą 0 is fixed as in Algorithm 1, the upper ” ı 1´ν 2 Ď . Ď “ 1´ν 1`ν Mν1`ν defined by (18) is also finite. Moreover, the condition (50) holds whenever Mk,i ě M bound M p1`νq i i Ď {M qu ` 1 Since Mk,i :“ 2Mk,i´1 “ 2 Mk,0 ě 2 M , the linesearch procedure is terminated after at most ik :“ tlog2 pM iterations. Now, weashow that the line-search procedure in Algorithm 2 is also finite. By the updating rule of tk , we have tk`1 :“ 0.5p1 ` 1 ` 4t2k q ď 0.5p1 ` p1 ` 2tk qq “ tk ` 1. By induction and t0 “ 1, we have tk ď k ` 1. Since, by the definition Ď , δk “ and tk ď k ` 1, we can show that: of M tk 1´ν „ 1´ν „ 1´ν „ 2 2 2 tk 1`ν k ` 1 1`ν 1 ´ ν 1 1`ν 1`ν 1`ν Ď M “ Mν ď Mν ď Mν1`ν . 1 ` ν δk
(56)
Ď . However, since Mk,i “ 2i Mk,0 ě 2i M , by using (56), Next, we note that the condition (50) holds whenever Mk,i ě M it is sufficient to show that: „ 1´ν 2 k ` 1 1`ν i 2Mě Mν1`ν .
Universal Primal-Dual Proximal-Gradient Methods
ˆ
“ k`1 ‰ 1´ν 1`ν
2 1`ν
˙
This condition leads to i ě log2 ´ log2 pM q. Hence, at the iteration k, we require at most ik :“ Mν Z ˆ 2 ˙^ ` ˘ 1`ν log2 k`1 ` log2 MνM ` 1 line-search iterations, which is finite.
C. Convergence analysis of the universal primal-dual proximal-gradient algorithm We analyze the convergence of the the universal primal-dual proximal-gradient algorithm (Algorithm 1) by proving Theorem 4.1 and Theorem 4.2. C.1. The proof of Theorem 4.1: Convergence rate of the dual objective function G Ď defined by (18), since the line-search is successful as shown in Lemma B.1, the condition (50) of Lemma A.2 is For M Ď . Then, from (51), we have: satisfied for i “ 0 to i “ k with Mi ď 2M 1 1 1 1 rgpλi q ` x∇gpλi q, λ´λi y ` hpλqs ` Gpλi`1 q ď ` }λ´λi }2 ´ }λ´λi`1 }2 Mi Mi 2Mi 2 2 1 1 1 ď Gpλq ` ` }λ´λi }2 ´ }λ´λi`1 }2 , Mi 2Mi 2 2
(57)
where the last inequality holds due to the convexity of g: gpλi q ` x∇gpλi q, λ´λi y ď gpλq, and G :“ g ` h. Summing up this inequality from i “ 0 to i “ k, we have: « ff k k ÿ ÿ 1 1 ” ı 1 1 Gpλi`1 q ď Gpλq ` ` }λ ´ λ0 }2 ´ }λ ´ λk`1 }2 . M M 2 2 2 i i i“0 i“0 ¯ k :“ Since G
1 Sk
řk
1 i“0 Mi Gpλi`1 q
and Sk :“
řk
1 i“0 Mi ,
the last inequality implies:
¯ k ď Gpλq ` 1 }λ ´ λ0 }2 ` , @λ P Rn . G 2Sk 2
(58)
¯ k we have Gpλ ¯kq ď G Ď for i “ 0 to k, we have ¯ k . Finally, since Mi ď 2M By convexity of G and the definition (21) of λ pk`1q ˝ Sk ě 2M Ď . Substituting this estimate into (58), we obtain (22).
C.2. The proof of Theorem 4.2: Convergence rate of the primal objective residual and the primal feasibility gap We first note that gpλi q “ ´f px‹ pλi qq ` xλi , b ´ Ax‹ pλi qy and ∇gpλi q “ b ´ Ax‹ pλi q. Hence, we can derive: gpλi q ` x∇gpλi q, λ ´ λi y “ ´f px‹ pλi qq ´ xAx‹ pλi q ´ b, λy.
(59)
By summing up (57) from i “ 0 to i “ k, for any λ P Rn , and then using (59), we obtain: ¯ k :“ Sk G (59)
“
k k ÿ ÿ ‰ Sk 1 1 1 “ 1 Gpλi`1 q ď gpλi q ` x∇gpλi q, λ ´ λi y ` hpλq ` ` }λ0 ´ λ}2 ´ }λk`1 ´ λ}2 Mi Mi 2 2 2 i“0 i“0 k ÿ ‰ Sk 1 1 “ ´ f px‹ pλi qq ´ xAx‹ pλi q ´ b, λy ` hpλq ` ` }λ0 ´ λ}2 . M 2 2 i i“0
This inequality leads to: k ‰ 1 ÿ 1 “ 1 ¯ k ` hpλq. f px‹ pλi qq ` xAx‹ pλi q ´ b, λy ď ` }λ0 ´ λ}2 ´ G Sk i“1 Mi 2 2Sk
(60)
řk ¯ k , we have f p¯ Now, by convexity of f and the definition (23) of x xk q ď S1k i“0 M1i f px‹ pλi qq and řk 1 1 ‹ xk ´ b, λy. Using these expressions and (60), we can show that: i“0 Mi xAx pλi q ´ b, λy “ xA¯ Sk f p¯ xk q ` xA¯ xk ´ b, λy ´ hpλq ď
¯ k ` 1 }λ ´ λ0 }2 . ´G 2 2Sk
(61)
Universal Primal-Dual Proximal-Gradient Methods
Since Gpλi q “ ´dpλi q by the definition (12), we have Gpλi q “ ´dpλi q ě ´d‹ “ ´f ‹ for any i “ 0, ¨ ¨ ¨ , k. Hence, 1 ‹ ‹ ¯ k :“ 1 řk G i“0 Mi Gpλi`1 q ě ´f “ ´d . We can estimate: Sk ¯ k ď d‹ “ f ‹ ď Lpx, r, λ‹ q “ f pxq ` xλ‹ , Ax ´ b ´ ry, ´G ¯ k and (61), we obtain: for any x P X and r P K. Combining this inequality with x “ x xA¯ xk ´ b, λy ´ hpλq ď
1 ` xλ‹ , A¯ xk ´ b ´ ry ` }λ ´ λ0 }2 . 2 2Sk
Since xλ‹ , ry ´ hpλq “ xλ‹ , ry ´ suprPK xλ, ry ě minrPK xλ‹ ´ λ, ry for all r P K, the last inequality leads to: min txA¯ xk ´ b ´ r, λ ´ λ‹ yu ´ rPK
1 }λ ´ λ0 }2 ď , @λ P Rn . 2Sk 2
(62)
¯ k prq :“ A¯ Let u xk ´b´r. Since (62) holds for any λ P Rn , we should have: " ** " 1 min maxn x¯ }λ ´ λ0 }2 ď . uk prq, λ ´ λ‹ y ´ rPK λPR 2Sk 2 Solving the concave maximization problem on the left-hand side directly, we obtain: Sk }¯ uk prq}2 ` 2x¯ uk prq, λ0 ´ λ‹ y ď . ‹
‹
(63) 2
‹
Since x¯ uk prq, λ0 ´ λ y ď }¯ uk prq}}λ0 ´ λ }, if we denote”t :“ }¯ uk prq}, then (63) leads to Sk tı ´ 2}λ0 ´ λ }t ´ ď 0. a 1 Therefore, we have }A¯ xk ´ b ´ r} “ }¯ uk prq} “ t ď Sk }λ0 ´ λ‹ } ` }λ0 ´ λ‹ }2 ` Sk for some r P K. Hence, we can show that: „ b 1 min }A¯ xk ´ b ´ r} “ dist pA¯ xk ´ b, Kq ď }¯ uk prq} ď }λ0 ´ λ‹ } ` }λ0 ´ λ‹ }2 ` Sk . rPK Sk a ? ‹ We note that Sk ě 2k`1 }λ0 ´ λ‹ }2 ` Sk ď 2}λ0 ´ λ‹ } ` Sk , the last estimate leads to (25). Ď and }λ0 ´ λ } ` M
¯ k ď d‹ “ f ‹ , we obtain from (61) that: To prove the left-hand side of (24), we follow the proof of (61). Note that ´G f p¯ xk q ´ f ‹ ` xA¯ xk ´ b, λy ´ hpλq ´
1 }λ ´ λ0 }2 ď , @λ P Rn . 2Sk 2
(64)
Since this inequality holds for any λ P Rn , by using the fact that: ! ) S 1 k }A¯ xk ´ b ´ r}2 ` xλ0 , A¯ xk ´ b ´ ry maxn xA¯ xk ´ b ´ r, λy ´ }λ ´ λ0 }2 “ λPR 2Sk 2 Sk ě }A¯ xk ´ b ´ r}2 ´ }λ0 }}A¯ xk ´ b ´ r}, @r P K. 2 Using this estimate into (64), we finally get: f p¯ xk q ´ f ‹ `
Sk }A¯ xk ´ b ´ r}2 ´ }λ0 }}A¯ xk ´ b ´ r} ď , @r P K, 2 2
which leads to: f p¯ xk q ´ f ‹ ď }λ0 }}A¯ xk ´ b ´ r} `
Sk ´ }A¯ xk ´ b ´ r}2 ď }λ0 }}A¯ xk ´ b ´ r} ` , @r P K. 2 2 2
Substituting r “ argminrPK }A¯ xk ´b´r} into the last estimate, then combining the result and (25), we get the right-hand side of (24). To prove the left-hand side of (24), we note that: f ‹ “ d‹ ď Lpx, r, λ‹ q “ f pxq ` xλ‹ , Ax ´ b ´ ry ď f pxq ` }λ‹ }}Ax ´ b ´ r}, for any x P X and r P K. Since this inequality holds for any r P K, we have: ´}λ‹ } min }Ax ´ b ´ r} “ ´}λ‹ }dist pAx ´ b, Kq ď f pxq ´ f ‹ , rPK
which is in fact the left-hand side of (24).
˝
Universal Primal-Dual Proximal-Gradient Methods
C.3. The worst-case complexity analysis of Algorithm 1 We now analyze the worst-case complexity of Algorithm 1. Indeed, by the same argument as in (Nesterov, 2014), we can Ďε }λ0 ´λ‹ }2 Ď 2M ‹ 2 ¯ k q ´ G‹ ď ε if M . Using the definition (18) of show that Gpλ k`1 }λ0 ´ λ } ď 2 due to (22), i.e., k ` 1 ě Ďε , we have: M „ 2 „ 1´ν Mν 1`ν 1 ´ ν 1`ν k`1ě2 }λ0 ´ λ‹ }2 . 1`ν ı 1´ν ” Y 2 ] ` M ˘ 1`ν 1`ν ‹ 2 ν Since 1´ν ď 1 for ν P r0, 1s, by Assumption A.2, this inequality leads to k “ 2}λ ´ λ } inf 0 0ďνď1 1`ν as shown in (28). Next, we estimate the total number of oracle quires in Algorithm 1. At each iteration, Algorithm 1 requires one gradient evaluation of g and ik function evaluation of g. Hence, the total number of oracle quires up to the iteration k is N1 pkq :“ řk j“0 pij ` 1q. However, since ij ´ 1 “ log2 pMj {Mj´1 q, we have: N1 pkq :“
k ÿ
pij ` 1q “ 2pk ` 1q ` log2 pMk q ´ log2 pM´1 q “ 2pk ` 1q ` log2 pMk q ´ log2 pM q.
j“0
Ď to obtain (29). It remains to use Mk ď M Finally, we assume that λ0 :“ 0n , then the estimates in Theorem 4.2 leads to: d Ď Ď ‹ 2M 4M ´}λ }dist pA¯ xk ´ b, Kq ď f p¯ x q´f ď and dist pA¯ xk ´ b, Kq ď }λ } ` . (65) 2 k`1 k`1 „ b Ď Ď M M In order to guarantee dist pA¯ xk ´b, Kq ď and |f p¯ xk q ´ f ‹ | ď , we must require }λ‹ } 4k`1 }λ‹ } ` 2k`1 ď , k
‹
‹
which leads to: Ď }λ‹ }2 4M ? k`1ě p3 ´ 5q
[ ñ
kmax :“
4}λ‹ }2 ? inf p3 ´ 5q 0ďνď1
ˆ
Mν
2 _ ˙ 1`ν
.
Hence, the worst-case complexity´to obtain an -solution of (1) in the sense of Definition 2.1 such that |f p¯ xk q ´ f ‹ | ď 2 ¯ ` k ˘ ` M ˘ 1`ν ‹ 2 and dist A¯ x ´ b, K ď is O }λ } inf 0ďνď1 ν , which is optimal if ν “ 0.
D. Convergence analysis of the accelerated universal primal-dual proximal-gradient algorithm We now analyze the convergence of the accelerated universal primal-dual proximal-gradient algorithm, Algorithm 2 both in terms of the dual objective function G and the primal objective residual and the primal feasibility gap. D.1. Key estimates The following lemma provides the key estimate to prove the convergence of (31). ! ) ˜kq Lemma D.1. Let pλk , λ be the sequence generated by (31). Then, for any λ P Rn , the following estimate holds: kě0
Gpλk`1 q ´ Gpλq `
“ ‰ Mk τk2 Mk τk2 ˜ ˜ k ´ λ}2 ` δk . }λk`1 ´ λ}2 ď p1 ´ τk q Gpλk q ´ Gpλq ` }λ 2 2 2
(66)
ˆ k q ` x∇gpλ ˆ k q, λ ´ λ ˆ k y ď gpλq and G :“ g ` h, the key inequality (51) leads to: Proof. Since gpλ ‰ δk Mk “ ˆ ` }λk ´ λ}2 ´ }λk`1 ´ λ}2 2 2 δk ˆ k ´ λk`1 , λ ˆ k ´ λy ´ Mk }λk`1 ´ λ ˆ k }2 . “ Gpλq` ` Mk xλ 2 2
Gpλk`1 q ď Gpλq`
(67)
Universal Primal-Dual Proximal-Gradient Methods
Now, using (67) with λ “ λk and subtracting Gpλq from both sides we get: δ k Mk ˆ k }2 ` Mk xλ ˆ k ´ λk`1 , λ ˆ k ´ λk y. ´ }λk`1 ´ λ (68) 2 2 By multiplying (68) by p1 ´ τk q and (67) by τk with τk P p0, 1q and summing the results up, we obtain: “ ‰ ˆ k q´Gpλq ` δk ´ Mk }λk`1 ´ λ ˆ k }2 ` Mk xλ ˆ k ´λk`1 , λ ˆ k ´ p1 ´ τk qλk ´ τk λy. (69) Gpλk`1 q´Gpλq ď p1 ´ τk q Gpλ 2 2 ˆ k ´λk`1 q. Putting these ˆ k and λ ˜ k , we have τk λ ˜ k :“ λ ˆ k ´ p1 ´ τk qλk and λ ˜ k`1 :“ λ ˜ k ´ 1 pλ By the update rules (31) for λ τk equalities into (69), we get: 2“ ‰ “ ‰ ˜ k ´λ}2 ´}λ ˜ k ´ 1 pλ ˆ k ´λk`1 q´λ}2 ˆ k q´Gpλq ` δk ` Mk τk }λ Gpλk`1 q´Gpλq ď p1 ´ τk q Gpλ 2 2 τk 2“ ‰ “ ‰ δk M τ ˆ k q´Gpλq ` ` k k }λ ˜ k ´λ}2 ´}λ ˜ k`1 ´λ}2 , (70) “ p1 ´ τk q Gpλ 2 2 which is in fact the estimate (66). ( ˆk, λ ˜ k q be a sequence generated by (31) and τ0 :“ 1. Let ě 0 be fixed, δk :“ τk and Lemma D.2. Let pλk , λ ř k Sˆk :“ i“0 M1i τi . If tτk ukě0 satisfies the condition: Gpλk`1 q´Gpλq ď Gpλk q´Gpλq`
2 τk2 “ τk´1 p1 ´ τk q,
(71)
then we have the following estimate: Gpλk q ´ G‹ ď
” ı M 2 2 Mk´1 τk´1 Mk´1 k´1 τk´1 }λ0 ´ λ‹ }2 ` Sˆk´1 ď }λ0 ´ λ‹ }2 ` . 2 2 2 M0
(72)
Proof. Using (66) with G‹ “ Gpλ‹ q, λ “ λ‹ and δi :“ τi for i ě 0, we have: Gpλi`1 q ´ G‹ ` From (71): τi2 “
2 τi`1 p1´τi`1 q
“ ‰ Mi τi2 ˜ Mi τi2 ˜ }λi`1 ´ λ‹ }2 ď p1 ´ τi q Gpλi q ´ G‹ ` }λi ´ λ‹ }2 ` τi . 2 2p1 ´ τi q
iq and the fact that Mi`1 ě Mi , we have Mi τi2 p1´τ ď Mi τ 2 i
1 2 Mi´1 τi´1
(73)
for i ě 1. Substituting this
‹
relation and noting that Gpλi q ´ G ě 0, it follows from (73) that: ‰ 1 “ ‰ 1 1 “ 1 ˜ i`1 ´ λ‹ }2 ď ˜ i ´ λ‹ }2 ` . Gpλi`1 q ´ G‹ ` }λ Gpλi q ´ G‹ ` }λ 2 2 Mi τi 2 Mi´1 τi´1 2 2Mi τi Summing up this inequality from i “ 1 to k ´ 1 and then using (73) with i “ 0, we obtain: k´1 ÿ 1 ‰ 1 ‰ p1 ´ τ0 q “ ‹ ‹ ˜ 0 ´ λ‹ }2 ´ 1 }λ ˜ k ´ λ‹ }2 ` Gpλ q ´ G ` Gpλ q ´ G ď } λ . 1 k 2 Mk´1 τk´1 M0 τ02 2 2 2 i“0 Mi τi ˜0 “ λ ˆ 0 “ λ0 . Using these relations and Sˆk´1 :“ řk´1 1 , the last estimate leads to: Since τ0 “ 1, we have λ i“0 Mi τi ” ı 2 Mk´1 τk´1 Gpλk q ´ G‹ ď }λ0 ´ λ‹ }2 ` Sˆk´1 , 2 which is the first inequality of (72).
1
“
Finally, we note that Mi ě M0 for i ě 0, and the condition (71) leads to τ1i “ τ12 ´ τ 21 . Hence, i i´1 « ff « ˙ff ˆ k´1 k´1 k´1 ÿ 1 ÿ 1 ÿ 1 1 1 1 1 Sˆk´1 “ ď 1` “ 1` “ . 2 ´ τ2 2 M τ M τ M τ M τ i i 0 i 0 0 i i´1 k´1 i“0 i“1 i“1 Using this estimate into the first inequality of (72), we obtain: Gpλk q ´ G‹ ď which is indeed the second inequality of (72).
2 Mk´1 τk´1 Mk´1 }λ0 ´ λ‹ }2 ` , 2 2 M0
(74)
Universal Primal-Dual Proximal-Gradient Methods
D.2. The proof of Lemma 4.3: Accelerated dual universal proximal-gradient scheme ˆ k ´ λk`1 q “ tk pλ ˆ k ´ λk`1 q. We also ˜k ´ λ ˜ k`1 “ 1 pλ Let tk :“ τk´1 , then t0 “ τ0´1 “ 1. From (31), we have λ τk ´1 1 ˆ k “ p1 ´ τk qλk ` τk λ ˜ k , which leads to λ ˜ k “ rλ ˆ k ´ p1 ´ τk qλk s “ tk rλ ˆ k ´ p1 ´ t qλk s. Combining these have λ k τk expressions, we get: ˆ k ´ λk`1 q “ λ ˜k ´ λ ˜ k`1 “ tk rλ ˆ k ´ p1 ´ t´1 qλk s ´ tk`1 rλ ˆ k`1 ´ p1 ´ t´1 qλk`1 s. tk pλ k k`1 This expression can be simplified as: ˆ k`1 “ tk λk`1 ` tk`1 p1 ´ t´1 qλk`1 ´ tk p1 ´ t´1 qλk “ ptk ` tk`1 ´ 1qλk`1 ´ ptk ´ 1qλk . tk`1 λ k`1 k ˆ k`1 “ λk`1 ` Hence λ
tk ´1 tk`1 pλk`1
´ λk q, which is the third step of (32).
Next, from the condition (71) and tk “ exactly the second step of (32).
1 τk
we have t2k`1 ´ tk`1 ´ t2k “ 0. Hence, tk`1 “
1 2
” ı a 1 ` 1 ` 4t2k , which is ˝
D.3. The proof of Theorem 4.4: Convergence rate of the dual objective residual Gpλk q ´ G‹ ´ ¯ a Since tk`1 “ 21 1 ` 1 ` 4t2k ě tk ` 12 . By induction and t0 “ 1, we have tk ě t0 ` k2 “ k`2 2 . Moreover, tk`1 ď tk ` 1, which leads to tk ď k ` 1 ă k ` 2 by induction. From the line-search condition (33), we have: 1´ν
Ď Mk´1 ď 2M t
k´1
1´ν 1`ν Ď Ď . M ă 2pk ` 1q 1`ν M “ 2tk´1
1´ν Ďε and τk´1 “ t´1 ď Substituting Mk´1 by its upper bound 2pk ` 1q 1`ν M k´1
2 k`1
(75)
into (72), we obtain (34).
˝
D.4. The proof of Theorem 4.5: Convergence rate of the primal objective residual and the primal feasibility gap Using Lemma A.2 and subtracting G‹ , we have: “ ‰ ˆ k ´λk`1 , λ ˆ k ´λy´ Mk }λk`1 ´ λ ˆ k }2 . ˆ k q`x∇gpλ ˆ k q, λ´ λ ˆ k y`hpλq´G‹ ` τk `Mk xλ Gpλk`1 q´G‹ ď gpλ 2 2 On the other hand, by (68), we have:
(76)
Mk δk ˆ k }2 ` Mk xλ ˆ k ´ λk`1 , λ ˆ k ´ λk y. ´ }λk`1 ´ λ (77) 2 2 Similarly to the proof of (69), by multiplying (77) by p1 ´ τk q and (76) by τk and summing the results up, then dividing both sides by Mk τk2 , we obtain: Gpλk`1 q ´ G‹ ď Gpλk q ´ G‹ `
‰ p1 ´ τk q “ ‰ ‰ 1 “ 1 “ ˆ ˆ k q, λ´ λ ˆ k y ` hpλq ´ G‹ Gpλk`1 q ´ G‹ ď Gpλk q ´ G‹ ` gpλk q ` x∇gpλ 2 2 M k τk Mk τk Mk τk ‰ 1“ ˜ ˜ k`1 ´λ}2 ` . ` }λk ´λ}2 ´ }λ 2 2Mk τk
(78)
kq 2 Since τk2 “ τk´1 p1 ´ τk q due to (71) and Mk ě Mk´1 , we have p1´τ ď Mk´11τ 2 . Then, using Gpλk q ´ G‹ ě 0, we Mk τk2 k´1 can derive: ‰ “ ‰ ‰ 1 1 “ ˆ 1 “ ˆ i q, λ´ λ ˆ i y ` hpλq ´ G‹ Gpλi`1 q ´ G‹ ď Gpλi q ´ G‹ ` gpλi q`x∇gpλ 2 2 Mi τi Mi´1 τi´1 Mi τi ‰ 1“ ˜ ˜ i`1 ´λ}2 ` . ` }λi ´λ}2 ´ }λ (79) 2 2Mi τi
Summing up this inequality from i “ 1 to k, and then adding the result to (78) with k “ 0, we obtain: k k ‰ 1 ´ τ0 “ ‰ ÿ ‰ ÿ 1 “ ˆ 1 1 “ ‹ ‹ ‹ ˆ ˆ Gpλ q ´ G ď Gpλ q ´ G ` gp λ q ` x∇gp λ q, λ ´ λ y ` hpλq ´ G ` k`1 0 i i i Mk τk2 M0 τ02 M τ 2 M i i i τi i“0 i“0 ‰ 1“ ˜ 2 2 ˜ ` }λ 0 ´ λ} ´ }λk`1 ´ λ} . 2
Universal Primal-Dual Proximal-Gradient Methods
˜ 0 “ λ0 . The last inequality can be simplified as: We note that τ0 “ 1 and λ k k ÿ ‰ ‰ ÿ 1 1 “ 1 1 “ ˆ ‹ 2 ˆ i q, λ ´ λ ˆ i y ` hpλq ´ G‹ . ď Gpλ q ´ G gpλi q ` x∇gpλ ` }λ ´ λ} ` k`1 0 2 Mk τk 2 i“0 Mi τi 2 Mi τi i“0
(80)
1 ˆ i q ` x∇gpλ ˆ i q, λ ´ λ ˆ i y “ ´f px‹ pλ ˆ i qq ´ xAx‹ pλ ˆ i q ´ b, λy. With Sˆk :“ řk ¯ As (59), we have gpλ i“0 Mi τi and xk :“ ř ř k k 1 1 1 1 ‹ ˆ ‹ ˆ ¯ ¯ ˆk ˆk i“0 Mi τi x pλi q defined by (35), using the convexity of f , we have f pxk q ď S i“0 Mi τi f px pλi qq and xAxk ´ S ř k 1 ˆ i q ´ b, λy. Using these expressions into (80) and noting that Gpλk`1 q ´ G‹ ě 0, we xAx‹ pλ b, λy “ 1 ˆk S
i“0 Mi τi
obtain: ¯ k q`xAx ¯ k ´b, λy´hpλq`G‹ ď ´ f px
‰ “ 1 1 1 Gpλk`1 q´G‹ ` ` }λ0 ´λ}2 ď ` }λ0 ´λ}2 . 2 ˆ ˆ 2 2 Sk Mk τk 2Sk 2Sˆk
(81)
¯ k q ` xλ‹ , Ax ¯k ´ On the other hand, similarly to the proof of Theorem 4.2, we have ´Gpλk`1 q ď ´G‹ “ d‹ “ f ‹ ď f px ‹ ‹ ¯ ¯ b ´ ry, which leads to ´f pxk q ´ xλ , Axk ´ b ´ ry ď G . Combing this expression and the last inequality, we obtain: ¯ k ´ b ´ r, λ ´ λ‹ y ´ xAx
1 }λ0 ´ λ}2 ď . ˆ 2 2Sk
(82)
By the same argument as the proof of Theorem 4.2, we finally obtain: ¯ k ´ b, Kq ď dist pAx
2}λ0 ´ λ‹ } ` Sˆk
c
. ˆ Sk
For the objective residual, we have the following estimate by using (81) and G‹ “ ´f ‹ : c 2}λ0 }}λ0 ´ λ‹ } ‹ ‹ ¯ ¯ ` . ´}λ }dist pAxk ´ b, Kq ď f pxk q ´ f ď ` }λ0 } 2 Sˆk Sˆk 2 Finally, since τk2 “ p1 ´ τk qτk´1 by (71), we have 1´ν 1`ν
2tk
Ď and Ď ď 2pk ` 2q 1´ν 1`ν M M Sˆk :“
k`2 2
1 τk
“
1 τk2
´
1 . 2 τk´1
(83)
(84)
Ďτ “ Using this relation, M0 ď Mi ď Mk ď 2M k
ď tk ă k ` 2 for i “ 0, . . . , k, we can show that:
1`3ν k k ÿ ÿ `1 pk ` 2q 1`ν 1 1 1 ˘ı 1 ” t2k ě ě ´ . “ 1 ` ě 1´ν 2 Ďτ τ 2 Ďτ Ď Mi τi τi2 τi´1 Ď 2M 2M 8M 2pk ` 2q 1`ν M k i k i“0 i“0 i“1
k ÿ
Substituting this estimate into (83) and (84), we obtain (36) and (37), respectively.
(85) ˝
D.5. The worst-case complexity analysis We analyze the worst-case complexity of Algorithm 2 in terms of the primal objective objective function f . From (36) of ¯ k ´ b, Kq ď f px ¯ k q ´ f ‹ ď 2 . Now, using (37) of Theorem 4.5, to Theorem 4.5 and λ0 :“ 0n , we have ´}λ‹ }dist pAx ‹ ¯ k q ´ f | ď and dist pAx ¯ k ´ b, Kq ď , we must have: guarantee |f px d Ď Ď 16M 8M ‹ 1`3ν }λ } ` 1`3ν ď ‹ . }λ } 1`ν 1`ν pk`2q pk`2q The inequality on the right hand side leads to: Ď }λ‹ }2 32M k`2ě „
1`ν 1`3ν
.
ı 1´ν ” Ď and considering the fact that 1´ν 1`ν ď 1 for ν P r0, 1s, we find the maximum number Using the definition (18) of M 1`ν of iterations that satisfies the above inequality is: [ _ 2 ˆ ˙ 1`3ν ? M ν ‹ 2p1`νq , kmax :“ r4 2}λ }s 1`3ν inf 0ďνď1
Universal Primal-Dual Proximal-Gradient Methods
which is indeed (40). Next, we consider the number of oracle quires in Algorithm 2. At each iteration k, Algorithm 2 requires one gradient evaluation of g and ik function evaluation of g. Hence, the total number of oracle quires up to the iteration k is N2 pkq :“ řk j“0 pij ` 1q. Since ij :“ log2 pMj {Mj´1 q, we have: N2 pkq “ pk ` 1q ` log2 pMk q ´ log2 pM´1 q “ pk ` 1q ` log2 pMk q ´ log2 pM q. 2 “ ‰ 1´ν Ď pkq ď k`1 1`ν Mν1`ν . Hence, we obtain: Using the same argument as in the proof of Lemma B.1, we have Mk ď M N2 pkq ď pk ` 1q `
2 1´ν rlog2 pk ` 1q ´ log2 pqs ` log2 pMν q ´ log2 pM q, 1`ν 1`ν
which is indeed (41).
E. The dual averaging subgradient method The dual averaging subgradient algorithm (Nesterov, 2007; 2009) is perhaps the most efficient variant in the class of subgradient methods for solving nonsmooth convex optimization problem. In general, the dual problem (12) is convex but nonsmooth. We then can apply dual averaging subgradient algorithm to solve this problem. We review this method here and then show that the Frank-Wolfe algorithm (Jaggi, 2013) can be viewed as a variant of the dual averaging subgradient algorithm. E.1. The dual averaging subgradient method for the dual problem We consider a special case of (1), where K “ t0n u. In this case, the dual component hpλq defined by (13) is vanished, i.e., hpλq “ 0. Hence, the dual problem (12) reduces to: G‹ :“ minn Gpλq,
(86)
λPR
where Gpλq :“ max txλ, b ´ Axy ´ f pxq : x P X u is the dual function. Let ppλq defined from Rn Ñ R` be a proximity x function (or prox-function), which is strongly convex with the convexity parameter µp “ 1. For any β ą 0, we define the mapping that maps from the dual space Rn to the primal space Rn : πβ psq :“ arg max txs, λy ´ βppλqu ” argmin tx´s, xy ` βppλqu . λ
λ
Let us recall the dual averaging subgradient scheme in (Nesterov, 2009) applying to solve (86) as follows: " sk`1 :“ sk ´ γk ∇Gpλk q λk`1 :“ πβk psk`1 q,
(87)
(88)
where πβ is the mapping defined by (87), βk`1 ě βk , and s0 :“ 0n . The step-size γk ą 0 will be appropriately defined below. Let FR :“ tλ P Rn : ppλq ď Ru be the sub-level set of p. Then, according to Theorem 1 in (Nesterov, 2009) we have: # + k k ÿ 1 ÿ γi2 }∇Gpλi q}2 . (89) ∆R pkq :“ max γi x∇Gpλi q, λ ´ λi y ď βk`1 R ` λPFR 2 β i i“0 i“0 As proved in (Nesterov, 2009), if we choose γk :“ 1 and βk`1 :“ βk ` η 2 βk´1 with β0 “ η for some η ą 0, then: ? ˆ ˙ 0.5 ` k ` 1 MG ‹ ¯ Gpλk q ´ GR ď γR ` , k`1 2η where MG :“ supλPFR }∇Gpλq} ˜ and G‹R :“ ¸ minλ tGpλq : λ P FR u. Hence, the worst-case complexity of the dual ´ ¯2 averaging subgradient scheme is O
MG 2η 2
γR`
, which is also optimal.
As indicated in (Nesterov, 2007; 2009), the convergence rate of the dual averaging subgradient algorithm depends on the assumptions on G:
Universal Primal-Dual Proximal-Gradient Methods
? • If ∇G is bounded, then one can show that the convergence rate of Gpλk q ´ G‹ is Op1{ kq. • If ∇G is Lipschitz continuous, then the convergence rate of Gpλk q ´ G‹ increases to Op1{kq. E.2. Frank-Wolfe’s algorithm vs. dual averaging subgradient method The Frank-Wolfe algorithm as revisited in (Jaggi, 2013) aims at solving the following constrained convex problem: ϕ‹ :“ min tϕpxq : x P X u ,
(90)
x
where X is a nonempty, closed, convex and bounded, and ϕ is a smooth convex function (with Lipschitz gradient). The FW algorithm generates a sequence t¯ xk u using: # ˆ k`1 :“ arg mintx∇ϕp¯ x xk q, xy : x P X u x Frank-Wolfe : (91) ¯ k`1 :“ x ¯ k ` γk pˆ ¯ k q, x xk`1 ´ x where γk P p0, 1s is the step-size. We can either set γk :“
2 k`2
or determine it via a line-search procedure.
Let us introduce r :“ x. Then we can rewrite (90) as (1): ϕ‹ “ min tϕprq : x ´ r “ 0, x P X u . x,r
(92)
The dual function (12) for (92) is then given by Gpλq :“ ϕ˚ pλq ` sup x´λ, xy, xPX ˚
where ϕ is the Fenchel conjugate of ϕ and sX is the support function of X . By definition, we can estimate ∇Gpλq “ ∇ϕ˚ pλq ´ x‹ pλq, x‹ pλq :“ argminxλ, xy. xPX
¯ k q, then λ ¯ k “ ∇ϕp¯ ¯ k q, then x ¯ k q :“ ¯ k :“ ∇ϕ˚ pλ ˆ k`1 :“ x‹ pλ ˆ k`1 “ x‹ p∇ϕp¯ If we define x xk q. Let x xk qq. Hence, ∇Gpλ ¯ k q. Using these expressions, we can equivalently reformulate (91) as: ´pˆ xk`1 ´ x " ¯kq ¯ k`1 :“ x ¯ k ´γk ∇Gpλ x Frank-Wolfe : (93) ¯ k`1 :“ ∇ϕp¯ λ xk`1 q. Since ∇ϕ is Lipschitz continuous with a Lipschitz constant Lϕ ą 0, by Baillon-Haddad’s theorem (Bauschke & Com˚ bettes, 2011), the conjugate ϕ˚ is strongly convex with the convexity parameter µϕ˚ :“ L´1 ϕ . Hence, we can use ϕ to define the following mapping: πβ psq “ argmin tx´s, λy ` βϕ˚ pλqu , λ
in order to map the primal variable x to the dual variable λ. Using πβ , we can easily obtain that: ¯ k`1 “ π1 p´¯ λ xk`1 q, and hence (93) is exactly cast into the dual averaging subgradient algorithm for solving the dual problem of (92). Following (Nesterov, convergence rate of Gpλk q ´ G‹ ? 2007; 2009), we can prove that, if ∇G is bounded, then one can show that the ‹ is Op1{ kq. If ∇G is Lipschitz continuous, then the convergence rate of Gpλk q ´ G increases to Op1{kq.
F. The implementation details In this section, we specify key steps of Algorithms 1 and 2 for two important applications. We also provide an analytical step-size that guarantees the line-search condition without function evaluation. F.1. Constrained convex optimization involving a quadratic cost Our first application can be cast in the following form: ( min p1{2q}Apxq ´ b}2 : x P X , x
(94)
Universal Primal-Dual Proximal-Gradient Methods
where X is a simple set, e.g., X is a simplex or a norm ball. By using a slack variable r “ Apxq ´ b, we can write (94) as: ( min p1{2q}r}2 : Apxq ´ r “ b, x P X . x,r
(95)
Clearly, the dual components g and h defined in (13) can be expressed as: ( ( gpλq :“ max xAT λ, xy ` xb, λy and hpλq :“ max xλ, ry ´ p1{2q}r}2 “ p1{2q}λ}2 . r
xPX
We consider two cases: • If X :“ tx : }x} ď κu, then gpλq “ κ}λ}˚ ` xb, λy, where } ¨ }˚ is the dual norm. We can also compute x‹ pλq and r‹ pλq “ λ explicitly in this case. Here are two examples: ˇ ˇ T – If } ¨ } is the `1 -norm, then: x‹ pλqimax “ arg max1ďiďp ˇpATˇ λqi ˇ, theˇ largest ˇ T absolute ˇ entry of A λ, and ‹ T x pλqi “ 0, otherwise, where imax is one index with max1ďiďp ˇpA λqi ˇ “ ˇpA λqimax ˇ. – If } ¨ } is the nuclear-norm, then: x‹ pλq “ e1 eT1 , where e1 is the singular vector corresponding to the largest singular value σ1 of AT λ. p( , then gpλq “ }λ} ` xb, λy, where }λ} “ σ1 ” σmax pλq, the largest eigenvalue of • If X :“ x : trpxq “ 1, x P S` λ. We can compute x‹ pλq “ e1 eT1 , where e1 is the eigenvector corresponding to σ1 . Hence, the evaluation of g, h and their universal gradient can be done explicitly. Computing an analytical step-size: Now, we consider the linesearch procedure in Algorithms 1 and 2. Since h is quadratic, we can use a universal gradient step instead of the universal proximal gradient step, i.e.: ˆ k ´ αk ∇Gpλ ˆ k q, where αk :“ 1 . λk`1 :“ λ Mk The line-search condition on G in both algorithms can be expressed as: ˆ k ´ αk ∇Gpλ ˆ k qq ď Gpλ ˆ k q ´ αk }∇Gpλ ˆ k q}2 ` δk {2. Gpλk`1 q “ Gpλ 2 ˆ k ´ αk ∇Gpλ ˆ k qq by: Since Gpλq “ p1{2q}λ}2 ` xb, λy ` κ}AT λ}˚ , we can upper bound Gpλ
(96)
ˆ k ´ αk ∇Gpλ ˆ k qq ď U pαk q :“ Gpλ ˆ k q ´ αk xλ ˆ k ` b, ∇Gpλ ˆ k qy ` pα2 {2q}∇Gpλ ˆ k q}2 ` αk κ}AT ∇Gpλ ˆ k q}˚ . Gpλ k ˆ k q}2 ` δk {2. Solving this equation, we obtain αk explicitly as: ˆ k q ´ αk }∇Gpλ The condition (96) holds if U pαk q “ Gpλ 2 ? P 2 ` 4δk ´ P ˆ k q}2 ` 2κ}AT ∇Gpλ ˆ k q}˚ ´ 2xλ ˆ k ` b, ∇Gpλ ˆ k qy, αk :“ , with P :“ }∇Gpλ 2 ˆ 2}∇Gpλk q} which has been used in our numerical experiments. We call the algorithms using this step-size α-UniProxGrad or αAccUniProxGrad in the main text. F.2. Constrained convex optimization involving a norm cost Now, we consider the second application, which is reformulated as: min t}x} : Apxq ´ b P Ku , x
(97)
where K is a simple set, e.g., X “ t0n u, or X is a norm ball. Similarly to the previous one, we can write (97) as: min t}x} : Apxq ´ r “ b, r P Ku . x,r
(98)
Clearly, the dual components g and h defined in (13) can be expressed as: # ( xb, λy, if }AT λ}˚ ď 1 and hpλq :“ max tx´λ, ryu “ sK p´λq. gpλq :“ max xAT λ, xy ´ }x} ` xb, λy “ x rPK `8, otherwise We consider two cases:
Universal Primal-Dual Proximal-Gradient Methods
• If K :“ tr : }r} ď κu, then hpλq “ κ}λ}˚ , the dual norm. We can also compute x‹ pλq and r‹ pλq explicitly as above. • If K :“ t0n u, then hpλq “ 0. We can also do the same trick as above to compute the analytical step size αk . We also note that min }x} is equivalent to minp1{2q}x}2 . Hence, we can also write (97) equivalently to: x
x
( min p1{2q}x}2 : Apxq ´ r “ b, r P K . x,r
(99)
We consider the corresponding dual component g as gpλq “ ψpAT λq, where: ( ψpsq :“ max xs, xy ´ p1{2q}x}2 . x
( We note that xs, xy ´ p1{2q}x}2 ď }s}˚ }x} ´ p1{2q}x}2 . Let t :“ }x} and consider mintě0 }s}˚ t ´ p1{2qt2 . We can see that t˚ “ }s}˚ is the solution of this maximization problem. Hence, }s}˚ “ }x}. We consider two cases: • If } ¨ } is the `1 -norm, then: x‹ psqimax “ arg max1ďiďp |si |, the largest absolute entry of s, and x‹ psqi “ 0, otherwise; where imax is one index with max1ďiďp |si | “ |simax |. • If } ¨ } is the nuclear-norm, then: x‹ psq “ e1 eT1 , where e1 is the singular vector corresponding to the largest singular value σ1 of s.