Accelerated Primal-Dual Proximal Block Coordinate Updating ...

0 downloads 0 Views 833KB Size Report
Nov 20, 2017 - required in Theorem 4.3 hold when c is sufficiently small. Remark 4.3 If ..... [23] G. M. James, C. Paulson, and P. Rusmevichientong. Penalized ...
arXiv:1702.05423v1 [math.OC] 17 Feb 2017

Accelerated Primal-Dual Proximal Block Coordinate Updating Methods for Constrained Convex Optimization Yangyang Xu∗

Shuzhong Zhang†

February 20, 2017

Abstract Block Coordinate Update (BCU) methods enjoy low per-update computational complexity because every time only one or a few block variables would need to be updated among possibly a large number of blocks. They are also easily parallelized and thus have been particularly popular for solving problems involving large-scale dataset and/or variables. In this paper, we propose a primal-dual BCU method for solving linearly constrained convex program in multiblock variables. The method is an accelerated version of a primal-dual algorithm proposed by the authors, which applies randomization in selecting block variables to update and establishes an O(1/t) convergence rate under weak convexity assumption. We show that the rate can be accelerated to O(1/t2 ) if the objective is strongly convex. In addition, if one block variable is independent of the others in the objective, we then show that the algorithm can be modified to achieve a linear rate of convergence. The numerical experiments show that the accelerated method performs stably with a single set of parameters while the original method needs to tune the parameters for different datasets in order to achieve a comparable level of performance. Keywords: primal-dual method, block coordinate update, alternating direction method of multipliers (ADMM), accelerated first-order method. Mathematics Subject Classification: 90C25, 95C06, 68W20.

1

Introduction

Motivated by the need to solve large-scale optimization problems and increasing capabilities in parallel computing, block coordinate update (BCU) methods have become particularly popular in recent years due to their low per-update computational complexity, low memory requirements, and their potentials in a distributive computing environment. In the context of optimization, BCU first ∗ †

[email protected]. Department of Mathematics, University of Alabama [email protected]. Department of Industrial & Systems Engineering, University of Minnesota

1

appeared in the form of block coordinate descent (BCD) type of algorithms which can be applied to solve unconstrained smooth problems or those with separable nonsmooth terms in the objective (possibly with separable constraints). More recently, it has been developed for solving problems with nonseparable nonsmooth terms and/or constraint in a primal-dual framework. In this paper, we consider the following linearly constrained multi-block structured optimization model: M M X X min f (x) + gi (xi ), s.t. Ai xi = b, (1) x

i=1

i=1

where x is partitioned into disjoint blocks (x1 , x2 , . . . , xM ), f is a smooth convex function with Lipschitz continuous gradient, and each gi is proper closed convex and possibly non-differentiable. Note that gi can include an indicator function of a convex set Xi , and thus (1) can implicitly include certain separable block constraints in addition to the nonseparable linear constraint. Many applications arising in statistical and machine learning, image processing, and finance can be formulated in the form of (1) including the basis pursuit [7], constrained regression [23], support vector machine in its dual form [10], portfolio optimization [28], just to name a few. Towards finding a solution for (1), we will first present an accelerated proximal Jacobian alternating direction method of multipliers (Algorithm 1), and then we generalize it to an accelerated randomized primal-dual block coordinate update method (Algorithm 2). Assuming strong convexity on the objective function, we will establish O(1/t2 ) convergence rate results of the proposed algorithms by adaptively setting the parameters, where t is the total number of iterations. In addition, if further assuming smoothness and the full-rankness we then obtain linear convergence of a modified method (Algorithm 3).

1.1

Related methods

Our algorithms are closely related to randomized coordinate descent methods, primal-dual coordinate update methods, and accelerated primal-dual methods. In this subsection, let us briefly review three classes of methods and discuss their relations to our algorithms.

Randomized coordinate descent methods In the absence of linear constraint, Algorithm 2 specializes to randomized coordinate descent (RCD), which was first proposed in [31] for smooth problems and later generalized in [27,38] to nonsmooth problems. It was shown that RCD converges sublinearly with rate O(1/t), which can be accelerated to O(1/t2 ) for weakly convex problems and achieves a linear rate for strongly convex problems. By choosing multiple block variables at each iteration, [37] proposed to parallelize the RCD method and showed the same convergence results for parallelized RCD. This is similar to setting m > 1 in Algorithm 2, allowing parallel updates on the selected x-blocks.

2

Primal-dual coordinate update methods In the presence of linear constraints, coordinate descent methods may fail to converge to a solution of the problem because fixing all but one block, the selected block variable may be uniquely determined by the linear constraint. To perform coordinate update to the linearly constrained problem (1), one effective approach is to update both primal and dual variables. Under this framework, the alternating direction method of multipliers (ADMM) is one popular choice. Originally, ADMM [14,17] was proposed for solving two-block structured problems with separable objective (by setting f = 0 and M = 2 in (1)), for which its convergence and also convergence rate have been wellestablished (see e.g. [2, 13, 22, 29]). However, directly extending ADMM to the multi-block setting such as (1) may fail to converge; see [6] for a divergence example of the ADMM even for solving a linear system of equations. Lots of efforts have been made to establish the convergence of multiblock ADMM under stronger assumptions (see e.g. [4, 6, 16, 25, 26]) such as strong convexity or orthogonality conditions on the linear constraint. Without additional assumptions, modification is necessary for the ADMM applied to multi-block problems to be convergent; see [12, 19, 20, 40] for example. Very recently, [15] proposed a randomized primal-dual coordinate (RPDC) update method. Applied to (1), RPDC is a special case of Algorithm 2 with fixed parameters. It was shown that RPDC converges with rate O(1/t) under weak convexity assumption. More general than solving an optimization problem, primal-dual coordinate (PDC) update methods have also appeared in solving fixed-point or monotone inclusion problems [9, 34–36]. However, for these problems, the PDC methods are only shown to converge but no convergence rate estimates are known unless additional assumptions are made such as the strong monotonicity condition.

Accelerated primal-dual methods It is possible to accelerate the rate of convergence from O(1/t) to O(1/t2 ) for gradient type methods. The first acceleration result was shown by Nesterov [30] for solving smooth unconstrained problems. The technique has been generalized to accelerate gradient-type methods on possibly nonsmooth convex programs [1, 32]. Primal-dual methods on solving linearly constrained problems can also be accelerated by similar techniques. Under weak convexity assumption, the augmented Lagrangian method (ALM) is accelerated in [21] from O(1/t) convergence rate to O(1/t2 ) by using a similar technique as that in [1] to the multiplier update, and [39] accelerates the linearized ALM using a technique similar to that in [32]. Assuming strong convexity on the objective, [18] accelerates the ADMM method, and the assumption is weakened in [39] to assuming the strong convexity for one component of the objective function. On solving bilinear saddle-point problems, various primal-dual methods can be accelerated if either primal or dual problem is strongly convex [3,5,11]. Without strong convexity, partial acceleration is still possible in terms of the rate depending on some other quantities; see e.g. [8, 33].

3

1.2

Contributions of this paper

We accelerate the proximal Jacobian ADMM [12] and also generalize it to an accelerated primaldual coordinate updating method for linearly constrained multi-block structured convex program, where in the objective there is a nonseparable smooth function. With parameters fixed during all iterations, the generalized method reduces to that in [15] and enjoys O(1/t) convergence rate under mere weak convexity assumption. By adaptively setting the parameters at different iterations, we show that the accelerated method has O(1/t2 ) convergence rate if the objective is strongly convex. In addition, if there is one block variable that is independent of all others in the objective (but coupled in the linear constraint) and also the corresponding component function is smooth, we modify the algorithm by treating that independent variable in a different way and establish a linear convergence result. Numerically, we test the accelerated method on quadratic programming and compare it to the (nonaccelerated) RPDC method in [15]. The results demonstrate that the accelerated method performs efficiently and stably with the parameters automatically set in accordance of the analysis, while the RPDC method needs to tune its parameters for different data in order to have a comparable performance.

1.3

Nomenclature and basic facts

Notations. We let xS denote the subvector of x with blocks indexed by S. Namely, if S = {i1 , . . . , im }, then xS = (xi1 , . . . , xim ). Similarly, AS denotes the submatrix of A with columns indexed by S, and gS denotes the sum of component functions indicated by S. We reserve I for the identity matrix and use k · k for Euclidean norm. Given a symmetric positive semidefinite (PSD) matrix W , for any vector v of appropriate size, we define kvk2W = v > W v. Also, we denote F (x) = f (x) + g(x),

Φ(ˆ x, x, λ) = F (ˆ x) − F (x) − hλ, Aˆ x − bi.

(2)

Preparations. Since there are only linear constraints in (1), a point x∗ is a solution to (1) if and only if the Karush-Kuhn-Tucker (KKT) conditions hold, i.e., there exists a λ∗ such that 0 ∈ ∂F (x∗ ) − A> λ∗ , ∗

Ax − b = 0.

(3a) (3b)

Together with the convexity of F , (3) implies Φ(x, x∗ , λ∗ ) ≥ 0, ∀x.

(4)

We will use the following lemmas as basic facts, the proofs of which can be found in [15]. Lemma 1.1 For any vectors u, v and symmetric PSD matrix W of appropriate sizes, it holds that u> W v =

 1 kuk2W − ku − vk2W + kvk2W . 2 4

(5)

Lemma 1.2 Given a function φ, for a given x and a random vector x ˆ, if for any λ (that may depend on x ˆ) it holds EΦ(ˆ x, x, λ) ≤ Eφ(λ), (6) then for any γ > 0, we have   E F (ˆ x) − F (x) + γkAˆ x − bk ≤ sup φ(λ). kλk≤γ

  Lemma 1.3 Suppose E F (ˆ x) − F (x∗ ) + γkAˆ x − bk ≤ . Then, EkAˆ x − bk ≤

   kλ∗ k ∗ , and − ≤ E F (ˆ x ) − F (x ) ≤ , γ − kλ∗ k γ − kλ∗ k

where (x∗ , λ∗ ) satisfies the optimality conditions in (3), and we assume kλ∗ k < γ. Outline. The rest of the paper is organized as follows. Section 2 presents the accelerated proximal Jacobian ADMM and its convergence results. In section 3, we propose an accelerated primal-dual block coordinate update method with convergence analysis. Section 4 assumes more structure on the problem (1) and modifies the algorithm in section 3 to have linear convergence. Numerical results are provided in section 5. Finally, section 6 concludes the paper.

2

Accelerated proximal Jacobian ADMM

In this section, we propose an accelerated proximal Jacobian ADMM for solving (1). At each iteration, the algorithm updates all M block variables in parallel by minimizing a linearized proximal approximation of the augmented Lagrangian function, and then it renews the multiplier. Specifically, it iteratively performs the following updates: D E 1 k > k k xk+1 = arg min ∇ f (x ) − A (λ − β r ), x + gi (xi ) + kxi − xki kP k , i = 1, . . . , M, i i k i i i 2 xi

(7a)

λk+1 = λk − ρk rk+1 ,

(7b)

k ) is a block diagonal matrix, and where βk and ρk are scalar parameters, P k = blkdiag(P1k , . . . , PM k k r = Ax − b denotes the residual. Note that (7a) consists of M independent subproblems, and they can be solved in parallel.

Algorithm 1 summarizes the proposed method. It reduces to the proximal Jacobian ADMM in [12] if βk , ρk and P k are fixed for all k and there is no nonseparable function f . We will show that

5

adapting parameters to the iterations can accelerate the convergence of the algorithm. Algorithm 1: Accelerated proximal Jacobian ADMM for (1) 1 2 3 4 5 6

Initialization: choose x1 , set λ1 = 0, and let r1 = Ax1 − b for k = 1, 2, . . . do Choose parameters βk , ρk and a block diagonal matrix P k Let xk+1 ← (7a) and λk+1 ← (7b) with rk+1 = Axk+1 − b. if a certain stopping criterion satisfied then Return (xk+1 , λk+1 ).

2.1

Technical assumptions

Throughout the analysis in this section, we make the following assumptions. Assumption 1 There exists (x∗ , λ∗ ) satisfying the KKT conditions in (3). Assumption 2 ∇f is Lipschitz continuous with modulus Lf . Assumption 3 The functions f and g are convex with moduli µf ≥ 0 and µg ≥ 0 respectively, and µ = µf + µg > 0. The first two assumptions are standard, and the third one is for showing convergence rate of O(1/t2 ), where t is the number of iterations. With only weak convexity, Algorithm 1 can be shown to converge in the rate O(1/t) with parameters fixed for all iterations, and the order 1/t is optimal as shown in the very recent work [24].

2.2

Convergence results

In this subsection, we show the O(1/t2 ) convergence rate result of Algorithm 1. First, we establish a result of running one iteration of Algorithm 1. Lemma 2.1 (One-iteration analysis) Under Assumptions 2 and 3, let {(xk , λk )} be the sequence generated from Algorithm 1. Then for any k and any (x, λ) such that Ax = b, it holds

6

that Φ(xk+1 , x, λ) ≤

i 1 h kλ − λk k2 − kλ − λk+1 k2 + kλk − λk+1 k2 − βk krk+1 k2 2ρk i βk h + kA(xk+1 − x)k2 − kA(xk − x)k2 + kA(xk+1 − xk )k2 2 i 1 h k+1 − xk2P k − kxk − xk2P k + kxk+1 − xk k2P k − kx 2 Lf µf k µg + kxk+1 − xk k2 − kxk+1 − xk2 − kx − xk2 . 2 2 2

(8)

Using the above lemma, we are able to prove the following theorem. Theorem 2.2 Under Assumptions 2 and 3, let {(xk , λk )} be the sequence generated by Algorithm 1. Suppose that the parameters are set to satisfy 0 < ρk ≤ 2βk ,

P k  βk A> A + Lf I, ∀k ≥ 1,

(9)

and there exists a number k0 such that for all k ≥ 1, k + k0 + 1 ρk



k + k0 , ρk−1

(10)

(k + k0 + 1)(P k − βk A> A − µf I)  (k + k0 )(P k−1 − βk−1 A> A + µg I).

(11)

Then, for any (x, λ) satisfying Ax = b, we have t X

k+1

(k + k0 + 1)Φ(x

, x, λ) +

k=1

t X k + k0 + 1 k=1

2

(2βk − ρk )krk+1 k2

t + k0 + 1 t+1 kx − xk2P t −βt A> A+µg I + 2 k0 + 2 k0 + 2 1 ≤ kλ − λ1 k2 + kx − xk2P 1 −β1 A> A−µf I . 2ρ1 2

(12)

In the next theorem, we provide a set of parameters that satisfy the conditions in Theorem 2.2 and establish the O(1/t2 ) convergence rate result. Theorem 2.3 (Convergence rate of order 1/t2 ) Under Assumptions 1 through 3, let {(xk , λk )} be the sequence generated by Algorithm 1 with parameters set to: βk = ρk = (k + 1)β,

P k = (k + 1)P + Lf I, ∀k ≥ 0,

(13)

where P is a block diagonal matrix satisfying 0 ≺ P − βA> A  µ2 I. Then, n o max βkrt+1 k2 , kxt+1 − x∗ k2P −βA> A ≤ 7

2 φ(x∗ , λ∗ ), (t + 1)(t + k0 + 1)

(14)

where k0 = 1 + and φ(x, λ) =

2(Lf − µf ) , µ

(15)

k0 + 2 k0 + 2 1 kλk2 + kx − xk2(P −βA> A)+(Lf −µf )I . 4β 2

In addition, letting γ = max {2kλ∗ k, 1 + kλ∗ k} and T =

t(t + 2k0 + 3) , 2

x ¯t+1 =

Pt

k=1 (k

+ k0 + 1)xk , T

we have 1 max φ(x∗ , λ), T |kλk≤γ 1 max φ(x∗ , λ). kA¯ xt+1 − bk ≤ T max{1, kλ∗ k} kλk≤γ

|F (¯ xt+1 ) − F (x∗ )| ≤

3

(16a) (16b)

Accelerating randomized primal-dual block coordinate updates

In this section, we generalize Algorithm 1 to a randomized setting where the user may choose to update a subset of blocks at each iteration. Instead of updating all M block variables, we randomly choose a subset of them to renew at each iteration. Depending on the number of processors (nodes, or cores), we can choose a single or multiple block variables for each update.

3.1

The algorithm

Our algorithm is an accelerated version of the randomized primal-dual coordinate update (RPDC) method that is recently proposed in [15]. At each iteration, it performs a block proximal gradient update to a subset of randomly selected primal variables while keeping the remaining ones fixed, followed by an update to the multipliers. Specifically, at iteration k, it selects an index set Sk ⊂ {1, . . . , M } with cardinality m and performs the following updates: ( xk+1 i rk+1

=

k k arg minh∇i f (xk ) − A> i (λ − βk r ), xi i + gi (xi ) + xi

xki , X = rk + Ai (xk+1 − xki ), i

ηk 2 kxi

− xki k2 ,

if i ∈ Sk ,

(17a)

if i 6∈ Sk (17b)

i∈Sk

λk+1 = λk − ρk rk+1 ,

(17c)

where βk , ρk and ηk are algorithm parameters, and their values will be determined later. Note that we use η2k kxi − xki k2 in (17a) for simplicity. It can be replaced by a PSD matrix weighted norm square term as in (7a), and our convergence results still hold. 8

Algorithm 2 summarizes the above method. If the parameters βk , ρk and ηk are fixed during all the iterations, i.e., constant parameters, the algorithm reduces to the RPDC method in [15]. Adapting these parameters to the iterations, we will show that Algorithm 2 enjoys faster convergence rate than RPDC if the problem is strongly convex. Algorithm 2: Accelerated randomized primal-dual block coordinate update method for (1) 1 2 3 4 5 6 7

Initialization: choose x1 , set λ1 = 0, and let r1 = Ax1 − b for k = 1, 2, . . . do Select Sk ⊂ {1, 2, . . . , M } uniformly at random with |Sk | = m. Choose parameters βk , ρk and ηk . Let xk+1 ← (17a) and λk+1 ← (17c). if a certain stopping criterion satisfied then Return (xk+1 , λk+1 ).

3.2

Convergence results

In this subsection, we establish convergence results of Algorithm 2 under Assumptions 1 and 3, and also the following partial gradient Lipschitz continuity assumption. Assumption 4 For any S ⊂ {1, . . . , M } with |S| = m, ∇S f is Lipschitz continuous with a uniform constant Lm . Note that if ∇f is Lipschitz continuous with constant Lf , then Lm ≤ Lf . Similar to the analysis in section 2, we first establish a result of running one iteration of Algorithm m 2. Throughout this section, we denote θ = M . Lemma 3.1 (One iteration analysis) Under Assumptions 3 and 4, let {(xk , λk )} be the sequence generated from Algorithm 2. Then for any x such that Ax = b, it holds h i µg E Φ(xk+1 , x, λk+1 ) + (βk − ρk )krk+1 k2 + kxk+1 − xk2 2 h i L µg k m k k k 2 ≤ (1 − θ)E Φ(x , x, λ ) + βk kr k + kx − xk2 + Ekxk+1 − xk k2 2 2  θµf βk  − Ekxk − xk2 + E kA(xk+1 − x)k2 − kA(xk − x)k2 + kA(xk+1 − xk )k2 2 2  ηk  k+1 2 − E kx − xk − kxk − xk2 + kxk+1 − xk k2 . (18) 2 When µf = µg = 0 (meaning (1) is only weakly convex), Algorithm 2 has O(1/t) convergence rate with fixed βk , ρk , ηk during all the iterations. This can be shown from (18) and has been established in [15]. For ease of referencing, we give the result below without proof. 9

Theorem 3.2 (Un-accelerated convergence) Under Assumptions 1 and 4, let {(xk , λk )} be the sequence generated from Algorithm 2 with βk = β, ρk = ρ, ηk = η for all k, satisfying 0 < ρ ≤ θβ,

η ≥ Lm + βkAk22 ,

where kAk2 denotes the spectral norm of A. Then 1 max φ(x∗ , λ), 1 + θt kλk≤γ 1 EkA¯ xt − bk ≤ max φ(x∗ , λ), (1 + θt) max{1, kλ∗ k} kλk≤γ

E[F (¯ xt ) − F (x∗ )] ≤

(19a) (19b)

where (x∗ , λ∗ ) satisfies the KKT conditions in (3), γ = max{k2λ∗ k, 1 + kλ∗ k}, and P xt+1 + θ tk=1 xk η θkλk2 t , φ(x, λ) = 1 − θ)(F (x1 ) − F (x)) + kx1 − xk2 + . x ¯ = 1 + θt 2 2ρ When F is strongly convex, the above O(1/t) convergence rate can be accelerated to O(1/t2 ) by adaptively changing the parameters at each iteration. The following theorem is our main result. It shows O(1/t2 ) convergence result under certain conditions on the parameters. Based on this theorem, we will give a set of parameters that satisfy these conditions and thus provide an implementable way to choose the paramenters. Theorem 3.3 Under Assumptions 1, 3 and 4, let {(xk , λk )} be the sequence generated from Algorithm 2 with parameters satisfying the following conditions for a certain number k0 : θ(k + k0 + 1) ≥ 1, ∀k ≥ 2, (βk−1 − ρk−1 )(k + k0 ) θ(k + k0 + 1) − 1 ρk−1 θ(t + k0 + 1) − 1 ρt−1 βk (k + k0 + 1)

≥ (1 − θ)(k + k0 + 1)βk , ∀k ≥ 2, θ(k + k0 + 2) − 1 ≥ , ∀ 2 ≤ k ≤ t − 1, ρk t + k0 + 1 ≥ , ρt ≥ βk−1 (k + k0 ), ∀k ≥ 2,

(k + k0 + 1)(ηk − Lm )I  βk (k + k0 + 1)A> A, ∀k ≥ 1,  (k + k0 )ηk−1 + µg θ(k + k0 + 1) − 1 ≥ (k + k0 + 1)(ηk − θµf ), ∀k ≥ 2.

(20a) (20b) (20c) (20d) (20e) (20f) (20g)

Then for any (x, λ) such that Ax = b, we have (t + k0 + 1)EΦ(x

t+1

, x, λ) +

t X

 θ(k + k0 + 1) − 1 EΦ(xk , x, λ)

k=2

h

≤ (1 − θ)(k0 + 2)E Φ(x1 , x, λ1 ) + β1 kr1 k2 + +

i k +2 µg 1 0 kx − xk2 + (η1 − θµf )Ekx1 − xk2 2 2

θ(k0 + 3) − 1 t + k0 + 1 Ekλ1 − λk2 − Ekxt+1 − xk2(µg +ηt )I−βt A> A . 2ρ1 2 10

(21)

Specifying the parameters that satisfy (20), we show O(1/t2 ) convergence rate of Algorithm 2. Theorem 3.4 (Accelerated convergence) Under Assumptions 1, 3 and 4, let {(xk , λk )} be the sequence generated from Algorithm 2 with parameters taken as   θµ(θk+2+θ)2 , for 1 ≤ k ≤ t − 1, 2ρ(6−5θ)kAk2 ρk = (22a) t+k0 +1  ρ , for k = t t−1 θ(t+k0 +1)−1 µ(θk + 2 + θ) , ∀k ≥ 1, 2ρkAk22 µ ηk = (θk + 2 + θ) + Lm , ∀k ≥ 1, 2

(22b)

βk =

(22c)

where ρ ≥ 1 and k0 =

2 2(Lm + µg ) + . θ θµ

(23)

Then 1 E[F (¯ xt+1 ) − F (x∗ )] ≤ max φ(x∗ , λ), T kλk≤γ

EkA¯ xt+1 − bk ≤

1 max φ(x∗ , λ), (24) T max{1, kλ∗ k} kλk≤γ

where γ = max{2kλ∗ k, 1 + kλ∗ k}, t+1

(t + k0 + 1)xt+1 +

Pt

k=2

 θ(k + k0 + 1) − 1 xk

, T i n h µg φ(x, λ) = (1 − θ)(k0 + 2) F (x1 ) − F (x) + β1 kr1 k2 + kx1 − xk2 2  k0 + 2 θ(k + 3) − 1 0 + (η1 − θµf )kx1 − xk2 + kλk2 2 2ρ1 x ¯

=

and T = (t + k0 + 1) +

t X

 θ(k + k0 + 1) − 1 .

k=2

In addition, Ekxt+1 − x∗ k2 ≤

4

2φ(x∗ , λ∗ ) (t + k0 + 1)



(ρ−1)µ 2ρ (θt

+ θ + 2) + µ + µg + Lf

.

Linearly convergent primal-dual method

In this section, we assume some more structure on (1) and show that a linear rate of convergence is possible. If there is no linear constraint, Algorithm 2 reduces to the RCD method proposed in [31]. It is well-known that RCD converges linearly if the objective is strongly convex. However, with the presence of linear constraints, mere strong convexity of the objective of the primal problem only 11

ensures the smoothness of its Lagrangian dual function, but not the strong concavity of it. Hence, in general, we do not expect linear convergence by only assuming strong convexity on the primal objective function. To ensure linear convergence on both the primal and dual variables, we need additional assumptions. Throughout this section, we suppose that there is at least one block variable being absent in the nonseparable part of the objective, namely f . For convenience, we rename this block variable to be y, and the corresponding component function and constraint coefficient matrix as h and B. Specifically, we consider the following problem min f (x1 , . . . , xM ) + x,y

M X

gi (xi ) + h(y), s.t.

M X

i=1

Ai xi + By = b.

(25)

i=1

Towards a solution to (25), we modify Algorithm 2 by updating y-variable after the x-update. Since there is only a single y-block, to balance x and y updates, we do not renew y in every iteration but m . Hence, roughly speaking, x and y variables are updated in instead update it in probability θ = M the same frequency. The method is summarized in Algorithm 3. Algorithm 3: Randomized primal-dual block coordinate update for (25) 1 2 3 4 5

Initialization: choose (x1 , y 1 ), set λ1 = 0, and choose parameters β, ρ, ηx , ηy , m. m . Let r1 = Ax1 + By 1 − b and θ = M for k = 1, 2, . . . do Select index set Sk ⊂ {1, . . . , M } uniformly at random with |Sk | = m. Keep xk+1 = xki , ∀i 6∈ Sk and update i D E ηx k k xk+1 = arg min ∇i f (xk ) − A> kxi − xki k2 , if i ∈ Sk . (26) i (λ − βr ), xi + gi (xi ) + i 2 xi

6

P 1 Let rk+ 2 = rk + i∈Sk Ai (xk+1 − xki ). i In probability 1 − θ keep y k+1 = y k , and in probability θ let y k+1 = y˜k+1 , where D E η 1 y y˜k+1 = arg min h(y) − B > (λk − βrk+ 2 ), y + ky − y k k2 . 2 y

(27)

1

7

Let rk+1 = rk+ 2 + B(y k+1 − y k ). Update the multiplier by λk+1 = λk − ρrk+1 .

8

if a certain stopping criterion is satisfied then Return (xk+1 , y k+1 , λk+1 ).

12

(28)

4.1

Technical assumptions

Assume h to be differentiable. Then any pair of primal-dual solution (x∗ , y ∗ , λ∗ ) to (25) satisfies the KKT conditions: 0 ∈ ∂F (x∗ ) − A> λ∗ , ∗

(29a)

> ∗

∇h(y ) − B λ = 0,

(29b)

Ax∗ + By ∗ − b = 0.

(29c)

Besides Assumptions 3 and 4, we make two additional assumptions as follows. Assumption 5 There exists (x∗ , y ∗ , λ∗ ) satisfying the KKT conditions in (29). Assumption 6 The function h is strongly convex with modulus ν, and its gradient ∇h is Lipschitz continuous with constant Lh . The strong convexity of F and h implies ˜ (x∗ ), xk+1 − x∗ i ≥ F (xk+1 ) − F (x∗ ) − h∇F hy k+1 − y ∗ , ∇h(y k+1 ) − ∇h(y ∗ )i ≥

µ k+1 kx − x∗ k2 , 2 νky k+1 − y ∗ k2 ,

(30a) (30b)

˜ (x∗ ) is a subgradient of F at x∗ . where ∇F

4.2

Convergence analysis

Similar to Lemma 3.1, we first establish a result of running one iteration of Algorithm 3. It can be proven by similar arguments to those showing Lemma 3.1. Lemma 4.1 (One iteration analysis) Under Assumptions 3, 4, and 6, let {(xk , y k , λk )} be the sequence generated from Algorithm 3. Then for any k and any (x, y) such that Ax + By = b, it holds h



i E F (xk+1 ) − F (x) + y k+1 − y, ∇h(y k+1 ) − λ, Axk+1 + By k+1 − b + (β − ρ)Ekrk+1 k2 h i +E hxk+1 − x, (ηx I − βA> A)(xk+1 − xk )i + hy k+1 − y, (ηy I − βB > B)(y k+1 − y k )i θµf µg 1 Lm + Ehλk+1 − λ, λk+1 − λk i − Ekxk+1 − xk k2 + Ekxk − xk2 + Ekxk+1 − xk2 ρ 2 2 2 h

k

i k k k k ≤ (1 − θ)E F (x ) − F (x) + y − y, ∇h(y ) − λ, Ax + By − b + β(1 − θ)Ekrk k2 µg 1−θ Ekxk − xk2 + Ehλk − λ, λk − λk−1 i 2 ρ +βEhA(xk+1 − x), B(y k+1 − y k )i + β(1 − θ)EhB(y k − y), A(xk+1 − xk )i. +(1 − θ)

13

(31)

For ease of notation, we denote z = (x, y) and P = ηx I − βA> A,

Q = ηy I − βB > B.

Let

˜ (x∗ ), xk − x∗ i + y k − y ∗ , ∇h(y k ) − ∇h(y ∗ ) , Ψ(z k , z ∗ ) = F (xk ) − F (x∗ ) − h∇F

(32)

and also let ψ(z k , z ∗ ; P, Q, β, ρ, c, τ ) 1 β(1 − θ) Ekrk k2 + Ekxk − x∗ k2P = (1 − θ)EΨ(z k , z ∗ ) + 2 2 µg − θµ 1 β(1 − θ) k ∗ 2 k ∗ 2 + Ekx − x k + Eky − y kQ + EkB(y k − y ∗ )k2 2 2 2τ  1 1 + E kλk − λ∗ k2 − (1 − θ)kλk−1 − λ∗ k2 + kλk − λk−1 k2 . 2ρ θ

(33)

The following theorem is key to establishing linear convergence of Algorithm 3. k k k Theorem 4.2 Under Assumptions 3 through 6, let {(x n , y 2, λ )} 2be o the sequence generated from 8kAk2 8kBk2 Algorithm 3 with ρ = θβ. Let 0 < α < θ and γ = max . Choose δ, κ > 0 such that αµ , αν

" 2

1 − (1 − θ)(1 + δ) (1 − θ)(1 + δ) (1 − θ)(1 + δ) κ − (1 − θ)(1 + δ)

#

" 

θ 1−θ

1 θ

1−θ − (1 − θ)

# ,

(34)

and positive numbers ηx , ηy , c, τ1 , τ2 , β such that P

 β(1 − θ)τ2 A> A + Lm I

1 Q  8cQ> Q + 4cρ2 (1 − θ)(1 + )B > BB > B + βτ1 B > B. δ

(35a) (35b)

Then it holds that   β(1 − θ) 1  1 k+1 2 2 + Ekr k − cρ κ + 2(1 − θ)(1 + ) EkB > rk+1 k2 (1 − α)EΨ(z ,z ) + 2 γ δ  µ 1 αµ g −2c(β − ρ)2 EkB > rk+1 k2 + E kxk+1 − x∗ k2P + kxk+1 − x∗ k2 + kxk+1 − x∗ k2 2 4 2    β 3αν k+1 1 k+1 k+1 ∗ 2 ∗ 2 ∗ 2 2 k+1 ∗ 2 − kA(x − x )k + E ky − y kQ + ky − y k − 4cLh ky −y k 2τ1 2 4     1 c 1 + + σmin (BB > ) E kλk+1 − λ∗ k2 − (1 − θ)kλk − λ∗ k2 + kλk+1 − λk k2 2ρ 2 θ k+1



≤ ψ(z k , z ∗ ; P, Q, β, ρ, c, τ2 ).

(36)

Using Theorem 4.2, a linear convergence rate of Algorithm 3 follows. 14

Theorem 4.3 (Linear convergence) Under Assumptions 3 through 6, let {(xk ,ny k , λk )} be the o 8kAk22 8kBk22 sequence generated from Algorithm 3 with ρ = θβ. Let 0 < α < θ and γ = max , . αµ αν Assume that B is full row-rank and max{kAk2 , kBk2 } ≤ 1. Choose δ, κ, ηx , ηy , c, β, τ1 , τ2 satisfying (34) and (35), and in addition, α β µ + θµ > 2 τ1 3αν β(1 − θ) > 4cL2h + 4 2τ2   1 1 2 > cρ κ + 2(1 − θ)(1 + ) − 2c(β − ρ)2 . γ δ

(37b)

1 ψ(z k , z ∗ ; P, Q, β, ρ, c, τ2 ), η

(38)

Then ψ(z k+1 , z ∗ ; P, Q, β, ρ, c, τ2 ) ≤

(37a)

(37c)

where ( η = min

1−α , 1+ 1−θ 1+

3αν 4

2 γ

− 2cρ2 κ + 2(1 − θ)(1 + 1δ ) − 4c(β − ρ)2 β(1 − θ) β(1−θ) 2τ2 , β(1−θ) 2τ2

− 2cL2h − ηy 2



+

, 1+

α 2µ

+ θµ −

β τ1

ηx + µg − θµ

,

) 1 + cρσmin (BB > )

> 1.

Remark 4.1 We can always rescale A, B and b without essentially altering the linear constraints. Hence, the assumption max{kAk2 , kBk2 } ≤ 1 can be made without losing generality. From (38), it is easy to see that (xk , y k ) converges to (x∗ , y ∗ ) R-linearly in expectation. In addition, note that 1 kλk+1 − λ∗ k2 − (1 − θ)kλk − λ∗ k + kλk+1 − λk k θ 1 = θkλk+1 − λ∗ k2 + 2(1 − θ)hλk+1 − λ∗ , λk+1 − λk i + ( − 1 + θ)kλk+1 − λk k θ ! 2 (1 − θ) kλk+1 − λ∗ k2 ≥ θ− 1 − 1 + θ θ =

1 θ

θ kλk+1 − λ∗ k2 . −1+θ

Hence, (38) also implies an R-linear convergence from λk to λ∗ in expectation.

5

Numerical experiments

In this section, we test Algorithm 2 on nonnegative quadratic programming 1 min F (x) = x> Qx + c> x, s.t. Ax = b, x ≥ 0, x 2 15

(39)

0

10 -5

10 -15

0

200

400

600

800

1000

RPDC Algorithm 1

10

0

10 -5

10

-10

10 -15

0

200

number of epochs

400

600

10 -15

400

600

10

-10

10 -15

0

200

800

number of epochs

1000

400

600

1000

10 -5

10 -15

0

200

400

600

RPDC Algorithm 1

10

0

10 -5

10 -10

10 -15

0

200

800

400

600

1000

number of epochs

1000

10 5

10 0

10 -5

10 -10

10 -15

10 -20

800

number of epochs

RPDC Algorithm 1

10 -10

10 -20

800

10 5

10 0

violation of feasibility

10 -5

10 -10

200

10 -5

RPDC Algorithm 1

violation of feasibility

10 0

0

0

10 5

number of epochs

10 5 RPDC Algorithm 1

violation of feasibility

1000

RPDC Algorithm 1

10

number of epochs

10 5

10 -20

800

β = 1000

10 5

RPDC Algorithm 1

violation of feasibility

10

-10

distance of objective to optimal value

RPDC Algorithm 1

10

β = 100

10 5

distance of objective to optimal value

β = 10 distance of objective to optimal value

distance of objective to optimal value

β=1 10 5

0

200

400

600

number of epochs

800

1000

10 0

10 -5

10 -10

10 -15

10 -20

0

200

400

600

800

1000

number of epochs

Figure 1: Results by Algorithm 2 and the RPDC method in [15] with different penalty parameter β for solving (39) with problem size n = 2000 and p = 200. Top row: difference of objective value to the optimal value |F (xk ) − F (x∗ )|; bottom row: violation of feasibility kAxk − bk. and we compare it to the RPDC method proposed in [15]. Note that when applied to (39), RPDC can be regarded as a special case of Algorithm 2 with nonadaptive parameters. The data was generated randomly as follows. We let Q = HDH > ∈ Rn×n , where H is Gaussian 99 randomly generated orthogonal matrix and D is a diagonal matrix with dii = 1 + (i − 1) n−1 ,i= 1, . . . , n. Hence, the smallest and largest singular values of Q are 1 and 100 respectively, and the objective of (39) is strongly convex with mudulus 1. The components of c follow standard Gaussian distribution, and those of b follow uniform distribution on [0, 1]. We let A = [B, I] ∈ Rp×n to guarantee the existence of feasible solutions, where B was generated according to standard Gaussian distribution. In addition, we normalized A so that it has a unit spectral norm. In the test, we fixed n = 2000 and varied p among {200, 500}. For both Algorithm 2 and RPDC, we evenly partitioned x into 40 blocks, i.e., each block consists of 50 coordinates, and we set m = 40, i.e., all blocks are updated at each iteration. The parameters of Algorithm 2 were set according to (22), and those for RPDC were set based on Theorem 3.2 with ρ = β, η = 100 + β, ∀k where β varied among {1, 10, 100, 1000}. Figures 1 and 2 plot the objective values and feasibility violations by Algorithm 2 and RPDC. From these results, we see that Algorithm 2 performed well for both datasets with a single set of parameters while the performance of RPDC was severely affected by the penalty parameter.

6

Conclusions

In this paper we propose an accelerated proximal Jacobian ADMM method and generalize it to an accelerated randomized primal-dual coordinate updating method for solving linearly constrained 16

0

10 -5

10 -15

0

200

400

600

800

1000

RPDC Algorithm 1

10

0

10 -5

10

-10

10 -15

0

200

number of epochs

400

600

-10

400

600

number of epochs

10

-10

10 -15

0

200

800

1000

400

600

10 -5

-10

10 -15

1000

0

200

400

600

RPDC Algorithm 1

10

0

10 -5

10 -10

10 -15

0

200

800

400

600

10 -5

-10

10 -15

1000

number of epochs

1000

10 5

10 0

10

800

number of epochs

RPDC Algorithm 1

10 0

10

800

10 5

violation of feasibility

10 -5

200

10 -5

RPDC Algorithm 1

violation of feasibility

violation of feasibility

10 0

0

0

10 5

number of epochs

10 5 RPDC Algorithm 1

10 -15

1000

RPDC Algorithm 1

10

number of epochs

10 5

10

800

β = 1000

10 5

RPDC Algorithm 1

violation of feasibility

10

-10

distance of objective to optimal value

RPDC Algorithm 1

10

β = 100

10 5

distance of objective to optimal value

β = 10 distance of objective to optimal value

distance of objective to optimal value

β=1 10 5

0

200

400

600

number of epochs

800

1000

10 0

10 -5

10 -10

10 -15

0

200

400

600

800

1000

number of epochs

Figure 2: Results by Algorithm 2 and the RPDC method in [15] with different penalty parameter β for solving (39) with problem size n = 2000 and p = 500. Top row: difference of objective value to the optimal value |F (xk ) − F (x∗ )|; bottom row: violation of feasibility kAxk − bk. multi-block structured convex programs. We show that if the objective is strongly convex then the methods achieve O(1/t2 ) convergence rate where t is the total number of iterations. In addition, if one block variable is independent of others in the objective and its part of the objective function is smooth, we have modified the primal-dual coordinate updating method to achieve linear convergence. Numerical experiments on quadratic programming have shown the efficacy of the newly proposed methods.

References [1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009. 3 [2] D. Boley. Local linear convergence of the alternating direction method of multipliers on quadratic or linear programs. SIAM Journal on Optimization, 23(4):2183–2207, 2013. 3 [3] K. Bredies and H. Sun. Accelerated douglas-rachford methods for the solution of convexconcave saddle-point problems. arXiv preprint arXiv:1604.06282, 2016. 3 [4] X. Cai, D. Han, and X. Yuan. The direct extension of admm for three-block separable convex minimization models is convergent when one function is strongly convex. Optimization Online, 2014. 3 [5] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120–145, 2011. 3

17

[6] C. Chen, B. He, Y. Ye, and X. Yuan. The direct extension of admm for multi-block convex minimization problems is not necessarily convergent. Mathematical Programming, 155(1-2):57– 79, 2016. 3 [7] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM review, 43(1):129–159, 2001. 2 [8] Y. Chen, G. Lan, and Y. Ouyang. Optimal primal-dual methods for a class of saddle point problems. SIAM Journal on Optimization, 24(4):1779–1814, 2014. 3 [9] P. L. Combettes and J.-C. Pesquet. Stochastic quasi-fej´er block-coordinate fixed point iterations with random sweeping. SIAM Journal on Optimization, 25(2):1221–1248, 2015. 3 [10] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995. 2 [11] C. Dang and G. Lan. Randomized methods for saddle point computation. arXiv preprint arXiv:1409.8625, 2014. 3 [12] W. Deng, M.-J. Lai, Z. Peng, and W. Yin. Parallel multi-block admm with o (1/k) convergence. arXiv preprint arXiv:1312.3040, 2013. 3, 4, 5 [13] W. Deng and W. Yin. On the global and linear convergence of the generalized alternating direction method of multipliers. Journal of Scientific Computing, 66(3):889–916, 2015. 3 [14] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976. 3 [15] X. Gao, Y. Xu, and S. Zhang. Randomized primal-dual proximal block coordinate updates. arXiv preprint arXiv:1605.05969, 2016. 3, 4, 8, 9, 16, 17 [16] X. Gao and S. Zhang. First-order algorithms for convex optimization with nonseparate objective and coupled constraints. Optimization online, 3:5, 2015. 3 [17] R. Glowinski and A. Marrocco. Sur l’approximation, par el´ements finis d’ordre un, et la r´esolution, par p´enalisation-dualit´e d’une classe de probl`emes de dirichlet non lin´eaires. ESAIM: Mathematical Modelling and Numerical Analysis, 9(R2):41–76, 1975. 3 [18] T. Goldstein, B. O’Donoghue, S. Setzer, and R. Baraniuk. Fast alternating direction optimization methods. SIAM Journal on Imaging Sciences, 7(3):1588–1623, 2014. 3 [19] B. He, L. Hou, and X. Yuan. On full Jacobian decomposition of the augmented Lagrangian method for separable convex programming. SIAM Journal on Optimization, 25(4):2274–2312, 2015. 3 [20] B. He, M. Tao, and X. Yuan. Alternating direction method with gaussian back substitution for separable convex programming. SIAM Journal on Optimization, 22(2):313–340, 2012. 3 18

[21] B. He and X. Yuan. On the acceleration of augmented lagrangian method for linearly constrained optimization. Optimization online, 2010. 3 [22] B. He and X. Yuan. On the O(1/n) convergence rate of the Douglas-Rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700–709, 2012. 3 [23] G. M. James, C. Paulson, and P. Rusmevichientong. Penalized and constrained regression. Technical report, 2013. 2 [24] H. Li and Z. Lin. Optimal nonergodic o(1/k) convergence rate: When linearized adm meets nesterov’s extrapolation. arXiv preprint arXiv:1608.06366, 2016. 6 [25] M. Li, D. Sun, and K.-C. Toh. A convergent 3-block semi-proximal ADMM for convex minimization problems with one strongly convex block. Asia-Pacific Journal of Operational Research, 32(04):1550024, 2015. 3 [26] T. Lin, S. Ma, and S. Zhang. On the global linear convergence of the admm with multiblock variables. SIAM Journal on Optimization, 25(3):1478–1497, 2015. 3 [27] Z. Lu and L. Xiao. On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming, 152(1-2):615–642, aug 2015. 2 [28] H. Markowitz. Portfolio selection. The journal of finance, 7(1):77–91, 1952. 2 [29] R. D. Monteiro and B. F. Svaiter. Iteration-complexity of block-decomposition algorithms and the alternating direction method of multipliers. SIAM Journal on Optimization, 23(1):475–507, 2013. 3 [30] Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k 2 ). Soviet Mathematics Doklady, 27(2):372–376, 1983. 3 [31] Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012. 2, 11 [32] Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125–161, 2013. 3 [33] Y. Ouyang, Y. Chen, G. Lan, and E. Pasiliao Jr. An accelerated linearized alternating direction method of multipliers. SIAM Journal on Imaging Sciences, 8(1):644–681, 2015. 3 [34] Z. Peng, T. Wu, Y. Xu, M. Yan, and W. Yin. Coordinate friendly structures, algorithms and applications. Annals of Mathematical Sciences and Applications, 1(1):57–119, 2016. 3 [35] Z. Peng, Y. Xu, M. Yan, and W. Yin. Arock: an algorithmic framework for asynchronous parallel coordinate updates. SIAM Journal on Scientific Computing, 38(5):A2851–A2879, 2016. 3

19

[36] J.-C. Pesquet and A. Repetti. A class of randomized primal-dual algorithms for distributed optimization. arXiv preprint arXiv:1406.6404, 2014. 3 [37] P. Richt´ arik and M. Tak´ aˇc. Parallel coordinate descent methods for big data optimization. Mathematical Programming, pages 1–52, 2012. 2 [38] P. Richt´ arik and M. Tak´ aˇc. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(1-2):1–38, 2014. 2 [39] Y. Xu. Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming. arXiv preprint arXiv:1606.09155, 2016. 3 [40] Y. Xu. Hybrid jacobian and gauss-seidel proximal block coordinate update methods for linearly constrained convex programming. arXiv preprint arXiv:1608.03928, 2016. 3

20

A

Technical proofs: Section 2

In this section, we give the detailed proofs of the lemmas and theorems in section 2.

A.1

Proof of Lemma 2.1

From (7a), we have the optimality conditions k+1 k k 0 ∈ ∇i f (xk ) − A> ) + Pik (xk+1 − xki ), i = 1, . . . , M, i (λ − βk r ) + ∂gi (xi i k+1 ) ∈ ∂g(xk+1 ), ˜ or equivalently there is ∇g(x k+1 ˜ 0 = ∇f (xk ) − A> (λk − βk rk ) + ∇g(x ) + P k (xk+1 − xk ).

Hence, for any x such that Ax = b, D E k+1 ˜ 0 = xk+1 − x, ∇f (xk ) − A> (λk − βk rk ) + ∇g(x ) + P k (xk+1 − xk ) D E µg ≥ xk+1 − x, ∇f (xk ) + g(xk+1 ) − g(x) + kxk+1 − xk2 2 D E D E k+1 k k+1 k k+1 + x − x, P (x − x ) − A(x − x), λk − βk rk ≥

Lf k+1 µf k µg f (xk+1 ) − f (x) − kx − xk k2 + kx − xk2 + g(xk+1 ) − g(x) + kxk+1 − xk2 2 2 2 E D E D k+1 k k+1 k k+1 k k (40) + x − x, P (x − x ) − Ax − b, λ − βk r

where the first inequality uses the strong convexity of g, and the second inequality is from Ax = b and the following arguments using gradient Lipschitz continuity and strong convexity of f : D E D E D E xk+1 − x, ∇f (xk ) = xk+1 − xk , ∇f (xk ) + xk − x, ∇f (xk ) Lf k+1 µf k kx − xk k2 + f (xk ) − f (x) + kx − xk2 2 2 Lf k+1 µf k = f (xk+1 ) − f (x) − kx − xk k2 + kx − xk2 . 2 2 Rearranging (40) gives (by noting F = f + g) D E Φ(xk+1 , x, λ) = F (xk+1 ) − F (x) − Axk+1 − b, λ D E D E D E ≤ Axk+1 − b, λk − βk rk − Axk+1 − b, λ − xk+1 − x, P k (xk+1 − xk ) ≥ f (xk+1 ) − f (xk ) −

+

Lf k+1 µf k µg kx − xk k2 − kx − xk2 − kxk+1 − xk2 . 2 2 2

Using the fact λk+1 = λk − ρk (Axk+1 − b), we have D E E 1 D k Axk+1 − b, λk − λ = λ − λk+1 , λk − λ ρk h i (5) 1 = kλ − λk k2 − kλ − λk+1 k2 + kλk − λk+1 k2 . 2ρk 21

(41)

(42)

In addition, D E Axk+1 − b, −βk rk D E = − βk Axk+1 − b, rk − rk+1 + rk+1 D E = − βk krk+1 k2 + βk A(xk+1 − x), A(xk+1 − xk ) i βk h (5) kA(xk+1 − x)k2 − kA(xk − x)k2 + kA(xk+1 − xk )k2 (43) = − βk krk+1 k2 + 2

Substituting (42) and (43) into (41) and also applying (5) to the cross term xk+1 − x, P k (xk+1 − xk ) gives the inequality in (8).

A.2

Proof of Theorem 2.2

First note that kλk − λk+1 k2 = ρ2k krk+1 k2 .

(44)

Since P k  βk A> A + Lf I, it holds that kxk+1 − xk k2P k ≥ βk kA(xk+1 − xk )k2 + Lf kxk+1 − xk k2 .

(45)

In addition, from the assumption in (10), we have t X k + k0 + 1 h k=1

2ρk

kλ − λk k2 − kλ − λk+1 k2

i t

=

X k0 + 2 t + k0 + 1 kλ − λ1 k2 − kλ − λt+1 k2 + 2ρ1 2ρt k=2





k + k0 + 1 k + k0 − 2ρk 2ρk−1

k0 + 2 kλ − λ1 k2 . 2ρ1



kλ − λk k2 (46)

Finally, note that t X

=



 1  βk  kA(xk+1 − x)k2 − kA(xk − x)k2 − kxk+1 − xk2P k − kxk − xk2P k 2 2 k=1  µf k µg − kxk+1 − xk2 − kx − xk2 2 2 k0 + 2 1 t + k0 + 1 t+1 kx − xk2P 1 −β1 A> A−µf I − kx − xk2P t −βt A> A+µg I 2 2 t   X + (k + k0 + 1)kxk − xk2P k −βk A> A−µf I − (k + k0 )kxk − xk2P k−1 −βk−1 A> A+µg I (k + k0 + 1)

k=2 (11)



t + k0 + 1 t+1 k0 + 2 1 kx − xk2P 1 −β1 A> A−µf I − kx − xk2P t −βt A> A+µg I . 2 2

(47)

Now multiplying k + k0 + 1 to both sides of (8) and adding it over k, we obtain (12) by using (44) through (47). 22

A.3

Proof of Theorem 2.3

From the choice of k0 , it follows that µ (2k + k0 + 1) I + (Lf − µf )I = (k + k0 )µI, ∀k. 2 Since P − βA> A  µ2 I, the above equation implies (2k + k0 + 1)(P − βA> A) + (Lf − µf )I  (k + k0 )µI, which is equivalent to h i h i (k + k0 + 1) (k + 1)P − (k + 1)βA> A + (Lf − µf )I  (k + k0 ) kP − kβA> A + (Lf + µg )I . Hence, the condition in (11) holds. In addition, it is easy to see that all conditions in (9) and (10) also hold. Therefore, we have (12), which, by taking parameters in (13) and x = x∗ , reduces to t X

(k + k0 + 1)Φ(xk+1 , x∗ , λ) +

k=1

t X (k + k0 + 1)(k + 1) k=1

2

βkrk+1 k2

t + k0 + 1 t+1 + kx − x∗ k2(t+1)(P −βA> A)+(Lf +µg )I 2 k0 + 2 k0 + 2 1 ≤ kλk2 + kx − x∗ k22(P −βA> A)+(Lf −µf )I , 4β 2

(48)

where we have used the fact λ1 = 0. Letting λ = λ∗ , we have from (4) and (48) that (by dropping nonnegative terms on the left hand side):



t + k0 + 1 t+1 (t + k0 + 1)(t + 1) βkrt+1 k2 + kx − x∗ k2(t+1)(P −βA> A)+(Lf +µg )I 2 2 k0 + 2 ∗ 2 k0 + 2 1 kλ k + kx − x∗ k22(P −βA> A)+(Lf −µf )I , 4β 2

which indicates (14). In addition, from the convexity of F , we have from (48) that for any λ, k0 + 2 k0 + 2 1 t(t + 2k0 + 3) Φ(¯ xt+1 , x∗ , λ) ≤ kλk2 + kx − x∗ k22(P −βA> A)+(Lf −µf )I , 2 4β 2 which together with Lemmas 1.2 and 1.3 implies (16).

B

Technical proofs: Section 3

In this section, we give the proofs of the lemmas and theorems in section 3.

23

B.1

Proof of Lemma 3.1

The x-update in (17a) can be written more compactly as k > k k xk+1 Sk = arg minh∇Sk f (x ) − ASk (λ − βk r ), xSk i + gSk (xSk ) + xSk

ηk kxSk − xkSk k2 . 2

(49)

We have the optimality condition k+1 k+1 k k k ˜ 0 = ∇Sk f (xk ) − A> Sk (λ − βk r ) + ∇gSk (xSk ) + ηk (xSk − xSk ).

˜ S (xk+1 ) is one subgradient of gS at xk+1 . Hence, for any x, it holds that where ∇g k k Sk Sk D

E k+1 k+1 k k > k k ˜ xk+1 Sk − xSk , ∇Sk f (x ) − ASk (λ − βk r ) + ∇gSk (xSk ) + ηk (xSk − xSk ) = 0.

(50)

m Recall θ = M . Note that xk+1 = xki , ∀i 6∈ Sk ; therefore, i D E k > k k E xk+1 − x , ∇ f (x ) − A (λ − β r ) S S k S k k k D Sk E k+1 k k k k = E xSk − xSk + xSk − xSk , ∇Sk f (xk ) − A> (λ − β r ) k S D E k D E k+1 k k > k k = E x − x , ∇f (x ) − A (λ − βk r ) + θE xk − x, ∇f (xk ) − A> (λk − βk rk )  

k+1 Lm k+1 k 2 k > k k k+1 k kx −x k + x − x , −A (λ − βk r ) ≥ E f (x ) − f (x ) − 2 h

i µf k +θE f (xk ) − f (x) + kx − xk2 + xk − x, −A> (λk − βk rk ) 2 h

k+1 i Lm k+1 = E f (x ) − f (x) + x − x, −A> (λk − βk rk ) − Ekxk+1 − xk k2 2 h

i θµf Ekxk − xk2 −(1 − θ)E f (xk ) − f (x) + xk − x, −A> (λk − βk rk ) + 2 h

 i Lm − = E f (xk+1 ) − f (x) + xk+1 − x, −A> λk+1 − λk+1 + λk − βk (rk+1 − rk+1 + rk ) Ekxk+1 − xk k2 2 h

i θµf −(1 − θ)E f (xk ) − f (x) + xk − x, −A> (λk − βk rk ) + Ekxk − xk2 2 h



= E f (xk+1 ) − f (x) − A(xk+1 − x), λk+1 + (βk − ρk ) A(xk+1 − x), rk+1

i Lm −βk A(xk+1 − x), A(xk+1 − xk ) − Ekxk+1 − xk k2 2 h



i θµf −(1 − θ)E f (xk ) − f (x) − A(xk − x), λk + βk A(xk − x), rk + Ekxk − xk2 , (51) 2

where in the inequality, we have used the strong convexity of f and Lipschitz continuity of ∇f , and the last equality follows from the definition of rk and the update of λk . Also, from the strong convexity of g, we have D E k+1 ˜ E xk+1 Sk − xSk , ∇gSk (xSk )   X µg kxk+1 − xi k2  ≥ E gSk (xk+1 Sk ) − gSk (xSk ) + 2 i i∈Sk

24



 X µg µg µg k = E g(xk+1 ) − g(xk ) + gSk (xkSk ) − gSk (xSk ) + kxk+1 − xk2 − kxk − xk2 + kx − xi k2  2 2 2 i i∈Sk    µ µ θµ g g g k+1 k k k+1 2 k 2 k 2 = E g(x ) − g(x ) + θ g(x ) − g(x) + kx − xk − kx − xk + kx − xk 2 2 2i i h h µg µg (52) = E g(xk+1 ) − g(x) + kxk+1 − xk2 − (1 − θ) g(xk ) − g(x) + kxk − xk2 . 2 2 In addition, note that E D E D k+1 k k+1 k+1 k , η (x − x ) = E x − x, η (x − x ) . E xk+1 − x Sk k Sk k Sk Sk

(53)

Plugging (51) through (53) into (50) and rearranging terms gives h i



µg E F (xk+1 ) − F (x) − A(xk+1 − x), λk+1 + (βk − ρk ) A(xk+1 − x), rk+1 + kxk+1 − xk2 2i h



µg Lm ≤ (1 − θ)E F (xk ) − F (x) − A(xk − x), λk + βk A(xk − x), rk + kxk − xk2 + Ekxk+1 − xk k2 2 2 h



i θµf Ekxk − xk2 + E βk A(xk+1 − x), A(xk+1 − xk ) − ηk xk+1 − x, xk+1 − xk . − 2 Since Ax = b, the above inequality reduces to (18) by using (5).

B.2

Proof of Theorem 3.3

We first establish a few inequalities below. They can be shown through simple arithmetic operations, and thus we omit the proofs. Proposition B.1 If (20e) holds, then t  β (t + k + 1) X βk (k + k0 + 1)  t 0 E kA(xk+1 − x)k2 − kA(xk − x)k2 ≤ EkA(xt+1 − x)k2 . (54) 2 2 k=1

Proposition B.2 If (20f) and (20g) hold, then t X βk (k + k0 + 1) k=1

2

EkA(xk+1 − xk )k2 −

t X k + k0 + 1 k=1

2

(ηk − Lm )Ekxk+1 − xk k2

t  X k + k0 + 1  − E ηk kxk+1 − xk2 − (ηk − θµf )kxk − xk2 2 k=1  t X µg θ(k + k0 + 1) − 1 µg (t + k0 + 1) t+1 2 − Ekx − xk − Ekxk − xk2 2 2 k=2 (η1 − θµf )(k0 + 2) (µg + ηt )(t + k0 + 1) ≤ Ekx1 − xk2 − Ekxt+1 − xk2 . 2 2

25

(55)

Proposition B.3 If (20c) and (20d) hold, then  t + k0 + 1 E kλt+1 − λk2 − kλt − λk2 + kλt+1 − λt k 2ρt t  X θ(k + k0 + 1) − 1  k − E kλ − λk2 − kλk−1 − λk2 + kλk − λk−1 k2 2ρk−1 k=2 θ(k0 + 3) − 1 Ekλ1 − λk2 . ≤ 2ρ1 −

(56)

Now we are ready to prove Theorem 3.3. Proof. [Proof of Theorem 3.3] Multiplying k + k0 + 1 to both sides of (18) and summing it up from k = 1 through t gives h i µg (t + k0 + 1)E Φ(xt+1 , x, λt+1 ) + (βt − ρt )krt+1 k2 + kxt+1 − xk2 2 t i X  h µ g + θ(k + k0 + 1) − 1 E Φ(xk , x, λk ) + kxk − xk2 2 k=2

t X

 (βk−1 − ρk−1 )(k + k0 ) − (1 − θ)(k + k0 + 1)βk Ekrk k2 k=2 h i µg ≤ (1 − θ)(k0 + 2)E Φ(x1 , x, λ1 ) + β1 kr1 k2 + kx1 − xk2 (57) 2 t  X βk (k + k0 + 1)  + E kA(xk+1 − x)k2 − kA(xk − x)k2 + kA(xk+1 − xk )k2 2 k=1 t  X k + k0 + 1  E ηk kxk+1 − xk2 − (ηk − θµf )kxk − xk2 + (ηk − Lm )kxk+1 − xk k2 . − 2 +

k=1

From the update of λ in (17c), we have hλk+1 − λ, Axk+1 − bi = −

1 k+1 hλ − λ, λk+1 − λk i, ρk

(58)

and thus (t + k0 + 1)hλt+1 − λ, Axt+1 − bi +

t X

 θ(k + k0 + 1) − 1 hλk − λ, Axk − bi

k=2 t X θ(k + k0 + 1) − 1

t + k0 + 1 t+1 hλ − λ, λt+1 − λt i − hλk − λ, λk − λk−1 i ρt ρk−1 k=2  t + k0 + 1 t+1 2 t = − kλ − λk − kλ − λk2 + kλt+1 − λt k 2ρt t  X θ(k + k0 + 1) − 1  k − kλ − λk2 − kλk−1 − λk2 + kλk − λk−1 k2 . 2ρk−1 = −

k=2

26

(59)

Adding (59) to (57) and rearranging terms yields (t + k0 + 1)EΦ(xt+1 , x, λ) +

t X

 θ(k + k0 + 1) − 1 EΦ(xk , x, λ)

k=2 t X  +(t + k0 + 1)(βt − ρt )Ekrt+1 k2 + (βk−1 − ρk−1 )(k + k0 ) − (1 − θ)(k + k0 + 1)βk Ekrk k2 k=2 i h µg 1 1 (60) ≤ (1 − θ)(k0 + 2)E Φ(x , x, λ ) + β1 kr1 k2 + kx1 − xk2 2 t  X βk (k + k0 + 1)  + E kA(xk+1 − x)k2 − kA(xk − x)k2 + kA(xk+1 − xk )k2 2 k=1 t  X k + k0 + 1  − E ηk kxk+1 − xk2 − (ηk − θµf )kxk − xk2 + (ηk − Lm )kxk+1 − xk k2 2 k=1  t X µg θ(k + k0 + 1) − 1 µg (t + k0 + 1) t+1 2 − Ekx − xk − Ekxk − xk2 2 2 k=2  t + k0 + 1 t+1 2 t − E kλ − λk − kλ − λk2 + kλt+1 − λt k 2ρt t  X θ(k + k0 + 1) − 1  k E kλ − λk2 − kλk−1 − λk2 + kλk − λk−1 k2 . − 2ρk−1 k=2

Therefore, we obtain (21) by substituting (54) through (56) into (60) and also noting the summation in the second line of (60) is nonnegative from the condition in (20b). 

B.3

Proof of Theorem 3.4

We first show that the parameters given in (22) satisfy the conditions in (20). Note that (23) implies k0 ≥ 4θ , and thus (20a) must hold. Also, it is easy to see that (20d) holds with equality A> A from the second equation of (22a). Since I  kAk 2 , we can easily have (20f) by plugging in βk and 2 ηk defined in (22b) and (22c) respectively. To verify (20c), we plug in ρk defined in the first equation of (22a), and it is equivalent to requiring for any 2 ≤ k ≤ t − 1 that θ(k + k0 + 1) − 1 θ(k + k0 + 2) − 1 θ(k0 + 1) − 3 θ(k0 + 1) − 3 ≥ ⇐⇒ 1 + ≥1+ . θ(k − 1) + 2 + θ θk + 2 + θ θk + 2 θk + 2 + θ The second inequality obviously holds, and thus we have (20c). Plugging in the formula of βk , (20e) is equivalent to (θk + 2 + θ)(k + k0 + 1) ≥ (θk + 2)(k + k0 ) which holds trivially, and thus (20e) follows.

27

Note that ρk = becomes

θ 6−5θ βk

⇐⇒ ⇐⇒ ⇐⇒ ⇐= ⇐⇒

for k ≤ t − 1 from their definition given in (22a) and (22b). Hence, (20b) 6 − 6θ βk−1 (k + k0 ) ≥ (1 − θ)(k + k0 + 1)βk , ∀k ≥ 2 6 − 5θ 6 (θk + 2)(k + k0 ) ≥ (k + k0 + 1)(θk + 2 + θ), ∀k ≥ 2 6 − 5θ (k + k0 + 1)(θk + 2 + θ) 6 ≥ , ∀k ≥ 2 6 − 5θ (k + k0 )(θk + 2) 6 (k0 + 3)(3θ + 2) ≥ 6 − 5θ (k0 + 2)(2θ + 2) ( 3 + 3)(3θ + 2) 6 ≥ 3θ 6 − 5θ ( θ + 2)(2θ + 2) 12(3 + 2θ) ≥ 3(6 − 5θ)(3θ + 2)

(61)

⇐⇒ 36 + 24θ ≥ 36 + 24θ − 45θ2 , where the sufficient condition in (62) uses the fact k0 ≥ 3θ , and the last inequality apparently holds. Hence, (20b) is satisfied. Finally, we show (20g). Plugging in ηk , we have that (20g) is equivalent to µ   (k + k0 ) (θk + 2) + Lm + µg θ(k + k0 + 1) − 1 2  µ (θk + 2 + θ) + Lm − θµf , ∀k ≥ 2 ≥ (k + k0 + 1) 2 µθ µ ⇐⇒ (k + k0 + 1) ≥ (θk + 2) + Lm + µg , ∀k ≥ 2 2 2 µθ (k0 + 1) ≥ µ + µg + Lm ⇐⇒ 2 2 2(Lm + µg ) ⇐⇒ k0 + 1 ≥ + , θ θµ where the last inequality is implied by (23). Therefore, (20g) holds, and we have verified all conditions in (20). Therefore, we have the inequality in (21) that, as λ1 = 0, reduces to (t + k0 + 1)EΦ(xt+1 , x, λ) +

t X

 θ(k + k0 + 1) − 1 EΦ(xk , x, λ)

k=2

h

≤ (1 − θ)(k0 + 2)E F (x1 ) − F (x) + β1 kr1 k2 + +

i k +2 µg 1 0 2 kx − xk2 + (η1 − θµf )Ekx1 − xk (62) 2 2

θ(k0 + 3) − 1 t + k0 + 1 Ekλk2 − Ekxt+1 − xk2(µg +ηt )I−βt A> A . 2ρ1 2

For ρ ≥ 1, we have >

(µg + ηt )I − βt A A 



 (ρ − 1)µ (θt + θ + 2) + µg + Lm I. 2ρ 28

Letting x = x∗ and using the convexity of F , we have from (62) and the above inequality that 

 1 E F (¯ xt+1 ) − F (x∗ ) − λ, A¯ xt+1 − b ≤ Eφ(x∗ , λ), ∀λ, T

(63)

which together with Lemmas 1.2 and 1.3 with γ = max(2kλ∗ k, 1 + kλ∗ k) indicates (24). In addition, note

µ t+1 kx − x∗ k2 . 2 Hence, letting (x, λ) = (x∗ , λ∗ ) in (62) and using (4), we have   t + k0 + 1 (ρ − 1)µ (θt + θ + 2) + µ + µg + Lm Ekxt+1 − x∗ k2 ≤ φ(x∗ , λ∗ ), 2 2ρ Φ(xt+1 , x∗ , λ∗ ) ≥

(64)

and the proof is completed.

C

Technical proofs: Section 4

In this section, we provide the proofs of the lemmas and theorems in section 4.

C.1

Proof of Lemma 4.1

Similar to (51), we have D E k > k k E xk+1 − x , ∇ f (x ) − A (λ − β r ) S S k S k k k h Sk



k+1 k+1 ≥ E f (x ) − f (x) − A(x − x), λk+1 + (β − ρ) A(xk+1 − x), rk+1

i Lm −β A(xk+1 − x), A(xk+1 − xk ) + B(y k+1 − y k ) − Ekxk+1 − xk k2 2 h i θµf



−(1 − θ)E f (xk ) − f (x) − A(xk − x), λk + β A(xk − x), rk + Ekxk − xk2 . (65) 2 From (27), the optimality condition for y˜k+1 is 1

∇h(˜ y k+1 ) − B > λk + βB > rk+ 2 + ηy (˜ y k+1 − y k ) = 0. Since Prob(y k+1 = y˜k+1 ) = θ,

Prob(y k+1 = y k ) = 1 − θ,

we have D E 1 E y k+1 − y, ∇h(y k+1 ) − B > λk + βB > rk+ 2 + ηy (y k+1 − y k ) D E 1 = (1 − θ)E y k − y, ∇h(y k ) − B > λk + βB > rk+ 2 ,

29

(66)

or equivalently, D E E y k+1 − y, ∇h(y k+1 ) − B > λk+1 + (β − ρ)B > rk+1 − βB > B(y k+1 − y k ) + ηy (y k+1 − y k ) D E D E = (1 − θ)E y k − y, ∇h(y k ) − B > λk + βB > rk + β(1 − θ)E B(y k − y), A(xk+1 − xk ) . (67) Plugging (65), (52) and (53) with ηk = ηx into (50) with βk = β and summing together with (67), we have, by rearranging the terms, that h i E F (xk+1 ) − F (x) + hy k+1 − y, ∇h(y k+1 )i + hxk+1 − x, −A> λk+1 i + hy k+1 − y, −B > λk+1 i θµf µg Lm − Ekxk+1 − xk k2 + Ekxk − xk2 + Ekxk+1 − xk2 2 2 2 h i +(β − ρ)E hxk+1 − x, A> rk+1 i + hy k+1 − y, B > rk+1 i D E D E +E xk+1 − x, (ηx I − βA> A)(xk+1 − xk ) + E y k+1 − y, (ηy I − βB > B)(y k+1 − y k ) h i ≤ (1 − θ)E F (xk ) − F (x) + hy k − y, ∇h(y k )i + hxk − x, −A> λk i + hy k − y, −B > λk i h i (1 − θ)µg Ekxk − xk2 + β(1 − θ)E hxk − x, A> rk i + hy k − y, B > rk i + D2 E D E +βE A(xk+1 − x), B(y k+1 − y k ) + β(1 − θ)E B(y k − y), A(xk+1 − xk ) . Now use (58) and the fact that Ax + By = b to have the desired result.

C.2

Proof of Theorem 4.2

Before proving Theorem 4.2, we establish a few inequalities. First, using Young’s inequality, we have the following results. Lemma C.1 For any τ1 , τ2 > 0, it holds that τ1 1 kA(xk+1 − x∗ )k2 + kB(y k+1 − y k )k2 , 2τ1 2 1 τ 2 hB(y k − y ∗ ), A(xk+1 − xk )i ≤ kB(y k − y ∗ )k2 + kA(xk+1 − xk )k2 . 2τ2 2 hA(xk+1 − x∗ ), B(y k+1 − y k )i ≤

(68) (69)

In addition, we are able to bound the λ-term by y-term and the residual r. The proofs are given in Appendix C.4 and C.5. Lemma C.2 For any δ > 0, we have EkB > (λk+1 − λ∗ )k2 − (1 − θ)(1 + δ)EkB > (λk − λ∗ )k2   ≤ 4E L2h ky k+1 − y ∗ k2 + kQ(y k+1 − y k )k2 + 2(β − ρ)2 EkB > rk+1 k2  1  +2ρ2 (1 − θ)(1 + )E kB > rk+1 k2 + kB > B(y k+1 − y k )k2 . δ 30

(70)

Lemma C.3 Assume (34). Then kB > (λk+1 − λ∗ )k2 − (1 − θ)(1 + δ)kB > (λk − λ∗ )k2 + κkB > (λk+1 − λk )k2  1 σmin (BB > )  k+1 kλ − λ∗ k2 − (1 − θ)kλk − λ∗ k2 + kλk+1 − λk k2 , ≥ 2 θ

(71)

where σmin (BB > ) denotes the smallest singular value of BB > . Now we are ready to show Theorem 4.2. Proof. [Proof of Theorem 4.2] Letting (x, y, λ) = (x∗ , y ∗ , λ∗ ) in (31), plugging (29) into it, noting Ax∗ + By ∗ = b, and using (5), we have  1  EΨ(z k+1 , z ∗ ) + (β − ρ)Ekrk+1 k2 + E kxk+1 − x∗ k2P − kxk − x∗ k2P + kxk+1 − xk k2P 2  Lm θµf 1  k+1 + E ky − y ∗ k2Q − ky k − y ∗ k2Q + ky k+1 − y k k2Q − Ekxk+1 − xk k2 + Ekxk − x∗ k2 2 2 2  µg 1  + Ekxk+1 − x∗ k2 + E kλk+1 − λ∗ k2 − kλk − λ∗ k2 + kλk+1 − λk k2 2 2ρ  1−θ  k ≤ (1 − θ)EΨ(z k , z ∗ ) + β(1 − θ)Ekrk k2 + E kλ − λ∗ k2 − kλk−1 − λ∗ k2 + kλk − λk−1 k2 (72) 2ρ



µg (1 − θ) Ekxk − x∗ k2 , +βE A(xk+1 − x∗ ), B(y k+1 − y k ) + β(1 − θ)E B(y k − y ∗ ), A(xk+1 − xk ) + 2 where Ψ is defined in (32). Adding

θ k 2ρ Ekλ

− λ∗ k2 to both sides of (72) and using (28) gives

 1  EΨ(z k+1 , z ∗ ) + (β − ρ)Ekrk+1 k2 + E kxk+1 − x∗ k2P − kxk − x∗ k2P + kxk+1 − xk k2P 2 θµ µg Lm f − Ekxk+1 − xk k2 + Ekxk − x∗ k2 + Ekxk+1 − x∗ k2 2 2 2  1  k+1 ∗ 2 k ∗ 2 k+1 + E ky − y kQ − ky − y kQ + ky − y k k2Q 2  ρ 1 1  1 + E kλk+1 − λ∗ k2 − (1 − θ)kλk − λ∗ k2 + kλk+1 − λk k2 − ( − 1)Ekrk+1 k2 2ρ θ 2 θ ≤ (1 − θ)EΨ(z k , z ∗ ) + β(1 − θ)Ekrk k2 (73)  1  k 1 1 ρ + E kλ − λ∗ k2 − (1 − θ)kλk−1 − λ∗ k2 + kλk − λk−1 k2 − ( − (1 − θ))Ekrk k2 2ρ θ 2 θ



µg (1 − θ) +βE A(xk+1 − x∗ ), B(y k+1 − y k ) + β(1 − θ)E B(y k − y ∗ ), A(xk+1 − xk ) + Ekxk − x∗ k2 . 2 Adding (68), (69), and (70) multiplied by the number c > 0 to (73), and also noting ρ = θβ, we

31

have  β(1 − θ) 1  Ekrk+1 k2 + E kxk+1 − x∗ k2P − kxk − x∗ k2P + kxk+1 − xk k2P 2 2 θµ µg Lm f − Ekxk+1 − xk k2 + Ekxk − x∗ k2 + Ekxk+1 − x∗ k2 2 2 2  1  k+1 + E ky − y ∗ k2Q − ky k − y ∗ k2Q + ky k+1 − y k k2Q 2    2 k+1  ρ2 1 > k+1 k 2 ∗ 2 k+1 k 2 − y )k −4cE Lh ky − y k + kQ(y − y )k + (1 − θ)(1 + )kB B(y 2   δ 1 1 + E kλk+1 − λ∗ k2 − (1 − θ)kλk − λ∗ k2 + kλk+1 − λk k2 2ρ h θ i EΨ(z k+1 , z ∗ ) +

+cE kB > (λk+1 − λ∗ )k2 − (1 − θ)(1 + δ)kB > (λk − λ∗ )k2 + κkB > (λk+1 − λk )k2 1 −cκEkB > (λk+1 − λk )k2 − 2cρ2 (1 − θ)(1 + )EkB > rk+1 k2 − 2c(β − ρ)2 EkB > rk+1 k2 δ   β(1 − θ − θ2 ) 1 τ1 k ∗ k 2 ≤ (1 − θ)EΨ(z , z ) + Ekr k + βE kA(xk+1 − x∗ )k2 + kB(y k+1 − y k )k2 2 2τ1 2   1 1 + E kλk − λ∗ k2 − (1 − θ)kλk−1 − λ∗ k2 + kλk − λk−1 k2 2ρ θ   µg (1 − θ) τ 1 2 k ∗ 2 k+1 k 2 kB(y − y )k + kA(x − x )k + Ekxk − x∗ k2 . (74) +β(1 − θ)E 2τ2 2 2 From (35a) and (35b), we have 1 k+1 τ2 Lm k+1 kx − xk k2P ≥ β(1 − θ) kA(xk+1 − xk )k2 + kx − xk k2 , 2 2 2

(75)

and 1 k+1 ky − y k k2Q 2 1 βτ1 ≥ 4ckQ(y k+1 − y k )k2 + 2cρ2 (1 − θ)(1 + )kB > B(y k+1 − y k )k2 + kB(y k+1 − y k )k2 ,(76) δ 2 and from (30), it follows that Ψ(z k+1 , z ∗ ) ≥ (1 − α)Ψ(z k+1 , z ∗ ) +

αµ k+1 kx − x∗ k2 + ανky k+1 − y ∗ k2 . 2

(77)

In addition, note that krk+1 k2 = kAxk+1 + By k+1 − (Ax∗ + By ∗ )k2 2 k+1 ≤ 2kAk − x∗ k2 + 2kBk22 ky k+1 − y ∗k2  αµ2 kx αν k+1 ≤ γ kxk+1 − x∗ k2 + ky − y ∗ k2 . 4 4 Adding (75) through (78) and (71) multiplierd by c to (74) and rearranging terms gives (36).

32

(78) 

C.3

Proof of Theorem 4.3

First, from 0 < α < θ, the full row-rankness of B, and (37), it is easy to see that η > 1. Second, we note the following inequalities: Ψ-term: r-term: ≥ ≥ x-term: ≥ ≥ y-term: ≥ λ-term: ≥

(1 − α)Ψ(z k+1 , z ∗ ) ≥ η(1 − θ)Ψ(z k+1 , z ∗ ),   β(1 − θ) 1  k+1 2 1  2 2 + kr k − cρ κ + 2(1 − θ)(1 + ) + 2c(β − ρ) kB > rk+1 k2 2 γ δ    β(1 − θ) 1  k+1 2 1 2 2 + kr k − cρ κ + 2(1 − θ)(1 + ) + 2c(β − ρ) krk+1 k2 2 γ δ β(1 − θ) k+1 2 η kr k , 2  αµ µ  1 k+1 β g kx − x∗ k2P + + kxk+1 − x∗ k2 − kA(xk+1 − x∗ )k2 2 4 2 2τ1  αµ µ  1 k+1 β g kx − x∗ k2P + + kxk+1 − x∗ k2 − kxk+1 − x∗ k2 2 4 2 2τ1   µg − θµ k+1 1 k+1 ∗ 2 ∗ 2 η kx − x kP + kx −x k , 2 2 1 k+1 3αν k+1 ky − y ∗ k2Q + ky − y ∗ k2 − 4cL2h ky k+1 − y ∗ k2 2 4  1 k+1 β(1 − θ) k+1 ∗ 2 ∗ 2 η ky − y kQ + ky −y k , 2 2τ2    c 1 k+1 1 > k+1 ∗ 2 k ∗ 2 k 2 + σmin (BB ) kλ − λ k − (1 − θ)kλ − λ k + kλ −λ k 2ρ 2 θ   η 1 k+1 k+1 ∗ 2 k ∗ 2 k 2 kλ − λ k − (1 − θ)kλ − λ k + kλ −λ k . 2ρ θ

Summing the above inequalities and using (36) gives (38) and completes the proof.

C.4

Proof of Lemma C.2

Let ˜ k+1 = λk − ρ(Axk+1 + B y˜k+1 − b), λ which together with (66) implies ˜ k+1 = ∇h(˜ B>λ y k+1 ) + Q(˜ y k+1 − y k ) + (β − ρ)B > (Axk+1 + B y˜k+1 − b).

(79)

Then

= ≤

EkB > (λk+1 − λ∗ )k2 ˜ k+1 − λ∗ )k2 + (1 − θ)EkB > (λk − λ∗ − ρ(Axk+1 + By k − b))k2 θEkB > (λ ˜ k+1 − λ∗ )k2 + (1 − θ)(1 + δ)EkB > (λk − λ∗ )k2 + ρ2 (1 − θ)(1 + 1 )EkB > (Axk+1 + By k − b)k2 θEkB > (λ δ 33

(79)

=

≤ =



θEk∇h(˜ y k+1 ) − ∇h(y ∗ ) + Q(˜ y k+1 − y k ) + (β − ρ)B > (Axk+1 + B y˜k+1 − b)k2 1 +(1 − θ)(1 + δ)EkB > (λk − λ∗ )k2 + ρ2 (1 − θ)(1 + )EkB > (Axk+1 + By k − b)k2 δ 2θEk∇h(˜ y k+1 ) − ∇h(y ∗ ) + Q(˜ y k+1 − y k )k2 + 2θ(β − ρ)2 EkB > (Axk+1 + B y˜k+1 − b)k2 1 +(1 − θ)(1 + δ)EkB > (λk − λ∗ )k2 + ρ2 (1 − θ)(1 + )EkB > (Axk+1 + By k − b)k2 δ 2Ek∇h(y k+1 ) − ∇h(y ∗ ) + Q(y k+1 − y k )k2 − 2(1 − θ)Ek∇h(y k ) − ∇h(y ∗ )k2 +2(β − ρ)2 EkB > (Axk+1 + By k+1 − b)k2 − 2(1 − θ)(β − ρ)2 EkB > (Axk+1 + By k − b)k2 1 +(1 − θ)(1 + δ)EkB > (λk − λ∗ )k2 + ρ2 (1 − θ)(1 + )EkB > (Axk+1 + By k − b)k2 δ   E 4L2h ky k+1 − y ∗ k2 + 4kQ(y k+1 − y k )k2 + (1 − θ)(1 + δ)EkB > (λk − λ∗ )k2  1  +2ρ2 (1 − θ)(1 + )E kB > rk+1 k2 + kB > B(y k+1 − y k )k2 + 2(β − ρ)2 EkB > rk+1 k2 , δ

which completes the proof.

C.5

Proof of Lemma C.3

Note that

=

= (34)

≥ = ≥ =

kB > (λk+1 − λ∗ )k2 − (1 − θ)(1 + δ)kB > (λk − λ∗ )k2 + κkB > (λk+1 − λk )k2 (1 − (1 − θ)(1 + δ))kB > (λk+1 − λ∗ )k2 + 2(1 − θ)(1 + δ)hB > (λk+1 − λ∗ ), B > (λk+1 − λk )i +(κ − (1 − θ)(1 + δ))kB > (λk+1 − λk )k2 " #> " #" # B > (λk+1 − λ∗ ) (1 − (1 − θ)(1 + δ))I (1 − θ)(1 + δ)I B > (λk+1 − λ∗ ) B > (λk+1 − λk ) (1 − θ)(1 + δ)I (κ − (1 − θ)(1 + δ))I B > (λk+1 − λk ) " #> " #" # 1 B > (λk+1 − λ∗ ) θI (1 − θ)I B > (λk+1 − λ∗ ) 2 B > (λk+1 − λk ) (1 − θ)I ( 1θ − (1 − θ))I B > (λk+1 − λk ) " #> " #" # 1 λk+1 − λ∗ θBB > (1 − θ)BB > λk+1 − λ∗ 2 λk+1 − λk (1 − θ)BB > ( 1θ − (1 − θ))BB > λk+1 − λk " #> " #" # σmin (BB > ) λk+1 − λ∗ θI (1 − θ)I λk+1 − λ∗ 2 λk+1 − λk (1 − θ)I ( 1θ − (1 − θ))I λk+1 − λk   σmin (BB > ) 1 k+1 k+1 ∗ 2 k ∗ 2 k 2 kλ − λ k − (1 − θ)kλ − λ k + kλ −λ k , 2 θ

which completes the proof.

34