Distributed Semi-Stochastic Optimization with Quantization Refinement

1 downloads 0 Views 372KB Size Report
Mar 21, 2016 - these algorithms, the nodes compute a full gradient step in .... and full gradients. The following theorem states the convergence behavior of.
Distributed Semi-Stochastic Optimization with Quantization Refinement

arXiv:1603.06306v1 [math.OC] 21 Mar 2016

Neil McGlohon and Stacy Patterson

Abstract— We consider the problem of regularized regression in a network of communication-constrained devices. Each node has local data and objectives, and the goal is for the nodes to optimize a global objective. We develop a distributed optimization algorithm that is based on recent work on semistochastic proximal gradient methods. Our algorithm employs iteratively refined quantization to limit message size. We present theoretical analysis and conditions for the algorithm to achieve a linear convergence rate. Finally, we demonstrate the performance of our algorithm through numerical simulations.

I. I NTRODUCTION We consider the problem of distributed optimization in a network where communication is constrained, for example a wireless sensor network. In particular, we focus on problems where each node has local data and objectives, and the goal is for the nodes to learn a global objective that includes this local information. Such problems arise in networked systems problems such as estimation, prediction, resource allocation, and control. Recent works have proposed distributed optimization methods that reduce communication by using quantization. For example, in [1], the authors propose a distributed algorithm to solve unconstrained problems based on a centralized inexact proximal gradient method [2]. In [3], the authors extend their work to constrained optimization problems. In these algorithms, the nodes compute a full gradient step in each iteration, requiring quantized communication between every pair of neighboring nodes. Quantization has been applied in distributed consensus algorithms [4], [5], [6] and distributed subgradient methods [7]. In this work, we address the specific problem of distributed regression with regularization over the variables across all nodes. Applications of our approach include distributed compressed sensing, LASSO, group LASSO, and regression with Elastic Net regularization, among others. Our approach is inspired by [1], [3]. We seek to further reduce per-iteration communication by using an approach based on a stochastic proximal gradient algorithm. This approach only requires communication between a small subset of nodes in each iteration. In general, stochastic gradients may suffer from slow convergence. Thus any per-iteration communication savings could be counter-acted by an extended number of iterations. Recently, however, several works have proposed semi-stochastic gradient methods [8], [9], [10]. To reduce the variance of the iterates generated by a stochastic approach, these algorithms periodically incorporate a full gradient *This work was funded in part by NSF grants 1553340 and 1527287. N. McGlohon and S. Patterson are with the Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA [email protected], [email protected]

computation. It has been shown that these algorithms achieve a linear rate of convergence to the optimal solution. We propose a distributed algorithm for regularized regression based on the centralized semi-stochastic proximal gradient of [10]. In most iterations, only a subset of nodes need communicate. We further reduce communication overhead by employing quantized messaging. Our approach reduces both the length of messages sent between nodes as well as the number of messages sent in total to converge to the optimal solution. The detailed contributions of our work are as follows: • We extend the centralized semi-stochastic proximal gradient algorithm to include errors in the gradient computations and show the convergence rate of this inexact algorithm. • We propose a distributed optimization algorithm based on this centralized algorithm that uses iteratively refined quantization to limit message size. • We show that our distributed algorithm is equivalent to the centralized algorithm, where the errors introduced by quantization can be interpreted as inexact gradient computations. We further design quantizers that guarantees a linear convergence rate to the optimal solution. • We demonstrate the performance of the proposed algorithm in numerical simulations. The remainder of this paper is organized as follows. In Section II, we present the centralized inexact proximal gradient algorithm and give background on quantization. In Section III, we give the system model and problem formulation. Section IV details our distributed algorithm. Section V provides theoretical analysis of our proposed algorithm. Section VI presents our simulation results, and we conclude in Section VII. II. P RELIMINARIES

A. Inexact Semi-Stochastic Proximal Gradient Algorithm We consider an optimization problem over the form: minimize G(x) = F (x) + R(x), x∈RP

(1)

PN where F (x) = N1 i=1 fi (x), and the following assumptions are satisfied. Assumption 1: Each fi (x) is differentiable, and its gradient ∇fi (x) is Lipschitz continuous with constant Li , i.e., for all x, y ∈ RP , k∇fi (x) − ∇fi (y)k ≤ Li kx − yk.

(2)

Algorithm 1 Inexact Prox-SVRG.

their gradients ∇fi (x(st ) ). Let x? = arg minx G(x), and let T be such that,

Initialize: x ˜(s) = 0 for s = 0, 1, 2, . . . do g˜(s) = ∇F (˜ x(s) ) x(s0 ) = x ˜(s) for t = 0, 1, 2, . . . , T − 1 do Choose ` uniformly at random from {1, . . . , N }. v (st ) = ∇f` (x(st ) ) − ∇f` (˜ x(s) ) + g˜(s) + e(st ) x(st+1 ) = proxηR (˜ x(st ) − ηv (st ) ) end for PT ˜(st ) x ˜(s+1) = T1 t=1 x end for

α=

1 4Lη(T + 1) + < 1. µη(1 − 4Lη)T (1 − 4Lη)T

Then, h i E G(˜ x(s) ) − G(x? ) ≤ αs

G(˜ x(0) ) − G(x? ) + β

s X

! α−i Γ(i)

i=1

Assumption 2: The function R(x) is lower semicontinuous, convex, and its effective domain, dom(R) := {x ∈ RP | R(x) < +∞}, is closed. Assumption 3: The function G(x) is strongly convex with parameter µ > 0, i.e., for all x, y ∈ dom(R) and for all ξ ∈ ∂G(x), G(x) − G(y) − 12 µkx − yk2 ≥ ξ T (x − y),

(3)

where ∂G(x) is the subdifferential of G at x. This strong convexity may come from either F (x) or R(x) (or both). Problem (1) can be solved using a stochastic proximal gradient algorithm [11] where, in each iteration, a single ∇f` is computed for a randomly chosen ` ∈ {1, . . . , N }, and the iterate is updated accordingly as, x(t+1) = proxηR (x(t) − η (t) ∇f` (x(t) )). Here, proxηR (·) is the proximal operator 1 proxηR (v) = arg min ky − vk2 + ηR(y). y∈Rp 2 While stochastic methods offer the benefit of reduced periteration computation over standard gradient methods, the iterates may have high variance. These methods typically use a decreasing step-size η (t) to compensate for this variance, resulting in slow convergence. Recently, Xiao and Zhang proposed a semi-stochastic proximal gradient algorithm, ProxSVRG that reduces the variance by periodically incorporating a full gradient computation [10]. This modification allows Prox-SVRG to use a constant step size, and thus, Prox-SVRG achieves a linear convergence rate. We extend Prox-SVRG to include a zero-mean error in the gradient computation. Our resulting algorithm, Inexact ProxSVRG, is given in Algorithm 1. The algorithm consists of an outer loop where the full gradient is computed and an inner loop where the iterate is updated based on both the stochastic and full gradients. The following theorem states the convergence behavior of Algorithm 1. Theorem 1: Let {˜ x(s) }s≥0 be the sequence generated by 1 Algorithm 1, with 0 < η < 4L , where L = maxi Li . Assume that the functions R, G, and fi , i = 1, . . . , N , satisfy Assumptions 1, 2, and 3, and that the errors e(st ) are zero-mean and uncorrelated with the iterates x(st ) and

η T (1−4Lη)

(i)

PT −1

and Γ = t=0 Eke(it ) k2 . where β = The proof is given in the appendix. From this theorem, we can derive conditions for the algorithm to converge to the optimal x? . Let the sequence {Γ(s) }s≥0 decrease linearly at a rate κ. Then   1) If κ < α, then E G(˜ x(s) ) − G(x? ) converges linearly with a rate of α.   2) If α < κ < 1, then E G(˜ x(s) ) − G(x? ) converges linearly with a rate κ.  of(s)  3) If κ = α, then E G(˜ x ) − G(x? ) converges linearly with a rate in O(sαs ). B. Subtractively Dithered Quantization We employ a subtractively dithered quantizer to quantize values before transmission. We use a substractively dithered quantizer rather than non-subtractively dithered quantizer because the quantization error of the subtractively dithered quantizer is not correlated with its input. We briefly summarize the quantizer and its key properties below. Let z be real number to be quantized into n bits. The quantizer is parameterized by an interval size U and a midpoint value z ∈ R. Thus the quantization interval is [z − U/2, z + U/2], and the quantization step-size is ∆ = 2nU−1 . We first define the uniform quantizer,   |z − z| 1 q(z) , z + sgn(z − z) · ∆ · + . (4) ∆ 2 In subtractively dithered quantization, a dither ν is added to z, the resulting value is quantized using a uniform quantizer, and then transmitted. The recipient then subtracts ν from this value. The subtractively dithered quantized value of z, denoted zˆ, is thus zˆ = Q(z) , q(z + ν) − ν.

(5)

Note that this quantizer requires both the sender and recipient to use the same value for ν, for example, by using the same pseudorandom number generator. The following theorem describes the statistical properties of the quantization error. Theorem 2 (See [12]): Let z ∈ [z − U/2, z + U/2] and zˆ = Q(z), for Q(·) in (5). Further, let ν is a real number drawn uniformly at random from the interval (−∆/2, ∆/2). The quantization error ε(z) , z − zˆ satisfies the following: 1) E [ε(z)] =   E [ν] =0. 2 2) E ε(z)2 = E ν 2 = ∆ 12

3) E [zε(z)] = E [z] E [ε(z)] = 0 4) For z1 and z2 in the interval [z − U/2, z + U/2], E [ε(z1 )ε(z2 )] = E [ε(z1 )] E [ε(z2 )] = 0. With some abuse of notation, we also write Q(v) where v is a vector. In this case, the quantization operator is applied to each component of v independently, using a vector-valued midpoint and the same scalar-valued interval bounds. III. P ROBLEM F ORMULATION We consider a similar system model to that in [1]. The network is a connected graph of N nodes where inter-node communication is limited to the local neighborhood of each node. The neighbor set Ni consists of node i’s neighbors and itself. The neighborhoods exist corresponding to the fixed undirected graph G = (V, E). We denote D as the maximum degree of the graph G. Each node i has a state vector xi with dimension mi . The T T T state of the system is x = [xT 1 x2 . . . xN ] . We let xNi be the vector consisting of the concatenation of states of all nodes in Ni . For ease of exposition, we define the selecting matrices Ai , i = 1, . . . , N , where xNi = Ai x and the matrices Bij , i, j = 1, . . . , N where xj = Bij xNi . These matrices each have `2 -norm of 1. Every node i has a local objective function over the states in Ni . The distributed optimization problem is thus, minimize G(x) = F (x) + R(x), x∈RP

(6)

PN where F (x) = N1 i=1 fi (xNi ). We assume that Assumptions 1 and 3 are satisfied. Further, we require the following assumptions hold. Assumption 4: For all i, ∇fi (xNi ) is linear or constant. This implies that, for a zero-mean random variable ν, E [∇fi (xNi + ν)] = ∇fi (xNi ). Assumption 5: The proximal operation proxR (x) can be performed by each node locally, i.e., proxR (x) = [proxR (x1 )T proxR (x2 )T . . . proxR (xN )T ]T . We note that Assumption 5 holds for standard regularization functions used in LASSO (kxk1 ), group LASSO where each xi its own group, and Elastic Net regularization (λ1 kxk1 + λ2 2 2 kxk2 ). In the next section, we present our distributed implementation of Prox-SVRG to solve Problem (6). IV. A LGORITHM Our distributed algorithm is given in Algorithm 2. In (s) each outer iteration s, node i quantizes its iterate x ˜i and (s) the gradient ∇fi and sends it to all of its neighbors. These values are quantized using two subtractively dithered (s) (s) quantizers, Qa,i and Qb,i , whereby the sender (node i) sends an n bit representation and the recipient reconstructs the value from this representation and subtracts the dither. The (s) (s) midpoints for Qa,i and Qb,i are set to be the quantized values from the previous iteration. Thus, the recipients already know these midpoints. The quantized values (after the dither

Algorithm 2 Inexact Semi-stochastic Gradient Descent as executed by node i 1: Parameters: inner loop size T , step size η (0) (−1) ˆ (−1) = 0 ˆ 2: Initialize: x ˜i = 0, x ˜i = 0, ∇f i 3: for s = 0, 1, . . . do 4: Update quantizer parameters: (s) (s) (s−1) ˆ Ua,i = Ca κ(s+1)/2 , xa,i = x ˜i , 5: (s) (s) (s−1) (s+1)/2 ˆ Ub,i = Cb κ , ∇fb,i = ∇f i 6: Quantize local variable and send to all j ∈ Ni : (s) (s) (s) (s) (s) ˆ x ˜i = Qa,i (˜ xi ) = x ˜i + ai (s) (s) ˆ ) 7: Compute: ∇fi = ∇fi (x ˜ Ni 8: Quantize gradient and send to all j ∈ Ni : (s) (s) (s) (s) (s) ˆ ∇f = Qb,i (∇fi ) = ∇fi + bi i P (s) (s) ˜ ˆ 9: Compute: h = 1 Bij ∇f i

10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26:

N

j∈Ni

j

(s) ˜ (s) for all j ∈ Ni ˆ (s) + h Compute: vij = −Bij ∇f j i Update quantizer parameters: (s) (s) (s) ˆ Uc,i = Cc κ(s+1)/2 , xc,i = x ˜i , (s) Ud,i (s0 ) xi =

(s)

(s)

ˆ = Cd κ(s+1)/2 , ∇fd,i = ∇f i (s) x ˜i

for t = 0, 1, . . . , T − 1 do Randomly pick ` ∈ {1, 2, 3, . . . , N } if i ∈ N` then Quantize local variable and send to `: (s ) (s ) (s ) (s ) (s ) x ˆi t = Qc,it (xi t ) = xi t + ci t if i = ` then (s ) (s ) Compute: ∇fi t = ∇fi (ˆ xN t ) i Quantize gradient and send to all j ∈ Ni : (s ) (s ) (s ) (s ) (s ) t t t ˆ ∇f = Qd,i (∇fi ) = ∇fi t + di t i end if Update local variable: (st+1 )

xi

(st )

= proxηR (xi

else Update local variable: (st+1 )

xi

27: end if 28: end for (s+1) 29: x ˜i = 30: end for

1 T

PT

(st )

ˆ − η(Bi` ∇f `

(s)

+ vi` ))

˜ (s) ) = proxηR (x(st ) − η h i

t=1

x(st )

(s) ˆ (s) , and the is subtracted) are denoted by x ˜ˆi and ∇f i (s) (s) quantization errors are ai and bi , respectively. For every iteration s of the outer loop of the algorithm, there is an inner loop of T iterations. In each inner iteration, a single node `, chosen at random, computes its gradient. To do this, node ` and its neighbors exchange their states (s ) (s ) xi t and gradients ∇fi t . These values are quantized using (s ) (s ) two subtractively dithered quantizers, Qc,it and Qd,it . The ˆ (s) . Each ˆ˜(s) and ∇f midpoints for these quantizers are x i i node sends these values to their neighbors before the inner loop, so all nodes are aware of the midpoints. The quantized values (after the dither is subtracted) are denoted by x ˆ(st ) (s ) (s ) (s ) t t ˆ and ∇f , and their quantization errors are ci and di t , i (s) (s) respectively. The quantization interval bounds Ua,i , Ub,i , (s) (s) Uc,i , and Ud,i , are initialized to Ca , Cb , Cc , and Cd , respectively, and each iteration, the bounds are multiplied by κ1/2 . Thus the quantizers are refined in each iteration. The quantizers limit the length of a single variable transmission to n bits. In the outer loop of the algorithm, each

node i sends its local variable, consisting of mi quantized components, to every neighbor. It also sends its gradient, consisting of |Ni |mi quantized components to every neighbor. PNThus the number of bits exchanged by all nodes is n i=1 |Ni |mi + |Ni |2 mi bits. In each inner iteration, only nodes j ∈ N` exchange messages. Each node j quantizes mj state variables P and sends them to node `. This yields a transmission of n j∈N` mj bits in total. In turn, node ` quantizes its gradient and sends it to all of its neighbors, 2 which is n|N` |P m` total bits. Thus, in each inner iteration 2 n(|N` | m` + j∈N` mj ) bits are transmitted. The total number of bits transmitted in a single outer iteration is therefore,    N T −1 X X X |N` |2 m` + n (|Ni |mi (1 + |Ni |)) + mj  . t=0

i=1

j∈N`

Let D = maxi |Ni | and m = maxi mi . An upper bound on the number bits transmitted by the algorithm in each outer iteration is nm(N + T )(D + D2 ). V. A LGORITHM A NALYSIS We now present our analysis of Algorithm 2. First we show that the algorithm is equivalent to Algorithm 1, where the quantization errors are encapsulated in the error term e(st ) . We also give an explicit expression for this error term.

We now show that e(st ) is is uncorrelated with x(st ) and (s ) the gradients ∇f` (xN`t ), ` = 1, . . . , N . Clearly, x(st ) and (st ) ∇f` (xN` ) are uncorrelated with the terms of e(st ) contain(s ) (s) (s) ing d` t , b` , and bi . In accordance with Assumption 5, the gradients ∇f` and ∇fi are either linear or constant. (s ) (s ) If they are constant, then ∇f` (ˆ xN`t ) − ∇f` (xN`t ) = 0 (s) ˆ˜(s) ) − ∇fi (˜ and ∇fi (x xNi ) = 0. Thus, the terms in e(st ) Ni containing these differences are also 0. If they are linear, e.g., ∇f` (z) = Hz + h, for an appropriately sized, matrix H and vector h (possibly 0). Then, (s )

(s )

∇f` (ˆ xN`t ) − ∇f` (xN`t ) (s )

(st )

= (H(xN`t + ci

(st )

) + h) − (Hx(st ) + h) = Hci

.

(s ) ci t

By Theorem 2, is uncorrelated with x(st ) . It is clearly (s ) also uncorrelated with ∇f` (xN`t ). Similar arguments can be (s ) used to show that x(st ) and ∇f` (xN`t ) are uncorrelated with the remaining terms in e(st ) . With respect to Eke(st ) k2 , we have   (s ) (s ) Eke(st ) k2 = EkAT xN`t ) − ∇f` (xN`t ) ` ∇f` (ˆ   (s) ˆ˜(s) ) − ∇f` (˜ − AT xN` ) ` ∇f` (x N`   PN (s) ˆ˜(s) ) − ∇fi (˜ xNi ) k2 + N1 i=1 AT i ∇fi (x Ni PN (s) (st ) T (s) 2 + EkAT + N1 i=1 AT ` d` i bi − A` b` k .

The first term on the right hand side can be bounded using Lemma 1: Algorithm 2 is equivalent to the Inexact Proxthe fact that ka + bk2 ≤ 2kak2 + 2kbk2 , as SVG method in Algorithm 1, with   (st ) (st ) T   ≤ 2EkA ∇f (ˆ x ) − ∇f (x ) k2 ` ` ` (s ) (s ) (s ) N N t ` ` e(st ) = AT xN`t ) − ∇f` (xN`t ) + AT   ` ∇f` (ˆ ` d` (s) (s) T   ˆ + 2EkA ∇f ( x ˜ ) − ∇f (˜ x ) ` ` ` (s) (s) (s) N` N` ˆ − AT ˜N` ) − ∇f` (˜ xN` ) − A T   ` ∇f` (x ` b` PN (s) (s) 1 T   ˆ A ∇f ( x ˜ ) − ∇f (˜ x ) k2 . + P P i Ni i Ni (s) (s) i=1 i N N N 1 T (s) ˆ + N1 i=1 AT A b . ∇f ( x ˜ ) − ∇f (˜ x ) + i Ni i Ni i i=1 i i N Further, Eke(st ) k2 is upper-bounded by, 2 X 2 X (s ) (s) Ekcj t k2 + 2L Ekaj k2 Eke(st ) k2 ≤ 2L j∈N` (st ) 2

+ Ekd`

j∈N` (s)

k + 2Ekb` k2 +

N 2 X (s) Ekbi k2 . N 2 i=1

Proof: The error e(st ) is: ˆ (st ) − AT ∇f ˆ (s) + 1 PN Ai T ∇f ˆ (s) e(st ) = A` T ∇f i ` i=1 ` ` N  (st ) (s) T T − A` ∇f` (xN` ) − A` ∇f` (˜ xN` )  PN (s) + N1 i=1 Ai T ∇fi (˜ xNi )   (st ) (s ) (s ) = AT xN`t ) − ∇f` (xN`t ) + AT ` ∇f` (ˆ ` d`   (s) (s) (s) ˆ x ˜ − AT ∇f ( ) − ∇f (˜ x ) − AT ` ` ` ` b` N` N`   PN PN (s) (s) (s) ˆ + N1 i=1 AT ˜Ni ) − ∇fi (˜ xNi ) + N1 i=1 AT i ∇fi (x i bi . We note that all quantization errors are zero-mean. Further, by Assumption 5, E [∇fi (x + δ)]= ∇fi (x), for a zero-mean random variable δ. Therefore, E e(st ) = 0.

We now bound the first term in this expression,   (s ) (s ) 2EkAT xN`t ) − ∇f` (xN`t ) k2 ` ∇f` (ˆ 2 X (s ) (s ) (s ) ≤ 2E(L2i kˆ xN`t − xN`t k2 ) ≤ 2L Ekcj t k2 , j∈N`

where the first inequality follows from Assumptions 1 and 5 and the fact that kA` k = 1. The second inequality follows from the independence of quantization errors (Theorem 2). Next we bound the second term,   (s) (s) ˆ ) 2EkAT ∇f ( x ˜ ) − ∇f (˜ x ` ` ` N` N`   PN (s) (s) 1 T ˆ˜ ) − ∇fi (˜ x ) k2 + N i=1 Ai ∇fi (, x Ni Ni   (s) ˆ˜(s) ) − ∇f` (˜ = 2EkAT xN` ) ` ∇f` (x N` i h  (s) ˆ˜(s) ) − ∇f` (˜ − E AT xN` ) k2 ` ∇f` (x N`   (s) ˆ˜(s) ) − ∇f` (˜ ≤ 2EkAT x ) k2 ` ∇f` (x N` N` (s) ˆ˜(s) − x ≤ 2E(L2i k(x ˜N` k2 ) N` P 2 (s) 2 ≤ 2L j∈N` Ekaj k ,

where the first inequality uses the fact that for a random variable υ, Ekυ − Eυk2 = Ekυk2 − kEυk2 ≤ Ekυk2 . The remaining inequalities follow from Assumptions 1 and 5, the fact that kA` k = 1, and the independence of the quantization errors. Finally, again from the independence of the quantization errors, we have, PN (s) (st ) T (s) 2 EkAT + N1 i=1 AT ` d` i bi − A ` b` k PN (st ) 2 1 T (s) 2 T (s) 2 ≤ EkAT ` d` k + Ek N i=1 Ai bi k − A` b` k PN (s ) (s) (s) ≤ Ekd` t k2 + 2Ekb` k2 + N22 i=1 Ekbi k2 . Combining these bounds, we obtain the desired result, 2 X 2 X (s ) (s) Eke(st ) k2 ≤ 2L Ekcj t k2 + 2L Ekaj k2 j∈N`

j∈N`

N 2 X (s) (s ) (s) Ekbi k2 . + Ekd` t k2 + 2Ekb` k2 + 2 N i=1

We next show that, if all of the values fall within their respective quantization intervals, then the error term Γ(s) decreases linearly with rate κ, and thus the algorithm converges to the optimal solution linearly with rate κ. Theorem 3: Given p, if for all 1 ≤ s ≤ (p − 1), the (s) (s) (s ) values of x ˜i , ∇fi , x(st ) , and ∇fi t fall inside of the (s) (s) (s ) (s ) respective quantization intervals Qa,i , Qb,i , Qc,it , and Qd,it , then Γ(k) ≤ Cκk , where,  DT m  2 N +1 C= (C + C ) + 2( 2L )C + C , a b b d N 12(2` − 1)2 with D = maxi |Ni | and m = maxi mi . It follows that, for α < κ < 1, h i E G(˜ x(s) ) − G(x? )   ≤ κs G(˜ x(0) ) − G(x? ) + βC

 1 1 − ακ Proof: First we note that, by Theorem 2 and the update rule for the quantization intervals, we have:  (s) 2 Ua,i (s) m s Ekai k2 ≤ 12 ≤ 12(2m ` −1)2 Ca κ 2` −1  (s) 2 U (s) 2 Dm s Ekbi k ≤ 12 2`b,i ≤ 12(2Dm ` −1)2 Cb κ −1  (s) 2 U (st ) 2 m s Ekci k ≤ 12 2`c,i ≤ 12(2m ` −1)2 Cc κ −1  (s) 2 Ud,i (s ) s Ekdi t k2 ≤ Dm ≤ 12(2Dm ` −1)2 Cd κ . 12 2` −1 We use these inequalities to bound ke(st ) k2 ,   2 s Eke(st ) k2 ≤ 2L D 12(2m ` −1)2 Cc κ  2  s s + 2 L D 12(2m + 12(2Dm ` −1)2 Ca κ ` −1)2 Cd κ     s s + 2 12(2Dm + N2 12(2Dm ` −1)2 Cb κ ` −1)2 Cb κ  2  = 12(2Dm 2L (Ca + Cc ) + 2( NN+1 )Cb + Cd κs . ` −1)2

Summing over t = 0, . . . , T − 1, we obtain, Γ(s) =

T −1 X

Eke(st ) k2 ≤ Cκs ,

t=0

where C=

DT m 12(2` −1)2



2

2L (Ca + Cc ) + 2( NN+1 )Cb + Cd



Applying Theorem 1, with κ > α, we have h i E G(˜ x(s) ) − G(x? ) s   X ≤ αs G(˜ x(0) ) − G(x? ) + β αs−i Cκi s

≤κ

G(˜ x

(0)

?

i=1 s X

) − G(x ) + Cβ

! −(s−i) s−i

κ

α

i=1  )s 1−( α Cβ 1−κα κ

 ≤ κs G(˜ x(0) ) − G(x? ) +    . ≤ κs G(˜ x(0) ) − G(x? ) + Cβ 1−1 α κ

While we do not yet have theoretical guarantees that all values will fall within their quantization intervals, our simulations indicate that is always possible to find parameters Ca , Cb , Cc , and Cd , for which all values lie within their quantization intervals for all iterations. Thus, in practice, our algorithm achieves a linear convergence rate. We anticipate that it is possible to develop a programmatic approach, similar to that in [1], to identify values for Ca , Cb , Cc , and Cd that guarantee linear convergence. This is a subject of current work. VI. N UMERICAL E XAMPLE This section illustrates the performance of Algorithm 2 by solving a distributed linear regression problem with elastic net regularization. We randomly generate a d-regular graph with N = 40 and uniform degree of 8, i.e., ∀i |Ni | = 9. We set each subsystem size, mi , to be 10. Each node has a local function fi (xNi ) = kHi xNi − hi k2 where Hi is a 80 × 90 random matrix. We generate hi by first generating a random vector x and then computing hi = Hi x. The global objective function is: N λ2 1 X fi (xNi ) + λ1 kxk2 + kxk1 . G(x) = N i 2 This simulation was implemented in Matlab and the optimal value x? was computed using CVX. We set the total number of inner iterations to be T = 2N and use the step size η = 0.1/L. With these values, α < 1, as required by Theorem 1. We set κ = 0.97, which ensures that κ > α. We use the quantization parameters Ca = 50, Cb = 300, Cc = 50, Cd = 400. With these parameters, the algorithms values always fell within their quantization intervals. Fig. 1 shows the performance of the algorithm where the number of bits n is 11, 13, and 15, as well as the performance of the algorithm without quantization. In these results, x(s)

convex, then µF (µR ) is its strong convexity parameter; if F (x) (R(x)) is only convex, then µF (µR ) is 0. For any x ∈ dom(R) and any v ∈ RP , define,

105 n = 11 n = 13 n = 15 No Error

100

x+ = proxηR (x − ηv)

jjx(s) ! x? jj

h = η1 (x − x+ ) ∆ = v − ∇F (x),

10!5

where 0 < η

Suggest Documents