1
On Nonconvex Decentralized Gradient Descent
arXiv:1608.05766v3 [math.OC] 13 Jun 2017
Jinshan Zeng and Wotao Yin
Abstract—Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been proposed for convex consensus optimization. However, to the behaviors or consensus nonconvex optimization, our understanding is more limited. When we lose convexity, we cannot hope our algorithms always return global solutions though they sometimes still do sometimes. Somewhat surprisingly, the decentralized consensus algorithms, DGD and Prox-DGD, retain most other properties that are known in the convex setting. In particular, when diminishing (or constant) step sizes are used, we can prove convergence to a (or a neighborhood of) consensus stationary solution and have guaranteed rates of convergence. It is worth noting that Prox-DGD can handle nonconvex nonsmooth functions if their proximal operators can be computed. Such functions include SCAD and ℓq quasi-norms, q ∈ [0, 1). Similarly, Prox-DGD can take the constraint to a nonconvex set with an easy projection. To establish these properties, we have to introduce a completely different line of analysis, as well as modify existing proofs that were used the convex setting. Index Terms—Nonconvex dencentralized computing, consensus optimization, decentralized gradient descent method, proximal decentralized gradient descent
I. I NTRODUCTION E consider an undirected, connected network of n agents and the following consensus optimization problem defined on the network: n X fi (x), (1) minimize f (x) , p
W
x∈R
i=1
where fi is a differentiable function only known to the agent i. We also consider the consensus optimization problem in the following differentiable+proximable∗ form: minimize s(x) , p x∈R
n X
(fi (x) + ri (x)),
(2)
i=1
where fi , ri are differentiable and proximable functions, respectively, only known to the agent i. Each function ri is possibly non-differentiable or nonconvex, or both. The models (1) and (2) find applications in decentralized averaging, learning, estimation, and control. Some specific examples include: (i) the distributed compressed sensing and machine learning problems, where fi is the data-fidelity term, which is often differentiable, and ri is a sparsity-promoting regularizer such as the ℓq (quasi)-norm with 0 ≤ q ≤ 1 [21], [27]; (ii) optimization problems with per-agent constraints, J. Zeng is with the College of Computer Information Engineering, Jiangxi Normal University, Nanchang, Jiangxi 330022, China (email:
[email protected]) W. Yin is with the Department of Mathematics, University of California, Los Angeles, CA 90095, USA (email:
[email protected]). ∗ We call a function proximable if its proximal operator prox αf (y) , argminx αf (x) + 21 kx − yk2 is easy to compute.
where fi is a differentiable objective function of agent i and ri is the indicator function of the constraint set of agent i, that is, ri (x) = 0 if x satisfies the constraint and ∞ otherwise [7], [20]. When fi ’s are convex, the existing algorithms include the (sub)gradient methods [6], [8], [16], [25], [28], [40], [45], and the primal-dual domain methods such as the decentralized alternating direction method of multipliers (DADMM) [34], [35], [7], and EXTRA [36], [37]. However, when fi ’s are nonconvex, few algorithms have convergence guarantees. Some existing results include [3], [4], [13], [23], [24], [38], [39], [19], [41], [42], [47]. In spite of the algorithms and their analysis in these works, the convergence of the simple algorithm Decentralized Gradient Descent (DGD) [28] under nonconvex fi ’s is still unknown. Furthermore, although DGD is slower than D-ADMM and EXTRA on convex problems, DGD is simpler and thus easier to extend to a variety of settings such as [31], [44], [26], [15], where online processing and delay tolerance are considered. Therefore, we expect our results to motivate future adoptions of nonconvex DGD. This paper studies the convergence of two algorithms: DGD for solving problem (1) and Prox-DGD for problem (2). In each DGD iteration, every agent locally computes a gradient and then updates its variable by combining the average of its neighbors’ with the negative gradient step. In each Prox-DGD iteration, every agent locally computes a gradient of fi and a proximal map of ri , as well as exchanges information with its neighbors. Both algorithms can use either a fixed step size or a sequence of decreasing step sizes. When the problem is convex and a fixed step size is used, DGD does not converge to a solution of the original problem (1) but a point in its neighborhood[45]. This motivates the use of decreasing step sizes such as in [8], [16]. Assuming fi ’s are convex and have Lipschitz continuous and bounded gradients, [8] shows that decreasing step sizes αk = √1k lead to a convergence rate O( lnkk ) of the running best of objective errors. [16] uses nested loops and shows an outerloop convergence rate O( k12 ) of objective errors, utilizing Nesterov’s acceleration, provided that the inner loop performs substantial consensus computation. Without a substantial inner loop, their single-loop algorithm using the decreasing step 1 sizes αk = k1/3 has a reduced rate O( lnkk ). The objective of this paper is two-fold: (a) we aim to show, other than losing global optimality, most existing convergence results of DGD and Prox-DGD that are known in the convex setting remain valid in the nonconvex setting, and (b) to achieve (a), we illustrate how to tailor nonconvex analysis tools for decentralized optimization. In particular, our asymptotic exact and inexact consensus results require new treatments because they are special to decentralized algorithms. The analytic results of this paper can be summarized as
2
follows. (a) When a fixed step size α is used and properly bounded, the DGD iterates converge to a stationary point of a Lyapunov function. The difference between each local estimate of x and the global average of all local estimates is bounded, and the bound is proportional to α. (b) When a decreasing step size αk = O(1/(k + 1)ǫ ) is used, where 0 < ǫ ≤ 1 and k is the iteration number, the objective sequence converges, and the iterates of DGD are asymptotically consensual (i.e., become equal one another), and they achieve this at the rate of O(1/(k + 1)ǫ ). Moreover, we show the convergence of DGD to a stationary point of the original problem, and derive the convergence rates of DGD with different ǫ for objective functions that are convex. (c) The convergence analysis of DGD can be extended to the algorithm Prox-DGD for solving problem (2). However, when the proximable functions ri ’s are nonconvex, the mixing matrix is required to be positive definite and a smaller step size is also required. (Otherwise, the mixing matrix can be non-definite.) The detailed comparisons between our results and the existing results on DGD and Prox-DGD are presented in Tables I and II. The global objective error rate in these two tables refers to the rate {f (¯ xk ) − f (xopt )} or {s(¯ xk ) − s(xopt )}, where Pof n 1 k k x ¯ = n i=1 x(i) is the average of the kth iterate and xopt is a global solution. The comparisons beyond DGD and ProxDGD are presented in Section IV and Table III. New proof techniques are introduced in this paper, particularly, in the analysis of convergence of DGD and Prox-DGD with decreasing step sizes. Specifically, the convergence of objective sequence and convergence to a stationary point of the original problem with decreasing step sizes are justified via taking a Lyapunov function and several new lemmas (cf. Lemmas 9, 12, and the proof of Theorem 2). Moreover, we estimate the consensus rate by introducing an auxiliary sequence and then showing both sequences have the same rates (cf. the proof of Proposition 3). All these proof techniques are new and distinguish our paper from the existing works such as [8], [16], [28], [3], [13], [23], [39], [42]. The rest of this paper is organized as follows. Section II describes the problem setup and reviews the algorithms. Section III presents our assumptions and main results. Section IV discusses related works. Section V presents the proofs of our main results. We conclude this paper in Section VI. Notation: Let I denote the identity matrix of the size n×n, and 1 ∈ Rn denote the vector of all 1’s. For the matrix X, X T denotes its transpose, X qijPdenotes its (i, j)th component, p 2 and kXk , hX, Xi = i,j Xij is its Frobenius norm, which simplifies to the Euclidean norm when X is a vector. Given a symmetric, positive semidefinite matrix G ∈ Rn×n , we let kXk2G , hX, GXi be the induced semi-norm. Given a function h, dom(f ) denotes its domain. II. P ROBLEM
SETUP AND ALGORITHM REVIEW
Consider a connected undirected network G = {V, E}, where V is a set of n nodes and E is the edge set. Any edge
(i, j) ∈ E represents a communication link between nodes i and j. Let x(i) ∈ Rp denote the local copy of x at node i. We reformulate the consensus problem (1) into the equivalent problem: minimize 1T f (x) , x
n X
fi (x(i) ),
(3)
i=1
subject to x(i) = x(j) , ∀(i, j) ∈ E, where x ∈ Rn×p , f (x) ∈ Rn with
xT(1) xT(2) .. . — xT(n)
— — x,
— f1 (x(1) ) f2 (x(2) ) — , f (x) , .. . f (x — n (n) )
In addition, the gradient of f (x) is — ∇f1 (x(1) )T — ∇f2 (x(2) )T ∇f (x) , .. . — ∇fn (x(n) )T
.
— — ∈ Rn×p . —
The ith rows of the matrices x and ∇f (x), and vector f (x), correspond to agent i. The analysis in this paper applies to any integer p ≥ 1. For simplicity, one can let p = 1 and treat x and ∇f (x) as vectors (rather than matrices). The algorithm DGD [28] for (3) is described as follows: Pick an arbitrary x0 . For k = 0, 1, . . . , compute xk+1 ← W xk − αk ∇f (xk ),
(4)
where W is a mixing matrix and αk > 0 is a stepsize parameter. Similarly, we can reformulate the composite problem (2) as the following equivalent form: minimize x
n X
(fi (x(i) ) + ri (x(i) )),
i=1
subject to x(i) = x(j) , ∀(i, j) ∈ E.
(5)
Pn Let r(x) , i=1 ri (x(i) ). The algorithm Prox-DGD can be applied to the above problem (5): Prox-DGD: Take an arbitrary x0 . For k = 0, 1, . . . , perform xk+1 ← proxαk r (W xk − αk ∇f (xk )),
(6)
where the proximal operator is ku − xk2 . proxαk r (x) , argmin αk r(u) + 2 u∈Rn×p (7) III. A SSUMPTIONS
AND MAIN RESULTS
This section presents all of our main results.
3
TABLE I C OMPARISONS ON DIFFERENT ALGORITHMS FOR CONSENSUS SMOOTH OPTIMIZATION PROBLEM (1). Fixed step size algorithm fi
DGD [45]
DGD (this paper)
D-NG [16]
DGD (this paper)
convex only
(non)convex
convex only
(non)convex
∇fi
Lipschitz
0 0; (iii) for all x in U ∩ {x : h(x∗ ) < h(x) < h(x∗ ) + η}, the KŁ inequality holds ϕ′ h(x) − h(x∗ ) · dist 0, ∂h(x) ≥ 1. (8)
λ1 (W ) = 1 > λ2 (W ) ≥ · · · ≥ λn (W ) > 0.
Proper lower semi-continuous functions that satisfy the KŁ inequality at each point of dom(∂h) are called KŁ functions.
Assumption 1 (Objective). The objective functions fi : Rp → R ∪ {+∞}, i = 1, . . . , n, satisfy the following: (1) fi is Lipschitz differentiable with constant Lfi > 0. (2) fi is proper (i.e., not everywhere infinite) and coercive. Pn The sum i=1 fi (x(i) ) is Lf -Lipschitz differentiable with Lf , maxi Lfi . In addition, each fi is lower bounded following Part (2) of the above assumption. Assumption 2 (Mixing matrix). The mixing matrix W = [wij ] ∈ Rn×n has the following properties: (1) (Graph) If i 6= j and (i, j) ∈ / E, then wij = 0, otherwise, wij > 0. (2) (Symmetry) W = W T . (3) (Null space property) null{I − W } = span{1}. (4) (Spectral property) I W ≻ −I. By Assumption 2, a solution xopt to problem (3) satisfies (I − W )xopt = 0. Due to the symmetric assumption of W , † Whole
sequence convergence from any starting point is referred to as “global convergence” in the literature. Its limit is not necessarily a global solution.
Let ζ be the second largest magnitude eigenvalue of W . Then ζ = max{|λ2 (W )|, |λn (W )|}.
(9)
B. Convergence results of DGD We consider the convergence of DGD with both a fixed step size and a sequence of decreasing step sizes. 1) Convergence results of DGD with a fixed step size: The convergence result of DGD with a fixed step size (i.e., αk ≡ α) is established based on the Lyapunov function [45]: 1 kxk2I−W . 2α It is worth reminding that convexity is not assumed. Lα (x) , 1T f (x) +
(10)
Theorem 1 (Global convergence). Let {xk } be the sequence generated by DGD (4) with the step size 0 < α < 1+λLnf(W ) . Let Assumptions 1 and 2 hold. Then {xk } has at least one accumulation point x∗ , and any such point is a stationary point of Lα (x). Furthermore, the running best rates‡ of the sequences§ {kxk+1 − xk k2 } and {k∇Lα (xk )k2 } are o( k1 ). In addition, if Lα satisfies the KŁ property at an accumulation point x∗ , then {xk } globally converges to x∗ . Remark 1. Let x∗ be a stationary point of Lα (x), and thus 0 = ∇f (x∗ ) + α−1 (I − W )x∗ .
(11)
Since 1T (I − W ) = 0, (11) yields 0 = 1T ∇f (x∗ ), indicating ∗ that Pn x is also a stationary point to ∗the separable function are not necessarily i=1 fi (x(i) ). Since the rows of x
‡ Given a nonnegative sequence a , its running best sequence is b k k = min{ai : i ≤ k}. We say ak has a running best rate of o(1/k) if bk = o(1/k). § These quantities naturally appear in the analysis, so we keep the squares.
5
identical, we cannot say x∗ is a stationary point to Problem (3). However, the differences between the rows of x∗ are bounded, following our next result below adapted from [45]: Proposition 1 (Consensual bound on x∗ ). For each iteration 1 Pn k k k, define x ¯ , n i=1 x(i) . Then, it holds for each node i that αD , (12) kxk(i) − x ¯k k ≤ 1−ζ
where D is a universal bound of k∇f (xk )k defined in Lemma 6 below, ζ is the second largest magnitude eigenvalue of W specified in (9). As k → ∞, (12) yields the consensual bound kx∗(i) − x¯∗ k ≤ where x ¯∗ ,
1 n
Pn
i=1
αD , 1−ζ
x∗(i) .
In Proposition 1, the consensual bound is proportional to the step size α and inversely proportional to the gap between the largest and the second largest magnitude eigenvalues of W. Let us compare the DGD iteration with the iteration of centralized gradient descent (14) for f (x). Averaging the rows of (4) yields the following comparison: X n 1 DGD averaged: x ¯k+1 ← x ¯k − α ∇fi (xk(i) ) . (13) n i=1 X n 1 ∇fi (¯ xk ) . (14) Centralized: x ¯k+1 ← x ¯k − α n i=1 Apparently, DGD approximates centralized gradient descent by evaluating ∇f(i) at local variables xk(i) instead of the global average. We can estimate the error of this approximation as n
k
n
1X 1X ∇fi (xk(i) ) − ∇fi (¯ xk )k n i=1 n i=1 n
≤
αDLf 1X xk )k ≤ k∇fi (xk(i) ) − ∇fi (¯ . n i=1 1−ζ
Unlike the convex analysis in [45], it is impossible to bound the difference between the sequences of (13) and (14) without convexity because the two sequences may converge to different stationary points of Lα . Remark 2. The KŁ assumption on Lα in Theorem 1 can be satisfied if each fi is a sub-analytic function. Since kxk2I−W is obviously sub-analytic and the sum of two sub-analytic functions remains sub-analytic, Lα is sub-analytic if each fi is so. See [43, Section 2.2] for more details and examples. Proposition 2 (KŁ convergence rates). Let the assumptions of Theorem 1 hold. Suppose that Lα satisfies the KŁ inequality at an accumulation point x∗ with ψ(s) = cs1−θ for some constant c > 0. Then, the following convergence rates hold: (a) If θ = 0, xk converges to x∗ in finitely many iterations. (b) If θ ∈ (0, 12 ], kxk − x∗ k ≤ C0 τ k for all k ≥ k ∗ for some k ∗ > 0, C0 > 0, τ ∈ [0, 1). (c) If θ ∈ ( 12 , 1), kxk − x∗ k ≤ C0 k −(1−θ)/(2θ−1) for all k ≥ k ∗ , for certain k ∗ > 0, C0 > 0.
Note that the rates in parts (b) and (c) of Proposition 2 are of the eventual type. Using fixed step sizes, our results are limited because the stationary point x∗ of Lα is not a stationary point of the original problem. We only have a consensual bound on x∗ . To address this issue, the next subsection uses decreasing step sizes and presents better convergence results. 2) Convergence of DGD with decreasing step sizes: The positive consensual error bound in Proposition 1, which is proportional to the constant step size α, motivates the use 1 of properly decreasing step sizes αk = O( (k+1) ǫ ), for some 0 < ǫ ≤ 1, to diminish the consensual bound to 0. As a result, any accumulation point x∗ becomes a stationary point of the original problem (3). To analyze DGD with decreasing step sizes, we add the following assumption. Assumption 3 (Bounded gradient). For any k, ∇f (xk ) is uniformly bounded by some constant B > 0, i.e., k∇f (xk )k ≤ B. Note that the bounded gradient assumption is a regular assumption in the convergence analysis of decentralized gradient methods (see, [3], [4], [13], [23], [24], [38], [39], [19], [42] for example), even in the convex setting [16] and also [8], though it is not required for centralized gradient descent. We take the step size sequence: 1 , 0 < ǫ ≤ 1, (15) αk = Lf (k + 1)ǫ throughout the rest part of this section. (The numerator 1 can be replaced by any positive constant.) By iteratively applying iteration (4), we obtain the following expression xk = W k x0 −
k−1 X j=0
αj W k−1−j ∇f (xj ).
(16)
Proposition 3 (Asymptotic consensus rate). Let Assumptions ¯ k , n1 11T xk . Then, 2 and 3 hold. Let DGD use (15). Let x k k ¯ k converges to 0 at the rate of O(1/(k + 1)ǫ ). kx − x According to Proposition 3, the iterates of DGD with decreasing step sizes can reach consensus asymptotically (compared to a nonzero bound in the fixed step size case in Proposition 1). Moreover, with a larger ǫ, faster decaying step sizes generally imply a faster asymptotic consensus rate. Note ¯ k k2I−W . that (I − W )¯ xk = 0 and thus kxk k2I−W = kxk − x Therefore, the above proposition implies the following result. Corollary 1. Apply the setting of Proposition 3. kxk k2I−W converges to 0 at the rate of O(1/(k + 1)2ǫ ).
Corollary 1 shows that the sequence {xk } in the (I − W ) semi-norm can decay to 0 at a sublinear rate. For any global consensual solution xopt to problem (3), we have kxk − xopt k2I−W = kxk k2I−W so, if {xk } does converge to xopt , then their distance in the same semi-norm decays at O(1/k 2ǫ ). Theorem 2 (Convergence). Let Assumptions 1, 2 and 3 hold. Let DGD use step sizes (15). Then (a) {Lαk (xk )} and {1T f (xk )} converge to the same limit; (b) limk→∞ 1T ∇f (xk ) = 0, and any limit point of {xk } is a stationary point of problem (3);
6
(c) In addition, if there exists an isolated accumulation point, then {xk } converges. In the proof of Theorem 2, we will establish ∞ X
k=0
k+1 − xk k2 < ∞, α−1 k (1 + λn (W )) − Lf kx
which implies that the running best rate of the sequence {kxk+1 − xk k2 } is o(1/k 1+ǫ ). Theorem 2 shows that the objective sequence converges, and any limit point of {xk } is a stationary point of the original problem. However, there is no result on the convergence rate of the objective sequence to an optimal value, and it is generally difficult to get such a rate without convexity. Although our primary focus is nonconvexity, next we assume convexity and present the objective convergence rate, which has an interesting relation with P ǫ. n For any x ∈ Rn×p , let f¯(x) , i=1 fi (x(i) ). Even if fi ’s are convex, the solution to (3) may be non-unique. Thus, let X ∗ be the set of solutions to (3). Given xk , we pick the solution xopt = ProjX ∗ (xk ) ∈ X ∗ . Also let fopt = f¯(xopt ) be the optimal value of (1). Define the ergodic objective: PK xk+1 ) αk f¯(¯ K ¯ , (17) f = k=0 PK k=0 αk ¯ k+1 = where x
1 T k+1 )1. n (1 x
f¯K ≥
Obviously,
min
k=1,...,K+1
f¯(¯ xk ).
(18)
Proposition 4 (Convergence rates under convexity). Let Assumptions 1, 2 and 3 hold. Let DGD use step sizes (15). If λn (W ) > 0 and each fi is convex, then {f¯K } defined in (17) converges to the optimal objective value fopt at the following rates: (a) if 0 < ǫ < 1/2, the rate is O( K1ǫ ); √ K ); (b) if ǫ = 1/2, the rate is O( ln K 1 (c) if 1/2 < ǫ < 1, the rate is O( K 1−ǫ ); 1 (d) if ǫ = 1, the rate is O( ln K ). The convergence rates established in Proposition 4 almost as good as O( √1K ) when ǫ = 21 . As ǫ goes to either 0 or 1, the rates become slower, and ǫ = 1/2 may be the optimal choice in terms of the convergence rate. However, by Proposition 3, a larger ǫ implies a faster consensus rate. Therefore, there is a tradeoff to choose an appropriate ǫ in the practical implementation of DGD. C. Convergence results of Prox-DGD Similarly, we consider the convergence of Prox-DGD with both a fixed step size and decreasing step sizes. The iteration (6) can be reformulated as xk+1 = proxαk r (xk − αk ∇Lαk (xk )) based on which, we define the Lyapunov function Lˆαk (x) , Lαk (x) + r(x),
(19)
Pn where we recall Lαk (x) = i=1 fi (x(i) )+ 2α1 k kxk2I−W . Then (19) is clearly the forward-backward splitting (a.k.a., proxgradient) iteration for minimizex Lˆαk (x). Specifically, (19) first performs gradient descent to the differentiable function Lαk (x) and then computes the proximal of r(x). To analyze Prox-DGD, we should revise Assumption 1 as follows. Assumption 4 (Composite objective). The objective function of (5) satisfies the following: (1) Each fi is Lipschitz differentiable with constant Lfi > 0. (2) Each (fi +ri ) is proper, lower semi-continuous, coercive. Pn As before, i=1 fi (x(i) ) is Lf -Lipschitz differentiable for Lf , maxi Lfi . 1) Convergence results of Prox-DGD with a fixed step size: Based on the above assumptions, we can get the global convergence of Prox-DGD as follows. Theorem 3 (Global convergence of Prox-DGD). Let {xk } be the sequence generated by Prox-DGD (6) where the step size α satisfies 0 < α < 1+λLnf(W ) when ri ’s are convex; and
) , when ri ’s are not necessarily convex (this 0 < α < λnL(W f case requires λn (W ) > 0). Let Assumptions 2 and 4 hold. Then {xk } has at least one accumulation point x∗ , and any accumulation point is a stationary point of Lˆα (x). Furthermore, the running best rates¶ of the sequences {kxk+1 −xk k2 } and kgk+1 k2 (where gk+1 is defined in Lemma 18) are both o( k1 ). In addition, if Lˆα satisfies the KŁ property at an accumulation point x∗ , then {xk } converges to x∗ .
The rate of convergence of Prox-DGD can be also established by leveraging the KŁ property. Proposition 5 (Rate of convergence of Prox-DGD). Under assumptions of Theorem 3, suppose that Lˆα satisfies the KŁ inequality at an accumulation point x∗ with ψ(s) = c1 s1−θ for some constant c1 > 0. Then the following hold: (a) If θ = 0, xk converges to x∗ in finitely many iterations. (b) If θ ∈ (0, 12 ], kxk − x∗ k ≤ C1 τ k for all k ≥ k ∗ for some k ∗ > 0, C1 > 0, τ ∈ [0, 1). (c) If θ ∈ ( 12 , 1), kxk − x∗ k ≤ C1 k −(1−θ)/(2θ−1) for all k ≥ k ∗ , for certain k ∗ > 0, C1 > 0. 2) Convergence of Prox-DGD with decreasing step sizes: In Prox-DGD, we also use the decreasing step size (15). To investigate its convergence, the bounded gradient Assumption 3 should be revised as follows. Assumption 5 (Bounded composite subgradient). For each i, ∇fi is uniformly bounded by some constant Bi > 0, i.e., k∇fi (x)k ≤ Bi for any x ∈ Rp . Moreover, kξi k ≤ Bri for any ξi ∈ ∂ri (x) and x ∈ Rp , i = 1 . . . , n. P ¯ , n (Bi + Bri ). Then ∇f (x) + ξ (where ξ ∈ Let B i=1 ¯ Note ∂r(x) for any x ∈ Rn×p ) is uniformly bounded by B. that the same assumption is used to analyze the convergence ¶ A nonnegative sequence a induces its running best sequence b k k = min{ai : i ≤ k}; therefore, ak has running best rate of o(1/k) if bk = o(1/k).
7
of distributed proximal-gradient method in the convex setting [6], [8], and also is widely used to analyze the convergence of nonconvex decentralized algorithms like in [23], [24]. In light of Lemma 19 below, the claims in Proposition 3 and Corollary 1 also hold for Prox-DGD. Proposition 6 (Asymptotic consensus and rate). Let Assumptions 2 and 5 hold. In Prox-DGD, use the step sizes (15). There hold ¯ ¯ k k ≤ C kx0 kζ k + B kxk − x
k−1 X j=0
αj ζ k−1−j ,
¯ k k converges to 0 at the rate of O(1/(k + 1)ǫ ). and kxk − x Moreover, let x∗ be any global solution of the problem (5). ¯ ∗ k2I−W converges Then kxk −x∗ k2I−W = kxk k2I−W = kxk − x to 0 at the rate of O(1/(k + 1)2ǫ ). Pn For any x ∈ Rn×p , define s¯(x) = i=1 fi (x(i) ) + ri (x(i) ). Let X † be a set of solutions of (5), xopt = ProjX † (xk ) ∈ X † , and sopt = s¯(xopt ) be the optimal value of (5). Define s¯K =
PK
αk s¯(¯ xk+1 ) . PK k=0 αk
k=0
(20)
Theorem 4 (Convergence and rate). Let Assumptions 2, 4 and 5 hold. In Prox-DGD, use the step sizes (15). Then Pn (a) {Lˆαk (xk )} and { i=1 fi (xk(i) ) + ri (xk(i) )} converge to the limit; k+1 P∞same −1 (b) −xk k2 0); (c) if {ξ k } satisfies kξ k+1 − ξ k k ≤ Lr kxk+1 − xk k for each k > k0 , some constant Lr > 0, and a sufficiently large integer k0 > 0, then lim 1T (∇f (xk ) + ξ k+1 ) = 0,
k→∞
where ξ k+1 ∈ ∂r(xk+1 ) is the one determined by the proximal operator (7), and any limit point is a stationary point of problem (5). (d) in addition, if there exists an isolated accumulation point, then {xk } converges. (e) furthermore, if fi and ri are convex and λn (W ) > 0, then the claims on the rates of {f¯K } in Proposition 4 hold for the sequence {¯ sK } defined in (20). Theorem 4(b) implies that the best running rate of kxk+1 − 1 ). The additional condition imposed on {ξ k } in x k is o( k1+ǫ Theorem 4(c) is some type of restricted continuous regularity of the subgradient ∂r with respect to the generated sequence, which may be held for a class of proximal functions as studied in [46]. If ∂r is locally Lipschitz continuous in a neighborhood of a limit point, then such condition can generally be satisfied, since {xk } is asymptotic regular, and thus xk will lies in such neighborhood of this limit point when k is sufficiently large. Theorem 4(e) gives the convergence rates of Prox-DGD in the convex setting. k 2
IV. R ELATED WORKS
AND DISCUSSIONS
We summarize some recent nonconvex decentralized algorithms in Table III. Most of them apply to either the smooth optimization problem (1) or the composite optimization problem (2) and use diminishing step sizes. Although (1) is a special case of (2) via letting ri (x) = 0, there are still differences in both algorithm design and theoretical analysis. Therefore, we divide their comparisons. We first discuss the algorithms for (1). In [39], the authors proved the convergence of perturbed push-sumk for nonconvex (1) under some regularity assumptions. They also introduced random perturbations to avoid local minima. The network considered in [39] is time-varying and directed, and specific column stochastic matrices and diminishing step sizes are used. Their algorithm is an extension of DGD with diminishing step sizes of this paper. The convergence results for the deterministic perturbed push-sum algorithm obtained in [39] are similar to those of DGD developed in this paper under similar assumptions (see, Theorem 2 above and [39, Theorem 3]). (However, in this paper, we obtain the asymptotic consensus and convergence to a stationary point of DGD via a Lyapunov function and developing several new results such as Lemma 12 for the convergence of the so-called weaklysummable sequence.) The proofs in [39] are mainly based on [30, Theorem 2.7.3]. In [13], a primal-dual approximate gradient algorithm called ZENITH was developed for (1). The convergence of ZENITH was given in the expectation of constraint violation under the Lipschitz differentiable assumption and other assumptions. Table III includes three algorithms for solving the composite problem (2), which are related to ours. All of them only deal with convex ri (whereas ri in this paper can also be nonconvex). In [24], the authors proposed NEXT based on the previous successive convex approximation (SCA) technique. The iterates of NEXT include two stages, a local SCA stage to update local variables and a consensus update stage to fuse the information between agents. While NEXT has results similar to Prox-DGD using diminishing step sizes, ProxDGD is much simpler than NEXT because NEXT needs to update five different sequences at each iteration. Another interesting algorithm is decentralized Frank-Wolfe (DeFW) proposed in [42] for nonconvex, smooth, constrained decentralized optimization, where a bounded convex constraint set is imposed. There are three steps at each iteration of DeFW: average gradient computation, local variable evaluation by Frank-Wolfe, and information fusion between agents. In [42], the authors established convergence results similar to ProxDGD under diminishing step sizes. The stochastic version of DeFW has also been developed in [19] for high-dimensional convex sparse optimization. The last one is projected stochastic gradient algorithm (Proj SGD) [3] for constrained, nonconvex, smooth consensus optimization. It has two steps at each iteration: a projected stochastic gradientstep to update local variables and a consensus step to exchange the information k The original form of this algorithm, push-sum, was proposed in [17] for the average consensus problem. It was modified and analyzed in [29] for convex consensus optimization problem over time-varying directed graphs.
8
between local agents. The mixing matrix used in this algorithm is random and row stochastic, but its expectation is column stochastic. Asymptotic consensus and convergence to the set of Karush-Kuhn-Tucker points were proved under diminishing step sizes, smooth objective function, some mean and variance restrictions to the stochastic direction, and other assumptions on the mixing matrices and the constraint set. Based on the above analysis, the convergence results of DGD and Prox-DGD with diminishing step sizes of this paper are comparable with most of the existing ones, which involve more complicated methods. However, we allow nonconvex nonsmooth ri and are able to obtain the estimates of asymptotic consensus rates. We also establish global convergenceusing a fixed step size while it is only found in ZENITH. V. P ROOFS In this section, we present the proofs of our main theorems and propositions.
A. Proof for Theorem 1 The sketch of the proof is as follows: DGD is interpreted as the gradient descent algorithm applied to the Lyapunov function Lα , following the argument in [45]; then, the properties of sufficient descent, lower boundedness, and bounded gradients are established for the sequence {Lα (xk )}, giving subsequence convergence of the DGD iterates; finally, whole sequence convergence of the DGD iterates follows from the KŁ property of Lα . Lemma 1 (Gradient descent interpretation). The sequence {xk } generated by the DGD iteration (4) is the same sequence generated by applying gradient descent with the fixed step size α to the objective function Lα (x). A proof of this lemma is given in [45], and it is based on reformulating (4) as the iteration: xk+1 = xk − α(∇f (xk ) + α−1 (I − W )xk ) = xk − α∇Lα (xk ).
(21)
Although the sequence {xk } generated by the DGD iteration (4) can be interpreted as a centralized gradient descent sequence of function Lα (x), it is different to the gradient descent of the original problem (3). Lemma 2 (Sufficient descent of {Lα (xk )}). Let Assumptions 1 and 2 hold. Set the step size 0 < α < 1+λLnf(W ) . It holds that Lα (xk+1 ) ≤ Lα (xk ) 1 − α−1 (1 + λn (W )) − Lf kxk+1 − xk k2 , 2 k+1
Proof. From x
k
(22) ∀k ∈ N.
k
= x − α∇Lα (x ), it follows that
h∇Lα (xk ), xk+1 − xk i = −
kxk+1 − xk k2 . α
Pn Since i=1 ∇fi (x(i) ) is Lf -Lipschitz, ∇Lα is Lipschitz with the constant L∗ , Lf + α−1 λmax (I − W ) = Lf + α−1 (1 − λn (W )), implying Lα (xk+1 ) ≤ Lα (xk ) + h∇Lα (xk ), xk+1 − xk i L∗ k+1 kx − xk k2 . + 2 Combining (23) and (24) yields (22).
(24)
Lemma 3 (Boundedness). Under Assumptions 1 and 2, if 0 < α < 1+λLnf(W ) , then the sequence {Lα (xk )} is lower bounded, and the sequence {xk } is bounded, i.e., there exists a constant B > 0 such that kxk k < B for all k.
Proof. The lower boundedness of Lα (xk ) is due to the lower boundedness of each fi as it is proper and coercive (Assumption 1 Part (2)). By Lemma 2 and the choice of α, Lα (xk ) is nonincreasing and upper bounded by Lα (x0 ) < +∞. Hence, 1T f (xk ) ≤ Lα (x0 ) implies that xk is bounded due to the coercivity of 1T f (x) (Assumption 1 Part (2)). From Lemmas 2 and 3, we immediately obtain the following lemma. 2 Lemma and asymptotic regularity∗∗). It holds P∞4 (ℓ2 -summable k+1 k 2 − x k < +∞ and that kxk+1 − xk k → 0 that k=0 kx as k → ∞.
From (21), the result below directly follows: Lemma 5 (Gradient bound). k∇Lα (xk )k ≤ α−1 kxk+1 −xk k. Based on the above lemmas, we get the global convergence of DGD. Proof of Theorem 1. By Lemma 3, the sequence {xk } is bounded, so there exist a convergent subsequence and a limit point, denoted by {xks }s∈N → x∗ as s → +∞. By Lemmas 2 and 3, Lα (xk ) is monotonically nonincreasing and lower bounded, and therefore kxk+1 − xk k → 0 as k → ∞. Based on Lemma 5, k∇Lα (xk )k → 0 as k → ∞. In particular, k∇Lα (xks )k → 0 as s → ∞. Hence, we have ∇Lα (x∗ ) = 0. The running best rate of the sequence {kxk+1 − xk k2 } follows from [10, Lemma 1.2] or [18, Theorem 3.3.1]. By Lemma 5, the running best rate of the sequence {k∇Lα (xk )k2 } is o( k1 ). Similar to [2, Theorem 2.9], we can claim the global convergence of the considered sequence {xk }k∈N under the KŁ assumption of Lα . Next, we derive a bound on the gradient sequence {∇f (xk )}, which is used in Proposition 1.
Lemma 6. Under Assumption 1, there exists a point y∗ satisfying ∇f (y∗ ) = 0, and the following bound holds k∇f (xk )k ≤ D , Lf (B + ky∗ k),
∀k ∈ N,
(25)
where B is the bound of kxk k given in Lemma 3. (23)
∗∗ A sequence {a } is said to be asymptotic regular if ka k k+1 − ak k → 0 as k → ∞.
9
Proof. By the lower boundedness assumption (Assumption 1 Part (2)), the minimizer of 1T f (y) exists. Let y∗ be a minimizer. Then by Lipschitz differentiability of each fi (Assumption 1 Part (1)), we have that ∇f (y∗ ) = 0. Then, for any k, we have k∇f (xk )k = k∇f (xk ) − ∇f (y∗ )k ≤ Lf kxk − y∗ k (Lemma 3) ≤ Lf (B + ky∗ k).
Proof of Proposition 3. By the recursion (16), note that ¯ k = (W k − xk − x −
1 T 11 )kkx0 k n k−1 X 1 αj kW k−1−j − 11T k · k∇f (xj )k + n j=0 k−1 X ≤ C kx0 kζ k + B αj ζ k−1−j .
Proof. Note that k∇Lα (xk+1 )k ≤ k∇Lα (xk+1 ) − ∇Lα (xk )k + k∇Lα (xk )k ≤ L∗ kxk+1 − xk k + α−1 kxk+1 − xk k
= (α−1 (2 − λn (W )) + Lf )kxk+1 − xk k, where the second inequality holds for Lemma 5 and the Lipschitz continuity of ∇Lα with constant L∗ = Lf + α−1 (1 − λn (W )). Thus, it shows that {xk } satisfies the socalled relative error condition as list in [2]. Moreover, by Lemmas 2 and 3, {xk } also satisfies the so-called sufficient decrease and continuity conditions as listed in [2]. Under such three conditions and the KŁ property of Lα at x∗ with ψ(s) = cs1−θ , following the proof of [2, Lemma 2.6], there exists k0 > 0 such that for all k ≥ k0 , we have
Furthermore, by Lemma 8 and step sizes (15), we get ¯ k k = 0. limk→∞ kxk − x ¯ k k, we Let bk , (k + 1)−ǫ . To show the rate of kxk − x only need to show that lim b−1 kxk k→∞ k
k ¯kk b−1 k kx − x
kx0 kζ k + B ≤ Cb−1 k
where a , 12 (α−1 (1 + λn (W )) − Lf ) and b , α−1 (2 − λn (W )) + Lf . Then, an easy induction yields
0
= Ckx
cb − xk0 −1 k + × a
k kb−1 k ζ
+ CBb−1 k
Following a derivation similar to the proof of [1, Theorem 5], we can estimate the rate of convergence of {xk } in the different cases of θ. C. Proof for Proposition 3 In order to prove Proposition 3, we also need the following lemmas. k
z }| { Lemma 7. ([28, Proposition 1]) Let W k , W · · · W be the power of W with degree k for any k ∈ N. Under Assumption 2, it holds 1 kW k − 11T k ≤ Cζ k (27) n for some constant C > 0, where ζ is the second largest magnitude eigenvalue of W as specified in (9). Lemma 8. ([32, Lemma 3.1]) Let {γk } be a scalar sequence. P If limk→∞ γk = γ and 0 < β < 1, then k γ . limk→∞ l=0 β k−l γl = 1−β
¯k k ≤ C ∗ −x
for some 0 < C ∗ < ∞. Let jk′ , [k − 1 + 2 logζ (b−1 k )] (where [x] denotes the integer part of x for any x ∈ R). Note that
cb × (26) a (Lα (xk ) − Lα (x∗ ))1−θ − (Lα (xk+1 ) − Lα (x∗ ))1−θ ,
(Lα (xk0 ) − Lα (x∗ ))1−θ − (Lα (xk+1 ) − Lα (x∗ ))1−θ .
(29)
j=0
2kxk+1 − xk k ≤ kxk − xk−1 k +
t=k0
1 T 11 )∇f (xj ). n
αj (W k−1−j −
¯ k k ≤ k(W k − kxk − x
B. Proof for Proposition 2
kxt+1 − xt k ≤ kxk0
j=0
(28)
Further by Lemma 7 and Assumption 3, we obtain
Therefore, we have proven this lemma.
k X
k−1 X
1 T 0 11 )x n
k−1 X
+
k−1 X j=0
CBb−1 k
αj ζ k−1−j ′
jk X
αj ζ k−1−j
j=0
αj ζ k−1−j
j=jk′ +1
, T1 + T2 + T3 ,
(30)
where the first inequality holds because of (29). In the following, we will estimate the above three terms in the right-hand side of (30), respectively. First, by the definition of jk′ , for any j ≤ jk′ , we have b−1 k ζ
k−1−j 2
≤ b−1 k ζ
′ k−1−jk 2
≤ 1.
Thus, ′
T2 ≤ CB
jk X
αj ζ (k−1−j)/2 .
Second, for jk′ < j ≤ k − 1, b−1 k αj ≤
(31)
j=0
(k + 1)ǫ (k + 1)ǫ ≤ , Lf (jk′ + 1)ǫ Lf (k − 1 + 2ǫ logζ (k + 1))ǫ
and also b−1 k αj ≥
(k + 1)ǫ 1 = , Lf (k + 1)ǫ Lf
10
Thus, for any jk′ < j ≤ k − 1
Note that 1 . Lf
(32)
k/2 = 0. lim b−1 k ζ
(33)
lim b−1 k αj =
k→∞
Furthermore, note that k→∞
2 ≤ , Lf
(34)
k/2 b−1 ≤ 1. k ζ
(35)
The above two inequalities imply that for sufficiently large k, T1 ≤ Ckx0 kζ k/2 ,
(36)
k−1 2CB X k−1−j ζ . T3 ≤ Lf ′
(37)
k−1 X
j=jk′ +1
k→∞
−1 k+1 2 (α−1 kI−W k+1 − αk )kx
≤ 2ǫLf (k + 1)ǫ−1 kxk+1 k2I−W
(39)
We have completed the proof of this proposition. D. Proof for Theorem 2 To prove Theorem 2, we first note that similar to (21), the DGD iterates under decreasing step sizes can be rewritten as k
k
= x − αk ∇Lαk (x ), 1 2 2αk kxkI−W ,
(40)
and we also need
Lemma 9 ([33]). Let {vt } be a nonnegative scalar sequence such that vt+1 ≤ (1 + bt )vt − ut + ct P∞ for all t P ∈ N, where bt ≥ 0, ut ≥ 0 and ct ≥ 0 with t=0 bt < ∞ < ∞. Then the sequence {vt } converges to ∞ and t=0 ct P ∞ some v ≥ 0 and t=0 ut < ∞. Lemma 10. Let αk satisfy (15). Then it holds
−1 ǫ−1 α−1 . k+1 − αk ≤ 2ǫLf (k + 1)
Proof. We first prove that (1 + x)ǫ − 1 ≤ 2ǫx,
(43)
By Lemma 10,
ζ k−1−j .
k ¯k k ≤ C ∗. lim b−1 k kx − x
where Lαk (x) = 1T f (x) + the following lemmas.
¯ k+1 k2 . ≤ (1 − λn (W ))kxk+1 − x
(38)
By Lemma 8 and (38), there exists a C ∗ > 0 such that
x
Lemma 11. Let Assumptions 1, 2, and 3 hold. In DGD, use −1 k+1 2 step sizes αk in (15). Then {(αk+1 − α−1 kI−W } is k )kx P∞ −1 −1 k+1 2 summable, i.e., k=0 (αk+1 − αk )kx kI−W < ∞. ¯ k+1 k2I−W kxk+1 k2I−W = kxk+1 − x
From (31), (36) and (37), we get
k+1
−1 k+1 2 Note that the term {(α−1 kI−W } exists in k+1 − αk )kx the right hand side the latter inequality (48). In order to apply Lemma 9 and then show the convergence of {Lαk (xk )}, we need the following lemma to guarantee that {(α−1 k+1 − k+1 2 α−1 )kx k } is summable. I−W k
Proof. Note that
j=jk +1
k ¯ k k ≤ Ckx0 kζ k/2 b−1 k kx − x ′ jk X 2 αj ζ (k−1−j)/2 + + CB L f j=0
(42)
where the last inequality holds for (41).
Therefore, there exists a k ∗ such that for k ≥ k ∗ b−1 k αj
−1 ǫ ǫ α−1 k+1 − αk = Lf (k + 2) − (k + 1) 1 ǫ ) −1 = Lf (k + 1)ǫ (1 + k+1 ≤ 2ǫLf (k + 1)ǫ−1 ,
¯ k+1 k2 . ≤ 2ǫLf (k + 1)ǫ−1 (1 − λn (W ))kxk+1 − x
(44)
Furthermore, by (44) and Proposition 3, the sequence −1 k+1 2 {(α−1 kI−W } converges to 0 at the rate of k+1 − αk )kx 1+ǫ O(1/(k + 1) ), which implies that the sequence {(α−1 k+1 − P∞ −1 k+1 2 α−1 )kx k } is ℓ -summable, i.e., (α 1 I−W k k=0 k+1 − k+1 2 α−1 )kx k < ∞. I−W k Lemma 12 (convergence of weakly summable sequence). Let {βk } and {γk } be two nonnegative scalar sequences such that 1 (a) γk = (k+1) ǫ , for some ǫ ∈ (0, 1], k ∈ N; P∞ (b) k=0 γk βk < ∞; (c) |βk+1 − βk | . γk , where “.”means that |βk+1 − βk | ≤ M γk for some constant M > 0, then limk→∞ βk → 0. We call a sequence {βk } satisfying Lemma 12 (a) and (b) a weakly summable sequence since itself is not necessarily summable but becomes summable via multiplying another non-summable, diminishing sequence {γk }. It is generally impossible to claim that βk converges to 0. However, if the distance of two successive steps of {βk } with the same order of the multiplied sequence γk , then we can claim the convergence of βk . A special case with ǫ = 1/2 has been observed in [9]. Proof. By condition (b), we have
∀x ∈ [0, 1].
(41)
ǫ
Let g(x) = (1 + x) − 1 − 2ǫx. Then its derivative g ′ (x) = ǫ(1 + x)ǫ−1 − 2ǫ < 0,
∀x ∈ [0, 1].
It implies g(x) ≤ g(0) = 0 for any x ∈ [0, 1], that is, the inequality (41) holds.
′ k+k X
i=k
γi βi → 0,
(45)
as k → ∞ and for any k ′ ∈ N. In the following, we will show limk→∞ βk = 0 by contradiction. Assume this is not the case, i.e., βk 9 0 as k → ∞,
11
then lim supk→∞ βk , C ∗ > 0. Thus, for every N > k0 , ∗ there exists a k > N such that βk > C2 . Let ∗ C ǫ ′ (k + 1) , k , 4M
(b) Convergence of objective sequence: By Lemma 11 and Lemma 9, (48) yields the convergence of {Lαk (xk )} and
where [x] denotes the integer part of x for any x ∈ R. By condition (c), i.e., |βj+1 − βj | ≤ M γj for any j ∈ N, then
which implies that kxk+1 − xk k2 converges to 0 at the rate of o(k −ǫ ) and {xk } is asymptotic regular†† . Moreover, notice that
βk+i ≥
C∗ , 4
∀i ∈ {0, 1, . . . , k ′ }.
(46)
Hence, ′ k+k X
j=k
=
γj βj ≥
(
∗
′ ∗ k+k X
C 4
j=k
γj ≥
C∗ 4
Z
C ′ 1−ǫ − 4(1−ǫ) (k + k + 1) ∗ C ′ 4 (ln(k + k + 1) − ln(k
k+k′
∞ X
k=0
k+1 − xk k2 < ∞ α−1 k (1 + λn (W )) − Lf kx
k ¯ k k2I−W αk−1 kxk k2I−W = α−1 k kx − x
¯ k k2 . ≤ (1 − λn (W ))Lf (k + 1)ǫ kxk − x
(x + 1)−ǫ dx
(47)
k
(k + 1)1−ǫ , ǫ ∈ (0, 1), + 1)) , ǫ = 1.
Note that when ǫ ∈ (0, 1), the term (k+k ′ +1)1−ǫ −(k+1)1−ǫ is monotonically increasing with respect to k, which implies Pk+k′ that j=k γj βj is lower bounded by a positive constant when ǫ ∈ (0, 1). While when ǫ = 1, noting that the specific form of k ′ , we have k′ C∗ ′ ln(k+k +1)−ln(k+1) = ln 1 + = ln 1 + , k+1 4M Pk+k′ which is a positive constant. As a consequence, j=k γj βj will not go to 0 as k → 0, which contradicts with (45). Therefore, limk→∞ βk = 0.
k 2 By Proposition 3, the term α−1 k kx kI−W converges to 0 as k → ∞. As a consequence, ! kxk k2I−W k T k lim 1 f (x ) = lim Lαk (x ) − k→∞ k→∞ 2αk
= lim Lαk (xk ). k→∞
¯ (xk ) , (c) Convergence to a stationary point: Let ∇f By the specific form (15) of αk , we have
1 T k n 11 ∇f (x ).
α−1 k (1 + λn (W )) − Lf
= α−1 k (1 + λn (W ) − Lf αk ) 1 −1 (53) ≥ αk 1 + λn (W ) − (k0 + 1)ǫ h i 1 for all k > k0 , where k0 = (1 + λn (W ))− ǫ , i.e., the integer 1
part of (1 + λn (W ))− ǫ . Note that
Proof of Theorem 2. We first develop the following inequality 1 k+1 2 − α−1 kI−W Lαk+1 (xk+1 ) ≤ Lαk (xk ) + (α−1 k )kx 2 k+1 1 − α−1 (48) (1 + λn (W )) − Lf kxk+1 − xk k2 , 2 k
and then claim the convergence of the sequences {Lαk (xk )}, {1T f (xk )} and {xk } based on this inequality. (a) Development of (48): From xk+1 = xk −αk ∇Lαk (xk ), it follows that
kxk+1 − xk k2 h∇Lαk (xk ), xk+1 − xk i = − . (49) αk Pn Since i=1 ∇fi (x(i) ) is Lf -Lipschitz, ∇Lαk is Lipschitz with the constant Lk , Lf + α−1 k λmax (I − W ) = Lf + αk−1 (1 − λn (W )), implying k+1
Lαk (x
(52)
)
(50)
Lk k+1 ≤ Lαk (xk ) + h∇Lαk (xk ), xk+1 − xk i + kx − xk k2 2 1 (1 + λn (W )) − Lf kxk+1 − xk k2 . = Lαk (xk ) − α−1 2 k Moreover, Lαk+1 (xk+1 )
1 k+1 2 = Lαk (xk+1 ) + (α−1 − α−1 kI−W . k )kx 2 k+1 Combining (50) and (51) yields (48).
(51)
1 ¯ k k = k 11T (xk+1 − xk )k k¯ xk+1 − x n ≤ kxk+1 − xk k.
(54)
Thus, (52), (53) and (54) yield ∞ X
k=0
¯ k k2 < ∞. α−1 xk+1 − x k k¯
(55)
By the iterate (4) of DGD, we have ¯ (xk ). ¯ k+1 − x ¯ k = −αk ∇f x
(56)
Plugging (56) into (55) yields ∞ X
k=0
Moreover,
¯ (xk )k2 < ∞. αk k∇f
(57)
¯ (xk+1 )k2 − k∇f ¯ (xk )k2 | |k∇f ¯ (xk+1 ) − ∇f ¯ (xk )k · (k∇f ¯ (xk+1 )k + k∇f ¯ (xk )k) ≤ k∇f ¯ (xk+1 ) − ∇f ¯ (xk )k ≤ 2Bk∇f
≤ 2Bk∇f (xk+1 ) − ∇f (xk )k
≤ 2BLf kxk+1 − xk k,
(58)
where the second inequality holds by the bounded gradient assumption (Assumption 3), the third inequality holds by the †† A sequence {a } is said to be asymptotic regular if ka k k+1 − ak k → 0 as k → ∞.
12
¯ (xk ), and the last inequality holds by the specific form of ∇f Lipschitz continuity of ∇f . Note that k+1
kx
k
−x k
¯ k+1 + x ¯ k+1 − x ¯k + x ¯ k − xk k −x ¯ (xk )k ¯ k+1 k + k¯ ≤ kxk+1 − x xk − xk k + αk k∇f . αk ,
= (59)
where the first inequality holds for the triangle inequality and (56), and the last inequality holds for Proposition 3 and the bounded assumption of ∇f . Thus, (58) and (59) imply ¯ (xk+1 )k2 − k∇f ¯ (xk )k2 | . αk . |k∇f (60) By the specific form (15) of αk , (57), (60) and Lemma 12, it holds ¯ (xk )k2 = 0. lim k∇f
(61)
k→∞
K X k X
k=0 j=0
k+1
= kx
and
As a consequence,
≤
K 1X
2
k
lim 1 ∇f (x ) = 0.
(62)
k→∞
Furthermore, by the coercivity of fi for each i and the convergence of {1T f (xk )}, {xk } is bounded. Therefore, there exists a convergent subsequence of {xk }. Let x∗ be any limit point of {xk }. By (61) and the continuity of ∇f , it holds T
1 1−ζ
Lemma 13 (Accumulated consensus of iterates). Under conditions of Proposition 3, we have ¯ k+1 k ≤ D1 + D2 αk kxk+1 − x 0
kζ where D1 = Ckx 2(1−ζ) , D2 = C specified in Assumption 3.
kx0 kζ 2
+
K X
α2k ,
(63)
k=0 B 1−ζ
, and B is
Proof. By (29),
k=0
k+1
αk kx
+ CB
k+1
¯ −x
K X k X
0
k ≤ Ckx kζ
K X
(66)
Lemma 14 ([8]). Let γk = k1ǫ for some 0 < ǫ ≤ 1. Then the following hold 1 1−ǫ (a) if 0 < ǫ < 1/2, PK1 γ ≤ K 1−ǫ −1 = O( K 1−ǫ ), PK
k=1
PK
k=1
γk2 γk
PK1
(b) if ǫ = 1/2,
k=1
PK
k=1
PK
k
k=1
γk2
k=1
γk
PK
αk ζ
ζ k−j αk αj .
(64)
K
K
PK1
k=0
k=0 K X
k=0
α2k ,
1−ǫ K 1−ǫ −1
= O( √1K ),
ln K 1 + ln K = O( √ ). 2(K 1/2 − 1) K γk
≤
1−ǫ K 1−ǫ −1
1 = O( K 1−ǫ ),
1 1 − ǫ 2ǫ − 1/K 2ǫ−1 · = O( 1−ǫ ). 1−ǫ 2ǫ − 1 K −1 K γk
γk2
k=1 γk
≤
≤
1 ln K
= O( ln1K ),
1 − 1/K 1 = O( ). ln(K + 1) − ln 2 ln K
Lemma 15. ([8, Proposition 3]) Let h : Rd → R be a continuously differentiable function whose gradient is Lipschitz continuous with constant Lh . Then for any x, y, u ∈ Rp , Lh kx − yk2 . 2 Proof of Proposition 4. To prove this proposition, we first develop the following inequality, h(u) ≥ h(x) + h∇h(y), u − xi −
1 (kxk − uk2 − kxk+1 − uk2 ) 2αk (67)
for any u ∈ Rn×p . By Lemma 15, we have Lαk (u) ≥ Lαk (xk+1 )
(68)
L∗ k+1 + h∇Lαk (xk ), u − xk+1 i − kx − xk k2 , 2
where L∗ = Lf + α−1 k (1 − λn (W )), and by (21), we have k k+1 ∇Lαk (xk ) = α−1 ). Then (68) implies k (x − x
1 X 2 1 X 2k αk + ζ αk ζ ≤ 2 2 k
1 1 + ≤ 2(1 − ζ) 2
k=1
≤
γk
≤
k=1
PK
≤
k=1
PK1
PK
1 − ǫ K 1−2ǫ−2ǫ 1 · = O( ǫ ). 1 − 2ǫ K 1−ǫ − 1 K
γk
γk2
(c) if 1/2 < ǫ < 1, PK
≤
k=0
In the following, we estimate these two terms in the right-hand side of (64), respectively. Note that
k=0
α2k .
Lαk (xk+1 ) − Lαk (u) ≤
k
k=0 j=0
K X
k=j
Besides Lemma 13, we also need the following two lemmas, which have appeared in the literature (cf. [8]).
(d) if ǫ = 1, E. Proof for Proposition 4 To prove Proposition 4, we still need the following lemmas.
1 X 2 X k−j α ζ 2 j=0 j
k=0
k=1
Moreover, by Proposition 3, x is consensual. As a consequence, x∗ is a stationary point of problem (3). In addition, if x∗ is isolated, then by the asymptotic regularity of {xk } (Lemma 4), {xk } converges to x∗ .
K
K
ζ k−j +
Plugging (65) and (66) into (64) yields (63).
∗
∗
K X
K X
k
k=0
j=0
k=0
1 ∇f (x ) = 0.
k=0
α2k
k X
k=1
T
K X
K
1 X X k−j 2 ζ (αk + α2j ) 2 j=0
ζ k−j αk αj ≤
(65)
Lαk (u) ≥ Lαk (xk+1 )
k k+1 + α−1 , u − xk+1 i − k hx − x
(69) ∗
L kxk+1 − xk k2 . 2
13
1 Note that the specific form of αk (= Lf (k+1) ǫ ), there exists an −1 ∗ integer k0 > 0 such that L ≤ αk for all k > k0 . Actually, ) for the simplicity of the proof, we can take αk < λnL(W f starting from the first step so that L∗ ≤ α−1 k holds from the initial step. Thus, (69) implies
Lαk (u) ≥ Lαk (xk+1 )
(70)
k k+1 + α−1 , u − xk+1 i − k hx − x
1 kxk+1 − xk k2 . 2αk
Recall that for any two vectors a and b, it holds 2ha, bi − kak2 = kbk2 − ka − bk2 . Therefore,
1 (ku − xk+1 k2 − ku − xk k2 ). 2αk As a consequence, we get the basic inequality (67). Note that the optimal solution xopt is consensual and thus, kxopt k2I−W = 0. Therefore, Lαk (xopt ) = f¯(xopt ) = fopt . By (67), we have αk Lαk (xk+1 ) − fopt Lαk (u) ≥ Lαk (xk+1 ) +
k
2
k+1
≤ (kx − xopt k − kx
2
− xopt k )/2.
Summing the above inequality over k = 0, 1, . . . , K yields K X
k=0
αk (Lαk (xk+1 ) − fopt ) ≤ kx0 − xopt k2 /2.
(71)
Moreover, noting that Lαk (¯ xk+1 ) = f¯(¯ xk+1 ) and by the convexity of Lαk ,
¯ k+1 i Lαk (xk+1 ) ≥ f¯(¯ xk+1 ) + h∇Lαk (xk+1 ), xk+1 − x k+1 k+1 k+1 ¯ ≥ f¯(¯ x ) − Bkx −x k, (72)
where the second inequality holds by the bounded assumption of gradient (cf. Assumption 3). Plugging (72) into (71) yields K X
k=0
≤
αk (f¯(¯ xk+1 ) − fopt )
(73) K
X 1 0 ¯ k+1 k. kx − xopt k2 + B αk kxk+1 − x 2 k=0
By the definition of f¯K (17), then (73) implies (f¯K − fopt ) ≤
K X
αk
(74)
k=0
K
X 1 0 ¯ k+1 k kx − xopt k2 + B αk kxk+1 − x 2
≤ D3 + D4
k=0
K X
α2k ,
(75)
k=0
where D3 = 12 kx0 − xopt k2 + BD1 , D4 = BD2 , D1 and D2 are specified in Lemma 13, and the second inequality holds for Lemma 13. As a consequence, PK D3 + D4 k=0 α2k K ¯ . (76) f − fopt ≤ PK k=0 αk
Furthermore, by Lemma 14, we get the claims of this proposition.
F. Proofs for Theorem 3 and Proposition 5 In order to prove Theorem 3, we need the following lemmas. Lemma 16 (Sufficient descent of {Lˆα (xk )}). Let Assumptions 2 and 4 hold. Results are given in two cases below: C1: ri ’s are convex. Set 0 < α < 1+λLnf(W ) . Lˆα (xk+1 ) ≤ Lˆα (xk ) (77) k+1 1 −1 − xk k2 , ∀k ∈ N. − α (1 + λn (W )) − Lf kx 2 C2: ri ’s are not necessarily convex (in this case, we assume ) λn (W ) > 0). Set 0 < α < λnL(W . f Lˆα (xk+1 ) ≤ Lˆα (xk ) (78) k+1 1 −1 − α λn (W ) − Lf kx − xk k2 , ∀k ∈ N. 2 Proof. Recall from Lemma 2 that ∇Lα (x) is L∗ -Lipschitz continuous for L∗ = Lf + α−1 (1 − λn (W )), and thus Lˆα (xk+1 ) − Lˆα (xk )
= Lα (xk+1 ) − Lα (xk ) + r(xk+1 ) − r(xk ) L∗ k+1 kx − xk k2 ≤ h∇Lα (xk ), xk+1 − xk i + 2 + r(xk+1 ) − r(xk ).
(79)
C1: From the convexity of r, (7), and (19), it follows that 1 k+1 x − xk + α∇Lα (xk ) , ξ k+1 ∈ ∂r(xk+1 ). 0 = ξ k+1 + α This and the convexity of r further give us r(xk+1 ) − r(xk ) ≤ hξ k+1 , xk+1 − xk i 1 = − kxk+1 − xk k2 − h∇Lα (xk ), xk+1 − xk i. α Substituting this inequality into the inequality (79) and then expanding L∗ = Lf + α−1 (1 − λn (W )) yield 1 L∗ k+1 Lˆα (xk+1 ) − Lˆα (xk ) ≤ − kx − xk k2 − α 2 1 = − α−1 (1 + λn (W )) − Lf kxk+1 − xk k2 . 2 Sufficient descent requires the last term to be negative, thus 0 < α < 1+λLnf(W ) . C2: From (7) and (19), it follows that the function r(u) + ku−(xk −α∇Lα (xk ))k2 reaches its minimum at u = xk+1 . 2α Comparing the values of this function at xk+1 and xk yields 1 r(xk+1 ) − r(xk ) ≤ kxk − (xk − α∇Lα (xk ))k2 2α 1 kxk+1 − (xk − α∇Lα (xk ))k2 − 2α 1 = − kxk+1 − xk k2 − h∇Lα (xk ), xk+1 − xk i. 2α Substituting this inequality into (79) and expanding L∗ yield L∗ k+1 1 kx − xk k2 − Lˆα (xk+1 ) − Lˆα (xk ) ≤ − 2α 2 1 = − α−1 λn (W ) − Lf kxk+1 − xk k2 . 2 Hence, sufficient descent requires 0 < α
0. G. Proofs for Theorem 4 and Proposition 6 Based on the iterate (6) of Prox-DGD, we derive the following recursion of the iterates of Prox-DGD, which is similar to (16). k
Lemma 19 (Recursion of {x }). For any k ∈ N, αj W k−1−j (∇f (xj ) + ξ j+1 ),
xk+1 = W xk − αk (∇f (xk ) + ξ k+1 ).
(83)
By (83), we can easily derive the recursion (81). Proof of Proposition 6. The proof of this proposition is similar to that of Proposition 3. It only needs to note that the subgradient term ∇f (xj ) + ξ j+1 is uniformly bounded by the ¯ for any j. Thus, we omit it here. constant B To prove Theorem 4, we still need the following lemmas. Lemma 20. Let Assumptions 2 and 4 hold. In Prox-DGD, use the step sizes (15). Results are given in two cases below: C1: ri ’s are convex. For any k ∈ N,
1 k+1 2 Lˆαk+1 (xk+1 ) ≤ Lˆαk (xk ) + (α−1 − α−1 kI−W k )kx 2 k+1 1 − α−1 (1 + λn (W )) − Lf kxk+1 − xk k2 . (84) 2 k C2: ri ’s are not necessarily convex. For any k ∈ N,
1 k+1 2 − α−1 kI−W Lˆαk+1 (xk+1 ) ≤ Lˆαk (xk ) + (α−1 k )kx 2 k+1 1 − α−1 λn (W ) − Lf kxk+1 − xk k2 . (85) 2 k Proof. The proof of this lemma is similar to that of Lemma 16 via noting that Lˆαk+1 (xk+1 ) = Lˆαk (xk ) + (Lˆαk+1 (xk+1 ) − Lˆαk (xk+1 )) + (Lˆα (xk+1 ) − Lˆα (xk )), k
and
Based on Lemmas 16–18, we can easily prove Theorem 3 and Proposition 5.
k−1 X
where ξ k+1 ∈ ∂r(xk+1 ), and thus
k
Thus, then the claim of Lemma 18 holds.
xk = W k x0 −
(82)
(81)
j=0
where ξ j+1 ∈ ∂r(xj+1 ) is the one determined by the proximal operator (7), for any j = 0, . . . , k − 1.
1 k+1 2 − α−1 kI−W . Lˆαk+1 (xk+1 ) − Lˆαk (xk+1 ) = (α−1 k )kx 2 k+1 While the term Lˆαk (xk+1 ) − Lˆαk (xk ) can be estimated similarly by the proof of Lemma 16. Lemma 21. Let Assumptions 2, 4 and 5 hold. In Prox-DGD, use the step sizes (15). If further each fi and ri are convex, then for any u ∈ Rn×p , we have
1 (kxk − uk2 − kxk+1 − uk2 ). Lˆαk (xk+1 ) − Lˆαk (u) ≤ 2αk Proof. By Lemma 15, we have Lαk (u) ≥ Lαk (xk+1 )
(86)
L∗ k+1 kx − xk k2 , + h∇Lαk (xk ), u − xk+1 i − 2
where L∗ = Lf + α−1 k (1 − λn (W )), and by the convexity of r, we have r(u) ≥ r(xk+1 ) + hξ k+1 , u − xk+1 i,
(87)
where ξ k+1 ∈ ∂r(xk+1 ) is the one determined by the proximal operator (7). By (83), it follows k k+1 ξ k+1 = α−1 ) − ∇Lαk (xk ). k (x − x
(88)
15
Plugging (88) into (87), and then summing up (86) and (87) yield Lˆαk (u) ≥ Lˆαk (xk+1 )
(89) ∗ L k k+1 + α−1 , u − xk+1 i − kxk+1 − xk k2 . k hx − x 2 Similar to the rest proof of the inequality (67), we can prove this lemma based on (89). Proof of Theorem 4. Based on Lemma 20 and Lemma 21, we can proof Theorem 4. The proof of Theorem 4(a)-(d) is similar to that of Theorem 2, while the proof of Theorem 4(d) is very similar to that of Proposition 4, and thus the proof of this theorem is omitted. VI. C ONCLUSION In this paper, we study the convergence behavior of the algorithm DGD for smooth, possibly nonconvex consensus optimization. We consider both fixed and decreasing step sizes. When using a fixed step size, we show that the iterates of DGD converge to a stationary point of a Lyapunov function, which approximates to one of the original problem. Moreover, we estimate the bound between each local point and its global average, which is proportional to the step size and inversely proportional to the gap between the largest and the second largest magnitude eigenvalues of the mixing matrix. This motivate us to study the algorithm DGD with decreasing step sizes. When using decreasing step sizes, we show that the iterates of DGD reach consensus asymptotically at a sublinear rate and converge to a stationary point of the original problem. We also estimate the convergence rates of objective sequence in the convex setting using different diminishing step size strategies. Furthermore, we extend these convergence results to Prox-DGD designed for minimizing the sum of a differentiable function and a proximal function. Both functions can be nonconvex. If the proximal function is convex, a larger fixed step size is allowed. These results are obtained by applying both existing and new proof techniques. ACKNOWLEDGMENTS The work of J. Zeng has been supported in part by the NSF grants 61603162, 11501440. The work of W. Yin has been supported in part by the NSF grant ECCS-1462398 and ONR grants N000141410683 and N000141210838. R EFERENCES [1] H. Attouch, and J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Math. Program., 116: 5-16, 2009. [2] H. Attouch, J. Bolte and B. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forwardbackward splitting, and regularized Gauss-Seidel methods, Math. Program., Ser. A, 137: 91-129, 2013. [3] P. Bianchi and J. Jakubowicz, Convergence of a multi-agent projected stochastic gradient algorithm for nonconvex optimization, IEEE Trans. Automatic Control, 58(2):391-405, 2013. [4] P. Bianchi, G. Fort and W. Hachem, Performance of a distributed stochastic approximation algorithm, IEEE Trans. Information Theory, 59(11):7405-7418, 2013.
[5] J. Bolte, A. Daniilidis and A. Lewis, The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems, SIAM Journal on Optimization, 17(4):1205-1223, 2007. [6] A. Chen and A. Ozdaglar, A fast distributed proximal gradient method, in Proc. 50th Allerton Conf. Commun., Control Comput., Moticello, IL, Oct. 2012, pp. 601-608. [7] T. Chang, M. Hong and X. Wang, Multi-agent distributed optimization via inexact consensus ADMM, IEEE Trans. Signal Process., 63(2): 482-497, 2015. [8] A. Chen, Fast Distributed First-Order Methods, Master’s thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 2012. [9] Y.T. Chow, T. Wu and W. Yin, Cyclic Coordinate Update Algorithms for Fixed-Point Problems: Analysis and Applications. UCLA CAM 16-78, 2016. [10] W. Deng, M. Lai, Z. Peng and W. Yin, Parallel multi-block admm with o(1/k) convergence, Journal of Scientific Computing, DOI 10.1007/s10915-016-0318-2, 2016. [11] E. Hazan, K.Y. Levy and S. Shalev-Shwarz, On graduated optimization for stochastic nonconvex problems, In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 2016. [12] M. Hardt, B. Retch and Y. Singer, Train faster, generalize better: stability of stochastic gradient descent, In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 2016. [13] D. Hajinezhad, M. Hong and A. Garcia, ZENTH: a zeroth-order distributed algorithm for multi-agent nonconvex optimization, (Technical report). [14] M. Hong, Z. Luo and M. Razaviyayn, Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems, ICASSP 2015. [15] S. Hosseini, A. Chapman and M. Mesbahi, Online distributed optimization on dynamic networks, IEEE T. Auto. Control, 61(11): 3545–3550, 2016. [16] D. Jakovetic, J. Xavier and J. Moura, Fast distributed gradient methods, IEEE Trans. Automatic Control, 59:1131-1146, 2014. [17] D. Kempe, A. Dobra and J. Gehrke, Gossip-based computation of aggregate information, In Foundations of Computer Science, 2003. Proceedings 44th Annual IEEE Symposium on, 482-491, IEEE Computer Society, 2003. [18] K. Knopp, Infinite sequences and series, Courier Corporation, 1956. [19] J. Lafond, H. Wai and E. Moulines, D-FW: communication efficient distributed algorithms for high-dimensional sparse optimization, ICASSP 2016. [20] S. Lee and A. Nedic, Distributed random projection algorithm for convex optimization, IEEE J. Sel. Topics Signal Process., 7(2): 221-229, 2013. [21] Q. Ling and Z. Tian, Decentralized sparse signal recovery for compressive sleeping wireless sensor networks, IEEE Trans. Signal Process., 58(7): 3816-3827, 2010. [22] S. Łojasiewicz, Sur la g´eom´etrie semi-et sous-analytique, Ann. Inst. Fourier (Grenoble) 43(5): 1575–1595, 1993. [23] P.D. Lorenzo and G. Scutari, NEXT: in-network nonconvex optimization, IEEE Trans. Signal and Information Processing over Network, 2(2):120136, 2016. [24] P.D. Lorenzo and G. Scutari, Distributed nonconvex optimization over time-varying networks, ICASSP 2016. [25] I. Matei and J. Baras, Performance evaluation of the consensus-based distributed subgradient method under random communication topologies, IEEE J. Sel. Top. Signal Process., 5:754-771, 2011. [26] H. McMahan and M. Streeter, Delay-Tolerant algorithms for asynchronous distributed online learning, In: Advances in Neural Information Processing Systems (NIPS), 2014. [27] G. Mateos, J. Bazerque and G. Giannakis, Distributed sparse linear regression, IEEE Trans. Signal Process., 58(10): 5262-5276, 2010. [28] A. Nedic and A. Ozdaglar, Distributed subgradient methods for multiagent optimization, IEEE Trans. Automatic Control, 54(1):48-61, 2009. [29] A. Nedic and A. Olshevsky, Distributed optimization over time-varying directed graphs, IEEE Trans. Automatic Control, 60(3):601-615, 2015. [30] M. Nevelson and R.Z. Khasminskii, Stochastic approximation and recursive estimation, [translated from the Russian by Israel Program for Scientific Translations; translation edited by B. Silver]. Americal Mathematical Society, 1973. [31] M. Raginsky, N. Kiarashi and R. Willett, Decentralized online convex programming with local information, In: 2011 American Control Conference, San Francisco, CA, USA, 2011. [32] S. Ram, A. Nedic and V. Veeravalli, Distributed stochastic subgradient projection algorithms for convex optimization, J. Optim. Theory Appl., 147: 516-545, 2010.
16
[33] H. Robbins and D. Siegmund, A convergence theorem for nonnegative almost supermartingales and some applications, in Proc. Optim, Methods Stat., 233-257, 1971. [34] I. Schizas, A. Ribeiro and G. Giannakis, Consensus in ad hoc WSNs with noisy links-part I: Distributed estimation of deterministic signals, IEEE Trans. Signal Process., 56(1):350-364, 2008. [35] W. Shi, Q. Ling, K. Yuan, G. Wu and W. Yin, On the linear convergence of the ADMM in decentralized consensus optimization, IEEE Trans. Signal Process., 62(7):1750-1761, 2014. [36] W. Shi, Q. Ling, G. Wu and W. Yin, EXTRA: an exact first-order algorithm for decentralized consensus optimization, SIAM Journal on Optimization, 25(2):944-966, 2015. [37] W. Shi, Q. Ling, G. Wu and W. Yin, A Proximal Gradient Algorithm for Decentralized Composite Optimization, IEEE Trans. Signal Processing, 63(22):6013-6023, 2015. [38] T. Tatarenko and B. Touri, On local analysis of distributed optimization, ACC 2016. [39] T. Tatarenko and B. Touri, Non-convex distributed optimization, arXiv preprint, arXiv:1512.00895. [40] J. Tsitsiklis, D. Bertsekas and M. Athans, Distributed asynchronous deterministic and stochastic gradient optimization algorithms, IEEE Trans. Automatic Control, AC-32(9): 803-812, 1986. [41] H. Wai, T. Chang and A. Scaglione, A consensus-based decentralized algorithm for nonconvex optimization with application to dictionary learning, ICASSP 2015. [42] H. Wai, A. Scaglione, J. Lafond and E. Moulines, A projection-free decentralized algorithm for non-convex optimization, GlobalSIP 2016. [43] Y. Xu and W. Yin, A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion, SIAM Journal on Imaging Sciences, 6:1758-1789, 2013. [44] F. Yan, S. Sundaram, S. Vishwanathan and Y. Qi, Distributed autonomous online learning: regrets and intrinsic privacy-preserving properties, IEEE T Knowledge and Data Engineering, 25(11):2483–2493, 2013. [45] K. Yuan, Q. Ling and W. Yin, On the Convergence of Decentralized Gradient Descent, SIAM Journal Optimization, 26(3):1835-1854, 2016. [46] J. Zeng, S. Lin and Z. Xu, Sparse regularization: convergence of iterative jumping thresholding algorithm, IEEE Transactions on Signal Processing, 64(19):5160-5118, 2016. [47] M. Zhu and S. Martinez, An approximate dual subgradient algorithm for multi-agent non-convex optimization, IEEE Trans. Automatic Control, 58(6):1534-1539, 2013.