Augmented distributed gradient methods for multi-agent optimization

0 downloads 0 Views 276KB Size Report
Abstract—We consider distributed optimization problems in which a number of agents are to seek the optimum of a global objective function through merely local ...
2015 IEEE 54th Annual Conference on Decision and Control (CDC) December 15-18, 2015. Osaka, Japan

Augmented Distributed Gradient Methods for Multi-agent Optimization Under Uncoordinated Constant Stepsizes Jinming Xu, Shanying Zhu, Yeng Chai Soh and Lihua Xie Abstract— We consider distributed optimization problems in which a number of agents are to seek the optimum of a global objective function through merely local information sharing. The problem arises in various application domains, such as resource allocation, sensor fusion and distributed learning. In particular, we are interested in scenarios where agents use uncoordinated (different) constant stepsizes for local optimization. According to most existing works, using this kind of stepsize rule for update, which is necessary in asynchronous scenarios, will lead to some gap (error) between the estimated result and the exact optimum. To deal with this issue, we develop a new augmented distributed gradient method (termed AugDGM) based on consensus theory. The proposed algorithm not only allows for using uncoordinated stepsizes but also, most importantly, be able to seek the exact optimum even with constant stepsizes assuming that the global objective function has Lipschitz gradient. A simple numerical example is provided to illustrate the effectiveness of the algorithm.

I. INTRODUCTION Distributed optimization has received a renewed attention from the control and machine learning communities due to its wide applications in areas such as formation control [1], resource allocation [2], sensor fusion in wireless sensor networks [3]–[5] and distributed learning [6] and just to name a few. The technique used to solve this problem is distributed and thus lends itself to large-scale problems as it only requires local resources (e.g., sensing, communication, and computation) for achieving global results. In existing literatures, (sub)gradient-based methods are widely employed to solve large-scale optimization problems in a distributed way. In particular, Tsitsiklis and Bertsekas at al. [7] first studied distributed gradient-like optimization algorithms in which a group of processors perform computations and exchange messages asynchronously intending to minimize a common function. In the context of distributed computation, consensus theory is particularly suitable for distributed implementation of algorithms as it allows agents to obtain global results by merely taking local actions [8]. In line with these works, Nedic and Ozdaglar [9] applied consensus theory into optimization problems where each agent only know the cost of itself and aim to minimize the sum of the costs through cooperation with others. In addition, This work was supported by Nanyang Technological University’s Research Scholarship Programme and partially by the Building Efficiency and Sustainability in the tropics (SinBerBEST) Programme, which is funded by Singapore’s National Research Foundation and led by the University of California, Berkeley (UCB) in collaboration with Singapore universities The authors are with EXQUISITUS, the School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798. E-mails: [email protected] and

{SYZHU,EYCSOH,ELHXIE}@ntu.edu.sg 978-1-4799-7886-1/15/$31.00 ©2015 IEEE

a dual subgradient averaging method is proposed for solving the same problem, which is shown to have better convergence results in terms of network scaling [10]. Some extension has also been made for cases under constraints [11], [12], cases where only noisy gradient is available [13], [14] and directed graphs [15]. A common issue of the abovementioned method is that they require decaying stepsize and the assumption of bounded (sub)gradient to achieve the exact optimum. On the other hand, dual decomposition is widely employed to solve large-scale optimization problems which are separable in the dual domain [16], [17]. Building on this method, distributed versions of Alternative Direction Method of Multipliers (ADMM [18]) have been proposed for solving the above same problem [19], [20]. However, this kind of technique depends heavily on the (coupling) structure of the problem, restricting its application into dynamic networks. Our focus in this paper is on the heterogeneity of agents involved in their computation. In particular, we consider the case where agents use uncoordinated (different) constant stepsizes for local optimization. In distributed algorithms, due to the lack of global clock, agents operate according to their local clocks, leading to uncoordinated execution of the algorithm. This is analogous to using different stepsizes in the average sense especially when it comes to random networks [21], [22]. Thus, it is useful and essential to study the convergence properties of distributed algorithms employing uncoordinated constant stepsizes. There are some related works dedicated to asynchronous issues. Specifically, a distributed Newton-Raphson algorithm is proposed in [23], [24] and the convergence properties of its asynchronous computation is investigated which requires the cost function to be continuous up to the second derivatives. In [21], an asynchronous broadcast-based algorithm is designed to handle the problem of random link failures as well as uncoordinated update over random networks. However, it requires the stepsizes to follow certain predefined decaying rule and the Possion rates of activation to be same for all agents to ensure the ability to seek the exact optimum. In this paper, we propose a new augmented version of distributed gradient methods by introducing an extra step for the consensus on the gradients of objective functions. The proposed distributed algorithm allows for using uncoordinated stepsizes for local optimization and, in contrast to most existing literatures, is guaranteed to converge to the exact optimum even with constant stepsizes. Note that our assumption is similar with [7] and [14] except for that we drop the strongly convex assumption but it is quite different from most (sub)gradient-based methods [9], [25], [26] where

2055

the (sub)gradients are usually assumed to be bounded, which is quite restrictive in unconstrained optimization problems. It is also important to note that our approach, though, has similar form with the ones proposed in [23], [26]–[28], it differs from them in its nature in that our assumptions (c.f. Assumptions 2, 3, 4) are different and, most importantly, our algorithm is discrete and has the potential to be applied into dynamic networks, e.g., random networks. Notations: We denote by xi the i-th component of a vector x. A variable x without subscript is viewed as the collection of the component xi , and written as [x1 , x2 , x3 , ...]T . In addition, we use to denote the component-wise multiplication T of two column vectors, ‘1’ all-ones vector and x ¯ = 11 m x the vector whose elements are the average of all components. II. PROBLEM STATEMENT We consider the problem in which a network of m agents are to minimize the following function in a cooperative way: F (θ) =

m X

Assumption 4: Each objective function fi is convex and coercive such that kxi k → ∞ leads to fi (xi ) → ∞ Remark 2: It follows immediately from Assumptions 3, 4 that the global function f is convex, coercive and has Lipschitz gradient with Lipschitz constant L = max{Li }. To exactly solve the problem (1), we propose a new augmented distributed gradient method (Aug-DGM) as detailed in Algorithm 1. The proposed distributed algorithm can be rewritten in a compact form as follows:

Initialization: ∀ agent i ∈ V: xi (0) arbitrarily assigned while yi (0) = gi (xi (0)). 2: Local Optimization: ∀ agent i ∈ V, computes:

fi (θ)

m X

si (k) = xi (k) − γi · yi (k) X xi (k + 1) = si (k) + wij (sj (k) − si (k))

(3)

j∈Ni

fi (xi )

3:

(1)

i=1

where xi ∈ Rd is the local estimate of agent i about the global optimum θ∗ while x = [x1 , x2 , ..., xm ]T ∈ Rm×d is the collection of the estimates of all agents and fi : Rd → R the local objective function known only to agent i. Remark 1: In the sequel, for brevity, we only consider the case d = 1 as the analysis for the case d 6= 1 is similar except for that we need to deal with Kronecker product and the results developed in this paper can be extended to multidimensional cases without much effort. For the problem (1) to be feasible, we make the following assumption on the existence of the optimal solution: Assumption 1: There exists an optimum x∗ = 1 ⊗ θ∗ to Problem (1) such that f ∗ := f (x∗ ) = minθ∈Rd F (θ). We assume agents communicate data through a fixed and connected network represented by a graph G = (V, E), where each vertex vi ∈ V denotes the agent and each edge eij ∈ E the communication link which is associated with a positive weight wij . Moreover, we make the following assumptions on the weight matrix as well as the cost functions. Assumption 2: The weight matrix W = {wij } associated with the communication graph satisfies 1T W = 1T , W 1 = T 1, and η = ρ(W − 11 m ) < 1 (see [29] for the details on the design of the weight matrix). Assumption 3: Each objective function fi is continuously differentiable and has Lipschitz gradient as follows kgi (xi ) −

(2b)

1:

≤ Li kxi −

x0i k , ∀xi , x0i

∈R

where Li is the Lipschitz constant while gi (xi ) and gi (x0i ) are the gradients of fi evaluated at xi and x0i respectively.

Dynamic Average Consensus: ∀ agent i ∈ V, computes: qi (k) = yi (k) + gi (xi (k + 1)) − gi (xi (k)) X yi (k + 1) = qi (k) + wij (qj (k) − qi (k))

s.t. xi = xj , ∀i, j ∈ V = {1, 2, ..., m}

gi (x0i )k

y(k + 1) = W [y(k) + ∆g(k)]

Algorithm 1 Aug-DGM for Fixed Connected Networks

where θ ∈ Rd is the global variable to be optimized, which is equivalent to solve the following problem [28] x∈Rmd

(2a)

where y(k) is the introduced auxiliary variable, ∆g(k) = g(x(k+1))−g(x(k)) the incremental change of the gradients and γ the stepsize which can be different for different agents.

i=1

min f (x) =

x(k + 1) = W [x(k) − γ y(k)]

(4)

j∈Ni

4:

Set k → k+1 and go to Step 2.

To quantify the variation of the stepsizes used by agents, we introduce the following parameter. Definition 1 (Heterogenerity of Stepsize(HoS)): Let γ be the stepsizes chosen by the agents. Then, HoS is defined as T ˜ = γ − γ¯ . ∆γ = k˜ γ k / k¯ γ k, where γ¯ = 11 m γ and γ The following lemma shows an important conservation property of the above distributed algorithm, which can be T immediately obtained by multiplying 11 m from the left to both sides of (2b) and then summing over k. Lemma 1 (Conservation Property [30]): Let g¯(k) = 11T g(x(k)), y(0) = g(x(0)), then y ¯ (k) = g ¯ (k), ∀k ≥ 0. m III. CONVERGENCE ANALYSIS In this section, we will study the convergence properties of the proposed Aug-DGM algorithm (2). We first consider a scalar sequence and give the following result. Lemma 2: Let {vk }k≥0 and {wk }k≥0 be two positive scalar sequences such that for all k ≥ 0 υ(k + 1) ≤ ηυ(k) + w(k)

(5)

where factor. Let Υ(k) = qP η ∈ (0, 1) is the decaying qP k k 2 2 i=0 kυ(i)k and Ω(k) = i=0 kw(i)k be the signal energy from 0 to k. Then, we have

2056

Υ(k) ≤ αΩ(k) + 

q √ 2 2 where α = 1−η and  = 1−η 2 υ(0). Proof: See Appendix. To facilitate our sequent analysis, let us consider the following auxiliary sequence which runs in analogy with (2a): x ¯(k + 1) = x ¯(k) − γ y(k) T

(6) T

11 where x ¯(k) = 11 m Tx(k) and γ y(k) = m (γ y(k)). 11 Also, let y¯(k) = m y(k), y˜(k) = y(k) − y¯(k). We have

γ y(k) = γ¯ y¯(k) + γ˜ y˜(k) Thus, taking the norm of both sides and knowing that

γ˜ y˜(k) ≤ √1 k˜ γ k k˜ y (k)k m

(7)

(8)

by Cauchy-Schwarz inequality, we obtain 1 √ (k¯ γ k k¯ y (k)k − k˜ γ k k˜ y (k)k) ≤ kγ y(k)k m

(9)

Then, we present our next important lemma. Lemma 3: Consider the algorithm qP(2) and suppose Ask 2 sumptions 2, 3 hold. Let X(k) = k˜ x(i)k , Y (k) = qP qP i=0 2 k k 2 y (i)k and Z(k) = i=0 k˜ i=0 kγ y(i)k be the signal energy from 0 to k and γmax = max{γi }, β = γmax L (1−η)2 and η 0 = η + β(1 + ∆γ ). If β < (1+∆γ )(2η 3 +2η 2 −η+1) such 0 that ρ1 ρ2 < 1 and η < 1, then we have 1 + ρ 1 2 ρ1 α2 + α1 Z(k) + (10a) X(k) ≤ 1 − ρ1 ρ2 1 − ρ1 ρ2 ρ2 α1 + α2 2 + ρ 2 1 Y (k) ≤ Z(k) + (10b) 1 − ρ1 ρ2 1 − ρ1 ρ2 √ √ √ 2ηγmax (1+∆γ ) 2η∆ 2k˜ x(0)k , α1 = 1−η γ , 1 = √ 1−η 2 1−η √ √ √ 2ηL(1+∆γ ) 2η(1+η)L 2k˜ y (0)k , α2 = , 2 = √ 0 2 . 1−η 0 1−η 0 1−η

where ρ1 = and ρ2 =

Proof: See Appendix. Now, we are ready to present our main result. Theorem 1: Consider the distributed algorithm (2) with y(0) = g(x(0)) and suppose Assumptions 1, 2, 3 and 4 hold. Then, there exists a positive number γ ∗ (η, ∆γ , L) such that if γi < γ ∗ , ∀i ∈ V, we have limk→∞ f (x(k)) = f (1⊗θ∗ ) = f ∗ , where f ∗ is the optimal value of Problem (1). Proof: Consider the sequence (6). Since f has Lipschitz gradient by Assum. 3 and Rem. 2, we have for ∀x, x0 ∈ Rm L 0 2 kx − xk 2 Let ∆¯ x(k) = x ¯(k + 1) − x ¯(k) = −γ y(k). Plugging y = x ¯(k + 1) and x = x ¯(k) into the above relation yields f (x0 ) ≤ f (x) + g(x)T (x0 − x) +

L 2 x(k)k f (¯ x(k + 1)) ≤ f (¯ x(k)) + g(¯ x(k))T ∆¯ x(k) + k∆¯ 2 L 2 ≤ f (¯ x(k)) + g(x(k))T ∆¯ x(k) + k∆¯ x(k)k 2 + (g(¯ x(k)) − g(x(k)))T ∆¯ x(k) L 2 x(k)k ≤ f (¯ x(k)) − y¯(k)T γ y(k) + k∆¯ 2 + (g(¯ x(k)) − g(x(k)))T ∆¯ x(k) (11)

where for the last inequality we have used the fact that g(x(k))T ∆¯ x(k) = g¯(k)T ∆¯ x(k) = y¯(k)T ∆¯ x(k), the first equality of which follows from the relation aT ¯b = a ¯T ¯b, ∀a, b ∈ Rm while the second is due to the conservation property given in Lemma 1. Let us first bound the second term. Using (7) we have √ m T (γ y(k) − γ˜ y˜(k))T γ y(k) y¯(k) γ y(k) = k¯ γk √

m 2 (kγ y(k)k − γ˜ y˜(k) kγ y(k)k) ≥ k¯ γk √ k˜ γk m 2 kγ y(k)k − k˜ y (k)k kγ y(k)k ≥ k¯ γk k¯ γk 1 2 y (k)k kγ y(k)k ≥ kγ y(k)k − ∆γ k˜ γmax (12) where we have employed the relation (8) to obtain the third inequality as well as the definition of HoS (see Definition 1) for the last inequality. Then, let us consider the last deviate term. By Assumption 3 and Remark 2, we obtain

(g(¯ x(k)) − g(x(k)))T ∆¯ x(k) ≤ L k˜ x(k)k k∆¯ x(k)k (13) Then, combing (11), (12) and (13) leads to   1 L 2 f (¯ x(k + 1)) ≤ f (¯ x(k)) − k∆¯ x(k)k − γmax 2 (14) + (∆γ k˜ y (k)k + L k˜ x(k)k) k∆¯ x(k)k Summing the above inequality over k from 0 to t, we have  t  L X 1 2 − k∆¯ x(k)k f (¯ x(t + 1)) ≤ f (¯ x(0)) − γmax 2 k=0

+ ∆γ

t X k=0

k˜ y (k)k k∆¯ x(k)k + L

t X

k˜ x(k)k k∆¯ x(k)k

k=0

(15) Then, using Cauchy-Schwarz inequality and recalling that ∆¯ x(k) = −γ y(k), we obtain   1 L f (¯ x(t + 1)) ≤ f (¯ x(0)) − Z 2 (t) − γmax 2 (16) + ∆γ Y (t)Z(t) + LX(t)Z(t) Suppose all the assumptions of Lemma 3 hold and β < (1−η)2 0 (1+∆γ )(2η 3 +2η 2 −η+1) such that ρ1 ρ2 < 1 and η < 1, then invoking Lemma 3 we have f (¯ x(t + 1)) ≤ f (¯ x(0)) − µZ 2 (t) + νZ(t) where ( µ= ν=

L(α1 +ρ1 α2 )+∆γ (α2 +ρ2 α1 ) L 1 γmax − 2 − 1−ρ1 ρ2 L(1 +ρ1 2 )+∆γ (2 +ρ2 1 ) 1−ρ1 ρ2

(17)

(18)

Let u(t) = f (¯ x(t)) − f ∗ . Then, since u(t) ≥ 0, ∀t ≥ 0, (17) can be rewritten as

2057

−µZ 2 (t) + νZ(t) + u(0) ≥ 0

(19)

∆γ=0

Additionally,√ it is not difficult to show that µ > 0 when 2 0 < β < b− b2a−4ac where  a = (1 − η 2 )(1 − η)(1 + ∆γ )    b = 4η(η 2 + 1)∆2 + (4η 3 − 4η 2 + 6η + 2)∆ γ γ (20) 3 2  + 4η + 5η − 4η + 3    c = 2(1 − η)2

1 µ>0 ρ1ρ2

Suggest Documents