Distributed Multi-Agent Optimization via. Dual Decomposition. HÃ
KAN TERELIUS. Master's Degree Project. Stockholm, Sweden 2010. XR-EE-RT 2010:013 ...
Distributed Multi-Agent Optimization via Dual Decomposition
HÅKAN TERELIUS
Master’s Degree Project Stockholm, Sweden 2010
XR-EE-RT 2010:013
Distributed Multi-Agent Optimization via Dual Decomposition
HÅKAN TERELIUS
MSc Thesis Stockholm, Sweden 2010
Abstract In this master thesis, a new distributed multi-agent optimization algorithm is introduced. The algorithm is based upon the dual decomposition of the optimization problem, together with the subgradient method for finding the optimal dual solution. The convergence of the new optimization algorithm is proved for communication networks with bounded timevarying delays, and noisy communication. Further, an explicit bound on the convergence rate is given, that shows the dependency on the network parameters. Finally, the new optimization algorithm is also compared to an earlier known primal decomposition algorithm, with extensive numerical simulations.
i
Acknowledgements There are a number of people that has been invaluable to me throughout my work on this thesis. This work began as a research project when I visited California Institute of Technology, in Pasadena, USA, during the spring of 2009. I would like to thank the entire Control and Dynamical Systems department at Caltech for introducing me to the world of research, and turning my visit into pleasure. My supervisor Prof. Richard M. Murray at Caltech. For inviting me to Caltech, always giving me new ideas and suggestions. Always being inspiring and encouraging, every time I ran into his office in despair. Ph.D. Ufuk Topcu, for always having time for me, guiding me through the problems and helping me in every possible way. Your time has truly been invaluable. Back in Sweden, I continued working on this project at the Automatic Control Laboratory, Royal Institute of Technology (KTH), Stockholm. I would also like to thank everyone at the department for giving me such a warm welcoming that not even the water in Trosa canal could make me feel cold. Prof. Henrik Sandberg, for inspiring discussions, support and a lot of patience throughout my work on this thesis. Prof. Karl Henrik Johansson, for giving me the best gift a student can receive: The inspiration, encouragement and joy of doing research. I also want to show my appreciation for the financial support I have received from Stiftelsen Frans von Sydows Hjälpfond during my studies at the Royal Institute of Technology, that has led to this master thesis. Finally, I would like to express my gratitude to my family and friends, for supporting me throughout my visit at Caltech, and the work on my master thesis. Thank you! Håkan Terelius Stockholm, August 2010
iii
Contents Abstract
i
Acknowledgements
iii
Contents
v
Notation
xi
1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Preliminaries 2.1 Mathematical Optimization . . . . . . 2.2 Convex Optimization . . . . . . . . . . 2.2.1 Concave Functions . . . . . . . 2.3 Subgradients . . . . . . . . . . . . . . 2.4 Subgradient Methods . . . . . . . . . . 2.4.1 Choosing Step Sizes . . . . . . 2.4.2 Projected Subgradient Methods 2.5 Decomposition Methods . . . . . . . . 2.5.1 Primal Decomposition . . . . . 2.5.2 Dual Decomposition . . . . . . 2.6 Graph Theory . . . . . . . . . . . . . . 2.7 Distributed Optimization . . . . . . . 2.7.1 Centralized Optimization . . . 2.7.2 Decentralized Optimization . . 2.8 Average Consensus Problem . . . . . .
1 1 2 2
. . . . . . . . . . . . . . .
3 3 4 6 6 8 8 10 11 12 13 16 18 19 19 20
3 Primal Consensus Algorithm 3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Computation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
23 23 24 28
v
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
4 Dual Decomposition Algorithm 4.1 Dual Problem Definition . . . . . . . . . 4.2 Centralized Algorithm . . . . . . . . . . 4.3 Computation Model . . . . . . . . . . . 4.4 Decentralized Algorithms . . . . . . . . 4.4.1 Halting Algorithm . . . . . . . . 4.4.2 Constant-Delays Algorithm . . . 4.4.3 Time-Varying Delays Algorithm 4.5 Convergence Analysis . . . . . . . . . . 4.5.1 Constant-Delays Algorithm . . . 4.5.2 Time-Varying Delays Algorithm 4.5.3 Noisy Communication Channels 4.6 Communication Considerations . . . . . 4.6.1 Measuring Communication . . . 4.6.2 Primal Consensus Algorithm . . 4.6.3 Dual Decomposition Algorithm . 4.7 Quadratic Objective Functions . . . . . 4.7.1 Simple Quadratic Example . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
33 33 36 37 40 40 42 44 45 45 52 56 61 61 61 62 64 65
5 Numerical Results 5.1 Common Model . . . . . . . . . . . . . . 5.2 Evaluating the Convergence Rate . . . . 5.3 Comparing Primal and Dual Algorithms 5.3.1 Connected Graph . . . . . . . . . 5.3.2 Line Graph . . . . . . . . . . . . 5.3.3 Ring Graph . . . . . . . . . . . . 5.3.4 Complete Graph . . . . . . . . . 5.4 Dual Decomposition Algorithm . . . . . 5.4.1 Without Bounded Subgradients . 5.4.2 Noisy Subgradients . . . . . . . . 5.4.3 Halting Algorithm . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
69 69 70 71 71 75 78 81 84 84 87 89
6 Conclusions and Remarks 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93 93 96 97
A Distributed Consensus Sum Algorithm
101
Bibliography
107
vi
List of Figures 2.1 2.2 2.3 2.4 2.5 2.6
A convex function f , with the cord between the two points x and A convex function f , with subgradients at the points A and B. . An undirected graph with 7 nodes and 7 edges. . . . . . . . . . . A directed graph with 7 nodes and 9 edges. . . . . . . . . . . . . Communication graph for a centralized optimization problem. . . Communication graph for a decentralized optimization problem.
. . . . . .
5 7 17 18 19 20
4.1 4.2 4.3 4.4 4.5
Communication graph for the centralized optimization algorithm. . . . . Communication graph for the decentralized optimization algorithm. . . The intuition behind the delays in the convergence results. . . . . . . . Communication graph for the simple quadratic example. . . . . . . . . . The spectral radius of the dynamics matrix, for the quadratic optimization problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36 43 51 66
5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16
y. . . . . . . . . . .
. . . . . .
Connected communication graph for the first simulation. . . . . . . . . . The convergence rate’s dependency on the step size, for the first simulation. The convergence rate for the first simulation, using the optimal step sizes. Communication graph for the second simulation, with a line topology. . The convergence rate’s dependency on the step size, for the line graph. . The convergence rate for the second simulation, using the optimal step sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Communication graph for the third simulation, with a ring topology. . . The convergence rate’s dependency on the step size, for the ring graph. The convergence rate for the third simulation, using the optimal step sizes. Communication graph for the fourth simulation, with a complete graph topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The convergence rate’s dependency on the step size, for the complete graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The convergence rate for the fourth simulation, using the optimal step sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Convergence rate with unbounded subgradients, and a small step size. . Convergence rate with unbounded subgradients, and the optimal step size. Convergence rate with unbounded subgradients, and a large step size. . Convergence rate with noisy communication channels. . . . . . . . . . . vii
68 72 73 75 76 77 78 79 80 81 82 83 84 85 86 87 88
5.17 The convergence rate’s dependency on the step Algorithm after T updates. . . . . . . . . . . . . 5.18 The convergence rate’s dependency on the step Algorithm after T time steps. . . . . . . . . . . . 5.19 Convergence rate for the Halting Algorithm. . . .
viii
size, . . . size, . . . . . .
for the Halting . . . . . . . . . . for the Halting . . . . . . . . . . . . . . . . . . . .
89 90 91
List of Algorithms 2.1 2.2
Primal Decomposition Algorithm . . . . . . . . . . . . . . . . . . . . . Dual Decomposition Algorithm . . . . . . . . . . . . . . . . . . . . . .
13 16
3.1
Primal Consensus Algorithm . . . . . . . . . . . . . . . . . . . . . . .
24
4.1 4.2 4.3 4.4 4.5
Centralized Optimization Algorithm . . . . Halted Decentralized Algorithm . . . . . . Constant-Delays Distributed Algorithm . . Time-Varying Delay, Distributed Algorithm Simple Quadratic Example Algorithm. . . .
36 41 43 45 66
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
A.1 Leader Election Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 102 A.2 Computing 1/N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
ix
Notation R C (·)i (·)i,j (·)i,· (·)·,j (·)T x xi λ O (·) ρ(·) 1 ||·||2 d (·, ·)
The set of real-valued scalars. The set of complex-valued scalars. The i:th element of the vector. The (i, j):th element of the matrix. The i:th row of the matrix. The j:th column of the matrix. The transpose of a matrix or vector. The n-dimensional real-valued optimization variable. The i:th node’s estimate of the optimization variable. The Lagrange dual optimization variables. The Landau notation, describes the limiting behavior of a function when the argument tends towards infinity. The spectral radius of the matrix. The vector with each element equal to one. The Euclidean norm of the vector. The distance between a point and a set.
xi
Chapter 1
Introduction “To raise new questions, new possibilities, to regard old problems from a new angle, requires creative imagination and marks real advance in science.” Albert Einstein, 1879-1955.
1.1
Motivation
In the ever increasingly connected world we live in, the ability to compute an optimal distributed decision has gained a lot of attention in recent years [1, 2, 3, 4, 5, 6]. The distributed optimization problems appear in a broad range of practical applications, for example, the Internet consist of many users, with different objectives [7]. The users are referred to as “agents”, and the interconnected network of agents is called a “multi-agent system”. The multi-agent system considered share an important feature, they consists of agents making local decisions while trying to coordinate the overall objective with the other agents in the system. Continuing with the Internet as an example, each connected agent gains a utility from using the network, but since the capacity of the network is limited, each agent also has an associated cost with the usage of the network. Thus, each agent can formulate an optimization problem of maximizing its utility, while minimizing its cost. Further, from a global perspective, we would like to maximize the total utility of all agents, while minimizing the total cost for all agents. Another application that has received a lot of attention is distributed cooperative control of unmanned vehicles, and especially formation control of such vehicles [8, 9, 10, 11]. For example, using several micro-satellites has the potential for greater functionality and reliability than an individual satellite. Also, formation control of a group of unmanned aircrafts or underwater vehicles can severely increase the efficiency in surveillance and search-and-rescue operations. Common to all these examples is that the optimal control algorithms should be completely distributed, relying on only local information. 1
CHAPTER 1. INTRODUCTION
In this thesis, we are trying to tackle the underlying distributed optimization problem, that is common for all these applications.
1.2
Problem Formulation
We consider an optimization problem, where a convex and potentially non-smooth objective function should be minimized. The objective function consists of N terms, each term being a function of the optimization variable x ∈ Rn , minimize n x∈R
N X
f i (x).
i=1
Further, a connected graph, consisting of N agents, is associated with the optimization problem, and each agent is associated with one of the terms in the objective function. It is assumed that the part f i of the objective function is agent i’s own local objective function, and only agent i has the necessary knowledge to be able to compute it. The goal in distributed multi-agent optimization is to solve this minimization problem in a distributed fashion, where the nodes are able to cooperatively find the optimal solution without any central coordinator.
1.3
Outline
The outline of this master thesis is as follows. In Chapter 2 we introduce the mathematical foundations on which this thesis is built upon. In Chapter 3 we give a brief summary of the distributed multi-agent optimization algorithm developed by Nedić and Ozdaglar [2]. In Chapter 4 we develop a novel distributed optimization algorithm based on the dual decomposition principle. We further analyze its convergence rate with time-varying delays, and noisy communication. We also study the communication requirements necessary for the algorithm. In Chapter 5 we present numerical simulations for the considered algorithms, and focus in particular on their convergence rate. Finally, in Chapter 6 we summaries the results in this thesis, and give a list of possible future extensions to the work presented herein.
2
Chapter 2
Preliminaries “To those who do not know mathematics it is difficult to get across a real feeling as to the beauty, the deepest beauty, of nature ... If you want to learn about nature, to appreciate nature, it is necessary to understand the language that she speaks in.” Richard P. Feynman, 1967.
In this chapter, we present the mathematical foundations on which the following chapters are built upon. Many books and research papers have been devoted to these subjects, and it is unfortunately not possible to cover them here in all of their glory, but references are provided to some of the materials that have been very useful.
2.1
Mathematical Optimization
In mathematics, an optimization problem is the problem of finding an optimal solution among all feasible solutions. The standard form for an optimization problem can be written as [12, 13] minimize n x∈R
f (x),
subject to ci (x) ≤ bi ,
i = 1, . . . , m.
(2.1)
Thus, the standard problem is to choose the optimization variable x = (x1 , . . . , xn ) ∈ Rn in such a way that the scalar-valued objective function f : Rn → R attains its minimal value over the feasible domain. The feasible domain is determined by the constraint functions ci (x) : Rn → R, and can be defined as D = {x | ci (x) ≤ bi
i = 1, . . . , m} .
The points x ∈ D are said to be feasible points. The optimal value f ∗ to problem (2.1) is the maximum lower bound to f in the feasible domain, f ∗ = inf {f (x) | x ∈ D} . 3
CHAPTER 2. PRELIMINARIES
The optimal set Xopt to problem (2.1) is the set of feasible points that attains the optimal value, Xopt = {x ∈ D | f (x) = f ∗ } . If the optimal set is non-empty, then the optimization problem is said to be solvable, and any vector in the optimal set is called optimal solution (or even global optimal solution), denoted by x∗ . Thus, for any other feasible point z ∈ D we must have f (x∗ ) ≤ f (z). The Euclidean distance between two points, x, y ∈ Rn , is defined as, v u n uX ||x − y||2 = t (xi − yi )2 . i=1
Further, the distance between a point x ∈ Rn and a set X ⊆ Rn is defined as the minimum distance from x to any point in X, x − x0 . d (x, X) = inf 2 0
x ∈X
A vector x0 ∈ D is called local optimal solution if it minimizes f for a neighborhood around x0 , i.e., that there exits an r > 0 such that x0 solves the optimization problem (2.2). minimize f (x), x∈Rn (2.2) subject to ci (x) ≤ bi , i = 1, . . . , m, 0 ||x − x ||2 ≤ r. An optimization problem like (2.1), but without any constraint functions is called an unconstrained optimization problem, and the feasible domain is then the entire Rn space. Similarly, if we want to emphasize that an optimization problem is constrained then we simply call it a constrained optimization problem.
2.2
Convex Optimization
A special, and very important class of optimization problems (2.1) are those where both the objective function and the constraint functions are convex. Definition 2.1. A function f : Rn → R is said to be convex if it satisfies the inequality f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y), (2.3) for all x, y ∈ Rn and all α ∈ R, 0 ≤ α ≤ 1 (Fig. 2.1). It is worth to notice that in particular all linear functions are convex, since they satisfy (2.3) with equality. Also, a quadratic function f (x) = xT Qx + cT x + b is convex if and only if Q 0, i.e., Q is positive semidefinite. 4
2.2. CONVEX OPTIMIZATION
f
f(x)
x, f(x)
αf(x)+(1
-α)f(y) y, f(y)
x Figure 2.1: A convex function f . The chord between (x, f (x)) and (y, f (y)) lies above the function graph between those two points.
Definition 2.2. A set C ⊆ Rn is said to be convex if the line segment between any two points in C also lies in C [14]. Thus, for any points x1 , x2 ∈ C, and for 0≤α≤1 αx1 + (1 − α)x2 ∈ C. Notice that if the constraint function ci (x) is a convex function, then the set of points satisfying the condition {x | ci (x) ≤ bi } is a convex set. Further, the intersection of convex sets is also convex, hence if all constraint functions ci are convex, then so is the feasible domain D. The main reason that convex optimization has received much attention in the literature is because of the property given in the following theorem. Theorem 2.1. For a convex optimization problem, any local optimal solution is also a global optimal solution. Proof. Suppose that x0 is a local optimal solution, i.e., there exist an r > 0 such 5
CHAPTER 2. PRELIMINARIES
that f (x0 ) ≤ f (x),
(2.4)
for all x such that ||x − x0 ||2 ≤ r and ci (x) ≤ bi , i = 1, . . . , m. Assume that x ˆ is a global optimal solution, with f (ˆ x) < f (x0 ). Consider the line segment between x0 and x ˆ, with the point x given by x = αˆ x + (1 − α)x0 , 0 < α < 1. By the convexity assumption (2.3) on the constraint function ci , we have ci (x) ≤ αci (ˆ x) + (1 − α)ci (x0 ) ≤ αbi + (1 − α)bi = bi . Hence, x is also a feasible point. Further, by the convexity assumption (2.3) on the objective function, we have f (x) ≤ αf (ˆ x) + (1 − α)f (x0 ) < f (x0 ). r Let α = 2||ˆx−x x − x0 ||2 > r, and thus, ||x − x0 ||2 < 0 || , notice that 0 < α < 1 since ||ˆ 2 r. This contradicts (2.4), and hence the assumption that f (ˆ x) < f (x0 ) must be false.
The implication of this theorem is that, for convex optimization problems, it is enough to find a local optimal solution, which is in general a much easier problem than finding a global optimal solution. Many efficient algorithms exist with the purpose of finding a local optimal solution [12].
2.2.1
Concave Functions
A class of functions that are closely related to the convex functions are the concave functions. Definition 2.3. A function f is said to be concave if −f is convex. The optimization problem (2.1) can be rewritten as an equivalent maximization problem, maximize −f (x), n x∈R
subject to ci (x) ≤ bi ,
i = 1, . . . , m.
Thus, we realize that a convex minimization problem is equivalent to a concave maximization problem, and we will therefore refer to both of these as convex optimization problems.
2.3
Subgradients
The concept of subgradients (also subderivative and subdifferential) is a generalization of the gradient to non-differentiable convex continuous functions [12, 15, 14]. 6
2.3. SUBGRADIENTS
Definition 2.4. A vector g ∈ Rn is a subgradient to f : X ⊆ Rn → R at the point x ∈ X if, for all other points z ∈ X, the following holds f (z) ≥ f (x) + g T (z − x).
(2.5)
Further, the subdifferential of the function f , at x ∈ X, is the set of subgradients to f at x, and it is denoted by ∂f (x). Thus, n
∂f (x) = g : f (z) ≥ f (x) + g T (z − x)
o
∀z ∈ X .
(2.6)
Similarly, for a concave function f : X ⊆ Rn → R, we call g a subgradient at x ∈ X if f (z) ≤ f (x) + g T (z − x), (2.7) holds for all z ∈ X. Remark. If f is a convex function, and differentiable at x, then there is exactly one subgradient of f at x, and it coincides with the gradient (Fig. 2.2).
f
f(x)
A
B
x Figure 2.2: Subgradients to the convex function f . There is a unique (sub-)gradient at point A, and several subgradients at point B.
7
CHAPTER 2. PRELIMINARIES
2.4
Subgradient Methods
The subgradient method is a simple first-order algorithm for minimizing a nondifferentiable convex function [16]. It is based on the well-known gradient descent method, originally proposed by Cauchy already in 1847 [17, 12], but extended to non-differentiable functions by replacing the gradient with a subgradient to the function. Since the subgradient method is a first-order method, it can have a slow convergence rate compared to more advanced methods, such as Newton’s method or interior-point methods for constrained optimization. However, it does have some advantages, in particular, it can be used for non-differentiable functions and it also has a lower memory requirement. Another reason that we are interested in the subgradient method is that it enables us to run the optimization problem in a distributed fashion, as we shall see later. Consider the unconstrained optimization problem minimize f (x), where f : Rn → R is a convex function. The subgradient method solves this optimization problem by the simple iterative algorithm x(t + 1) = x(t) − αt g(t).
(2.8)
Here, x(t) denotes the estimate of the solution at step t, g(t) ∈ ∂f (x(t)) is a subgradient to f at x(t), and αt is a step size rule, which we will discuss next.
2.4.1
Choosing Step Sizes
The convergence of the subgradient method in equation (2.8) depends on the choice of step sizes αt , and there are several different schemes used. To analyze the convergence rates we need three assumptions; Assumption 2.1. There exists at least one finite minimizer of f , denoted by x∗ . Let Xopt denote the optimal set, also, let f ∗ = f (x∗ ) denote the optimal value of f . Assumption 2.2. The norm of the subgradients is uniformly bounded by G, ||g(t)||2 ≤ G
∀t.
Assumption 2.3. The distance from the initial point to the optimal set is bounded by R, d (x(0), Xopt ) ≤ R. For the normal gradient method the function value decreases at each iteration, but that is not necessarily the case for the subgradient method. Instead, it is the distance to the optimal set that decreases. Let x∗ be an optimal solution, then 0 ≤ ||x(t + 1) − x∗ ||22 = ||x(t) − αt g(t) − x∗ ||22 = ||x(t) − x∗ ||22 − 2αt g(t)T (x(t) − x∗ ) + αt2 ||g(t)||22 . 8
(2.9)
2.4. SUBGRADIENT METHODS
Since g(t) ∈ ∂f (x(t)) is a subgradient to the function f , the subgradient definition (2.5) implies that −g(t)T (x(t) − x∗ ) ≤ − (f (x(t)) − f ∗ ), and thus 0 ≤ ||x(t + 1) − x∗ ||22 ≤ ||x(t) − x∗ ||22 − 2αt (f (x(t)) − f ∗ ) + αt2 ||g(t)||22 . This is a recursive expression in terms of the norm ||x(t) − x∗ ||22 , and by expanding the relation until the initial step t = 0, we get 0 ≤ ||x(0) −
x∗ ||22
−2
t X
∗
αi (f (x(i)) − f ) +
i=0
t X
αi2 ||g(i)||22 .
(2.10)
i=0
Since the function value f (x(i)) does not necessarily decrease at each iteration, we evaluate the method by remembering the best solution fbest found until step t. Thus, fbest (t) is defined as fbest (t) = min f (x(i)), i=0,...,t
and with this definition, we have t X
αi f (x(i)) ≥ fbest (t)
i=0
t X
αi .
i=0
Using this relation, and rearranging the terms in (2.10), yields ||x(0) − x∗ ||22 + ti=0 αi2 ||g(i)||22 fbest (t) − f ≤ . P 2 ti=0 αi P
∗
Notice that the optimal point x∗ can be chosen arbitrarily in Xopt , and thus, ||x(0) − x∗ ||2 can be replaced with the distance to the optimal set, d (x(0), Xopt ). Substituting the bounds from assumption 2.2 and 2.3 yields the main convergence result for the subgradient method, R2 + G2 ti=0 αi2 fbest (t) ≤ f + . P 2 ti=0 αi P
∗
(2.11)
We can now give bounds on the convergence rate for the following step size rules: Constant step size The simplest choice is to use a constant step size αt = α, that is independent of t. In this case, the convergence results (2.11) becomes fbest (t − 1) ≤ f ∗ +
R 2 + G2 α 2 t . 2αt
(2.12)
Thus, the subgradient method will converge as R2 /2αt to within G2 α/2 of optimality. In other words, it will converge to a neighborhood around the optimal solution, whose size is proportional to the step size, but with a speed that is inversely proportional to the step size. 9
CHAPTER 2. PRELIMINARIES
The result can be further strengthened by instead considering the average objective function value favg (t − 1) =
t−1 1X f (x(i)). t i=0
Again, rearranging the terms in (2.10) shows that the average value satisfies the same bound as the best objective function value above, favg (t − 1) ≤ f ∗ +
R 2 + G2 α 2 t . 2αt
(2.13)
Constant step length Instead of keeping the step size constant, by letting αt = γ/||g(t)||2 we ensure that the step length is constant. Here, γ > 0 is the length of the step in each iteration. The convergence results (2.11) becomes fbest (t − 1) ≤ f ∗ +
R2 + γ 2 t . 2γt/G
(2.14)
Thus, the subgradient method will converge as GR2 /2γt to within Gγ/2 of optimality. Square summable, but not summable step sizes Both of the previously discussed step size rules only guarantees convergence to a neighborhood around the optimal value, but now we analyze a step size rule that can guarantee convergence to the optimal value. Let the step size satisfy ≥ 0; α Pt ∞ 2 t=0 αt < ∞; P∞ α = ∞. t=0 t Consider the convergence results (2.11), notice that the nominator is finite, but the denominator tends towards infinity. Thus, this choice of step sizes guarantees convergence for the subgradient method, i.e., fbest (t) → f ∗ ,
(2.15)
as t → ∞.
2.4.2
Projected Subgradient Methods
Subgradient methods can also be extended to constrained optimization problems. The projected subgradient method solves the constrained optimization problem (2.1) with the iterations x(t + 1) = P x(t) − αt g(t) , 10
2.5. DECOMPOSITION METHODS
where P is the Euclidean projection onto the convex feasible set defined by the constraints ci (x) ≤ bi , i = 1, . . . , m. These iterations should be compared to the ordinary subgradient iterations (2.8). Notice that the optimal set to the optimization problem is a subset of the feasible set defined by the constraint functions. Thus, the Euclidean projection onto the convex feasible set can only decrease the distance to the optimal solution, hence we have
2
||x(t + 1) − x∗ ||22 = P x(t) − αt g(t) − x∗ ≤ ||x(t) − αt g(t) − x∗ ||22 . 2
By updating expression (2.9), it can be seen that the convergence result (2.11) for the ordinary subgradient method also holds for the projected subgradient method. Hence, the convergence bounds given in (2.12), (2.14) and (2.15) also holds for the projected subgradient method.
2.5
Decomposition Methods
In mathematics, decomposition is the general concept of solving a problem by breaking it up into smaller subproblems, which can be solved independently, and then assemble the solution to the original problem from the solution of the subproblems [18]. Decomposition methods has been devoted a lot of attention for particular two reasons: • If the complexity of the algorithm grows faster than linear in the number of subproblems, then decomposing the problem into several subproblems can result in a significant performance gain. For example consider the problem of inverting a block diagonal matrix, (2.16), with Gauss-Jordan elimination. A=
A1 0 · · · 0 0 A2 · · · 0 .. .. . . .. . . . . 0 0 · · · Am
(2.16)
Assume that each block Ai is an n × n-matrix, and thus A is a nm × nmmatrix. Inverting A directly with the general Gauss-Jordan elimination would require O (nm)3 operations [19], but the inverse could also be decomposed as in (2.17).
A−1 =
A1 0 · · · 0 0 A2 · · · 0 .. .. . . .. . . . . 0 0 · · · Am
−1
11
=
A−1 0 ··· 0 1 0 A−1 · · · 0 2 .. .. .. .. . . . . 0 0 · · · A−1 m
(2.17)
CHAPTER 2. PRELIMINARIES
By the decompositioning technique, the problem of inverting matrix A can be solved by inverting the submatrices A1 , . . . , Am of size n × n, which requires mO n3 operations in total. • If the subproblems can be solved independently of each other, then the decomposition technique enables us to solve the subproblems in parallel. Traditionally, computers has been designed for serial computation, where each task is solved after each other. Parallelization has for a long time mainly been used in high-performance computing, where large supercomputers has been built up from thousands of smaller and cheaper processors than what otherwise would have been possible [20, 21]. However, interest in it has grown lately due to the shift from single core to multi-processor computer architectures, and ever faster networks [22, 23]. Thus, the ability to decompose a problem into many smaller subproblems that can be solved in parallel has never been more important. Decomposition methods can also improve the reliability of a large system [24]. Another aspect of parallelization that we will be more interested in are the application in multi-agent systems, where the decomposition methods yields a distributed optimization algorithm. A problem is called trivially parallelizable if the subproblems can be solved completely independent of each other. Consider for example the following optimization problem minimize f (x1 , . . . , xn ) = minimize f 1 (x1 ) + · · · + f n (xn ). x1 ,...,xn
x1 ,...,xn
This problem is trivially parallelizable since the subproblems minimize f i (xi ) ∀i xi
can be solved independently of each other. Inverting the block diagonal matrix (2.16) is another trivially parallelizable example. More interesting situations occur when there is a coupling between the subproblems, and that is the situation which decomposition methods are trying to solve.
2.5.1
Primal Decomposition
The simplest decomposition method is called Primal Decomposition, because the optimization algorithm manipulates the primal variables [18]. Consider the following unconstrained optimization problem minimize f (x1 , x2 , y) = minimize f 1 (x1 , y) + f 2 (x2 , y), x1 ,x2 ,y
x1 ,x2 ,y
(2.18)
where there is a coupling between the two subproblems f 1 and f 2 in the variable y. The variable y is commonly called the complicating variable, since it complicates the decomposition. 12
2.5. DECOMPOSITION METHODS
The primal decomposition method works by fixating the complicating variables, in this case y, and then solving the now decoupled subproblems. Thus, define φ1 (y) = min f 1 (x1 , y), x1
φ2 (y) = min f 2 (x2 , y), x2
and φ(y) = φ1 (y) + φ2 (y).
(2.19)
The optimization problem (2.18) can then be expressed with, (2.19), as minimize f (x1 , x2 , y) = minimize φ(y). x1 ,x2 ,y
y
This is called the master problem in primal decomposition. Notice that if the original problem is convex, then so is the master problem. The master problem can then be solved by any local optimization algorithm, for example the subgradient method. If g 1 (y) is a subgradient to φ1 (y) and g 2 (y) is a subgradient to φ2 (y), then g 1 (y) + g 2 (y) is a subgradient to φ(y). The subgradient method solves the master problem by the iteration
y(t + 1) = y(t) − αt g 1 (y(t)) + g 2 (y(t)) .
(2.20)
Thus, the primal decomposition method together with the subgradient method can be used directly, as Algorithm 2.1, to solve the optimization problem in (2.18). Algorithm 2.1: Primal Decomposition Algorithm Input: Initial estimate y(0) Output: Estimate y(T ) at time T 1 for t=0 to T-1 do // Solve the subproblems, can be done in parallel 2 Solve φ1 (y(t)), and return g 1 (y(t)) 3 Solve φ2 (y(t)), and return g 2 (y(t)) // Update the complicating variable 4 y(t + 1) = y(t) − αt g 1 (y(t)) + g 2 (y(t)) 5 end
2.5.2
Dual Decomposition
The second method that we will consider is the dual decomposition method, named after its use of Lagrange dual variables (or Lagrange multipliers) [18]. Consider again the unconstrained optimization problem (2.18), but this time, convert it into 13
CHAPTER 2. PRELIMINARIES
an equivalent constrained optimization problem by introducing two copies y1 , y2 of the complicating variable y. The problem can be expressed as minimize x1 ,x2 ,y1 ,y2
f (x1 , x2 , y1 , y2 ) = f 1 (x1 , y1 ) + f 2 (x2 , y2 ),
subject to y1 = y2 .
(2.21)
Notice how this simple reformulation of the optimization problem makes the objective function trivially parallelizable, but at the cost of adding the consistency constraint that requires the two copies to be equal. We will proceed by applying the method of Lagrange multipliers [12, 25, 17]. Introduce the dual variable (or Lagrange multiplier) λ, and define the Lagrange function as L(x1 , x2 , y1 , y2 , λ) = f 1 (x1 , y1 ) + f 2 (x2 , y2 ) + λT (y1 − y2 ).
(2.22)
Further, the Lagrange dual function q is defined as the minimum value of the Lagrange function over all primal variables x1 , x2 , y1 , y2 , i.e., q(λ) =
inf
x1 ,x2 ,y1 ,y2
L(x1 , x2 , y1 , y2 , λ).
(2.23)
Notice that the Lagrange dual function is concave, even if the problem (2.21) is not convex, since it is the point-wise infimum of the affine function L. An important property of the Lagrange dual function is that it provides a lower bound on the optimal value, as is shown in Theorem 2.2. Theorem 2.2 (Boyd [12]). Let f ∗ be the optimal value to problem (2.21), and further let q be the Lagrange dual function defined in (2.23). Then, for any λ we have the lower bound q(λ) ≤ f ∗ . Proof. This can be realized by considering any feasible point (xˆ1 , xˆ2 , yˆ, yˆ) to (2.21), L(xˆ1 , xˆ2 , yˆ, yˆ, λ) = f 1 (xˆ1 , yˆ) + f 2 (xˆ2 , yˆ) + λT (ˆ y − yˆ) = f (xˆ1 , xˆ2 , yˆ, yˆ). Hence q(λ) =
inf
x1 ,x2 ,y1 ,y2
L(x1 , x2 , y1 , y2 , λ) ≤ L(xˆ1 , xˆ2 , yˆ, yˆ, λ) = f (xˆ1 , xˆ2 , yˆ, yˆ).
Since (xˆ1 , xˆ2 , yˆ, yˆ) was an arbitrary feasible point, then this holds in particular for the optimal solution to (2.21), thus giving us q(λ) ≤ f ∗ . Since the Lagrange dual function q(λ) gives us a lower bound on the optimal value f ∗ , a natural problem is to maximize the lower bound. This leads us to the Lagrange dual problem, maximize q(λ). (2.24) λ
14
2.5. DECOMPOSITION METHODS
Let q ∗ denote the optimal value to the convex optimization problem (2.24). Theorem 2.2 implies that q ∗ ≤ f ∗ , and this is also referred to as the weak duality. However, if q ∗ = f ∗ then strong duality is said to hold. Strong duality does not hold in general, but, for example, Slater’s condition [12, 25] guarantees that strong duality holds for the convex optimization problem (2.21). Theorem 2.3 (Slater’s condition). Consider the primal optimization problem minimize n x∈R
f (x),
subject to ci (x) ≤ bi , hj (x) = 0,
i = 1, . . . , m, j = 1, . . . , k,
where all functions are convex. The feasible domain to this problem is D = {x | ci (x) ≤ bi , hj (x) = 0
i = 1, . . . , m
j = 1, . . . , k} .
Slater’s condition states that strong duality holds if there exists an interior point x ∈ D with ci (x) < bi , i = 1, . . . , m. Let us now continue with the dual decomposition, that, in contrast to the primal decomposition, works by fixating the dual variables λ. Define φ1 (λ) = inf
f 1 (x1 , y1 ) + λT y1 ,
φ2 (λ) = inf
f 2 (x2 , y2 ) − λT y2 .
x1 ,y1
x2 ,y2
The Lagrange dual function can then be written as q(λ) = φ1 (λ) + φ2 (λ), and the corresponding maximization problem is the master problem in dual decomposition. Once again we can solve the master problem with the subgradient method. Notice that the subgradient to −φ1 is −y1 and the subgradient to −φ2 is y2 , hence the subgradient to −q is y2 − y1 , where y1 and y2 are obtained as the solutions to φ1 and φ2 respectively. The corresponding subgradient update rule is λ(t + 1) = λ(t) − αt (y2 (t) − y1 (t)) . Thus, combining the dual decomposition with the subgradient method yields Algorithm 2.2, which can be used to solve optimization problem (2.21).
15
CHAPTER 2. PRELIMINARIES
Algorithm 2.2: Dual Decomposition Algorithm Input: Initial estimate λ(0) Output: Estimate λ(T ) at time T 1 for t=0 to T-1 do // Solve the subproblems, can be done in parallel 2 Solve φ1 (λ(t)), and return y1 (t) 3 Solve φ2 (λ(t)), and return y2 (t) // Update the dual variable 4 λ(t + 1) = λ(t) − αt (y2 (t) − y1 (t)) 5 end
2.6
Graph Theory
In this section, we introduce some notation and basic concepts from graph theory that will be useful for describing distributed multi-agent systems. A more rigorous description of this fascinating topic can be found in [26]. A graph G(V, E) consists of a set of nodes (or vertices), denoted by V, and a set of edges E ⊆ V × V. We usually denote the number of nodes in a graph G by NG = |V|, or simply N if the graph can be understood from the context. Further, we usually label the nodes with numbers from 1 to N , thus V = {1, . . . , N }. We can further divide the graphs into two families, first the undirected graphs are the family of graphs where the edges are unordered pairs of vertices, hence (i, j) and (j, i) are considered to be the same edge, and (i, j) ∈ E if and only if (j, i) ∈ E. All other graphs are said to be directed graphs, and the edge (i, j) is said to be directed from node i to node j. Two nodes that has an edge between them are said to be adjacent. The neighbors of a node i are those nodes which have an incoming edge from node i. We denote the neighbors of node i by Ni , thus Ni = {j | (i, j) ∈ E} . A path in G is a list of distinct vertices {v0 , v1 , . . . , vn } such that (vi , vi+1 ) ∈ E, i = 0, 1, . . . , n − 1. The number of edges in the path is the length of the path. The path is said to go from v0 to vn . A graph is strongly connected if there exists a path from every vertex to every other vertex in the graph. The distance dist(v0 , vn ), between two nodes v0 and vn , is the length of the shortest path between them, and ∞ if there is no such path. Further, the diameter of a graph is the greatest distance between any two nodes in the graph. A cycle is a path {v0 , v1 , . . . , vn }, with the additional requirement that the edge (vn , v0 ) ∈ E exists. A graph without cycles is said to be acyclic. A graph where the length of all cycles has a common factor k > 1 is said to be k-periodic, or aperiodic if no such k exist. 16
2.6. GRAPH THEORY
An edge (i, i) is called a loop, and a graph without any loops is said to be simple. Notice that a strongly connected graph with at least one loop is aperiodic. A tree is an undirected, connected graph with no cycles. In a tree, any two vertices is connected by a unique, simple, path, and a tree with N nodes has exactly N − 1 edges. The usual way of picturing an undirected graph is by drawing a circle for each node, and joining two of the circles by a line if the corresponding nodes form an edge (Fig. 2.3).
6
2 3
1 4
5 7
Figure 2.3: An undirected graph with 7 nodes and 7 edges. Similarly, a directed graph is pictured by drawing a circle for each node, and joining two circles with an arrow pointing from i to j if (i, j) forms an edge of the graph (Fig. 2.4). Remark. When we are considering a multi-agent system we will usually think about the nodes as agents, and an edge between two nodes represent a communication link between the two agents. We will therefore also refer to the nodes as agents.
17
CHAPTER 2. PRELIMINARIES
6
2 3
1
5
4
7
Figure 2.4: A directed graph with 7 nodes and 9 edges.
2.7
Distributed Optimization
As we have mentioned before, the topic in this thesis is distributed optimization, which means that we have a set V of N agents that wants to cooperatively solve an optimization problem. There are several different possible ways to express the optimization problem, but we will focus on where the objective function f is already expressed as a sum of individual objective functions for the agents. Thus, we associate the local objective function f i with agent i, for each agent i ∈ V. The optimization problem that the agents are trying to solve is
minimize n x∈R
N X
f i (x),
(2.25)
i=1
thus, the local objective functions are coupled in the optimization variables. The agents can, for example, be computers in a computer network, or vehicles trying to stay in a certain formation. It is also assumed that only agent i has the full knowledge about function f i , thus they need to communicate with each other in order to solve the problem. The problem is further restricted by a set of communication edges E, where an agent i can only communicate with agent j if there is an edge (i, j) ∈ E. In order for the problem to be solvable, it is assumed that the graph G(V, E) is strongly connected. We want to further make the distinction between two classes of distributed optimization algorithms, centralized and decentralized, where we will be more interested in the later. 18
2.7. DISTRIBUTED OPTIMIZATION
2.7.1
Centralized Optimization
In centralized optimization there exists a special coordinator agent that is responsible for coordinating all other agents (Fig. 2.5). In view of the decomposition methods, the coordinator is solving the master problem, while the other agents solves the subproblems. Consider the Primal Decomposition Algorithm 2.1. There, agent 1 would solve the subproblem φ1 (y(t)), and sending the subgradient g 1 (y(t)) to the coordinator. Similarly, agent 2 would solve the subproblem φ2 (y(t)), and send the subgradient g 2 (y(t)) to the coordinator. The coordinator would then be able to update the complicating variable y, and return the updated variable y(t + 1) to both agents. Similarly, for the Dual Decomposition Algorithm 2.2, agent 1 and 2 would solve the subproblems φ1 (λ(t)) and φ2 (λ(t)) respectively. They then transmit the local copies of the complicating variables, y1 (t) and y2 (t), to the coordinator, who is then able to update the dual variable λ(t + 1).
Central coordinator
1
2
3
4
5
Figure 2.5: Communication graph for a centralized optimization problem, where there is a special coordinator agent.
2.7.2
Decentralized Optimization
In decentralized optimization, there is no central coordinator, which means that all agents should be considered equivalent (Fig. 2.6). Compared to the centralized optimization, this means that even the master problem has to be solved distributedly. Thus, decentralized optimization is in general a more difficult problem than the centralized optimization problem, and we will therefore focus on the decentralized optimization problem. Considering both the centralized and decentralized optimization problems for the primal and dual decomposition methods respectively, we have, in total, four different approaches. The centralized primal decomposition method, described in the previous section, is an implementation of the ordinare subgradient method, and we will be content with the description of this method given so far. In Chapter 3, an earlier known decentralized primal decomposition method is presented. In Chapter 4 we first, briefly, discuss the centralized Dual Decomposition Algorithm, and then 19
CHAPTER 2. PRELIMINARIES
move on to the main contribution of this thesis; a decentralized Dual Decomposition Algorithm.
1
5
2
7 4
3
6
Figure 2.6: Communication graph for a decentralized optimization problem, where all agents are equal.
2.8
Average Consensus Problem
The average consensus problem is the distributed computational problem of finding the average of a set of initial values. The problem of distributed averaging comes up in many applications such as decentralized optimization, coordination of multi-agent systems, estimation and distributed data fusion [27, 5, 28, 6, 9]. Consider a connected network of N nodes, each node holding an initial value i x (0) ∈ R at time t = 0. The goal is to compute the average of these values, 1 PN i i=1 x (0), with only local communication. Thus, a node i is only allowed to N communicate with its neighbors Ni . Further, the goal is also that the nodes should reach consensus on the average value, i.e., every node’s value should converge to the average of the initial values, lim xi (t) =
t→∞
N 1 X xi (0) N i=1
∀i ∈ V.
(2.26)
A standard algorithm to solve the average consensus problem is with the following distributed linear iterations, xi (t + 1) =
N X
Wi,j xj (t)
i = 1, . . . , N,
(2.27)
j=1
where t = 0, 1, . . . are the time steps and W ∈ RN ×N is the weight matrix. To enforce that each node only uses local information we have the requirement that Wi,j = 0 if j ∈ / Ni and j 6= i. 20
2.8. AVERAGE CONSENSUS PROBLEM
Now, let x(t) be defined as the column vector with elements xi (t), x(t) =
h
x1 (t) x2 (t) · · · xN (t)
iT
,
then the update (2.27) can be simplified for the entire network as x(t + 1) = W x(t). Expanding this equation recursively implies that x(t) = W t x(0), hence a necessary and sufficient condition for the convergence of this algorithm is that lim W t =
t→∞
11T , N
(2.28)
where 1 denotes the vector consisting of only ones. Boyd [27] showed the following theorem Theorem 2.4. Equation (2.28) holds if and only if the following three conditions hold (i) 1T W = 1T (ii) W 1 = 1 (iii) ρ(W − 11T /N ) < 1 Before we continue, we will need another definition, Definition 2.5. • A vector v ∈ Rn is said to be a stochastic vector if the components vi are P nonnegative and ni=1 vi = 1. • A matrix is said to be a stochastic matrix if all of its rows are stochastic vectors. • A matrix is said to be a doubly stochastic matrix if all of its rows and columns are stochastic vectors. Now, if the elements of the weight matrix W are nonnegative, then the first two equations of Theorem 2.4 are equivalent with W being a doubly stochastic matrix. Further, if the weights satisfy {i, j} ∈ E ⇒ Wi,j > 0, then the final condition is equivalent to the graph being strongly connected and aperiodic.
21
Chapter 3
Primal Consensus Algorithm “The difficulty lies, not in the new ideas, but in escaping the old ones, which ramify, for those brought up as most of us have been, into every corner of our minds.” John M. Keynes, 1935.
In this chapter we will review a distributed optimization method developed by A. Nedić and A. Ozdaglar [2, 3, 4]. This will be the main method that we will compare our optimization algorithm that we develop in the next chapter. The distributed method is designed to let a multi-agent system find the optimal solution to the optimization problem of minimizing a sum of convex, not necessarily differentiable, objective functions. Every agent will try to minimize its own objective function, while only exchanging information with its closest neighbors over a timevarying topology. Compared to the decomposition methods discussed in Section 2.5.1 and 2.5.2, this method manipulates the primal variables, and the method can be viewed as a combination of the consensus algorithm, described in Section 2.8, and the subgradient method from Section 2.4. We will therefore denote this algorithm as the Primal Consensus Algorithm.
3.1
Problem Definition
Consider a network of N agents trying to minimize a common additive cost function. The agents want to cooperatively solve the unconstrained optimization problem minimize n x∈R
N X
f i (x),
(3.1)
i=1
where f i : Rn → R is a convex function only known by agent i. Every agent will maintain its own estimate xi ∈ Rn of the optimal solution x ∈ Rn , and only update it from the local subgradient g i of f i , and information received from its neighbors. 23
CHAPTER 3. PRIMAL CONSENSUS ALGORITHM
We assume that all agents update their estimates simultaneously at the discrete times t = 0, 1, . . .. Let xi (t) ∈ Rn denote agent i’s estimate of the optimal solution at time t. The update rule that we consider is a combination of the average consensus update (2.27) and the subgradient method update (2.8). At each time step, every agent computes a weighted average of its neighbors current estimates, and then takes a step in the direction of its local subgradient, i
x (t + 1) =
N X
Wj,i (t)xj (t) − αti g i (t).
(3.2)
j=1
Algorithm 3.1: Primal Consensus Algorithm Input: Initial estimates xi (0). Output: Estimate of xi (T ) at time T // The following algorithm is executed on each agent i. 1 for t=0 to T-1 do 2 Compute a subgradient g i (t) to f i at the point xi (t). // Update the local estimate P j i i 3 xi (t + 1) = N j=1 Wj,i (t)x (t) − αt g (t) 4 Transmit the new primal variable xi (t + 1) to all neighbors. 5 Receive primal variables xj (t + 1) from neighbors. 6 end To follow the notation used by A. Nedić and A. Ozdaglar [3, 2], the matrix W should be compared to the transpose of the consensus weight matrix. Further, αti > 0 is the step size used by agent i at time t, and g i (t) is the local subgradient to f i at agent i’s estimate xi (t). Notice that the weights Wj,i are time dependent, and that this can correspond to a time-varying communication topology. If the weight Wj,i (t) is nonzero then that corresponds to agent i using agent j’s estimate at time t, hence the directed communication edge (j, i) is used during that time. This leads us to the definition of a time dependent directed graph (V, Et ), with the edge set given by Et = {(j, i) | Wj,i (t) 6= 0} , representing the communication at time t.
3.2
Computation Model
It is necessary to impose some rules on the information exchange model in order to guarantee the convergence of the estimates. This section is used to introduce and explain the assumptions that are necessary for the convergence analysis in the next section. 24
3.2. COMPUTATION MODEL
The first assumption states that each agent gives a significant weight to its own estimate as well as to all other estimates that that are available to it. Also, the weight is zero for all estimates that are unavailable. Assumption 3.1 (Weights Rule). The following is true for all t: (a) There exists a real number η with 0 < η < 1 such that (i) Wi,i (t) ≥ η for all i and t. (ii) Wj,i (t) ≥ η for all i, j, t when (j, i) ∈ Et . (iii) Wj,i (t) = 0 otherwise. (b) The transpose of the weight matrix, W (t)T , is stochastic. It is necessary for every agent to be able influence every other agent, hence we assume that every pair of agents are able to influence each other an infinite number of times, possibly through a path of other agents. Assumption 3.2 (Connectivity). The graph (V, E∞ ) is strongly connected, where the edge set is defined as E∞ = {(j, i) | (j, i) ∈ Et for infinitely many t} . To further strengthen the Connectivity Assumption 3.2, we impose an upper bound on the intercommunication interval for all edges in E∞ . Assumption 3.3 (Bounded Intercommunication Interval). There exists an integer B ≥ 1 such that (j, i) ∈ E∞
⇒
(j, i) ∈
t+B−1 [
Ei ,
i=t
for all t. Similar to the average consensus problem, we require the weight matrix to be doubly stochastic at every time step to ensure that all agents’ estimates converge to the same value. Assumption 3.4 (Doubly Stochastic Weights). The weight matrix W (t) is doubly stochastic for all t. In order to satisfy the Doubly Stochastic Weights Assumption the agents need to coordinate their choices of weights at each update. This can be a problem when the network topology is changing rapidly, and thus also the weights are changing. The following procedure, together with assumptions 3.5 and 3.6, will in fact guarantee that the weight matrix is doubly stochastic. 25
CHAPTER 3. PRIMAL CONSENSUS ALGORITHM
Assumption 3.5 (Simultaneous Information Exchange). The agents exchange information simultaneously, (j, i) ∈ Et
⇒
(i, j) ∈ Et .
Let us introduce a new weight Pj,i , called planned weights, that are used to determine the actual weights Wj,i . Let each agent i choose the planned weights Pj,i (t) that it would like to use if it receives an estimate from agent j during time step t. If, at time t, agent i is able to communicate with agent j, then it will send both its own estimate xi (t) as well as the planned weight Pj,i (t) to agent j. By the Simultaneous Information Exchange Assumption, if they are able to communicate then agent i will also receive the estimate xj (t) and the planned weight Pi,j (t) from agent j. Let the actual weight used by the agents be defined as
Wj,i (t) =
P 1 − k6=i Wk,i (t)
if i = j; min(Pi,j (t), Pj,i (t)) if i 6= j and (j, i) ∈ Et ; 0 otherwise.
(3.3)
Assumption 3.6 (Symmetric Weights). The following is true for all t: (a) There exists a real number η with 0 < η < 1 such that (i) Pj,i (t) ≥ η for all i, j, t. (ii)
PN
j=1 Pj,i (t)
= 1.
(b) The actual weights are chosen according to equation (3.3). We will now prove that this procedure guarantees that the weight matrix is doubly stochastic. Proposition 3.1. Assume that the Simultaneous Information Exchange Assumption 3.5 and Symmetric Weights Assumption 3.6 holds. Then so does the Weight Rule Assumption 3.1 and the Doubly Stochastic Weights Assumption 3.4. Proof. First, notice that k6=i Wk,i (t) ≤ k6=i Pk,i (t) ≤ 1 − η, thus, Wi,i (t) = P 1 − k6=i Wk,i (t) ≥ 1 − (1 − η) = η. Second, min(Pi,j (t), Pj,i (t)) ≥ η, thus Wj,i (t) = min(Pi,j (t), Pj,i (t)) ≥ η if (j, i) ∈ Et and Wj,i (t) = 0 otherwise. P P Next, consider the column sum k Wk,i (t) = Wi,i (t) + k6=i Wk,i (t)) = 1 − P P T k6=i Wk,i (t) + k6=i Wk,i (t)) = 1, hence W (t) is stochastic and the Weight Rule Assumption is satisfied. Finally, the Simultaneous Information Exchange Assumption together with the equation (3.3) tells us that Wj,i (t) = Wi,j (t) = min(Pi,j (t), Pj,i (t)) if (j, i), (i, j) ∈ Et and Wj,i (t) = Wi,j (t) = 0 otherwise. Thus, W (t) is symmetric, and W (t)T being stochastic implies that W (t) is doubly stochastic. P
P
In particular, we will consider the subgradient method with a constant step size. 26
3.2. COMPUTATION MODEL
Assumption 3.7 (Constant Step Size). The step size is constant and common to all agents, αti = α. In the analysis, we will consider a related model where the agents cease to compute the subgradients g i (t) at some time tˆ, but continue exchanging their estimates according to the consensus part of the algorithm. Assumption 3.8 (Stopped Model). There exists a time tˆ such that (
g i (t) =
0 if t ≥ tˆ; a subgradient to f i at xi (t) otherwise,
for all i. Finally, we need some assumptions for the subgradient part, similar to assumptions 2.1, 2.2, 2.3 used in Section 2.4 about the subgradient method. Assumption 3.9 (Subgradient Assumption). The following is true (i) The optimal set Xopt is nonempty. (ii) The subgradients are uniformly bounded by G, i g ≤ G 2
for all subgradients g i ∈ ∂f i (x) and all i and x. (iii) The distance from the initial points to the optimal set is bounded by R,
d xi (0), Xopt ≤ R, for all i. (iv) The initial values are bounded by αG, i x (0) ≤ αG 2
for all i, where α is the step size, and G is the bound on the subgradients. 27
CHAPTER 3. PRIMAL CONSENSUS ALGORITHM
3.3
Convergence Analysis
By recursively expanding the update rule (3.2) it is possible to express the estimate xi (t + 1) in terms of x1 (s), . . . , xN (s), for any s ≤ t, as i
x (t + 1) =
N X
[W (s)W (s + 1) · · · W (t − 1)W (t)]j,i xj (s)
j=1
−
N X
[W (s + 1) · · · W (t − 1)W (t)]j,i αsj g j (s)
j=1
− −
··· N X
(3.4)
j [W (t − 1)W (t)]j,i αt−2 g j (t − 2)
j=1
−
N X
j g j (t − 1) − αti g i (t). [W (t)]j,i αt−1
j=1
Since the products W (s)W (s + 1) · · · W (k) appears several times in the expression, let us define the matrix Φ(t, s), for any t ≥ s, as Φ(t, s) = W (s)W (s + 1) · · · W (t − 1)W (t). With this definition, the expression for the estimate xi (t + 1) in equation (3.4) can be simplified as xi (t + 1) =
N X
[Φ(t, s)]j,i xj (s) −
N X
[Φ(t, s + 1)]j,i αsj g j (s)
j=1
j=1
− −
··· N X
j g j (t − 2) [Φ(t, t − 1)]j,i αt−2
(3.5)
j=1
−
N X
j [Φ(t, t)]j,i αt−1 g j (t − 1) − αti g i (t).
j=1
We are now going to study the properties of the Φ matrices, but first recall that the longest possible path in a graph with N nodes consists of N − 1 edges. Hence the Connectivity Assumption 3.2 implies that there is a path in E∞ between any two nodes of length at most N − 1. Further, the Bounded Intercommunication Interval Assumption 3.3 gives us an upper bound B on the time it takes the estimate xi to ¯ = (N − 1)B influence state xj if the edge (i, j) belongs to E∞ . Thus, let us define B as the upper bound on the time it takes any node’s estimate to influence any other node’s estimate. The asymptotic behavior of the matrices Φ(t, s) is characterised in the following lemma, as t → ∞ and for any fixed s. 28
3.3. CONVERGENCE ANALYSIS
Lemma 3.2 (Lemma 4, [2]). Under the Weights Rule 3.1, Connectivity 3.2 and Bounded Intercommunication Interval 3.3 assumptions the following is true, ¯ (a) The limit Φ(s) = limt→∞ Φ(t, s) exists for each s. ¯ (b) The limit matrix Φ(s) has identical columns and the columns are stochastic, ¯ Φ(s) = φ(s)1T , where φ(s) is a stochastic vector. (c) The columns converge to φ(s) with a geometric rate, ¯ t−s 1 + η −B ¯ B¯ B 1 − η [Φ(t, s)]i,j − [φ(s)]i ≤ 2 ¯
1 − ηB
∀i, j and t ≥ s
If we also assume that the matrices W are doubly stochastic then we have the following corollary to Lemma 3.2 Corollary 3.3 (Proposition 1,2, [2]). Let the Weights Rule 3.1, Connectivity 3.2, Bounded Intercommunication Interval 3.3 and Doubly Stochastic Weights 3.4 assumptions hold (or Connectivity 3.2, Bounded Intercommunication Interval 3.3, Simultaneous Information Exchange 3.5 and Symmetric Weights 3.6 assumptions hold). Then the following is true, ¯ (a) The limit matrices Φ(s) = limt→∞ Φ(t, s) are doubly stochastic and correspond to a uniform steady state distribution for all s, 1 ¯ Φ(s) = 11T N (b) The entries [Φ(t, s)]j,i converge to
1 N
∀s
(3.6)
as t → ∞ with a geometric rate,
¯ −B t−s ¯ ¯ [Φ(t, s)]i,j − 1 ≤ 2 1 + η 1 − ηB B ¯ B N 1−η
∀i, j and t ≥ s
Proof. Recall Proposition 3.1, it states that the Weights Rule and Doubly Stochastic Weights assumptions are implied by the Simultaneous Information Exchange and Symmetric Weights assumptions. Thus, we can proceed with the former assumptions. (a) The assumption that W (t) is doubly stochastic for all t implies that the Φ(t, s) matrices are also doubly stochastic, since that the product of two doubly stochastic matrices is also doubly stochastic. From Lemma 3.2 we know that the columns are identical, and since every row sums to one we have φ(s) = and
1 1 N
1 ¯ Φ(s) = 11T . N 29
CHAPTER 3. PRIMAL CONSENSUS ALGORITHM
(b) This follows directly from Lemma 3.2, with [φ(s)]i =
1 N.
The Φ matrices we have just studied determine how the consensus part of the algorithm propagate, and we will now turn to the subgradient part. In particular, with the Constant Step Size Assumption 3.7, the iterates in equation (3.5) becomes xi (t + 1) =
N X
[Φ(t, s)]j,i xj (s) − α
t X
r=s+1
j=1
N X
[Φ(t, r)]j,i g j (r − 1) − αg i (t). (3.7)
j=1
To further analyze the convergence, we will consider the related Stopped Model Assumption 3.8, where the agents cease to compute the subgradients at some time tˆ. Let x ˆ denote the iterates for the stopped model. Notice that x ˆi (t) = xi (t) for t ≤ tˆ, and for t > tˆ we have x ˆi (t) =
N X
[Φ(t − 1, 0)]j,i xj (0) − α
tˆ X
r=1
j=1
N X
[Φ(t − 1, r)]j,i g j (r − 1) ,
(3.8)
j=1
where we also let s = 0. Using Corollary 3.3 it is evident that the limit limt→∞ x ˆi (t) exists, and is independent of i. However it does depend on the parameter tˆ, thus let us define the limit as y(tˆ) = lim x ˆi (t). t→∞
By using the relation (3.6) from Corollary 3.3 we can express the limit as
N tˆ N X X 1 j 1 X xj (0) − α y(tˆ) = g (r − 1) . N j=1 N r=1 j=1
Rewriting it as a recursive equation in tˆ yields the expression y(tˆ + 1) = y(tˆ) −
N α X g j (tˆ). N j=1
(3.9)
Notice the similarity with the subgradient method update in (2.8). However, the vector g j (tˆ) is a subgradient to f j at xi (tˆ), not at y(tˆ), but it can be used as an approximation of the subgradient at y(tˆ), as is done in the following lemma. Lemma 3.4 (Lemma 5, [2]). Let the sequence {y(t)} be generated by the iteration (3.9), and let the sequence {xj (t)} be generated by (3.7). Let {dj (t)} denote the sequence of subgradients to f j at y(t). For any x ∈ Rn and all t ≥ 0 we have ||y(t + 1) − x||22 ≤ ||y(k) − x||22 + −
N 2α X j g (t) + dj (t) y(t) − xj (t) 2 2 2 N j=1 N 2α α2 X j 2 [f (y(t)) − f (x)] + 2 g (t) . 2 N N j=1
30
3.3. CONVERGENCE ANALYSIS
Corollary 3.5. In particular, for any optimal solution x∗ ∈ Rn to (3.1) we have, ||y(t + 1) − x∗ ||22 ≤ ||y(k) − x∗ ||22 + −
N 2α X j g (t) + dj (t) y(t) − xj (t) 2 2 2 N j=1 N 2α α2 X j 2 [f (y(t)) − f ∗ ] + 2 g (t) . 2 N N j=1
In the following proposition we will give the main convergence result for the Primal Consensus Algorithm. It consists of both a uniform bound on the difference between y(t) and xi (t), and upper bounds on the objective function for the average estimates yavg (t) and xiavg (t). yavg (t) =
t−1 1X y(k), t k=0
xiavg (t) =
t−1 1X xi (k). t k=0
The main convergence result is Theorem 3.6 (Proposition 3, [2]). Let Connectivity, Bounded Intercommunication Interval, Simultaneous Information Exchange, Symmetric Weights, Constant Step Size and the Subgradient assumptions hold. We then have (a) For every agent i, a uniform upper bound on y(t) − xi (t) 2 is given by
y(t) − xi (t) ≤ 2αGC1 2
where C1 = 1 +
∀t ≥ 0 ¯
N 1
1 − (1 − η B¯ ) B¯
1 + η −B 1 − η B¯
(b) An upper bound on the objective function for the average values f (yavg (t)) is given by N R2 + G2 α2 Ct ∀t ≥ 1, (3.10) f (yavg (t)) ≤ f ∗ + 2αt and an upper bound on the objective function for each agent i is given by f (xiavg (t)) ≤ f ∗ +
N R2 αG2 C + + 2αN G2 C1 2αt 2
∀t ≥ 1,
where C = 1 + 8N C1 . This shows that the error between y(t) and xi (t) is bounded by a constant that depends on the step size α. The second part shows how the objective function converges to a neighborhood around the optimal value when a constant step size is used. The result is similar to, and should be compared with the pure subgradient method in equation (2.13), favg (t − 1) ≤ f ∗ + 31
R 2 + G2 α 2 t . 2αt
CHAPTER 3. PRIMAL CONSENSUS ALGORITHM
If we instead consider the minimal function value, instead of the average, then we get the following trivial corollary. Corollary 3.7. Let fbest (t) be defined as fbest (t) = min f (y(t)), i=0,...,t
then, since the objective function f is convex, fbest (t − 1) ≤ f (yavg (t)) ≤ f ∗ +
32
N R2 + G2 α2 Ct . 2αt
(3.11)
Chapter 4
Dual Decomposition Algorithm “Nothing tends so much to the advancement of knowledge as the application of a new instrument. The native intellectual powers of men in different times are not so much the causes of the different success of their labours, as the peculiar nature of the means and artificial resources in their possession.” Sir Humphry Davy, 1812.
In this chapter we describe a new distributed optimization algorithm, based on the dual decomposition principle in Section 2.5.2. The distributed optimization method is designed in the spirit of the Primal Consensus Algorithm from Chapter 3, where a multi-agent system tries to solve the optimization problem of minimizing a sum of convex objective functions. There are, however, some significant differences between the models, as we will see in Section 4.3. From the decomposition method we develop both centralized and decentralized algorithms in Section 4.2 and 4.4 respectively. In particular, we are able to prove the convergence of the algorithms with timevarying delays, and noisy communication channels. The proofs are presented in Section 4.5. We further investigate some aspects of the communication model, and especially some ways of limiting the communication in Section 4.6. Finally, in Section 4.7 we explore how the algorithm behaves for quadratic cost functions.
4.1
Dual Problem Definition
The optimization problem we are considering is similar to the one in the previous chapter. Consider a multi-agent network consisting of N agents, trying to minimize a common additive cost function, minimize n x∈X⊆R
N X i=1
33
f i (x),
(4.1)
CHAPTER 4. DUAL DECOMPOSITION ALGORITHM
where f i : Rn → R is a convex function only known by agent i. Notice that we introduced the constraint set X ⊆ Rn , where we assume X is convex and has a nonempty interior so that Slater’s condition guarantees strong duality. We will now apply the dual decomposition method from Section 2.5.2 on this problem, but first, let us partition the state x ∈ Rn as follows. Let
x1 x2 .. .
x=
xN
,
where xi ∈ Rni for i = 1, . . . , N and with N i=1 ni = n. Notice that some ni could be zero. This partitioning can be done arbitrary, but the idea is to associate part xi with agent i. In many cases there are a natural partitioning, where xi is agent i’s internal state, for example in the problem with vehicle formations xi could contain vehicle i’s position and velocity. The optimization problem in (4.1) can now be rewritten as P
minimize n x∈X⊆R
N X
f i (x1 , x2 , . . . , xN ).
(4.2)
i=1
Let each agent maintain its own estimate of the optimization variable xi , thus would denote part j of agent i’s estimate. The optimization problem (4.1) can be further rewritten as
xij
minimize
x1 ,...,xN ∈X⊆Rn
subject to
N X
f i (xi1 , xi2 , . . . , xiN ),
i=1 x1 =
x2
= ··· =
(4.3)
xN ,
such that the coupling between the objective functions is in the consistency constraints. We now apply the method of Lagrange multipliers. Introduce the dual variables λij for i = 1, . . . , N and j = 1, . . . , N , where λij is associated with the constraint xii = xji . It will be useful for us to denote the collection of dual variables λij by λ, and in particular we consider λ to be the vector formed by concatenating the dual variables, λ11 . .. λ1N λ21 λ = . . .. λ 2N . . .
λN N
34
4.1. DUAL PROBLEM DEFINITION
We can now proceed by constructing the Lagrange function as L(x1 , x2 , . . . , xN , λ) =
N X
N X N X
f i (xi1 , xi2 , . . . , xiN ) +
i=1
λTij xii − xji ,
(4.4)
i=1 j=1
and the Lagrange dual function is, q(λ) =
inf
x1 ,x2 ,...,xN ∈X
L(x1 , x2 , . . . , xN , λ).
(4.5)
Remark. Notice that the Lagrange function in independent of the dual variable λii for all i, since λii accounts for the difference xii − xii = 0. Thus, λii could be removed from the expression, but instead, for simplicity, we define λii = 0. From here on, it is implicitly assumed that λij denotes a dual variable such that i 6= j. Let us now rewrite the Lagrange dual function by introducing the subproblems φi (λ), φi (λ) = inf f i (xi1 , xi2 , . . . , xiN ) + xi ∈X
N X
λTij xii −
j=1
N X
λTji xij .
(4.6)
j=1
Notice that each function φi only depends upon the dual variable λ, and it is the infimum over only the local variable xi . Thus, agent i is able to compute φi (λ), independent of all other agents, provided that it receives the dual variable λ. The Lagrange dual function can now be written as q(λ) =
N X
φi (λ),
(4.7)
i=1
and the corresponding Lagrange dual problem is maximize q(λ) = maximize λ
λ
N X
φi (λ).
(4.8)
i=1
Using the subgradient method to solve the dual problem yields the iterations λ(t + 1) = λ(t) + αt g(t),
(4.9)
where g(t) is a subgradient to q at λ(t). Remark. When compared to (2.8) we have a plus sign in (4.9) since the Lagrange dual problem is concave, instead of being convex. Let us now consider the subgradient update (4.9) for a single component λij of the dual variable. The subgradient to q with respect to λij is xii − xji , thus the subgradient update is
λij (t + 1) = λij (t) + αt xii (t) − xji (t) , where xi (t) is given by the solution to (4.6) at λ(t). 35
(4.10)
CHAPTER 4. DUAL DECOMPOSITION ALGORITHM
4.2
Centralized Algorithm
In this section we will describe how the optimization problem (4.1) can be solved by a distributed, centralized, algorithm, as described in Section 2.7.1. In the dual problem definition we saw how the problem could be solved by the subgradient method, with the iterations in (4.10). The idea behind the centralized algorithm is to let the coordinator agent compute the updates to the dual variables λ, and let each agent solve its corresponding optimization problem φi .
Central coordinator λ
1
λ x1
x2
2
λ
x3
3
λ
x4
4
λ
x5
5
Figure 4.1: Communication graph for the Centralized Optimization Algorithm. The agents send their primal variables xi to the coordinator, and the coordinator responds with the updated dual variables λ. Consider a network consisting of N computational agents, each associated with a subproblem φi , and one extra central coordinator agent, whose responsible it is to update the dual variables (Fig. 4.1). Algorithm 4.1: Centralized Optimization Algorithm Input: The coordinator has an initial estimate λ(0). Output: Estimate of λ(T ) at time T 1 for t=0 to T-1 do 2 Coordinator sends λ(t) to all agents. 3 for Agent i = 1 to N do // Subproblems can be solved in parallel. 4 Agent i solves φi (λ(t)). 5 Primal variable xi (t) is transmitted back to the coordinator. 6 end // The coordinator updates the dual variables. 7 8
λij (t + 1) = λij (t) + αt xii (t) − xji (t) end
Let us analyze the algorithm. Slater’s condition guarantees that the optimal value for the Lagrange dual problem (4.8) is equal to the optimal value for the 36
4.3. COMPUTATION MODEL
primal problem (4.1), i.e., q ∗ = f ∗ . Further, since we solve the Lagrange dual problem using the subgradient method, the convergence results from Section 2.4 applies. Thus, under the assumptions that there exist at least one finite maximizer λ∗ to q, the norm of the subgradients is uniformly bounded, ||g(t)||2 ≤ G, and the initial distance to the optimal set is also bounded, ||λ(0) − λ∗ ||2 ≤ R, we have R2 + G2 ti=0 αi2 qbest (t) ≥ q − . P 2 ti=0 αi P
∗
(4.11)
Further, for a constant step size αt = α, we have qbest (t − 1) ≥ q ∗ −
R 2 + G2 α 2 t . 2αt
(4.12)
The drawbacks of this method is the requirement of the central coordinator, which is infeasible for many applications. The central coordinator also makes the optimization algorithm vulnerable to a single point of failure. Thus, in the next sections we develop a decentralized distributed optimization algorithm.
4.3
Computation Model
We now proceed to a decentralized algorithm that would better capture the model used for the Primal Consensus Algorithm. In this section we will introduce and explain the model used for the decentralized algorithm, and compare it against the model used in the previous chapter. Once again, we will consider a network of N interconnected agents, trying to solve the optimization problem (4.1). All agents perform their computations and transmissions synchronously, at the discrete times t = 0, 1, . . .. We start off with the ordinary subgradient method’s assumptions on the Lagrange dual function. Assumption 4.1 (Subgradient 1: Existence of Minimizer). There exists at least one finite maximizer of q, denoted by λ∗ . Let Λopt denote the set of maximizers λ∗ to q. Also, let q ∗ = q(λ∗ ) denote the optimal value of q. From Slater’s condition, we know that q ∗ = f ∗ , and thus, it is equivalent to solve the Lagrange dual problem instead of the original problem (4.1). Assumption 4.2 (Subgradient 2: Bounded Subgradients). Let g(t) be a subgradient to q, i.e., g(t) ∈ ∂q(λ(t)), then the norm of the subgradients g(t) is uniformly bounded by G, ||g(t)||2 ≤ G ∀t. This assumption is especially valid if the primal variables are constrained to a compact set X, since the subgradient g(t) is constructed from the primal variables as xii (t) − xji (t). 37
CHAPTER 4. DUAL DECOMPOSITION ALGORITHM
Assumption 4.3 (Subgradient 3: Bounded Initial Distance). The distance from the initial point, λ(0), to the optimal set is bounded by R, d (λ(0), Λopt ) ≤ R. The next assumption is the most significant change from the Primal Consensus Algorithm. The Primal Consensus Algorithm was truly distributed, in the sense that each agent only needed to exchange information with its direct neighbors, without any knowledge of anything further away. In the decentralized algorithms for the dual problem we need agent i to be able to send a message to agent j, even if j is not a neighbor to agent i. This property is a common requirement in network architectures, however, it is outside the scope of this thesis, and has already been extensively studied [5, 29]. We will therefore resort to the following assumption: Assumption 4.4 (Routing Protocol). The agents utilize a routing protocol such that an agent can send a package of information to any other agent in the network, even if they are not neighbors. In that case, the routing protocol uses multi-hops to reach the destination, i.e., the package is sent through a path of adjacent agents in the network. However, it is not assumed that the package arrives immediately at the destination, instead, it is normally assumed that it takes one time step for a package to transfer between any two neighbors, and thus that the time it takes a package to reach j from i is equal to the distance from i to j in the communication graph. To further strengthen the Routing Protocol Assumption, we continue by assuming that there will always eventually be a path between any two agents. Also, no packages will be dropped, instead they remain in the network until that they are delivered. The assumption can be compared to the Connectivity Assumption 3.2, that the graph (V, E∞ ) is strongly connected. Assumption 4.5 (Reliable Network). Every package sent on the network will eventually arrive to its destination. The Routing Protocol Assumption and the Reliable Network Assumption motivates the two following network models. First, assume that the network topology is static, which is modeled by defining a constant time delay between any pair of agents. Let δij ≥ 0 denote the time-delay it takes a package sent from agent i to be delivered to agent j. Further, let dij = δij + δji denote the round-trip time between agent i and agent j, i.e., the time it takes agent j to respond to agent i. In the second model, it is assumed that the network topology can change, and even that the network can temporarily be disconnected. In this model, let δij (t) denote the time-varying delay from agent i to agent j. Thus, a package sent by agent i at time t will arrive to agent j at time t + δij (t). Similarly, the round-trip time with time-varying delays is dij (t) = δij (t) + δji (t + δij (t)). In the following assumptions it is assumed that the round-trip time is bounded. 38
4.3. COMPUTATION MODEL
Assumption 4.6 (Bounded Time-Delays). In the static network model, there exists an upper bound D on the round-trip time, dij = δij + δji ≤ D
∀i, j ∈ N .
Assumption 4.7 (Bounded Time-Varying Delays). In the dynamic network model, there exists an upper bound D on the round-trip time, dij (t) = δij (t) + δji (t + δij (t)) ≤ D
∀i, j ∈ N and t = 0, 1, . . . .
¯ in the Primal Consensus Algorithm, is an upper bound on the Recall that B, time it takes any agent i to influence any other agent j. Thus, the bound D on the ¯ round-trip time should be compared against 2B. Next, the agents are required to apply the updated information simultaneously. Thus, if agent i computes and transmits an update to agent j at time t, then he is enforced to wait until agent j receives the information at time t + δij until he can use that information himself. However, this implies that agent i needs to know when agent j receives the information, which leads to the following assumption: Assumption 4.8 (Locally Known Time-Delays). In the static network model, the time delay δij is known by both agent i and agent j. In the dynamic network model, the time-varying delay δij (t) is known by agent i and j at time t + δij (t). In the communication models considered so far, it has been assumed that the agents can transfer real numbers without any errors. In practice, however, this is not a realistic assumption [30, 31]. Instead, the communication channels are limited to a finite alphabet, and the real numbers are quantized at transmission. The quantization is captured by an extended model compared to the Primal Consensus Algorithm, with an added noise to the subgradients. Introduce the noisy subgradients g˜(t) as g˜(t) = g(t) + ε(t), (4.13) where g(t) is, as before, a subgradient to q at λ(t), and where ε(t) is a noise vector. The subgradient method (4.9) is modified to use the noisy subgradients, λ(t + 1) = λ(t) + αt g˜(t). Assumption 4.9 (Bounded Noise). Let the noisy subgradients be given by (4.13). Then, there exist an upper bound E on the norm of the noise vectors, ||ε(t)||2 ≤ E
∀t.
(4.14)
In order to handle the noisy subgradients, the class of objective functions q has to be further restricted. Two different models that accomplish this are considered. First, if the dual variables are known to be constrained to a compact set, and secondly, if the objective function has a sharp maxima [30]. 39
CHAPTER 4. DUAL DECOMPOSITION ALGORITHM
Assumption 4.10 (Compact Constraint Set). The dual variables λ are known to be constrained to a compact set. In particular, the distance to the optimal set is bounded. Thus,there exist a constant L such that, for any λ, d (λ, Λopt ) ≤ L. Remark. Independent of the original problem, the dual problem is a maximization problem where the dual variables, λ, are taken in the entire Rn domain. There is thus no direct way to enforce the Compact Constraint Set Assumption, however, the dual variables can often be interpreted as a disagreement cost, and can thereby be bounded for a specific application. Assumption 4.11 (Sharp Maxima). The concave objective function q has a sharp set of maxima. This means that q(λ) decreases at least linearly as λ moves away from the optimal set Λopt , i.e., for some constant µ > 0 q(λ) ≤ q ∗ − µd (λ, Λopt ) . Remark. The Bounded Subgradient Assumption essentially implies that the dual function decreases at most linearly when λ moves away from the optimal set. Thus, the Sharp Maxima Assumption implies that the dual function is bounded both from above and below by linear functions.
4.4
Decentralized Algorithms
The goal of this section is to develop decentralized algorithms to solve the optimization problem in (4.1). The idea is similar to the centralized algorithm described in Section 4.2, where each agent i solves the subproblems φi (λ), but where the dual variable update (4.10) is decentralized. In the centralized algorithm, the central coordinator was responsible for updating and distributing the dual variables λ. Consider the dual variable λij , notice that it is only used by agent i and agent j, and further that it is only necessary to know their optimization variables, xi and xj , in order to update λij . This observation is the key idea behind our decentralized algorithms. In all of the following algorithms, we will consider a network of N agents trying to solve the optimization problem (4.1), where f i is the local objective function known only by agent i. Further, we assume that the Routing Assumption 4.4 as well as the Reliable Network Assumption 4.5 holds. Thus, it is assumed that any agent i can send a package of information to any other agent j, and it is guaranteed that this package will arrive within a finite time.
4.4.1
Halting Algorithm
This is perhaps the simplest decentralized algorithm, and it will be useful to compare against the more advanced algorithms introduced in the following sections. 40
4.4. DECENTRALIZED ALGORITHMS
Let the Bounded Time-Delays Assumption 4.6 hold, and further, assume that the bound on the delays D is known by all agents. Let each agent i solve its optimization problem φi at the discrete times t = 0, D + 1, 2(D + 1), 3(D + 1), . . ., and extracting the primal variable xi (t). It then broadcasts the primal variable to all other agents j, and halts its computations while it is waiting for the updated dual variables. Agent j receives the primal variable from agent i at time t + δij , and whenever an agent receives the primal variable of another agent, it computes an update to the dual variable as
λij (t + D + 1) = λij (t) + αt xii (t) − xji (t) . The updated dual variable is then transmitted back to i, thus, i will receive λij (t + D + 1) at time t + δij + δji = t + dij , which is guaranteed to be before the next computational round at time t + D + 1. Thus, at the end of time t + D, agent i has received the updated dual variable λij (t + D + 1) from all other agents j, and can perform its next computational round, calculating xi (t + D + 1) from λ(t + D + 1). Algorithm 4.2: Halted Decentralized Algorithm Input: Initial estimates λij (0). Output: Estimate of λ(T (D + 1)) at time T (D + 1) // The following algorithm is executed on each agent i. 1 for t=0 to T-1 do 2 Solve φi (λ(t(D + 1))). 3 Transmit the primal variable xii (t(D + 1)) to all other agents. 4 for τ = 0 to D do // Update dual variables 5 foreach Incoming primal variable xjj (t(D + 1)) do 6 Compute λji ((t + 1)(D + 1)) =
λji (t(D + 1)) + αt(D+1) xjj (t(D + 1)) − xij (t(D + 1)) 7 8 9 10 11 12 13
Transmit the dual variable λji ((t + 1)(D + 1)) to agent j. end foreach Incoming dual variable λij ((t + 1)(D + 1)) do Save the dual variable until next computational round. end end end
In conclusion, each agent will solve its optimization problem φi only at the discrete times t = 0, D + 1, 2(D + 1), 3(D + 1), . . ., and spend the remaining rounds waiting for the dual variables to be updated, which was previously done by the central coordinator in just one time step. Thus, this algorithm performs the same computations as the centralized algorithm, but D +1 times slower. The convergence rate, with a constant step size, can thus be written as 41
CHAPTER 4. DUAL DECOMPOSITION ALGORITHM
qbest (t − 1) ≥ q ∗ −
R2 + G2 α2 t/(D + 1) . 2αt/(D + 1)
(4.15)
Remark. The assumption that every agent knows the time-delay bound D might be considered as a violation to our desire that the agents only use local information, since D is the maximum round-trip time between any pair of agents in the network, not just the ones concerning agent i.
4.4.2
Constant-Delays Algorithm
We are now going to speed up the Halting Algorithm by letting each agent solve its local optimization problem φi at every time step. However, since the time it takes to update the dual variable λij is dij , the agents can not necessarily use the dual update from the last round, instead it has to resort to the dual variable computed from the primal variables that where known dij time steps ago. Remark. Anyone familiar with dynamical systems should immediately react to the delays, since they are a common source of instability. In the next section we prove the convergence of the algorithm, under the Bounded Time-Delays Assumption 4.6. Let the Locally Known Time-Delays Assumption 4.8 holds, this implies that agent i knows when a package sent to agent j will arrive, and thus that if agent i waits δij time steps before using the information sent to agent j, then they are able to use the same information at the same time. It is now sound to define λij (t) as the dual variable known to both agent i and agent j at time t. Thus, both agents will use λij (t) when they solve their local optimization problems, φi and φj , at time t. Agent i then transmit xii (t) to agent j, but agent j does not receive that information until time t + δij . However, since δij is known to both agents, agent j can use his estimate xji (t) when he updates λij . After another δji time steps, the new dual variable is known to agent i too, and can then be used by both agents at time t + dij + 1. Thus, the new dual variable is in fact λij (t + dij + 1). Figure 4.2 illustrates the packages send during one time step, for the Constant-Delays Algorithm. In summary, the dual variables are updated according to
λij (t + dij + 1) = λij (t + dij ) + αt xii (t) − xji (t) .
(4.16)
Notice that φi is solved at each time step, and that agent i receives a new dual variable λij at every time step, but that the dual variable is delayed dij steps compared to the centralized algorithm. The pseudocode is given in Algorithm 4.3. Remark. Notice that it takes dij time steps until agent i receives its first update on the dual variable λij , which means that the first dij + 1 dual variables will be equal, λij (0) = λij (1) = · · · = λij (dij ). 42
4.4. DECENTRALIZED ALGORITHMS
x1(t)
1
2 x1(t-1)
λ14(t+1)
5 λ14(t+2)
3
4
Figure 4.2: Communication graph for the decentralized optimization algorithm. The packages sent at time t, between agent 1 and agent 4 in order to update λ14 are shown. At time t, agent 1 send its primal variable x1 (t) to agent 2, while agent 2 transmits the primal variable x1 (t − 1) it received in the last time step to agent 4. Agent 4 computes the new dual variable λ14 (t + 2) that can be used 2 steps later, and transmits it to agent 5. Finally, agent 5 transmits the dual variable λ14 (t + 1), that it received from agent 4 in the last time step, to agent 1. The delay between agent 1 and agent 4 is δ14 = δ41 = 1.
Algorithm 4.3: Constant-Delays Distributed Algorithm Input: Initial estimates λij (0). Output: Estimate of λij at time T // The following algorithm is executed on each agent i. 1 for t=0 to T-1 do 2 Solve φi (λ(t)). 3 Transmit the primal variable xii (t) to the other agents. // Update dual variables 4 foreach Incoming primal variable xjj (t − δji ) do 5 Compute
λji (t + δij ) = λji (t + δij − 1) + αt−δji xjj (t − δji ) − xij (t − δji ) 6 7 8 9
Transmit the dual variable λji (t + δij ) to agent j. end Receive next dual variable λ(t + 1) end
43
CHAPTER 4. DUAL DECOMPOSITION ALGORITHM
4.4.3
Time-Varying Delays Algorithm
The final algorithm that we consider is just a small modification of the last algorithm with constant time-delays. This algorithm is trying to capture the more general setting, when the graph topology is time-variant, and thus that the time it takes a package to be delivered from agent i to agent j can be different from iteration to iteration. Let the Bounded Time-Varying Delays Assumption 4.7 and the Locally Known Time-Delays Assumption 4.8 hold. Thus, a package sent from agent i to agent j at time t will arrive at time t + δij (t). Remark. Notice that we do not assume that the packages necessarily arrive in the same order as they where transmitted in. It is entirely possible that δij (t) > δij (t + 1) + 1, as long as dij (t) ≤ D. This means in particular that we need to handle multiple estimates of xii arriving during one time step. The proposed algorithm handles these by combining all estimates into one update to the dual variable, as λij (t + 1) = λij (t) +
D X d=0
(
α(t−d) xii (t − d) − xji (t − d) 0
if dij (t − d) = d; otherwise.
(4.17)
In other words, the dual variable used at time t + 1 is computed from the primal variables whose round-trip was completed during time step t. When implementing the dual variable update (4.17), we must also take into account that updates computed at different times by agent j can arrive during one time i. Because of this, we only let agent j compute the update step at agent αt xii (t) − xji (t) , and transmit the update instead of the entire new dual variable. Agent i can then assembly the next dual variable himself. Let µij (t) denote the update information, µij (t) = αt xii (t) − xji (t) . The dual variable update (4.17) can then be rewritten as λij (t + 1) = λij (t) +
X
µij (tˆ),
tˆ : tˆ+dij (tˆ)=t
where the sum is over all updates that complete their round-trip at time t. The pseudocode is given in Algorithm 4.4.
44
(4.18)
4.5. CONVERGENCE ANALYSIS
Algorithm 4.4: Time-Varying Delay, Distributed Algorithm Input: Initial estimates λij (0). Output: Estimate of λij at time T // The following algorithm is executed on each agent i. 1 for t=0 to T-1 do 2 Solve φi (λ(t)). 3 Transmit the primal variable xii (t) to the other agents. // Compute updates for dual variables 4 foreach Incoming primal variable xjj (tˆ) do // tˆ denotes the time when the primal variable was sent from agent j. 5 Compute µji (tˆ) = αtˆ xjj (tˆ) − xij (tˆ) Transmit the update µji (tˆ) to agent j. end // Compute dual variables from updates Receive dual updates µij (tˆ). X µij (tˆ) λij (t + 1) = λij (t) +
6 7
8 9
tˆ : tˆ+dij (tˆ)=t
λji (t + 1) = λji (t) +
10
X
µji (tˆ)
tˆ : tˆ+dji (tˆ)=t 11
end
4.5
Convergence Analysis
In this section we will focus on constructing bounds on the convergence rate, similar to those of the ordinary subgradient method (2.11). The Constant-Delays Algorithm is analyzed for arbitrary step sizes αt , while the Time-Varying Delays Algorithm is analyzed for constant step sizes αt = α, t = 0, 1, . . .. Finally, the noisy subgradient model, introduced in Assumption 4.9, is used to analyze the Constant-Delays Algorithm.
4.5.1
Constant-Delays Algorithm
The analysis will focus on the evolution of the dual variables λ(t). Recall that λ(t) is the vector formed by concatenating the dual variables λij (t) for i, j ∈ N . Further, g(t) is a subgradient to the Lagrange dual function, q, evaluated at λ(t). It is useful to view the subscript ij as a general vector decomposition operator, 45
CHAPTER 4. DUAL DECOMPOSITION ALGORITHM
applicable for other vectors than λ as well,
xij . . .
x11 . ..
xN N
= xij .
ij
The dual variable update in (4.16) can then be written as λij (t + dij + 1) = λij (t + dij ) + αt gij (t),
(4.19)
where gij (t) = xii (t) − xji (t). Instead of focusing on the dual variable λ(t), we shift our focus to the subgradient g(t) of q, evaluated at λ(t). In order to express the previous relation in terms of the ¯ entire subgradient vector g(t), define the new vector λ(t) by ¯ ij (t) = λij (t + dij ). λ Thus,
λ11 (t + d11 ) .. .
¯ λ(t) =
λij (t + dij ) .. .
.
λN N (t + dN N ) The update rule (4.19) can then be rewritten, for the entire concatenated vector, as ¯ + 1) = λ(t) ¯ + αt g(t). λ(t
(4.20)
Remark. Notice the similarity between this expression and the ordinary subgradient method iteration (2.8), however, remember that g(t) is a subgradient at λ(t) and ¯ not at λ(t). ¯ with the following vector projection Pd . Let Pd (x) Next, let us relate λ(t) to λ(t) be defined by ( xij if dij = d; [Pd (x)]ij = 0 otherwise. In other words, Pd (x) selects those components of x whose round-trip time is equal to d. Notice two important properties of P under the Bounded Time-Delays Assumption 4.6. First, for any vector x D X
Pd (x) = x.
d=0
46
(4.21)
4.5. CONVERGENCE ANALYSIS
¯ Second, we can express λ(t) from λ(t) as D X
¯ − d)) = λ(t). Pd (λ(t
(4.22)
d=0
¯ + 1) − λ∗ , The idea behind the convergence proof is to consider the norm λ(t 2 similar to the convergence proof in Section 2.4.
Lemma 4.1. The following is true under the Bounded Initial Distance Assumption 4.3, ¯ d λ(0), Λopt ≤ R. ¯ we have Proof. By the definition of λ, ¯ ij (0) = λij (dij ). λ Recall that the first time λij is updated is at time dij + 1, thus λij (dij ) = λij (0), ¯ and hence λ(0) = λ(0). The lemma now follows from the Bounded Initial Distance Assumption. Lemma 4.2. The following is true under the Bounded Subgradients Assumption 4.2, ¯ ¯ ≤ αt G. λ(t + 1) − λ(t) 2
Proof. By the definition of the update rule (4.20), we have ¯ + 1) − λ(t) ¯ = αt g(t). λ(t Thus, by the Bounded Subgradients Assumption, ¯ ¯ = αt ||g(t)|| ≤ αt G. λ(t + 1) − λ(t) 2 2
¯ In the next lemma we give an upper bound on the difference between λ(t) and λ(t). Lemma 4.3. The following is true under the Bounded Subgradients Assumption 4.2 and the Bounded Time-Delays Assumption 4.6, t−1 X ¯ αd . λ(t) − λ(t) ≤ G 2
47
d=t−D
CHAPTER 4. DUAL DECOMPOSITION ALGORITHM
¯ − λ(t) can be written as Proof. Recall that λ(t) ¯ − λ(t) = λ(t) ¯ − λ(t)
D X
¯ − d) = Pd λ(t
d=0
D X
¯ − λ(t ¯ − d) . Pd λ(t)
d=1
¯ − λ(t ¯ − d) with a telescoping sum, we have By replacing the term λ(t) ¯ − λ(t) = λ(t) =
D X
d−1 X
Pd i=0 d=1 D d−1 XX
!
¯ − i) − λ(t ¯ − i − 1) λ(t
¯ − i) − λ(t ¯ − i − 1) , Pd λ(t
d=1 i=0
since Pd is a linear function. Now, by changing the order in this double sum, we have ¯ − λ(t) = λ(t)
D−1 X
D X
¯ − i) − λ(t ¯ − i − 1) . Pd λ(t
i=0 d=i+1
By the triangle inequality, we have D−1 D X X ¯ ¯ − i) − λ(t ¯ − i − 1) . Pd λ(t λ(t) − λ(t) ≤ 2 i=0 d=i+1 2
Further, since the projections Pd are disjoint for different d, we have D X X D ¯ − i) − λ(t ¯ − i − 1) ¯ − i) − λ(t ¯ − i − 1) ≤ Pd λ(t Pd λ(t d=i+1 d=0 2 2 ¯ ¯ − i − 1) . = λ(t − i) − λ(t 2
But using Lemma 4.2 yields ¯ ¯ − i − 1) ≤ Gαt−i−1 . λ(t − i) − λ(t 2
Finally, putting everything together gives us D−1 t−1 X X ¯ Gαt−i−1 = G αd . λ(t) − λ(t) ≤ 2 i=0
d=t−D
Remark. If t < D, then λ(t) should be written as λ(t) = bound becomes
Pt
¯
d=0 Pd (λ(t − d)),
and the
t−1 t−1 X X ¯ Gαt−i−1 = G αd . λ(t) − λ(t) ≤ 2
i=0
d=0
Instead of handing this case separately, it is easier to define αt = 0 if t < 0. 48
4.5. CONVERGENCE ANALYSIS
Lemma 4.4. The following is true under the Bounded Subgradients Assumption 4.2 and the Bounded Time-Delays Assumption 4.6,
t−1 X
¯ − λ∗ ≤ G2 g(t)T λ(t)
αd + q(λ(t)) − q ∗ .
(4.23)
d=t−D
Proof. Adding 0 = g(t)T (λ(t) − λ(t)) to the left-hand side of equation (4.23) yields
¯ − λ(t) + g(t)T (λ(t) − λ∗ ) ¯ − λ∗ = g(t)T λ(t) g(t)T λ(t) By Cauchy-Schwarz inequality,
¯ − λ(t) . ¯ − λ(t) ≤ ||g(t)|| · λ(t) g(t)T λ(t) 2 2
Since g(t) is a subgradient at λ(t), the subgradient definition (2.7) implies that g(t)T (λ(t) − λ∗ ) ≤ q(λ(t)) − q ∗ . Thus,
¯ − λ∗ ≤ ||g(t)|| · λ(t) ¯ − λ(t) + q(λ(t)) − q ∗ . g(t)T λ(t) 2 2
Finally, using Lemma 4.3 and the Bounded Subgradients Assumption 4.2, we get T
g(t)
t−1 X
¯ − λ∗ ≤ G 2 λ(t)
αd + q(λ(t)) − q ∗ .
d=t−D
The function value q(λ(t)) does not have to be monotonically increasing, just as the ordinary subgradient method. Therefore, the algorithm is evaluated as the maximum value over a period of T iterations. Let us now state and prove the main convergence theorem for the ConstantDelays Algorithm, which gives an upper bound on the difference between q ∗ − maxt=0,...,T q(λ(t)). Theorem 4.5. Let the Existence of Minimizer Assumption 4.1, Bounded Subgradients Assumption 4.2, Bounded Initial Distance Assumption 4.3 and the Bounded Time-Delays Assumption 4.6 hold. Then, the maximum value satisfies the following bound, max q(λ(t)) ≥ q ∗ −
R 2 + G2
PT
t=0
2
t=0,...,T
49
αt2 + 2αt
PT
t=0 αt
Pt−1
d=t−D
αd
.
(4.24)
CHAPTER 4. DUAL DECOMPOSITION ALGORITHM
¯ Proof. Let λ∗ be any optimal point to q, and let λ(t) be given by the iterations (4.20). Consider the following relation 2
2
¯ ) + αT g(T ) − λ∗ ¯ + 1) − λ∗ = λ(T 0 ≤ λ(T
2
2
2 ¯ ¯ ) − λ∗ + α2 ||g(T )||2 . ) − λ∗ + 2αT g(T )T λ(T = λ(T T 2 2
2 ¯ − λ∗ , and can be expanded until t = 0, This is a recursive equation in λ(t) 2
yielding 2
¯ − λ ∗ + 0 ≤ λ(0) 2
T X
¯ − λ∗ + 2αt g(t)T λ(t)
t=0
T X
αt2 ||g(t)||22 .
t=0
Using equation (4.23) from Lemma 4.4 gives us t−1 T T 2 X X X ¯ αt2 ||g(t)||22 2αt G2 αd + q(λ(t)) − q ∗ + 0 ≤ λ(0) − λ∗ + 2
t=0
t=0
d=t−D
X t−1 T T 2 X X ¯ 2αt 2αt G2 αd + max q(λ(t)) − q ∗ ≤ λ(0) − λ ∗ + 2 t=0
+
T X
t=0,...,T
d=t−D
t=0
αt2 ||g(t)||22 .
t=0
Thus, 2 P P P ¯ αd + Tt=0 αt2 ||g(t)||22 λ(0) − λ∗ + Tt=0 2αt G2 t−1 d=t−D 2 max q(λ(t)) ≥ q ∗ − . PT t=0 2αt
t=0,...,T
¯ Notice that λ∗ is an arbitrary optimal point, thus λ(0) − λ∗ can be replaced with 2
¯ d λ(0), Λopt . Finally, using Lemma 4.1 and Bounded Subgradients Assumption 4.2 gives us ∗
max q(λ(t)) ≥ q −
R2 +
PT
t=0 2αt G
t=0,...,T
= q∗ −
R 2 + G2
PT
t=0
2
2 Pt−1 d=t−D PT t=0 2αt
αt2 + 2αt
PT
t=0 αt
αd +
Pt−1
PT
2 2 t=0 αt G
d=t−D αd
.
The convergence rate in Theorem 4.5 can be compared with the results in equation (2.11) for the ordinary subgradient method. The results can be further analyzed for different step sizes, in particular for constant step sizes and square summable but not summable. 50
4.5. CONVERGENCE ANALYSIS
Corollary 4.6. For a constant step size αt = α, t = 0, 1, . . ., the result in Theorem 4.5 becomes R2 + G2 α2 T (1 + 2D) max q(λ(t)) ≥ q ∗ − . (4.25) t=0,...,T −1 2αT Comparing the results in Corollary 4.6 against the ordinary subgradient method with a constant step size in (2.12), we notice the factor (1+2D). Thus, they converge with the same speed, but the Constant-Delays Algorithm converges to a neighborhood around the optimal solution that is (1+2D) times larger. The intuition behind the ordinary subgradient method’s neighborhood is that the algorithm overshoots the optimal point with half the step length (Fig. 4.3a). The intuition behind the factor (1 + 2D), can also be explained, since the algorithm uses information from D steps ago, it can continue past the optimal point for an additional D steps (Fig. 4.3b).
(a) Without delays
(b) With delays
Figure 4.3: The intuitive explanation of the convergence results with delays, Corollary 4.6. With a constant step size, the algorithm converges to a neighborhood around the optimal value whose radius is half the step length (a). In (b), the dashed lines shows the delayed messages. With delays, the algorithm could potentially continue an additional D steps past the optimal point before turning around.
Corollary 4.7. Let the step size αt satisfy ≥α ≥0 α Pt ∞ t+1 2 t=0 αt < ∞; P∞ t=0 αt
t = 0, 1, . . . ;
= ∞.
Then, the result in Theorem 4.5 becomes ∗
max q(λ(t)) ≥ q −
R2 + G2 (1 + 2D) 2
t=0,...,T
as t → ∞. 51
PT
PT
t=0 αt
2 t=0 αt−D
→ q∗
CHAPTER 4. DUAL DECOMPOSITION ALGORITHM
4.5.2
Time-Varying Delays Algorithm
We now turn to the analysis of the Time-Varying Delays Algorithm. The algorithm provides a way of updating the dual variables according to equation (4.18),
αtˆ xii (tˆ) − xji (tˆ) .
X
λij (t + 1) = λij (t) +
(4.26)
tˆ : tˆ+dij (tˆ)=t
The idea is to proceed in a similar way as with the analysis of the ConstantDelays Algorithm. However, there are some complications arising from the fact that several gradient updates can arrive at a single time instance. Define the time-dependent vector-projection Pt,d , for a vector x, as (
[Pt,d (x)]ij =
xij 0
if dij (t) = d; otherwise.
In other words, Pt,d (x) selects those components of x who has a round-trip time equal to d, if it is sent during time step t. Notice that for a fixed t, Pt,d has the same properties as Pd in the preceding section. Thus, under the Bounded Time-Varying Delays Assumption 4.7, for any vector x, and any fixed time t, D X
Pt,d (x) = x.
d=0
¯ + 1) = λ(t) ¯ + αt g(t) for Remark. It is not possible to express the update rule as λ(t ¯ any state λ, since the update can contain gradient information from many different time steps. With the aid of the vector projection Pt,d it is possible to express the update rule (4.26), for the entire state vector λ, as λ(t + 1) = λ(t) + α
D X
Pt−d,d (g(t − d)) ,
(4.27)
d=0
where a constant step size αt = α is assumed. Lemma 4.8. The following is true under the Bounded Subgradients Assumption 4.2 and the Bounded Time-Varying Delays Assumption 4.7, ||λ(t + 1) − λ(t)||2 ≤ α(D + 1)G. Proof. From the update rule (4.27), we have λ(t + 1) − λ(t) = α
D X d=0
52
Pt−d,d (g(t − d)) .
4.5. CONVERGENCE ANALYSIS
Thus, D X Pt−d,d (g(t − d)) . ||λ(t + 1) − λ(t)||2 = α d=0
2
By the triangle inequality, we further have D D D X X X ||Pt−d,d (g(t − d))||2 ≤ ||g(t − d)||2 . Pt−d,d (g(t − d)) ≤ d=0
2
d=0
d=0
Finally, by using the Bounded Subgradients Assumption, we have ||λ(t + 1) − λ(t)||2 ≤ α
D X
||g(t − d)||2 ≤ α(D + 1)G.
d=0
Lemma 4.9. The following is true under the Bounded Subgradients Assumption 4.2 and the Bounded Time-Varying Delays Assumption 4.7, g(t)
T
D X
Pt,d (λ(t + d) − λ(t)) ≤ αG2 D(D + 1)
d=0
for all t ≥ 0. Proof. Notice that t is fixed, and thus, this lemma is similar to Lemma 4.3. Using Cauchy-Schwarz inequality, we have g(t)
T
D X d=0
D X Pt,d (λ(t + d) − λ(t)) ≤ ||g(t)||2 · Pt,d (λ(t + d) − λ(t)) . d=0
Replacing λ(t + d) − λ(t) with the telescoping sum
2
Pd
i=1 λ(t + i) − λ(t + i − 1)
yields
! D D d X X X λ(t + i) − λ(t + i − 1) Pt,d (λ(t + d) − λ(t)) = Pt,d i=1 d=1 d=0 2 2 D d X X = Pt,d (λ(t + i) − λ(t + i − 1)) . d=1 i=1
2
Changing the order of summation, and using the triangle inequality gives us D X d D X D X X Pt,d (λ(t + i) − λ(t + i − 1)) = Pt,d (λ(t + i) − λ(t + i − 1)) i=1 d=i d=1 i=1 2 2 D X D X ≤ Pt,d (λ(t + i) − λ(t + i − 1)) . i=1
53
d=i
2
CHAPTER 4. DUAL DECOMPOSITION ALGORITHM
Further, D D X D D X X X Pt,d (λ(t + i) − λ(t + i − 1)) Pt,d (λ(t + i) − λ(t + i − 1)) ≤ i=1
d=i
i=1
2
=
D X
d=0
2
||λ(t + i) − λ(t + i − 1)||2
i=1
By using Lemma 4.8, we have D D X X Pt,d (λ(t + i) − λ(t + i − 1)) ≤ Dα(D + 1)G, i=1
d=i
2
and with the Bounded Subgradients Assumption, g(t)T
D X
Pt,d (λ(t + d) − λ(t)) ≤ αG2 D(D + 1).
d=0
Because the gradient updates can arrive out-of-order, we evaluate the algorithm on the related “stopped model”, introduced in the Primal Consensus Algorithm, Assumption 3.8. Thus, let g(t) be given by (
g(t) =
a subgradient to q at λ(t) if t ≤ T ; 0 otherwise.
Let us now state and prove the main convergence theorem for the Time-Varying Delays Algorithm: Theorem 4.10. Let the Existence of Minimizer Assumption 4.1, Bounded Subgradients Assumption 4.2, Bounded Initial Distance Assumption 4.3 and the Bounded Time-Varying Delays Assumption 4.7 hold. Then, the maximum value satisfies the following bound max
t=0,...,T −1
q(λ(t)) ≥ q ∗ −
R2 + α2 D(D + 1)2 G2 + 3T α2 (D + 1)2 G2 2αT
(4.28)
Proof. Let λ∗ be an arbitrary optimal point to q, and let λ(t) be given by the iterations (4.27). Consider the following relation 0 ≤ ||λ(T + 1) −
λ∗ ||22
2 D X ∗ = λ(T ) + α PT −d,d (g(T − d)) − λ d=0
= ||λ(T ) − λ∗ ||22 + 2α
D X d=0
2
2 D X PT −d,d (g(T − d))T (λ(T ) − λ∗ ) + α2 PT −d,d (g(T − d)) . d=0
54
2
4.5. CONVERGENCE ANALYSIS
This is a recursive equation in ||λ(t) − λ∗ ||22 , and can be expanded until t = 0, thus yielding 0 ≤ ||λ(0) − λ∗ ||22 + 2α
T X D X
Pt−d,d (g(t − d))T (λ(t) − λ∗ )
t=0 d=0
(4.29)
2 D T X X Pt−d,d (g(t − d)) . + α2 t=0
d=0
2
Let us evaluate this expression on the “stopped model”, at the time T + D. This means that all subgradient updates up until time step T has had time to complete their round-trip, and no update is transmitted after time step T . Consider the expression TX +D X D
Pt−d,d (g(t − d))T (λ(t) − λ∗ ).
t=0 d=0
By using the fact that g(t) = 0 for t > T , and reindexing the sum with t¯ = t − d, we get T X D X
Pt¯,d g(t¯)
T
(λ(t¯ + d) − λ∗ ).
t¯=0 d=0
Consider now the effect the projection Pt¯,d has on the scalar product. Especially, it is symmetric in the operands, and can thus be moved to the second operand, i.e., T X D X
g(t¯)T Pt¯,d λ(t¯ + d) − λ∗ .
t¯=0 d=0
The subgradient g(t) can be moved outside the second summation, and by both adding and subtracting λ(t¯) from the projection, we get T X
g(t¯)T
t¯=0
D X
Pt¯,d λ(t¯ + d) − λ(t¯) + Pt¯,d λ(t¯) − λ∗
.
d=0
Lemma 4.9 can be applied to the first part of this sum, and for the second part we have, g(t¯)T
D X
Pt¯,d λ(t¯) − λ∗ = g(t¯)T λ(t¯) − λ∗ ≤ q(λ(t¯)) − q ∗ ,
d=0
since g(t) is a subgradient at λ(t). Thus, D X
Pt−d,d (g(t − d))T (λ(t) − λ∗ ) ≤ αG2 D(D + 1) + q(λ(t¯)) − q ∗ .
d=0
Returning to (4.29), consider the last part α2
TX +D X D t=0
d=0
2 Pt−d,d (g(t − d)) . 2
55
CHAPTER 4. DUAL DECOMPOSITION ALGORITHM
Recall that α us α2
PD
d=0 Pt−d,d (g(t
D TX +D X t=0
d=0
− d)) = λ(t + 1) − λ(t), and using Lemma 4.8 gives
2 Pt−d,d (g(t − d)) ≤ (T + D + 1)α2 (D + 1)2 G2 . 2
Assembling the results attained so far yields 0 ≤ ||λ(0) − λ∗ ||22 +2α
T X
αG2 D(D + 1) + q(λ(t¯)) − q ∗ +(T +D+1)α2 (D+1)2 G2 .
t¯=0
¯ − λ∗ can be replaced Notice that λ∗ is an arbitrary optimal point, thus λ(0) 2 ¯ with d λ(0), Λopt . Finally, using the Bounded Initial Distance Assumption, and evaluating the maximal value maxt=0,...,T q(λ(t)) yields R2 + 2α2 (T + 1)G2 D(D + 1) + (T + D + 1)α2 (D + 1)2 G2 2α(T + 1) 2 2 2 2 R + α D(D + 1) G + 3(T + 1)α2 (D + 1)2 G2 ≥ q∗ − . 2α(T + 1)
max q(λ(t)) ≥ q ∗ −
t=0,...,T
Remark. The bound in Theorem 4.10 is conservative, since the main quantity that is bounded is the step length at each step. Notice now, that with time-varying delays, D + 1 updates could arrive at one time instance, and hence, the maximal step length is D + 1 times larger than the maximal update. But when updates arrive simultaneously, then that reduces the maximal step length at some other time instance, which is not taken into consideration in this proof.
4.5.3
Noisy Communication Channels
In the final analysis in this section, we evaluate the Constant-Delays Algorithm on the noisy subgradient model. The noisy subgradients model (4.13) states that the subgradients g(t) are affected by an additive noise vector ε(t). Thus, let us modify the update rule (4.20), used by the Constant-Delays Algorithm, to use the noisy subgradients. We have ¯ + 1) = λ(t) ¯ + αt (g(t) + ε(t)) . λ(t
(4.30)
Lemma 4.11. The following is true under the Bounded Subgradients Assumption 4.2 and the Bounded Noise Assumption 4.9, ¯ ¯ ≤ αt (G + E) λ(t + 1) − λ(t) 2
56
4.5. CONVERGENCE ANALYSIS
Proof. By the definition of the update rule (4.30), we have ¯ ¯ = αt ||g(t) + ε(t)|| ≤ αt (||g(t)|| + ||ε(t)|| ) . λ(t + 1) − λ(t) 2 2 2 2
Thus, by the Bounded Subgradients Assumption and the Bounded Noise Assumption, ¯ ¯ ≤ αt (G + E) . λ(t + 1) − λ(t) 2
¯ Further, the bound on the λ(t) and λ(t) in Lemma 4.3 de difference between ¯ ¯ pended on the step lengths λ(t + 1) − λ(t) . Thus, the following Lemma is a 2 generalization of Lemma 4.3. Lemma 4.12. The following is true under the Bounded Subgradients Assumption 4.2, Bounded Time-Delays Assumption 4.6 and the Bounded Noise Assumption 4.9, t−1 X ¯ αd . λ(t) − λ(t) ≤ (G + E) 2
d=t−D
Proof. The proof is essentially the same as for Lemma 4.3. We have D−1 X ¯ ¯ − i) − λ(t ¯ − i − 1) . λ(t) − λ(t) ≤ λ(t 2
2
i=0
Using Lemma 4.11 yields t−1 X ¯ αd . λ(t) − λ(t) ≤ (G + E) 2
d=t−D
Similarly, the following lemma is an extension of Lemma 4.4. Lemma 4.13. The following is true under the Bounded Subgradients Assumption 4.2, Bounded Time-Delays Assumption 4.6 and the Bounded Noise Assumption 4.9,
t−1 X
¯ − λ∗ ≤ G(G + E) (g(t) + ε(t))T λ(t)
¯ αd + q(λ(t)) − q ∗ + E λ(t) − λ∗
d=t−D
Proof. By the Cauchy–Schwarz inequality,
¯ − λ∗ ≤ ||g(t)|| · λ(t) ¯ − λ(t) (g(t) + ε(t))T λ(t) 2 T
∗
2
¯ + g(t) (λ(t) − λ ) + ||ε(t)||2 · λ(t) − λ ∗ . 2
57
2
CHAPTER 4. DUAL DECOMPOSITION ALGORITHM
Recall that g(t) is a subgradient to q at λ(t), thus g(t)T (λ(t) − λ∗ ) ≤ q(λ(t)) − q ∗ . Using the Bounded Subgradients Assumption, Bounded Noise Assumption and Lemma 4.12 yields
t−1 X
¯ − λ∗ ≤ G(G + E) (g(t) + ε(t))T λ(t)
¯ − λ ∗ αd + q(λ(t)) − q ∗ + E λ(t)
d=t−D
2
The convergence rate theorem for the Constant-Delays Algorithm can now be extended to handle the noisy subgradients model (4.30). However, the class of dual objective functions q needs to be restricted to either the Compact Constraint Set Assumption, or the Sharp Maxima Assumption. Theorem 4.14. Let the Existence of Minimizer Assumption 4.1, Bounded Subgradients Assumption 4.2, Bounded Initial Distance Assumption 4.3, Bounded TimeDelays Assumption 4.6, Bounded Noise Assumption 4.9 and the Compact Constraint Set Assumption 4.10 hold. Then, the maximum value satisfies the following bound, max q(λ(t)) ≥ q ∗ −
R2 +
PT
t=0
αt2 (G + E)2 + 2αt G(G + E) 2
t=0,...,T
Pt−1
d=t−D αd + 2αt EL
PT
t=0 αt
(4.31) ¯ Proof. Let λ∗ be any optimal point to q, and let λ(t) be given by the iterations (4.30). Consider the following relation 2
2
¯ ¯ 0 ≤ λ(T + 1) − λ∗ = λ(T ) + αT (g(T ) + ε(T )) − λ∗ 2
2
2 ¯ ¯ ) − λ∗ + α2 ||g(T ) + ε(T )||2 . = λ(T ) − λ∗ + 2αT (g(T ) + ε(T ))T λ(T T 2 2
By Lemma 4.13, 2 2 ¯ ¯ ) − λ∗ + αT2 ||g(T ) + ε(T )||22 λ(T + 1) − λ∗ ≤ λ(T 2 2 TX −1 ¯ + 2αT G(G + E) αd + q(λ(T )) − q ∗ + E λ(T ) − λ ∗ . 2
d=T −D
Because λ∗ is an arbitrary optimal point, this especially holds for
¯ λ∗ = arg min λ(T ) − λ , 2
λ∈Λopt
58
.
4.5. CONVERGENCE ANALYSIS
¯ ¯ ), Λopt . Thus, we have the inequality ) − λ∗ is equivalent to d λ(T hence, λ(T 2
¯ + 1), Λopt d λ(T
2
2
¯ ¯ ), Λopt + 1) − λ∗ ≤ d λ(T ≤ λ(T
2
2
TX −1
+ 2αT G(G + E)
+ αT2 ||g(T ) + ε(T )||22
¯ ), Λopt . (4.32) αd + q(λ(T )) − q ∗ + Ed λ(T
d=T −D
This is now a recursive relation in terms of the distance to the optimal set Λopt . Expanding this until T = 0 gives us
¯ 0 ≤ d λ(0), Λopt
2
+
T X
αt2 ||g(t) + ε(t)||22
t=0
+2
T X
αt G(G + E)
t=0
t−1 X
¯ αd + q(λ(t)) − q ∗ + Ed λ(t), Λopt .
d=t−D
Using Lemma 4.1, the Bounded Subgradients Assumption, the Bounded Noise Assumption and the Compact Constraint Set Assumption gives us 0 ≤ R2 +
T X
αt2 (G + E)2 + 2
T X
αt G(G + E)
t=0
t=0
t−1 X
αd + q(λ(t)) − q ∗ + EL .
d=t−D
Thus, max q(λ(t)) ≥ q ∗ −
R2 +
PT
t=0
αt2 (G + E)2 + 2αt G(G + E) 2
t=0,...,T
Pt−1
d=t−D αd + 2αt EL
PT
t=0 αt
Corollary 4.15. For a constant step size αt = α, t = 0, 1, . . ., the result in Theorem 4.14 becomes max
t=0,...,T −1
q(λ(t)) ≥ q ∗ −
R2 + α2 T (G + E) (G + E + 2GD) + EL. 2αT
Finally, we consider the restriction to Lagrange dual functions q with a Sharp Maxima. Theorem 4.16. Let the Existence of Minimizer Assumption 4.1, Bounded Subgradients Assumption 4.2, Bounded Initial Distance Assumption 4.3, Bounded TimeDelays Assumption 4.6, Bounded Noise Assumption 4.9 and the Sharp Maxima Assumption 4.11 hold. If µ > E, then, the maximum value satisfies the following bound, max q(λ(t)) ≥ q ∗ −
t=0,...,T
R2 + (G + E)2
PT
t=0
1−
59
E µ
αt2 + 2αt
P
T t=0 2αt
Pt−1
d=t−D αd
.
(4.33)
.
CHAPTER 4. DUAL DECOMPOSITION ALGORITHM
¯ Proof. Let λ∗ be any optimal point to q, and let λ(t) be given by the iterations (4.30). Continue from expression (4.32) in the previous proof, and use the triangle inequality ¯ ) + d (λ(T ), Λopt ) . ¯ ), Λopt ≤ λ(T ) − λ(T d λ(T 2
Further, by the Sharp Maxima Assumption, we have for some constant µ > 0 ¯ ) − 1 (q(λ(T )) − q ∗ ) . ¯ ), Λopt ≤ λ(T ) − λ(T d λ(T 2 µ
Thus, by also using Lemma 4.12, we have
¯ + 1), Λopt d λ(T
2
2
TX −1
¯ ), Λopt ≤ d λ(T
+ αT2 ||g(T ) + ε(T )||22 + 2αT (q(λ(T )) − q ∗ )
TX −1
1 αd + E (G + E) αd − (q(λ(T )) − q ∗ ) + 2αT G(G + E) µ d=T −D d=T −D
¯ ), Λopt = d λ(T
2
+ αT2 ||g(T ) + ε(T )||22 + 2αT 1 −
E µ
(q(λ(T )) − q ∗ )
+ 2αT (G + E)2
TX −1
αd .
d=T −D
Assume that µ > E, and thus that 1 − until T = 0 yields
¯ 0 ≤ d λ(0), Λopt
2
+
T X
E µ
> 0. Expanding the recursive relation
T X
αt2 ||g(t) + ε(t)||22 +
t=0
t=0 T X
+
2αt 1 −
E µ
2αt (G + E)2
t=0
(q(λ(t)) − q ∗ ) t−1 X
αd .
d=t−D
Using Lemma 4.1, the Bounded Subgradients Assumption and the Bounded Noise Assumption, the lower bound on the maximal value can then be written as max q(λ(t)) ≥ q ∗ −
R2 + (G + E)2
t=0,...,T
PT
t=0
1−
E µ
αt2 + 2αt
Pt−1
d=t−D αd
P
T t=0 2αt
.
Corollary 4.17. For a constant step size αt = α, t = 0, 1, . . ., the result in Theorem 4.16 becomes max
t=0,...,T −1
q(λ(t)) ≥ q ∗ −
R2 + α2 T (G + E)2 (1 + 2D)
2αT 1 − 60
E µ
.
4.6. COMMUNICATION CONSIDERATIONS
4.6
Communication Considerations
In this section, we analyze the communication for the algorithms, and emphasize the different requirements. Especially, we define a measure of the communication, and use it in order to make a fair comparison.
4.6.1
Measuring Communication
Networks of wireless sensors are becoming increasingly popular [32]. Since each transmission drains battery of the deployed sensors, and hence decreases the lifetime of the sensor network, it is important to reduce the amount of communication. In fact, a wireless sensor node often expends most energy in data communications. Let us consider a network G(V, Et ) of N agents, interconnected by the edges Et at time step t. One metric that is commonly used is to count the number of transmissions in the network, however, in our case, the communication is synchronized among all agents, and virtually all edges are used at every time step. Thus, that metric would give us the cost |Et | at each step t, for all algorithms. Notice also that the dimension of the dual variables, λ, is much larger than the dimension of the primal variables, x, dim(λ) =
N N X X
dim(λij ) =
N N X X
dim(xi ) = N dim(x).
i=1 j=1
i=1 j=1
Thus, since the package sizes can vary a lot, a natural way to define the cost is as the number of bits transfered. Assuming that every real number is encoded in the same way, we define the cost c to transmit a vector x of real numbers over an edge as c(x) = dim(x). Also, we will count the cost of the communication per time step, assuming that both algorithms are restricted to the same topology Et .
4.6.2
Primal Consensus Algorithm
Recall that in the Primal Consensus Algorithm, each agent sends its estimate xi (t) to all of its neighbors at each time step. Even though the Simultaneous Information Exchange Assumption 3.5 assumes that the graph is undirected, we still consider the graph as a directed graph, such that the number of edges, |Et |, counts each edge in each direction. Recall further that all local estimates xi have the same dimension, dim(xi ) = dim(x). Thus, at each time step t, one local estimate is transfered over each edge in Et , yielding the total cost for the Primal Consensus Algorithm as |Et | dim(x). 61
(4.34)
CHAPTER 4. DUAL DECOMPOSITION ALGORITHM
4.6.3
Dual Decomposition Algorithm
One of the main differences in the computational model between the Primal Consensus Algorithm and the Dual Decomposition Algorithm was the Routing Protocol Assumption 4.4. This increases the complexity of the algorithm, but, depending on the particular routing algorithm, can reduce the amount of communication. Because of this, the cost of the Dual Decomposition Algorithm is harder to analyze, since it depends on the routing algorithms as well as the structure of the problem instance. Consider the Dual Decomposition Algorithm, the communication cost consists of two contributions; Distributing the primal variables xii (t), so that the dual variables can be computed, and secondly, accumulating the dual variables λji . Let us analyze these contributions separately. For simplicity, we assume that the network E = Et is time-independent and strongly connected. Then, another way of calculating the communication cost for a single time instance is to calculate the total cost of delivering an estimate generated at one time instance, since an equal communication amount is generated at each time step. Primal variables Since each primal variable xii (t) should be distributed to all other agents j 6= i, this task is quite easy for the routing protocol. This can be accomplished by letting each node i flood the network with the primal variables, thus, whenever an agent receives an estimate of xii that it has not received before, it simply retransmits it to all of its neighbors. With this simple routing protocol, each estimate xii (t) will be transmitted over each edge exactly once, thus the communication cost is N X
dim(xii )|E| = dim(x)|E|.
i=1
However, a reasonable assumption is that each agent only receives an estimate xii (t) once, and not from all of its neighbors. In this case, each time the estimate is transfered, it is thus assumed that the recipient receives that estimate for the first time. Since there are only N agents, each estimate is only transfered N − 1 times. With a static, strongly connected, network, each agent except i receives exactly one estimate of xii at each time step, hence, the communication cost is N X
dim(xii )(N − 1) = (N − 1) dim(x).
i=1
Dual variables In contrast to the primal variables, there are many more dual variables, but, each dual variable λij is computed at node j, and only needs to be transfered to node i. With the simplest possible routing algorithms, where each agent 62
4.6. COMMUNICATION CONSIDERATIONS
floods the network with the dual variables, each estimate λij (t) will still travel across every edge, thus, the communication cost for the dual variables is N X N X
dim(λij )|E| = dim(λ)|E|.
i=1 j=1
Let us now consider some more intelligent routing protocols. First, with the assumption that no agent receives the same estimate twice, we similarly have the communication cost N X N X
dim(λij )(N − 1) = dim(λ)(N − 1).
i=1 j=1
The same bound on the communication cost is attained if we assume that the routing protocol is capable of sending each estimate on a shortest path from agent j to agent i, since the shortest path is bounded by N − 1 edges. Notice the very important property that the local objective function φi (λ) P only depends on the dual variables λji , j ∈ N , and the sum N j=1 λij , not the individual values λij , j ∈ N . Since all the dual variables λji are calculated at P agent i, it only needs to receive the sum of the dual variables N j=1 λij . The general problem of calculating the distributed sum of some initial values is also discussed in Appendix A. We shall now consider some assumptions about the routing protocol that allows the agents to benefit from this property. Let us consider all dual variables directed towards a fixed agent i, i.e., λji for all j. Assume that during one time instance, an agent k only transmits the dual variables directed towards agent i over a single edge. Thus, during a single time instance, an agent will only try to use a single path to reach a particular agent i. Further, assume that no cycles appear in the transmissions of the dual variables, i.e., a dual variable λij (t) once transmitted by agent k never returns to agent k. Under these assumptions, an agent k can compute the sum of all dual variables λij directed towards agent i, and transfer this sum instead of the individual dual variables. Further, these assumptions guarantees that agent i receives all dual variables summed exactly once. Notice that this procedure also works for the time-varying delays network, where the updates µ are summed instead, provided that the assumptions are still satisfied. The cost of transporting the dual variables λij to agent i is thus dim(λij ) for each agent except i, calculated per time step. The total cost of transporting the dual variables is then N X
dim(λij )(N − 1) = (N − 1)
i=1
N X i=1
63
dim(xi ) = (N − 1) dim(x).
CHAPTER 4. DUAL DECOMPOSITION ALGORITHM
As we just have seen, the total communication cost for the Dual Decomposition Algorithm, when adding the cost for the primal and dual variables, is 2(N − 1) dim(x),
(4.35)
but it is highly dependent upon the routing protocol. Notice that the communication cost is independent of the network topology. Also, for an undirected, connected graph, the number of edges |Et | is at least N − 1. Thus, when considering the edges with direction, where each undirected edge is counted twice, it has at least 2(N − 1) edges. Thus, the communication cost for the Primal Consensus Algorithm is at least 2(N − 1) dim(x). The last communication consideration that we want to emphasize is related to the problem structure. Recall the optimization problem in equation (4.2), and that each local objective function can be written as f i (x1 , x2 , . . . , xN ). Notice here that all objective functions f i are assumed to depend on all of the optimization variables x1 , x2 , . . . , xN , but in reality, many interesting distributed optimization problems has sparse structure in the dependencies. For example, in vehicle formation, it is common that a vehicle’s decision only depends on the nearest neighbors, and not all other vehicles. Assume that f i does not depend upon xj , i.e., f i (x1 , . . . , xj−1 , xj+1 , . . . , xn ). Then the variable xij does not exists, hence, there is no need for the dual variable λji , accounting for the difference between xjj and xij . The implications for the communication is that the primal variable xjj does not have to be sent to agent i, and the dual variable λji does not have to be sent back to agent j, further reducing the communication cost for the dual decomposition algorithms. Notice that the routing protocol is a requirement to take this advantage from the problem structure. Remark. In this analysis, the communication overhead necessarily for the routing protocol has been neglected.
4.7
Quadratic Objective Functions
An important class of convex objective functions are the quadratic functions, i.e., f i (x) : Rn → R can be written as f i (x) = xT Ai x + bTi x + ci , for some symmetric, positive definite, n × n matrix Ai 0, n-dimensional vector bi and scalar ci . The corresponding subgradient g i (x) can then be written as g i (x) = 2Ai x + bi . Notice that the subgradient does not satisfy the Bounded Subgradient Assumpunless the optimization variables x are constrained to a bounded set X, since tion g i (x) → ∞ as ||x|| → ∞. 2 2 64
4.7. QUADRATIC OBJECTIVE FUNCTIONS
Consider the corresponding Lagrange dual function φi (λ), φi (λ) = inf xT Ai x + bTi x + ci + x∈X
N X
λTij xi −
j=1
N X
λTji xj .
j=1
Define the local concatenated dual variable λi as
−λ1i .. .
−λ(i−1)i P N λi = j=1 λij , −λ (i+1)i .. .
−λN i
then, the Lagrange dual function can be written as φi (λ) = inf xT Ai x + (bi + λi )T x + ci . x∈X
The optimal solution to this local subproblem is given by 1 xi (t) = − A−1 (bi + λi (t)), 2 i provided that the solution is feasible, i.e., xi (t) ∈ X. Observe that the minimizing xi (t) to φi is a linear function in the dual variables i λ (t), and further, the dual variables are updated with the linear relation
λij (t + dij + 1) = λij (t + dij ) + αt xii (t) − xji (t) . Thus, the state evolution for the Dual Decomposition Algorithm is a linear system.
4.7.1
Simple Quadratic Example
Consider the simplified optimization problem minimize f 1 (x) + f 2 (x), x∈R
where f 1 and f 2 are quadratic functions, f i (x) = ai x2 + bi x + ci , with ai > 0. Introduce two local estimates of the primal variable, x1 and x2 , and let λ be the dual variable corresponding to the constraint x1 = x2 . Assume that the round-trip time between the two nodes is d, thus, the Constant-Delays Algorithm is given by Algorithm 4.5. 65
CHAPTER 4. DUAL DECOMPOSITION ALGORITHM
Algorithm 4.5: Simple Quadratic Example Algorithm. Input: Initial estimate λ(0). Output: Estimate of λ at time T 1 for t=0 to T-1 do // Agent 1 solves φ1 (λ(t)). 2 x1 (t) = − 2a11 (b1 + λ(t)). // Agent 2 solves φ2 (λ(t)). 3 x2 (t) = − 2a12 (b2 − λ(t)). 4 Agent 1 sends x1 (t) to agent 2. // Agent 2 updates the dual variable λ. 5 λ(t + d + 1) = λ(t + d) + α(x1 (t) − x2 (t)) 6 Agent 2 sends λ(t + d + 1) to agent 1. 7 end
x1(t)
1
2 λ(t+d+1)
Figure 4.4: Communication graph for the simple quadratic example.
Let us write the state evolution as a linear expression. Define the lifted state vector z(t) as
z(t) =
x1 (t) x2 (t) λ(t + d) λ(t + d − 1) λ(t + d − 2) .. . λ(t + 1) λ(t) 66
.
4.7. QUADRATIC OBJECTIVE FUNCTIONS
The state evolution can then be written as z(t + 1) = Az(t) + C, with A=
0 0 0 0 ··· 0 0 0 0 ··· α −α 1 0 · · · 0 0 1 0 ··· 0 0 0 1 ··· .. .. .. .. . . . . . . . 0 0 0 0 ··· 0 0 0 0 ···
0 − 2a11 0 2a12 0 0 0 0 0 0 .. .. . . 1 0 0 0 1 0 0 0 0 0 0 .. .
,
C=
b1 − 2a 1 b2 − 2a 2 0 0 0 .. .
0 0
.
This system converges if and only if the spectral radius of the dynamics matrix A is less than one. The characteristic equation to A can be expressed, by using the Laplace expansion [33] for the first three rows of the determinant det(A − ξI) = 0, as 1 1 d+2 ξ (1 − ξ) + ξα + = 0. 2a1 2a2 Notice that the eigenvalues, and hence the spectral radius, only depends on the 1 1 delay d and the expression α 2a1 + 2a2 . In Fig. 4.5 we notice that given the constants a1 and a2 , the step size α can be chosen small enough such that the spectral radius is less than one, and thus making the system convergent. Remark. This result is not implied by the theorems in Section 4.5, since the subgradients are not bounded when the cost functions are quadratic.
67
CHAPTER 4. DUAL DECOMPOSITION ALGORITHM
delay= 5 delay= 4 delay= 3
1
delay= 2
ρ(Z) = max(|ξ|)
0.9
delay= 1
0.8
0.7
0.6 delay= 0 0.5 0
0.05
0.1
0.15 0.2 α( 2a11 + 2a12 )
0.25
0.3
0.35
Figure 4.5: The spectral radius of the dynamics matrix, as a function of the expres 1 1 sion α 2a1 + 2a2 , and for delays d = 0 through 10.
68
Chapter 5
Numerical Results “Argument is conclusive... but... it does not remove doubt, so that the mind may rest in the sure knowledge of the truth, unless it finds it by the method of experiment. For if any man who never saw fire proved by satisfactory arguments that fire burns. His hearer’s mind would never be satisfied, nor would he avoid the fire until he put his hand in it that he might learn by experiment what argument taught.” Roger Bacon, 13th century.
To compare the Primal Consensus Algorithm and the Dual Decomposition Algorithm, extensive simulations has been performed. The simulations was carried out with the numerical computing environment MATLAB [34], together with the optimization toolboxes YALMIP [35] and SeDuMi [36].
5.1
Common Model
In order to compare the two algorithms, the convergence rate should be compared for the same optimization problem. Thus, we want to consider a class of optimization problems that satisfies both the assumptions 3.1-3.9, for Primal Consensus Algorithm, and the assumptions 4.1-4.8, for the Dual Decomposition Algorithm. We consider a network (V, E) consisting of N agent. For simplicity, we limit ourselves to the networks with time-invariant topology, i.e., the edge set E is constant throughout the simulation. It is assumed that the network is strongly connected, and further, to accommodate the Primal Consensus Algorithm, we require the network to be undirected. For the Dual Decomposition Algorithm, we define the delay δij between agent i and agent j, as dist(i, j)−1 in the communication graph. This means that neighbors can communicate directly with each other, but it takes one extra time step for each intermediate agent on the shortest path between two nodes. In particular, this means that the bound on the round-trip times, D, is less than or equal to 2(N − 2). The weight matrix, used by the Primal Consensus Algorithm, is determined from the planned weights Pj,i according to equation (3.3), and the planned weight 69
CHAPTER 5. NUMERICAL RESULTS
Pj,i is |Ni1|+1 if (j, i) ∈ E and 0 otherwise. The local objective functions used for the simulations are quadratic functions, restricted to a compact convex set X. The reason for using this class of objective functions is primarily because they are easy to work with, and efficient to compute, while still being non-trivial. Thus, the local objective function f i (x) can be written as f i (x) = xT Ai x + bTi x + ci . This function is convex if and only if the matrix Ai is positive semidefinite, i.e., all eigenvalues are non-negative. The parameters Ai , bi and ci are chosen randomly, bi and ci are drawn from a normal distribution, and Ai is generated as a random symmetric matrix, with eigenvalues drawn from a uniform distribution on the interval [0, U ], for some positive number U . In order to have a uniform upper bound on the subgradients, the optimization variables are restricted to a compact convex set X, where X is determined by x ∈ X ⇔ a ≤ xi ≤ b ∀i, for some constants a and b. Further, the initial values (x(0) for the Primal Consensus Algorithm, and λ(0) for the Dual Decomposition Algorithm) are all set to zero. Finally, we assume that each node’s state is a single scalar, dim(xi ) = 1, hence the entire state dimension is dim(x) = N .
5.2
Evaluating the Convergence Rate
In order to compare the convergence rate, we must define how each algorithm is evaluated at each time step. Consider first the Primal Consensus Algorithm, and recall that each agent has its own estimate of the state vector xi (t). Similar to the convergence proof, we evaluate the algorithm on the average estimate xavg (t) =
N X
xi (t).
i=1
Thus, the function value at time step t is evaluated as f (t) =
N X
f i (xavg (t)).
i=1
As a last step, we consider the best value up until time t as fbest (t) = min f (i). i=0,...,t
The Dual Decomposition Algorithm, on the other hand, has two different functions that can be evaluated, both the primal objective function f , and the Lagrange 70
5.3. COMPARING PRIMAL AND DUAL ALGORITHMS
dual function q. Since there is only one estimate of the dual variables λ(t) at time t, it is easy to evaluate the Lagrange dual function, q((t) =
N X
φi (λ(t)).
i=1
Recall that the primal problem is a minimization problem, and the dual problem is a maximization problem, thus, it is easier to compare the primal solutions against each other for the two algorithms. Each agent can compute its own estimate of the primal variables xi (t) from the dual variables λ(t), thus, we can evaluate the primal P i problem f (t) on the average of the agents local estimates xavg (t) = N i=1 x (t), similar to the Primal Consensus Algorithm. We also define the best values for the Dual decomposition problem as fbest (t) = min f (i), i=0,...,t
and qbest (t) = max q(i). i=0,...,t
Another reason to compare the two algorithms with the primal variables is that the variables can have an important interpretation for the application. For example, the primal variables could represent the position of a vehicle, while the dual variables represent the cost of moving in a certain direction. Both algorithms’ convergence rate depends on the step size rule, and this must be taken into account when producing a fair comparison. We limit both algorithms to use a constant step size αt = α, and then try to find an optimal, constant, step size for each algorithm. Thus, for a given time interval 0, . . . , T , we try to find the best step size α for each algorithm, such that fbest (T ) is minimal. Fortunately, this problem is almost convex in the step size (Fig. 5.2, 5.5, 5.8, 5.11), thus, the optimal step size can be found numerically.
5.3
Comparing Primal and Dual Algorithms
In this section, we are going to compare the convergence rate for the Primal Consensus Algorithm 3.1, and the Dual Decomposition Algorithm with constant delays 4.3.
5.3.1
Connected Graph
We start with a randomly chosen, strongly connected, network consisting of 15 agents (Fig. 5.1). The model is otherwise chosen according to the Common model, stated above. Notice that the network consists of 31 undirected edges, and hence, counted with direction, it has 62 directed edges. Further, the diameter of the 71
CHAPTER 5. NUMERICAL RESULTS
Figure 5.1: The connected communication graph for the first comparison between the Primal Consensus Algorithm, and the Dual Decomposition Algorithm. The network consists of 15 agents, and 31 undirected edges.
network is 4, thus, the maximal delay between any two agents is 3. Hence, the upper bound on the round-trip time is D = 6. Before comparing the convergence rate for the two algorithms, the optimal step sizes has to be found. We evaluate the best function value fbest (T ) at time step T = 1000, for different step sizes (Fig. 5.2), and then choses the optimal step size. The Primal Consensus Algorithm uses the step size α = 0.00069, and the Dual Decomposition Algorithm uses the step size α = 0.0568. The convergence rate for the best function value fbest (t) is evaluated on the time interval t = 0, . . . , 2000 (Fig. 5.3, Table 5.1), with the optimal step sizes chosen at time step t = 1000. Notice that the Primal Consensus Algorithm reaches its neighborhood around the optimal value at time step 1000, and does not significantly improve the best function value during the next 1000 steps. It is the tradeoff in step size; a smaller step size would yield a slower convergence, but to a smaller neighborhood, thus, giving a worse function value at time step 1000, but a better solution at time step 2000. The Dual Decomposition Algorithm, on the other hand, is close to the optimal value already after 1000 steps, and has essentially reached the optimal value after 2000 steps. Finally, compare the communication used by each algorithm. The communication per time step for the Primal Consensus Algorithm is analyzed in equation (4.34), |E| dim(x). With 62 directed edges, and dim(x) = N = 15, the communica72
5.3. COMPARING PRIMAL AND DUAL ALGORITHMS
D ual Ste p Si z e α 0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
0.055
0.06
Primal Consensus Algorithm Dual Decomposition Algorithm
−4460 −4480 −4500
f b e s t(T )
−4520 −4540 −4560 −4580 −4600 −4620 1
2
3
4 5 6 P ri mal Ste p Si z e α
7
8
9
10 −3
x 10
Figure 5.2: The best function value fbest (T ) for both the Primal Consensus Algorithm and the Dual Decomposition Algorithm, as a function of the step sizes. The function value is evaluated at time step T = 1000, and the optimal step sizes are α = 0.00069 for the Primal Consensus Algorithm, and α = 0.0568 for the Dual Decomposition Algorithm.
73
CHAPTER 5. NUMERICAL RESULTS
Optimal value Primal value Primal value Dual value Dual value
f∗ fbest (1000) fbest (2000) fbest (1000) fbest (2000)
-4625.84 -4625.00 -4625.05 -4625.72 -4625.84
Table 5.1: The best function values for the Primal Consensus Algorithm and the Dual Decomposition Algorithm, evaluated at time step 1000 and 2000.
tion cost for the Primal Consensus Algorithm is 930 per step. Recall also that the communication cost for the Dual Decomposition Algorithm is independent of the network topology, and given by equation (4.35), 2(N − 1) dim(x). With dim(x) = N = 15, this equals to 420 per step, for the Dual Decomposition Algorithm. Thus, for this problem instance, the Dual Decomposition Algorithm converges faster, to a smaller neighborhood around the optimal value, and with less communication than the Primal Consensus Algorithm.
74
5.3. COMPARING PRIMAL AND DUAL ALGORITHMS
−4610 Opti mal sol uti on P ri mal f b e s t(t ) D ual f b e s t(t ) D ual q b e s t(t )
−4612 −4614
Func ti on val ue
−4616 −4618 −4620 −4622 −4624 −4626 −4628 −4630 0
200
400
600
800 1000 1200 Ti me ste p t
1400
1600
1800
2000
Figure 5.3: Simulating the convergence rate for fbest (t) on the time interval t = 0, . . . , 2000, with the optimal step sizes chosen at time step t = 1000. The Primal Consensus Algorithm has reached a neighborhood around the optimal value, while the Dual Decomposition Algorithm converges to the optimal value.
5.3.2
Line Graph
In this section, we consider a network with a special line topology, where all the agents are distributed on a line (Fig. 5.4). The network consist of 15 agents, and they are connected by only 14 undirected edges. An important property of the family of line graphs is that they have the highest diameter among all connected graphs. A line graph with N agents is a connected graph, with N − 1 undirected edges, and also diameter N − 1. Thus, the line graph in Fig. 5.4 has diameter 14, and hence the upper bound on the round-trip time is D = 26. The optimization problem is again chosen as described in the Common model, and the optimal step sizes are determined by the evaluation of the best function value fbest (T ) at time step T = 1000 (Fig. 5.5). The Primal Consensus Algorithm uses the step size α = 0.00042, while the Dual Decomposition Algorithm uses the step size α = 0.0290. 75
CHAPTER 5. NUMERICAL RESULTS
Figure 5.4: The connected communication graph for the second comparison between the Primal Consensus Algorithm, and the Dual Decomposition Algorithm. The network consist of 15 agents, distributed on a single line with 14 undirected edges.
With the optimal step sizes computed, the convergence rate for the best function value fbest (t) is evaluated on the time interval t = 0, . . . , 2000 (Fig. 5.6, Table 5.2). Optimal value Primal value Primal value Dual value Dual value
f∗ fbest (1000) fbest (2000) fbest (1000) fbest (2000)
-6296.28 -6290.65 -6291.17 -6272.35 -6272.35
Table 5.2: The best function values for the Primal Consensus Algorithm and the Dual Decomposition Algorithm, evaluated at time step 1000 and 2000. The results here are almost the opposite compared to the previous Connected Graph example. The Primal Consensus Algorithm converges to a value much closer to the optimum, compared to the Dual Decomposition Algorithm. The Dual Decomposition Algorithm, on the other hand, reaches a neighborhood around the optimal value before step 1000, and does not improve after that. Again, this is depending on the step sizes used, so a smaller step size would give a worse value after 1000 steps, but a better value after 2000 steps. Notice the characteristic staircase appearance in the graph, it is caused by oscillations in the underlying function values f (t), which are due to the delays in the network. The communication cost for the Primal Consensus Algorithm is proportional to the number of edges in the network, and the line graph (together with all tree graphs) is the class of undirected connected graphs with the fewest number of edges. For a tree, the number of undirected edges in a network with N agents are |E| = N − 1, thus the communication cost for our undirected graph is |E| dim(x) = 2(N − 1)N = 420 per step, the exact same cost as for the Dual Decomposition Algorithm. Thus, for this problem instance on the line graph, the Primal Consensus Algo76
5.3. COMPARING PRIMAL AND DUAL ALGORITHMS
D ual Ste p Si z e α 0.005
0.01
0.015
0.02
0.025
−5200
0.03
0.035
0.04
0.045
0.05
Primal Consensus Algorithm Dual Decomposition Algorithm
−5400
f b e s t(T )
−5600
−5800
−6000
−6200
0.2
0.4
0.6
0.8 1 1.2 1.4 P ri mal Ste p Si z e α
1.6
1.8
2 −3
x 10
Figure 5.5: The best function value fbest (T ) for both the Primal Consensus Algorithm and the Dual Decomposition Algorithm, as a function of the step sizes. The function value is evaluated at time step T = 1000, and the optimal step sizes are α = 0.00042 for the Primal Consensus Algorithm, and α = 0.0290 for the Dual Decomposition Algorithm.
77
CHAPTER 5. NUMERICAL RESULTS
−5800 Opti mal sol uti on P ri mal f b e s t(t ) D ual f b e s t(t ) D ual q b e s t(t )
−5900 −6000
Func ti on val ue
−6100 −6200 −6300 −6400 −6500 −6600 −6700 −6800 0
200
400
600
800 1000 1200 Ti me ste p t
1400
1600
1800
2000
Figure 5.6: Simulating the convergence rate for fbest (t) on the time interval t = 0, . . . , 2000, with the optimal step sizes chosen at time step t = 1000. The Primal Consensus Algorithm attains a lower value than the Dual Decomposition Algorithm. The staircase characteristics of the Dual Decomposition Algorithm is typical for oscillations in the function value f (t), caused by the delays.
rithm converges to a smaller neighborhood around the optimal value, and with the same communication cost as the Dual Decomposition Algorithm.
5.3.3
Ring Graph
In this section, we consider a network with ring topology, where all agents are distributed on a single cycle (Fig. 5.7). The network consist of 15 agents, and they are connected by 15 undirected edges. Notice that this graph topology can be attained from the line graph by only adding a single edge to the network, however, this single edge causes the diameter of the graph to shrink from 14 to 7, and the upper bound on the round-trip time is D = 12 instead of 26. The optimization problem is again chosen as described in the Common model, 78
5.3. COMPARING PRIMAL AND DUAL ALGORITHMS
Figure 5.7: The connected communication graph for the third comparison between the Primal Consensus Algorithm, and the Dual Decomposition Algorithm. The network consist of 15 agents, distributed on a single cycle with 15 undirected edges.
and the optimal step sizes are determined by the evaluation of the best function value fbest (T ) at time step T = 1000 (Fig. 5.8). The Primal Consensus Algorithm uses the step size α = 0.000498, while the Dual Decomposition Algorithm uses the step size α = 0.0407. With the optimal step sizes computed, the convergence rate for the best function value fbest (t) is evaluated on the time interval t = 0, . . . , 2000 (Fig. 5.9, Table 5.3). Optimal value Primal value Primal value Dual value Dual value
f∗ fbest (1000) fbest (2000) fbest (1000) fbest (2000)
-7649.96 -7647.01 -7647.17 -7649.46 -7649.96
Table 5.3: The best function values for the Primal Consensus Algorithm and the Dual Decomposition Algorithm, evaluated at time step 1000 and 2000. These results are similar to those of the Connected Graph, where the Primal Consensus algorithm reaches a neighborhood around the optimal value before step 1000, while the Dual Decomposition Algorithm converges to the optimal value. With only one more edge than the line graph, the ring graph has 15 undi79
CHAPTER 5. NUMERICAL RESULTS D ual Ste p Si z e α 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 −6600
Primal Consensus Algorithm Dual Decomposition Algorithm
−6700 −6800 −6900
f b e s t(T )
−7000 −7100 −7200 −7300 −7400 −7500 −7600 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 P ri mal Ste p Si z e α
0.02
Figure 5.8: The best function value fbest (T ) for both the Primal Consensus Algorithm and the Dual Decomposition Algorithm, as a function of the step sizes. The function value is evaluated at time step T = 1000, and the optimal step sizes are α = 0.000498 for the Primal Consensus Algorithm, and α = 0.0407 for the Dual Decomposition Algorithm.
rected edges, and thus 30 directed edges. The communication cost for the Primal Consensus Algorithm is 450 per step, compared to the 420 per step for the Dual Decomposition Algorithm.
80
5.3. COMPARING PRIMAL AND DUAL ALGORITHMS
−7550
Func ti on val ue
Opti mal sol uti on P ri mal f b e s t(t ) D ual f b e s t(t ) D ual q b e s t(t )
−7600
−7650
−7700 0
200
400
600
800 1000 1200 Ti me ste p t
1400
1600
1800
2000
Figure 5.9: Simulating the convergence rate for fbest (t) on the time interval t = 0, . . . , 2000, with the optimal step sizes chosen at time step t = 1000. The Primal Consensus Algorithm has reaches a neighborhood around the optimal value, while the Dual Decomposition Algorithm converges to the optimal value.
5.3.4
Complete Graph
In this section, we consider a fully connected network, where every agent is adjacent to every other agent (fig. 5.10). The network consist of 15 agents, and they are connected by 105 undirected edges. The upper bound on the round-trip time is D = 0, and hence, the Dual Decomposition Algorithm reduces to a plain subgradient method on the dual variables. The optimization problem is again chosen as described in the Common model, and the optimal step sizes are determined by the evaluation of the best function value fbest (T ) at time step T = 1000 (Fig. 5.11). The Primal Consensus Algorithm uses the step size α = 0.000698, while the Dual Decomposition Algorithm uses the step size α = 0.1244. With the optimal step sizes computed, the convergence rate for the best function value fbest (t) is evaluated on the time interval t = 0, . . . , 2000 (Fig. 5.12, Table 5.4). 81
CHAPTER 5. NUMERICAL RESULTS
Figure 5.10: The completely connected communication graph for the fourth comparison between the Primal Consensus Algorithm, and the Dual Decomposition Algorithm. The network consist of 15 agents, where every pair of agents are adjacent. Optimal value Primal value Primal value Dual value Dual value
f∗ fbest (1000) fbest (2000) fbest (1000) fbest (2000)
-5582.53 -5582.42 -5582.42 -5582.53 -5582.53
Table 5.4: The best function values for the Primal Consensus Algorithm and the Dual Decomposition Algorithm, evaluated at time step 1000 and 2000.
Again, these results are similar to those of the Connected Graph and the Ring Graph, where the Primal Consensus Algorithm reaches a neighborhood around the optimal value before step 1000, while the Dual Decomposition Algorithm converges to the optimal value. Since every pair of agents are connected, this graph has the highest possible number of edges, thus the communication cost for the Primal Consensus Algorithm is also highest for this graph. The cost for the Primal Consensus Algorithm is 3150 per step, while for the Dual Decomposition Algorithm the cost is 420 per step. While the line graph had the highest possible delay, and also the lowest possible communication cost for the Primal Consensus Algorithm, the complete graph is the 82
5.3. COMPARING PRIMAL AND DUAL ALGORITHMS D ual Ste p Si z e α 0.02
0.04
0.06
0.08
0.1
0.12
Primal Consensus Algorithm Dual Decomposition Algorithm
f b e s t(T )
−5400
−5450
−5500
−5550
0.005
0.01
0.015
0.02 0.025 0.03 0.035 P ri mal Ste p Si z e α
0.04
0.045
0.05
Figure 5.11: The best function value fbest (T ) for both the Primal Consensus Algorithm and the Dual Decomposition Algorithm, as a function of the step sizes. The function value is evaluated at time step T = 1000, and the optimal step sizes are α = 0.000698 for the Primal Consensus Algorithm, and α = 0.1244 for the Dual Decomposition Algorithm.
opposite, with the lowest delay and highest communication cost for the Primal Consensus Algorithm. Thus, as we would expect from the previous examples, the Dual Decomposition Algorithm once again converges faster, to a smaller neighborhood around the optimal value, and with a much smaller communication cost than the Primal Consensus Algorithm.
83
CHAPTER 5. NUMERICAL RESULTS
−5581 Opti mal sol uti on P ri mal f b e s t(t ) D ual f b e s t(t ) D ual q b e s t(t )
Func ti on val ue
−5581.5
−5582
−5582.5
−5583
−5583.5 0
500
1000 Ti me ste p t
1500
2000
Figure 5.12: Simulating the convergence rate for fbest (t) on the time interval t = 0, . . . , 2000, with the optimal step sizes chosen at time step t = 1000. The Primal Consensus Algorithm reaches a neighborhood around the optimal value, while the Dual Decomposition Algorithm converges to the optimal value.
5.4
Dual Decomposition Algorithm
In this section, we examine some properties for the Dual Decomposition Algorithm with numerical simulations. In particular, we focus the on the restriction of the primal variables to a compact set X, the noisy subgradients, and finally, compare the Halted Decentralized Algorithm 4.2 against the Constant-Delays Distributed Algorithm 4.3.
5.4.1
Without Bounded Subgradients
The Bounded Subgradients Assumption 4.2, for the Dual Decomposition Algorithm, requires that the subgradients g(t) ∈ ∂q(λ(t)) are uniformly bounded by G. This property is especially satisfied if the primal variables are constrained to a compact set X, as assumed in the Common model (Section 5.1). However, this implies that 84
5.4. DUAL DECOMPOSITION ALGORITHM
we can assert that the sought optimal solution is confined to the compact set, and thus, the convergence rate (4.25) depends on the size of the compact set X. It would be interesting to weaken the Bounded Subgradients Assumption, and in this section we compare the Constant-Delays Algorithm with and without the Bounded Subgradients Assumption. The following simulations are conducted on the same model used by the Connected Graph (Section 5.3.1), but while the bounded subgradients version has the primal variables confined to a compact set X, the unbounded version lets the primal variables take any value in Rn . 2000 Opti mal sol uti on Bounde d f (t ) Bounde d q (t ) Unb ounde d f (t ) Unb ounde d q (t )
Func ti on val ue
0
−2000
−4000
−6000
−8000
−10000 0
50
100
150
200 250 300 Ti me ste p t
350
400
450
500
Figure 5.13: The objective function value for the Dual Decomposition Algorithm, with and without the bounded subgradients assumption. When a small step size is used (α = 0.04), both versions converges towards the optimal value, and they are indistinguishable from each other. In this simulation we are not primarily interested in the optimal value, but instead in the underlying behavior of the algorithm. Therefore, we evaluate the function value f (t) at every time steps, instead of taking the minimum fbest (t) = mini=0,...,t f (i). First, with a small step size such as α = 0.04, both versions converges steadily 85
CHAPTER 5. NUMERICAL RESULTS
towards the optimal value, and they are almost indistinguishable from each other (Fig. 5.13). Recall that the optimal step size for the bounded Dual Decomposition Algorithm was α = 0.0568, when evaluated after 1000 steps. With this step size, the function value starts to oscillate, but both versions still converges towards the optimal value, and are also indistinguishable from each other (Fig. 5.14). 2000 Opti mal sol uti on Bounde d f (t ) Bounde d q (t ) Unb ounde d f (t ) Unb ounde d q (t )
1000 0
Func ti on val ue
−1000 −2000 −3000 −4000 −5000 −6000 −7000 −8000 0
50
100
150
200 250 300 Ti me ste p t
350
400
450
500
Figure 5.14: The objective function value for the Dual Decomposition Algorithm, with and without the bounded subgradients assumption. When the optimal step size, evaluated after 1000 steps, is used (α = 0.0568), both versions still converges towards the optimal value, and they are indistinguishable from each other. However, the function value has started to oscilliate, but it is still stable. Further increasing the step size will cause the oscillations to increase, and this is also when we start to see a significant difference between the bounded and unbounded versions of the algorithm (Fig. 5.15). From the convergence analysis of the Dual Decomposition Algorithm, and in particular equation (4.25), we know that the bounded version will converge to a neighborhood around the optimal solution, where the size of the neighborhood depends on the bound on the subgradients. 86
5.4. DUAL DECOMPOSITION ALGORITHM
However, if we let the bound on the subgradients tend towards infinity, then the convergence results does not provide any bound on the function value, and this is exactly what the figure shows. The bounded version converges to a neighborhood around the optimal solution, while the unbounded version diverges. 4
x 10 8
Opti mal sol uti on Bounde d f (t ) Bounde d q (t ) Unb ounde d f (t ) Unb ounde d q (t )
6
Func ti on val ue
4
2
0
−2
−4 0
50
100
150
200
250 300 Ti me ste p t
350
400
450
500
Figure 5.15: The objective function value for the Dual Decomposition Algorithm, with and without the bounded subgradients assumption. When a large step size is used (α = 0.061), the difference between the two algorithms becomes apparent. While both algorithms oscillate heavily, the unbounded version diverges, while the bounded version stays in a neighborhood around the optimal solution.
5.4.2
Noisy Subgradients
In Section 4.5.3, the Constant-Delays Algorithm was analyzed when the subgradients was subject to noisy communication channels. Thus, the dual variables are updated according to equation (4.30), where ε(t) is a stochastic noise vector, uniformly bounded by the constant E, i.e., ||ε(t)||2 ≤ E. 87
CHAPTER 5. NUMERICAL RESULTS
The convergence was proved under two different assumptions, either that the dual variables are constrained to a compact set, or if the Lagrande dual function has a sharp maxima. Neither of these conditions hold in general for the quadratic functions used in the simulations, but we can still explore the behavior of the algorithm when the communication channels are noisy (Fig. 5.16). The model from the Connected Graph (Section 5.3.1) is reused in this simulation, and the step size α = 0.052 is used for both versions of the algorithm. The simulation shows the Constant-Delays Algorithm, with and without noise, where each element in the stochastic noise vector ε(t) is drawn from a random distribution on the interval [−0.04, 0.04]. −4624 Opti mal sol uti on Wi thout noi se f (t ) Wi thout noi se q (t ) Wi th noi se f (t ) Wi th noi se q (t )
Func ti on val ue
−4624.5
−4625
−4625.5
−4626
−4626.5
−4627 0
500
1000 Ti me ste p t
1500
2000
Figure 5.16: The objective function value for the Dual Decomposition Algorithm, with and without noisy communication channels. The noise is a stochastic variable, bounded by 0.04, and affecting the subgradients. The simulation shows that the algorithm converges to a neighborhood around the optimal value, even with the added noise. However, the neighborhood around the optimal value is larger for the noisy version, and the convergence is not as smooth as for the ordinary version. 88
5.4. DUAL DECOMPOSITION ALGORITHM
5.4.3
Halting Algorithm
In Section 4.4.1, a simple distributed version was introduced, based on the idea to halt the algorithm until all dual variables have been updated. In this section, we compare the Halting Algorithm against the Constant-Delays Algorithm, in order to see if the Constant-Delays Algorithm provides a significant improvement. −4400 Constant-D e l ay s Al gori thm Hal ti ng Al gori thm −4450
f b e s t(T )
−4500
−4550
−4600
−4650 0.01
0.02
0.03
0.04
0.05 0.06 0.07 Ste p Si z e α
0.08
0.09
0.1
0.11
Figure 5.17: The best function value fbest (T ) for both the Constant-Delays Algorithm and the Halting Algorithm, as a function of the step sizes. The function value is evaluated after 1000 dual variable updates, thus, T = 1000 for the ConstantDelays Algorithm, and T = 7000 for the Halting Algorithm. The simulations are performed on the same model used in the Connected Graph (Section 5.3.1), where the upper bound on the round-trip time is D = 6. First, we should chose the optimal step size for each algorithm, since they might be different. Evaluating each algorithm after 1000 updates of the dual variables shows that the Halting Algorithm can use a significantly larger step size (Fig. 5.17). However, since the Halting Algorithm only updates the dual variables every D + 1 step, instead of 1000 times during the time period on every step, the dual variables are only updated D+1 where the Constant-Delays Algorithm updates the dual variables 1000 times. Thus, 89
CHAPTER 5. NUMERICAL RESULTS
evaluating both algorithms after 1000 time steps gives a significantly different result (Fig. 5.18). The Halting Algorithm can still use a much larger step size, but it does not have enough time to converge to the optimal value in the first 1000 steps. −4400 Constant-D e l ay s Al gori thm Hal ti ng Al gori thm −4450
f b e s t(T )
−4500
−4550
−4600
−4650 0.01
0.02
0.03
0.04
0.05 0.06 0.07 Ste p Si z e α
0.08
0.09
0.1
0.11
Figure 5.18: The best function value fbest (T ) for both the Constant-Delays Algorithm and the Halting Algorithm, as a function of the step sizes. The function value is evaluated at time step T = 1000 for both algorithms, and the optimal step sizes are α = 0.0568 for the Constant-Delays Algorithm, and α = 0.1012 for the Halting Algorithm. These results are what we would expect, the delays causes the Constant-Delays Algorithm to become unstable for a smaller step size than what is possible without any delay. On the other hand, the Halting Algorithm is slower since it only updates the variables every D + 1 step. The optimal step sizes after 1000 steps is α = 0.1012 for the Halting Algorithm, and α = 0.0568 for the Constant-Delays Algorithm. Simulating both algorithms with these step sizes shows that the Constant-Delays Algorithm is significantly faster than the Halting Algorithm (Fig. 5.19) for this problem setup. Notice that for the Complete Graph topology (Section 5.3.4), both algorithms would implement the pure subgradient algorithm on the dual variables, and hence, the results would be identical between the two algorithms on that topol90
5.4. DUAL DECOMPOSITION ALGORITHM
ogy. −4000 Opti mal sol uti on Constant-D e l ay s f b e s t(t ) Constant-D e l ay s q b e s t(t ) Hal ti ng f b e s t(t ) Hal ti ng q b e s t(t )
−4100
Func ti on val ue
−4200
−4300
−4400
−4500
−4600
−4700 0
200
400
600
800 1000 1200 Ti me ste p t
1400
1600
1800
2000
Figure 5.19: Simulating the convergence rate for fbest (t) on the time interval t = 0, . . . , 2000, with the optimal step sizes chosen at time step t = 1000. The ConstantDelays Algorithm converges significantly faster than the Halting Algorithm.
91
Chapter 6
Conclusions and Remarks “Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning.” Sir Winston Churchill, November 1942.
6.1
Conclusions
In this master thesis, the previously known Primal Consensus Algorithm (Chapter 3) is compared against the new Dual Decomposition Algorithm (Chapter 4). Both algorithms try to solve the same decentralized optimization problem, but the convergence proofs require slightly different computational models, and their performance also depends on these assumptions. The following list contains the significant differences between the computational models used by the two algorithms. Connected Network Both the Primal Consensus Algorithm and the Dual Decomposition Algorithm allows the communication topology to be time-varying, and even temporarily disconnected, as long as there is an upper bound on the time it takes any agent to influence any other agent in the network. However, the Primal Consensus Algorithm also assumes that the network is undirected, which is not necessary for the Dual Decomposition algorithm. Routing Protocol The Dual Decomposition Algorithm uses point-to-point communication, and assumes that there is a routing protocol implemented in the network which provides this functionality, independent of the network topology and time variations. The Primal Consensus Algorithm only uses local communication, and since each agent only communicates with its direct neighbors, no explicit routing protocol is necessary. 93
CHAPTER 6. CONCLUSIONS AND REMARKS
Delays The routing protocol, used by the Dual Decomposition Algorithm, creates a point-to-point channels between every pair of agents in the network. To capture the network topology, a delay is introduced between agents that are not adjacent. Thus, the Dual Decomposition Algorithm can readily handle an extra time delay over single edges, while the Primal Consensus Algorithm assumes that there are no delays between adjacent agents. Reliable Network The Primal Consensus Algorithm assumes that each undirected edge is reliable, thus, either an edge exist, or it does not exist. The Dual Decomposition Algorithm allows an edge to fail in one direction, but assumes that the routing protocol will resend the information until it succeeds, and further, that no information will ever be lost in the network. Bounded Subgradients Both algorithms requires the subgradients to be bounded, but since they are working with different functions, these assumptions are not equivalent. The Primal Consensus Algorithm assumes that the subgradients to the local objective function f i are uniformly bounded, while the Dual Decomposition Algorithm assumes that the subgradients to the Lagrange dual function q are uniformly bounded (which, in particular, is true if the primal variables are constrained to a compact set X). The most significant difference between the two algorithms is the routing protocol required by the Dual Decomposition Algorithm. It implies a more complex network infrastructure, that is capable of reliably transmitting messages between any pair of agents. The Primal Consensus Algorithm favors instead simplicity, where only local communication is necessary. However, the routing protocol enables the Dual Decomposition Algorithm to work on directed network, as well as on networks with delays in the communication. Further, the routing protocol allows the Dual Decomposition Algorithm to reduce the total communication cost. The communication cost for the Dual Decomposition Algorithm, with the routing protocol, is 2(N −1) dim(x), while the communication cost for the Primal Consensus Algorithm is |Et | dim(x). Notice that the communication cost for the Dual Decomposition Algorithm is independent of the network topology, while the communication cost for the Primal Consensus Algorithm is proportional to the number of edges in the network. Recall also that for a connected, undirected network, where each edge is counted once in each direction, the number of edges |Et | is bounded by 2(N − 1) ≤ |Et | ≤ N (N − 1). In particular, the communication cost for the Dual Decomposition Algorithm is a lower bound on the communication cost for the Primal Consensus Algorithm. Let us now focus our attention on the convergence rates for the algorithms (Table 6.1), where a network of N agents are trying to solve either the primal minimization problem (3.1) or the dual maximization problem (4.8). It is assumed 94
6.1. CONCLUSIONS
that all algorithms uses a constant step size α, and that the initial distance to the optimal set is bounded by Rp for the primal problem, and by Rd for the dual problem. Further, Gp is an upper bound on the subgradients for the primal problem, and similarly, Gd is the upper bound on the subgradients for the dual problem. ¯ is the Primal Consensus Algorithms’ upper bound on the time it takes Also, B any node’s estimate to influence any other node’s estimate, introduced in Section 3.3, while D is the Dual Decomposition Algorithms’ upper bound on the round-trip time, introduced in Section 4.3. Subgradient method on the primal variables (2.12)
fbest (t − 1) ≤ f ∗ +
Rp2 +G2p α2 t 2αt
Primal Consensus Algorithm (3.11)
fbest (t − 1) ≤ f ∗ +
N Rp2 +G2p α2 Ct 2αt
where C = 1 + 8N C1 and C1 = 1 +
N 1 1−(1−η B¯ ) B¯
¯
1+η −B 1−η B¯
Subgradient method on the dual variables (4.12)
qbest (t − 1) ≥ q ∗ +
Rd2 +G2d α2 t 2αt
Halting Algorithm (4.15)
qbest (t − 1) ≥ q ∗ −
Rd2 +G2d α2 t/(D+1) 2αt/(D+1)
Constant-Delays (4.25)
Algorithm
qbest (t − 1) ≥ q ∗ −
Rd2 +G2d α2 t(1+2D) 2αt
Time-Varying Delays Algorithm (4.28)
qbest (t − 1) ≥ q ∗ −
Rd2 +α2 D(D+1)2 G2d +3tα2 (D+1)2 G2d 2αt
Table 6.1: The convergence rates for the analyzed optimization algorithms, using a constant step size. All of the convergence results have a similar expression, they converge to a neighborhood around the optimal value, whose size is proportional to the step size 1 α, and with an error that is proportional to αt . From the analysis, it is clear that the centralized subgradient method on the primal variables is faster than the decentralized Primal Consensus Algorithm, and similarly, the centralized subgradient method on the dual variables is faster than any of the decentralized dual algorithms. Thus, the price we have to pay if we want to solve the optimization problem in a decentralized manner is a slower convergence. Notice that the Halting Algorithm converges to the same neighborhood around the optimal value as the centralized subgradient algorithm, but with a slower convergence speed, while the Constant-Delays Algorithm on the other hand converges with the same speed as the centralized subgradient algorithm, but to a larger neighborhood around the optimal value. The numerical simulations (Section 5.4.3) shows, 95
CHAPTER 6. CONCLUSIONS AND REMARKS
however, that the Constant-Delays Algorithm in practice performs a lot better than the Halting Algorithm. When comparing the Time-Varying Delays Algorithm against the ConstantDelays Algorithm, it comes as no surprise that the former has a worse convergence rate, since it handles the extended model, where the delays are time dependent. However, we believe that the bound on its convergence rate is more conservative than the bound for the Constant-Delays Algorithm, because the bound is computed per time step, and with time-varying delays the updates initiated at D consecutive steps could in the worst case arrive at the same time step. The bound is constructed by assuming the worst possible update occurs at every time step, even though that prevents the worst update from occurring in the following steps. The most difficult part is to compare the primal algorithms against the dual algorithms, since they are essentially solving different optimization problems. Even though the expressions for the convergence rates are similar, the constants are different, and the constants depends on the problem instance. The numerical simulations show that the Dual Decomposition Algorithm often outperforms the Primal Consensus Algorithm, but it has trouble with sparse networks with a high diameter. In summary, there is not a clear winner for all cases, as the simulations shows. However, if the added complexity in the communication network, implied by the routing protocol, is not an obstruction, then the Dual Decomposition Algorithm has showed some promising results. Even if the Dual Decomposition Algorithm does not always provide an advantage, it is a valuable tool that should be seriously considered when deciding upon a decentralized optimization algorithm, and evaluated for the particular problem instance. It is also possible to run the Primal Consensus Algorithm and the Dual Decomposition Algorithm in parallel, which has the advantage that the Primal Consensus Algorithm provides an upper bound on the optimal value, while the Dual Decomposition Algorithm provides a lower bound on the optimal value. Thus, the gap between the two algorithms’ objective function value is an indication on the current estimation error.
6.2
Summary of Contributions
In this master thesis, the dual decomposition method and the subgradient method are used to construct a distributed, decentralized optimization algorithm. The main contribution of this thesis is to prove the convergence of the decentralized optimization algorithm on time-varying network, with communication delays, and noisy communication channels. Further, the convergence rate is explicitly expressed in terms of the network parameters. The convergence rate of this new algorithm is compared to previously known algorithms, with extensive numerical simulations. Besides the convergence rate, the communication cost has been of primary interest, and it is shown how the communication cost can be kept at a minimum 96
6.3. FUTURE RESEARCH
level.
6.3
Future Research
During my work on this thesis, I have come across several possible interesting extensions that could be made. Constraint Coupling In this thesis, we have focused on an optimization problem where the coupling has been in the objective functions (4.2), minimize n x∈X⊆R
N X
f i (x1 , x2 , . . . , xN ).
i=1
Another common optimization problem is where the coupling is in the constraint functions, minimize n x∈X⊆R
PN
i=1 f
i (x
i ),
subject to cj (x1 , x2 , . . . , xN ) ≤ bj ,
j = 1, . . . , m.
These problems occur for example when a set of independent multi-agents are using a common, limited, resource. The dual decomposition method already handles constraint coupling by introducing the Lagrange dual variables, so it should be a straightforward procedure to add another set of dual variables for these constraints. A future research topic is to study the convergence rate for this class of optimization problems, from a distributed computational view. Asynchronous Multi-Agents The models considered in this thesis has assumed that all agents are computing and communicating at synchronous time steps. If the multi-agents are nonhomogeneous, or if the computational resources required to solve their local optimization problems are significantly different, then this assumption can be very restrictive. An interesting extension is to develop, and analyze, an asynchronous version of these distributed optimization algorithms. Package Loss In the Reliable Network Assumption 4.5, it is assumed that every package sent on the network will eventually arrive to its destination. A common practice in network control is to drop packages to avoid network congestion. Thus, a possible extension is to analyze the behavior of the algorithms when packages can be lost in the network. Further, in the Time-Varying Delay, Distributed Algorithm, packages can arrive to the destination in another order than how they where transmitted, and it would be interesting to analyze if the latest generated information could supersede older information. 97
CHAPTER 6. CONCLUSIONS AND REMARKS
Removing Artificial Delays In the Dual Decomposition Algorithm, the updates are used simultaneously for the agents. Consider for example the dual variable λij (t), computed by agent j at time step t − δji − 1. Recall that λij (t) is not used by any of the agents before time step t, thus, agent j knows the value of λij (t) for δji time steps, without using it. Also, when the dual variable λij (t) is updated at time step t − δji − 1, agent j uses its estimate from time step t − δji − δij − 1, instead of his latest primal variable estimate. The reason for introducing these artificial delays was so that the update would be a pure subgradient to the problem, thereby enabling us to prove the convergence of the algorithm. Another possibility would be to let agent j use its current estimate when updating the dual variable, and also use the dual variable immediately, before agent i receives the update, but it remains to analyze the convergence of this version of the algorithm. Interpret the Dual Variables A common interpretation of the primal variables are as a resource allocation, while the dual variables are interpreted as the price of the resources. An interesting research project could be to extend this interpretation, and especially try to use the interpretation of the variables to design a network topology for the optimization problem. Optimal State Partitioning Recall that each element in the optimization variable x is associated with a single agent, yielding a state partitioning of x into the components x1 , x2 , . . . , xN . After introducing local estimates of the optimization variable x1 , x2 , . . . , xN , the conditions imposed was that xii = xji for all i = 1, . . . , N and j = 1, . . . , N . An interesting project could be to study how different partitions of the optimization variable x affects the convergence rate, and thus, how to find the optimal partitioning of the optimization variable. Further, the conditions xii = xji was chosen to satisfy the requirement that x1i = x2i = · · · = xN i , however, these conditions could have been chosen in another way, for example xji = xj+1 for all i = 1, . . . , N and j = 1, . . . , N − 1. i Thus, the next step would be to find the optimal set of conditions that yields the fastest convergence rate. Unconstrained Quadratic Functions In Section 4.7, it was shown that the Dual Decomposition Algorithm does converge for a very simple two-agent system, even with quadratic objective functions that does not satisfy the bounded subgradients assumption, provided that the step size is chosen small enough. It would be interesting to see if this result could be generalized to an arbitrary, strongly connected, network, as long as the objective functions are quadratic, and the step size is small enough. 98
6.3. FUTURE RESEARCH
These extensions are left as exercises to be completed by the interested reader, or, if no one reads this thesis, my own future research.
99
Appendix A
Distributed Consensus Sum Algorithm In Section 4.6.3 we came across the distributed problem of computing the sum of a set of values. We argued that different schemes could be used if the network topology was known, however the general problem with only local knowledge is much harder. Here we propose a general algorithm that solves this problem, however it has such a poor performance that it should never be used in practice!
Problem definition We want to solve the distributed sum problem in a similar way to the Average Consensus Algorithm described in Section 2.8. Thus, consider a strongly connected network of N agents, with each agent holding an initial value xi (0) ∈ R at time P i t = 0. The goal is to compute the sum of these initial values N i=1 x (0) with only local communication. Further, the goal is also that the agents should reach consensus on the sum, thus every agent’s estimate should converge to the sum of the initial values. lim xi (t) =
t→∞
N X
xj (0) ∀i ∈ V
j=1
We also assume that all agents are equal, except for the unique identity, and that the computational resources available to each agent are modest.
Algorithm First, notice that if the number of agents N is known, then the problem could easily be solved by using the Average Consensus Algorithm to find the average of the initial values, and then multiply by N to get the sum of the initial valus. However, in our model N is considered to be global knowledge, and can thus not be used directly by the agents. 101
APPENDIX A. DISTRIBUTED CONSENSUS SUM ALGORITHM
The idea behind our algorithm is to run three sub-algorithms in parallel, the first two is used to compute N , and the third is the ordinary Average Consensus Algorithm. Lynch [5] has described these sub-algorithms individually.
Algorithm 1: Leader Election. The first algorithm is used to decide on one special agent, called a leader. In order for this algorithm to work we need to assume that each agent has a unique identifier idi , and that the identifiers are well-ordered, i.e., they can be compared, for example numerically or lexicographically. Let us define the maximum identity as idmax = maxi idi . The goal of the Leader Election Algorithm is then to reach consensus on the value of idmax , and the agent whose id is idmax is called the leader. The algorithm proceeds as follows, each agent keeps a record of the largest identity it has heard of idimax , and whenever it receives a message about an identity greater than its own estimate, it updates its estimate and broadcasts the identity to all of its neighbors, see Algorithm A.1. Recall that the diameter of a graph is the greatest distance between any two nodes, and let d denote the diameter. Since we have assumed that the graph is strongly connected, it is known that d is finite. Consider Algorithm A.1, and how the maximal id idmax propagates out from the leader as circles. After the first step, all nodes within distance 1 from the leader has received its identity, after the second step all nodes within distance 2 has received its identity, and so on. Thus, after d steps, which is the maximal distance in the graph, all agents will have received the maximum identity. Algorithm A.1: Leader Election Algorithm. Data: Each agent starts with its own idi . Result: After d steps in the outer loop, every agent has idimax = idmax . // The following algorithm is executed on each agent. 1 idimax ← idi 2 Send idimax to all neighbors. 3 forever do 4 foreach Incoming id do 5 if id > idimax then 6 idimax ← id 7 Send idimax to all neighbors. 8 end 9 end 10 end
Remark. The normal stopping criterium for the algorithm is to stop after d steps, but for the same reasons that we consider the number of agents N to be global 102
knowledge, so is the diameter of the graph d. However, this is not a problem, since we will let the algorithm run indefinitely, just knowing that it will find a leader within a finite time.
Algorithm 2: Computing 1/N . The second part of the algorithm is to combine the Leader Election Algorithm with an Average Consensus Algorithm to compute the inverted value of the number of agents N1 . Let W1 be a nonnegative average consensus weight matrix satisfying the conditions of Theorem 2.4. Notice the important property of W1 that it preserves the sum of a vector, 1T x = 1T W1 x
∀x ∈ RN .
Let each node start with an estimate ni (0) of 1/N , and run an Average Consensus Algorithm over these values, with the weight matrix given by W1 . We will now introduce two important properties, first how the initial values ni (0) are chosen, and then a small modification to the consensus algorithm. The initial value ni (0) is chosen by the rule ( i
n (0) =
1 if idimax = idi ; 0 otherwise.
We will refer to an agent with the property idimax = idi as a believed leader, and the sum 1T n(0) is equal to the number of believed leaders. Especially, if both algorithms start at the same time, then the number of believed leaders is equal to N. Since the Leader Election Algorithm is run in parallel, we add the following modification to the consensus algorithm: Every time a believed leader receives information that he is in fact not the leader, then he subtracts 1 from his estimate ni . Consider the sum 1T n, notice that it has the invariant property of counting the number of believed leaders, since the consensus update does not change the sum. But we know that after d steps there will only be one believed leader, and after that the sum will be constantly 1. Now, the Average Consensus Algorithm will ensure that everyones value eventually becomes equal, thus they converge to 1/N , as desired. The combined algorithm is specified in Algorithm A.2. Remark. It might seem like an easier option to use the leader as a central agent, where everyone send their values to the leader, let the leader compute the sum, and then broadcast the sum back to the agents. However, such a scheme enforces different agents to behave differently, while in our case they perform the same operations. 103
APPENDIX A. DISTRIBUTED CONSENSUS SUM ALGORITHM
Algorithm A.2: Computing 1/N .
1 2 3 4 5
6 7
8 9 10 11 12 13 14 15
Data: Each agent starts with its own idi . Result: After d steps in the outer loop, every agent has idimax = idmax , and the sum 1T n = 1 stays constant. The only update that remains is the average consensus update on n. // The following algorithm is executed on each agent. idimax ← idi ni = 1 Send idimax to all neighbors. forever do foreach Incoming id do // No ids will be sent after step d. if id > idimax then if idimax = idi then // This condition is satisfied exactly once for all agents, except the leader. ni ← ni − 1 end idimax ← id Send idimax to all neighbors. end end Compute an average consensus update for ni . end
Algorithm 3: Average Consensus. The final step is, as described before, to calculate the average of the initial values xi (0). Let us denote agent i’s estimate of the average value by y i , thus y i (0) = xi (0) and y(t + 1) = W2 y(t), where W2 also is an Average Consensus Algorithm. This is the third algorithm that is executed in parallel with the others. We thus have the results 1 lim ni (t) = , t→∞ N N 1 X lim y i (t) = xj (0). t→∞ N j=1 Finally, by defining xi (t) = y i (t)/ni (t) for t > 0, we get lim xi (t) =
t→∞
N X j=1
as desired. 104
xj (0),
Remark. Notice how an error propagates from ni (t) to xi (t). Let ni (t) = then xi (t) ≈ (N + εN 2 )y i (t), thus the convergence rate can be very slow.
105
1 N
+ ε,
Bibliography [1]
B. Johansson, On Distributed Optimization in Networked Systems. PhD thesis, Royal Institute of Technology (KTH), dec 2008. TRITA-EE 2008:065.
[2]
A. Nedić and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” Tech. Rep. 2575, MIT LIDS, 2007.
[3]
A. Nedić and A. Ozdaglar, “On the rate of convergence of distributed subgradient methods for multi-agent optimization,” in Decision and Control, 2007 46th IEEE Conference on, pp. 4711–4716, 12 2007.
[4]
A. Nedić and A. Ozdaglar, Convex Optimization in Signal Processing and Communications, ch. 10 - Cooperative distributed multi-agent optimization. Cambridge University Press, Dec 2009.
[5]
N. A. Lynch, Distributed algorithms. Morgan Kaufmann, 1996.
[6]
J. N. Tsitsiklis, Problems in Decentralized Decision Making and Computation. PhD thesis, Department of EECS, MIT, November 1984.
[7]
F. Paganini, J. Doyle, and S. Low, “Scalable laws for stable network congestion control,” in Decision and Control, 2001. Proceedings of the 40th IEEE Conference on, vol. 1, pp. 185 –190 vol.1, 2001.
[8]
J. A. Fax, Optimal and Cooperative Control of Vehicle Formations. PhD thesis, California Institute of Technology, nov 2001.
[9]
R. Olfati-Saber and R. M. Murray, “Consensus problems in networks of agents with switching topology and time-delays,” Automatic Control, IEEE Transactions on, vol. 49, pp. 1520 – 1533, sep 2004.
[10] R. Olfati-Saber and R. M. Murray, “Distributed cooperative control of multiple vehicle formations using structural potential functions,” in IFAC World Congress, 2002. [11] V. D. Blondel, J. M. Hendrickx, A. Olshevsky, and J. N. Tsitsiklis, “Convergence in multiagent coordination, consensus, and flocking,” in In proceedings of the joint 44th ieee conference on decision and control and european control conference, pp. 2996–3000, 2005. 107
BIBLIOGRAPHY
[12] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004. [13] D. P. Bertsekas et al., Convex Analysis and Optimization. Athena Scientific, 2003. [14] J. Hiriart-Urruty and C. Lemaréchal, Fundamentals of convex analysis. Springer-Verlag, 2001. [15] S. Boyd and L. Vandenberghe, “Subgradients,” 2007. Lecture notes. [16] S. Boyd and A. Mutapcic, “Subgradient methods,” 2007. Lecture notes. [17] J. A. Snyman, Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical and New Gradient-Based Algorithms. Springer, 2005. [18] S. Boyd, L. Xiao, A. Mutapcic, and J. Mattingley, “Notes on decomposition methods,” 2007. Lecture notes. [19] G. Strang, Introduction to linear algebra. SIAM, 2003. [20] R. Buyya, “The design of paras microkernel,” 2000. [21] K. Asanovic et al., “The landscape of parallel computing research: A view from berkeley,” Tech. Rep. UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006. [22] B. Barney, “Introduction to parallel computing.” [23] L. Chai, Q. Gao, and D. K. Panda, “Understanding the impact of multi-core architecture in cluster computing: A case study with intel dual-core system,” Cluster Computing and the Grid, IEEE International Symposium on, vol. 0, pp. 471–478, 2007. [24] D. Li and Y. Y. Haimes, “A decomposition method for optimization of largesystem reliability,” IEEE Transactions on Reliability, vol. 41, pp. 183–188, jun 1992. [25] A. Mordecai, Nonlinear Programming: Analysis and Methods. Courier Dover Publications, 2003. [26] R. Diestel, Graph Theory (Graduate Texts in Mathematics). Springer, August 2005. [27] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,” Systems and Control Letters, vol. 53, pp. 65–78, 2003. 108
BIBLIOGRAPHY
[28] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Randomized gossip algorithms.,” IEEE Transactions on Information Theory, vol. 52, no. 6, pp. 2508– 2530, 2006. [29] V. D. Park and M. S. Corson, “A highly adaptive distributed routing algorithm for mobile wireless networks,” in INFOCOM ’97: Proceedings of the INFOCOM ’97. Sixteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Driving the Information Revolution, pp. 1405–1413, IEEE Computer Society, 1997. [30] A. Nedić and D. P. Bertsekas, “The effect of deterministic noise in subgradient methods,” Mathematical Programming, 2008. [31] P. Frasca, R. Carli, F. Fagnani, and S. Zampieri, “Average consensus on networks with quantized communication,” International Journal of Robust and Nonlinear Control, vol. 19, no. 16, pp. 1787–1816, 2009. [32] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci, “Wireless sensor networks: a survey,” Computer Networks, vol. 38, no. 4, pp. 393 – 422, 2002. [33] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge University Press, 2009. [34] The MathWorks Inc., “MATLAB version 7.5.0,” 2007. [35] J. Löfberg, “Yalmip : A toolbox for modeling and optimization in MATLAB,” in Proceedings of the CACSD Conference, (Taipei, Taiwan), 2004. [36] J. Sturm, “Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones,” Optimization Methods and Software, vol. 11–12, pp. 625–653, 1999. Version 1.3 available from http://sedumi.ie.lehigh.edu.
109