Distributed and robust network localization algorithms - Institute For ...

E DE LISBOA INSTITUTO SUPERIOR TÉCNICO

X X X

X

Distributed and robust network localization algorithms

Cláudia Alexandra Magalhães Soares

Orientador: Doutor João Pedro Castilho Pereira Santos Gomes

Thesis approved in public session to obtain the PhD Degree in Electrical and computer Engineering

Jury final classification: Pass with Distinction and Honour

Tese Provisória

Dezembro 2015

Abstract

Signal processing over networks has been a broad and hot topic in the last few years in the signal processing community. Networks of agents typically rely on known node positions, even if the main goal of the network is not localization. A network of agents may comprise a large set of miniature, low cost, low power autonomous sensing nodes. In this scenario it is generally unsuitable or even impossible to accurately deploy all nodes in a predefined location within the network operation area. GPS is also discarded as an option for indoor applications or due to cost and energy consumption constraints. Also, mobile agents need localization for, e.g., motion planing, or formation control, and GPS might not be available in many environments. Real world conditions imply noisy environments, and the network operation calls for fast and reliable estimation of the agents’ locations. So, galvanized by the compelling applications and the difficulty of the problem itself, researchers have dedicated work to finding the nodes in networks. Some develop centralized methods, while others pursue distributed, scalable solutions, either by developing approximations or tackling the nonconvex problem, sometimes combining both approaches. With the growing network sizes of devices constrained in energy expenditure and computation power, the need for simple, fast, and distributed algorithms for network localization spurred the work presented on this thesis. Here, we approach the problem starting from minimal data collection — aggregating only range measurements and a few landmark positions — delivering a good approximated solution, that can be fed to our fast, yet simple maximum-likelihood method, returning highly accurate solutions. We explore tailored solutions recurring to the optimization and probability tools that can leverage performance with noise and unstructured environments. Thus, this thesis contributions are, mainly:

• Distributed localization algorithms characterized for their simplicity but also strong guarantees; • Analyses of convergence, iteration complexity, and optimality bounds for the designed procedures; • Novel majorization approaches which are tailored to the specific problem structure. i

Keywords Distributed algorithms, convex relaxations, nonconvex optimization, maximum likelihood estimation, distributed iterative agent localization, robust estimation, noisy range measurements, network localization, majorization-minimization, optimal gradient methods.

ii

Resumo O processamento de sinal em redes tem sido um tema lato e abundante nos últimos anos junto da comunidade científica. As redes de agentes geralmente assentam o seu funcionamento no conhecimento da posição dos seus nós, mesmo em situações em que a localização não é o objectivo principal da operação. Uma rede de agentes pode ser composta por um grande conjunto de nós autónomos, de baixo custo e baixa potência. Neste cenário não é em geral adequado — ou torna-se mesmo impossível — posicionar os nós em localizações pré-definidas dentro da área de operação. A utilização de GPS também está excluída para aplicações no interior de edifícios ou devido ao seu custo ou requisitos energéticos. Por outro lado, agentes móveis necessitam de conhecer a sua localização para planear o seu movimento ou efectuar controlo de formação, por exemplo, e recursos como o GPS podem não ser acessíveis. O mundo real implica ainda ambientes ruidosos e o objectivo da rede exige uma estimação rápida e fiável das posições dos diversos agentes. Assim, investigadores da área dedicaram trabalho a localizar os nós da rede, galvanizados pelas relevantes aplicações, mas também pelo desafio do problema. Algumas linhas de trabalho centraram-se no desenvolvimento de métodos centralizados, enquanto que outras procuram soluções distribuídas e escaláveis, quer desenvolvendo aproximações, quer tratando directamente o problema não convexo e por vezes combinando ambas as abordagens. Com o aumento do tamanho das redes de dispositivos parcos em recursos energéticos e computacionais, a necessidade de algoritmos simples, rápidos e distribuídos impulsionou o trabalho apresentado nesta tese. Nela abordamos o problema começando com um conjunto mínimo de dados, agregando apenas medidas de distância e posições de uns poucos marcos, entregando uma solução aproximada de média precisão que, posteriormente, pode ser alimentada ao nosso rápido e, no entanto, simples método de máxima verosimilhança, retornando soluções de muito alta precisão. Para ir ao encontro destas necessidades exploramos soluções à medida recorrendo a técnicas de optimização e probabilidade que potenciam a exactidão e rapidez mesmo na presença de ruído e ambientes não estruturados. Assim, as principais contribuições desta tese são: • Algoritmos distribuídos de localização, caracterizados pela sua simplicidade, mas também pelas fortes garantias de convergência; • Análises de convergência, complexidade em termos de iterações e limites de optimalidade para os procedimentos em causa; iii

• Novas abordagens de majorização feitas à medida para a estrutura específica do problema.

Palavras Chave Algoritmos distribuídos, relaxações convexas, optimização não convexa, estimação de máxima verosimilhança, localização distribuída de redes de agentes, estimação robusta, medidas de distância ruidosas, localização em redes, majorização-minimização, métodos óptimos de gradiente.

iv

Contents 1 Introduction 1.1

1

Motivation and related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.1.1

Scalability and networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.1.2

Robustness and harsh environments . . . . . . . . . . . . . . . . . . . . . .

3

1.2

Objectives and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.3

Agent localization on a network . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

2 Distributed network localization without initialization: Tight convex underestimatorbased procedure

7

2.1

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.2

Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.3

Convex underestimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.4

Distributed sensor network localization . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.4.1

Gradient and Lipschitz constant of fˆ . . . . . . . . . . . . . . . . . . . . . .

11

2.4.2

Parallel method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.4.3

Asynchronous method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.5.1

Quality of the convexified problem . . . . . . . . . . . . . . . . . . . . . . .

16

2.5.2

Parallel method: convergence guarantees and iteration complexity . . . . .

18

2.5.3

Asynchronous method: convergence guarantees and iteration complexity . .

19

Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.6.1

Assessment of the convex underestimator performance . . . . . . . . . . . .

26

2.6.2

Performance of distributed optimization algorithms . . . . . . . . . . . . . .

28

2.6.3

Performance of the asynchronous algorithm . . . . . . . . . . . . . . . . . .

29

Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

2.7.1

Convex envelope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

2.7.2

Lipschitz constant of ∇φBij . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

2.7.3

Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

2.7.4

Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

2.5

2.6

2.7

v

Contents

2.8

Summary and further extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

2.8.1

35

Heterogeneous data fusion application . . . . . . . . . . . . . . . . . . . . .

3 Distributed network localization with initialization: Nonconvex procedures 3.1

Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

3.2

Distributed Majorization-Minimization with quadratic majorizer . . . . . . . . . .

39

3.2.1

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

3.2.2

Problem reformulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

3.2.3

Majorization-Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

3.2.4

Distributed sensor network localization

. . . . . . . . . . . . . . . . . . . .

42

3.2.5

Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.2.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

Majorization-Minimization with convex tight majorizer . . . . . . . . . . . . . . . .

47

3.3.1

Majorization function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

3.3.2

Experimental results on majorization function quality . . . . . . . . . . . .

49

3.3.3

Distributed optimization of the proposed majorizer using ADMM . . . . . .

49

3.3.4

Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

3.3.5

Proof of majorization function properties . . . . . . . . . . . . . . . . . . .

63

3.3.6

Proof of Proposition 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

3.3.7

Proof of (3.31) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

3.3.8

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

Sensor network localization: a graphical model approach . . . . . . . . . . . . . . .

66

3.4.1

Uncertainty models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

3.4.2

Optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

3.4.3

Combinatorial problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

3.4.4

Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

3.4.5

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

3.4.6

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

3.4.7

Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

3.4.8

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

3.3

3.4

4 Robust algorithms for sensor network localization

vi

37

77

4.1

Related work and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

4.2

Discrepancy measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

4.3

Convex underestimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

4.3.1

Approximation quality of the convex underestimator . . . . . . . . . . . . .

80

4.4

Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

4.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

Contents

5 Conclusions and perspectives

87

5.1

Distributed network localization without initialization . . . . . . . . . . . . . . . .

88

5.2

Addressing the nonconvex problem . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

5.2.1

With more computations we can do better . . . . . . . . . . . . . . . . . . .

89

5.2.2

Network of agents as a graphical model . . . . . . . . . . . . . . . . . . . .

89

5.3

Robust network localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

5.4

In summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

vii

Contents

viii

List of Figures 2.1

Convex envelope for one-dimensional example . . . . . . . . . . . . . . . . . . . . .

2.2

One-dimensional example of the quality of the approximation of the true nonconvex

10

cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.3

Two-dimensional star network to assess the quality of optimality bounds . . . . . .

18

2.4

Proximal minimization evolution for the toy problem . . . . . . . . . . . . . . . . .

21

2.5

Proximal minimization cost evolution for the toy problem . . . . . . . . . . . . . .

22

2.6

Network 1. Topology with 4 anchors and 10 sensors. Anchors are marked with blue squares and sensors with red stars. . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.7

24

Network 2. Topology with 4 anchors and 50 sensors. Anchors are also marked with blue squares and sensors with red stars. . . . . . . . . . . . . . . . . . . . . . . . .

25

2.8

Relaxation quality experiment for different noise levels . . . . . . . . . . . . . . . .

26

2.9

Relaxation quality experiment for high power noise . . . . . . . . . . . . . . . . . .

27

2.10 Estimates for the location of the sensor nodes (network with 10 agents). . . . . . .

27

2.11 Performance comparison: Algorithm 1 vs. projection method. . . . . . . . . . . . .

28

2.12 Performance comparison: Algorithm 1 vs. ESDP method. . . . . . . . . . . . . . .

29

2.13 Performance of the asynchronous algorithm. . . . . . . . . . . . . . . . . . . . . . .

29

3.1

Nonconvex reformulation illustration . . . . . . . . . . . . . . . . . . . . . . . . . .

40

3.2

Evolution of cost and average error per sensor with communications, for Algorithm 4 and the benchmark, under low power noise. . . . . . . . . . . . . . . . . . . . . . .

3.3

Evolution of cost and average error per sensor with communications, for Algorithm 4 and the benchmark, under medium power noise. . . . . . . . . . . . . . . . . . . .

3.4

44

45

Evolution of cost and average error per sensor with communications, for Algorithm 4 and the benchmark, under high power noise. . . . . . . . . . . . . . . . . . . . . . .

46

3.5

Tightness evaluation for the proposed majorizer in (3.15). . . . . . . . . . . . . . .

48

3.6

Evaluation of majorizer performance for different initializations. . . . . . . . . . . .

50

3.7

Performance comparison: Algorithm 7 vs. SGO; noiseless range measurements. . .

59

3.8

Performance comparison: Algorithm 7 vs. SGO; noisy range measurements. . . . .

60 ix

List of Figures

3.9

Performance comparison: Algorithm 7 vs. SGO; noisy range measurements, random anchors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

61

3.10 Performance comparison: Algorithm 7 vs. SGO; with increasing measurement noise.

62

3.11 Performance comparison: Algorithm 7 vs. SGO; accuracy and communications. . .

63

3.12 Performance comparison: Algorithm 7 vs. SGO; increasing parameter value. . . . .

64

3.13 Average cost over Monte Carlo trials. . . . . . . . . . . . . . . . . . . . . . . . . . .

73

3.14 Mean positioning error per sensor over Monte Carlo trials. . . . . . . . . . . . . . .

74

3.15 Rank of the solution matrix E in the tested Monte Carlo trials. . . . . . . . . . . .

74

4.1

Comparison of nonconvex cost functions. . . . . . . . . . . . . . . . . . . . . . . . .

80

4.2

Quality of the proposed relaxation (4.4). . . . . . . . . . . . . . . . . . . . . . . . .

81

4.3

Illustration of the relaxation quality of (4.4). . . . . . . . . . . . . . . . . . . . . .

85

4.4

Estimates for sensor positions for the three discrepancy functions.

. . . . . . . . .

86

4.5

Average positioning error vs. the value of the Huber function parameter. . . . . . .

86

List of Tables 2.1

Bounds on the optimality gap for the example in Figure 2.2 . . . . . . . . . . . . .

18

2.2

Bounds on the optimality gap for the 2D example in Figure 2.3 . . . . . . . . . . .

18

2.3

Number of communications per sensor for the results in Fig. 2.12 . . . . . . . . . .

28

3.1

Mean positioning error, with measurement noise . . . . . . . . . . . . . . . . . . .

43

3.2

Squared error dispersion over Monte Carlo trials for Figure 3.7. . . . . . . . . . .

59

3.3

Squared error dispersion over Monte Carlo trials for Figure 3.8. . . . . . . . . . . .

60

3.4


61

3.5


62

3.6

Cost values per sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

4.1

Bounds on the optimality gap for the example in Figure 4.3 . . . . . . . . . . . . .

82

4.2

Average positioning error per sensor (MPE/sensor), in meters . . . . . . . . . . . .

83

4.3

Average positioning error per sensor (MPE/sensor), in meters, for the biased experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

xi

List of Tables

xii

List of Algorithms 1

Parallel method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2

Asynchronous method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3

Asynchronous update at each node i . . . . . . . . . . . . . . . . . . . . . . . . . .

16

4

Distributed nonconvex localization algorithm . . . . . . . . . . . . . . . . . . . . .

42

5

Minimization of the tight majorizer in (3.19) . . . . . . . . . . . . . . . . . . . . .

50

6

Nesterov’s optimal method for (3.35)

. . . . . . . . . . . . . . . . . . . . . . . . .

54

7

Step 2 of Algorithm 5 using ADMM: position updates . . . . . . . . . . . . . . . .

58

8

Distributed monotonic spanning tree-based algorithm . . . . . . . . . . . . . . . .

72

9

Coordinate descent algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

xiii

List of Algorithms

xiv

1 Introduction

Contents 1.1

Motivation and related work . . . . . . . . . . 1.1.1 Scalability and networks . . . . . . . . . . . . 1.1.2 Robustness and harsh environments . . . . . 1.2 Objectives and contributions . . . . . . . . . 1.3 Agent localization on a network . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 3 3 3 4

1

1. Introduction

Networks of agents are becoming ubiquitous. From environmental and infrastructure monitoring to surveillance, and healthcare, networked extensions of the human senses in contemporary technological societies are improving our quality of life, our productivity, and our safety. Applications of such networks recurrently need to be aware of node positions to fulfill their tasks and deliver meaningful information. Nevertheless, locating the nodes is not trivial: these small, low cost, low power devices are deployed in large numbers, often with imprecise prior knowledge of their locations, and might be equipped with minimal processing capabilities. Such limitations call for localization algorithms which are scalable, fast, and parsimonious in their communication and computational requirements.

1.1

Motivation and related work

The network localization problem is not new (see e.g., Bulusu et al. in [1], published in 2000), nevertheless, the scientific community is still striving for a usable, practicable solution to the problem of spatially localizing a network of agents. Nowadays, with the increasing number of networked devices for applications which are localization dependent, the scalability problem imposes itself as a real and urgent need. Solutions based on centralized semidefinite programming relaxations like in [2] are, thus, not suitable for problems with hundreds of nodes, and many distributed approaches, like [3] demand that each node solve a semidefinite problem at each algorithm iteration — a requirement difficult to attain by the small, low power, inexpensive hardware usually deployed in applications. There is the need for simple, distributed and fast methods for localization, methods easy to understand by a non-specialist engineer, but with no concessions in terms of convergence and expected behavior. This thesis provides such methods with the best performance in accuracy and convergence rate, demanding only simple arithmetic operations. Also, the signal processing community has approached the network localization problem with noisy range measurements assuming that noise is Gaussian, as did Biswas et al. [2]. Nevertheless, in many applications, added to the indeed Gaussian measurement noise, we find outlier data — due to, e.g., environmental conditions, malfunctioning or malicious nodes. This problem, with great practical impact in applications, is also recognized as interesting in the literature (see e.g., Ihler et al. [4] or more recently Simonetto et al. [3]). Two noteworthy works that address this largely unexplored problem are presented by Korkmaz et al. in [5], where the authors provide a distributed nonconvex procedure using the Huber M-estimator, but highly dependent on the initialization, and the approach by Oğuz-Ekim et al. [6], which does not need initialization information but is centralized, assumes a Laplacian noise model, and does not scale well with the size of the network. 2

1.2 Objectives and contributions

Our work bridges this gap, by providing an estimate that does not depend on the initialization, does not assume an outlier distribution, and is lightweight in its computing needs. As the present work spans over such a varied set of tools and flavors of the sensor network localization problem, more in depth treatment of related work is included in each chapter. The remainder of this section briefly discusses some aspects that should be taken into account when devising a system that effectively delivers localization information to a network of agents.

1.1.1

Scalability and networks

With agents operating in networks and processing increasing amounts of data, today’s computational paradigm is shifting from a centralized to a distributed operation. Self-localization of agents is one of the basic requirements of such networked environments, enabling other applications. This paradigm shift raises questions such as: is it possible to pursue parallel and even asynchronous optimization algorithms and, at the same time, ensure their optimality? Considering that communication expends more energy than computation, fewer algorithm iterations will mean not only a faster response, but also longer battery life and thus extended operation. The simplicity of agents might mean that the amount of computation is also a limited resource, and only simple operations are available. Also, the spread of this type of solution throughout applications imposes some simplicity constraints to the procedures to be implemented, and avoidance of tuning parameters, so they can be intuitively and swiftly interpreted by a nonspecialist.

1.1.2

Robustness and harsh environments

Agents deployed in unstructured or uncontrolled environments must often cope with sporadic but strong measurement impairments that are difficult to characterize probabilistically, and which can greatly perturb the accuracy of regular algorithms. In this sense, in addition to the scalability and communication performance issues, we must work toward robust behavior to outlier measurements. With this concern in mind, this thesis contains some work in progress in the area of robust network localization, also aiming at simple, efficient, and understandable solutions.

1.2

Objectives and contributions

The focus of this work is to develop approaches to the network localization problem that lead to efficient, simple, and intuitive distributed algorithms, always seeking a rigorous performance analysis, and provable convergence guarantees. Whenever we design a convexified proxy to the problem, we always try to produce a bound for the optimality gap of the resulting estimate, also to understand how we can better tailor the method to the specific problem, taking the most out of the problem structure. When designing nonconvex refinement solutions, we aim at both performance and also guaranteed convergence. The broad goals of this thesis are to: 3

1. Introduction

• Study optimization strategies to ensure accuracy and simplicity; • Consider probabilistic tools to analyze performance in highly unstructured procedures. Thus, the main contributions of this thesis are the following: • A distributed network localization algorithm that requires neither parameter tuning nor initialization in the vicinity of the solution. This algorithm has two flavors: a synchronous, parallel one and an asynchronous one. Convergence guarantees, optimality bounds and iteration complexity are provided for both. The main body of work was accepted for publication on the IEEE Transactions on Signal Processing [7]. • A distributed network localization algorithm with no parameter tuning, addressing directly the nonconvex Maximum Likelihood estimation problem for Gaussian noise. We provided convergence guarantees to a stationary point, capitalizing on the properties of the MajorizationMinimization (MM) framework. This work was presented at GlobalSIP 2014 [8]. • A novel, tight majorization function specially crafted for the nonconvex Maximum Likelihood problem for Gaussian noise, whose preliminary experimental results show a substantial improvement in performance over the general purpose quadratic majorizer. The working draft for this paper is [9]. • A probabilistic approach to the problem, relying in the framework of graphical models. Also, we re-derive the standard LP relaxation for the maximum a posteriori problem and, capitalizing on this reformulation, we propose an SDP relaxation which is tighter and shows better results in the localization problem. We also propose a descent method, requiring no initialization, based on the least squares and the majorization-minimization framework more accurate than a coordinate descent method. • A novel convex relaxation for a robust formulation of the discrepancy function arising from the Maximum Likelihood problem for Gaussian noise. Instead of considering the square of the difference between acquired measurements and distances of estimated points, we consider the Huber M-estimator. This leads to an interesting improvement in the performance under Gaussian and outlier noise, while preserving the Maximum Likelihood properties within a prescribed region. The working draft of this paper is [10]. A common denominator in the present thesis contributions is the exploitation of problem structure, crafting solutions which are intuitive, simple, and robust.

1.3

Agent localization on a network

We represent mathematically the network of agents as an undirected graph G = (V, E), where the node set V = {1, 2, . . . , n} designates agents with unknown positions. An edge i ∼ j ∈ E 4

1.3 Agent localization on a network

between sensors i and j means there is a noisy range measurement between nodes i and j known to both, and i and j can communicate with each other. Anchors are elements with known positions and they are collected in the set A = {1, . . . , m}. For each agent i ∈ V, we define Ai ⊂ A as the subset of anchors with quantified range measurement to agent i. The set Ni collects the neighboring agents of node i. Let Rp be the space of interest (p = 2 for planar networks, and p = 3 otherwise). We denote by xi ∈ Rp the position of agent i, and by dij the noisy range measurement between agents i and j, available at both i and j. Anchor positions are denoted by ak ∈ Rp . We let rik denote the noisy range measurement between agent i and anchor k, available at agent i. The distributed network localization problem addressed in this thesis consists in estimating the agents’ positions x = {xi : i ∈ V}, from the available measurements {dij : i ∼ j}∪{rik : i ∈ V, k ∈ Ai }, through collaborative message passing between neighboring agents in the communication graph G. Under the assumption of zero-mean, independent and identically-distributed, additive Gaussian measurement noise, the maximum likelihood estimator for the agent positions is the solution of the optimization problem minimize f (x),

(1.1)

x

where f (x) =

X1 i∼j

2

(kxi − xj k − dij )2 +

X X 1 (kxi − ak k − rik )2 . 2 i k∈Ai

Problem (1.1) is nonconvex and NP-hard for generic network configurations [11]1 . Problem (1.1) is similar to the Multidimensional scaling (MDS) problem presented in Costa et al. in [13]. In classical MDS, one must have access to distance measurements between all nodes, or be able to estimate them. To circumvent this in a sparse network as the geometric topology of sensor networks, the authors write the MDS problem as minimize x

XX i∈V j>i

2

wij (kxi − xj k − dij ) +

X k∈A

2

wik (kxi − ak k − rik ) ,

where the weights wij are zero whenever there is no measurement between nodes i and j, or positive — chosen by the user — reflecting how accurate are the range measurements dij .

|V|(|V|+1)

1 In [12] the authors prove that highly dense networks in R2 (with edge set cardinality |E| ≥ 2|V| + ) 2 can be localized in polynomial time. This density would correspond to an average node degree hki ≥ 5 + |V|, which is not realistic in practice.

5

1. Introduction

6

2 Distributed network localization without initialization: Tight convex underestimator-based procedure Contents 2.1 2.2 2.3 2.4

2.5

2.6

2.7

2.8

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Convex underestimator . . . . . . . . . . . . . . . . . . . . . . . . . . . Distributed sensor network localization . . . . . . . . . . . . . . . . . 2.4.1 Gradient and Lipschitz constant of fˆ . . . . . . . . . . . . . . . . . . . . 2.4.2 Parallel method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Asynchronous method . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Quality of the convexified problem . . . . . . . . . . . . . . . . . . . . . 2.5.2 Parallel method: convergence guarantees and iteration complexity . . . 2.5.3 Asynchronous method: convergence guarantees and iteration complexity Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Assessment of the convex underestimator performance . . . . . . . . . . 2.6.2 Performance of distributed optimization algorithms . . . . . . . . . . . . 2.6.3 Performance of the asynchronous algorithm . . . . . . . . . . . . . . . . Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Convex envelope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.2 Lipschitz constant of ∇φBij . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.3 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.4 Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary and further extensions . . . . . . . . . . . . . . . . . . . . . 2.8.1 Heterogeneous data fusion application . . . . . . . . . . . . . . . . . . .

8 8 9 11 11 13 14 16 16 18 19 23 26 28 29 30 30 31 31 33 34 35

7

2. Distributed network localization without initialization: Tight convex underestimator-based procedure

After deploying a network of agents we would like to estimate the optimal configuration, given a set of acquired range measurements and a (small) set of reference positions. As we have seen, solving (1.1) is, in general, an NP-hard problem, and any polynomial time algorithm can only aspire to deliver a local minimizer; the returned estimate will be initialization dependent, but at this point of the network’s operation, any meaningful initialization might be impossible. For such scenarios where there is no possibility of producing a valuable hint on the true agent positions, one might turn to some approximation of the problem (1.1) that can be globally minimized and, at the same time, captures the “main shape” of the original problem, particularly, the location of its global minimizer. This kind of estimate can have low precision, but it may be sufficient for some practical purposes. Even if this is not the case, it returns invaluable information to initialize a descent algorithm over the nonconvex problem (1.1) to refine the solution. The work described in this Chapter was published in the IEEE Transactions on Signal Processing. Some new, unsubmitted, material was added to this thesis, as signaled in the text.

2.1

Contributions

We propose a convex underestimator of the maximum likelihood cost for the sensor network localization problem (1.1) based on the convex envelopes of its terms. We also obtain a simple bound for the optimality gap given an estimate produced by the algorithm. We present an optimal synchronous and parallel algorithm to minimize this convex underestimator with proven convergence guarantees. We also propose an asynchronous variant of this algorithm, prove that it converges almost surely, and we analyze its iteration complexity. Moreover, we assert the superior performance of our algorithms by computer simulations; we compared several aspects of our method with [3], [6], and [14], and our approach always yields better performance metrics. When compared with the method in [3], which operates under the same conditions, our method outperforms it by one order of magnitude in accuracy and in communication volume.

2.2

Related work

Reference [15] proposes a parallel distributed algorithm to minimize a discrepancy function based on squared distances, which is known to amplify large variance noise. Also, each element in the sensor network must solve a second-order cone program at each algorithm iteration, which can be a demanding task for the simple hardware used in such networks. Furthermore, the formal 8

2.3 Convex underestimator

convergence properties of the algorithm are not established. The work in [16] considers network localization outside a maximum-likelihood framework. The approach is not parallel, operating sequentially through layers of nodes: neighbors of anchors estimate their positions and become anchors themselves, making it possible in turn for their neighbors to estimate their positions, and so on. Position estimation is based on planar geometry-based heuristics. In [17], the authors propose an algorithm with assured asymptotic convergence, but the solution is computationally complex since a triangulation set must be calculated, and matrix operations are pervasive. Furthermore, in order to attain good accuracy, a large number of range measurement rounds must be acquired, one per iteration of the algorithm, thus increasing energy expenditure. On the other hand, the algorithm presented in [18] and based on the non-linear Gauss Seidel framework, has a pleasingly simple implementation, combined with convergence guarantees inherited from the framework. Notwithstanding, this algorithm is sequential, i.e., nodes perform their calculations in turn, not in a parallel fashion. This entails the existence of a network-wide coordination procedure to precompute the processing schedule upon startup, or whenever a node joins or leaves the network. The sequential nature of the work in [18] was superseded by the work in [3] which puts forward a parallel method based on two consecutive relaxations of the maximum likelihood estimator in (1.1). The first relaxation is a semi-definite program with a rank relaxation, while the second is an edge based relaxation, best suited for the Alternating Direction Method of Multipliers (ADMM). The main drawback is the amount of communications required to manage the ADMM variable local copies, and by the prohibitive complexity of the problem at each node. In fact, each one of the simple sensing units must solve a semidefinite program at each ADMM iteration and after the update copies of the edge variables must be exchanged with each neighbor. A simpler approach was devised in [14] by extending the source localization Projection Onto Convex Sets algorithm in [19] to the problem of sensor network localization. The proposed method is sequential, activating nodes one at a time according to a predefined cyclic schedule; thus, it does not take advantage of the parallel nature of the network and imposes a stringent timetable for individual node activity.

2.3

Convex underestimator

Problem (1.1) can be written as minimize x

X1 i∼j

2

d2Sij (xi − xj ) +

X X 1 d2Saik (xi ), 2 i

(2.1)

k∈Ai

where d2C (x) represents the squared euclidean distance of point x to the set C, i.e., d2C (x) = inf kx − yk2 , y∈C

9


0.4 S i j = {z ∈ R : |z | = 0.5} B i j = {z ∈ R : |z | ≤ 0.5} d 2S i j( z )

0.2

d 2B i j( z )

0 −1

−0.5

0

0.5

1

Figure 2.1: Illustration of the convex envelope for intersensor terms of the nonconvex cost function (2.1). The squared distance to the ball Bij (dotted line) is the convex hull of the squared distance to the sphere Sij (dashed line). In this one dimensional example the value of the range measurement is dij = 0.5 and the sets Sij and Saik are defined as the spheres generated by the noisy measurements dij and rik Sij = {z : kzk = dij } ,

Saik = {z : kz − ak k = rik } .

Nonconvexity of (2.1) follows from the nonconvexity of the building block 1 2 1 d (z) = inf kz − yk2 . 2 Sij 2 kyk=dij

(2.2)

A simple convexification consists in replacing it by 1 2 1 dBij (z) = inf kz − yk2 2 2 kyk≤dij

(2.3)

where Bij = {z ∈ Rp : kzk ≤ dij } , is the convex hull of Sij . Actually, (2.3) is the convex envelope1 of (2.2). This fact is illustrated in Figure 2.1 with a one-dimensional example; a formal proof for the generic case is given in Section 2.7.1. The terms of (2.1) associated with anchor measurements are similarly relaxed as d2Baik (z) =

inf

ky−ak k≤rik

kz − yk2 ,

(2.4)

where the set Baik is the convex hull of Saik : Baik = {z ∈ Rp : kz − ak k ≤ rik } . Replacing the nonconvex terms in (2.1) by (2.3) and (2.4) we obtain the convex problem minimize fˆ(x) = x

X1 i∼j

2

d2Bij (xi − xj ) +

X X 1 d2Ba ik (xi ). 2 i

(2.5)

k∈Ai

The function in Problem (2.5) is an underestimator of (2.1) but it is not the convex envelope of the original function. We argue that in our application of sensor network localization it is generally a very good approximation whose sub-optimality can be quantified, as discussed in Section 2.5.1. 1 The convex envelope (or convex hull) of a function γ is its best possible convex underestimator, i.e., conv γ(x) = sup {η(x) : η ≤ γ, η is convex}, and is hard to determine in general.

10

2.4 Distributed sensor network localization

The cost function (2.5) also appears in [14], albeit via a different reasoning; our convexification mechanism seems more intuitive. But the striking difference with respect to [14] is how (2.5) is exploited to generate distributed solution methods. Whereas [14] lays out a sequential blockcoordinate approach, we show that (2.5) is amenable to distributed solutions either via the fast Nesterov’s gradient method (for synchronous implementations) or exact/inexact randomized blockcoordinate methods (for asynchronous implementations).

2.4


We propose two distributed algorithms: a synchronous one, where nodes work in parallel, and an asynchronous, gossip-like algorithm, where each node starts its processing step according to some probability distribution. Both algorithms entail computing the gradient of the cost function and its Lipschitz constant. In order to achieve this it is convenient to rewrite Problem (2.5) as minimize x

X X 1 1 2 dB (Ax) + d2Baik (xi ), 2 2 i

(2.6)

k∈Ai

where A = C ⊗ Ip , C is the arc-node incidence matrix of G, Ip is the identity matrix of size p, and B is the cartesian product of the balls Bij corresponding to all the edges in E. We denote the two terms in (2.6) as g(x) = where hi (xi ) =

P

1 2 k∈Ai 2 dBaik (xi ).

1 2 dB (Ax), 2

h(x) =

X

hi (xi ),

i

Problems (2.5) and (2.6) are equivalent since Ax is the vector

(xi − xj : i ∼ j) and function g(x) in (2.6) can be written as 1 2 dB (Ax) 2 1 = inf kAx − yk2 2 y∈B X 1 inf kxi − xj − yij k2 . = 2 kyij k≤dij i∼j

g(x) =

As all the terms are non-negative and the constraint set is a cartesian product, we can exchange inf with the summation, resulting in g(x) = =

1X inf kxi − xj − yij k2 2 i∼j kyij k≤dij X1 i∼j

2

dB 2ij (xi − xj ),

which is the corresponding term in (2.5).

2.4.1

Gradient and Lipschitz constant of fˆ

To simplify notation, let us define the functions: φBij (z) =

1 2 d (z), 2 Bij

φBaik (z) =

1 2 d (z). 2 Ba ik 11


Now we call on a key result from convex analysis (see [20, Prop. X.3.2.2, Th. X.3.2.3]): the function in (2.3), φBij (z) = 12 d2Bij (z) is convex, differentiable, and its gradient is ∇φBij (z) = z − PBij (z),

(2.7)

where PBij (z) is the orthogonal projection of point z onto the closed convex set Bij PBij (z) = argmin kz − yk. y∈Bij

Further, function φBij has a Lipschitz continuous gradient with constant Lφ = 1, i.e., k∇φBij (x) − ∇φBij (y)k ≤ kx − yk.

(2.8)

We show (2.8) in Section 2.7.2. Let us define a vector-valued function φB , obtained by summing all functions φBij . Then, g(x) = φB (Ax). From this relation, and using (2.7), we can compute the gradient of g(x): ∇g(x)

= A> ∇φB (Ax) = A> (Ax − PB (Ax)) = Lx − A> PB (Ax),

(2.9)

where the second equality follows from (2.7) and L = A> A = L ⊗ Ip , with L being the Laplacian matrix of G. This gradient is Lipschitz continuous and we can obtain an easily computable Lipschitz constant Lg as follows k∇g(x) − ∇g(y)k

=

kA> (∇φB (Ax) − ∇φB (Ay)) k

≤

|||A||| kAx − Ayk

≤

|||A||| kx − yk

=

λmax (A> A)kx − yk

(a)

2

=

λmax (L)kx − yk

≤

2δmax kx − yk, | {z }

(2.10)

Lg

where |||A||| is the maximum singular value norm; equality (a) is a consequence of Kronecker product properties. In (2.10) we denote the maximum node degree of G by δmax . A proof of the bound λmax (L) ≤ 2δmax can be found in [21]2 .

The gradient of h is ∇h(x) = (∇h1 (x1 ), . . . , ∇hn (xn )) , where the gradient of each hi is ∇hi (xi ) =

X k∈Ai

∇φBaik (xi ).

(2.11)

2 A tighter bound would be λ max (L) ≤ maxi∼j {δi + δj − c(i, j)} where δi is the degree of node i and c(i, j) is the number of vertices that are adjacent to both i and j [22, Th. 4.13], nevertheless 2δmax is easier to compute in a distributed way.

12


The gradient of h is also Lipschitz continuous. The constants Lhi for ∇hi are k∇hi (xi ) − ∇hi (yi )k

≤ ≤

X k∈Ai

k∇φBaik (xi ) − ∇φBaik (yi )k

|Ai |kxi − yi k,

(2.12)

where |C| is the cardinality of set C. We now have an overall constant Lh for ∇h, sX k∇h(x) − ∇h(y)k = k∇hi (xi ) − ∇hi (yi )k2 i

≤

sX i

|Ai |2 kxi − yi k2

≤ max(|Ai | : i ∈ V) kx − yk. | {z }

(2.13)

Lh

We are now able to write ∇fˆ, the gradient of our cost function, as P  k∈A1 x1 − PBa1k (x1 )   .. ∇fˆ(x) = Lx − A> PB (Ax) +  . . P x − P (x ) Bank n k∈An n

(2.14)

A Lipschitz constant Lfˆ is, thus, Lfˆ = 2δmax + max(|Ai | : i ∈ V).

(2.15)

This constant is easy to precompute through in-network processing by, e.g., a diffusion algorithm — c.f. [23, Ch. 9] for more information. Although we restricted ourselves to a fixed (time-invariant) topology, it is easy to show, taking the worse-case scenario, that the computation of a Lipschitz constant is possible for time-varying topologies. For this worst-case constant we replace the maximum node degree δmax and maximum number of connected anchors to one single node, and max(|Ai | : i ∈ V) by the corresponding values of the complete network, resulting in Lfˆ = 2(|V| − 1) + |A|. Note, however, that the algorithm would be slower, but the constant would still be valid for any topology. In summary, we can compute the gradient of fˆ using Equation (2.14) and a Lipschitz constant by (2.15), which leads us to the algorithms described in Sections 2.4.2 and 2.4.3 for minimizing fˆ.

2.4.2

Parallel method

Since fˆ has a Lipschitz continuous gradient we can follow Nesterov’s optimal method [24]. Our approach is detailed in Algorithm 1. In Step 7, c(i∼j,i) is the entry (i ∼ j, i) in the arc-node incidence matrix C, and δi is the degree of node i. 13


Algorithm 1 Parallel method Input: Lfˆ; {dij : i ∼ j ∈ E}; {rik : i ∈ V, k ∈ A}; Output: x ˆ 1: k = 0; 2: each node i chooses random xi (0) = xi (−1); 3: while some stopping criterion is not met, each node i do 4: k =k+1 k−2 (xi (k − 1) − xi (k − 2)) ; 5: wi = xi (k − 1) + k+1 6: node i broadcasts w i to its neighbors X X 7: ∇gi (wi ) = δi wi − wj + c(i∼j,i) PBij (wi − wj ); j∈Ni

∇hi (wi ) =

8:

X k∈Ai

wi − PBa ik (wi );

1 (∇gi (wi ) + ∇hi (wi )); Lfˆ

9:

xi (k) = wi −

10:

end while return x ˆ = x(k)

11:

j∈Ni

Parallel nature of Algorithm 1 The updates in Step 9 of the algorithm require the computation of the gradient of the cost w.r.t. the position of node i. This corresponds to the i-th entry of ∇fˆ, given in (2.14). The last summand in (2.14) is simply ∇h(x), and the i-th entry of ∇h(x) is given in (2.11). This can be easily computed independently by each node. The i-th entry of Lx can be computed by node i, from its current position estimate and the position estimates of the neighbors, in particular, it P holds (Lx)i = δi xi − j∈Ni xj . The less obvious parallel term is A> PB (Ax). We start the analysis by the concatenated projections PB (Ax) = {PBij (xi − xj )}i∼j∈E . Each one of those projections only depends on the edge terminals and the noisy measurement dij . The product with A> will

collect, at the entries corresponding to each node, the sum of the projections relative to edges where it intervenes, with a positive or negative sign depending on the arbitrary edge direction agreed upon P at the onset of the algorithm. More specifically, (A> PB (Ax))i = j∈Ni c(i∼j,i) PBij (xi − xj ), as presented in Step 7 of Algorithm 1.

2.4.3

Asynchronous method

The method described in Algorithm 1 is fully parallel but still depends on some synchronization between all the nodes — so that their updates of the gradient are consistent. This requirement can be inconvenient in some applications of sensor networks; to circumvent it, we present a fully asynchronous method, achieved by means of a broadcast gossip scheme (c.f. [25] for an extended survey of gossip algorithms). Nodes are equipped with independent clocks ticking at random times (say, as Poisson point processes). When node i’s clock ticks, it performs the update of its variable xi and broadcasts the update to its neighbors. Let the order of node activation be collected in {ξk }k∈N , a sequence of 14


Algorithm 2 Asynchronous method Input: Lfˆ; {dij : i ∼ j ∈ E}; {rik : i ∈ V, k ∈ A}; Output: x ˆ 1: each node i chooses random xi (0); 2: k = 0; 3: while some stopping criterion is not met, each node i do 4: k = k + 1; 5: if ξk = i then 6: xi (k) = argminwi fˆ(x1 (k − 1), . . . , wi , . . . , xn (k − 1)) 7: else 8: xi (k) = xi (k − 1) 9: end if 10: end while 11: return x ˆ = x(k)

independent random variables taking values on the set V, such that P(ξk = i) = Pi > 0.

(2.16)

Then, the asynchronous update of variable xi on node i can be described as in Algorithm 2. To compute the minimizer in Step 6 of Algorithm 2 it is useful to recast Problem (2.6) as   X X 1 X 1  minimize d2Bij (xi − xj ) + d2Baik (xi ) , (2.17) x 4 2 i j∈Ni

where the factor

1 4

k∈Ai

accounts for the duplicate terms when considering summations over nodes

instead of over edges. By fixing the neighbor positions, each node solves a single-source localization problem; this setup leads to the Problem minimize fˆsli (xi ) := xi

X 1 X 1 d2Bs ij (xi ) + d2 (xi ), 4 2 Baik

j∈Ni

(2.18)

k∈Ai

where Bsij = {z ∈ Rp : kz − xj k ≤ dij }. We call the reader’s attention to the fact that the function in (2.18) is continuous and coercive; thus, the optimization problem (2.18) has a solution. We solve (2.18) at each node by employing Nesterov’s optimal accelerated gradient method as described in Algorithm 3. The asynchronous method proposed in Algorithm 2 converges to the set of minimizers of function fˆ, as established in Theorem 2, in Section 2.5. We also propose an inexact version in which nodes do not solve Problem (2.18) but instead take just one gradient step. That is, simply replace Step 6 in Algorithm 2 by xi (k) = xi (k − 1) −

1 ∇i fˆ(x(k − 1)) Lfˆ

(2.19)

where ∇i fˆ(x1 , . . . , xn ) is the gradient with respect to xi , and assume P (ξk = i) =

1 . n

(2.20)

The convergence of the resulting algorithm is established in Theorem 3, Section 2.5. 15


Algorithm 3 Asynchronous update at each node i Input: ξk ; Lfˆ; {dij : j ∈ Ni }; {rik : k ∈ Ai }; Output: xi (k) 1: if ξk not i then 2: xi (k) = xi (k − 1); 3: return xi (k); 4: end if 5: choose random z(0) = z(−1); 6: l = 0; 7: while some stopping criterion is not met do 8: l = l + 1; l−2 (z(l − 1) − z(l − 2)); 9: w = z(l − 1) + l+1 X X 1 w − PBa ik (w) w − PBS ij (w) + 10: ∇fˆsli (w) = 2 j∈Ni k∈Ai 1 11: z(l) = w − ∇fˆsli (w) Lfˆ 12: end while 13: return xi (k) = z(l)

2.5

Analysis

A relevant question regarding Algorithms 1 and 2 is whether they will return a good solution to the problem they are designed to solve, after a reasonable amount of computations. Sections 2.5.2 and 2.5.3 address convergence issues of the proposed methods, and discuss some of the assumptions on the problem data. Section 2.5.1 provides a formal bound for the gap between the original and the convexified problems.

2.5.1

Quality of the convexified problem

While evaluating any approximation method it is important to know how far the approximate optimum is from the original one. In this Section we will focus on this analysis. It was already noted in Section 2.3 that φBij (z) = φSij (z) for kzk ≥ dij ; when the functions differ, for kzk < dij , we have that φBij (z) = 0. The same applies to the terms related to anchor measurements. The optimal value of function f , denoted by f ? , is bounded by fˆ? = fˆ(x? ) ≤ f ? ≤ f (x? ), where x? is the minimizer of the convexified problem (2.5), and fˆ? = inf fˆ(x) x

is the minimum of function fˆ. With these inequalities we can compute a bound for the optimality 16

2.5 Analysis

20

1D s t ar N e twor k

ne ighb or

node

ne ighb or

ne ighb or f ( x)

10 fˆ( x) 0 0

2

3

4

5

7

Figure 2.2: One-dimensional example of the quality of the approximation of the true nonconvex cost f (x) by the convexified function fˆ(x) in a star network. Here the node positioned at x = 3 has 3 neighbors. gap, after (2.5) is solved, as f ? − fˆ? ≤ f (x? ) − fˆ? X 1 d2Sij (x?i − x?j ) − d2Bij (x?i − x?j ) = 2 i∼j∈E X X 1 + d2Saik (x?i ) − d2Ba ik (x?i ) 2 i∈V k∈Ai X X 1 X 1 d2Sij (x?i − x?j ) + d2 (x? ). = 2 2 Saik i i∈V k∈A2i

i∼j∈E2

(2.21) In Equation (2.21), we denote the set of edges where the distance of the estimated positions is less than the distance measurement by E2 = {i ∼ j ∈ E : d2Bij (x?i − x?j ) = 0}, and similarly A2i = {k ∈ Ai : d2Baik (x?i ) = 0}. Inequality (2.21) suggests a simple method to compute a bound for the

optimality gap of the solution returned by the algorithms: 1. Compute the optimal solution x? using Algorithm 1 or 2; 2. Select the terms of the convexified problem (2.5) which are zero; 3. Add the nonconvex costs of each of these edges, as in (2.21). Our bound is tighter than the one (available a priori ) from applying [26, Th. 1], which is f ? − fˆ? ≤

X 1 X X 1 d2ij + r2 . 2 2 ik

i∼j∈E

(2.22)

i∈V k∈Ai

For the one-dimensional example of the star network costs depicted in Figure 2.2 the bounds in (2.21), and (2.22), averaged over 500 Monte Carlo trials, are presented in Table 2.1. The true average gap f ? − fˆ? is also shown. In the Monte Carlo trials we sampled a zero mean Gaussian random variable with σ = 0.25 and obtained a noisy range measurement as described later by (2.31). 17


Table 2.1: Bounds on the optimality gap for the example in Figure 2.2 f ? − fˆ? Equation (2.21) Equation (2.22) 0.0367

0.0487

3.0871

3

0.6

0.4

2

0.2

5

0

−0.2

4

−0.4 1 −0.8

−0.6

−0.4

−0.2

0

0.2

0.4

Figure 2.3: 2D network example to assess the quality of the bound in Equation (2.21). Blue squares stand for anchors, while the red star is a sensor with unknown position.

A two-dimensional example was also produced to check if the bound is also informative in 2D. Our bound is in the same order of magnitude as the true optimality gap, whereas the bound in (2.22) is two orders of magnitude greater. For the simple example in Figure 2.3 we obtain the results of Table 2.2. These results show the tightness of the convexified function and how loose the bound (2.22) is when applied to our problem.

2.5.2

Parallel method: convergence guarantees and iteration complexity

As Problem (2.6) is convex and the cost function has a Lipschitz continuous gradient, Algo 2Lfˆ ? 2 rithm 1 is known to converge at the optimal rate O k −2 [24], [27]: fˆ(x(k))−fˆ? ≤ (k+1) 2 kx(0) − x k .

Table 2.2: Bounds on the optimality gap for the 2D example in Figure 2.3 f ? − fˆ? Equation (2.21) Equation (2.22) 4.5801

18

6.0899

384.1226

2.5 Analysis

2.5.3

Asynchronous method: convergence guarantees and iteration complexity

To state the convergence properties of Algorithm 2 we only need Assumption 1; it is used to prove coerciveness of the relaxed cost in (2.5). Assumption 1. There is at least one anchor linked to some sensor and the graph G is connected (there is a path between any two sensors). This assumption holds generally as one needs p + 1 anchors to eliminate translation, rotation, and flip ambiguities while performing localization in Rp , which exceeds the assumption requirement. We present two convergence results, — Theorem 2, and Theorem 3 — and the iteration complexity analysis for Algorithm 2 in Proposition 4. Proofs of the Theorems are detailed in Section 2.7. The following Theorem establishes the almost sure (a.s.) convergence of Algorithm 2. Theorem 2 (Almost sure convergence of Algorithm 2). Let {x(k)}k∈N be the sequence of points

produced by Algorithm 2, or by Algorithm 2 with the update (2.19), and let X ? = {x? : fˆ(x? ) = fˆ? } be the set of minimizers of function fˆ defined in (2.5). Then it holds: lim dX ? (x(k)) = 0,

k→∞

a.s.

(2.23)

In words, with probability one, the iterates x(k) will approach the set X ? of minimizers of fˆ;

this does not imply that {x(k)}k∈N will converge to one single x? ∈ X ? , but it does imply that limk→∞ fˆ(x(k)) = fˆ? , since X ? is a compact set, as proved in Section 2.7, Lemma 5.

Theorem 3 (Almost sure convergence to a point). Let {x(k)}k∈N be a sequence of points generated by Algorithm 2, with the update (2.19) in Step 6, and let all nodes start computations with uniform probability. Then, with probability one, there exists a minimizer of fˆ, denoted by x? ∈ X ? , such that x(k) → x? .

(2.24)

This result not only tells us that the iterates of Algorithm 2 with the modified Step 6 stated in Equation (2.19) converge to the solution set, but it also guarantees that they will not be jumping around the solution set X ? (unlikely to occur in Algorithm 2, but not ruled out by the analysis). One of the practical benefits of Theorem 3 is that the stopping criterion can safely probe the stability of the estimates along iterations. To the best of our knowledge, this kind of strong type of convergence (the whole sequence converges to a point in X ? ) was not established previously in the context of randomized approaches for convex functions with Lipschitz continuous gradients, though it was derived previously for randomized proximal-based minimizations of a large number of convex functions, cf. [28, Proposition 9]. Proposition 4 (Iteration complexity for Algorithm 2). Let {x(k)}k∈N be a sequence of points generated by Algorithm 2, with the update (2.19) in Step 6, and let the nodes be activated with 19


equal probability. Choose 0 < < fˆ(x(0)) − fˆ? and ρ ∈ (0, 1). There exists a constant b(ρ, x(0)) such that P fˆ(x(k)) − fˆ? ≤ ≥ 1 − ρ

(2.25)

2nb(ρ, x(0)) + 2 − n.

(2.26)

for all k≥K=

The constant b(ρ, x(0)) can be computed from inequality (19) in [29]; it depends only on the initialization and the chosen ρ. We remind that n is the number of sensor nodes. Proposition 4 is saying that, with high probability, the function value fˆ(x(k)) for all k ≥ K will be at a distance no larger than of the optimal, and the number of iterations K depends inversely on the chosen . Proof of Proposition 4. As fˆ is differentiable and has Lipschitz gradient, the result trivially follows from [29, Th. 2]. A natural question to pose, is whether the strong convergence properties of the inexact version still apply to the exact version. Actually, we can disprove it, with a small toy example. A toy example. The explanation of this counter-intuitive phenomenon lies in the lack of uniqueness of minimizers for the function in (2.18), as this function is not necessarily strictly convex at all iterations. The ambiguity in selecting minimizers across iterations may generate oscillations. Consider a network localization problem in R with two anchors placed at 0 and 3, and two nodes placed at 1 and 2. Assume that node 1 measures its distance to anchor 1 with no noise, node 2 measures its distance to anchor 2 with no noise, and nodes 1 and 2 measure their mutual distance (with noise) as 1.2. The problem we face is the minimization of f (x1 , x2 ) =

1 2 1 1 dB (x1 − x2 ) + d2A1 (x1 ) + d2A2 (x2 ), 2 2 2

(2.27)

where A1 = [−1, 1], A2 = [2, 4] and B = [−1.2, 1.2]. Consider the initialization x2 (0) = 2.6 and assume that nodes minimize (2.27) alternatively x1 (k + 1) = arg min f (x1 , x2 (k)), x1

x2 (k + 1) = arg min f (x1 (k + 1), x2 ), x2

(2.28)

for k = 0, 1, . . .. It is straightforward to check that the assignments x1 (1) = 1.2, x2 (k) = 2.05 + 0.05(−1)k for k ≥ 1 and x1 (k) = 1 − 0.1/k for k ≥ 2 obey (2.28), i.e., are valid algorithm outputs. In this example x1 (k) converges whereas x2 (k) oscillates (we can adjust the example such that both oscillate in the optimal set). Note, however, that x1 (k) and x2 (k) are optimal for (2.27) as soon as k ≥ 2 (of course, for larger networks, optimality cannot be certified at a single node as in this simple scenario). The subproblem faced by node 2 depends only on x1 (k + 1). Thus, selecting one minimizer— when it has many—corresponds to establishing a “rule” for each given x1 (k + 1). The example shows that not any rule will lead to strong convergence. 20

2.5 Analysis

2.6

2.4

2.2

2

1.8

1.6

1.4

1.2

1 0

5

10

15

20

25

30

Figure 2.4: Proximal minimization evolution for the toy problem: iterates x1 (k) (lower-curve, blue) and x2 (k) (upper-curve, red) for k = 1, . . . , 30. Proximal minimization. A possible approach to circumvent non-uniqueness of minimizers in (2.18) is to add a proximal term (as this makes the function strictly convex). In the context of the toy problem, this translates into replacing (2.28) with c x1 (k + 1) = arg min f (x1 , x2 (k)) + (x1 − x1 (k))2 , x1 2

(2.29)

c x2 (k + 1) = arg min f (x1 (k + 1), x2 ) + (x2 − x2 (k))2 x2 2

(2.30)

and

for some c > 0, possibly time-varying. Problems (2.29)-(2.30) have now unique solutions at all iterations. However, the proximal terms tend to slow down convergence. With the same initialization as above (x2 (0) = 2.6, x1 (1) = 1.2) and c = 1, Figure 2.4 shows the first 30 iterations of (2.29)-(2.30), and Figure 2.5 shows the corresponding cost function values (2.27). We see that optimality is not reached after 30 iterations (recall that, for (2.28), it is attained at the 2nd iterate, with zero cost). Other approaches. Another option would be to set a systematic rule for selecting minimizers, whenever there are many. For example, always picking the one with lowest norm. Intuitively, this 21


−1

10

−2

10

−3

10

−4

10

−5

10

−6

10

−7

10

−8

10

0

5

10

15

20

25

30

Figure 2.5: Function values f (x1 (k), x2 (k)), cf. (2.27), corresponding to the iterates in Figure 2.4.

22

2.6 Numerical experiments

should stabilize iterations but the implementation of such a rule would complicate the numerical solution of the inner problems (2.18) substantially (also, the theoretical analysis seems very challenging). In sum, given that oscillations of the iterations for the exact version of Algorithm 2 are rarely observed in practice (the example above is highly artificial), it is unclear if alternative approaches that secure strong convergence are worth pursuing, from a practical standpoint. Note that our inexact version that guarantees strong convergence is both simple to implement and to certify theoretically.

2.6

Numerical experiments

In this Section we present experimental results that demonstrate the superior performance of our methods when compared with four state of the art algorithms: Euclidean Distance Matrix (EDM) completion presented in [6], Semidefinite Program (SDP) relaxation and Edge-based Semidefinite Program (ESDP) relaxation, both implemented in [3], and a sequential projection method (PM) in [14] optimizing the same convex underestimator as the present work, with a different algorithm. The fist two methods — EDM completion and SDP relaxation — are centralized, whereas the ESDP relaxation and PM are distributed. Setup We conducted simulations with two uniquely localizable geometric networks with sensors distributed in a two-dimensional square of unit area, with 4 anchors in the corners. Network 1, depicted in Figure 2.6, has 10 sensor nodes with an average node degree3 of 4.3, while network 2, shown in Figure 2.7, has 50 sensor nodes and average node degree of 6.1. The ESDP method was only evaluated in network 1 due to simulation time constraints, since it involves solving an SDP at each node, and each iteration. The noisy range measurements are generated according to dij = |kx?i − x?j k + νij |,

rik = |kx?i − ak k + νik |,

(2.31)

where x?i is the true position of node i, and {νij : i ∼ j ∈ E} ∪ {νik : i ∈ V, k ∈ Ai } are independent Gaussian random variables with zero mean and standard deviation σ. The accuracy of the algorithms is measured by the original nonconvex cost in (1.1) and by the Root Mean Squared Error (RMSE) per sensor, defined as v u u1 RMSE = t n

! M 1 X ? kx − x ˆ(m)k2 , M m=1

(2.32)

where M is the number of Monte Carlo trials performed. 3 To characterize the used networks we resort to the concepts of node degree k , which is the number of edges i P connected to node i, and average node degree hki = 1/n n i=1 ki .

23


3

1

4 12

0.9

0.8

0.7

0.6 13

7

9 0.5

8

0.4 10 0.3

0.2 11 0.1 6

5

14

1

0 0

2 0.2

0.4

0.6

0.8

1

Figure 2.6: Network 1. Topology with 4 anchors and 10 sensors. Anchors are marked with blue squares and sensors with red stars.

24


2

1

0.9

4

15 8

17 37

42

53

49

18

11

35 0.8

5226

34 28 19

0.7

21

23

30 6

5 0.6 9

51

41 39

0.5

50

7 48

0.4

22

45

13

33

40

31 0.3

12

38 0.2

36

10

46 20 47

54 43

0.1

1

0 0

25

24

0.2

32 44 0.4

29 16 27 14 0.6

3 0.8

1

Figure 2.7: Network 2. Topology with 4 anchors and 50 sensors. Anchors are also marked with blue squares and sensors with red stars.

25


R MSE

0.25 0.21 0.18

EDM c omple t ion

0.14 SDP r e lax at ion Dis k r e lax at ion

0.1

0.06 0.01

0.05 0.1 Me as ur e me nt nois e σ

0.3

Figure 2.8: Relaxation quality: Root mean square error comparison of EDM completion in [6], SDP relaxation in [3] and the disk relaxation (2.5); measurements were perturbed with noise with different values for the standard deviation σ. The disk relaxation approach in (2.5) improved on the RMSE values of both EDM completion and SDP relaxation for all noise levels, even though it does not rely on the SDP machinery. The performance gap to EDM completion is substantial.

2.6.1

Assessment of the convex underestimator performance

The first experiment aimed at comparing the performance of the convex underestimator (2.5) with two other state of the art convexifications. For the proposed disk relaxation (2.5), Algorithm 1 was stopped when the gradient norm k∇fˆ(x)k reached 10−6 , while both EDM completion and SDP

relaxation were solved with the default SeDuMi solver [30] with eps = 10−9 , so that algorithm

properties did not mask the real quality of the relaxations. Figures 2.8 and 2.9 report the results of the experiment with 50 Monte Carlo trials over network 2 and measurement noise with σ = [0.01, 0.05, 0.1, 0.3]; so, we had a total of 200 runs, equally divided by the 4 noise levels. In Figure 2.8 we can see that the disk relaxation in (2.5) has better performance for all noise levels. Figure 2.9 depicts the results of optimizing the three convex functions for the same problems in RMSE vs. execution time, which reflects, albeit imperfectly, the complexities of the considered algorithms. The convex surrogate (2.5) used in the present work combined with our methods is faster by at least one order of magnitude. We tested all convex relaxations for robustness to sensors outside the convex hull of the anchors and they all performed worse in such conditions. This type of behavior has been previously noted by several authors. The noise-free network with 10 nodes and 4 anchors is depicted in Figure 2.6. Notice that there are some sensors placed near the boundary of the anchors’ convex hull. The result of 40 Monte Carlo runs is shown in Figure 2.10. This plot is illustrative of the behavior of the tested algorithms: both Algorithm 1 and the centralized version of [3] are somewhat better in more interior nodes, like 9 and 7, and perform not so well in the more peripheric nodes, near the boundary of the anchors’ convex hull, like 5 and 14. 26


EDM c omple t ion

R MSE

0.25

SDP r e lax at ion 0.14 0.13 Dis k r e lax at ion 1.21 27.31

152.05 E x e c ut ion t ime

Figure 2.9: Relaxation quality: Comparison of the best achievable root mean square error versus overall execution time of the algorithms. Measurements were contaminated with noise with σ = 0.1. Although disk relaxation (2.5) has a distributed implementation, running it sequentially can be faster by one order of magnitude than the centralized methods.

M e asu r e m e n t n oi se σ = 0. 01 3

1

4 12

0.9

0.8

0.7

0.6 13

7

9 0.5

8

0.4 10 0.3

0.2 11 0.1 6

5

14

1

0 0

2 0.2

0.4

0.6

0.8

1

Figure 2.10: Estimates for the location of the sensor nodes, based on 40 Monte Carlo trials, for network 1, shown in Figure 2.6. Red dots express the output of Algorithm 1, blue circles indicate the estimates of the centralized version of [3], and yellow stars represent the EDM completion algorithm in [6].

27


R MSE

0.1

P r o j e c t ion me t hod ( σ =0.05) Pr op os e d me t hod ( σ = 0.05) P ro j e c t ion me t hod ( σ = 0.01)

0.07 0.06 Pr op os e d me t hod ( σ = 0.01) 0.4

0.6

2 C ommunic at ions p e r s e ns or

4

x 10

Figure 2.11: Performance of the proposed method in Algorithm 1 and of the Projection method presented in [14]. The stopping criterion for both algorithms was a relative improvement of 10−6 in the estimate. The proposed method uses fewer communications to achieve better RMSE for the tested noise levels. Our method outperforms the projection method with one fourth the number of communications for a noise level of 0.01.

2.6.2

Performance of distributed optimization algorithms

To measure the performance of the presented Algorithm 1 in a distributed setting we compared it with the state of the art methods in [14] and the distributed algorithm in [3]. The results are shown, respectively, in Figures 2.11 and 2.12. The experimental setups were different, since the authors proposed different stopping criteria for their algorithms and, in order to do a fair comparison, we ran our algorithm with the specific criterion set by each benchmark method. Also, to compare with the distributed ESDP method in [3], we had to use a smaller network of 10 sensors because of simulation time constraints — as the ESDP method entails solving an SDP problem at each node, the simulation time becomes prohibitively large, at least using a general-purpose solver. The number of Monte Carlo trials was 32, with 3 noise levels, leading to 96 realizations for each noisy measurement. In the experiment illustrated in Figure 2.11, the stopping criterion for both the projection method and the presented method was the relative improvement of the solution; we stress that this is not a distributed stopping criterion, we adopted it just for the sake of algorithm comparison. We can see that the proposed method fares better not only in RMSE but, foremost, in communication cost. The experiment comprised 120 Monte Carlo trials and two noise levels. Table 2.3: Number of communications per sensor for the results in Fig. 2.12 ESDP method Algorithm 1 21600

2000

From the analysis of both Figure 2.12 and Table 2.3 we can see that the ESDP method is one order of magnitude worse in RMSE performance, using one order of magnitude more communications, than Algorithm 1. 28


ESDP me t hod 0.44 R MSE

0.35

0.11 0.07 0.02 0.01

Pr op os e d me t hod 0.05 Me as ur e me nt nois e σ

0.1

Figure 2.12: Performance of the proposed method in Algorithm 1 and of the ESDP method in [3]. The stopping criterion for both algorithms was the number of algorithm iterations. The performance advantage of the proposed method in Algorithm 1 is even more remarkable when considering the number of communications presented in Table 2.3.

C os t value

0.06 Par alle l algor it hm

0.02 A s y nchr onous algor it hm 0.01

0.05 Me as ur e me nt nois e σ

0.1

Figure 2.13: Final cost of the parallel Algorithm 1 and its asynchronous counterpart in Algorithm 2 with an exact update for the same number of communications. Results for the asynchronous version degrade less than those of the parallel one as the noise level increases. The stochastic Gauss-Seidel iterations prove to be more robust to intense noise.

2.6.3

Performance of the asynchronous algorithm

A second set of experiments examined the performance of the parallel and asynchronous flavors of our method, presented respectively in Algorithms 1 and 2, the latter with exact updating. The metric was the value of the convex cost function fˆ in (2.5) evaluated at each algorithm’s estimate of the minimum. For fairness, both algorithms were allowed to run until they reached a preset number of communications. In Figure 2.13 we present the effectiveness of both algorithms in optimizing the disk relaxation cost (2.5), with the same amount of communications. We chose the random variables ξk representing the sequence of updating nodes in the asynchronous version of our method, with uniform probability. Again, we ran 50 Monte Carlo trials, each with 3 noise levels, thus leading to 150 samplings of the noise variables in (2.31). 29


The more robust behavior of the asynchronous version is a phenomenon empirically observed in other optimization algorithms, when comparing deterministic and randomized versions. In [31, Section 6.3.5] the authors prove that, for a fixed-point algorithm with given properties and “with bounded communication delays, the convergence rate is geometric and, under certain conditions, it is superior to the convergence rate of the corresponding synchronous iteration”. Also, in [28] the author states that, for the problem considered in the cited paper, “the randomized order provides a worst-case performance advantage over the cyclic order”. Our numerical experiments suggest a similar behavior of our Algorithms 1 and 2, but we don’t have further theoretical support for these observations.

2.7 2.7.1

Proofs Convex envelope

We show that the function in (2.3) is the convex envelope of the function in (2.2). Refer to α as the function in (2.2) and β as the function in (2.3). We show that α?? = β where f ? denotes the Fenchel conjugate of a function f , cf. [20, Cor. 1.3.6, p. 45, v. 2]. We start by computing α? : α? (s)

sup s> z − α(z) z 1 2 = sup s> z − inf kz − yk 2 kyk=dij z 1 2 = sup sup s> z − kz − yk 2 z kyk=dij =

sup sup s> z −

=

z

kyk=dij

=

sup kyk=dij

=

1 2 kz − yk 2

1 2 ksk + s> y 2

1 2 ksk + dij ksk . 2

Thus, α? is the sum of two closed convex functions: α? = g + h where g(s) =

1 2

2

ksk and h(s) =

dij ksk. Note that h(s) = σB(0,dij ) (s) where σC (s) = sup{s> x : x ∈ C} denotes the support function of a set C. Thus, using [20, Th. 2.3.1, p. 61, v. 2], we have α?? (z) = Since g ? (z1 ) =

1 2

g ? (z1 ) + h? (z2 ).

inf

z1 +z2 =z

2

kz1 k [20, Ex. 1.1.3, p. 38, v. 2] and h? (z2 ) = iBij (z2 ) [20, Ex. 1.1.5, p. 39, v. 2]

where iC (x) = 0 if x ∈ C and iC (x) = +∞ if x 6∈ C denotes the indicator of a set C, we conclude that α?? (z)

= =

inf

z1 +z2 =z

inf

z2 ∈Bij

= β(z). 30

1 2 kz1 k + iBij (z2 ) 2

1 2 kz − z2 k 2

2.7 Proofs

2.7.2

Lipschitz constant of ∇φBij

We prove the inequality in (2.8):

∇φBij (x) − ∇φBij (y) ≤ kx − yk

(2.33)

where ∇φBij (z) = z − PBij (z), and PBij (z) is the projector onto Bij = {z ∈ Rp : kzk ≤ dij }. Squaring both sides of (2.33) gives the equivalent inequality 2

2(P(x) − P(y))> (x − y) − kP(x) − P(y)k ≥ 0

(2.34)

where, to simplify notation, we let P(z) := PBij (z). Inequality (2.34) can be rewritten as >

(P(x) − P(y))> (x − y) + (P(x) − P(y)) (P(y) − y)

+(P(x) − P(y))> (x − P(x)) ≥ 0.

(2.35)

By the properties of projectors onto closed convex sets, (z − P(z))> (w − P(z)) ≤ 0, for any w ∈ Bij and any z, cf. [20, Th. 3.1.1, p. 117, v. 1]. Thus, the last two terms on the left-hand side of (2.35) are nonnegative. Moreover, the first term is nonnegative due to [20, Prop. 3.1.3, p. 118, v. 1]. Inequality (2.35) is proved.

2.7.3

Auxiliary Lemmas

In this Section we establish basic properties of Problem (2.6) in Lemma 5 and also two technical Lemmas, instrumental to prove our convergence results in Theorem 2. Lemma 5 (Basic properties). Let fˆ as defined in (2.5). Then the following properties hold. 1. fˆ is coercive; 2. fˆ? ≥ 0 and X ? 6= ∅; 3. X ? is compact; Proof.

1. By Assumption 1 there is a path from each node i to some node j which is connected to an anchor k. If kxi k → ∞ then there are two cases: (1) there is at least one edge t ∼ u along the path from i to j where kxt k → ∞ and kxu k 6→ ∞, and so d2Btu (xt − xu ) → ∞;

(2) if kxu k → ∞ for all u in the path between i and j, in particular we have kxj k → ∞ and

so d2Bajk (xj ) → ∞, and in both cases fˆ → ∞, thus, fˆ is coercive.

2. Function fˆ defined in (2.5) is a sum of squares, it is continuous, convex and a real valued function, lower bounded by zero; so, the infimum fˆ? exists and is non-negative. To prove this infimum is attained and X ? 6= ∅, we consider the set T = {x : fˆ(x) ≤ α}; T is a sublevel

set of a continuous, coercive function and, thus, it is compact. As fˆ is continuous, by the

Weierstrass Theorem, the value p = inf x∈T fˆ(x) is attained; the equality fˆ? = p is evident. 31


3. X ? is a sublevel set of a continuous coercive function and, thus, compact. Lemma 6. Let {x(k)}k∈N be the sequence of iterates of Algorithm 2, or of Algorithm 2 with the

update (2.19), and ∇fˆ (x(k)) be the gradient of function fˆ evaluated at each iterate. Then, 1.

X k≥1

k∇fˆ (x(k)) k2 < ∞, a.s.;

2. ∇fˆ (x(k)) → 0, a.s. Proof. Let Fk = σ (x(0), · · · , x(k)) be the sigma-algebra generated by all the algorithm iterations h i until time k. We are interested in E fˆ (x(k)) |Fk−1 , the expected value of the cost value of the kth iteration, given the knowledge of the past k − 1 iterations. Firstly, let us examine function φ :

Rp → R, the slice of fˆ along a coordinate direction, φ(y) = fˆ(x1 , . . . , xi−1 , y, xi+1 , . . . , xn ). As fˆ has Lipschitz continuous gradient with constant Lfˆ, so will φ: k∇φ(y) − ∇φ(z)k ≤ Lfˆky − zk, for all y and z, and, thus, it will inherit the property φ(y) ≤ φ(z) + h∇φ(z), y − zi +

Lfˆ 2

ky − zk2 .

(2.36)

Inequality (3.18) is known as the Descent Lemma [32, Prop. A.24]. The minimizer of the quadratic upper-bound in (3.18) is z −

1 Lfˆ ∇φ(z),

which can be plugged back in (3.18), obtaining

1 φ ≤φ z− ∇φ(z) Lfˆ

!

?

≤ φ(z) −

1 k∇φ(z)k2 . 2Lfˆ

(2.37)

In the sequel, for a given x = (x1 , . . . , xn ), we let fî? (x−i ) = inf{fˆ(x1 , . . . , xi−1 , z, xi+1 , . . . , xn ) : z}. h i P n Going back to the expectation E fˆ (x(k)) |Fk−1 = i=1 Pi fî? (x−i (k − 1)), we can bound it from above, recurring to (2.37), by n X i=1

Pi

1 k∇i fˆ(x(k − 1))k2 fˆ(x(k − 1)) − 2Lfˆ

!

n

1 X = fˆ(x(k − 1)) − Pi k∇i fˆ(x(k − 1))k2 2Lfˆ i=1 (a)

Pmin ≤ fˆ(x(k − 1)) − k∇fˆ(x(k − 1))k2 , 2Lfˆ

(2.38)

where we used 0 < Pmin ≤ Pi , for all i ∈ {1, · · · , n} in (a). To alleviate notation, let g(k) = ∇fˆ(x(k)); we then have

and adding

Pmin 2L

P

i≤k−1

kg(k)k2 =

X i≤k

kg(i)k2 −

X i≤k−1

kg(i)k2 to both sides of the inequality in (2.38), we find that E [Yk |Fk−1 ] ≤ Yk−1 ,

32

kg(i)k2 ,

(2.39)

2.7 Proofs

where Yk = fˆ(x(k)) +

Pmin 2L

P

i≤k−1

kg(i)k2 . Inequality (2.39) defines the sequence {Yk }k∈N as a

supermartingale. As fˆ(x) is always non-negative, then Yk is also non-negative and so [33, Corollary 27.1], Yk → Y, a.s. In words, the sequence Yk converges almost surely to an integrable random variable Y . This entails P that k≥1 kg(k)k2 < ∞, a.s., and so, g(k) → 0, a.s. The previous arguments show that Lemma 6 holds for Algorithm 2. To show that Lemma 6 also holds for Algorithm 2 with the update (2.19) it suffices to redefine ! 1 fî? (x−i ) := fˆ x1 , . . . , xi − ∇i fˆ(x), . . . , xn . Lfˆ As the second inequality in (2.37) shows, we have the bound

2 1

ˆ

fî? (x−i (k − 1)) ≤ fˆ(x(k − 1)) −

∇i f (x(k − 1)) Lfˆ and the rest of the proof holds intact. Lemma 7. Let {x(k)}k∈N be one of the sequences generated with probability one according to Lemma 6. Then, 1. The function value decreases to the optimum: fˆ(x(k)) ↓ fˆ? ;

2. There exists a subsequence of {x(k)}k∈N converging to a point in X ? : x(kl ) → y, y ∈ X ? . n o Proof. As fˆ is coercive, then the sublevel set Xfˆ = x : fˆ(x) ≤ fˆ(x(0)) is compact and, because fˆ(x(k)) is non increasing, all elements of {x(k)}k∈N belong to this set. From the compactness

of Xfˆ we have that there is a convergent subsequence x(kl ) → y. We evaluate the gradient at this

accumulation point, ∇fˆ(y) = liml→∞ ∇fˆ(x(kl )), which, by assumption, vanishes, and we therefore

conclude that y belongs to the solution set X ? . Moreover, the function value at this point is, by definition, the optimal value.

2.7.4

Theorems

Equipped with the previous lemmas, we are now ready to prove the Theorems stated in Section 2.5. Proof of Theorem 2. Suppose the distance does not converge to zero. Then, there exists an > 0 and some subsequence {x(kl )}l∈N such that dX ? (x(kl )) > . But, as fˆ is coercive (by Lemma 5), continuous, and convex, and whose gradient, by Lemma 6, vanishes, then by Lemma 7, there is a subsequence of {x(kl )}l∈N converging to a point in X ? , which is a contradiction. Proof of Theorem 3. Fix an arbitrary point x? ∈ X ? . We start by proving that the sequence of

squared distances to x? of the estimate produced by Algorithm 2, with the update defined in 33


2

Equation (2.19), converges almost surely; that is, the sequence {kx(k) − x? k }k∈N is convergent with probability one. We have E kx(k) − x? k2 |Fk−1 =

2 n

X 1 1

? gi (k − 1) − x

x(k − 1) −

n Lfˆ

(2.40)

i=1

where gi (k−1) = (0, . . . , 0, ∇i fˆ(x(k−1)), 0, . . . , 0) and Fk = σ (x(1), . . . , x(k)) is the sigma-algebra generated by all iterates until time k. Expanding the right-hand side of (2.40) yields 2

kx(k − 1) − x? k +

2 1

ˆ f (x(k − 1))

∇ nL2fˆ

2 (x(k − 1) − x? )> ∇fˆ(x(k − 1)). nLfˆ Since (x(k − 1) − x? )> ∇fˆ(x(k − 1)) = (x(k − 1) − x? )> ∇fˆ(x(k − 1)) − ∇f (x? ) ≥ 0, we conclude −

that E kx(k) − x? k2 |Fk−1 ≤

2

kx(k − 1) − x? k +

Now, as proved in Lemma 6, the sum

P

k

2 1

ˆ

f (x(k − 1))

.

∇ nL2fˆ

k∇fˆ(x(k))k2 converges almost surely. Thus, invoking

the result in [34], we get that kx(k) − x? k2 converges almost surely.

We can now invoke the technique at the end of the proof of [28, Prop. 9] to conclude that x(k) converges to some optimal point x? .

2.8

Summary and further extensions

Experiments in Section 3.4.7 show that our method is superior to the state of the art in all measured indicators. While the comparison with the projection method published in [14] is favorable to our proposal, it should be further considered that the projection method has a different nature when compared to ours: it is sequential, and such algorithms will always have a larger computation time than parallel ones, since nodes run in sequence; moreover, this computation time grows with the number of sensors while parallel methods retain similar speed, no matter how many sensors the network has. When comparing with a distributed and parallel method similar to Algorithm 1, like the ESDP method in [3] we can see one order of magnitude improvement in RMSE for one order of magnitude fewer communications of our method — and this score is achieved with a simpler, easy-toimplement algorithm, performing simple computations at each node that are well suited to the kind of hardware commonly found in sensor networks. Also, unlike SDP methods, our proposal preserves positions, which need not to be recovered back from the SDP estimate. 34

2.8 Summary and further extensions

There are some important questions not addressed here. For example, it is not clear what influence the number of anchors and their spatial distribution can have in the performance of the proposed and state of the art algorithms. Also, an exhaustive study on the impact of varying topologies and number of sensors could lead to interesting results. Some preliminary experiments show that all convex relaxations experience some performance degradation when tested for robustness to sensors outside the convex hull of the anchors. This issue has been noted by several authors, but a more exhaustive study exceeds the scope of this thesis. But with the data presented here one can already grasp the advantages of our fast and easily implementable distributed method, where the optimality gap of the solution can also be easily quantified, and which offers two implementation flavours for different localization needs.

2.8.1

Heterogeneous data fusion application

A spin-off of the presented method was developed and already submitted for publication. The problem in (2.5) can be thought of as a minimization of the squared discrepancy between a data model and measured data. In this perspective, we envisioned an extension of the present work, by fusing the range measurements procedure with angle information. This can be done by considering a new edgeset Eu containing pairs of nodes with measured angle between them, and the squared distance d2`tv (·) to a line `tv , passing through the origin and defined by the unit vector utv . The

problem is, then, minimize x

X i∼j∈E

X 1 X 1 2 dBij (xi −xj )+ d2`tv (xt −xv )+ 2 2 2σ 2σ` i i∼j∈Eu

X k∈Ai

! X 1 1 2 2 d d (xi ) , (xi ) + 2σ 2 Baik 2σ`2 àik k∈Aui

(2.41) where Aui is the set of anchors with angle measurements related to node i, σ is the standard deviation of the Gaussian noise term in (2.31), and σ` is the standard deviation of the noise in the angle measurement statistics4 . The simulation and real data results are very encouraging, and the possibility of fusing two different types of information on the same minimization offers a new flexibility to localization.

4 Note

the simplifying assumption that the statistics over the measured angle has a Gaussian distribution.

35


36

3 Distributed network localization with initialization: Nonconvex procedures Contents 3.1 3.2

Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distributed Majorization-Minimization with quadratic majorizer . 3.2.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Problem reformulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Majorization-Minimization . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Distributed sensor network localization . . . . . . . . . . . . . . . . . . 3.2.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Majorization-Minimization with convex tight majorizer . . . . . . . 3.3.1 Majorization function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Experimental results on majorization function quality . . . . . . . . . . 3.3.3 Distributed optimization of the proposed majorizer using ADMM . . . . 3.3.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.5 Proof of majorization function properties . . . . . . . . . . . . . . . . . 3.3.6 Proof of Proposition 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.7 Proof of (3.31) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Sensor network localization: a graphical model approach . . . . . . 3.4.1 Uncertainty models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Combinatorial problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.6 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.7 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38 39 39 39 41 42 42 46 47 47 49 49 58 63 65 65 66 66 67 67 68 69 69 70 72 75

37

3. Distributed network localization with initialization: Nonconvex procedures

Now imagine we have some prior knowledge on where our agents are located. This knowledge can come from either deployment instructions or a previous run of a convexified algorithm, like the one presented in Chapter 2, or, maybe, from known (or estimated) positions at a previous moment in time. Imagine you want an accurate estimate of your network configuration, but you need it fast and with a simple and stable implementation. This Chapter addresses this scenario, providing results concerning MM with a quadratic majorizer, MM with a tighter majorizer, and a graphical model approach to the problem. The work described in the next Section was partially presented n the 2014 IEEE GlobalSIP conference.

3.1

Related work

As we have seen previously, distributed and maximum-likelihood (thus nonconvex) approaches to the sensor network localization problem are much less common than centralized or relaxationbased approaches, despite the more suited nature of this computational paradigm to the problem at hand. The work in [15] proposes a parallel distributed algorithm. However, the sensor network localization problem adopts a discrepancy function between squared distances which, unlike the ones in maximum likelihood (ML) methods, is known to amplify measurement errors and outliers. The convergence properties of the algorithm are not studied theoretically. The work in [16] also considers network localization outside a ML framework. The approach proposed in [16] is not parallel, operating sequentially through layers of nodes: neighbors of anchors estimate their positions and become anchors themselves, making it possible in turn for their neighbors to estimate their positions, and so on. Position estimation is based on planar geometry-based heuristics. In [17], the authors propose an algorithm with assured asymptotic convergence, but the solution is computationally complex since a triangulation set must be calculated, and matrix operations are pervasive. Furthermore, in order to attain good accuracy, a large number of range measurement rounds must be acquired, one per iteration of the algorithm, thus increasing energy expenditure. The algorithm presented in [18] is a nonlinear Gauss-Seidel approach: only one node works at a time and solves a source localization problem with neighbors playing the role of anchors. The nodes activate sequentially in a round-robin scheme. Thus, the time to complete just one cycle becomes proportional to the network size. Parallel algorithms — the ones we are interested in Chapters 2 and the present one — avoid this issue altogether, as all nodes operate simultaneously; moreover, adding or deleting a node raises no special synchronization concern. The work presented in [35] puts forward a two-stage algorithm which is parallel: in a first consensus phase, a Barzilai-Borwein (BB) step size is calculated, followed by a local gradient computation phase. It is known that BB steps do 38

3.2 Distributed Majorization-Minimization with quadratic majorizer

not necessarily decrease the objective function; as discussed in [36], an outer globalization scheme involving line searches is needed to ensure its stability. However, line searches are cumbersome to implement in a distributed setting and are, in fact, absent in [35]. Further, the algorithm requires the step size to be computed via consensus, and thus the number of consensus rounds needed is a parameter to tune.

3.2

Distributed Majorization-Minimization with quadratic majorizer

We propose a simple, stable and distributed algorithm which directly optimizes the nonconvex maximum likelihood criterion for sensor network localization, with no need to tune any free parameter. We reformulate the problem to obtain a gradient Lipschitz cost; by shifting to this cost function we enable a Majorization-Minimization (MM) approach based on quadratic upper bounds that decouple across nodes; the resulting algorithm happens to be distributed, with all nodes working in parallel. Our method inherits the stability of MM: each communication cuts down the cost function. Numerical simulations indicate that the proposed approach tops the performance of state of the art algorithms, both in accuracy and communication cost. The algorithm we present has an astonishingly simple implementation which is both parallel and stable, with no free parameters. In Section 3.4.7 we will compare experimentally the performance of our method with the distributed, parallel, state of the art method in [35].

3.2.1

Contributions

We tackle the nonconvex problem in (1.1) directly, with a simple and efficient algorithm which: 1. is parallel; 2. does not involve any free parameter definition; 3. is proven not to increase the value of the cost function at each iteration (thus, stable); 4. has better performance in positioning error and cost value than a state of the art method, while requiring fewer communications. The first and second claims are addressed in Section 3.2.4, the third in Section 3.4.7.B and the last one in Section 3.4.7, dedicated to numerical experiments.

3.2.2

Problem reformulation

We can reformulate Problem (1.1) as minimize xi ,yij ,wik

X1 i∼j

2

kxi − xj − yij k2 +

XX 1 kxi − ak − wik k2 2 i

(3.1)

j∈Ai

subject to kyij k = dij , kwij k = rij . 39


xi

Sij

xj

dSij (xi

dij

d2Sij (xi

xj )

xj ) = minimize kxi

yij k2

xj

yij

subject to kyij k = dij Figure 3.1: Illustration of the reformulation in (3.1) of Problem (1.1). The sphere Sij of radius dij is defined as {y ∈ Rp : kyk = dij }. This reformulation is illustrated in Figure 3.1. We now rewrite (3.1) as X1 1 kxi ⊗ 1 − αi − wi k2 minimize kAx − yk2 + xi ,yij ,wik 2 2 i

(3.2)

subject to kyij k = dij , kwik k = rik , with concatenated vectors x = (xi )i∈V , y = (yij )i∼j , αi = (aik )k∈Ai , and wi = (wik )k∈Ai . In (3.2), the symbol 1 stands for the vector of ones. Matrix A is the result of the Kronecker product of the arc-node incidence matrix1 C with the identity matrix Ip : A = C ⊗ Ip . Problem (3.2) is equivalent to

  2

x 1

A −I 0  y  + 1 kEx − α − wk2 minimize

xi ,yij ,wik 2 2

w subject to kyij k = dij , kwik k = rik , where α = (αi )i∈V , w = (wi )i∈V , and E is a matrix with zeros and ones, selecting the entries in α and w corresponding to each sensor node. We now collect all the optimization variables in z = (x, y, w), and rewrite our problem as 1

A −I z 2 subject to z ∈ Z,

minimize

2 1 0 z + E 2

0

2 −I z − α

where Z = {z = (x, y, w) : kyij k = dij , i ∼ j, wik = rik , i ∈ V, k ∈ Ai }. Problem (3.2) can be written as minimize f (z) = z

1 T z M z − bT z 2

(3.3)

subject to z ∈ Z,

(3.4)

for M and b defined as M = M1 + M2 ,

 T A M1 =  −I  A 0 1 Each

40

−I

0 ,

 T E b =  0  α, −I  T E M2 =  0  E −I

edge is arbitrarily assigned a direction by the two incident nodes.

(3.5)

0

−I .


3.2.3

Majorization-Minimization

To solve Problem (3.3) in a distributed way we must deal with the complicating off-diagonal entries of M that couple the sensors’ variables. We emphasize a simple, but key fact: Remark 8. The function optimized in Problem (3.3) is quadratic in z and, thus, has a Lipschitz continuous gradient [32], i.e., k∇f (x) − ∇f (y)k ≤ Lkx − yk, for some L and all x, y. From this property of function f we can obtain the upper bound (also found in [32]) f (z) ≤ 2

f (z t )+h∇f (z t ), z − z t i+ L2 kz − z t k , for any point z t and use it as a majorizer in the Majorization-

Minimization framework [37]. This majorizer decouples the variables and allows for a distributed solution. This happens because the quadratic term is a diagonal matrix and, so, there are no off-diagonal terms to couple the sensors’ position variables. Our algorithm is simply:

2

L z t+1 = argmin f (z t ) + ∇f (z t ), z − z t + z − z t . 2 z∈Z The solution of (3.6) is the projected gradient iteration [32] 1 z t+1 = PZ z t − ∇f (z t ) , L

(3.6)

(3.7)

where PZ (z) is the projection of point z onto Z. This projection has a closed-form expression,   hx i yij    P (y) = kyij k dij  PZ (z) =  Y . i∼j h i   wik PW (w) = kwik k rik i∈V k∈Ai

The gradient in (3.7) can be easily computed as the affine function ∇f (z) = M z − b. See the recent work [38] for interesting convergence properties of the recursion (3.7). Particularly, we emphasize that the cost function is non increasing per iteration. We now compute a Lipschitz constant L for the gradient of the quadratic function in Problem (3.3), such that it is easy to estimate in a distributed way. L = λmax (M ) ≤ λmax (M1 ) + λmax (M2 ) = λmax AAT + I + λmax EE T + I ≤ λmax AT A + λmax EE T + 2 ≤ 2δmax + max |Ai | + 2, i∈V

(3.8)

where λmax denotes the largest eigenvalue, |A| is the cardinality of set A, and δmax is the maximum

node degree of the network. We note that λmax (AT A) is the maximum eigenvalue of the Laplacian matrix of graph G; the proof that it is upper-bounded by 2δmax can be found in [21] and was discussed in Section 2.4.1. This Lipschitz constant can be computed in a distributed way by, e.g., a diffusion algorithm (c.f. [23, Ch. 9]). 41


Algorithm 4 Distributed nonconvex localization algorithm Input: x0 ; L; {dij : j ∈ Ni }; {rik : k ∈ Ai }; Output: x ˆ 0 0 = PWik x0i − ak , Wik = {w : kwk = = PYij x0i − x0j , Yij = {y : kyk = dij } and wik 1: set yij rik } 2: t = 0 3: while some stopping criterion is not met,each node i do P P t t 4: xt+1 = bi xti + L1 j∈Ni xtj + C(i∼j,i) yij + aik ) + L1 k∈Ai (wik i 5: for all neighboring j, compute k+1 1 t t k , yij = PYij L−1 L yij + L C(i∼j,i) xi − xj 6: for each of the connected anchors k ∈ A , compute i t+1 1 t wik = PWik L−1 L wik + L (xi − aik ) 7: broadcast xt+1 to neighbors i 8: t=t+1 9: end while 10: return x î = xti

3.2.4


At this point, the recursion in Eq. (3.7) is already distributed, as detailed below. From (3.7) we will obtain the update rules for the variables x, y and w. For this we write matrix M as follows:  T  A A + E T E −AT −E T −A I 0 , M = (3.9) −E 0 I and denote B = AT A + E T E. Then, each block of z is updated according to 1 1 1 xt+1 = I − B xt + AT y t + E T (wt + α), L L L L−1 t 1 t+1 t y = PY y + Ax , L L L − 1 1 α t+1 t t w = PW w + Ex − , L L L

(3.10) (3.11) (3.12)

where Y and W are the constraint sets associated with the acquired measurements between sensors, and between anchors and sensors, respectively, and Ni is the set of the neighbors of node i. We observe that each block of z = (x, y, w) at iteration t + 1 will only need local neighborhood information, as clarified in Algorithm 4. Each node i will update the current estimate of its own position, each one of the yij for all the incident edges i ∼ j and the anchor terms wik , if any. The symbol C(i∼j,i) denotes the arc-node incidence matrix entry relative to edge i ∼ j (row index) and node i (column index). The constant in step 4 of Algorithm 4 is defined as bi =

3.2.5

L−δi −|Ai | . L

Experimental results

We present numerical experiments to ascertain the performance of the proposed Algorithm 4, both in accuracy and in communication cost. For a fized graph, accuracy will be measured in 1) mean positioning error defined as MPE =

42

M n 1 XX kˆ xi (m) − x?i k, M m=1 i=1

(3.13)


Table 3.1: Mean positioning error, with measurement noise σ Proposed method BB method 0.01 0.05 0.10

0.0053 0.0143 0.0210

0.0059 0.0154 0.0221

where M is the total number of Monte Carlo trials, x î (m) is the estimate generated by an algorithm at the Monte Carlo trial m, and x?i is the true position of node i, and 2) also by evaluating the cost function in (1.1), averaged over the Monte Carlo trials, as in (3.65). In the previous Chapter we used RMSE as a performance measure. Both MPE and RMSE characterize the localization error, albeit in different ways. RMSE penalizes bigger discrepancies, while MPE weights outliers less. So, as we are dealing with the nonconvex Maximum Likelihood cost directly, a discrepancy in the estimate is due to the presence of measurement noise — which shifts the cost minimum, to possible ambiguities in the network, but also to the existence of different local minima attracting the measured algorithm. The fact that the algorithm converged to a local minimum which is different from the global optimum should be penalized. Nevertheless, the resulting individual distance kˆ xi (m) − x?i k can be large, and may not represent adequately the overall performance of the algorithm in question. Communication cost will be measured taking into account that each iteration in Algorithm 4 involves communicating pn real numbers. We will compare the performance of the proposed method with the Barzilai-Borwein algorithm in [35], whose communication cost per iteration is n(2T + p), where T is the number of consensus rounds needed to estimate the Barzilai-Borwein step size. We use T = 20 as in [35]. The setup for the experiments is a geometric network with 50 sensors randomly distributed in the two-dimensional square [0, 1] × [0, 1], with average node degree of about 6, and 4 anchors placed at the vertices of this square. The network remains fixed during all the Monte Carlo trials. Both algorithms are initialized by a convex approximation method. The initialization will hopefully hand the nonconvex refinement algorithms a point near the basin of attraction of the true minimum. For this purpose we generate noisy range measurements according to dij = |kx?i − x?j k + νij |, and rik = |kx?i − ak k + ηik |, where {νij : i ∼ j ∈ E} ∪ {ηik : i ∈ V, k ∈ Ai } are independent Gaussian random variables with zero mean and standard deviation σ. We conducted 100 Monte Carlo trials for each standard deviation σ = (0.01, 0.05, 0.1). If we spread the sensors by a squared area with side of 1Km, this means measurements are affected by noise of standard deviation of 10m, 50m, and 100m. In terms of mean positioning error the proposed algorithm fares better than the benchmark: Table 3.1 shows the mean error defined in (3.65) after the algorithms have stabilized, or reached a maximum iteration number. In the simulated setup, we improve the accuracy of the gradient descent with Barzilai-Borwein steps by about 1m per sensor, even for high power noise. Figure 3.2 depicts the averaged evolution of the error per sensor of both algorithms as a function of the volume of accumulated communications, and also 43


−3

x 10 MPE/s e ns or

8

7 Gr adie nt de s c e nt w it h BB s t e ps

6

P r op os e d me t hod

5 0

5000 10000 C ommunic at ions /s e ns or

15000

(a) The proposed method improves the comparing algorithm, both in accuracy and communication cost. Our proposed method improves the state of the art method in [35] by about 60 cm in mean positioning error per sensor, delivering a consistent and stable progression of the error of the estimates. −4

A v g. c os t /s e ns or

x 10 20

Gr adie nt de s c e nt w it h BB s t e ps

10

0 0

Pr op os e d me t hod 5000 10000 C ommunic at ions /s e ns or

15000

(b) The final costs are, for the BB method, 1.7392 10−4 and, for the proposed method 1.5698 10−4 . A small difference in cost that translates into a considerable distance in error, as depicted in Figure 3.2(a) and Table 3.1.

Figure 3.2: Comparison of the evolution of cost and average error per sensor with communications, for Algorithm 4 and the benchmark. Noisy distance measurements with σ = 0.01, representing 10m for a square with 1Km sides. The proposed method shows a faster and smoother progression, while the benchmark bounces, always above the proposed method.

the evolution of the cost for low power noise. Gradient descent with Barzilai-Borwein steps shows an irregular pattern for the error, only vaguely matching the variation in the corresponding cost (Figure 3.2(b)), thus leaving some uncertainty regarding when to stop the algorithm and what estimate to keep. The presented method reaches the final cost value per sensor much faster and steadily than the benchmark for medium-low measurement noise. In fact, our method takes under one order of magnitude fewer communications than the benchmark to approach the minimum cost value (match the cost at about 1500 communications with 15000 for the benchmark). The most realistic case of medium noise power led to the results presented in Figure 3.3. The characteristic irregularity of the BB method continues to fail in delivering better solutions in average than our 44



MPE/s e ns or

0.016 0.015 0.014

P r op os e d me t hod 0.013 0.012 0

1000

2000

3000 4000 5000 C ommunic at ions /s e ns or

6000

7000

(a) For medium noise power the algorithms’ performance comparison follows the one under low noise power. The accuracy gain is more than 1m per sensor.


−3

10

x 10

8 6


4 2 0

P r op os e d me t hod 1000

2000


6000

7000

(b) Under medium noise the proposed method also reaches a smaller value for the average cost per sensor: 0.0031, vs. 0.0032 for the BB method.

Figure 3.3: Comparison of the evolution of cost and average error per sensor with communications, for Algorithm 4 and the benchmark, under medium power noise. Average error and cost Distance measurements contaminated with noise, with σ = 0.05, representing 50m for a square with 1Km sides. The proposed method continues to outperform the benchmark, and evolves much more predictably than the BB method.

stable, guaranteed method. The error curves in Figure 3.3(a) are increasing because the error is not the quantity being directly optimized and the medium-high noise power in measurement data shifts the optimal points of the cost function relative to the nominal positions. Under high noise power, our method tops the performance of the benchmark in cost function terms, as shown in Figure 3.4(b), not only in terms of convergence speed, but also in the final value reached. Again, our method expends almost one order of magnitude fewer communications to achieve its plateau, which is itself, on average, better than the alternative method (compare the performance at 700 communications with the one at 7000 for the benchmark). 45


MPE/s e ns or

0.024 Gr adie nt de s c e nt w it h BB s t e ps 0.022 0.02

Pr op os e d me t hod

0.018 0

1000

2000


6000

7000


(a) The proposed algorithm tops the benchmark in error, under high noise power, by more than 1m, when considering a squared deployment area of 1Km sides.

0.025 0.02 Gr adie nt de s c e nt w it h BB s t e ps 0.015 0.01 Pr op os e d me t hod 0.005 0

1000

2000


6000

7000

(b) Under heavy noise the proposed method reaches a smaller value for the average cost per sensor: 0.0096, vs. 0.0099 for the BB method.

Figure 3.4: Comparison of the evolution of cost and average error per sensor with communications, for Algorithm 4 and the benchmark, under high power noise. Distance measurements contaminated with noise, with σ = 0.1, representing 100m for a square with 1Km sides.

3.2.6

Summary

The monotonicity of the proposed method is a strong feature for applications of sensor network localization. Our method proves to be not only fast and resilient, but also simple to implement and deploy, with no free parameters to tune. The steady accuracy gain over the competing method also makes it usable in contexts of different noise powers. The presented method can be useful both as a refinement algorithm and as a tracking method, e.g., for mobile robot formations where position estimates computed on a given time step are used as initialization for the next one. An asynchronous flavor of the algorithm would be, as far as we know, restricted to a broadcast gossip scheme, following a block-coordinate descent model. This line of research is in progress. 46

3.3 Majorization-Minimization with convex tight majorizer

3.3

Majorization-Minimization with convex tight majorizer

A quadratic majorizer such as the one used in 3.2 is a common choice for the MM framework. As one would expect, preliminary simulation results show that using a tighter majorizer improves localization performance. In the following Sections we describe a particularly tight convex majorization function and point some directions of research in order to devise a distributed method to optimize it.

3.3.1

Majorization function

Commonly, MM techniques resort to quadratic majorizers which, albeit easy to minimize, show a considerable mismatch with most cost functions (in particular, with f in (1.1)). To overcome this problem, we introduce a key novel majorizer. It is specifically adapted to f , tighter than a quadratic, convex, and easily optimizable. Before proceeding it is useful to rewrite (1.1) as f (x) =

X

fij (xi , xj ) +

X X

i∼j

i

fik (xi ),

k∈Ai

where fij (xi , xj ) = φdij (xi − xj ) and fik (xi ) = φrik (xi − ak ), both defined in terms of the basic building block φd (u) = (kuk − d)2 . 3.3.1.A

(3.14)

Majorization function for (3.14)

Let v ∈ Rp be given, assumed nonzero. We provide a majorizer Φd (· | v) for φd in (3.14) which is tight at v, i.e., φd (u) ≤ Φd (u | v) for all u and φd (v) = Φd (v | v). Proposition 9. Let Φd (u|v) = max gd (u), hd (v > u/ kvk − d) ,

(3.15)

where 2

gd (u) = (kuk − d)+ ,

(3.16)

2

(r)2+ = (max{0, r}) , and

hR (r) =

2R|r| − R2 r2

if |r| ≥ R if |r| < R,

(3.17)

is the Huber function of parameter R. Then, the function Φd (· | v) is convex, is tight at v, and majorizes φd . Proof. See Section 3.3.6. Further, we propose the following Conjecture: Conjecture 10. Majorizer Φd (· | v) in (3.15) is a tight convex majorizer of the nonconvex func-

tion φd (·) in (3.14), i.e., for all convex functions ψ : Rn → R such that φd (x) ≤ ψ(x) ≤ Φd (x) for

all x ∈ Rn we have ψ(x) = Φd (x).

47


12 10 8 6

Q d (u| v )

4

Φ d (u| v )

2

φ d (u) 0 −3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

Figure 3.5: Nonconvex cost function (black, dash point) in (3.14) against the proposed majorizer (red, solid) in (3.15) and a vanilla quadratic majorizer (blue, dashed) in (3.18), for d = 0.5 and v = 0.1. The proposed convex majorizer is a much more accurate approximation. The tightness of the proposed majorization function is illustrated in Figure 3.5, in which we depict, for a one-dimensional argument u, d = 0.5 and v = 0.1: the nonconvex cost function in (3.14), the proposed majorizer in (3.15) and a quadratic majorizer Qd (u|v) = kuk2 + d2 − 2d

v> u , kvk

(3.18)

obtained through routine manipulations of (3.14), e.g., expanding the square and linearizing kuk at v, which is common in MM approaches (c.f. [5, 6] for quadratic majorizers applied to the sensor network localization problem and [39] for an application in robust MDS). Clearly, the proposed convex majorizer is a better approximation to the nonconvex cost function2 . As an expected corollary, it also outperforms in accuracy the quadratic majorizer when embedded in the MM framework, as shown in the experimental results of Section 3.3.2. The proof of Conjecture 10 is being addressed. 3.3.1.B

Majorization function for the sensor network localization problem

We remind the reader about the maximum likelihood estimation problem for sensor network localization defined in (1.1). 2 The fact that both majorizers have coincident minimum is an artifact of this toy example, and does not hold in general.

48


Now, for given x[l], consider the function F (x | x[l]) =

X

Fij (xi , xj ) +

X X

i∼j

i

Fik (xi ),

(3.19)

k∈Ai

where Fij (xi , xj ) = Φdij (xi − xj | xi [l] − xj [l])

(3.20)

Fik (xi ) = Φrik (xi − ak | xi [l] − ak ).

(3.21)

and

Given Proposition 9, it is clear that it majorizes f and is tight at x[l]. Moreover, it is convex as a sum of convex functions.

3.3.2

Experimental results on majorization function quality

To initialize the algorithms we take the true sensor positions x? = {x?i : i ∈ V} and we perturb them by adding independent zero mean Gaussian noise, according to xi [0] = x?i + ηi ,

(3.22)

2 where ηi ∼ N (0, σinit Ip ) and Ip is the identity matrix of size p × p. The parameter σinit is detailed

ahead. We compare the performance of our proposed majorizer in (3.19) with a standard one built out of quadratic functions, e.g., the one used in [6]. We have submitted a simple source localization problem with one sensor and 4 anchors to two MM algorithms, each associated with one of the majorization functions. They ran for a fixed number of 30 iterations. At each Monte Carlo trial, the true sensor positions were corrupted by zero mean Gaussian noise, as in (3.22), with standard deviation σinit ∈ [0.01, 1]. The range measurements are taken to be noiseless, i.e., σ = 0 in (3.64), in order to create an idealized scenario for direct comparison of the two approaches. The evolution of RMSE as a function of initialization noise intensity is illustrated in Figure 3.6. There is a clear advantage of using this majorization function when the initialization is within a radius of the true location which is 30% of the square size.

3.3.3

Distributed optimization of the proposed majorizer using ADMM

At the l-th iteration of the nonconvex minimization algorithm, the convex function in (3.19) must be minimized. We now show how this optimization problem can be solved collaboratively by the network in a distributed, parallel manner. We propose a first distributed algorithm to tackle problem (1.1). Starting from an initialization x[0] for the unknown sensors’ positions x, it generates a sequence of iterates (x[l])l≥1 which, hopefully, converges to a solution of (1.1). We apply the majorization minimization (MM) framework [37] to (1.1): at each iteration l, we minimize (3.19), a majorizer of f , tight at the current 49


−1

10

Q ( q uadr at ic )

−2

R MSE

10

−3

10

F ( pr op os e d) −4

10

−2

10

−1

10 I nit ializ at ion nois e int e ns ity ( σ i n i t )

0

10

Figure 3.6: RMSE vs. σinit , the intensity of initialization noise in (3.22). The range measurements are noiseless: σ = 0 in (3.64). Anchors are at the unit square corners. The proposed majorizer (red, solid) outperforms the quadratic majorizer (blue, dashed) in accuracy. Algorithm 5 Minimization of the tight majorizer in (3.19) Input: x[0] Output: x[L] 1: for l = 0 to L − 1 do 2: x[l + 1] = argminx F (x | x[l]) 3: end for 4: return x[L]

iterate x[l], to obtain the next iterate x[l + 1]. The algorithm is outlined in Algorithm 5 for a fixed number of iterations L. Here, F (· | x[l]) denotes a majorizer of f (i.e., f (x) ≤ F (x | x[l]) for all x) which is tight at x[l] (i.e., f (x[l]) = F (x[l] | x[l])). The majorizer is detailed in Section 3.3.1.B. Note that f (x[l +1]) ≤ f (x[l]), that is, f is monotonically decreasing along iterations, an important property of the MM framework. Algorithm 5 is a distributed algorithm because, as we shall see, the minimization of the upperbounds F can be achieved in a distributed manner. 3.3.3.A

Problem reformulation

In the distributed algorithm, the working node will operate on local copies of the estimated positions of its neighbors and of itself. So, it is convenient to introduce new variables. Let Vi = {j : j ∼ i} denote the neighbors of sensor i. We also define the closed neighborhood 50


V i = Vi ∪ {i}. For each i, we duplicate xi into new variables yji , j ∈ V i , and zik , k ∈ Ai . This choice of notation is not fortuitous: the first subscript reveals which physical node will store the variable, in our proposed implementation; thus, xi and zik are stored at node i, whereas yji is managed by node j. We write the minimization of (3.19) as the optimization problem minimize subject to

F (y, z) yji = xi , zik = xi ,

j ∈ Vi k ∈ Ai ,

(3.23)

where y = {yji : i ∈ V, j ∈ V i }, z = {zik : i ∈ V, k ∈ Ai }, and X F (y, z) = (Fij (yii , yij ) + Fij (yji , yjj )) + i∼j

+2

X X i

Fik (zik ) .

(3.24)

k∈Ai

In passing from (3.19) to (3.23) we used the identity Fij (xi , xj ) = 21 Fij (yii , yij ) + 12 Fij (yji , yjj ), due to yji = xi . Also, for convenience, we have rescaled the objective by a factor of two. 3.3.3.B

Algorithm derivation

Problem (3.23) is in the form minimize subject to

F (y, z) + G(x) A(y, z) + Bx = 0

(3.25)

where F is the convex function in (3.24), G is the identically zero function, A is the identity operator and B is a matrix whose rows belong to the set {−e> i , i ∈ V}, being ei the ith column of the identity matrix of size |V|. In the presence of a connected network B is full column rank, so the problem is suited for the Alternating Direction Method of Multipliers (ADMM). See [40] and references therein for more details on this method. See also [41–46] for applications of ADMM in distributed optimization settings. Let λji be the Lagrange multiplier associated with the constraint yji = xi and λ = {λji } the collection of all such multipliers. Similarly, let µik be the Lagrange multiplier associated with the constraint zik = xi and µ = {µik }. The ADMM framework generates a sequence (y(t), z(t), x(t), λ(t), µ(t))t≥1 such that (y(t + 1), z(t + 1)) = argmin Lρ (y, z, x(t), λ(t), µ(t)) y,z

(3.26) x(t + 1) = argmin Lρ (y(t + 1), z(t + 1), x, λ(t), µ(t)) x

(3.27) λji (t + 1) = λji (t) + ρ(yji (t + 1) − xi (t + 1))

µik (t + 1) = µik (t) + ρ(zik (t + 1) − xi (t + 1)),

(3.28) (3.29)

where Lρ is the augmented Lagrangian defined as XX Lρ (y, z, x, λ, µ) = F (y, z) + λ> ji (yji − xi )+ i

j∈V

i X X ρ ρ 2 2 > kyji − xi k + µik (zik − xi ) + kzik − xi k . 2 2 i

k∈Ai

(3.30) 51


Here, ρ > 0 is a pre-chosen constant. In our implementation, we let node i store the variables xi , yij , λij , λji , for j ∈ V i and zik , µik , for k ∈ Ai . Note that a copy of λij is maintained at both nodes i and j (this is to avoid extra communication steps). For t = 0, we can set λ(0) and µ(0) to a pre-chosen constant (e.g., zero) at all nodes. Also, we assume that, at the beginning of the iterations (i.e., for t = 0) node i knows xj (0) for j ∈ Vi (this can be accomplished, e.g., by having each node i communicating xi (0) to all its neighbors). This property will be preserved for all t ≥ 1 in our algorithm, via communication steps. We now show that the minimizations in (3.26) and (3.27) can be implemented in a distributed manner and require low computational cost at each node. 3.3.3.C

ADMM: Solving Problem (3.26)

As shown in Section 3.3.7, the augmented Lagrangian in (3.30) can be written as  X X  Lij (yii , yij , xj , λij ) + Lρ (y, z, x, λ, µ) = i

j∈V i

! +

X k∈Ai

Lik (zik , xi , µik )

(3.31)

where Lij (yii , yij , xj , λij ) = Fij (yii , yij ) + λ> ij (yij − xj ) + ρ 2 + kyij − xj k 2 and Lik (zik , xi , µik ) = 2Fik (zik ) + µ> ik (zik − xi ) + ρ 2 + kzik − xi k . 2 In (3.31) we let Fii ≡ 0. It is clear from (3.31) that Problem (3.26) decouples across sensors i ∈ V, since we are optimizing only over y and z. Further, at each sensor i, it decouples into two types of subproblems: one involving the variables yij , j ∈ V i , given by X minimize Lij (yii , yij , xj , λij ) , (3.32) yij , j∈V i

j∈V i

and into |Ai | subproblems of the form minimize Lik (zik , xi , µik ) , zik

(3.33)

involving the variable zik , k ∈ Ai . Note that problems related with anchors are simpler, and, since there are usually few anchors in a network, they do not occur frequently. A–

Solving Problem (3.32)

First, note that node i can indeed address Problem (3.32) since

all the data defining it is available at node i: it stores λji (t), j ∈ V i , and it knows xj (t) for all neighbors j ∈ Vi (this holds trivially for t = 0 by construction, and it is preserved by our approach, as shown ahead). 52


To alleviate notation we now suppress the indication of the working node i, i.e., variable yij is simply written as yj . Problem (3.32) can be written as X ρ Fij (yi , yj ) + kyj − γij k2 minimize 2 yj , j∈V i j∈Vi ρ + kyi − γii k2 , (3.34) 2 where γij = xj −

λij ρ .

We make the crucial observation that, for fixed yi , the problem is separable in the remaining variables yj , j ∈ Vi . This motivates writing (3.34) as the master problem minimize H(yi ) = yi

X j∈Vi

ρ Hij (yi ) + kyi − γii k2 , 2

(3.35)

where ρ Hij (yi ) = min Fij (yi , yj ) + kyj − γij k2 . yj 2

(3.36)

We now state important properties of Hij . Proposition 11. Define Hij as in (3.36). Then: 1. Optimization problem (3.36) has a unique solution yj for any given yi , henceforth denoted yj? (yi ); 2. Function Hij is convex and differentiable, with gradient ∇Hij (yi ) = ρ yj? (yi ) − γij ;

(3.37)

3. The gradient of Hij is Lipschitz continuous with parameter ρ, i.e., k∇Hij (u) − ∇Hij (v)k ≤ ρku − vk for all u, v ∈ Rp . Proof.

1. Recall from (3.20) that Fij (yi , yj ) = Φd (yi − yj | v) where d = dij and v = xi [l] − xj [l].

We have Hij (yi ) = Θ(yi − γij ) where Θ(w) = min Φd (u | v) + u

ρ 2 ku − wk . 2

(3.38)

Moreover, u? solves (3.38) if and only if yj? = yi − u? solves (3.36). Now, the cost function in (3.38) is clearly continuous, coercive (i.e., it converges to +∞ as kuk → +∞) and strictly convex, the two last properties arising from the quadratic term. Thus, it has an unique solution; 2. The function Θ is the Moreau-Yosida regularization of the convex function Φd (·|v) [20, XI.3.4.4]. As Θ is known to be convex and Hij is the composition of Θ with an affine map, Hij is convex. It is also known that the gradient of Θ is ∇Θ(w) = ρ(w − u? (w)) 53


where u? (w) is the unique solution of (3.38) for a given w. Thus, ∇Hij (yi )

= ∇Θ(yi − γij ) = ρ(yi − γij − u? (yi − γij )).

Unwinding the change of variable, i.e., using yj? (yi ) = yi − u? (yi − γij ), we obtain (3.37); 3. Follows from the well known fact that the gradient of Θ is Lipschitz continuous with parameter ρ.

As a consequence, we obtain several nice properties of the function H. 2

Theorem 12. Function H in (3.35) is strongly convex with parameter ρ, i.e., H − ρ2 k·k is convex. Furthermore, it is differentiable with gradient ∇H(yi ) = ρ

X j∈Vi

yj? (yi ) − γij + ρ(yi − γii ).

(3.39)

The gradient of H is Lipschitz continuous with parameter LH = ρ(|Vi | + 1). Proof. Since H is a sum of convex functions, it is convex. It is strongly convex with parameter ρ due to the presence of the strongly convex term

ρ 2

2

kyi − γii k . As a sum of differentiable functions,

it is differentiable and the given formula for the gradient follows from proposition 11. Finally, since H is the sum of |Vi | + 1 functions with Lipschitz continuous gradient with parameter ρ , the claim is proved. The properties established in Theorem 12 show that the optimization problem (3.35) is suited for Nesterov’s optimal method for the minimization of strongly convex functions with Lipschitz continuous gradient [27, Theorem 2.2.3]. The resulting algorithm is outlined in Algorithm 6, which is guaranteed to converge to the solution of (3.35). Algorithm 6 Nesterov’s optimal method for (3.35) 1: y î (0) = yi (0) 2: for s ≥ 0 do 3: yi (s + 1) = yî (s) − L1H ∇H(yi (s)) yî (s + 1) = yi (s + 1) + 5: end for 4:

B –

√ √ L − ρ √ H √ (yi (s LH + ρ

Solving problem (3.36)

+ 1) − yi (s))

It remains to show how to solve (3.36) at a given sensor node.

Any off-the-shelf convex solver, e.g. based on interior-point methods, could handle it. However, we present a simpler method that avoids expensive matrix operations, typical of interior point methods, by taking advantage of the problem structure at hand. This is important in sensor networks where the sensors have stringent computational resources. 54


First, as shown in the proof of Proposition 11, it suffices to focus on solving (3.38) for a given w: solving (3.36) amounts to solving (3.38) for w = yi − γij to obtain u? = u? (w) and set

yj? (yi ) = yi − u? .

Note from (3.15) that Φd (·|v) only depends on v/ kvk, so we can assume, without loss of generality, that kvk = 1. From (3.15), we see that Problem (3.38) can be rewritten as minimize subject to

2

r + ρ2 ku − wk gd (u) ≤ r hd (v > u − d) ≤ r,

(3.40)

with optimization variable (u, r). The Lagrange dual (c.f., for example, [20]) of (3.40) is given by maximize subject to

ψ(ω) 0 ≤ ω ≤ 1,

(3.41)

where ψ(ω) = inf{Ψ(ω, u) : u ∈ Rn } and Ψ(ω, u) =

ρ 2 ku − wk + ωgd (u) + (1 − ω)hd (v > u − d). 2

(3.42)

We propose to solve the dual problem (3.41), which involves the single variable ω, by bisection: ˙ we maintain an interval [a, b] ⊂ [0, 1] (initially, they coincide); we evaluate ψ(c) at the midpoint

˙ c = (a + b)/2; if ψ(c) > 0, we set a = c, otherwise, b = c; the scheme is repeated until the uncertainty interval is sufficiently small.

In order to make this approach work, we must prove first that the dual function ψ is indeed differentiable in the open interval Ω = (0, 1) and find a convenient formula for its derivative. We will need the following useful result from convex analysis. Lemma 13. Let X ⊂ Rn be an open convex set and Y ⊂ Rp be a compact set. Let F : X × Y → R. Assume that F (x, ·) is lower semi-continuous for all x ∈ X and F (·, y) is concave and differentiable for all y ∈ Y . Let f : X → R, f (x) = inf{F (x, y) : y ∈ Y }. Assume that, for any x ∈ X, the

infimum is attained at an unique y ? (x) ∈ Y . Then, f is differentiable everywhere and its gradient at x ∈ X is given by ∇f (x) = ∇F (x, y ? (x))

(3.43)

where ∇ refers to differentiation with respect to x. Proof. This is essentially [20, VI.4.4.5], after one changes concave for convex, lower semi-continuous for upper semi-continuous and inf for sup. Now, view Ψ in (3.42) as defined in Ω × Rn . Is is clear that Ψ(ω, ·) is lower semi-continuous for all ω (in fact, continuous) and Ψ(·, u) is concave (in fact, affine) and differentiable for all u. In fact, some even nicer properties hold.

55


Lemma 14. Let ω ∈ Ω. The function Ψω = Ψ(ω, ·) is strongly convex with parameter ρ and differentiable everywhere with gradient ∇Ψω (u) = ρ(u − w) + 2ω(u − π(u)) + (1 − ω)h˙ d (v > u − d)v,

(3.44)

where π(u) denotes the projection of u onto the closed ball of radius d centered at the origin. Furthermore, the gradient of Ψω is Lipschitz continuous with parameter ρ + 2. Proof. We start by noting that gd in (3.16) can be written as gd (u) = d2C (u) where C is the closed ball with radius d centered at the origin, and dC denotes the distance to the closed convex set C. It is known that gd is convex, differentiable, the gradient is given by ∇gd (u) = 2(u − π(u)) and it is Lipschitz continuous with parameter 2 [20, X.3.2.3]. Also, function hd in (3.17) is convex and differentiable. Thus, the function Ψω is convex (resp. differentiable) as a sum of three convex (resp. differentiable) functions. It is strongly convex with parameter ρ due to the first term

ρ 2

2

k· − wk .

The gradient in (3.44) is clear. Finally, from |h˙ d (r)| − h˙ d (s)| ≤ 2|r − s| for all r, s, there holds for any u1 , u2 , |h˙ d (v > u1 − d) − h˙ d (v > u2 − d)| ≤ 2|v > (u1 − u2 )| ≤ 2ku1 − u2 k,

where kvk = 1 and the Cauchy-Schwarz inequality was used in the last step. We conclude from (3.44) that, for any u1 , u2 , k∇Ψω (u1 ) − ∇Ψω (u2 )k ≤ (ρ + 2ω + 2(1 − ω)) ku1 − u2 k , i.e., the gradient of Ψω is Lipschtz continuous with parameter ρ + 2. Using Lemma 14, we see that the infimum of Ψω is attained at a single u? (ω) since it is a continuous, strongly convex function. The derivative of ψ in (3.41) relies on u? (ω), as seen in Lemma 15. Lemma 15. Function ψ in (3.41) is differentiable and its derivative is ˙ ψ(ω) = gd (u? (ω)) − hd v > u? (ω) − d .

(3.45)

Proof. We begin by bounding the norm of u? (ω). From the necessary stationary condition ∇Ψω (u? (ω)) = 0 and (3.44) we conclude (ρ + 2ω)u? (ω) = ρw + 2ωπ(u? (ω)) − (1 − ω)h˙ d (v > u? (ω) − d)v.

(3.46)

Since |h˙ d (t)| ≤ 2d for all t (see (3.17)), kπ(u)k ≤ d for all u, kvk = 1, and 0 ≤ ω ≤ 1, we can bound the norm of the right-hand side of (3.46) by ρ kwk + 4d. Thus, ku? (ω)k

56

1 (ρ kwk + 4d) ρ + 2ω 1 ≤ (ρ kwk + 4d) ρ 4d = kwk + . ρ

≤


Introduce the compact set U = {u ∈ Rn : kuk ≤ kwk + 4d/ρ}. The previous analysis has shown that the dual function in (3.41) can also be represented as ψ(ω) = inf{Ψ(ω, u) : u ∈ U }, i.e., we can restrict the search to U and view Ψ as defined in Ω × U . We can thus invoke Lemma 13 to conclude that ψ is differentiable and (3.45) holds. Finding u? (ω) To obtain u? (ω) we must minimize Ψω . But, given its properties in Lemma 14, the simple optimal Nesterov method, described in Algorithm 6, is also applicable here.

C –

Solving problem (3.33)

Note that node i stores xi (t) and µik (t), k ∈ Ai . Thus, it can

indeed address Problem (3.33). Problem (3.33) is similar (in fact, much simpler) than (3.32), and following the previous steps leads to the same Nesterov’s optimal method. We omit this straightforward derivation. 3.3.3.D

ADMM: Solving Problem (3.27)

Looking at (3.30), it is clear that Problem (3.27) decouples across nodes also. Furthermore, at node i a simple unconstrained quadratic problem with respect to xi must be solved, whose closed-form solution is  X 1 1  λji (t) + yji (t + 1) xi (t + 1) = ρ |V i | + |Ai | j∈V i ! X 1 µik (t) + zik (t + 1) . + ρ

(3.47)

k∈Ai

For node i to carry this update, it needs first to receive yji (t + 1) from its neighbors j ∈ Vi . This requires a communication step. 3.3.3.E

ADMM: Implementing (3.28) and (3.29)

Recall that the dual variable λji is maintained at both nodes i and j. Node i can carry the update λji (t + 1) in (3.28), for all j ∈ Vi , since the needed data are available (recall that yji (t + 1) is available from the previous communication step). To update λij (t + 1) = λij (t) + ρ(yij (t + 1) − xj (t+1), node i needs to receive xj (t+1) from its neighbors j ∈ Vi . This requires a communication step. 3.3.3.F

Summary of the distributed algorithm

Our ADMM-based algorithm stops after a fixed number of iterations, denoted T . Algorithm 7 outlines the procedure derived in Sections 3.3.3.C and 3.3.3.D, and corresponds to step 2 of the ADMM-based algorithm (Algorithm 5). Note that, in order to implement step 5 of Algorithm 7, one must adapt Algorithm 6 to the problem at hand. 57


Algorithm 7 Step 2 of Algorithm 5 using ADMM: position updates Input: x[l] Output: x[l + 1] 1: for t = 0 to T − 1 do 2: for each node i ∈ V in parallel do 3: Solve Problem (3.32) by minimizing H in (3.35) with Alg. 6 to obtain yij (t + 1), j ∈ V i 4: for k = 1 to |Ai | do 5: Solve Problem (3.33) to obtain zik (t + 1) 6: end for 7: Send yij (t + 1) to neighbor j ∈ Vi 8: Compute xi (t + 1) from (3.47) 9: Send xi (t + 1) to all j ∈ Vi 10: Update {λji (t + 1), µik (t + 1), j ∈ V i , k ∈ Ai } as in (3.28) and (3.29) 11: end for 12: end for 13: return x[l + 1] = x(T )

A –

Communication load Algorithm 7 shows two communication steps: step 7 and step 9.

At step 7 each node i sends |Vi | vectors in Rp , each to one neighboring sensor, and at step 9 a

vector in Rp is broadcast to all nodes in Vi . This results in 2T L|Vi | communications of Rp vectors

for node i for the overall algorithm. When comparing with SGO in [18], for T iterations, node i sends T |Vi | vectors in Rp . The increase in communications is the price to pay for the parallel nature of the ADMM-based algorithm.

3.3.4

Experimental setup

Unless otherwise specified, the generated geometric networks are composed by 4 anchors and 50 P 1 sensors, with an average node degree, i.e., |V| i∈V |Vi |, of about 6. In all experiments the sensors are distributed at random and uniformly on a square of 1 × 1, and anchors are placed, unless otherwise stated, at the four corners of the unit square (to follow [18]), namely, at (0, 0),(0, 1), (1, 0) and (1, 1). These properties require a communication range of about R = 0.24. Since localizability is an issue when assessing the accuracy of sensor network localization algorithms, the used networks are first checked to be generically globally rigid, so that a small disturbance in measurements does not create placement ambiguities. To detect generic global rigidity, we used the methodologies in [47, Section 2]. The results for the proposed algorithm consider L = 40 MM iterations, unless otherwise stated. 3.3.4.A

ADMM and SGO: RMSE vs. initialization noise

Two sets of experiments were made to compare the RMSE performance of SGO in [18] and the proposed Algorithm 5, termed DCOOL-NET, as a function of the initialization quality (i.e., σinit in (3.22)). In the first set, range measurements are noiseless (i.e., σ = 0 in (3.64)), whereas in the second set we consider noisy range measurements (σ > 0).

58


0.08

RMSE

0.06 SGO

0.04

DC O O L - NET

0.02

0

0.05

0.1 0.15 0.2 0.25 I nit ializ at ion nois e int e ns ity ( σ i n i t )

0.3

Figure 3.7: RMSE vs. σinit , the intensity of initialization noise in (3.22). The range measurements are noiseless: σ = 0 in (3.64). Anchors are at the unit square corners. Proposed DCOOL-NET (red, solid) and SGO (blue, dashed) attain comparable accuracy. Table 3.2: Squared error dispersion over Monte Carlo trials for Figure 3.7. σinit DCOOL-NET SGO 0.01 0.10 0.30

Noiseless range measurements

0.0002 0.0638 0.2380

0.0007 0.1290 0.3400

In this setup 300 Monte Carlo trials were run. As the mea-

surements are accurate (σ = 0 in (3.64)) one would expect not only insignificant values of RMSE, but also a considerable agreement between all the Monte Carlo trials on the solution for sufficiently close initializations. Figure 3.7 confirms that both DCOOL-NET and SGO achieve small error positions, and their accuracies are comparable. As stated before, SGO also has a low computational complexity. In fact, lower than DCOOL-NET (although DCOOL-NET is fully parallel across nodes, whereas SGO operates by activating the nodes sequentially, implying some high-level coordination). Table 3.2 shows the squared error dispersion over all Monte Carlo trials, i.e., the standard deviation of the data {SEm : m = 1, . . . , M}, SEm = kˆ xm − x? k2 , for both algorithms. We see that DCOOL-NET exhibits a more stable performance, in the sense that it has a lower squared error dispersion. Noisy range measurements We set σ = 0.12 in the noise model (3.64). Figure 3.8 shows that 59


0.1

RMSE

0.08

SGO

0.06 DC O O L - NET

0.04 0

0.05

0.1 0.15 0.2 0.25 I nit ializ at ion nois e int e ns ity ( σ i n i t )

0.3

Figure 3.8: RMSE vs. σinit , the intensity of initialization noise in (3.22). The range measurements are noisy: σ = 0.12 in (3.64). Anchors are at the unit square corners. Proposed DCOOL-NET (red, solid) outperforms SGO (blue, dashed) in accuracy.

Table 3.3: Squared error dispersion over Monte Carlo trials for Figure 3.8. σinit DCOOL-NET SGO 0.00 0.01 0.10 0.30

0.0118 0.0121 0.0727 0.2490

0.0783 0.0775 0.1610 0.3320

DCOOL-NET fares better than SGO: the gap between the performances of both algorithms is now quite significant. The squared error dispersion over all Monte Carlo trials for both algorithms is given in Table 3.3. As before, we see that DCOOL-NET is more reliable, in the sense that it exhibits lower variance of estimates across Monte Carlo experiments. We also considered placing the anchors randomly within the unit square, instead of at the corners. This is a more realistic and challenging setup, where the sensors are no longer necessarily located inside the convex hull of the anchors. The corresponding results are shown in Figure 3.9 and Table 3.4, for 250 Monte Carlo trials. Again, DCOOL-NET achieves better accuracy. Comparing the dispersions in Tabs. 3.3 and 3.4 also reveals that the gap in reliability between SGO and our algorithm is now wider. 60


RMSE

0.16

0.12

SGO

0.08 DC O O L - NET

0.04 0

0.1

0.2 0.3 0.4 I nit ializ at ion nois e int e ns ity ( σ i n i t )

0.5

Figure 3.9: RMSE vs. σinit , the intensity of initialization noise in (3.22). The range measurements are noisy: σ = 0.12 in (3.64). Anchors were randomly placed in the unit square. Proposed DCOOL-NET (red, solid) outperforms SGO (blue, dashed) in accuracy. Table 3.4: Squared error dispersion over Monte Carlo trials for Figure 3.9. σinit DCOOL-NET SGO 0.00 0.01 0.10 0.30 0.50

3.3.4.B

0.0097 0.0099 0.1550 0.4350 0.8330

0.0712 0.0709 0.3160 0.8440 1.3000

ADMM and SGO: RMSE vs. measurement noise

To evaluate the sensitivity of both algorithms to the intensity of noise present in range measurements (i.e., σ in (3.64)), 300 Monte Carlo trials were run for σ = 0.01, 0.1, 0.12, 0.15, 0.17, 0.2, 0.3. Both algorithms were initialized at the true sensor positions, i.e., σinit = 0 in (3.22), and ADMM performs L = 100 iterations3 . Figure 3.10 and Table 3.5 summarize the computer simulations for this setup. As before, ADMM consistently achieves better accuracy and stability. 3.3.4.C

ADMM and SGO: RMSE vs. communication cost

We assessed how the RMSE varies with the communication load incurred by both algorithms. We considered the general setup described in Section 3.3.4. The results are displayed in Figure 3.11. 3 This is to guarantee that, in practice, ADMM indeed attained a fixed point, but the results barely changed for L = 40.

61


RMSE

0.08

SGO

0.06

0.04

DC O O L - NET

0.02

0.05

0.1 0.15 0.2 0.25 Me as ur e me nt nois e int e ns ity ( σ )

0.3

Figure 3.10: RMSE vs. σ, the intensity of measurement noise in (3.64). No initialization noise: σinit = 0 in (3.22). Anchors are at the unit square corners. Proposed ADMM (red, solid) outperforms SGO (blue, dashed) in accuracy. Table 3.5: Squared error dispersion over Monte Carlo trials for Figure 3.10. σ DCOOL-NET SGO 0.01 0.10 0.12 0.15 0.17 0.20 0.30

0.0002 0.0177 0.0218 0.0326 0.0394 0.0525 0.1020

0.0016 0.0688 0.0702 0.0921 0.0993 0.1090 0.1630

We see an interesting tradeoff: SGO converges much quicker than ADMM (in terms of communication rounds), and attains a lower RMSE sooner. However, ADMM can improve its accuracy through more communications, whereas SGO remains trapped in a suboptimal solution. 3.3.4.D

ADMM: RMSE vs. parameter ρ

The parameter ρ plays a role in the augmented Lagrangian discussed in Section 3.3.3, and is user-selected. As such, it is important to study the sensitivity of ADMM to this parameter choice. For this purpose, we have tested several ρ between 1 and 200. For each choice, 300 Monte Carlo trials were performed using noisy measurements and initializations. Figure 3.12 portrays RMSE against ρ for L = 40 iterations of ADMM. There is no ample variation, especially for values of 62


0.08

RMSE

0.07 DC O O L - NET

0.06

0.05 SGO

1

2

3 4 5 6 Numb e r of c ommunic at ions

7

8 6

x 10

Figure 3.11: RMSE versus total number of two-dimensional vectors communicated in the network. The range measurements are noiseless: σ = 0 in (3.64). Initialization is noisy: σinit = 0.1 in (3.22). Anchors are at the unit square corners. Proposed ADMM (red, solid) outperforms SGO (blue, dashed) in accuracy, at the expense of more communications.

ρ over 30, which offers some confidence in the algorithm resilience to this parameter, a pivotal feature from the practical standpoint. However, an analytical approach for selecting the optimal ρ is beyond the scope of this work, and is postponed for future research. Note that adaptive schemes to adjust ρ do exist for centralized settings, e.g., [40], but seem impractical for distributed setups as they require global computations.

3.3.5

Proof of majorization function properties

We now prove Proposition 9. We write Φd (u) instead of Φd (u|v) and we let hx, yi = x> y.

Convexity Note that gd is convex as the composition of the convex, non-decreasing function (·)2+ with the convex function k·k − d. Also, hd (hv/kvk, ·i − d) is convex as the composition of the convex Huber function hd (·) with the affine map hv/kvk, ·i − d. Finally, Φd is convex as the pointwise maximum of two convex functions. 63


0.065

RMSE

0.06

0.055

0.05

0.045 20

40

60

80

100

120

140

160

180

200

Par ame t e r value ( ρ ) Figure 3.12: RMSE vs. ρ. The range measurements are noisy: σ = 0.05 in (3.64). Initialization is noisy: σinit = 0.1 in (3.22). Anchors are the unit square corners. Tightness It is straightforward to check that φd (v) = Φd (v) by examining separately the three cases kvk < d, d ≤ kvk < 2d and kvk ≥ 2d. Majorization We must show that Φd (u) ≥ φd (u) for all u. First, consider kuk ≥ d. Then, gd (u) = φd (u) and it follows that Φd (v) = max{gd (u), hd (hv/kvk, ui − d)} ≥ φd (u). Now, consider kuk < d and write u = Rˆ u, where R = kuk < d and kˆ uk = 1. It is straightforward to check that, in terms of R

and u ˆ, we have φd (u) = (R − d)2 and Φd (u) = hd (Rhˆ v, u î − d), where vˆ = v/kvk. Thus, we must

show that hd (Rhˆ v, u î − d) ≥ (R − d)2 . Motivated by the definition of the Huber function hd in two branches, we divide the analysis in two cases.

Case 1: |Rhˆ v, u î−d| ≤ d. In this case, hd (Rhˆ v, u î−d) = (Rhˆ v, u î−d)2 . Noting that |hˆ v, u î| ≤ 1, there holds (Rhˆ v, u î − d)2 ≥ inf{(Rz − d)2 : |z| ≤ 1} = (R − d)2 , where the fact that R < d was used to compute the infimum over z (attained at z = 1). Case 2: |Rhˆ v, u î − d| > d. In this case, hd (Rhˆ v, u î − d) = 2d|Rhˆ v, u î − d| − d2 . Thus, hd (Rhˆ v, u î − d) ≥ d2 ≥ (d − R)2 , where the last inequality follows from 0 ≤ R < d. 64


3.3.6

Proof of Proposition 9

We write Φd (u) instead of Φd (u|v) and we let hx, yi = x> y. Convexity Note that gd is convex as the composition of the convex, non-decreasing function (·)2+ with the convex function k·k − d. Also, hd (hv/kvk, ·i − d) is convex as the composition of the convex Huber function hd (·) with the affine map hv/kvk, ·i − d. Finally, Φd is convex as the pointwise maximum of two convex functions.

Tightness It is straightforward to check that φd (v) = Φd (v) by examining separately the three cases kvk < d, d ≤ kvk < 2d and kvk ≥ 2d. Majorization We must show that Φd (u) ≥ φd (u) for all u. First, consider kuk ≥ d. Then, gd (u) = φd (u) and it follows that Φd (v) = max{gd (u), hd (hv/kvk, ui − d)} ≥ φd (u). Now, consider kuk < d and write u = Rˆ u, where R = kuk < d and kˆ uk = 1. It is straightforward to check that, in terms of R

and u ˆ, we have φd (u) = (R − d)2 and Φd (u) = hd (Rhˆ v, u î − d), where vˆ = v/kvk. Thus, we must show that hd (Rhˆ v, u î − d) ≥ (R − d)2 . Motivated by the definition of the Huber function hd in two branches, we divide the analysis in two cases. Case 1: |Rhˆ v, u î−d| ≤ d. In this case, hd (Rhˆ v, u î−d) = (Rhˆ v, u î−d)2 . Noting that |hˆ v, u î| ≤ 1, there holds (Rhˆ v, u î − d)2 ≥ inf{(Rz − d)2 : |z| ≤ 1} = (R − d)2 , where the fact that R < d was used to compute the infimum over z (attained at z = 1). Case 2: |Rhˆ v, u î − d| > d. In this case, hd (Rhˆ v, u î − d) = 2d|Rhˆ v, u î − d| − d2 . Thus, hd (Rhˆ v, u î − d) ≥ d2 ≥ (d − R)2 , where the last inequality follows from 0 ≤ R < d.

3.3.7

Proof of (3.31)

We show how to rewrite (3.30) as (3.31). First, note that F (y, z) in (3.24) can be rewritten as F (y, z) =

XX i

j∈Vi

Fij (yii , yij ) + 2

X X i

Fik (zik ).

(3.48)

k∈Ai

65


Here, we used the fact that Fij (yji , yjj ) = Fji (yjj , yji ) which follows from dij = dji and Φd (u|v) = Φd (−u| − v), see (3.15). In addition, there holds XX ρ 2 λ> kyji − xi k ji (yji − xi ) + 2 i j∈V i XX ρ 2 λ> = kyij − xj k ij (yij − xj ) + 2 j i∈V j XX ρ 2 λ> kyij − xj k . (3.49) = ij (yij − xj ) + 2 i j∈V i

The first equality follows from interchanging i with j. The second equality follows from noting that i ∈ V j if and only if j ∈ V i . Using (3.48) and (3.49) in (3.30) gives (3.31).

3.3.8

Summary

We presented a convex majorizer, crafted to be a tight fit to the sensor network localization problem (1.1). We developed a distributed, fully parallel algorithm to optimize the convex majorizer, based on the ADMM. This choice allowed for the distribution of the problem but at the expense of unpractical communication load. This behavior of the algorithm can be explained by the increase of the number of variables, when adding edge variables to the equivalent problems, but mainly by the fact that node subproblems do not have closed form exact solutions and ADMM has to compensate the deviations of the partial iterative solutions with more communication rounds. We are currently establishing the proof of tightness in Rn for n > 1 as enunciated in Conjecture 10, and investigating a proximal method to efficiently minimize each majorizer in a distributed fashion, also allowing for gossip-like asynchronous solutions.

3.4

Sensor network localization: a graphical model approach

This Section focuses on the sensor network localization problem, when one has access to the mean and variance of the normally distributed priors on the sensor positions. In this setting we do not need landmarks or anchors to take care of rotation, translation or flip ambiguities, so when anchors are not easy to determine and on deployment we have some notion of the drop-off of sensors and their spread, this solution is appropriate. The problem is cast under the formalism of probabilistic graphical models and the optimization problem to obtain the MAP (maximum a posteriori ) estimate of the sensor positions is stated. The proposed goals concentrate on suboptimal approximation methods for the derived combinatorial problem. In general, the deployment of the sensors in the terrain is not done in an accurate way, but sometimes it is possible to delimit regions with some probability of containing each sensor. Many sensor networks can also acquire noisy distance measurements between neighboring nodes, thus obtaining data to estimate their true positions. Under such conditions, each node’s position can 66

3.4 Sensor network localization: a graphical model approach

be seen as a random variable whose distribution depends on the distribution of the noisy measurements, the prior on its own position and the distributions of the neighboring nodes positions. Here, the probabilistic graphical models framework may capture this complex set of dependencies between random variables and enable the use of general purpose algorithms for performing inference. The graphical model for the sensor network coincides with the measurement model.

3.4.1

Uncertainty models

In order to establish the graphical model formalism on our problem, we restate several objects already defined, but now framed in the probabilistic setting. Range measurements are contaminated by zero mean independent Gaussian noise. So, the distance between node t and node u can be expressed as dtu = kx?t − x?u k + νtu ,

iid

νtu ∼ N (0, σ),

(3.50)

where x?t is the true position of node t. A set of measurements corresponding to a subset of edges I ⊂ E is denoted by dI and, in the same way, a set of positions corresponding to a subset of nodes V ∈ V is denoted as xV . The probability distribution of νtu will be denoted by pν (νtu ). The noisy range measurement acquired between sensor t and anchor k ∈ At is modelled by rtk = kx?t − ak k + νtk ,

(3.51)

where ak is the anchor position and νtk is a random variable with probability distribution pν (νtk ), the same as in (3.50). It is assumed that dtu = dut and that the position variables x are independent of the random variables νE = {νtu : t ∼ u ∈ E} and νV = {νtk : t ∈ V, k ∈ At }. Additionally, it is presumed that each sensor position xt has a prior distribution pt (xt ) = N (xt ; µt , Rt ). Each xt is independent from xV−{t} . The joint distribution is, thus, p(x, dE )

= p(d Q E |x)p(x) Q Q Q = t∼u p(dtu |xt , xu ) t pt (xt ) t k∈At p(rtk |xt ),

(3.52)

and the a posteriori distribution is proportional to the joint distribution in (3.52), i.e., p(x|dE ) ∝

Y t∼u

pν (kxt − xu k − dtu )

Y t

pt (xt )

Y Y t k∈At

pν (kxt − ak k − rtk ),

(3.53)

where we explicitly wrote the conditional probabilities in terms of pν .

3.4.2

Optimization problem

As defined in the previous section, all probability distributions are Gaussian. To find the maximum a posteriori (MAP) estimate for the sensor positions an optimization problem is cast 67


by taking the logarithm of Eq. (3.53), thus obtaining X

minimize x

θtu (xt , xu ) +

t∼u

X

θt (xt ),

(3.54)

t

where pairwise potentials are θtu (xt , xu ) =

1 (kxt − xu k − dtu )2 , σ2

and single node potentials θt (xt ) =

1 X (kxt − ak k − rtk )2 + (xt − µt )> Rt−1 (xt − µt ). σ2 k∈At

Problem (3.54) is known to be NP-hard for generic graphs, as stated earlier.

3.4.3

Combinatorial problem

We discretize the 95% confidence region for each of the prior distributions, collecting an alphabet (1)

(nt )

of candidate node positions Xt = {αt , · · · , αt

} for each sensor node t, where nt is the cardinality

of each alphabet, and we formulate a combinatorial problem over the collection of such alphabets as minimize {xt ∈Xt }

X

θtu (xt , xu ) +

t∼u

X

θt (xt ).

(3.55)

t

To rewrite the problem over binary variables, we translate this functional form into a matrix form; we define the matrix Θtu as the evaluation of the pairwise potential θtu (xt , xu ) on all points of the intervening nodes’ alphabets as 

Θtu

(1)

(1)

θtu (αt , αu )  (2) (1)  θtu (αt , αu ) :=  ..   . (nt )

θtu (αt

(1)

(2)

θtu (αt , αu ) · · · ··· .. .

(1)

(nu )

)

θtu (αt

(nu )

, αu

   ,  

.. . (nt )

···

, αu )

(1)

θtu (αt , αu

(3.56)

)

and the vector θt of all evaluations of the node potential function θt (xt ) over the alphabet Xt to obtain  (1) θt (αt )  (2)   θt (αt )   . θt :=  ..    . (nt ) θt (αt ) We also specify a set ∆t := et : et ∈ {0, 1}nt , e> t 1 = 1 ; it is now possible to rewrite the problem 

over the binary variables et as minimize {et ∈∆t }

X t∼u

e> t Θtu eu +

X

e> t θt .

(3.57)

t

The formulation in Problem (3.57) is well known in the probabilistic graphical models literature. 68


3.4.4 3.4.4.A

Related work Linear relaxation and tree-reweighted message passing algorithms

The problem of minimizing (3.55) is widely attacked by means of a Linear programming (LP) relaxation (see e.g., Wainwright et al. [48]). This approach relies on minimizing (3.55) in the local marginal polytope, relaxing the integer constraints to non-negative ones. This relaxation is tight for graphs configured as trees. Nevertheless, the number of variables is very large and the method does not scale well. To address this issue, some other approaches rely on maximizing the dual of (3.55). The tree-reweighted message passing algorithms solve a dual problem determined by a convex combination of trees. 3.4.4.B

Dual decomposition

Komodakis et al. [49] proposed a Lagrangian relaxation of the MAP problem related to the optimization technique of dual decomposition. Rather than minimizing directly (3.55), the problem is decomposed into a set of subproblems which are easier to solve. As the subproblems emerge from dualization, the sum of their minima is a lower bound on the value of (3.55). One can apply different decompositions, resulting in distinct relaxations; if the minimization of (3.55) is decomposed in a set of trees, this relaxation is equivalent to the LP relaxation. A major issue arising from the dual nature of this algorithm is how to find a primal solution.

Remarks 1. Problem (3.54) is very hard to solve. A possible approach is to discretize the 6-sigma ellipsoid defined by the prior distributions, as mentioned in Section 3.4.3, obtaining a combinatorial problem for which it is possible to design linear and semi-definite relaxations (see Wainwright et al. [48]), in order to approximate the optimal solution. 2. Dealing with graphical models with cycles — like the ones generally arising from geometric networks — is also a difficult task. In fact, there are no guarantees of convergence of the sum-product updates on such topologies. 3. As observed in Ihler et al. [4], only a coarse discretization of the 2D or 3D space leads to a problem which is computationally effective. Nevertheless, the obtained result can provide an initialization to be refined by local optimization methods as explored, e.g., in Soares et al. [8].

3.4.5

Contributions

1. Design an effective convex approximation to Problem (3.54), following the approach sketched in Remark 1; 69


2. Formulate an iterative scheme for monotonically decreasing the cost function, by performing inference on judiciously chosen spanning trees of the graph, thus tackling the issue noted in Remark 2; 3. Provide numerical results assessing the value of the approaches taken on this Section.

3.4.6 3.4.6.A

Algorithms Linear and semidefinite relaxations

The work of Wainwright and Jordan [48] establishes a linear relaxation of Problem (3.57) that we will derive in a different way. The cited work [48] proves that the relaxation is exact for tree-structured graphs. We begin by defining:  0 Θ12 Θ13 · · ·  Θ21 0 Θ23 · · ·  Θ :=  . ..  .. . ···

Θn1

     e1 θ1 Θ1n  e2   θ2  Θ2n       ..  θ :=  ..  e :=  ..  , .    . . en θn 0

where Θtu is as in (3.56), if the edge t ∼ u belongs to the edge set E, and the zero matrix otherwise. We reformulate Problem (3.57) as ( minimize Tr

θ> /2 Θ

0 θ 2

1 1 subject to E = e

e

> ) E

(3.58)

{et ∈ ∆t } and rewrite the restrictions to isolate the non convexity in a rank constraint. In order to do so, we need to write our variable E as

1 E= E21

> E21 E22

where E21 = e, E22 = ee> . We can write the equivalent restrictions as E = E> E ≥0 E11 = 1 diag ((E22 )ii ) = (E21 )ii

(3.59)

(E22 )ij 1 = (E21 )i 1> (E21 )i = 1 Rank(E) = 1. To achieve the linear relaxation of (3.58) we drop the rank constraint in (3.59). As stated before, this linear program is only exact for tree structured graphs, as shown in [48]. For graphs with cycles we propose an SDP reformulation of (3.58), with the constraints in (3.59) except the first, E = E > , that we replace with E 0. 70

(3.60)


One could expect that this stronger constraint allows a more approximate result in graphs with cycles, at the expense of an increase in computational cost incurred when passing from the linear to the semidefinite problem. 3.4.6.B

Distributed tree-based inference

The second contribution of this work is an iterative scheme to monotonically decrease the cost function. We know that inference on trees is exact. In fact, the methods referred to in Section 3.4.4 work with a dual problem to have access to this important property. Our approach will perform inference on trees, but still retains a primal nature. At each step we choose a spanning tree over the geometric measurement graph and perform inference on it. In order to choose the edges that go into the spanning tree, for each edge t ∼ u, we find the vectors at and au such that a separable matrix majorizes as tightly as possible the edge potentials matrix. Here, separable matrix means we can impute to each alphabet element of each node a constant part of the entries of the edge potentials matrix Θtu . Mathematically, for each edge t ∼ u, we solve the problem minimize kat 1> + 1a> u − Θtu kF

(3.61)

at ,au

subject to at 1> + 1a> u ≥ Θtu > > e> t at + eu au = et Θtu eu ,

where k·kF denotes the Frobenius norm. The values for et and eu are given as an initialization. The first restriction in (3.61) ensures that the separable approximation to the edge potentials matrix lies above it, whereas the second one guarantees it is tight at the initialization point. The optimal value of problem (3.61) represents the cost of breaking the edge t ∼ u and at and au are the vectors to add to the node potentials of nodes t and u, respectively, in case the edge t ∼ u is not present in the spanning tree. Mathematically, we construct a sequence of problems solvable in polynomial time that majorize (3.57): minimize fb(e) = {et ∈∆t }

X

e> t Θtu eu +

X

b e> t θt ,

(3.62)

t

t∼u∈T

where θbt is θbt = θt +

X

(at )t∼v .

(3.63)

t∼v∈E t∼v6∈T

We build the spanning tree with the edges that are more expensive to break, thus retaining those which are less separable, and perform exact inference on the resulting tree. The method then builds the maximum spanning tree T of the measurement graph G breaking the cheapest edges. There are many algorithms which are able to compute minimum spanning trees, even in a distributed way (see for example the work of Gallager et al. in [50]). The maximum spanning tree 0 is obtained by invoking the chosen minimum spanning tree algorithm with the edge weights wt∼u =

71


Algorithm 8 Distributed monotonic spanning tree-based algorithm Input: Initialization e Output: Estimate eb 1: while some stopping criterion is not met do 2: for t ∼ u ∈ E do 3: Solve problem (3.61) for edge t ∼ u 4: wt∼u = optimal value of problem (3.61) 5: (at , au )t∼u = (at , au ) optimal points of problem (3.61) 6: end for 0 7: Compute maximum spanning tree T of G, feeding edge weights wt∼u = max{wi∼j : i ∼ j ∈ E} − wt∼u + 1 to a (distributed) minimum spanning tree algorithm 8: Increment node potentials of broken edges θbt as in (3.63) 9: Perform exact inference on T , to solve problem (3.62), obtaining a new estimate e 10: end while 11: return e b= e max{wi∼j : i ∼ j ∈ E} − wt∼u + 1. We note that using an upper bound on max{wi∼j : i ∼ j ∈ E} will also work and spares the distributed computation of the maximum. Algorithm 8 monotonically decreases the cost in (3.57) at each iteration. These properties are inherited from the MajorizationMinimization framework (see Hunter et al. [37] for an in-depth treatment of the framework). We stress that Algorithm 8 does not prescribe any method to perform exact inference on each tree T . This flexibility is also found in Komodakis et al. [49]. To obtain a distributed algorithm, we can use a message passing max-sum method which is distributed across nodes. As with all Majorization-Minimization algorithms, Algorithm 8 requires an initialization. Our strategy is to have an initial iteration of Algorithm 8 where, instead of solving problem (3.61), we solve the least squares problem 2 minimize kat 1> + 1a> u − Θtu kF at ,au

for each edge t ∼ u. As in (3.61), we are looking for the best fit between a separable matrix and the edge potential matrix Θtu , but, as the separable matrix will be used as an initializing step, it does not majorize the edge potential matrix, neither it obeys the tightness requirement. 3.4.6.C

Distributed nature of the algorithm

Algorithm 8 is distributed, because at each edge the intervening nodes can agree which one will perform Step 3; the computing node, say, t, will them communicate with the neighbor u to pass wt∼u and au , thus enabling the distributed computation of the maximum spanning tree T . Inference in Step 9 is also distributed, as mentioned earlier.

3.4.7

Experimental results

In this Section we present numerical experiments assessing the quality of the proposed algorithms in the context of the localization problem. 72


Methods We conducted simulations with several uniquely localizable geometric networks with 5 sensors randomly distributed in a two-dimensional square of size 1 × 1. The discretization was random, with 13 elements in each alphabet. The noisy range measurements are generated according to dij = |kx?i − x?j k + νij |,

rik = |kx?i − ak k + νik |,

(3.64)

where x?i is the true position of node i, and {νij : i ∼ j ∈ E} ∪ {νik : i ∈ V, k ∈ Ai } are independent Gaussian random variables with zero mean and standard deviation σ. The accuracy of the algorithms is measured by the original nonconvex cost value in (3.55) and by the mean positioning error, defined as MPE =

n M 1 XX kˆ xi (m) − x?i k, M m=1 i=1

(3.65)

where M is the total number of Monte Carlo trials, x î (m) is the estimate generated by an algorithm at the Monte Carlo trial m, and x?i is the true position of node i. 3.4.7.A

Linear and semidefinite relaxations

The first experiment aimed at comparing the performance of the linear and semidefinite relaxations. In Figure 3.13, we can observe the average nonconvex cost over Monte Carlo trials, 3 LP SDP

2.5

Average cost value

2

1.5

1

0.5

5

10

15

20

25 30 Monte Carlo Trials

35

40

45

50

Figure 3.13: Average cost over Monte Carlo trials. 73


stabilizing on 1.745 for the LP, and 1.478 for the SDP relaxations. Thus, we obtain an improvement of 15% by using the proposed tighter relaxation. The mean positioning error depicted in

0.35

LP SDP

Mean positioning error /sensor

0.3

0.25

0.2

0.15

5

10

15

20

25 30 Monte Carlo trials

35

40

45

50

Figure 3.14: Mean positioning error per sensor over Monte Carlo trials.

Figure 3.14 also shows that, as expected, a tighter relaxation can perform better in terms of accuracy. The rank of the solution matrix in the experiments also shows the superiority of our

Rank of the solution matrix

4 LP SDP

3.5 3 2.5 2 1.5 1 5

10

15

20

25 30 Monte Carlo Trials

35

40

45

50

Figure 3.15: Rank of the solution matrix E in the tested Monte Carlo trials. relaxation in accuracy: for all the trials, the SDP method is rank 1, proving that the solution of the SDP problem in (3.58), with the restriction (3.60), is the solution to the nonconvex, combinatorial problem (3.57). The LP relaxation, on the other hand, is very loose most of the trials, as seen in Figure 3.15. The price to pay for the accuracy gains is a noticeable increase in execution time, from less than a second for the LP to several minutes for the SDP. 74


Algorithm 9 Coordinate descent algorithm Input: Initialization x Output: Estimate x b 1: while some stopping criterion is not met do 2: for t ∈ V do 3: Compute the cost C for all elements of the alphabet Xt and considering xV−{t} fixed. 4: xt = argminy C(x1 , · · · , xt−1 , y, xt+1 , xn ) 5: end for 6: end while 7: return x b=x Table 3.6: Cost values per sensor LP MM+LS MM+LP CD+LP 0.3617 3.4.7.B

0.0358

0.0788

0.0719

Distributed majorization-minimization

In a second experiment, we compared the performance of our Algorithm 8 with a vanilla coordinate descent procedure, described in Algorithm 9. We ran our MM algorithm initialized as explained in the end of Section 3.4.6.B (MM+LS), and our MM algorithm and the coordinate descent method, both initialized with an LP estimate (MM+LP and CD+LP). Measurements were contaminated with white Gaussian noise with standard deviation σ = 0.01. The resulting cost values per sensor are shown in Table 3.6. Here we can see that all refinement strategies improve the initialization score, but our combination of Majorization-Minimization plus least squares initialization (MM+LS) decreases the cost by one order of magnitude.

3.4.8

Summary

We addressed the sensor network localization problem with a graphical model approach, by proposing an SDP relaxation which outperforms the standard LP approximation. Also, we proposed a distributed iterative algorithm to estimate the solution of the problem by means of the Majorization-Minimization framework, by judiciously choosing spanning trees where inference could be exactly performed. This method outperformed the coordinate descent method initialized with the LP estimate.

75


76

4 Robust algorithms for sensor network localization

Contents 4.1 4.2 4.3

Related work and contributions . . . . . . . . . . . . Discrepancy measure . . . . . . . . . . . . . . . . . . Convex underestimator . . . . . . . . . . . . . . . . . 4.3.1 Approximation quality of the convex underestimator 4.4 Numerical experiments . . . . . . . . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78 79 79 80 82 83

77

4. Robust algorithms for sensor network localization

In practice, network applications have to deal with failing nodes, malicious attacks, or, somehow, nodes facing highly corrupted data — generally classified as outliers. This calls for robust, uncomplicated, and efficient methods. We propose a dissimilarity model for network localization which is robust to high-power noise, but also discriminative in the presence of regular Gaussian noise. We capitalize on the known properties of the M-estimator Huber penalty function to obtain a robust, but nonconvex, problem, and devise a convex underestimator, tight in the function terms, that can be minimized in polynomial time. Simulations show the performance advantage of using this dissimilarity model in the presence of outliers and under regular Gaussian noise: our proposal consistently outperforms the L1 norm by about one half the positioning error.

4.1

Related work and contributions

Some approaches to robust localization rely on identifying outliers from regular data. Then, outliers are removed from the estimation of sensor positions. The work in [4] formulates the network localization problem as an inference problem in a graphical model. To approximate an outlier process the authors add a high-variance Gaussian to the Gaussian mixtures and employ nonparametric belief propagation to approximate the solution. In the same vein, [51] employs the EM algorithm to jointly estimate outliers and sensor positions. Recently, the work [52] tackled robust localization with estimation of positions, mixture parameters, and outlier noise model for unknown propagation conditions. Alternatively, methods may perform a soft rejection of outliers, still allowing them to contribute to the solution. In the work [6] a maximum likelihood estimator for Laplacian noise was derived and subsequently relaxed to a convex program by linearization and dropping a rank constraint, The authors in [39] present a robust Multidimensional Scaling based on the least-trimmed squares criterion minimizing the squares of the smallest residuals. In [5] the authors use the Huber loss [53] composed with a discrepancy between measurements and estimate distances, in order to achieve robustness to outliers. The resulting cost is nonconvex, and optimized by means of the Majorization-Minimization technique. The cost function we present incorporates outliers into the estimation process and does not assume any outlier model. We capitalize on the robust estimation properties of the Huber function but, unlike [5], we do not address the nonconvex cost in our proposal. Instead, we produce a convex relaxation which numerically outperforms other natural formulations of the problem. We present a tight convex underestimator to each term of the robust discrepancy measure for sensor network localization. Further, we analyze its tightness and compare it with other discrepancy 78

4.2 Discrepancy measure

measures and appropriate relaxations. Our approach assumes no specific outlier model, and all measurements contribute to the estimate. Numerical simulations illustrate the quality of the convex underestimator.

4.2

Discrepancy measure

The maximum-likelihood estimator for the sensor positions with additive i.i.d. Gaussian noise contaminating range measurements is the solution of the optimization problem minimize fG (x), x

where fG (x) =

X1 i∼j

2

(kxi − xj k − dij )2 +

X X 1 (kxi − ak k − rik )2 2 i k∈Ai

is the cost in (1.1). However, outlier measurements will heavily bias the solutions of the optimization problem since their magnitude will be amplified by the squares hQ (t) = t2 at each outlier term. From robust estimation, we know some alternatives to perform soft rejection of outliers, namely, using L1 loss h|·| (t) = |t| or the Huber loss ( t2 hR (t) = 2R|t| − R2

if |t| ≤ R, if |t| ≥ R.

(4.1)

The Huber loss joins the best of two worlds: It is robust for large values of the argument — like the L1 loss — and for reasonable noise levels it behaves quadratically, thus leading to the maximumlikelihood estimator adapted to regular Gaussian noise. Figure 4.1 depicts a one-dimensional example of these different costs. We can observe in this simple example the main properties of the different cost functions, in terms of adaptation to low/medium-power Gaussian noise and high-power outlier spikes. Using (4.1) we can write our modified robust localization problem as minimize fR (x), x

(4.2)

where fR (x) =

X1 i∼j

2

hRij (kxi − xj k − dij ) +

X X 1 hR (kxi − ak k − rik ). 2 ik i

(4.3)

k∈Ai

This function is nonconvex and, in general, difficult to minimize. We shall provide a convex underestimator, that tightly bounds each term of (4.3), thus leading to better estimation results than other relaxations which are not tight [3].

4.3

Convex underestimator

To convexify fR we can replace each term by its convex hull as depicted in Figure 4.2. Here, we observe that the high-power behavior is maintained, whereas the medium/low-power is only 79


1.2 f Q Q u ad r at i c 1

f | · | Ab sol u t e val u e

0.8

0.6

0.4

0.2 f R Hu b e r

−1.5

−1

−0.5

0

0.5

1

1.5

Figure 4.1: The different cost functions under consideration: the maximum-likelihood independent white Gaussian noise term fQ (xi , xj ) = (kxi − xj k − dij )2 shows the steepest tails, which act as outlier amplifiers; the L1 loss f|·| (xi , xj ) = |kxi − xj k − dij |, associated with impulsive noise, which fails to model the Gaussianity of regular operating noise; and, finally, the Huber loss fR (xi , xj ) = hR (kxi − xj k − dij ), combines robustness to high-power outliers and adaptation to medium-power Gaussian noise. altered in the convexified area. We define the convex costs by composition of any of the convex functions h with a nondecreasing function (·)+ (t)+ = max{0, t} which, in turn, transforms the discrepancies δij (x) = kxi − xj k − dik , δik (xi ) = kxi − ak k − rik . As (δij (x))+ and (δik (x))+ are nondecreasing and each one of the functions h is convex, then fˆR (x) =

X X 1 X1 h (kxi − xj k − dij )+ + h (kxi − ak k − rik )+ 2 2 i∼j i

(4.4)

k∈Ai

is also convex.

4.3.1

Approximation quality of the convex underestimator

The quality of the convexified quadratic problem was addressed in Section 2.5.1, which we summarize here for convenience of the reader and extend to the two other convex problems. 80

4.3 Convex underestimator

1.2

fˆQ Q u ad r at i c

C o n vex t!i g ht u n d er es t i ma t o r s " fˆ( x ) = f ma x{ 0 , | x i − x j | − d i j}

1

0.8

0.6

fˆ| · | Ab sol u t e val u e

0.4

0.2

0 −1.5

fˆR Hu b e r

−1

−0.5

0

0.5

1

1.5

Figure 4.2: All functions f are tight underestimators to the functions g in Figure 4.1. They are the convex envelopes and, thus, the best convex approximations to each one of the original nonconvex cost terms. The convexification is performed by restricting the arguments of g to be nonnegative. The optimal value of the nonconvex f , denoted by f ? , is bounded by fˆ? = fˆ(x? ) ≤ f ? ≤ f (x? ), where x? is the minimizer of the convex underestimator fˆ, and fˆ? = min f (x), x

is the minimum of function f . A bound for the optimality gap is, thus, f ? − fˆ? ≤ f (x? ) − fˆ? . It is evident that in all cases (quadratic, Huber, and absolute value) f is equal to g when kxi −xj k ≥ dij and kxi − ak k ≥ rik . When the function terms differ, say for all edges i ∼ j ∈ E2 ⊂ E, we have s (kxi − xj k − dij ) = 0, and similarly with the anchor terms, leading to X 1 2 kx?i − x?j k − dij 2 i∼j∈E2 X 1 ? ? kx?i − x?j k − dij − fˆ|·| ≤ f|·| 2 i∼j∈E2 X 1 fR? − fˆR? ≤ hR kx?i − x?j k − dij , 2 ij ? ? fQ − fˆQ ≤

(4.5) (4.6) (4.7)

i∼j∈E2

where E2 = {i ∼ j ∈ E : kx?i − x?j k < dij )}. 81


Table 4.1: Bounds on the optimality gap for the example in Figure 4.3 Cost g ? − fˆ? Eqs. (4.5)-(4.7) Eqs. (4.8)-(4.10) Quadratic Absolute value Robust Huber

3.7019 1.1416 0.1784

5.5250 1.1533 0.1822

11.3405 3.0511 0.4786

These bounds are an optimality gap guarantee available after the convexified problem is solved; they tell us how low our estimates can bring the original cost. Our bounds are tighter than the ones available a priori from applying [26, Th. 1], which are X1 ? ? fQ − fˆQ ≤ d2ij (4.8) 2 i∼j X1 ? ? f|·| − fˆ|·| ≤ dij (4.9) 2 i∼j X1 fR? − fˆR? ≤ hR (dij ) . (4.10) 2 ij i∼j For the one-dimensional example of the star network costs depicted in Figure 4.3 the bounds in (4.5)-(4.7), and (4.8)-(4.10), averaged over 500 Monte Carlo trials, are presented in Table 4.1. The true average gap f ? − fˆ? is also shown. In the Monte Carlo trials we sampled a set of zero mean Gaussian random variables with σ = 0.04 for the baseline Gaussian noise and obtained a noisy range measurement as in (4.11). One of the measurements is then corrupted by a zero mean random variable with σ = 4, modelling outlier noise. These results show the tightness of the convexified function under such noisy conditions and also demonstrate the behaviour of the a priori bounds in (4.8)-(4.10).

4.4

Numerical experiments

We assess the performance of the three considered loss functions through simulation. The experimental setup consists in a uniquely localizable geometric network deployed in a square area with side of 1Km, with four anchors (blue squares in Figure 4.4) located at the corners, and ten sensors (red stars). Measurements are also visible as dotted green lines. The average node degree of the network is 4.3. The regular noisy range measurements are generated according to dij = |kx?i − x?j k + νij |, rik = |kx?i − ak k + νik |,

(4.11)

expressed in Km, where x?i is the true position of node i, and {νij : i ∼ j ∈ E} ∪ {νik : i ∈ V, k ∈ Ai } are independent Gaussian random variables with zero mean and standard deviation 0.04, corresponding to an uncertainty of about 40m. Node 7 is malfunctioning and all measurements related to it are perturbed with Gaussian noise with standard deviation 4, corresponding to an uncertainty of 4Km. The convex optimization problems were solved with cvx [54]. We ran 100 Monte Carlo trials, sampling both regular and outlier noise. 82

4.5 Summary

Table 4.2: Average positioning error per sensor (MPE/sensor), in meters fˆ|·| fˆQ fˆR 59.50

32.16

31.06

Table 4.3: Average positioning error per sensor (MPE/sensor), in meters, for the biased experiment fˆ|·| fˆQ fˆR 80.98

58.31

47.08

The performance metric used to assess accuracy is the average positioning error defined as (3.65). In Figure 4.4 we can observe that clouds of estimates from fR and fQ gather around the true positions, except for the malfunctioning node 7. Note the spread of blue dots in the surroundings of the edges connecting node 7, indicating that fR better preserves the nodes’ ability to localize themselves, despite their confusing neighbor, node 7. This intuition is confirmed by the analysis of the data in Table 4.2, which demonstrates that, even with only one disrupted sensor, our robust cost can reduce the error per sensor by 1.1 meters. Also, as expected, the malfunctioning node cannot be reliably located by any of the methods. The sensitivity to the value of the Huber parameter R in (4.1) is moderate, as shown in Figure 4.5. In fact, the error per sensor of the proposed estimator is always the smallest for all tested values of the parameter. We observe that the error increases when R approaches the standard deviation of the regular Gaussian noise, meaning that the Huber loss gets closer to the L1 loss and, thus, is no longer adapted to the regular noise (R = 0 corresponds exactly to the L1 loss); in the same way, as R increases, so does the quadratic section, and the estimator gets less robust to outliers, so, again, the error increases. Another interesting experiment is to see what happens when the faulty sensor produces measurements with consistent errors or bias. So, we ran 100 Monte Carlo trials in the same setting, but the measurements of node 7 are consistently 10% of the real distance to each neighbor. The average positioning error per sensor is shown in Table 4.3. Here we observe a significant performance gap between the alternative costs, and our formulation proves to be, by far, superior.

4.5

Summary

We proposed an easy to motivate and effective dissimilarity model, which accounts for outliers without prescribing a model for outlier noise. This dissimilarity model was convexified by means of the convex envelopes of its terms, leading to a problem with a unique minimum value attainable in polynomial time. Further, we studied the optimality gap of the discrepancies, both a priori and after obtaining an estimate, thus providing bounds for the suboptimality of the convexification — guarantees useful in practice. Different types of algorithms can be designed to attack the discrepancy measure presented in this work, since the function is continuous and convex (in the previous section the optimization 83


problem of minimizing (4.4) was solved using the cvx general-purpose convex solver). Due to the distributed nature of networks of sensors or, generically, agents, we aim at investigating a distributed minimization of the proposed robust loss. There are also several nice properties regarding distributed operation: the adjustable Huber parameter is local to each edge and, if desired, can be dynamically adjusted to the local environmental noise conditions, in a distributed manner.

84

4.5 Summary

10

1D st ar Ne twor k

n e i ghb or

node

n e i ghb or n e i ghb or

f Q ( x) fˆQ( x)

5

0 0

2

3

4

5

7

(a) Quadratic cost.

10

1D st ar Ne twor k

n e i ghb or

node


f | · (| x) 5 fˆ| ·(| x) 0 0

2

3

4

5

7

(b) Absolute value cost.

10

1D st ar Ne twor k

n e i ghb or

node


f R ( x)

5

fˆR( x) 0 0

2

3

4

5

7

(c) Robust Huber cost.

Figure 4.3: One-dimensional example of the quality of the approximation of the true nonconvex costs f (x) by the convexified functions f (x) in a star network. The node positioned at x = 3 has 3 neighbors.

85


3

1

4

0.9 12

0.8 0.7 0.6

13

7

9

0.5

8

0.4 10 0.3 14

0.2

11 5

0.1 6 1

0 0

2 0.2

0.4

0.6

0.8

1

Av g. p os it ioning e r r or ǫ/s e ns or [m

Figure 4.4: Estimates for sensor positions for the three loss functions; We plotted the results of minimizing the L1 loss f|·| (yellow), the quadratic loss fQ (blue), and the proposed robust estimator with Huber loss fR (magenta) It is noticeable that the L1 loss is not able to correctly estimate positions whose measurements are corrupted with Gaussian noise. The Outlier measurements in node 7 have more impact in the augmented dispersion of blue dots than magenta dots around its neighbors.

31.15

31.1

31.05 0.04

0.05

0.06 0.07 0.08 0.09 H ub e r f unc t ion par ame t e r R

0.1

Figure 4.5: Average positioning error versus the value of the Huber function parameter R. The accuracy is maintained even for a wide range of parameter values. We stress that the error will increase largely when R → 0 and R → ∞, since these situations correspond to the L1 and L2 cases, respectively. 86

5 Conclusions and perspectives

Contents 5.1 5.2

Distributed network localization without initialization Addressing the nonconvex problem . . . . . . . . . . . . 5.2.1 With more computations we can do better . . . . . . . . 5.2.2 Network of agents as a graphical model . . . . . . . . . 5.3 Robust network localization . . . . . . . . . . . . . . . . 5.4 In summary . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88 88 89 89 90 90

87

5. Conclusions and perspectives

In this thesis we presented a flow of methods, from the initialization-free convex relaxation method, which only requires knowledge of noisy measurements and a few anchor locations, to more precise algorithms that, given also a good initial guess, will provide a highly accurate estimate of the positions of the nodes. We also addressed localization in harsh environments, prone to outliers, which need especially robust algorithms to overcome the negative influence of corrupted or malicious data.

5.1

Distributed network localization without initialization

We presented a simple, fast and convergent relaxation method for synchronous and asynchronous time models. From the analysis of the problem, we uncovered key properties which allow a synchronous time, distributed gradient algorithm with an optimal convergence rate. We also presented an asynchronous randomized method, more suited for unstructured and large scale networks. We proved not only almost sure convergence of the cost value, but also almost sure convergence to a point, which is, as far as we know, an absolutely novel result for distributed gradient algorithms in general. This stronger convergence result has a significant impact in real time applications because nodes can safely probe the amount of change in the estimates to stop computing. The methods were published in IEEE Transactions on Signal Processing. Extending this work, we interpreted each term of the cost as a discrepancy measure between a model and the noisy measurement and generalized it to include in the same cost heterogeneous measurements. In particular, we fused range and angle information, obtaining very interesting results, which were already submitted for publication.

5.2

Addressing the nonconvex problem

In some applications it is fundamental to obtain a very accurate estimate of the positions of the agents. Sometimes the convex approximation discussed above does not achieve these tight precision requirements. In such cases one should address the nonconvex estimator problem directly, whenever armed with a good starting point — for example, the convex approximation solution. With this need in mind, we presented a simple, distributed, and efficient algorithm, proven to converge1 , requiring no parameter tuning. The method turns out to be a member of the majorizationminimization family where the majorization function is a quadratic. An alternative to the majorization-minimization framework, initialized with, e.g., the estimate 1 More

88

precisely, every limit point of the algorithm is a stationary point of the cost function.

5.2 Addressing the nonconvex problem

from the methods described in Chapter 2, could be an homotopy continuation method (see Allgower [55] for in-depth information on homotopy methods). The downside of such methods is that they can become very difficult when applied to nonconvex functions, e.g., if we maintain the accessibility condition that all isolated solutions can be reached. This might lead to bifurcations of the method and, so, a combinatorial problem. Nevertheless, Moré and Wu [56] propose a smoothing Gaussian kernel, also used in the paper by Destino and Abreu [57] for localization by continuation. The first work addresses a squared discrepancy with squared distances, whereas the second applies the same smoothing kernel to the maximum likelihood formulation. Experimentally, our method has substantial performance improvements over the state of the art, and adding the convergence properties, the algorithm stands in the small club of distributed, nonconvex, and provable maximum-likelihood estimators for the network localization problem, given a good starting point. We presented the method in IEEE GlobalSIP 2014.

5.2.1

With more computations we can do better

In order to be less dependent on the initialization, we aimed at a tighter convex majorization function as the tool for a majorization-minimization algorithm. The quality of the estimate from the MM procedure with this novel, tighter approximation was verified experimentally as being more than one order of magnitude in root mean squared error. This extraordinary result encourages us to pursue a tailored minimization algorithm for this tighter majorizer. A first attempt was to follow the Alternating Direction Method of Multipliers (ADMM) strategy, but, even though we obtained a distributed method with far better accuracy than the benchmark, we were not satisfied with the amount of communications expended on the process, that are also commonly observed in distributed methods using ADMM. Also, as the ADMM subproblems at each node did not have closed form solution, the overall estimate degraded very sharply with a small degradation in the solutions of the subproblems at each node, thus leading to a less interesting performance than that of our previously mentioned work. Our next step is to devise a proximal algorithm, taking into account the non-differentiability of our novel majorizer.

5.2.2

Network of agents as a graphical model

Another perspective on the localization problem is to consider the positions of the nodes as random variables, and the measurement network as an undirected graphical model. This perspective was explored in Section 3.4 and the known linear relaxation to the resulting combinatorial problem was re-derived. The resulting formulation of the problem led to a novel SDP relaxation, tighter than the linear one and, thus, obtaining better experimental results in terms of root mean squared error. A faster and more accurate descent algorithm with no initialization and attacking the nonconvex cost was also presented. This novel method improved the error not only of the linear relaxation but also of a vanilla coordinate descent initialized with the estimate from the 89


linear relaxation.

5.3

Robust network localization

Sometimes our nodes are malfunctioning or malicious and their collected measurements behave like outliers. Albeit the practical pertinence of this problem, research in this topic is still meager. To bridge this gap, we designed a soft outlier rejection approach, by considering the known outlier rejecting penalties of the L1 norm and the Huber function. Nevertheless, using these penalties leads to very difficult, nonconvex problems. We convexified these penalties and the approximated estimator for the Huber function performed far better in a scenario with malfunctioning nodes than the approach presented in Chapter 2 and discussed in Section 5.1. The results are very exciting and our next step is to shape an algorithm to optimize the tight convex approximations in a way that is distributed, fast and simple to implement.

5.4

In summary

Throughout this work we were dedicated to achieve useful estimates for the agent positions given a sparse noisy range measurement network and a small set of reference nodes. Here we presented methods for network localization which are defined by the following principles: • Full network localization solutions, from uninformed agent deployment to a refined estimate; • Emphasis on scalable, distributed, solutions; • Novel, tighter, approximations to the Maximum-Likelihood cost function, thus leading to estimates that are more resilient to noise; • Intuitive derivations and simple to implement algorithms; • Faster and reliable estimates; • Provable convergence of algorithms. Open perspectives of work include deepening our approaches, but also considering new settings, like: • Online optimization variants of the proposed algorithms; • Applying our approaches to the mobile setting, by introducing dynamics in the problem formulation; • Considering that the noise variance is not constant but rather a function of distance measured; • Determining the optimal placement of anchors and sensor nodes; 90

5.4 In summary

• Approaching the robust estimation topic with a Laplacian noise model. Here we can write the Laplacian distribution as a mixture of Gaussians with the same mean but different variances, as described in Girosi [58]. In this setting, the expectation-maximization algorithm could be employed.

91


92

Bibliography [1] N. Bulusu, J. Heidemann, and D. Estrin, “Gps-less low-cost outdoor localization for very small devices,” Personal Communications, IEEE, vol. 7, no. 5, pp. 28–34, 2000. [2] P. Biswas, T.-C. Lian, T.-C. Wang, and Y. Ye, “Semidefinite programming based algorithms for sensor network localization,” ACM Transactions on Sensor Networks (TOSN), vol. 2, no. 2, pp. 188–220, 2006. [3] A. Simonetto and G. Leus, “Distributed maximum likelihood sensor network localization,” Signal Processing, IEEE Transactions on, vol. 62, no. 6, pp. 1424–1437, Mar. 2014. [4] A. Ihler, I. Fisher, J.W., R. Moses, and A. Willsky, “Nonparametric belief propagation for self-localization of sensor networks,” Selected Areas in Communications, IEEE Journal on, vol. 23, no. 4, pp. 809 – 819, Apr. 2005. [5] S. Korkmaz and A.-J. van der Veen, “Robust localization in sensor networks with iterative majorization techniques,” in Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, Apr. 2009, pp. 2049 –2052. [6] P. Oguz-Ekim, J. Gomes, J. Xavier, and P. Oliveira, “Robust localization of nodes and timerecursive tracking in sensor networks using noisy range measurements,” Signal Processing, IEEE Transactions on, vol. 59, no. 8, pp. 3930 –3942, Aug. 2011. [7] C. Soares, J. Xavier, and J. Gomes, “Simple and fast convex relaxation method for cooperative localization in sensor networks using range measurements,” Signal Processing, IEEE Transactions on, vol. 63, no. 17, pp. 4532–4543, Sept 2015. [8] ——, “Distributed, simple and stable network localization,” in Signal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on, Dec 2014, pp. 764–768. [9] ——, “DCOOL-NET: Distributed cooperative localization for sensor networks,” submitted, http://arxiv.org/abs/1211.7277. [10] C. Soares and J. Gomes, “Robust dissimilarity measure for network localization,” arXiv preprint arXiv:1410.2327, 2014. 93

Bibliography

[11] J. Aspnes, D. Goldenberg, and Y. R. Yang, “On the computational complexity of sensor network localization,” in Algorithmic Aspects of Wireless Sensor Networks.

Springer, 2004,

pp. 32–44. [12] P. Biswas and Y. Ye, “Semidefinite programming for ad hoc wireless sensor network localization,” in Proceedings of the 3rd international symposium on Information processing in sensor networks.

ACM, 2004, pp. 46–54.

[13] J. Costa, N. Patwari, and A. Hero III, “Distributed weighted-multidimensional scaling for node localization in sensor networks,” ACM Transactions on Sensor Networks (TOSN), vol. 2, no. 1, pp. 39–64, 2006. [14] M. Gholami, L. Tetruashvili, E. Strom, and Y. Censor, “Cooperative wireless sensor network positioning via implicit convex feasibility,” Signal Processing, IEEE Transactions on, vol. 61, no. 23, pp. 5830–5840, Dec. 2013. [15] S. Srirangarajan, A. Tewfik, and Z.-Q. Luo, “Distributed sensor network localization using SOCP relaxation,” Wireless Communications, IEEE Transactions on, vol. 7, no. 12, pp. 4886 –4895, Dec. 2008. [16] F. Chan and H. So, “Accurate distributed range-based positioning algorithm for wireless sensor networks,” Signal Processing, IEEE Transactions on, vol. 57, no. 10, pp. 4100 –4105, Oct. 2009. [17] U. Khan, S. Kar, and J. Moura, “DILAND: An algorithm for distributed sensor localization with noisy distance measurements,” Signal Processing, IEEE Transactions on, vol. 58, no. 3, pp. 1940 –1947, Mar. 2010. [18] Q. Shi, C. He, H. Chen, and L. Jiang, “Distributed wireless sensor network localization via sequential greedy optimization algorithm,” Signal Processing, IEEE Transactions on, vol. 58, no. 6, pp. 3328 –3340, June 2010. [19] D. Blatt and A. Hero, “Energy-based sensor network source localization via projection onto convex sets,” Signal Processing, IEEE Transactions on, vol. 54, no. 9, pp. 3614–3619, Sept. 2006. [20] J.-B. Hiriart-Urruty and C. Lemaréchal, Convex analysis and minimization algorithms. Springer-Verlag Limited, 1993. [21] F. R. Chung, Spectral graph theory. [22] R. B. Bapat, Graphs and matrices.

American Mathematical Soc., 1997, vol. 92. Springer, 2010.

[23] M. Mesbahi and M. Egerstedt, Graph theoretic methods in multiagent networks. Princeton University Press, 2010. 94

Bibliography

[24] Y. Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k 2 ),” in Soviet Mathematics Doklady, vol. 27, no. 2, 1983, pp. 372–376. [25] D. Shah, Gossip algorithms.

Now Publishers Inc, 2009.

[26] M. Udell and S. Boyd, “Bounding duality gap for problems with separable objective,” ONLINE, 2014. [27] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course.

Kluwer Aca-

demic Publishers, 2004. [28] D. Bertsekas, “Incremental proximal methods for large scale convex optimization,” Mathematical Programming, vol. 129, pp. 163–195, 2011. [29] Z. Lu and L. Xiao, “On the complexity analysis of randomized block-coordinate descent methods,” arXiv preprint arXiv:1305.4723, 2013. [30] J. Sturm, “Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones,” Optimization Methods and Software, vol. 11–12, pp. 625–653, 1999, version 1.05 available from http://fewcal.kub.nl/sturm. [31] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: numerical methods. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1989. [32] D. P. Bertsekas, Nonlinear programming.

Athena Scientific, 1999.

[33] J. Jacod and P. Protter, Probability Essentials.

Springer, 2003, vol. 1.

[34] H. Robbins and D. Siegmund, “A convergence theorem for non negative almost supermartingales and some applications,” in Herbert Robbins Selected Papers.

Springer, 1985, pp.

111–135. [35] G. Calafiore, L. Carlone, and M. Wei, “Distributed optimization techniques for range localization in networked systems,” in Decision and Control (CDC), 2010 49th IEEE Conference on, Dec. 2010, pp. 2221–2226. [36] M. Raydan, “The barzilai and borwein gradient method for the large scale unconstrained minimization problem,” SIAM Journal on Optimization, vol. 7, no. 1, pp. 26–33, 1997. [37] D. R. Hunter and K. Lange, “A tutorial on MM algorithms,” The American Statistician, vol. 58, no. 1, pp. 30–37, Feb. 2004. [38] A. Beck and Y. Eldar, “Sparsity constrained nonlinear optimization: Optimality conditions and algorithms,” SIAM Journal on Optimization, vol. 23, no. 3, pp. 1480–1509, 2013. [Online]. Available: http://dx.doi.org/10.1137/120869778 95

Bibliography

[39] P. Forero and G. Giannakis, “Sparsity-exploiting robust multidimensional scaling,” Signal Processing, IEEE Transactions on, vol. 60, no. 8, pp. 4118 –4134, Aug. 2012. [40] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statisR tical learning via the alternating direction method of multipliers,” Foundations and Trends

in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011. [41] I. Schizas, A. Ribeiro, and G. Giannakis, “Consensus in ad hoc WSNs with noisy links — part i: Distributed estimation of deterministic signals,” Signal Processing, IEEE Transactions on, vol. 56, no. 1, pp. 350 –364, Jan. 2008. [42] H. Zhu, G. Giannakis, and A. Cano, “Distributed in-network channel decoding,” Signal Processing, IEEE Transactions on, vol. 57, no. 10, pp. 3970 –3983, Oct. 2009. [43] P. Forero, A. Cano, and G. Giannakis, “Consensus-based distributed support vector machines,” The Journal of Machine Learning Research, vol. 11, pp. 1663–1707, 2010. [44] J. Bazerque and G. Giannakis, “Distributed spectrum sensing for cognitive radio networks by exploiting sparsity,” Signal Processing, IEEE Transactions on, vol. 58, no. 3, pp. 1847 –1862, Mar. 2010. [45] T. Erseghe, D. Zennaro, E. Dall’Anese, and L. Vangelista, “Fast consensus by the alternating direction multipliers method,” Signal Processing, IEEE Transactions on, vol. 59, no. 11, pp. 5523 –5537, Nov. 2011. [46] J. Mota, J. Xavier, P. Aguiar, and M. Puschel, “Distributed basis pursuit,” Signal Processing, IEEE Transactions on, vol. 60, no. 4, pp. 1942 –1956, Apr. 2012. [47] B. D. O. Anderson, I. Shames, G. Mao, and B. Fidan, “Formal theory of noisy sensor network localization,” SIAM Journal on Discrete Mathematics, vol. 24, no. 2, pp. 684–698, 2010. [48] M. Wainwright and M. Jordan, “Graphical models, exponential families, and variational inR in Machine Learning, vol. 1, no. 1-2, pp. 1–305, 2008. ference,” Foundations and Trends

[49] N. Komodakis, N. Paragios, and G. Tziritas, “Mrf energy minimization and beyond via dual decomposition,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 33, no. 3, pp. 531–552, March 2011. [50] R. G. Gallager, P. A. Humblet, and P. M. Spira, “A distributed algorithm for minimum-weight spanning trees,” ACM Transactions on Programming Languages and systems (TOPLAS), vol. 5, no. 1, pp. 66–77, 1983. [51] J. Ash and R. Moses, “Outlier compensation in sensor network self-localization via the EM algorithm,” in Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEE International Conference on, vol. 4, March 2005, pp. iv/749–iv/752 Vol. 4. 96

Bibliography

[52] F. Yin, A. Zoubir, C. Fritsche, and F. Gustafsson, “Robust cooperative sensor network localization via the EM criterion in LOS/NLOS environments,” in Signal Processing Advances in Wireless Communications (SPAWC), 2013 IEEE 14th Workshop on, June 2013, pp. 505–509. [53] P. J. Huber, “Robust estimation of a location parameter,” The Annals of Mathematical Statistics, vol. 35, no. 1, pp. 73–101, 1964. [54] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convex programming, version 1.21,” http://cvxr.com/cvx, Apr. 2011. [55] E. L. Allgower and K. Georg, Numerical continuation methods: an introduction.

Springer

Science & Business Media, 2012, vol. 13. [56] J. J. Moré and Z. Wu, “Global continuation for distance geometry problems,” SIAM Journal on Optimization, vol. 7, no. 3, pp. 814–836, 1997. [57] G. Destino and G. Abreu, “On the maximum likelihood approach for source and network localization,” Signal Processing, IEEE Transactions on, vol. 59, no. 10, pp. 4954 –4970, Oct. 2011. [58] F. Girosi, “Models of noise and robust estimates,” 1991.

97

Distributed and robust network localization algorithms - Institute For ...

Distributed and robust network localization algorithms - Institute For ...

Suggest Documents

Distributed and robust network localization algorithms - Institute For ...

Robust Distributed Network Localization with ... - People.csail.mit.edu

Distributed Inference for Network Localization Using Radio ...

Universal and Robust Distributed Network Codes

Robust localization algorithms for an autonomous campus ... - CiteSeerX

LocDyn: Robust Distributed Localization for Mobile Underwater ... - arXiv

Critical Analysis of Distributed Localization Algorithms for Wireless ...

Robust Localization Protocols and Algorithms in ... - Semantic Scholar

Distributed Sensor Network Localization from ... - Infoscience - EPFL

Distributed Sensor Network Localization Using SOCP ... - CiteSeerX

Robust automatic mapping algorithms in a network

robust distributed detection, localization, and estimation of a diffusive ...

Analysis of Flip Ambiguities for Robust Sensor Network Localization

A Distributed Network Architecture for Robust Internet Voting Systems

On Distributed Algorithms for Maximizing the Network Lifetime ... - HiPC

Distributed Algorithms for Minimum Cost Multicast with Network Coding

computationally efficient algorithms for robust

Robust and Low Complexity Algorithms for Seizure

Robust and Regularized Algorithms for Vehicle ...

Autonomous Algorithms for Centralized and Distributed Interference ...

Resampling Algorithms and Architectures for Distributed ... - CiteSeerX

Distributed Frameworks and Parallel Algorithms for ... - CiteSeerX

Boosting Algorithms for Parallel and Distributed Learning

Data Structures and Algorithms for Distributed