Feb 3, 2013 ... The thesis of Ram Rajagopal, titled Statistical Processing in Sensor Networks: ...
1.2.2 Chapter 3: Quantile Estimation in a Data Communication Con- ..... (e.g., [
Kempe et al., 2003; Xiao and Boyd, 2004; Boyd et al., 2006; Aysal et al., .....
model captures the asymptotic distance to the consensus subspace.
Statistical Processing in Sensor Networks: Estimation and Inference by Ram Rajagopal
A thesis submitted in partial satisfaction of the requirements for the degree of Master of Arts in Statistics in the Graduate Division of the University of California, Berkeley
Committee in charge: Professor Martin J. Wainwright, Chair Professor Bin Yu Professor Pravin Varaiya Fall 2009
The thesis of Ram Rajagopal, titled Statistical Processing in Sensor Networks: Estimation and Inference, is approved:
Chair
Date
Date
Date
University of California, Berkeley
Statistical Processing in Sensor Networks: Estimation and Inference
Copyright 2009 by Ram Rajagopal
i
To my mother and father: for sharing with me their love for life.
ii
Contents List of Figures
v
List of Tables
vii
1 Introduction 1.1 Architectures for statistical inference . . . . . . . . . . . . . . . . . . . . . . 1.2 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Chapter 2: Network-based consensus averaging with general noisy channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Chapter 3: Quantile Estimation in a Data Communication Constrained Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Chapter 4: Network Structure Inference from Multicast Delay . . . . 1.3 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Network-based consensus averaging with general noisy channels 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Problem set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Consensus matrices and stochastic updates . . . . . . . . . . 2.2.2 Communication and noise models . . . . . . . . . . . . . . . . 2.3 Statement of results and consequences . . . . . . . . . . . . . . . . . 2.3.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Scaling and graph topology . . . . . . . . . . . . . . . . . . . 2.3.4 Illustrative simulations . . . . . . . . . . . . . . . . . . . . . . 2.4 Proof of Theorem 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Proof of Theorem 2.3.1(a) and (b) . . . . . . . . . . . . . . . 2.4.2 Proof of Theorem 2.3.1(c) . . . . . . . . . . . . . . . . . . . . 2.4.3 ODE method for mean paths . . . . . . . . . . . . . . . . . . 2.5 Proof of Theorem 2.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Proof of Lemma 2.3.1 . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
1 2 4 4 5 5 6 7 7 10 10 13 14 15 17 20 21 24 25 27 29 31 33 33 33
iii 3 Quantile Estimation in a Data Communication-Constrained 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Problem Set-up and Decentralized Algorithms . . . . . . . . . . 3.2.1 Centralized Quantile Estimation . . . . . . . . . . . . . 3.2.2 Distributed Quantile Estimation . . . . . . . . . . . . . 3.2.3 Protocol specification . . . . . . . . . . . . . . . . . . . 3.2.4 Convergence results . . . . . . . . . . . . . . . . . . . . 3.2.5 Comparative Analysis . . . . . . . . . . . . . . . . . . . 3.2.6 Simulation example . . . . . . . . . . . . . . . . . . . . 3.3 Some Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Different levels of feedback . . . . . . . . . . . . . . . . 3.3.2 Extensions to noisy links . . . . . . . . . . . . . . . . . 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Proof of Theorem 3.2.1 . . . . . . . . . . . . . . . . . . 3.5.2 Proof of Theorem 3.2.2 . . . . . . . . . . . . . . . . . . 3.5.3 Proof of Theorem 3.3.1 . . . . . . . . . . . . . . . . . .
Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
35 35 37 37 38 39 40 42 43 44 44 47 50 50 52 52 56
4 Network Structure Inference from Multicast Delay 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Basic Definitions . . . . . . . . . . . . . . . . 4.1.2 Our Results. . . . . . . . . . . . . . . . . . . 4.1.3 Discussion . . . . . . . . . . . . . . . . . . . . 4.2 Phylogenetic Reconstruction Techniques . . . . . . . 4.2.1 Basics . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Distorted Metric Algorithms . . . . . . . . . 4.3 Routing Tree Reconstruction . . . . . . . . . . . . . 4.3.1 Variance Metric . . . . . . . . . . . . . . . . . 4.3.2 Inferring the routing tree . . . . . . . . . . . 4.4 Edge Delay Inference . . . . . . . . . . . . . . . . . . 4.4.1 Additive Functions . . . . . . . . . . . . . . . 4.4.2 Delay-based metrics . . . . . . . . . . . . . . 4.4.3 Algorithm for Moment Inference . . . . . . . 4.5 Analysis of the ER Algorithm . . . . . . . . . . . . . 4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Algorithm Details . . . . . . . . . . . . . . . . . . . . 4.7.1 Examples of Regular Delay Distributions . . 4.7.2 DMR Algorithm . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
59 59 61 63 65 67 67 68 70 70 72 74 74 75 77 81 87 88 88 90
. . . . .
93 93 94 94 94 95
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
5 Contributions and suggested directions 5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Suggested directions . . . . . . . . . . . . . . . . . . . . 5.2.1 Statistical hierarchical processing . . . . . . . . . 5.2.2 Nonparametric sequential statistical methods . . 5.2.3 Statistical methods to detect and robust to point
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . changes
. . . . . . . . . . . . . . . . . . .
. . . . .
. . . . . . . . . . . . . . . . . . .
. . . . .
. . . . . . . . . . . . . . . . . . .
. . . . .
. . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
iv Bibliography
96
v
List of Figures 1.1
Typical monitoring architecture choices.
2.1
Illustration of the distributed protocol. Each node j ∈ V maintains an estimate θ(j). At each round, for a fixed reference node ` ∈ V , each neighbor i ∈ N (`) sends the message F (θ(i), ξ(i, `)) along the edge i → `. . . . . . . Comparison of empirical simulations to theoretical predictions for the ring graph in panel (a). (b) Sample path plots of log MSE versus log iteration number: as predicted by the theory, the log MSE scales linearly with log iterations. (c) Plot of number of iterations (vertical axis) required to reach a fixed level of MSE versus the graph size (horizontal axis). For the ring graph, this quantity scales quadratically in the graph size, consistent with Corollary 2.3.1. Solid line shows theoretical prediction. . . . . . . . . . . . . Comparison of empirical simulations to theoretical predictions for the four nearest-neighbor lattice (panel (a)). (b) Sample path plots of log MSE versus log iteration number: as predicted by the theory, the log MSE scales linearly with log iterations. (c) Plot of number of iterations (vertical axis) required to reach a fixed level of MSE versus the graph size (horizontal axis). For the lattice, graph, this quantity scales linearly in the graph size, consistent with Corollary 2.3.1. Solid line shows theoretical prediction. . . . . . . . . . . . . Comparison of empirical simulations to theoretical predictions for the bipartite expander graph in panel (a). (b) Sample path plots of log MSE versus log iteration number: as predicted by the theory, the log MSE scales linearly with log iterations. (c) Plot of number of iterations (vertical axis) required to reach a fixed level of MSE versus the graph size (horizontal axis). For an expander, this quantity remains essentially constant with the graph size, consistent with Corollary 2.3.1. Solid line shows theoretical prediction. . . .
2.2
2.3
2.4
3.1
. . . . . . . . . . . . . . . . . . .
Sensor network for quantile estimation with m sensors. Each sensor is permitted to transmit a 1-bit message to the fusion center; in turn, the fusion center is permitted to broadcast k bits of feedback. . . . . . . . . . . . . . .
2
11
23
24
25
38
vi 3.2
3.3
Convergence of θn to θ∗ with m = 11 nodes, and quantile level α∗ = 0.3. (b) Log-log plots of the variance against m for both algorithms (log(m)-bf and 1-bf) with constant step sizes, and theoretically-predicted rate. (b) Loglog plots of the variance against m for log(m)-bf and 1-bf algorithms with constant step size. (c) Log-log plots of log(m)-bf with constant step size versus 1-bf algorithm with decaying step size. . . . . . . . . . . . . . . . . . (a) Plots of the asymptotic variance κ(α∗ , Q` ) defined in equation (3.3.8) versus the number of levels ` in a uniform quantizer, corresponding to log2 (2`) bits of feedback, for a sensor network with m = 4000 nodes. The plots show the asymptotic variance rescaled by the centralized gold standard, so that it starts at π/2 for ` = 2, and decreases towards 1 as ` is increased towards m/2. (b) Plots of the asymptotic variances Vm () and V1 () defined in equation (3.3.13) as the feedforward noise parameter is increased from 0 towards 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Unidentifiability of Mean Delay: If one were to replace d1 with d1 + µ and d2 , d3 with d2 −µ, d3 −µ for µ > 0 (assuming µ can be chosen so that all delays remain positive) then the distribution of delays at a, b would be unchanged. This example also shows that one cannot deduce the delays on all edges given total delays at all leaves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Algorithm Additive Function Inference. . . . . . . . . . . . . . . . . . 4.3 Edge weight inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Algorithm Symmetric Edge Reconstruction. . . . . . . . . . . . . . . 4.5 The Hi ’s are centered sums of delays on the corresponding paths. . . . . . . 4.6 Algorithm Edge Reconstruction. . . . . . . . . . . . . . . . . . . . . . . 4.7 Illustration of routine Mini Contractor. . . . . . . . . . . . . . . . . . . 4.8 Illustration of routine Extender. . . . . . . . . . . . . . . . . . . . . . . . 4.9 Algorithm Mini Contractor. . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 Algorithm Extender. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
48
4.1
64 75 76 78 80 81 91 91 92 92
vii
List of Tables 3.1 3.2 3.3
Description of the log(m)-bf algorithm. . . . . . . . . . . . . . . . . . . . . . Description of the 1-bf algorithm. . . . . . . . . . . . . . . . . . . . . . . . . Description of the general algorithm, with log2 (2`) bits of feedback. . . . . .
39 40 46
viii
Acknowledgments My interest in Statistics was kindled by the research in Signal Processing I carried out for my Ph.D. Berkeley provided the perfect environment to learn more about statistics, and work on problems that related to my own original area and in statistics itself. The thesis owes a lot to interactions with Prof. Martin Wainwright, and his encouragement as well as support. I would also like to thank Prof. Pravin Varaiya for allowing me to explore statistical science and formally take various classes in the Statistics department, and for serving in the committee for the thesis. I would like to thank Prof. Bin Yu for serving in my committee and for her talk on statistics on sensor networks, which inspired part of the work. She and Prof. Gang Liang, from UC Irvine, were kind on discussing an early draft of the material in Chapter 4. I am very grateful to Shankar Bahmidi and Sebastien Roch for the wonderful collaboration that resulted in Chapter 4. Profs. John Rice, David Brillinger and David Aldous inspired me and encouraged me to pursue various courses and methods in Statistics. Guilherme Rocha, Xuanlong Nguyen, Carol and Gregorio Caetano have been constant companions on long conversations about statistics, its philosophy and methods.
1
Chapter 1
Introduction Modern adaptive control and signal processing systems operate on non-conventional sensing and control architectures. Typically, such systems are built using a centralized processing architecture, where various sensors measured the relevant variables and were connected to a single central processing unit through a wired connection. A single clock drives the computation and data is assumed to be communicated from the sensor to the processor without any added noise. More recently, a novel type of sensor data processing architecture, a sensor network, has been proposed for monitoring of large distributed systems, such as urban traffic or weather. The architecture relies on sensor nodes that are capable of sensing and locally computing before transmitting the information, using a wireless channel. Transmission can be local, to nearby sensing units, or centralized to a single data fusion unit. Each sensing unit is operated by batteries. Usually, data transmission consumes significant more power than local computation. Such an architecture imposes constraints on how inference and estimation should be performed on the data. Data transmission constraints, for example, impose the need for intelligent local processing of information. Moreover, communication imperfections or local computation requirements, such as data quantization, require careful design of inference and estimation algorithms. The aim of this thesis is to study the design of inference and estimation algorithms in sensor networks, and their performance under data transmission constraints. We focus on three specific statistical problems, and show how the strategies that achieve close to optimal performance, where performance is carefully defined for each application. The remainder of
2
Figure 1.1: Typical monitoring architecture choices.
the chapter describes the architecture in more details, introduces the statistical problems and identifies the main contributions of the thesis.
1.1
Architectures for statistical inference
Figure 1.1 displays the various components of a sensor network. There are two typical processing configurations for the sensor network: distributed and decentralized processing [Tsitsiklis, 1993]. In decentralized networks, sensors process the data locally and transmit information summaries, so that global decisions or estimates are made by a fusion center, or central node, that then transmits the results to all nodes if required. Savings are achieved since summaries are transmitted in place of all the raw data. Distributed processing networks, on the other hand, decentralize the decision process itself. Each node exchanges information locally with its neighbors and computes its own estimate of the decision. The aim is to avoid latency and the communication overhead required to establish a fusion center based network. Distributed processing can also be used to reduce computation burden in a data centralized application. For example, identifying failed sensors in a network is a hypothesis testing problem where the number of hypothesis is exponential in the number of nodes. The sensor network architecture imposes several constraints on any statistical methodology: memory at the nodes is limited, energy available is limited and communicating information is costly. The availability of such sensor nodes allows a much higher sampling of data, and a desirable objective is to use low cost local computation to mitigate the constraints.
3 Developing statistical methodologies requires then a careful tradeoff between statistical accuracy demands, implementation complexity and communication limitations. One way to address these needs simultaneously is by exploiting the network structure in an appropriate way. One of the main claims of this thesis is that for various problems, carefully constructed methods can achieve near optimal performance. Optimality is defined with respect to a data centralized implementation, where implementation complexity and communication limitations are not an issue. Another interesting point is that Given the constraints and benefits of sensor networks, and the demands of various types of applications, there are particular classes of statistical techniques that are well suited for realistic usage:
Sequential. The real-time nature of monitoring problems and the limited memory in the sensing devices, implies that any local computations should preferably be sequential. In a sequential estimation or decision method, a partial decision is available after each new information is received. Moreover, decisions are update according to rules that only depend on short summaries of the data seen until that point. Sequentially performed estimation or stochastic optimization forms the class of stochastic approximation methods [Benveniste et al., 1990; Kushner and Yin, 1997]. An important concern is the speed at which estimates converge to a true value, under communications and noise constraints when calculations are performed sequentially. Non-parametric. The types of processes monitored by sensor networks vary at many different time-scales from slow to fast. Moreover, in many applications although standard behavior can be captured using a parametric model, unexpected behavior does not satisfy such models. One example is urban traffic, where the congestion dynamics are hard to capture with parameterized stochastic models. Instead, decision making processes require certain properties of measured distributions of interest, such as the median. Implementation and analysis of the performance of these non-parametric estimators for networks is of interest. Approximations. Obtaining optimal solutions for estimation and decision problems, in a decentralized or distributed scenario, is a computationally hard problem [Tsitsiklis, 1993]. In fact, even deciding whether a optimal approach exists can be intractable. Instead, if approximation methods are used, we can obtain deployable algorithms. One can pursue the development of approximate statistical decision methods. In this case, various important
4 statistical methodologies are approximated so they operate in decentralized, distributed or computation constrained architectures. Performance guarantees are provided for the obtained solutions. Approximation quality is measured with respect to a centralized optimal or best algorithm performance.
1.2
Thesis organization
The thesis is organized in chapters that address three specific problems: decentralized mean consensus averaging (Chapter 2), distributed quantile estimation (Chapter 3) and network structure inference from additive measurements (Chapter 4). The remainder of this section summarizes each chapter. The three problems address different aspects of estimation in sensor networks. The objective is to be able to obtain provably consistent algorithms with error performance guarantees. Furthermore, in the second problem we compute relative performance with respect to the best expected performance. The remainder of the section summarizes each problem and respective contributions.
1.2.1
Chapter 2: Network-based consensus averaging with general noisy channels
Chapter 2, in collaboration with Martin Wainwright, focuses on the consensus averaging problem on graphs under general imperfect communications. In a consensus averaging problem the objective is for every node to have an estimate of the average of a single measured value across nodes. We study a particular class of distributed consensus algorithms based on damped updates, and using the ordinary differential equation method, we prove that the updates converge almost surely to the consensus average for various models of perturbation of data exchanged between nodes. The convergence is not asymptotic in the size of the network. Our analysis applies to various types of stochastic disturbances to the updates, including errors in update calculations, dithered quantization and imperfect data exchange among nodes. Under a suitable stability condition, we prove that the error is asymptotically Gaussian, and we show how the asymptotic covariance is specified by the graph Laplacian. For additive perturbations, we show how the scaling of the asymptotic MSE is controlled by the spectral gap of the Laplacian.
5
1.2.2
Chapter 3: Quantile Estimation in a Data Communication Constrained Setting
Many modern engineering systems require solving problems of statistical inference in a decentralized manner, in contrast to the centralized approach taken in classical statistical theory. α-quantiles of a distribution are values θ∗ so that the probability that the random variable is less than θ∗ is α. When α = 0.5, we obtain the median. Empirical estimates of quantiles from observed data capture well properties of general distributions. Chapter 3, in collaboration with Martin Wainwright and Pravin Varaiya, considers the following problem of decentralized statistical inference: given i.i.d. samples from an unknown distribution, estimate an arbitrary quantile subject to limits on the number of bits exchanged. We analyze a standard fusion-based architecture, in which each of m sensors transmits a single bit to the fusion center, which in turn is permitted to send some number k bits of feedback. Supposing that each of m sensors receives n observations, the optimal centralized protocol yields meansquared error decaying as O(1/[nm]). We develop and analyze the performance of various decentralized protocols in comparison to this centralized gold-standard. First, we describe a decentralized protocol based on k = log(m) bits of feedback that is strongly consistent, and achieves the same asymptotic MSE as the centralized optimum. Second, we describe and analyze a decentralized protocol based on only a single bit (k = 1) of feedback. For √ step sizes independent of m, it achieves an asymptotic MSE of order O[1/(n m)], whereas √ for step sizes decaying as 1/ m, it achieves the same O(1/[nm]) decay in MSE as the centralized optimum. Our theoretical results are complemented by simulations, illustrating the tradeoffs between these different protocols.
1.2.3
Chapter 4: Network Structure Inference from Multicast Delay
Chapter 4, in collaboration with Shankar Bhamidi and Sebastien Roch, uses computational phylogenetic techniques to solve a central problem in inferential network monitoring. More precisely, we design a novel algorithm for multicast-based delay inference, that is, the problem of reconstructing delay characteristics of a network from end-to-end delay measurements on network paths. Our inference algorithm is based on additive metric techniques used in phylogenetics. It runs in polynomial time and requires a sample of size only O(log n). We also show how to recover the topology of the routing tree.
6
1.3
Summary of contributions
Some of the main contributions of the thesis are • Algorithm, consistency and asymptotic normality for distributed noisy consensus averaging ; • Algorithm, consistency and asymptotic normality for decentralized noisy quantile estimation; • Algorithm and consistency for delay distribution reconstruction in communication trees.
7
Chapter 2
Network-based consensus averaging with general noisy channels 2.1
Introduction
Consensus problems, in which a group of nodes want to arrive at a common decision in a distributed manner, have a long history, dating back to seminal work from over twenty years ago [deGroot, 1974; Borkar and Varaiya, 1982; Tsitsiklis, 1984]. A particular type of consensus estimation is the distributed averaging problem, in which a group of nodes want to compute the average (or more generally, a linear function) of a set of values. Due to its applications in sensor and wireless networking, this distributed averaging problem has been the focus of substantial recent research. The distributed averaging problem can be studied either in continuous-time [Olfati-Saber et al., 2007], or in the discrete-time setting (e.g., [Kempe et al., 2003; Xiao and Boyd, 2004; Boyd et al., 2006; Aysal et al., 2007a; Dimakis et al., 2008]). In both cases, there is now a fairly good understanding of the conditions under which various distributed averaging algorithms converge, as well as the rates of convergence for different graph structures. The bulk of early work on consensus has focused on the case of perfect communication between nodes [Denantes et al., 2008]. Given that noiseless communication may be an unrealistic assumption for sensor networks, a more recent line of work has addressed the issue of noisy communication links. With imperfect observations, many of the standard consensus protocols might fail to reach an agreement. Xiao et al. [Xiao et al., 2007] observed this phe-
8 nomenon, and opted to instead redefine the notion of agreement, obtaining a protocol that allows nodes to obtain a steady-state agreement, whereby all nodes are able to track but need not obtain consensus agreement. Schizas et al. [Schizas et al., 2008] study distributed algorithms for optimization, including the consensus averaging problem, and establish stability under noisy updates, in that the iterates are guaranteed to remain within a ball of the correct consensus, but do not necessarily achieve exact consensus. Kashyap et al. [Kashyap et al., 2007] study consensus updates with the additional constraint that the value stored at each node must be integral, and establish convergence to quantized consensus. Fagnani and Zampieri [Fagnani and Zampieri, 2007] study the case of packet-dropping channels, and propose various updates that are guaranteed to achieve consensus. Picci et al. [Picci and Taylor, 2007] show almost sure convergence of consensus updates when the communication is noiseless, but the connectivity is random, such as in packet dropping channel models. Aysal et al. [Aysal et al., 2007b,a] use probabilistic forms of quantization to develop algorithms that achieve quantized average consensus. (i.e., to a quantized value that is shown to be close to the true average instead of the actual average itself). Aysal et al. [Aysal et al., 2008] propose a broadcasting-based consensus algorithm that does not preserve the sum, but converges faster than standard consensus algorithms. Yildiz and Scaglione [Yildiz and Scaglione, 2008] suggest coding strategies to deal with quantization noise and rate constrained channels, and establish MSE convergence as the number of nodes in the network goes to infinity. In related work, Carli et al. [Carli et al., 2007] propose a sum-preserving difference update to handle quantization and communication noise. In the current chapter, we address the discrete-time average consensus problem in general fixed networks, modeling the communication between neighboring nodes as a general stochastic channel. Our main contributions are to propose and analyze simple distributed protocols that are guaranteed to converge almost surely to the exact consensus mean, and √ under suitable stability conditions, whose n-rescaled error is asymptotically normal with covariance controlled by the graph structure. These exactness guarantees are obtained using protocols with decreasing step sizes, which smooths out the noise factors. The framework described here is based on the classic ordinary differential equation method [Ljung, 1977], which allows us to explicitly identify the deterministic limit, and moreover to establish asymptotic normality. This framework allows for the analysis of several different and important scenarios, namely:
9 • Noisy storage: stored values at each node are corrupted by noise, with known covariance structure. • Noisy transmission: messages across each edge are corrupted by noise, with known covariance structure. • Bit constrained channels: dithered quantization is applied to messages prior to transmission. In this chapter, we analyze protocols that can achieve arbitrarily small mean-squared error (MSE) for distributed averaging with noise. Closely related to part of our results are past work by Hatano et al. [Hatano et al., 2005], who analyze averaging under additive noise and use a Lyapunov function approach to prove almost sure convergence to the so-called consensus subspace (i.e., C = span(~1)). Concurrent work by Kar and Moura [Kar and Moura, 2008a,b], brought to our attention after initial submission of this work, allows for more general models (including Markovian dynamics), and also uses the Lyapunov function approach to prove almost sure convergence of the updates to a random element of the consensus subspace. This chapter studies a more restricted class of models with independent noise, and as a preliminary result, we establish almost sure convergence of the updates to a deterministic constant—namely, the exact mean in the consensus problem, or a biased mean, depending on the noise model. We use the ordinary differential equation or ODE method [Ljung, 1977] to approximate the long term behavior of the mean evolution, another technique from stochastic approximation theory. The main contribution of our chapter is to show that under appropriate stability conditions, the error in the estimate (after the usual √ n-rescaling) is asymptotically normal, and to demonstrate explicitly how its asymptotic covariance matrix is controlled by the graph topology. This analysis reveals how different graph structures—ranging from ring graphs at one extreme to expander graphs at the other—lead to different variance scaling behaviors, as determined by the eigenspectrum of the graph Laplacian [Chung, 1991]. Our theory predicts that the number of iterations required to achieve δ-error will scale very differently in the network size for different graph topologies, ranging from quadratic scaling in network size for ring graphs to constant scaling (independence of network size) for expander graphs. In addition, our simulation results show excellent agreement with the theoretical predictions. The remainder of this chapter is organized as follows. We begin in Section 3.2.1 by describing the distributed averaging problem in detail, and defining the class of stochastic
10 algorithms studied in this chapter. In Section 2.3, we state our main results on the almostsure convergence and asymptotic normality of our protocols, and illustrate some of their consequences for particular classes of graphs. In particular, we illustrate the sharpness of our theoretical predictions by comparing them to simulation results, on various classes of graphs. Section 2.4 is devoted to the proofs of our main results, and we conclude the chapter with discussion in Section 2.6.
Comment on notation: Throughout this chapter, we use the following standard asymptotic notation: for functions f and g, the notation f (n) = O(g(n)) means that f (n) ≤ Cg(n) for some constant C < ∞; the notation f (n) = Ω(g(n)) means that f (n) ≥ C 0 g(n) for some constant C 0 > 0, and f (n) = Θ(g(n)) means that f (n) = O(g(n)) and f (n) = Ω(g(n)). We use the index n to refer to discrete iterations of the gossip algorithm and the variable t to refer to continuous time. The | · | operator refers to cardinality when applied to sets.
2.2
Problem set-up
In this section, we describe the distributed averaging problem, and specify the class of stochastic algorithms studied in this chapter.
2.2.1
Consensus matrices and stochastic updates
Consider a set of m = |V | nodes, each representing a particular sensing and processing device. We model this system as an undirected graph G = (V, E), with processors associated with nodes of the graph, and the edge set E ⊂ V × V representing pairs of processors that can communicate directly. For each node i ∈ V , we let N (i) : = {j ∈ V | (i, j) ∈ E} be its neighborhood set. Suppose that each vertex i makes a real-valued measurement x(i), and consider the 1 P goal of computing the average x = m i∈V x(i). We assume that |x(i)| ≤ xmax for all i ∈ V , as dictated by physical constraints of sensing. For iterations n = 0, 1, 2, . . ., let θn = {θn (i), i ∈ V } represent an m-dimensional vector of estimates. Solving the distributed averaging problem amounts to having θn converge to θ∗ : = x ~1, where ~1 ∈ Rm is the vector of all ones. Various algorithms for distributed averaging [Olfati-Saber et al., 2007; Boyd
11
Figure 2.1: Illustration of the distributed protocol. Each node j ∈ V maintains an estimate θ(j). At each round, for a fixed reference node ` ∈ V , each neighbor i ∈ N (`) sends the message F (θ(i), ξ(i, `)) along the edge i → `.
et al., 2006] are based on symmetric consensus matrices L ∈ Rm×m with the properties: L(i, j) 6= 0 L~1 = ~0,
only if (ij) ∈ E and
L 0.
(2.2.1a) (2.2.1b) (2.2.1c)
The simplest example of such a matrix is the graph Laplacian, defined as follows. Let A ∈ Rm×m be the adjacency matrix of the graph G, i.e. the symmetric matrix with entries
Aij
=
1
if (i, j) ∈ E
0
otherwise,
(2.2.2)
and let D = diag{d1 , d2 , . . . , dm } where di : = |N (i)| is the degree of node i. Assuming that the graph is connected (so that di ≥ 1 for all i), the graph Laplacian is given by L(G) = D − A.
(2.2.3)
Our analysis applies to the graph Laplacian, as well as to various weighted forms of graph Laplacian matrices [Chung, 1991], as long as they satisfy the properties in Equation (2.2.1). For d-regular graphs, it can be shown directly that the normalized Graph Laplacian satisfy these properties.
12 Given a fixed choice of consensus matrix L, we consider the following family of updates, generating the sequence {θn , n = 0, 1, 2 . . .} of m-dimensional vectors. The updates are designed to respect the neighborhood structure of the graph G, in the sense that at each iteration, the estimate θn+1 (vr ) at a receiving node vr ∈ V is a function of only1 the estimates {θn (vt ), vt ∈ N (vr )} associated with transmitting nodes vt in the neighborhood of node vr . In order to model noise and uncertainty in the storage and communication process, we introduce random variables ξ(vt , vr ) associated with the transmission link from vt to vr ; we allow for the possibility that ξ(vt , vr ) 6= ξ(vr , vt ), since the noise structure might be asymmetric. With this set-up, we consider algorithms that generate a stochastic sequence {θn , n = 0, 1, 2, . . .} in the following manner: 1. At time step n = 0, initialize θ0 (v) = x(v) for all v ∈ V . 2. For time steps n = 0, 1, 2, . . ., each node vt ∈ V computes the random variables
Y n+1 (vr , vt ) =
F (θn (vt ), ξ n+1 (vt , vr ))
if (vt , vr ) ∈ E
0
otherwise,
(2.2.4)
where F is the communication-noise function defining the model. 3. Generate estimate θn+1 ∈ Rm as h i θn+1 = θn + n − L Y n+1 ~1 ,
(2.2.5)
where denotes the Hadamard (elementwise) product between matrices, and n > 0 is a decaying step size parameter.
See Figure 3.1 for an illustration of the message-passing update of this protocol. In this chapter, we focus on step size parameters n that scale as n = Θ(1/n). On an elementwise 1 In fact, our analysis is easily generalized to the case where θn+1 (vr ) depends only on vertices vt ∈ N 0 (vr ), where N 0 (vr ) is a (possibly random) subset of the full neighborhood set N (v). However, to bring our results into sharp focus, we restrict attention to the case N 0 (vr ) = N (vr ).
13 basis, the update (2.2.5) takes the form θn+1 (vr ) = θn (vr ) − n L(vr , vr )F (θn (vr ), ξ n+1 (vr , vr ))+ X L(vr , vt ) F (θn (vt ), ξ n+1 (vt , vr )) . + vt ∈N (vr )
2.2.2
Communication and noise models
It remains to specify the form of the the function F that controls the communication and noise model in the local computation step in equation (2.2.4).
Noiseless real number model: The simplest model, as considered by the bulk of past work on distributed averaging, assumes noiseless communication of real numbers. This model is a special case of the update (2.2.4) with ξ n (vt , vr ) = 0, and F (θn (vt ), ξ n+1 (vt , vr )) = θn (vt ).
(2.2.6)
Additive edge-based noise model (AEN): In this model, the values stored in a node can be observed without noise, so that F (θn (vr ), ξ n+1 (vr , vr )) = θn (vr ). The term ξ n (vt , vr ) is zero-mean additive random noise variable that is associated with the transmission vt → vr , and the communication function takes the form F (θn (vt ), ξ n+1 (vt , vr )) = θn (vt ) + ξ n+1 (vt , vr ).
(2.2.7)
We assume that the random variables ξ n+1 (vt , vr ) and ξ n+1 (vt0 , vr ) are independent for distinct edges (vt0 , vr ) and (vt , vr ), and identically distributed with zero-mean and variance σ 2 = Var(ξ n+1 (vt , vr )). Additive node-based noise model (ANN): In this model, the function F takes the same form (2.2.7) as the edge-based noise model. In particular, using values stored in the node itself is noise free, so F (θn (vr ), ξ n+1 (vr , vr )) = θn (vr ). However, the key distinction is that for each v 0 ∈ V , we assume that ξ n+1 (vt , vr ) = ξ n+1 (vt )
for all vr ∈ N (vt ),
(2.2.8)
14 where ξ n+1 (vt ) is a single noise variable associated with node vt , with zero mean and variance σ 2 = Var(ξ n (vt )). Thus, the random variables ξ n+1 (vt , vr ) and ξ n+1 (vt , vr0 ) are all identical for all edges out-going from the transmitting node vt .
Bit-constrained communication (BC): Suppose that the channel from node vt to vr is bit-constrained, so that one can transmit at most B bits, which is then subjected to random dithering. Under these assumptions, the communication function F takes the form F (θ(vt ), ξ(vt , vr )) = QB (θ(vt ) + ξ(vt , vr )) ,
(2.2.9)
where QB (·) represents the B-bit quantization function with maximum value M and ξ(vt , vr ) is random dithering. We assume that the random dithering is applied prior to transmission across the channel out-going from vertex vt , so that ξ(vt , vr ) = ξ(vt ) is the same random variable across all neighbors vr ∈ N (vt ). The specification of the iteration uses F (θ(vt ), ξ(vr , vr )) = QB (θ(vr ) + ξ(vr , vr )), so the node value is quantized in its self update as well. The dithering variables ξ n (vt ) are independent and bounded by ∆, with variance σ 2 . Initial values are bounded a prior, θ0 (vt ) ≤ B − ∆. Similar quantization models were considered in Aysal et al. [Aysal et al., 2008], for consensus updates without decreasing step sizes, and in Kar et al. [Kar and Moura, 2008a], for decreasing step sizes. As explained in the introduction, our contribution is to show almost sure convergence with a different method, and to show asymptotic normality of the updates.
2.3
Statement of results and consequences
In this section, we state our main results, concerning the stochastic behavior of sequence {θn } generated by the updates (2.2.5). We then illustrate its consequences for the specific communication and noise models described in Section 2.2.2, and conclude with a discussion of behavior for specific graph structures. Consider the factor L Y that drives the updates (2.2.5). An important element of our analysis is the conditional covariance of this update factor, denoted by Σ = Σθ and given by Σθ
h i : = E (L Y ) ~1 ~1T (L Y )T | θ − L θ(Lθ)T ,
(2.3.1)
15 where Y (vr , vt ) =
F (θ(vt ), ξ(vt , vr ))
if (vt , vr ) ∈ E
0
otherwise.
(2.3.2)
A little calculation shows that the (i, j)th element of this matrix is given by Σθ (i, j) =
m X
L(i, k)L(j, `) E [Y (i, k)Y (j, `) − θ(k)θ(`) | θ] .
(2.3.3)
k,`=1
Moreover, the eigenstructure of the consensus matrix L plays an important role in our analysis. Since it is symmetric and positive semidefinite, we can write L = U JU T ,
(2.3.4)
where U be an m × m orthogonal matrix with columns defined by unit-norm eigenvectors of L, and J : = diag{λ1 (L), . . . , λm (L)} is a diagonal matrix of eigenvalues, with 0 = λ1 (L) < λ2 (L) ≤ . . . < λm (L).
(2.3.5)
e denote the m × (m − 1) matrix with columns defined by eigenIt is convenient to let U vectors associated with positive eigenvalues of L— that is, excluding column U1 = ~1/k~1k2 , associated with the zero-eigenvalue λ1 (L) = 0. With this notation, we have e T LU e. Je = diag{λ2 (L), . . . , λm (L)} = U
(2.3.6)
We demonstrate two theorems: strong consistency and asymptotic normality. Strong consistency for analogous models has been shown in Kar and Moura [2008a] and Kar and Moura [2008b]. The proof of the theorem in this chapter contrasts with those results by computing a differential equation driving the mean path. Our main result is the asymptotic normality of the procedure.
2.3.1
Consistency
Theorem 2.3.1(a) asserts that the sequence {θn } is a strongly consistent estimator of the average for the AEN and BC models. As opposed to weak consistency, this result guarantees that for almost any realization of the algorithm, the associated sample path converges to
16 the exact consensus solution. Moreover, for the ANN model, the solution converges to a biased estimator of the true mean. Theorem 2.3.1. Consider the random sequence {θn } generated by the update (2.2.5) for a communication function F of the form in equation (2.2.7), equation (2.2.8) or equation (2.2.9), a consensus matrix L, and step size parameter n = Θ(1/n). Let θ∗ = x~1. For all initial node value vectors θ0 : (a) The sequence {θn } is a strongly consistent estimator of θ∗ under the ANN model (2.2.8), meaning that θn → θ∗ almost surely (a.s.); (b) The sequence {θn } is a biased estimator of θ∗ , under the AEN model (2.2.7), with θn → θˆ∗ almost surely (a.s.), where θˆ∗ = θ∗ + η, where η is the realization of a zero mean random variable, with variance γ σ 2 and γ=
m m ∞ X 1 1 XX L(i, j)2 . m r2 i=1 j=1
r=1
(c) The sequence {θn } is a strongly consistent estimator of θ∗ under the BC model (2.2.9). Remarks: The essential condition for the communication function used in the AEN and ANN models is that E[F (θ(vt ), ξ(vt , vr ))|θ(vt )] = θ(vt ).
(2.3.7)
A more general condition for the communication function is that E[F (θ(vt ), ξ(vt , vr ))|θ(vt )] = r(θ(vt )),
(2.3.8)
where r is a function such that the ODE θ˙ = −L r(θ) has θ = θ∗ I as the unique asymptotically stable equilibrium. One simple condition is if the function r is a monotonic nondecreasing function. The assertions of the theorem continue to hold, except that the matrix J = r0 (θ∗ ) L, so the asymptotic variance is scaled by 1/r0 (θ∗ ). One can perhaps choose functions r that accelerate the mean behavior convergence and reduce the asymptotic variance for the method. Notice that a large r0 (θ∗ ) factor is helpful in such a situation, but this also implies that more power is being used by the system. If power is constrained (e.g. r0 (θ∗ ) = 1), the only gain is in mean behavior.
17
2.3.2
Asymptotic normality
Theorem 2.3.2 establishes that for appropriate choices of consensus matrices, the rate √ of MSE convergence is of order 1/n for the ANN and BC models, since the n-rescaled error converges to a non-degenerate Gaussian limit. For the AEN model, the asymptotic normality is observed in a m − 1 dimensional subspace, since the estimation is biased. Such a rate is to be expected in the presence of sufficient noise, since the number of observations received by any given node (and hence the inverse variance of estimate) scales as n. The solution of the Lyapunov equation (2.3.11) specifies the precise form of this asymptotic covariance, which (as we will see) depends on the graph structure. Theorem 2.3.2. Consider the random sequence {θn } generated by the update (2.2.5) for a communication function F of the form in equation (2.2.7), equation (2.2.8) or equation (2.2.9), a consensus matrix L, and step size parameter n = Θ(1/n). If the second smallest eigenvalue of the consensus matrix L satisfies λ2 (L) > 1/2 then (a) For the ANN model (2.2.8) and BC model (2.2.9): √
0 0 d U , n(θn − θ∗ ) → N 0, U T 0 Pe
(2.3.9)
(b) For the AEN model (2.2.7):
√ 0 0 d e T (θn − θ∗ ) → U , n UU N 0, U T 0 Pe
(2.3.10)
where the (m − 1) × (m − 1) matrix Pe is the solution of the continuous time Lyapunov equation T I I Pe + Pe Je − Je − 2 2
e θ∗ = Σ
(2.3.11)
eT Σ ∗U e is the transformed version of e θ∗ = U where Je is the diagonal matrix (2.3.6), and Σ θ the conditional covariance (2.3.1) evaluated at θ = θ∗ = θ¯~1. Theorem 2.3.2 makes some more specific predictions for different communication models, as we describe here. Under the same conditions as Theorem 2.3.2, we define the average
18 mean-squared error as AMSE(L; θ∗ ) : =
1 trace(Pe(θ∗ )), m
(2.3.12)
corresponding to asymptotic error variance, averaged over nodes of the graph. For the AEN model captures the asymptotic distance to the consensus subspace. Corollary 2.3.1 (Asymptotic MSE for different communication models). Given a consensus matrix L with second-smallest eigenvalue λ2 (L) > 21 , the sequence {θn } is a strongly consistent estimator of the average θ∗ , with asymptotic MSE characterized as follows: (a) For the additive edge-based noise (AEN) model (2.2.7):
AMSE(L; θ∗ ) ≤
σ2 m
m X i=2
max
P
k6=j
L2 (j, k)
j=1,...,m 2λi (L) − 1
.
(2.3.13)
(b) For the additive node-based noise (ANN) model (2.2.8) and the bit-constrained (BC) model (2.2.9): ∗
AMSE(L; θ ) =
m σ 2 X [λi (L)]2 , m 2λi (L) − 1
(2.3.14)
i=2
where the variance term σ 2 is given by the quantization noise E QB (θ + ξ)2 − θ2 | θ for the BC model, and the noise variance Var(ξ(i)) for the ANN model. Proof. The essential ingredient controlling the asymptotic MSE is the conditional covariance matrix Σ ∗ , which specifies Pe via the Lyapunov equation (2.3.11). For analyzing model θ
AEN, it is useful to establish first the following auxiliary result. For each i = 1, . . . , m − 1, we have Peii ≤
|||Σθ∗ |||2 , 2 λi+1 (L) − 1
(2.3.15)
where |||Σθ∗ |||2 = |||Σ|||2 is the spectral norm (maximum eigenvalue for a positive semdefinite symmetric matrix). To see this fact, note that e T ΣU e U e T [|||Σ|||2 I] U e = |||Σ|||2 I. U
19 Since Pe satisfies the Lyapunov equation, we have I e e e I T e P +P J − J− 2 2
|||Σ|||2 I.
T Note that the diagonal entries of the matrix Je − I2 Pe+Pe Je − I2 are of the form (2λi+1 − 1) Peii . The difference between the RHS and LHS matrices constitute a positive semidefinite matrix, which must have a non-negative diagonal, implying the claimed inequality (2.3.15). In order to use the bound (2.3.15), it remains to compute or upper bound the spectral norm |||Σ|||2 , which is most easily done using the elementwise representation (2.3.3). (a) For the AEN model (2.2.7), we have E [Y (i, k)Y (j, `) − θk θ` | θ] = E [ξ(i, k)ξ(j, `)] .
(2.3.16)
Since we have assumed that the random variables ξ(i, k) on each edge (i, k) are i.i.d., with zero-mean and variance σ 2 , we have
E [Y (i, k)Y (j, `) − θ(k)θ(`) | θ] =
σ 2
if (i, k) = (j, `) and i 6= k
0
otherwise.
Consequently, from the elementwise expression (2.3.3), we conclude that Σ is diagonal, with entries Σ(j, j) = σ 2
X
L2 (k, j),
k6=j
so that |||Σ|||2 = σ 2 maxj=1,...,m
P
k6=j
L2jk , which establishes the claim (2.3.13).
(b) For the BC model (2.2.9), we have
E [Y (i, k)Y (j, `) − θ(k)θ(`) | θ] =
σ 2 qnt
if i = j and k = `
0
otherwise,
2 where σqnt : = E QB (θ + ξ)2 − θ2 | θ is the quantization noise.
(2.3.17)
Therefore, we have
20 2 L2 , and using the fact that U e consists of eigenvectors of L (and hence also L2 , Σ(θ∗ ) = σqnt
the Lyapunov equation (2.3.11) takes the form T I I Pe + Pe Je − Je − 2 2
2 e 2, = σqnt (J)
2
2
σqnt λi+1 (L) which has the explicit diagonal solution Pe with entries Peii = 2λ . Computing the i+1 (L)−1 P m−1 e 1 asymptotic MSE m i=1 Pii yields the claim (2.3.14). The proof of the same claim for the
ANN model is analogous.
2.3.3
Scaling and graph topology
We can obtain further insight by considering how Corollary 2.3.1 as a function of the graph topology and consensus matrix L. For a fixed graph G, consider the graph Laplacian L(G) defined in equation (2.2.3). It is easy to see that L(G) is always positive semidefinite, with minimal eigenvalue λ1 (L(G)) = 0, corresponding to the constant vector. For a connected graph, the second smallest eigenvalue L(G) is strictly positive [Chung, 1991]. Therefore, given an undirected graph G that is connected, the most straightforward manner in which to obtain a consensus matrix L satisfying the conditions of Corollary 2.3.1 is to rescale the graph Laplacian L(G), as defined in equation (2.2.3), by its second smallest eigenvalue λ2 (L(G)), thereby forming the rescaled consensus matrix R(G) : =
1 L(G). λ2 (L(G))
(2.3.18)
with λ2 (R(G)) = 1 > 21 . With this choice of consensus matrix, let us consider the implications of Corollary 2.3.1(b), in application to the additive node-based noise (ANN) model, for various graphs. Define the normalized trace of the Laplacian α(L(G)) : =
trace(L(G)) . m
(2.3.19)
For a normalized graph Laplacian , we have α(L(G)) = 1. Otherwise, for any (unnormalized) graph Laplacian, we have α(L(G)) ≤ (m − 1) and α(L(G)) ≥ λ2 (L(G))(m − 1)/m. Finally, for graphs with bounded degree d, we have α(L(G)) ≤ d. (See Appendix 2.7.1 for proofs of these properties.) With this definition, we have the following simple lemma,
21 showing that, up to constants, the scaling behavior of the asymptotic MSE is controlled by the second smallest eigenvalue λ2 (L(G)). Lemma 2.3.1. For any connected graph G, using the rescaled Laplacian consensus matrix (2.3.18), the asymptotic MSE for the ANN model (2.2.8) satisfies the bounds σ 2 α(L(G)) σ 2 α(L(G)) ≤ AMSE(R(G); θ∗ ) ≤ , 2λ2 (L(G)) λ2 (L(G))
(2.3.20)
where λ2 (L(G)) is the second smallest eigenvalue of the graph. We provide the proof of this claim in Appendix 2.7.1. Combined with known results from spectral graph theory [Chung, 1991], Lemma 2.3.1 allows us to make specific predictions about the number of iterations required, for a given graph topology of a given size m, to reduce the asymptotic MSE to any δ > 0. Recall Theorem 2.3.1(b) guarantees that the asymptotic MSE per node scales as
1 n
AMSE(R(G); θ∗ ). Using this fact and Lemma 2.3.1,
we have n = Θ
σ2d 1 λ2 (L(G)) δ
,
(2.3.21)
for a graph with maximum degree d. If a normalized Laplacian is used in the update, we have the same scaling of n, but with the term d rescaled to 1. It is interesting to note that this scaling (2.3.21) is similar but different from the scaling of noiseless updates [Boyd et al., 2006; Dimakis et al., 2008], where the MSE is (with log(1/δ) high probability) upper bounded by δ for n = Θ( − log(1−λ ). When the spectral gap 2 (L(G)))
λ2 (L(G)) shrinks as the graph size grows, then this scaling can be expressed as n = Θ
log(1/δ) λ2 (L(G))
,
(2.3.22)
Therefore, we pay a price for noise in the updates (the factor log(1/δ) versus 1/δ), but the graph topology enters the bounds in the same way—namely, in the form of the spectral gap λ2 (L(G)).
2.3.4
Illustrative simulations
We illustrate the predicted scaling (2.3.21) and the role of the Laplacian eigenspectrum via simulations on different classes of graphs. For all experiments reported here, we set
22 the step size parameter n =
1 n+100 .
The additive offset serves to ensure stability of the
updates in very early rounds, due to the possibly large gain specified by the rescaled Laplacian (2.3.18). We performed experiments for a range of graph sizes, for the additive node noise (ANN) model (2.2.8), with noise variance σ 2 = 0.1 in all cases. For each graph size m, we measured the number of iterations n required to reach a fixed level δ of mean-squared error. Cycle graph Consider the ring graph Cm on m vertices, as illustrated in Figure 2.2(a). For this example, we use the normalized Laplacian, since all nodes have the same degree, and consequently, we have α(L(G)) = 1. Panel (b) provides a log-log plot of the MSE versus the iteration number n; each trace corresponds to a particular sample path. Notice how the MSE over each sample converges to zero. Moreover, since Theorem 2.3.1 predicts that the MSE should drop off as 1/n, the linear rate shown in this log-log plot is consistent with the theory. Figure 2.2(c) plots the number of iterations (vertical axis) required to achieve a given constant MSE versus the size of the ring graph (horizontal axis). For the ring graph, it can be shown (see Chung [Chung, 1991]) that the second smallest eigenvalue scales as λ2 (L(Cm )) = Θ(1/m2 ), which implies that the number of iterations to achieve a fixed MSE for a ring graph with m vertices should scale as n = Θ(m2 ). Consistent with this prediction, the plot in Figure 2.2(c) shows a quadratic scaling; in particular, note the excellent agreement between the theoretical prediction and the data. Lattice model Figure 2.3(a) shows the two-dimensional four nearest-neighbor lattice graph with m vertices, denoted Fm . For this example, we also use the normalized Laplacian Graph, so α(L(G)) = 1. Again, panel (b) corresponds to a log-log plot of the MSE versus the iteration number n, with each trace corresponding to a particular sample path, again showing a linear rate of convergence to zero. Panel (c) shows the number of iterations required to achieve a constant MSE as a function of the graph size. For the lattice, it is known [Chung, 1991] that λ2 (L(Fm )) = Θ(1/m), which implies that the critical number of iterations should scale as n = Θ(m). Note that panel (c) shows linear scaling, again consistent with the theory.
23
(a)
(b)
(c)
Figure 2.2: Comparison of empirical simulations to theoretical predictions for the ring graph in panel (a). (b) Sample path plots of log MSE versus log iteration number: as predicted by the theory, the log MSE scales linearly with log iterations. (c) Plot of number of iterations (vertical axis) required to reach a fixed level of MSE versus the graph size (horizontal axis). For the ring graph, this quantity scales quadratically in the graph size, consistent with Corollary 2.3.1. Solid line shows theoretical prediction.
Expander graphs Consider a bipartite graph G = (V1 , V2 , E), with m = |V1 | + |V2 | vertices and edges joining only vertices in V1 to those in V2 , and constant degree d; see Figure 2.4(a) for an illustration with d = 3. A bipartite graph of this form is an expander [Alon, 1986; Alon and Spencer, 2000; Chung, 1991] with parameters α, δ ∈ (0, 1), if for all subsets S ⊂ V1 of size |S| ≤ α|V1 |, the neighborhood set of S—namely, the subset N (S) : = {t ∈ V2 | (s, t)
for some s ∈ S},
has cardinality |N (S)| ≥ δd|S|. Intuitively, this property guarantees that each subset of V1 , up to some critical size, “expands” to a relatively large number of neighbors in V2 . (Note that the maximum size of |N (S)| is d|S|, so that δ close to 1 guarantees that the neighborhood size is close to its maximum, for all possible subsets S.) Expander graphs have a number of interesting theoretical properties, including the property that λ2 (L(Km )) = Θ(1)—that is, a bounded spectral gap [Alon, 1986; Chung, 1991]. In order to investigate the behavior of our algorithm for expanders, we construct a random bi-partite graph as follows: for an even number of nodes m, we split them into two subsets Vi , i = 1, 2, each of size m/2. We then fix a degree d, construct a random matching on d m 2 nodes, and use it to connect the vertices in V1 to those in V2 . This procedure forms
24
(a)
(b)
(c)
Figure 2.3: Comparison of empirical simulations to theoretical predictions for the four nearestneighbor lattice (panel (a)). (b) Sample path plots of log MSE versus log iteration number: as predicted by the theory, the log MSE scales linearly with log iterations. (c) Plot of number of iterations (vertical axis) required to reach a fixed level of MSE versus the graph size (horizontal axis). For the lattice, graph, this quantity scales linearly in the graph size, consistent with Corollary 2.3.1. Solid line shows theoretical prediction.
a random bipartite d-regular graph; using the probabilistic method, it can be shown to be an edge-expander with with probability 1 − o(1), as the graph size tends to infinity [Alon, 1986; Feldman et al., 2007]. Since the graph is d-regular, the normalized Laplacian can be used for consensus, and α(L(G)) = 1. Given the constant spectral gap λ2 (L(Km )) = Θ(1), the scaling in number of iterations to achieve constant MSE is n = Θ(1). This theoretical prediction is compared to simulation results in Figure 2.4; note how the number of iterations soon settles down to a constant, as predicted by the theory.
2.4
Proof of Theorem 2.3.1
We now turn to the proof of Theorem 2.3.1. The basic idea is to relate the behavior of the stochastic recursion (2.2.5) to an ordinary differential equation (ODE), and then use the ODE method [Ljung, 1977] to analyze its properties. The ODE involves a function t 7→ θt ∈ Rm , with its specific structure depending on the communication and noise model under consideration. For the AEN and ANN models, the relevant ODE is given by dθt dt
= −Lθt .
(2.4.1)
25
(a)
(b)
(c)
Figure 2.4: Comparison of empirical simulations to theoretical predictions for the bipartite expander graph in panel (a). (b) Sample path plots of log MSE versus log iteration number: as predicted by the theory, the log MSE scales linearly with log iterations. (c) Plot of number of iterations (vertical axis) required to reach a fixed level of MSE versus the graph size (horizontal axis). For an expander, this quantity remains essentially constant with the graph size, consistent with Corollary 2.3.1. Solid line shows theoretical prediction.
For the BC model, the approximating ODE is given by
dθt = −L CM (θt ) dt
u with CM (u) : = −M +M
if |u| < M if u ≤ −M
(2.4.2)
if u ≥ +M .
In both cases, the ODE must satisfy the initial condition θ0 (v) = x(v). A combined Lyapunov-ODE method Theorem can be used to prove this approximation.
2.4.1
Proof of Theorem 2.3.1(a) and (b)
We start with statement (a). Consider a reduced form of the iteration by projecting the error to the orthonormal eigenvector basis of L = U JU T , and using in the ANN model ξ n+1 (vt , vr ) = ξ n+1 (vt ): h i U T (θn+1 − θ∗ ) = U T (θn − θ∗ ) + n U T −L (θn − θ∗ ) − L ξ n+1 ~1 , = en + n −Jen − U T L ξ n+1 , = en + n −Jen − JU T ξ n+1 .
(2.4.3)
26 If we take J(1, 1) = 0, and set the corresponding normalized eigenvector U1 = ~1/k~1k2 in U , we find that en+1 (1) = en (1) = e1 (1) = 0. The remaining vector e˜ satisfies the iteration: h i e˜n+1 = e˜n + n −Jee˜n − Jeξ˜n+1
(2.4.4)
where ξ˜ is a random variable independent over n, defined as the last m − 1 elements of e s) = U T ξ. The variance is σ 2 , the same for ξ, as the basis is orthonormal. Notice that J(s, λr (L) > 0 (denote by λs ). It can be directly written for each element: e˜n (s) =
n Y
(1 − r λs ) e˜0 (s) −
r=1
n Y n X
(1 − r λs ) k λs ξ˜k+1 (s)
(2.4.5)
k=1 r=k
Let rs ≥ λr be the first such r. Then as n → ∞: rY s −1
(1 − r λs )
n Y
(1 − r λs ) ≤
r=rs
r=1
rY s −1 r=1
n X λs (1 − r λs ) exp − r r=r
(
) → 0,
s
so the deterministic initial condition vanishes. The second term has expected value zero. Without loss of generality, consider the Laplacian to be normalized, λr < 1, else a so and using independence and , the variance is λ2s σ 2
n X k=1
2k
n Y
2
(1 − r λs ) ≤
λ2s σ 2
"r −1 r −1 s s Y X k=1 r=k
r=k
+
n X k=rs
2k
n Y
n X λs (1 − r λs ) exp −2 r r=rs
(
2
) +
(1 − r λs )2 → 0.
r=k
The first term in the sum converges to zero as n → ∞ using a similar argument as before. The second term can also be verified to converge to 0. Therefore, we have shown that e˜n → 0 a.s., and thus U T (θn+1 − θ∗ ) → 0 a.s. Since U is an orthonormal matrix, θn+1 → θ∗ a.s. For part (b), the above analysis holds with some important adjustments. The noise P 2 L ξ n+1 ~1 has zero mean, the variance for term i is σ(i) = j L(i, j) and the noise is independent for each i. Regarding the projection, equation (2.4.3), it applies, but the e T L ξ n+1 ~1 projection U T L ξ n+1 ~1 should be divided into two parts. The first part, U has bounded variance and the argument is analogous for part (a), with obvious changes.
27 √ The main difference is that now (~1T / m) L ξ n+1 ~1 does not project to zero. Instead we obtain an iteration en+1 (1) = en (1) + n ξ˜n+1 (1) P where ξ˜n+1 (1) has the variance σ 2 i,j L(i, j)2 /m. Also, e1 (1) = 0, so n
e (1) =
n X
k ξ˜k+1 (1).
k=1
The proof concludes by noting that the above variable is zero mean and bounded variance, and is identified as ξ.
2.4.2
Proof of Theorem 2.3.1(c)
For the BC model, the analysis is somewhat more involved, since the quantization function saturates the output at ±M . We start the proof as before by projecting the iteration using U T , and a single noise vector since we used quantized updates. In the direction of the √ eigenvector U1 = ~1/k~1k2 , γn+1 = U1T θn+1 = U1T θn = U1T θ0 = m θ∗ . Let The remaining vector: e T θn+1 = U e T θn + n U e T −L QB (θn + ξ n+1 ) , so U e T QB (θ∗~1 + U e θ˜n + ξ n+1 ), θ˜n+1 = θ˜n − n JeU √ since θn = U θ˜n , and θ˜n (1) = mθ∗ , but U1 = ~1/k~1k2 . We are now interested in showing that θ˜n → 0 w.p. 1. We use the following well-known perturbed Lyapunov argument from Kushner and Yin (2003, pag. 145, Theorem 1), with a proof presented in the appendix. This theorem has been the essential argument in most of the noisy consensus approaches till date, but without being explicitly identified. Theorem 2.4.1. Let θn+1 = θn + n Yn . Let V (·) be a real-valued function such that: A1 V (·) ≥ 0 is continuous, twice continuously differentiable with bounded mixed second partial derivatives, V (0) = 0 and E[V (θ0 )] < ∞. A2 For each > 0, let there be δ > 0 : V (θ) ≥ δ for |θ| ≥ , and δ does not decrease as increases.
28 A3 Let g¯(θn ) = E[Yn |θ0 , Y1 , ..., Yn−1 ] and f (θ) = −Vθ (θ)T g¯(θ). For each > 0, there is δ1, such that −f (θ) ≤ −δ1, for |θ| ≥ . A4 E[||Yn ||22 |θ0 , Y1 , ..., Yn−1 ] ≤ K2 f (θn ) when |θn | ≥ K0 , and
E
"∞ X
# 2r |Yr |2 I(|θr |
≤ K0 ) < ∞.
r=1
Then θn → 0 w.p.1. Essentially the theorem requires the existence of a global Lyapunov function for the reduced iteration, with a at most quadratic growth rate. We propose the natural function ˜ = θ˜T θ. ˜ Since θ0 is bounded, condition A1 is satisfied for V . Condition A2 is satisfied V (θ) e T CM (θ∗~1 + U e θ˜n ), where the saturation by setting δ = m2 . In our case g¯(θ˜n ) = −JeU ˜ = 2θ˜ is the gradient, so function is applied element-wise. Vθ˜(θ) ˜ = 2θ˜T (JeU ˜ e T CM (θ∗~1 + U e θ)). f (θ) e θ˜ < M , To show A3 we consider three choices for the above quantity. First, if −M < θ∗~1+ U ˜ = 2θ˜T Jeθ, ˜ which satisfies A3, e T is orthogonal to ~1 vector, we have that f (θ) then since U ˜ e T CM (θ∗~1 + U e θ), since Je is a strictly positive diagonal matrix. In fact, by setting a ˜ =U ˜ = 2θ˜T Jea ˜ > . From the definition f (θ) ˜, so the only possibility to violate A3 is if a ˜ = 0 for |θ| ˜ = α~1. e , it is clear that the only possibility of obtaining the zero vector is if CM (θ∗~1+ U e θ) of U e θ˜ < M (already considered), Only three possibilities satisfy this requirement −M < θ∗~1 + U e θ˜ ≥ M ~1 and θ∗~1 + U e θ˜ ≤ −M ~1. Let us consider the case θ∗~1 + U e θ˜ ≥ M ~1, in which θ∗~1 + U e θ˜ ≥ (M − θ∗ )~1 (element wise), and case α = M satisfies the requirement. So if there is a U θ˜ 6= 0, A3 is violated. But, projecting summing the inequalities (i.e projecting on the space e θ˜ ≥ (M − θ∗ )m, but ~1T U e θ˜ = 0, so we of ~1) should still satisfy the vector inequality, ~1T U e r)2 and K0 = 2σ 2 . A4 can have a contradiction. Thus A3 is satisfied. Set K2 = maxr J(r, be directly verified using e θ˜n + ξ n+1 )T U e (J) e 2U e T QB (θ∗~1 + U e θ˜n + ξ n+1 )|·], E[||Yn ||2 |θ0 , Y1 , ..., Yn−1 ] = E[QB (θ∗~1 + U e θ˜n + e θ˜n + ξ n+1 ) − 2∆ ≤ QB (θ∗~1 + U e θ˜n + ξ n+1 ) ≤ θ∗~1 + U the fact |ξ n+1 | ≤ ∆ and θ∗~1 + U ξ n+1 + 2∆, where 2∆ is the quantization step size.
29
2.4.3
ODE method for mean paths
The mean path behavior for (a) and (b) can be inferred from the ODE method. The following result connects the discrete-time stochastic process {θn } to the deterministic ODE solution: Lemma 2.4.1. The ODEs (2.4.1) and (2.4.2) with initial condition θ0 (v) = x(v) each have θ∗ = x~1 as their unique stable fixed point. Moreover, for all δ > 0, we have n = 0, P lim sup kθ − θtn k > δ n→∞
for tn =
Pn
1 k=1 k ,
(2.4.6)
which implies that θn → θ∗ almost surely. Remark: Equation (2.4.6) relates the stochastic path θn of the algorithm to the deterministic path θtn , resulting from sampling the ODE solution θt at times tn . Since θt → θ∗ , the amost sure convergence of θn to θ∗ follows from path convergence. Proof. For item (a), we prove this lemma by using the ODE method and stochastic approximation— in particular, Theorem 1 from Kushner and Yin [Kushner and Yin, 1997], which connects stochastic recursions of the form (2.2.5) to the ordinary differential equation dθt /dt = Eξ [− (L Y (θt , ξ)) | θt ]. Using the definition of Y in terms of F , for the AEN and ANN models, we have Eξ [F (θ(v), ξ(v, vr )) | θ(v)] = θ(v), from which we conclude that with the stepsize choice n = Θ(1/n), we have Eξ [− (L Y (θt , ξ)) | θt ] = −Lθt . By our assumptions on the eigenstructure of L, the system dθt /dt = −L θt is globally asymptotically stable, with a line of fixed points {θ ∈ Rm | Lθ = 0}. Given the initial condition θ0 (v) = x(v), we conclude that θ∗ = x~1 is the unique asymptotically fixed point of the ODE, so that the claim (2.4.6) follows from Kushner and Yin [Kushner and Yin, 1997].
30 For item (c), We start the proof as before by projecting the iteration using U T : U T θn+1 = U T θn + n U T −L g(θn + ξ n+1 ) ,
For the dithered quantization model (2.2.9), we have Eξ [− (L Y (θt , ξ)) | θt ] = −L CM (θt ), where CM (·) is the saturation function (2.4.2). We now claim that θ∗ is also the unique asymptotically stable fixed point of the ODE dθt /dt = −L CM (θt ) subject to the initial condition θ0 (v) = x(v). Consider the eigendecomposition L = U JU T , where J = diag{0, λ2 (L), . . . , λm (L)}. Define the rotated variable γt : = U T θt , so that the ODE (2.4.2) can be re-written as dγt (1)/dt = 0
(2.4.7a)
dγt (k)/dt = −λk (L)UkT CM (U γt ),
for k = 2, . . . , m,
(2.4.7b)
where Uk denotes the k th column of U . Note that U1 = ~1/k~1k2 , since it is associated with the eigenvalue λ1 (L) = 0. Consequently, the solution to equation (2.4.7a) takes the form γt (1) = U1T θ0 = with unique fixed point γ ∗ (1) =
√
m x, where x : =
√
m x,
1 m
Pm
i=1 x(i)
(2.4.8) is the average value,
A fixed point γ ∗ ∈ Rm for equations (2.4.7b) requires that UkT CM (U γ ∗ ) = 0, for k = 2, . . . , m. Given that the columns of U form an orthogonal basis, this implies that CM (U γ ∗ ) = α~1 for some constant α ∈ R, or equivalently (given the connection U γ ∗ = θ∗ ) CM (θ∗ ) = α~1.
(2.4.9)
Given the piecewise linear nature of the saturation function, this equality implies either that the fixed point satisfies the elementwise inequality θ∗ > M (if α = M ); or the elementwise inequality θ∗ < −M (if α = −M ); or as the final option, the θ∗ = α when α ∈ (−M, +M ).
31 But from equation (2.4.8), we know that γ ∗ (1) = have γ ∗ (1) =
~1T √ m
√
m x ∈ [−M
√
√ m, +M m]. But we also
θ∗ by definition, so that putting together the pieces yields
−M
1 2
by assumption, the asymptotic normality (2.5.1) applies to this reduced
iteration, so that we can conclude that √
d n (β n − β ∗ ) → N (0, Pe)
where Pe solves the Lyapunov equation I e e e I T e eT Σ ∗U e. J− P +P J − = U θ 2 2 We conclude by noting that the asymptotic covariance of θn is related to that of β n by the relation
P
0 0 U, = UT 0 Pe
(2.5.7)
33 from which Theorem 2.3.1(b) follows.
2.6
Discussion
In this chapter, we analyzed the convergence and asymptotic behavior of distributed averaging algorithms on graphs with general noise models. Using suitably damped updates, we showed that it is possible to obtain exact consensus, as opposed to approximate or near consensus, even in the presence of noise. We guaranteed almost sure convergence of our algorithms under fairly general conditions, and moreover, under suitable stability conditions, we showed that the error is asymptotically normal, with a covariance matrix that can be predicted from the structure of the underlying graph. We provided a number of simulations that illustrate the sharpness of these theoretical predictions. One interesting consequence is that in the presence of noise, the number of iterations required to achieve log(1/δ) 1 σ2 d . This rate should be contrasted with the rate Θ an error of δ is Θ λ2 (L(G)) δ λ2 (L(G)) achievable by standard gossip algorithms under perfect (noiseless) communication. This comparison shows that there is some loss in convergence rates due to to noisiness, but the influence of the graph structure—namely, via the spectral gap λ2 (L(G))—is similar. Finally, although the current chapter has focused exclusively on the averaging problem, the methods of analysis in this chapter are applicable to other types of distributed inference problems, such as computing quantiles or order statistics, as well as computing various types of M -estimators. Obtaining analogous results for more general problems of distributed statistical inference is an interesting direction for future research.
2.7 2.7.1
Proofs Proof of Lemma 2.3.1
We observe that trace(L(G)) =
P
i∈V
di , where di are the positive diagonal elements
of L(G) For the normalized Laplacian, we have trace(L(G)) = m. A direct computation shows that the normalized Laplacian is a valid consensus matrix only when all nodes have the same degrees. Since the smallest eigenvalue is 0, and the others are positive, we have P the bound (m − 1)λ2 (L(G)) ≤ i∈V di . Define α(L(G)) : =
trace(L(G)) , m
34 and notice that for normalized graphs, α(L(G)) = 1. Furthermore, for any graph α(L(G)) ≥ λ2 (L(G)) (m − 1)/m. Also, the simple upper bound α(L(G)) ≤ (m − 1) holds. Using these facts, we establish Lemma 2.3.1 as follows. Recall that by construction, we have R(G) =
L(G) λ2 (L(G)) ,
so that the second smallest eigenvalue of R(G) is λ2 (R(G)) = 1, and
the remaining eigenvalues are greater than or equal to one. Applying Corollary 2.3.1 to the ANN model, we have ∗
AMSE(L; θ ) = =
m σ2 X [λi (R(G))]2 , m 2λi (R(G)) − 1 i=2 m X σ2 [λi (L(G))]2 m λ2 (L(G)) 2λi (L(G)) − λ2 (L(G)) i=2
≥ =
σ2 2λ2 (L(G)) m σ 2 α(L(G)) . 2λ2 (L(G))
trace(L(G))
In the other direction, using the fact that λ2 (R(G)) ≥ 1 and the bound x ≥ 1. we have ∗
AMSE(L; θ ) =
m σ2 X [λi (R(G))]2 , m 2λi (R(G)) − 1 i=2
≤ = = as claimed.
σ2 m
trace(R(G))
σ2 trace(L(G)) λ2 (L(G)) m σ 2 α(L(G)) , λ2 (L(G))
x2 2x−1
≤ x for
35
Chapter 3
Quantile Estimation in a Data Communication-Constrained Setting
3.1
Introduction
Whereas classical statistical inference is performed in a centralized manner, many modern scientific problems and engineering systems are inherently decentralized : data are distributed, and cannot be aggregated due to various forms of communication constraints. An important example of such a decentralized system is a sensor network [Chong and Kumar, 2003]: a set of spatially-distributed sensors collect data about the environmental state (e.g., temperature, humidity or light). Typically, these networks are based on ad hoc deployments, in which the individual sensors are low-cost, and must operate under very severe power constraints (e.g., limited battery life). In statistical terms, such communication constraints imply that the individual sensors cannot transmit the raw data; rather, they must compress or quantize the data—for instance, by reducing a continuous-valued observation to a single bit—and transmit only this compressed representation back to the fusion center. By now, there is a rich literature in both information theory and statistical signal processing on problems of decentralized statistical inference. A number of researchers, dating
36 back to the seminal paper of Tenney and Sandell [Tenney and Sandell, 1981], have studied the problem of hypothesis testing under communication-constraints; see the survey papers [Tsitsiklis, 1993; Veeravalli et al., 1993; Blum et al., 1997; Viswanathan and Varshney, 1997; Chamberland and Veeravalli, 2004] and references therein for overviews of this line of work. The hypothesis-testing problem has also been studied in the information theory community, where the analysis is asymptotic and Shannon-theoretic in nature [Amari and Han, 1989; Han and Kobayashi, 1989]. A parallel line of work deals with problem of decentralized estimation. Work in signal processing typically formulates it as a quantizer design problem and considers finite sample behavior [Ayanoglu, 1990; Gubner, 1993]; in contrast, the information-theoretic approach is asymptotic in nature, based on rate-distortion theory [Zhang and Berger, 1988; Han and Amari, 1998]. In much of the literature on decentralized statistical inference, it is assumed that the underlying distributions are known with a specified parametric form (e.g., Gaussian). More recent work has addressed nonparametric and data-driven formulations of these problems, in which the decision-maker is simply provided samples from the unknown distribution [Nguyen et al., 2005; Luo, 2005; Han et al., 1990]. For instance, Nguyen et al. [Nguyen et al., 2005] established statistical consistency for non-parametric approaches to decentralized hypothesis testing based on reproducing kernel Hilbert spaces. Luo [Luo, 2005] analyzed a non-parametric formulation of decentralized mean estimation, in which a fixed but unknown parameter is corrupted by noise with bounded support but otherwise arbitrary distribution, and shown that decentralized approaches can achieve error rates that are order-optimal with respect to the centralized optimum. This Chapter addresses a different problem in decentralized non-parametric inference— namely, that of estimating an arbitrary quantile of an unknown distribution. Since there exists no unbiased estimator based on a single sample, we consider the performance of a network of m sensors, each of which collects total of n observations in a sequential manner. Our analysis treats the standard fusion-based architecture, in which each of the m sensors transmits information to the fusion center via a communication-constrained channel. More concretely, at each of the n observation rounds, each sensor is allowed to transmit a single bit to the fusion center, which in turn is permitted to send some number k bits of feedback. For a decentralized protocol with k = log(m) bits of feedback, we prove that the algorithm achieves the order-optimal rate of the best centralized method (i.e., one with access to the full collection of raw data). We also consider a protocol that permits only a single
37 bit of feedback, and establish that it achieves the same rate. This single-bit protocol is advantageous in that, with for a fixed target mean-squared error of the quantile estimate, it yields longer sensor lifetimes than either the centralized or full feedback protocols. The remainder of the Chapter is organized as follows. Section 3.2 describes the required background on quantile estimation, and optimal rates in the centralized setting. We then describe two algorithms for solving the corresponding decentralized version, and provide an asymptotic characterization of their performance. These theoretical results are complemented with empirical simulations. Section 3.5 contains the proofs of our main results, and we conclude in Section 3.4 with a discussion.
3.2
Problem Set-up and Decentralized Algorithms
In this section, we begin with some background material on (centralized) quantile estimation, before introducing our decentralized algorithms, and stating our main theoretical results.
3.2.1
Centralized Quantile Estimation
We begin with the classical background on the problem of quantile estimation, and refer the interested reader to Serfling [Serfling, 1980] for further details. Given a real-valued random variable X, let F (x) : = P[X ≤ x] be its cumulative distribution function (CDF), which is non-decreasing and right-continuous. For any 0 < α < 1, the αth -quantile of X is defined as F −1 (α) = θ(α) : = inf {x ∈ R | F (x) ≥ α}. Moreover, if F is continuous at α, then we have α = F (θ(α)). As a particular example, for α = 0.5, the associated quantile is simply the median. Now suppose that for a fixed level α∗ ∈ (0, 1), we wish to estimate the quantile θ∗ = θ(α∗ ). Rather than impose a particular parameterized form on F , we work in a nonparametric setting, in which we assume only that the distribution function F is differentiable, so that X has the density function pX (x) = F 0 (x) (w.r.t Lebesgue measure), and moreover that pX (x) > 0 for all x ∈ R. In this setting, a standard estimator for θ∗ is the sample quantile ξN (α∗ ) : = FN−1 (α∗ ) where FN denotes the empirical distribution function based on i.i.d. samples (X1 , . . . , XN ). Under the conditions given above, it can be a.s.
shown [Serfling, 1980] that ξN (α∗ ) is strongly consistent for θ∗ (i.e., ξN → θ∗ ), and more-
38
Figure 3.1. Sensor network for quantile estimation with m sensors. Each sensor is permitted to transmit a 1-bit message to the fusion center; in turn, the fusion center is permitted to broadcast k bits of feedback.
over that asymptotic normality holds √
d
N (ξN − θ∗ ) → N
α∗ (1 − α∗ ) , 0, p2X (θ∗ )
(3.2.1)
so that the asymptotic MSE decreases as O(1/N ), where N is the total number of samples. Although this 1/N rate is optimal, the precise form of the asymptotic variance (3.2.1) need not be in general; see Zielinski [Zielinski, 2004] for in-depth discussion of the optimal asymptotic variances that can be obtained with variants of this basic estimator under different conditions.
3.2.2
Distributed Quantile Estimation
We consider the standard network architecture illustrated in Figure 3.1. There are m sensors, each of which has a dedicated two-way link to a fusion center. We assume that each sensor i ∈ {1, . . . , m} collects independent samples X(i) of the random variable X ∈ R with distribution function F (θ) : = P[X ≤ θ]. We consider a sequential version of the quantile estimation problem, in which sensor i receives measurements Xn (i) at time steps n = 0, 1, 2, . . ., and the fusion center forms an estimate θn of the quantile. The key condition—giving rise to the decentralized nature of the problem—is that communication between each sensor and the central processor is constrained, so that the sensor cannot simply relay its measurement X(i) to the central location, but rather must perform local computation, and then transmit a summary statistic to the fusion center. More concretely, we impose the following restrictions on the protocol. First, at each time step n = 0, 1, 2, . . ., each sensor i = 1, . . . , m can transmit a single bit Yn (i) to the fusion center. Second, the fusion center can broadcast k bits back to the sensor nodes at each time step. We analyze two distinct protocols, depending on whether k = log(m) or k = 1.
39
3.2.3
Protocol specification
For each protocol, all sensors are initialized with some fixed θ0 . The algorithms are specified in terms of a constant K > 0 and step sizes n > 0 that satisfy the conditions ∞ X
n = ∞
and
n=0
∞ X
2n < ∞.
(3.2.2)
n=0
The first condition ensures infinite travel (i.e., that the sequence θn can reach θ∗ from any starting condition), whereas the second condition (which implies that n → 0) is required for variance reduction. A standard choice satisfying these conditions—and the one that we assume herein—is n = 1/n. With this set-up, the log(m)-bit scheme consists of the steps given in Table 3.1.
Although the most straightforward feedback protocol is to
Algorithm: Decentralized quantile estimation with log(m)-bit feedback Given K > 0 and variable step sizes n > 0: (a) Local decision: each sensor computes the binary decision Yn+1 (i) ≡ Yn+1 (i; θn ) : = I(Xn+1 (i) ≤ θn ),
(3.2.3)
and transmits it to the fusion center. (b) Parameter update: the fusion center updates its current estimate θn+1 of the quantile parameter as follows: Pm ∗ i=1 Yn+1 (i) θn+1 = θn + n K α − (3.2.4) m (c) Feedback: the fusion broadcasts the m received bits {Yn+1 (1), . . . , Yn+1 (m)} back to the sensors. Each sensor can then compute the updated parameter θn+1 .
Table 3.1: Description of the log(m)-bf algorithm.
broadcast back the m received bits {Yn+1 (1), . . . , Yn+1 (m)}, as described in step (c), in fact it suffices to transmit only the log(m) bits required to perfectly describe the binomial P random variable m i=1 Yn+1 (i) in order to update θn . In either case, after the feedback step, P each sensor knows the value of the sum m i=1 Yn+1 (i), which (in conjunction with knowledge of m, α∗ and n ) allow it to compute the updated parameter θn+1 . Finally, knowledge of θn+1 allows each sensor to then compute the local decision (3.2.3) in the following round.
40 Algorithm: Decentralized quantile estimation with 1-bit feedback Given Km > 0 (possibly depending on number of sensors m) and variable step sizes n > 0: (a) Local decision: each sensor computes the binary decision Yn+1 (i) = I(Xn+1 (i) ≤ θn )
(3.2.5)
and transmits it to the fusion center. (b) Aggregate decision and parameter update: The fusion center computes the aggregate decision Pm ∗ i=1 Yn+1 (i) Zn+1 = I ≤α , (3.2.6) m and uses it update the parameter according to θn+1 = θn + n Km (Zn+1 − β)
(3.2.7)
where the constant β is chosen as bmα∗ c
β =
X i=0
m (α∗ )i (1 − α∗ )m−i . i
(3.2.8)
(c) Feedback: The fusion center broadcasts the aggregate decision Zn+1 back to the sensor nodes (one bit of feedback). Each sensor can then compute the updated parameter θn+1 .
Table 3.2: Description of the 1-bf algorithm.
The 1-bit feedback scheme detailed in Table 3.2 is similar, except that it requires broadcasting only a single bit (Zn+1 ), and involves an extra step size parameter Km , which is specified in the statement of Theorem 3.2.2. After the feedback step of the 1-bf algorithm, each sensor has knowledge of the aggregate decision Zn+1 , which (in conjunction with n and the constant β) allow it to compute the updated parameter θn+1 . Knowledge of this parameter suffices to compute the local decision (3.2.5).
3.2.4
Convergence results
We now state our main results on the convergence behavior of these two distributed protocols. In all cases, we assume the step size choice n = 1/n. Given fixed α∗ ∈ (0, 1),
41 we use θ∗ to denote the α∗ -level quantile (i.e., such that P(X ≤ θ∗ ) = α∗ ); note that our assumption of a strictly positive density guarantees that θ∗ is unique. Theorem 3.2.1 (m-bit feedback). For any α∗ ∈ (0, 1), consider a random sequence {θn } generated by the m-bit feedback protocol. Then (a) For all initial conditions θ0 , the sequence θn converges almost surely to the α∗ -quantile θ∗ . (b) Moreover, if the constant K is chosen to satisfy pX (θ∗ ) K > 12 , then √
∗
d
n (θn − θ ) → N
K 2 α∗ (1 − α∗ ) 1 0, 2KpX (θ∗ ) − 1 m
! ,
(3.2.9)
1 so that the asymptotic MSE is O( mn ).
Remarks: After n steps of this decentralized protocol, a total of N = nm observations have been made, so that our discussion in Section 3.2.1 dictates (see equation (3.2.1)) that the 1 optimal asymptotic MSE is O( nm ). Interestingly, then, the m-bit feedback decentralized
protocol is order-optimal with respect to the centralized gold standard. Before stating the analogous result for the 1-bit feedback protocol, we begin by introducing some useful notation. First, we define for any fixed θ ∈ R the random variable m
m
i=1
i=1
1 X 1 X Y¯ (θ) : = Y (i; θ) = I(X(i) ≤ θ). m m Note that for each fixed θ, the distribution of Y¯ (θ) is binomial with parameters m and F (θ). It is convenient to define the function bmyc
Gm (r, y) : =
X i=0
m i r (1 − r)m−i , i
(3.2.10)
with domain (r, y) ∈ [0, 1] × [0, 1]. With this notation, we have P(Y¯ (θ) ≤ y) = Gm (F (θ), y). Again, we fix an arbitrary α∗ ∈ (0, 1) and let θ∗ be the associated α∗ -quantile satisfying P(X ≤ θ∗ ) = α∗ .
42 Theorem 3.2.2 (1-bit feedback). Given a random sequence {θn } generated by the 1-bit feedback protocol, we have a.s.
(a) For any initial condition, the sequence θn −→ θ∗ . √ (b) Suppose that the step size Km is chosen such that Km >
2πα∗ (1−α∗ ) √ , 2pX (θ∗ ) m
or equivalently
such that ∂G 1 m γm (θ∗ ) : = Km (r; α∗ ) r=α∗ pX (θ∗ ) > , ∂r 2 then √
∗
! 2 G (α∗ , θ ∗ ) 1 − G (α∗ , θ ∗ ) Km m m 0, 2γm (θ∗ ) − 1
d
n (θn − θ ) → N
(3.2.11)
(3.2.12)
(c) If we choose a constant step size Km = K, then as n → ∞, the asymptotic variance behaves as
"
# p K 2 2πα∗ (1 − α∗ ) p , √ 8KpX (θ∗ ) m − 4 2πα∗ (1 − α∗ ) so that the asymptotic MSE is O n√1m . (d) If we choose a decaying step size Km = 1 m
"
then
# p 2πα∗ (1 − α∗ ) p , 8KpX (θ∗ ) − 4 2πα∗ (1 − α∗ )
so that the asymptotic MSE is O
3.2.5
√K , m
K2
1 nm
(3.2.13)
(3.2.14)
.
Comparative Analysis
It is interesting to compare the performance of each proposed decentralized algorithm to the centralized performance. Considering first the m-bf scheme, suppose that we set K = 1/pX (θ∗ ). Using the formula (3.2.9) from Theorem 3.2.1, we obtain that the asymptotic variance of the m-bf scheme with this choice of K is given by
α∗ (1−α∗ ) 1 , p2X (θ∗ ) mn
thus matching
the asymptotics of the centralized quantile estimator (3.2.1). In fact, it can be shown that the choice K = 1/pX (θ∗ ) is optimal in the sense of minimizing the asymptotic variance for our scheme, when K is constrained by the stability criterion in Theorem 3.2.1. In practice, however, the value pX (θ∗ ) is typically not known, so that it may not be possible to
43
(a)
(b)
(c)
Figure 3.2. Convergence of θn to θ∗ with m = 11 nodes, and quantile level α∗ = 0.3. (b) Log-log plots of the variance against m for both algorithms (log(m)-bf and 1-bf) with constant step sizes, and theoretically-predicted rate. (b) Log-log plots of the variance against m for log(m)-bf and 1-bf algorithms with constant step size. (c) Log-log plots of log(m)-bf with constant step size versus 1-bf algorithm with decaying step size.
implement exactly this scheme. An interesting question is whether an adaptive scheme could be used to estimate pX (θ∗ ) (and hence the optimal K simultaneously), thereby achieving this optimal asymptotic variance. We leave this question open as an interesting direction for future work. p ¯ = K/ 2πα∗ (1 − α∗ ) Turning now to the algorithm 1-bf, if we make the substitution K in equation (3.2.14), then we obtain the asymptotic variance ¯ 2 α∗ (1 − α∗ ) 1 π K ¯ X (θ∗ ) − 1 m . 2 2Kp
(3.2.15)
¯ = 1/pX (θ∗ ). Since the stability criterion is the same as that for m-bf, the optimal choice is K Consequently, while the (1/[mn]) rate is the same as both the centralized and decentralized m-bf protocols, the pre-factor for the 1-bf algorithm is
π 2
≈ 1.57 times larger than the
optimized m-bf scheme. However, despite this loss in the pre-factor, the 1-bf protocol has substantial advantages over the m-bf; in particular, the network lifetime scales as O(m) compared to O(m/ log(m)) for the log(m)-bf scheme.
3.2.6
Simulation example
We now provide some simulation results in order to illustrate the two decentralized protocols, and the agreement between theory and practice. In particular, we consider the quantile estimation problem when the underlying distribution (which, of course, is unknown to the algorithm) is uniform on [0, 1] random. In this case, we have pX (x) = 1 uniformly
44 for all x ∈ [0, 1], so that taking the constant K = 1 ensures that the stability conditions in both Theorem 3.2.1 and 3.2.2 are satisfied. We simulate the behavior of both algorithms for α∗ = 0.3 over a range of choices for the network size m. Figure 3.2(a) illustrates several sample paths of m-bit feedback protocol, showing the convergence to the correct θ∗ . For comparison to our theory, we measure the empirical variance by averaging the error √ √ eˆn = n(θn − θ∗ ) over L = 20 runs. The normalization by n is used to isolate the effect of increasing m, the number of nodes in the network. We estimate the variance by running algorithm for n = 2000 steps, and computing the empirical variance of eˆn for time steps n = 1800 through to n = 2000. Figure 3.2(b) shows these empirically computed variances, and a comparison to the theoretical predictions of Theorems 3.2.1 and 3.2.2 for constant step size; note the excellent agreement between theory and practice. Panel (c) shows the √ comparison between the log(m)-bf algorithm, and the 1-bf algorithm with decaying 1/ m step size. Here the asymptotic MSE of both algorithms decays like 1/m for log m up to roughly 500; after this point, our fixed choice of n is insufficient to reveal the asymptotic behavior.
3.3
Some Extensions
In this section we consider some extensions of the algorithms and analysis from the preceding sections, including variations in the number of feedback bits, and the effects of noise.
3.3.1
Different levels of feedback
We first consider the generalization of the preceding analysis to the case when the fusion center communicates some number of bits between 1 and m. The basic idea is to apply a quantizer with 2` levels, corresponding to log2 (2`) bits, on the update of the stochastic gradient algorithm. Note that the extremes ` = 1 and ` = 2m−1 correspond to the previously studied protocols. Given 2` levels, we partition the real line as − ∞ = s−` < s−`+1 < . . . < s`−1 < s` = +∞,
(3.3.1)
45 where the remaining breakpoints {sk } are to be specified. With this partition fixed, we define a quantization function Q` Q` (X) : = rk
if X ∈ (sk , sk+1 ] for k = −`, . . . , ` − 1,
(3.3.2)
where the 2` quantized values (r−` , . . . , r`−1 ) are to be chosen. In the setting of the algorithm to be proposed, the quantizer is applied to binomial random variables X with parameters (m, r). Recall the function Gm (r, x), as defined in equation (3.2.10), corresponding to the probability P[X ≤ mx]. Let us define a new function Gm,` , corresponding to the expected value of the quantizer when applied to such a binomial variate, as follows
Gm,` (r, x) : =
`−1 X
rk {Gm (r, x − sk ) − Gm (r, x − sk+1 )} .
(3.3.3)
k=−`
With these definitions, the general log2 (2`) feedback algorithm takes the form shown in Table 3.3. In order to understand the choice of the offset parameter β defined in equation (3.3.7), we compute the expected value of the quantizer function, when θn = θ∗ , as follows Pm `−1 i h X Y¯ (θ∗ ) ∗ ∗ ∗ ∗ i=1 Yn+1 (i) rk P (α − sk+1 ) < | θn = θ = ≤ (α − sk ) E Q` α − m m k=−`
=
`−1 X
rk [Gm (F (θ∗ ), α∗ − sk ) − Gm (F (θ∗ ), α∗ − sk+1 )]
k=−`
= Gm,` (F (θ∗ ), α∗ ). The following result, analogous to Theorem 3.2.2, characterizes the behavior of this general protocol: Theorem 3.3.1 (General feedback scheme). Given a random sequence {θn } generated by the general log2 (2`)-bit feedback protocol, there exist choices of partition {sk } and quantization levels {rk } such that: a.s.
(a) For any initial condition, the sequence θn −→ θ∗ . (b) There exists a choice of decaying step size (i.e., Km
√1 ) m
such that the asymptotic
46 Algorithm: Decentralized quantile estimation with log2 (2`)-bits feedback Given Km > 0 (possibly depending on number of sensors m) and variable step sizes n > 0: (a) Local decision: each sensor computes the binary decision Yn+1 (i) = I(Xn+1 (i) ≤ θn )
(3.3.4)
and transmits it to the fusion center. (b) Aggregate decision and parameter update: The fusion center computes the quantized aggregate decision variable Pm ∗ i=1 Yn+1 (i) Zn+1 = Q` α − , (3.3.5) m and uses it update the parameter according to θn+1 = θn + n Km (Zn+1 − β)
(3.3.6)
where the constant β is chosen as β : = Gm,` (F (θ∗ ), α∗ ).
(3.3.7)
(c) Feedback: The fusion center broadcasts the aggregate quantized decision Zn+1 back to the sensor nodes, using its log2 (2`) bits of feedback. The sensor nodes can then compute the updated parameter θn+1 .
Table 3.3: Description of the general algorithm, with log2 (2`) bits of feedback.
variance of the protocol is given by ∗
κ(α∗ ,Q` ) mn ,
κ(α , Q` ) : = 2π
P`−1
where the constant has the form
2 k=−` rk ∆Gm (sk , sk+1 ) − P`−1 k=−` rk ∆m (sk , sk+1 )
β2
,
(3.3.8)
with ∆Gm (sk , sk+1 ) = Gm (F (θ∗ ), α∗ − sk ) − Gm (F (θ∗ ), α∗ − sk+1 ), and ! ms2k+1 ms2k ∆m (sk , sk+1 ) = exp − ∗ − exp − ∗ . 2α (1 − α∗ ) 2α (1 − α∗ ) We provide a formal proof of Theorem 3.3.1 in Section 3.5. Figure 3.3(a) illustrates how the constant factor κ, as defined in equation (3.3.8) decreases as of levels ` in an uniform
47 quantizer is increased. Note In order to provide comparison with results from the previous section, let us see how the two extreme cases (1 bit and m feedback) can be obtained as special case. For the 1-bit case, the quantizer has ` = 1 levels with breakpoints s−1 = −∞, s0 = 0, s1 = +∞, and quantizer outputs r−1 = 0 and r1 = 1. By making the appropriate substitutions, we obtain: ∆Gm (s0 , s1 ) − β 2 , ∆m (s0 , s1 ) ∆Gm (s0 , s1 ) = Gm,` (F (θ∗ ), α∗ )
κ(α∗ , Q1 ) = 2π
β 2 = Gm,` (F (θ∗ ), α∗ )2 , and
∆m (s0 , s1 )) = 1.
By applying the central limit theorem, we conclude that ∆Gm (s0 , s1 ) − β 2 = Gm,` (F (θ∗ ), α∗ )(1 − Gm,` (F (θ∗ ), α∗ )) → 1/4, as established earlier. Thus κ(α∗ , Q1 ) → π/2 as m → ∞, recovering the result of Theorem 3.2.2. Similarly, the results for m-bf can be recovered by setting the parameters rk−` = α∗ −
k , m
for k = 0, ..., M,
si = ri .
3.3.2
and (3.3.10)
Extensions to noisy links
We now briefly consider the effect of communication noise on our algorithms. There are two types of noise to consider: (a) feedforward, meaning noise in the link from sensor node to fusion center, and (b) feedback, meaning noise in the feedback link from fusion center to the sensor nodes. Here we show that feedforward noise can be handled in a relatively straightforward way in our algorithmic framework. On the other hand, feedback noise requires a different analysis, as the different sensors may loose synchronicity in their updating procedure. Although a thorough analysis of such asynchronicity is an interesting topic for future research, we note that assuming noiseless feedback is not unreasonable, since the fusion center typically has greater transmission power. Focusing then on the case of feedforward noise, let us assume that the link between each sensor and the fusion center acts as a binary symmetric channel (BSC) with probability ∈ [0, 12 ). More precisely, if a bit x ∈ {0, 1} is transmitted, then the received bit y has the
48
(a)
(b) ∗
Figure 3.3. (a) Plots of the asymptotic variance κ(α , Q` ) defined in equation (3.3.8) versus the number of levels ` in a uniform quantizer, corresponding to log2 (2`) bits of feedback, for a sensor network with m = 4000 nodes. The plots show the asymptotic variance rescaled by the centralized gold standard, so that it starts at π/2 for ` = 2, and decreases towards 1 as ` is increased towards m/2. (b) Plots of the asymptotic variances Vm () and V1 () defined in equation (3.3.13) as the feedforward noise parameter is increased from 0 towards 21 .
(conditional) distribution
P(y | x) =
1 − if x = y
(3.3.11)
if x 6= y.
With this bit-flipping noise, the updates (both equation (3.2.4) and (3.2.7)) need to be modified so as to correct for the bias introduced by the channel noise. If α∗ denotes the desired quantile, then in the presence of BSC() noise, both algorithms should be run with the modified parameter α e() : = (1 − 2)α∗ + .
(3.3.12)
Note that α e() ranges between α∗ (for the noiseless case = 0), to a quantity arbitrarily close to
1 2,
as the channel approaches the extreme of pure noise ( =
1 2 ).
The following
lemma shows that for all < 12 , this adjustment (3.3.12) suffices to correct the algorithm. Moreover, it specifies how the resulting asymptotic variance depends on the noise parameter: Proposition 3.3.1. Suppose that each of the m feedforward links from sensor to fusion
49 center are modeled as i.i.d. BSC channels with probability ∈ [0, 21 ). Then the m-bf or 1-bf algorithms, with the adjusted α e(), are strongly consistent in computing the α∗ -quantile. Moreover, with appropriate step size choices, their asymptotic MSEs scale as 1/(mn) with respective pre-factors given by K2 α e() (1 − α e()) 2K(1 − 2)pX (θ∗ ) − 1 " # p K 2 2π α e()(1 − α e()) p V1 () : = . 8K(1 − 2)pX (θ∗ ) − 4 2π α e()(1 − α e())
Vm () : =
(3.3.13a) (3.3.13b)
In both cases, the asymptotic MSE is minimal for = 0. Proof:
If sensor node i transmits a bit Yn+1 (i) at round n + 1, then the fusion center
receives the random variable Yen+1 (i) = Yn+1 (i) ⊕ Wn+1 , where Wn+1 is Bernoulli with parameter , and ⊕ denotes addition modulo two. Since Wn+1 is independent of the transmitted bit (which is Bernoulli with parameter F (θn )), the received value Yen+1 (i) is also Bernoulli, with parameter ∗ F (θn ) = (1 − F (θn )) + (1 − ) F (θn ) = + (1 − 2) F (θn ).
(3.3.14)
Consequently, if we set α e() according to equation (3.3.12), both algorithms will have their unique fixed point when F (θ) = α∗ , so will compute the α∗ -quantile of X. The claimed form of the asymptotic variances follows from by performing calculations analogous to the proofs of Theorems 3.2.1 and 3.2.2. In particular, the partial derivative with respect to θ now has a multiplicative factor (1 − 2), arising from equation (3.3.14) and the chain rule. To establish that the asymptotic variance is minimized at = 0, it suffices to note that the derivative of the MSE with respect to is positive, so that it is an increasing function of . Of course, both the algorithms will fail, as would be expected, if = 1/2 corresponding to pure noise. However, as summarized in Proposition 3.3.1, as long as < 21 , feedforward noise does not affect the asymptotic rate itself, but rather only the pre-factor in front of the 1/(mn) rate. Figure 3.3(b) shows how the asymptotic variances Vm () and V1 () as is
50 increased towards = 21 .
3.4
Discussion
In this Chapter, we have proposed and analyzed different approaches to the problem of decentralized quantile estimation under communication constraints. Our analysis focused on the fusion-centric architecture, in which a set of m sensor nodes each collect an observation at each time step. After n rounds of this process, the centralized oracle would be able to estimate an arbitrary quantile with mean-squared error of the order O(1/(mn)). In the decentralized formulation considered here, each sensor node is allowed to transmit only a single bit of information to the fusion center. We then considered a range of decentralized algorithms, indexed by the number of feedback bits that the fusion center is allowed to transmit back to the sensor nodes. In the simplest case, we showed that an log m-bit feedback algorithm achieves the same asymptotic variance O(1/(mn)) as the centralized estimator. More interestingly, we also showed that that a 1-bit feedback scheme, with suitably designed step sizes, can also achieve the same asymptotic variance as the centralized oracle. We also showed that using intermediate amounts of feedback (between 1 and m bits) does not alter the scaling behavior, but improves the constant. Finally, we showed how our algorithm can be adapted to the case of noise in the feedforward links from sensor nodes to fusion center, and the resulting effect on the asymptotic variance. Our analysis in this Chapter has focused only on the fusion center architecture illustrated in Figure 3.1. A natural generalization is to consider a more general communication network, specified by an undirected graph on the sensor nodes. One possible formulation is to allow only pairs of sensor nodes connected by an edge in this communication graph to exchange a bit of information at each round. In this framework, the problem considered in this Chapter effectively corresponds to the complete graph, in which every node communicates with every other node at each round. This more general formulation raises interesting questions as to the effect of graph topology on the achievable rates and asymptotic variances.
3.5
Proofs
In this section, we turn to the proofs of Theorem 3.2.1 and 3.2.2, which exploit results from the stochastic approximation literature [Kushner and Yin, 1997; Benveniste et al.,
51 1990]. In particular, both types of parameter updates (3.2.4) and (3.2.7) can be written in the general form θn+1 = θn + n H(θn , Yn+1 ),
(3.5.1)
where Yn+1 = (Yn+1 (1), . . . Yn+1 (m)). Note that the step size choice n = 1/n satisfies the conditions in equation (3.2.2). Moreover, the sequence (θn , Yn+1 ) is Markov, since θn and Yn+1 depend on the past only via θn−1 and Yn . We begin by stating some known results from stochastic approximation, applicable to such Markov sequences, that will be used in our analysis. In addition to these assumptions, convergence requires an additional attractiveness condition. For each fixed θ ∈ R, let µθ ( · ) denote the distribution of Y conditioned on θ. A key quantity in the analysis of stochastic approximation algorithms is the averaged function Z h(θ) : =
H(θ, y)µθ (dy) = E [H(θ, Y ) | θ] .
(3.5.2)
We assume (as is true for our cases) that this expectation exists. Now the differential equation method dictates that under suitable conditions, the asymptotic behavior of the update (3.5.1) is determined essentially by the behavior of the ODE
dθ dt
= h(θ(t)).
Almost sure convergence: Suppose that the following attractiveness condition h(θ) [θ − θ∗ ] < 0
for all θ 6= θ∗
(3.5.3)
is satisfied. If, in addition, the variance R(θ) : = Var[H(θ; Y ) | θ] is bounded, then we are a.s.
are guaranteed that θn → θ∗ (see §5.1 in [Benveniste et al., 1990]). Asymptotic normality: In our updates, the random variables Yn take the form Yn = g(Xn , θn ) where the Xn are i.i.d. random variables. Suppose that the following stability condition is satisfied: γ(θ∗ ) : = −
1 dh ∗ (θ ) > . dθ 2
(3.5.4)
Then we have √
n (θn − θ ) → N 0, ∗
d
R(θ∗ ) 2γ(θ∗ ) − 1)
(3.5.5)
52 See §3.1.2 in [Benveniste et al., 1990] for further details.
3.5.1
Proof of Theorem 3.2.1
(a) The m-bit feedback algorithm is a special case of the general update (3.5.1), with n = n1 1 Pm and H(θn , Yn+1 ) = K α∗ − m i=1 Yn+1 (i; θn ) . Computing the averaged function (3.5.2), we have m
"
1 X h(θ) = KE α − Yn+1 (i) | θn m
#
∗
i=1
= K (α∗ − F (θn )) , where F (θn ) = P(X ≤ θn ). We then observe that θ∗ satisfies the attractiveness condition (3.5.3), since [θ − θ∗ ] h(θn ) = K [θ − θ∗ ] [α∗ − F (θn )] < 0 for all θ 6= θ∗ , by the monotonicity of the cumulative distribution function. Finally, we compute the conditional variance of H as follows: Pm ∗ i=1 Yn+1 (i) R(θn ) = K Var α − | θn m K2 K2 F (θn ) [1 − F (θn )] ≤ , = m 4m 2
(3.5.6)
using the fact that H is a sum of m Bernoulli variables that are conditionally i.i.d. (given θn ). Thus, we can conclude that θn → θ∗ almost surely. ∗ ∗ (b) Note that γ(θ∗ ) = − dh dθ (θ ) = KpX (θ ) >
1 2,
so that the stability condition (3.5.4)
holds. Applying the asymptotic normality result (3.5.5) with the variance R(θ∗ ) =
K2 ∗ m α (1−
α∗ ) (computed from equation (3.5.6)) yields the claim.
3.5.2
Proof of Theorem 3.2.2
This argument involves additional analysis, due to the aggregate decision (3.2.6) taken by the fusion center. Since the decision Zn+1 is a Bernoulli random variable; we begin by computing its parameter. Each transmitted bit Yn+1 (i) is Ber(F (θn )), where we recall the
53 notation F (θ) : = P(X ≤ θ). Using the definition (3.2.10), we have the equivalences P(Zn+1 = 1) = Gm (F (θn ), α∗ )
(3.5.7a)
β = Gm (α∗ , α∗ ) = Gm (F (θ∗ ), α∗ ).
(3.5.7b)
We start with the following result. Lemma 3.5.1. For fixed x ∈ [0, 1], the function f (r) : = Gm (r, x) is non-negative, differentiable and monotonically decreasing. Proof: Non-negativity and differentiability are immediate. To establish monotonicity, note P that f (r) = P( m i=1 Yi ≤ xm), where the Yi are i.i.d. Ber(r) variates. Consider a second Pm P 0 0 Ber(r ) sequence Yi0 with r0 > r. Then the sum m i=1 Yi , i=1 Yi stochastically dominates so that f (r) < f (r0 ) as required. To establish almost sure convergence, we use a similar approach as in the previous theorem. Using the equivalences (3.5.7), we compute the function h as follows h(θ) = Km E [Zn+1 − β | θ] = Km [Gm (F (θ), α∗ ) − Gm (F (θ∗ ), α∗ )] . Next we establish the attractiveness condition (3.5.3). In particular, for any θ such that F (θ) 6= F (θ∗ ), we calculate that h(θ) [θ − θ∗ ] is given by n o Km Gm (F (θn ), α∗ ) − Gm (F (θ∗ ), α∗ ) [θn − θ∗ ] < 0, where the inequality follows from the fact that Gm (r, x) is monotonically decreasing in r for each fixed x ∈ [0, 1] (using Lemma 3.5.1), and that the function F is monotonically increasing. Finally, computing the variance R(θ) : = Var [H(θ, Y ) | θ], we have 2 R(θ) = Km Gm (F (θ), α∗ ) [1 − Gm (F (θ), α∗ )] ≤
2 Km , 4
since (conditioned on θ), the decision Zn+1 is Bernoulli with parameter Gm (F (θ); α∗ ). Thus, we can conclude that θn → θ∗ almost surely. (b) To show asymptotic normality, we need to verify the stability condition. By chain rule,
54
we have
h ∗ dθ (θ )
∗ ) m = Km ∂G (r, α ∂r
r=F (θ)
pX (θ). From Lemma 3.5.1, we have
0, so that the stability condition holds as long as γm (θ∗ ) >
1 2
∂Gm ∗ ∂r (F (θ), α )
1 (independent of n), of consecutive central moments. That is, we assume there are characteristic moments h i we(j) = E (de − E[de ])j , for all e ∈ E and 2 ≤ j ≤ J. Our goal is to estimate these moments within a fixed accuracy. More formally, we make the following assumption. We first need a definition. Definition 4.1.1 (Regular Families). Let ε > 0 and J ≥ 2 be fixed. Let Q = {Qθ }θ∈Θ be a family of distributions on R parametrized by θ ∈ Θ where Θ is a subset of an Euclidean space. Let w(j) (θ) {2≤j≤J} be the first J − 1 central moments of Qθ . We say that the family Q is (ε, J)-regular if there exists a map Ψ from RJ−1 to Θ and a δ > 0 such that if (j) the vector w ˆ = w ˆ satisfies {2≤j≤J} (j) ˆ − w(j) (θ) ≤ δ w for all 2 ≤ j ≤ J, then
Qθ − QΨ(w)
ˆ 1 ≤ ε. In Appendix 4.7.1, we give simple examples of regular families. Assumption 4.1.1 (Regularity and Boundedness). Let ε > 0 and J ≥ 2 be fixed (independent of n). We assume that all edge delay distributions are from a fixed (ε, J)-regular family of distributions. Furthermore, we assume that the delays are uniformly bounded, namely there is a constant M > 0 independent of n such that for all e ∈ E, de ∈ [0, M ]. This framework is simple enough to be tractable yet general enough to accommodate large classes of distributions: parametrized distributions, e.g., beta distributions; and nonparametrized distributions, e.g., discretized distributions on {0, 1, . . . , M }. Further we need the following assumption. Assumption 4.1.2 (Lower Bound on Second Moment). We assume that there is a constant f > 0 (independent of n) such that for all e ∈ E, we(2) ≥ f.
63 To sum up, the multicast inference problem is defined as follows. Definition 4.1.2 (Multicast Inference Problem, Moment Version). Let ε > 0 and J ≥ 2 be n o (j) fixed. The multicast inference problem consists in the following. Let T and we
{e∈E,2≤j≤J}
be any tree (with internal degrees at least 3) and set of central moments on edges. Given samples of delays at the leaves, we are required to: 1. Tree Reconstruction. Recover T . n o (j) 2. Moment Estimation. Estimate all characteristic moments we
{e∈E,2≤j≤J}
within
ε. Remark 4.1.1. As noted by Lo Presti et al. [Presti et al., 2002], the means of the edge delay distributions are, in general, unidentifiable. See Figure 4.1 for an illustration. In particular, one cannot hope to recover the deterministic transmission delay on each link. But, as noted in Presti et al. [2002], this is not a major issue. Indeed, in practice, one is only interested in the variable portion of the delay, that is, the portion resulting from traffic. To restore identifiability, Lo Presti et al. proceed by subtracting the lowest observed delay on each receiver, in order to remove the (estimated) deterministic component of the delay. They further assume that the variable portion of the delay “starts at 0.” We also make this last assumption (see our examples of regular delay distributions in Appendix 4.7.1). However, instead of subtracting the minimum observed delay (which may be unreliable on a large network), we use central moments—which are not affected by the deterministic transmission delay.
4.1.2
Our Results.
Our main result is the following theorem. Theorem 4.1.1 (Main Result). Let ε > 0 and J ≥ 2 be fixed. Let Assumptions 4.1.1 and 4.1.2 hold. Then, there is a polynomial-time algorithm which solves the multicast inference problem with high probability using k = O (poly(log n)) samples. See Theorems 4.3.1, 4.5.1, and 4.5.2 below for more precise statements. The proofs of the main theorems rely on the important notion of a tree metric from phylogenetics. Roughly speaking, a tree metric is a metric on the leaves of a tree such
64
Figure 4.1. Unidentifiability of Mean Delay: If one were to replace d1 with d1 +µ and d2 , d3 with d2 − µ, d3 − µ for µ > 0 (assuming µ can be chosen so that all delays remain positive) then the distribution of delays at a, b would be unchanged. This example also shows that one cannot deduce the delays on all edges given total delays at all leaves.
that the distance between any two leaves can be written as a sum of edge weights on the corresponding path. (See Section 4.2 for definitions.) There are two components to our algorithm: 1. Topology reconstruction: The reconstruction of the routing tree can be achieved by adapting known phylogenetic reconstruction algorithms—once the proper delaybased metric is defined. This result is proved in Section 4.3. The relevant phylogenetic background is introduced in Section 4.2. 2. Moment estimation on edges: Most of the technical work of this chapter is in deriving and analyzing a metric-based algorithm for inferring edge delay distributions (Theorems 4.5.1 and 4.5.2). For this purpose, a) we relax the notion of a tree metric to allow nonnegative edge weights, b) we define appropriate delay-based metrics, and c) we show how to estimate these metrics. The analysis relies on large deviations arguments. As far as we are aware, our algorithm is the first multicast inference algorithm to be both provably efficient and consistent. Previous work concerned mostly non-rigorous techniques such as maximum pseudo-likelihood and EM algorithms. See Castro et al. [2004] for details. An exception is the independent, unpublished work of Liang et al. [Liang et al., 2007] which uses techniques similar to ours in the related context of multicast packet drop inference.
65
4.1.3
Discussion
Validity of assumptions. The multicast delay process defined in Section 4.1.1 relies on two basic assumptions about routing and traffic which makes its analysis possible: temporal and spatial independence. In reality, of course, both assumptions are violated to some extent. Lo Presti et al. [Presti et al., 2002] (see also C´aceres et al. [1999]) studied the effect of these violations empirically and concluded that the multicast delay process is a useful first approximation to the underlying complex process. We briefly summarize their findings. Temporal dependence—delays at a given link being correlated at different points in time— is common in communication networks. But, as it turns out, its impact is rather mild for our purposes. Indeed the type of inference procedure studied in Presti et al. [2002] (as well as in the current chapter) does not actually require independence in time but only ergodicity—a much weaker assumption; more precisely, the estimator in Presti et al. [2002] (and in the current chapter) is consistent as long as the delay process is ergodic. Hence, the temporal dependencies impact only the convergence rate of the inference procedure. Lo Presti et al. showed empirically that, although this effect cannot be ignored, it is rather mild. Quantifying exactly the effect of temporal correlations on the theoretical convergence rate of an estimator is non-trivial. As for spatial correlations—dependencies in delays on neighboring links—Lo Presti et al. found that they can produce a systematic bias in the estimation. However, they showed empirically that the bias is a small, second-order effect, possibly—they argue—because the diversity of traffic on the network results only in localized, short-term correlations in delays. They also point out that very little is known about the precise structure of such spatial correlations in real networks, making it hard to derive a good model for them. Another assumption implicit in our model is that the process, including the routing tree itself, remains homogeneous over time. In fact, there are sporadic large-scale changes in the network. These explain why a low sample complexity is critical for an inference procedure to be useful in practice. Minimizing the sample complexity is the main focus of this chapter. Related results. The multicast delay inference problem was formalized by Lo Presti et al. in Presti et al. [2002]. In that paper, the authors give a procedure to infer a discretized delay distribution on each link, given the routing tree topology. Their algorithm is based on an ad-hoc fixed point equation that is solved by least squares. Moreover, these authors show
66 that their estimator is asymptotically normal with a variance-covariance matrix depending implicitly on the delay characteristics. More explicit formulas are given in the limit of small delays. The algorithm is tested on small networks and the dependence on the size is not given. More recently, Ni and Tatikonda [Ni and Tatikonda, 2006, 2007, 2008; Ni et al., 2008]—in work subsequent to ours [Bhamidi et al., 2006]—used phylogenetic techniques to recover the routing tree topology in this context. Similarly to the current chapter, they use distancebased techniques. The basic algorithm they consider is the well-known Neighbor-Joining (NJ) algorithm which they apply to various tree metrics, for instance, the delay variance metric (as we do here). They also deal with trees of internal degrees higher than 3 by introducing a variant of NJ called Rooted Neighbor-Joining (RNJ) [Ni and Tatikonda, 2008] (based on a technique equivalent to what is known in phylogenetics as the Farris transform [Farris, 1973]). They show more precisely that RNJ is a consistent estimator of the routing tree, but no convergence rate is given. Note, however, that RNJ has in fact a high sample complexity due to its reliance on the diameter of the tree. See, e.g., [Atteson, 1999]. See also our discussion about diameter v. depth in Section 4.2.2. Here, we make use of state-of-art phylogenetic reconstruction techniques to derive a low sample complexity algorithm for routing tree reconstruction. We also show how to infer delay distributions. A technique to infer discrete delays was also subsequently obtained by Ni and Tatikonda [Ni and Tatikonda, 2007] (although no convergence rate is provided). A related network tomography problem is the so-called multicast link loss inference problem, where one observes packet losses at the receivers of a multicast routing tree—instead of delays—and seeks to infer the routing tree and packet drop probabilities on the links. This problem was formalized in C´ aceres et al. [1999] where a maximum-likelihood estimation procedure was analyzed. In C´ aceres et al. [1999], the network topology is assumed known. In more recent independent work, Liang et al. [Liang et al., 2007] (unpublished) applied phylogenetic techniques to the inference of the routing topology in this context. Indeed, the multicast link loss problem is in essence a special case of the standard model of DNA evolution used in biology. Similarly to the current chapter, Liang et al. use distance-based techniques. More precisely, they give a computationally efficient reconstruction algorithm with sample complexity O(b−2 log n) where b (possibly depending on n) is a lower bound on the link loss probability. Ni and Tatikonda [Ni and Tatikonda, 2006, 2007, 2008; Ni et al., 2008] (see above) also considered the link loss inference problem.
67 The remainder of the chapter is organized as follows. We start with some phylogenetic background in Section 4.2. Our results concerning the topology reconstruction can be found in the Section 4.3. We then present and analyze our delay inference algorithm in Section 4.4.
4.2
Phylogenetic Reconstruction Techniques
In this section, we summarize and adapt to our setting the DMR algorithm of [Daskalakis et al., 2009].
4.2.1
Basics
We begin with a few basic notions from phylogenetics. Tree metrics.
In phylogenetics, the notion of a tree metric is useful for reconstructing
the topology of phylogenies. We use the notation R++ = {x ∈ R : x > 0}. Definition 4.2.1 (Tree Metric). Let L be a finite set with cardinality n. A function W : L × L → R+ defines a (nondegenerate) tree metric if the following holds. There exist a tree T = (V, E) with leaf set L and a weight function w : E → R++ such that W (a, b) = P e∈Pab we for all a, b ∈ L where Pab is the path between a and b in T . Tree metrics are usually estimated from samples of the tree process at the leaves. In that context, Azuma’s inequality is useful (see, e.g., [Motwani and Raghavan, 1995]). Lemma 4.2.1 (Azuma-Hoeffding Inequality). Suppose X = (X1 , . . . , Xk ) are independent random variables taking values in a set S, and f : S k → R is any t-Lipschitz function: |f (x) − f (y)| ≤ t whenever x and y differ at just one coordinate. Then, ∀λ > 0, λ2 P [f (X) − E[f (X)] ≥ λ] ≤ exp − 2 , 2t k and
Bipartitions.
λ2 P [f (X) − E[f (X)] ≤ −λ] ≤ exp − 2 . 2t k A useful combinatorial description of a tree T = (V, E) is obtained by
noticing that each edge e ∈ E of the tree naturally corresponds to a partition of the leaves L into two subsets (that is, the leaves on either “side” of e). Such partitions are
68 called bipartitions and they characterize the tree: it is easy to generate all bipartitions corresponding to a given tree, and on the other hand, there is a simple efficient iterative procedure to recover a tree from the set of all of its bipartitions. See [Felsenstein, 2004; Semple and Steel, 2003] for details.
4.2.2
Distorted Metric Algorithms
Classical distance-based reconstruction algorithms (that is, those methods based on tree metrics) such as UPGMA [Sneath and Sokal, 1973] or Neighbor-Joining (NJ) [Saitou and Nei, 1987], typically make use of all pairwise distances between leaves. This leads to difficulties because “long” distances are more “noisy” and require a large number of samples to be accurately estimated. For instance, in the phylogenetic context, the widely used NJ algorithm is computationally efficient, but it is known to require exponentially many samples—even for simple linear trees [Lacey and Chang, 2006]. An important breakthrough was made in Erd˝os et al. [1999] where it was shown that it was in fact enough to use “short” distances to fully recover the tree under reasonable assumptions. To help understand this result, we need a notion of tree “depth.” Given an edge e ∈ E, the chord depth of e is the length (in graph distance) of the shortest path between two leaves on which e lies1 . That is, ∆(e) = min {d(u, v) : u, v ∈ L, e ∈ Puv } , where d is the graph distance on T . We define the chord depth of a tree T to be the maximum chord depth in T ∆(T ) = max {∆(e) : e ∈ E} . It is easy to show that ∆(T ) ≤ log2 n if the degree of all internal nodes is at least 3 (argue by contradiction). In a nutshell, the key insight behind the results in Erd˝os et al. [1999] is that the diameter and the depth of a tree behave very differently: even though the diameter can be as large as O(n), the depth is always O(log n), in other words, each edge lies on a 1 Note that unlike Daskalakis et al. [2009] we use the graph distance in the definition of chord depth. Because of our assumptions (see below) the two graph and weighted distances are the same up to a constant factor. Note also that we are using a different definition than Erd˝ os et al. [1999]. But again the difference is only a constant factor.
69 “short” path between two leaves. Using clever combinatorial arguments, Erd¨os et al. [Erd˝ os et al., 1999] showed that one can reconstruct trees with much fewer samples by ignoring those distances corresponding to paths longer than O(log n). More recently, Daskalakis et al. [Daskalakis et al., 2009] relaxed some of the assumptions in Erd˝ os et al. [1999]. In particular, they gave a reconstruction algorithm based on short distances allowing internal degrees bigger than 3—which is particularly relevant in the networking context. Their algorithm, which we will call the DMR algorithm, reconstructs all bipartitions using only distances smaller than a threshold of order O(log n). To check that the algorithm works, one only needs to show that such distances are accurately estimated for a given number of samples. In the tomography setting, the DMR algorithm will allow us to reconstruct the routing tree using as few as poly log n samples (see next section). The details of the algorithm are sketched in Appendix 4.7.2. We now state a corollary of Daskalakis et al. [2009] that will be useful to us. We first need the following definition which formalizes the idea that short distances are accurately estimated (and that long distances can in some sense be ignored). Definition 4.2.2 (Distorted Metric [Mossel, 2007; King et al., 2003]). Let T = (V, E) be a tree with leaf set L and edge weight function w : E → R++ . Let W : L × L → R+ be f > 0. We say that W c : L × L → (0, +∞] is a the corresponding tree metric. Fix τ˜, M f)-distorted metric for T or a (˜ f)-distortion of W if: (˜ τ, M τ, M c is symmetric, that is, 1. (Symmetry) For all u, v ∈ L, W c (u, v) = W c (v, u); W c is accurate on “short” distances, that is, for all u, v ∈ L, if either 2. (Distortion) W f + τ˜ or W c (u, v) < M f + τ˜ then W (u, v) < M c W (u, v) − W (u, v) < τ˜. Let f, g > 0 be bounds on the edge weights, that is, f ≤ we ≤ g for all e ∈ E. We say that such an edge weight function satisfies the (f, g)-condition. Theorem 4.2.1 (DMR Algorithm [Daskalakis et al., 2009]). Let 0 < f < g < +∞, α ˜ < 1/6, and β˜ > 2. There is a polynomial-time algorithm A such that, for all trees T = (V, E) with
70 ˜ c edge weight function w satisfying the (f, g)-condition and all (˜ αf, βg∆(T ))-distortions W c returns T . of W (where W is the tree metric corresponding to w), A applied to W Note that the previous theorem is a deterministic statement about distorted metrics. We show how to estimate such a distorted metric from random samples with high probability in Section 4.3.2.
4.3
Routing Tree Reconstruction
The goal of this section is to reconstruct efficiently the topology of the routing tree using Theorem 4.2.1.
4.3.1
Variance Metric
From Definition 4.2.1, one can define a tree metric by first choosing a tree—in our case, the routing tree—and then defining a weight function on its edges. Any positive quantity can serve as a weight. The important point is that one must be able to estimate the resulting tree metric from samples at the leaves. This governs the choice of the weight function. Let T = (V, E) be the (unknown) routing tree with leaf set L and consider the choice of weights we(2) = Var[de ], for all e ∈ E and the corresponding tree metric W (2) (a, b) ≡
X
Var[de ],
e∈Pab
for all a, b ∈ L. Our first task is to check that this metric can be estimated from samples at (2)
the leaves. Let a, b be leaves and consider the quantity δab ≡ Var[Da − Db ] (where recall from (4.1.1) that Du is the delay at u). The delays Da and Db are observed at the leaves a and b respectively and therefore the variance of Da − Db can be easily estimated. Moreover, (2)
we claim that the equality δab = W (2) (a, b) holds. Indeed, denote γab the common ancestor of a and b, that is, the node at which all three paths Pab , P0a , and P0b intersect (where we
71 assume a, b 6= 0). Then, by independence of the edge delays, we have (2)
δab = Var[Da −Db ] = Var
X
de −
e∈Paγab
X
de =
e∈Pγab b
X
Var[de ]+
e∈Paγab
X
Var[de ] = W (2) (a, b).
e∈Pγab b
Therefore, we can estimate W (2) by estimating δ (2) at the leaves. (2)
To estimate δab from k samples, we use the standard unbiased estimator for the variance of Da − Db k
(2) δˆab =
i 1 Xh i (1) 2 (Da − Dbi ) − δˆab , k−1 i=1
where
k 1X i (1) (Da − Dbi ). δˆab = k i=1
(2) (2) Below, we will need to show that δˆab is well concentrated around δab , which follows from
the Azuma-Hoeffding inequality (see Lemma 4.2.1). The next lemmas provide the necessary Lipschitz condition. Lemma 4.3.1. Suppose X = (X1 , . . . , Xk ) are independent random variables taking values in [−B, B] with k ≥ 2. Then, the variance estimator k
s2X =
X 1 1 X (Xi − X)2 = (Xi − Xj )2 , k−1 k(k − 1) i=1
where X is the sample average, is
i 0, h i ˆ(j) (j) ˆ P δab − E δab > λ ≤ 2 exp −
λ2 k C∆2j−1
.
(4.5.1)
2. There exists a constant C 0 such that ∆j (1) ˆ(1) j E δab − δab ≤ C 0 j/2 , k
(4.5.2)
(1)
where δab = E[Da − Db ]. 3. There exists a constant C 00 such that, if k ≥ ∆2 , h i 2j j+1 ˆ(j) (j) 00 M ∆ √ E δ − δ . ab ab ≤ C k
(4.5.3)
4. If further C 00
M 2j ∆j+1 √ ≤ λ, k
then we have h i (j) (j) P δˆab − δab > 2λ ≤ 2 exp −
λ2 k C∆2j−1
.
(4.5.4)
Proof. 1. We use Azuma’s inequality (see Lemma 4.2.1). Let (1) Ki = (Dai − Dbi ) − δˆab ,
where Dui is the i-th delay sample at node u. Because |Pab | ≤ 2∆ and de ∈ [0, M ] for all e, it follows that |Ki | ≤ 4M ∆. Then let L =
k 1X (Ki )j , k i=1
and let L0 be the same quantity when an arbitrary die is perturbed by δ with |δ| ≤ M (where die is the i-th delay sample on edge e). Without loss of generality, assume the perturbation
83 is in the first sample. Then, L0 =
! k (k − 1) j X δ j K1 + . δ + Ki − k k
1 k
(4.5.5)
i=2
Now expanding (4.5.5), we get j j−1 j j−1 M |L − L | ≤ 2 (4M ∆) M + (k − 1) 2 (4M ∆) k j−1 ∆ ≤ C , k 1 k
0
for some constant C depending on M, J. Noting that L depends on at most 2∆k random variables die , we get the result by an application of Azuma’s inequality (for a different C). 2. Note that (1)
(1)
L = δab − δˆab , is a
2M ∆ k
-Lipschitz function of {Dai − Dbi }i∈[k] thus we have by Azuma’s inequality h i kλ2 (1) (1) P δab − δˆab > λ ≤ 2 exp − . 8M 2 ∆2
Now we use the fact that for a positive random variable Y ,
E Y
j
Z
∞
=j
λj−1 P(Y > λ)dλ.
0 (1)
(1)
If Y = |δab − δˆab | and ψ =
k , 8M 2 ∆2
j E Y j ≤ ψ− 2
Z
we have
+∞
j
y 2 −1 e−y dy =
0
8M 2 ∆2 k
j/2
C 0.
That proves 2 (for a different C 0 ). 3. We have (j) δˆab =
k k j 1 X i 1 X i (1) j (1) (1) (1) i ˆ = . Da − Db − δab Da − Dbi − δab + δab − δˆab k k i=1
i=1
84 Now expand using the binomial theorem and take expectations to get k j−1 h 1 X X j i (1) (1) (1) E Da − Dbi − δab (δˆab − δab )j−h k h i=1 h=0 (1) (1) j−h E δab − δˆab . ≤ C 00 (4M ∆)j max
h i ˆ(j) (j) E δab − δab ≤
0≤h≤j−1
Note that by k ≥ ∆2 , it follows that the maximum is attained at h = j − 1 in (4.5.2). 4. This follows from 1. and 3. We then get the main theorem in the symmetric case. Recall that J = O(1) and that, in general, ∆ = O(log n) where n is the number of leaves. 2
Theorem 4.5.1. Let ε > 0 be arbitrarily small. If k = ω(∆2J log n), then after an application of SymER, one has h i P δˆe(j) − δe(j) ≤ ε, ∀e ∈ E, ∀1 ≤ j ≤ J ≥ 1 − o(1),
(4.5.6)
as n → +∞. The algorithm runs in time O(∆J n2 ). Proof. Let (a, b) ∈ L × L be called a short pair if a, b are at graph distance at most 2∆. Denote S be the set of all short pairs. Let c (j) σj = max W (a, b) − W (j) (a, b) , (a,b)∈S
and Σj = max σi . 1≤i≤j
It follows immediately from the application of the AFI algorithm that (j) max w ˆe − we(j) ≤ 2σj . e∈E
Therefore, it suffices to prove ΣJ = o(1), with high probability as n tends to +∞.
85 Further, assume we have a uniform bound (j) (j) max max δˆab − δab ≤ τ ∗ .
1≤j≤J (a,b)∈S
Recall that c (j) (a, b) = δˆ(j) − Fbj (a, b) W ab where Fbj (a, b) =
Y β α Y j (y ) w ˆe(xi i ) (−1)yi w ˆ fi i . x, y
X (x,y)∈Dj (a,b)
i=1
i=1
Note that Fbj (a, b) has at most ∆j terms (including the multinomial factor). Therefore, since the function h(x) =
J Y
xj ,
j=1
is continuously differentiable with bounded derivatives in [−M J , M J ], there is C (depending on M, J) such that σj ≤ τ ∗ + C∆j (2Σj−1 ), for small Σj−1 . Then we have ΣJ ≤ τ ∗ C ∗ ∆J
2 /2
,
for some C ∗ > 0 depending on J, M , where we used σ2 ≤ τ ∗ . So it suffices to have τ ∗ = (ωn ∆J
2 /2
)−1 where ωn → +∞ as n → +∞ arbitrarily slowly.
By the last part of Proposition 4.5.1, using a union bound over the O(n2 ) short pairs of 2
leaves, it follows that k = C 0 ωn ∆2J log n samples are enough to guarantee h i 2 (j) (j) P δˆab − δab ≤ (ωn ∆J /2 )−1 , ∀1 ≤ j ≤ J, ∀ short pairs a, b ≥ 1 − o(1), for some C 0 depending on J, M . As for the computational complexity of the algorithm, assume first that the tree is represented in such a way that finding the set of edges on the path between two leaves a, b at distance O(∆) takes time O(∆) (this is easy in a rooted tree). Note that for each j, a, b the sum Fbj (a, b) =
X (x,y)∈Dj (a,b)
Y β α Y j (y ) (xi ) w ˆ ei (−1)yi w ˆ fi i . x, y
i=1
i=1
86 can be computed in time ∆J . Since there are O(n2 ) pairs of leaves, the total complexity is O(∆J n2 ). Similarly, in the general case, we get: Proposition 4.5.2. Let a, b, c ∈ L at graph distance less than 2∆ where ∆ = ∆(T ) is the depth of T . Fix j ∈ N. We have the following (where the constants depend on J and M only): 1. There exists a constant C such that, ∀λ > 0, h i ˆ(j) (j) ˆ P φab|c − E φab|c > λ ≤ 2 exp −
λ2 k C∆2j−1
.
(4.5.7)
2. There exists a constant C 0 such that, if k ≥ ∆2 , h i (M ∆)j ˆ(j) (j) . E φab|c − φab|c ≤ C 0 √ k
(4.5.8)
3. If further C0
(M ∆)j √ ≤ λ, k
then we have ˆ(j) (j) P φab|c − φab|c > 2λ ≤ 2 exp −
λ2 k C∆2j−1
.
(4.5.9)
Proof Sketch: The proof is very similar to Proposition 4.5.1. We only give a sketch. To prove 1., it is enough to consider four separate cases depending on which path segment (corresponding to H1 , H2 , H3 and H4 in Figure 4.5) we make the perturbation. To prove 2., note that we can write k 1X ˆ φab|c = (Xi + I )j−1 (Yi + II ), k
(4.5.10)
1
(1)
(1)
(1)
with Xi = (Dai − Dbi ) − δab , Yi = (Dai − Dci − δac ) + (Dbi − Dci − δbc ), and I II
(1)
(1)
= δab − δˆab , (1) (1) (1) (1) = (δac − δˆac ) + (δbc − δˆbc ).
87 (j)
Also note that φab|c = E[Xij−1 Yi ], ∀i. Use the Binomial theorem to expand the expression in (4.5.10) and write it as k 1 X j−1 (j) φˆab|c = Xi Yi + R, k i=1
where the error term is j−1 k k 1X 1 X X j−1 j−1 I l Xij−1−l . R = II (Xi + I ) + Yi l k k i=1
i=1
l=1
Now use the fact that |Xi | ≤ 4M ∆, |Yi | ≤ 8M ∆, and Part 2. of Proposition 4.5.1 to conclude that E[|R|] ≤ C 0
(M ∆)j √ . k
Part 3. now follows by combining Part 1. and 2. 2
Theorem 4.5.2. Let ε > 0 be arbitrarily small. If k = ω(∆2J log n), then after an application of ER, one has h i P δˆe(j) − δe(j) ≤ ε, ∀e ∈ E, ∀1 ≤ j ≤ J ≥ 1 − o(1),
(4.5.11)
as n → +∞. The algorithm runs in time O(∆J n2 ). Proof. The proof is identical to Theorem 4.5.1.
4.6
Discussion
There are several directions in which we can extend the results in this chapter. We have assumed that delays are finitely supported. This assumption is not essential. Unbounded distributions for which similar concentration inequalities can be obtained lead to the same results. For example, using [Ledoux, 2001, Proposition 4.18], one can treat the case of Exponential and Gamma delays. Moreover, it is an interesting problem, from a practical point of view, to improve the dependence of our results on J. It is somewhat intriguing that the reconstruction of the topology of the tree required the joint distributions on pairs of leaves whereas the reconstruction of delays (in the asymmetric case) required the joint distributions on triples of
88 leaves. A similar situation holds in phylogenetics [Chang, 1996]. It could be interesting to prove that this is indeed necessary in some sense. Throughout, the model was assumed to be static. In real-life networks, characteristics of the network change over time. One could try to adapt our algorithm to a more dynamic setting. See for example Cao et al. [2000] for a discussion of temporal issues.
4.7 4.7.1
Algorithm Details Examples of Regular Delay Distributions
Below, we give two typical examples of families of distributions covered by our results. The first example is a set of continuous distributions with few parameters. The second example is a general discrete distribution. The latter is the main focus of Presti et al. [2002]. Uniform distributions. Let Q = {Qθ }θ∈Θ be the family of distributions where Qθ is ˆ (2) be the estimated uniform on [0, θ] with Θ = [θ, θ] for some 0 < θ < θ < +∞. Let w variance and define 2 if 12w ˆ (2) < θ2 , θ , 2 2 θˆ2 = Ψ(w ˆ (2) ) = θ , if 12w ˆ (2) > θ , 12w ˆ (2) , otherwise. Assume |w ˆ (2) − w(2) | ≤ δ ≡ ˆ ≤ εθ . Note that |θ − θ|
εθ2 12 .
ˆ + θ), ˆ it follows easily that From θ2 − θˆ2 = (θ − θ)(θ
2
Z kQθ − Qθˆk1 =
0
θ
1x≤θˆ 1x≤θ θ − θˆ dx,
and assuming w.l.o.g. that θ > θˆ (the other case is symmetric) Z 0
θ
ˆ 1x≤θˆ 1x≤θ 1 1 ˆ ˆ 1 ≤ 2 θ − θ ≤ ε. dx = θ − − + (θ − θ) θ θ θ θˆ θˆ θ
Therefore, Q is (ε, 2)-regular for any ε > 0.
89 Bounded discrete distributions.
Let M be a positive integer and let [M ] = {0, 1, . . . , M }.
Also, let 0 < θ < 1 and X Θ = θ = (θ0 , θ1 , . . . , θM ) : 0 ≤ θi ≤ 1, ∀i ∈ [M ], θ0 > θ, and iθi ∈ [M ] . i∈[M ]
Denote by Q = {Qθ }θ∈Θ the family of distributions on [M ] such that X ∼ Qθ means P[X = i] = θi ,
∀i ∈ [M ].
The assumption on the mean of X in the definition of Θ greatly simplifies the calculations below. It is a reasonable approximation in the standard practical case where Q is a discretization of continuous densities with a large number of bins M . The assumption on θ0 simply indicates that the distribution has been translated to “start at 0.” Define µ = E[X] 0 0 0 0 , θ−M where X ∼ Qθ and let θ0 = (θ−M +1 , . . . , θM ) where θi−µ = θi for all i ∈ [M ] and 0
otherwise. Note that the following holds M X
ij θi0 = w(j) (θ),
∀j ∈ [2M + 1],
i=−M
or in matrix form Λθ0 = w. From the Vandermonde structure of Λ it follows easily that det Λ ≥ 1, that is, Λ−1 exists, and furthermore kΛ−1 k1 is a strictly positive constant depending on θ, M . Let w ˆ be the estimate of w and let θˆ0 = Λ−1 w. ˆ Then, it follows that for any ε > 0 there is δ > 0 such that kθˆ0 − θ0 k1 ≤ kΛ−1 k1 kw − wk ˆ 1 ≤ ε, whenever kw − wk ˆ ∞ ≤ δ. Assume further that ε < θ/2, then we can recover an estimate 0 θˆ of θ from θˆ such that kθˆ − θk1 ≤ ε. Indeed, our assumptions above allow us to infer a distribution centered at 0 which we then translate to start at 0. Therefore, Q is (ε, 2M − 1)regular. Note that strictly speaking one should force all components of θˆ to be in [0, 1] and renormalize appropriately. Details are omitted.
90
4.7.2
DMR Algorithm
We shall now provide an outline of the DMR algorithm. The general DMR algorithm actually allows the user to build a “forest” when the number of samples is too small. We will not use this feature here and we therefore simplify the algorithm accordingly. The input f)-distorted metric W c on n leaves. In particular, we assume that to the algorithm is a (˜ τ, M f are known to the algorithm. We denote the true tree by T = (V, E). the values τ˜ and M Take α, α0 > 0 and 0 < β, β 0 < 1 such that 6 < α0 + 3 < α < (˜ α)−1 , and ˜ −1 M f − 3˜ f + τ˜ < β M f < 1 [β 0 M τ ]. (β) 2 f = ω(˜ (Here it is assumed that M τ ).) The details of the subroutines Mini Contractor and Extender can be found in Figures 4.9 and 4.10. The reader is referred to Daskalakis et al. [2009] for a detailed explanation of the algorithm—which is somewhat involved. In a nutshell, for each pair of leaves u, v that are not “too far”: 1) the algorithm finds all edges sitting on the path between u and v (as illustrated in Figure 4.7); 2) then it derives the bipartitions corresponding to these edges by “extending” the bipartitions in a small ball around u, v (as illustrated in Figure 4.8). b β = (Vbβ , E bβ ) where Vbβ = L • Pre-Processing: Proximity Test. Build the graph H bβ ⇐⇒ W c (u, v) < β M f; and (u, v) ∈ E • Main Loop. bβ : – For all pairs of leaves u, v ∈ Vbβ such that (u, v) ∈ E ∗ Mini Reconstruction. Compute r(u,v)
{ψj (u, v)}j=1
b β ; u, v); := Mini Contractor(H
∗ Bipartition Extension. Compute r(u,v)
{ψ¯j (u, v)}j=1
r(u,v)
b β , {ψj (u, v)} := Extender(H j=1 ; u, v);
r(u,v) – Deduce the tree Tb from {ψ¯j (u, v)}j=1 ;
91
Figure 4.7: Illustration of routine Mini Contractor.
Figure 4.8: Illustration of routine Extender.
• Output. Return the resulting tree Tb.
92
Algorithm Mini Contractor b β ; Leaves u, v; Input: Graph H r(u,v) Output: Bipartitions {ψj (u, v)}j=1 ; • Ball. Let
n o b (0) bβ : W c (u, w) ∨ W c (v, w) < β 0 M f ; B (u, v) := w ∈ H 0 β
b (0) • Intersection Points. For all w ∈ B β 0 (u, v), estimate the point of intersection between u, v, w (distance from u), that is, ˆ v) + d(u, ˆ w) − d(v, ˆ w) ; b w := 1 d(u, Φ 2 (0)
b 0 (u, v) − {u}, x−1 = u, j := 0; • Long Edges. Set S := B β – Until S = ∅: b w : w ∈ S} (break ties arbitrarily); ∗ Let x0 = arg min{Φ b x −Φ b x ≥ α0 τ˜, create a new edge by setting ψj+1 (u, v) := {B b (0) ∗ If Φ 0 −1 β 0 (u, v)−S, S} and let Cj+1 := {x0 }, j := j + 1; ∗ Else, set Cj := Cj ∪ {x0 }; ∗ Set S := S − {x0 }, x−1 := x0 ; r(u,v)
• Output. Return the bipartitions {ψj (u, v)}j=1 . Figure 4.9: Algorithm Mini Contractor.
Algorithm Extender b β ; Bipartitions {ψj (u, v)}r(u,v) ; Leaves u, v; Input: Graph H j=1 r(u,v) Output: Bipartitions {ψ¯j (u, v)} ; j=1
• For j = 1, . . . , r(u, v) (unless r(u, v) = 0): (u)
– Initialization. Denote by ψj (u, v) the vertex set containing u in the bipartition (u) (u) ψj (u, v), and similarly for v; Initialize the extended partition ψ¯ (u, v) := ψ (u, v), j
j
(v) (v) ψ¯j (u, v) := ψj (u, v);
b β where all edges between ψ (u) (u, v) and ψ (v) (u, v) – Modified Graph. Let K be H j j have been removed; (i)
(u)
(v)
– Extension. For all w ∈ vˆβ − (ψj (u, v) ∪ ψj (u, v)), add w to the side of the partition it is connected to in K (by definition of K, each w as above is connected to exactly one side); r(u,v)
• Return the bipartitions {ψ¯j (u, v)}j=1 . Figure 4.10: Algorithm Extender.
93
Chapter 5
Contributions and suggested directions In this Chapter we summarize the main contributions of the thesis and present some suggested future directions for research.
5.1
Contributions
In this thesis we studied three different problems in estimation and structural inference in sensor networks. Chapter 2 studies the consensus averaging problem on graphs under general imperfect communications. We demonstrate that damped updates involving messages from neighbors with properly chosen weights can lead to procedures that achieve strong consistency and good asymptotic performance, requiring only local distributed computation. The main result shows that the MSE is controlled by the spectral gap of the Laplacian. Chapter 3 proposes various methods for computing such quantiles in a communicationefficient and sequential way, without requiring any prior parameterization of the statistical distribution. The method is based on sequential stochastic approximation theory, and is simple to implement in practice. One side benefit is that accurate performance estimates can be computed. We propose the method for both power constrained two way networks, such as for embedded infrastructure sensors, and one way power constrained network, such as for a mobile user, who uses existing quantile estimates for his own decision making. Chapter 4 uses computational phylogenetic techniques to solve the multicast network
94 delay inference problem. The proposed algorithm relies again on local computation to reconstruct both network communication delay statistics and the network structure itself. The good performance of the algorithm permits its use in various practical applications.
5.2
Suggested directions
We have shown the benefits of using a statistically principled approach for designing algorithms for statistical processing in sensor networks. Our solutions relied on three conceptual principles: hierarchical processing structures, where decisions from lower layers can be locally computed and passed to higher layers; balancing between computation and estimation, by considering limited data transmission and accounting for the resulting errors in the estimation process; and methodologies that seek approximate optimality in place of exact optimality, therefore reducing computation complexity. In the remainder of the section we address some opportunities for future research in methodology.
5.2.1
Statistical hierarchical processing
New sensing and control systems are addressing large applications where estimation and decision making need to be decentralized. For example, in urban traffic various local decisions are made in traffic signal controllers based on inferences on queuing delay and route demands. Addressing coordination has been a long standing problem in such systems mainly due to two issues: measurement and inference of the system state is complex, as the systems have many distributed components; and the effects of the statistical model in the control system have not been studied more deeply in such scenarios. For example, if a statistical model is slow to track sudden changes in the behavior of traffic, will the traffic light controls respond properly? Or should the controls themselves account for statistical uncertainties, and use more adaptable estimators? Tradeoffs such as these for large social-physical systems is an emerging area of research, with various important statistical problems.
5.2.2
Nonparametric sequential statistical methods
Nonparametric statistical methods have enjoyed a resurgence in recent years due to the large amounts of data available in concrete applications. Most methodologies have focused on the fundamental task of block inference of unobserved random variables. In many sit-
95 uations, such as those present in sensor networks, inference needs to be performed in a sequential fashion. This imposes another level of tradeoffs, as early stopping can be understood as a form of regularization as well. Moreover, limited communication bandwidth imposes distribution and decentralization requirements on any proposed solution. Investigating algorithms for sequential nonparametric inference under communication constraints is a promising area of research.
5.2.3
Statistical methods to detect and robust to point changes
A central problem in large monitoring and control systems is the detection of spatially and temporally distributed abrupt changes. Hidden variables can be physical, such as a flow in an edge of the network, or more abstract, such as a change in aggregate preferences of drivers that use roads. Typically there are multiple correlated change events in a network. For example, in the multicast network inference problem, part of the network structure might experience a change due to reconfiguration of the network itself. Detecting such changes and reestimating relevant quantities should require smaller sample complexity than an initial complete estimation. We are currently pursuing, formulating such problems in the context of sensor networks and the inference mechanisms proposed here. Another important characteristic of statistical methodologies that can integrate monitoring and control of large systems is the ability to be robust to point perturbations. For example, certain sensors might fail unexpectedly or fail and remain undetected, or sudden events on the system might not reflect true behavior at smaller timescales. We contend that a point change is a fixed point perturbation in a variable. Some alternative types of perturbation arise due to the nature of sensor nets, such as communication link failures or localized sudden changes of measurement, which not necessarily reflect longer term change in dynamics. Understanding the performance of methods to point perturbations, and considering robustness issues in the context of a sensor network is an interesting research direction as well.
96
Bibliography N. Alon. Eigenvalues and expanders. Combinatorica, 6(2):83–96, 1986. N. Alon and J. Spencer. The Probabilistic Method. Wiley Interscience, New York, 2000. S. Amari and T. S. Han. Statistical inference under multiterminal rate restrictions: A differential geometric approach. IEEE Trans. Info. Theory, 35(2):217–227, March 1989. K. Atteson. The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica, 25(2-3):251–278, 1999. ISSN 0178-4617. E. Ayanoglu. On optimal quantization of noisy sources. IEEE Trans. Info. Theory, 36(6): 1450–1452, 1990. T. C. Aysal, M. Coates, and M. Rabbat. Distributed average consensus using probabilistic quantization. In IEEE Workshop on Stat. Sig. Proc., Madison, WI, August 2007a. T. C. Aysal, M. Coates, and M. Rabbat. Rates of convergence for distributed average consensus with probabilistic quantization. In Proc. Allerton Conference on Communication, Control, and Computing, 2007b. T. C. Aysal, M. J. Coates, and M. G. Rabbat. Distributed average consensus with dithered quantization. IEEE Trans. Signal Processing, 56(10):4905–4918, October 2008. A. Benveniste, M. Metivier, and P. Priouret. Adaptive Algorithms and Stochastic Approximations. Springer-Verlag, New York, NY, 1990. S. Bhamidi, R. Rajagopal, and S. Roch. Network delay inference from additive metrics. Preprint. Available at Arxiv: math.PR/0604367, 2006. R. S. Blum, S. A. Kassam, and H. V. Poor. Distributed detection with multiple sensors: Part II — advanced topics. Proceedings of the IEEE, 85:64–79, 1997.
97 V. Borkar and P. Varaiya. Asymptotic agreement in distributed estimation. IEEE Trans. Auto. Control, 27(3):650–655, 1982. S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE Transactions on Information Theory, 52(6):2508–2530, 2006. P. Buneman. The recovery of trees from measures of dissimilarity. In Mathematics in the Archaelogical and Historical Sciences, pages 187–395. Edinburgh University Press, Edinburgh, 1971. Ram´ on C´ aceres, N. G. Duffield, Joseph Horowitz, and Donald F. Towsley. Multicast-based inference of network-internal loss characteristics. IEEE Trans. Inform. Theory, 45(7): 2462–2480, 1999. ISSN 0018-9448. Jin Cao, Drew Davis, Scott Vander Wiel, and Bin Yu. Time-varying network tomography: router link data. J. Amer. Statist. Assoc., 95(452):1063–1075, 2000. ISSN 0162-1459. R. Carli, F. Fagnani, P. Frasca, T. Taylor, and S. Zampieri. Average consensus on networks with transmission noise or quantization. In Proceedings of European Control Conference, 2007. Rui Castro, Mark Coates, Gang Liang, Robert Nowak, and Bin Yu. Network tomography: recent developments. Statist. Sci., 19(3):499–517, 2004. ISSN 0883-4237. J. F. Chamberland and V. V. Veeravalli. Asymptotic results for decentralized detection in power constrained wireless sensor networks. IEEE Journal on Selected Areas in Communication, 22(6):1007–1015, August 2004. Joseph T. Chang. Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math. Biosci., 137(1):51–73, 1996. ISSN 0025-5564. C. Chong and S. P. Kumar. Sensor networks: Evolution, opportunities, and challenges. Proceedings of the IEEE, 91:1247–1256, 2003. F.R.K. Chung. Spectral Graph Theory. American Mathematical Society, Providence, RI, 1991.
98 Constantinos Daskalakis, Elchanan Mossel, and S´ebastien Roch.
Phylogenies without
branch bounds: Contracting the short, pruning the deep. To appear in RECOMB’09. Preprint available as arXiv:0801.4190v1, 2009. M. H. deGroot. Reaching a consensus. Journal of the American Statistical Association, 69 (345):118–121, March 1974. P. Denantes, F. Benezit, P. Thiran, and M. Vetterli. Which distributed averaging algorithm should i choose for my sensor network?
In The 27th IEEE Conference on Computer
Communications (INFOCOM 2008), pages 986–994, 2008. A. G. Dimakis, A. Sarwate, and M. J. Wainwright. Geographic gossip: Efficient averaging for sensor networks. IEEE Trans. Signal Processing, 53:1205–1216, March 2008. P´eter L. Erd˝ os, Michael A. Steel, L´aszl´o A. Sz´ekely, and Tandy J. Warnow. A few logs suffice to build (almost) all trees. I. Random Structures Algorithms, 14(2):153–184, 1999. ISSN 1042-9832. F. Fagnani and S. Zampieri. Average consensus with packet drop communication. SIAM J. on Control and Optimization, 2007. To appear. J. S. Farris. A probability model for inferring evolutionary trees. Syst. Zool., 22(4):250–256, 1973. J. Feldman, T. Malkin, R. A. Servedio, C. Stein, and M. J. Wainwright. LP decoding corrects a constant fraction of errors. IEEE Trans. Information Theory, 53(1):82–89, January 2007. J. Felsenstein. Inferring Phylogenies. Sinauer, New York, New York, 2004. J. A. Gubner. Decentralized estimation and quantization. IEEE Trans. Info. Theory, 39 (4):1456–1459, 1993. J. Han, P. K. Varshney, and V. C. Vannicola. Some results on distributed nonparametric detection. In Proc. 29th Conf. on Decision and Control, pages 2698–2703, 1990. T. S. Han and S. Amari. Statistical inference under multiterminal data compression. IEEE Trans. Info. Theory, 44(6):2300–2324, October 1998.
99 T. S. Han and K. Kobayashi. Exponential-type error probabilities for multiterminal hypothesis testing. IEEE Trans. Info. Theory, 35(1):2–14, January 1989. Y. Hatano, A. K. Das, and M. Mesbahi. Agreement in presence of noise: pseudogradients on random geometric networks. In Proceedings of the 44th IEEE Conference on Decision and Control, and the European Control Conference 2005, December 2005. S. Kar and J. M. F. Moura. Distributed average consensus in sensor networks with quantized inter-sensor communication. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. (ICASSP 2008), pages 2281–2284, 2008a. S. Kar and J. M. F. Moura. Distributed consensus algorithms in sensor networks: Link failures and channel noise. Technical Report arXiv:0711.3915v2 [cs.IT], 2008b. A. Kashyap, T. Basar, and R. Srikant. Quantized consensus. Automatica, 43:1192–1203, 2007. D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. Proc. 44th Ann. IEEE FOCS, pages 482–491, 2003. Valerie King, Li Zhang, and Yunhong Zhou. On the complexity of distance-based evolutionary tree reconstruction. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (Baltimore, MD, 2003), pages 444–453, New York, 2003. ACM. H. J. Kushner and G. G. Yin. Stochastic Approximation Algorithms and Applications. Springer-Verlag, New York, NY, 1997. Michelle R. Lacey and Joseph T. Chang. A signal-to-noise analysis of phylogeny estimation by neighbor-joining: insufficiency of polynomial length sequences. Math. Biosci., 199(2): 188–215, 2006. ISSN 0025-5564. Michel Ledoux. The concentration of measure phenomenon, volume 89 of Mathematical Surveys and Monographs. American Mathematical Society, Providence, RI, 2001. ISBN 0-8218-2864-9. G. Liang, E. Mossel, and B. Yu. Network topology inference through end-to-end measurements. 2007.
100 L. Ljung. Analysis of recursive stochastic algorithms. IEEE Transactions in Automatic Control, 22:551–575, 1977. Z.Q. Luo. Universal decentralized estimation in a bandwidth-constrained sensor network. IEEE Trans. Info. Theory, 51(6):2210–2219, 2005. E. Mossel. Distorted metrics on trees and phylogenetic forests. IEEE/ACM Trans. Comput. Bio. Bioinform., 4(1):108–116, 2007. Elchanan Mossel and S´ebastien Roch. Learning nonsingular phylogenies and hidden Markov models. Ann. Appl. Probab., 16(2):583–614, 2006. ISSN 1050-5164. Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. Cambridge University Press, Cambridge, 1995. ISBN 0-521-47465-5. X. Nguyen, M. J. Wainwright, and M. I. Jordan. Nonparametric decentralized detection using kernel methods. IEEE Trans. Signal Processing, 53(11):4053–4066, November 2005. J. Ni and S. Tatikonda. A Markov random field approach to multicast-based network inference problems. In Proceedings of the IEEE International Symposium on Information Theory, pages 2769–2773, 2006. J. Ni and S. Tatikonda. Explicit link parameter estimators based on end-to-end measurements. In Forty-Fifth Annual Allerton Conference, 2007. J. Ni and S. Tatikonda. Network tomography based on additive metrics. In Proceedings of the 42nd Annual Conference on Information Sciences and Systems, pages 1149–1154, 2008. Jian Ni, Haiyong Xie, S. Tatikonda, and Y.R. Yang. ference from end-to-end measurements.
Network routing topology in-
INFOCOM 2008. The 27th Conference on
Computer Communications. IEEE, pages 36–40, April 2008. ISSN 0743-166X. doi: 10.1109/INFOCOM.2008.16. R. Olfati-Saber, J. A. Fax, and R. M. Murray. Consensus and cooperation in networked multi-agent systems. Proceedings of the IEEE, 95(1):215–233, 2007. G. Picci and T. Taylor. Almost sure convergence of random gossip algorithms. In 46th IEEE Conference on Decision and Control, 2007, pages 282–287, 2007.
101 Francesco Lo Presti, N. G. Duffield, Joe Horowitz, and Don Towsley. Multicast-based inference of network-internal delay distributions. IEEE/ACM Trans. Netw., 10(6):761– 775, 2002. ISSN 1063-6692. doi: http://dx.doi.org/10.1109/TNET.2002.805026. N. Saitou and M. Nei. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4(4):406–425, 1987. I. D. Schizas, A. Ribeiro, and G. B. Giannakis. Consensus in ad hoc WSNs with noisy links: Part I distributed estimation of deterministic signals. IEEE Transactions on Signal Processing, 56(1):350–364, 2008. Charles Semple and Mike Steel. Phylogenetics, volume 24 of Oxford Lecture Series in Mathematics and its Applications. Oxford University Press, Oxford, 2003. ISBN 0-19850942-1. R. J. Serfling. Approximation Theorems of Mathematical Statistics. Wiley Series in Probability and Statistics. Wiley, 1980. Peter H. A. Sneath and Robert R. Sokal. Numerical taxonomy. W. H. Freeman and Co., San Francisco, Calif., 1973. The principles and practice of numerical classification, A Series of Books in Biology. R. R. Tenney and N. R. Jr. Sandell. Detection with distributed sensors. IEEE Trans. Aero. Electron. Sys., 17:501–510, 1981. J. Tsitsiklis. Problems in decentralized decision-making and computation. PhD thesis, Department of EECS, MIT, 1984. J. N. Tsitsiklis. Decentralized detection. In Advances in Statistical Signal Processing, pages 297–344. JAI Press, 1993. Y. Vardi. Network tomography: estimating source-destination traffic intensities from link data. J. Amer. Statist. Assoc., 91(433):365–377, 1996. ISSN 0162-1459. V. V. Veeravalli, T. Basar, and H. V. Poor. Decentralized sequential detection with a fusion center performing the sequential test. IEEE Trans. Info. Theory, 39(2):433–442, 1993. R. Viswanathan and P. K. Varshney. Distributed detection with multiple sensors: Part i—fundamentals. Proceedings of the IEEE, 85:54–63, January 1997.
102 L. Xiao and S. Boyd. Fast linear iterations for distributed averaging. Systems & Control Letters, 52:65–78, 2004. L. Xiao, S. Boyd, and S.-J. Kim. Distributed average consensus with least-mean-square deviation. Journal of Parallel and Distributed Computing, 67(1):33–46, 2007. M. E. Yildiz and A. Scaglione. Coding with side information for rate constrained consensus. IEEE Trans. on Signal Processing, 2008. Z. Zhang and T. Berger. Estimation via compressed information. IEEE Trans. Info. Theory, 34(2):198–211, 1988. R. Zielinski. Optimal quantile estimators: Small sample approach. Technical report, Inst. of Math. Pol. Academy of Sci., 2004.