Fundamentals of Decentralized Optimization in Autonomic ... - CiteSeerX

Fundamentals of Decentralized Optimization in Autonomic Systems Tomasz Nowicki, Mark S. Squillante, Chai Wah Wu Mathematical Sciences Department IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598, USA

1 Introduction

tain many levels, i.e., each AE can itself include a manager controlling and optimizing subenvironments within the AE. It is the goal of the CM to optimize the overall objective function for the entire information system. This objective function can be based on any combination of measures (e.g., availability, costs, overheads, penalties, performance, reliability, revenues, risk, robustness, utility [6]) and depends upon the resources, workloads and other parameters of the information system. This objective is also based on the self-∗ property of interest, each of which should involve optimization in an important role. Independent of the details of the objective function, a central question concerns whether there is any loss in optimality when the self-management is carried out via a decentralized approach. One purpose of this paper therefore is to provide conditions under which a decentralized optimization framework is as good as a centralized framework. In particular, we show that there is no loss of quality in the optimal self-management of complex information systems when a decentralized approach is used and we provide a foundation for the decentralized approach to designing and implementing autonomic systems with self-∗ properties. Another purpose of our study is to investigate in more detail the interactions between system components at different levels of this hierarchical decentralized framework for optimal self-management. Specifically, we consider a negotiation scheme where additional information is passed between the CM and the AEs in order to significantly increase the efficiency with which the optimization algorithms compute the optimal solution. We then exploit a representative example of our general mathematical framework to investigate other fundamental properties of decentralized optimal self-management in practice, including phase transitions, chaotic behavior, stability and computational complexity.

An autonomic system is a complex information system comprised of many interconnected components operating at different time scales in a largely independent fashion that manage themselves to satisfy high-level system management requirements and specifications [5]. This includes providing the self-∗ properties of self-configuring, self-repairing, self-organizing and self-protecting. A fundamental problem for achieving the goals of selfmanagement concerns the general optimization framework that provides the underlying foundation and supports the design, architecture and algorithms employed throughout the system. Given the increasing complexity of current and future information systems, a decentralized approach is a natural way to design and implement autonomic systems that provide self-∗ properties. On the other hand, a centralized approach with complete knowledge over all constituent system components has the potential to provide significant improvements over a decentralized approach, in the same way that solutions to global optimization problems (if attainable) are often superior to the corresponding locally optimal solutions. This fundamental problem involving the tradeoff between centralized and decentralized approaches arises in a wide range of applications, and its solution is especially important to achieve the goals of self-management in autonomic systems.

Our study focuses on this fundamental problem within the context of the decentralized optimization aspects of self-management as a step toward providing a scientific basis for information systems that provide self-∗ properties. Specifically, we consider a hierarchical and decentralized framework for optimal self-management in which a complex information system is partitioned into multiple application environments, AE, each of which has an application manager, AM , that controls and optimizes the resource management and operations within the AE. The 2 Optimal Total Cost collection of AM s are in turn controlled and optimized by a central manager, CM , that allocates the system re- To formally express the problem in the standard form sources among the AEs. The system hierarchy can con- of optimization problems, we consider without any loss

1

s.t. xi ∈ Ci (ri , ui ), where ri are the set of resources allocated by the CM to AEi . In turn, the CM determines the resource allocation by solving the optimization problem:

of generality minimizing a cost function (since maximizing f is equivalent to minimizing −f ). A cost function fi (xi , ri , ui ) is associated with each AEi , where xi is the set of variables that can be changed in AEi , ri is the set of resources allocated to AEi , and ui is the set of external variables that affect AEi . Examples of ri include the set of servers assigned to AEi , or the amount of disk space or processing time made available to AEi . Examples of ui include the current workload of AEi or conditions imposed by external events such as failures. The set of variables xi must also satisfy the set of constraints Ci , i.e., xi ∈ Ci is the feasible region of operation for AEi . In general, Ci will depend on ri and ui . Examples of Ci include conditions imposed by AEi , such as an upper limit on the percentage of requests having end-to-end response times that exceed some threshold when the resources of AEi are organized in a multi-tier architecture. The total cost function for the entire system is given by h(f1 (x1 , r1 , u1 ), . . . , fn (xn , rn , un )), where h aggregates the cost of each AEi into a single total cost. Examples of h of interest include summation (SUM), the maximum function (MAX), and the minimum function (MIN). The total set of resources in the system is finite, and the set of resources assigned to the AEs is required to satisfy a constraint: (r1 , . . . , rn ) ∈ R. Examples of this constraint include bounds on the amount of disk space or number of servers, or in the case when the servers are organized in multiple tiers, bounds on the end-to-end response or processing time. By allowing elements of R to represent a strict subset of the resources, the CM can reserve a set of available resources for direct allocation to any AEi rather than being moved from AEj to AEi . Then the goal of the autonomic system is to globally minimize the total cost function h subject to constraints:

hd = min h(g1 (r1 , u1 ), . . . , gn (rn , un )) ri

(3)

s.t. (r1 , . . . , rn ) ∈ R. Notice the decentralized nature of this scheme. Each AEi optimizes the cost within its environment and passes this optimal cost to the CM . In particular, there is no need for the CM to know the form of the cost function fi . Additional information can be sent from each AEi to the CM to aid in the optimization of the total cost; see §4. For vectors in IRn , let ≥ be the partial order generated by the positive orthant, i.e., x ≥ y if xi ≥ yi for all i. Definition 1 A function g : IRn → IRm is called orderpreserving with respect to ≥ (OPGT) if g(x) ≥ g(y) whenever x ≥ y. Examples of OPGT functions are SUM, MAX and MIN. Suppose the external variables ui are constant or slowly varying compared to the optimization so that we can ignore them for the moment; see §5 where this is addressed. Theorem 1 If the aggregation function h is OPGT, then hc = hd , i.e., decentralized optimal self-management is as good as centralized optimal self-management. Proof: Clearly hd ≥ hc . Let x∗i and ri∗ be the optimal set of variables and resource allocations s.t. h(f1 (x∗1 , r1∗ , u1 ), . . . , fn (x∗n , rn∗ , un )) = hc while satisfying (r1∗ , . . . , rn∗ ) ∈ R, x∗i ∈ Ci (ri∗ , ui ). Then by definition gi (ri∗ , ui ) ≤ fi (x∗i , ri∗ , ui ), and from the OPGT property of h we have: hd

hc = min h(f1 (x1 , r1 , u1 ), . . . , fn (xn , rn , un )) (1)

≤ ≤

h(g1 (r1∗ , u1 ), . . . , gn (rn∗ , un )) h(f1 (x∗1 , r1∗ , u1 ), . . . , fn (x∗n , rn∗ , un )) = hc .

xi ,ri

4 Hierarchical Negotiation

such that (s.t.) (r1 , . . . , rn ) ∈ R, xi ∈ Ci (ri , ui ). The value hc is the optimal cost of the system (which in general depends on the set of external variables u1 , . . . , un ) in a centralized framework as it is the globally minimal cost among all feasible resource allocations ri and all feasible sets of variables xi . The cost hc can be computed by an optimization algorithm which has knowledge about the operations of all AEi including the cost functions fi . Now we seek to find conditions under which hc can be obtained using a hierarchical, decentralized framework.

In general, continuous optimization algorithms to solve Eq. (3) perform much better if in addition to the ability to evaluate the objective function ˜h(r1 , . . . , rn ) = ˜ of the obh(g1 (r1 , u1 ), . . . , gn (rn , un )), the gradient ∇h jective function is also available. ˜ = ∇i h · ∂gi with ∂gi = 0 for i = j. Note that ∇h i ∂r ∂rj Assuming the constraints xi ∈ Ci (ri , ui ) are written as i ci (xi , ui ) = ri or as ci (xi , ui ) ≤ ri , then − ∂g ∂ri are the Lagrange multipliers (LMs) in solving Eq. (2); refer to [1]. Thus by having each AMi send to the CM 3 Decentralized Optimization both g(ri , ui ) and the corresponding LMs, the gradient ˜ can be efficiently computed by the CM . In this case, ∇h For each AEi , the corresponding AMi minimizes the cost the (logical) negotiation scheme between the CM and the function for AEi by solving the optimization problem: AEs is as follows: gi (ri , ui ) = min fi (xi , ri , ui ) xi

(2)

1. The CM sends ri to each AEi .

2

in AEi . With our focus here on single-class QoS requirements, there is no need to set or adjust any scheduling variables at each server, and thus xi consists solely of a vector of the proportional weights for routing requests among the set of servers allocated to AEi . The set R is the set of all possible N -way partitions of the set of servers {S1 , . . . , SM }, together with a router for each AE. The workloads ui can be accurately modeled as stochastic processes that vary over time [3]. To achieve the global objectives under such nonstationary behavior, the resource allocation decisions are made periodically at time epochs t , = 0, 1, . . .. The time scales at which these scheduling decisions are made depend upon several factors, including the delays, overheads and constraints involved in making changes to decision variables, the QoS requirements of each AEi , and the properties of the underlying stochastic processes. Then the optimization problems in (2) and (3) are solved at each scheduling epoch t based on measurements collected during previous scheduling intervals τk ≡ [tk+1 , tk ), k = 0, . . . , −1, in order to determine the optimal variables x∗i and ri∗ that should be deployed during the next scheduling interval τ . The details of the cost functions in Eqs. (2) and (3), as well as the corresponding constraints Ci (ri , ui ), depend upon the specific client environments being served. We shall thus focus on a typical scenario in which the QoS requirements are based on the response times of client requests. In particular, fi (xi , ri , ui ) = fi (E[Ti (xi , ri , ui )]) where E[Ti (xi , ri , ui )] denotes the expectation of the stochastic response time process for AEi given the allocation of resources ri and under the routing and scheduling variables xi and the workload ui . The aggregate cost function is given by h(g1 (r1 , u1 ), . . . , gN (rN , uN )) = N i=1 gi (ri , ui ), although MAX and MIN could be used instead of SUM in a similar fashion. Now we can consider each AEi during any scheduling interval τ in which the corresponding workload processes ui are stationary. Every AMi determines the optimal routing variables x∗i ∈ Ci by solving the optimization problem with fi (E[Ti (xi , ri , ui )]) substituted in Eq. (2), and the CM determines the optimal allocation of servers ∗ ) ∈ R by solving the optimization problem (r1∗ , . . . , rN N with i=1 gi (ri , ui ) substituted in Eq. (3). Assume for clarity of presentation that fi (·) and h(·) are linear functions with slopes of unity. Then the router variables for each AEi are obtained by minimizing the expected response time within AEi , and the CM allocates servers among the set of AEi in order to minimize the overall sum of the corresponding expected response times. The optimization problem of each AMi has been considered in [4] within the context of closed-form approximations based on heavy-traffic stochastic-process limits [3] to accurately model the per-server response time processes in each AEi under general conditions in an on-

2. Depending on the architecture of the system, the CM might also send the set of external variables ui to each AEi . In other cases, the external variables ui are readily available to each AEi . 3. Each AEi computes gi (ri , ui ) and sends it to the CM along with the corresponding LMs. ˜ and ∇h ˜ 4. The CM uses this information to compute h and find the next resource allocation (r1 , . . . , rn ). 5. This is iterated until a suitable resource allocation is found or the algorithm converges. In the beginning, each AEi does not have to compute gi (ri , ui ) too accurately, e.g., run too many iterations within AEi to compute gi (ri , ui ). For environments where derivatives are not available or cannot be computed efficiently, the above negotiation approach can be used together with derivative-free optimization (DFO) methods [2] to realize similar implementation benefits. In particular, when DFO is used to compute gi , the trust region radius and (internal) trust region model used in computing gi can be sent to the CM instead of the LMs. This information can be sent to the CM in a compact form and used in an efficient manner that is analogous to the five-step negotiation scheme provided above.

5 Example Let us now consider a representative example in which a set of M heterogeneous computing servers, S1 , . . . , SM , and a set of routers are used by a common service provider to host a set of N client environments, AE1 , . . . , AEN . The router assigned to AEi immediately routes all incoming requests to one of the servers allocated to and under the control of AMi . A service-level agreement (SLA) is created for each AE to define the corresponding qualityof-service (QoS) requirements and the revenues (respectively, penalties) for satisfying (respectively, failing to satisfy) these requirements. To elucidate the exposition, we consider SLAs with a single QoS class within each AE. Self-optimization in such an autonomic system includes the allocation of servers among the set of AEs, the routing of requests within each AE, and the scheduling of requests at each server within an AE, all in order to minimize the global objective function based on the collection of SLAs. Specifically, each AMi solves the optimization problem in Eq. (2) and the CM solves the optimization problem in Eq. (3), where ri are the set of servers and the router allocated by the CM to AEi , (r1 , . . . , rN ) ∈ R, ui are the set of workload characteristics for AEi , and xi ∈ Ci (ri , ui ) are the set of router and per-server scheduling variables that can be changed

3

– excluding an envelope (e.g., closure or convex hull) of them might exclude an overly large portion of the parameter space. In special cases chaos can be controllable, for example many stochastically stable systems exhibit individual chaotic trajectories, but with very well behaved distributions or moments. The transitions from a deterministic regime, where all trajectories are predictable at all times, to a stochastic regime, where most of the trajectories are predictable over long intervals of time, may go through all kinds of uncontrollable evolutions. It is therefore essential for any given control system to determine the types of possible asymptotic behavior, the stability of such behavior under small perturbations of the system (a robustness question), and to conceive of mechanisms exposing the type of behavior the system is currently in. Some of the research questions that we are currently pursuing include:

line fashion. We therefore exploit the results derived in [4] showing that the RHS of (2) for every AEi is given by 2 xi,j + 1 − xi,j + CS2 i,j λ2 σA 1 + xi,j , min xi µi,j 2(µi,j − λxi,j ) Sj ∈ri

s.t. Sj ∈ri xi,j = 1, xi,j ≥ 0, λxi,j < µi,j , where 2 µ−1 i,j and CSi,j are the mean and coefficient of variation of the service time process on server Sj ∈ ri , λ−1 and 2 are the mean and variance of the overall interarrival σA time process for AEi , and xi,j is the proportional weight for routing requests to server Sj ∈ ri . The corresponding CM problem consists of solving for the set of servers ∗ ) ∈ R in the following optimization problem: (r1∗ , . . . , rN min min ri

xi

N 1 + µ i,j i=1 Sj ∈ri

2 xi,j + 1 − xi,j + CS2 i,j λ2 σA

2(µi,j − λxi,j )

xi,j .

1. What is the stability of the algorithm when each AEi operates at a different time scale and optimizes Eq. (2) at different speed and accuracy?

We have implemented the foregoing example and have conducted many numerical experiments. Using this framework, we find that even though the total amount of computation is larger for the decentralized approach, the optimization is distributed and the work performed by the CM is in general less than having the CM perform centralized global optimization.

2. What is the potential for and impact of phase transitions and chaotic behavior in the decentralized system? 3. How does stochasticity and the time-varying nature of the external variables ui affect the optimality and efficiency of the optimization algorithm? 4. How is the computational complexity and efficiency different for an algorithm computing hc in Eq. (1) and for an algorithm computing hd in Eq. (3)? Is the global minimum (versus a local minimum) harder to obtain in Eq. (3) than in Eq. (1) or vice versa?

6 Additional Fundamentals The foregoing mathematical framework makes it possible for us to further explore several fundamental research issues concerning the decentralized approach to designing and implementing complex information systems with self-∗ properties. The control, decision making and optimization mechanisms cannot always be continuous (due to granularity) and may include some time-delay dependencies. One of the reasons is that any change requires time and resources. For example, switching a server from one application environment to another even if done in the same physical domain requires some clean-up and quarantine time (due to, e.g., privacy restrictions in SLAs). The time delays can also be caused by different time scales of the workloads as well as the operations of several applications within an environment. It is known that even very simple (e.g., linear) models which are only piecewise continuous or contain a feedback element may exhibit chaotic (in the sense of difficult to predict and qualitatively very sensitive to initial or control conditions) behavior. This chaotic behavior may appear in some regimes of parameters, however the sets of vulnerable parameters may also be very complex

References [1] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd edition, 1999. [2] A. R. Conn, N. I. Gould, and P. L. Toint. Trust-Region Methods. SIAM, 2000. [3] D. Gamarnik, Y. Lu, and M. S. Squillante. Foundations of stochastic modeling and analysis for self-∗ properties in autonomic computing systems. Technical report, IBM Research Division, March 2004. [4] X. Guo, Y. Lu, and M. S. Squillante. Optimal stochastic routing in distributed parallel queues. Technical report, IBM Research Division, November 2003. [5] J. O. Kephart and D. M. Chess. The vision of autonomic computing. Computer, 36(1):41–52, 2003. [6] W. E. Walsh, G. Tesauro, J. O. Kephart, and R. Das. Utility functions in autonomic systems. Technical report, IBM Research Division, January 2004.

4

Self Managed Systems - A Control Theory Perspective Richard Taylor and Chris Tofts Model Based Analysis Group HP Laboratories {r.taylor, chris.tofts}@hp.com

Abstract Self-managed systems are essentially closed loop control systems. When designing such systems we should ensure that they do not allow fundamentally bad properties, too slow convergence, oscillation, chaotic behaviour, stuck modes; just as for any control system. How can we perform these checks in the presence of arbitrary (Turing complete) control functions? We argue that the space of control functions and compositions should be restricted to those with known ‘good’ properties and demonstrate such a space within cellular automata.

1 Introduction As systems are growing more complex their control and configuration requirements frequently exceeds the comprehension and ability of their operators. Indeed, there are circumstances, the classic being modern fighter aircraft, where the system is designed ab initio to be uncontrollable by a human operator. The desire to move control from the human to the system is rational. Humans have a limited response rate, often make mistakes, are expensive to maintain, and difficult to train. In fact given all of these limitations it is unclear why we have retained humans is so many control loops for so long. The main reason for maintaining humans in management (or control) situations is their flexibility. Any fixed control system will only respond in the way in which it is designed, it cannot be aware of wider goals which can often be in conflict with its current control response. In almost all complex systems there are unforeseen situations to which a closed control system will not have a designed response. It is important to remember that self-managed systems are actually closed loop control systems.

They are by definition, systems that make observations about their current state1 , have the ability to control some (rarely all) aspects of their behaviour with the aim of reaching some ‘good’ (or even optimal state) and then repeat. The previous statement is simply the description of closed loop control. However, with such control systems there is a clear requirement to demonstrate that the control achieves the designers intent. This is of paramount importance when the intention is to leave the system running without human intervention for any extended period2 . The rest of this paper will present a quick review of the observations from control theory of a ‘good’ control system, discuss how this might be demonstrated in a complex self-managed system and conclude with practical examples based on the exploitation of known cellular automata behaviour.

2 Control Theory Perspective The structure and analysis of the feedback control system (figure 1) has been understood and applied for well over a hundred years, and control systems analysis and design is an integral part of engineering within the modern world. There are a well known list of properties that are essential for a control system to be considered ‘good’. These are the results of centuries of engineering practice and have generally been incorporated as the result of a disaster. A nonexhaustive list will certainly include:

1. convergence on the required solution over the whole input space; 1 This

obviously can include environmental inputs extended period in this setting can be very short, for a fighter aircraft about 2 seconds 2 An

5

this setting. There is a fundamental question as to whether we should employ control systems 3. stability of solution on stable inputs, no un- that cannot be analysed and should we restrict necessary oscillations; ourselves to mathematically analysable settings? 4. no snap over on small changes of the inputs, There are further problems of potential higher order neglected factors in original model and of ‘chaos’. the consequences of approximation. For instance the Lotka-Volterra equations of population dynamics are not stable as a discrete probabilistic system and require the introduction of refuges. A presentation of Lotka-Volterra in this style would look like the following. 2. adequate responsiveness;

∆prey = bin(prey, birth)−bin(prey∗pred, encounter) ∆pred = bin(prey∗pred, encounter)−bin(pred, death) Figure 1: There are two forms of control system - open and closed loop. We have a system which we wish to control (with a desired outcome of y(t)) based upon some reference trajectory, r(t). In an open loop system, the controller bases the behaviour of the system on an idealised model, taking the reference trajectory and generating a control or actuation u(t). In most systems however, disturbances to the system (d(t)) can cause the output or behaviour of the system to deviate significantly from the reference (or desired) trajectory. In addition, innacuracies in the model we use to predict the effect of u(t) on the system can also cause significant deviations. In this case, a feedback loop (shown as a dotted line) is added to the system, enabling measurements taken by a sensor from the controlled system s(t) to be subtracted from the reference trajectory r(t) to provide error e(t). The controller can then make decisions based upon the difference between the current state of the system and it’s desired state (plus extensions of such direct measurement, including first and second derivatives). In practice, most complex systems make use of some form of feedback as well as multiple sensors and actuators. For small numbers of inputs and response controls this is a well understood problem, all-be it when the control system is describable as a set of coupled differential equations. However, the question arises as to what should replace this approach to deriving control system properties in a complex setting, in particular where the control system uses coupling that cannot be easily analysed using differential equations or indeed computations that cannot be described in

Where bin(n, p) is a sample from the binomial distribution with n samplings and p probability of success. With the above set of equations the only certainly stable point is (0, 0). The cycle point is expectedly stable, but is not guaranteed to be so, consequently eventually this system will end up at (0, 0). There are two basic solution approaches to this problem. The first is to verify that any particular control system has the required properties by employing the appropriate mathematics. In the case of control systems whose nodes employ Turing complete levels of computation this is well known to be a difficult problem. The second is to limit the set of measure-response functions in the control system so that they are known to have the required properties in any setting. In the next section we outline how this can be achieved for cellular automata. As an example of control on a cellular automata, consider the problem of maintaining a pattern within the game of Life, given an arbitrary starting position and the ability either to set a limited number of cells, or only to set cells occasionally.

3 Examples - the basic cellular automata Cellular automata make fascinating abstract, as well as practical example of systems that might be considered to have ‘emergent’ properties. Cellular automata (CA) are discrete dynamic sys-

6

tems (space, time and state values), with determinism and with local interaction. A CA is a finite dimensional lattice of sites whose values are restricted to a finite set of integers. The value of any site at any time step is determined as a function of the values of neighbouring sites at the previous time step and itself. In one simple example of Life, space is represented by a uniform mesh of ‘processing’ cells. Each cell within that space is connected to its eight nearest (surrounding) neighbours. Every cell executes a program that exchanges a token (dead or alive) with these eight neighbours, and then alters its own token based upon the number of ‘live’ neighbours that surround it. Too many neighbours and the cell ‘dies’ (over crowding), too few and the cell also dies (loneliness), a critical number and a cell is ‘born’ (i.e. change token from dead to alive). Such a simple system, replicated across a large continuous surface gives rise to apparently complex behaviour from simple seeds, as groups of cells are born, expand and die. Regular patterns of creation and dispersal appear to occur spontaneously, and in many systems, stable recurring patterns become established. It is the observation that recurrent, stable ‘self organised’ structures have emerged from apparently trivial ‘rules’ of computation and communication that have made CA of great interest as one model of parallel and distributed computation. Stability through self organisation on a massive scale has excited many researchers with an interest in scalable, robust parallel systems. Indeed practical and robust implementations of load balancing and fault kernels for operating systems, as well as telecommunications switching systems have been based on simple, well understood automata designs. Observations on the rate of dispersion of system perturbations (through load or fault conditions), combined with the required behaviour of system boundaries, for example, make it possible to construct reliable statements about the upper and lower bounds of system performance and degradation. Unfortunately, life is not as rosy as one might think. If conventional computing and communication systems are used to construct CA and their resultant behaviour examined, a simple qualitative distinction can be made between them (sometimes referred within the research community as the class of the automata)

1. evolve to an homogenous state

2. evolve to simple, separated periodic structures

3. generate ‘chaotic’ aperiodic patterns

4. generate complex patterns of localised structures

While the forward problem - the description of properties of a given CA such as reversibility, invariants and criticality are well worn with observation based analysis, the reverse problem, the development of a set of techniques that that will generate a rule or a set of rules that can reproduce quantitative observations is primitive in all but the simplest subset of systems (mainly, but not exclusively a subset of class 1 and 2 automata). The bulk of ‘successful’ reverse analyses make use of a combination of probabilistic search strategies with forward experimental checking. For large systems, especially those of types 3 and 4, the limitations are obvious. Validation of design through simulation and/or construction is not practicable for large (and in the real world potentially expensive in both construction costs and the financial and social implications of error) systems. Successful commercial systems that do rely on CA stability typically live in the class 1 and 2 domains[4], where stability may often (but not always) be predicted analytically. In practice the implications of these observations are stark. The consequences of introducing ‘complex’ behaviour into networks of simple, communicating devices can be catastrophic. Domino style failure modes in a number of high profile systems, including large scale telecommunications failure in the US East coast network, and widespread power failure in the US North East can be attributed in part to the failure of the designers to understand for (reasonable simple) communication systems the instabilities of these self organised networks. Modern communications construction (at least for fixed infrastructure) mandates that designs can be demonstrated to dampen the oscillations that invariably follow failure or excessive load conditions.

7

4 Example - Ant algorithms

clearly imperative that we derive as wide a class as possible of compositions and control functions that are inherently ‘good’; instead of relying on As an example of the difficulty of validating par- the skills of a human designer and the observaticular algorithms see [1, 3, 5, 6]. These pa- tion of successful behaviour on a limited number pers present verficiation of self organised control of simulation exercises. strategies in ant colonies for: • task allocation; • synchronisation; • sorting The work required to capture each of these algorithms is extensive and proving the convergence points for each of these algorithms is, from personal experience, extremely time consuming. Whilst it was possible in each of these cases to prove some properties about the robustness of the stable states. In these examples it was still virtually impossible to prove anything about convergence rates, either in general of from specific cases. Consequently the application of these aglorithms could be advocated in settings where their slow convergnce would be of little economic consequence. However, it would still be difficult to advocate their use in either safety critical or commercially critical settings. An example demonstrating the difficulty of proving convergence rates for a sequential algorithm of this type can be found in [2].

5 Conclusions Whilst it is tempting, given the ‘emulate anything theorem’, the presence of Turing complete systems in a closed loop control system is extremely dangerous. It is very difficult to demonstrate the absence of unsatisfactory control behaviour in systems that employ such measureresponse functions. Restricted models of computation can give ‘good’ solutions by construction, even with distributed solutions as the example from cellular automata demonstrated. Given the cost of validating control systems that are constructed ad hoc from an arbitrary underlying computation framework, and the risk that such a validation may well be incomplete, should this be a domain of control? Given the need for large numbers of control systems, and the speed with which the demands on the designer vary, it is

References [1] N.R. Franks and C. Tofts, Doing the Right Thing: Ants, Bees and Naked Mole Rats, Trends in Evolution and Ecology 7: 346-349, 1992. [2] Leslie Ann Goldberg, William E. Hart and David Bruce Wilson, Analysis of a simple learning algorithm: Learning foraging thresholds for lizards, Journal of Theoretical Biology, 197:361-369, 1999 . [3] M.J. Hatcher, N.R. Franks and C. Tofts, The Autosynchronisation of Leptothorax Acevorum(Flavius): Theory, Experiment and Testability Journal of Theoretical Biology 157: 71-82, 1992. [4] R Taylor, Non-linear structural replication in locally connected computing structures, New Developments in Neural Computing ed. J.G. Taylor and C.L.T. Mannion, (Adam Hilger, 1989) [5] C. Tofts Algorithmns for Task Allocation in Ants: A Study of Temporal Polyethism: Theory, Bulletin of Mathematical Biology, 5:891-918, 1993. [6] C. Tofts, Using Process Algebra to Describe Social Insect Behaviour3 , Transactions on Simulation, 1994.

3 This should really have been titled using process algebra to prove properties of social insect behaviour

8

Modelling System Behaviour by Abstract State Machines Artur Andrzejak ZIB Berlin Takustrasse 7, D-14195 Berlin, Germany [email protected] D. conditional actions and their compositions, E. effects of executing actions on the system state, F. desired system states and optimization goals. The model specification should be complemented by a set of tools that facilitate tasks such as static verification of system behaviour against imposed constraints, monitoring of running systems and mapping of system state into the model, checking the validity of the current state and executing the behavioural specification as a program (the last feature is particularly interesting for workflows, which are usually “programs” of limited complexity). Additional tools from the kernel of Autonomic Computing such Constraint Satisfaction Problem Solvers and heuristic decision engines (e.g. using Bayesian Networks or rule systems) should also be included. Such tools are designed to automatically find suitable sequences of actions in order to bring a system into a desired state, move it out of a forbidden state, and optimize the system for specific targets. In this paper we evaluate the applicability of AsmL [1], a specification language developed at Microsoft Research, as the basis for a framework of the type described above. AsmL is derived from the theory of Abstract State Machines, in which each step in the machine computes a set of updates to the machine's variables. AsmL has a rich set of data structures, the ability to execute, integration into the .NET framework (including .NET reflection) and existing test and conformance tools, yet it is also relatively simple. The language is fully object-oriented as well, and it provides complex data types such as sets, finite mappings, sequences and structures. AsmL can be translated into C# and executed on the .NET platform (letting programs interoperate with external software). Reflection in .NET allows an AsmL specification to be translated into input for external tools (such as Constraint Satisfaction Problem Solvers) at runtime, depending on the system state. Microsoft also provides test generator and conformance checker tools, which let developers to test specific actions as well as verify system state against the AsmL model at runtime.

ABSTRACT Using a language derived from the theory of Abstract State Machines, we are able to concisely describe both the actual and desired behaviour of systems and their components. The resulting specifications may be utilized in a variety of ways: for modelling and monitoring actual system state; as an executable description of intended behaviour; for static verification and testing of this behaviour; to evaluate a running system against a model; and as input for an automatic decision system that plans and adapts system actions according to states in the model. In addition to describing complete programs, the language (AsmL) can also indicate actions or workflows by specifying only the pre and post conditions, leaving room for adaptive decision making. By integrating specification, executable code and system state monitoring into a single model, we provide a framework for self-managing behaviours in devices and components at different scales.

1. INTRODUCTION Modelling and monitoring system attributes, system state space, possible behaviours and desired/undesired conditions is essential to automatic system management. While a subset of these capabilities may suffice for most systems, the lack of a framework and accompanying tools for modelling, implementing, and integrating system behaviours may severely hamper the development of the system, due to reduced flexibility in modelling and restricted support for verification and testing of the implementation. Because developing such a framework is immensely challenging in itself, most solutions to automatic system management are proprietary, with unnecessarily high implementation costs and low reliability. One of the essential features of such modelling frameworks is the ability to specify relevant system features. We must be able to describe: A. hierarchies of system components/resources, B. attributes of components/resources, C. constraints defining legal system states,

9

This paper proposes an Object-Oriented extension to AsmL. The extension includes specification conventions for Object-Oriented systems as well as additional interfaces to external OO tools for supporting adaptive (system state-dependent) decision making.

abstract class Resource class Software extends Resource var name as String var version as String var targetOS as Set of String var hdRequired as Integer constraint Size(targetOS) > 0

2. RELATED WORK

// in MByte

class Host extends Resource var installedOS as Set of String var installedSoftware as Set of Software var RAM as Integer // in MByte var freeHd as Integer // in MByte constraint Size(installedOS) > 0

Scientific and business workflows have received recently a lot of attention in the web services and the Grid computing communities. Among the emerging standards for business process execution languages BPEL4WS [2] is the most popular one. Triana [7] is an example for an environment which employs an own XML-based language as well as BPEL4WS for managing serial and distributed scientific workflows. These languages are more specialized for business and/or scientific processes than the presented approach. Further, they do not allow pre- and postconditions, focusing on description of actions. Policies have been widely used in network management [3]. They govern functions as firewalling, encryption or proxy-caching. A policy-based approach for automatic configuration of resources in data centers is presented in [5]. Specification languages for software design such as VDM or Z have been widely studied but are rarely used in practice. More popularity has gained the design by contract approach [4]. It applies the preand post-conditions to enforce correctness of interfaces and component behaviour.

3.2 MODELLING BEHAVIOUR Actions (atomic or composed) are associated with resources they act upon. Technically, each action is instantiated as an object which holds references to the objects of corresponding resources. Each action has two mandatory methods: pre() and post(). The former one specifies the preconditions necessary for an action to execute (and returns true iff they are satisfied), and the latter method describes the conditions that should hold after the execution. We employ two kinds of action specification: implicit actions (discussed below) are completely described by their pre() and post() conditions, while explicit actions are additionally required to implement a method execute() that contains a detailed “program” for achieving a desired state (i.e. one allowed by post()). Thus pre() and execute() together satisfy requirement D above, while post() fulfills requirement E. The following AsmL code shows an explicit action that installs Java on a host. The keyword step enforces sequential execution and indicates that all statements in a step-block can be evaluated in parallel but must be finished at the end of the block.

3. MODELLING APPROACH We take advantage of the built-in OO capabilities of AsmL and express resources (hardware, software) and actions or workflows as classes. Modelling the system is thus conceptually similar to object-oriented design. This similarity is illustrated by the following example of a system that consists of a single host and two dependent actions: installation of Java software and installation of the Eclipse IDE.

interface ImplicitAction pre() as Boolean post() as Boolean interface ExplicitAction extends ImplicitAction execute() as Boolean

3.1 MODELLING THE RESOURCES

class InstallJava implements ExplicitAction var host as Host java as Software var sysHandle as RealSystemHandle

We model hierarchies of resource types as classes derived from the abstract class Resource, and actual resources as instances of these types. The attributes of a resource are represented by the member variables of the respective object. The state of a resource is the value of the variable. This approach satisfies requirements A and B from the introduction. Requirement C is covered by AsmL's ability to constrain the set of allowed resource states by the keyword constraint. The following figure shows the AsmL for our example:

pre() as Boolean return( (exists OS in host.installedOS where OS in java.targetOS) and (java.hdRequired < host.freeHd)) post() as Boolean return (exists soft in host.installedSoftware where soft.name eq java.name and soft.version >= java.version) execute() as Boolean

10

return (eclipse.hdRequired < host.freeHd)

step if post() or (not pre()) then skip WriteLine ("InstallJava not executed") else if sysHandle.reallyDo then step sysHandle.issueCommand(host, "javaInstall.bat") step sysHandle.updateModel(host) else add java to host.installedSoftware step return post()

pre_javaVersion() as Boolean return (exists soft in host.installedSoftware where soft.name eq "java" and soft.version >= "1.4.3") post() as Boolean return (post_eclipseInstalled()) post_eclipseInstalled() as Boolean return (exists soft in host.installedSoftware where soft.name eq eclipse.name and soft.version >= eclipse.version) execute() as Boolean step if post() or (not pre()) then skip WriteLine ("InstallEclipse not executed") else if sysHandle.reallyDo then step sysHandle.issueCommand(host, "eclipseInstallation.bat") step sysHandle.updateModel(host) else step add eclipse to host.installedSoftware step return post()

The inclusion of sysHandle.reallyDo demonstrates how the specification and “implementation” can be combined in a single file. If the value of this Boolean variable is true (implementation mode), then an external Java installation script is executed (by simply calling the installer), and the attribute variables of the host model host are updated (after the script, because of step) according to the state of the real host. In the other mode (specification or modelling), a reference to java is simply added to the variable host.installedSoftware, which records the set of software installed on the host. The class RealSystemHandle is responsible for the communication with the “real world” and is simplified in this example:

The defined actions and resources can be then used in the following code fragment which instantiates the resources (host, Java and Eclipse objects) and attempts to execute both installation steps:

class RealSystemHandle var reallyDo as Boolean issueCommand(res as Resource, scriptName as String) updateModel(res as Resource)

Main() sysHandle = new RealSystemHandle(false) // create a host instance and initialize var myComputer = new Host({"WinXP"}, {}, 512, 15000) // create software instances as constants java = new Software("java", "1.4.3", {"linux", "Solaris", "HPUX", "Win2000", "WinXP"}, 80) eclipseIDE = new Software("eclipse", "3.0", {"linux", "Solaris", "HPUX", "Win2000", "WinXP"}, 300) step var installJava = new InstallJava( myComputer, java, sysHandle) installJava.execute() step var installEclipse = new InstallEclipse( myComputer, eclipseIDE, sysHandle) installEclipse.execute() step if installEclipse.post() then WriteLine ("Eclipse installed.") else WriteLine ("Eclipse installation failed.")

The second explicit action in our example attempts to install the Eclipse IDE on the host. The precondition of this action is that Java is already installed. Here we employ the convention that each “atomic” constraint in pre() and post() is formulated as a separate method named pre_*() or post_*(). This is helpful in debugging a workflow execution, and it also facilitates the extraction of the pre and postconditions of an action as input for external tools. For this purpose we use .NET reflection to discover methods with the pre_ and post_ prefixes, then parse our specification source code to extract the discovered constraints. class InstallEclipse implements ExplicitAction var host as Host eclipse as Software var sysHandle as RealSystemHandle

3.3 IMPLICIT ACTIONS The pre and postconditions are specified in terms of the attributes of involved resources, which comprise the sets of necessary and resulting system component states, respectively. From this perspective the states described by post() can be seen as a specification of the goal to be reached by an action, an implicit description of the desired action -- implicit,

pre() as Boolean return (pre_targetOS() and pre_enoughHd() and pre_javaVersion()) pre_targetOS() as Boolean return (exists OS in host.installedOS where OS in eclipse.targetOS) pre_enoughHd() as Boolean

11

an inconsistence causes a termination). In general, model classes can be bound to classes in any .NET assembly, which may be derived from a variety of implementation languages. We are currently working on additional tools for supporting self-managing behaviour, including programs for extracting the constraints from pre() and post() and selecting them according to the model state, a framework for mapping the state of a real system to the model attributes, and a heuristic decision support engine.

because they lack the method execute(). In this way we partially satisfy requirement F (the optimization goal description is still missing). Such implicit action specifications allocate space for implementing self-managing system behaviour and automatic planning of execution steps. In the first phase of a particular self-managing activity, the pre and postconditions of the available explicit actions are extracted (as described in Section 3.2) in order to construct a pool of executable “atomic” actions. In the next phase, the current system state and the desired goal state (the post() of an implicit action) are used as input to an external “reasoning” tool. A resource that recovers from execution failure of an action might serve as an example: here the current resource status is given by the real resource state, which is mapped to the model, and the post() of the failed action is the goal specification. In a successive phase the external tool computes the sequence of atomic actions, which are subsequently executed via .NET reflection. Of course, during the execution new failures can occur, which induces “recursive” application of the process described above.

4. CONCLUSION Modelling system behaviour using AsmL allows us to take advantage of built-in data types, existing tools, and the interoperability of .NET. We have proposed a simple object-oriented extension for AsmL modelling that fulfills the specification language requirements enumerated in the introduction and offers enhanced tool support. Our future work will focus on the self-management tools mentioned in Section 3.4, a graphical editor for designing AsmL-specified workflows, and real-world workflow examples such as server farm configuration [5] and protein threading in bioinformatics [6].

3.4 TOOL SUPPORT AsmL comes with several tools: an execution framework (compiler, debugger, Visual Studio IDE integration), a test case generator, and a conformance tester. In our context the most important component is the execution framework. Using this tool we are able to integrate the model specification and the implementation (sysHandle.reallyDo is true) within the execute() method. This allows us to avoid the separation of the specification from the program, eliminating a source of duplication, inconsistencies, and additional development effort. The AsmL test generator supports test case-based model checking. After the ranges of the input values are specified and the relevant model attributes indicated by so-called "filters", the tool generates a Finite State Machine and runs the model with random inputs from the specified ranges. The conformance tester enables runtime verification of the implementation against the AsmL model. By binding model classes to implementation classes, execution steps of the implementation that do not adhere to the model are automatically detected. In the above example, we can run our code in the “model mode” (sysHandle.reallyDo is false) and another instance of the code in the “implementation mode” (which implies that the states of resources are derived from the real system). This lets us determine whether the real system behaves as expected. (However, this scenario can be currently used only for testing purposes, as the set of input values is restricted, and

5. REFERENCES 1. M. Barnett, W. Grieskamp, L. Nachmanson, W. Schulte, N. Tillmann, and M. Veanes: ModelBased Testing with AsmL.NET, 1st European Conference on Model-Driven Software Engineering, Dec. 11-12, 2003. 2. Business Process Execution Language for Web Services BPEL4WS, Version 1.1, http://www.siebel.com/bpel, 2003. 3. R. Haas, P. Droz, B. Stiller: Autonomic service deployment in networks, IBM Systems Journal, Vol. 42, No. 1, 2003. 4. Bertrand Meyer: Eiffel: The Language. ObjectOriented Series, Prentice Hall, New York, NY, 1992. 5. A. Sahai, S. Singhal, R. Joshi, and V. Machiraju: Resource Management in Utility Computing Environments, HP Technical Report, HPL-2003176, Aug. 2003. 6. M. Shah, S. Passovets, D. Kim, K. Ellrott, L. Wang, I. Vokler, P. LoCascio, D. Xu, and Y. Xu: A computational Pipeline for Protein Structure Prediction and Analysis at Genome Scale, Third IEEE Symposium on BioInformatics and BioEngineering (BIBE'03), March 10-12, 2003. 7. M. Shields and I. Taylor: Programming Scientific and Distributed Workflow with Triana Services, Global Grid Forum 10, Berlin 2004.

12

Open issues in self-inspection and self-decision mechanisms for supporting complex and heterogeneous information systems Michele Colajanni

Mauro Andreolini

Riccardo Lancellotti

University of Modena Modena, Italy [email protected]

University of Roma Tor Vergata Roma, Italy [email protected]

University of Modena Modena, Italy [email protected]

Abstract Self-* properties seem an inevitable mean to manage the increasing complexity of networked information systems. The implementation of these properties imply sophisticated software and decision supports. Most research results have focused on the former aspects with many proposals of passing from traditional to reflective middleware. In this paper we focus instead on the supports to the run-time decisions that any self-* software should take, independently of the underlying software used to achieve some self-properties. We evidence the problems of self-inspection and self-decision models and mechanisms that have to operate in real-time and in extremely heterogeneous environments. Without an adequate solution to these inspection and decision problems, self-* systems have no chance of real applicability to complex and heterogeneous information systems.

1 Introduction Self-* systems seem the inevitable answer to the continuously increasing complexity of networked information systems. Let us define a complex and heterogeneous information system (CHIS) as a system with multiple application classes and multiple Service Level Objectives, such as performance, availability, security, energy saving, costs. These SLOs may be contradictory and, even worse, difficult or impossible to quantify with the same measurable metrics. Many researchers have addressed the issues of designing and implementing software that coordinates interactive networked applications. For example, CORBA [1], J2EE [3], .NET [2] hide from the programmer many complicated details of the underlying software and hardware platforms, thereby increasing portability and facilitating maintenance. They provide an abstract interface that masks to the application low-level details of the operating system and network layer, and guarantees interoperability among application components through standard interfaces. On the other hand, these infrastructures lack the necessary support for the dynamic aspects of todays’ new computational needs. Hiding the underlying details has many advantages, but a certain degree of awareness is necessary for scalability, QoS, and adaptability to context and conditions of highly dynamic environments. For these reasons, a desirable middleware for implementing complex policies related to the above mentioned aspects should provide an adequate mix of transparency and control depending on the applications. Many researchers think that to provide the software with adequate flexibility requires the passage from conventional middleware to some forms of adaptive middleware. For example, the reflective middleware model (e.g., DynamicTAO [4],

13

OpenORB [9]) is implemented as a collection of concurrent objects that can react dynamically to changes in the underlying platform and to external requirements through migration, enabling of dynamic interaction patterns, reconfigurations, insertion and removal of components. In this way, it is possible to select protocols, algorithms, policies and any other mechanism to optimize system performance for unpredictable context and situations. However, it is important to remark that self-* properties and reflective middleware are not synonymous. Actually, some partially self-* systems have been implemented without recurring to reflective middleware. As an important example of self-oriented software that does not seem based on reflective middleware, we can cite the IBM WebSphere Application Server, that in its present version is considered predictive, that is, at the third level in the IBM scale of Autonomic Computing, ranging from ”basic” (level 1) to ”autonomic” (level 5). Independently of the software supports to achieve self-* properties, there is an underlying (and less investigated) level that must help the middleware to trigger or not the adapting actions at run-time. Self-inspection Self-inspection refers to the ability of automatically capture all information about the internal state and also adapting the monitoring system to internal and external conditions. The support to self-adaptive applications for a CHIS must be well developed in the following parts: monitoring, measurement, comparison, information retrieval from other sources, including resource utilization monitors. Self-decision This is the capacity of taking autonomous decisions according to some SLO rules and to a measure of the internal system state that is obtained from the previously described self-inspection component. We think that self-inspection and self-decision properties are among the most important issues that should be considered to implement really operative self-* systems. Self-* capabilities for inspection, decision (and, possibly, adaptation) are desirable in a middleware system, but they must be disciplined much better than supports to conventional middleware, because dynamic and autonomous modifications can result in unpredictable system behavior and possible breakdown. These risks limit the applicability of self-*systems as the basis of a complex and heterogeneous information system.

2 Self-inspection Operating any distributed information system without accurate statistics is not desirable. In general, it is not easy to find the right combination between data sources providing low volume, coarse-grained, nonapplication specific data, and data sources providing high volume, fine-grained, and application specific information. These issues are even more complex in self-* systems that are governed by the imposed SLOs. Their policies should use system-wide and component status information to take the appropriate actions and to react to events, but this kind of valuable information is not directly available. Distributed monitors usually yield raw, OS-level data (e.g., CPU and disk utilizations, network throughput) or application-level data (e.g., request throughput) that has to be aggregated to infer conclusions about the specific subsystem. Even worse, self-* systems should be able to weigh different measurements into an homogeneous indicator that is used to quickly estimate the status of one or more components and to take actions accordingly. Similarly to a neural network that has to make assumptions about which weights to increase/decrease after an error signal, a distributed system relies on a number of interactive sub-parts that together result in a global phenomenon. In this context, one interesting challenge is how to quickly transform heterogeneous and distributed measurements into an homogeneous indicator of performance or status. These indicators should

14

help to take run-time decisions that allow the system to perform sufficiently well. Due to the complexity and heterogeneity of the models, we cannot expect optimality and we should not search for it. It is much more important to escape from worst cases and to respect SLOs. The literature helps only partially. Conversion from multi-objective to single objective is often done by computing a weighted sum of the different metrics, as shown in [10, 6, 7]. In these works, either a simple weighted arithmetic mean or a simple weighted geometric mean are used to aggregate individual ratings of system features. The logical relationships among features and the distinction of mandatory, desirable, and optional selection criteria are not incorporated in these early models. Even more sophisticated hierarchical models [8] do not aim to capture and combine the dynamics of transient phenomena fairly accurately. The large majority of statistical models provide off-line data analyses. We are studying models that distinguish two main classes of reaction to external forces: resources degrading gracefully (e.g., CPU) and resources degrading suddenly (e.g., thread pools, memory, process descriptors, number of connections). Resources degrading gracefully cause a smooth deterioration of system performance, while those degrading suddenly may have tragic consequences on the SLOs and even on the availability of the whole system. To avoid abrupt performance degradation, a possible solution is that of adjusting the weights according to the availability of suddenly degrading resources. In this way, resource scarcity is gradually signaled by an increasing performance indicator. This could lead to novel self-inspection supports that combine heterogeneous sources of information by working on the distance from the maximum capacity of each critical resource and by focusing on performance trends more than on instantaneous values, possibly combined with some past measures. The goals of these and other models for self-inspection run-time support should be clear: to extract from many heterogeneous and raw data ”the information” that is really valuable to the self-decision support of CHIS in order to activate the right component.

3 Self-Decision There is on open debate whether a CHIS with multiple SLOs may be characterized by steady state behaviors or just by an aperiodic behavior. In the latter case, it appears dangerous to predict future conditions by crunching current information, and almost impossible to take any valuable adaptive decision. However, drawing a pessimistic view, such as that CHIS is characterized by unstable aperiodic behavior resembling chaos, only because any human product is not (and for a long time will not be) helped by other natural features typical of chaotic systems, is premature. We think that to associate self-* properties to nervous systems or nature-like behaviors is a dangerous hype. Human products cannot embed all known (and unknown) forces that govern natural beings, such as selection, evolution, and the time scale is completely different. Maybe one day in the future, this will be the reality. But we need to support CHIS tomorrow, not in an unforeseen future. Hence, we have to consider what it is possible to do now. On the negative hand, we undoubtedly have to forget about linearity assumptions at the basis of many previous models. We also have to exclude all optimization models and algorithms that do not provide an answer in reasonable or real-time. On the positive hand, self-decision run-time supports for a CHIS can confide in at least two important facts. First, any CHIS consists of layers, components, subsystems and hierarchies, hence traditional divide-et-impera approaches remain the most valuable source of solutions. Last, and even more important, a CHIS must not tend to optimization but to something we will call ”good enough quality”. Lloyd suggests that the combination of the dynamical systems theory and information theory could be used together to formulate a solution for the control of complex adaptive systems. However, control of

15

complex, nonlinear systems requires insight and intuition [5]. But, what happens if the decision algorithm has not enough time to learn? The time to learn, the time to reach another stable state, the time necessary for optimization, is often neglected by previous theories. Is another theory necessary? Can existing theories be extended or combined? If we try to adopt previous theories that tend to optimization, that assume to have enough time to reach a steady state, that have many try-and-drop possibilities, that have natural selection, we do not have many hopes to build a sufficiently reactive self-*system .Fortunately, in most instances, a CHIS does not require optimization methods that are interested in finding the best solution possible. A self-* system supporting a CHIS can be fully satisfied by an acceptable state that escapes a critical situation. This goal is not so difficult to meet, since most real systems are largely over-provisioned. However, the full requirements and implications of good enough quality remain to be explored.

4 Conclusions Integrating self-* properties into current systems is one of the future challenges of distributed computing. The claim of this position paper is that the real applicability of self-* properties to complex and heterogeneous information systems requires not only sophisticated software supports (such as reflective middleware), but also new insights and models for self-inspection and self-decision that can support real-time adaptation. We presented some open issues that have to be addressed. It is yet unclear how to aggregate several heterogeneous measurements into an homogeneous indicator of performance or status. Given the highly dynamic nature of CHIS, fast and ”good enough quality” decisions are often preferred to slow and optimal solutions. However, the full requirements and implications of ”good enough quality” in most contexts remain yet to be explored.

References [1] The Object Management Group. http://www.omg.org/. [2] Microsoft .NET Information. http://www.microsoft.com/net/. [3] Java 2 Platform, Enterprise Edition (J2EE). http://java.sun.com/j2ee/. [4] F. Kon, M. Roman, P. Liu, J. Mao, T. Yamane, L. C. Magalhaes, and R. H. Campbell. Monitoring, Security, and Dynamic Confi guration with the dynamic TAO Reflective ORB. In Proceedings of IFIP/ACM International Conference on Distributed Systems Platforms and Open Distributed Processing (Middleware’2000), Apr. 2000. [5] Learning How to Control Complex Systems. http://www.santafe.edu/sfi/publications/ Bulletins/bulletin-spr95/10cont%rol.html. [6] J. R. Miller. Professional Decision-Making: a procedure for evaluating complex alternatives. Praeger, New York, NY, 1970. [7] J. D. Sable. System evaluation methodology. Technical Report AUER-1834-TR-2, INFROM Data Management System Study, Auerbach, 1970. [8] S. Y. W. Su, J. Dujmovic, D. S. Batory, S. B. Navathe, and R. Elnicki. A cost-benefi t decision model: analysis, comparison and selection of data management. IEEE Trans. on Database Systems (TODS), 12(3):472–520, Sept. 1987. [9] The Community OpenORB project. http://openorb.sourceforge.net/. [10] D. R. J. White, D. L. Scott, and R. N. Schulz. POED – A method of Evaluating System performance. IEEE Trans. Eng. Manage., pages 177–182, Dec. 1963.

16

Position Paper: “Self-”properties in Distributed K-ary Structured Overlay Networks ∗ †

Luc Onana Alima1,2 , Seif Haridi1,2, , Ali Ghodsi1 , Sameh El-Ansary2 and Per Brand2 1 KTH, Royal Institute of Technology, 2 SICS, Swedish Institute of Computer Science, Kista, Sweden Abstract As can be seen today, there are clear evidences that computing systems are becoming more and more complex. This complexity decomposes into several aspects that stem mainly from the large-scale and the high dynamism of the operating environments of these systems. Due to their large-scale and the high dynamism of their operating environments, most computing systems of today and of the future will be characterized by their “self-” properties. For instance, self-organization, self-repair and self-management. The current trend in constructing such complex systems, consists in building overlay networks as substrate on top of which novel large-scale applications can be built. The overlay networks can be structured or un-structured. In this position paper, we first briefly present a general principle for building structured overlay networks. Second, we discuss techniques for achieving effective self-organization and self-repair in such overlay networks. Following that, we raise some questions that we think should be considered when designing systems with “self-” properties.

1. Introduction The exponential growth of the Internet has made it possible to connect billions of machines scattered around the globe and to share computing resources such as processing power, storage and content. In order to effectively exploit these resources, the trend is to use the Internet as it was originally intended. That is, a symmetric network through which machines share resources as equals. With this in mind, a number of novel distributed systems/applications characterized by large-scale and high-dynamism of ∗ This work was funded by the European PEPITO project IST-2001-33234 and Vinnova GES3 project in Sweden. † Currently on sabbatical leave at the National University of Singapore

their operating environment are being built. In these distributed systems, participating nodes directly share resources as equals in a peer-to-peer fashion. We name them, peer-to-peer (P2P) systems. The high-dynamism in these systems is due to two reasons mainly. First, there is the need for freedom, peers should be able to freely join or leave the system. Second, peers can fail at any time. To cope with this dynamism, these systems should be at least self-organizing and self-healing, i.e. the system self-reconfigures, without external (or manual) intervention, to legitimate configurations when participating nodes join, leave or fail. The current trend in constructing such complex systems, consists in building applicationindependent overlay networks as substrate on top of which novel large-scale applications can be built. Two main design approaches can be identified for building such substrates. On one hand, there are un-structured overlay networks, in which peers are connected in an uncontrolled fashion. Unstructured overlay networks provide flexibility in search. For example, arbitrary queries can be handled easily. However, these overlay networks do not scale well, as they typically use flooding for search. Furthermore, they do not guarantee that an item inserted in the system will be located when needed. On the other hand, there are structured overlay networks, where peers are organized in a controlled manner. Structured overlay networks provide location-independent routing mechanism as the basic component on top of which high level services can be built. However, arbitrary queries are not “naturally” supported. In this paper our focus is on the structured overlay networks [18, 1, 17]. The basic service that these structured overlay networks provide is a location-independent routing. On top of this, higher level services such as Distributed Hash Ta-

17

ble (DHT), location-independent one-to-one communication (point-to-point), one-to-many communication such as broadcast [9] and multicast [4], object replication and caching under various consistency models can be built. The performance of the above mentioned services highly depends on the ability of the overlay network to self-organize and self-repair when peers join, leave or fail. The questions then arise as to how the overlay network is going to be built and maintained? How do the high level services (or applications) take benefit from the “self-” properties provided by the underlying substrate? Could we quantify the guarantees offered by systems with self-properties? The list is not exhaustive. The rest of the paper is organized as follows. In Section 2, we briefly present the k−ary tree embedding framework for building structured overlay networks. Section 3, presents some techniques for achieving self-organization at the substrate level. For each of them, pros and cons are given. In Section 4, we summarize the paper and highlight some of the questions we consider important to be addressed when designing complex systems with “self” properties.

The principle is to let each virtual identifier be the root of a virtual k−ary tree (actually a rooted DAG) that spans the whole identifier space. Hence, any node that joins the system is the root of such a virtual k−ary tree. Assuming a well chosen system size, the virtual space is systematically and recursively divided such as to guarantee a k-ary tree of height logk (N ), where N is the size of the identifier space. Routing in such an overlay network is interval routing [19].

3. Techniques for organization

Achieving

Self-

Self-organization in structured overlay networks should be considered both at the underlying overlay layer and at the services (or applications layer). For the sake of simplicity, in the present paper, we only focus on the self-organization at the substrate level. Merely, how to cope with joins and departure of nodes. Techniques for achieving self-organization in overlay networks can be categorized in several ways. In this section we give brief informal presentation of those that we consider relevant for this position paper.

3.1. Periodic Stabilization

2. Embedding of k−ary Trees In the past three years, various structured peerto-peer overlay networks have been proposed [18, 1, 17, 16, 11, 14, 15]. Typically, these overlay networks are built such as to ensure logarithmic diameter under normal system operation, while maintaining at each peer a routing table of logarithmic or constant size. Each structured overlay networks is built using a well-chosen virtual identifier space onto which peers and items are mapped. A close analysis of the existing structured overlay networks with logarithmic diameter shows that there is a fundamental element behind almost all these overlay networks: the embedding of k-ary trees. Briefly, for each virtual identifier, there is an associated virtual k−ary tree that spans the whole identifier space, and whose height is the logarithm of the system size. This observation is natural as it is a well-known fact that logarithmic search goes hand in hand with tree structures. We propose the distributed k−ary principle[5, 1, 2], as a general tool for building, understanding and analyzing structured peer-to-peer overlay networks. The topology of most of the existing overlay networks of logarithmic diameter can be derived using the distributed k−ary principle.

We call periodic stabilization the technique that consists of running, periodically, separate routines for correcting routing information that each node maintains. Most of the existing peer-to-peer infrastructures use this technique. For instance, it is used in systems such as Chord [18], CAN [16] and Pastry [17]. The idea here is that each peer periodically checks its neighbors, to detect any change that occurs in the vicinity of the checking node. In Chord, this is done by periodically running the stabilize and the fix finger sub-routines. This technique has the advantage that changes can be detected quickly. However, the cost of doing this periodical checking is not well understood. An immediate observation that one can make is that in systems using this technique, there is an unnecessary bandwidth consumption when is frequently used but the dynamism in the system is low.

3.2. Adaptive Stabilization As mentioned in the previous subsection, periodic stabilization induces unnecessary bandwidth consumption in period of low dynamism. To overcome this problem, an alternative approach is what we

18

call adaptive stabilization, in which the rate of stabilization is tuned depending on some observed conditions or parameters, as suggested in [13]. In [13], what we here call adaptive stabilization is termed self-tuning, and requires some estimate of the system size and the failure rate. Intuitively, adaptive stabilization technique might help reducing unnecessary bandwidth consumption. However, it is not yet clear what parameters are to be observed to effectively tune the probing rate. More importantly, how to make these observations is currently not well understood, given the large scale nature and the high dynamism of the targeted systems. Nevertheless, the research on adaptive stabilization show the importance of building systems that self-adapt to observed and current behaviors. Correction-on-use combined with correction-on-change presented in the following sub-sections provide this self-adaption at a low cost.

3.3. Correction-on-use

Periodic stabilization is expensive and induces unnecessary bandwidth consumption. To overcome this problem an alternative approach, correctionon-use, is proposed in [1]. The idea here is to take advantage of the use of the overlay network in order to let it self-organizes in face of changes. When a peer n joins the system, it receives approximate routing information, that is not necessarily accurate. This routing information becomes accurate over time when the system is used. To achieve this convergence, two ideas are used: (i) whenever a peer receives a message from another peer, the receiving peer adapts itself to account the presence of the sender. (ii) whenever a peer n sends message to another peer n , peer n embeds some information about its current “local view” of the network. The receiving peer n can then precisely determine whether the sender n had a correct view at the sending time. If not, a badpointer notification is sent back to peer n. The notification message carries the identifier of a candidate peer for correction. Upon receipt of such a notification, the sender peer n corrects itself. If the ratio of the use (traffic injected into the system) over the dynamism of the system is high enough, the overlay network converges to a legitimate configuration. The main advantage of the correction-on-use is that it completely eliminates unnecessary bandwidth consumption. Each peer pays for what it needs. However, if the ratio of the traffic injected into the system over the dynamism is not sufficiently high, the convergence of the overlay network to legitimate configuration is slowed down.

3.4. Correction-on-change Correction-on-use is really the antipode of periodic stabilization. No extra cost is paid if the system ever enters a period with no dynamism. Furthermore, in case of continuous dynamism, if the traffic injected into the system is high enough, the cost for correction is still kept low. However, the assumption of high enough traffic might be difficult to satisfy in some use scenarios. To remove the assumption of high enough traffic needed in correction-on-use, we proposed a complementary technique, we call correction-on-change [8]. Here, the idea is to correct the overlay network whenever a change is detected due to join, leave and failure. Combining correction-on-change with correction-on-use gives the system high robustness while reducing unnecessary bandwidth consumption. The resulting correction technique has the advantage that when there are no changes in the system, no extra cost is paid. Our current experimental studies show that the combination of correctionon-use and correction-on-change provides the system with a strong robustness. The unnecessary bandwidth consumption is eliminated, because no maintenance message is sent out when there is no dynamism. The increase robustness comes from the fact that whenever a node n joins or fails, all nodes that depend on node n are pro-actively notified, while periodic stabilization or correction-on-use will not necessarily do so immediately. Furthermore, unlike periodic stabilization, correction-on-change’s performance is not dependent on some stabilization rate.

4. Conclusions and Open Questions In this position paper, we briefly presented a framework for structured overlay networks and discussed techniques for achieving “self-”properties in these systems. Currently, it not clear how the “self-”properties of the structured overlay network affect the higherlevel services built on top of the it. Furthermore, it is not clear whether the high-level services need to integrate additional mechanisms for achieving their own “self-”properties. Our experience in providing broadcast [10, 7] and multicast [3] services in the DKS system show that the high-level services are affected by the “self”properties of the underlying infrastructure. Moreover, it seems that the high-level services need to integrate mechanisms for their own “self”-properties.

19

Systems with “self-” properties are convergent width Consumption in Structured Overlay Networks, Tech. Report ISRN KTH/IMIT/LECS/Ror stabilizing systems [6, 12]. It is important to 03/07–SE, Kista Sweden, 2003. find suitable techniques for disturbance containment in systems with “self-” properties. This is- [9] A. Ghodsi, L. O. Alima, S. El-Ansary, P. Brand, sue has already been considered in the context of and S. Haridi, Self-correcting broadcast in distributed hash tables, Parallel and Distributed self-stabilizing systems. A question worth posing Computing and Systems (PDCS’2003) (Calgary), is how could techniques developed in the context ACTA Press, 2003. of self-stabilizing systems be applicable in com, Self-Correcting Broadcast in Distributed plex systems such as peer-to-peer systems. In self- [10] Hash Tables, 15th IASTED International Conferstabilizing context, the system is guaranteed to conence, Parallel and Distributed Computing and Sysverge within finite number of state transitions, and tems (Marina del Rey, CA, USA), November 2003. that is what systems with “self-” properties actually aim at. So, techniques developed for self- [11] F. Kaashoek and D. R. Karger, Koorde: A simple degree-optimal distributed hash table, Proceedings stabilization will probably prove useful for building of the Second International Workshop on Peer-tosystems with “self-” properties. A final question Peer Systems, IPTPS, 2003. that we plan to investigate concerns the guarantees that one can expect from systems with “self- [12] E. Laszlo, Basic constructs of sysetms philosophy, Systematics 10 (1972), 40–54. ”properties. Could we quantify these guarantees?

References [1] L. O. Alima, S. El-Ansary, P. Brand, and S. Haridi, DKS(N, k, f ): A Family of Low Communication, Scalable and Fault-Tolerant Infrastructures for P2P Applications, The 3rd International workshop CCGRID2003 (Tokyo, Japan), May 2003. [2] L. O. Alima, S. El-Ansary, A. Ghodsi, P. Brand, and S. Haridi, Four Design Principles for Structured Overlay Networks, Tech. Report ISRN KTH/IMIT/LECS/R-03/01–SE, Kista Sweden, 2003. [3] L. O. Alima, A. Ghodsi, P. Brand, and S. Haridi, Multicast in DKS(N, k, f ) Overlay Networks, 7th International Conference on Principles of Distributed Systems (OPODIS) (La Martinique, France), December 2003. [4]

, Multicast in DKS(n, k, f ) overlay networks, The 7th International Conference on Principles of Distributed Systems (OPODIS’2003) (Berlin), Springer-Verlag, 2004.

[13] R. Mahajan, M. Castro, and A. Rowstron, Controlling the cost of reliability in peer-to-peer overlays, LNCS 2735, Proceedings of the Second International Workshop IPTPS 2003 (Berkeley), Springer, 2003. [14] D. Malki, M. Naor, and D. Ratajczak, Viceroy: A scalable and dynamic emulation of the butterfly, Proceedings of the 21st ACM Symposium on Principles of Distributed Computing, 2002. [15] M. Naor and U. Wieder, A simple fault tolerant distributed hash table, Proceedings of the Second International Workshop on Peer-to-Peer Systems, IPTPS, 2003. [16] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker, A Scalable Content Addressable Network, Tech. Report TR-00-010, Berkeley, CA, 2000. [17] A. Rowstron and P. Druschel, Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems, Lecture Notes in Computer Science 2218 (2001).

[5] L. O. Alima, A. Ghodsi, and S. Haridi, A Framework for Building Structured Peer-To-Peer Overlay Networks, Tech. Report TR-2004-09, SICS, May 2004.

[18] I. Stoica, R. Morris, D. Karger, M. Kaashoek, and H. Balakrishnan, Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications, ACM SIGCOMM 2001 (San Deigo, CA), August 2001, pp. 149–160.

[6] E. W. Dijkstra, Self-stabilizing systems in spite of distributed control, Communications of the ACM 17 (1974), 643–644.

[19] G. Tel, Introduction to distributed algorithms, Cambridge University Press, 1994, ISBN 0 521 47069 2.

[7] S. El-Ansary, L. O. Alima, P. Brand, and S. Haridi, Efficient Broadcast in Structured P2P Netwoks, 2nd International Workshop on Peer-to-Peer Systems (IPTPS ’03), February 2003. [8] A. Ghodsi, L. O. Alima, P. Brand, and S. Haridi, Increasing Robustness while Minimizing Band-

20

Principles of Locality-Aware Networks for Locating Nearest Copies of Data Ittai Abraham∗ and Dahlia Malkhi†

Introduction

costly reconfigurations when nodes join and depart. Thus, trivial solutions that connect all nodes to each Building self-maintaining overlay networks for locat- other are inherently precluded. ing information in a manner that exhibits localityawareness is crucial for the viability of large internets. Bounded-stretch solutions The problem of It means that costs are proportional to the actual disforming overlay routing networks was considered by tance of interacting parties, and in many cases, that several recent works in the context of networks that load may be contained locally. This paper presents a are searchable for content. Many of the prevastep-by-step decomposition of several locality-aware lent overlay networks were formed for routing search networks, that support distributed content-based loqueries in peer-to-peer applications, and exhibit no cation services. It explains their common principles locality awareness. and their variations with simple and clear intuition There are several known schemes that provide loon analysis. Section 2 describes a novel technique for cality awareness, including [12, 13, 15, 9, 3]. All of robustifying locality-aware overlays. these solutions borrow heavily from the PRR scheme [12], yet they vary significantly in their assumptions Problem statement. This paper considers the and properties. Some of these solutions are designed problem of forming a self-organizing, self-maintaining for a uniform density space [9]. Others work for a overlay network that locates objects (possibly repli- class of metrics space whose growth rate is bounded cated) placed in arbitrary network locations. Recent both from above and from below [12, 13, 15, 4], studies of scalable content exchange networks, e.g., while others yet cope with an upper bound only on [6], indicate that up to 80% of Internet searches could the growth rate [3]. There is also variability in the be satisfied by local hosts within one’s own organiza- guarantee provided on the stretch: In [9], there is tion. Therefore, in order for the network to remain no bound on stretch (except the network diameter). viable, it is crucial to consider locality awareness from In [12], the stretch is an expected constant, a rather the outset when designing scalable, decentralized net- large one which depends on the growth bound. And in [3], the stretch can be set arbitrarily small (1 + ε). work tools. More formally, consider that the network consti- Diversity is manifested also in the node degree of the tutes a metric space, with a cost function c(x, y) schemes. denoting the “distance” from x to y. Let s = x0 , x1 , ..., xk = t be the path taken by the search from A step-by-step deconstruction. This paper ofa source node s to the object residing on a target fers a deconstruction of the principles that underlie node t. The main design goal to achieve is constant these locality-aware schemes step by step, and indik−1 ,xk ) stretch. Namely, that the ratio c(x0 ,x1 )+...+c(x cate how and where they differ. It demonstrates the c(s,t) is bounded by a (small) constant. Another impor- principles of locality awareness in a simplistic, yet tant goal is to keep node degree low, so as to prevent reasonable (see [4]) network model, namely, a network with power law latencies. In our belief, the simplic∗ The Hebrew University of Jerusalem, Israel. ity and the intuitive analysis may lead to improved [email protected] † The Hebrew University of Jerusalem, Israel. practical deployments of locality-aware schemes. For clarity, our exposition describes the design of [email protected]

21

an N -node network. It should be clear however, that this network design is intended to be self-maintaining and incremental. In particular, it readily allows nodes to arrive and depart with no centralized control whatsoever. Some additional issues, such as dynamic maintenance, are provided in an accompanying technical report [1].

1

Locality-aware solutions

Preliminaries. The set of nodes within distance r from x is denoted N (x, r). We assume a network model with power law latencies, |N (x, r)| = Γr2 , for some known constant Γ. For convenience, we define neighborhoods Ak (x) = N (x, 2k ), and the radius ak = 2k . Thus, we have that |Ak (x)| = Γ4k . For the purpose of forming a routing structure among nodes, nodes need to have addresses and links. We refer to a routing entity of a node as a router, and say that the node hosts the router. Thus, each node u hosts an assembly of routers. Each router u.r has an identifier denoted u.r.id, and a level u.r.level. Identifiers are chosen uniformly at random. The radix for identifiers is selected for convenience to be 4. This is done so that a neighborhood of radius 2k shall contain in expectation constant number of routers with a particular lengthk identifier. Indeed, the probability of a finding a router with a specific level and a particular prefix of length k is 1/4k . According to our density assumption, a neighborhood of diameter 2k has Γ4k nodes. Hence, such a neighborhood contains in expectation Γ routers matching a length-k prefix. Assume a network of N nodes, and let M = log4 N . Identifier strings are composed of M digits. The level is a number between 1 and M . A level-k router has links allowing it to ‘fix’ its k’th identifier digit. Routers are interconnected in a butterfly-like network, such that level-k routers are linked only to level-(k + 1) and level-(k − 1) routers. Let d be a k-digit identifier. Denote d[j] as the prefix of the j most-significant digits, and denote dj as the j’th digit of d. A concatenation of two strings d, d is denoted by d||d .

and narrowing it down as the route progresses. More specifically, each router r of level k has four neighbor links, denoted L(b), b ∈ {0..3}. Each one of the links L(b) is selected as the closest node within Cb (r), where Cb (r) = {u ∈ V | ∃s, u.s.id[k] = v.id[k − 1]||b, u.s.level = k + 1}. The link L(b) ‘fixes’ the k’th bit to b, namely, it connects to the closest node that has a level-(k + 1) router whose identifier matches v.Rk [k − 1]||b. Geometric routing alone yields a cost which is proportional to the network diameter. The designs in [9, 4] make use of it to bound their routing costs by the network diameter.

Step 2: Shadow routers. The next step is unique to the design of LAND in [3]. Its goal is to turn the expectation of geometric routing into a worst-case guarantee. This is done while increasing node degree only by a constant expected factor. The technique to achieve this is for nodes to emulate links that are missing in their close vicinity as shadow nodes. In this way, the choice of links enforces a distance upper bound on each stage of the route, rather than probabilistically maintaining it. If no suitable endpoint is found for a particular link, it is emulated by a shadow node. The idea of bounding the distance of links is very simple: If a link does not exist within a certain desired distance, it is emulated as a shadow router. More precisely, for any level 1 ≤ k ≤ M let r be a level-k router hosted by node v (this could itself be a shadow router, as described below). For b ∈ [0..3], if Cb (v) contains no node within distance 2k , then node v emulates a level-(k+1) shadow router s that acts as the v.r.L(b) endpoint. Router s’s id is s.id = v.r.id[k − 1]||b and its level is (k + 1). Since a shadow router also requires its own neighbor links, it may be that the j’th neighbor link of a shadow router s does not exist in Cj (s) within distance 2k+1 . In such a case v also emulates a shadow router that acts as the s.L(j) endpoint. Emulation continues recursively until all links of all the shadow routers emulated by v are found (or until the limit of M levels is reached). With shadow routers, we have a deterministic k bound k’th hop of a path, and a bound of 2 i on the of 2 = 2k+1 on the total distance of a k-hop Step 1: Geometric routing. The first step builds i=1..k geometric routing, whose characteristic is that the path. A different concern we have now is that a node routing steps toward a target increase geometrically in distance. This is achieved by having large flexibil- might need to emulate many shadow routers, thus inity in the choice of links at the beginning of a route, creasing the node degree. Using a standard argument

22

on branching processes, we may obtain that hosting show routers increases a nodes degree only by an expected constant factor. Shadow emulation of nodes is employed in LAND [3]. In all other algorithms, e.g., [12, 13, 15], a node’s out-degree is a priori set so that the stretch bound holds with high probability (but is not guaranteed). Hence, there is a subtle tradeoff between guaranteed out-degree and guaranteed stretch. We believe that it is better to design networks whose outliers are in terms of out-degree than in terms of stretch. Additionally, fixing a deterministic upper bound on link distances results in a simpler analysis than working with links whose expected distance is bounded. Step 3: Publish links. The final step in our deconstruction describes how to bring down routing costs from being proportional to the network diameter (which could be rather large) to being related directly to the actual distance of the target. This is done via a technique suggested by Plaxton et al. in [12], that makes use of short-cut links that increase the node degree by a constant factor. With a careful choice of the short-cut links, as suggested by Abraham et al. in [3], this guarantees an optimal stretch. The technique that guarantees a constant stretch is to ‘publish’ references to an object in a slightly bigger neighborhood than the regular links distance. The intuition on how to determine the size of the enlarged publishing-neighborhood is as follows. The route that locates obj on t from s starts with the source s, and hops through nodes x1 . . . xk until a reference to obj is found on xk . The length of the route from s to xk is bounded by ak+1 . The distance from xk to t is bounded (by the triange inequality) by ak+1 + c(s, t). In order to achieve a stretch bound close to 1, we should therefore guarantee that a reference to obj is found on xk , where ak is proportional to εc(s, t). This will yield a total route distance proportional to (1 + ε)c(s, t). Therefore, by selecting the range of publish links from to cover xk , the stretch of any search path is bounded by 1 + ε. The total number of outgoing links per node increases only by an expected constant factor. The increased neighborhood for publishing provides a tradeoff between out-degree and stretch. Setting it large, so as to provide an optimal stretch bound, is unique to the design of LAND [3]. The designs in [12, 13, 15] fix the size of publish neighborhoods indepedently of the network density growth.

This yields a stretch bound that depends on the density growth rate of the network.

2

Solutions that are both Locality aware and Robust

Pervious lookup solutions achieved either fault tolerance [7, 11, 14] or provably good locality properties [12, 3] but not both. In this section we sketch a lookup network that has, with high probability, low stretch even in the presence of a failure model, where all nodes may have a constant probability of failure. In terms of fault tolerance, the main drawback of PRR [12] like networks is in their routing flexibility. In [5] it is shown that while hypercube and ring geometries have about (log n)! different routes from a given source to a given target, PRR like networks have only one! Thus the basic architecture of [12, 15, 8, 13, 3] is fragile and must be augmented with some form of robustness. The first overlay network that has both a provable low latency for paths and a high fault tolerance was presented by the authors in FTLAND [2]. FTLAND achieves the combination of these two properties by augmenting the basic LAND architecture of Abraham et al. [3] with novel, locality-aware fault tolerance techniques. The techniques are based on the goal of dramatically increasing the routing flexibility to (log n)log n while still maintaining a provably good proximity selection mechanism. Technical approach. In order to have fault tolerance, a node must increase the number of outgoing links it may use for routing. Doing so naively, e.g., as in [10, 11, 7], by simply replicating each link to log n suitable destinations instead of one, compromises locality. More specifically, in PRR-like networks, hops have geometrically increasing distances. It is imperative that the i’th hop has distance at most ai (where a denotes a base that is typically a parameter of the construction). However, if the closest link happens to be down and a replacement link is used, there is no guarantee on the distance, and locality is lost. In FTLAND, every node hosts O(log n) routers (instead of one) at each level. As before, routers are interconnected in a butterfly-style, where levelk routers have outgoing links only to level-(k + 1) routers. However, each router has w.h.p. O(log n) outgoing links (instead of an expected constant) for each desired destination. Since each node in the net-

23

work has log n routers at each level, whose identifiers are independently and uniformly selected, a router finds all O(log n) replicated destinations at a distance [6] no greater than the distance to the closest router in the LAND scheme. Hence, locality is preserved when using any of these links. The total number of links increases by a poly-log factor (for each of the O(logn) levels there are O(log n) routers in place of one, each of which has O(log n) replicated links w.h.p.). [7] Routing over the (log n)log n possibilities is done deterministically, with no backtracking. At each hop, one live link is followed, and with high probability, it can lead to the target. Dealing with failures can be done in a very lazy [8] manner, since the network can maintain a successful, locality-aware service in face of a linear fraction of unavailabilities. This property is crucial for coping well with churn, as a sustained quality of service is guaranteed through transitions. It also serves well [9] to cope with transient disconnections and temporary failures, since there is no need for the network to reconfigure itself in response to small changes. [10] In [2], the following is proven:

for computer communications, pages 381–394. ACM Press, 2003. K. P. Gummadi, R. J. Dunn, S. Saroiu, S. D. Gribble, H. M. Levy, and J. Zahorjan. Measurement, modeling, and analysis of a peer-to-peer file-sharing workload. In Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 314– 329. ACM Press, 2003. K. Hildrum and J. Kubiatowicz. Asymptotically efficient approaches to fault-tolerance in peer-to-peer networks. In Proceedings of the 17th International Symposium on DIStributed Computing (DISC 2003), 2003. K. Hildrum, J. D. Kubiatowicz, S. Rao, and B. Y. Zhao. Distributed object location in a dynamic network. In Proceedings of the Fourteenth ACM Symposium on Parallel Algorithms and Architectures, pages 41–52, Aug 2002. X. Li and C. G. Plaxton. On name resolution in peer-to-peer networks. In Proceedings of the 2nd ACM Worskhop on Principles of Mobile Commerce (POMC), pages 82–89, October 2002. D. Malkhi, M. Naor, and D. Ratajczak. Viceroy: A scalable and dynamic emulation of the butterfly. In Proceedings of the 21st ACM Symposium on Principles of Distributed Computing (PODC ’02), pages 183–192, 2002.

Theorem 1 There exists a scheme that requires O(log3 n) links and w.h.p routes on paths of stretch 1 + ε even if every node in the network has a constant [11] M. Naor and U. Wieder. A simple fault tolerant disprobability of failure.

tributed hash table. In Proceedings of the 2nd International Workshop on Peer-to-Peer Systems (IPTPS ’03), 2003.

References [1] I. Abraham and D. Malkhi. Principles of localityaware networks for locating nearest copies of data. Technical Report Leibnitz Center TR 2003-84, School of Computer Science and Engineering, The Hebrew University, 2003. [2] I. Abraham and D. Malkhi. A robust low stretch lookup network. Technical Report Leibnitz Center TR 2003, School of Computer Science and Engineering, The Hebrew University, 2004. [3] I. Abraham, D. Malkhi, and O. Dobzinski. LAND: Stretch (1 + ε) locality aware networks for DHTs. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA04), 2004. [4] A. Goal, H. Zhang, and R. Govindan. Incrementally improving lookup latency in distributed hash table systems. In ACM Sigmetrics, 2003. [5] K. Gummadi, R. Gummadi, S. Gribble, S. Ratnasamy, S. Shenker, and I. Stoica. The impact of DHT routing geometry on resilience and proximity. In Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols

[12] C. Plaxton, R. Rajaraman, and A. Richa. Accessing nearby copies of replicated objects in a distributed environment. In Proceedings of the Ninth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA 97), pages 311–320, 1997. [13] A. Rowstron and P. Druschel. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), pages 329–350, 2001. [14] J. Saia, A. Fiat, S. Gribble, A. R. Karlin, and S. Saroiu. Dynamically fault-tolerant content addressable networks. In Proceedings of the First International Workshop on Peer-to-Peer Systems, 2002. [15] B. Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D. Joseph, and J. Kubiatowicz. Tapestry: A resilient global-scale overlay for service deployment. IEEE Journal on Selected Areas in Communications, 2003.

24

Experience with a physics-style approach for the study of self properties in structured overlay networks ∗

1

Sameh El-Ansary1 , Erik Aurell1,2 , Per Brand1 and Seif Haridi1,3, Distributed Systems Lab, SICS-Swedish Institute of Computer Science, Sweden 2 Department of Physics, KTH-Royal Institute of Technology, Sweden 3 IMIT, KTH-Royal Institute of Technology, Sweden {sameh,eaurell,perbrand,seif}@sics.se †

Abstract This paper gives a brief summary of our experience in applying a physics-style approach for analyzing the behavior of structured overlay networks that deploy self-organization and self-repair policies. Such systems are not always simple to model analytically and simulation of scales of interest can be prohibitive. Physicists deal with scale by characterizing a system using intensive variables, i.e. variables that are size independent. The approach proved its substantial usefulness when applied to satisfiability theory and it is the hope that it can be as useful in the field of large-scale distributed systems. We report here our finding of one simple selforganization-related intensive variable, and a more complex self-repair-related intensive variable.

1

Introduction

A number of structured P2P overlays [6, 5, 7, 1], aka Distributed Hash tables (DHTs)1 , were recently designed relying on self-organization and self-repair strategies. Self-organization: In a structured overlay network, nodes are supposed to self organize in a graph with a diameter and outgoing arity of nodes that are both of a logarithmic order of the number of nodes2 . Moreover, every node in the graph is responsible for the storage of data items and storage is uniformly distributed among nodes in a self-organizing fashion as well. Nodes balance storage load between themselves without any central coordination. Self-repair: i) Repair of routing tables. To maintain the graph in an optimal state despite change of ∗ Currently

on sabbatical leave at the National University of Singapore † This work is funded by the Swedish funding agency VINNOVA, PPC project, the European IST-FET PEPITO and EU 6th FP EVERGROW project. 1 We will use both terms interchangeably. 2 Actualy some systems like [3, 4] can provide a logarithmic order diameter with a constant order routing table.

membership (joins/failures), each node needs to follow some maintenance policy for keeping its routing table (the outgoing edges) up-to-date. ii) Repair of storage. Upon a join, a node might need to transfer some data items to the new node. Upon a graceful departure, a node needs to hand-over its stored items to another node. Upon ungraceful failures, data items are lost and thus a replication policy that both respects and makes-use-of the organization is applied. Dominance of self-organization in DHT literature. Our general observation on the literature of structured overlay networks is that the selforganization aspect is dominant compared to selfrepair. The typical case for a paper introducing a DHT system would be to show the structure of the routing table, how the overlay graph will be constructed, protocols for joins and leaves, when it comes to self-repair, the discussion gets relatively superficial. We find arguments on the level of: “Periodic maintenance of routing tables will ensure its correctness”, “Data items have to be republished periodically by the upper layer”, etc.. In the best case, a simulation is given showing that under particular stabilization rates, the network can operate. We attribute this phenomena to two factors: i) The novelty of such systems and the requirement to establish the new concepts first before deep analysis is performed. ii) Once we have a system exhibiting a self-repair property, an analytical model that describes the behavior of the system can range in degree of difficultly from not-so-clear to very-complexto-analyze. Therefore, we see the compelling need for more studies of analytical nature that can tell us whether those novel approaches are really useful or over-hyped.

2

The physics-style approach

Motivation. Having observed that analytical models are not always trivial to formulate given a

25

7

0

−0.2

6

0.5*log(ρ) →

5

−0.6 −

(Avg. Path Length)

−0.4

4

3

−1

−1.2

N= 128 N= 256 N= 512 N= 1024 N= 2048 N= 4096 N= 8192 N=16384

2

1

−0.8

0

0.1

0.2

0.3

0.4

0.5 0.6 ρ (Density=P/N)

0.7

0.8

0.9

N= 128 N= 256 N= 512 N= 1024 N= 2048 N= 4096 N= 8192 N=16384

−1.4

−1.6

−1.8

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ρ (P/N)

Figure 1: The average lookup length as a function of ρ and N .

Figure 2: Data collapse of the average lookup length as a function of ρ and N compared to system applying a given self-repair policy, therefore, 0.5 log2 ρ. using simulation seemed to be the practical tool for Intensive variables of strucanalyzing such systems. However, scales of interest 3 could be prohibitive for simulation purposes. At tured overlays networks this point, a physics-style approach started to be of Having explained the importance of identifying ininterest since physicists are accustomed to reasontensive variables in describing characteristics that ing about large natural systems. hold irrespective of size. The question that we How do physicists deal with scale? The first are trying to investigate is:“Is it possible to find level of analysis in a physical system of many comintensive variables to describe the characteristics ponents, is to try to separate intensive and extenof structured overlay networks that deploy selfsive variables. Extensive variables are those that organization and self-repair policies?” eventually become proportional to the size of the system, such as total energy. Intensive variables, 3.1 A self-repair-related intensive such as density, temperature and pressure, on the variable other hand, become independent of system size. A To establish our methodology we started first with description in terms of intensive variables only is a analyzing a self-organization aspect of a DHT, great step forward, as it holds regardless of the size namely, the effect of the density of nodes in the of the system, if sufficiently large. Further steps identifier space. We used Chord as an example sysin a physics-style analysis may include identifying tem on which we conduct our experiments. The analysis followed the three following methodphases, in each of which all intensive variables vary smoothly, and where the characteristics of the sys- ological steps: tem remain the same. Was the approach useful Step 1: Nomination of intensive variables. in the computer science arena? A physics-style Let N be the size of the identifier space and P be approach was carried over to satisfiability theory the population, i.e. the number of nodes that are more than ten years ago. KSAT is the problem to uniformly distributed in the identifier space. We P with a maxdetermine if a conjunction of M clauses, each one a define the density (ρ) to be the ratio N disjunction of K literals out of N variables can be imum value of 1 for a fully populated system. Our satisfied. Both M and N are extensive variables, question is: “Is ρ an intensive variable?” while α = M/N , the average number of clauses Step 2: Looking for characteristic behavior. per variable, is an intensive variable. For large N , A key quantity of interest in a DHT system is the instances of KSAT fall into either the SAT or the average lookup path length. Therefore, studying UNSAT phase depending on whether α is larger or the effect of the density on the average lookup path smaller than a threshold αc (K) [2]. Without ques- length should represent a characteristic behavior. tion, statistical mechanics have been proved to be Step 3: Simulation. very useful on very challenging problems in theo- Experiments set-up Let Chord(P ,N ) be an opretical computer science, and it can be hoped that timal Chord graph, where all the fingers of all nodes this will also be the case in the analysis and design are correctly assigned. For all N ∈ {27 , 28 , .., 214 }, of distributed systems.

26

0.6

0.4

0.3

P0=64 τ=150 P0=64 τ=300 P0=64 τ=600 P0=64 τ=1200 P0=128 τ=150 P0=128 τ=300 P0=128 τ=600 P0=128 τ=1200 P0=256 τ=150 P0=256 τ=300 P0=256 τ=600 P0=256 τ=1200 P0=512 τ=150 P0=512 τ=300 P0=512 τ=600 P0=512 τ=1200 P0=1024 τ=150 P0=1024 τ=300 P0=1024 τ=600 P0=1024 τ=1200

0.5 Average distance from optimal network

0.5 Average distance from optimal network

0.6

P0=64 τ=150 P0=64 τ=300 P0=64 τ=600 P0=64 τ=1200 P0=128 τ=150 P0=128 τ=300 P0=128 τ=600 P0=128 τ=1200 P0=256 τ=150 P0=256 τ=300 P0=256 τ=600 P0=256 τ=1200 P0=512 τ=150 P0=512 τ=300 P0=512 τ=600 P0=512 τ=1200 P0=1024 τ=150 P0=1024 τ=300 P0=1024 τ=600 P0=1024 τ=1200

0.2

0.1

0.4

0.3

0.2

0.1

0

0 0

0.2

0.4

0.6

0.8

1

1.2

1.4

0

200

400

600

800

1000

1200

1400

β’ = µ/(τ/)

β = µ/τ

Figure 3: The average distance from optimal network δ as a function of the speculated intensive variable β (the ratio of average time between change events µ and average time between stabilization events τ ). for all P ∈ {0.1 × N, 0.2 × N, .., 1.0 × N }, we generate Chord(P ,N ), inject uniformly distributed P 2 lookups, and reecord the average lookup length over the P 2 lookups denoted L(P, N ) or equivalently L(ρ, N ). This procedure is repeated 10 times, with different random seeds, and the results are averaged. Results. Figure 1 shows the behavior of the path length as a function of the density and the size of the identifier space. The curves are, to first approximation, vertically shifted by the same distance, while the values of N used are exponentially spaced. This means that the dependence on N alone (constant P ) is logarithmic. Indeed, it was noted in the Chord papers that the average path length is 0.5 × log P . However, we can see an additional observation by looking at the data collapse obtained in figure 2 by subtracting L(1, N ) from every respective curve L(ρ, N ) compared to 0.5 log2 ρ. From the data collapse, we can clearly see that L(ρ, N ) = 0.5 log2 ρ + f (ρ) where the function f is a decreasing function. That is for any given number of nodes, the average lookup length increases when they are placed in a smaller identifier space and the relative effect is the same irrespective of the system size, therefore ρ is an intensive variable. It is a curious fact that the function f (ρ) alluded to above is decreasing because if the P populated nodes are regularly spaced in the circular address space, the average path length is exactly 0.5 log2 P , in other words larger. Hence, we have as a result that randomization improves the performance of P2P system built on DHT, even in a static situation, with no peers leaving or joining the system.

Figure 4: Data collapse of figure 3 obtained by using β .

3.2

A self-repair-related variable

intensive

Step 1: Nomination of intensive variables. Let τ be the time between two stabilization actions of a certain node. Let µ be the average time between two perturbation events (joins or failures) while the network is in a stable state. That is, the number of nodes is varying around a certain population P0 . We are interested in understanding the interaction between perturbation and stabilization as two opposite forces, the former pulling the network towards suboptimal performance and the later bringing it back to optimal state. However, the behavior of that interaction is not known. Taking µ as the magnitude of perturbation and τ as the magnitude of stabilization, we need to answer the following question: “Is β = µτ an intensive variable?”. Step 2: Looking for characteristic behavior. In this investigation, we did not start with the average lookup length as the indicator for a characteristic behavior, we needed a more descriptive metric and thus we used what we call the “distance from from optimal network δ” which is computed as follows: i i i∈P j=1.. log N Edgej = OptimalEdgej δ= P log N (1) Where Edgeij is the j th (1 ≤ j ≤ log N ) outgoing edge of a node i ∈ P and OptimalEdgeij is the optimal value for that edge. P log N is the total number of edges (P nodes, log(N ) edges per node, where N is the size of the identifier space). Informally, δ is the number of “wrong/outdated” edges over the total number of edges. Step 3: Simulation.

27

Experiments set-up. We let P0 nodes form a network and we wait until δ is equal to 0.0 i.e. the network graph is optimal. We then let the network operate under specified values of µ and τ for 50 turnovers (A turnover is the replacement of P nodes with another P ). During this experiment we record δ frequently and average it over the whole experiment, denoted δ. Effect of β on a fixed P0 . For one value of P0 , we examine various values of β by fixing τ and varying µ. We, then, repeat the whole experiment with a different τ and vary µ such that the same values of β are conserved. Effect of beta under different values of P0 . We repeat the above procedure under different values of P0 . Results. As shown in figure 3, all the curves of a given P0 are superimposed. This means that for a network undergoing changes around an average size P0 , irrespective of how fast is the stabilization (τ ) or how fast is the perturbation (µ), as long as their ratio (β) is the same, the average distance from optimal network (δ) is the same. To see the behavior irrespective of the size, we need to perceive the obtained results differently. The stabilization as defined above is a “node-level” event while the perturbation is a “whole-graph-level” event. That is, erturbation of the system (µ) β is defined as PStabilization of each node (τ ) . Therefore, if we were to compare the behavior of two network sizes under the same values of τ and µ, the network with larger size will have the same perturbation but higher stabilization since the number of nodes is larger. Therefore, we define β to be P erturbation of the system (µ) Stabilization of the system ( τ ) to have a more fair

to stabilization β, a self-repair-related variable governs the absolute number of wrong pointers in an overlay graph, irrespective of its size. In the continuation of this work, we intend to do the following: Perform the same experiment with a wider spectrum of numbers to have more statistically-accurate results. Use the characteristic behaviors in providing a more adaptive nature to the current DHT algorithms. Search for more intensive variables and possible phase transitions.

References [1] Luc Onana Alima, Sameh El-Ansary, Per Brand, and Seif Haridi, DKS(N; k; f ): A Family of Low Communication, Scalable and Fault-Tolerant Infrastructures for P2P Applications, The 3rd International Workshop On Global and Peer-To-Peer Computing on Large Scale Distributed SystemsCCGRID2003 (Tokyo, Japan), May 2003, http://www.ccgrid.org/ccgrid2003. [2] S. Kirkpatrick and B. Selman, Critical behaviour in the satisfiability of random boolean expressions, Science 264 (1994), 1297–1301. [3] D. Malkhi, M. Naor, and D. Ratajczak, Viceroy: A scalable and dynamic emulation of the butterfly, InProceedings of the 21st ACM Symposium on Principles of Distributed Computing (PODC ’02), August 2002.

[4] Moni Naor and Udi Wieder, Novel architectures for p2p applications: the continuous-discrete apcomparison. The β re-plot of figure 3 is shown in proach, InProceedings of SPAA 2003, 2003. figure 4 where the data collapse shows that all the system sizes behave the same and therefore β is an [5] Antony Rowstron and Peter Druschel, Pastry: Scalable, Decentralized Object intensive variable. Location, and Routing for Large-Scale Peer-to-Peer Systems, Lecture Notes in 4 Conclusion and future work Computer Science 2218 (2001), citeWe have reported in this paper our progress in inseer.nj.nec.com/rowstron01pastry.html. vestigating whether a physics-style analytical approach can give more understanding to the per- [6] I. Stoica, R. Morris, D. Karger, M. Kaashoek, and H. Balakrishnan, Chord: A Scalable Peerformance of structured overlay networks. The apto-Peer Lookup Service for Internet Applicaproach mainly necessitates the description of the tions, ACM SIGCOMM 2001 (San Deigo, CA), characteristics of the system using variables that August 2001, pp. 149–160. do not depend on the size, known as intensive variables. [7] Ben Y. Zhao, John D. Kubiatowicz, and AnUsing this approach, we have shown that: i) thony D. Joseph., Tapestry: An InfrastrucThe density of nodes in an identifier space, a selfture for Fault-tolerant Wide-area Location and organization-related variable is an intensive variable Routing, U. C. Berkeley Technical Report that describes a characteristic behavior of a network UCB//CSD-01-1141, April 2000. irrespective of its size. ii) The ratio of perturbation

P0

28

Self-scaling Networks for Content Distribution Pascal A. Felber, Ernst W. Biersack Institut EURECOM, 06904 Sophia Antipolis, France {felber,erbi}@eurecom.fr

Abstract— Peer-to-peer networks have often been touted as the ultimate solution to scalability. Although cooperative techniques have been initially used almost exclusively for content lookup and sharing, one of the most promising application of the peer-to-peer paradigm is to capitalize the bandwidth of client peers to quickly distribute large content and withstand flash-crowds (i.e., a sudden increase in popularity of some online content). Cooperative content distribution is based on the premise that the capacity of a network is as high as the sum of the resources of its nodes: the more peers in the network, the higher its aggregate bandwidth, and the better it can scale and serve new peers. Such networks can thus spontaneously adapt to the demand by taking advantage of available resources. In this paper, we evaluate the use of peerto-peer networks for content distribution under various system assumptions, such as peer arrival rates, bandwidth capacities, cooperation strategies, or peer lifetimes. We specifically try to answer the question: “Do the self-scaling and self-organizing properties of cooperative networks pave the way for cost-effective, yet highly efficient and robust content distribution?”

I. I NTRODUCTION Cooperative content distribution networks are inherently selfscalable, in that the bandwidth capacity of the system increases as more peers arrive: each new peer requests service from, but also provides service to, the other peers. The network can thus spontaneously adapt to the demand by taking advantage of the resources provided by every peer. As an example of the self-scaling properties of cooperative content distribution, consider the situation where a server must replicate a critical file to a large number of clients, e.g., an antivirus update, to all 100, 000 machines of a large company. Given a file size of 4 MB, and a server (client) bandwidth capacity of 100 Mb/s (10 Mb/s) with 90% link utilization, a classical client/server distribution protocol would distribute the file by iteratively serving groups of 10 simultaneous clients in Mb u = 932Mb/s = 3.55 seconds. Updating 100, 000 clients would thus necessitate 100,000 10 u, i.e., almost 10 hours. In contrast, cooperative distribution leverages the bandwidth of the nodes that have already obtained the file, thus dynamically increasing the service capacity of the system as the file propagates to the clients. As each client that has already received the file can serve another client while the server updates 10 new clients, we can compute the number of clients updated at time t as n(t) = 2n(t − u) + 10 = 2t/u 10 − 10. Updating 100, 000 clients would thus necessitate less than 1 minute, as can be observed in Figure 1. The exponential increase of peer-to-peer distribution provides a sharp contrast with the linear progression of traditional client/server distribution, and illustrates the self-scaling property of cooperative networks.

II. C OOPERATIVE C ONTENT D ISTRIBUTION In order to maximize the participation of each of the peers in the network, large content is typically split into many blocks (or “chunks”) that are directly exchanged between the peers—a technique also known as “swarming.” The large number and small size of the chunks are key to quickly create enough diversity in the network for each of the peers to be useful to some other peers. Cooperative networks are usually build incrementally, with joining peers dynamically connecting to existing peers to eventually create complex mesh topologies. In practice, a peer usually knows only a subset of other peers, and actively trades with an even smaller subset. In addition to the actual structure of the mesh (i.e., which and how many neighbors each peers has), two factors are crucial to the global effectiveness of the content distribution process: • Peer selection strategy: which among our neighboring peers will we actively trade with, i.e., serve or request chunks from? • Chunk selection strategy: which chunks will we preferably serve to, or request from, other peers? The popular BitTorrent [1] tool, which we have studied extensively in [2], empirically selects the peers that offer the best upload and download rates to trade with (“tit-for-tat” strategy). When a new peers joins the system, it initially requests random chunks in order to quickly receive some data and become useful to the system; thereafter, it requests the rarest chunks among those owned by its neighbors, because rare chunks have a higher “trading value” than common chunks. The main focus of our study is to better understand the potential and the limitations of cooperative networks for content contribution. In particular, we evaluate several peer and chunk selection strategies to determine which ones perform best in various deployment scenarios. For the purpose of our evaluation, we only study the extreme case where each peer knows all other peers (fully-connected mesh) and can potentially trade with any of those peers during its lifetime, although we impose a limit on the number of simultaneous active connections. This assumption allows us to observe the asymptotic behavior of the various cooperative strategies. A. Deployment Scenarios In our study, we specifically focus on two deployment scenarios that correspond to real-world applications of cooperative content distribution. In the first scenario, we assume that some critical content need to be quickly replicated on a large number of machines within the private network of a large company. This essentially corresponds to a push model where all the peers

29

are known beforehand and distribution stops once the content has been fully replicated on all the machines, which typically have similar connectivity (homogeneous bandwidth). The second scenario of interest corresponds to the traditional Internet flash-crowd phenomenon, where a large number of clients access almost simultaneously some large popular content. This corresponds to a pull model with continuous arrival of the peers. Distribution continues over several peer “generations,” with some peers arriving well after the first peers have already left. The clients typically have heterogeneous bandwidth capacities, ranging from dial-up modems to broadband access (asynchronous and synchronous). B. Notation We denote by C the set of all chunks in the file being distributed, and by Di and Mi the set of chunks that peer i has already downloaded and is still missing, respectively (with Mi ∪ Di = C and Mi ∩ Di = ∅). Similarly, di |Di |/|C| and mi |Mi |/|C| correspond to the proportions of chunks that peer i has already downloaded and is still missing, respectively. The function U (a, b) returns a random number uniformly distributed in the interval [a, b]. C. Peer Selection The peer selection strategy defines “trading relationships” between peers and affects the way the network self-organizes. In our simplified model, we assume that all the peers know one another. When a peer has some chunks available and some free uplink bandwidth capacity, it will use a peer selection strategy to locally determine which other peer it will serve next. In this paper, we propose and evaluate the following peer selection strategies: • Random: A peer is selected at random. This strategy is expected to achieve good diversity in peer connectivity. • Least missing: Preference is given to the peers that have many chunks, i.e., we serve in priority peer j with dj ≥ di , ∀i. This strategy is inspired by the SRPT (shortest remaining processing time) scheduling policy that is known to minimize the service time of jobs [3]. • Most missing: Preference is given to the peers that have few chunks (newcomers), i.e., we serve in priority peer j with dj ≤ di , ∀i. The rationale behind this strategy is to evenly spread chunks among all peers to allow them to quickly serve other peers. • Adaptive-missing: Peers that have many chunks serve peers that have few chunks, and vice-versa, with more randomness introduced when download tend to be half complete. A peer i will serve in priority peer j with the lowest rank rj , computed as: rjRnd rjDet f rj

= U (0, 1) dj : di ≥ 0.5 = mj : di < 0.5 = (1 − |2di − 1|)2 = f rjRnd + (1 − f )rjDet

where rjRnd and rjDet are the random and deterministic ranks of peer j, respectively, and f ∈ [0, 1] is a weight

factor that controls randomness and is maximal when peer i is exactly half-way through the download. This strategy is expected to give good chances to newcomers without artificially slowing down peers that are almost complete. Although not shown in this paper because of space constraints, we have also experimented with randomized variants of least missing and most missing, as well as additional strategies that take into account the free bandwidth capacities of the peers. D. Chunk Selection The chunk selection strategy specifies which chunks should preferably be traded between the peers. Chunk selection can be performed by the receiver (which requests specific chunks from its neighbors) or by sender (which decides which chunk it will send next on an active connection). With both interaction models, obviously, the chosen chunk must be held by the sender and not by the receiver. In our simplified model, we assume that every peer knows the list of chunks held by its neighbors (i.e., all peers with a fully-connected mesh topology) and that the chunk selection strategy is applied on the sender’s side. In this paper, we evaluate the following chunk selection strategies: • Random: The sending peer i selects a chunk c ∈ (Di ∩ Mj ) at random among those that it holds and the receiving peer j needs. This strategy ensures good diversity of the traded chunks. • Rarest: The sending peer i selects the rarest chunk c ∈ (Di ∩Mj ) among those that it holds and the receiving peer j needs. Rarity is computed from the number of instances of each chunk held by the peers known to the sender. This strategy is expected to maximize the number of copies of the rarest chunk in the system. III. S IMULATION AND E VALUATION For the purpose of evaluating cooperative content distribution, we have developed a simulator that models various types of peer-to-peer networks and allows us to observe step-by-step the distribution of large files among all peers in the systems, according to several metrics. Although we have taken extra care to reproduce realistic operating conditions, we have yet made some assumptions in order to simplify and speed up the simulations. In particular, we do not consider failures (peer or network) nor link congestion in any of the experiments, and we do not favor long-running connections overt short connections as real systems usually do. Due to space constraints, we only present here selected results of the simulations of extreme scenarios (little heterogeneity, limited server bandwidth) that best exhibit the differences between the various aforementioned strategies; more moderate scenarios have shown the same general trends, albeit with lower intensity. A. Methodology and Setup Our simulator is essentially event-driven, with events being scheduled and mapped to real-time with a millisecond precision. The transmission delay of each chunk is computed dynamically according the link capacities (minimum of the sender uplink and receiver downlink) and the number of simultaneous transfers on the links (bandwidth is equally split between concurrent connections).

30

5000

Client/Server Cooperative

4500

1e+05

Number of complete peers

Number of updated clients

1e+06

1e+04

1e+03

1e+02

4000

Progress (%)

3500 3000 2500 2000 1500 Least missing Most missing Adaptive missing Random

1000 500

1e+01 0:00

0:10

0:20

0:30

0:40

0:50

1:00

0 00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00

Time (min.)

Time

Fig. 1. Scalability of cooperative content distribution: the number of clients that successfully receive a file increases linearly with client/server distribution, and exponentially with cooperative distribution.

Fig. 2. Completion times for the random chunk selection strategy, with simultaneous arrivals, homogeneous and symmetric bandwidth, and selfish peers.

Once a peer i holds at least one chunk, it becomes a potential server. It first sorts its neighboring peers according to the specified peer selection strategy. It then iterates through the sorted list until it finds a peer j that (1) needs some chunks from Di (Di ∩ Mj = ∅), (2) is not already being served by peer i, and (3) is not overloaded. We say that a peer is overloaded if it has reached its maximum number of connections and has less than 128 kb/s bandwidth capacity left. Peer i then applies the specified chunk selection strategy to choose the best chunk to send to peer j. Peer i repeats this whole process until it becomes overloaded or finds no other peer to serve. Our simulator allows us to specify several parameters that define its general behavior and operating conditions. The most important ones relate to the content being transmitted (file size, chunk size), the peer properties (arrival rates, bandwidth capacities, lifetimes, number of simultaneous active connections), and global simulation parameters (number of initial servers or “origin peers,” simulation duration, peer selection strategy, chunk selection strategy). Table I summarizes the values of the main parameters used in our simulations. B. Simultaneous Arrivals The chunk selection strategy can have a significant impact on the effectiveness of cooperative content distribution, especially when considering selfish peers. As shown in Figure 2, several of the peer selection strategies need a long time to replicate the file on all clients. First consider that the transmission Parameter Chunk size File size Peer arrival rate Simultaneous (push) Continuous (flash-crowd) Peer bandwidth (downlink/uplink) Homogeneous, symmetric Homogeneous, asymmetric Heterogeneous, asymmetric Peer lifetime Selfish Altruistic Active connections per peer Number of origin peers Duration of simulation Peer selection strategy Chunk selection strategy

Value 256 kB 200 chunks (i.e., 51.2 MB) 5000 peers at t0 Poisson with rate λ =

1 2.5 s

100% peers: 128/128 kb/s 100% peers: 512/128 kb/s 50% peers: 512/128 kb/s Disconnects when complete Remains 5 minutes online 5 inbound and 5 outbound 1 (bandwidth: 128/128 kb/s) 12 h or more Varies Varies

TABLE I PARAMETERS USED IN THE SIMULATIONS .

100 80 60 40 20 0 18:00 5000

12:00 4000

3000 Peer 2000 1000

06:00 Time 0 00:00

Fig. 3. Download progress for the random peer selection strategy, with the random chunk selection strategy, simultaneous arrivals, homogeneous and symmetric bandwidth, and selfish peers.

of all 200 chunks of the file over a 128 kb/s connection requires 200·256·8 kb = 3200 seconds, i.e., slightly less than one hour. 128 kb/s If we could construct a linear chain, with each client receiving the file from the previous peer in the chain and serving it simultaneously to the next one, we could theoretically approach this asymptotic limit. In practice, because we only consider the transmission of complete chunks and we share bandwidth capacities between several connections, we expect to experience lower efficiency. We can explain the low performance of the least missing peer selection strategy by the fact that the server will initially only serve the same 5 peers that are closest to completion. These peers will in priority exchange chunks with each other and then slowly propagate some chunks to the other peers, which remain mostly idle because they have no rare chunks to trade. As completed peers leave immediately the system, we essentially have one server (the initial peer) that iteratively serves batches of 5 peers at a time, which explains the low efficiency of the least missing strategy. One should note, however, that this strategy minimizes the download time of the first complete peer. At the other extreme, the most missing peer selection strategy tries to make all clients progress simultaneously, thus making them quickly and equally useful to others. This results in a better utilization of the available resources. By “artificially” delaying the departure of the peers, we always keep a large service capacity and ensure that all peers complete approximately at the same time. In the case of simultaneous arrivals, we can observe that the most missing strategy minimizes the download time of the last complete peer. The random peer selection strategy is expected to let all peers progress at approximately the same rate, and thus to behave roughly like the most missing strategy. We observe, however, that only one third of the peers complete simultaneously and the rest essentially follow the same pattern as the least missing strategy. This problem can be tracked down to the random chunk selection. Indeed, the chunks that were injected first in the system exist in many instances, while the latter chunks are very rare, with the server doing nothing to correct this imbalance. Most of the peers quickly reach near completion, as shown in Figure 3, but many require much time to obtain the few missing chunks (often just one) that are only held by the origin server. The adaptive missing strategy is interesting because it seems

31

18000 Least missing Most missing Adaptive missing Random

4000 3500

16000 Number of complete peers

Number of complete peers

4500

3000 2500 2000 1500 1000

14000

1.4e+06

Arrivals Least missing Most missing Adaptive missing Random

Number of blocks in the system

5000

12000 10000 8000 6000 4000 2000

500 0 01:00

02:00

03:00

0 00:00

02:00

04:00

Time

Fig. 4. Completion times for the rarest chunk selection strategy, with simultaneous arrivals, homogeneous and symmetric bandwidth, and selfish peers.

06:00

08:00

10:00

12:00

Time

C. Continuous Arrivals In the case of continuous arrivals and asymmetric bandwidth (512/128 kb/s ADSL) with moderately altruistic peers, we observe in Figure 5 that the random and adaptive missing peer selection strategies keep up with the arrival rate of the clients, with the latter looking empirically better initially. The most missing strategy delays the completion of a first batch of clients, before following the same slope as the arrivals but with notable steps. Finally, the least missing strategy shows an odd behavior: the number of complete peers is slow to “take off,” then makes a big step to overtake all other strategies, then stalls again for a longer period of time before another even higher step, and so on. To better understand this behavior, consider that the origin peer will iteratively serve groups of 5 peers until they complete their download. The peers of a group will exchange chunks with each other in priority, but also slowly propagate some chunks to other less-complete peers, which will quickly disseminate them among all remaining peers (they cannot indeed serve more-complete peers as the least missing strategy would require, because they only have blocks that the morecomplete peers also hold). Therefore, we have few peers that complete very fast, and a large majority of peers that progresses slowly but steadily and eventually complete all together. We can better understand the behavior of the peer selection strategies by considering the chunk capacity of the system with respect to time, shown in Figure 6. The random and adaptive missing strategies maintain a nearly constant number of chunks in the system. We can note that the latter looks more efficient than the former in this deployment scenario, as it achieves the same completion rate with a lower average chunk capacity. The

Least missing Most missing Adaptive missing Random

1e+06 800000 600000 400000 200000 0 00:00

02:00

04:00

06:00

08:00

10:00

12:00

Time

Fig. 5. Completion times for continuous arrivals, with the rarest chunk selection strategy, homogeneous and asymmetric bandwidth, and altruistic peers.

to inherit some of the good properties of each of the extreme least missing and most missing strategies. It initially quickly and evenly replicates blocks in the system and, at the same time, does not artificially prevent near-complete peers to finish their download. When switching to the rarest chunk selection strategy, we observe in Figure 4 significant performance improvements, particularly for the random peer strategy that becomes as efficient as most missing, and the least missing strategy that shows a seven-fold improvement. In contrast to the random chunk selection strategy, we do not experience the pathological situation where the origin sequentially serves the rare missing chunks to almost-complete peers.

1.2e+06

Fig. 6. Chunk capacity of the system, with the rarest chunk selection strategy, homogeneous and symmetric bandwidth, and altruistic peers.

most missing strategy creates a higher chunk capacity by delaying peers until the first batch completes, which corresponds to the sharp drop of chunk capacity. Thereafter, the capacity oscillates with a constant period, driven by the batches of peers that progress and complete together. Finally, the least missing strategy exhibits the highest volatility in chunk capacity. The system traverses phases during which it builds an extremely large chunk capacity, and then completely empties it by letting almost all peers terminate simultaneously. Interestingly, the frequency and amplitude of the oscillations increase over time. This corresponds to the steps that we have observed in Figure 5. IV. C ONCLUSIONS AND O PEN I SSUES The main objective of this paper was to assess the potential of, and make a case for, cooperative content distribution. Based on our preliminary study, it appears that the self-scaling and self-organizing properties of peer-to-peer networks do indeed offer the technical capabilities to quickly and efficiently distribute large or critical content to huge populations of clients. Cooperative distribution techniques capitalize the bandwidth of every peer to dramatically increase the service capacity of the system. The efficiency of these techniques does, however, depend on many factors. In particular, the chunk and peer selection strategies directly impact the delay experienced by the clients and the global throughput of the system. We did not clearly identify a “best” strategy, as each of them offers various trade offs and may prove most adequate for specific deployment scenarios. Further investigations will be necessary to answer the many open questions raised by our study. In particular, we did not take into account failures nor the churn of the system, and it is not clear how such networks behave in the face of malicious or uncooperative clients. R EFERENCES [1] B. Cohen, “Incentives to build robustness in BitTorrent,” Tech. Rep., http://bitconjurer.org/BitTorrent/bittorrentecon. pdf, May 2003. [2] M. Izal, G. Urvoy-Keller, E.W. Biersack, P.A. Felber, A. Al Hamra, and L. Garces-Erice, “Dissecting BitTorrent: Five months in a torrent’s lifetime,” in Proceedings of the 5th Passive and Active Measurement Workshop, Apr. 2004. [3] L.E. Schrage, “A proof of the optimality of the shortest remaining service time discipline,” Operations Research, vol. 16, pp. 670–690, 1968.

32

Autonomous Systems in the Internet: A Potential Subject for Studying Self-* Aspects Thomas Erlebach Computer Engineering and Networks Laboratory (TIK) ETH Z¨ urich, CH-8092 Z¨ urich, Switzerland E-mail: [email protected]

Abstract An autonomous system (AS) is a subnetwork of the Internet under separate administrative control. On the AS level, the Internet can be viewed as a graph with a node for every AS and an edge between two nodes if the corresponding ASs have at least one physical link between them. In recent years, several studies have investigated different aspects of this AS network, e.g. the economic relationships between ASs and their impact on routing. In this paper, we discuss some of the research results obtained in this area. We also attempt to view some aspects of the AS network with an eye towards self-* properties.

1

The Internet – A Network of Autonomous Systems

The Internet can be viewed as a network of autonomous systems. An autonomous system (AS) is a subnetwork under separate administrative control and can consist of tens to thousands of routers and hosts. Examples of ASs are networks of big companies or universities, national research networks, local or national Internet service providers (ISPs), or international backbone providers. Currently, there are more than 15,000 ASs on the Internet. Each AS is connected to one or several other ASs with direct, physical links. Therefore, the AS network can be represented by a graph, with a node for every AS and an edge between two nodes if and only if the corresponding ASs have a direct link. This graph is called the AS graph. The routing of traffic on the Internet is hierar-

chical. Inside each AS, the routing can be done in any proprietary way, determined by the administrator of that AS. Usually, a link-state routing protocol such as OSPF (open shortest path first) is used: Each node in the AS has knowledge about the current state of all links and nodes inside the AS, and state updates are distributed in the AS when some state changes (e.g. a link goes down). Such a routing mechanism is not scalable. It is feasible inside an AS, but not on a network of the size of the Internet. Therefore, a different routing protocol is necessary for inter-autonomous system routing: Between ASs, BGP (border gateway protocol) routing is used. BGP is a standardized protocol [7] based on the exchange of network reachability information: Each AS announces routes to certain ranges of destination addresses to its neighbors. If an AS A receives an announcement from neighboring AS B saying that B has a route to a range R of destination addresses, then AS A may decide to route traffic for a destination x ∈ R to B. Each announcement also contains AS path information, i.e. the announcement says through which path in the AS graph the respective destination addresses can be reached. When a packet travels from its source to its destination in the Internet, the ASs that it traverses are determined by BGP routing, and the path inside each AS is determined by the intraautonomous system routing protocol employed by that autonomous system. Typically, an AS does not announce all its routes to every neighbor; BGP policies restrict the set of neighbors to which a route is announced. Therefore, BGP policies have a significant impact on Internet routing. The policies are in turn affected by the economic relationships between ASs.

33

2

Discovering the AS Network

As the current Internet is the result of distributed growth without central control, it is not easy to obtain a “map” of the AS network, i.e. information about all direct connections that exist between ASs. Therefore, the main approach to get such data is to extract information from the Internet through measurements or by reading out routing tables of selected BGP routers. The latter approach has been followed by the University of Oregon Route Views Project1 : Their router maintains peering sessions with BGP routers in the ASs of selected service providers to gather information about the global routing system from the perspectives of several different backbones and locations around the Internet. Using their data, one can reconstruct AS graphs by including all edges that appear on AS paths found in the BGP routing tables. However, it is not clear how many edges are actually overlooked by this approach, and other studies have complemented the information from Route Views with additional data from Routing Registries or Looking Glass sites. The goal of obtaining a 100% accurate AS graph is still elusive, however, and one can only hope that the available data provides an acceptable approximation of the real AS graph. Various studies have investigated properties of the available AS graphs. An interesting characteristic that has been observed in at least some of the available data is that the degree sequence of the AS network seems to follow a power law [5].

3

Economic Relationships between ASs

Mainly due to the economic contracts between ASs, there are different types of relationships between neighboring ASs that need to be considered. These relationships affect the routing in the Internet, because routing polices depending on the economic relationships determine which routes an AS announces to each of its neighbors. Therefore, it is important to take them into account when trying to analyze the Internet. As proposed by Gao [6], one can classify the relationships into the main types 1 http://antc.uoregon.edu/route-views/

of customer-provider relationships, peer-to-peer relationships, and sibling relationships. A customerprovider relationship between ASs A and B means that A pays B for access to the Internet. In this case, B will announce all its routes to A, but A will announce to B only its own routes and those of its customers. A peer-to-peer relationship between ASs A and B means that A and B have an agreement to exchange traffic to their mutual advantage; none of them pays the other, and they announce to each other only their own routes and routes to their customers. Finally, if A and B are siblings, they will exchange all routes; this may happen, for example, if ASs A and B are owned by the same company. The reason for the policies described above is that an AS does not want to route traffic that goes from one provider or peer to a different provider or peer, because such traffic would exploit the infrastructure of the AS without creating any benefit for it. The effect of the routing policies is that a packet transmitted through the Internet can follow a certain AS path only if the path satisfies the following constraints [6]: • If one considers only the customer-provider and peer-to-peer edges on the path, then the path must first traverse customer-provider edges in the direction from customer to provider, then zero or one peer-to-peer edges, and then customer-provider edges in the direction from provider to customer. • Sibling edges can occur anywhere in the path. A path satisfying these constraints is called valid. If one is interested in analyzing the routing on the Internet, it is therefore important to know the relationships between the ASs. As this information is not readily available, it has been proposed to infer it from BGP routing data (i.e. from AS paths) [6, 8]. Subramanian et al. have suggested to ignore sibling edges and formalized this inference problem as the Type-of-Relationship (ToR) problem [8]: Given an undirected graph and a set of paths in the graph, classify the edges into customer-provider and peer-to-peer relationships in such a way that as many of the given paths as possible become valid. They proposed a heuristic algorithm for the problem, and left its complexity as an open question. Erlebach et al. [4] and, independently, Di Battista

34

s

t

Figure 1: Example AS graph.

s

t

Figure 2: Two disjoint valid s-t-paths. et al. [2] resolved the complexity of the ToR problem and proved it to be NP-hard in general, while a special case can be solved efficiently: If the question is whether there is a classification that makes all given paths valid, then an answer can be found in linear time. They also proposed new algorithms for the ToR problem, some of them with provable approximation guarantees. However, it is not clear whether maximizing the number of valid paths is the right optimization criterion. This model favors solutions that do not classify any edges as peerto-peer edges, a clear deviation from reality. Furthermore, the existence of sibling edges is ignored. Thus it remains an interesting task to devise a better model for the problem of classifying the economic relationships between autonomous systems. When one has annotated the AS graph with information about the economic relationships (e.g., using one of the algorithms from [6, 8, 4, 2]), one can use this information to study routing properties of the Internet further. For example, one might be interested in the maximum number of disjoint valid paths between two ASs, or in the minimum number of ASs that must fail so that two given ASs become disconnected (a node separator). Figure 1 shows a small AS graph in which customer-provider relationships are represented as directed edges from customer to provider. The maximum number of

node-disjoint valid paths between s and t is two in this example, and the minimum number of nodes different from s and t that must fail to disconnect s and t is three. One such set of two node-disjoint s-t paths is displayed in Figure 2, where also a node separator of size 3 is indicated by shaded nodes. The complexity and approximability of the problems of computing many disjoint valid paths and small separators were studied in [3]. The problem of computing a maximum number of node-disjoint valid s-t-paths was shown to be NP-hard and a 2approximation algorithm was presented for it. The problem of computing a minimum number of nodes that must fail in order to disconnect s and t with respect to valid paths was shown to be NP-hard as well, and again a 2-approximation algorithm was proposed. It is interesting to note that in the model with valid paths, the minimum size of a node separator can be as large as twice the maximum number of disjoint valid paths; in the standard model of undirected or directed paths in graphs, these two numbers are always the same. The corresponding problems for edge-disjoint paths and edge separators were studied in [3], too. Here, it was shown that a smallest edge separator can be found in polynomial time, while the problem of computing a maximum number of edge-disjoint valid s-t-paths is NP-hard and can again be approximated only within a factor of 2. Finally, further results for acyclic AS graphs (i.e. AS graphs where there is no cycle of directed customer-provider edges) were presented.

4

Self-* Aspects of the AS Network

Recently, there has been an increasing interest in the study of so-called self-* properties of networks: self-configuring, self-organizing, selfmanaging, self-repairing, etc. We think that it is interesting to look at the AS network with an eye towards self-* aspects. Which self-* properties does the AS network have, and which would be desirable? What do self-* properties actually mean for the AS network? How can we analyze the self-* properties of the AS network? First, one can view the AS network as a selforganizing network, because each AS can decide locally to which other ASs it connects and which

35

routing policies it enforces. Essentially the only global “rule” is the BGP protocol, which every AS must adhere to. Therefore, the AS network can be seen as a complex network that is created by individual agents using local interactions, not by centralized engineering decisions. The local decisions of individual ASs about how they connect to other ASs determine the global structure of the AS network. It would be interesting to gain a better understanding of how the local decisions affect the resulting global structure. Various models of network evolution have already been proposed (see, e.g., [1]) and represent first steps in this direction. Furthermore, BGP routing may be considered to be self-repairing to a certain extent. Routes in the network are dynamically changing, without human intervention. When a link goes down, route withdrawal messages and new route announcements are communicated by BGP routers, and new routes that avoid the failed link are established in the BGP routing tables. There are still some problematic aspects of this mechanism (slow convergence time, instability), but essentially it represents a self-repairing mechanism for the maintenance of valid routing information. Improving the speed of convergence and the stability of BGP routing is a current research topic. Large ISPs frequently report that substantial efforts by human experts are required to monitor the network and to try to fix it in case of unexpected or erroneous behavior. For example, misconfigurations, attacks, or overload situations in one part of the network can have disastrous effects on the whole network, and human intelligence is required to track down the causes and to devise countermeasures in short time. In a self-managing and self-repairing network, one would expect the network to deal with such disruptions autonomously. Another issue of self-organizing networks that arises in AS networks is the difficulty of “mapping” the network (gaining full knowledge about its structure) and monitoring its behavior. As there is no central authority that has complete information about the network, one can only use different kinds of measurements to discover such information. This seems to be a general property of selforganizing networks whose structure is determined by local decisions of interacting agents. In the discussion above, we have used terms such as “self-organizing,” “self-repairing,” and “self-

managing” in an informal way, without giving precise definitions. It would be an interesting direction for future research to formalize these notions, to identify the desirable self-* properties of the AS network and other networks, and to derive algorithms (protocols) and complexity results for the problem of achieving them.

Acknowledgement The author would like to thank Danica Vukadinović for helpful discussions.

References [1] A. Barab´ asi and R. Albert. Emergence of scaling in random networks. Science, 286:509–512, October 1999. [2] G. Di Battista, M. Patrignani, and M. Pizzonia. Computing the types of the relationships between autonomous systems. In Proceedings of INFOCOM’03, 2003. [3] T. Erlebach, A. Hall, A. Panconesi, and D. Vukadinović. Cuts and disjoint paths in the valley-free path model. TIK-Report 180, Computer Engineering and Networks Laboratory (TIK), ETH Z¨ urich, 2003. [4] T. Erlebach, A. Hall, and T. Schank. Classifying customer-provider relationships in the Internet. In Proceedings of the IASTED International Conference on Communications and Computer Networks, pages 538–545, 2002. [5] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the Internet topology. In SIGCOMM’99, 1999. [6] L. Gao. On inferring Autonomous System relationships in the Internet. IEEE/ACM Transactions on Networking, 9(6):733–745, 2001. [7] Y. Rekhter and T. Li. A Border Gateway Protocol 4 (BGP-4). RFC 1771, IETF, March 1995. [8] L. Subramanian, S. Agarwal, J. Rexford, and R. Katz. Characterizing the Internet hierarchy from multiple vantage points. In Proceedings of INFOCOM’02, 2002.

36

Spatial Computing: a Recipe for Self-organization in Distributed Computing Scenarios Franco Zambonelli, Marco Mamei DISMI - Università di Modena e Reggio Emilia – Reggio Emilia – ITALY { mamei.marco, franco.zambonelli }@unimo.it in which the presence of an underlying distributed network were totally hidden from application components [ChiC91]. In the 90s, the emergence of Internet computing outlined the need for network-aware computing models, in which application components were made aware of the distributed and decentralized nature of their operational environment [Wal97]. Now, it appears like the concept of network-aware computing need to evolve into a concept of space-aware computing. In the following of this section, we will shortly discuss the key characteristics of a variety of modern computing scenarios distinguishing between micro, medium and global-scale scenarios. In particular, we will emphasize how the concept of space plays an important role in it, a role strictly related to self-organization.

Abstract Here we discuss the role of “space” in modern distributed computing, and analyze how spatial abstractions promise to be basic necessary ingredients for novel approaches, based on self-organization, to engineer distributed systems.

1. Introduction Distributed computing is getting more and more decentralized, dynamic, and intertwined with the physical world. These characteristics appear very clearly in a variety of emerging scenarios, such as peer-to-peer (P2P) networks, mobile ad-hoc networks and sensor networks. As a consequence, the complexity of these scenarios is such that they can no longer be managed with traditional approaches to distributed systems engineering. There is need for approaches supporting distributed systems to autonomously self-configure and self-adapt their activities in their operational environment. A variety of diverse approaches exploiting some forms of self-organization are emerging. The question of whether a single unifying approach, applicable with little or no adaptations to scenarios as diverse as P2P worldwide networks and local networks of embedded sensors, is still open. In this position paper, without having the ambition of answering to the above question, we will try to identify the important role that will likely be played in that process by spatial abstractions. The key point that we consider is that the adoption of self-organization approaches in distributed systems engineering will likely be rooted on spatial computing models, in which the activities of application components are abstracted as taking place in some sort of abstract space.

2.1 The Micro Scale Smart Dusts [Pis00] and Sensor Networks [Est02] researches focus on small computer-based components that can be deployed in an environment and can coordinate their actions (i.e., local sensing and effecting of specific environmental conditions) with the goal of enriching our physical word with specific “smart” functionalities. Potential applications range from simple monitoring activities, to smart materials and selfassembly of computational beings. Whatever the applications one envision, the key characteristics that will distinguish micro-scale computer applications from traditional distributed computing systems are in our opinion the following three: x Large Scale: the number of components involved in a distributed application will be dramatically high and hardly controllable. It is neither possible to enforce a strict configuration of components nor to control their behavior at execution at a fine-grained level. x Network dynamics: components activities take place in a network whose structure derives from an almost random deployment process, and that is likely to change over time with unpredictable dynamics (due to failure and mobility of components). x Space dependencies: Most important in our discussion, the activities of these components are strongly related to their position in the physical

2. Space in Modern Distributed Computing Early researches in parallel and distributed computing promoted a transparent distributed computing approach,

37

space. This is because each component is only capable to sense and affect the physical environment in a local neighborhood. The first two characteristics call for self-organizing and self-adapting models. The last characteristic outlines that, at this scale, spatial concepts become an intrinsic part of the application conceptual space. In any case, the two aspects cannot be considered in isolation and the implementation of spatial activities (e.g. the definition of a specific spatial region of interest to be monitored) must be rooted on self-organizing mechanisms, so as to be adaptive and robust.

2.3 The Global Scale In Internet computing, there is need to access data and services according to a variety of patterns, independently of the availability/location of specific servers. Again, this scenario can be characterized in terms of: x Large Size. The Internet and the Web, other than being of a very large size, are also intrinsically decentralized and it is simply impossible to control their structure, the data and services available. x Network Dynamics. Not only nodes come and go, and new nodes can be added at any time. Also data and services typically come and go at any time in an unpredictable way, leading to a dynamic scenario. x Space Dependencies: the concept of space is at the basis of current proposals in world-wide P2P computing [Rat01]. In these approaches the P2P network is actually structured as a metric space, in which objects can be mapped into a specific position of the network, and in which processes and messages can effectively navigate to reach a specific position of the network. In other words, the position of data and services in a network becomes “space dependent” and the activities of application components are strongly related to positioning themselves and of navigating in that space. It is worth noting that new P2P systems, like [Rat01] – starting from the key motivation of promoting adaptive self-organizing data access in large size dynamic networks – promote a radical conceptual shift from network-aware computing (centered around navigation in a network, i.e. by explicitly accessing network nodes) to space-aware computing (centered around navigating in a virtual metric space). Differently from the micro-scale and the medium scale scenarios, for which the notion of space enters into play because of the physical characterization of the activities in such scenarios, the notion of space enters into the global scenario for the very sake of promoting self-organization. In our opinion, this is a strong body of evidence that spatial concepts are not incidentally involved in self-organization and are instead basic ingredients for a self-organization recipe in modern distributed computing.

2.2 The Medium Scale As the ubiquitous and pervasive computing scenario is becoming a reality, our world will be soon densely populated by local ad-hoc networks (e.g. the ensemble of Bluetooth enabled devices we could carry on or we could find in our cars) and network-based furniture (e.g., Webenabled fridges and ovens able to interact with each other and effectively support our cooking activities), just to mention few examples. The above types of networks, although being formed by different types of computer-based devices, share with those at the micro-scale the same issues as far as the development and management of distributed applications is concerned. In fact: x Large Size: even if not all these systems will be actually of extremely large size, and could be possibly centrally controlled, it is simply not commercially and economically viable to consider deploying applications for everyday popular use requiring explicit configuration and explicit tuning of operational parameters. x Network Dynamics: most of these networks will be wireless, with structures varying dynamically on the basis of the relatives positions of devices (the persons in an ad-hoc network can move around in an environment and the position of home furniture can change on needs) and characterized by the dynamic arrival dismissing of nodes. x Space dependencies: Since our activities as humans are strongly spatial-dependent, also the activities of these computer enriched objects will be so. For example, a cell-phone could automatically connect to the hi-fi stereo of our car once in there, and turning silent once in a meeting room. Thus, also in this case, we can affirm that the capability of self-organization will be primarily driven by the system need to perceive its position in a physical space and of acting on this base.

3. Spatial Mechanisms of Self-Organization In this section, we will show that the basic selforganization mechanisms, that have been exploited so far in distributed computing (from the micro to the global scale), can be easily interpreted and mapped into very similar spatial concepts. We classify these mechanisms according to a sort of “space-oriented” stack of levels (see table 1). The analysis at the different levels will help understanding that a variety of apparently very different,

38

MICRO SCALE MEDIUM SCALE GLOBAL SCALE Nano Networks, Sensor Networks, Smart Home Networks, MANETs, Pervasive Internet, Web, P2P networks, multiagent Dust, Computational Self-Assembly, Environments, Mobile Robotics systems Modular Robots “Application” Level (exploiting the spatial organization achieved by the lowest levels to achieve in a selforganizing and adaptive way specific application goals)

Spatial Queries

Discovery of Services

Spatial Self-Organization and Differentiation Spatial Displacement of Activities Coordination and Distribution of Task and Spatial Displacement Activities

P2P Queries as Spatial Queries in the Overlay Motion Coordination on the Overlay Pattern formation monitoring)

(e.g.,

for

Motion Coordination

Motion coordination

Pattern formation

Pattern formation

DATA: environmental data

DATA: local resources and environmental data

DATA: files, services, knowledge

“Navigation” Level (dealing with the mechanism exploited by the entities living in the space to direct activities and movements in that space)

Flooding

Computational fields

Flooding

Gossiping (random navigation)

Multi-hop routing based on Spanning Trees

Gossiping (random navigation)

“Structure” Level (dealing with the mechanisms and policy related to shape a metric space and to let components find their position in that space) “Physical” Level (dealing with the low level communication mechanism and services necessary to get into existence in a network)

Self-localization triangulation)

Geographical Routing (selecting reaching specific physical coordinates)

and

Pattern-matching based systems

and Localized Tuple- Metric-based (moving towards coordinates in the abstract space)

Directed Diffusion (navigation following sorts of computational fields)

network

specific

Gossiping (random navigation) Stigmergy (navigation following pheromone gradients distributed in the overlay network)

Stigmergy (navigation following pheromone gradients) (beacon-based Self-localization triangulation)

(Wi-Fi

or

RFID Establishment and Maintainance of an Overlay Network (for P2P systems)

Definition and Maintainance of a Spanning Tree (as a sort of navigable overaly)

Referral Networks and e-Institutions (for multiagent systems)

Radio Broadcast

Radio Broadcast

TCP broadcast – IP identification

Radar-like localization

RF-ID identification

Directed TCP/UDP messages Location-dependent Directory services

Table 1. Mechanisms of Self-organization in Modern Distributed Computing Scenarios

despite network dynamics, enables an application to be shielded and unaffected by the network dynamics and preserve its functioning. Also at this level, the mechanisms to structure the space are very similar across different scenarios. Microscale system typically structure the space accordingly to their positions in the physical space, by exploiting mechanisms of geographical self-localization. Medium scale systems, in addition to mechanisms of geographical localization, often exploit logical spatial structures reflecting some sorts of abstract spatial relationships of the physical world (e.g., rooms in a building). Global scale systems, exploits overlay networks built over a physical communication network. Although early approaches (e.g., Gnutella) give no metric structure to such overlay space, more recent approaches [Rat01] tend to build metric overlay spaces. The “navigation level” concerns the basic mechanisms that components exploit to orient their activities in the spatial structure (i.e. actually “use” the spatial structure). If the spatial structure has not any

mechanisms, are indeed grounded on similar principles. Thus, it is likely that a unifying model for selforganizing computing – possibly leading to a single programming model can be identified. The lowest “physical level” enables a component to start interacting – in a dynamic and spontaneous way – with other components in the systems. This is a very basic expression of self-organization which is a prerequisite to support more complex forms of selforganization at the higher levels. The basic communication mechanism is broadcast (i.e. communicate with whoever is available). Radio broadcast is used in micro scale systems and in medium scale. Different forms of TCP/IP broadcast (or dynamic lookup) are also used at the global scale. The “structure level” is the level at which some sort of spatial structure emerges from the components activities. Clearly, the fact that a system is able to make some sort of spatial structure dynamically emerge in the system is a very important expression of selforganization. Moreover, maintaining the spatial structure

39

depicted in table 1, meaningful? It is a first attempt. We are aware that not everything fits perfectly in it. Nevertheless, our opinion is that, once all the layers will be properly defined, they will support a better engineering of such systems; promoting separation of concerns and clearly identifying the duties of the different levels. (ii) Besides the fact that spatial concepts seem promising, is it actually possible to identify an unifying model for a large variety of self-organization phenomena on this ground? (iii) If such a unifying model can be found, can it be translated into a reasonable limited set of programming abstractions and lead to the identification of a methodology for developing and studying self-organizing distributed systems? (iv) A variety of self-organization phenomena, not analysed in this paper, deals with concepts that can be hardly intuitively mapped into spatial concepts. Would be exploring some sorts of spatial mapping still useful and practical? Would it carry advantages? (v) Most of all, is such research fueled by enough concrete applications? With this regard, we have developed a system called TOTA [MamZ04]. It is a novel middleware infrastructure, with an associated programming model, focusing on building overlay distributed data structures over dynamic networks. Such distributed data structures can be employed to realize a suitable overlay space able to support specific applications. The approach have been successfully experienced to achieve self-organization and self-adaptation in a variety of distributed applications.

well-defined metric, the only navigational approaches are flooding and gossiping. However, if some sort of metric structure is defined at the structure level (as, e.g., in the geographical spatial structures of sensor networks or in metric overlay networks) navigation approaches typically relate in following the metrics defined at the structure level. For instance, navigation can imply the capability of components of reaching specific points (or of directing messages and data) in the space based on simple geometric considerations as in geographical routing. Starting from the basic navigational capability, is also possible to enrich the structure of the space by propagating additional information to describe specific features of the space itself. Typical mechanisms exploited to create this information are computational fields and pheromones. Despite the different inspiration of the two approaches, we emphasize that they can be modeled in a uniform way, i.e., in terms of time-varying properties defined over a space [MamZ03]. Using this additional information a component can navigate the space by following the gradient of a specific computational field or by following a specific pheromone scent. At the “application level”, navigational mechanisms are exploited by application components to interact and organize their activities. Applications can be conveniently built on the following self-organizing feedback loop: x Components navigate in the network (i.e., acting on the basis of the locally perceived structure and properties of the space) x Components, at the same time, modify existing structure due to the evolution of their activities. Depending on the types of structures propagated in the space, and on the way components react to them, different phenomena of self-organization can be achieved (e.g. self-assembly in modular robots, phenomena mimicking the behavior of social insects, phenomena mimicking the behavior of granular media, as well as a variety of social phenomena). However, despite these successful examples, the ultimate goal of a uniform modeling approach capable of effectively capturing the basic properties of self-organizing computing, and possibly leading to practical and useful general-purpose modeling and programming tools, is far from close.

References [ChiC91] R. Chin, S. Chanson, “Distributed Object-Based Programming Systems”, ACM Computing. Surveys., Vol. 23, No. 1, March 1991. [Est02] D. Estrin, D. Culler, K. Pister, G. Sukjatme, “Connecting the Physical World with Pervasive Networks”, IEEE Pervasive Computing, 1(1):59-69, 2002 [MamZ03] M. Mamei, F. Zambonelli, “Co-Fields: a Unifying Approach to Swarm Intelligence”, LNCS, No. 2677, April 2003. [MamZ04] M. Mamei, F. Zambonelli, “Programming Pervasive and Mobile Computing Applications with the TOTA Middleware”, 2nd IEEE Conference on Pervasive Computing and Communications, Orlando (FL), IEEE CS Press, March 2004. [Pis00] K. Pister, “On the Limits and Applicability of MEMS Technology”, Defense Science Study Group Report, Institute for Defense Analysis, Alexandria (VA), 2000. [Rat01] S. Ratsanamy,, P. Francis, M. Handley, R. Karp, "A Scalable Content-Addressable Network", ACM SIGCOMM Conference 2001, Aug. 2001. [Wal97] J. Waldo et al., “A Note on Distributed Computing”, Mobile Object Systems, LNCS 1222, Springer Verlag (D), pp. 49-64, February 1997.

4. Research Agenda Spatial computing can be an effective approach towards the identification of general and widely applicable approaches to self-organization in distributed systems engineering. However, to fulfil its promises several questions need to be answered: (i) is the layered model,

40

Biologically Inspired Communication Network Control Masayuki Murata Graduate School of Information Science and Technology, Osaka University [email protected] nodes and end users, and a wide variety of devices

1. Introduction The research group of Osaka University in the field of information technologies and bioinformatics engineering started a project entitled “New Information Technologies for Building a Networked Symbiosis Environment” in 2002 [1]. The aim of our project is to develop new information systems based on biologically inspired approaches with the knowledge gained through analyzing the behavior of various living organisms, and the author’s group concentrates on biologically inspired communication network control. However, we have noticed that a direct application of those studies into our study is still immature. Meanwhile, we have decided to use well-established, biologically inspired mathematical models and apply those to the control methods for communication networks, especially for new emerging networks like P2P (peer-to-peer) networks, (mobile) wireless ad hoc networks and sensor networks. Historically, in computer networks, including the Internet, it has been assumed that static nodes (i.e., routers) serve packet forwarding based on the routing protocol. Then, new infrastructures like IntServ and DiffServ were developed, though those are still not spreading in use as expected. The adequate operation of those networks require careful network planning in order to satisfy the objectives of those networks: QoS (Quality of Service) guarantees in IntServ and QoS differentiation in DiffServ, respectively. On the other hand, the above-mentioned new emerging networking technologies have a quite different structure from the traditional networks; the node itself may move, and on-demand search is necessary for finding the shared information in P2P networks and the peer location in ad hoc networks. For those networks, we consider that the following three characteristics for network control are mandatory: 1) Expandability (or scalability): We need to cope with the growing number of

attached to the network. More importantly, the number of nodes and terminals has never been predicted in advance. This means that the conventional network planning method becomes meaningless. 2) Mobility: In addition to the users’ mobility, we should also consider the mobility of network nodes in ad hoc and sensor networks. This implies that stable packet forwarding by nodes cannot be expected. 3) Diversity: We need to support a wide variety of network devices generating traffic in a quite different nature from existing network applications, implying that a single, universal network infrastructure like IntServ and DiffServ has no means to meet the different demands in newly emerging network applications. The only solution to meet the above characteristics seems to be that end hosts must be equipped with adaptability to the current network status for finding peers and/or for controlling congestion. From this reason, the biologically inspired approach is promising since it is known that it has the excellent feature of adaptability, though it is rather slow to adapt to environmental changes. Of course, biologically inspired approaches for information technologies are not new, but most of those have been concentrated on the optimization problems of network controls. On the contrary, we are focusing on the adaptability, robustness, and self-organization properties of the biological system. Especially, our main purpose of the project is to learn the symbiotic nature of the biological system [2]. In this article, we introduce a new network control method inspired by biology, in order to explain why we need self-organized control in the sensor network. We are now developing the experimental system using the developer’s toolkit for the sensor networks. Based on our experiences, we also report the lessons

41

learned in the actual implementation of the biologi-

gated from the edge of a sensor network to the sink

cally inspired approach. Lastly, we present the future

node. We do not assume that all sensor nodes are

direction of our project.

visible to each other, as in other research work. An administrator does not need to configure sensor nodes

2. Self-Organized Sensor Networks: An Example 2.1. Requirements on Sensor Network Design With the development of low-cost microsensor equipment that has the capability of wireless communication, sensor network technology has attracted the attention of many researchers and developers [3]. By deploying a large number of multifunctional sensors in a monitored region and composing a sensor network of them, one can remotely obtain information on the behavior, condition, and position of elements in the region, using wireless channels. Sensor nodes are distributed in a region in an uncontrolled and unorganized way to decrease the installation cost and eliminate the need for careful planning. Thus, the method used to gather sensored information should be scalable to the number of sensor nodes; robust to the failure and disruption of sensor nodes; adaptable to addition, removal, and movement of sensor nodes; inexpensive in power consumption; and fully distributed and self-organizing without a centralized control mechanism. Several research works have been done in developing schemes for data fusion in sensor networks (see [4] ). However, they require so-called global information such as the number of sensor nodes in the whole region, the optimal number of clusters, the locations of all sensor nodes, and the residual energy of all sensor nodes. Consequently, they need an additional ʊ and possibly expensive and unscalable ʊ communication protocol to collect and share the global information. Thus, it is difficult to adapt to the dynamic addition, removal, and movement of sensor nodes.

before deployment. In periodic data fusion, power consumption can be effectively saved by reducing the amount of data to send, avoiding unnecessary data emission, and turning off unused components of a sensor node between data emissions. As an example, such data fusion can be attained by the following strategy on a sensor network where sensor nodes organize a tree whose root is the sink node in a distributed manner. First, leaves ʊ i.e., sensor nodes ʊ that are the most distant from the sink node, simultaneously emit their sensored information to their parent nodes at a regular interval. The parent nodes, which are closer to the sink node, receive information from their children. They aggregate the received information with locally sensored information to reduce the amount of data to send. Then, they emit it at a timing that is synchronized with the other sensor nodes at the same level in the tree. Likewise, sensored information is propagated and aggregated to the sink node. As a result, we observe a concentric circular wave of information propagation centered at the sink node. To accomplish the synchronized data fusion without any centralized controls, however, each sensor node should independently determine the cycle and the timing at which it emits a message to advertise its sensored information based on locally available information. The ideal synchronization can be attained by configuring sensor nodes prior to the deployment, provided that the clocks of sensor nodes are completely synchronized, sensor nodes are placed at the appropriate locations, and that they maintain their clocks through their lifetime. However, we cannot realistically expect such an ideal condition. This is the reason that we introduce the biologically inspired

2.2. Our Approach We have proposed an efficient scheme for data fusion in sensor networks where a large number of sensor nodes are deployed [5]; in such networks, nodes are randomly introduced, and occasionally die, or sometimes change their locations. We consider an application that sensored information is periodically propa-

self-organizing approach to the sensor network. Self-organized and fully-distributed synchronization can be found in nature, as is widely known in the literature. For example, fireflies flash independently, at their own interval, when they are apart from each other. However, when a firefly meets a group, it ad-

42

justs an internal timer to flash at the same rate as its

sink node. However, the sensor node sometimes re-

neighbors by being stimulated by their flashes. Con-

ceives unexpected messages from the other node,

sequently, fireflies in a group flash in synchrony.

which was originally recognized to be located far

Mutual synchronization in a biological system is

from the node. Our solution is to filter out those

modeled as pulse-coupled oscillators [6]. In [7], the

messages in order to treat it as an exceptional one.

authors proposed a management policy distribution

Then, we obtained the results that we expected. The

protocol based on firefly synchronization theory. The

remaining problem is that a simple filtering method

protocol is based on gossip protocols to achieve weak

does not still work correctly in unreliable air circum-

consistency of information among nodes. The rate of

stances, e.g., in a room where a reflected wave carries

updates is synchronized in a network through

the message over a long distance. It is clear that the

pulse-coupled interactions. They verified that their

sole biologically inspired approach cannot solve our

protocol is scalable to the number of nodes in terms

problem and the other robust protocols should be

of the average update latency. They attempted to dis-

incorporated for the entire data fusion network. In our

tribute a management policy, whereas our application

current case, the robust tree construction method is at

is designed to collect sensored information to a sink

least necessary.

node. By adapting the pulse-coupled oscillator model,

3. Biologically Inspired Symbiotic Networks

we can obtain a fully distributed, self-organizing,

Essentially, the network users are competitive in the

robust, adaptable, scalable, and energy-efficient

sense that they want to dominate network resources

scheme for data fusion in wireless sensor networks.

in order to maximize the individual’s QoS during its

By observing the signals that neighboring sensor

communication. Then, it becomes important to

nodes emit, each sensor node independently deter-

achieve a fair share of the network resources among

mines the cycle and the timing at which it emits a

active users. The mathematical ecology says that the

message to achieve synchronization with those

system would be stable if the effect of self-inhibitive

neighboring sensors and thus draw a concentric cir-

action is larger than the effect of inhibitive action by

cle.

others. This implies that if we successfully incorporate such a mechanism into the communication net-

2.3. Our Experiences By conducting simulation experiments with several modeling assumptions, we confirmed that our proposed method works well, which means that mathematical model is good for achieving self-organized, scalable control for data fusion. Furthermore, we have developed the method using MOTE, the developer’s toolkit for the sensor network [8], to verify our method in the actual environment. However, we found that our method did not work properly without careful tuning of the protocol in implementation. A most difficult problem was due to the unreliable nature of the wireless links. In the case of a firefly, the light can reach other fireflies without attenuation. On the other hand, on the wireless link, messages sometimes reach the neighbor nodes and sometime not. The determination on the level of sensor nodes relies on whether each node can receive the control message from the upper level of nodes, originated by the

work field, we are able to establish a stably operated and fairly shared network, which we call the symbiotic network. As cited in [9], the ad hoc network itself may be viewed as a symbiotic network. However, our aim is to establish it in a more advanced way. TCP is a good example for considering network symbiosis. Of course, TCP is originally implemented in a distributed manner, and it is self-adaptive to the network congestion because each TCP sender determines its window size according to the network congestion status, which is inspected by returned ACKs. It is an essential property originated from the design principle of the Internet, named an “end-to-end principle” [10]. However, as widely known, TCP, especially the currently most-used version of TCP Reno, is too aggressive in the sense that it increases its window size (equivalently, packet transmission rate) continuously until it experiences packet loss due to

43

the buffer overflow, which results in other existing

[2] A. Kashiwagi et al., “Experimental molecular evolution

TCP connections also being damaged and throughput

showing flexibility of fitness leading to coexistence and

performance becoming quite low, especially in

diversification in biological system,” in Proc. of

high-speed networks [11]. We are now developing a

Bio-ADIT, 2004.

more elegant solution to this problem. A key idea is

[3] I. Akyildiz et al., “Wireless sensor networks: A survey,”

that the TCP connection itself investigates the avail-

Computer Networks Journal, Vol. 38, 2002.

able bandwidth and increases the window size up to

[4] K. Dasgupta, L. Kalpakis, P. Namjoshi, “An efficient

the available bandwidth gradually [12]. Then, packet

clustering-based heuristic for data gathering and aggre-

losses can be avoided to a large extent, which is very

gation in sensor networks,” in Proc. of the IEEE WCNC,

important ʊespecially in high-speed networks where

2003.

packet losses decrease the window size and lose the

[5] N. Wakamiya, M. Murata, “Scalable and robust scheme

throughput. We model the TCP window size changes

for data fusion in sensor networks,” in Proc. of Bio-ADIT,

by a logistic curve through which we can employ the

2004.

discussion on system stability in the biology field.

[6] X. Guardiola et al., “Synchronization, diversity, and

Another example can be found in a wireless ad hoc

topology of networks of integrate and fire oscillators,”

network, including the above-mentioned sensor net-

The America Physical Society Physical Review E 62,

work. The medium access control (ALOHA or

pp.5565-5569, 2002.

CSMA/CA) used in the ad hoc network is competi-

[7] I. Wokoma, I. Liabotis, O. Prnjat, L. Sacks, I. Marshall,

tive, and the rate control in the data link layer is nec-

“A weakly coupled adaptive gossip protocol for applica-

essary for keeping the maximum throughput due to

tion level active networks,” in Proc .IEEE 3rd Interna-

its bifurcation property [13]. We are now developing

tional Workshop on Policies for Distributed Systems and

various network control methods that have a symbi-

Networks, 2002.

otic nature.

[8] “Nest Project,” available at http://webs.cs.berkeley.edu/.

The other direction of our research is to study the

[9] R. Gedge, “Symbiotic networks,” BT Technology

adaptive complex networks. As a result of the symbi-

Journal, Vol.21, pp.67-73, 2003.

otic control in each communication layer, we con-

[10] J.H. Saltzer, D.P. Reed, D.D. Clark, “End-to-end ar-

sider that the network would be an adaptive complex

guments in system design,” ACM Trans. on Computer

system. The rationale behind it is due to the recent

Systems, 1984.

study in [2], where the authors point out that the

[11] S. Floyd, “HighSpeed TCP for large congestion win-

power-law property observed in the biological system

dows,” RFC 3649, 2003.

is a result of the adaptive nature of the biological

[12] C.L.T. Man, G. Hasegawa, M. Murata, “A new avail-

systems. It is now widely known that the Internet

able bandwidth measurement technique for service over-

exhibits the power-law property in various aspects,

lay

including the AS-level topology and the packet-level

pp.436-448, 2003.

networks,”

in

Proc.

of

IFIP/IEEE

MMNS,

traffic behavior. The author feels that similar argu-

[13] A. Kumar, A. Karik, “Performance analysis of wireless

ments can be applied to the network system, and

ad-hoc networks,” Handbook of Ad Hoc Wireless Net-

building the adaptive and robust network can be built

works, CRC Press, 2002.

by virtue of the insights obtained by the biology, but that it requires further research.

References [1] New Information Technologies for Building a Networked

Symbiosis

Environment,

available

at

http://www-nishio.ist.osaka-u.ac.jp/COE/english/index.h tml

44

Statistical Monitoring + Predictable Recovery = Self-* Armando Fox and Emre Kıcıman, Stanford University David Patterson, Randy Katz, Michael Jordan, Ion Stoica, University of California, Berkeley April 29, 2004

Abstract

tions, cannot be easily detected by such techniques. To compound the problem, in many Internet systems today, It is by now motherhood-and-apple-pie that complex dis- failure detection time is now in the critical path, being a tributed Internet services form the basis not only of e- major component of overall recovery time [?]. commerce but increasingly of mission-critical networkWe believe a promising direction is to start thinking not based applications. What is new is that the workload and in terms of normal operation vs. recovery, but in terms internal architecture of three-tier enterprise applications of constant and rapid adaptation to external conditions, presents a unique opportunity to use statistical approaches including sudden workload changes, inevitable hardware to anomaly detection and localization to keep these sys- and software failures, and human operator errors. In partems running. We propose three specific extensions to ticular, we propose the broad application of techniques prior work in this area. First, we propose anomaly de- from statistical learning theory (SLT) to observe and track tection and pattern mining not only for time-based opera- structural behaviors of the system, and to rapidly detional statistics such as response time, but also for struc- tect potential problems such as the example above. SLT tural behaviors of the system—what parts of the system, techniques for classification, prediction, feature selection, in what combinations, are being exercised in response to clustering, sequential decision-making, novelty detection, different kinds of external stimuli. Second, rather than trend analysis, and diagnosis are already being used in building baseline models a priori, we extract them by bioinformatics, information retrieval, spam filtering and observing the behavior of the system over a short pe- intrusion detection. We believe the emergence of comporiod of time during normal operation. Third, we com- nentized, request-reply enterprise applications makes the bine these detection and analysis techniques with low- time ripe for aggressive application of SLT to dependabilcost, predictable control points that can be activated in ity of these systems. response to a suspected-anomalous event; these control We assume typical request-reply based Internet serpoints are designed so that the cost of activating them is vices, with separate session state [11] used to synthesize low enough to tolerate the inevitable false positives that more complex interactions from a sequence of otherwise result from the application of statistical techniques. We stateless request-reply pairs. Past approaches to statistical explain why the assumptions necessary for this to work monitoring of such services have primarily relied on a prican be addressed by new systems research, report on some ori construction of a system model for fault detection and early successes using the approach, describe benefits of analysis; this construction is tedious and error-prone, and the approach that make it competitive as a path toward will likely remain so as our services continue to evolve self-managing systems, and outline some research chal- in the direction of heterogeneous systems of black boxes, lenges. with subsystems such as Web servers, application logic servers, and databases being supplied by different vendors and evolving independently. We propose instead to build Recovery as Rapid Adaptation and periodically update the baseline model by observing A “five nines” availability service (99.999% uptime) can the system’s own “normal” behavior. The approach can be down only five minutes a year. Putting a human in be summarized as follows: the critical path to recovery would expend that entire 1. Ensure the system is in a state in which it is mostly budget on a single incident, hence the increasing interdoing the right thing most of the time, according to est in self-managing or so-called “autonomic” systems. simple and well-understood external indicators. Although there is extensive literature on statistics-based 2. Collect observations about the system’s behavior change point detection [2], some kinds of partial failures, during this time to build one or more baseline models or “brown-outs” in which only part of a service malfunc-

45

of behavior. These models may capture either time- that correct application semantics are not jeopardized by series behaviors of particular parameters or structural actuating the control point. Predictable means that the cost of actuating the control point must be well known. behaviors of the system. Non-disruptive means that the result of activating a con3. If “anomalous” behaviors relative to any of these trol point will be no worse than a minor and temporary efmodels are observed, automatically trigger simple fect on performance. These properties are particularly imcorrective actions. If repeated simple corrective ac- portant when statistical techniques are used because those tions do not cause the anomaly to go away, notify a techniques will inevitably generate false positives. If we human. These corrective actions are designed to be know that the only effect of acting on a false positive is safe and low-cost since false positives are a fact of a temporary and small decrease in performance, we can life with statistical approaches. quantify the cost of “blindly” acting on false positives; this enhances the appeal of automated statistical tech4. Periodically, go back to step 2, to update the model. niques, since many techniques’ sensitivity can be tuned to trade off false positive rates vs. false negative (miss) Each of steps 1–3 corresponds to an assumption, as folrates. lows. While assumption A1 is true for the services in quesA1. Large number of independent requests. If most tion, and we have made progress on A2 by instrumentusers’ interactions with the service are independent of ing the middleware of framework-intensive applications, each other (as they usually are for Internet services), and the bigger challenge is adding application-generic control if we assume bugs are the exception rather than the norm, points that are predictable, safe and non-disruptive. We such a workload gives us the basis to make “law of large now describe how we have approached this challenge in numbers” arguments supporting the use of statistical techtwo early case studies: one based on time-series models niques to extract the model from the behavior of the sysand another based on structural models. tem itself. Also, a large number of users per unit time means that large fractions of the service’s functionality are exercised in a relatively short period of wall-clock time, Time Series Models providing hope that the model can be created and mainTime-series models capture aperiodic or quasi-periodic tained online, while the system is running. A2. Modular architecture for observation points. To patterns in a service’s temporal behavior that cannot be use statistical or data-mining techniques, we need a repre- easily characterized by a statistic or a small set of pasentation of the data observations the model will operate rameters. For example, the memory used by a server-like on (“concepts” in the terminology of data mining) and a process typically grows until garbage collection occurs, way to capture those observations. A modular service de- then falls abruptly. We do not know the period of this patsign, such as the componentized design induced by Java 2 tern, or indeed whether it is periodic; but we would expect Enterprise Edition (J2EE) or CORBA, allows us to crisply that multiple servers running the same logic under reasondefine a single user’s time-bounded request-reply interac- able load balancing should behave about the same—the tion with the service as a collection of discrete service relative frequencies of garbage-collection events at varielements or subsystems that participated in that interac- ous timescales should be comparable across all the replition. Furthermore, in such systems we can add the obser- cas. We successfully used this method to detect anomavation points to the middleware, avoiding modifications lies in replicas of SSM, our session state management to each application, as we have done for the JBoss open subsystem [11]. Each replica reports the values of sevsource application server [4]. Note that it is OK if the be- eral resource-usage and forward-progress metrics once haviors observed at different observation points are corre- per second, and these time series are fed to the Tarzan lated with each other, or completely uncorrelated to any algorithm [9], which discretizes the samples to obtain biinteresting failure: feature-selection algorithms can iden- nary strings and counts the relative frequencies of all subtify the subset of features most predictive of anomalies strings within these strings. Normally, these relative frefrom a much larger collection of features. Lastly, collect- quencies are about the same across all replicas, even if cycles are out of phase or their peing these observations must not materially interfere with the garbage-collection 1 riods vary . If the relative frequencies of more than 2/3 service performance. of these metrics on some replica differ from those of the A3. Simple and predictable control points. If the other replicas, that replica is immediately rebooted. model’s predictions and analyses are to be used to efThis works because SSM is deliberately optimized for fect service repair when an anomaly indicating a poten1 Classical time-series methods are less effective when the signal petial failure is detected, there must be a safe, predictable, and relatively non-disruptive way to do so. Safe means riod varies.

46

EJB’s without causing unavailability of the entire application. This integration work is still in progress [4], but we have demonstrated that EJB microreboots are predictable and non-disruptive, and there is reason to believe they are safe because J2EE constrains application structure in a way that makes persistent state management explicit— most EJB’s are stateless, and we are modifying JBoss to externalize the session state into SSM, which is itself optimized for safe and non-disruptive fast reboot.

Research Challenges Figure 1: Detection rate vs. false positive rate for PCFG-based path-shape analysis of PetStore 1.3 running on our modified JBoss server. Relying on HTTP error logs would reduce the detection rate to about 78%. Compared to the uninstrumented application, our throughput is 17% less, request latency is about 40ms more, and analysis of several thousand paths takes a few seconds, suggesting that the approach is feasible as an online technique.

Cheap recovery also enables other online operations to be recast as adaptation. For example, online repartitioning of a cluster-based hash table can be achieved by taking one replica offline (which looks like a failure and does not affect correctness), cloning it, and bringing both copies back online [8]. The resulting inconsistency looks the same as existing failure cases that are already handled by normal-case mechanisms, hence no new machinery is required to implement growing, partitioning, or rebalancing fast reboot: it does not preserve replica state across re- as online operations analogous to “failure and recovery”. boots, and since some overprovisioning due to replication Most existing implementations of SLT algorithms are is inherent in its design, this control action is safe, pre- offline; our proposal may motivate SLT practitioners to dictable and non-disruptive. (That is, SSM is a crash-only focus on online and distributed algorithms. Our experisubsystem [3].) The net effect is that SSM as a system ments show that even an unoptimized offline implemenhas no concept of “recovery” vs “normal” behavior; since tation of PCFG analysis can process thousands of paths periodic reboots are normal and incur little performance in a few seconds. This in turn motivates the need for cost, the system is “always recovering” by adapting to generic data collection and management architectures for changing external conditions through a simple composi- statistically-monitored systems: even a simple (11K lines tion of mechanisms. of code) application we instrumented produces up to 40 observations per user request, with 1,000 to 10,000 requests per second being representative of Internet serStructural Models vices. Scalable abstractions for sliding data windows, sampling, fusion of results from different SLT models, Structural models capture control-flow behavior of an apetc. will have to be provided, as well as easy ways to creplication, rather than temporal behavior. One example ate observation and control points without requiring intruof a structural model is a path—the inter-component dysive modifications to every application. namic call tree resulting from a single request-reply interaction. We modified JBoss to dynamically collect such call trees for all incoming requests; these are then treated Related Work as parse trees generated by a probabilistic context-free grammar (PCFG) [12]. Later on, when a path is seen Anomaly detection has been used to infer errors in systhat corresponds to a low-probability parse tree, the cor- tems code [5], debug Windows Registry problems [14], responding user request is flagged as anomalous. In our detect possible violation of runtime variable assignment initial testing, this approach detects over 90% of various invariants [7], and discover source code bugs by disinjected faults with false positive rates around 3% (see tributed assertion sampling [10]. The latter is particularly figure 1). While our diagnosis results are not as good, illustrative of SLT’s ability to mine large quantities of obwith our decision trees identifying the correct cause of servations for interesting patterns that can be directly rethe anomaly only 50–60% of the time, it is striking that lated to dependability. System parameter tuning and autothe technique performs as well as it does with no prior matic resource provisioning have also been tackled using knowledge of the application’s structure or semantics. PCFG-based approaches [1] and closed-loop control theWhen anomalies are detected and localized to specific ory [13], although such approaches generally cannot deEJB’s, we selectively “microreboot” the suspected-faulty tect functional or structural deviations in system behavior

47

unless they manifest as performance anomalies. The Recovery-Oriented Computing project [6] has argued that fast recovery is good for its own sake, but in the context of SLT, fast recovery is essential because it gives us an inexpensive way to deal with false positives. As such, ROC is a key enabler for this approach.

[4] George Candea, Pedram Keyani, Emre Kiciman, Steve Zhang, and Armando Fox. JAGR: An autonomous selfrecovering application server. In Proc. 5th International Workshop on Active Middleware Services, Seattle, WA, June 2003. [5] Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin Chelf. Bugs as deviant behavior: A general approach to inferring errors in systems code. In Proc. 18th ACM Symposium on Operating Systems Principles, pages 57–72, Lake Louise, Canada, Oct 2001.

Conclusion

Our ability to design and deploy large complex systems has outpaced our ability to deterministically predict their [6] David A. Patterson et al. Recovery-Oriented Computing: motivation, definition, techniques, and case studies. behavior except at the coarsest grain. We believe statisTechnical Report CSD-02-1175, University of California tical approaches, which can find patterns and detect deat Berkeley, 2002. viations in data whose semantics are initially unknown, will be a powerful tool not only for monitoring and on- [7] Sudheendra Hangal and Monica Lam. Tracking down software bugs using automatic anomaly detection. In Proline adaptation of these systems but for helping us betceedings of the International Conference on Software Enter understand their structure and behavior. Our encourgineering, May 2002. aging initial results are just proofs-of-concept that invite much deeper exploration of the approach. Generalizing [8] Andy C. Huang and Armando Fox. A persistent hash table with cheap recovery: A step towards self-managing state our approach has already revealed research challenges in stores. In preparation. data management scalability (collecting and logging hundreds of thousands of observations per second), algorithm [9] E. Keogh, S. Lonardi, and W Chiu. Finding surprising patterns in a time series database in linear time and space. design (distributed online SLT algorithms with provable In In proc. of the 8th ACM SIGKDD International Conbounds that allow incremental model updates), and mulference on Knowledge Discovery and Data Mining, pages tilevel learning (understanding which models should be 550–556, Edmonton, Alberta, Canada, Jul 2002. assigned high confidence when their classifications differ). A generic platform for pervasive integration of SLT [10] Ben Liblit, Alex Aiken, Alice X. Zheng, and Michael I. Jordan. Bug isolation via remote program sampling. In methods, themselves the subject of broad and vigorous Proceedings of the ACM SIGPLAN 2003 Conference on research, would hasten the adoption of SLT into dependProgramming Language Design and Implementation, San able systems, which we believe would in turn provide Diego, California, June 9–11 2003. a new scientific foundation for the construction of self[11] Benjamin C. Ling, Emre Kıcıman, and Armando Fox. Sesmanaging systems. sion state: Beyond soft state. In First USENIX/ACM Symposium on Networked Systems Design and Implementation (NSDI 04), San Francisco, CA, March 2004.

Acknowledgments

The ideas in this paper have benefited from advice and [12] Christopher D. Manning and Hinrich Shutze. Foundations of Statistical Natural Language Processing. The MIT discussion with George Candea, Timothy Chou, Moises Press, Cambridge, MA, 1999. Goldszmidt, Joseph L. Hellerstein, Ben Ling, Matthew Merzbacher, and Chris Overton, and from the comments [13] S Parekh, N Gandhi, JL Hellerstein, D Tilbury, TS Jayram, and J Bigus. Using control theory to achieve service level of the anonymous reviewers. objectives in performance management. Real Time Systems Journal, 23(1–2), 2002.

References [1] Paul Barham, Rebecca Isaacs, Richard Mortier, and Dushyanth Narayanan. Magpie: real-time modelling and performance-aware systems. In Proc. 9th Workshop on Hot Topics in Operating Systems, Lihue, HI, June 2003.

[14] Yi-Min Wang, Chad Verbowski, and Daniel R. Simon. Persistent-state checkpoint comparison for troubleshooting configuration failures. In Proc. International Conference on Dependable Systems and Networks, San Francisco, CA, June 2003.

[2] Michèle Basseville and Igor V. Nikiforov. Detection of Abrupt Changes—Theory and Application. Prentice-Hall Inc., Englewood Cliffs, NJ, 1993. [3] George Candea and Armando Fox. Crash-only software. In Proc. 9th Workshop on Hot Topics in Operating Systems, Lihue, HI, June 2003.

48

9

O

Q

R

T

;

H

J

=

V

?

=

A

X

C

;

Y

F

A

?

H

Z

L

N

J

[

=

?

L

=

;

N

^

;

=

`

L

N

J

?

A

$

&

(

$

,

V

(

;

A

V

;

;

V

Z

;

A

Z

R

0

;

2

4

A

H

H

N

[

H

L

J

H

N

H

[

L

&

J

R

t b

Y

A

=

L

A

N

[

V

=

L

?

=

Q

A

=

i

d

?

A

Q

^

N

V

b

L

A

A

d

k

H

N

l

R

n

;

p

p

=

q

?

C H

q

`

L

N

J

?

A

l

n

J

N

R

;

A

{

¡

|

¢

}

£

~

¤

¥

x

¦

}

§

¢

¨

{

¤

|

}

z

x

¬

½

´

¶

¯

¼

°

²

³

º

´

¹

´

²

°

¶

º

³

»

²

¶

«

°

²

¼

«

«

´

¶

°

´

¿

Æ

°

«

¹

¯

²

¼

º

À

»

°

´

³

´

¶

³

°

¼

«

¯

´

Â

½

²

´

¾

¶

¼

º

¯

´

¼

Ã

³

¯

¼

´

¼

¶

´

¿

¶

²

¯

¯

¶

²

¯

´

¶

Ä

Å

°

´

Æ

º

¼

«

¶

²

¾

«

¼

¶

Ç

²

¯

È

¹

²

«

«

¼

¹

¶

°

²

«

°

º

»

¯

²

¬

¼

´

¼

¯

¾

²

¯

º

«

¹

¼

²

¯

¬

°

Å

Ä

°

Å

°

¶

½

²

¾

¶

Æ

¼

¹

²

º

»

²

«

¼

«

¶

³

½

²

«

´

°

Ë

¼

¯

¶

°

²

«

´

°

Â

«

°

Ì

¹

«

¶

Å

½

Í

Ï

²

¶

Å

Ç

½

´

°

´

°

¶

È

Ë

«

¼

²

Ç

Æ

½

¯

Ä

¼

¼

´

«

¶

¼

¯

Ì

¹

¶

°

²

Å

¾

Í

»

Ñ

²

¯

´

²

¹

¶

©

¼

´

«

´

¶

²

¼

¯

¯

«

Ð

¼

²

¶

Ä

³

Ä

´

¼

¯

¼

´

¹

²

¿

º

Æ

¼

²

Ç

´

¼

«

¬

¼

¼

¯

¹

¼

¿

´

º

°

Å

°

¶

¼

¯

Ç

«

°

Æ

¼

¶

Å

Æ

»

¶

´

°

Æ

¼

«

¾

´

»

¼

¹

¼

¶

¶

Ë

°

Æ

«

Â

¶

³

¹

»

Å

¶

°

¹

Æ

È

°

¼

«

Ë

Â

²

¶

Ç

Æ

«

¼

Å

¯

²

¼

Å

Ë

²

²

¾

Ë

Ä

»

Ä

¯

Æ

²

³

Â

Æ

Ç

¼

³

«

Ë

¼

¯

´

¶

«

R

;

J

A

?

A

u

|

}

|

|

~

x

x

}

|

z

x

¶

¼

Ë

²

«

Å

½

¶

²

Å

°

º

°

¶

¼

Ë

¼

Ó

¶

¼

«

¶

¶

»

Ä

Å

°

¼

¹

»

¯

¼

¼

Ë

´

¯

¶

¿

»

Ç

¯

¶

¼

¶

â

¶

°

´

²

Ë

½

Ë

´

«

¯

¶

¼

¼

´

´

º

¶

¬

Æ

¼

¯

°

°

´

´

Ä

Å

³

¼

¼

´

²

¿

Õ

¾

°

Ë

«

¼

Ë

«

¶

¶

Æ

°

Ì

¹

¼

°

´

¶

´

°

³

²

«

¼

²

¾

Í

¯

¶

«

°

¶

Æ

´

¯

¶

¼

¯

´

°

Æ

Â

Â

²

Å

¼

Ë

¯

¼

Í

Ë

Ô

Ç

¼

¬

Æ

¼

¯

¼

«

¼

Å

´

¬

¼

³

¹

¯

Æ

¶

¶

°

°

º

º

¼

¼

¯

´

¯

¼

¿

²

Ó

¹

¯

´

¼

¼

½

´

Ë

¶

´

¼

´

²

º

º

¬

¼

¯

°

Å

¼

´

¼

Ó

°

´

¶

²

¯

º

½

Ä

¼

°

«

¶

¯

²

Ë

³

¹

¼

Ë

°

«

º

²

´

¶

´

½

´

¶

¼

º

À

´

¼

¹

Æ

²

°

¹

¼

²

¾

¶

Æ

¼

»

»

¯

²

»

¯

°

¶

¼

´

½

´

¶

¼

º

¬

¯

°

Ä

Å

¼

¾

Í

²

¯

¼

´

¶

¯

¶

°

´

²

«

¼

²

¾

¶

Æ

¼

¹

¯

³

¹

°

Å

°

´

´

³

¼

´

¿

´

²

¶

Æ

¶

¯

¼

´

¶

¯

¶

°

«

À

À

²

¿

¼

¼

Ë

Ç

°

Å

Å

½

°

¼

Å

Ë

°

º

»

¯

²

¬

¼

Ë

»

¼

¯

¾

²

¯

º

«

¹

¼

²

¯

¬

°

Å

Ä

°

Å

°

¶

½

«

Ç

¯

½

¾

¶

Â

¶

Æ

Å

¼

Æ

¶

¯

°

¹

´

Í

ç

Æ

°

Å

¼

»

¼

¯

¾

²

¯

º

°

«

Â

¯

¼

´

¶

¯

¶

´

°

«

´

½

´

¶

¼

º

¶

Æ

¼

À

¹

¶

N

«

¼

³

¶

º

¼

º

¾

N

n

¯

Ç

´

¯

´

¬

Ë

²

²

°

¼

°

¯

¯

Æ

³

«

Ú

²

L

n

¼

¹

`

L

À

Æ

«

´

´

Â

»

¾

¼

¶

´

¾

ä

¼

;

¡

Ë

«

N

|

²

´

=

n

L

¶

©

k

u

z

[

t L

x

N

n

t

v

L

Æ

²

´

¼

«

¶

°

º

¼

¶

Æ

¯

¼

´

Æ

²

Å

Ë

»

¼

¯

º

«

¼

«

¶

Å

½

«

¼

¼

Ë

´

Ë

»

¶

¶

°

²

«

Í

¶

é ³

°

´

Æ

»

»

¼

«

°

«

Â

¼

Ó

¹

¶

Å

½

°

«

¶

Æ

¼

©

«

¶

¼

¯

«

¼

¶

Í

Ô

²

¾

¶

Ç

¯

¼

Õ

Â

°

«

Â

¶

²

º

¶

°

¹

¯

¼

´

¶

¯

¶

º

³

´

¶

°

«

¹

Å

³

Ë

¼

¼

ê

¹

°

¼

«

¶

¿

¯

²

Ä

³

´

¶

«

Ë

Õ

ë ¼

Æ

´

Ä

¼

¼

«

°

Ë

¼

«

¶

°

Ì

¼

Ë

´

»

Æ

¼

«

²

º

¼

«

²

«

«

Ë

¯

¼

Ð

³

¬

¼

«

¶

¶

°

°

²

«

«

À

°

¼

Ó

¶

°

º

Æ

»

´

¶

¼

¯

´

¯

²

¬

Ä

¼

°

²

´

«

¶

Â

³

¯

¾

¶

¶

¶

³

²

«

Æ

¾

¹

¼

¶

¶

¹

°

Æ

²

¼

«

³

´

´

Å

¼

²

°

¾

¶

´

¶

½

²

Ç

Í

¯

×

¾

¼

³

Â

°

¼

¶

«

«

Å

Â

¬

°

¶

°

¶

¯

Å

«

²

«

¼

º

³

Ë

¼

«

Ç

Ë

¼

«

¼

¶

¯

¯

À

´

¶

¼

Æ

«

¼

«

²

Å

Ë

»

¶

°

°

°

²

Ä

Å

«

´

Å

¯

Å

½

¼

º

«

Ä

²

²

Å

¶

¬

¼

¼

Ä

¶

°

¼

¶

²

°

Í

Ä

©

Å

Ë

¼

«

¼

«

²

¶

¶

³

²

°

¯

¾

¹

½

¶

Ç

²

²

º

Æ

¯

¼

¼

È

¶

´

¿

²

¶

¶

²

Æ

Æ

³

¼

¯

¯

¼

¹

¼

¯

¼

¾

¼

²

²

Ã

¯

¾

³

¼

°

¶

¿

¯

Æ

Ç

¼

¼

¼

Ë

»

³

´

«

¯

´

Ë

²

³

Ä

¼

Å

º

¯

¼

´

º

³

´

¼

¶

Ç

«

¼

¼

º

²

¬

¼

¶

Æ

¼

»

¯

²

Ä

Å

¼

º

´

¿

°

«

´

¶

¼

Ë

Ç

¼

Ç

«

¶

¶

²

²

»

¶

°

Ë

«

°

°

Â

´

Å

í

¶

¼

»

Å

²

½

º

¼

«

¶

²

¾

Õ

Ä

Å

¹

È

À

Ä

²

Ó

Õ

¯

¼

´

¶

¯

¶

¶

²

°

º

»

¯

²

¬

¼

´

½

´

¶

¬

°

Å

Ä

°

Å

°

¶

½

²

¯

»

¼

¯

¾

²

¯

º

«

¹

¼

²

Ú

Æ

¼

«

³

Â

º

©

¼

¼

«

º

´

«

¶

¼

¶

´

¶

²

¼

¯

°

Æ

¹

¾

¯

«

¼

¬

´

¶

¼

´

¶

¶

¼

Ä

²

²

¯

Â

¼

¼

»

¼

¶

¶

Æ

«

«

°

¶

¹

º

°

´

´

¹

Ì

¶

¯

°

¯

¼

°

¶

¯

¬

¼

Æ

´

°

Ä

°

Ë

¼

¼

¼

´

²

¼

°

³

¼

«

Û

°

»

»

«

¶

Ó

¶

°

´

²

«

¶

´

Æ

¼

¹

´

Æ

¼

¼

¹

º

²

¼

«

´

Í

Ë

°

©

´

´

º

³

»

¼

¯

Ç

²

¬

¼

°

Ë

«

°

Â

´

²

¹

³

«

´

´

Æ

Ë

¼

¯

»

¼

¶

À

Í

ï

ð

£

ð

ñ

¢

ò

£

¢

ô

¤

¥

ð

õ

³

¼

Ë

¶

²

²

º

«

À

Å

°

¶

«

¼

¯

¼

´

¼

¼

´

Å

¶

¼

¹

¯

¶

°

¶

¿

²

°

«

«

¹

²

Å

³

¾

Ë

´

°

½

´

«

Â

¶

¼

Ë

½

º

«

¬

º

¯

°

°

¹

Ä

Å

¼

Ë

´

»

¶

º

¼

¶

«

°

²

«

Ä

Å

¼

º

²

¯

¼

´

¶

¯

¶

¿

Ç

¼

«

¼

¼

Ë

¶

²

Ë

¼

¹

°

Ë

¼

¶

Æ

¼

º

¼

¶

¯

°

¹

²

¾

°

«

¶

¼

¯

¼

´

¶

¿

Í

¶

¶

«

°

°

¼

¶

¶

´

²

Æ

Ë

»

¼

Å

¶

¼

º

Ë

Ç

°

º

Ë

¼

À

«

Ú

¯

Æ

Â

ì

¼

¹

´

«

³

Ó

À

¼

«

Ý

Û

¹

¶

¯

Ü

Þ

¼

²

»

¿

²

Ý

Í

Ë

¼

Ë

Û

Ë

´

«

ß

²

Ë

¾

Ý

¯

»

Ç

«

¾

³

¯

Å

¼

²

²

¯

¶

²

Æ

´

¼

¼

«

Ë

»

¯

¶

¶

°

¶

¼

°

º

Ó

´

°

»

¼

º

º

¼

´

¿

¶

°

»

¹

°

°

´

¼

¶

°

°

¶

´

¶

´

¶

¼

¼

»

Å

Æ

´

¼

³

³

¶

º

º

¶

´

¼

°

¼

Ë

²

Ë

«

ç À

«

¼

º À

Æ

Ë

Ë À

¯

«

°

´

¼

¼

¼

º

´

¼

¯

º

¼

Æ

³

¼

¿

Æ

¼

»

¶

¶

³

Ó

Ë

´

¼

Æ

¯

¼

¶

¼

Í

¹

¼

¼

Ð

´

¹

°

²

Ë

½

¼

¶

«

Ë

´

Ä

º

©

¶

¼

²

Æ

³

²

¶

Ç

²

¯

«

²

²

Å

¬

²

º

Ç

«

º

Ë

Å

¯

Ç

¼

¶

¹

Å

¿

°

Ä

´

°

¶

³

¼

¼

¶

Æ

¼

º

Å

Ë

¶

Ç

¶

°

²

¼

È

Ë

¯

Å

Ë

´

²

¿

¶

º

¼

¾

Æ

²

«

Ç

¼

¶

¼

²

¹

¼

²

Â

¶

¶

²

¶

¼

Æ

¶

¶

»

«

¶

´

¯

Ä

«

Ë

³

´

Ç

¹

Å

¯

¼

¼

Å

°

¼

Í

²

²

«

¶

Å

é Ä

¹

¶

Æ

°

°

´

¼

¶

²

¼

º

´

Ë

¾

¼

¹

¶

²

«

³

¯

³

¯

¶

¶

¼

¬

´

Æ

°

¾

²

Å

Ë

´

Û

¯

³

¼

¿

¯

´

¼

¯

Ý

¼

Í

°

à

´

°

¶

¯

³

Ä

¶

á

«

Å

¯

¶

²

¿

Ë

Ç

¯

¼

²

Ä

³

°

Ä

¶

¼

°

¼

Æ

Ð

³

¹

Å

¹

¹

Ë

´

°

¶

²

«

°

º

»

¬

»

¼

Â

¯

³

°

²

²

«

´

¶

Ó

¶

½

¶

°

²

¶

´

Æ

¼

²

¼

¯

«

³

²

¶

¬

¾

²

Ä

¼

¯

¶

Æ

º

²

¬

Ä

¼

¼

¶

´

²

Ë

»

¼

Ç

¼

²

¯

¶

¯

¼

È

´

°

¯

º

¶

Í

¯

²

« Å

¶

©

Ë À

¿

«

49

Ð

Ç

²

¶

¼

³

«

´

¹

¹

¶

°

Å

²

´

«

²

º

¶

Â

Ë

¶

Æ

¼

¶

Æ

°

°

¼

«

¶

¼

º

²

´

¯

º

¼

²

«

¾

¼

Ë

»

²

¼

»

°

¼

©

Ç

¹

«

«

¶

¼

Å

¼

Â

¿

²

¯

¼

°

«

«

Ë

¼

Û

´

¶

¼

à

Ý

¶

Í

¼

Ë

°

¹

²

Ç

Ë

«

«

°

«

Ë

¼

°

¼

¶

¹

¬

°

¹

Ä

¶

»

²

Ë

°

³

«

²

«

´

Â

°

Å

÷

Å

´

½

°

¼

¶

«

¿

³

´

´

²

´

¾

²

ø

³

º

º

Ú

Å

½

¼

Ú

Å

ù

²

Ë

«

½

¾

¶

ú

²

Ü

¼

²

Æ

Ä

¼

Í

Ü

´

À

û

º

Í

5

#

%

#

r

t

u

w

x

y

z

{

w

y

}

~

u

y

w

y

'

1

#

+

#

.

@

#

0

A

3

+

.

#

#

3

#

D

+

#

:

;

<

>

G

'

1

%

%

#

L

.

#

3

.

3

+

#

L

#

#

#

#

G

#

%

#

#

L

0

.

0

+

@

:

#

#

%

G

.

#

#

#

L

%

#

L

D

3

#

@

+

%

#

@

#

.

@

#

D

L

#

#

%

.

+

%

%

#

@

#

.

>

%

.

D

#

@

#

%

@

%

%

n

@

0

3

#

3

#

D

%

%

#

L

%

#

#

'

0

@

@

+

#

A

D

>

+

#

L

.

#

+

'

@

]

_

#

%

@

%

#

#

#

+

@

G

D

%

#

#

3

+

#

#

#

#

#

#

#

%

#

Y

D

Y

Y

+

3

%

'

0

%

%

#

#

%

L

]

#

#

0

3

T

(

U

#

%

#

'

3

%

.

%

0

#

#

L

3

#

.

%

L

.

.

D

%

>

%

#

L

3

#

L

%

G

D

@

#

_

#

%

#

#

L

0

+

+

#

#

#

>

#

L

#

.

D

3

'

#

D

3

D

+

3

#

3

%

0

Y

.

9

>

%

#

3

#

+

@

#

#

L

0

\

3

'

#

%

.

.

%

@

Y

#

L

0

0

0

G

.

D

<

'

%

+

%

3

3

L

'

L

#

'

%

#

0

L

>

Y

#

#

@

#

L

#

%

#

0

#

_

L

Timeout Oracle

0

1

#

D

:

#

(

>

;

Y

3

#

L

3

G

%

@

q

#

@

%

+

Server

'

#

Preprocessing unit

T

$

U V

,

!

T

− Variable selection − statistical survey U

T

/

U

1

T

(

U

!

\

/

#

T

U

!

T

_

U

G

> 1 6

#

A

%

#

+

3

#

L

V

,

T

U

$

T

U

d 1

,

50

%

%

%

%

#

L

#

%

G

D

>

!

"

$

'

'

+ }

o

q

.

*

3

8

F

H

Y

A

1

*

H

F

P

Z

S

I

«

Y

F

N

]

:

F

F

I

I

N

H

K

Y

]

H

G

P

]

G

H

N

F

H

«

I

]

Y

_

_

K

I

Z

¬ ,

'

'

0

,

,

7

0

,

+

,

,

,

;

<

7

+

,

,

,

'

'

'

,

0

0

P

K

L

K

L

_

±

±

e

,

+

,

0

,

,

0

βc ), the formation of locally aligned patches can be observed. To analyze the behavior of our cellular automaton model we consider the time evolution of a statistical ensemble of systems. For technical details we refer to Ref. [6] and [8]. In a mean-field description a central role is played by the average occupation numbers fi (r, t) ≡ si (r, t). It is assumed that at each time step just before interaction the probability distribution is completely factorized over channels (r, ci ), so that the probability to find a microstate {si (r)} at time t is given 4 by r i=1 [fi (r, t)]si (r) [1 − fi (r, t)]1−si (r) . We denote the factorized average by · · · MF . Replacing · · · MF by · · · , i.e., neglecting all correlations between occupation numbers, we obtain a closed evolution equation for fi (r, t): the nonlinear Boltzmann equation, fi (r + ci , t + 1) = fi (r, t) + Ii (r, t).

(2)

Here the term Ii (r, t) ≡ σi (r, t) − si (r, t) MF , taking values between −1 and 1, equals the average change in the occupation number of channel (r, ci ) during interaction. Simulations of the (deterministic) Boltzmann equation starting from various initial conditions, which provide the only sources for random effects, nicely mimic the cellular automaton snapshots [11]. Thus, it seems reasonable to expect valuable insights into the automaton dynamics from the stability analysis of the nonlinear Boltzmann equation. Linear stability analysis We have focused on the linear stability of the spatially homogeneous and stationary solution fi (r, t) = f¯ = ρ¯/4 to Eq. (2) with respect to fluctuations δfi (r, t) = fi (r, t) − f¯ . Therefore, we linearize Eq.(2), perform a Fourier transfor−ik·r mation, δfi (k, t) = δfi (r, t), and obtain a re linear system δfi (k, t + 1)

4

Γij (k)δfj (k, t).

j=1

The mean-field or Boltzmann propagator Γ(k) describes how a small perturbation around a spatially

uniform state evolves in time. It is given by 4 eik·cp Ωpij , Γij (k) = e−ik·ci δij +

for example enable the determination of the speed of information spread in a dynamic network context.

References

p=0

with c0 ≡ 0 and Ωpij = ∂Ii (r, t)/∂fj (r + cp , t)|f¯. Statements about stability are now reduced to statements about the matrices Γ(k, β, ρ¯). A detailed stability analysis of the model can be found in [6]. Phase transition The mean-field stability analysis illustrates the nature of the observed phase transition in cellular automaton simulations, which can be interpreted as emergent behavior. We have compared the results of our stability analysis with the computer simulations. Simulations exhibit an abrupt change in µ ¯ at β 0.7, which agrees well with the prediction βc = 0.67 obtained from our stability analysis. Note that the mean-field approximation obtained by neglecting all correlations, is sufficient to explain the emergent behavior in the automaton simulations because all correlations that are produced in the interaction step are immediately destroyed by the deterministic migration.

Outlook The general advantage of using individual-based approaches (particularly cellular automaton models) is that microscopic interaction mechanisms can be directly formulated as corresponding cellular automaton rules and corresponding macroscopic behavior can be analyzed. It turns out that in many important biological systems the coarse-grained perspective of the cellular automaton model covers the essential aspects of the cell interaction behavior. So what can we learn for applications in dynamic networks? First of all, dynamic cell networks provide a source of microscopic interactions that can be exploited for technological applications, in which macroscopic behavior appears as a result of selforganization [12]. For example, the idea of adhesive interaction might be used to design new selforganized distributed libraries, in which similar information should arrange itself in clusters. For this application, adhesivity of cells has to be translated into similarity of information. Furthermore, the idea of stability analysis, that we have demonstrated in this paper for a cellular automaton model of cellular swarming, allows to link particular microscopic interaction rules with specific emergent macroscopic behaviors. Besides the characterization of emergent behavior (in the swarming model the phase transition visible as formation of aligned patches) it should

68

[1] J. D. Murray, Mathematical biology, Springer, New York, 2002 [2] M. S. Steinberg, Reconstruction of tissues by dissociated cells, Science, 141 (3579): 401-408, 1963 [3] D’Arcy W. Thompson, On growth and form, Cambridge University Press, Cambridge, 1917 [4] A. M. Turing, The chemical basis of morphogenesis, Phil. Trans. R. Soc. London, 237: 37-72, 1952 [5] N. F. Britton, Reaction-diffusion equations and their applications to biology, Academic Press, London, 1986 [6] H. Bussemaker, A. Deutsch and E. Geigant: Mean-field analysis of a dynamical phase transition in a cellular automaton model for collective motion. Phys. Rev. Lett. 78: 5018, 1997 [7] U. B¨ orner, A. Deutsch, H. Reichenbach and M. B¨ ar, Rippling patterns in aggregates of myxobacteria arise from cell-cell collisions, Phys. Rev. Lett, 89 (7): 8101, 2002 [8] A. Deutsch and S. Dormann, Cellular automaton modeling of biological pattern formation, Birkhäuser, Boston, 2004 [9] U. Frisch, B. Hasslacher and Y. Pomeau, Latticegas automata for the Navier-Stokes equation, Phys. Rev. Lett., 56(14): 1505-1508, 1986 [10] A. T. Lawniczak, D. Dab, R. Kapral and J. P. Boon, Reactive lattice gas automata, Physica D, 47: 132, 1991 [11] A. Czir´ ok, A. Deutsch and M. Wurzel, Individual-based models of cohort migration in cell cultures, In: W. Alt, M. Chaplain, M. Griebel and J. Lenz (eds.), Models of polymer and cell dynamics, Birkh¨ auser, Basel, 2003 [12] A. Deutsch, N. Ganguly, G. Canright, M. Jelasity and K. Engoe-Monsen, Models for advanced services in ad-hoc and p2p networks, deliverable D08 of the EU-RTD BISON project, www.cs.unibo.it/bison/deliverables/D08.pdf

The Emergent Thinker Sergio Camorlinga

Ken Barker

Dept. of Computer Science University of Manitoba Winnipeg, MB R3T 2N2 Canada Email: [email protected]

Dept. of Computer Science University of Calgary Calgary, AB T2N 1N4 Canada Email: [email protected] work published previously. Several examples are available in the literature2 [1-10].

Abstract This paper introduces the Emergent Thinker, an area-wide logical computing entity, named after a philosopher that continously analyses information and has emergent computed solutions for new and/or living requests. The Complex Adaptive System (CAS) emergent computation model and the CAS propagation model are proposed as mechanisms to achieve the Emergent Thinker. The Thinker is proposed as an alternative approach for new and existent design and implementation challenges in systems research.

A common feature of previous work is the existence of simple agents3 with local functions4 and a simple communication mechanism5 that is either direct or indirect. For this, we propose a basic model that generalizes previous work. We call it the CAS Emergent Computation Model (CEC). See Figure 1.

I. INTRODUCTION Systems research has produced solutions to many of its design and implementation challenges. These solutions are usually based on algorithms and models that have predetermined, predefined centralized and distributed techniques. However, these models and algorithms are limited1 when subject to computing environments that are dynamic, self-organized, ad-hoc (e.g. in terms of connectivity, operatively, etc) and decentralized like those found in peer-to-peer systems, pervasive computing environments, some grid systems (e.g. grids with no common domain across the Internet), etc. It is clear that different approaches are required for these environments.

Figure 1: CAS Emergent Computation Model

In the CEC model, agents follow simple rules to affect their states and/or environment to generate an emergent pattern formation6 that produces a system wide result (Figure 1). The system wide result is interpreted as an emergent computation (i.e. self-* property) that solves a particular distributed system problem (e.g. aggregation, resource allocation, classification, assignment, path selection, decision, etc). A model hypothesis is that all computation to be

Complex adaptive systems (CAS) are characterized by having a large number of members with simple functions and limited communication among them. The emergence of swarm intelligence [11] from simple members activities boasts autonomy and selfsufficiency, which allows them to adapt quickly to changing environmental conditions. The emergent global outcome may be applied to solve particular distributed system problems. Emergent outcomes (i.e. self-* properties) providing solutions to specific problems have been documented in different research

2

Some authors do not call their mechanisms CASbased. However in essence they are, if we assume that these mechanisms have similar characteristics. 3 We use the term ‘agent’, ‘individual’, ‘member’ to mean the same across the paper 4 We use the term ‘function’, ‘property’, ‘action’, ‘activity’ to mean the same across the paper 5 The communication mechanism exists within the domain environment. 6 We use ‘emergent pattern formation’ to mean pattern formations in agents, domain environment, or both.

1

Some workarounds are sometimes implemented to alleviate their limitations

69

provided can be obtained by emergent computations of simple activities.

Furthermore, the way the actions are propagated will have an impact in the perturbations (i.e. properties’ feedback, environment feedback), which subsequently affect the local agents’ properties, which then adapt their agents’ behaviors accordingly. If these perturbations are too high, it can force the system into chaos. The CPM is a fertile area of experimental research that can lead to an understanding of the fundamentals linkages between local properties and emergent, global self-* properties that eventually can lead us to develop the theory and mathematical formulations that represent them.

A basic question7 is identifying the relationship that exists between the emergent computation and the local agents’ properties or actions. By focusing on the causeeffect relationships, which are dynamic and non-linear, it could lead us to identify necessary and sufficient conditions to obtain emergent computations (i.e. self-* properties). Further, it helps us to understand how a specific property produces an emergent outcome. All these will allow us to control and manage the agent’s properties to obtain desired global outcomes. This leads to a key question: How can we identify these relationships? We propose the CAS Propagation Model.

With the understanding of the CAS Emergent Computation model (Figure 1), the CEC model is extended to develop what we call the ‘Emergent Thinker’ (ET). This is possible because we will be able to control and manage the emergent computations provided by the ET. The Emergent Thinker is named after a philosopher that continuously analyses information and has emergent computed solutions for new and/or living8 requests.

II. CAS PROPAGATION MODEL The CAS Propagation Model (CPM) is based on the simple idea of amplification by propagation. The way the different agents connect will have a direct consequence in the action’s impact on the other agents and in the system as a whole. In the past, different approaches have been used to interconnect CAS agents (e.g. fully connected, small world pattern, random, small set of neighbors, local paths, etc). These connectivity approaches generate different global outcomes from the same local properties. We could attribute this to the way the local properties were propagated across the domain and are consequently amplified for different global, emergent self-* properties. The main point that the CPM makes is about the propagation and its affects on the amplification of the agents’ actions to eventually give an emergent result, called a global self-* property (Figure 2).

III. THE EMERGENT THINKER The Emergent Thinker (ET) is an area-wide logical computing entity (Figure 3) based solely on CAS models and algorithms to provide function services. It uses the CAS computing model (Figure 1) as its building block. It has CAS domain environments that are implemented in decentralized contexts (e.g. P2P environment, pervasive environments, multi-processors computers, grids, etc) providing emergent function services. The emergent function services are equivalent to function calls provided by a regular operating system, but with a completely different functionality and purpose. The emergent computation engines are located wherever the emergent service is required (i.e. pervasive engines). This is possible because the services or functions (i.e. self-* properties) are continuously emerging from the system itself. These engines are access points to observable and/or interpreted global properties.

Figure 2: CAS Propagation Model

7

Other questions include things like message exchanges, speed, property sets, feedbacks, edge of chaos, limit theory, etc.

8

Living requests mean those computations that the ET continuously works on.

70

proposed as a means to understand and layout the fundamentals of the CAS emergent computing model. ACKNOWLEDGMENTS

We want to thank Peter Graham, John Anderson and Jeff Diamond from University of Manitoba for their participation in useful discussions about emergent computations. Also thank the St. Boniface Research Centre and the Medical Informatics Group by sponsoring this research work. REFERENCES

Figure 3: The Emergent Thinker

[1]

White T, “Swarm Intelligence and Problem Solving in Telecommunications”, Canadian AI Magazine, No 41, pp 14-16, Spring 1997.

[2]

Dorigo M, Maniezzo V, Colorni A. “The Ant System: Optimization by a Colony of Cooperating Agents”, IEEE Transactions on Systems, Man and Cybernetics, Vol. 26, No. 1, pp 29-41, 1996.

[3]

Babaoglu O, Meling H, Montresor A. “Anthill: A Framework for the Development of Agent Based Peer-ToPeer Systems”, University of Bologna Technical Report UBLCS-2001-09, September 2002.

[4]

Montresor A, “Anthill: A Framework for the Design and Analysis of Peer-To-Peer Systems”, 4th European Research Seminar on Advances in Distributed Systems (ERSADS 2001), Bertinoro, Italy, May 2001. See docs at http://www.cs.unibo.it/projects/anthill/papers/anthill-01.pdf

[5]

Montresor A, Meling H, Babaoglu O. “Messor: LoadBalancing through a Swarm of Autonomous Agents”, Proceedings of the 1st International Workshop on Agents and Peer-To-Peer Computing, Bologna, Italy, pp 125-137, July 2002.

[6]

Camorlinga S, Barker K, Anderson J. “Multiagent Systems for Resource Allocation in Peer-to-Peer Systems”, Proceedings of the Winter International Symposium on Information and Communication Technologies WISICT 2004, Cancun Mexico, pp 173-178, January 2004.

[7]

Bonabeau E and Meyer C, “Swarm Intelligence, a New Way to Think About Business”, Harvard Business Review, pp 106-114, May 2001.

[8]

Solnon C, “Ants Can Solve Constraint Satisfaction Problems”, IEEE Trans. on Evol. Computation, Vol. 6, No. 4, pp 347-357, August 2002.

[9]

Bourjot C, Chevrier V, and Thomas V, “A New Swarm Mechanism Based on Social Spiders Colonies: From Web Weaving to Region Detection”, Web Intelligence and Agent Systems: An International Journal, WIAS 2003, Vol. 1, No. 1, pp 47-64, 2003.

IV. EXTENDED WORK Current work assumes logically isolated CAS domains. The CAS domains are logically isolated from the perspective that they compute independently from each other, however they might be running on the same physical hardware. Future agendas will analyze and obtain the fundamental bases when these CAS domains are interrelated at different levels and forms. V. CONCLUSION Systems research is an opportunity for complex systems to provide innovative schemes and models. These schemes and models can provide emergent solutions (i.e. self-* properties) to a variety of design and implementation challenges in large distributed systems, some already exist but others have not been conceived before.

[10] Jelasity M, Kowalczyk and Van Steen M. “An Approach to Massively Distributed Aggregate Computing on Peer-toPeer Networks”, Proceedings 12th Euromicro Conference on Parallel, Distributed and Network based Processing (PDP 2004), La Coruña, Spain, February 2004.

We have proposed a generic CAS emergent computing model. This model is later extended to become the Emergent Thinker that provides a different approach to distributed systems computing with the use of self-* properties. The CAS propagation model is

[11] Bonabeau E and Theraulaz G, “Swarm Smarts”, Scientific American, Vol. 282, No. 3, pp 54-61, March 2000.

71

72

The Emergence of Stability in Diverse Supply Chains Owen Densmore The Santa Fe Institute Business Network ValueNet Team The distribution of products from manufacturer to distributor to wholesaler and finally to retailer (the Supply Chain) exhibits surprisingly erratic behavior, popularly termed the Bullwhip Effect. Recently a group at the Santa Fe Institute's Business Network formed a team to study the Bullwhip effect via John Sterman's classic Operations Research game called the Beer Game. One of our goals was to discover mechanisms that would "dampen" the chaotic behavior, selforganizing the system into a stable one. Two mechanisms were investigated: increasing the visibility up and down the supply chain for each member of the chain, and converting the chain from a linear form to a network or mesh form. Both effects provided a simple Self-Organized network where improvements in individual choices dampened the variations in the overall chain. diapers would provide a nearly constant demand, and the associated supply chain would be quite stable and predictable. Studies showed, however, that the pampers supply chain showed highly erratic, chaotic behavior.

History While investigating the dynamics of supply chains in the 1980's, researchers were surprised to find that presumably stable commodities exhibited surprisingly chaotic inventory properties. Demand for these products, rather than being constant, varied considerably, and the associated Supply (inventory at the manufacturer and warehouses) fell into uncontrolled erratic behavior.[4]

MIT's John Sterman[5] invented a simple supply chain board game, called the Beer Game, where four players managed inventory in a four level supply chain: Beer Factory, Distributor, Wholesaler and Retailer. Each turn of the game represented one week's ordering and receiving of inventory into stock. A two-week supply queue, and a one-week ordering queue, existed between each player, introducing delay and uncertainty. The "customer" for this supply chain had simple behavior in every play of the game: buy 4 barrels of beer each week for four weeks, then buy 8 barrels from then on, thus introducing a simple step function for customer demand. Players attempt to minimize a cost function based on $.50 per barrel for storage, suffering a $2.00 per barrel penalty for having inventory reduced to zero, thus not being able to fulfill orders (understock). Sterman's seminal work was to quantify typical human behavior, rather than attempting to "solve" the problem of optimizing the supply chain. Players try to minimize their costs but typically exhibit panic between having too much inventory and not having enough. Sterman found a set of equations that accurately mimic this behavior.

Figure 1: Inventory Volatility Landscape A classic study[2] of this behavior looked into Pampers, a disposable diaper for babies produced by Proctor and Gamble. Presumably the number of babies and their daily requirements for clean

73

The ValueNet Project

warehouses and retailers.

The Santa Fe Institute (SFI) has a Business Network (BusNet) composed of over 50 corporate partners interested in applying complexity techniques to their businesses. Two recent improvements in supply chain technologies prompted BusNet members to ask if these could reduce the chaotic behavior within supply chains. These were 1) Radio Frequency ID tags (RFID) and 2) improved Internet communications.

The project was carried out by five members of the ValueNet team, joined by a sixth member in the latter stages of the development to help add sophisticated visualization techniques. The results were presented to the SFI BusNet over a period of three SFI biannual meetings. Visibility (RFID) The initial Beer Game hid information from the players by placing orders on cards upside down on a playing table. This was simulated by having the RePast agents use a queue between themselves, with only the end of the queue visible. Agents were free to keep as much "local knowledge" as they wished. This included their pending incoming supply orders "in the pipe" and their current inventory for fulfilling incoming demand orders. Using the human behavior model discovered by Sterman, this results in extreme volatility for certain parameter settings.

An initial group of over 20 interested parties meet twice to decide upon a project[3]. The project selected was to use an existing RePast model of the Beer Game, and modify the model in two areas. First, to model the impact of RFID and its software infrastructure, the model was modified to allow agents to see further down the supply chain than just the current incoming orders. Second, to consider the impact of the Internet, the linear supply chain model was replaces with a "mesh" network with multiple factories, distributors,

Figure 2: The effect of visibility on volatility.

74

consisting of just one of each agent type (Factory, Distributor, Wholesaler and Retailer). It was decided by the ValueNet team that a more "modern" supply chain would use the Internet to access many vendors. Thus a Beer Factory would use multiple Distributors, which in turn would use multiple Warehouses and so on.

To model the impact RFID technology introduces into supply chains, it was decided to parameterize how far down the supply chain the agents could see. Thus, for example, the Factory agent could see its incoming orders to any level, even all the way to the Retailer. The result was dramatic: with vision increased just one level, the volatility within the supply chain dampened quickly, self-organizing into a simple steady state, constant order supply chain In the figure above, the top three graphs, labeled "None" (for no additional visibility) show the standard Beer Game volatility over a run of 100 weeks. The bottom three, labeled "Adjacent", show the result of increasing visibility just to the adjacent agent. Note the dampening, reaching constant order rates near week 80, and the much-reduced cost values for the four agents.

In our initial Mesh study, the agents simply uniformly fulfilled their inventory requirements among their providers, with no bargaining or auctions and indeed with no price differences among them. As in the visibility study, the volatility in the linear supply chain decreased, self-organizing into a steady state, constant order supply chain. Below, the top diagram shows a mesh networked supply chain with two of each agent type serving a single customer. The bottom three graphs show the dampening effects of the mesh network, producing a stable supply chain at around week 70.

Mesh (Internet) The classic Beer Game uses a linear supply chain,

Figure 3: The effect of a mesh supply chain on volatility Summary

dampened the volatility of the supply chain.

The classic Beer Game, with the Sterman human behavior model, provides an interesting environment for investigating self-organizing within supply chains. Two such investigations: one on increased visibility in the demand for products, the other on a more general network topology,

In terms of Self-Organization, the key feature here appears to be that the addition of greater diversity (increased visibility, mesh network topology) within the supply chain promotes a more stabile environment. This view of Self-Organization was nicely articulated by John Holland when he posed

75

the question: How is it possible to buy a sandwich in Manhattan with fresh lettuce, only hours old? It would be impossible to organize this by a city planning office. But instead, individual agents, with sufficient local knowledge and resources, perform this miracle, solving an NP-complete problem "well enough". The key is providing enough local knowledge and resources, but not too much and not too constrained.

level), the rules simply produce a steady solution, escaping volatility.

Workshop Discussion Material

(1,073,741,824,000,000,000,000,000,000,000,000,000,000)

The state space for the Beer Game can be approximated by collecting 24 state variables such as orders in transit, internal state variables, inventory and control parameters. Presuming orders in [0-40), inventory in [-110 +140), and two internal order state variables for each component, the state space size is roughly 1.074*10^39 This is conservative using only observed variation. A state based analysis could either use the state variables to define a landscape to look for minima, or analyze the state space for basins of attraction.

The initial ValueNet exploration did not attempt to analyze the Beer Game from the standpoint of Self-Organization. Rather it used tools of the Complexity community to simulate the supply chain, and to investigate two modifications to it. This section presents material for discussing the Beer Game via Self-Organization.

A thermodynamic analysis of the Beer Game would consider the customer demand as the energy input, the delivered goods as the work of the system, and the total cost (inventory and backorders) as the generated heat.

Exploring the literature on Self-Organization[1] reveals multiple formal approaches, each of which defines a measure for organization and uses it to study the increase or decrease of a system’s organization. Statistical Entropy provides a probabilistic measure. Chaos theory uses basins of attraction arguments to discuss reducing access to restricted parts of the systems phase space, thus adding structure. (Note this does not mean optimal, merely more stable.)

Finally, a statistical entropy analysis shows the quiescent states as minimal entropy, thus “organized”. More interesting would be to include the impact of the two experiments on the entropy of the system over time as an indicator that they apriori lead to Self-Organization. A simple NetLogo model of the Beer Game with Visibility added is available for our use at: http://backspaces.net/Models/beergame.html

Less popular is an attempt to analyze a system for potential improvements using these measures, i.e. for prediction. This holds true for Complexity in general: it is very difficult to tell what rules should be used to achieve a desired result. For example, it is difficult to derive a constitution that would produce a desired set of social behaviors.

References [1] Heylighen, F. The Science of Self-Organization and Adaptivity. The Encyclopedia of Life Support Systems. EOLSS Publishers, 2002. [2] Lee, H.L., Padmanabhan, V. and Whang, S. The Bullwhip Effect in Supply Chains. Sloan Management Review, pages 93--102, 1997. [3] Macal, North, MacKerrow, Danner, Densmore, Kim. Information Visibility and Transparency in Supply Chains. Lake Arrowhead Conference on Human Complex Systems, March 2003. [4] North, M.J., Macal, C.M. The Beer Dock: Three and a Half Implementations of the Beer Distribution Game. SwarmFest 2002. [5] Sterman, J.D. Modeling Managerial Behavior: Misperceptions of Feedback in a Dynamic Decision Making Experiment. Management Science, 35(3), 321-339, 1988.

The Sterman human behavior model has two rules: the demand forecasting rule and the inventory maintenance rule. The Visibility experiment modifies the forecasting rule by using the downstream demand as the prediction. Note that this demand is not optimal, but is itself conditioned by knowledge of its downstream demand, and so on. The Mesh experiment modifies the inventory rule by spreading the orders evenly among the multiple suppliers. The Beer Game stabilizes when it reaches a point of constant cost for each agent. It need not be optimal however (i.e. have the desired inventory

76

Self-Organization and Volunteering: Engineering in Very Large Scale Sharing Networks Peter Triantafillou Computer Technology Institute and University of Patras

ABSTRACT

The good news are that a non-negligible percentage of the peers were proven to be altruistic. In Gnutella, 1% (10%) of peers were found to serve about 40% (90%) of the total download requests [8]. In MojoNation more than 1-2% of all users, stayed connected almost all the time [9]. However, unfortunately, there is a great lack of research exploiting the heterogeneities in terms of power/capacity and behaviour (altruistic vs selfish) among peer nodes, especially in the structured world.

Our position is that a key to research efforts on ensuring high performance in very large scale sharing networks is the idea of volunteering; recognizing that such networks are comprised of largely heterogeneous nodes in terms of their capacity and behaviour, and that, in many real-world manifestations, a few nodes carry the bulk of the request service load. In this paper we outline how we employ volunteering as the basic idea using which we develop altruism-endowed self-organizing sharing networks to help solve two open problems in large-scale peer-topeer networks: (i) to develop an overlay topology structure that enjoys better performance than DHT-structured networks and, specifically, to offer O(log log N) routing performance in a network of N nodes, instead of O(log N), and (ii) to efficiently process complex queries and range queries, in particular.

Finally, one of the biggest shortcomings of DHTs is that they only support single-identity, exact-match queries. This has lead researchers to investigate how they could enhance P2P systems to reply to more complex queries.

1. INTRODUCTION Recent research in P2P networks has largely focused on structuring efforts for the network overlay so to ensure fast lookup/routing. This goal is accomplished by maintaining a structure in the system, emerging by the way that peers define their neighbors. These systems are usually built around the notion of a distributed hash table (DHT). Prominent examples of such architectures include Chord [3], CAN [1], Pastry [2], Tapestry [4],etc. DHT-based networks can route in O(log N) hops in the steady-state case, in a network of N peers. In unstructured networks [5], [7], [6] related overheads are much higher and no performance guarantees can be given.

In this overview we outline solutions for developing altruism-endowed self-organizing sharing networks that can be used to define novel structured overlay network architectures to: (i) speedup the fundamental operations in such networks (that is, routing and location operations) and to (ii) efficiently support range queries. Given the prominence of DHT-based architectures, we will assume a popular DHT (Chord) as our substrate, hoping that in this way our work can leverage existing results.

2. AESOP: VOLUNTEERS AND P2P NETWORK ARCHITECTURES

A key characteristic of large-scale sharing networks is the existence of altruistic and selfish nodes. The great majority of peers have been proven to be free riders [8]. Other reports offer evidence that half of the peer population is changing almost every half an hour [9]. In DHT overlays, given such a highly-dynamic environment, routing performance degrades to (at best, if the topology does not break) O(N) hops. Furthermore, even O(log N) hops, achieved in steady-state assuming “stable node behaviour”, may not be good enough; after all, these are overlay hops with each being translated into multiple physical network hops, involving perhaps low bandwidth nodes (e.g., with 56Kbps lines). This fact is cited by work in unstructured P2P networks to justify the inadequacies of structured networks [7].

An AESOP (Altruism-Endowed Sellf-Organizing Peers) sharing network consists of three components:

x AltSeAl, a monitoring/accounting software layer that identifies and validates altruistic peers; x PLANES, a layered overlay network architecture, based on altruistic nodes volunteering to perform several key tasks and carry on a great part of the total load created by the requests of all nodes in the network; x SeAled PLANES, a layer that facilitates the collaboration of the previous two components.

77

TRs are constructed using strong (e.g. public-key) cryptographic primitives, while nodes in AltSeAl are equipped with a public/private key-pair and identified using a digest of their public key, also used to verify TRs. Thus, nodes can't fake TRs or refuse the validity of a TR, unless they change their ID (key-pair). Furthermore, AltSeAl deploys a feedback mechanism rather than a penalizing one – requests are queued and served in a prioritized manner, while the actual resources allocated for serving these requests (e.g. bandwidth, storage, etc.) vary, based on the overall “score” of the served peers. Moreover, peers commence their lifecycle in the system with the worst possible score, thus having no incentive to change their ID or mount a Sybil-class attack [10]. Our detailed experimentations has revealed that SeAl performs its tasks swiftly while incurring only minor overhead in terms of latency, network bandwidth and storage overheads.

2.1 SeAl: Identifying and Validating the True Colors of Volunteers The Auditing/Accounting Layer The basic idea is that all transactions between peers result in the creation of tokens (called “Transaction Receipts” or TRs) that can be used much like ``favors'' in real life; the peers rendering favors (i.e. sharing resources) gain the right to ask peers receiving favors to somehow pay them back in the future or get ``punished'' otherwise. All of these operations are performed transparently to the user. Nodes keep track of the favors they render or receive (i.e. store the corresponding TRs) in two “favor lists”: the ``Favors-Done'' (F_d) and “Favors-Owed” (F_o) lists. Moreover, nodes in AltSeAl are characterized by their `àltruism score''. This is simply a function of F_d and F_o. For example, we can consider |F_d|-|F_o| or |F_d|/|F_o| as possible altruism score functions.

2.2 PLANES: Volunteers for Improved Network Architectures Assuming knowledge of altruistic peers, PLANES exploits them to create an overlay structure that speedups significantly the fundamental network operations. The following figure explains PLANES.

If node n_1 shares a resource r_1 and node n_2 accesses it, the favor-lists mechanism enables n_1 to selectively redirect a subsequent incoming request for r_1 to n_2. AltSeAl nodes autonomously and independently set an upper, score_max(), and a lower, score_min(), threshold values for their score. When they rate higher than score_max(), they always redirect incoming requests (if possible), while never redirecting when rating lower than score_min(). In all other cases, nodes with a tunable probability decide whether to serve or redirect. Favors and Complaints In the previous scenario, if n_2 serves the redirected request, then the corresponding favor is marked as paidback. Otherwise, n_1 may choose to use the corresponding TR -- i.e. TR(n_1, n_2, r_1) -- as a means of accusing n_2 of acting selfishly. This is accomplished in the following manner. AltSeAl uses a DHT overlay of its own to store “complaints”. n_1 sends TR(n_1, n_2, r_1) to the appropriate node -- say n_3 -- on this DHT (found by hashing the TR itself). n_3 then acts as an arbitrator between n_1 and n_2; it can ask (both) nodes to verify TR(n_1, n_2, r_1) and have n_2 pay back the corresponding favor. If the verification succeeds but n_2 still refuses to play fair, n_3 stores TR(n_1, n_2, r_1) for other nodes to know. If the verification fails, n_3 may choose to similarly “complain” about the perjurer peer.

Figure 1: A simple example of PLANES with 300 nodes organized in 6 clusters

PLANES creates C clusters, each of size S nodes and with each cluster being organized in a DHT (e.g., Chord) ring (network). In an N-node network, we can define S = z log N and C=N/S, for integer z. Since each Chord cluster in PLANES has logN nodes, routing within it requires only O(log logN) hops. For N=1,000,000 and z=1, there must be A = 50,000 (i.e., 5%) altruistic nodes and routing requires about 5-6 hops, a speedup of a factor of about 3, compared to plain Chord DHT network. Note that increasing the value of z reduces the required number of altruists linearly and increases hop counts only logarithmically. Furthermore, in the highly-dynamic case where regular

To keep storage requirements constant as the system evolves, we use an aging scheme for stored TRs. Also, selfish nodes may choose to collect the “complaints”, serve an equal number of requests and restore its fame.

78

where ai,bi are the attributes of R, with every tuple t in R being uniquely identified by a primary key key(t). This key can be either one of the attributes of the tuple, or can be calculated otherwise (e.g. based on the values of one or more of the attributes of t). Furthermore, attributes ai are used as single-attribute indices of t, with each ai being characterized by the domain {vmin(ai ),vmax(ai)}of its values. The proposed architecture can also support multi-attribute range queries, but this is outside the scope of this paper.

DHT performance degrades to O(N), this architecture ensures speedups of orders of magnitude!

2.3 SeAled PLANES: Putting it All Together AESOP requires the capability to find the IDs of a number of altruistic peers, asynchronously with respect to when they assumed such a status. In the context of AltSeAl, altruism can be expressed using the “altruism score”. For example, only peers with |F_d| / |F_o| > 2 may be deemed altruists.

In the proposed framework we will use Chord, because of its simplicity, its popularity within the P2P community (which has led to it being used in a number of P2P systems), but also because of its good performance (logarithmic in terms of routing hops and routing state), robustness, and fault-tolerance.

Proofs of altruism are also needed. In AltSeAl, TRs of favors (white records) rendered or paid-back can serve for this purpose. Using these TRs, further auditing is possible, validating a peer's claims for altruism. Finally, AESOP needs a structure to manage altruists. We define a second DHT-based overlay for the altruists -- the AltDHT. As soon as a node n is proved to be an altruist, the AddToAltDHT(n) routine is called. This routine: x Is directed to the node n', responsible for maintaining the “complaints” for n. x n' audits n, retrieving its white records and checking the locally-stored black records for it. x If the audit is successful (i.e. n's altruism score is higher than the system's altruism threshold), n' computes an altruist ID (e.g. by hashing the concatenation of the string “Altruist” and n's ID) and, using the DHT node addition protocol, it adds n to the AltDHT. If the audit fails, n' returns an error to n. x Whenever a node is promoted to the AltDHT it assumes special responsibilities (e.g. routing). x When a peer loses its altruist status (e.g. its altruism score drops below the corresponding threshold), it is removed from the AltDHT using the DHT's node deletion protocol.

3.1 Data Insertion A tuple t with key key(t) is inserted in the network and stored, using Chord's consistent hashing, in peer succ(key(t)). In addition, for every index attribute ai(t) of t, with value vi(t), we store in the network an index tuple Ii(t):{vi(t), key(t)}. Therefore, for each tuple inserted in the network, we also store k index tuples. These index tuples are stored at nodes succ(hi(vi(t))), using the following mod2m-order preserving hash function (OPHF):

hi (vi )

§ · v v min (a i ) ¨¨ s o 2 m ¸¸ mod 2 m v max (ai ) v min (a i ) © ¹

where s0=hash( attribute_name ), and hash() is the base hash function used by the underlying DHT (e.g. SHA-1). Deletions and updates of the original tuples are broadcast to the peers holding the index tuples. Note that since the quantities v{min, max}(ai) may be different for each attribute ai, there is (possibly) a different hi() for every attribute.

3.2 Query processing

Note that peers have a natural incentive not to cheat staying in AltDHT when they wish not to be altruists, since they receive extra load. In addition, peers in AltDHT can perform random audits: periodically, they choose a random ID n from those in AltDHT and calculate its altruism score. Peers that are discovered to cheat can be ejected!

Given a range query (vlow(ai), vhigh(ai)) for an attribute ai, we proceed as follows: we apply our OPHF on vlow and send the query to peer pl:succ(hi(vlow)). By design (because of the order-preserving hashing), the requested index tuples are stored in peers falling between pl:succ(hi(vlow)) and ph:succ(hi(vhigh)) inclusive. Thus, every peer pj, l j h, sends back to the requester any locally present index tuples that satisfy the query, and forwards the query to its successor node. This results in faster query processing.

Now, discovering altruists is straightforward. To discover k altruists we can, for example, compute a random ID, use AltDHT to locate the node responsible for it and then follow k successor pointers.

If r values fall in the given range, at most r successor pointers will be followed, bringing the total hop count to O((logN)+r): O(logN) to reach the node pl, plus r hops to reach all other nodes. This compares very well to O(r logN) that would be required to access each individual node in any basic DHT architecture. However, despite the above speedup, this method has a worst-case performance of

3. THE RANGEGUARD: VOLUNTEERS AND EFFICIENT RANGE QUERY PROCESSING We assume that data stored in the P2P network is structured in a (k+m)-attribute relation R(a1 ...,ak ,b1 ...,bm),

79

RangeGuards, worst-case hop-count efficiency becomes

O(N) routing hops, when r=O(N). Thus, we enhance our architecture using a number of special, powerful peers: the RangeGuards.

O( N ) .

3.3 The RangeGuard: Exploiting Node Heterogeneities We form a second OP Chord ring - the RangeGuard ring above the normal Chord ring, composed of powerful nodes burdened with extra routing state and functionality - the RangeGuards (Figure 1) Each such node is responsible for storing the index tuples placed in nodes between its predecessor RangeGuard and itself.

Figure 2. Range query processing with and without the RangeGuard

4. CONCLUSIONS AND FUTURE WORK We outlined some of our efforts towards utilizing documented node behavior to offer traditional engineering qualities in very large scale data and resource sharing networks, through the creation of altruist-endowed self-organizing overlays.

5. REFERENCES

Figure 1. The RangeGuard architecture

[1] Ratnasamy S., Francis P., Handley M., Karp R., Shenker S.

Each RG maintains routing information for both the Chord ring and the RangeGuard ring. Additionally, there is a direct link from each peer to the next RG in the normal Chord ring. In this setting, nodes in the lower Chord ring probe their RG (e.g. as part of the standard Chord stabilization process), and automatically send to it any updates regarding the index tuples they store.

A scalable Content-Addressable Network. In Proceedings of the 2001 SIGCOMM (2001).

[2] Rowstron A., Druschel P. Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems. In Middleware, vol. 2218 of Lecture Notes in Computer Science. Springer (2001) 329-350.

[3] Stoica, I., Morris, R., Karger, D., Kaashoek, and K. F.,

3.4 Query processing using RangeGuards

Balakrishnan, H. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of the 2001 SIGCOMM (2001), ACM Press, 149-160.

With RangeGuards in the scene, a query (vlow(ai), vhigh(ai)) on attribute ai will be sent from the requesting node directly (in 1 hop) to the corresponding RG. After this point the RangeGuards assume responsibility to gather the requested information, using the OPHF algorithm described earlier, except that now all operations take place on the RangeGuard ring (Figure 2). With data placement on the lower ring being reflected on the RangeGuard ring, the requested index tuples will reside between RGl:succ'(hi(vlow)) and RGh:succ'(hi(vhigh)), where succ'() is the Chord successor function for the RangeGuard ring.

[4] Zhao Y. B., Kubiatowitcz J., Joseph A. Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Tech. Rep. UCB/CSD-01-1141, University of California at Berkley, Computer Science Department (2001).

[5] Gnutella, http://gnutella.wego.com/ [6] Limewire, http://www.limewire.org/ [7] Y. Chawathe and S. Ratnasamy and L. Breslau and N. Lanham and S. Shenker, Making Gnutella-like P2P systems scalable, In Proceedings of the 2003 SIGCOMM (2003).

This algorithm requires 1 routing hop to reach the RangeGuard ring, and as many routing hops as there are RangeGuards between RGl and RGh. Since, however, there will probably be much fewer RangeGuards in the system than there are nodes, and RangeGuards are more powerful (with respect to computing capacities and network bandwidth) than the average node in the system, this algorithm is significantly more efficient that the one presented

earlier.

Specifically,

for

M

[8] E. Adar and B.A. Huberman, Free riding on Gnutella, Technical report Xerox PARC, (2000).

[9] B. Wilcox-O'Hearn, Experiences Deploying A Large-Scale Emergent Network, In Proceedings of the IPTPS '02, (2002).

[10] J.R. Douceur, The Sybil attack, In Proceedings of the IPTPS'02, (2002).

N

80

Evolutionary Games: What is the algorist’s perspective? Spyros Kontogiannis1,2 and Paul Spirakis1 1

Computer Technology Institute, Riga Feraiou 61, 26221 Patras, Greece. {kontog,spirakis}@cti.gr 2 Department of Computer Science, University of Ioannina, 45110 Ioannina, Greece.

Abstract. Evolutionary Game Theory is the study of strategic interactions among large populations of agents who base their decisions on simple, myopic rules. A major goal of the theory is to determine broad classes of decision procedures which both provide plausible descriptions of selfish behaviour and include appealing forms of aggregate behaviour. For example, properties such as the correlation between strategies’ growth rates and payoffs, the connection between stationary states and Nash equilibria and global guarantees of convergence to equilibrium, are widely studied in the literature. In this paper we discuss some computational aspects of the theory, which we prefer to view more as Game Theoretic Aspects of Evolution than Evolutionary Game Theory, since the term “evolution” actually refers to strategic adaptation of individuals’ behaviour through a dynamic process and not the traditional evolution of populations. We consider this dynamic process as a self-organization procedure, which under certain conditions leads to some kind of stability and assures robustness against invasion.

1

Introduction

Classical game theory deals with a rational individual who is engaged in a given interaction (a “game”) with other individuals (its co-players) and has to adopt a strategy (for selecting among a set of allowable actions) that maximizes its own payoff. Of course each player’s payoff is dependent on the other players’ strategies for choosing their own actions. Evolutionary game theory deals with the entire (typically large) population of players, where all players are programmed to adopt some strategy. Strategies with high payoff (given the current state of the population) are expected to spread within the population (by learning, copying successful strategies, inheriting strategies, or even by infection). The frequencies of the strategies in the population thus change according to their payoffs, which in turn depend on all the players’ strategies and thus their frequencies in the population. The subject of evolutionary game theory is exactly the study of the dynamics of this feedback loop. A very good presentation of evolutionary game dynamics is in [3]. For a more thorough study the reader is referred to [1].

Numerous paradigms for modeling individual choice in a large population have been proposed in the literature. For example, if each agent chooses its own strategy so as to optimize its own payoff (“one against all others” scenario) given the current population state (ie, other agents’ strategies), then the aggregate behaviour is described by the best-response dynamics [4]. If each time an arbitrary user changes its strategy for any other strategy of a strictly better (but not necessarily the best) payoff, then the aggregate behaviour is described by the better-response dynamics or Nash dynamics [6]. In case that pairs of players are chosen at random and these players engage in a bimatrix game (“one against one” scenario) whose payoff matrix determines according to some rule the gains of the strategies adopted by these two players, then we refer to imitation dynamics, the most popular version of which is the replicator dynamics[8]. The proposed dynamical systems for describing evolutionary games, despite their appealing and highly intuitive definitions, have some strong weaknesses. For example, the best/better-response dynamics admit multiple solution trajectories from a single initial point, whose terminal (rest) points vary significantly in their aggregate performance. On the other hand, the replicator dynamics admit trajectories, whose rest points are not necessarily Nash equilibria (eg, when all the users of a population choose exactly the same strategy, even if this strategy is strictly dominated, there is nothing to imitate – this is a general weakness of the imitation dynamics). A typical way out of such situations is to introduce small amounts of noise to the underlying models. For example, we may add (small) payoff perturbations to the optimization dynamic models, or we may add occasional arbitrary behaviour to the model of replication. Of course, such modifications create new difficulties: for the optimization models the rest points may now be slightly away from the Nash equilibria, while the replication models may admit some oscillation phenomena.

Since evolutionary game theory is a dynamical process (hopefully) ending up in some equilibrium which may also demonstrate robustness against invasion or infection (stability), we consider it as a self-organization This work was partially supported by the EU within the Fu- process in a very large population of entities that adopt ture and Emerging Technologies Programme under contract some strategies representing either computer programs IST-2001-331116 (FLAGS) and within the 6th Framework trying to prevail in the market, or competing network protocols, or viruses trying to spread all over the InterProgramme under contract 001907 (DELIS).

81

net. Thus the prime concern of an algorist is to determine the principal rules defining the (global or local) convergence to stable states, to propose computationally efficient algorithms for constructing such stable states, to be able to compare the trajectories of two phenomenically different laws of motion, or even to describe how the underlying infrastructure (eg, the network infrastructure in which a virus might spread) is involved in the evolution of a population. We comment here that our perspective has to do with strategic evolution, ie, the evolution in time of the vector of the strategies’ frequencies in the whole population, and not with traditional evolution (under the biological viewpoint) where the sub-populations actually evolve in size.

2

Definitions and Notation

For any n ∈ IN, let [n] ≡ {1,2, . . . , n}, while Pn ≡ z ∈ [0, 1]n : j∈[n] zj = 1 is the simplex of probability n-vectors. Assume having a set N ≡ [n] of selfish, non-cooperative players, and n action sets (Si ≡ {si,1 , . . . , si,mi })i∈N for them, where mi = |Si |. Each player i ∈ N may adopt either a pure strategy si,j ∈ Si (ie, a fixed allowable action), or a mixed strategy pi ∈ P mi ≡ mi mi (z(si,j ))j∈[mi ] ∈ [0, 1] : j=1 z(si,j ) = 1 , (ie, a probability distribution over its own action set). For simplicity of notation, let ∀i ∈ N, ∀pi ∈ Pmi , ∀j ∈ [mi ], pi (si,j ) = pi (j). A set p = (p1 , . . . , pn ) ∈ ×i∈N Pmi ≡ P of (mixed in general) strategies for all the users of the game is called a (mixed) strategies profile. For the special case where p corresponds to a set s ∈ ×i∈N Si ≡ S of pure strategies for all the players, we have a pure strategies profile or a configuration, which is also represented by s. ∀i ∈ N , we denote by p−i ≡ (p1 , . . . , pi−1 , ·, pi+1 , . . . , pn ) the mixed strategies profile of all the players but for player i, while we define the operation p−i ⊕ z = (p1 , . . . , pi−1 , z, pi+1 , . . . , pn ) , ∀z ∈ Pmi , ∀p−i ∈ ×j=i Pj ≡ P −i .By abusing notation a little bit, we will occasionally allow a pure strategy si,j ∈ Si to be combined by ⊕ with p−i , although we should use instead the corresponding (mixed) strategy z = ej ∈ Pmi , where ej is the vector with 1 in its j th position and 0 in every other position. Consider now that each player i ∈ N has its own payoff function Ui : S → IR (this function can also be represented as a 2-dimensional matrix whose rows are labeled by the actions of Si and columns are labeled by combined actions of the remaining users, ie, elements of ×r=i Sr ≡ S −r ). We extend the utility functions to the domain P as follows: ∀i ∈ N, ∀p ∈ P, Ui (p) = P (s, p) · Ui (s)

where P (s, p) ≡ i∈N pi (si ) is the occurrence probability of configuration s under the mixed strategies profile p. Nash Equilibrium (NE) [5]. A mixed strategies profile p ∈ P is a NE iff each player’s strategy is a best-response to the other players’ strategies. That is, ∀i ∈ N , pi ∈ arg max Ui (p−i ⊕ z) ≡ BRi (p−i ) . (1) z∈Pi

p is a strict NE iff ∀i ∈ N, {pi } = BRi (p−i ). The Replicator Dynamics. Consider having a (very large) population of individuals, each having exactly the same set N = [n] of possible types. Let x ∈ Pn be the population state, ie, the vector of proportions (or frequencies) of the n different types in the whole population1 . Consider the dynamical system where in each step two individuals of the population are chosen randomly and then engage in a symmetric bimatrix game whose (common) payoff function is described by the n × n matrix U . Then, (U x)i is the expected payoff of a type-i user that is involved in a new game, while xT U x is the average (expected) payoff of a random user (ie, the average payoff in the system) involved in a new game, wrt the current population state x. Suppose that the type-frequencies vector x is a vector of differentiable functions of time t (this requires that the population is infinitely large, or that xi ’s express the expected frequencies in an ensemble of populations) and postulate a law of motion x(t). Suppose that for any type of strategies, its proportion in the next generation is related to the proportion of the same type in the current generation, according to the following equation, called the replicator equation [8]: ∀i ∈ N, x˙ i = xi (U x)i − xT U x . (2) Observe that this frequency-dependent fitness rule introduces a strategic aspect to evolution: more successful strategies (ie, those having an expected payoff strictly larger than the average expected payoff) increase their proportions in the population, while those with less than average expected payoff loose some of their proportions. Recall that the total change in frequencies is exactly 0, ie, although the frequencies of the types change, the population size remains the same. Assume now that the strategies possibly adopted by the individuals of the population represent m different mixed strategies on the set of n different types,(p(i) ∈ Pn )i∈[m] . The expected payoff of a type p(i) against another type p(j) is (p(i))T U p(j), while

s∈S

82

1

Recall that x corresponds to (more than one) pure strategies profiles for the individuals.

Heads Tails for a given frequencies vector x ∈ Pm , the average (exT Heads 0,1 1, 0 pected) payoff within the population is (p(x)) U p(x), Tails 1,0 0, 1 where p(x) ≡ i∈[m] xi p(i).The analogue of (2) for the law of motion of mixed strategies, can be written as Fig. 1. The Matching Pennies game: The row player bets on ∀i ∈ [m], x˙ i = xi (p(i) − p(x))T U p(x) . (3) different outcomes showing up, while the column player bets on the same outcomes showing up. For this game there is no PNE.

Evolutionary Stable Strategy (ESS) [7]. A (mixed in general) strategy p ∈ Pn is said to be evolutionary stable (ESS) if, for any other strategy q ∈ Pn \ {p}, the induced replicator equation describing the dynamics of the population consisting of these two types only (the proportion of residents using p is 1 − r and the proportion of invaders using q is r, for some 0 < r ≤ ε(q) 1) leads to the elimination of the invaders, as long as their initial frequency is sufficiently small. That is, p is an ESS iff ∀q ∈ P \ {p} the following two conditions hold: (I) Equilibrium: qT U p ≤ pT U p (II) Stability:

qT U p = pT U p ⇒ qT U q < pT U q

Condition (I) demands that p is a Nash equilibrium for the bimatrix game determining the law of motion (no invader can do better than the resident against the resident), while condition (II) states that, in case that an invader does equally well with the resident against the resident, then the resident must be strictly better than the invader against the invader. The following theorem provides a simple (yet inefficient, due to the possibly too many (mixed) strategies for the individuals) test whether a specific strategy p is an ESS for a replicator equation: Theorem 1 (Hofbauer, Schuster and Sigmund [2]) p(i) is a strict The strategy p is an ESS iff i∈[m] xi 2 local Lyapunov function for the replicator equation, or equivalently, iff ∀q = p, pT U q > qT U q in some neighbourhood of p.

3

Computational aspects of game-theoretic evolution

(4)

PNE exist, of different payoffs for the players and quite different aggregate performances. The problem then is for a rational player, how to decide which of the several NE is the “right” one to settle upon. To this direction, numerous refinements of the space of (P)NE have been proposed in the literature. In fact, there are so many refinements, that practically every NE may be shown to be the “right choice” of a proper refinement! There is a strong hope that evolutionary game theory will assist this kind of choices. The reason is the systematic way by which the evolution is modeled, as a kind of a reasonable game between an individual and some other individuals (or even against all other individuals). The point is to be able to construct computationally efficient algorithms for quantifying the values of all the NE, or even solving the corresponding optimization problem in the space of NE wrt a given objective function. Recall that for an evolutionary game the reachable NE are a subset of the rest points for the dynamics of this game, which in turn are the endpoints of (continuous in the limit) trajectories starting from some strategy initially adopted by (ie, prevailing among) the users. This should not be confused with the (computationally hard) task of determining the worst/best NE in a traditional game. Additionally it would be extremely interesting to devise an algorithm for detecting the existence of an ESS in an evolutionary game, that only takes as input the n × n matrix U determining the payoffs of the game and responds with YES/NO, or even better, constructs an ESS in case of existence. Recall that an ESS is a NE of the corresponding bimatrix game with an additional stability property, which does not necessarily hold in a NE. Indeed, one can construct simple examples where a small payoff matrix admits no ESS at all (although the game has at least one NE). A similar computational problem is the description of a computationally efficient evolutionary process (ie, the proper rule for evolution) that will lead as fast as possible to a rest point which is also a NE for the corresponding traditional game, or even an ESS for the evolutionary dynamics.

Equilibrium Selection. The most popular notion of stability in game theory is the Nash equilibrium (NE) [5]. Although the notion of NE is quite natural, if we restrict ourselves to pure strategies, we may not be able to reach a pure Nash equilibrium (PNE) at all. For example, in the Matching Pennies game shown in figure 1 Bounded vs. Unbounded Rationality. A central asthere is no PNE at all. More importantly, even if we have sumption of the traditional game theory is that each PNE in a game, we may face a situation where multiple player adopts a kind of selfish behaviour, exploiting 2 A function F (x) is a Lyapunov function if F˙ (x) ≥ 0, for the complete knowledge of other players’ decisions. all x. It is a strict Lyapunov function, if additionally equality For example, in order to be able to assign a cardinal utility function to individual players, one typically asholds only when x is a rest point.

83

sumes that each player has a well-defined, consistent set of preferences over the set of “lotteries” over the outcomes which may result from individual choice. Since the number of different lotteries over outcomes is uncountably many, this requires each player to have a well-defined, consistent set of uncountably many preferences, which is typically considered to be infeasible. In many cases though this is not a feasible situation, either because it is not computationally efficient to handle such an amount of information, or because each player only knows the actual strategies of those players in its own neighborhood. Despite the fact that traditional game theory does not deal with such situations, there is a strong hope that evolutionary game theory will manage eventually to successfully describe and predict the behaviour of such players since it is better equipped to handle these weaker rationality assumptions and yet causing the same effects on players’ behaviours in the long run as complete-knowledge traditional games. A typical example of such a scenario is when we wish to cut off the iteratively dominated strategies of the players. Depending on the model of evolution, we are able in some cases to assure that all these strategies, that a rational individual should never play in a traditional game, will eventually vanish in the evolutionary version of the game. The computational point of view demands again to design algorithms dealing with bounded rationality and achieving in polynomial time (possibly good approximations of) the same outcome as the one expected by the corresponding traditional game.

Trajectory Prediction. Due to its nature, evolutionary game theory explicitly models the dynamics present in interactions among individuals in a population. One might try to capture the dynamics of the decisionmaking process in traditional game theory by modeling the game in its extensive (rather than its strategic) form. However, for most games of reasonable complexity, the extensive form of the game quickly becomes unmanageable. Moreover, in the extensive form of a game, traditional game theory represents an individual’s strategy as a specification of what choice that individual would make at each information set in the game. A selection of strategy then corresponds to a selection, prior to game play, of what that individual will do at any possible stage of the game. This representation of strategy selection clearly presupposes hyperrational players and fails to represent the process by which a player observes its opponents’ behaviours, learns from these observations and makes a best/better response, replication, imitation, etc choice to what it has learned so far. We would like to be able to decide in polynomial time whether two phenomenically different evolutionary dynamics actually produce the same trajectories, up to some kind of homeomorphism. This would enable

a classification scheme of the evolution schemes into broader categories according to their orbital characteristics. Such kind of classification for replicator dynamics exist only when there are 2 or 3 distinct types of individuals in the population. Imposing structural properties of a game into the evolutionary dynamics. Our last question deals with some new ways of interaction in the evolutionary dynamics of a game, that will also depict the special structure of the corresponding traditional game. For example, when a virus spreads in a network, the architecture of the network itself and the starting points of the virus in it should affect somehow the success of the virus. The proposed game theoretic models of evolution proposed so far in the literature, mainly focus on the case where the individuals in a population collide with each other in a random fashion. Ie, the underlying “interaction” infrastructure is represented by a clique. What if this is not the case, and we have instead some special graph representing the interactions? We need new evolutionary models to capture such cases, that will somehow encode the structure of this graph in the dynamics, via elementary properties (eg, the connectivity or the expansion of the graph).

References 1. Cressman R. Evolutionary dynamics and extensive form games. MIT Press, 2003. 2. Hofbauer J., Schuster P., Sigmund K. A note on evolutionary stable strategies and game dynamics. Journal of Theoretical Biology, 81:609–612, 1979. 3. Hofbauer J., Sigmund K. Evolutionary game dynamics. Bulletin of the American Mathematical Society, 40(4):479– 519, 2003. 4. Matsui A., Gilboa I. Social stability and equilibrium. Econometrica, 59:859–867, 1991. 5. Nash J. F. Noncooperative games. Annals of Mathematics, 54:289–295, 1951. 6. Rosenthal R.W. A class of games possessing pure-strategy nash equilibria. International Journal of Game Theory, 2:65–67, 1973. 7. Smyth M. J., Price G. The logic of animal conflict. Nature, 246:15–18, 1973. 8. Taylor P. D., Jonker L. Evolutionary stable strategies and game dynamics. Mathematical Biosciences, 40:145–156, 1978.

84

Grassroots Self-Management: A Modular Approach∗ Márk Jelasity, Alberto Montresor, Ozalp Babaoglu Department of Computer Science, University of Bologna, Italy e-mail: jelasity,montreso,[email protected]

Abstract Traditionally, autonomic computing is envisioned as replacing the human factor in the deployment, administration and maintenance of computer systems that are ever more complex. Partly to ensure a smooth transition, the design philosophy of computer systems remains essentially the same only autonomic components are added to implement functions such as monitoring, error detection, repair, etc. In this position paper we outline an alternative approach which we call “grassroots self-management”. While this approach is by no means a solution to all problems, we argue that recent results from fields such as agent-based computing, the theory of complex systems and complex networks can be efficiently applied to achieve important autonomic computing goals, especially in very large and dynamic environments. Unlike in traditional compositional design, the desired properties like self-healing and self-optimization are not programmed explicitly but rather “emerge” from the local interactions among the system components. Such solutions are potentially more robust to failures, are more scalable and are extremely simple to implement. We discuss the practicality of grassroots autonomic computing through the examples of data aggregation, topology management and load balancing in large dynamic networks.

1 Introduction The desire to build fault tolerant computer systems with an intuitive and efficient user interface has always been part of the research agenda of computer science. Still, the current scale and heterogeneity of computer systems is becoming alarming, especially because our everyday life has come to depend on such systems to an increasing degree. There is a general feeling in the research community that to cope with this new situation—which emerged as a result of Moore’s Law, the widespread adoption of the Internet and computing becoming pervasive in general—needs radically new approaches to achieve seamless and efficient functioning of computer systems. Accordingly, more and more effort is devoted to tackle the problem of self-management. One of the most influential and widely publicized approach is IBM’s autonomic computing initiative, launched in 2001 [8]. The ∗ This work was partially supported by the Future & Emerging Technologies unit of the European Commission through Projects BISON (IST-2001-38923) and DELIS (IST-001907).

85

term “autonomic” is a biological analogy referring to the autonomic nervous system. The function of this system in our body is to control “routine” tasks like blood pressure, hormone levels, heart rate, breathing rate, etc. At the same time, our conscious mind can focus on high level tasks like planning and problem solving. The idea is that autonomic computing should do just the same: computer systems would take care of routine tasks themselves while system administrators and users would focus on the actual task instead of spending most of their time troubleshooting and tweaking their systems. Since the original initiative, the term has been adopted by the wider research community although it is still strongly associated with IBM and, more importantly, IBM’s specific approach to autonomic computing. It is somewhat unfortunate because the term autonomic would allow for a much deeper and more far-reaching interpretation, as we explain soon. In short, we should not only take it seriously what the autonomic nervous system does but also how it does it. We believe that the remarkably successful self-management of the autonomic nervous system, and biological organisms in general lies exactly in the way they achieve this functionality. Ignoring the exact mechanisms and stopping at the shallow analogy at the level of function description misses some important possibilities and lessons that can be learned by computer science. The traditional approach to autonomic computing is to replace human system administrators with software or hardware components that continuously monitor some subsystem assigned to them, forming so called control loops [8] which involve monitoring, knowledge based planning and execution. Biological systems however achieve self-management and control through entirely different, often fully distributed and emergent ways of processing information. In other words, the usual biological interpretation of self-management involves no manager and managed entities. There is often no subsystem responsible for self-healing or self-optimization, these properties simply follow from some simple local behavior of the components typically in a highly non-trivial way. The term “self” is meant truly in a grassroots sense, and we believe that this fact might well be the reason of many desirable properties like extreme robustness and adaptivity, with the additional benefit of a typically very simple implementation. There are a few practical obstacles in the way of the deployment of grassroots self-management. One is the entirely different and somewhat un-natural way of think-

ing and the relative lack of understanding of the principles of self-organization and emergence [10]. Accordingly, trust delegation represents a problem: psychologically it is more relaxing to have a single point of control, an explicit controlling entity. In the case of the autonomic nervous system we cannot do anything else but trust it, although probably many people would prefer to have more control, especially when things go wrong. Indeed, the tendency in engineering is to try to isolate and create central units that are responsible for a function. A good example is the car industry that gradually places more and more computers into our cars that explictly control the different functions, thereby replacing old and proven mechanisms that were based on some, in a sense, self-optimizing mechanism (like the carburetor) and so also sacrificing the selfhealing and robustness features of these functions to some degree. To exploit the power and simplicity of emergent behavior yet to ensure that these mechanisms can be trusted and be incorporated in systems in an informed manner, we believe that a modular paradigm is required. The idea is to identify a collection of simple and predictable services as building blocks and combine them in arbitrarily complex functions and protocols. Such a modular approach presents several attractive features. Developers will be allowed to plug different components implementing a desired function into existing or new applications, being certain that the function will be performed in a predictable and dependable manner. Research may be focused on the development of simple and well-understood building blocks, with a particular emphasis on important properties like robustness, scalability, self-organization and selfmanagement. The goal of this position paper is to promote this idea by describing our preliminary experiences in this direction. Our recent work has resulted in a collection of simple and robust building blocks, which include data aggregation [6, 9], membership management [5], topology construction [4] and load balancing [7]. Our building blocks are typically no more complicated than a cellular automaton or a swarm model which makes them ideal objects for research. Practical applications based on them can also benefit from a potentially more stable foundation and predictability, a key concern in fully distributed systems. Most importantly, they exhibit the desirable self-∗ properties naturally, without dedicated system components. In the rest of the paper, we briefly describe these components.

2 A Collection of Building Blocks Under the auspices of the BISON project [1], our recent activity has been focused on the identification and development of protocols for several simple basic functions. The components produced so far can be informally subdivided into two broad categories: overlay protocols and functional protocols. An overlay protocol is aimed at maintaining application-layer, connected communication topologies over a set of distributed nodes. These topologies may constitute the basis for functional proto-

86

do forever wait(T time units) p ← GET P EER() send s to p sp ← receive(p) s ← UPDATE(s, sp ) (a) active thread

do forever sp ← receive(*) send s to sender(sp) s ← UPDATE(s, sp )

(b) passive thread

Figure 1: The skeleton of a gossip-based protocol. Notation: s is the local state, sp is the state of the peer p. cols, whose task is to compute a specific function over the data maintained at nodes. Our current bag of protocols includes: (i) protocols for organizing and managing structured topologies like superpeer based networks, grids and tori (T-M AN [4]); (ii) protocols for building unstructured networks based on the random topology (NEWSCAST [5]); (iii) protocols for the computation of a large set of aggregate functions, including maximum and minimum, average, sum, product, geometric mean, variance, etc [6, 9]; and (iv) a load balancing protocol [7]. The relationships between overlay and functional protocols may assume several different forms. Topologies may be explicitly designed to optimize the performance of a specific functional protocol (this is the case of NEWS CAST [5] used to maintain a random topology for aggregation protocols). Or, a functional protocol may be needed to implement a specific overlay protocol (in superpeer networks, aggregation can be used to identify the set of superpeers). All the protocols we have developed so far are based on the gossip-based paradigm [2, 3]. Gossip-style protocols are attractive since they are extremely robust to both computation and communication failures. They are also extremely responsive and can adapt rapidly to changes in the underlying communication structure, just by their nature, without extra measures. The skeleton of a generic gossip-based protocol is shown in Figure 1. Each node possesses a local state and executes two different threads. The active one periodically initiates an information exchange with a peer node selected randomly, by sending a message containing the local state and waiting for a response from the selected node. The passive one waits for messages sent by an initiator and replies with its local state. Method UPDATE builds a new local state based on the previous local state and the state received during the information exchange. The output of UPDATE depends on the specific function implemented by the protocol. The local states at the two peers after an information exchange are not necessarily the same, since UPDATE may be nondeterministic or may produce different outputs depending on which node is the initiator. Even though our system is not synchronous, it is convenient to talk about cycles of the protocol, which are simply consecutive wall clock intervals during which every node

load balancing

T−Man

aggregation

newscast

Figure 2: Dependence relations between the components mentioned in the paper. has its chance of performing an actively initiated information exchange. In the following we describe the components. Figure 2 illustrates the dependence relations between them as will be described in the text as well.

2.1 Newscast In NEWSCAST [5], the state of a node is given by a partial view, which is a set of peer descriptors with a fixed size c. A peer descriptor contains the address of the peer, along with a timestamp corresponding to the time when the descriptor was created. Method GET P EER returns an address selected randomly among those in the current partial view. Method UPDATE merges the partial views of the two nodes involved in an exchange and keeps the c freshest descriptors, thereby creating a new partial view. New information enters the system when a node sends its partial view to a peer. In this step, the node always inserts its own, newly created descriptor into the partial view. Old information is gradually and automatically removed from the system and gets replaced by new information. This feature allows the protocol to “repair” the overlay topology by forgetting dead links, which by definition do not get updated because their owner is no longer active. In NEWSCAST, the overlay topology is defined by the content of partial views. We have shown in [5] that the resulting topology has a very low diameter and is very close to a random graph with out-degree c. According to our experimental results, choosing c = 20 is already sufficient for very stable and robust connectivity. We have also shown that, within a single cycle, the number of exchanges per node can be modeled by a random variable with the distribution 1 + Poisson(1). The implication of this property is that no node is more important (or overloaded) than others.

2.2 T-Man Another component is T-M AN [4], a protocol for creating a large set of topologies. The idea behind the protocol is very similar to that of NEWSCAST. The difference is that instead of using the creation date (freshness) of descriptors, T-M AN applies a ranking function that ranks any set of nodes according to increasing distance from a base node.

87

Method GET P EER returns neighbors with a bias towards closer ones, and, similarly, UPDATE keeps peers that are closer, according to the ranking. Figure 3 illustrates the protocol, as it constructs a torus topology. In [4] it was shown that the protocol converges in logarithmic time also for network sizes as large as 220 and for other topologies as well including the ring and binary tree topologies. With the appropriate ranking function T-M AN can be also applied to sort a set of numbers. This component, T-M AN, relies on another component for generating an initial random topology which is later evolved into the desired one. In our case this service is provided by NEWSCAST.

2.3 Gossip-Based Aggregation In the case of gossip-based aggregation [6, 9], the state of a node is a numeric value. In a practical setting, this value can be any attribute of the environment, such as the load or the storage capacity. The task of the protocol is to calculate an aggregate value over the set of all numbers stored at nodes. Although several aggregate functions may be computed by our protocol, in this paper we concentrate on the average function. In order to work, this protocol needs an overlay protocol that provides an implementation of method GET P EER. Here, we assume that this service is provided by NEWS CAST , but any other overlay could be used. To compute the average, method UPDATE(a, b) must return (a + b)/2. After one state exchange, the sum of the values maintained by the two nodes does not change, since they have just balanced their values. So the operation does not change the global average either; it only decreases the variance over all the estimates in the system. In [6] it was shown that if the communication topology is not only connected but also sufficiently random, at each cycle the empirical variance computed over the set of values maintained by √nodes is reduced by a factor whose expected value is 2 e. Most importantly, this result is independent from the size of the network, showing the extreme scalability of the protocol. In addition to being fast, our aggregation protocol is also very robust. Node failures may perturb the final result, as the values stored in crashed nodes are lost; but both analytical and empirical studies have shown that this effect is generally marginal [9]. As long as the overlay network remains connected, link failures do not modify the final value, they only slow down the aggregation process.

2.4 A Load-Balancing Protocol The problem of load balancing is similar, to a certain extent, to the problem of aggregation. Each node has a certain amount of load and the nodes are allowed to transfer some portions of their load between themselves. The goal is to reach a state where each node has the same amount of load. To this end, nodes can make decisions for sending or receiving load based only on locally available information. Differently from aggregation, however, the amount of load that can be transfered in a given cycle is bounded:

after 3 cycles

after 5 cycles

after 8 cycles

after 15 cycles

Figure 3: Illustrative example of T-M AN constructing a torus over 50 × 50 = 2500 nodes, starting from a uniform random topology with c = 20. For clarity, only the nearest 4 neighbors (out of 20) of each node are displayed. the transfer of a unit of load may be an expensive operation. In our present discussion, we use the term quota to identify this bound and we denote it by Q. Furthermore, we assume that the quota is the same at each node. A simple, yet far from optimal idea for a completely decentralized algorithm could be based on the aggregation mechanism illustrated above. Periodically, each node contacts a random node among its neighbors. The loads of the two nodes are compared; if they differ, a quantity q of load units is transfered from the node with more load to the node with less load. q is clearly bounded by the quota Q and quantity of load units needed to balance the nodes. If the network is connected, this mechanism will eventually balance the load among all nodes. Nevertheless, it fails to be optimal with respect to load transfers. The reason is simple: if the loads of two nodes are both higher than the average load, transferring load units from one to the other is useless. Instead, they should contact nodes whose load is smaller than the average, and perform the transfer with them. Our load-balancing algorithm is based exactly on this intuition. The nodes obtain an estimate of the current average load through the aggregation protocol described above. This estimate is the target load; based on its value, a node may decide if it is overloaded, underloaded, or balanced. Overloaded nodes contact their underloaded neighbors in order to transfer their excess load and underloaded nodes contact their overloaded neighbors to perform the opposite operation. Nodes that have reached the target load stop participating in the protocol. Although this was a simplified description, it is easy to see that this protocol is optimal with respect to load transfer, because each node transfers exactly the amount of load needed to reach its target load. As we show in [7], the protocol is also optimal with respect to speed under some conditions on the initial load distribution.

3 Conclusions In this paper, we presented examples for simple protocols that exhibit self-managing properties without any explicit management components or control loops, in short, without increased complexity. We argued that a modular approach might be the way towards efficient deployment of

88

such protocols in large distributed systems. To validate our ideas, we have briefly presented gossip based protocols as possible building blocks: topology and membership management (T-M AN and NEWSCAST), aggregation, and load balancing.

References [1] The Bison Project. http://www.cs.unibo.it/ bison. [2] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. Epidemic algorithms for replicated database management. In Proceedings of the 6th Annual ACM Symposium on Principles of Distributed Computing (PODC’87), pages 1–12, Vancouver, August 1987. ACM. [3] P. T. Eugster, R. Guerraoui, A.-M. Kermarrec, and L. Massoulié. From epidemics to distributed computing. IEEE Computer. to appear. [4] M. Jelasity and O. Babaoglu. T-Man: Fast gossip-based construction of large-scale overlay topologies. Technical Report UBLCS-2004-7, University of Bologna, Department of Computer Science, Bologna, Italy, May 2004. [5] M. Jelasity, W. Kowalczyk, and M. van Steen. Newscast computing. Technical Report IR-CS-006, Vrije Universiteit Amsterdam, Department of Computer Science, Amsterdam, The Netherlands, November 2003. [6] M. Jelasity and A. Montresor. Epidemic-style proactive aggregation in large overlay networks. In Proceedings of The 24th International Conference on Distributed Computing Systems (ICDCS 2004), pages 102–109, Tokyo, Japan, 2004. IEEE Computer Society. [7] M. Jelasity, A. Montresor, and O. Babaoglu. A modular paradigm for building self-organizing peer-to-peer applications. In Engineering Self-Organising Systems, number 2977 in Lecture Notes in Artificial Intelligence, pages 265–282. Springer, 2004. [8] J. O. Kephart and D. M. Chess. The vision of autonomic computing. IEEE Computer, 36(1):41–50, January 2003. [9] A. Montresor, M. Jelasity, and O. Babaoglu. Robust aggregation protocols for large-scale overlay networks, 2004. available as http://www.cs.unibo. it/techreports/2003/2003-16.pdf, to appear in the proceedings of DSN 2004. [10] J. M. Ottino. Engineering complex systems. Nature, 427:399, January 2004.

)

+

+

-

/

1

3

4

6

-

3 F

5

"

@

B

'

3

}

'

c

'

Q

'

3

$

+

'

3

3

'

V

$

'

'

Q

Q

:

.

$

'

9

`

Q

x

Q

3

$

Q

'

3

'

e

Q

+

E

Z

'

3

'

$

Z

'

$

'

Q

c

'

$

'

Z

'

'

$

'

Z

+

'

'

'

$

'

:

$

'

+

+

Q

'

+

3

+

$

+

D

+

$

:

~

~

~

'

3

'

$

Z

$

E

:

+

}

'

'

3

'

V

E

c

'

'

E

'

'

E

c

'

'

'

'

$

c

:

'

+

+

Z

Q

Q

$

V

Z

'

B

$ '

$

'

'

$

'

'

3

Z

'

}

'

$

'

+

3

$

`

E

'

E

Z

$

$

$

'

'

+

Z

'

Z

3

$

'

'

+

$

3

}

Q

'

$

+

$

Z

Q

3

Z

+

$

Q

'

:

$

`

'

'

:

'

'

'

$

:

'

$

'

e

3

'

Z

Q

90

'

'

3

3

3

'

$

'

'

:

'

'

e

$

'

'

$

+

'

$

'

'

3

E

'

$

n

'

%

,

-

$

0

$

%

,

%

9

$

%

0

&

$

%

)

)

%

>

,

-

%

$

%

-

R

%

%

0

$

%

R

,

7

$

,

%

$

>

@

%

0

$

%

A

,

%

-

%

-

9

7

0

-

$

H

9

$

A

$

>

$

$

,

%

,

%

$

{

9

$

7

$

,

%

$

9

,

%

0

$

%

9

$

%

A

%

-

y

9

%

-

9

$

7

-

9

$

-

$

-

&

$

-

$

7

,

%

0

%

-

9

%

,

-

$

$

R

$

0

9

>

7

,

%

$

-

9

$

0

0

%

>

7

>

$

$

,

%

$

{

9

$

-

%

9

7

-

-

9

H

-

9

,

7

%

-

%

-

,

%

$

,

%

9

>

7

U

$

%

%

,

,

-

$

$

-

W

0

%

$

-

%

&

-

0

-

%

,

%

$

,

%

9

$

%

7

7

$

,

-

%

,

,

-

$

0

$

A

$

$

{

,

,

$

,

%

%

7

0

,

%

$

,

%

9

$

$

A

,

%

%

$

%

$

0

,

%

$

,

%

,

,

-

$

,

$

-

7

7

$

$

-

-

W

,

%

$

0

R

$

-

-

&

$

0

,

%

$

,

%

$

R

%

-

9

$

%

&

>

7

$

%

A

,

-

%

R

$

7

7

` $

0

,

%

>

^

$

%

%

$

R

$

$

$

$

0

m

R

9

0

$

-

-

$

9

W

-

%

l

k

n

l

p

7

$

7

$

A

,

-

%

7

^

,

,

-

$

%

$

0

,

-

$

$

-

-

&

$

&

h

%

&

A

7

%

%

-

R

%

,

$

$

%

A

%

$

-

A

%

R

h

7

,

%

$

,

%

$

R

%

-

9

$

%

-

9

$

9

%

h

,

9

%

$

#

$

-

%

0

$

%

9

$

$

-

7

W

-

%

$

0

r

r

W

-

%

%

-

>

7

U

h

-

0

-

%

$

0

,

%

%

$

%

%

,

$

$

0

$

%

%

$

%

7

%

%

%

$

-

$

$

&

$

$

$

,

-

-

9

%

>

h

,

,

$

,

,

,

-

$

>

-

>

&

k

l

m

n

o

p

q

k

"

@

l

m

p

-

-

,

%

$

$

$

-

,

W

$

R

>

>

>

l

,

(

(

>

-

,

>

$

(

-

-

r

%

$

$

$

-

>

,

0

7

r

%

$

,

%

$

R

%

-

9

R

$

$

7

%

$

0

%

%

%

$

v

$

%

>

%

7

,

>

@

%

7

,

(

(

>

$

&

9

>

$

>

7

,

%

$

,

%

$

0

-

0

$

%

y

$

%

$

&

7

R

%

9

$

$

0

$

%

7

9

%

7

,

,

-

$

>

$

R

%

-

9

A

%

9

%

>

>

A

h

>

%

&

A

U

>

>

%

%

%

A

>

$

-

>

-

%

,

%

7

%

%

A

%

%

$

%

%

%

A

7

$

,

-

7

$

-

-

&

$

%

$

,

7

y

$

,

%

W

-

%

$

%

&

-

$

>

^

.

/

.

/

)

7

1

)

2

4

2

5

/

2

7

2

:

A

U

>

>

%

%

%

A

>

7

7

$

R

%

-

9

,

%

R

$

%

$

R

%

$

-

>

r

%

$

,

%

%

,

7

7

$

,

0

$

-

9

>

~

,

7

7

7

0

$

%

$

,

,

%

$

$

$

-

>

1

I

I

I

.

7

0

$

%

%

$

%

%

-

9

2

M

2

7

P

A

¡

'

(

A

7

R

$

$

%

A

%

%

-

-

v

$

%

-

9

7

%

-

R

¡

,

(

(

>

&

y

>

$

>

7

91

%

%

9

>

)

2

A

-

D

E

/

E

?

?

(

B

P

D

G

K

2

E

@

2

D

G

4

2

*

!

4

(

J

B

(

K

&

*

U

*

T

X

T

S

#

;

D

=

?

?

#

B

&

E

(

*

?

?

?

B

@

B

C

N

K

Z

#

\

X

=

*

#

N

/

P

K

@

#

*

e

`

&

(

(

P

;

4

&

D

`

B

?

J

@

4

G

(

J

h

?

G

*

D

i

C

e

C

@

K

`

E

*

2

(

B

@

;

#

E

?

C

m

o

N

K

E

?

P

K

(

#

q

S

T

T

n

Z

(

`

i

4

/

&

\

s

(

4

/

*

s

*

(

`

(

s

*

=

(

2

?

4

e

@

(

B

C

;

D

v

E

&

?

4

B

D

G

v

E

/

@

D

\

G

J

w

X

e

*

(

T

*

T

`

K

S

(

y

s

B

(

&

D

?

?

B

E

?

?

B

N

K

E

?

P

K

4

U

s

2

#

*

(

v

-

s

~

&

#

2

(

#

*

&

v

/

(

#

*

(

i

i

(

#

2

=

*

&

4

D

\

E

?

B

T

D

J

G

G

Z

E

*

#

G

J

,

K

v

*

D

P

B

#

(

@

E

i

@

;

(

&

B

#

&

?

v

(

(

E

@

S

B

4

2

i

K

h

4

@

D

/

(

(

?

G

*

D

?

?

/

B

J

#

?

C

*

v

C

2

D

#

T

C

=

N

K

E

?

P

K

?

4

E

T

#

/

&

s

4

,

\

,

=

D

?

B

(

?

D

#

?

S

y

T

T

n

(

-

Z

4

2

Z

/

(

;

(

S

&

v

=

S

T

*

T

/

*

(

(

;

;

2

,

/

v

h

#

/

e

(

(

X

-

#

4

*

(

v

s

*

v

*

v

#

X

(

&

X

&

2

4

5

(

2

*

4

X

&

Z

#

(

2

*

!

\

h

v

*

(

4

#

v

(

&

\

i

v

&

(

&

v

#

(

#

/

4

=

\

E

K

@

D

D

E

@

?

B

K

E

S

D

B

G

@

T

E

T

@

#

U

E

?

D

G

J

C

,

P

B

n

$

q

`

(

&

4

;

$

(

E

@

D

"

D

N

K

E

#

?

E

#

P

B

?

K

B

?

D

C

,

q

K

#

`

&

4

2

2

v

h

$

2

#

i

&

v

(

2

s

*

h

/

/

(

(

4

4

4

2

(

&

&

\

#

G

J

(

2

s

B

S

*

T

K

T

U

/

(

(

D

4

?

?

=

B

E

?

?

?

B

@

B

N

C

K

E

?

D

P

E

K

?

B

D

G

E

@

D

92

Self-made systems Maarten van Steen, Spyros Voulgaris, Elth Ogston, Frances Brazier Vrije Universiteit Amsterdam executes the following exchange protocol (let N t denote the set of nodes at the current time t):

1 Introduction

1. Randomly select a peer m with m k V n . If m N t , repeat with V n V n mk

Self-* systems are designed to adapt to a changing environment such that specific properties are automatically restored when that environment disturbs the normal behavior. We also see that designs generally have several global parameters, which are subsequently configured for specific applications or domains. The choice of a parameter value often affects the emergent properties of a self-* system. Parameters may need to be carefully tuned in order to obtain the desired emergent behavior. Ideally, the parameter value can be found by means of a feedback loop based on monitoring system behavior. This approach effectively transforms a parameter into just another system variable and has also been referred to as self-tuning in [4]. We take the position that in designing self-* systems we need to strive for parameterless designs, which we denote as self-made systems.1 In this paper, we discuss two different examples of systems that we are currently developing to see how parameter choices affect prominent emergent behavior and how their influence can be minimized.

2. V n

V n

n0

3. Send V n to m; receive V m from m; V m : pk pk 1

pk

4. V n V n V m such that no node is listed more than once.

5. V n V n restricted to the c entries with the lowest hop count.

The contacted node m executes all but the first step as well. In the original version of this protocol, called Newscast [3], each node executes the exchange protocol once every ∆T time units. As it turns out, with c 20 networks as large as 100,000 nodes remain connected, regardless of the initial topology. Moreover, these networks have been demonstrated to be highly robust. There are several design parameters that influence the emergent behavior of the protocol, of which the two most prominent ones are the view size c and the cycle length ∆T . (A discussion and evaluation of other parameters can be found in [2].) The choice of c affects the connectivity of the network, as well as properties such as clustering 2 Unstructured overlays and others related to complex networks [1], but has not Consider a dynamically changing collection of nodes N been found critical in the sense that small changes lead to that jointly maintain an overlay network. Each node n has very different behavior. a list V n of c (neighbor, hop count) pairs, referred to as More interesting, in this respect, are the—conflicting— its view. As N may change over time, nodes communicate factors that inflict the choice of the cycle length. The cycle to update their view. To this end, each node n repeatedly length determines the rate at which views are exchanged, and thus the speed at which changes in the set of nodes 1 Main Entry: self-made. Function: adjective. Description: made such by one’s own actions; especially: having achieved success or are detected. In other words, a small cycle length is reprominence by one’s own efforts a self-made man . quired to keep a rapidly changing set of nodes up to date,

93

20000

20000

15000

15000

10000

10000

5000

5000

0

0 0

50

100

150

200

0

50

100

150

200

(a) (b) Figure 1: In-degree distribution when 10% of the nodes run (a) twice as fast, and (b) ten times as fast as the rest, respectively. whereas it would be an overkill—in terms of processor and network resources—for a set of nodes changing at a significantly lower rate. In the extreme scenario of a non-changing overlay, exchanging lists is merely useless. Moreover, the value for ∆T has to be chosen such that all nodes can execute and complete the exchange protocol. That is, ∆T is bound to a minimum value that is dependent on the slowest node and the minimum internode communication speed across all pairs of nodes. An alternative is therefore is to adapt ∆T such that it may vary in the course of time, but also that it is no longer a global value but simply local to each node. We are thus confronted with designing a system in which each node should be allowed to locally decide how often it initiates the exchange protocol, and when and how it changes its view-exchange rate. Turning the cycle length from a global parameter to a local one has obvious benefits, but has also severe effects on the emergent behavior of the overall system. For example, without taking any counter measures the network will rapidly partition into many clusters. A promising solution is to modify the merge operation such that properties such as connectivity become invariant [6]. Even in this case, though, turning the cycle length to a local parameter can have undesirable effects for the quality of the overlay formed. We observed that non-uniform cycle lengths among nodes results in an unbalanced indegree distribution. Figure 1 shows the in-degree distribution for an experiment where 10% of the nodes run at

94

a faster speed than the rest, namely 2 times faster in 1(a), and 10 times faster in 1(b). Fast nodes tend to have respectively 2, or 10 times higher in-degree than the nodes running at normal speed. Let us take a closer look at this example. In general, fast nodes run at a speed speed f , and slow ones run at speeds . Since the in-degree of a node increases by one each time it shuffles, the expected in-degree ratio between fast and slow nodes will be proportional to their shuffling speed ratio, that is, indegree f indegrees

speed f speeds

Also, the sum of in-degrees of all nodes is equal to the total number of outgoing links, that is N c. If the percentage of fast nodes is f , and, therefore, the percentage 1 f of slow ones is 1 f , we have: f indegree f indegrees c From these formulas we can compute the expected in-degrees of fast and slow nodes to be: indegree f speed f A, and indegrees speeds A, where

A

f

speed f

c 1

f

speeds

Our formula suggests that with c 20, the in-degree for a fast and slow node is 36.36 and 18.18 for 1(a), respectively, and 105.26 and 10.53 for 1(b). Furthermore, we have found that by simply overdimensioning the view size, while exchanging as few as only

2 randomly selected items from the cache, many desirphase a link is seen that is shorter λ, go to step 0. able properties can be retained. This approach shifts the Otherwise, go to step 2. problem to membership management where the joining 2. update phase: When a link is rejected (i.e., its two or leaving of a node should at the very least restore the agents are not put into the same cluster on account λ d 100. Go to step 1 when a original graph. of their link), λ link is encountered with length l λ, or when λ d. Where joining appears to be relatively simple, parameRestart step 2 when a link is encountered with length terless scalable detection of failing nodes remains a chall d, setting d l. lenge. We are currently investigating under which circumstances the joining of a node can trigger enough events to 3. match found: Whenever a new match with length l is made, set λ l and go to step 1. also detect failed nodes. Such a scheme would prevent having to use a separate heartbeat protocol (with its inWith these adjustment rules, we have been able to let evitable probe interval). agents discover the correct value for λ. As a result, λ is no longer a design parameter but can be considered as 3 Decentralized data clustering another system variable that is optimized during runtime, As another example, consider a network of agents, each in this case by means of a simple learning procedure. representing a data item. Agents proactively construct Removing the maximum cluster size is a much more links of which the length reflects the semantic proximity difficult problem. Rather than pessimistically preventing of their respective data items. This approach effectively clusters from growing too large, we have chosen to allow leads to a collection of graphs, each graph connecting se- clusters to grow to a point where it may be necessary to mantically related agents and thus forming a data cluster. split them again. To this end, the links in a cluster are orThe details of this decentralized data clustering scheme dered by their length. A crucial observation is that when are described in [5], where we also demonstrate that the a cluster should be split, this series will generally show a quality of clustering is competitive with well-known cen- pronounced gap. (Note that such a gap will generally octralized approaches. cur at the beginning of a series.) To increase the accuracy In the original algorithm, we used two parameters to of gap detection, each time a link is added between two steer the clustering. First, we feed the agents with the agents that leads to a cycle, we remove the longest link in maximum length of a good link, λ. Links above this that cycle effectively aiming at the construction of a minlength are considered bad and as such should not be used imal spanning tree. As a result, the removal of any link to place two agents in the same cluster. Second, like will split a cluster into two. in many data clustering approaches, data clusters were The gap can be found by considering the second derivanot allowed to grow beyond a certain maximum size s. tive of the series: f x y2 2y1 y0 , where y0 y1 y2 Such a maximum is necessary to prevent ending up with are consecutive lengths and x is the position in the series only a single giant cluster. Again, we see examples of of y0 . However, taking a constant-valued threshold to deapplication-dependent, global parameters that should be termine whether or not a gap is large appears to be depenavoided. dent on the data. To reduce this dependency, we compute Eliminating a fixed value for λ turned out to be rela- the standard deviation σ of the series f x . This value tively easy in the case an agent’s data represented a point should account for the “normal” variation that we could in a 2-dimensional Euclidean space (the situation we have expect in the link lengths of a cluster. Figure 2 shows this investigated extensively so far). The essence of our ap- approach works in the case of a good and a bad cluster, proach is that we let agents learn an appropriate value for respectively. λ, as follows: From this information we then decide on a reasonable value γ to take further decisions on to whether a gap actually indicates that the corresponding link should be re0. init: λ 0 1. recording phase: Watch a series of 50 links, record- moved from the cluster (and thus always splitting the origing the shortest length d. If during the recording inal cluster into two parts). For our data sets, we experi

95

(a) (b) Figure 2: Example of detecting (a) good and (b) bad clusters using a “cluster separator” γ. mentally found that setting γ 7σ does a wonderful job. However, although we can show that our approach is less sensitive to the type of data that is being clustered than many others, it is also clear that it cannot be easily generalized to handle arbitrary data sets. In this case, we are gradually reaching a point at which we will have to conclude that a fully self-made system for decentralized data clustering may be impossible. Instead, we will have to separate application domains and find reasonable criteria within each domain for splitting clusters.

4 Conclusions These two examples each illustrate the importance of striving for parameterless designs, but also that the road to these designs is often not evident. In general, we doubt that it is possible to develop completely self-made systems, such as in the case of decentralized data clustering where the semantics of the data may need to be taken into consideration. However, the first example shows that initial design decisions can be replaced by alternatives that lead to an improvement of the original system.

5 Acknowledgments The work described in this paper is carried out in close collaboration with others. Special thanks goes to Daniela

96

Gavidia Simonetti, Márk Jelasity, Wojtek Kowalcyck, and Benno Overeinder.

References [1] R. Albert and A.-L. Barabasi. “Statistical Mechanics of Complex Networks.” Reviews of Modern Physics, 74(1):47–97, Jan. 2001. [2] M. Jelasity, W. Kowalczyk, and M. van Steen. “Newscast Computing.” Technical Report IR-CS-006, Vrije Universiteit Amsterdam, Department of Computer Science, 2003. [3] M. Jelasity and M. van Steen. “Large-Scale Newscast Computing on the Internet.” Technical Report IR-503, Vrije Universiteit, Department of Computer Science, Oct. 2002. [4] R. Mahajan, M. Castro, and A. Rowstron. “Controlling the Cost of Reliability in Peer-to-Peer Overlays.” In Second Int’l Workshop on Peer-to-Peer Systems, volume 2735 of Lect. Notes Comp. Sc., pp. 21–32. Springer-Verlag, Berlin, Feb. 2003. [5] E. Ogston, B. Overeinder, M. van Steen, and F. Brazier. “A Method for Decentralized Clustering in Large Multi-Agent Systems.” In Proc. Second Int’l Joint Conf. Autonomous Agents and Multiagent Systems, July 2003. ACM Press, New York, NY. [6] A. Stavrou, D. Rubenstein, and S. Sahu. “A Lightweight, Robust P2P System to Handle Flash Crowds.” IEEE J. Selected Areas Commun., 22(1):6–17, Jan. 2004.

HP Labs’ Complex Adaptive Systems Group Research Overview Andrew Byde, Dave Cliff & Matthew Williamson

rate. To limit propagation a rate-limiter or virus throttle is enabled that does not affect normal traffic, but quickly slows and stops viral traffic. The approach prevents an infected machine spreading the virus further, although it does not prevent the machine from being infected in the first place. Thus the method limits attacks at the system level, not at the individual machine level, by restricting computers so that they can only spread the infection at an extremely low rate. This directly addresses the two ways that viruses cause damage: fewer machines spreading the virus will reduce the number of machines infected and reduce the traffic generated by the virus.

The Complex Adaptive Systems (CAS) group at HP Labs, Bristol [1], was formed in November 2001 to study the science and engineering of complex, dynamic, parallel, distributed, adaptive systems, so far as they are relevant to HP’s present and future business strategy. Such systems can exhibit highly desirable characteristics of resilient self-organisation, self-regulation, self-healing, and adaptation over multiple spatial and temporal scales, and can also be used for complex optimisation and automated-design tasks. CAS aims to grow world-class expertise for HP in the theory and practice of artificial systems that exhibit these desirable characteristics. Of our current projects, those that are particularly relevant to the question of self-maintaining systems are the project investigating market-based resource allocation for Utility Data Centres, and the projects exploring throttling of malicious code (e.g. viruses) and unwanted data (e.g. spam).

The throttle limits the rate that a machine can interact with different machines. A machine is determined to be “different” if its address is not contained in a short history list maintained by the throttle. If the address is in the list, the message is passed without delay; but if it is not, the message is queued. The queue is serviced regularly (e.g. once per second), removing messages, sending them and updating the history. The queue mechanism thus ensures that for example, the machine can interact with at most one new machine per second.

Virus Throttling We are investigating benign and biologically-inspired methods for ameliorating the effects of “malware” such as computer viruses and worms. The basic problem is well known: telling the difference between malicious and benign messages/behaviour is very difficult for computers. The current typical systemic mode of response to infection is for the computer to prevent the spread of anything whose signature is known, and to wait for humans to recognize and describe the signatures of any new threats that emerge. The evident problem with this approach is that the human response loop in question works on timescales of days or weeks that take far too long to be effective. By the time a system manager has discovered why her computers have gone down and an effective patch or signature to distribute to others has been designed, the virus has typically done its work already, spreading itself to thousands or millions of other computers. What is needed is a way for computers to “look after themselves”.

Since most normal traffic is at a low rate and to a slowly varying set of machines, this rate limiter (when set with suitable values for parameters such as the rate limit and the size of the history list) will not affect normal traffic greatly. However, if a virus were to attempt to spread faster than allowed, it would be forced to spread only at the allowed maximum rate. In practice, worms and viruses attempt to connect to hundreds of machines per second, so the pending queue gets large very quickly. This can be easily detected and the further propagation of the virus stopped completely. These techniques have been implemented for both IP traffic [3], [2] and for email [4]. Work is currently underway applying the idea to instant messaging [5].

Spam Control

The “throttling” solution to this problem is simple but effective [2]. The approach relies on the observation that the normal patterns of network traffic (messages, packets) on many protocols are quite different from the traffic generated by a spreading virus, with the virus contacting many different machines at a high

Spam or junk mail is a growing problem for today’s enterprises. MessageLabs (an Internet email handler) measures that 55% of all mail is spam and predicts that this will grow to around 70% by the end of 2004 [6], [7]. The most obvious effect of this junk mail is

97

In this framework, once service contracts have been signed between the service customer (e.g. an enterprise wishing to outsource payroll management), and the computing fabric owner (who will run the relevant applications), the customer need not concern themselves with resource requirements, and the fabric owner need only attempt to maximize the return on their Service Level Agreement (SLA) by varying the allocation of resources given to each service. This can be done in many ways; our proposal is to assign an autonomous management agent to each service, and to give the agent internal currency in proportion to the performance of its service with respect to the relevant SLA. This currency can them be redeemed in an internal market for the various compute resources that the fabric owner has, such as storage, CPU cycles, bandwidth, etc. In this way, the fabric provider only makes budgeting decisions for the various service management agents, and lets the market take care of assigning resources where they are needed most, as measured by the agents’ willingness to pay for them. We have studied experimental simulations of markets of this sort for regulating resource allocation between idealized job-processing workflows, and have found that the pricing mechanism can indeed be an effective method of self-management [9]. Agents predict the effect on their workflow’s queues of having various resource levels, and the consequent impact that will have on their income, and thus their willingness to pay for those resource levels. In general, when correctly set up, the system can manage distribution of resources between jobs as well as any other policy. In some circumstances, for example, we find that prices oscillate in such a way as to implement time-sharing of scarce indivisible resources. However, it must be said that the systemic behaviour is (not surprisingly) highly dependent on the value prediction algorithms at the management agents’ disposal: for example, in [9] we describe how an algorithm for predicting job processing times for various resource levels that is less precise leads to greater efficiency for the system as a whole. The hypothesised reason – that the less precise algorithm has a beneficial damping effect on system volatility – indicates that autonomous multi-agent systems such as the one described probably need to be designed and tested at the system level, not as groups of individual parts.

to clutter users’ inboxes, but is has another effect: the mail infrastructure cannot keep up with the increase in mail, and is becoming overloaded. This has the effect of poor service (e.g. transit delays) for email. Part of the reason for this is the extra processing that is now required to detect viruses and spam in email. The “scanner” is the bottleneck in the mail server and if an underpowered machine is used, delays will result. As mail volumes increase, machines rapidly become underpowered! We [8] have analysed large volumes of email traffic and deduced that it is possible to accurately predict before a mail is completely received whether or not it is likely to be junk (spam, virus or undeliverable) or not. This prediction can then be used to prioritise good mail through the mail server, so that transit delays are reduced and the quality of service improved. The effects can be large: for mail that would be delayed by over 4 hours, the prioritisation scheme gives average delays of only 22 seconds. The prediction is based on sending history – a mail server on the Internet tends to send the same sort of mail i.e. one sending junk mail will likely continue and vice versa. As messages are received and scanned the accuracy of the prediction improves, although only a small number of messages (less than 10) are required for the prediction to converge. This scheme gives resilience back to the mail server, allowing it to cope with large mail loads while maintaining good service. By maintaining the prediction, the mail server optimises its own resource allocation depending on the traffic flowing through it. In addition, the mail server can be provisioned to carry the volume of good mail, and any increases in the volume of spam mail will not impact the operation of the server. We are currently analysing how this mechanism could be used to reduce the overall volume of spam processed, as well as ensure that good mail is processed promptly.

Market-based Control of Utility Data Centres Another focus of the HP Labs CAS group that is of particular relevance to the issue of self-managed systems is our ongoing investigations into the possibility that a managed IT service provider who seeks to run multiple demand-varying services on a common infrastructure such as HP’s Utility Data Centre (UDC) product, might choose to organize the allocation of resources between competing services as a market – so-called market-based control (MBC).

But the system-level design and testing of autonomous multi-agent systems typically requires skilled human practitioners, and in many instances is more of an art than a science, with trial-and-error methods used given the absence of any established rigorous engineering techniques.

98

Auction, or CDA, used in almost all of the world’s international financial exchanges), economists remain unable to explain precisely what aspects of the auction mechanism contribute to the observable dynamics of markets organised according to that mechanism. This gives some indication that designing new mechanisms is a non-trivial task, even for a skilled economist, and can again often involve trial-and-error design methods where there is a lack of applicable theoretical or analytic results on which to base a design. As an alternative to manual design, we have explored the automated design of new agent-based market mechanisms again using a GA, this time to explore a continuous space of possible auction mechanisms which includes the CDA mechanism but also includes peculiar hybrid mechanisms readily implementable as online exchanges or marketplaces but unlike any traditional market mechanism. To our surprise, when attempting to find mechanisms which gave the most rapid and most stable convergence of transaction prices to the underlying equilibrium price, it was particular instances of these hybrid markets (and not the CDA) that the GA identified as being best. Although originally motivated by attempting to design better internal agent-based markets for MBC systems applicable to UDCs, this work has now attracted significant attention from the world of international equity traders and exchanges. Although the original work concentrated on ZIP-trader markets, one of our subsequent studies [13] established that the GA could find non-traditional hybrid market mechanisms that were better than the comparable traditional market mechanisms, regardless of the nature of the traderagents in those markets (i.e. whether the traders are human, or any form of artificial agent, the GAdesigned markets are better).

For this reason, the HP Labs CAS group has made significant research investments in developing automated methods for the design and optimisation of autonomous multi-agent systems, focusing particularly on distributed market-based resource allocation and load balancing in large-scale distributed computer systems such as UDCs. Specifically, we have explored the use of evolutionary computation techniques such as genetic algorithms (GAs) to automatically optimise traderagents and market-mechanisms that could be used in MBC systems. To date, much of our work has concentrated on GA-optimisation of markets populated by software agents running the “ZIP” trader-agent algorithm [10] which was developed initially at HP Labs and subsequently demonstrated by researchers at IBM to outperform human traders [11]. Two desirable system-level behaviours of the agentbased markets in an MBC system are that the agents’ transaction prices rapidly converge on the underlying equilibrium price, and that the convergence is stable [10]. The price dynamics of markets populated by ZIP traders are determined by the values of eight realvalued control parameters, so any one ZIP-trader market can be characterised as a point in an 8dimensional real space. In early work [12], we successfully demonstrated that a simple GA could find values for the elements of these 8-dimensional ZIP-market vectors that were better than the manually-set values originally chosen by the designer of the ZIP algorithm, and analysis of the evolutionary dynamics of the GA system demonstrated that this was a non-trivial search problem. Following IBM’s demonstration of the superiority of ZIP traders (and of another trader-agent algorithm known as “MGD”) over human traders [11], the IBM research team claimed that it seemed likely that, in future, economically significant online auctionmarkets (such as those operated by international equity and derivative exchanges) might be depopulated, with the current human traders being replaced by automated software-agent traders. This prompted us to explore the possibility that, in markets where it is known a priori that all the traders are software agents and no humans are present, new forms of market mechanism (i.e. the rules that govern the behaviours of and interactions between the traderagents) might be discovered that are better (in some sense) than the traditional mechanisms. The “traditional” mechanisms are usually online reimplementations of traditional market mechanisms originally designed by humans and for humans.

We are now actively exploring the application of these results to the automatic design and optimisation of new trader-agents and market-mechanisms for MBC of UDCs and similar distributed large-scale computing systems.

Conclusion This paper has given an overview of selected research by the HP Labs CAS group that is relevant to self-star computing. Further details are available on our website [1].

References 1. 2.

For the most economically significant form of auction mechanism (the so-called Continuous Double

99

http://www.hpl.hp.com/research/bicas/. Matthew M Williamson. “Throttling viruses: Restricting propagation to defeat malicious mobile code”. In Proceedings of ACSAC Security Confer-ence, pages 61 – 68, Las Vegas, Nevada, December 2002. HP Labs Technical report HPL-2002-172.

3.

4.

5. 6. 7. 8.

9.

10.

Jamie Twycross and Matthew M. Williamson. “Implementing and testing a virus throttle”. In Proceedings 12th USENIX Security Symposium, 2003, pages 285 – 294, Washington DC, August 2003. USENIX. HP Labs Technical report HPL-2003-103. Matthew M Williamson. “Design, implementation and test of an email virus throttle”. In Proceedings of ACSAC Security Conference, Las Vegas, Nevada, December 2003. HP Labs Technical report HPL-2003-118. Matthew M Williamson, Alan Parry and Andrew Byde. “Virus Throttling for Instant Messaging.” To appear at Virus Bulletin Conference 2004, Chicago IL. MessageLabs. MessageLabs Monthly View, November 2003. Published on the MessageLabs site, http://www.messagelabs.com. MessageLabs, “Spam and viruses hit all time highs in 2003”, December 2003. Published on the MessageLabs site, http://www.messagelabs.com. D. Twining, M. Williamson, M. Mowbray, & M. Rohmouni, “Email Prioritization: reducing delays on legitimate mail caused by junk mail”. To appear at USENIX Annual Technical Conference, June 2004, Boston MA. HP Labs Technical Report HPL-2004--5. A. Byde, M. Salle, & C Bartolini, “Market-Based Resource Allocation for Utility Data Centers”. HP Labs Technical Report HPL-2003-188.

11.

12.

13.

14.

100

D. Cliff, “Minimal-intelligence agents for bargaining behaviours in market environments”. HP Labs Technical Report HPL-97-91. R. Das, J. Hanson, J. Kephart, & G. Tesauro, “Agent-human interactions in the continuous double auction”. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI-01), 2001. D. Cliff, “Evolutionary optimization of parameter sets for adaptive software-agent traders in continuous double-auction markets”. Presented at The Artificial Societies and Computational Markets (ASCMA98) workshop at the Second International Conference on Autonomous Agents, May 1998. HP Labs Technical Report HPL-2001-99. A. Byde, “Applying Evolutionary Game Theory to Auction Mechanism Design”. Presented at the ACM Conference on E-Commerce, 2003. HP Labs Technical Report HPL-2002321. D. Cliff, "Explorations in evolutionary design of online auction market mechanisms". In Journal of Electronic Commerce Research and Applications. 2(2):162-175, 2003.

!!" ! # !! !! $ # !# ! $! ! %&$ $' $# ( ) $%&$ $ ! * $

)

! $ ! ! $ $ $ $ " $ $ > 0 $ $ !? $ =! 5 = )

=! $ $ @ ! (

!! =

# !!)

%& " +

$! $

1$ 0 ! "+ !$ !"" " . ! $ ) $ @ $ $ $

,+#" !$ $ %& $ .+/ " !$ $ # $ 0 1"( !& ( 0 %& 0 $ 0 %&

" $ ! $ " $ ! ! $ $ ! $ "%& $ "

" $

$ - 2 3$ ! ) +

# $ $! ! ! $! ! ! $$ $$ " ( $ ! ! % (

( ! ! 0 $ $ $

@ $ ! ! ! . ! ! $$ @ ! 0 # @ ! " # $ A $ ! !' !! $ $ $(B$

C $

,4+56 $ %& 7 $ " $ $(!!$! $ $ .4+ 7#! " $

!$ $ # !,4 , $ $ ! $+ " ( %&-. ! "8 9 $ "8 91 8 9-1" ! ! 8" 9+ ! ! " $ ! " ;< # $ ! $ $$ $ " + " " # 3 =!! 8 "9' %&

" ,4 ! #$$. !.4 ! ! $ ( 7 (2! -" ) ! ! ! - ! ! $ ! ! $) (%& ) $ $ ! (( +"

. ! " $ !! $ $(

101

$$ ( % =

6 ;4>C " $ ! ! 8 98 !9 7 " !89+ $! A B$ 0 C! 0 $ $ 7A +" $ B " C ! $ ! $

!$ ! !$ ( !! $ ! *$ $ " @ ! $ $@ ! ! $ $$ ) $$ $ ! 2# ( " "$ ! $ 0 $ G" ( ! $ $ $@ @ $ "!! $ # !! " 0 $;< # "" $? $ !!$ H. ( " B ! ( C!$ ! ) +$!! ! $ $ EA $

! ! !- $ ! $ 0 !

# + !" !$ $ ! $ 8!9 $ $ $ !8 (9 ! ! ! " . ! 3$ $$! $ #$ 0 !)

7$( +$ 0 ! ! $ *$ $! $ $ ) $ I $ $ 0 $ ! . ! ! ! $ ! 0 0

,+1$ " ! $ !" $ %& # ! , . D # " ! $ " $ ( ( 0 ! $@ ! 0 $ 7#53$3 " # " !$ "! ! ( !" !$ $

# +" " !$ ! ! $ ! ! $ $ ' ! ! " " ! $ ) 0

. E" " $ " ! ! 8 !9 ! $$ $ $ ! 8 9+ ! $ ! $$$ ! !

! , . 4 1" $ $+" !$ ! # @ @ ! $ ! >0 +$ " $ $ $"

102

$ ! $# "

%! " ! 7 $" $ !"

$ ! ! $ %& !-2 ( ! ! $ - ! ) " ! + -H )

$("! 0 ! %!$ %! " ! %%!$ $ " @ B C !" " ! ( ! ! " "" %!$ "" %%!$ # $ $" %! " 7 $ (

,K # ! $ ( ! ! $ $ ! $ ( $ ! ."! # " !,J # ,K # !# ! $

# ! !8 9- ,J

1" ! ! $

$ E ! $ 1 "$ !0

! " !0$! !@$ $ !$$ ! $ ! $ "$ $ "

,J $ ! (%& $ !( ! $ $ @ 0 ) + $ 5 / $ %& $ $ 2" !$ "! $ %& $ " ! !+ %!$ % $$ "! !@ " " !

,K ( % % %& 0 ( ! ! $

@ $ !

B $$! C@ ! ! -

>" ! " @ 8 9 !$( !

( 8 $9 2 $ 0 ) 8 $9 + ! "$ 3!!! $$ ' ! %& ! ! $ $ !$ "! !! 8( !9@

. #! !( $ $ " " 7 $$!! !" B$ C " $!! $ L ( 0" ! 5 . $ # $ $ $

5$ ! $ +

.$ % $ @ (%& ( @ 8 9-1 A $(! ! 7! 0 ( !

H $ ! @ $%&@ ( $ ) !$ $! " " 2" ! I! $ ) !,K-

1!$ " -1 $ # ! $ $ . %& (!! $ 8 9 $ $ $ . !! $ ! " !

1 !! $" 89B C $ ! " " $ $ $ " !BC !$ $ ! 3 '

" 0 0 +

,J $ $) # "! ! $ $ +

103

!$! !$ ! "$ $ $ $$ $ E$ ! $ !%& ! "$ F B>>C!" % ! ' ! ! $ " 0$ 1 $! '

" " 7 FA ' FAA!! '$ $ $ $ 0 $@ ! ! $ ! !! $$$ $ -#! # ! $ $ " E" " $ $#"3($! $ $ $ $ @ $ $ $ $!I ! ($ ! ! E$ $ !! $' ) $ $$ ! " ! ! ! $ * " " $ ! $ @! " @ !!! $ $ $ $ 0 $ $

;4DDD ;4

Fundamentals of Decentralized Optimization in Autonomic ... - CiteSeerX

Fundamentals of Decentralized Optimization in Autonomic ... - CiteSeerX

Suggest Documents

An Autonomic Adaptation Mechanism for Decentralized ... - CiteSeerX

Autonomic and Decentralized Management of ...

Decentralized Optimization, with Application to Multiple ... - CiteSeerX

AUTONOMIC PERFORMANCE OPTIMIZATION WITH

Autonomic Optimization of an Oil Reservoir using ... - CiteSeerX

On the Design of Autonomic, Decentralized VPNs - David Isaac Wolinsky

DECENTRALIZED DETECTION IN UNDIRECTED ... - CiteSeerX

Analysis of Decentralized Decision Processes in ... - CiteSeerX

Autonomic Index Management - CiteSeerX

A hierarchical optimization framework for autonomic performance ...

Autonomic self-optimization according to business objectives ...

Autonomic self-optimization according to business objectives ...

Decentralized Job matching - CiteSeerX

Decentralized Trust Management - CiteSeerX

Chapter 1 DECENTRALIZED OPTIMIZATION VIA NASH BARGAINING

Decentralized Optimization Algorithms for Variable Speed Pumps ...

Neighborhood Matchmaker Method: A Decentralized Optimization ...

Towards a Decentralized Architecture for Optimization

Autonomic self-optimization according to business objectives ...

Decentralised Autonomic Computing - CiteSeerX

Decentralized Consensus Optimization with Asynchrony and ... - arXiv

Optimization for Centralized and Decentralized ... - IEEE Xplore

Decentralized Energy Systems, Market Integration, Optimization

Autonomic Determinism: The Modes of Autonomic Control ... - CiteSeerX