Game Theoretic Iterative Partitioning for Dynamic Load Balancing in

0 Game Theoretic Iterative Partitioning for Dynamic Load Balancing in Distributed Network Simulation

arXiv:1111.0875v1 [cs.DC] 3 Nov 2011

ADITYA KURVE, The Pennsylvania State University CHRISTOPHER GRIFFIN, The Pennsylvania State University DAVID J. MILLER, The Pennsylvania State University GEORGE KESIDIS, The Pennsylvania State University

High fidelity simulation of large-sized complex networks can be realized on a distributed computing platform that leverages the combined resources of multiple processors or machines. In a discrete event driven simulation, the assignment of logical processes (LPs) to machines is a critical step that affects the computational and communication burden on the machines, which in turn affects the simulation execution time of the experiment. We study a network partitioning game wherein each node (LP) acts as a selfish player. We derive two local node-level cost frameworks which are feasible in the sense that the aggregate state information required to be exchanged between the machines is independent of the size of the simulated network model. For both cost frameworks, we prove the existence of stable Nash equilibria in pure strategies. Using iterative partition improvements, we propose game theoretic partitioning algorithms based on the two cost criteria and show that each descends in a global cost. To exploit the distributed nature of the system, the algorithm is distributed, with each node’s decision based on its local information and on a few global quantities which can be communicated machine-to-machine. We demonstrate the performance of our partitioning algorithm on an optimistic discrete event driven simulation platform that models an actual parallel simulator. Categories and Subject Descriptors: I.6.8 [Simulation and Modeling]: Types of Simulation—Distributed General Terms: Algorithms, Design, Theory, Performance Additional Key Words and Phrases: dynamic load balancing, partitioning, Nash equilibrium, iterative improvement, optimistic synchronization ACM Reference Format: Kurve, A., Griffin, C., Miller, D. J. and Kesidis, G. 2011. Game Theoretic Iterative Partitioning for Dynamic Load Balancing in Distributed Network Simulation. ACM Trans. Model. Comput. Simul. 0, 0, Article 0 ( 0), 25 pages. DOI = 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000

1. INTRODUCTION

The rapid growth of large-scale communication networks such as the Internet and the on-line social networks it supports has brought an increased emphasis on the need to model and simulate them in order to understand their macroscopic behavior and to design and test defenses against network attacks e.g., denial-of-service and Border Gateway Routing Protocol (BGP) attacks [Sriram et al. 2006]. However, the sheer size of such networks entails formidable computational challenges for detailed modeling of This work was funded in part by National Science Foundation CNS grant 0831068. Authors’ addresses: A. Kurve ([email protected]), Department of Electrical Engineering; C. Griffin ([email protected]), Applied Research Laboratory; D. J. Miller ([email protected]), Department of Electrical Engineering; G. Kesidis ([email protected]), Departments of Electrical Engineering and Computer Science & Engineering, The Pennsylvania State University, University Park, PA 16802. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 0 ACM 1049-3301/0/-ART0 $10.00

DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000 ACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.

0:2

A. Kurve et al.

every single element such as a complex router or a simple data buffer. One solution is to scale down the topology, i.e., abstract or ignore some parameters that presumably do not affect the goal of the experiment [Psounis et al. 2003]. However some studies show that certain parameters that are seemingly inconsequential can, in reality, have a large impact on the simulation output. For example, [Chertov and Fahmy 2011] studied the need to model forwarding devices such as switches or routers and the sensitivity of the experimental results to their presence in the model. An alternative is to parallel-distribute the simulation tasks across multiple machines and in this way, exploit their combined resources. A challenge here is to maintain synchronization among collaborating machines in order to preserve the causality of processed events. Load balancing is a critical component in any distributed processing application. In the context of distributed simulations, inequitable load balancing leads to large synchronization overhead due to rollbacks, thus preventing the exploitation of the complete simulation power of parallel machines/processors. Also, if the machines are remotely connected e.g., via an Ethernet, then there may be communication delays between processes that reside on different machines, further increases the likelihood of rollbacks and, thus, the need for additional simulation execution time. In our model we ignore any potential rollback-delays incurred due to limited availability of hardware resources. Most parallel simulation mechanisms use the concept of component logical process (LP). Each LP has a set of local variables that define its state. The state of the simulation at a given time is defined by the combined state of all LPs at that time. LPs communicate with each other via event messages. For the specific case of network simulation, it is desirable and natural to model each element of the network with a single LP [Nicol and Fujimoto 1994]. LPs only communicate with their “neighboring” LPs in the simulated network topology, thus allowing us to represent the LPs and their interdependencies with a graphical model. In our problem setup, we look at the task of load balancing a network simulation across multiple machines as a graph partitioning problem. This problem is different from the classical graph partitioning problem in three ways: — The node and edge weights of the graph, which reflect the computation and potential rollback-delay costs respectively, are not known prior to the simulation. — The weights are dynamic because they depend on an event generation model which is random in most cases. Hence an optimal partition at any given point in time will fail to remain so subsequently. — The parallel configuration of hardware emphasizes the need to have an iterative and distributed partition refinement mechanism which is also feasible in terms of the synchronization overhead cost. Most existing heuristics for graph partitioning are either centralized in nature or derivatives of centralized methods. Thus, there is scope to explore techniques that exploit the distributed nature of the hardware. Game theory allows us to study the effect of local behavior under competition and/or cooperation on the combined outcome of a distributed system. Specifically, it is interesting to formulate cost functions that allow for stable equilibria with desirable “social welfare”, interpreted in our case as a good simulated-network partition among parallel machines. In this work we formulate the partitioning problem as a game played between the nodes or LPs. In order to define criteria for node transfer, we propose and compare two cost frameworks. Via a numerical study, we show that these two cost criteria result in disparate equilibria points. Using this insight we propose a distributed algorithm where machines exchange nodes using knowledge of the node costs, i.e., where they play the game on behalf of the nodes that currently belong to their partition. The machine utilities rely on machine-level agACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.

Game Theoretic Iterative Partitioning for Dynamic Load Balancing in Distributed Network Simulation0:3

gregate state information which is feasibly exchanged during the parallel simulation. This algorithm applies iterative improvements to an existing partition, rather than a complete refresh, and hence is suitable for a dynamic graph of LPs. The iterative refinement consists of distributed and sequential steps as shown in Figure 1. In our simulations, we compared the two methods in terms of the total simulation execution time for an experiment. We observed that one local cost framework performs better than the other in terms of simulation execution time as it tends to converge to better solutions for both global costs. Finally, we demonstrate the algorithm on a software based simulator which is an archetype of an optimistic discrete-event driven simulator. The advantages of using this testing environment are flexibility due to independence from the constraints of hardware, the generic nature of the simulation scenario and a shorter testing time that saves us from the intricacies of any specific parallel simulator software. The organization of the paper is as follows. In Section 2, we describe the background and relevant literature. In Section 3, we formalize our problem with an example feasible node-level cost function requiring only machine-level overhead (see Section 4.5), formulate our game and prove the existence of Nash equilibria in pure strategies. Section 4 explains the proposed algorithm based on the cost framework. Some methods to improve the locally optimal equilibrium points are also discussed in this section. In Section 5, we discuss an alternate feasible cost framework that also guarantees stable Nash equilibria in pure strategies. We also perform a comparative numerical study between the two frameworks on random graphs. In Section 6 we describe a generic network simulator platform. We give simulation results for two different models of random graphs: a preferential-attachment type and another derived from a geometric model. We conclude in Section 7. 2. BACKGROUND

Networks in the form of social connections, information-carrying links, collaborative associations and reputation systems are ubiquitous. Large and complex networks result from the association of entities distinct in character, preferences, and inherent design. Network modeling is essential for understanding the behavior of a combined system of heterogeneous individual nodes that interact with each other. It is used, e.g, to test routing and forwarding protocols, defenses against network attacks and to study community structures in social networks. A typical communication network may contain tens of thousands of nodes with a large number of links. The nodes in reality might be whole routers, CPUs, or switches, individual router ports, firewalls, or end systems. [Sriram et al. 2006] studied the large scale impact of BGP peering session attacks that can cause cascading failures, permanent route oscillations or the gradually degrading behavior of the routers, using an Autonomous System (AS) level topology which was down-sampled from a typical AS-level topology consisting of 23,000 ASs and 96,000 BGP peering links [UCLA Internet Topology Data ]. There is a fidelity/complexity tradeoff when simulating such a network. Some methods scale down the network under study [Psounis et al. 2003] by omitting some intricate details and cleverly choosing the parameters that are expected to maximally affect the simulation results so as to minimize the loss of fidelity as much as possible. Recent papers such as [Gupta et al. 2008] introduce new methods to represent groups of nodes by a single node while [Carl and Kesidis 2008] studies path-preserving scaled-down topologies for large scale testing of routing protocols. [Dimitropoulos et al. 2009] suggests use of annotations to a simple unweighted, undirected graph to represent the original network. However, the reliability of scaled-down methodologies depends largely on the assumption of low sensitivity of the outcome of the experiment to certain microscopic factors which may be at best crudely modeled by scale-down. However, an important macroscopic behavior of a ACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.

0:4

A. Kurve et al.

Fig. 1. Iterative improvement via distributed synchronous updates

network might be due (in an a priori unknown way) to a microscopic behavior ignored in the simulated model, thus resulting in inaccurate simulation results. With the advancement of distributed processing systems, computing power has increased and we can thus feasibly represent networks using more refined models. Simultaneously, the theory of distributed simulation has developed along with practical implementation of simulators such as PARSEC [Bagrodia et al. 1998]. The insight that allows one to parallelize a simulation is that not every event affects all the events that follow in time. In the case of network simulation, one can represent each node in a network by a logical process (LP). An LP is an object that consists of a set of variables known as its “state” along with functions specific to the type of node being modeled. It normally has an event list containing time-stamped events to be executed. The methods to synchronize event execution between LPs running in parallel can be classified into two major types: conservative [Chandy and Misra 1981] and optimistic, e.g. Time Warp [Jefferson 1985]. In the conservative methods, LPs strictly follow timecausality, i.e., all events are processed in the strict order of their time stamps. Each LP tries to ensure, before processing an event A, that no other event, say B, with time stamp less than that of A, will arrive at the LP subsequent to the processing of event ACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.


A. This time causality is followed by assuming that the graph topology of LPs is fixed and known beforehand. LPs communicate via messages and each message-carrying link ensures that they are sent in the order of their time stamps. Synchronizing null messages are exchanged between LPs to assure each other that no event with time stamp less than a specified time will be sent. As opposed to conservative mechanisms, optimistic synchronization allows for non-causal execution: An LP is allowed to process events ahead of time without any kind of assurance. Each LP maintains its own local time and processes the event with lowest time stamp in the event list. If an LP receives an event time-stamped t, less than its local time, it rolls back in time to t. Optimistic methods may be able to exploit parallelism more than conservative methods; however, there is a chance that the system might be burdened by excessive rollbacks. Load balancing is a critical component of distributed processing applications. In the context of distributed simulations, inequitable load balancing leads to large synchronization overhead due to rollbacks, thus limiting the full use of the power of parallel machines. In this work we look at load balancing as a graph partitioning problem by considering the network model under simulation as a graph of logical processes or LPs. The graph partitioning problem arises in many different scenarios and is known to be NP-complete [Garey and Johnson 1979]. It is formally defined as follows. Let G = (V, E) be an undirected graph where V is the set of nodes and E is the set of edges. Suppose the nodes and the edges are weighted. Let wi represent the weight of the ith node and let cij represent the weight of the undirected edge {i, j}. Then the K-way graph partitioning problem aimsPto find K subsets V1 , V2 , .., VK such that SK P w Vi ∩ Vj = ∅ ∀ i, j and i=1 Vi = V , j∈Vi wj = kK k ∀ j and with the sum of the weights of edges whose incident vertices belong to different subsets minimized. Heuristics to solve the graph partitioning problem primarily make use of spectral bisection methods [Pothen et al. 1990] or multilevel coarsening and refinement techniques [Karypis and Kumar 1996]. Spectral bisection methods calculate the second smallest eigenvector, known as Fiedler vector, of the modified adjacency matrix of the graph. These methods by far give the best results. However, finding the Fiedler vector is computationally very expensive. For “geometric” graphs in which coordinates are associated with the vertices, geometric methods are available which are randomized and are quicker than spectral bisection methods. Multilevel partitioning algorithms are by far the most popular techniques. The idea is to coarsen the graph, i.e., locally, at different places in the graph, collapse a connected component into a single node, until the graph is reduced to a small number of such nodes that can be partitioned easily (using brute force methods). The subsequent uncoarsening stage is accompanied by refinement of the partition. The coarsening phase is improvised by using random matching edges or finding highly connected sets of nodes. The partitioning phase may use spectral bisection, the K-L method, or graph growing partition algorithm (GGP). The K-L method [Kernighan and Lin 1970] is used for step by step refinement to identify groups of nodes that can be exchanged between two partitions in order to improve the overall partition. All the previously discussed methods are motivated by a global criterion used to define an optimal partitioning and hence are most suitable for centralized computational implementations involving access to global state information. A static graph is more amenable to such partitioning methods than a dynamic one. Moreover, considering the case that the partitioning algorithm itself is executed on parallel machines, parallel implementation of the above mentioned algorithms is not straightforward. The main contribution of this paper is a local method of load balancing for performing distributed and dynamic graph partitioning. We give two local cost frameworks, both of which aim to minimize the total simulation execution time, albeit descending

ACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.

0:6

A. Kurve et al.

in two different global cost functions. Both these methods are distributed in that they work with only local information available at each LP, in addition to aggregate global “state” information which is independent of the graph size. Our problem is closely related to the class of multi-agent optimization problems considered in, e.g., [Lobel et al. 2010]. To study the performance of such a local cost function whose minimization is consistent with least global simulation execution time, we appeal to game theory. An extensive game theoretic study of the load balancing problem has been done in [Nisan 2007], in particular Chapter 20. In this paper, we use a similar formulation of the problem, but we also consider the inter-machine potential rollback-delay “costs”, reflected in edge weights in a graph cut. We also use insights gained from our theoretical results to design an iterative partitioning scheme to deal with dynamically changing load by using the node-level costs derived from the game formulation as criteria for node transfer. The work presented in [Nandy and Loucks 1993] is the closest to the work presented in this paper with some major differences: they proposed an iterative method based on the gain of each node, however, the gain only minimizes the cost of the cut in the graph without taking into account the computational burden cost of the node. Moreover, a forced convergence is employed, allowing the nodes to migrate only once. In our work convergence is a natural outcome with a guaranteed locally optimal partition refinement. We support our claims with a theoretical study of the problem. In our previous work [Kurve et al. 2011], we considered a machine-level decision framework wherein the idea of a “sparse-cut” is used by each machine to find clusters of nodes suitable for exchange. In this work, we explore the idea of the LPs (simulated nodes) as players and propose an algorithm that uses node-level cost functions that result in efficient “social welfare”, interpreted as minimal simulation execution time in our case. We are particularly interested in the class of games known as “potential games” [Monderer and Shapley 1996] which guarantee a descent in a global cost function (indicative of the system performance) with each decision made at the local (machine or node) level. Note that although the decisions are made at the node level, the synchronization overhead is constant, irrespective of the number of nodes. 3. PROBLEM SETUP AND NODE COST FUNCTION

First, we clarify some terms that we will commonly use in the following. — Network Model: This is the simulation model of the network under study. The nodes might represent routers, entire autonomous systems, buffers, clients, servers or any data processing or storing unit. — Machines: This refers to the computers (hosts or processors) that are part of the simulator hardware. — Logical Process (LP) or (Simulated) Node: Fundamental unit of execution that shares a machine with other LPs running in parallel. Every LP has a set of variables that define its state and communicates with other LPs via messages. Each node in the simulated graph represents an LP; hence we use the words node and LP interchangeably. Consider an undirected unweighted, graph G = (V, E) representing the network model to be simulated. As described previously, the graph can be interpreted as a model of a network of LPs with weights assigned to the nodes and edges of G. Thus, we want to first estimate the computational load generated by each node and the amount of communication between the logical processes associated with each node to assign node weights and edge weights, respectively, to the graph. Second, we wish to find a distributed technique to equitably load-balance among the machines while also taking into consideration the amount of inter-machine communication, the latter reflecting the risk of rollback. We assume that the simulation program involves a mechanism ACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.


that estimates the node weights (bi ) and the edge weights (cij ) based on the list of future events. The graph G is partitioned among K machines or less, since in some cases where the cost of potential rollback-delay is high, partitioning the workload among less than K machines might be optimal, i.e., some machines may be not be assigned any LPs. Let bi represent the computational load of ith node. Let cij denote the cost of communicating over the edge {i, j}, representing the average amount of traffic between node i and node j. We assume a connected network model graph. Otherwise, one can easily convert a disconnected graph into a connected one by adding edges of weight zero between the disconnected subgraphs. 3.1. A Local Cost Function

A natural way of formulating a game for graph partitioning consists of treating machines as players with each machine bidding for the nodes that help to maximize its utility or minimize its cost [Kurve et al. 2011]. This can be formulated as a combinatorial auction game. Combinatorial auctions have been studied widely [Cramton et al. 2006] ; however we are not aware of any work that uses them for graph partitioning. Treating the nodes as players who may choose from among partitions (machine) to minimize their own costs is an alternative approach to formulating the partitioning problem as a game. This is our approach. In the following, we specify suitable cost functions that guarantee a Nash equilibrium and give us a locally optimum partition in the sense of a global potential (Lyapunov) function, approximating total simulation time. Suppose there are N nodes and K partitions. Let ri ∈ {1, 2, ...K} be the partition chosen by the ith node. Let the normalized capacity or speed of the k th machine be: sk w k = PK

j=1 sj

,

where sj is the speed of the j th machine. Let us assign the following cost function to the ith node. Ci (ri , r−i ) =

µ X bi X bj + cij , wri j:r =r 2 j

i

(1)

j:rj 6=ri

j6=i

where r−i denotes the vector of assignment of all the nodes in r except that of the ith node and µ denotes the relative weight given to the inter-machine potential rollbackdelay cost. This can vary across simulation platforms. For example, if the participating machines are remotely connected, then this cost will be higher than if locally connected. In the function above, the computational cost of a node intuitively depends on two factors: the existing load on the assigned machine, i.e., X bj , j:rj =ri ,j6=i

and the computational load that the node will bring to the machine, i.e., bi . Supposing that the computational load generated by the node is zero (bi = 0), then the computational part of the cost should be zero. In this way, multiplication by bi makes sense. This cost incentivizes the node to choose a machine that has relatively less existing load, thus encouraging load balancing at the system level. For example, if µ = 0 then a node i which is currently assigned to machine ri would choose to relocate to machine ACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.

0:8

A. Kurve et al.

ri∗ only if 1 w ri

X

bj >

j:rj =ri ,j6=i

1 wri∗

X

(2)

bj ,

j:rj =ri∗

and in this way the load balancing mechanism is implicitly manifested in the local cost function. The second term in the sum represents the weight of edges that connect the ith node with nodes in other partitions. This term incentivizes the node to choose a partition with which it is well connected. 3.2. Nash Equilibrium ∗ A strategy profile r = (r1∗ , r2∗ , ..., rN ) is a Nash equilibrium if and only if

Ci (ri∗ , r∗−i ) ≤ Ci (ri , r∗−i ) ∀ri ∈ {1, 2, ..., K} ∀i.

(3)

Thus, at Nash equilibrium every node, say node i, will not be able to improve its cost by unilaterally changing its current processor ri∗ , i.e., provided that the decisions of all other nodes are given by the assignment vector r∗−i . T HEOREM 3.1. For the game described above, a Nash Equilibrium exists in pure strategies. P ROOF. To construct a proof of Theorem 3.1, consider the following combinatorial optimization problem: min C0 (r) r

where C0 (r) :=

X

Ci (ri , r−i ) =

i

X i

! bi X µ X cij . bj + wri j:r =r 2 j

i

j:rj 6=ri

j6=i

As the sum of the costs of all nodes for a particular assignment vector, C0 is also referred to as the social welfare of the system in game theoretic literature. Since this problem is combinatorial in nature, at least one globally optimum (minimum) solution ˆ = (ˆ exists for this problem. We denote such an assignment as r r1 , rˆ2 , ..., rˆN ). We will ˆ is also a Nash equilibrium for our game, i.e., the Nash prove, by contradiction, that r ˆ is not a Nash equilibrium. Hence we equilibrium is C0 -“efficient”. We assume that r can find at least one node, say node l, which can improve its local cost given by (1), by changing its currently assigned processor k1 to processor k2 , resulting in a new assignment vector r∗ . We find what effect this has on the function C0 (r). Specifically, we show that if a node, say l, is moved from processor k1 to k2 , the new global cost is r), thus C0 (r∗ ) − C0 (ˆ r) = 2 Cl (r∗ ) − Cl (ˆ r) or C0 (r∗ ) = C0 (ˆ r) + 2 Cl (r∗ ) − Cl (ˆ r) < C0 (ˆ ˆ is the global minimum solution. Please refer to contradicting our assumption that r Appendix C for the complete proof. 4. ALGORITHM DESIGN AND ANALYSIS

We will now describe our distributed iterative partition refinement algorithm based on the local node-level cost framework. But prior to that we need to initialize the node-tomachine assignment. 4.1. Initial Partitioning

We argue that due to initial uncertainty of node and edge weights and their dynamic nature, an optimum partition is not possible a priori. As a result, we employ a simple ACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.


initial partitioning method in which each machine chooses an initial “focal node” from among the nodes of the graph and then expands hop-by-hop to include neighboring nodes. To avoid contention between partitions, we require that each machine wait a random amount of time after every hop and check for a semaphore before claiming ownership of new nodes. Unit edge and node weights are assumed during initial partitioning. The choice of the focal nodes is important to ensure a high probability of a good initial partition. As will be described later, the iterative partition refinement algorithm converges to a local minimum. Hence, a good initial partition might improve the chances of converging to the global minimum or a good local minimum. Please refer to Appendix A for details of our initial partitioning algorithm. 4.2. Iterative Partition Refinement

The refinement algorithm given in Figure 1 demonstrates the practical utility of the game theoretic framework described previously. During the refinement step, each machine takes turns to decide the “most dissatisfied” node in its partition and transfers its ownership to a new machine. The “most dissatisfied” node is the one which would benefit the most in terms of its cost by changing its machine. The cost function used here is the same as given in (1). Each partition maintains a list of nodes currently owned by it along with their current costs and the costs if they were to change to every other machine. Hence Ci (k) represents the cost of the ith node if it were to be assigned to the k th machine and Ci (ri ) represents the current cost of the ith node. We define the dissatisfaction of the ith node as =(i) = Ci (ri ) − min Ci (k)

(4)

k

Hence the most dissatisfied node in a partition is the one with maximum value of =. If = = 0 for the most dissatisfied node, then the partition forsakes its turn. When all the partitions have their most dissatisfied nodes with = = 0 then the algorithm has converged to a local minimum as will be proved in the following. 4.3. Convergence of Algorithm

T HEOREM 4.1. The iterative partition refinement algorithm converges. P ROOF. The function C0 (r) defined in Theorem 3.1 will be shown to be a potential function. Suppose the lth node is transferred from its current allocated machine rl to a new machine rl∗ . Call the new assignment vector r∗ . Then obviously, Cl (r∗ ) − Cl (r) < 0. ˆ with any arbiFrom Theorem 3.1, replacing the global optimum assignment vector r trarily chosen assignment vector r: X X Ci (r∗ ) − C0 (r∗ ) − C0 (r) = Cl (r∗ ) − Cl (r) + Ci (r) i:i6=l ri∗ =rl

+

X

Ci (r∗ ) −

i:i6=l ri∗ =rl∗

X

Ci (r) +

i:i6=l ri =rl∗

i:i6=l ri =rl

X i:i6=l,ri∗ 6=rl ri∗ 6=rl∗

Ci (r∗ ) −

X

Ci (r)

i:i6=l,ri =ˆ rl ri 6=rl∗

(5) ∗

= 2 Cl (r ) − Cl (r) < 0 Hence for every iterative step the potential function C0 (r) decreases and we know that due to bounded nature of the combinatorial optimization problem, there exists an assignment vector r that will yield a local minimum value of C0 which is greater than ACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.

0:10 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19:

A. Kurve et al.

repeat repeat Wait until trigger is received if ReceiveNodeTrigger then Add the new node to the list Update cost functions and recalculate = for all nodes in my partition end if if RegularUpdateTrigger then Update cost functions for the new assignment and recalculate = for all nodes in my partition. end if if TakeMyTurnTrigger then Transfer the “most dissatisfied” node to its destination machine Send ReceiveNodeTrigger to the destination machine Send RegularUpdateTrigger to all other machines. Send TakeMyTurnTrigger to the next machine Update cost functions for the new assignment and recalculate = for all nodes in my partition. end if until forever Fig. 2. Iterative Partition Refinement Algorithm

or equal to the global minimum value which is strictly bounded. Hence an achievable lower bound for the potential function exists and so, we can conclude that the algorithm converges. 4.4. Local and Global Minima

The potential function C0 (r) may not be convex. So, the algorithm may converge to only a local minimum as dictated by the properties of the graph. The algorithm does not guarantee an optimal performance in the sense of C0 and in some cases the gap between global and local optima might be significant. As in other problems of similar nature, a meta-heuristic approach such as (distributed) simulated annealing [Kirkpatrick et al. 1983], [Rao and Kesidis 2004] can be used to get a good approximation to the global optimum. Simulated annealing has been studied for graph partitioning [Bertsimas and Tsitsiklis 1993] and shown to improve cost by approximately 5% for the examples considered. Restricting our search to the “most dissatisfied node” constrains the strategy space. If we consider coordinated play by a set of nodes/players, instead of an individual node, we may be able to push the algorithm out of the only local C0 optimum (maximum) equilibrium. So, we would like to consider more than one node for transfer during a single iteration. However, in this case, the search space increases exponentially with the number of nodes that are jointly considered for transfer. To address this issue, we can explore the use of a “sparse cut” criterion [Kurve et al. 2011] to narrow our search for the “most dissatisfied” set of nodes. 4.5. Algorithm Complexity

Every node transfer by a machine necessitates re-computation of the cost function for each node and for each partition. This may be computationally onerous if we calculate the cost function naively. But we can optimize our computational effort by using the fact that the transfers can be spatially localized, in the sense that the potential rollback-delay cost of only those nodes which are connected to the node involved in the ACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.


transfer will be affected. Also, the computational cost of each node can be calculated using a common variable array containing the current aggregate weights of each partition. We can see that the computational effort scales linearly with the number of nodes associated with a machine. To optimize further, we can allow simultaneous transfer of nodes by more than one machine if they are distant in the graph and if they are between disjoint pairs of machines. Note that such asynchronous transfers might not guarantee a descent in the global cost. Furthermore, hierarchical search techniques can be employed to find the “most dissatisfied” node and arbitrate the transfer of nodes. A hierarchy of machines helps to reduce the communication overhead for coordination between the machines. A key simplification of the algorithm considered herein is obtained by noting that to calculate minkPCi (k), a machine needs to know the sum of weights of nodes at other machines ( j:rj =k bj ) and not the allocation for each node, i.e., the communij6=i

cation overhead is at the machine-level, instead of the node-level. Note that packets exchanged to synchronize this “sum of weights” information defining aggregate “state” scales linearly with the number of machines and remains unchanged with the number of nodes, i.e., synchronization overhead is independent of the size of the simulated network graph. 5. AN ALTERNATE COST FRAMEWORK

For purpose of comparison, let us consider an alternate local cost function for the graph partitioning problem. b2 2bi X µ X 2bi X C˜i (ri , r−i ) = i2 + 2 bj + cij . bj − w ri wri j:r =r wri j 2 j

i

(6)

j:rj 6=ri

j6=i

Note that this cost function also has the property of machine-to-machine level overhead similar to the cost function described previously. Next, we state a centralized graph partitioning problem that reasonably models the total simulation execution time by minimizing the load variance across machines as well as the potential inter-machine rollback-delay cost. Let X be a K × |V | matrix such P that xki = 1 if node i belongs to machine k; otherwise xki = 0. We require k xki = 1 ∀i = 1, 2, ..., |V |, i.e., (7)

xki = 1 when ri = k.

Let wk be the normalized capacity of the k th machine so that k wk = 1. Then, the centralized LP assignment problem is: P 2 K X X x b µX  j∈V kj j − bj  + cij xki (1 − xkj ) (8) min C˜0 (X) = X wk 2 i,j j P

k=0

subject to

X

xkj = 1 ∀ j and xkj ∈ {0, 1} ∀ k, j.

k

The above standard formulation of the graph partitioning problem [Van Den Bout and Thomas Miller III 1990] is a quadratic integer programming problem the convexity of which depends on the network graph. In most cases it will not be convex. We can think of decomposing this problem into a set of K subproblems each of which is solved by a P single partition. However, with the constraints k xkj = 1. ∀ j such a decomposition is ACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.

0:12

A. Kurve et al.

difficult to realize. So, instead we study the effect of sequential node-by-node transfer on (8). T HEOREM 5.1. For the local node cost function (6), Nash equilibria exist at the local optima (maxima) of the centralized cost function (8). P ROOF. The proof is straight forward and directly follows from the definitions of the centralized cost and node cost functions. Suppose that node i is moved from machine ˆ Thus, C˜0 (X) ˆ is the new m to machine n, resulting in the new assignment matrix X. value of cost function. We can directly show: 2 X 2bi X µ X ˆ = bi + 2bi C˜0 (X) − C˜0 (X) bj − bj + cij 2 2 wm wm j:r =m wm j 2 j:rj 6=m

j

j6=i

−

b2 2bi X 2bi X µ X i + b − bj − cij j wn2 wn2 j:r =n wn j 2 j

(9)

j:rj 6=n

j6=i

= C˜i (n, r−i ) − C˜i (m, r−i ).

(10)

Thus, we can define a new cost function for each node as given by (6). Note that at each of the locally optimal points of (8), none of the nodes can improve their costs (6) given that the node assignments of all other nodes remain constant. Hence, the assignment vectors at these points are the Nash equilibria for this game. And since the node decisions always perform descent on (8), there is convergence guaranteed to one of the equilibrium points. 5.1. Comparison of the Two Cost Frameworks

The algorithm design and analysis for the second cost framework is similar to the first one as given in Section 4, except that we replace Ci with C˜i in the definition of node dissatisfaction given by (4). It is interesting to see how the two local cost-based algorithms fare in terms of optimality with respect to the global cost functions C0 and C˜0 . We performed a simple numerical study on random graphs that represented a network model under simulation using Netlogo [Tisue and Wilensky 2004], a multi-agent simulator, to generate random graphs of 230 nodes (LPs) to be divided into 5 partitions. The degree of each node randomly varied from 3 to 6. We randomly generated node and edge weights each with mean 5. The normalized machine speeds (wk ) were 0.1, 0.2, 0.3, 0.3, 0.1 and the relative weight of potential rollback-delay µ = 8. We then performed initial partitioning and then separate runs of iterative improvement described in Section 4, one for each cost criterion until convergence. For a fair comparison between the two frameworks, the initial node-to-machine assignment and the sequence in which machines took turns were same for both runs, with the only change being the cost criterion employed. Convergence was implied when there was no more improvement in the respective global cost. Table I shows the results for 5 different random realizations of the graph. The values of C0 and C˜0 given in the table are at convergence. We observed that using one of the cost frameworks did not guarantee a descent in the global cost for the other framework and vice-versa, thus corroborating our claim that the two cost frameworks are inherently different. Moreover, we can see that for the 5 different realizations of network graph, the first framework performed better in terms of both global costs C0 and C˜0 . We subsequently performed a batch simulation with 50 random realizations of the graph of the simulated network, and and 10 different initial partitions for each realization. We also varied the relative weight µ and normalized ACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.

Game Theoretic Iterative Partitioning for Dynamic Load Balancing in Distributed Network Simulation0:13 ˜0 ) Table I. Comparison of two cost frameworks (C0 and C Random Graph Realization Trial Number 1 2 3 4 5

Using Ci (ri , r−i ) for iterative refinement No. of it˜0 C0 C erations to converge

˜i (ri , r−i ) for Using C iterative refinement No. of it˜0 C0 C erations to converge

457134 461704 456260 472157 456336

463130 471539 461614 475814 463559

4363 4826 2920 2451 3322

42 99 29 15 69

6692 9405 5300 3976 6390

40 79 24 8 51

machine speeds wk . It was observed that in 49 runs out of the 50 runs, the Ci framework performed better in terms of both global costs, whereas in 1 run out of 50 runs, the C˜i framework performed better, but even then only with respect to its own global cost function C˜0 . Suppose we define a C0 -discrepancy (C˜0 -discrepancy) as an iterative step that increases C0 (C˜0 ) while using C˜i (Ci ) as the local criterion for node transfer. Over 50 runs, we observed that the average number of C0 -discrepancies were about 0.2 whereas the average number of C˜0 -discrepancies were about 5.2. It is apparent from these observations that, typically, Ci allows for more “breadth” of search than C˜i and thus converges to better solutions for both global costs. Note that C˜i needs to explicitly consider load balancing among machines due to its direct relationship to C˜0 which is a Lagrangian cost with an explicit load balancing constraint. By contrast, moves using Ci are “less constrained” but achieve load balance eventually due to the implicit and latent incentive in the definition of Ci given by (1). Also, as µ was increased, the number of runs where the C˜i framework performed better increased, but again only in terms of its own global cost (C˜0 ). This is due to the fact that the two cost frameworks differ only in terms of computational load balancing criterion and an increase in µ gives less relative weight to computational load. However, we cannot conclude that the first framework will also perform better in terms of the actual simulation execution time since C0 and C˜0 are only its approximate representations. 6. SIMULATION EXECUTION TIME BASED SOFTWARE MODEL OF THE SIMULATOR

We designed a model of a simulator using Netlogo [Tisue and Wilensky 2004] that mimics an actual parallel simulator. Mathematical models of parallel simulators [Agrawal and Chakradhar 1992] can be used to predict their performance with respect to factors such as load balancing and communication [Xu and Chung 2004]. Our model is based on simulating the LPs as nodes with individual characteristics and then observing the system level performance of the combined graph of nodes in terms of the simulator execution time. The simulator is based on the causal nature of event generation and event processing time which determines the total execution time required to complete the simulation (henceforth referred to as simulation time). The motivating factors to develop such a model were: (1) To abstract the simulation model across different scenarios as an event generation model defined by the initial event list and the cause-effect relationships between different events. This allows us to check the partitioning algorithm on different scenarios with varying load generation models without specifying explicitly the scenarios. (2) To abstract the hardware so as to test with different numbers and speeds of machines. An actual distributed simulator will be tuned to work on given specificaACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.

0:14

A. Kurve et al.

tions of the hardware system and migrating between different hardware specifications could involve a large cost factor. (3) To make the evaluation process independent of any parallel simulator software, thus saving the time required to understand the finer details of a parallel simulator software tool. (4) To focus our attention on the simulation time and the synchronization overhead, which depend largely on the event generation and processing model. Our goal was to design a software based program that mimics a discrete event driven optimistic simulator. The input to the program is the partition (LP to machine assignment) and output of the program will be the simulation time. The simulation time will depend on many factors such as the event generation model, event processing model, initial event list, etc. These are modeled as software parameters in the program.

Fig. 3. Software based model of a simulator

The simulation of a communication scenario can be described in terms of its equivalent event generation model. We can consider the simulation to be a deterministic process if we know certain parameters, e.g. (1) The machine load and speed which give us the time interval between two instruction cycles. (2) The function that each logical process performs for each event and the number of instruction cycles needed for each function. (3) How events are spawned one after the other from the parent event. (Events trigger the execution of functions at each LP). (4) The nature of parallelism (conservative or optimistic) (5) The maximum length of memory that stores past history needed in case of a rollback. (6) Delay in wall-clock time for event transfer between two LPs (inter-machine and intra-machine). (7) The assignment of logical processes to machines (which will be the input to the program). 6.1. Limited Scope Flooded Packet-flow Model

We studied a simple limited scope flooded packet-flow using the generic framework of the simulator model. See Appendix B for a discussion of the generic concepts in discrete event simulation and the generic model that is used. In this model, packets are generated at random times by randomly chosen LPs and these packets flood the network for a limited number of hops, i.e., each node that receives such a packet forwards it to all its neighbors that have not yet received it, provided the hop count given by event-count is not zero. Thus, we have just three types of events: “process-forward”, “process-only” and “rollback” (default type). In our experiments, we randomly initialize the LPs that generate such a thread of events in such a way that the computation and potential rollback-delay costs are dynamic during the course of simulation. To be precise, we ACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.

Game Theoretic Iterative Partitioning for Dynamic Load Balancing in Distributed Network Simulation0:15 Table II. Local variables of an LP local-time state event-list event-time event-type event-tick

event-count-history busy-tick status? my-machine sim-time

Contains time stamp of event being processed Local variables of the LP List containing the thread number of the event List containing the time stamp of the event List containing the type of event. List containing the waiting time in wall-clock units before execution of the event List containing the hop count of the event thread List containing the list of thread number of the processed events List containing the list of time stamp of the processed events List containing the list of type of processed events List containing the list of waiting time in wall-clock units before execution of the processed events List containing the list of hop count of the event thread of processed events The time remaining in wall-clock units to finish processing the current event A boolean variable indicating if the LP is busy processing an event or idle. Current machine assigned to the LP Time in wall-clock ticks needed for processing the current event

global-time partition-refine-freq

Lowest time stamp across all LPs and events in their event-lists The time in wall-clock ticks between two refinements

event-count event-list-history event-time-history event-type-history event-tick-history

Table III. Global variables

generate “hot spots” of traffic or a cluster of nodes that generate large amounts of traffic over a short period of (simulation) time. The locations of these hot spots change regularly. In our experiments, we vary the frequency of partition refinements to see the time needed for simulation. Each partition refinement consists of sequential and coordinated iterative node transfers by machines until convergence. The node and the edge weights are updated before each refinement step. In our model, this is defined as a simple inverse relation, i.e., speed of a machine is inversely proportional to the number of LPs residing on it. The node weight represents the computation load generated by the node or the LP. It varies with time and is equal to the size of the event list at that time. The edge weight of the edge between node i and node j is equal to the sum of the number of events in i and j that generate events in j and i respectively. We used two different models of random graphs to represent the network simulation under study. The first model is a preferential attachment model which is scale-free and, according to some studies, can be used to model the Internet topology at the level of Autonomous System (AS) [Bu and Towsley 2002]. For such a model, we can observe in Figure 7 the decrease in the total simulation time as we increase the frequency of refinement. The second model also incorporates geometric information, i.e., each node has associated coordinates in two dimensions: while forming links, each node randomly chooses another from among a set of 15 closest nodes (in terms of distance with respect to the coordinate axes). Figure 8 shows similar results for such a model. We also observed that the first local cost framework works better than the second. At the same time, we also kept track of the load generated at the machines. The load is the average length of the event list of LPs on the machine, where the average was calculated across the LPs residing on that machine. Figures 9 and 10 show the machine loads as they vary with the simulation time (in wall-clock ticks) with and without partition refinements. The load with regular refinements certainly looks more balanced. 7. SUMMARY AND FUTURE WORK

We studied distributed load balancing for distributed network simulation, considering in particular when the load generated during simulation may be dynamic. For such a scenario, we proposed iterative coordinated sequential partition refinement ACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.

0:16

A. Kurve et al.

for each event in event-list-history { if event-time > time of current-event then Restore event data event-list, event-time, event-type, event-tick, event-count. Delete event data from event history lists else if current-event exists in event-list-history of neighbor Send rollback event to the neighbor end if } status? ← busy busy-time ← number of nodes with my-machine equal to that of myself × get process time(event type of current-event) Put all data of current-event in event history lists Fig. 4. Process noncausal event()

for each event in event-list-history { if event-time > time of current-event then Restore event data event-list, event-time, event-type, event-tick, event-count. Delete event data from event history lists for each neighboring node { if current-event exists in event-list of neighbor Delete the event else if current-event exists in event-list-history of neighbor Send rollback event to the neighbor end if } end if } Fig. 5. Process rollback event()

by the participating machines based on a local cost at the node-level with feasible machine-to-machine level communication overhead. We proposed and compared two such game-theoretic cost frameworks, each aiming to model the simulator execution time. We modeled using Netlogo, an archetype of an opportunistic event-driven, parallel simulator, and ran our distributed partitioning algorithm on randomly generated graphical models of LPs (preferential attachment and specialized geometric models). We observed that one framework performs better than the other and we have plausibly explained why this is the case. In the future, we will explore methods to overcome poor local minima e.g., by transferring clusters (groups of connected nodes) instead of single nodes in an attempt to find Pareto (instead of just Nash) equilibria. APPENDIX A: Initial Partitioning Algorithm

In this appendix, we describe the initial partitioning algorithm. As mentioned previously, the choice of focal nodes is important to get a good initial partition. The focal nodes are chosen so that they are at maximum geodesic distance from each other. Ideally we should find focal nodes such that each one of them is at least 2N |V | geodesic K

distance away from the others, where N |V | is the mean number of hops that cover K

|V | K



Initialize global variables. ask nodes { Initialize local variables event-list, event-time, event-type, event-tick, event-count.. } sim-time ← 0 partition-refine-freq ← curr-partition-refine-freq while event-list of all nodes not empty do ask nodes { Remove events from event-list-history with event-time < global-time Remove rollback events from emphevent-list-history if emphevent-time > local-time if status? = idle then current-event ← event in event-list with lowest time stamp and event-tick = 0 if current-event 6= rollback and time stamp of current-event ≥ local-time then status? ← busy busy-time ← number of nodes with my-machine equal to that of myself × get process time(event type of current-event) Put all data of current-event in event history lists else if current-event 6= rollback and time stamp of current-event < local-time Process noncausal event() else if current-event = rollback then Process rollback event() end if else busy-time ← busy-time - 1 if busy-time = 0 status? ← idle if event-count of current-event 6= 0 then for each neighbor { if current-event not present in event list or history of events of neighbor then { Add event to neighbors’ event data. } end if } end if end if end if for each event in event-list { if event-tick 6= 0 then event-tick ← event-tick - 1 end if } } global-time ← minimum local-time and time stamps of events in event-list across all nodes if remainder partition-refine-freq / sim-time = 0 Measure node weight and edge weights of LP graphical model and refine partition. end if end while Fig. 6. Simulator pseudo-code for limited scope flooded packet flow ACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.

0:18

A. Kurve et al.

Fig. 7. Iterative refinements and simulation time (Preferential Attachment Model)

Fig. 8. Iterative refinements and simulation time (For Specialized Geometric Model)

nodes. In this case, we should know the properties of the underlying graph for calculating N |V | . We find this value for the example of an Erdos-Renyi random graph K model. T HEOREM A.1. For an Erdos-Renyi random graph G = (V, E) with p as the probability of existence of an edge between any two nodes, if Nk and Nk−1 are the expected number of nodes collected into a cluster at the k th hop and (k − 1)th hop respectively, then the expected number of nodes in the cluster by the (k + 1)th hop is: Nk+1 = Nk + (|V | − Nk )(1 − (1 − p)Nk −Nk−1 ) for k > 0, Nk+1 = 1 for k = 0. P ROOF. Suppose after the k th hop, the graph is divided into two sets of nodes: the set A of nodes obtained by the k th hop and set A0 of nodes not yet obtained. Let |A| = Nk and |A0 | = |V | − Nk . Denote the set of newly acquired nodes in the k th hop by B, where |B| = Nk − Nk−1 . For any node a ∈ A0 : P(a is not connected to any node in B)= (1 − p)Nk −Nk−1 and P(a is connected to at least one node in B) = 1 − (1 − p)Nk −Nk−1 . ACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.


Fig. 9. No iterative refinement after initial partitioning

Fig. 10. Iterative refinement after every 500 ticks of wall-clock time

The number of nodes acquired during the (k+1)th hop is binomial, with (|V |−Nk ) trials and the probability of success being 1 − (1 − p)Nk −Nk−1 . Hence the expected number of nodes acquired during (k + 1)th hop is (|V | − Nk )(1 − (1 − p)Nk −Nk−1 ). More practically, we would like to have the focal nodes as far as possible (in geodesic distance) from each other. This will ensure that the partitions formed after hop-byhop expansion are somewhat equal in number of nodes per partition. Suppose F = {f1 , f2 , ..., fK } is the set of focal nodes. Hence we would like to have F =

arg max

min

H⊆V s.t. |H|=K h,l∈H:l6=h

dG (h, l),

(11)

where dG is the geodesic distance between the two nodes. We can attempt to find such nodes using heuristics. For example, start by assigning an arbitrary set of distinct nodes to each machine. In round-robin fashion, each machine takes a turn at finding a node from the set of nodes that are neighbors of its current focal node that increases the minimum of its geodesic distance with focal nodes of other machines. This becomes the new focal node for that machine. This process is iterated until there is no further improvement possible. We iterate this process over multiple initializations of the focal node set and the best set of focal nodes is identified. In the next phase, starting at ACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.

0:20

A. Kurve et al.

the focal nodes, the partitions try to collect-in nodes in their neighborhood, thus expanding their clusters. We can use random waiting time between two successive hops and semaphores to deal with contention issues, i.e., when two or more machines try to claim ownership of the same node. APPENDIX B: Background on Optimistic Parallel Simulation A.1. Basic Concepts of Discrete Optimistic Simulation

We will now describe in brief certain generic elements of optimistic discrete eventdriven simulation. The time of execution of events is in terms of simulation time while the total time required for simulation is in terms of wall-clock time. Each LP maintains a set of variables that change when an event is processed. The value of these variables at a given time defines the state of the LP at that time. An LP maintains a local (simulation) time variable that contains the time stamp of the event currently being processed (if busy) or the most recent event processed (if idle). Events are timestamped with their time of execution (in simulation time). Each LP maintains a list of events to be processed. Optimistic synchronization allows an LP to execute events ahead of time, i.e., advance its local time to the time stamp of the “future” events. If any non causality is detected, i.e., if an event with a time stamp lower than the local time of the LP arrives, then the LP rolls back its local time. Rolling back means restoring a state prior to the time stamp of the event that triggered the rollback. In order for it to rollback to a prior state, an LP should store the past history of states and the events. The combined system of LPs maintains a global variable called global-time which is equal to the minimum local time across all the LPs and the time stamps of all the events in the event-list of the LPs. Hence the global time is indicative of the progress of the simulation. A.2. Event Generation and Processing Model

In our model, each thread of events has a unique number which is stored in the eventlist variable list. This helps us track the spread of packets in a limited scope flooded packet-flows scenario. The time stamp of the events are stored in the event-time variable list. The time stamp is the simulation time when the event is supposed to occur. The event-type variable tells the LP what its reaction to the event should be. Generally, there is a specified procedure call for every type of event. A rollback event is one of the types by default. In order to model potential rollback-delay (in wall-clock time) between two LPs, we use the variable event-tick, which is programmed to a value by the LP that generates the event depending on the potential rollback-delay. At every tick of wall-clock time, it decrements by one unit until it reaches zero. An event can be processed by an LP only if its corresponding event-tick variable is zero. The variables event-list-history, event-time-history, event-type-history, event-tick-history and event-count-history are the lists containing the data of the events already processed by the LP. In the case of a rollback, the LP restores its prior state from these variables. The LP regularly flushes out the data of events from the history with time stamps less than the global time. This is because the LP will never have to rollback to a state prior to the global time. Note that in such a model of the simulator, the functions that actually process events are generic. All that is needed is the time in wall-clock ticks stored in the variable sim-time, required for processing the event and the neighboring nodes where new events will be created as a result of processing this event. The sim-time is a function of the speed of the machine on which the LP resides given by my-machine, which in turn depends on the number of LPs that reside on that machine. ACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.

Game Theoretic Iterative Partitioning for Dynamic Load Balancing in Distributed Network Simulation0:21 APPENDIX C: Proof of Theorem 3.1

We can express C0 (r) as the aggregation of four partial sums: i) cost of node l itself; ii)sum of costs of nodes that belong to processor k1 ; iii) sum of costs of nodes that belong to processor k2 ; and iv) sum of costs of those nodes that belong neither to processor k1 nor to processor k2 . In the three steps given below, we show that i), ii) and iii) decrease as node l changes its assignment to minimize its cost (node-level) while iv) ˆ is the is left unchanged. Thus, C0 (r) decreases, contradicting our assumption that r global optimum (minimum) solution of C0 . We can divide the objective function as follows: bl X µ X C0 (r) = clj bj + wrl j:r =r 2 j

! +

j:rj 6=rl

l

i:i6=l ri =k1

j6=l

+

X i:i6=l ri =k2

X

µ X bi X bj + cij wri j:r =r 2 j

bi X µ X cij bj + wri j:r =r 2 j

j:rj 6=ri

i

j6=i

! X

+

i:i6=l,ri 6=k1 ri 6=k2

j:rj 6=ri

i

!

j6=i

bi X µ X bj + cij wri j:r =r 2 j

!

j:rj 6=ri

i

j6=i

Thus, X

C0 (r) =Cl (r) +

Ci (r) +

i:i6=l ri =k1

X

X

Ci (r) +

Ci (r).

i:i6=l,ri 6=k1 ri 6=k2

i:i6=l ri =k2

The first term is the cost of node l, the second term is sum of the costs of the nodes that are assigned to the former machine of node l (except node l), the third term is the sum of the costs of the nodes that belong to the prospective machine of node l (except node l) and the fourth term is the sum of the costs of the machines that belong neither to the current machine k1 of node l nor to the prospective machine k2 of node l. ∗ ), where ri∗ = rî ∀ i 6= l. Let the new assignment vector be r∗ = (r1∗ , r2∗ , ..., rN X X C0 (r∗ ) − C0 (ˆr) = Cl (r∗ ) − Cl (ˆr) + Ci (r∗ ) − Ci (ˆr) i:i6=l ri∗ =ˆ rl

+

X

i:i6=l rî =ˆ rl

X Ci (r∗ ) − Ci (ˆr) +

i:i6=l ri∗ =rl∗

X

Ci (r∗ ) −

i:i6=l,ri∗ 6=rˆl ri∗ 6=rl∗

i:i6=l rî =rl∗

X

Ci (ˆr)

.

i:i6=l,ˆ ri =ˆ rl rî 6=rl∗

(12) STEP 1: By assumption, we know that Cl (r∗ ) − Cl (ˆ r) < 0. STEP 2: We can also deduce that X i:i6=l,ri∗ 6=rˆl ri∗ 6=rl∗

Ci (r∗ ) −

X

Ci (ˆr) = 0,

i:i6=l,ˆ ri =ˆ rl rî 6=rl∗

because this term represents the sum of the change in the cost values of nodes which belong to neither of the two machines involved in the transfer of node l. STEP 3: ACM Transactions on Modeling and Computer Simulation, Vol. 0, No. 0, Article 0, Publication date: 0.

0:22

A. Kurve et al.

Next we show that X

Ci (r∗ ) −

i:i6=l ri∗ =ˆ rl

X

X X Ci (ˆr) + Ci (r∗ ) − Ci (ˆr) < 0.

i:i6=l rî =ˆ rl

i:i6=l ri∗ =rl∗

i:i6=l rî =rl∗

First, X i:i6=l ri∗ =ˆ rl

Ci (ˆr) =

i:i6=l rî =ˆ rl

bi wri∗

X i:i6=l ri∗ =ˆ rl

=

X

Ci (r∗ ) −

X

j:rj∗ =ri j6=i

bi wrˆl


µ X cij bj + 2 ∗ ∗ ∗

! −

j:rj 6=ri

X j:rj∗ =ˆ rl j6=i

bi X µ X bj + cij wrî 2

X i:i6=l rî =ˆ rl

µ X bj + cij 2 ∗

j:ˆ rj =ˆ ri j6=i

! X

−

j:rj 6=rˆl

i:i6=l rî =ˆ rl

!

j:ˆ rj 6=rî

! bi X µ X bj + cij . wrˆl 2 j:ˆ rj 6=rˆl

j:ˆ rj =ˆ rl j6=i

Note that the two sets {i : i 6= l, rî = rˆl } and {i : i 6= l, ri∗ = rˆl } are equal because all other assignments except rl are the same in the new assignment vector r∗ . Thus, X X Ci (r∗ ) − Ci (ˆr) = i:i6=l ri∗ =ˆ rl

i:i6=l rî =ˆ rl

bi wrˆl


X

=

i:i6=l ri∗ =ˆ rl

X j:rj∗ =ˆ rl j6=i

bi X µ X µ X bj − cij − cij bj + wrˆl 2 ∗ 2 j:rj 6=rˆl

j:ˆ rj =ˆ rl j6=i

!

j:ˆ rj 6=rˆl

! µ X X X bi X cij − cij . bj − bj + wrˆl 2 ∗ ∗ j:rj 6=rˆl

j:ˆ rj =ˆ rl j6=i

j:rj =ˆ rl j6=i

j:ˆ rj 6=rˆl

For i 6= l, {j : rj∗ = rˆl , j 6= i} represents the set of nodes except the lth and ith node which are currently assigned to rˆl th machine and {j : rˆj = rˆl , j 6= i} represents the set of nodes except the lth and ith node whose new assignment is the rˆl th machine. So, for i 6= l, {j : rj∗ = rˆl , j 6= i} \ {j : rˆj = rˆl , j 6= i} = l. This implies X X bj − bj = −bl . j:rj∗ =ˆ rl j6=i

j:ˆ rj =ˆ rl j6=i

Also, {j : rj∗ 6= rˆl } \ {j : rˆj 6= rˆl } = l. Thus, we can write X X cij − cij = cil , j:rj∗ 6=rˆl

j:ˆ rj 6=rˆl

which implies, X i:i6=l ri∗ =ˆ rl

∗

Ci (r ) −


Ci (ˆr) =


! b µ i + cil . −bl wrˆl 2



Using a similar approach, we can show: X

∗

Ci (r ) −

i:i6=l ri∗ =rl∗

X

Ci (ˆr) =

i:i6=l rî =rl∗

X i:i6=l ri∗ =rl∗

! b µ i bl − cil . wrl∗ 2

So, X

Ci (r∗ ) −

i:i6=l ri∗ =ˆ rl

X X Ci (ˆr) + Ci (r∗ ) − Ci (ˆr) =


i:i6=l ri∗ =rl∗

b µ i + cil −bl wrˆl 2


(13)

i:i6=l rî =rl∗

! +


! b µ i − cil . bl wrl∗ 2

Now consider X

Cl (r∗ ) − Cl (ˆ r) = bl

i:ri∗ =rl∗ i6=l

X bi bi µ X µ X cil − bl + − cil . wrl∗ 2 ∗ ∗ wrˆl 2 i:ˆ ri =ˆ rl i6=l

i:ri 6=rl i6=l

i:ˆ ri 6=rˆl i6=l

We know that, X

X

cil −

i:ri∗ 6=rl∗ i6=l

X

cil =

cil +

ri∗ =ˆ rl i6=l

i:ˆ ri 6=rˆl i6=l

X

cil −

i:ri∗ 6=rl∗ ri∗ 6=rˆl i6=l

X

cil −

rî =rl∗ i6=l

X

cil .

(14)

i:ˆ ri 6=rˆl rî 6=rl∗ i6=l

However, X

cil =

i:ri∗ 6=rl∗ ri∗ 6=rˆl i6=l

X

cil .

i:ˆ ri 6=rˆl rî 6=rl∗ i6=l

Thus we see: X i:ri∗ 6=rl∗ i6=l

cil −

X i:ˆ ri 6=rˆl i6=l

cil =

X

cil −

ri∗ =ˆ rl i6=l

⇒ Cl (r∗ ) − Cl (ˆ r) = −bl

X i:ˆ ri =ˆ rl i6=l

X

cil

rî =rl∗ i6=l

X bi µ X µ X bi + cil + bl − cil . wrˆl 2 ∗ wrl∗ 2 ∗ ∗ ∗ ri =ˆ rl i6=l

i:ri =rl i6=l

rî =rl i6=l

But we also know that {i : rî = rˆl , i 6= l} = {i : ri∗ = rˆl , i 6= l} and {ˆ ri = rl∗ , i 6= l} = ∗ = rl , i 6= l}. So,

{ri∗


0:24

A. Kurve et al.

Cl (r∗ ) − Cl (ˆ r) = −bl

X i:ri∗ =ˆ rl i6=l

=


=

X

b µ i −bl + cil wrˆl 2 Ci (r∗ ) −

i:i6=l ri∗ =ˆ rl

X

X bi µ X µ X bi − + cil + bl cil wrˆl 2 ∗ wrl∗ 2 ∗ ∗ ∗ ∗ ri =ˆ rl i6=l

! +


i:ri =rl i6=l

b µ i bl − cil wrl∗ 2

ri =rl i6=l

!

X X Ci (ˆr) + Ci (r∗ ) − Ci (ˆr) .

i:i6=l rî =ˆ rl

i:i6=l ri∗ =rl∗

The last step follows from (13). Thus we have shown: X X X X Ci (r∗ ) − Ci (ˆr) + Ci (r∗ ) − Ci (ˆr) = Cl (r∗ ) − Cl (ˆ r) < 0 i:i6=l ri∗ =ˆ rl

i:i6=l rî =ˆ rl

i:i6=l ri∗ =rl∗

(15)

i:i6=l rî =rl∗

(16)

i:i6=l rî =rl∗

This completes the proof for STEP 3. Hence using Steps 1, 2 and 3, C0 (r∗ ) − C0 (ˆ r) = 2 Cl (r∗ ) − Cl (ˆ r) < 0. This contradicts ˆ is a globally optimal solution of C0 . Hence r ˆ is also a Nash our assumption that r equilibrium in pure strategies. ELECTRONIC APPENDIX

The electronic appendix for this article can be accessed in the ACM Digital Library. REFERENCES A GRAWAL , V. AND C HAKRADHAR , S. 1992. Performance analysis of synchronized iterative algorithms on multiprocessor systems. IEEE Transactions on Parallel and Distributed Systems 3(10): 739-746. B AGRODIA , R., M EYER , R., T AKAI , M., C HEN, Y., Z ENG, X., M ARTIN, J., AND S ONG, H. 1998. Parsec: A parallel simulation environment for complex systems. Computer 31(10): 77–85. B ERTSIMAS, D. AND T SITSIKLIS, J. 1993. Simulated annealing. Statistical Science 8(1): 10–15. B U, T. AND T OWSLEY, D. 2002. On distinguishing between Internet power law topology generators. In IEEE INFOCOM. C ARL , G. AND K ESIDIS, G. 2008. Large-scale testing of the Internet’s Border Gateway Protocol (BGP) via topological scale-down. ACM Transactions on Modeling and Computer Simulation (TOMACS) 18(3): 1–30. C HANDY, K. AND M ISRA , J. 1981. Asynchronous distributed simulation via a sequence of parallel computations. Communications of the ACM 24(4): 198–206. C HERTOV, R. AND FAHMY, S. 2011. Forwarding devices: From measurements to simulations. ACM Transactions on Modeling and Computer Simulation (TOMACS) 21(2): 71–93. C RAMTON, P., S HOHAM , Y., AND S TEINBERG, R. 2006. Combinatorial Auctions. MIT Press. D IMITROPOULOS, X., K RIOUKOV, D., VAHDAT, A., AND R ILEY, G. 2009. Graph annotations in modeling complex network topologies. ACM Transactions on Modeling and Computer Simulation (TOMACS) 19(4): 71–99. G AREY, M. AND J OHNSON, D. 1979. Computers and Intractability. Freeman, San Francisco, CA. G UPTA , D., V ISHWANATH , K., AND VAHDAT, A. 2008. Diecast: Testing distributed systems with an accurate scale model. In Proc. USENIX Symposium on Networked Systems Design and Implementation. J EFFERSON, D. 1985. Virtual time. ACM Transactions on Programming Languages and Systems (TOPLAS) 7(3): 404–425. K ARYPIS, G. AND K UMAR , V. 1996. Parallel multilevel k-way partitioning scheme for irregular graphs. In Proc. ACM/IEEE Conference on Supercomputing. K ERNIGHAN, B. AND L IN, S. 1970. An efficient heuristic procedure for partitioning graphs. Bell System Technical Journal 49(2): 291–307.


Game Theoretic Iterative Partitioning for Dynamic Load Balancing in Distributed Network Simulation0:25 K IRKPATRICK , S., G ELATT, C., AND V ECCHI , M. 1983. Optimization by simulated annealing. Science 220(4598): 671–680. K URVE , A., G RIFFIN, C., AND K ESIDIS, G. 2011. Iterative partitioning scheme for distributed simulation of dynamic networks. In Proc. IEEE International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD). L OBEL , I., O ZDAGLAR , A., AND F EIJER , D. 2010. Distributed multi-agent optimization with statedependent communication. Mathematical Programming 129(2): 255–284. M ONDERER , D. AND S HAPLEY, L. 1996. Potential games. Games and Economic behavior 14(1): 124–143. N ANDY, B. AND L OUCKS, W. 1993. On a parallel partitioning technique for use with conservative parallel simulation. In Proc. ACM Workshop on Parallel and Distributed Simulation. N ICOL , D. AND F UJIMOTO, R. 1994. Parallel simulation today. Annals of Operations Research 53(1): 249– 285. N ISAN, N. 2007. Algorithmic Game Theory. Cambridge University Press. P OTHEN, A., S IMON, H., L IOU, K., ET AL . 1990. Partitioning sparse matrices with eigenvectors of graphs. SIAM Journal on Matrix Analysis and Applications 11(3): 430–452. P SOUNIS, K., PAN, R., P RABHAKAR , B., AND W ISCHIK , D. 2003. The scaling hypothesis: Simplifying the prediction of network performance using scaled-down simulations. ACM SIGCOMM Computer Communication Review 33(1): 35–40. R AO, R. AND K ESIDIS, G. 2004. Purposeful mobility for relaying and surveillance in mobile ad hoc sensor networks. IEEE Transactions on Mobile Computing 3(3): 225–231. S RIRAM , K., M ONTGOMERY, D., B ORCHERT, O., K IM , O., AND K UHN, D. 2006. Study of BGP peering session attacks and their impacts on routing performance. IEEE Journal on Selected Areas in Communications (JSAC) 24(10): 1901–1915. T ISUE , S. AND W ILENSKY, U. 2004. NETLOGO: A simple environment for modeling complexity. In Proc. International Conference on Complex Systems. X U, J. AND C HUNG, M. 2004. Predicting the performance of synchronous discrete event simulation. IEEE Transactions on Parallel and Distributed Systems 15(12): 1130–1137. UCLA Internet topology data. http://irl.cs.ucla.edu/topology. VAN D EN B OUT, D. AND M ILLER III, T. 1990. Graph partitioning using annealed neural networks. IEEE Transactions on Neural Networks 1(2): 192–203.


Game Theoretic Iterative Partitioning for Dynamic Load Balancing in

Game Theoretic Iterative Partitioning for Dynamic Load Balancing in

Suggest Documents

Iterative Dynamic Load Balancing in Multicomputers

Game-theoretic static load balancing for distributed systems - UTSA CS

Dynamic load balancing algorithm for balancing

An effective game theoretic static load balancing ...

Dynamic-Distributed Load Balancing for Highly

Dynamic Load Balancing for High-Performance ...

Dynamic DNS for load balancing - Distributed

Dynamic Load Balancing for the Parallel

Parallel Load Balancing for Dynamic Execution ... - CiteSeerX

Synchronisation for Dynamic Load Balancing of Decentralised ...

Dynamic Load Balancing For Cloud Computing

New Challenges in Dynamic Load Balancing - CiteSeerX

Modeling Dynamic Load Balancing in Molecular ... - CiteSeerX

Dynamic Load Balancing in MPI Jobs - CiteSeerX

IRJET-Dynamic Load Balancing in LTE Network

Load Balancing in Dynamic Networks - CiteSeerX

Dynamic Load Balancing in Hierarchical Parallel ... - CiteSeerX

Load Balancing in Dynamic Structured P2P Systems

Dynamic load balancing in parallel database systems

Dynamic Load Balancing in Distributed Hash Tables

Dynamic Load Balancing and Efficient Load Estimators for

Partitioning and Load Balancing for Emerging Parallel Applications ...

Task Partitioning and Load Balancing Strategy for ... - Semantic Scholar

Load Balancing Model for Cloud Services Based on Cloud Partitioning ...