DSSimulator: Achieving Million Node Simulation of Distributed Systems
Jawwad Shamsi, Monica Brockmeyer Department of Computer Science Wayne State University Detroit, MI 48202, USA jshamsi,
[email protected]
Keywords: Overlay Networks, Distributed Scalability, Route Caching, Internet topology.
Systems,
Abstract Simulation of distributed systems is different than the simulation of networking-applications; requiring a more scalable and application-oriented model. The widespread use of distributed applications imposes a requirement for such a simulation platform where these applications can be easily deployed and tested. We introduce an object oriented model which provides a scalable and flexible platform for simulating distributed applications. It avoids the complexities of packet level features and implements BFS as channel computation technique for end to end message delivery. We show by deploying different distributed applications that our scheme can successfully simulate over 1 million nodes on a single 1.8 GHz 1GB RAM workstation. 1
INTRODUCTION
An overlay system is a distributed system in which an application is deployed on top of an existing underlying network, such as the Internet. Recent technological advancements have resulted in a widespread and extensive usage of overlay systems, which now provides one of the most popular platforms for deployment of distributed applications. Evaluation of overlay systems e.g. peer-to-peer systems, caching systems, wide-area replication systems, and content distribution systems is currently a significant challenge. Such systems may have thousands to tens of thousands of interactive nodes deployed in the Internet. Their evaluation requires observation under realistic conditions, including a realistic topology, latency and bandwidth circumstances, and dynamic failure conditions. Of course the most realistic evaluation environment is achieved by actual deployment in the Internet. Many efforts have been made to provide a distributed Internet testbed to deploy and evaluate these emerging ideas; such as Planetlab [5]. However the presently available distributed This material is based upon work supported by the National Science Foundation under grant ANI-0347222.
test beds are not yet sufficiently large to permit testing of tens of thousands of overlay nodes. Further, real-world deployment of test protocols risks undesirable consequences in the Internet. Also, the tools to administer and deploy protocols in these environments are still evolving and do not yet permit easy deployment on more than tens of nodes. Moreover, since the Internet is highly dynamic and subject to sudden behaviors, the evaluation environment is not under the researcher’s control (or even complete understanding). As a result, it may be difficult to identify and understand the phenomena that affect the performance of the protocol under evaluation. Cluster-based deployment may mitigate the complexity of administration and deployment, but the network is generally highly over-provisioned and uniform in its performance characteristics, failing to reflect the complexities of the underlying Internet. To remedy these challenges, most researchers complement real world deployment (which serves as a sanity check) with simulation. Simulation permits user control and understanding of the test environment. Networking applications are frequently evaluated using packet-level simulators, such as ns2 [4], which provides simulation of many network characteristics, such as propagation, transmission, and queuing delays, and detailed simulation of transmission and routing protocols. While such detail adds to the fidelity of the simulation environment to the real Internet, it comes at the cost of scalability, due to the large memory requirements required to implement routing tables (generally O(n2) with respect to the total network size), and CPU processing time, which also increases quadratically [3] [7]. As a consequence, ns and similar tools generally permit simulation on the order of a few hundred nodes. Parallel simulation approaches may mitigate this challenge somewhat, but require a significant hardware investment. Our approach is based on the observation that many distributed systems including overlay applications do not require simulation of packet-level features and routing characteristics, and their main requirement is to simulate the application over large number of nodes at the applicationlayer of the OSI model. These challenges have emphasized the need of an application-oriented simulation platform for
distributed systems. Rather than simulate at the packetlevel, it suffices to model end-to-end characteristics, such as latency, bandwidth, message loss, and failure rates. However, in order to achieve realistic approximations of these characteristics, they must be modeled with respect to an underlying target topology, in this case, the Internet. In this paper, we investigate these challenges and present a new model for simulation of large-scale distributed systems including overlay applications. Our model implements a flexible approach and provides support for various features dedicated for overlay networks. DSSimulator avoids implementation of routes on per-hop basis, and stores channel characteristics by using an adjacency list of nodes and edges. The channels are computed using breadth first search on the adjacency list and are also stored in a cache for repetitive use. This avoids repetitive processing, gives us faster execution and also reduces the memory requirement to O(n + e) for the adjacency list and O(m2) for cache, where ‘m’ is the number of overlay nodes. Different characteristics of channel such as latency, bandwidth and failure rate can be stored in the adjacency list and cache. We also implement a user-defined node and edge level failure model for DSSimulator and show that our model can simulate different kinds of distributed applications of over 1 million nodes on a 1.8 GHz Intel Xeon machine with 1 GB RAM. As an application to the DSSimulator we have implemented and validated a protocol for tactical construction of overlay networks [9]. The remainder of this paper is organized as follows. Section 2 describes the related work; section 3 introduces our model; section 4 presents the implementation of our design; followed by performance evaluation in section 5 and conclusion and future work in section 6. 2
RELATED WORK
Much of the current and active research on network simulation is focused on simulating network and packet level characteristics, while our goal is to achieve greater scalability for application level simulation. ns2 [4] is a popular and powerful, discrete event based tool, widely used for simulation of TCP, routing, and multicast protocols over wired and wireless (local and satellite) networks. Since ns2 implements packet-level simulation, the time it takes to run the simulation, the computer power and the memory required are proportional to the number of packets generated. Thus a user experiences severe performance degradation with the increasing number of nodes. In practice, the number of nodes which can be simulated effectively by ns is limited to few thousand at most. pdns [8] is the parallel version of ns2. It has been successfully tested on a total of 128 processors, and achieving as many as 500,000 nodes. The Scalable Simulation Framework is introduced in [2]. It uses Domain Modeling Language to simulate large
scale networks. The Parallel version (DaSSF) can be used to simulate few hundred thousand nodes over a parallel platform. The above mentioned simulation platforms are suitable for network or packet level simulation where the main goal is to simulate the application/protocol under different network characteristics, such as variable QoS, variable packet sizes, per-hop routing and network level characteristics; additionally they suffer from scalability problems. The parallel versions of these models though solve the scalability problems to some extent, but require high system resources. Comparatively, in our approach we focused on simulating the protocol at the application layer of the OSI model and achieve scalability by eliminating the details not necessary to simulate end-to-end channel characteristics. Some research is closer to ours in that some network and packet-level details are abstracted away or modeled analytically rather than fully simulated. Modelnet [10] is a cluster based emulation environment for wide area systems. It provides a flexible platform which permits tuning the emulation between greater detail and higher scalability. Our approach is similar in that we eliminate the network level details by providing higher abstraction for simulation. Huang [3] introduces abstraction techniques for multicast systems. It implements a hybrid approach in which analytical model is used to avoid node by node or link by link packet transmission. Our approach is simpler than theirs resulting in significantly greater scalability. Additionally, we propose channel caching to avoid repetitive computation, which is likely to further increase the performance. Riley [7] presents a Neighbor-index vector routing technique to compute routes on an underlying substrate network and to cache these routes for later use. GTNetS [6] is implemented using this technique providing an efficient, parallel simulation platform. While their approach is similar to ours, in that they pre-compute and cache channel characteristics, they compute a full routing path, which requires log2N bits and they actually simulate routing using the calculated Neighbor-index vector. In contrast, we store only the necessary end-to-end characteristics (generally latency and success probability) and do not explicitly route messages. While it appears to be common practice to write custom discrete event simulators to evaluate large-scale distributed systems, the issues of what abstractions are appropriate and how best to achieve those abstractions seems to have received little attention in the literature.
3
SIMULATION MODEL
We propose DSSimulator, a scalable discrete-event network simulator, as our model to provide large-scale simulation for distributed systems. Since we make certain
assumptions in our calculations of end-to-end channel properties, several assumptions must be made by the application. These include: • The network topology is static. That is, once the topology is created, no new nodes or routes will be added to the network. However, nodes and edges are allowed to fail and they possess varying failure rates. Note that this does not prevent new applications from being deployed or new nodes from joining an application-level overlay. • The network path from sender to receiver will always remain same. • Applications do not access route information and only need channel aggregates for end to end message delivery. Our model is implemented in C++ and uses an objectoriented paradigm. Each node is assigned a unique numerical identifier (similar to IP in the Internet), which identifies it with other nodes. Two different types of nodes can be deployed in the simulator, an underlay (network) node and an overlay (application) node. An overlay node is a derived class of underlay node and inherits all of its characteristics. Since the nodes are C++ objects a user can add or deploy additional services in the overlay and can also create different types of application nodes, each of which will be a derived class of the overlay node and will inherit all the characteristics of the overlay node. To achieve high scalability, DSSimulator avoids per hop routing and network edge level characteristics. The path from the application source node to the application destination node (called channel) is computed using a breadth first search algorithm and is used for end to end message delivery. A UDP style connectionless messaging system is implemented, where a message from sender to receiver is not guaranteed to be delivered. However, unlike UDP, application level messages fragmented, nor routed, and irrespective of the size of the message, the complete message will be delivered in a single packet. This assumption provides simplicity and helps us to gain high scalability using minimal system resources, further; it is unlikely to affect the performance of the distributed application. All the communication among nodes is established through exchange of system-defined messages. Each systemdefined message is a C++ object and contains identifiers of sender and recipient nodes. Each message is marked with a time stamp at which the message is supposed to be executed. and also contains a pointer of type void, which can be pointed to any user specified message. A user can create application specific message and encapsulate it with in the system message. The messages are identified using the message identifier field. Te core of the model is implemented by the object simulator, which maintains references to all the nodes and implements an adjacency list and cache for message routing.
The simulator also maintains a global timer. All the messages are stored in a queue. In the following section we explain our model in more detail. 4 DESIGN AND IMPLEMENTATION We use the Georgia-Tech transit-stub topology generator [1] to generate the underlying physical topology. An adjacency list containing network nodes and edges and edge characteristics, including latency, is created by parsing the topology file. Once the list is created, the breadth first search algorithm can be used to compute the applicationlayer channel and its characteristics. When a node wants to communicate to another node, it computes the channel latency, which the simulator uses to determine when the message is delivered by the destination node. 4.1 Message creation A user can create a custom message. Messages can be classified using a message identifier. Each message contains time stamp value at which the message is required to be processed. When a node wants to send a message to another node, it gets the current value of the system timer and computes the latency of the channel to the destination node. The message is then time stamped with the desired value of the timer, which is the sum of the current value of the timer and the channel latency. 4.2 Message routing and delivery The simulator maintains a global priority queue in which messages are stored according to the increasing value of their time stamp values, and the topmost message in the queue has the lowest time stamp. To provide efficient and scalable system simulation, the simulator uses smart routing; instead of actually routing the packet to the destination, the node computes the latency of the path from the node to the destination node and drops the message in a global queue. The queue is checked periodically for messages. When the queue is examined, the time stamp of the topmost message in the queue is compared with the current value of the simulation timer, if the value of timer is less it is incremented to the time stamped value of the message. This allows the faster functioning of the simulator and since the time is also simulated, incrementing time stamp does not affects the actual performance. The message is then popped out of the queue. Each node has a method execute. When the simulator object pops a message out of the queue, it checks the message for the destination and calls the execute method of the destination node. The identifier field of the message is used to classify the message inside the execute method and the message is processed accordingly. The whole process of message insertion and execution is illustrated in Figure1.
4.3 Route caching For fast and efficient results the simulator maintains a cache in which it stores the channel characteristics of application-level channels, including latency and failure probability. This eliminates the need of calculating the route on a repetitive basis and the latency of the channel can be fetched directly from the cache. If the channel is not in the cache, then the latency and other channel characteristics will be calculated form the adjacency list but it will be stored in the cache, so it can be fetched in future. The size of the cache is user defined to balance the tradeoff between scalability and performance. We do not implement the cache replacement algorithm.
definition file. The channel is checked for failure before sending the message (inserting the message in the queue), whereas the destination node is checked for failure at the time of message execution. Channel Failure When the adjacency list is created the success probability of each edge is also assigned by generating a random number between the MIN_EDGE_PROB and MAX_EDGE_PROB (defined by user). The success probability values of edges are then stored in the adjacency list along with the edge latencies. When a node sends a message to the other node, the simulator computes the latency and the success probability of the channel, or retrieves it from the cache. The channel success probability is computed by calculating the product of the success probabilities of all the intermediary edges. The computed probability is then compared with a randomly generated number to determine the liveliness of the channel. Node Failure Each node is assigned a success probability by generating a random number between MIN_NODE_PROB and MAX_NODE_PROB (defined by the user). This information is stored in an array whose size is equal to the total number of nodes in the network. When a message is to be executed, the destination node is checked if it is alive or failed. The destination node’s success probability is then compared with a randomly generated number to determine liveliness of a node. A node can also be made fail explicitly. 5
Figure 1 - Message insertion and execution 4.4 System support messages. Besides implementing support for user-defined messages, our model implements the following system messages. • PING: enables ping form a node to any other node. • TRACEROUTE: traces the route from a node to any other node. 4.5 Support for overlay Networks. To provide support for overlay networks each overlay node maintains a C++ data structure to store a list of its overlay neighbors. Additionally, our model contains a feature to check if the overlay is a partitioned network. It does so by implementing a breadth first search along the neighbors of each overlay node. 4.6 Failure model The DSSimulator also implements node and edge failures. In addition to calculating and storing channel latency, the channel’s probability of success is also calculated and stored in the adjacency list and in the cache. The user specifies the maximum and minimum success probability of nodes and edges in the network through a
PERFORMANCE AND EVALUATION
We identified the following questions for the performance evaluation. • What is the scalability of our platform? • What is the effect of caching the channel latencies? • What are the memory requirements for a large set of nodes? • What is the performance of our model? How many messages can be executed in a small unit time? We choose an Intel 1.8 GHZ Xeon Workstation with 1 GB RAM, and selected six different but large underlay topologies of 64,372, 218,960, 358,092, 488,012, 520,492 and 1,008,504 nodes. We used the Georgia Tech topology generator to generate underlay networks up to 520,492 nodes, which realistically model the internet topology. For underlay networks beyond this size, we experienced failures of the topology generation tool. Therefore our topology for 1,008,504 nodes is comprised of smaller underlay networks with artificial edges added. While this topology does not model the Internet, it does not affect the results of our experiment with respect to the capabilities of the simulation. To evaluate the performance of our model, we conducted numerous and diversified set of experiments. In our first set of experiments we focused on evaluating the scalability of our system. We deployed a simple ping
64372
40
218960
30
358092
20
488012
10
520492
358092 488012 520492 1008504
Figure 3 - Gain ratio of using cache in terms of no of messages per second
21 00 0
18 00 0
15 00 0
90 00 12 00 0
60 00
30 00
218960
overlay nodes
1008504
0 15 00
percentage of memory
50
64372
15 00 30 00 60 00 90 00 12 00 0 15 00 0 18 00 0 21 00 0
underlay
underlay
8000 7000 6000 5000 4000 3000 2000 1000 0
caching gain
protocol as an overlay application. The size of the overlay is varied from 1,500 nodes to 21,000 nodes. In each of our experiment we issued 400 random ping queries from one overlay node to another overlay node, and noted the memory usage by using the top command in LINUX (Figure 2).
overlay nodes
6000 4000 2000 0 64372
218960
358092
488012
520492
1008504
underlay nodes
Figure 4 - Messages executed per second for flooding experiment
messages per second
In the second set of our experiments, we focused on evaluating the effect of caching the channel latency. For this purpose we modified the first set of experiments in the following way. • Each experiment consists of 800 ping messages in total, with each message from the 400 ping messages of the first set of experiments was executed twice. • During the first set of the 400 messages the cache was disabled, so every time the channel was computed from the adjacency list and was also stored in the cache. During the second set of the 400 messages the cache was enabled, so the channel was fetched only from the cache.
8000
20000 15000 10000 5000 0
200
500 No of queries
1000
Figure 5 - Messages executed per second for 6000 overlay nodes and 64372 underlay nodes (flooding)
messages per second
Following are the observations from our experiment. • The memory usage is constant for each underlay, irrespective of the size of the overlay. This is probably due to the fact that storing the topology dominates the memory use. This also shows the stability of our model. • The memory usage for 64,372, 218,960, 358,092, 488,012, 520,492 and 1,008,504 nodes is 1.6%, 5%, 14.4%, 19.6%, 20.9% and 40.2% respectively. This shows that the memory usage is mostly linear with a minor relaxation (constant). It is important to note that the memory requirement in our model consists of O(n + e) for adjacency list and O(m2) for cache. Considering that the number of underlay nodes is likely to be far greater than the number of overlay nodes the memory requirement of our model is generalized as O(n + e); which is validated by our experiment.
messages per second
Figure 2 - Percentage of memory vs. Size of Underlay
16000 14000 12000 10000
10000 8000 6000 4000 2000 0 64372
218960
358092 488012 underlay nodes
520492 1008504
Figure 6 - Messages executed per second for ping experiment and channel is fetched only from cache
We noted the time taken for each subset of experiments and calculated the ratio of the time taken by the first subset of experiment over the time taken by the second subset of experiments; we call this the caching gain. Figure 3 illustrates the caching gain of different underlay sizes plotted against different sizes of the overlay. It is interesting to note that the caching gain increases with the size of the underlay and is in the order of thousands. The highest caching gain is achieved when the size of the underlay is 1,008,504 nodes and the overlay is 15,000 nodes. Unlike other underlay sizes, the caching gain in the underlay with 1,008,504 nodes varies to a greater degree, which is because of the higher size of adjacency list. In the third set of our experiments we deployed a peerto-peer flooding-based resource lookup scheme as an overlay application. We deployed 6,000 overlay nodes and varied the size of the underlay from 64,372 nodes to 1,008,504 nodes. Each overlay node has 14 random neighbors and each experiment consists of 200 random queries with a TTL value of 5. We noted the message execution rate for different underlay networks and observed that the rate decreases with the increase in the size of the underlay (Figure 4). This is due to the fact that with the increase in the size of the underlay, the size of the adjacency list increases, requiring more time to compute the channel latency. Additionally the cache hit rate decreases with the increase in the number of underlay nodes. To observe the behavior of cache hit rate, we repeat the flooding based experiment for 64,372 underlay nodes with 500 and 1,000 lookup queries. We observed that with the increase in the number of queries, the cache hit rate increases, which results in higher message execution rate. Figure 5 illustrates the results. In another set of experiments, we repeat the ping experiments performed in the second set of our experiments and computed the message execution rate when the channel is fetched only from the cache (Figure 6). While this set of experiments involved ping messages and the message execution rate cannot be compared with the message execution rate in resource lookup, it is interesting to note that the number of messages executed per second, when the channels are fetched only from cache, seems independent of the number of underlay nodes and holds constant rate. 6 CONCLUSION AND FUTURE WORK In this paper, we have presented an object oriented scheme for the simulation of distributed systems. Our model provides a scalable, efficient and flexible platform for the simulation of many distributed application including overlay systems. It implements a user specific node and edge failure model and uses channel caching and smart routing to achieve high scalability and performance using minimal system resources. We have successfully simulated 1 million nodes on a 1.8 GHZ machine with 1GB RAM. Having implemented the foundations for a scalable platform we
would like to extend our research in the following directions. • Provide support for dynamic channel characteristics such as variable edge latencies. • Implement bandwidth characteristics. • Implement cache replacement algorithm. References [1] Calvert K, Doar M., and Zegura E.. “Modeling internet topology”. IEEE Communications Magazine, June 1997. [2] Cowie J., Liu H., Liu J., Nicol D., and Ogielski A., “Towards realistic million-node Internet simulations”. In Proceedings of International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), 1999. [3] Huang P., Estrin D., and Heidemann J., “Enabling largescale simulations: selective abstract approach to the study of multicast protocols”. In Proceedings of the International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication System, July 1998. [4] “ns2”, http://www.isi.edu/nsnam/ns/. [5] Peterson L., Anderson T., Culler D., and Roscoe T. “A blueprint for introducing disruptive technology into the Internet”. In Proceedings of ACM HotNets-I Workshop, Princeton, October 2002. [6] Riley G. “The Georgia Tech Network Simulator”. In Proceedings of the ACM SIGCOMM workshop on Models, methods and tools for reproducible network research. 2003, pp 5-12. [7] Riley G. F., Ammar M. H., and Fujimoto R., “Stateless Routing in Network Simulations”. In Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2000, pp. 524–531. [8] Riley G. F., Fujimoto R. M., and Ammar M. H., “A generic framework for parallelization of network simulations”. In Proceedings of the 7th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 1999, pp. 128–135. [9] Shamsi J, Abebe L., and Brockmeyer M. “TACON: Tactical Construction of Overlay Networks”. Submitted in ICDCS 2005. [10] Vahdat A, Yocum K., Walsh K., Mahadevan P., Kostic D., Chase J., and Becker D. “Scalability and Accuracy in a Large-Scale Network Emulator". In Proceedings of 5th Symposium on Operating Systems Design and Implementation (OSDI), December 2002