the simulated time of the sender and the receiver are con- sistent with each other. .... ulator, and behaves like a perf
An Adaptive Synchronization Technique for Parallel Simulation of Networked Clusters Ayose Falc´on
Paolo Faraboschi
Daniel Ortega
Hewlett-Packard Laboratories {ayose.falcon, paolo.faraboschi, daniel.ortega}@hp.com Abstract Computer clusters are a very cost–effective approach for High Performance Computing, but simulating a complete cluster is still an open research problem. The obvious approach—to parallelize individual node simulators—is complex and slow. Combining individual parallel simulators implies synchronizing their progress of time. This can be accomplished with a variety of parallel discrete event simulation techniques, but unfortunately any straightforward approach introduces a synchronization overhead causing up two orders of magnitude of slowdown with respect to the simulation speed of an individual node. In this paper we present a novel adaptive technique that automatically adjusts the synchronization boundaries. By dynamically relaxing accuracy over the least interesting computational phases we dramatically increase performance with a marginal loss of precision. For example, in the simulation of an 8-node cluster running NAMD (a parallel molecular dynamics application) we show an acceleration factor of 26x over the deterministic “ground truth” simulation, at less than a 1% accuracy error.
1. Introduction A computer cluster is a group of tightly coupled computers that work together as though they are a single computer. Clusters are used to improve performance and availability in a way that, because they are based on industrystandard commercial off-the-shelf (COTS) components, is typically more cost–effective than an ad-hoc solution. Clusters are used extensively in the High Performance Computing (HPC) field. In June 1997, the first clustered computer entered the TOP500 [18] list. Five years later, in June 2002, 16.20% of the total of TOP500 computers were clusters. In the June 2007 version of the TOP500 list, merely 10 years after, three out of four (precisely, 74.60%) of all TOP500 computers are cluster computers. Building a cluster out of single systems is a relatively easy task when compared to building a parallel computer from
scratch. Nevertheless, building a simulator for a cluster has proven to be of comparable complexity with that of building a simulator for a parallel computer. There are very few simulators for parallel machines and even less for full cluster computers that include both the functional and the timing simulation of the complete systems. Simulators of parallel machines are often ad-hoc implementations of a specific parallel machine: they are the result of many engineering-years of development which makes them very hard to retarget to other systems, including clusters. We believe that constructing simulators for clusters out of individual full-system simulators should be as easy as building clusters out of individual computers. This shift in perspective is one of the objectives of this paper. Most current full-system simulators model networking by providing a software proxy that channels packets from the simulated network to the external world. Combining several full-system simulators is as easy as providing a software switch that routes packets between the cluster machines and the outside world. Several system simulators, such as Simics Central [12] and AMD’s SimNowTM [1] come with such a functionality already built-in. Similarly, many virtual machines and emulators, such as VMWare [19] or QEMU [2] also embed similar “virtual networks”, to enable the networking of multiple instance of individual machines, often modeled after the VDE toolkit [6], which provides a generic layer for emulated network. The techniques described above provide the functional way to route packets from one simulated node to another. However, they do not provide any mechanism to ensure that the simulated time of the sender and the receiver are consistent with each other. This is the fundamental problem of Parallel Discrete Event Simulation (PDES), and some level of time synchronization is a necessary step to ensure any form of simulation accuracy. Providing time synchronization is what turns a loosely combined set of parallel full-system simulators into a cluster simulator. The challenges of this task involve controlling the time flow in a cohesive way, yet allowing for fast simulation and parallel execution. Here, we present a novel technique that enables combining multiple parallel full-system simulators into a “cluster simulator” capable of running standard distributed applica-
tions (such as MPI-based) on unmodified OS, with accurate timing and fast simulation speed. The main contribution of this paper is in the “adaptive quantum synchronization” algorithm which automatically adjusts the global synchronization accuracy based on a dynamic measurement of the networking traffic. As fewer packets flow by, we can lower the inter-node synchronization quantum, and vice versa. The rest of the paper is organized as follows. Section 2 analyzes related work, Section 3 presents our technique, Section 4 describes the simulation methodology and the evaluation methodology, Section 5 shows experimental results and finally, Section 7 concludes the paper.
2. Related Work Most parallel simulators developed in the past target parallel shared memory machines. One of the first and most successful parallel simulators is the Wisconsin Wind Tunnel (WWT) [16], which runs a parallel shared memory program on a parallel computer (CM-5). The WWT uses execution driven, distributed, discrete-event simulation to calculate program time: it divides time into lock-step quanta to ensure all events originating on a remote node that affect a node in the current quantum are known at the quantum’s beginning. With this approach, the WWT achieves accurate reproducible results without approximations. Burger and Wood [4] propose to trade accuracy versus performance in the context of the WWT simulator. The tradeoff is selected globally, and is accomplished by changing the timing model. Tango [5] is another popular shared memory simulator of parallel computers, which exploits direct execution on the host machine. Tango spawns an event generation process on each node in the host machine, but serializes all memory system simulation on one central simulation process. The field of Parallel Discrete Event Simulation has abundant literature, starting from the seminal works of ChandyMisra on “conservative” simulation [13] and Fujimoto [9] on “optimistic” (checkpoint-and-rollback) simulation. In more recent work, such as [11], PDES is used in a novel way, neither conservatively nor optimistically, but rather statistically. Each node runs independently and synchronizes at certain points by exchanging statistical information regarding the possible events that should have been communicated. This statistical information is enough to compute time progression and provides a good balance between speed and performance. As we described in the introduction, many researchers have addressed the problem of extending system simulation (such as like Simics [12]) into some sort of cluster emulation, like in [7] and [3]. However, they mostly target only network functional simulation and do not really address network timing issues or node synchronization in a parallel and distributed environment.
3. Adaptive Synchronization Full-system simulation of a complete computing node is a major challenge, but it has already been addressed in many different ways and is outside of the scope of this paper. For the purpose of our current research, the building blocks for the cluster simulator is a full-system simulator which includes models for the CPU, memory, network cards, disks, and other devices. Our full-system simulator employs a decoupled design. One component is responsible for the functional simulation, which emulates the behavior of the target machine (running the OS with the application) and models a large set of common devices. The other component is the timing simulation which is responsible for assessing the target performance (i.e., speed) by modeling the latency of each of the functions of the emulated devices, such as instructions, path to memory, or disk and network interface accesses. Full-system node simulators typically include network card models that live at the boundary of the simulated world. These models act as bridges between the simulated and the real world by providing a proxy functionality that channels network packets back and forth. A network “proxy” greatly enhances the functional features of the simulator, by allowing for external communication of the OS and applications. Nevertheless, the network communication is outside of the simulator’s control and there is no good way of attaching timing models to it. Fortunately, for our purposes, by combining the network functional simulation of all the simulation nodes, we can expand the simulated world to include the communication that happens within the cluster. Instead of bridging simulated packets directly to the external world, we bridge them to a centralized “network controller”, responsible for routing packets to and from the simulated nodes (not unlike the functionality offered by the VDE “switch” [6]). The network controller acts as a functional network simulator, and behaves like a perfect link-layer (MAC-to-MAC) network switch. Within a network controller, adding a timing component is a straightforward task: we can model any kind of network/switch/router topology by making packets take more or less (simulated) time to reach their endpoints. Figure 1 shows the combination of several full-system node simulators together with a network simulator which behaves like the network controller we have just described. With the network controller functionality up and running, we still have one missing piece for our node-combining approach to work, i.e., the synchronization of the simulated nodes. Notice that even without synchronizing the nodes’ simulated time, the functional simulation of the cluster would still behave correctly for most applications. As long as an application does not rely on isochronous nodes, which is common for most distributed programs, the functional behavior is independent of a possible skew in node
()*+ ,
()*+ -
!"#$$ % &' #
Figure 1. Components of a cluster simulator timing. However, the simulated time would be indeterminable, since each node would be running at its own speed. The internal simulated time of a node depends on many facts, such as the type of application that it is running and the complexity of said simulation. The speed of the simulator also depends on external factors such as the type of host in which it is running and its load. From an unknown observer living in the real world, the clocks of all the simulated machines would not only be skewed with respect to each other, but they will also have dynamically changing speeds. Nevertheless, since a bad clock should not change the behavior of a distributed application, nothing prevents the cluster application to proceed correctly 1 . Figure 2 shows an example of what happens during the round trip of a network communication between unsynchronized nodes. Node 1 sends a network packet to Node 2 at its own time ta , arriving at local time tc , for Node 2 (all times local to their respective nodes). After some time spent processing the packet, Node 2 answers back to Node 1 with a packet sent at time td , arriving at Node 1 at time tb . Since times progresses forward in both nodes in parallel, we can be sure that ta < tb and tc < td . The functional causality of the application is maintained by the data flow, regardless of the skew in clock times. Unfortunately the timing causality may be broken. Let’s assume that the latency of the network for the first packet is tn : the packet that leaves Node 1 at ta should reach Node 2 at ta + tn . If we tag the network packet with the originating timestamp, once the packet arrives at Node 2 we have three possible scenarios. (1) The simulated arrival time at Node 2 (tc ) is the same time (ta + tn ). In this case we have complete accuracy, but this also means we have been particu1 The macroscopic behavior is likely to be correct, regardless of the clock skews of the individual nodes, but finer-grain functionality may still be affected. An example of this is a packet retransmission due to a slow machine acknowledging its arrival. We assume this rarely happens.
Figure 2. Communication between timeskewed nodes
larly lucky, because the probability that two parallel simulations advance at the same exact simulation speed is tiny. (2) If the time is smaller (tc < ta + tn ) it means that Node 2 has not yet reached the simulated time when the packet should be delivered. Thus, the simulator may hold that packet for later and schedule its arrival perfectly. (3) If the destination node has already gone past the packet delivery time (tc > ta +tn ), it means that Node 2 has simulated too fast and has “missed” the delivery of packet. Because we cannot deliver a packet in the past, the only possibility we have is to schedule the packet immediately, and lose some accuracy, because the packet will not be able to affect events occurred since tc as it should have. We call these packets stragglers. Figure 3 shows four of the situations that may happen when simulation speeds differ in a quantum-synchronized system. In each quadrant, real-world time flows from top to bottom, and the horizontal lines represent beginning and end of a quantum (assumed to be 10 simulated time units). The vertical bars represent simulated time in two nodes, and they stop when simulation reaches the next quantum. The arrows indicate a packet flowing between the two nodes, and all four scenarios are related to a single packet roundtrip (e.g., what a ‘ping’ would do). In figure (a), both nodes run at the same simulation speed: this is the ideal situation that rarely happens and yields to the expected packet roundtrip time (in the example, 6 time units). In figure (c), Node 1 runs slower than Node 2, hence its simulation time advances more slowly and the packet roundtrip appears shorter than the ideal case (3 vs. 6 time units). In this case, we could delay the delivery of the packet until Node 1 reaches the correct time. Even if we don’t do that, the accuracy loss may still be acceptable with a reasonably short quantum. In figure (b), Node 1 runs much faster than Node 1, and when the packet comes back from Node 2 it has a timestamp in the past, so it becomes a
2
6
2
?
5
2
9
5 7
(c) Node 1 slightly slower: Latency appears shorter unless we delay the delivery of the packet at the right time Roundtrip = 3
1
Time
Time
5
3
(b) Node 1 slightly faster. Latency appears longer. Packet may be a “straggler” breaking time causality Roundtrip = 7
2 1
2 1
Roundtrip = 6
8
2
(a) Normal case: nodes simulate at similar speed
Time
Time
1
4
2
3
?
10
5
2
(d) Node 1 reaches the quantum before packet arrival. Packet is a “straggler”. The controller queues the packet to next quantum. Latency snaps to next quantum Roundtrip > 8
Figure 3. Four scenarios in quantum-synchronized systems straggler. We can deliver the packet right away, but the latency appears longer than the ideal case (7 vs. 6 time units), and we potentially break the time causality of execution. In figure (d), we have a pathological case of figure (b), where in addition to generating a straggler, we have no way of delivering it to Node 1 because it has already reached the end of the quantum. In this case the only option for the network controller is to queue the packet for delivery at the next quantum, with a resulting increase in visible roundtrip latency (8 vs. 6 time units). This phenomenon gets worse for longer quanta: if we had a quantum of 100, the visible latency could be as high as 98! Depending on the quantity of stragglers and their total delay time, the simulation accuracy diminishes. If we decided to completely remove any form of synchronization, not only the accuracy would be wrong, but we would also have no way of determining it. This is because all node clocks would be different and there would be no way of estimating the global time and the delivery error for packets. In the following sections, we describe the mechanisms we use to synchronize the individual simulated clocks to keep the accounting of the global time. However, it is important to remember that no matter what we do, we may still have accuracy errors related to stragglers. The key challenge to increase accuracy without incurring the cost of excessive synchronization is indeed to find a way to quickly detect the stragglers and adjust the simulation for them, which is the key contribution of this paper. Synchronizing clocks among the different simulators is akin to controlling time advancement in Parallel Discrete Event Simulation (PDES), a field of active research for several decades, which we use as a basis to explain our approach. Discrete Event Simulation (DES) models a set of state variables which have discrete transitions in response to events. These events are processed one at a time, each affecting the state variables and potentially scheduling more events. Parallel Discrete Event Simulation (PDES) partitions the state space among the multiple processing units.
Each node processes events and communicates with the rest to schedule events that may affect them. The main difficulty lies in determining the next event, since the first event in a local list may be preceded by events arriving from other nodes. In our scenario, events are network packets and nodes are the different full-system simulators that process those network packets. The name Stragglers comes from PDES literature. There are two main implementations of PDES simulators, conservative and optimistic. The optimistic approach assumes that stragglers are rare, and provides a checkpointing (fast) and rollback (slow) mechanism for those occasions when stragglers happen. By rolling back to a previously saved checkpoint, we can recover a coherent state and then reprocess the packet delivery in the correct timing sequence. If recovery happens infrequently, parallel simulation proceeds assuming no straggler will ever appear and we can achieve a substantial performance gain. In the conservative approach, no event gets locally processed until it is safe to do so. A basic implementation of conservative PDES assumes that all nodes operate in lockstep mode [16], by advancing through a set of discrete simulation quanta (Q). Safeness is assured if Q