Distributed High-Performance Simulation using Time Warp and Java

Distributed High-Performance Simulation using Time Warp and Java Matthew C. Lowry [email protected] Peter J. Ashenden [email protected] Ken A. Hawick [email protected]

Technical Report DHPC--084 Department of Computer Science, The University of Adelaide, South Australia , 5005

Abstract

The Time Warp mechanism is a protocol for synchronising a distributed computation of message-passing processes. Since its introduction in 1985 it has received attention speci cally as a mechanism for synchronising a distributed discrete event simulation. The Time Warp mechanism is conceptually simple and has many attractive features. However its performance can be poor. As a result many re nements and optimisations for enabling good, consistent performance have appeared in the literature. In this report the Time Warp mechanism is discussed and numerous re nements are surveyed. Issues surrounding the operation of a distributed simulation using Time Warp are discussed. The design of a Java-based simulation kernel that addresses these issues and can act as a testbed for Time Warp re nements is presented. Experience from a prototype implementation of the design is related, and some test results are presented.

1 Introduction This document is the product of a research project conducted by the primary author while engaged in postgraduate coursework. The project centred around designing and implementing a system in the Java environment for conducting parallel discrete event simulation. The system, known as FATWa, employs the Time Warp mechanism for synchronisation. It is intended to support experimentation with the assorted variations and re nements that exist for the basic Time Warp mechanism. The motivation of the project is discussed in detail in Section 1.1. The eld of parallel discrete event simulation (PDES) receives attention in Sections 1.2 and 1.2. The detailed discussion of the Time Warp mechanism can be found in Sections 2 and 3, while issues related to implementing FATWa are discussed in Section 4 and 5. The experiments conducted with FATWa are presented and analysed in Section 6. Finally conclusions are drawn in Section 7.

1.1 Motivation

The primary motivation of the research reported in this document is to investigate the Time Warp mechanism as a method for achieving high performance simulation. In particular, the mechanism is to be implemented in the Java environment in a fashion that allows experimentation with the con gurable aspects of the mechanism. It has been shown [refs] that the performance of a Time Warp system is heavily dependent on a large range of implementationcon gurable aspects of the mechanism. Although speed can be achieved from a monolithic single-process simulation system a scheme that is general and scalable (with respect to the size of the simulation) will require the exploitation of parallelism. It is possible to obtain a simple form of parallelism through concurrently executing multiple independent runs of a given simulation. However the requirement for independence can be unacceptable. It is often the case that the results of one run are used to set the initial conditions of subsequent runs in a series of tests. Hence a general system for high speed simulation must be able to deliver a suciently high turn-around time for a single simulation as well as 2

high throughput for a series of simulations. In [26] Lubachevsky observes that beyond concurrently executing seperate runs, there are two forms of parallelism that could be exploited by a simulation system. Functional parallelism achieves speedup by performing the various tasks required of a simulator in parallel. This contrasts with exploiting \space-time" parallelism through concurrently executing dierent components of a simulation and/or dierent time segments of a component. Functional parallelism has the advantage of hiding the exploitation of parallelism from the simulation. The simulation programmer working within such a system programs to a sequential model. However this suers from the severe drawback of exploiting parallelism that is inherent in the simulator, rather than the simulation being executed. For this reason functional parallelism cannot be considered a general and scalable scheme for achieving fast turn-around from a simulation system. Consequently a general scheme is necessary for dividing a simulation along the dimensions of \space" or time or both. The scheme must allow the divisions to be arbitrarily distributed over a set of processors so that they can be executed in parallel. The simulation model known as discrete event simulation is well suited for execution in parallel, however distributing parts of the simulation introduces a need for synchronisation. The Time Warp mechanism, introduced by Jeerson in [17], is a general mechanism for synchronising a distributed computation that is particularly well suited for simulation. It is an implementation of the notion of virtual time, also introduced in [17], which can be considered an abstraction for the time coordinate of a distributed computation. It appears to have the same properties as real time within the computation, but to the implementer is manipulatable. This is discussed further in Section 2. An important aspect of the Time Warp mechanism is that its performance is highly sensitive to the implementation-dependent con guration of numerous parameters. Furthermore, a plethora of variants to the basic mechanism introduced by Jeerson exist. As a result a general system for Time Warp simulation that allows controlled experimentation with these parameters and variants is highly desirable, and a primary goal in the design the FATWa system was to support such experimentation. The Java development environment was chosen since it provides a convenient virtual machine environment in which to distribute a computation over an arbitrary, possibly heterogeneous, set of physical processors. As such it is an appropriate environment to support the generality required in the FATWa system. It is 3

unfortunate that due to the complexity of the various factors aecting the performance of a Time Warp system most research into the mechanism to date has adopted a narrow focus in either the parameters investigated or the range of applications the results applied to. Hence a general Time Warp simulation system that supports arbitrary distribution of a simulation and extensive con guration of the Time Warp mechanism will be a useful tool for investigating the applicability of the mechanism as a method of synchronising a distributed simulation.

1.2 Parallel Discrete Event Simulation

Given the motivation presented previously, it will be useful at this point to brie y discuss the PDES paradigm as an approach to simulation. This discussion is focused on the aspects of the general DES paradigm that makes it an attractive approach for achieving high-performance parallel simulation. For an extensive survey article the reader is directed to the excellent paper by Fujimoto [14]. Firstly, there is the issue of what advantages are oered by the DES paradigm over the alternative time-driven paradigm of simulation. Given the Newtonian terms in which we consider the world it is intuitive to model a system in a time-driven fashion. For example, the well known \N-body problem" can be modeled via a set of dierential equations. These express state changes in the system (i.e. changes in the position and velocity of bodies) as functions of the passage of time. Indeed, the required set of equations are the same dierential equations that Newton derived in his Laws of Motion. Although a time-driven paradigm may be the intuitive paradigm from a simulation programmer's perspective, it often leads to inecient computer simulation. The reason is the necessity to divide the simulation time into time slices that are executed in series. For each time slice the state changes that occur in the system are computed, so to achieve an acceptable degree of accuracy from the simulation these time slices must be small relative to the total length of simulated time. The result is that in a given time slice there will be very few state changes. So in general the execution of a timedriven simulation is dominated by determining that nothing interesting has just happened. 4

The DES paradigm avoids this ineciency by being event-driven rather then time-driven. Instead of expressing the behaviour of a system through state changes as a function of time, behaviour is expressed as the state changes and eect events that result from a given cause event. The modeled system changes state instantaneously at the point in simulation time at which an event occurs, remaining constant between events. Hence executing a discrete event model takes the form of generating an event dependency graph ; i.e. generating the full set of events that result from the initial causes in the system [26]. In general such an approach should be more ecient than blindly traversing the time coordinate of a simulation searching for interesting occurrences to simulate. The second issue raised by the motivation presented in Section 1.1 is that of the particular suitability of models in the DES paradigm for execution in parallel. As with all instances of parallelising a sequential computational model, a fundamental need for synchronisation to ensure correct results is introduced. In the case of PDES this synchronisation takes the form of the causality constraint. This constraint requires that for every event in the simulation, it be executed before any other event that is dependent upon it (directly or indirectly). Here dependence occurs whenever the outcome of an event is determined by one or more elements of system state that are modi ed by another event. By observing this constraint a simulation will compute the logically correct chains of cause and eect for the model. Ensuring correct results in this fashion gives the PDES paradigm an advantage over parallelising the time-driven paradigm, which in general requires that concurrently executing components of a model be barrier-synchronised at each time slice. In a PDES model concurrent components only need to be synchronised if they are bound by dependent events. The PDES paradigm is further advantaged by a simple restriction that can be placed upon models constructed within it that makes the task of determining event dependence quite easy. In general such dependence analysis is a dicult problem. However if the state of a model can be partitioned into components such that a given event accesses state elements belonging to exactly one component then dependence can be segregated to individual components. For an event occuring at a given component the only events it could be dependent upon are other events that occur within that component. Hence a PDES system that observes this restriction can ensure that global causality is correct by ensuring local causality is correct. In other words, each component of the system can be executed in parallel with correct results provided each component 5

executes its events in the correct simulation time order. The restriction discussed above brings a large degree of tractability to the problem of PDES synchronisation. It has thus become entrenched in the PDES paradigm to the extent it is often considered necassary. As a result it is common to nd it as an assumption to discussion of the PDES paradigm. However, it is not strictly necassary. It is usual to term the components of a PDES model as logical processes (LPs), making the paradigm closely aligned with the general notion of a distributed computation as a set of processes interacting solely by message passing. It is adopting this processoriented paradigm that allows the Time Warp mechanism, which is a general synchronisation mechanism for distributed computations [17], to be applied to PDES. It is important to recognise that forcing a process-oriented paradigm upon a PDES model is usually not a burden upon the simulation programmer; indeed it can be useful. Complex systems are commonly modelled as interacting sets of autonomous subsystems; a systems analysis technique congruent with process-oriented analysis. Another common and congruent analysis technique is object-oriented analysis, which can easily subsume the processoriented paradigm by making each model object identi ed by the analysis a logical process in a simulation. However further discussion of this issue is beyond the scope of this report.

1.3 Using A PDES System

This section contains a brief description of a PDES system from a simulation programmer's perspective. The intent is to give the reader some idea of how a user would construct a simulation and execute it with such a system. This is to illuminate the requirements placed on the programmer by the issues discussed in the previous section, and the requirements placed on a PDES system to provide a simple and convenient programming environment. For the purposes of an example the reader is asked to consider the simulation of a small set of ant's nests that are competing for resources in a limited habitat. The rst step of the programmer's task is to analyse the system into a set of processes (or \objects" if this is a more convenient way to look at it). The processes must form a complete partitioning of the system, as discussed 6

in the previous section. There can be no overlapping of system state between processes, and they interact when an event at one process causes an event to be scheduled at another process. With realistic simulation it is generally not hard to delineate self-contained entities, both active and passive. With the current example there would be processes to model individual ants, one per ant in the simulation. There would also need to be a process for each ant nest in the simulation, and the habitat can be modelled as a grid of regions with a process handling each region. The habitat also needs to contain predator processes - say spider or ant-lion processes. Perhaps less obviously there should also be a process to model weather. Once the system has been partitioned into processes the programmer must specify the behaviour of each process. Given the event-driven nature of the simulation this speci cation must be in the form of a cause-to-eect mapping. For each event that could possibly occur at each process there must be speci ed (i) any eect the event has on the state of the process and (ii) any subsequent events that are caused. These eect events must be scheduled on the processes where they occur before the cause event can be considered to have been executed. It is always possible that there are no eect events caused. This must happen at least once before a simulation can terminate. Conversely it can be the case that an event has no eect on the state of a process. Because the processes in the simulation can only interact by exchanging events, if one process requires knowledge of the state of a second it must schedule a \query" event at that process. When executed this event causes the appropriate \response" event to be scheduled at the rst process. Since the query event is \read-only" it will not modify the state of the second process. In the current ant's nest example, some of the events that the programmer might discern and de ne include: Terrain interaction events. When an ant is exploring its habitat it must be able to interact with the process that models the area of ground it is exploring. There need to be query and response events that allow an ant to nd food, leave pheremone trails, detect other ant's pheremone trails, and so forth. Weather and environmental events. If a weather process is included in the model, then this process will need to send \it is raining", \it has stopped raining", etc. events to terrain processes. 7

Ant interaction events.

The detail with which the behaviour of an ant is simulated will depend on the range and detail of events that may occur to an ant. So events are required to model the way a worker or soldier ant can detect queen ant pheremones and make decisions (i.e. schedule future events) on that basis. Ants also need to interact with their nest - they need to be able to schedule \dig new tunnel" events on their next process, and so forth. Ant creation events. To model the growth of a nest a queen ant process needs to be able to create new ant processes as she lays eggs. So there needs to be a \lay egg" event that will create the new ant process. The initial state of this process would be \egg", so there needs to be a \pupate" event scheduled at the new ant so that it can mature. Having de ned processes that make up a simulation and the events that can occur at them all that remains for the programmer to do is initiate the simulation with the desired initial conditions. This is done by creating the processes that are present at the beginning of the simulation and scheduling initial events on them. What processes should be considered \initial" is the programmer's prerogative. In the current example, a programmer could initiate the simulation by creating the full set of ant, ant nest and habitat processes, but this is unnecessary. It would be equally valid, for example, to create nest processes and then schedule at time zero of the simulation an \initialise" event which causes the nest to create its default complement of ants. From the above brief description of the steps involved in creating a simulation in the PDES paradigm the API that must be provided can be seen. A process-level API is required that will: Use a \process-event" callback provided by the programmer to invoke the process' simulation behaviour. Provide a \schedule-event" interface that accepts an event as a parameter and schedules the event at the appropriate process. The event must be correctly scheduled and executed regardless of the simulation time it is scheduled for. 8

Provide a \create-new" interface that accepts a newly created process

as a parameter and incorporates the new process into the simulation. While the programmer primarily employs the process-level API, there still remains a need to boot-strap the simulation when it is initiated. For this some kind of driver process is required that does not participate in the simulation, but can create the initial processes and events. To achieve this the required API is merely the same \create-new" and \schedule-event" interfaces required at the process level. From the above discussion it can be seen that the programming environment provided by the PDES paradigm is quite simple to work within. The programmer concerns themselves solely with de ning the simulation in terms of processes and events. It is assumed that the underlying simulation system will take care of the delivery of events between processes, and ensure that all events are executed in the correct order. These two tasks are the primary responsibility of FATWa or any such PDES simulation system. The system must achieve these tasks without compromising the abstract simulation environment seen by the user. Hence it is vitally important that a synchronisation scheme such as Time Warp can achieve the second task without exposing the user to any details of the scheme, since the user sees their simulation as implicitly sequential.

9

2 Time Warp As A Mechanism For Parallel Discrete Event Simulation This section focuses discussion on the Time Warp synchronisation mechanism that is central to the FATWa system. The notion of virtual time that underlies Time Warp is rst presented in Section 2.1, followed by a discussion of the mechanism itself in Section 2.2. However the mechanism as it was originally presented did not always exhibit stable and ecient behaviour. The large range of re nements and optimisations that have since been developed to improve eciency and stability are discussed in Section 2.3. The issue of GVT algorithms receives detailed attention in Section 2.4. The notion of GVT, introduced in Section 2.2, is fundamental to the operation of Time Warp. Finally in Section 2.5 a brief survey of existing experimental Time Warp systems is given.

2.1 The Virtual Time Model

The virtual time model was proposed by Jeerson in [17]. It was introduced as a new paradigm for organising a distributed computation that would underly and support its internal synchronisation. The intent was to provide a

exible abstraction over real time in much the same way that virtual memory provides a convenient and exible abstraction over real memory [17]. Jeerson de ned a virtual time system as a distributed system which is coordinated by a virtual clock ticking virtual time. Virtual time replaces real time as the temporal coordinate of the system. From the programmer's perspective virtual time always progresses forward (or at least never backwards). However the virtual clock is implemented by a set of local clocks at each process that are synchronised to produce the correct virtual time semantics. For the implementor the local clocks can jump forwards or backwards, aording a great degree exibility, provided that the semantics of virtual time are met. For the purposes of the model a distributed system is considered to be a set of separate processes that can interact solely by message passing. A process can perform three operations: the sending of a message, the receiving of a message, and the modi cation of the processes' internal state. A process is free to send a message at any time to any other process it can name. There 10

is no concept of communication channels as is present in some distributed system models. In a virtual time system all messages are stamped with four values: the sender, send time, receiver, and receive time. The two times given are the points in virtual time when the message is semantically sent and received, regardless of the real times at which the message is transmitted. A virtual time system is subject to two fundamental semantic rules: The virtual send time of the message must be less than the virtual receive time. The virtual time of each event in a process must be less than the virtual time of the next event. Here event is de ned as a set of actions a process has taken at a given point in virtual time. These rules re ect the nature of virtual time as replacement for real time since they embody an arrow of causality or ow of information that is always forward in time. Notably they are identical to the two conditions Lamport gives for a correct system of logical clocks [21]. A chain of causality is de ned as a sequence of events where subsequent events are either (a) subsequent events at a given process or (b) a pair of events corresponding to the sending and receipt of the same message at seperate processes. This is congruent to the notion of event dependence discussed in Section 1.2, and the same as Lamport's \happens before" relation [21]. From this it follows that the primary constraint on a virtual time system is that if event B is caused by event A, then the execution of A must be completed before the execution of B is commenced to ensure that the computation of the system is semantically correct.

2.2 The Basic Time Warp Mechanism

The constraint on a virtual time system discussed previously is the same as the causality constraint required of a discrete event simulation. As is the case with PDES, for a virtual time system it is necassary and sucient for each process in the system to process all messages it receives in correct receive time order. Achieving this synchronisation is the motivation behind the 11

Time Warp mechanism presented by Jeerson in [17], which is summarised here. Given the congruence of virtual time with simulation time in the PDES paradigm, the two can be considered interchangable for the purposes of the following discussion. The PDES notion corresponding to message receipt is the act of a process receiving a simulation event. Similarly modifying internal state corresponds to processing a simulation event. For the purposes of the mechanism Jeerson terms the virtual receive time of a message as the timestamp at which the message must be processed. A process may only execute the eects of a message with the same timestamp as its local clock. Each process in the simulation attempts to advance its local clock and process incoming messages. However the local clocks of the processes in the system will inevitably vary over some range of times. Thus it is not obvious how a process might ensure that, of all the messages it has queued, the \next" message is among them. By this it is meant a message that can safely be executed, speci cally one for which no other message with a smaller timestamp could possibly arrive. This is the central problem in implementing virtual time that is solved by Time Warp . The Time Warp mechanism consists of two parts: the local and global control mechanisms. Together these implement virtual time, and are discussed separately below.

2.2.1 Local Control Mechanism

The Time Warp mechanism is an optimistic mechanism, employing a lookahead-rollback strategy. This contrasts with the many conservative schemes that exist which adopt a block-resume strategy. Conservative schemes are usually based on the Chandy-Misra mechanism presented in [9]. This is a fairly limited mechanism that requires static FIFO communication channels between processes. Each process employs an algorithm to determine which channel the \next" message will arrive on, and if a message is not already present in the queue then the process blocks awaiting its arrival. The Time Warp mechanism adopts a strategy of optimistically executing incoming messages whether or not they can be guaranteed to be in the correct order. If a process receives a \straggler" (i.e. a message in its local past) then the process employs a rollback mechanism to revoke its speculative execution. The new message can then be incorporated into the process' 12

incoming message stream in the correct order. The issue of rolling back execution in a distributed system is far from trivial since any messages the process has sent during the revoked execution must somehow be \unsent." ....

133

140

135

145

144

135

145

164

Send Time

135

141

145

147

149

159

162

166

Recv Time

....

....

....

....

....

....

....

....

Input Queue

Current State

....

....

135

147

149

162 Local Time

133

135

149

149

149

159

159

162

Send Time

148

138

154

156

151

180

166

165

Recv Time

....

....

....

....

....

....

....

....

State Queue

Output Queue

Figure 1: Pro le of a Time Warp process Implementing rollback is the task of the local control mechanism in Time Warp. This is done using a scheme known as antimessaging, which requires that processes in a Time Warp system have the run-time representation depicted in gure 1. This shows that a process must maintain three queues of objects. These are: 1. A state queue containing saved copies of the process' recent state. These copies are frequently referred to as checkpoints. The current state of the process can be considered as occupying the head of this queue, however this is not strictly necessary for the mechanism. Part of the process' state is the value of its local clock. 2. An input queue containing all the messages recently received by the process, ordered by their (virtual) receive time. Part of the state of the process is a pointer into the input queue to indicate which message is the next one to be processed. In general the queue will contain both 13

messages ahead of this pointer that are yet to be processed (i.e. have a higher timestamp then the processes local clock) and messages that have already been processed. 3. An output queue containing all the messages recently sent by the process, ordered by their send times. Part of the state of the process is a pointer into the output queue to indicate the most recently sent message. Since this will always be the head of the queue this may seem irrelevant, but the pointer becomes important when the state of the process is checkpointed. Figure 1 depicts these output queue pointers saved in checkpointed states. Figure 1 shows a process that has just completed processing a message with timestamp 162. This is re ected by its local clock having a value of 162 and there being an output messages generated with a send time of 162. The message in the input queue with timestamp 166 is the next message that will be processed. The gure also shows how the rate at which a process checkpoints its state can vary. A checkpoint is possible after the processing of every input message, but can be conducted less frequently. This is one of the many details the Time Warp mechanism leaves as an implementation policy. To see the operation of the Time Warp rollback mechanism consider the example of a straggler with timestamp 148 arriving at the process depicted in Figure 1. The result is depicted in Figure 2. The straggler has been inserted into the input queue in its proper place. To obtain proper results the execution of the process has been reversed so that the straggler is the next message to be processed. To achieve this the checkpoints in the state queue have been employed. The state chosen is the one with the largest clock value that is still suciently early to restore the input queue the required distance. Figure 2 also shows the sending of antimessages to cancel messages the process has sent during the computation that was reversed. For each \positive" message a corresponding antimessage is sent out to signal to the recipient that it should reverse any action it took as a result of the original positive message. This may force the recipient of the antimessage to engage in a rollback itself, possibly causing a chain of rollbacks and further antimessage sending. The antimessage annihilates the eects, both direct and indirect, of the positive message. In Jeerson's original presentation positive 14

Straggler

....

133

140

135

145

129

144

135

145

164

135

141

145

147

148

149

159

162

166

....

....

....

....

....

....

....

....

....

Input Queue

Restored State

....

....

135

133

135

148

138

....

....

State Queue

147

149

149

149

159

159

162

154

156

151

180

166

165

....

....

....

....

....

....

Output Queue Antimessages sent

Figure 2: Time Warp process after rolling back and antimessages are created in pairs (a la matter and antimatter), with the antimessage stored in the output queue. Sending the antimessage destroys the pair, so the Time Warp system maintains the attractive property of the algebraic sum of all messages being zero. There are three cases to consider when an antimessage arrives at a process: The original message has been processed, and is in the receiver's local past. As mentioned previously, the receiver must itself rollback so as to reverse its execution and enforce the proper annihilation semantics of the antimessage. The original message has been received, but not yet processed. In this case the receiver can simply remove the positive message from its input queue to achieve the required semantics. 15

The original message has not yet been received. Since the virtual time

model does not assume FIFO communications between processes, this is a possibility that must be dealt with. The process should enqueue the antimessage, and eventually the positive message will arrive to annihilate it. If the antimessage becomes the \next" message in the queue it can be executed with a \null operation." When the positive message arrives it can annihilate the antimessage in the process' local past and the correct semantics will be observed without the need for a rollback. This rollback mechanism is extremely robust and has many attractive features, such as being fully distributed. The antimessage system employs the same communication infrastructure as the processes do during normal operation, and does not assume the computation is halted during rollback. It does not even require the rollback to be carried out atomically with respect to the activity of other processes, only with respect to the process rolling back. That is, other processes are free to send messages to the process that is rolling back, only the act of receipt of these messages must be deferred. No matter how indirect the eect of incorrect computation, it will be correctly revoked by antimessages, and cycles of antimessaging are not signi cant. Although the \domino eect" may result in an avalanche of rollbacks, in the worst case all processes in the system will roll back to the timestamp of the straggler that triggered the avalanche. This is because a rollback will never generate an antimessage with a timestamp less than that of the target time of the rollback. An important feature of the Time Warp local control mechanism is that its synchronisation overhead is only the cost of checkpointing, rolling back, and sending antimessages. The cost of the computation that is discarded during a rollback cannot be considered an overhead. This lookahead computation is speculative, and under the same circumstances a conservative synchronisation scheme would force processes to be idle and wasteful in that fashion. However the speculative computation of a Time Warp process may be correct, in which case it will have successfully increased the overall progress of the system.

16

2.2.2 Global Control Mechanism

The local control mechanism of Time Warp implements the virtual time model. It ensures that the messages exchanged within the system are processed in the correct order, regardless of out-of-order receipt. However it leaves a critical issue unanswered. Amidst all the rollback activity, how is progress in the system as a whole detected and measured? There are two requirements. One is the detection of the termination of the computation. The other is to determine when the computation has progressed suciently so that the saved state checkpoints and messages can safely be discarded to reclaim memory. While the rst requirement is for strict correctness, the second has important implications for the viability of a Time Warp system. Some actions of a Time Warp system may have external eects that are irrevocable { Jefferson gives the examples of dispensing cash or launching a missile. Since the missile cannot be rolled back, the Time Warp system must internally execute the launch but defer the external launch until it can be sure that no antimessage will arrive to roll it back. To this end the Global Virtual Time, or GVT, of the system at a given point in real time is de ned as the minimum of (1) all process local times and (2) all unprocessed message send times. Hence GVT is the the earliest time to which any rollback in the system could possibly occur. It can never decrease, and assuming the system is capable of continually processing messages it must eventually increase. As such it represents a commitment horizon which gradually moves forward in virtual time, delineating computation that is guaranteed to be correct. Using GVT the two requirements discussed above can be satis ed. The rst is achieved by making +1 the end of the virtual time line. Processes advance their local clock to +1 when they have no more messages to process and are idle. If an idle process receives a message it can wake up and roll back to the time of the message. Hence the computation will be complete (i.e. all processes idle and no unprocessed messages) if and only if the GVT of the system is +1. GVT also solves the second problem of when to commit external events and discard old messages and state checkpoints. The act of reclaiming memory in a Time Warp system is similar to garbage collection and is usually termed fossil collection. When the commitment horizon de ned by GVT sweeps past an irrevocable action or a given piece of old data then a process can safely take the action or discard the data. 17

GVT provides a solution to two other important issues: error handling and taking global snapshots of a Time Warp system. Firstly error conditions should be treated like irrevocable actions { they should not be raised until GVT has passed by and there is no possibility of the condition being revoked. Secondly a consistent global snapshot is possible by sending control messages that prompt each process to take a local snapshot at a given point in virtual time. Processes can \unsnapshot" themselves if they roll back, and eventually when GVT passes the snapshot point the collective local snapshots will form a consistent global snapshot. An important feature of the de nition of GVT given by Jeerson is the second condition concerning the send times of unprocessed messages. This is necessary since Jeerson's mechanism has ow control that allows for a receiver to return a message to its sender. This will cause the sender to roll back to the time at which it sent the message, and force it to attempt to resend the message. However if this never occurs then it is sucient to rede ne the second condition to be \the minimum of receive times of messages not yet inserted into the input queue of their receivers". This includes both messages in transit, and messages buered at the receive but not yet examined (due to a rollback in progress, for example). The dierence is that if messages are never returned they can only induce roll backs by arriving as stragglers. Another vital feature of GVT is its monotonicity. This means that it is not necessary to determine the exact GVT at a given point in time to employ that value for the tasks discussed previously. Any lower bound on the GVT of a system, although suboptimal, is valid for these purposes. This is important because implementing precise GVT determination is dicult. De ning GVT in terms of an instantaneous (in real time) snapshot of the system is not operational as in general such snapshots cannot be taken. This is an important issue given further discussion in Section 2.4.

2.3 Re nements and Improvements

Since Jeerson's original presentation of the Time Warp mechanism many re nements have appeared in the literature. These optimisations can signi cantly improve the performance of the mechanism. They can be considered as 18

falling into two distinct categories. There are two speci c optimisations that allow Time Warp processes to exploit more parallelism than is available in the basic mechanism. These are discussed in Section 2.3.1. The many other re nements that have appeared can all be placed in a category of schemes to reduce the operational overhead incurred by the Time Warp mechanism. These are collectively discussed in Section 2.3.2.

2.3.1 Introducing Laziness

One notable shortcoming of Jeerson's original mechanism is that it does not exploit parallelism available within a process. When a straggler arrives the messages that have been optimistically processed ahead of it must be cancelled. However it may be the case that subsequent messages and the straggler access disjoint portions of the process' state. Hence the subsequent messages are independent from the straggler, and need not be cancelled. The two \laziness" re nements discussed below address this issue.

Lazy Cancellation

This is a re nement that can be thought of as repairing incorrect computation, rather than discard it as the original mechanism does. This is achieved by deferring the sending of antimessages during a roll back. When the execution of the process resumes the output messages it generates are compared with those speculatively generated prior to the rollback. If they are the same then no action is taken. Only if a message is not regenerated after the rollback is an antimessage sent to annihilate the original, and the new message sent to replace it [23]. The original strategy is usually termed aggressive cancellation in contrast to this approach. In [23] Lin and Lazowska observe that lazy cancellation can improve performance, since a correct message may have been sent prematurely for the wrong reason. Lazy cancellation will detect this and not cancel it. However it can also degrade performance since incorrect computation is not cancelled as soon as is the case with aggressive cancellation. This allows the eects of the incorrect computation to spread further than might otherwise be the case. Because premature computation that turns out to be correct is preserved, the critical path of a computation can be exceeded by employing lazy cancellation [23]. The critical path of a PDES computation can be determined 19

from an event dependency graph where nodes are weighted with the time taken to execute the event. It is the maximal weighted path through the graph, and is the lower bound on the execution time of a conservative synchronisation scheme that has perfect knowledge of event dependency. The vital factor is the probability that a given staggler will actually aect the results of the messages that were rolled back to accommodate the straggler. If this probability is low then lazy cancellation can be expected to outperform aggressive cancellation; even at moderate probabilities the schemes should perform equitably. Importantly there are pathological cases that can be constructed where aggressive cancellation fails to complete the simulation, yet lazy cancellation succeeds. Lin and Lazowska [23] cite personal correspondence with Jeerson that argues lazy cancellation should be considered the \correct" cancellation mechanism in Time Warp for this reasons.

Lazy Reevaluation

This is a re nement related to lazy cancellation that also attempts to preserve as much correct speculative computation as possible when a rollback occurs. However this scheme operates on the state queue of a Time Warp process [14]. When a process rolls back it does not discard checkpoints as it traverses its queue to nd one that is suciently early. Rather it keeps them and compares them with the state checkpoints it generates after the roll back. If they are identical then the process can conclude that its state was not aected in a lasting way by the message, and it can jump forward in execution to the checkpoint it made before rolling back. In this way recomputation is avoided if the straggler did not have a substantial eect on the process. This may happen in the case of the read-only query events discussed in Section 1.3, or if the straggler contains information that is immediately superceeded by a message already in the process' input queue. Although the overhead of implementing this scheme adds to the complexity of a Time Warp system, it can improve performance especially if query events are common. Notably query events are often employed in the PDES paradigm. The scheme works in conjunction with lazy cancellation. It avoids unecessary recomputation while the former avoids cancelling the eects of the original computation that turn out to be replicated in the recomputation.

20

2.3.2 Reducing Operational Overhead

Broadly speaking the Time Warp mechanism incurs two forms of overhead: space overhead and time overhead. The time overhead consists of the time required to perform the local and global functions of the mechanism. This is the time taken to perform rollbacks to cancel incorrect computation and the time taken to perform GVT determination. The space overhead incurred is that of the objects stored by each process in its input, output, and state queues. It is important to recognise that a conservative synchronisation scheme will also need to maintain a future event queue, although not output and state queues. Hence the overhead of Time Warp above and beyond a conservative scheme is the state and output queues, those messages in input queues that have been optimistically generated, and input messages that have been processed and retained to accommodate rollbacks. The schemes discussed below all try to reduce the time, space, or combined operational overhead of Time Warp.

Past Object Reclaimation

The most obvious space overhead in a Time Warp system is the collection of old input, output, and state objects that are accumulated in the three queues that the system must maintain. As was discussed in Section 2.2.2 a Time Warp system must compute lower bounds on GVT to allow it to safely reclaim this storage space. The issue of determining GVT is raised here since operation of a GVT algorithm is itself an important overhead of the Time Warp mechanism. However discussion of actual algorithms is defered to Section 2.4. GVT algorithms generally require that (i) processes maintain certain information (such as message counts) as part of their state, and (ii) one or more processes act as initiators and collectors of control messages during operation of the algorithm. These requirements cause space overheads for storing the information and time overheads for processing the control messages. However a primary motivation for determining GVT is to reclaim memory used by fossils ; i.e. objects in the queues of processes that are no longer required. Hence choosing a GVT algorithm involves a trade-o between the time overhead of the algorithm and its eectiveness at reducing the overall space overhead of a system. Another strategy that exists for reclaiming space from past objects (specifically state checkpoints) is known as pruneback [31]. Here a process that requires space can select and reclaim a checkpointed state from its state 21

queue. If a rollback subsequently occurs that would have required the state then the process suers a performance degradation. This is because the roll back must now be to a point in virtual time earlier then necessary. Notably this strategy is semanticly equivalent to periodic state saving, which is discussed on page 25.

Future Object Reclaimation

The other obvious space overhead of a Time Warp simulation is in the form of future objects. These are the events queued by processes that have timestamps greater than GVT. Fossil collection cannot reclaim space from these objects, yet they may form the bulk of the the space overhead of a Time Warp system. It may be the case that the memory available to a Time Warp system is exhausted, yet a GVT determination fails to enable fossil collection of sucient space to allow further progress of GVT. The result would be a system unable to continue execution if future objects could not be reclaimed. Jeerson's original mechanism allowed for ow control whereby the receiver of a message could refuse to process it and return it to its sender. The result is the sender rolling back to the time it sent the message. Generally the motivation on the receivers part for such action would be the receiver having insucient memory to accept the incoming message. Note that it need not be the actual arriving message that is returned, it could be some other enqueued message with a higher timestamp. Either way the roll back that is caused at the sending process also invalidates a collection of future objects, resulting in reduced memory consumption by that process as well. An important implication of this strategy is the operational de nition of GVT that it mandates. As was discussed in Section 2.2.2 a process must include in its local time calculation the send times of messages it has transmitted and that have not yet been processed by their receivers. This is to accommodate the possibility of message return and roll back. Hence Jeerson's GVT algorithm mandates message acknowledgement within a system employing it, yet the communications interface visible to a Time Warp process is assumed to be unacknowledged. The issue of message acknowledgement in GVT algorithms is a dicult one, and discussion can be found in Section 2.4. Gafni has presented a generalisation [16] of Jeerson's ow control that gives a more complete solution to managing future objects. In Gafni's scheme any future object can be reclaimed. If a process has exhausted its available memory then it selects and reclaims the future object with the highest times22

tamp. If this object is an input message then the message is returned as per Jeerson's scheme. If it is an output message or state checkpoint then the process voluntarily rolls itself back to the timestamp of the object so as to reclaim it. Gafni's protocol has in turn been specialised for use on a shared-memory Time Warp system where all processes are contending for a single pool of storage space. This scheme is known as cancelback [18]. It assumes that claiming storage for a given object always succeeds, but as a result there may not be enough memory to proceed. To deal with this situation the system rst invokes fossil collection. If this fails to obtain sucient memory then Gafni's protocol is employed, but not necessarily at the process that originally caused the memory exhaustion. In this way it is possible to select the globally optimal object for reclaimation. Operation of cancelback is completely dependent on a single-pool shared-memory architecture. Its operation generally implies the operation of another re nement known as direct cancellation, discussed further on. It is highly space ecient, although it does incur a signi cant time overhead. This issue is addressed in [10], togther with some unrealistic assumptions that are made in [18]. There are also some empirical performance results presented in [10]. It was found that a well-tuned cancelback protocol could perform to within 10% of the ideal, in nite memory scenario while only using 2-3 times the storage space of a sequential equivalent. Finally there exists a scheme known as arti cial rollback [24]. Here it is observed that the end eect of the message-return/Gafni-cancelback protocols is to cause a roll back. So in the arti cial rollback strategy any process, upon observing low amounts of free memory, can voluntarily choose to roll back computation that is particularly advanced. The semantics of all these strategies are the the same; they dier in their triggers and patterns of operation. However arti cial rollback is more general in that it does not depend on any hardware architecture.

Limiting Optimism

A signi cant issue in the operation of the Time Warp mechanism is that of optimistic computation and associated roll backs swamping correct computation. As a result many schemes have emerged that add conservatism to Time Warp in an attempt to throttle the most optimistic of computation. Also extreme optimism tends to generate large quantities of future objects (c.f. foregoing discussion). Hence placing some limit on the extent of the 23

optimism of a Time Warp system may be desirable. One approach is usually known as moving time windows (MTW) or \bounded Time Warp" [14]. A threshold is set a certain distance past GVT; processes will not cross the threshold deeming computation that far forward as too speculative and likely to be rolled back. A simple scheme can employ xed sized windows, while more complicated variants allow process to dynamically alter their window of optimism. The problem with these schemes is that they can inhibit useful speculative computation. One scheme that addresses this issue is the \Filter" algorithm [30], which also operates on the input queue of processes. In this scheme outgoing events are attached with a list of assumptions that were made during their generation. This can enable receiving processes to lter their input queue for messages that will de nitely be cancelled at a later stage. Another approach which operates on the output queue of a process is known as breathing time warp [37]. Here the processes of the system move in a roughly synchronous fashion through multiple phases of operation. One phase is normal Time Warp operation. A process may decide to \hold its breath," or withold the transmission of its output messages, if it has advanced far beyond GVT. This is done on the premise that since it is far in front of GVT it is likely to receive stragglers forcing the roll back the messages it is generating. When the process is rolled back it can \garbage collect" the invalidated messages directly. Eventually GVT advances and the process can

ush the buered messages that have survived. The result of this scheme is a signi cant reduction in the amount of message passing performed by a system. Transmission of both incorrect messages and subsequent antimessages is avoided. This makes it especially attractive in situations with a relatively low message passing bandwidth where saturating the communications subsystem can result in large latencies.

Message Preemption

In a Time Warp system with message preemption the arrival of a straggler immediately causes the process to halt work on its current message and begin roll back proceedings. This eliminates the period of time an antimessage may otherwise be forced to wait before it can trigger a rollback. If the average execution time for a message is long or the frequency of roll back is high then preemption can be expected to provide a signi cant improvement in performance. Unfortunately message preemption does require processes to have a fairly 24

intimate relationship with the underlying communications infrastructure supporting them. This is contrary to the very high degree of independence of computing architectures that the basic Time Warp mechanism possesses. The mechanism only assumes a simple asynchronous send/receive message communications interface, so preemption must be considered an implementationdependent optimisation. However for an implementation with close ties to its communications subsystem message preemption is highly desirable.

Direct Cancellation

One of the attractive features of the Time Warp mechanism is that it uses the same method of interprocess communication to achieve both forward computation and roll back. From the perspective of whatever communications infrastructure supports a Time Warp system there is no dierence between a positive message and an antimessage. For a given process to perform a roll back there are no requirements for interaction with the rest of the system beyond those to perform normal forward execution. However this feature is a disadvantage if the average time taken to process a message is small relative to the time to transmit it. In this case, antimessages will chase incorrect computation at a rate that is only marginally faster than the spread of the incorrect computation. This has been characterised as a \dog chasing its own tail" eect. An optimisation available to shared-memory multi-processor implementations is to bypass the antimessage system and directly cancel incorrect messages. When an event causes the scheduling of another, the rst obtains a pointer to the second [13]. Thus if the rst is rolled back the second can be located and cancelled, as per an antimessage, but much quicker. This improves performance by cancelling incorrect computation faster, but clearly is only an option for Time Warp systems that run on appropriate shared memory hardware.

State saving overhead

The cost, both in time and space, of continually saving the states of processes is a very signi cant overhead for the Time Warp mechanism. Many proposals have appeared for reducing this overhead, and some are discussed here. A common approach is infrequent or periodic state saving (PSS). Here both the time and space overhead is reduced by simply doing the checkpointing less often. A process may choose to only checkpoint after every N events (N > 1), and may even dynamically alter this value during execution. How25

ever the cost is an increase in the time cost of performing roll backs. The reason is that to accommodate a straggler a process may be forced to roll back further than the timestamp of the straggler. Fortunately the process does not need to send antimessages for the output messages unnecessarily rolled back. Instead it enters a \coast-forward" phase similar to lazy cancellation. When the input messages between the restored state and the straggler are re-executed they will generate the same results. This is depicted in Figure 3. Importantly the coast-forward phase is transparent to the rest of the rollback operation. Thus the assumption that a straggler never causes a rollback to a time earlier than its timestamp can be maintained. last checkpointed event

next event before rollback

straggler

event

extra rollback coast forward

required rollback

virtual time

required reexecution

Figure 3: Rollback with Periodic State Saving In [25] an analytical model for PSS is presented together with a derived algorithm that nds the optimal checkpoint interval for an individual process. It makes a trade-o between savings from fewer checkpoints with the cost of the extra computation incurred. Empirical results have shown that there is a feedback loop in operation with checkpoint frequency aecting rollback behaviour and in turn aecting the optimal checkpoint frequency. The algorithm is iterative to accommodate this. In [28] a slightly dierent model is derived. The result is also an iterative algorithm for selecting the optimal checkpoint interval, but employing a dierent formula to the one derived in [25]. Unfortunately both these algorithms are costly for processes in that they require statistics to be obtained and complicated formulae evaluated. This is addressed in [12] where a new model for the cost trade-o of PSS is derived. This is done speci cally to obtain a simple heuristic for adapting 26

checkpoint interval. Another adaption algorithm is presented in [34], however the formula it derives and uses can be transformed into one very similar to that in [28]. An alternative to PSS is incremental state saving (ISS). This reduces the space cost of state saving by only copying the parts of the process' state that have changed since the last checkpoint. However this has time costs in both identifying the altered parts of the state and the computation required to restore state. Since the scheme essentially saves inverses of applied state changes, these inverse functions must be applied sequentially to restore a state. This is signi cantly more expensive then simply reassigning a pointer to restore an entire state copy. In [28] analytical models of both PSS and ISS are compared. The two models compete to reduce the total overhead incurred in state saving, and the results indicate that generally PSS outperforms ISS. The exception is when rollbacks are frequent and rarely more then one or two events in distance. Notably it has been demonstrated in other areas (for example reversable execution in debuggers) that ISS is a costly operation. To be used eectively in a Time Warp system the strategy would require hardware support to eliminate the software cost. To this end Fujimoto et al [15] have proposed a \rollback chip." This is essentially a memory management unit for dedicated Time Warp process state storage. It uses a cache-like incremental saving algorithm to provide essentially overhead-free checkpointing, rollback, and fossil collection as primitive operations on the memory it manages.

2.4 Determining GVT

The issue of precisely how GVT is to be determined is not given proper treatment in Jeerson's original presentation of the Time Warp mechanism. He acknowledges that a de nition in terms of an instantaneous global snapshot is not operational and gives a relaxed de nition in terms of a distributed snapshot. This is one in which each process takes a local snapshot at times that will be distributed across an interval of real time. The relaxed de nition requires message acknowledgement since it adds the times of all messages sent but not acknowledged to the GVT determination. In this way messages in transit during the snapshot are accounted for. While not explicitly acknowledged by Jeerson, determining the GVT of 27

a distributed simulation is an instance of computing a global state function from a snapshot of a distributed computation. However global functions of such as GVT have the important property of progressing monotonically. Thus it is not necessary to construct an instantaneous global snapshot from a distributed one. Rather, a meaningful [8] distributed snapshot can be used to compute a lower bound on the function. A meaningful snapshot is one suitable for inferring a result for a global state function that was correct at some point before or during the interval of the distributed snapshot. This allows a lower bound to be obtained without the expense of reconstructing a global state. In [8] an elegant algorithm is presented for eciently obtaining such a meaningful snapshot. Importantly, the algorithm does not require the computation to be frozen. It operates concurrently and transperently to the computation. The algorithm assumes static FIFO communication channels, and involves propagating markers through them. An initiator spontaneously performs a local snapshot and places a marker in all output channels. When a marker arrives at a process it immediately snapshots itself and places markers in all of its output channels. A marker arriving at a process that has already snapshot is simply ignored. The collective local snapshots of the system are forwarded to the initiator for processing. The local snapshots are spread over time, and hence do not form a global instantaneous snapshot. But a lower bound computed from the distributed snapshot is sucient for functions such as GVT. The algorithm ensures that the snapshot it takes is causally consistent [27]. This means that for every event happening within or before the snapshot, all its cause events also occur within or before the snapshot. Such consistent snapshots will given the correct results for monotonic functions. The algorithm in [8] was extended in [20] to accommodate non-FIFO, non-static communication channels. The algorithm piggybacked a single ag bit onto all messages passed between processes. This bit replaces the marker for a FIFO channel; instead of placing a marker in its channels a process

ips the marker bit on all its outgoing messages. The two sets of messages with alternate bit settings are the same two sets of messages delineated by a marker in a FIFO channel. The drawback of this algorithm is that it requires a complete message history be included in each snapshot. To determine the messages that were in transit during the real time interval of the snapshot, the dierence of the sent and received message sets of corresponding processes 28

is computed. The problem of recording complete message histories is addressed in [27]. Here Mattern observes that the messages in transit are precisely those messages that have not had their bit ipped, yet arrive at a process that has already ipped its bit and taken its local snapshot. The simple protocol of forwarding all such messages to the process that is accumulating local snapshots will allow that process to determine the set of in-transit messages. The problem then becomes one of termination detection ; i.e. determining when the last in-transit message has been detected and forwarded. This is achieved by piggybacking a three (rather than two) state ag onto messages. The algorithm can then execute repeatedly to delineate multiple phases of message passing, and catch all in-transit messages. In [27] Mattern also presents a variant of the above algorithm optimised for GVT determination. In this algorithm there is no need for in-transit messages to be forwarded to a collector process. Instead processes circulate vector counters of sent and received messages. This allows them to establish a lower bound on the timestamps of any in-transit messages, which is sucient for a GVT calculation. Mattern's algorithm is very attractive in that is does not require processes to maintain message histories beyond a counter, and does not require the forwarding of copies of messages. Furthermore it does not place an undue overhead (in terms of time or space) on a Time Warp system employing it. A related approach is the Tomlinson-Garg algorithm [38]. It also involves vector counters being maintained and exchanged by processes. As with Mattern's scheme the counters are used to establish the existence (or rather non-existence) of messages that are in transit during the operation of the algorithm. However the algorithm operates in a very dierent manner to Mattern's scheme. The Tomlinson-Garg algorithm is motivated by minimal latency in GVT determination. The latency of a GVT algorithm is de ned as the real time interval between a given GVT being exceeded and all processes in the system becoming aware of that fact. To minimise latency the algorithm is controlled by a centralised manager that selects a target GVT which is broadcast to all processes. When the process' local times attain this target they inform the manager. But rather than count messages and exchange vectors of these message counts as with Mattern's algorithm, this algorithm counts relevant rollbacks. These are rollbacks from after the target GVT to before it. Vectors of counts of these rollbacks are sent to the 29

manager together with the message informing the manager of the rollback. These vectors allow the manager to establish when the last such rollback has occured. The Tomlinson-Garg algorithm suers the drawback of not being able to be triggered spontaneously by any process, as is the case with Mattern's. The advantage of such an ability is that any process in the system, upon dropping below a given threshold of free storage space, can trigger a round of GVT determination and fossil collection. In contrast the Tomlinson-Garg algorithm requires the manager to select appropriate target times, and does not operate on demand. A trade-o is required between the reduced storage requirements aorded by frequent target times and the message passing overhead this incurs.

Message Acknowledgement Revisited.

The issue of message acknowledgement in association with GVT determination was touched upon in Sections 2.2.2 and 2.3.2. Discussion is now returned to this topic. It was previously mentioned that Jeerson's original proposal for GVT determination required message acknowledgement. This solved the problem of in-transit messages and provided a ow control mechanism. Each process simply incorporated the timestamps of all its unacknowledged messages into its local virtual time determination. The minimum of a set containing exactly one local time value from each process is a valid lower bound on GVT. Various algorithms based in this approach exist [3, 4, 22]. They all involve a central manager deciding to poll the processes of the system for their local time values, collecting the responses, and broadcasting the result back to the processes. The event passing API provided by a Time Warp system to a simulation process does not provide an acknowledged message-passing paradigm. Processes assume that event delivery is reliable, and do not block to await any con rmation. This gives processes in a system the maximum opportunity for concurrent execution. So introducing an underlying requirement for message acknowledgement to allow a GVT algorithm to accommodate in-transit messages is unattractive. It places constraints on the implementation of a Time Warp system that are not essential to the provision of the Time Warp API. This con ict is avoided by Mattern's algorithm, which obviates the need for acknowledgement. However it can be the case that the communications infrastructure that supports a Time Warp implementation does not provide reliable delivery. If 30

this is the case then message acknowledgement must be employed within the system regardless of GVT algorithm. This is the motivation given behind the \passive response GVT" or pGVT algorithm [11]. In this algorithm the loss of a control message does not have a signi cant impact. Processes make local time determinations as they please, and forward the results to the centralised manager as they see t. The manager passively accumulates the local time updates, and periodically broadcasts GVT updates. The attraction of the pGVT algorithm is that despite the existence of the centralised manager the decision making process is completely distributed. Processes with a local time close to GVT will send updates regularly, since it is updates from these processes that will allow GVT to advance. Conversely a process that is speculatively executing far ahead of GVT, and hence making no impact on its advance, will only rarely send updates. In this way GVT can expeditiously advance without the overhead of continually polling processes that will not make an impact on the calculation. In [11] benchmark results are present which strongly favour the pGVT algorithm. The total simulation time and memory consumption of pGVT outperformed both the Lin-Lazowska (acknowledgement-based) [22] and the Tomlinson-Garg (no acknowledgement) [38] algorithms.

2.5 Existing Time Warp Systems

Attention is now brie y turned to some existing simulation systems that employ the Time Warp mechanism. TWOS

An early implementation of the Time Warp mechanism was the Time Warp Operations System, developed at the Jet Propulsion Laboratory at Caltech. It performs the usual functions of an operating system, such as I/O and memory management. However it executes distributed applications using the Time Warp mechanism for synchronisation. It began development in 1983, before the publication of the mechanism. Initially simulation applications were implemented in the system, but it has also been used for research into the use of Time Warp in elds such as database management. See [19] for a general discussion paper on the system. It is still being maintained and 31

developed; a recent version can be found at URL http:// cus-www.cs.ucla.edu/project-members/reiher/Time Warp.html. SPEEDES SPEEDES

stands for Synchronous Parallel Environment for Emulation and Discrete Event Simulation. SPEEDES began development in 1990, also at the Jet Propulsion Laboratory, as a system for experimenting with a particular synchronisation scheme [36]. It later grew to become a testbed for PDES synchronisation schemes, including Time Warp . It has the attractive feature of supporting real-time monitoring of a simulation's performance by a user. A user employs the system by writing a con guration le for the system's main program. The con guration le speci es simulation processes and events which are also provided by the user. However it does not provide the simple API discussed in the previous section. It requires a user to be aware of the internal workings of the system, and the user is forced to engage in signi cant code writing to support the operation of the system. It is written in C++ and runs on networks of UNIX workstations or the JPL Mark III Hypercube supercomputer. Since SPEEDES began supporting Time Warp it has been used extensively to investigate the mechanism. Results from studies using SPEEDES have made a signi cant contribution to the eld. WARPED

The WARPED system is a Time Warp simulation kernel written in C++ [32], and is freely available. It can be obtained via anonymous FTP from the WARPED homepage at http://www.ececs.uc.edu/ paw/warped/. As with FATWa, the system is intended as a distributed, portable testbed for Time Warp related algorithms. It was been employed on a variety of hardwares, including networks of workstations, multiprocessor machines, and a Cray T3E supercomputer. WARPED was used as the starting point for the design of FATWa. The design of WARPED and its similarity to FATWa is discussed in Section 4.

32

3 Issues Related to Designing a Distributed Time Warp System One of the most attractive features of the basic Time Warp mechanism is that it is essentially independent of whatever hardware may underly its implementation. The model that the mechanism works with is extremely simple. It is assumed that Time Warp processes reside within an amorphous process space. The mapping of Time Warp processes to the underlying notion of a process (operating system process or thread or physical processor) is an implementation policy. Importantly the only assumption made for the support provided to a process is that of a simple asynchronous message passing interface. For the sake of eciency reliable delivery is assumed so a process need not wait for acknowledgement [17]. It is assumed the underlying communications subsystem can buer incoming messages and provide them on demand for placement in the process' input queue. A consequence of this simplicity is that the Time Warp mechanism can be implemented on almost any computing platform. The simple message passing model, being asynchronous and giving no guarantee beyond reliability, is eminently suitable for modern distributed computing environments. Although an amorphous process space is assumed, any structure that may be present is transparent to the mechanism. Time Warp can be implemented on hardwares from a LAN of single-processor PCs to a shared-memory or message-passing multiprocessor, or even a WAN of such multiprocessors. Although structure is transparent to the operation of the mechanism, it does have profound eects on performance. If the space is physically partitioned with high latencies between distant processes then the long-latency communication may hamper exploitation of parallelism amongst \close" (i.e. highly dependent) processes. Dealing with the issues of structure within the process space is entirely left as an implementation policy. This is discussed in Section 3.1. A further implementation-speci c issue is the possible requirement for scheduling N logical Time Warp processes onto M physical processing elements where M < N . This and related issues are discussed in Section 3.2. In Section 3.3 the issue of load balancing receives attention. Finally in section 3.4 the notion of a scope hierarchy is discussed.

33

3.1 Partitioning the Process Space

As mentioned previously Jeerson's original speci cation of the Time Warp mechanism did not introduce any notion of structure or form to the process space. However it is inevitable that the structure of the hardware underlying some Time Warp implementations will have a visible impact in the implemented process space. For example, in a system operating over a cluster of workstations with many processes per workstation, the message passing latency between nodes will generally be at least an order of magnitude greater than between processes on the same node. The contradicts the implication of an amorphous process space that all message passing latencies will derive from the same distribution. This dependence of message passing latency on the particular processes that are sending and receiving is one eect of process space structure. Another visible eect is closely related and would also be present in the previous example. This eect is the constraint placed on the scheduling algorithm employed in a Time Warp system with more logical processes than physical processing elements. In the case of a workstation cluster a process must \reside" on exactly one workstation, and can only be scheduled on the physical processor(s) of that worksation. As a result it can be the case that the globally optimal process to schedule for execution does not reside at the processing element that is available. To illuminate these issues the reader is asked to reconsider the ant's nest example introduced in section 1.3. Consider the situation of a simulation run involving four nests in a 2 2 habitat grid; one nest per habitat element. The process that models each habitat element, together with its nest process and associated ant processes, each reside on their own workstation. The four workstations are connected with o-the-shelf network technology, and as a result message passing latency across the network is far longer than within a workstation. Now consider the situation of a small group of worker ants from one nest exploring the terrain of another nest in an attempt to nd food. The ants must begin exchanging the occasional query/response message pairs with the terrain process on a dierent workstation from the one on which they reside. Because of the high latency involved this places a limit on the execution rate of the ant processes involved, and they begin to fall behind others on their local workstation. When the simulated ants return from their foraging expedition they will once again begin exchanging messages with their local terrain process. This will result in the terrain process being forced to 34

roll back to the simulation time that the expeditioners have been lagged to. This in turn forces all the processes (nest and ant) dependent on the terrain process to be rolled back. Eectively the entire set of processes at one workstation have been forced to execute slower when only a few of them are actually engaging in long-latency message passing. Notably the ant processes making the long-latency message passing are not even exchanging messages with other local ant processes when returning from their expedition. Despite this it causes other ants to be rolled back due to their indirect dependence through the terrain process. Clearly this eect is undesirable. To make matters worse the eect is symmetric, since the terrain process that the ants are making the expedition to will also be lagged by the sequence of long-latency messages it exchanges. The result is the execution of the entire set of processes on two workstations being retarded. However this dependence is as much a product of the manner in which the simulation was constructed as it is of the long communication latency. Careful analysis would show that the events exchanged by the terrain process and ants local to the expeditioners should be independent of the straggler events exchanged by terrain and expeditioners upon return. This independence could have been explicitly expressed in the simulation model by have a much ner terrain grid. The implication of this example is that re nements such as lazy cancellation and reevaluation are necessary if a Time Warp system is to be capable of exploiting parallelism not explicitly expressed by the programmer in the form of separate processes. The basic mechanism eectively exploits parallelism between processes: in the example the expeditioning ants did not cause a lag on their nest mates during the expedition since they did not exchange messages during the expedition. But as the example shows, by not exploiting the parallelism available due to independence of events at a single process the basic mechanism can perform poorly. An important factor to recognise from the foregoing example is the precursor to the poor performance suered by the basic mechanism. This is the visibility to the Time Warp system in the process space of underlying hardware structure, speci cally the high latency of message passing between groups of processes. In general this structure will manifest itself as a partioning of the process space. In the example it was a four-fold partitioning on the basis of the workstations that underlay the process space. This grouping is transparent to the operation of the basic Time Warp mechanism since it only recognises \global"(i.e. entire process space) and \local" (i.e. single process) 35

levels of scope. The two laziness re nements can ameliorate the direct eect of the partitioning, however they do so indirectly by exploiting an extra form of parallelism. There are more subtle eects caused by the partitioning, and these are discussed in the following sections.

3.2 Scheduling

The issue of scheduling is not given recognition by the Time Warp mechanism. It is perfectly feasible to execute multiple Time Warp processes on a single physical processing element, however such a hardware-dependent issue is purposely ignored by the mechanism. The choice of policy to govern scheduling is left entirely to an implementation that requires it. Furthermore other policy decisions left to a Time Warp implementation can aect scheduling. For example if some form of bounded Time Warp scheme (see Section 2.3.2) is employed then some processes will voluntarily disqualify themselves from scheduling. From the perspective of the basic Time Warp mechanism any scheduling policy present is immaterial, since any process that executes \too much" (i.e. too far forward in virtual time) will be rolled back. However this fact belies the vital importance of scheduling in a practical Time Warp implementation. Performing a rollback may reverse the eects of incorrect speculative execution but rollback and reexecution competes with correct execution. Furthermore, the occurrence of a rollback indicates that incorrect speculative computation has previously occurred, possibly at the expense of correct computation. Some level of rollback activity is unavoidable in an optimistic Time Warp system. However a practical Time Warp implementation must take steps to reduce the rate and length of rollbacks in a manner that balances the competing demands of correct and possibly incorrect speculative computation. Failure to do so leads to behaviour known as rollback thrashing. The most straight-forward and intuitive approach to scheduling Time Warp processes is to do so on the basis of their local virtual times, an approach called least time stamp rst (LTSF) scheduling. The process with the lowest such time (or perhaps the lowest \next" message timestamp) is selected for execution by the scheduler. This strategy focuses solely on ad36

vancing the minimum local time of the processes under its scope. Hence if this strategy is applied globally to a Time Warp system then it is GVT which is advanced by each round of scheduling, and good performance can be expected. However if this strategy is not applied globally then results can be suboptimal [29]. When the process space is partitioned and the LTSF strategy applied independently to each partition then the most useful process to execute will not always be selected at each partition. This can be true when the critical path of a simulation is moving between a large number of processes, and those currently o the critical path can perform useful lookahead computation. It will also be the case when over a small time slice of the simulation there are multiple independent critical subpaths. An LTSF scheduler will only give preferential treatment to the subpath that contributes to the global critical path. Hence it can be concluded that in general an algorithm that focuses naively on advancing the minimum virtual time of the processes within its scope is not optimal for a general distributed system. This can be seen in the particular case of the ant's nest example presented in the previous section. In this situation an LTSF scheduler would have given maximum priority to the ant processes that were foraging to a remote terrain process. However this would be counterproductive and possibly lead to rollback trashing since the ant processes were necessarily lagged behind others on their node due to high latency communications. Many other algorithms are possible that base their operation on metrics other than a process' local virtual time. In [5] results are presented comparing, among others, algorithms that work on the input queue size of an object. Processes that have a large queue of future input events are those that have the most work to do, and are given higher scheduling priority. A notable result from this paper is that applying a scheduling penalty to a process that sends antimessages can signi cantly improve performance. The rationale for such a policy is that sending an antimessage indicates that a processes was further ahead in virtual time than its compatriates and was forced to rollback. In other words, the process had previously been allowed to execute too much. The common drawback of the strategies discussed above is their assumption of global scope for the operation of the algorithm. This makes them feasible only for shared-memory implementations where a scheduler can directly inspect processes and have global control over the allocation of processes to processors. In a general distributed environment such assumptions cannot be 37

made, and a scheduling strategy must also operate in a distributed manner. A distributed Time Warp scheduler has a dicult job. Firstly the algorithm must deal with issue discussed previously: it can only have direct in uence over a subset of the processes in a system. However the choices an individual schedular makes have global implications. When a process A rolls back this is an indication that its local schedular has failed previously; it scheduled A for execution too early. However the antimessages sent by A as a result of the rollback may cause rollbacks at processes controlled by dierent schedulers. The error made by A's scheduler eectively causes other schedulers to have made an error. This can be a source of rollback cascades and thrashing. Since rollbacks are the indication of a failure in scheduling, a schedular should aim to minimise their occurrence. With this in mind Palaniswamy and Wilsey have proposed a paradigm called parameterised Time Warp (PTW) [29]. Here an adaptive control mechanism is associated with each process to monitor its behaviour and dynamicly alter such run-time parameters as the process' scheduling priority. One of the aims of the control mechanism is to minimise the number of rollbacks a process performs. The notable aspect of this scheme is that the scheduling policy is completely delegated to processes. They select their own execution priority, and this can be employed by a simple priority-based schedular that has scope over any subset of processes. An essential feature of scheduling in a distrubuted Time Warp system is its close relationship to communication. A scheduler operating on a given partition of a distributed system must observe the arrival of stragglers in its partition to identify its own failure, and the arrival of antimessages to identify failure at other partitions. In this regard scheduling is strongly in uence by the issue of load balancing which is discussed below.

3.3 Load Balancing and Process Migration

In general load balancing is a vital factor in the performance of a distributed computation. Unless the computation is highly regular, dynamic run-time load balancing is usually required to obtain high eciency. Despite this it may be the case that the optimal distribution of processes in a computation changes so rapidly that the bene t of attempting to constantly balance 38

load may not be worth the overhead. Also a general principle that must be observed is the trade-o between optimising communication load and computation load. If a balancing algorithm attempts to distribute the computational load of a system as evenly as possible it may cause undue amounts of communication. Conversely an algorithm that attempts to place groups of strongly interacting processes at one node may cause undue computational load. Hence a good load balancing algorithm must attempt to compromise between these competing factors. However from the discussion in foregoing sections it can be seen that there are load balancing issues speci c to the Time Warp mechanism. In Section 3.1 the ant's nest example demonstrated how the logical structure of a simulation can cause processes in separate partitions to become highly dependent on each other. When this occurs the performance of the processes, and indeed the entire simulation, can be degraded due to relatively high communication latencies between partitions. In this example the foraging ant processes could have been migrated to the same node as the terrain process. Provided the overhead of the migration was low (relative to the amount of processing required) a signi cant performance bene t could be expected. However a load balancing algorithm would need to be capable of rapidly identifying the condition of strong dependence on a remote process and promptly causing the migration to exploit the potential bene t. Furthermore it would be required to correctly identify that it was the ant process that should migrate to the remote terrain process, not the other way around. This could be achieved by observing that the ant processes were not dependent on other processes local to them, while the terrain process was dependent on other processes on its node. A number of approaches to dynamic load balancing in Time Warp systems have been proposed. In [33] Reiher and Jeerson introduce a metric called eective utilisation which is based on the notion of \eective work," i.e. work that is eventually committed. Based on this metric they propose an algorithm for migrating processes from processors with high utilisation to those with low utilisation. In [6] Burdorf and Marti propose an algorithm based on the motivation of avoiding rollbacks. To this end it periodically computes the mean and variance of local times across all processes, and attempts to minimise variance by migrating processes with particularly low local times to processors that have a high mean local time. In [35] Schlagenhaft et al propose a cluster-based algorithm which employs a virtual time progress metric. This metric is the rate in real time at which processes are advancing their 39

local virtual time. Like the Burdorf-Marti scheme load imbalance is re ected by variance in the metric, which the algorithm attempts to minimise by migrating clusters of processes. In [39] three similar algorithms are presented that are based on computational weight, the cumulative execution time of the process' future input events. Notably the algorithms in this paper also incorporate heuristics that observe the communication patterns of processes. They attempt to avoid causing large amounts of interpartition communication while distributing on the basis of computation (c.f. the general caveat discussed at the beginning of this section). The scheme in [2] is also based on computational weight. However in this paper the weight is determined by the execution time of all events (positive and anti) processed between iterations of the algorithm, rather then the expected execution time of future events at each iteration. A load balancing issue that has received little attention is that of \background execution" of a Time Warp program, i.e. a system that targets networks of multi-user workstations and can accommodate dynamic changes to the external workload of the workstations. This issue is discussed in [7] which presents an algorithm with two facets. One is a dynamic load balancing scheme based on a metric called \processor advance time," the amount real time required for a processor to complete a unit of virtual time. This metric automatically accommodates the level of external (i.e. non-Time Warp) load present at a processor, and will migrate processes away from a heavily loaded processor, regardless of where the load comes from. Notably the schemes in [6] and [35] will also accommodate external load, but not explicitly. The metrics in these schemes, based on local virtual time advancement, will notice any lag in virtual time of processes at heavily loaded processors, and migrate processes to less loaded processors for this reason. However the novel aspect of the algorithm in [7] is the extra facet which allows the pool of available processors between which processes may migrate to be dynamically altered. The algorithm uses allocation and deallocation thresholds to periodically add or remove processors from the set available for receiving processes. This allows the algorithm to eectively accommodate \spikes" of external workload.

40

3.4 Intermediate Levels of Scope

A common theme in the foregoing discussion is the presence of the two levels of algorithmic scope explicitly acknowledge by the basic Time Warp mechanism. These are the global and local (or process) levels of scope within which the basic mechanism is wholely de ned. The predominance of these levels of scope can be seen in the optimisations and GVT algorithms that have subsequently been developed. Most of the optimisations discussed in Section 2.3 are local optimisations; i.e. they operate independently on each processes in a system. However some are explicitly global in their scope of operation. For example the cancelback storage reclaimation protocol inspects the entire pool of queue objects in a system and selects the globally optimal object for reclaimation. In a similar fashion Mattern's GVT algorithm [27] operates at both levels of scope, and mirrors the Time Warp mechanism in being de ned in terms of global and local behaviour. On the one hand an initiator process acts as a global coordinator for the algorithm. It spontaneously begins the execution of the algorithm, observes its completion, and broadcasts the result to all processes. On the other hand the algorithm de nes the behaviour, both during and between rounds of GVT calculation, that is required of each process (including the initiator). Furthermore the scheduling algorithms discussed in Section 3.2 also exhibit this local/global dichotomy. For example a least-timestamp- rst scheduler can provide optimal performance if its has global scope, but if applied independently to smaller clusters of processes can have counter-productive behaviour. Other approaches such as the parameterized Time Warp scheme [29] delegate scheduling entirely to individual processes. Despite the predominance of the global and local scopes in Time Warp, many algorithms employ what can be termed the \cluster"1 level of scope. This can take the form of a partition level of scope, as discussed previously. This is the level of scope imposed upon a Time Warp system due to physical partitioning of the process space amongst physical processors. It is seen in scheduling and load balancing algorithms which operate on partitions to schedule processes within a partition or migrate processes between them. Process clustering can also be seen in the subpartition clusters employed by the load balancing scheme in [7]. Here highly dependent processes are placed Not to be confused with \Clustered Time Warp" [1] , a hybrid scheme in which clusters of processes execute sequentially, and synchronisation between clusters is achieved through Time Warp. 1

41

in small clusters which are atomically migrated between partitions by the algorithm. Furthermore superpartition clustering, where partitions are clustered and treated collectively, may be appropriate to the operation of an algorithm and to represent some supercomputing architectures. For example the SGI PowerChallenge architecture, consisting of clusters of sharedmemory multiprocessors with a message-passing interconnect, could be effectively given software representation by partition and superpartition levels of scope. A GVT algorithm could then exploit this structure during its operation. This might involve making less high latency communication between superpartitions and more low latency intrapartition communication. As a more concrete example the WARPED system discussed introduced in Section 2.5 employs clusters of processes. Within the clusters the use of the MPI interface is bypassed, allowing faster communication. Global (entire system)

. . .

. . .

Superpartition

. . .

. . . . . . ...

. . . ...

Partition

. . .

...

. . . ...

...

. . . ...

clustering

(physical)

...

Subpartition ...

Local (individual processes)

Figure 4: The Tree Structure of Clustering Scope Levels This hierarchy forms a tree structure, which is depicted in gure 4. The structure is implicit, in one form or another, in all algorithms concerned with the Time Warp mechanism that have been discussed thus far. Clearly this hierarchy can be extended arbitrarily. However the utility of extra levels of scope between global and local other than the three present in Figure 4 is questionable. The partition level, and the levels above and below it, appears to be the simplest paradigm that uni es the scope models of the vast majority of Time-Warp-related algorithms. Generally these algorithms are interested in the major external characteristics of a Time Warp process such as its virtual time and its communication patterns. These characteristics 42

can be accumulated using simple functions: arithmetic minimum in the case of virtual time and set union in the case of communication patterns. This allows each level of scope to eectively present their parent level with external characteristics that correspond to the cumulative characteristics of the level's components. The particular implications of this, and the opportunities it may present, have yet to be investigated. The tree structure, and algorithms for operating on it, are well known. However this does has not been acknowledged, let alone exploited, by the Time Warp research community.

43

4 High-Level Design of the FATWa System The FATWa system is a collection of Java classes that can be employed to program and execute a discrete event simulation over multiple Java Virtual Machines (JVMs). These JVMs can be distributed over any number of physical hosts. The FATWa system is intended to support investigation of the wide range of issues raised in the previous sections of this report. A review of these issues is now given.

The Simple PDES API.

In Section 1.3 the requirements for an API for PDES simulation were discussed. By way of example it was demonstrated that the minimal API consists of an operation to obtain the \next" simulation event, an operation to send simulation events in response, and an operation for creating new simulation processes. These three operations are required at the process level, while the later two alone are required by a driver for initiating a simulation. It was then shown that the minimalist model of distributed computation assumed by the Time Warp mechanism was completely congruent with the PDES paradigm, and the virtual time model can provide the required synchronisation. Consequently there is a strong design motivation for providing a simple API to the user of the FATWa system. Ideally the user's view of the API will revolve around the three operations listed above. The Large Range of Optimisations to the Basic Mechanism. In Section 2 the basic Time Warp mechanism was introduced. It was shown that a large range of optimisations and re nements are available (if not required) to achieve good performance from the mechanism. Hence a goal in designing FATWa has been to provide a testbed for these algorithms. Ideally they should be able to be implemented with the minimum of side-eects allowing controlled experimentation and comparison. The Partitioning of the Process Space. In Section 3 the issues surrounding the implementation of a distributed Time Warp system were discussed. In particular the issues of distributed scheduling and dynamic load balancing were raised as important issues requiring investigation. The design of FATWa is intended 44

to provide a test bed for these classes of algorithms as well as those discussed in section 2.3. In Section 2.5 the WARPED simulation package was introduced, and it was mentioned that this system had been taken as the starting point for the design of FATWa . Both systems attempt to present a simple API to users through a Time Warp process class that they extend to make simulation process classes. Both systems hide the complexity of implementing the Time Warp local control mechanism behind classes corresponding to the fundamental Time Warp objects of processes, state objects, and events. This is discussed below in Section 4.1. The FATWa classes that correspond to these Time Warp objects are discussed in Section 4.2. The two other classes fundamental to the operation of Time Warp are discussed in Sections 4.3, and 4.4. Also a summary of FATWa classes can be found in Section 4.6. A major dierence between the two systems is the manner in which the modularity discussed previously is achieved. In WARPED the preprocessor facilities of the C++ language are extensively exploited to achieve this modularity. By de ning the appropriate symbols alternative GVT algorithms and so forth are included or excluded from the system at compile time. Since this is not possible in Java the FATWa design was forced to adopt a dierent approach. This is discussed in Section 4.5.

4.1 Overview of FATWa Classes

The user of FATWa sees the classes depicted in Figure 5. The gure shows how the user's driver class creates and initiates an OverSeer object. The OverSeer manages simulation process objects which the driver also creates, together with their associated state objects and initial events. The gure also shows how the user de nes simulation processes, state objects, and events by extending the FATWa classes BasicObject, BasicState, and BasicEvent repectively. The internal overview of the FATWa classes is quite dierent, as is depicted in gure 6. This gure shows how a single instance of the OverSeer class distributes the simulation over multiple JVMs through the VMGroup class. This class manages the simulation processes that are present in a particular 45

FATWa classes BasicEvent

BasicState

BasicObject

User classes inheritence

Simulation Events

Process States

Simulation Processes

manages

creates

Simulation

Driver

OverSeer

creates and initiates

Figure 5: A User's Overview of FATWa classes JVM. The simulation processes are not constructed by the VMGroup objects, but by the user's driver initially and by existing processes during the simulation. A newly constructed instance of a simulation object (a subclass of BasicObject) is passive; it must be registered with its local VMGroup to become an active process participating in the simulation. From an internal perspective, the OverSeer, VMGroup, and BasicObject classes are the three fundamental classes in FATWa, and they are discussed in the following three sections. Figure 6 also shows modules attached to the three fundamental classes. These are discussed in detail in Section 4.5.

4.2 Simulation Processes

All simulation processes in a FATWa simulation must be objects of classes that extend the BasicObject FATWa class. This class is the software representation of the local level of scope, as discussed in Section 3.4. It is the responsibility of this class to implement the local control aspect of the Time Warp mechanism, and support any part of an algorithm that operates at a 46

OverSeer

InitiatorHook

. . . . . OverSeer Module

OverSeer Module

. . . . . VMGroup CommsAgent

VMGroup Module

VMGroup Module

. . . . . BasicObject

B. O. Module

B. O. Module

Input Event, State Object, and Output Event Queues

Basic Basic Event State

Basic Event

Figure 6: An Internal Overview of FATWa classes local level. Since simulation processes extend (directly or indirectly) the BasicObject class, it is through this class the FATWa provided the process-level PDES API. As discussed previously, the minimalist PDES API consists of only three operations: \process next event," \send event," and \create new process." These three operations are provided by three protected methods in BasicObject:

abstract protected void processNext( BasicEvent )

The BasicObject class is de ned as abstract, with users being forced to implement this method. It is the method that is called by BasicObject 47

code when the user's subclass code is required to process a single simulation event (the \next" event). The event itself is provided as a parameter to the method.

protected void sendEvent( BasicEvent )

This method allows the user to send a simulation event. Since all userde ned simulation events are subclasses of the BasicEvent class, this method will accept any simulation event constructed by user code, and deliver the event to the process named as the receiver of the event. protected void registerObject( BasicObject )

This method can accept any newly constructed simulation process, and incorporate it into the running simulation.

Through these three methods FATWa provided the minimalist PDES API and maintains the illusion of an amorphous Time Warp process space. A simulation process need only be able to name another process as the receiver of an event to be able to send an event to it, as per the Time Warp speci cation [17]. For BasicObjects to successfully act as Time Warp processes they must be able to maintain run-time queues of events and state checkpoints. Furthermore the event and state objects must contain certain attributes to enable the operation of the mechanism (see Section 2.2). These attributes are de ned as attributes of the BasicEvent class in the case of events, and the BasicState class in the case of process state objects. Since all events are BasicEvent subclass instances, the input and output event queues of a process can be maintained as queues of BasicEvents. Similarly the user must extend the BasicState class when de ning the state object of a simulation process. The state checkpoint queue of a process is maintained as a queue of BasicStates. A brief summary of the properties of BasicEvent and BasicState is given below:

BasicEvent

This class is the base for all simulation events. It contains the basic attributes required of all messages in a Time Warp system; sender, virtual send time, receiver, and virtual receive time. It also contains other attributes to enable the operation of FATWa, for example a sequence number then ensures unique matching of an antimessage to its positive counterpart. The class also provides a method that produces 48

and returns an antimessage equivalent for a positive event. Since the data associated with a user event is irrelevant to the operation of the antimessage mechanism, the antimessage produced is actually an instance of BasicEvent that is tagged with the appropriate information to nd and annihilate the original.

BasicState

This class is the base for all simulation process state objects. It contains the basic attributes required of the states of all processes in a Time Warp system: a local virtual time counter, the input queue position, and the output queue position. Other attributes required for the implementation, such as the counter for the sequence numbers used to tag outgoing events, are also present. An important property of the state of a Time Warp process is that it must regularly checkpointed. As a result the BasicState class is abstract with the user forced to implement a deep clone method. To ensure that all the internal attributes are correctly cloned by the subclass method, the baseClonify(BasicState ) method is provided by the BasicState class. It accepts the newly constructed clone (which will be an instance of a subclass of BasicState) and copies over the BasicState attributes.

The BasicObject, BasicState, and BasicEvent classes collectively comprise the simulation process programming environment. Following is the form taken by a typical process in a FATWa simulation. class SimProcess extends BasicObject f SimProcess( ... ) f super( "My Name" , new SimState() ) // The BasicObject constructor requires a String //

name and the initial simulation state

g // Following is the only method that must be present in // the user's class public void processNext( BasicEvent in_event ) f SimEvent my_event = (SimEvent) in_event; // For convenience the parameter can be downcast

49

// to the user's subclass SimState my_state = (SimState) this.state; // When this method is called the superclass // attribute state will refer to the current state // (which can also be downcast) switch ( my_event.type ) f case 1: my_state.my_attribute =

...

// Modifying state due to incoming event sendEvent( new SimEvent(

...

) );

// Sending an effect event due to incoming event registerObject( new SimProcess() ); // Creating a new process due to incoming event ... case 2: ...

g g g

This template demonstrates the congruence of the programming environment provided by FATWa with the PDES paradigm of specifying a simulation as a cause-to-eect mapping. The bulk of the process class code is taken up by the switch statement which provides the map from input cause event to eects. As the example code demonstrates these eects can include state changes, further events, and the creation of new processes. The form for a state class such as SimState is as follows: class SimState extends BasicState f ... user's state attributes ... // The user must provide a deep clone method // for the state attributes above public BasicState aclone() f SimState new_clone = new SimState();

50

baseClonify( new_clone ); // Use superclass method to copy over attributes // internal to

FATWa

... copy over user's attributes ... return (BasicState) new_clone;

g g

The event class SimEvent would be de ned in the following simple form: class SimEvent extends BasicEvent f ... data attributes for event ... SimEvent( String receiver, int recv_time) f super( receiver, recv_time ); // These parameters required for the super // constructor ...

g public boolean sameEvent( BasicEvent other ) f if ( !super.sameEvent( other ) ) return false; if ( !( other instanceof SimEvent ) ) return false; // Might be asked to compare this event with one of // a different class if ( (SimEvent)other.attribute1 != this.attribute1 ) return false; ... compare this classes attributes ...

g g

From these templates it can be seen that the programming environment provided by FATWa largely shields the user from both implementation details and the Time Warp mechanism itself. There are three major implementation issues the user must be aware of. One is the need to implement a deep clone method for state objects, a second to implement a comparison method. The 51

third, not shown by the previous templates, is the need for all FATWa simulation processes to be fully serializable. While all the base classes implement the Serializable interface, users must ensure that all attributes they add in subclasses are also fully serializable. But providing they do so, the JDK serialization mechanism ensures that the full set of objects referenced by a process will be serialized and migrated with it. Furthermore the requirements of the Time Warp mechanism are present in the programming environment in two signi cant forms. One is the form of the constructors of BasicObject and BasicEvent. The constructor for BasicObject requires the provision of a name to identify the process and an initial state object, as per the speci cation of the Time Warp mechanism. The BasicEvent constructor requires the provision of a receiver's name and receive time (timestamp), again as per the requirements of the Time Warp mechanism. The sender's name and send time, also require for all Time Warp messages, are provided by FATWa. An alternative, that has not been explored, is to avoid String names constructed by the user and have processes identi ed by handles generated by the system. These handles would be instances of a name class which could also provide copy and comparison methods. This approach would remove the onus from the user for ensuring the uniqueness of process names, and other problems associated with employing user-generated String names. The other noticable intrusion of the Time Warp mechanism is the requirement for user code to only access process state through the BasicObject.state attribute. This ensures that all state information will be correctly managed through checkpointing and rollback. However the user is mostly unhindered in their speci cation of a PDES simulation within the programming environment provided by base BasicObject, BasicState, and BasicEvent base FATWa classes.

4.3 Virtual Machine Groups

The diagram in Figure 6 shows the VMGroup occupying a managerial position over simulation processes. However the role of this class is completely transparent to the user, as can be seen from Figure 5. The user sees an OverSeer managing their simulation, yet most of the managerial duties are delegated to VMGroups. For each JVM that simulation processes are exe52

cuting on, a single instance of the VMGroup class is constructed. VMGroups are active, threaded managers. VMGroup extends java.lang.Thread and is started upon initialisation. The VMGroup objects collectively interact to provided a physical partition level of scope (as de ned in Section 3.4), and manage the entire process space of a simulation. However the simulation programmer is presented with the illusion of an amorphous process space, and is largely unconcerned with this structure. The primary managerial task of VMGroup objects is to maintain a table of all the simulation processes present in the JVM. As mentioned in the previous section, a process must be registered with its local VMGroup object for the process to participate in a simulation. To this end the VMGroup class provides a registerObject(BasicObject,String ) method. This method is called by BasicObject code on behalf of user code when a new simulation process is registered. Before the process can added to the VMGroup's table of local processes it's name must be checked. Duplicate process names are not allowed, as are certain special names (such as \BasicObject" and other class names). Initially the process is nascent, so it is \activated" by the VMGroup through a BasicObject method that causes a new java.lang.Thread to begin executing the process. The second parameter of VMGroup.registerObject is implicitly null for user-created processes. However internally it is used to specify migration of a simulation process from one JVM to another. This aspect of the method operates in tandem with a symmetrical deregisterObject(BasicObject,String ) method. Together these two methods allow run-time migration of simulation processes between JVMs managed by VMGroups. The VMGroup class also plays the primary clerical role in the FATWa system. Implicit in the Time Warp process space is the ability of a process to \inject" a message into the space and have it arrive, some indeterminate but nite time later, at its destination process. Since VMGroup objects collectively implement the Time Warp process space, they must provide this universal communications subsystem. In FATWa this is achieved through a public method VMGroup.deliverThis(BasicMesg ). Simply by ensuring that all objects in a FATWa system have a reference to their local VMGroup object, all objects are ensured the ability to send messages. The BasicMesg class is the base class for all messages (simulation event or otherwise) that are exchanged within FATWa. BasicMesg objects only have two attributes: two Strings specifying the sending and receiving entity. In the case of simulation processes this is their process names while various special names are used 53

by FATWa system objects when they exchange messages. To eect delivery of messages each VMGroup object in a simulation maintains references to all the other VMGroups in the simulation. Further more each VMGroup maintains references, directly or indirectly, to all entities local to its JVM. Thus when presented with a message to deliver, a VMGroup can engage a set of routing heuristics and decide what to do with the message. This may be to directly deliver the message to the recipient, or to forward the message to another local entity or remote VMGroup which is \closer" to the receiver. 111 000 000 111 000 0000111 1111 000 0000111 1111 000 111 0000 1111 000 0000111 1111 0000 1111 0000 1111 0000 1111 00000 11111 VMGroup 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 0000 1111 00000 11111 0000 1111 0000 1111 Comms 0000 1111 0000 1111 Agent 1111 0000 1111 0000

1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 000 111 000 111 0001111 111 0000 000 111 0000 1111 000 111 0000 0001111 111 0000 1111 0000 1111 0000 1111 0000 1111

1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 1111 0000 1111 0000

1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111

1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111

11111 00000 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00 11 00 11 00 11 00 11

Process Space

Java Virtual Machine

11 00 00 11 00 11 000 111 000 111 000 111

1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111

1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111

111 000 000 111 000 111 000 111 000 111 000 111 000 111 11 00 00 11 00 11 00 111 11 000 000 111 000 111 000 111


Figure 7: Interpartition Communication through the CommsAgent Class To improve the modularity of the FATWa design the CommsAgent class is employed by VMGroup objects to deliver objects between JVMs. Since there are numerous possibilities for the manner in which messages or simulation processes can move between JVMs it is desirable to make such remote communication completely modular. To this end the method CommsAgent.sendThis(BasicMesg,String ) allows a VMGroup to request the delivery of the given message to the VMGroup speci ed in the second parameter. Similarly a second method CommsAgent.transportObject(BasicObject,String ) migrates the given simulation process to the another VMGroup. This arrange54

ment is depicted in Figure 7. The gure shows a 4-way partitioned process space, as would exist for the example ant simulation discussed in Section 3.1. In the gure it can be seen how the four VMGroup objects in the simulation, through their server CommsAgent objects, form a fully connected point-topoint communication topology. In this fashion the universal communications subsystem required of an implemented Time Warp process space is achieved.

4.4 The Overseer Process

The logical role of the OverSeer class is to embody the global level of scope for the Time Warp mechanism. On a practical level the class' role is to bootstrap a FATWa simulation, and provide the user with a run-time point-of-contact with the simulation. To initiate a FATWa simulation a user's driver class constructs an OverSeer object. The OverSeer provides the API that allows the driver to construct and register initial simulation processes, then send them initial events. The method OverSeer.registerObject has identical behaviour to its VMGroup counterpart. The BasicObject.registerObject counterpart is not provided, since the user must be able to specify the initial distribution of processes to partitions. However the OverSeer.sendEvent has identical behaviour to its BasicObject counterpart. As mentioned in the previous section the OverSeer class distributes most of its managerial duty to the VMGroup class. This detail is hidden from the user, who only interacts with the OverSeer and BasicObject classes. However the user is still required to specify the partitioning of the process space for a given simulation. Before initial processes can be registered the OverSeer must be given a partition speci cation object. The PartitionSpec class provides attributes for specifying information such as logical partition names and their corresponding physical host names, as well as the partition that is local to the OverSeer object. The OverSeer does not interact directly with all of the VMGroup objects that form the partitioned process space, rather indirectly through a single VMGroup that is local to the OverSeer. This arrangement is shown in Figure 8. The diagram shows how the user's driver constructs an OverSeer object, and provides it with a PartitionSpec object. This prompts the driver to use the information in this object to construct the required VMGroup objects on all the JVMs that will be hosting 55

Hook

Construction

Hook

Partition Specification

....

Driver Overseer VMGroup

VMGroup

JVM Host 1

JVM Host 2

VMGroup JVM Host 3

....

Figure 8: Phase 1 of Partitioning the Process Space simulation process. The local VMGroup can be directly constructed, as is shown in Figure 8. However to construct VMGroup objects on JVMs other then the one on which it resides the OverSeer employs the InitiatorHook class. This can be seen in Figures 6 and 8. The InitiatorHook class uses the Java RMI (remote method invocation) mechanism to allow the OverSeer to bootstrap the FATWa system from one JVM onto many in a concerted and consistent fashion. As Figure 8 shows the PartitionSpec object provided to the OverSeer is passed onto each InitiatorHook and VMGroup. This allows each VMGroup to establish connections to all other VMGroup objects, and ensures the partitioned process space is correctly formed. The details of this phase are hidden from the user, and conducted as soon as a PartitionSpec has been provided. Once the process space has been constructed, the user's driver can construct initial simulation processes and send them their initial events. This is depicted in Figure 9. The diagram shows a driver constructing the processes and events and injecting them into the process space using the API provided by the OverSeer class. The OverSeer passes all the processes and events on to its local VMGroup object, which employs its normal object migration and message delivery scheme to spread the processes across the partitions. In this fashion the driver of a simulation can construct the idealised partitioned process space depicted in Figure 10. 56

Driver

VMGroup

VMGroup

VMGroup

Initial Processes

Initial Events

Overseer

11 111 00 00 000 11 000 111 00111 11 000 000111 000111 111 000000 111 000 111

11111 00 000 00111 11 000 00 11 000 11 111 000 111 00 000 111 00 11 Simulation Processes

11 11 00 00 00 11 11 00 . . . . 0011 11 00 11 00 00 11 00 11 00 11 00 11 11 00 00 11 00 11

Indicates the Driver is constructing the object

JVM Host 2

JVM Host 1

JVM Host 3

....

Figure 9: Phase 2 of Partitioning the Process Space 1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111

111 000 000 111 000 111 000 111 000 111 000 111 000 111

11111 00000 00000 11111 00000 11111 OverSeer 00000 11111 00000 11111 000 111 00000 11111 000 111 00000 11111 000 111 000 111 000 111 000 111

1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 000 111 000 111 000 111 000 111 000 111 000 111

11111 00000 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 111 000 000 111 000 111 000 111 000 111 000 111

111 000 000 111 000 111 000 111

11 00 00 11 00 11

11 00 00 11 00 11 1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111

Process Space

Java Virtual Machine

11 00 00 11 00 11 00 11

1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 11 00 00 11 00 11 00 11 00 11


Figure 10: User's ideal view of a partitioned process space This diagram corresponds to that in Figure 7, and shows the user's idealised view of the same process space. The user sees a process space that is partitioned across JVMs in the manner they have speci ed, with an Over57

Seer object residing on one of the JVMs and managing the simulation. The OverSeer class eectively hides the details of how this is achieved from the user. Drivers of FATWa simulations take the form of the following template: class SimDriver f public static void main( String[] args ) f PartitionSpec pspec = new PartitionSpec(); ... specify partitioning ... OverSeer os = new OverSeer( pspec ) boolean result = os.partitionSpace( pspec ); // Return value indicates true for success if ( result == false ) f

; System.exit(-1); g ;

...

os.registerObject( new SimProcess(

...

) , ``Partition 1'' );

os.registerObject( new SimProcess(

...

) , ``Partition 2'' );

... initial objects are created ... os.sendEvent( new SimEvent(

...

) );

... initial events are sent ... // Simulation is now executing ... driver can continue interacting with OverSeer ... os.shutDown();

g g

The template above shows the driver explicitly calling for a halt to the simulation. This is not strictly necessary since the FATWa system automatically performs shut down operations (closing log les, collating statistics, etc.) when the GVT of the system reaches +1

4.5 The FATWa Module Interfaces

To aid the modularity of the FATWa system interfaces are de ned to allow plug-in modules to be added to the system. Figure 6 on page 47 shows how 58

the three major classes in FATWa, discussed in the previous three sections, all support modules. For each class there is de ned a module interface to which plug-ins must conform. The interfaces de ne methods which allow the host class to (a) query a module and (b) stimulate the module when something \interesting" happens. The de nition of interesting changes from host class to host class, however there is much common ground between all three interfaces. Thus there is an interface hierarchy with the FATWaModule interface at its root. Three direct subinterfaces BasicObjectModule, VMGroupModule, OverSeerModule provide the three host class-spec c methods. The FATWaModule interface mandates two query methods of all modules. These allow the host class to obtain a name from all modules, as well as a value indicating how frequently a module would like to be spontaneously stimulated by host class. A module may require spontaneity to allow it it trigger to operation of an algorithm. For example Mattern's GVT algorithm requires a spontaneous \initiator process," and this could be implemented by a spontaneous module which starts the algorithm. However a module can, of course, chose not to exhibit any spontaneous activity. The name required of each module is necessary since modules, like all entities in a FATWa system, can use the universal communications subsystem provided by VMGroups. The class ModuleMesg extends BasicMesg and de nes two extra attributes that specify source and destination module. The names of source and destination host objects are provided by BasicMesg attributes. By extending the ModuleMesg class the instances of any number of module classes throughout a simulation can exchange messages. For this reason the FATWaModule interface mandates that all modules have a method which accepts a ModuleMesg and takes appropriate action. While all modules need to take notice of an arriving ModuleMesg, what else may be interesting to a given module will vary. So while the BasicObjectModule interface de nes a wide range of occurrences that may be of interest, many modules will be interested in only one or two of these. The result is that modules will frequently take the form of many methods consisting of a null body. If this fact is recognised by a Java compiler or JVM class loader then the overhead of a support class stimulating a distinterested module may be eectively diminished. Furthermore if a module interface is expanded to include additional interesting occurrences, a module can easily be upgraded to the new interface by implementing a null-operation method body. By de ning a suciently broad range of interesting occurrences, the mod59

ule interfaces can support a wide range of algorithms. As an example Mattern's GVT algorithm has been implemented using the module interfaces (see Section 5.4 for details). Furthermore the FATWa system uses modules internally to monitor and accumulate statistics concerning their host object. The module interfaces allow allow GVT algorithms, for example, to be compared in a controlled fashion. Various algorithms can be added and removed in a side-eect free fashion. It is also possible to implement modules that monitor the performance of other modules. The module interfaces provide the high degree of exibility motivating the FATWa design as well as a high degree of ease-of-use.

4.6 Summary of FATWa Class and Interfaces

The following table summarises the function of the classes and interfaces of the FATWa design described previously.

Class

BasicObject

BasicState BasicEvent BasicObjectModule VMGroup

CommsAgent VMGroupModule OverSeer

Function

Provides the superclass for simulation processes in a FATWa simulation. Implements the local level of scope in a Time Warp system. The superclass for the state objects of simulation processes. The superclass for all events exchanged by simulation processes. The interface for modules attached to simulation processes. Manages partitions (i.e. JVMs) on which simulations execute. Provides a partition level of scope for load balancing algorithms, etc. Class employed by VMGroup objects to communicate with on another. The interface for modules attached to VMGroups. Oversees an entire simulation and provides the runtime point of contact between a simulation program and a simulation driver class. 60

Class

InitiatorHook PartitionSpec OverSeerModule BasicMesg ModuleMesg

Function

Enables an OverSeer object to construct a partitioned Time Warp process space. A class for describing the physical partitioning of the process space of a simulation. The interface that must be satis ed by modules attached to the OverSeer object of a simulation. The base class for all messages exchanged in FATWa (including BasicEvent) The base class for all messages exchanged by modules.

61

5

Low-Level Implementation Issues in the FATWa System

The design described in the previous section has been implemented, and a number of issues have come forth as a result. These issues primarily consist of performance considerations and the restrictions encountered in the Java environment. FATWa was implemented using the JDK 1.2 API, and exploits features not present in earlier versions. The following four subsections treat in more detail the three primary classes and interface system discussed in the earlier subsections. Important contrasts in the manner in which the FATWa and WARPED systems implement the same design feature are also discussed.

5.1 Simulation Processes

Section 4.2 described the BasicObject class as being responsible for the local level of scope in a Time Warp system. It implements the local control aspect of the Time Warp mechanism and provides a PDES API to users. It was shown how the PDES-programming user is to a large extent shielded from the implementation details of the BasicObject class and the Time Warp mechanism. However some features of the JDK 1.2 API force the FATWa system to impose itself upon a user. One way is to force the user to implement a deep clone method for their process state classes. The Time Warp mechanism requires a process to regularly checkpoint its state, hence a clone operation is required for all classes acting as process state objects in FATWa. However the shallow clone method provided by java.lang.Object is not acceptable, since it would allow modi cations to a processes current state to be seen by previously cloned checkpoints. The requirement for a deep clone forces the user to take into account the structure of their state object. For example if they use a Collection or Map class from the java.util package they must remember to explicitly clone each element (key and value in the case of the Map classes) as well as the utility object itself. If the user fails to provide the proper deep clone behaviour in their method then the simulation may or may not suer causality errors. Unfortunately the system would be unable to detect such an occurrence if it happened. Furthermore if the user fails to 62

employ the BasicState.baseClonify method (or manually copy across the base information) the system may behave incorrectly, resulting in a causality error or system failure. There are means by which it would be possible to avoid forcing the user to provide deep clone methods. One possibility is to employ the Java serialization mechanism, since this has similar semantics to a deep clone. This would involve constructing output and input object streams that would be plugged together into a \cloning pipeline." A static class would be able to provide a cloning service that operated on any object this way. However anecdotal evidence suggests that this scheme can be over one hundred times slower then manual cloning, and in the context of a Time Warp system this is unacceptable. Another possibility would be to employ the re ection API in JDK 1.2 to allow a BasicState method to inspect and clone its subclass' attributes. However this approach would probably also incur unacceptable overheads, although this is yet to be con rmed. Unfortunately forcing deep cloning upon a user would appear unavoidable in the Java environment if acceptable performance is to achieved. Notably the WARPED system suers similar problems, but not due to performance considerations. Despite having direct access to the memory space containing states and events the WARPED system requires the user to provide a deep clone method. There is no recourse to a serialization or re ection approach at all in C++. Furthermore the user of WARPED is forced to implement state allocation and deallocation methods in their process classes. These are required to enable handling of state objects, since there is no automatic garbage collection. Importantly if users do not implement these correctly then the simulation will proceed correctly, but with memory leakage. Another way the implementation requirements of the FATWa system force themselves upon a user is to require a comparator method for all simulation events. This is to allow the operation of the lazy cancellation mechanism. This requirement is similar to cloning for state objects, and also requires a user to implement the correct comparison semantics or else the simulation may fail. In the same way that the JDK's default cloning operation is insuf cient, the default comparator Object.equals(Object ) does not provide the correct results. As with cloning the system would be unable to detect the method failing to correctly compare two events. Likewise this problem appears unavoidable in the Java environment. The only recourse, the re ection API, would again involve too great an overhead. Notably the WARPED system requires users to implement serialization and deserialization methods 63

for their event classes, since this facility is not automatically provided in the C++ environment. However a WARPED user is not required to implement a comparator method. A vital implementation issue for the BasicObject class is the eciency with which it manages its Time Warp run-time structures. Indeed the fundamental purpose of the class could be construed as the manipulation of the input, output, and state queues of the simulation process. The three structures are referred to as queue, and ideally behave as such. An object will enter the end of the queue when created, move through the queue, and leave at its head when fossil collected. However the requirements of a rollback include searching from the end of the state queue for the checkpoint with the highest virtual time that is still below the time of the roll back. In this aspect the behaviour is more stack-like then queue-like. Manipulating the input queue involves occasionally inserting a straggler into the queue in correct timestamp order, or removing an event from the middle of the queue due to an antimessage. Also in the case of a rollback, output queue events must be moved to a secondary queue to allow the operation of the lazy reevaluation algorithm. There may be sorting involved if there are already elements in this lazy reevaluation queue. In this aspect the input queue's behaviour is more list-like. While the general behaviour is queue-like, clearly there are many speci c operations that could be better supported by a specialised data structure. To this end the WARPED system exploits C++ preprocessing to obtain modularity with these data structures. A uniform queue interface is de ned allowing alternatives for input, output, and state queues to be evaluated. The current implementation of the FATWa design does not support modularity of its runtime structures. It simply employs the JDK utility class java.util.Vector for maintaining its four queues (three basic Time Warp queues and the lazy reevaluation queue). The methods of this class and its array-based nature are well suited to the ideal FIFO behaviour of these queues. However the other operations such as sorting and searching are not speci cally supported by this class and have been implemented within the BasicObject class. However it would be possible to employ a specialised queue management class that provided the operations discussed in the previous paragraph with highly optimised implementations. Examination of the manipulation patterns of its run-time structures reveals many opportunities to specialise general queue handling techniques for a Time Warp process. As a simple example, rollbacks are generally short in terms of both the time length and number of checkpoints that must be removed from the state. 64

Hence the searching performed during rollback for an appropriate state object can be optimised to search from the end of the queue. But on a more subtle level it would be possible for a specialised class to predict rollback lengths and cache a reference into the state queue, allowing faster searching. The eciency of such re nements with the Java environment remains to be investigated. Such an investigation would require the careful characterisation of the operations required by a Time Warp process. Since manipulation of a run-time structure is generally an \interesting thing," the FATWa module interface allows this to be carried out by using a module. A minor issue with regard to implementing Time Warp processes in the Java environment is that of random number generators. A random number generator used by processes in a Time Warp system must be capable of being rolled back with the process. Thus the generator must be incorporated into the state of the process and be private to the process. The default random number provided in the Java environment is not sucient for this purpose. Although its seed may be reset, it is impossible to obtain the current seed and hence checkpoint the generator's state. FATWa includes a wrapper class that solves this problem, however at the expense of requiring the construction of a new java.util.Random instance each time it is used. Clearly a specialised Time Warp random number generator which explicitly allows its state to be checkpointed and restored would be more desirable. Finally, an important aspect of implementing the virtual time model which WARPED addresses and FATWa does not is that of the data type for virtual time. Jeerson's original presentation assumed virtual times were real numbers. However operationally a oating-point data type or even an integer data type will suce. There are two requirements for a virtual time system. Firstly it must be able assign a total ordering to all virtual times it encounters. The second is that the special time +1 exist. Furthermore, from an implementation standpoint the system must be able to perform assignment and addition/subtraction operations on the virtual times. The current FATWa implementation simply employs signed integers for its virtual time data type. While this suces for a prototype system, WARPED allows users to provide their own virtual time data type. This is necessary in some simulation application classes to have a time type larger then 32 bits. As a further example, in VHDL simulations time is a structured quantity with seperate simulation time and cycle components. 65

5.2 Virtual Machine Groups

On a practical level the VMGroup class' role in implementing a partitioned process space is primarily to provide ecient communication between simulation processes, regardless of their locations. Section 4.3 discussed how the collection of VMGroup objects that form a simulation employs a CommsAgent class. It is the responsibility of this class to physically perform interpartition communication. Various techniques are feasible for achieving the passing of messages and simulation processes between JVMs that host partitions of a simulation. Hence it is desirable to test the variety of schemes available for their suitability. The simple interface through which the CommsAgent and VMGroup interact give the CommsAgent replaceability on the same level as a module. The diagram in Figure 6 on page 47 depicts this. The easiest scheme for achieving cross-partition communication is to employ the Java Remote Method Invocation (RMI) mechanism. The current implementation of the FATWa design uses this RMI to achieve a simple, albeit inecient CommsAgent class. The RMI mechanism provides a service that ensures the correct transport of messages and process between JVMs provided all the classes used in a simulation are fully serializable. However there are alternative mechanisms that could oer signi cantly more ecient operation. Aside from more ecient pure Java alternatives to the RMI mechanism, it may be desirable to overtly employ native code to achieve greater eciency in appropriate situations. The CommsAgent class would segregate such details from the managerially-oriented VMGroup class. For example only the CommsAgent class needs to be a \remote server" from the standpoint of the RMI mechanism. Hence it is not necessary for the VMGroup class to extend an RMI server class from the java.rmi package, or otherwise provide RMI server behaviour. Aside from the issue of communication latency, the other vital issue in the performance of VMGroup objects is the routing scheme they employ. The current FATWa implementation employs a highly naive routing system that does not guarantee delivery if processes migrate between JVMs at anytime other the the startup of the simulation. Successfully accommodating the migration of processes within the universal communications subsystem provided by VMGroup objects is necessary if FATWa is to support dynamic load balancing. While this issue has not yet been explored in the current FATWa implementation, it is expected that existing protocols could easily be adapted. 66

5.3 The Overseer Process

The current FATWa implementation does not provide a signi cant degree of feedback from simulations for driver classes. While a driver can introduce new processes and messages to a simulation, it cannot receive feedback. As a result the primary issues encountered by the current OverSeer implementation concern the manner in which a simulation is initiated. There is the issue of how initial processes of a simulation are spread from the driver's local partition where they are constructed to the partitions where they will initially reside. This can easily be dealt with since the facilities for such migration are already present in the VMGroup class. Similarly the facilities for delivering initial events to a simulation process are also present. Hence this aspect of initiation can be dealt with by simply handing all initial processes and events to the VMGroup object local to the OverSeer. However there remains an issue of how an OverSeer can bootstrap a simulation from its JVM onto all the other participating JVMs. In Section 4.4 the InitiatorHook class was introduced as the means by with this was achieved. The goal of the bootstrapping process is to construct VMGroup objects on each JVM. In particular the VMGroups must all be aware of each other, and be able to establish the communications subsystem depicted in Figure 7. Since this must be achieved before initial processes and events can be distributed, the two-phase initiation procedure described in Section 4.4 is necessary. The InitiatorHook class provides RMI server behaviour that allows the OverSeer class to send PartitionSpec objects to each instance on the set of JVMs participating in the simulation. In response the InitiatorHook objects construct appropriately initiated VMGroup objects. To ensure that the fully connected communications topology seen in Figure 7 is correctly established this rst phase of simulation initiation is itself a biphase procedure where all VMGroups are constructed prior to their CommsAgent objects establishing contact with each other in a synchronous fashion. However an eect of adopting this strategy is a noticable time overhead in the construction of the process space.

67

5.4 The FATWa Module Interfaces

The utility of the approach taken in de ning module interfaces to support \plug-in" algorithms has been to some extent validated with the current FATWa implementation. The calculation of GVT updates was discussed in Section 2.4 as a vital aspect of the operation of a Time Warp system. Many alternative algorithms have been proposed, so FATWa should support comparison of the latency and overhead of these algorithms. To this end none of the FATWa classes concern themselves with the calculation. Rather the OverSeer object expects to receive from an attached GVT module periodic updates which are broadcast to simulation processes. The GVT algorithm due to Mattern has been implement in a simple, concise, and ecient manner using the module interfaces. The algorithm is presented in [27] the form \When a process receives a message it performs: ... psuedocode .... When a process sends a message it performs: ... psuedcode .... Et cetera." These operational triggers are precisely the interesting occurrences de ned by the BasicObjectModule interface. Thus the behaviour required of a process to implement Mattern's algorithm was provided by a module that took a concise form strongly resembling that in which the algorithm is presented. Also the algorithm requires an initiator process which initiates a calculation and accumulates the results. Rather than implement this behaviour in a module attached to all processes it was implemented as an OverSeer module, since this is the appropriate place for code with \singleinstance per simulation" nature. Furthermore the algorithm was presented employing an inecient message-passing scheme, with more ecient schemes left to the reader. To increase the eciency and parallelism of the algorithm a VMGroup module is de ned, segregating this aspect of the algorithm from the simulation process and initiator behaviour. This is an example of exploiting the clustering, partition level of scope discussed in Section 3.4. By implementing Mattern's GVT algorithm as modules it can operate transperently to the underlying simulation kernel. The algorithm requires all Time Warp messages (i.e. events) exchanged by processes to be tagged with a small value. This is achieved by de ning a MatternBasicEvent class that extends BasicEvent and provided the necessary attribute. This class can then be used as the base event class in any simulation employing the algorithm without aecting the rest of the FATWa system. Unfortunately this is an example of an algorithms implemented through module interfaces being 68

visible to users for the system. Although transperant to the operation of the FATWa kernel, this not always the case with users. This is undesirable, and there are alternatives. It would be possible to employ a \wrapper class" that inherited from, and presented a stable interface to, underlying classes such as MatternBasicEvent. Through recompilation a hierarchy of algorithm support classes could be layered to both the underlying simulation kernel and the user. However this approach would require signi cant redesign of FATWa, and remains to be pursued. The module interfaces have also been employed in the current FATWa implementation to provide some degree of instrumentation and process characterisation. This is a primary motivation behind FATWa, and has been raised in foregoing discussion as an important research issue. In general the statistics and data logging required to satisfy this goal revolve around the interesting occurrances of module interfaces. For example the frequency and length of rollbacks could easily be determined by a module. Furthermore complex statistics that require regular sampling of a value could be obtained without alteration to the BasicObject class. The communication patterns of a process could be monitored and logged by a module. Also the communications patterns between partitions, including which particular processes are communicating across partition boundaries, could be determined by VMGroup modules. However these possibilities have not yet been suciently explored. The full range of interesting occurrences that should be present in a module interface to support the most comprehensive range of algorithms and instrumentation remains an untreated issue.

69

6 Experimental Design and Results Only a limited degree of instrumentation has been implemented using the FATWa module interfaces. However the statistical modules that have been implemented have provided some interesting results that are presented below. A small, simple network simulation consisting of numerous terminals communicating through a handful of routing nodes was used as the test workload. The terminals exchange request and response packets, with the nodes passively forwarding packets to each other and terminals. Performance was evaluated using the total time of execution and the peak memory consumption of JVMs, as reported by monitor modules attached to VMGroup objects. Experiments involved adding a simple scheduling module to processes. This module used a very simple \throttling" approach to scheduling. It enforced as minimum inter-event time (i.e. an absolute speed limit on the rate at which events were processed), and a xed rollback penalty as discussed in section 3.2. Other statistics, for example rollback lengths (in terms of events rolled back), were also available from individual processes. However these were not pursued. 1100 ’results’ ’mean’ ’std_dev+1’ ’std_dev-1’

1000

Total time (secs)

900

800

700

600

500

400 20

30

40 50 60 Min. inter-event time (msecs)

70

80

Figure 11: Exp. A: Total execution time when well partitioned 70

4500 ’results’ ’mean’ ’std_dev+1’ ’std_dev-1’

4000

3500

Total time (secs)

3000

2500

2000

1500

1000

500

0 20

30

40 50 60 Min. inter-event time (msecs)

70

80

Figure 12: Exp. A: Total execution time when poorly partitioned Two experiments were conducted. The rst involved splitting the simulation between two single processor machines on an Ethernet LAN. Runs were performed twice, once with a good partitioning and once with a bad partitioning. The simulation was constructed in such a way that it could be eciently partitioned in two with both the computational and communicational load balanced between the two halves. A bad partitioning was obtained by perverting this good partitioning to create a moderate computational imbalance. Also some terminals were moved between partitions so that they were no longer local to their \local node" (i.e. the node that handled their trac). Hence the bad partitioning also induced a higher rate of interpartition communication. The results are shown in Figures 11 and 12. The most notable aspect of the results is the high degree of variance. The gures both have lines delineating the mean values, and those at 1 standard deviation from the mean. The results show that by applying a greater degree of throttling to processes this variance was brough under control. This is particularly clear from the bad partitioning results. When processes were not throttled suciently performance could be very poor, although it could also be very good. Increasing the throttling of processes generally increased the execution time, but avoided the poor performance. It is was assumed that 71

the poor performance was due to rollback thrashing. In the poor partitioning case, the eect of high interpartition communication latency as discussed in Section 3.1 is probably responsible. The second experiment involved executing a single-partition simulation on two dierent machines { a SPARC Ultra two-way symmetric multiprocesser (SMP) and a similar four-way SPARC SMP. For this experiment both the minimum interevent time and rollback penalty parameters were varied to generate a two-dimensinal surface. Furthermore both the execution time and peak memory consumption were compared. Unfortunately it was not possible to perform a sucent number of runs to obtain statistically signi cant results. Despite this the results obtained do show interesting features. Figures 13 and 14 show results for the two-way processor, while Figures 15 and 16 have corresponding results for the four-way processor. In each case a single JVM was running as a single Unix process, however the Solaris JVM implementation which was used was capable of having multiple threads scheduled simultaneously onto the physical processors. The trend vaguely evident in Figure 13 is not statistically signi cant, however there does appear to be a correlation between the time taken and the memory consumed. Comparing Figures 13 and 14 shows that whenever a simulation executed rapidly, it's peak memory consumption was high, and conversely high execution time is associated with low memory consumption. This is con rmed by the results in Figures 15 and 16 which show that the four-way SMP machine aorded generally lower execution times and higher memory consumption for the same workload. However the memory consumption of a FATWa partition is not just from current simulation objects. It also includes those objects that have been fossil collected by FATWa, but have yet to be garbage collected by the JVM. More rapid execution would result in a higher fraction of memory consumed this way. Hence more investigation will be required to draw conclusions concerning execution time and memory consumption.

72

’results-time’ 625 600 575 550 525 Total time (secs) 650 600 550 80

500 70 60

450

50 Min. Inter-event time 40 20

30

30

40

50 60 Rollback Penalty

70

80

20

Figure 13: Exp. B : Total execution time on 2-way SMP

’results-mem’ 3.75e+03 3.7e+03 3.65e+03 3.6e+03 3.55e+03 Consumption (Kbytes) 4200 4100 4000 3900 3800 3700 3600 3500 3400 3300

80 70 60 50 Min. Inter-event time 40

20

30

30

40


70

80

20

Figure 14: Exp. B : Peak memory consumption on 2-way SMP 73

’results-time’ 380 360 340 320 Total time (secs) 450

400

350

80

300 70 60 250

50 Min. Inter-event time 40

20

30

30

40


70

80

20

Figure 15: Exp. B : Total execution time on 4-way SMP

’results-mem’ 5.6e+03 5.4e+03 5.2e+03 5e+03 4.8e+03

Consumption (Kbytes)

6000 5500 5000 80 4500

70 60

4000

50 Min. Inter-event time 40

20

30

30

40


70

80

20

Figure 16: Exp. B : Peak memory consumption on 4-way SMP 74

7 Conclusion It was demonstrated in Section 1 how the PDES paradigm is a simple distributed computational model. It involves logical processes with no shared memory, interacting solely via message passing; i.e. the basic distributed computation model. The value of the model as a route to high-performance simulation was discussed. In particular its superiority to time-driven simulation models for distributed execution was elucidated. In Section 2 the Time Warp mechanism was introduced. It was shown how the virtual-time-based model of distributed computation was congruent with the PDES model. The virtual time of Time Warp is congruent with simulation time in PDES, as are the two model's notions of a process. Furthermore the Lamport-clock constraints on virtual time enforce the local causal constraint in PDES. Hence it was concluded that the Time Warp mechanism is eminantly suitable for implementing a PDES system. It was asserted that the performance of the Time Warp mechanism has been found to be poor, but algorithms exist to improve this. Many of these algorithms were surveyed. Two particular schemes for increasing the parallelism available for exploitation by a Time Warp system were discussed. Also numerous schemes for reducing the operational overhead were discussed. Issues surrounding the practicalities of GVT determination were discussed, and a number of GVT algorithms were introduced. In Section 3 an example was presented to demonstrate issues surrounding a distributed implementation of a Time Warp system. The undesirable eects on performance that resulted from a physical partitioning of a Time Warp process space were discussed. Scheduling was discussed in this distributed setting where multiple schedulers exist and each has perview over a subset of the processes in the system. It was concluded that least timestamp rst scheduling (LTFS) was suboptimal in this context. Dynamic load balancing in a distributed Time Warp system was discussed. Some of the various metrics that have been proposed upon which to base migration heuristics were introduced. The speci c issue of opportunistic background executing of a distributed Time Warp system and the requirements for load balancing introduced received attention. Lastly a simple tree hierarchy of scope levels was proposed as a unifying model for Time Warp algorithms. While it was established that the model does unify the operational scope levels present in a wide range of Time-Warp related algorithms, the implications of the model have yet to be explored. 75

In Section 4 a design for Time Warp system in Java was presented. The design attempted to present a simple PDES programming environment to the user, and hide the details of Time Warp implementation as much as possible. The design was centred around three major classes. One implemented Time Warp processes, and the user would extend this class together with associated state and event classes to obtain Time Warp simulation processes. A second major class implemented a distributed Time Warp process space, providing partitioning and communication for processes. This part of the system is hidden from the user. To provide abstraction of the partitioning a third class acts as a global overseer, allowing the user to interact with the system as a single entity regardless of distribution. The design also addressed the issues of distributed scheduling and load balancing raised in Section 3. Also present was a framework for achieving the high degree of algorithmic modularity given as a primary motivation in the design. Interfaces allowed modules that to be constructed and attached to host objects in the system. In this way alternative algorithms could be implemented, incorporated into the system, and evaluated in a controlled fashion. In Section 5 the prototype implementation of the design in Section 4 was discussed. The issue of achieving the cloning of state objects and the comparison of simulation events were raised as problems that warranted further exploration. In particular, it was concluded that use of the re ection API in JDK 1.2 to automaticly perform these operations would involve too much overhead. However this has not been given adequate investigation, and there may be schemes available that perform at an adequate level relative to manual cloning and the requirements of a Time Warp system. Importantly both cloning and comparison require the traversal of the transitive closure of an object; the dierence is the operation applied during the traversal. It is the traversal that is dicult to perform automatically, and if this could be achieved then FATWa would be able to provide an almost pure PDES API with no impact from the Time Warp mechanism. Another issue discussed in Section 5 was how FATWa could be used to investigate alternative data structures for managing the run time data structures of a Time Warp process. Also identi ed as a signi cant issue for further investigation were alternatives to RMI for communication between JVMs. In Section 6 some preliminary results obtained from the FATWa system with a simple test simulation were presented. Although few conclusions could be drawn from the results, they did validate the FATWa implementation. In summary, the primary goals of the project were met. In particular a 76

Java Time Warp simulation system was designed and implemented to support experimentation with Time Warp algorithms. The three levels of operation heirarchy, and their associated module interfaces allowed the algorithms to be incorporated into the system in a side-eect free fashion. Also the system successfully presented a simple PDES API to users, and to the greatest extend possible sheilded the user for details of Time Warp implementation. It was found that some aspects of the Java programming environment were detrimental to this task. However Java also simpli ed many of the tasks associated with implementing a distributed Time Warp system. In general Java was found to provide a simplifying design and implementation environment. Furthermore, while not a goal of the project, a model was de ned for unifying the operation of Time Warp related algorithms. As a result of the project promising directions for future work in the implementation and evaluation of algorithms, as well as the exploration of the unifying model, have been aorded.

77

References [1] Avril, H., and Tropper, C. Clustered time warp and logical simulations. In Proceedings of the 9th Workshop on Parallel and Distributed Simulation (PADS '95) (July 1995), pp. 112{119. [2] Avril, H., and Tropper, C. The dynamic load balancing of clustered time warp for logical simulations. In Proceedings of the 10th Workshop on Parallel and Distributed Simulation (PADS '96) (July 1996), pp. 20{ 27. [3] Baldwin, R., Chung, M. J., and Chung, Y. Overlapping window algorithm for computing GVT in Time Warp. In Proceedings of the 11th Internation Conference on Distributed Computing Systems (1991), pp. 534{541. [4] Bellenot, S. Global virtual time algorithms. In Proceedings of the SCS Multiconference on Distributed Simulation (Jan. 1990), Society for Computer Simulation, pp. 122{127. [5] Burdorf, C., and Marti, J. Non-preemptive Time Warp scheduling algorithms. Operating Systems Review 24, 2 (Apr. 1990), 7{18. [6] Burdorf, C., and Marti, J. Load balancing strategies for Time Warp on multi-user workstations. The Computer Journal 36, 2 (1993), 168{176. [7] Carothers, C. D., and Fujimoto, R. M. Background execution of Time Warp programs. In Proceedings of the 10th Workshop on Parallel and Distributed Simulation (PADS '96) (July 1994), pp. 12{19. [8] Chandy, K. M., and Lamport, L. Distributed Snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems 3, 1 (Feb. 1985), 63{75. [9] Chandy, K. M., and Misra, J. Asynchronous distributed simulation via a sequence of parallel computations. Communications of the ACM 24, 11 (Apr. 1981), 198{206. [10] Das, S. R., and Fujimoto, R. M. A performance study of the cancelback protocol for Time Warp. In Proceedings of the 7th Workshop 78

on Parallel and Distributed Simulation (PADS '93) (July 1993), pp. 135{ 142. [11] D'Souza, L. M., Fan, X., and Wilsey, P. A. pGVT: An algorithm for accurate GVT estimation. In Proceedings of the 8th Workshop on Parallel and Distributed Simulation (PADS '94) (July 1994), pp. 102{ 109. [12] Fleischmann, J., and Wilsey, P. A. Comparative analysis of periodic state saving techniques in Time Warp simulators. In Proceedings of the 9th Workshop on Parallel and Distributed Simulation (PADS '95) (June 1995), pp. 50{58. [13] Fujimoto, R. M. Time Warp on a shared memory multiprocessor. Transactions of Society for Computer Simulation (July 1989), 211{239.

[14] Fujimoto, R. M. Parallel discrete event simulation. Communications of the ACM 33, 10 (Oct. 1990), 30{53. [15] Fujimoto, R. M., Tsai, J., and Gopalakrishnan, G. C. Design and evaluation of the Rollback Chip: Special purpose hardware for Time Warp. IEEE Transactions on Computers 41, 1 (Jan. 1992), 68{82. [16] Gafni, A. Rollback mechanisms for optimistic distributed simulation systems. In Proceedings of the SCS Multiconference on Distributed Simulation (July 1988), vol. 19(3), pp. 61{67. [17] Jefferson, D. Virtual time. ACM Transactions on Programming Languages and Systems 7, 3 (July 1985), 405{425. [18] Jefferson, D. Virtual time II: The Cancelback protocol for storage management in distributed simulation. In Proceedings of the 9th Annual ACM Symposium on Principles of Distributed Computation (Aug. 1990), pp. 75{90. [19] Jefferson, D., Beckman, B., Wieland, F., Blume, L., Loreto, M. D., Hontalas, P., LaRouche, P., Sturdevant, K., Tupman, J., Warren, V., Wedel, J., Younger, H., and Bellenot, S. Dis-

tributed simulation and the time warp operating system. In Proceedings of the 11th Annual ACM Symposium on Operating System Principles (nov 1987), pp. 77{93. 79

[20] Lai, T. H., and Yang, T. H. On distributed snapshots. Information Processing Letters 25 (May 1987), 153{158. [21] Lamport, L. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21, 7 (July 1978), 558{565. [22] Lin, Y.-B., and Lazowska, E. D. Determining the global virtual time in a distributed simulation. In Proceedings of the International Conference on Parallel Processing (1990), pp. 201{209. [23] Lin, Y.-B., and Lazowska, E. D. A study of Time Warp rollback mechanisms. ACM Transactions on Modeling and Computer Simulation 1, 1 (Jan. 1991), 51{72. [24] Lin, Y.-B., and Preiss, B. R. Optimal memory management for Time Warp parallel simulation. ACM Transactions on Modeling and Computer Simulation 1, 4 (Oct. 1991), 283{307. [25] Lin, Y.-B., Preiss, B. R., Loucks, W. M., and Lazowska, E. D. Selecting the checkpoint interval in Time Warp simulation. In Proceedings of the 7th Workshop on Parallel and Distributed Simulation (PADS '93) (July 1993), pp. 3{10. [26] Lubachevsky, B. D. Relaxation for massively parallel discrete event simulation. In Performance Evaluation of Computer and Communication Systems, vol. 729 of Lecture Notes in Computer Science. Springer Verlag, 1993, pp. 307{329. [27] Mattern, F. Ecient algorithms for distributed snapshots and global virtual time approximation. Journal of Parallel and Distributed Computing 18, 4 (Aug. 1993), 423{434. [28] Palaniswamy, A., and Wilsey, P. A. An analytical comparison of periodic checkpointing and incremental state saving. In Proceedings of the 7th Workshop on Parallel and Distributed Simulation (PADS '93) (July 1993), pp. 127{134. [29] Palaniswamy, A., and Wilsey, P. A. Parameterized Time Warp (PTW): An integrated adaptive solution to optimistic PDES. Journal of Parallel and Distributed Computing 37, 2 (Sept. 1996), 134{145. 80

[30] Prakash, A. Filter : An algorithm for reducing cascaded rollbacks in optimistic distributed simulation. In Proceeings of the 24th Annual Simulation Symposium (1991), pp. 123{132. [31] Preiss, B. R., and Loucks, W. M. Memory management techniques for Time Warp on a distributed memory machine. In Proceedings of the 9th Workshop on Parallel and Distributed Simulation (PADS '95) (June 1995), pp. 30{39. [32] Radhakrishnan, R., Martin, D. E., Chetlur, M., Rao, D. M., and Wilsey, P. A. An object-oriented Time Warp kernel. In Proceedings of the International Symposium on Computing in Object-Oriented Parallel Environments (iSCOPE '98) (Dec. 1998), pp. 13{23. [33] Reiher, P. L., and Jefferson, D. Virtual time based dynamic load management in the time warp operating system. In Proceedings of the 4th Workshop on Parallel and Distributed Simulation (PADS '90) (1990), pp. 103{111. [34] Ronngren, R., and Ayani, R. Adaptive checkpointing in Time Warp. In Proceedings of the 8th Workshop on Parallel and Distributed Simulation (PADS '94) (July 1994), pp. 110{117. [35] Schlagenhaft, R., Ruhwandl, M., Sporrer, C., and Bauer, H. Dynamic load balancing of a multi-cluster simulation on a network of workstations. In Proceedings of the 9th Workshop on Parallel and Distributed Simulation (PADS '95) (July 1993), pp. 175{180. [36] Steinman, J. S. Breathing Time Warp. In Proceedings of the SCS Western Simulation Multi-conference (1991), vol. 23, pp. 109{118. [37] Steinman, J. S. Breathing Time Warp. In Proceedings of the 7th Workshop on Parallel and Distributed Simulation (PADS '93) (July 1993), pp. 109{118. [38] Tomlinson, A. I., and Garg, V. K. An algorithm for minimally latent global virtual time. In Proc of the 7th Workshop on Parallel and Distributed Simulation (PADS '93) (July 1993), pp. 35{42.

81

[39] Wilson, L. F., and Nicol, D. M. Experiments in automated load balancing. In Proc of the 10th Workshop on Parallel and Distributed Simulation (PADS '96) (July 1996), pp. 4{11.

82

Distributed High-Performance Simulation using Time Warp and Java

Distributed High-Performance Simulation using Time Warp and Java

Suggest Documents

Distributed Simulation and the Time Warp Operating System - CiteSeerX

Time Warp Simulation on Clumps - CiteSeerX

Java Based Conservative Distributed Simulation

Using Real-Time Java in Distributed Systems: Problems and ... - UC3M

Using Real-Time Java in Distributed Systems: Problems and ... - UC3M

Using Real-Time Java in Distributed Systems: Problems and Solutions

warped: A Time Warp Simulation Kernel for Analysis and Application ...

warped: A Time Warp Simulation Kernel for Analysis and ... - CiteSeerX

A Distributed Fossil-Collection Algorithm for Time-Warp? - CiteSeerX

Parallel and Distributed Computing Using the Java

Time Warp Edit Distance

An Object-Oriented Time Warp Simulation Kernel - CiteSeerX

Extensions to Time Warp Parallel Simulation for ... - Semantic Scholar

The Distributed Real-Time Specification for Java

A simple distributed garbage collector for distributed real-time Java ...

Toward Grid-Aware Time Warp

Java for controlling and con guring a distributed Turbine Simulation

3D PARTICLE SIMULATION OF BEAMS USING THE WARP CODE ...

Sam Samurai - Time Warp Trio

Toward Grid-Aware Time Warp

Distributed File System Using Java - Google Sites

Distributed Network Management Using SNMP, Java ...

The Do! project: distributed programming using Java

Using Programmable NICs for Time-Warp Optimization