of raw processing power than the processors in more recent workstations, this disadvan- tage is o set by the ... The transputer versions of TWSIM and TWOS were implemented under the ANSI C standard, using ..... IOS Press,. 1993. 18] D.A.P. ...
Distributed Simulation with a Transputer Version of the Time Warp Operating System Stephen Turner, Monique Damitio and Stephen Trivett Department of Computer Science, University of Exeter, Exeter EX4 4PT, England Abstract. Computer-based discrete event simulation is an important design and analysis tool in many dierent application areas. Traditionally, discrete event simulation has been performed in a sequential manner, but the size and complexity of many of today's simulation models demand a move towards parallel execution. Distributed simulation explores the potential parallelism inherent in many simulation applications by modelling events as time-stamped messages which are exchanged between the logical processes that represent the physical objects of the application. This paper describes the porting of the Time Warp Operating System (TWOS) onto a network of transputers. TWOS is a special purpose operating system, designed to support distributed simulation, originally developed at the Jet Propulsion Laboratory. The paper discusses the way in which particular features of the transputer make it suitable for distributed simulation and presents experimental results which demonstrate the speed-ups obtainable with TWOS on a network of T805 transputers. Current work in porting TWOS onto the T9000 series of transputers is also discussed.
1. Introduction
Simulation is a method of studying the behaviour of large systems by conducting experiments on a computer-based model rather than on the system itself. It is an important tool in the design of new systems such as industrial plants, road trac networks or computer architectures, as it provides the designer with a way of modelling and analyzing the behaviour of such systems before their physical construction takes place. Simulation also has many applications in the area of defence: battle eld simulations and the simulation of logistics are two important examples. In order to model such a physical system, it is necessary to identify a set of variables which describe the system state. A discrete event simulation is one in which events (changes in state) occur only at speci ed points in time. This is in contrast to a continuous simulation, in which the state variables evolve continuously. The most important principle that is encountered in discrete event simulation is the causality principle: \The future cannot aect the past". On a sequential machine, causality is preserved by having a single set of future events, and selecting the event with the lowest simulation time for processing at each stage. However, simulation models usually include many structural details of the physical system and often require a great deal of computing time to obtain the required results. Since many of the physical systems we wish to model are inherently parallel, this clearly
c Crown Copyright 1994 Defence Research Agency, Farnborough, Hampshire, GU14 6TD U.K. 1
suggests a move from sequential to parallel execution. Distributed simulation explores this potential parallelism by partitioning the state into a set of logical processes (LPs) that represent the physical objects of the application. These LPs are distributed over the processors of a parallel computer and events are modelled as time-stamped messages which are exchanged between the LPs. As there is no longer a single set of future events, preserving causality is the main problem in distributed simulations, and this is converted to an event synchronization problem. There are two basic approaches to solving this in the design of distributed simulations: the conservative approach [5] and the optimistic approach [13]. The conservative approach strictly avoids the possibility of any causality error ever occurring. It means that each LP must deal with event messages in order of nondecreasing time-stamp. The basic problem conservative mechanisms must solve is determining when it is safe to process an event. LPs containing no safe events must block: this can lead to deadlock if appropriate precautions are not taken. In the Chandy-Misra-Bryant (CMB) algorithm [5], null messages are used to avoid deadlock. The principal drawbacks of this method are the overhead of the null messages, the xed connectivity required between LPs and the restriction on the amount of parallelism that may be exploited. In contrast to conservative mechanisms, optimistic methods detect and recover from causality errors [9], they do not strictly avoid them. The Time Warp mechanism, based on the Virtual Time paradigm [13], is the most well-known optimistic protocol. This mechanism allows an LP to simulate as far forward in time as possible, with no regard for the risk of having its simulated past aected. If its past is changed (due to interaction with an LP further behind in simulation time), it must be able to roll back in time, and then be allowed to continue along new execution paths. An event that causes rollback may require the mechanism to perform two actions: restore the state of the LP and cancel all intermediate side eects by \unsending" previously sent messages. Rolling back the state is accomplished by periodically saving the LP's state, and restoring an old state vector on rollback. \Unsending" a previously sent message is accomplished by sending an anti-message that annihilates the original when it reaches its destination. This anti-message may force another LP on another processor to roll back. In order to be able to send anti-messages each LP must keep a negative copy of the messages it has already sent. Each LP has also an input queue containing the messages it has received. These messages may have been already processed but they are not deleted from the queue because it may be necessary to roll back and reprocess them. In simulations using the Time Warp mechanism, virtual time is synonymous with simulation time and the smallest time-stamp among all unprocessed event messages is called Global Virtual Time [2] or GVT (see gure 1). No event with a time-stamp smaller than GVT will ever be rolled back, so storage used by such events (e.g., saved states) can be discarded. Such an event is said to be a committed event. GVT is also used for normal termination detection. The Time Warp Operating System (TWOS) [15] is a complete implementation of the Time Warp mechanism, designed to support discrete event simulation on parallel architectures. Simulation designers can de ne many dierent types of object (logical process). For each object, there is a module of application code that will be executed at run-time. The management and execution of these objects is handled completely by the Time Warp Operating System. This system was developed at the Jet Propulsion Laboratory (JPL), originally for the Caltech Hypercube, and has been ported onto 2
Object
Time
Object
Time
Object
Time GVT
Local Virtual Time (LVT) Previous Events messages already processed messages not processed yet
Figure 1: Virtual Time paradigm other parallel machines such as the BBN Butter y GP-1000. There is also a version which executes on a network of Sun workstations [3]. The Time Warp Sequential Simulator (TWSIM) is a sequential program that provides the user with a way of sequentially running applications intended for the Time Warp Operating System. TWSIM has exactly the same user interface as TWOS and produces the same simulation results. It is used to debug simulations, to generate certain statistics useful in running TWOS, and to provide a basis for speed-up comparisons. The event set mechanism in TWSIM is based on the splay tree, one of the most ecient methods of organising the set of events [14] in a sequential simulator. This paper describes the transputer [11] implementation of TWSIM and TWOS. The research objectives of the project are discussed in the next section. After outlining some of the design considerations in section 3, important aspects of the transputer implementation are described in section 4. Results of timing experiments carried out using a network of T805 transputers are then given in section 5. Section 6 discusses the porting of TWOS onto the T9000 series of transputers and outlines future developments. The nal section presents the conclusions of this research.
2. Purpose of Research
The Time Warp Operating System (TWOS) is recognised by many researchers in the area of distributed simulation as being one of the most ecient implementations of the optimistic approach. The speci c objectives in implementing a version of TWOS on transputers were as follows: 1. To evaluate the suitability of transputers as a platform for optimistic simulation based on the Time Warp mechanism. This would include not only the current generation of transputers (T805), but also the newly available T9000 series [17]. 2. To provide a benchmark against which other approaches to discrete event simulation could be evaluated. This would include both optimistic and conservative methods, and variations of these such as the Breathing Time Buckets (BTB) algorithm [22]. 3
3. To provide a parallel discrete event simulation system which could be used not only as an educational tool but also as a production quality simulator for a variety of application areas. This paper reports on the rst of these objectives: the suitability of transputer networks as a platform for the Time Warp Operating System. The use of TWOS as a benchmark in evaluating the performance of other simulators is described in [8], which gives a comparison of the performance of TWOS and a simulator [4] developed by the Defence Research Agency (Malvern) based on the Breathing Time Buckets algorithm. TWOS is also being used in a number of simulation projects: examples include the simulation of parallel computer architectures, Petri nets and cellular communications.
3. Design Considerations
TWSIM and TWOS are designed to run on a variety of dierent architectures. Each underlying hardware platform presents the implementor with certain advantages and disadvantages. Although the current generation of transputers (T805) is slower in terms of raw processing power than the processors in more recent workstations, this disadvantage is oset by the fact that support for concurrency is provided on the transputer by hardware [18] rather than by software. Moreover, in a network of transputers, message passing is via high speed links rather than a local area network such as Ethernet. A decision was taken to base the transputer implementation of TWOS on the Sun-4 version and not some other version such as that for the BBN Butter y for two reasons. First, both the Sun and transputer networks have distributed memory and therefore would be similar in their need to acknowledge messages to enable the correct calculation of Global Virtual Time [2]. Secondly, the Sun version was available, known to work and complete. This meant that tests could be carried out on both the Sun and transputer versions, to provide benchmarking and con dence checks. TWOS allows any object to send messages to any other object with no constraints on their location. On a network of Suns, this is provided by means of the socket mechanism, and the objects running on each processor have direct access to the le server through the Unix operating system. By contrast, a transputer can communicate directly with only its four neighbouring transputers and only the root transputer has direct access to the host computer's le system. At the time this work was carried out, it was felt that Tiny [6, 7] oered the best solution to this problem as it provided a minimal, fast, deadlock-free message router. As there will be many objects (logical processes) on each processor, a context switching mechanism is required. In TWOS, the scheduling policy adopted is that each processor always runs the object with the lowest virtual time rst. This scheduling policy guards against one object from racing too quickly ahead of other objects on the same processor. On other machines, pre-emptive scheduling has sometimes been implemented. This means that when a message arrives from an object on another processor, it can interrupt the currently executing object, and the receiving object may be executed instead if the simulation time of the message is less than that of the interrupted object. Currently, pre-emptive scheduling is not implemented in the transputer version.
4. The Transputer Implementation of TWOS
The transputer versions of TWSIM and TWOS were implemented under the ANSI C standard, using the Inmos D4214B ANSI C Toolset [12]. 4
HOST
Tiny
Time Warp node 0 Transputer 0
TW[1]
Tiny
Time Warp node 1 Transputer 1
TW[n]
Tiny
Time Warp node n Transputer n
Figure 2: Mapping between hardware and software for n+1 transputers 4.1. Message passing The Tiny [7] message passing system was used to provide full connectivity between tasks, with messages addressed by task name rather than by channel. A Time Warp simulation is con gured as a number of Time Warp nodes with one Time Warp node per transputer (see gure 2). Tiny is only used when messages are exchanged between Time Warp nodes on dierent transputers: thus, each Time Warp node is regarded by Tiny as a unique task. On the root transputer (transputer 0), this task also includes a \host" process which communicates with the host machine. This is responsible for handling log le data and error messages from the Time Warp nodes. TWOS uses asynchronous messages to allow the sender to continue execution as soon as the message data has been copied. Also, the blocking message functions are used to initiate the message operation, send or receive, as these wait for message termination before returning to the caller. To prevent deadlock and to optimize the performance, a message handler was implemented. Its purpose is to ensure that the messages are delivered as quickly as possible. The message handler is composed of four processes running in parallel at high priority (see gure 3). The processes TinyRecv and TinySend are responsible respectively for receiving and sending messages from and to the network. BuRecv and BuSend are responsible for storing the message addresses into a buer as soon as a message is received from another node or has to be sent o-node. This implementation allows the Time Warp node to deal with a received message only when it is ready to do so and not to block on the sending of a message. Note that no ordering of messages is required within the message handler as this is performed within the Time Warp node both for receiving and sending. 4.2. Context Switching In TWSIM, as in most sequential simulators, there is a single set of future events. This means that the next object to be executed is determined by the event in this set with the lowest time-stamp. Since no object can generate an event in the past, once an object has been selected for processing, it runs to the completion of that event.
5
Tiny
message
message
Message handler TinyRecv
TinySend
message address
request
BuRecv
request
message address
BuSend
message address get msg
message address send msg
Figure 3: Internal organisation of the message handler The above is not true in TWOS as a node can send a message which may be in the simulated past of an object on another node. If this occurs, the receiving node may need to roll back objects. For this reason it is necessary to check objects regularly to see which object should be executed next. Each Time Warp node in TWOS has its own local scheduler queue, containing all the objects handled by that node. The object with the lowest virtual time should be the next object to be executed on that node. In order to prevent greedy objects from hogging the processor, a context switching mechanism exists in TWOS so that calls to its operating system services will stop the current object. Once the object has been stopped, TWOS takes over and rst ful ls the service requested. It then checks the incoming messages before deciding which object should be run next. This may be the object which was previously being executed or a dierent object altogether, if a message has been received with a lower time-stamp. In the Sun version of TWOS, context switching is performed by software, by saving the machine registers on the object's stack and reinstating these when the object is restarted. This could be implemented in a similar way on the transputer, but it would involve the unnecessary manipulation of memory and workspace pointers. A better method is to use the transputer's in-built support for concurrent processes. 6
The main problem with this scheme is that the transputer keeps its own scheduler queue and these processes would be time-sliced by its microcoded scheduler. With the Time Warp mechanism, it is necessary to keep the scheduling of objects strictly under the control of the Time Warp Operating System as a particular scheduling algorithm is required (lowest virtual time rst). However, manipulating the scheduling queue is dangerous on the T805 transputer, since there is no easy way of preventing the link hardware from modifying the queue at the same time. A simple solution to this problem involves the use of the two transputer instructions runp and stopp [18]. Each object is started as a transputer process, but when an object calls a TWOS service, its workspace pointer is saved in its state and the TWOS kernel on that node is run by executing runp, which reinstates the TWOS workspace pointer. The object then uses stopp, which stops the current process (the object) without putting that process back on the transputer scheduler queue. After ful lling the service request, the TWOS kernel decides which object to run according to its lowest virtual time rst algorithm. Instead of trying to manipulate the transputer scheduler queue, the TWOS kernel saves a pointer to its own workspace, and runs the object process using the transputer instruction runp. The TWOS kernel then stops itself by executing the stopp instruction. The main advantage of this mechanism is that each object can be run in a way which is natural to the transputer without any manipulation of the transputer scheduler queue. As pre-emptive scheduling is not implemented in the transputer version of TWOS, the user is relied upon to insert \twinterrupt" calls in the application where appropriate: these act as dummy calls for an operating system service and allow a context switch to be performed if necessary. 4.3. Alarms In TWOS, the calculation of GVT occurs periodically, initiated by Time Warp node 0 after a given interval of time. In the Sun version, an alarm utility is used which raises a signal when the time has expired. The signal handler for that particular signal then sets a ag, which is checked at a certain point in the main TWOS loop. The Inmos ANSI C Toolset does not have an alarm facility, but it does provide the user with the ability to raise a signal. Moreover, it is possible to make a process wait for a set period of time, which allows a process to be started as an alarm. The process simply raises a signal when the time has expired, which is trapped in the same way as before.
5. Experimental Results
A comparison was initially made with the Sun-4 version of TWOS, although this was more for the purpose of validating the implementation rather than as a comparison of performance. As a measure of con dence in the transputer implementation, statistics for the number of committed events and committed messages were compared between the Sun and transputer versions and also between the sequential and parallel simulators for each application run. Initially some event discrepancies were found, which were mainly caused by the fact that the transputer is a \little-endian" machine whereas the Sun and BBN Butter y are \big-endian". In TWOS, there are \fast" macros for comparing simulation times, etc., which work by comparing bit patterns and these needed to be rewritten for the transputer. In the rst set of experiments, the transputer version of TWOS was compared with the Sun-4 version running on IPC and ELC Sparcstations using the colliding Pucks simulation (see section 5.1). The run times in seconds are shown in gure 4 for a small 7
35
'Transputer' 3 'Sun4' +
30 25 Time 20 (secs) 15 10 3
3
+
+
3
+
3
5+ 0 seq
1
2 Processors
3
+
3 4
Figure 4: Comparison of Sun-4 and Transputer Versions simulation involving 28 objects: 8 sectors, 10 cushions and 8 pucks. The two values plotted on the vertical axis are those for the sequential simulator TWSIM. It can be seen that the Sun-4 version of TWSIM is slightly faster than the transputer version and this can be explained by the relative speeds of the processors. On both platforms, TWSIM is faster than TWOS on a single processor, because of the overheads of the Time Warp mechanism. With the transputer version of TWOS, an improvement is obtained as more processors are used, whereas with the Sun version, the two-processor version is less ecient than that for a single processor because message passing now has to take place over Ethernet. Three standard benchmark applications are provided with the TWOS system and these ran on the transputer version without modi cation: Pucks [10], a simulation of two-dimensional colliding disks, Warpnet [19], a computer network simulation, STB88 [24], a distributed combat simulation. In the following subsections, results are given comparing the performance of the transputer version of TWOS with that of TWOS on the BBN Butter y GP-1000 (using data provided by JPL). It should be noted that the gures for the BBN Butter y are for TWOS with dynamic load management (see section 6.2), whereas this feature has not yet been implemented in the transputer version. The transputers used in these experiments are 25 MHz T805 transputers, each with 4 MByte of 4-cycle memory, and a link speed of 20 Mbits/sec, organized as a rectangular grid. The GP-1000 consists of Motorola 68020 processors, each with 4 MBytes of memory. For each of the benchmark applications, three graphs are shown, as recommended in [16]. The rst gives the timing curves, which show run time plotted against the number of processors. The two points on the vertical axis are the run times for the sequential simulator TWSIM. The second graph shows speed-up, which is obtained by dividing 8
3000
'Transputer' 3 'BBN' +
2500 2000 Time (secs)
1500 1000 500
+
+
3 + 3 3 + + 3 3 3 3 + 3 + 3 3 3 3 3 3
3
0 seq
3 4 5 6 7 8 9 10 11 12 13 14 15 16 Processors
Figure 5: Total run times for Pucks application 5
'Transputer' 3 'BBN' +
4 Speedup
3 2
+ + 3 3
1 0
+ +
3 3 3 3 3 + 3 3 3 + 3 3 3
3 1
3 4 5 6 7 8 9 10 11 12 13 14 15 16 Processors
Figure 6: Speed-up for Pucks application 200000 150000
+ 03 1
3 3 + + 3 3
3
ERBOs 100000 50000
3
'Transputer' 3 'BBN' +
3 +
3 3 + 3 +
3
3 3
3 +
3 4 5 6 7 8 9 10 11 12 13 14 15 16 Processors
Figure 7: ERBOs for Pucks application
9
the run time for TWSIM by the run time for TWOS (speed-up calculated relative to the one-processor TWOS time would look more impressive, but would be misleading [25]). The third graph shows EBRO (Events Rolled Back Over), which is calculated by subtracting the number of events committed from the number of events completed. 5.1. Results for Pucks application Pucks is a discrete event simulation of two-dimensional disks moving on a at surface (the table). The pucks collide with each other and with stationary cushions on the border of the table. It is assumed that there is total conservation of energy and no table friction so that the pucks remain in constant motion. The design and implementation of Pucks is described in [1]. There are three object types: sectors, cushions and pucks. A sector represents a portion of the table and regulates the movement and interaction of pucks within its boundaries. A cushion object calculates and schedules collisions of pucks with the cushion that borders the table. A puck object calculates and schedules collisions with other pucks within the same sector. Figures 5 to 7 show a comparison of the performance of the transputer version of TWOS with that of the BBN Butter y. These experiments were carried out using a simulation of 128 sectors, 48 cushions and 128 pucks (304 objects in total). With only 4 MBytes of memory per processor on both the transputer network and the BBN Butter y, it was not possible to run TWOS with this con guration of Pucks on either a single processor or on two processors. It was possible to obtain results for TWSIM with only 4 MBytes, because state saving is unnecessary in a sequential simulator. (Although 16 MBytes TRAMS were available to us, these had 3-cycle memory and so the run times were not directly comparable). It can be seen that Pucks has a very regular set of performance curves for both machines. Although the run times for the transputer are consistently less than those for the Butter y, the speed-up gures are not as good because of the smaller run time for TWSIM. Slightly more rollbacks occurred in the transputer version, which is probably due to the dierences in message latency, but this overhead is overcome by the increase in processing power. 5.2. Results for Warpnet application Warpnet is a simulation of a distributed computer network in which a number of computers send messages to each other via data transfer lines. The network being simulated generally does not have complete connectivity, so that data must be sent through a number of intermediate computers before arriving at its destination. Warpnet uses a dynamic routing scheme to avoid using data transfer lines that are extremely busy and each computer in the network must therefore be updated on changes in the line loads. To simulate a network, Warpnet makes each computer in the topology an object. To simulate the passing of messages between computers in the network, Warpnet uses TWOS to pass event messages between the objects representing those computers. There are two types of object: warpnet nodes and warpinit objects (to initialize the network). Figures 8 to 10 show a comparison of the performance of the transputer version of TWOS with that of the BBN Butter y. These experiments were carried out using a simulation of 137 warpnet node objects and 32 warpinit objects (169 objects in total). The exact topology of the simulated computer network is not important apart from the fact that it has only partial connectivity [19]. 10
3000
3
2500 Time (secs)
'Transputer' 3 'BBN' +
20003 + 1500
3 +
1000 500
3
+ 3 + 3 3 + 3 3 + 3 3 + 3 3 + 3 3 3 3 +
0 seq 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Processors
Figure 8: Total run times for Warpnet application
12
+
'Transputer' 3 'BBN' +
10
+
8 Speedup
+ 3 3 3
+ 3 3
6 4 2 0
3 3 3
3
+ 3 3 3 + 3 3 + + 3 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Processors
Figure 9: Speed-up for Warpnet application
10000 8000 ERBOs
'Transputer' 3 'BBN' +
3
6000
3
4000 2000
+ 3 +
+ 3 + 3 3 3
+ 3
3 3 3 +
+ 3 3 + 3
+ 3 03 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Processors Figure 10: ERBOs for Warpnet application
11
Warpnet also shows a regular set of curves, although the run times for the transputer are slightly greater than those for the BBN Butter y, even for the sequential simulator TWSIM. This can be explained by the fact that the calculations for Warpnet do not involve oating-point arithmetic which is faster on the T805 than on the Butter y. As it is possible to run Warpnet on one and two processors, a full set of gures is shown for the transputer (although data for TWOS on one Butter y processor is not available). It can be seen that the run time for TWOS on two transputers is approximately half of that on one transputer, which demonstrates the eciency of the communication routines and the suitability of transputers for distributed simulation. With Warpnet, there is a large variation in the time taken to process the application code for individual event messages, but the average event processing time is much higher than that required by some of the other simulation applications written for TWOS. With large grain events, the TWOS system overhead is only a small fraction of the event processing time and is therefore insigni cant. As expected, simulations with large grain events will give a good speed-up and this is shown in gure 9. 5.3. Results for STB88 application STB88 is a deterministic ground combat simulation, consisting of two opposing armies, \blue" and \red". Each army can be either an aggressor or a defender: the goal of an aggressor is to capture territory by exploiting weaknesses in the opposing side while that of the defender is rst to maintain its current position and secondly to capture territory. The simulation may be set up so that either or both sides are initially aggressors (at least one side must be an aggressor for the simulation to progress). Each side is organized into \corps" and \divisions", with the corps being above the divisions in the military hierarchy. The divisions are the ghting elements, with the corps units deciding where and when their subordinate divisions will attack or defend. An attack is made through a sector, with each corps unit managing three sectors. A division may be placed in any of the sectors belonging to its corps unit, or held in reserve to the rear of the battle. When two enemy divisions are within a critical distance of each other, an engagement begins and the divisions perform an attrition calculation (this is based on the Lanchester model in [23]). When a division has lost enough weapons to fall below a critical value, it is destroyed and the territory may be captured by the other side. Combat simulations are among the most complex of discrete event simulations, because of their irregularity and the computationally expensive calculations that need to be performed. There are four object types in the simulation: corps, division, grid and distributor. The battle eld is divided into cells, with a grid object for each cell which keeps track of the names, locations and velocities of all units within its cell. There is a distributor object associated with a corps unit which is responsible for performing the mapping between the divisions and the grid objects. Figures 11 to 13 show a comparison of the performance of the transputer version of TWOS with that of the BBN Butter y. These experiments were carried out using a simulation of 36 corps (18 blue and 18 red), 306 divisions (18 7 = 126 blue and 18 10 = 180 red), 20 grid objects and 18 distributors, making a total of 380 objects. As with Pucks, it was not possible to run TWOS with this con guration of STB88 on a single processor or on two processors. Two dierent ways of assigning objects to processors are shown in the gures. The rst, referred to as \standard", uses the con guration le supplied by JPL. A corps object and its divisions are assigned in a cyclic manner to the individual processors.
12
'Transputer.Grouping' 3 'Transputer.Standard' + 'BBN' 2
5000 4000 2 3000
Time (secs)
+ 3
2000 1000 0 seq
2 + 2 3 + 3 + 3 2+ + 2+ 3 3 3 3 2+ 3 + 3 + 3 2+ 3 + 3 + 3 2+ + 3 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Processors
Figure 11: Total run times for STB88 application
Speedup
8 'Transputer.Grouping' 3 2 + + 3 'Transputer.Standard' + + 7 3 3 3 'BBN' 2 + 2+ 6 + 3 3 + 2 3 + 5 3 3 + 3 4 + 2 2+ 3 3 + 3 2+ 2 + 3 2 3 1 0 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Processors
Figure 12: Speed-up for STB88 application
+
250000 'Transputer.Grouping' 3 + 'Transputer.Standard' + 'BBN' 2 2 200000 ERBOs
150000 100000 50000 03 2+ 1
3
+ +
2 3 3 3 + + + 2 + 2 + 3 + 2+ + 3 3 + 3 3 3 2 3 3
+
3
2 3
3 4 5 6 7 8 9 10 11 12 13 14 15 16 Processors
Figure 13: ERBOs for STB88 application
13
However, this has the eect that on 8 processors all the blue corps are placed on the same processor and on 11 processors all the red ones. On 16 processors, the blue corps will be divided between just two processors. Since corps intelligence involves receiving messages from subordinate divisions about the status of ongoing battles, placing too many corps on the same processor can cause this processor to become a bottleneck. The poor results for 8, 11 and 16 processors with the standard con guration le can be seen in the performance curves. The processor on which the corps objects are placed will have longer message queues and will have more events to process. This in turn will generate rollbacks as objects advance their simulation times at dierent rates. The second way of assigning objects, referred to as \grouping", assigns every division to the same processor as its corps, with the corps objects assigned to dierent processors in a cyclic manner. A signi cant improvement in performance can be seen in the graphs. Note that although the BBN Butter y uses the standard con guration le, dynamic load management (see section 6.2) is implemented, which will cause the migration of work from any processor which is a bottleneck. However, it can be seen that even dynamic load management is not completely successful in the case of 8 processors and that the initial allocation of objects can still be important in achieving good results.
6. Current work and Future developments 6.1. The T9000 Implementation The T9000 series of transputers oers a number of improvements over the T805. In addition to being a much faster processor, global connectivity is provided by means of virtual channels [17]. On the T9000, these are multiplexed by hardware onto the physical links by the on-chip virtual channel processor. In conjunction with the C104 router, this provides virtual channel routing by hardware. Porting TWOS onto the T9000 series therefore involves replacing the current software router, Tiny, with code that makes use of the virtual channel routing mechanism. Virtual channels may be declared between the message handlers of every pair of Time Warp nodes (see gures 2 and 3). The only signi cant modi cation required is to the TinyRecv and TinySend processes in the message handler. The Time Warp node number is now used to select the appropriate virtual channel rather than as a task name for addressing messages. With Tiny, message types allow the identi cation of special messages to and from the host process on the root transputer. The same eect may be achieved on the T9000 by declaring additional virtual channels. Another feature of the T9000 which will improve the eciency of the implementation of TWOS is the provision of instructions to manipulate the transputer's own scheduling queue (it is very dicult to do this safely on the T805). In particular, an atomic operation is provided to change the front and back registers of an active list as a pair. This will allow the easy implementation of a pre-emptive lowest virtual time rst scheduling algorithm. On the Hypercube, it was found that pre-emptive scheduling improved the performance of TWOS by about 10%. 6.2. Dynamic Load Management According to Reiher et. al. [20], simulations run under TWOS can only achieve their peak performance if the load on the processors of the parallel machine is balanced. Otherwise, some processors will spend part of the run not contributing to the computation, eectively wasting some processing power. There are two basic methods for balancing load with the TWOS system. Static load management attempts to make a
14
good initial assignment of objects to Time Warp nodes, such that most nodes have an equal share of the work for most of the simulation. Dynamic load management monitors the course of the run and shifts objects or phases (see below) from node to node to equalize load. Reiher and Jeerson [21] have shown how the dynamic load management facility produces good performance improvements. To balance the load on TWOS dynamically, it is necessary to determine periodically the load on each node of the network. The basic metric used to calculate the relative load on the nodes in a TWOS run is eective utilization, which is, informally, the proportion of time a node spends performing work that is committed. Periodically, one node, the load master, requests that all nodes calculate their local load and return the information to the load master. Then it passes the information back to all the nodes. This is called the dynamic load management cycle and is initiated by an alarm. Each node then has to determine whether it will migrate work or not, and, if so, to where. An overloaded node has to choose the local object whose migration will best balance the two nodes' utilizations. Because objects can be very large, TWOS tries to avoid migrating whole objects. This relies on temporal decomposition, which divides objects along time boundaries: each part of this division is called a phase. After selecting an object to move, TWOS must next determine where to split it. The nal step consists in moving the selected phase. As most of this code is machine independent, implementation of dynamic load management on transputers should be relatively straight-forward.
7. Conclusions
This paper has reported on the implementation of a version of the Time Warp Operating System on transputers. The main problems to be solved were the implementation of low-level message passing routines and the provision of a context switching mechanism based on the lowest virtual time rst scheduling policy. Experimental results have been presented for three benchmark applications. These demonstrate the performance of the system and show that good speed-ups can be obtained, particularly in the case of simulation applications with larger grain events. The current generation of transputers has been shown to be very suitable as a platform for distributed simulation. The new T9000 transputer, as well as being a much faster processor, oers a number of additional features which should signi cantly increase the performance of the implementation. These include hardware virtual channel routing and improved access to and control of the transputer's own scheduler, which will allow the easy implementation of pre-emptive scheduling. It is hoped to report on a T9000 implementation of TWOS in the near future. Acknowledgements We would like to acknowledge the help of the High Performance Computing Group at the Jet Propulsion Laboratory (California) who developed TWOS, especially Peter Reiher 1 and Steve Bellenot 2 , and also the US Army Concepts Analysis Agency for making this system available. This research was supported by the UK Defence Research Agency (Malvern) through DRA research contracts 2041/038/RSRE and 2041/042/CSM.
References
[1] B. Beckman, M. Di Loreto, K. Sturdevant, P. Hontalas, L. Van Warren, L. Blume, D. Jeerson, and S. Bellenot. Distributed Simulation and Time Warp: Design of Colliding Pucks. In Proc. SCS Distributed Simulation Conference, pages 56{60, 1988. 1currently at the Department of Computer Science, University of California, Los Angeles 2currently at the Mathematics Department, Florida State University, Tallahassee
15
[2] S. Bellenot. Global Virtual Time algorithms. In Proc. SCS Distributed Simulation Conference, pages 122{127, 1990. [3] S. Bellenot. A Network Version of the Time Warp Operating System. In Proc. Workshop on Cluster Computing, 1992. [4] C. J. M. Booth, M. J. Kirton, and K. R. Milner. Experiences in Implementing the Breathing Time Buckets Algorithm on a Transputer Array. Proc. IASTED Conf. on Modelling and Simulation, pages 274{277, 1993. [5] K. M. Chandy and J. Misra. Distributed Simulation: A case study in design and veri cation of distributed programs. IEEE Trans. Software Engineering, S.E.5(5):440{452, 1979. [6] L. J. Clarke. Tiny Version 2 Release 1 For Inmos ANSI C Toolset: C Interface, 1991. [7] L. J. Clarke. Tiny Version 2 Release 1 For Inmos ANSI C Toolset: Overview, 1991. [8] M. Damitio, S. J. Turner, C. J. M. Booth, M. J. Kirton, K. R. Milner, and P. R. Hoare. Comparing the Breathing Time Buckets Algorithm and the Time Warp Operating System on a Transputer Architecture. In Proc. SCS European Simulation Multiconference, pages 141{145, 1994. [9] R. M. Fujimoto. Parallel Discrete Event Simulation. CACM, 33(10):30{53, 1990. [10] P. Hontalas, B. Beckman, M. Di Loreto, L. Blume, P. Reiher, K. Sturdevant, L. Van Warren, J. Wedel, F. Wieland, and D. Jeerson. Performance of the colliding pucks simulation on the Time Warp Operating System. In Proc. SCS Distributed Simulation Conference, pages 3{7, 1989. [11] Inmos. Transputer Reference Manual. Prentice Hall, 1988. [12] Inmos. ANSI C Toolset Reference Manual, 1990. [13] D. R. Jeerson. Virtual Time. ACM Trans. on Programming Languages and Systems, 7(3):404{ 425, 1985. [14] D.W. Jones. An Empirical Comparison of Priority Queue and Event Set Implementations. Comm. ACM, 29(4):300{311, 1986. [15] JPL. Time Warp Operating System User's Manual. Jet Propulsion Laboratory, 1991. [16] JPL. Time Warp Operating System Internals Manual. Jet Propulsion Laboratory, 1992. [17] M.D. May, P.W. Thompson, and P.H. Welch. Networks, Routers and Transputers. IOS Press, 1993. [18] D.A.P. Mitchell, J.A. Thompson, G.A. Manson, and G.R. Brookes. Inside the Transputer. Blackwell Scienti c Publications, 1990. [19] M. Presley, M. Ebling, F. Wieland, and D. Jeerson. Benchmarking the Time Warp Operating System with a computer network simulation. In Proc. SCS Distributed Simulation Conference, pages 8{13, 1989. [20] P. Reiher, S. Bellenot, and D. Jeerson. Temporal Decomposition of Simulations under The Time Warp Operating System. In Proc. SCS Parallel and Distributed Simulation Conference, pages 47{54, 1991. [21] P. Reiher and D. Jeerson. Dynamic Load Management in the Time Warp Operating System. Trans. Society for Computer Simulation, 7(2):91{120, 1990. [22] J. S. Steinman. SPEEDES: Synchronous Parallel Environment for Emulation and Discrete Event Simulation. Proc. Parallel and Distributed Simulation Conference, pages 95{103, 1991. [23] J. G. Taylor. Lanchester-Type Models of Warfare. Technical report, US Naval Postgraduate School, Monterey, CA, 1980. [24] F. Wieland, L. Hawley, A. Feinberg, M. Di Loreto, L. Blume, J. Rues, P. Reiher, B. Beckman, P. Hontalas, S. Bellenot, and D. Jeerson. The Performance of a distributed combat simulation with the Time Warp Operating System. Concurrency: Practice and Experience, 1(1):35{50, 1989. [25] F. Wieland, P. Reiher, and D. Jeerson. Experiences in Parallel Performance Measurement: The Speedup Bias. In Proc. 3rd Symp. on Experiences with Distributed and Multiprocessor Systems, 1992.
16