Conservative Discrete-Event Simulations on Bulk Synchronous ...

Programming Research Group CONSERVATIVE DISCRETE-EVENT SIMULATIONS ON BULK SYNCHRONOUS PARALLEL ARCHITECTURES Radu Calinescu PRG-TR-16-95

Oxford University Computing Laboratory Wolfson Building, Parks Road, Oxford OX1 3QD

Conservative Discrete-Event Simulations on Bulk Synchronous Parallel Architectures Radu Calinescu April 1995 Abstract

All the parallel discrete-event simulation algorithms developed so far have been designed to suit a speci c parallel model (e.g., a PRAM model, a MP-RAM model, etc.). This paper presents several versions of conservative parallel discrete-event simulation algorithms developed around a unifying model for general purpose parallel computer design and programming, namely around the Bulk Synchronous Parallel (BSP) model. The new algorithms are analysed in terms of the BSP model parameters and the eectiveness of simulators based on these algorithms is evaluated. The performance achieved even for a loosely coupled distributed system is comparable with that reported in previous research work, while the generality of the BSP model provides portability to the new approaches.

Keywords: Bulk Synchronous Parallel Computers, Conservative Parallel Simulation Algorithms, Discrete-Event Dynamical Systems, General Purpose Parallel Computing

This work was supported by a Soros/Foreign Commonwealth Oce scholarship.

1

Contents

1 2 3 4

Introduction Conservative parallel discrete-event simulation The bulk synchronous parallel model Conservative BSP simulation algorithms 4.1 4.2 4.3 4.4

General considerations : : : : : : : : : : : : : : : : A Chandy-Misra algorithm with termination : : : A BSP deadlock avoidance algorithm : : : : : : : : A BSP deadlock detection and recovery algorithm

5 Practical implementation and results 6 Conclusions

2

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

3 4 9 11

11 12 18 19

22 29

1 Introduction Discrete-event simulation represents the main tool for the design and testing of many systems that are too complex to be mathematically modelled. While the principles of discrete-event simulation have been stated during the 70s [Fishman78] and an important number of simulation languages [Markowitz63, Pristker74] and packages [Birtwistle79] were devised even earlier, the sequential approach to discrete-event simulation proved to lead to rather poor results in respect to the time required to accomplish a medium sized system simulation. The parallel discrete-event simulation (PDES) algorithms devised to overcome the limitation of the sequential event list based algorithm [Zeigler76] simulate parts of the original system in parallel on many processors. However, this means that dierent processors may be at dierent points in the simulation time and that causality errors might occur. Two main classes of PDES algorithms are known so far, the discrimination being done [Fujimoto90] according to whether the parallel algorithm completely avoids the causality errors or not. The parallel algorithms in the former class are said to be conservative and are based on the Chandy-Misra distributed simulation algorithm [Misra86]. The second class comprises algorithms that allow causality errors and recover from these errors using a rollback mechanism [Fujimoto90]. The attempts to analyse and predict the behaviour of the two distinct PDES policies [Fujimoto90, Nicol90, Preiss92] have not been able to decisively impose any of them. This is due to the fact that, while the conservative strategies do not fully exploit the inherent simulation parallelism, the optimistic ones require the maintaining of additional information on the system's previous status for the rollback mechanism, and the periodical computation of the so-called global virtual time [Jeerson85]. A completely dierent, don't care design philosophy has been proposed in [Booth95] as a mean to better exploit the parallelism of PDES. Unlike the correct strategies described above, this new one pays attention neither to the avoidance of causality errors nor to the recovery from such errors; instead, it automatically discards any interaction between parallel simulated subsystems that breaches the causality principle. However, it is still to be determined what type of system might one want to simulate without caring about a total resemblance of the real system behaviour. Nevertheless, whether they belong to one category or to another, all the algorithms proposed so far have been designed for a speci c parallel architecture, either a shared memory machine [Reed88, Konas92], a message-passing parallel computer [Cai90, Alonso93], or a workstation network [Groselj91]. This paper proposes the Bulk Synchronous Parallel (BSP) model [Valiant90, 3

McColl94] as a generic target for conservative discrete-event simulations. After providing a brief background on conservative PDES in next section and a description of the BSP model in section 3, the paper introduces several BSP variants of conservative simulation algorithms. The rst approach that is taken into account is a classic Chandy-Misra simulation algorithm with a termination mechanism that makes it usable for general feedforward system simulation. The other algorithms tackle the cyclic system simulation problem using null messages, and a BSP synchronous deadlock detection and recovery policy, respectively. Based on the BSP model parameters [Valiant90], an analysis of the above mentioned algorithms is also provided in section 4. Finally, the results obtained for the implementation of these algorithms on a cluster of SUN workstations by means of the Oxford BSP library [Miller94] are presented in section 5 and are interpreted in terms of the analysis developed in section 4.

2 Conservative parallel discrete-event simulation The systems considered for discrete-event simulation are those that can be modelled as networks composed of several physical processes (PPs) which operate autonomously and interact with each other through messages [Misra86]. Such a system changes its status at discrete points in time, namely when a new message | corresponding to the occurrence of an event in the real system | is exchanged between two processes. There are three types of PP: source processes, which can only generate new messages, sink processes, which model parts of the system that only receive incoming messages, and server processes, which are able to receive, process and send messages to other PPs. As examples of systems that can be modelled in this way, one can mention communication systems, computer systems, manufacturing systems, etc. In order to simulate the behaviour of the real system, each message transmitted between two processes is assigned a timestamp corresponding to the moment when the event associated with that message would appear in the real system. The sequential simulation algorithm is straightforward: the simulator maintains an event list containing all the messages generated so far, and in every simulation step the event with the smallest timestamp is extracted from the list and processed, possibly leading to new events that are added to the list. The simulation clock is then adjusted to the timestamp value of the simulated event. Initially, the event list comprises only messages generated by source PPs and, as the simulation advances, new messages can be added and/or deleted in any step. Finally, the simulation ends either 4

when the event list becomes empty, or when the simulation clock reaches an a priori xed stop value. Although attractive for its simplicity, the sequential event list approach proved to be inadequate for parallelization. Therefore, an alternative strategy has been adopted for parallel discrete-event simulations. This strategy requires that each PP is simulated by a corresponding logical process (LP) that communicates through timestamped messages with other LPs. Each LP maintains status information for the modelled PP (e.g., the value of the local simulation time, etc.), and processes the incoming messages. As dierent LPs can be at dierent points in the simulation time, and a given message can be modi ed or cancelled by an earlier one, precautions must be taken when deciding to process an existing message; otherwise, a causality error might occur and lead to false simulation results. A sucient constraint for preventing causality errors is [Fujimoto90] to ensure that each LP processes the incoming messages in nondecreasing timestamp order. Unfortunately, in order to respect this constraint, a given LP has to know the timestamp of at least one event for each incoming channel before processing any message, so it may waste time waiting for a message though its message queue is not empty. Moreover, if the system presents cycles, the simulation may deadlock. Despite these obvious limitations, the choice to process only events that are safe (i.e., no event with a smaller timestamp can arrive to that LP) is considered by a whole class of parallel simulation algorithms, namely the conservative PDES. This is due to the fact that the alternative strategy | used by the optimistic approaches | has its own major drawbacks. Indeed, if one decides to process an incoming message hoping that no earlier timestamped message will arrive afterwards, he or she must be able to recover from causality errors that might occasionally arise. The rollback mechanism used to overcome such situations is often time consuming due to the antimessages that must be sent to cancel the eect of wrongly sent messages, and it also requires a record of the system's behaviour down to the global virtual time of the whole system (i.e., down to a lower bound for the timestamps of the unprocessed messages in the system). This means that, even if no rollback actually occurs, the simulation will involve both time and space overheads. As concerns the conservative approaches, they are all based on the Chandy-Misra distributed discrete-event simulation algorithm [Misra86]. The algorithm is given in gure 1 and uses the following notations: p represents the number of processors; n represents the number of LPs; 5

/* Initialisation */ for k = 1 to p do in parallel for all lpi : f LPs simulated by processor k g do lpi:t = 0 for j = 1 to lpi :chanNo do lpi:ckj = 0 endfor endfor endfor /* Simulation */ for k = 1 to p do in parallel while min f lpi :f LPs simulated by processor kg lpi :t g6= stop time do for lpi :f LPs simulated by processor k g do for all received messages with timestamp less than lpi :t do process message and update message queue send generated messages, if any endfor for j = 1 to lpi:chanNo do receive messages from channel j update lpi :ckj endfor lpi:t = min f j : 1::lpi:chanNo lpi:ckj g endfor endwhile endfor Figure 1: The Chandy-Misra algorithm. For the sake of readability, only the case when LPi is a server is presented.

6

HH HHHj * LP2

200

LP3

LP1

250

Figure 2: If LP1 sends its last message at 200, lp3:t will never advance beyond this value.

lpi:t :R+ denotes the current simulation time for LPi , i : 1::n; lpi:chanNo :N represents the number of incoming channels for LPi , i : 1::n; lpi:ckj :R+ represents the timestamp of the last message received by LPi , i : 1::n through channel j , j : 1::lpi:chanNo; 0::stop time is the interval for which the simulation must be per-

formed. However, the Chandy-Misra algorithm is prone to deadlocks when used to simulate cyclic networks of PPs. In fact, a deadlock may occur even in the case of a non-cyclic network. For instance, in gure 2 LP3 will never process the message with timestamp 250 because LP1 sends its last message at lp1:t = 200. There are two ways to solve the deadlock problem. The former is to simply avoid any deadlock by sending additional status messages that have no correspondence in the real system, and the second is to allow deadlocks and then to detect and recover from them. The deadlock avoidance approaches are mainly variants of the null message algorithm described in [Misra86]: in each iteration and for each outgoing channel, the timestamp value of the earliest message that can appear for that channel is computed (e.g., for a server process in a queueing network this is the sum of the current simulation time and the service time of the server). If this value is greater than the timestamp of the last message sent along the channel, a null message timestamped by the new value is sent. Clearly, a LP that receives a null message uses it only to update the corresponding channel clock. Unfortunately, beyond being unable to solve the deadlock for a cycle where no increments can be added to the LPs simulation time when computing the null message 7

? - hhhh(( (( LP1

1000

LP2

LP3

service time=10

Figure 3: The ratio between the number of null messages and the number of useful messages can be unsatisfactory high even for very simple systems. timestamp, this approach yields unsatisfactory solutions for many simple problems as that presented in gure 3. In this case, no less than 100 null messages must be sent by both LP2 and LP3 before LP2 can declare the real message received from LP1 as safe. Moreover, even if one chooses to send null messages only on a demand basis [Misra86], no improvement results in a case like that in gure 3. On the other hand, the deadlock detection and recovery approaches do not worry about deadlock occurrences, but identify them and recover applying a speci c procedure. Some of the most signi cant approaches resembling this description are presented as follows. In [Misra83] a special type of message called a marker has been used to detect deadlocks and to carry information for the recovery phase. Thus, the marker comprises a ag for each LP in the system; an LP that receives the marker in a given step, sets its corresponding ag if it has not received or sent any real message since the last visit of the marker, and sends the marker to a neighbour LP. Moreover, the marker stores the timestamp of the earliest unprocessed event in the system, and every idle LP visited by the marker updates this information. If eventually all the ags get set at a certain simulation step, deadlock is declared, and all the local simulation clocks are safely advanced to the timestamp of the earliest unprocessed message in the system. The same policy can also be separately applied to each cycle in the network. A dierent technique has been proposed in [Reed88] for shared memory simulations. Here, each processor sets a ag in the global memory when it is unable to process any message in the current simulation step. A guardian processor is then used to control the global system state and to bring the processors to a synchronisation barrier when it detects potential deadlock. Finally, the deadlock recovery algorithm is invoked to overcome the blockage. As false deadlocks may be detected due to this centralised control policy, the guardian processor uses a backo algorithm that veri es a potential deadlock whenever the cost of an unnecessary call of the recovery routine is supposed 8

to be unacceptable high. A hybrid deadlock avoidance/deadlock detection and recovery approach has been devised in [Cai90]. The key idea of this approach is to use an additional type of null message (i.e., a so-called carrier-null message) to reduce the null message trac. This goal is achieved using the carrier-null message to transport supplementary information (i.e., the earliest time of a real message that exists along the supervised loop, and route information) among LPs. While the route information is used to dynamically detect dependency loops in the network, the timing information is used to correspondingly update the clock value of the LPs that form the loop. Based on the remark that the original carrier-null method is not applicable for systems whose communication graphs present nested cycles, a generalisation of the method has been developed in [Wood94]. Here, a priori knowledge on the topology of the system communication graph is used to identify the strongly connected components or the meta-subcircuits of the communication graph. Then, carrier-null messages transmitted along these meta-subcircuits search for the timestamp of the earliest message in each meta-subcircuit taking into account only the non-cyclic LP inputs (i.e., the incoming channels that bring messages from outside the meta-subcircuit). A carrier message is discarded and replaced by a fresh one if it reaches a LP whose simulation time is greater than the carrier creation time. Finally, whenever a carrier message succeeded to visit a whole meta-subcircuit, the cyclic input channel clocks of the LPs it further reaches are updated to the timestamp provided by the carrier message; this timestamp represents the so-called meta-subcircuit non-cyclic ceiling.

3 The bulk synchronous parallel model The existence of a standard model is the only way to fully impose parallel computing as a viable alternative to sequential computing. The BSP model proposed in [Valiant90] and further developed in [Valiant93], [McColl93a] and [McColl94] provides such a unifying framework for the design and programming of general purpose parallel computers. A bulk-synchronous parallel computer consists of: a set of processor-memory pairs; a communication network for point-to-point message delivery; a mechanism for ecient barrier synchronisation of all processors or of a subset of processors. 9

No specialised broadcasting or combining facilities are available. The performance of a BSP computer is fully characterised by the quadruple < p; s; l; g >, where p is the number of processors; s represents the processor speed, i.e., the number of basic operations executed on local data per second; l represents the minimal number of time steps between successive synchronisation operations, or the synchronisation periodicity; g is the ratio between the total number of local operations performed by all processors in one second and the total number of words delivered by the communication network in one second. While the l parameter is a measure of the network latency, the parameter g is related to the time required to complete a so-called h-relation, i.e., a routing problem when each processor has at most h packets to send to various processors in the network, and where at most h packets are to be received by each processor; practically, g is the value such that g h is an upper bound for the number of steps required to perform an h-relation. A BSP computation consists of a sequence of supersteps; in every superstep, the processors can execute operations on locally held data and/or initiate read/write requests for non-local data. However, the non-local memory accesses initiated during a superstep take eect only when all the processors reach the barrier synchronisation that ends that superstep. In order to analyse the complexity of a BSP algorithm, one has to take into account the complexity of the supersteps that compose that algorithm. The cost of a superstep S depends on the synchronisation cost (l), on the maximum number of local computation steps executed by any processor during S (w), and on the maximum number of messages sent/received by any processor during S (hs , respectively hr ): cost(S) = l + w + g maxfhs; hrg (1) Equivalent results can be obtained adopting other expressions for the cost of a superstep, for instance maxfl; w + g hs ; w + g hr g [Gerbessiotis94] or maxfl; w; g hs ; g hr g [McColl93a]. It is clear now that the performance of any BSP algorithm will depend not only on the problem size and on the number of processors, but also on the BSP parameters l and g . Moreover, as the same implementation of a BSP algorithm can be executed on dierent target machines, the two parameters can be used [McColl94] to identify the characteristics of the target 10

machine and to dynamically tune the BSP program for best results. Thus, if g approaches 1, the BSP computers closely resembles a shared memory parallel system. If, on the other hand, g has a high value, approximately g operations on locally held data must be performed for every non-local memory access such that the communication overheads do not dominate the computation costs. As concerns the synchronisation periodicity l, it should also not dominate the complexity equation (1), so a certain degree of parallel slackness is required for high values of l (i.e., the program must be written for a number of virtual processors exceeding p).

4 Conservative BSP simulation algorithms This section presents several BSP algorithms for conservative discrete-event simulation. Although it is usually very dicult to analyse the performance of PDES algorithms other than through experiments [Fujimoto90, Reed88, Cai90] or under restrictive conditions [Nicol90], we shall see in this section that the BSP model proves to be an extremely powerful framework from this point of view. However, in the attempt to analyse the algorithms introduced in this section, we shall only consider the simulation of queueing networks because their extensive use [Reed88, Lubachevski89, Madisetti91, etc.] in PDES experiments recommends them as a possible benchmarking criterium. Nevertheless, many of the results obtained in this section are applicable to a certain extent to the parallel simulation of any discrete-event dynamic system.

4.1 General considerations

A basic part of any PDES scheme is the interprocess message exchange. As in a BSP superstep one can only operate on locally held data, it is obvious that any LP is able to process in a given superstep only messages received in previous supersteps. Therefore, a proper buering technique must be provided between any communicating LPs. If we considered for instance an ordinary producer-consumer system, the rst choice would be to use a simple buer between the two processes. However, this is not a viable solution for a BSP implementation. Indeed, if the producer is allowed to completely ll the buer in one superstep, it will have nothing to do in the next superstep, because the buer cannot be simultaneously locally read by the consumer process and remotely written by the producer process. And this is the case even though the buer will be partially empty at a certain moment during the second superstep. So, if a single buer is placed between the two processes and used in a straightforward way, the two processes will actually 11

? -

-

server-1 server-2

server-n

Figure 4: If the message population M of a cyclic queueing network equals the inter-server buer size, it is possible that only a single server is active in any superstep, and the simulation is no longer parallel. operate in alternative supersteps and no parallelism will be achieved. In fact, the performance will be even worse than for a sequential simulation, because supplementary time will be used to synchronise the two processes. Therefore, a more sensible approach is to halve the buer and to alternately use the rst half for transmission and the second half for reception in (say) the odd supersteps, respectively the rst half for reception and the second half for transmission in the even supersteps. Of course, this is the same as using two dierent buers between any two communicating LPs. Clearly, the inter-process buer size will in uence the performance of the simulation. However, the rst impression that a bigger buer would provide more computation for any superstep, and would thus increase the overall performance is not necessarily true. To give an counterexample, let us consider for instance the simple n-server queueing network shown in gure 4. If the number of messages in the network is comparable with the interserver buer size, it is very likely that in any superstep some of the servers are idle. If, on the other hand, the message population is signi cantly greater than the buer size, most LPs will not be idle in any superstep. However, it is worth noticing that discrete-event simulation often requires reduced amounts of computational time for actual message processing. Therefore, the eciency of a BSP simulation will highly depend on the degree of parallel slackness that can be achieved and used to counterbalance the communication and synchronisation costs.

4.2 A Chandy-Misra algorithm with termination

As shown in section 2, the Chandy-Misra algorithm may deadlock even for acyclic networks. In this section we provide a BSP algorithm for the simulation of general feedforward networks, i.e., for networks whose communication 12

graph can be organised in successive layers, with arcs only from one layer to the next ones (i.e., a dag). The simulation part of the algorithm is presented in gure 5. Beyond the notations introduced in section 2, we use lpi:ready : f true, false g to denote a ag that starts with the value false and is switched to true when lpi:t, i : 1::n reaches stop time; lpi:message queue to denote the queue for incoming unprocessed messages. This queue can be organised in nondecreasing timestamp order, case when the insertion of new messages is the most costly queue operation, or in random order, case when the most costly queue operation is the extraction of the earliest event. Both in the analysis of the algorithm and in the implementation described in section 5, the former solution was chosen. The other notations are self-explanatory. In order to insure the termination of the simulation, an additional type of message, called a termination message is used. This type of message has no correspondent in the real system, has the timestamp part set to a value OVER from outside the simulation interval 0:::stop time, and induces no signi cant overhead because the number of such messages in a given simulation is a constant equal to the number of channels in the system. Indeed, as shown in gure 5, each LP sends a termination message through all its output channels exactly when it reaches stop time.

Theorem 1 The simulation is correct at every point and ends with all LP simulation clocks at stop time for any feedforward system.

Proof As the rst part of the theorem is proved in [Misra86], we only

have to prove that the simulation will end with lpi:t = stop time for every i : 1::n. We shall prove this part of the theorem by induction on the number of layers in the communication graph of the system. Clearly, if there is only one layer, the system comprises only independent source LPs. Then, the simulation will nish in one superstep because the number of messages generated by a source LP in the nite interval 0::stop time is nite, and there is no output buer size to restrict the number of messages generated in this rst superstep. If there are n +1 layers in the communication graph, n > 1, the processes corresponding to the rst layer of this graph are source LPs and they can all generate in every superstep a number of messages greater than or equal to the size of their output buers. As the number of messages generated by 13

for k = 1 to p do in parallel while (9i : 1::n :lpi :ready ) do start superstep for all lpi : f LPs simulated by processor k g do if :lpi:ready then /* receive incoming messages */ for j = 1 to lpi:chanNo do for all m : input bufferj do if m:timestamp 6= OV ER then insert m in lpi :message queue endif endfor update lpi:ckj to the timestamp of the last message in input bufferj or to stop time if the last message is a termination message endfor /* update simulation clock */ lpi:t =min f j : 1::lpi:chanNo lpi:ckj g /* process safe messages and send output messages */ while (9m : lpi:message queue m:timestamp < lpi:t)^ (8b : output buffer b is not full) do extract m from lpi :message queue process m update lpi:message queue and/or store resulting message with timestamp < stop time in non-local output buer (i.e., send message) endwhile if (lpi:t = stop time) ^ (lpi:message queue is empty)^ (8b : output buffer b is not full) then for all b : output buffer do store termination message in b endfor lpi :ready =true endif endif endfor end superstep endwhile endfor Figure 5: The simulation part of the BSP Chandy-Misra algorithm with 14 termination.

-- - n=p n=p n=p P1

P2

Pp

Figure 6: A tandem network whose n LPs are distributed in the straightforward way among the p processors. a source LP in 0::stop time is nite, a superstep s will exist when all the LPs from the rst layer nish their messages. Then, up to the superstep s + 1, all source processes will send the termination messages and will nish the simulation at stop time. This implies that in the superstep s + 2, the servers in layer 2 will receive the last incoming messages and will start to act like source LPs. So, after a nite number of supersteps, the system will start to behave like an n-layer feedforward system. Then, by the induction hypothesis, all the LPs will nish their simulation at stop time.

2

It is worth noticing that when an LP has two or more input channels, the theoretical risk exists that some of the processes incident to this LP send a very limited number of messages (or messages with very low timestamps) to it. Therefore, the message queue of such an LP consistently grows with each superstep; in order to avoid the risk of running out of memory in a practical implementation of the algorithm, one must use an appropriate strategy for handling this situation. An example of such strategy is for instance to temporarily "block" the input channels whose channel clocks are much ahead in the simulation time whenever the size of the message queue becomes greater than a xed threshold. However, in most real systems one may want to simulate, such situations are extremely unlikely to appear. To analyse the algorithm, let us rst consider a tandem network whose n LPs are assigned in the straightforward way to the p available processors ( gure 6). As each server LP has only one input channel, all messages received in a superstep are safe and can be processed in the same order in the following superstep. Therefore, in every superstep an LP processes and/or sends a number of messages equal to the inter-LP buer size b, so the cost of a superstep S is

cost(S) = l + dn=pe b + g b

(2)

If the total number of messages generated by the source is M , then the simulation takes n ? 1 + M=b supersteps. However, taking into account that 15

6

t

t

... 6 %%eDD e % DD ee logk n % q+1 %% DD ee %% P1 J DD P2 ee JJAA . . . AA ? ?q 6 ?1AA . . . iAA i+1 k Figure 7: The leftmost subtree that contains the LPs to be simulated by the rst processor. usually M=b n ? 1, the cost of the whole simulation is of the order

O( Mb l + Mp n + M g)

(3)

This result shows that for a BSP computer characterised by the quadruple < p; s; l; g >, the usage of more than pmax = minfn b=l; n=gg processors can bring no gain versus the usage of pmax processors. Although the synchronisation cost can be hidden in (3) by choosing an appropriate value for b (i.e., b > l p=n), the number of processors that one may want to use is de nitely bounded by n=g . Obviously, this conclusion is applicable to any pipeline-style BSP algorithm that implies an O(n=p) computational cost in every superstep. For a tree network whose non-leaf LPs have k output channels, each server LP has still a single incoming channel, so once more the messages need no ordering before processing. As in a superstep any non-leaf LP sends exactly b messages to one of its k descendant LPs and any number of messages between 0 and b ? 1 to the other k ? 1 descendant LPs, the average computational cost per superstep corresponds to the processing of (b ? 1)=2 messages by each LP and is dn=pe (b ? 1)=2. As concerns the communication cost, it is clearly bounded by the cost of an incomplete dn=pe k b-relation. However, one can use a strategy of LP distribution among processors that leads to a worst case corresponding to an O(k b logk n)-relation. Without paying attention to details, this strategy consists in the following steps. To establish the LPs to be simulated by the rst processor (P1 ), consider ( gure 16

7) the leftmost subtree of height q + 1, where q is uniquely de ned by q q (4) 9i : 1::k ? 1 i kk ??11 < np (i + 1) kk ??11 + 1 Then, if the second part of equation (4) is satis ed for equality, assign to P1 all the LPs in the selected subtree. Otherwise, assign to P1 the LPs in the subsubtrees 1; 2; ::; i of the considered subtree (i.e., i (kq ? 1)=(k ? 1) LPs), as well as n=p ? i (kq ? 1)=(k ? 1) LPs chosen by recursively applying the same strategy for the subsubtree i + 1 (see gure 7). Both the rst application of this partitioning technique and each recursive usage of it lead to at most k ? 1 communication channels from an LP that is not assigned to P1 to an LP assigned to P1. As the procedure cannot be applied more times than the number of levels in the whole tree, in the worst case P1 has to realise a (k ? 1) b logk n-relation in each superstep. The same strategy can be successively applied to establish the LPs to be simulated by P2 to Pp , leading to a superstep communication cost of O(g k b logk n). Then, the cost of a superstep is

cost(S) = O(l + n2 pb + g k b logk n)

(5)

In the average case, the source LP generates k b=2 messages per superstep, so the simulation takes logk n +2M=(k b) = O(2M=(k b)) supersteps. Thus, the cost of the simulation will be O( 2 M l + M n + 2 M g log n) (6)

kb

kp

k

As for the tandem network case, the expression of the complexity shows that the speedup of the parallel simulation increases with the number of processors up to p = pmax = minfn b=(2 l); n=(2 g k logk n)g, and remains almost unchanged for more than pmax processors. It is worth noticing that in this case the communication cost imposes a signi cantly lower limit than for the tandem network case. Nevertheless, a mapping of the LPs onto the p processors that implies the realisation of a cheaper h-relation is possible for many particular cases. Finally, for a general feedforward network where each LP has up to k input channels and up to k output channels the computational part of the simulation increases due to requirement to store the incoming messages in nondecreasing timestamp order. As the messages received along an input channel have increasing timestamps, this requirement can be achieved by simply merging the content of the k input buers with the message queue 17

content. If we neglect the message queue length and we consider that all the k input buers are full, an optimal way to merge their content is to rst merge pairs of b messages in 2 b steps, then to merge pairs of sorted sets of 2 b messages in 4 b steps, continuing the procedure until it ends with the merging of two sets of k b=2 messages each. The whole operation takes k b log2 k steps and represents the most costly part of a superstep computation. As no simpli cation can be assumed for this case, the cost of a superstep must charge the realisation of an dn=pe k b-relation for the communication: (7) cost(S) = O(l + np k b log2 k + np g k b)

As O(M=(k b)) is a sensible estimation of the number of supersteps required to perform a typical simulation, the complexity of the algorithm is given for this case by l + M n log2 k + M n g ) O( M (8) kb p p Therefore, in this case the communication cost does not impose an upper bound for the number of processors that can be used to eciently perform the simulation as long as g < log2 k. Moreover, for an appropriate value of b (i.e., b > l p=(k n log2 k)), the synchronisation cost does not dominate in (8) either. However, it is important to emphasise that the condition g < log2 k is likely to be hardly obtainable, and that even if this condition is ful lled, the parallel algorithm could lead to poor speedups when compared to the classical sequential algorithm (i.e., the event list based algorithm described in section 2).

4.3 A BSP deadlock avoidance algorithm

The null message strategy for deadlock avoidance described in section 2 can be used in conjunction with the previous BSP algorithm for the simulation of networks comprising cycles. In fact, a single straightforward modi cation must be performed on the algorithm presented in subsection 4.2: in every superstep and for each LP and outgoing channel, a null message must be sent along that channel if the timestamp of the earliest possible real message for the channel is greater than the timestamp of the last message (null or real) sent through it (or greater than 0 if such a message does not exist). Obviously, the timestamp of the null message must be identical with that of the earliest possible real message. Certainly, being only a BSP transcript of the classical null message approach, this algorithm inherits all its drawbacks (e.g., high ratio between the number of null messages and useful messages, etc.). 18

The most signi cant overhead induced by the appearance of null messages is the increase in the number of supersteps. If we consider for instance a simple cycle network ( gure 4), then no real message is processed by a server that sends a null message in the corresponding superstep. If a great part of the servers (or all of them) send null messages in a superstep, that superstep is mostly (or only) used for deadlock avoidance, and the overall simulation cost may drastically increase. The buer size has a more complex role here: whereas a large buer size still decreases the number of supersteps because more real messages can be transmitted per superstep, it can also increase the need for null messages if in more supersteps some LPs are out of safe real messages.

4.4 A BSP deadlock detection and recovery algorithm

Although the approaches falling into this category usually use a special type of circulating message to identify deadlocks and to gather information for the recovery phase (see section 2 for details), such a mechanism would be inappropriate for a BSP algorithm. Indeed, due to the restriction to use only local data in a superstep, this technique will require a number of supersteps equal to the number of LPs in the considered cycle to detect a deadlock. Moreover, there is no guarantee that a previously visited idle LP will not become active while the special message is moving around the cycle, so precautions must be taken to avoid false deadlock detections. Another inconvenience arises in the case of a real deadlock, when during the detection supersteps most LPs are idle and a lot of time is spent with inter-superstep synchronisation. Therefore, our BSP deadlock detect and recovery algorithm ( gure 8) resembles the algorithm introduced in [Reed88] and described in section 2. Practically, each processor is responsible for detecting a local deadlock over the LP subset it simulates and for providing information on the earliest message in the corresponding message queues. However, while the algorithm devised in [Reed88] may still detect a false deadlock, our algorithm detects only real deadlocks. Two dierent techniques can be used to inform the guardian processor on local deadlock appearances and to advance the simulation clocks in case of a deadlock. The former is to directly store the adequate information in the guardian processor's local memory and in the other processors' local memories, respectively. This technique will only require two supersteps to identify and break the deadlock, one to notify the guardian processor about the local deadlocks, and the second to perform a jump in the simulation time of all LPs, according to the timestamp of the earliest message in the system. However, p-relations are to be used in each of the two supersteps, 19

for k = 1 to p do in parallel while (9i : 1::n :lpi :ready ) start superstep /* simulate the LPs in the local subset of processor k */ . . /* detect local deadlock */ deadlockk =true next timestampk = +1 for all lpi : f LPs simulated by processor k g do if (lpi processed no message during current superstep) then if :(lpi :message queue is empty) ^ head(lpi:queue):timestamp + lookaheadi < next timestampk then next timestampk = head(lpi:queue):timestamp + lookaheadi else if (lpi:message queue is empty) ^ lpi :t + lookaheadi < next timestampk then next timestampk = lpi:t + lookaheadi endif endif else deadlockk=false endif endfor if deadlockk then send < DEADLOCK; next timestamp k > to the guardian processor endif if (k=guardian processor) ^(8x : fk : 1::p deadlockkg x) then for i = 1 to n do set lpi:t to minfk : 1::pjnext timestampk 6= +1 next timestampk g endfor endif end superstep endwhile endfor Figure 8: The BSP deadlock detection and recovery algorithm. If the system is deadlocked, the earliest moment when a given LP will produce a message is given by the sum of the timestamp of the earliest message in its queue and a speci c lookahead (e.g., its service time). 20

and, taking into account that almost no computation is done, the cost of the whole operation will be: 2 (l + p g )

(9)

The second technique is to use combining, respectively broadcasting to achieve the same goal. Supposing that 1 < k p messages are sent to, respectively sent by a given processor in a superstep, the whole operation will take 2 dlogk pe steps and will cost 2 dlogk pe (l + k g )

(10)

Clearly, for k = p, the second approach reduces to the former one.

Theorem 2 The algorithm correctly detects any deadlock and recovers from it.

Proof In this proof we suppose that the two superstep technique is used. Then, the guardian processor declares deadlock in the superstep s + 1 i all the processors found local deadlocks in the superstep s. Obviously, this happens if and only if no message was processed in the superstep s because all LP simulation clocks had values lower than the earliest message timestamp. This also means that no message was sent in superstep s, so no channel clock is to be updated in the superstep s +1. Therefore, if no external mechanism is used, no LP simulation clock is updated in the superstep s + 1, so no message is processed in the superstep s + 1 either. Obviously, the blockage would continue in any subsequent superstep, so the system is indeed deadlocked i the guardian processor declares deadlock. As the guardian processor advances all the simulation clocks beyond the timestamp of the earliest event in the system, at least this earliest event will be processed in the superstep s +2, so the deadlock is broken. Moreover, the clock updating value is chosen such that no new message with a timestamp less than this value will be produced, so all the existing messages with smaller timestamps are safe. 2

Unfortunately, the overheads introduced by the deadlock detection and recovery mechanism are not limited to the occurrences of real deadlocks. Indeed, a large number of local deadlocks may be reported, especially when each processor simulates a reduced number of LPs; this is one more case when parallel slackness has a bene cial in uence on the algorithm performance. Besides, the crude variant of the algorithm that was presented in 21

this section only deals with global deadlocks. This could obviously be inappropriate if the network comprises many independent cycles. In such a case, it is possible that some cycles are deadlocked for certain periods of time while the simulation advances in other parts of the system. Nevertheless, in order to overcome this aw, a more re ned algorithm can use information on the network topology when applying the same strategy for deadlock detection and recovery.

5 Practical implementation and results The algorithms presented in section 4 were implemented and tested using the primitives provided by the Oxford BSP Library [Miller94]. The BSP library has been developed at Oxford Parallel | the Parallel Computing Centre of Oxford University | and oers a base for designing portable applications across a wide range of parallel platforms. The library uses a static single program multiple data (SPMD) BSP programming model, each parallel process executing the same program. All processes execute the same sequence of supersteps and must reach the end of a superstep before any of them can proceed to the next. However, within the supersteps each process may take its own execution path. As in any multiple data programming model, each process operates on its own data space; still, remote accesses to non-local data is available through a couple of library functions. Unlike accesses to local memory, remote data access is asynchronous, i.e., a request for a non-local data item is guaranteed to be satis ed only at the end of the current superstep. The library comprises the following basic primitives: bsp start(max procs, num procs, my proc id) used to initiate a BSP application executed on up to max procs processes; bsp nish() for ending a BSP application; bsp sstep(sstep no) and bsp sstep end(sstep no) used to delimitate the beginning and the end (i.e., the synchronisation point) of a superstep; bsp fetch(from proc id, from data, to data, no bytes) for remote data fetching; bsp store(to proc id, from data, to data, no bytes) for non-local data storing. These routines are callable from Fortran, C, Pascal, or any other programming language whose implementation respects the same linkage and 22

? @ ? @ ? @ ? @ ? @ ? @@R ? R ? @ R @ ? @@ @@@R???@@@R??? ?? @R? 1

1

1

. .

. .

. .

k

k

k

Figure 9: The topology of the general feedforward queueing networks used to test the simulator based on Chandy-Misra algorithm with termination. The experiments were carried out for networks with k = 4 and k = 12. static memory model. Moreover, highly ecient and scalable implementations of the library have been developed at Oxford Parallel for a large range of distributed memory systems, shared memory machines, and workstation networks. The simulators were developed in C and were tested on a 14 SUN workstation network. However, the same C code can be executed on any parallel platform for which exists a BSP library implementation. Each simulator was tested for several queueing networks comprising source processes, M/M/1 servers [Cohen82] and sink processes. In order to obtain the message interarrival times and the service times, a generator for random numbers with negative exponential distribution was build using the algorithm presented in [Neelamkavil87]. The simulator based on the Chandy-Misra algorithm with termination was tested for tandem queueing networks and for general feedforward networks. In order to study the role of parallel slackness, two network sizes were taken into account. Thus, for the tandem network case, networks comprising 17 and 42 LPs respectively were simulated, while for the general feedforward case the simulator was tested using networks with the topology presented in gure 9, for k = 4 and k = 12. For any multi-output LP in a general feedforward network, each outgoing message was sent through an output channel chosen at random using a uniform distribution. Mean message inter-generation times between 10 and 15 were set for the source LPs, while the server LPs were assigned mean service times ranging from 5 to 15. The speedups obtained for the 17 LP tandem network for 3 dierent inter-process buer sizes are presented in table 1. The number of supersteps required for each buer size is shown in table 2. As expected, increasing 23

buf. p size 1 2 3 4 3 1 1.46 1.52 1.61 10 1 1.77 2.21 2.48 30 1 1.80 2.51 2.94 buf. p size 8 9 10 11 3 1.40 1.26 1.23 1.20 10 3.06 3.13 2.88 2.71 30 4.27 5.51 5.51 5.34

5 6 7 1.40 1.52 1.40 2.44 2.82 3.13 3.35 4.38 4.03 12 13 14 1.07 0.99 1.10 2.71 2.66 2.71 5.02 5.02 4.75

Table 1: The speedups obtained for a 17 LP tandem network. The simulation was carried out for 50000 time units and approximately 68K messages were processed during the simulation. buer size number of supersteps 3 1704 10 513 30 173 Table 2: The number of supersteps required for the simulation decreases linearly when the buer size is increased, but does not change with the tandem network size. the buer size one can linearly decrease the number of supersteps and hence the synchronisation costs. According to the analysis of the algorithm (see subsection 4.2), the fact that the simulation scales up to higher values of p when the buer size is increased means that the maximum number of processors one can eciently use is imposed by the synchronisation cost. It is interesting to notice that for tandem networks, the parallel algorithm for p=1 is faster than the event list sequential algorithm (table 3). This is due to the fact that the classical sequential algorithm spends extra time to maintain a global event list, while the parallel algorithm uses an individual message list for each logical process. Better results were obtained when the parallel slackness was increased, i.e., when a tandem network with 42 LPs was simulated (table 4). This approximately 3 times increase of the parallel slackness led for instance to 24

algorithm CPU time sequential event list approach 31165408 parallel algorithm, p=1, buer size=3 16416010 parallel algorithm, p=1, buer size=10 14466088 parallel algorithm, p=1, buer size=30 17132648 Table 3: For tandem networks, the parallel algorithm for p=1 is faster than the sequential event list algorithm. buf. p size 1 2 3 4 3 1 1.55 2.14 2.47 10 1 1.66 2.63 3.19 30 1 1.73 2.78 3.58 buf. p size 8 9 10 11 3 2.52 2.46 2.37 2.49 10 4.02 4.22 4.27 4.75 30 5.83 6.56 6.46 7.50

5 6 7 2.43 2.75 2.62 3.45 4.02 4.38 4.16 5.12 5.70 12 13 14 2.38 2.34 2.40 4.95 4.81 5.18 7.50 7.36 8.13

Table 4: The speedups obtained for a 42 LP tandem network. The simulation was carried out for 50000 time units and approximately 164K messages were processed during the simulation. an increase of the speedup from 4.03 to 5.7 for p = 7 and buffer size = 30. A similar improvement is also re ected by the comparison between the cost of the parallel algorithm for p = 1 and the cost of the sequential event list approach (table 5). As the number of supersteps remained unchanged (table 2), the gain in speedup is entirely due to the relative decrease of the communication cost versus the computation cost. The results obtained for the 16 LP feedforward network in gure 9 (k = 4) are presented in table 6. It is easy to notice that the performance is inferior to that corresponding to the tandem simulations; the cause for this lower eciency is the presence of multi-input LPs which are not able to immediately process all the incoming messages. However, allowing a larger number of messages to be transmitted in a superstep (i.e., increasing the inter-LP buer size), the number of messages that remain unprocessed in 25

algorithm CPU time sequential event list approach 97129448 parallel algorithm, p=1, buer size=3 39448422 parallel algorithm, p=1, buer size=10 34248630 parallel algorithm, p=1, buer size=30 42064984 Table 5: The ratio between the simulation time for the sequential algorithm and the simulation time for the parallel algorithm (p = 1) increases with the system size. buf. p size 1 2 3 4 3 1 1.58 1.34 1.76 10 1 1.45 1.57 2.64 30 1 1.66 1.87 3.73 buf. p size 8 9 10 11 3 1.46 1.07 0.97 0.99 10 2.86 2.28 2.38 2.18 30 5.11 4.59 4.42 4.45

5 6 7 1.30 1.72 1.63 2.33 2.44 2.53 3.51 4.20 4.16 12 13 14 0.89 0.92 1.02 2.30 2.25 1.95 4.52 4.42 4.82

Table 6: The speedups obtained for the 16 LP feedforward network from gure 10 (k = 4). The presence of multi-input LPs led to a performance inferior to that presented in table 1. the message queue at the end of a superstep is signi cantly reduced. This explains the supra-linear decrease in the supersteps number with the increase of buer size (table 7). This time, the sequential event list algorithm and the parallel algorithm for p=1 needed comparable amounts of CPU time (table 8). Taking into account all the speedup results presented so far, as well as the one corresponding to the simulation of a 40 LP feedforward queueing network (table 9), one can notice that the speedup tends to increase with p up to a given number of processes and then remains constant or slowly decreases. This almost general pattern is in uenced by two parameters, the parallel slackness and the inter-process buer size: the increase of the values of these two parameters has a bene cial eect both on the slope 26

buer size number of supersteps 3 833 10 193 30 57 Table 7: For general feedforward networks, the number of supersteps showed a supra-linear decrease when the buer size was increased. algorithm CPU time sequential event list approach 6883058 parallel algorithm, p=1, buer size=3 7999680 parallel algorithm, p=1, buer size=10 5099796 parallel algorithm, p=1, buer size=30 5933096 Table 8: For general feedforward networks, the eciency of the parallel algorithm for p = 1 is comparable to that corresponding to the sequential algorithm. of the speedup and on the number of processes for which the maximum speedup is obtained (i.e., on the scalability of the simulation). This pattern corresponds with the analysis of the algorithm (subsection 4.2), which also explains the presence of an upper bound for the number of processors that one can eciently use to perform the simulation. The simulators based on the deadlock avoidance and on the deadlock detection and recovery algorithms were used to simulate cycle queueing networks comprising ( gure 10) a source process that injects messages in the network, a number of servers, and a sink process that receives a small percentage (e.g., 7%) of the messages processed by the last server in the pipeline, the other messages processed by servern?2 being returned to the rst server. The performance of the null message approach was highly in uenced by the ratio between the message inter-generation time of the source and the service time of the servers. This major drawback is illustrated in table 10. However, no signi cant dierence was noticed in the performance of the deadlock detection and recovery approach in similar conditions; this is explained by the fact that the recovery mechanism allows appropriate jumps in the simulation time for whatever ratio between the source inter-generation time and the server service times. 27

buf. p size 1 2 3 4 3 1 1.17 1.22 3.23 10 1 1.48 1.74 3.81 30 1 1.68 2.08 3.90 buf. p size 8 9 10 11 3 2.34 2.20 1.86 1.96 10 5.23 4.58 5.23 5.23 30 6.40 6.43 6.40 6.85

5 6 7 2.75 2.82 2.61 4.58 4.78 5.00 4.92 5.64 5.93 12 13 14 1.86 1.92 1.92 5.22 5.00 5.24 6.72 6.81 6.87

Table 9: The speedups obtained for the 42 LP feedforward network from gure 10 (k = 12).

? - -

93%

source server1

server2

-

-

7%

servern?2

sink

Figure 10: The cycle network used to test the simulators based on the deadlock avoidance and deadlock detection and recovery algorithms.

source inter-generation time:server service time real messages:null messages 10:1 6:1 100:1 2:1 1000:1 2:3

Table 10: The percentage of null messages is highly dependent on the ratio between the message inter-generation time of the source and the server service time. 28

Although it did not lead to high performance, the deadlock detection and recovery approach proved to be superior to the null message one for all the tests. A comparison that sustains this remark is presented in table 11. Taking into account that a local deadlock detection signal leads to a communication overhead comparable to that corresponding to the transmission of a null message, and the numbers of null messages and local deadlock signals, respectively, it is easy to understand why the detect and recovery simulator needed smaller amounts of CPU time. It is also important to emphasise that the inter-process buer size has a dierent role for cyclic networks than the one it plays for tandem or feedforward networks. Indeed, in this case a larger buer allows more processed messages per superstep, (i.e., decreases the synchronisation overheads), while a smaller buer tends to keep more LPs busy in every supersteps, and thus decreases the need for null messages and local deadlock signals, respectively. This explains why, especially for small networks, the best performance is not obtained for the largest buer size, but for medium buer sizes.

6 Conclusions The results presented in the previous section show that signi cant speedups can be obtained for acyclic discrete-event system simulations using the BSP programming model. Moreover, the BSP model was used to develop an analysis of the conservative simulation algorithms that predicts and explains the simulation results, an issue rarely approached in other parallel simulations reported so far. Although the performance achieved for cyclic queueing network simulations was not very encouraging, it is comparable with that reported in previous research work [Reed88, Fujimoto90]. Besides, due to the generality of the BSP model, this implementation has the advantage of being transparently applicable to any target parallel machine. This attempt to provide a general framework for distributed discrete-event simulators design and implementation represents a counterpart of the signi cant research eorts directed towards the development of a unifying theory for distributed simulation [Bagrodia91, Radyia94]. On the other hand, the analysis of the BSP discrete-event simulation algorithms presented in section 4, and the remarks about the in uence of dierent parameters (e.g., inter-process buer size, parallel slackness) on the performance of the BSP approaches can be used to build simulators that dynamically adjust their parameters to produce best results. This goal may be achieved using for instance the GL programming language presented in [McColl93b]. Other possible improvements could include the use of a dynamic 29

n 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42

buf. size 3 3 3 3 3 10 10 10 10 10 30 30 30 30 30 3 3 3 3 3 10 10 10 10 10 30 30 30 30 30

p 1 2 4 8 14 1 2 4 8 14 1 2 4 8 14 1 2 4 8 14 1 2 4 8 14 1 2 4 8 14

real msgs 47K 47K 47K 47K 47K 47K 47K 47K 47K 47K 47K 47K 47K 47K 47K 110K 110K 110K 110K 110K 110K 110K 110K 110K 110K 110K 110K 110K 110K 110K

DA DDR CPU null CPU local time msgs ssteps time dlocks ssteps 14216098 8K 1669 12816154 2 1237 10282922 8K 1669 8982974 75 1237 10666240 8K 1669 9099636 181 1237 12332840 8K 1669 8566324 732 1237 16949322 8K 1669 13066144 1265 1237 13382798 10K 1136 11149554 2 668 8532992 10K 1136 6749730 164 668 8216338 10K 1136 6066424 365 668 7566364 10K 1136 4466488 1615 668 11099556 10K 1136 5983094 3K 668 14899404 13K 1134 12749490 2 662 9399624 13K 1134 7449702 212 662 5883098 13K 1134 7833020 875 662 4866427 13K 1134 7749690 2.5K 662 6766396 13K 1134 10899564 4.8K 662 33765316 29K 1680 29682146 2 1275 20567843 29K 1680 18049278 290 1275 14999400 29K 1680 12432836 817 1275 14199432 29K 1680 11749530 2463 1275 16982654 29K 1680 12232844 7630 1275 34598616 42K 1564 27382238 2 1172 21499140 42K 1564 17032652 548 1172 14399424 42K 1564 10966228 2232 1172 13349466 42K 1564 10232924 5608 1172 15049398 42K 1564 12049518 11K 1172 38381798 47K 1564 23432396 2 1172 23132408 47K 1564 14366092 982 1172 15149394 47K 1564 10366252 3116 1172 13866112 47K 1564 10049598 7376 1172 15532752 47K 1564 11582870 14K 1172

Table 11: A comparison between the deadlock avoidance (DA) and the deadlock detection and recovery (DDR) algorithms; cycle networks with 16 and 42 LPs were simulated using the two techniques. The number of null messages ("null msgs" in the table) used by the DA simulator is signi cantly greater than the number of local deadlocks ("local dlocks") corresponding to the DDR-based simulation. 30

load balancing strategy for runtime reassignments of LPs to processors or the use of a priori information on the communication network topology to assign subsets of tightly coupled LPs to the same processor. Finally, it is worth noticing that in order to obtain an overall image of the BSP discrete-event simulation paradigm, further research work must be dedicated to the design of BSP optimistic algorithms for discrete-event simulation.

Acknowledgements The author would like to thank Dr W F McColl and Oxford Parallel who kindly provided both advice and access to the Oxford BSP Library and to adequate computing facilities.

References [Alonso93] [Bagrodia91] [Birtwistle79] [Bisseling93] [Booth95] [Cai90] [Cohen82] [Fishman78] [Fujimoto90]

Alonso J.M. et al., Conservative parallel discrete-event simulation in a transputer based multicomputer. In: Grebe R. et al., Transputer Applications and Systems '93, IOS Press 1993, pp.636-650. Bagrodia R. et al., A unifying framework for distributed simulation. In: ACM Transactions on Modeling and Computer Simulation, vol. 1, no. 4, Oct. 91, pp. 348-385. Birtwistle G.M. et al., DEMOS: A System for Discrete Event Simulation, Macmillan Press, New York, 1979. Bisseling R.H., McColl W.F., Scienti c computation on bulk synchronous parallel architectures. Technical Report 836, Department of Mathematics, University of Utrecht, December 1993. Booth C.J.M., Roberts J.B.G., Discrete event simulation on parallel and distributed architectures. In: Distributed vs Parallel: Convergence or Divergence?, Proceedings of the PPECC Workshop, Abingdon, UK, 14-15 March 1995, pp. 29-30. Cai W., Turner S.J., An algorithm for distributed discrete-event simulation: The "carrier null message" approach, in Distributed Simulation, Proceedings 1990 SCS Multiconference on Distributed Simulation, January 90, pp. 3-8. Cohen J.W., The Single Server Queue, North-Holland Publishing Company, 1982. Fishman G.S., Principles of Discrete Event Simulation, John Wiley, New York, 1978. Fujimoto R.M., Parallel discrete event simulation. In: Communications of the ACM, vol. 33, no. 10, October 1990, pp.30-53.

31

[Gerbessiotis94] Gerbessiotis A.V., Valiant L.G., Direct bulk-synchronous parallel algorithms. In: Journal of Parallel and Distributed Computing, vol. 22, no. 2, August 1994, pp. 251-267. [Groselj91] Groselj B., Tropper C., The distributed simulation of clustered processes. In: Distributed Computing (1991), vol. 4, pp. 111-121. [Jeerson85] Jeerson D.R., Virtual time. In: ACM Transactions on Programming Languages and Systems, vol. 7, no. 3, July 1985, pp. 404-425. [Konas92] Konas P., Pen-Chung Y., Synchronous parallel discrete-event simulation on shared memory multiprocessors. In: Proceedings of the 1992 SCS Western Simulation MultiConference and Distributed Simulation, 20-22 Jan. 1992, Newport Beach, California, pp. 12-

21. [Lubacevski89] Lubacevski B.D., Ecient distributed event-driven simulations of multiple-loop networks. In: Communications of the ACM, vol. 32, no. 1, Jan. 1989, pp. 111-131. [Madisetti91] Madisetti V.K. et al., Asynchronous algorithms for the parallel simulation of event-driven dynamical systems. In: Computer Simulation, vol. 1, no. 3, July 1991, pp. 244-274. [Markowitz63] Markowitz H.M. et al., SIMSCRIPT, A Simulation Programming Language, Prentice Hall, 1963. [McColl93a] McColl W.F., General purpose parallel computing. In: Gibbons A. M., Spirakis P. (eds.), Lectures on Parallel Computation. Proc. 1991 ALCOM Spring School on Parallel Computation, volume 4 of Cambridge International Series on Parallel Computation, Cambridge University Press, Cambridge, UK, 1993, pp. 337-391. [McColl93b] McColl W.F., GL: An architecture independent programming language for scalable parallel computing. Technical Report 93-072-39025-1, NEC Research Institute, Princeton, NJ, 1993. [McColl94] McColl W.F., BSP Programming. In: Blelloch G., Simon I. (eds.), Proc. 13th IFIP World Computer Congress, vol. I, Elsevier, 1994, pp. 539-546. [Miller94] Miller R., Reed J.L., The Oxford BSP Library: Users' Guide Version 1.0, Oxford Parallel Technical Report, Oxford University Computing Laboratory, 1994. [Misra83] Misra J., Detecting termination of distributed computations using markers. In: Proceedings of the 2nd ACM Principles of Distributed Computing, ACM, New York, pp. 290-293. [Misra86] Misra J., Distributed discrete-event simulation. In: Computing Surveys, vol. 18, no. 1, March 1986, pp. 39-65.

32

[Neelamkavil87] Neelamkavil F., Computer Simulation and Modelling, John Wiley and Sons, 1987. [Nicol90] Nicol D.M., Analysis of synchronisation in massively parallel discrete-event simulations. In: Proceedings of the Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming PPOPP, Seattle, Washington, March 14-16, 1990, ACM

[Preiss92]

Press 1990, pp. 89-98. Preiss B.R. et al., On the trade-o between time and space in optimistic parallel discrete-event simulation. In: Proceedings of the

1992 SCS Western Simulation MultiConference and Distributed Simulation, 20-22 Jan. 1992, Newport Beach, California, pp.32-

[Pristker74] [Radiya94] [Reed88] [Valiant90] [Valiant93] [Wood94] [Zeigler76]

42. Pristker A.A.B., The GASP IV Simulation Language, John Wiley, New York, 1974. Radiya A., Sargent R.G., A logic-based foundation of discrete event modeling and simulation. In: ACM Transactions on Modeling and Computer Simulation, vol. 4, no. 1, Jan. 94, pp 3-51. Reed D.A. et al., Parallel discrete-event simulation using shared memory. In: IEEE Transactions on Software Engineering, vol. 14, no. 4, Apr. 1988, pp. 541-553. Valiant L.G., A bridging model for parallel computation. In: Communication of the ACM, vol. 33, August90, pp. 103-111. Valiant L.G., Why BSP computers? In: Proceedings of the Seventh International Parallel Processing Symposium, Newport, USA, IEEE Computing Society Press, 1993, pp. 2-5. Wood K.R., Turner S.J., A generalised carrier-null method for conservative parallel simulation. In: Proceedings of the 8th Workshop on Parallel and Distributed Simulation, IEEE Computer Society Press, 1994, pp. 50-57. Zeigler B.P., Theory of Modelling and Simulation, John Wiley, New York, 1976.

33

Conservative Discrete-Event Simulations on Bulk Synchronous ...

Conservative Discrete-Event Simulations on Bulk Synchronous ...

Suggest Documents

Conservative Discrete-Event Simulations on Bulk Synchronous ...

Communication-Efficient Bulk Synchronous

Bulk-Synchronous On-Line Crawling on Clusters of ... - CiteSeerX

Efficient parallel Text Retrieval techniques on Bulk Synchronous ...

bulk synchronous parallel ml with exceptions - CiteSeerX

Apache Hama: An Emerging Bulk Synchronous ...

Apache Hama: An Emerging Bulk Synchronous ...

Bulk Synchronous Parallel ML: Modular ... - Semantic Scholar

Bulk Synchronous Parallel ML: Semantics and ... - CiteSeerX

Towards Running Bulk-Synchronous Parallel ...

TWO-DIMENSIONAL BULK MICROFLOW SIMULATIONS ... - MathCCES

Bulk Models Reevaluated through Large Eddy Simulations

A Conservative Extension of Synchronous Data-flow with ... - DI/ENS

Angels in the Cloud: A Peer-Assisted Bulk-Synchronous Content ...

A Modular Implementation of Bulk Synchronous ... - Semantic Scholar

Systematic Development of Correct Bulk Synchronous Parallel Programs

Systematic Development of Correct Bulk Synchronous Parallel Programs

A Parallel Virtual Machine for Bulk Synchronous Parallel ML - LACL

Toward Bulk Synchronous Parallel-Based Machine Learning ... - MDPI

Bulk Synchronous Parallel Scheduling of Uniform ... - Semantic Scholar

Towards a Scalable Parallel Object Database {The Bulk Synchronous ...

A static analysis for Bulk Synchronous Parallel ML to ... - CiteSeerX

Bulk Synchronous Parallel Scheduling of Uniform ... - Semantic Scholar

Bulk Synchronous Parallel Computing { A Paradigm for ... - CiteSeerX