Parallel and Distributed Simulation of Petri Nets

Tutorial on

Parallel and Distributed Simulation of Petri Nets Performance Tools '95 September 20, 1995, Heidelberg, Germany

A. Ferscha

Institut fur Angewandte Informatik und Informationssysteme Universitat Wien Lenaugasse 2/8, A-1080 Vienna, AUSTRIA Tel.: +43 (1) 408 63 66 18, Fax: +43 (1) 408 04 50 Email: [email protected] This text, tutorial slides and related material is available at http://www.ani.univie.ac.at/ferscha/Ferscha.html

1

Contents

1 Petri Nets 2 Simulation Modeling and Analysis

2.1 Accelerating Simulations : : : : : : : : : : 2.1.1 Levels of Parallelism/Distribution 2.2 Logical Process Simulation : : : : : : : : 2.2.1 Synchronous LP Simulation : : : : 2.2.2 Asynchronous LP Simulation : : :

: : : : :

: : : : :

3 Parallel Simulation of TTPNs 4 Distributed Simulation of TTPNs 4.1 4.2 4.3 4.4

: : : : :

: : : : :

Event Driven Simulation Engines : : : : : : : : : Partitioning : : : : : : : : : : : : : : : : : : : : : Communication Interfaces : : : : : : : : : : : : : Chandy-Misra-Bryant Communication Interfaces 4.4.1 Deadlock Avoidance : : : : : : : : : : : : 4.4.2 Deadlock Detection/Recovery : : : : : : : 4.4.3 Conservative Time Windows : : : : : : : 4.5 Time Warp Communication Interfaces : : : : : : 4.5.1 Rollback and Annihilation Mechanisms : 4.5.2 Optimistic Time Windows : : : : : : : : : 4.5.3 The Limited Memory Dilemma : : : : : : 4.5.4 Incremental and Interleaved State Saving 4.5.5 Fossil Collection : : : : : : : : : : : : : : 4.5.6 Freeing Memory by Returning Messages : 4.6 Adaptive Simulation Protocols : : : : : : : : : : 4.6.1 Adaptive Memory Management : : : : : : 4.6.2 Probabilistic Optimism : : : : : : : : : :

5 Sources of Literature and Further Reading

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

: : : : :

: : : : : : : : : : : : : : : : :

3 9

10 11 11 13 13

13 15 15 15 16 18 20 22 23 24 25 28 28 29 30 30 32 33 34

42

2

Abstract

The dynamics of discrete event systems can be eectively described and analyized using the timed Petri net formalism. The aim of this tutorial is to comprehensively present the achievements attained in accelerating Petri net executions by using parallel or distributed multiprocessing environments. The basic problem is to generate concurrent Petri net executions insuring correctness in the sense that the partial ordering of transition rings produced is consistent with the total event ordering that would be produced by a (hypothetical) sequential execution. Two lines of thought have been followed: in parallel simulations transition rings evolve as governed by a SIMD iteration mechanism. Distributed simulations aim at a proper synchronization of rings in spatially dierent net parts to avoid timing inconsistencies and alterations of the execution behavior. In both cases, structural properties of the underlying Petri net can be eciently used to simplify and/or accelerate concurrent execution implementations.

1 Petri Nets General Petri Nets (PNs) [54] can be classi ed with respect to the complexity of structure they employ. With decreasing restrictiveness in allowing structural constructs, the modeling power of the resulting PN class increases, and so does its analysis complexity. While analytical quantitative evaluation methods are feasible and tractable for less expressive PN classes and mainly exponential timing, simulation tends to be the only applicable method for analyzing general PN classes and versatile timing. Since simulation can become quite aggressive in the use of computational resources, parallel and distributed simulation techniques used with multiprocessors or distributed computing environments can still be a promising approach for the execution of complex simulation models. In this review we shall give a brief background on PNs and how they describe time dynamic discrete event systems. We show how qualitative properties of PNs are obtained, and how they can be used to realize their parallel or distributed execution. Since realizations of these concurrent executions are always implementation (software) and technology (hardware) dependent, we shall avoid presenting absolute execution performance here, and will study just the sources of parallelism which make speedup possible.

Petri Nets A Petri net (PN) is usually denoted by a tuple (P; T; F; W; (0)) where P is the set (p1; p2; : : :pP ) of places graphically presented as cycles, T is (t1 ; t2; : : :tT ), the set of transitions drawn as bars, and F (P T) [ (T P) de nes an input-output relation to and from transitions, graphically represented by a set of directed arcs. (The introduction of inhibitory arcs or transition priorities raising the expressive power of PNs to that of Turing machines is straightforward and will not be presented here.) Seen as graph therefore, PN is bipartite among P and T with (a set of) input arcs I 2 (P T) pointing from P to T and output arcs O 2 (T P) pointing from T to P (equivalently, these relations can be seen as functions I (P T) and O (T P)). The set of input places to t 2 T is denoted by I(t), analogously O(t) is its set of output places. I, O are used in a similar way for places. W : F 7! IN + assigns weights w((p; t)) to arcs (p; t) 2 F to denote their multiplicity. (0) is a marking (vector) generated by the marking function M : P 7! IN 0 , expressing for every p the number of tokens (0)(p) initially assigned to it. The dynamic behavior of a PN is described in terms of two rules:

(i) (enabling rule) A transition t 2 T is enabled in some marking i each of its input places holds a \sucient" amount of tokens, i.e. i 8p 2 I(t), (p) w((p; t)) in . E() is the set of all transitions enabled in . Every t 2 E() may or may not re. (ii) ( ring rule) When a transition t 2 T res in it creates 0 by removing a certain amount of tokens from its input places and depositing a certain amount of tokens in its output places: 8p 2 I(t) [ O(t) 0 (p) = (p) ? w((p; t))+ w((t; p)). The ring of t in the marking (i) (reached after i other rings) is denoted by (i) !t (i+1) . + = w((t; p)) be a shorthand for the multiplicity of the arc pointing from t to p, and w? = w((p; t)) Let wt;p t;p + ? w? is the \amount of change" in the number of tokens in p caused by respectively. Clearly, at;p = wt;p t;p the ring of t. The set of all at;p (t 2 T, p 2 P) de nes the \amount of change" for PN in the incidence ? (p) 8p 2 I(t), and matrix A = [at;p]. Now the enabling of t 2 T can be veri ed by the condition wt;p consecutive rings can be described as

3

(i) = (0) + AT

i X k=1

uk

(1)

We read (i) here as a P 1 column vector showing the current number of tokens in pj in its j-th component (j(i) = (i) (pj )). AT is the transposed of the T P matrix A, and uk is T 1 column vector having 1 in a single component and 0 in all others. Assume the ring vector uk to have the 1 in row l. Then the operation AT uk \selects" the l-th column of AT (row of A) containing the \amount of change" caused by ring tl , and the change in the marking can be simply determined by adding the vector al to . Repeated P transition rings can be represented by cumulating ring vectors uk to a ring count vector x = ik=1 uk , i.e. x counting in its l-th row the number of rings of tl 2 T that must occur to generate (i) . Consequently, (j ) ? (i) = = AT x is the marking dierence imposed by ring transitions with respect to x starting from (i) and ending up in (j ). A vector x with strictly positive integer components that solves 0 = AT x does not impose any dierence on any when executing the corresponding rings (since = 0). x is called a T-invariant of PN, expressing that by ring tl 2 T for xl times starting in , the PN ends up in again, given that all the rings encoded in x are actually realizable. A dual interpretation exists for a P 1 vector y, which by its components yl \weighs" the changes caused by ring tl 2 T (encoded in row l of A) in such a way, that the overall change zeroes out (0 = Ay). An integer solution y to Ay = 0 is called a P-invariant, expressing that the weighted number of tokens P P = i=1 (ij ) yi in the respective places pi is invariant over any reachable marking (j ) . In other words, the ring of transitions does not change . The set of places pi for which (ij ) > 0 is called the support of the invariant, and is said to be minimal if no subset of it is a support. A minimal support P-invariant can always be found if a P-invariant exists, and any linear combination of a P-invariant is still a P-invariant. Many properties of PNs (useful in the context of parallel and distributed simulation) can be veri ed using invariants, they may however not exist for the particular PN under study. Practically, vectors y can be found by rewriting Ay = 0 as

"

#

#"

Y1 jP?r1j = 0 A11jrP?rj A jrrj (2) Y2 jr1j A21jT ?rP?rj A22jT ?rrj where A is a submatrix of A having full rank r, and [Y1Y2 ] is the respective decomposition of Y . Then from A11Y1 +A12 Y2 = 0 (the linearly independent part of (2)) we get Y2 = ?A ?1 A11Y1 , which, by combining Y1 to the left-hand and right-hand side, gives

"

# "

#

I jP?rP?rj Y1 jP?r1j = jrP?rj Y1 jP?r1j: (3) ? AA11 jrrj Y2 jr1j Here I is a P ? r P ? r identity matrix (Y1 = IY1 ). Thus we get all possible P-invariants as vectors Y with integer coecients. A marking (j ) is said to be reachable from marking (i) if there exists a sequence of (j ) for short. A necessary (but tm (j ) , or (i) ! tl (i+2) ! tk (i+1) ! transitions = (tk ; tl ; : : :) such (i) ! (i), unfortunately not sucient) condition for reachability is directly obtained from the fact that if (0) ! T ( i ) (0) T then a ring count vector x with A x = ? = must exist, and B = 0 must hold. (Multiply 1 by B T to verify this.) If on the other hand can be constructed as a linear combination of the columns of B, then (0) 6! (i). The set of all markings reachable from (i) is conveniently denoted by RS((i) ), and we must recognize that jRS((i) )j can be exponential as a function of P and the number of tokens in (0) . Consider a few other properties of PNs: We call a PN k-bounded if in every 2 RS((0) ) the number of tokens in pi is less or equal to k. A 1-bounded PN is safe, a 1-bounded PN is unbounded. Most practical PNs reveal k-boundedness, even the stronger structural boundedness which guarantees i k for any nite (0) . A PN is structurally bounded i there exists a vector y of positive integers such that Ay 0 (the opposite might not hold). A PN is conservative over a set of places P 0 P if the number of tokens is \conserved" over P 0 despite transition rings. As we can conclude from (2), a necessary and sucient condition for conservativeness is the existence of a P-invariant y, Ay = 0, with positive components refering to places p 2 P 0. A PN is persistent over a subset of transitions T 0 T if any t 2 T 0 can loose enabling only by its own ring, but not by the ring of some other transition t0 2 T. Finally, a PN is live over a subset of transitions T 0 T if for every t 2 T 0 and any marking 2 RS((0) ) there exists a marking 0 2 RS() such the enabling condition holds for t in 0 (note that a variety of 4

p1

t1

p3

t3

p5

2

t5

2 λ 3 = 0.5

λ 5 = 2.0

p7

p8 λ 4 = 1.0

p2

t2

p4

t4

λ

p6

6

= 2.0

t6

Figure 1: PN1: Communicating Processes with Resource Contention other de nitions of liveness is possible and useful [54]). A marking (i) is called a home state, if for every (i) . Certainly all these properties can be (j ) 2 RS((i) ) there exists a ring sequence with (j ) ! (0) veri ed based on an explicit knowledge of RS( ), but for the complexity of its construction one relies on the (polynomial time) structural analysis presented above.

Example The net PN1 in Figure 1 models the behavior of a system of four computational (c-)processes (p5 ), operating in parallel to each other and to one I/O process (p6). Two c-processes after having done local computations (t5 ) compete with another c-process pair and the I/O process for two shared communication devices (e.g. channels, p7) to be able to communicate (t3 ). The local computation of the I/O process is conditioned on the availability of data produced by one of the other processes as represented by a token in p8. We refer to the net if place p8 is not present as PN1, and PN1' if it is.Neglecting for the moment the annotation of net and the dierent transition representations We obtain the following properties: PN1 is 4-bounded, not conservative over P, but over (p2 ; p4; p6) and (p3; p4; p7). It is persistent over (t3 ; t4; t5; t6), while t1 and t2 are not persistent: ring t2 in = (2; 1; 1; 0; 0;0;1)T will disable t1 and vice versa. The initial marking (0) = [0; 0; 0; 0;4;1; 2]T is a homestate. We nd the minimum support P-invariants y1 = [1; 0; 2; 0;1; 0; 0]T , y2 = [0; 0; 1; 1;0; 0; 1]T and y3 = [0; 1; 0; 1;0; 1; 0]T , saying that irrespective of any transition ring 1(i) + 23(i) + 5(i) = 4, 3(i) + 4(i) + 7(i) = 2, and 2(i) + 4(i) + 6(i) = 1 in any reachable marking (i) 2 RS((0) ). Two T-invariants exist: x1 = [1; 0; 1; 0;2; 0] and x2 = [0; 1; 0; 1; 0;1], expressing0 that if in (i) there exists (i+4) , or a sequence 00 a sequence to (exclusively) re t1 , t3 and two times t5 yielding (i+4) , (i) ! 00 containing t t2 , t4 and t6 s.t. (i) ! (i+3), yielding (i+3) , then (i) = (i+4) or (i) = (i+3). PN1', in contrast to PN1, is structurally unbounded due to the unboundedness of p8 , and a home state does not exist. while having the same set of P-invariants, only has one T-invariant: x1 = [1; 2; 1; 2;2;2] which is intuitive due to the asynchronous coupling of t5 and t6 via p8 . The most noticeable extension to \standard" PNs is the introduction of inhibitor arcs, i.e. arcs from places to transitions inverting the enabling rule (for enabling t the incident place must not have tokens). This possibility to test whether a condition does not hold, in order to re a transition, makes the expressive power of PNs equivalent to those of Turing machines. The same can be achieved by relating priorities to transitions and considering only enabled transitions at the highest priority level for ring.

(a)

(b)

(c)

(d)

Figure 2: Structural Restricitions that classify PNs

5

t1 τ

p1

t1 τ

t2 time: t t 1 fires

t1 τ

p1 t2

p1

t1 τ

p1

t2 time: t + τ

time: t

token arrives in p 2 token enables t 2

t2 time: t + τ t 2 fires

Figure 3: Firing in Place-Timed PNs (PTPNs)

PN classes Restricting the input-output relation F in a general PN to allow for transitions constructs only like those in Figure 2 (a) (jI(t)j = jO(t)j = 1 8t 2 T) yields a class of nets known as state machines

(SMs). Since transitions can neither create nor destroy tokens, transtitions can be removed from a SM yielding the classical representation of a nite state automaton, which it is. SMs can model concurrency and con ict (choice), but not synchronization. If F is restricted for places to structures like in Figure 2 (b) (jI(p)j = jO(p)j = 1 8p 2 P), then PN is called a marked graph (MG). Places are redundant in MGs, and their removal yields what is well known as series parallel (task-)graphs. MGs can model concurrency and synchronization, but no con ict. The extension to MGs that allows multiple input to places but only single output from places as in Figure 2 (c) (jO(p)j = 1 8p 2 P) is commonly called con ict free nets (CFNs). A PN that allows both constructs in Figure 2 (a) and (b), but not the construct in (d) (O(pi ) \ O(pj ) 6= ; ) jO(pi)j = jO(pj )j = 1 8pi ; pj 2 P; i 6= j) is called a free choice net (FCN), which uni es (and extends) the power of SMs and MGs, but does not allow situations called confusion. In a FCN if two transitions have an input place in common, there is always a \free choice" of which one to re, i.e. no marking can exist in which one is enabled and the other is not. If there are markings possible such that the two transitions are simultaneously enabled, and others s.t. they are only enabled in isolation, then the ring evidently depends on the evolving of the net - we have \confusion" (PN1 exhibits confusion around p7). Finally we refer to a PN that does not impose any of the restrictions in Figure 2 as a general net GN. A PN is called strongly connected if jI(t)j; jO(t)j; jI(p)j; jO(p)j 1 8t 2 T; 8p 2 P, irrespective of the speci c class PN belongs to. Most models in practice yield strongly connected PNs, i.e. neither source/sink places nor source/sink transitions do exist. Both PN1 and PN1' obey strong connectedness.

Timing A PN executes transition rings instantaneously, i.e. no time is consumed, which is certainly

sucient for reasoning about the quality of system behaviors (causalities, synchronization, etc.). To make the PN formalism adequate also for quantitative (i.e. performance) analysis, nite timing of activities can be expressed by associating a time concept to places, transitions, arcs, tokens, or any combination of them. Popular approaches are to assign ring times to transitions [64] representing the time interval during which a corresponding activity is going on, or holding times to places [69] modeling the time expired between the occurrence of events (an arriving token has to reside in the place for the holding time before it can enable a transition). Several equivalences among timing notions have been proven [25]. Figure 3 shows the ring semantics in a Place-Timed PN, i.e. : P 7! IR assigns holding times i to places pi 2 P, whereas Figure 4 shows the ring for a Transition-Timed PN, i.e. : T 7! IR assigns ring delays i to transitions ti 2 T. Figure 5 demonstrates how PTPNs can be transformed to TTPNs of equivalent modelling power. Probably the most popular time extension to PNs are Stochastic Petri Nets (SPNs), assigning exponentially distributed ring rates to transitions for continuous time (CT) systems, or geometrically distributed ring rates in case of discrete time (DT) systems. In both cases an isomorphism among the markings RS((0) ) and the states in a Markov chain (MC) can be proven, which allows to evaluate models in terms of CTMC or DTMC steady state or transient analysis. PN1 is a Generalized SPN (GSPN) [2], allowing a combination of nontimed (bars) and stochastically timed transitions (empty boxes), while still supporting embedded MC analysis. The same is true for Deterministic and SPNs (DSPNs) [3] where even deterministically timed transitions are allowed, but only in such a way that at most one of them can be enabled in any reachable marking. Generalized Timed PNs (GTPNs) [40] associate deterministic timing to transitions, 6

τ

t1

t1

t1

t1

p1

p1

p1

p1

τ

t2

τ

t2

τ

t2 time: t + τ

time: t

time: t

t 1 fires

token arrives at p 2

t2

time: t + τ

token reserved by t 2 t 2 fires

Figure 4: Firing in Transition-Timed PNs (TTPNs) t1

t1 τ = 0

τ=0

t2

t2

p 1’ τ

τ

p1

t3

p 1 ’’

τ=0

t4

t 1’

t3

PTPN

τ=0

t4

TTPN

Figure 5: Equivalence Transformation from PTPN to TTPN but by the con icting transitions in self-loops geometric ring times can be expressed. We shall consider mostly transition timing when talking about timed PNs (TPNs), and follow the convention of drawing deterministically timed transitions as thick bars, stochastically timed transitions as empty boxes and nontimed transitions as thin bars. (Both PN1 and PN1' are GSPNs, the ring rates are annotated to stochastic transitions.)

Timed Enabling and Firing Semantics Important for modeling, analysis and simulation is how the timing concept attached to PNs modi es the enabling and ring rule of PNs: If 8p 2 I(t), (p) c w((p; t)) c > 1 in then t is said to be multiply enabled at degree c. If t can only \ re" (serve) one enabling at a time, we talk about single server (SS) (enabling) semantics, and in nite server (IS) semantics if any amount of enablings can be served at a time. If t1 in the SPN in Figure 6 followed SS, then the occurrence time ot(t1 (i )) of t1 with the i-th token out of the four in p1 would be ot(t1 (i )) = ot(t1 (i?1)) + Xi , where Xi exp(1 ) is an exponential variate. If t1 applied IS, then ot(t1 (i )) = Xi 8i, expressing a notion of parallelism among the tokens. The latter is useful in our interpretation of PN2 where tokens model machines subject to failure (t1 ) and repair (t2 ). Note that for memoryless distributions, a t with IS can be modeled as an appropriately \faster" SS transition by choosing a marking dependent rate, e.g. 1 1 for t1. In a PN with IS, SS can always be modeled by assigning to transitions a loop-back place marked with a single token. Both the enabling and the ring can be timed in a TPN. Timing the ring is straightforward. If the enabling is timed, then for SMs, FCNs and GNs (but not for MGs and CFNs) two questions arise related to timing: (i) should it be allowed to preempt the enabling, and if so, (ii) how should intermediate \work" be memorized. In a policy with non-preemptive enabling semantics, once t becomes enabled it is decided p1

t1

p2

λ1

t2 λ2

Figure 6: PN2: A TPN with Model Parallelism 7

class State Machines (SM) Marked Graphs (MG) Con ict Free nets (CFN) Free Choice nets (FCN) General nets (GN) Class nontimed PN Timed MG Timed GN Discr./Cont. Time Stochastic PN Discr./Cont. Time Generalized SPN Non-memoryless RET D/C GSPNs Deterministic and Stochastic PN GN

De nition

Modeling Power

jI (t)j = jO(t)j = 1 8t 2 T concurrency, con ict jI (p)j = jO(p)j = 1 8p 2 P concurrency, synchronization jO(p)j = 1 8p 2 P concurrency, synchronization, con ict excluded O(pi ) \ O(pj ) = 6 ;) concurrency, con ict, synchronization jO(pi )j = jO(pj )j = 1 8pi; pj 2 P; i 6= j no restriction

concurrency, con ict, synchronization, confusion

Type of Enabling Delay (ti ) (ti ) = 0 (ti ) 2 fIN; IR; geom(); exp(); discr.-general; cont.-generalg (ti ) 2 fIN; IR; geom(); exp(); discr.-general; cont.-generalg (ti ) geom() 8 ti 2 T (ti ) exp() 8 ti 2 T (ti ) 2 f0; geom()g 8 ti 2 T , \memoryless" RET (ti ) 2 f0; exp()g 8 ti 2 T , \memoryless" RET as with D/C GSPNs, but ringpolicy with absence of memory at regeneration points (ti ) 2 f0; IR; exp()g, at most 1 determ. transition enabled in each marking (ti ) = f () (user de ned, arbitrary timing)

Analysis Techniques no quantitative analysis possible recurrence equations simulation isomorphic to DTMC/CTMC standard steady-state/transient anal. isomorphic to DTMC/CTMC standard steady-state/transient anal. semi-Markov (DTMC/CTMC) analysis on embedded MC steady-state analysis by method of supplementary variables simulation

Figure 7: Modeling Power versus Complexity of Solution whether it should re. If t res, it removes tokens from I(t) (start- ring event), \hides" tokens during the ring period, and after that releases tokens to O(t) (end- ring event). This policy is know as preselection (PS), since in cases of con ict among t and t0 one of them is preselected for ring. Atomic ring (AF) refers to a policy where t, once it becomes enabled, its ring is delayed to the instant when the enabling time (t) has expired. If t is still enabled at that time, it res exactly like in a PN. For the case where t looses enabling during that period, either a new, resampled (t), or the non expired part of (t) (age memory) is used for the eventual new enabling of t. While with PS con icts (among timed transitions) are resolved by prede ned mechanisms like random switches or priorities (from \outside"), AF resolves (timed) con icts always and entirely by the race condition (from \inside\). Real systems with race conditions, like timers controling activities, cannot be modeled by TPNs with PS. Opposed to timing the enabling, timing the ring does not impose any diculty (always non-preemptive). The nal adaptation of the enabling and ring rules for PNs to TPNs is, that we have to replace the \mayor-may-not re" rule to a \must re" rule in order to make timing meaningful.

Discrete Event Systems The execution of a TPN can now be interpreted as a time dynamic discrete event system in various dierent ways. If oriented along the places p 2 P, the token arrival and departure

events (in case of the PS policy also the token reservation events), characterize the dynamic behavior of a system. More natural is the transition oriented view, where transition rings are related to event occurrences, while places represent the conditions that need to hold for the event to occur; e.g. in Figure 6 the ring of t1 represents the occurrence of a \machine failure" event, etc.

Modelling Power vs. Analysis Complexity Both the net class as well as the timing notion in PNs have impact on their qualitative and quantitative analysis. E.g. , working with PN models for realistically sized systems one frequently arrives at the limits of practical tractability of the analytical evaluation. Therefore, and due to the limitations of modeling power of net classes with tractable analytical solutions, simulation often remains as the only practical technique to obtain performance indices. The tables in Figure 7 summarizes this situation. Taking just the class of timed Petri nets having an underlying stochastic process which is Markov [53] semi-Markov [1] or semi-regenerative [22] as anPargument, it requires the knowledge of the in nitesimal generator Q (to solve e.g. for ~Q = ~0 subject to si 2RS ((0) ) i = 1), i.e. the enumeration of all states RS((0) ) reachable from the initial marking (0) . RS((0) ), however, is potentially exponential in the number of places jP j and the number of tokens in the initial marking, yielding a space and a computational problem for analytical evaluation. The complexity of generating the state space is re ected in the computational complexity of deciding whether given a PN and a marking : 9 s:t: (0) ! ? (\reachability problem"). Although decidable [51], this problem is known to be exp-space-complete for symmetric PNs, Pspace-complete if jI(t)j = jO(t)j = c 8t 2 T and for safe PNs, NP-complete for PNs without cycles and CFNs, and polynomial only for bounded CFNs, for MGs, and for live, bounded, strongly connected FCNs. 8

But also simulation is not devoid of shortcomings. Existing (sequential) Petri net simulators tend to augmented computational requirements for realistically large nets. The hope of parallel (and distributed) simulation techniques now is to alleviate these shortcomings by applying multiple processors to reduce simulation time, or in other words, preserve modeling power by at the same time improving the complexity of the analysis. In the sequel of this presentation we shall systematically look at possibilities of improving the complexity of quantitative analysis by employing parallel processing facilities.

2 Simulation Modeling and Analysis Basically every simulation model is a speci cation of a physical system (or at least some of its components) in terms of a set of states and events. Performing a simulation thus means mimicking the occurrence of events as they evolve in time and recognizing their eects as represented by states. Future event occurrences induced by states have to be planned (scheduled). In a discrete simulation the occurrence of an event is instantaneous and xed to a selected point in time. Two kinds of discrete simulation have emerged that can be distinguished with respect to the way simulation time is progressed. In time driven discrete simulation simulated time is advanced in time steps (or ticks) of constant size , or in other words the, observation of the simulated dynamic system is discretized by unitary time intervals. The choice of interchanges simulation accuracy and elapsed simulation time: ticks short enough to guarantee the required precision generally imply longer simulation time. Intuitively, for event structures irregularly dispersed over time, the time-driven concept generates inecient simulation algorithms. Event driven discrete simulation discretizes the observation of the simulated system at event occurrence instants. We shall refer to this kind of simulation as discrete event simulation (DES) subsequently. A DES, when executed sequentially repeatedly processes the occurrence of events in simulated time (often called \virtual time", VT) by maintaining a time ordered event list (EVL) holding timestamped events scheduled to occur in the future, a (global) clock indicating the current time and state variables S = (s1 ; s2 ; : : :sn ) de ning the current state of the system (see Figure 8). A simulation engine (SE) drives the simulation by continuously taking the rst event out of the event list (i.e. the one with the lowest timestamp), simulating the eect of the event by changing the state variables and/or scheduling new events in EVL { possibly also removing obsolete events. This is performed until some pre-de ned endtime is reached, or there are no further events to occur. VT

s1 s2 s3 s4

sN

ej

ek

S

EVL

ei

Figure 8: Simulation Engine for Discrete Event Simulation As an example (see Figure 9), assume a physical system of two machines participating in a manufacturing process. In a preprocessing step, one machine produces two subparts A1 and A2 of a product A, both of which can be assembled concurrently. Part A1 requires a single, whereas A2 takes a nonpredictable amount of assembly steps. Once one A1 and one A2 are assembled (irrespective of the machine that assembled it) one piece of a A is produced. The system is modelled in terms of a Petri net (PN): transition t1 models the preprocessing step and the forking of independent postprocessing steps, t2 and t3. Machines in the preprocessing phase are represented by tokens in p1, nished parts A1 by tokens in p5 and nished or \still in assembly" parts A2 by tokens in p4 . Once there is at least one token in p5 and at least one token in p4, the assembly process stops or repeats with equal probability (con ict among t4 and t5). Once the assembly process terminates yielding one A, one machine is released (t5). The time behavior of the physical system is modelled by associating timing information to transitions ((t1) = 3, (t2) = (t3) = 2 and (t4 ) = (t5 ) = 0). This means that a transition ti that became enabled by the arrival of tokens in the input places at time t and remained enabled (by the presence of tokens in the input places) during [t; t + (ti)). It res at time t + (ti ) by removing tokens from input places and depositing tokens in ti 's output places. The initial state of the system is represented by the marking of the PN where place p1 has 2 tokens (for 2 machines), and no tokens are available in any other place. Both the time driven and the event driven DES of the PN are illustrated in Figure 9. 9

Time Driven Simulation

Event Driven Simulation

VT

VT

p1 t1

τ = 2.0

p2

t4

τ = 0.0

p5

t5

t2

τ = 2.0

p4 τ = 0.0

S p1 p2 p3 p4 p5

EVL

2

0

0

0

0

2

0

0

0

0

t1

t1

2

0

0

0

0

1

1

1

0

0

t1

t2

t3

2

0

0

0

0

0

2

2

0

0

t2

t2

t3

1

1

1

0

0

0

2

1

0

1

t2

t2

t3

0

2

2

0

0

0

1

1

1

1

t2

t3

t4

t5

0

2

1

0

1

0

0

1

2

1

t3

t4

t4

t5

1

1

0

0

1

1

0

1

1

0

t3

t4

t1

1

0

0

1

1

1

1

1

0

0

t3

t2

t1

1

1

1

0

0

1

1

0

0

1

t2

t1

1

1

1

0

0

1

0

0

1

1

t4

t5

1

0

1

1

0

2

0

0

0

0

t1

t1

1

1

1

0

0

1

1

1

0

0

t1

t2

τ = 3.0

p3

t3

S p1 p2 p3 p4 p5

t3

t1

t3

Figure 9: A Sample Simulation Model described as Timed Petri Net The time driven DES increments VT (denoted by a watch symbol in the table) by one time unit each step, and collects the state vector S as observed at that time. Due to time resolution and non time consuming state changes of the system, not all the relevant information could be collected with this simulation strategy. The event driven DES employs a simulation engine as in Figure 8 and exploits a natural correspondence among event occurences in the physical system and transition rings in the PN model by relating them: whenever an event occurs in the real (physical) system, a transition is red in the model. The event list hence carries transitions and the time instant at which they will re, given that the ring is not preempted by the ring of another transition in the meantime. The state of the system is represented by the current PN marking (S), which is changed by the processing of an event, i.e. the ring of a transition: the transition with the smallest timestamp is withdrawn from the event list, and S is changed according to the corresponding token moves. The new state, however, can enable new transitions (in some cases maybe even disable enabled transitions), such that EVL has to be corrected accordingly: new enabled transitions are scheduled with their ring time to occur in the future by inserting them into EVL (while disabled transitions are removed). Finally the VT is set to the timestamp (ts) of the transition just red. Related now to the example in Figure 9 we have: Before the rst step of the simulation starts, VT is set to 0 and transition t1 is scheduled twice for ring at time 0 + (t1 ) = 3 according to the initial state S = (2; 0; 0; 0; 0). There are two identical event entries with identical timestamps in the EVL, announcing two event occurrences at that time. In the rst simulation step, one of these events (since both have identical, lowest timestamps) is taken out of EVL arbitrarily, and the state is changed to S = (1; 1; 1; 0; 0) since ring t1 removes one token from p1 and generates one token for both p2 and p3 . Finally VT is adjusted. The new marking now enables transitions t2 and t3 , both with ring time 2. Hence, new event tuples ht2@VT+(t2 )i and ht3 @VT + (t3 )i are generated and scheduled, i.e. inserted in EVL in increasing order of timestamp. In step 2, again, the event with smallest timestamp is taken from EVL and processed in the same manner, etc.

2.1 Accelerating Simulations

In the example of Figure 9, there are situations where several transitions have identical smallest timestamps, e.g. in step 5 where all scheduled transitions have identical end ring time instants. This is not an exceptional situation but appears whenever (i) two or more events (potentially) can occur at the same time but are mutually exclusive in their occurrence, or (ii) (actually) do occur simultaneously in the physical system. The latter distinction is very important with respect to the construction of parallel or distributed simulation engines: t2 and t3 are scheduled to re at time 5 (their enabling lasted for the whole period of their ring time (t2 ) = (t3 ) = 2), where the ring of one of them will not interfere the ring of the other one. t2 and t3 are said to be concurrent events since their occurrences are not interrelated. Obviously t2 and t3 could be 10

Levels of Parallelism/Distribution in Discrete Event Simulation

Application-Level

independent replications of same PN on different processors

Subroutine-Level

distribute support functions, e.g. random number generation, event handling, ...

Component-Level

utilize parallelism among simulation model components

centralized synchronization (parallel simulation)

Event-Level decentralized synchronization (distributed simulation)

Chandy-Misra Bryant

"A spectrum of options ..."

(conservative) - deadlock detection - deadlock recovery - bounded lag

Time Warp (optimistic) - aggressive cancell. - lazy cancellation - direct cancellation

Figure 10: Options to Parallelize/Distribute Discrete Event Simulations. simulated in parallel, say t2 by some processor P1 and t3 by another processor P2. As an improvement of the sequential simulation on the other hand, they could both be removed from EVL in a single simulation step. The situation is somewhat dierent with t4 and t5, since the occurrence of one of them will disable the other one { t4 and t5 are said to be con icting events. The eect of simulating one of them would (besides changing the state) also be to remove the other one from EVL. t4 and t5 are mutually exclusive and preclude parallel simulation. Before following the idea of simulating a single simulation model (like the example PN) in parallel, we will rst take a more systematic look at the possibilities to accelerate the execution of simulations using P processors.

2.1.1 Levels of Parallelism/Distribution

What are the options for the acceleration of discrete event simulation executions? Figure 10 shows that application-level distribution is feasible if large parameter spaces have to be searched (execute the PN with several initial markings). Subroutine-level distribution is feasible if iteration dependencies (output values of replication i = input parameters for replication i + 1) hinders application-level distribution, but is limited in the degree of exploitable parallelism by the number of simulation support subroutines. Component-level distribution exploits the natural parallelism of simulation model components. Most challenging from a scienti c viewpoint is the parallelization at the event-level, where basically two strategies of event list (EVL) management are feasible. In a centralized EVL management approach, a master process maintains the EVL and virtual time (VT) progression, while workers execute concurrent events. This approach is particularly appropriate for UMA and COMA architectures with the EVL organized as a shared data structure, and often makes use of a centralized control (synchronization) unit. A decentralized EVL management strategy allows the concurrent simulation of events with dierent timestamps, thus necessitating a local synchronization protocol. We shall distinguish parallel simulation that follows a SIMD operated synchronization, from distributed simulation (or Logical Process Simulation) which operates completely asynchronous.

2.2 Logical Process Simulation

Common to all simulation strategies with distribution at the event level is their aim to divide a global simulation task into a set of communicating logical processes (LPs), trying to exploit the parallelism inherent 11

among the respective model components with the concurrent execution of these processes. We can thus view a logical process simulation (LP simulation) as the cooperation of an arrangement of interacting LPs, each of them simulating a subspace of the space-time which we will call an event structure region. In our example a region would be a spatial part of the PN topology. Generally a region is represented by the set of all events in a sub-epoch of the simulation time, or the set of all events in a certain subspace of the simulation space. Communication System

CI 1

CI 2

CI 3

CI P

SE 1

SE 2

SE 3

SE P

R1

R2

R3

RP

LP 1

LP 2

LP 3

LP P

CI ... Communication Interface R ... Region, Simulation Sub-Model

SE ... Simulation Engine LP ... Logical Process

Figure 11: Architecture of a Logical Process Simulation The basic architecture of an LP simulation can be viewed as in Figure 11: A set of LPs is devised to execute event occurrences synchronously or asynchronously in parallel. A communication system (CS) provides the possibility to LPs to exchange local data, but also to synchronize local activities. Every LPi has assigned a region Ri as part of the simulation model, upon which a simulation engine SEi operating in event driven mode (Figure 8) executes local (and generates remote) event occurrences, thus progressing a local clock (local virtual time, LVT). Each LPi (SEi ) has access only to a statically partitioned subset of the state variables Si S, disjoint to state variables assigned to other LPs. Two kinds of events are processed in each LPi : internal events which have causal impact only to Si S, and external events which also aect Sj S (i 6= j) the local states of other LPs. A communication interface CIi attached to the SE takes care for the propagation of eects causal to events to be simulated by remote LPs, and the proper inclusion of causal eects to the local simulation as produced by remote LPs. The main mechanism for this is the sending, receiving and processing of event messages piggybacked with copies of the senders LVT at the sending instant. Basically two classes of CIs have been studied for LP simulation, either taking a conservative or an optimistic position with respect to the advancement of event executions. Both are based on the sending of messages carrying causality information that has been created by one LP and aects one or more other LPs. On the other hand, the CI is also responsible for preventing global event causality violations. In the rst case, the conservative protocol, the CI triggers the SE in a way which prevents from causality errors ever occuring (by blocking the SE if there is the chance to process an `unsafe' event, i.e. one for which causal dependencies are still pending). In the optimistic protocol, the CI triggers the SE to redo the simulation of an event should it detect that premature processing of local events is inconsistent with causality conditions produced by other LPs. In both cases, messages are invoked and collected by the CIs of LPs, the propagation of which consumes real time dependent on the technology the communication system is based on. For the representation and advancement of simulated time (VT) in an LP simulation we can devise two possibilities[61]: a synchronous LP simulation implements VT as a global clock, which is either represented explicitly as a centralized data structure, or implicitly implemented by a time-stepped execution procedure { the key characteristic being that each LP (at any point in real time) faces the same VT. This restriction is relaxed in an asynchronous LP simulation, where every LP maintains a local VT (LVT) with generally dierent clock values at a given point in real time. 12

2.2.1 Synchronous LP Simulation

In a time-stepped LP simulation [61], all the LPs' local clocks are kept at the same value at every point in real time, i.e. every local clock evolves on a sequence of discrete values (0; ; 2; 3; :: :). In other words, simulation proceeds according to a global clock since all local clocks appear to be just a copy of the global clock value. Every LP must process all events in the time interval [i; (i + 1)) (time step i) before any of the LPs are allowed to begin processing events with occurrence time (i + 1) and after. This strategy considerably simpli es the implementation of correct simulations by avoiding problems of deadlock and possibly overwhelming message trac and/or memory requirements as will be seen with synchronization protocols for asynchronous simulation. Moreover, it can eciently use the barrier synchronization mechanisms available in almost every parallel processing environment. The imbalance of work across the LPs in certain time steps on the other hand naturally leads to idle times and thus represents a source of ineciency. Both centralized and decentralized approaches of implementing global clocks have been followed. In [73], a centralized implementation with one dedicated processor controlling the global clock is proposed. To overcome stepping the time at instances where no events are occuring, algorithms to determine for every LP at what point in time the next interaction with another LP shall occur have been developed. Once the minimum timestamp of possible next external events is determined, the global clock can be advanced by (S), i.e. an amount which depends on the particular state S. For a distributed implementation of a global clock [61], a structured (hierarchical) LP organization can be used [23] to determine the minimum next event time. A parallel min-reduction operation can bring this timestamp to the root of a process tree [9], which can then be propagated down the tree. Another possibility is to apply a distributed snapshot algorithm in order to avoid the bottleneck of a centralized global clock coordinator. Combinations of synchronous LP simulation with event-driven global clock progression have also been studied. Although the global clock is advanced to the minimum next event time as in the event driven scheme, LPs are only allowed to simulate within a -tick of time, called a bounded lag by Lubachevsky [48] or a Moving Time Window by [70].

2.2.2 Asynchronous LP Simulation

Asynchronous LP simulation relies on the presence of events occuring at dierent simulated times that do not aect one another. Concurrent processing of those events thus eectively accelerates sequential simulation execution time. The critical problem, however, which asynchronous LP simulation poses is the chance of causality errors. Indeed, an asynchronous LP simulation insures correctness if the (total) event ordering as produced by a sequential DES is consistent with the (partial) event ordering as generated by the distributed execution. Jeerson [44] recognized this problem to be the inverse of Lamport's logical clock problem [45], i.e. providing clock values for events occuring in a distributed system such that all events appear ordered in logical time. It is intuitively convincing and has been shown in [52] that no causality error can ever occur in an asynchronous LP simulation if and only if every LP adheres to processing events in nondecreasing timestamp order only (local causality constraint (lcc) as formulated in [35]). Although sucient, it is not always necessary to obey the lcc, because two events occuring within one and the same LP may be concurrent (independent of each other) and could thus be processed in any order. The two main categories of mechanisms for asynchronous LP simulation already mentioned adhere to the lcc in dierent ways: conservative methods strictly avoid lcc violations, even if there is some nonzero probability that an event ordering mismatch will not occur; whereas optimistic methods hazardously use the chance of processing events even if there is nonzero probability for an event ordering mismatch.

3 Parallel Simulation of TTPNs Parallel discrete event simulation (PDES) of TPNs aims to exploit the speci c features oered by SIMD operated environments, especially the hardware support for fast communication in a static, regular interconnection network of processors controlled by a centralized control unit. Not only the one-to-all broadcast operation is possible in O(logN) time, N being the number of processors, but also the reduction of data values di from the N processors with respect to any binary associative operator (e.g. +, min, etc.), and, most importantly, the N partial products ij =1dj i = 1 : : :N (parallel pre x operation). 13

Time Stepping Obviously, TPNs with DT and PS nd a straightforward parallel execution implementation, since start- ring and end- ring events can only occur at instants kt; k = 0; 1; : : : of simulated time. Denote the state of such a TPN by (; %), where is the marking and % is a T 1 vector holding for every t 2 T the remaining time to next event (start- ring or end ring), then the algorithm at time kt executes events related to ti 2 T for which %i = 0 in parallel (as long as there are ti 2 T with %i = 0). After that, for all %i > 0 it sets %i = %i ? t and steps to time (k + 1)t. Clearly, time gaps containing no event occurrences can be \overjumped" employing a min-reduction over % to determine the earliest next event time. In principle, the same execution strategy applies to TPNs with AF and/or CT, but is likely to be less eective due to handling the descheduling with AF, and the far smaller probability of two or more events occuring at exactly the same instant of time (parallelism) with CT. Time stepping will perform best for TPNs with a very high concentration of events at certain points in virtual time, and nds an ecient implementation for SMs, MGs, and CFNs. A distributed con ict resolution scheme necessary for FCNs and GNs can severely complicate the implementation and degrade performance. t1

p1

p2

t3

λ

λ2

λ1

t2

p4 4

p3 λ

3

Figure 12: PN3: MG with Stochastic Holding Times

Recurrence Equations A more general way to execute certain classes of TPNs concurrently was devel-

oped by Baccelli and Canales [6], describing in terms of recurrence equations how the TPN dynamically evolves. Consider exponential holding times associated with places pi as hi (1); hi(2); : : :, and (for simplicity) zero enabling/ ring times in the MG in Figure 12. Then the occurrence of the k-th ring of ti 2 T, oti (k) = ot(ti (k )) can be written as: ot1 (k) = h1 (k) ot3 (k ? 1), ot2 (k) = h3 (k) ot3 (k ? 1), ot3 (k) = h2 (k) ot1 (k) h4(k) ot2 (k), where denotes max as an associative in x operator, and denotes the + operator. This set of equations together with ot1(0) = ot2(0) = ot3(0) = 0 de nes the evolution of PN3, and can, with ot3 (k) = h2 (k) h1 (k) ot3 (k ? 1) h4 (k) h3(k) ot3 (k ? 1), be written in matrix form as: ot(k) = ot(k ? 1) A1 (k) (E A0 (k)); A0 (k) =

h2 (k) h4 (k)

!

A1 (k) =

h1 (k) h3 (k)

!

(4)

where and are operations in the semiring (R; ; ) with null elements = ?1 for and e = 0 for

; E is the identity matrix in (R; ; ). Indeed, (4) is a special form of ot(k) = (ot(k ? 1) A1 (k) ot(k ? M ) AM (k))

(E A0 (k) A0 (k)2 A0 (k)L ) (5) where An (k) are matrices representing with their elements ij the holding time of the k-th token in p 2 P with O(ti ) = I(tj ) = p having n tokens in (0) , L is the maximumnumber of places on a directed cycle in MG not having a token in (0) , and M is the maximum number of tokens in any p 2 P in (0) . By appropriately rewriting (5) and using the associativity of in (R; ; ), the underlying MG can be executed either in a spatially decomposed way (parallelism from the MG size), or in a temporally decomposed way by executing matrix multiplications that simulate the evolving of MG in distinct epochs of time, the results of which are parallel pre xed with eventually. Bachelli and Canales [6] achieved excellent performance for MGs with DT or CT for both transitions and places, and FIFO token arrival/consumption. The extension of the approach to higher net classes is possible in principle, but not within (R; ; ).

14

4 Distributed Simulation of TTPNs 4.1 Event Driven Simulation Engines

Opposed to PDES, distributed discrete event simulations (DDES) execute TPNs as a spatially decomposed event system dispersed over processors that operate asynchronously in parallel (but not necessarily; note the implementation of a DDES scheme for MGs on a SIMD platform by Sellami et.al. [68]). As such, DDES of TPNs is more general than the time stepping approach, since also transition rings occuring at dierent instants of virtual time are considered for parallel execution. However, DDES potentially induces higher overheads for event list management and complex synchronization protocols not present in PDES. The basic idea of a DDES is to separate topological TPN parts to be simulated by logical processes (LPs). A synchronization protocol provides for a proper synchronization of intra and inter LP event relations with respect to simulated time and causal interdependencies. A DDES of a TPN is organized as in Figure 13 (see also [18]), where LPs comprise Communication Subsystem

CI SE

CI

1

SE

1

p1

t1 λ

1

2 2

p2 R

t2 λ

1

LP 1

2

R

2

LP 2

Figure 13: Logical Process simulation of TPNs

R a spatial region of TPN: Ri = (Pi P, Ti T, Fi F) (all event occurrences in Ri will be simulated

by LPi ), SE an event driven simulation engine maintaining an event list (EVL), a local virtual time (LVT), and a static subset of the state variables Si S de ning the state of Ri at any time LVT (we will study cases with Si = [j j 8pj 2 Pi), and CI a communication interface triggering the local SE and synchronizing with other LPs. CI implements a synchronization protocol either based on the Chandy-Misra-Bryant scheme (conservative), or on Time Warp (optimistic). (Both protocols will be described in detail in the sequel.) At simulated time LVT, EVL in SEi holds pending future event occurrences, organized in increasing timestamp order. In case of e.g. AF semantics, these are tuples htj @ot(tj )i denoting the scheduled ring of tj at time ot(tj ). Once SEi gets permission from CIi to re the rst (lowest occurrence time) transition tj 0 tj in state Si , it (i) removes htj @ot(tj )i from EVL, (ii) modi es the state variables according to S ! Si , (iii) schedules new tuples htk @ot(tk )i for all tk 2 E(Si0 ), and (iv), removes tuples htl @ot(tl )i which have lost enabling in Si0 (only in the AF policy). Note that in target environments lacking a shared address space, for performance reasons it is very important to be able to verify tk 2 E(Si ) without having to consult data residing in remote LPs (implicitly assumed here with the notation), thus restricting the freedom in partitioning.

4.2 Partitioning

Thomas and Lazowska [71, 72]) have proposed Ri = (p 2 P) (P-LPs) or Ri = (t 2 T) (T-LPs), setting up P + T LPs. His \Transition Firing Protocol" (TFP) operates in two phases: (i) verify tk 2 E(), tk residing in LPtk , and (ii) re tk . A fairly communication intensive double-handshake protocol is used for (i), involving the announcement of available tokens by P-LPs, the requesting of a certain amount of tokens necessary for enabling by T-LPs, the granting of the requested amount of tokens by P-LPs, and the con rmation of absorbation of that amount of tokens again by T-LPs. Basically TFP implements a (distributed) token competition resolution policy, but not a con ict resolution mechanism in the sense of the GSPN de nition, where user de ned random switches for competing transition selection can be speci ed. (ii) implements the removal (deposit) of tokens from (to) P-LPs. P-LPs and T-LPs employ dierent SEs, which in combination behave conservatively (P-LPs do only serve competing earliest requests). Ammar and Deng [5] allow a totally general decomposition of TPNs into regions, but with redundant proxy representations of places that are cut away from their output transitions. A Time Warp based 15

communication interface managing ve dierent types of messages is used to maintain consistency in the \overlapped" state representation (note that here Si is not a pairwise disjoint partition of ). It has been seen by Thomas [71], Nicol and Roy [60] and Chiola and Ferscha [18], that for performance reasons the generality of Ri should be limited in such a way that con icting transitions together with all their input places should always reside in the same LP, i.e. for tk 2 Ti , tk 2 E(Si ) can always be veri ed without communication. A minimum region partitioning and grain packing strategy was proposed in [19], that uses two sources of partitioning information: (i) the TPN topology (P; T; F), and (ii) structural properties of the TPN in combination with (0) (if available). Consider the following relations among transitions: SC ti ; tj 2 T are in structural con ict (ti SC tj ), i I(ti ) \ I(tj ) 6= ; ti 0 might cause that t 62 E(0 ). EC ti ; tj 2 T are in eective con ict (ti EC tj ), i ti ; tj 2 E(), and ! j SC is necessary but not sucient for EC. Note that in PN1 (t1 SC t2), but in the interpretation as a GSPN with AF, the probability for (t1 EC t2 ) is zero for all (i) 2 RS([0; 0; 0; 0; 4;1; 2]T ). ti 0 might cause that CC ti ; tj 2 T are causally connected (ti CC tj ), i ti 2 E() and tj 62 E(), ! tj 2 E(0 ). ME ti ; tj 2 T are mutually exclusive, (ti ME tj ), i 6 9 s.t. ti; tj 2 E(). i.e. they cannot be simultaneously enabled simultaneously ME is symmetric and naturally not re exive. A sucient conditions for (ti ME tj ) is that the number of tokens in a P-invariant out of which ti and tj share places prohibits a simultaneous enabling. Another sucient conditions for (ti ME tj ) is that 9tk : k > j such that 8p 2 tk p 2 ti [ tj ^ w(tk ; p) max(w(ti ; p); w(tj ; p)). CN ti ; tj 2 T are concurrent (ti CN tj ), if they are neither EC, nor CC, nor ME. CN is symmetric and re exive, but not transitive. ti ; tj 2 T are in symmetric SC (ti SSC tj ) i ((ti SC tj ) _ (tj SC ti)) ^ :(ti ME tj ). Note that (ti SC ti ), (ti SSC ti ). The transitivity and re exivity of SSC allows to partition T into equivalence classes of potentially con icting transitions by computing the transitive closures for t 2 T with respect to SSC, (denoted by ECS(t), the extended con ict set of t), which gives the minimun region partitioning (MRP) of TPN. (We neglect the possibility of indirect con ict for simplicity here). ECS(t) can be computed in polynomial time in a pre-partitioning analysis for GNs, and is trivially obtained for SMs, MGs, CFNs and FCNs. The MRP for PN1 is R1 = ((p1 ; p7; p2); (t1; t2 ); F1), R2 = ((p3 ); (t3); F2), R3 = ((p4 ); (t4); F3), R4 = ((p5); (t5 ); F4), R5 = ((p6); (t6 ); F5); for PN2 it is depicted in Figure 13. Starting from the automatically generated MRP, regions of larger \grain size" can be obtained by collapsing Ris using \grain packing" heuristics. A few arguments are: (i) ME transitions do not bear any parallelism, s.t. it might not make sense to separate them into dierent regions: consider PN1 having just a single shared communication device, i.e. (0) = [0; 0; 0; 0; 4; 1;1]T . Then together with the P-invariant y2 = [0; 0; 1; 1; 0;0;1]T we have 3(i) + 4(i) + 7(i) = 1, which is a sucient condition for (t3 ME t4 ), and we would \pack" R2 and R3 together. Moreover this is sucient for (t1 ME t3 ) and (t2 ME t4 ) and we could collapse even R1, R2 and R3. (ii) From (0) = [0; 0; 0; 0; 4; 1; 2]T (two communication devices) together with y3 = [0; 1; 0; 1; 0;1;0]T we have 2(i) + 4(i) + 6(i) = 1 which is a sucient condition for (t4 CC t6 ) (even (t4 ME t6 )), s.t. the corresponding regions R3 and R5 could be collapsed. (iii) From (0) = [0; 0; 0; 0;4;1;2]T it is trivially seen that (t5 CN t6), i.e. executing rings of t5 and t6 can indeed exploit model parallelism. Note that also (t4 CN t5), (t3 CN t6 ) and (t3 CN t4) in (0) = [0; 0; 0; 0;4; 1; 2]T . As a principle, grainpacking should be done so as to separate CN transitions.

4.3 Communication Interfaces

The purpose of the CI is to synchronize the transition rings local to the LP's region with rings in remote regions (LPs). The main mechanism for this is to communicate messages among the LPs carrying tokens along with their creation time. These tokenmessages are of the form tokenmessages of the form m = h#; i; tsi among the LPs, where # is the number of tokens to be deposited in place pi in the receiver LP, respecting its timestamp ts, i.e. a copy of the senders LVT at the sending instant. Once a tokenmessage is received by the CI of an LP, it integrates the token in the local event structure at LVT = ts to preserve global causalities. Both classical synchronization protocols can be applied for this integration: While Chandy-Misra-Bryant (CMB) 16

based CIs (CICMB ) would prevent the local SE from simulating local transition rings surpassing a LVT horizon for which the LP has been guaranteed not to receive a tokenmessage with ts less than that horizon (by simply blocking SE), a Time Warp based CI (CITW ) would relax this restriction allowing SE to progress local simulation as far into the simulated future as there are scheduled events in EVL. Once a straggler tokenmessage (i.e. one with timestamp less than the current LVT) is received, the CITW would recover from the causality violation by undoing both local and consequential remote \overoptimistic" simulations in the rollback procedure. Data and control structures for CICMB and CITW were worked out in detail in [18]. CMB based CIs rely on model speci c information called lookahead, that can be obtained from a future list [56] of presampled random variates for transition ring times (t). Assume for example that in PN1 exponential variates have been drawn as (t4 ) and (t6 ) then the lookahead la(t2; (t6 ; p2)) of transition t2 imposed on the arc (t6; p2) is la(t2 ; (t6; p2)) = (t4 ) + (t6 ), because in (0) = [0; 0; 0; 0;4;1; 2]T both t4 and t6 are persistent. This means that once t2 is red at LVT, then eventually a token will be generated on arc (t6 ; p2) with timestamp LVT+la(t2 ; (t6 ; p2)), and no token with a smaller timestamp will be seen on (t6 ; p2). Exactly this information needs to be propagated among LPs using CMB nullmessages to unblock an LP, i.e. help to determine an LP when it is safe to process an event in the local EVL. A restriction imposed by a CMB CI on the partitioning is, that a nonzero time increment must be passed along with every token message leaving the LP, in order to avoid deadlock due to cyclic waiting for unblocking. A CI based on conservative time windows (to be explained in detail later) has been proposed in [55], that lets SE execute all events e V T ot(e) < V T +! without incurring any further communication among LPs, where V T is the (global) virtual time present in every LP after a barrier synchronization, and ! is the precomputed (global) time window. With respect to partitioning, this work avoids to make use of information other than the static net topology (P; T; F), but periodically attempts to dynamically redistributed LPs after monitoring the intensity and distribution of the workload. The static premapping is based on a sophisticated heuristic for weighted directed graph partitioning. Time Warp based CIs bear the danger of aggressive memory consumption and the chance of rollback thrashing due to \overoptimism". Adaptive memory management and lazy cancellation rollback can improve the shortcomings of this strategy ( a detailed view of the optimizations of CMB and Time Warp that can be applied to CIs is given in [28]). Time Warp will perform best if the tokenmessages exchanged among LPs are not widely dispersed over time because rollback probabilities will become unacceptably high. The impact on grain packing is thus to try to balance the average time increment per tokenmessage, which cannot be derived trivially from the net structure. Execution Time and its Decomposition

Communication Load generated by CI (Time Warp) 15000

Antimsg. Message

0

0

5

5000

10

10000

15

Block. Rollb. Comm. SE

[1246]

[35]

[1235]

[46]

[1234]

[56]

Figure 14:

Performance of CITW

[1246]

[35]

[1235]

on the CM-5

[46]

[1234]

[56]

Example The performance of a DDES of PN1 as executing on the CM-5 with a Time Warp based, lazy cancellation CI is shown in Figure 14. Let 1 = [1246][35] be a shorthand for the partitioning R1 = ((p1 ; p2; p4; p6; p7); (t1; t2; t4; t6); F1), R2 = ((p3 ; p5); (t3; t5); F2), i.e. LP1 is assigned t1 ; t2; t4; t6 with the respective inputplaces and LP2 is assigned t3 and t5 . Similarly let 2 = [1235][46], 3 = [1234][56]. The potential gain from 1 and 2 is (t3 ; t5 CN t4 ; t6), i.e. t3 ; t5 and t4; t6 are pairwise CN in (0) = [0; 0; 0; 0;4; 1; 2]T . The dierence among 1 and 2 is that in 2 ME transitions on the same T-invariant x2 = [0; 1; 0; 1; 0; 1]are separated from each other: t2 with (t2 ME t4; t6) is separated from t4; t6 (transitions in x1 = [1; 0; 1; 0;2; 0] (in case of 1 ) are not pairwise ME). A consequence of this is, that rollback can never occur in LP2 with 2 (see middle columns in Figure 14). Moreover, no antimessage will ever be generated for (sent to) LP2 . The hope of 3 is (t3 CN t6) and (t4 CN t5 ). Note that 2 , despite the violation of the grain packing 17

argument (i), but due to avoiding communication induced by rollback (but suering from blocking instead) outperforms 1 and 3 on the CM-5 for the particular (0) . The situation would change with a dierent (0) (e.g. more than one I/O-processes in PN1), and might be dierent on a platform with a smaller communication-computation speed ratio. We shall look at the two possibilities for implementing communication interfaces in more detail.

4.4 Chandy-Misra-Bryant Communication Interfaces

LP simulations following a conservative strategy date back to original works by Chandy and Misra [15] and Bryant [13], and are often referred to as the Chandy-Misra-Bryant (CMB) protocols. As described by [52], in CMB causality of events across LPs is preserved by sending timestamped (external) event messages of type hee@ti, where ee denotes the event and t is a copy of LVT of the sending LP at (@) the instant when the message was created and sent. t = ts(ee) is also called the timestamp of the event. A logical process following the conservative protocol (subsequently denoted by LPCMB ) is allowed to process safe events only, i.e. events up to a LVT for which the LP has been guaranteed not to receive (external event) messages with LVT < t (timestamp \in the past"). Moreover, all events (internal and external) must be processed in chronological order. This guarantees that the message stream produced by an LPCMB is in turn in chronological order, and a communication system (Figure 11) preserving the order of messages sent from LPCMB to LPCMB (FIFO) is sucient to guarantee that no out of chronological order message can i j CMB ever arrive in any LPi S (necessary for correctness). A conservative LP simulation can thus be seen as a set of all LPs LPCMB = k LPCMB together with a set of directed, reliable, FIFO communication channels k S CH = k;i (k6=i) chk;i = (LPk ; LPi ) that constitute the Graph of Logical Processes GLPCMB = (LP; CH). (It is important to note, that GLPCMB has a static toplogy, which compared to optimistic protocols, prohibits dynamic (re-)scheduling of LPs in a set of physical processors.) Communication System ch 1,k

ch P,k

ch k,1

ch k,P

LVTH Communication Interface e0 @ t0

e4 @ t0

e2 @ t0

e1 @ t9 e3 @ t7 e4 @ t2

IB[1] CC[1] LVT

e2 @ t1

e2 @ t5

IB[P]

OB[1]

e3 @ t3

OB[P]

CC[P] S Simulation Engine

EVL

Rk

Figure 15: Architecture of a Conservative Logical Process The communication interface CICMB of an LPCMB on the input side maintains one input buer IB[i] and a channel (or link) clock CC[i] for every channel chi;k 2 CH pointing to LPCMB (Figure 15). IB[i] k intermediately stores arriving messages in FIFO order, whereas CC[i] holds a copy of the timestamp of the message at the head of IB[i]; initially CC[i] is set to zero. LVTH = mini CC[i] is the time horizon up until which LVT is allowed to progress by simulating internal or external events, since no external event can arrive with a timestamp smaller than LVTH. CI now triggers the SE to conduct event processing just like a (sequential) event driven SE (Figure 8) based on (internal) events in the EVL, but also process (external) events from the corresponding IBs respecting chronological order and only up until LVT meets LVTH. During this, SE might have produced future events for remote LPs. For each of those, a message is constructed by adding a copy of LVT to the event, and deposited into FIFO output buers OB[i] to be picked up there and 18

program LPCMB (Rk ) S1 LVT = 0; EVL = fg; S = initialstate(); S2 for all CC[i] do (CC[i] = 0) od; S3 for all iei caused by S do chronological insert(hiei @occurrence time(iei )i, EVL) od; S4 while LVT endtime do S4.1 for all IB[i] do await not empty(IB[i]) od; S4.2 for all CC[i] do CC[i] = ts( rst(IB[i])) od; S4.3 LVTH = mini CC[i]; S4.4 min channel index = i j CC[i] == min channel clock; S4.5 if ts( rst(EVL)) LVTH then /* select rst internal event*/ e = remove rst(EVL) ;

else /* select rst external event*/ e = remove rst(IB[min channel index]); end if;

/* now process the selected event */ LVT = ts(e); if not nullmessage(e) then S = modi ed by occurrence of(e); for all iei caused by S do chronological insert(hiei @occurrence time(iei )i, EVL) od; for all iei preempted by S do remove(iei, EVL) od; for all eei caused by S do deposit(heei @LVTi, corresponding(OB[j])) od; end if; S4.11 for all empty(OB[i]) do deposit(h0@LVT + lookahead(chk;i)i, OB[i]) od; S4.12 for all OB[i] do send out contents(OB[i]) od; od while;

S4.6 S4.7 S4.7.1 S4.7.2 S4.7.3 S4.7.4

Figure 16: Conservative LP Simulation Algorithm Sketch. delivered by the communication system. CI maintains individual output buers OB[i] for every outgoing channel chk;l CH to subsequent LPs LPl . The basic algorithm is sketched in Figure 16. Given now that within the horizon LVTH neither internal nor external events are available to process, then LPCMB blocks processing, and idles to receive new messages potentially widening the time horizon. k Two key problems appear with this policy of \blocking-until-safe-to-process", namely deadlock and memory over ow as explained with Figure 17: Each LP is waiting for a message to arrive, however, from an LP that is blocked itself (deadlock). Moreover, the cyclic waiting of the LPs involved in deadlock leaves events unprocessed in their input buers, the amount of which can grow unpredictably even in the absence of deadlock (memory over ow). Several methods have been proposed to overcome the vulnerability of the CMB protocol to deadlock, falling into the two principle categories of deadlock avoidance and deadlock detection/recory. Communication System

29

2 LVT

19 2

88 LVT

14 19

43 LVT

14

Figure 17: Deadlock and Memory Over ow

19

4.4.1 Deadlock Avoidance

Deadlock as in Figure 17 can be prevented by modifying the communication protocol based on the sending of

nullmessages [52] of the form h0@ti, where 0 denotes a nullevent (event without eect). A nullmessages is not

related to the simulated model and only serves for synchronization purposes. Essentially it is sent on every output channel as a promise not send any other message with smaller timestamp in the future. It is launched whenever an LP processed an event that did not generate an event message for some corresponding target LP. The receiver LP can use this implicit information to extend its LVTH and by that become unblocked. In our example (Figure 17), after the LP in the middle would have broadcasted h0@19i to the neighboring LPs, both of them would have chance to progress their LVT up until time 19, and in turn issue new event messages expanding the LVTHs of other LPs etc. The nullmessage based protocol can be guaranteed to be deadlock free as long as there are no closed cycles of channels, for which a message traversing this cycle cannot increment its timestamp. This implies, that simulation models whose event structure cannot be decomposed into regions such that for every directed channel cycle there is at least one LP to put a nonzero time increment on traversing messages cannot be simulated using CMB with nullmessages. Although the protocol extension is straight-forward to implement, it can put a dramatic burden of nullmessage overhead on the performance of the LP simulation. Optimizations of the protocol to reduce the frequency and amount of nullmessages, e.g. sending them only on demand (upon request), delayed until some timeout, or only when an LP becomes blocked have been proposed [52]. An approach where additional information (essentially the routing path as observed during traversal) is attached to the nullmessage, is the carrier nullmessage protocol [14]. One problem that still remains with conservative LPs is the determination of when it is safe to process an event. The degree to which LPs can look ahead and predict future events plays a critical role in the safety veri cation and as a consequence for the performance of conservative LP simulations. In the example in Figure 17, if the LP with LVT 19 could know that processing the next event will certainly increment LVT to 22, then nullmessages h0@22i (so called lookahead of 3) could have been broadcasted as further improvement on the LVTH of the receivers. Lookahead must come directly from the underlying simulation model and enhances the prediction of future events, which is { as seen { necessary to determine when it is safe to process an event. The ability to exploit lookahead from FCFS queueing network simulations was originally demonstrated by Nicol [56], the basic idea being that the simulation of a job arriving at a FCFS queue will certainly increment LVT by the service time, which can already be determined, e.g. by random variate presampling, upon arrival since the number of queued jobs is known and preemption is not possible. Simulation Engine 1

Simulation Engine 2

Future Lists for P1

T1

τ1 = exp( λ )

Region 1

P2

T2

τ2 = exp( λ )

T1 0.37 0.17 0.22 0.34 0.93 0.65

λ

=2

T2 0.51 0.39 0.42 0.05 0.88 0.17

Region 2

Figure 18: LP Simulation of PN1

Example: Conservative LP Simulation of a PN with Model Parallelism Assume PN1 to represent a physical system consisting of three machines, either being in operation or being maintained. The PN model comprises balanced ring delays ((T1) exp(0:5), (T2) exp(0:5)), i.e. time operating is approximately the same as time being maintained. Related to those ring delays and the number of machines being represented by circulating tokens, a certain amount of model parallelism can be exploited when partitioning the net into two LPs (Figure 18), such that the individual PN regions of LP1 and LP2 are: R1 = (fT1g; fP1g; f(P1; T1)g; (T1) exp( = 0:5)); and R2 = (fT2g; fP2g; f(P2; T2)g; (T2) exp( = 0:5)): 20

Step 0 1 2 3 4 5

VT 0.00 0.17 0.37 0.51 0.56 0.73

S (2,1) (1,2) (0,3) (1,2) (2,1) (1,2)

EVL [email protected]; [email protected]; [email protected] [email protected]; [email protected]; [email protected] [email protected]; [email protected]; [email protected] [email protected]; [email protected]; [email protected] [email protected]; [email protected]; [email protected] [email protected]; [email protected]; [email protected]

T | T1 T1 T2 T2 T1

Table 1: Sequential DES of a PN with Model Parallelism fl(T1,2) = 0.17

fl(T1,4) = 0.34

fl(T1,1) = 0.37

T1

fl(T1,3) = 0.22

fl(T1,5) = 0.93

fl(T2,3) = 0.42 fl(T2,2) = 0.39

fl(T2,6) = 0.17 fl(T2,4) = 0.05

fl(T2,1) = 0.51

T2 0.00

0.17

0.37

0.51 0.56

fl(T2,5) = 0.88

0.73 0.78 0.79

0.90

fl(T1,i) = t

i-th variate from future list of T1

firing of T1, token move schedules T2

fl(T2,i) = t

i-th variate from future list of T2

firing of T2, token move schedules T1

Figure 19: Model Parallelism Observed in the PN execution Let the future list [56], a sequence of exponentially distributed random ring times (random variates), for T1 and T2 be as in the table of Figure 18. The sequential simulation would then sequence the variates according to their resulting scheduling in virtual time units when simulating the timed behavior of the PN as in Table 1. This sequencing stems from the policy of always using the next free variate from the future list to schedule the occurrence of the next event in EVL. To explain model parallelism, observe that the ring of a scheduled transition (internal event) always generates an external event, namely a message carrying a token as the event description (tokenmessage), and a timestamp equal to the local virtual time LVT of the sending LP. On the other hand, the receipt of an event message (external event) always causes a new internal event to the receiving LP, namely the scheduling of a new transition ring in the local EVL. By just looking at the PN model and the variates sampled in the future list (Figure 18), we observe that the rst occurrence of T1 and the rst occurrence of T2 could be simulated in a time overlapped way. This is explained as follows (Figure 19): Both T1 and T2 have in nite server ring semantics, i.e. whenever a token arrives in P1 or P2, T1 (or T2) is enabled with a scheduled ring at LVT plus the transitions next future variate. There are constantly M = 3 tokens in the PN model, therefore the maximum degree of enabling is M for both T1 and T2. Considering now the initial state S = (2; 1) (two tokens in P1 and one in P2 ), one occurrence of T1 is scheduled for time tT1 = 0.17, and another one for t0T1 = 0.37. One occurrence of T2 is scheduled for time tT2 = 0.51. The next variate for T1 is 0.22, the one for T2 is 0.39. A token can be expected in P1 at min(0:51; 0:39; 0:42) = 39 at the earliest, leading to a new (the third) scheduling of T1 at 0:39 + 0:22 = 0:61 at the earliest, maybe later. Consequently the rst occurrence of T1 must be at t(T11) = 0.17, and the second occurence of T1 must be t(T12 ) = 0.37. The rst occurrence of T2 can be either the one scheduled at 0.51, or the one invoked by the rst occurence of T1 at 0.17 + 0.39 = 0.56, or the one invoked by the second occurence of T1 at 0.37 + 0.42 = 0.78. Clearly, the rst occurence of T2 must be at t(T21 ) = 0.51, and the second occurrence of T2 must be at t(T22 ) = 0:17+0:39 = 0:56, etc. Since T11 ! T22 with t(T21 ) < t(T22 ) and T21 ! T13 with t(T11) < t(T13), T11 and T21 do not interfere with each other and can therefore be simulated independently (T1i ! T2j denotes the direct scheduling causality of the i?th occurrence of T1 onto the j ?th occurrence of T2). As was seen, PN1 provides inherent model parallelism. In order to exploit this model parallelism in a CMB simulation, the PN model is decomposed into two regions R1 and R2, which are assigned to two LPs LP1 and LP2 , such that GLP= (fLP1 ; LP2g; fch1;2; ch2;1g), where the channels ch1;2 and ch2;1 are supposed 21

LP1 EVL [email protected]; [email protected] [email protected]; [email protected] [email protected] |

OB |

B

IB |

LVT 0.00

SP2 1

LP2 EVL [email protected]

OB |

B

h 0; P2; 0.17 i

|

0.00

1

[email protected]

h 0; P1; 0.39 i

h 0; P2; 0.17 i h 1; P2; 0.17 i

0.17 0.17

1 2

h 0; P1; 0.51 i h 0; P1; 0.51 i

h 1; P2; 0.37 i

0.37

3

h 0; P1; 0.51 i

h 0; P2; 0.73 i

h 0; P2; 0.73 i

0.51

2

h 0; P2; 0.73 i h 0; P2; 0.73 i

h 0; P2; 0.73 i h 0; P2; 0.73 i

0.56 0.73

1 1

[email protected] [email protected]; [email protected] [email protected]; [email protected]; [email protected] [email protected]; [email protected] [email protected] [email protected]

h 1; P1; 0.56 i h 0; P1; 0.78 i

h 0; P2; 0.73 i

0.73

1

[email protected]

h 0; P1; 0.78 i

Step 0

IB |

LVT 0.00

SP1 2

1

|

0.00

2

2 3

h 0; P1; 0.39 i h 0; P1; 0.51 i

0.17 0.37

1 0

4

h 0; P1; 0.51 i

0.51

0

|

h 0; P2; 0.73 i

5

h 0; P1; 0.51 i

0.51

0

|

6 7

h 1; P1; 0.51 i h 1; P1; 0.56 i

0.51 0.56

1 2

8

h 0; P1; 0.78 i

0.73

1

[email protected] [email protected]; [email protected] [email protected]

h 1; P2; 0.17 i h 1; P2; 0.37 i

h 1; P2; 0.73 i

h 1; P1; 0.51 i

Table 2: Parallel Conservative LP Simulation of a PN with Model Parallelism to substitute the PN arcs (T1; P2) and (T2; P1) respectively. Both ch1;2 and ch2;1 carry messages containing tokens that were generated by the ring of a transition in a remote LP. Consequently, ch1;2 propagates a message of the form m = h1; P2; ti from LP1 to LP2 on the occurrence of a ring of T1, in order to deposit 1 ( rst component of m) token into place P2 (second component of m) at time t (third component). The timestamp t is produced as a copy of the LVT of LP1 at the instant of that ring of T1, that produced the token. A CMB parallel execution of the LP simulation model developed above, since operating in a synchronous way in two phases ( rst simulate one event locally, then transmit messages), generates the trace in Table 2. In step 0, both LPs use precomputed random variates from their individual future lists and schedule events. In step 1, no event processing can happen due to LVTH = 0:0, LPs are blocked (see indication in B column. Generally in such a situation every LPi computes its lookahead la(chi;j ) imposed on the individual outputchannels j. In the example we have la(chi;j ) = min( (LVTi ? mink=1::Si (stk )) ; mink=1::(M ?Si ) flk ) where stk is the scheduled occurrence time of the k-th entry in EVL, flk is the k-th free variate in the future list, and M is the maximum enabling degree (tokens in the PN model). For example, the lookahead in LP1 in the state of step 1 imposed on the channel to LP2 is la(ch1;2) = 0:17, whereas la(ch2;1) = 0:39. la is now attached to the LP's LVT, giving the timestamps for the nullmessage h 0; P2; 0.17 i sent from LP1 to LP2 , and h 0; P1; 0.39 i sent from LP2 to LP1 . The latter, when arriving at LP1 , unblocks the SE1 , such that the rst event out of EVL1 can be processed, generating the event message h 1; P1; 0.17 i. This message, however, as received by LP2 still cannot unblock LP2 since it carries the same timestamp as the previous nullmessage; also the local lookahead cannot be improved and h 0; P1; 0.51 i is resent. It takes another iteration to nally unblock LP2, which can then process its rst event in step 5, etc. It is easy seen from the example, that the CMB protocol (for the particular example) forces a `logical' barrier synchronization whenever the sequential DES (see trace in Table 1) switches from processing a T1 related event to a T2 related one and vice versa (at VT 0.17, 0.51, 0.73, etc.). In the diagram in Figure 19, this is at points where the arrow denoting a token move from T1 (T2) to T2 (T1) has the opposite direction that the previous one.

4.4.2 Deadlock Detection/Recovery

An alternative to the Chandy-Misra-Bryant protocol avoiding nullmessages has also been proposed by Chandy and Misra [16], allowing deadlocks to occur, but providing a mechanism to detect it and recover from it. Their algorithm runs in two phases: (i) parallel phase, in which the simulation runs until it deadlocks, and (ii) phase interface, which initiates a computation allowing some LP to advance LVT. They prove, that in every parallel phase at least one event will be processed generating at least one event message, which will also be propagated before the next deadlock. A central controller is assumed in their algorithm, thus violating a distributed computing principle. To avoid a single resource (controller) to become a communication performance bottleneck during deadlock detection, any general distributed termination detection algorithm [50] or distributed deadlock detection algorithm [17] could be used instead. In an algorithm described by Misra [52], a special message called marker circulates through GLP to detect and correct deadlock. A cyclic path for traversing all chi;j 2 CH is precomputed and LPs are initially 22

colored white. An LP that received the marker takes the color white and is supposed to route it along the cycle in nite time. Once an LP has either received or sent an event message since passing the marker, it turns to red. The marker identi es deadlock if the last N LPs visited were all white. Deadlock is properly detected as long as for any chi;j 2 CH all messages sent over chi;j arrive at LPj in the time order as sent by LPi . If the marker also carries the next event times of visited white LPs, it knows upon detection of deadlock the smallest next event time as well as the LP in which this event is supposed to occur. To recover from deadlock, this LP is invoked to process its rst event. Obviously message lengths in this algorithm grow proportionally to the number of nodes in GLP. Bain and Scott [10] propose an algorithm for demand driven deadlock free synchronization in conservative LP simulation that avoids message lengths to grow with the size of GLP. If an LP wants to process an event with timestamp t, but is prohibited to do so because CC[j] < t for some j, then it sends time requests containing the sender's process id and the requested time t to all predecessor LPs with this property. (The predecessors, however, may have already advanced their LVT in the mean time.) Predecessors are supposed to inform the sender LP when they can guarantee that they will not emit an event message at a time lower than the requested time t. Three types of reply types are used to avoid repeated polling in the presence of cycles: a yes indicates that the predecessor has reached the requested time, a no indicates that it has not (in which case another request must be made), and a ryes (\re ected yes") indicates that it has conditionally reached t. Ryes replys, together with a request queue maintained in every LP, essentially have the purpose to detect cycles and to minimize the number of subsequent requests sent to predecessors. If the process id and time of a request received match any request already in the request queue, a cycle is detected and ryes is replied. Otherwise, if the LP's LVT equals or exceeds the requested time a yes is replied, whereas if the LP's LVT is less the requested time the request is enqueued in the request queue, and request copies are recursively sent to the receiver's predecessors with CC[i]'s < t, etc. The request is complete when all channels have responded, and the request reached the head of the request queue. At this time the request is removed from the request queue and a reply is sent to the requesting LP. The reply to the successor from which the request was received is no (ryes), if any request to a predecessor was answered with no (ryes), otherwise yes is sent. If no was received in an LP initiating a request, the LP has to restart the time request with lower channel clocks. The time-of-next-event algorithm as proposed by Groselj and Tropper [38] assumes more than one LP mapped onto a single physical processor, and computes the greatest lower bound of the timestamps of the event messages expected to arrive next at all empty links on the LPs located at that processor. It thus helps to unblock LPs within one processor, but does not prevent deadlocks across processors. The lower bound algorithm is an instance of the single source shortest path problem.

4.4.3 Conservative Time Windows

Conservative LP simulations as presented above are distributed in nature since LPs can operate in a totally asynchronous way. One way to make these algorithms more synchronous in order to gain from the availability of fast synchronization hardware in multiprocessors is to introduce a window Wi in simulated time for each LPi , such that events within this time window are safe (events in Wi are independent of events in Wj , i 6= j) and can be processed concurrently across all LPi [48], [57]. A conservative time window (CTW) parallel LP simulation synchronously operates in two phases. In phase (i) (window identi cation) for every LPi a chronological set of events Wi is identi ed such that for every event e 2 Wi, e is causally independent of any e0 2 Wj , j 6= i. Phase (i) is accomplished by a barrier synchronization over all LPs. In phase (ii) (event processing) every LPi processes events e 2 Wi sequentially in chronological order. Again, phase (ii) is accomplished by a barrier synchronization. Since the algorithm iteratively lock-steps over the two consecutive phases, the hope to gain speedup over a purely sequential DES heavily depends on the eciency of the synchronization operation on the target architecture, but also on the event structure in the simulation model. Dierent windows will generally have dierent cardinality of the covered event set, maybe some windows will remain empty after the identi cation phase for one cycle. In this case the corresponding LPs would idle for that cycle. A considerable overhead can be imposed on the algorithm by the identi cation of when it is safe to process an event within LPi (window identi cation phase). Lubachevsky [48] proposes to reduce the complexity of this operation by restricting the lag on the LP simulation, i.e. the dierence in occurrence time of events being processed concurrently is bounded from above by a known nite constant (bounded lag protocol). By this restriction, and assuming a \reasonable" amount of dispersion of events in space and time, the execution of the algorithm on N processors in parallel will have one event processed in O(logN) time on average. 23

4.5 Time Warp Communication Interfaces

Optimistic LP simulation strategies, in contrast to conservative ones, do not strictly adhere to the local causality constraint lcc (see Section 2.2.2), but allow the occurrence of causality errors and provide a mechanism to recover from lcc violations. In order to avoid blocking and safe-to-process determination which are serious performance pitfalls in the conservative approach, an optimistic LP progresses simulation (and by that advances LVT) as far into the simulated future as possible, without warranty that the set of generated (internal and external) events is consistent with lcc, and regardless to the possibility of the arrival of an external event with a timestamp in the local past. Pioneering work in optimistic LP simulation was done by Jeerson and Sowizral [43, 44] in the de nition of the Time Warp (TW) mechanism, which like the Chandy-Misra-Bryant protocol uses the sending of messages for synchronization. Time Warp employs a rollback (in time) mechanism to take care of proper synchronization with respect to lcc. If an external event arrives with timestamp in the local past, i.e. out of chronological order (straggler message), then the Time Warp scheme rolls back to the most recently saved state in the simulation history consistent with the timestamp of the arriving external event, and restarts simulation from that state on as a matter of lcc violation correction. Rollback, however, requires a record of the LP's history with respect to the simulation of internal and external events. Hence, an LPTW has to keep sucient internal state information, say a state stack SS, which allows for restoring a past state. Furthermore, it has to administrate an input queue IQ and an output queue OQ for storing messages received and sent. For reasons to be seen, this logging of the LP's communication history must be done in chronological order. Since the arrival of event messages in increasing time stamp order cannot be guaranteed, two dierent kinds of messages are necessary to implement the synchronization protocol: rst the usual external event messages (m+ = hee@t; +i), (where again ee is the external event and t is a copy of the senders LVT at the sending instant) which will subsequently call positive messages. Opposed to that are messages of type (m? = hee@t; ?i) called negative- or antimessages, which are transmitted among LPs as a request to annihilate the prematurely sent positive message containing ee, but for which it meanwhile turned out that it was computed based on a causally erroneous state. Communication System ch 1,k

ch P,k

ch k,1

ch k,P

GVT

Communication Interface

ee1 @ 7, +

ee2 @ 6, ee1 @ 5, +

OB

IB ee3 @ 7, + GVT

LVT

GVT

IQ ee1 ee3 ee5 ee2 ee1 3 +

4 +

4 +

6 +

LVT

OQ ee4 ee1 ee6 ee1

9 +

3 +

4 +

6 +

7 +

GVT

SS

0

1

1

1

2

2

LVT

3

4

4

4

5

5

6

S0 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12

EVL

LVT

EVL

EVL

EVL

EVL

EVL

EVL

S EVL

ie1 8

ie7 8

ie2 9

ei5 9

Simulation Engine

R

k

Figure 20: Architecture of an Optimistic Logical Process The basic architecture of an optimistic LP employing the Time Warp rollback mechanism is outlined in Figure 20. External events are brought to some LPk by the communication system in much the same way as 24

in the conservative protocol. Messages, however, are not required to arrive in the sending order (FIFO) in the optimistic protocol, which weakens the hardware requirements for executing Time Warp. Moreover, the separation of arrival streams is also not necessary, and so there is only a single IB and a single OB (assuming that the routing path can be deduced from the message itself). The communication related history of LPk is kept in IQ and OQ, whereas the state related history is maintained in the SS data structure. All those together represent CIk ; SEk is an event driven simulation engine equivalent to the one in LPCMB . The triggering of CI to SE is sketched with the basic algorithm for LPTW in Figure 21. The LP mainly loops (S3) over four parts: (i) an input-synchronization to other LPs (S3.1), (ii) local event processing (S3.2 { S3.8), (iii) the propagation of external eects (S3.9) and (iv) the (global) con rmation of locally simulated events (S3.10 { S3.11). Part (ii) and (iii) are almost the same as was seen with LPCMB . The input synchronization (Rollback and Annihilation) and con rmation (GVT) part however are the key mechanisms in optimistic LP simulation.

4.5.1 Rollback and Annihilation Mechanisms

The input synchronization of LPTW (rollback mechanism) relates arriving messages to the current value of the LP's LVT and reacts accordingly (see Figure 22). A message aecting the LP's \local future" is moved from the IB to the IQ respecting the timestamp order, and the encoded external event will be processed as soon as LVT advances to that time. hee3@7; +i in the IB in Figure 20 is an example of such an unproblematic message (LVT = 6). A message timestamped in the \local past" however is an indicator of a causality violation due to tentative event processing. The rollback mechanism (S3.1.1 ) in this case restores the most recent lcc-consistent state, by reconstructing S and EVL in the simulation engine from copies attached to the SS in the communication interface. Also LVT is warped back to the timestamp of the straggler message. This so far has compensated the local eects of the lcc violation; the external eects are annihilated by sending an antimessage for all previously sent outputmessages (in the example hee6@6; ?i and hee1@7; ?i are generated and sent out, while at the same time hee6@6; +i and hee1@7; +i are removed from OQ). Finally, if a negative message is received (e.g. hee2@6; ?i) it is used to annihilate the dual positive message (hee2@6; +i) in the local IQ. Two cases for the negative messages must be distinguished: (i) If the dual positive message is present in the receiver IQ, then this entry is deleted as an annihilation. This can be easily done if the positive message has not yet been processed, but requires rollback if it was. (ii), if a dual positive message is not present (this case can only arise if the communication system does not deliver messages in a FIFO fashion), then the negative message is inserted in IQ (irrespective of its relation to LVT) to be annihilated later by the (delayed) positive message still in trac. As is seen now, the rollback mechanism requires a periodic saving of the states of SE (LVT, S and EVL) in order to able to restore a past state (S3.7 ), and to log output messages in OQ to be able to undo propagated external events (S3.8 ). Since antimessages can also cause rollback, there is the chance of rollback chains, even recursive rollback if the cascade unrolls suciently deep on a directed cycle of GLP. The protocol however guarantees, although consuming considerable memory and communication resources, that any rollback chain eventually terminates whatever its length or recursive depth is.

Lazy Cancellation In the original Time Warp protocol as described above, an LP receiving a straggler message initiates sending antimessages immediately when executing the rollback procedure. This behavior is called aggressive cancellation. As a performance improvement over aggressive cancellation, the lazy cancellation policy does not send an antimessage (m? ) for m+ immediately upon receipt of a straggler. Instead,

it delays its propagation until the resimulation after rollback has progressed to LVT = ts(m+ ) producing 0 0 + + + m 6= m . If the resimulation produced m = m+ , no antimessage has to be sent at all [36]. Lazy cancellation thus avoids unnecessary cancelling of correct messages, but has the liability of additional memory and bookkeeping overhead (potential antimessages must be maintained in a rollback queue) and delaying the annihilation of actually wrong simulations. Lazy cancellation can also be based on the utilization of lookahead available in the simulation model. If a straggler m+ < LVT is received, than obviously antimessages do not have to be sent for messages m with timestamp, ts(m+ ) ts(m) < ts(m+ ) + la. Moreover, if ts(m+ ) + la LVT even rollback does not need to be invoked. As opposed to lookahead computation in the CMB protocol, lazy cancellation can exploit implicit lookahead, i.e. does not require its explicit computation. The traces in Table 3 represent the behavior of the optimistic protocol with the lazy cancellation message annihilation in a parallel LP simulation of PN1. (The trace table is to be read in the same way as the one in Table 2, except that there is a rollback indicator column RB instead of a blocking column B.) In step 2, for 25

program LPTW (Rk) S1 GVT = 0; LVT = 0; EVL = fg; S = initialstate(); S2 for all iei caused by S do chronological insert(hiei @occurrence time(iei )i, EVL) od; S3 while GVT endtime do S3.1 for all m 2 IB do S3.1.1 if ts(m) < LVT /* m potentially aects local past */ then if (positive(m) and dual(m) 62 IQ) or (negative(m) and dual(m) 2 IQ) then /* rollback */ restore earliest state before(ts(m)); generate and sendout(antimessages); LVT = earliest state timestamp before(m);

endif; endif; /* irrespective of how m is related to LVT */ if dual(m) 2 IQ then remove(dual(m), IQ); /* annihilate */ else chronological insert(external event(m)@ts(m); sign(m)), IQ); endif; od; if ts( rst(EVL)) ts( rst nonnegative(IQ)) then e = remove rst(EVL); /* select rst internal event*/ else e = rst nonnegative(IQ); /* select rst external event*/ endif;

S3.1.2

S3.2

/* now process the selected event */ LVT = ts(e); S = modi ed by occurrence of(e); for all iei caused by S do chronological insert(hiei @occurrence time(iei )i, EVL) od; for all iei preempted by S do remove(iei , EVL) od; log new state(hLVT, S, copy of(EVL)i, SS); for all eei caused by S do deposit(heei @LVT; +i, OB); chronological insert(hiei @LVT; +i, OQ); od; S3.9 send out contents(OB); S3.10 GVT = advance GVT(); S3.11 fossil collection(GVT); od while;

S3.3 S3.4 S3.5 S3.6 S3.7 S3.8

Figure 21: Optimistic LP Simulation Algorithm Sketch.

Step 0 1 2 3 4

IB | | h 1; P1; 0.51 i h 1; P1; 0.56 i h 1; P1; 0.79 i

LVT 0.00 0.17 0.37 0.73 0.90

S 2 1 1 1 1

LP EVL1 [email protected]; [email protected] [email protected] [email protected] [email protected] [email protected]

5

h -1; P1; 0.79 i

0.90

0

|

OB | h 1; P2; 0.17 i h 1; P2; 0.37 i h 1; P2; 0.73 i h 1; P2; 0.90 i

RB

|

IB | | h 1; P2; 0.17 i h 1; P2; 0.37 i h 1; P2; 0.73 i

LVT 0.00 0.51 0.56 0.79 0.73

S 1 0 0 0 2

h 1; P2; 0.90 i

0.78

2

LP EVL2 [email protected] | | | [email protected]; [email protected] [email protected]; [email protected]

OB | h 1; P1; 0.51 i h 1; P1; 0.56 i h 1; P1; 0.79 i h -1; P1; 0.79 i

Table 3: Parallel Optimistic LP Simulation of a PN with Model Parallelism 26

h 1; P1; 0.78 i

RB

Arriving Message is of Type: _ m

+ m _ dual m exists in IQ

+ dual m exists in IQ (not yet processed)

annihilate dual m

_

+ annihilate dual m

timestamp(m) >= LVT (in the local future) + chronological insert (m , IQ) _ dual m does NOT exist _ dual m exists in IQ

_ chronological insert (m , IQ) + dual m does NOT exist + dual m exists in IQ (already processed)

annihilate dual m

_

rollback + annihilate dual m

timestamp(m) < LVT (in the local past) rollback + chronological insert (m , IQ) _ dual m does NOT exist

_ chronological insert (m , IQ) + dual m does NOT exist

Figure 22: The Time Warp Message based Synchronization Mechanism example, LP2 receives the straggler h1; P1; 0:17i at LVT = 0.51. Message annihilation and rollback can be avoided due to the exploitation of the lookahead from the next random variate in the future list, 0.39. The eect of the straggler is in the future of LP2 (0.56). It has been shown [42] that Time Warp with lazy cancellation can produce so called \supercritical speedup", i.e. surpass the simulations critical path by the chance of having wrong computations produce correct results. By immediately discarding rolled back computations this chance is lost for the aggressive cancellation policy. A performance comparison of the two, however, is related to the simulation model. Analysis by Reiher and Fujimoto [65] shows that lazy cancellation can arbitrarily outperform aggressive cancellation and vice versa, i.e. one can construct extreme cases for lazy and aggressive cancellation such that if one protocol executes in time using N processors, the other uses N time. Nevertheless, empirical evidence is reported \slightly" in favor of lazy cancellation for certain simulation applications.

Lazy Reevaluation Much like lazy cancellation delays the annihilation of external eects upon receiving a

straggler at LVT, lazy re-evaluation delays discarding entries on the state stack SS. Should the recomputation after rollback to time t < LVT reach a state that exactly matches one logged in SS and the IQ is the same as the one at that state, then immediately jump forward to LVT, the time before rollback occured. Thus, lazy reevaluation prevents from the unnecessary recomputation of correct states and is therefore promising in simulation models where events do not modify states (\read-only" events). A serious liability of this optimization is again additional memory and bookkeeping overhead, but also (and mainly) the considerable complication of the Time Warp code [35]. To verify equivalence of IQ's the protocol must draw and log copies of the IQ in every state saving step (S3.7 ). In a weaker lazy re-evaluation strategy one could allow jumping forward only if no message has arrived since rollback.

Lazy Rollback The dierence of virtual time in between the straggler m , ts(m ), and its actual eect at time ts(m ) + la(ee) LVT can again be overjumped, saving the computation time for the resimulation of events in between [ts(m ); ts(m )+la(ee)). la(ee) is the lookahead imposed by the external event carried by m .

Breaking/Preventing Rollback Chains Besides the postponing of erroneous message and state annihilation until it turns out that they are not reproduced in the repeated simulation, other techniques have been 27

studied to break cascades of rollbacks as early as possible. Prakash and Subramanian [62], comparable to the carrier null message approach, attach a limited amount of state information to messages to prevent recursive rollbacks in cyclic GLPs. This information allows LPs to lter out messages based on preempted (obsolete) states to be eventually annihilated by chasing antimessages currently in transit. Related to the (conservative) bounded lag algorithm, Lubachevsky, Shwartz and Weiss have developed a ltered rollback protocol [49] that allows optimistically crossing the lag bound, but only up to a time window upper edge. Causality violations can only aect the time period in between the window edge and the lag bound, thus limiting (the relative) length of rollback chains. The SRADS protocol by Dickens and Reynolds [26], although allowing optimistic simulation progression, prohibits the propagation of uncommitted events to other LPs. Therefore, rollback can only be local to some LP and cascades of rollback can never occur. Madisetti, Walrand and Messerschmitt with their protocol called Wolf-calls freeze the spatial spreading of uncommitted events in so called spheres of in uence W(LPi ; ), de ned as the set of LPs that can be in uenced by a message from LPi at time ts(m) + respecting computation times a and communication times b. The Wolf algorithm ensures that the eects of an uncommitted event generated by LPi are limited to a sphere of a computable (or selectable) radius around LPi , and the number of broadcasts necessary for a complete annihilation within the sphere is bounded by a computable (or chooseable) number of steps B (B being provably smaller than for the standard Time Warp protocol).

4.5.2 Optimistic Time Windows

A similar idea of \limitingthe optimism"to overcome rollback overhead potentials is to advance computations by \windows" moving over simulated time. In the original work of Sokol, Briscoe and Wieland [70], the moving time window (MTW) protocol, neither internal nor external events e with ts(e) > t + are allowed to be simulated in the time window [t; t + ), but are postponed for the next time window [t + ; t + 2). Two events e and e0 timestamped ts(e) and ts(e0 ) respectively therefore can only be simulated in parallel i j ts(e) ? ts(e0 ) j< . Naturally, the protocol is in favor of simulation models with a low variation of event occurrence distances relative to the window size. Compared to a time-stepped simulation, MTW does not await the completion of all events e with t ts(e) < t + which would cause idle processors at the end of each time window, but invokes an attempt to move the window as soon as the number of events to be executed falls below a certain threshold. In order to keep moving the time window, LPs are polled for the timestamp of their earliest next event ti (e) (polling takes place simultaneously with event processing) and the window is advanced to mini ti (e); mini ti(e) + . (The next section will show the equivalence of the window lower edge determination to GVT computation.) Obviously the advantage of MTW and related protocols is the potential eective implementation as a parallel LP simulation, either on a SIMD architecture or in a MIMD environment where the reduction operation mini ti (e) can be computed utilizing synchronization hardware. Points of criticism are the assumption of approximately uniform distribution of event occurrence times in space and the ignorance with respect to potentially \good" optimism beyond the upper window edge. Furthermore, a natural diculty is the determination of the admitting enough events to make the simulation ecient. The latter is addressed with the adaptive Time Warp concurrency control algorithm (ATW) proposed by Ball and Hyot [11], allowing the window size (t) be adapted at any point t in simulation time. ATW aims to temporarily suspend event processing if it has observed a certain amount of lcc violations in the past. In this case the LP would conclude that it progresses LVT too fast compared to the predecessor LPs and would therefore stop LVT advancement for a time period called the blocking window (BW). BW is determined based on the minimum of a function describing wasted computation in terms of time spent in a (conservatively) blocked mode, or a fault recovery mode as induced by the Time Warp rollback mechanism.

4.5.3 The Limited Memory Dilemma

All arguments on the execution of the Time Warp protocol so far assumed the availability of a sucient amount of free memory to record internal and external eect history for pending rollbacks, and all arguments were related to the time complexity. Indeed, Time Warp with certain memory management strategies to be described in the sequel can be proven to work correctly when executed with O(M seq ) memory, where M seq is the number of memory locations utilized by the corresponding sequential DES. Opposed to that, the CMB protocol may require O(kM seq ) space, but may also use less storage than sequential simulation, depending on the simulation model ( it can even be proven that simulation models exist such that the space complexity of CMB is O((M seq )k )). Time Warp always consumes more memory than sequential simulation 28

GVT

LVT 1

State Saving Points π

LVT 2 Straggler

Rollback Coast Forward

Committed Event

Resimulate

Processed, Uncommitted Event

Event being processed

Scheduled Event

Figure 23: Interleaved State Saving [47], and a memory limitation imposes a performance decrease on Time Warp: providing just the minimum of memory necessary may cause the protocol to execute fairly slow, such that the memory/performance tradeo becomes an issue. Memory management in Time Warp follows two goals: (i) to make the protocol operable on real multiprocessors with bounded memory, and (ii) to make the execution of Time Warp performance ecient by providing \sucient" memory. An infrequent or incremental saving of history information in some cases can prevent, maybe more eectively than one of the techniques presented for limiting the optimism in Time Warp, aggressive memory consumption. Once, despite the application of those techniques, available memory is exhausted, fossil collection could be applied as a technique to recover memory used for history recording that will de nitely not be used anymore due to an assured lower bound on the timestamp of any possible future rollback (GVT). Finally, if even fossil collection fails to recover enough memory to proceed with the protocol, additional memory could be freed by returning messages from the IQ (message sendback, cancelback) or invoking an arti cial rollback reducing space used for storing the OQ.

4.5.4 Incremental and Interleaved State Saving

Minimizing the storage space required for simulation models with complex sets of state variables Si S, Si being the subset stored and maintained by LPi that do not extensively change values over LVT progression, can be accomplished eectively by just saving the variables sj 2 Si aected by a state change. This is mainly an implementation optimization upon step S3.4 in the algorithm in Figure 21. This incremental state saving can also improve the execution complexity in step S3.7 , since generally less data has to be copied into the logrecord. Obviously the same strategy could be followed for the EVL, or the IQ in a lazy reevaluation protocol. Alternatively, imposing a condition upon step S3.7 : S3.7 if (step count modulo ) == 0 then log new state(hLVT, S, copy of(EVL)i, SS); could be used to interleave the continuity of saved states and thus on the average reduce the storage requirement to 1 of the noninterleaved case. Both optimizations, however, increase the execution complexity of rollback. In incremental state saving protocols, desired states have to be reconstructed from increments following back a path further into the simulated past than required by rollback itself. The same is true for interleaved state saving, where the most recent saved state older than the straggler must be searched for, a reexecution up until the timestamp of the straggler (coast forward) must be started which is a clear waste of CPU cycles since it just reproduces states that have already been computed but were not saved, and nally the straggler integration and usual reexecution are necessary (Figure 23). The tradeo between state saving costs and the coast forward overhead has been studied (as reported by [59] in reference to Lin and Lazowska) based on expected event processing time ( = E[exec(e)]) and state saving costs (), giving an optimal interleaving factor as

p p b ( ? 1) c < < d (2 + 1) e 29

where is the average number of rollbacks with = 1 and = . The result expresses that an overestimation of is more severe to performance than an underestimation by the same (absolute) amount. In a study of the optimal checkpointing interval explicitly considering state saving and restoration costs while assuming does neither aect the number of rollbacks nor the number of rolled back events in [46], an algorithm is developed that, integrated into the protocol, \on-the- y", within a few iterations, automatically adjusts to . It has been shown that some , though increasing the rollback overhead, can reduce overall execution time.

4.5.5 Fossil Collection

Opposed to techniques that reclaim memory temporarily used for storing events and messages related to the

future of some LP, fossil collection aims to return space used by history records that will no longer be used by

the rollback synchronization mechanism. To assure from which state in the history (and back) computations can be considered fully committed, the determination of the value of global virtual time (GVT) is necessary. Consider the tuple i (T) = (LVTi (T); IQi (T); SSi (T); OQi (T)) to be a local snapshot of LPi at real time T, i.e. LVT, IQi is the input queue as seen by an external S N observer at real time T, etc., and (T) = i=1 i (T) [ CS(T) be the global snapshot of GLP. Further let LVTi (T) be the local virtual time in LPi , i.e. the timestamp of the event being processed at the observation instant T, and UMi;j (T) the set of external events imposed by LPi upon LPj encoded as messages m in the snapshot . This means m is either in transit on channel chi;j in CS or stored in some IQj , but not yet processed at time T. Then the GVT at real time T is de ned to be: GVT(T) = min(min LVTi (T); i;j ;m2min ts(m)) i UMi;j (T ) It should be clear even by intuition, that at any (real time) T, GVT(T) represents the maximum lower bound to which any rollback could ever backdate LVTi (8i). An obvious consequence is that any processed event e with ts(e) < GVT(T) can never (at no instant T) be rolled back, and can therefore be considered as (irrevocably) committed [47] (Figure 23). Further consequences (for all LPi ) are that: (i) messages m 2 IQi with ts(m) GVT(T), as well as messages m 2 OQi with ts(m) GVT(T) are obsolete and can be discarded (from IQ, OQ) after real time T. (ii) state variables s 2 Si stored in SSi as with ts(s) GVT(T) are obsolete and can be discarded after real time T. Making use of these possibilities, i.e. getting rid of external event history according to (i) and of internal event history according to (ii) that is no longer needed to reclaim memory space is the idea behind fossil collection. It is called as a procedure in the abstracted Time Warp algorithm (Figure 21) in step S3.11 . The idea of reclaiming memory for history earlier than GVT is also expressed in Figure 20, which shows IQ and OQ sections for entries with timestamp later than GVT only, and copies of EVL in SS if not older than GVT (also the rest of SS beyond GVT could be purged as irrelevant for Time Warp, but we assume here that the state trace is required for a post-simulation analysis). Generally, a combination of fossil collection with any of the incremental/interleaved state saving schemes is recommended. Related to interleaving, however, rollback might be induced to events beyond the momentary committed GVT, with an average overhead directly proportional . Not only that the interleaving of state recording is prohibiting fossil collection for states timestamped in the gap between GVT and the most recent saved state chronologically before GVT, it is also contraproductive to GVT computation which is comparably more expensive than state saving as will be seen soon.

4.5.6 Freeing Memory by Returning Messages

Previous strategies (interleaved, incremental state saving as well as fossil collection) are merely able to reduce the chance of memory exhaustion, but cannot actually prevent such situations from occurring. In cases where memory is already completely allocated, only additional techniques, mostly based on returning messages to senders or arti cially initiating rollback, can help to escape from deadlocks due to waiting for free memory: 30

Message Sendback The rst approach to recover from memory over ow in Time Warp was proposed by the message sendback mechanism by Jeerson [44]. Here, whenever the system runs out of memory on the occasion of an arriving message, part or all af the space used for saving the input history is used to recover free memory by returning unprocessed input messages (not necessarily including the one just received) back to the sender and relocating the freed (local) memory. By intuition, input messages with the highest send timestamps are returned rst, since it is more likely that they carry incorrect information compared to \older" input messages, and since the annihilation of their eects can be expected not to disperse as much in virtual time, thus restriciting annihilation in uence spheres. Related to the original de nition of the Time Warp protocol which distinguishes the send time (ST) and receive time (RT) (ST(m) RT(m)) of messages, only messages with ST(m) > LVT (local future messages) are considered for returning. An indirect eect of the sendback could also be storage release in remote LPs due to annihilation of messages triggered by the original sender's rollback procedure. Gafni's Protocol In a message trac study of aggressive and lazy cancellation, Gafni [37] notes that past

(RT(m) < GVT) and present messages (ST(m) < GVT < RT(m)) and events accumulate in IQ, OQ, SS for the two annihilation mechanisms at the same rate, pointing out also the interweaving of messages and events in memory consumption. Past messages and events can be fossil collected as soon as a new value of GVT is available. The amount of \present" messages and events present in LPi re ects the dierence of LVTi to the global GVT directly expressing the asynchrony or \imbalance" of LVT progression. This fact gives an intuitive explanation of the message sendback's attempt to balance LVT progression across LPs, i.e. intentionally rollback those LPs that have progressed LVT ahead of others. Gafni, considering this asynchrony to be exactly the source from which Time Warp can gain real execution speedup, states that LVT progression balancing is does not solve the storage over ow problem. His algorithm reclaims memory by relocating space used for saving the input or state or output history in the following way: Whether the over ow condition is raised by an arriving input message, the request to log a new state or the creation of a new output message, the element (message or event) with the largest timestamp is selected irrespective of its type.

If it is an output message, a corresponding antimessage is sent, the element is removed from OQ and

the state before sending the original message is restored. The antimessage arriving at the receiver will nd its annihilation partner in the receiver's IQ upon arrival (at least in FIFO CSs), so memory is also reclaimed in the receiver LP. If it is an input message, it is removed from IQ and returned to the original sender to be annihilated with its dual in the OQ, perhaps invoking rollback there. Again also the receiver LP relocates memory. If it is a state in SS, it is discarded (and will be recomputed in case of local rollback). The desirable property of both message sendback and Gafni's protocol is that LPs that ran out of memory can be relieved without shifting the over ow condition to another LP. So, given a certain minimumbut limited amount of memory, both protocols make Time Warp \operable".

Cancelback An LP simulation memory management scheme is considered to be storage optimal i it

consumes O(M seq ) constant bounded memory [47]. The worst case space complexity of Gafni's protocol is O(NM seq ) = O(N 2) (irrespective of whether memory is shared or distributed), the reason for this being that it can only cancel elements within the individual LPs. Cancelback is the rst optimal memory management protocol [41], and was developed targeting Time Warp implementations on shared memory systems. As opposed to Gafni's protocol, in Cancelback elements can be canceled in any LPi (not necessarily in the one that observed memory over ow), whereas the element selection scheme is the same. Cancelback thus allows to selectively reclaim those memory spaces that are used for the very most recent (globally seen) input-, stateor output-history records, whichever LP maintains this data. An obvious implementation of Cancelback is therefore for shared memory environments and making use of system level interrupts. A Markov chain model of Cancelback [4] predicting speedup as the amount of available memory beyond M seq is varied, revealed that even with small fractions of additional memory the protocol performs about as well as with unlimited memory. The model assumes totally symmetric workload and a constant number of messages, but is veri ed with empirical observations. 31

Arti cial Rollback Although Cancelback theoretically solves the memory management dilemma of Time Warp since it produces correct simulations in real, limited memory environments with the same order of storage requirement as the sequential DES, it has been criticized for its implementation not being straightforward, especially in distributed memory environments. Lin [47] describes a Time Warp management scheme that is in turn memory optimal (there exists a shared memory implementation of Time Warp with space complexity O(M seq ) 1 ), but has a simpler implementation. Lin's protocol is called arti cial rollback for provoking the rollback procedure not only for the purpose of lcc-violation restoration, but also for its side eect of reclaiming memory (since rollback as such does not aect operational correctness of Time Warp, it can also be invoked arti cially, i.e. even in the absence of a straggler). Equivalent to Cancelback in eect (cancelling an element generated by LPj from IQi is equivalent to a rollback in LPj , whereas cancelling an element from OQi or SSi is equivalent to a rollback in LPi ), arti cial rollback has a simpler implementation since the rollback procedure already available can be used together with an arti cial rollback trigger causing only very little overhead. Determining, however, in which LPi to invoke arti cial rollback, to what LVT to rollback and at what instant of real time T to trigger it is not trivial (except the triggering, which can be related to the over ow condition and the failure of fossil collection). In the implementation proposed by Lin and Preiss [47], the two other issues are coupled to a processor scheduling policy in order to guarantee a certain amount of free memory (called salvage parameter in [59]), while following the \cancel-furthest-ahead" principle.

4.6 Adaptive Simulation Protocols

A general argument on the relative performances of CMB and TW protocol implementations cannot be given due to an overwhelming degree of interdependencies among performance in uencing factors [28], mainly coming from the event structure of the simulation model (inherent model parallelism), the partitioning of the simulation model into submodels to be executed by LPs (locality of causal eects, balance of virtual time progression, balance of load across LPs, etc. ), the performance characteristics of the target hardware (processor speed, memory size, cache hierarchies, memory access speed communication latency, etc. ), the communication and synchronization model on the target machine (routing strategy, nonblocking communication modes, hardware support for multicast and reduction operations, buering, etc. ), and last but not least implementation optimizations. (A \performance comparable implementation" is proposed in [21].) Experience, however, with implementations of parallel SEs for distributed memory environments that necessarily need to employ synchronization schemes based on the exchange of time stamped messages often reveals communication overhead to be the dominating performance factor. The reduction of overhead induced by the sending of messages is therefore of outstanding practical interest. Protocols that regulate the degree of \optimism" underlying TW according to hypotheses that LPs are establishing during simulation on the amount of available model parallelism have therefore been studied. The \aggressiveness" of TW can be adaptively throttled to that point in the spectrum between unlimited optimism and extreme pessimism, that is the best compromise considering a particular simulation model. This has lead to so called adaptive protocols (see Figure 24). In [63] the possibilities of options for combining CMB and TW to attain the capabilities of both are classi ed as (i) relaxing conservativism in CMB protocols, (ii) limiting the optimism in TW protocols, and seamlessly switching between optimistic and conservative schemes. As already mentioned, the Adaptive Time Warp (ATW) protocol [11] allows the execution of events e with time stamp ts(e) only if ts(e) 2 [t; t + ), where is adapted dynamically according to the number of local causality constraint violations in the past. In the Global Synchronization protocol [58], LPs exchange send/receive counts based on which it is decided whether to advance simulation to the next global synchronization step, or to await and process forthcoming message. In [31] a protocol is proposed that temporarily blocks an LP if a potential straggler messages (that would induce rollback) is anticipated. The Local Adaptive Protocol (LAP) [39] in much the same way uses statistical information from observing the simulation performance on the y to dynamically adjust optimism control parameters. An explicit cost function to determine whether it is more \cost eective" to execute an arriving message instantaneously despite rollback hazard, or to delay its execution to avoid potential rollback/antimessage is reported in [34]. All these protocols use statistical data that re ects the central tendency (e.g. average increase of local virtual time and average (real and simulated) message interarrival time in LAP; average commitment rate in AMM) to determine parameters for optimism control in TW. For implementations in distributed memory environments, Time Warp with arti cial rollback cannot guarantee a space complexity of O(M seq). Cancelback and Arti cial Rollback in achieving the sequential DES storage complexity bound rely on the availability of a global, shared pool of (free) memory. 1

32

Integration of abilities

Chandy/Misra/Bryant

Time Warp

seamlessly switch

add optimism

limit optimism

- SRADS with local rollback (Dickens/Reynolds 90)

- Composite ELSA (Arwind/Smart 92)

- Wolf Calls (Madisetti et. al. 88)

- Breathing Time Buckets (Steinmann 91/92)

- Adaptive Time Warp (Ball/Hoyt 90)

- Moving Time Window (Sokol et. al. 88)

- Speculative Simulation (Mehl 91)

- Concervative Optimistic (Steinmann 92)

- Bounded Time Warp (Turner/Xu 92)

- Filtered Rollback (Lubachevsky et.al. 89)

- Local Adaptive Protocol (Hamnes/Tripahi 94)

- Throttling (Reiher/Jefferson 89) - Unified (McAffer 90)

optimist./cons. clusters - Probabilistic Protocol (Ferscha/Chiola 94) - Local Time Warp (Rajaei et. al. 93)

- Adaptive Memory Managem. (Das/Fujimoto 94) - Cost Expectation (Ferscha/Luethi 95) - Probab. Adaptive Direct (Ferscha 95)

Figure 24: A Spectrum of Options inbetween TW and CMB. free memory

- create and send message - schedule event - log state fossil collection

message annihilation (cancelback)

GVT

GVT

LVT

rollback

processed

unprocessed

committed

process event uncommitted

GVT progression

Figure 25: \Flow of buers" in the AMM protocol

4.6.1 Adaptive Memory Management

The adaptive memory management (AMM) scheme proposed by Das and Fujimoto [24] attempts a combination of controling optimism in Time Warp and an automatic adjustment of the amount of memory in order to optimize fossil collection, Cancelback and rollback overheads. Analytical performance models of Time Warp with Cancelback [4] for homogeneous (arti cial) workloads have shown that at a certain amount of available free memory fossil collection is sucient to allocate enough memory. With a decreasing amount of available memory, absolute execution performance decreases due to more frequent cancelbacks until it becomes frozen at some point. Strong empirical evidence has been given as a support to this analytical observations. The motivation now for an adaptive mechanism to control memory is twofold: (i) absolute performance is supposed to have negative increments after reducing memory even further. Indeed, one would like to run Time Warp in the area of the \knee-point" of absolute performance. A successive adaptation to that performance optimal point is desired. (ii) the location of the knee might vary during the course of simulation due to the underlying simulation model. A runtime adaptation to follow movements of the knee is desired. The AMM protocol for automatic adjustment of available storage uses a memory ow model that divides 33

the available (limited) memory space M into three \pools", M = M c + M uc + M f . M c is the set of all memory locations used to store committed elements (t(e) GVT), M uc is its analogy for uncommitted events (in IQ, OQ, or SS with t(e) > GVT) and M f holds temporarily unused (free) memory. The behavior of Time Warp can now be described in terms of ows of ( xed sized) memory buers (able to record one message or event for simplicity) from one pool into the other (Figure 25): Free memory must be allocated for every message created/sent, every state logged or any future event scheduled, causing buer moves from M f to M uc. Fossil collection on the other hand returns buers from M c as invoked upon exhaustion of M f , whereas M c is being supplied by the progression of GVT. Buers move from M uc to M f with each message annihilation, either incurred by rollback or by Cancelback. A Cancelback cycle is de ned by two consecutive invocations of cancelback. A cycle starts where Cancelback was called due to failure of fossil collection to reclaim memory; at this point there are no buers in M c . Progression of LVT will move buers to M uc, rollback of LVT will occasionally return free memory, progression of GVT will deposit into M c to be depleted again by fossil collection, but tendentially the free pool will be drained, thus necessitating a new cancelback. Time Warp can now be controlled by two (mutually dependent) parameters: (i) , the amount of processed but uncommitted buers left behind after cancelback, as a parameter to control optimism; and (ii) , the amount fo buers freed by Cancelback, as a parameter to control the cycle length. Obviously, has to be chosen small enough to avoid rollback thrashing and overly aggressive memory consumption, but not too small in order to prevent rollbacks of states that are most likely to be con rmed (events on the critical path). should be chosen in such a way as to minimize the overhead caused by unnecessary frequent cancelback (and fossil collection) calls. The AMM protocol now by monitoring the Time Warp execution behavior during one cycle, attempts to simultaneously minimize the values of and , but respecting the constraints above. It assumes Cancelback (and fossil collection) overhead to be directly proportional to N uc the Cancelback invocation frequency. Let %M uc = ecommitted Tprocess ? %FC be the rate of growth of M , where ecommitted is the fraction of processed events also committed during the last cycle, Tprocess is the average (real) time to process an event and %FC is the rate of depletion of M uc due to fossil collection. (Estimates for the right hand side are generated from monitoring the simulation execution.) is then approximated by = (Tcycle ? TCB;FC )%M uc where TCB;FC is the overhead incurred by Cancelback and fossil collection in real time units, and Tcycle is the current invocation interval. Indeed, is a parameter to control the upper tolerable bound for the progression of LVT. To set appropriately, AMM records by a marking mechanism whether an event was rolled-back by Cancelback. A global (across all LPs) counting mechanism lets AMM determine the number #(ecp ) of events that should not have been rolled back by Cancelback, since they were located on the critical path, and by that causing a de nitive performance degrade2 . Starting now with a high parameter value for (which will give an observation #(ecp ) ' 0), is continuously reduced as long as #(ecp ) remains negligible. Rollback thrashing is explicitly tackled by a third mechanism that monitors ecommitted and reduces and to their halves when the decrease of ecommitted hits a prede ned threshold. Experiments with the AMM protcol have shown that both the claimed needs can be achieved: Limiting optimism in Time Warp indirectly by controlling the rate of drain of free memory can be accomplished eectively by a dynamically adaptive mechanism. AMM adapts this rate towards the performance kneepoint automatically, and adjusts it to follow dynamical movements of that point due to workloads varying (linearly) over time.

4.6.2 Probabilistic Optimism

A communication interface CI that considers the CMB protocol and Time Warp as two extremes in a spectrum of possibilities to synchronize LP's LVT advancements is the probabilistic distributed DES protocol [32]. Let the occurrence of an event e, t(e) = LVTi , in some LPi be causal (!) for a future event e0 with t(e0 ) = LVTi + (t) in LPj with probability P[e ! e0 ], i.e. event e with some probability changes state 2 The Critical Path of a DES is computed in terms of the (real) processing time on a certain target architecture respecting lcc. Traditionally, critical path analysis has been used to study the performance of distributed DES as reference to an \ideal", fastest possible asynchronous distributed execution of the simulation model. Indeed, it has been shown that the length of the critical path is a lower bound on the execution time of any conservative protocol, but some optimistic protocols do exist (Time Warp with lazy cancellation, Time Warp with lazy rollback, Time Warp with phase decomposition, and the Chandy-Sherman Space-Time Method [42], which can surpass the critical path. The possibility of so called supercritical speedup, and as a consequence its nonsuitability as an absolute lower bound reference, however, has made critical path analysis less attractive.

34

virtual time (VT) ^t = ee i + ^∆ (δ1 ... δn ) δ1

... ...

...

ee i

δn

0

ee i+1

[

]

s

t

simulated time

internal event external event

CDF 1

α 0 α (t-s)

s

t

^∆ = t-s 2

Figure 26: Motivation for the Probabilistic DDES variables in Sj with incluence on e0 . Then the CMB \block until safe-to-process"-rule appears adequate for P[e ! e0 ] = 1, and is overly pessimistic for cases P[e ! e0 ] < 1 since awaiting the decision whether (e ! e0 ) or (e 6! e0 ) hinders the (probably correct) concurrent execution of events e and e0 in dierent LPs. Clearly, if P[e ! e0 ] 1 an optimistic strategy could have executed e0 concurrently to e most of the time. More speci cally (see Figure /refmotprob), assume the timestamp of the next (token-)message mi+1 to arrive at LPj in the time interval [a; b] (b can be arbitrarily large), i.e. a ts(mi+1 ) b. Then LPj could have safely simulated all the scheduled transition rings htk @ot(tk )i with ot(tk ) < ts(mi+1 ); CICMB however will block at some LVTj = a (Figure 31). As an example, let the probability of the actual timestamp of the forthcoming message be P[ts(mi ) = x] = b?1 a 8x 2 [b; a], then CMB, by blocking, fails to use a (1 ? ) chance to produce useful simulation work by progressing over a and executing scheduled transition rings in the time interval [a; a + (b ? a)]. This is a clear overpessimism at least for small values of . On the other hand, CITW would allow executing even scheduled rings ot(tk ) > b, which is a clear overoptimism because every ring ot(tk ) > b de nitely will have to be rolled back, incurring annihilation messages that could have been avoided. These arguments mainly motivate the use of an optimistic CI for simulation models with nondeterministic event causalities. On the other hand, as already seen, an optimistic LP might tend towards overoptimism, i.e. optimism lacking rational justi cation. The probabilistic protocol aims to exploit information on the dynamic simulation behavior in order to be able to allow in every simulation step just that amount of optimism that can be justi ed. Step 0

IB |

LVT 0.00

1 2 3 4

h 1; P1; 0.51 i h 1; P1; 0.56 i

|

0.17 0.37 0.73 0.90

|

LP1 EVL [email protected]; [email protected] 1 [email protected] 1 [email protected] 1 [email protected] 0 |

S 2

OB |

IB |

LVT 0.00

LP1 S EVL 1 [email protected]

h 1; P2; 0.17 i h 1; P2; 0.37 i h 1; P2; 0.73 i h 1; P2; 0.90 i

h 1; P2; 0.17 i h 1; P2; 0.37 i h 1; P2; 0.73 i

|

0.51 0.56 0.56 0.78

0 0 1 1

| | [email protected] [email protected]

OB |

h 1; P1; 0.51 i h 1; P1; 0.56 i | h 1; P1; 0.78 i

Table 4: Motivation for Probabilistic LP Simulation To give an intuition on the justi cation of optimism look again at the parallel simulation of the small simulation model in Figure 6 together with the future list, and the parallel execution of lazy cancellation 35

Time Warp in Figure 3. At the beginning of step 3 LP2 at LVT = 0:56 (= LVT at the end of step 2) faces the straggler h 1; P2; 0.37 i in its IQ; the next element in LP2 's future list is 0.42. Since the eect of the straggler is in the local future of LP2 , i.e. h T2@(0.37+0.42) i, the lazy rollback strategy applies and rollback is avoided at all. The event h [email protected] i is executed in that step, setting LVT = 0.79, and the outputmessage h 1; P1; 0.79 i is generated (OQ) and sent at the end of the step. Unfortunately, in step 4 a new straggler h 1; P2; 0.73 i is observed in IQ of LP2 , but now with the eect that at time t = 0:73 + 0:05 < LVT = 0:79 LP2 is forced to roll back (Figure 3). Indeed, LP2 in step 3 generated and sent out h 1; P1; 0.79 i without considering any information whether the implicit optimism is justi ed or not. If LP2 would have observed that it received \on the average" one input message per step, with an \average" timestamp increment of 0.185, it might have established a hypothesis that in step 4 a message is expected to arrive with an estimated timestamp of 0.37+0.185 = 0.555 (= timestamp of previous message + average increment). Taking this as an alarm for potential rollback, LP2 could have avoided the propagation of the local optimistic simulation progression by e.g. delaying the sendout of h 1; P1; 0.79 i for one step. This is illustrated in Figure 4: LP2 just takes the input message from IQ and schedules the event h [email protected] i in EVL, but does not process it. Instead, the execution is delayed until the hypothesis upon the next message's timestamp is veri ed. The next message is h 1; P1; 0.73 i, the hypothesis can be dropped, and a new event h [email protected] i is scheduled and processed next. Actually two rollbacks and the corresponding sending of antimessages could be avoided by applying the probabilistic simulation scheme. To be able to directly control the optimism in TW, each LP in the PADOC (Probabilistic Adaptive Direct Optimism Control) [29] approach monitors the LVT progression on each of its incident channels, i.e. logs the timestamps of messages as they arrive. From the observed message arrival patterns, each LP formulates a hypothesis on the timestamp of the next message expected to arrive, and { related to statistical con dence in the forecast value { by means of throttling adapts to a synchronization behavior that is presumably the best tradeo among blocking (CMB) and optimistically progressing (TW) with respect to this hypothesis in the current situation. Throttling is done probabilistically in the sense that blocking is induced with a certain probability. program PADOC Simulation Engine ( ) 1 initialize(); 2 while GVT endtime do 2.1 for all arriving messages m do update(arrivalstatistics, m); if ts(m) < LVT /* m aects local past */ then /* rollback */ restore earliest state before(ts(m)); generate and sendout(antimessages); else chronological insert(m, IQ); endif;

2.2 2.3 2.4

2.5 2.6

od; tbs = forecast(arrivalstatistics); (tbs) = con dence in forecast(arrivalstatistics); if ts( rst(EVL)) ts( rst nonnegative(IQ)) then

if (1 ? P [exec rst(EVL)]) rand() then /* delay execution */ delay(s); else process( rst(EVL)); endif;

else process( rst nonnegative(IQ)); endif;

sendout(outputmessages); fossil collection(advance GVT()); od while;

Figure 27: PADOC Simulation Engine. Assume that the history over the last n message arrivals Ak = (ts(mi?n+1 ); ts(mi?n+2 ), : : : ts(mi )) is maintained in LPl for every (input) channel chk;l , and that tbs(mi+1 )) is an estimate for the timestamp of the forthcoming message mi+1 . Let the con dence 0 (tbs(mi+1 ))) 1. express the \trust" in this estimate. Then LPl having scheduled tr as the transition to re next, say at ot(tr ), would execute the occurrence of tr with some probability P [execute htr @ot(tr )i], but would block for the average amount of CPU time s (used 36

LVT = ts ( m i)

δ1

δ n-3

δ n-2 δ n-1

δn

∆

ts ( m i+1 ) = ts ( m i) + ∆ (δ 1 ... δ n )

0

Simulated Time Executed Transition Firing

Scheduled Transition Firings

Tokenmessage Arrival

Figure 28: Message Timestamp Forecast to simulate one transition ring) with probability 1 ? P . The algorithm sketch of the PADOC (Probabilistic Direct Optimism Control) LP simulation engine in Figure 27 explains further details. Note that in contrast to other adaptive TW mechanisms that compute an optimal delay window for blocking the simulation engine [11, 39, 34], the PADOC engine blocks for a xed amount of real time (i.e. s), but loops over the blocking decision, incrementally establishing longer blocking periods. By this discretization of the \blocking window" PADOC preserves the possibility to use information on the arrival process encoded in the timestamps of messages that arrive in between blocking phases. Algorithms based on variable size blocking windows fail to make use of intermediate message arrivals.

Incremental Forecast Methods Predicting the timestamp of the forthcoming message mi+1 after having observed n arrivals is explained in Figure 28. Basically, by statistically analyzing the arrival instants ts(mi?n+1 ); ts(mi?n+2 ); : : :ts(mi ), an estimate tbs(mi+1 ) = ts(mi ) + b (1 ; 2 ; : : :n ) is generated, where k = ts(mi?n+k ) ? ts(mi?n+k?1 ) is the dierence in timestamps of two consecutive messages. (Note that k is negative if mi?n+k is a straggler.) The choice of the size of the observation history n as well as the selection of the forecast procedure is critical for the performance of the PADOC engine for two reasons: (i) the achievable prediction accuracy and (ii) the computational and space complexity of the forecast method. Generally, the larger n, the more information on the arrival history is available in the statistical sense. Considering much of the arrival history will at least theoretically give a higher prediction precision, but will also consume more memory space. Intuitively, complex forecast methods could give \better" predictions than trivial ones, but are liable to intrude on the distributed simulation protocol with an unacceptable amount of computational resource consumption. Therefore, incremental forecast methods of low memory complexity are recommended, i.e. procedures where tbs(mi+2 ) can be computed from the previous forecast tbs(mi+1 ) and the actual observation ts(mi+1 ) in O(c) instead of O(cn) time. Arithmetic Mean If no observation windowPis imposed on the arrival history, but all observed j 's are considered, then the observed mean b i = n1 nj=1 j as an estimate of the timestamp of the forthcoming message has a recursive form. Upon the availability of the next time dierence n+1 , tbs(mi+1 ) can be computed incrementally as: b

b n+1 = nnn ++ 1n+1 :

Exponential Smoothing The arithmetic mean based forecast considers all observations j as equally

\important". A possibility to express the history as an exponentially weighted sum (e.g. give recent history higher \importance" than past history) is the exponential smoothing of the observation vector by a smoothing factor (j1 ? j < 1). b in this case has the incremental form b n+1 = n+1 + (1 ? )b n 1 gives a high weight to the last observation, and potentially yields a high variation in the forecasts. 0 causes intense smoothing, making forecasts less reactive to shocks in the arrival process. We use the smoothing factor nX ?1 = min2(0:1;0:2;:::;0:9)( [i+1 ? b i+1 ()]2); i=1

which is periodically readjusted during the simulation.

37

DD SD DS SS

TW TW+M TW+S TW+A TW TW+M TW+S TW+A TW TW+M TW+S TW+A TW TW+M TW+S TW+A

Execution Time 0.46 0.60 0.65 0.62 0.68 0.84 0.96 0.64 0.64 1.51 0.66 1.17 0.91 0.57 1.03 0.91

%-Simulation LP1 LP2 12.7 13.7 12.5 33.4 16.4 35.9 23.7 41.0 10.9 13.8 19.2 35.0 19.4 37.7 19.8 41.8 10.7 13.8 22.1 35.6 17.1 39.2 24.3 43.5 11.0 15.0 16.6 37.6 19.8 37.9 23.1 41.6

%-Rollback LP1 LP2 5.8 16.1 12.1 5.7 11.5 6.3 6.8 5.4 9.8 17.2 6.8 8.9 6.8 8.4 5.8 6.8 8.4 17.9 8.2 8.7 6.0 8.2 5.8 9.5 11.7 15.3 7.3 6.9 7.4 9.6 6.8 9.3

Table 5: TW with M, S and A for CM-5.

Median Approximation The virtual time increments in general cannot be assumed to yield a nonskewed,

unimodal distribution of values, as is implicitly assumed when the arithmetic mean is used as an index of central tendency. Particularly, if the frequency of time increments has a pdf skewed to the left, then the arithmetic mean is higher in value than the median, and would thus overestimate the next message's timestamp. A consequence would be \overpessimism" in the blocking policy. Forecast based on the median would use the estimate ( n odd b = 1( n?2 1 ) ( + ) else n +1 2 ( n2 ) ( 2 ) which cannot be computed incrementally, as new timestamp increments have to be inserted in a sorted list of i 's to nd the value of the median afterwards. As an approximation for the median we have developed the following supplement. Let M = median mean , which is a constant for every distribution (e.g. for the exponential distribution we have ( 1 ln2)= 1 = ln 2). Then with s + s?1 + : : : + s?n+1) c(s) = Median( M 1 ( + +:::+ ) n s

s?1

s?n+1

we nd a forecast based on a median which is approximated by the arithmetic mean as c 1 (1 + : : : + n ) b = M n The performance of the three \straightforward" forecast methods (arithmetic mean (M), exponential smoothing (S) and approximated median (A)) applied to the SPN in Figure 6 with 4 tokens in the initial marking and dierent timing scenarios is summarized in Table 5. In the scenario referred to as DD both T1 and T2 obey deterministic, but imbalanced timing ((T1) = 1, (T2) = 8). In the second case, SD, T1 has stochastic timing with (T1) exp(1), but T2 is deterministically timed as (T2) = 8. Similarly, DS represents (T1) = 1 and (T2) exp(1=8), whereas SS represents (T1) exp(1), (T2) exp(1=8). Note that in any case LVT progression in LP2 is (on average) eight times higher than in LP1 causing signi cant load imbalance and rollback (communication) overhead. (All forecast con dences are kept constant at = 0:9 for comparability, (TW) refers to TW with unlimited optimism.) Sample arrival process traces as collected on the CM-5 are depicted in Figure 29 for the two LPs for DD (left), SD (half-left), DS (half-right) and SS (right). Since the timestamp dierences in DD toggle between = 0 and = 9 = 8+1, this deterministic behavior represents a neutral case for all methods, e.g. method M repeatedly overestimates and underestimates the next timestamp and thus cannot gain over TW in the long run. Overall execution time grows, however, since forecasting intrudes the simulation engine (stalls CPU cycles). The rst quartuple of lines in Table 5 explains the order of intrusion induced by the various methods. In the case SD (second quartuple of lines in Table 5), method A nds the highest chances to avoid rollbacks and can outperform U. For DS, method M nds an absolute stress case, yielding a slowdown as compared to U. If the arrival process has two stochastic components (SS) both M and A can outperform U. The most important observation from Table 5 is, that all the methods are able to increase the percentage of overall execution time spent for simulating events over the communication overhead induced. Thus the optimism control schemes are even more promising for 38

T1=1.0 (det.) T2=8.0 (det.) No Tokens = 4

T1~exp(1.0) (exp.) T2=8.0 (det.) No Tokens = 4

30

T1=1.0 (det.) T2~exp(1/8.0) (exp.) No Tokens = 4

30

30

Arrivals at LP2

Arrivals at LP2

Arrivals at LP2

T1~exp(1.0) (exp.) T2~exp(1/8.0) (exp.) No Tokens = 4 30 Arrivals at LP2

20

20

20

20

10

10

10

10

0

0

0

0

-10

-10

-10

-10

-20

-20

-20

-20

-30

-30

-30

50

100 150 200 Message number

250

300

50


250

-30

300

50


250

300

50


250

300

30

40

20 Lag

30

40

10

30

40

20 Lag

30

40

0.6 0.2 0

10

20 Lag

30

40

20 Lag

30

40

0.0

0.2

10

20 Lag

30

40

20 Lag

30

40

-0.1 -0.2

-0.2 10

0

10

-0.3

-0.4

-0.4

-1.0 0

0 0.1

20 Lag

0.0

Partial ACF -0.5 0.0

-0.2

0.0 0

0.0

20 Lag

0.4

10

0.5

0

-0.4

-0.2

-0.2

0.2

0.4

0.6

ACF 0.2 0.6

0.8

1.0

1.0

1.0

Figure 29: Arrival Processes as observed at LP2 (CM-5)

0

10

0

10

Figure 30: Autocorrelation and Partial Autocorrelation Function of Arrival Processes at LP2 (CM-5) distributed memory environments, for which the communication/computation speed ratio is smaller than on the CM-5, e.g. a cluster of RISC workstations. The main drawback of the forecast schemes M, S and A are that they cannot cope with transient \patterns" of arrivals, but do respect only a central tendency of timestamp increments. Arrival patterns that show certain regularities, or at least some correlation in the time increments, can yield to stress cases as was seen above. Therefore forecast methods able to identify correlations and to predict next events at the maximum likelihood of those correlations are demanded.

ARIMA Forecasts In this section we follow the idea of considering the arrival process as an unknown stochastic processes fXt g = (X1 ; X2; : : :Xn ), where X1 ; X2; : : :Xn are a series of instances of a random variable. Speci cally Xt = t ? are the empirically observed timestamp dierences, transformed by the series mean . If fXt g are statistically dependent variables, the arrival process can be modeled by an integrated autoregressive moving average process ARIMA[p; d; q] (see e.g. [12]) (B)rd Xt = (B)t (6) d d where r = (1 ? B) is the d-fold dierencing operator, and B the backward shift operator de ned by B i Xt = Xt?i i = 0; 1; 2; : : :. This means that for e.g. d = 2, the process Yt = r2 Xt = Xt ? 2Xt?1 +Xt?2 is assumed to be a stationary ARMA[p; q] (=ARIMA[p; 0; q]), composed by a pure autoregressive process of order p (AR[p]) explaining Yt as a dependency Yt = 1Yt?1 + 2 Yt?2 + : : : + p Yt?p + t, t being a white noise random error, and a pure moving average process of order q (MA[q]) that explains Yt as a series of i.i.d. white noise errors Yt = t + 1 t?1 + 2 t?2 + : : : + q t?q with E(i ) = 0, Var(i ) = 2 and E(Yt ) = 0. Looking at the arrival process as obtained on the CM-5 for LP2 of the SPN in Figure 6 (Figure 29) and the corresponding autocorrelation (ACF) and partial ACF (Figure 30), we nd high positive correlation after every fourth lag for DD (Figure 30, left) obviously due to the four tokens in the SPN. The case SD (Figure 30, half-left) leads us to hypothesize that the arrival process is AR, since we nd ACF dying down in an oscillating damped exponential fashion. This is also intuitive because the deterministic component in the process ((T2) = 8) dominates the stochastic one ((T1) exp(1)). DS (Figure 30, half-right) gives 39

LVT

ts ( m i-2 )

ts ( m i-1 )

ts ( m i )

LVT

ts ( m i+1 )

ts ( m i+1 )

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

Probability of Delay

Probability of Delay

0

confidence = 0.95

0 120 125 130 135 140 145 150 155 160

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 confidence = 0.90 0 120 125 130 135 140 145 150 155 160

Figure 31: Probabilistic Direct Optimism Control with ARIMA Forecasting evidence for a suitable representation of the the arrivals as a MA process, becuase ACF has a single spike at lag 1, and PACF dies down, etc. For the automated characterization of the arrival process as an ARIMA[p; d; q] process, the classical Box-Jenkins procedure can be adapted:

1. Model Order Identi cation First, the order of the ARIMA model, [p; d; q] is identi ed. Since the the-

oretical partial autocorrelations %k?1(k) to the lag k vanish after some k > p for a pure AR[p], the order of such a process can be approximated from the empirical partial autocorrelations rk?1(k). Similarly, for a pure MA[q], the theoretical autocorrelations %( k) vanish after some k > q, such that again the empirical data (autocorrelations r(k)) can be used to approximate the order. For a combined ARMA[p; q] 2 + 2 (p+q) process, the Akaike-criterion, i.e. the combination (p; q) that minimizes AIC(p; q) = log bp;q n 2 2 approximates the order. (bp;q is a Maximum-Likelihood estimate of the variances of the underlying white noise error.) The Akaike-criterion has a more general form for ARIMA[p; d; q] processes. Indeed, as intuitively recognized, we nd best order ttings e.g. for SD as ARIMA[3; 0; 0] or for DS as ARIMA[0; 0; 4]. 2. Model Parameter Estimation In the next step, the parameters in (1) (1; : : :; p and 1 ; : : :; q ) are determined as maximum likelihood estimates from the empirical data, i.e.Pthe estimates ^1 ; : : :; ^p and ^1 ; : : :; ^q that minimize the square sum S 2 (^1; : : :; ^p, ^1 ; : : :; ^q ) = nt=?1 "~2t of the residuals "~t = t ? ^1t?1 ? : : : ? ^p t?p ? ^1 "~t?1 ? : : : ? ^q "~t?q are used as the model parameters. 3. Model Diagnostics/Veri cation A well know method to validate the model with the estimates ^1; : : :; ^p and ^1 ; : : :; ^q is the Portmonteau lack-of- t-test, which tests whether the residuals "~t are realizations of a white noise process. The con dence level (1 ? ) of the test can be used as a measure to quantify the \trust" in the model and, as a consequence, in the forecast. 4. Forecast Finally, the (recursive) Durbin-Levinson method provides an algorithm for the one-step (or k-step) best linear prediction for Xbt+1 . At a con dence level = (1 ? ), the PADOC simulation engine (Figure 27, in Step 2.4) executes the next scheduled a transition ring with probability LV T ?tbs

P [exec rst(EVL)] = 1 ? (1 + e?( (1?)100 ) )?1 ; (7) otherwise the CPU is blocked for s time units. Figure 31 explains the blocking probability (7) related to the con dence level = (1 ? ): The higher the con dence , the steeper the ascent of the delay probability as LVT progresses towards tbs. (Steepness of the sigmoid function in Figure 31 (left) with = 0:95 is higher than in Figure 31 (right) = 0:90). Note also that after LVT progression in LPj has surpassed the estimate tbs (Figure 31, right), delays become more and more probable, expressing the increasing rollback hazard the LP runs into. The performance gain of PADOC for the example in a distributed memory environment with comparably high communication latencies is illustrated in Figure 32. 40

AAAA Simulation AAAA

TW

AAAA AAAA Communication AAAA Rollback AAAA

SS

TW +

LP2

LP1

ARIMA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAA AAAA LP2

AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAAAAAA AAAA AAAAAAAAAAAA AAAA AAAAAAAAAAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAAAAAAAAAA AAAA AAAAAAAAAAAA

LP1

LP1

LP2

LP1

LP1

0

AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAA AAAA LP2

5

TW + ARIMA

LP2

15

DS

TW

LP1

20

LP2

AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAA AAAA

25

10

TW + ARIMA

LP2

TW

LP1

30

SD

TW + ARIMA

LP1

DD

TW

LP2

35

AAAA Blocking AAAA

Figure 32: TW and TW with ARIMA forcasting for RS6K Cluster P1

t1

P2

t9 λ

9

P3

t5

P4

= 1.0

τ = 10.0 10

t2

t6

t4

t8 λ 11= 0.1

P5

t3

t 10

P6

τ12= 1.0

t 11

P7

t7

P8

t 12

Figure 33: PN4: A PN with virtual time progression in phases A general observation is that with 1, PADOC imposes a synchronization behavior close to CMB, whereas with 0, optimism is as unlimited as in (plain) TW. Moreover, by periodically rebuilding the ARIMA model, the PADOC scheme adapts the LP to a synchronization behavior directly re ecting the inherent model parallelism, and also copes with transient arrival processes. To demonstrate this by example, we study PN4 (Figure 33), with 1 = 2 = : : : = 8 = 0:0 and 10 = 10:0, 12 = 1:0, while 9 exp(1:0) and 11 exp(0:1). Let the net be spatially decomposed into two LPs where LP1 (LP2 ) holds the transitions ft1; t2; t3; t4; t9; t10g (ft5 ; t6; t7; t8; t11; t12g) and the respective input places. Since TW performance is known to degrade in case of imbalancements in the local virtual time (LVT) progression of dierent LPs. In the example we have intentionally constructed imbalancement by having LP2 progress LVT per simulation step at a tenfold higher rate than LP1 if transitions re along the T-invariant x6 = [1; 0; 0; 0;1;0;0; 0; 1;1;0;0]. Clearly, LP1 will permantly force LP2 to rollback, thus causing overhead. If the net executes along x3 = [0; 0; 1; 0;0;0; 1; 0;0;0;1; 1], the opposite is the case, with the only dierence that the dominating timing component is not deterministic (t1 0), but of stochastic nature (t1 1). We introduce a switching probability for the con icting transitions (t1; t2), (t3; t4), etc. to be able to control the swichting of tokens from cycle C1 de ned by x6 to C2 de ned by x3 and vice versa. p = 50 e.g. means that for a token at any decision point it is equally likely to switch the cycle or to stay on the current P one, whereas p = 95 expresses just a 5 % \willingness" of tokens to switch cycles. For an initial marking Pi=1 (0) (i) = 2 the LVT progression in the LPs reveals three dierent phases: 2 tokens in C1 (phase 1), 1 token in each cylce C1 and C2 (phase 2), and 2 tokens in C2 (phase 3), thus naturally representing a stress case for TW in general, and adaptive TW protocols with optimism control parameters based on averages (ATW, LAP, AMM) in particular. The LVT progression traces as obtained for PN4 (Figure 33) with p = 95 on the CM-5 are depicted in Figure 34. It is seen that plain TW considerably suers from the (arti cial) straggler bulks, yielding an oscillating LVT progression. TW with tokentime forecasts based on the arithmetic mean can cope with this destructive situation to some extent, but is still prone to pathological behavior. The TW+ARIMA approach, however, gains from consistent forecasts and reveals the best adaptation. The major strength of the PADOC protocol is that the optimism of Time Warp can be automatically controlled by the model parallelism available as expressed by the likelihood of future messages, and can even adapt to a transient behavior of the simulated model. The asymptotic behavior of the protocol for simulation 41

LVT Progression: LP2 (2 Tokens) 800

ACF 0.5

5

10

15

20

25

30

0

5

10

15 20 Lag

25

30

-0.2 0 50

100

150

200

250

300

350

400

0

5

10

15

20

25

30

0

5

10

15 20 Lag

25

30

Partial ACF -0.3 0.0 0.2

200

0

ACF 0.4 0.8

400

-0.5

)

600

Partial ACF -0.8 -0.4 0.0

p=75 p=95

Figure 34: Message No vs. message time stamp as observed on CM-5; ARIMA characteristics LVT Progression on LP 2 (CM-5) 1200 Time Warp Mean Mean Forecast ARIMA ARIMA Forecast

1000

LVT

800

600

400

200

0 0

50000

100000 CPU time

150000

Figure 35: Adaptive model parallelism extraction

200000

models with P[e ! e0 ] ' 1 (for all pairs (e; e0 )) is close to CMB, while for models with P[e ! e0 ] ' 0 it is arbitrarily close to (a throtteled) Time Warp. The PADOC mechanism gains adaptiveness in the sense that, independent of the ratio of the communication and computation speed of the target platform, the synchronisation policy is adjusted automatically to that point in the continuum between TW and CMB protocols, that is most appropriate for the parallelism inherent in the simulation model.

5 Sources of Literature and Further Reading Overviews on the eld of parallel and distributed simulation are the surveys by Richter and Walrand [66], Fujimoto [35], and most recently Nicol and Fujimoto [59] and Ferscha [28]. The primary reading for Time Warp is [44], for conservative protocols it is [52]. The most relevant literature in the context of parallel discrete event simulations of Petri nets is [7], [6], [67], [8], while the distributed approach is documented in [71], [72], [60], [5], [18], [33], [68], [27], [55]. Probabilistic and adaptive protocols are studied in [31], [29]. The partitioning issue is addressed in [19], [30], implementation aspects are presented in [21], [20]. This text, the tutorial slides, as well as related contributions by the author can be downloaded in electronic format from http://www.ani.univie.ac.at/ferscha/Ferscha.html. 42

References [1] M. Ajmone Marsan, G. Balbo, A. Bobbio, G. Chiola, G. Conte, and A. Cumani. The Eect of Execution Policies on the Semantics and Analysis of Stochastic Petri Nets. IEEE Transactions on Software Engineering, 15(7):832{846, July 1989. [2] M. Ajmone Marsan, G. Balbo, G. Chiola, and G. Conte. Generalized Stochastic Petri Nets Revisited: Random Switches and Priorities. In Proc. of the International Workshop on Petri Nets and Performance Models, August 24 - 26, 1987, Madison, Wisconsin., pages 44 { 53. IEEE Computer Society Press, 1987. [3] M. Ajmone Marsan and G. Chiola. On Petri Nets with Deterministic and Exponentially Distributed Firing Times. In G. Rozenberg, editor, Advances in Petri Nets 1987, Lecture Notes in Computer Science Vol. 266, pages 132 { 145. Springer Verlag, 1987. [4] I. F. Akyildiz, L. Chen, R. Das, R. M. Fujimoto, and R. F. Serfozo. The Eect of Memory Capacity on Time Warp Performance. Journal of Parallel and Distributed Computing, 18(4):411{422, August 1993. [5] H. H. Ammar and S. Deng. Time Warp Simulation of Stochastic Petri Nets. In Proc. 4th Intern. Workshop on Petri Nets and Performance Models, pages 186 { 195. IEEE-CS Press, 1991. [6] F. Baccelli and M. Canales. Parallel Simulation of Stochastic Petri Nets Using Recurrence Equations. acm Transactions on Modeling and Computer Simulation, 3(1):20 { 41, January 1993. [7] F. Baccelli, G. Cohen, G. J. Olsder, and J.-P. Quadrat. Synchronization and Linearity. An Algebra for Discrete Event Systems. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, Chichester, 1992. [8] F. Baccelli, N. Furmento, and B. Gaujal. Parallel and Distributed Simulation of Free Choice Petri Nets. In Proceedings of the 9th Workshop on Parallel and Distributed Simulation (PADS '95), pages 3 { 10, 1995. [9] D. Baik and B. P. Zeigler. Performance Evaluation of Hierarchical Distributed Simulators. In Proc. of the 1985 Winter Simulation Conference, pages 421 { 427. SCS, 1985. [10] W. L. Bain and D. S. Scott. An Algorithm for Time Synchronization in Distributed Discrete Event Simulation. In B. Unger and D. Jeerson, editors, Proceedings of the SCS Multiconference on Distributed Simulation, 19 (3), pages 30{33. SCS, February 1988. [11] D. Ball and S. Hoyt. The Adaptive Time-Warp Concurrency Control Algorithm. In D. Nicol, editor, Distributed Simulation. Proceedings of the SCS Multiconference on Distributed Simulation, pages 174 { 177, San Diego, California, 1990. Society for Computer Simulation. Simulation Series, Volume 22, Number 1. [12] P. J. Brockwell and R. A. Davis. Time Series: Theory and Methods. Springer Verlag, New York, 1991. [13] R. E. Bryant. A Switch-Level Model and Simulator for MOS Digital Systems. IEEE Transactions on Computers, C-33(2):160{177, February 1984. [14] W. Cai and St. J. Turner. An Algorithm for Distributed Discrete-Event Simulation - The `Carrier Null Message' Approach. In Proceedings of the SCS Multiconference on Distributed Simulation Vol. 22 (1), pages 3{8. SCS, January 1990. [15] K. M. Chandy and J. Misra. Distributed Simulation: A Case Study in Design and Veri cation of Distributed Programs. IEEE Transactions on Software Engineering, SE-5(5):440{452, September 1979. [16] K. M. Chandy and J. Misra. Asynchronous Distributed Simulation via a Sequence of Parallel Computations. Communications of the ACM, 24(11):198{206, November 1981. [17] K. M. Chandy, J. Misra, and L. M. Haas. Distributed Deadlock Detection. ACM Transactions On Computer Systems, 1(2):144{156, May 1983. [18] G. Chiola and A. Ferscha. Distributed Simulation of Petri Nets. IEEE Parallel and Distributed Technology, 1(3):33 { 50, August 1993. 43

[19] G. Chiola and A. Ferscha. Distributed Simulation of Timed Petri Nets: Exploiting the Net Structure to Obtain Eciency. In M. Ajmone Marsan, editor, Proc. of the 14th Int. Conf. on Application and Theory of Petri Nets 1993, Chicago, June 1993, Lecture Notes in Computer Science 691, pages 146 { 165, Berlin, 1993. Springer Verlag. [20] G. Chiola and A. Ferscha. Performance Comparable Implementation Design of Synchronization Protcols for Distributed Simulation. Technical Report ACPC-TR 93-20, Austrian Center for Parallel Computation, 1993. [21] G. Chiola and A. Ferscha. Performance Comparable Design of Ecient Synchronization Protocols for Distributed Simulation. In Proc. of MASCOTS'95, pages 343 { 348. IEEE Computer Society Press, 1995. [22] G. Ciardo, R. German, and Ch. Lindemann. A Characterization of the Stochastic Process Underlying a Stochastic Petri Net. In Proc. of the 5th International Workshop on Petri Nets and Performance Models, October 19-22, 1993, Toulouse, France. IEEE Computer Society Press, 1993. [23] A. I. Concepcion. Mapping Distributed Simulators onto the Hierarchical Multibus Multiprocessor Architecture. In P. Reynolds, editor, Proc. of the SCS Multiconference on Distributed Simulation, pages 8{13. Society for Computer Simulation, 1985. [24] S. R. Das and R. M. Fujimoto. An Adaptive Memory Management Protocol for Time Warp Parallel Simulation. In Proc. of the 1994 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, Nashville, 1994, pages 201{210. ACM, 1994. [25] R. David and H. Alla. Petri Nets & Grafcet. Tools for Modelling Discrete Event Systems. Prentice Hall, New York, 1992. [26] Ph. M. Dickens and P. F. Reynolds. SRADS with Local Rollback. In Proceedings of the SCS Multiconference on Distributed Simulation Vol. 22 (1), pages 161{164. SCS, January 1990. [27] A. Ferscha. Concurrent Execution of Timed Petri Nets. In J. D. Tew, S. Manivannan, D. A. Sadowski, and A. F. Seila, editors, Proceedings of the 1994 Winter Simulation Conference, pages 229 { 236, 1994. [28] A. Ferscha. Parallel and Distributed Simulation of Discrete Event Systems. In A. Y. Zomaya, editor, Parallel and Distributed Computing Handbook. McGraw-Hill, 1995. [29] A. Ferscha. Probabilistic Adaptive Direct Optimism Control in Time Warp. In Proceedings of the 9th Workshop on Parallel and Distributed Simulation (PADS '95), pages 120 { 129, 1995. [30] A. Ferscha and G. Chiola. Accelerating the Evaluation of Parallel Program Performance Models using Distributed Simulation. In Proc. of. the 7th Int. Conf. on Modelling Techniques and Tools for Computer Performance Evaluation., Lecture Notes in Computer Science, pages 231{252. Springer Verlag, 1994. [31] A. Ferscha and G. Chiola. Self Adaptive Logical Processes: The Probabilistic Distributed Simulation Protocol. In Proc. of the 27th Annual Simulation Symposium, pages 78{88, Los Alamitos, California, 1994. IEEE Computer Society Press. [32] A. Ferscha and G. Chiola. Self Adaptive Logical Processes: The Probabilistic Distributed Simulation Protocol. In Proc. of the 27th Annual Simulation Symposium, pages 78{88, Los Alamitos, California, 1994. IEEE Computer Society Press. [33] A. Ferscha and G. Csertan. Optimistic Discrete Event Simulation of Petri Nets. In Proceedings of IMPACT TEMPUS JEP's and Hungarian Transputer User Group Workshop on Parallel Processing, pages 85 { 90. Continuing Education Center, Univ. Miskolc., 1993. [34] A. Ferscha and J. Luethi. Estimating Rollback Overhead for Optimism Control in Time Warp. In Proc. of the 28th Annual Simulation Symposium, pages 2{12, Los Alamitos, California, 1995. IEEE Computer Society Press. [35] R. M. Fujimoto. Parallel Discrete Event Simulation. Communications of the ACM, 33(10):30{53, October 1990. 44

[36] A. Gafni. Rollback Mechanisms for Optimistic Distributed Simulation Systems. In Proc. of the SCS Multiconference on Distributed Simulation 19, pages 61{67, 1988. [37] A. Gafni. Rollback Mechanisms for Optimistic Distributed Simulation Systems. In B. Unger and D. Jeerson, editors, Proceedings of the SCS Multiconference on Distributed Simulation, 19 (3), pages 61{67. SCS, February 1988. [38] B. Groselj and C. Tropper. The Time-of-next-event Algorithm. In B. Unger and D. Jeerson, editors, Proceedings of the SCS Multiconference on Distributed Simulation, 19 (3), pages 25{29. SCS, February 1988. [39] Donald O. Hamnes and Anand Tripathi. Investigations in Adaptive Distributed Simulation. In D. K. Arvind, Rajive Bagrodia, and Jason Yi-Bing Lin, editors, Proceedings of the 8th Workshop on Parallel and Distributed Simulation (PADS '94), pages 20{23, July 1994. [40] M. A. Holliday and M. K. Vernon. A Generalized Timed Petri Net Model for Performance Analysis. IEEE Transactions on Software Engineering, 13(12):1297 { 1987, December 1987. [41] D. Jeerson. Virtual Time II: The Cancelback Protocol for Storage Management in Time Warp. In Proc. of the 9th Annual ACM Symposium on Principles of Distributed Computing, pages 75{90, ACM, New York, 1990. [42] D. Jeerson and P. Reiher. Supercritical Speedup. In A. H. Rutan, editor, Proceedings of the 24th Annual Simulation Symposium, New Orleans, Louisiana, USA, April 1-5, 1991., pages 159{168. IEEE Computer Society Press, 1991. [43] D. Jeerson and H. Sowizral. Fast Concurrent Simulation Using the Time Warp Mechanism. In P. Reynolds, editor, Distributed Simulation 1985, pages 63{69, La Jolla, California, 1985. SCS-The Society for Computer Simulation, Simulation Councils, Inc. [44] D. A. Jeerson. Virtual Time. ACM Transactions on Programming Languages and Systems, 7(3):404{ 425, July 1985. [45] L. Lamport. Time, Clocks, and the Ordering of Events in Distributed Systems. Communications of the ACM, 21(7):558 { 565, Jul 1978. [46] Y-B. Lin, B. Preiss, W. Loucks, and E. Lazowska. Selecting the Checkpoint Interval in Time Warp Simulation. In R. Bagrodia and D. Jeerson, editors, Proc. of the 7th Workshop on Parallel and Distributed Simulation, pages 3{10, Los Alamitos, CA, 1993. IEEE Computer Society Press. [47] Yi-Bing Lin and Bruno R. Preiss. Optimal Memory Management for Time Warp Parallel Simulation. ACM Transactions on Modeling and Computer Simulation, 1(4):283{307, October 1991. [48] B. D. Lubachevsky. Bounded Lag Distributed Discrete Event Simulation. In B. Unger and D. Jeerson, editors, Proceedings of the SCS Multiconference on Distributed Simulation, 19 (3), pages 183{191. SCS, February 1988. [49] B. D. Lubachevsky, A. Weiss, and A. Shwartz. An Analysis of Rollback-Based Simulation. ACM Transactions on Modeling and Computer Simulation, 1(2):154{193, April 1991. [50] F. Mattern. Algorithms for distributed termination detection. Distributed Computing, 2:161{175, 1987. [51] E. Mayr and A. Meyer. The Complexity of the Finite Containment Problem for Petri Nets. Journal of the ACM, 28(3):561 { 576, 1981. [52] J. Misra. Distributed Discrete-Event Simulation. ACM Computing Surveys, 18(1):39{65, 1986. [53] M. K. Molloy. Performance Analysis Using Stochastic Petri Nets. IEEE Transactions on Computers, C-31(9):913 { 917, September 1982. [54] T. Murata. Petri Nets: Properties, Analysis and Applications. Proceedings of the IEEE, 77(4):541{580, April 1989. 45

[55] D. Nicol and W. Mao. Automated Parallelization of Timed Petri-Net Simulations. submitted for publication, 1994. [56] D. M. Nicol. Parallel Discrete-Event Simulation OF FCFS Stochastic Queueing Networks. In Proceedings of the ACM/SIGPLAN PPEALS 1988, pages 124 { 137, 1988. [57] D. M. Nicol. Performance Bounds on Parallel Self-initiating Discrete Event Simulations. ACM Transactions on Modeling and Computer Simulation, 1(1):24 { 50, Jan 1991. [58] D. M. Nicol. Global Synchronization for Optimistic Parallel Discrete Event Simulation. In R. Bagrodia and D. Jeerson, editors, Proc. of the 7th Workshop on Parallel and Distributed Simulation, pages 27{34, Los Alamitos, CA, 1993. IEEE Computer Society Press. [59] D. M. Nicol and R. M. Fujimoto. Parallel Simulation Today. Operations Research, 1994. [60] D. M. Nicol and S. Roy. Parallel Simulation of Timed Petri-Nets. In B. Nelson, D. Kelton, and G. Clark, editors, Proceedings of the 1991 Winter Simulation Conference, pages 574 { 583, 1991. [61] J. K. Peacock, J. W. Wong, and E. G. Manning. Distributed Simulation using a Network of Processors. Computer Networks, 3(1):44{56, 1979. [62] A. Prakash and R. Subramanian. Filter: An Algorithm for Reducing Cascaded Rollbacks in Optimistic Distributed Simulation. In A. H. Rutan, editor, Proceedings of the 24th Annual Simulation Symposium, New Orleans, Louisiana, USA, April 1-5, 1991., pages 123{132. IEEE Computer Society Press, 1991. [63] H. Rajaei, R. Ayani, and L. E. Thorelli. The Local Time Warp Approach to Parallel Simulation. In R. Bagrodia and D. Jeerson, editors, Proc. of the 7th Workshop on Parallel and Distributed Simulation, pages 119{126, Los Alamitos, CA, 1993. IEEE Computer Society Press. [64] Ch. Ramchandani. Analysis of Asynchronous Concurrent Systems by Petri Nets. Technical report, MIT, Laboratory of Computer Science, Cambridge, Massachusetts, February 1974. [65] P. L. Reiher, R. M. Fujimoto, S. Bellenot, and D. Jeerson. Cancellation Strategies in Optimistic Execution Systems. In Proceedings of the SCS Multiconference on Distributed Simulation Vol. 22 (1), pages 112{121. SCS, January 1990. [66] R. Richter and J. C. Walrand. Distributed Simulation of Discrete Event Systems. Proceedings of the IEEE, 77(1):99 { 113, Jan 1989. [67] S. Roy. Massively Parallel SIMD Simulation of Discrete Time Stochastic Petri Nets, 1994. [68] H. Sellami, J. D. Allen, D.E. Schimmel, and S. Yalamanchili. Simulation of Marked Graphs on SIMD Architectures Using Ecient Memory Management. In Proc. of MASCOTS'94, pages 343 { 348. IEEE Computer Society Press, 1994. [69] J. Sifakis. Use of Petri Nets for Performance Evaluation. In H. Beilner and E. Gelenbe, editors, Measuring, modelling and evaluating computer systems, pages 75 { 93. North-Holland, 1977. [70] L. M. Sokol, D. P. Briscoe, and A. P. Wieland. MTW: A Strategy for Scheduling Discrete Simulation Events for Concurrent Execution. In Proc. of the SCS Multiconf. on Distributed Simulation, pages 34 { 42, 1988. [71] G. S. Thomas. Parallel Simulation of Petri Nets. Technical Report TR 91-05-05, Dep. of Computer Science, University of Washington, May 1991. [72] G. S. Thomas and J. Zahorjan. Parallel Simulation of Performance Petri Nets: Extending the Domain of Parallel Simulation. In Proc. of the 1991 Winter Simulation Conference, 1991. [73] K. Venkatesh, T. Radhakrishnan, and H. F. Li. Discrete Event Simulation in a Distributed System. In IEEE COMPSAC, pages 123 { 129. IEEE Computer Society Press, 1986.

46

Parallel and Distributed Simulation of Petri Nets

Parallel and Distributed Simulation of Petri Nets

Suggest Documents

Parallel and Distributed Simulation of Petri Nets - Semantic Scholar

Parallel Simulation of Petri Nets on Desktop PC Hardware - CiteSeerX

Distributed simulation of timed petri nets - IEEE Concurrency [see also ...

Distributed Simulation of Timed Petri Nets - Semantic Scholar

Distributed simulation of timed petri nets - IEEE Concurrency - CiteSeerX

DECOMPOSITION OF PETRI NETS

Colored petri nets based modeling and simulation of ... - Springer Link

Modeling and Simulation of Projects with Petri Nets - Semantic Scholar

MODELING, SIMULATION AND ANALYSIS OF PETRI NETS IN MATLAB

Resource Equivalences in Petri Nets - Petri Nets 2017

Using Coloured Petri Nets for design of parallel raytracing environment

Using Coloured Petri Nets for design of parallel raytracing environment

From Coloured Petri Nets to Object Petri Nets

Towards Distributed Verification of Petri Nets Properties - CiteSeerX

Petri Nets 2000 - PURE

Petri Nets 2000 - PURE

Abridged Petri Nets

An Extensible Editor and Simulation Engine for Petri Nets ... - CiteSeerX

Stochastic Petri Nets - CiteSeerX

Stochastic Petri Nets

Petri Nets 2000 Conference

Generalized stochastic Petri nets

Logic Programming Petri Nets

Regenerative Simulation of Stochastic Petri Nets with ...