On the E cient Sequential and Distributed Evaluation

On the Ecient Sequential and Distributed Evaluation of Very Large Stochastic Petri Nets Boudewijn Haverkort, Henrik Bohnenkamp, Alexander Bell Rheinisch-Westfalische Technische Hochschule Aachen Department of Computer Science, Laboratory for Distributed Systems D-52056 Aachen, Germany http://www-lvs.informatik.rwth-aachen.de/

January 1999

Abstract

In this paper we present ecient techniques for the generation of very large continuoustime Markov chains (CTMCs) speci ed as stochastic Petri nets (SPNs). In particular, we investigate how the storage eciency of the reachability graph generation can be improved by using clever state coding techniques and by using hashing tables instead of tree-based data structures. These techniques already allow us to analyse SPNs with almost 55 million states on a single workstation. The size of the SPNs that can be handled by these techniques is then further enlarged by using a cluster of workstations. With a dozen workstations, connected via an ordinary Ethernet, we are in the position to generate reachability graphs with over 110 million states in reasonable time. The presented techniques have been realised in a prototype tool named PARSECS which has been implemented in C++ using the libraries STL and MPICH. The SPNs to be input to PARSECS are speci ed using CSPL, known from the tool SPNP. In the paper we present our techniques and study their performance for a number of case studies. We also present comparisons with other well-known SPN tools.

1 Introduction Over the last two decades, numerous researchers have turned their attention to numerical performance evaluation techniques in which, on the basis of SPNs of some sort, a continuoustime Markov chain (CTMC) is automatically generated and subsequently solved. In doing so, an enormous modelling exibility is obtained, without having to resort to simulation. The main problem with such an approach is the resulting largeness of the models. The three steps in an SPN-based performance evaluation trajectory are illustrated in Figure 1: (1) CTMC generation: the generation of the underlying Markov chain from the SPN, resulting in a state space description and the generation of the generator matrix Q; (2) CTMC solution: the evaluation of the steady-state probability vector from the global balance equations ( Q = 0; 1 = 1); and (2) the \conversion" of to measures that can be interpreted at the SPN level, such as average place occupancies or transition throughputs. Experience with SPN-based evaluations has revealed that the rst two steps are the most time and memory consuming. In this paper we focus on the rst step and comment only brie y on step 2; the third step is not addressed any further. 1

SPN model

SPN measures

generation Markov chain

conversion solution

probabilities

Figure 1: Steps in the solution process of SPN-based Markov models The rst step consists of a tree-search algorithm in which, given a starting state, new states are generated and subsequently explored until all states have been addressed. This step is both time and memory consuming since next to an easy-to-access data structure for the SPN, information about all possible states in the SPN and the transition rates between them need to be stored (the reachability graph). In the second step, exact information about the SPN and the interpretation of the states in the CTMC, i.e., knowing to which marking they correspond, is not needed anymore, however, the complete matrix Q and at least one instance of the probability vector need to be stored. In this paper we present some results of a recently started research project in which we are investigating a number of ways to improve the eciency of the evaluation of SPNs with very large underlying CTMCs. We thereby simply accept that the state spaces are very large, i.e., we do not try to use hierarchical model decomposition techniques or other largeness avoidance techniques. In doing so, two directions can be taken: (i) a sequential approach in which one tries to come as far as possible with a single reasonably priced workstation, and (ii) a distributed approach in which a cluster of workstations is used as a multiple-instruction, multiple-data (MIMD) computer system, thereby providing both more processing power and storage capacity. The sequential approach has been addressed since the emergence of SPN-based performance evaluation tools in the mid 1980s. Here in particular, the proper choice of data structures and operation implementations, as well as the coding of indiviual markings plays a key role towards eciency; appropriate approaches in this direction are well described in [7]. In Section 3 we present some of our improvements and compare them with other implementations. The distributed approach has been followed to a lesser extent; we mention the most important contributions here. Results on parallel and/or distributed state space generation from SPNs, both on massive parallel machines and on clusters of workstations, have been reported in [5, 6]. Interesting to note is that these authors have found that the use of MIMD computer systems, i.e., workstation clusters, is more pro table than using expensive singleinstruction, multiple data (SIMD) machines such as the CM-2 and CM-5. In [6] Caselli et al. propose an ecient distributed reachability graph generation approach, based on an a priori allocation of states to nodes in the computing cluster using hashing techniques. Ciardo et al. then extended this approach by removing a number of synchronisation barriers, thus increasing the overall eciency [8]. The results achieved by these authors using networks of workstations are very promising, although they also signal diculties with respect to a proper load balancing. Therefore, Nicol and Ciardo propose a techniques that automatically balances the load in [23]; although their results are promising, they are only validated (in the 2

paper) for a relatively small SPN (150 000 states), and they do incur a large communication overhead which is undiserable for workstations connection via a low-priced LAN like Ethernet (Nicol and Ciardo have an SP-2 system with crossbar switch at their disposal). In [21], Caselli et al. continue their work, thereby putting special emphasis on state space compression techniques and ecient communication. Using a cluster of 14 Pentium PCs, they are in the position to generate and solve CTMC up to 2 million states. Allmaier et al. recently proposed a MIMD generation and solution approach using an expensive shared-memory multiprocessor system [3]. Their results (speed-ups) are very promising indeed, however, the need for special hardware remains a severe drawback. They also ported their system to a cluster of workstations, however, the speed-ups and maximum state spaces they arrive at are not very high [1, 2]. These authors explicitly avoid the use of hashing for the state-to-processor allocation and try to use load-balancing techniques to divide the work properly. Recently, Knottenbelt et al. presented a probabilistic distributed generation of CTMCs from SPNs [17]. In their approach, states are coded very eciently using a dynamic hashing technique. However, the employed hashing technique allows for collisions to occur that can not be detected. As a result, at the end of the generation process, only with probability smaller than 1 (but possibly close) the complete reachability graph has been generated. Although the correctness probability is estimated to be well over 99% for some cases, when addressing models with potentially millions of states, even such a very small fraction of non-completeness is signi cant. Nevertheless, Knottenbelt et al. are in the position to generate state space larger than 100 million states on the Fujitsu AP3000 (12 UltraSparcs connected via a 200 Mbps network). The idea to generate state spaces probabilistically using static hashing data structures has been applied successfully in the area of computer-aided veri cation (\model checking"), see the work by Holzmann [16], Wolper and Leroy [30] and Stern and Dill [24]. The latter two authors recently also described a distributed state space exploration approach very similar to the one presented by Caselli et al. , Ciardo et al. and Knottenbelt et al. , yielding very good speed-ups as well [25]. In the sections that follow we will present our recent results in this area. To ease the explanation of our work, we rst present an example SPN in Section 2; it has been taken from the literature and has been used widely as benchmark SPN. Section 3 is devoted to the sequential generation of the CTMC, whereas Section 4 reports experience regarding the distributed generation. These two sections form the main body of the current paper. In both sections, we will present theory as well as numerical examples, thereby comparing our sequential implementation with SPNP. Finally, in Section 5, we brie y touch upon the existence and eciency of numerical solution techniques for large CTMCs (Step 2 in Figure 1). We conclude the paper with a summary and outlook in Section 6. Finally, in Appendix A we will present some results for another model.

2 The exible manufacturing model To test and explain our algorithms, we use the exible manufacturing system (FMS) model, as used by many others [11, 8, 17, 23], and which is depicted in Figure 2. The model describes three cyclic production lines in which some machines (represented by the tokens in places M1, M2 and M3) are shared. The capacity of a production line is dependent on the number of pallets (k) that is available; this number corresponds to the initial number of tokens in the 3

Figure 2: The exible manufacturing system model ( gure copied from [11]) places P1, P2 and P3. By varying k, we can easily change the number of states and transitions in the underlying CTMC, as illustrated in Table 1.

3 Sequential reachability graph generation This section is started with a description of the basic state space generation algorithm and the required data structures in Section 3.1. We then continue with techniques to code the state eciently in Section 3.2. Implementation aspects are presented in Section 3.3 before numerical results are shown in Section 3.4.

3.1 Basic algorithm and data structures

In Figure 3 we show the well-known algorithm for the sequential reachability graph generation. Although not directly visible from this pseudo-code, we do eliminate the vanishing markings on the y. For this algorithm to work eciently, we need to consider the possibilities we have for the three main data structures: the set of states S : balanced trees, hashing tables; the set of new states S : stack or queue; the set of arcs A (arc a = (s; s ; a )): in-/outside main memory. new

0

s;s0

4

k

1 2 3 4 5 6 7 8 9 10 11 12

states arcs non-zero's 54 155 155 810 3 843 3 699 6 520 39 814 37 394 35 910 258 180 237 120 152 712 1 235 025 1 111 482 537 768 4 758 026 4 205 670 1 639 440 15 576 888 13 552 968 4 459 455 44 903 808 38 533 968 11 058 190 116 856 795 99 075 405 25 397 658 279 568 685 234 523 289 54 682 992 623 353 302 518 030 370 111 414 940 1 309 143 628 1 078 917 632

Table 1: The number of states, arcs and non-zero's in the Q matrix for the FMS model as a function of the model parameter k; the number of arcs is given before multiple transition between the same two states have been amalgamated

1. S := fInitialg; S := S ; A := ;; 2. while 9s 2 S do 3. S := S nfsg; 4. for each t 2 Enabled(s) do 5. s := NewState(s; t); 6. if s 62 S 7. then S := S [ fs g; 8. S := S [ fs g; 9. A := A [ f(s; s ; a )g; 10. end for; 11. end while; new

new

new

new

0

0

new

new

0

0

0

s;s0

Figure 3: Algorithm for sequential reachability graph generation

5

The following operations to be performed on these data structures \decide" which is the best: S := S nfsg: to remove a state from S ; Enabled(s): which transitions are enabled in a certain state s; NewState(s; t): given a state s and a transition t that res, what is the next state; s 62 S : does a newly generated state s already exist? S[ ] [ fs g and S [ fs g: add a new state s to the set of those already existing and those to be considered further. The rst function requires us to be able to remove an item easily from the data structure for S , as well as to add items to the same data structure ( fth operation). Hence, a queue or a stack seems to be the most appropriate. Although the memory costs of these two methods are similar, the resulting ordering of the states substantially diers. This might have an in uence on the convergence of the iterative procedure that is to be used to compute the steady-state probabilities in Step 2 (Figure 1). Although the investigations of these dierences do go beyond the scope of the current paper, we will present some comparison results below. The second and third operation can be executed eciently when the actual marking can directly be read from the state description. The second function then has to check which transitions are enabled, and the third has to compute which successor state is reached upon ring a particular transition. For these purposes, ecient techniques are known [7]. More crucial to the overall memory eciency is the internal storage of S . When using a balanced tree, which is typically done in SPN tools, various forms of overhead occur. First of all, in a balanced tree three pointers are required to point to the left and right child, as well as to the parent node. The latter pointer is required to allow for rebalancing. Furthermore, we need a pointer to access the actual state information. Given the way we are able to code the actual state information (see below), these 4 pointers (16 bytes) do represent a signi cant overhead. Secondly, when a new state has been generated, it will not appear yet in the tree. Hence, the search in the tree will be unsuccesful, but this fact is only know after a leaf has been reached. Given that already n states have been found, in a balanced tree O(log n) steps (comparisons and pointer operations) are required to reach a leaf. Thus, the computational overhead is quite substantial as well. To relief the discussion here, we abstract from overhead for keeping the tree balanced. In contrast, a data structure based on hashing would not suer from the above disadvantages. Furthermore, since no elements are removed from S , a very simple form of hashing without chaining can be used. With a simple computation, the address of the newly generated state is computed and it can be veri ed directly whether the state has already been encountered or not. If not, a simple insertion follows. One problem that comes up here is the fact that the hashing function might causes collisions, i.e., if we have two non-identical states s1 and s2, their image under the hashing function h1 might be identical (h1 (s1) = h1 (s2)). Of course, collisions should be resolved with some collision avoidance techniques. For this purpose we have chosen to use so-called \open addressing with double hashing" [18, Chapter 6.4, Algorithm D]. With this mechanism, next to the rst hashing function a second one (h2 ) is used. If the total hash table size is M , the rst hashing function should results in a value in the range f0; ; M ? 1g and the second one should result in a value in the range new

new

new

0

0

new

0

0

0

new

6

f1; ; M ? 1g that is relative prime to M . In this way, the sequence of hash table addresses h1 (s); h1 (s) h2 (s); h1(s) 2h2 (s); ; will visit all hash table entries (the operator denotes addition modulo M ). When checking whether a state has already been encountered the rst address in this sequence is probed. If this address does not contain an entry yet, the new state has not yet been encountered. If the address does contain the state being checked, it is a known state. On the other hand, if the address does contain a state dierent from the state being checked, the next address in the probe sequence is checked, until either the state is found (\successful search") or an empty entry is encountered (\unsuccessful search") and an insertion can take place. Although exact results for the number of collisions are not available, emperical evidence has shown that the number of collisions before an unsuccessful search (E [Cnew ]) and the number of collisions before a successful search (E [Cknown ]) are well estimated as follows [18]:

E [Cnew ] 1 ? ; ? ) E [Cknown ] + ln(1 ? ;

(1)

where is the degree of lling of the table at the moment of the search. Notice that we do use a hashing technique without chaining. We can do so because we only have insertions in the table, no removals. This saves us again a 4-byte pointer overhead per state. The other problem with hashing is the required a priori decision on the hash table size. Given the size of the main memory of the machine the algorithm will be running on, taking into account the requirements of the other resident programs, we allocate a hash table that is as large as possible. A table that is too large of course represents unused resources, on the other hand, it also will have very favourable collision behaviour. Below, we will compare the two alternatives (tree and hashing) in detail, and show performance results of the hashing method. Knuth suggests to use a table of size M , such that both M and M ? 2 are primes (\twin primes"). This then leads to the following two hashing functions:

h1(s) = key(s) mod M; and h2 (s) = key(s) mod M ? 2: After we have discussed the actual coding of states in Section 3.2, we will present the function key(s). By using a table of twin primes (computed once) it is easy to nd a suitable value of M for a given storage demand. The set of arcs A is so large that we can not keep it in main memory during the reachability graph computation. Fortunately, this is not needed either; we can simply write the computed arcs to a secondary storage device. It may happen that there is more than one transition sequence between two tangible states. In that case, we store all of them and interpret them as being additive when constructing the CTMC. As an example, if we have the following two arcs stored: (s; s ; a1 ) and (s; s ; a2 ), then we have a single arc from state s to state s with rate a1 + a2 in the CTMC. 0

0

0

7

3.2 The coding of states

If we assume that a place can hold at most 255 tokens at any time, a single byte suces to encode the marking of a place, hence, in an SPN with N places, a marking is a vector of N bytes. To number each marking uniquely, a 32 bit (4 byte) identi er is also required. In order to decrease the memory requirements per state, a number of facts need to be considered: For many SPNs place invariants can be computed, e.g., with the algorithm of Martinez and Silva [22]. In a place invariant for M places, the actual marking of one of the places can be computed when the other M ? 1 place occupancies are known. In this way, per place invariant, the coding of one place marking can be omitted. In fact, we do not use the above algorithm, since it only computes positive place invariants. Instead, we solve the linear system vC = 0 (where C is the incidence matrix of the SPN and v an integer solution vector) in the integer domain, using a modi ed Gaussian elimination procedure. Since the division operation does not exist in the integer domain, least common multiples of rows are constructed, followed by subtractions, to bring the system of linear equation in upper-triangular form. Subsequently, overlapping parts in the thus-obtained place invariants are removed with the LLL-algorithm [19] of the Number Theory Library (see http://www.cs.wisc.edu/~shoup/ntl/). In SPNs with variable arc multiplicities, place invariants can not generally be computed. To alleviate this problem, we therefore propose to use either constant arc multiplicities, or multiplicities linearly dependent on the marking in a single place. Once the place invariants have been computed, we can use them to compute upper bounds on the place occupancies. These can then be used to further reduce the storage requirements for the coding. In some SPNs places exists that will not contain tokens in any tangible marking, i.e., as soon as a token enters such a place, immediate transitions are enabled that cause an immediate removal of the token from the place. Hence, such places always contain zero tokens, and therefore do not need to be coded. Notice that we can do so because we eliminate vanishing markings on the y. In many SPNs, the number of dierent token distributions that can exist in a subnet is so small that we can code this number with less bits than we would require for the coding of the individual markings per place in the subnet. In that case we simply use as coding the number of the particular token distribution for the subnet. Although the actual application of this reduction technique is far from trivial (see the example below) it often yields large gain. As an example, consider the FMS model. It consists of 22 places, so that a naive coding would require 22 bytes. However, we can easily compute the 6 place invariants: 8 P1 + P1wM1 + P1M1 + P1d + P1s + P1wP2 + P12 + P12wM3 + P12s + P12M3 = k; > > P2 + P2wM2 + P2M2 + P2d + P2wP1 + P2s + P12 + P12wM3 + P12s + P12M3 = k; > < P3 + P3M2 + P3s = k; > P1M1 + M1 = 3; > > P12M3 + M3 = 2; > : P2M2 + M2 = 1: 8

We directly see that for k 15, all places contain at most 15 tokens so that a nibble per place suces. Furthermore, places P1d and P2d are directly connected to immediate transitions which will be enabled as soon as these places become marked; hence, in the tangible markings these places will not contain tokens. In the rst invariant, 9 places ! remain to be coded. The k tokens in these places can k + 8 ? 1 be distributed in ways. Taking the logarithm of this value (base 2, rounded k up) gives us the number of bits required for the coding of these three places & 9 places. The!' contained in the third invariant can be coded similarly with log2 k + k3 ? 1 bits. The two places contained in the fourth invariant do not need to be coded anymore: P12M3 has already been coded via the rst invariant and place M3 can be computed from that value. The sixth invariant allows us to code the marking of M2 and P2M2 with only 1 bit. Of the second place invariant, only the four places P2, P2wM2, P2wP1 and P2s still need to be coded now; the other places have already been addressed. We now introduce a fth pseudo place that holds the sum of the tokens in places P12&, P12wM3, P12s, P12M3 !' and P2M2. k + 5 ? 1 The k tokens in this place invariant can now be coded in log2 bits, i.e., we k do explicitly take into account the four remaining places. However, what happens in detail in the rest of the places of this place invariant is not important to us. We therefore only code the number of tokens in the rest of the place invariant, which corresponds to the number of tokens in the pseudo place. In summary, for various values of k, we need the following number of bits to code the states in the FMS example: k 9 11 13 15 bits 32 36 38 40 Hence, for values of k up to 15, 5 bytes suce. Notice that Knottenbelt et al. [17] also requires 5 bytes to code states in the FMS model with k = 7, however, in their coding scheme multiple states might be coded with the same code word (collisions) whereas this is not possible in our scheme. Now that we have coded the states, we can also de ne the function key(s) that has been used in the hashing function. We simply let key(s) be equal to the rst 32 bits of the encoded state s and interpret it as an unsigned integer.

3.3 Implementation details

We implemented (in C++) a class to specify SPN models (using the CSPL syntax [10]) as well as the reachability graph generation algorithm [15, Chapter 14]. We implemented a balanced tree-based approach, using the set data structure of the standard template library (STL) [28]. This might not be the most ecient, however, it does lead to very exible and extendible code. We also implemented the hashing approach. As data structure for Snew we implemented both the queue and stack, again using STL. In order to save main memory, we extended the stack-based approach such that only its top of size 1 MB is in main memory, the rest of the stack is stored on disk. If Snew grows above 1 MB, the lower 21 MB is pushed to disk; if Snew empties, at most 12 MB is popped from disk. 9

k

3 4 5 6 7 8 93 10 11

SPNP PARSECS (4.0) (6.0) tree-based hashing optimised 1 CPU MB CPU MB CPU MB CPU MB CPU MB 00:29 6 00:05 1 00:08 1 00:02 1 00:01 1 16:23 26 00:15 8 00:14 3 00:11 2 00:09 1 {:{ 110 01:19 35 01:10 11 00:52 4 00:42 3 {:{ | 04:00 125 04:45 38 03:25 9 02:45 6 2 {:{ | 14:00 391 16:03 116 11:18 23 09:19 16 {:{ | {:{ | 48:08 308 32:58 59 27:24 40 {:{ | {:{ | 131:54 595 85:53 146 71:51 95 {:{ | {:{ | {:{ | 252:10 326 177:54 245 {:{ | {:{ | {:{ | {:{ | 832:09 484

Table 2: A comparison of the required CPU times (wall clock time in minutes:seconds) and memory usage (in MB) for the serial generation of various SPN models using four dierent tools We decided to eliminate the vanishing markings \on the y", knowing that this is more ecient for larger models; indeed, for small models, i.e., , with less than 1000 states, SPNP (4.0), which removes the vanishing markings in a second pass, might be faster.

3.4 Numerical examples

Below, we will rst compare our prototype implementation with SPNP, before we study the behaviour of the hashing data structure in more detail.

Comparison with SPNP

In Table 2 we show the CPU and memory usage for the complete generation of a CTMC from an SPN model for SPNP (4.0 and 6.0) and PARSECS in three versions, run on namur (Pentium II, 300 Mhz, 512 MB) using the FMS example. The three versions of our tool correspond to the use of trees or hashing tables for the state space, the latter with either a straightforward state coding (using a nibble per place) or the optimised one. The rst thing we observe is that PARSECS, even when employing trees, does require far less memory as SPNP (both versions). The main reason for this is the fact that we do not actually store the full generator matrix of the CTMC in main memory, which SPNP apperantly does. Furthermore, we observe that the use of hashing does once again reduce the memory requirements substantially. Were SPNP could still compute the case k = 7, we are able to continue towards the case k = 11, which is 33 times as large (measured in states) or even 46 times (measured in transitions); see Table 1. The second thing we observe is that our prototype using balanced trees is about as fast as SPNP (6.0). When we would use dedicated data structures instead of the STL data 1 These times have been measured on a dierent machine and subsequently scaled for comparison purposes; SPNP (4.0) does not run on namur. 2 Without storing the .mc le. 3 For k 9, the measured times do not included the actual storage of the reachability graph; our disk space does not suce for this purpose.

10

structures, we think we can increase the speed of our prototype slightly. However, a big gain is reached when using hashing tables as the data stucture for the state space; it simply avoids the lengthy tree searches! The fact that the hashing table implementation is very ecient implies that the collision phenomenon has limited impact; we discuss this in detail below. The fact that STL allows us to easily change data structures for storing the states that still need to be investigated, i.e., Snew , allows us to study the impact of these dierences on the structure of the generated CTMC. When using a queue, the states are investigated in a FCFS manner, whereas using a stack yields an LCFS ordering. In Figure 4 we show the occurrence of nonnull entries in the generator matrices when using a queue and a stack; note that a pixel indicates that in a block of 24 24 entries, at least one is nonnull. In both matrices, the number of nonnull entries is the same, however, they are far more scattered in the stack case. We will further investigate the impact of the generator matrix structure on the eciency of the solution process, and especially on the amount of communication overhead in a distributed solution.

(a) Queue

(b) Stack

Figure 4: The occurrence of nonnull entries in 24 24 blocks in the generator matrices for the FMS model for k = 4 in case of a queue and a stack data structure for Snew

The behaviour of the hashing table One problem that might occur when using hashing tables is the growth of the number of collisions when the table becomes more lled. In order to investigate this number of collisions, we studied the FMS model for k = 5; ; 8 and varied the size of the allocated hashing table. During the complete reachability graph generation we then measured the number of collisions and divided this by the total number of insertions made (denoted as ). In Figure 5 we show as a function of the table lling at the end of the generation process for k = 5 and 8. As can be observed, even for a lling as large as 90%, remains to be relatively small. The curves for k = 6 and k = 7 are very similar. 11

9 8 7 6 5 4 3 2 1 0+ 0.2

k=5 k=8 +

+ 0.3

+ 0.4

+

+

+

0.5 0.6 0.7 degree of lling

+ 0.8

+ 0.9

+

+ +

+ +

1

Figure 5: The mean number of collisions () per insertion in the table during the whole generation process for varying hash table degree of lling Even if is larger, say 4 or 5, then this does not in uence the overall speed of the computations dramatically; it merely corresponds to a sequence of simple tries to write an entry to the table. To illustrate this, Figure 6 shows the required overall computation time as a function of the hash table degree of lling at the end of the computation in case k = 7 (using namur). Notice that the curve is very at, up to a degree of lling of 90%. There an increase occurs, however, this increase is not very dramatic, if one takes into account the fact that the table is very full there. The last two plotted points correspond to table llings of 0.99928% and 0.99995%, yielding an average number of collisions of respectively 6.4 and 8.3 per insertion. The overall run-time, however, is only 40 to 55 seconds longer (less than 10%). To illustrate the eciency of the hashing scheme further, consider the fact that the number of free table entries at the end of the generation process is only 1183 and 73, respectively, on a total of 1 639 440!

4 Distributed state space generation We present the basic algorithm in Section 4.1 and implementation issues in Section 4.2 before we show numerical results in Section 4.3.

4.1 The basic algorithm

The main problem we face when using multiple processors to explore the reachability graph of an SPN is to divide the work equally. Since, we do not know the size and structure of the reachability graph in advance, we do not know how much work there is to be divided. Nevertheless are we using an a priori state allocation function Z : S ! f1; ; N g. Our distributed algorithm follows the lines of the one presented by Caselli et al. [6], Ciardo and 12

600

k=7

590 580

t

570 560 550 540 530 0.2

0.3

0.4

0.5 0.6 0.7 degree of lling

0.8

0.9

1

Figure 6: The generation time (in seconds) for the FMS example with k = 7 for varying hash table degree of lling Nicol [8] and Stern and Dill [25] and works as follows (see also Figure 7): all processors start their exploration program; processor i with i = Z (Initial) starts to explore successor states; upon generating a state s from state s: { the allocation for the new state s is computed: j := Z (s ); { if j = i then state s is handled locally; { if j 6= i then the state s and the arc (s; s ; a )) are sent to processor j ; all processors process the states received from others, as well as the those generated locally. The algorithm terminates when every processor i has its set of states-to-be-explored S empty. Notice that the choice of Z is crucial for good performance (load-balance, communication overhead), but the domain of Z is not known in advance. We have experimented with a variety of allocation functions and comment on our experience below. 0

0

0

0

0

0

s;s0

i new

4.2 Implementation details

We have implemented the distributed algorithm using C++ and STL. As communication platform we used MPICH (see http://www-c.mcs.anl.gov/mpi/mpich/), a free implementation of the Message Passing Interface MPI [29]. Of course, we incorporated the eciency improvements (state coding, hashing) presented for the sequential algorithm. To make the communication more ecient, we buer messages for each of the other processes separately and send them as soon as a threshold of 100 messages is reached. To avoid starvation, the contents of every buer is also sent every second, even if the threshold 13

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.

if Z (Initial) = i then S := fInitialg else S := ;; S := S ; A := ;; while \not received terminate message" do while 9s 2 S do S := S nfsg; for each t 2 Enabled(s) do s := NewState(s; t); j := Z (s ); if j 6= i then SendStateArc(j; s ; (s; s ; t)); else if s 62 S then S := S [ fs g; S := S [ fs g; A := A [ f(s; s ; t)g; end for; end while; S := S [ ReceiveStates; A := A [ ReceiveArcs; end while; i

i

i

i new

i new

i

i new i new

0

0

0

i new

i new i

i

i

i new i

0

i

0

i

0

0

0

i new i new

Figure 7: The distributed reachability graph generation program has not been reached yet. For the termination detection, we used an implementation of Dijkstra's algorithm [14]. To test our distributed algorithm we used a cluster of workstations at the computing center of the RWTH, consisting of 7 Sparc Ultra machines (143 MHz) and 5 Sparc Ultra's (167 MHz) connected via a 10 Mbps Ethernet. Only when using 8 or more nodes, the faster machines are used.

4.3 Numerical results

As rst test case we used the FMS model with k = 7. In Table 3 we show the computation times and speed-up values reached using the computing center cluster (with N the number of active nodes). The rst line (N = s) shows the required time for a serial solution, whereas the second line (N = 1) shows the times for the parallel solution with only 1 node. These values have been measured using namur and scaled down for the use of the \slow" 143 MHz Ultra's; we could not generate the CTMC for k = 7 directly on the 143 MHz Ultra machines because they only had 128 MB main memory. A comparison of the lines \1" and \s" provides insight in the overhead for parallellising the code. For comparison purposes, we also show the speed-ups in relation to the sequential code (SU ). As can be observed, the generation process is fairly well scalable, although the speed-ups are not very high. Indeed, the speed-ups for the tree-based approach are better than those for the algorithms using hashing; however, this does not imply that the former approach is better! Notice that the tree-based approach shows irregularities for 10 and 11 nodes. These are accounted for by load imbalance; the employed allocation function is not optimal in that case. Notice that in case we use 12 processors, we generate a CTMC with 1.64 million states s

14

N

tree-based

t SU

s 2292 1 2781 2 1458 3 972 4 754 5 606 6 522 7 441 8 406 9 367 10 367 11 396 12 287

s

1.0 0.8 1.6 2.4 3.0 3.8 4.4 5.3 5.6 6.2 6.2 5.8 8.0

SU1

hashing

t SU

- 1082 1.0 1227 1.9 739 2.9 575 3.7 468 4.6 365 5.3 327 6.3 294 6.8 248 7.6 225 7.6 196 7.0 193 9.7 182

s

1.0 0.9 1.5 1.9 2.3 3.0 3.3 3.7 4.4 4.8 5.5 5.6 5.9

SU1

1.0 1.7 2.1 2.6 3.4 3.7 4.2 4.9 5.4 6.3 6.4 6.7

Table 3: Computation times (wall clock time in seconds) and speed-ups for the PARSECS distributed generation of the CTMC in case k = 7 for the FMS example and 13.5 million transitions in slightly more than 3 minutes. Apart from the gain in time, we gain in the use of main memory. The model with k = 7 requires 116 MB (158 MB) in the sequential case (in the parallel case with one node). These memory requirements are roughly split among the participating nodes, so that all the required data can be kept in main memory all of the time. For the derivation of the numbers in Table 3 we have used two dierent allocation functions. The tree-based approach uses the allocation function proposed by Ciardo [8]:

Ztree (s) = ((P1 + 1013 P2 + 10132 P3) MOD 99991) MOD N: This function works ne, however, it shows irregularities when N = 10 or 11. Furthermore, we found that for other values of k, e.g., k = 8, the load distribution of this allocation function was not so good. Instead, in the hashing approach, we used a simpler allocation function:

Zhash (s) = (key(s) MOD 99991) MOD N; where key(s) is the 32-bit key also used for the hashing tables. Using this function again, saves us computational overhead, especially in comparison with the complex allocation function proposed by Knottenbelt et al. [17]. In all the experiments we have performed so far, the function Zhash (s) did very well. This is also illustrated in Table 4 where we show the (rounded) percentage of cross arcs (X ) that has to be communicated between the processor nodes: when increasing the number of nodes, an increase in this number is observed, however, there seems to be a convergence to around 45%. Finally, for the case when N = 12, Table 5 shows the distribution of states per processor (minumum and maximum) as well as the cross arc percentage for three dierent allocation functions. As can be observed, the function proposed by Knottenbelt distributes very well, but at the cost of a high communication overhead. Our allocation function distributes better than the one proposed by Ciardo, and even yields a slightly smaller cross arc percentage. 15

N 1 2 3 4 5 6 7 8 9 10 11 12 X 0 21 35 29 41 41 44 34 47 46 46 43 Table 4: The percentage of cross arcs X in the FMS model (k = 7) for increasing number of processors N allocation function Knottenbelt Ciardo We

states/processor cross arcs min max X 136109 137112 89 106002 157962 46 122688 155772 43

Table 5: The distribution of states over processors and the percentage of cross arcs in the FMS model (k = 7) for N = 12 processors for three dierent allocation functions

5 On the solution of large CTMCs The solution of large CTMCs for their steady-state probabilities has been addressed by many researchers. We did not investigate new techniques in this respect, we only investigated how well standard techniques such as Jacobi, Gauss-Seidel and SOR iterations are suitable for distributed implementation [27]. In a distributed setting, each of the involved processors has to compute only a part of the next iteration vector. Therefore, in each processor, only a part of the generator matrix has to be stored. However, also to compute just a part of the next iteration vector, the full current approximation to the steady-state probability vector has to be available. This can only be accomplished when after every step in the iterative procedure, all the involved processors exchange their respective parts of the iteration vector (and possibly perform a renormalisation). Furthermore, it should be understood that Gauss-Seidel (and SOR) cannot really be used since not all the newly computed elements of the iteration vector are locally available in each processor. Hence, a mixed form of Gauss-Seidel and Jacobi results; for a discussion of the convergence properties of these kinds of iterative methods, see [20]. Storing the complete (intermediate) solution vector(s) plus a large part of the generator matrix will be beyond what is currently feasible: we are currently not in the position to solve the global balance equations. Therefore, we are considering methods that avoid this memory problem, e.g., by recomputing the non-zero entries of the CTMC on-the- y during the solution process, every time they are required. Furthermore, we will investigate disk-based methods for the solution of very large CTMCs as well [13, 12]. All of the above approaches cause an enormous communication overhead due to the exchange of the complete iteration vector after every step. To avoid this, blocked variants of the standard numerical techniques can be employed. This works ne for the Jacobi method. Notice that it is dicult to say a priori how large the number of local iterations should be before a global exchange of the iteration vector has to take place. We are currently experimenting with these algorithm parameters.

16

6 Summary and outlook In this paper we have presented some results of our current research in the evaluation of large SPNs, where the focus has been on the CTMC generation process. Regarding the sequential generation, it has been shown that hashing tables are to be preferred over treebased approaches, as the former have far less memory overhead and are much faster as well. Furthermore, the use of place invariants also reduces the memory requirements to store states signi cantly. Using these techniques, we have been able to generate CTMCs with almost 55 million states and 623 million transitions on a single workstation. Although these results are really promising, we think that the distributed generation and solution of SPNs comprises a fundamental step towards the evaluation of even larger models. Using the ecient memory techniques of the sequential case, we have been able to solve models that are even larger on a cluster of standard workstations in reasonable time. Especially the proper work distribution is an important issue here; experiments have shown that our proposal for a distribution function pairs a moderate cross arc percentage to a reasonable load distribution. However, we require more experimental work to establish its suitability more generally. In the near future we will continue our work on the distributed generation of CTMCs, however, we also have to increase our activities on the actual CTMC solution. We will especially investigate ways to automatically balance the generation and solution process evenly over as many processors as possible, study the impact of disk-based methods [13, 12, 26], as well as the impact of various memory models for distributed systems (distributed memory, distributed-shared memory, and shared memory [4]).

References [1] S. Allmaier, S. Dalibor, and D. Kreische. Parallel graph generation algorithms for shared and distributed memory machines. In Proceedings of the Parallel Computing Conference (PARCO '97). Springer Verlag, 1997. [2] S. Allmaier and G. Horton. Parallel shared-memory state space exploration in stochastic modelling. In G. Blardi, A. Ferreira, B.Luling, and J. Rolim, editors, Solving Irregularly Structured Problems in Parallel, Lecture Notes in Computer Science 1253. Springer Verlag, 1997. [3] S. Allmaier, M. Kowarschik, and G. Horton. State space construction and steady-state solution of GSPNs on a shared-memory multiprocessor. In Proceedings of the 7th International Workshop on Petri Nets and Performance Models, pages 112{121. IEEE Computer Society Press, 1997. [4] C. Amza, A.L. Cox, S. Dwarkadas, P. Keleher, R. Rajamnoy H. Lu, W. Yu, and W. Zwaenepoel. TreadMarks: Shared memory computing on networks of workstations. IEEE Computer, 29(2):18{28, 1996. [5] S. Caselli, G. Conte, F. Bonardi, and M. Fontanesi. Experiences on SIMD massively parallel GSPN analysis. In G. Haring and G. Kotsis, editors, Computer Performance Evaluation: Modelling Techniques and Tools, volume 794 of Lecture Notes in Computer Science, pages 266{283. Springer-Verlag, 1994. 17

[6] S. Caselli, G. Conte, and P. Marenzoni. Parallel state space exploration for GSPN models. In G. De Michelis and M. Diaz, editors, Applications and Theory of Petri Nets 1995, volume 935 of Lecture Notes in Computer Science, pages 181{200. Springer-Verlag, 1995. [7] G. Chiola. Compiling techniques for the analysis of stochastic Petri nets. In R. Puigjaner and D. Potier, editors, Modelling Techniques and Tools for Computer Performance Evaluation, pages 11{24. Plenum Press, 1989. [8] G. Ciardo, J. Gluckman, and D. Nicol. Distributed state space generation of discretestate stochastic models. ORSA Journal of Computing, forthcoming 1998. [9] G. Ciardo and A.S. Milner. Storage alternatives for large structured state spaces. In R. Marie, B. Plateau, M. Calzarossa, and G. Rubino, editors, Computer Performance Evaluation, Lecture Notes in Computer Science 1245, pages 44{57. Springer Verlag, 1997. [10] G. Ciardo, J. Muppala, and K. S. Trivedi. SPNP: Stochastic Petri net package. In Proceedings of the 3rd International Workshop on Petri Nets and Performance Models, pages 142{151. IEEE Computer Society Press, 1989. [11] G. Ciardo and K.S. Trivedi. A decomposition approach for stochastic reward net models. Performance Evaluation, 18(3):37{59, 1993. [12] D. Deavours and W.H. Sanders. An ecient disk-based tool for solving very large Markov models. In R. Marie, B. Plateau, M. Calzarossa, and G. Rubino, editors, Computer Performance Evaluation, Lecture Notes in Computer Science 1245, pages 58{71. Springer Verlag, 1997. [13] D. Deavours and W.H. Sanders. \On-the- y" techniques for stochastic Petri nets and extensions. In Proceedings of the 7th International Workshop on Petri Nets and Performance Models, pages 132{141. IEEE Computer Society Press, 1997. [14] E.W. Dijkstra, W.H.J. Feijen, and A.J.M. van Gasteren. Derivation of a termination detection algorithm for distributed computations. Information Processing Letters, 16:217{ 219, 1983. [15] B. R. Haverkort. Performance of Computer Communication Systems: A Model-Based Approach. John Wiley & Sons, 1998. [16] G. Holzmann. The model checker SPIN. IEEE Transactions on Software Engineering, 23(5):279{295, 1997. [17] W. Knottenbelt, M. Mestern, P. Harrison, and P. Kritzinger. Probability, parallelism and the state space exploration problem. In R. Puigjaner, N.N. Savino, and B. Serra, editors, Computer Performance Evaluation, Lecture Notes in Computer Science 1469, pages 165{179. Springer-Verlag, 1998. [18] D.E. Knuth. The Art of Computer Programming; Volume 3: Sorting and Searching. Addison-Wesley, 1973. [19] A.K. Lenstra, H.W. Lenstra, and L. Lovasz. Mathematical Annals, 261:515{534, 1982. 18

[20] B. Lubachevsky and D. Mitra. A chaotic asynchronous algorithm for computing the xed point of a nonnegative matrix of unit spectral radius. Journal of the ACM, 33(1):130{150, 1986. [21] P. Marenzoni, S. Caselli, and G. Conte. Analysis of large GSPN models: a distributed solution tool. In Proceedings of the 7th International Workshop on Petri Nets and Performance Models, pages 122{131. IEEE Computer Society Press, 1997. [22] J. Martinez and M. Silva. A simple and fast algorithm to obtain all invariants of a generalized Petri net. In C. Girault and W. Reisig, editors, Application and Theory of Petri Nets; Informatik Fachberichte 52, pages 301{310. Springer-Verlag, 1981. [23] D. Nicol and G. Ciardo. Automated parallelization of discrete state-space generation. Journal of Parallel and Distributed Computing, 47:153{167, 1997. [24] U. Stern and D.L. Dill. Improved probabilistic veri cation by hash compaction. In IFIP WG 10.5 Advanced Research Workshop on Correct Hardware Design and Veri cation Methods, pages 206{224, 1995. [25] U. Stern and D.L. Dill. Parallelizing the Mur' veri er. In O. Grumberg, editor, Computer Aided Veri cation, Lecture Notes in Computer Science 1254, pages 256{267. SpringerVerlag, 1997. [26] U. Stern and D.L. Dill. Using magnetic disk instead of main memory in the Mur' veri er. In A. J. Hu and M.Y. Vardi, editors, Computer Aided Veri cation, volume 1427 of Lecture Notes in Computer Science, pages 172{183. Springer-Verlag, 1998. [27] W.J. Stewart. On the use of numerical methods for ATM models. In H. Perros, G. Pujolle, and Y. Takahashi, editors, Modelling and Performance Evaluation of ATM Technology, pages 375{396. North-Holland, 1993. [28] B. Stroustrup. The C++ Programming Language. Addison Wesley, third edition, 1997. [29] D.W. Walker. The design of a standard message passing interface for distributed memory concurrent computers. Parallel Computing, 20:657{673, 1994. [30] P. Wolper and D. Leroy. Reliable hashing without collision detection. In Computer-Aided Veri cation '93, volume 697 of Lecture Notes in Computer Science, pages 59{70, 1993.

A The Kanaban model The Kanaban model has been used in [9] as an example SPN to generate very large Markov chains. For conciseness reasons, we do not elaborate on this model here. What is important to know is that two variants of the model exist: one with certain synchronisation transitions timed, and one with these transitions of the immediate type. These two variants are distinguished in our results as well. Furthermore, by increasing the initial number of tokens (denoted as k here, in [9] as N ), we obtain very large models, as illustrated in Table 6. Using the distributed generation procedure, we obtained the speed-ups as reported in Table 7, again using the key-based allocation function. As state coding technique, we simply used a nibble per place. The results regarding eciency and scalability obtained for this model con rm our 19

results for the FMS model. Note that the parallel generation with only 1 node is almost as fast as the pure sequential generation (SU (N = 1) = 0:96). s

k

1 2 3 4 5 6 7 8

immediate timed states arcs states arcs 152 600 160 616 3 816 23 832 4 600 28 120 41 000 316 360 58 400 446 400 268 475 2 343 050 454 475 3 979 850 1 270 962 12 025 566 2 546 432 24 460 016 4 785 536 47 943 168 11 261 376 15 198 912 158 891 712 42 324 525 457 291 980 -

Table 6: The number of states and arcs in the Kanaban model for increasing number of initial tokens (k) for both the timed and immediate variant

N

hashing

t SU

s

s 1308 1.0 1 1367 0.96 2 884 1.5 3 638 2.1 4 523 2.5 5 442 3.0 6 405 3.2 7 368 3.5 8 353 3.7 9 290 4.5 10 285 4.6 11 259 5.0 12 240 5.4

SU1

1.0 1.5 2.1 2.6 3.1 3.4 3.7 3.9 4.7 4.8 5.3 5.7

Table 7: Speed-ups for the Kanaban model with k = 6

20

On the E cient Sequential and Distributed Evaluation

On the E cient Sequential and Distributed Evaluation

Suggest Documents

Stochastic Distributed Energy-E cient Clustering ...

jc: An E cient and Portable Sequential Implementation of Janus

Implementation and Evaluation of an E cient Parallel ... - CiteSeerX

Implementation and Evaluation of an E cient Parallel ... - CiteSeerX

Stochastic Distributed Energy-E cient Clustering ... - Semantic Scholar

E cient Barriers for Distributed Shared Memory Computers University ...

E cient Distributed Algorithms for Dynamic Channel Assignment

Implementation and Evaluation of an E cient 2D Parallel ... - CiteSeerX

E cient Use of Parallel & Distributed Systems: From ... - CiteSeerX

E cient Logic Variables for Distributed Computing 1

on distributed sequential hypothesis testing

On the E cient Allocation of Resources for Hypothesis

Space-e cient On-the- y Race Detection for

Robust and E cient Video Communication Based

E cient Optimization by Modifying the Objective

Communication-E cient Deterministic Parallel

E cient Functional Programming Communication

E cient Image Processing Algorithms on the Scan Line ... - CiteSeerX

E cient Probabilistically Checkable Proofs and ... - CiteSeerX

E cient and Reliable Computation with Algebraic

On the E cient Score Function for some Semiparametric Location ...

SPIDER: Flexible and E cient Communication

E cient and Adaptive Lagrange-Multiplier Methods

On Randomization in Sequential and Distributed Algorithms - CiteSeerX