A Petri net approach to fault detection and diagnosis in distributed systems. Armen Aghasaryan, Ren ee Boubour, Eric Fabre,. Claude Jard, Albert Benveniste.
I
IN ST IT UT
DE
E U Q TI A M R
ET
ES M È ST Y S
E N
RE CH ER C H E
R
IN F O
I
S
S IRE O T ÉA AL
A
PUBLICATION INTERNE No 1117
A PETRI NET APPROACH TO FAULT DETECTION AND DIAGNOSIS IN DISTRIBUTED SYSTEMS
ISSN 1166-8687
´ BOUBOUR, ARMEN AGHASARYAN, RENEE ERIC FABRE, CLAUDE JARD, ALBERT BENVENISTE
IRISA CAMPUS UNIVERSITAIRE DE BEAULIEU - 35042 RENNES CEDEX - FRANCE
` ´ INSTITUT DE RECHERCHE EN INFORMATIQUE ET SYSTEMES ALEATOIRES Campus de Beaulieu – 35042 Rennes Cedex – France ´ : (33) 02 99 84 71 00 – Fax : (33) 02 99 84 71 71 Tel. http://www.irisa.fr
A Petri net approach to fault detection and diagnosis in distributed systems Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste Theme 1 | Reseaux et systemes Projet Pampa & Sigma2 Publication interne n1117 | August 6, 1997 | 41 pages
Abstract: This report presents a new use of safe Petri nets in the eld of distributed
Discrete Event Dynamic Systems, with application to telecommunication network management. This study has in its long range objectives to provide a generic supervisor, which can be easily distributed on a set of sensors. Petri nets are used to provide both a model and an algorithm in fault management domain. Key features of our approach are 1/ we take advantage of the ability of Petri Nets to model concurrency in distributed systems, 2/ we refuse using the marking graph in our algorithms in order to avoid state explosion and thus rely instead on the so-called partial order semantics of Petri Nets, and, 3/ our algorithms use net unfolding techniques associated with partial order semantics, and extend them to the probabilistic case by providing a generalized Viterbi algorithm. This report is composed of two independent parts. The rst one concentrates on application, motivations, and modelling. The second part is devoted to a precise mathematical modelling and to algorithms. In particular, an original notion of stochastic Petri net is developped, that provides fully independent behaviors to regions of the net that are not directly interacting. Key-words: (distributed) DEDS, safe Petri net, stochastic Petri net, partial order semantics, Viterbi algorithm, telecommunication network, fault management, error correlation. (Resume : tsvp)
This work is supported by France Telecom/CNET, contract 95 1B 151.
CENTRE NATIONAL DE LA RECHERCHE SCIENTIFIQUE
Centre National de la Recherche Scientifique (URA 227) Universit´e de Rennes 1 – Insa de Rennes
Institut National de Recherche en Informatique et en Automatique – unit´e de recherche de Rennes
Une approche par reseaux de Petri aux problemes de detection de panne et de diagnostic dans les systemes distribues
Resume : Ce rapport presente une utilisation originale des reseaux de Petri dans le do-
maine des systemes a evenements discrets distribues. L'application visee concerne la gestion des pannes dans les reseaux de telecommunications, et se donne comme objectif le developpement d'un systeme de supervision distribue sur un ensemble de capteurs du reseau. L'approche developpee ici se distingue par plusieurs points: 1/ on utilise la capacite naturelle des reseaux de Petri a modeliser la concurrence dans les systemes distribues, 2/ on refuse de recourir au graphe de marquage dans les algorithmes de diagnostic, pour eviter l'explosion du nombre d'etats, 3/ on s'appuie au contraire sur la semantique d'ordre partiel des reseaux de Petri, notamment a travers les techniques de depliage. Ce formalisme est muni d'un cadre probabiliste, dans lequel sont etendues les techniques classiques de diagnostic par programmation dynamique (algorithme de Viterbi). Le rapport est compose de deux parties independantes. La premiere est consacree a la description de l'application et aux motivations du modele. La seconde partie developpe le cadre mathematique et les algorithmes de diagnostic. En particulier, une nouvelle notion de reseau de Petri stochastique est proposee, qui procure des comportements independants a des regions du reseau n'ayant pas d'interaction directe. Mots-cle : systemes a evenements discrets (distribues), reseau de Petri sauf, reseau de Petri stochastique, semantique d'ordre partiel, algorithme de Viterbi, reseau de telecommunications, detection de pannes, diagnostic, correlation d'erreurs.
3
A Petri net approach to fault detection and diagnosis in distributed systems
Contents I Application to telecommunication networks, motivations, and modelling 5 1 Introduction 2 Application Example
2.1 SDH Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Modeling Fault and Alarm Propagations . . . . . . . . . . . . . . . . . . . . . 2.3 Using Petri Nets and Partial Orders . . . . . . . . . . . . . . . . . . . . . . .
3 Propagation Model
3.1 Syntax . . . . . . . . . . . . . . . . . . . . . . 3.2 Semantics . . . . . . . . . . . . . . . . . . . . 3.2.1 Choice of Partial Order Semantics . . 3.2.2 From Fault Net to 1-Safe System Net 3.2.3 Family of Partial Orders on Alarms .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
5 6 6 6 8
9
9 10 10 11 12
4 Principle of Detection Algorithm
13
5 Related Work 6 Concluding Remarks and Future Work
16 17
4.1 Alarm Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 Correlating alarm patterns with fault net histories . . . . . . . . . . . . . . . 15
II Extending Viterbi algorithm and HMM techniques to Petri nets 19 7 Objectives and ideas 7.1 7.2 7.3 7.4
Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Brief review of existing models of stochastic Petri nets . . . . . . . . . . . . . Further ideas for randomization . . . . . . . . . . . . . . . . . . . . . . . . . . The quest for exact matching between concurrency and probabilistic independence in probabilistic Petri Nets . . . . . . . . . . . . . . . . . . . . . . . . .
8 A quick overview of CSS
19
19 19 22 24
24
8.1 Hybrid system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 8.2 Combination of systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
PI n1117
4
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste
9 A CSS model for stochastic Petri nets 9.1 Unfolding of time . . . . . 9.2 Mini-systems . . . . . . . 9.2.1 Place . . . . . . . 9.2.2 Transition . . . . . 9.2.3 Standby . . . . . . 9.3 Validity of the CSS model
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
27
27 29 29 29 30 30
10 Likelihood of a path
31
11 Search for the most likely path
34
12 Conclusion
37
10.1 Notion of tile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 10.2 Computation of the likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 10.3 Consequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
11.1 Back to our original problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 11.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 11.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Irisa
A Petri net approach to fault detection and diagnosis in distributed systems
5
Part I
Application to telecommunication networks: motivations and modelling 1 Introduction
The complexity of telecommunication networks and the huge amount of information carried by them have caused an increase in demand for network management systems. In particular, the area of network fault management requires a lot of expertise and is getting critical, for breakdowns of telecommunication networks cause huge nancial losses. Most of the current proposals are built on an ad-hoc basis. They are also much more involved in structuring the management system than in designing dedicated algorithms. There is a real pressing need for establishing a theoretical foundation of network fault management. This report aims at contributing to this foundation in focussing on the treatment of causal dependencies between alarms and faults. The main idea is to take into account the essential distributed nature of the problem [3]. This will be done by the use of Petri nets (both non-probabilistic and probabilistic) and their \true concurrency" semantics. We think it constitutes an original way to attack the problem of alarm correlation. The example of telecommunication networks is a prototype of large, distributed systems, and we believe that the approach we develop here can be useful for other such systems. Petri nets are well known as a powerful model for concurrent systems. We decided to found our approach on the explicit description of fault and alarm propagation using 1-safe Petri nets. It allows us to express and to deal with multiple faults, alarms interleaving, and causal dependencies between faults and alarms. As fault propagation models in telecommunication networks may result in large Petri nets, it is critical to avoid using the marking graph which typically blows up due to state explosion. This also holds when probabilistic Petri nets come into consideration : the associated Markov chain should never be explicitly constructed. Our investigation is supported by applying this approach to a speci c network system: the SDH network (Synchronous Data Hierarchy), in collaboration with cnet (the research center of France-Telecom). The model and algorithm presented here serve also as a basis for a probabilistic approach to fault detection. This rst part introduces the problem and its modelling, and provides a sketch of the associated algorithms. Mathematics and algorithms are detailed in the second part. The rst part is organized as follows. Sect. 2 presents the real SDH example, which constitutes the context of our study. Sect. 3 details the fault and alarm propagation model, and establishes its mathematical model. Sect. 4 presents the main principles of the detection algorithm. Related work is discussed in Sect. 5, before conclusion.
PI n1117
6
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste
2 Application Example 2.1 SDH Network
SDH (Synchronous Data Hierarchy) was developed from SONET (Synchronous Optical NETwork) to manage optical interconnections as well as existing plesiochronous signals. An SDH signal is constructed from STM-1 (Synchronous Transport Module level 1) frames. Higher interfaces rates, STM-n signals, are formed by byte interleaving n STM-1 signals (see e.g. [13]). Figure 1 illustrates a part of an SDH network, with its associated management network. Supervisor
S1
S2
S3
AU B
MS D
STM 16
Digital Switch
P
F
LOS
STM 1 Port (faulty)
DG
STM 16
Forward
Figure 1: Example of SDH data network and its associated management network. The data network is made of network elements (STM-16, DG for digital switch). These elements are connected via bi-directional connections. Each of these elements contains STM-1 ports. For such a port, there are several elements of dierent level (synchronous physical interface level (SPI), multiplexage section level (MS), administrative unit level (AU), etc.). A telecommunication network is so viewed as a set of network elements. These elements are grouped into sites, each of them being associated with a sensor. Alarms go from network elements through the sensors s to the supervisor (S ). In this example, three sites are associated to sensors (s1 ; s2 ; s3 ). i
2.2 Modeling Fault and Alarm Propagations
In this study, modeling consists in describing expected manifestations from part of the supervised network, like the \event model" described in [17]. It does not consists in describing the whole behavior of the considered system, however : knowledge about faults in telecommunication networks is more often concerned with manifestations caused by faults than with all network elements speci cations.
Irisa
A Petri net approach to fault detection and diagnosis in distributed systems
7
F (AU) AU-AIS
St-Ch
P (AU) AU-AIS
St-Ch
P (MS) MS-AIS
St-Ch
P (SPI) LOS
St-Ch
TF
D (SPI) LOS
St-Ch
TF
D (MS) MS-AIS
St-Ch
D (AU) AU-AIS
St-Ch
B (AU) AU-AIS
St-Ch
Figure 2: Example of production of alarms and propagation of faulty states in case of a loss of signal (LOS) of the STM-1 port P in the network part of gure 1. Alarms are depicted as black dots, while dashed dots denote faulty states for diagnosis. Some other elements of the net are involved, which are marked as D (for Distant port), F (for Forward port) and B (for Backward port) in gure 1. An horizontal line is dedicated to each component. This is inferred from ocial standards, mainly G774, G782 and G783 from ITU. They de ne which alarms should be produced by the net elements, according to current possible faults and alarm indication signals (AIS) that are emitted between elements (oblique arrows in the diagram). Thus, causality or dependency among alarms and faulty states is depicted by oblique arrows, and horizontal arrows associated with each component. Alarms are viewed as special events, signaling faults, by the way of noti cations. Alarms propagate, level by level of software components: in gure 2, for example, SPI, MS and AU components of port P are involved. Alarm indication signals go from low level components (SPI) to higher level components (MS, AU). This implies dependencies between alarms, in the general case. And it appears that a fault state could manifest itself through several alarms, with dependency or concurrency between these alarms. An alarm pattern could then be de ned by a set of alarms (and their dependencies) between two fault states. Faults are viewed as particular local states between alarms. They will constitute the elements of diagnosis. Faults propagate, from a rst defect, through physical connections for example, as illustrated in the case of LOS. Physically speaking, a fault can occur only once at a time. It makes no sense to consider the same fault involving the same network elements twice at the same time (although it can occur repeatedly). It is pointed out that in telecommunication networks faults could appear simultaneously, i.e. multiple faults occur, which are then concurrent (see e.g. [22]). Moreover, several faults could combine themselves to produce others faults (And, Or dependencies in gure 4), and one single fault could result in several other faults (Simultaneous, Exclusive dependencies in gure 4).
PI n1117
8
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste
2.3 Using Petri Nets and Partial Orders
The previous example allows us to see what it is to be expressed, namely dependencies (or causalities) between faulty states and alarms. This is gured in the form of a circuitfree directed graph, which in turn is equivalent to de ning a partial order. Thus it appears that partial orders are adequate to represent dependencies between alarms, as well as concurrencies. Figure 3 illustrates possible dependencies between alarms, coming from the SDH example. Symbol means that a faulty state may not manifest itself, this is the case when fault propagates without notifying so. Such a partial order on alarms is called an alarm pattern, that represents their dependencies. Alarm patterns are associated to fault state manifestation (and so, to fault state propagation). T4 :
T7 :
AU-AIS (P)
T9 :
T:
St-Ch (P) St-Ch (B) St-Ch (P)
TF (D) AU-AIS (B)
MS-AIS (P)
Figure 3: Some examples of alarm patterns. There are several kinds of dependencies between faults. Places, transitions, and their relationship, appeared to be an adequate tool to express causal dependencies between faults, and multiplicity, as shown in gure 4. This gure illustrates elementary cases of causal dependencies between faults, using usual drawing of places and transitions of Petri nets. Persistent faults, spontaneous faults occurrence and reabsorption are also mentioned. Some of them are illustrated by referring to our SDH example. A set of dependencies between faulty states is therefore represented by a bounded net of capacity one. A fault manifests itself through one, or usually many alarms. So, alarm patterns are associated with transitions of the net. Capacity one is required on places, because of the nature of faults. TF
TF
LOS
LOS
TF
TF
P1 T1
Simple sequence T1
Persistent fault
P2
St-Ch
LOS
P1 T2
P3
Exclusiveness
AU
Spontaneous fault P1
P1
T1
LOS
Reabsorption
P3 MS
Simultaneity
P2
T2
Or
T1
P3
P2
And
Figure 4: Dierent kinds of dependencies between faults in telecommunication networks.
Irisa
9
A Petri net approach to fault detection and diagnosis in distributed systems
These considerations alltogether motivate the use of nite Petri net, of capacity one, with arc weight 1. The so used Petri nets are called fault nets. Transitions are labeled with alarm patterns when needed. The resulting fault net for the LOS of P is given in gure 5. A 1-safe Petri net can be associated to each fault net. This allows us to use the associated theory and the interesting properties of these nets. All considered nets are nite. St-Ch (F)
AU (F) T1
T2
LOS (P)
T3
AU-AIS (F)
T7
St-Ch (P)
AU-AIS (P)
St-Ch (B)
AU-AIS (B)
St-Ch (P) St-Ch (P)
MS (P)
T4
MS-AIS (P)
T5
LOS (P)
TF (P)
TF (P)
LOS (D)
AU (B) T6
St-Ch (D) AU-AIS (D)
T8
St-Ch (D)
MS-AIS (D)
MS (D) St-Ch (D)
T9
LOS (D) TF (D)
TF (D)
Figure 5: Example of fault net for a LOS (Loss Of Signal) of a STM-1 port. An observation of produced alarms will be provided by the way of a partial order on these alarms, as suggested in gure 2. The detection of fault will then consists in recognizing alarm patterns in a given observation. Diagnosis is provided by a backward tracing of the fault net. The use of probabilities as developed by part II, will allow to deal with uncertainties due to poor knowledge, or \noisy" observations (e.g., loss of alarms themselves), or alternatively they will help to discriminate between likely and nonlikely fault propagation sequences.
3 Propagation Model This section gives the formal description of the propagation model. It details the syntax of alarm pattern and fault net. The semantics of the model is then de ned in terms of family of nite partial orders on alarms.
3.1 Syntax
Alarm patterns aim at describing a set of alarms with their expected mutual dependencies. An alarm pattern A associated to a transition T is de ned as a 4-tuple (X ; ; A ; ' ) , where: X as nite set of vertices, i.e. alarms, as partial order relation on vertices (where X X ), i.e. dependencies between alarms, i
i
i
i
i
PI n1117
i
i
i
i
i
i
10
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste
and ' as a labeling function of alarms on A . i
i
A vertex represents an occurrence of its labeling alarm. Edges indicate expected dependencies between alarms, according to the prior knowledge about their possible propagation. In a given alarm pattern, the label ' (x) = a of an alarm must contain U also the sensor s(x) which should observe it. The set of all alarm patterns is A = A , where i enumerates transitions of the net. The covering relation, ?< , on a partial order O = (X; ) is de ned by i
i
i
a?< b i 8z 2 X; (a z b) ) (z = a or z = b) : The Hasse diagram of a partial order is the graphic representation of its covering relation. An alarm pattern is thus characterized by its Hasse diagram, oriented from bottom to top in gures (see for example Figure 10). The set of minima of O is Min(O), and the set of its maxima is Max(O). O0 is a sub-order of O if X 0 X , and in O0 is the restriction on X 0 of in O. A fault net is de ned to describe a set of faults and the known dependencies between them. In a fault net, a place represents a faulty state, and transitions represent dependencies between faults. Transitions are labeled by the corresponding alarm patterns, when some manifestations are expected due to this faulty state. Let A be an alarm patterns alphabet. A fault net on A is a 4-tuple N = (P; T; F; l) where: P is the nite set of places, T is the nite set of transitions, P \ T = ;, F (P T ) [ (T P ) is the ow relation, l : T ! A [ fg is the labeling function. F
3.2 Semantics
3.2.1 Choice of Partial Order Semantics
As pointed out by Vogler in [26], one needs partial order semantics when considering nonatomic events on transitions. In some sense, this is the case for fault nets, because of alarm patterns on transitions. In case of multiple faults (i.e. several places are active), one must consider concurrency between the associated alarm patterns. The idea is then to work explicitly with the net structure through unfoldings. This will avoid the usual state explosion due to concurrency. The interest of this semantics for concurrency is illustrated in Figure 6.
Irisa
A Petri net approach to fault detection and diagnosis in distributed systems T1
T2
11
T1 P
P
T2
T1
T2
Two partial orders (without concurrency) : T1 T2 T2
A single partial order (with concurrency) : T1
T2
T1
Figure 6: Illustration of partial order semantics. This gure shows two nets, that are indistinguishable with respect to their ring sequences (\interleaving semantics"). However, these nets are dierent in the partial order semantics : they generated dierent partial orders for their transition rings. The rst net admits two partial orders, each of them being totally ordered, i.e. without concurrency. On the other hand, the second net admits one single partial order, involving concurrency. The fault net semantics is de ned by the Set of partial Orders on Alarms it describes. Let
S denote this semantics. Branching processes of Petri nets and associated event structures oa
are used to represent this semantics of fault net.
3.2.2 From Fault Net to 1-Safe System Net
For the sake of convenience, a fault net is translated in a 1-safe system net. This is done in two steps. First, places are complemented to code explicitly the capacity one. Secondly, alarm patterns on transitions are expanded and described inside the Petri net. As usual, 8 x 2 P [ T , let x = fy 2 P [ T j (y; x) 2 F g be its pre-set and x = fy 2 P [ T j (x; y) 2 F g its post-set. The marking of a place p 2 P is M (p), with values 0 or 1. To translate a Petri net of capacity one into a 1-safe net, simply add a complement to every place in the fault net. For a place p 2 P of the fault net, p is the complement of p if p = p ; p = p and M0(p) = 1 as initial marking. As Reisig marked out in [24], this is a convenient way to have transition rings to depend only on their respective pre-sets. And the behavior of the considered net is left unchanged. To strictly limit the model to 1-safe Petri nets, alarm patterns are expanded onto 1-safe nets. Let A be an alarm pattern. The corresponding labeled net N (A) = (B; E; F; l) is obtained using a construction similar to the construction proposed in [21]: Firstly, create a transition for each vertex of the alarm pattern: for every a 2 X , consider a unique transition t(a) = t 2 E . These transitions are labeled with the corresponding alarms : l(t) = ' (a). Secondly, put a place for every covering edge of the partial order relation. For every x?< y create a unique p 2 B such that p = t(x) and p = t(y). i
PI n1117
12
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste
Finally, put a place under every minimal of the order, and similarly, on top of every maximal of the order. For every a 2 Min(A) (b 2 Max(A)) create a unique p 2 B with p = t(a) ( p = t(b)).
An alarm pattern is thus expanded onto a causal net in which, for every p 2 P , j pj = 1 and j p j = 1. The underlying partial order on transitions of such a causal net is exactly the partial order underlying the considered alarm pattern. This implies a bijection between alarms patterns and their corresponding labeled nets. To interface these expanded alarm patterns with places in the 1-safe nets, a -labeled transition is used as pre-set for the minimal (another one as post-set of the maximal) of the corresponding causal nets. Figure 7 illustrates the expansion of an alarm pattern. On top of the gure an alarm pattern A is given, looking like one of the LOS fault net. A possible alarm pattern for a transition labeled A in this fault net is also proposed. The bottom of the gure illustrates the labeled net associated with this alarm pattern. In this example, complement of places are forgotten, to avoid clumsiness. It is clear that complements can be forgotten, because they do not in uence the behavior nor the semantics of the fault net. A
A: St-Ch AU-AIS St-Ch MS-AIS
MS-AIS
AU-AIS
St-Ch
St-Ch
Figure 7: An alarm pattern and its corresponding labeled causal net. This leads to nite, 1-safe labeled system nets. Let N = (P; T; F; M0 ) such a net derived from a fault net, referenced as a 1-safe system net. This net is also T-restricted, i.e. 8t 2 T; t 6= ; 6= t . To avoid problems due to labeling, places and transitions of the nite 1-safe net are numbered to distinguish any of them.
3.2.3 Family of Partial Orders on Alarms
Partial order semantics of Petri nets is de ned through the notion of branching process. Branching processes are themselves based on a simple class of nets, occurrence nets. An occurrence net ON is a nitary acyclic net such that, for every p 2 P , j pj = 1 and M0 = Min(ON ). Places are named conditions, and transitions are named events in occurrence nets.
Irisa
A Petri net approach to fault detection and diagnosis in distributed systems
13
A branching process of a net N is a pair (N 0 ; h) where N 0 is an occurrence net and h an homomorphism de ned by the following mapping from N 0 to N . h : P 0 [ T 0 ! P [ T is such that: h(P 0 ) P and h(T 0 ) T , for every t 2 T 0, the restriction of h to t is a bijection between t (in N 0) and h(t) (in N ), and similarly for t and h(t) , the restriction of h to M0 (in N 0 ) is a bijection between M0 (in N') and M0 (in N ), for every t1 ; t2 2 T 0, if t1 = t2 , and h(t1 ) = h(t2 ) then t1 = t2 . The unfolding of a Petri net is its unique maximal branching process [9]. A pre x (N 0 ; h0 ) of (N; h) is de ned as in [10] : if a condition belongs to N 0 , then its input event in N also belongs to N 0 , if an event belongs to N 0 , then its output conditions in N also belong to N 0 . Every branching process of a 1-safe system net can be seen as a pre x of its unfolding. Unfolding represents the in nite set of all these branching processes. Figure 8 gives an example of 1-safe system net. This example is chosen to illustrate unfolding. Figure 9 shows one branching process of the previous 1-safe system net. 1
3
1 2
3
0 0 4 2
4 5
Figure 8: A 1-safe system net. Performing unfolding of the fault net of gure 5, and then deleting the places (they are not observed) yields the partial order on alarms depicted in Figure 10. In this case, the so generated partial order is unique.
4 Principle of Detection Algorithm We provide here only the principles, details and mathematics are provided in part II, see also [4] for a preliminary version of the algorithm in the non-probabilistic case, based on the use of event structures | a technique slightly dierent from that used in part II.
PI n1117
14
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste
3
... 2
3 3
1
1
3
1 1
1
3 3
3
1
3
1
1
... 2
... 2
2
... 2 4
2 4
4
2 5
4
3 3
0
2
3 0
1
1 2
2
3
1 2
5
4 2
4
4
5
4
Figure 9: A branching process of the 1-safe system net of Figure 8.
St-Ch (B) AU-AIS (D) AU-AIS (B)
St-Ch (D) TF (D) St-Ch (D)
MS-AIS (D)
(B)
St-Ch (D) LOS (D)
St-Ch (P) AU-AIS (P) St-Ch (P)
St-Ch (F) AU-AIS (F)
(D)
MS-AIS (P) TF (P)
(F) St-Ch (P)
(P)
LOS (P)
Figure 10: Hasse diagram of the partial order on alarms corresponding to the fault net of gure 5 for LOS.
Irisa
15
A Petri net approach to fault detection and diagnosis in distributed systems
4.1 Alarm Observation
Typically, a network manager receives lots of alarms, produced by elements of the data network, and has to correlate them with the underlying fault net, to recognize the original fault(s). Each sensor of the management network observes alarms and some dependencies. Mechanisms like time-stamps provide a coding of dependencies that each sensor could observe between alarms ([12, 20]). One obtains, for each sensor, a partial order on alarms, with immediate predecessors known by the sensor for every alarm. Figure 11 shows an example of observation provided by sensors for the case of loss of signal of STM-1 port P. S
St-Ch (F)
St-Ch (D)
S3 AU-AIS (D)
S2
St-Ch (D) TF (D)
AU-AIS(F)
S1
St-Ch (B) S3
F (AU) AU-AIS
St-Ch (P) MS-AIS (D)
St-Ch
AU-AIS (P)
P (AU) AU-AIS
St-Ch (P)
St-Ch
P (MS) MS-AIS
St-Ch (D)
TF (P)
AU-AIS (B)
St-Ch
P (SPI) LOS
St-Ch
TF
MS-AIS (P)
D (SPI) LOS
St-Ch TF
St-Ch (P)
D (MS) MS-AIS
LOS (D) S1
St-Ch
D (AU) AU-AIS St-Ch
LOS (P)
B (AU) AU-AIS
St-Ch
S2
Figure 11: Observation of alarms by sensors for the case of loss of signal of STM-1 port P. Horizontal lines represents the time for each network element, for the sensors, and for the supervisor. The line of the supervisor illustrates the sequence of alarms provided to a manager without any information of concurrency between them. On the other hand, on the right hand of this gure, are the Hasse diagrams of observation provided by the sensors with time-stamp mechanisms. To avoid overloading the data network, no further intrusion in the trac is allowed to observe dependencies. This would be too costly. For this reason, sensors do not exchange information about alarms. It implies that some dependencies between alarms could not be known: Inter-site dependencies are never observed by sensors. Each sensor only knows the intra-site dependencies it observes between alarms produced by elements of its site. Observations collected by the manager are assumed to be in accordance with the eective ordering of production of alarms by the network elements, for each individual sensor.
4.2 Correlating alarm patterns with fault net histories Next steps of the algorithm consist in :
PI n1117
16
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste
1. Rejecting those fault net histories which are not accepted by (i.e., are incompatible with) the underlying fault net. Corresponding algorithm does not make use of the marking graph but is rather based on net unfoldings, thus avoiding the problem of state explosion. If no probabilistic modelling is at hand, then this nishes the detection/diagnosis algorithm. 2. Probabilistic modelling may be of interest for 1/ taking into account \noisy measurements", i.e., loss of alarm that typically occurs, through buer over ow, in networks operating under faulty conditions, 2/ discriminating between possible faults of low/high likelihood, and, generally 3/ adding robustness to the detection/diagnosis algorithm. When marking graphs are used and dependencies are ignored, then this problem reduces to the classical Viterbi algorithm for maximum likelihood decoding in HMM's, as popularized in the area of speech recognition. In our case, however, since we base our technique on a partial order semantics and refuse to use the marking graph, corresponding algorithm shall be based on net unfoldings, thus avoiding again the problem of state explosion.
5 Related Work Fault detection and identi cation in communication networks have been studied in [23], using a discrete event systems (non probabilistic) approach. Also, [7, 6] uses an HMM (Hidden Markov type) probabilistic approach. Both works assume a nite state machine or Markov chain is used, thus facing the risk of state explosion for large, distributed, systems. Our work diers from these in that we explicitly take concurrency and distribution into account and use Petri Nets with partial order semantics. Petri nets have already been used for diagnostic, see e.g. [11] in the eld of automation. They establish a linear logic associated to Petri net. Linear negation is used to diagnose faults. In [25] alarm correlation is considered by the way of database operations. Events are represented by data-patterns, and data patterns are continuously retrieved from the network database. Events are correlated through combination of events. By correlation rules: disjunctions, conjunctions; by temporal relationships: temporal order, by the way of an acyclic directed graph, and temporal constraint, using occurrences time. In our approach, we do not want to use real time, in order to avoid problems due to clock synchronization. We feel that aspects related to real time somewhat hide causal dependencies and concurrency, which we explicit. In [14], alarm correlation is a conceptual interpretation of multiple alarms, so that a new meaning is assigned to these alarms. Their approach is based on the principles of modelbased reasoning. Dependencies between faults are represented by logical fault propagation rules: causal implication, and, or. In [15] the same authors concentrate on time-dependent events management. In our model, a great variety of dependencies between faults and alarms
Irisa
A Petri net approach to fault detection and diagnosis in distributed systems
17
can be expressed. In particular, because they are tracted in a uniform manner, we take into account the problem of alarm inter-dependencies due to fault propagation. In [5], two nite state machines are used. One models the whole behavior of the considered system, that is partially observed by the second. Faults are considered as changes and additions of arcs in the nite state machine of the system. Therefore, the detection problem is expressed in terms of sequences generated (or not) by the system. The focus of their work is in the design of the simplest observer for a given nite state machine. Our propagation model does not describe the whole behavior of the system, but rather its manifestations in case of malfunctions. In [17], problems are viewed as messages encoded in the set of alarms they cause. By pruning, a bipartite correlation graph is obtained. It provides for each problem a code by the way of an alarm vector. Then the alarm correlation problem is that of nding problems whose code optimally match an observed alarm vector. The way to perform matching is left open. For a more general class of models, this is what our algorithm does. In [16] probabilistic methods are also used, but without a model of concurrency between events.
6 Concluding Remarks and Future Work This report takes advantages of results in the theory of Petri nets to deal with fault management in telecommunication network. It takes explicitly into account multiplicity and dierent causal dependencies between faults. And it allows to express dependencies and concurrencies between alarms. We believe this case study is representative of fault detection and diagnosis in other discrete event distributed systems. As far as we know, this model is a new approach to fault detection through alarm correlation in distributed systems. The choice of partial order semantics of Petri nets prevents from the combinatorial explosion due to interleaving when dealing with concurrency. This model was illustrated with an example from SDH network. The detection/diagnostic algorithm was rst introduced in [4], and is presented in its full probabilistic version in part II, it is currently being programmed with the Eiel language, and we plan real experiments. A short term research objective is to de ne precisely how to return diagnosis (i.e. explanation) to the operator, and how to distribute the supervising activity on sensors : A diagnostic could be provided to the operator in the form of a backward simulation of the fault net according to the results given by the recognition. This allows the operator to go back to the original fault through the fault net. All this work is done keeping in mind the aim of distributed detection. This detection algorithm can be distributed over sensors. Those sensors are able to process part of the recognition, by providing \synthesized information" or \sucient statistics" based on local information only. Then the global supervisor would collect these \sucient statistics" and build the diagnosis. This will be facilitated thanks to the use of partial
PI n1117
18
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste
orders as model of dependencies, so that no notion of \global time", and thus no synchronization is needed.
Irisa
A Petri net approach to fault detection and diagnosis in distributed systems
19
Part II
Extending Viterbi algorithm and HMM techniques to Petri nets 7 Objectives and ideas 7.1 Motivations
We refer the reader to the rst part for a detailed discussion of failure diagnosis in large, distributed, discrete event systems, with telecommunication networks as a typical instance. Main conclusions drawn from this analysis are summarized next : 1. We have advocated the use of a model based approach, in which the spontaneous occurrence of faults is modeled, together with their propagation throughout the system and resulting emitted alarms. Diagnosis is then performed by the way of inferring underlying faulty state histories from observed alarms. 2. Safe Petri nets1 were proposed for this purpose, due their ability to capture the distributed nature of faults via tokens, as well as concurrency. Also, probabilistic nets will be considered, to capture the \noisy" nature of alarm observations (alarms may get lost in the case of malfunctioning of the network) and the dierence in likelihood in case of several fault scenarios giving raise to the same recorded set of alarms. 3. To avoid the risk of state explosion for large distributed systems, we decided to refuse using the Marking Graph, and rely instead on partial order semantics and net unfolding. The purpose of this part is 1/ to establish the proper probabilistic setting for safe Petri nets with partial order semantics, and 2/ to provide a (new) Viterbi type algorithm for maximum likelihood estimation of hidden state history from observed alarms. While it would be a natural objective to seek for a distributed Viterbi algorithm, we will not discuss this issue here, but rather defer it to a forthcoming paper.
7.2 Brief review of existing models of stochastic Petri nets
Before presenting our formalism for a partial randomization of Petri nets (PNs), we review existing models to highlight the kind of suitable properties we wish to keep, as well as bad side-eects that need to be canceled.
1 A Petri net is said to be safe i each place contains at most one token, and if transition rings preserve this property.
PI n1117
20
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste
Generalized stochastic Petri nets
The usual way transitions are made stochastic in PNs is by considering timed Petri nets for which waiting times are random [8], [19]. The waiting time of a given transition t is generally distributed according to an exponential law of parameter =1/(average waiting time). From a given marking, a race policy determines which of the enabled transitions will re. It is quite straighforward P to check that if t1; : : : ; t are enabled by marking M , t will re with probability = =1 . This stochastic Petri net (SPN) formalism has been generalized (GSPN) to take into account immediate transitions, that have priority over timed ones. We consider now the GSPN framework where all transitions are immediate, in order to avoid any reference to physical time. These nets are sometimes referred to as probabilistic nets. In a probabilistic net, a weight w (instead of an average waiting time) is associated to each transition t. It is assumed that only one transition is red at a time, which requires a special selection policy. In the literature, this is performed by rst de ning the notion of Extended Con ict Set (ECS). Basically, two transitions are in con ict if they require the same resource (i.e. they have a common input place). This con ict relation is extended by transitivity into an equivalence relation, the classes of which de ne the ECSs. This yields a partition of the transition set (see gure 12 for examples). Given a marking, the selection of the transition that will re is done in two steps : rst an ECS is selected, then a transition is selected inside this ECS. If t1 ; : : : ; t are enabled in the same P ECS, and have respective weights w1 ; : : : ; w , t will be selected with probability w = =1 w . Several policies exist for the selection of an ECS : 1/ priority rules, 2/ uniform random choice, or 3/ weighted random choice as above (the weight of an ECS is obtained by summing over the enabled transitions it contains). In this last case, the notion of ECS is useless since t is actually competing against all enabled transitions, with a probability proportional to its weight. t
n
n
i
i
j
j
t
n
n
n
i
i
j
j
i
1
1
3
3 5
4
4
2
2
-a-
-b-
Figure 12: Extended Con ict Sets. In the GSPN framework, the randomization of the net can be achieved through the notion of Extended Con ict Set. ECSs are highlighted by dotted lines on the gure.
Discussion. In any case, the behavior of the network is described by a Markov chain which induces competition between all enabled transitions in a given marking. For example, if s and t are enabled and don't require the same resources, ring the sequence s; t will not in general yield the same likelihood as ring t; s. This remains true even if s and t Irisa
21
A Petri net approach to fault detection and diagnosis in distributed systems
belong to completely disconnected PNs, which means that gathering independent networks automatically generates interaction. 1
1
1’
3
3
1
5
3 3’
4
4
2
2
2
2’
-a-
-b-
-c-
Figure 13: Concurrency and probabilistic independence. t1 and t2 are two concurrent transitions. In a) they belong to dierent ECSs, while in b) they lie in the same. On example c), t1 ; t2 and t3 are not in a pure concurrency position, their ring must follow a partial order.
An interesting point appears when we \drop the order" in GSPNs, as is suggested in [19] and [18]. This requires that we restrict our attention to confusion free Petri nets, which means that ring a transition inside a given ECS cannot disable transitions of other ECSs. Another condition is that ECSs are selected at random according to above stated policy 3. Consider the example of gure 13.a, where 4 transitions are competing. Assume we want to compute the probability that t1 and t2 re, regardless of the order. This is given by summing the probabilities of the sequences t1 ; t2 and t2 ; t1 : IP(t1 and t2 have red) = w + w w+1 w + w w w+2 w + w + w w+2 w + w w w+1 w 1 2 3 4 2 4 1 2 3 4 1 3 (1) = w w+1 w w w+2 w 1 3 2 4 This is exactly the probability of performing two independent and local selections, where \local" means that we compare elements of the same ECS only. We suspect that the right notion to describe this property is not exactly that of ECS. Considering the example g. 13.b, where t1 and t2 are still concurrent, (1) remains true despite the fact that t1 and t2 lie in the same ECS. This result holds beyond the \pure concurrency" situation, as illustrated in g. 13.c. The three transitions t1 ; t2 and t3 , lying in distinct ECSs, can't be red in any order since t3 must always follow t1 . But again, if we sum over all legal interleavings subject to this partial order constraint, we recover independence. Summary. Gathering the above observations, it appears that the main drawback of GSPNs lies in the randomization of all sequences of transitions. If the order is removed, that is if we consider instead the partial order semantics associated to a PN, we recover in a natural way interesting independence properties. This is precisely the kind of behavior
PI n1117
22
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste
we shall model in the sequel, in a rather direct way. More precisely, if several transitions can be red simultaneously, i.e. if they require dierent resources, the choice of one or the other should not be randomized, it should be just \unknown". As a consequence, we will not de ne a proper Markov chain behavior for our Petri net, since only part of the events will be randomized.
Free choice Petri nets
The model we develop in the sequel is somewhat more related to free choice nets (FCNs), that we brie y review here. t1
t1 t3 t2 -a-
t3 t2 -b-
Figure 14: From nets to free choice nets. a) There is confusion between t1 and t2 : these two transitions look concurrent, but ring t1 rst changes the context of t2 since a con ict appears. This is an obstacle to randomization through GSPNs. b) The use of a FCN with a random routing policy can solve this problem. A free fhoice fet [1] is a Petri net with a reduced notion of con ict : if two transitions share a common input place, then they have only this place as input. In other words, the token of this input place can freely choose the transition it will re (see gure 14.b). FCNs are an elegant way to guarantee confusion freedom, with a minor change in the behavior of the net. Unfortunately, extended con icts can't be transformed so easily, and FCNs do not necessarily provide safe nets. Thus a structural diculty still remains at stake. The randomization of FCNs can be achieved in two equivalent ways. The rst method is by considering a timed FCN where waiting times are random (race policy). The second method applies rst a random routing to tokens (they choose a transition), and then imposes a random waiting time before the elected transition res. These two formalisms are equivalent and end up with Markov chain dynamics, i.e., global interaction.
7.3 Further ideas for randomization
Let's now describe the basic mechanisms we develop in the sequel. We are interested in capacity 1 PNs, which means that every place can contain at most one token, and transitions are allowed provided they preserve this property (this is equivalent to a safe net). Notice that such a restriction induces competition between transitions not only for tokens, but also for \holes", i.e. absent tokens. We shall call these elements the resources of a transition.
Irisa
A Petri net approach to fault detection and diagnosis in distributed systems
23
The idea is to let tokens and holes select at random their routing. A token will choose a branch by which to leave the place, whereas a hole will select a branch from which a token is expected. These choices will be probabilized. Figure 15 gives an example of such rules. Notice on this example that we don't restrict our construction to free choice nets : the example includes a confusion situation, as well as an extended con ict. 1
a
p
α 1-p q
β 1-q
2
b 3 r
γ
4
c 1-r
Figure 15: Randomization mechanism for a Petri net. In case of multiple possibilities, every token selects at random the branch where to go. In the same way, every \hole" selects a branch from which a token is expected. The corresponding probabilities are mentioned close to the arrows. We assume the following protocol, that induces Markov chain dynamics : All places of the net make their routing choice simultaneously. If a place has the token, it selects an output transition, otherwise it selects an input transition. The next marking is computed by looking for agreements : a transition is selected i it has been chosen by all its resources. Referring to g. 15, let us denote by a the presence of a token in place a, and by a its absence. Only 5 markings are possible at the next time, that can be coded abc; abc; abc; abc and abc. They have the respective probabilities p(1 ? q)r; p(1 ? q)(1 ? r)+ pq; (1 ? p)q; (1 ? p)(1 ? q)r and (1 ? p)(1 ? q)(1 ? r). We let the reader check these easy computations (and that the sum yields 1). Such a formalism could be very appealing because of its simplicity. However, it raises several diculties. 1. Observe the last case for example : abc. It corresponds to an interlocking, also referred to as idle situation, or stuttering : places have made incompatible choices, so nothing happens. Speci cally, the token in had chosen t2 , the one in had selected t3 , and the hole in c expected a token from t4 . This side-eect is really bothering since the PN remains in the same state with a non zero probability, whereas we would like our model to be blind with respect to stuttering : \nothing happens" is a \no-event" and thus should not be assigned a probability.
PI n1117
24
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste
2. Assume t is a \direct" transition that doesn't involve any choice and so is not competing with any other. Then t must re. This is inappropriate to capture concurrency properly. 3. Pick a transition t. What is the probability that t is red alone? If no other transition is enabled, the answer is 1. Otherwise, the answer is \the probability that the other transitions lock each other..." And so we loose the locality property that we were looking for : the probability of t depends on decisions that are far from t.
7.4 The quest for exact matching between concurrency and probabilistic independence in probabilistic Petri Nets
The bottom line is : while the idea of local probabilistic routing for both tokens and holes is appealing, it fails here, due to the use of a globally synchronized protocol. But our objective is to have an exact matching between concurrency and probabilistic independence, a feature that fails to be provided by existing models. The crux toward this objective will be the combination of local probabilistic routings and a full use of partial order semantics of PN's. It will be achieved by allowing an extra possibility to tokens and holes, under the form \I want to move" (in which case I select a routing) or \I want to stay". This choice will not be a random variable, which is a central feature of our model. The rest of this part is organized as follows. To properly de ne the stochastic PN model sketched above, we need a framework that handles both random variables (routings), non random variables (move or stay), and also constraints that encode legal behaviors of all these variables (the ring rules of Petri Nets). These capabilities are borrowed from the CSS model [2], that we brie y present in section 8. Let us right now indicate a particularity of CSS : it abandons the request that probabilities be normalized (a point of view borrowed from statistical mechanics). This is technically helpful and has no eect on maximum likelihood techniques. Section 9 is dedicated to the eective construction of a partially randomized Petri net based on CSS. Section 10 computes the likelihood of one trajectory of the PN, and studies its properties. Notice that we shall indierently use the words \trajectory", \path" and \history" to describe a legal sequence of transitions in the Petri net. Finally, section 11 addresses the diagnosis problem and provides an extension of Viterbi type dynamic programming techniques to this model.
8 A quick overview of CSS CSS (a Calculus for Stochastic Systems) has been designed to handle both stochastic and non stochastic variables, related through a set of constraints [2]. This framework proposes 1/ a de nition of hybrid fstochastic / non stochasticg systems, 2/ a way of building complex systems by combination of simple ones, 3/ algorithms for incremental simulation and hidden state estimation. We shall concentrate on modelling, and then develop speci c Viterbi type
Irisa
25
A Petri net approach to fault detection and diagnosis in distributed systems
algorithms by taking advantage of the particular nature of Petri Nets regarded as CSS systems.
8.1 Hybrid system
A CSS system (see g. 16) is a 4-tuple = (X; C ; W; IP) where X = fX1 ; : : : ; X g is a nite set of visible variables, i.e. variables that are observable from outside the system. Part of their role is to allow communications between systems, since two systems having visible variables with the same name must share these variables. W = fW1 ; : : : ; W g is a nite set of private random variables. They are local to each system, i.e. not visible from outside, and obey the unnormalized probability distribution IP. C is a set of relations between the X 's and the W 's. They take the form of possible (p+ q)-tuples (x1 ; : : : ; x ; w1 ; : : : ; w ). We also talk about constraints, since C represents the set of possible behaviors of the system. These constraints can be \projected" on a subset of variables. For example, x1 2 C means that there exist possible values x2 ; : : : ; x ; w1 ; : : : ; w such that (x1 ; : : : ; x ; w1 ; : : : ; w ) 2 C . Example 1 : putting X1 = W1 in the set of constraints amounts to making the random variable W1 visible from outside. Example 2 : setting (X1 )2 = W1 means that the absolute value of X1 is a random variable, but X1 itself is not a random since its sign is not probabilized. p
q
i
p
p
j
q
q
p
IP(w)
W
q
C (W; X)
X
Figure 16: A CSS system is composed of private random variables W = fW g and visible variables X = fX g, that are related through a set of constraints C . i
j
PI n1117
26
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste
8.2 Combination of systems
Let 1 and 2 be two CSS systems. Their combination = 1 j2 is de ned by (cf. g 17) : X = X1 [X2 : visible variables having the same name in the two systems are identi ed ; they behave as communication channels between 1 and 2 . W = W1 W2 : random variables are local, IP = IP1 IP2 : the prior joint distribution of W1 W2 makes its two components independent, C = C1 ^ C2 : conjunction of the two sets of constraints.
1 IP1 (w1 )
W1
X
C1
X1
C2
W2
2 IP2 (w2 )
X2
1 j2
IP1 (w1 ) IP2 (w2 )
W1
W2 C1 ^ C 2
X1 X X2 Figure 17: Combining two CSS systems. Example. Consider a rst system 1 = (fX1 g; fW1 g; C1; IP1 ) where C1 only makes the random variable visible : X1 = W1 . Let 2 be a pure constraint on X1 : 2 = (fX1g; ;; C2; 1). The combination yields = (fX1 g; fW1g; C1 ^ C2 ; IP1 ). X1 is still a random variable in ,
Irisa
A Petri net approach to fault detection and diagnosis in distributed systems
27
but its range has been constrained by C2 . In other words, we have built the conditional law of X1 given the constraint C2 . Of course IP1 hasn't changed in the operation, but this is coherent with the fact that we handle unnormalized probability distributions. Assuming that IP1 was a normalized probability, the correct conditional distribution of X1 would be obtained by renormalizing : IP1 (:)=IP1 (C2 ). The \j" operator will play a crucial role in the sequel since a stochastic PN will be obtained as a combination of many mini-systems. We won't need all the details described above however, and will rely mainly on the concepts. Two keypoints must be kept in mind : 1/ random and non-random variables live together in this formalism, 2/ the probabilities are unnormalized, thus reducing the range of a random variable by constraints has no in uence on IP, nevertheless it amounts to implicitly building a conditional (unnormalized) law.
9 A CSS model for stochastic Petri nets As suggested at the end of section 7, we aim at maintaining the random routing choices for tokens and holes, but we also add the extra capability of not participating to rings. However, if tokens keep the control on rings, lockings will not be erased. To allow coherent choices only, the idea is to transfer the control to transitions : they either re, in which case they were chosen by all their resources, or rest, in which case none of their resources could have selected them. Finally, any token/hole that has been rejected by all its transitions will be considered as not wishing to move. It is quite easy to code this protocol with a CSS model since it simply amounts to adding constraints on the evolution of the net. What we build now is thus exactly the model described section 7.3, conditioned by the fact that no locking appears.
9.1 Unfolding of time
We're going to represent the full PN as the combination of mini-systems : places and transitions. Places will be in charge of the randomness, and transitions will de ne constraints on these choices so that to ensure a correct Petri net behavior, and no locking. At this point, we have to face a light modeling diculty. The systems we handle in CSS assume no dynamics, whereas we need to express constraints between consecutive time indexes. So we have to duplicate place systems around a transition, in order to represent the \past" state of places, and their \future" state, after the transition is (or isn't) red. Figure 18 gives an example of this duplication. Clearly, this procedure must be extended to the whole PN. To do so, we assume that a universal clock is given, that governs the instants at which changes can occur in the markings. This is somewhat contradictory with our objective of partial ordering of events in non competing regions, but this point will be discussed later on. For the time being, we adopt the unfolded CSS representation of the PN that is illustrated by gure 19. We have a new version of each place for every top of the clock, while transitions appear as linking consecutive sets of places.
PI n1117
28
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste
past
a
future
a
t c
a t
b
b
b c
c
communication by place choices
communication by tokens/holes
Figure 18: Unfolding. The CSS framework assumes no dynamics. To encode dynamic relations, we have to \unfold" time, i.e. to represent the past and the future state of places. This is done by duplicating places to stand for dierent instants. Transitions induce constraints on the evolution of the net, i.e. between past and future.Left : an ordinary PN transition. Right : CSS representation of a transition between two consecutive instants. The arrows are kept to identify pre- and post-sets of a transition.
time
1 a 3 b 5
d c e
2 4 6
d
1
1
a
2
2
3
3
4
4
5
5
6
6
c b e
Figure 19: Unfolding of time to represent in CSS the paths of a PN. Circles stand for \place" mini-systems, whereas squares represent \transition" mini-systems.
Irisa
29
A Petri net approach to fault detection and diagnosis in distributed systems
9.2 Mini-systems 9.2.1 Place
The system \place" is composed of three visible variables : the token , taking value 0 or 1, and two choice variables C and C (referring to \output" and \input" respectively). C ranges from 0 to the number of output branches of the place, and will thus point towards the desired transition when the = 1 (token is present), or will be set to 0. The non zero values of C are randomized, which means that there exists a (private) random variable W such that C = W as soon as C 6= 0. Of course, C plays the symmetric role : it points towards an input branch when the token is absent, and non zero values of C are randomized. The set of local constraints of a place is thus the following : 1. if C 6= 0 then C = W , corresponding to the predicate (C = 0) _ (C = W ). 2. if C 6= 0 then C = W , corresponding to the predicate (C = 0) _ (C = W ). 3. (C = C = 0) _ ( = 0 ^ C 6= 0 ^ C = 0) _ ( = 1 ^ C = 0 ^ C 6= 0). Constraint 3 encodes the rules of ring from the point of view of places : a token can leave only if there was a token and the place decided to let its token move, and symmetrically for empty places. Notice that constraint 3 induces that one branch at most is selected around the place. But the possibility that none is chosen is also accepted (case C = C = 0), which means that the resource of that place doesn't want to change. Notice that the choice fmoving / not movingg is not randomized. Also, when \not moving" is decided by the place, then the random variables W ; W get unconstrained, i.e., can take any value. o
i
o
o
o
o
o
o
i
i
o
o
i
i
o
i
o
o
i
o
i
i
o
i
o
i
i
o
o
i
i
o
9.2.2 Transition
This system is composed of constraints only. Its visible variables coincide with those of the places to which it is connected. More precisely, it is linked to places in the past by C variables : by C if the place belongs to its preset, and by C if the place is in the postset. The connections towards the future involve the token variables only (see g. 18). Two situations are allowed for transition t : it is either red, which means that all places in the past had voted for it, or not red, in which case we impose that nobody had voted for it. This is done to forbid lockings, obviously. The constraints summarize to 4. [(all C variables in the past point towards t) ^ (t has been red)] _ [no C variable in the past had selected t] By \t has been red", we mean of course that tokens are set to 0 in the preset and set to 1 in the postset, on the future side of t. Constraint 4 is just the relational form of the ring rule from the point of view of a transition. o
PI n1117
i
30
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste
9.2.3 Standby
Standby systems appear for a technical purpose only. A standby roughly behaves like a dummy transition linking every place at time n to its future version at time n + 1. Its function is to propagate the state of the token from past to future in case no transition has been red around the considered place. This is coded by the constraint : 5. if C = C = 0 in the past, then variables are identical in the future and in the past. For clarity, standby systems are not represented on gure 19. o
i
9.3 Validity of the CSS model
We now prove the equivalence between the CSS model presented above and its underlying Petri net. Speci cally, we show that to any trajectory of the Petri net corresponds a unique setting of visible variables in the CSS model. We assume that the PN trajectory is represented by an initial marking and a length N sequence of (possibly empty) transition sets. Transitions of each set are red simultaneously : they constitute a salvo. To represent this path on the CSS model, we rst build a length N unfolding of time. At time 0, tokens are determined by the initial marking. Given the marking at time n, we show below that choices C ; C at time n and tokens at time n + 1 are uniquely determined by the nth salvo. Conversely, we also show that only legal salvos are accepted by the CSS model. Proof. On the CSS model, if transition t has red between times n and n + 1, it was chosen by its pre- and post-set places (rule 4), which can happen only if tokens and holes were correctly positioned (rule 3). So this ring is legal for Petri net rules. Moreover, each place can choose one transition at most. Hence transitions that have the status \ red" (between times n and n + 1) on the CSS model must be simultaneously rable for the Petri net. Thus a legal salvo for the CSS model is also legal for the PN. By rule 4, ring t sets the tokens on its future side (time n + 1), but also determines uniquely the C and C variables on its past side (time n). By rule 4 again, a transition t that didn't re was \forbidden" to all places in its past, thus one possibility is canceled in the range of C (resp. C ) for every place in the pre-set (resp. post-set) of t, at time n. Once all transitions have been checked, the C variables (time n) are set for places that were connected to a red transition. At most one transition is red around a place (rule 3). In the case where none was red, the C and C variables are necessarily set to 0 since all possible choices have been canceled. This triggers the standby system, which imposes identity between tokens at times n and n + 1 (rule 5). So nally all tokens are set at time n + 1. Remark. Choice variables and randoms are useless for places at time N (extremity of the CSS model), so these place systems can be simpli ed. i
o
o
i
o
o
i
i
Irisa
A Petri net approach to fault detection and diagnosis in distributed systems
31
10 Likelihood of a path
Let us denote by X and W the visible and random variables of the big CSS model associated to a given Petri net, and let C be the constraint set of this model. Assuming places are identi ed by the index k on the Petri net, we shall denote by the place system corresponding to place k at time n in the CSS model. The equivalence proved in the previous section can be stated in the following way : there is a one to one correspondence between x 2 C , a possible value for X and legal paths2 of the Petri net. Now, the novelty of the CSS model is that some random variables are also determined in this equivalence. Namely, the choices of places connected to a red transition are linked to a random. Conversely, the random W of a place that didn't change (C = C = 0) is not constrained. In other words, given x there are several possible values w of W such that (x; w) 2 C . To any (x; w) 2 C is associated a \likelihood" L(x; w) = L(w). L(w) can be computed easily since W is made of independent randoms W , one for each place system, thus L(w) = Q IP (w ). Of course, these likelihoods do not sum to 1, even if each IP does, because the constraint set C has reduced the set of possible values for w (see the example, section 8.2). This is of little importance however since our problem in the sequel will be to nd the x maximizing L(x; w) inside a set of admissible solutions. Following [2] we thus de ne the likelihood of x as : k;n
k;n
i k;n
k;n
o k;n
k;n
k;n
k;n
k;n
k;n
L(x) = : max L(w) ( )2C w
w;x
and study its properties below.
10.1 Notion of tile
By purely algebraic considerations, we rst show that one can reduce the set of constraints characterizing a given path x. The previous section has shown how the status of transition systems between times n and n + 1 uniquely determines the C variables of all place systems at time n. Let us introduce the extra convention : \a C variable equals 0 unless it is set by a red transition". Then only constraints imposed by red transitions remain useful, and non red transitions can be forgotten. In the same way, a red transition imposes token values to places of its future side. So standby system constraints become useless for these places. These two remarks show that, given a path of the Petri net, its corresponding element x in the CSS model is characterized by only part of the constraints : those of red transitions, and those of stantby systems for places not involved in a ring. If we go back to gure 19 (right side), this means that we can \erase" transition systems that didn't re, together with standby systems that weren't used, and still get a complete identi cation of a path. This operation yields simpli ed representations for trajectories, as the one of gure 20. 2
of length not greater than N
PI n1117
32
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste d a c b e
1
d a c
2
d a c
3
b e
b e
d a c
c
d a c
b e
b e
d a 4
b e
1
5
2
d a
d a
c b 6
c
3
b e
e
Figure 20: A simpli ed trajectory. This trajectory is made of \tiles" corresponding to red transitions (solid lines), and to standby operations (dotted lines). Let us call a tile a triple composed of 1/ a transition name, 2/ the token conditions that make it enabled, on the past side, together with the associated choice variables, 3/ the result of ring this transition. Tiles can be represented as on gure 21. Obviously, the trajectory depicted gure 20 is made out of connected tiles, just like a puzzle. Standby operations appear also, as lling elements, or \dummy tiles" that only reproduce the token. C variables are not represented graphically on tiles, since they play no role in the connectivity. d a
1
a c
4
b
d a
d a
a c
b
b
e
d a
2
b e
5
a c
3
a c
b
b
b e
b 6
e
Figure 21: Tiles. These tiles correspond to the Petri net of gure 8. A letter \a" stands for \presence of the token in place a", while \a" stands for its absence. Values of C and C variables are also attached to the left legs of each tile. o
i
10.2 Computation of the likelihood
Given this simpli ed representation of a path x, we wish to compute the most likely w such that (x; w) 2 C . Let us consider the random W of place . If this place is a left leg of a tile for path x, then3 w = c = the value of the choice variable C . Otherwise the place is the left leg of a standby system, so c = 0 and w can take any value. Finally, we get 0 10 1 Y?1 Y @ Y Y L(x) = IP (c )A @ max IP (w )A (2) =0 !tile !standby k;n
k;n
k;n
k;n
k;n
k;n
k;n
N
k;n
n
3
k
pk;n
k;n
pk;n
wk;n
k;n
k;n
for the sake of simplicity, we drop the distinction between W i and W o , and between C i and C o
Irisa
33
A Petri net approach to fault detection and diagnosis in distributed systems
Since we are not interested the actual value of the likelihood, let us divide both sides of (2) Q Qinmax IP (:). This amounts to requiring that all \probabilities" by the constant C = IP of place systems be such that their max is 1, which is allowed by CSS since it handles unnormalized probabilities. With this extra assumption, (2) becomes Y?1 Y L(x) = IP (c ) (3) =0 : !tile n
k;n
k
k;n
N
k;n
n
k
k;n
pk;n
so that standby systems now have no in uence on the likelihood. Terms in (3) can be grouped tile by tile, and the nal likelihood L(x) results in a product of \tile likelihoods". Notice that a single likelihood is attached to a tile : it is determined by the only choice permitted to places represented by the left legs of the tile. In practical applications, it is more convenient to work with minus the log of likelihoods. So we shall talk about the cost of a tile and look for the minimal cost trajectory, where the cost of a trajectory is obtained by summing over tiles that have been used. This suggests to simplify once more the representation of a path, by removing standby systems since they have a null cost ( gure 22). d a c
1
d a
2
b e
1
d a 3
a c b
4
a c
d a
2
d a 3 b
b
b 5
e
6
a c b
e
Figure 22: A simpli ed trajectory. A trajectory can be seen as a concatenation of tiles only. Standby operations have been removed for they do not participate to the global likelihood.
10.3 Consequences
We have shown above that all trajectories of a Petri net can be represented in the CSS framework as a concatenation of tiles (this must be related to the unfolding of a safe net, described in part I). The partial randomization we have de ned with this model induces a local and constant cost to each tile. A simple addition over tiles in a path gives the global cost of this path. Notice rst that we have removed extra costs due to lockings : we only pay for ring a transition, and the price doesn't depend on the marking from which the transition is red. Secondly, let us come back to the universal clock assumption that was used to unfold time. Assume t1 and t2 are concurrent transitions, i.e. they require dierent resources and so can be red simultaneously or in any order. Since waiting costs nothing in our model, the order in which they re cannot be determined. All sequences described by
PI n1117
34
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste
g. 23 are equivalent, so the two transitions behave just as if they had their own clock. More generally, we have randomized only partial orderings on rings, rather than total orderings, which provides a perfect matching between concurrency and independence. time
t1 t2
Figure 23: Several ring sequences in the PN that have the same cost. Black rectangle = ring; white rectangle = standby.
11 Search for the most likely path
11.1 Back to our original problem
A target application of the previous formalism is the failure diagnosis in telecommunication networks. We assume that a safe PN representing fault propagations is provided for the whole network. We assume also that it has been randomized according to the model presented here : tiles have been isolated, and their costs have been computed. Information is obtained from the system in the following way : each time a transition is red, a label (an \alarm") is sent to a global supervisor. Labels are randomized, i.e. each transition picks at random a label among its possible choices. These emissions are assumed to be independent, which is a light dierence with part I (there causal dependence relations between alarms were also known). The objective is to nd the most likely path of the network given a sequence of labels 1 ; : : : ; . By \likely" we mean cheapest in terms of the costs de ned above. Of course, transitions have labels in common, otherwise the problem has a straightforward solution. N
11.2 Algorithm
As in the case of standard dynamic programming, we build the most likely path by recursion on the number of absorbed measurements, i.e. the number of red transitions, or tiles. We rst review the usual Viterbi algorithm, under the Markov chain assumption, and then show how it specializes to the new model.
Standard dynamic programming
The forward sweep computes the most likely n-transition path leading to any (possible) marking M that is compatible with the rst n observations (1 ; : : : ; ). It propagates n
n
Irisa
35
A Petri net approach to fault detection and diagnosis in distributed systems
according to the recursion IP(M +1 ; 1 +1 ) IP (M +1 ) = max ( +1)?paths ! +1 = max IP( +1 jt +1 )IP(t +1 jM )IP (M ) (4) ( [ +1i +1 +1 ): where the notation 1 +1 stands for the sequence of alarms 1 ; : : : ; +1 . The pair (M ; t +1 ) attaining the max is stored as M (M +1 ) and t +1 (M +1 ). The last step of the recursion, at time N , computes the most likely nal marking by n
n
Mn ;tn
n
n
Mn
Mn tn
n
Mn
n
n
n
n
n
n
n
n
n
n
n
n
M = arg max IP (M ) N
N
MN
from which all previous optimal markings and the optimal path, are deduced by the backtrack sweep M = M (M +1 ) t +1 = t +1 (M +1 ) This algorithm would apply directly to our model, provided we would work with minus log of the quantities appearing in (4). Let us describe it before stating the dierences between the two methods. n
n
n
n
n
n
Specialization to the new model: a concurrent Viterbi algorithm Let M be the set of possible markings at time n, and Cost (M ) the lowest cost among the paths that produced M (and 1 ). For each M 2 M , we construct a possible successive n
n
n
n
n
n
marking M +1 by connecting a tile t +1 that could have produced +1 . The price associated to the path (M ; t +1 ) is : n
n
n
n
n
Cost(M ; t +1 ) = Cost (M ) + Cost(t +1 ) + Cost( +1 jt +1 ) where Cost(t +1 ) is of course the cost of tile t +1 , and Cost( +1 jt +1 ) the price for emitting +1 from t +1 . We end up with a set M +1 of possible markings, some of which can have been reached by dierent ways. If several pairs (M ; t +1 ) produce the same M +1 , we keep n
n
n
n
n
n
n
n
n
n
n
n
n
n
the best one :
Cost (M +1 ) = ( n
+1 ) :
Mn ;tn
min [
n
n
+1 iMn+1
Mn tn
Cost(M ; t +1 ) n
n
and t +1 (M +1 ) stores the best tile. Of course, at the last step we keep only the best possible marking M = arg min Cost (M ) 2M n
n
N
MN
and backtrack by removing the last best tile
N
N
M = M +1 t +1 (M +1 ) n
PI n1117
n
n
n
36
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste
The key term of the standard dynamic programming recursion is ? log IP(t +1 jM ). It represents a transition cost that varies with M . We have replaced it by a constant term Cost(t +1 ), provided that t +1 applies to M , which is a somewhat minimal dependence. As a consequence, we don't need any more to consider complex dynamics on the marking graph. By the way, Cost(t +1 ) is easy to compute from local considerations. n
n
n
n
n
n
n
11.3 Example α
d
α,β
1 a
α,β
2 c
3 b
α,β,γ 5
β,γ 4
e
γ 6
Figure 24: Initial marking and possible alarms for each transition. We consider here the network depicted on gure 24, where the possible alarms are represented next to each transition. Let us assume that the sequence (; ; ; ) has been received by the observer. We are going to detail the four (forward) steps of the algorithm. Notice that we have de ned tile costs, but the algorithm takes also into acount the price of emitting a label from each given tile. We could as well have de ned dierent versions of each tile, one for each possible label, and attached global costs to them (= cost of the tile + cost of the label for this tile). Example : 1 could denote transition 1 + emission of from 1. We use this notation below. The rst step ( g. 25) starts building trajectories by connecting on M0 a possible tile for the label . There are 3 possibilities : 1, 3 and 5. Since they produce dierent markings, we keep all of them. At step 2 ( g. 26) we want to connect a tile corresponding to . No compatible connection is available for 5, so this track is abandoned. We end up with three possibilities : (1; 2 ), (1; 5 ) or (3; 4 ). But (1; 2 ) and (3; 4 ) yield the same marking, so we keep the best one, say (1; 2 ). Notice that on (1; 5 ) the order is determined by the observation sequence since (1; 5) or (5; 1) are equivalent for the network alone. Step 3 ( g. 27) tries to incorporate a new -tile on the two remaining tracks. It is possible in both cases and we get (1; 2 ; 1), (1; 2 ; 3), (1; 2 ; 5) and (1; 5 ; 2). The last two sequences produce the same marking, and moreover are identical from the network standpoint. Only alarm probabilities help the discrimination. We assume (1; 5 ; 2) is the best choice.
Irisa
37
A Petri net approach to fault detection and diagnosis in distributed systems d a
1
d a
d a
c
c
b
b
e
3
a
d a
c
c
b
b
e
b 5
e
e
Figure 25: Step 1. Three -tiles out of four can be connected on the initial marking. All of then produce dierent markings. d a
d a
c
c
b e
b 5
e
b e
1
d a
5
b e
d a
1
d
2
d a
c
d a c
b e
b e
a
3
a c b
4
a c b
Figure 26: Step 2. -tiles produce a rst reduction. One track is abandonned, and one is selected among two others that yield the same marking. The last step ( g. 27) now have to connect a -tile. They all nd a place where to clic in, which yields (1; 2 ; 1; 5 ), (1; 2 ; 3; 4 ) and (1; 5 ; 2; 6 ). Once again, the last two tracks produce the same marking, so we keep only the best one, say (1; 2 ; 3; 4 ). But since there is no more tile to add, we now have to look for the lower cost among the two nalists, which yields the most likely trajectory of the net for the observed sequence (; ; ; ).
12 Conclusion For the application eld underlying this report (failure diagnosis in telecommunication networks), standard stochastic Petri nets do not appear as the right framework. They assume a global knowledge of the state space, together with transition probabilities in this space. These elements are unaordable for large networks. By the way, ordinary stochastic PNs assume that all enabled transitions are in competition at a given time, even though they do not use the same tokens. This phenomenon seems to contradict the intuition that physical systems behave independently until they reach a state that requires cooperation or imposes competition. We have presented here a new way of randomizing a Petri net. It aims at generating independent behaviours for regions of the net that are not directly in competition. This framework induces two original features. First, the complete ordering of transitions (or time) is abandoned. Only partial orderings are randomized. In other words, when transitions are
PI n1117
38
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste
d a c
d a
1
2
d a
d a
1
b e d a
1
d a c
2
d a
a c
3
b e
d a c
1
α
d a
b
α
e
5
b e
d a c
1
b e
b
α
d a
β
b
5
e
2
d a
2
d a
β
α
Figure 27: Step 3. Two identical trajectories are obtained (on the right), but the sequence of alarms induces dierent orderings, among which one is prefered. Labels are mentioned here to make the distinction.
d a
1
d a
2
d a
1
d a
c b
b 5
e d a c b e
1
d a
2
d a
e
a 3
c b
4
a
d a
c
c
b
b e
1
5
d a
d a
b
b
e
2
6
e
Figure 28: Step 4. Once again, a selection is made between trajectories yielding the same marking. Since we have introduced all labels, the most likely trajectory is the best one among the remaining possibilities.
Irisa
A Petri net approach to fault detection and diagnosis in distributed systems
39
not competing, we don't know which one res rst. This is in favour of parallelism. Secondly, the full transition probability matrix on the marking graph is replaced by very few elements (one for each transition in the net), and these elements are easy to compute. This second property has two major advantages : it greatly simpli es dynamic programming procedures, but also allows an easy updating of the stochastic model when the net is modi ed. Several extensions of this framework have been developped, in particular for the case where some alarms are missing (masking phenomena). Eorts are now dedicated to the distribution of the diagnosis algorithm on sensors that observe dierent regions of the network.
PI n1117
40
Armen Aghasaryan, Renee Boubour, Eric Fabre, Claude Jard, Albert Benveniste
References [1] F. Baccelli, S. Foss, and B. Gaujal. Structural, temporal and stochastic properties of unbounded free-choice petri nets. INRIA research report, (2411), November 1994. [2] A. Benveniste, B.C. Levy, E. Fabre, and P. Le Guernic. A calculus of stochastic systems : Speci cation, simulation, and hidden state estimation. Theoretical Computer Science, (152):171{217, 1995. [3] R. Boubour and C. Jard. Une approche pour des capteurs d'alarmes intelligents dans les reseaux. In A. Bennani, R. Dssouli, A. Benkiran, and O. Ra q, editors, CFIP'96 : Ingenierie des Protocoles, octobre 1996. [4] R. Boubour and C. Jard. Fault detection in telecommunication networks based on petri net representation of alarm propagation. In P. Azema and G. Balbo, editors, Proceedings of the 18th Int. Conf. on Application and Theory of Petri Nets, Toulouse, June 1997. [5] A. Bouloutas, G. Hart, and M. Schwartz. On the Design of Observers for Fault Detection in Communication Networks, chapter 5. New-York : Plenum, 1990. [6] A. Bouloutas, G. Hart, and M. Schwartz. Fault Identi cation using a Finite Sate Machine Model with Unreliable Partially Observed Data Sequences, volume 41. july 1993. [7] A. Bouloutas, G. W. Hart, and M. Schwartz. Simple nite-state fault detectors for communication networks. IEEE Trans. on Communications, 40(3):477{479, Mar 1992. [8] R. David and H. Alla. Petri nets for modeling of dynamic systems - a survey. Automatica, 30(2):175{202, 1994. [9] J. Engelfriet. Branching Processes of Petri Nets. Acta Informatica, 28(6), 1991. [10] J. Esparza, S. Romer, and W. Volger. An Improvement of McMillan Unfolding Algorithm. In 2nd Int. Workshop TACAS, number 1055 in Lecture Notes in Computer Science. Springer-Verlag, march 1996. [11] R. Valette et L.A. Kunzle. Reseaux de Petri pour la detection et le diagnostic. In Journees S^urete, surveillance, supervision : detection et localisation des defaillances, number LAAS-R-94463, novembre 1994. [12] J. Fidge. Timestamps in Message Passing Systems that Preserve the Partial Ordering. In Proc. 11th Australian Computer Science Conference, pages 55{66, February 1988. [13] John M. Griths, editor. ISDN Explained : Worldwide Network and Applications Technology. John Wiley and Sons, 1990.
Irisa
A Petri net approach to fault detection and diagnosis in distributed systems
41
[14] G. Jakobson and M. D. Weissman. Alarm Correlation. IEEE Network, 7(6), November 1993. [15] G. Jakobson and M. D. Weissman. Real-time Telecommunication Network Management : Extending Event Correlation with Temporal Constraints. In Sethi, Raynaud, and Faure-Vincent, editors, Integrated Network Management, number 4, pages 290{301. IFIP, Chapman and Hall, may 1995. [16] I. Katzela, A.T. Bouloutas, and S.B. Calo. Centralized vs Distributed Fault Localisation. In Sethi, Raynaud, and Faure-Vincent, editors, Integrated Network Management, number 4, pages 250{261. IFIP, Chapman and Hall, may 1995. [17] S. Kliger, S. Yemini, Y. Yemini, D. Ohsie, and S. Stolfo. A Coding Approach to Event Correlation. In Sethi, Raynaud, and Faure-Vincent, editors, Integrated Network Management, number 4, pages 266{277. IFIP, Chapman and Hall, may 1995. [18] M. Ajmone Marsan, G. Balbo, G. Chiola, and G. Conte. Generalized stochastic petri nets revisited : Random switches and priorities. Proceedings of PNPM '87, IEEE-CS Press, pages 44{53, 1987. [19] M. Ajmone Marsan, G. Balbo, G. Conte, S. Donatelli, and G. Franceschinis. Modelling with generalized stochastic petri nets. Wiley Series in Parallel Computing, 1995. [20] F. Mattern. Virtual Time and Global States of Distributed Systems. In Cosnard, Quinton, Raynal, and Robert, editors, Proc. Int. Workshop on Parallel and Distributed Algorithms Bonas, France, Oct. 1988. North Holland, 1988. [21] M. Nielsen, G. Plotkin, and G. Winskel. Petri Nets, Event Structures and Domains, part 1. Theoretical Computer Science, (13), 1981. [22] Y.A. Nygate. Event Correlation using Rule and Object Based Techniques. In Sethi, Raynaud, and Faure-Vincent, editors, Integrated Network Management, number 4, pages 278{289. IFIP, Chapman and Hall, may 1995. [23] Y. Park and E. K. P. Chong. Fault detection and identi cation in communication networks : a discrete event systems approach. In Proceedings of the 1995' Allerton Conference, 1995. [24] W. Reisig. A Primer in Petri Nets Design. Springer-Verlag, 1992. [25] O. Wolfson S. Sengupta and Y. Yemini. Managing Communication Network by Monitoring Databases. IEEE Transactions on Software Engineering, 17(9), Septembre 1991. [26] W. Vogler. Modular Construction and Partial Order Semantics of Petri Nets. Number 625 in Lecture Notes in Computer Science. Springer-Verlag, 1992.
PI n1117