Using Extended Logic Programming for Alarm ... - Semantic Scholar

3 downloads 892 Views 170KB Size Report
Alarm correlation is a necessity in large mobile phone networks, ... for a GSM network application using the extended logic programming system. Revise.
Using Extended Logic Programming for Alarm-Correlation in Cellular Phone Networks Peter Fr¨ohlich1 , Wolfgang Nejdl2 , Michael Schroeder3 Carlos Dam´asio4 , and Luis Moniz Pereira4 1

ABB Corporate Research Center, Heidelberg, Germany [email protected] 2 Universit¨at Hannover, Germany [email protected] 3 City University, London, [email protected] 4 Centria, Universidade Nova de Lisboa, Portugal (cd,lmp)@di.fct.unl.pt

Abstract. Alarm correlation is a necessity in large mobile phone networks, where the alarm bursts resulting from severe failures would otherwise overload the network operators. In this paper, we describe how to realize alarm-correlation in cellular phone networks using extended logic programming which provides integrity constraints, implicit, and explicit negation. We solve different scenarios for a GSM network application using the extended logic programming system Revise.

1 Introduction Mobile networks, like the pan-European GSM networks, are growing rapidly. Alarm handling systems enable the operators to run such networks with minimal operation costs. The goal is to collect and interpret alarm messages and failure indications from the network elements without human intervention. In large networks, like the current GSM networks, the alarm vectors supplied by the network elements tend to flood the workstations of the operators especially in critical situations like the passage of a thunderstorm front. Performance of the mobile network is degraded heavily in such situations and operators have difficulties interpreting the burst of important and less important messages from the network [12]. To deal with alarm bursts alarm correlation systems are required to filter and condense the incoming alarms to meaningful high-level alarms and diagnoses. We review the application described in [12] and show how the problem is modelled and solved with extended logic programming which provides integrity constraints, implicit, and explicit negation. In the model-based approach, we describe the cellular network using an extended logic program. This logic program is used to predict the expected behaviour of the network, given assumptions about the correct or faulty behaviour of its network elements. We monitor the behaviour of the actual network and derive observations. A diagnosis has to be computed, if the behaviour predicted under the current assumptions differs from the observed behaviour. This is summarized in Fig. 1. The diagnosis is computed by revising assumptions with the goal to find valid sets of assumptions explaining the observed behaviour. Each assumption concerns the behavioural mode of a component in the network. We consider only the modes ok, i.e.

Model

#

Predictions

Diagnoses

Actual System

?! Contradiction ?

Observations

"

#

Fig. 1. Model-based Diagnosis.

working as specified, and abnormal, i.e. faulty. By default, we assume, that all components are ok. If a failure is observed, we have to assume that some of the components are abnormal in order to explain the observed failure. Each minimal set of abnormal components explaining the failure is called a Diagnosis [18].

MSC AAAA AAAA AAAA AAAA

MS

BTS CC

ML or CL

BSC

ML AAAA AAAA AAAA AAAA

MSC

MSC

ISDN

MS BTS

Mobile Station

Access Network

Switched Network

Fig. 2. Structure of the GSM network [12].

Fig. 2 and Fig. 3 summarizes our application domain. Mobile networks can be divided into three parts: the mobile station (MS, see Fig. 2), the access network with the base station transceiver (BTS) consisting of antennas, radio transceivers, cross connect systems (CC) and microwave (ML) or cable links (CL) and the base station controller (BSC), and the switched network, which is connected to the access network by the BSC’s. The BSC provides the radio resource management, which serves the control and selection of appropriate radio channels to interconnect the mobile station and the switched network. The switched network interconnects the mobile station to the communication partner, which might be another mobile station or an ISDN subscriber [12].

BSC BTS

microwave link

BTS

leased line

BTS

BTS BTS Base Transceiver Station BSC Base Station Controller

BTS BTS

BTS

BTS

BTS BTS

BTS

BTS

BTS BTS

BTS

BTS BTS

BTS

BTS

BTS

BTS

BTS

Fig. 3. Star-configuration of a base station subsystem [12].

The rest of the paper is organised as follows. First we introduce extended logic programs and diagnosis, then we show how to model cellular phone networks and alarmcorrelation as extended logic programs, and finally we show how the alarm-correlation is realised in REVISE [9], a system for contradiction removal of extended logic programs.

2 Extended Logic Programming and Diagnosis Since Prolog became a standard in logic programming much research has been devoted to the semantics of logic programs. In particular, Prolog’s unsatisfactory treatment of negation as finite failure led to many innovations. Well-founded semantics [13] turned out to be a promising approach to cope with negation by default. Subsequent work extended well-founded semantics with a form of explicit negation and constraints and showed that the richer language, called WFSX, is appropriate for a spate of knowledge representation and reasoning forms [4]. In particular, the technique of contradiction removal from extended logic programs [17] opens up many avenues in model-based diagnosis [16,7,8,15]. 2.1 Syntax Definition 1. An extended logic program is a set of rules of the form L0 L1 ; : : : ; Lm ; not Lm+1 ; : : : ; not Ln (0  m  n), where each Li is an objective literal

(0  i  n). An objective literal is either an atom A or its explicit negation :A. Literals of the form not L are called default literals. Literals are either objective or default ones. Note that the coherence principle relates explicit and default, or implicit, negation:

:L implies not L for every objective literal L.

The behaviour of the system to be diagnosed is coded as an extended logic program. To express the assumption that the system works correctly by default we use negation by default. Example 2. Below is an example for a syntactically correct rule containing negation by default not (as we will explain in detail later this rule is part of the propagation model of the cellular phone network): signal(NE; up; Sender; Signal) not ab(NE); type(NE; ml ); type(Sender; bts); class(Signal; farend or status signal); signal(NE; down; Sender; Signal): Rules as the one above allow one to predict the behaviour of the system to be diagnosed in case it is working fine. To express that normality assumptions may lead to contradictions between predictions and actual observations we introduce integrity constraints. Definition 3. An integrity constraint has the form ? L1 ; : : : ; Lm ; not Lm+1 ; : : : ; not Ln (0  m  n) where each Li is an objective literal (0  i  n), and ? stands for false. ICs P

Syntactically, the only difference between the program rules and the integrity constraints is the head. A rule’s head is an objective literal, whereas the constraint’s head is ?, the symbol for false. Semantically the difference is that program rules open the solution space, whereas constraints limit it, as indicated on the left.

Example 4. Integrity Constraint Now we can express that a contradiction arises if predictions and observations differ. In the setting of alarm-correlation we use for example the constraint

? :signal(bsc down Sender alive) signal(bsc down Sender alive) ;

;

;

;

;

;

;

:

to express that it is contradictory for the BSC to allegedly have received an alive signal of a BTS and to know at the same time that it has not. The rules used in diagnosis contain default literals of the form not ab(c) expressing the assumption that a component c is not abnormal, i.e. working ok. In example 2, the rule contains not ab(NE), indicating the default assumption that the network element NE is working as specified. A contradiction may be based on the assumptions that the components are ok. We can remove a contradiction by partially revising some of these default literals. Technically, we achieve this by adding a minimal set of revisable facts to the initially contradictory program:

Definition 5. The revisables R of a program P are a subset of the default negated literals which do not occur as rule heads in P. For convience, we define for a set of literals L that L¯ = fl j not l 2 Lg[f:l j l 2 Lg. The set R0  R [ R¯ is called a revision if it is a minimal set such that P [ R0 is free of contradiction , i.e. P [ R0 6j= ?, where ? stands for false (For a definition of the j= operator see the next subsection). Example 6. In the example above not ab(NE) is a revisable. The limitation of revisability to default literals which do not occur as rule heads is adopted for efficiency reasons, but without loss of generality. We want to guarantee that the truth value of revisables is independent of any rules. Thus we can change the truth value of a revisable whenever necessary without considering an expensive derivation of the default literal’s truth value. Before we show how to model alarm-correlation as extended logic programs, we define the WFSX (Well-Founded Semantics with eXplicit negation [4]) semantics of the programs. We use WFSX as it allows us to use an SLD-like top-down proof procedure rather than a bottom-up fixpoint computation. This top-down approach will be very important to efficiently compute diagnoses as shown in section 4. At first reading the section below may be skipped. 2.2 WFSX Semantics: a Top-Down Proof Procedure WFSX, the well-founded semantics, enjoys the properties of simplicity, cumulativity, rationality, relevance, and partial evaluation that other semantics do not fully enjoy [4,1,2]. – Simplicity means that the semantics can be characterised by two iterative operators, without recourse to three-valued logic. – Cumulativity means that the addition of lemmas, i.e. true propositions, to the program does not change the semantics. – Rationality refers to the ability to add the negation of a non-provable conclusion without affecting the semantics. – The issue of relevance is of particular importance for the top-down algorithm. Relevance means that the top-down inference of a literal requires nothing but the predicate call-graph below the literal. – Partial evaluation means that partially evaluated programs do not change the semantics of the program. The above properties are important as the following two examples show. Consider the programs pg P2 = f p not pg P1 = f p

We obtain the minimal models M1 = fg and M2 = f pg for P1 and P2 , respectively. Thus, as expected intuitively, we cannot conclude p from P1 , but we do from P2 . In Prolog, however, both cases would lead to no inference due to infinite loops. Trying to prove p in P1 , Prolog applies the rule p p over and over again. A similar reason stops P2 from terminating in Prolog. The problems occur in general due to positive or negative loops through recursion. To deal with this problem we define T- and TU-trees, to prove verity and non-falsity, respectively [1–4].

Definition 7. T-tree, TU-tree Let P be a ground extended logic program. A T-tree (resp. TU-tree) for a literal L is an and-tree with root labeled L and nodes labeled by literals. T-trees (resp. TU-trees) are constructed top-down starting from the root by successively expanding new nodes using the following rules: 1. If n is a node labeled with an objective literal L then if there are no rules for L in P then n is a leaf else select a rule L

L1 ; : : : ; Lm ; not Lm+1 ; : : : ; not Ln

from P. In a T-tree the successors of n are nodes labeled L1 ; : : : ; Lm ; not Lm+1 ; : : : ; not Ln while in a TU-tree there are, additionally, the successor nodes labeled not :L1 ; : : : ; not :Lm 2. Nodes labeled with default literals are leaves. In figures TU-trees are inside a box (see e.g. the figure in example 9). Definition 8. Successful or failed tree A T- or TU-tree is either successful or it fails. All infinite trees are failed. A tree is successful/failed if its root is successful/failed. Nodes are marked as follows: 1. A leaf labeled true is successful. 2. A leaf labeled with an objective literal distinct from true is failed. 3. A leaf labeled with a default literal not L is successful in T-tree (TU-tree) if (a) all TU-trees (T-trees) for L are failed or (b) if there is a successful T-tree for :L. Otherwise it is labeled as failed. 4. An intermediate node n of a T-tree (TU-tree) is successful if all its children are successful and otherwise failed. All remaining nodes are labeled failed in T-trees and successful in TU-trees. Example 9. Consider the program p not p. To prove verity of p a Ttree for p is created. In order to prove not p true, all TU-trees for p have to fail. Thus, an infinite loop occurs and therefore the nodes in the TU-tree are eventually classified as failed and the nodes in the T-tree as successful, so that p is proved true.

p

not p

p

not p

p

Theorem 10. Correctness [1,2,4] Let P be a ground, possibly infinite, extended logic program, M its well-founded model according to WFSX, and let L be an arbitrary fixed literal.

– If there is a successful T-tree with root L then L 2 M (soundness) – If L 2 M then there is a successful T-tree with root L (completeness) As argued above, the main issues in defining top-down procedures for well-founded semantics are infinite positive recursion, and infinite recursion through negation by default. The former results in failure to prove verity, while the latter results in failure to prove verity and falsity, i.e. the literal is undefined. Cyclic infinite positive recursion is detected locally in T-trees and TU-trees by checking if a literal L depends on itself. A list of local ancestors is maintained to implement this pruning rule. For cyclic infinite negative recursion detection, a set of global ancestors is kept. A T-tree for a literal L that already appears in an ancestor T-tree is failed. A TU-tree for a literal L which already appears in an ancestor TU-tree is successful. The demo predicate in Fig. 4 implements top-down inference using T-trees and TU-trees and pruning of cycles. The predicate demo has the goal to be proved, the mode t or tu, the local and the global ancestor list as parameters. The top goal (1) initialises the ancestor lists as empty. Item 2 in Fig. 4 states that a node marked true is successful. A conjunction is inferred if the conjuncts hold (3). Item 4 and 5 implement that not L is successful in a T-tree (TUtree) if L fails for TU-trees (T-trees), where the ancestor lists are kept accordingly. A revisable is proven if it is assumed, either by default or after flipping its truth value (6). Loop checking is implemented by (7) and rule application by (8) and (9). For revision, the demo-predicate is extended to contain a fifth parameter that returns the revisables involved in the proof. With the demo-predicate defined in Fig. 4 and theorem 10, we can use the demo-predicate to check entailment, i.e. P j= L iff demo(L) succeeds for program P. 1: demo(L) : ?demo(L; t ; []; []): 2: demo(true; ; ; ): 3: demo((L; Cont ); M ; LA; GA) : ? demo(L; M ; LA; GA); demo(Cont ; M ; LA; GA): 4: demo(not L; t ; ; GA) : ? not demo(L; tu; []; GA): 5: demo(not L; tu; ; GA) : ? not demo(L; t ; GA; GA): 6: demo(L; ; ; ) : ? revisable(L); !; assumed (L; ):

7: demo(L; ; Ans; ) : ? loop detect (L; Ans); !; f ail : 8: demo(L; t ; LA; GA) : ? rule(L; Body); demo(Body; t ; [LjLA]; [LjGA]): 9: demo(L; tu; LA; GA) : ? complement neg(L; NL); rule(L; Body); demo((Body; not NL); tu; [LjLA]; [LjGA]):

Fig. 4. Top-down proof procedure for WFSX (in PROLOG syntax).

3 Modelling Cellular Phone Networks The cellular phone networks are configured in a star topology (see Fig. 3) with exactly one path from a BTS to the BSC. Since such networks are highly dynamic an explicit

BTS Failure

BSC

ML16

ML17

BTS Failure

BTS17

BTS18 ML18

ML19

ML20

BTS19

BTS20

BTS21

BTS Failure

BTS Failure

BTS Failure

Fig. 5. Network’s topology where microwave link ml18 is faulty [12].

model of the network is a necessary prerequisite for alarm correlation. Consider the trunk depicted in Fig. 5. Its topology is modelled by facts of the components’ types and connections (see rule 1 and 2 in Fig. 6). The types of the components are given by facts of the predicate type. The fact type(ml16; ml ), for example, states that the network element ml16 is a microwave link (ml). The connections are described using the predicate Conn. The fact conn(ml16; up; bsc; down) states that the up-stream port of microwave link ml16 is connected to the down-stream port of the base station controller. 1: type(ml16; ml ): type(ml17; ml ): type(ml18; ml ):

2: conn(ml16; up; bsc; down): conn(ml18; up; ml16; down): conn(bts18; up; ml17; down):

:::

:::

Fig. 6. Facts for the network’s topology. list all entries

The network elements are intelligent and perform local diagnosis resulting in alarm messages which are sent to the BSC. In our model, we abstract from the actual messages sent through the alarm network and group them into the following classes: farend signal: This type of alarm message is generated by a BTS and sent to the BSC. It indicates that another BTS, which is located downstream from the sender (i.e. further away from the BSC), is unreachable. In Fig. 7 you see that the alarm farend alarm 1 is classified as such a farend signal. bts failure signal: These messages are generated directly in the BSC, when it detects that a BTS is no longer reachable. As indicated in Fig. 7, there are several message types in the class bts failure signal, related to different layers of the communication protocol. In the case of a failure, instances of all these messages are physically generated, leading to the abovementioned alarm showers. status signal: We have introduced this message type in our model to describe the answers of a BTS in the polling process. It does therefore not correspond to an alarm message in the protocol. The alarm classes are specified with the predicate class. The fact class(farend alarm 1; farend signal) indicates that the alarm farend alarm 1 belongs to the class farend signal.

3: class(bts omu link fail; bts failure signal): class(bcch missing; bts failure signal): class(available traffic; bts failure signal): class(lapd link failure; bts failure signal): class(farend alarm 1; farend signal): class(alive; status signal):

4: bts failure alarm(Sender ) alarm(Sender ; Alarm); type(Sender ; bts); class(Alarm; bts failure signal):

Fig. 7. Alarm classes.

The propagation of alarms and status messages is captured by the rules 5 and 6 shown in Fig. 8. The BSC periodically checks if all BTS are reachable. We model this using a status message alive, which is send by each BTS to the BSC (rule 5). Usually the status message is forwarded to the BSC using the propagation mechanisms of the network described later. However, if a microwave link fails, the message is lost and the BSC reports a BTS failure alarm. This is captured in rule 6: If the BSC generates a BTS failure alarm, we know that the alive signal of the concerned BTS did not arrive at the BSC. Here explicit negation (:signal) proves to be very useful to get a compact model. 5: signal(Sender ; up; Sender ; alive) type(Sender ; bts):

6:

:signal(bsc down Sender alive) ;

;

;

bts failure alarm(Sender ); type(Sender ; bts):

Fig. 8. Signal generation and suppression.

In Fig. 9 (rule 7-9) we formalize how signals are propagated over connections and through components (BTS or ML). Rule 7 states that signals are propagated from a components upstream port to another components downstream port, if these two components are connected. Rule 8 expresses that BTSs propagate any status and farend signal from their downstream port to their upstream port. Note, that we use no default literals in these rules, so that these propagations always work and can never be assumed faulty. Microwave links propagate signals too, but they may be faulty, which is captured by the additional abnormal predicate. By default we assume that the microwave link is not abnormal (not ab(NE)), but this default literal is a revisable whose truth value may be changed if a broken link can explain alarms and thus satisfy the constraints. Finally, we have to specify the integrity constraints (see Fig. 10). It is contradictory to have and not to have an alive message at the BSC (constraint 10). A second constraint (11) captures an invariant for the BSC. For every BTS, the BSC either has the BTS’s alive signal, or an alarm message of the BTS, or somehow the BTS lost messages. The latter is far more likely than faults in the system. This is represented by low probability (0.001) for the abnormal predicate and a high probability (0.1) of the message lost predicate (rule 12 in Fig. 11). The exact values for these probabilites are not impor-

7. signal(NE2 ; down; Sender ; Signal) type(Sender ; bts); conn(NE1 ; up; NE2 ; down); signal(NE1 ; up; Sender ; Signal):

8: signal(NE; up; Sender ; Signal) type(NE; bts); type(Sender ; bts); NE 6= Sender ; signal(NE; down; Sender ; Signal):

9: signal(NE; up; Sender ; Signal) not ab(NE); type(NE; ml ); type(Sender ; bts); signal(NE; down; Sender ; Signal):

Fig. 9. Signal propagation.

tant and would be hard to establish. Usually it is sufficient to have the right orders of magnitude among the probabilities [10].

10:

? :signal(bsc down Sender alive) ;

;

;

signal(bsc; down; Sender ; alive):

;

11:

?

type(Sender ; bts); not signal(bsc; down; Sender ; alive); not bts failure alarm(Sender ); not message lost (Sender ):

Fig. 10. Constraints for signals.

12: probability(ab( ); 0:001): probability(message lost (

); 0:1):

Fig. 11. A-priori probabilities of revisables.

In order to compute diagnoses given the above model, consider Fig. 1 again. We specified a model of the system, which is in itself consistent. Now we may compare the predictions of this model to actual observations. If they agree, everything is fine, otherwise we have to change assumptions in our model to accomodate the observations. In our application, the observations are alarm messages. Without any alarms, the model of the network is consistent. If alarms are generated and received by the BSC the constraints are violated and satisfying them involves assuming microwave links abnormal and/or messages lost and thus yields the diagnoses of the problem.

Example 11. Consider the system description (rule 1-12) in the previous figures. They are consistent; but when we add the alarms in Fig. 12 (rule 13) they become inconsistent as the constraints are violated. The burst of alarm messages is difficult to survey for a human operator, but revising assumptions in the system description to achieve consistency with the alarms easily reveals that the most probable explanation is microwave link ml16 being abnormal. I.e. fab(ml16g is a revision (in the sense of definition 5) of the contradictory program composed of rules 1-13.

13: alarm(bts17 ; bcch missing): alarm(bts17 ; bcf bie alarm in): alarm(bts17 ; pcm fail): alarm(bts17 ; bts omu link fail): alarm(bts18; bcch missing): alarm(bts18; pcm failure): alarm(bts18; bts omu link fail): alarm(bts19; bcf bie alarm in):

alarm(bts19; pcm failure): alarm(bts19; bts omu link fail): alarm(bts20; bcch missing): alarm(bts20; pcm fail): alarm(bts20; bts omu link fail): alarm(bts21; bcch missing): alarm(bts21; pcm fail): alarm(bts21; bts omu link fail): Fig. 12. Alarms.

With our definition of a revision, we can correlate the alarms, or to speak more generally, revise contradictory extended logic programs. But definition 5 is declarative in that it defines what a revision is, but not how to efficiently compute it. The next section is devoted to this problem. We develop an efficient algorithm to compute revisions, which is an adaption of Reiter’s hitting-set algorithm [18,14] suitable for extended logic programming. Our algorithm is implemented in the REVISE system [9].

4 Computing the Revisions Definition 5 states that a revision is a minimal set of revisables, which prevent the derivation of a contradiction when added to the program. To compute such revisions we need the notion of a conflict: A set of revisables, which support the derivation of a conflict. If we disable (or “hit”) one revisable of each such conflict, the contradiction is removed. As detailed below, we will call such a set a hitting-set. Regarding the generation of conflicts, we will see that the top-down proof procedure and the demo-predicate in section 2.2 are of tremendous importance. We can extend the demo-predicate to collect all revisables encountered during the proof and the result is the conflict we wanted to generate. Conflicts and hitting-sets are the core concepts of our algorithm, therefore let us formally define them:

4.1 Definitions Before we show how the revisions are computed, we need some defintions. Conflicts are sets of revisables that lead to a contradiction. Definition 12. Let P be an extended logic program with revisables R. Then C  R is a conflict iff P [f:c j not c 2 Cg[fc j c 2 Cg j= ? Example 13. Consider for example P = f? not a; not b: ? not a; not cg with revisables fnot a; not b; not cg. There are three conflicts (two of which are minimal): fnot a; not bg; fnot a; not cg; fnot a; not b; not cg. To compute revisions, we have to change default assumptions so that all conflicts are covered. Such a cover is called hitting set, since all conflicts involved are hit.

S

Definition 14. A hitting set for a collection of sets C is a set H  S2C S such that H \ S 6= fg for each S 2 C. A hitting set is minimal iff no proper subset of it is a hitting set for C. Example 15. (continued) There are three hitting sets for the above conflicts, namely

fnot ag fnot b not cg fnot a not b not cg. Only the first two are minimal. ;

;

;

;

;

The next theorem is fundamental as it gives a first hint how to compute revisions (for the definition of H¯ see definition 5): Theorem 16. [18] Let P be a program. Then H¯ is a revision of P iff H is a minimal hitting set for the collection of conflicts for P. Example 17. (continued) The above program has two revisions fag and fb; cg. Theorem 16 states that revisions can be computed from conflicts and hitting sets which in turn can be obtained from hitting-set trees [18]: Definition 18. Let C be a collection of sets. A hitting-set tree for C, call it T , is a smallest edge-labeled and node-labeled tree with the following properties:

p

1. The root is labeled if C is empty. Otherwise the root is labeled by an arbitrary set of C. 2. For each node n of T , let H (n) be the set of edge labels on the path in T from the root node to n. The label for n is any set Σp2 C such that Σ \ H (n) 6= fg, if such a set Σ exists. Otherwise, the label for n is . If n is labeled by the set Σ, then for each σ 2 Σ, n has a successor, nσ , joined to n by an edge labeled by σ. To compute hitting set trees, Reiter proposed an algorithm [18] which was corrected in [14]. In the next section we explain its adaption to our approach.

4.2 A Top-Down Algorithm to Compute Revisions Our diagnosis algorithm to compute revisions consists of three components as depicted in Fig. 13: – A hitting-set tree, which is incrementally expanded, – a conflict generator, which returns new conflicts based on partial revisions using the demo-predicate, and – a sorter, which re-arranges the nodes in the hitting-set tree according to a defined preference relation. The three components are related as follows: The nodes of the hitting-set tree are conflicts. For a leave n in the tree, the negated set of revisables H¯ (n) on the path from root to the leave forms a partial revision. This partial revision is passed to the conflict generator, which checks whether it can compute another conflict under the partial revision. The conflict, if it exists, is passed back to the hitting-set tree component, which updates the tree accordingly. This involves also a re-ordering of the tree according to the desired preference relation so that the most promising candidate of the partial revisions is expanded next. This mechanism allows us to cater for different minimality criteria such as minimality by set-inclusion, cardinality, or probability.

1. Partial Revision Hitting-Set Tree

Conflict generation 2. New conflict

3. Re-orders Preference Order

Fig. 13. Components of the revision algorithm.

Conflict generator The core of the conflict generator is the top-down proof procedure for WFSX presented in section 2.2. The top-down inference allows to construct one conflict at a time. We know that there are conflicts if and only if demo(?) (as defined in Fig. 4) succeeds. Furthermore, the successful T-tree of ? contains the revisables forming the conflict. Thus we can employ demo to generate conflicts by collecting the encountered revisables in a successful proof for ?. Theorem 19. demo and conflicts There is a conflict for P iff demo(?) succeeds. If T is a successful T ? tree for ?, then the revisables in T (negated if they occur in a T or TU-subtree) form a conflict.

Example 20. Consider the extended logic program with revisables a, b, and c initially false: ? not p: ? a; not c: ? c: p a: p b: false not p

p a

b

fail

fail

Have a look at the T-tree for ? (false) on the left. The topdown proof procedure creates a T-tree with root ? and child not p. To prove verity of not p falsity of p must be proven. A TU-tree rooted with p is created. Since both children a and b fail, p fails and not p succeeds. Finally, ? is proved based on the falsity of the revisables a and b. Thus, the proof procedure returns the conflict fnot a; not bg, which means that at least a or b must be true.

The Hitting-set tree component Being able to compute conflicts, let us turn to the hitting-set tree. It is constructed iteratively by exapnding leaves. To ensure that we generate a fresh conflict when expanding leave n, we add the partial revision H¯ (n), i.e. the negated edge labels from root to n, to the program P. The following lemma ensures that for intermediate nodes n there is a conflict of P that contains H (n). Lemma 21. Let P be a program and n a node in a hitting-set tree such that n is not a leaf. Then there is a conflict C of P such that H (n)  C. / ThereProof sketch: By definition 18 of a hitting-set tree we know that n \ H (n) 6= 0. fore there are revisables in n not occuring in H (n), which cause a contradiction. Hence, ut there is a conflict C for P [ H¯ (n) and H¯ (n)  C. To put lemma 21 in other words, we can revise the literals of H (n) and still obtain another conflict. Thus, we add H¯ (n) to P and generate a conflict to expand the current node. With theorem 19 and lemma 21 we are in a position to define our revision algorithm, which is based on Reiter’s original diagnosis algorithm [18,14]. The algorithm is depicted in Fig. 14; The numbers in brackets below refer to the lines of code in Fig. 14. We start with a hitting-set tree with nothing but a root node (1). Then we iterate the following steps: As long as there are unmarked nodes, we select one, call it n, which is minimal according to our minimality criterion (2). Now we have to check, whether this node can potentially lead to a revision. Therefore, we check whether there is another p node n0 , for which H (n)  H (n0 ) and which is already marked and thus is a solution. In this case we discard n as it cannot be minimal anymore (4). Otherwise, we mark n (5) and check whether it is a p solution, i.e. whether there are no further conflicts (6). In this case we mark the node n . If there is however another conflict Σ (7) then we label the node n with Σ (9) unless Σ is already covered (8). Next, we expand the node (12), possibly re-using existing nodes (11). Theorem 22. Computing Revisions p The top-down algorithm in Fig. 14 computes the revisions of P. If a leaf n is marked , then H¯ (n) is a revision.

1 Let D represent a growing dag. Generate a node which will be the root of the dag. This node will be processed by step 2 below. 2 While there is an unmarked node: 3 Define H (n) to be the set of edge labels on the path in D from the root down to node n. p 4 If there is a node n0 which is labeled by and H (n0 )  H (n) then close node n and mark it . Otherwise, 5 let n be an unmarked node such that H (n) is minimal; mark n p 6 If there is no conflict for P [ H¯ (n) (i.e. demo(?) fails) then label n by . 7 else let Σ be a conflict for P [ H¯ (n). 8 If there is a node n0 which has been labeled by the set S0 of C where Σ  S0 , then relabel n0 with Σ. For any α in S0 ? Σ, the α-edge under n0 is no longer allowed. The node connected by this edge and all of its descendants are removed, except for those nodes with another ancestor which is not being removed. Interchange the sets S0 and Σ in the collection. 9 Else label n by Σ. 10 If n is labeled by a set Σ 2 C, then for each σ 2 Σ, 11 either reuse a node, i.e. if there is a node n0 in D such that H (n0 ) = H (n) [fσg, then let the σ-arc under n point to this existing node n0 . 12 or generate a new downward arc labeled by σ. This arc leads to a new node m with H (m) = H (n) [ σ. The new node m will be processed (labeled and expanded) after all nodes in the same generation as n have been processed. 13 Return the resulting dag, D.

Fig. 14. Top-down algorithm to compute revisions.

Proof sketch: The theorem follows by Reiter’s theorem 16. The algorithm in Fig. 14 deviates only slightly from Reiter’s original algorithm: We adapted line 5-7 to work with our approach. In line 5, we select a minimal partial solution H (n) to guarantee minimality of the final solution. In line 6, we use theorem 19 to check for conflicts by calling demo(?). Lemma 21 ensures that we find further conflicts (if they exist) for the program P [ H (n) (line 7). The rest of the algorithm is left untouched, so that we inherit correctness and termination. ut Let us illustrate the algorithm by an example:

Conflict Generator

Hitting-Set Tree {}

(1)

false

(2)

not p

{not a, not b} (3)

b

fail

fail

{not a, not b} not a

not b false

{a} a

{not c}

{not a, not b} not b

(4) not c

not a (5)

p a

a

c fail

fail

not a {not c} not c

{b}

(6)

false fail (7)

{not a, not b} not b

not a {not c}

{a,c}

not c

(8)

false c

(9)

{}

{not a, not b} not b

not a

not c

c fail

{not c} not c

Fig. 15. Iterative construction of hitting-set tree.

Example 23. Consider the extended logic program with revisables a, b, and c initially false and have a look at Fig. 15.

?

not p:

?

a; not c:

?

c:

p

a:

p

b:

The algorithm starts with the empty graph and passes H (n) = fg to the conflict generator (1). As explained in example 20, the proof procedure (demo) returns the conflicts fnot a; not bg (2) which can be satisfied by adding a or b to the program. Thus, the downward arcs of the root node now labeled by the conflict fnot a; not bg are labeled by fnot ag and fnot bg (3). Assume that arc fnot ag is selected for expansion and H (n) = fag is passed to the conflict generator. The new conflict fnot cg (4) is found. Thus the node is labeled fnot cg and a new downward arc not c is added (5). Next the node reached by the arc not b is selected for expansion and fbg is passed to conflict generator. It turns out to be a revision, since p ? cannot be derived. The conflict generator returns nothing and the node is marked (6). Note that returning nothing is distinct from returning the empty set. The former means there are no conflicts and we are done, the latter means that there is a conflict which cannot be resolved. Finally, for expansion of the last node, fa; cg is sent to the conflict generator (7). The proof procedure returns the empty conflict fg (8), so that the node is marked  as closed. Thus the overall solution is fbg (9).

5 Application The algorithm developed above has been implemented in the REVISE system [9]. In this section, we give an example how the algorithm works in our alarm-correlation domain and shortly discuss its performance. 5.1 Example for Alarm Correlation Consider the description of the phone network of the previous section. Given the two alarm falarm(bts20; bcch missing); alarm(bts20; lapd link f ailure)g the most probable solution is that microwave 19 is faulty and a message from bts21 was lost. To compute these two revisions, our algorithm (Fig. 14) proceeds as shown in Fig. 16. The left column shows the expanding hitting-set tree and the right the conflict generator. For clarity’s sake, the proof tree in the conflict generator contains only the most relevant literals such as signal, etc. Assuming nothing (2), the base station controller explicitly derives from the given alarms that there cannot be an alive signal of bts20. However, since all components are assumed to be working correctly, the BSC also derives that there should be an alive signal of BTS20. The components assumed fault-free and involved in the propagation are microwave links ml16; ml18; ml19. They form the first conflict, and in (3), they are root of the hitting set tree. The tree is now expanded step by step. First, it is assumed that ml16 is abnormal (4). A new conflict is derived since the bsc does not have an alive signal from bts17, neither a failure alarm, nor was a message lost. The latter is revisable, i.e. we may assume that messages get lost, and thus the conflict msg lost (bts17) is returned. The hitting set tree is updated accordingly

(5) and, in particular, the nodes are re-ordered according to their probability. While the branches for microwave links 18 and 19 have a probability of 0.001, the branch of ab(ml16); msg lost (bts17) has only a probability of 0.0001. Hence, the former is expanded before the latter. Similar to (4), ml18 is now assumed abnormal (6), leading to a new conflict involving msg lost (bts19). This process of changing assumptions, deriving a contradiction with associated conflict, and updating the hitting-set tree accordingly carries on until finally, assuming ml19 abnormal and the message from bts21 lost, there is no further conflict. Due to the constant re-ordering of the hitting-set tree to expand the most probable candidate next, we know that we found the most probable solution and can terminate the search.

{} (1)

false

(2)

-signal(bsc,down,bts20,alive) signal(bsc,down,bts20,alive)

(3) {not ab(ml16), not ab(ml18), not ab(ml19)}

{not ab(ml16), not ab(ml18), not ab(ml19)} ab(ml16)

ab(ml19)

ab(ml18)

signal(bts20,up,bts20,alive)

{ab(ml16)} false

(4)

not msg_lost(bts17) not bts_failure_alarm(bts17) not signal(bsc,down,bts17,alive)

type(bts17,bts) {not msg_lost(bts17)}

(5) {not ab(ml16), not ab(ml18) not ab(ml19)} ab(ml16) ab(ml18) ab(ml19) {not msg_lost(bts17)}

{ab(ml18)}

{ab(ml19), msg_lost(bts21)} false (15)

no conflict

{not ab(ml16), not ab(ml18), not ab(ml19)} ab(ml19)

ab(ml16)

ab(ml18)

{not msg_lost(bts21)} {not msg_lost(bts19)} {not msg_lost(bts17)} msg_lost(bts19) msg_lost(bts21) msg_lost(bts17) {not msg_lost(bts21)} {not msg_lost(bts18)}

Fig. 16. Computation of revisions.

fail

(14)

5.2 Consistency-based and Abductive Diagnosis In the example from the previous section we computed diagnoses by changing assumptions, so that the predictions from the model under these assumptions were consistent with the observations. Restoring consistency in this way is sufficient for some applications and was also the idea of the first definitions of model-based diagnosis, e.g. [18]. However, for some applications this consistency-based approach to diagnosis has been found insufficient and more expressive forms of explanatory diagnosis have been developed [6]. In this section, we will introduce another scenario from our cellular network application, where consistency-diagnosis is insufficient. We will solve this example through an abductive diagnosis approach. This example stresses the suitability of Revise for the different forms of model-based diagnosis, since we are able to use the same model and same revision algorithm for abductive diagnosis, only by adding additional constraints defining the abduction task. To understand the need for a stronger form of diagnostic explanation, consider the alarms shown in Fig. 17. We observe that the BSC has generated BTS failure alarms for both BTS20 and BTS21, indicating that they are not reachable. On the other hand, the farend alarm sent by BTS19 has reached the BSC. Given the topology of the network trunk we are considering (shown in Fig. 5), the only correct diagnosis for this set of observations is a failure of microwave link ml19, because the BTSs located downstream from ml19 are unreachable, but a message from BTS19 has reached the BSC, so the path via ml18 and ml16 must be working properly.

14: alarm(bts20; bcch missing): alarm(bts20; pcm fail): alarm(bts20; bts omu link fail): alarm(bts21; bcch missing): alarm(bts21; pcm fail): alarm(bts21; bts omu link fail): alarm(bts19; farend alarm 1): Fig. 17. Alarms for the scenario in Fig. 18.

Unfortunately, consistency-based diagnosis leads to the two diagnosis ml18 and ml19 in this case. Fig. 18 explains why the unwanted diagnosis gets computed: ml18 is considered abnormal and so the farend alarm is lost at ml18 (the default literal in rule 9 is false and so the propagation of the message is blocked). However, it magically reappears at the BSC for no obvious reason. This phenomenon is a known weakness of the consistency-based approach to diagnosis. We have specified, how messages are generated and propagated through the network, but we have not written down explicitly, that these generation and propagation rules are the only way for messages to be transmitted. Therefore, the model allows the magical appearance of an additional alarm message, if it is needed for consistency.

In the theory of abductive diagnosis [6], a subset of the literals is defined, which must be explained i.e. derived using the rules of the logic program. We can get rid of the counter-intuitive diagnosis by stating that the presence of messages at the BSC must be explained. After this modification, messages can only reach the BSC following the generation and propagation process described in our rule set. ML17 BSC

BTS17

ML16

BTS18 ML18

ML19 BTS19

ML20 BTS20

BTS21

Fig. 18. Counter-intuitive diagnosis resulting from the consistency-based approach.

The technical implementation of this abductive approach is usually achieved by completing the rule base [5], i.e. systematically adding rules which explicitly state that messages cannot appear magically within the network. This approach has been applied in several domains [6], including the mobile network domain described in the current paper [11]. Revise allows for a simpler and more elegant solution based on its constraints. We can keep the rule base and only add the constraint: 15:

?

not signal(bsc; down; bts19; farend alarm 1):

This constraint forces Revise to derive the presence of the alarm message at the BSC using the rule set. After this modification, the counter-intuitive diagnosis ml18 is eliminated and only the correct diagnosis ml19 remains. 5.3 Performance The algorithm described in section 4 is implemented in the REVISE system (version 2.4) [9]. As described in [9], REVISE 2.4 substatially improves performance of a previous implementation that relies on bottom-up computation and generates all conflicts prior to hitting-set computation. Fig. 19 shows timings and most probable diagnoses for 17 test vectors of alarms also used in [12]. REVISE ran on 4 pentium II processors and timings are in msec. REVISE performance is similar to the DRUM system described in [12]. REVISE is also online available at www.soi.city.ac.uk/˜msch.

Conclusions In cellular phone networks faults of microwave links are likely to cause alarm bursts, which put the system operator to a hard test. Alarm correlation tools are needed, which support the operators by identifying the cause of a large number of alarm messages.

Testvector Time in msec 1. 270 2. 300 3. 240 4. 330 5. 350 6. 350 7. 210 8. 190 9. 250 10. 90 11. 210 12. 90 13. 140 14. 280 15. 140 16. 290 17. 270

Diagnosis fab(ml16), message fab(ml20)g fab(ml19)g fab(ml19), message fab(ml19), message fab(ml19), message fab(ml18)g fab(ml18)g fab(ml18), message fg fab(ml17)g fg fab(ml16), message fab(ml16), message fab(ml16), message fab(ml16)g fab(ml16), message

lost(bts19)g lost(bts21)g lost(bts21)g lost(bts21)g lost(bts20), message lost(bts21)g

lost(bts18), message lost(bts19)g lost(bts19)g lost(bts19)g lost(bts19)g

Fig. 19. Timings in msec for most probable diagnosis for 17 sets of alarms.

The contribution of this paper is twofold: We have shown how to model alarmcorrelation in cellular-phone networks declaratively as extended logic programs and how to diagnose these networks by satisfying violated integrity constraints using the REVISE1 system. We have demonstrated that the key elements of extended logic programs naturally match the requirements of the diagnosis application: Integrity constraints are used to constrain predictions and observations. They also provide a very elegant and simple way to enforce abductive explanations, which does not require changes to the rule base. Implicit negation is used to represent the default assumptions concerning the behavioral modes of the components. Explicit negation is a direct way to encode negative observations, which is not present in classical logic programming languages. Many practical problems like alarm-correlation require a logical explanation of observations and therefore abductive capabilities as used in our approach. When compared with other approaches to alarm correlation such as neural networks [19], there are three major benefits: 1. A very small and maintainable system description, that separates structural from behavioural components and thus makes changes of the network topology easy. 2. A propagation model which allows to correctly diagnose unforeseen errors as well as multiple faults. 3. Failure probability estimates (see Fig. 11), which lead to correct diagnoses even on noisy data, where alarm messages have been lost. Acknowledgements We’d like to thank ICCTI-BMFT. 1

REVISE is online at www.soi.city.ac.uk/˜msch.

References 1. J. J. Alferes, C. V. Dam´asio, and L. M. Pereira. Top–down query evaluation for well–founded semantics with explicit negation. In A. Cohn, editor, Proc. of the European Conference on Artificial Intelligence’94, pages 140–144. John Wiley & Sons, August 1994. 2. J. J. Alferes, C. V. Dam´asio, and L. M. Pereira. A top-down derivation procedure for programs with explicit negation. In M. Bruynooghe, editor, Proc. of the International Logic Programming Symposium’94, pages 424–438. MIT Press, November 1994. 3. J. J. Alferes, C. V. Dam´asio, and L. M. Pereira. A logic programming system for nonmonotonic reasoning. Journal of Automated Reasoning, 14(1):93–147, 1995. 4. J. J. Alferes and L. M. Pereira. Reasoning with Logic Programming. (LNAI 1111), SpringerVerlag, 1996. 5. K. L. Clark. Negation as failure. In Gallaire and Minker, editors, Logic and Databases, pages 293–322. Plenum Press, New York, 1978. 6. Luca Console and Pietro Torasso. A spectrum of logical definitions of model-based diagnosis. Computational Intelligence, 7(3):133–141, 1991. 7. Carlos Viegas Dam´asio, Wolfgang Nejdl, and Luis Moniz Pereira. REVISE: An extended logic programming system for revising knowledge bases. In J. Doyle, E. Sandewall, and P. Torasso, editors, Knowledge Representation and Reasoning, pages 607–618, Bonn, Germany, May 1994. Morgan Kaufmann. 8. Carlos Viegas Dam´asio, Wolfgang Nejdl, Luis Moniz Pereira, and Michael Schroeder. Model-based diagnosis preferences and strategies representation with meta logic programming. In Krzysztof R. Apt and Franco Turini, editors, Meta-logics and Logic Programming, chapter 11, pages 269–311. The MIT Press, 1995. 9. Carlos Viegas Dam´asio, Luis Moniz Pereira, and Michael Schroeder. REVISE: Logic programming and diagnosis. In Proceedings of the Conference on Logic Programming and Non-monotonic Reasoning LPNMR97. LNAI 1265, Springer–Verlag, 1997. 10. Johan de Kleer. Using crude probability estimates to guide diagnosis. Artificial Intelligence, 45:381–391, 1990. 11. Peter Fr¨ohlic and Wolfgang Nejdl. Efficient diagnosis based on incomplete system descriptions. In Proceedings of the Workshop on Non-monotonic Reasoning NMR98, Trento, Italy, 1998. 12. Peter Fr¨ohlich, Wolfgang Nejdl, Klaus Jobmann, and Hermann Wietgrefe. Model-based alarm correlation in cellular phone networks. In Fifth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), January 1997. 13. Allen Van Gelder, Kenneth Ross, and John S. Schlipf. Unfounded sets and well-founded semantics for general logic programs. In Proceeding of the 7th ACM Symposium on Principles of Databse Systems, pages 221–230. Austin, Texas, 1988. 14. Russell Greiner, Barbara A. Smith, and Ralph W. Wilkerson. A correction of the algorithm in reiter’s theory of diagnosis. Artificial Intelligence, 41(1):79–88, 1989. 15. I. A. M´ora and J. J. Alferes. Diagnosis of distributed systems using logic programming. In C. Pinto-Ferreira and N.J. Mamede, editors, Progress in Artificial Intelligence, 7th Portuguese Conference on Artificial Intelligence EPIA95, volume LNAI990, pages 409–428. Springer–Verlag, Funchal, Portugal, 1995. 16. L. M. Pereira, C. V. Dam´asio, and J. J. Alferes. Diagnosis and debugging as contradiction removal. In L. M. Pereira and A. Nerode, editors, 2nd Int. Workshop on Logic Programming and Non-Monotonic Reasoning, pages 334–348, Lisboa, Portugal, June 1993. MIT Press. 17. Luis Moniz Pereira, Jos´e J´ulio Alferes, and Joaquim Aparicio. Contradiction Removal within Well Founded Semantics. In A. Nerode, W. Marek, and V. S. Subrahmanian, editors, Logic

Programming and Nonmonotonic Reasoning, pages 105–119, Washington, USA, June 1991. MIT Press. 18. Raymond Reiter. A theory of diagnosis from first principles. Artificial Intelligence, 32(1):57– 96, 1987. 19. Hermann Wietgrefe, Klaus-Diter Tuchs, Klaus Jobmann, Guido Carls, Peter Fr¨ohlich, Wolfgang Nejdl, and Sebastian Steinfeld. Using neural networks for alarm correlation in cellular phone networks. In Proceedings of International Workshop on Applications of Neural Networks in Telecommunications, 1997.

Suggest Documents