The Expert Advisor: An Expert System for Real Time Network ...

The Expert Advisor: An Expert System for Real Time Network Monitoring

Tony White Bell Northern Research P.O. Box 3511 Station C Ottawa, Ontario K1S 5B6 email: [email protected] Andrzej Bieszczad Bell Northern Research Ottawa, Ontario K1Y 4H7 email: [email protected]

Keywords Real-Time Expert System, Network Monitoring

Abstract This paper describes the Expert Advisor1, a real-time expert system used in the monitoring of packet switching networks and how it has been completely integrated into a conventional network surveillance system. The paper describes the primary functions of the expert system as being the identification of service-affecting conditions in the network and the presentation of all information pertinent to the problem to the network operator in a single entity. The expert system is based upon the concept of a network problem, defined in terms of network events and information from conventional databases. Problems are represented by problem descriptions that are written in a Problem Description Language (PDL) and are modular, allowing for incremental growth of the knowledge contained in the system. The paper describes system design considerations, the novel knowledge representation used, the Problem Description Language itself, and references a mathematical model used to compute the benefits derived by its use in a live network.

1Copyright

Northern Telecom

1. Introduction A network operator's job is to monitor the network and identify service-affecting conditions as they arise and to take steps to rectify them. These actions can include the rebooting of a component, patching software or telephoning a repair person in order to have some on site repair effected. Frequently, a single fault in the network can cause the generation of tens of logs or alarms, most of which are artifacts of the fault rather than indicating the cause. It is left to the operator to determine which of the network events indicate the true failure. Also, as faults rarely occur one at a time, the network operator often has to deal with multiple faults simultaneously. With the increasing size and complexity of networks, the value of automated tools becomes apparent. The Expert Advisor is one such tool. The system acts as an interface between the event stream from the network and the operator and does what the operator traditionally has had to do - identify problems from a stream of network event information. Naturally, the system can deal with multiple faults simultaneously and is able to make correlations between problems in order to form problem hierarchies. The system can deal with several hundred problems simultaneously, significantly in excess of a human counterpart. The principal function of the Expert Advisor is to collect together in a single object - a problem - all events from the network, and other information from conventional data sources that might prove useful to the operator and present it in a comprehensible format. The system thus correlates network event information, along with other data, and presents the network operator with a set of problems. A problem can be any event or sequence of events in the network that is reflected in the stream of diagnostic data that comes from the network along with any information that has been gathered from system data bases. A problem can also have states associated with it - pieces of information that have been inferred from the stream of data gathered from the above sources. The key design goal in the system has been to present only the relevant information to the network operator, thereby eliminating extraneous data from the network. In this way, network operators can quickly identify, and rectify, problems in the network.

The Expert Advisor thus replaces a stream of network events with a set of problems that are structured in a hierarchy. Tools are provided with specialized browsers that allow rapid traversal of the problem hierarchy in order to view information pertinent to the solution of the problem at hand. The remainder of this paper is composed of four sections. Section two describes Expert Advisor design considerations. Section three provides a brief description of the various Expert Advisor elements. Section four describes the Expert Advisor benefits. Section five describes aspects of the maintenance process and section six concludes with a summary of the paper's key messages.

2. Design considerations An initial feasibility study indicated that conventional surveillance tools, such as a graphical browser, were component oriented, not problem oriented as is the nature of the work done by network operators. Conventional tools were found to be excellent at displaying the raw network events, but did little beyond that in terms of problem formulation. The need to create, and reason with, "a problem" was established. Tools from Artificial Intelligence seemed most appropriate for this. However, it was realized that the Expert Advisor could benefit significantly from providing data to, and receiving data from, the conventional surveillance tools for two reasons. Firstly, effective means for the graphical presentation of data have already been established, and the Expert Advisor should not duplicate these. Secondly, network operators are familiar with their conventional tools and would prefer new tools to augment, rather than replace, their existing mode of operation. For example, by allowing a graphical network browser to indicate one or more problems on a component and through selection to focus an Expert Advisor problem browser tool on that component, the best of the component and problem oriented paradigms are provided. A clear distinction was made between knowledge - behavior applying to all networks - and data - such as the topology of a specific network. The Expert Advisor identified several important characteristics that it must have in order to succeed. Firstly, the system must be able to operate in real time. It must have the capacity to be able to deal

with large bursts of data from the network and be able to present the condition of the network to the operator quickly. A performance goal of 20 network events per second was established. After having generated a prototype using a shell this goal implied the need to work in a procedural language and develop an interpreter that was highly tuned to the type of data being received from the network. Secondly, the system should be customizable. Traditionally, vendor-developed expert systems have been closed or difficult to modify. Very rarely has it been possible to modify the knowledge base while the system remains actively monitoring the network. It is very important to have a network surveillance system operational for 100% of the time and it should not be necessary to take down an expert system just because one inference is incorrect. Expert systems, as with most software, have contained defects or more importantly, the quality of their inferences could be improved by allowing the use of data from conventional data bases which are customer and network specific. By allowing customizability and providing a large startup knowledge base, both the novice and expert user can derive benefits from the system, the latter by using the 10,000 lines of provided problem description language (PDL) statements as examples of PDL coding. Enhanced network operator training was highlighted as a benefit with the provision of a customization environment. With such a large knowledge base it is necessary to provide a mechanism for decomposing the knowledge base into smaller components and provide a testing environment that can exercise the components in isolation. In providing a customizable system, it is necessary to solve several problems that present themselves such as customization mechanism (design of a language or graphical programming environment), provision of tools to effect knowledge base modification and how to introduce these changes to a running expert system.

surveillance systems. As a result, to a greater or lesser degree, the value of the expert system is lost as information has to be transferred manually from one system to another. The Expert Advisor is completely integrated with conventional surveillance tools, such as a graphical network browser, thereby exploiting the strengths of both systems. Hence the choice of platform had to be that of the conventional surveillance system. This is a SUN Sparc2 2. Finally, the expert system should support a graphical user interface in order to present information in an easy to understand and manipulate format, and preferably be consistent with the user interface paradigm used in conventional surveillance tools.

3. Expert System Design The real time Expert Advisor design consists of a number of distinct components, as is shown in figure 1. It should be noted that this is a much simplified view of the system.

Expert System Components Problem Viewer

Application Program Interface

PDL Compiler

Problem Inference Engine

Knowledge Base Figure 1

These problems are addressed in another paper concerned with the Expert Advisor customization environment. See the [White, Bieszczad 1992] for details. Thirdly, the expert system should be integrated with conventional surveillance systems and be able to exchange information with them. Traditionally, expert systems have not been written such that they can easily be integrated with conventional

There are four principal components of the expert system. They are the set of problem descriptions (or knowledge base), the problem inference engine that interprets those descriptions in the context of network events, an application program interface and a user interface component - the Problem Viewer. These four elements are described in detail in the next four sections. 2

Copyright Sun Microsystems.

3.1 Problem Descriptions The set of problem descriptions stored in a single directory comprise the knowledge base. The knowledge base forms the most important part of the system. The knowledge base currently consists of approximately 10,000 lines of problem description language statements and 6,000 lines of problem help. One or more problem descriptions are contained in an ASCII file, stored on disk, and compiled to form structures usable by the problem inference engine. These compiled structures represent the expert knowledge needed by operators to monitor the network. This information includes the meaning of specific events and which events should be associated with a certain failure, the relationships between particular failures, and how information should be displayed for the operator. Figure 2 shows the organization of a knowledge base. A hybrid of model [Kahn et al, 1987] and rule based [Laffey et al, 1988] representations was chosen in the Expert Advisor. Early work on our current knowledge representation can be found in [Peacocke, Rabie, 1987] and [Rabie et al, 1988]. A description of the most recent knowledge representation which integrates aspects of object oriented technology can be found in [Baird, White, 1989]. The advantages of this modular approach are twofold. Firstly, the knowledge base used by the expert system can be easily modified by changing directory and the system has been designed to support the dynamic modification of the knowledge

Each problem description is associated with a given network component type, i.e., each problem is said to occur on a specific component type. A problem description can, therefore, be described as encapsulating the partial behavior of a component type. The problems that are defined on a given component type together provide a comprehensive model of the behavior of that component type. In this sense the Expert Advisor can be said to be fault model based - the fault model being the sum of all of the behaviors encoded in problem descriptions for a particular network component type. For example, the Expert Advisor knowledge base contains "Processor Fail", "Accounting records lost" and "Alarms lost" problem descriptions for the network component type PE. Hence we say that the "Processor Fail", "Accounting records lost" and "Alarms lost" problems form a partial fault model for the network component type PE. Each problem description represents a category of service-affecting conditions. For example, "Processor Fail" contains descriptions of several types of processor failure and

Knowledge Base Organization

Examples: /surveillance/kb

Directory

processor.obj

processor.prb

base while monitoring the network. In this way, the Expert Advisor remains online during knowledge base maintenance. Garbage collection of vestigial problem descriptions is performed automatically by the system. Multiple knowledge bases are easily supported. For example, at Northern Telecom multiple knowledge bases have been built which allow fault diagnosis to take place down at the process level. Secondly the whole knowledge base need not be recompiled if a single problem description needs to be changed, or a new problem description added.

.obj1

Prob1

.. Probk

.obj2

Probk+1

Figure 2.

...

Probl

.objn

...

Probm

therefore accepts a wide range of network events. However, "Accounting records lost" indicates service degradation and is very specific in nature. In this later case, only a single network event is captured. A problem instance is an example of a particular problem description that has been instantiated on a specific component in the network. An example might be a "Processor Fail" problem instance on the component "PM R99 PE 8" in the network. To draw a comparison between the object oriented programming paradigm and the Expert Advisor knowledge representation scheme; the problem description corresponds to the object class and the problem instance corresponds to the object instance. As with the object oriented programming paradigm, several instances of a particular class can be active concurrently in the problem inference engine. In the Expert Advisor knowledge representation scheme, this translates to several physical components having similar faults. A problem description can be seen as a frame consisting of 14 slots that represents the behavior of a network component type under specific fault conditions. These slots are: Network component type: the type of component for which the problem applies.

Problem Name: a meaningful description of the nature of faults described in the problem description. Problem trigger: a description of which events cause a new problem instance to be created. This slot is only used during problem instance creation. Accepts: a description of which events from this network component type are to be associated with a problem instance after it has been created. This slot is only used after a problem instance has been created and is used in conjunction with the During monitoring slot. Includes: a description of which events from other network component types are to be associated with a problem instance after it has been created. This slot is only used after a problem instance has been created and is used in conjunction with the During monitoring slot. Retrieves: a description of which events are to be retrieved from the events stored in memory and associated with the newly created problem instance. This slot provides a "look back" mechanism. This slot is only used during problem instance creation. Related: a description of which problem descriptions are related to this problem description. The slot encodes a "depends-upon" relationship. For example, a processor being part of a switch depends upon the power supply of the switch. This will be further described in a later section.

Problem Description Frame

Network component type Problem Name Problem trigger Includes Accepts Retrieves Related Suppressed Sends Receives Displays On creation On deletion During monitoring

Network component type Problem Name Problem trigger Includes Accepts Retrieves Related Suppressed Sends Receives Displays On creation On deletion During monitoring

Figure 3.

Suppresses: a description of which problem descriptions are subordinate or inferior to this problem description. This slot encodes a "part-of" relationship. For example, a port being part of a processor will have several of its problem descriptions suppressed by a processor problem description. The information encoded in this slot allows problem hierarchies to be generated. The suppresses slot has a display impact in that Problem Viewer users by default only see the problem instances at the top of various problem hierarchies. As the suppresses slot allows the specification of specific problem descriptions on particular network component types it is considerable more flexible than a simple "part-of" relationship. Sends: consists of a list of problem states which are to be broadcast to other problems which are related to or suppressed by this problem. This slot, along with the receives slot, implements an inter-problem message passing mechanism. Receives: consists of a list of problem states that are to be accepted when broadcast from other problems that are related to or suppressed by this problem. This slot, along with the sends slot, implements an inter-problem message passing mechanism. Displays: provides a description of which problem states are to be displayed in the Problem Viewer. This slot encodes which state variables from a problem are of interest and, whenever they change, Problem Viewer users are automatically notified of the new value. This slot negates the need for "printf" rule actions; which comprised a large percentage of rule actions in early prototypes. The final three slots provide mechanisms for describing the dynamic behavior of a problem as events arrive from the network. The Expert Advisor encodes this dynamic behavior using a rule-based representation. These slots are: On creation: this slot is a set of production rules which are evaluated on problem instance creation. On deletion: this slot consists of a set of production rules which are evaluated on problem instance deletion. During monitoring: this slot consists of a set of production rules that are evaluated whenever an

event occurs which is of interest to a particular problem instance, i.e., whenever an event passes through the Accepts or Includes filters. By decomposing the frame into multiple slots which describe the various phases of the "life" of a problem instance, significant reduction in the number of rules which have to be evaluated for a single event has been made. Similarly, by having slots which indicate which events apply to various phases of the "life" of a problem instance, a similar reduction in inferencing has been observed. The overall Expert Advisor knowledge representation scheme is shown in figures 3 and 4. Figure 3 graphically shows the structure of a problem description frame, the arrows representing relationships between problem descriptions. Figure 4 graphically shows the model-based nature of the knowledge base and the inter problem relationships which exist between problem description frames. Knowledge Represention

Legend network component fault model problem description interproblem relationship knowledge base Figure 4.

3.2 Problem Inference Engine The second component of the Expert Advisor is the problem inference engine (PIE). If the events are viewed as program data then the set of problem descriptions form a program operating on this data

and PIE is the interpreter that runs this program. Events fall into five categories within the Expert Advisor. These are:

Problem Viewer user as problem hierarchies may be modified. (5) If new links are created, the Sends slots from the linked problem instances are evaluated and states that are present in the Receives slot of the newly created problem instance are sent as messages to it. This causes all rules that relate to that state to be evaluated. (6) The rules stored in the During monitoring slot are then evaluated in the context of the triggering event.

Network alarms. Network status messages. Expert Advisor internal alarms. Expert Advisor expectation alarms. Expert Advisor state messages. The first two categories are called network events; the last three categories are called internal events. The algorithm that is used to process network events and internal alarms can be described as follows:

(b) In the second stage, the existing problem instances for the network component type are processed. The Includes slot is evaluated and if the event passes the filter, the event is added to the event list stored for the problem instance. The rules stored in the During monitoring slot are then evaluated in the context of the event. Note that rules are executed in the order in which they are written in the problem description. Forward chaining of the rules occurs whenever the value of a state variable is changed by the actions of a rule.

1. The event is first classified according to which network component type generated it. 2. All problem descriptions and problem instances which are associated with this network component type are then considered in two stages. (a) In the first stage, new problem instances are created. The Problem trigger slot of the problem description is evaluated in the context of the event. If the event passes the trigger, then a new instance of the problem description is created, instantiated using the component name associated with the event. Note that each problem description may define several triggering events but the problem instance will only have one triggering event. In other words, each problem in the real network is triggered by a single event. Checks are provided in order to ensure that multiple instances of a given problem description are not generated on a specific network component.

Expert Advisor expectation alarms provide a mechanism for the exploitation of the time dependency of events. Expectation alarms are created by the actions of rules, then inserted into the regular event stream after an appropriate delay. As such, expectation alarms provide a mechanism for belief revision. In association with the state message-passing mechanism mentioned previously, arbitrarily complex belief revision strategies can be encoded. Expectation alarms are used to express conditions such as: If

Creating a problem instance involves several steps. (1) An identifier for the problem instance is created using the identity of the event. (2) The Retrieves slot is evaluated in order to retrieve events from the memory-resident event cache. (3) The On creation slot is then evaluated in the context of the triggering event. (4) The Suppressed and Related slots are evaluated and new inter-problem instance links are created as appropriate. This stage may cause changes in the display seen by a

Then

or If Then

the processor does not clear within 60 seconds of failing inform the operator that the processor is not recovering normally. 5 threshold alarms occur within a 10 minute time span inform the operator that a potential problem exists.

Expert Advisor state messages are generated whenever the value of a state variable is modified by a rule action. The Sends slot of the problem description script is evaluated and if it contains the

state variable, the value of the state variable is broadcast to all related or suppressed problem instances.

3.3 Problem Viewer Network operators use the expert system by accessing a Problem Viewer which has been built using the NT Signature3 tool kit that is built on top of X4 windows. Each operator defines an individual view (or set of views) of the network in terms of the region of interest (the components) and the problem descriptions of interest. A Problem Viewer provides a number of specialized browsers that can display different types of information. Problems are displayed in a hierarchy, with the highest level, the network, displayed in the main Problem Viewer browser. Every problem is considered to be a sub problem of the network. Figure 5 shows the main Problem Viewer browser with two problems. Figure 5 shows two problems, a PE problem on RM 99 PE 1 that is out of service and a port problem on AM 98 PE 5 PI 5 PO 1, indicating a modem problem. As the menu in figure 5 indicates, the alarms, status messages and system messages can be reviewed for the problem instance. An example of a message browser is shown in figure 6. Alarm and status browsers are similar.

Main Problem Viewer Browser ✔ ProblemViewer

Functions Options

RM99 PE 1: Out of service

22-10:10:01

AM98 PE 5 PI 5 PO 1: Modemproblem 22-10:10:01 Alarms Status Messages Related Problemhelp Help

Figure 5

Problem Viewer Message Browser Main Problem Viewer Browser ✔ ProblemViewer

Functions Options

RMAM 99 PE Out5ofPIservice 22-10:10:01 981:PE 5 PO 1: Modem problem 22-10:10:01 AM[22-10:08:01] 98 PE 5 PI 5 POModem 1: ModemDTC problem 22-10:10:01 problem Alarms [22-10:09:00] Modem okay Status [22-10:10:01] Modem problem Messages Related Problemhelp Help

It is possible for the operator to manipulate the problem instance hierarchy. Problem instances can be moved from one level of the hierarchy to another and it is possible to hide or delete problem instances. Hiding a problem instance moves the problem instance to another specialized browser where it remains until the operator "unhides" it, or a change in the state of the instance occurs. For problem instances that have suppressed or related problems it is possible to open a related (or sub problem) browser. A sub problem browser looks similar to figure 5 except in that the title "Problem Viewer" is replaced with the name of the problem, for example, "RM 99 PE 1".

As the menu in figure 5 indicates, help for a particular problem can also be obtained. Help on a problem provides textual information on how to solve the problem and where extra information can be obtained, for example. The use of graphical and video help are currently under research. Audio help is already provided directly from the Problem Viewer without the provision of a special browser. An example of a help browser is shown in figure 7.

3Copyright

Multiple radio buttons are defined in a help browser. Each radio button provides help on an aspect of the problem, or possibly alternative assistance that cannot be disambiguated based upon the eventrelated information processed by the expert system.

Northern Telecom. Copyright Massachusetts Institute of Technology. 4X11

Figure 5 Figure 6.

Problem Viewer Help Browser Help for: RM 99 PE 1 Currently, accounting records are being lost. This is possibly due to several calls clearing simultaneously and the processor is under-engineered. Please check call clear logs and, if excessive call clears are observed, re-engineer the processor with more memory. Records lost Re-engineering a processor

Help on help

Figure 7

3.4 Application Program Interface The application program interface (API) allows other applications to share information with the Expert Advisor. The (API) provides a simple ASCII interface with mechanisms to: (a) Get data from the Expert Advisor, i.e., ask for historical data. (b) Set data in the Expert Advisor, i.e., modify parameters stored in the Expert Advisor. This allows problem states to be modified by external programs, implying that it is possible to inject state message events into the system. (c) Ask to notified of changes in Expert Advisor parameters. (d) Destroy data stored in the Expert Advisor. The API allows conventional applications such as a graphical network browser to benefit from the advanced reasoning done by the Expert Advisor. It is currently being used to drive a trouble ticket system.

4. Maintenance Maintenance of the Expert Advisor knowledge base takes two forms. These are: • •

End user customization Northern Telecom maintenance

The first type of maintenance uses the customization environment described earlier. The second type of maintenance occurs once every six months, or whenever a release of network software occurs. At that time, differences between the current and previous versions of the network log documentation are computed and appropriate software designers consulted to determine the impacts of these new logs. New problems are added or existing ones modified based upon interviews with these designers. New versions of the various problem scripts are placed in a revision control system. This is done in order that Northern Telecom updates can be merged with whatever end user changes have been made to appropriate scripts. This same revision control system is used in the customization environment in order that end users can maintain a history of their knowledge base changes.

5. Benefits A mathematical model [White, 1989] has been built in order to calculate the benefits of the Expert Advisor. The Expert Advisor was designed to assist network operators in the following areas. • • • • •

Problem identification Problem tracking Problem resolution Problem review Operator training

These benefits are described in the next five sections.

5.1 Problem identification One or more network events is said to trigger a problem. The network operator is presented with a stream of problems, not network events such as logs or alarms. Constant observation of a stream of logs or alarms is therefore not required. Audio feedback, such as "PM R99 PE 1 is down" has helped here.

5.2 Problem tracking Once identified, the status of a problem is monitored automatically and displayed on screen. One of the principal functions of the Expert Advisor is to update the various browsers in real time in such a

way as to present the most important problem in the main browser. Thus, only the most important information is displayed; irrelevant network events, or events that are artifacts of another problem are hidden. Multiple failures or repeating problems are easily captured.

5.3 Problem resolution Assistance in the resolution of persistent problems is provided by clustering all relevant information together in one place. Status information, data from electronic sources, can be accessed without recourse to other tools. Further, suggestions on possible solution strategies are available via the problem help browser. It is also possible to have the system send commands to the network automatically which speeds up resolution of the problem.

5.4 Problem review Most conventional surveillance tools delete fault indicators on a given component when the component has returned to service. Thus a "fault audit trail" does not exist and it is difficult to detect chronic component failures. The Expert Advisor does not delete a problem when the associated fault has cleared up but rather retains it up to the storage limits of the system. Storage capacity has been so designed that 12-24 hours of problems can be stored in memory in order that one shift can hand off chronic failures to the next shift.

5.5 Operator training The customization environment [White, Bieszczad, 1992] provided with the Expert Advisor allows novice operators to train themselves. Using fault scenarios captured from the network, it is possible to replay problem situations in a controlled way; at the user's own pace. The customization environment provides a whatyou-see-is-what-you-get interface thus allowing the operator to become familiar with the operation of the various problem browsers. It is strongly believed that this environment will ultimately lead to more consistently trained network operators. The mathematical model built predicted considerable savings due to increased network availability.

6. Summary The Expert Advisor system described in this paper has shown the benefits of a hybrid approach to expert system development. The use of a fault model, composed of multiple inter-related problem descriptions, has shown that a very large system can be designed, implemented and, most importantly maintained. The paper on Expert Advisor customization [White, Bieszczad 1992] provides further information of this aspect of the system. The isolation of network knowledge and network topology has allowed a generic knowledge base to be delivered. The message passing mechanisms employed in order to exchange information between problem instances has proven to be a particularly effective one for the de coupling of rule sets that describe the dynamic behavior of each problem. Message passing has proved to be an excellent way of testing the interfaces between problem descriptions. The system has shown itself to be easy to use, and has been quoted as being the most valuable single surveillance tool available in the Network Management toolset. Currently, work is being done to use the Problem Viewer, Problem Inference Engine and support tools to generate another expert system. The domain is again Network Management, but in the voice area on a different platform. The view is that greater than 80% of the software written will be reused and that an operational system will be produced in six months. The knowledge representation scheme has been shown to apply to this completely different environment with no modifications.

Bibliography [Kahn et al, 1987] Kahn G., Kepner., Pepper J., TEST: A Model-Driven Application Shell, Proceedings AAAI 1987 pages 814-818. [Laffey, et al 1988] Laffey T., WeitzenKamp S., Read J., Kao S., Schmidt J., Intelligent Real-Time Monitoring, Proceedings AAAI 1988 pages 72-76.

[Peacocke, Rabie 1987] Peacocke D., Rabie S., Knowledge Based Maintenance in Networks, Proceedings Globecom 1987. [Rabie et al 1988] Rabie S., Rau-Chaplin A., Shibahara T., DAD: A Real-Time Expert System for Monitoring Data Packet Networks, IEEE Networks Magazine September, 1988. [Baird, White, 1989] Baird C., White T., A RealTime Network Monitor, Proceedings of the 9th International Workshop on Expert Systems and their Applications 1989 pages 35-41. [White, Bieszczad 1992] White T.,Bieszczad A., A Customization environment for the Expert Advisor Network Management system, Canadian Conference on Artificial Intelligence, Vancouver, 1992. [White, 1989] White T., Expert Advisor Benefits, internal Bell-Northern Research report.