diagnosis are conjugated. The monitoring system is implemented completely in Java and has been tested on some real applications and simulations at the.
ARCHITECTURE AND DATA MODEL FOR MONITORING OF DISTRIBUTED AUTOMATION SYSTEMS Volodymyr Vasyutynskyy, Klaus Kabitzsch Faculty of Computer Science Dresden University of Technology, D-01062 Dresden, Germany Fax: ++49 351 463 38460 E-mail: {vv3, kk10}@inf.tu-dresden.de
Abstract: A monitoring architecture and corresponding data model to support fault diagnosis in distributed automation systems is presented. The monitor system consists of a central monitor and a set of the monitoring agents, that cooperate via rule sets and diagnosis results. The data model combines event-based and continuous diagnostic methods and allows to adjust the monitor overhead. The achieved compacting of monitor information on the end nodes along with iterative diagnosis makes teleservice easier and more attractive. Copyright © 2004 IFAC Keywords: remote diagnosis, monitoring, complex events, monitoring agents
1.
INTRODUCTION
Modern automation systems like home automation or MES of a factory are typically distributed and consist of heterogeneous components, with increasing role of software part. These properties along with growing complexity lead to fault proneness of such systems during their operation, also if the systems have been carefully tested. As a rule, arising faults are rare, transient and therefore hardly predictable. To cope with such faults, the monitoring of such systems is necessary. It is an important part of diagnosis, that helps to indicate possible faults and to avoid more critical fault effects in the future. Known event-based monitoring systems like ZM4/Simple (Dauphin, et al., 1992), GEM (Mansouri-Samani, 1995), HiFi (Al-Shaer, et al., 1992) etc., are oriented on special system architectures and require a lot of a priori information about investigated systems. This leads to high costs for creating of the diagnosis knowledge base during installation of the monitoring system or its adjusting by system reconfiguration. In addition to it, exact information about the behavior of some components is often not available during the system design. This information must be obtained during system
integration and operation for tuning of diagnosis system. Therefore monitor architecture must support additionally iterative learning, automated or with human observer, and the mechanisms for easy reconfiguration. Automation systems combine discrete and continuous processes. Thus the event based monitors mentioned above may also profit from highly developed diagnosis methods for continuous systems, see (Simani, et al., 2003), like diagnosis based on linear and nonlinear models of systems, fuzzy and neural algorithms etc. Although these methods are not considered in this paper in details, the possibility of their coupling with event based methods will be shown. The iterative diagnosis requires efficient interaction with human experts and powerful learning algorithms. The next important point is an optimization of monitoring overhead in sources of log data, communication media and data users. Two variants of the distribution of monitoring overhead can be distinguished from these two points of view. In the first case, all log data is collected on the central monitor and analyzed there, in most cases offline. This may provide the full flexibility of data
manipulation, but leads to overload of communication medium, that is an expensive and critical resource. The intensive stream of monitoring data can seriously influence total system performance, causing new faults as result of network overload (Kotte, et al., 2002). The contrary approach uses distributed intelligent agents (Köppen-Seliger, et al., (2003), Munz (2001)), that provide all diagnosis activities directly at the end nodes. This is possible due to the trend to more intelligent end nodes of automation systems, allowing to place more powerful processing algorithms. Distributed data processing reduces the network load, but may bring poorly predictable diagnosis results and end node monitoring overhead. Also control possibilities of human experts and using of their intuition are then restricted. This paper considers the middle way, intended to solve the problems mentioned in the paragraph above. It is asserted, that the preprocessing of monitoring data may be effective due to their high redundancy, conditioned by monotony and periodicity of underlying processes. Simultaneously, the human expert must have the control over diagnosis activities. That’s why the monitoring architecture is proposed in this article, that allows to optimize the monitoring overhead in distributed systems and adapt it to diagnosis needs. It uses the monitoring agents, adjustable thru the sets of rules sent from central monitor. The architecture uses a special extendable data model combining detection of complex events with continuous signal diagnosis methods. In that way the process of monitoring and diagnosis are conjugated. The monitoring system is implemented completely in Java and has been tested on some real applications and simulations at the Dresden University of Technology. The rest of the paper is organized as follows. Section 2 describes the proposed monitoring architecture. The description of corresponding data model follows in section 3, demonstrated on the examples from home automation. Finally, the described approach is evaluated, the perspectives of applying and future work are stated. 2.
ARCHITECTURE OF DISTRIBUTED MONITORING SYSTEM
The monitoring system consists of one central monitor and several monitoring agents as shown in fig. 1. Monitor and agents communicate via local network or internet. Monitoring agents access log records from investigated system over interface to monitor sensors, that are system specific sources of monitoring records. Sensors can be instrumented in different ways: as event notifiers, checkpoints integrated in application program code, different buffers, databases etc. Monitoring agent can also access results from the underlying monitor agents, so that hierarchical monitoring systems can be built.
Expert Rule editor Visualization
User interface
Global knowledge base
Scenarios
Rule base
Communication
Communication
Results Interpretation - Events - Complex events -…
Data buffer Interpretation
Result buffer Result buffer
Monitor
Data buffer
Event triggering
Rule base
Agent
Monitoring sensors
Agent
System
(lower hierarchy)
Data DB
Local Agent
Fig. 1. Architecture of monitoring system. The monitor sends requests to agents in form of scenarios. Scenarios are a formal description of log data structure, diagnostic rules and diagnostic steps. They also include the communication modes, that describe the cooperation between agent and monitor, i.e. when and which resulting data should be sent from agent to monitor. There are the following modes available: • • • •
in definite time periods, for example each hour; by request from monitor; by occurrence of definite event. For example, some critical events must be sent to monitor immediately; by overflow of data or result buffer.
So, the communication modes are strictly limited and therefore the communication overhead can be better controlled. Scenarios contain also the status of rules, that defines, whether the results of rule firing should be sent to the monitor. In most cases, it suffices to send only the most relevant high order events. Data model for description of diagnostic rules and results is described in the next section in details. When the monitoring agent have received the scenario and saved it in its rule base, it starts processing of log information coming from monitoring sensors. Log records are triggered and interpreted according to the scenario rules. Interpreted data come to result buffer and are then sent to monitor according to communication mode and rule status. The obtained results, in their turn, are interpreted on the monitor side. More complex rules
and learning methods can be used here, because the data stream there is much more compact and more resources are available. Diagnosis results are passed then to the user interface, that presents them to human expert. Depending on diagnosis results the expert can modify scenarios and send them again to agents and so on. In this way the iterative diagnosis process is supported. Monitor can also use data from the local agents, for example, in case of offline diagnosis. 3.
DATA MODEL
Presented data model describes the structure of log data, diagnostic rules and their results in one uniform system. This helps to evaluate and control the monitoring overhead. The model extends the notations of complex events like GEM (MansouriSamani, 1995) with process diagnosis functions. It consists of elements, combined in hierarchies, and is constructed according to the principles of objectoriented programming. The simplified class diagram of the model is shown on fig. 2. A formal language used here is adapted for better comprehension of the reader. In implemented tool the rules are put in over more native visual interface. The elements of data model are described below.
Fig. 2. Basic elements of data model. Trace presents a log source as a chronologically ordered stream of log records. Each record in a trace possesses several attributes, for example the program module identifier or variable value. In this way the trace becomes independent from source or specific storage form like a text file or a database. The trace presents a (N + 1) – dimensional state space, where N is the number of attributes. One further dimension is time, which is as a rule discrete and not equidistant. The rules for trace analysis are combined in scenarios. As mentioned in section 2, scenarios combine the elements used in the same cases. A scenario represents therefore one diagnosis
hypothesis. Several, among others also competing scenarios may be produced during diagnosis. The scenarios are stored in XML-files, that allows the easy data exchange with other development tools. For instance, the scenarios can be automatically generated from UML diagrams (Matzke et al, 2003) or from development information databases like LNS-databases (LON). The original monitoring events are abstracted in higher order elements like events and time periods. Primitive events are trace records, that are relevant for a diagnosis hypothesis and have in that way a proper semantic meaning. The primitive events are defined through the attributes of trace, for instance: Req_17:= (Trace.Source=17 AND Trace.Msg=’Request’ ) This rule describes the events from the node with ID = 17, that contain the message “Request”. Primitive events are used for quick triggering and filtering of log data and are the basis for more complex events. Complex events combine several events (among others also other complex events), that stay in proper relations to each other. They personify in this way the causal relationships in a system and unite the events in more complex structures, so that analyzed data amount is reduced. Let A and B be primitive or complex events, then following relations can be used in the definition of complex events, compare Mansouri-Samani (1995): • A; B – event A must precede event B (sequence); • A | B – one of two events may occur (branching); • A ~ B – events A and B may occur in any order (parallel execution); • [n:m] A – event A must appear from n to m times (iteration); • ! A – event A may not occur (exclusion). For example, the following expression describes a complex event, that represents a successful transaction, consisting of two activities: sending of request Req_17 and receiving of response Resp_17. The transaction timeout equals 30 seconds: TA_17:= (Req_17; Resp_17).(Duration < 00:00:30) The description of causality by complex events is equivalent to such descriptions as timed automata, Petri nets or causal trees, compare fig. 3. The model proposes at the same time more compact, powerful and simply extending presentation. For instance, further event attributes can be simply introduced, that should need additional extensions in Petri nets.
A
A
A
A
A
B
(B,C,D)!
B
C B C
A; B; C
B
C D
A ; (B | C) ; D
C
0...5
A ; [0:5] B ; C
E
A ; ( B ~ C ~ D) ; E
(A ; B) ! C
Fig. 3. Relations in complex events in form of automata. The violations of event rules, also called outliers, may indicate possible faults or their causes. That’s why they may play an important role and are especially processed. For example, when in the last rule TA_17 the last event Resp_17 does not occur, an outlier is thrown, that represents an incomplete transaction. Depending on protocol details, such outlier can be interpreted as a fault (“disrupted transaction”) or as a symptom for network overload, since several such disrupted transactions can be tolerated by the automation system. Time periods are generally defined as periods of the trace, that possess definite properties. They may embody some temporal properties of the system like: • System configuration. For example, the time period from 9:00 till 18:00 from Monday till Friday represents the time of human activity in the office, described by such rule: Per_WorkDay:= (9:00-18:00, Mo.-Fr.). • Overall system state like the period with high network load: Per_High_Load:= (TA_17.Duration>10) AND (TA_17.Frequency > 20) • Control loop state as the period of transaction, i. e. between beginning and end of a transaction: Per_TA:= (TA_17).Period Such time periods allow to represent all processes of different nature at the unified time axis, so that the causal relationships may become obvious. The rules can be simpler as in poor event description models. All elements possess a set of primary and secondary attributes. Primary attributes are contained in the source data and represent explicit knowledge of the expert about system behavior. These are for instance the log source or the time stamp of event occurrence. Secondary attributes, or functions, are omitted on the basis of primary attributes. They can be changed during diagnosis, hence they depend on diagnosis hypothesis. An example of the secondary attribute is a class that describes network load in proper time period („high load“ or „low load“). This class is calculated on the basis of transaction frequency in network channel like that: Class_Load: = { “high” : TA.Frequency ≥ 30; “low” : TA.Frequency < 30} Secondary attributes produce additional projection in trace data, so that a deeper view of relationships is
created. They are calculated dynamically and can be added online during diagnosis if it is necessary. Once capsulated in attributes, different model based and model identification methods may be used for diagnosis, such as statistical clustering, neural nets, fuzzy classifiers etc. To be placed in the remote agent, the routines implementing these methods must fulfill the requirements on monitoring overhead, that can be checked by the sample run of the routine with using of typical historical data. The heterogeneous values may be compared with each other by classes obtained from attributes. Complex events, time periods and functions are internally organized as tree structures, that allows the quick and online capable search. But this requires also the corresponding tree structure of rules. These restrictions are checked in the rule editor. Further elements, namely groups and decisions, are introduced for purposes of better structuring and comprehension for expert. Groups are parts of the investigated system or sets of attributes that possess some common properties. They restrict and subdivide the validity space of the rules to make the search for relationships easier and more automated. As an example the group of transactions is introduced, that is produced by different heating controllers in rooms of one house: Gr_TA_17 : = Group (TA_17, GroupingParameter = Room) This can be used for comparison of rooms of one house. Further examples of groups are all devices, that communicate on one fieldbus channel, or all devices of an automated house, that possess the same functionality. So, the diagnostician must just define the group, that is relevant for his diagnosis purposes. The further search inside of group proceeds automatically. Decisions complete data model with expert rules in the form of If…Then statements. For instance, the following rule: IF TA.Duration.Max>00:00:20 THEN Residuum:=’Problems in communication’ indicates the communication problems and fires when the transaction duration in some transactions exceed certain limit. Here is an example of more complex decision: IF AllNetworkEvents.Frequency>30 AND PID.Frequency>20 THEN Residuum:=’Oscillations in PIDcontroller caused by network overload’. The rule indicates, that if the oscillations in PID control loop are accompanied with the high frequency of messages in the network, then these oscillations may be caused by the message delays.
Ideally, there can be constructed a complex decision, that indicates the normal system behavior (“Everything is OK”). In that case the monitoring overhead on communication medium is minimal, but not the overhead on agent side. Returned answers from monitoring agent repeat structure of corresponding rules with the difference that, as a rule, only high order elements are transferred. The results can be then presented to human expert online or offline, for example as a Gantt diagram in the diagnosis tool “eXtrakt” (Kotte, et al., 2002), as shown in fig. 4. Primitive events
Complex events
Outlier
Fig. 4. Presentation of primitive and complex events as Gantt diagram. 3.
USAGE OF DATA MODEL
With help of the data model, the queries on monitoring data can be produced. The diagnosis proceeds iterative, starting with simple events describing simple transactions. These events can be generated automatically on the basis of protocol details. Then the queries on different subsystems can be produced, depending on the purposes of diagnosis. For example, the behaviour of a control loop, the behaviour of the household appliances in a room or the behaviour of the whole heating system of the house can be investigated. The results of a diagnosis step can be used later in further diagnosis steps. The queries can be entered in the comfortable way directly in GUI of development tools. As we have stated in Introduction, application of proposed monitoring system promises more compact and adjustable transfer of monitoring data. The compression rate is larger when only elements of higher order are transferred. This rate depends in general on the purposes of the diagnostician, properties of underlying processes like periodicity, ratio of explicit, available explicit and implicit knowledge about system, fault frequency. It is clear, that the compression rate may grow during the diagnosis, when more explicit information about system becomes known. The compression rate of 5 up to 20 was achieved in tested real applications with
admissible monitoring overhead on the agent side. The choice of proper ratio between monitoring overhead at communication medium and on agent side is not a trivial task. Obtained experience shows that this choice should be made iterative, starting with few rules describing most frequently appearing cases. Usually, only a few iterations are necessary to achieve desirable monitoring overhead. The diagnosis and monitor adjustment can proceed offline as well as online (“on-the-fly”) using the same data model. Described data model is general enough to cope with heterogeneous nodes in the automation systems. It unifies the diagnosis and is easy extendable on the further diagnosis methods. Thanks to the common trace interface, rules do not depend on the source data format and can be reused in different application domains, if the definitions of primitive events have been adapted to format of monitoring records. The model can use the á priori information as well as newly obtained results of monitoring and human experience. An example of using design information is shown in (Matzke, et al., 2003), where UML – diagrams are used to produce diagnostic rules. The monitor and agents are implemented in Java, so the monitor system may be used in different applications using this language. The OSGI – initiative is interesting in this connection, because it proposes universal interfaces for using Java with different automation systems. On the basis of data model, the specific monitoring agents can be produced in native programming language of the system and placed directly in the automation nodes. This would bring further performance gain, retaining the unified data structure and communication interfaces. 4.
CONCLUSION AND FUTURE WORK
The architecture for monitoring of distributed heterogeneous automation systems together with the corresponding data model are presented. They combine description generality with adjustment of monitoring overhead. The data model allows to use the different diagnosis methods in one diagnosis system. The compacting of monitor information on the end nodes along with iterative diagnosis would make teleservice easier and more attractive. Future work will concern learning algorithms on the monitor and agent sides and the efficient distribution of the learning activities between these two parts. Another task is the representation of diagnostic results to the human diagnostician. REFERENCES Al-Shaer, E., H. Abdel-Wahab and K. Maly (1999). HiFi: A New Monitoring Architecture for Distributed Systems Management. In: 19th IEEE
International Conference on Distributed Computing Systems. May 31 - June 04, 1999 Austin, Texas. Dauphin, P., R. Hofmann, R. Klar et al. (1992). ZM4/SIMPLE: a General Approach to Performance-Measurement and -Evaluation of Distributed Systems. In: Readings in Distributed Computing Systems (T. Casavant and M. Singhal, eds.), IEEE Computer Society Press, Los Alamitos, California, 1992, Chapter 6, pp. 286309. Köppen-Seliger B., S. X. Ding and P. M. Frank (2002). MAGIC - IFATIS: EC-Research Projects. In: Proceedings of 15th IFAC World Congress, Barcelona, 2002. Kotte G., K. Kabitzsch and V. Vasyutynskyy (2002). Diagnosis in MES of Semiconductor Manufacturing. In: Advanced Computer Systems, 9th International Conference, ACS‘2002, Miedzyzdroje, Poland, October 23-25, 2002, Proceedings Part 1, pp. 223-238. LON: http:\\www.echelon.com Mansouri-Samani, M. (1995). Monitoring of distributed systems. PhD Thesis, University of London. Matzke F., V. Vasyutynskyy and K. Kabitzsch (2003). UML Specification Based Fault Diagnosis on Embedded Systems. In: Proceedings of Fourth International Conference on Industrial Automation, Montreal, Canada, 911 June 2003. Munz, H. (2001). The State of PC Based Control. In: Proceedings of 1st IFAC Conference on Telematics Applications in Automation and Robotics, Weingarten, Germany, pp. 179 OSGI: http://www.osgi.org/ Simani, S., C. Fantuzzi and R. J. Patton (2003). Model-Based Fault Diagnosis in Dynamic Systems Using Identification Techniques. Springer Verlag.