Towards an information-based approach for the dependability evaluation of distributed control systems Fabrice JUMEL*, Jean-Marc THIRIET**, Jean-François AUBRY***, Olaf MALASSE**** *Institut National Polytechnique de Lorraine (Technical National Institute of Lorraine) LORIA 54500 Vandoeuvre, France
[email protected] ** Université Henri Poincaré Nancy 1 Centre de Recherche en Automatique de Nancy / Nancy Research Centre for Automatic Control 2, rue Jean Lamour, 54 519 VANDOEUVRE les Nancy cedex France
[email protected] ***Institut National Polytechnique de Lorraine (Technical National Institute of Lorraine) Centre de Recherche en Automatique de Nancy / Nancy Research Centre for Automatic Control 2 avenue de la Forêt-de-Haye, 54516 Vandoeuvre, France
[email protected] ****ENSAM (Ecole Nationale Supérieure des Arts et Métiers / Higher Education Institute of Engineering) 4 rue Augustin Fresnel, 57078 METZ cedex 3, France
[email protected] Abstract - With both the improvement of the technology and the formalization of final user needs, distributed architectures have become common both for control and measurement applications. The credibility of the information got from this distributed architectures is not easy to point out. Several aspects need to be taken into account, relative to processing which are done on the information on various places in the distributed system, and also the temporal aspect, or the information dating, which is linked with the decay of the information. The characteristics of the communication function (generally achieved thanks to a fieldbus network) are also to be taken into account to quantify their influence on the lifecycle of the information. Index Terms - dependability - RAMS (reliability, availability, maintenability, safety) - distributed systems credibility of the information - distributed architecture for measurement and control
1. INTRODUCTION AND CONTEXT In most applications nowadays, control systems are architectured around a communication network. Also, most devices used on these systems such as Programmable Logic Controllers, intelligent actuators and sensors are equipped with microcontrollers or DSPs, which provide them with calculation capabilities, together with communication properties. As an example, for an intelligent sensor, the communications properties are not only the possibilities for the sensor to provide the system with its measurement, but also to provide some information about its state, the local environment; the intelligent sensor is also capable to receive some information, coming from other parts of the system, in order to validate its information, for example.
Evaluating the dependability characteristics (by dependability we means more particularly the reliability and availability [1]) remains a huge challenge, since the parameters of the whole system are not easily gettable from a serial or parallel combination of the dependability characteristics of the various subparts composing this system: some dynamic or temporal effects have to be taken into account. Another aspect, which is not so easy to take into account, is the fact that all these components (sensor, actuator, network) are equipped with digital calculation units. The influence of this calculation properties on the credibility of the information remains to be analyzed. Some communication networks are used as the bone of modern distributed systems. It is not trivial to measure the influence of this network on the
whole system dependability characteristics [2] [3] [4]. The content of the paper, is an information-oriented reflection which aims at defining a methodology for the evaluation of the criticality of the system vs. the credibility of the information. The application of this reflection on an example allows get some interesting results.
2. PROPOSED METHOD So, in order to evaluate the dependability characteristics of a distributed control system, it is proposed to evaluate those from an information point of view. This method is based upon the evaluation of the credibility. The aim of the present paragraph is to reflect, based upon an example, on some situations to take into account. Availability of the components
Target credibility
sensor
+ DB
Va
System criticality
Obtained credibility
Vd Vc
Vb τ variable
+
V1 Vr Vb-1
V2 V3 V4
Delay of 1 sample period
N modes between which modes n1 and n2 are critical V3 -> n1 V3,V2 -> n2
Figure 1: General scheme of the approach As an example, let's imagine (figure 1) the information vr is the critical one for our application, this information is obtained by a combination of the following operations: - an information vd which is elaborated by a sensor, and to which a certain credibility is attributed, - a second information va got from a database, in which this information is stored for a certain duration with a certain credibility; we may imagine the credibility of this information could be constant during a certain amount of time, and then could be decreasing during a second amount of time obeying a certain mathematical law (in order to model its senescence)… - an addition is realized vc = va + vb, the new information vc gets its credibility as a function of the actual credibilities of va and vb, to which are added the influence of the dependability characteristics of the calculation unit used to achieve the addition, and also the time needed to achieve it… - the information vc is then transmitted thanks to a communication network, symbolized as a variable delay τ. The credibility of the vb variable is a function of the value of the delay… - the value of vb is then added to the preceding value of vb (vb-1): the credibility of vr is a function of the credibility of vb, the credibility of vb-1, under the influence of the dependability characteristics of the "1 sample
period delay processor.
operator"
and
the
adding
3. TOP-DOWN AND BOTTOM-UP APPROACHES
From the figure 1, two approaches may be envisaged: 3.1. Top-down This synthesis approach consists to define, for an under-control system, the critical, dangerous or non desired functioning modes. From this analysis, it concerns then to identify variables which can influence on the functioning modes and to characterize those which can lead to a critical, dangerous or undesired state. From the determination of the occurrence probability for a risky state reachability, and the level of gravity of it, we have to find the methods allowing qualify the desired or needed credibility, of the influence variables… The HAZOP methodology, for example, could be useful, to go up until the origin(s) of an information [5]. 3.2. Bottom-up This approach is based on the analysis of the information, from the basic data generation, to which are affected some credibility values, to the final actuator controls, together with a certain credibility. To the traditional serial or parallel combinations of the information (or data), have to be added the particular problems linked to: • the types of mathematical operations done by calculation units (for instance, a multiplication is more complex than an addition), • the problems linked to the quantification and numerical errors (overflow…), • the time: the following characteristics have to be taken into account: ♦ the evolution of the credibility of the information seen from the application as it is elaborated, ♦ the way to take account of the time in the elaboration of a complex parameter, stem from the combination of several basic parameters, each basic parameter getting its own independent time-base, ♦ the way to take account of the feed-backs, in the elaboration of the information. 3.3. Conclusions These two approaches are complementary and should allow to develop a methodology for the identification of weak links, as far as the reliability is concerned, of a complex distributed architecture. The final paper will contain also some considerations about the credibility of the variables, and an object-oriented model of the information.
4. EXAMPLE, STUDY OF ONE ISOLATED FAILURE
Let's consider the error state due to the information chain, the error state is linked to a bad control information applied to the system. We consider here that the control law is correct but there is only an alteration of the information received and used by the controller for the elaboration of the control. This failure of the information can be the result of an acquisition error, a calculation or a transmission error. In this example, we discuss of a system constituted of a tank and its control law. For this system, the critic variable is the high level of the water, since it is imperative not to overflow; the maximal height is called Hmax. The control law allows control the level, but there are some uncontrollable events, which are called here "Disturbance", as an entering and a leaving flow.
4.1. Presentation of the information processing chain The elaboration of the control values can be seen as an information processing chain constituted of three functions: - one function for the acquisition of the system state thanks to a sensor, - one function for the elaboration of the control values as a function of the control law and the system state under control, - one function for the action which modifies the system dynamics thanks to the actuator control. The information is a representation of the data flows between these various functions. Figure 3 shows this information chain for our level control. Level Tank
dataflow
dataflow
Sensor
Actuator Valve+Pump
Tank
Controller
Disturbance
Actuator
desired input value
tracking error
Sensor
controller s
smax -ethreshold
e
s
ethreshold smin
e informatic system
Valve + Pump
Level sensor
Figure 2: Scheme of the system The quantity of liquid is obtained thanks to a level sensor which links in a proportional manner the liquid quantity and this measured level. The filling and emptying of the tank are achieved thanks to a valve and a pump, thus there are three possible states: - no action: the level of the tank is varying as a function of the disturbance, - emptying: a quantity Qv of liquid is taken out of the tank (without disturbance) during a certain duration (linked to the sampling frequency), - filling: a quantity Qr of liquid is put in of the tank (without disturbance) during a certain duration (linked to the sampling frequency). The considered control law is very easy, it is a "threshold-based control": At each sampling period: - the level of the tank is measured, - if it is lower than the desired level, a quantity Qr is added during the sampling period δT, - if it is higher than the desired level, a quantity Qv is removed from the tank. Figure 2 presents the physical system and the control loop. S (between Smin and Smax) represents the leaving flow.
Figure 3: information chain
4.2. Links between the information failure and the system failure 4.2.1 Model of the failure on the information Let's consider for the study that there are no disturbance, except the water overflow which is taken into account. The variable which is considered as critic for the application is the maximum height level, so the study will focus only on failures which may have a consequence on the increasing of the water level. The system is either at the equilibrium and there is nothing to do, or at a higher level than the desired one and the control system needs to empty. Two kinds of errors have to be taken into account: - the control is on "filling" while it is required to empty or to do nothing, - the control is on "doing nothing" while it is required to empty. We always consider the worst case which is the filling. One error only is considered, it is a filling control whereas it should be necessary to empty or to do nothing.
4.2.2 Response of the system to a temporary information failure By studying the answer of the system to a failure, we can remark there is a correction of the error if the error was not the origin of a failure. Because the emptying capacity is weaker than the filling capacity, the system needs N sampling periods in order to find the equilibrium position (N= Qr/Qv). We can consider that during this time, the physical system is in an error state which can possibly leads to a failure. Figure 4 shows the response of the system to a control failure (filling).
H
Si 11 10 9 8 7 6 5 4 3 2 1 0
Qr 0 δt
Ν.δ t
t
Fault
Figure 4: Occurrence of a fault
5. EXAMPLE, SEQUENCE OF FAILURE
1
The aim is here to characterize all the failure sequences, linked to the activation period of the control law, which leads to a physical system failure. For that, we define a function which links the reached level and the sequence of errors. Let's consider as an example the simple case when the failure probability of the variable used by the control law follows a binomial law (the failure probability is independent for each sample). Let's assume N=3 for the tank of water and Smax=11. Figure 5 shows an example of the evolution of such a system with some failures of the control law.
0
1
3
2
4
t εi
1 0 1 0 1 1 0 0 1 1 0 0
Figure 5: Example of an avalanche of faults By taking account simultaneously of the control failure occurrence law and the error evolution model, it is possible to define the evolution of the system state as a Markov chain. The state 11 which is absorbent corresponds to the failure, it is not possible to leave this state, once it is reached.
5
6
7
8
9
10
11
transition probability: p transition probability: 1-p
Figure r the system
6:
Markov
This Markov chain can be represented as a matrix:
P=
Pi,j
0
0
p 0
2 3 4 0 1-p
1
p 0
0
1
5
p 0
0
0 1-p
3
p
0
0
8
9
10 11
Probability
0 1-p
5
p 0 0 p 0
6
p
7
7
0 1-p
2 4
6
0 1-p 0
0 1-p
0
0
p 0
0 1-p 0 0 1-p
8
p 0
0
0 1-p
9
p
0
0 1-p
p
0 1-p 1
10 11
Figure 7: Matrix equivalent to the Markov chain From this matrix it is possible to calculate some results. The calculation of P(n) allows to determine the probability to reach a failure state after n steps, coming from the equilibrium state 1. The canonic
( )
chain fo because of the shape of the matrix and also because p is close to 1, the standard deviation of the time of reachability is close than the obtained mean time. These results are indicative, because of the values of the standard deviations, important deviations could occur with the mean time.
1 0 allows also determine the mean and form P = R Q the standard deviation of the time necessary to reach the failure state [6]. In the cases studied here,
Step
Mean time
Standard deviation (= mean time) 0.99 1 second 267 days 267 days 0.999 1 second 7862 years 7862 years 0.99 0.5 12.106 years 12.106 years 1 second (=> Smax=2 2) 0.99 1 second 4940 years 4940 years N=2 Table: results From this results, we can show that several parameters may influence the occurrence probability. Concerning what we are more particularly interested in, the credibility of the 1
In this case, it is equivalent to double the sampling frequency
information, it is interesting to notice the fact that the choice of a different sampling frequency and so a better discretization of the system leads to a lower occurrence probability (12.106 years compared to 267 days !!!) [7].
6. CONCLUSION It is not trivial to evaluate the dependability characteristics of a distributed control or measurement system. Several approaches or methods (Petri nets [9], bayesian networks [10]) may be used. In this paper, we present an information-oriented methodology which aims at elaborating the credibility of the information, on its whole lifecycle. The final paper will present how it is possible to get a real-time characterization of this credibility, at least for the most critic information.
7. ACKNOWLEDGEMENTS We would like to thank: • The Lorraine region for the grants • The various active members of the working group on "Automatic control and Information engineering: from the functional architecture to the operational architecture: how to guarantee the dependability (RAMS)?"
REFERENCES [1] Villemeur A. (1988). Methods and Techniques, Volume 1, Reliability, Availability. [2] Juanole, Guy; Blum, Isabelle (France): "Evaluating the QoS of Real Time Networks and Linking this QoS to the Performance of an Industrial Application" [3] N. Navet, Y-Q. Song, F. Simonot, "WorstCase Deadline Failure Probability in RealTime Applications Distributed over CAN (Controller Area Network)", Journal of Systems Architecture, Elsevier Science, vol. 46, n°7, 2000. [4] L. Cristaldi, A. Ferrero, C. Muscas, S. Salicone, R. Tinarelli - The effect of net latency on the uncertainty in distributed measurement system - 18th IEEE/IMTC Instrumentation and Measurement Technology Conference, Anchorage, Alaska, USA, 21-23 May 2002. [5] J.A. McDermid, M. Nicholson, D.J. Pumfrey, P. Fenelon - Experience with the application of HAZOP to computer-based systems, 10th Annual IEEE Conference on COMPuter ASSurance (COMPASS '95).
[6] U.N Bhat . Elements of Applied Stochastic Processes. John Wiley & Sons 1984 [7] S. Nuccio, C. Spataro - Can the effective number of bits be useful to assess the measurement uncertainty - 18th IEEE/IMTC Instrumentation and Measurement Technology Conference, Anchorage, Alaska, USA, 21-23 May 2002. [8] M. Choi, N. Park, F. J. Meyer, F. Lombardi, V. Piuri - Reliability Measurement of Faulttolerant onboard memory system under fault clustering - 18th IEEE/IMTC Instrumentation and Measurement Technology Conference, Anchorage, Alaska, USA, 21-23 May 2002. [9] K. Jensen. Coloured Petri Nets. Basic Concepts, Analysis Methods and Practical Use. Monographs in Theoretical Computer Science, Springer-Verlag, 2nd corrected printing 1997. [10] Jensen F. V., Olesen K. G. et Andersen, S. K. An Algebra of Bayesian Belief Universes for Knowledge-Based Systems. Networks, vol. 20, p637-659. 1990.