Management of Uncertainty in Sensor Validation ...

94 downloads 0 Views 3MB Size Report
AC DC x1 x10 HP LP NORMAL PULSE x10^1 x10^2 ...... lane change, exiting the automated lane), the Intelligent Vehicle Highway System (IVHS) requires a.
Management of Uncertainty in Sensor Validation, Sensor Fusion, and Diagnosis of Mechanical Systems Using Soft Computing Techniques by

Dipl. Ing. (Technische Universität München, Germany, 1990) M.S. (University of California at Berkeley, 1993) A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Mechanical Engineering in the GRADUATE DIVISION of the UNIVERSITY of CALIFORNIA at BERKELEY Committee in charge: Professor Alice M. Agogino, Chair Professor David Dornfeld Professor Stuart Dreyfus

Fall 1996

The dissertation of Kai Frank Goebel is approved:

Chair

Date

Date

Date University of California, Berkeley 1996

Management of Uncertainty in Sensor Validation, Sensor Fusion, and Diagnosis of Mechanical Systems Using Soft Computing Techniques

Copyright © 1996 by Kai Frank Goebel

Abstract Management of Uncertainty in Sensor Validation, Sensor Fusion, and Diagnosis of Mechanical Systems Using Soft Computing Techniques by Kai Frank Goebel Doctor of Philosophy in Mechanical Engineering University of California at Berkeley Professor Alice M. Agogino, Chair The goal of this dissertation is to provide means to deal with uncertainty in complex sensor driven systems, in particular for sensor validation, sensor fusion, and diagnosis. These means come from probability theory, neural network theory, and fuzzy logic and are more generally termed as "soft computing" techniques. Our investigation considered systems which had to be controlled based on monitored sensor readings to allow corrective action in case of aberrations from a desired value. Since all sensor readings are always subject to a variety of noise conditions, such as Gaussian noise, bias, clutter, outliers, and non-symmetric noise distributions, the aim is to perform sensor validation and – in case of multiple sensor readings – sensor fusion. This dissertation provides means for fuzzy sensor validation and fusion. This approach was compared with a probabilistic method which is a Kalman filter based scheme assuming Gaussian noise distribution. While the fusion is performed for the probabilistic method in a Bayesian way, the fuzzy sensor validation and fusion approach uses non-symmetric validation regions in which sensor readings are assigned confidence values. Each sensor has its own dynamic validation curve which is shaped according to sensor characteristics. These characteristics can take into account the range, external factors affecting the sensor, reliability of the sensor, etc. The curves have its maximum value at the predicted value which is arrived at using fuzzy exponential weighted moving average time series predictor. Sensor readings have value 0 at the boundaries of the validation gate which is determined by the maximum possible change a system can undergo in one time sample. Since readings outside the gate are implausible, they are discarded. Readings closer towards the predicted value are rewarded with a higher confidence value. Fusion is performed using a weighted average of sensor readings and confidence values and the predicted value scaled by the operating condition. Each method performs best in the presence of certain types of noise and recommendations are made as to which approach is more appropriate under various conditions. Another aspect of this dissertation is to provide a tool for diagnosis in the presence of vague symptoms. This is achieved though fuzzy abduction which can diagnose crisp as well as soft faults. This means that faults can be diagnosed if they occur to some degree. The proposed algorithm computes a closeness measure taking into account the distance from an observed symptom set to the modeled symptom set for all failure combinations. It then ranks the failure sets according to maximum closeness measure and minimum cover, i.e., number of faults. As an extension, a framework for fuzzy influence diagrams is provided which uses this closeness measure

Several applications from extant systems show the feasibility of the approaches developed here. The first example aims at providing safety enhancement in intelligent vehicle highway systems. Here it is crucial to always provide a correct sensor value to ensure a safe ride of the passengers. It is shown how the system recovers from bad sensor readings and how functional redundancy included in the fusion helps to improve the results. Furthermore, it is shown how fault diagnosis is performed on-line. The second example is to provide robust and optimal control in power plants. It is imperative to provide correct sensor values to be able to distinguish between system and sensor failure and to prevent unwarranted shutdowns. Through the inclusion into the fusion algorithm of system variables having a functional relationship to the reading of interest, the system is able to exploit this functional redundancy to perform properly in the presence of multiple simultaneous sensor failure. Lastly, an example from manufacturing shows how the neural network tools provided allow for improved decision making for optimal tool exchange in a high speed milling machine environment by monitoring the degree of tool wear on-line. Alice M. Agogino, Thesis Committee Chair

To my parents

Management of Uncertainty in Sensor Validation, Sensor Fusion, and Diagnosis of Mechanical Systems Using Soft Computing Techniques

Outline 1 Introduction ..................................................................................................................................................7 2 Monitoring: Sensor Validation and Fusion under Uncertainty.....................................................................9 2.1 Uncertainty – Preliminaries.....................................................................................................................10 2.1.1 Probability Theory................................................................................................................................10 2.1.2 Fuzzy Reasoning ..................................................................................................................................17 2.1.3 Neural Networks...................................................................................................................................24 2.1.3.1 Backpropagation Neural Network .....................................................................................................25 2.1.3.2 Self-Organizing Probabilistic Neural Network (SOPNN).................................................................27 2.1.3.3 Neural-Fuzzy Networks.....................................................................................................................28 2.1.4 Genetic Algorithms ..............................................................................................................................32 2.2 Sensors ....................................................................................................................................................35 2.3 Validation and Fusion Techniques ..........................................................................................................38 2.3.1 Probabilistic Sensor Validation and Fusion with PDAF ......................................................................39 2.3.2 Fuzzy Sensor Validation and Fusion ....................................................................................................49 2.3.2.1 A Review on Techniques for Fuzzy Sensor Validation and Fusion ..................................................49 2.3.2.2 A New Approach to Fuzzy Sensor Validation and Fusion (FUSVAF) .............................................54 2.3.2.2.1 The FUSVAF Algorithm................................................................................................................54 2.3.2.2.2 Simulations.....................................................................................................................................59 2.3.2.2.3 Sensitivity Analysis ........................................................................................................................88 2.3.2.2.4 Recommendation and Summary.....................................................................................................93 3 Diagnosis....................................................................................................................................................95 3.1 Probabilistic Influence Diagram..............................................................................................................96 3.2 Fuzzy Diagnosis ....................................................................................................................................102 3.2.1 A Review on Fuzzy Diagnosis ...........................................................................................................103 3.2.2 A New Approach for Fuzzy Diagnosis...............................................................................................124 3.2.2.1 Crisp Failures ..................................................................................................................................125 3.2.2.2 Gradual Failures ..............................................................................................................................130 3.2.2.3 Example...........................................................................................................................................138 3.2.2.4 Summary and Conclusions ..............................................................................................................144 3.2.3 Fuzzy Influence Diagrams..................................................................................................................146 4 Applications for Sensor Validation, Fusion, and Diagnosis .....................................................................155 4.1 Using a Neural-Fuzzy Scheme to Diagnose Manufacturing Processes .................................................156 4.2 Power Plant Gas Turbine.......................................................................................................................178 4.3 Automated Vehicles ..............................................................................................................................195 4.3.1 Sensor Validation and Fusion Using PDAF .......................................................................................207 4.3.2 Fuzzy Sensor Validation and Fusion for IVHS ..................................................................................216 4.3.3 Fuzzy Diagnosis in IVHS...................................................................................................................225 5 Conclusions ..............................................................................................................................................231 5.1 Summary and Conclusions ....................................................................................................................231 5.2 Future Research.....................................................................................................................................233 6 Appendix ..................................................................................................................................................234 7 References ................................................................................................................................................267

7

1 Introduction The basic assumption upon which classical (two-valued) logic is based, namely that every proposition is either true or false has been questioned since Aristotle. He discusses in his treatise “De Interpretatione” the problematic truth status of matters that are future contingent. Propositions of future events, he maintains, are neither actually true nor actually false but potentially either; hence, their truth value is undetermined, at least prior to the event. We do not restrict this “problematic” truth status to future events only. The indeterminate or “uncertain” truth status has to be allowed to cover a wide range of instances where a case cannot be classified as either completely true or false. There are several methods of dealing with uncertainty, e.g. probability theory and fuzzy logic. Probability theory seeks to assign probabilities to an event which is a measure for how likely the event will happen. Fuzzy logic on the other hand operates by assigning degrees of truth to data. In the course of this dissertation, the two methods are evaluated in how they deal with uncertainty for various applications. This evaluation is not seen as a competitive one since the definition of uncertainty is dissimilar. Hence, in our approach these two methods complement each other. The focus of this dissertation is on dealing with uncertainty in specific instances of decision-making. These instances encompass validation and fusion of data as well as diagnosis. Decisions have to be made in the presence of imprecise information obtained from sensors which measure physical quantities necessary to control a system. The information is not precise in part because of the “characteristics” of the sensor which means there is a deficiency of complete understanding of the principles governing the operation of the sensor and its physical and spatial limitations. More imprecision is added through an incomplete knowledge of the environment, due to sensor failure, the tolerances added during manufacturing, receptiveness to environmental conditions, etc. Sensor validation is the process by which the sensor data are authenticated through rating their integrity and then they are replaced it with a better value. In this dissertation, sensor validation is achieved through the use of fuzzy logic by expressing a confidence for a given measurement based upon the expected value, the system state, the environmental conditions, sensor reliability, and physical limitations of the system. Sensor fusion is the process of integrating the information obtained from several sensors into one value. Redundant readings obtained from several sensors measuring the same quantity are never the same at some level of precision. This means necessarily that one sensor gives more precise information than the other ones. It is therefore necessary to identify how each sensor performs and to then use the information from all sensors to output one value for further processing. This method is closely related to sensor validation. In this thesis, fuzzy sensor fusion makes use of the ratings obtained during the validation process. Where possible, values obtained through functional redundancy are integrated into the fusion process. Values obtained through functional redundancy rely on sensor readings which do not directly measure the quantity of interest but are calculated using other measured system variables which stand in some functional relation to the variable of interest. The advantage of exploiting this information is an increase in robustness against sensor failure. This allows better performance of any algorithm relying on sensor information. Diagnosis is the reasoning from a specific observation and several general rules, a process called abduction. This is in contrast to deduction which involves a general rule and a specific case, from which a specific result can be “deduced”. It is also different to induction which consists of a specific case and a specific result from which a general rule can be hypothesized. The problem with diagnosis is that generally there are more than one possible explanations for an observation. The contribution of this

8 work is to provide a ranking scheme which allows to find the most plausible explanation for an observation. This ranking scheme also uses fuzzy logic and introduces a suitable fuzzy measure. The main body of this thesis is arranged in chapters 2 - 5. The second chapter gives an introduction to the problems of sensing and control in real world settings. Uncertainty is the key focus and will be illuminated from within systems with varying degree of complexity. This complexity can be translated into the ability to model the system and therefore warrants the use of techniques appropriate to the situation. Chapter 2 also proposes several techniques for sensor validation and fusion. These techniques comprise probability theory, fuzzy logic, neural networks, and genetic algorithms. Strengths and weaknesses of each method are discussed. The probabilistic method under investigation is the probabilistic data association filter (PDAF), a Kalman filter based approach using Bayesian fusion techniques. Noise is assumed to be Gaussian distributed. Then an algorithm for fuzzy sensor validation and fusion (FUSVAF) is introduced. The fusion method uses non-symmetric validation regions in which sensors readings are assigned confidence values. Each sensor has its own dynamic validation curve which is shaped according to sensor characteristics. These characteristics can take into account the range, external factors affecting the sensor, reliability of the sensor, etc. The curves have their maximum value at the predicted value which is arrived at using fuzzy exponential weighted moving average time series predictor. Confidence curves have value 0 at the boundaries of the validation gate which is determined by the maximum possible change a system can undergo in one time sample. Since readings outside the gate are implausible, they are discarded. Readings closer towards the predicted value are rewarded with a higher confidence value. Fusion is performed using a weighted average of sensor readings and confidence values and the predicted value scaled by the operating condition. The fuzzy and probabilistic approaches are compared in Monte Carlo simulations where they are subjected to a variety of noise and modeling conditions. The chapter concludes with recommendations for the use of each approach. Diagnosis is the subject of chapter 3. First, probabilistic influence diagrams are presented as a successful technique for diagnosis. This is followed by a review on historic techniques to solve the diagnosis problem using fuzzy logic. Then, fuzzy diagnosis is introduced which is achieved through abduction and a ranking scheme using fuzzy abduction which can diagnose crisp as well as soft faults, i.e., not only are faults diagnosed but also the degree to which they occur is determined. The proposed algorithm computes a closeness measure taking into account the distance from an observed symptom set to the modeled symptom set for all failure combinations. It then ranks the failure sets according to maximum closeness measure and minimum cover, i.e., number of faults. Lastly, formal methods for a fuzzy influence diagram are presented using the results of fuzzy diagnosis. Chapter 4 shows three real world examples which have different degrees of complexity. In particular, monitoring of tool wear of a milling machine is introduced as an example for a highly non-linear system which is consequently very hard to model. The chosen approaches for this application use neural networks and fuzzy techniques and compare the performance of each. The next example is a power plant gas turbine, a process which is relatively well understood. However, sensor data fidelity is of prime concern to allow operation at the optimal point and to avoid system shutdown due to faulty sensor readings. It is shown how the robustness of the system is improved greatly through the use of functional redundancy. The last example involves longitudinal control of automobiles within the context of intelligent vehicle highway systems (IVHS). This system is again highly non-linear and moreover subject to a fast changing environment. High sensor data fidelity and proper diagnosis are crucial to ensure a safe ride because human life is at stake. Therefore, the importance of sensor validation and fusion becomes very apparent, in particular with the inclusion of functionally redundant values.

9

2 Monitoring: Sensor Validation and Fusion under Uncertainty Monitoring encompasses a wide range of operations which deal with the observation of events. Usually, conclusions about findings are produced which are then used as feedback advice for control purposes. Monitoring is distinguished from diagnosis in that the latter looks specifically for malfunctions by analyzing a set of symptoms. The controlling unit is alerted about failures and remedial actions are initiated. Validation and fusion of data are part of the monitoring activities. Data are analyzed with regard to their integrity. Output of this process is another value which replaces the one of the raw sensor data. In case there are several sensors involved, a fusion process is necessary. Usually, the measured data do not completely coincide when they measure the same quantity which depends on a variety of factors. It is therefore necessary to integrate the information from several sensors to come up with one value which can be trusted. It is desirable to have several sensors because through this redundancy the problem of sensor failure can be relaxed. This is particularly true for systems which involve the safety of personnel or which involve an expensive shutdown when a malfunction is indicated. Sensors imprecision is an inherent property because they are subject to noise. Here, noise is used as a term which summarizes several reasons for this imprecise behavior. Explanations for the occurrence of noise are a poor understanding of the principles governing the sensor behavior, a lack of understanding of the environment, changing environmental conditions, tolerances within the sensor, etc. This uncertainty has to be dealt with by filtering out the noise. Among the methods for dealing with uncertainty are probability theory, fuzzy logic, and neural networks. The underlying theory is briefly introduced in the next sections.

10

2.1 Uncertainty – Preliminaries Uncertainty is inherent to almost all processes. The sources of uncertainty are most generally the likelihood of an event to occur and the imprecision of information. Inferencing under uncertainty involves uncertainty of the data and of the rules. The problems are the validity of the inference steps taken under uncertainty as well as of the resulting data. Generally, uncertainty is represented by a measure such as a probability or a fuzzy membership. Probability theory is a tried and tested mathematical theory which has attained an advanced development. Its axioms are clear and undisputed. The fundamental axioms is that of additivity of probabilities for disjoint events. The controversial aspects pertain to the interpretation. Two schools of thought have arisen, the frequentist school and the subjectivists. The former regard probabilities as limits of frequencies of observed events while the latter regard probability as a measure of feeling of uncertainty. While this position can be contested from a philosophical and practical point of view, the axioms of probability are the only reasonable basis for evaluating subjective probability. Probability theory has been used successfully in control theory and other applications. The limiting assumptions are that events considered are mutually exclusive and collectively exhaustive. Fuzzy logic, on the other hand focuses more on the vagueness of data as opposed to the likelihood of an event. In fuzzy logic, the confidence in observed data is expressed through a membership value which is the degree of truth in believing that the data are correct. This view diverges from the classical view that an event is either true or not true. Rather, an event can be partially true. This degree of truth is captured in the membership value. Lastly, as a third method to deal with uncertainty, neural networks have been introduced. They are not a traditional tool in the sense that they do not quantify the uncertainty involved in a process or event. Rather, they are used to filter out uncertainty. This can be achieved by first training neural networks, e.g. by exposing them to noisy input patterns and known output and a suitable learning algorithm. After that, the network has reached the ability to remove the uncertainty attached to the data. The disadvantage is that neural networks have only an internal representation of this process. That is knowledge about the procedure is not made transparent. Also, it is not sure whether the representation can generalize from the training data for many neural network architectures. The subsequent sections give a brief introduction to probability theory, fuzzy logic, and neural networks to the extent as it is necessary for this thesis.

2.1.1 Probability Theory This chapter gives a brief introduction to probability theory to the extend necessary for this work. An extended examination can be found in Anderson (Anderson, 1958), Bennett and Franklin (Bennett and Franklin, 1953), and Hadley (Hadley, 1967). Probability can be seen from at least two different standpoints: one regards probability as the frequency of events, another regards it as a subjective assessment of an event based on a particular state of information which can change. In either case, probability is governed by three fundamental axioms. The first demands that a probability is nonnegative P (X ) ≥ 0 where P is a probability

11 X is an event The second axiom normalizes the measure to 1 for certain events P ( I) = 1 where I is the universe of all events The third axiom requires that the probabilities of two mutually exclusive events are additive P (A + B) = P( A) + P( B) where B is another event and AB = 0 , i.e., the events are mutually exclusive The Venn diagram in Fig. 2.1.1-1 illustrates these axioms

I

A

B

Fig. 2.1.1-1: Venn Diagram for mutually exclusive events The first of three theorems of probability says that the probability of the negative event, i.e., everything else but the event in question, is defined by P (¬A) = 1− P(A) The second says that the nil event has probability 0, i.e., P ( 0) = 0 and the third says that the probability of two events that are not necessarily mutually exclusive is defined by P (A + B) = P( A) + P( B) − P(A,B) which means that the overlapping space in a Venn diagram has to be subtracted once to prevent it to be counted twice. The corresponding Venn diagram is shown in Fig. 2.1.1-2.

I

A

B

Fig. 2.1.1-2: Venn Diagram for not mutually exclusive events

12 The conditional probability describes the likelihood of an event given that another event occurred. In the interpretation of the Venn diagram this means the ratio of the area shared by both events, i.e., the intersection, to the area of the conditioned event. Fig. 2.1.1-3 shows this scenario. The conditional probability basically renormalizes the probability measure when new information shows that an event (A) is known with certainty. I=A

I

A

=> B A,B

Fig. 2.1.1-3: Venn Diagrams for conditional probabilities The conditional probability is defined as: P(A,B) P (A| B) = P ( B) When knowledge of one event gives no information about the other event, the two events are called independent. In that case, their probabilities can simply be multiplied (which is an important property for computational purposes). P (AB) = P(A) ⋅P( B) Therefore P(A,B) P( A)⋅ P( B) P (A| B) = = = P (A) P( B) P(B) Bayes’ theorem is the most important relation in probabilistic expert systems and influence diagrams. With the help of Bayes’ theorem, probabilistic inference becomes possible. The underlying assumption is that the events are mutually exclusive and collectively exhaustive, i.e., P (Ai A j )= 0

P (A1 + A 2 +...+ An ) = 1

If all probabilities P(Ai) and P(B|Ai) are known then the joint probability can be formed P (Ai B) = P (A i )P(B|A i ) To get P (A i |B) using the approach from the above theorems P (A i |B) =

P(A i B) P(B|A i )⋅ P(A i ) = P(B) P( B)

Since the events Ai are collectively exhaustive, i.e., P (A1 + A 2 +...+ An ) = 1 , we can define B = I ⋅B = (A1 + A2 +...+A n )B Therefore, the probability P(B) becomes (assuming mutually exclusive events) P (B) = P (A1B) + P(A2 B)+...+P (An B) n

= ∑ P (A jB) j=1

13 n

= ∑ P (B|A j )⋅ P (A j ) j=1

and hence

P (Ai |B) =

P(A iB) n

∑ P(B| A )⋅ P(A ) j

j

j=1

which is called Bayes’ theorem. It is the driving mechanism for inference about failure in view of evidence The mean of a probability distribution is its location or mathematical expectation. It is a measure of the center of the probability distribution and the first moment about the origin. µ = ∑ x ⋅ f (x ) x

where f(x) is the probability distribution which expresses the probability that any one variable falls within the range of f(x). ∑ f (x) = 1 x

The standard deviation is defined as the second moment about the mean 1 σ= (x − µ )2 ∑ n 2 If a random sample of size n is taken from a population having the mean µ and variance σ , then X is a random variable whose distribution has the mean µ. In a normal distribution, the errors of the measurement are found to be regular and can be approximated by the continuous curve (x−µ )2

− 1 2σ 2 f n x,µ,σ = e −∞ < x < ∞ 2πσ Other distributions are the uniform distribution  1 α+β 2 1 α< x 0; β > 0

Γ (α ) = ∫ x

α−1 − x

0

the beta distribution

e dx

β 2

2

, σ =e

2α +β

elsewhere

the gamma distribution  1 x > 0, α > 0, β > 0 f g (x) = β 2 Γ (α ) elsewhere 0 where ∞

µ=e

α+

2

µ = 2β , σ = αβ

2

2

(

e

β

2

)

−1

14  Γ (α + β ) α−1 β −1 x (1 − x) 0 < x < 1, α > 0, β > 0 f b (x) =  Γ(α )Γ(β ) 0 elsewhere  αβ α 2 ,σ = µ= 2 α+β (α + β) (α + β + 1) and the Weibull distribution β αβxβ −1e − αx x > 0, α > 0, β > 0 f w (x ) =  elsewhere 0 2 2 −      1   2  1  β 2 µ = α Γ 1+  , σ = α Γ  1 +  −  Γ  1+     β   β    β  



1 β

The expectation of a random variable whose function is given by g( X) is ∞

E [g(X)] = ∫ g(x)f ( x)dx −∞

where f is the probability density function

g(x1 ,x2 ) , the expected value of E [(X 1 − µ1 )(X 2 − µ 2 )] tends to be positive when large X1, X2 or small X1, X2 occur together. It is therefore called the covariance of X1 and X2. If X1 and X2 are independent, the covariance is 0.

If there is a function with several random variables

Often times, one needs to decide whether a statement is true or false. If the sample is taken and the mean and standard deviation are found, they can still be wrong. A test of hypothesis can help to accept or reject a statement. The event hypothesis H j is true with a prior probability p j . For mutually exclusive events, the observed event is associated with a hypothesis and divided by the total probability to become the posterior probability P (Hj Ai ). H j is selected if P (Hj Ai )>P(H k A i ), k ≠ j . The directive to maximize the probability of correct decisions in this way brings us back to Bayes’ rule. When a mathematical representation is required for prediction, the formula which relates dependent variables (the value to be predicted) to independent variables can be arrived at through curve fitting techniques. The least squares method works by minimizing n

∑ [y − (a + bx )]

2

i

i =1

i

from which the following two equations can be derived n

∑y i =1

n

i

= an + b ∑ x i i =1

and n

n

n

i =1

i=1

i =1

∑ xi yi = a ∑ x i + b ∑ xi

2

The solution to normal equations is

15 b=

Sxy

and Sxx a = y − bx where n

n

S xx = ∑ (x i − x) = ∑ x i − i =1

n

2

2

 n   x ∑ i  i =1  n

i =1

n

S xy = ∑ (x i − x)(y i − y ) = ∑ x i yi − i =1

n

n

i= 1

i =1

∑ x i ∑ yi

i=1

n

Similarly, curvilinear regression works for data with exponential character, e.g. y = c ⋅ d . If none of these regression tools work satisfactorily, a Taylor series expansion can be employed to 2 p approximate the data with the polynomialy = d 0 + d1 x + d2 x +...+d p x which is minimized and solved for the parameters. x

Time series analysis uses information about dependent observations which can utilize spectral analysis or stochastic models (Goodwin and Sin, 1984, Graupe, 1984, Box and Jenkins, 1970). The types of problems intrinsic to filtering, prediction, and control are: 1. Filtering: extraction of signals from noise. In principle it is possible to design an optimal filter to enhance the signal/noise ratio, if models for the signal and noise are available, i. Is the model not known, data should be used to estimate a model. Adaptive filters are a natural extension. 2. Prediction: Extrapolate a given time series into the future through simple recursive calculation. As before, if the model is not known completely, try to estimate it from the time series. One can determine how to utilize past data, figure out the variance of the forecast errors, and to calculate limits within which a future value of the series will lie within a given probability. 3. Control: Manipulation of the inputs to a system to let the output achieve certain objectives. In deterministic systems, noise and disturbances are of secondary importance relative to the modeling errors. They can still be used in the presence of noise. In stochastic systems, on the other hand, noise and disturbances (perturbations) are explicitly considered. The basic concept is to express a data point as a sum of previous weighted observations. For prediction, this equation can then be extrapolated into the future. Parameters in this sum are optimized using a number of different cost functions. Some of the important concepts in time series analysis are included below. The Autoregressive (AR) model is defined as: ∑ α ix k − i = w k i

The Moving Average (MA) model is described by: x k = ∑β jw k − j j

The Autoregressive Moving Average (ARMA) model is given by

16

∑φ x

k− i

i

i

= ∑ θ jw k − j j

or x k = ∑ α i xk − i + w k i

hence x k − ∑ α i xk − i = w k i

such that wk is the models residual. For estimation, we get n

xˆ k − ∑ αi xk −i i =1

For least squares parameter identification of time series and systems consider again x k = ∑ α i xk − i + w k i

and the error between the observations and the estimates defined as the cost J 2 r n   J = ∑ x k − j +1 − ∑ ˆa r,i xk −i − j +1  j=1  i =1 which is to be minimized. Further extension leads to the Kalman filter. Forgetting factors are introduced for systems which are not stationary, i.e., their parameters change over time. Data points further in the past are weighted less than the ones closer to the presence. The cost function J using forgetting factor q, is. 2 r n  r− k   J = ∑q x − ∑ aˆ x where 0 ≤ q ≤ 1  k − j +1 i =1 r,i k − i− j +1  j=1 In stochastic systems, the output is assumed to be predictable up to a white noise residual. The latter is also termed “innovation”. To handle the random nature of this noise, the use of probabilistic tools is near at hand. While the nature of the problem is still largely the same, some modifications need to be made to the algorithms discussed above to reflect the influence of the random components. If noise and signal spectra are non overlapping, the undesired noise can be attenuated while the signal will be allowed to pass. Low-pass or high-pass filter are concerned with this task and will not be discussed in this context. Here, the problem of overlapping signal and noise spectra is of interest. While in the deterministic case exact prediction is possible, this is not so in the presence of noise. Rather, one aims to minimize the prediction error variance. The Kalman filter lies at the heart of this problem and will be discussed along with alternatives. As before the state-space model is used, amended by a noise term: x(t + 1) = Fx(t) + Gu(t) + v(t) z(t) = Hx(t) + w(t) where v(t) is the process perturbation w(t) is the measurement or observation noise v(t) and w(t) are assumed to be zero mean white processes with covariance given

17 v(t) and w(t) are white noise sequences defined on a probability space The model above is called a “Markov model” since the probability density of x(t) conditioned on x(t1), x(t2), ..., x(tm) is simply the probability density of x(t) conditioned on x(tm): p(x(t)x(t 1 ),..., x(t m )) = p(x(t)x(t m ))

2.1.2 Fuzzy Reasoning This chapter gives a brief overview of fuzzy logic and introduces terminology and notation employed in later sections of this thesis. For further review, the reader is directed to Klir and Folger (Klir and Folger, 1988), Zimmermann (Zimmermann, 1991), Pedrycz (Pedrycz, 1989), and Kosko (Kosko, 1992). The key to the success in applying fuzzy logic is that it adds quantitative meaning to linear rules. Because more than one rule can be active at any given time, elasticity is added to the response of a system. Fuzzy systems have been around since the 1920s when Lukasiewicz and later Gödel and Black (Rescher, 1969) worked on techniques for inexact reasoning which were first dubbed “possibility theory”. In 1965, Zadeh (Zadeh, 1965) extended on that work by introducing tools for working with fuzzy natural language terms which he called “fuzzy logic”. Terms used in natural language to describe some concept with vague values are called linguistic variables such as “small”, “medium”, and “large”. To further refine these variables, adverbs or hedges such as “very” or “somewhat” can be added. By removing the sharp boundaries between members of a class the complexity of a system can be reduced, allowing uncertainty to be incorporated in the model. The uncertainty is expressed as a grade of membership. In contrast to probability, fuzzy theory describes a deterministic uncertainty. To illuminate the difference Kosko (Kosko, 1992) points to the description of an inexact oval in Fig. 2.1.2-1 by alternatively saying “it is probably an ellipse” or “it is a fuzzy ellipse”.

Fig. 2.1.2-1: Inexact oval (Kosko, 1992) Di Nola (Di Nola et al., 1989) makes a further philosophical distinction: a prior probability of x belonging to A becomes -after observation- a posterior probability of either P(xA)=1 or 0. However, the degree to which something is part of a class remains the same. That is, fuzziness remains the same while the randomness changes. A fuzzy set A of a universe of discourse X is characterized by a membership function µA which associates each element x from X with a degree of membership in A by µ A ( x):X → [0,1] A fuzzy relation is expressed as: G is A is µ where A: fuzzy set

18 G: variable µ: membership value For example, 10 is positive big is 0.9 µ B = positive big 0.9

10

A

Fig. 2.1.2-2: Membership function of “positive big” Fig. 2.1.2-2 shows the membership function of the fuzzy set “positive big”. Membership functions can be of the parameterized form (Jang, 1993) 1 µ= 2b µ y − cµ 1+ aµ where µ is the membership value y is the sensor reading a µ determines the width of the membership function b µ is the exponent which determines how closely the function will resemble the trapezoidal function, i.e., the slope at the crossovers points c µ is the center of the fuzzy membership function The drawback with this function is its symmetricity Fuzzy sets, as the name suggests, operate with non-crisp (fuzzy) boundaries of sets while fuzzy measures allow the assignment of degrees of membership to crisp sets. A notation often used for fuzzy sets is A = µ1 x1 + µ 2 x2 +...+ µ n xn where A is a fuzzy set xi are the elements of the fuzzy set supporting it µi are the degrees of membership relating to xi The height of a fuzzy set is the largest membership grade attained by any element in that set. The fuzzy set is called normalized when at least one of its elements attains the maximum possible membership degree of 1. A α-cut of a fuzzy set is a crisp set Aα which contains all the elements of the universal set X which have membership grade in A greater than or equal to α, i.e.,

19

A α = {x ∈X µα (x ) ≥ α} The complement of a fuzzy set is defined as µ ¬A (x) = 1 − µ A (x ) It must be noted that the complement of this set is not necessarily the same as the literal complement, i.e., the complement of the fuzzy set “old” is not necessarily the same as the fuzzy set “young” but rather “not old”. Furthermore, A ∩ ¬A ≠ 0 and A ∪ ¬A ≠ X , i.e., paradoxes are permissible to some degree. The union of two fuzzy sets is defined as µ A ∪B (x ) = max[µA (x ),µ B (x)] for all x ∈X = µ A (x ) ∨ µ B (x) = µ A (x ) ∪ µ B (x) This means that the membership grade of the union is either the membership grade in A or the membership grade in B. The intersection is defined as µ A ∩B (x ) = min[µ A (x ),µ B (x)] for all x ∈X = µ A (x ) ∧ µ B (x) = µ A (x ) ∩ µ B (x) Again, the membership grade of the intersection is either the membership grade in A or the membership grade in B, this time the smaller one of the two. An important concept is known as the “extension principle” (Zadeh, 1973). It provides a means to generalize from the mapping of points of a crisp set to another to fuzzy subsets. Given a function ∂ mapping points in set X to points in set Y and a fuzzy set A where A = µ1 x1 + µ 2 x2 +...+ µ n xn , the extension principle states that ƒ(A) = ƒ(µ 1 x1 + µ2 x 2 +...+ µn x n ) = µ1 ƒ(x1 ) + µ 2 ƒ(x 2 )+...+ µ n ƒ(xn ) IF more than one element of X is mapped by ∂ to the same element y ∈Y , then the maximum of the membership grade of these elements in the fuzzy set A is chosen as the membership grade to y. If no element is mapped from x to y, then the membership grade is zero. If a function ∂ maps ordered tuples of several different sets X1, ..., Xn where ∂(x1, ... xn) = y, then the membership grade of element y in ∂(A1, ..., An) is equal to the minimum of the membership grades of x1, ..., xn in A1, ... , An, respectively. From binary logic, deductive inference is performed through the use of modus ponens. Given two true propositions, A and A  B (where “A  B” stands for “IF A THEN B”), then the truth of the proposition B may be inferred, or (A ∧ (A ⇒ B)) ⇒ B In fuzzy logic, the same inference methods are used except that propositions are expressed as fuzzy propositions, x is A’, and If x is A THEN y is B. IF x is A THEN y is B x is A' y is B' where A’ is similar but not completely equal to A

20 for example (Zimmermann, 1991):

IF tomato is red THEN tomato is ripe tomato is very red tomato is very ripe This association can be stored in a relational matrix M A° M = B where A is a fuzzy set defined on X B is a fuzzy set defined on Y To compute component bj

{

}

b j = max min(a i ,m ij ) or 1≤ i≤ n

bj =

∨[a ∧ m ] n

i =1

i

ij

Kosko (Kosko, 1992) calls the entries in the relational matrix “Fuzzy Associative Memory” because fuzzy systems associate output fuzzy sets with input fuzzy sets and so behave as associative memories. To form the matrix M, Zadeh described the compositional rule of inference using max-min inference m ij = truth(a i ⇒ b j ) = min (a i ,b j ) If there are several premises in a rule, e.g. IF x is A AND y is B THEN z is C, the induced fuzzy sets can be computed through composition ' A ° M AC = C A' B ° M BC = CB ' and the AND-joined matrix ' ' ' C = [A ° M AC ]∧ [B ° M BC ] = CA ' ∧ C B ' '

The aggregated fuzzy system can be illustrated as shown in Fig. 2.1.2-3.

A

Rule 1

B'1

Rule 2

B'2

...

Rule n

Σ

B'n

Fig. 2.1.2-3: Fuzzy System Architecture

B

21 Other relational operators Rs, Rg, Rsg, Rgg, Rgs, and Rss, (Mizutomo et al., 1979) can be used for the ' ' ' ' consequence A’ for Modus Ponens through Bx = A ° R x and A x = R x ° A , respectively. Let A and B be fuzzy sets in universes of discourse U and V, respectively, which are represented as A = ∫ µ A (u ) u U

and B = ∫ µB (v) v V

Rs = A × V⇒U × B

[

s

]

= ∫U× V µ A (u)→ µ B ( v) (u, v) where

s

1 µ A (u ) ≤ µB (v ) µ A ( u) → µ B ( v) =  s 0 µ A (u ) > µB (v ) → is an implication rule in the standard sequence s

Rg = A × V ⇒U × B g

=∫ where

U× V

 µ (u )→ µ (v) ( u,v ) B  A g 

µ A ( u) ≤ µ B ( v) 1 µ A ( u) → µ B ( v) =  g  µ B ( v) µ A ( u) > µ B ( v) → is an implication rule in the Gödelian sequence g

(

)

R sg = A × V ⇒ U × B ∩  ¬A × V ⇒ U × ¬B s g =∫

U× V

[µ (u)→ µ (v)]∧ 1 − µ (u )→ 1− µ (v) (u,v) A

s

B

A

B

g

R gg =  A × V ⇒ U × B ∩  ¬A × V ⇒ U × ¬B g g  µ (u )→ µ (v) ∧ 1 − µ (u )→ 1− µ ( v) ( u,v) B A B  A g   g   R gs =  A × V ⇒ U × B ∩ ¬A × V ⇒ U × ¬B g s =∫

U× V

=∫

U× V

(

(

)

[

]

 µ (u )→ µ (v) ∧ 1− µ (u) →1 − µ (v ) ( u,v) B A B  A g  s

)(

)

R ss = A × V⇒ U × B ∩ ¬ A × V ⇒ U × ¬B

[

s

][

s

]

= ∫U× V µ A (u) → µ B (v) ∧ 1 − µ A (u) →1 − µB (v ) (u, v) s

s

For Modus Ponens, the inference results can be summarized as displayed in Table 2.1.2-1 A

very A

more or less A

not A

22

µB µB µB µB µB µB

Rs Rg Rsg Rgg Rgs Rss

1 µB 1 µB 2 1 − µB µB µB µB 1 − µB µB µB 1 − µB µB 2 1 − µB µB µB Table 2.1.2-1: Inference with Modus Ponens µB µB 2

These operators can also be used for Modus Tollens which is described by IF x is A THEN y is B y is B' x is A' It reduces to binary Modus Tollens when B = ¬B and A = ¬A . '

'

For Modus Tollens the inference results can be summarized as displayed in Table 2.1.2-2.

not B 1 − µA

Rs Rg

0.5 ∨ (1 − µ A )

Rsg Rgg

1 − µA

Rgs

0.5 ∨ (1 − µ A )

Rss

1 − µA

0.5 ∨ (1 − µ A )

not very B not more or less B B 2 1 1 − µA 1 − µA 1 3− 5 5−1 ∨ (1 − µ 2A ) ∧ µA 2 2 2 0.5 ∨ µ A 1 − µA 1 − µA 0.5 ∨ µ A 5−1 3− 5 ∨ (1 − µ 2A ) ∧ µA 2 2 µA 3− 5 5−1 ∨ (1 − µ 2A ) ∧ µA 2 2 2 µA 1 − µA 1 − µA Table 2.1.2-2: Inference with Modus Tollens

Note also the definition for the following operators and sets which will be used later: 1 if a ≤ b aα b =  b if a > b 0 if a < b aβ b =  b if a ≥ b ∆

[]

Rδb = s ij (Pappis and Sugeno, 1985) where n s ij =  ∧(r ik α b k ) β (r ijβb j ) i=1, ... ,m j=1, ... ,n i=1

23

if a > b  b aωb = [b,1] if a = b 0 if a < b 

[0, b] if a > b aωb =  [0,1] if a ≤ b The set Φ( a ) is defined by (Pappis and Sugeno, 1985) Φ( a ) = {φ(a )} where T ) a = (a 1 ,a 2 ,...,a m ) such that a i = a = ∨(a i ) or 0 i

φ( a ) = (φ1 ,φ 2 ,...,φ m ) with φ i = 0 or aˆ , i=1, ... ,m m ∑ φ i = )a T

i =1

Thus if there are k non-zero elements in a, there are k vectors in Φ( a ) For control purposes, often a crisp value is necessary which is accomplished in the defuzzification step. One popular method is the centroid method, namely p

∑ y µ (y ) j

yi =

j

B'

j=1 p

∑ µ (y ) j =1

B'

j

which is shown in fig. 2.1.2-4

µ1 µ2

y2

yi y1

yj = control value according to rule j with membership µj yi = overall control value

Fig. 2.1.2-4: Defuzzification (weighted average) The uncertainty of a system can be measured by the entropy E(A) of the system, defined by l1 (A,Anear ) E (A) = 1 l (A,Afar ) where µ A (x i ) − µx (x i ) l 1 (A,A x ) is the fuzzy Hamming distance = ∑ i Anear is the nearest vertex

24 Afar is the farthest vertex, residing opposite the long diagonal from Anear. Fuzzy sets contain subsets. In bivalent logic, A is a subset of B, A ⊂ B , iff every element in A is an element in B. In fuzzy logic, the fuzzy set of A must be contained in B, or A ⊂ B, iff µ A ( x) ≤ µ B (x) for all x If there are non-fits, we can in fuzzy logic count the violations and then get a measure for subsethood. This is done by actually measuring the supersethood (or non-subsethood).

SUPERSETHOOD = where

∑ max(0, µ x

A

(x) − µ B(x))

l (A, ∅) 1

∅ denotes the origin n

l 1 (A,∅) = ∑ µ A (xi ) = M (A)

i Note that where M(A) is the cardinality (or size) of the fuzzy set A

Consequently SUBSETHOOD = 1 −

∑ max(0, µ x

A

(x ) − µ B(x))

l (A, ∅) 1

Lukasiewicz Lκ is an attractive candidate for fuzzy logic (Kundu, S., and Chen, J., 1994). The language Lκ has two primitive connectives {→, ¬}, where “¬” is the implication given by a → b = min {1,1 − a + b} a,b ∈ [0,1] and the negation ¬ a = 1 − a . Other operators such as “ ∧ “ and “ ∨ “ are considered just abbreviations of operations involving “¬” and “ → “, for example A ∨ B = (A → B) → B Lukasiewicz logics have been studied to arrive at an extension to handle fuzzy reasoning. It has been shown (Smets, P., Margrez, P., 1987) that they are isomorphic when completeness requirements are postulated. Furthermore, Lukasiewicz logics are the only ones that are axiomatizable (Kundu, S., and Chen, J., 1994).

2.1.3 Neural Networks Artificial Neural Networks have obtained attention recently because they seem to resemble functions which are similarly performed by biological neurons. Although some of those analogies are far fetched, neural networks still provide an interesting approach to solving problems and in our context in particular because of their filtering abilities. The neural networks considered here are software or hardware implementations of adaptive matrices that take multiple inputs, process these inputs and return an output. The key idea in the use of neural nets is that they are not programmed in the traditional sense. Rather, the neural network develops an internal representation out of a set of examples which is used to “train” the network. It thus comes up with a pattern matcher which it can later use to recognize similar patterns within a certain range of tolerance.

25 Because a neural net constitutes many partly redundant nodes, different aspects of a pattern are seen by different nodes which may overlap. That has two primary consequences: (1) the neural net is not dependent on a correct input. It may still come up with a correct decision even when part of the information is missing. (2) the neural net will still work even when it is damaged. Especially the first consequence has important implications for a monitoring environment where sensor data may be corrupted or unreadable due to noise or faulty sensors: a correct decision can sometimes still be made in cases where traditional systems would have given up. Many types of neural networks are available. The interested reader is referred to (Rumelhart and McClelland, 1986) or (Hertz et al., 1991). Here we describe a simple multi-layered feed forward network, a self-organizing probabilistic neural network, and a neural-fuzzy network.

2.1.3.1 Backpropagation Neural Network A neuron, in analogy to its counterpart in the brain, sums several inputs and generates an output signal dependent on how much the summed input value exceeds a certain threshold (Fig. 2.1.3.1-1). The following formula expresses this relation mathematically:

  y = g ∑ (wj x j )− t   j  where y = output signal w i = weight i x i = input signal i t = threshold g = (nonlinear) activation function 1

w1

...

x

g w x

y t

i

i

Fig. 2.1.3.1-1: Node (or neuron) of a neural net The nodes are organized in layers (Fig. 2.1.3.1-2). Three different types of layers can be distinguished: (1) The input layer is connected with the environment and is fed with the sensor data (2) The output layer provides the result (3) Hidden layers that lie between input and output layers perform part of the necessary computation of the matching algorithm. The number of hidden layers and nodes per layer can vary. As more nodes and layers are used, the higher the computational time will be. If there are too few hidden nodes, the pattern matching task will not perform as desired, though. Hidden nodes are necessary to be able to

26 obtain certain results. Since the pattern matching can be thought of as a method to partition data into different categories the network has to be able to divide the input data into the appropriate categories, equal to setting a hyperplane between different types of data. This is only possible with hidden units and a proper learning algorithm. Although the power of hidden layers was recognized early (Rosenblatt, 1958), no good learning algorithm could be found thus virtually halting the interest into neural networks for two decades until the introduction of the backpropagation learning algorithm. Networks with only two layers such as the perceptron, a network consisting only of input and output layers, was not able to perform the parity task, to name an example.

... ...

output layer

...

hidden layer

...

input layer

Fig. 2.1.3.1-2: Layers of a neural net During the training period the weights are adjusted according to a certain training algorithm. The most commonly used technique computes the error and its negative gradient and propagates changes in the weights and thresholds backwards through the net until all the different input cases render the desired outputs. It is therefore called backpropagation. The updating of the weights is performed according to: new m old m m wij = w ij + ∆ wij where new m wij is the new weight between node i of layer m and node j of layer m-1 old

m

wij is the old weight between node i of layer m and node j of layer m-1

∆ w ij is the amount by which the old weight is updated; ∂E m ∆ w ij = −η m ∂w ij where η is the learning rate 1 c c M 2 E = ∑ ∑ ( ζi − y i ) is the error 2 c i c ζi is the desired output at the output layer at node i for case c c M y i is the actual output at node i of the output layer M of case c   c m y i = g ∑ w mij c ymj −1   j  where − x −1 1 + e ( ) g = activation function, e.g. m

27 ∆ w ij = η δi where m

c m c m−1 j

y

 c M −1  δ = g ′ ∑ w M y j  ∑ (c ζi − c y M M=output layer ij i ),  j  i   c m −1 δi = g ′ ∑ wmki c ym−1 (wmji c δ mj ), m=M-1, M-2, ..., 2 k  k  ∑ j c M i

c

0

y k input of the pattern at node k of the input layer for case c

After the training is completed, the network configuration and the weights are fixed. It is now ready for operation and should yield approximately correct responses when the input is close to the examples presented at training. This is in particular useful in the presence of noise of which much is present during machining. However, generalization is not guaranteed and the training algorithm can get stuck during training in a local minimum which will prevent it from finding an optimal solution. The appealing concept of neural nets is that it can provide functionality not easily obtainable with other technologies. All that is necessary is a set of training examples to learn and adapt from. No precompiled knowledge base is required. The neural net provides a great deal of flexibility because it can be implemented to a specific problem without more processing or reprogramming. The noise reducing properties stem from the optimization for a batch of training data. The algorithm minimizes the error of training data which are subject to noise. While some data have larger variances than others, best values are obtained because the training procedure finds the best trade off for the data. This implies an implicit distributional assumption of the neural network.

2.1.3.2 Self-Organizing Probabilistic Neural Network (SOPNN) A probability-based approach has been investigated using an algorithm designed by Tseng (Tseng, 1991). The particular algorithm used is a Self-Organizing Probabilistic Neural Network, or SOPNN. This is based in the work of Specht (Specht, 1988), but differs in that K-means clustering method is used to create a finite set of distribution centroids producing a fixed size network. The probability density function (pdf) of the system is then modeled by a weighted sum of non-covariant multivariate Gaussian probability distributions about the cluster centroids with a common variance used to smooth the distribution between clusters:

P ( x) =

1 d

2π 2 σd n

where: n k mk nk d σ

∑n e

 (x− m k )(x− m k )T  −    2σ 2  

k

k

is the number of training points is the number of clusters is the centroid of cluster k is the number of points in cluster k is the dimension of the data space is the smoothing variance

This method produces a joint distribution which can be decomposed into conditional distributions used for detection of both process failure and sensor failure. Kim et al. (Kim et al., 1992) have described a

28 methodology for validation of sensor input data useful in industrial power generation applications. This method requires some preprocessing of data to generate features which are then modeled with a joint pdf. Knowledge of sensor failure modes is coupled to conditional pdfs generated from this joint pdf to separate data deviations generated by process faults from those caused by sensor error. Explicitly reasoning about the sensor is an essential part of applying knowledge based techniques to on-line systems and one which has been aided through the use of pdfs for process modeling. The SOPNN method should prove valuable for the integration of reasoning about sensor integrity in the manufacturing environment. Fig. 2.1.3.2-1 is the neural representation of the SOPNN system. Each set of four circled network subnodes represents a cluster of data. The subnodes are used to separate the input and output spaces to create the proper conditional pdfs used in diagnosis. Node inputs are derived from the quadratic term in the above definition of the multivariate Gaussian pdf. The input data vector, X, is compared to each cluster centroid. Two thresholds are applied to this comparison: θi, the norms of the data clusters, and the norm of the input vector, ||X||2. To form the joint distribution, these data sets are augmented by the one dimensional output variable, Y, and its appropriate thresholds. The output of the network is properly scaled as a post process to produce P(Y|X). ln(f( Y|X))

-1

ln(f( X))

1

ln(f( X,Y))

•••

Wjc

X1

•••

XN

|| X|| 2

Y

Threshold θi

Input joint

Net input : Σ j Wjc Xj

Input and output joint

Y2

Fig. 2.1.3.2-1: SOPNN probabilistic-neural system architecture 2.1.3.3 Neural-Fuzzy Networks The neural-fuzzy networks introduced here are an assembly of different neural nets whose output is given a meaning according to fuzzy logic. That is, the output of the neural nets are interpreted as membership values of fuzzy functions and of linguistic statements such as “high”, “low”, etc. The motivation to use this type of architecture stems from earlier attempts to perform pattern matching on

29 data from highly noise environments using pure backpropagation neural networks. The limited success led to braking down the learning task into learning patterns whose scope was restricted by preprocessing. A multi-network architecture was first proposed by Collins (Collins et al., 1988). The multiple neural network learning system employs an array of coupled subnets and a controller. Each subnet focuses on a non-exclusive subset of the full feature space. The advantage is efficiency and ease of training. The learning task is divided into obvious and fine discriminations (Collins et al., 1988). When comparing our neural-fuzzy network to the approach using just the neural network, the result is found to be much smoother and more precise. The combination of partially redundant information helps to arrive at this result. The procedure to set up the neural-fuzzy network is divided in several steps (Takagi and Hayashi, 1991). First, data are processed to ease learning by establishing groups or clusters of distinctive data. The groups are then interpreted as antecedents of rules of the form “If the data belongs to that cluster then I can diagnose the system to be in a certain state”. A dendrogram clustering method finds the clusters utilizing the centroid method. It does this by finding the two closest data points in a higher dimensional space and establishes the centroid which thereafter will be regarded as a data point itself. Then the next two data points are clustered. This procedure continues until all data are clustered into one group. The next step is to find the number of clusters necessary for the rules. There should be at least as many clusters as there are desired outputs, i.e., as many rules as there are states of the system for diagnosis purposes. If data from different output situations are found in one cluster, the respective data undergo another run through the algorithm until they fall into distinguishable clusters. This will eventually determine the final number of rules. Note that if there are an equal number of clusters and states, there is no need (disregarding smoothing effects) to use a multi-network architecture because a single network will be able to perform the pattern matching task by itself. Clustering data this way has the effect of manually setting hyperplanes into place. A pure network will have much more trouble to overcome these hyperplanes, as is shown later in section 4.1. Fig. 2.1.3.3-1 shows the clustering performed. Data of one class (C1) may fall into two clusters separated by members of another class (C2).

C1

C2 C1

Fig. 2.1.3.3-1: Data of one class separated by data from another class The system is subsequently trained with a perfect fit membership value (µ = 1) for the cluster it belongs to and no fit (µ = 0) otherwise. This part resembles the antecedent of a fuzzy rule   x1     x2   IF  data set is part of cluster ci is µi ...     x     k

30 where µi is the membership value that is learned by this step x1 - xk are the k different data gathered from the sensors (input pattern) ci is one of the i clusters which were determined during the preprocessing stage Fig. 2.1.3.3-2 shows the architecture for learning the antecedents of the fuzzy rule. This part also determines the number of rules which is equal to the number of clusters ci. This network is called NNmem. c1

c2

x1

ci

...

x2

xk

Fig. 2.1.3.3-2: Neural Network NNmem for learning membership of clusters The update function for ∆ wij (M=output layer) is M

  1 if data in cluster i   c ∂   − µi    0 otherwise   M ∆ wij = M ∂wij

2

The consequent of the rules is learned in separate networks. The number of networks corresponds to the number of diagnosis instances chosen, e.g. “value is small”, “value is medium”, “value is large”, etc. During training, the value “1” is assigned to the diagnosis category the input data belong to and “0” for the other categories. Links with weight “0” which do not contribute to a particular diagnosis output can be eliminated (Takagi and Hayashi, 1991). This categorization is interpreted as the consequent part of the rule and can be expressed as

 x1  x 2  Data set   is categorized as d m is ym ... x   k where d m is the linguistic diagnosis, e.g. d m ∈ {value is small, value is medium, value is large} y m is the scaled output value associated with d m , e.g. y m ∈ {0, 0.5, 1} The overall diagnostic output is obtained by taking the weighted average of membership values with the diagnostic value of each rule. The relation for the defuzzification for the diagnosis can be expressed as follows:

31

y *j =

∑ µ(c )y(d ) ∑ µ(c ) i

ij

i

where: *

y j is the overall numeric control value associated with di d ij is the diagnosis value of network j for rule i µ(ci ) = membership value of rule i Fig. 2.1.3.3-3 shows the architecture of the consequent part for rule 1. This network is called NN1. d11

d21

x1

dm1

...

x2

xk

Fig. 2.1.3.3-3: Architecture of the consequent part for rule 1 The use of the consequent parts has the effect of smoothing the output. They use partly redundant information to combine it into one justified value. Hence, spikes which appear in the pure neural network are avoided. The categorization is therefore also much more consistent. Learning is terminated for all nets when the error of a validation data set starts to increase thus ensuring that overlearning is avoided. The system is then ready to run. Note that we deal with a network of neural nets instead of a single one. Its architecture is shown in fig. 2.1.3.3-4.

32 diagnosis

*

y1

...

diagnosis

weighted average

*

yj

weighted average

µ 11 - µ ij

... µ µ

... µ µ

ij

i1

1j

11

y 11 ... NN mem

NN

x

1

y 21 ... y 2j

y 1j

NN

1

x

2

...

NN

2

x

3

...

...

3

NN

...

x

y i1

i-1

...

NN

y ij

i

k

Fig. 2.1.3.3-4: Architecture of the neural-fuzzy system

2.1.4 Genetic Algorithms The basic concepts of genetic algorithms (GA) were developed by Holland (Holland, 1975). The reader is referred to Goldberg (Goldberg, 1989) for a more thorough treatise on the subject. The following summary is adapted largely from Grefenstette (Grefenstette, 1990). Genetic algorithms are general purpose task independent adaptive search procedures. They are used as optimizers which provide in iterative procedures a “population” p(t) of candidate solutions to the objective function f(x): p(t ) = {x1 (t ),x2 ( t ),...,x N (t )} where xi represents a vector of parameters to the function f(x) In this context xi is a binary string of some length. The actual meaning associated with the vector remains unknown to the GA. During each iteration step – or generation – the current population is evaluated. On the basis of that evaluation, a new population of candidate solutions is formed. The algorithm works as shown in fig. 2.1.4-1 (Grefenstette, 1990):

33

t = 0; initialize P(t); evaluate structures in p(t); while termination condition not satisfied t = t + 1; select p(t) from p(t-1); recombine structures in p(t); evaluate structures in p(t);

Fig. 2.1.4-1: Nassi-Sheiderman diagram for GA algorithm The initial population p(0) is either chosen at random or heuristically. In any case, the initial population should contain a wide variety of structures. Next, each structure in the initial population is evaluated. If, for example, the task is to minimize a function f, the evaluation consists of computing and storing f(x1), ... , f(xN). The structures of the next population p(t+1) are chosen from the population p(t) by a randomized “selection procedure” that ensures that the expected number of times a structure is chosen is proportional performance of that structure, relative to the rest of the population. If xj performs on average twice as good compared to all structures in P(t), then xj will appear twice in population p(t+1). At the end of the selection procedure, population p(t+1) contains exact duplicates of the selected structures in population p(t). To allow finding other points in the search space, some variation is introduced into the new population through some idealized “genetic recombination operators”. The most important recombination operator is the “crossover” operator. For crossover, two structures in the new population exchange portions of their binary representations. By choosing a point (the crossover point) at random a segment to the right of this point is exchanged. For example (Grefenstette, 1990), let x1 = 100:01010 and x2 = 010:10100 and suppose that the crossover point has been chosen as indicated by the colon. The resulting structures would be y1 = 100:10100 and y2 = 010:01010. Crossover serves two complementary purposes for searching. First, it provides new points for further testing within the schemes already present in the population. In the above example, both x1 and y1 are representatives of the schema 100#####, where the # means “don't care”. Thus, by evaluating y1 the GA gathers further information about this schema. Second, crossover introduces representatives of new schemes into the population. In the above example, y2 is a representative of the schema #1001###, which is not represented by either “parent”. If this schema represents a high-performance area of the search space, the evaluation of y2 will lead to further exploration in this part of the search space.

34 Termination is either triggered by finding an acceptable approximate solution to f(x), by fixing the total number of evaluations, or some other application dependent criterion. Theoretical considerations concerning the allocation of trials to schemes show that genetic techniques provide a highly efficient heuristic for information gathering in complex search spaces. A number of experimental studies have shown that GAs exhibit impressive efficiency in practice (Grefenstette, 1990). While classical gradient search techniques are more efficient for problems which satisfy tight constraints (e.g. discontinuity, low dimensionality, unimodality, etc.), GAs in many cases outperform both gradient techniques and various forms of random search on more difficult (and more common) problems, such as optimizations involving discontinuous, noisy, high dimensional, and multimodal objective functions. GAs have been applied to many domains, including numerical function optimization, adaptive control system design, and artificial intelligence task domains. In the context of this thesis, GAs will be used for optimization of parameters for the fuzzy validation and fusion procedure. The parameters of interest determine the shape of the membership functions as well as parameters for the shape of the validation region as explained in chapter 2.3.2.2.

35

2.2 Sensors Most generally, a sensor is a device that indicates characteristics (presence, absence, intensity, or degree) of some form of energy impinging on them. It responds to a physical stimulus, converts its input energy into electrical currents that can be used as signals for measurement or control purposes. In the following sections, the sensors used in the project are briefly introduced. These sensors are radar sensor, sonar sensor, optical sensor, temperature sensor, pressure sensor, mass flow sensor, vibration sensor, and acoustic emission sensor. For more detail on sensors see Norton (Norton, 1982), Dalley (Dalley et al., 1984), Barney (Barney, 1988), and Fraden (Fraden, 1993). Many sensors make use of the piezoelectric effect such as the acoustic emission, vibration, sonar, and pressure sensor. Generally, the piezoelectric sensor generates an electric charge when it is subject to stress (Nachtigal, 1990). When an external force is applied, the hexagonal lattice of the crystalline molecules (often quartz comprised of Si and O2) deform. This shifts atoms in such a manner that a positive charge is built at the Silicon atom side and negative charge on the Oxygen side. The crystal develops an electric charge of opposite polarity. To pick up this charge, conductive electrodes are applied at the opposite sides of the crystal, making the piezoelectric sensor a capacitor with a dielectric material. The dielectric acts as a generator of electric charge, resulting in voltage across the capacitor. The piezoelectric effect is a reversible physical phenomenon, i.e., applying voltage across the crystal produces mechanical strain. It is possible to use the crystal for both picking up and delivering charges by placing pairs of electrodes on the crystal. This method is used quite extensively in various piezoelectric transducers.

Piezoelectric Accelerometers for Measurement of Vibration and Acoustic Emission Acceleration is sensed along the longitudinal axis of the sensor and acts on the seismic mass which exerts a force on the piezoelectric crystal which then produces an electric charge (Nachtigal, 1990). The quartz plates are located next to the seismic mass and are preloaded so that either an increase or decrease in the force acting on the crystals due to acceleration in either direction causes changes in the charge produced by them. Depending on the type of crystal used as well as differences in design of seismic mass and support, the sensitivity changes and makes the accelerometer suitable to act as a sensor for vibration or acoustic emission with the former working in the range from 0kHz to 40kHz and the latter working in the range from 50kHz to several MHz. Sonar sensor Sound Navigation and Ranging (Sonar) systems may be divided into three categories: 1.) active, 2.) passive, and 3.) acoustic communication systems. In active sonar systems an acoustic projector generates a sound wave that spreads outward and is reflected back by a target object. A receiver picks up and analyzes the reflected signal and may determine the range, bearing, and relative motion of the target. Passive systems consist simply of receiving sensors that pick up the noise produced by the target. Waveforms thus detected may be analyzed for identifying characteristics as well as direction and distance. The third category requires active components at sending and receiving unit and are not considered here. Acoustic transducers utilize piezoelectric crystals (e.g., quartz or tourmaline). These materials change shape when subjected to electric or magnetic fields, thus converting electrical energy to acoustic energy. Suitably mounted in an oil-filled housing, they produce beams of acoustic energy over a wide range of frequencies. Usually the receiving and transmitting transducers are the same.

36 The attenuation coefficient, x, in Beer's law, as applied to sound is dependent on the viscosity of the medium and inversely proportional to the frequency of the sound and the density of the medium. Highpitched sounds are absorbed and converted to heat faster than low-pitched sounds. Sound velocity is determined by the square root of elasticity divided by the medium’s density. Since both the elasticity and density of most media change with temperature, humidity, pollution, and pressure, so does the velocity of sound. The sonar sensor under consideration has a wavelength of about 0.003 m. Its working range is restricted in air to about 4m.

Pressure Sensor Piezoelectric pressure sensors use the force applied by a diaphragm to a stack of quartz crystals to measure pressure (Nachtigal, 1990). One side of the sensor has to be exposed to the medium under consideration. Other pressure sensors use the displacement of liquids (H2O, Quicksilver) in tubes as a measure for pressure where one side of the tube is exposed to the pressure in question.

Mass Flow Many methods for flow sensing are determined from volumetric measurements with simultaneous measurements of density (Tse and Morse, 1989). Volumetric sensing makes use of pressure and velocity. The latter can for example be determined by ultrasonic flow sensing which uses one or more pairs of transducers to detect variations in a pipe. The principle is using sound propagation velocity through the medium. Changes in wave travel velocity are an indication for the change of mass flow. Radar sensor Radar is an electromagnetic sensor used for detecting, locating, tracking, and identifying objects of various kinds. It operates by transmitting electromagnetic energy toward targets and observing the echoes returned from them by extracting the Doppler frequency shift of the echo (Nyfors and Vainikainen, 1989). What distinguishes radar from optical and infrared sensing devices is its ability to detect faraway objects under all weather conditions. The range accuracy of a simple pulse radar depends on the width of the pulse: the shorter the pulse, the better the accuracy. Short pulses, however, require wide bandwidths in the receiver and transmitter (since bandwidth is equal to the reciprocal of the pulse width). A radar with a pulse width of one microsecond can measure the range to an accuracy of a few tens of meters or better. Some special radars can measure to an accuracy of a few centimeters. The ultimate range accuracy of the best radars is limited not by the radar system itself, but rather by the known accuracy of the velocity at which electromagnetic waves travel. The calculation of range involves the velocity of the electromagnetic energy transmitted as well as the round-trip time. Almost all radars use a directive antenna, i.e., one that directs its energy in a narrow beam. The direction of a target can be found from the direction in which the antenna is pointing when the received echo is at a maximum. Radar is an "active" sensing device in that it has its own transmitter for locating targets. Radar typically operates in the microwave region of the electromagnetic spectrum at frequencies extending from about 400 MHz to 40 GHz. The particular radar sensor under consideration operates at 24 GHz and its range is approximately 30 m. Optical sensor The optical sensor uses active source and active targets. The sensor design is based on the triangulation principle, where one infrared light source is installed on the target and two electro-optical cameras are positioned at the source. The triangular geometry gives a relationship between the distance measured

37 and the current output from the detectors. The measuring range of the sensor is related to the focal length of the lens used, the transmission capability of the source, the offset between the two sensors, etc. The source wavelength is chosen to be as long as possible to enhance its transmission power and is modulated at 10 KHZ (Qualimatrix). The light beam passes a light filter and a lens system before reaching the detectors. The outputs of the detectors are then demodulated and processed by a DSP board where band pass filters are implemented. The current outputs from the sensors are at the order of mAmps so the signal-to-noise ratio is very crucial and the ability of optical and digital filtering limits the accuracy and measuring range achievable. Some sources of uncertainty are: weather conditions like fog, rain, dust, etc., source from other leading targets, etc. In some extreme cases, the sensor circuits may get saturated (the specified saturation level for the sensor is 500 to 700 mAmp) and give no reading. The decision on whether the loss of sensor read out is due to abnormal sensing condition, or too close (or too far) distance between cars is facilitated by monitoring the DC signal. Setting a different modulation frequency for different leading targets (which is adjustable from 10 - 28 KHZ) is one way to differentiate the sources from other leading targets. The one-detector-two-source configuration will be more robust in separating the frequency disturbances. Decrease in light intensity and increase of the bandwidth of the sensor tend to increase the noise level. Any extraneous particle in the air caused e.g. by fog, dust, snow, and rain etc. increases light scattering and decreases the transmission efficiency of the light source, which results in false sensor reading. The DC voltage is constantly measured and serves as a reference for the judgment to be made in these situations. With this setup, lateral offset can also be discovered by comparing the two voltages. The problem is that the target has to be active as well. For many applications this is not a feasible solution because either does the target not want to be detected or failure or neglect causes serious safety concerns.

Temperature A thermocouple is a temperature-measuring device consisting of two wires of different metals joined at each end (Morris, 1988). One junction is placed where the temperature is to be measured, and the other is kept at a constant lower temperature. A measuring instrument is connected in the circuit. The temperature difference causes the development of an electromotive force (Seebeck effect) that is approximately proportional to the difference between the temperatures of the two junctions created by free electrons which diffuse though the junction of the two metals. Any two different metals or metal alloys exhibit the thermoelectric effect, but only a few are used as thermocouples, e.g. antimony and bismuth, copper and iron, or copper and constantan (a copper-nickel alloy). Usually platinum, either with rhodium or a platinum-rhodium alloy, is used in high-temperature thermocouples.

38

2.3 Validation and Fusion Techniques Due to its uncertain nature, sensor data have to be validated for many applications where high data fidelity is required. These can be systems where safety of humans is at stake or where high cost is involved. Therefore, there is a tendency to automate systems in industry as well as for everyday applications. This work will show applications from gas turbines, milling machines, and automated highway systems which all utilize sensor validation and fusion. In sensor validation, sensor data are assessed with respect to their integrity. Usually, a better value is offered as output of validation systems as well. Depending on which type of uncertainty is used for evaluation, the ranking is expressed, for example, in terms of probabilities or degrees of membership as confidence measures. Often times, it is desirable to include redundancy in a system to make sure that a sensor value is available even when one sensor fails. This redundancy involves more than one sensor which measures the quantity of interest with either the same or different techniques. Redundancy can also be created by using indirect sensors which have some functional relationship to the quantity of interest. The problem is that two sensors almost never return the same measurements. Although averaging techniques can be employed to smooth out the effects of an aberrant reading, it would be desirable to know which sensor performs better than the other and to what degree. Furthermore, if the readings are far apart, averaging techniques may not be acceptable, if the resulting action renders the system in an unstable or unsafe state. Voting schemes attempt to eliminate the bad readings but still face the fact that a reading is never correct. It is the task of sensor fusion to make sure that each sensor is rated according to its integrity and to output a reliable value. Validation and fusion often times go hand in hand when the validation process becomes an integral part of the rating for fusion purposes. The following sections will show techniques for sensor validation and sensor fusion which make use of the theoretical tools introduced in the previous section. Section 2.3.1 investigates probabilistic uncertainty and proposes a Kalman filter based solution. The underlying assumptions are that noise is always Gaussian distributed. Kalman filters are optimized for this type of noise. The application itself is the probabilistic data association filter (PDAF), a suboptimal process which performs better in the presence of clutter. Section 2.3.2 looks at fuzzy sensor validation and fusion. After a brief review in section 2.3.2.1, section 2.3.2.2 introduces a new algorithm for fuzzy sensor validation and fusion (FUSVAF). This algorithm makes no assumptions about the nature of the noise. The fusion algorithm uses the confidence values obtained for each sensor reading and performs a weighted average. With increasing distance from the predicted value, readings are discounted through a non-linear validation function. The predicted value uses exponential weighted moving average (EWMA) time series predictor which has adaptive coefficients. These coefficients are governed by fuzzy membership functions. The membership function can be learned via machine learning techniques with experimental data. Simulations motivate the use of the rules chosen. An exhaustive comparison to Kalman filter based approaches is given using Monte Carlo simulations. A sensitivity analysis shows the robustness of the parameters chosen. The chapter closes with recommendations for the use of fuzzy and probabilistic methods.

39

2.3.1 Probabilistic Sensor Validation and Fusion with PDAF Sensor validation involves the estimation of the true state of a system which is then compared to the measured sensor data. If measurements are taken from several sensors the information has to be shared and combined. This pooling of information involves weighted averaging techniques of varying degree of complexity (Berger, 1985, Durrant-Whyte, 1987, Manyika and Durrant-Whyte, 1994), e.g. in a linear combination of posterior probabilities of information sources (Stone, 1961) where the weights sum up to 1, and variations such as the independent opinion pool or independent likelihood pool. The latter two methods model more accurately the multi-sensor system where the conditional contributions of observations can be shown to be independent. In probabilistic fusion a priori and a posteriori information can be used for fusion. A priori information assigns weights which are inversely proportional to the noise covariance while a posteriori information considers the distance of the observation to the predicted value with distance measures based on the likelihood ratio and its derivatives (Alag, 1996). Estimation of a state of a moving object is called “Tracking” which involves the generation of state trajectories estimated from the measurements. Tracking makes us of data association which associates a given data point with possibly several tracks. This is especially applicable to radar tracking of airplanes where several data have to be associated with particular targets. This becomes difficult when uncertainty is involved which in turn is due to random false alarms in the detection process, clutter due to spurious reflections, interfering targets, etc. Filtering out noise from data of dynamic system is closely tied with estimation. The essence of estimation is to generate information whose quality is higher than the raw measurements and that it contains information which is not directly available in the measurements. The main tracking algorithms are: 1.) Kalman Filter, 2.) α−β Filter, 3.) Extended Kalman Filter, and 4.) Multiple Model Estimator. The fundamental assumptions of the Kalman Filter are that the linear plant equations are known and have zero mean white Gaussian noise with known covariance and that the noise also has zero mean white Gaussian noise with known covariance. These assumptions make this approach vulnerable to auto- and cross-correlated noise, model changes, and nonlinear motion models among others (although there are extensions which tackle these issues). The Extended Kalman Filter tries to linearize the nonlinear functions which results in a suboptimal state estimation algorithm. It is sensitive to the accuracy of the initial conditions as well as the covariance. However, if the initial errors and the noises are not too large, then the Extended Kalman Filter performs well. A filter for measurement association or data correlation is also called Probabilistic Data Association Filter (PDAF) (Bar-Shalom and Li, 1995, Fortmann et al., 1983). It calculates the association probabilities for each validated measurement at the current time to the target of interest. This probabilistic information accounts for the measurement origin uncertainty. The PDAF is a suboptimal Bayesian algorithm. However, in the presence of clutter which in turn can result from returns of nearby objects, weather, electromagnetic interference, false alarms, and acoustic anomalies, etc., the PDAF performs more effectively. The PDAF is motivated from tracking of airplanes on a radar screens. Therefore, the terminology evolved mainly to signify the prevalent events in this environment. It is shown later that the PDAF can also be used in different applications where this terminology does not necessarily make sense. However, for historic reasons and consistency, the terminology will be carried over to other applications as well. Another important difference in the original development and other uses is that the set of validated measurements consists of correct and incorrect measurements. However, in the application shown, validated measurements come exclusively from one target with several sensors measuring the quantity of interest. In this work, it is assumed that measurements are from the same target. It will be shown that this difference in assumptions poses no problem for the application of this

40 technique. Alag (Alag, 1996) also addressed this issue and developed a modified PDAF algorithm which also removes the assumption that only one measurement is correct.

Kalman Filtering We begin by reviewing the principle of Kalman filtering which is used in the validation process (BarShalom and Fortmann, 1988, Chui and Chen, 1991b, Bar-Shalom, 1990, Grewal and Andrews, 1993). Consider a discrete time dynamic system described by x( k + 1) = F(k)x(k) + G(k)u(k) + v(k) where x(k) is the state at the time k u(k) is the (known) input or control signal v(k) is a sequence of zero-mean, white, Gaussian process noise with covariance Q(k) F is the system model G is the gain through which the input is multiplied A number of sensors i = 1, .. , m, are considered to take observations zi(k)of the state according to the observation equation z(k) = H(k)x(k) + w(k) where T T z (k ) = [z1 ( k),..,z m ( k)] is the stacked observation vector w(k) is a sequence of zero-mean white Gaussian measurement noise with covariance R(k) H is the observation model The initial state is assumed to be Gaussian with mean xˆ (0|0) and covariance P(0|0). The two noise sequences and the initial state are assumed to be independent, i.e., we assume E [w(k )]= E[v( k )]= 0 E [w(k )w (j)]= Rδ kj T

E [v( k)v

( j)]= Q(k )δkj E [w(k )v (j)] = 0 T

T

For the above system the Kalman filter provides a recursive solution for the estimate xˆ ( k| k) of the state x(k) in terms of the estimate xˆ ( k − 1|k − 1) and the new measurements z(k). The one step prediction of the state is xˆ ( k + 1| k) = F(k)xˆ (k| k) + G(k)u(k) xˆ ( k + 1| k + 1) = xˆ (k + 1| k) + W(k + 1)v(k + 1) v(k + 1) = z(k + 1) − H(k + 1)xˆ (k + 1|k)

v(k + 1) is called the innovation or measurement residual. The filter gain W(k+1) is Τ −1 W(k + 1) = P(k + 1| k)H (k + 1)S (k + 1) where P(k+1|k) is the one step prediction covariance S(k+1) is the measurement prediction covariance ∆ Τ P(k + 1|k) = E[x˜ ( k + 1|k )x˜ (k + 1|k)|z(1)..z(k)]= F(k)P(k| k) F ′(k) + Q(k) ∆

x˜ ( k + 1| k)= x(k + 1| k) − xˆ (k + 1|k) = F(k)x˜ (k|k) + v(k)

41 The measurement prediction covariance is

Τ Τ S(k + 1) = E[z˜ (k + 1| k)˜z (k + 1| k )|z(1)..z(k)]= H(k + 1)P(k + 1)H (k + 1) + R(k + 1) ∆

PDAF The fundamental assumptions of the PDAF which make it possible to obtain a state estimation scheme as simple as the Kalman Filter are: • There is only one target of interest • The filter has been initialized • The past information is summarized approximately by  1 '  −1 ∆  − ν i ( k ) S ( k ) ν i ( k )  p x (k ) Z k −1 = N x (k ); x   [ ˆ (k k − 1),P(k k − 1)] ei = e 2 where

[

]



Z is the set of validated measurements up to time k-1, Z ( k)= {z i (k )}i =1 , where zi(k) is the validated measurement, mk (which is also a random variable) is the number of measurements in the validation region. • At each time step a validation region is set up • If the measurement fell into the validation region, the measurement may or may not originate from the target. This assumption differs from the original assumption of the PDAF which deemed only one measurement to be correct. This made sense because there was only one device taking the measurements. If there was more than one observation in the vicinity of the expected reading, any one could potentially be right – but only one actually came from the target. The rest were readings due to false alarm and clutter. For an application where several sensors observe the same target, the assumption that at most one reading is correct is no longer practicable. It is therefore dropped. That does not imply that all readings are correct at all times. Measurements can still be corrupted and originate from other targets. Hence all that can be said is that any number of the observations may be right (if they fall inside the validation region). • The remaining measurements – the ones which fall outside the validation region – are assumed to be due to outliers or some other kind of sensor malfunction • the target detection occurs independently over time with known probability PD. k −1

The validation region is the elliptical region ' −1 V(k; γ ) = z:[z − zˆ (k k − 1)]S(k) [z − zˆ (k k − 1)]≤ γ

{

mk

}

where

γ is the gate threshold −1 S( k ) = H(k )P(k k − 1)H( k ) + R(k ) is the covariance of the innovation H(k) to the true measurement z(k) The association events θ i (k ) are mutually exclusive and collectively exhaustive for m(k) ≥ 1, where i = 0,...,m( k ) zi (k ) is a target originated measurement θ i (k ) =  none of the measurements is target originated i = 0 Using the total probability theorem w.r.t. the above events, the conditional mean of the state at time k can be written as

42

xˆ ( k| k) = E[x (k )| Z k ] m (k )

=

∑ E [x(k)|θ (k ),Z ]P{θ (k)|Z } k

i

k

i

i =0

m (k )

=

∑ xˆ (k| k)β (k) i

i

i =0

where

xˆ i (k|k ) is the updated state conditioned on the event that the i-th validated measurement is correct ∆

β i (k) = P{θ i (k )| Z

k

} is the conditional probability of this event (the association probability)

The estimate conditioned on measurement i being correct is xˆ ( k| k) = xˆ ( k|k − 1) + W (k )ν i ( k) i = 1,.., m k where νi (k ) = z i ( k) − ˆz( k|k − 1) is the corresponding innovation T −1 W(k) = P(k|k − 1)H (k)S (k) is the standard filter gain For i=0, i.e., if none of the measurements is correct (or there is no validated measurement, i.e., m(k) = ˆ ˆ 0), the estimate is x 0 (k| k ) = x( k|k − 1) . That is, the estimate of the previous time step also becomes the new estimate. The updated PDAF equation becomes with the above equations xˆ ( k| k) = xˆ ( k|k − 1) + W (k )ν (k ) where

ν(k ) =

m( k)

∑ β (k )ν (k) i =1

i

i

is the combined innovation

Unlike the Kalman filter where the covariance equation is independent of the measurements, the estimation accuracy of the PDAF depends upon the date they actually were encountered. The error covariance is c P (k| k) = β 0 (k )P( k| k − 1) + [1 − β 0 (k )]P ( k| k) + P˜ (k ) where c T P ( k|k ) = P( k| k − 1) − W( k)S( k )W (k ) is the covariance of the updated state β 0 ( k ) is the probability that no measurement is target originated

β 0 ( k )P (k|k − 1) denotes that no update will be performed (since no measurement is target originated) 1 − β ( k [ 0 )]Pc (k| k) denotes the increase of the covariance since it is not known which of the m(k) validated measurements is target originated. mk ˜P (k ) = W (k ) β (k) ν ( k )ν Τ ( k) − ν( k) νΤ (k )W ′( k ) i i  ∑  i=1

Probabilistic inference is performed on the number of measurements in the validation region and their locations, i.e.,

43

β i (k) = P{θ i (k )| Z k } = P{θi (k )| Z(k ), m(k),Z k −1 } i = 0,1,..,m k Using Bayes' rule p(Z k x)p( x) p x Zk = p(Z k ) we have 1 β i (k) = p[Z(k)| θ i (k ),m( k ), Z k−1 ]P{θ i ( k )| m(k),Z k −1 } i = 0,1,..,m (k ) c where the joint density of the validated measurements conditioned of θ i (k ) is the product of the (assumed) Gaussian pdf of the correct (target originated) measurements and the pdf of the incorrect measurements, assumed uniform in the validation region.

( )

Since the events must be mutually exclusive and collectively exhaustive, m( k)

∑ β (k ) = 1 i =0

i

Using the Poisson clutter model, the associated probabilities are given by ei  i = 1,...,m (k) m (k ) b + ∑e j  j =1 β i ( k) =  b i = 0 m (k )  b + ∑e j  j =1 where ∆

ei = e

 1 '  − ν ( k ) S− 1 ( k ) ν i ( k )   2 i  nz

 2π  2 1 − P D PG b = λV( k)c −1 nz  γ  PD PG is the probability that the correct measurement falls in the validation region PD is the probability that the true measurement is detected at all nz is the dimension of the measurement z cnz is the volume of the nz dimensional hypersphere (c1=2, c2= π, c3=4 π/3, etc.) β 0 ( k ) is the probability that no measurement is target originated ∆

λ V(k ) = m (k ) Assuming a Poisson density the probabilities of the events conditioned only on the number of validated measurements are P D PG  i = 1,..,m( k )  µ F[m( k )]    m (k ) P D PG + (1 − P DP G )  µ F [m( k ) − 1]   µ [m( k )] γ i [m (k )]=  (1 − P DP G ) µ Fm k − 1  ] F[ ( ) i=0  µ F [m (k )] P P + 1 − P D PG )  D G ( µ F [m (k ) − 1] 

44 where −λV

(λV )m

m! µ F (m ) = e is the probability mass function of the number of false measurements in the validated region for the Poisson model V is the volume of the elliptical validation region λ is the spatial density of the Poisson model The discrimination capability of the PDA relies on the difference between Gaussian and uniform densities.

Simulations We illustrate the probabilistic validation and fusion by means of an example motivated from a tracking application. Throughout the example the same sensor measurements as displayed in fig. 2.3.1-1 will be used. For the simulations, a sampling time of 0.02 seconds was used. The covariance of the process Q 2 was taken as Q=0.02 . Three sensors are used with data generated as follows. The covariance of the sensor noise R was taken as R=0.5 for all three sensors for the first fifty samples. The covariance of the sensor noise for sensor 1 was increased to R=1 for samples 50 to 80 (1sec - 1.6sec). The same was done for sensors 2 and 3 for samples 60 to 80 (1.2sec - 1.6 sec) and samples 70 to 80 (1.4sec - 1.6 sec). 2 respectively. The process covariance was changed to Q=0.04 for samples 80 to 90 (1.6sec - 1.8sec) and 2 to Q = 0.01 from 90 to 100 (1.8sec - 2sec). A sensor bias of 0.25 meters was introduced in sensor 1 from sample 100 to 150 (2sec - 3sec). As can be seen, the performance of the fused values does not drop significantly over the observed interval. This shows the robustness of the method to modeled and unmodeled disturbances. The outputs of the three sensors and the actual distance is shown in Fig. 2.3.11. 7

Sensor 1 Sensor 2 Sensor 3 Actual Distance

6

5

4 Sensor Output

3

2

1

0 0

1

2

time (sec)

3

Fig. 2.3.1-1: The outputs from the three sensors and the actual value of the process used in example 1 Fig. 2.3.1-2 shows the fused estimate along with the actual value of the process and an estimate obtained by averaging the values of the three sensors which were shown in fig. 2.3.1-1. As can be seen clearly,

45 the fused estimate follows the actual process value very well, in spite of unmodeled disturbances and changes in the process. 6

Fused Estimate Mean of 3 Sensors Process

Sensor Output

5

4

3

2

1

0 0

1

2

time (sec)

3

Fig. 2.3.1-2: Fused estimate, average values from sensors and actual process value Fig. 2.3.1-3 shows the normalized innovations for the three sensors. Here, a validation gate corresponding to a confidence of 95.9% (innovation should be less than 6) was used for the sensor validation process. Where the sensor measurements exhibit large jumps, the corresponding innovations goes up accordingly.

46

Normalized Innovation sensor 1

15 99.8%

10 9

95.9% 6 5

0 0

0.4

0.8

1.2

1.6

2

2.4 time (sec) 2.8

Normalized Innovation sensor 2

10 9 8

99.8% 95.9%

6 4 2 0 0

0.4

0.8

1.2

1.6

2

2.4 time (sec) 2.8

Normalized Innovation sensor 3

10 99.8%

8

95.9%

6 4 2 0 0

0.4

0.8

1.2

1.6

2

2.4

time (sec)

2.8

Fig. 2.3.1-3: Normalized innovations for sensors used in validation process The probabilities behave inversely to the innovations: when the innovations indicate a reading outside the validation gate – for confidence 95.5% at value 6 – the probability drops to zero. Similarly, where the innovations are small, the probabilities rise. This can be seen in fig. 2.3.1-4 where the probabilities are displayed. Compare this figure to fig. 2.3.1-3 which showed the innovations for the sensors to observe the described effect.

Probabilities used for Fusion sensor 3

Probabilities used for Fusion sensor 2

Probabilities used for Fusion sensor 1

47 1 0.8 0.6 0.4 0.2 0 0

0.4

0.8

1.2

1.6

2

0.4

0.8

1.2

1.6

2

2.4

time (sec)

1 0.8 0.6 0.4 0.2 0 0

2.4

time (sec)

1 0.8 0.6 0.4 0.2 0 0

0.4

0.8

1.2

1.6

2

2.4

time (sec)

Fig. 2.3.1-4: Probabilities with which sensor values were fused To illustrate the sensor bias detection methodology recall that a bias of -0.25 meter was introduced in the readings of sensor 1 from samples 101 to 150 (2sec - 3sec). Fig. 2.3.1-5a shows the residue (difference of sensor output and the fused estimate) for sensor 1 for the first 100 samples (0sec - 2sec), while Fig. 2.3.1-5b shows the sensor 1 residue for the remaining samples. As stated earlier, in the absence of sensor bias the sensor residue should be ideally zero. An estimate for the sensor bias can be obtained by the magnitude of the mean of sensor bias. Since sensor 1 readings (up to sample 100), sensor 2 and 3 readings were simulated so as not to have a bias, the mean of their residues should be close to zero. It is -0.0867 for the first 100 readings (0sec - 2sec) for sensor 1 and it changes to -.3246 for the next 50 readings (for which a bias of -.25 was introduced). The means of sensor residues for sensor 2 and 3 are .0.0319 and 0.0055 which are close to 0 as expected. Fig. 2.3.1-5 shows the sensor residue for sensor 1, while Fig. 2.3.1-6 shows the sensor residue for sensor 2 and 3.

Sensor 1 Residue

48 2 mean = -0.0867 0

-2

Sensor 1 Residue

0

0.4

0.8

1.2

1.6

2

2.4

time (sec)

2

2.4

time (sec)

2

0 mean = -0.3246 -2 0

0.4

0.8

1.2

1.6

Sensor 2 Residue

Fig. 2.3.1-5: Sensor residue for sensor 1 mean = -0.0319

2

0

-2

Sensor 3 Residue

0

0.4

0.8

1.2

1.6

2

2.4

time (sec)

mean = 0.0055

2

0

-2 0

0.4

0.8

1.2

1.6

2

2.4

time (sec)

Fig. 2.3.1-6: Sensor residue for sensors 2 and 3 Summary and Conclusions This section shows how the Probabilistic Data Association Filter can be used to validate and fuse data which are assumed to originate from the same source. Only data falling outside the validation region are discarded as being too far away from the expected value and are therefore not considered in the fused value. Inside the validation region probabilities are assigned to the sensor values according to a Gaussian distribution. The probabilistic information accounts for the measurement uncertainty which can result from sensor noise or various kinds of environmental interference. The PDAF is based on the Kalman filter but abandons optimality to accomplish better performance in the presence of clutter. Several examples show how the performance of the PDAF in a variety of operating conditions.

49

2.3.2 Fuzzy Sensor Validation and Fusion Section 2.3.2.1 gives a brief review on techniques used for fuzzy validation and fusion. Section 2.3.2.2 introduces the new algorithm for fuzzy sensor validation and fusion (FUSVAF). The fusion algorithm uses confidence values obtained for each sensor reading from validation curves and performs a weighted average fusion. With increasing distance from the predicted value, readings are discounted through a non-linear validation function. The predicted value uses exponential weighted moving average (EWMA) time series predictor which has adaptive coefficients. These coefficients are governed by fuzzy membership functions. The membership function are learned via machine learning techniques using genetic algorithms (GA) with experimental data as training data. Simulations motivate the use of the fuzzy rules chosen. An exhaustive comparison to Kalman filter based approaches is given using Monte Carlo simulations. A sensitivity analysis shows the robustness of the parameters chosen. The chapter closes with recommendations for the use of fuzzy and probabilistic methods.

2.3.2.1 A Review on Techniques for Fuzzy Sensor Validation and Fusion Fuzzy sensor validation and fusion often go hand in hand. Most of the techniques attempt to assign some kind of confidence value to the sensor under consideration. The fusion then takes into account the confidence of the sensor reading by either discarding readings with lesser confidence or applying some rules as to which sensor to trust under certain circumstances or by performing an averaging method. Aguilar-Crespo (Aguilar-Crespo et al., 1992) looks at the equality of sensor values. Sensor characteristics are extracted by creating a possibilistic distribution using the histogram method which is interpreted as the fuzzy definition of normality. Assuming there are two sensors measuring the same quantity, four confidence values are obtained, two for the individual sensors and for the two pairwise combinations. The individual confidence values are obtained by comparing the current behavior of the sensor with the past values. Pairwise confidence values are obtained through the comparison of two sensors and evaluating their coincidence rate. If the readings are similar, the confidence is high, otherwise it will be low (Aguilar-Crespo et al., 1992). The rules are designed as follows: IF readings from sensor_1 and sensor_2 similar THEN confidence high IF readings from sensor_1 and sensor_2 different THEN confidence low etc. The fusion takes place creating a single measurement value which has a confidence assembled through the confidences and weights of each measurement σ f = w1σ12 + w2 σ 22 + ... + wn σ 2n where σf is the fused confidence value σi are the confidence values obtained wi are the weights to each confidence value (0