1
Utilizing data mining algorithms for identification and reconstruction of sensor faults: a Thermal Power Plant case study Christina Athanasopoulou, Vasilis Chatziathanasiou, and Ioannis Petridis
Abstract— This paper describes a procedure of identifying sensor faults and reconstructing the erroneous measurements. Data mining algorithms are successfully applied for deriving models that estimate the value of one variable based on correlated others. The estimated values can then be used instead of the recorded ones of a measuring instrument with false reading. The aim is to reassure the correctness of data entered to an optimization software application under development for the Thermal Power Plants of Western Macedonia, Greece. Index Terms— Data mining, sensor, power generation.
I. INTRODUCTION
D
and privatization are requiring more efficient and cost-effective electric power generation. As the need for reliable operation of a power plant is becoming even more urgent it is difficult for a human operator to detect and cope with operational problems in real-time. To this direction, an optimization software application for the Thermal Power Plants (TPP) of Public Power Corporation (PPC) in Western Macedonia, Greece is under development. Part of this project concerns the identification and reconstruction of sensor faults. This paper describes a procedure of identifying sensor faults and reconstructing the erroneous measurements, which is integrated by the design and development of a software application. Data mining (dm) algorithms are applied for deriving models that estimate the value of one variable based on correlated others. The estimated values can then be used instead of the recorded ones of a measuring instrument with false reading. There are various methods for sensor validation available, but in practice they have serious drawbacks. They have increased demands in human expertise and computer resources and, furthermore, they are time consuming. This holds even truer when considering that these methods are intended to comprise only the first part of the overall EREGULATION
Christina Athanasopoulou, and Vasilis Chatziathanasiou are with the Electrical and Computer Engineering Department, Aristotle University of Thessaloniki, Greece, GR 54124, P.O. Box 486 (phone: 00306946507550; email: athanasc@ eng.auth.gr,
[email protected]). Ioannis Petridis is with the Public Power Corporation S.A. Greece (
[email protected]).
application. At the current project the decision was taken to base the validation on rules that derive from the measuring equipment requirements, the plant operation specifications and the personnel experience. The main aspect is to reassure that data entered to the following stages of the optimization software are correct and that production of false alarms is minimized. The latter aims at preventing repeated display of false alarms from causing disorientation or even scorn. A far as sensor values estimation is concerned, there is a considerable number of papers available. The authors believe that for this purpose it is worthwhile the demanded time and computer and personnel resources, contrary to sensor validation. The most significant contribution of the proposed system is the replacement of the recorded values of an erroneous instrument measurement with values estimated based on models derived by applying data mining algorithms to historical data. Sensor validation has been the subject of a vast list of papers found in bibliography. Indicatively some of them are mentioned below. Typical approaches for the detection of incorrect representations of a sensor include the use of hardware redundancy and majority voting, analytical redundancy, and temporal redundancy [Frank]. The difficulties, that each one of them presents, motivated the further research in this field, especially by using Artificial Intelligence techniques. Paper [2] references some of them that apply mainly neural networks and fuzzy logic. It also introduces a novel theory and algorithms for information validation based on the use of Bayesian networks utilizing probabilistic propagation to estimate the expected values of variables. Reference [3] involves a methodology for intelligent sensor measurement validation, fusion and sensor fault detection for equipment monitoring and diagnostics. It comprises of four steps and, according to the authors, is able to detect multiple simultaneous failures and even more to distinguish between a sensor failure and a system failure for complex systems. Most of available theoretical frameworks are expensive or too complicated to apply and very few of them have been adopted by the industry. When the authors of [4] sought the
2 industrial applications of the sensor validation methodologies available in the literature, found out that it was difficult to persuade the power plant personnel to apply any of them due to lack of previous successful cases. So they stopped short at selective averaging of signals and thermo-dynamic condition check. Since many proposed methods for sensor validation involve comparing a predicted value with the available measurement of each sensor, fault reconstruction can be conveniently achieved by small additions. In [5] Principal Components Analysis and Partial Least Squares (i.e. projection to latent structures) were applied for both identification of process and sensor faults as well as estimation of sensor value estimation. In other cases, fault signal replacement is not considered at all, whereas in several research papers it is included as a separate second part. Inferential sensing which uses correlation information for sensor estimation has been proposed [6]. Auto-associative neural networks, kernel regression and its variations: multivariate state estimation techniques (MSET) and the support vector regression (SVR) models have been employed based on this idea [6]. The paper [7] describes the design of an auto-associative neural network (AANN) based sensor validation and fault detection system for a power generation system. It aimed to act as a sensor calibration monitoring system that provided continuous sensor status information and virtual estimates for faulty sensors. In [8] back propagation network is used to estimate variables from other variables whereas in [9] multilayer perceptrons are used to estimate variables based on their latter values. In section 2 the proposed procedure for sensor validation is outlined. In section 3 the methodology for sensor estimation is described. The applied data mining algorithms and attribute selection filters are listed and tables with representative results are given. The conclusions are stated in section 4. Finally, section 5 includes the description of our future plans. II.
The project aims at developing a common methodology that will be tailored to meet the needs of PPC TPPs. However, it will take advantage of any majority voting procedures already available. For instance, in many plants there is such functionality for some temperatures that are important for safety or/and performance reasons. The specifications of the power plant regarding the validity of the values of system’s variables differ according to the operating mode. A clear distinction is made whether the status is start-up, shut-down, steady-state, lifting-load, lower-load etc. For instance the plant’s engineers indicated that in the event of steady-state there are three cases of temperatures, pressures or flow measurements that should be considered invalid: 1) stuck at signals, 2) zero values, and 3) out of the limits of the measurement instruments. This assumption is based on the specifications of the signal transmitters and measurement instruments that are used in the power plant. Biased measurements are minimized by quality assurance. Apart from sensor false readings there are two types of errors that are critical for validation: 1) jumps and 2) a value out of the range for the specific operating mode. The challenge at this part of the procedure is the adaptation of the threshold of each variable based on previous cases. Effort is spent in this direction, as a successful one is expected to increase the robustness of the validation process. Choosing too low a threshold increases the rate of false alarms, while choosing it too high increases the time to fault detection. Power Plant sensors
Raw data
Micro processors
DCS function
Advanced
SENSOR VALIDATION
For the current application it was decided to follow a straightforward approach consisting of the following steps: 1) The various sensor faults were recognized and categorized based on the graphs of their data. A knowledge engineering methodology, CommonKADS [10], was applied in order to convert the tacit knowledge of plant engineers and operators to explicit rules. 2) The anomalies (faulty data) were then used for the sensor validation. 3) The error-free data were used as input to the dm algorithms for deriving the estimation models. The installed instrumentation and control equipment vary significantly among TPPs. As demonstrated in Fig.1 input data may be from plain raw data to signals that are filtered by microprocessors or even advanced functions.
Control monitoring system
Original Database
Data validation
Preliminary Database Fig.1 Data acquisition and validation
3 The same rules that are applied for clearing the data are embedded to the software application. On-line validation of measurements will be performed both periodically and ondemand by other parts of the overall application. Special care was taken so as the code is written in such a way that the rules would be easily replaceable by new ones in case this will be needed in the future. III.
SENSOR’S VALUE ESTIMATION
The estimation of a sensor’s value includes data preprocessing, application of dm algorithms, and results evaluation/ model selection, as depicted in Fig.2.
( p1 − a1 ) + ... + ( pn − an ) 2
Mean absolute error
2
(1)
n Relative absolute error
Correl. coefficient
SP
∑ ( p − p) = i
i
n −1
p1 − a1 + ... + pn − an
S PA SP S A
(2)
a1 − a + ... + an − a
, S PA =
2
, and S A
∑ ( p − p)(a − a) (3) i
i
i
n −1
∑ (a − a) =
2
i
i
n −1
Where p1…pn and a1…an represent the predicted and the actual values of the instances 1 to n respectively. Preliminary Database
Data Mining Data preprocessing Classification algorithms Visualization Evaluation
Models Fig. 2 Off-line preprocessing and processing of data.
A. Data mining algorithms Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. The expedient outcome as well as the form of the data (numeric values) indicated classification as the most appropriate technique. The WEKA data mining suite has been used for the application of the algorithms [11]. In the frame of our case studies, extensive experiments were carried out in a trial and error approach, in which we diversified the classifier algorithm and its parameters, the contribution of the classes in the training dataset and even the input features. Various preprocessing techniques were also applied, as normalization, discretization and filters that suggest which variables (that will act as input parameters to the algorithms) are more relevant to each class (output variable). The latter is quite important as it can contribute to a significant reduction of the dataset dimensions. Two of the most commonly used methods were applied for determining the performance of classifiers: the hold-out estimate and the cross-validation [12]. These methods provide five statistical parameters when the output variable is numeric. Three of them, i.e. mean absolute error (1), relative absolute error (2), and correlation coefficient (3) were used for the evaluation.
Evaluation of the learned knowledge also involved knowledge interpretation by domain experts. For this purpose, adequate knowledge representation formalism has been used (eg rules defined by a decision tree). Based on the criterion of minimum mean absolute error the following algorithms proved to be the most suitable for our dataset: 1) REPTree: Fast decision/regression tree learner [13]. 2) IBk: K-nearest neighbors classifier [13]. In our test case k=2 resulted in quite good results, while k>2 improvement was not worth the extra computer resources and in some cases even deteriorate the results. 3) M5Rules [13]. 4) M5P: The original algorithm M5 invented by Quinlan[14] was improved by Y. Wang M5’ [15]. B. Data mining results Temperatures and gas emissions are considered to be the most important features for the optimization software application. The performance of the algorithm REPTree applied for the model of steam temperature (Table 1) and NOX (Table 2) are presented, as an example. The original dataset covers 2 months. The sampling interval was set to 1 minute. The dataset was divided based on the criterion of partial or full power plant operation. The latter corresponded to more than 60000 instances. These were distributed to 6 files. Each one of them was used one time as training set and the others as test set. Cross-validation was applied for determining the performance of each training set. The dimensions of the original dataset (88 variables) were reduced (44 variables) by applying aggregation techniques. This comprised the first dataset. In an effort to further reduce the input parameters the available in WEKA filters for attribute selection were tested. 1) Application of the algorithm REPtree a)
Steam temperature
Table 1 includes the three aforementioned statistical parameters, i.e. mean absolute error (1), relative absolute error
4 (2), and correlation coefficient (3) that resulted when the classification algorithm REPTree was applied for modeling the live steam temperature. The live steam temperature descriptive statistics at the specific dataset are: minimum= 523.114 oC, maximum= 545.231 oC, mean= 535.162 oC, standard deviation= 3.84. The second dataset was modulated by the Best First search method and the CfsSubset evaluator [13]. The authors selected the attributes of the third and fourth datasets. The attributes of Dataset2 and Dataset3 are listed below. Dataset 1. Initial 44 attributes. Dataset 2. 9 attributes: Load (MW), Coal calorific value (MJ/kg), Flue gas temperature after Economizer (oC), Steam temperature after Superheater 2 at the 1st of 4 steam paths (oC), Steam temperature after Superheater 2 at the 2nd of 4 steam paths (oC), Steam temperature after Reheater at the 4th of 4 steam paths (oC), Injection rate to Reheater (Kg/s), CO2 before stack (mg/m3), O2 after Economizer (%). Dataset 3. 5 attributes: Load (MW), Coal caloric value (MJ/kg), Flue gas temperature after Economizer (oC), Steam temperature after Superheater 2 at the 1st of 4 steam paths (oC), Steam temperature after Reheater at the 4th of 4 steam paths (oC). TABLE 1 THE PERFORMANCE OF THE CLASSIFICATION ALGORITHM REPTREE FOR THE VARIABLE STEAM TEMPERATURE SET AS CLASS Mean absolute Relative absolute Correlation error error coefficient Dataset 1
0.1141
3.7001 %
0.9975
Dataset 2
0.1421
4.4838 %
0.9935
Dataset 3
0.1375
4.3203 %
0.9939
Dataset 4
0.1592
5.0834 %
0.9913
The high value of the correlation coefficient proves the appropriateness of the chosen input parameters for estimating the value of the output variable. The mean absolute errors reported in Table 1 are considered satisfactory, taking into account the mean of the variable live steam temperature, which is 535.162 oC. It should also be noticed that the plant experts set as limit an error less than 2 oC. The relative absolute error expresses the benefit of applying this algorithm instead of using the variable’s mean to replace missing values. In this case, the relative absolute errors are found to be quite good. b)
NOX
Table 2 includes three statistical parameters, mean absolute error (1), relative absolute error (2), and correlation coefficient (3) that resulted when the classification algorithm REPTree was applied for modeling the NOX. The NOX descriptive statistics at the specific dataset are: minimum= 100.11mg/m3, maximum= 205.01 mg/m3, mean= 151.81 mg/ m3, standard deviation= 18.6. The second dataset was modulated by the Best First search method and the CfsSubset
evaluator. The attributes of the third and fourth datasets were selected by the authors. The attributes of Dataset2 and Dataset3 are listed below. Dataset 1. Initial 44 attributes. Dataset 2. 12 attributes: Air flow (kg/s), Excess O2 (%), Steam temperature after Superheater 1 at the 3rd of 4 steam paths (oC ), Steam temperature after Superheater 2 at the 4th of 4 steam paths (oC), Steam temperature after Reheater at the 1st of 4 steam paths (oC), Steam temperature after Reheater at the 4th of 4 steam paths (oC), Induced Draft Fan 2 (IDF) current (A), Total primary air flow (kg/s), Flue gas temperature before stack (oC), Flue gas temperature at Air heater outlet (oC), O2 after Economizer (%), O2 before stack (%). Dataset 3. 7 attributes: Steam temperature after Superheater 1 at the 3rd of 4 Steam paths (oC), Steam temperature after Superheater 2 at the 4th of 4 steam paths (oC), Steam temperature after Reheater at the 3rd of 4 steam paths (oC), Flue gas temperature at Air heater outlet (oC), IDF 1 current (A), Total primary air flow (kg/s), Flue gas temperature before stack (oC). TABLE 2 THE PERFORMANCE OF THE CLASSIFICATION ALGORITHM REPTREE FOR THE VARIABLE NOX Mean absolute Relative Correlation error absolute error coefficient Dataset 1
3.1017
19.0955 %
0.9498
Dataset 2
3.6668
22.9389 %
0.9359
Dataset 3
2.4899
14.9343 %
0.9616
Dataset 4
2.9171
17.8398 %
0.9508
The mean absolute errors reported in Table 2 are acceptable, taking into account the mean of the variable NOX which is 151.81 mg/ m3. The plant experts set as objective an error less than 2-3 %, depending on the operating mode and the circumstances. The relative absolute errors were marginally accepted. 2) Application of the algorithm IBk (k=2) The algorithm IBk (k=2) proved to perform better than all other ones tested on datasets concerning the boiler, so far. However, its requirements to computer resources discouraged or even prohibited its implementation depending on the dimensions of the dataset. For instance, WEKA shut down when we attempted to apply it at the aforementioned dataset comprised of 44 attributes with 10256 instances. Evidently, more than the allocated 600MB of RAM was required. Nevertheless, the number of attributes was further reduced to 28 in order to enable IBK successful execution. The results were more than encouraging. The good performance of IBk (k=2) motivated the execution of more experiments on datasets preprocessed by attribute selection filters. We combined all the search methods and evaluators available in WEKA and applicable to numeric data [13]. These resulted to 10 datasets, 7 of which are listed
5 at Table 3, along with the dataset comprised of initial 28 attributes. Dataset 1: initial 28 attributes. Dataset 2: 11 attributes, srch = BestFirst. Dataset 3: 10 attributes, srch = Exhaustive Search. Dataset 4: 10 attributes, srch = Genetic Search. Dataset 5: 10 attributes, srch = Random Search. Dataset 6: 7 attributes, srch = BestFirst. It comprised of the attributes: Live steam temperature (oC), Flue gas temperature after Economizer (oC), Flue gas temperature at Air heater outlet (oC), Total primary air flow (kg/s), Steam temperature after Superheater 2 - mean of the 4 steam paths (oC), Steam temperature after Reheater - mean of the 4 steam paths (oC), Time pasted since the latter sootblowing procedure was ended (minutes). Dataset 7: attributes, srch = Random Search. Where srch is the search algorithm. The CfsSubset evaluator was used for the datasets 2-5. It evaluates the worth of a subset of attributes by considering their individual predictive ability of each feature along with the degree of redundancy between them. The WrapperSubsetEval evaluator was used for datasets 6 and 7. It evaluates attribute sets by using a learning scheme, which in this case was chosen to be the REPTree. Cross validation is used to estimate the accuracy of the learning scheme for a set of attributes. TABLE 3 THE PERFORMANCE OF THE CLASSIFICATION ALGORITHM IBK(K=2) APPLIED TO DATASETS MODULATED BY DIFFERENT ATTRIBUTE SELECTION FILTERS THE VARIABLE NOX IS THE OUTPUT
seconds and does not differ a lot from algorithm to algorithm. 3) Embedding models to the software application For the software application two of the models that derived by applying dm algorithms are chosen for each variable. They are used for estimating its value whenever it is characterized as invalid. The best model is normally preferred, apart from the cases that it contains one or more variables with false readings. Thereat the second best model is used. In case this also contains invalid variables then both models are adopted with the modification that each invalid variable is replaced with its last valid value. Their average is used for replacing the inquired measurement. IV. CONCLUSIONS Data mining algorithms can be successfully applied for reconstruction of sensor faults, as it was demonstrated by the good performance of the classification algorithms. This paper proposes to base sensor faults identification on rules that derive from the measuring equipment requirements, the plant operation specifications and the personnel experience. These rules can be applied for both clearing the input data to dm and the on-line validation of measurements. For DM experiments it is suggested to use at the beginning less demanding algorithms to discard datasets (eg. REPTree). At this stage the appropriateness of various preprocessing techniques can also be judge. This step should be followed by attribute selection filters and algorithms that have proved to be more efficient, but they are time and computer resources consuming (e.g. IBk). Attribute selection by the WEKA filters proved to be useful. Personal judgment based on domain knowledge permits further reduction of the dataset attributes with similar statistical results.
Mean absolute error
Relative absolute error
Correlation coefficient
Dataset 1
2.164
14.716 %
0.958
Dataset 2
2.807
19.091 %
0.949
Dataset 3
2.671
18.165 %
0.950
V. FUTURE WORK
Dataset 4
2.852
19.396 %
0.948
Dataset 5
2.376
16.159 %
0.949
Dataset 6
1.181
8.035 %
0.967
Dataset 7
1.450
9.890 %
0.961
Future work is planned towards two directions: 1) the development of a Multi Agent System that will extend the present software by adding intelligence and retraining capabilities and 2) the application of dm algorithms in order to extract rules triggering alarms based on more than one parameter, thus covering more spherically the TPP operation.
The BestFirst search method gave better results than the other tested search methods when combined with the same evaluator. It searches the space of attribute subsets by greedy hill-climbing augmented with a backtracking facility. The WrapperSubsetEval evaluator gave better results than CfsSubset evaluator, but this was also the one that demanded the most computer resources. For instance, the Dataset 7 resulted after 10 days of continuous execution of the attribute selection filter combining the WrapperSubsetEval evaluator and the Random Search using WEKA. It should be stressed that this concerns only the derivation of the model. The time needed for running the output model in order to estimate one instance of a sensor measurement is approximately a few
ACKNOWLEDGMENT Special thanks are owed to the engineers, technical personnel and operators of the PPC S.A. Thermal Power Plants of Western Macedonia, Greece for providing the information, data and continuous support. REFERENCES [1]
[2]
P. Frank, “Fault diagnosis in dynamic systems using analytical and knowledge based redundancy- a survey and some new results”, Automatica, vol 26, pp.459-470, 1990. P. H. Ibarguengoytia, S. Vadera, and L. E. Sucar, “A probabilistic model for information and sensor validation”, The Computer Journal, vol.49, no.1, pp. 113-126, 2006.
6 [3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11] [12]
[13]
S. Alag, A.M. Agogino, and M. Morjaria, “A methodology for intelligent sensor measurement, validation, fusion, and fault detection for equipment monitoring and diagnostics”, Artificail Intelligence for Engineering Design, Analysis and Manufacturing, vol.15, pp.307-320, 2001. G. Heo, S. H. Chang, and S. S. Choi, “Development of a need-oriented steam turbine cycle simulation toolbox”, IEEE Trans. On Energy Conversion, vol. , 2004. J. A. Ritsie, and D. Flynn, “Data mining for performance monitoring and optimization”, Thermal Power Plant Simulation, Monitoring and Control, IEE, 2003, pp.309-344. A. V. Gribok, A. M. Urmanov, J. W. Hines and R. E. Uhrig, "Use Of Kernel Based Techniques For Sensor Validation in Nuclear Power Plants", Statistical Data Mining and Knowledge Discovery, Chapman and Hall/CRC Press, 2004, pp. 221-235. X. Xu, J. W. Hines, and R. E. Uhrig, "Sensor Validation and Fault Detection Using Neural Networks", in Proc. Maintenance and Reliability Conference (MARCON 99), Gatlinburg, TN, May 10-12, 1999. E. Eryurek, and B. R. Upadhyaya, “Sensor validation for power plants using adaptive backpropagation neural network”, IEEE Trans. Nuclear Science, vol. 37, pp. 1040-1047, Apr.1990. M. R. Napolitano, D. Windom, J. Casanova, M. Innocenti, and G. Silvestri, “Kalman filters and neural network schemes for sensor validation in flight control systems”, IEEE Trans. Control Systems Technology, vol. 6, pp. 596-611. G. Shreiber, H. Akkermans, A. Anjewierden, R. de Hoog, N. Shadbolt, W. Van de Velde, and B. Wielinga, Knowledge Engineering and Management: the CommonKADS methodology. MIT Press, 2000. Available: http://www.cs.waikato.ac.nz/~ml/weka/index.html M. Stone, “Cross-validatory choice and assessment of statistical predictions”, Journal of the Royal Statistical Society, vol. 36, no. 2, p.p. 111-147, 1974. I. Witten, and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition). Morgan Kaufmann, 2005.
[14] M. Hall, G. Holmes, and E. Frank, “Generating Rule Sets from Model Trees”, in Proc. 12th Australian Joint Conf. on Artificial Intelligence, Sydney, Australia, 1999. [15] J.R. Quinlan, “Learning with continuous classes”, in Proc. of the Australian Joint Conf. on Artificial Intelligence, Singapore, 1992, p.p. 343-348. [16] Y. Wang, and I.H. Witten, “Induction of model trees for predicting continuous classes”, in Proc. of the poster papers of the European Conference on Machine Learning, Prague, 1997, p.p.128-137.
Christina A. Athanasopoulou was born in Thessaloniki Greece, on February 27, 1976. She received her Dipl. Eng. Degree from the Department of Electrical and Computer Engineering at the Aristotle University of Thessaloniki, in 2000. She worked as technical manager of IST European projects for ALTEC, Greece during 2001-2003. Since 2003 she has been a postgraduate student at the Department of Electrical and Computer Engineering at the Aristotle University of Thessaloniki. Her special interests include power plants monitoring, multi agent systems, and data mining. Vasilis Chatziathanassiou was born in Serres, Greece, on October 30, 1954. He received his Dipl. Eng. Degree and the Ph.D. degree from the Department of Electrical and Computer Engineering at the Aristotle University of Thessaloniki, in 1978 and 1989 respectively. He has been working as research assistant since 1980, as Lecturer since 1990 and as Ass. Professor since 1994 at the Division of Electrical Energy of the Department of Electrical and Computer Engineering at the Aristotle University of Thessaloniki, Greece. His special interests are coupled electrothermal fields in cables and electrical machines, heat transfer problems in production, transmission and distribution of electrical energy, applications of intelligent agents in Power industry. Ioannis M. Petridis was born in Kozani, Greece, on February 6, 1960. He received his Dipl. Eng. Degree from the Department of Electrical and Computer Engineering at the Aristotle University of Thessaloniki, in 1983.
He has been working as Shift Engineer during 1985-89, as Head of Optimization Department during 1989-94, as Head of Operation Department during 1984-2002, and as Production Manager since 2002 at Amynteon-Filota SES, Public Power Corporation SA, Greece. His special interests are Thermal Power Plants simulators and efficiency on-line monitoring systems.