Domain Driven Data Mining for Unavailability Estimation ... - CiteSeerX

0 downloads 0 Views 334KB Size Report
Logistic regression models bagged via their median score achieved Max_KS=0.341 and ..... http://www.aneel.gov.br/cedoc/ren2007270.pdf. 2. CHESF.
Domain Driven Data Mining for Unavailability Estimation of Electrical Power Grids Paulo J.L. Adeodato1,2 , Petrônio L. Braga2, Adrian L. Arnaud1, Germano C. Vasconcelos1,2, Frederico Guedes3, Hélio B. Menezes3, Giorgio O. Limeira3 NeuroTech Ltd., Av. Cais do Apolo, 222 / 8º andar, 50030-905, Recife-PE, Brazil 2 Center for Informatics, Federal University of Pernambuco, Av. Professor Luís Freire s/n, Cidade Universitária, 50740-540, Recife-PE, Brazil 3 Companhia Hidrelétrica do São Francisco - CHESF, St. Delmiro Gouveia, 333 – Bongi, 50761-901, Recife-PE, Brazil {Paulo,Adrian,Germano}@neurotech.com.br, {pjla,plb,gcv}@cin.ufpe.br, {fred,helio,giorgiol}@chesf.gov.br 1

Abstract. In Brazil, power generating, transmitting and distributing companies operating in the regulated market are paid for their equipment availability. In case of system unavailability, the companies are financially penalized, more severely, on unplanned interruptions. This work presents a domain driven data mining approach for estimating the risk of systems’ unavailability based on their component equipments historical data, within one of the biggest Brazilian electric sector companies. Traditional statistical estimators are combined with the concepts of Recency, Frequency and Impact (RFI) for producing variables containing behavioral information finely tuned to the application domain. The unavailability costs are embedded in the problem modeling strategy. Logistic regression models bagged via their median score achieved Max_KS=0.341 and AUC_ROC=0.699 on the out-of-time data sample. This performance is much higher than the previous approaches attempted within the company. The system has been put in operation and will be monitored for the performance reassessment and maintenance re-planning. Keywords: Electrical power grid unavailability, Equipment unavailability penalties, Domain driven data mining, Model ensembles, Logistic regression.

1 Introduction In the 1990s, most of the Brazilian power companies went private and started operating, under concession from the government, regulated by the National Agency of Electrical Energy (ANEEL = Agência Nacional de Energia Elétrica) and inspected by the National System Operator (ONS = Operador Nacional do Sistema). The companies operating in this regulated market are paid for the service they provide and are penalized for system unavailability at the Operational Function (FUNOP = FUNção OPeracional) level [1]. Each unavailability penalty depends on the value of

the FUNOP asset, its characteristics, the duration of the power interruption and, mainly, if the interruption had been planned or not; an unplanned unavailability costs roughly 20 times more than a planned one of the same duration [1]. The reliability of electrical power grids is already very high and under continuous improvement. Each FUNOP is composed of several equipments which implement an operational function in power generation, transmission or distribution. For preserving this high reliability profile, strict maintenance plans are periodically conducted on these equipments, with particular features for each family of equipments. In general, the maintenance plan is made according mainly to the equipment manufacturer’s recommendations. That takes into account the electrical load, the temperature and other aspects to define the periodicity, the procedures and parameter monitoring and adjustments. The equipment manufacturers have carried out series of trials within their plants and also collect data from their costumers’ installations and apply statistical methods for defining their maintenance recommendations. However, there are many other factors interfering in the system reliability in different power grids such as the quality of the repairmen’s labor, ways of loading the system etc. There is also a major aspect to be considered; as the system’s quality improves, less data about risky conditions are produced. Therefore, the better the system becomes, the less data about faults will be available for statistical modeling of risky conditions. Fortunately, more data from monitoring operation in normal conditions are being collected and will be available for future modeling. Instead of using the traditional statistical modeling, this paper introduces an approach based on behavioral data. That may seem odd if one thinks of the system operating in a stable regime, under a constant fault rate. However, as the faults are very rare events, it is not possible to assure a constant fault rate and the adherence in the hypothesis test always gives at least a small difference; behavioral consolidation of data may capture variations which are important for risk estimation. The results presented here support this idea. This paper is organized in six more sections. Section 2 characterizes the unavailability problem faced by CHESF (Companhia Hidro Elétrica do São Franscisco) with the data structure available and the integration and transformation needed. Section 3 shows the modeling of the problem as a binary decision based on the maintenance plan, the creation of behavioral variables and the selection of the most relevant variables for modeling. Section 4 describes the knowledge extraction process via a bagged ensemble of logistic regression models. Section 5 presents and interprets the results achieved on a statistically independent data set. Section 6 summarizes the important contributions, the limitations of the approach and future work to be done to broaden the research.

2 Problem Characterization CHESF (Companhia Hidro Elétrica do São Franscisco) is one of the biggest Power generating company in Brazil producing 10,618 MW in 14 hydroelectric power plants and 1 thermoelectric. It also transmits this energy along an 18-thousand km long power grid [2]. Its annual revenue has reached R$ 5.64 billion (= US$ 3.15 billion) in

2008, with a net profit of R$ 1.43 billion. Unfortunately, the revenue losses caused by penalties for unavailabilities still remain undisclosed. CHESF’s power grid has 462 FUNOPs of 7 different families with an average of 39 equipments in a total of 17.8 thousand equipments with an average age of approximately 19 years of operation. The seven different FUNOP families are: transmission lines, power transformers, reactors, capacitor banks, synchronous compensators, static compensators and isolated cables. Just before being put in operation, the equipments and FUNOPs are registered in the Asset Management System (SIGA = Sistema de Gerenciamento de Ativos). After becoming operational, the equipments have all their maintenances, planned or not, recorded in the same system (SIGA). Each unavailability, no matter the cause, is recorded in the accountability system within the Asset Management System (SIGA). Unavailabilities that occurred before of January 1, 2008 were recorded in the system (DISPON) which had no direct link to the SIGA system. These two data sources hosted in two different systems with relational databases needed to be integrated in a single data mart because they are the basis for the unavailability risk estimation system to be developed. The difference in granularity between the DISPON and SIGA databases and the consequent lack of a unique key together with the legacy systems turned this database integration into a non-trivial task. Asset registration and their maintenance records have been integrated in the SIGA system for the last two years but there were several adjustments in data imported from legacy systems for previous periods in a much longer history. The most important difficulty faced however was the integration with the DISPON system where each unavailability recorded had not been linked to an equipment maintenance action. Furthermore, DISPON had been abandoned without any data importation to the current SIGA installed only 2 years ago. So, the unavailability data were dumped from the legacy database (DISPON) and were joined to the current SIGA database to form the complete unavailability database. These integration steps alone took around 60% of the project duration, having required a lot of interactions with the IT management and electrical engineers at CHESF. The purpose of this work is to estimate the risk of occurring unavailabilities in the FUNOPs which compose the power grid at each moment. At this point, it is important to emphasize a trick made to turn the risk assessment problem into a binary decision problem for data mining. Considering that unavailabilities caused by planned maintenances are negligible in cost compared to unplanned ones (only 1:20 ratio), and that maintenance actions reset the operational status of the system to optimal, the temporal sequence of planned maintenance actions defines a frame of time intervals where the presence or absence of unplanned unavailabilities characterize the binary target for the supervised training process. This binary target definition approach will be explained in the next section, along with the creation of behavioral information.

3 Data Transformation The variables present in the integrated database in a relational architecture needed to be transformed into more meaningful variables concerning the binary decision problem characterized for modeling the unavailability risk assessment problem. This section explains the proposed random variables that produce the most adequate mapping from the original input space to the data mart variables. It also presents how the binary decision target was defined. 3.1 Variable Creation Behavioral data are widely used in behavior scoring for credit risk assessment [3] and other business applications. In that domain, in general, it consists of the RFM (Recency, Frequency and Monetary value) variables’ creation approach [4]. For systems’ faults at CHESF, the approach was adapted to capture the relevant sequential features implicit in each event within the FUNOP related to recency, frequency and impact along time for faults and errors (RFI approach). In this approach, the impact is measured by the duration, cost and other features related to each system component / event. This is a very important basis for systematic and automatic creation of behavioral variables, considering several time spans. Other variables inherent to the FUNOPs and related to their complexity were created, such as the amount of equipments, the families of equipments and the entropy of the equipment distribution within the FUNOP. This is a Domain Driven Data Mining approach [5] of embedding the expert’s knowledge from the electrical engineering field into the decision support system. The RFI approach can be generalized to model rare events in several application domains where the impact is captured by several different metrics (to be published elsewhere). Another important aspect is that, due to the very small amount of faults per equipment, their rate of faults is defined at the equipment family level. Several “ratio” variables were created for measuring differences from a FUNOP to the population. So, the ratio of the average rate of faults per family of equipments within a FUNOP and the average in the whole grid form an important set of variables. At this point, it is important to highlight that, in general, equipments are not replaced or swapped in the power grid; they are simply maintained. 3.2 Proposed Model and Label Definition Considering the scarcity of data about system faults and the consequent high imprecision of the estimated distributions and fault rates, the approach adopted here was to convert a classical statistical problem into a data mining problem with the advantage of reducing the amount of limiting assumptions in the modeling process. In this approach, the temporal sequence of planned maintenance actions defines a frame of time intervals used for modeling and labeling the system condition. The label is defined as “bad” if there is at least one unplanned unavailability within that

time interval and, “good” otherwise. This characterizes the binary target needed for the supervised training process of the decision support system [6]. The set of all planned maintenances defines a sequence of time intervals, each of which possess a binary label and takes into account all the past history of the FUNOP and its components (behavior), as illustrated in Fig. 1.

Fig. 1. Planned maintenances define a sequence of time intervals for modeling the problem as a binary decision and labeling the target.

An approximation has been made in the approach depicted above, considering the negligible cost of the planned unavailabilities compared to the un-planned ones (1:20 ratio) and the fact that planned unavailabilities may be produced during a planned maintenance itself. Therefore, planned unavailabilities were discarded from the training data for the modeling process. No other constraint has been made concerning data distribution types or their parameters, different from the statistical approaches. The goal of this modeling approach is to take preventive actions whenever a 'bad' prediction is made within a time interval. Despite not being in the long term maintenance plan, this short term planned maintenance action produces either negligible penalty (1:20 of the fault unavailabity penalty) or no penalty at all (several preventive maintenance actions do not cause unavailability). 3.3 Variables Selection As the process of systematic creation of behavioral variables makes it very easy to automatically produce new variables, variable selection is needed to preserve only the most meaningful and discriminative variables. The selection process was based on an approach for maximizing the information gain of the input variables in relation to the binary target and, simultaneously, minimizing the similarity (redundancy) among the input variables selected, measured by appropriate metrics, in a univariate fashion. As all input variables were numerical and the target binary, the Max_KS (Kolmogorov-Smirnov) metric [7] was used for ranking the variables by their univariate discriminative importance. The redundancy among input variables was measured by linear correlation. The input variables with correlation higher than 0.9 with other variables of higher Max_KS were discarded from the model. Following this approach, only 30 among over 900 input variables were preserved. Table 1 lists the top five most relevant variables selected with their information gain measured in

terms of Max_KS and AUC_ROC (Area under the ROC Curve) [8], to be explained in Sub-section 5.1.

Table 1. Five univariately most relevant variables selected in terms of Max_KS. Variables Max_KS AUC_ROC Hours of UnPlanned Unavailability in Last 24 Months 0.30 0.68 Quantity of UnPlanned Unavailability in Last 24 Months 0.28 0.69 Hours of UnPlanned Unavailability in Last 12 Months 0.27 0.66 Quantity of UnPlanned Unavailability in Last 12 Months 0.27 0.67 Time Since Last UnPlanned Unavailability 0.25 0.58 It is clear that unplanned unavailability along the last two years of operation is the most relevant aspect for estimating the risk of unavailability before the next planned maintenance. It is interesting that the equipments’ age appear only in 22nd place in the ranking with Max_KS=0.09 and AUC_ROC=0.48, suggesting that the system fault rate is indeed at the flat part of its curve.

4 Modeling Strategy 4.1 Data Sampling As the modeling strategy involves the creation of behavioral variables, there is statistical dependence among the examples, differently from typical classification problems. Therefore data division for modeling and testing the system should to be temporally disjoint in two blocks, as done in time series forecasting tasks [9] for more realistic performance assessment. The diagram in Fig. 2 below shows this division in time.

Fig. 2. Data partition along time for modeling and performance assessment of the system.

This data partition took into account the change in the computational environment to represent the worst case in terms of performance assessment. In the modeling set, the target class (unavailability) represents 18.1% of the examples whereas, in the testing set, it represents only 10.5% of the examples. An additional difficulty is related to the differences in the way data were recorded before and after SIGA, which not even CHESF’s personnel can precisely assess. The modeling data refer to the whole period before the SIGA system was deployed while the testing data have their target defined after SIGA’s deployment. The behavioral variables of the testing data, however, also capture historical information from the preceding period.

4.2 Logistic Regression and Model Ensemble The modeling technique chosen was logistic regression for several interesting features it possesses being the quality and understandability of the solution produced and the small amount of data required the most relevant features for this work. Logistic regression has been successfully applied to binary classification problems, particularly to credit risk assessment [3], it does not require a validation set for overfitting prevention and it presents explicitly the knowledge extracted from data in terms of statistically validated coefficients [10]. As preliminary experiments with different data samples showed a high variance in performance, it was clear that an ensemble of systems was necessary [11]. In this work, the ensemble consisting of 31 Logistic Regression models has reduced the system’s variance and their median was taken as the response for each test example. This median approach had been adopted by the authors’ teams since 2007 in PAKDD Data Mining Competition [12] and in NN3 Time Series Forecasting Competition [13]. As already stated, the modeling technique chosen was Logistic Regression due to its explicit coefficients and for not having the need of a validation set. For training the 31 models, 50% of the examples in the modeling data set were randomly sampled without replacement. These parameters were chosen by linear experimental project [14] with the ensemble size taking the values 31, 51 and 101 and the percentage taking the values 70% 60% and 50%.

5 Experimental Metrics, Results and Interpretation 5.1 Performance Metrics As there was no criterion available yet for defining the decision threshold along the continuous output of the logistic regression ensemble, the performance assessment was carried out using two metrics for the whole decision domain (the score range): the maximum Kolmogorov-Smirnov distance (Max_KS) [7] and the Area Under the ROC Curve (AUC_ROC) [8]. The AUC_ROC metric is widely accepted for performance assessment of binary classification based on continuous output. Similar wide acceptance holds for the Max_KS within the business application domain. Differently from its original purpose as a statistical non parametric tool for measuring the adherence of cumulative distribution functions (CDF) [7], in binary decision systems, the KS maximum distance is applied for assessing the lack of adherence between the data sets from the 2 classes, having the score as independent variable. The Kolmogorov-Smirnov Curves are the difference between the CDFs of the data sets of the two classes. The higher the curve, the better the system and the point of maximum value is particularly important in performance evaluation. Another widely used tool is the Receiver Operating Characteristic Curve (ROC Curve) [8] whose plot represents the compromise between the true positive and the false positive example classifications based on a continuous output along all its possible decision threshold values (the score). The closer the ROC curve is to the upper left corner (optimum point), the better the decision system is. The focus is on

assessing the performance throughout the whole X-axis range by calculating the area under the ROC curve (AUC) [8]. The bigger the area, the closer the system is to the optimum decision which happens with the AUC_ROC equal to one. 5.2 Results and Interpretation Performance was assessed on the testing set which consisted of the out-of-sample data with 4,059 examples reserved for this purpose only. Fig. 3 shows the KolmogorovSmirnov curve with its Max_KS=0.341. Fig. 4 shows the ROC curve with its AUC_ROC=0.699.

Fig. 3. Performance assessment by the Kolmogorov-Smirnov metric with Max_KS=0.341.

Fig. 4. Performance assessment by Area Under the ROC Curve metric with AUC_ROC=0.699.

The curves are quite noisy probably because of the small amount of data in the testing set. There are only around 400 examples from the target class in this data set whose CDF is a very noisy curve (top plot in Fig. 3) whereas the non-target class (“good”) is a smooth curve. Even being noisy, the performance curves are consistent and present an improvement which will be useful, for CHESF, particularly considering that the testing set represents a worst case approximation.

6 Concluding Remarks This paper has presented a domain driven data mining approach to the problem of Operational Function unavailability in the electrical power grid of one of the biggest power companies in Brazil - CHESF. Different from statistical approaches, this innovative work has modeled the unavailability as a data mining binary decision problem with behavioral input variables. These variables were created by sliding windows of different sizes timed by the planned maintenance events which were labeled as “bad” when an unplanned unavailability occurred before its next planned maintenance. An important advantage of this approach compared to the statistical ones is that it does not impose any constraint on the data distributions to be modeled. The only approximation made was to consider the planned unavailability’s cost negligible compared to that of an unplanned one; around 5% of the value. It should be emphasized here that there is a big difference between the concepts of approach and technique which becomes clear when the statistical technique logistic regression is used within a domain driven data mining approach for modeling the whole problem as a sequence of rare events consolidated in RFI variables which capture sequential information in terms of Recency, Frequency and Impact. The median of an ensemble of bagged logistic regression models has provided the unavailability’s risk estimating score and its coefficients made explicit the most relevant variables for each suggested decision. Results of the experiments carried out on an out-of-sample test set have shown that the approach is viable for risk estimation. It attained a Max_KS=0.341 and AUC_ROC=0.699, in a worst case scenario. After this approach’s validation, the testing data set has been included in the modeling data set and the system has been re-trained with the same procedure. Now, the system has just been put in operation and its performance will be monitored for the next six months when CHESF will be making pro-active maintenance based on the system predictions. Both the quality the solution and the availability of the power grid can lead to redesigning maintenance periods. Several refinements still have to be made, particularly, those referring to the revenue losses caused by the penalties for power grid unavailability. This refinement can be made by considering the “losses” either in the modeling process or in the postprocessing stage along with the risk estimating score produced by the decision support system. Also, the variable selection process should include multivariate techniques such as the variance inflation factor (VIF) [15].

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

13. 14. 15.

ANEEL. Normative Resolution no. 270 of June 2007, http://www.aneel.gov.br/cedoc/ren2007270.pdf CHESF. Companhia Hidro Elétrica do São Franscisco. http://www.chesf.gov.br/acompanhia_visaomissao.shtml West, D.: Neural network credit scoring models. Computers and Operations Research, 27, pp. 1131–1152, (2000) Jiang, T., Tuzhilin, A.: Improving Personalization Solutions through Optimal Segmentation of Customer Bases. IEEE Trans. Knowledge and Data Eng., (21) 3, pp. 1–16, (2009) Cao, L.: Introduction to Domain Driven Data Mining, in Data Mining for Business Applications (eds. Cao L, et al.), pp. 3–10 (2008) Han, J., Kamber, M.: Data Mining: Concepts and techniques. Morgan Kaufmann, San Francisco, CA, (2006) Conover, W. J.: Practical Nonparametric Statistics, 3rd edition, John Wiley & Sons, NY, USA (1999) Provost, F., Fawcett, T.: Robust Classification for Imprecise Environments. Machine Learning J., (42), 3, pp. 203–231 (2001) Adya, M., Collopy, F.: How Effective are Neural Networks at Forecasting and Prediction? A Review and Evaluation, J. of Forecasting, Vol. 17, pp. 48–495 (1998) Hilbe, J. M.: Logistic Regression Models. Chapman & Hall / CRC Press (2009) Breiman, L.: Bagging predictors. Machine Learning, (24) 2, pp.123–140 (1996) Adeodato, P. J. L., Vasconcelos, G. C., Arnaud, A. L., Cunha, R. C. L. V., Monteiro, D. S. M., Oliveira Neto, R.: The Power of Sampling and Stacking for the PAKDD-2007 Cross-Selling Problem. Int. J. of Data Warehousing and Mining (IJDWM), 4, pp. 22–31 (2008) Adeodato, P. J. L., Vasconcelos, G. C., Arnaud, A. L., Cunha, R. C. L. V., Monteiro, D. S. M. P..: MLP ensembles improve long term prediction accuracy over single networks. Int. J. of Forecasting (2010) (to appear) Jain, R.: The Art of Computer Systems Performance Analysis Techniques for Experimental Design Measurements Simulation and Modeling, New York: John Wiley & Sons, 1991. Kutner, M., Nachtsheim, C., Neter, J.: Applied Linear Regression Models, 4th edition, McGraw-Hill / Irwin, (2004)

Suggest Documents