Data Mining Process to Improve the Inspections of a ...

Data Mining Process to Improve the Inspections of a Power Utility Iñigo Monedero, Félix Biscarri, J.I. Guerrero, Eva María Mestres, Juan María Jubani, Carlos León 

Abstract—A non-technical loss in power utilities is defined as any consumed energy or service that is not billed by some type of anomaly. For the power utilities, detecting these losses is a very important task, since it supposes a high percentage of unbilled energy and, therefore, unearned money. Endesa Company is the main power utility in Spain. A methodology used by Endesa to detect nontechnical losses is based in the detection and inspection of the customers which have null consumption during a certain period. The problem of this methodology is the low rate of success of the inspections. We are working in a collaborative project with Endesa, and we have developed a module that improves this type of inspections. The module is based on a set of rules, and it has been developed by means of a data-mining process. With this module, the success of inspections can be multiplied by 3.

Keywords—Non-technical losses, power utility, data mining, decision tree, neural network. I. INTRODUCTION

T

HE non-technical losses (NTLs) in power utilities are defined as any consumed energy or service which is not billed because of measurement equipment failure or illintentioned and fraudulent manipulation of said equipment. The NTLs are caused by breakdown or illegal manipulation in customer installations. These types of losses are very hard to predict. Normally, utility companies use massive inspection to reduce NTLs. The main problem is that these companies have not the necessary technology to implement a deep processing of this data before carrying out these inspections. Thus, although the utility companies invest efforts to detect and correct this kind of anomalies, their main focus of the work rather is other topics as the maintenance of infrastructure. A kind of regular methodology used by the power utilities to detect NTLs is based on the study of the customers that have null consumption during a certain period. This type of customers has in their null consumption the clearest sign of a non-technical loss. The problem of this methodology is that a customer with null consumption does not always have necessary a NTL. These instances could be due to empty houses in private customers of a drop of electrical demand in customers with some sort of clientele. So, some additional information of the customer, and not only his consumption, many times is interesting to consider the reason of the null consumption. This information includes the historic incidents of each customer or the type of business in order to know if it Iñigo Monedero, Félix Biscarri, J.I.Guerrero and Carlos León are with the Electronic Technology Department of the University of Sevilla, Spain (e-mail: [email protected]). Eva María Mestres, Juan María Jubani are with the IT, Measure and Non Technical Losses Control Areas of the Endesa Company, Spain.

is a business in which the requirement is currently falling (e.g., currently, the construction business in Spain). Likewise, it is known by the inspectors of the Company that the following types of business are more likely to have consumption drops innate to their usage of the energy (and not due to a possible NTL): wells, lightings, irrigation pumps, water purification and construction (previously mentioned). Then again, it is known that nowadays data mining techniques [1] are being applied to multiple fields and detection of NTLs is one sphere in which it has met with success recently [2-7]. These anomalies are frauds in telecommunication and financial sectors; breakdown or fraud in power, water or gas sector, etc. Midas Project is the name of a collaborative project between the Endesa Company (the chief power utility in Spain) and the University of Seville whose objective is the detection of NTLs by means of artificial intelligence techniques. In this project we have been working for 9 years, and we have already given some results from the project [810]. The study presented in this article arises from the need Endesa Company to improve its detections of NTLs of customers with null consumption. Concretely, the need comes from the low percent of success in the inspections in-situ carried out by the power utility with its original methodology. These poor results are due to the aforementioned casuistry can do that a customer with null consumption does not have necessarily any kind of anomaly. The aim of our work has been studying the methodology that uses Endesa to select inspections of customers with null consumption and to develop an additional module based on data mining techniques to improve its results by means of a customer filtering. II. DATA MINING A. Database In order to carry out the data mining process, Endesa Company provided us the database of a particular Spain area (concretely Catalonia) corresponding to the data of last inspections of customers with null consumption. Essentially, the criterion that Endesa uses to select these customers in its inspections depends on the number of last consecutive bills with null consumptions that the customers' registers. The data of these inspections dated April 2012, and they contained the contracts selected for inspection as well as the inspection results. This sample set is integrated by 3510 customers with the information referring to their consumption during the last two years (from April 2010 to March 2012), contracted power and the specific function of that contract (private or the type of

business). Additionally, we know the upshot of the inspection carried out for each customer. The categorization of these inspections as well as the number of clients in each band is given in Table 1. Table 1. - Results of the inspections of the Company Result of inspection Correct Not done Anomaly without NTL Anomaly with NTL

Number 2284 455 607 164

Percentage 65.07% 12.96% 17.30% 4.67%

These events included those inspections without some type of anomaly (Correct), those inspections could not be done because it was not possible to access to the meter (Not done), those inspections which detected some type of anomaly but without energy loss (Anomaly without PNT) and those inspections which detected some type of anomaly with energy loss (Anomaly with PNT). As it is possible to celebrate the success of the inspections was less than 5%, since Endesa is interested in identifying only the customers of the last group. The main problems at the time of our data-mining process were two: - The little information that we had for each customer (only referring to its contracted power, its use and the type of business for the contract). - The few case series in consumption patterns of customers (due to that most of them were relative to customers without consumption within two years of analysis interval). Once analyzed the data as well as the problems that we found in their analysis, we focused the aim of the data mining process in the development of some type of detections (rule set) to invoke the current share of success. With that objective, we considered two modules for two types of detections: - Detection of customers with a stable consumption which in a moment of two years of analysis drops suddenly. So, we would try to discard those customers relative to empty flats, closed business, etc. That is, those contracts with actual null consumption. - Detection of contracts relative to the business which the index of NTLs (which embedded frauds) was high. In order to execute the data-mining process as well as the generation of the models, we employed powerful software called IBM SPSS Modeler 15 [11] much extended in the field of data mining. This package offers a quick access to the databases and many libraries for the generation of models such as: clustering process, decision trees or neural networks. B. Filter by drops of consumption The rule that we tried to recover with this module is depicted in Figure 1. That is, customers with consumptions during a time and, subsequently, a sudden drop in null consumption.

Figure 1.- Pattern of drop consumption In order to detect this type of patterns we carried out the following processing: - Divide the analysis time frame (two years) for each contract in 4 windows (6 months each). The division of the previous example is shown in Figure 2. - Carry out a normalization of the expenditure of each client. It was performed dividing each value of consumption by the reduced power consumption of the contract. This operation was carried out in order to equate in the analysis all the customers. - Calculus for each window of the following values: maximum, minimum, mean, median and standard deviation. Window 1

Window 2

Window 3

Window 4

Figure 2.- Windows in the consumption pattern Once obtained these new window parameters for each client, we applied algorithms that, using these as inputs, get filtering rules. The first algorithm that we applied was a decision tree, concretely C&R [12]. The node classification and regression tree (C&R) is a prediction and classification method based on trees. Similar to C4.5, this method uses binary partitioning to divide repeatedly training records into segments with values similar output field. By means of C&R we obtained the first simple tree of Figure 3 that divided the sample set depending on the median of the window 1 (Median_w1). Considering that the training algorithm uses 30% of the data to prevent overfitting, the final results are presented in Figure 4. That is, with the rule Median_w1>36.61 was obtained a confidence of 12.87% (35 out of 237) with a support of 7.8% (272 out of 3510).

Figure 6.- SOM neural network The best results were obtained with a SOM [13] with a neuronal structure of 4x3. If we analyzed the NTLs grouped in each neuron, we could observe the chart of Figure 7 (where the first column reports on the coordinates of each neuron and the rest of columns their information). Figure 3.- C&R tree for filtering As it is possible to mention, this rule improves to a great degree from the success rate of the original sample set. The fact is that this pattern has a lower support. Thus, searching for a higher support by means of C&R, we got another rule (Median_w1>36.6 or Mean_w2>22.9) with a higher support (12.5%) and a good confidence (10.8%). Figure 7.- Results of SOM neural network Figure 4.- Results of first C&R tree Finally, to obtain an even higher support, we generated another rule (Median_w1 > 36.61 or (Mean_w3 0.63)) with a support of 21% and a confidence of 9.28%. The results of this rule are presented in Figure 5.

As it is possible to note there are two clusters: X=2, Y=0 and X=3, Y=0 with a higher rate of NTLs. Concretely, if we joined both clusters we obtained a confidence of 12.5% with a support of 5%. If we merged the customers detected with both algorithms, corresponding to the third rule from C&R tree and the SOM algorithm, we noted that the confidence increased to 14.75% for a support of 8%. These results are shown below in Figure 8. The junction of both algorithms could serve as method to validate the results.

Figure 5.- Results of third C&R tree As it is possible to deduce observing their conditions, all the old rules are aimed to find the shape of Figure 1. That is, these rules identify customers with consumption in the first windows and null in the last ones. Dependent upon the number of clients that Endesa Company wants to inspect (that is, the support of the rule), it would be necessary to use one or the other. Subsequently, we used another type of algorithm of data mining: clustering. For the clustering procedure, first of all we filtered all the outliers of the sample set: 283 customers. Subsequently, we tested K-Means algorithm and Selforganizing maps (SOM).

Figure 7.- Results merging C&R tree and SOM network C. Filter by type of contract use (private o business). On the other hand, furthermore customer consumption, we had an additional parameter called CNAE [14] or list of economic activities. This parameter specifies if the customer is private or relative to a business (as well as the type of business). In this study we attempted to understand the influence of the type of business in the innate or not of null consumption in the contract, knowing that business as the following ones have innate drop consumptions: wells, lightings, irrigation pumps, water purification and expression.

A first analysis of the distribution of the main CNAEs (each one is enciphered with a numeric value) with its corresponding distribution of NTLs is shown in Figure 8 (the column Proportion indicates the normalized proportion of NTLs for each CNAE) .

Inspections originally proposed by Endesa

Rule 3 from C&R Confidence

Rule 2 from C&R

Rule1 from C&R

+

Rule1 from C&R merged with CNAE filter

Figure 11. - Flowchart of the rule set Figure 8.- Distribution of CNAEs and NTLs As it is possible to watch knowing the meaning of the values of the CNAE, the private customers (values 9820 and 9810) as well as warehousing and storage (value 5210) encompass the roughly inspections. On the other hand, the higher rates of NTLs are found in private customers, legal activities (value 6910) and restaurants (value 5610). Therefore, if we extracted only the customers with CNAE code 9820, 9810, 6910 and 5610 of the rules of C&R algorithm, their confidences improve with the same order for the support. The results from the rules are presented in Figure 9.

Figure 9.- Results of C&R with selection of CNAEs Lastly, if we filter by CNAEs (9820, 9810, 6910 and 5610) the effects of merging C&R tree and SOM network we obtained a very high confidence (almost 18%) in the resulting pattern. This result almost multiplies by four the original results obtained by the Endesa Company in its inspections.

Figure 10.- Results of merging C&R/SOM with CNAEs As a brief summary, in Figure 11 we present the flowchart of the generated rule set, which additionally shows the variations of support and confidence. It would be necessary to select a specific rule depending on the number of customers that Endesa wants to inspect (it involves a specific support).

III. CONCLUSIONS For the electrical distribution business, detecting NTLs has been a very important task; since, for instance, in Spain it has been estimated that the percentage of fraud in terms of energy with respect to the total NTLs is about 35-45%. One of the methodologies employed by the power utilities is the search and inspection by customers with null consumption during a certain period. The main problems of this methodology are: - The modest part of success in this type of inspections (around 5%). It is due to that a null consumption in a customer is not necessarily an NTL and many times additional information is necessary to support it. - The requirement of a large number of inspectors and, therefore, the low rate of success in the inspections involves a high cost to the Company. Endesa is the most important Spanish energy distribution company in Spain with more than 12 million customers in Spain. For 6 years ago, we are carrying out a collaborative project with it and currently we have been working on the improvement of its inspections of customers with null consumption. We have developed a module based on the rules from a data mining process for the improvement of the results of the inspections. This module is an additional filter to the methodology that Endesa Company carries out. Hence, the module consists of a set of rules that filters a number of customers depending on the support (percentage filtered) that Endesa wants to achieve in its inspections. These patterns have been generated from two algorithms: C&R (a type of decision tree) and SOM (a type of neural network for clustering). The module has been developed and validated with the database of the results of a real campaign of inspections carried out by the company. The results of our module multiplied up to 4 the original results of these inspections. Currently, this module is being used by Endesa Company for the generation of a new campaign.

ACKNOWLEDGMENT The authors would like to thank the Endesa Company and Sadiel Company for providing the funds for this project (since 2005). The authors are also indebted to the following colleagues for their valuable assistance in the project: Gema Tejedor, Francisco Godoy, Joaquín Mejías, Rocío Millán and Jesús Biscarri. Special thanks to Jesús Macías, Eduardo Ruizbérriz, Juan Ignacio Cuesta, Tomás Blazquez and Jesús Ochoa for their help and cooperation. REFERENCES [1]

[2] [3]

[4] [5]

N. Padhy, P. Mishra, R. Panigrahi, "The Survey of Data Mining Applications And Feature Scope", International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol. 2, No. 3, June 2012. C. Phua, V. Lee, K. Smith-Miles, and R. Gayle , “A c m ehensive survey of data mining-based fraud detection research”, Artificial Intelligence Review, 2010. Yufeng Kou, Chang-Tien Lu, S. Sirwongwattana, y Yo-Ping Huang, "Survey of fraud detection techniques", presented at the 2004 IEEE International Conference on Networking, Sensing and Control, 2004, vol. 2, págs. 749- 754 Vol.2. M. Weatherford, "Mining for fraud", IEEE Intelligent Systems, vol. 17, no. 4, págs. 4- 6, Ago. 2002. S. Ghosh y D. L. Reilly, "Credit card fraud detection with a neuralnetwork", presented at the Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences, 1994, 1994, vol. 3, págs. 621-630.

[6]

[7] [8]

[9]

[10]

[11] [12] [13]

[14]

E. Aleskerov, B. Freisleben, y B. Rao, "CARDWATCH: a neural network based database mining system for creditcard fraud detection", presented at the Computational Intelligence for Financial Engineering (CIFEr), 1997., Proceedings of the IEEE/IAFE 1997, 1997, págs. 220226. J. R. Dorronsoro, F. Ginel, C. Sgnchez, y C. S. Cruz, "Neural fraud detection in credit card operations", IEEE Transactions on Neural Networks, vol. 8, no. 4, págs. 827-834, Jul. 1997. I. Monedero, F. Biscarri, C. León, J.I. Guerrero, J. Biscarri et. al.: "Detection of frauds and other non-technical losses in a power utility using Pearson coefficient, Bayesian networks and decision trees". International Journal of Electrical Power and Energy Systems. 2012. Vol. 34. Núm. 1. Pag. 90-98. C. León, F. Biscarri, I. Monedero, J.I. Guerrero, J. Biscarri, "Integrated Expert System Applied to the Analysis of Non-Technical Losses in Power Utilities", Expert systems with applications. 2011. Vol. 38. Núm. 8. Pag. 10274-10285 F. Biscarri, C. León, I. Monedero, J.I. Guerrero, J. Biscarri, "Variability and Trend-Based Generalized Rule Induction Model to NTL Detection in Power Companies", IEEE T Power Syst. 2011. Vol. 26. Núm. 4. Pag. 1798-1807 .i m.c m s ae d c s es es s ss-m dele Editorial. Recent advances in data mining. Eng App Artif Intell; 2006. p. 19. S. Valero, M. Ortiz, C. Senabre, A. Gabaldón, and F. García, “Classi ica i n, il e ing and iden i ica i n elec ical c s me l ad pattern through the use of self- ganizing ma s,” IEEE T ans. P e Systs., vol. 21, 4 (2006) 1672-1682. http://www.cnae.com.es/lista-actividades.php