An adaptive technique for energy consumption ...

An adaptive technique for energy consumption anomaly detection Una tecnica adattiva per la anomaly detection nei consumi energetici Gaetano Zazzaro, Angelo Martone, Gianpaolo Romano and Michele Ferrucci1

Abstract Nowadays, the topic of energy saving is felt more than ever, given the increase in energy costs and the resulting efforts to reduce waste both in residential and industrial areas. In particular, it is necessary to optimize the energy resources by identifying potential anomalies in the consumption of a specific system by means of ad-hoc analysis techniques, based on the peculiar characteristics of the tracked system. In this paper, we illustrate an analysis technique that uses Temporal Data Mining algorithms, in particular clustering and box-plot diagrams, useful for identifying and representing anomalies in consumption flows. The algorithm is based on the determination of self-adapting power consumption ranges, according to new or unpredictable operating conditions, in real time. Abstract Oggigiorno il tema del risparmio energetico è quanto mai sentito, visto l’incremento dei costi dell’energia ed i conseguenti sforzi tesi alla riduzione degli sprechi sia in ambito residenziale che industriale; in particolare, è necessario ottimizzare le risorse energetiche identificando possibili anomalie nei consumi di uno specifico sistema mediante tecniche di analisi sviluppate ad hoc in base alle caratteristiche peculiari del sistema sotto monitoraggio. In questo lavoro, è presentata una tecnica di analisi che fa uso di algoritmi di Temporal Data Mining, in particolare clustering e diagrammi box-plot, utili per identificare e rappresentare le anomalie nei flussi dei consumi. L’algoritmo si basa sulla determinazione di intervalli adattivi, rispetto a condizioni di esercizio nuove o non predicibili, in tempo reale. Keywords Temporal Data Mining, anomaly detection, energy consumption monitoring, online analysis, statistical-based adaptive ranges.

1

Gaetano Zazzaro, CIRA; email: [email protected]

Angelo Martone, CIRA; email: [email protected] Gianpaolo Romano, CIRA; email: [email protected] Michele Ferrucci, CIRA; email: [email protected]

PAGE 2

Gaetano Zazzaro, Angelo Martone, Gianpaolo Romano and Michele Ferrucci

1 Introduction This paper describes an adaptive analysis technique that has been developed within the SIDECO (SIstema Dati Energetici aperti e COndivisi) project [11], in order to detect anomalies in the energy consumption process. The SIDECO project, funded by MIUR (Italian Ministry of Education, University and Research) has been aimed at the creation of a Big Data Analytics software platform in order to support strategic decisions of institutions, companies, professionals and communities, with the objective to provide innovative data analysis techniques applied to the energy management domain. Within the project, an architecture for real-time monitoring, analysis and correlation of sensor data received from heterogeneous streams, such as wireless sensor networks and web services, has been proposed and implemented. This kind of monitoring activity is focused on detecting possible anomalies or failures in energy consumption, in order to adopt mitigation actions aimed at minimize the wastefulness of energy. The remainder of this paper is organized as follows: in the section 2 the requirements and the preliminary definitions are collected for the specific Analysis Models Design. section 3 focuses on the exploration of the dataset, specifically to point out its features and perform a precise analysis of the outliers. The algorithm definition and main concepts about the data analytics are provided in the section 4, while the results of the implementation and testing of the algorithm are reported in section 5. Finally, section 6 shows the possible future works, concerning the application of the proposed technique in different domains and use cases, and its extension in applications characterized by the availability of large dataset, using scalable analytics framework.

2 Analysis Models Design The electrical consumption Analysis Models of a user (residential, mall, industrial plant, etc.) must be able to identify abnormal power consumption in real time. In addition, they must be able to update as soon as new data arrives and if changes in consumption habits happen, such as energy savings, installation of new plants, etc. In the next sections, requirements, preliminary definitions and thresholds algorithm are given in more details.

2.1

Requirements

The electrical consumption Analysis Models must show a “sensitivity” to changes in energy consumption habits and therefore they must show a fast and complete adaptation to the new conditions. In particular, these models are designed to detect

An adaptive technique for energy consumption anomaly detection

PAGE 3

and report abnormal consumption peaks by applying statistical methods. The following Table 1 lists the requirements that will be met by the Analysis Models. Table 1: Analysis Models Requirements

2.2

Id

Req. Name

Description

REQ1

Analysis

REQ2

Adaptability

REQ3

Phases

REQ4

Exceptions

REQ5

Background

REQ6

Failures

Analysis Models must be able to analyse energy consumption data and should be aimed at detecting any abnormal consumption Analysis Models must show adaptability to changing habits and energy behaviours of consumption, savings, or any changes to the plants Analysis Models must take into account the different hours of the day Analysis Models should consider the existence of exceptional days Analysis Models must be able to handle background energy consumption Analysis Models must be robust to any failure in recordings and/or transmission of data consumption

Preliminary Definitions

This section provides the key definitions useful to conduct energy consumption data analysis and to detect potentially abnormal power spikes or change in habits. It is given the U hourly consumption time series (e.g. it can be a private residence, a mall, etc.), which is a sequence of triples (date, hour, consumption): 𝑈 = {(𝑚0 /𝑑0 /𝑦0 , 𝑡0 , 𝑝0 ), (𝑚0 /𝑑0 /𝑦0 , 𝑡1 , 𝑝1 ), , … , (𝑚0 /𝑑0 /𝑦0 , 𝑡23 , 𝑝23 ), (𝑚0 /𝑑1 /𝑦0 , 𝑡0 , 𝑝24 ), … } where 𝑚𝑖 , 𝑑𝑖 and 𝑦𝑖 are respectively the month, the day and the year that are part of the date D and 𝑡𝑖 is the hour in which the consumption 𝑝𝑖 was recorded; 𝑡0 can be written also as: 𝑡0 = 00: 00 = [𝑡0 ] = [0], and more generally: 𝑡𝑖 = 𝑖 = [𝑡𝑖 ] = [𝑖]. Def1. A time interval consumption of period T (in days) from date D is an average μ𝑖,𝑇,𝐷 calculated in T days before date D, of energy consumption recorded at all hours 𝑖. For simplicity, the average is considered rounding to the nearest integer. The set of T days is also called Training Set. For example, 𝑝14,20,03/02/2015 is the round closest to the average energy consumption of all the 14:00 hours of the 20 days prior to 03/02/2015. Def2. A consumption 𝑝𝑖 is anomalous (or outlier) if it does not belong to [𝑠1 , 𝑠2 ]|𝑖,𝑇,𝐷 , where 𝑠1 and 𝑠2 are respectively the lower energy threshold and the higher energy threshold. 𝑠1 and 𝑠2 are calculated by applying: 1. the method of the minimum confidence interval at night hours [23], [0], [1], [2], [3], [4], [5], [6] which are generally indicated as [23,6], and it is also called night band;

PAGE 4

2.


the box-plot diagram method, in the other hours.

Both thresholds are calculated by considering as distribution a T number of the hourly i energy consumption in the T days previous the date D. For example, if it is i=[13], T=20 and D=02/03/2015, consider the distribution of 20 energy consumptions at 13 o’clock of the 20 days prior to 02/03/2015 and 𝑠1 and 𝑠2 were determined with the box-plot diagram method.

2.3

Thresholds calculation by the Box-Plot Diagram Method

The following algorithm is able to calculate the thresholds by using box-plot diagrams method, useful to identify extreme outlier values in a univariate statistical distribution: 𝑆 = {(𝑡0 , 𝑝0 ), (𝑡1 , 𝑝1 ), … , (𝑡𝑛 , 𝑝𝑛 )} Algorithm: Box-Plot Diagram Method

1. 2.

Input: 𝑆 distribution of length 𝑛 Output: upper and lower abnormal extremes Sort 𝑆 Determine the median 𝑀𝑒: a. If 𝑛 is odd: 𝑀𝑒 = 𝑝(𝑛+1)/2 𝑝(𝑛)+𝑝(𝑛)+1

3.

4.

5. 6.

2 b. If 𝑛 is even: 𝑀𝑒 = 2 2 Determine the first quartile 𝑄1 : a. If 0.25 ∗ 𝑛 is not an integer, the 𝑘 index is the next integer and 𝑄1 = 𝑝𝑘 𝑝 +𝑝 b. If 0.25 ∗ 𝑛 is an integer, 𝑄1 = 𝑘 𝑘+1 with 𝑘 = 0.25 ∗ 𝑛 2 Determine the third quartile 𝑄3 : a. If 0.75 ∗ 𝑛 is not an integer, the 𝑘 index is the next integer and 𝑄3 = 𝑝𝑘 𝑝 +𝑝 b. If 0.75 ∗ 𝑛 is an integer, 𝑄3 = 𝑘 𝑘+1 with 𝑘 = 0.75 ∗ 𝑛 2 Determine the interquartile distance 𝐷𝐼 = 𝑄3 − 𝑄1 Determine extreme abnormal values, using the following thresholds: a. Lower Threshold: 𝑠1 is the smallest 𝑝𝑖 of the distribution greater than or equal to 𝑄3 − 3 ∗ 𝐷𝐼. The 𝑝𝑖 below 𝑠1 are lower extremes (abnormal or outlier); b. Upper Threshold: 𝑠2 is the greatest 𝑝𝑖 of the distribution smaller than or equal to 𝑄3 + 3 ∗ 𝐷𝐼. The 𝑝𝑖 above 𝑠2 are higher extremes (abnormal or outlier).

3 Dataset Exploration The available dataset relates to the electrical consumption data of a mall that has an hourly sampling frequency and consists of 4224 observations which runs from 01/01/2015 to 25/06/2015. In the following sections, the analysis of consumption data will be detailed.


3.1

PAGE 5

Analysis of Consumption trends in a mall

The consumption chart for the entire observation period is shown in the following Figure 1. Figure 1: Chart of the time series consumption

We can easily observe that: there is a background energy consumption (BEC); there are abnormal energy consumptions; there are days when the mall is closed (green circles in the figure); there are hours (about 0.6% of the entire data sample) in which the consumption is either zero or near zero, representing failures in records due to unknown causes (red circles in the figure);  it is not possible to determine the existence of periods and consumption cycles, given the small sample of data available.    

PAGE 6 Gaetano Zazzaro, Angelo Martone, Gianpaolo Romano and Michele Ferrucci Figure 2: Average hourly consumption chart

The chart of Figure 2 shows the average hourly consumption of the entire reference period. Thanks to this chart, it is possible to confirm the existence of a background energy consumption, which can be obtained from night-time consumption (from 23:00 to 06:00), where the mall has a constant consumption on average.

3.2

Background Energy Consumption and failures in recordings

The Background Energy Consumption, called BEC, can be derived from the consumptions during night hours [23], [0], [1], [2], [3], [4], [5], [6], which are indicated as a whole with [23,6], also known as the night band. The average of consumptions in night hours is considered as the BEC value, which for the whole dataset is 10.6. In addition, it is important to consider any failures in recordings, and/or any abnormal consumption, as well as those circled in Figure 1. Therefore, after calculating the BEC, the following rule for failures is set: “If any consumption at i-th hour time is lower than 𝑎𝑟𝑟_𝑖𝑛𝑡(𝐵𝐸𝐶 − 𝜎), it will be replaced by the related consumption of the i-th hour time”. This rule is also justified by the fact that in the minimum confidence interval [μσ, μ + σ] (i.e., considering rounding, [9, 12]), there are 1200 observations, which are about 97% of the 1232 elements of the night hours. In addition, only 6 observations in these hours are related to consumption less then 𝑎𝑟𝑟_𝑖𝑛𝑡(𝐵𝐸𝐶 − 𝜎) = 9 (such as red-circle elements in Figure 1). In Figure 3, the chart shows only night-time consumptions, including thresholds 9 and 12. The 6 lower consumptions of lower threshold 9 are evident. Furthermore, there is a single consumption equal to 0, clearly obtained by a measurement failure, related to 29/03/2015 at 02:00.

An adaptive technique for energy consumption anomaly detection Figure 3: Chart of energy consumption in a [0,6] band with thresholds 9 and 12

PAGE 7

Figure 4 shows the average daily energy consumption chart. It is possible to see that:  cycles, seasonality and/or periodicity of consumption cannot be obtained, as there is not enough data available;  there are closing days in the mall where energy consumption can be approximated with the BEC. These closing days are exceptions;  the highest energy consumption is in May and June. Figure 4: Chart of daily energy consumption

3.3

Outlier Analysis

By applying the DBSCAN [4] clustering algorithm, setting epsilon = 0.1 and min_num = 1 to the whole dataset of average daily energy consumption, noisy elements of the dataset are mined, which can easily be matched to whole abnormal or exceptional days. Thus, only 4 days of mall closure are mined (Table 2):

PAGE 8


Table 2: Results of the DBSCAN Clustering algorithm application

Instance number

Row tag

Day

Cluster

0 94 95 120

1/1/2015 5/4/2015 6/4/2015 1/5/2015

Thursday Sunday Monday Friday

NOISE NOISE NOISE NOISE

In addition, looking at the algorithm results, it is possible to infer that during the monitored period there are no exceptional days (e.g. strike days and/or days where there has been special maintenance) outside of the four closing days for national holidays. The mall closing days are handled separately and are listed in a set named E, known a priori. The Figure 5 shows the chart of average hourly energy consumption on the closing days of the mall. It is possible to see that:  in [0,15] the average hourly power consumption is close to the BEC;  in [16,23] there is a peak of consumption probably tied to some maintenance or unloading goods activities. Figure 5: Average hourly energy consumption chart in closing days

4 An algorithm for thresholds determination In the following sections, an algorithm for thresholds determination is depicted. The algorithm is quite generic, because it could be applied to a broad range of users (i.e. residential, commercial, industrial, etc.).


4.1

PAGE 9

Design specifications

Table 3 shows the specifications useful for designing the energy data analysis algorithm, which is defined to detect abnormal consumption by thresholds determination. Table 3: Specifications for designing analysis algorithm

Id

Name

Description

REQ

SPEC1

Sliding Training

REQ2

SPEC2

Night Hours

SPEC3

Thresholds

The training set T is fixed to 30 days prior to the current day, reduced by the days listed in the set E defined in SPEC6 and taking into account SPEC5. [23], [0], [1], [2], [3], [4], [5], [6] hours are defined as night hours, indicated by [23,6], also called Night Band. Thresholds are calculated by distinguishing two cases: 1. In the night band the lower threshold is 𝑠1 = 𝑎𝑟𝑟_𝑖𝑛𝑡(𝜇 − 𝜎), while the upper threshold is 𝑠2 = 𝑎𝑟𝑟_𝑖𝑛𝑡(𝜇 + 𝜎), with 𝜇 = mean and 𝜎 = standard deviation, calculated in [23,6] and 𝑎𝑟𝑟_𝑖𝑛𝑡 means rounding to the nearest integer; 2. Otherwise, using the box-plot diagrams method. So doing, a threshold for the lower extremity and one for the upper extremity are obtained.

SPEC4

BEC

REQ5

SPEC5

Failures

SPEC6

Exceptions

SPEC7

Updating

SPEC8

Anomalies

The BEC is calculated as the average 𝜇 of the consumption in the night band. Any hourly energy consumption lower than a 𝑎𝑟𝑟_𝑖𝑛𝑡(𝐶𝐸𝐹 − 𝜎), it is replaced with the corresponding hourly energy consumption, to calculate the thresholds for SPEC3. A set E is defined containing the exceptional days. The thresholds in these days are considered equal to those of the night band. All calculation routines are updated upon receipt of the last measurement of the current day. The consumption that does not belong to [𝑠1 , 𝑠2 ]|𝑖,𝑇,𝑑𝑎𝑡𝑒 is considered abnormal.

4.2

REQ3

REQ1

REQ6

REQ4

REQ2 REQ1

Thresholds algorithm

For all band consumption calculation, it is considered as Training Set T, the set consisting of 30 days prior to the current day G, to which the set E of the exceptional days has to be subtracted, known at the beginning of the year. In the case of residential users, the set E may optionally be empty. It is also important to note that any failures in the recordings are replaced with the average consumption of the corresponding hourly band. Below is listed the algorithm for thresholds calculation.

PAGE 10


Algorithm: Thresholds Input:

Output: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

hourly energy consumption time series 𝑆 current day 𝐺 𝐸 set of exceptional days thresholds of Def2

Get the 𝐶 set, selecting the 30 days prior to the current day 𝐺 from 𝑆 BEC calculation, as in SPEC4, considering 𝐶 elements Replace failures in 𝐶, as in SPEC5, getting the 𝐶 ∗ set Replace failures in 𝐸, as in SPEC5, getting the 𝐸 ∗ set Calculate the training set 𝑇 = 𝐶 ∗ − 𝐸 ∗ Considering all elements in 𝑇 foreach time band do if 𝐺 ∈ 𝐸 then calculate thresholds 𝑠1 and 𝑠2 as in 1 of SPEC3 else calculate thresholds 𝑠1 and 𝑠2 as in 1 of SPEC3 endif endfor

The Thresholds Algorithm is applied as follows: At the beginning of G, the algorithm to calculate the lower thresholds 𝑠1 and higher 𝑠2 is applied, including the transformation of C in T, by replacing the failures and removing the set 𝐸 ∗ , that is also cleansed of failures; 2. The thresholds 𝑠1 and 𝑠2 are plotted; 3. Calculate the hourly consumption of G; 4. Consumptions are plotted. 1.

5 Experiment Design & Visualization With the aim of testing the Analysis Models on streaming data, a sw platform was developed (Figure 6), in order to simulate streaming data (measurement data were read from a csv file, according to section 3.1) and to collect them to build the training set T. The Analysis Models were developed in java programming language (they were grouped in the Data Analytical Layer) and they have in input both the streaming measurement data and the training set T.

Additionally, to store measurements as time series, it was used a particular database type called Time Series Database 4 (TSDB in short). A TSDB is a software system optimized to manage time series data, i.e. arrays indexed by time. OpenTSDB [12] is a highly scalable and distributed Open Source time series database, which was


PAGE 11

chosen as the TSDB implementation within the reference architecture for the SIDECO project. Figure 6: Energy Data Analytics System Architecture

5.1

Streaming simulation

In order to simulate the stream of consumption data, the csv file is read by a java process that sends a packet for each line of it to both the TSDB and the Data Analytical Layer. The packet sent is made up as follows: 1. a metric = the string “EnergyConsumption”; 2. a timestamp = the timestamp of measure; 3. a value = the measured value; 4. a tag = the pair of strings , which specifies where the measure come from. An example of a data packet is as follows: [EnergyConsumption 1448979636 22 sensor=Mall] Lastly, a Data Visualization Layer was developed, using Grafana [7] and Highcharts [9] software tools, to allow the visually evaluation of experimental results.

5.2

Experimental results visualization

In this section, a detailed description of the Data Visualization Layer implementation is provided. This layer, see Figure 6, consists of two GUI, each of which is designed for a particular end user type (i.e. Normal User and Energy Data Scientist described below):

PAGE 12

1. 2.


Grafana: to display time series with the refresh rate at user choice; Thresholds Graph Visualizer: to display the consumptions and thresholds calculated by the Data Analytical Layer.

Grafana is a time series graphical editor that can use the most popular time series databases as data source and it gives the possibility to overlay multiple time series in a single dashboard. This GUI was designed for Normal Users, which do not have expertise in energy domain field or data analysis skills. The Thresholds Graph Visualizer is an ad hoc developed GUI, which allows thresholds to be displayed for a selected day (the current day G) and for a selected number of days of the Training Set T (from 1 to 150 days). The interface was designed for the Energy Data Scientist users (that is, those experienced in the energy domain or data analysis experts) to allow data exploration and to identify innovative algorithms for consumption forecast (e.g. this GUI was used to properly detect the thresholds described in SPEC3). Hereafter, some use cases of the developed system are shown, using the Thresholds Graph Visualizer as user interface, varying the current day G input parameter. The graph in Figure 7 relates to the 17/05/2015 day, where all hourly consumptions (in blue colour) are considered normal as they fall within the interval [𝑠1 , 𝑠2 ], respectively drawn with cyan and red colours. The BEC is rendered with the green colour. Figure 7: Day G ∉ E without anomalies

An adaptive technique for energy consumption anomaly detection Figure 8: Day G ∉ E with anomalies

PAGE 13

Considering 07/05/2015, we notice that most of the consumptions are above 𝑠2 threshold, an evidence that a consumption change is in act, with respect to normality condition (Figure 8). Considering a closing day G ∈ E, (Figure 9), for example 05/04/2015 (Easter day), we notice five abnormal consumptions focused in the evening hours ([18], [19], [20], [21], [22]). Figure 9: Day G ∈ E with anomalies

6 Conclusions and Future Works The SW platform used to test the Thresholds Algorithm, presented in this paper, can be used in very different domain applications, simply changing data sources and it is able to analyse data stream in real-time in a different domain. It is also high scalable and reliable thanks to OpenTSDB replication capability. A future step to take into account could also be to spread the Analysis phase over multiple machines, using an open source distributed real-time computation system (i.e. Apache Spark or Apache Flink).

PAGE 14


References 1. 2.

Aggarwal C.C., Outlier Analysis, Springer Science+Business Media New York, 2013. Bifet A., Holmes G., Kirkby R., Pfahringer B., Data Stream Mining. A Practical Approach, COSICentre for Open Software Innovation, May, 2011. 3. Bifet A., Kirkby R., Kranen P., Reutemann P., Massive Online Analysis. Manualo, COSI-Centre for Open Software Innovation, March, 2012. 4. Dunning T., Friedman E., Time Series: Databases New Ways to Store and Access Data, O’Reilly Media, 1st edition, 2014. 5. Ester M., Kriegel H.P., Sander J., Xu X., A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, in: Second International Conference on Knowledge Discovery and Data Mining, 226-231, 1996. 6. Frank E., Hall M.A., and Witten I.H., Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, Fourth Edition, 2016. 7. Grafana, “http://grafana.org/”. 8. Gupta M., Gao J., Aggarwal C.C., Han J., Outlier Detection for Temporal Data: A Survey, in IEE Transactions on Knowledge and Data Engineering, Vol. 25, NO.1, January 2014. 9. Highcharts, “https://www.highcharts.com/”. 10. Mitsa, T., Temporal Data Mining, Chapman & Hall/CRC, Data Mining and Knowledge Discovery Series, 2010. 11. MIUR - PAC Piano di Azione e Coesione: Bando Startup, Linea Big Data, n°436 del 13 marzo 2013. 12. OpenTSDB, “http://opentsdb.net/”.