D4.1 Load Profile Classification WP4 – Classification of EU residential energy consumers
D4.1: Load Profile Classification
Disclaimer Any dissemination of results reflects only the author's view and the European Commission is not responsible for any use that may be made of the information it contains.
Copyright message © NATCONSUMERS Consortium, 2016 This deliverable contains original unpublished work except where clearly indicated otherwise. Acknowledgement of previously published material and of the work of others has been made through appropriate citation, quotation or both. Reproduction is authorised provided the source is acknowledged.
2 | 80
D4.1: Load Profile Classification
Document Information Grant Agreement Number
657672
Acronym
NATCONSUMERS
Full Title
NATural Language Energy for Promoting CONSUMER Sustainable Behaviour
Horizon 2020 Call
H2020-LCE-2014-2, Call for competitive low-carbon energy
Type of Action
Coordination and Support Action
Start Date
1st May 2015
Project URL
www.natconsumers.eu
EU Project Officer
Damian BORNAS CAYUELA
Project Coordinator
Mr. Zoltán Kmetty, Ariosz Consulting Ltd.
Deliverable
D4.1. Load Profile Classification
Work Package
WP4 – Classification of EU residential energy consumers
Date of Delivery
Contractual
Nature
R - Report
Lead Beneficiary
Ariosz LTD
Responsible Author
Zoltán Kmetty PhD
Duration
M15
26 months
Actual Dissemination Level
M15 PU - Public
Email
[email protected]
Phone
+36303664611
Reviewer(s): Keywords
Load profile, classification, segmentation, clustering
Document History Version
Issue Date
Stage
Changes
Contributor
1.0
18/07/16
Draft
First Version
Zoltán Kmetty (Ariosz)
2.0
28/04/16
Draft
Proof-reading
Caitlin Bent, Greg Shreeve (EST)
3.0
29/04/16
Final
Final adjustments
Zoltán Kmetty (Ariosz)
3 | 80
D4.1: Load Profile Classification
Table of Contents 1
Executive summary .................................................................................................................................... 9
2
Introduction ............................................................................................................................................. 12
3
Smart Meters ........................................................................................................................................... 13
4
State-of-the Art ........................................................................................................................................ 14
5
4.1
General overview ............................................................................................................................. 15
4.2
Opower’s practice ............................................................................................................................ 16
Our Approach........................................................................................................................................... 19 5.1
Three parts of time .......................................................................................................................... 19
5.1.1
Within the day.......................................................................................................................... 19
5.1.2
Two other aspects of time ....................................................................................................... 19
5.1.2.1
Within the week ................................................................................................................... 19
5.1.2.2
Within a year ........................................................................................................................ 21
5.1.3
5.2
5.1.3.1
Nested approach .................................................................................................................. 23
5.1.3.2
Separate time dimensions ................................................................................................... 25
5.1.3.3
Time-series approach........................................................................................................... 27
Classification methods ..................................................................................................................... 27
5.2.1
6
Multidimensional modelling .................................................................................................... 23
Hierarchical clustering ............................................................................................................. 27
5.2.1.1
Distance ............................................................................................................................... 27
5.2.1.2
Agglomeration methods ...................................................................................................... 28
5.2.1.3
Decision about final number of clusters .............................................................................. 29
5.2.1.4
Advantages and disadvantages ............................................................................................ 30
5.2.2
K-means clustering ................................................................................................................... 31
5.2.3
Latent Profile analysis .............................................................................................................. 32
5.2.4
Time series clustering (DTW) ................................................................................................... 33
5.2.5
Other classification methods ................................................................................................... 34
5.2.6
Comparing relevant methods .................................................................................................. 34
Case Study................................................................................................................................................ 37 6.1
Description of the Pilots ................................................................................................................... 37
6.1.1 6.1.1.1 6.1.2 6.2
Data (pre-)processing............................................................................................................... 38 Sample size .......................................................................................................................... 38 Basic structure of consumption ............................................................................................... 39
Segmentation baselines ................................................................................................................... 40
4 | 80
D4.1: Load Profile Classification
6.3
Hierarchical clustering ..................................................................................................................... 40
6.3.1
Load profiles of Ireland – 5 cluster solution ............................................................................. 40
6.3.2
Visualisation ............................................................................................................................. 46
6.3.3
Why five clusters in Ireland? .................................................................................................... 50
6.3.4
Country comparison................................................................................................................. 56
6.4
6.3.4.1
Consumption structure of the UK ........................................................................................ 57
6.3.4.2
Consumption structure of Italy ............................................................................................ 59
6.3.4.3
Consumption structure of Hungary ..................................................................................... 61
6.3.4.4
Country differences and common patterns ......................................................................... 64
Other clustering methods ................................................................................................................ 67
6.4.1
K-means clustering ................................................................................................................... 67
6.4.2
Latent Profile Analysis .............................................................................................................. 69
6.4.3
DTW Time Series clustering ..................................................................................................... 70
6.5
Conclusions of load profile classification ......................................................................................... 72
7
Further Steps ........................................................................................................................................... 73
8
References ............................................................................................................................................... 74
9
Appendix .................................................................................................................................................. 75
5 | 80
D4.1: Load Profile Classification
Table of Figures Figure 1.
Analytic service flowchart of WP4 ................................................................................................. 12
Figure 2.
Smart Electricity Metering Roll-Out in the European Union .......................................................... 14
Figure 3.
An illustration from Portugal ......................................................................................................... 15
Figure 4. The graph displays weather-normalized hourly electricity consumption from a random sample of 1,000 residential utility customers, for a typical weekday. .............................................................................. 16 Figure 5.
Opower “Archetypes” example 1 .................................................................................................. 17
Figure 6.
Opower “Archetypes” example 2 .................................................................................................. 17
Figure 7.
An example of electricity consumption on a weekly basis............................................................. 20
Figure 8.
Three different weekly pattern of electricity consumption (illustration) ...................................... 21
Figure 9.
An example of electricity consumption on a yearly basis .............................................................. 22
Figure 10.
Two different monthly patterns of electricity consumption ........................................................ 23
Figure 11.
An example of hierarchical clustering dendrogram ..................................................................... 29
Figure 12.
A model selection example using SSE .......................................................................................... 30
Figure 13.
A model selection example using BIC .......................................................................................... 33
Figure 14.
Daily, weekly and yearly share of electricity consumption in the four countries ........................ 39
Figure 15.
Load profile segments, daily share Ireland, 5 cluster solution, hierarchical clustering ................ 41
Figure 16. Boxplot of load profile segments, Double Risers segment, three time aspects, Ireland, 5 cluster solution, hierarchical clustering ....................................................................................................................... 42 Figure 17. Boxplot of load profile segments, Afternoon Actives segment, three time aspects, Ireland, 5 cluster solution, hierarchical clustering ........................................................................................................... 43 Figure 18. Boxplot of load profile segments, Evening Actives segment, three time aspects, Ireland, 5 cluster solution, hierarchical clustering ....................................................................................................................... 44 Figure 19. Boxplot of load profile segments, Home Luncher segment, three time aspects, Ireland, 5 cluster solution, hierarchical clustering ....................................................................................................................... 45 Figure 20. Boxplot of load profile segments, Winter Spinners segment, three time aspects, Ireland, 5 cluster solution, hierarchical clustering ........................................................................................................... 45 Figure 21. Figure of load profile segments, Double Risers, integrated approach of time dimensions, Ireland, 5-cluster solution, hierarchical clustering ........................................................................................... 47 Figure 22. Figure of load profile segments, Afternoon Actives segment, integrated approach of time dimensions, Ireland, 5 cluster solution, hierarchical clustering ....................................................................... 48 Figure 23. Figure of load profile segments, Evening Actives segment, integrated approach of time dimensions, Ireland, 5 cluster solution, hierarchical clustering ....................................................................... 49 Figure 24. Figure of load profile segments, Home Lunchers segment, integrated approach of time dimensions, Ireland, 5 cluster solution, hierarchical clustering ....................................................................... 49 Figure 25. Figure of load profile segments, Winter Spinners segment, integrated approach of time dimensions, Ireland, 5 cluster solution, hierarchical clustering ....................................................................... 50 Figure 26.
Sum of Squares Error (SSE) of the hierarchical clustering - Ireland ............................................. 51
Figure 27.
Part of the hierarchical clustering dendrogram, Ireland .............................................................. 52
6 | 80
D4.1: Load Profile Classification
Figure 28. Figure of load profile segments, all clusters, integrated approach of time dimensions with two bi-months, Ireland, 9 cluster solution, hierarchical clustering ......................................................................... 53 Figure 29. Figure of load profile segments, all clusters, integrated approach of time dimensions with two bi-months, Ireland, 5 cluster solution, hierarchical clustering ......................................................................... 55 Figure 30.
Daily share of electricity consumption in the four countries in separate charts ......................... 57
Figure 31.
Load profile segments, daily share UK, 4-cluster solution, hierarchical clustering ...................... 58
Figure 32.
Load profile segments, daily share Italy, 5-cluster solution, hierarchical clustering .................... 59
Figure 33. Boxplot of load profile segments, Summer Wavers segment, three time aspects, Italy, 5-cluster solution, hierarchical clustering ....................................................................................................................... 60 Figure 34. Boxplot of load profile segments, Summer Peakers segment, three time aspects, Italy, 5-cluster solution, hierarchical clustering ....................................................................................................................... 61 Figure 35.
Load profile segments, daily share Hungary, 4-cluster solution, hierarchical clustering ............. 62
Figure 36. Boxplot of load profile segments, Summer Wavers, three time aspects, Hungary, 4-cluster solution, hierarchical clustering ....................................................................................................................... 63 Figure 37.
Load profile segments, daily share 4 countries, Double Risers segments, hierarchical clustering 64
Figure 38. Load profile segments, daily share 4 countries, Afternoon/Evening/Late Actives segments, hierarchical clustering ...................................................................................................................................... 64 Figure 39.
Load profile segments, daily share 4 countries, Home Luncehr segments, hierarchical clustering 65
Figure 40.
Sum of Squares Error (SSE) of the K-means clustering - Ireland .................................................. 68
Figure 41.
BIC value of the fitted latent profile models (Cluster number 1-11) ........................................... 69
Figure 42.
Sum of Squares Error (SSE) of the DTW Time Series clustering - Ireland ..................................... 71
Figure 43. Boxplot of load profile segments, Double Risers segment, three time aspects, United Kingdom, 4-cluster solution, hierarchical clustering ........................................................................................................ 75 Figure 44. Boxplot of load profile segments, Afternoon Actives segment, three time aspects, United Kingdom, 4-cluster solution, hierarchical clustering ........................................................................................ 75 Figure 45. Boxplot of load profile segments, Home Lunchers segment, three time aspects, United Kingdom, 4-cluster solution, hierarchical clustering ........................................................................................................ 76 Figure 46. Boxplot of load profile segments, Winter Spinners segment, three time aspects, United Kingdom, 4-cluster solution, hierarchical clustering ........................................................................................ 76 Figure 47. Boxplot of load profile segments, Double Risers segment, three time aspects, Italy, 5-cluster solution, hierarchical clustering ....................................................................................................................... 77 Figure 48. Boxplot of load profile segments, Evening Actives segment, three time aspects, Italy, 5-cluster solution, hierarchical clustering ....................................................................................................................... 77 Figure 49. Boxplot of load profile segments, Late Actives segment, three time aspects, Italy, 5-cluster solution, hierarchical clustering ....................................................................................................................... 78 Figure 50. Boxplot of load profile segments, Double Risers segment, three time aspects, Hungary, 4cluster solution, hierarchical clustering ........................................................................................................... 78 Figure 51. Boxplot of load profile segments, Afternoon Actives segment, three time aspects, Hungary, 4cluster solution, hierarchical clustering ........................................................................................................... 79
7 | 80
D4.1: Load Profile Classification
Figure 52. Boxplot of load profile segments, Home Lunchers segment, three time aspects, Hungary, 4cluster solution, hierarchical clustering ........................................................................................................... 79 Figure 53. Figure of load profile segments, Summer Wavers segment, integrated approach of time dimensions, Hungary, 4-cluster solution, hierarchical clustering..................................................................... 80
Tables Table 1.
An example of the data structure used in the Nested approach (Ireland, volume (KwH) data) ..... 24
Table 2. An example of the data structure used in the separate time dimension approach (Ireland, volume (KwH) data) ...................................................................................................................................................... 26 Table 3.
Comparing the relevant clustering techniques ............................................................................... 35
Table 4.
Basic information about the 4 case studies data ............................................................................ 37
Table 5.
5-clusters Hierarchical clustering – average explained variance in the three time dimensions .... 46
Table 6.
9 to 4 cluster Hierarchical clustering – average explained variance in the three time dimensions 56
Table 7.
9- to 4-cluster K-means clustering – average explained variance in the three time dimensions ... 68
Table 8.
5-cluster latent Profile Clustering – average explained variance in the three time dimensions .... 70
Table 9. 9 to 4 clusters DTW Time series clustering – average explained variance in the three time dimensions ....................................................................................................................................................... 71
8 | 80
D4.1: Load Profile Classification
1 Executive summary The key aim of the NATCONSUMERS project has been to develop an advanced and integral user-centred framework for the implementation of efficient energy feedback programmes in the domestic sector. With the rollout of smart metering across Europe, the project will look at incorporating smart meter data to provide households with relevant advice. In order to raise the energy awareness of consumers, customised, tailored advice needs to be given them. In order to achieve this, multiple consumer segmentations are planned in this project. This report summarizes the first part of the segmentation work, which is the load profile classification. Smart metering provides customers with more information on how they use energy and enable customers to reduce their energy usage. Most outstanding benefits to customers are the following:
Smart metering means the end of estimated billing - consumers can get more accurate bills which are based on real usage data in every payment period; Consumers can be more informed on their energy consumption and costs. By making their energy usage more easily understood, they can make smarter decisions to save energy and money; Time-of-use tariff functionality can be added to allow consumers to reduce costs by increasing energy consumption during off-peak cheaper tariff periods.
Although smart metering in the mass market is a relatively new technology in the energy industry (not more than ten years), there is fast growing literature dealing with consumer segmentation based on load profile historical data. Some prominent experts in the field were invited to a NATCONSUMERS workshop held in Madrid in January 2016, to share some methodological details with the participants of our research programme. Most experts highlight that there are some basic requirements to meet and preparatory procedures to be carried out for any reliable segmentation process on smart meter data (such as having a sufficiently long historical data series, proper granularity of smart meter data, need for standardization, data validations etc.). With regards to the analytical ‘space’ of the load profile classification, the predominant intention in the literature is to identify daily consumption patterns. Other aspects of time (for example days of the week or seasons) receive relatively low attention, and are treated as secondary conditions. A typical approach is filtering by them, for example: ‘average weekday in winter’. In our NATCONSUMERS project we recommend a more sophisticated method of load profile segmentation compared to other projects. The key idea of our approach is an enhanced and integrated concept of time differences within electricity consumption. We argue that beyond “daily” shapes (which are arguably the most important ones), there are other aspects of time which might be regarded as inherent parts of load profile segmentation. We therefore suggest the use of weekly and yearly time aspects in load profile segmentation procedures alongside the commonly used daily one. To do this we propose a Nested approach, that uses the 24-hour daily data, the days of week aggregated to average weekdays, Saturdays and Sundays, and monthly data aggregated to bi-months. In this approach, the interaction of these three times variables is used, meaning 24*3*6=432 dimensions. There are a wide range of classification methods available which could be used in load profile segmentation procedures. These methods are different in many ways; each have their own advantages and disadvantages. We don’t want to give an ultimate solution in this report, rather a menu with possible methods in it. The advantages and disadvantages of hierarchical clustering, K-means clustering, latent profile analysis and Dynamic Time Warping (DTW) Time Series clustering are demonstrated in this report. The most common approach in segmentation is the use of hierarchical or K-means algorithms. As load profiles are time sequenced data, it is also possible to use special distance functions like DTW for input into 9 | 80
D4.1: Load Profile Classification
hierarchical or K-means clustering. These are all heuristic models. A model-based approach could also handle the segmentation problem. In this report a finite mixture model called latent-profile analysis is introduced. All of the methods need normalized raw data with continuous variables. Only the latent profile technique relies on normality assumptions, and this method could handle only a limited number of input variables. In the case of big samples (N>=100 000) only the K-means method could be used, as the other 3 clustering techniques could only handle smaller databases. Determining the proper cluster number proved to be a complicated problem on its own. The latent profile method has the advantage that the BIC value could be used in choosing the most suitable cluster size. But the BIC value is sensitive to sample size, so this isn’t a universal solution either. In hierarchical clustering, a dendrogram could also be used to determine the appropriate cluster number. In the case study part of the report, we have used pilot data from 4 countries: Hungary, Ireland, Italy and the UK. The frequency of data differed in the countries so we had to harmonize it. We decided to aggregate all data on an hourly basis, as the segmentation procedures did not required a more detailed granularity. We have also harmonized the sample size across the countries. Despite the many differences between the four countries involved in our case study, the basic structure of their residential electricity consumption is rather similar. There is a downward tendency in the night until 3-4 o’ clock, after that comes a small peak in the morning and a high peak in the evening. The level of consumption is lower on weekdays, and higher on weekends, especially on Sundays. The most observable difference between the countries applies at a yearly basis, with summer peaks in Hungary and Italy, and winter peaks in Ireland and the UK. We are using the hierarchical clustering method as a benchmark in the report, and the other methods mainly for comparative reasons. Choosing hierarchical clustering is not random, this method has a number of advantages compared to the others. In the biggest part of the analysis we focus on the Irish load profile data. Our preliminary analysis showed that the five clusters solution fits the Irish data well, and the segments are also easily interpretable. All five clusters have a unique daily characteristic, but similarities could also be found between them. Three types of general consumption patterns are identified: those having a big peak in the afternoon or in the evening; those who have a small morning peak and an average afternoon/evening peak; and those who have a strong bi-modality in their consumption with two peaks of rather similar volume. The most interesting Irish group was the segment of Winter Spinners, who have much greater yearly share dispersion than the other segments. The existence of this segment highlights the necessity of using annual data in load profile segmentations. If we focused on daily curves only, we would have missed this segment, although they need different messages to get custom-tailored advice about their energy usage. To determine the different impact of the three dimensions, the magnitude of average explained consumption variance was calculated by segment. The average explained variance was around 20 percent on a daily basis, 12 percent on a yearly basis and only 2 percent on a weekly basis in the case of the 5-cluster Irish Hierarchical Clustering model. This confirms that the daily pattern has the biggest impact on the classification, but also strengthens our approach regarding the potential use of different time aspects. We have also presented an integrated visualization with the joint use of the three time dimensions. This new visualization technique helped us to understand the segments in a more complex way. For example, the detailed figure of Irish Afternoon Actives consumption revealed some interesting results, which was screened earlier. Their afternoon peak mainly occurs in the winter months and the between-season periods, but not in the summer months. In the summer months a small afternoon peak also could be found, but on weekdays the evening peak is nearly as high as the afternoon one.
10 | 80
D4.1: Load Profile Classification
After the brief analysis of the 4 countries we could formulate some conclusions about the load profile characteristics. There are at least three load profile curves that could be identified in (nearly) all the countries. We called the first general consumption character Double Risers. Although there are some variations by country, the main profile curve is similar. The peaks in their consumption are not extremely high; they have a very small morning peak and an average afternoon/evening peak between 5 and 8 pm. On weekly and yearly basis the segment produces an ordinary consumption structure. The second general group contains the Afternoon/Evening/Late Actives segments. In some countries these two segments merge, in other countries, they stay separate. In Italy we could call them Night Actives as their consumption goes on to midnight too. The common characteristic of these segments is that they do not produced morning peaks (except in the UK), only afternoon/evening peaks, but these late peaks are expressly high. The third general group has a bi-modal consumption structure, with a morning/forenoon peak and an afternoon peak. In the case of Ireland and Hungary, the forenoon peak is higher than the afternoon peak. The forenoon peak is a clear indicator of a home-based household member, who stays at home during the whole day. We called this segment Home Lunchers. Beside the general consumption profiles, the hierarchical clustering could identify country specific load profiles too. In Ireland and UK there is a group called Winter Spinners (who probably use electric heating), in Italy there is a Summer Peakers segment and in Italy and Hungary there is a Summer Wavers segment. The common feature of these segments is that their daily load profile is rather usual, only the yearly consumption pattern revealed the main attributes of these segments. This also points out that it is highly important to use the yearly data in load profile segmenting as this could help find unique country specific segments. The results of our analysis have clearly showed that it was worth using an enhanced and integrated concept of the time differences in load profile segmentation (Nested approach). There are plenty of classification methods existing which could help us in load profile segmentation (Hierarchical Clustering, K-means clustering, Latent-Profile analysis, DTW time series approach etc.). We have to take many aspects into account when we are choosing from these methods (sample size, assumptions about data distribution, process time, outliers, non-convexity of the variable space, availability on different software platforms etc.). The hierarchical clustering of load profiles provided an efficient way to segment the electricity profiles of consumers. The other clustering methods (K-means, Latent Profile, DTW Time Series) have given different outcomes in load profile classification. All clustering techniques could be adequate to classify load profiles, but in our Case Study the Hierarchical Clustering performed the best with regards to interpretability and usability.
11 | 80
D4.1: Load Profile Classification
2 Introduction The key aim of the NATCONSUMERS project has been to develop an advanced and integral user-centred framework for the implementation of efficient energy feedback programmes in the domestic area. The project will develop a methodology to communicate more effectively with consumers by using ‘Natural Language’. This advice will be designed to raise awareness about how people use energy within their homes and to give them advice about how to use energy in a more sustainable way. With the rollout of smart metering across Europe, the project will look at incorporating smart meter data to provide households with advice. The main objective of Work Package 4 is the integration of user and infrastructure information with load data for the classification of EU consumers into groups showing similar features and habits. In order to propose effective measures to increase engagement of consumers in sustainable energy patterns, it is necessary to be aware of their views, status, personal and households’ characteristics, and national context, so that the approach deployed is relevant and feasible for them. The next figure helps to explain the basic data structure and the planned segmentation processes in WP4. Figure 1. Analytic service flowchart of WP4
In order to raise the energy awareness of consumers, custom-tailored advice needs to be given them. To fulfil this expectation, multiple consumer segmentations are planned in this project. This report summarises the first part of the segmentation work, the load profile classification. Our main objective is to create a methodological framework. We therefore do not intend to give an ultimate solution for segmenting load profiles, but rather to present the advantages and disadvantages of classification methods using real load profile data. The report is structured in the following way: after a short explanation of the technical background of load profile data, the “state of the art” will be presented. In the section after the literature review, we will demonstrate our own approach that we suggest applying in classifying load profiles. This approach is based on the innovative use of the time dimensions. The load profile classification works usually rely on the daily load curves, and don’t integrate other time aspects. In this report a more complex approach will be suggested with the integrated use of daily, weekly and yearly time dimensions. In the theoretical part of the paper, four possible segmentation methods will be presented to segment energy load profiles feasibly. All
12 | 80
D4.1: Load Profile Classification
the advantages and disadvantages of these methods will be demonstrated. Based on these methods we will present a case study that helps us evaluate the different methods. This case study will also give a brief overview of the different energy load profiles in the EU. After the case study, we will lay out the basics of our future work, which will lead to a complex and integrated classification of EU energy consumers.
3 Smart Meters With traditional electricity metering systems, data operators rely on human resources to physically take meter readings in order to gather household energy usage data. This manual data collection process means meter readings occur at only monthly or annual frequencies. As such, both the data operators and the households receive only aggregated consumption data, which has highly limited utility. By contrast, the smart meter is a new kind of energy consumption meter that can digitally send meter readings to the data operators. Smart meters are the next generation of metering technology for every kind of usage (electricity, natural gas, piped water, central heating system etc.). Smart meters enable two-way communication between the device and data operator. This means they can be controlled remotely, thus the data operator can switch it on and off centrally. The similarity between traditional metering and smart metering is that both have a settled point of delivery. This is why smart meters can only measure the whole amount of household consumption, like the regular metering system (rather than monitoring usage at a more disaggregated level). The main differences between smart and regular metering systems are with regards to the granularity and data reading frequency. Concerning granularity, the smart meter can register household energy usage within a day, commonly every 15 or 30 minutes. Granularity depends on the device adjustments, so it is possible to measure and store usage data at anything from 1 minute or 1 hourly frequency. Smart metering enables data operators to read registered consumption data from devices which can store these records for a while. Although data communication proceeds online, there is not a real-time connection between the data operation server and devices. The data operator must address all devices on the distribution network and gather consumption data from every device. Therefore the data reading frequency is determined by the data collection process schedule. Smart metering provides customers with more information on how they use energy and enables customers to reduce their energy usage. The major benefits to customers are the following:
Smart metering means the end of estimated billing - consumers can get more accurate bills which based on real usage data in every payment period; Consumers can be informed on their energy consumption and costs. By making their energy usage more easily understood, they can make smarter decisions to save energy and money; Time-of-use tariff functionality can be added to allow consumers to reduce costs by increasing energy consumption during off-peak cheaper tariff periods.
The European Union asked all member governments to look at a smart metering system as part of their measures to upgrade their energy supply and tackle climate change. But smart metering roll-out is in different stages in different countries across the European Union.
13 | 80
D4.1: Load Profile Classification
Figure 2. Smart Electricity Metering Roll-Out in the European Union
Source: http://ses.jrc.ec.europa.eu/smart-metering-deployment-european-union
4 State-of-the Art Before delivering NATCONSUMERS’ own Load Profile Segmentation (LPS) recommendations, it is worth taking a brief look at the most relevant available instances of similar segmentation efforts made so far in Europe and the US. Although smart metering in the mass market is a relatively new technology in the energy industry (not more than ten years), there is fast growing literature dealing with consumer segmentation based on load profile historical data. Some prominent experts of the field were invited to a NATCONSUMERS workshop held in Madrid in January 2016 to share some methodological details with the participants of our research programme. For briefness’ sake, not all the materials will be discussed here in detail, instead we are going to focus on the generalised methodological conclusions to be drawn. However, as Opower has been acknowledged as the most experienced player at providing feedback to electricity customers based on load profile segmentation, we will take a closer look at their methodology. For a detailed list of materials referred in the paper, see the References.
14 | 80
D4.1: Load Profile Classification
4.1
General overview
Most experts highlight the need for some basic requirements which must be met and preparatory procedures to be carried out for any reliable segmentation process on smart meter data. The most important of them are: To have a large scale and well-balanced sample of the population. Ideally the sample size would be at least several thousand units (households in our case); A long enough historical data series of smart meter data. The basic requirement is at least one year or more; Proper granularity of smart meter data. The typical range is form fifteen minutes to one hour; Data validation processes to handle outliers, missing and invalid data; Some kind of data normalisation/standardisation to focus on consumption patterns instead of volume; Proper computational capacity to run complex data processes within reasonable timeframe is needed; A preference to apply robust data analytic techniques which are relatively resistant to noisy data. With regards to the analytical ‘space’ of the load profile classification, the predominant intention in the literature is to identify daily consumption patterns. Other aspects of time (for example days of the week or seasons) get relatively low attention, and are treated as secondary conditions. A typical approach is filtering by them, for example: ‘average weekday in winter’. Figure 3. An illustration from Portugal
Source: Vieira, Susana: Computational methodologies for knowledge extraction about consumer behaviour using smart metering data. January, 2016
15 | 80
D4.1: Load Profile Classification
With regards to classification methods, a good variety of different techniques have been found, ranging from the more traditional ones (hierarchical clustering, K-Means) through time series clustering to solutions based on fuzzy logic (deterministic Fuzzy C-Means). None of these can be regarded as the single best possible approach, some researchers test more than one method against their datasets.
4.2
Opower’s practice
Undoubtedly, the most important instance of Load Profile Segmentation has been implemented by Opower, a leading provider of customer engagement and energy efficiency cloud services to utilities 1. The company works with over 100 utilities in the world, and Opower’s big data platform stores and analyses over 600 billion meter reads from 60 million utility end customers. Based on that extremely large data warehouse, they demonstrate some examples on their blog site2 as shown below: Figure 4. The graph displays weather-normalized hourly electricity consumption from a random sample of 1,000 residential utility customers, for a typical weekday.
Technical Details Opower: “By applying advanced clustering techniques, such as vector quantization, we were able to identify a series of recurring patterns across the usage data. Specifically, based on our statistical clustering, you can see around five distinct hourly electric load patterns start to emerge from what used to be a mere jumble. Each of these well-defined patterns can be described as a particular ‘load archetype’. There are about five weekday load archetypes discernible in the below graph”
1
Oracle has bought Opower in May 2016. For further information see the announcement and the press release here: https://opower.com/oracle/ 2 https://blog.opower.com/2014/10/load-curve-archetypes/
16 | 80
D4.1: Load Profile Classification
Figure 5. Opower “Archetypes” example 1
The defined patterns are usually referred as “archetypes” and each group has been labelled by a tell-tale name which is assumed to be a “story-telling” description of the specific segment. Some illustrations – drawn from another sample of customers – are shown below. Figure 6. Opower “Archetypes” example 2
17 | 80
D4.1: Load Profile Classification
Note that all of the examples above relate to a typical weekday in terms of the daily shape of the electricity consumption. According to Opower’s recommendations, “archetypes” can be approached efficiently in different ways when communicating to consumers. For example “based on what time of day customers tend to use the most energy, customers can be delivered information and energy savings tips that are most likely to be relevant to them.” They also state that “customers with different load profile archetypes who are enrolled in behavioural energy efficiency programs tend to save energy at different rates”.
18 | 80
D4.1: Load Profile Classification
5 Our Approach In the chapter above we have identified and described some examples of load profile segmentation. They all focus on the (average) daily load curves and determine consumer segments (“archetypes”) based on daily load curve patterns. In our NATCONSUMERS project, we recommend a more sophisticated method of load profile segmentation. The key idea of our approach is an enhanced and integrated concept of the time differences within electricity consumption. We argue that besides the “daily” shapes (which are undoubtedly the most important), there are other aspects of time which might be regarded as inherent parts of load profile segmentation. In this chapter we briefly show the possible clustering alternatives and summarize their advantages and disadvantages. This chapter could help any future adopter of the NATCONSUMERS methodology choose the best method for their needs.
5.1
Three parts of time
5.1.1
Within the day
Most energy consumption segmentation models use daily data. This is a reasonable approach because the deviations between load curves mainly occur on an hourly basis. The different signal patterns could distinguish different demographic groups and also different behavioural groups. The pensioners or the parents of infants have a higher consumption share at midday than the average consumer. The evening peak of electricity usage could be diverse in different family life cycle stages. Those who have higher energy awareness may produce smaller peaks, and have smaller variability in their consumption share. In some countries, the energy tariffs could also influence consumption patterns if the users pay attention to different prices and use more electricity in periods when it’s cheaper. Based on the earlier findings and previous experiences, it is obvious that daily consumption data is the most important part in any kind of load profile segmentation. On the other hand, we think that it is worth integrating other aspects of time to get a clearer picture from the consumption shapes. In the next section we concentrate on two other aspects of time.
5.1.2 5.1.2.1
Two other aspects of time Within the week
In nature the weekly pattern is rare, but human behaviour is closely connected to days of the week. It is obvious that there is a clear weekly pattern in energy consumption, because people usually spend more time at home at the weekend than on weekdays. This time pattern alone doesn’t mean that the weekly data should have use in load profile segmentations. On the other hand, these weekly time patterns could be diverse, as different types of household spend different times at home, and use different types of appliances. Here are just two simple examples: Some people like to stay at home during the weekends, other always travel away. Pensioners tend to have smaller differences in their energy consumption between weekdays and weekends, because they stay at home during weekdays too. These weekly energy consumption patterns are not as remarkable as the daily curve, but also contain important information about energy (electricity) usage, which we could exploit in our segmentation models. In our segmentation process, we could either use all seven days of the week, or we could aggregate the data.
19 | 80
D4.1: Load Profile Classification
The figure below shows an example of the share of electricity usage by days of week. The electricity usage is lowest on Mondays and highest on Sundays. The other days are quite uniform from that viewpoint. Figure 7. An example of electricity consumption on a weekly basis
The frequency of the used data is heavily reliant on the sample size, the length of the load profile time series, and the applied segmentation method. All seven days could be used in segmentation, but the closing up of days is also possible. In the Natconsumers project we aggregate the 5 weekdays to an average weekday, and handle the two weekend days separately. In the next figure we present three different weekly patterns. In this figure we aggregated the 5 weekdays to one time dimension as mentioned above.
20 | 80
D4.1: Load Profile Classification
Figure 8. Three different weekly pattern of electricity consumption (illustration)
Segment B represents a quite common weekly pattern, with a slightly lower share of energy use on weekdays, and higher share on Saturdays and Sundays. In this type of consumption pattern the difference between Saturdays and Sundays is nearly zero. A more specialized but also prevalent consumption is represented by segment C. During weekdays the share is smaller than on weekends, and the difference between Saturdays and Sundays is also significant. This group has the highest Sunday’s consumption share, 50 percent higher than on average weekdays. Segment A represents quite a special consumer group, having high weekday consumption and lower consumption at weekends. These three different consumption patterns clearly point to the necessity of using weekly data in segmentation models. When we communicate with end-users regarding their consumption, the different weekly consumption patterns need different handling. Messages which are well-tailored for one group could be misleading to others. Based on this, we suggest the use of weekly data in any kind of load profile segmentation procedure.
5.1.2.2
Within a year
The yearly distribution of energy consumption is obviously not uniform. It correlates with the weather, the length of daylight, with the used appliances, etc. Again, if the consumption pattern of users were the same within a country, the use of monthly data wouldn’t be necessary. If we could assume that there are different monthly consumption patterns, it would be beneficial to use this information in a segmentation procedure. In theory it is not a hard task to imagine different types of consumption patterns on a yearly basis:
21 | 80
D4.1: Load Profile Classification
Heating with electricity could increase consumption in the winter Air-conditioning could increase consumption in the summer School-aged children could increase consumption in the summer A weekend-house, near a lake or the sea, could decrease consumption in the summer Yearly consumption patterns depend on owning certain appliances, having the family in a certain life cycle situation or showing certain attitudinal behaviours. Just as in the case of days of the week, when we aim to use yearly data we have to first decide the level of aggregation. In this case we have to find a solution that produces a sufficient category to our goals, but doesn’t increase the degrees of freedom of our statistical models too much. The next figure shows the share of electricity consumption per month through example data. Figure 9. An example of electricity consumption on a yearly basis
In the NATCONSUMERS project we decided to use a bi-monthly approach. The availability of pilot data only allowed us to use a within-year aggregation method, so merging December and January was not a possible solution (although based on the above figure it would be a good idea). In our bi-monthly approach we also wanted to handle the two hot summer months (July, August) together, so we decided to start the aggregation process with January. The single-monthly approach would be too detailed for our proposed clustering solution (see later), whilst more aggregated data (for example, seasons) would be too country specific. In the next figure we present two different monthly patterns. In this figure we use bi-monthly data as described before.
22 | 80
D4.1: Load Profile Classification
Figure 10.
Two different monthly patterns of electricity consumption
Segment A represents a common yearly consumption pattern structure. The consumption is higher in the winter season, and lower in the summer, but the standard deviation is rather low. Segment B represents another consumption pattern, which is not so common, but also it is not a rare one. In this case, the winter months again have the highest share, and the summer months are similarly the lowest, but the differences here are much greater than in the other consumption group. This indicates that they probably used electric heating. It is therefore worth considering that these two types of consumption groups need different message content. Based on this, we suggest the use of yearly data in any kind of load profile segmentation procedure.
5.1.3
Multidimensional modelling
In the above sections we suggested the use of weekly and yearly data in load profile segmentation procedures, besides the commonly used daily data. There are plenty of available modes to apply this approach; we suggest three of them in the next sections. These methods fit both the segmentation techniques presented later and the structure of our pilot data.
5.1.3.1
Nested approach
Our first approach combines the three time elements to a big 432 dimensional space. It uses the 24-hour daily data, the days of week aggregated to average weekdays, Saturdays and Sundays, and monthly data aggregated to bi-months (Jan-Feb/Mar-Apr/ay-Jun/Jul-Aug/Sep-Oct/Nov-Dec). In this approach the interaction of these three times variable is used, thus the 24*3*6=432 dimensions.
23 | 80
D4.1: Load Profile Classification
Table 1. An example of the data structure used in the Nested approach (Ireland, volume (kWh) data) Bi-months
Days of week
Hours
User1
User2
User3
1
0.21
1.286
1.646
2
0.126
1.242
1.131
20
0.85
2.188
3.897
21
0.889
1.802
3.625
22
0.99
1.636
3.686
23
0.946
1.621
3.248
24
0.668
1.433
2.549
1
0.372
1.408
2.818
2
0.137
1.145
2.338
20
0.809
1.916
4.707
21
0.986
1.914
4.212
22
0.853
1.769
3.785
23
0.772
1.326
2.967
24
0.603
1.238
2.882
1
0.175
1.131
2.788
2
0.138
1.1
2.344
20
0.759
2.679
4.36
21
0.825
2.087
3.708
22
0.827
1.898
3.361
23
0.878
1.8
3.307
24
0.825
1.625
2.439
…
Weekdays
…
January - February
Saturday
…
Sunday
As seen in the above example (which is just a short excerpt of the whole data-space) this combination allows us the joint use of the three time segments in any modelling process. This method also gives natural weight to the three time dimensions. Natural weight here means that the three time dimensions have different impacts in the creation of the 432 dimensional space. As the daily data is the most frequent (24) in this approach, this dimension has the biggest natural weight; the yearly data has lower weight and the weekly data has the lowest. It is also conceivable to give the three time dimensions equal weight, but the natural weight structure described above fits our theoretical viewpoint better. In the table above the data used for demonstrative purposes were based on consumption volumes. Before any load profile segmentation this raw data needs to be standardized.
24 | 80
D4.1: Load Profile Classification
5.1.3.1.1
Standardization
The common approach in load profile segmentation is the usage of share data instead of raw volume data. The volume of electricity consumption can be estimated with the help of demographic variables quite well. The size of the household, the size of the dwelling and the method of heating are good indicators to estimate the volume. By contrast, the patterns of consumption and the shape of the signal are more difficult to model through demographic variables. In a segmentation procedure where raw data is used, the volume of usage would overwrite any signal pattern. This segmentation would only give back different volume levels, not different shapes. As such, we need to exclude the volume of consumption in order to avoid this problem. Technical Details: There are many ways to do that, and standardize the data. We suggest the following normalization method. ∑
∑ ∑
where Smwh is the share usage rate in a given (bi-)month (m) in a given days of week (w) and in a given hour (h), and V is the volume of consumption. After this standardization every user consumption share sum will be the same (1).
5.1.3.1.2
Applicable methods based on this approach
The Nested approach introduced here could be used in different types of segmentation procedures. It is fit to hierarchical and K-means clustering, also to fuzzy c-mean methods, and also to time series clustering models. This approach has quite universal properties, robust in many ways, and it could be applied to different countries. As we will demonstrate in the later sections, the use of this nested time handling technique greatly helps in the exploitation of time factors behind different consumption behaviour.
5.1.3.2
Separate time dimensions
The number of possible time dimensions used in segmentation could be different based on the size of available data, the frequency of data, and the segmentation methods used. The Nested approach suggested above uses a 432-dimensional space. This variable number could be too high for some clustering techniques, which use model-based approaches (for example latent model techniques presented later). In that case, a lower dimensional data is needed. It is possible to approach this using the three time dimensions separately. The next table shows an example of this data handling method.
25 | 80
D4.1: Load Profile Classification
Table 2. An example of the data structure used in the separate time dimension approach (Ireland, volume (KwH) data) User1
User2
User3
1
3.534
17.946
33.271
2
2.346
14.977
27.628
3
2.246
13.35
21.428
20
9.72
34.525
69.576
21
10.573
31.614
64.073
22
11.115
27.652
54.225
23
10.777
23.184
48.834
24
7.647
20.336
38.299
Sum
166.406
544.699
860.615
Weekdays
63.833
177.714
270.762
Saturdays
54.535
184.195
289.099
Sundays
48.038
182.79
300.754
Sum
166.406
544.699
860.615
Jan-Feb
32.717
103.323
153.877
Nov-Dec
26.701
98.189
159.761
Sum
166.406
544.699
860.615
..
Hours
Days of week
… Bi-months
In this example the same aggregation levels were used as in the Nested approach, we used bi-months and combined the 5 workdays to an average workday. As seen above, the sum of consumption (volume data) is equal in the three time dimensions, only the aggregation level differs. This data structure needs a different standardization procedure.
5.1.3.2.1
Standardization and applicable methods based on this approach
The same standardization procedure used in the Nested approach doesn’t fit this data structure. Handling the three time dimensions separately could solve the problem. So the hourly data needs to be standardized separately, and the days of week and bi-months data need to be distinctly handled as well. After this process the sum of share would be 3 for each consumer. This data structure could be used in hierarchical clustering processes, also in K-means and fuzzy C-means segmentations, but for these methods the above mentioned Nested approach was better. In model-based methods where the numbers of input variables are limited, this data structure is recommended. In timeseries clustering this data structure couldn’t be used.
26 | 80
D4.1: Load Profile Classification
5.1.3.3
Time-series approach
The third suggested mode of time handling is a simple time series approach. In this case no further aggregation is needed and the whole time period can be used. The frequency of data could be hourly or more detailed. Because the different seasonal patterns we are proposing use full years, the normalization procedure could be the same as shown in the Nested approach section. This data structure best fits in models where time-series clustering techniques are used.
5.2
Classification methods
There is a wide range of classification methods available which could be used in load profile segmentation procedures. These methods are different in many ways; each has their own advantages and disadvantages. In this paper, we do not want to give an ultimate solution or ‘best method’, rather a menu with possible methods in it. In the next section we therefore briefly introduce those clustering methods that could be applicable by any NATCONSUMERS third party users, and summarize their advantages and disadvantages.
5.2.1
Hierarchical clustering
Hierarchical Clustering is one of the most well-known classification algorithms. Hierarchical clustering could be either agglomerative or divisive. At the start of the agglomerative process, all units constitute a single cluster, meaning the number of clusters is equal to the number of cases. In the next step, the closest clusters are merged into a joint cluster. By the last step all of the cases constitute one cluster. The method is hierarchical because in the process the merging is done be step by step, and if two cases are merged into the same cluster, they would stay in the same cluster until the end of the process. The divisive process uses a reverse approach. In the first step all of the cases are located in the same cluster. In the next step we search for the farthest case (from the cluster central), and put it in a separate cluster. By the last step each of the cases would constitute a single cluster. Technical Details Most of the commercial programs use agglomerative methods, so we focus on this in the next part. 1. Calculate the distance matrix of the cases3 2. All the cases constitute a single cluster 3. Merge the two closest clusters 4. Calculate the distance matrix with the new cluster centres Iterate the 3. and 4. steps until all the cases have been merged into one cluster
In a hierarchical clustering procedure we have to define the definition of distance, and the concrete process of merging.
5.2.1.1
Distance
Defining the distance and the similarity of the cases and clusters is a crucial part in a hierarchical clustering process. In the NATCONSUMERS environment the load profiles variables used are all continuous and are in 3
It is possible to cluster variables too, but in the NATCONSUMERS project this mode in not interesting for us.
27 | 80
D4.1: Load Profile Classification
the interval metric level. This means that the metrics dealing with nominal and ordinal variables are useless in this project. A common distance metric is the Euclidean metric. The Euclidean metric could be used in all cases when the variables are continuous and standardized, or have the same range. The data structure used in the NATCONSUMERS project fulfils these assumptions. The squared Euclidean metric is also used often in the hierarchical clustering processes. This is simply the square of the Euclidean distance. In some clustering processes it is theoretically important to use statistically independent (uncorrelated) variables. If this is not possible, and the correlation adds bias to the process, the use of Mahalanobis metric could solve this problem. This metric separates out the correlation from the distance. It is also an option to use stronger or weaker penalties regard distance. The Minkowsky metric is a generalisation of Euclidean distance. As seen above, there are plenty of modes available for distance calculation. The chosen mode is a function of the data structure, and the agglomeration method used. Technical Details Euclidean metric:
(
)
√∑(
)
Mahalanobis metric: (
)
(
)
(
)
where S is the covariance matrix of the variables, and the S-1 is the inverse of this matrix. Minkowsky metric: (
)
∑(|
| )
In the case when w=2, we get back the Euclidean distance.
5.2.1.2
Agglomeration methods
The calculation of cluster distance is a required but not sufficient part of the clustering process. It is also a crucial part to define the proper agglomeration method. Here are some common algorithms to do this: Simple linkage: The distance of the two closest cases of clusters Complete or farthest linkage: The distance of the two farthest cases of clusters Average linkage: The average of distances between all cases of the two clusters Centroid linkage: The distance of cluster centres Ward linkage: After the merging of two clusters the mean square error would be minimized In the NATCONSUMERS project we suggest the use of the Ward method. The Ward method minimizes the square of in-group deviation from the cluster centre. This method isn't (or is less) sensitive to outliers and biased data. It is a good method if we assume that the standard deviations in the clusters are closely equal in
28 | 80
D4.1: Load Profile Classification
all groups, and the sizes of clusters are also equal. This method is the most analogous with the K-means clustering technique. It works best if the squared Euclidean metric has been used.
5.2.1.3
Decision about final number of clusters
A crucial part of cluster analysis is defining the proper number of clusters. There is no universal formula for it, just suggestions and basic principles. In the hierarchical cluster analysis, the dendrogram is the most useful tool in this process.
Figure 11.
An example of hierarchical clustering dendrogram
The vertical axis of the dendrogram represents the distance between clusters. The horizontal axis represents the cases. Each merging of two cases (clusters) is represented in the figure by the spitting of vertical line into two vertical lines. The vertical position of the split gives the dissimilarity of two clusters. When defining the cluster number, it is a good idea to search for greater jumps in the splitting lines. Another possible solution is based on the sum of squares error (SSE). The sum of squares error is the sum of the distance between each cases and their cluster centre. Technical Details Calculation of Sum of Squares Error ∑∑
(
)
29 | 80
D4.1: Load Profile Classification
where x is a data point in cluster Ci and mi is the representative point for cluster Ci.
It is possible to define SSE in different cluster size outcomes, as seen in the next figure. Figure 12.
A model selection example using SSE
The SSE value is decreasing through the joining of clusters, but the level of decrease is changing. The elbow point of the figure shows a possible solution to the problem of cluster numbers. The number of cases within each cluster could also be important. For further analysis, clusters which are too small could be problematic. If a higher cluster number means small and unique clusters, a lower cluster number is suggested. It could also be a decision point in the number of final clusters. Last but not least, the frame of possible interpretation of the clusters could also be a crucial point in the evaluation of the process.
5.2.1.4
Advantages and disadvantages
The biggest disadvantage of the hierarchical clustering procedure is the high calculation demand. In the first step an N*N matrix has to be calculated, where N is the number of cases. So it is quite problematic to use this method for big databases (N>=10 000). This method won’t give the best possible solution, just the best solution locally; as if two cases are merged together they cannot split. This method is sensitive to outliers, and to the non-spherical (not convex) clusters.
30 | 80
D4.1: Load Profile Classification
One of the biggest advantages of hierarchical clustering is that it has an exact solution. Although it is based on an iterative process the final result of clustering is always the same (if the distance metric and the agglomeration method remain unchanged). The different types of distance modes and the different types of agglomeration methods make this technique quite flexible to handle different types of data and variable structure. It is also possible to examine the different levels of clustering and which clusters are merged in that iteration. This useful information could be used at the stage of defining the most suitable number of clusters. Hierarchical clustering is available through all the common statistical software like SPSS and SAS.
5.2.2
K-means clustering
K-means clustering is one of the most prevalent clustering techniques; it is fast and quite efficient, but also has some disadvantages. It is a partitional algorithm, which puts the data cases into non-overlapping clusters, so that each case is in exactly one cluster. The K-means clustering searches for a global optimum (but usually finds local optimums). Technical Details The K-means method follows the following processes: 1. Define K number of cases as initial cluster centre 2. Assign all the cases to the closest cluster (based on cluster centroids) 3. Re-calculate the cluster centroids 4. Repeat the second and third step, until the system reaches a stable state
How we define the initial cluster centres is a crucial point in K-means clustering. There are many options to do this. We could use the first K element or use random initial points, or we could define the centroids based on previous findings (for example from hierarchical clustering). The basic K-means clustering technique is usable only when continuous variables are available. The algorithm in some point (distance, agglomerative process) is analogous with the hierarchical clustering technique using Euclidean distance and the Ward method (but different K-means methods exist)4. Speed is one of the biggest advantages of the K-means method over the hierarchical technique. It could also easily handle large data structures. The hierarchical method has limitations in terms of cases; the K-means has no such limitation 5 (the running time is quasi linearly increasing with the number of cases). In K-means clustering we have to define the number of clusters before the process. So the choice about cluster number has to be decided before the analysis. It is a possible solution to rerun the analysis with a different cluster number and then choose the best fitting one. The choice between the numbers of clusters is not straightforward, the same solutions could work as we have described previously in hierarchical clustering. The hardest part in K-means clustering is finding the stable cluster structure. The change of initial cluster centre eventuates a different final cluster classification. The most common solution to this problem is based on multiple runs of the K-means process with different initial cluster centres, and finding the typical 4
It is problematic to use this technique when the sizes or shapes of the “original” clusters are quite different. It is also sensitive to outliers and biased cases. 5 In very large databases other methods (like CLARA) solve the clustering quicker than K-means does.
31 | 80
D4.1: Load Profile Classification
clusters through evaluation of the results. The problem is that this solution is heavily reliant on the decision of researchers. It is also possible to run a hierarchical cluster analysis with a sample of cases and use these cluster centres as initial centroids. Another solution is to find the farthest cases in the database and use them as initial cluster centres. As seen here, many solutions could work, but there is no exact method. In the pilot study part of this paper, we introduce a special technique to solve this problem developed by Ariosz. K-means clustering is available through all the common statistical software like SPSS and SAS.
5.2.3
Latent Profile analysis
One of the disadvantages of hierarchical and K-means clustering techniques is that they use a heuristic approach and are not based on formal models. The model-based clustering algorithms (Banfield-Raftery 1993) provide an alternative approach for the segmentation process. Model-based clustering is closely associated with mixture modelling. In a finite mixture model each component probability distribution corresponds to a cluster. In that case the determination of the number of clusters, and the choice from possible models, can be recast to a model selection problem. Technical Details In a finite mixture Gaussian model, we assume that the data is generated by mixtures of normal densities. Each component of this multivariate mixture distribution corresponds to a cluster. The clustering processes try to identify these different distributions behind the data structure. This is a probabilistic clustering technique that also quantifies the uncertainty of classification. The result could be used as fuzzy clustering, but also as a rigid one. The components are usually modelled by the normal Gaussian distribution. For a fixed number of clusters, the model parameters can be estimated by the EM algorithm (Fraley-Raftery, 1998). Different parametrisation could be used in latent profile models. We suggest the parametrisation which assumes ellipsoidal distributions, and also assumes that clusters have the same volume, shape and orientation in the p-dimensional space. This model specification is closest to the Ward approach of hierarchical clustering. The model estimation process is based on Maximum Likelihood fitting, so statistical selection criteria could be used to choose the best model. A commonly used statistic for that purpose is the Bayesian Information Criteria (BIC), which has given good results in a range of application of model-based clustering (Fraley-Raftery 1998) A bigger BIC value means a better estimated model. The next figure shows an example of this.
32 | 80
D4.1: Load Profile Classification
Figure 13.
A model selection example using BIC
In the above example the BIC value clearly shows that the improvement of the goodness of model stops at the 9-cluster solution. After that only minor improvements could be seen. This indicates that a 9-cluster model is likely the best one. On the other hand, the BIC value is very sensitive with regards to sample size. Using a small sample with a low cluster number solution fits best, whereas a big sample could necessitate a higher number of clusters. If we wanted to deal with this problem, one solution could be searching for the elbow points of the figure and find the cluster number where the model improvement starts to decrease. This model-based approach also has some disadvantages. First of all, the used number of used variables is limited. There is no opportunity to use a hundred dimension data space (as we suggested in the Nested approach section) so a parsimonious solution is needed (as suggested in Separate time dimensions), as we have to avoid over-fitting the models. The model-based approach also has problems when it comes to large databases. The problem is similar to the problem in the hierarchical cluster analysis. It could also be a problem if the distributions behind the data structure strongly differ from the Gaussian normal distribution. The common statistical softwares (SPSS, SAS) haven’t implemented latent profile analysis yet, so special software is needed, like R (Fraley-Raftery 2007) or LatentGold to fit a mixture model.
5.2.4
Time series clustering (DTW)
The methods introduced in the previous sections (Hierarchical clustering, K-means clustering, Latent Profile modelling) don’t use time series attribution of the data. The load profiles are basically temporal, sequenced data, there is an explicit information of the timing in it. The notion here is similar to other conventional clustering techniques, with a set of individual time series our objective is to group similar time series to the same cluster (this is called whole clustering and has to be distinguished from subsequence clustering). The task can be done in various ways (Keogh-Lin 2005, Benitez et all 2014). Time series clustering can be tackled
33 | 80
D4.1: Load Profile Classification
in three ways; a model based approach, a feature based approach and a raw-data-based approach (Liao 2005). We will use the second type in this paper. One of the most common approaches in raw-data time series clustering is the use of special distance method that also relies on the time dimension of the data. After the extraction of a distance matrix, any general solution (Hierarchical/K-means) could be used in clustering. A typical distance function used in time series clustering is the dynamic time warping distance (DTW). Technical Details This measure allows non-linear mapping of two vectors by minimizing the distance between them. The DTW is still expensive to compute, and it is quadratic on the length of sequences. It also has the disadvantage that typically cannot handle slight variations in the underlying process (Li-Prakish 2011). It is mainly used in speech and signature recognition, but it could also be used in pattern recognition applications. This method concentrates on the shape of the time series, and it allows for sequence drifting. The DTW metric could be a good option when it is important to capture highly similar cases, and it is not a critical factor to capture the not-so-similar cases (Iglesias-Kastner 2013). There is no implementation of this metric in the basic statistical software, but it is available in special statistical programmes (like R).
5.2.5
Other classification methods
There are a range of other classification method that exists. In this short section we set out only a few of them. The fuzzy C-mean algorithm is similar to K-means in various ways, but with a big difference. In Kmeans applications it is assumed that each case is in exactly one cluster. In fuzzy clustering every case could belong to different clusters with different probability. The level of “fuzziness” could be set by a parameter of the model. The self-organizing map is a neutral network approach based on unsupervised learning. The Kohonen adaptive vector quantitization (AVQ) algorithm is a variation of the K-means method, and it belongs to the unsupervised neutral network family of models (Tsekouras et al. 2008). To get a detailed description of these clustering techniques, see Liao 2005, Tsekouras et al. 2008.
5.2.6
Comparing relevant methods
In the previous five sections we have briefly introduced some clustering techniques that were used in previous classification works. The most common approach in segmentation is the use of hierarchical or Kmeans algorithms. As load profiles are time sequenced data, it is also possible to use special distance functions like DTW for input into hierarchical or K-means clustering. These are all heuristic models. A modelbased approach could also handle the segmentation problem. In this report we have introduced a finite mixture model, called latent profile analysis. In the above sections we have mainly presented the universal attributes of these models. In this section we try to compare these methods with all the factors that are relevant in the NATCONSUMERS project.
34 | 80
D4.1: Load Profile Classification
Table 3. Comparing the relevant clustering techniques
Hierarchical Clustering (ward)
K-means
Latent Profile
Time series approach (DTW)
Normalized Rawdata
Normalized Rawdata
Normalized Rawdata
Normalized Rawdata
Sample size
Limited
Quasi un-limited
limited
Very limited
Dimensions
Quasi un-limited
Quasi un-limited
Very limited
limited
Continuous (based on distance metric)
continuous
continuous
continuous
No assumptions
No assumptions
Multivariatenormal
No assumptions
Process time6
Slow
Quick
Slow
Very slow
Handles outliers
Badly
Badly
Well
Badly
Handles convexity
Badly
Badly
Well
Average
No universal approach (SSE value, clusters size, interpretation, dendrogram)
No universal approach (SSE value, clusters size, interpretation)
BIC value (sensitive to sample size)
No universal approach (SSE value, clusters size, interpretation)
No stable solution.
Has to be parameterized. Has difficulties with non-normal distributions
Has to be parameterized. Captures badly the not so similar cases
Special statistical software platforms (R, LatentGold, Matlab)
Special statistical software platform (R, Matlab)
Data requirements
Data measurement level Data distribution
non-
Determination cluster number
of
Other disadvantages
Availability in different software platforms
Finds only local optimums.
All statistical software platforms
All statistical software platforms
(SPPS, SAS)
(SPSS, SAS)
In the table above we have summarised some pros and cons of the four methods. In the case of hierarchical clustering, we assumed the use of the Ward method in agglomeration. All of the techniques need normalised raw data with continuous variables. Only the latent profile technique relies on normality assumptions, and this method could handle only a limited number of input variables. In the case of big samples (N>=100 000) only the K-means method could be used, as the other 3 clustering techniques could only handle smaller databases. This factor closely correlates with the speed of data process, as K-means is the quickest, and DTW is the slowest method. Most of the methods react badly to outliers, except for the latent profile technique 6
The process time heavily relies on the hardware. With stronger hardware and with the help of parallel processing the performance could easily speed up.
35 | 80
D4.1: Load Profile Classification
where additional parameters could be added to the model to deal with outliers. The same is true for nonspherical (non-convex) clusters, hierarchical and K-means, which handle this problem badly. The determination of the proper number of clusters presents a complicated issue on its own. The latent profile method has an advantage that the BIC value could be used in choosing the suitable cluster size. But the BIC value is sensitive to sample size, so this isn’t a universal solution either. One big disadvantage of K-means is that it doesn’t give a stable solution and the result is highly sensitive to the definition of the initial cluster centres. We will get back to this problem in the Pilot Study chapter (6.4.1 K-means clustering). As a consequence, the K-means method usually finds local optimums and not global ones. On the other hand, the K-means algorithms are available through nearly all statistical software platforms, and it doesn’t need any additional parameterisation. The latent profile method badly handles non-normal distribution, and it is not an easy task to properly parametrize the latent profile models. It needs special statistical software to run, as it hasn’t been implemented in any big commercial statistical software yet. The same is true for the DTW approach. Besides these, the DTW method badly captures those clusters where the cases are not so similar. In the next section we will present the usability of the above described statistical methods in load profile segmentation using a 4 country (Italy, Hungary, Ireland, UK) case study.
36 | 80
D4.1: Load Profile Classification
6 Case Study In the previous chapter of the paper a short overview was given regarding the basics of our data classification approach. We have presented the importance of recombining the three time aspects, the hour, the days of week and the month (season). There are plenty of methods available to apply the classification of load profiles. We have presented four techniques, hierarchical clustering, K-means clustering, latent profile analysis and DTW time series clustering. In the next sections we will test these methods using pilot data from 4 countries: Hungary, Ireland, Italy and the UK. We will mainly focus on the hierarchical clustering technique as we think it has many benefits compared to the other algorithms. In the main text, mostly the Irish load profile data will be presented. In the appendix we will provide additional results with regards to other countries. The content of this chapter will be organized in the following order: Basic information and descriptive statistics about the 4 countries involved in the case study Detailed description of the hierarchical clustering segmentation using Ireland load profile data Visualization of the different time aspects Comparison of the load profile segments of the 4 countries based on hierarchical clustering Common patterns and country specific differences Description of the K-means clustering segmentation using Ireland load profile data Description of the latent profile analysis of Ireland load profile data Short Description of the DTW time series clustering using Ireland load profile data Conclusion about the outputs of different segmentation methods
6.1
Description of the Pilots
Some basic information is provided about the four pilot cases below. Table 4. Basic information about the 4 case studies data
Source Period Granularity Number of cases (used) Customer type Unit Data format
Hungary
UK
Ireland
Italy
Hungarian DSOs (ELMŰ/ÉMÁSZ, E.ON, EDF DÉMÁSZ) 2014.01.01. cont. 15 min
ACORN 2008.01.082010.09.30. 30 min
CER 2009.07.132010.12.31. 30 min
ENEL 2011.01.012012.12.31. 15 min
17 207 (4 000) Residential kWh
14 319 (4 000) Residential kWh
4 225 Residential kWh
1 020 Residential kWh
Oracle database
TEXT format
CSV format
Excel format
37 | 80
D4.1: Load Profile Classification
As can be seen in the above table, the four pilot cases differ in many ways. The Hungarian pilot data were provided by three DSOs (ELMŰ/ÉMÁSZ, E.ON, EDF DÉMÁSZ). The 3 DSOs covered all residential households in Hungary. This is the most recent data source; it comes from the period of 2014-2015. The data contains the load profile of two full years, with a 15 minute frequency. The sample size is quite big: it is around 17 000 cases. The Hungarian sample is representative of Hungarian households by region and urbanization, but those who have higher consumption are slightly over-represented. To avoid any bias due to this overrepresentation we have taken a 4000 case stratified sample with respect to consumption level. We have used this 4 000 case sample in the work presented later here. The UK load profile data contains nearly 15 000 cases. We have again taken a 4 000 case sample from this sample frame to balance the different sample sizes in the analysed countries (we will return to the issue of sample size in the next sub-section). The UK data comes from the period of between 2008 and 2010; the time span is more than two years with a frequency of 30 minutes. The load profiles from Ireland have the same 30 minute frequency as in UK, and the sample size is a little over 4 000. It comes from the period of between 2009 and 2010, but the time span is less than two years. The Italian pilot has the smallest sample size, just 1 020 households, but the frequency of the data is 15 minutes. It comes from the period of between 2011 and 2012 and contains the load profile of two full years. The unit of the raw data is kWh in all the cases, and all the samples contain residential electricity usage.
6.1.1
Data (pre-)processing
Before the segmentation we had to harmonize the case studies’ data, and we also had to handle the outliers and anomalies in the raw data. Our primary aim was to create a standardized data structure, which is similar at the country level. The data comes from residential consumers, and the unit of consumption was in kilowatt per hour in the four countries. The frequency of data was differing in the countries so we needed to harmonize it. To do so, we decided to aggregate the data to an hourly basis, as the segmentation procedures haven’t required a more detailed granularity. We have also dealt with the problem of daylight savings. We transformed all the hourly aggregated consumption data into winter time during data processing instead of Daylight Saving Time. After the consolidation of the raw data, we performed several data validation inquiries concerning anomalies, outliers and extreme data and we eliminated all these irregularities before proceeding with data analysis. We also harmonized the sample size across the countries. The details of these are presented in the next sub-section.
6.1.1.1
Sample size
The initial sample size was quite different in the four countries. In the previous chapter (5.2.6: Comparing relevant methods) we described the possible problems about the sample size in the aspect of the different methods in detail. Only the K-means clustering algorithm could handle large data structures easily; the other methods are more limited. In order to avoid this problem, we decided to balance the sample size within the four countries. Although the Italian sample size was the smallest, we decided a better approach was to use the Ireland sample size as a benchmark. In the case of Hungary and the UK, we had to take a sub-sample from the initial sample frame. In Hungary we took a stratified sample (also to fix some sample bias), in the UK we took a random sample. So the sample size of Hungary, the UK and Ireland is around 4 000 cases, and the Italian sample contains around 1 000 consumers.
38 | 80
D4.1: Load Profile Classification
6.1.2
Basic structure of consumption
Besides the many differences between the four countries involved in the case study, the basic structure of their residential electricity consumption is rather similar, as we will see in this section. To make the comparison easier the standardized share data is presented here.
Figure 14.
Daily, weekly and yearly share of electricity consumption in the four countries
The daily share figure presents the electricity consumption based on hourly frequency. The basic shape of the consumption is quite similar; there is a downward tendency in the night until 3-4 o’clock, after which comes a small peak in the morning and a high peak in the evening. The Irish data has the biggest deviation from the average, it has the smallest consumption share ratio at night, and the highest in the late afternoon and early evening. The exact time of the evening peak differs between each country. It occurs at around 5-6 pm in Ireland and in the UK, at 7 pm in Hungary, and around 8 pm in Italy. The weekly share of electricity usage also highlights a basic consumption structure. The level of consumption is lower on weekdays and higher at the weekends, especially on Sundays. There are some minor differences between the countries. In Ireland and Italy, the electricity consumption is quite low on Mondays, lower than on the other weekdays. There are no such differences in Hungary and in the United Kingdom. The Italian consumption is surprising on Fridays compared to the other three countries, as it is much higher than their
39 | 80
D4.1: Load Profile Classification
consumption on Saturdays. On a daily basis the Irish data has the highest deviation from the average. On a weekly basis it appears that Italy has the highest deviation, but the marked differences here are not so sharp. The most noticeable difference between the countries applies on a yearly basis. There are two quite easily distinguishable tendencies in the load profiles. In the case of Ireland and UK the peak of the consumption appears in December and January. A decrease could be reported until June, after which the consumption begins to increase. To simplify the picture, the consumption is high in winter and rather small in summer. Hungary and Italy have a quite a different consumption structure. There is also a winter peak in their data, but these countries also have a summer peak in July and August. These two countries are hotter in the summer than the UK and Ireland are, and air-conditioning is widely applied there (especially in Italy).
6.2
Segmentation baselines
In the theoretical part of the paper we have presented four statistical methods that are widely used in segmentation processes: Hierarchical clustering, K-means clustering, latent profile analysis and DTW timeseries clustering. In the next section we will focus on how these methods could work in load profile segmentation procedures. As we highlighted earlier we have a rather unique approach in the case of time handling. We suggested combining the hourly consumption data (which is commonly used in other load profile segmentations) with days of the week consumption and monthly consumption, to a big, 432dimension space (see 5.1.3.1 Nested approach). This data structure will be used in the following analysis (except in the case of latent profile models). Our primary aim is to test and present the usability of these methods with real data coming from various countries. However, a detailed description of all the results would unnecessarily widen the scope of this analysis, so we decided to highlight only one country’s results – this would be Ireland. Thus, in the next sections we will briefly present the load profile segmentation of the Irish data, with the four described methods. In a brief section, the results from the other countries will also be shown. We will use the hierarchical clustering method as a benchmark, and we will use the other methods mainly for comparative reasons. Choosing the hierarchical clustering was not random; this method has a number of advantages compared to others. The main reasons were that the hierarchical clustering result was exact, and we could follow how the clusters join to each other. We could therefore draw a nice map of cluster binding that is very useful in interpretation. The 4 000 cases samples used were not too big, so the limitation of the hierarchical clustering regarding the number of cases was not a point.
6.3
Hierarchical clustering
6.3.1
Load profiles of Ireland – 5 cluster solution
In the first part of the empirical analysis we present the result of the hierarchical clustering. The Irish data contained 4 225 cases, with the granularity of 30 minutes. It came from the period of between 2009 and 2010, and the time span was less than two years. First the pre-processing of the data was done by the methods described in the previous section (Data (pre-)processing 6.1.1). After that, the 432 dimensional variable space was created (see: Nested approach 5.1.3.1), and the data was standardized (see: Standardization 5.1.3.1.1). In hierarchical clustering there is no need to choose the final cluster number at the beginning of the procedure, so it is quite easy to compare solutions with different cluster numbers. Our preliminary analysis showed that the 5 cluster solution fits the data well, and it is also interpretable. So next,
40 | 80
D4.1: Load Profile Classification
we focus on the results of the 5 cluster solution, and we use this as a showcase (later we explain why we have chosen the 5 cluster solution).
Figure 15.
Load profile segments, daily share Ireland, 5 cluster solution, hierarchical clustering
In the figure above only the daily dimension was used to present the segments. All five clusters have a unique daily characteristic, but we can also find some similarities between them. We could identify three main types of consumption pattern, those who have a big peak in the afternoon or in the evening, those who have a small morning peak and an average afternoon/evening peak, and those who have a strong bimodality in their consumption, with two peaks of rather similar magnitude. The next plots help us to understand the differences between these segments.
41 | 80
D4.1: Load Profile Classification
Figure 16. Boxplot of load profile segments, Double Risers segment, three time aspects, Ireland, 5 cluster solution, hierarchical clustering
The biggest segment contains 44 percent of the cases, nearly half of the sample. We called them Double Risers as they have two big jumps in their consumption, one in the morning and one in the afternoon. Their consumption corresponds mainly to average consumption. The peaks in their consumption are not extremely high, they have a very small morning peak and an average afternoon/evening peak between 5pm and 8pm. On a weekly basis, their consumption is quite low on Mondays, lower than on other weekdays. On Sundays their consumption significantly increases, but this is not surprising as their consumption pattern is quite similar to the average Irish consumption shown in a previous section (Hiba! A hivatkozási forrás nem alálható.: Hiba! A hivatkozási forrás nem található.). The pattern of yearly share of this group is also very similar to the average yearly pattern in Ireland (Hiba! A hivatkozási forrás nem található.: Hiba! A hivatkozási rrás nem található.). The peak of the consumption appears in December and January and there is a clear decrease in the summer, although the differences between the months are only moderate.
Technical Details Boxplot:
42 | 80
D4.1: Load Profile Classification
We use boxplots to visualize the electricity consumption of the segments. The bottom and top of the box are the first and third quartiles, and the band inside the box is the median. The ends of the whiskers represent the 5th percentile and the 95th percentile.
Figure 17. Boxplot of load profile segments, Afternoon Actives segment, three time aspects, Ireland, 5 cluster solution, hierarchical clustering
The second Irish cluster was named Afternoon Actives. Their weekly and yearly consumption is quite similar to what we have found in the case of Double Risers, but their daily consumption differs significantly. This group has the biggest average consumption, 7 percent above the national average. This group does not have any morning peaks; the share rate is uniform in midday until 2 pm. After that a quick increase appears, but after 6 pm their consumption starts to decrease. This means they arrive home quite early, but they also finish their day early. This is also a big segment; nearly 30 percent of the sample belongs here. The third Irish segment, called Evening Actives, is similar to the Afternoon Actives in many ways. The weekly and yearly consumption pattern doesn’t show any difference compared to those previously presented. The biggest similarity with the Afternoon Actives is that this group also has a single large consumption peak. But there is also a big difference: this peak appears much later in the evening. The consumption share of Evening Actives is uniform in the morning and afternoon, and it starts to increase around 4 pm, and reaches their maximum around 8 pm. So there is a clear 2-3 hour shift between the hourly consumption pattern of the
43 | 80
D4.1: Load Profile Classification
Afternoon and the Evening Actives. The Evening Actives group is also smaller, only 11 percent of the consumers belong there.
Figure 18. Boxplot of load profile segments, Evening Actives segment, three time aspects, Ireland, 5 cluster solution, hierarchical clustering
The three profile types above have only one consumption peak (the Double Risers also have a morning peak, but that one is rather low). The following segment has two distinct peaks; their distribution of daily load profile is bi-modal. The first peak appears around 12 o’clock, the second around 4 pm. So one peak appears at lunchtime, the other in early afternoon. Their name, Home Lunchers, alludes to the fact that this consumer group stays at home during the day. They are probably inactive, maybe pensioners, or parents with small children, or people who work from home or are self-employed. Their weekly and yearly consumption pattern is completely ordinary, without any surprises. The within year consumption deviation is probably the smallest here from among all groups. They have the smallest average consumption in kWh within the 5 clusters.
44 | 80
D4.1: Load Profile Classification
Figure 19. Boxplot of load profile segments, Home Luncher segment, three time aspects, Ireland, 5 cluster solution, hierarchical clustering
Figure 20. Boxplot of load profile segments, Winter Spinners segment, three time aspects, Ireland, 5 cluster solution, hierarchical clustering
45 | 80
D4.1: Load Profile Classification
The last consumption segment is quite specialized, we called them Winter Spinners. Their daily consumption pattern is nearly the same as the Double Risers, although the Winter Spinners’ consumption share decreases a little bit around 4 pm (that is a difference from the Double Risers segment). The weekly pattern doesn’t show anthing unusual, the Winter Spinners’ consumption is smaller on Mondays, and higher on Sundays. The biggest difference comes from the yearly share. The pattern is not unusual, higher rates could be found in the winter months than in the summer months. However, the dispersion of the consumption is much bigger. The main reason behind this is that this segment probably uses electricity for heating, so they have much higher electricity consumption in the cold winter months. The size of this group is 6 percent of the sample. The existence of this segment points out how necessary it is to use yearly data in load profile segmentations. If we focus only on the daily curves we would have missed this segment, although they need different messages to get appropriately tailored advice about their energy usage. To determine the different impact of the three dimensions the magnitude of average explained variance of consumption by the segments was calculated. Technical Details: Separate linear regression models were fitted to the share of consumption by different time-basis. The dependent variable was the consumption share in a given time dimension (for example: consumption share of the users at 6 o’clock, or consumption share of the users on Sundays, or consumption share of the users in January-February). The independent dimension was the segment classification as factor variable. The averages of R squared statistic were summarized by daily, weekly and yearly basis.
The detailed analysis of the 5 segments clearly shows the great role of daily shares in the classification procedure. The biggest difference between the segments could be captured on a daily basis. The weekly consumption curves did not differ greatly. On the other hand, there was one segment – the Winter Spinners – which differs from the other segments mainly because of their yearly consumption pattern. The table below supports these findings.
Table 5. 5-clusters Hierarchical clustering – average explained variance in the three time dimensions
Average Explained Variance
Daily 19.8%
Weekly 2.4%
Yearly 12.3%
The average explained variance is around 20 percent in daily basis, 12 percent in yearly and only 2 percent at the weekly basis. This clearly confirms that the daily pattern has the biggest impact on the classification, but also strengthens our approach about the potential use of different time aspects. The yearly pattern has a quite strong effect on the classification, although the weekly one only has small impact. In the next section we continue the detailed analysis of the 5 Irish segments.
6.3.2
Visualisation
In the previous section only aggregated results were visualised. There were separate charts for daily, weekly and the yearly consumption pattern. This viewpoint is very useful in exploring the main characteristics of the
46 | 80
D4.1: Load Profile Classification
segments, but it could also screen significant details. So in the next part an integrated visualization will be presented with the joint use of the three time dimensions7. The figure is separated into three parts by days of the week. In each weekly part, the whole 24-hour consumption curve is displayed. The colours present the bi-months in the figures.
Figure 21. Figure of load profile segments, Double Risers, integrated approach of time dimensions, Ireland, 5cluster solution, hierarchical clustering
In the case of Double Risers, the curves of weekdays, Saturdays and Sundays look similar at first sight, with only minor differences. The highest consumption share can be found in the winter months, from November to February. It is also easy to identify which bi-months correlate strongly: November/December with January/February, March/April with September/October and May/June with July/August, so the winter months, the between-seasons periods and the summer months. These months (seasons) move together in all the three days of week. The biggest difference occurs between months at the afternoon-evening peak time. This latter peak is very low in the case of the summer months on Sundays (lower than the morning peak). The detailed figure of the Afternoon Actives’ consumption reveals some interesting results, which were not visible in the earlier analysis. This afternoon peak mainly occurs in the winter months and the betweenseason periods, but not in the summer months. In the summer months a small peak is also found, but on weekdays the evening peak is nearly as high as the afternoon one. The same is true on Sundays, with an 7
See for other advanced visualization: Benitez et all 2014
47 | 80
D4.1: Load Profile Classification
additional morning peak that is as high as the afternoon peak. This morning peak appears on Sundays throughout the whole season. The same bi-months correlate with each other as in the case of the Double Risers segment, i.e. the winter months, the between-season periods, and the summer months.
Figure 22. Figure of load profile segments, Afternoon Actives segment, integrated approach of time dimensions, Ireland, 5 cluster solution, hierarchical clustering
48 | 80
D4.1: Load Profile Classification
Figure 23. Figure of load profile segments, Evening Actives segment, integrated approach of time dimensions, Ireland, 5 cluster solution, hierarchical clustering
Figure 24. Figure of load profile segments, Home Lunchers segment, integrated approach of time dimensions, Ireland, 5 cluster solution, hierarchical clustering
49 | 80
D4.1: Load Profile Classification
Figure 25. Figure of load profile segments, Winter Spinners segment, integrated approach of time dimensions, Ireland, 5 cluster solution, hierarchical clustering
In the case of Evening Actives a significant difference can be identified between the season share ratios. A huge evening peak appears in every time aspect - bigger in summer, smaller in winter. The curves in the weekly comparison look similar, there are only minor differences between them. The Home Lunchers group produces the smallest seasonal differences. The winter and between-season periods produce similar consumption structure in the morning and in the afternoon, it diverges only in the evenings: the afternoon/evening peaks last longer in between-season periods. On Sundays the morning peak is much higher than the afternoon peak. Compared to Home Lunchers, the Winter Spinners segment produces huge seasonal differences. The summer bi-months correlate with each other (since there is no need for heating in the summer), but the other bi-months clearly separate this group from the others. There is a weak bi-modality on weekdays and a strong bi-modality in the Sunday electricity consumption. Beyond that, the daily curves do not differ from each other.
6.3.3
Why five clusters in Ireland?
We have used the 5 cluster solution as a showcase in the previous sections, although the number of clusters is not a matter of course. Hierarchical clustering gives a stable and definite solution, and the cluster merging history is simply traceable. A possible solution to determine the optimum number of clusters is based on the calculation of the sum of squares error. The sum of squares error is the sum of the distances between each cases and their cluster centre. A smaller number means a better fit, but it is important to highlight that more clusters automatically gives better fitting in the case of hierarchical clustering. The elbow point of the figure shows a possible solution to the problem of cluster numbers. Preliminarily we have defined that the solution we seek has to
50 | 80
D4.1: Load Profile Classification
be between 3 and 9 clusters. 10 or more would be useless for interpretation, in the NATCONSUMERS project a more parsimonious approach is needed. On the other hand, the classification needs to give a solution that is suitable for giving tailored messages to the consumers, and a two cluster solution cannot give a result that satisfies this assumption. So we have tested the 3 to 9 cluster runs. As the next figure will show, it is not an easy to task to determine the correct number of clusters with the help of the SSE value. The decrease of the value is quite continuous, with minor changes. If we have to identify an elbow point it is somewhere around the 5- or 6-cluster solution.
Figure 26.
Sum of Squares Error (SSE) of the hierarchical clustering - Ireland
Another method to define the number of clusters is based on the examination of the clustering tree, the Dendrogram. The red boxes highlight the 9-cluster solution. The height between the merge of clusters indicate the distance between these clusters. Merging the clusters into 7 and 8 groups is very close to each other. After that there is a big jump, and the 5- and the 6-cluster classifications are also close to each other. The jump to 4-clusters is moderate, and it merges a very small cluster with a bigger one (we will come back to this point later). Based on the dendrogram, the 5 or the 6 cluster solution seems to be the best.
51 | 80
D4.1: Load Profile Classification
Figure 27.
Part of the hierarchical clustering dendrogram, Ireland
In the NATCONSUMERS project there are two other dimensions that have to be considered when defining the correct cluster number. The first is cluster size, the second is interpretation. As we plan to apply further segmentations to the given clusters, small cluster sizes are useless. It is very hard to define the minimum cluster size; we have used a 5 percent limit to the Irish data. The exact number is a function of the sample size, but a minimum of 200 cases is suggested to be in the smallest cluster. The other important element is the interpretation. It is crucially important to understand the clusters and understand those socioeconomical moments that lead to these clusters. As in the NATCONSUMERS project we have to send tailored messages to the consumers, we have to be close to them, and we have to understand how their habits or attitudes influence their energy usage. Next we will present the 9-cluster solution as an initial starting step, and the process of merging that leads to the 5-cluster result. For this purpose, a new type of figure will be used. Only two bi-months are used in this chart, one in winter (January-February), and one in the summer (July-August). This is a simplification of the previous charts on one hand; on the other, this allows us to put all the clusters in one figure.
52 | 80
D4.1: Load Profile Classification
Figure 28. Figure of load profile segments, all clusters, integrated approach of time dimensions with two bimonths, Ireland, 9 cluster solution, hierarchical clustering
Group A – 22% This is the largest segment identified. From this point of view, it is not so surprising that this group is quite close to the average consumption pattern. Within the day they have two peaks, a small morning one and a high afternoon one. In the summer the morning peak is equally as high as the afternoon peak. The consumption difference between the summer and the winter is big, but this is typical in nearly all segments. This difference is principally the consequence of longer daylight but in some cases heating could also inflict this.
Group B – 16% The characteristics of this group are also close to the average. Looking within the day, consumption is specifically strong in the evening, and sometimes it goes on until midnight. On the weekends the morning consumption is also high.
Group C – 6% This is the only group that does not have any significant difference between their winter and summer consumption. In wintertime their consumption reaches their maximum in the afternoon, but in summer their consumption is uniform. Another typical characteristic of this group is that in the night their consumption is relatively low. This segment is very small, only 6 percent of the sample belongs here.
53 | 80
D4.1: Load Profile Classification
Group D – 17% Group D is a sizeable, one-peak group, the daily maximum of this segment falls in the early evening. They start the mornings late and tardily and they don’t even have a local morning peak. The forenoon consumption of this segment is only detectable on the weekends, especially on Sundays.
Group E – 12% Group E has a very similar character to group D, with one early evening peak. The main dividing pattern between them is that group E has very high weekend consumption especially on Sunday forenoon.
Group F – 11% A craggy evening peak principally typifies this group, and can be seen on every days of week and also in both seasons. This peak goes on until midnight and it couples with a low afternoon consumption. It is likely that active people without any children in a one or two-person household belong to this group.
Group G – 10% Group G is a typical “daytimer” group with a two-peak consumption pattern. One peak appears at lunchtime, the other peak appears in the early afternoon. They have extra particularly high consumption peak on Sunday forenoons. The difference between their winter and summer consumption is rather moderate. It is likely that there is least one person who stays at home in these households.
Group H – 2% The main characteristic of this segment is the significant difference between the winter and summer consumption. The simple explanation of this profile pattern is that these households have electric heating. The rather high night consumption in winters also strengthens this assumption. Presumably the heating system is adjusted automatically to utilise a night electricity tariff. The within day consumption of this group could be typified by two peaks in winter and a “daytimer” character in the summer, especially on the weekends.
Group I – 4% Group I also has a big winter-summer difference in their consumption, although not as big as group H. The winter electric heating is presumably not automated in this segment, it probably has a supplementary role only because the consumption overhead is only traceable during the day and not at night. The segment basically has a two-peak consumption, with a strong forenoon peak in the weekends, especially on Sundays. The next figure presents the 5 cluster solution that we have used as a showcase in the previous sections. It also shows which segments were merged in the clustering process.
54 | 80
D4.1: Load Profile Classification
Figure 29. Figure of load profile segments, all clusters, integrated approach of time dimensions with two bimonths, Ireland, 5 cluster solution, hierarchical clustering
The process was the following: 9 to 8 clusters: Merge of clusters I and H: The two electric heater segments are merged. 8 to 7 clusters: Merge of clusters D and E: Group D and E are one-peak segments, with very similar consumption profiles. The joined segments were named Afternoon Actives in the previous part of the paper. 7 to 6 clusters: Merge of clusters B and C: C was a small segment with an interesting uniform seasonal consumption, but their daily pattern is quite similar to group B. 6 to 5 clusters: Merge of clusters A and BC: Group A’s and the united BC’s consumption pattern are close to the average with two peaks. The morning peak of the group BC comes later than the morning peak of group A. In other terms the groups are quite similar. We named the merged segment Double Risers in the previous part. Groups F and G stand out after the process, but it is not surprising as they have unique characteristics. We called group F Evening Actives and Group G Home Lunchers in the previous sections. It is also interesting to examine which segments merged in the next step. From 5 to 4 clusters the Winter Spinners (IH), merges with the Double Risers (group ABC). The group Winter Spinners is very special in many ways, and it is obvious that they need different messages about their electricity consumption as the Double Risers, so it is important to keep them separate. It is also interesting to examine the size of explained variance in the difference level of hierarchical clustering:
55 | 80
D4.1: Load Profile Classification
Table 6. 9 to 4 cluster Hierarchical clustering – average explained variance in the three time dimensions
Daily 28.6% 27.1% 25.8% 23.9% 19.8% 19.7%
9 8 7 6 5 4
Weekly 3.5% 3.3% 2.4% 2.4% 2.4% 2.3%
Yearly 16.6% 14.8% 14.7% 12.6% 12.3% 1.2%
The weekly explained variance is small during the entire process. After the detailed analysis of the segments, this is not a surprise - the days of week patterns don’t differ significantly between the segments. We will come back to this in the conclusion. The yearly explained variance decreases little in the case of merging groups I and H and in the case of merging groups B and C. The 5 to 4 clusters step would reduce the yearly explained variance to near zero. In the whole process the daily explained variance is the highest; this suits our theoretical concept. Up to this point only the Irish consumption profiles were analysed. In the next sections we will go further and briefly present the results of hierarchical clustering in the case of the other three countries.
6.3.4
Country comparison
The detailed analysis of Irish consumption data has revealed a complex consumption structure with many different load profile characters. The length of this paper does not allow repetition of that analysis for all countries to a similar depth, so in the next sections we will focus only on the main characteristics of the remaining three countries.
56 | 80
D4.1: Load Profile Classification
Figure 30.
Daily share of electricity consumption in the four countries in separate charts
The basic 24-hour consumption pattern of the countries are quite similar, there is a downward tendency in the night until 3 or 4 o’clock, after which comes a small peak in the morning and a high peak in the afternoon/evening. The Irish data presents the smallest consumption share ratio in the night, and the highest in late afternoon/early evening, meaning it has the biggest range. The exact time of the evening peak differs between the countries. It occurs at around 5-6 pm in Ireland and in the UK, at 7 pm in Hungary, and around 8 pm in Italy. Next we will present the country specific results for the UK, Hungary and Italy. The same cluster names will be used where possible in order to make comparison between countries easier.
6.3.4.1
Consumption structure of the UK
In the UK a four-cluster solution was selected. The observed segments are very similar to the ones presented for Ireland but without Evening Actives. The segments are Winter Spinners, Double Risers, Afternoon Actives and Home Lunchers.
57 | 80
D4.1: Load Profile Classification
Figure 31.
Load profile segments, daily share UK, 4-cluster solution, hierarchical clustering
The size of the Winter Spinners segment is 9 percent, a little bit bigger than in the Ireland. Their consumption share is nearly uniform from the night until the early afternoon. They have a moderate evening peak in their consumption, but the overall range of their electricity usage is low compared to other countries. On the other hand, a big consumption difference could be found between the summer and winter months (see appendix). The most probable reason behind this difference is the usage of electric heating in these households. Their average electricity usage is 45 percent above the national mean. The Double Risers segment’s daily load profile character strongly correlates with the average consumption dynamics. The peaks in their consumption is not extremely high, they have a very small morning peak and an average afternoon/evening peak between 5 pm and 8 pm. In Ireland the Evening and the Afternoon Actives were distinguished. In the UK these two groups merged into a larger Afternoon Actives segment. Nearly 50 percent of the sample belongs to this segment. Their consumption pattern is quite similar to the Double Risers (two main times when their consumption increases, once in the morning and once in the afternoon) with a big difference; their afternoon peak is very high. The range of the Double Risers’ consumption share was around 4 percent, in the case of Afternoon Actives, it is around 6 percent.
58 | 80
D4.1: Load Profile Classification
The last group have three consumption peaks, one in the morning, one just before noon and one in the afternoon. These two morning peaks have a quite similar height, but the afternoon peak is a little higher. We called this segment Home Lunchers, as their consumption structure clearly present that somebody in the household stays at home through the whole day. The size of this group is 23 percent, which is much bigger than the similar group in Ireland (10 percent).
6.3.4.2
Consumption structure of Italy
The Italian sample was the smallest, as it contained only about 1000 cases. Due to this, the uncertainty of is higher than in the other cases. The hierarchical clustering of the Italian consumption identified 5 clusters that differ from each other in many ways. Figure 32.
Load profile segments, daily share Italy, 5-cluster solution, hierarchical clustering
The biggest Italian group is the Evening Actives, 37 percent of the sample belongs here. It has the same characteristic as the cluster with the same name in Ireland. There is another group, the Double Risers, which was also found in UK or Ireland. Their electricity usage corresponds mainly to average consumption. The segment has been characterized by a small morning and a moderate afternoon peak with two big bumps. The third group that is already known from the previous analysis is the segment of Late Actives. We have found a rather similar group in Ireland but we have called them Evening Actives. There is a 2-hour shift between the two segments consumption characteristic. The Italian Late Actives segment reaches its
59 | 80
D4.1: Load Profile Classification
minimum consumption at 4 o’clock, after that a continuous increase could be observed until 6 pm, when an evening peak starts. Their high consumption goes on through to midnight. In summary, the people who belong to this segment are likely to get up late just before noon, and go to bed only late at night. The fourth group, the Summer Wavers has a quite similar daily consumption character to the Afternoon Actives, except the peak of their afternoon consumption is only moderate. However, as shown below, the main difference between these segments is not due to the afternoon peaks, but rather due to the seasonality of their consumption. The first 3 segments’ yearly consumption pattern was classic in the way that they used more electricity in the winter, and less in the summer. Figure 33. Boxplot of load profile segments, Summer Wavers segment, three time aspects, Italy, 5-cluster solution, hierarchical clustering
60 | 80
D4.1: Load Profile Classification
Figure 34. Boxplot of load profile segments, Summer Peakers segment, three time aspects, Italy, 5-cluster solution, hierarchical clustering
As the boxplot of Summer Wavers clearly presents the yearly share of the electricity usage is rather uniform throughout the year but with a small peak in August. They probably use an air-conditioner or other electric appliances in the summer. However in the case of the last segment, the Summer Peakers, there is a huge peak in their summer electricity consumption. Their daily load profile was similar to the Summer Wavers’, so the difference of this two segments comes from the yearly pattern.
6.3.4.3
Consumption structure of Hungary
In Hungary a 4-cluster8 solution was selected like in the UK. Three of the four segments were recognizable from the previous analysis, but there is a country specific one in Hungary too. 8
The Hierarchical Clustering of the Hungarian load profiles originally suggested a 6-cluster solution, but two clusters were really tiny (1 and 3 percent). The detailed analysis of these little segments revealed that the first 1 percent group contains summer weekend houses (there is no consumption in the winter just in the summer). The other 3-percent group is just the opposite: they consume energy only in the winter and not in the summer, these could be also some kind of season homes. Although these segments hit the high spots, they are too small for further analysis. Therefore, we decided to leave them out from the following analyses.
61 | 80
D4.1: Load Profile Classification
Figure 35.
Load profile segments, daily share Hungary, 4-cluster solution, hierarchical clustering
The Afternoon Actives segment contains 22 percent of the households from the sample. It presents the ‘usual’ Afternoon Actives characteristic, with a continuously increasing consumption until 4 pm, and a high afternoon peak that comes after, which goes through until the evening too. The Double Risers segment also has this afternoon peak, but their consumption is rather bi-modal, with a moderate late morning consumption peak. As in the other countries their electricity usage corresponds mostly to average consumption. The third group also has two consumption peaks, one in the late morning and one in the afternoon. The main defining feature is that their forenoon peak is higher than their afternoon peak. The Home Lunchers name was used earlier for this consumption pattern as it is obvious that at least one person stays at home in this household in the whole day. The Summer Waver segment is a country specific group (although we use the same name for it as with the Italian segment). Below we present a more in depth analysis of this segment.
62 | 80
D4.1: Load Profile Classification
Figure 36. Boxplot of load profile segments, Summer Wavers, three time aspects, Hungary, 4-cluster solution, hierarchical clustering
This segment is quite big; 20 percent of the sample belongs to this segment. Their consumption is rather flat compared to other segments which might suggest low consumption due to fewer appliances. The consumption volume of this segment concurs, as consumption is 15 percent lower than the annual average. The analysis of the yearly share points out that these households have higher consumption in summer than in winter, which is opposite to overall average consumption profile for all homes. This could be an effect of air-conditioning like in Italy (Summer Wavers, Summer Peakers), but the low consumption assumes poorer households behind this load profile with few appliances. In Hungary the air-conditioning is typical in wealthier households. As additional information is available in Hungary about the used appliances and demographic attributes of these households, we looked a little deeper to understand this segment. They are small households (1-2 persons), who are generally elderly and poorer. The overall numbers of appliances they own is moderate, but a common point is that they have a freezer. This freezer is usually old and in poor condition, and therefore inefficient. The freezer is likely to explain the increased summer consumption. Therefore, for these households it is principally the freezer and not the air-conditioner that raises the summer energy usage. This analysis indicates that different mechanisms could work in different countries to produce two similar load profile curves.
63 | 80
D4.1: Load Profile Classification
6.3.4.4
Country differences and common patterns
After the brief analysis of the 4 countries we could formulate some conclusions about the load profile characteristics. There are at least three load profile curves that in general could be identified in (nearly) all the countries. We called this first general consumption segment the Double Risers. Although there are some differences between countries, the main profile curve is similar. The peaks in their consumption are not extremely high, they have a very small morning peak and an average afternoon/evening peak between 5 pm and 8 pm. The main characteristic of their consumption is that two big increases (jumps) in their consumption could be identified. On a weekly and yearly basis this segment produces a consumption profile similar to the overall average.
Figure 37.
Load profile segments, daily share 4 countries, Double Risers segments, hierarchical clustering
The second general group contains the Afternoon/Evening/Late Actives segments. In some countries this segment merges with another, in other countries they stay separate (naturally this also depends on the clustering method, and the decision about the number of clusters). In Italy we called them Late Actives as their consumption goes on to midnight too. The common characteristic of these segments is that they don’t have significant morning peaks (except in the UK) just afternoon/evening/night ones, and these late peaks are expressly high.
64 | 80
D4.1: Load Profile Classification
Figure 38. Load profile segments, daily share 4 countries, Afternoon/Evening/Late Actives segments, hierarchical clustering
The third general group has a bi-modal consumption structure, with a late morning peak and an afternoon peak. In the case of Ireland and Hungary, the late morning peak is higher than the afternoon peak. The late morning peak is a clear indicator of an inactive household member, who stays at home in the whole day. We have called these segments the Home Lunchers. Figure 39.
Load profile segments, daily share 4 countries, Home Lunchers segments, hierarchical clustering
65 | 80
D4.1: Load Profile Classification
Besides the general consumption profiles, the hierarchical clustering did identify some country specific load profiles too. In Ireland and the UK there is the Winter Spinners group (who probably use electric heating). In Italy there is the Summer Peakers segment, and in Italy and Hungary there is the Summer Wavers segment. The common feature of these segments is that their daily load profile is close in shape to the overall average. Only the yearly consumption pattern revealed the main attributes of these segments. This also points out that it is highly important to use the yearly data in load profile segmenting as this could help to find the unique country specific segments.
66 | 80
D4.1: Load Profile Classification
6.4
Other clustering methods
The hierarchical clustering of load profiles provided an efficient way to segment the electricity profile of consumers. But the Hierarchical Clustering is only one possible method of segmenting load profile data. We have undertaken this segmentation with K-means clustering, latent profile analysis and DTW time-series clustering too. As there is no space to present the detailed analysis of the load profiles segments coming from the use of all other clustering technique we will focus only on the Irish data. To make the comparison easier, principally the 5-cluster solution will be presented in every case.
6.4.1
K-means clustering
K-means clustering follows heuristic principals like Hierarchical Clustering, but the result of K-means clustering is not deterministic. The hardest part in K-means clustering is finding the stable cluster structure. The change of initial cluster centre eventuates a different final cluster classification. To solve this problem, we have used a meta-clustering technique developed by Ariosz LTD. Technical Details Metaclustering K-Means: To remedy a well-known drawback of K-Means clustering method (unstable results with different initial cluster centers) Ariosz has developed and offers a simple and automated extension to the procedure. The steps are as follows: 1. Sort the record in your dataset randomly 2. Run the standard K-Means procedure with a predefined number of clusters (‘n’) 3. Save the cluster centers to an external data file, assigning also the number of cases belonging to each cluster 4. Repeat the steps from Step 1 to Step 3 several times (e.g. one hundred times) 5. Merge all the saved files containing the cluster centers to a common dataset 6. Run a clustering process on the merged dataset (weighted by the size of the groups) 7. Save the cluster centers and accept them as a final solution The same variable space was used in K-means clustering as in the case of Hierarchical Clustering, the 432 dimensions Nested approach. As the lower level of cluster structure doesn’t originate from a higher level structure we cannot draw an evolutional map of the clusters. The SSE value suggests a 6- or 7 cluster solution in the case of Ireland. It is interesting that the 5 cluster solution has worse SSE value than the 4 cluster solution.
67 | 80
D4.1: Load Profile Classification
Figure 40.
Sum of Squares Error (SSE) of the K-means clustering - Ireland
Examining the size of explained variance in the difference level of K-means clustering can also help to determine the best number of clusters to use. Table 7. 9- to 4-cluster K-means clustering – average explained variance in the three time dimensions
9 8 7 6 5 4
Daily 32.5% 31.3% 29.2% 26.7% 25.1% 24.7%
Weekly 1.3% 2.2% 2.3% 2.7% 0.6% 1.6%
Yearly 8.4% 8.3% 9.7% 10.4% 5.0% 5.2%
A quick examination of the explained variance suggests the use of a 6 or 7 cluster solution. In the case of five clusters, the yearly explained variance drops to 5 percent from 10 percent. As the difference between the 6 and 7-cluster models is only moderate in the terms of explained variance, we would suggest using the 6cluster solution, as it is more parsimonious. In the case of 6-cluster solution the smallest cluster contains only 2.4 percent of the sample, which is under the suggested 5 percent rule. On the other hand, nearly 1/3 of the sample belongs to the biggest segment. Although there is correlation between the K-means and Hierarchical Clustering classification it is not an easy task to recognize the same load profile characters.
68 | 80
D4.1: Load Profile Classification
6.4.2
Latent Profile Analysis
In opposition to hierarchical and K-means segmentation algorithms that are following heuristic principals, latent profile analysis uses a model-based clustering technique. As shown in the detailed theoretical section, the model based approach has advantages and disadvantages. The greatest disadvantage compared to hierarchical and K-means algorithms is that the number of usable variables is strongly limited, due to over fitting problems. To handle this problem, we changed the data structure from the Nested approach to separate time dimensions (see earlier: 5.1.3.2: Separate time dimensions). In the following part of the analysis we will use the 24 hourly share variables, the 6 yearly share variables (using bi-months) and the 3 days of week share variables. Technical Details To fit the latent profile models the R statistical programme was used. We used the Mclust package (Fraley et al., 2012). This package fits a finite mixture Gaussian model via an EM algorithm. In the model parametrization we define the volume, the shape, and the orientation of the final clusters. These parameters could be either equal or varying or, they could rely on an identity matrix. We have chosen EII parameterization with an equal volume and identity matrix based shape and orientation (which is also a special feature of the equal approach), as a standard approach in cases when the main structures are the focus of interest.
Using the 33 variables presented above, we tried to fit separate models with different cluster number. We have tested the cluster solutions from 1 to 11 clusters. As described earlier, the BIC statistic is a useful index in measuring model fitting. The next figure shows the BIC value of the fitted models. Figure 41.
BIC value of the fitted latent profile models (Cluster number 1-11)
The above figure suggests that the 11-cluster solution fits the data best. As also described in detail earlier, this finite mixture modelling technique strongly depends on the sample size. A low number of clusters fits a
69 | 80
D4.1: Load Profile Classification
small sample better, whereas using more clusters could better fit a bigger sample. To avoid this problem, we could search for the elbow points of this BIC figure. It is not an easy task to determine the elbow point from this figure, as the 5 cluster point could be one. As we used the 5-cluster hierarchical solution as a benchmark we will use this cluster number in the following analysis based on latent profile segmentation as well. There are three relatively big clusters based on that segmentation and two very small ones. The smallest is 2 percent; the other one is 4 percent. These small cluster sizes are quite problematic as it is hard to build any statistical models upon them later. Additionally to the problems with cluster size, the explained variance is also skewed.
Table 8. 5-cluster latent Profile Clustering – average explained variance in the three time dimensions
Average Explained Variance
Daily 8.3%
Weekly 16.5%
Yearly 28.6%
The daily time dimension has the lowest average explained variance and the yearly dimension has the highest. This is a quite different structure compared to what we have found in the case of hierarchical clustering. We use the term skewed, because the structure does not fit the expected outcome of segmentation. The clear consensus with regards to the market of electricity consumption is that the daily profiles are the most important. Based on this evidence that time dimension has to comprise the biggest part in the creation process of consumer classes. As seen above that is not the case here, the daily time dimension is underrepresented in the terms of effect strength. It is therefore not a surprise after these results that the correlation between the latent profile clusters and the hierarchical clusters is weak. There is only one segment that is fairly similar, the Evening Actives (but this group seemed to be smaller in the latent class solution).
6.4.3
DTW Time Series clustering
There are many ways to use the time series attribute of a data structure in a segmentation process. In this paper we suggested applying the dynamic time warping distance (DTW). This measure allows a non-linear mapping of two vectors by minimizing the distance between them. As this is not a special method rather a special distance metric it could be combined with “traditional” clustering techniques like K-means or Hierarchical Clustering. As it is easier to analyse a deterministic solution compared to an “unstable” one, we have decided to use Hierarchical Clustering. The ward method was applied in the agglomeration process like in the previous “simple” hierarchical clustering runs. The same 432- dimension Nested approach was used in this case too. Although it is possible to use the raw time series data as an input (after standardization) it is not suggested. The calculation of DTW is time consuming; it heavily depends on the sample size, but also depends on the number of used variables (time points). So an aggregated time-series (like in the Nested approach) could speed up processing time. Besides that, it is important to mention that the processing time was much slower than in the case of simple hierarchical clustering. Technical Details To fit the DTW Time Series Clustering models the R statistical programme was used. We applied the dtwclust package (Sarda-Espinosa, 2016). This package allows the fit of time series clustering along with optimised techniques related to the Dynamic Time Warping distance and its corresponding lower bounds. The DTW
70 | 80
D4.1: Load Profile Classification
distance method was used, in a Hierarchical Clustering process with the Ward agglomeration method. As the running time could be extremely long using big databases in DTW clustering, a 1000-case random sample was selected from the Irish data. We have run the DTW classification using this sub-sample.
The SSE value suggests a 5-cluster solution in the case of Ireland, although the decrease of the SSE value is volatile. The decrease is high between 8 to 7 and 7 to 6 clusters and quite low between the decrease from 5 to 4 clusters, but it is high again between 4 to 3 clusters.
Figure 42.
Sum of Squares Error (SSE) of the DTW Time Series clustering - Ireland
Like in the case of other methods, the average explained variance was also calculated in the three time dimensions. The weekly dimension doesn't play a significant role, but the explained variance was higher in yearly basis than in the daily basis. Although the yearly shape is quite important in our approach, the daily pattern has to represent a stronger impact. The DTW clustering result is skewed from that perspective.
Table 9. 9 to 4 clusters DTW Time series clustering – average explained variance in the three time dimensions
9 8 7 6 5 4
Daily 17.1% 16.1% 15.3% 13.2% 9.5% 7.8%
Weekly 3.1% 2.9% 2.9% 2.8% 1.4% 1.0%
Yearly 27.8% 27.5% 24.2% 15.9% 15.2% 14.6%
71 | 80
D4.1: Load Profile Classification
The cluster size is also problematic, as a smaller than 1 percent segment appears above as cluster number 4. This small cluster has to be screened if this method were chosen. Like in the case of other methods, the correlation between the DTW Time series clusters and the hierarchical clusters are weak.
6.5
Conclusions of load profile classification
At the end of this chapter we give a brief summary of the lessons learnt from our analysis. It is worth using an enhanced and integrated concept of the time differences in load profile segmentation. We argue that besides the “daily” shapes (which are undoubtedly the most important), there are other time aspects that might be regarded as inherent parts of load profile segmentation (mainly the yearly shape). There are plenty of classification methods, which could help us in load profile segmentation (Hierarchical Clustering, K-means clustering, Latent Profile analysis, DTW time series approach, etc.). We have to take many aspects into account when choosing from these methods (sample size, assumptions about data distribution, processing time, outliers, non-convexity of the variable space, availability in different software platforms etc.). We have to perform several data validation inquiries concerning anomalies, outliers and extreme data and also have to eliminate irregularities before undertaking the rest of the data analysis. Besides the many differences between the four countries involved in the case study, the basic structure of their residential electricity consumption is rather similar. There is a downward tendency in the night until 3-4 o’clock, after which comes a small peak in the morning and a high peak in the evening. We identified three main types of consumption patterns, those who have a big peak in the afternoon or evening, those who have a small morning peak with an average afternoon/evening peak, and those who have a strong bi-modality in their consumption, with two peaks of rather similar volume. It is possible to identify similar consumption profiles through countries, but each country may have its own specific consumption segment. Interpretation is a key aspect of the evaluation of classification. It is crucially important to understand the clusters and understand those socio-economic factors that give rise to these clusters. In NATCONSUMERS we intend to send customised messages to the consumers. Therefore we have to be familiar with them and need to understand how their habits or attitudes influence their energy usage. The hierarchical clustering method gives the best opportunity to understand the emergence of segments. The weekly aspect of electricity consumption doesn’t play a big role in the classification processes. This time dimension might be omitted from the future load profile segmentation works. The hierarchical clustering of load profiles provided an efficient way to segment the electricity profiles of consumers. The other clustering methods (K-means, Latent Profile and DTW Time Series) have given different outcomes in load profile classification. All clustering techniques could adequately classify load profiles. In our case study however, the Hierarchical Clustering performed the best in terms of interpretability and in terms of usability.
72 | 80
D4.1: Load Profile Classification
7 Further Steps As part of the NATCONSUMERS project multiple segmentation models will be created, based both on load profile data (from smart meters) and attitudinal surveys. Based on user load profiles we identify typical load curves, which can characterise different types of usage. In this report we have presented the basis of this segmentation work. In order to compare the household’s energy consumption to benchmarks we will segment the volume of consumption by demographics with various statistical models. The result of load profile and demographic segmentation will define the form and content of the message. We are convinced that communication efficiency depends not only on the content of the message (what is communicated), but is also very much influenced by the messaging style and argumentation (how it is communicated). Therefore different segmentations are needed, including load profile classification and segmentation based on demographics, as well as segmentation based on social values and attitudes. As next step we will conduct a segmentation based on people’s attitudes to energy usage, energy savings, and the social norms that impact on their behaviour. Based on this segmentation we will also determine the language and the style of the messages to be used with these consumers (the how). Building upon the results of WP3 that identified the main factors influencing electricity usage, the NATCONSUMERS project designed and commissioned quantitative surveys in four countries (Denmark, Hungary, Italy and UK). These surveys have been designed to collect data about the identified factors and forms the basis for our attitudinal classification model. The exact methods of demographic and attitudinal segmentation will be described in D4.2 the next deliverable of the project.
73 | 80
D4.1: Load Profile Classification
8 References Banfield, J. D., & Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 803821. Benítez, I., Quijano, A., Díez, J. L., & Delgado, I. (2014). Dynamic clustering segmentation applied to load profiles of energy consumption from Spanish customers. International Journal of Electrical Power & Energy Systems, 55, 437-448. Keogh, E., & Lin, J. (2005). Clustering of time-series subsequences is meaningless: implications for previous and future research. Knowledge and information systems, 8(2), 154-177. Fraley C, Raftery AE (1998). “How Many Clusters? Which Clustering Method? – Answers via Model-based Cluster Analysis.” Computer Journal, 41, 578–588. Fraley, C., & Raftery, A. E. (2007). Model-based methods of classification: using the mclust software in chemometrics. Journal of Statistical Software, 18(6), 1-13. C. Fraley, A. E. Raftery, T. B. Murphy and L. Scrucca (2012). mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation. Technical Report No. 597, Department of Statistics, University of Washington. Gioffreda, Giulia, Head of EU Affairs, Opower: Historical consumption data to drive energy efficiency. Presentation held at NATCONSUMERS Workshop on 28th January 2016 in Madrid. Iglesias, F., & Kastner, W. (2013). Analysis of similarity measures in times series clustering for the discovery of building energy patterns. Energies, 6(2), 579-597. Li, L., & Prakash, B. A. (2011). Time series clustering: Complex is simpler!. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) (pp. 185-192). http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Li_159.pdf Liao, T. W. (2005). Clustering of time series data—a survey. Pattern recognition, 38(11), 1857-1874. Opower, Opower blog. https://blog.opower.com/2014/10/load-curve-archetypes/ Opower: Unlocking a smarter energy future through customer engagement and insights. British American Business Council, Januray 2015. https://www.babcne.org/images/PDF/OPower.pdf Alexis Sarda-Espinosa (2016). dtwclust: Time Series Clustering with Dynamic Time Warping Distance. R package version 2.0.0. http://CRAN.R-project.org/package=dtwclust Tsekouras, G. J., Salis, A. D., Karanasiou, I. S., & Tsaroucha, M. A. (2008). Load time-series classification based on pattern recognition methods. INTECH Open Access Publisher.
74 | 80
D4.1: Load Profile Classification
9 Appendix Figure 43. Boxplot of load profile segments, Double Risers segment, three time aspects, United Kingdom, 4cluster solution, hierarchical clustering
Figure 44. Boxplot of load profile segments, Afternoon Actives segment, three time aspects, United Kingdom, 4cluster solution, hierarchical clustering
75 | 80
D4.1: Load Profile Classification
Figure 45. Boxplot of load profile segments, Home Lunchers segment, three time aspects, United Kingdom, 4cluster solution, hierarchical clustering
Figure 46. Boxplot of load profile segments, Winter Spinners segment, three time aspects, United Kingdom, 4cluster solution, hierarchical clustering
76 | 80
D4.1: Load Profile Classification
Figure 47. Boxplot of load profile segments, Double Risers segment, three time aspects, Italy, 5-cluster solution, hierarchical clustering
Figure 48. Boxplot of load profile segments, Evening Actives segment, three time aspects, Italy, 5-cluster solution, hierarchical clustering
77 | 80
D4.1: Load Profile Classification
Figure 49. Boxplot of load profile segments, Late Actives segment, three time aspects, Italy, 5-cluster solution, hierarchical clustering
Figure 50. Boxplot of load profile segments, Double Risers segment, three time aspects, Hungary, 4-cluster solution, hierarchical clustering
78 | 80
D4.1: Load Profile Classification
Figure 51. Boxplot of load profile segments, Afternoon Actives segment, three time aspects, Hungary, 4-cluster solution, hierarchical clustering
Figure 52. Boxplot of load profile segments, Home Lunchers segment, three time aspects, Hungary, 4-cluster solution, hierarchical clustering
79 | 80
D4.1: Load Profile Classification
Figure 53. Figure of load profile segments, Summer Wavers segment, integrated approach of time dimensions, Hungary, 4-cluster solution, hierarchical clustering
80 | 80