Data Density Based Clustering Richard Hyde, Plamen Angelov Data Science Group, School of Computing and Communications Lancaster University Lancaster, LA1 4WA, UK
[email protected] [email protected] Abstract— A new, data density based approach to clustering is presented which automatically determines the number of clusters. By using RDE for each data sample the number of calculations is significantly reduced in offline mode and, further, the method is suitable for online use. The clusters allow a different diameter per feature/ dimension creating hyper-ellipsoid clusters which are axisorthogonal. This results in a greater differentiation between clusters where the clusters are highly asymmetrical. We illustrate this with 3 standard data sets, 1 artificial dataset and a large real dataset to demonstrate comparable results to Subtractive, Hierarchical, K-Means, ELM and DBScan clustering techniques. Unlike subtractive clustering we do not iteratively calculate P however. Unlike hierarchical we do not need O(N2) distances to be calculated and a cut-off threshold to be defined. Unlike k-means we do not need to predefine the number of clusters. Using the RDE equations to calculate the densities the algorithm is efficient, and requires no iteration to approach the optimal result. We compare the proposed algorithm to k-means, subtractive, hierarchical, ELM and DBScan clustering with respect to several criteria. The results demonstrate the validity of the proposed approach. Keywords—clustering; incremental clustering; evolving clustering, big data
I. INTRODUCTION Clustering algorithms have long been considered useful methods of extracting information from large datasets, especially those with high dimensionality that are hard to visualize. As we enter the world of ‘big data’ the speed, efficiency, accuracy and autonomy of the methods becomes ever more important. As the size of data sets in both number of samples and data dimensions grow small differences in algorithm efficiency start to become more significant. By minimizing the number of calculations required to calculate the density, when compared to the potential in subtractive clustering, Data Density Based Clustering (DDC) is seen to be more efficient especially for big data. In comparison with traditional approaches such as kmeans and hierarchical DDC is autonomous and does not require the pre-defining of ‘k’ or thresholds.
This work was supported by the National Environmental Research Council (Grant number NE/J006262/1)
978-1-4799-5538-1/14/$31.00 ©2014 IEEE
II. RELATED WORK A. K-means Clustering The well-known K-means clustering method seeks to cluster data points by assigning each sample to the nearest mean, updating the cluster means and re-assigning until the mean values no longer change, or there are no re-allocations. The number of clusters, ‘k’, needs to be pre-defined and fixed. ‘k’ random samples are chosen as the initial cluster means and the data assignment starts. The method is susceptible to differing results according to the initial samples chosen and also to converging to sub-optimal results. As a result the technique is usually repeated a number of times and the best, or most common, results chosen. The dependence on the Euclidean distance as a measure also results in hyper-spherical clusters which may not suit natural data. Selection of a suitable value for ‘k’ requires some expert prior knowledge of the expected results. B. Subtractive Clustering A more recent method, called subtractive clustering [1-3], treats each data sample as a potential cluster center. It measures the likelihood of each sample to being a cluster center based on the inverse exponential sum of the distances to the other samples, called the potential. The point with the highest value is considered to be the first cluster center. Input parameters for cluster radii are required and all samples within these radii are included in the cluster and removed from the data set. The process is repeated on the remaining data samples until no points are left un-clustered. The requirement to re-calculate the distances can result in slow operation, particularly with large datasets. The advantage over the k-means approach is that the number of clusters is not required beforehand, however the expected size of the clusters is required [1-3]. C. Hierarchical Clustering Hierarchical clustering builds a dendrogram based on the distance between each pair of samples. Although different methods of measuring the distance can be utilized the overall method is the same. In the agglomerative version each sample starts as a potential cluster. The nearest samples are then joined based on the distance between them. After the dendrogram has been created, a threshold has to be chosen to find the number of clusters or, alternatively, a number of
clusters can be chosen. In many cases of natural data there are no clear divisions between clusters as the spread of data results in data samples between groupings. Even where there are clear divisions, the divisions may be small relative to the cluster size and may not provide natural divisions. If the number of expected clusters is not known then a suitable height can be found by repeated iteration until an acceptable number of clusters are found. D. Evolving Local Means Evolving Local Means (ELM) [4] is an online density based approach for clustering. It is based on the mean of any cluster having the highest density and the local densities will always reduce with distance from the mean. By recursively updating the local mean and variance ELM is very time and memory efficient thus making it very useful not only for streaming data, but also for large offline datasets by processing data samples sequentially. ELM requires a user input of an initial radius on which the cluster sizes are based. E. DBScan DBSCan [5] is another density based method which aims to create clusters based on the density around data samples. Two user inputs are required, the minimum number of samples within a radius for a sample to be considered dense, and the radius itself. Although often presented as creating large clusters it is, in fact, a cluster joining technique. The two inputs effectively define a series of small clusters of specified size and density where every sample is a potential cluster center. DBScan then joins these clusters if they overlap. By defining the required minimum density DBScan limits its ability to find clusters of varying density. Further, calculating the distances to all other data samples to check how many are within the radius is time consuming. In practice, for all the datasets used in this paper the only successful method was to define the number of clusters required. III. THE PROPOSED ALGORITHM The Data Density based Clustering method demonstrated here has advantages over these existing methods, namely: • No prior knowledge of the number of clusters is required. • The initial radius parameter is used initially only, but updated later based on the real data distribution. • The use of RDE [6] equations to calculate the density results in dramatically fewer calculations, particularly in larger datasets. • In smaller datasets the time advantage may be reduced due to carrying out the density calculations twice per cluster when adjusting the radii. However, the efficiency becomes significant with larger datasets. A. Overview The newly proposed Data Density based Clustering (DDC) algorithm uses the density of the real data sample
distribution to automatically identify clusters. This approach is influenced by the mountain clustering method [1] whereby clusters are identified by the mountain function which is a Gaussian function of the distances between the data samples. This method was further improved with subtractive clustering [2, 3] where samples are used for cluster centers rather than points on a grid, and no need for the grid itself. In this paper we use Recursive Density Estimation (RDE) and a much simpler Cauchy function which renders recursive calculations. After the creation of the first cluster using RDE is very computationally efficient. B. Offline DDC In a similar approach to subtractive clustering, in the proposed method we identify the cluster centers based on the density of the samples in the region. The highest density point is chosen for the first cluster center. All samples within the given initial radii are included in the cluster. These points are then removed from the un-clustered data and the densities recalculated. Unless the point of maximum density lies within a cluster, this method is highly likely to find new cluster centers on the edge of a local group. At this stage a significant part of the cluster may include low density spaces between groups or overlap a nearby cluster. To overcome this we move the cluster center to the local densest point and re-assign the data samples to the cluster. Any new samples that are now within the cluster radii are added. This moves the cluster nearer to the center of the data group. The cluster radii are now adjusted to match the data according to equation (4) below. An alpha value of α=1 keeps the initial radii, a value of α=0 exactly matches the size of the assigned data. Any previously clustered samples that are no longer within the radii are returned to the un-clustered data. In the offline case, because we have all the data, we can adjust the cluster details to better match the data spread. By examining the data assigned to a cluster we can remove outliers and adjust the radii to match the size of the cluster in each data dimension. In traditional subtractive clustering the densities would be re-calculated using the distances between each point and all other points. In the DDC approach each remaining sample requires only a single calculation to update its density thus greatly reducing the computation compared with subtractive clustering. The difference to the computational load becomes more apparent for big data sizes and high variable dimensions. C. Incremental DDC Due to the speed and relative (to subtractive clustering) simplicity of the calculation the algorithm is suitable for applying in an incremental mode. This has been tested and is discussed here, although it requires further work to make it suitable for large data sets. Data is fed to the algorithm one sample at a time. The global variables are updated at this stage. These are copied to local variable for use during the clustering. Next we use a simplified version of the DDC offline algorithm to assign the
samples with the cluster radii. In this implementation we do not move the clusters to the local mean, or adjust the radii to match the cluster data. 1) Drawbacks In this basic implementation there are a number of drawbacks: 1. All the data is stored, so although the method is relatively quick it is not very memory efficient 2. With each new sample a full clustering routine is recalculated making it computationally inefficient. Options for further improvement are discussed later. 3. As the cluster adjustment routines of the full DDC offline method have not been implemented online there is a drop in cluster purity and rise in number of clusters. 4. This version is slow for high dimensional large data.
(2)
2
Density [6]. The sample density calculated recursively using the results of equations (1) and (2). This is also descried as ‘Global’ for all remaining data or ‘local’ for the data within a cluster.
=
1 1+
−
0
2
+
0
−
0
(3)
2
Radii learning equation used to update the cluster radii according to the data spread. A value for ‘α’ of 1 uses the data spread while 0 uses the user input radii.
=
2 0
+ (1 − )
1
=
2
(4)
=1
1. High accuracy and cluster purity (see results) 2. Computational efficiency makes it faster than repeated subtractive clustering 3. Order independent due to the nature of re-clustering with each sample. It should be noted that this is a simple implementation and is not intended to be suitable for all situations at this stage. However it provides a good starting point for future work. D. Equations used for DDC The equations used for DDC are well established and the proofs can be found in the cited texts. To aid understanding of the DDC algorithms we present the equations here. N – the number of samples being considered. This will be all the remaining samples for a global calculation, or the local samples assigned to a cluster if calculating within that cluster. xi – sample ‘i’ Di – the density of sample xi r0 – the initial radii rj – the radii of cluster ‘j’ α – a learning parameter μj – the mean of cluster ‘j’ Data mean, calculated recursively [6]. This is the mean of all the data points being considered, described as ‘Global’ for all remaining data or ‘local’ for the data within a cluster.
1
1 =1
2
2) Advantages The key advantages to this method are:
=
=
(1) =1
Scalar product, calculated recursively [6]. This is used to calculate the sample density recursively and is described as ‘Global’ for all remaining data or ‘local’ for the data within a cluster.
E. The Offline Algorithm The DDC algorithm works on the full data set, i.e. offline. The whole data set is loaded in step 1. Steps 2 and 3 are used to find the data sample closest to the point of maximum density. By using the RDE equation the calculation is computationally efficient as each data sample requires only a single calculation. Step 4 uses the user supplied radii in each data dimension to assign and samples within that region to the cluster. For reasons discussed previously we then calculate the local density, again using RDE, to find the local densest point in step 5. This is assigned as the final cluster center. By calculating the local densest point we move the cluster center to the densest point of the cluster samples providing a more accurate result. After moving the cluster center we assign the samples within the user supplied radii to create the cluster in step 6. Steps 7 and 8 refine the cluster definition. We examine the data within the cluster and first remove any outliers. After this we adjust the cluster radii to match the data within the cluster. This results in a final cluster definition, in terms of center and radii that closely matches the clustered data. The clustered data is removed from the global dataset and the global mean, global scalar product and remaining sample densities updated in steps 9 and 10. We check to see if enough samples remain to form further cluster in step 11 and repeat the process from step 3 if necessary. 1. Load data 2. Calculate the global mean, global scalar product and sample densities using equations (1), (2), (3) 3. Find the sample with the highest density and assign as a cluster center 4. Find all points within the cluster radii and assign to the cluster
5. Calculate the local mean and move cluster center to this point using equations (1), (2), (3) applied to local cluster data only 6. Find all points within cluster radii from the new cluster center 7. Check for outliers and return them to the global dataset 8. Adjust the cluster radii to match the included points using equation (4) 9. Update the global mean and global scalar product using equations (1), (2) 10. Calculate new sample densities using equation (3) 11. If >2 samples remain un-clustered return to step 3 12. Assign last sample to a cluster F. The Incremental Algorithm The incremental algorithm simply repeats the DDC algorithm with each new data sample and does a complete iteration of the DDC offline algorithm each time. It is not intended to be fast, or to replace a genuine online algorithm but rather to make use of the speed of DDC to update previous clustering results without the delay associated with Subtractive clustering. The algorithm simply reads the new data sample in step1, carries out the DDC algorithm in steps 2 to 13. Step 14 loops as long as a stream of data keeps providing new data samples. 1. Read first data sample and assign as first cluster center 2. Calculate the global mean, global scalar product and sample densities using equations (1), (2), (3) 3. Read next data sample 4. Calculate the global mean, global scalar product and sample densities using equations (1), (2), (3) 5. Assign global densest point to be first cluster center 6. Copy global values to local values 7. Find all points within the cluster radii and assign to the cluster 8. Update the global mean and global scalar product using equation (1), (2) 9. Calculate new densities using equation (3) 10. Assign local densest point as cluster center 11. Assign samples within radii to the cluster 12. Remove samples from un-clustered data 13. If >2 samples remain go to 8 14. while data stream continues go to 3
IV. DATASETS USED A number of experiments were carried out across a range of datasets. 1. 3 standard datasets are used, with both continuous and discrete data: a. Iris [7]. This data has low dimensions and small number of samples. b. Wine [8] A higher dimensions dataset with discrete data. c. Agaricus-lepiota (mushroom) [9] A high dimensional, medium number of samples data set with discrete values. 2. An artificial dataset or 5 clusters each containing 5,600 samples was created. The clusters have no overlap but are allowed to ‘touch’. This results in some noise between the clusters making differentiation between some more difficult than others. The dataset is shown in Figure 1. 3. The household power usage dataset [10] also shown in Dataset
Technique
Average Purity
Iris 150 samples 4 dimensions 3 classes
DDC Subtractive Hierarchical K-means ELM DBScan DDC Subtractive Hierarchical
Wine 178 samples 13 dimensions 3 classes
Mushroom 8124 samples 23 dimensions 3 classes
Gaussian 28,000 samples 5 classes 2 dimensions
Household Power 2075258 samples 3 dimensions
Figure 1 Datasets showing gaussian distributions of 28,000 samples (5600 samples per cluster) and Household Power, random selection from 2,075,258 samples.
Household Power 2075258 samples 8 dimensions
Run Time (ms) 10 2 220 7 10 1 25 4.5 3
Accuracy
99% 86% 92% 95% 94% 75% 98% 89% 96%
Number of Clusters 8 5 6 7 6 2 18 14 18
K-means ELM DBScan DDC Subtractive Hierarchical
95% 99% 40% 96% 92% 96%
5 24 1 26 21 26
5 12 1 270 8926 1685
96% 95% 39% 97% 89% 94%
K-means ELM DBScan DDC Subtractive
95% 96%
138 720
95% 97%
97% 93%
9 22 Failed 6 6
360 15288
99% 97%
Hierarchical K-means ELM DBScan DDC
87% 91% 100% 100% n/a
6 5 6 5 13
260046 180
40% 80% 100% 98% n/a
13370 7550
95% 90% 69% 95% 91% 67% 87% 83% 43%
Subtractive
Failed
n/a
K-means ELM DBScan DDC
error creating empty cluster >3 n/a 7 180298 Failed n/a 20 28290
n/a n/a n/a n/a
Subtractive
Failed
n/a
K-means error creating empty cluster >3 n/a ELM n/a 16 344345 n/a DBScan Failed n/a Table 1 DDC clustering results and comparisons with alternative techniques
Figure 1 was used to test speed and accuracy on both large and higher dimension datasets. It is used in two sets of tests: a.Using household consumption measurements only. This has 3 dimensions and can be easily represented on a plot. The meters represent 3 ‘rooms’ in the house, kitchen, utility and heating/ air-con. b. Using the household measurements and the global power data for a total of 7 dimensions. Primarily this is used to measure the ability of DDC to work with higher dimensions. The data represents household power use in 3 household areas, kitchen, laundry / utility room and heating/ air conditioning. It would be expected that most households would have a similar useage in these areas with some having more or less efficient equipment. This should produce clusters centred around 0, where it is off, 0.5 where it is on and typical efficient and up to 1 for less efficient equipment. The additional 4 data dimensions relate to distribution grid readings and are included to increase clustering dimensions. V. EXPERIMENTAL RESULTS A. Offline DDC Here we apply the techniques to various datasets available from The UCI Machine Learning Repository [11], Iris [7], agaricus-lepiota (mushroom) [9] and wine [8], Household Power Useage [10] and the artificial Gaussian data. For comparison we use the Mathworks Matlab routines for Subtractive [12], k-means [13] and hierarchical [14] clustering on the same datasets with a range of suitable parameters. Individual implementations of ELM and DBScan were used but may not be optimized. Each of the techniques was run with a range of their respective input parameters and the results recorded. Average cluster purity is a commonly used measure of the quality of the technique. However it has drawbacks, especially
when dealing with techniques that produce clusters of varying sizes and membership. Low purity, high membership clusters can be disguised by a number of high purity, low membership clusters and vice versa. To avoid this an additional measure was used, cluster accuracy. To obtain this figure we calculate the total number of samples that are correctly assigned as a percentage of the total number of available samples. Both figures are given where available. For the household power data there is no known classification for the data so no figure is available. However we show plots of the results to illustrate that they are meaningful. Where techniques are extremely slow they are designated as ‘failed’. Extremely slow was designated to be greater than 10 times the longest successful method with no significant progress. Hierarchical particularly failed at relatively small data sizes and was excluded from larger data experiments. K-means clustering failed to provide any results on the household power data. This was due to errors creating empty clusters if asked to find more than 3. As the data can be clearly seen to contain more than 3 cluster this is considered to be a fail. The results are shown in Table 1. B. Incremental DDC Some results for the incremental DDC are shown in Table 2 Incremental DDC Results. In each case the data was presented in a randomized order. The results show a high level of purity for a reasonable number of clusters. It should be noted that, at the end of it’s run, it effectively performs an offline DDC analysis of the final dataset so the results should be similar. However, during the run a similar level of cluster purity is seen illustrating the potential use of the incremental method, see Figure 2. This shows the cluster number and purity measured every 10 samples. The cluster purity drops when the number of clusters reduces and some samples are mis-assigned. Dataset
Purity %
Clusters
Runtime (s)
Iris 150 samples 4 dimensions 3 classes
0.2 0.3 0.4 0.5 0.6
95.37 95.13 96.83 91.80 57.50
21 11 7 4 3
0.58 0.35 0.23 0.16 0.13
Wine 178 samples
0.5 0.6
99.40 96.83
53 10
1.22 0.83
13 dimensions
0.7
96.16
16
0.54
3 classes
0.9 1
94.33 92.11
8 6
0.30 0.20
Mushroom 8124 samples
0.9 1
98.91 98.98
66 53
347.00 272.00
23 dimensions
1.1
98.56
43
226.45
3 classes
Figure 2 Cluster number and purity during incremental run on randomized Iris data
Radii
1.5 96.00 21 1.6 98.56 19 Table 2 Incremental DDC Results
116.44 226.45
However the purity then recovers as the cluster centers are improved, with more data, and more samples are correctly assigned. The runtimes shown in Table 2 Incremental DDC Results is the total time for all the data to pass through the analysis. Even on the larger mushroom dataset this is an average of 47ms per sample. VI. DISCUSSION OF RESULTS As with all clustering techniques there is no single best answer. There are two targets, a reasonable number of clusters and a higher cluster purity. It is often the case that there is a trade-off between the two. With a lower number of clusters the size of the cluster must be larger and is therefore more likely to overlap into the next class thus reducing the purity. With a higher number of clusters the results become less valuable as the considerable subsequent work is still required. The cluster accuracy measure is arguably more indicative of the usefulness of the clustering techniques. A high value here indicates that the results achieved are more likely to place any data sample in the most appropriate cluster. When considering the results it is worth noting that both Iris and wine have not been necessarily classified on the available data. Wine in particular is an arbitrary classification of wine quality based on subjective evaluation. As there are no pre-defined classifications for the Household Power data a plot of the results is shown in Figure 3 show that the cluster produced are meaningful and clear. A. Comparison of run times For all these comparisons it should be remembered that the DDC, ELM and DBScan software scripts are not optimized, whereas the MatLab native scripts for the alternative methods have been highly optimized. Some future efficiency gains should, therefore, be expected for the DDC method. DDC has an overhead when moving the cluster centers and repeating the clustering assignment. On smaller datasets such as Iris and Wine this overhead is greater than the time saved
Figure 3 Plot of DDC clustering of household power data. Clusters are shown in different colours and are clearly separate and centred around expected data locations.
using recursive techniques resulting in longer times. However we see a great improvement on datasets that are still relatively small as shown by the mushroom dataset and only k-means comes close when increasing the data size to that of the Gaussian set. With much larger datasets such as the household power dataset all the alternatives, except ELM, run in to memory and/ or operational speed issues. B. Comparison of cluster results With two notable exception it is seen that DDC provides the most accurate clustering results. The two notable exceptions are the wine data, where kmeans is better and the Gaussian 28000 data where ELM manages a perfect score. They wine data has been classified subjectively and not based on the provided data. Although this should affect the kmeans method equally the reliance on DDC on a Gaussian type distribution when re-clustering at the second stage may be significant as is the use of equal radii along each data dimension in the experiments. It should also be noted that DBScan also failed to produce meaningful results on this dataset suggesting that the data itself is not suitable for density based clustering. The mushroom dataset also proved difficult for DDC to reduce the number of cluster found. With a reduction in cluster numbers the purity and accuracy dropped dramatically. This is likely due to the discrete nature of the data meaning that there is no distribution around a central node to form the kind of clusters that DDC is intended to find. The results for the household power data are shown in Figure 3. The plot shows a random 0.05% of each cluster to reduce plot times. The clusters are shown in different colors but due to the number of clusters there some appear to be similar. Clusters are centered around the expected locations as discussed previously suggesting these are meaningful results. C. Comparisons of prior knowledge required DDC requires only an initial estimate of the cluster radii, in a similar manner to Subtractive clustering. However the computational efficiency of DDC allows for the cluster center to be moved and the cluster radii to be adjusted to match the data. K-means and hierarchical clustering techniques both required some prior knowledge of the number of clusters required. During repeated experiments on similar sets of data this information can be tuned based on manual examination of the results. However, where relationships in the data are unknown the ability of DDC to discover these relationships is a clear advantage. ELM requires some knowledge of the expected cluster size requiring an initial radii estimation to be entered. Although this is updated during the analysis according to the data presented it is order dependent. DBScan requires deep knowledge of the expected data clusters to define whether each data sample is a small cluster that can be joined to others nearby, or whether it is an edge cluster with lower density.
VII. CONCLUSIONS DDC as a clustering technique is slightly slower than some techniques on small data sets, however it is not intended to excel in this area. It also fails to produce very good results where data is discrete rather than continuous although it is still comparable with many techniques. As datasets grow larger both the computational efficiency and the low memory requirements due to the use of recursive density estimation become more significant. DDC excels for big data and larger dimensions in comparison to other techniques. DDC requires no prior knowledge of the number of clusters. Additionally it is less sensitive to the initial radii user input as the final cluster radii are adjusted based on the actual cluster data spread. VIII. FUTURE WORK A. DDC Offline Some work has already taken place on using the data density to provide a data based estimate of the initial radii. This will remove all of the user inputs and provide an autonomous clustering technique. Initial work has proved successful and it is especially useful on high dimensional data where tuning radii for each dimension is extremely time consuming. There are many aspects of the DDC algorithm that make it suitable for parallelizing. Investigation into parallelization of both the algorithm, and by dividing the data space is planned. B. Incremental DDC Further investigation of the use of DDC in the incremental method by reducing the need to re-cluster. Such methods may include 1. checking if sample is near current cluster centre by some proportion of cluster radii 2. with higher numbers of samples, clusters become ‘settled’. Should show an initial increase in run time, followed by a reduction as clusters ‘settle down’ C. Adaption to Online Clustering DDC was initiated with a view to using recursive equations. The intention is to adapt the method to online uses. The incremental method presented here is not intended to replace that initial goal. REFERENCES 1.
Yager, R. and D. Filev, Generation of Fuzzy Rules by Mountain Clustering. Journal of Intelligent & Fuzzy Systems, 1994. 2(3): p. 209-219.
2.
3. 4.
5.
6. 7. 8. 9.
10.
11.
12.
13. 14.
Chiu, S. A cluster extension method with extension to fuzzy model identification. in Fuzzy Systems, 1994. IEEE World Congress on Computational Intelligence., Proceedings of the Third IEEE Conference on. 1994. Chiu, S.L., Fuzzy Model Identification Based on Cluster Estimation. Journal of Intelligent & Fuzzy Systems, 1994. 2(3): p. 1064-1246. Rasmi Dutta Baruah, P.A., Evolving Local Means Method for Clustering of Streaming Data, in IEEE World Congress on Computational Intelligence. 2012, IEEE: Brisbane, Australia. p. 2161-2168. Martin Ester, H.-P.K., Jorg Sander, Xiaowei Xu. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. in 2nd International Conference on Knowledge Discovery and Data Mining. 1996. Portland, Oregan: AAAI. Angelov, P., Fundamentals of Probability Theory, in Autonomous Learning Systems. 2012, John Wiley & Sons, Ltd. p. 17-36. Fisher, R.A., Iris Plants Database. UCI Machine Learning Repository. Forina, M.e.a., PARVUS, Wine Recognition Data, M.e.a. Forina, PARVUS, Editor. 1998: UCI Machine Learning Repository. Lincoff, G.H., Mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms 1981, UCI Machine Learning Repository: New York. Georges Hebrail, A.B., Individual household electric power consumption Data Set A.B. Georges Hebrail, EDF R&D, Clamart, France, Editor. 2012: UCI Machine Learning Repository. K Bache, M.L., UCI Machine Learning Repository, I. University of California, School of Information and Computer Sciences, Editor. 2013, University of California, Irvine, School of Information and Computer Sciences. The Mathworks, I. subclust. 2014 [cited 2014 14/05/2014]; 2014a:[Available from: http://www.mathworks.co.uk/help/fuzzy/subclust.htm l. The Mathworks, I. kmeans. 2014 [cited 2014 21/05/2014]; 2014a:[Available from: http://www.mathworks.co.uk/help/stats/kmeans.html. The Mathworks, I. Hierarchical Clustering. 2014 [cited 2014 14/05/2014]; 2014a:[Available from: http://www.mathworks.co.uk/help/stats/hierarchicalclustering.html.