Scalable Evolutionary Clustering Algorithm with Self Adaptive Genetic Operators Elizabeth León, Olfa Nasraoui, and Jonatan Gomez
Abstract— In this paper, we present a scalable evolutionary algorithm for clustering large and dynamic data sets, called Scalable Evolutionary Clustering with Self Adaptive Genetic Operators (Scalable ECSAGO). The proposed evolutionary clustering algorithm can adapt its genetic operators rate while the evolution leads to the optimal centers of the clusters. The sizes of the clusters are estimated using a hybrid analytical optimization procedure. Moreover, a memorization factor is introduced in order to allow the algorithm to keep as much of the previously discovered knowledge about clusters and data summarization as desired. The proposed scalable ECSAGO algorithm is able to find accurate representations of the clusters on very large data sets of different sizes and dimensionality that might not fit in main memory, while maintaining the desirable properties of robustness to noise and automatic detection of the number of clusters. The algorithm is also useful for traking evolving cluster structures that change with the passage of time.
I. I NTRODUCTION Clustering [1], [2] is a descriptive learning technique from data mining and exploratory data analysis that aims at classifying the unlabeled data points into different groups or clusters according to some similarity concept, such that members of the same group are as similar as possible, while members of different groups are as dissimilar as possible. Genetic Algorithms (GA) have been used for solving clustering problems (genetic clustering or evolutionary clustering) [3], [4], [5]. However, data sets (in fields such as marketing, network security, and the World Wide Web) have become more difficult to store, analyze, and manage due to their size: number of records and dimensionality. In particular, finding clusters in these data sets is a very challenging task because most of the clustering techniques are computationally expensive from the point of view of the required time and memory [6], [7], [5]. Scalable clustering techniques are designed to handle this kind of data by finding accurate representations of clusters quickly [8], [9], [10] by reducing the time complexity, and by managing the loading/unloading of the data set into the computer’s main memory (the size of a typical computer’s main memory is small compared to the number of points in a huge data set). This paper introduces a scalable model of the ECSAGO algorithm proposed by Leon et. al. in [11]. ECSAGO is a self adaptive genetic clustering algorithm based on the Unsupervised Niche Clustering algorithm (UNC) proposed by Elizabeth León and Jonatan Gómez are with the Department of Computer Engineering, Universidad Nacional de Colombia, Bogotá-Colombia (emails: {eleonguz, jgomezpe}@unal.edu.co). Olfa Nasraoui is with the Department of Computer Science and Computer Engineering, University of Louisville, Louisville (KY)-USA (email:
[email protected]).
978-1-4244-8126-2/10/$26.00 ©2010 IEEE
Nasraoui et al in [12]. Although ECSAGO’s time complexity is linear with respect to the size of the data set, it still depends on the population size and the number of generations: each individual used the full data set for calculating its fitness function. Moreover, the space complexity of ECSAGO is linear with respect to the size of the data set, i.e., the full data set should be loaded into the computer’s memory in order to be able to use ECSAGO. The proposed scalable ECSAGO algorithm is a sequential and incremental clustering algorithm. It receives the data points as they arrive, and the data is loaded into main memory only once. First, ECSAGO is considered as a non-stationary evolutionary optimization algorithm [13], [14], [15]. Second, the fitness and scale update concepts are adapted in order to handled both: new and summarized data. The summarized data samples are generated from the previous runs, and form a dynamic memory of the data that has been clustered in the past. While this memory must be as accurate as possible in representing past data (in the ideal case, the most accurate memory consists of the entire past data!), it should not become a burden on the clustering process, i.e. this memory should be as small as possible, (in the ideal case, the smallest memory is a null set!), otherwise, the entire goal of achieving scalability will be defeated. Hence, it is essential that this summarization be concise and reliable. In order to achieve these two difficult and often contradictory objectives, we interpret the clusters search performed by UNC, as based on the concept of kernel density estimation. This interpretation provides a strong backing for a Kernel density based summarization strategy, that is proposed to allow an effective estimation and manipulation of the summarization, to achieve scalable clustering in the presence of large and evolving data sets. Finally, a memorization factor is introduced in order to allow the algorithm to keep as much of the previously discovered knowledge about clusters and data summarization as desired. The proposed scalable algorithm is able to find accurate representations of the clusters on very large data sets of different sizes and dimensionality, while maintaining the desirable properties of robustness to noise and automatic detection of the number of clusters. This paper is divided in five sections. Section 2 gives an overview of the underlying genetic clustering algorithms used by the proposed approach. Section 3 presents the proposed scalable genetic clustering algorithm. Section 4 describes a set of experiments carried on synthetic data sets and analyzes the obtained results of the proposed scalable algorithm. Finally, Section 5 draws some conclusions about the presented work.
II. BACKGROUND A. Unsupervised Niche Clustering Algorithm (UNC) UNC is an approach to clustering, using Genetic Niching, that is robust to noise and is able to determine the number of clusters automatically [12], [16], [17]. UNC locates and maintains dense areas (clusters) in the solution space using an Evolutionary Algorithm (EA) and a niching technique [13], [18]. In UNC, each individual of the population represents a candidate cluster (center and scale). While the center of the cluster is evolved using the EA, its scale or size is updated using an iterative hill-climbing procedure. The updated scale is used to compute the cluster’s fitness, hence, introducing a Baldwin effect into the evolutionary process. To preserve individuals in the already detected niches, a restriction in the mating is imposed: only individuals that belong to the same niche mate to produce offspring. The UNC’s model is composed of [16]: the evolutionary process (generation of the population that will represent candidate clusters), the extraction of the final prototypes (selection of the better candidates as optimal clusters from the final population), and the refinement optional process (improvement of the center and size of these final clusters by applying a local optimization process)1 . In the evolutionary process, the fitness value, fi , for the ith candidate center location ci , is defined as follows: PN j=1 wij , (1) fi = σi2 where σi2 is the scale measure (or dispersion) of the ith candidate center. This scale is updated (for the entire population) after each generation: PN 2 j=1 wij dij 2 σi = PN (2) j=1 wij ! d2ij (3) wij = exp − 2 2σi wij is a robust compatibility weight that measures how typical data point xj is in the ith cluster (wij is computed using the value of σi2 calculated in the previous generation), d2ij is the distance from data point xj to cluster center ci , and N is the number of data points. To further reduce the effect of outliers, the weights wij are binarized, i.e., mapped to 1 if their value exceeds a minimum weight threshold (typically 0.3), or zero otherwise.
problems where the representation should be different from binary strings. ECSAGO maintains from UNC the final cluster prototype extraction process, and final local refinement process. In the evolutionary process, the fitness function, the automatic hybrid scale updating procedure, encoding, genetic operators, niching technique, selection mechanism, mating restriction and adaptation of the rates of the genetic operators are different. ECSAGO uses an algorithm called Hybrid Adaptive Evolutionary HAEA [21] for changing the genetic operator rates while the individuals undergo evolution. Operators such as Linear Crossover (LC), Linear Crossover per Dimension (LCD), Gaussian and Uniform mutations, were proposed in [11] as operators for the real numbered representation. III. P ROPOSED S CALABLE ECSAGO M ODEL Although ECSAGO’s time complexity is linear with respect to the size of the data set, the population size, and the number of generations, it needs all data points at the same time to evaluate the fitness of each individual in the population. Therefore, the full data set should be loaded into the computer’s main memory, thus limiting the size of the input data set to the computer’s own limitations. In order to manage the scalability, the Scalable ECSAGO algorithm will consider portions of the total data, or portions of the data that is arriving when the environment is very dynamic (high amounts of data), such as in the case of data streams. The portions or chunks of data will be used by the algorithm one at a time and the generated cluster model will be updated when another chunk of data is used. The model can keep or loose, to various degrees, the knowledge obtained from previous chunks. The size of the chunks are defined according to the memory limitations. To start, ECSAGO uses a first portion of the data in the order that data arrives to generate a preliminary solution (final population). Then, when ECSAGO receives another portion of data, the initial population and the new portion of the data will be enriched with some individuals from the final population obtained from clustering the previous portion of data, while the remaining initial individuals are randomly selected from the new chunk of data. In this way, information that is accumulated from the previous data is used. The results of this model must be as similar as possible to the original evolutionary approach that processed the entire data in a single chunk. The proposed scalable evolutionary clustering model is depicted in figure 1.
B. Evolutionary Clustering Algorithm with Self Adaptive Genetic Operators (ECSAGO) ECSAGO [11], [20] is an evolutionary algorithm based on UNC that has the advantage of reducing the number of parameters required by UNC (thus avoiding the problem of fixing the genetic operator parameter values), and solving 1 The Maximal Density Estimator (MDE) [19], a robust estimator process, is used by UNC in this final refinement process
Fig. 1.
Scalable ECSAGO model for evolutionary clustering
The scalable ECSAGO algorithm can be considered as a hybrid approach that strives to maintain the diversity of the population and that exploits a memory based strategy. The algorithm tries to maintain diversity, while restarting the GA, and replacing the remaining population by new individuals selected randomly from the new data when a change in the environment occurs; it can also be considered as a memory based approach for preserving the best individuals (according to the extraction process) in the population, at the moment of change in the environment. Moreover, ECSAGO can be considered to evolve the best locations for kernel density estimation [22] (the prototype centers) as well as the optimal scale/smoothing parameter for each kernel, which furthermore, and unlike classical kernel density estimation, adjusts to the different locations within the data space (i.e. this scale is not constrained to be fixed and identical for all the kernels). Kernel density estimation is a powerful mathematical technique for estimating densities in arbitrary data sets. The goal of kernel density estimation is to provide an estimate of the density of the data at a given point in the space of the data set [22]. In this way, ECSAGO performs an accurate cluster estimation, even in noisy data sets that contain an unknown number of clusters. A. Summarization Strategy The most difficult challenge in mining large, evolving data sets is being able to quickly compute an estimate of the past data that is as accurate and as complete as possible, i.e. a faithful snapshot or summarization of the past data. In the classical kernel density estimation, the kernel scale is fixed a priori and is generally very small, decreasing with the number of data points, which is why a large number of kernels are needed to accurately portray the input data. Because ECSAGO allows the scale to be higher, but only as large as necessary to fit the natural dispersion of the clusters, it can approximate an entire cluster with fewer kernels, and in the optimal case, a single kernel. Hence, the kernels that are evolved can be considered to be a sparse summary of the classical kernels computed by kernel density estimation. This is exactly why, we propose to use the prototypes that are evolved in the final population at the end of each batch of data as a form of summary of previous data chunks. The complete final population is not kept due to the redundancy of its individuals. Hence, the individuals will be extracted using the final extraction phase of the ECSAGO algorithm. However, we will apply a weak extraction, i.e, by only considering the same niche property (only the best individual in each niche is extracted) and by not using a minimum fitness threshold (fmin = 0). This is done to avoid the premature loss of promising solutions that may have relatively low fitness because of an “unfortunate” order of arrival of the data in their cluster (i.e., only a small number of data points from such a cluster arrives in each portion of the data). The extraction will ensure that only the best individual in each detected cluster (or niche) is extracted and hence will prevent redundancy.
Once a set of clusters are detected in the first chunk of data, they will be part of the initial population for the next chunk. If no information besides the cluster centers is conserved, and if the new data chunk does not contain data samples from the already detected clusters, these clusters risk being lost. Therefore, the extracted individuals must maintain more information about the clusters that they represent, which will be used in the next step (when mining the next portion of data). The information to be maintained are: σ (scale or niche radius) and W (cumulative robust weight). The weight W represents the number of data points belonging to the cluster represented by the individual (cardinality of the cluster) according to the weight threshold (this threshold allows the binarization of weights). The extracted individuals will be part of the initial population (i.e., new individuals), and will also be added to the new portion of data (as summarized data). An individual that is discovered through the previous chunk of data, is maintained in the new population, in order to give it a chance to improve and adapt according to the new portion of data. It is added to the input data in order to summarize the cluster that has already been detected in the previous stage. This in turn will enable the computation of the appropriate robust weights for the data points that this summarized point is representing. B. Population Initialization and Summarization New individuals are initialized from the new chunk of input data using the same original random selection process of ECSAGO. However, an individual i that is selected from the previous final population, will be initialized using its summarized information (center and σ) that resulted from the previous stage (i.e. after clustering the previous portion of Nold data points). This individual will later evolve its center and update its scale in accordance with the new chunk of data and the ”summarized data”. The summarized data that is added to the new portion of data will maintain the previously learned knowledge in the form of data attribute values, together with its scale σ, and its effective weight W . We distinguish between two kinds of data points, that we will call summarized data and raw data, that need to be represented in the same way. Therefore, both kinds of data points are represented by their data attributes (prototype attributes), in addition to σ, and W . A summarized data, record x∗j , is represented by a prototype’s attributes, learned scale σj , and learned cumulative weight Wj ; while a raw data, record xj , is represented by its data attribute values, initial scale σj = 0, and initial cumulative weight Wj = 1. C. Calculating Weights When the evolution restarts with a new chunk of data, each individual i in the population is evolved and therefore its scale σi must be updated according to the weights of the new data and the summarized data points. Calculating the weight of a raw data point relative to an individual is done in the same way as in the original ECSAGO and UNC. Since the weights are a pre-requisite to computing the fitness, we need to define how
defined, as shown in figure 2, by Wj =
I = [L, R] I=0
if L 6 R otherwise
(5)
[L, R] = [max{−rj , (dij − ri )}, min{rj , (dij + ri )}]
Fig. 2. Approximation for calculating the compatibility weight of past summarized data with respect to current individual
the compatibility weights are computed for summarized data. Therefore, for summarized data, the weight can be considered as a local density kernel that approximates the distribution of all the ”past” data points represented by the summarized point, instead of only 0 or 1. At first intuition, one could set this weight to the regular weight W using equation (3) computed based on the distance between a summarized data record and the individual evolved within the current population. However, not all the ’real points’ from the previous batch that have been represented by the summarized data point are at the same distance to this individual. Therefore, it is possible that the weight with respect to the individual is less than this cumulative weight W . Therefore, we need to devise an approximation that allows calculating the appropriate weight of a summarized data point with respect to an individual. A simple approximation is to consider the problem as a univariate problem (that is a function only of the distance from the individual). Since the weights are calculated using an exponential kernel model, the approximation is defined based on an exponential representation, see figure 2. The cumulative weight of the summarized data with respect to an individual is defined as the proportion of raw data points from the previous batch that the summarized data represents, and that fall within the influence of the current population individual. Thus, the estimated proportion of past summarized data that is covered by the current individual is defined as the ratio of the intersection between the area under the weight/kernel function of the summarized data and the area under the weight/kernel function of the current individual, to the area under the weight/kernel function of the summarized data that spans the interval defined by the radius. The radius of a summarized point and the individual is Kσ, where K is a factor derived from the minimum weight threshold Tw , by 2 2 solving (Tw = e−(Kσ) /2σ ), which results in K=
p −2 ln Tw
(4)
If we consider a past summarized data j, and a current individual i, with radius rj = Kσj and ri = Kσi respectively, and separated by a distance dij . Then, the proportion of summarized data j affecting current individual i is defined by the intersection between them and the cumulative weight W of the summarized data, where the intersection region is
For simplicity, consider the summarized data j is considered to be centered at the origin, and the individual i centered at position dij . The proportion of the summarized data covered by individual i can be calculated as the ratio of the area under the kernel of the summarized data falling into the region I and the area under the kernel of summarized data j spanning [−rj , rj ]. In general, the area under the kernel function of data j in any interval [a,b] of distance values (x) is calculated by the following integral2 : ! Z b dx2 (6) exp − 2 dx 2σj a Therefore, the proportion of summarized data points is calculated as follows: RR dx2 exp − 2σ dx 2 L ∗ j wij = Wj ∗ R r (7) j x2 exp − dx 2 −rj 2σ j
Thus, the equation for calculating the weight of a data point (including the summarized data) j to an individual i is
∗ wij =
8 > > > > > < Wj ∗ > > > > > :
RR L
R rj
2
!
exp − dx2
−rj
2σ
j
2 exp − x 2 2σ
dx !
if j is summarized
dx
j
1 0
if j not summar.. and wij > Tw otherwise (8)
where Wj is the cumulative weight of the summarized data maintained from the previous learning phase, wij = d2ij exp − 2σ2 , and Tw is the weight threshold used for binai rization. D. Updating the Scale Since the summarized data point j is representing the center of a previously detected cluster, and its data points are not all located exactly at the center of this cluster (they are rather distributed inside the radius rj and their actual distance values are not maintained due to the memory restrictions), the dij is not a sufficiently accurate measure of distance from an individual to a summarized data. Only the data points that are inside the intersection area that belong to j should be considered. Therefore, the distance can be smaller than the distance to the center, or it can be (on average) equal to σj when the summarized data and individual are the same. Therefore, the distance from i to j is defined as follows: 2 It
should be calculated using any numeric integration method.
∗
d2ij =
8 > < > :
2
(2∗ri −|dij −ri |+rj ) 2 σj2 d2ij
if j is summarized if j is summarized and otherwise
dij 6= 0 (9)
The scale σ of the ith individual is calculated by extending the scale update equation in a natural way, as follows N new X ∗
2 σnew =
initial population
prototypes