GANY: A Genetic Spectral-based Clustering Algorithm for Large Data Analysis H´ector D. Men´endez
David Camacho
Department of Computer Sciences Universidad Aut´onoma de Madrid Madrid, Spain Email:
[email protected]
Department of Computer Sciences Universidad Aut´onoma de Madrid Madrid, Spain Email:
[email protected]
Abstract—Recently, Data analysis is one of the most growing fields. The big amounts of data are making their analysis a really challenging area. The most relevant techniques are mainly divided in two sub-domains: Classification and Clustering. Even though Classification is currently growing and evolving, one of the promising techniques to deal with the Large Data Analysis is Clustering, because Classification needs human supervision, which makes the analysis more expensive. Clustering is a blind process used to group data by similarity. Currently, the most relevant methods are those based on manifold identification. The main idea behind these techniques is to group data using the form they define in the space. In order to achieve this goal, there are several techniques based on Spectral Analysis which deal with this problem. However, these techniques are not suitable for Large Data, due to they require a lot of memory to determine the groups. Besides, there are some problems of local minima convergence in these techniques which are common in statistical methodologies. This work is focused on combining Genetic Algorithms with spectral-based methodologies to deal with the Large Data Analysis problem. Here, we will combine the Nystr¨om method with the Spectrum to generate an approximation of the problem to an accurate summary of the search space. Also a genetic algorithm is used to reduce the local minimum convergence problem in the new search space. The performance of this methodology has been evaluated using the accuracy with both, synthetic and real-world datasets extracted from the literature.
I.
ployed to discriminate the manifolds. These methodologies try to separate the data according to the form they define in the space [3]. The most representative technique is Spectral Clustering [4]. Spectral Clustering is useful when the features of search space are unknown, and it is necessary to generate a continuity space of the data instance. This problem is frequent in graph representations where the graph edges represent the similarities among data and the nodes represent data instances. The algorithm is divided in three main steps: •
A Similarity Graph among the data is generated. There are three main approaches depending on the information contained in the similarity graph: full (where all the instances are connected), neighbourhood (where only a number of neighbours are connected) and epsilon (where there is a minimum similarity and only those elements whose similarity is bigger are connected).
•
The Laplacian or the Spectrum is extracted. There are three Laplacian matrices which are frequently used: unnormalized (the Laplacian is L = D − W where W is the Similarity Graph and D is the diagonal matrix whose (i, i)-element is the sum of the Similarity Graph i-th row), symmetrically normalized (the Laplacian is Lsym = D−1/2 LD−1/2 ) and the normalized version inspired by Random Walks (the Laplacian is Lrw = D−1 L).
•
The eigenvectors of the Laplacian are extracted and they are used to generate a projective space where the a clustering algorithm is applied to group the data instances. This is used for the clustering assignation.
I NTRODUCTION
Current Data Analysis methods have several challenges. This new area is focused on several perspectives according to data volume, algorithms velocity, data dimension, among others [1]. Due to there are still several methodologies which might or not be considered Big Data, authors usually focus their work on specific sub-areas of this field. In this work, we are focused on Large Data Analysis which is the set of methodologies and tools used to extract patterns from data where the number of instances is higher than common datasets which can be found in the literature such as the UCI Machine Learning repository [2]. One of the most challenging areas in Data Analysis is the manifold identification problem. Manifold identification is based on identifying the structure of data in the space automatically. Usually, data are grouped in manifolds following a continuity criterion. In order to identify these manifolds blindly, continuity-based clustering techniques are usually em-
c 978-1-4799-7492-4/15/$31.00 2015 IEEE
Spectral Clustering has several problems according to its robustness [3] and memory consumption [5]. The latter is specially important for Large Data Analysis, due to the algorithm consumes a lot of memory during the search process. There are some methodologies, such as the Nystr¨om extension [6], which try to reduce the memory usage, however, these techniques make the algorithm more sensitive to noisy data. This work tries to reduce this effect introducing a genetic search in the projective space. These techniques have proved to obtain more accurate solutions than classical techniques using less memory [7]. This can be applied to different fields such as image segmentation [8], community finding [9], Social Network analysis [10] and behaviour extraction [11], among others.
First, the method divides the matrix as follows (see Figure W1 W2 W = , W2t W3
Other techniques which are gaining importance in Large Data Analysis are the online-clustering algorithms [12]. These analytical techniques are focused on data streams.
1)
Data streams generate data continually. The idea behind the online-clustering algorithms is to analyse this data using real-time analysis [12]. These techniques usually need to deal with large data quantities which produce several limitations on the algorithm representation. The most relevant limitations of these systems are:
∈ Rn×n , W2 ∈ Rn×(N −n) , W3 ∈ . This methodology only uses W1 and W2 to R estimate the eigenvectors of W .
•
The data order matters and can not be modified.
•
The data can not be stored or re-analysed during the process.
•
The results of the analysis depend on the moment the algorithm has been stopped.
where W1 (N −n)×(N −n)
The Nystr¨om extension is defined as follows [6]: Definition II.1 (Nystr¨om Extension). Let k(xi , xj ) be a kernel whose Gram matrix K is symmetric and positive semi-definite matrix satisfying Ki,j = k(xi , xj ). We assume that the eigendescomposition is KU = U Λ with U orthogonal. Then, the Nystr¨om extension for a new instance x is the approximation u ¯k (x) to the real uk (x) given by:
The main problem of these algorithms is that they need a specific space to update the information. One of the main tools used for online-clustering analysis is the Massive On-line Analysis (MOA) tool [13]. This framework provides the following online-clustering algorithms: •
Online K-means [12]: This online algorithm updates the centroid position when a new instance arrives. Only one centroid is updated per iteration. It is similar to classical K-means algorithm.
•
ClusTree [14]: This online algorithm iteratively update the information of the clusters. It is able to consider the speed of the data stream generating the concept of the age of the object. It also maintains stream summaries.
•
CluStream [15]: This algorithm combine offline clustering and online-clustering in order to provide partial clustering solutions which measure the evolution of the clusters.
All of these algorithms have been designed in order to identify more properties of the data stream than the final clusters in a specific moment. In this work, we have compared with these algorithms to evaluate the performance of the new method comparing with synthetic and real-world datasets. This work is structured as follows: Section II introduces the Nystr¨om extension, Section III presents the GANY algorithm which is evaluated in Section IV with synthetic and real-world datasets. Finally, the paper presents the conclusions and future work. II.
¨ EXTENSION N YSTR OM
The main application of the Nystr¨om extension is to reduce the memory usage during the eigenvector extraction process of a determine matrix. Given a matrix W which is semi-positive defined, this method can approximate the main eigenvectors using a subsample of the original matrix. The goal of this method is to reduce the dimensions sampling the Similarity Matrix. We are going to choose a subset of points S = {s1 , . . . , sn } ⊂ X = {x1 , . . . , xN }.
u ¯k (x) =
N 1 k(x, xj )ukj λuk j=1
(1)
The Nystr¨om Extension will only need the augmented matrix formed by W1 and W2 : (W1 |W2 ). The matrix W3 is the part that we want to approximate. It satisfies that is larger than W1 , due to n