GANY: A Genetic Spectral-based Clustering Algorithm for Large Data ...

GANY: A Genetic Spectral-based Clustering Algorithm for Large Data Analysis Héctor D. Menéndez

David Camacho

Department of Computer Sciences Universidad Autónoma de Madrid Madrid, Spain Email: [email protected]

Department of Computer Sciences Universidad Autónoma de Madrid Madrid, Spain Email: [email protected]

Abstract—Recently, Data analysis is one of the most growing fields. The big amounts of data are making their analysis a really challenging area. The most relevant techniques are mainly divided in two sub-domains: Classification and Clustering. Even though Classification is currently growing and evolving, one of the promising techniques to deal with the Large Data Analysis is Clustering, because Classification needs human supervision, which makes the analysis more expensive. Clustering is a blind process used to group data by similarity. Currently, the most relevant methods are those based on manifold identification. The main idea behind these techniques is to group data using the form they define in the space. In order to achieve this goal, there are several techniques based on Spectral Analysis which deal with this problem. However, these techniques are not suitable for Large Data, due to they require a lot of memory to determine the groups. Besides, there are some problems of local minima convergence in these techniques which are common in statistical methodologies. This work is focused on combining Genetic Algorithms with spectral-based methodologies to deal with the Large Data Analysis problem. Here, we will combine the Nyström method with the Spectrum to generate an approximation of the problem to an accurate summary of the search space. Also a genetic algorithm is used to reduce the local minimum convergence problem in the new search space. The performance of this methodology has been evaluated using the accuracy with both, synthetic and real-world datasets extracted from the literature.

I.

ployed to discriminate the manifolds. These methodologies try to separate the data according to the form they define in the space [3]. The most representative technique is Spectral Clustering [4]. Spectral Clustering is useful when the features of search space are unknown, and it is necessary to generate a continuity space of the data instance. This problem is frequent in graph representations where the graph edges represent the similarities among data and the nodes represent data instances. The algorithm is divided in three main steps: •

A Similarity Graph among the data is generated. There are three main approaches depending on the information contained in the similarity graph: full (where all the instances are connected), neighbourhood (where only a number of neighbours are connected) and epsilon (where there is a minimum similarity and only those elements whose similarity is bigger are connected).

•

The Laplacian or the Spectrum is extracted. There are three Laplacian matrices which are frequently used: unnormalized (the Laplacian is L = D − W where W is the Similarity Graph and D is the diagonal matrix whose (i, i)-element is the sum of the Similarity Graph i-th row), symmetrically normalized (the Laplacian is Lsym = D−1/2 LD−1/2 ) and the normalized version inspired by Random Walks (the Laplacian is Lrw = D−1 L).

•

The eigenvectors of the Laplacian are extracted and they are used to generate a projective space where the a clustering algorithm is applied to group the data instances. This is used for the clustering assignation.

I NTRODUCTION

Current Data Analysis methods have several challenges. This new area is focused on several perspectives according to data volume, algorithms velocity, data dimension, among others [1]. Due to there are still several methodologies which might or not be considered Big Data, authors usually focus their work on specific sub-areas of this field. In this work, we are focused on Large Data Analysis which is the set of methodologies and tools used to extract patterns from data where the number of instances is higher than common datasets which can be found in the literature such as the UCI Machine Learning repository [2]. One of the most challenging areas in Data Analysis is the manifold identification problem. Manifold identification is based on identifying the structure of data in the space automatically. Usually, data are grouped in manifolds following a continuity criterion. In order to identify these manifolds blindly, continuity-based clustering techniques are usually em-

c 978-1-4799-7492-4/15/$31.00 2015 IEEE

Spectral Clustering has several problems according to its robustness [3] and memory consumption [5]. The latter is specially important for Large Data Analysis, due to the algorithm consumes a lot of memory during the search process. There are some methodologies, such as the Nyström extension [6], which try to reduce the memory usage, however, these techniques make the algorithm more sensitive to noisy data. This work tries to reduce this effect introducing a genetic search in the projective space. These techniques have proved to obtain more accurate solutions than classical techniques using less memory [7]. This can be applied to different fields such as image segmentation [8], community finding [9], Social Network analysis [10] and behaviour extraction [11], among others.

First, the method divides the matrix as follows (see Figure W1 W2 W = , W2t W3

Other techniques which are gaining importance in Large Data Analysis are the online-clustering algorithms [12]. These analytical techniques are focused on data streams.

1)

Data streams generate data continually. The idea behind the online-clustering algorithms is to analyse this data using real-time analysis [12]. These techniques usually need to deal with large data quantities which produce several limitations on the algorithm representation. The most relevant limitations of these systems are:

∈ Rn×n , W2 ∈ Rn×(N −n) , W3 ∈ . This methodology only uses W1 and W2 to R estimate the eigenvectors of W .

•

The data order matters and can not be modified.

•

The data can not be stored or re-analysed during the process.

•

The results of the analysis depend on the moment the algorithm has been stopped.

where W1 (N −n)×(N −n)

The Nyström extension is defined as follows [6]: Definition II.1 (Nyström Extension). Let k(xi , xj ) be a kernel whose Gram matrix K is symmetric and positive semi-definite matrix satisfying Ki,j = k(xi , xj ). We assume that the eigendescomposition is KU = U Λ with U orthogonal. Then, the Nyström extension for a new instance x is the approximation u ¯k (x) to the real uk (x) given by:

The main problem of these algorithms is that they need a specific space to update the information. One of the main tools used for online-clustering analysis is the Massive On-line Analysis (MOA) tool [13]. This framework provides the following online-clustering algorithms: •

Online K-means [12]: This online algorithm updates the centroid position when a new instance arrives. Only one centroid is updated per iteration. It is similar to classical K-means algorithm.

•

ClusTree [14]: This online algorithm iteratively update the information of the clusters. It is able to consider the speed of the data stream generating the concept of the age of the object. It also maintains stream summaries.

•

CluStream [15]: This algorithm combine offline clustering and online-clustering in order to provide partial clustering solutions which measure the evolution of the clusters.

All of these algorithms have been designed in order to identify more properties of the data stream than the final clusters in a specific moment. In this work, we have compared with these algorithms to evaluate the performance of the new method comparing with synthetic and real-world datasets. This work is structured as follows: Section II introduces the Nyström extension, Section III presents the GANY algorithm which is evaluated in Section IV with synthetic and real-world datasets. Finally, the paper presents the conclusions and future work. II.

¨ EXTENSION N YSTR OM

The main application of the Nyström extension is to reduce the memory usage during the eigenvector extraction process of a determine matrix. Given a matrix W which is semi-positive defined, this method can approximate the main eigenvectors using a subsample of the original matrix. The goal of this method is to reduce the dimensions sampling the Similarity Matrix. We are going to choose a subset of points S = {s1 , . . . , sn } ⊂ X = {x1 , . . . , xN }.

u ¯k (x) =

N 1 k(x, xj )ukj λuk j=1

(1)

The Nyström Extension will only need the augmented matrix formed by W1 and W2 : (W1 |W2 ). The matrix W3 is the part that we want to approximate. It satisfies that is larger than W1 , due to n

GANY: A Genetic Spectral-based Clustering Algorithm for Large Data ...

GANY: A Genetic Spectral-based Clustering Algorithm for Large Data ...

Suggest Documents

MOGCLA: A Multi-Objective Genetic Clustering Algorithm for Large ...

A GA-Based Clustering Algorithm for Large Data ... - Semantic Scholar

DESCRY: a Density Based Clustering Algorithm for Very Large Data ...

A Quantum-inspired Genetic Algorithm for Data Clustering - CiteSeerX

A Variant of Genetic Algorithm Based Categorical Data Clustering for

Parallel Clustering Algorithm for Large-Scale Biological Data ... - PLOS

Efficient parallel spectral clustering algorithm design for large data ...

Parallel Clustering Algorithm for Large-Scale Biological Data ... - PLOS

Genetic Algorithm Fuzzy Clustering using GPS data for ... - OpenstarTs

Scalable parallel clustering approach for large data using genetic ...

Genetic Algorithm for Document Clustering with ... - CiteSeerX

GENETIC k-MEANS CLUSTERING ALGORITHM FOR MIXED

Constructive Genetic Algorithm for Clustering Problems

Genetic Algorithm for Document Clustering with ... - CiteSeerX

GENETIC k-MEANS CLUSTERING ALGORITHM FOR MIXED ...

A clustering algorithm for multivariate data ... - Journal of Big Data

SPICi: a fast clustering algorithm for large ... - Semantic Scholar

A High Performance Algorithm for Clustering of Large ... - Google Sites

A Distributed Clustering Algorithm for Voronoi Cell-based Large ...

Genetic algorithm based two-mode clustering of metabolomics data ...

A distributed clustering algorithm for large-scale dynamic ... - PRiSM

A Genetic Algorithm Based Approach for Systematic SOM Clustering ...

a clustering genetic algorithm for software ... - Semantic Scholar

A Distributed Genetic Algorithm for Graph-based Clustering