Scalable Parallel Clustering Approach for Large Data ... - IEEE Xplore

3 downloads 86 Views 737KB Size Report
This paper mainly focuses in identifying the limitations of the k means algorithm and to propose the parallelization of the k-means using firefly based clustering ...
Scalable Parallel Clustering Approach for Large Data Using Parallel K Means and Firefly Algorithms Juby Mathew, Student Member, IEEE

R

Abstract: This paper mainly focuses in identifying the limitations of the k means algorithm and to propose the parallelization of the k-means using firefly based clustering method. The new parallel architecture can handle large number of clusters. Firefly algorithm to find initial optimal cluster centroid and then k-means algorithm with optimized centroid to refined them and improve clustering accuracy. The final convergence issue is also addressed and solved to a great extent. Finally modified algorithm is compared with parallel k means is demonstrated with experiments and it has been found that the performance of

modified algorithm is better than the existing

algorithm. Four typical benchmark data sets from the UCI machine learning repository are used to demonstrate the results of the techniques. To achieve this we can use fork�ioin method in java programming. It is the most effective design method for achieve good parallel performance

Keywords: Clustering, k-means, algorithm, join and fork parallelism

parallel

k-means,

Firefly

INTRODUCTION I. In the present work, the aim is to make a clustering algorithm that utilizes maximum capabilities of a regular multi-core PC to cluster the dataset as fast as possible while resulting in acceptable quality of clusters. One of the most important tasks in data analysis is data clustering, especially when there is no extra information about the data. The aim of clustering is to partition data instances (i.e. observations) into groups in a way that members of each group have similar characteristics. It has a variety of scientific and industrial applications, including business intelligence, bioinformatics, web mining, and social network analysis. Starting from the early 2000s, the datasets collected by companies, websites, and academia became extremely massive, and the computational resources were overwhelmed by the large amount of data being processed. Traditional data analytic methods usually cannot be applied to big datasets. Consequently, research in data mmmg community has recently been directed towards mining of massive datasets for various tasks, such as classification [I], clustering [2], and association rule mining. A promising solution to handling massive datasets is using parallel processing power of modem muti-core processors via multi­ threading. Most of the available clustering algorithms have two shortcomings when used on big data: (1) a large group of clustering algorithms, e.g. k-means, has to keep the data in memory and iterate over the data many times which is very costly for big datasets, (2) clustering algorithms that run on limited memory sizes, especially the family of stream­ clustering algorithms, do not have a parallel implementation to

978-1-4799-5958-7/14/$31.00©20141EEE

Vijayakumar

Mahatma Gandhi University Kottayam, Kerala, India [email protected]

Dept. of MCA, Amaljyothi College of Engg. Kanjirapally, Kerala, India [email protected]

utilize modem multi-core processors and also they lack decent quality of results. A popular partitional clustering algorithm-k-means clustering, is essentially a function minimization technique, where the objective function is the squared error. However, the main drawback of k-means algorithm is that it converges to local minima from the starting position of the search [3] and sensitive to initial cluster centers. In order to overcome local optima problems, many nature inspired algorithms such as, genetic algorithm [4], ant colony optimization [5], artificial immune system [6], artificial bee colony [7], and particle swarm optimization [8] have been used. Recently, efficient hybrid evolutionary optimization algorithms based on combining evolutionary methods and k-means to overcome local optima problems in clustering are used [9-11]. The computing process needs improvements to efficiently apply the method to applications with huge number of data objects such as genome data analysis and geographical information systems. Parallelization is one of the obvious solutions to this problem and many researchers have proposed the idea many years ago. [12][13] [14]. Optimization problem is one of the most challenging problems in the field of operation research. The goal of the optimization problem is to find the set of variables that results into the optimal value of the objective function, among all those values that satisfy the constraints. Literature survey involved the analysis of many existing optimization algorithms. Natural phenomena are mimicked by some of the powerful population-based algorithms like firefly algorithm which are used for solving optimization problems. The field of nature inspired computing and optimization techniques have evolved to solve the difficult optimization problems in diverse fields of engineering, science and technology. The Firefly algorithm is one of the several nature inspired algorithms that have been developed in the recent past and is inspired from the flashing behavior of the fireflies. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, many researchers try to design efficient parallel clustering algorithms. Efficient parallel clustering algorithms and implementation techniques are the key to meeting the scalability and performance requirements entailed in such scientific data analyses. Data points are continuously relocated to the nearest centroid and the shape of the cluster are fme-tuned in partitioning

approach. Even though this approach is used in K-means clustering algorithm the performance depends on the initial values of the starting centroids which are generated with each execution of the algorithm. It is known that K-means can easily fall into local optima that do not produce the best clustering results. Achieving a globally optimum clustering result requires an exhaustive process in which all partitioning possibilities are tried out, which IS computationally prohibitive. Parallel computing has become attractive during the recent past as an efficient technique to improve the efficiency of population-based methods. One can identify many different reasons to parallelize an algorithm: (i) reduction of execution time, (ii) expansion of the problem size,(iii) making the class of problems computationally feasible, and so on. Clustering is a widely used technique of finding interesting patterns residing in the dataset that are not obviously known. The K-Means algorithm is the most commonly used partitioned clustering algorithm because it can be easily implemented and is the most efficient in terms of the execution time. However, due to its sensitiveness to initial partition it can only generate a local optimal solution. Firefly technique offers a globalized search methodology but suffers from slow convergence near optimal solution. In this paper, we present a new Hybrid parallel clustering approach, which uses FA in sequence with K-Means algorithm for data clustering. The proposed approach overcomes drawbacks of both algorithms, improves clustering and avoids being trapped in a local optimal solution. II. REVIEW OF LITERATURE Parallelization of clustering techniques is receiving an increasing attention due to ever-increasing data sizes today. Many parallel implementations of various clustering techniques [2] is studied in the literature and k-means algorithm is one of them. K-means algorithm was proposed by J.B. MacQueen in 1967 [15] and since then it has gained great interest from data analysts and computer scientists. K-means is a method of data analysis, which is easy to implement and apply even on large data sets. To obtain decent computational speed on huge datasets, most researchers opt some form of parallelizing scheme. Li and Fang [3] are among the pioneer groups on studying parallel clustering. They proposed a parallel algorithm on a single instruction multiple data (SIMD) architecture. Modha and Dhillon [8] proposed a distributed k­ means that runs on a multiprocessor environment. Couch and Kantabutra [4] proposed a master-slave single program multiple data (SPMD) approach on a network of workstations to parallel the k-means algorithm. Tian and colleagues [9] proposed the method for initial cluster center selection and the design of parallel k-means algorithm. Stoffel and Belkoniene proposed parallel k-means works on a distributed database, the database was distributed over a network of 32 PCs, their results revealed that as the number of node increase the speedup degrades because of the increase of communication overhead and the variations in the execution times of the different processors [10].

Prasad [11] parallelized the k-means algorithm on a distributed memory multi-processors using the MPI scheme. Farivar and colleagues [12] studied parallelism using the graphic coprocessors to reduce energy consumption of the main processor. Inderjit S et al. [13] presented a parallel implementation of the k-means clustering algorithm based on the message passing model. Author [29] proposed scalable parallel clustering approach for large data using PFCM to parallelize the computation across machines. Some of the parallelization comes from the cluster assignment in parallel [31]or the distance computation in parallel [34] .In the paper by Judd et al. [31], algorithmic enhancements are described which reduces large computational efforts in mean square-error data clustering. Zhang et al. [35] proposed a parallel CLARANS(Clustering Large Applications Based upon Randomized Search) algorithms using PVM (Parallel Virtual Machine). Swarm Intelligence (SI) belongs to an artificial intelligence discipline (AI) that became increasingly popular over the last decade. These are complex computational systems which mimic behavior of species such as ants, birds, fish, bees, cuckoos, frogs, etc. These animals (insects) have very finite individual capability, but when they act as a group (swarm), they can perform many complex tasks in order to survive. Swarm intelligence refers to a research field that is concerned with a collective behavior within self-organized and decentralized systems. This term was probably first used by Beni [18] in the sense of cellular robotic systems consisting of simple agents that organize themselves through neighborhood interactions. Recently, methods of swarm intelligence are used in optimization, the control of robots, and routing and load balancing in new-generation mobile telecommunication networks, demanding robustness and flexibility. Examples of notable swarm-intelligence optimization methods are ant colony optimization (ACO) [19], particle swarm optimization (PSO) [20], and artificial bee colony (ABC) [21].Today, some of the more promising swarm intelligence optimization techniques include the firefly algorithm (FA) [22], cuckoo­ search [23], and the bat algorithm [24], while new algorithms such as the krill herd bio-inspired optimization algorithm [25] and algorithms for clustering also emerged recently. FA is one of the recent swarm intelligence methods developed by Yang [18] in 2008 and is a kind of stochastic, nature­ inspired, meta-heuristic algorithm that can be applied for solving the hardest optimization problems (also NP-hard problems).Subotic et al. [26] developed the parallelized FA for unconstrained optimization problems tested on standard benchmark functions. Both the speed and quality of the results were placed by the authors and as a result, the parallelized FA obtained much better results over much less execution time. Unfortunately, this conclusion was valid only when more than one population was taken into account. Husselmann et al. in [27] proposed a modified FA on a parallel graphical processing unit (GPU) where the standard benchmark functions were taken for comparison with the classic firefly algorithm. They revealed that the results of this parallel algorithm were more accurate and faster than by the original

firefly algorithm, but this was only valid for multi-modal functions. As matter of fact, the classical FA is well suited to optimizing unimodal functions as very few fireflies are required and, thus, calculation times are dramatically lower. K-means objective function (SSE) is not convex, and hence the optimization procedure can get stuck in local optima. This makes k-means very sensitive to the choice of initial cluster centroids. [n original k-means, the initial centroids are chosen randomly. However, various alternatives have been proposed to reduce the sensitivity of k-means to initialization. A simple but expensive solution is to run k-means multiple times with different initial seeds. In this case, the best solution among different clustering is chosen [28]. Another well-known algorithm for seed selection is k-means++, which selects the centroids through a biased stochastic selection [29]. A.

Firefly algorithm

The firefly algorithm (FA) is a metaheuristic algorithm, inspired by the flashing behavior of fireflies. This algorithm is a type of swarm intelligence algorithm based on the reaction of a firefly to the light of other fireflies. [30]. Firefly algorithm idealizes some of the characteristics of the firefly behavior. a. All fireflies are considered unisex and irrespective of the sex every firefly is attracted to other fireflies. b. The Attractiveness is proportional to their brightness, meaning for any two flashing fireflies, the movement of firefly is from less bright towards the brighter one and if no one is brighter than other it will move randomly. Furthermore they both decrease as their distance increases. c. The brightness of a firefly is proportional to the value of its objective function. 1. Light intensity and attractiveness In the Firefly algorithm there are two important things, first is the variation in light intensity and second is formulation of attractiveness. Assume that there exists a swarm of n agents (fireflies) and Xi represents a solution for a firefly i, whereas f (Xi) denotes its fitness value. Here the brightness [ of a firefly is selected to reflect its current position x of its fitness value f (x) . (1) li=j(Xi),1:S.i:S.n. Firefly attractiveness is proportional to the light intensity determined by adjacent fireflies. Each firefly has its distinctive attractiveness � which implies how strong it attracts other members of the swarm. [30] However, the attractiveness � is relative, it will vary with the distance rij between two fireflies i and j at locations Xi and Xj respectively, is given as �=�-�. m The degree of attractiveness of a firefly is determined by Y fJ(r) = fJoe- r2 (3) Where �o is the attractiveness at r = 0 and y is the light absorption coeffIcient at the source. An objective function f(x) to encode the brightness of a given firefly. Actually, it represents the light intensity at location x as l(x) =f(x). Yet, there are some issues with the distance, point of view and the fact that the environment absorbs part of emitted light. At the source, the brightness is higher than at some distant point. Also, the brightness decreases while environment absorbs the light while it is travelling. It can be

concluded that the attractiveness of firefly fJ is relative. It is known that the light intensity l(r) varies following the inverse square law: (4) I(r) = lolr2 where [0 represents the light intensity at the source and r is the radius. Attractiveness of the firefly is dependent and is directly proportional to the intensity of light. [f light absorption coefficient y is added to Equation (4) it transforms to: I(r)=Io/(1+Yr2) (5) Constant 1 is added to denominator to avoid singularity of the term at the source (r=O). Eq. 6 can also be used because of the fact that attractiveness fJ is proportional to intensity: fJ(r) = fJol( [+yr2) (6) Eq. 6 can be further approximated with Gaussian form [16]: Y fJ(r) = fJoexp- r2 The movement of a firefly i at location Xi attracted to another more attractive (brighter) firefly j at location Xj is determined Y by xlt + 1) = xi(t) + fJoe r2 (Xj - xJ. (7) For the most cases of implementations, �o = 1 and a E [0, 1]. The parameter y characterizes the variation of the attractiveness and its value is important to determine the speed of the convergence and how the FA behaves. [n the most applications, it typically varies from 0.01 to [00. III. PROPOSED METHOD There are two key factors in achieving good K-means clustering results. They are the ability to fmd good centroid locations at the start and the ability to explore global optima beyond the local ones. As with any partitional clustering algorithm, they are highly sensitive to the initial parameters: the number k of clusters and their centroids, respectively. K­ means clustering often leads to local optima which may be far from the best results [31]. [n practice, to obtain the best clustering results, the K-means algorithm is often applied many times to different random initiations. It will leads bad clustering results. This proposed method is intended to show a basic construct of integration of a K-means clustering algorithm with bio­ inspired optimization algorithms. Whenever the new solution is found to be better than the current one, the searching agents relocate over there by replacing the solutions, and continue to search further elsewhere until some stopping criteria are met. In the proposed algorithm, the clustering works in two steps, first step we can apply firefly algorithm to find good centroid locations at the start. The objective function, which must be minimum, is Euclidean distance. The algorithm beginning stage, findout number of cores in a system. Based on this knowledge assign number clusters is the number of cores available in the system. Tn this approach cores have no communication between themselves at all. The speed increases almost as many times as there are execution cores in a system. The mechanism of firefly algorithm must do till predefme iteration. In the second step, the k-means will initialize with the position of best firefly. The k-means clustering recalculate the centers. The proposed hydride parallel clustering algorithm can be summarized as the pseudo code.

In the proposed algorithm, use global optima in firefly's movement. Global optimum is related to optimization problem and it can be a firefly that has the maximum or minimum value. And the global optima will be update in any iteration of algorithm. [n the proposed algorithm, when a firefly compared with other firefly instead of the one firefly being allowed to influence and to attract its neighbours, global optima (a firefly that have maximum or minimum value) in each iteration can be allowed to influence others and affect in their movement. [n the modified approach, when a firefly compare with correspond firefly, if the correspond firefly be brighter, the compared firefly will move toward correspond firefly, considered by global optima. Pseudo code for the Modified paralle[ k means

1. Start 2. Determine the number of cores P and assign number of clusters K=P, Initialize the population of fireflies, N, and related parameters. 3. Randomly assign K clusters for each of the N fireflies 4. For each firefly, select K objects from S data objects as initial centroids, by taking the mean values of the attributes of the objects within their given clusters. 5. Calculate the fitness of the centrod in each firefly, and find the best solution that is represented by the total fitness values of centroid in a firefly 6. For each firefly, update its light intensity according to its fitness value (Objective function) 7. For each firefly, update its attractiveness that varies with r2 distance r via exp-Y 8. Merge the fireflies by allowing the less brighter one to be attracted by the brighter one. 9. Update centroids in each firefly according to their latest positions. [0. Rank the fireflies and find the current best [ [. Reassign the clusters according to the best solution [2. Output the best cluster configuration that is represented by the firefly that has the greatest fitness 13. Exit criteria met l.end. Algorithm 2: Modified parallel k means

l. Define Objective function f(X) 2. Generate Initial a population of fireflies as N 3. Define light absorption coefficient y 4. Find number of cores P 5. Partition data to P subgroups 6. Repeat until maximum generation is reached for i= [ to N do 7. for j=1:I do If (Ij>ID then Move firefly i towards j in all d dimensions End if 8. Receive cluster members of k cluster from P Process 9. Recalculate new centroid 10. If k stable centroid reached [ [. Else go to step 5

IV. PERFORMANCE MEASURES Validity of clusters, effectiveness of classification and classification error percentage (CEP) are the three parameters used to analyze the performance of the proposed algorithm. [n paper [32], Senthilnath applied FA for clustering data objects into groups according to the values of their attributes. [n this work, for a given data set the FA is used to find the cluster centers. The cluster centers are obtained by randomly selecting 75% of the given data set. This 75% of the given data set, we call as a training set. The FA algorithm uses this training set and the cluster centers are obtained. In order to study, the performance of the FA algorithm, the remaining 25% of data set is used (called test data set). The performance measure used in the FA is the classification error percentage (CEP). This CEP is defined as the ratio of number of misclassifIed samples in the test data set and total number of samples in the test data set. This can be done because in the test data set, we know the actual class of the test data. The distances between the given test data and the cluster centers are computed. The data is assigned to the cluster center (class) that has the minimum distance. Hence, we can compute the performance measure-classification error percentage (CEP). A.

Classification Error Percentage (CEP)

Most important characteristics of a clustering method is the ability of it in deceasing clustering error. The given data set, 75% of the data set is randomly selected to obtain the cluster centers using Algorithm 1. In this way to obtain the cluster centers for all the classes. The remaining 25% of data set is used (called test data set) to obtain the classification error percentage (CEP).The classification of each pattern is done by assigning it to the class whose distance is closest to the center of the clusters. Then, the classified output is compared with the desired output and if they are not exactly the same, the pattern is separated as misclassified. Let n be the total number of elements in the dataset and m be the number of elements misclassified after finding out the cluster center using the above algorithms. Then classification error percentage is given by [17] CEP= � * [00 (8) n

B.

Classification efficiency

Classification efficiency is obtained using both the training (75%) and test data (25%). The classification matrix is used to obtain the statistical measures for the class-level performance (individual efficiency) and the global performance (average and overall efficiency) of the classifier [30].The efficiency is indicated by the percentage classification which tells us how many samples belonging to a particular class have been correctly classified. The percentage classification (tli) for the class ci . qii (9) tl l-� q}' L.. j=l

where qii is the number of correctly classified samples and n is the number of samples for the class Ci in the data set. C. Cluster validity

Cluster validation is a technique for finding a set of clusters that best fits natural partitions (of given datasets) without the

benefit of any a priori class information. A cluster validity index is used to validate the outcome. The evaluation of a clustered output has several facets. One is actually an assessment of the data domain rather than the clustering algorithm itself - data which do not contain clusters should not be processed at all by a clustering algorithm. Different clustering algorithms achieve different results with certain data sets because most clustering algorithms are sensitive to the input parameters and the structure of data sets. The way of evaluating the results of the clustering algorithms, cluster validity, is one of the problems in cluster analysis. Which evaluates the results of clustering algorithms on data sets? Cluster validity measures are the methods, which can not only compare the results of two different sets of clustering algorithms to determine the better one, but determine the "correct" number of clusters in the data set. We have valuated the results according to the following metrics: accuracy, SSW, SSB, Davies-Bouldin index, Dunn-Dunn index and Silhouette Coefficient [33] D. Metrics and Evaluation Indexes

Evaluation indexes and performance metrics in the area of clustering were used to assess the effectiveness of the algorithms. We have valuated the results according to the following metrics: accuracy, SSW, SSB, Davies-Bouldin index, Dunn-Dunn index and Silhouette Coefficient [33] One of the most important metrics is the accuracy, that checks how much of the samples were correctly classified. If all the samples are grouped on their correct clusters, the accuracy is 100%. It is also calculated the SSW (Sum Squares Within), which informs how close to the cluster center each sample is. SSB (Sum Squares Between) is calculated as well, which computes how far those samples are from the other clusters. For DB, a lower score will be the result of less dispersion within clusters and more distance between clusters. Unlike Dunn, DB uses cluster centroids to represent clusters in order to measure separation. Because the score uses the maximum comparison for each cluster, the measure is built upon "worst­ case" situations. DB divides compactness by separation, meaning that as clusters become more compact and more separated, the DB values will shrink. The smallest the DB value is, the better is the clustering. The Dunn-Dunn's index (DDI) tries to maximize inter-cluster distance (increase separation), while it minimizes the intra cluster distance (increase compactness), measuring the separation ratio of the clustering. Its range is [0,00], where the higher the value, the better is the clustering. The Silhouette Coefficient (SC) is based on the proximity of objects of a cluster and on distance of objects from one cluster to the nearest cluster. It is used to evaluate a partition, measuring the suitability of each object to its cluster and the quality of each cluster individually. The silhouette value is in the range [-1, 1], with the best partition occurs when the silhouette value is 1. EXPERIMENTAL RESULTS V. We implemented the proposed algorithm, parallel k-means and the serial k-means using JAVA language. In the Java

programming language, concurrent programming is mostly concerned with threads. The code is executed on dell inspiron N4030 Laptop, Intel(R) Core(TM) is Processor 2.67 GHz, 2 MB cache memory, 3GB RAM, 64-bit Windows 7 Home and NetBeans IDE 8.0. Data set description

Dataset I: The Credit data set is based on the Australian credit card to assess applications for credit cards. There are 690 patterns (number of applicants), 15 input features and the output has 2 classes. Dataset 2: The Glass data set is defined in terms of their oxide content as glass type. Nine inputs are based on 9 chemical measurements with one of 6 types of glass. The data set contains 214 patterns which are split into 161 for training and 53 for testing. Dataset 3: The Wine data set is regarding to drinks recognition that totally has 178 samples classified into three different classes including 59,71 and 48 samples respectively obtained from the chemical analysis of wines were derived from three different cultivators. The data set, each sample has 13 attributes. Dataset 4: The WDBC (Wisconsin diagnostic breast cancer) data set is about breast cancer that is collected at the University of Wisconsin. It has 345 instances separated into two classes. Each instance has 30 continuous features.. Performance of the proposed approach is evaluated by execution time, cluster validity, classification error percentage (CEP) and classification efficiency. 1. Execution time

We have implemented proposed algorithm on two­ dimensional dataset, four cores and data size varying from 10K to 1000K. The computational speed of proposed algorithm as compared with parallel k-means and serial k­ means is given in Table I, Ts refers to execution time of serial k-means, Tp refers to execution time of parallel k-means, Tfp refers to execution time of parallel k-means modified and Sp and Spf are the speed up obtained from respective algorithms. Running time comparison of modified parallel against parallel k-means is graphically shown in Figure 2 and percentage of running time speedup is shown in fig.3. Table 1 presents the results of our experiments. When the total number of instances is less than 40.000, the parallel algorithm perform almost constant. However when the total number of instances is greater than 40.000 the execution time start to increase linearly for both cases. The reason for this behavior is that, for small number of instances since all threads are not active adding new instances result in creation of new threads. Since all threads run in parallel, we cannot observe any performance loss. However, for instances above 40.000, all threads become saturated and instances need to wait threads to terminate before getting processed. TABLE I.T HE EXECUTION TIME OF PARALLEL K MEANS AND MODIFIED PARALLEL K MEANS AT 4 CORES Datase Exec.Se Exec.Sec Exec.Se Sp Spfc PKM MPK t Size c Serial- -PKMTs Tp M MPKMN Tpf

lOOK

8.16

3.12

3.02

2.62

2.70

200K

17.24

6.56

6.32

2.63

2.73

300K

28.56

10.8

10.4

2.64

2.75

400K

35.18

12.9

12.7

2.73

2.77

500K

42.78

15.92

15.3

2.69

2.80

600K

54.87

19.9

19.4

2.76

2.83

700K

67.58

24.2

23.4

2.79

2.89

800K

78.21

27.9

26.46

2.80

2.96

900K

86.34

29.76

28.67

2.90

3.01

1000K

95.67

31.43

29.14

3.04

3.22

It is noticeable from Table 1 that execution time of MPKM is better than PKM. Speedup is calculated according to the following equation Speedup=Ts/Tp (10) where Ts refers to the execution time of serial k-means,Tp refers to execution time of parallel k means. Linear speedup or ideal speedup is obtained when speedup is equal to number of processors (cores).ln this case, linear speedup is equal to four.

results Fig 1 shows that Task Manager Performance view of the proposed al orithm is running. .

150 100 Ts 50 0 ::.:: 0 0 rl

::.:: 0 0