Scalable parallel clustering approach for large data using genetic ...

2 downloads 45236 Views 400KB Size Report
Dr.R Vijayakumar ... experimental analysis showed that the proposed approach obtained .... me to concentrate on large data clustering using scalable.
Scalable parallel clustering approach for large data using genetic possibilistic fuzzy c-means algorithm Juby Mathew, Student Member, IEEE Dept. of MCA, AmalJyothi College of Engg. Kanjirapally, Kerala, India [email protected] Abstract— In various domains, big data play crucial and related processes because of the latest developments in the digital planet. Such irrepressible data growth has led to bring clustering algorithms to segment the data into small sets to perform associated processes with them. However, the challenge continues in dealing with large data, because most of the algorithms are compatible only with small data. However, the existing clustering algorithms either handle different data types with inefficiency in handling large data or handle large data with limitations in considering numeric attributes. Hence, parallel clustering has come into the picture to provide crucial contribution towards clustering large data. This insists the need of having scalable parallel clustering to solve the aforesaid problems. In this paper, we have developed a scalable parallel clustering algorithm called Possibilistic Fuzzy C-Means (PFCM) clustering to cluster large data. So, our ultimate aim is to design and develop an algorithm in parallel way by considering data. The parallel architecture includes, splitting the input data and clustering each set of data using PFCM. Then the genetic firefly algorithm applied to the merged cluster data, which will provide better clustering accuracy in merge data. The experimental analysis will be carried out to evaluate the feasibility of the scalable Possibilistic Fuzzy C-Means (PFCM) clustering approach. The experimental analysis showed that the proposed approach obtained upper head over existing method in terms of accuracy and time. Keywords— parallel clustering, large data, PFCM, FCM, Firefly algorithm, genetic algorithm

I.

INTRODUCTION

Data mining technologies are emerging as concrete solutions for extracting hidden information from huge databases. The term, “Data mining” can be defined as “the non-trivial extraction of implicit, previously unknown and potentially useful information from data in databases” [1]. In parallel computing and communication advancements that occur in both wired and wireless networks have led to the development of numerous pervasive distributed computing environments. These are the environments with distributed organization of diverse data sources and computing sources. When mining is performed in such environments, complete exploitation of the distributed resources becomes essential. Unfortunately, most of the developed off-the-shelf data mining techniques deal with monolithic centralized application only. These techniques prefer to mine the information in a centralized location, after downloading the relevant data into it. This approach is not well suited for many of the data mining applications, where distributed, ubiquitous and privacy play key roles. The problem can be solved when distributed resources are used for data mining, which often termed as distributed data mining (DDM) [2].

978-1-4799-3975-6/14/$31.00 ©2014 IEEE

Dr.R Vijayakumar Dean (Engg & Tech) Mahatma Gandhi University Kottayam, Kerala, India [email protected] Clustering can be defined as a process of allocating data objects into a specific disjoint group, which is called as a cluster, in such a way that the data objects belong to same cluster should be similar to each other, while the data objects of different cluster should be different from each other. Numerous clustering algorithms have been reported in the literature for clustering the subjected data in an efficient way. The current researches in clustering algorithms are in need to address scalability and big data analysis. As solutions for handling big data, the clustering algorithms are initiated by clustering a sample data and crude partitioning of the whole data. However, partitioning clustering methods such as CLARANS [3], hierarchical clustering methods such as BIRCH [4], grid clustering methods such as STING [5] and WAVE CLUSTER [6] are found to be outstanding clustering algorithms. In addition, a single computer is not sufficient to process such a large data. Hence, parallel and distributed clustering come into the picture because of its high scalability and cost efficiency by clustering under a distributed environment. The rest of the paper is organized as; the second section includes the review of recent works, third section describes about the proposed approach and methodologies, fourth section, subject the experimental results and fifth section conclude the approach. II.

REVIEW OF RELATED WORKS

Here, we review certain literary works among a lot reported in the literature. AshwaniGarg et al [7] have intended to propose a parallel BIRCH algorithm to improve the scalability without affecting the clustering quality. The algorithm has balanced the processing load by cyclic distribution of the incoming data or block cyclic distribution, if there is bursty incoming data. The experimental results on large scale data have revealed that the algorithm has exhibited linear scaling property with increasing number of processor. They have also shown that the obtained cluster quality is competent with the quality of the cluster obtained from BIRCH. On the basis of message passing model, Inderjit S et al. [8] have proposed parallel implementation of k-means clustering algorithm. The algorithm has exploited the feature of natural data – parallelism available in the k – means algorithm. From the analysis, it has been observed that the speedup and scale up of the algorithm remain optimum, even though there is an increase in the data points.

2014 IEEE International Conference on Computational Intelligence and Computing Research

Olivier Beaumont et al [9] have investigated large scale distributed platforms such as BOINC and WCG for understanding the resource clustering problem. They have executed a task on a set of resources so that single computing resource constraint can be eliminated. They have aimed at designing a distributed method for a huge resources set to formulate clusters. A generic 2 – phase method that is based on resource augmentation has been detailed in which the approximation ratio has been maintained as 1/3. If there is small value of D in the metric space and the distances have been defined by L’ norm, then the distributed version of the aforesaid method has been introduced. Numerous techniques have been developed to address the clustering problem among which parallel clustering attempts to solve large data clustering problem. Author [18] proposed scalable parallel clustering approach for large data using PFCM to parallelize the computation across machines. Md. Mostofa Ali Patwary et al [10] have recently performed large data clustering by introducing a scalable parallel clustering method, which has been developed based on OPTICS. The quality of the results obtained from POPTICS is competent with the results obtained from classical OPTICS algorithm. This has inspired me to concentrate on large data clustering using scalable parallel clustering. The proposed method will be formulated by unifying upgraded fuzzy c – means clustering algorithm with genetic algorithm. FCM will cluster the large data in parallel mode, where genetic algorithm will attend to solve a modified fitness function to obtain dominant results. The procedural steps framed in the proposed method will be able to solve the aforesaid issues. The entire steps will be developed in Java and the experimentation will be carried out on recognized datasets with standard measures.

algorithm which is applied to the merged cluster centroid data obtained from the PFCM. Then finally the resulting cluster is obtained at the output. The Fig.1 shown below represents the architecture of the proposed scalable parallel clustering algorithm. At the initial stage, these proposed methods randomly divide the input large dataset in to equal number of sets. 1) Partitioning the input large dataset Let the input be the large dataset with a size of M × N . In this processing, input large dataset using Possibilistic fuzzy cmeans clustering algorithm is difficult. So dividing the input dataset randomly in to small subsets of data with equal size will make this system better. So further in this proposed system the input large data set is randomly divided into N number of subset, S = {S1 , S 2 , S3............. S N } , where N is the total number of sets with equal size. Here the each subset of data is clustered in to clusters using a standard and efficient clustering algorithm called Possibilistic Fuzzy C-Means (PFCM). The each single data subset S consist of a vector of d measurements, where X = ( x1 , x2 , x3 ,........xd ) . The attribute of an individual data set is represented as xi and d represents the dimensionality of the vector. The Possibilistic Fuzzy CMeans (PFCM) is applied to the each subset of dataset for clustering the input dataset n × d in to k-clusters.

The main contribution of the paper 1. Parallel architecture The parallel architecture includes, splitting the input data and clustering each set of data using PFCM. Then the genetic firefly algorithm applied to the merged cluster data. 2. Genetic firefly It is developed by combining both the genetic and firefly algorithm which will provide better clustering accuracy in merge data. 3. Analysis In the analysis section two different data sets such as skin data set and poker hand data set are used to evaluate the performance of the proposed method. III. PROPOSED SCALABLE PARALLEL CLUSTERING APPROACH FOR LARGE DATA USING GENETIC POSSIBILISTIC FUZZY C-MEANS ALGORITHM The aim of the proposed method is to cluster a large dataset efficiently. Here a scalable parallel clustering algorithm is used to overcome the problem in clustering large dataset with high dimension. Two different clustering methods are used here. The first clustering method is the Possibilistic Fuzzy C-Means (PFCM) clustering algorithm which is applied to the each randomly divided set of input data. The second is a genetic algorithm based clustering method called genetic firefly

Fig. 1. Parallel Architecture of proposed scalable parallel clustering algorithm

2014 IEEE International Conference on Computational Intelligence and Computing Research

2) Possibilistic Fuzzy C-Means (PFCM) In this Possibilistic Fuzzy C-Means (PFCM) clustering method [11] is applied to each randomly divided subset of data. The PFCM is one of the most efficient parallel clustering methods. It allows one piece of data to belong to two or more clusters and one of the most frequently used techniques in pattern recognition. Let the unlabelled data set is S = S1 , S 2 , S 3............. S N which is further clustered in to a

{

}

group of k-clusters using PFCM clustering method. This proposed PFCM is based on the minimization of the objective function given below, n n c n c ­° ½° 2 m n = + × − + − min J ( U , T , V ; X ) au bt x v γ ( 1 t ) ¦¦ N ® m,η ik ¾ k i A ¦ i¦ ik ik °¿ (U,T,V)° i=1 k=1 k=1 i=1 ¯ (1)

(

Subject to the constraints,

)

c

¦u

ik

=1

, and 0 ≤ uik , tik ≤ 1 . ∀k

i =1

Here a > 0, b > 0, m > 1,η > 1, [12] where m is any real number greater than 1, uik is the degree of membership of Xi in the cluster j, Xi is the ith of d-dimensional measured data, vi is the d-dimension center of the cluster, and ||*|| is any norm expressing the similarity between any measured data and the center, where, DikA = X k − vi A and n

¦t

ik

= 1 ∀i

k =1

The PFCM clustering or partitioning is carried out through an iterative optimization of the objective function shown above, with the update of membership uik and the cluster centers

vi by, [13] −1

§ c § D · 2 /( m−1) · ¸ ,1 ≤ i ≤ c; ,1 ≤ k ≤ n uik = ¨¨ ¦ ¨ ikA ¸ ¸ ¸ ¨ j =1 © D jkA ¹ © ¹

tik =

1

,1 ≤ i ≤ c; ,1 ≤ k ≤ n 1 / (η −1) §b 2 · 1 + ¨¨ DikA ¸¸ © γi ¹ n

vi =

¦ (au k =1 n

)

+ btikη X k

m ik

¦ (au k =1

m ik

+ btikη

)

(2)

Step 2: At k step: calculate the centers vectors C ( k ) = [vi ]withU (k ) [15] n

vi =

¦ (u k =1 n

m ik

¦ (u

)

+ tikη xk

m ik

+ tikη

k =1

Step 3: Update

)

(5)

,1 ≤ i ≤ c.

U ( k ) , U ( k +1 ) ,

§ c § D · 2 /( m −1) · ¸ uik = ¨ ¦ ¨ ikA ¸ ¨ j =1 ¨ D ¸ ¸ jkA ¹ © © ¹

−1

(6)

Step 4: If U ( k ) − U ( k +1) < ε , then stop; otherwise return to step 2. Finally for subset of the input data, a group of K-clusters are obtained after applying the PFCM clustering method. Likewise for each set of input data a group of K-clusters is obtained. The size of the obtained group of K-clusters is less than the size of the input subset of data. It is completely based on the K value. Then before applying the optimization algorithm, the obtained group of K-clusters from each individual set is merged in to a single group of clusters. For this purpose, a new metaheuristic method called genetic firefly algorithm is used here. This new genetic firefly clustering algorithm is applied to merged clusters data. The merging process in this proposed method is given below. 3) Merging cluster Merging is the process of grouping the distributed group of clusters which is an important process in this proposed method. Also the important purpose of this merging process is to get the efficient data from all the clusters. The proposed approach initiates a genetic algorithm based clustering for obtaining the merging node. 4) Genetic firefly algorithm clustering

,

(3)

(4)

,1 ≤ i ≤ c.

{

}

This iteration will stop when max ik uik( k +1) − uik( k ) < ε Where, ε is a termination criteria between 0 and 1 and k are the iteration steps. This procedure converges to local minimum of J m ,η [14] The PFCM clustering algorithm contains various steps, Algorithm 1: PFCM clustering algorithm

Step 1: Initialize U = [u ik ]matrix ,U (0 )

A new method is developed in this paper for clustering the merge data by combining both the genetic algorithm and firefly algorithm. In general, the genetic firefly base clustering algorithm can reach the optimum point of the function very quickly. In the first step of this genetic firefly method each firefly is randomly generated. For each firefly two data from the input data are taken as initial population and forms the two centroids for the respective firefly. Then the distance is computed between the data in the firefly and the whole data in the data set. Likewise for each data in the firefly the distance between the centroid and the whole data is computed. Finally the minimum distance among the centroids in each firefly is found out which is the fitness for each firefly. Finally for each fitness of the firefly Davies Bouldin index (DBI) is calculated. Then two worst solutions are taken and performed the crossover and mutation process. These steps are repeated until stopping criteria is met.

2014 IEEE International Conference on Computational Intelligence and Computing Research

Then the DB Index is calculated for the minimum distance

D min of each firefly. b) DB Index calculation

In this section the DBI for the selected data of each firefly is calculated. The representation of DBI for N number of clusters is given below, DB ≡ Fig. 2. Sample Initial Population

a) Initial population for firefly The initial population of firefly is created as F = F1 , F2 , F3 ,..... FN . Then for each firefly the two input data d1 and d 2 are taken and has two centroids each C11 and C 21 . Then for the selected data the DBI is calculated to find the minimum distance which calculates the new solution. Initially the minimum distance for each firefly of the population is determined based on the below process. Firstly, the distance between the data in the firefly and the whole data in the dataset are computed. The randomly taken centroid data forms the data in the firefly represented by C i j , where 0 < i ≤ 2 and 0 < j ≤ N . For the first firefly, two centroids are considered and are represented by C11and C21 . The whole dataset is represented by di , where 0 < i ≤ M and M is the total number of data in the dataset. The term Ci1 − d j represents the distance between the ith centroid and jth data in the dataset. The distance computation is made by the formula given by: dis =

M

¦ (z n=1

− z jn )

2

in

Where, zin is the nth attribute of the ith data and

(7)

z jn is the nth

1 N

N

¦D

i

(10)

i =1

The DB index is completely depends on both the data as well as the algorithm. In the above equation the D i choose the worst case scenario, and this value is equal to Ri , j for the most similar cluster i. The symmetry condition for Di is given below,

Di ≡ max Ri , j where, Ri , j = Si + S j j:i ≠ j M i, j

Here, Ri , j be a measure of how good the cluster scheme is. This measure, by definition has to account for M i , j the

j th clusters, which ideally has to be as large as possible, and Si the within cluster scatter for cluster i which has to be as low as possible. The DB index th

separation between i and

calculation has various properties such a, (i) Ri , j ≥ 0

(11)

(ii) Ri , j = R j , i

(12)

(iii)If S j ≥ S k and M i , j = M i ,k then Ri , j = Ri , k and

(13)

(iv) If S j = S k and M i , j ≤ M i , k then Ri , j > Ri , k

(14)

attribute of the jth data.

Let

Once the distance computation is carried out, find out the minimum distance value in each row. That is the minimum distance among the centroids in the firefly with respect to data from the dataset is found out. For example, minimum distance value between the centroids in the first firefly and the first data from the dataset is found using the formula:

feature vector assigned to cluster Ci .

­C11 − d1 ; if [(C11 − d1 ) < (C 21 − d1 ) ] ° 1 D min 1 = ®C 21 − d1 ; if [(C 21 − d1 ) < (C11 − d1 ) ] ° 1 1 1 ¯C 3 − d1 ; if [(C 3 − d1 ) < (C1 − d1 ) ]

(8)

Generalizing, minimum distance value between the centroids in the first firefly and the jth data from the dataset is found using the formula: D min j

1

­C11 − d j ; if [(C11 − d j ) < (C 21 − d j ) ] °° = ®C 21 − d j ; if [(C 21 − d j ) < (C11 − d j ) ] ° 1 1 1 °¯C3 − d j ; if [(C3 − d j ) < (C1 − d j ) ]

Ci be a cluster of vectors. Let X j be an n dimensional

The within cluster scatter §1 S i = ¨¨ © Ti

Ti

¦X

j

j =1

1

q

(15)

Here Ai is the centroid of Ci and T i is the size of the cluster i . Usually the value of q is 2, which makes this a Euclidian distance function between the centroid of the cluster, and the individual feature vectors. For meaning full result the distance metrics has to match with the metrics used in the clustering. 1

M i , j = Ai − A j

(9)

q · − Ai ¸¸ ¹

Si for cluster I is represented as

p

p ·p § n = ¨ ¦ a k ,i − a k , j ¸ © k =1 ¹

(16) where, M i , j is a measure of separation between cluster Ci and the cluster C j .

2014 IEEE International Conference on Computational Intelligence and Computing Research

ak ,i is the kth element of A i and there are n such elements in

A. Dataset Description

A for it is an n dimensional centroid.

In our proposed method two kinds of datasets are used for evaluation. The first dataset is the skin segmentation dataset taken from UCI machine learning repository. It consists of totally 245057 instances and 4 attributes [16].

Hence, the DB Index of all fireflies are computed which is subsequently given for movement computation.

c) Movement Computations The movement computation is calculated based on the calculated DB Index value. Subsequently, the movement computation is made on the firefly that has high DB Index value. For example, compare the first firefly having DB Index value D1 and the second firefly having DB Index value D2 , then if D2 > D1 , which is if the second firefly has better Db Index value, then apply the movement calculation in the second firefly and substitute the data in the first firefly by the solution acquired after movement computation. Similarly, if D1 > D2 , then substitute the data in the second firefly by the solution obtained after movement computation. Movement computation is out by the formula:

zi = zi + β 0 e

− γrij2

1· § + α ¨ random − ¸ 2¹ ©

(17)

Where, zi is the attribute data value of firefly having higher DB Index and β 0 , γ and α are constants. randomn represents a value that hold the values in range of 0 to 1 and difference between

rij

is the

zi and z j . The process of comparison is

carried out for all firefly and movement computation and substitution is made accordingly.

d) Crossover and Mutation In this crossover two parent solutions are used for crossover. By performing the cross over operation on the selected parent solutions two new offspring solutions are generated. Each off spring solution has a part of their parent solutions. Here the cross over point is set to be as one. After the crossover alternative n numbers of new children clusters are obtained. Then the mutation is applied over the children of each parent cluster in order to make diversity among the solutions. So that entirely new offspring solutions will be added into the population at each iteration. So, the parent solutions through the help of genetic operators are able to generate offspring that is superior to their parents. After crossover and mutation, a completely new population of clusters is obtained. Then the fitness value of new population is calculated. After that, the population of clusters with maximum fitness value is selected as the new parent clusters. These steps are repetitive till it reaches the maximum iteration limit. After the termination criteria, new set of clusters are getting which represents the better clusters. IV.

EXPERIMENTAL RESULTS AND DISCUSSION

We implemented the proposed approach using JAVA language. The code is executed on dell Inspiron N4030 Laptop, Intel(R) Core(TM) i5 Processor 2.67 GHz, 2 MB cache memory, 3GB RAM, 64-bit Windows 7 Ultimate and NetBeans IDE 8.0.

The poker hand data set is multivariate characteristics which consist of 1025010 instances and 11 numbers of attributes. The dataset description is each record is an example of a hand consisting of five playing cards drawn from a standard deck of 52. Each card is described using two attributes, for a total of 10 predictive attributes. [17].

B. Evaluation metrics The proposed scalable parallel clustering algorithm uses mainly two evaluation matrices, the clustering accuracy and computation time. The clustering accuracy is measured by counting the number of correctly assigned documents and dividing by N, which is given the equation below Accuracy =

1 ¦ max Bk ∩ C j N k j

(18)

Where, Bk is the set of clusters B = {B1 , B2 , B3 .......... Bk } and C j is the set of classes C = {C1 , C2 .C3 ,......Ck } .Interpret Bk as the set of documents in B k and C j as the set of documents in C j . The other metrics used here is the computation time which is measured based on the starting and ending time of the program.

C. Performance evaluation based on skin dataset The initial analysis is based on the skin dataset with respect to the proposed approach. The skin dataset is initially selected for the analysis. The analysis conducted for varying parameters data size in the parallel and initial number of clusters in the system. In the following section, the accuracy and computation time based on final cluster is subjected.

1) Accuracy and computation time based on final cluster

Fig.3: Accuracy of skin data set on final cluster

The existing method used here to compare the performance of the proposed method is k-means and genetic. In the Fig.3, presents the accuracy of the proposed approach by varying the final clusters of the parallel clustering system. The final clusters are varying from 2 to 6. The analysis from the figure shows that the proposed approach has upper hand over the existing method. In the case of proposed approach, the accuracy increases as the number of final cluster increases.

2014 IEEE International Conference on Computational Intelligence and Computing Research

The maximum accuracy attained by the proposed approach is 91%, which is very better figure as compared to the existing method, for which the maximum accuracy obtained is 70%. All these experiments are conducted in same environment.

Fig.6: Time analysis of the skin data set based on data size

The time analysis based on the data size is shown in the above figure. From this analysis we found that the proposed approach and existing approach is clearly different from each other. As considering all scenarios, we assess that the proposed approach is efficient in the time analysis as compared to the existing approach.

D. Performance evaluation based on poker hand dataset The analysis is based on the poker hand dataset with respect to the proposed approach. The poker hand dataset is initially selected for the analysis. The analysis conducted for varying parameters data size in the parallel and initial number of clusters in the system. In the following section, the accuracy and computation time based on final cluster is subjected.

1) Accuracy and computation time based on final cluster

Fig.4: Time analysis of the skin data set on final cluster

The Fig.4 represents the time analysis of the proposed approach based on varying number of clusters. The response regarding time for execution is different from that of accuracy analysis. The response is clearly irregular in a manner for all the cases. Both the proposed approach and existing approach are clearly different from each other. As considering all scenarios, this study assess that the proposed approach is efficient in the timing analysis as compared to the existing approach.

2) Accuracy and computation time based on data size Fig.7: Accuracy of poker hand data set on final cluster

Fig.7 shows the accuracy of the proposed approach by varying the final clusters of the parallel clustering system in poker hand dataset. The final clusters are varied from 2 to 6. The analysis from the figure shows the effectiveness of our approach. This may be due to the initial cluster centers generated by proposed algorithm are quite closed to the optimum solutions.

Fig.5: Accuracy and computation time based on data size in the skin data set

Fig.5 shows the accuracy of the proposed approach based on the data size of the parallel clustering system. The data size is varied from 50K to 90K. For each data size the accuracy is plotted. From this it is clear that the proposed approach has the upper hand over the existing method. The maximum accuracy attained by the proposed approach is 94%, which is a very better figure as compared to the existing method, for which the maximum accuracy obtained is 91%.

Time

30

Fig.8: Time analysis of the poker hand data set based on final clusters

20 Existing

10

Proposed

0 50

60 70 80 Data Size

90

The Fig.8 represents the time analysis of the proposed approach based on varying number of clusters in poker hand dataset. The response regarding time for execution is different from that of accuracy analysis. Both the proposed approach and existing approach are clearly different from each other. As considering all scenarios, this study assess that the proposed approach is efficient in the timing analysis as compared to the existing approach.

2014 IEEE International Conference on Computational Intelligence and Computing Research

2) Accuracy and computation time based on data size

experimental result showed that our proposed method provide better result. Also the experimental analysis showed that the proposed approach obtained upper head over existing method in terms of accuracy and time. The highest accuracy achieved by the proposed approach is 94%. REFERENCES [1]

[2]

[3] Fig.9: Accuracy and computation time based on data size in the poker hand data set

The accuracy plotted based on the varying data size from 50K to 90K. For each data size the accuracy is plotted. Here the accuracy of the proposed method increased with increasing data size which shows that this proposed provide better accuracy than the existing method. Similarly the time analysis plot based on the varying data size is shown in the figure 10.It is revealed that the execution time of the proposed method is very low when compared with existing method.

[4]

[5] [6]

[7]

[8]

[9]

[10]

Fig.10: Time analysis of the poker hand data set based on data size

This analysis found that the proposed approach and existing approach is clearly different from each other. As considering all scenarios, this study assess that the proposed approach is efficient in the time analysis as compared to the existing approach. V.

CONCLUSION

Large data clustering plays crucial and related processes in various domains. However, most of the clustering algorithms are compatible only with small data. The solution to address large scale clustering problem is exploiting parallel algorithm. Finally, based on the assumption, in this paper, we have presented a method for scalable parallel clustering based on series of methods. The proposed approach is designed to address mainly for the difficulty to cluster large data bases. The proposed approach used a PFCM algorithm to handle the large data set. Also a hybrid genetic firefly clustering method is applied to merged clusters. Finally the analysis was made by two types of datasets the skin and the poker hand data set from UCI machine learning repository. Our proposed method is compared with the performance of the existing genetic and kmeans clustering algorithm. The performance analysis and

[11]

[12]

[13]

[14]

[15]

[16] [17] [18]

Kumar, D. Vimal and Tamilarasi, A.. "Mining of Optimized Multi Relational Relation Patterns for Prediction System", International Review on Computers & Software, 2013 Souptik Datta Kanishka Bhaduri,Chris Giannella Ran Wolff HillolKargupta "Distributed Data Mining in Peer-to-Peer Networks", Journal of internet computing, vol.10, no.4, pp.18-26. 2006. Ng R.T,Han J,”Efficient and effective clustering methods for spatial Data mining”International conference on very large data bases,1994 T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: an efficient data clustering method for very large databases”, In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp.103114, 1996 W. Wang, J. Yang, R. Muntz, STING,”A Statistical Information Grid Approach to Spatial Data Mining”, VLDB, 1997. G.Sheikholeslami,S.Chatterjee and A. Zhang,”WaveCluster:A multiresolution clustering approach for very large spatial databases”VLDB,pp-428-439,1998 AshwaniGarg, Ashish Mangla, Neelima Gupta, VasudhaBhatnagar, " PBIRCH: A Scalable Parallel Clustering algorithm for Incremental Data", in proceedings with transactions on IEEE, pp: 315 - 316, 2006. Inderjit S. Dhillon and Dharmendra S. Modha, “A Data-Clustering Algorithm on Distributed Memory Multiprocessors”, Proceedings of KDD Workshop High Performance Knowledge Discovery, pp. 245-260, 1999 Olivier Beaumont, Nicolas Bonichon, Philippe Duchon, Lionel EyraudDubois and Hubert Larcheveque, “A Distributed Algorithm for Resource Clustering in Large Scale Platforms”, Principles of Distributed Systems, Vol.5401, pp.564-567, 2008 d. Mostofa Ali Patwary,Diana Palsetia1, Ankit Agrawal1,Wei-keng Liao1, Fredrik Manne2, AlokChoudhary, " Scalable Parallel OPTICS Data Clustering Using Graph Algorithmic Techniques", International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, No. 49, 2013 Nikhil R. Pal, Kuhu Pal, James M. Keller, and James C. Bezdek” A Possibilistic Fuzzy c-Means Clustering Algorithm”, IEEE Transactions On Fuzzy Systems, Vol. 13, No. 4, 2005 Jian-Jiang Zhou. "Possibilistic Fuzzy c-Means Clustering Model Using Kernel Methods", International Conference on Computational Intelligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and Internet Commerce (CIMCA-IAWTIC 06), 2005 Ehsan Nadernejad. "A new method for image segmentation based on Fuzzy C-means algorithm on pixonal images formed by bilateral filtering", Signal Image and Video Processing, 11/06/2011 Yue Li. "Setting up Model of Forecasting Core Reservoir Parameters by Fusion of Soft Computing Methods", Third International Conference on Natural Computation (ICNC 2007), 08/2007 Tai-hoon Kim. "Procedure of Partitioning Data into Number of Data Sets or Data Group – A Review", Communications in Computer and Information Science, 2010 Data set from UCI machine learning repository, “https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation#” Data set from UCI machine learning repository “https://archive.ics.uci.edu/ml/datasets/Poker+Hand”. Juby Mathew,R Vijayakumar,” Scalable Parallel Clustering Approach for Large Data using Possibilistic Fuzzy C-Means Algorithm”, (IJCA 103(9):2014,24-29.

Suggest Documents