D
Journal of Communication and Computer 11 (2014) 45-51
DAVID
PUBLISHING
Comparing K-Means and Mean Shift Algorithms Performance Using Mahout in a Private Cloud Environment Paulo Muniz de Avila1, Roan Simoes da Silva2, Luiz Angelo Valota Francisco2, Rodrigo Palucci Pantoni2, David Buzatto2 and Sergio Donizetti Zorzo3 1. Department of Informatics, Federal Institute of South of Minas Gerais (IFSULDEMINAS), Minas Gerais 13872-551, Brazil 2. Department of Informatics, Federal Institute of São Paulo (IFSP), São Paulo 01109-010, Brazil 3. Computing Department, Federal University of São Carlos (UFSCar), São Paulo 13565-905, Brazil
Received: November 17, 2013 / Accepted: December 16, 2013 / Published: January 31, 2014. Abstract: Cloud computing has providing the possibility of services like store data, run applications and scalability of resources. The ability to measure computing resources according to request allows a great quantity of data to be stored in Data Kernels. Hadoop is a promising solution to solve problems with big sets of data. Mahout is a project developed with Hadoop that, by default, implements several grouping and classification algorithms, such as K-Means and Mean Shift, which are grouping algorithms successfully used during the last years in small databases. This paper presents a performance analysis of K-Means and Mean Shift in a standard implementation of Mahout in MapReduce distributed paradigm. Key words: Cloud computing performance, Apache Hadoop, Apache Mahout.
1. Introduction The cloud computing has introduced the possibility of services like data storage, application execution, scalability measurement and computing resources according to requests, allowing a big quantity of data to be stored in Data Centers [1]. According to this new situation, it is essential to offer efficient solutions from the computing point of view and at the same time guarantee the correct classification of elements for grouping in big databases. The Hadoop framework [2] provides a solution for the problem of processing a big volume of data because it is performed in cluster with failure tolerance. Hadoop implements a computational Corresponding author: Rodrigo Palucci Pantoni, Ph.D., professor, research fields: Informatics. E-mail:
[email protected].
paradigm called MapReduce and provides a distributed file system (HDFS) that stores data about the cluster nodes. Apache Mahout Project, which was developed with Hadoop framework and the MapReduce paradigm, aims at offering a platform ready for operation of several algorithms. However, it is not consolidated in the market yet. K-Means and Mean Shift are popular algorithms for grouping and they have been successfully used during the last years in small databases [3]. Both algorithms have standard implementation in Mahout. This paper presents a performance analysis of K-Means and Mean Shift algorithms, implemented in Apache Mahout Platform [2]. The purpose of the work is to evaluate their clustering operation in a big database. This paper is organized as follow: Section 2
46
Comparing K-Means and Mean Shift Algorithms Performance Using Mahout in a Private Cloud Environment
presents the related works, MapReduce paradigm, Hadoop framework, Mahout Project and K-Means and Mean Shift algorithms; Section 3 presents the experiment; Section 4 analyses the results; and the final conclusions are presented in Section 5.
2. Related Work There are related works about data scalability and data mining in cloud computing systems, which presented by Refs. [4, 5]. In relation to tests with Mahout Framework, Esteves et al. [6] present a study carried out in Cloud Amazon EC2. This study compares the gain in K-Means algorithm runtime in a cluster with many nodes. The conclusion is that Mahout framework is not the best solution for small data sets due to the overhead in replication and distribution of data in many blocks of HDFS files system. This overhead is compensated as soon as the data set increases. A second study presented by Esteves and Chunming [7], shows a comparison between K-Means and fuzzy C-Means algorithms for big data sets. In this study the Apache Mahout/Hadoop platform is used and tests are carried out in order to verify which clustering algorithm is more efficient to deal with big quantity of data. The author concludes that despite Mahout Framework is a promising clustering platform, there still is a necessity of great effort to pre-process data and set up the algorithms in Mahout framework. This paper compares de performance of these two grouping algorithms implemented by Mahout Project, i.e., K-Means and Mean Shift. 2.1 MapReduce The relational database systems are not able to meet the demand for necessary performance and scalability through the current distributed systems which manipulate big quantity of data. MapReduce is a distributed programming paradigm for processing big sets of data aiming at meeting the necessities,
providing a better performance and scalability [8]. Thus, MapReduce model breaks the organization paradigm used by relational databases, working with the concept that data can be stored in a distributed way. Systems developed in this programming paradigm are able to distribute data processing in big clusters, even if they are composed of conventional hardware. Its architecture is based on map and reduced functions, which are commonly used in functional languages. The map function is responsible for mapping a request for the origin of data stored in a distributed way, while the reduced function is responsible for the consolidation process of results, adding data which was distributed in a set, meeting the initial request. In general, the nodes that compose the cluster do not have only one function; they are responsible for application processing and also for data storage. This characteristic helps to improve the performance of MapReduce process, in order to have higher levels of performance. 2.2 Apache Hadoop Apache Hadoop Project [2] is an open-source framework, supported by Apache Software Foundation, which simplifies the development of distributed systems for cloud, through MapReduce development paradigm. Applications using Apache Hadoop framework present an agile way to manipulate big quantities of data, keeping characteristics like scalability, performance and failure tolerance, without the obligation of using a lot of equipment. Apache Hadoop is also able to run in a low cost conventional hardware [9]. One of the main differences that makes possible to increase the capacity of failure tolerance [10], consists on the fact that the Apache Hadoop framework libraries detect the mistakes in the application layer instead of using only the conventional hardware verification used in other systems. Thus, all computers used in the cluster can tolerate failures, even those which do not have
Comparing K-Means and Mean Shift Algorithms Performance Using Mahout in a Private Cloud Environment
physical resources to do so. 2.3 Apache Mahout Apache Mahout [11] is an open-source project from Apache Software Foundation developed with Apache Hadoop framework and MapReduce paradigm. It aims at providing a practical mechanism to use machine-learning algorithms with high scalability and failure tolerance. Apache Mahout has implementations to store in cluster, like K-Means, Fuzzy K-Means, Mean Shift, among others, besides tools for collaborative filtering, the main standards for mining and evolutionary programming. Apache Mahout provides mechanisms to be used mainly in the following cases: recommendation systems based on user behavior; data classification in groups using clusters; classification using learning algorithms; and data mining in order to identify related items. 2.4 Clustering Clustering or grouping means joining items that present similarities and dividing them in subsets (called clusters). The records presenting similarities are grouped in a same cluster. The partitioning based clustering algorithms receive an N database and divide it in n divisions defined by the user. Thus, each item is inserted in the cluster providing the lower distance value between the item and its “center”. 2.5 K-Means Algorithm K-Means is a data mining algorithm widely used and it is able to classify data by analyzing and
Fig. 1
MapReduce for clustering [14, 15].
47
comparing numeric values from the data itself. Thus, in order to use discrete data, they have to be mapped for numeric values in order to be classified. Considering that K-Means classification is not based on pre-established standards and does not depend on the human interference, this algorithm is classified as not supervised. K-Means algorithm structure is presented in Table 1 [12]. K-Means algorithm can be defined as an interactive process in order to minimize, according to Eq. (1), to compute E. k
E=
ns d ( xn, j) j
(1)
1
where xn is a vector representing the order number n, λj is the centroid of the item in sj and d is the distance measured. K-Means algorithm moves items between clusters until it is not possible to reduce E [13]. The use of K-Means, implemented in Apache Mahout and used in this work, is divided in three phases depicted in Fig. 1: Initial Phase, Mapping Phase and Reduction Phase. (1) Initial Phase: the data set is divided in HDFS blocks and these blocks are replicated and transferred according to the cluster setting. In this phase the necessary tasks are attributed and defined; Table 1 K-Means algorithm. Select the number of clusters k; Select randomly k points like the initial centroids Associate each point to the nearest centroid Calculate the centroid of each group Associate each point to the nearest centroid Stop as soon as the centroids stabilize.
48
Comparing K-Means and Mean Shift Algorithms Performance Using Mahout in a Private Cloud Environment
(2) Mapping Phase: the distance between data from the sample and the centroid vector is calculated, data are grouped according to centroid vector proximity; (3) Reduction Phase: the average of grouping coordinate points is calculated for each grouping. After this procedure a new location is given in the centroid vector based on the result of the average. This phase is only finished when the centroid grouping points are converged. The centroid configuration will be used as a feedback for the mapping phase. 2.6 Mean Shift Algorithm Mean Shift [16] is a non-parametric clustering algorithm used for analyzing sets of discrete data, aiming at finding the density points. In the distances between the points, Mean Shift algorithm uses the Gaussian Kernel function presented in Eq. (2). K ( xi x) e
c xi x
2
(2)
With that function, Mean Shift algorithm evaluates the distance between the points. The algorithm will select each of the points, bound an area around themselves and calculate the position of the centroid in this density area using the function presented in Eq. (3). m( x )
xi N( x) K ( xi x) xi xi N( x) K ( xi x)
(3)
Next, it will alternate the analyzed points repeating the process until finding the main centroids, identifying the maximum points of density in the data set.
3. Experiment The dataset that was chosen for this experiment was the “KDD Cup 1999 Data” [17]. This dataset was used in the “Third International Competition of Knowledge Discovery Using Data Mining Tools”. In the fifth edition of this event the same dataset was used, but aiming at developing a detector of net intrusion able to make difference between invasions or
attacks and normal connections. This dataset contains a standard set of records to be audited, due to the great variety of data from the invasions occurred in a military net. 3.1 Infrastructure Apache Mahout Algorithm was performed in a Dell PowerEdge R720 server, which has two Intel Xeon E5-2600 processors with 6 kernels of 2.0 GHz each, totalizing 12 physical kernels, 64 GB of DDR3 memory of 1333 MHz, with verification of ECC integrity and 4 SAS Hard Disks of 300 GB with active RAID level 6. Tests were carried out in a virtual environment, with a 64 bits GNU/Linux distribution as the host operation system, which uses the Kernel Linux version 3.5.0, with the hypervisor Citrix Free XenServer software responsible for the virtualization in hardware level. Ten identic virtual machines were installed, with as 64 bits GNU/Linux operational system, using the kernel Linux 3.5.0 and the Cloudera Enterprise Free system. Each of the hosts was set up with a processing kernel and 2 GB of RAM memory. 3.2 K-Meansversus Mean Shift K-Means is one of the most popular grouping algorithms. It is simple, fast and efficient. One of the most important differences is that K-Means provides two general presuppositions: the number of clusters that will be used needs to be known and the clusters are spherically or elliptically molded. As the Mean Shift is a non-parametric algorithm, it does not assume anything about number of clusters. K-Means is very sensible to initializations. The wrong initialization can delay the convergence or sometimes can even result in wrong clusters. Mean Shift is very robust for initialization [18]. K-Means presents an O(KnT) order of growth related to time, where K is the number of groupings, n is the number of points, and T is the number of
Comparing K-Means and Mean Shift Algorithms Performance Using Mahout in a Private Cloud Environment
iterations. The classical Mean Shift algorithm has a high computational cost since it has a quadratic order of growth O((Tn)2) in terms of processing time. Therefore, both algorithms were selected for tests because they are clustering algorithms and they present very different implementations and computational costs. This difference allows analyzing the performance of two algorithms presenting similar tasks with different complexities.
Table 2
Results with K-Means algorithm.
Size of file 10% 50% 75% 100%
49
Mahout with one node-Time (s) 40.2 152.1 200.8 320.6
Mahout (MN)-Time (s) 43.9 49.8 60.4 71.2
Gain (%) -8.43 205.42 232.5 350.28
4. Results In this section it is presented the results and analysis of the performance of both algorithms using the Mahout platform. 4.1 K-Means Performance Tests Table 2 presents the results from the tests carried out with K-Means algorithm. First, the dataset was divided in subsets with the respective sizes of the original file: 10%, 50%, 75¨% and 100%. Next, the data subset with 10% of original data was used, that is, nearly 70 MB of data. It was noticed that the framework overhead compromises the algorithm runtime in this scenario. Despite having ten nodes available for algorithm performance, HDFS file system works with 64 MB blocks and thus, only 2 nodes were used. Only the initial process of dividing in blocks (Fig. 1, section “Initial Phase”) produces enough overhead so that the performance time is better than the time of only 1 cloud node. Using half of available data it is possible to observe a significant improvement in the processing time. With 50% of data, the time necessary to perform a task was improved in approximately 200% if compared to performance in only one node. In this scenario, all nodes available are used and they compensate the initial block division overhead. Fig. 2 presents average values after seven executions of K-Means algorithm considering different size of data files. Table 3 presents runtime results related to the number of nodes effectively designated.
Fig. 2 Average runtime considering different sizes of the file using K-Means. Table 3
Processing time-Number of nodes used.
Number of nodes 1 2 3 4 5 6 7 8 9 10
Processing time 320.6 154.8 128.34 123.12 117.16 105.23 102.1 99.45 98.2 93.1
Gain (%) 207.11% 249.81% 260.40% 273.64% 304.67% 314.01% 322.37% 326.48% 344.36%
It is possible to verify in Table 3 the performance gain with six nodes (around 300%). From this point on, it is possible to note a lower gain at each new node inserted. This way, it is possible to verify that some nodes remain inactive, waiting the others to finish their work. This behavior of the framework was also observed in the results presented in Ref. [6] and penalizes considerably the algorithm performance. 4.2 Mean Shift Algorithm Performance Tests For the tests, with Mean Shift algorithm were used the same datasets used in K-Means tests. The results are presented in Table 4.
50
Comparing K-Means and Mean Shift Algorithms Performance Using Mahout in a Private Cloud Environment
Table 4
Results with Mean Shift algorithm.
Size of file 10% 50% 75% 100%
Mahout time (s) with one node (UN) 50.45 192.19 350.12 390.60
Mahout time(s) with ten nodes (MN) 64.35 59.45 92.45 103.10
Gain (%) -21.60% 223.28% 278.71% 278.86%
Similarly to the experiment with K-Means algorithm, with few data, i.e., 10 % of the original file, the framework overhead compromises the runtime, as shown in Fig. 3. Due to the Hadoop framework overhead, with 10% of the original file, there is a loss of approximately 21% of performance against sequential execution. As Fig. 3 presents, with 50% of file size, the performance gain is 223.28%, which is bigger than sequential execution. Fig. 4 shows the execution with 100% of the file varying number of kernels. As Fig. 4 shows, from six kernels, the gain at each inclusion of a new node is reduced.
Fig. 3 Average runtime considering different sizes of the file using Mean Shift.
Fig. 4 Runtime per set of nodes considering the total dataset using Mean Shift.
5. Discussion Tests using K-Means algorithm have shown better performance result. In both tests, it is observed that for a small data set, the cost to implement the algorithm in parallel produces a performance loss compared to sequential execution. Fig. 5 presents the comparison between K-Means and Mean Shift performance. As showed in Fig. 5, K-Means algorithm performance overcomes Mean Shift, mainly starting 50% of file size.
6. Conclusions and Future Works In this work it is presented the analysis and results of two clustering algorithms using Mahout framework, an open-source project of the Apache Software Foundation. During the tests it was observed that the platform utilization is something non-trivial, conclusion also obtained by Esteves et al. [6]. In terms of algorithm efficiency, it is noted that a significant gain when compared to executions using single-core implementations. However, the lack of
Fig. 5 node).
Average K-Means vs. Mean Shift runtime (one
documentation of algorithm implementations, lead to doubts regarding the algorithm optimization of the Mahout framework. It was used a private cloud, which reduce considerably the failure problems. Thus, the main objective was to verify the increase in processing time obtained with the new nodes inclusion in the cluster. It was evident that there is a problem with the CPU idle time when increasing the number of cluster nodes. This idleness causes direct impact on the efficiency of the platform. The Mahout framework is presented as a promising tool in the task of clustering. However, it is needed
Comparing K-Means and Mean Shift Algorithms Performance Using Mahout in a Private Cloud Environment
investments in documentation, examples and tools for pre-processing the data. As future work, it is interesting to expand the quantity of nodes used during the experiments, besides comparing Mahout Framework performance with other data mining algorithm classes, such as: association rules algorithms and stochastic algorithms. The CPU inactiveness issue is an interesting topic for future research, as new nodes are inserted in the cluster.
References [1]
[2] [3] [4]
[5]
[6]
[7]
B. Jing, Z. Zhiliang, T. Ruixiong, W. Qingbo, Dynamic provisioning modeling for virtualized multi-tier applications in cloud data center, in: IEEE International Conference on Cloud Computing, Miami, Florida, July 5-10, 2010. Apache Foundation, Apache Hadoop Home Page, http://hadoop.apache.org. H. Jiawei, M. Kamber, Data Mining: Concepts and Techniques, 2nd ed., Elsevier, MA, 2006. K. Ericson, S. Pallickara, On the performance of high dimensional data clustering and classification algorithms, Journal of Future Generation Computer Systems 29 (2012) 1024-1034. Y. Yuan, L. Wen-Cai, Efficient resource management for cloud computing, in: International Conference on System Science, Engineering Design and Manufacturing Informatization, Guiyang, Oct. 22-23, 2011, pp. 233-236. R.M. Esteves, R. Pais, C. Rong, K-Means clustering in the cloud—A mahout test, in: IEEE Workshops of International Conference on Advanced Information Networking and Applications, Biopolis, Mar. 22-25, 2011, pp. 514-519. R.M. Esteves, R. Chunming, Using Mahout for clustering
[8]
[9] [10]
[11] [12]
[13]
[14] [15]
[16]
[17]
[18]
51
Wikipedia’s latest articles: A comparison between K-Means and fuzzy C-Means in the cloud, in: IEEE International Conference on Cloud Computing Technology and Science, Athens, Nov. 29-Dec. 01, 2011. J. Dean, S. Ghemawat, MapReduce: Simplified data processing on large clusters, Communications of the ACM 51 (2008) 107-113. T. White, Hadoop: The Definitive Guide, 3rd ed., O’Reilly, 2012. Hadoop Documentation, The Hadoop Distributed File System: Architecture and Design [Online], http://hadoop.apache.org/docs/r0.18.0/hdfs_design.html. S. Owen, R. Anil, T. Dunning, E. Friedman, Mahout in Action, Manning Publications, 2012. J.B. MacQueen, Some Methods for classification and analysis of multivariate observations, in: Fifth Berkeley Symposium on Mathematical Statistics and Probability (1967) 281-297. F. Ricci, L. Rokach, B. Shapira P.B. Kantor, Recommender Systems Handbook, Springer, New York, 2011. O. Maimon, L. Rokach, Introduction to Knowledge Discovery and Data Mining, Springer, New York, 2010. J. Ekanayaka, S. Pallickara, Map Reduce for Data Intensive Scientific Analyses, IEEE International Conference on E-Science, Indianapolis, IN, Dec. 7-12, 2008, pp. 277-284. K. Fukunaga, L.D. Hostetler, The estimation of the gradient of a density function, with applications in pattern recognition, IEEE Transactions on Information Theory 21 (1975) 32-40. KDD Cup 1999 Data [Online], University of California, 1999, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. D. Comaniciu, P. Meer, Mean shift: A robust approach towards feature space analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002) 603-619.