Speeding up k-means Clustering by Bootstrap Averaging - CiteSeerX
Recommend Documents
big data, a Kmeans process is implemented on MapReduce to scale to large data sets. Even a single Kmeans process on MapReduce requires considerable ...
Connect more apps... Try one of the apps below to open or edit this item. 25-clustering-and-kmeans-handout.pdf. 25-clust
occurs before event e2) or leave it temporally unqualified (the only thing we know ... facts of the form beforeFact(e1,e2). The predicate ... Some hashing mechanism is assumed so that ..... be obtained by eliminating the calls to happens and by.
of iris flowers), and wine (chemical analysis of Italian wines from different regions). In order .... Pergamon Press, Oxford, United Kingdom 1990. [Touretzky et al.
P. S. Bradley, U. Fayyad, and C. Raina. Scaling clustering ... V. Suresh Babu and P. Viswanath. Rough-fuzzy weighted ... P. Viswanath and V. Suresh Babu.
Nora D. Volkow, Dardo Tomasi & Gene-Jack Wang, (2011) âEffects of Cell Phone. Radiofrequency Signal Exposure on Brain Glucose Metabolismâ, Journal of ...
Kalthum Asaaf Maulood and Gawhar Ahmad Shekha. 235. Vol: 13 No:2 , April ..... El-Fakharany II, Massoud AH, Derbalah AS. and SaadAllah MS.2011 .Toxicological ... Zeb Shah T, Ali AB, Ahmad Jafri S. and Qazi MH.2013. Effect of Nicotinic ...
Crump, J.A.and Mintz, E.D.; Global trends in typhoid and paratyphoid fever. ... J.V. and Lewis, S.M.; 2001âpractical Haematologyâ 9th ed, Churchill Livingston,.
nation (RFE) into multi-class SVM. ... infeasible for multi-class SVM if without PCA dimension re- ...... Workshop on Neural Networks for Signal Processing,. 1997.
A meta-heuristic algorithm on the base of ants' behavior was developed in early. 1990s by .... Another ACO algorithm is Ant Colony System (ACS). ACS uses.
Theoretical Analysis of Proton Therapy for Prostate Cancer. Buthainah .... Due to their relatively large mass protons are located, and have slight transverse side dispersion in tissues .... When the target is composed of more than one element, it is
to initiate an investment, which would produce with certainty a surplus of size p > 0, the ... entitled and actually justified in his/her hope to receive a higher reward than ..... Please, insert y for yes if you expect Y to accept, and n for no if y
Imagine a user u1 looking for a low price flat with size about 60m2. ... domain. Local preference can be expressed e.g. by fuzzy predicate that maps at-.
Page 1. Speeding up Variable Reordering of OBDDs. Christoph Meinel. University of ... dering and reordering developed so far is too long to be presented here.
Abstract. In Pollard's rho method, an iterating function f is used to de ne a sequence (yi) by yi+1 = f(yi) for i = 0;1;2;:::, with some starting value y0. In this paper, we ...
to speed up mutual information-based registration). De- pending on the size of images and the domain of the reg- istration problem (e.g. rigid, similarity, affine, ...
In Proceedings of the Tenth European Conference on Machine Learning (ECML'98) ..... us abbreviate t = t(s a), = . .... line 8 will quickly over ow any machine for.
Proceedings of the 2005 Informing Science and IT Education Joint Conference. Flagstaff ..... Jordan. He received his B.Sc. in Computer Science from IU-Jordan,.
supercomputer using up to 12 Power4 1.3 Ghz processors. Our objective is to show how the various strategies described in Section 3 can be used to speed up.
elapsed time of mergesort is a ected by di erent layout strategies and reading strategies. ... In Section 3, the traditional mergesort algorithm is analyzed. .... consecutive disk addresses, followed by the second block from each run, and so on.
phisticated data structures, and by taking advantage of further structure from ... vex case, but simple modifications of our algorithms solve the concave case as ...
In Proceedings of the Tenth European Conference on Machine Learning (ECML'98) ..... us abbreviate t = t(s; a), = . .... line 8 will quickly over ow any machine for.
Speeding up k-means Clustering by Bootstrap Averaging - CiteSeerX
the large data sets found in data mining. In this paper we show how bootstrap averaging with k-means can produce results comparable to clustering all of the.
Speeding up k-means Clustering by Bootstrap Averaging Ian Davidson and Ashwin Satyanarayana Computer Science Dept, SUNY Albany, NY, USA, 12222. {davidson, ashwin}@cs.albany.edu
Abstract K-means clustering is one of the most popular clustering algorithms used in data mining. However, clustering is a time consuming task, particularly with the large data sets found in data mining. In this paper we show how bootstrap averaging with k-means can produce results comparable to clustering all of the data but in much less time. The approach of bootstrap (sampling with replacement) averaging consists of running k-means clustering to convergence on small bootstrap samples of the training data and averaging similar cluster centroids to obtain a single model. We show why our approach should take less computation time and empirically illustrate its benefits. We show that the performance of our approach is a monotonic function of the size of the bootstrap sample. However, knowing the size of the bootstrap sample that yields as good results as clustering the entire data set remains an open and important question. 1. Introduction and Motivation Clustering is a popular data mining task [1] with kmeans clustering being a common algorithm. However, since the algorithm is known to converge to local optima of its loss/objective function and is sensitive to initial starting positions [8] it is typically restarted from many initial starting positions. This results in a very time consuming process and many techniques are available to speed up the k-means clustering algorithm including preprocessing the data [2], parallelization [3] and intelligently setting the initial cluster positions [8]. In this paper we propose an alternative approach to speeding up k-means clustering known as bootstrap averaging. This approach is complimentary to other speed-up techniques such as parallelization. Our approach builds multiple models by creating small bootstrap samples of the training set and building a model from each, but rather than aggregating like bagging [4], we average similar cluster centers to produce a single model that contains k clusters. In this paper we shall focus on bootstrap samples that are smaller than the training data size. This produces results that are comparable with multiple random
restarting of k-means clustering using all of the training data, but takes far less computation time. For example, when we take T bootstrap samples of size 25% of the training data set then the technique takes at least four times less computation time but yields as good as results if we had randomly restarted k-means T times using all of the training data. To test the effectiveness of bootstrap averaging, we apply clustering in two popular settings: finding representative clusters of the population and prediction. Our approach yields a speedup for two reasons. Firstly, we are clustering less data and secondly because the k-means algorithm converges (using standard tests) more quickly for smaller data sets than larger data sets from the same source/population. It is important to note that we do not need to re-start our algorithm many times for each bootstrap sample. Our approach is superficially similar to Bradley and Fayyad’s initial point refinement (IPR) [8] approach that: 1) sub-samples the training data, 2) clusters each sub-sample, 3) clusters the resultant cluster centers many times to generate refined initial starting positions for k-means. However, we shall show there are key differences to their clever alternative to randomly choosing starting positions. We begin this paper by introducing the k-means algorithm and explore its computational behavior. In particular we show and empirically demonstrate why clustering smaller sets of data leads to faster convergence than clustering larger sets of data from the same data source/population. We then introduce our bootstrap averaging algorithm after which we discuss our experimental methodology and results. We show that for bootstrap samples of less size than the original training data set our approach performs as well as standard techniques but in far less time. We then discuss the related Bradley and Fayyad technique IPR and discuss differences to our own work. Finally, we conclude and discuss future work. 2. Background to k-means Clustering Consider a set of data containing n instances/observations each described by m attributes.
The k-means clustering problem is to divide the n instances into k clusters with the clusters partitioning the instances (x1… xn) into the subsets Q1…k. The subsets can be summarized as points (C1…k ) in the m dimensional space, commonly known as centroids or cluster centers, whose co-ordinates are the average of all points belonging to the subset. We shall refer to this collection of centroids obtained from an application of the clustering algorithm as a model. K-means clustering can also be thought of as vector quantization with the aim being to minimize the vector quantization error (also known as the distortion) shown in equation ( 1 ). The mathematical trivial solution is to have a cluster for each instance but typically k