Document not found! Please try again

Parallelization of a Hierarchical Data Clustering Algorithm ... - CiteSeerX

2 downloads 0 Views 179KB Size Report
Abstract. This paper presents a parallel implementation of CURE, an efficient hierarchical data clustering algorithm, using the OpenMP pro- gramming model.
Parallelization of a Hierarchical Data Clustering Algorithm using OpenMP Panagiotis E. Hadjidoukas1 1

?

and Laurent Amsaleg2

Department of Computer Science, University of Ioannina, Ioannina, Greece [email protected] 2 IRISA/INRIA, Campus de Beaulieu, 35042 Rennes cedex, France [email protected]

Abstract. This paper presents a parallel implementation of CURE, an efficient hierarchical data clustering algorithm, using the OpenMP programming model. OpenMP provides a means of transparent management of the asymmetry and non–determinism in CURE, while our OpenMP runtime support enables the effective exploitation of the irregular nested loop–level parallelism. Experimental results for various problem parameters demonstrate the scalability of our implementation and the effective utilization of parallel hardware, which enable the use of CURE for large data sets.

1

Introduction

Data clustering is one of the fundamental techniques in scientific data analysis and data mining. The problem of clustering is to partition the data set into segments (called clusters) so that intra–cluster data are similar and inter–cluster data are dissimilar. Clustering algorithms are very computation demanding and, thus, require high–performance machines to get results in a reasonable amount of time. In this paper, we present a parallel implementation of CURE (Clustering Using REpresentatives) [4], a well–known hierarchical data clustering algorithm, using OpenMP [2]. CURE is a very efficient clustering algorithm with respect to the quality of clusters: it can identify arbitrary–shaped clusters and handle high– dimensional data. However, its worst–case time complexity is O (n2 logn), where n is the number of points to be clustered. Although sampling and partitioning can allow CURE to handle larger data sets, the algorithm is not applicable to today’s huge data bases because of its quadratic time complexity. Our general goal is the development of an efficient parallel data clustering algorithm that targets shared memory multiprocessors, clusters of computers and computational grids. This paper focuses only on the shared memory architecture. Although CURE provides high quality clustering, a parallel version was not available due to the asymmetric and non–deterministic parallelism of the clustering algorithm. OpenMP, however, manages to resolve successfully these ?

This work was done while the first author was a postdoctoral researcher at IRISA/INRIA, Rennes.

issues due to the dynamic assignment of parallel tasks to processors. In addition, our previous research work has already resulted in a portable OpenMP environment that supports multiple levels of parallelism very efficiently. Thus, we are able to satisfy the need for nested parallelism exploitation in order to achieve load balancing. Our experimental results demonstrate significant performance gains in PCURE, the parallel version of CURE, and effective utilization of the shared memory architecture both on small–scale SMPs and high performance multiprocessors. A first survey of parallel algorithms for hierarchical clustering using distance based metrics is given in [9]. Most parallel data clustering approaches target distributed memory multiprocessors and their implementation is based on message passing [1, 7, 8, 11, 10]. None of them has been applied to a pure hierarchical data clustering algorithm. As we will show experimentally, static parallelization approaches are not applicable to CURE due to the non–deterministic behavior of its algorithm. In addition, message passing would require significant programming effort to handle the highly irregular and unpredictable data access patterns in CURE. The rest of this paper is organized as follows: Section 2 presents our modifications to the main clustering algorithm of CURE and its parallelization using OpenMP directives. Experimental results are reported in Section 3. Finally, Section 4 presents some conclusions and our ongoing work.

2 2.1

CURE Data Clustering Algorithm Introduction

Cure is a bottom–up hierarchical clustering algorithm, but instead of using a centroid–based approach it employs a method that is based on choosing a well– formed group of points to identify the distance between clusters. For each intermediately computed cluster, CURE chooses a constant number, r of well scattered points. These points are used to identify the shape and size of the cluster. The next step of the algorithm shrinks the selected points towards the centroid of the cluster by a pre–determined fraction a. Varying this fraction between 0 and 1 can help CURE to identify different types of clusters. Using the shrunken position of these points to identify the cluster, the algorithms then finds the clusters with the closest pairs of identifying points. These clusters are chosen to be merged as part of the hierarchical algorithm. Merging continues until the desired by the user number of clusters, k, remain. A k–d tree is used to store information about the clusters and the points that belong to them. Figure 1 outlines the main clustering algorithm: since CURE is an hierarchical agglomerative algorithm, initially every data point is considered as a separate cluster with one representative, the point itself. The algorithm computes initially the closest cluster for each cluster. Next, it starts the agglomerative clustering, merging the closest pair of clusters until only k clusters remain. According to the merge procedure, the centroid of the new cluster is the weighted mean of the two

1.

Initialization: Compute distances and find nearest neighbors pairs for all clusters

2.

Clustering: Perform hierarchical clustering until the predefined number of clusters k has been computed While (number of remaining clusters > k) { a. Find the pair of clusters with the minimum distance b. Merge them: i. new size = size1 + size2 ii. new centroid = a1*centroid1 + a2*centroid2, where a1 = size1/new size and a2 = size2/new size iii. find c new representative points c. Update nearest neighbors pairs for the clusters d. Reduce the number of remaining clusters e. If conditions are satisfied, apply pruning of clusters }

3.

Output the representative points of each cluster Fig. 1. Outline of CURE

merged clusters. Moreover, to reduce the time complexity of the algorithm, the authors propose an improved merge procedure where the new c representative points are chosen between the 2c points of the two clusters merged. The worst–case time complexity of CURE is O (n2 logn), where n is the number of points to be clustered. In order to allow CURE to handle very large data sets, CURE uses a random sample of the database. Sampling improves the performance of the algorithm since the sample can be designed to fit in main memory, eliminating thus significant I/O costs, and also contributes in the filtering of outliers. To speed up the clustering process when the sample size increases, CURE partitions and partially clusters the data points in the partitions of the random sample. Instead of using a centroid to label the clusters, multiple representative points are used, and each data point is assigned to the cluster with the closest representative point. The use of multiple points enables the algorithm to identify arbitrarily shaped clusters. Empirical work with CURE discovered that the algorithm is insensitive to outliers and can identify clusters with interesting shape. Moreover, sampling and partitioning speed up the clustering process without sacrificing cluster quality. 2.2

Implementation

Our parallel implementation of CURE was inspired by the source code of Dr. Han and has been enhanced to handle large data sets. The algorithm uses a linear array of records that keeps information about the size, the centroid and the representative points of each cluster. Taking into consideration the improved procedure for merging clusters and that the labeling of data is a separate process,

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.

init_nnbs () { for (i=0; i

Suggest Documents