Efficient Single-Linkage Hierarchical Clustering

1 downloads 0 Views 248KB Size Report
Linkage hierarchical clustering with its related graph theoretical terms is the most ... the authors of [9] show that the list of k nearest neighbors (knn-table) is.
Efficient Single-Linkage Hierarchical Clustering based on Partitioning Mohamed A. Mahfouz PhD, Department of Computer and Systems Engineering Faculty of Engineering, Alexandria University Alexandria, Egypt [email protected] the number of clusters to be known in advance as partitioning algorithm do. However they suffer from high computational cost and their inability to recover early bad decisions while building the hierarchy. Other variations of hierarchical clustering methods that tackle these problems are based on either clusters proximity or clusters interconnectivity or both [4]. Several linkage criteria can be used in building the hierarchy (dendrogram) such as single linkage, complete linkage or average linkage

Abstract— Ease of interpretation of results makes hierarchical clustering algorithms suitable for many applications. However they suffer high computational complexity. Single Linkage hierarchical clustering with its related graph theoretical terms is the most famous among hierarchical algorithms. Several existing techniques reduce its complexity by formulating it as a minimum spanning tree (MST) problem. The main contribution of this research study is minimizing the number of edges given as input to the MST algorithm by first partitioning the dataset using any scalable clustering technique then building k-nearest neighbors table for each partition. After that, outliers and border objects are identified in each partition and moved to a separate partition. Objects in the outliers’ partition are used in updating the k-nearest neighbors table of itself and the other partitions. The final edges in the k-nearest neighbors’ tables of all partitions are sorted in descending order then merged. The final sorted edges are given as input to a MST algorithm. Several experiments are carried in order to validate the idea, fine tune the input parameters and to get insights of its performance. The proposed algorithm can be completely implemented in parallel which can improve the performance by significant factor. Keywords—clustering; hierarchical clustering; spanning trees, nearest neighbors, parallel computing.

I.

In agglomerative single-linkage clustering algorithm (SLCA), at each step, two clusters are combined if they contain the closest two objects (most similar) such that these two objects are one in each cluster. The main drawback of this method is that objects at opposite ends of produced long thin clusters may be much farther from each other than to objects of other. Recent scalable hierarchical algorithm for sequence clustering based on single linkage strategy is found in [5]. Most of recent work in reducing the complexity of single linkage algorithm are based on formulating the problem as minimum spanning tree (MST) problem and developing a parallel and distributed implementations for state of the art algorithms (BorĤvka, PRIM and KRUSKAL) for solving the MST problem [6-8]. Early on-line single-linkage hierarchical clustering algorithm that tries to reduce the complexity of single linkage algorithm by using k nearest table is found in [9]. The k-nearest table (knn-table) is continuously updated by the arrival of a new object. Using simulation, the authors of [9] show that the list of k nearest neighbors (knn-table) is sufficient for building the dendrogram that is produced by the single linkage strategy with very high accuracy. The value of k is found to be very small compared to the size of the dataset. However the cost of computing the list of nearest neighbors was very high O(kn2) where n is the number of objects. This motivates us to look for a new method based on partitioning for efficiently computing the knn-table of the input dataset. The selected edges in the computed knn-table are merge-sorted and are given to the MST algorithm as input.

minimum

INTRODUCTION

Clustering techniques are computer methods aim to partition input data into groups such that objects in the same group are much similar to each other than to objects in other groups [1]. Compactness and separation are required properties for the output groups (clusters). Cluster analysis is successfully applied to several research areas such as bioinformatics and web mining. Partitioning clustering (either medoid-based or centroidbased) and Hierarchical clustering (either agglomerative or divisive) are two main categories of clustering algorithms [2]. While in medoid-based algorithms such as CLARANS [3], a cluster is represented by the object which is the most centrally located in a cluster, in centroid-based algorithms k-means [2] a cluster is represented by the average of its objects. Also while agglomerative methods progressively merge objects according to their degree of similarity, divisive methods start with the whole dataset as one cluster and progressively subdivide the data set. Both agglomerative and divisive methods are able to explore input data as hierarchy of clusters and do not require

978-1-5090-3267-9/16/$31.00 ©2016 IEEE

The remainder of this paper is organized as follows. Section II presents related works. Section III discusses the proposed technique. Section IV discusses experimental results. Finally Section V concludes the paper and highlights future research directions.

128

II.

partitioned using CLARANS and then k-nearest neighbors table (knn-table) is built for each partition. Outliers and border objects are identified in each partition and moved to separate partition. Objects in outlier partition are used in updating the knearest neighbors table of overlapping partitions. The edges of each knn-table are sorted in descending order then merged. The final sorted edges are given as input to a MST algorithm.

PRELIMINAIRIES

This section provides a supporting material to help the reader to better understand next sections. Commonly used abbreviations and symbols are listed in Table I. TABLE I.

ABBREVIATION AND SYMBOLS USED IN THE TEXT

Symbol

Description

X

Dataset matrix of size n object by m features

x, y

used to denote objects of the input data matrix X

d(x, y)

distance between x and y

s(x, y)

similarity between x and y

xij

jth feature value of the ith object in the dataset X

knn-Table c

A table of size n×k cells, each cell (j,s(xi,xj)) in row i of xi represents a nearest neighboor xj along with similarity Number of clusters

k

The number of entries kept in the k-table for each object

Nj (xi)

The list of nearest neighbors of object xi in cluster cj

SLCA

Single Linkage Hierarchical Clustering Algorithm

B. Partitioning the input dataset using CLARANS In this research study CLARANS algorithm [3] is used in the partitioning stage of the proposed algorithm. CLARANS is a medoid-based clustering algorithm based on randomized search. After creating initial clusters, CLARANS tries to find a better solution by randomly choosing one of the k medoids (representatives) and trying to replace it randomly by one of the other (n-k) objects. If the objective function to be minimized of the new solution is less than the existing solution, the new solution is accepted and the search starts again from it. The search stop when the number of continues failed trials exceed a threshold termed max_trials. CLARANS finds a local optimal solution and can be used several times starting from different initial solution and the best solution is chosen from the local optimal solutions found. In this research study CLARANS is used to initially partition the input dataset. A fuzzy and semi-fuzzy implementation of CLARANS can be found in [12, 13] respectively. But since we target efficiency the hard version is preferred.

A. Minimum Spanning Tree and single-linkage clustering The input for MST algorithm is a weighted graph and the output is a tree that connects all the vertices together with minimum total weighting for its edges. The solution proposed in [10] is one of the fastest sequential algorithm to compute the MST which is O(mȘ(n)), where m is the number of edges and Ș is a function with very slow growth and can be considered a constant less than 4.ƒ

C. Scalability using Partitioning The proposed algorithm here tries to reduce the complexity by using partitioning similar to the idea in [14]. In [14], the computational complexity of DBSCAN is reduced by first partitioning the dataset and examine dense regions in each partition then found dense regions are merged to reach the final natural number of clusters. In our problem, we find that by partitioning the input dataset most of the k-nearest neighbors can be found locally in each partition instead of searching the whole dataset for each object. Other scalability technique may rely on random sampling, compression etc.

Several research studies works on reducing the complexity of the single-linkage clustering by formulating it as a minimum spanning tree (MST) problem. Recent efficient algorithms for solving MST are parallel and distributed implementations for early MST algorithms (BorĤvka, PRIM and KRUSKAL) [6-8]. In [6] the whole set of ordered edges are assigned to main thread and also they evenly divided into k partitions and each partition is assigned to a helper thread. The helper thread is responsible for testing for cycle locally while the main thread tests for a cycle globally and produce the final set of edges. In [7], the entire data is partitioned in equally sized partitions. PRIM algorithm is applied on each partition then KRUSKLE is applied to merge the output from each partition. The final ordered output edges of MST algorithm correspond to edges of the dendrogram of single linkage algorithm. In [8], Sparkbased system is proposed in which a Divide and Conquer approach is used. KRUSKLE is applied on each subset of the dataset to compute its local MST. Finally the local MSTs are merged to form a single MST for the whole dataset.

D. Online Single Linkage Algorithm In [9], a simple data structure is used in order to keep the knearest objects of each object. The k-nearest object is computed online for each incoming object beside that the knearest of previous objects need to be updated in accordance to the new object. The use of the k-nearest reduces the computational complexity of traditional single linkage algorithm from O(n3) to (k n2) where n is the total number of objects. Also the memory required for keeping the whole dissimilarity matrix in traditional algorithm is reduced to k.n instead of n2. The complexity of this algorithm is still high and several more efficient algorithms exist. However the main advantage of this algorithm is that it is incremental. After computing the k-nearest table the set of edges that corresponds to the entries of the table is sorted ascending according to their corresponding distances then MST algorithm is applied. Several experimental studies on small datasets are carried in [9] to estimate a suitable value for k. In this research study these

Most of these algorithms works on the complete graph constructed from the whole input dataset and corresponds to the dissimilarity matrix of the input dataset. As the input data become very large these solutions fail to scale well. Recently a few works done on using only the edges that corresponds to the k-nearest graph [11]. However they did not partition the input data. In this research study we try to merge two ideas: partitioning the data using clustering algorithm and the use of the k nearest neighbors in constructing the MST. The data set is Mohamed A. Mahfouz is with the Computer Engineering Department, Elbehira Higher Institute of Engineering and Technology.

129

experiments are extended in order to estimate k for larger dataset and also the computational complexity is reduced. III.

partition in which the distance between its medoid and the outlier partition is close to the minimum distance of the first updated partition by a threshold Į. Also the previous entries of the outlier partition in the knn-table of the outlier partition are updated by this object. Finally the k-nearest entries of each table including outliers can be sorted locally and merged (merge sort) and given as input to MST algorithm [11].

PROPOSED ALGORITHM

Fig. 1 shows the stages of the proposed algorithm. The main work here is in the stages that precede applying the MST algorithm. In this research study, the main objective is to show the efficiency and the effectiveness of producing the k-nearest neighbors table by first partitioning the dataset using CLARANS. The computed table is merge-sorted and is given as input to the MST algorithm instead of the whole graph. KRUSKAL’s algorithm is used for computing MST.

TABLE II.

Input : X:dataset matrix of size n by m or its corresponding distance matrix S k: the number of nearest neighbors default 7 max_trials: randomized search parameter of CLARANS Į: defuzzification threshold (dataset dependent) c: number of partitions default n/1000 Output: E : sorted list of k-nearest edges of all objects T: dendrogram of SLCA for X Begin 1. Apply CLARNS to partition the input dataset into c partitions where n the size of the input dataset 2. Build knn-table for each partition (k=7) 3. Compute mean and standard deviation of the average distance of objects for each knn-table 4. Move all objects having average distance to their neighbors greater than 3 standard deviation to a new partition call it outlier partition 5. Compute new medoid for each partition after removing outliers 6. Compute the distance of each outlier object xj to medoids of all partition and identify partition i with minimum distance dmj 7. For each partition i and outlier object j such that (dmj - dij) < Į Begin Update knn-table of partition i by object j Update every entry h in knn-table of the outlier partition such that h < j End 8. Compute MST after merging the sorted edges of the corresponding knn-table of each partition including the outlier partition End.

Fig. 1. Layout of the proposed algorithm

Table II shows the steps of the proposed algorithm. After the input of the required parameters for CLARANS k, max_trials and a defuzzification threshold Į, the next step is to partition the data into partitions with average size 1000 objects. Then for each partition we compute the k-nearest table locally i.e. using only the objects in this partition. The value of k in this step is set to 7 in all experiments. The value of 7 is shown in [9] as suitable value for this size of dataset if we are going to use larger partition the input k is used. The average distance of each object in the local knn-table is computed. Objects having average distance greater than a three standard deviation are moved to another partition termed as outlier partition. Depending on the size of the outlier partition a suitable value for k is chosen. A new medoid is computed for each non outlier partition. The new medoid xq of partition P is computed as follows:

A. Computational Complexity The partitioning step depends on the used algorithm if CLARANS is used it costs roughly O(n2) and it requires the data to be in memory. For low dimensional numerical data a scalable algorithm such as BIRCH [15] is O(n). In order to build the k-nearest neighbors table, the list of the first k entries in the corresponding row of xi in the distance matrix are sorted descending and stored in the neighbors list of xi denoted N(xi). Then the other n - k entries are compared to the top of N(xi) one by one and the list is updated accordingly. The sorting step costs O(klogk) and the whole step costs O(klogk +n-k). For k