A Simple Yet Effective Data Clustering Algorithm - Semantic Scholar

7 downloads 0 Views 1MB Size Report
algorithms such as CURE, DBSCAN, CHAMELEON and. SNN. Therefore, any ... Proceedings of the Sixth International Conference on Data Mining (ICDM'06).
A Simple Yet Effective Data Clustering Algorithm∗ Soujanya Vadapalli Satyanarayana R Valluri Kamalakar Karlapalem Center for Data Engineering, IIIT, Hyderabad, INDIA {soujanya, satya, kamal}@iiit.ac.in Abstract

identified by these techniques include the surrounding noise points too. Presence of noise disrupts the process of clustering; noise needs to be separated from the dataset to enhance the quality of the clustering results. Density-based techniques address the issue of noise well but require user given thresholds to identify the noise points. In this paper, a simple yet effective solution for clustering and outlier detection has been proposed: RECORD REverse nearest neighbor based Clustering and OutlieR Detection. The algorithms proposed in this paper, first segregate noise points that might disrupt clustering. Clusters are identified based on reverse nearest neighbors and their mutual reachability with other points in the dataset. Our technique is based on this simple concept but its results are as good as (and in some cases better than) the results of other algorithms such as CURE, DBSCAN, CHAMELEON and SNN. Therefore, any application that finds results of these algorithms satisfactory can as well use our algorithm. Our algorithm does not require any user-input parameters. Some of the results displayed in this paper show the effectiveness of our algorithms on various datasets. Though the notion of reverse nearest neighbors has been around and has been used for various applications, the very specific concept of ‘utilizing the number of reverse nearest neighbors for a data point’ for clustering and outlier detection has not been studied to the best of our knowledge. More details and results are given in the extended version of the paper [10].

In this paper, we use a simple concept based on kreverse nearest neighbor digraphs, to develop a framework RECORD for clustering and outlier detection. We developed three algorithms - (i) RECORD algorithm (requires one parameter), (ii) Agglomerative RECORD algorithm (no parameters required) and (iii) Stability-based RECORD algorithm (no parameters required). Our experimental results with published datasets, synthetic and real-life datasets show that RECORD not only handles noisy data, but also identifies the relevant clusters. Our results are as good as (if not better than) the results got from other algorithms.

1. Introduction Clustering partitions a dataset into highly dissimilar groups of similar points [5]. The problem of outlier detection is closely related to clustering and deals with identifying anomalous observations in the data. Earliest clustering and outlier detection algorithms have been devised based on objective criteria and prior assumptions on data distributions. Recent approaches are developed based on concepts like density, k nearest neighbor (kNN) graphs and shared nearest neighbors. The definition of clusters and outliers depends very much on the domain of the dataset. But for sake of clarity, general definitions quoted in the literature are given below: A cluster is a set of similar points that are highly dissimilar with other points in the dataset. An outlier or a noise point is an observation which appears to be inconsistent with the remainder of the data [1]. Aggarwal and Yu [1] note that outliers may be considered as noise points lying outside a set of defined clusters. The problem of identifying clusters becomes more challenging when: (i) there is noise, (ii) clusters are arbitrary shaped and (iii) clusters have different densities. Generally, existing clustering techniques work well in the absence of noise. When there is noise in the dataset, the clusters ∗ This work was made possible

1.1. Related Work Density-based techniques like DBSCAN [4], OPTICS [2] consider the density around each point to demarcate boundaries and identify the core cluster points. The close cluster points in a single neighborhood are then merged. OPTICS does not generate a clustering solution; instead generates an augmented ordering of the points. But to understand these plots, the user should have a minimum knowledge of the algorithm, which precludes its usage by novice users. Finding the appropriate parameters for DBSCAN and identifying cluster boundaries in OPTICS are challenges to the user.

by the grant from The Boeing Company.

1

Proceedings of the Sixth International Conference on Data Mining (ICDM'06) 0-7695-2701-9/06 $20.00 © 2006

introduced in the database setting by Korn and Muthukrishnan [8]. In this paper, we demonstrate the application of RNNs to address the problem of clustering and outlier detection.

Graph-based approaches: To address the problem of clustering, k nearest neighbors (kNNs) are used to identify k most similar points around each point and by way of conditional merging, clusters are generated. There are various variants of kNN clustering and they differ at the conditional merging part of the solution. For a given point p, its kNNs are found out. If the distance between p and any of the points in kNN(p) set (say q) is less than , then point q is merged into the cluster of p. This algorithm requires tuning of  and k values to get clusters. Tarjan [9] proposed a method to construct strongly connected components from a given directed graph where each edge is associated with a weight (usually distance between points). This method is highly sensitive to the presence of noise and cannot handle clusters of different densities. CHAMELEON [7] involves two phases: identifying small tight clusters built from k-nearest neighbor graph and merging these clusters using measures of inter-connectivity and relative closeness. CHAMELEON can find clusters of different shapes and sizes, but it requires the parameters for relative inter-connectivity and relative closeness to be finetuned. The Shared Nearest Neighbor approach [3] (first proposed in [6]), defines a similarity measure between a pair of points based on the number of nearest neighbors they share. Using this similarity measure, a notion of density is defined based on the sum of the similarities of neighbors of a point. Points with high density and similarity are formed into clusters and points with low density are eliminated as noise. This method requires three parameters: k (to build kNN graphs), link strength threshold (to identify dense and outlier points) and link weight threshold (to remove edges with have less weights).

2.1. Notation and Definitions The following is the notation used in this paper. X: d-dimensional dataset, n: size of dataset C: set of clusters, c: number of clusters O: set of outlier points xi , xj : i-th, j-th points in X p, q: any two points in X dij : distance between two points (xi , xj ) kNN(p): k-nearest neighbor set of point p kRNN(p): k-reverse nearest neighbor set of p. A point q belongs to kRNN(p) iff p ∈ kNN(q). kRNNG: k-reverse nearest neighbor graph kRNN≥k G: sub-graph of kRNNG having only core points (defined below) kRNN