2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing
DSCLU: a new Data Stream CLUstring algorithm for multi density environments Amin Namadchian
Gholamreza Esfandani
Department of Computer Shahid Rajaee University Tehran, Iran
[email protected]
Department of Computer Engineering Sharif University of Technology Tehran, Iran
[email protected] (determined by input parameter). Such methodology has some features which will be explained in this paper. The rest of the paper is organized as follows: Related work on data stream clustering are briefly described in section 2. In section 3 we describe basic definitions used in the algorithm. The details of the online and offline phases of the algorithm are given in section 4. In section 5 we analyze the DSCLU using both synthetic data and real data. And finally some conclusions end up the paper.
Abstract— Recently, data stream has become popular in many contexts of data mining. Due to the high amount of incoming data, traditional clustering algorithms are not suitable for this family of problems. Many data stream clustering algorithms proposed in recent years considered the scalability of data, but most of them did not attend the following issues: (1) The quality of clustering can be dramatically low over the time. (2) Some of the algorithms cannot handle arbitrary shapes of data stream and consequently the results are limited to specific regions.(3) Most of the algorithms have not been evaluated in multi-density environments. Identifying appropriate clusters for data stream by handling the arbitrary shapes of clusters is the aim of this paper. The gist of the overall approach in this paper can be stated in two phases. In online phase, data manipulate with specific data structure called micro cluster. This phase is activated by incoming of data. The offline phase is manually activated by coming a request from user. The algorithm handles clusters by considering with micro clusters created by the online phase. The experimental evaluation showed that proposed algorithm has suitable quality and also returns appropriate results even in multi-density environments.
II.
In this section we review some existing methods related to data stream clustering. In [5] the authors presented an approximation method for clustering based on k-median technique. Their algorithm processes the points one-pass and needs low memory. The time complexity and memory cost of this method are respectively O(nk) and O( ne ), in which k is the number of clusters, n is number of points and e is a value less than one. In [6] a method is presented using sliding window technique in clustering. It uses histogram for enhancing the algorithm in [5]. In this method histogram is used as data structure for combining clusters whose centers are far from each other. STREAM [7] is also another onepass algorithm in this area. It works incrementally and receives data stream as incoming chunks. Each point is valued based on number of its occurrence. Then it uses an algorithm known as LOCALSEARCH for clustering chunks. Aggrawal et al. proposed the CluStream framework [8] for clustering data stream. The algorithm divides the process of clustering into two phases. In online phase, algorithm gathers the statistical information of data and stores them in a data structure named as micro cluster. Offline phase does the clustering operation with micro clusters created in online phase. Now we review some important density-based clustering algorithms. DBSCAN [9] is the first density-based clustering algorithm which lacks the ability to handle different shapes in multi-density environments. To solve the problem many algorithms have been proposed like DENCLUE [10],OPTICS [11], LDBSCAN [12], MSDBSCAN [13]. DENCLUE [10] models the influence of each point to its neighbors by using influence function and density function. By this approach the algorithm tries to find appropriate clusters in mutli density environments. Recently, a new density-based algorithm, LDBSCAN [12], suggested using the concepts of local outlier factor (LOF) and local reachability distance (LRD) [14] to find clusters. The algorithm takes four input parameters,
Keywords-component; data stream clustering, density-based clustering, DSCLU, multi-density environment clustering
I.
INTRODUCTION
Nowadays processing data stream is an essential work in data mining. One of the issues in this area is data stream clustering. The problem of data stream clustering refers to partitioning data points to some clusters which cluster members have highest intrasimilarity and low similarity with members of other clusters. The goal is classifying points that arrive continuously. The constraint here is that number of allowed access to the points is limited and only saved points in the memory can be accessed several times [1]. There already exist several algorithms in data stream clustering. Among them Den-Stream [2], DStream [3] and MRStream [4] are more famous which are density-based algorithms with capability of identifying different shapes of clusters appropriately during time. Our contributions in this paper are: (1) we propose a new method for offline phase of the algorithm with ability to identify different shapes of clusters in multi-density environments. The other algorithms use traditional method for data stream clustering in offline phase which is not always accurate for clustering.(2) In order to manage the micro cluster in online phase we use a buffer that do not allow the number of micro clusters exceed the one threshold
978-0-7695-4761-9/12 $26.00 © 2012 IEEE DOI 10.1109/SNPD.2012.119
RELATED WORKS
83
MinPts LOF ,MinPts LDBSCAN ,pct,LOFUB . The value of each of these parameters strongly influences the output of clustering and the algorithm has no guidance to select appropriate values for these parameters. Researchers in [13] presented MSDBSCAN which uses the concept of local core distance(lcd) for each point. The algorithm specifies an interval based on lcd for each point to identify the core points in data set. Experimental results show that algorithm can find clusters in mutli density environments and it can also find overlapping clusters appropriately. Unfortunately these density-based clustering algorithms are not suitable for data stream. Den-Stream [2] assigns a weight to each point according this fading function and makes micro clusters. With respect to the radius and weight of each micro cluster it will be determined that micro cluster is core, potential or outlier. By coming request from user, the offline phase is triggered to identify clusters. D-Stream [3] is a grid-based stream clustering algorithm. Unlike Den-Stream that user requests to initiate the offline phase, D-Stream begins to calculate on appropriate time unit with fading factor and does the clustering periodically. The importance of each grid is directly related to the weights of internal points. MRStream [4] has a hierarchical, multi resolution view of clusters at any time. Like [2][3] and [5], MRStream includes online and offline phases. Using the concept of extended neighbor grids in offline phase leads in more accurate results.
and it is greater than input parameter Wd . Other elements of DMC
quadruple
n
n
i =1
i =1
{CF 1 , CF 2 , w, tu }
in
CF 1 , CF 2
.
Vector
j =1
n
in micro cluster and vector CF 2 = ¦ 2− λ (t −Ti ) p j 2 equals to j =1
the weighted sum of the squares of the points. The input tu is the last time of updating for this micro cluster. It is actually for configuring timeIndex variable and is not used individually. The value of timeIndex shows how many time steps have been passed without updating sporadic micro cluster. timeIndex is calculated by the formula timeIndex = tcurrent − tu . The important task of timeIndex is in handling of sporadic micro cluster explained in this section. The center of micro clusters is calculated as c = CF 1 / w . The radius of the micro cluster equals to the
r = | (| CF 2 | / w) − (| CF 1 | / w) 2 | proposed by [2]. A sporadic micro cluster definition is like density one except that w is less than a real input parameter Ws . Similarly, a transitional micro cluster is defined like dense definition with condition of Ws < w ≤ Wd . Definition 2: generic neighbor and special neighbor: For each micro cluster like mc two types of neighbors are defined. With respect to the input parameter MinPts, special neighbors, represented by N sMinPts (mc) , is a vector of MinPts nearest dense micro clusters to the mc. Assuming the furthest dense micro cluster in N sMinPts (mc) is at distance r from mc, then all transitional and dense micro clusters in distance r, are in vector of generic Neighbors referred by N gMinPts (mc) . Thus the proposition card ( N gMinPts (mc)) ≥ MinPts is always true for mc. Definition 3: Density of micro cluster: In order to define density of a micro cluster, it is essential to clarify the definition of volume of a micro cluster. The definition of volume for a micro cluster is so similar to definition of nsphere volume [15]. For radius 1, as dimension grows, the volume increases until dimension 5. After that volume decreases and finally converges to zero. So we set a condition for the coefficient of (1) . For dimensions bigger than twenty, the algorithm considers the volume of dimension twenty as volume of micro cluster. The volume of micro cluster mc in d dimensional space is obtained as the following formula:
constant coefficients and t is variable of time. In traditional clustering algorithm, we can assume that all points fit in the memory but it is not an acceptable assumption for stream. The bottleneck here is memory and time complexity which should be handled carefully. One of the applicable approaches is using micro clusters which the micro clusters are created with low time complexity on arrival time of data points. The bigger clusters will be obtained by using micro clusters on the request of users. Definition 1:Dense Micro Cluster(DMC): A dense micro clusters on time t contains history of points p1 , p2 ,...., pn . Assuming that points arrive respectively on T1 , T2 ,...., Tn , a a
vectors
CF 1 = ¦ 2− λ ( t −Ti ) p j equals to the weighted sum of the points
A. Basic concepts and formal definitions One of the approaches for partitioning data stream to distinct clusters is using sliding window. This approach is unable to keep the history of old data and this is the drawback of this approach. Recently, Many of the algorithms use fading function to keep the history of data. They usually use fading function α −λt in which Į and λ are positive
is
two
n
III. BASIC DEFININTION OF THE ALGORITHM In this section we describe some symbols and definitions used in DSCLU algorithm. Some of these definitions are novel and some are refined from definitions in Den-Stream.
DMC
are
(1) Vd (mc) = Cd r d In which r is the radius of mc and Cd is a coefficient calculated by (2) . d!! in (2) is double factorial [15].
which
w = ¦ f (t − Ti ) = ¦ 2− λ (t −Ti ) is the weight of micro cluster
84
§ d2 d if d is even and d ≤ 20 ¨ π /( )! 2 ¨ ¨ d +1 d −1 (2) Cd = ¨ (2 2 π 2 )/d !! if d is odd and d