Stream Clustering Based on Kernel Density Estimation. Stefano Lodi and Gianluca Moro and Claudio Sartori 1. Abstract. We present a novel algorithm for ...
799
ECAI 2006 G. Brewka et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved.
Stream Clustering Based on Kernel Density Estimation Stefano Lodi and Gianluca Moro and Claudio Sartori 1 Abstract. We present a novel algorithm for clustering streams of multidimensional points based on kernel density estimates of the data. The algorithm requires only one pass over each data point and a constant amount of space, which depends only on the accuracy of clustering. The algorithm recognizes clusters of nonspherical shapes and handles both inserted and deleted objects in the input stream. Querying the membership of a point in a cluster can be answered in constant time.
1 Introduction In many emerging scenarios, applications process the output of high speed, high volume data sources that originate continuous, unbounded flows of data, which have been termed data streams. Examples of data streams are call records of telephone companies, click streams, environmental data, and data from sensor networks. Stream processing is subject to tight constraints: Huge volume and high speed make random accesses and multiple scans of a memory image of the entire stream clearly infeasible, whereas unboundedness requires algorithms with constant time complexity per update, to avoid continuously increasing response times. The design of stream data mining algorithms is especially demanding, as traditional solutions often entail combinatorial computations or exploring mutual relationships among the data. A fundamental data mining problem which has recently been studied in the stream domain is data clustering. Most formal definitions of the clustering problem capture the intuitive idea that similar objects must be grouped into the same cluster, whereas dissimilar objects must belong to different clusters. All approaches avoid storing the data records and maintain various kinds of statistics on the data, mostly in constant time. The statistics are sufficient to describe the clustering structure of the stream, and to answer cluster membership queries, at the price of some degree of inaccuracy. We propose a novel stream clustering algorithm based on nonparametric kernel density estimation, capable of recognizing nonspherical clusters and allowing for both insertions and deletions of points from the stream. Nonparametric density estimation is the construction of an estimate of the probability density function without assuming the data are drawn from a parametric family of distributions [5]. Kernel estimators, or Parzen estimators, are a popular family of nonparametric estimators [4]. Kernel estimators are expressed as summations of scaled copies of a single function; each copy is shifted and centered at a data point. Let us assume a set D = {xi | i = 1, . . . , N} ⊆ Rd of multidimensional data points. Kernel estimators formalize the idea that every data point contributes to the value of the estimate at a space 1
Department of Electronics, Computer Science and Systems, University of Bologna, Viale Risorgimento 2, IT-40136 Bologna BO, Italy, email: {stefano.lodi,gianluca.moro,claudio.sartori}@unibo.it
vector x ∈ Rd an amount that monotonically decreases with distance from x. Let K : R → R+ ∪ {0} be a non-negative, non-increasing function with unit integral on R. K is called a kernel function. Examples of kernel functions are the square pulse function 14 (sign(x + 1) − sign(x − 1)), and the Gaussian function √1 exp(− 21 x2 ). A ker2π
nel density estimate ϕˆ K,h [D](x) : Rd → R+ ∪ {0} is defined as the sum over all data records xi in D of the differences between x and xi , scaled by a factor h, called window width, and weighted by the kernel function K: 1 N 1 ϕˆ K,h [D](x) = (1) ∑ K h (x −xi ) . Nhd i=1 The estimate is therefore a sum of bumps centered at the data points, and the flatness of the bumps is controlled by h. Thus, the smoothness of the estimate depends on the window width h. Nonparametric density estimates have been used in several data clustering algorithms both in the statistical literature [5] and in the data mining literature [2, 3]. Clustering based on nonparametric estimates offers numerous advantages: immunity to noise, recognition of nonspherical clusters, and data-driven, objective criteria to optimally choose K and h for a given dataset. The intuition underlying all approaches is that well separated, high density regions around local maxima of the estimate correspond to clusters. In a center-defined cluster [2], all points are connected to a single local maximum having a sufficiently large density by an uphill path. In [5], a forest of points is constructed such that every path from a leaf to a root is an uphill path. All these approaches cannot be adapted easilty to the stream environment since they do not maintain information allowing for quick, repeated updates of the clusters as new points arrive.
2 Clustering Data Streams The approaches reviewed in the previous section build polygonal paths; the direction of every segment starting at x approximates the direction of the gradient of ϕˆ [D](x). In our approach, we relax this condition and allow for curves along which ϕˆ [D](x) increases at any rate. Definition 1. C ⊆ D is a subcluster for a local maximizer x ∗ of ϕˆ [D] if and only if xi ∈ C implies that there exists a simple continuous curve c : [0, 1] → Rd satisfying c(0) = xi , c(1) = x ∗ , and ϕˆ [D](c(t1 )) > ϕˆ [D](c(t2 )) when t1 > t2 . C is a cluster if C is a maximal subcluster for some local maximizer of ϕˆ [D]. We now define our problem. Let S be a stream of d-dimensional points . . . ,x(−1) ,x(0) ,x(1) ,x(2) , . . . , with x(i) ∈ Rd , for i ∈ Z, and let the index i0 ∈ Z of the first data object read from input be given. Further let Q be a stream of cluster membership queries . . . , q[−1], q[0], q[1], q[2], . . ., where each q[ j] asks whether two objects x(i) ,
800
S. Lodi et al. / Stream Clustering Based on Kernel Density Estimation
x(i ) , i, i ≤ j are in the same cluster. The problem is to output correct results (w.r.t. Definition 1) to the queries Q in such a way that: the number of data objects read from the input stream S between reading of q[ j] and writing its result is O(1); the number of computation steps executed between reading x(i) and x(i+1) , i ≥ i0 , is O(1); the number of computation steps between reading x(i) and the earliest time the result to a query q equals the result of q applied to the clustering of {x(i0 ) ,x(i0 +1) . . . ,x(i) } is O(1). These requirements specify that a stream algorithm must not be left behind by an increasingly large lag as time passes. A constant bound on the number of items which have to be updated to yield an updated clustering is attainable only if we can avoid building and maintaining a forest over the data objects. Alternatively, we maintain a forest of directed grid trees the vertex set of which is a set of regularly spaced points of Rd covering the support of the density estimate. Note that smooth kernel functions with bounded support exist; for example, take K (x) = exp(1/(x2 − 1)) if |x| < 1; 0 otherwise. Each vertex is associated to the value the density estimate takes at the vertex, that is, to a sample of ϕˆ . Let w be a sampling frequency, and let us denote any element of Zd by n. Let n1 /w, n2 /w be the corners of the minimal d-dimensional rectangle containing the support of ϕˆ . Note that n1 ,n2 can always be set initially so to contain the bounding box of the data, which is always known in practical cases, extended by h on all dimensions. The vertex set is therefore the ddimensional rectangular grid of cornersn1 /w,n2 /w and spacing 1/w in all dimensions {n/w :n1 ≤n ≤n2 }, where n ≤n if and only if ≤ holds componentwise. The main invariant we wish to maintain is the following. Invariant 1. (n /w,n /w) is an edge of the grid forest if and only if ϕˆ [S](n /w) < ϕˆ [S](n /w) and n maximizes ϕˆ [S](n/w) over all n/w nearest to n /w, i.e., n = argmaxn {ϕˆ [S](n/w) : n = n +e/w for some versor e}. Therefore, in the data structures we store polygonal uphill paths constructed from segments that can only run parallel to some coordinate axis, and values of the density at every grid point. The structure can be updated as follows. When a new point x(i) arrives from the stream S only the density at a bounded number of grid points can increase. (If deletions are allowed, their densities increase or decrease according to a sign attached to the new point.) The affected grid points are located inside the smallest hypercube containing the hypersphere of radius h, centered at x(i) . Edges going into or out of the hypercube must be checked for inversion. The number of density values and edge updates needed depends on h and w. The forest induces a clustering of the space, for example by a nearest neighbour criterion. Therefore membership queries q[i] in the stream Q can be answered by first finding the nearest vertex and then following the unique uphill path from the vertex. The cost is clearly independent of stream size, and depends only on the spacing w of the vertex set. Other types of queries can also be answered with similar costs. The total number of clusters can be maintained as the number of tree roots. Grid points which become roots, or are no longer roots can be detected easily during the updates. The size of a cluster can be computed by visiting the corresponding tree, and summing the values of a counter attached to every grid point. The counter is incremented whenever the grid point is the nearest grid point to a new stream point. Preliminary experiments have shown that considerable accuracy can be achieved on clusters of arbitrary shape. The dataset shown in Figure 1 has been processed as a stream setting w = 2 and w = 4.
2
0
-2
-4
-6 -2
0
2
4
Figure 1. A test data set with a contour plot of its kernel estimate (h = 0.7).
(a) w = 2
(b) w = 4
Figure 2. Grid forests constructed for w = 2 and w = 4.
The resulting grid forests are shown in Figures 2(a) and 2(b) respectively. Equal gray levels correspond to equal clusters. For w = 2, the top left cluster was erroneously split nearby its maximum since the directions of the grid edges do not yield an increase in ϕˆ [S]. Halving the spacing between grid points allows for correct recognition. This behaviour has been apparent in all experiments and, in fact, should reflect a formal convergence property showing that every path in the grid forest, although not guaranteed to be monotonic for ϕˆ [S](x), as w increases, should converge to a monotonic path for ϕˆ [S](x). The speed of convergence must also be investigated, as well as possible heuristics to merge clusters when w is large (e.g. due to limited resources). Moreover, the speed of processing stream data points must be experimentally evaluated and compared to the rate of a fast algorithm, e.g. CluStream [1]. Methods to choose h should also be investigated. In fact, the optimal value of h decreases with sample size (as N −1/(d+4) ) [5], therefore no value can be uniformly optimal.
REFERENCES [1] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, ‘A framework for clustering evolving data streams’, in VLDB 2003, pp. 81–92, Berlin, Germany, (September 9-12 2003). Morgan Kaufmann. [2] A. Hinneburg and D. A. Keim, ‘An effi cient approach to clustering in large multimedia databases with noise’, in Proc. KDD-98, pp. 58–65, New York City, New York, USA, (1998). AAAI Press. [3] M. Klusch, S. Lodi, and G. Moro, ‘Distributed clustering based on sampling local density estimates’, in Proc. IJCAI-03, pp. 485–490, Acapulco, Mexico, (August 2003). AAAI Press. [4] E. Parzen, ‘On estimation of a probability density function and mode’, Ann. Math. Statist., 33, 1065–1076, (1962). [5] B. W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman and Hall, London, 1986.