Dynamic Subspace Clustering for Very Large High-Dimensional Databases P. Deepa Shenoy1 , K.G. Srinivasa1 , M.P. Mithun1 , K.R. Venugopal1 , and L.M. Patnaik2 1
2
University Visvesvaraya College of Engineering, Bangalore-560001.
[email protected],
[email protected],
[email protected] http://www.venugopalkr.com Microprocessor Application Laboratory, Indian Institute of Science , Bangalore-560012, India. {
[email protected]}
Abstract. Emerging high-dimensional data mining applications needs to find interesting clusters embeded in arbitrarily aligned subspaces of lower dimensionality. It is difficult to cluster high-dimensional data objects, when they are sparse and skewed. Updations are quite common in dynamic databases and they are usually processed in batch mode. In very large dynamic databases, it is necessary to perform incremental cluster analysis only to the updations. We present a incremental clustering algorithm for subspace clustering in very high dimensions, which handles both insertion and deletions of datapoints to the backend databases.
1
Introduction
Clustering is a useful technique for the discovery of data distribution and patterns in the underlying database. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in the other clusters. A cluster of data objects can be treated as one group. Applications like data warehousing, market research, customer segmentation and web search involve the collection of data from multiple sources, which are frequently updated. All cluster patterns derived from the above applications have to be changed as the updation to backend database occurs. Since backend databases are very large in size, it is highly desirable to perform the changes to the previous patterns incrementally. In high dimensional spaces all the dimensions may not be relevant to a given cluster. Further, a single set of subspaces may not be enough for discovering all the clusters because different set of points may cluster better for different subsets of dimensions [2]. In the algorithm used in [1] arbitrarily ORiented projected CLUSter generation, the best projection for each cluster is determined by retaining the greatest amount of similarity among the points in the detected cluster. This algorithm is used as the base for our Dynamic Clustering algorithm. The base algorithm [1] generates clusters in a lower dimensional projected subspace for data of high dimensionality. It is assumed that the number of output J. Liu et al. (Eds.): IDEAL 2003, LNCS 2690, pp. 850–854, 2003. c Springer-Verlag Berlin Heidelberg 2003
Dynamic Subspace Clustering for Very Large High-Dimensional Databases
851
clusters k and the dimensionality l of the subspace in which each cluster exists are the user defined input parameters to the algorithm. The output of the algorithm is the set C := {C 1 , . . . , C k , Ψ } of the data points such that the points form the clusters {C1 , . . . , C k } each of the cluster of dimensionality l and the set of outliers Ψ (outliers are the points that do not belong to any cluster). Let εi := {εi 1 , εi 2 , . . . , εi l } denote the set of l ≤ d orthogonal vectors defining the subspace associated with cluster C i . The vectors for the outlier set Ψ are empty. Let N be the total number of data points, d the dimensionality of the input data, k 0 the initial number of clusters, Ci := {x1 , x2 , . . . , xt } be the set of points in the cluster i. The paper is organized into the following sections. Section 2 presents the algorithm DPCA, section 3 shows the performance analysis and results of the algorithm. Finally section 5 presents the conclusions.
2
Dynamic Projected Clustering Algorithm(DPCA)
Consider a large backend database D of high dimensionality d consisting of a finite set of records {R1 , R2 , . . . , Rm }. Each record is represented as a set of numeric values. Let δ + be the set of incremental records added to D and δ − be the set of records deleted from D. Let Ψ : set of outlier points, Υ : set of non-classified points i.e., the points that do not belong to any of the existing clusters, ξ i : E (C i , εi ) / E (U, εi ) i.e., the ratio of projected energy of the cluster C i to that of the universal set of points U in the subspace εi , and τ : the value of ξ i for which the cluster C i is wellformed. The objective is to generate new cluster patterns D + δ + or D - δ − without considering D. The datapoints can either be inserted into or deleted from the database, that can affect the cluster patterns in the following way. Insertions: p , (1 ≤ i ≤ k ). a) Absorption: For any p ∈ δ + , C i = C i b) Creation: New cluster Ck+1 gets added to the set of existing clusters C. / C i (1 ≤ i ≤ k ), p becomes an outlier. c) Outliers: For p ∈ δ + and p ∈ Deletions: a) Reduction: For any p ∈ δ − , C i = C i - p and ξ i ≤ τ (1 ≤ i ≤ k ). b) Removal: For any p ∈ δ − , C i = C i - p and ξ i > τ (1 ≤ i ≤ k ). c) Split: A particular cluster C i may split into γ clusters, γ ≥ 2. d) Outliers: Undeleted points of a cluster C i become outliers when ξ i > τ . The base algorithm is initially executed on the backend database. The clusters formed have the cluster sparsity coefficient less than one. To determine whether a particular cluster C i is wellformed, ξ i the ratio of the projected energy of the cluster C i in the subspace εi to that of the projected energy of the universal set of points U in the same subspace is determined. If ξ i > τ , it indicates that the average spread of the points in the subspace is almost same as that of the universal set of points of the database and the cluster thus formed is sparse [7]. Therefore it is meaningless to consider C i as a cluster. Further if ξ i ≤ τ , the cluster C i is dense and said to be wellformed.
852
P.D. Shenoy et al.
Algorithm IncrementData (δ + ) Υ = δ+ Ψ; while (true) do for each Point p ∈ Υ , do //classify p as merged or non-merged point Classify (p, Υ ); O=CalculateCentroid (Υ ); //nearest centroid to O ν=CalculateInterCentroidDistance(O); for each point p ∈ Υ , do if (distance (p, O) ≤ ν) //point p gets absorbed into cluster Ck+1 AddPoint (p, Ck+1 ); Υ = Υ - {p}; endif ξ k+1 = E (Ck+1 , εi ) / E (U, εi ); if (ξ k+1 ≤ τ ) //new cluster is formed k = k + 1; else Ψ = Υ ; //neglect Ck+1 , end of iteration} return; endif; for i = 1 to k do Recompute the eigenvectors and eigenvalues for C i ; Recompute centroid for C i ; End Algorithm IncrementData Lemma: For all p ∈ δ + , p may join the existing cluster C i iff disti (p) ≤ min{disti (X j ), 1≤j ≤k and j =i } where X j is the centroid of cluster C j . Proof : Assume p ∈ C i . If disti (p) > min{disti (X j ), 1 ≤ j ≤ k and j = i } where X j is the centroid of cluster C j , p becomes an outlier. Therefore p cannot join any of the existing clusters. Hence the contradiction. IncrementData(): The increments are assumed to be in batch mode. The input set of points δ + are classified into merged and non-merged points. Merged points get absorbed into the existing clusters when their distance from the centroid of the parent cluster is less than or equal to the shortest intercentroid distance between the parent and the remaining clusters [Lemma]. The remaining points may form a new cluster or become outliers. The datapoint nearest to the arithmetic mean of the non-merged points will form a potential seed for the new cluster. The nearest centroid (these centroids correspond to the centroids of the existing clusters) to this potential seed is calculated(ν). The distance between the non-merged points and the seed is calculated(g). If g < ν, the points get added to the cluster. The remaining points become outliers, but they are not discarded because they may form new clusters in future updations. The dimensions of the newly formed cluster Ck+1 are reduced to check whether it is wellformed or not. Next ξ k+1 the ratio of the projected energy of the cluster Ck+1 in the subspace εk+1 to the projected energy of the universal set U of points in the same subspace is calculated. If ξ k+1 ≤ τ , cluster C k+1 is added to the set of existing clusters. The centroids and the eigenvectors corresponding to the least spread subspaces are recomputed for all the clusters. This
Dynamic Subspace Clustering for Very Large High-Dimensional Databases
853
iteration is repeated for the remaining set of outliers. If ξ k+1 > τ the cluster C k+1 is discarded and the iteration stops. The points of C k+1 are added to the outlier set. The deletion of points can be treated as an indirect insertion into the database. For each set δ − deleted from the database, ξ i the ratio of energies of the clusters with the universal set of points is determined. The clusters with ξ i > τ are discarded. The points of the discarded clusters are added to the outlier set Ψ and these set of points are given as input for reinsertion. This step is required mainly to handle the split situation(cluster may split into two or more clusters).
3
Performance Analysis
The algorithm DPCA was implemented using C++ and tested on an Intel PIII 900MHz machine with 128MB memory. The data was stored on a 20GB harddisk. We determined the performance of DPCA with respect to the accuracy of clustering with variable number of updations, the running time requirements for insertions and the speedup factor of our algorithm over the base algorithm. The algorithm was tested using synthetic data. Table 1. Confusion Matrix: ORCLUS OC
A
B
C
D
E
1 2
0
0
0
17453
0
0
0
0
3
0
0
0
4
0
0
0
5
0
0
IC
6 7
14263
0
F
G
0
0
0
0
16447
0
0
0
0
0
17487
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Table 2. Confusion Matrix: Insertion OC
A
B
C
D
E
1 2
0
0
0
16768
0
0
0
0
0
0
3
0
0
0
0
0
0
4
0
0
0
0
16638
0
0
0
5
0
0
0
0
0
0
0
0
6
13557
0
0
0
0
0
0
0
0
7
0
982
0
0
0
0
0
15414
IC
0
F
G
0
0
15682 0 14600
The algorithm was tested for variable number of updations and the corresponding Confusion matrices were obtained. The Confusion Matrix indicates the mapping of input clusters to the output clusters. The rows(A, B, . . . ) and columns(1, 2, . . . ) correspond to the output clusters(OC) and input clusters(IC) respectively. Quality of clustering of a particular input cluster is defined as the number of points that are present in the non-dominant columns of a particular input cluster in the confusion matrix. The algorithm was executed for 0.1 million backend database. The Confusion matrices are shown for the base algorithm (Table 1) and for an insertion of 5000 points(Table 2). The dominance of a particular column in each row indicates that the particular cluster is wellformed. The execution time of DPCA versus number of increments for different sizes of backend databases is depicted in Fig. 1. The running time of DPCA shows that the algorithm scales linearly with increase in number of insertions into the
854
P.D. Shenoy et al.
database. The performance of DPCA versus the base algorithm is evaluated and speedup factors are derived for typical parameter values. Let f ins := the total cost for running the base algorithm for both backend database(D) and the increments(δ + ), fbase := Cost of running the base algorithm only for the backend database(D), fDPCA := Cost of running DPCA only for the increments(δ + ) . Therefore, Speedup factor =( f ins - fbase ) / fDPCA . The speedup factor indicates the degree of improvement of DPCA over the base algorithm with respect to the running time for a specified size of increments. Fig. 2 shows that the speedup factor increases with an increase in the size of increments (δ + ). Also the speedup factor decreases with the increase in the size of backend database(D). 50
100
0.1 0.2 0.3 0.4 0.5
80
million million million million million
0.1 0.2 0.3 0.4 0.5
45
million million million million million
40
35
Running time
Speedup factor
60
30
40 25
20 20 15
0
10 10
15
20
25
30
No. of increments
35
40
45
50
x1000
Fig. 1. Execution time vs Increments
4
10
15
20
25
30
No. of increments
35
40
45
50
x1000
Fig. 2. Speed up factor vs Increments
Conclusions
In this paper, we have proposed an Dynamic Subspace Clustering Algorithm (DPCA) for dynamic updations. DPCA addresses sparse and skewed databases of high dimensionality. The results of DPCA show that the algorithm is stable for dynamic updations with respect to the quality of clusters formed and is also scalable to large databases. Significant speedup factors are achieved over the base algorithm for variable number of increments.
References 1. Charu C. Aggarwal, and Philip S. Yu, “Redefining Clustering For High Dimensional Applications,” IEEE Transactions on Knwoledge and Data Engineering, Vol 14, No. 2, pp. 210–224, 2002. 2. Anil K. Jain, Richard C. Dubes, “Algorithms for Clustering Data ,” Prentice Hall, Englewood Cliffs, New Jersey, 1998. 3. P. Deepa Shenoy, Srinivasa K. G, Venugopal K. R, L. M Patnaik, “An Evolutionary Approach for Association Rule Mining on Dynamic Databases,” Proc. Int. Conf. on PAKDD, LNCS, Springer Verlag, 2003. 4. Jiawei Han, Micheline Kamber,”Data Mining : Concepts and Techniques ,” Academic Press, 2001.