Preference weighted relative core object, Density differ. 1 Introduction. Clustering is a common data analysis task that tries to partition a collection of objects into ...
An Enhanced Density Connected Clustering With Local Subspace Preferences Anant Raml, Anand S. Jalall, Narendra Kohli2, Ashish Sharma1 1 Department of Computer Science, G.L.A Institute of Technology and Management, Mathura, India 2 Department of Computer Science ,H.B.T.I Kanpur, India
Abstract - In presence of noise and outlier in high dimensional database it is difficult task to find out the clusters of different shapes, sizes and differ in local density in the subspace. The density based clustering algorithm DBSCAN finds out the clusters of different shapes and sizes by giving the equal weightage to complete feature space. While PreDeCon (Subspace preference weighted Density Connected clustering) algorithm finds out the clusters of different shapes and sizes in subspace. In this paper we proposed an enhanced version of PreDeCon algorithm that can find out the clusters of different shapes and sizes with differ density in subspace of complete feature space. The experimental results illustrate that the proposed clustering algorithm gives optimized results. Keywords: Core object, Preference weighted core object, Preference weighted relative core object, Density differ.
1
Introduction
Clustering is a common data analysis task that tries to partition a collection of objects into homogeneous groups called cluster [1]. Determining clusters in data containing noise and outlier is very difficult task when clusters are of different shapes, sizes and differ in local density in subspace of complete feature space. The performance of distance based algorithm degenerates rapidly with increasing dimensionality because in a high dimensional space the distance of an object to the nearest and farthest object is nearly same [2]. So the better clustering can be achieved by taking a subspace from a complete feature space. This was given by the author of CLIQUE [3] clustering algorithm. Although many algorithms exist for finding clusters with different sizes, shapes and different density in complete feature space but enhanced PreDeCon algorithm which can determine the cluster of different sizes, shapes and different density in a subspace of complete feature space. Rest of the paper is organized as follows. Section 2 provides related work on density based clustering technique for both
complete feature space and subspace. Section 3 presents existing Subspace Clustering algorithms PreDeCon and required modification to get better clustering results than PreDeCon. The proposed algorithm is discussed in section 4. Experimental result is illustrated in section 5. Finally Section 6 presents conclusion and future work.
2
Related work
The DBSCAN (Density Based Spatial Clustering of Applications with Noise) [8] is a basic density based clustering algorithm. The density around an object is achieved by counting the number of objects in a region of specified parameter radius, say , around the object. An object is treated as dense (core) if it is having neighborhood of that object greater than or equal to a specified threshold, MinPts, otherwise sparse (non-core). Non-core objects that do not have a core object within the specified radius are known as noise. The cluster formed by DBSCAN may have wide density variation [8]. Such cluster may be represented by several smaller clusters so that each cluster may have reasonably uniform density. DBSCAN does not define upper limit of a core object i.e. how much objects may present in -neighborhood. So due to this if there is wide variation in local density, it will merge into the same cluster. OPTICS [10] algorithm adopts DBSCAN to achieve this goal. OPT ICS computes an ordering of the objects augmented by reachability distance representing the intrinsic hierarchical clustering structure. The Valleys in reachability plot indicates clusters. For identifying the valleys say ξ-clusters, the parameter ξ is vital. DD_DBSCAN finds the clusters of different shapes, sizes and differ in local density [5]. It apply upper limit during expansion of a cluster. But this algorithm is giving the equal weightage to all the dimensions of the feature space. CHAMELEON [11] finds the clusters in a data set by using two phase algorithm. In first phase it generates a k-nearest neighbor graph. This mechanism reduces effect of noise and outlier. In the second phase, it uses an agglomerative hierarchical clustering algorithm to find the cluster by
iteratively combining these sub clusters. Some of the above discussed density based clustering algorithm can determine the clusters of different shapes, sizes in presence of noise and outliers and also some of them can determine the clusters of different shapes, sizes and differ in local density in complete feature space. But complete feature space cannot contribute to form a cluster. So to find out the clusters in subspace CLIQUE [3], the base approach to subspace clustering uses a grid-based clustering notion. The data space is partitioned by an axis–parallel grid into equi-sized units of width ξ. Only the units which contain at least minimum points are considered as dense. Modification of CLIQUE includes ENCLUS [6]. SUBCLU [7] uses the DBSCAN cluster model of density connected sets [8]. As we know that d-dimensional data set contains 2d subspace which may have cluster. Normally the output of subspace cluster is very large while a lot of application requires that the data set is divided into partition where each object belongs to exclusively into one cluster. So to achieve this goal projection clustering algorithm has been introduced. But some of the problem like overlapping clustering is also available with projected clustering algorithms. PROCLUS [9] a subspace clustering algorithm that do not allow overlapping clustering and finds representative clusters in an appropriate set of cluster dimensions. It needs number of cluster k and the average cluster dimensionality l as an input parameter. The clusters detected by PreDeCon [4] are dense regions separated by sparse regions in the subspace. The sparse regions (noise) have less than MinPts number of objects within the preference weighted -neighborhood of each object. On the other hand clusters i.e. dense regions have at least MinPts number of objects within preference weighted -neighborhood of each core object. There is no upper bound regarding the number of objects that a core object may have in its preference weighted -neighborhood. The proposed Enhanced PreDeCon algorithm apply the upper and lower limit, i.e. total number of objects a preference weighted core object can have in its preference weighted -neighborhood to maintain the reasonable density variation in the same cluster.
3
PreDeCon Algorithm
PreDeCon(Subspace preference weighted Density Connected clustering)[4] is a density based subspace clustering algorithms. It detects the clusters in subspace of specified dimensionality threshold parameter . This algorithm is based on definitions summarized as follows. For detailed description refer the PreDeCon[4].
Let D be a data set with d dimensions, i.e. d= {d1,d2….., dd}. Any subset S d is called a subspace. The projection of an object p D along a dimension d i d is denoted by ( ) . Let N (p) denotes the -neighborhood of p D, i.e. N (p) contains all objects q where dist (p, q) ≤. A cluster in subspace is a collection of density connected objects associated with a certain subspace preference vector. The selection of dimensionality subspace preference is a group of objects having a small variance than parameter along at most dimensions. The variance is a measure of how far each value in the data set is from the mean. For every object p D and for every dimension di d of the object p it calculates the variance along a dimension di d in connection with neighborhood (p) is denoted by variancedi (N (p)). The subspace preference dimensionality (), i.e. the number of dimensions did with variancedi (N (p)) . It is denoted by Pdim(N(p)).So the objective of the above is to consider those objects as core objects of a cluster which have sufficient dimensions with a low variance in their neighborhood. Therefore, each object p D is associated with a subspace preference vector wp , which reflects the variance of the objects in the neighborhood (p) along each dimension in d. The preference weighted similarity measure ( , ) is a weighted Euclidean distance. The parameter specifies the threshold for a low variance. The objective is to distinguishing between dimensions with low variance and all other dimensions, it’s weight vector has only two possible values. If any dimension of the preference vector of an object p have greater value to then that dimension of preference weight vector will be assigned 1value otherwise greater than 1. Since similarity measures of two objects p & q is not symmetric i.e. ( , )≠ ( , ). To overcome with this problem The preference weighted similarity of two arbitrary objects p, q D, denoted by ( , ), is defined as the maximum of the corresponding preference weighted similarity measures of p ( ) and q ( ), then ( , ) = max { ( , ), ( , )}. Based on the above concepts it is defining preference weighted -neighborhood as a symmetric concepts.
The preference weighted -neighborhood of an object p D is denoted by ( ) = { D |distpref (p, q) }.
An object p D is said to be preference weighted core object w.r.t. parameters , R and MinPts, N if the preference dimensionality of its -neighborhood is at most λ and its preference weighted -neighborhood contains at least MinPts objects i.e. COREden ( ) Pdim N (p) λ {| (p)| ≥ MinPts}.
An object p D is said to be direct preference weighted reachable from an object qD (denoted by ( , )) w.r.t. parameters , R and MinPts, N if q is a preference weighted core object, the subspace preference dimensionality of N (p ) is at most λ and p (q) . Direct preference weighted reachability is symmetric for preference weighted core objects. Both ( , ) and ( , ) must hold. An object p D is said to be preference weighted reachable from an object q D (denoted by ( , ) ) w.r.t. parameters , R and MinPts, N if there is a chain of objects p1………….pn such that p1 = q, pn = p and pi+1 is directly preference weighted reachable from pi. An object p D is said to be preference weighted connected to an object q D, if there is an object o D such that both p and q are preference weighted reachable from o. Initially each object is marked as unclassified. To find a new cluster, PreDeCon starts with an arbitrary preference weighted core object p which is unclassified and it create the new clusterID. It start to collect all objects in the preference weighted -neighborhood of object p, insert them into a queue and assign the same clusterID to them. For each object in the queue, it computes all direct preference weighted reachable objects and inserts those objects into the queue which are still unclassified. This is repeated until the queue is empty and the entire cluster is computed, same clusterID will be assigned to all objects found in the generation of the subspace preference cluster. When it completes its execution all objects are either assigned a certain cluster or marked as noise.
Since PreDeCon uses global values for and MinPts, it may merge two clusters into one cluster, if two clusters of different density are close to each other. Let the distance between two sets of objects Q1 and Q2 be defined as distpref (Q1, Q2) = min { distpref (p,q) | p Q1, q Q2} and p(p Q1) and q(q Q2) are the preference weighted core objects Then, two sets of objects having at least the density of the thinnest cluster will be separated from each other only if the
distance between the two sets is larger than . Thus wide variation of density exists within a cluster detected by PreDeCon. Such a cluster can be represented by several smaller clusters so that each cluster has reasonably uniform density. If we try to find out uniform density cluster then a large number of smaller unimportant clusters may be generated, because in real life datasets local densities may vary significantly. Therefore some amount of local density variations should be allowed within the same cluster. In a cluster density may slowly rise or fall at some allowed rate but greater change in density should indicate a separate cluster.
4
The proposed Algorithm (Enhanced PreDeCon)
Enhanced PreDeCon allows the considerable density variation within the same cluster and large amount of density variation with other clusters. It starts with a preference weighted core object and collect other direct preference weighted reachable objects. Consider a preference weighted core object q having mo objects in its preference weighted neighborhood including itself. Let mi, i [0,mo-1] represent preference weighted -neighborhood sizes of these mo objects. It applies restriction on the values of mi as compared to the value mo, for detecting variable density clusters starting from the preference weighted core object q then if all mi are preference weighted core objects and are satisfying the following condition: mi ≤(1+ Δ) mo if mi > mo mo ≤ (l + Δ) mi if mi < mo Than all the mi are included in the cluster for expansion. Where Δ is the density variation factor which is a fraction within the interval [0, 1) and denotes the variation in density that is allowed to exist within a cluster. Any preference weighted core object that does not satisfy this condition is not expanded. In addition to PreDeCon[4] the following definitions are required in Enhanced PreDeCon to allow the considerable density variation within the same cluster and wide variation with other clusters. Definition 1 (Preference weighted relative core object): An object p is said to be preference weighted relative core object ( , ) )with respect to object q (denoted by to , , Δ R & Minpts, N if it is satisfying the following. (i) Object p & q must be preference weighted core object. (ii) Object p (q).
(iii) If | (p)| ≥ | (q)| then it must satisfy the following condition i.e. | (p)| (1+ Δ) | (q)| (iv) If | (q)| > | (p)| then it must satisfy the following condition | (q)| (1+ Δ) | (p)| Where Δ is a fraction within the interval [0,1].
In Figure. 1(b) (p) = 11 < (q) and (q) > (1+ Δ) (p). i.e. 19 is greater than 1.2*11=14 Therefore p can not be expanded. Because in preference weighted neighborhood of p there is wide density variation with respect to in preference weighted -neighborhood of q.
Definition 2 (Direct preference weighted relative density reachable): An object p is said to be direct preference weighted relative density reachable to the object q with respect to , , Δ R & Minpts, N if p ( , ) , q is a preference weighted core object and Pdim[N(p)] is at most . Direct preference weighted relative density reachable is symmetric for preference weighted relative core objects.
Case 3: In Figure. 1(c) (p) = 18 < (q) and (q) (1+ Δ) (p). i.e. 19 is less than 1.2*18=22. Therefore p will be expanded. Because in preference weighted neighborhood of p there is significant density variation with respect to in preference weighted -neighborhood of q.
Definition 3 (Preference weighted relative density reachability): An object p D is said to be preference weighted relative density reachable from an object q D with respect to , , Δ R & Minpts, N If there is a chain of objects p1,………… pn such that p1 = q, pn = p and pi+1 is direct preference weighted relative density reachable from pi. In Following Figure 1(a)-1(d) for example if we consider following scattered points in a subspace created by the parameter then how the preference weighted neighborhood of a preference weighted core object are introduced for expansion or only introduced but not expanded in that cluster to maintain the significant density variation by the factor Δ within that cluster.
Figure 1(a)
Figure 1(b)
Figure1(c)
In Figure. 1(a) – 1(d) Let Minpts = 10, = 0.2.
Figure 1(d)
(q) = 19 and Δ
Case 1: In Figure. 1(a) (p) = 32 > (q) and (p) (1+ Δ) (q). i.e. 32 is not less than or equal to 1.2*19=23Therefore p can not be expanded. Because in preference weighted -neighborhood of p there is wide density variation with respect to in preference weighted neighborhood of q. Case 2:
Case 4: In Figure. 1(d) (p) = 22 > (q) and (p) (1+ Δ) (q). i.e. 22 is less than 1.2*19=23. Therefore p will be expanded. Because in preference weighted -neighborhood of p there is significant density variation with respect to in preference weighted -neighborhood of q. Cluster formation Mechanism by Enhanced PreDeCon Algorithm Initially each object is marked as unclassified. To find a new cluster, Enhanced PreDeCon checks whether this object is a preference weighted core object. If this is the case, the algorithm includes the object to this new created cluster. Otherwise the object is marked as noise. Enhanced PreDeCon starts with an arbitrary preference weighted core object q and it create the new clusterID. It assigns the clusterID to q and inserts it into the Queue. Then it collects all direct preference weighted reachable objects to q. It adds all those direct preference weighted reachable objects to q in the Queue for further expansion which are preference weighted relative core object to object q (i.e. to include all their direct preference weighted reachable objects in the cluster which are satisfying Case 3 and Case 4) still unclassified and also assign the clusterID to these objects. Otherwise which are satisfying the Case 1 and Case 2 are simply added in the cluster but cannot be expanded. This limitation on expansion of objects in a cluster maintains the significant density variation within the cluster by the factor Δ. For each object in the queue, it computes all direct preference weighted reachable objects and insert those objects which are direct preference weighted relative density reachable (i.e. satisfying Case 3 and Case 4 ), still unclassified into the Queue. Remaining direct preference weighted reachable objects which are not direct preference weighted relative density reachable(i.e. satisfying Case 1 and Case 2 ) ,still unclassified are simply assigned the cluster ID. This is repeated until the queue is empty and the entire cluster is computed. All objects are either assigned a certain clusterID or marked as noise.
Algorithm Enhanced PreDeCon (D, , ,, Δ MinPts) 1. Initially all objects are unclassified 2. For each unclassified object o D 3. If ( ) then 4. Insert o into the Queue 5. Generate new Cluster ID & Assign o in the cluster. 6. While Queue Empty 7. Extract front object y from the Queue 8. Calculate R={x D | ( , )} 9. For each object x R 10. If x is unclassified and x ( , ) 11. Then insert x into Queue. 12. If x is unclassified or noise 13. Then assign the clusterID to x 14. End for 15. End while 16. Else o is noise 17. End for
Since all the attributes are having numerical values. Enhanced PreDeCon & PreDeCon are distance based density oriented clustering algorithms. So each pixel with 19 attribute values can be represented by a point in 19Dimensional space. But both are subspace clustering algorithms. Then both mathematically create the subspace by using the parameter value . Clustering of pixels will take place due to some of the important features like intensity-mean, rawred-mean, rawblue-mean, rawgreen-mean, Hedge-mean etc. But in subspace each pixels is also represented by a point. The density variation in the subspace occurs due to the attributes values of pixels like RGB values measured by different ways. These measured values will be different for the pixels belonging to the different classes. PreDeCon cannot detect the density variation in the subspace and it merge the pixels of different class into a cluster and Enhanced PreDeCon algorithm generates the number of clusters based on density variation parameter Δ and all others parameters.
400 350 300
So Enhanced PreDeCon clustering algorithm is detecting the clusters of different shapes, sizes and differs in local density in the subspace. Clusters formed by the Enhanced PreDeCon are having significant density variation within the cluster and wide density variation with the other clusters. While clusters Detected by the PreDeCon are having the wide density variation within the cluster.
250 200 150 100 50 0 Size of Cluster
Time Complexity of Both PreDeCon and Enhanced PreDeCon are same. Let n be the number of data objects in the data set and d be the number of dimensions of the data space. The time complexity of both algorithm is O(n2 ·d).
5
Experimental Evaluation
Total Dimensions
Total Dimensionalities
Figure 2: clusters generated by PreDeCon algorithm for the values, =15, =0.001, =10, =0.5,
250
The proposed algorithm Enhanced PreDeCon & PreDeCon have implemented in JAVA. The performance of above two algorithms is evaluated by using the real world data set for color image segmentation [12] from UCI repository of machine learning database. The classification dataset of color image segmentation have five class levels i.e. grass, sky, window, concrete and dirt.
200
The Enhanced PreDeCon & PreDeCon clustering algorithms are used to obtain segment labels for each pixel of an image, in order to identify potential Clustering image pixels. The extracted features are 19 attributes that describe the position of extracted image, line densities, edges, and color values.
0
150
100
50
Size Of Clusters
Total Dimensions
Total Dimensionalities
Figure 3: clusters generated by Enhanced PreDeCon Algorithm for the values, =15, =0.001, =10, =0.5, Δ = 0.2
Figure 2-3 provides the comparative studies between the PreDeCon and Enhanced PreDeCon Clustering algorithms. PreDeCon clustering algorithms generates a healthy cluster as shown in Figure 2 for the specified parameters values and it cannot differentiate in the local density variation within the same cluster in the subspace .While the Enhanced PreDeCon clustering algorithm gives three healthy clusters and two more clusters by considering it that there is variation in the local density in the subspace as shown in the figure 3 for the same parameters values as in case of PreDeCon including one more parameter Δ for the local density variation in the sub space.
6
Conclusion
In this paper we proposed an enhancement of PreDeCon subspace clustering algorithm for high dimensional database so that density variations are detected in the subspace. The proposed clustering algorithm can find clusters that represent relatively uniform regions without being separated by sparse regions in subspace. A parameter Δ is used to limit the amount of allowed local density variations in the subspace. The future work can be to determine the Δ value automatically for better clustering for any given data set.
7
References
[1] A.K. Jain, R.C. Dubes,: Algorithm for clustering data. Printice Hall Englewood cliffs NJ.(1998). [2] J. Han and M. Kamber.: Data Mining: Concepts and Techniques. Morgan Kaufman. (2001). [3] R. Agrawal, J. Gehrke, D. Gunopoulos, P. Raghavan.: Automatic Subspace Clustering of High Dimensional Data for Data Mining Application. ACM International Conference on Management of Data, pp.94105. (1998). [4] C. Bohm, K. Kailing, H.P. Kriegel, P. Kr¨oger.: Density Connected Clustering with Local Subspace Preferences. In Proc. ICDM, (2004). [5] B.Borach, D.K. Bhattacharya.: A Clustering Technique using Density difference. ICSCN. India Feb(22-24) pp-585588 (2007). [6] C.H. Cheng, A.C. Fu, Y. Zhang.: Entropy- Based Subspace Clustering for Mining Numerical Data. In Proc. ACM SIGKDD. (1999). [7] K. Kailing, H.P. Kriegel, P. Kr¨oger.: Density Connected Subspace Clustering for High-Dimensional Data. In Proc. SIAM Data Mining. (2004). [8] M. Ester, H.P. Kriegel, J. Sander, X. Xu.: A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proc. KDD (1996). [9] C Aggarwal, C Procopiue, J.wolf,P.Yu, J.Park.: A frame work for finding Projected Cluster in High Dimensional
Space. In Proc. ACM SIGMOD international Conference an management of data (1999). [10] M. Ankerst, M. Breunig, H. P. Kriegel, J. Sander.: OPTICS: Ordering Objects to Identify the Clustering Structure, In Proc. ACM SIGMOD. In International Conference on Management of Data, pp. 49–60. (1999). [11] G. Karypis, E.H. Han, V. Kumar: CHAMELEON: A hierarchical clustering algorithm using dynamic modeling, Computer 32(8): 68-75 (1999). [12] archive.ics.uci.edu/ml/datset.html.