Density-based Clustering Algorithms Parameters ...

11 downloads 0 Views 988KB Size Report
Figure (6): Data set D2 [11]. Figure (5): data set (D1) clustering by DBSCAN with our estimated parameters, 595 points of 600 are clusters into 15 clusters.
Density-based Clustering Algorithms Parameters Estimation Ibrahim Z. Al-Sharif Faculty Of Engineering, Islamic University, Gaza. [email protected]

Abstract Density-based clustering algorithms is considered as one model of various clustering analysis models, it defines clusters as connected dense regions depending on two important parameters, first is (σ) which represents the radius around the point to search for the number of neighborhoods in DBSCAN or to calculate density in DENCLUE, second is (ξ) which represents the minimum number of neighborhood points required to consider this point as a part of a cluster, choosing these two parameters majorly affects the clustering quality.

Wesam M. Ashour Faculty Of Engineering, Islamic University,Gaza. [email protected]

 Categorization of points Points (Objects) are categorized into three exclusive groups: core points, border points and noise points. A point is a core point if it has more than or equal to (ξ) (MinPts) within radius (σ) (Eps), a border point has fewer than (ξ) (MinPts) within radius σ (Eps) but is in the neighborhood of a core point, and noise point is any point that is not a core point nor a border point.

This paper introduces a new approach for estimating the two key parameters (σ) and (ξ) instead of heuristic way which requires to run the algorithm many times.

1. Introduction Data clustering (or just clustering), also called cluster analysis, segmentation analysis, taxonomy analysis, or unsupervised classification, is a method of creating groups of objects, or clusters, in such a way that objects in one cluster are very similar and objects in different clusters are quite distinct. Data clustering is often confused with classification, in which objects are assigned to predefined classes [1]. Data clustering algorithms classified into four categories: partitioning, hierarchical, density-based and grid-based[2]

Figure (1): Three categorization of points in densitybased clustering algorithms with ξ=4 and σ= r

 Directly density-reachable A point (q) is directly density-reachable from point (p) if (p) is a core point and (q) is in p’s -neighborhood. density reachability is transitive but not symmetric relation, which means that nothing can reached from non-core point so we cannot extend our cluster.  Density-Connected

In density-based clustering the local point density around that point has to exceed some threshold to consider as a part of a cluster[3]. Nσ (p) := *q in D | dist(p, q)  σ } .... (1) where Nσ (p) is the set of neighborhood points to the point P in radius σ, D is the data set and ξ is the minimum number of the neighborhood points set that Nσ (p) required to consider the point P as a part of a cluster.

A pair of points (p) and (q) are density-connected if they are commonly density-reachable from a point (o), this relation is symmetric so it makes us able to extend the new cluster found.  Formal Description of Cluster A cluster C is a subset of points satisfying two criteria: 1. Connected: ∀ p,q ∈ C: p and q are densityconnected. 2. Maximal: ∀ p,q: if p ∈ C and q is densityreachable from p, then q ∈ C.

The rest of the paper is structured as follows: section 2 talks about related works, section 3 talks about the proposed method, section 4 is about experiments and results, and finally section 5 is the conclusion.

2. Related Works 2.1 DBSCAN Algorithm Density Based Spatial Clustering of Applications with Noise (DBSCAN) starts with an arbitrary point p and retrieves all points density-reachable from p in radius (σ) (Eps) and minimum number of points (ξ) (MinPts). If p is a core point, this procedure yields a cluster, but If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. Since we use global values for σ and ξ, DBSCAN may merge two clusters into one cluster, if two clusters of different density are "close" to each other.

A disadvantage of DENCLUE 1.0 is, that the used hill climbing may make unnecessary small steps in the beginning and never converges exactly to the maximum, it just comes close, DENCLUE 2.0 introduces a new hill climbing procedure for Gaussian kernels, which adjusts the step size automatically at no extra costs.  Influence & Density Function The influence function can be an arbitrary function. For the definition of specific influence functions, we need a distance function which determines the distance of two d-dimensional feature vectors. The distance function has to be reflexive and symmetric. For simplicity, in the following we assume a Euclidean distance function. Note, however, that the definitions are independent from the choice of the distance function. Examples of basic influence functions are:[4]

The following is DBSCAN algorithm: for each O ∈ D do if O is not yet classified then if O is a core-object then collect all objects density-reachable from O and assign them to a new cluster. else assign o to NOISE DBSCAN not required to determine the number of clusters as input like k-mean algorithm, it can find an arbitrarily shaped clusters, it is robust to noise objects, and required two key parameters: σ (Eps) and ξ (MinPts), but DBSCAN cannot cluster data sets well with large differences in densities, since the σ - ξ combination cannot then be chosen appropriately for all clusters.

 Density-Attractor A point x* is a density attractor if it has local maximum of the density function.  Density attracted points Points from which a path to x* exists for which the gradient is continuously positive (case of continuous and differentiable influence function).  Center-Defined Cluster Assign to each density attractor the points density attracted to it.

2.2 DENCLUE Algorithm  How to find the clusters DENCLUE algorithm employs a cluster model based on kernel density estimation. A cluster is defined by a local maximum of the estimated density function. Data points are assigned to clusters by hill climbing, i.e. points going to the same local maximum are put into the same cluster[4].

1. Divide the space into grids, with size 2. 2. Consider only grids that are highly populated. 3. determine density attractors for all points using hill-climbing.

DENCLUE can find an arbitrarily shaped clusters, it is robust to noise objects, and required three key parameters: σ (Eps) , ξ (MinPts) and  to calculate attractor.

2.3 OPTICS Algorithm OPTICS (Ordering points to identify the clustering structure) is an algorithm for finding densitybased clusters in spatial data, Its basic idea is similar to DBSCAN, but it addresses one of DBSCAN's major weaknesses: the problem of detecting meaningful clusters in data of varying density. In order to do so, the points of the database are (linearly) ordered such that points which are spatially closest become neighbors in the ordering. Additionally, a special distance is stored for each point that represents the density that needs to be accepted for a cluster in order to have both points belong to the same cluster[5]. OPTICS can be seen as a generalization of DBSCAN that replaces the σ (Eps) parameter with a maximum value that mostly affects performance. ξ MinPts then essentially becomes the minimum cluster size to find. While the algorithm is much easier to parameterize than DBSCAN, the results are a bit more difficult to use, as it will usually produce a hierarchical clustering instead of the simple data partitioning that DBSCAN produces. 2.4 BDE-DBSCAN In 2014 Amin Karami and Ronnie Johansson introduced an article[6] that presents an efficient and effective hybrid clustering method, named BDEDBSCAN, that combines Binary Differential Evolution and DBSCAN algorithm to simultaneously quickly and automatically specify appropriate parameter values for Eps and MinPts. Since the Eps parameter can largely degrades the efficiency of the DBSCAN algorithm, the combination of an analytical way for estimating (Eps) and Tournament Selection (TS) method is also employed. 2.5 Entropy Method In 2011 Niphaphorn Obthong and Wiwat Sriphum introduced an Optimal Choice of Parameters for DENCLUE-based and Ant Colony paper[7].

Entropy is a measure of uncertainty [8] for spreading density by defining the σ value. If density of the data point was comparatively equal to another data point, it would increase uncertainty of the density spread and entropy. On the other hand, if the data point offered different density spreads, the uncertainty of the spread and entropy would be much decreased. As a consequence, the optimal σ parameter value by decreasing the value of entropy measure would be as the below statement. n data points were defined in the data space D density entropy value could be found by using equation (2) ........... (2)

If the σ value was too low or high, the H value would get closer to the maximum entropy; some σ values would make H value the global minimum, which was in accordance with the demand of the optimal σ value. After receiving the proper σ parameter value, it would be possible to find out the minimum density level for a density attractor (ξ). Thus, the results gained from the data clustering would depend on the density threshold (ξ) which could be found by using the equation (3). ................... (3) where ||DN|| contains numbers of noise , c is a constant c ≤ 10 , and d is a dimensional of data [9].

3. The proposed method of Estimating σ and ξ The basic idea of the proposed estimation method is to map the points distribution into two dimensions array in case of two dimensions data - n-dimensions array in case of n-dimensions data- such that each data point belongs to one location in the array- but not vice versa according to the point dimensions values, then calculate the distance and density to those location to get the suitable parameters. To do this we divide the range of each data dimension into an equal steps, the number of steps in each dimension represents the length of this dimension in the array -we called it the map array.

max1= 17.124, min1= 3.402 , max2= 17.012, min2= 3.178 ,d1= 13.722 and d2= 13.834; then calculate the step value in each data points dimension as follow: 𝑑2

𝑠𝑡𝑒𝑝1 = 1, 𝑠𝑡𝑒𝑝2 = 𝑑1 ( 𝐼𝑓𝑑1 < 𝑑2) 𝑑1

𝑠𝑡𝑒𝑝1 = 𝑑2 , 𝑠𝑡𝑒𝑝2 = 1 ( 𝐼𝑓𝑑1 > 𝑑2) .......... (6) 𝑠𝑡𝑒𝑝1 = 𝑠𝑡𝑒𝑝2 = 1

(𝐼𝑓𝑑1 = 𝑑2 )

where (step1) is the step value in the first dimension of the points, and (step2) is the step value of second dimension of the points.

when d1 is less than d2, steps in the first dimension must weigh less than steps in the second dimension.

Figure (2): Data set (D1) [10]

For example, data set (D1) in Figure (2) which consists of two dimensions 600 points will result a map array of locations as in Figure (3).

for our example on (D1) step1= 1, step2= 1.008; The map array dimensions size depends on step1 and step2, and calculated as follow: length1 = (Integer) (d1 / step1) + 1

....................(7)

length2 = (Integer) (d2/ step2) + 1

....................(8)

We add 1 to equations (7) and (8) because of the result needed is an integer value,. for our example on (D1) length1=14, length2=14; Figure (3): Map array Example

Each cell in the array represents a location which consists of number of points, the value in each cell represents the number of points in this location. To determine the array dimensions lengths we first calculate the range of the data for each points dimension as follow: d1= max1 - min1

.....................(4)

d2= max2 - min2

.....................(5)

where (min1) and (max1) are the minimum and maximum values in first dimension, (min2) and (max2) are the minimum and maximum values in second dimension, d1 is the difference between maximum and minimum value of points first dimension, and d2 is the difference between maximum and minimum value of points second dimension.

For our example on (D1)

Note that each location in the array represents a distance of step1 in the first dimension and step2 in the second dimension. To identify the location of each point in the map array of locations, we generate the following equations: index1=(Integer) (first dimension value - min1) / step1) ..............(9) index2 =(Integer) (second-dimension value - min2) / step2) .............(10) where index1,index2 are the index of the location in the map array like this( map[index1][ index2])

for example the point (14.02,5.614) has index1= (14.023.402)/1=10 and index2=(5.614-3.178)/1.008=2; Each location has a number of points greater than t >2 (Noise Threshold) considered as dense location, we

choose t > 2 because we have enough area for every location so only two points in this space almost represents a noise, we do not choose t> 0 or t> 1 to eliminate as possible the effect of noise.

is the average distance between all points in the two locations L and C, then

for every dense location we calculate the closest location to it, the closest location must be non zero location, then we identify the minimum and maximum Euclidian distance between all locations and their closest location.

To get the estimated minimum number of points (ξ) we get the average number of points in each the two locations with maximum distance (lmax,cmax) and the other two locations with minimum distance(lmin,cmin).

d(l, c) =

l. x − c. x 2 − l. y − c. y

2

.................(11)

where (l), (c) are the dense location and its closest location, and (d) is the distance between the two locations.

for the previous example of (D1), a two sample of locations and their closest locations will be as follow:

σ=[Avdis(lmin,cmin) + Avdis(lmax,cmax)]/2 ...........(12)

let Count(L) is the number of points in location L, then: C1= Count(lmin) + Count(cmin)

...........(13)

C2= Count(lmax) + Count(cmax)

...........(14)

ξ= (C1+C2)/2

........... (15)

l= [0,3], c= [1,3]

4. Experiments and Results

l= [0,4], c=[1,4]

Our experiments are on different 2-dimensions data sets are simulated by java program on corei5 personal computer CPU and 8 GB RAM. Experiment I That data set (D1) shown in Figure(2) consists of 600 points into 15 clusters.

Figure (4): Closest location Example

To get a good clustering by DBSCASN or DENCLUE we have to try random parameters many times and try to adjust these parameters by running the algorithm many times. Table (1) below shows the results of running DBSCAN many times to adjust parameters. # points

in Figure (4) the location [0,3] which has 17 points has [1,3] location as the closest location, note that [0,4] can be also the closest location to [0,3], but because we concern here with the distance only and they have the same distance from our location [0,3] so no matter which location we consider. note also that the location [1,4] has more distance from [0,3] than [1,3] and [0,4].

σ

1.5

600

1

The minimum and maximum Euclidian distance between all locations and their closest location used for calculating σ as follows: let (lmin), (cmin) be the dense and its closest location with minimum distance, (lmax),(cmax) be the dense and its closest location with maximum distance, Avdis(L,C)

0.5

ξ

#clusters

30 20 10 5 1 30 20 10 5 1 30 20 10 5 1

8 8 8 8 8 8 8 8 8 8 12 15 14 11 13

# noise points 0 0 0 0 0 0 0 0 0 0 185 26 9 5 0

Table (1): Results of heuristic experiments on D1

After many experiments we note that the best results for σ is to be 0.5-1.5, as we can see, the results are bad and we do not know exactly when to increase or decrease these parameters. But when we apply the proposed algorithm we have the following results: # points

σ

ξ

#clusters

# noise points

600

0.589

20

15

5

Table (2): Results of the proposed method on (D1) Figure (6): Data set D2 [11]

Results show that the proposed method estimate the parameters directly and accurately.

Table (3) below shows the results of running DBSCAN many times to adjust parameters. # points

σ

2

1.5

. Figure (5): data set (D1) clustering by DBSCAN with our estimated parameters, 595 points of 600 are clusters into 15 clusters

788

1.2

Experiment II The data set (D2) shown in Figure(6) consists of 788 points into 7 clusters. To get a good clustering by DBSCASN or DENCLUE we have to try random parameters many times and try to adjust these parameters by running the algorithm many times. After many experiments we note that the best results for σ is to be 2-1.

1

ξ 10 8 6 5 4 3 2 10 8 6 5 4 3 2 10 8 6 5 4 3 2 10 8 6 5 4 3 2

#clusters 5 5 5 5 5 5 5 7 7 5 5 5 5 5 11 11 7 6 5 5 6 2 16 20 13 7 6 7

# noise points 0 0 0 0 0 0 0 16 3 1 1 1 1 0 371 97 9 3 2 2 0 777 506 164 61 18 8 6

Table (3): Results of heuristic experiments on (D2) When we apply the proposed algorithm we have the following results:

# points

σ

ξ

#clusters

# noise points

788

1.30

7

7

9

The estimated parameters (σ=1.30, ξ=7 ) generate a very good clustering result.

Table (4): Results of the proposed method on (D2) Figure (7) below shows a visualization of DBSCAN clustering on dataset (D2) with our estimated parameters.

Experiment III The data set (D3) shown in Figure (9) consists of 3100 points into 31 clusters.

Figure (7): data set (D2) clustering by DBSCAN with our estimated parameters, 779 points of 788 are clusters into 5 clusters. Note that the heuristic results in Table (3) which produces 7 clusters with 6 noise points (σ=1, ξ=2 ) is not the right parameters, because the clusters itself are not the ideal clusters also, but the choice of (σ=1.5, ξ=8) is very good because it produces 7 right clusters with 3 points as a noise.

Figure (9): Data set D3 [10]

To get a good clustering by DBSCASN or DENCLUE we have to try a random parameters many times and try to adjust these parameters by running the algorithm many times, Table(8) shows our results. After a lot of heuristic experiments we note that the best values for σ is to be 1.5-1, we fail to get any parameters that produce 31 clusters with small number of noise points, because of the nature of the dataset itself. When we apply the proposed algorithm we have the following results: # points

σ

ξ

#clusters

# noise points

3100

1.01

21

13

38

Table (5): Results of the proposed method on (D3)

Figure (8): data set (D2) clustering by DBSCAN with (σ=1, ξ=2 ), 788 points of 788 are clusters into 7 but wrong clusters.

Note that the dataset (D3) is classified to 31 clusters, but our estimated parameters make DBSCAN estimate classify it to 13 clusters, and the heuristic experiments shows that with σ=1, ξ=42 DBSCAN classify (D3) to 31 clusters with 230 point as a noise (~7.5%).

# points

σ

1.5

3100

1

ξ 50 48 46 44 42 40 30 25 20 18 15 50 48 46 44 42 40 30 25 20 18 15

#clusters 9 8 8 8 8 7 6 2 2 2 2 31 31 31 31 31 30 23 18 13 12 9

# noise points 3 3 3 3 3 3 2 2 2 2 2 447 381 323 268 230 193 87 56 37 27 15

# points

σ

ξ

#clusters

# noise points

3100

0.82

24

28

225

Table (9): Results of the proposed method on (D3), with t>3. Figure(11) is a visualization of DBSCAN clustering with the new estimated parameters.

Table (8): Results of heuristic experiments on (D3) Such type of datasets which we expect heavy noise in it, we have to increase the noise threshold to t > 3 for more elimination of the effect of noise.

Figure (11): data set (D3) clustering by DBSCAN with our estimated parameters, 2875points of 3100 are clusters into 28 clusters. Overall , when we used the proposed algorithm on dataset (D3) we have come to a result very close to the best result we have obtained after tens of experiments, so we have got good results with less time.

5. Conclusion Experiments showed that the proposed method of estimating density-based algorithms parameters is very useful, It makes us able to estimate directly a very good σ (Eps) and ξ (MinPts) parameters without needing to run clustering tens of times.

Figure (10): data set (D3) clustering by DBSCAN with our estimated parameters, 3062 points of 3100 are clusters into 13 clusters.

When we apply the proposed algorithm with t>3, we have more better results when we use the default value t>2, Table (9) shows the new estimated parameters.

For future work we need to work on reducing noise effect, performance and multi-dimension data. References [1] Gan, Guojun, Chaoqun Ma, and Jianhong Wu, Data Clustering: Theory, Algorithms, and Applications, ASASIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria, VA, 2007 [2] Tran Manh Thang and Juntae Kim. The anomaly detection by using dbscan clustering with multiple parameters. In

International Conference on Information Science and Applications (ICISA), pages 1–5, 2011. [3] M.Ester,H.P.Kriegel, J.Sander, andX.Xu. Adensity-based algorithm for discovering clusters in large spatial databases with noise. In Proceeding of 2nd International Conference on Knowledge Discovery and Data Mining, pages 226–231, 1996. [4] Alexander Hinneburg and Daniel A. Keim. An Efficient Approach to Clustering in Large Multimedia Databases with Noise (1998). [5] Mihael Ankerst, Markus M. Breunig and Hans-Peter Kriegel, Jörg Sander (1999). OPTICS: Ordering Points To Identify the Clustering Structure. ACM SIGMOD international conference on Management of data. ACM Press. pp. 49–60. [6] Amin Karami and Ronnie Johansson. Choosing DBSCAN Parameters Automatically using Differential Evolution. International Journal of Computer Applications 91(7):1-11, April 2014. [7] Niphaphorn Obthong and Wiwat Sriphum. Optimal Choice of Parameters for DENCLUE-based and Ant Colony Clustering, 2011 International Conference on Modeling, Simulation and Control IPCSIT vol.10 (2011) © (2011) IACSIT Press, Singapore. [8]

Xiao-Gao Yu, and Yin Jian. A New Clustering Algorithm based on KNN and DENCLUE. Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 18-21 Agust 2005.

[9] Niphaphorn Obthong and Wiwat Sriphum. Optimal Choice of Parameters for DENCLUE-based and Ant Colony Clustering, 2011 International Conference on Modeling, Simulation and Control IPCSIT vol.10 (2011) © (2011) IACSIT Press, Singapore. [10] Veenman, C.J., M.J.T. Reinders, and E. Backer, A maximum variance cluster algorithm. IEEE Trans. Pattern Analysis and Machine Intelligence, 2002. 24(9): p. 12731280. [11] Gionis, A., H. Mannila, and P. Tsaparas, Clustering aggregation. ACM Transactions on Knowledge Discovery from Data (TKDD), 2007. 1(1): p. 1-30.

Suggest Documents