Fuzzy joint points based clustering algorithms for large data sets

5 downloads 1480 Views 1MB Size Report
Aug 27, 2014 - the FJP-based methods to be used on large data sets. ... The goal of clustering algorithms is segmenting the entire data set into relatively ...
Available online at www.sciencedirect.com

ScienceDirect Fuzzy Sets and Systems 270 (2015) 111–126 www.elsevier.com/locate/fss

Fuzzy joint points based clustering algorithms for large data sets Efendi Nasibov a,b,∗ , Can Atilgan a , Murat Ersen Berberler a , Resmiye Nasiboglu a a Department of Computer Science, Dokuz Eylul University, Buca, 35160 Izmir, Turkey b Institute of Control Systems, Azerbaijan National Academy of Sciences, AZ-1141 Baku, Azerbaijan

Received 24 September 2013; received in revised form 16 June 2014; accepted 15 August 2014 Available online 27 August 2014

Abstract The fuzzy joint points (FJP) method is one of the successful fuzzy approaches to density-based clustering. Besides the basic FJP method, there are other methods based on the FJP approach such as, Noise-Robust FJP (NRFJP), and Fuzzy Neighborhood DBSCAN (FN-DBSCAN). These FJP-based methods suffer from the low speed of the FJP algorithm, thus applications that deal with large databases cannot benefit from them. The Modified FJP (MFJP) method addresses this issue and achieves an improvement in speed, but it is not satisfactory from the point of applicability. In this work, we integrate various methods with FJP to establish an optimal-time algorithm. An even faster algorithm which uses the FJP approach in a somewhat supervised fashion is also proposed. Along with theoretic comparison, experimental results are presented to show the significant speed improvement, which will allow the FJP-based methods to be used on large data sets. © 2014 Elsevier B.V. All rights reserved. Keywords: Fuzzy neighborhood relation; Fuzzy Joint Points (FJP); Clustering; Optimal algorithm

1. Introduction Clustering is one of the main tasks of data mining, which is a well-studied field of extracting information from large data sets. The goal of clustering algorithms is segmenting the entire data set into relatively homogeneous clusters, where the similarity of the records within the cluster is maximized, and the similarity to records outside this cluster is minimized. Clustering techniques are commonly categorized into hierarchical, model-based, grid-based, partitioning-based and density-based clustering [1,2]. Implementing fuzzy approaches for clustering usually offers more robust methods. Fuzzy C-Means (FCM) is the most popular fuzzy clustering algorithm, which falls into the partitioning-based category. Various algorithms have been developed by integrating FCM with other methods [3–7]. Despite their speed advantage, partitioning-based algorithms have some major disadvantages. One of these disadvantages is the need of number of clusters to be known beforehand. These algorithms are usually run multiple times with different cluster numbers and the best of the obtained results is then singled out. Another disadvantage is the dependency to the chosen metric. Shapes of the resulting * Corresponding author at: Department of Computer Science, Dokuz Eylul University, Buca, 35160 Izmir, Turkey.

E-mail address: [email protected] (E. Nasibov). http://dx.doi.org/10.1016/j.fss.2014.08.004 0165-0114/© 2014 Elsevier B.V. All rights reserved.

112

E. Nasibov et al. / Fuzzy Sets and Systems 270 (2015) 111–126

clusters are actually determined by the distance function. This implies that clusters with different and irregular shapes cannot be discovered using partitioning-based algorithms. Specifying initial cluster centers and handling noise are the other problems with these algorithms. Density-based clustering methods discover spatially connected components of data by investigating neighborhood relations. These methods typically deal with noise well and are designed to discover clusters with irregular shapes. Some well-known algorithms such as DBSCAN, OPTICS, DBCLASD and DENCLUE [8–11] embody the density-based approach. Algorithms like GDBSCAN can also work on geometric objects rather than merely points [12]. For clustering web-pages, an algorithm with relatively low time complexity was given in [13]. DBSCAN is sensitive to two parameter inputs, which are used to determine neighborhood relations. OPTICS addresses the parameter selection problem of DBSCAN by intelligently ordering the points with respect to a reachability property described in the work [9]. It does not explicitly extract clustering partitions, but maintains a cluster ordering that contains information equivalent to a wide range of parameter settings of DBSCAN. A robust density-based method that builds a hierarchy of partitions with different densities called HDS (or Auto-HDS) is given in [14]. HDS starts with clustering the entire data set (i.e. no point is considered as noise) and gradually reduces the number of data points it handles by a predefined fraction, resulting in a number of partitions with different densities. Each partition is then evaluated to construct a compact hierarchy. It also provides ranking criteria to choose the best clustering. A density-based algorithm which uses fuzzy neighborhood relation is the Fuzzy Joint Points (FJP) algorithm [15]. It is a fuzzy relative of DBSCAN. FN-DBSCAN, Scalable FN-DBSCAN, NRFJP and MFJP are other algorithms based on FJP approach [16–20]. They were developed to improve the FJP algorithm in terms of noise handling, speed and overall clustering performance. Although FJP-like algorithms yield some notable advantages over DBSCAN, they are slower. The modified version of FJP, i.e. MFJP, provides speed improvement to some extent. However, the implementation of the MFJP algorithm was still slow, such that the computational tests could be conducted up to only 1336 data points with 2 dimensions in an acceptable amount of time. In this study, we discuss how FJP-based methods can be further improved in terms of worst-case time complexity and running time performance, so that they can be used in practice on larger data sets. In Section 2, we look over the FJP and MFJP methods. Their time complexities are analyzed to reveal the bottlenecks, and improvements that lead to an optimal-time algorithm are introduced in Section 3. A novel algorithm based on the FJP approach, which has a considerable speed advantage is then presented in Section 4. We give theoretic and experimental comparisons of the discussed methods in Section 5. Finally, Section 6 concludes the paper. 2. The FJP and MFJP methods The FJP and MFJP methods use Euclidean distance as their fundamental notion of distance between data points. There are clustering methods that use different distance functions, whereas the majority of the distance-based clustering methods use Euclidean distance. Euclidean distance between any points a and b of m-dimensional space E m is defined then   m  d(a, b) =  (ai − bi )2 . i=1

Any distance function mentioned in this paper refers to Euclidean distance. Data points are handled as conical fuzzy points and fuzzy neighborhood relations are used to determine clusters. The definitions of the mathematical notions that are essential for the FJP and MFJP methods are introduced below. Conical fuzzy point: A conical fuzzy point P = (p, R) ∈ F (E m ) is a fuzzy set whose membership function is given by  1 − d(x,p) R , d(x, p) ≤ R μ(x) = 0, otherwise, where p ∈ E m is the center of the conical fuzzy point P , and R ∈ E 1 is the radius of the point’s support set supp P , where

E. Nasibov et al. / Fuzzy Sets and Systems 270 (2015) 111–126

113

  supp P = x ∈ E m  μP (x) > 0 . α-Level set: The α-level set of a conical fuzzy points P is expressed as follows:   Pα = x ∈ E m  μP (x) ≥ α . Fuzzy neighborhood relation: A fuzzy neighborhood relation T : X × X → [0, 1] on a set X is defined by T (X1 , X2 ) = 1 −

d(x1 , x2 ) , 2R

where x1 , x2 ∈ E m are the centers of the conical fuzzy points X1 , X2 ∈ X ⊂ F (E m ), respectively. For |X| = n, the radius is chosen as R=

max d(xi , xj ) dmax = , 2 2

i, j = 1, n,

in the methods. This implies T (Xi , Xj ) = 1 −

d(xi , xj ) . dmax

α-Neighborhood: Similar to the α-level set notion, the α-neighborhood of a conical fuzzy point P is the set of conical fuzzy points, whose values of fuzzy neighborhood relations to P are larger than or equal to a particular α value. In other words, conical fuzzy points X1 = (x1 , R) and X2 = (x2 , R) are α-neighbors, if T (X1 , X2 ) ≥ α,

α ∈ (0, 1],

and it is denoted by X1 ∼α X2 . Max–min composition: Let T : X × X → [0, 1] be a fuzzy neighborhood relation. A relation T = (tij )n×n that is calculated as tij = max min{tik , tkj }, k

i, j = 1, . . . , n,

is the max–min composition of the relation matrix T = (tij )n×n , and is denoted by T ◦ T . Max–min transitive closure: Let T : X × X → [0, 1] be a fuzzy neighborhood relation. A relation Tˆ that is defined by Tˆ = T ∪ T 2 ∪ . . . ∪ T n ∪ T n+1 ∪ . . . , where T k = T ◦ T k−1 , k∈ Z,k ≥ 2, is the max–min transitive closure of the relation T . Fuzzy α-joint points: Conical fuzzy points X1 and X2 are fuzzy α-joint points, if there is a sequence of α-neighbors between them, such that X1 ∼α Y1 , Y1 ∼α Y2 , . . . , Yk−1 ∼α Yk , Yk ∼α X2 ,

k ≥ 0.

It is shown in [15] that any points X1 , X2 are fuzzy α-joint points, if and only if Tˆ (X1 , X2 ) ≥ α, where Tˆ : X × X → [0, 1] is the max–min transitive closure of the relation T : X × X → [0, 1].

114

E. Nasibov et al. / Fuzzy Sets and Systems 270 (2015) 111–126

Pseudo-code of the FJP algorithm is as follows: FJP(X): Input: Data set X = {x1 , x2 , . . . , xn }. Output: Clustering partition X 1 , X 2 , . . . , X k . Step1. Calculate dij = d(xi , xj ), dmax = max dij , i, j = 1, n; α0 = 1; dij Step2. Calculate the fuzzy neighborhood relation T : tij = 1 − dmax , i, j = 1, . . . , n; ˆ Step3. Calculate the transitive closure T of the relation T ; Step4. yi = {xi }, i = 1, . . . , n; h = 1; k = n; Step5. Calculate 

d(yi , yj ) = min d x , x |x ∈ yi , x ∈ yj ; dh = min d yi , yj ), i, j = 1, . . . , k; i =j

dh ; αh = 1 − dmax Step6. Call the procedure Clusters(X, α h ) to obtain a clustering partition X 1 , X 2 , . . . , X k of the global data set X; Step7. If k > 1, then yi = X i , i = 1, . . . , k; h = h + 1; go to Step5; End if; Step8. Calculate: Δαi = αi − αi+1 , i = 0, . . . , h − 1; z = arg maxi=0,...,h−1 Δαi ; Step9. Call the procedure Clusters(X, α z ) to obtain the resulting clustering partition X 1 , X 2 , . . . , X k ; End.

Pseudo-code of the procedure Cluster is as follows: Procedure Clusters(X, α): Input: Data set X and parameter α. Output: Clustering partition X 1 , X 2 , . . . , X k Step1. S = F (X), where X is the global data set; k = 1; Step2. Pick an element A ∈ S to form X k = {B ∈ S|Tˆ (A, B) ≥ α}; S = S\X k ; Step3. If S = ∅, then k = k + 1; go to Step2; Else, return the partition X 1 , X 2 , . . . , X k and k; End if; End.

The MFJP method is a modification of the FJP method, which offers a faster algorithm with the same clustering efficiency. Two strategies based on mathematical lemmas are applied to decrease the running time and the time complexity, while maintaining the structure of the original algorithm. The first strategy is exploiting the reflexivity property of the fuzzy neighborhood relation to compute the transitive closure as explained below. Proposition 1. Let T : X × X → [0, 1] be a fuzzy neighborhood relation and |X| = n. If T is a reflexive relation, then I ⊂ T ⊂ T 2 ⊂ · · · ⊂ T n−1 = T n = T n+1 = · · · ,

E. Nasibov et al. / Fuzzy Sets and Systems 270 (2015) 111–126

115

therefore, Tˆ = T n−1 holds. Also, there exists a positive integer k such that T 2(k−1) ⊂ T 2k = T 2(k+1) = · · · , where T 2k = T 2(k−1) ◦ T 2(k−1) . Lemma 1. According to the propositions above, it is proved in [20] that Tˆ = T 2k , where k = log2 (n − 1). The second strategy is extracting the critical alpha value αz from the transitive closure matrix instead of calculating clustering partitions and finding the minimum distance between the clusters for different α values at each step. Lemma 2. Consider an array V = (ai )hi=1 , which consists of the unique elements of Tˆ = (tˆij )n×n matrix and is sorted in decreasing order. It is again proved in [20] that the clustering partitions corresponding to the different αi ∈ V are also different and the clustering partitions corresponding to any values in the interval [αi+1, αi ) are the same. Pseudo-code of the MFJP algorithm is as follows:

MFJP(X): Input: Data set X = {x1 , x2 , . . . , xn }. Output: Clustering partition X 1 , X 2 , . . . , X k . Step1. Calculate dij = d(xi , xj ), dmax = max dij , i, j = 1, . . . , n; dij Step2. Calculate the fuzzy neighborhood relation T : tij = 1 − dmax , i, j = 1, . . . , n; ˆ Step3. Calculate the transitive closure T of the relation T using Lemma 1; Step4. Form an array V = (ai )hi=1 by getting the unique elements of the Tˆ matrix and sort the array in decreasing order; Step5. Calculate: Δαi = αi − αi+1 , i = 1, . . . , h − 1; z = arg maxi=1,...,h−1 Δαi ; Step6. Call the procedure Clusters(X, α z ) to obtain the resulting clustering partition X1 , X 2 , . . . , X k ; End. It is worth noting that αz is chosen to be the maximum difference between two consecutive α values of the sorted array V , since it implies the clustering partition remains the same for a larger interval [αi+1, αi ), according to Lemma 2. In other words, the minimum inter-cluster distance is maximized. It is clear that the modifications provided in the MFJP method improve the speed. However, it is still a relatively slow algorithm, since its worst time complexity is higher than similar classical methods such as DBSCAN, and the computational experiments conducted in the previous work [20] promote the theory. 3. Improving the time complexity The worst time complexity of the FJP and the MFJP algorithms are O(n4 ) and O(n3 log2 n), respectively. Let us investigate the complexities step by step. Step1 and Step2 of both algorithms run in O(n2 ) time, since pairwise distances are calculated and n × n matrices are formed. By definition, calculating a max–min composition has O(n3 ) complexity. Calculating the transitive closure using the standard definition requires calculating max–min compositions n − 2 times. Thus, Step3 has O(n4 ) complexity. On the other hand, calculating transitive closure using Lemma 1 has a lower complexity of O(n3 log2 n), because log2 (n − 1) − 1 max–min compositions are needed. Step4 of the FJP algorithm has O(n) time complexity. Between Step5 and Step6, the FJP algorithm iterates maximum n(n − 1)/2 times, since there can be n(n − 1)/2 unique elements in the Tˆ matrix. At the fifth step, the pairwise distances between n elements are calculated. So, its complexity is O(n2 ). Step6 calls the procedure Clusters, whose complexity is O(n), and Step7 has O(n) complexity n(n−1) as well, due to the assignment operations. Consequently, the loop between Step6 and Step7 yields O( n(n−1) + 2 ( 2 n + n)) = O(n4 ) complexity.

116

E. Nasibov et al. / Fuzzy Sets and Systems 270 (2015) 111–126

Step4 of the MFJP algorithm has a complexity of O(n2 log2 n), which is the complexity of the required sorting process. The last two steps of both algorithms are the same. At the last but one step, maximum (n(n − 1)/2) − 1 subtractions and comparisons are done to obtain αz and the last step calls the procedure Clusters again. In conclusion, the FJP algorithm has O(n4 ) worst time complexity, whereas the MFJP algorithm has a better complexity of O(n3 log2 n). Despite the improvements in the MFJP method, O(n3 log2 n) is a high complexity for clustering task, considering the huge data sets present at almost any related application. It is also much higher than the complexity of its non-fuzzy akin DBSCAN algorithm, whose time complexity is O(n2). It can be decreased to O(n log2 n) using R∗ -tree structure [8]. While an adequate implementation of the MFJP algorithm can do a fair job, a further improvement in terms of time complexity can make the method applicable to a wider range of applications. The primary bottleneck of the methods is the calculation of the max–min transitive closure matrix at Step3, such that it dominates the rest of the computations. Determining the critical alpha value is a secondary bottleneck, which is addressed by the MFJP method as well. 3.1. Calculating the max–min transitive closure In [21], an optimal algorithm to calculate a max–min transitive closure of a fuzzy similarity matrix is described. This algorithm and its integration to the MFJP method are explained below. Fuzzy similarity matrix: A fuzzy matrix that is reflexive and symmetric is a fuzzy similarity matrix. It can be shown that the matrix of a fuzzy neighborhood relation T : X × X → [0, 1] is a fuzzy similarity matrix: For each X1 ∈ X, d(x1 , x1 ) 0 =1− = 1, 2R 2R therefore the matrix of T is reflexive and for each X1 , X2 ∈ X, T (X1 , X1 ) = 1 −

d(x1 , x2 ) d(x2 , x1 ) =1− = T (X2 , X1 ), 2R 2R therefore the matrix of T is symmetric. So, the matrix of a fuzzy neighborhood relation T is a fuzzy similarity matrix. Pseudo-code of the procedure TC is as follows: T (X1 , X2 ) = 1 −

Procedure TC(T): Input: Fuzzy neighborhood relation T . Output: Transitive closure Tˆ of the relation T . Step1. Tˆ = (tˆij )n×n = ∅; U (T ) = list of the elements of the upper triangle of T ; Step2. Sort U (T ) in descending order; Step3. tˆii = 1, i = 1, . . . , n; k = 1; Step4. While there is a blank cell in Tˆ , do ti0j 0 = kth element of U (T ); If tˆi0j 0 is blank, then I = {i|tˆij 0 is not blank}, J = {j |tˆi0j is not blank}, tˆij = tˆj i = ti0j 0 , i ∈ I, j ∈ J ; End if; k = k + 1; End while; Step5. Return Tˆ . End.

E. Nasibov et al. / Fuzzy Sets and Systems 270 (2015) 111–126

117

The algorithm can be implemented in two different ways, i.e. using a matrix and an array in a straightforward fashion or using a max-heap data structure. For clarity, the optimal algorithm that calculates max–min transitive closure is called procedure TC and the one that uses max-heap is called procedure TC-heap henceforth. The time complexity of the procedure TC is O(n2 log2 n), due to the sorting process at Step2. The assignments at Step3 yield O(n) complexity and the loop at Step4 has a complexity of O(n2 ), since there are n2 cells in the Tˆ matrix to be filled. Pseudo-code of the procedure TC-heap is as follows:

Procedure TC-heap(T): Input: Fuzzy neighborhood relation T . Output: Transitive closure Tˆ of the relation T . Step1. Tˆ = (tˆij )n×n = ∅; U (T ) = the elements of the upper triangle of T ; Step2. Build a max-heap H for U (T ); Step3. tˆii = 1, i = 1, . . . , n; Step4. While there is a blank cell in Tˆ , do ti0j 0 = the root of H ; If tˆi0j 0 is blank, then I = {i|tˆij 0 is not blank}, J = {j |tˆi0j is not blank}, tˆij = tˆj i = ti0j 0 , i ∈ I, j ∈ J ; End if; Delete the root of H and heapify; End while; Step5. Return Tˆˆ ; End.

The time complexity of the procedure TC-heap is O(n2 ), which is lower than the straightforward procedure. The difference is the way of getting the next largest element of U (T ). A max-heap structure can be built in linear time [22]. Since there are n(n − 1)/2 elements in the upper triangle of T , it takes O(n2 ) time. Heapification has O(log2 n) complexity and the loop iterates O(n) times at most, so it does not change the overall complexity of the algorithm. Using the procedure TC-heap to compute the max–min transitive closure drops the complexity of the MFJP algorithm to O(n2 log2 n). In this case, the overall time complexity is dominated by the sorting process at Step4. 3.2. Determining the critical alpha value It is stated in Lemma 2 that the clustering partitions are different for different α values, where α values correspond to the unique values in the transitive closure matrix. In order to determine the critical alpha value αz , the maximum gap, i.e. difference between two consecutive elements of the sorted array V = (ai )hi=1 consisting of the unique values in the transitive closure matrix is found. However, sorting the array is not obligatory to find the maximum gap. Proposition 2. Let x be the number of discrete elements and y be the number of containers that hold these elements. There is at least one container that holds no more than  xy  elements. This proposition is a generalization of the pigeonhole principle, also known as the Dirichlet box principle [23].

118

E. Nasibov et al. / Fuzzy Sets and Systems 270 (2015) 111–126

Lemma 3. Given an unsorted array V = {ai }hi=1 , suppose that the interval [αmin , αmax ] is equally divided into h − 1 subintervals. Then the maximum gap of the array V is between the maximum and the minimum elements of two consecutive nonempty subintervals, where the subintervals are considered as containers that hold the elements of V. Proof. Let the global interval [αmin , αmax ] of the unsorted array V = {ai }hi=1 be divided into subintervals sk of equal length γ , where αmin ∈ s1 and αmax ∈ sh−1 and S = (sk )h−1 k=1 . Since the containers of αmin and αmax are obvious, there are h − 2 elements and h − 1 containers to hold them. As stated in Proposition 2, one of the containers must hold  h−2 h−1  = 0 elements at most; therefore the maximum gap must be at least γ , which is the length of a subinterval. Accordingly, the maximum gap is not within a subinterval but between the maximum and the minimum elements of two consecutive nonempty subintervals. 2 The procedure MaxGap, whose pseudo-code is given below finds the maximum gap in an unsorted array V and returns the critical alpha value αz . Pseudo-code of the procedure MaxGap is as follows:

Procedure MaxGap(V): Input: Array V . Output: Critical alpha value αz . Step1. h = |V |; αmax = max αi , αmin = min αi , i = 1, . . . , h; Step2. Divide the interval [αmin , αmax ] into (h − 1) subintervals sk , where S = (sk )h−1 k=1 . of equal length γ = (αmax − αmin )/(h − 1); sk = ∅, k = 1, . . . , h − 1; Step3. Calculate the maximum and the minimum values within the subintervals: For i = 1, . . . , h, do k = (αi − αmin )/γ  + 1; If Sk = ∅, then Skmax = Sk min = αi ; Else if Skmax < αi , then Skmax = αi ; Else if Skmin > αi , then Skmin = αi ; End if; End for; Step4. Remove the blank subintervals from the set S; r = |S|; Step5. Find the maximum gap and the corresponding αz value: Δmax = 0; For k = 1, . . . , r − 1, do If Δmax < S(k+1)min − Skmax , then Δmax = S(k+1)min − Skmax ; αz = Sk max ; End if; End for; Step6. Return αz ; End.

The procedure MaxGap runs in linear time, since maximum O(|V |) operations are performed at any step. V has n(n − 1)/2 elements at most, so the complexity of MaxGap is O(n2 ). Thus, using MaxGap to find αz eliminates the O(n2 log2 n) complexity of the sorting process.

E. Nasibov et al. / Fuzzy Sets and Systems 270 (2015) 111–126

119

3.3. The optimal-time FJP algorithm Replacing the slow steps of the MFJP algorithm with aforementioned procedures leads to an optimal-time algorithm that runs in O(n2 ) time. To avoid any confusion, we name the optimal-time algorithm as OFJP. Pseudo-code of the OFJP algorithm is as follows:

OFJP(X): Input: Data set X = {x1 , x2 , . . . , xn }. Output: Clustering partition X 1 , X 2 , . . . , X k . Step1. Calculate dij = d(xi , xj ); dmax = max dij , i, j = 1, . . . , n; dij Step2. Calculate the fuzzy neighborhood relation T : tij = 1 − dmax , i, j = 1, . . . , n; Step3. Call the procedure TC-heap(T ) to obtain the transitive closure Tˆ ; Step4. Call the procedure MaxGap(V) to obtain the critical alpha value αz ; Step5. Call the procedure Clusters(X, α z ) to obtain the resulting clustering partition X1 , X 2 , . . . , X k ; End.

Optimality of the OFJP algorithm is obvious. Since the entire FJP method is based on the fuzzy neighborhood relation, calculating the distances between all pairs is inevitable. For n data points, there are n(n − 1)/2 distances to be calculated, thus the complexity of an FJP-based method cannot be lower than O(n2). 4. Tradeoff between autonomy and speed The FJP, MFJP and OFJP algorithms are autonomous and they produce exactly the same results. We say they are autonomous with regard to their ability to construct the resulting clustering partition without the need of an input parameter chosen by user. The self-operated nature of these algorithms eliminates a major disadvantage of the DBSCAN algorithm. DBSCAN is very sensitive to input parameters, i.e. desired results can only be achieved for some small interval of MinPts and ε values [18], whereas the autonomous FJP-based methods determine its parameter values in automatic fashion. On the other hand, faster algorithms which use the same fundamentals with FJP-based methods can be devised by sacrificing the autonomy and sometimes the clustering efficiency. It is a utilizable insight that the larger the interval [αi+1 , αi ) is, the further the clusters are from each other. An algorithm that scans the interval of possible α values can dispose of the computational intensity of the autonomous FJP methods. Let us present such an algorithm. Instead of calculating the transitive closure matrix and extracting the αz value from it, the fuzzy neighborhood relation can be scanned with different αl values to find an appropriate αz -neighborhood. For each different αl -neighborhood, the connected components of the fuzzy neighborhood relation matrix must be discovered. Each component will correspond to a cluster. The clustering partitions will be the same for some different αl values and a maximum inter-cluster distance is obtained where a clustering partition is repeated the most. We name this algorithm as αScan (alpha-scan). Unlike the autonomous FJP methods, it has an input parameter Δα, which is the unit of scanning.

120

E. Nasibov et al. / Fuzzy Sets and Systems 270 (2015) 111–126

Pseudo-code of the αScan algorithm is as follows:

αScan(X, Δα): Input: Data set X = {x1 , x2 , . . . , xn } and parameter Δα. Output: Clustering partition X 1 , X 2 , . . . , X c . Step1. Calculate dij = d(xi , xj ); dmax = max dij , i, j = 1, . . . , n; R = ∅; c = r = s = 0; αl = 1 − Δα; /* αl : α value of the current partition c: number of clusters of the last different partition r: number of repetitions of the current partition s: number of repetitions of the most repeated partition */ dij Step2. Calculate the fuzzy neighborhood relation T : tij = 1 − dmax , i, j = 1, . . . , n; Step3. Call the procedure ConnectedComponents(X, T, α l ) to obtain a clustering partition X1 , X2 , . . . , Xk . Step4. If k = c then, r = r + 1; /* note that k is the number of clusters obtained at Step3 */ Else, If s < r and k = 1, then R = {X 1 , X 2 , . . . , X k }; s = r; c = k; αz = αl + Δα; End if; r = 1; End if; Step5. αl = αl − Δα; If k > 1 and αl > 0 then, go to Step3; Else, the resulting clustering partition is R, the number of clusters is c and the critical alpha value is αz ; End if; End.

Pseudo-code of the ConnectedComponents is as follows:

Procedure ConnectedComponents(X, T , α l ): Input: Data set X, fuzzy neighborhood relation T and parameter αl . Output: Clustering partition X 1 , X 2 , . . . , X k Step1. S = F (X), where X is the global data set; k = 1; Step2. Pick an element C ∈ S; N = {C}; X k = ∅; Step3. While N = ∅, do pick an element A ∈ N to form X k = X k ∪ {B ∈ S|T (A, B) ≥ α}; N = X k \{A}; S = S\{A}; End while; Step4. If S = ∅, then k = k + 1; go to Step2; Else, return the partition X 1 , X 2 , . . . , X k and k; End if; End.

E. Nasibov et al. / Fuzzy Sets and Systems 270 (2015) 111–126

121

Table 1 Comparison of the time complexities of the algorithms by each main process. Calculating the fuzzy neighborhood relation

Calculating the transitive closure

Determining the critical alpha value

Forming the clustering partitions

FJP MFJP

O(n2 ) O(n2 )

O(n4 ) O(n3 log2 n)

O(n4 ) O(n2 log2 n)

O(n) O(n)

OFJP

O(n2 )

O(n2 )

O(n2 )

O(n)

αScan

O(n2 )

O(rn2 )

Table 2 NoiseFilter parameters chosen in the experiments. Instance

n

1 2 3 4 5 6

300 500 1000 2000 3000 4000

7 8 9

5000 7500 10 000

ε1

ε2

Number of noise points

0.9 0.9 0.95 0.95 0.9 0.95

0.2 0.3 0.2 0.2 0.2 0.2

0 37 129 221 78 69

0.9 0.9 0.9

0.2 0.2 0.2

365 708 339

The αScan algorithm has O(rn2 ) time complexity. Again, Step1 and Step2 hold O(n2 ) complexity. The loop between Step3 and Step5 is repeated maximum r = 1−Δα Δα times and the procedure ConnectedComponents has 2 a time complexity of O(n ), since it requires a neighborhood query for each data point. The number of repetitions r will be large for small values of Δα, and vice versa. So, the parameter Δα can be chosen a relatively large value in order to decrease the running time. In this regard, if r ≤ n, then the time complexity equals O(n2 ). In addition, a single cluster is likely to be obtained many iterations before αl reaches a non-positive value. If the clusters are distinct enough, then the maximum inter-cluster distance can be achieved for very large Δα values (i.e. Δα > 0, 01, which implies less than 100 iterations). We can intuitively say that αScan is not very sensitive to the input parameter and it is easy to choose an appropriate Δα value. However, this is yet an open discussion and subject of study, along with observation and optimization of the loop between Step3 and Step5. Even though the time complexity of αScan is larger than the time complexity of OFJP, r value can be negligible in practice. Depending on the Δα parameter, the αScan algorithm can run in much shorter time than OFJP. A comparative study of the time complexities and computational performances is given in the next section. 5. Theoretic and experimental comparisons A detailed cluster validity analysis was done, and clustering efficiency of the FJP algorithm was tested and compared with the FCM algorithm in [15] on some famous data sets such as Anderson’s [24], and Bensaid et al.’s [25], as well as some synthetic data sets. The autonomous FJP, MFJP and OFJP methods are identical in terms of clustering performance; hence we do not attend to it in this work, but rather we evaluate the algorithms in terms of running time efficiency. To sum up the outcomes of the study and show the results in theory, the time complexities of the processes in the mentioned algorithms are presented in Table 1.

122

E. Nasibov et al. / Fuzzy Sets and Systems 270 (2015) 111–126

Fig. 1. Data set instances.

For the computational experiments, we generated 2-dimensional (for visualization purpose), dense data set instances with similar distributions with the data sets used in previous works [15,18]. These instances are depicted in Fig. 1. They are noisy, like the majority of real world data. The FJP-like methods are tailored for noise free data and typically perform after a preprocessing procedure called NoiseFilter, which simply filters out the noise. The procedure is explained in detail in [15]. It basically detects the noise points according to two different parameters in the interval (0, 1). The first parameter ε1 specifies the ε1 -neighborhood relation:   N(X1 , ε1 ) = Y ∈ X  T (X1 , Y ) ≥ ε1 , and the second parameter ε2 determines the noise threshold ε2 ∗ (maxi,=1,...,n card N (Xi , ε1 )), which is compared to the card N(X1 , ε1 ) value of each fuzzy point X1 ∈ X to decide whether they are noise, where  card N(X1 , ε1 ) = T (X1 , Y ). Y ∈N (X1,ε1)

E. Nasibov et al. / Fuzzy Sets and Systems 270 (2015) 111–126

123

Pseudo-code of the procedure NoiseFilter is as follows:

Procedure NoiseFilter(X, ε 1 , ε2 ): Input: Data set X, filtering parameters ε1 and ε2 . Output: Filtered data set X and the set of noise points Xnoise . Xn } is the global data set; Xnoise = ∅; Step1. X = {X1 , X2 , . . . , Step2. Calculate cardi = Y ∈N (Xi,ε1) T (Xi , Y ); cardmax = max cardi , i = 1, . . . , n; Step3. For i = 1, . . . , n, do If cardi < ε2 cardmax , then Xnoise = Xnoise ∪ {Xi }; End if; End for; Step4. X = X\Xnoise ; Return X and Xnoise . End.

Table 3 Comparison of running time performances of the algorithms. Instance

n

1 2 3 4 5 6 7 8 9

300 500 1000 2000 3000 4000 5000 7500 10 000

CPU time in seconds MFJP

OFJP

αScan (Δα = 0.05)

0.15 0.74 36.63 381.98 > 10 min > 10 min > 10 min > 10 min > 10 min

0.00 0.01 0.08 0.37 0.92 1.68 2.65 6.15 11.70

0.00 0.00 0,01 0,06 0.14 0.23 0.46 0.92 1.72

The NoiseFilter procedure takes O(n2 ) time. Table 2 shows the input parameters chosen for each data set instance and number of the corresponding noise points. Instances after omitting the noise points are illustrated in Fig. 2. The final results of the computational experiments are in Table 3. Filtering processes are included in the running time results. The filtering parameters are basically chosen by trial and error. Even though the similarity of the proper parameters (see Table 2) provides some insight into the matter, further research must be done to establish a solid method for setting the noise filtering parameters. The algorithms were coded in C language and compiled with GCC [26] using -O3 optimization parameter. All the running time tests were conducted on an Intel i5, 3.10 GHz CPU, 4 GB RAM PC, on which a 64-bit Windows operating system is running. All the algorithms achieved successful results, as seen in Fig. 2. The Δα parameter of αScan was chosen to be 0.05, since it was small enough to produce the same results with the autonomous methods and large enough to perform notably faster. The experimental results show that the Optimal-time FJP algorithm is dramatically faster than the Modified FJP algorithm. This speedup makes the FJP approach appropriate for large data sets. A further improvement in speed is also achieved by αScan. Fig. 3 reflects the results graphically.

124

E. Nasibov et al. / Fuzzy Sets and Systems 270 (2015) 111–126

Fig. 2. Clustering results.

6. Conclusion Despite their significant advantages such as robustness, autonomy etc. over similar classical methods such as DBSCAN, when it comes to dealing with large data sets, the prior FJP methods are too slow to use in practice. In order to resolve this issue, we revealed the bottlenecks of the prior methods, and integrated some valuable techniques with them and established an optimal-time algorithm. Moreover, a parameter dependent, but considerably faster algorithm that is also based on the fuzzy neighborhood relation was proposed. Computational experiments support the theoretic outcome and show that a great speedup can be achieved using the new algorithms. Hence, the FJP approach can be used in applications that work on large data sets. When speed is of primary concern, the αScan algorithm might be preferred, since it seems to be even faster than the OFJP algorithm. On the other hand, validity of clustering efficiency of the αScan algorithm and how to choose a proper scanning unit were not addressed in this work; hence they could be the subjects of a future work.

E. Nasibov et al. / Fuzzy Sets and Systems 270 (2015) 111–126

125

Fig. 3. Running times of the algorithms vs. data set cardinality.

Acknowledgements The authors would like to thank the anonymous reviewers for their valuable comments and corrections. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]

D.T. Larose, Discovering Knowledge in Data, Wiley Press, 2005. J. Han, M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, 2001. J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function, Plenum Press, 1981. N. Zahid, O. Abouelala, M. Limouri, A. Essaid, Fuzzy clustering based on K-nearest-neighbours rule, Fuzzy Sets Syst. 120 (2001) 239–247. E. Mehdizadeh, S. Sadi-Nezhad, R. Tavakkoli-Moghaddam, Optimization of fuzzy clustering criteria by a hybrid PSO and fuzzy C-means clustering algorithm, Iran. J. Fuzzy Syst. 5 (3) (2008) 1–14. J.C. Dunn, A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters, J. Cybern. 3 (3) (1973) 32–57. L. Mahnhoon, W. Pedrycz, The fuzzy C-means algorithm with fuzzy P-mode prototypes for clustering objects having mixed features, Fuzzy Sets Syst. 160 (2009) 3590–3600. M. Ester, H.P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226–231. M. Ankerst, M.M. Breunig, H.P. Kriegel, J. Sander OPTICS, Ordering Points To Identify the Clustering Structure, in: ACM SIGMOD International Conference on Management of Data, 1999, pp. 49–60. X. Xiaowei, E. Martin, H.P. Kriegel, J. Sander, A distribution-based clustering algorithm for mining in large spatial databases, in: Proceedings of the 14th Int. Conference on Data Engineering, 1998, pp. 324–331. A. Hinneburg, A.K. Daniel, An efficient approach to clustering in large multimedia databases with noise, in: Proceedings of the 4th Int. Conference on Knowledge Discovery and Data Mining, 1998, pp. 58–65. J. Sander, M. Ester, H.P. Kriegel, X. Xu, Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications, Data Min. Knowl. Discov. 2 (1998) 169–194. M.H. Chehreghani, H. Abolhassani, M.H. Chehreghani, Improving density-based methods for hierarchical clustering of web pages, Data Knowl. Eng. 67 (2008) 30–50. G. Gupta, A. Liu, J. Ghosh, Automated hierarchical density shaving: a robust automated clustering and visualization framework for large biological data sets, IEEE/ACM Trans. Comput. Biol. Bioinform. 7 (2) (2010) 223–237. E.N. Nasibov, G. Ulutagay, A new unsupervised approach for fuzzy clustering, Fuzzy Sets Syst. 158 (19) (2007) 2118–2133. E.N. Nasibov, G. Ulutagay, A new approach to clustering problem using the fuzzy joint points method, Autom. Control Comput. Sci. 39 (6) (2005) 8–17. E.N. Nasibov, G. Ulutagay, On the fuzzy joint points method for fuzzy clustering problem, Autom. Control Comput. Sci. 40 (5) (2006) 33–44. E. Nasibov, G. Ulutagay, Robustness of density-based clustering methods with various neighborhood relations, Fuzzy Sets Syst. 160 (24) (2009) 3601–3615. J.K. Parker, L.O. Hall, A. Kandel, Scalable fuzzy neighborhood DBSCAN, in: IEEE International Conference on Fuzzy Systems, 2010, pp. 1–8. G. Ulutagay, E. Nasibov, On fuzzy neighborhood based clustering algorithm with low complexity, Iran. J. Fuzzy Syst. 10 (3) (2013) 1–20. H.S. Lee, An optimal algorithm for computing the max-min transitive closure of a fuzzy similarity matrix, Fuzzy Sets Syst. 123 (2001) 129–136. T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms, 3rd ed., The MIT Press, 2009.

126

E. Nasibov et al. / Fuzzy Sets and Systems 270 (2015) 111–126

[23] V.G. Sprindzhuk (Originator), Dirichlet box principle, in Encyclopedia of Mathematics, http://www.encyclopediaofmath.org/index.php? title=Dirichlet_box_principle&oldid=16845 (last accessed on 01.09.2014). [24] E. Anderson, The IRISes of the Gaspe peninsula, Bull. Am. Iris Soc. 59 (1935) 2–5. [25] A.M. Bensaid, L.O. Hall, J.C. Bezdek, L.P. Clarke, M.L. Silbiger, J.A. Arrington, R.F. Murtagh, Validity-guided (re)clustering with applications to image segmentation, IEEE Trans. Fuzzy Syst. 4 (2) (1996) 112–123. [26] gcc.gnu.org (last accessed on 21.08.2013).