Reusable components for partitioning clustering ... - Springer Link

11 downloads 13555 Views 262KB Size Report
Oct 20, 2009 - established in data mining software. In partitioning clustering, given a dataset and, the number of clusters K, an algorithm organizes the objects ...
Artif Intell Rev (2009) 32:59–75 DOI 10.1007/s10462-009-9133-6

Reusable components for partitioning clustering algorithms Boris Delibaši´c · Kathrin Kirchner · Johannes Ruhland · Miloš Jovanovi´c · Milan Vuki´cevi´c

Published online: 20 October 2009 © Springer Science+Business Media B.V. 2009

Abstract Clustering algorithms are well-established and widely used for solving data-mining tasks. Every clustering algorithm is composed of several solutions for specific sub-problems in the clustering process. These solutions are linked together in a clustering algorithm, and they define the process and the structure of the algorithm. Frequently, many of these solutions occur in more than one clustering algorithm. Mostly, new clustering algorithms include frequently occurring solutions to typical sub-problems from clustering, as well as from other machine-learning algorithms. The problem is that these solutions are usually integrated in their algorithms, and that original algorithms are not designed to share solutions to sub-problems outside the original algorithm easily. We propose a way of designing cluster algorithms and to improve existing ones, based on reusable components. Reusable components are well-documented, frequently occurring solutions to specific subproblems in a specific area. Thus we identify reusable components, first, as solutions to characteristic sub-problems in partitioning cluster algorithms, and, further, identify a generic structure for the design of partitioning cluster algorithms. We analyze some partitioning algorithms (K-means, X-means, MPCK-means, and Kohonen SOM), and identify reusable components in them. We give examples of how new cluster algorithms can be designed based on them. Keywords Cluster algorithm · Partitioning clustering · Reusable component · Generic · Kohonen SOM · K-means · X-means · MPCK-means

B. Delibaši´c (B) · M. Jovanovi´c · M. Vuki´cevi´c Faculty of Organizational Sciences, University of Belgrade, Jove Ili´ca 154, Belgrade, Serbia e-mail: [email protected]; [email protected] K. Kirchner · J. Ruhland Faculty of Economics and Business Administration, Friedrich Schiller University of Jena, Carl-Zeiß Straße 3, Jena, Germany

123

60

B. Delibaši´c et al.

1 Introduction Clustering in data mining is the process of grouping a set of objects into classes of similar objects. Many clustering algorithms are described in literature. Among these, the most important are partitioning and hierarchical algorithms (Berkhin 2006). In this paper, we want to consider partitioning cluster algorithms that are widely discussed in literature as well as established in data mining software. In partitioning clustering, given a dataset and, the number of clusters K , an algorithm organizes the objects into K partitions, in which each partition represents one cluster. An algorithm tries to discover these partitions by iteratively relocating points between subsets. During a partitioning cluster process, several decisions are made. Choosing the number of clusters K is often a decision based on prior knowledge, assumptions, and practical experience. This task is more difficult when the data has many dimensions, even when clusters are well-separated. Another common task, the choice of good starting points for clusters, influences the number of iterations needed. Badly chosen starting points influence the quality of solutions and can prevent an algorithm from finding the optimal solution. They can force algorithms to find local optimum solutions. The most well-known and commonly used partitioning methods are K-means, K-medoids and their variations (Han and Kamber 2006). Every partitioning clustering algorithm uses a different strategy for finding clusters. Most of the algorithms were designed to solve a specific problem in partitioning clustering. The problem in real-life applications is what algorithms to choose and how to use them if a new problem arises. To handle the new problem existing algorithms were usually modified, or new algorithms were developed. Although good solutions for specific partitioning sub-problems have been reused in newly developed or modified algorithms, most new algorithms were not designed to allow easily reusing the solutions to specific sub-problems outside the original algorithms. Most algorithms were implemented as black boxes. On the other hand, in software engineering good solutions to specific sub-problems are formalized as reusable components. Software engineering design is often based on reusable components that have been validated through usage in other software. The idea of reusable components is not common in data mining today. Instead of implementing the same things over and over again, reusable components could be defined, and the design of algorithms could be based on these components. Therefore, our research idea is to extract good solutions for typical partitioning clustering sub-problems from original algorithms and to formalize them in a way that allows them to be applied as reusable components in the design of partitioning clustering algorithms. In this way we can exploit reusable-components design in solving new, or existing, partitioning clustering tasks. We are aware of the fact that the combination of reusable components is interesting but cannot guarantee more efficient algorithms. Better performance by means of combining reusable components can only be achieved by using statistical tests that can provide evidence that a new generic algorithm performs better compared to another algorithm. Furthermore, reusable-components design allows additional benefits. For every characteristic sub-problem in a domain several reusable components can be identified. Experiments can be carried out on the applicability and performance of these components. This can help to find discriminative properties for the usage of reusable components for solving a certain sub-problem. Those properties can be used further for documenting and using the component better.

123

Reusable components for partitioning clustering algorithms

61

Every reusable component is documented in a way that reveals when and how a component can be used and explains what the component is good for. This documentation helps to understand whether a component is suitable for reuse in a specific situation. Therefore, the objective of this paper is: 1. to identify reusable components that can be applied to solve important sub-problems for partitioning clustering, and 2. to propose a generic algorithm that allows the composition of reusable components and is able to generate the original algorithms, modify existing algorithms, and design new ones. The outline of this paper is as follows: in the next section we summarize the findings of our literature review on typical sub-problems in partitioning clustering and design with reusable components. The third section describes the cluster algorithms we analyzed. Section four presents the identified components. Examples for building new cluster algorithms based on our reusable components are given in section five. Finally, section six concludes with future research directions.

2 Literature review Our research on reusable components in partitioning cluster algorithms has its roots, on the one hand, in the identification of typical problems in the application of partitioning clustering algorithms and, on the other hand, in design with reusable components. 2.1 Sub-problems in partitioning clustering In partitioning clustering, K-means (Hartigan and Wong 1979) is the most common clustering algorithm. According to Berkhin (2006) analysis of clustering algorithms, “K-means is by far the most popular clustering algorithm used in scientific and industrial applications”. Steinley (2006) gives an exhaustive overview of K-means modifications and improvements. K-means is an interesting algorithm in partitioning clustering because it opened a series of investigations in modifications and improvements of the original algorithms. These are done to improve K-means fallacies but also to allow K-means-like algorithms to solve additional important sub-problems in partitioning clustering. Therefore we scanned the literature to analyze some k-means-related papers. According our literature review, we found six typical problems in partitioning clustering, which are solved in algorithms in different ways. Table 1 summarizes our findings. According to our findings in this literature survey we conclude that: 1. There are a lot of K-means improvements for solving typical clustering sub-problems, and Ideas from statistics and machine learning can be combined in various algorithms to make them more efficient. 2.2 Designing with reusable components In every engineering discipline, systems can be designed by arranging reusable components that have been used in other systems. In software design, there are many types of reusable components.

123

62

B. Delibaši´c et al.

Table 1 Some characteristic problems and their solutions in partitioning clustering

1

2

3

4

5

6

Typical problem in partitioning clustering

Solution

Number of clusters definition

User defines K

Hartigan and Wong (1979) Kohonen (2001)

Centroids initialization

User defines the shape (map, cube, etc.) Uses sub-sampling and a heuristic to determine the right number of clusters Random

Measure distance

Perform (K-1)*N runs of K-means to find the optimal centroids Choosing the first centroid random, and the remaining K-1 with a simple probabilistic measure Initialize centroids in a 2D map Uses sub-sampling and a heuristic to determine the right centroids Euclidian distance

Calculate representatives

Stop criterion

Split clusters

Learn-weighted Euclidean distance for each cluster Arithmetic mean Change centroid each time an object is assigned to a cluster, or a neighbouring cluster Stability of solution

Article

Bradley and Fayyad (1998)

Hartigan and Wong (1979), Kohonen (2001) Likas et al. (2002)

Arthur and Vassilvitskii (2007)

Su et al. (2002) Bradley and Fayyad (1998)

Hartigan and Wong (1979), Kohonen (2001) Xing et al. (2003) Hartigan and Wong (1979) Kohonen (2001), Cheung (2003)

Hartigan and Wong (1979)

No. of runs

Kohonen (2001)

Minimal number of points in cluster Use principal components analysis to test the quality of binary split clusters Use Anderson-Darling statistic to test the quality of binary split clusters Use Bayesian information criteria to test the quality of binary split clusters

Bennett et al. (2000) Ding and He (2004)

Hammerly and Elkan (2003)

Pelleg and Moore (2000)

Many authors discuss reusable-components design. According to Freeman (1983), reusable components are usually source code fragments, but they can also include design structures, module-level implementation structures, specifications, documentation or transformations. Sommerville (2004) identifies several widely applied design reuse approaches, e.g. design pattern, application frameworks or program libraries. Moreover, reuse is possible at a range of levels from simple functions to complete systems and from more abstract descriptions to concrete applications.

123

Reusable components for partitioning clustering algorithms

63

Another popular theory that deals with reusable components design is the pattern theory. Originally it was introduced by architect Alexander (1979, 2005a, b, c, d). Since then, it has been successfully applied to software engineering (Coplien and Schmidt 1995; Gamma et al. 1995), organizational design (Coplien and Harrison 2005), telecommunication design (Adams et al. 1998), avionics control system design (Lea 1994). The pattern approach has earnt its popularity over the years and is being accepted in many areas of human creative activity. There are also first papers discussing patterns in data mining (e.g. Delibasic et al. 2008). The pattern theory provides some guidelines for identifying good reusable components, i.e. patterns. Winn and Calder (2002) proposed an open list of pattern features that can be used for pattern research. These are: 1. A pattern implies an artefact: A pattern should have visual representation; the solution it gives has to be an artefact. This gives the pattern the ability to be independent and reusable. 2. A pattern bridges many levels of abstraction: A pattern provides both abstract interpretations of the solution, but also specific implementation details. 3. A pattern is both functional and non-functional: A pattern provides a concrete solution, though it includes information on the feasibility of a pattern for solving a problem in a context. 4. A pattern is manifest in a solution: A pattern is recognizable in a solution. It is “explained, understood, and demonstrated” in an artefact. 5. A pattern captures system hot spots: A pattern is about making crucial decisions in solution design. The patterns used in a solution are capable of defining the whole solution. 6. A pattern is part of a language: A pattern solves important sub-problems of the problem (e.g. clustering). To solve a problem it is most often necessary to couple more patterns together. The possible couplings of patterns for solving a problem define a pattern language. 7. A pattern is validated by use: Only experience-validated components can be defined as patterns. 8. A pattern is grounded in a domain: A pattern is always related to a domain. 9. A pattern captures a big idea: A pattern solves important sub-problems of a problem. But even with the help of this list, it is difficult to decide whether a component is a good design component or not because the process of identifying reusable components is not formalized. For the description of components Tracz’s definition (Tracz 1990) is helpful. He defines reusable components as triplets consisting of concept, content and context. The concept is the description of what a component does; it describes the interface and the semantics represented by pre and post conditions. The content describes how the component is realized, which is hidden from the casual user. The context explains the application domain of the component, which helps to find the right component for a specific problem. Every reusable component is well-documented, and describes where the component can be applied, but it also includes a solution for the problem it solves. A reusable component is abstract in a sense that it can have many areas of usage, but it is also specific because it offers a concrete solution. In the field of data mining and machine learning, the definition and composition of components to build new algorithms occurs very rarely. Though Sonnenburg et al. (2007) report the emerging need for open-source software with a code that is well-documented and can be easily reused and combined.

123

64

B. Delibaši´c et al.

Another reuse approach in data mining is through hybrid algorithms that combine various machine learning algorithms (e.g. Siddique et al. 2007). Research similar to ours is rare. Drossos et al. (2000) have identified some reusable components for decision-tree design, and have developed a generic decision tree. Zaki et al. (2005) have developed DMTL (data mining template library) which consists of containers and algorithms for frequent pattern mining. They show that “the use of generic algorithms is competitive with special purpose algorithms”.

3 Survey of partitioning clustering algorithms For the definition of components, we chose algorithms that contain, in our opinion, interesting sub-problem solutions, and are also implemented in some machine learning platforms, e.g. Weka (Witten and Frank 2005) and RapidMiner (Mierswa et al. 2006). In the following, we analyze the K-means algorithm (Hartigan and Wong 1979), X-means (Pelleg and Moore 2000), MPCK-means (Bilenko et al. 2004), and Kohonen SOM (Kohonen 2001) in order to identify further reusable components that can be used to generate hybrid algorithms. 3.1 K-means The K-means algorithm follows three steps (Hartigan and Wong 1979): 1. Randomly k centroids initialization, 2. Assigning object to clusters according to minimal distance from centroids, 3. Recalculating centroids as the arithmetical mean of the object found in the appropriate cluster. (steps 2 and 3 are repeated until stability of clusters between two iterations is reached) K-means uses Euclidean distance to calculate distances between objects. 3.2 X-means An algorithm that improves the K-means initialization and quality of clusters is X-means (Pelleg and Moore 2000). X-means uses a binary division partitioning strategy for clustering. It starts with a user-defined low boundary of clusters number and divides until it reaches the top boundary. The algorithm has the following steps: 1. K-means, 2. Binary cluster division, and 3. Calculating BIC (Bayesian information criterion) for all clusters. If the sum of BICs of the child clusters is smaller than the parent cluster BIC no split is implemented. (steps 2 and 3 are repeated until the stop criterion is reached) The Bayesian (also known as Schwartz) information criterion is shown with (1): B I C(J ) = −2

J  j=1

where: J—number of clusters,

123

ξ j + m j log(N )

(1)

Reusable components for partitioning clustering algorithms

65

ξj —log likelihood of the j-th cluster, N—number of cases, mj —number of parameters in a model with J clusters. It is shown in (2): ⎡

⎤ KB  (L k − 1)⎦ m j = J ⎣2K A +

(2)

k=1

where: KA —total number of numeric attributes, KB —total number of categorical attributes, Lk —number of categories in the k-th attribute, ξj —shown with (3): ⎡ A ⎤ K KB   1 E jk ⎦ log(σk 2 + σ jk 2 ) + ξ j = −N j ⎣ 2 k=1

(3)

k=1

where: Nj —number of objects in cluster j, σk2 —variance of the k-th numeric attribute on the whole data set, σjk2 —variance of the k-th numeric attribute in cluster j, Ejk —log likelihood in cluster j for categorical attribute k. It is calculated with (4):

Ejk = −

Lk  Njkl Njkl log Nj Nj l=1

(4)

where: Njkl —number of objects within cluster j for categorical attribute k, for category i, Nj —number of objects within cluster j. 3.3 Kohonen SOM Although Kohonen self-organizing maps (SOMs) (Kohonen 2001) are widely presented as neural networks, we were interested in the design of the algorithm. We believe that Kohonen SOM has similarities with K-means design. The Kohonen SOM algorithm for a 2D map consists of the following steps: 1. Random m × n centroids initialization (m × n are dimensions of a map, when there are two dimensions), 2. Assigning objects in a stepwise manner to the clusters (map cells) according to minimal distance from centroids, 3. Stepwise centroid recalculation. (steps 2 and 3 are repeated until the stop criteria are reached) In Kohonen SOM the objects are assigned one at a time to centroids. In the original algorithm, objects are chosen randomly from the dataset one by one.

123

66

B. Delibaši´c et al.

When a object is assigned to a centroid, this has influence not only on the centroid recalculation of the cluster the object is assigned to, but also on neighbouring centroids that are inside the radius of the centroid the object was assigned to. This radius is user-defined and decreases during the learning process of Kohonen SOM. It is defined by (5): t

σ (t) = σo ∗ e− λ

(5)

where: σo —initial radius, t—learning iteration, and λ—a constant that is calculated by λ =

t log(σo )

The centroids are adjusted stepwise using Eq. 6 C(t + 1) = C(t) + η(t)θ (t)(C∗(t) − C(t))

(6)

where: C(t+1)—adjusted centroid value, C(t)—current centroid value, C(t)*—the best centroid for the current object, η(t)—learning rate defined with (7), and (t)—learning rate modifier given with (8). t

η(t) = ηo ∗ e− λ

(7)

where: ηo —initial learning rate The learning rate enables centroids to be modified according to their similarity with the best centroid. θ (t) = e



d2 2σ 2 (t)

(8)

where: d—distance between a neighbouring centroid and the best centroid, and σ —the radius of the best centroid. 3.4 MPCK-means The MPCK-means algorithm, described in a paper by Bilenko et al. (2004), is a semi-supervised, constraint-based, distance-based algorithm, and it has a K-means design structure. The algorithm uses user definitions of must links—ML (objects that should be in the same cluster) and cannot links—CL (object that should not be clustered in the same cluster) for cluster initialization. The same information is used for distance measures adjustments. The algorithm has the following steps: 1. Cluster initialization: a. Based upon constraints of must links and cannot links, λ neighbourhoods of objects are formed, and

123

Reusable components for partitioning clustering algorithms

67

b. K centroids are initialized randomly, but there are situations where more complex algorithm are used (e.g. λ > K). Steps 2 and 3 repeat until a stable solution is reached. The similarity to centroids is calculated using Eq. 9:  (||xi − µi ||2 ) − log(det(Ah )) + wi j f M (x j , x j )[li  = l j ]+ (xi ,x j )∈M  wi j f C (x j , x j )[li = l j ]

(9)

(xi ,x j )∈M

2. Centroids recalculation and distance adjustment matrix A recalculation using (10). ⎛  1  wi, j (xi − x j )(xi − x j )T [h  = l j ] (xi − µh )(xi − µh )T + Ah = |x h | ⎝ 2 (xi ,x j )∈Mh

x∈X h

+



⎞−1

wi, j (x h − x h )(x h − x h )T − (xi − x j )(xi − x j )T [h  = l j ]⎠

(10)

(xi ,x j )∈Mh

where: xi —objects, i = 1, . . . , N (N—number of objects), M—must links constraints set, C—cannot links constraints set, A—the distance adjustment matrix. For every cluster a special matrix can be made. µ—centroids, fM —penalty functions for M rules, fC —penalty functions for C rules, W—penalty weight for M rules, Wi,j —penalty weights for C rules.

4 Reusable components in partitioning clustering With the help of our literature review, we identified six characteristic sub-problems in the partitioning cluster algorithms we analyzed. These are: 1. 2. 3. 4. 5. 6.

Define number of clusters, Initialize centroids, Measure distance, Calculate representatives, Stop criterion, and Split clusters.

In the following, we will discuss problem solutions for these sub-problems and derive reusable components. 4.1 Define number of clusters In partitioning clustering it is important to estimate the number of partitions that should be found in data. At the start of clustering, it is not easy to estimate how many clusters can be

123

68

B. Delibaši´c et al.

found in data. On the other hand, the number of supposed clusters in data has great influence on the cluster quality. K-means expects a user-defined number of clusters, and Kohonen SOM expects userdefined maps in one or more dimensions, which implicitly defines the number of clusters. Both methods try to find the user-defined number of clusters. On the other hand, X-means uses a strategy that starts searching for clusters from a, usually small, number. By dividing each of the small amount of clusters initially found in a binary fashion and by testing the quality of the child clusters compared to the parent cluster, it decides whether or not to keep the child clusters. X-means is designed to search for the right number of clusters, and it asks a user to define the minimal and maximal number of clusters the user supposes to be hidden in the data. Hammerly and Elkan (2003) use Anderson-Darling statistics as an improvement on the BIC measure used in X-means. Ding and He (2004) use PCA to test the quality of binary-divided clusters. MPCK-means uses a semi-supervised strategy to find the right cluster number. The user can define the number of clusters supposedly hidden in the data, as well as the constraints that define which objects should and shouldn’t be coupled together. To solve this sub-problem one could either define K as in K-means, or define a map as in Kohonen SOM. Alternatively one could use a strategy of defining a minimum and a maximum K and start clustering from the minimum K number of clusters and then performing statistical tests (BIC, Anderson-Darling, PCA) on the binary-split child clusters and their parents to find a more appropriate K than the user-defined value. So for solving this sub-problem, we identified two components: RANGE (like in K-means and X-means) and MAP (like in Kohonen SOM). 4.2 Initialize centroids After k partitions have been chosen, it is necessary to initialize the partitions (clusters) with cluster representatives, i.e. centroids for numerical data. They are important since they are the starting solutions for the clustering process, and they have great influence on the quality of the final solution. They are also very effective because they reduce the number of distance calculations between objects. Compared to hierarchical clustering algorithms, in which the distances between all objects are calculated, in partitioning clustering only distances between the object and the centroids have to be calculated. In K-means and Kohonen SOM, centroids are initialized randomly. In X-means the first parent centroids are initialized randomly; the child centroids are initialized with a strategy of finding two centroids that are on different sides of a randomly chosen vector and whose distance is proportional to the space of the parent cluster (Pelleg and Moore 2000). In MPCK-means the centroids are initialized based on the constraints of must (ML) and cannot (CL) links between objects. There are three possibilities for cluster identification based upon the relationship between the k and the number of clusters defined with the constraints (Bilenko et al. 2004). Su et al. (2002) made an improvement for the initialization of centroids for Kohonen SOM two-dimensional maps. The first two centroids are initialized as the most distant to each other. The third centroid is the most distant to the first and second one etc. The inner clusters are initialized through a linear transformation of the border centroids. The same process can be used also for improving K-means centroid initialization. Arthur and Vassilvitskii (2007) use in K-means ++ a strategy of selecting the first centroid randomly and the remaining K-1 with a simple probabilistic measure.

123

Reusable components for partitioning clustering algorithms

69

Although in Table 1 more solutions to this sub-problem are proposed, we only mention reusable components here that can easily be integrated in our generic clustering algorithm structure presented in Sect. 4.7. Other reusable components require a more complex generic algorithm structure that allows for nesting of components, but we do not plan to implement them in the first phase of our research. For solving the sub-problem of centroids initialization, we identified three components: RANDOM, like in Kohonen SOM, K-means, and X-means for minimum K ; LINEAR MAP, like in Su et al. (2002); and K-MEANS ++, like in Arthur and Vassilvitskii (2007). 4.3 Measure distance The distance of centroids is typically measured with Euclidean distance. All of the algorithms we analyzed work with numerical data and use the Euclidean metric for distances between objects and centroids measurement. MPCK-means uses a weighted Euclidean metric that can be different for every cluster. After the first iteration, and based upon ML and CL constraints as parameters, the algorithms learn a specific metric for every cluster. In this way clusters of different shapes can be found. However, the Euclidean distance is not the only way to calculate distances. In the paper presented by Basu et al. (2004), for example, the authors of MPCK-means use two other distortion measures: the I-divergence and the Cosine similarity. The Cosine similarity should be used for data when measuring similarity of directional data in which differences in angles between vectors are important, while vector lengths are not. In contrast, I-divergence should be used for data that is described by probability distributions (Basu et al. 2004). A host of other distance measures for both continuous and categorical data has been proposed in the literature (see e.g. Han and Kamber 2006; Zezula et al. 2006) As examples we describe two reusable components according to Tracz (1990) for measuring distances. Component Name: EUCLIDEAN 1. Concept: Description: Measures the distance between objects using the Euclidean distance. Input: Objects and centroids Output: Objects assigned to their nearest representative, i.e. centroids (labelled objects), and distances between objects and centroids. 2. Context: Application: Measures distances between centroids and objects. The most frequently used distance measure. It is efficient for finding spherical clusters, but it supposes that all clusters have the same shape. 3. Content: Uses the Euclidean distance like in the paper by Hartigan and Wong (1979). Component Name: ADJUST DISTANCE TO CONSTRAINTS 1. Concept: Description: Measures the distance between objects using a different measure of distance for each cluster.

123

70

B. Delibaši´c et al.

Input: Objects, centroids, and pair-wise constraints of object relations (must-links, cannot-links) Output: Objects assigned to their nearest representative, i.e. centroids (labelled objects), and distances between objects and centroids. 2. Context: Application: Measures distances between centroids and objects. Can find clusters of various shapes. Allows a user to define pair-wise constraints of object relationships. 3. Content: Calculates the distance as described in Bilenko et al. (2004) These component descriptions help to understand when and for what purposes a reusable component can be applied. 4.4 Calculate representatives Every cluster in partitioning cluster algorithms can have a representative point. It is the centroid for numerical data. It is calculated in K-means and X-means as the arithmetic mean of all objects that are contained in a cluster. In Kohonen SOM a strategy of stepwise centroid adjustment is used. In Fig. 1 (generic clustering algorithm) we call this strategy of centroid calculation STEP BY STEP. This strategy allows similar clusters to be next to each other in the map, and separates dissimilar clusters. The same approach can be used in K-means clustering (Cheung 2003). Centroids are important cluster representatives because they improve the efficiency of cluster algorithms, compared to algorithms in which distances between all of the objects have to be calculated (e.g. hierarchical cluster algorithms). For this sub-problem we found two components: ARITHMETIC MEAN (used in K-means, X-means, and MPCK-means), and STEP BY STEP (used in Kohonen SOM, and Cheung (2003)). 4.5 Stop criterion A clustering process has to have some stopping criteria in order to stop the process of machine learning. In K-means and MPCK-means the criteria for stopping is when no modification between two calculations can be found. Kohonen SOM stops the process after a certain number of iterations, or when a stable solution is reached. X-means stops when no more splits can be implemented or when the upper limit of user-defined clusters has been found. 4.6 Split clusters X-means uses the strategy of binary splitting clusters in order to find more appropriate clusters. This strategy can be used for cluster quality testing, as in K-means and in Kohonen SOM. In Kohonen SOM every map can be divided binary in 1 × 2 or 2 × 1 (for two dimensional maps) cells. In X-means BIC is used to test the quality of the binary-split clusters, although there are also other measures of cluster quality, e.g. information entropy in Barbara et al. (2001), PCA in Ding and He (2004), Anderson-Darling statistic (Hammerly and Elkan 2003). The

123

Reusable components for partitioning clustering algorithms

71

split-cluster strategy can be used for testing the quality of the clusters in all of the algorithms we analyzed. Here we identified four components: BIC (Pelleg and Moore 2000), PCA (Ding and He 2004), ANDERSON-DARLING (Hammerly and Elkan 2003), and NONE if no splitting should be performed. 4.7 The generic partitioning cluster algorithm We propose a generic algorithm that can combine the reusable components we have identified. It is shown in Fig. 1. The user generates a clustering algorithm by combining these reusable components. The problems we have described and their corresponding solutions can be shown in the form of a generic partitioning clustering algorithm. The components of the generic algorithm are structural in the sense that they define both the algorithm process and the structure of the generic algorithm. The components define the process and the structure of a partitioning cluster algorithm because they explain and resolve crucial sub-problems and define the structure of an algorithm.

Define number of clusters

Initialize centroids

Measure distance

Define number of clusters -RANGE-

Initialize centroids -RANDOM-

Define number of clusters -MAP-

Initialize centroids -LINEAR MAP-

Initialize centroids K-MEANS ++

Measure distance -EUCLIDEAN-

Measure distance -COSINE-

Measure distance -I-DIVERGENCE-

Measure distance -ADJUST DISTANCE TO CONSTRAINTS-

Calculate representatives -ARITHMETIC MEAN-

Calculate representatives -STEP BY STEP-

Stop criterion -No. OF RUNS-

Stop criterion -STABILITY-

Split clusters BIC

Split clusters -NONE-

Split clusters PCA

Split clusters ANDERSON-DARLING

Calculate representatives

Stop criterion 1

Stop criterion n

Split clusters

Fig. 1 A generic cluster partitioning algorithm

123

72

B. Delibaši´c et al.

The generic cluster algorithm shows important problems in partitioning clustering that can have various manifestations in clustering-algorithm design. The components chosen in the generic clustering algorithm define the clustering algorithms. For “Define Number of Clusters” we identified two components: 1. RANGE: component found in K-means, and X-means, and 2. MAP: component found in Kohonen SOM. For “Initialize Centroids” we identified two components: 1. RANDOM: component found in Kohonen SOM, K-Means, and X-Means, 2. LINEAR MAP: component found in Su et al. (2002), and 3. K-MEANS ++: component found in Arthur and Vassilvitskii (2007). For “Measure Distance” we identified four components: 1. 2. 3. 4.

EUCLIDEAN: component found in Kohonen SOM, K-Means, and X-Means, COSINE: component found in Basu et al. (2004), I-DIVERGENCE: component found in Basu et al. (2004), and ADJUST DISTANCE TO CONSTRAINTS: component found in MPCK-Means.

For “Calculate Representatives” we identified two components: 1. ARITHMETIC MEAN: component found in K-Means, X-Means, and MPCK-Means, and 2. STEP BY STEP: component found in Kohonen SOM. For “Stop Criterion” we identified two components: 1. No. OF RUNS: component found in Kohonen SOM, and 2. STABILITY: component found in K-Means, X-Means, and MPCK-Means. For “Split clusters” we identified four components: 1. 2. 3. 4.

BIC: component found in X-Means, PCA: component found in Ding and He (2004), ANDERSON-DARLING, component found in Hammerly and Elkan (2003), and NONE: if no component is used.

In MPCK-means the whole algorithm has strongly tied reusable components, so it was not easy to extract more than one reusable component (ADJUST DISTANCE TO CONTRAINSTS) that could be used in the generic clustering algorithm we proposed. Our generic clustering algorithm composes components in a linear way, so it is not suitable to handle more complicated reusable component dependencies that also frequently occur in algorithm design, e.g. nesting of components. In MPCK-means ML and CL constraints are tied with every other reusable component so more complicated generic structures are needed to describe the MPCK-means design. The generic clustering algorithm we propose can only handle linear and recursive reusable component dependencies.

5 Examples for usage of the generic clustering algorithm The reusable components we identified can be used in two different ways. First, they can be combined to recreate the original algorithms described in literature; Kohonen SOM, e.g., can be reproduced using the following components:

123

Reusable components for partitioning clustering algorithms

73

“Define number of clusters” = MAP “Initialize centroids” = RANDOM “Measure distance”= EUCLIDEAN “Calculate centroids” = STEP BY STEP “Stop criterion” = NUMBER OF RUNS, or STABILITY “Split clusters” = NONE All of the algorithms we analyzed can be reproduced in a similar way. Moreover, all algorithms can be modified using these components. For K-means, the component “Initialize centroids” STEP BY STEP could be used instead of ARITHMETIC MEAN. The second way of using our components is to create new partitioning cluster algorithms. They can be created by combining the analyzed components. Including “Split clusters” can change K-means, Kohonen SOM, and MPCK-means. We can easily define an algorithm that uses for “Define number of clusters” RANGE, for “Initialize centroids” LINEAR MAP, for “Measure distance” EUCLIDEAN, uses “Stopping criteria”, and uses for “Split clusters” BIC. In this way we can compose an algorithm that fits our needs in solving a special clustering task. In this way we have shown that there are a lot of components whose usage has been traditionally restricted to their original algorithms, but which could be reused intelligently in other partitioning cluster algorithms.

6 Summary and future research In this paper, we defined reusable components that can be applied for building generic partitioning cluster algorithms. The idea was to extract interesting solutions to sub-problems as reusable components that can be implemented, even in a slightly different way, in other algorithms, which can be beneficial to that other algorithm. We built a generic algorithm for handling the reusable components we identified; there are, however, more than one alternative to our generic algorithm, even in partitioning cluster algorithms that can combine the components we identified. We plan to investigate partitioning clustering algorithms further and to identify new reusable components. The next step in our research is to implement our components in an opensource machine-learning platform (e.g. R, Weka, RapidMiner etc). This research can help to achieve the following goals: 1. “Unleashing” reusable components that can be found in existing clustering algorithms. In this way new algorithms can be created more easily and tested for their effectiveness. 2. In situations where good solutions for both specific data and the requirements of the user can be found in a specific algorithm, the good solutions can easily be used outside of the original algorithm to solve the problem. 3. Sharing of ideas in the machine learning and pattern community can be made easier. This paper only provides the first ideas on implementing reusable components for building new partitioning cluster algorithms. Experimental evidence of the performance of reusable components has to be provided in order to document a reusable component well. A precise description of a component can provide information about the strengths and limitations of a component, depending on data (attribute type, number of attributes, number of cases, correlation between attributes, attributes distribution, missing values, etc.) but also on user demands.

123

74

B. Delibaši´c et al.

Combining of components has great potential, but we still have to show that a generic partitioning algorithm can outperform a known algorithm on specific data and user demands. Our investigation aims at a decision-support system architecture that can guide users in the partitioning cluster process and that can even create data-driven algorithms that can be generated automatically or semi-automatically. We have already started to develop a white box component-based data-mining framework, called WhiBo, which can work with reusable components and allows their combination. Currently, only decision trees are included1 , but we plan to extend it with components for partitioning clustering. Acknowledgements 12013.

This research is partially funded by a Grant from the Serbian Ministry of Science, TR

References Adams M, Coplien J, Gamoke R, Hammer R, Keeve F, Nicodemus K (1998) Fault-tolerant telecommunication system patterns. In: Rising L (ed) The pattern handbook: techniques, strategies, and applications. Cambridge University Press, New York, pp 189–202 Alexander C (1979) The timeless way of building. Oxford University Press, New York Alexander C (2005a) The nature of order book 1: the phenomenon of life. The Center for Environmental Structure, Berkeley, CA Alexander C. (2005b) The nature of order book 2: the process of creating life. The Center for Environmental Structure, Berkeley, CA Alexander C. (2005c) The nature of order book 3: a vision of a living world. The Center for Environmental Structure, Berkeley, CA Alexander C. (2005d) The nature of order book 4: the luminous ground. The Center for Environmental Structure, Berkeley, CA Arthur D, Vassilvitskii S (2007) K-Means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, Society for Industrial and Applied Mathematics, New Orleans, Louisiana, pp 1027–1035 Barbara D, Couto J, Li Y (2001) COOLCAT: An entropy-based algorithm for categorical clustering. In: Proceedings of the eleventh international conference on information and knowledge management, pp 582–589 Basu S, Bilenko M, Mooney RJ (2004) A probabilistic framework for semi-supervised clustering. In: Proceedings of ACM SIGKDD, Seattle, WA, pp 59–68 Bennett KP, Bradley PS, Demiriz A (2000) Constrained k-means clustering. Microsoft Research. Available via DIALOG. ftp://ftp.research.microsoft.com/pub/tr/tr-2000-65.ps Accessed 9 Apr 2009 Berkhin P (2006) A survey of clustering data mining techniques. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimensional data. Springer, Berlin-Heidelberg, pp 25–71 Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the twenty-first international conference on machine learning, Banff, Canada, pp 81–88 Bradley PS, Fayyad UM (1998) Refining initial points for k-means clustering. In: Proceedings of the fifteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, pp 91–99 Cheung YM (2003) k*-Means: a new generalized k-means clustering algorithm. Pattern Recog Lett 24:2883– 2893 Coplien JO, Harrison NB (2005) Organizational patterns of agile software development. Prentice-Hall PTR, Upper Saddle River, NJ Coplien JO, Schmidt DC (1995) Pattern languages of program design. Addison-Wesley Professional, Reading, MA Delibasic B, Kirchner K, Ruhland J (2008) A pattern-based data mining approach. In: Preisach C, Burckhardt H, Schmidt-Thieme L et al (eds) Data analysis, machine learning and applications. Springer, Berlin/Heidelberg , pp 327–334 1 WhiBo platform as an extension of open-source RapidMiner data-mining software can be downloaded on GoogleCode page (http://code.google.com/p/whibo).

123

Reusable components for partitioning clustering algorithms

75

Ding C, He X (2004) K-means clustering via principal component analysis. In: Proceedings of the twenty-first international conference on machine learning, ACM, New York, NY, p 29 Drossos N, Papagelis A, Kalles D (2000) Decision tree toolkit: a component-based library of decision tree algorithms. In: Zighed DZ, Komorowski J, Zytkow J (eds) Principles of data mining and knowledge discovery. Springer, Berlin/Heidelberg , pp 121–150 Freeman P (1983) Reusable software engineering: concepts and research directions. In: Workshop on reusability in programming, ITT Programming, Stratford, Connecticut, pp 2–16 Gamma E, Helm R, Johnson R, Vlissides JM (1995) Design patterns: elements of reusable object-oriented software. Addison-Wesley, Reading, MA Hammerly G, Elkan C (2003) Learning the k in k-means. In: Proceedings of the seventeenth annual conference on neural information processing systems, pp 281–288 Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Morgan Kaufmann Publishers, San Francisco Hartigan JA, Wong MA (1979) A k-means clustering algorithm. Appl Stat 28:100–108 Kohonen T (2001) Self-organizing maps. Springer, Berlin Lea D (1994) Design patterns for avionics control systems. Available via DIALOG. http://gee.cs.oswego.edu/ dl/acs/acs.pdf. Accessed 9 Apr 2009 Likas A, Vlassis N, Verbeek JJ (2002) The global k-means clustering algorithm. Pattern Recog 36:451–461 Mierswa I, Wurst M, Klinkenberg R, Scholz M, Euler T (2006) YALE: rapid prototyping for complex data mining tasks. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 935–940 Pelleg D, Moore A (2000) X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning, Morgan Kaufmann, San Francisco, pp 727–734 Siddique NH, Amavasai BP, Ikuta A (eds) (2007) Special issue on hybrid techniques in AI. Artif Intell Rev 27:71–201 Sommerville I (2004) Software engineering. Pearson, Boston Sonnenburg S, Braun ML, Ong CS, Bengio S, Bottou L, Holmes G, LeCun Y, Müller KR, Pereira F, Rasmussen CE, Rätsch G, Schölkopf B, Smola A, Vincent P, Weston J, Williamson RC (2007) The need for open source software in machine learning. J Mach Learn Resour 8:2443–2466 Steinley D (2006) K-means clustering: a half-century synthesis. British J Math Stat Psychol 59:1–34 Su MC, Liu TK, Chang HT (2002) Improving the self-organizing feature map algorithm using an efficient initialization scheme. Tamkang J Sci Eng 5:35–48 Tracz W (1990) Where does reuse start? ACM SIGSOFT Softw Eng Notes 15:42–46 Winn T, Calder P (2002) Is this a pattern?. IEEE Softw 19((1):59–66 Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco Xing EP, Ng AY, Jordan MI, Russell S (2003) Distance metric learning with application to clustering with side-information. Adv Neural Inf Syst 15:521–528 Zaki M, De N, Gao F, Palmerini P, Parimi N, Pathuri J, Phoophakdee B, Urban J (2005) Generic pattern mining via data mining template library. In: Boulicaut JF, De Raedt L, Mannila H (eds) Constraint-based mining and inductive databases. European workshop on inductive databases and constraint based mining. Springer, Berlin/Heidelberg , pp 362–379 Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach (Advances in database systems). Springer, New York

123