Chapter 4. Clustering - Data Mining TextBook

6 downloads 293 Views 3MB Size Report
There are many clustering approaches including partitioning methods, hierarchical methods, density-based methods, grid-
CC: BY NC ND

Table of Contents Chapter 4. Clustering and Association Analysis ......................................................................................... 171 4.1. Cluster Analysis or Clustering ........................................................................................................... 171 4.1.1. Distance and similarity measurement...................................................................................... 173 4.1.2. Clustering Methods .................................................................................................................. 177 4.1.3. Partition-based Methods ......................................................................................................... 179 4.1.4. Hierarchical-based clustering ................................................................................................... 183 4.1.5. Density-based clustering .......................................................................................................... 186 4.1.6. Grid-based clustering ............................................................................................................... 188 4.1.7. Model-based clustering ............................................................................................................ 189 4.2. Association Analysis and Frequent Pattern Mining ......................................................................... 193 4.2.1. Apriori algorithm ...................................................................................................................... 197 4.2.2. FP-Tree algorithm ..................................................................................................................... 202 4.2.3. CHARM algorithm ..................................................................................................................... 206 4.2.4. Association Rules with Hierarchical Structure ......................................................................... 210 4.2.5. Efficient Association Rule Mining with Hierarchical Structure................................................. 216 4.3. Historical Bibliography ..................................................................................................................... 218 Exercise......................................................................................................................................................... 221

Sponsored by AIAT.or.th and KINDML, SIIT

Chapter 4.

Clustering and Association Analysis

This chapter presents two unsupervised tasks; i.e., clustering and association rule mining. Clustering groups input data into a number of groups without examples or clues while association rule mining finds frequently co-occurring events using correlation analysis. This chapter provides basic techniques for clustering, and association rule mining in order.

4.1. Cluster Analysis or Clustering Unlike classification, cluster analysis or clustering or data segmentation handles data objects the class labels of which are unknown or not given. Clustering is known as unsupervised learning and it does not rely on predefined classes and class-labeled training examples. For this reason, clustering is a form of learning by observation, rather than learning by examples (like classification). Although classification is an effective means for distinguishing groups or classes of objects, it often requires costly process of labeling a large set of training tuples or patterns,which the classifier uses to model each group. Therefore, it is more desirable to proceed in the reverse direction: first partition the set of data into groups based on data similarity (e.g., using clustering), and then assign labels to the relatively small number of groups. In other words, even we can analyze data objects explicitly if we know the class label for each object (classification tasks) but it is quite common in large databases that objects may not be assigned any label (clustering tasks) since the process to assign class labels is very costly. In such situations where we have a set of objects (or data objects) without any class label given, it is quite often that one may need arranging them into groups based on their similarity in order to know some structure within the object collection. Moreover, a cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression. As one by-product of clustering, it is also possible to detect outliers, i.e., objects that are deviated for norm. One additional advantage of such a clustering-based process is that it is adaptable to changes and helps figuring out useful features that distinguish different groups. In data mining, efforts have focused on finding methods for efficient and effective cluster analysis in large databases. Active themes of research focus on the scalability of clustering methods, the effectiveness of methods for clustering complex shapes and types of data, highdimensional clustering techniques, and methods for clustering mixed numerical and categorical data in large databases. Clustering attempts to group a set of given data objects into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters. Similarities or dissimilarities are assessed based on the attribute values describing the objects where distance measures are used. There are many clustering approaches including partitioning methods, hierarchical methods, density-based methods, gridbased methods, model-based methods, methods for high-dimensional data (such as frequent pattern–based methods), and constraint-based clustering. Conceptually, cluster analysis is an important human activity. Early in childhood, we learn how to distinguish between dogs and cats, between birds and fish, or between animals and plants, by continuously improving subconscious clustering schemes. With automated clustering, dense and sparse regions are identified in object space and, therefore, overall distribution patterns and interesting correlations among data attributes are discovered. Clustering has its roots in many areas, including data mining, statistics, information theory and machine learning. It has been widely used in numerous applications, such as market research, pattern recognition, data analysis, biological discovery, and image processing. In business, clustering can help marketing

171

Sponsored by AIAT.or.th and KINDML, SIIT

people discover distinct groups in their customer bases and characterize customer groups based on purchasing patterns. In biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionality, and gain insight into structures inherent in populations. Clustering may also assist in the identification of areas of similar land usage in an earth observation database and in the identification of groups of houses in a city according to house type, value, and geographic location, as well as the identification of groups of automobile insurance policy holders with a high average claim cost. It can also be used to help classify online documents, e.g. web documents, for information discovery. As one major data mining function, cluster analysis can be used as a stand-alone tool to gain insight into the distribution of data, to observe the characteristics of each cluster, and to focus on a particular set of clusters for further analysis. Alternatively, it may serve as a preprocessing step for other algorithms, such as characterization, attribute subset selection, and classification, which would then operate on the detected clusters and the selected attributes or features. As the context of the huge amounts of data collected in databases, cluster analysis has recently become a highly active topic in data mining research. As a branch of statistics, cluster analysis has been extensively studied for many years, focusing mainly on distance-based cluster analysis. Cluster analysis tools based on k-means, k-medoids, and several other methods have also been built into many statistical analysis software packages or systems, such as DBMiner, Weka, KNIME, RapidMiner, SPSS, and SAS. In machine learning, clustering is an example of unsupervised learning. In clustering, there have been different ways in which the result of clustering can be expressed. Firstly, the groups that are identified may be exclusive so that any instance belongs to only one group. Secondly it is also possible to have clusters that are overlapping, i.e., an instance may fall into several groups. Thirdly, the cluster may be probabilistic in the sense that an instance may belong to each group with a certain probability. Fourthly the cluster may be hierarchical, such that there is a crude division of instances into groups at the top level, and each of these groups is refined further, perhaps all the way down to individual instances. Which choice among these possibilities should be applied depend on constraints and purposes of clustering. Two more factors are that clusters are formed in discrete domains (categorical or nominal attributes), numeric domains (integer- or real-valued attributes) or hybrid domains, and that clusters can be formed incrementally or statically. The following are a list of typical topics that we need to consider for clustering in real applications. Firstly, practical clustering techniques are required to be able to handle large-scale data, say millions of data objects, since in a real application we usually have such large amount of data. Secondly, clustering techniques need handling various types of attributes, including interval-based (numerical) data, binary data, categorical or nominal data, and ordinal data, or mixtures of these data types. Thirdly, without any constraint, most clustering techniques will naturally try to find clusters, which are in a spherical shape. However, intuitively and practically, clusters could be of any shape and cluster techniques should be able to detect clusters of arbitrary shape. Fourthly, clustering techniques should autonomously determine the suitable number of clusters and their associated elements without any required extra steps. Fifthly, clustering should be able to detect and eliminate noisy data which are outliers or erroneous data. Sixthly, it is necessary to enable incremental clustering which is insensitive to the order of input data for clustering. Many techniques cannot incorporate newly inserted data into existing clustering structures but must make clusters from scratch. Moreover, most of them are sensitive to the order of input data. Sevenly, in many cases, it is inevitable to handle data with high dimensions since data in many tasks have objects with a large number of attributes. Finding clusters of data objects in high dimensional space is challenging, especially considering that such

172

Sponsored by AIAT.or.th and KINDML, SIIT

data can be sparse and highly skewed. Eighthly, it is practically to perform clustering under various kinds of constraints. For example, when we are assigned to locate a number of convenience stores, we may cluster residents by taking their static characteristics into account, as well as considering geographical constraints such as the city’s rivers, ponds, roads, and highway networks, and the type and number of residents per cluster (customers for a store). A challenging task is to find groups of data with good clustering behavior that satisfy specified constraints. Ninethly, selection of clustering techniques and features have to be done with the consideration of interpretability and usability. Clustering may need to be tied to specific purposes, which match semantic interpretations and applications. It is important to study how an application goal may influence the selection of clustering features and methods. The next sections describe distance or similarity, which is the basis for clustering similar objects and then illustrate well-known types of clustering methods, including partition-based methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. Special topics on clustering in high-dimensional space, constraint-based clustering, and outlier analysis are partially discussed.

4.1.1. Distance and similarity measurement To enable the process of clustering, a form of distance or similarity need to be defined to determine which pair of objects is close to each other. In general, a data object is characterized by a set of attributes or features. They can be in various types, including interval-scaled variables; by binary variables; by categorical, ordinal, and ratio-scaled variables; or combinations of these variable types. Data objects can be represented in the form of a mathematics schema as shown in Section 2.1. It is also shown in brief as follows. Assume that a dataset D is composed of n instance objects; o1, …, on, each of which oi is represented by a vector of p attributes (namely A1, . . . , Ap) ai1, . . . , aip. In other words, a dataset can be viewed as a matrix with n rows and p columns, called a data matrix as in Figure 4-1. Each instance object oi is characterized by the values of attributes that express different aspects of the instance.

Dataset D

o1 o2 … o i …

on-1 on

A1  a11 a21 … ai1 … a(n-1)1 an1

A2  a12 a22 … ai2 …

a(n-1)2 an2 D oi

= =

… … … … … … … … [ (o1, o2, …, on)T ] (ai1, ai2, …, aip)

Ak  a1k a2k … aik … a(n-1)k ank

… … … … … … … …

Ap  a1p a2p … aip … a(n-1)p anp

… Database: A set of data objects … An object characterized by attributes

Figure 4-1: A matrix representation of data objects with their attribute values. Conceptually we can measure dissimilarity (also called distance) or similarity between any pair of objects. The result can be represented in the form of a two-dimensional matrix, called dissimilarity (distance) matrix or similarity matrix. This matrix holds object-by-object structure and stores a collection of proximity values (dissimilarity or similarity values) that are available for all pairs of objects. It is represented by an -by- matrix as shown in Figure 4-1.

173

Sponsored by AIAT.or.th and KINDML, SIIT

In Figure 4-2 (a), expresses the difference or dissimilarity between objects i and j. In most works, is defined to be a nonnegative number. It is close to 0 when objects i and j are very similar or close to each other and it becomes a larger number when they are more different to each other. In general, the equality is assumed and the difference between two identical object is set to zero, i.e., d(i, i) = 0. On the other hand, in Figure 4-2 (b), expresses the similarity between objects i and j. Normally, represents a number that takes a maximum of 1 when objects i and j are identical, and it becomes a smaller number when they are farther from each other. Same as dissimilarity measure, it is natural to set the equality of s and the similarity between two identical objects is set to one, i.e., s(i, i) = 1. To calculate distance or similarity, since there are various types of attributes, measurement of the distance between two objects, say and , becomes an important issue. The point is how to fairly define the distance for interval-scaled attributes, binary attributes, categorical attributes, ordinal attributes, and ratio-scaled attributes since the distance (or similarity) between two objects, i.e., or in Figure 4-2, naturally comes from the combination of the distances of their attributes. As a naïve way, in many works the combination is usually defined by a summation operator, denoted by (or for similarity, ). While different types of attributes may have different distance (or similarity) calculations, it is possible to classify them into two main approaches.

o1 o2 … o i …

on-1 on

o1  0 d(2,1) … d(i,1) … d(n-1,1) d(n,1)

o2  d(1,2) 0 … d(i,2) … d(n-1,2) d(n,2)

… … … 0 … … … …

oi  d(1,i) d(2,i) … 0 … d(n-1,i) d(n,i)

… … … … … 0 … …

o n-1  d(1,n-1) d(2,n-1) … d(i,n-1) … 0 d(n,n-1)

on  d(1,n) d(2,n) … d(i,n) … d(n-1,n) 0

(a) Dissimilarity (Distance) matrix (minimum value = 0)

o1 o2 … o i …

on-1 on

o1  1 s(2,1) … s(i,1) … s(n-1,1) s(n,1)

o2  s(1,2) 1 … s(i,2) … s(n-1,2) s(n,2)

… … … 1 … … … …

oi  s(1,i) s(2,i) … 1 … s(n-1,i) s(n,i)

… … … … … 1 … …

o n-1  s(1,n-1) s(2,n-1) … s(i,n-1) … 1 s(n,n-1)

on  s(1,n) s(2,n) … s(i,n) … s(n-1,n) 1

(b) Similarity matrix (maximum value=1) Figure 4-2: A dissimilarity (distance) matrix and similarity matrix The first approach is to normalize all attributes into a fixed standard scale, say 0.0 to 1.0 (or -1.0 to 1.0) and then use a distance measure, such as Euclidean distance or Manhattan distance, or use a similarity measure, such as cosine similarity, to determine the distance between a pair of objects. The detail of this approach is described in Section 2.5.2. The second approach is to use different measurements for different types of attributes as follows.

174

Sponsored by AIAT.or.th and KINDML, SIIT

No. 1.

Type Interval-scaled attributes

Method [Normalization Step] Option 1: transformation of the original value by the absolute deviation of the attribute mean value of the attribute , denoted by .

to the standardized value , denoted by , using the

Option 2: transformation of the original value by the standard deviation of the attribute mean value of the attribute , denoted by .

to the standardized value , denoted by , using the

[Distance Measurement Step] For distance measurement in Figure 4-2, we can use a standard measure such as Euclidean distance or Mahattan distance between two objects, say and , as follows. Euclidean Distance:

Manhattan Distance:

Both Euclidean distance and Manhattan distance satisfy the following mathematic requirements of a distance function. 1. 2. 3. 4.

: The distance is a nonnegative number. : The distance from an object to itself is zero. : The distance is a symmetric function. : The distance satisfies the triangular inequality. The distance from object i to object j directly is not larger than the indirect contour over any other object h.

175

Sponsored by AIAT.or.th and KINDML, SIIT

1.

Interval-scaled attributes (continued)

It is also possible to measure similarity instead of distance. As for Figure 4-2, two common similarity measures are dot product and cosine similarity. Their formulae are given below. When the object is normalized, the dot product and the cosine similarity become identical. Dot Product:

Cosine Similarity:

Note that dot product has no bound but the cosine similarity ranges between -1 and 1 (in this task, 0 and 1).

2.

3.

Categorical attributes

Ordinal attributes

A categorical attribute can be viewed as a generalization of the binary attribute in that it can take more than two states. For example, ‘product brand’ is a categorical attribute that may take one value from a set of more than two possible values, say Oracle, Microsoft, Google, and Facebook. For the distance for a categorical attribute, it is possible to use the same approach with an intervalscaled attribute by setting distance to 1 if two objects have the same value, otherwise 0. When the value of a categorical attribute of two objects, and , is the same, the dissimilarity and the similiarty between these two objects for that categorical attribute are set to 0 and 1, respectively. Otherwise they are 1 and 0, respectively. If ther

when the object

and the object

have a same value for attribute A.

when the object

and the object

have different values for attribute A.

A discrete ordinal attribute positions between a categorical attribute and numeric-valued attribute in the sense that a ordinal attribute has N discrete values (like categorical attributes) but they can be ordered in a meaningful sequence (like numeric-valued attributes). An ordinal attribute is useful for recording subjective assessments of qualities that cannot be measured objectively. For example, ‘height’ can be high, middle and low, ‘weight’ can be heavy, mild and light, etc. This ordinal property presents continuity of an unknown scale but its actual magnitude is not known. To handle the scale of an ordinal attribute, we can treat an ordinal variable by normalization as follows. First, the values of an ordinal variable are mapped to ranks. For example, suppose that each value of an ordinal attribute ( ) has been arranged to ranks ( ), say . Second each value of the ordinal attribute is mapped to a value between 0 and 1 ( ) as follows.

Third the dissimilarity can then be computed using any of the distance measures for interval-scaled variables, using to represent the for the i-th object.

value

176

Sponsored by AIAT.or.th and KINDML, SIIT

4.

Binary attributes

Even it is possible to use the same approach with interval-scaled attributes, treating binary variables as if they are interval-scaled may not be suitable since it may mislead to improper clustering results. Here, it is necessary to use more suitable methods specific to binary data for computing dissimilarities. As one approach, a dissimilarity matrix can be calculated from the given binary data. Normally we consider all binary variables to have the same weight. By this setting, a 2-by-2 contingency table can be constructed to calculate dissimilarity between object i and object j as follows.

Object i

1 0 Total

1 a c a+c

Object j 0 b d b+d

Total a+b c+d a+b+c+d

Here, two types of dissimilarity measures for binary attributes are symmetric binary dissimilarity and asymmetric binary dissimilarity as follows.

The asymmetric binary dissimilarity is used when the positive and negative outcomes of a binary attribute are not equally important, such as the positive and negative outcomes of a disease test. That is, the value of 1 of an attribute has different importance level to the value of 0 of that attribute. For example, we may give more importance to the outcome of HIV positive (1) which is usually rarely occurred, and less importance to the outcome of HIV negative (0) which is usually detected. As the negation version of this dissimilarity, we can calculate symmetric binary similarity and asymmetric binary similarity , as follows.

5.

Ratio-scaled attributes

A ratio-scaled attribute takes a positive value on a nonlinear scale, such as an exponential scale. The following is the formula for a ratio-scaled value. Here, A is a positive numeric constant, B is a numeric constant, and t is a focused variable. It is not good to treat ratio-scaled attributes like interval-scaled attributes since it is likely that the scale may be distorted due to exponential scales. There are two common methods to compute the dissimilarity between objects described by ratio-scaled attributes as follows. 1.

2.

The first method is to apply logarithmic transformation to a value of a ratio-scaled attribute for object I, say , by using the formula The transformed value can be treated as an intervalvalued attribute. However, it may be suitable to use other transformation, such as log-log transformation. The second method is to treat as a continuous ordinal attribute and treat their ranks as an interval-valued attribute.

4.1.2. Clustering Methods During there are many existing clustering algorithms, we can classify the major clustering methods including partition-based methods, hierarchical methods, density-based methods, gridbased methods, and model-based methods. Their details are discussed below.

177

Sponsored by AIAT.or.th and KINDML, SIIT

1. Partitioning methods A partitioning method divides objects (data tuples) into partitions of the data, where each partition represents a cluster and . This method needs to specify , the number of partitions, beforehand. Usually clustering assigns one object to only one cluster but it is possible to allow it to assign to several clusters, such as fuzzy partitioning techniques. The steps of partitioning methods are as follows. First, the partitioning method will assign randomly or heuristically each object to a cluster, as initial partition. Here one cluster has at least one object assigned from the beginning. Second, the method will relocate objects in clusters iteratively by attempting to improve the partitioning result by moving objects from one group to another based on a predefined criterion, which is that objects in the same cluster are close to each other, whereas they are far apart or very different from objects in different clusters. At present, there have been a few popular heuristic methods, such as (1) the k-means algorithm, where each cluster is represented by the mean value of the objects in the cluster, and (2) the k-medoids algorithm, where each cluster is represented by one of the objects located near the center of the cluster. The partitioning methods seem work well to construct a number of spherical-shaped clusters in small- to medium-sized databases. However, these methods need some modification to deal with clusters with complex shapes and for clustering very large data sets. 2. Hierarchical methods Unlike a partitioning method, a hierarchical method does not specify the number of clusters beforehand but attempts to create a hierarchical structure for the given set of data objects. Two types of a hierarchical method are agglomerative and divisive. The agglomerative method is the bottom-up approach where the process starts with each object forming a separate group and then successively merges the objects or groups that are close to one another, until all of the groups are merged into one (the topmost level of the hierarchy), or until a termination condition holds. On the other hand, the divisive method is the top-down approach where the procedure begins with all of the objects in the same cluster and then for each successive iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster, or until a termination condition holds. Even hierarchical methods are superior in small computation costs due to avoiding a combinatorial number of different merging or splitting choices, they may suffer with erroneous decisions in each margining or splitting choice since once a step is done, it can never be undone. To avoid this problem, two solutions are (1) to perform careful analysis of linkages among objects at each hierarchical partitioning, as in Chameleon, or (2) to integrate hierarchical agglomeration and other approaches by first using a hierarchical agglomerative algorithm to group objects into microclusters, and then performing macroclustering on the microclusters using another clustering method such as iterative relocation, as in BIRCH. 3. Density-based methods Most methods which use distance between objects for clustering will tend to find clusters with spherical shape. However, in general clusters can have arbitrary shape. Towards this, it is possible to apply the notion of density for cluster objects into any shape. The general idea is to start from a single point in a cluster and then to grow the given cluster as long as the density (number of data points or objects) in the “neighborhood” exceeds a threshold. For each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points. A density-based method tries to filter out

178

Sponsored by AIAT.or.th and KINDML, SIIT

noise (outliers) and discover clusters of arbitrary shape. Examples of density-based approach are DBSCAN and its extension, OPTICS, and DENCLUE. 4. Grid-based methods While point-to-point (object pair similarity) calculation in most clustering methods seems slow, a grid-based method first divides the object space into a finite number of cells as a grid structure. Then clustering operations are applied on this grid structure. Grid-based methods are superior in its fast processing time. Rather than the number of data objects, the time complexity depends only on the number of cells in each dimension in the quantized space. A typical grid-based method is STING. It is possible to combine gridbased and density-based as done in WaveCluster. 5. Model-based methods: Instead of using a simple similarity definition, a model-based method predefines a suitable model for each of the clusters and then find the best fit of the data to the given model. The model may form clusters by constructing a density function that reflects the spatial distribution of the data points. With statistics criteria, we can automatically detect the number of clusters and obtain more robustness. Some well-known model-based methods are EM, COBWEB and SOM. It is hard to say which type of clustering fixes the best with the current task. The solution depends both on the type of data available and on the particular purpose of the application. It is possible to explore one-by-one to see their result clusters and compare to get the most practical one. Some clustering methods may use the ideas of several clustering methods and become mixed-typed clustering. Moreover, some clustering tasks, such as text clustering or DNA microarray clustering, may have high dimension, causing difficulty in clustering since the data become sparse. Clustering high-dimensional data is challenging due to the curse of dimensionality. Many dimensions may not be relevant. As the number of dimensions increases, the data become increasingly sparse so that the distance measurement between pairs of points become meaningless and the average density of points anywhere in the data is likely to be low. For this task, two influential subspace clustering methods are CLIQUE and PROCLUS. Rather than searching over the entire data space, they search for clusters in subspaces (or subsets of dimensions) of the data. Frequent pattern–based clustering, another clustering methodology, extracts distinct frequent patterns among subsets of dimensions that occur frequently. It uses such patterns to group objects and generate meaningful clusters. pCluster is an example of frequent pattern–based clustering that groups objects based on their pattern similarity. Beyond simple clustering, constraint-based clustering performs clustering under user-specified or application-oriented constraints. Users may have some preferences on clustering data and specify them as constraints in clustering process. A constraint is a user’s expectation or describes “properties” of the desired clustering results. For example, objects in a space are clustered under the existence of obstacles or they are clustered when some objects are known to be in or not in the same cluster.

4.1.3. Partition-based Methods A partitioning method divides objects (data tuples) into partitions of the data, where each partition represents a cluster and . This method needs to specify , the number of partitions, beforehand. As the most classic clustering, the k-means method receives in advance the number of clusters k we construct. With this parameter, k points are chosen at random as cluster centers.

179

Sponsored by AIAT.or.th and KINDML, SIIT

Next all instances are assigned to their closest cluster center according to the ordinary distance metric, such as Euclidean distance. Next, the centroid, or mean, of the instances in each cluster is calculated to be a new centroid. These centroids are taken to be new center values for their respective clusters. The whole process is repeated with the new cluster centers. Finally, iteration continues until the same points are assigned to each cluster in consecutive rounds, at which stage the cluster centers have stabilized and will remain the same forever. In summary, four steps of the k-means algorithm are as follows. 1. Partition objects into k non-empty subsets 2. Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. 3. Assign each object to the cluster with the nearest seed point. 4. Go back to Step 2, stop when no more new assignment that is no member changes its group. The formal description of k-means can be described as follows. Given a training set of objects and their associated class labels, denoted by , each object is represented by an n-dimensional attribute vector, depicting the measure values of n attributes, , without any class label. Here, suppose that has possible values, . That is, . Algorithm 4.1 shows a pseudocode of the kmeans method. Algorithm 4.1.

k-means Algorithm

T is a dataset, where k is the number of clusters.

Input:

Output:

A set of clusters

and

,

where each element is a cluster with its members Procedure: (1) FOREACH T {  ; } // Randomly assign (2) WHILE some members change their groups { (3) (4) (5) (6) (7) (8) }

FOREACH FOREACH FOREACH 

{ T { { ; }

;

}

// Calc centroid

to a class of each cluster

; } // Calc distance to each cluster // Select the best cluster for // Assign

to the best cluster

180

Sponsored by AIAT.or.th and KINDML, SIIT



ROUND 1

ROUND 2 



ROUND 4

ROUND 3

Figure 4-3: A graphical example of k-means clustering For more clarity, Figure 4-3 and Figure 4-4 shows an graphical example of the k-means algorithm and its calculation for each step, respectively. Compared with other methods, the kmeans clustering method is simple and effective. Its principle is to swap the clusters’ members among clusters until the total distance from each of the cluster’s points to its center becomes minimal and no more swapping is needed. For example, it is possible to have situations in which k-means fails to find a good clustering like in Figure 4-5. This figure shows the local optimal due to improper initial clusters. We can view that these four objects are arranged at the vertices of a rectangle in two-dimensional space. In the figure, two initial clusters are A and B, where P1 and P3 are in the cluster A, and P2 and P4 are grouped in the cluster B. Graphically the two initial cluster centers fall at the middle points of the long sides. This clustering result seem stable. However, the two natural clusters should be formed by grouping together the two vertices at either end of a short side. That is, P1 and P2 are in the cluster A, and P3 and P4 are in the cluster B. In the k-means method, the final clusters are quite sensitive to the initial cluster centers. Completely different clustering results may be obtained even slight changes are made in the initial random cluster assignment. To increase the chance of finding a global minimum, one can execute this algorithm several times with different initial choices and choose the best final result, the one with the smallest total distance.

181

Sponsored by AIAT.or.th and KINDML, SIIT

ROUND ONE No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 19 20

x 1 2 2 3 3 4 4 5 5 6 7 7 8 8 8 9 10 10 11

ROUND TWO y 8 7 10 2 9 3 10 2 4 8 3 1 11 2 3 11 3 4 10 2

Cluster A A A A A A A A A A B B B B B B B B B B

A B

3.40 8.40

6.30 5.00

No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

x 1 2 2 3 3 4 4 5 5 5 6 7 7 8 8 8 9 10 10 11

y 8 7 10 2 9 3 10 2 4 8 3 1 11 2 3 11 3 4 10 2

A 2.94 1.57 3.96 4.32 2.73 3.35 3.75 4.59 2.80 2.33 4.20 6.41 5.92 6.30 5.66 6.58 6.50 6.99 7.57 8.73

B 7.98 6 71 8.12 6.18 6.72 4.83 6.66 4.53 3.54 4.53 3.12 4.24 6.16 3.03 2.04 6.01 2.09 1.89 5.25 3.97

Cluster A A A A A A A B A A B B A B B B B B B B

A B

3.60 8.20

7.20 4.10

ROUND THREE No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

x

y

1 2 2 3 3 4 4 5 5 5 6 7 7 8 8 8 9 10 10 11

8 7 10 2 9 3 10 2 4 8 3 1 11 2 3 11 3 4 10 2

A 2.72 1.61 3.22 5.23 1.90 4.22 2.83 5.39 3.49 1.61 4.84 7.07 5.10 6.81 6.08 5.81 6.84 7.16 6.99 9.04

B 8.19 6.84 8.56 5.61 7.14 4.34 7.24 3.83 3.20 5.04 2.46 3.32 7.00 2.11 1.12 6.90 1.36 1.80 6.17 3.50

Cluster A A A A A A A B B A B B A B B A B B B B

A 2.90 2.10 2.83 5.97 1.42 4.90 2.10 6.00 4.05 1.10 5.33 7.56 4.38 7.18 6.39 5.14 7.07 7.24 6.45 9.23

B 8.29 6.91 8.85 5.10 7.44 3.92 7.67 3.22 2.96 5.44 1.94 2.56 7.65 1.40 0.41 7.60 1.17 2.18 6.93 3.40

Cluster A A A B A B A B B A B B A B B A B B A B

A B

3.90 7.90

7.90 3.40

ROUND FOUR No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

x

y

1 2 2 3 3 4 4 5 5 5 6 7 7 8 8 8 9 10 10 11

8 7 10 2 9 3 10 2 4 8 3 1 11 2 3 11 3 4 10 2

A B

4.67 6.91

9.33 2.64

Figure 4-4: Numerical calculation of the k-mean clustering in Figure 4-3

182

Sponsored by AIAT.or.th and KINDML, SIIT

Figure 4-5: Local optimal due to improper initial clusters. There are many variants of the basic k-means method developed. Some of them try to produce a hierarchical clustering result (shown in the next section) with a cutting point at k groups and then perform k-means clustering on the result. However, all methods still left one question on how large the k should be. It is hard to estimate the likely number of clusters. One solution is to try different values of k, and choose the best clustering result, i.e., one with the largest intracluster similarity and the smallest inter-cluster similarity. Another solution of finding k is to begin by finding a few clusters and determining whether it is worth splitting them. For example, we choose k=2 and then perform k-means clustering until it terminates, and then consider splitting each cluster.

4.1.4. Hierarchical-based clustering As one of the common clustering methods, hierarchical clustering groups data objects into a tree of clusters, without a predefined number of clusters. Their basic operation is to merge the similar objects or object groups or to split dissimilar objects or object groups to different groups. Anyway, the most serious drawback of a pure hierarchical clustering method is its inability to reassign once a merge or split decision has been executed. If a particular merge or split decision is later known to be a wrong decision, the method cannot make it correct later. To solve this, it is possible to incorporate some iterative relocation mechanism into the original version. In general, there are two types of hierarchical clustering methods are agglomerative and divisive, depending on whether the hierarchical structure (tree) is formed in either bottom-up (merging) or topdown (splitting) style. As for the bottom-up fashion, the agglomerative hierarchical clustering approach starts with having each object in its own cluster and then merges these atomic clusters into larger and larger clusters, until all of the objects are included in a single cluster or until some certain termination conditions are satisfied. most hierarchical clustering methods belong to this type. However, there may be various definitions of intercluster similarity. On the other hand, as for the top-down fashion, the divisive hierarchical clustering approach performs the reverse of agglomerative hierarchical clustering by starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces, until each object forms a cluster on its own or until it satisfies certain termination conditions, such as a desired number of clusters is obtained or the diameter of each cluster is within a certain threshold.

183

Sponsored by AIAT.or.th and KINDML, SIIT

Agglomerative versus divisive hierarchical clustering While two directions of hierarchical clusterings are top-down and bottom-up, it is possible to represent both of them in the form of a tree structure. In general, such tree structure is called a dendrogram. It is commonly used to represent the process of hierarchical clustering. It shows how objects are grouped together step by step. Figure 4-6 shows a dendrogram for seven objects in (a) agglomerative clustering and (b) divisive clustering. Here, Step 0 is the initial stage while Step 6 is the final stage when a single cluster is constructed. The agglomerative hierarchical clustering places each object into a cluster with its own. Then it tries to merge step-by-step according to some criterion. For example, in Figure 4-6 (a) at the step 4, the agglomerative hierarchical clustering method attempts to merge {a,b,c} with {d} and form {a,b,c,d} in the bottom-up manner. It is also known as AGNES (AGglomerative NESting). In Figure 4-6 (b) at the step 3, the divisive hierarchical clustering method tries to divide {e,f,g} into {e,f} and {g}. This approach is also called DIANA (DIvisive ANAlysis). However, in either agglomerative or divisive hierarchical clustering, the user can specify the desired number of clusters as a termination condition. That is, it is possible to terminate at any step to obtain clustering results. If the user requests three clusters, the agglomerative method will terminate at Step 4 while the divisive method will terminate at Step 2.

Bottom-up (Agglomerative)

Top-down (divisive) Figure 4-6: Dendrogram: Agglomerative vs. Divisive Clustering

Distance measurement among clusters As stated in Section 4.1.1, there are several possible distance definition to express distance between two single objects. However, in hieratchical clustering, an additional requirement is to define the distance between two clusters which may include more than one object. Four widelyused measures are single linakage, complete linkage, centroid comparision and element

184

Sponsored by AIAT.or.th and KINDML, SIIT

comparion. Figure 4-7 shows the graphical representation of these four methods. formulation of each measure can be defined as follows. No. 1.

Cluster Distance

The

Distance definition

Single linkage (minimum distance)

2.

Complete linkage (maximum distance)

3.

Centroid comparison (mean distance)

where, 4.

is the centroid of

and

is the centroid of

.

Element comparison (average distance)

Figure 4-7: Graphical representation of four definitions of cluster distances Firstly, for the single linkage, if we use the minimum distance, , to measure the distance between clusters, it is sometimes called a nearest-neighbor clustering algorithm. In this method, the clustering process is terminated when the distance between nearest clusters exceeds an arbitrary threshold. It is possible to view the data points as nodes of a graph where edges form a path between the nodes in a cluster. When two clusters, and , are merged, an edge is added between the nearest pair of nodes in and . This merging process will result in a tree-like graph. An agglomerative hierarchical clustering algorithm that uses the minimum distance measure is also known as a minimal spanning tree algorithm.

185

Sponsored by AIAT.or.th and KINDML, SIIT

Secondly, for the complete linkage, an algorithm uses the maximum distance, , to measure the distance between clusters. Also called a farthest-neighbor clustering algorithm, the clustering process is terminated when the maximum distance between the nearest clusters exceeds a predefined threshold. In this method, each cluster can be viewed as a complete subgraph where there exist edges connecting all of the nodes in the clusters. The distance between two clusters is determined by the most distant nodes in the two clusters. These farthestneighbor algorithms aim to minimize the increase in diameter of the clusters at each iteration as little as possible. It performs well when the true clusters are compact and approximately equal in size. Otherwise, the clusters produced can be meaningless. The nearest-neighbor clustering and the farthest-neighbor clustering express two extreme cases for defining the distance between clusters. They are quite sensitive to outliers or noisy data. Rather than these two methods, sometimes it is better to use mean or average distance instead, in order to compromise between the minimum and maximum distances and to overcome the outlier and noisy problem. The mean distance comes for the calculation of the centroid of each cluster and then the measurement of the distance between a pair of centroids. Also known as centroid comparison, this method is computationally simple and cheap. Every time two clusters are merged (for agglomerative) or a cluster is split (for divisive), a new centroid will be calculated for the newly merged cluster or two centroids will be calculated for the newly split two cluters. However, agglomerative clustering is slightly simpler than divisive clustering since it is possible to use a so-called weighted combination technique to calculate the centroid of the newly merged cluster but it cannot be applied for the divisive approach. For both methods, it is necessary to calculate the distance between the new centroid(s) with the other centroids. Compared to the centroid comparison, the element comparison is done to calculate the distance between two clusters by finding the average distance among all elements in those two clusters. This step is much more computational expensive than the centroid comparison. Moreover, the average distance is advantageous in that it can handle categoric as well as numeric data. The computation of the mean vector for categoric data can be difficult or impossible to define but it is possible for average distance.

Problems in the hierarchical approach While hierarchical clustering is simple, it has a drawback in how to select points to merge or split. Each merging or splitting step is important since the next step will process based on the newly generated clusters and it will never reconsider the result of the previous steps or swap objects between clusters. If the previous merge or split decisions are not well determined, low-quality clusters may be generated. To solve this problem, it is possible to perform multiple-phase clustering by incorporating other clustering techniques into hierarchical clustering. Three common methods are BIRCH, ROCK and Chameleon. The BIRCH partitions objects hierarchically using tree structures whose leaf (or low-level nonleaf nodes) are treated as microclusters, depending on the scale of resolution. After that, it applies other clustering algorithms to perform macroclustering on the microclusters. The ROCK merges clusters based on their interconnectivity after hierarchical clustering. The Chameleon explores dynamic modeling in hierarchical clustering.

4.1.5. Density-based clustering Unlike partition-based or hierarchical-based clustering which tends to discover clusters with a spherical shape, density-based clustering methods have been designed to discover clusters with arbitrary shape. In this approach, dense regions of objects in the data space will be separated by regions of low density. As density-based methods, DBSCAN grows clusters according to a density-

186

Sponsored by AIAT.or.th and KINDML, SIIT

based connectivity analysis. OPTICS extends DBSCAN to produce a cluster ordering obtained from a wide range of parameter settings. DENCLUE clusters objects based on a set of density distribution functions. In this book, we describe DBSCAN.

DBSCAN: Density-Based Spatial Clustering of Applications with Noise DBSCAN is a density-based clustering algorithm, finding regions which includes objects with sufficiently high density into clusters. By this nature, It discovers clusters of arbitrary shape in spatial databases with noise. In this method, a cluster is defined as a maximal set of densityconnected points. The following are definitions of the concepts used in the methods.    





-neighborhood: The neighborhood objects within a radius of a given object is called the -neighborhood of the object. Core object: When an object has an -neighborhood containing objects more than a minimum number, MinObjs, it will be called a core object. Directly density-reachable: Given a set of objects D, an object p is directly densityreachable from the object q if p is within the -neighborhood of q, and q is a core object. Density-reachable: An object p is density-reachable from object q with respect to and MinObjs in a set of objects, D, if there is a chain of objects , where and such that is directly density-reachable from with respect to and MinObjs, for . Density-Connected: An object p is density-connected to object q with respect to and MinObjs in a set of objects, D, if there is an object such that both p and q are densityreachable from o with respect to and MinObjs. A density-based cluster: A density-based cluster is a set of density-connected objects that is maximal with respect to density-reachability. Every object not contained in any cluster is considered to be noise.

Note that density reachability is the transitive closure of direct density reachability, and this relationship is asymmetric. Only core objects are mutually density reachable. Density connectivity, however, is a symmetric relation.

Figure 4-8: An example of density-based clustering

187

Sponsored by AIAT.or.th and KINDML, SIIT

For example, given twenty objects (a - t) as shown in Figure 4-8, the neighbor elements for each object are shown in the left list in the figure. They are derived, basing on . Moreover, the core objects (marked with ‘*’) are a, b, d, f, g, i, j, k, m, o, p, q and s since they include more than three neighbors (MinObjs = 3). The directly density-reachable objects of each core object (q) can be listed as follows. Object (q)

Directly density reachable objects

Object (q)

a* b* d* f* g* i* j*

b, c, d a, d, f a, b, c, f, h b, d, h e, i, j, k g, j, k e, g, i, k, m

k* m* o* p* q* s*

Directly density reachable objects i, j, m, o, p k, o, p k, m, p, q, s k, m, o, q o, p, s, t o, q, t

The density reachable objects of each core object (q) can be listed as follows. Object (q) a* b* d* f* g* i* j*

Density reachable objects b, c, d, f, h a, c, d, f, h a, b, c, f, h a, b, c, d, h e, i, j, k, m, o, p, q, s, t e, g, j, k, m, o, p, q, s, t e, g, i, k, m, o, p, q, s, t

Object (q) k* m* o* p* q* s*

Density reachable objects e, g, i, j, m, o, p , q, s, t e, g, i, j, k, o, p, q, s, t e, g, i, j, k, m, p, q, s, t e, g, i, j, k, m, o, q, s, t e, g, i, j, k, m, o, p, s, t e, g, i, j, k, m, o, p, q, t

The clustering result can be defined by the density-connected property. In the example, the result are two clusters as follows. Moreover, the objects that become noises are l, n and r. Cluster 1 2

Cluster members a, b, c, d, f, h g, e, i, j, k, m, o, p, q, s, t

When we apply a sort of indexing, the computational complexity of DBSCAN will be , where n is the number of database objects. Without any index, it is . With appropriate settings of the user-defined parameters and MinObjs, the algorithm is effective at finding arbitrary-shaped clusters.

4.1.6. Grid-based clustering In constrast with partition-based, hierarchical-based and density-based methods, the grid-based clustering approach uses a multiresolution grid data structure to quantize the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed. This approach aims to improve the processing time. Its time complexity is typically independent of the number of data objects, instead it depends onthe number of cells in each dimension in the quantized space. Some typical grid-based methods are STING, WaveCluster and CLIQUE. Here, STING explores statistical information stored in the grid cells, WaveCluster clusters objects using a wavelet transformation method, and CLIQUE represents a grid-and density-based approach for clustering in high-dimensional data space.

STING: STatistical INformation Grid STING is a grid-based multiresolution clustering technique in which the spatial area is divided into rectangular cells. There are usually several levels of such rectangular cells corresponding to

188

Sponsored by AIAT.or.th and KINDML, SIIT

different levels of resolution, and these cells form a hierarchical structure: each cell at a high level is partitioned to form a number of cells at the next lower level. Statistical information regarding the attributes in each grid cell (such as the mean, maximum, and minimum values) is precomputed and stored. These statistical parameters are useful for query processing, as described below.

Figure 4-9: Grid structure in grid-based clustering Figure 4-9 shows a hierarchical structure for STING clustering. Statistical parameters and characteristics of higher-level cells can easily be computed from those of the lower-level cells. These parameters can be the attribute-independent parameters such as count; the attributedependent parameters such as mean, stdev (standard deviation), min (minimum), max (maximum); and the distribution type of the attribute value in the cell such as normal, uniform, exponential, or none (if the distribution is unknown). When the data are loaded into the database, the parameters (e.g., count, mean, stdev, min, and max) of the bottom-level cells can be calculated directly from the data. Since STING uses a multiresolution approach to cluster analysis, the quality of STING clustering depends on the granularity of the lowest level of the grid structure. When the granularity is too fine, the cost of processing will increase substantially. On the other hand, when the bottom level of the grid structure is too coarse, it may reduce the quality of cluster analysis. While STING does not take the spatial relationship between the children and their neighboring cells for construction of a parent cell into consideration, the shapes of the resultant clusters are isothetic; that is, all of the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected. By this characteristics, the quality and accuracy of the clusters may be lower, tradeoff with the fast processing time.

4.1.7. Model-based clustering Instead of using a simple similarity definition, a model-based method predefines a suitable model for each of the clusters and then find the best fit of the data to the given model. Some well-known model-based methods are EM, COBWEB and SOM.

189

Sponsored by AIAT.or.th and KINDML, SIIT

EM Algorithm: Expectation-Maximization method The EM (Expectation-Maximization) algorithm (Algorithm 4.2) is a popular iterative refinement algorithm used in several applications, such as speech recognition, image processing. It was developed to estimate the suitable values of unknown parameters by maximizing expectation. Algorithm 4.2. The EM Algorithm 1. Intitialization step: To obtain the seed for probability calculation, we start with making an initial guess of the parameter vector. Even there are several possible choices on this. As one simple approach, the method randomly partition the objects into k groups and then for each group (cluster) calculate its means (its center) (similar to k-means partitioning). 2. Repetetion step: To improve the initial cluster parameter, we iteratively refine the parameters (or clusters) based on the two steps of expectation step and maximization step. (a) Expectation Step

The probability that each object

Here, is the probability define it as which distribution) under the mean ( The probability that the object

is in a cluster

is defined as follows.

that the object occurs in the cluster . We can follows the normal distribution (i.e., Gaussian ) and the standard deviation of the cluster ( ). belongs to the cluster is defined as follows.

is the priori probability that the cluster will occur. Without bias, we can set all clusters to have the same probability. For the , it does not depend on any cluster. Therefore, it can be ignored for consideration. Finally, it is possible to use only as the probability that the object belongs to the cluster . (b) Maximization Step

By the above probability estimates, we can re-estimate or refine the model parameters as follows.

As its name, this step is the maximization of the likelihood of the distributions given the data.

190

Sponsored by AIAT.or.th and KINDML, SIIT

Although there have been several variants of EM methods. most of them can be viewed as an extension of the k-means algorithm, which assigns an object to the cluster that is the closet to it, based on the cluster mean or the cluster representative. Instead of assigning each object to a single dedicated cluster, the EM method assigns each object to a cluster according to a weight representing the probability of membership. That is, no regid boundaries between clusters are defined. Afterwards, new means are computed based on weighted measures. Since we do not know which objects should be grouped into the same group beforehand, the EM begins with an initial estimate (or guess) for each parameter in the mixture model (collectively referred to as the parameter vector). The parameter can be set by randomly grouping objects into k clusters, or just selecting k objects in the set of objects to be the means of the clusters. After this intial setting, the EM algorithm will iteratively rescore the objects against the mixture density, produced by the parameter vector. The rescored objects are then used to update the parameter estimates. During calculation, each object is assigned a probability of how likely it belonged to a given cluster. The detail of the algorithm is described as follows. While the EM algorithm is simple and easy to implement and converges fast, sometimes it falls into a local optima. The convergence is guaranteed for certain forms of optimization functions with the computational complexity of O( where is the number of input features, is the number of objects, and is the number of iterations. Sometimes known as Bayesian clustering, the method focuses on the computation of classconditional probability density. They are commonly used in the statistics community. In industry, AutoClass is a popular Bayesian clustering method that uses a variant of the EM algorithm. The best clustering maximizes the ability to predict the attributes of an object given the correct clusterof the object. AutoClass can also estimate the number of clusters. It has been applied to several domains and was able to discover a new class of stars based on infrared astronomy data.

Conceptual Clustering Unlike conventional clustering which does not focus on detailed description of a cluster, conceptual clustering forms a conceptual tree by also considering characteristic descriptions for each group, where each group corresponds to a node (concept or class) in the tree. In other words, conceptual clustering has two steps; clustering and characterization. Clustering quality is not solely a function of the individual objects but also the generality and simplicity of the derived concept descriptions. Most conceptual clustering methods uses probability measurements to determine the concepts or clusters. As an example of this type, COBWEB is a popular and simple method of incremental conceptual clustering. Its input objects are expressed by categorical attribute-value pairs. COBWEB creates a hierarchical clustering in the form of a classification tree. Figure 4-10 shows an example of a classification tree for a set of animal data. A classification tree differs from a decision tree in the sense that intermediate nodes in a classification tree specifies a concept but those in a decision indicates an attribute test. In a classification, each node refers to a concept and contains a probabilistic description of that concept, which summarizes the objects classified under the node. To be summarized, the COBWEB works as follows. Given a set of objects, , each object is represented by an n-dimensional attribute vector, depicting the measure values of n attributes of the object. The Bayesian (statistical) classifier assigns (or predicts) a class to the object when the class has the highest posterior probability over the others, conditioned on the object’s attribute values . That is, the Bayesian classifier predicts that the object belongs to the class .

191

Sponsored by AIAT.or.th and KINDML, SIIT

Figure 4-10: A classification tree. This figure is based on (Fisher, 1987) The probabilistic description includes the probability of the concept and conditional probabilities of the form , where is an attribute-value pair (that is, the i-th attribute takes its j-th possible value) and is the concept class. Normally, the counts are accumulated and stored at each node for probability calculation. The sibling nodes at a given level of a classification tree form a number of partitions. To classify an object using a classification tree, a partial matching function is employed to descend the tree along a path of best matching nodes. COBWEB uses a heuristic evaluation measure called category utility to guide construction of the tree. Category utility (CU) is defined as follows.

where n is the number of nodes (also called concepts, or categories) forming a partition, , at the given level of the tree. In other words, category utility can be used to measure the increase in the expected number of attribute values that can be correctly guessed given a partition with no such knowledge

over the expected number of correct guesses .

As an incremental approach, when a new object comes to COBWEB, it descends the tree along an appropriate path, updates counts along the way and tries to search for the best node to place the object in. This decision is done by selecting the situation that has the highest category utility of the resulting partition. Indeed, COBWEB also computes the category utility of the partition that would result if a new node were to be created for the object. Therefore, the object is placed in an existing class, or a new class is created for it, based on the partition with the highest category utility value. COBWEB has the ability to automatically adjust the number of classes in a partition. That is there is no need to provide the number of clusters like k-means or hierarchical clustering. However, the COBWEB operators are highly sensitive to the input order of the object. To solve this problem, COBWEB has two additional operators, called merging and splitting. When an object is incorporated, the two best hosts are considered for merging into a single class. Moreover, COBWEB considers splitting the children of the best host among the existing categories, based on category utility. The merging and splitting operators implement a bidirectional search. That is, a merge can undo a previous split as well as a split can be undone by a later merging process.

192

Sponsored by AIAT.or.th and KINDML, SIIT

Howover, COBWEB still has a number of limitations. Firstly, it assumes that probability distributions of an attributes are statistically independent of one another, which is not always true. Secondly, it is expensive to store the probability distribution representation of clusters, especially when the attributes have a large number of values. The time and space complexities depend not only on the number of attributes, but also on the number of values for each attribute. Moreover, the classification tree is not height-balanced for skewed input data, which may cause the time and space complexity to degrade dramatically. As an extension to COBWEB, CLASSIT deals with continuous (or real-valued) data. It stores a continuous normal distribution (i.e., mean and standard deviation) for each individual attribute in each node and applies a generalized category utility measure by an integral over continuous attributes instead of a sum over discrete attributes as in COBWEB. While conceptual clustering is popular in the machine learning community, both COBWEB and CLASSIT suffers in cases of clustering large database data.

4.2. Association Analysis and Frequent Pattern Mining Another form of knowledge that we can mine from data is frequent patterns or associations which include frequent itemsets, subsequences, substructures and association rules. For example, in a supermart database, a set of items such as bread and better are likely to appear frequently together. This set is called a frequent itemset. Moreover, in an electronics shop database, there may be a frequent subsequence that a PC, then digital camera, and a memory card are bought in sequence frequently. This is called a frequent subsequence. A substructure is a more complex pattern. It can refer to different structural forms, such as subgraphs, subtrees, or sublattices, which may be combined with itemsets or subsequences. If a substructure is found frequently, that substructure is called a frequent structured pattern. Such frequent patterns are important in mining associations, correlations, and many other interesting relationships among data. Moreover, again in a supermarket database, if one buys ice, he (or she) is likely to buy water at the supermarket. This is called an association rule. After performing frequent itemset mining, we can use the result to discover associations and correlations among items in large transactional or relational data sets. The discovery of interesting correlation relationships among huge amounts of business transaction records can help in many business decision-making processes, such as catalog design, cross-marketing, and customer shopping behavior analysis. A typical example of frequent itemset mining and association rule mining is market basket analysis. Its formulation can be summarized as follows. Let be a set of possible items, be a set of database transactions where each transaction includes a number of items, and an itemset is a set of items. A transaction is said to contain the itemset A if and only if . Let be a set of the transactions that contain . The support of an itemset A can be defined as the number of transactions that include all items in , devided by the number of all possible items. It correspond to the probability that will occur in a transaction (i.e., ) as follows.

An itemset A is called a frequent itemset if and only if the support of A is greater than or equal to a threshold called minimum support (minsup), i.e. . Note that an itemset represents a set of items. When the itemset contains k items, it is called k-itemset. For example, in supermarket database, the set is a 2-itemset. The occurrence frequency of an itemset is the number of transactions that contain the itemset and the support is its ratio, compared to the total number of transactions.

193

Sponsored by AIAT.or.th and KINDML, SIIT

An association rule is an implication of the form , where , and (i.e., and are two non-overlap itemsets. The support of the rule is denoted by . It corresponds to , (i.e., the union of sets A and B). This is taken to be the probability, . Moreover, as another measure, the confidence of the rule is denoted by , specifying the percentage of transactions in T containing A that also contain B. It corresponds to and then imply . Its formal description is as follows.

An association rule is called a frequent rule if and only if is a frequent itemset and its confidence greater than or equal to a threshold called minimum confidence (minconf), i.e. . Besides support and confidence, another important measure is lift. Theoretically. if the value of lift is lower than one, i.e., , the probability that the conclusion (B) will occur under the condition (A), i.e., , is lower than the probability of conclusion without the precondition, i.e., . This makes a meaningless rule. Therefore, we always expect an association rule with a lift larger than or equal to a threshold called minimum lift, . The fomal description of life is shown below.

In general, as a process to find frequent association rules, association rule mining can be viewed as a two-step process. 1. Find all frequent itemsets. 2. Generate strong association rules from the frequent itemsets. The first process, from the transactional database, find a set of frequent itemsets which occur at least as frequently as a predetermined minimum support count (i.e., minsup) while the second process generates strong association rules from the frequent itemsets obtained from the first process. The strong association rules are supposed to have support greater than minimum support , confidence greater than minimum confidence and lift greater than minimum lift . In principle, the second step is much less costly than the first, the overall performance of mining association rules is normally dominated by the second step. One research topic in mining frequent itemsets from a large data set is to efficiently generate a huge number of itemsets satisfying a low minimum support. Furthermore, given a frequent itemset, each of its subsets is frequent as well. A long itemset will contain a combinatorial number of shorter, frequent subitemsets. For example, a frequent itemset with a length of 30 includes up to subsets as follows.

This first term comes from the frequent 1-itemsets, , the second term is , and so on. To solve the problem of such extreme large number of itemsets, the concepts of closed frequent itemset and maximal frequent itemset are introduced. Here, we describe frequent itemsets, association rules as well as closed frequent itemsets and maximal frequent itemsets, using the example in Figure 4-11,

194

Sponsored by AIAT.or.th and KINDML, SIIT

Transaction ID 1 2 3 4 5 6

Items coke, ice, paper, shoes, water ice, orange, shirt, water paper, shirt, water coke, orange, paper, shirt, water ice, orange, shirt, shoes, water paper, shirt, water

(a) An toy example of a transactional database for retailing ( Itemset coke ice orange paper shirt shoes water

Trans 14 125 245 1346 23456 15 123456

Freq. 2 3 3 4 5 2 6

Itemset ice, orange ice, paper ice, shirt ce, water orange, paper orange, shirt orange, water paper, shirt paper, water shirt, water

Trans 25 1 25 125 4 245 245 346 1346 23456

Freq. 2 1 2 3 1 3 3 3 4 5

)

Itemset orange, shirt, water paper, shirt, water

Trans 245 346

Freq. 3 3

(b) Frequent itemsets (minimum support = 3 (i.e., 50%)) Here,

No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Left Items ice water orange shirt orange water paper shirt paper water shirt water orange, shirt orange, water shirt, water water shirt orange paper, shirt paper, water shirt, water water shirt paper

                       

Right Items water ice shirt orange water orange shirt paper water paper water shirt water shirt orange orange, shirt orange, water shirt, water water shirt paper paper, shirt paper, water shirt, water

support 3/6=0.50 3/6=0.50 3/6=0.50 3/6=0.50 3/6=0.50 3/6=0.50 3/6=0.50 3/6=0.50 4/6=0.67 4/6=0.67 5/6=0.83 5/6=0.83 3/6=0.50 3/6=0.50 3/6=0.50 3/6=0.50 3/6=0.50 3/6=0.50 3/6=0.50 3/6=0.50 3/6=0.50 3/6=0.50 3/6=0.50 3/6=0.50

confidence 3/3=1.00 3/6=0.50 3/3=1.00 3/5=0.60 3/3=1.00 3/6=0.50 3/4=0.75 3/5=0.60 4/4=1.00 4/6=0.67 5/5=1.00 5/6=0.83 3/3=1.00 3/4=0.75 3/5=0.60 3/6=0.50 3/5=0.60 3/3=1.00 3/3=1.00 3/4=0.75 3/5=0.60 3/6=0.50 3/5=0.60 3/4=0.75

(c) Association rules (minimum confidence = 66.67%) (

Lift (3/3)/(6/6)=1.0 (3/6)/(3/6)=1.0 3/3)/(5/6)=1.2 (3/5)/(3/6)=1.2 (3/3)/(6/6)=1.0 (3/6)/(3/6)=1.0 (3/4)/(5/6)=0.9 (3/5)/(4/6)=0.9 (4/4)/(6/6)=1.0 (4/6)/(4/6)=1.0 (5/5)/(6/6)=1.0 (5/6)/(5/6)=1.0 (3/3)/(6/6)=1.0 (3/4)/(5/6)=0.9 (3/5)/(3/6)=1.2 (3/6)/(3/6)=1.0 (3/5)/(3/6)=1.2 (3/3)/(5/6)=1.2 (3/3)/(6/6)=1.0 (3/4)/(5/6)=0.9 (3/5)/(4/6)=0.9 (3/6)/(3/6)=1.0 (3/5)/(4/6)=0.9 (3/4)/(5/6)=0.9

)

  

=

Figure 4-11: Frequent itemset and frequent rules (association rules)

195

Sponsored by AIAT.or.th and KINDML, SIIT

Given the transaction database in Figure 4-11 (a), we can find a set of frequent itemsets shown in Figure 4-11 (b) when the minimum support is set to 0.5. The set of frequent rules (association rules) is found as displayed in Figure 4-11 (c) when the minimum confidence is set to 0.66. Moreover, a valid rule need to have a lift of at least 1.0. In this example, the thirteen frequent itemsets can be summarized in Figure 4-12. No. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

Frequent Itemset ice orange paper shirt water ice, water orange, shirt orange, water paper, shirt paper, water shirt, water orange, shirt, water paper, shirt, water

Transaction Set 125 245 1346 23456 123456 125 245 245 346 1346 23546 245 346

Support 3/6 3/6 4/6 5/6 6/6 3/6 3/6 3/6 3/6 4/6 5/6 3/6 3/6

Figure 4-12: Summary of frequent itemsets with their transaction sets and supports. The concepts of closed frequent itemsets and maximal frequent itemsets are defined as follows. [Closed Frequent Itemset] An itemset X is closed in a data set S if there exists no proper super-itemset such that Y has the same support count as X in S. An itemset X is a closed frequent itemset in set S if X is both closed and frequent in S. [Maximal Frequent Itemset] An itemset X is a maximal frequent itemset (or max-itemset) in set S if X is frequent, and there exists no super-itemset Y such that and Y is frequent in S. Normally it is possible to find the whole set of frequent itemsets from the set of closed frequent itemsets but it is not possible to find all frequent itemsets from the set of maximal itemsets. The set of closed frequent itemsets contains complete information regarding its corresponding frequent itemsets but the set of maximal frequent itemsets registers only the support of the maximal frequent itemsets. Using the above example, the frequent closed itemsets are shown in Figure 4-13. Here, the frequent itemsets which are not closed was indicated by strikethrough. There are six frequent closed itemsets. Here, the column ‘Transaction Set’ indicates the set of transactions that include that frequent itemset. For example, the frequent itemset {orange, water} is in the 2nd, 4th and 5th transactions in Figure 4-11 (a). Moreover, the frequent itemset {orange} is not closed since it have the same transaction set with its superset, i.e., the frequent itemset {orange, shirt, water}. Note that the shortest superset that includes the frequent itemset {orange} is the closed frequent itemset {orange, shirt, water}. Therefore, the closed frequent itemset of {orange} is {orange, shirt, water}.

196

Sponsored by AIAT.or.th and KINDML, SIIT

No. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

13.

No. (closed)

1 2

3 4 5

6

Frequent Itemset ice orange paper shirt water (ice, water), (ice) orange, shirt orange, water paper, shirt (paper, water), (paper) (shirt, water), (shirt) (orange, shirt, water), (orange, water), (orange, shirt), (orange) (paper, shirt, water), (paper, shirt)

Transaction Set 125 245 1346 23456 123456 125 245 245 346 1346 23546 245

Support 3/6 3/6 4/6 5/6 6/6 3/6 3/6 3/6 3/6 4/6 5/6 3/6

346

3/6

Figure 4-13: Six closed itemsets with their transaction sets and supports. In general, the set of closed frequent itemsets contains complete information regarding the frequent itemsets. For example, the transaction set and support of the frequent itemset {orange}, {orange, shirt} and {orange, water} can be found to be equivalent to those of the closed frequent itemset {orange, shirt, water} since there is no shorter frequent closed itemset that include them, except {orange, shirt, water}. That is, TransactionSet({orange})= TransactionSet({orange,shirt})= TransactionSet({orange,water})=TransactionSet({orange,shirt,water}). Moreover, in this example, the maximal frequent itemsets are {orange, shirt, water} and {paper, shirt, water} since they are the longest frequent itemsets and no superset of these itemsets are frequent. For the association rules (Figure 4-11 (c)), we can disgard the 2nd, 4th, 6th, 8th, 15th, 16th, 17th, 21st, 22nd, 23rd rules since their confidences are lower than the minimum confidences, i.e., 0.67. Moreover, the 7th , 8th , 14th , 20th , 21st , 23rd and 24th rules are disgarded since their lifts are lower than 1.0. Finally we have ten frequent rules left. As stated above, to find frequent association rules, association rule mining can be viewed as a two-step process; (1) Find all frequent itemsets, and (2) Generate strong association rules from the frequent itemsets. The second process is trivial while the first process consumes more time and space. The following subsections describes a set of existing algorithms, namely Apriori, CHARM and FP-Tree, to efficiently discover the frequent itemsets.

4.2.1. Apriori algorithm The Apriori algoroithm was proposed in 1994 by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets for Boolean association rules. As its name, the algorithm uses prior knowledge of frequent itemset properties to prune out some dominant infrequent itemsets, and then to ignore the process of counting their supports. It explores the space of itemsets iteratively in level-wise style, where k-itemsets are used to explore (k+1)-itemsets. At the first step, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item. Here, the items satisfy minimum support are collected as a set denoted by . Next, is used to find , the set of frequent 2-itemsets, which is used to find ,and so on, until no more frequent k-itemsets can be found. For each step to find , one full database scan is required.

197

Sponsored by AIAT.or.th and KINDML, SIIT

To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property is used to reduce the search space. The Apriori property states “All non-empty subsets of a frequent itemset must also be frequent.” Semantically, if an itemset does not satisfy the minimum support threshold, minsup ( is not frequent, i.e., ), any of its superset will never be frequent. In other words, even we add an item to the itemset , the resulting itemset will never have larger frequent than , i.e., . In conclusion, is not frequent if X is not frequent. That is, . This Apriori property has partial order property. It is called anti-monotone in the sense that if a set cannot pass a test, all of its supersets will fail the same test as well. The property is monotonic in the context of failing a test. Exploiting this property, the two-step process can be defined as follows. [Join Step]  

Generate , a set of candidate k-itemsets by joining with itself. Then calculate which members in are frequent and find all frequent k-itemset

.

Here, if the items within a transaction or itemset are sorted in lexicographic order, then we can easily generate from by joining two itemsets in which have common prefix. For example, suppose that and are in . Since these two itemsets share the same prefix , we can generate as a candidate in . Note that this join step implicitly uses the Apriori property. [Prune Step] 

After the first step (join step), a set of candidate k-itemsets will be generated. Among them, we have to identify which ones are frequent (count no less than the minimum support) and which are not. In other words, is a superset of (frequent itemsets). While members in may or may not be frequent, all of the frequent k-itemsets are included in . That is,

A scan of the database to determine the count of each candidate in would result in the determination of . In many cases, the set of candidates is huge, resulting in heavy computation. To reduce the size of , the Apriori property is also used in addition to the join step. That is, “All non-empty subsets of a frequent itemset must also be frequent.” This subset testing can be done quickly by maintaining a hash tree of all frequent itemsets. Algorithm 4.3 shows the pseudocode of the Apriori algorithm for discovering frequent itemsets for mining Boolean association rules. In the pseudocode of the Apriori algorithm, the function GenerateCandidate corresponds to the join step while the function CheckNoInfrequentSubset equals to the pruning step. The following shows an example of the Apriori algorithm with six retailing transactions where minsup = 0.5 and minconf = 0.67.

198

Sponsored by AIAT.or.th and KINDML, SIIT

Algorithm 4.3: The Apriori Algorithm OBJECTIVE:

Find frequent itemsets using an iterative level-wise approach based on candidate generation.

Input:

D minsup L

Output:

# a database of transactions; # minimum support count # frequent itemsets in D.

Procedure Main(): 1: L1 = Find_frequent_1_Itemset(D) # Find frequent 1-itemsets from D 2: for (k = 2 ; Lk-1  ; k++) { 3: Lk = ; 4: Ck = GenerateCandidate(Lk-1); # Generate Candidates 5: foreach transaction t D { # Scan D for counts 6: Ct = Subset(Ck, t); # Find which candidates are in t 7: foreach candidate c Ct 8: count[c] = count[c]+1; # Add one to the count of c 9: } 10: foreach c in Ck { # check all c in Ck 11: if (count[c]  minsup) # If count[c] > minsup 12: Lk = Lk  {c}; # Add c to set of frequent itemset 13: } 14: } 15: return L = k Lk # return all sets of large k-itemsets Procedure GenerateCandidate (Lk-1) # Lk-1 is frequent (k-1)-itemsets 1: Ck = ; 2: foreach itemset x1 Lk-1 { # itemset in large (k-1)-itemsets 3: foreach itemset x2 Lk-1 { # itemset in large (k-1)-itemsets 4: if( (x1[1]=x2[1]) & (x1[2]=x2[2]) &...& (x1[k-2]=x2[k-2]) & 5: (x1[k-1]