Comparison of Unsupervised Classi ers Muhammad Afzal Upal Eric M Neufeld
[email protected] [email protected] University of Alberta University of Saskatchewan Edmonton, Canada. Saskatoon, Canada. Abstract: The activity of sorting like objects into classes without any help from an omniscient supervisor is known as unsupervised classi cation. In AI both symbolic and connectionist camps study classi cation. The statistical classi ers such as Autoclass and Snob search for the theory that can best explain the distribution of given data, whereas neural network classi ers such as Kohonen's networks and ART2 use the vector quantization principle for classifying data. Previously, many studies have compared supervised classi cation algorithms, but the more challenging problem of comparing unsupervised classi ers has largely been ignored. We performed an empirical comparison of ART2, Autoclass and Snob. We highlight the strengths and weaknesses of the various classi ers. Overall, statistical classi ers, especially Snob, perform better than their neural network counterpart ART2.
Keywords: Unsupervised classi cation. Area of Interest: Concept formation and classi cation.
1 Introduction Classi cation is the process by which objects are arranged into categories. Classi cation can be divided into two types: supervised and unsupervised. Supervised classi cation is a form of learning from examples. A supervised classi er learns a function f when it is provided with examples of the form (Xi; f (Xi )) by a supervisor. Unsupervised classi cation can be thought of as learning from observation. An unsupervised classi er is provided with some observations Xi , that have not been categorized by a supervisor and the goal is to look for similarities in the observed patterns and group them such that patterns in one group are similar to one another and dissimilar from the patterns in any other group. Unsupervised classi cation seeks a convenient and valid organization of data and not to establish rules for separating future data into categories [8]. Early work in the formalization of unsu-
pervised classi cation techniques was done by taxonomists for naming new species of animals and plants [15]. Similar requirements arose for naming a variety of phenomena in both social and natural sciences. Statisticians developed tools that employ objective methodology for classi cation. Re ecting the large number of applications unsupervised classi cation is studied variously as clustering, autoclassi cation, vector-quantization, numerical taxonomy and conceptual clustering.
1.1 Cluster Analysis
Cluster analysis is based on a mathematical formulation of a measure of similarity. Some well known cluster analysis algorithms are the leader algorithm, the k-means algorithm, ISODATA, and the quick partition algorithm [1]. Cluster analysis algorithms require prior speci cation of the number of classes and/or some distance measure to measure the dis-
tance between the objects. In many clustering problems, however, the user has no knowledge about the number of classes to be discovered or about the distance measure on which to base the classi cation. The alternative is to consider classi cation as a theory proposed to explain given data. This lets us formulate the classi cation problem as a problem of inductive inference.
1.2 AI Techniques
In AI unsupervised classi cation is studied by both connectionist and symbolic camps. The symbolic research is carried out by researchers in the traditional machine learning community, under the name of conceptual clustering [21], as well as by those employing methodologies with a statistical avor. The basic idea behind conceptual clustering is that instead of considering just the similarity between objects, the method uses the conceptual cohesiveness among the objects as a criterion for classi cation. Conceptual cohesiveness depends not only on those objects and the surrounding objects but also on the information that describes the objects together. Conceptual clustering techniques thus perform a context based clustering and arrange objects in a hierarchical fashion [21]. According to the statistical approach properties of objects are distributed according to some probability distribution. This approach suggests estimating the parameters of these distributions based on some assumptions. The distinct focus on modeling the cognitive processes involved in learning and classi cation sets the connectionist approach apart from the symbolic approaches. The connectionist researchers model self organizing properties of the brain using networks of neurons.
1.3 Previous Comparative Studies
Thus given raw data, there are many choices of classi cation algorithms, and it is dicult for applied researchers to decide which algo-
rithm to prefer over the others. Many studies [10] [11] [26] [36] compare machine learning and neural network techniques in a supervised environment. Comparing unsupervised classi ers, however, is not as straightforward as comparing supervised classi ers because of the absence of class labels in the experimental classi cation. Anderberg [1] suggests classifying the classi cation algorithms themselves in order to quantify their similarities and dierences. While this approach highlights the similarities and dierences among classi ers, it begs the question because we still must determine the rationale for so classifying the algorithms. Another approach adopted by some researchers is to compile a list of generally acceptable criteria (called admissibility criteria) such as correctness and then examine which of the algorithms adhere to these criteria [8]. The admissibility approach provides insight into the classi cation methods but it does not provide any basis for a fair comparison of the systems as one would still need to measure the degree of admissibility of a criterion in order to measure the extent of conformity to a particular criterion [8]. A theoretical comparison of unsupervised classi ers is also not feasible as they are almost impossible to model mathematically in such a way that models can be compared [8]. As Milligan [22] notes, none of the classi ers in use provide any theoretical proof which ensures that the algorithm will recover the true cluster structure. A reasonable approach then is to evaluate the classi cations obtained from the classi ers to establish their characteristics as exhibited on the classi cation of some data sets. In an earlier study [33] we studied the classi cation recovery characteristics of Autoclass and ART2 on real life data sets. The problem with this approach, however, is that the validity of the \original" classi cation, with which we can compare the classi cation resulting from an algorithm, is open to question. Moreover, using real life data sets does not give us the
exibility of arbitrarily changing the proper-
ties of the data sets and performing controlled statistical experiments to study the eects of these changes on the recovery of the classi cation. The alternative is to perform a Monte Carlo analysis by generating data separated into well de ned classes and study the class recovery for various systems. Most of the Monte Carlo studies [29] [22] [23] [25] [17] [20] [2] focus on the cluster analysis techniques. As Dubes et al. [8] note that the trend in AI and the pattern recognition literature has been to ignore the problems of validation and comparison, and focus on applications of these methods. Lutz Prechelt [28] examined 113 journal articles about neural net learning algorithms. Only 6% show results for more than one problem using realworld data. One third present no quantitative comparison with any previous algorithm, and one third use no realistic problem at all [28].
1.4 Objective
The objective of the present work is to perform an impartial and systematic comparison of various unsupervised classi ers by measuring the extent of recovery of the true classi cation using standard statistics.
2 Algorithms Most unsupervised classi cation algorithms suit a particular application and hence vary in the degree of supervision. To make this study fair and unbiased we consider systems using minimal supervision including Autoclass and Snob (two statistical techniques) as well as ART2 (a neural network technique). Next we provide an overview of the algorithms, the test statistics and the data-sets used in the experiments and nally a discussion of the results.
2.1 Statistical Classi ers
Classi cation is concerned with the task of hypothesis generation as opposed to traditional
statistical activity of hypothesis testing. The statistical approach to classi cation suggests that properties of objects are distributed according to some probability distribution and the problem is to nd a theory that can best explain this distribution. There are many possible criteria for choosing a good theory. One is that the best theory is the one that has the highest probability of generating the observed data. In this approach we prefer the theory that has the highest posterior probability or equivalently, using Bayes' rule, the one that maximizes the product of prior probability of the theory with the probability of data occurring in the light of the theory. This leads us to formulation of the Bayesian approach to classi cation [5]. An alternative criterion is that best theory is the one that provides least complex explanation of data [31] [34]. Explanation includes both the description of the theory as well as the speci cation of data given that theory. The minimum message length (MML) approach suggests using information theoretic principles to measure lengths of explanations oered by competing theories and preferring the theory that gives a shorter explanation. The explanation length depends on the language used to describe both the theory as well as the data with respect to that theory. The MML criterion requires that the language re ect our prior expectations about the environment and have shorter descriptions for the more likely objects. For instance in natural languages more common objects usually have shorter descriptions than the rare ones. In such languages the theory that is most likely to have generated the observed data also gives the shortest explanation.
2.1.1 Bayesian Classi cation
The problem of unsupervised classi cation can be formulated as follows. Given a set X with n objects X = X ; X ; : : : ; Xn with each object Xi , i = 1 : : : n, described by q attributes, < xi ; xi ; : : :; xiq >; determine a classi cation that is most likely to have gen1
1
2
2
erated the observed objects. Suppose the objects are taken from m classes, c ; c ; : : : ; cm where each class is characterized by a probability distribution giving the distribution of attributes of its members. If H denotes the hypothesis and D the observed data, Bayes' rule states that the probability P (H jD) that the hypothesis explains the data is proportional to the probability P (DjH ) of observing the data if the hypothesis were true [5], jH ) P (H jD) = P (HP) (PD(D ) where P (D) is called the prior probability of data and can be obtained as a normalizing constant. In case of classi cation, the hypothesis is the number and probability distribution of classes. Let and be the vectors that denote class parameters and class probabilities, respectively. The probability that this distribution explains the observed data can be stated using the Bayes' rule, j; ; m) : P (; ; mjX) = P (; ; mP) (PX(X ) For a given value of the class parameters, the probability of object Xi belonging to class Cj can be calculated as, 1
2
i jXi 2 Cj ; j ) P (Xi 2 Cj jXi; ; ; m) = PP(X (Xi j; ; m) (1) where j denotes the class parameter vector for class Cj . Bayes' rule does not provide an algorithm for classi cation. The designers of a Bayesian classi er are faced with the computationally intractable problem of searching the hypothesis space for the optimal distribution that produced the observed data and the controversial problem of estimating the priors.
Autoclass: Cheeseman et al. [5], [6] im-
plement Bayesian classi cation in Autoclass. Their strategy involves making simplifying assumptions about the classi cation model. Rather than searching the entire hypothesis
space by considering all possible states of the world they focus on a limited number of possible states thereby reducing the number of possibilities to be analyzed. In case of real attributes the assumption is that data is distributed according to the normal probability distribution. A multinomial distribution is assumed for the discrete attributes. Autoclass divides the problem into two parts, namely, the calculation of the number of classes and the estimation of the classi cation parameters. The Bayesian approach suggests that optimal number of classes, m, maximizes the posterior probability P (mjX) of the number of classes given the data, jm) P (mjX) = P (mP) (PX(X ) where P (m) is the probability that there are m classes in the data, P (X ) the prior probability of the data and P (Xjm) is obtained by normalizing
Z P (Xjm) = P (; jm) P (Xj; ; m) d d: Instead of calculating the above integral Autoclass searches for the values of parameters and that maximize P (Xjm) and then uses the Gaussian distribution as an approximation of the integral around that point. The other part involves estimating the parameter vector for the optimal number of classes. Denoting the parameter vector for the distribution of class j by j and once again using Bayes rule gives, P (; jX; m) = P (; jmP)(XP j(mX)j; ; m) : Autoclass assumes that the data are generated independently of each other to facilitate the calculation of P (Xj; ; m), n Y P (Xj; ; m) = P (Xi j; ; m): i=1
Autoclass also assumes that the attributes are independent of each other given the classi ca-
tion, which gives,
P (Xj; ; m) =
m n Y Y i=1 k=1
P (xik j; ; m):
Autoclass uses the Expectation Maximization (EM) algorithm [7] to estimates the class parameters that maximize the posterior probability of the parameters for a given number of classes.
2.1.2 The Minimum Message Length Classi cation
A set X = [Xi]ni of n objects each with q attributes can be represented without loss of any information as an n q matrix as =1
x ;x ;x ;:::;x q 11
12
13
1
x ;x ;x ;:::;x q 21
22
23
2
xn ; xn ; xn ; : : : ; xnq : In the minimum message length paradigm the problem of classi cation is to nd an explanation that describes this information more brie y. The data can be encoded more brie y only if it exhibits systematic group patterns as opposed to being random. The strategy is to assume that data possesses some groups, estimate parameters for the distribution of these groups using traditional statistical techniques and encode the data assuming that estimates are true values. However, these parameters have to be speci ed in the encoded message. The encoding speci es, the hypothesis, i.e., the number of classes, prior distribution of each class, distribution of classi cation parameters for each class, and distribution of parameters of each attribute for each class. The probability of data given the hypothesis. The selection of the optimal message involves a compromise. The length of the rst part (i.e., the hypothesis) would be minimal if we 1
2
3
assumed that the data belongs to one class, but this would increase the length of the second part. On the other hand, the length of the second part would be minimal if we put each item in a separate class, but this would increases the length of the rst part as more parameters must be speci ed. Given data D and a hypothesis H , the rst part of the encoding is the probability distribution of H and the second part is the distribution of data given this hypothesis, P (DjH ). Shannon's information theory states that the probability p of an event can be encoded (for instance, using Human encoding) by a message ?log(p) bits long [30]. Therefore, total message length will be the sum of ?log P (H ) and ?log P (DjH ). Total message length = ?log P (H )?log P (DjH ) = ?log(P (DjH ) P (H )) (2) Using the Bayes' rule, ) P (H ) ; P (H jD) = P (DjPH(D ) gives total message length = ?log(P (H jD)P (D)): Therefore, minimizing message length is equivalent to maximizing the posterior probability P (H jD), the same criterion as Bayesian classi cation. Although Snob is dierently motivated than Autoclass, both employ the same basic principle. But implementation details vary as Snob does not need to employ the two step process used by Autoclass (i.e., estimating the optimal number of classes and estimating classi cation parameters) [34].
Snob: Snob [35] assumes that the attributes
are independently distributed given the classi cation and that each object Xi is independently produced by a random process. This assumption facilitates the computation of the posterior probability of data in Equation 2. If X denotes the data set, then m n Y Y P (xik )jH ): P (XjH ) = i=1 k=1
The discrete attributes are modeled by a multinomial distribution. A uniform prior over the possible theories is assumed in this case. In case of real attributes the set of possible theories becomes a continuum and it is not possible to assign a non-zero probability to each theory. Snob assumes a normal distribution and measuring the parameters, in rst part of the encoding, to the precision comparable to the expected error in the estimate leads us to the Maximum Likelihood estimate for the parameters given as mean = x and variance = n s? 1 : Snob implements a heuristic search operation with routines such as adjust, random, split, merge and wipe to nd the best classi cation [27]. 2
vector stored in each output node. Grossberg's Adaptive Resonance Theory (ART) addresses the problem of instability of learned patterns by adding extensive feedback connections.
ART2: ART1, ART2, ART3, ART Map,
ART Star and Fuzzy ART are variations on the Adaptive Resonance architecture [14]. ART1 classi es binary input patterns in an unsupervised environment whereas ART2 works in the real pattern space. ART Map and Art Star modify the basic ART architecture for training in a supervised environment. Fuzzy Art performs a fuzzy classi cation. We used ART2 in our study. Control 2
Output/Recognition Layer Orienting Subsystem
Feed-back Weights
Feed-back Weights
2.2 Neural-Net Classi ers
The neural networks are highly interconnected networks of simple processing units called neurons. These networks can learn to approximate arbitrary functions using the appropriate learning algorithm such as competitive learning [19].
2.2.1 The Competitive Learning Model
The basic idea behind competitive or winner take all learning is similar to the natural selection principle proposed by Charles Darwin to account for human and animal evolution. In a winner take all model, neurons compete to respond to a stimulus and the best responding neuron wins the competition and rewards itself by increasing the synaptic strength of its active synapses. Variations on the basic competitive learning model include Kohonen's self organizing maps and Adaptive Resonance Theory. Kohonen's self organizing maps are topographic maps autonomously organized by a cyclic process of comparing input patterns to the weight
Feed Forward Weights control 1
Input/Comparison Layer
Input Patterns
Figure 1: Basic Diagram of an ART Network (adapted from [2]). The basic ART2 network (see Figure 1) has two layers, F1 and F2 (collectively called the attentional gain subsystem), and an orientation subsystem. F1 is used as input/comparison layer (also called the feature representation eld) and F2 as output/recognition layer (also called the category representation eld). There are two control channels (control 1 and control 2 in Figure 1) called attentional gain control channels that regulate the change in behavior of the layers, i.e., from input to comparison, or vice versa, in the case of F1. The detailed diagram of ART2, (Figure 2), reveals extensive loops of feed-forward and
feed-back connections. The input Xi to the input layer F1 is fed through three sub-layers and is modi ed at each step. These steps perform normalization, contrast enhancement and noise suppression of the input before it is fed to the output layer F 2 for competition. These activities are governed by the following equations, I wi = Xi + aui ti = e +wjjiwjj
II
III
vi = f (ti) + bf (qi) ui = e +vjji vjj pi = u i +
X j
g(yj )zji
qi = e +pjji pjj
f in II is a non-linear function. Typical de nitions of f include, ( x2 f (x) = x2 2 if 0 x x if x 2
(
or
(
+
)
x f (x) = x0 ifif 0x Bottom up weights are initialized to random values 1p zij (1 ? q) M where q is the number of dimensions of the input vector. The top down weights are initialized to small values between 0 and 1. When an input pattern X is ltered through the feed-forward connections, competitive learning takes place in the F2 nodes. Activation Tj at P some F2 node j is the weighted sum, Tj = i pi zij . A winner is decided by comparing the activations produced among all the
y j
Orienting Sub-system
F2
g(y ) j ri
q i
c pi p i
bf(q ) i v i ui
F1
a ui w i
X
f(t ) i
ti i
Figure 2: Detailed diagram of an ART2 network (adapted from [14]). Open arrows indicate speci c patterned inputs to target nodes. Filled arrows indicate non-speci c control inputs. When F2 makes a choice, g(yj ) = d if j th node in F2 is active, otherwise g(yj ) = 0. F2 nodes. The best matching vector is now passed back to F1. The vigilance threshold, , is compared to the reset function R de ned as, R =< r ; r ; ::::::; rq > ri = e + jjuui jj++czjji cz jj i where u is the processed input vector and zi is the weight vector associated with the winner node i and c is the constant. If r , the weights are modi ed in a fashion similar to competitive learning. Otherwise the best matching node is disabled, the system is reset, and the whole process is repeated. If all the nodes in the layer F2 become disabled, a new node is created. 1
2
3 Experiments The experimental setup involves generating the data sets using the data generation algorithm and then measuring the class recovery using test statistics.
c c ... cm Row Sum c n n ... n m n: c n n ... n m n: ... ... ... ... ... .... cl nl nl ... nlm nl: Col Sum n: n: ... n:m n:: = n Table 1: Contingency table for two classi cations. C 1 0 C 1 a b 0 c d Table 2: Contingency table for comparing classi cations. Here a is the number of pairs that are in the same class in both classi cations, d the number of pairs in dierent classes in both classi cations, and b and c are the pairs placed in the same class by one classi cation and in dierent classes by the other. 0
1 0
2 0
1
2
11
12
1
1
21
22
2
2
1
2
1
2
0
3.1 Test Statistics
Consider a set of objects X = X ; X ; :::::; Xn to be classi ed into m classes. Let C =< c ; c ; ::::::::;cm > denote the \true" classi cation and C =< c ; c ; :::::::; cl > the classi cation resulting from the algorithm. Let nij denote the number of objects that are in both ci and cj . Let ni: denote the row sum for row i in Table 6.1 and n:j the sum of column j or equivalently the number of objects in class cj . We can de ne two similarity functions C and C ; ( Xi 2 Cu and Xj 2 Cu ; 1 u m C (i; j ) = 10 ifotherwise ( if Xi 2 Cv and Xj 2 Cv ; 1 v l C (i; j ) = 01 otherwise 1
1
2
2
0
0
1
0
0
2
0
0
0
0
0
0
In terms of Table 6.2 we can de ne a; b; c and d as, X X nij ! (3) a= 2 ; i j X n:j ! X X nij ! b= 2 ? i j 2 ; (4) j
X ni: ! X X nij ! c= 2 ? i j 2 ; (5) i 1 ! 0X X 1 n d = 2 ? 2 @ ni: + n:j A : (6) j i 2
2
We used Rand R [29], corrected Rand R [16], Fowlkes and Mallows FM [12] and Jaccard statistic J [24] to compare classi cations produced by dierent algorithms. 0
R = [a + d]=[a + b + c + d]: R = R1 ?? EE((RR)) : a : FM = q (a + b)(a + c) J = a=[a + b + c]: For perfect classi cation the values of all statistics are one and they approach zero for disjoint classi cations. 0
3.2 Data Sets
The data sets were generated using multivariate normal distributions. This approach has been used in various comparisons of clustering algorithms [2] [3] [4] [9] [18] [29] [13] [22] [25]. We used a modi ed form of the data generation algorithm that Milligan et al. used to examine the properties of cluster analysis algorithms [22], [23], [24], [25]. The algorithm generates externally isolated and internally cohesive classes. The user speci es some number k of distinctly non-overlapping classes to be generated. Each class c can have mc objects each embedded in dc dimensional space. Following types of noise was then added to the well separated classes: additive noise: Let Aij be the error free value of j th dimension of pattern i then the noise added value Eij is given by Eij = Aij + eij , where eij is the noise value and is normally distributed with
mean of 0 and variance corresponding to the cluster and the dimension, is a constant whose value speci es the magnitude of the noise added. Setting the noise value to 2 and 5 gives us low level and high level noise added data.
Alg size ART2 declines Auto Snob
dimensions classes improves improves then declines improves declines improves improves improves improves then declines
Noise dimensions: A uniform random
Table 3: Eect of increasing the data set size, the number of dimensions and the number of classes on the classi er behavior.
Outliers : Twenty percent outliers pat-
be dierent, which may lead to a dierent classi cation. In theory, Snob's information measure and Autoclass's posterior probability measure assures us of a unique optimal classi cation. However, because of the intractability of the optimality problem, Snob and Autoclass use heuristic searches to nd sub-optimal classi cations. Testing with unordered data is perhaps more realistic and results obtained for unordered data should give users more useful information about the behavior of classi ers. When patterns are unordered, ART2's performance decreases in general as the data set size increases. Increasing the number of dimensions, however, has the opposite eect on ART2 as it improves with increase in the number of dimensions. Increasing the number of classes improves ART2's performance but it eventually drops as the number of classes is further increased. Autoclass's performance improves in all cases, including the noisy data, as the data set size is increased. Adding dimensions, however, has the opposite eect as its performance deteriorates with more dimensions in all cases. Increasing the number of classes also improves Autoclass in all cases. Snob performs perfectly for noiseless data sets of all sizes, but with noise its behavior is similar to ART2. Adding dimensions improves Snob in all cases except for very noisy data. For noisy data, it performs poorly in most cases, and it can separate classes only for large dimensional data. As we increase the number of classes Snob performs perfectly for all cases except for noisy data in which case it initially
number generator is used to generate patterns to be added as random dimensions. Three noise dimensions are added to obtain noise dimension added data. terns are added to obtain the outlier added data. The distribution of the outliers is multivariate normal with the same mean as for the error free class patterns; however, the variance matrix is multiplied by nine. A pattern is accepted as an outlier if it exceeds the class boundaries in at least one dimension.
Experiments were performed with data sets of various sizes, dimensions, number of classes and with various types and levels of noise added, using ART2, Autoclass and Snob and measuring performance with the Rand, the corrected Rand, the Jaccard and the Fowlkes and Mallows statistic. We used both ordered and unordered data, where ordered data means that patterns from the same class are placed adjacent to each other and unordered data is the result of randomly perturbing ordered data.
4 Results and Discussion In all, 426 experiments were done. Detailed results are reported in [32]. Here we present and discuss the general trends (see Table 3). The classi cation obtained by all three classi ers is aected by the order of data. ART2 is aected by the order because classi cation of a pattern depends on the current pattern and the weights learned by classifying previous patterns. Thus, when the sequence of the patterns changes, the learned weights may
Alg ART2 Auto Snob
50 .13 .09 .00
100 .11 .06 .57
150 .10 .22 1.0
200 .05 .23 1.0
250 .08 .37 .98
300 .09 .45 .99
350 .10 .63 .99
Table 4: Value of corrected Rand statistic for increasing the data set size from 50 to 350, in noise added data, and keeping number of dimensions = 8 and the number of classes = 2. Alg 4 8 12 16 24 32 48 ART2 .09 .13 .13 .05 .03 .16 .04 Auto .48 0.1 .08 .05 .13 .07 .01 Snob .00 .00 .00 .00 .00 .00 1.0 Table 5: Value of corrected Rand statistic for increasing the number of dimensions from 4 to 48, in noise added data, and keeping the size = 50 and the number of classes = 2. improves and then declines. Generally, Snob performs the best and ART2 the worst (see Table 4). ART2's poor performance may be attributed to the fact that the classes were generated using normal distributions and both the statistical classi ers look for normally distributed classes. However, distorting the normal distribution by adding varying amounts of noise to each point adversely aects ART2; more than it affects Snob and Autoclass. The only test condition for which ART2 consistently performs better than statistical classi ers is the addition of outliers. This shows the dependence of Autoclass and Snob on the normal shape of the classes. One problem faced by applied researchers is Alg 2 4 8 16 24 ART2 .08 .29 .73 .48 .46 Auto .37 .46 .65 .87 .81 Snob .98 .83 1.0 1.0 1.0 Table 6: Value of corrected Rand statistic for increasing the number of classes from 2 to 24 in noise added data and keeping number of dimensions = 8 and the data set size = 250.
to decide which variables are important to the classi cation and which ones are irrelevant. We simulate this condition by adding three noise dimensions to noiseless data that have no information about the true class structure. Snob is least aected by the addition of error dimensions and ART2 the most. This indicates that for such data Snob is a better choice. In case of higher dimensional data Autoclass performs worse, which is opposite to the trend observed for ART2 and Snob. Autoclass also performs better for data having large number of classes. In most cases its performance is better for the 24 class data than for the two class data. Autoclass should be chosen if the data to be classi ed has large number of classes but few dimensions.
5 Conclusion We have done an empirical comparison of Autoclass, ART2 and Snob highlighting the strengths and weaknesses of each classi er. We have also attempted to lay out a systematic method for comparing unsupervised classi ers. A possible extension of this method could be to test whether the dierences in classi er performances are statistically significant. The signi cance tests could also be used for quantifying the observations about changes in the performance of classi ers as a result of varying dierent conditions such as addition of various types of noise or increasing the size, dimensions or number of classes. We hope that this study will stimulate further research in the AI literature for empirical comparison of unsupervised classi ers. Acknowledgement: The authors would like to thank David Dowe, Paulo Gaudiano and Will Taylor for their help in getting the software. G
References [1] Michael R. Anderberg. Cluster Analysis for Applications. Academic Press, New
[2]
[3] [4] [5]
[6]
[7]
[8] [9]
[10]
York, 1973. C. K. Bayne, C. L. Beauchmap, C. L. Begovich, and V. E. Kane. Monte Carlo comparison of selected clustering procedures. Pattern Recognition, 12:51{60, 1980. R. K. Blash eld. Mixtures model test of cluster analysis. Psychological Bulliten, 83:377{388, 1976. R. K. Blash eld, M. S. Aldenderfer, and L. C. Morey. Cluster analysis software. Handbook of Statistics, 2:245{266, 1982. Peter Cheeseman, James Kelly, Matthew Self, John Stutz, Will Taylor, and Don Freeman. Autoclass: A Bayesian classi cation system. In Michael B. Morgan, editor, Proceedings of Fifth International Conference on Machine Learning, pages 54{64, San Mateo, CA, 1988. Morgan Kaufmann. Peter Cheeseman, John Stutz, and Robin Hanson. Bayesian classi cation with correlation and inheritance. In Proceedings of the 12th International Joint Conference on Arti cial Intelligence, Sydney, Australia, pages 692{698, San Mateo, CA, 1991. Morgan Kaufmann. Arthur P. Dempster. Maximum likelihood from incomplete data via the EM algorithm. Royal Journal of Statistical Society, Series B, 39:1{38, 1977. Richard C. Dubes and Anil K. Jain. Algorithms for Clustering Data. Prentice Hall, New York, 1988. C. Edelbrock. Mixture model tests of hierarchical clustering algorithms{problem of classifying everybody. Multivariate Behavioral Research, 14:367{384, 1979. C. Feng, A. Sutherland, R. King, S. Muggleton, and R. Henry. Comparison of machine learning classi ers to statistics and neural networks. In Fourth International
Workshop on Arti cial Intelligence and Statistics, pages 41{52, 1993.
[11] Douglas H. Fisher and Kathleen B. McKusick. An empirical comparison of ID3 and back propagation. In Proceedings of the Eleventh IJCAI , pages 788{ 793, San Mateo, CA, 1989. Morgan Kaufmann. [12] E. Fowlkes and C. Mallows. A method for comparing two hierarchical clusterings. Journal of American Statistical Association, 78:553{569, 1983. [13] A. L. Gross. Monte Carlo study of the accuracy of a hierarchical grouping procedure. Multivariate Behavioral Research, 7:379{389, 1972. [14] Stephen Grossberg and Gail Carpentar. Pattern Recognition by Self-Organizing Neural Networks. MIT Press, Cambridge, Massachussets, 1991. [15] John Hartigan. Clustering Algorithms. Wiley, New York, 1975. [16] L. J. Hubert and P. Arabie. Comparing partitions. Journal of Classi cation, 2:193{218, 1985. [17] Anil Jain, A. Indrayan, and L. Goel. Monte Carlo comparison of six hierarchical clustering methods on random data. Pattern Recognition, 19:95{100, 1986. [18] F. K. Kuiper and L. Fisher. Monte Carlo comparison of six clustering procedures. Biometrics, 31:777{783, 1975. [19] A. S. Lapedes and R. M. Farber. How neural nets work. In Y. S. Lee, editor, Evolution, Learning and Cognition. World Scienti c, Singapore, 1988. [20] J. E. Mezzich and H. Solomono. Taxonomy and Behavioral Sciences. Academic Press, London, 1980.
[21] Ryszard S. Michalski. A theory and methodology of inductive learning. In R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, editors, Machine Learning: An Arti cial Intelligence Approach. Morgan Kaufmann, Los Altos, CA, 1983. [22] Glenn W. Milligan. An examination of eect of six types of error perturbation algorithms on fteen clustering algorithms. Psychometrika, 45:325{342, 1980. [23] Glenn W. Milligan. A Monte Carlo comparison of 30 internal criterion measures for cluster analysis. Psychometrika, 46:187{195, 1981. [24] Glenn W. Milligan and C. Cooper. An examination of the procedures for determining the number of clusters in data. Psychometrika, 50:159{179, 1985. [25] Glenn W. Milligan, Lisa M. Sokol, and S. C. Soon. The eect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(1):40{47, January 1983. [26] Raymond Mooney, Jude Shavlik, Geoffrey Towell, and Allen Gove. An experimental comparison of symbolic and connectionist learning algorithms. In Proceedings of the Eleventh IJCAI , pages 775{780, San Mateo, CA, 1989. Morgan Kaufmann. [27] J.D. Patrick. Snob: A program for discriminating between classes. Technical report TR 151, Dept. of Computer Science, Monash University, Clayton, Victoria 3168, Australia, 1991. [28] Lutz Prechelt. A study of experimental evaluations of neural network learning algorithms: Current research practice. Technical Report 19/94, Fakultat fur Informatik, Universitat Karlsruhe, Karlsruhe, Germany, August 24 1994.
[29] W. M. Rand. Objective criterion for evaluation of clustering methods. Journal of American Statistical Association, 66:846{851, 1971. [30] C. Shannon. A mathematical theory of communication. Bell Systems Technical Journal, 24:379{423 and 623{656, 1948. [31] R Solomono. Formal theories of inductive inference I and II. Information and Control, 7:1{22 and 224{254, 1964. [32] M. Afzal Upal. Monte carlo comparison of non-hierarchical unsupervised classi ers. Master's thesis, University Of Saskatchewan, 1995. [33] M. Afzal Upal and Eric Neufeld. Comparison of Bayesian and neural net unsupervised classi cation techniques. In Proceedings of Sixth Annual Symposium on Computational Science, University of Saskatchewan, pages 152{163. 1994. [34] C. S. Wallace and M. P. George. A general selection criterion for inductive inference. In T. O'Shea, editor, Proceedings of ECAI-84: Advances in Arti cial Intelligence, pages 473{482. Elsevier, Amsterdam, 1984. [35] C.S. Wallace and D.L. Dowe. Intrinsic classi cation by MML { the Snob program. In C. Zhang, J. Debenham, and D Lukose, editors, Proceedings of the 7th Australian Joint Conference on Arti cial Intelligence, pages 37{44. World Scienti c, Singapore, 1994. [36] Sholom M. Weiss and Ioannis Kapouleas. An empirical comparison of pattern recognition, neural nets, and machine learning classi cation methods. In N. S. Sridharan, editor, Proceedings of the Eleventh IJCAI , pages 781{787, San Mateo, CA, 1989. Morgan Kaufmann.