AntTree: a New Model for Clustering with Artificial Ants - CiteSeerX

71 downloads 461 Views 842KB Size Report
Abstract- In this paper we will present a new clustering .... update its thresholds in the following manner: ..... Intelligence, Lyon, France, july 2002, IOS Press.
AntTree: a New Model for Clustering with Artificial Ants H. Azzag, N. Monmarch´e, M. Slimane, G. Venturini Laboratoire d’Informatique, Ecole Polytechnique de l’Universit´e de Tours, 64, Avenue Jean Portalis, 37200 Tours, France. Phone: +33 2 47 36 14 14, Fax: +33 2 47 36 14 22 [email protected] {monmarche,slimane,venturini}@univ-tours.fr

C. Guinot C.E.R.I.E.S., 20 rue Victor Noir, 92521 Neuilly sur Seine Cedex, France. [email protected]

Abstract- In this paper we will present a new clustering algorithm for unsupervised learning. It is inspired from the self-assembling behavior observed in real ants where ants progressively become attached to an existing support and then successively to other attached ants. The artificial ants that we have defined will similarly build a tree. Each ant represents one data. The way ants move and build this tree depends on the similarity between the data. We have compared our results to those obtained by the k-means algorithm and by AntClass on numerical databases (either artificial, real, or from the CE.R.I.E.S.). We show that AntTree significantly improves the clustering process.

convenient location in the tree. This behavior is directed by the local structure of the tree and by the similarity between data represented by ants. When all ants are fixed in the tree, this hierarchical structure can be interpreted in several ways: it can be seen as a partitioning of the data (that we will compare with other clustering methods), or it can be used for data visualization purposes, as in hierarchical clustering [JAI 99]. The remainder of this paper is organized as follows: in section 2, we give an overview of the underlying biological model. In section 3, we present the details of the AntTree algorithm. Its properties and comparative results will be given in section 4. In section 5, we finally draw some conclusions and describe future evolutions of this method.

1 Introduction Many data mining systems often require the use of a clustering algorithm: this is the case for instance when one wants to reduce a huge data collection to a few clusters which can easily be studied, or when one wants to discover a structure within unstructured data. Natural systems have evolved in order to solve many problems that can be related to clustering. Different species have developed social behaviors to tackle the problem of gathering objects or individuals. For instance, we can cite the brood sorting or cemetery organization of ants [FRA 92] or the collective movements in different species such as the ability of bacteria to form surprising spatial patterns and aggregations [CAM 01]. Many researchers in computer science have been inspired by real ¨ 90] and have defined artificial ants paradigms ants [HOL for dealing with optimization or machine learning problems (see a review in [BON 99]). In this paper, we propose the adaptation of a new biological model which, as far as we know, has never been used before to solve computer science problems. We model the ability of ants to build live structures with their bodies [LIO 01] in order to discover, in a distributed and unsupervised way, a tree-structured organization of the data set. Each ant represents a data and is initially placed on a fixed point, called the support, which corresponds to the root of the tree. The behavior of an ant consists of moving on already fixed ants to fix itself to a

2 A model for self-assembling behaviors in real ants Real ants provide a stimulating self-organization model for the clustering problem. Previous models involve the ability of ants to sort objects [LUM 94] [KUN 99][GOS 91] [MON 99] or to build a colonial odor [LAB 02]. Here, we consider another biologically observed behavior: ants are able to build mechanical structures by a selfassembling behavior. This can be for instance the formation of drops constituted of ants [THE 01], or the building of chains by ants with their bodies in order to link leaves together [LIO 01]. These types of self-assembly behavior have been observed with Linepithema humiles Argentina ants and African ants of gender Oecophylla longinoda. The goal of drop structures built by L. humiles is today still obscure. This ability has been recently experimentally demonstrated [THE 01]. The drop can sometimes fall down. For Oecophylla longinoda ants, it can be observed that two types of chains are built: for crossing an empty space, and on the other hand for building their nest [LIO 01]. In both cases, crossing chains and building chains, these structures disaggregate after a given time. From these self-assembly behaviors, we can extract properties that will constitute the framework of our algo-

rithm: • ants build this type of live structures starting from a fixed support (stem, leaf,...), • ants can move on this structure whilst it is being currently built, • ants can cling anywhere on the structure because every position can be reached. Nevertheless, in the formation of chains for example, ants will preferably fix themselves at the end of the chain, because they are attracted by gravity or by the object to reach,

Moving Ant

Connected Ant

Figure 1: General principles of tree building with artificial ants.

• the majority of ants which constitute the structure can be blocked without any possibility of displacement. For example, in the case of a chain of ants, this corresponds to ants placed in the middle of the chain, • some ants (a number generally much more reduced than for blocked ants considered in the previous point) are connected to the structure but with a link which they maintain by themselves, they can thus be detached from the structure whenever they want. In the case of a chain of ants, this corresponds to the ants placed at the end of the chain, • we can observe a phenomenon of growth but also of decrease of the structure. In general, the motivation for using bio-inspired clustering techniques is twofold: they can avoid local minima thanks to their probabilistic behavior and they can produce high quality results without any prior knowledge of data structure (such as the number of clusters or an initial partition). In this work, in addition to these motivations, we are especially interested in showing that this new biological model may be a promising technique for achieving parallel tree-based clustering.

3 The AntTree algorithm 3.1 General principles To obtain a partitioning of the data, we build a tree where nodes represent data and where edges remain to be discovered. One should notice that this tree will not be strictly equivalent to a dendogram as used in standard hierarchical clustering techniques: each node in our tree will correspond to one data while this is not the case in general for dendrograms, where data only correspond to leaves [JAI 99]. Nevertheless, as it will be shown in section 4, it is possible to interpret our tree in different ways. We consider in the following that each data can be described by any representation language, provided that there

apos

ant close to apos

Figure 2: Computation of an ant’s neighborhood. exists a similarity measure between two data. In the following, we denote this similarity measure by Sim(i, j) which gives, for a couple of data (di , dj ), i ∈ [1, N ], j ∈ [1, N ], a value in [0, 1] (N indicates the number of data). 0 means that di and dj are totally different and 1 means that they are identical. We do not need any additional hypothesis about data representation. The main principles of our algorithm called AntTree are the following (see figure 1): each ant represents a node of the tree to be assembled, i.e. a data to be clustered. On the basis of a root materialized by a fictitious node a0 which represents the support on which the tree will be built , ants will gradually fix themselves on this initial node, and then successively on the ants fixed to this node, and so on until all ants are attached to the structure. All these moves and these fixings depend on the value returned by Sim(i, j), and on the local neighborhood of the moving ants. Thus, for each ant ai , i ∈ [1, N ] we need to define the following concepts: • the outgoing link of ai is the link that ai can maintain toward another ant, • the incoming links of ai are the links that the other ants maintain toward ai , these bonds can be the legs of the ant,

• the data di represented by ai ,

1. If no ant is connected yet to the support a0 Then connect ai to a0

• a similarity threshold TSim (ai ) and a dissimilarity threshold TDissim (ai ) that will be locally updated by ai .

2. Else (a) let a+ be the ant connected to a0 which is the most similar to ai

During the assembly of the structure, each ant ai will be either:

(b) If Sim(ai , a+ ) ≥ TSim (ai ) Then move ai toward a+ /* ai is similar enough to a+ */

• moving on the tree: ai moving over the support a0 or over an other ant denoted by apos , but ai is not connected to the structure (see the ants colored in gray on figure 2 and 1). It is thus completely free to move on the support or toward another ant within its neighborhood. If apos denotes the ant where ai is located on, then ai will move randomly to any immediate neighbors of apos in the tree (considering in this case that edges are undirected, as shown in figure 2), • connected to the tree: ai can no be longer released anymore from the structure. Moreover we will consider the fact that each ant has only one outgoing link toward other ants and cannot have more than Lmax incoming links from other ants. This enables us to build a tree having a maximum degree of Lmax links for any given node (see figure 2). 1. initially, all ants are placed on the support and their similarity and dissimilarity thresholds are respectively initialized to 1 and 0 2. While there exists a non connected ant ai Do 3.

If ai is located on the support Then Support case

4.

Else Ant case

5. End While Figure 3: Main algorithm

3.2 Ants behavior Details about AntTree are given in figures 3, 4 and 5. At each step, an ant ai is selected in a sorted list of ants (we will explain how this list is sorted in the following section) and will connect itself or move according to the similarity with its neighbors. While there is still a moving ant ai , we simulate an action for ai according to its position (i.e. on the support or on another ant). The first ant is directly connected to the support. Next, for each ant ai , two cases need to be considered. The first case is when ai is on the support (figure 4): if ai is similar enough to a+ according to its similarity threshold

(c) Else i. If Sim(ai , a+ ) < TDissim (ai ) Then /* ai is dissimilar enough to a+ */ connect ai to the support a0 (or, if no incoming link available on a0 , decrease TSim (ai ) and move ai toward a+ ) ii. Else decrease TSim (ai ) and increase TDissim (ai ) /* ai is more tolerant */ Figure 4: Support case (where a+ is the ant which is the most similar to ai among the ants already connected to the support), ai is moved toward a+ in order to be possibly clustered in the same subtree, i.e. the same cluster. otherwise (i.e. ai is not similar enough to a+ ), if ai is dissimilar enough to a+ , then it is connected to the support. This means that we then create a new sub-tree, where ants will be as much dissimilar as possible to the ants in the other sub-trees connected to a0 . Finally, if ai is not similar or dissimilar enough to a+ , we update its thresholds in the following manner: TSim (ai ) ← TSim (ai ) ∗ 0.9 and TDissim (ai ) ← TDissim (ai ) + 0.01 in order to allow ai to be more tolerant and to increase its probability of being connected the next time it will be considered (in the meantime, other ants will have changed the tree). The second case is when ai is on top of another ant denoted by apos (figure 5): if there is a free incoming link for apos and if ai is similar enough to apos and dissimilar enough to ants connected to apos , then ai is connected to apos . In this case, ai represents the root of a new sub-tree or sub-cluster below apos . Its dissimilarity with other ants directly connected to apos is such that sub-clusters of apos will be well ”separated” from each other (while being similar to apos ). otherwise, ai is randomly moved toward a neighbor node of apos and its thresholds are updated in the same way than on the previous case. So, ai will move in the graph in order to find a better location to be connected.The algorithm ends when all ants are connected.

1. let apos denote the ant on which ai is located, and let ak denote a randomly selected neighbor of apos , 2. If Sim(ai , apos ) ≥ TSim (ai ) Then (a) Let a+ be the neighbor ant of apos which is the most similar to ai C2

(b) If Sim(ai , a+ ) < TDissim (ai ) Then Connect ai to apos (or, if no incoming link is available on apos , randomly move ai toward ak ) (c) Else decrease TDissim (ai ), increase TSim (ai ) and move ai toward ak 3. Else randomly move ai toward ak

C1

Figure 6: Interpreting the tree as clusters.

Figure 5: Ant case

Database

One should also note that ants have been used to build specific kind of trees in [AND 02] but with a ”standard” ACO algorithm (ant colony optimization) [BIA 02].

Art1 Art2 Art3 Art4 Art5 Art6 Art7 Art8 Iris Wine Glass Pima Soybean Thyroid CERIES

4 Experimental results 4.1 Testing methodology In order to evaluate and compare results obtained by AntTree, we have used databases where data is represented with numerical attributes (one data is a vector of reals). Two kinds of databases are used: artificially generated ones (Art1,..., Art8) and real ones (Iris, Wine, Glass, Pima, Soybean and Thyroid) from the Machine Learning Repository [BLA 98]. We have also used data from the CE.R.I.E.S. research center on healthy human skin domain [GUI 03]. These databases are supervised ones because for each data we know the cluster it belongs to. In this way, it is possible to evaluate the clusters we obtain with respect to these real classes (but of course real classes are not given to the clustering methods). We denote by ki the known class number for data di and by ki0 the class number computed by a clustering method. K corresponds to the real number of classes and K 0 corresponds to the number of classes found by one of our methods. We have used the following classification error measure: Ec =

2 N (N − 1)

X

ij

AntTree (random order) Ec [σEc ] K 0 [σK 0 ] 0.44 [0.01] 2.36 [0.08] 0.27 [0.01] 1.94 [0.03] 0.37 [0.01] 2.26 [0.08] 0.27 [0.00] 3.80 [0.05] 0.36 [0.02] 3.36 [0.09] 0.53 [0.02] 1.68 [0.06] 0.66 [0.00] 3.68 [0.06] 0.66 [0.01] 3.74 [0.06] 0.27 [0.01] 2.36 [0.08] 0.64 [0.00] 2.04 [0.04] 0.67 [0.01] 2.98 [0.07] 0.45 [0.00] 1.92 [0.11] 0.10 [0.00] 3.90 [0.04] 0.36 [0.00] 2.76 [0.06] 0.58 [0.02] 2.30 [0.11]

K 4 2 4 2 9 4 1 1 3 3 7 2 4 3 6

N 400 1000 1100 200 900 400 100 1000 150 178 214 798 47 215 259

Table 1: Results obtained by AntTree for a random initial order of objects. Ec corresponds to the averaged classification error on 50 runs and K 0 to the averaged number of classes, σEc and σK 0 are the corresponding standard deviations, and N a number of data.

v u M u 1 X Sim(i, j) = 1 − t (vik − vjk )2 M

(3)

k=1

where M denotes the number of numerical attributes used for describing each data, and where vik is the value of the k th attribute for the ith data.

(1)

(i,j)∈{1,...,N }2 ,i