How to use ants for hierarchical clustering - CiteSeerX

0 downloads 0 Views 709KB Size Report
Abstract. We present in this paper, a new model for document hierar- chical clustering, which is inspired from the self-assembly behavior of real ants. We have ...
How to use ants for hierarchical clustering H. Azzag1 , C. Guinot2 , and G. Venturini1 1

Laboratoire d’Informatique de l’Université de Tours, École Polytechnique de l’Université de Tours - Département Informatique 64, Avenue Jean Portalis, 37200 Tours, France. Phone: +33 2 47 36 14 14, Fax: +33 2 47 36 14 22 [email protected] [email protected] http://www.antsearch.univ-tours.fr/webrtic 2 C.E.R.I.E.S., 20 rue Victor Noir, 92521 Neuilly sur Seine Cedex. [email protected]

Abstract. We present in this paper, a new model for document hierarchical clustering, which is inspired from the self-assembly behavior of real ants. We have simulated the way ants build complex structures with different functions by connecting themselves to each other. Ants may thus build ”chains of ants” or form ”drops of ants”. The artificial ants that we have defined will similarly build a tree. Each ant represents one document. The way ants move, disconnect or connect themselves depends on the similarity between these documents. The result obtained is presented as a hierarchical structure with a series of HTML files with hyperlinks.

1

Introduction

Many web systems and applications often require the use of a clustering algorithm. For instance, one can define a portal site as a hierarchical partitioning of a set of documents (the documents contain information from different subject) which will repeat the following property: at each node (or category), the subcategories are similar to their mother, but they are as much dissimilar to each others. The major problem of the automatic construction of portal sites is the automatic definition of this hierarchy of documents which, in actual systems, must be given by a human expert [3][9][15][13]. If we want to work with an important number of documents, or if we wish to let the computer autonomously do all this work, then standard approaches are useless. Many researchers in computer science have been inspired by real ants [7] and have defined artificial ants paradigms for dealing with optimization or machine learning problems [2]. Therefore, several studies involve ants behavior and data clustering [6] [12] [10][4]. In this paper we deal with a new biological model based on artificial ants which as far as we know, has never been used before to solve computer science problems. We model the ability of ants to build structures by

assembling their bodies together[16] in order to discover, in a distributed and unsupervised way, a tree-structured organization of the document set as in actual system of portals sites. The remainder of this paper is organized as follows: in section 2, we give an overview of the biological model that we have use and we present the details of the AntTree algorithm. Its properties and the results obtained will be given in section 3. Section 4, concludes and describes future evolutions of this method.

2 2.1

AntTree Algorithm Biological inspiration

Real ants provide a stimulating self-organization model for the clustering problem. In this paper we consider a new biologically observed behavior: ants are able to build mechanical structures by a self-assembling behavior. Biologists observe for instance the formation of drops of ants [17], or the building of chains of ants [11]. These types of self-assembly behavior have been observed with Linepithema humiles Argentina ants and African ants of gender Oecophylla longinoda. The goal of drop structures built by Linepithema humiles is today still obscure. This ability has been recently experimentally demonstrated [17]. The drop can sometimes fall down. For Oecophylla longinoda ants, it can be observed that two types of chains are built: for crossing an empty space, and for building their nest [11]. In both cases, crossing chains and building chains, these structures disaggregate after a given time. From these self-assembly behaviors, we can extract properties that will constitute the framework of our algorithm: – ants build this type of live structures starting from a fixed support (stem, leaf,...), – ants can move on this structure whilst it is being currently built, – ants can cling anywhere on the structure because every position can be reached. Nevertheless, in the formation of chains for example, ants will preferably fix themselves at the end of the chain, because they are attracted by gravity or by the object to reach, – the majority of ants which constitute the structure can be blocked without any possibility of displacement. For example, in the case of a chain of ants, this corresponds to ants placed in the middle of the chain, – some ants (a number generally much more reduced than for blocked ants considered in the previous point) are connected to the structure but with a link which they maintain by themselves, they can thus be detached (disconnected) from the structure whenever they want. In the case of a chain of ants, this corresponds to the ants placed at the end of the chain, – we can observe a phenomenon of growth but also of decrease of the structure.

ant close to Apos

Moving Ant

Connected Ant

Disconnection of a group of ants

Fig. 1. Disconnection of ants

2.2

Artificial ants behavior

For our algorithm called AntTree. Each ant ai , i ∈ [1, N ] represents one document di to cluster. An ant ai is moving over the support denoted by a0 or over an other ant denoted by apos (see the ants colored in gray on figure 1) The similarity measure used (denoted by Sim(i, j)) is based on cosine measure [14] which encodes each text di and dj as a vector of word count. We have used a common weighting scheme, i.e. tf-idf (term frequency - inverse document frequency) to represent these vectors. tf denotes the word count of the document and idf denotes the inverse document frequency (document frequency is the number of documents which contain the considered word). The main principles of our deterministic algorithm are the followings: at each step, an ant ai is selected in a sorted list of ants (we will explain how this list is sorted in the following section) and will connect itself or move according to the similarity with its neighbors. While there is still a moving ant ai , we simulate an action for ai according to its position (i.e. on the support or on another ant). In the following we consider that apos denotes the ant or the support over which the moving ant ai is located, and a+ is the ant (daughter) connected to apos which is the most similar to ai (see figure 1). For clarity, consider now that ant ai is located on an ant apos and that ai is similar to apos . As will be seen in the following, when an ant moves toward another one, it means that it is similar enough to that ant. So ai will become connected to apos provided that it is dissimilar enough to ants connected to apos . ai will thus form a new sub-category of apos which is as dissimilar as possible from the other existing sub-categories. For this purpose, let us denote by TDissim (apos ) the lowest similarity value which can be observed among the daughters of apos . ai is connected to apos if and only if the connection of ai decreases further this value. The test that we perform consists in comparing ai to the most similar ant

a+ . If these two ants are dissimilar enough (Sim(ai , a+ ) < TDissim (apos )), then ai is connected to apos , else it is moved toward a+ . Since this minimum value TDissim (apos ) can only be computed with at least two ants, then the two first ants are automatically connected without any tests. This may result in "abusive" connections for the second ant. Therefore the second ant is removed and disconnected as soon as a third ant is connected (for this latter ant, we are certain that the dissimilarity test has been successful). When this second ant is removed, all ants that were connected to it are also dropped, and all these ants are placed back onto the support (see figure 1). This algorithm can thus be stated, for a given ant ai , as the following ordered list of behaviour rules: R1 (no ant or only one ant connected to apos ): ai connects to apos R10 (2 ants connected to apos , and for the first time): 1. Disconnect the second ant from apos (and recursively all ants connected to it) 2. Place all these ants back onto the support a0 3. Connect ai to apos R2 (more than 2 ants connected to apos , or 2 ants connected to apos but for the second time): 1. let TDissim (apos ) be the lowest dissimilarity value between daughters of apos (i.e. TDissim (apos ) = M in Sim(aj , ak ) where aj and ak ∈ {ants connected to apos }), 2. If ai is dissimilar enough to a+ (Sim(ai , a+ ) < TDissim (apos )) Then ai connects to apos 3. Else ai moves toward a+ When ants are placed back on the support, they may find another place where to connect using the same behavior rules. It can be observe that, for any node of the tree, the value TDissim (apos ) is only decreasing, which ensures the termination and convergence of the algorithm. One must notice that no parameters or predefined thresholds are necessary for using our algorithm: this is one major advantage, because ants based methods often need parameter tuning, as well as clustering methods which often require a user-defined similarity threshold.

3 3.1

Results Testing methodology

We have evaluated AntTree on 4 databases which have from N = 258 to 1025 texts (see table 1). The Reuters databases contains 1025 texts extracted from the reuteurs21578 database (8653 texts). Since clusters in the Reuters database can be very small (two or three texts for instance), we have selected some of the largest clusters only. The CE.R.I.E.S. database [5] contains 258 texts dealing with human skin (the CE.R.I.E.S. is a laboratory funded by Chanel). Database

Databases Size (# of documents) Size (Mb) # of classes Reuters 1025 4.05 9 CE.R.I.E.S. 258 3.65 17 Database 1 319 13.2 4 Database 2 524 20 7 Table 1. Description of used databases (see text for more explanation)

1 consists of web pages from different scientific topics (73 about scheduling, 84 about pattern recognition, 81 about TcpIp network, 94 about vrml courses). Database 2 consists of web pages with general topics (55 about c++ courses, 82 about the Danone food company, 86 about IEEE, 90 about cinema, 50 about the Le Monde newspaper, 63 about the sfr phone company, 101 about medicine category from Google’s directory). Database 1 and Database 2 are extracted from the web. The real classes of documents (Cr ) are of course not given to the algorithms. They are used in the final evaluation of the obtained partitioning. The evaluation of the results is performed with the number of found clusters Cf , with the purity Pr of clusters (percentage of correctly clustered document in a given cluster), and the classification error measure, denoted by Ec, this measure represents the proportion of document couples which are not correctly clustered, i.e. in the same real cluster but not in the same found cluster, and vice versa. We have tested three methods: AntTree Disc (the algorithm presented in this paper), AntTree Stoch [1] a previous stochastic version and AHC which is an efficient clustering method [8] (we have used the Ward criterion for cutting the dendrogram). 3.2

Analyzing the results

Databases Cr Reuters 9 CERIES 17 Database 1 4 Database 2 7

AntTree Ec Cf 0.22 10 0.26 7 0.16 9 0.06 9

Disc Pr 0.47 0.27 0.84 0.87

AHC Ec Cf Pr 0.21 5 0.50 0.21 7 0.30 0.09 7 0.82 0.23 3 0.52

AntTree Stoch Ec[σEc ] Cf [σCf ] Pr [σPr ] 0.35 [0.004] 12 [0.00] 0.40 [0.007] 0.15 [0.001] 17 [0.00] 0.37 [0.012] 0.29 [0.011] 7 [0.00] 0.68 [0.012] 0.10 [0.006] 8 [0.00] 0.80 [0.009]

Table 2. Comparative results obtained between AntTreeDisc , AntTree stoch and AHC (C is the number of clusters, P is the purity, and Ec a classification error).

Results are presented in tables 2 and 3. As far as the number of classes is concerned, AntTree Disc and Stoch obtain the best results compared to AHC,

Databases AntTree Disc AHC AntTree Stoch Reuters 0.81 120 1.51 CERIES 0.03 4 0.04 Database 1 0.08 6 0.12 Database 2 0.25 25 0.34 Table 3. Prossesing Time: seconds

except for Database 1 where the number of classes is small. The explanation is that AHC often get a lower number of classes than the other algorithms. The automatic cutting procedure of the dendogram consider the largest value of the Ward criterion. Purity values are comparable for AHC and AntTree Disc, except for Database 2 where AHC does not perform well. AntTree Disc performs better than AntTree Stoch for two databases. The definition of thresholds is one drawback of the stochastic method. If these thresholds are well tuned (by hand), then the results are very good. So obviously, the automatic threshold computation in AntTree Disc is efficient. Disconnecting the ants and placing them back onto the support also seems to increase the performances. The classification error results lead us to the following conclusions: results of AHC and AntTree Disc are comparable (equal for two databases, and one algorithm better than the other for the two other databases). AntTree Stoch does not perform as well as the two other methods. In table 3, we give the computation time needed once the similarity measure has been computed. AnTree was programmed in C++ and AHC in Visual Basic and the tests were performed on a standard PC (Pentium 2GHz, 512Mo). Both AntTree algorithms clearly outperform AHC. This is due to the fact that AntTree exploits the tree-structure very well and avoids exploring the whole tree when connecting a new ant. It is known for instance that the sorting algorithms based on trees (such as heap sorting for instance) have a lower complexity than other methods. When one consider the whole portal site generation, a run approximatively takes 1 second, including the text extraction from the HTML pages, the display of the tree and the generation of the HTML output as explained in the next section. 3.3

Portal site generation

When all of texts have been clustered in a tree, then it is easy to generate the corresponding portal site. The hierarchy of documents is represented in our actual implementation as a directory tree with indentation. The tree is encoded in a database in a few seconds and the generation of HTML files is dynamic using PHP language. Figure 2 gives an example of the portal home page obtained for Database 2 (524 documents) and in figure 3 we have integrated a search engine based on a word index which is automatically generated in the database.

Fig. 2. A typical portal site generated from the Database 2.

4

Conclusion

In this paper we have described a new algorithm directly inspired from the ants self-assembly behavior and its application to the automatic generation of portals sites. The results obtained are extremely encouraging and the main perspective of this work is to keep on studying this promising model. As future work, we are currently including the automatic extraction of keywords in order to automatically annotate the sub-categories of a given node. Another direction of research is to perform a sampling of the database when it contains many documents (e.g. more than 3000 pages for instance): the texts which have not been used for learning the tree are assigned to a category by following the path with highest similarity. This procedure is on again very fast. We also want to generalize AntTree to the generation of graphs (and not just trees). We could generate hypertexts with the same self-assembly principles.

References 1. N. Azzag, H. Monmarché, M. Slimane, G. Venturini, and C. Guinot. Anttree : a new model for clustering with artificial ants. In IEEE Congress on Evolutionary Computation, Canberra, Australia, 08-12 December 2003. 2. E. Bonabeau, M. Dorigo, and G. Theraulaz. Swarm Intelligence: From Natural to Artificial Systems. Oxford University Press, New York, 1999. 3. D. Filo and J. Yang. Yahoo!, 1997. 4. S. Goss and J.-L. Deneubourg. Harvesting by a group of robots. In Varela, editor, Proceedings of the First European Conference on Artificial Life, pages 195–204, Paris, France, 1991. Toward a Practice of Autonomous Systems. 5. C. Guinot, D. J.-M. Malvy, F. Morizot, M. Tenenhaus, J. Latreille, S. Lopez, E. Tschachler, and L. Dubertret. Classification of healthy human facial skin. Textbook of Cosmetic Dermatology Third edition (to appear), 2003.

Fig. 3. Searching through the portal site using "C++" term

6. J. Handl, J. Knowles, and M. Dorigo. On the performance of ant-based clustering. Proceedings of Hybrid Intelligent Systems (to appear), IOS Press, 2003. 7. B Hölldobler and E-O Wilson. The Ants. Springer Verlag, Berlin, 1990. 8. A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice Hall Advanced Reference Series, 1988. 9. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. On semi-automated web taxonomy construction. In WebDB, Santa Barbara, May 2001. 10. P. Kuntz, D. Snyers, and P. Layzell. A stochastic heuristic for visualising graph clusters in a bi-dimensional space prior to partitioning. Journal of Heuritics, 5(3), October 1999. 11. A. Lioni, C. Sauwens, G. Theraulaz, and J.-L. Deneubourg. The dynamics of chain formation in oecophylla longinoda. Journal of Insect Behavior, 14:679–696, 2001. 12. E.D. Lumer and B. Faieta. Diversity and adaptation in populations of clustering ants. In Proceedings of the Third International Conference on Simulation of Adaptive Behaviour, pages 501–508, 1994. 13. Andrew K. McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127–163, 2000. 14. Gerard Salton and Christopher Buckley. Term weighting approaches in automatic text retrieval. In information processing and management, volume 25, pages 513– 523, 1988. 15. Mark Sanderson and W. Bruce Croft. Deriving concept hierarchies from text. In Research and Development in Information Retrieval, pages 206–213, 1999. 16. Christian Sauwens. Étude de la dynamique d’auto-assemblage chez plusieurs espèces de fourmis. Thèse de doctorat, Université libre de bruxelles, 2000. 17. G. Theraulaz, E. Bonabeau, C. Sauwens, J.-L. Deneubourg, A. Lioni, F. Libert, L. Passera, and R.-V. Solé. Model of droplet formation and dynamics in the argentine ant (linepithema humile mayr). Bulletin of Mathematical Biology, 2001.

Suggest Documents