On data clustering with a flock of artificial agents Fabien Picarougne, Hanene Azzag, Gilles Venturini Laboratoire d’Informatique Ecole Polytechnique de l’Universit´e de Tours 64, Avenue Jean Portalis 37200 Tours, France. {fabien.picarougne,hanene.azzag,gilles.venturini} @univ-tours.fr Abstract We present a new bio-inspired algorithm that dynamically creates and visualizes groups of data. This algorithm uses the concepts of a flock of agents that move together in a complex manner with simple local rules. Each agent represents one data. The agents move with the aim of creating homogeneous groups of data that evolve together in a 2D environment. These created groups are visualized in real time and help the domain expert to understand the underlying class structure of the data set, like for example a realistic number of classes, clusters of similar data, isolated data, etc. We present several extensions of this algorithm and present results from artificial and real-world data.
1. Introduction Numerous data clustering algorithms exist, several of them using biological principles like genetic algorithms, artificial ants or immune networks. This study deals with a model that is inspired from the social behavior of animals with respect to moves [3]. The social moves of artificial individuals can be used for the clustering problemas follows [2]: each artificial agent will represent one data and these agents will move according to the same set of behavioral rules. These rules are such that, after a given number of iterations, groups of homogeneous agents (i.e. data) are created. The work that we present here goes further in the analysis and evaluation of the performances of such algorithms: we have extended these algorithms with the concept of ideal distance, we have introduced several key points such as the initialization of the agents, the use of a stereoscopic display, interactive requests and a stopping criterion. Finally, we have more precisely evaluated our method.
Christiane Guinot C.E.R.I.E.S. 20 rue Victor Noir 92521 Neuilly sur Seine C´edex, France.
[email protected]
2. Main principles of moves We consider a set of n data (or examples) (e1 , ...en ) to cluster. the only assumption that we make about the data is the existence of a similarity function denoted by Sim(i, j), where Sim(i, j) ∈ [0, 1]. We consider a population of n agents where the ith agent represents data ei . Agents move in a 2D environment. The main algorithm for controlling agents works as follows: initially, all agents are placed at a random position and with a random initial direction. Ideal distances are then computed for each couple of agents. Then, agents move and decide, according to a local rule, whether they must get closer to each other or not, and whether they should go in the same direction or not. Intuitively, this rule has two goals: (1) to establish an ideal distance between agents that is representative of the similarities of the data they represent, and (2) to let agents with similar data move in the same direction. From this local rule emerge groups of agents that move together and that define a partitioning of the data set. We have proposed several extensions of this initial algorithm. First, the user may interact with the dynamical vizualization by selecting agents, either individually or in a group, which results in displaying inforamtion This can be useful for checking why an agent is isolated and for determining whether this data is for instance noisy or not. This also allows the user to get additional information about the agents located at the center of a group (and which can thus be considered as good prototypes for this group). Then we have integrated this swarm algorithm into a virtual reality system presented in [1]. For this purpose, we have added one third dimension in all equations. The user has the possibility to change its point of view with a 3D sensor (Ascension Flock of Bird) which is located over his hand and which can be used as a virtual camera. The user may also visualize the flock of agents with LCD stereoscopic
Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004) 1082-3409/04 $20.00 © 2004 IEEE
(a)
(b) Figure 1. Results obtained for an artificial dataset and for the Iris database (in gray levels we have represented the real classes).
glasses (Nuvision 60 GX). Finally, we have defined an automatic procedure for computing final groups, which includes a stopping criterion based on spatial entropy (the algorithm automatically stops when agents have reached an organized state) and a simple algorithm for determining a partitionning of the data.
3. Results and Conclusion We have tested our method on numerous datsets, and we give here a very small sample. Once the algorithm is started, groups quickly appear among the data. As can be seen in the results of figure 1, the interest of the dynamical visualization of the data comes from the fact that the user may perceive at once a lot of information. The 2D distribution of the data gives information about the groups and their relative size, about the center of groups (the center contains very similar data), and about the edges of groups where data are less similar. For instance, for the Iris database (see figure 1(b)), the algorithm easily detects the fact that one class can be linearly separated from the two others, and that the two remaining classes are not separable. These are grouped in a single cluster. One may notice that in this cluster, data from the two real classes are not located at random positions but rather according to their similarities: one can clearly dis-
tinguish the frontier between the two classes. One also obtains a visualization of isolated data, which can be possibly noisy, and which do not belong to any larger groups. Our algorithm behaves well over the tested databases (when the obtained partition is compared to the real classes). The number of found clusters is close to reality, and most of the data are efficiently clustered with respect to the real classes. We have compared our algorithm to the Kmeans algorithm. The obtained results are globally similar for both methods, with a slight advantage for our algorithm that seems to better approximate the real number of classes. This confirms that our algorithm may also be used to find real clusters (which may be complementary to the visualization of data). We have applied our algorithms to a real world database in collaboration with the Biometric Unit of the CE.R.I.E.S., a research center on healthy human skin funded by Chanel. We have confirmed the hypothesis of the existence of six classes in this domain (83% of the data are clustered in the same way as with standard data analysis techniques). We have described in this paper a new method for creating and dynamically visualizing groups among a database. This method has shown that the principles used in flocks of agents in other domains could be adapted to solve a clustering and visualization problem. The interactive display of the clusters allows the user to quickly access to a lot of information like for example the number of clusters, their shapes, the similarity of the data they contain, the isolated or noisy data. The others interesting points of our approach are the fact that it does not need initial knowledge and that it may handle numeric or symbolic data. We have also studied the visual properties of this algorithm and its integration in our virtual reality system for visual data mining. We wish to extend this work along several directions. Among them, we would like to let the agents automatically adapt the radius of their neighborhood.
References [1] N. Monmarch´e, H. Marteau, J.-P. G´erard, C. Guinot, and G. Venturini. Interactive mining of multimedia databases with virtual reality. In Proceedings of the Third International Conference on Virtual Reality, pages 478–484, Hangzhou, China, April 9-12 2002. [2] G. Proctor and C. Winter. Information flocking: Data visualisation in virtual worlds using emergent behaviours. In J.-C. Heudin, editor, Proc. 1st Int. Conf. Virtual Worlds, VW, volume 1434, pages 168–176. Springer-Verlag, 1998. [3] C. W. Reynolds. Flocks, herds, and schools: A distributed behavioral model. Computer Graphics (SIGGRAPH ’87 Conference Proceedings), 21(4):25–34, 1987.
Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004) 1082-3409/04 $20.00 © 2004 IEEE