An Interactive Visualization Environment for Data Exploration Using Points of Interest David Da Costa1,2 and Gilles Venturini2 1
2
AGICOM 3 degr´e Saint Laumer 41000 Blois, France
[email protected] Laboratoire d’Informatique 64, Avenue Jean Portalis 37200 Tours, France {david.dacosta, venturini}@univ-tours.fr
Abstract. We present in this paper an interactive method for numeric or symbolic data visualization that allows a domain expert to extract useful knowledge and information. We propose a new approach based on points of interest (POI) but in the context of visual data mining. POIs are located on a circle, and data are displayed within this circle according to their similarities to these POI. Interactive actions are possible: selection, zoom, dynamical change of POI. We evaluate the properties of such visualization with standard data with known characteristics. We describe an industrial application which explores results from satisfaction inquiries.
1
Introduction
The methods of ”Visual data mining”(VDM) try to solve the problems of interpretation and interaction in the knowledge discovery process by using dynamic visualizations and graphical requests on the represented data and knowledge [5], [11], [12]. By way of traditional examples, we can mention Chernoff’faces [4] which encodes data into icons while being based on the fact that the human mind easily analyzes the resemblances and differences between faces. We can also mention the ”scatter plots” [2] which make it possible to obtain multiple views on the data and to observe the data using graphical techniques such as the ”brushing” (which gives the possibility to select data in a view while underlining these same data in the other views). These methods bring innovations and pursue goals which are promising for the field of VDM: the use of visual perception and often of preattentive perception [6], dynamic interaction with the data, easiness of use, direct use of the results. However, these methods also have limits as far as the VDM is concerned: the visualized data are generally numerical, visualizations and their handling requires user training (as it is the case for example to interpret graphs like ”parallel coordinates” [8]), the dynamic interaction requires many resources of calculation (real time modifications) and must thus need the fastest possible algorithms (but which must in addition provide as much information as possible). In this work, we suggest a new method of VDM, adapted itself from the methods involving points of interest and which are used for visualization of
textual data. Our objectives, in addition to those of VDM, are: to be able to represent all types of data on the basis of the existence of a similarity function (or distance) between the data, to have a very fast display when working with dynamic interactions and, if possible, to handle large volumes of data (algorithms with linear temporal and spatial complexities w.r.t. the number of data), to use a visualization requiring the shortest possible training time (thus understood by the majority of the potential users who are not regarded as experts in data mining).
2
Background of the visualization methods involving points of interests
These methods are named by the terms ”points of interests” or ”points of references”. They consist in positioning some specific icons (POIs) on a circle, and then to display the data icons within this disc at locations determined by the similarity between the POIs and the data. For example, this visualization was used as a method to display the documents resulting from a search engine request, which made it easier to navigate within all these returned documents. The selected POIs are generally keywords used in the request and the data are the documents which location is determined according to the proportion of keywords they contain. The choice of the keywords depends on their frequency in the documents. To visualize these data, one uses in general graphs displaying techniques involving springs and forces. The force being exerted between a POI and a data is proportional to the similarity between this POI and this data. The VIBE System [9], SQWID [10], Radviz [7] or the radial visualization of the system Information Navigator [1] use these principles. Sometimes it is difficult to see exactly toward which point of interest a data is attracted. In this case, these systems then make it possible to remove or add points of interest on the circle to obtain a better representation of the data. These are the principal interactive operations suggested by these methods. Radial [1] is a generic example of such visualization methods. Initially, after extraction of the result of the request, Radial identifies a series of key terms relating to these results. Then the first 12 most important terms are arranged all around a circle. It is possible to modify the list of the displayed terms, the choice being done on two lists placed on the left of the screen. Then documents are displayed inside the disc. Only the data in connection with the keywords aligned around the circle are displayed. A document is like being suspended by springs connected to the keywords. It is thus impossible to move a document while clicking on it, because of the forces exerted by the springs. On the other hand, while clicking on a point, the keywords in connection with this point are lit and a bubble displays information on this data. It is possible to move the terms outside the circle and thus to move all the nodes of the data in connection with these terms. This makes it possible to make a manual classification of the results in different categories.
All these systems showed that this type of dynamic visualization brings a great interest for the user who can extract information quite easily. The speed of display coupled with the interaction possibilities bring more to these methods. In addition, as we will show in this paper, they can visualize data of various types. To our knowledge, they have not yet been used for data mining as we will present it in the following sections.
3 3.1
Using Points of Interest in Data Mining Basic principles of visualization
One considers n data D1 , ..., Dn and a matrix of similarity Sim between these data. Sim(i, j) is the similarity between the data Di and Dj , this matrix being symmetrical and with a diagonal with 1s. If Sim(i, j) = 1 then the Di and Dj data are identical, and if Sim(i, j) = 0 then they are completely different. Initially, we will consider that the POIs are a subset of these data which are denoted by D1 , ..., Dk . We display these k data on a circle with an arc of constant length between each POI (see figure 1). D3
D2
w2
w3 D4
w1
D1
w4 Di w6 w5 D5 D6 Fig. 1. Basic principles of our visualization with the positioning of a data Di according to k POIs.
One wants then to position the n − k remaining data according to their similarity to the POIs D1 , ..., Dk . We use the following formula to calculate the display coordinates (XDi , Y Di ) of data Di : XDi = w1 × XD1 + ... + wk × XDk
(1)
Y Di = w1 × Y D1 + ... + wk × Y Dk
(2)
with w1 + ... + wk = 1. For this data Di and a POI Dj , the weight wj is computed in the following way: Sim(Di , Dj ) wj = Pk p=1 Sim(Di , Dp )
(3)
If Di is identically similar to all of the POIs, it will be displayed in the middle of the disc. On the opposite, if it is completely similar to one POI and completely different from the others, its position will be confounded with that POI. If its similarity is biased towards certain POIs, then it will tend to approach these POIs. More generally, our method is such that two data close to one another in the initial representation will thus be also close with respect to POIs, and they will thus be close in 2D space. The visualized space thus becomes a space of distances between selected points (POIs) and the data. It is in this manner that this method can deal with any type of data. We use a Euclidean distance for numerical data or a Hamming ditance for symbolic data. On the other hand, the reciprocity of this property is not true: two data close in the 2D space are not necessarily close in the original space (all the points at equal distance of two POIs in the initial space form a mediating line and are thus displayed at the same 2D location). It will be necessary to use other methods to remove these ambiguities (see the last section). Finally, displaying requires very little calculation and only needs to compute a part of the similarities (k × (n − k)). Several questions are raised by this method. First of all, the initial choice of the POIs must be carried out. Initially, we consider that if the data are supervised (a class label is available), then we take the first representative of each class as initial POIs. There will thus be as many POIs as classes in the first visualization suggested to the user. If the data are not supervised, we choose the first k data. Other automatic choices are possible (and certainly more judicious) as we describe it in the last section, and we try here to suggest initial choices that the user will be able to interactively and dynamically modify according to what is displayed (see the following section). A second question comes from the order of the POIs: if a great number of data are attracted by two POIs, then it is desirable that these POIs are close to each others on the circle. A critical situation would consist in placing these POIs in a diametrically opposed way, which would generate unreadable visualizations (many data in the center). We propose an interactive solution to this problem in the following section, but it is obvious that automatic solutions can be found like ordering POIs according to their similarity (see last section). It would also be possible not to preserve a fixed arc length between POIs in order to show the similarities that exist between POIs. 3.2
Interaction
To be really efficient, the visualization of information must be interactive and must make it possible to dynamically refine the display and to answer to the graphic requests of the user. In visualization with POIs, the user can ask for the following requests: what is this data (or this POI), how to enlarge this part of the visualization (zoom without loss of context), how to change POI (to remove some, to add some, to change their order, and possibly to define POIs which are not necessarily some data of the initial database).
When the mouse is moved over a point/data, we indicate what is this point. Then, it is possible to focus on a data by clicking on it. The zoom carries out the following operations: it centers the data on the middle of the disc; it enlarges the area centered on this data and pushes the other data toward the edges of the visualization. The distortion is calculated using a hyperbolic function. This zoom makes it possible to enlarge the view while preserving the context of the global data display. As far as the POIs are concerned, the main possible interactions are the following. First of all, it is possible to remove a POI. This is done very simply by dragging a POI inside the disc. This POI takes its place back within the data. The view is dynamically recomputed. A dynamic and progressive transition is performed so that the user can follow the change of representation. He then has the possibility to cancel its action, which causes to put the POI back on the circle. It is also possible to choose a data and to define it as a POI. For this purpose, one drags the data on the circle. If the data is placed on a POI, it replaces this POI, and if it is placed between two POIs, it is inserted between them. The length of the arcs between POIs is kept constant. These functionalities are very significant since they allow the user to redefine at will the representation. However, the initial k POI are important for this method because it is necessary to have at least three POI around the circle. If there are only two POI then all the data will be placed on the line formed by these two POI. If two many POIs are placed around the circle then all data shrink at the center of the visualization. Lastly, it is possible to generalize POIs so that they are not necessarily some data any more, but more generally any point of the space of representation and even any object for which it is possible to compute a similarity with the data. Thus, one can represent ”ideal” data, not really existing, and according to which the user would like to position the real data. We present in section 4 a typical application of this functionality. Also, it would be possible to represent for example a decision rule as a POI, and to place the data according to their matching with this rule. This functionality offers many perspectives by visualizing not only data but also knowledge.
4 4.1
Results Artificial and traditional bases
We have evaluated this method on various artificial and standard data sets. Figure 2 represents an artificial database Art1 made up of 400 data and 4 classes. We illustrate in particular the effects of a zoom. When one has only two classes, the data of the two classes are positioned on the segment ranging from POI1 to POI2. To help the user finding better visualization of the data, one may add another POI or several POIs. Finally, we have tested our approach on traditional databases from the ”Machine Learning Repository” [3]. We thus represent on figure 3 the Iris database (150 data, 3 classes), the Wine database (178 data, 3 classes) and Segment
database (2310 data, 7 classes). The expected shapes of these databases are easily found in our visualization (as for Iris and Wine for example).
(a)
(b)
Fig. 2. Visualization of Art1 data without zoom (a) and with zoom (b).
4.2
Real world application
Some of the activities of Agicom consist in collecting data that result from satisfaction inquiries using questionnaires. These data can be considered as an individual×variables table where these variables are qualitative (with ordered values). The values of such variables can be for instance (”delighted ”, ”satisfied”, ”unsatisfied”, ”disappointed” and ”NSP” (”Do not know”)). For a domain expert, in order to have the possibility to exploit these data, it is important to be able to graphically visualize the satisfaction of the customers. The aims of the expert are for instance to detect possible correspondences between individuals, to know the evolution of the customers from one segment to another, but also to visualize the existing relation between a given variable and the other variables. Our goal is thus to design a tool for representing the results of the satisfaction inquiries, with the final aim to understand and improve the user satisfaction. We have evaluated and tested our method on the Agicom1 database made up of 31 unsupervised data. Figure 4(a) illustrates this first application in which the POIs are not data but typical profiles of variables. A profile thus corresponds to a distribution of values (answers) for this variable. In this manner, the POIs represent various known typologies of variables (like very positive answers, or extremely positive or negative answers, etc). We present a second example on the Agicom2 database (see figure 4(b)). In this application, we have allowed the Agicom users to interact with the characteristics of POIs and with the zoom (see figure 4(b)). Moreover, in this database we allowed the visualization of the various classes. A validation with real users is currently under study.
(a)
(b)
(c)
Fig. 3. Visualizations of Iris (a), Wine (b), and Segment (c) databases.
5
Conclusion and Perspectives
We have described in this paper a new visualization method inspired from the work performed in the context of points of interest. It consists in transforming a initial space represented by a similarity matrix into a visual 2D representation of these similarities. This method has advantages like the speed of display, an intuitive presentation of the data, rather fast user learning, and interactive abilities. We have detailed its behavior on on traditional data and finally within a real application being developed by Agicom.
(a)
(b)
Fig. 4. Visualization of Agicom1(a) and Agicom2(b) database.
Several perspectives can emerge. We mentioned the importance of the choice of POIs as well as their location on the circle. A first extension consists in studying the use of an optimization algorithm in order to find the most effective ordering of POIs. This consists in finding some relevant permutations of the k chosen POIs. Another significant perspective consists in extending the visual-
ization so as to remove ambiguities related to the overlapping data. We intend to use a method of graph display based on forces and springs in order to move away the points that are too close to each others on the graph. We also want to make a distinction between the points which are placed at the same location but which have different mean similarity with the POIs. A 3D approach will be tested soon for this purpose.
References 1. Peter Au, Matthew Carey, Shalini Sewraz, Yike Guo, and Stefan M. R¨ uger. New paradigms in information visualization. In Research and Development in Information Retrieval, pages 307–309, 2000. 2. R. A. Becker and W. S. Cleveland. Brushing Scatterplots. Technometrics, 29:127– 142, 1987. Reprinted in Dynamic Graphics for Data Analysis, edited by W. S. Cleveland and M. E. McGill, Chapman and Hall, New York, 1988. 3. C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. 4. H. Chernoff. Using faces to represent points in k–dimensional spae graphically. Journal of the American Statistical Association, 68:361–368, 1973. 5. W. S. Cleveland. Visualizing Data. Hobart Press, Summit, New Jersey, U.S.A., 1993. 6. Christopher G. Healey, Kellogg S. Booth, and James T. Enns. Harnessing preattentive processes for multivariate data visualization. In Proceedings of Graphics Interface ’93, pages 107–117, Toronto, ON, Canada, May 1993. 7. Patrick Hoffman, Georges Grinstein, and David Pinkney. Dimensional anchors: a graphic primitive for multidimensional multivariate information visualizations. In NPIVM ’99: Proceedings of the 1999 workshop on new paradigms in information visualization and manipulation in conjunction with the eighth ACM internation conference on Information and knowledge management, pages 9–16, New York, NY, USA, 1999. ACM Press. 8. Alfred Inselberg. The plane with parallel coordinates. The Visual Computer, 1:69–91, 1985. 9. Robert Korfhage. To see, or not to see: Is that the query? In Abraham Bookstein, Yves Chiaramella, Gerard Salton, and Vijay V. Raghavan, editors, Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Chicago, Illinois, USA, October 13-16, 1991 (Special Issue of the SIGIR Forum), pages 134–141. ACM, 1991. 10. S. McCrickard and C. Kehoe. Visualizing search results using sqwid. In Proceedings of the Sixth International World Wide Web Conference, April 1997. 11. Ben Shneiderman. The eyes have it: A task by data type taxonomy for information visualizations. In IEEE Visual Languages, number UMCP-CSD CS-TR-3665, pages 336–343, College Park, Maryland 20742, U.S.A., 1996. 12. Pak Chung Wong and R. Daniel Bergeron. 30 years of multidimensional multivariate visualization. In Scientific Visualization — Overviews, Methodologies and Techniques, pages 3–33. IEEE Computer Society Press, Los Alamitos, CA, 1997.