Digital Libraries â Classification and Visualization Techniques. Dieter Merkl and ... the digital library. ... position of the book within the shelf with recently ac-.
Digital Libraries — Classification and Visualization Techniques Dieter Merkl and Andreas Rauber Institut f¨ ur Softwaretechnik Technische Universit¨at Wien Favoritenstraße 9–11/188, A-1040 Wien, Austria www.ifs.tuwien.ac.at/∼dieter www.ifs.tuwien.ac.at/∼andi Abstract The constantly increasing flood of available textual information demands for the devlopment of powerful tools to organize, search, and explore these document libraries. Within the framework of the SOMLib digital library project we have proposed the utilization of unsupervised artificial neural networks for document classification and an intuitive user interface relying on metaphor graphics for visualization of the contents of the digital library.
1
Introduction
Today’s information age may be characterized by constant massive production and dissemination of written information. More powerful tools for exploring, searching, and organizing the available mass of information are needed to cope with this situation. The users will benefit particularly from clustering techniques that uncover similar documents and bring these similarities to the user’s attention. Especially the map metaphor for displaying the contents of a document archive in a two-dimensional display has gained increased interest [2, 4, 5, 6, 11]. In such an environment, similar documents may be found in neighboring parts of the map display. The obvious benefit for the user is that navigation in the document archive is similar to the well-known task of navigating in a geographical map. However, most of the above mentioned research work aims at providing one single map representation for the complete document archive. As a consequence, hierarchical relations between documents are hidden in the display. Moreover, it is only natural that with increasing size of the document archive the maps for representing the archive grow larger thus leading to problems for the user in finding proper orientation within the map. A second shortcoming is the underlying assumption that the document archive is available locally for training of the usually adaptive process of document clustering. Hence, distributed archives
cannot be addressed appropriately and elegantly with these approaches.
2
The SOMLib Project
We believe that both issues, i.e. the representation of hierarchical document relations and the provision of ways to cope with and integrate distributed document archives, are vital for the usefulness of mapbased document archive visualization approaches. We are currently evaluating strategies to address these issues within the framework of our SOMLib project [10]. The SOMLib project is based on unsupervised neural network technology for the task of document archive organization. In particular, we rely on self-organizing maps [3] and variants thereof for document clustering.
2.1
Topical hierarchies
In order to detect hierarchical document relations, we proposed a novel neural network architecture, the growing hierarchical self-organizing map [1]. The distinctive feature of this model is its problem dependent architecture which develops during the unsupervised learning process. Hence, the time-consuming and error-prone development of heuristics for defining the size of the neural network prior to training is no longer needed. First experiments with this model on a large document archive of newspaper articles demonstrated its general feasibility in uncovering hierarchical document relations [9]. For the user this approach has the benefit of enabling an explorative access to large document archives where zooming into topics of interest is realized in an easy and intuitive fashion. As an example, consider the hierarchical access to documents on the war in the Balkan region as depicted in Figure 1. The documents originate from a collection of articles of an Austrian daily newspaper. The various labels to characterize a topical cluster are derived with our LabelSOM technique [7].
Figure 1: Topic hierarchy on the Balkan war
2.2
User interface
The intuitive access to document archives is further provided by making use of an interface incorporating metaphor graphics for meta data associated with the documents, e.g. size of the document, time of last access, frequency of access [8]. The various metaphors are depicted in Figure 2. In particular, we find varying thickness of the books referring to the size of the documents, the time of last access is visualized by the position of the book within the shelf with recently accessed documents appearing towards the front of the shelf, the frequency of access symbolized by the appearance of the book ranging from shiny and new to well-thumbed spines, etc. In Figure 3 we present a portion of the SOMLib interface for a collection of TIME Magazine articles. In this case the clustering as provided by the artificial neural network is visualized by means of books appearing in different sections of the shelf. Similar documents, i.e. documents dealing with similar subject matter, are shown in the same section.
3
Conclusion
In this paper we have described the main design principles of the SOMLib digital libary project. These principles are, first, an unsupervised organization of the documents into topical hierarchies, and, second, an intuitive user interface relying on metaphor graphics for meta data visualization.
References [1] M. Dittenbach, D. Merkl, and A. Rauber. The growing hierarchical self-organizing map. In Proceedings International Joint Conference on Neural Networks, Como, Italy, 2000. [2] S. Kaski, T. Honkela, K. Lagus, and T. Kohonen. WEBSOM–self-organizing maps of document collections. Neurocomputing, 21(1–3), 1998. [3] T. Kohonen. Self-organizing maps. Verlag, Berlin, 1995.
Springer-
[4] X. Lin, D. Soergel, and G. Marchionini. A self-organizing semantic map for information retrieval. In Proceedings International ACM SIGIR Conference on Research and Development in Information Retrieval, Chicago, IL, 1991. [5] D. Merkl. Exploration of text collections with hierarchical feature maps. In Proceedings International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, PA, 1997. [6] D. Merkl. Text classification with self-organizing maps: Some lessons learned. Neurocomputing, 21(1–3), 1998. [7] D. Merkl and A. Rauber. Automatic labeling of self-organizing maps for information retrieval. In
Figure 2: Metaphor graphics in SOMLib
Proceedings Int’l Conference on Neural Information Processing, Perth, WA, 1999. [8] A. Rauber and H. Bina. Visualizing electronic document repositories: Drawing books and papers in a digital library. In Advances in Visual Database Systems: Proceedings of the IFIP TC2 WG2.6 5. Working Conference on Visual Database Systems, Fukuoka, Japan, 2000. [9] A. Rauber, M. Dittenbach, and D. Merkl. Automatically detecting and organizing documents into topic hierarchies: A neural network based approach to bookshelf creation and arrangement. In Proceedings European Conference on Research and Advanced Technology for Digital Libraries, Lisbon, Portugal, 2000. [10] A. Rauber and D. Merkl. The SOMLib Digital Library system. In Proceedings European Conference on Research and Advanced Technology for Digital Libraries, Paris, France, 1999. [11] D. Roussinov and M. Ramsey. Information forage through adaptive visualization. In Proceedings International ACM Conference on Digital Libraries, Pittsburgh, PA, 1998.
Figure 3: Clustering of documents into sections of the bookshelf