Automatic Labeling of Self-Organizing Maps for

2 downloads 0 Views 215KB Size Report
approach on an example from text classi cation using a real-world document ... self-organizing maps and subsequent semantic labeling of the units. A brief ...
Automatic Labeling of Self-Organizing Maps for Information Retrieval Dieter Merkl and Andreas Rauber Institut fur Softwaretechnik Technische Universitat Wien Resselgasse 3/188, A{1040 Wien, Austria



www.ifs.tuwien.ac.at/

dieter

Abstract

The self-organizing map is a very popular unsupervised neural network model for the analysis of high dimensional input data as in information retrieval applications. However, the interpretation of the map requires much manual e ort, especially as far as the analysis of the learned features and the characteristics of identi ed clusters is concerned. In this paper we present our novel LabelSOM method which, based on the features learned by the map, automatically selects the most descriptive features of the input patterns mapped onto a particular unit of the map, thus making the characteristics of the various clusters within the map explicit. We demonstrate the bene ts of this approach on an example from text classi cation using a real-world document archive. In this particular case, the features correspond to keywords describing the contents of a document. The bene t of this approach is that the various document clusters are characterized in terms of shared keywords, thus making it easy for the user to explore the contents of an unknown document archive.

1 Introduction

Today's information age may be characterized by constant massive production and dissemination of written information. Powerful tools for exploring, searching, and organizing this mass of information are needed. Particularly the aspect of exploration has found only limited attention in the research community as compared to other aspects of information retrieval systems. Current information retrieval technology still relies on systems that retrieve documents based on the similarity between keyword-based document and query representations. An attractive way to assist the user in document archive exploration is based on unsupervised arti cial neural networks, especially self-organizing maps, for



www.ifs.tuwien.ac.at/

andi

document space representation. A number of research publications show that this idea has found appreciation in the community [7, 8, 9, 10, 11, 13, 15, 22]. Maps are used to visualize the similarity between documents in terms of distances within the two-dimensional map display. Hence, similar documents may be found in neighboring regions of the map display. Many of the above mentioned papers focus on the visualization of cluster structure. However, it still remains a tedious task to interpret the mapping of the self-organizing map as such, i.e. to analyze which attributes were relevant for a particular mapping. When we look at present applications of the self-organizing map we usually nd it labeled manually in such a way that after inspection of the trained map a set of keywords is assigned to each unit or cluster to provide the user with some hints on the contents of the map. Apart from the fact that manually assigning labels is highly labour intensive by requiring manual inspection of all data items mapped onto the units, it is dicult if not impossible for very high-dimensional data sets. What is needed is a way to automatically label the units and clusters of a self-organizing map to make the structures learned by the map explicit, i.e. to give a justi cation for a particular mapping. In this paper we present our novel method to automatically assign keywords to the units of a trained self-organizing map. For brevity we will refer to this method as the LabelSOM method. We demonstrate the bene ts of this method by using an application scenario from information retrieval. In particular, we describe the results from labeling a self-organizing map that was trained on a collection of TIME Magazine articles. The LabelSOM method allows to now automatically describe the subject matter of documents using the features learned by the self-organizing map and thus assists the user in understanding the data collection that is presented by the map.

The remainder of this paper is organized as follows. Section 2 provides a brief outline of self-organizing maps. In Section 3 we describe the document collection used for the experiments and our approach for document representation. In Section 4 we provide experimental results from document classi cation with self-organizing maps and subsequent semantic labeling of the units. A brief review of related work in document classi cation with unsupervised arti cial neural networks is contained in Section 5. Finally, we give our conclusions in Section 6.

2 Self-organizing maps

The self-organizing map [5, 6] is a general unsupervised tool for the ordering of high-dimensional data in such a way that similar input items are grouped spatially close to one another. The model consists of a number of neural processing elements, i.e. units. Each of the units i is assigned an n-dimensional weight vector mi , mi = [i1 ; i2 ; : : : ; i ]T , mi 2

Suggest Documents