Self-Organizing Maps (SOMs) are used to represent documents on a map that provides an insightful view of the text collection. This view visualizes similarity ...
To appear in the Proc. of the International Conf. on Neural Networks (ICNN'96), Washington, June 2-6, 1996.
Exploration of Full-Text Databases with Self-Organizing Maps
Timo Honkela, Samuel Kaski, Krista Lagus, and Teuvo Kohonen Helsinki University of Technology Neural Networks Research Centre Rakentajanaukio 2 C, FIN-02150 Espoo, Finland Timo.Honkela@hut. ABSTRACT
Availability of large full-text document collections in electronic form has created a need for intelligent information retrieval techniques. Especially the expanding World Wide Web presupposes methods for systematic exploration of miscellaneous document collections. In this paper we introduce a new method, the WEBSOM, for this task. Self-Organizing Maps (SOMs) are used to represent documents on a map that provides an insightful view of the text collection. This view visualizes similarity relations between the documents, and the display can be utilized for orderly exploration of the material rather than having to rely on traditional search expressions. The complete WEBSOM method involves a two-level SOM architecture comprising of a word category map and a document map, and means for interactive exploration of the data base.
1. Introduction Full-text classi cation may be based on the assumption that the elementary textual features of documents that deal with similar topics are statistically similar. Consider tentatively that the documents (text les) were simply described by their word histograms. A Self-Organizing Map (SOM) of the documents could then be computed easily using these histograms as input vectors. However, it has turned out to be a more eective encoding scheme to rst construct the statistics of word categories, because this increases generality: the general nature of the texts is not altered much if the words are replaced by their synonyms or other words of the same category. The categorial histograms may then be used as inputs to the SOM mentioned above. The so-called \semantic" SOM [8] is able to eectively and completely automatically cluster words into grammatical and semantic categories on the basis of their short contexts, i.e., frames of neighboring words in which they occur. The document-searching method WEBSOM introduced in this paper also rst extracts a great number of elementary contextual features from text les and maps them onto an ordered two-dimensional SOM array. In the second stage, histograms of contexts accumulated on the rst array are further mapped into points on the second map, whereby similar les become mapped close to each other. This order facilitates easy browsing and searching of related les. The WEBSOM applies to arbitrary full-text les: no descriptors are thereby needed. In this work we have used it to organize 20 selected Usenet newsgroups1 in the Internet, from which we have picked up in total 10 000 discussions, or approximately 3 000 000 words. It may be clear that systematic organization of such an amount of colloquial texts is a very hard task, but the WEBSOM seems to do it eectively. The WEBSOM has two possible modes of operation: unsupervised and supervised. In the former, clustering of arbitrary text les is made by a conventional two-level Self-Organizing Map, whereby no class information about the documents is given; classi cation is simply based on the analysis of the raw texts. In the supervised mode of operation, separation of the document groups is enhanced if auxiliary class information, for instance the name of the newsgroup, can be given. In this work we use a partly supervised mode. 1
The term \newsgroup" has already become established, although \discussion group" would in most cases be more accurate.
The overall architecture of the WEBSOM method thus consists of two levels: the word category map and the document map, respectively.
2. The Word Category Map
2.1. Preprocessing of Text
The documents in the Internet contain plenty of details that may be only remotely connected with the topic, for example ASCII drawings and automatically included signatures. These were rst removed using heuristic rules. Also the articles (\a", \an", \the") were removed and the numerical expressions as well as special codes were treated with heuristic rules. The documents may also contain plenty of words that occur only a few times in the whole data base. Their contribution to the formation of the SOM would be highly erratic. For this reason, and also to reduce the computational load signi cantly, all words occurring less than 50 times were represented by a \don't care" symbol and neglected in further computation.
2.2. Formation of the Word Category Map
Several articles have been published on SOMs that map words into grammatical and semantic categories [1], [7], [8], [9], [14]. Below we only delineate the basic idea of the \semantic" SOM. Consider that all words in a text corpus are concatenated into a single symbol string, from which strange symbols and delimiters (such as punctuation marks) have been eliminated, and rare words have been denoted by the \don't care" symbol as mentioned above. In the vocabulary of all occurring words, each word is represented by an n-dimensional real vector x with random-number components. (In our present experiments we had x 2