Exploration of Full-Text Databases with Self-Organizing Maps

To appear in the Proc. of the International Conf. on Neural Networks (ICNN'96), Washington, June 2-6, 1996.

Exploration of Full-Text Databases with Self-Organizing Maps

Timo Honkela, Samuel Kaski, Krista Lagus, and Teuvo Kohonen Helsinki University of Technology Neural Networks Research Centre Rakentajanaukio 2 C, FIN-02150 Espoo, Finland Timo.Honkela@hut. ABSTRACT

Availability of large full-text document collections in electronic form has created a need for intelligent information retrieval techniques. Especially the expanding World Wide Web presupposes methods for systematic exploration of miscellaneous document collections. In this paper we introduce a new method, the WEBSOM, for this task. Self-Organizing Maps (SOMs) are used to represent documents on a map that provides an insightful view of the text collection. This view visualizes similarity relations between the documents, and the display can be utilized for orderly exploration of the material rather than having to rely on traditional search expressions. The complete WEBSOM method involves a two-level SOM architecture comprising of a word category map and a document map, and means for interactive exploration of the data base.

1. Introduction Full-text classi cation may be based on the assumption that the elementary textual features of documents that deal with similar topics are statistically similar. Consider tentatively that the documents (text les) were simply described by their word histograms. A Self-Organizing Map (SOM) of the documents could then be computed easily using these histograms as input vectors. However, it has turned out to be a more eective encoding scheme to rst construct the statistics of word categories, because this increases generality: the general nature of the texts is not altered much if the words are replaced by their synonyms or other words of the same category. The categorial histograms may then be used as inputs to the SOM mentioned above. The so-called \semantic" SOM [8] is able to eectively and completely automatically cluster words into grammatical and semantic categories on the basis of their short contexts, i.e., frames of neighboring words in which they occur. The document-searching method WEBSOM introduced in this paper also rst extracts a great number of elementary contextual features from text les and maps them onto an ordered two-dimensional SOM array. In the second stage, histograms of contexts accumulated on the rst array are further mapped into points on the second map, whereby similar les become mapped close to each other. This order facilitates easy browsing and searching of related les. The WEBSOM applies to arbitrary full-text les: no descriptors are thereby needed. In this work we have used it to organize 20 selected Usenet newsgroups1 in the Internet, from which we have picked up in total 10 000 discussions, or approximately 3 000 000 words. It may be clear that systematic organization of such an amount of colloquial texts is a very hard task, but the WEBSOM seems to do it eectively. The WEBSOM has two possible modes of operation: unsupervised and supervised. In the former, clustering of arbitrary text les is made by a conventional two-level Self-Organizing Map, whereby no class information about the documents is given; classi cation is simply based on the analysis of the raw texts. In the supervised mode of operation, separation of the document groups is enhanced if auxiliary class information, for instance the name of the newsgroup, can be given. In this work we use a partly supervised mode. 1

The term \newsgroup" has already become established, although \discussion group" would in most cases be more accurate.

The overall architecture of the WEBSOM method thus consists of two levels: the word category map and the document map, respectively.

2. The Word Category Map

2.1. Preprocessing of Text

The documents in the Internet contain plenty of details that may be only remotely connected with the topic, for example ASCII drawings and automatically included signatures. These were rst removed using heuristic rules. Also the articles (\a", \an", \the") were removed and the numerical expressions as well as special codes were treated with heuristic rules. The documents may also contain plenty of words that occur only a few times in the whole data base. Their contribution to the formation of the SOM would be highly erratic. For this reason, and also to reduce the computational load signi cantly, all words occurring less than 50 times were represented by a \don't care" symbol and neglected in further computation.

2.2. Formation of the Word Category Map

Several articles have been published on SOMs that map words into grammatical and semantic categories [1], [7], [8], [9], [14]. Below we only delineate the basic idea of the \semantic" SOM. Consider that all words in a text corpus are concatenated into a single symbol string, from which strange symbols and delimiters (such as punctuation marks) have been eliminated, and rare words have been denoted by the \don't care" symbol as mentioned above. In the vocabulary of all occurring words, each word is represented by an n-dimensional real vector x with random-number components. (In our present experiments we had x 2

Exploration of Full-Text Databases with Self-Organizing Maps

Exploration of Full-Text Databases with Self-Organizing Maps

Suggest Documents

Selforganizing maps for latent semantic analysis ... - Wiley Online Library

Exploration of Full-Text Databases with Self ... - Semantic Scholar

Visual exploration of temporal object databases - CiteSeerX

Visual Exploration Across Biomedical Databases - University of ...

Visual Exploration of Biomedical Databases

Visual exploration of temporal object databases - CiteSeerX

Exploration of Document Collections with Self-Organizing Maps: A ...

Visual exploration of temporal object databases - IIHM

Visual Exploration Across Biomedical Databases - UMD Department ...

CONCEPT MAPS APPLIED TO MARS EXPLORATION ... - CiteSeerX

USING CONCEPT MAPS IN AN INTERNET EXPLORATION ...

Interactive exploration of distributed 3D databases ... - Semantic Scholar

Specialized microbial databases for inductive exploration of microbial

Exploration of Geographic Databases: Supporting a Focus ... - CiteSeerX

ACCESSING DATABASES WITH JDBC

HapticRiaMaps: Towards Interactive exploration of Web World maps ...

Non-visual exploration of geographic maps: Does ... - Cognitivelab

CJM-ex: Goal-oriented Exploration of Customer Journey Maps using ...

Self-Organising Maps for exploration of spatio ... - GeoComputation

Exploration Mining in Diabetic Patients Databases - NUS Computing

Making Maps with ArcGIS

Comparison with existing maps

Enhancing clinical MRI Perfusion maps with data-driven maps of ...

COBORDISMS OF FOLD MAPS AND MAPS WITH PRESCRIBED