Creating an Order in Digital Libraries with Self ... - Semantic Scholar

11 downloads 10101 Views 172KB Size Report
Creating an Order in Digital Libraries with Self-Organizing Maps. Samuel Kaski, Timo ... Neural Networks Research Centre .... Async96 call for papers. 4th Int'l ...
In Proc. WCNN'96, World Congress on Neural Networks, pp. 814-817. Lawrence Erlbaum and INNS Press, Mahwah, NJ, 1996.

Creating an Order in Digital Libraries with Self-Organizing Maps Samuel Kaski, Timo Honkela, Krista Lagus, and Teuvo Kohonen Helsinki University of Technology Neural Networks Research Centre Rakentajanaukio 2 C, FIN-02150 Espoo, Finland Samuel.Kaski@hut.

Abstract|Formulation of suitable search expressions for information retrieval from large full-text databases may currently require considerable e orts. Changing the scope of the search when, e.g., too many or too few hits have been obtained, requires re-formulation of the search expression. For an alternative scheme we suggest an explorative full-text information retrieval method, where the Self-Organizing Map (SOM) algorithm is used to order documents based on their full textual contents. The visualized order can then be utilized for an explorative search or exploration of novel knowledge areas, whereby the scope can be changed interactively. The ordering of the documents is achieved by a two-level analysis: First, word categories are extracted from the text by a \semantic" SOM. Second, the textual context of the documents is encoded on the basis of the histograms of words formed on the word category map.

1 Introduction

The information age is characterized by an uncontrolled ood of miscellaneous digital information from very di erent sources. Powerful methods for organizing, exploring, and searching collections of free-form textual documents would be needed even for everyday purposes. Classical methods do exist for searching by keywords or by indexed contents of full-text documents, sometimes enhanced with proximity search or combinations of search terms according to Boole's algebra. However, there exists a growing need for methods of systematic explorative information retrieval, where the exact keywords which could guide to relevant and interesting information may not be known in advance. Exploration may be supported by organizing the documents into taxonomies or hierarchies, a task that librarians have carried out throughout the past centuries. However, while the amount of available textual information increases progressively, automatic methods for its management become necessary. The previous nonexistence of such methods has been due to a lack of e ective means for encoding and ordering free-form documents. The Self-Organizing Map (SOM) [2], [3] is a general unsupervised tool for ordering high-dimensional statistical data so that alike inputs are in general mapped close to each other. To utilize the SOM on texts, a document might, for example, be represented as the histogram of its words. A more practical method is to rst use the so-called semantic SOM [8] for word categorization. The semantic SOM organizes the words into grammatical and semantic categories represented on a two-dimensional array. The relative similarity of the categories is re ected in their distance relationships on the array. An extra bene t from the use of word category histograms instead of simple word histograms is that the dimensionality of the input to the document map is reduced by an order of magnitude. Several studies have been published on SOMs that map words into grammatical and semantic categories [1], [7], [8], [9], [12]. The SOM has also been utilized previously to form a small map based on titles of scienti c documents by Lin et al. [4]. Scholtes has developed, based on the SOM, a neural lter and a neural interest map for information retrieval [10], [11], [12]. Merkl [5], [6] has used the SOM to cluster textual descriptions of software library components. In this work we introduce a new architecture in which the semantic SOM is rst used to extract statistical contextual information from free-form documents, and another SOM transforms the categorial statistics into an ordered document map. The locations in the document map can be used as bins, or a kind of speci c \traps," into which closely related documents will automatically be gathered for their easy retrieval.

a)

b) Document map Word category map

...

encoded context

...

encoded word

...

"blurred" histogram

Word category map

Full−text input

encoded context

... Full−text ... statistical and neural−network ... input

Figure 1: The basic two-level WEBSOM architecture. a) The word category map rst learns to represent relations of words based on their averaged contexts. This map is used to form a word histogram of the text to be analyzed. b) The histogram, a \ ngerprint" of the document, is then used as input to the second SOM, the document map.

2 The WEBSOM Method

The basic two-level WEBSOM architecture consisting of two hierarchically interrelated Self-Organizing Maps (SOMs) is depicted in Fig. 1. The word category map (Fig. 1 a) is a \semantic SOM" that describes relations of words based on their averaged short contexts. The ith word in the sequence of words is represented by an n-dimensional real vector x (here, n = 90) with random-number components. The averaged context vector of this word reads 3 2 Efx ?1 jx g 5 ; X (i) = 4 (1) "x Efx +1 jx g where E denotes the estimate of the expectation value evaluated over the text corpus, and " is a small scalar number (e.g., " = 0:2). Now the X (i) 2

Suggest Documents