retrieving documents by constrained spreading ... - Semantic Scholar

5 downloads 85692 Views 45KB Size Report
email: [email protected] ... an automatically constructed very large hypertext and by spreading ... of textual documents in a completely automatic way. .... output value, it fires it to all the nodes connected to it, usually sending the same value.
RETRIEVING DOCUMENTS BY CONSTRAINED SPREADING ACTIVATION ON AUTOMATICALLY CONSTRUCTED HYPERTEXTS Fabio Crestani Department of Computing Science University of Glasgow Glasgow G12 8QQ, Scotland Tel. + 44 - 141 - 330 6292 fax. + 44 - 141 - 330 4913 email: [email protected]

Abstract We report on the design of a system that enables the user to perform retrieval of items of interest both by browsing an automatically constructed very large hypertext and by spreading activation on the hypertext network itself. It is this second option that is particularly interesting since it enables the user to retrieve items that have not been visited, but that are similar to items flagged as relevant during the browsing. The spreading of activation from relevant items to similar items is achieved using a form of “constrained spreading activation”. This is a kind of spreading activation controlled by heuristic rules that limit and direct the activation towards the most promising links and node of the hypertext.

1 Introduction In associative retrieval associations among information items are often represented as a network, where information items are represented by nodes and associations by links connecting nodes. The heuristic rule, consisting in retrieving items associated to those assessed as relevant, is often implemented by means of a technique called spreading activation. The purpose of this paper is to describe the design of a system that enables associative retrieval by “constrained” spreading activation on an automatically constructed very large hypertext. At present there is a decrease of interest on the use of spreading activation on large networks (like for example semantic networks). This is mainly due to the fact that the construction of a network of associations among information items is very time consuming process when the size of the document collection is very large. Most of the original work in associative retrieval was performed with small document collections, and often the associations among the information items were set up manually or semi-automatically [9]. This, of course, becomes impossible when the document collection is very large. However, nowadays more and more computing power is becoming available and its cost is rapidly decreasing, making it possible to construct associative networks from large document collections in a automatic way. In [2, 1] we presented a methodology and tool for the automatic construction of hypertexts to be used for information retrieval purposes. With such a tool, it is possible to build up a large hypertext from a flat collection of textual documents in a completely automatic way.

2 Information Retrieval and Hypertexts Information Retrieval (for a good overview see [5]) is a science that aims to store and allow fast access to a large amount of unstructured information. This information can be of any kind: textual, visual, or auditory. Most actual IR systems store and enable the retrieval of only textual information called documents. Anyway, the task is not simple,  Previously at Dipartimento di Elettronica e Informatica, Universit´a di Padova, Italy

and to give a clue to the size of the task, it must be noticed that often the collections of documents an IRS has to deal with contain several thousands or even millions of documents. A user accesses the IR system by submitting a query. The system tries to retrieve all documents that “satisfy” (are relevant to) the query. Good IR systems typically rank the matched documents so that those most likely to be relevant (for example those with the higher degree of similarity with the query representation) are presented to the user first. Some retrieved documents will be relevant (with varying degree of relevance) and some will, unfortunately, be irrelevant. The user appraises those ones that he considers relevant and feeds them through a process called relevance feedback which modifies the original query to produce a new improved query and as a consequence a new ranking of documents. If the IR process is interactive this will go on until the user is happy of the resulting list of documents. Hypertexts are large networks of textual nodes and links connecting nodes. A very famous example of hypertext (actually an hypermedia, since nodes can be also pictures, sounds, or movies) is the World Wide Web. Hypertext links usually don’t have a semantics associated to them, like for examples links in semantic networks or conceptual graphs, and most often they do not have weights neither. So browsing is achieved following links that may have a clear or obscure semantics and importance. Links simply connect nodes with a semantics and a degree of importance that is left in the head of their designer. Systems providing both browsing and querying search strategies allow users to access a hypertext by browsing only after a query has been issued. In this way users are given access to documents that have not matched the query. In particular, given a retrieved document, the user can pick up the neighboring documents, even if they do not match the query. This mixed access way is useful especially if the collection is made up of multimedia documents. The representation of multimedia documents is rather difficult because of the different nature of media. Some approaches have been proposed to index multimedia document collections, and cluster based techniques are used to relate textual documents to neighboring multimedia document [4].

3 Automatic Construction of Hypertexts In 1993 Agosti and Crestani [1] proposed a methodology for the automatic construction of hypertexts. The starting point was to model a set of raw IR data by means of a conceptual schema. In IR the term conceptual schema refers to a conceptual structure describing semantic relationships among IR data, i.e. among the different objects (documents, index terms, concepts, etc.) taking part in an IR application. A conceptual schema of a specific IR application provides the user with a frame of reference in the query formulation process and can be very useful if the user is allowed to browse it. In the IR field the idea of using conceptual models is new. Most IR applications have an “ad hoc” data model. In the proposed conceptual model the application of the classification mechanism to the IR data implies working with three different levels of abstraction: documents, index terms, and concepts, as depicted in Figure 1. Links connect nodes to express a semantic relation between them. The semantics of a link can be made explicit putting a label on the link and/or can be quantified for importance putting a weight. There are links connecting objects of the same type (on the same level) and links connecting objects of different type. For example a link connecting two index terms indicates that the two terms are similar or occur quite often together. A link connecting an index term with a document indicates that the document has been indexed using that term or that that term occurs in the document. A link between an index term and a concept indicates that the concept can be expressed using that index term. Links on the document level represent bibliographical citations or similarity between documents. A tool that automates the proposed methodology has been presented in [2]. Such tool enables to construct a large hypertext from a collection of documents in a fully automatic way. The tool has been experimented with several collections and in various applications.

4 Associative Retrieval using Spreading Activation on a Hypertext Studies on associative retrieval date as early as the 60s and had their origins in statistical studies of associations among terms and/or documents in a collection. The “associative linear retrieval model” is one of these earliest models based on the concept of associative retrieval. This model, in its basic principles [8], consists of expanding the original query using statistically determined term–term, term–document, and document–document associations. This technique is based on the assumption that there exists statistically determinable relations among terms, among documents, and among documents and terms. These associations can be represented using a similarity matrix. There are many other heavy assumptions on this model and more recent studies [6, 9] have lead to the conclusion that effective term expansion methods valid for a variety of different collections are difficult to generate.

Concept level

Index term level

Concept

Document level

Index term Document

Figure 1: A three levels hypertext. Recently, these models of associative retrieval has been revised using the so-called Spreading Activation Model (SA), which is based on supposed mechanisms of human memory operations [7].

4.1

The pure Spreading Activation model

The SA model in its pure form is made up of a conceptually simple processing technique on a network data structure. The processing technique is defined by a sequence of iterations that can be halted by the user or by the triggering of some termination condition. An iteration consists of: 1. preadjustment; 2. Spreading; 3. postadjustment. 4. termination check. In the preadjustment and postadjustement phases, which are optional, some form of activation decay can be applied to the active nodes. These phases are used to avoid retention of activation from previous pulses, enabling to control both activation of single nodes and the overall activation of the network. They implement a form of “loss of interest” in nodes that are not continually activated. The spreading phase consists on a number of passages of activation weaves from one node to all other nodes connected to it. There are many ways of spreading activation over a network. In its simplest form, on a single node level, SA consists first in the computation of the node input using the formula:

I =

X

j

Ow k

ij

i

whereIj is the total input of node j , Oi is the output of node i connected to node j , and Wij is a weight associated to the link connecting node i to node j . Input values and weights are usually real numbers, however their numerical types are determined by the specific requirements of the application to be modelled. After a node has computed its input value, its output value must be determined. This is usually computed as a function of the input value. There are many different functions that can be used in the evaluation of the output; the most commonly used function in pure SA models is the threshold function, where a threshold is used to determine if the node j has to be considered active and therefore fire activation to other nodes or not. An example of output threshold function is the following:

O = j



0 1

I k j

j

j

j

where kj is the threshold value for node j . The threshold value of the activation function is application dependent and can vary from node to node, therefore the notation kj for the node threshold has been used. After the node has computed its output value, it fires it to all the nodes connected to it, usually sending the same value to each of them. Pulse after pulse, the activation spreads over the network reaching nodes that are far from the initially activated ones. After a number of pulses has been fired a termination condition is checked. If the condition is verified than the SA process stops, otherwise it goes on for another series of pulses. SA is therefore iterative, consisting of a sequence of pulses and termination checks. The result of the SA process is the activation level of nodes reached at termination time. The interpretation of the level of activation of each node depends on the application and, in particular, on the characteristics of the object being modelled by that node.

4.2

Constrained Spreading Activation

The pure SA model, however, presents some serious drawbacks:

 unless controlled carefully by means of the preadjustment and the postadjustment phases the activation ends up spreading all over the network;  there is not a complete use of the information provided by labels or weights associated to links;  it is difficult to implement forms of inference based on the semantics or weights of associations. These problems can find a solutions by taking into account in the SA process the diverse significance of the relations among nodes. This can be achieved using the information provided by labels or weights on links and by processing links in different ways according to them. In this way it is possible to implement complex forms of heuristics, and to spread activation on the network according to complex inference rules. A common way of implementing a processing technique which spreads the activation according to rules, is by means of constraints on the spreading. Here are some constraints that we found could be used in SA models.

 Distance constraints: the spreading of activation should cease when it reaches nodes that are too far away in terms of links covered to reach them from the initially activated ones. This corresponds to the simple heuristic rule that the strength of the relation between two nodes decreases with their semantic distance.  Fan-out constraints: the spreading of activation should cease at nodes with very high connectivity (fan-out), that is at nodes connected to a very large number of other nodes. The purpose of this constraint is to avoid a spreading that would be too wide, that could derive from nodes with a very broad semantic meaning and therefore connected to many other nodes.  Path constraints: activation should spread using preferential paths, reflecting application dependent inference rules. This can be modelled using weights on links or, if links are labelled, diverting the activation flow to particular semantic path while stopping it from following other less meaningful paths.  Activation constraints: using the threshold function at a single node level, it is possible to control the spreading of activation on the network. This can be achieved by automatically changing the threshold value in relation to the total level of activation over the entire network at any single pulse. Moreover, it is possible to assign different threshold levels to each node or set of nodes in relation to their meaning in the context of the application. Although this may cause a increase in the number of computations, it makes possible to implement various complex inference rules. Referring to the pure SA model, these constraints act during the preadjustment phase (this is the case for distance, fan-out, and path constraints) or during the postadjustment phase (activation constraints). A good example of the use of constrained SA in IR is reported in [3].

4.3

Constrained Spreading Activation on a Hypertext

In Section 3 we have seen how, starting from a raw set of documents and a thesaurus like structure among concepts, we can use statistical and IR techniques to determine nodes and links and build up a large network. The network, that follows the schema depicted in Figure 1, is written in HTML. It is basically a hypertext and can be browsed by means of a hypertext browsing tool, like for example Netscape. The browsing tool can be used not only for simple browsing, but also for query formulation. A control structure on the hypertext could enable the user to build up a query by moving through the hypertext on different levels of abstraction and marking the information items (nodes) that better represent his information need. After the user has built up a query in this way, an automatic procedure making use of the different semantics and weights associated to links spreads activation over the network. Activation flows to concepts, index terms, or documents that are linked to those marked by the user, and from there on. A set of constraints like the ones presented in Section 4.2 control and direct the spreading over the network, implementing precise heuristic search rules. Active documents nodes are put in a ranked retrieval list for the user to browse. The ranking order is determined by the amount of activation reaching them. The user can provide some additional information to the system by marking new nodes in the retrieval list that he considers relevant. In this way he assess if the spreading has been successful or not. This process in similar to the relevance feedback technique used in advanced IR system. He can then start a new spreading of activation and continue its search in a iterative and interactive process.

5 Conclusions and future work This work is currently at a prototypical stage. We still have to determine the most effective heuristic search rules to be implemented by means of constrained SA. We will report with more details on the complete architecture and on the evaluation of the prototype in a future paper.

Aknowledgements The work reported in this paper has been partially supported by the European Space Agency (ESA) under the project Study of Semantic Networks Inter-Operations, contract ESA N.12100/96/I-GP.

References [1] M. Agosti and F. Crestani. A methodology for the automatic construction of a Hypertext for Information Retrieval. In Proceedings of the ACM Symposium on Applied Computing, pages 745–753, Indianapolis, USA, February 1993. [2] M. Agosti, F. Crestani, and M. Melucci. Design and implementation of a toll for the automatic construction of hypertexts for information retrieval. Information Processing and Management, 32(4):459–476, 1996. [3] P.R. Cohen and R. Kjeldsen. Information Retrieval by constrained spreading activation on Sematic Networks. Information Processing and Management, 23(4):255–268, 1987. [4] M.D. Dunlop. Multimedia Information Retrieval. PhD Thesis, Department of Computing Science, University of Glasgow, Glasgow, UK, October 1991. [5] W.R. Frakes and R. Baeza-Yates, editors. Information Retrieval: data structures and algorithms. Prentice Hall, Englewood Cliffs, New Jersey, USA, 1992. [6] S.E. Preece. A spreading activation model for Information Retrieval. PhD thesis, University of Illinois, UrbanaChampaign, USA, 1981. [7] D.E. Rumelhart and D.A. Norman. Representation in memory. Technical report, Department of Psychology and Institute of Cognitive Science, UCSD La Jolla, USA, 1983. [8] G. Salton. Automatic information organization and retrieval. Mc Graw Hill, New York, 1968. [9] G. Salton and C. Buckley. On the use of spreading activation methods in automatic Information Retrieval. In Proceedings of ACM SIGIR, Grenoble, France, June 1988.