HYPERMEDIA AND FREE TEXT RETRIEVAL* M.D. DUNLOP† and C.J. VAN RIJSBERGEN Department of Computing Science University of Glasgow Glasgow G12 8QQ United Kingdom Telephone: +44 (0)41 3304256
Abstract – This paper discusses aspects of multimedia document bases and how access to documents held on a computer-based system can be achieved; in particular the current access methods of hypermedia and free text information retrieval are discussed. Browsing-based hypermedia systems provide ease of use for novice users and equal access to any media; however, they typically perform poorly with very large document bases. In contrast query-based free text retrieval systems are, typically, designed to work with very large document bases, but have very poor multimedia capabilities. This paper then presents a hybrid between these two traditional fields of information retrieval together with a technique for using contextual information to provide access, through query, to documents which cannot be accessed by content (e.g. images). Two experiments are then presented which were carried out to test this approach. Finally, the paper gives a brief discussion of a prototype implementation, which provides access to mixed media information by query or browsing, and user-interface issues are discussed.
*
A version of this paper was presented at the RIAO91 conference in Barcelona (Dunlop and Van Rijsbergen 1991). This paper completely supersedes the conference paper in the development and testing of the model, however, the conference paper gives more details on the prototype application and wider issues described at the end of this paper.
†
May be reached by e-mail to
[email protected].
HYPERMEDIA & FREE TEXT RETRIEVAL
2
INTRODUCTION The research reported in this paper was conducted with the main aim of providing a general method for query-based access to non-textual documents. The motivation behind this work comes from two directions. Firstly, the requirement to provide access to non-textual documents held in large computer based document bases. Secondly, by considering the current methods for accessing such document bases. Computers have traditionally been used for processing numerical and textual information, with the vast majority of computers now used almost exclusively for processing textual documents. There are, however, many fields of work which require access to non-textual information; for example medics require access to x-rays, architects to building plans, ornithologists to bird calls, and estate agents to property photographs. In such fields the non-textual information is at least as important as the textual information which may accompany it, in other areas the non-textual information is used to highlight details or to give alternative views. With recent advances, in quality and price, of display and storage technology, computers are being used more regularly for the production of images, animation, and music. It is becoming apparent to many computer users that it is possible to create large libraries of documents which contain mixed textual and non-textual information. Most existing non-textual libraries are held on non-computer media, e.g. public libraries often have an extensive music selection held on cassette, vinyl, or compact disc. Within these libraries items are either indexed by artist, title, or by a rough classification, i.e. textual identifiers which are used to describe the non-textual medium. Libraries traditionally accessed textual documents by the same process, for example novels are typically indexed by author and title. In recent years, however, there has been a significant growth in computer-based library systems which have access to the entire text of the document (or at least a paragraph or two extracted from the document). This not only allows the searcher to partially examine the content of the documents, e.g. academic papers, without going to the shelf, but permits searching on the documents’ content. In a textual environment reasonably effective general purpose algorithms have been developed which allow a user to input a natural language sentence which is then matched against all the documents in the document base (van Rijsbergen 1979 and Croft & Harper 1979). However, no such general purpose algorithms exist, or
HYPERMEDIA & FREE TEXT RETRIEVAL
3
are likely to exist, for the automatic matching of non-textual documents. The problem of accessing non-textual nodes by query has typically been solved with domain specific solutions or by associating a piece of text with each non-textual document. Domain specific solutions have generally been split into two categories: searching a limited pictorial language (e.g. Constantopoulos et al. (1991) for document retrieval and Kurlander & Bier (1988) for searching and replacing within a larger drawing) or in specific application domains (e.g. Andrew et al.(1990) for access to images held within a document base of medical scans and Hirabayashi et al. (1988) who used impressions, e.g. bright, flamboyant, and formal, to describe images held within a document base of fashion photographs). The use of a textual entry representing a non-textual document, e.g. Al-Hawamdeh et al. (1991), though at first promising, is likely to lead to many problems since these descriptions must be created by an indexer. This is not only a time consuming task, but is also likely to be unreliable since human indexers may produce inconsistent and biased descriptions due to their own perspective and level of domain knowledge, for example an auctioneer would index the Mona Lisa very differently from a fine art student. Partly to provide access to multimedia document bases, there has been a rapid increase in the popularity of browsing-based hypermedia* systems in recent years. These systems, based on early work by Bush (1945) and Nelson (1967), allow the user to browse through the document base using a series of paths (or links) connecting documents together. These paths need not be restricted to accessing textual nodes and often access nodes in many media, providing a very natural environment for the storage and retrieval of non-textual information. Browsing-based retrieval systems are, however, restricted in scale due to the undirected approach users must take. General reviews of work in hypermedia can be found in a special edition of CACM (Smith and Wiess 1988) and in Conklin (1987). Begoray (1990) and Neilsen (1990) present more recent surveys together with discussion of the design decisions involved in creating a hypermedia system. The contrast between browsing and querying may be expressed in terms of a book library. When using a small library, e.g. a small department library, or when looking within a field which one is
*
Throughout this paper the term hypermedia will be used as a general term which encompasses mixed media and text only variants of the access method. The text only variant is also refered to, elsewhere, as hypertext.
HYPERMEDIA & FREE TEXT RETRIEVAL
4
very familiar with, it is often easier to simply browse through the bookcases looking for books which are useful. The organisation of the library and any labels which are provided will help locate the required books. Alternatively, when looking for books in an unknown domain in a large library it is much easier to start with a query, either to the librarian for help or to the catalogue system (whether computerised or not). This distinction has led to the inclusion of query routines in hypermedia systems, so that the user can issue a query to locate the approximate areas to browse. These queries cannot, however, provide direct access to non-textual nodes, leaving users to browse to these from textual nodes. In such systems non-textual documents must be linked on paths from textual documents, otherwise it would not be possible to access them. These links could then be used to provide access to non-textual nodes directly by query, based on the non-textual document’s context in the document base; for example if a digitised image were linked with various textual nodes and a reasonable proportion of these textual nodes were relevant to the user’s query, it is likely that the image itself would be relevant. To make use of this form of access to non-textual information a combined hypermedia and free text retrieval model of information retrieval must be used. Frisse (1988) developed such a hybrid model and used it in the development of a medical handbook which allowed access through query and browsing. The model used by Frisse takes account of the hypermedia structure when performing a query, as opposed to the approach here which treats the two access methods as almost orthogonal. The model developed here will use hypermedia links to give an approximation to the content of a non-textual node, this approximation, or descriptor, is then treated as the document’s content with the retrieval engine having no understanding of links. A MODEL FOR ACCESSING NON-TEXTUAL NODES Within a hypermedia system composed of many nodes of different media it is likely that the structure shown in figure 1 would occur naturally. Indeed if access to the non-textual node is to be permitted then links must exist between textual and non-textual nodes.
HYPERMEDIA & FREE TEXT RETRIEVAL
5
This is a textual node which can be retrieved directly.
This is a textual node which can be retrieved directly.
This is a textual node which can be retrieved directly.
This is a textual node which can be retrieved directly.
Text Node Non-Text Node Link
Figure 1: Non-textual object linked to textual objects
In a retrieval system which provides free text querying and hypermedia browsing, it would be possible to issue a textual query which would result in one (or more) of the textual nodes being presented to the user. The user could then use the links to browse from the matched document to the non-textual document. These links can also be used to calculate an artificial descriptor for the image* node which would permit it to be retrieved directly from a textual query. The textual nodes which are connected to the image node can be considered as forming a cluster – a group of nodes (or documents) which are very closely related. Cluster techniques to calculate the average meaning, or cluster representative, of the cluster can then be applied to establish the overall content of the documents which compose the cluster. As these nodes are all linked with the image node the cluster representative can be assigned to the image giving the image a retrieval content equal to the combined content of the nodes which are connected to it. This approach provides a method for automatically calculating a descriptor for non-textual nodes, this descriptor can then be used to perform information retrieval operations (e.g. querying and relevance feedback) directly on the non-textual node. A Simple Cluster Based Model of Retrieval (Level 1 Cut Off) The descriptor of a non-textual node can be calculated by considering each document, L, as an Ndimensional vector – where each term occurring in the document base is considered as a dimension and the value of Li is the weight of term i in document L. The cluster centroid algorithm (Croft 1978) can then be used to calculate the point in N-dimensional space which is at the centroid, or centre, of
*
To increase clarity this discussion considers an image node to be connected with many textual nodes. There is no requirement for the central node to be an image and it could be composed of any non-textual medium.
HYPERMEDIA & FREE TEXT RETRIEVAL
6
the points which represent the documents in the given cluster. The algorithm essentially averages the weight of terms in neighbouring documents and can be expressed as follows:
∑ ∀ i∈ 1.. N : Wd,i =
Li
L ∈C d Cd
where Wd,i = cluster based weight of term i of document d Cd = cluster of documents linked to, and from, document d Supporting a Wider Context (Level 2 Cut Off) The level 1 cut-off, described above, only takes into account the immediate neighbours of a node in the hypermedia network. It may be useful to consider more of the context of the nodes when calculating their cluster-based descriptors. The basic model is extended below to take into account all nodes which can be reached from the node in question by following at most two links, and vice-versa: nodes which can reach the non-textual node within two links (since links need not be symmetrical).
∑ Li ∀ i∈ 1.. N : Wd,i =
L ∈C d Cd
∑ +k
Li
L ∈C´d Cd
where W d,i = cluster based weight of term i for document d Cd = cluster of documents linked to, and from, document d C´d = cluster of documents linked to, and from, document dby exactly two links, C´d = ∪ Ci -Cd -d i ∈ Cd . k = a constant, 0≤k≤1, defining the relative strength of the more remote neighbours
The model can be continually extended to include more and more distant nodes, however, the impact these of these extra nodes would be small, as they are quite distant from the node of interest. The model can also be extended by taking into account different strength of links between nodes, for example as a result of the type of link or the media of the destination node.
HYPERMEDIA & FREE TEXT RETRIEVAL
7
Limitations of the Model Although the model developed here provides a general purpose method of indexing non-textual nodes for access by textual query it does have some limitations. These are mainly connected with the quality and quantity of links which are available in a given document base. The model assumes that the document base contains many nodes which can have descriptors assigned from their content (in a traditional document base this set is restricted to textual nodes). This restricts use of the model to document bases which have a reasonable ratio of indexable (e.g. textual) to non-indexable nodes. In order for the cluster based-algorithms to provide some benefit, each non-indexable node should be linked to at least two indexable nodes. If only one link is available then the model degrades to a poor variant of the traditional method, by which non-indexable media are retrieved by indexing a hidden textual description. This restriction does not, however, require the network to be composed of twice the number of indexable nodes than non-indexable nodes. As an absolute minimum requirement, for level 1 cut off, each non-indexable node in the network must be connected directly to at least one indexable node. If the document base uses level 2 cut off, then the requirement is reduced to state that every non-indexable node must be connected to an indexable node with no more than one intervening node on the path. The precise ratio of indexable to non-indexable nodes which is required for reasonable cluster descriptors to be created is not clear, and will vary between document bases. In general a ratio of 1:1 could be considered as a reasonable minimum. As with many areas of free text information retrieval, there is no solid cut off point and the model can be used with lower indexable to non-indexable ratios, but the effectiveness of the retrieval engine will decrease with the percentage of indexable nodes. Likewise, the model benefits from increased ratio of indexable to non-indexable nodes and an increased number of links – so long as the number of links from each node is still a small subset of the document base. As the number of nodes which are used to calculate a cluster description approaches the total number of nodes in the document base the effectiveness of the retrieval engine also reduces: as the cluster descriptor becomes a descriptor for the entire document base, its ability to distinguish documents within the base is reduced. As with plain hypermedia, the quality of links is also very important. The links must connect the node being indexed to other nodes which are similar. If the links are to very similar nodes then the
HYPERMEDIA & FREE TEXT RETRIEVAL
8
descriptors will more precisely describe the content of the cluster and therefore more accurately describe the non-textual node. This is issue is addressed in more detail later. TESTING THE CLUSTER DESCRIPTOR To test the validity of this approach two experiments were carried out using a text only document base. The experiments initially calculated the descriptor of each document by traditional statistical analysis and then a second descriptor for each document based on the cluster technique described above. The test collection was composed of 3204 records derived from the Communications of the A.C.M. (Fox 1991). Each record was composed of (amongst other fields) a title, an abstract, a list of keywords, and a list of references to other records in the collection. Many of the records in the collection did not have any references to, or from, other records and so could not be used for these comparative experiments, this resulted in only 1751 records being used. Initially each document had a descriptor calculated for it using traditional statistical analysis of the words which compose the document. These words were passed through a stop word filter (van Rijsbergen 1979 pp. 28-29) to remove most words which have no meaning when taken out of context. The remaining words were then conflated (Porter 1980) so that inflectional suffixes were removed. The words were then assigned a score to state their ability to distinguish relevant from non-relevant documents (Sparck Jones 1971). A cluster descriptor was then calculated for each document based on the documents which were cited by it and which cited it. This descriptor was calculated by use of the simpler, level 1 cut-off, method described above. Experiment 1 – Comparing Different Clusters When each document had a descriptor calculated for it by traditional and by cluster techniques these two descriptors were compared using the cosine coefficient (van Rijsbergen 1979 pp. 39). This coefficient considers the two N-dimensional vectors (which start at the origin) representing the descriptors and calculates the cosine of the angle between them. If the value is 0 this corresponds to
HYPERMEDIA & FREE TEXT RETRIEVAL
9
two perpendicular (Θ=90°) vectors, or two documents with no common terms; a value of 1 (Θ=0°) corresponds to two identical* documents. The cosine coefficient was calculated between each document’s traditional, content-based, descriptor and its cluster descriptor. As a base case the same process was carried out with cluster descriptors being calculated with random links (as opposed to using citations as links). Figure 2 shows the correspondence between the two sets of coefficient values obtained by comparing each traditional descriptor with the corresponding citation linked cluster descriptor and a random linked cluster descriptor. The distribution of links for random cluster assignment was devised so as to approximate the distribution for the actual citation links but was not a perfect match – hence the three outlying points (which, in these cases, represent single documents). 1.0 0.9 0.8
Cosine Correlation
0.7 0.6
Random Links 0.5
Citations
0.4 0.3
mean = 0.230 std. dev. = 0.182
0.2 0.1 0.0 0
10
20
30
40
50
60
70
mean = 0.037 std. dev. = 0.034 80
Number of Links Figure 2: Comparison of random and link based cluster descriptors
The document base was heavily biased towards records which had very few links associated with them, figure 2 averages the correlation values for each integral number of links, consequently the mean values need not be the mean of the points displayed. The cluster descriptors based on citations achieved an average 23% similarity with the original textual descriptor whilst random based cluster descriptor achieved only 4% accuracy. This shows that
*
As far as the retrieval system is concerned.
HYPERMEDIA & FREE TEXT RETRIEVAL
10
the citation based cluster descriptors perform significantly (approximately six times) better than the random case, it also shows that the cluster technique is likely to provide descriptors of suitable quality to be used by a retrieval system. Experiment 2 – Assessing the Effect on Precision and Recall To gain an impression of how cluster-based retrieval would perform in practice an experiment was run to compare the precision and recall of a given retrieval engine on the CACM collection. The experiment was based on the 64 standard queries (with relevant documents, or answers) which are provided with the CACM collection. Initially the main text of each query was matched, using the cosine coefficient, against the descriptor for every document in the CACM collection. This produced a sorted list of matched documents. This set of 64 queries was then repeated but using citation-based clusters to represent the documents. The outcome of this experiment was two sets of query solutions, each solution stating the query number and which documents were considered the best matches (in decreasing order of matching value). From these lists the top M documents were selected, where M refers to the number of documents given in the CACM sample answers, and compared with the sample answers using the standard measures of precision and recall to produce recall and precision figures for each query. Similar figures were produced for the top 2M and 3M documents. As a base case the experiment was then repeated using clusters based on random links, as opposed to the citations used to calculate the main clusters. Finally, the recall and precision figures were averaged over the number of documents which were retrieved (same as in sample solutions, twice as many , or thrice as many) and presented in figures 3 and 4.
HYPERMEDIA & FREE TEXT RETRIEVAL
11
0.5
Recall
0.4 0.3
Real Citations Random
0.2 0.1 0.0 Single
Double
Triple
Figure 3: Graph of Recall Against Number of Documents Retrieved
0.5
Precision
0.4 0.3
Real Citations Random
0.2 0.1 0.0 Single
Double
Triple
Figure 4: Graph of Precision Against Number of Retrieved Documents
From these graphs it is quite clear that citation-based clusters provided almost as good retrieval performance as directly using the records’ content (approximately 70% of the performance). Considering that the approach is intended to retrieve documents which cannot be retrieved directly this quality of result is very encouraging, and confirms that these descriptors are of suitable quality for use in a retrieval engine. Not surprisingly, clusters based on random links provided a very low retrieval performance (approximately 4% when compared with content-based retrieval). Statistical analysis of the performance figures was carried out using the Wilcoxon signed pairs test (Siegal 1956). For this analysis the E-measure of retrieval performance (van Rijsbergen 1979) was used, this provides a single value which describes the performance of the retrieval and can be defined
HYPERMEDIA & FREE TEXT RETRIEVAL
12
as below. The E-measure produces a number in the range 0 to 1 where 0 represents a perfect retrieval (recall and precision 1), and 1 represents a complete failure (recall and precision 0). 1 Eq = 1 1 1 α P +(1-α) q Rq where
Eq = E-measure for query q Pq = Precision for query q Rq = Recall for query q
α = Scaling factor such that 0≤α≤1, for these results α=0.5. Table 1 shows the mean E-values, for the 64 queries, together with their standard distribution.
Single
Double
Triple
Real Data
0.753 ( 0.225 ) 0.778 ( 0.167 ) 0.804 ( 0.134 )
Citation Clusters
0.840 ( 0.166 ) 0.843 ( 0.143 ) 0.861 ( 0.116 )
Random Clusters
0.995 ( 0.016 ) 0.991 ( 0.020 ) 0.992 ( 0.016 )
Table 1: E-Measure Values - mean values with standard deviations in brackets.
Significance figures were then calculated based on pairs of E-measures, where one value was extracted from queries based on document content and one from context-based descriptors. The test was carried out for the three levels of retrieval (single, double, and triple) and for both cluster definitions (citations and random). A null hypothesis that the retrieval performance was the same for cluster and content based descriptors was adopted. Since it was strongly suspected that the content descriptors would perform better than cluster descriptors, a one tail test was carried out. As can be seen below this hypothesis was rejected for all cases with random clusters (for any level of significance). At the 0.05 significance level the null hypothesis is also rejected for citation-based clusters. However, at the 0.01 significance level the null hypothesis cannot be rejected for single and double retrieval levels – at this level of significance, it cannot be stated that content-based descriptors are better than citation-cluster-based descriptors. Although not a strong result, when assessing equality of retrieval performance, the experiment does show that context-based cluster retrieval has similar performance to content-based retrieval when the links are meaningful. The differences in
HYPERMEDIA & FREE TEXT RETRIEVAL
13
performance shown here are also within the range of differences typically found in information retrieval experiments.
Single
Double
Triple
Citation Clusters
0.0287
0.0307
0.0016
Random Clusters
0.00000
0.00000
0.00000
Table 2: Probabilities that cluster-based retrieval has the same performance as content-based
The results given here for citation based clustering are based on use of the entire CACM document base of which 45% of documents cannot be accessed by citation-based indexing – since they have no citations (within the document base) and are not cited. The figures shown here are, therefore, somewhat pessimistic for fully connected document bases, separate experiments need to be carried out on such a document base. Limitations of these Experiments The experiments described here have attempted to show that the cluster-based model of descriptor calculation is worthwhile. To provide a test situation in which the effects could be compared against an automatically derived base case, a text-only document base was used. This raises two issues about the validity of these experiments: will the results extend to a non-textual environment and will links in hypermedia document bases have the same properties as the citations used here? The usage of the text document base was designed so that when calculating the cluster-based descriptor of a node, its content was completely ignored and the node was treated as if it were non-textual. When considering the calculations carried out with respect to a particular node, the only time its contents were used* was in the calculation of the content-based descriptors for comparison with the cluster-based descriptor. As a result of this, there is no reason why the results shown here cannot be extended to a multi-media environment in which the content-based descriptor cannot be calculated. A stronger challenge to the validity of this experiment comes when considering the relationship between the citations used here and links in a hypermedia document base. These different forms of
*
excluding when it was used to calculate the cluster descriptors for other nodes
HYPERMEDIA & FREE TEXT RETRIEVAL
14
connection between nodes have two important properties in common: they are created by human users, and there is no single formal definition of the relationship between the two connected nodes. When writing a scientific paper, authors will cite other work for various reasons, for example, citations might be to similar work, contradictory work, interesting work in another field, source for methods used, or for deeper discussions on topics briefly covered. Although much of the work cited by authors is in the same subject area this is not always the case, and citations are not always a strong indicator of the subject content of a paper. Likewise in hypermedia networks, authors will include links to many nodes which they think users might find interesting. The motivation for creating, or following, many of the links in a hypermedia environment would appear to be very similar to that for citations – both are simply connections to something which the reader may find useful or interesting. It would appear that the relationship between the experimental conditions and those in which the cluster based algorithm will be used are similar enough (in the areas of importance) that the results shown here will be valid in a hypermedia document base. The relatively low correlations achieved in the first experiment and low matching quality in the second will, in part, be due to simplistic retrieval engine which was used in the tests. The retrieval engine was based on simple information retrieval techniques and did not include many features, such as a thesaurus, which would improve the ability of retrieval engine to match documents. The effects of improving the retrieval engine would have no, or very little, effect on the correlations (and retrieval performance) for randomly created clusters, thus increasing the difference between random and citation-based correlations. WIDER ISSUES So far this paper has concentrated on the development and testing of a hypothesis that contextual information can be used to retrieve documents which cannot be retrieved by content. This section extends the argument to consider more general issues concerned with the combination of query-based and browsing-based retrieval. These issues are all considered further in Dunlop (1991). Improvement of Textual Descriptions So far the discussion has been based on the premise that contextual information would be used to retrieve documents which cannot be retrieved by content (e.g. non-textual nodes). While the main purpose of the approach presented here is to provide access to non-textual nodes it may also be used
HYPERMEDIA & FREE TEXT RETRIEVAL
15
to improve the quality of textual nodes by defining them partly upon their context in the document base, i.e. merging the content-based and context-based descriptors. This is similar to work by Sparck Jones (1971) on query expansion by use of clustering information. Effect on Relevance Feedback In query-based systems which provide relevance feedback, the power of feedback is potentially reduced by the fact that users can only pass judgement on documents which matched their most recent query: as these are the only documents which are presented* . However, when query and browsing access are both provided, users can browse through the document base after issuing a query. This approach treats the query as a method for providing starting points for a browsing-based search in a similar manner to early work by Croft (1978). When browsing users can view, and hence give relevance feedback, on documents which do not match the current query. The strength of positive feedback on such, non-matched, documents could be expected to be higher than for matched documents as the user is, in some sense, correcting the retrieval decision. Removal of Header Nodes Hypermedia systems often require users to select a document from the underlying file system (e.g. Guide by OWL (Brown 1986)) or they provide a method for users to browse to the actual documents / nodes they are concerned with (e.g. the Home Card of HyperCard (Goodman 1989)). When the document base is large these manual access techniques become harder to use, and, in the case of specific access nodes, very difficult to create. After the initial document / node is chosen the users can, however, browse freely using links between documents. The provision of query-based access alleviates the requirement for such manual access methods since users can provide the starting point(s) for a browse by issuing a query.
*
This is technically an interface issue and is not necessarily the case for query-only access, however, it is very typical of query only systems. As a counter-example Sanderson and Van Rijsbergen (1992) permit users to access documents from previous queries and to give feedback on these for the current query.
HYPERMEDIA & FREE TEXT RETRIEVAL
16
Use in Single Search Style Systems So far the discussion has assumed that this model of retrieval will be accessed through both queries and browsing. However, this need not be the case – the hybrid model of document bases can be used “behind the scenes” to provide some query-style features to users who can only browse and vice versa. When the hybrid model of document bases is presented through a browsing-only interface, nearest neighbour links (van Rijsbergen 1979) could be provided between nodes. These links would provide a method for browsing between two nodes which, although not directly linked, have very similar content. This reproduces some of the effects of queries, but within an entirely browsing environment. The idea of using information about documents to create links is not a new one and was developed by Ivie (1966) for search routines. It would also be possible to use access to nodes as a mild form of relevance feedback and, consequently, build an impression of the users interests. This impression could then be used to either provide extra links or highlight links to nodes which match this impression. However, the dynamic creation of links would have to be very rapid to prevent degrading the browsing performance of the system. A free-text retrieval engine could also be used by authors of the hypermedia base to assist in the creation of links by suggesting nodes which are very similar to the node they are adding. This would reduce the, theoretical, requirement for authors to scan the entire document base for nodes to be connected with the new node. In a system which is restricted to providing query-only access (e.g. due to compatibility constraints), links may still exist in the underlying document base. These could be used to provide context-based access to non-textual nodes and to improve the description of textual nodes – indeed the development of links may be done entirely for these purposes. It is not clear whether a user interface which provided access by a combination of queries and browsing would perform as well as, or better than, a single search style system which had been supplemented as described above. While a hybrid (query and browsing) search strategy has potentially more power than either browsing or query only searches, it is also a more complex model of retrieval which may reduce users’ effectiveness. A set of user tests (Dunlop 1991) showed very small differences in user effectiveness, with respect to time and quality of results, when using a basic integration of querying and browsing – better results are expected for a tighter integration of the
HYPERMEDIA & FREE TEXT RETRIEVAL
17
access methods which was suggested. The tests were carried out by 30 users (mainly undergraduate students with limited ability in using a computer and no knowledge of computing science) over approximately 1.5 hours per user using the prototype system mm IR (m ulti-m edia Information Retriever) described briefly below. A prototype system, mmIR, has been developed, based on the ideas presented in this paper, to provide access to the British Highway Code (Department of Transport 1987) by browsing, querying, or by a combination of both. The Highway Code is a mixed media document composed of 198 rules, which may reference each other, and 31 images, which are required to fully understand the textual rules. Figure 5 presents a typical screen from mmIR which shows the 5th best matched node for the query “position in road when turning right”. The Navigator window provides the users with most commands for browsing around the document base and/or matched document list while links to other nodes are presented as a list in the main window (see figure 6).
Figure 5: A typical screen shot
While the main window in figure 5 shows an image node which matched the query, figure 6 shows a non-matching textual node, which was accessed by link traversal and is itself linked to node 55 – which may, or may not, match the current query. mmIR is described more fully in Dunlop and Van Rijsbergen (1991) or Dunlop (1991).
HYPERMEDIA & FREE TEXT RETRIEVAL
18
Figure 6: Display of a Non-Matching Node (with link)
CONCLUSIONS This paper builds upon a combined model of information retrieval which encompasses the principles and the benefits of both free text retrieval and of hypermedia. The model allows users to have access to large document bases which have limited structure but allows users to browse using whatever structure exists. The paper developed a model for approximating the meaning (or content) of documents which cannot be retrieved by content (e.g. images), by use of contextual information. The contextual information being extracted from a hypermedia network. Two experiments, in a text only environment, have shown that the model provides descriptions of documents which are reasonably similar to content based descriptions and that, when issuing queries, context-based descriptions performed reasonably effectively when compared with content based descriptions. Further work needs to be carried out into the use of these methods within a large multi-media document base, however the testing of such a system would be more difficult since no standard collection (with relevance judgements) is available. It would also be useful to experiment with different document bases which make different use of non-textual media: for example, queries to fine art catalogues would probably be expected to retrieve images directly, while following links is perfectly acceptable for a bar-chart in a company report which is merely highlighting the text. Work also needs to be carried out to compare the different levels of cluster-based descriptions presented here. Overall the experiments, and observation of a small prototype system, have shown that using context information from a hypermedia network to retrieve non-textual nodes by query is effective and of reasonable quality.
HYPERMEDIA & FREE TEXT RETRIEVAL
19
Acknowledgement – The authors would like to acknowledge the support of the Science and Engineering Research Council who funded the main author’s Ph.D. work from which this paper is mainly derived. The authors would also like to thank Professor Robert Oddy for his extensive help during the writing of this paper.
REFERENCES Al-Hawamdeh, S., Ang, Y.H., Hui, L., Ooi, B.C., Price, R., & Tng, T.H. (1991). Nearest neighbour searching in a picture archive system. Proceedings of the ACM SIGIR International Conference on Multimedia Information Systems, National University of Singapore. Andrew, M.L., Bose, D.K., & Cosby, S. (1990). Scene analysis via Galois lattices. Proceedings of the IMA Conference on A Maths Revolution Due to Computing, Oxford University Press. Begoray, J.A. (1990). An introduction to hypermedia issues, systems and application areas. International Journal of Man-Machine Studies, 33, 121-147. Brown, P.J. (1986). Interactive documentation. Software: Practice and Experience, 16(3), 91–299. Bush, V. (1945). As we may think. Atlantic Monthly, 176(1), 101-108. Conklin, J. (1987). Hypertext: An introduction and survey. IEEE Computer, 20(9), 17-41. Constantopoulos, P., Drakopoulos, J., & Yeorgaroudakis, Y. (1991). Retrieval of multimedia documents by pictorial content: a prototype system. Proceedings of the ACM SIGIR International Conference on Multimedia Information Systems, National University of Singapore. Croft, W.B. (1978). Organising and searching large files of document descriptions, Ph.D. Thesis, University of Cambridge. Croft, W.B., & Harper, D.J. (1979). Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35, 285-295 Department of Transport (1987). The Highway Code, Her Majesty’s Stationary Office, London. Dunlop, M.D., and van Rijsbergen, C.J. (1991). Hypermedia & free text retrieval. Proceedings of the RIAO 91 conference on Intelligent Text and Image Handling, Universitat Autònoma de Barcelona, Catalunya, Spain, pp. 337-356.
HYPERMEDIA & FREE TEXT RETRIEVAL
20
Dunlop, M.D. (1991). Multimedia Information Retrieval, Ph.D. Thesis, Computing Science Department, University of Glasgow. Report 1991/R21. Fox, E. (1991) CACM document collection, Virginia Polytechnic Institute and State University. Frisse, M.E. (1988). Searching for information in a medical handbook. Communications of the ACM, 31 (7), 880-886 Goodman, D. (1989). The Complete HyperCard Handbook. Bantom Computer Books, New York Hirabayshi, F., Matoba, H., & Kasahara, Y. (1988). Information retrieval using impression of documents as a clue. Proceedings of the 1988 ACM SIGIR Conference. Ivie, E.L. (1966). Search procedures based on measures of relatedness between documents, Ph.D. Thesis, M.I.T. USA. Report MAC-TR-29. Kurlander, D., and Bier, E.A. (1988). Graphical search and replace, ACM Transactions on Computer Graphics, 22(4), 113-120. Neilsen, J. (1990). Hypertext and Hypermedia. Academic Press, San Diego, California. Nelson, T.H. (1967). Getting it out of our system. In Schechter, G., et al, Information Retrieval: A Critical Review (pp. 121-210), Thompson Books, Washington D.C. Porter, M.F. (1980). An algorithm for suffix stripping, Program, 14(3), 130-137. Sanderson, M., and Van Rijsbergen, C.J. (1992). NRT (News Retrieval Tool), Electronic Publishing Origination, Dissemination and Design (to be published). Siegal, S. (1956). Nonparametric Statistics: For the Behavioral Sciences, McGraw-Hill. Smith, J.B., & Weiss, S.F. (1988). Hypertext. Introduction to the special issue of Communications of the ACM on hypermedia, 31(7) Sparck Jones, K. (1971). Automatic Keyword Classification for Information Retrieval. Butterworths, London. van Rijsbergen, C.J. (1979). Information Retrieval (second edition), Butterworths, London.