SUPPORT OF ONLINE DATABASE SELECTION IN KHS Marc Rittberger University of Constance Department of Information Science P.O. Box 5560 D-78434–Konstanz email:
[email protected]
Keywords: online database selection, open hypertext, KHS, online retrieval
Abstract: The Constance-Hypertext-System (KHS) is an open hypertext-system which employs internal and external information sources to satisfy the user’s need for information when navigating through a hypertext. E-mail, Gopher and online databases are available in KHS as external information sources. The main tasks which must be carried out during an online search are analysis of the question, selection of the databases, formulation of a search strategy, online search, feedback and the presentation of the results. All of these tasks are supported within KHS, but special emphasis is given to database selection, which will be discussed below. In this article we will introduce a new method for selecting relevant online databases with KHS using the database descriptions from several hosts.
1.
INTRODUCTION
The 1994 Gale Directory of Databases (former Cuadra [Ref. Marcaccio 94]) shows that the online market is still growing. There is a continuing increase in the numbers of producers, hosts and especially databases. Whereas in 1990 there were nearly 4,000 databases, at the beginning of 1994 about 5,300 different databases were available on the online market. Selecting the right database for a search is difficult not only because of the great number of available databases, but because the latter also differ in contents, type, size, update time and geographic coverage. Even the name of the database can be a problem, because some databases are offered by several hosts under different names. For example, the INSPEC database is referred to files 2,3,4 by DIALOG and INSP, INZZ, IN79 by DATASTAR. This is of significance, since appropriate database selection is one of the key factors in obtaining the right information from online databases. The online community is well aware of this and has made available various aids for database selection such as guides for the whole online market, guides for specific subjects, e.g. chemistry, or guides for single hosts. There are also a few experimental systems [Ref. Glasen 93, Morris et al. 93, Williams & Preece 77] which support online database selection using expert systems. In the following we will explain an approach for the selection of online databases aided by a hypertext system, using as an example the Constance-Hypertext-System (KHS). KHS, as an open hypertext system, provides tools for online retrieval, including support for the selection of online databases. Below we will give a brief introduction to KHS and its online features and provide a comprehensive description of its database selection mechanism.
2.
KHS
KHS is an open hypertext-system designed to integrate various application domains, use of multiple information resources and use by several users. It offers a uniform presentation and interaction style, which does not vary with the specific application. By typing hypertext objects, KHS allows users to structure a hypertext in the way most suitable for their current application. Because units and/or links are typed, KHS can, for example, handle e.g. e-mail communication, a knowledge base or the results of an online search, etc. [Ref. Hammw¨ohner & Kuhlen 93, Rittberger et al. 94]. KHS units are arranged in a polyhierarchy of composite nodes which allows users to structure information in a domain specific way, e.g. all information about a specific topic being researched can be assembled and organized under one composite node. The system also distinguishes between composite units and terminal units. • •
Composite units are hypertext structuring elements which are used to build the polyhierarchy of a hypertext. They can contain additional units, other composite units or terminal ones. Terminal units organize data, their presentation and interaction with these data. Whereas textual units handle text, and image units handle images, form-units contain structured information, e.g. an address, and external application units allow users to run external applications such as a postscript viewer.
Figure 1 The ’KHS-Browser’ and the ’Table of Contents Browser’. The ’KHS-Browser’ shows the current unit, the links of the current unit, the path to the top of the hierarchy and the available contexts. The ’Table of Contents Browser’ displays the embedding of the currently selected unit in the hierarchy of the hypertext.
Links represent relations between units which are based on the contents of the related units. Depending on their type, links may connect whole units, pieces of text (hotwords) within units, or areas within images (hotareas). The main tool used to navigate and browse in a hypertext with KHS is the ’KHS-Browser’, shown in figure 1. There are four different windows, and each contains information on the selected unit in the ’KHS-Browser’. •
•
• •
The left-hand window shows the content of the currently selected unit. In our system, a form-unit of type ’OnlineDatabaseUnit’ is used to display textual information describing an online database. The different fields of this form-unit are used to organize the information. The upper right-hand window shows the various links used in this unit. To navigate in the hypertext a user can click on a link, e.g. similarity, and will thereby obtain the possible destination units connected with the current unit. There are four possible destinations shown in figure 1, and for each of them the weight of the link and the name of the destination unit is given. The middle right-hand window gives access to the hierarchy in which the user is currently navigating. The lower right-hand window shows available contexts. Because of the polyhierarchical structure of the hypertext, a given unit can be part of different hierarchies in the hypertext. The current unit in figure 1 has two contexts. First, the unit is included in the selected alphabetical listing of units; second, the unit is part of a content based cluster of units designated ’water resources’, which has been selected in the illustration.
Besides the main ’KHS-Browser’, other tools are available in KHS which facilitate navigation, orientation and searches in a hypertext. •
• • •
The ’Table of Contents Browser’ provides an overview of how a chosen unit is embedded in the selected hierarchical structure of the hypertext. In the left-hand side of figure 1 we find the unit ’Environmental Bibliography’ as part of a composite unit designated ’water resources’, together with other environmentally relevant units. The ’Full Text Browser’ takes the full text of all units of the current hierarchical level and all units in the subhierarchy of a selected unit and linearizes the textual information. As with the ’Table of Contents Browser’, one can access the hierarchy of the links in the hypertext with the ’Link Hierarchy Browser’ (see figure 3). With the ’Query Browser’ the user can access units by retrieving them through either an index or a full text search.
A KHS user has access to information sources other than internal ones. The external information sources integrated in KHS are based on different Internet services. Useful Internet services such as Gopher, World Wide Web, Wide Area Information Servers, Simple Mail Transfer Protocol, Domain Name Service Telnet or the File Transfer Protocol are normally accessible using different tools. KHS provides the opportunity to use several Internet services with a single interface employing a uniform interaction model and also to share data with other users. External information sources based on standard Internet services already available for KHS are electronic mail, a Gopher client service and Telnet’s NVT (Network Virtual Terminal), which permits online database searches [Ref. Aßfalg et al. 93, Hammw¨ohner & Rittberger 93].
Figure 2 The ’Online Browser’ with which the user can arrange the search terms and start the search. The lower window displays the retrieval process in the online database.
3.
ONLINE SEARCH
Because not all vendors of online databases are accessible through the Internet, the international online market is also accessible to KHS using the German X.25 network. The integration of online database searches in a hypertext environment was first mentioned by Smith [Ref. Smith 88], as ’query link’, whereby users can run an online search starting in a hypertext. Percival and Mac Morrow [Ref. Percival & Morrow 89] used a hypertext system based on Hypercard to establish a connection with a business database containing company information. Online searching involves not only a ’query link’, but also several steps a user must follow in order to complete a successful search. Analysis of the question, selection of the databases, formulation of a search strategy, online search, feedback and presentation of results are the most important steps. While there are still many user-friendly systems which support one or more of these steps, KHS goes farther and assists users during all the steps necessary to carry out a successful online search. A short overview of online searches in KHS is presented below, followed by a detailed description of database selection. •
Users have several options for choosing the right terms to employ when searching in external databases. To navigate through a hypertext, users can select terms by: 1. using the index terms of one or more specific units relevant to their problem. 2. selecting words or phrases in the hypertext referring to relevant aspects of the search topic.
3. navigating through a KHS-thesaurus. 4. entering words on their own which may be needed for the search. •
•
•
•
•
4.
KHS supports users in selecting an online database with a hypertext which contains database descriptions of the online databases of several hosts, (described in detail in section 4). Users can start the selection process during the preparation of an online search by using the ’Online Browser’ (see figure 2). To select a search strategy, users can decide whether they want to interact in the online search without using boolean operators or whether they wish to use them. Not using boolean operators means that the online search will be performed in the same way as a retrieval query in KHS. The system takes the terms chosen by the user, counts how often they occur in the inverted file of the selected online database, analyzes this data using a statistical procedure, searches for the most relevant documents and ranks them in the order of their frequency [Ref. Rittberger et al. 94]. If users know boolean operators they can arrange their search terms in the appropriate way for a specific strategy (e.g. ’building block’) and include formal parameters like publication year or language in the search request. The search in the online database runs fully automatically, either with a statistical query or with a boolean one. KHS connects to the selected online database, runs the query, displays the results, finishes the search and terminates the connection. Every retrieved document is assigned to a new hypertext unit unless a unit for the document already exists. This may be the case if a unit was already found in a previous search. To enable them to look up new units and decide whether they are relevant or not, users receive a list of all new units. Every unit which is not to be kept in the hypertext can be deleted from this list. All new units are aggregated under several nodes and contexts according to author names, publication year, publication type (e.g. all reports) or a group containing all documents from the National Online Meetings. The online search is itself documented in a special unit which contains the query and is linked to all new units.
DATABASE SELECTION
Choosing the right database from among those available on the online market is a major problem. At our department we have analyzed about a hundred online searches run on just one host, and even with the relatively small number of about 200 hundred available databases, the searchers often had difficulties in choosing the optimal database for their search. Among the reasons for this was that they were simply not aware of one or more relevant databases. The aim of the project was to assist novice users in selecting online databases for their first searches, to help experienced users in finding relevant databases for new areas, and to give experts advice on a special topic of a database or to recommend a relevant, but still untried database. To support the different aims of different user groups, several procedures are provided for accessing the relevant databases. The selection of online databases depends on many relevant factors such as contents, sources, geographic and time coverage, update frequency and cost. All of these factors are accessible to users for each database in KHS, but we have placed the main emphasis on content-based support of database selection.
Figure 3 The ’Link Hierarchy Browser’ displays the connectivity of the currently selected unit. It shows the links, starting with the current unit and all neighboring units accessible by means of a link from the current unit.
To obtain the most recent database content, we used the database descriptions of several hosts, which are available in machine-readable form. We chose three hosts offering altogether more then 600 different databases. These were DATASTAR, DIALOG and STN, all three of which are major players on the international online market. We took all the descriptions of the online databases which the hosts offer in the databases DATASTAR-BASE, DIALOG-BLUESHEETS and STNGUIDE and embedded them in a KHS hypertext arranged in a polyhierarchical structure. A hierarchy in which each composite unit contains nine other units allows KHS to store and organize all database descriptions on the third level of such a hierarchy. The database descriptions of the three hosts are embedded in such a hierarchy in alphabetical order, which allows quick and direct access with only three navigational steps, extending from the top unit to the unit containing the database description of current interest. All databases are linked to each other on the basis of whether the descriptions of different databases are similar. Similarity is measured with a statistical procedure using term frequency and inverse document frequency [Ref. Salton & McGill 83]. If the calculated similarity between two units exceeds a certain threshold, KHS connects the two units through a ’similarity link’. Besides the threshold, every unit is linked to a minimum number of other units which seem to be the most similar in comparison with the source unit. With the help of this link structure we have built clusters of databases, where databases with similar contents are grouped in the same cluster. Figure 1 shows a database belonging to an environmental cluster. In this cluster users will find several databases relevant to environmental questions. Figure 3 shows the link hierarchy of the unit shown in figure 1. All links from the current unit and the links of the closest neighbors are displayed. Using information about links, KHS builds clusters of units for those units where interconnectivity is especially high.
Figure 4 The polyhierarchy of the unit ’Environmental Bibliography’ is displayed. There are several paths from this unit to the top unit, designated ’Supporting Units’. The different paths show the hierarchies in which the unit ’Environmental Bibliography’ is embedded.
The user can find a database description along the link hierarchy, by direct access via an alphabetical list or by entering a relevant cluster. The different clusters are organized in alphabetical and content-related order. Figure 4 shows the embedding in KHS of the unit for the database description of the database ’Environmental Bibliography’. The unit is included in the composite units ’Engineered Material Abstract’ and ’water resources’. One can see the different possible paths to the top (’Supporting Units’) starting with ’Environmental Bibliography’. As we see, users have many possibilities to browse through an interesting range of database descriptions and select the databases relevant to their online search. But where do we start navigating through a hypertext in search of the relevant database descriptions? There are several possibilities for users to find an appropriate unit as a starting point in KHS. •
•
An experienced user of online services and KHS can use index and full-text searches in KHS to retrieve one or more relevant starting points. Using the local search options, the user obtains a list of relevant units ranked in accordance with his query. A less sophisticated user can obtain assistance from online services by using their database indices. KHS connects to the three above-mentioned hosts and in response to a simple query searches the databases CROS, DIALINDEX and STNINDEX to find the databases with the greatest number of answers. These online retrieved databases can be used for the online search, either directly or as starting points for a search in KHS for the most relevant databases.
A user searching for information about sewage sludge in the area of Lake Constance may start database selection automatically with the terms ’Lake Constance’ and ’sludge’ (see figure 2). KHS performs searches for the statement in each of the index-databases of the three hosts and
Database Name
Hits
Hazardous Substances
8
Water Resources Abstracts
6
Excerpta Medica
4
Pollution Abstracts
4
Table 1 The most valuable databases for the query ’Lake Constance and sludge’ found in the three online hosts DATASTAR, DIALOG and STN and the number of hits.
combines the results of these three searches. In our example (see figure 2) KHS presents a list of the four best databases found in the three indices (see table 1). The result is a good example of the use of KHS. The four databases contain three bibliographic and one factual database. If the user is interested in obtaining bibliographic information, he will not use the database ’Hazardous Substances’ for an online search. As he reads the database descriptions provided by KHS, he will recognize that ’Excerpta Medica’ is not a typical environmental database. Finally, in the cluster containing ’Water Resources Abstracts’ and ’Pollution Abstracts’, he will find other environmental databases, which are displayed in the ’Table of Contents Browser’ in figure 1.
5.
CONCLUSIONS
In our approach we use a procedure to find relevant online databases which can be described as lying somewhere between a search in conventional dictionaries and use of expert systems for database selection. KHS uses system intelligence to organize the database descriptions in its polyhierarchy and to make the database descriptions searchable in KHS. Furtermore it uses the assistance of the online vendors for the index search. KHS helps users to find and select the relevant database in several ways, but the decision as to which database is most suitable for their special purpose is still left up to the users. Initial tests of the system have produced encouraging results and suggest possibilities for further development of the system. We intend to add a selection mechanism including data about size, update, language, geographic or time coverage of the online databases to reduce the information space a user must navigate in using KHS. In addition we are running evaluations to locate points where we need to improve the database selection with KHS.
REFERENCES [Aßfalg et al. 93.] R. Aßfalg, R. Hammw¨ohner and M. Rittberger. The hypertext internet connection: E–mail, online search, gopher. In D.I. Raitt and B. Jeapes (eds.), Online Information 93. 17th International Online Information Meeting, 7.–9. December, London, pp. 453–464. Learned Information Ltd: London, 1993. [Glasen 93.] F. Glasen. Wissensbasiertes Informationsressourcen-Management zur Kreditw¨urdigkeitspr¨ufung., vol. 11 In Schriften zur Informationswissenschaft. Universit¨atsverlag Konstanz: Konstanz, 1993.
[Hammw¨ohner & Kuhlen 93.] R. Hammw¨ohner and R. Kuhlen. Semantic Control of Open Hypertext Systems by Typed Objects. Report 34-94 (WITH 6/93), Informationswissenschaft. Universit¨at Konstanz, 1993. [Hammw¨ohner & Rittberger 93.] R. Hammw¨ohner and M. Rittberger. KHS – ein offenes Hypertext–System. In G. Knorz, J. Krause and C. Womser-Hacker (eds.), Information Retrieval ’93. Von der Modellierung zur Anwendung. Proceedings der Tagung Information Retrieval ’93, pp. 208 – 222. Universit¨atsverlag Konstanz, 1993. [Marcaccio 94.] K. Young Marcaccio (ed.). Gale Directory of Databases. Volume 1: Online Databases, Januar 1994. [Morris et al. 93.] A. Morris, H. Drenth and G. Tseng. The development of an expert system for online company database selection. Expert System, Vol. 10(2), pp. 47–59, 1993. [Percival & Morrow 89.] M. Percival and N. Mac Morrow. Evaluating the feasibility of using hypercard as an interface prototyping tool with reference to online services: the impact of ISDN. In Online Information 89. 13th International Online Information Meeting Proceedings, pp. 265–276. Oxford: Learned Information, 1989. [Rittberger et al. 94.] M. Rittberger, R. Hammw¨ohner, R. Aßfalg and R. Kuhlen. A homogenous interaction platform for navigation and search in and from open hypertext systems. In RIAO 94 Conference Proceedings. Intelligent multimedia information retrieval systems and management, pp. 649–663, New York, NY - USA October 11–13, 1994. Rockefeller University. [Salton & McGill 83.] G. Salton and M.J. McGill. Introduction to modern information retrieval. Mc Graw-Hill: New York, 1983. 1987. [Smith 88.] K.B. Smith. Hypertext – Linking to the future. Online, Vol. 12(3), pp. 32–40, 1988. [Williams & Preece 77.] M.E. Williams and S.E. Preece. Database selector for network use. A feasibility study. In Information Management in the 1980‘s. Proceedings of the ASIS annual meeting 1977, pp. 1–8/C13–D6, 1977.