A File System Based Inverted Index (LUT CS-TR 996) Jon P. Knight and Martin Hamilton fJ.P.Knight,
[email protected]
Department of Computer Studies, Loughborough University of Technology, Ashby Road, Loughborough, Leics. LE11 3TU UK October 17, 1995 Abstract
This paper documents the design and development of a simple le system based inverted index for use in rapid retrieval of information from resource templates in a Subject Based Information Gateway (SBIG) application on the World Wide Web (WWW). The index mechanism trades le system space for speed and ease of implementation. It makes use of the UNIX hierarchial lesystem structure to hold the index and provide rapid access to the index les with only a few disc accesses. Unlike DBM and the other simple UNIX based databases, this index has no limit on its bucket size, other than that imposed by the physical disc space available. This work was supported by FIGIT/ISSC under Electronic Libraries Programme grant 12/39/01
1
1 Introduction Resource location on the Internet is a problem which is currently of interest to a large number of researchers and engineers worldwide [1]. The growth of the Internet has been phenomenal and hundreds of new resources are being brought online daily. The vast majority of these resources are accessible through the distributed hypermedia system known as the World Wide Web (WWW) [2]. A number of indexing systems have been developed to allow users to search for resources which they require. These indexing services are split into two broad camps: automatic index services that use robots to scour the Web for new or changed resources, such as Lycos [3], and indexing systems that use manual methods, such as ALIWEB [4]. The former often aim to be as comprehensive as possible and can have hundreds of thousands or even millions of indexed documents. However, they suer from:
a large percentage of \dead links" that can no longer be resolved into the original target documents having their robots excluded from some sites because of past problems with resource use by robot indexers con icts of search terms betwen dierent subject disciplines
The last point is a highly signi cant, since dierent disciplines frequently use the same words to mean dierent things. This can lead to large numbers of false hits in the index database for some search terms, with the result that users have to trawl through documents that may not be of use to them before nding the information that they want. The authors have been funded under the UK Electronic Libraries Programme [5] to develop infrastructure software for a number of Subject-Based Information Gateways (SBIGs). The SBIGs will provide quality reviewed indexing services for on-line resources in their subject area. This underlying infrastructure, known as ROADS1 will use the WHOIS++ protocol [6, 7] to support both the searching of individual SBIG databases, and cross-discipline searches across multiple SBIGs. WHOIS++ is one of the leading contenders to be the search protocol component of the WWW architecture. Existing free and commercial databases were considered as the back-end to ROADS, but there were a number of concerns with these. The most signi cant issues were felt to be the danger of being locked into proprietary database technology on the one hand, and the support and maintenance problems associated with free software on the other. As an experiment, the authors decided 1
Resource Organisation And Discovery in Subject-based services
2
to attempt to build a simple indexing system which would use as little specially developed software as possible, and which would be speci cally tailored to our requirements in building ROADS. This was successful, and eventually deployed in the rst version of the ROADS software.
2 The Basic Structure The indexing system developed for the ROADS prototype uses the UNIX le system as its means of maintaining its inverted index structure. Inverted indexes are a common technique used for full-text databases, such as WAIS [8]. Typically the index consists of, for each word in the source data, a list of pointers to the records where this word occurs. Some implementations go further than this, e.g. to store information about the proximity of one word to another. The UNIX lesystem has already had many years of work put into its design and implementation to ensure that it provides a means of rapid retrieval of stored information. By careful use of a directory structure within the le system, the inverted index mechanism can build upon that work and give a good response. The databases being indexed by the ROADS system belong the the SBIGs and contain Internet Anonymous FTP Archive (IAFA) templates [9]. These are plain text records containing sets of attribute-value pairs, and are typically up to ve kilobytes in size - though there is no xed upper bound. It is anticipated that an SBIG will typically have between a few hundred and a few thousand of these templates in a database. A sample IAFA template is shown in Figure 1. There is no strict requirement that IAFA templates be used with this indexing scheme - this just happens to be the metadata format being used internally within the ROADS system. The size of the inverted index is not of great concern to us, as the amount of data being indexed by the SBIGS using the ROADS architecture is relatively small, and the index entries themselves are compact. For example, no matter how many repetitions of an indexed term there are in a particular template, only one pointer to that template will be stored in the index. The basic inverted index structure is shown in Figure 2. The root directory of the index2 contains a number of subdirectories. Each of these directories has a two letter name, also known as a bigraph. The bigraphs are derived from the rst two characters of the words being indexed. For example, if the word category was indexed, its information would be placed in the ca directory. The bigraphs are always converted to lower case before indexing, so that Category, category and CATEGORY will all have their indexing information placed in The root directory of the index is an arbitrary directory in the lesystem usually choosen for this purpose and should not be confused with the root of the lesystem itself. 2
3
Template-Type: SERVICE Handle: SOSIG123 Title: European Space Agency Keywords: Military Science Space Research Description: The aim of the European Space Agency is to "provide for and to promote, for exclusively peaceful purposes, cooperation among European States in space research and technology and their space applications, with a view to their being used for scientific purposes and operational space applications systems." One of the ESA's projects is the use of satellites in disarmament verification for CFE. Subject-Descriptor-Scheme-v1: UDC Subject-Descriptor-v1: 355 Admin-Email-v1: URI-v1: http://www.esrin.esa.it/ Record-Last-Modified-Name: Jon P. Knight Record-Last-Modified-Email: Record-Last-Modified-Date: Tue, 23 May 95 13:50:55 GMT
Figure 1: Sample IAFA template
Inverted Index Root Directory
aa
ab
ac
ad
za
zi
zo
zz
}
Bigraph Directories
Advanced
adam
advanced
advocacy
Figure 2: The basic structure of the inverted index 4
}
Index Files
SOSIG11:/home/jon/test/source/11.idx SOSIG123:/home/jon/test/source/123.idx SOSIG214:/home/jon/test/source/214.idx SOSIG280:/home/jon/test/source/280.idx SOSIG293:/home/jon/test/source/293.idx SOSIG332:/home/jon/test/source/332.idx SOSIG342:/home/jon/test/source/342.idx
Figure 3: Sample index le for the term space the ca subdirectory. This means that case insensitive searching can be done more quickly than if these words had their indexing information held in three dierent subdirectories. It also conviently bounds the number of subdirectories which can appear under the root index directory. Inside each bigraph directory, there exists a le for each term being indexed named after the term. This time the name is case sensitive as we may need to do case sensitive searching and it is the lename that is used to match against the search term. The le itself is a plain text list of the IAFA template handles and the absolute paths to the les that the templates are located in. These are speci ed one pair per line, with a colon separating the two elements. This le format allows a single inverted index to hold information on templates spread through many dierent directories in the UNIX lesystem. The templates could even be located on dierent machines which are exporting their lesystems through a mechanism such as NFS. Figure 3 shows a sample index le entry for the term space, which would be located in the le sp/space under the top level inverted index directory. In this case, sp is the bigraph. Since these index les are plaintext, the prototype versions of the software implementing this indexing scheme can be debugged easily. The contents of the index les can quickly be changed using a text editor and normal UNIX text processing tools such as grep(1) can be run over the index les. The format of the index les is such that it is extensible, allowing new data to be added as new colon separated elds. For example, the index le format could be extended to record the template attributes containing the term being indexed.
3 Indexing a Template A script to generate the inverted index structure described above from a collection of IAFA templates has been written in Perl [10]. This is called mkinv.pl. 5
Over the course of its development, the mkinv.pl script has undergone a number of iterative re nements. The rst version read through the templates being indexed a line at a time, split the line into single terms and then individually added each term to the index in the manner detailed in the previous section. Whilst this method worked, it was very slow, taking several hours to index several hundred templates3 on a lightly loaded SPARCstation 5. The reason for the poor performance of this version was that many les had to be opened, read through (to ensure the term had not already been indexed for that template), written to and then closed. A single le for a common term might undergo this process several hundred times in the course of the indexing procedure. In an eort to overcome this problem, the indexing script was completely re-written. The new version read through the source templates and split each line into terms as before. However, instead of immediately attempting to add each term to the inverted index, it noted the term and the le that it appeared in in an array held in memory. Only when all the templates had been read through did the indexing script write out the terms to the inverted index. All matches for each term were written in one go so that each index les was only opened once during the indexing operation. This improved performance enormously and allowed the test set of templates to be indexed in less than half an hour. However, when a much larger set of templates4 was indexed, the program exhausted the workstation's virtual memory as the associative array grew too large. To solve this problem and permit the indexing of arbitrarily large sets of templates, the mkinv.pl script was modi ed to only record a limited number of terms from the source templates before writing out the indexing les, freeing the memory associated with the array and then continuing. Whilst this slowed the indexing procedure down slightly, it permitted the large set of templates to be indexed successfully - and in a comparable time to the popular WAIS software.
4 Searching Using the Index To search for a single term in the inverted index: 1. the bigraph at the front of the term is rst generated by taking the initial two characters and converting them to lowercase, e.g. Spies ! sp 3 The test set of templates were taken from the Social Science Information Gateway database (see ). 4 Just over 13,000 USER templates representing the LUT White Pages directory
6
2. the resulting bigraph is matched against the list of bigraph directories under the index root and if a match is found, that directory is entered 3. the directory listing is then read from the disk and a regular expression search is made to compare the directory entries with the word being searched for This comparison can take into account whether the search is to be caseless or not and can implement either an exact match or left string search. These are the two search types which it is mandatory for a WHOIS++ implementation to provide. The resulting match is a list of handles and absolute paths to the templates and is held in an associative array in the Perl package that provides the search interface. Boolean operations are implemented by making multiple searches of the inverted index and then combining the associative arrays together. The AND operator has its associative arrays combined using a union operation and the OR operator uses an intersection routine. The NOT operation is slightly more complex as it must generate a list of all the templates in which the search term is not present. To do this, a single text le is also generated by the indexing script that simply lists all the handles of the templates and their associated absolute pathnames. The negated term being searched for is matched in the index as before, and then the match list associative array is generated by reading this le containing all the template details and simply removing ones that matched in the index search. This makes the NOT operation more time consuming than the AND and OR operations but luckily it appears to be used far less frequently. A simple stemming algorithm has also been implemented in the search engine that allows the various forms of a root word to be searched for. This is done by converting a single search term into a number of separate terms, doing exact matches on all of them and then ANDing the resulting match lists together. The prototype implementation of the search engine also permits matches to be con ned to terms appearing in a particular attribute of the IAFA template. This feature provides support for the WHOIS++ requirement that search terms can be speci ed as being present in one of more attributes of a template. Its currently implemented by doing the match as normal and then actually reading through the templates that have been matched to check whether the term being searched for is present in the required attribute. This reduces the performance slightly but has not yet become a cause for concern. If it was felt that performance was falling to unacceptable levels, the attributes could be recorded in the index les themselves by the indexing script which would mean that the search script would not need to read through the template itself at run time. 7
5 Deindexing a Template After a time, the data held in a template may become out of date and need to be edited or deleted completely. This requires that the inverted index can have all references to the template removed from it. A simple Perl deindexing script called deindex.pl has been written that, given a template handle, will remove its details from the index. It does this by reading the template and generating bigraphs in much the same way as the mkinv.pl does. It then opens the le for each term in the template and removes the entry for that template. As this basically means copying most of the le to a new le for each term being indexed, this is quite an expensive operation. Luckily, deindexing is usually done a template at a time and so the speed is not so important as with the index generator, which may run over hundreds or even thousands of templates in one run. The deindex.pl script also removes the template's entry from the list of all the templates that is used for the logical NOT operation and has the option of deleting the actual template le itself. By default it doesn't do this as some SBIGs have indicated that they would prefer to move their deindexed templates to archive directories or to new databases ready for reindexing.
6 Conclusions This document has detailed the design and implementation of a simple UNIX lesystem based inverted index mechanism. This mechanism is being used in the ROADS system to provide a fast search engine for the SBIGs' resource databases. The inverted index mechanism is simple to implement and is relatively fast even on quite large databases. It can support exact and left string matching and attribute matching restrictions and so appears to be suitable for use as the backend of a simple WHOIS++ implementation. It does have the disadvantage of requiring quite a lot of disc storage compared to the size of the data set being indexed, but this is not a pressing problem in the ROADS environment and the ease of implementation and alteration far outweigh this. There are a number of features which would be desirable in a search mechanism which this simple indexing system does not yet support. Firstly, substring or full regular expression matching is not yet possible without scanning all of the bigraph directories for possible matches. This could be xed by supplementing the existing inverted index structure with an additional le listing all the terms being indexed. This list of terms would be used to key index lookups and the results of these ANDed together. More information about the ROADS project, and the ROADS software itself 8
can be found on the Web, at .
References [1] M. F. Schwartz, A. Emtage, B. Kahle, and B. C. Neuman. A Comparison of Internet Resource Discovery Approaches. Computing Systems, pages 461{493, Fall 1992. Earlier version appeared as TR CU-CS-601-92, July 1992. [2] T. Berners-Lee, R. Cailliau, A. Luotonen, H. Nielson, and A. Secret. The World Wide Web. Communications of the ACM, 37(8):76{82, August 1994. [3] M. L. Mauldin and J. R. R. Leavitt. Web Agent Related Research at the Center for Machine Translation. Presented at the August 1994 SIGNIDR meeting, August 1994. . [4] M. Koster. ALIWEB - Archie-Like Indexing in the WEB. In WWW94 Conference Proceedings. Elsevier Science BV, 1994. . [5] Electronic Libraries Programme (eLib) - Press Release, June 1995. . [6] J. Gargano and K. Weiss. Whois and Network Information Lookup Service Service - Whois++. Request For Comments (Informational) RFC 1834, Internet Engineering Task Force, August 1995. . [7] P. Deutsch, R. Schoultz, P. Faltstrom, and C. Weider. Architecture of the WHOIS++ service. Request For Comments (Standards Track) RFC 1835, Internet Engineering Task Force, August 1995. . [8] M. J. Fullton, K. J. Goldman, B. J. Kunze, H. Morris, and F. Schiettecatte. WAIS over Z39.50-1988. Request for Comments (Informational) RFC 1625, Internet Engineering Task Force, June 1994. . [9] P. Deutsch, A. Emtage, M. Koster, and M. Stumpf. Publishing Information on the Internet with Anonymous FTP. Internet Draft (expired), Internet Engineering Task Force, September 1994. . [10] L. Wall and R. Schwartz. Programming Perl. O'Reilly & Associates, Inc., 1991. 9