Integrating external information sources to guide worldwide web information retrieval Alvaro E. Monge Charles P. Elkan Department of Computer Science and Engineering, 0114 University of California, San Diego La Jolla, California 92093 (619) 534-8897
famonge,
[email protected] http://www.cs.ucsd.edu/users/famonge,elkang October 1995
ABSTRACT: Information retrieval in the worldwide web environment poses unique challenges. The most common approaches involve indexing, but indexes introduce centralization and can never be up-to-date. This paper advocates using external databases and information sources as guides for locating worldwide web information. This approach has been implemented in WEBFIND, a tool which discovers scientific papers made available on the web by their authors. The external information sources used by WEBFIND are MELVYL, the online University of California library catalog, and NETFIND, a service for finding email addresses. WEBFIND combines the information available from these services in order to find good starting points for searching for the papers that a user wants. At several stages in its operation, WEBFIND must solve instances of what we call the field matching problem. This problem is to determine whether or not two syntactic values are alternative representations of the same semantic entity. For example the addresses Dept. of Comput. Sci. and Eng., University of California, San Diego, 9500 Gilman Dr. Dept. 0114, La Jolla, CA 92093 and UCSD, Computer Science and Engineering Department, CA 92093-0114 do designate the same department. We propose a systematic algorithm for this problem that takes into account the many complexities such as abbreviations and missing subfields that are evident in the example above.
Keywords: search, applications, information retrieval, worldwide web.
1 Introduction The worldwide web is a distributed, always changing, and ever expanding collection of documents. These features of the web make it difficult to find information about a specific topic. Available information retrieval software has been designed for very different environments: typical tools [Salton and McGill, 1983; Salton and Buckley, 1988] work on an unchanging corpus, with the entire corpus available for direct access. This paper describes a novel approach to performing information retrieval on the worldwide web. In a nutshell, the approach is to use a combination of external information sources as a guide for locating where to look for information on the web. The first tool in which we have implemented this approach is called WEBFIND. This application discovers scientific papers made available by their authors on the web. The external information sources used by WEBFIND are MELVYL and NETFIND. MELVYL is a University of California library service that includes comprehensive databases of bibliographic records [University of California, 1995]. NETFIND is a white pages service that gives internet host addresses and people’s email addresses [Schwartz and Pu, 1993]. Separately, these services do not provide enough information to locate papers on the web. WEBFIND integrates the information provided by each in order to find a path for discovering on the web the information actually wanted by a user. The integration of external information sources and the whole discovery process pose several reasoning problems that require heuristic solutions. In combining information provided by MELVYL and NETFIND for example, WEBFIND must determine which internet host suggested by NETFIND best corresponds to the institutional affiliation of an author as given by MELVYL. This problem is an instance of what we call the field matching problem. In general the field matching problem is to determine whether or not two field values (typically strings) are syntactic alternatives that designate the same semantic entity. Section 2 describes existing approaches for doing information retrieval (also called resource discovery) over the worldwide web and compares them to our approach. Section 3 addresses the field matching problem. Properties of the problem are discussed and a recursive algorithm is presented. Section 4 discusses the design and implementation of WEBFIND, while Section 5 describes experiments using WEBFIND to discover computer science articles.
2 Approaches to worldwide web information retrieval This section discusses briefly the most important tools available now for worldwide web information retrieval. Most tools share the same general strategy: to create a centralized index of all or some of the pages of the web. While this technique has given favorable results, it has important disadvantages. Later in this section, the approach of integrating information sources to guide the discovery of the relevant information is presented. We then revisit the disadvantages of indexing techniques and find that these do not affect our approach. The phrase resource discovery is often used to mean information retrieval over distributed information sources like the worldwide web. The most common approach to resource discovery over the web is to use an index to store information about web documents. This approach involves
1
periodic automatic searching of the web and gathering of information about all the documents found in these searches. The information gathered is used to make an index which users can later query for desired information. The WebCrawler [Pinkerton, 1994], Lycos [Mauldin and Leavitt, 1994], OpenText [Open Text Corporation, 1995], Infoseek [Randall, 1995], and Inktomi [Brewer and Gauthier, 1995] are the most important examples of applications which use this indexing approach. Unfortunately, many of the indexing-based tools are inadequate now for resource discovery and the others will be inadequate sooner or later. Tools based on indexing are inadequate ultimately because they are centralized, while the web is distributed and growing exponentially [Gray, 1995]. All the tools cited above are server-based, with the index and/or the querying engine residing on a single computer (which may be a multiprocessor). One host, however powerful, cannot keep up with changes in the web. Existing documents are always being updated or removed, while new documents keep coming online. A second problem is that experience has shown that the headers of web documents typically provide little to no information about the contents of a document. More recent web indexing tools therefore index the entirety of documents in order to base relevance judgements on the full contents of documents. However full-contents indexing implies that the size of a web index must be of the same order of magnitude as the size of the whole worldwide web itself. The main alternative to resource discovery based on offline indexing is to perform automated online searching. Such online searching requires sophisticated heuristic reasoning to be sufficiently focused. The most developed example of this approach is the so-called Internet Softbot [Etzioni and Weld, 1994]. The Softbot is a software agent that transforms a user’s query into a goal and applies a planning algorithm to generate a sequence of actions that should achieve the goal. The Softbot planner possesses extensive knowledge (some acquired through learning) about the information sources available to it. Some resource discovery tools use a combination of indexing and online search. For example, WebCrawler uses an index to suggest starting points for online searches. Having good starting points is important in online searches because otherwise it takes longer than users are willing to wait to find relevant information. The WebCrawler approach may solve the problem of keeping up with the dynamism of the web, but it does not solve the other problems associated with indexing approaches. The WEBFIND approach to resource discovery is similar to the WebCrawler and Softbot approaches, in that WEBFIND performs online searching of the web. However, unlike the WebCrawler. the starting points used by WEBFIND are suggested by inference from information provided by external sources, not by a precomputed index of the web. Unlike the Softbot, WEBFIND uses application-specific algorithms for reasoning with its information sources. In principle a planning algorithm could generate the reasoning algorithm used by W EBFIND, but in practice the WEBFIND algorithms are more sophisticated than it is feasible to synthesize automatically.
3 The field matching problem Many information sources provide information about the same real-world entities, but designate these entities differently. The field matching problem is to determine whether or not syntactically 2
different designators are semantically equivalent. The field matching problem arises often with WEBFIND. For example, successfully matching an institution as named in the affiliation field of a MELVYL record for an article with the same institution as named in a NETFIND entry gives WEBFIND knowledge of an internet host computer at which to look for a worldwide web server that may contain the article. A field is a value, typically a string, that designates an entity in some semantic category. The following are two examples of fields designating academic institutions. Dept. of Comput. Sci. and Eng. University of California, San Diego 9500 Gilman Dr. Dept. 0114 La Jolla, CA 92093
UCSD Computer Science and Engineering Department La Jolla, CA 92093-0114
These examples show that fields can be made up of subfields delimited by separators such as newlines, commas, or spaces. A subfield is itself a field and may be made up of subsubfields, and so on. Two fields are equivalent, if they are equal semantically, that is if they both designate the same semantic entity. For example, the subfields “Dept. of Comput. Sci. and Eng.” and “Computer Science and Engineering Department” are equivalent. Given two fields, the field matching problem is to determine whether or not the two fields are equivalent. Equivalence may sometimes be a question of degree, so we allow a function solving the field matching problem to return a value between 0.0 and 1.0, where 1.0 means certain equivalence and 0.0 means certain non-equivalence. There has been little previous research on the field matching problem, although the problem has been recognized as important in industry for decades. For example tax agencies must do field matching to correlate different pieces of information about the same taxpayer when social security numbers are missing or incorrect. The published previous work deals with special cases of the problem, involving customer addresses [Ace et al., 1992], census records [Slaven, 1992], or variant lexicon entries [Jacquemin and Royaute, 1994]. The most similar work to ours is due to Wen-Syan and Clifton [1993]. However this work is in the context of heterogeneous database systems and assumes that schema information is available for records to be matched, i.e. information about the structure of fields and subfields, and their semantic categories. In general, and in particular in the worldwide web environment, such information is not available.
3.1 Properties of the field matching problem This section discusses properties of the field matching problem. First, the field matching problem is naturally recursive. Whether or not two fields composed of subfields match depends on whether or not their subfields match each other. There may be more than one level of subfields in the original fields. The base case for the recursion is when the subfields involved are primitive values, typically individual words represented as strings. Second, the example above shows that any field matching algorithm must be able to tolerate missing information. The first address above contains a subfield (i.e. line) giving a street name and number. This subfield is missing from the second address, but the fields still match overall. 3
Third, a field matching algorithm must able to tolerate lack of order among subfields. For example, the first subfield in the first address corresponds to the second subfield of the second address. Fourth, a field matching algorithm must deal with abbreviations. For example above, “UCSD” matches “University of California, San Diego.” Almost all abbreviations follow one of four patterns: (i) the abbreviation is a prefix of its expansion, e.g. “Univ.” abbreviates “University”, or (ii) the abbreviation combines a prefix and a suffix of its expansion, e.g. “Dept.” abbreviates “Department”, or (iii) the abbreviation is an acronym for its expansion, e.g. “UCSD” matches “University of California, San Diego”, or (iv) the abbreviation is a concatenation of prefixes from its expansion, e.g. “Caltech” abbreviates “California Institute of Technology”. All four cases are complicated bythe fact that periods and apostrophes may or may not appear inside or at the end of abbreviations. Cases (iii) and (iv) are also complicated by two further issues: first, some words in the expansion may be discarded in forming the abbreviation (for example “Institute of” in forming “Caltech”) and second, the abbreviation may match what appears to be more than one subfield (the comma makes “University of California, San Diego” resemble two subfields).
3.2 Recursive algorithm This section describes the field matching algorithm used in WEBFIND. Following the nature of the problem, the algorithm is recursive and handles the issues explained above.
4
INPUT: Two fields A and B and two lists of separators FSA and FSB, where first(FSA) is the separator that divides A into subfields, second(FSA) is the separator that divides these into subsubfields, and so on; similarly for FSB. OUTPUT: A real number representing the degree to which A and B match.
MATCH (A; B): if empty (FSA) or empty (FSB ) then return base case match(A; B ) let subfieldsA = list of strings in A as separated by first(FSA) let subfieldsB = list of strings in B as separated by first(FSB ) let total score = 0 for each subfieldA 2 subfieldsA let NA = number of subfields in subfieldA as separated by second(FSA) let max score = 0 for each subfieldB 2 subfieldsB let NB = number of subfields in subfieldB as separated by second(FSB ) let A = hsubfieldA; tail(FSA)i let B = hsubfieldB ; tail(FSB )i let score = MATCH (A ; B )=((NA + NB )=2) let max score = max(max score; score) let total score = total score + max score return total score 0
0
0
0
The recursive field matching algorithm. The algorithm sketched above handles missing subfields and unordered subfields. Matching of abbreviations is handled by the base case match function. The present version of this function says that A and B match with degree 1.0 if either is a substring of the other; otherwise their degree of match is 0.0. This version of the base case match function deals with the first of the four cases for abbreviations explained above, which is the most common case. The next implementation will deal with all four cases. One important optimization in the implementation is to apply memoization to remember the results of recursive calls which have already been made. When MATCH is called with the same arguments as on a previous call, the result saved previously is returned instead of being recalculated.
4 The design of WEBFIND This section describes the protocol followed by WEBFIND when retrieving a scientific paper over the worldwide web. The two main phases are, first, integrating information provided by MELVYL and NETFIND, and second, discovering a worldwide web server, an author’s home page, and finally 5
the location of the wanted paper.
4.1 MELVYL and NETFIND integration A WEBFIND search starts with the user providing keywords to identify the paper, exactly as he or she would in searching MELVYL directly. A paper can be identified using any combination of the names of its authors, words from its title or abstract, or other bibliographic information. After the user confirms that the right paper has been identified, WEBFIND queries MELVYL to find the institutional affiliation of the principal author of the paper. Then, WEBFIND uses NETFIND to provide the internet address of a host computer with the same institutional affiliation. A query to NETFIND is a set of keywords describing an affiliation. Useful keywords are typically words in the name of the institution or in the name of the city, state, and/or country where it is located [Schwartz and Pu, 1993]. The NETFIND query engine is incapable of processing abbreviations, so WEBFIND chooses full words found in affiliation given by MELVYL, with a few common abbreviations expanded, such as “univ.” to “university.” In general, the result of a NETFIND query is all hosts whose affiliation contains the keywords in the query. There can be many such hosts, and WEBFIND must determine which of them is best. To do so, WEBFIND uses the field matching algorithm described in Section 3. The algorithm is executed once for each of the candidate hosts. Each time the first field to be matched is the MELVYL affiliation and the second field is the NETFIND description of the location of the host. For the special task of matching affiliation fields, the algorithm of Section 3 is extended to take into account the ordering of top-level subfields. Leftmost subfields in affiliations are typically more specific, and hence matching them is more important than matching subfields further right. Instead of returning a single degree of match, the algorithm returns separate scores for each subfield of the MELVYL affiliation. Each score is the maximum degree of match between this subfield and any subfield of the NETFIND affiliation. The overall score is a weighted mean of these scores, with weights decreasing from left to right.
4.2 Discovery phase The searching of the worldwide web done by WEBFIND is real-time in two senses. First, the search takes place while the user is waiting for a response to his or her query. Second, information gathered from one retrieved document is analyzed and used to guide what documents are retrieved next. The first step in the discovery phase is to find a worldwide web server on the chosen internet host. This step uses heuristics based on common patterns for naming servers. The most widely used convention is to use the prefix www. or www-. WEBFIND tests the existence of a server named with either of these prefixes by calling the Unix ping utility. If either prefix yields a server, then WEBFIND continues with the next step of the discovery phase. Otherwise, WEBFIND strips off the first segment of the internet host name and applies the same heuristics again. For example, cs.ucsd.edu is transformed to ucsd.edu and then the potential servers www.ucsd.edu and www-ucsd.edu are pinged. Once a worldwide web server has been identified, WEBFIND follows links until the wanted article is found. This search proceeds in two stages: 6
1. find a web page for the principal author, and 2. find a web page that is the wanted article. Each stage of the search uses a priority queue whose entries are candidate links to follow. The priority of each link in the queue is equal to the estimated relevance of the link. For the first stage, the priority queue initially has a single link, the link for the main page of the server. For the second stage, the priority queue initially contains just the result of the first stage. When a link is added to the priority queue, its relevance is estimated using the field matching algorithm applied to the context of the link, and each of two sets of keywords, a primary set and a secondary set. The context of a link is its anchor text and the two lines before and two after the line containing the link, provided no other link appears in those lines. Links are ranked lexicographically, first using degree of match to the primary set, and then using degree of match to the secondary set. In the first stage of search, the primary set of keywords is the name of the principal author, while the secondary set is fstaff, people, facultyg. Intuitively, the main objective is to find a home page for the author, while the fall-back objective is to find a page with a list of people at the institution. In the second stage of search, the primary set of keywords is the title of the wanted article, while the secondary set has keywords fpublications, papers, reportsg. Here, the main objective is to find the actual wanted paper, while the fall-back objective is to find a page with pointers to papers in general. At each stage, the search procedure is to repeatedly remove the first link from the priority queue, and to retrieve the pointed-to web page. The search succeeds when this page is the wanted page. The search fails when the queue is in fact empty. If the page is not the wanted page, all links on it are added to the priority queue with their relevance estimated as just described. Even if either stage of search fails, the user still receives useful information. If the first stage fails, the user is given the web page of the author’s institution. If the second stage fails, the user is given the web page of the author’s institution and his or her own home page.
5 Experimental results This section reports on experiments performed with the initial implementation of WEBFIND. The aim of the experiments was to identify which aspects of this first version of WEBFIND are the limiting factors in its ability to locate authors and their papers on the worldwide web. Figure 1 shows an example of a WEBFIND discovery session. The experiments discussed here used ten queries in different areas of computer science concerning papers by authors at ten different institutions.
7
Figure 1: Results from a WEBFIND discovery session
8
Dept. of Comput. Sci. & Eng., California Univ., San Diego, La Jolla, CA, USA Dept. of Cognitive Sci., California Univ., San Diego, La Jolla, CA, USA Dept. of Electr. Eng. & Comput. Sci., California Univ., Berkeley, CA, USA Dept. of Comput. Sci., California Univ., Santa Barbara, CA, USA Lab. for Comput. Sci., MIT, Cambridge, MA, USA Dept. of Comput. Sci., Cornell Univ., Ithaca, NY, USA Dept. of Comput. Sci., Texas Univ., Austin, TX, USA Dept. of Electr. Eng. & Comput. Sci., Illinois Univ., Chicago, IL, USA Dept. of Comput. Sci., Waterloo Univ., Ont., Canada Dept. of Comput. Sci., Columbia Univ., New York, NY, USA Author affiliations as reported by MELVYL.
The discussion here is organized into four phases: mapping affiliations to internet hosts, discovery of worldwide web servers, discovery of home pages for authors, and finally discovery of the wanted paper. WEBFIND correctly associated eight of the ten MELVYL affiliations to internet hosts in NETFIND. The first affiliation that WEBFIND did not correctly identify was “Dept. of Cognitive Sci., California Univ., San Diego.” The reason for this failure was that NETFIND does not have an entry for this department, although its internet host is cogsci.ucsd.edu. In future work we intend to quantify the comprehensiveness of the coverage of NETFIND, and if necessary we will extend WEBFIND to use additional white page resources. The other affiliation that WEBFIND did not find a correct host for was “Lab. for Comput. Sci., MIT.” The reason here is that fifteen different internet hosts all have a NETFIND description equivalent to “Lab. for Comput. Sci., MIT.” Each of these hosts corresponds to a different research group (for example cag.lcs.mit.edu belongs to the computer architecture group) but this information is not available in either the MELVYL or NETFIND affiliation descriptions. The next version of WEBFIND will overcome this problem by adding keywords to the MELVYL and/or the NETFIND affiliations, if necessary. Added MELVYL affiliation keywords will be subject keywords, while added NETFIND affiliation keywords will be host name segments. For example, adding the subject keywords “computer architecture” to “Lab. for Comput. Sci., MIT” would give a specific match to the host name cag.lcs.mit.edu. Note that this will often involve matching of abbreviations, e.g. of “cag” and “computer architecture.” Of the eight internet hosts that WEBFIND found correctly, there was only one that it could not find a worldwide web server for. Given the simple heuristic used for finding a server (Section 4.2), this is encouraging. WEBFIND found the home page for five principal authors on the seven worldwide web servers it searched. The other two principal authors did not have home pages of any kind on the servers found by WEBFIND. In these two cases, the authors were no longer affiliated with the institution that MELVYL provided. We will solve this problem in the next version of WEBFIND by using the most
9
recent information that MELVYL can provide. This will involve additional queries to MELVYL, as described in the next section. Finally, WEBFIND successfully discovered two starting from the five home pages found. The low rate is due to the type of author pages which were discovered. Two of the five pages were not personal home pages, but rather they were “annual reports” or “research statements” which did not provide any outgoing links, so the wanted papers were not in fact available through their authors’ home pages. In summary, our experiments show that WEBFIND is successful at finding worldwide web servers and finding web pages designated for authors. WEBFIND is less successful at finding actual papers, most of all because many authors have not yet published their papers on the worldwide web.
6 Discussion From a relational databases perspective, when WEBFIND combines information from MELVYL and NETFIND, it is performing a “join” on these two sources of information. Specifically, MELVYL can be viewed as a relation where each tuple is one bibliographic record, and NETFIND can be viewed as a relation where each tuple gives information about one internet host. These two relations have an attribute in common, namely institution. What WEBFIND does is to join the two relations on this attribute. Since institutions are represented differently in MELVYL and NETFIND, it is non-trivial to decide when a MELVYL institution field value is in fact equal to a NETFIND institution field value. The field matching algorithm solves this problem. The relational databases perspective suggests extensions to WEBFIND. For example, some MELVYL records do not contain affiliation fields, and many have out-of-date affiliation fields. This limits the functionality of WEBFIND because affiliation information is needed to find a suitable internet host using NETFIND. A related difficulty is that MELVYL provides an institutional affiliation only for the principal author of a paper. However, the paper may exist on the web only in association with another author. All these problems can be solved in the same way, by performing additional join operations to retrieve the needed information from other bibliographic records. Formally, the join of two relations, T1 and T2, on the common attribute A is a relation T whose tuples are in the Cartesian product of T1 and T2 such that the A attribute of T1 is in relation to corresponding A attribute of T2. The relation can be any of the comparison operators