Proxy Searching of Non-Searchable and Poorly Searchable Open ...

2 downloads 96 Views 69KB Size Report
the open access sites hosting the archives of full-text articles in HTML, Word and/or PDF ... growing archives in lack of funds and experienced systems staff for software ... It cannot do exact phrase searching, and its relevance ranking algorithm.
Proxy Searching of Non-Searchable and Poorly Searchable Open Access Archives of Digital Scholarly Journals Péter Jacsó University of Hawaii, Department of Information and Computer Sciences, 2550 The Mall Honolulu, HI 96882 [email protected]

Abstract. Many high quality, open access scholarly journals have been published on the Web. However, many of them do not offer a search program to find articles in their article archive. They provide access only via browsing through/digging down the volumes, issues, and table of content pages. Archives of many other open access scholarly journals do offer search options, but these have often very limited capabilities, lacking even such essential fulltext search features as exact phrase searching. Both types of archives, however, can be searched using as a proxy the advanced options of some of the Webwide search engines, such as Google, AllTheWeb and WiseNut, which spider the open access sites hosting the archives of full-text articles in HTML, Word and/or PDF format. The paper demonstrates the efficiency and –in case of directly searchable archives- the much higher precision and relevance of searching the archives by proxy.

1

Introduction

Traditional publishers of scholarly print journals, such as Elsevier, Springer Verlag, and MCB University Press have expensive and capable information storage and retrieval, or database management systems to offer sophisticated access to the full text of the large digitized collections of articles for their subscribers. All of them allow the use of Boolean operators, truncation of search terms, searching for exact phrase, limiting the search to specific fields, such as title, abstract, author or keywords (descriptors). Some of the programs allow restricting the search to journals, volumes or issues which the libraries subscribe(d) to. Elsevier’s Science Direct even allows searching for cited works. In contrast with the traditional large publishers, most of the publishers of open access single publications which were born on the Web to be distributed on the Web without any revenue or compensation, focus on providing current articles. Understandably, not much thought could be given to efficient searching of the growing archives in lack of funds and experienced systems staff for software installation, maintenance and customization. Many other single-title publishers chose to install one of the open source (free) but limited capability web site search engines.

2

Péter Jacsó

2 The Problem The problem is that for many of the valuable scholarly digital journals there is no appropriate access for efficiently discovering their content. The lack of search capabilities force the users to dig down in each issue and each article or other documents, such as editorials and letters to the editor, and judge from the table of contents entries and –if available- from the abstract whether the document may discuss a specific topic, product or service in which they are interested. The full text of articles, of course, can be scanned by using the browser’s Find command, but it is a slow, awkward and less than perfect process, requiring too many unrewarding mouse clicks . This happens, for example, with the archive of the Journal of Medical Internet Research (JMIR). Even though it has a relatively small archive of 120+ articles and other documents, finding all articles which discuss, for example, the MEDCERTAIN information quality assessment project and “trustmark” of health related Web sites is an arduous process. The term appears on the table of content pages only in one article’s title, and in another’s subtitle. A few other titles and abstracts allude to the possible coverage of the project. A time consuming thorough check through the 15 issues of the 5 volumes of JMIR would discover a good dozen items which discuss MEDCERTAIN to various extent. To put this into perspective, the entire MEDLINE database has only four hits about MEDCERTAIN, and it is the one which has the richest coverage of this topic among the nearly 500 databases on the DIALOG information system. The lack of appropriate access and efficient discovery is less obvious in case of those open access scholarly journal archives which do offer a web-site search engine. The problem, however, becomes apparent when you go beyond the single word query. The case of the WebSTAR Search software used with one of the best Web-born information science and technology journals, First Monday (FM), illustrates the problem well. It cannot do exact phrase searching, and its relevance ranking algorithm is overly simplistic. Searching for articles which include an exact term, such as “information organization” is not possible. The software truncates the component words to inform and organ, and finds 262 articles even when relevance is set at minimum 90%. There are 7 articles in the entire archive (as of June 20th, 2003) in which the term “information organization” appears. The rest of the “hits” retrieved include articles with one of these word variants: inform, informed, information, informatics AND one of these variants: organ, organs, organize, organizer, organizers, organized, organization, organizations. Pluralizing the word organization adds insult to injury caused by degrading the adjacent, unidirectional word order relationship of the query to a simple Boolean AND operation between the two excessively truncated original words. The more often those word variants occur anywhere in an article, the higher its relevance rank will be. Aggregate frequency of the truncated component words of the query and their density in the article are weighted higher for relevance ranking than adjacency, proximity, position and exact phrase matching. This explains (but does not justify that the seven articles which have exact match for the original query “information organization” were ranked by WebSTAR Search

Proxy Searching of Non-Searchable and Poorly Searchable Open Access Archives of Digital Scholarly Journals 3

as 6th, 13th, 27th, 84th, 103rd, 181st, and 210th. Irrelevant articles which include one of the modified query words on the first page and the other on the last page or in the next paragraph, push down the relevant ones in the result list. Users get tired of the deluge of irrelevant results before they would get to the relevant articles. There are similar problems in searching the archives of open access scholarly journals which use the ht://dig software which cannot handle exact phrases in a search query. In addition, it offers only Boolean AND, and OR operators between the words in the query, and also engages in senseless, unsolicited pluralizing, such as informations, knowledges, shelfs, when encountering the singular form of these words in a query. Its relevance ranking also leaves much to be desired.

3 The Solution The Web-wide search engines used as proxy agents can provide a) much needed access to many of those open access digital archives of scholarly journals which do not have site search engines and b) far more relevant search results from those archives which use site search engines of very limited capabilities. Using the Web-wide search engines as proxy agents merely requires the appropriate domain specification of the archive to be searched along with the query. To make the search for articles about “information organization” on the primary host site of the archive of First Monday the user simply invokes the advanced search template, enters the query in the exact phrase cell, and specifies the domain to which the search must be limited.

Fig. 1. Using Google as a proxy agent to search the First Monday archive Query specification in the major Web-wide search engines is more powerful and flexible than those offered by most of the open source site-search engines. For example, Boolean NOT operator and exact phrase searching are available. Until

4

Péter Jacsó

recently, Google was my top choice for proxy searching, by virtue of its superiority in terms of the size of its database, the scope, depth and frequency of spidering relevant sites, its ability of harvesting HTML, Word and PDF documents, its speed and options for presenting results. Current test searches, however, proved that AllTheWeb, and WiseNut yielded as good results for most test queries as Google. AltaVista, and Inktomi were on par with them except for one test query shown below. The newest competitor, GigaBlast had the least results across all the archives tested, but it is worth trying. As the search engines command syntaxes are getting more and more similar, it is easy to switch from one to the other using such query syntax as “information organization” site: www.firstmonday.dk.

Fig. 2. Rank number of the exact matches in the native and proxy agents’ result list Apart from the number of hits, the biggest difference among the Web-wide search engines as proxy agents are in the relevance ranking. Relevance ranking of the results is better or much better using the proxy agents (except for GigaBlast) than using the open source web-site search engines, simply because the former are constantly being tuned and improved, while the open source search programs, understandably, get much less attention and improvement by the developers. Special features, such as the cached versions of the articles with multi-color highlighting of the matching terms, the conversion of PDF files to HTML, or the sneak-preview of results will also determine which search engine is the best proxy agent for a user. Certain syntax limitations, such as the inability of specifying a secondary domain level in Google may also play a role in the decisions if the archive is not at the top domain of a host. For example, use of ATW as proxy is recommended when searching the Bulletin of ASIS&T because it allows the use of the specific secondary level domain www.asis.org/Bulletin. In Google, one can’t go beyond www.asis.org and the query picks up many matching records from ASIS&T publicity announcements, not just articles from the Bulletin. Comparing results and their ranking from the users’ preferred archives using different Web-wide search engines on a regular basis will help determine which are the best proxy agents for them.