Distributed Resource Discovery using a Content ...

7 downloads 105 Views 1MB Size Report
COI “Religion.IslamPoetry” can be a child of a member with COI “Religion” in case ...... Information about alt.religion.scientology newsgroup. 10.Scientology: ...
Distributed Resource Discovery using a Content-Sensitive Infrastructure Anders Fongen

A thesis submitted in partial fulfilment of the requirements of the University of Sunderland for the degree of Doctor of Philosophy

May 2004

Abstract The growth rate of the Internet indicates that search engines based on centralised design will not cope well in the future. The volume of information that needs to be indexed and the volume of queries to be processed will grow beyond what is possible to handle from a centralised site on the Internet. A distributed approach to Internet resource discovery seems to be necessary. While several distributed resource discovery and information retrieval systems have been built, few of them discuss or analyse their scaleability properties throroughly. The focus of the thesis is the design and properties of a distributed resource discovery system called the Content-Sensitive Infrastructure (CSI). The CSI is designed as a peerto-peer networking system and uses relaxed-consistency principles for its application protocols. The scaleability of the CSI is formally analysed using complexity theory, and the results from the analysis indicates that the CSI is extremely scaleable. The CSI bases its distribution of data and queries on classification: Classes from a hierarchical classification system are attributed to data, queries and network nodes. Data and queries are forwarded towards network nodes of the same class, thus increasing the chances for relevant query results without having to inspect the entire data set. The CSI is tested both on pre-classified data, and data that are classified by an automatic classification method. The scaleability properties and retrieval performance of the CSI is studied through a series of experiments. A standard document collection and evaluation methods taken from TREC (Text REtrieval Conference) are employed in this evaluation. The conclusion from this research is that the design shows good scaleability and fault-tolerant properties, and that the retrieval recall is acceptable.

i

Table of Contents CHAPTER 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

The history of Internet search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The future of Internet search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Justification for the research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Foundations of the research. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Thesis overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Delimitation of scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8

CHAPTER 2 Literature survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 2.2 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Building an index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 Removing stop words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Stemming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 2.2.4 Inverse Document Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.5 The Inverted file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.6 The Centroid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.7 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.8 Query processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.9 Ranking and presenting result sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.10 Evaluation of IR engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Internet search engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2 What is Resource Discovery? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.3 Issues related to Internet Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.4 Retrieving information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.5 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 Automatic classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2 Principles of automatic classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.3 The Vector Space model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.4 Naïve Bayesian model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.5 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.6 Some automatic classification research projects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Distributed systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.2 Historical background and key concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.3 Motivation factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.4 Peer-to-peer distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5.5 Service traders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5.6 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.6 Distributed information retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

ii

2.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6.2 A taxonomy of DIR projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6.3 Meta-searchers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.6.4 Using forward knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6.5 Content Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.6.6 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 2.7 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47

CHAPTER 3 The problem of scaleability in distributed resource discovery . . . . 49 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49 3.2 The elements of a search engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3 The flows of data in a search engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3.1 The Fulltext volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.2 The Index volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.3 The Query volume. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.4 The Metadata volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4 Alternative distribution strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.2 Required terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.3 Centralised index (centralised query processor) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.4 Distribution of Query volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4.5 Distribution of Fulltext volume. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4.6 Distribution of the Index volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5 The use of centroids and their properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5.2 Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59 3.5.3 The distributional effect on queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.5.4 Trade-off between scaleability and distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.6 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .63

CHAPTER 4 Possible solutions to the scaleability problem. . . . . . . . . . . . . . . . . . 64 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64 4.2 Using static forward knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3 Metadata associated with a topic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3.1 Advantages of using metadata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4 Building a system based on static forward knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4.2 The goal of the system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4.3 The “ocean” metaphor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4.4 The need for classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4.5 The effects of classification on the retrieval effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.5 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70

CHAPTER 5 The Content-Sensitive Infrastructure - design and analysis . . . . . . 72 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72 5.1.1 The design principles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2 The architectural elements of the CSI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2.1 The CSI Member. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

iii

5.2.2 The member group, Tuplets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2.3 The member tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2.4 The Helper Member . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3 The message protocols used in the CSI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3.1 Peer discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3.2 Member information forwarding (join phase). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3.3 Advertisement forwarding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3.4 Query forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3.5 Leaving members . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.4 Scaleability analysis of the CSI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.2 The message complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.3 Distribution of members . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.4 The processing of advertisements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4.5 The processing of queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5 Scaleability simulation experiment (experiment 5a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.6 The fault-tolerant properties of the CSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.6.1 A relaxed-consistency approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.6.2 Masking and recovery techniques used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.6.3 Forwarding of advertisements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.6.4 Forwarding of queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.6.5 Processing of member information messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.6.6 Consequences of errors - the Retrieval Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.6.7 Simulated study of the Retrieval Ratio (experiment 5b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .101 5.7 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .101

CHAPTER 6 Description of the classification experiment . . . . . . . . . . . . . . . . . . 103 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .103 6.1.1 The reason for studying classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.1.2 The methods used in the experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.2 Experimental setup and design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.2.2 The choice of a classification hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.2.3 Assembling a training set and evaluation documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.2.4 Details of algorithm design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.2.5 Calculating evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.3 Training and evaluation of the classifier (experiment 6a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.3.2 Using ODP metadata for training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.3.3 Using web documents for training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.4 Recursive classification (experiment 6b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.4.2 Traversal of hierarchical classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.4.3 Evaluation of classification accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.4.4 Conclusion from the recursive classification experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.5 Classification of the TREC corpus (experiment 6c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.5.2 The conduct of the experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

iv

6.5.3 Evaluation of experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.6 Discussion: are the results reliable? Are they as expected? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.7 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .116

CHAPTER 7 Evaluation of retrieval effectiveness in the CSI . . . . . . . . . . . . . . . 118 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .118 7.1.1 What is retrieval effectiveness?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.2 Experimental design (experiment 7a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.2.1 Automatic generation of advertisements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.2.2 The resident retrieval engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.2.3 The topic collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.2.4 The TREC evaluation guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.3 Results from effectiveness evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .126 7.4.1 Factors affecting the precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.4.2 What is most important: Recall or precision? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.5 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .129

CHAPTER 8 Description of the distribution experiment. . . . . . . . . . . . . . . . . . . 130 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .130 8.2 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 8.2.1 Member control program and resident retrieval engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 8.2.2 Open Directory Project data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 8.2.3 Configuration and logging facility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 8.3 Conduct of experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 8.3.1 Measurement of scaleability (experiment 8a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 8.3.2 Measurement of fault tolerance (experiment 8b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 8.3.3 Measurement of response time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 8.3.4 Self-configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 8.4 Quality of result set (experiment 8c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .141 8.6 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .142 8.6.1 Scaleability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.6.2 Fault tolerance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.6.3 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

CHAPTER 9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 9.1 9.2 9.3 9.4 9.5 9.6

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143 Conclusions about the research questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Contributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .146 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .146 A personal statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .150

Appendix A: Experiments overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Appendix B: Retrieval result sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Appendix C: The CSI User Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

v

Appendix D: Published papers related to the thesis. . . . . . . . . . . . . . . . . . . . . . . . 162 Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

vi

List of Figures CHAPTER 1 Introduction ............................................................................................1 CHAPTER 2 Literature survey....................................................................................9 Figure 2-1. Figure 2-2. Figure 2-3. Figure 2-4. Figure 2-5. Figure 2-6. Figure 2-7.

A typical recall-precision graph....................................................................................16 Peer-to-peer configuration ............................................................................................31 A taxonomy of distributed information retrieval projects............................................35 Architecture of a meta-searcher....................................................................................36 Distributing queries on the basis of forward knowledge ..............................................37 A network of WHOIS++ servers organised as a directed graph ..................................40 CDNs use DNS to direct requests to nearby CDN server ............................................45

CHAPTER 3 The problem of scaleability in distributed resource discovery ........49 Figure 3-1. The components of a search engine ..............................................................................50 Figure 3-2. Estimated distribution of web page lifetime ................................................................51 Figure 3-3. The symbols used in this discussion: ...........................................................................55 Figure 3-4. The information flow using centralised index. ............................................................56 Figure 3-5. The information flow using query volume distribution................................................57 Figure 3-6. The information flow using distributed index and Forward Knowledge. ...................59 Figure 3-7. Word distribution according to Zipf’s law. ..................................................................62

CHAPTER 4 Possible solutions to the scaleability problem....................................64 Figure 4-1. The information flow using distributed metadata clustering and static FK..................66 Figure 4-2. Service model of the Content-Sensitive Infrastructure .................................................68 Figure 4-3. The CSI visualised as an ocean. ..................................................................................69

CHAPTER 5 The Content-Sensitive Infrastructure - design and analysis ............72 Figure 5-1. Figure 5-2. Figure 5-3. Figure 5-4. Figure 5-5. Figure 5-6. Figure 5-7. Figure 5-8. Figure 5-9. Figure 5-10. Figure 5-11.

The duties of a CSI member .........................................................................................74 The CSI member groups shown together with their category (COI) relationships......75 Sequence diagram showing flow of message during peer discovery and join phase ...79 Sequence of join-information messages. .....................................................................81 The forwarding of advertisement .................................................................................83 Distribution of queries inside a member group is a variation of tree-based multicast. 86 The delegation of query processing to child members. ...............................................91 Number of messages in members during simulated operation of the CSI ...................95 Causes of failure and their effects on the Retrieval Ratio ............................................98 The retrieval ratio measured in a simulated environment ........................................100 The Retrieval Ratio measured as a function of number of CSI member..................101

CHAPTER 6 Description of the classification experiment ....................................103 Figure 6-1. The effect of giving breadth and cutoff different values. ..........................................112 Figure 6-2. Example of a TREC topic (from TREC-9).................................................................114 Figure 6-3. Maximum retrieval recall as a function of breadth and cutoff during classification..115

CHAPTER 7 Evaluation of retrieval effectiveness in the CSI ..............................118 Figure 7-1. Data model of the resident retrieval engine................................................................121 Figure 7-2. Example of a TREC topic (from TREC-9).................................................................122 Figure 7-3. Interpolation of recall-precision values (taken from the TREC guidelines)...............123

vii

Figure 7-4. The Recall-Precision graph from the numbers in table 7-1 ........................................125

CHAPTER 8 Description of the distribution experiment......................................130 Figure 8-1. Figure 8-2. Figure 8-3. Figure 8-4.

Data model of the retrieval engine in the member control program...........................131 Generation and insertion of advertisements into the CSI...........................................133 Experimental and analytical results compared ...........................................................137 The observed fault-tolerance during query processing in two different contexts .......138

CHAPTER 9 Conclusion...........................................................................................143 Figure C-1. Screenshot of the CSI User Agent .............................................................................160

viii

List of Tables CHAPTER 1 Introduction ............................................................................................1 CHAPTER 2 Literature survey....................................................................................9 CHAPTER 3 The problem of scaleability in distributed resource discovery ........49 Table 3-1. The symbols used to denote the data flows....................................................................55

CHAPTER 4 Possible solutions to the scaleability problem....................................64 CHAPTER 5 The Content-Sensitive Infrastructure - design and analysis ............72 CHAPTER 6 Description of the classification experiment ....................................103 Table 6-1. Evaluation of two different classifiers..........................................................................109 Table 6-2. Classification accuracy under flat and recursive classification.................................... 111

CHAPTER 7 Evaluation of retrieval effectiveness in the CSI ..............................118 Table 7-1. Table 7-2. Table 7-3. Table 7-4.

Average recall-precision calculations...........................................................................125 Average precision over all topics .................................................................................126 Average recall in a result set with 1000 advertisements ..............................................126 Selected results from TREC-2001 Web Track .............................................................127

CHAPTER 8 Description of the distribution experiment......................................130 Table 8-1. Mean average message count during advertisements processing ................................135 Table 8-2. Mean average message count during query processing ...............................................136

CHAPTER 9 Conclusion...........................................................................................143

ix

1

Introduction

1.1 The history of Internet search The publication of information on the Internet has mostly been an uncoordinated activity: Each author offers information on Internet information servers with little or no regard to existing information elsewhere on the Internet. As a result, the task of retrieving information from the Internet requires additional resources (indices, directories etc.). Even before the Word Wide Web emerged and before the Internet was made public in 1993 this need was perceived and addressed through manually compiled directories maintained by volunteer experts. Of more interest, though, was archie, a directory service for public FTP servers. Given the knowledge about an ftp server, the archie server was able to index its content by fetching directory listings (only file names were indexed). Archie proved to be effective because it offered a centralised entry point to a distributed network of information servers. With the introduction of the World Wide Web a different type of indexing mechanism was necessary: The application protocol (called HTTP, HyperText Transfer Protocol) does not offer a directory listing option as the FTP protocol does. On the other hand, documents may refer to other documents by the use of hyperlinks, including documents on different servers. The combination of documents and hyperlinks renders the World Wide Web (WWW) as a directed graph, and by traversing this graph new WWW servers can be discovered by a directory service. This property paved the way for the Internet Search Engines: A discovery system that automatically discovers new web documents, and indexes them for later query processing. The Search Engines transforms the World Wide Web into a document collection where fulltext retrieval operations are possible. The WWW directed graph makes the strengths and weaknesses of the Internet Search Engines: New servers can be discovered by following hyperlinks, but the existence of a document will not be known to an observer unless there is a hyperlink pointing to this document. A search engine employs a crawler to discover documents (and other resources as

1

well) by traversing the WWW graph, so that the documents may be fetched and indexed for later retrieval.

1.2 The future of Internet search The Internet search engines have become a cornerstone for the exchange of information on the Web. A large number of users turn to Yahoo!, Google, AltaVista or Lycos as their starting point when looking for information of various kinds. Given their tremendous popularity, it is important that search engines are not overwhelmed by the increase in the volume of information on the web, and by the growth in user demand. At the time of writing, Google reports now1 200 million queries per day, and the number is expected to grow exponentially (Kobayashi and Takeda, 2000). The volume of information on the web is also growing exponentially (ibid.), and the use of different media types is becoming more common: Pictures and sound are complementing the use of textual representation, and information is becoming more often offered as responses to programmed services, e.g. in the form of web services. The traditional designs of web search engines have difficulties in facing these challenges: Their centralised designs face scaleability problems in dealing with the increase in traffic volume, and the use of crawlers for information discovery does not deal well with the diversity of media types and the use of web services. This thesis contributes to the future of Web search engines by proposing a novel distributed search engine based on distributed clusters of classified metadata. A new analytical framework is used for identifying the properties necessary for a distributed search engine to be scaleable to very large proportions, and it will be shown that no existing approaches to distributed Information Retrieval or Resource Discovery possess these properties. A working distributed retrieval system is built on these principles and demonstrated in the thesis. The properties of the design are investigated through analysis, simulation and a series of laboratory experiments.

1. Reported in the Norwegian newspaper Itavisen Business, May 22. 2003

2

A design approach known as peer-to-peer was applied in the construction of the system, where a large number of personal computers connected to the Internet become parts of a distributed system of potentially very large scale. Peer-to-peer can be seen as a strategy for utilising the world’s largest available resource for storage, computing and communication, namely the pool of under-used computers connected to the Internet (together with the access network). The distribution strategy used is based on classification, i.e. information belonging to the same class is clustered together in the system. The assumption that it is possible to find a good result set by inspecting a small number of clusters relies on van Rijsbergen’s Cluster Hypothesis (Jardine and van Rijsbergen, 1971). The system, called the Content-Sensitive Infrastructure (CSI), offers the core services for a new generation of web search engines exploiting distributed clusters of metadata. The process of generating metadata and collecting them in clusters is studied in this thesis, and the effect of this process on the retrieval performance is investigated through several experiments.

1.3 Research questions The research questions critically addressed in this thesis are: • What are the necessary properties for a Web search engine to be scaleable to very large proportions (i.e. million of nodes)? • What is the retrieval performance of a system built on these principles measured by precision, recall and average precision? • Can the operating principles for extreme scaleability be enforced in an implementation that also offers fault tolerance, resilience, self-configuration and heterogeneous web content? • Can the CSI provide the basis for a search engine that is scaleable to very large proportions?

1.4 Justification for the research The first research question is justified by the arguments in section 1.2: It is of great importance that the search engines cope with the future growth in information and traffic vol-

3

ume. Search engines are widely used and the quality of their services is important to a large fraction of Internet users. The second research question is justified by the need for comparing the performance of search engines: Since the design and algorithms chosen for the system profoundly affect the retrieval performance, it is important to ensure that the resource discovery system performs in a manner that makes it useful as a search engine. The two criteria used for evaluating a “scaleable search engine” are inherent in the words used: It must be scaleable, and it must be a search engine. Therefore, the retrieval performance needs to be investigated in a manner that enables comparison with existing search engines. The widely used measures precision, recall and average precision is chosen for this purpose. Since the project in this thesis obtains scaleability through large-scale distribution, it is necessary to study to what extent the implementation has the necessary properties (selfconfiguration, resilience etc.) for a large-scale distributed application. This justifies the third research question. The fourth research question is justified by the need to see the design invented in this thesis, the CSI, in the perspective of the preceding discussion.

1.5 Foundations of the research In this section some of the theories and methods that have been important in the work to answer the research questions are presented: The idea of using distribution as a means for aggregating computing and communication resources in order to obtain a large scale system is central to the research presented in this thesis. The concept and utility of distributed clusters of metadata leans heavily on van Rijsbergen’s Cluster Hypothesis (Jardine and van Rijsbergen, 1971). The Cluster Hypothesis states that “closely associated documents tend to be relevant to the same requests”. The hypothesis is applied in this project as a distribution strategy, in that it indicates that a good query result set can be constructed by searching a small number of metadata clusters. The vector model, devised by Salton et al. (1975), provides an important framework for the calculation of similarities between document or category representations. In the vector model, a document or a group of documents is represented as a vector where each index 4

represents a given term in the vocabulary in use, and the value of the elements represents the importance of the terms on the different indices. The similarity between two documents can be expressed as the cosine of the angle between the terms vectors in an Euclidian space. This thesis describes the use of the vector model both in its retrieval engine and its classification engine. Zipf’s law (explained in chapter 3) provides a tool to estimate the relative frequency of terms in a document containing natural language. It is used in this thesis for calculating the volume of index information which need to be moved during operation of some distributed IR systems. Complexity analysis is being used a tool for analysing the message complexity of the distributed system, i.e. the number of exchanged messages expressed in “big-Oh” notation (Weiss, 1999 p.41). Simulation experiments have been employed in chapter 5, where the scaleability and fault tolerant properties have been investigated under carefully controlled conditions. TREC-based retrieval evaluation offers a standardised set of guidelines by which retrieval engines can be compared. These guidelines are used in one of the experiments in order to put the retrieval performance of the CSI into the perspective of other retrieval engines. Laboratory distribution experiments have been used to demonstrate the correctness and performance of the distributed retrieval engine. Computers connected together in a campus lab network were used to start a medium number of cooperating nodes.

1.6 Thesis overview Following the introduction contained in this chapter, the remaining components of this thesis are as follows: Chapter 2 surveys the state of the art in areas related to this thesis. The area of Internet Search Engines are described together with relevant issues from Information Retrieval and Resource Discovery, then selected issues from the area of Distributed Computing are discussed. Distributed Information Retrieval is then discussed and some related research projects are discussed. The chapter presents evidence that the use of Internet Search engines must cope with an aggressive growth both in information volume and request volume. 5

Chapter 3 examines the nature of Internet information volumes and discusses the need for scaleable solutions to Internet resource discovery. The chapter then provides an analytical framework for assessing the scaleability properties of existing distributed IR systems, and identifies the key properties that need to be present in scaleable solutions. The chapter concludes that existing distribution models for information discovery and retrieval have insufficient scaleability properties. Chapter 4 outlines a prospective architecture and a service model for a scaleable resource discovery system. The main idea behind this system is the use of classified items of metadata, and to cluster metadata items together according to their classification. The advantages of using metadata instead of full-text indices are also addressed in this chapter. The system has the properties identified as necessary for a scaleable system in chapter 3. Chapter 5 describes the actual design of a peer-to-peer based resource discovery system based on the model outlined in chapter 4. The protocols required for automatic peer discovery and message forwarding are described and discussed. The scaleability properties of the design are formally analysed, and the results from the analysis are confirmed through a simulation experiment (experiment 5a). The fault-tolerant properties of the system are discussed and investigated in a simulation experiment (experiment 5b). The conclusions drawn from the experiments are that the chosen design has better scaleability properties than existing systems. Chapter 6 addresses the problem of automatic classification. First, the choice of a classification hierarchy is discussed, and the necessary properties for a hierarchy used in this type of system are discussed. Second, a recursive classification algorithm is trained on a training set assembled for this purpose and evaluated using a related evaluation set (experiments 6a and 6b). Third, the classification algorithm is applied to the TREC WT10g corpus. The related effect on retrieval recall is studied by applying classification to TREC topics and comparing classification results on documents found in the query relevance judgements (experiment 6c). The conclusion drawn from this chapter is that the classification scheme allows for a reasonable retrieval recall. Chapter 7 describes experiments to evaluate the retrieval performance of the ContentSensitive Infrastructure as a conventional retrieval engine, and applies the TREC guidelines for reporting average precision and a recall-precision graph. The entire WT10g

6

corpus is automatically classified using the algorithm developed in chapter 6, and metadata descriptions representing every single document in WT10g are generated. The retrieval operations (using the TREC standard topics) are then conducted on this metadata collection, and the retrieval performance is calculated (experiment 7a). The conclusion from this chapter is that the CSI performs acceptably as an Internet Search Engine. Chapter 8 describes the evaluation of the performance of the Content-Sensitive Infrastructure as a distributed system in a real distributed environment. This experiment is conducted in order to assess the correctness of the networking protocols, and to measure the fault-tolerant properties (experiment 8b), response times, and networking traffic in an environment as close to a real world as possible. The measured scaleability properties of the system are compared to the simulation experiment 5a (experiment 8a), and the quality of the retrieved result sets are informally judged (experiment 8c). The conclusions drawn from this chapter is that the networking protocols work as expected, and that the system has the expected scaleability and fault-tolerant properties. Chapter 9 concludes the thesis and discusses the relevance and usefulness of the results presented in the previous chapters. It also points to questions and problems that may be investigated further.

1.7 Delimitation of scope This thesis has the focus on a particular distribution model for used in Internet Search Engines, and investigates its technical properties in terms of scaleability, fault-tolerance and retrieval performance. The thesis does not address the more controversial issues related to peer-to-peer computing, such as various security problems, vulnerability to sabotage and denial-of-service attacks. This thesis proposes a common classification hierarchy for the entire Web, and demonstrates an approach for automatic classification. In practice, automatic classification can be applied only to a fraction of the available information. The practicality of this limitation is not addressed in the thesis.

7

1.8 Acknowledgements I would like to thank my employer, the Norwegian School of Information Technology for support and funding, and for letting me use the school laboratories for my experiments. I would also like to thank my family and friends for patience and encouragements during my time as a Ph.D. student, including those friends with furs and whiskers. Thanks to the NITH students Astrid Vareberg, Kristian Kyvig, Petter Berg-Kristoffersen and Preben Ingebricson for their efforts in programming a CSI User Agent. The implementation of the Porter’s algorithm used in this work is programmed by Darcy Quesnel. Linux and other software from the Open Source community was used during the development of the CSI. Java compilers, libraries and run-time system were provided by Sun Microsystems, Inc. The XML parser was provided by the Apache Xerces project. Thanks to the two people at the University I first met, Dr. Ian Ferguson and Dr. Chris Bloor for giving me the first feeling that this study was possible to accomplish. Thanks my supervisors, and in particular Dr. Ian Ferguson and Prof. John Tait, for their consistent support and criticism during the entire project. They have taught me a lot about academic thinking and writing, and I owe them both a huge dept.

1.9 Summary This chapter has introduced the research questions of this thesis, and the methodology used in the conduct of the research. It has also given a brief overview of the results obtained and their significance. The thesis shall now proceed with a discussion of the background to the research followed by detailed description of the study.

8

2

Literature survey

2.1 Introduction This chapter provides a background to the rest of the thesis. The literature survey intends to relate the thesis work to existing relevant research in the areas of Information Retrieval, Internet Search Engines (since their scaleability is the focus of this thesis) as well as Automatic Classification, Distributed systems and Distributed Information Retrieval, because the project described by this thesis builds on knowledge and research results in these areas. The literature survey also contains evidence of the problems addressed by the Content-Sensitive Infrastructure, namely the scaleability problems in Internet search engines.

2.2 Information Retrieval The term “Information Retrieval” (IR) refers to the science of storage, maintenance and retrieval of loosely structured information (Kowalski and Maybury, 2000). Whilst the science of databases deals with strongly structured information (kept in tables), Information Retrieval deals with “natural” forms of information (collection of documents in natural languages, to a lesser extent speech, images and so on). The science of how to organise collections of books so that they are easily retrieved is as old as the libraries, but the study of how to use computers to store and retrieve the information dates back to 1958. The International Conference on Scientific Information held in Washington 1958 marked the start of Information Retrieval as it is known today, according to Sparck Jones and Willett (1997 p.3). Although the fields of Natural Language Processing and Artificial Intelligence have been involved in Information Retrieval experiments, they have not performed better than Lexical approaches where the statistical properties of term occurrences have been the object of study (Sparck Jones, 2003). This chapter will focus on Lexical approaches to IR, where the Vector Model (Salton et al., 1975) and Probabilistic Model (Croft and Harper, 1979, van Rijsbergen, 1986) are the most well-studied.

9

The following subsections describe some problem areas connected to the construction of a searchable document collection and the retrieval operations on document collections.

2.2.1 Building an index The retrieval of information requires that the information is stored in a way that enables search operations to take place efficiently. The term indexing refers to the process of building and maintaining a representation of the document in a data structure efficient for searching. In an early paper, Luhn (1961) studied the use of keyword lists made automatically by computers. His arguments for doing this are mainly those of scaleability, since he predicts the information volume to grow beyond the capacity of human indexers. The index is built by selecting and processing terms from the document in the collection. The terms (words or n-grams1) are often associated with a weight indicating its significance for describing the characteristics of the document. Criteria for judging a term as a candidate for selection can be based on its frequency or its syntactic role. The index terms are possibly annotated with their position in the document. The selection and weighting of terms is a well-researched problem in the area of IR (Biebricher et al., 1988, Salton and Buckley, 1988). The following sub-sections describe some of the tasks necessary to consider when making a document representation: Stop word removal, stemming and weight calculation.

2.2.2 Removing stop words Large portion of a document consists of very frequent words that do not carry meaning in natural language (e.g. the, and, him) and should be disregarded in retrieval operations. These words, called stop words, should not be part of a document representation. Comparing terms to a list of stop words before selecting them can significantly reduce the size of a document representation. Stop word lists are often small (200-500 words) and easily kept in main memory for fast lookup.

2.2.3 Stemming Since the different grammatical forms of a word (singular/plural, past/present tense) seldom need to be kept distinct, words are often normalised by extracting their “stems”.

1. The term n-gram denotes an ordered collection of n words

10

A popular algorithm for stemming English words is Porter’s algorithm (Porter, 1980), which is available coded in different programming languages1. Stemming may introduce ambigous words, since e.g. ‘arm’ and ‘arms’ are reduced to the same stem.The use of stemming thus involves a precision/recall tradeoff (Section2.2.10).

2.2.4 Inverse Document Frequency A term may be given a weight indicating its significance for this particular document (compared to other documents) as mentioned in section 2.2.1. The frequency of the term i in the document (tfi) is a good start for calculating weight. Terms that are frequent in most document (not only in this document) should be given lower weight due to its lower significance (it is less able to distinguish this document from other documents). A common description for this effect is inverse document frequency for a term, defined as follows: Nidf i = log --ni

(Eq. 2-1)

(Salton and Buckley, 1988) where N is the total number of documents in the collection (domain of calculation) and ni is the number of documents where this term is occurring. A common expression for calculating weight is: w i = tfi ⋅ idf i

(Eq. 2-2)

2.2.5 The Inverted file The inverted file is a popular form for index organisation. For a given term, the inverted file has a list of (documents,weight) tuples associated with the term. The inverted file thus facilitate the operation of listing all documents containing a given term (Kowalski and Maybury, 2000, p.85). Luhn’s contribution on keyword-in context index can be viewed as a pre-stage to inverted files (Luhn, 1961).

1. See: http://www.tartarus.org/~martin/PorterStemmer/ [31 Jul 2003]

11

2.2.6 The Centroid While the inverted file is useful for storing indices representing individual documents in a document collection, it does not easily answer the question “what are the characterising terms in this particular document”. A list of (term,weight) pairs from each document is a better choice when e.g. the similarity between two documents is of interest. A weighted term list that describes a document in this manner is called a Centroid (Salton et al., 1975). This approach to document representation is called the Vector Space Model (Salton et al., 1975), or sometimes a Bag-of-words approach, since the order of terms and their proximity are ignored. In the Vector Space Model, each possible term of the domain (e.g. the vocabulary in use) is given an index, and a vector element on this index is the weight of this term in the particular document. Formally: For a document dj, the term ki is given the weight wi,j. The vector representing this document can be defined as: d j = ( w 1, j, w 2, j, w 3, j, …, w t, j )

(Eq. 2-3)

Where t is the total number of terms in the domain. The similarity between two document di and dj is expressed by the cosine of the angle between the vectors d i and d j :

di • dj sim (di,d j) = ------------------di ⋅ dj

∑ w x, i ⋅ w x, j x=1 t

= ---------------------------------------------------------2 2 t t w x, i w x, j ∑ ∑ ⋅ x=1 x=1

The term wx,j denotes the weight of term x in document j, see equation 2-3.

12

(Eq. 2-4)

This function returns 1 for two identical vectors (indicating a perfect match), and 0 for orthogonal vectors. Centroids can also be used to represent several documents as a whole, e.g. the lexical characteristics of a document collection.

2.2.7 Feature selection For large documents the vector can contain tens of thousands of terms, and computations on such vectors are costly in terms of computing and storage resources. The number of elements in the vector can be reduced by a process called feature selection, also known as dimensionality reduction (Kobayashi and Takeda, 2000 p.160). Feature reduction replaces related terms with some other representation. Synonyms can be replaced by their entry in Roget’s thesaurus or with synonym sets from WordNet (Fukumoto and Suzuki, 2001). Semantically related words can be replaced by common hypernyms from WordNet (Scott and Matvin, 1998). Kowalski calls these methods “indexing by concept” (Kowalski and Maybury, 2000 p.63). Another approach to feature selection is to look at the statistical interdependence between words, and to replace groups of often co-occurring terms with distributional clusters (Bekkerman et al., 2001).

2.2.8 Query processing When a user of a IR system looks for documents of interest, s/he will formulate a query which expresses her/his information need. The IR engine processes the query and returns a result set back to the user in a ranked form. “Search statements are the statements of an information need generated by users to specify the concepts they are trying to locate in items.” (Kowalski and Maybury, 2000 p.166) The query is processed by retrieving those documents regarded as relevant for the query. This is done by searching the index for documents that has the required properties expressed by (or derived from) the query. The relevance criteria determine which properties of a document that must be preserved during the term extraction phase. The properties of the term selection algorithm therefore becomes important for the effectiveness of the retrieval operations.

13

The syntax of the query influences the IR system’s ability to attract new users or to integrate with other systems. The standardisation of query languages has resulted in at least two widely adopted standards: Z39.50 (Z39.50, 1995) - This standard protocol not only describe the syntax of the query language, but also the communication protocol between an IR engine and a client. WAIS (Wide Area Information Service), (RFC1580, 1994, RFC1625, 1994) - a now obsolete protocol for querying databases through the Internet. It was popular in the first half of 1990s as a pre-stage to today’s Internet search engines. Not in use today.

2.2.9 Ranking and presenting result sets A result set has two key properties, Precision and Recall: Precision measures the fraction of relevant documents in the result set, and recall measures the fraction of the total population of relevant documents that are present in the result set. A formal definition is given in section 2.2.10. In a large IR system, the processing of a query will often produce a result set too large to be useful. It therefore becomes important to rank the result set so that the results are presented ordered by relevance to the query. Users of a retrieval system are not expected to browse through the entire result set: Jansen et al. (1998) shows that users seldom look at more than the first two screens of results. Improving the precision of the result set is therefore considered more important than having a high recall. This opinion is shared by several researchers (Pinkerton, 1994, Sullivan, 1999). A basic approach to ranking is to calculate each document’s similarity to the query using some “similarity metric” and to present the documents ordered by this metric. A common method is to represent the query and the document as Centroids (section 2.2.6), and to calculate the cosine of the angle between these two vectors (Salton et al., 1975). Other approaches consider the popularity or importance of the documents by looking at their retrieval frequency or in-degree (the number of references to this document from elsewhere). The search engine Google1 takes this approach (Notess, 2000). Other

1. http://www.google.com/ [May 13. 2003]

14

approaches to importance ranking are the algorithms HITS and PageRank (Arasu et al., 2001) Ranking can also identify inter-document similarities and present a list of document clusters. Some clustering algorithms are presented and evaluated by Griffiths et al. (1997). A clustered presentation of a result set often includes a “More like this” button, so that the user can indicate the document type of interest and see more documents from that particular cluster (Harper and Muresan, 2002).

2.2.10 Evaluation of IR engines Research on Information Retrieval techniques require that IR engines can be compared. The metrics often used when evaluating the result set of a query are called precision and recall. For a query q, the definitions of these are:

Number of relevant documents in the result set Precision(q) = ---------------------------------------------------------------------------------------------------------------Total number of documents in the result set

(Eq. 2-5)

Number of relevant documents in the result set Recall(q) = ------------------------------------------------------------------------------------------------------------------------------Total number of relevant documents in the collectoin

(Eq. 2-6)

These formulae do not give any credit to a ranking algorithm, since the precision calculation disregards the document’s position in the result set. Jansen’s findings (Jansen et al., 1998) supports the importance of bringing the relevant documents as high up as possible in the result set. The effect of ranking may be expressed by a recall-precision graph, where the precision and recall is calculated over a varying number of documents in the result set, counted from the highest ranked document and downwards (Figure 2-1).

15

Figure 2-1. A typical recall-precision graph1

Another measure of ranking effectiveness is the average precision, which is the average of the precision calculated over the result set at each relevant document (Voorhees and Harman, 2000). Since “relevance” is a subjective concept, the retrieval operation is often evaluated by comparing the result set to a result set constructed through human judgement. This is the case with TREC2-based experiments, where search engines are compared using a standard document collection, a standard set of queries (called topics) and a list of documents considered relevant to the individual query (Hawking et al., 1999). The guidelines for TREC-based evaluation will be described in detail in Chapter 7, as this approach has been chosen for some of the experiments in this thesis.

1. Taken from: http://technet.oracle.com/products/text/htdocs/imt_quality.htm [May 13. 2003]

2. Text REtrieval Conference, see: http://trec.nist.gov/ [May 13. 2003]

16

2.3 Internet search engines 2.3.1 Introduction This section will present some general principles of Resource Discovery systems and then examine their application in the technology of Internet Search Engines as this is of special relevance to this thesis. The section also discusses some research results within this area. In addition, scaleability problems with existing Internet search engines are identified. They include representing the “invisible web” and web services. Internet search engines in the form that is known today first appeared in 1994 as OpenText1 and WebCrawler2, succeeding a variety of smaller search facilities like Archie, and manually updated directory pages such as Galaxy. A range of local retrieval engines were by then accessible over the Internet using the WAIS protocol (described later). The use of crawlers, whereby the search engine can discover web pages that it wishes to index, is among the factors that distinguishes a search engine from older retrieval engines. The properties of a crawler are described in section 2.3.3.1.

2.3.2 What is Resource Discovery? An application of Information Retrieval is Resource Discovery. The term refers to the process of locating accessible resources in a network, so they can be indexed and subject to retrieval operations. The well-known Internet search engines are examples of resource discovery systems. Resource discovery does not seem to have a precise definition, but research groups within this area (like the Distributed Systems Technology Centre at the University of Queensland, Australia3) describe Resource Discovery as location, access, retrieval and management of resources in heterogeneous distributed networks. Resource Discovery extends Information Retrieval in that it involves the problem of finding resources (documents, services) in a network that lacks a central registration authority. A centrally located document collection can be completely indexed by traversing the file

1. http://www.opentext.com/ about_us/our_history.html [May 13. 2003] 2. http://www.wiley.com/legacy/compbooks/sonnenreich/history.html#crawler [May 13. 2003] 3. See http://archive.dstc.edu.au/RDU/reports/QuestNet95.html [Dec. 22. 2002]

17

system where the document files are stored. A document collection spread across a network where no “list all documents”-command is available needs to be indexed by some form of discovery heuristic. In an Internet totally dominated by the World Wide Web, the discovery heuristic tries to optimise the traversal of the graph made up by hyperlinked documents (Arasu et al., 2001). The discovery mechanisms used by Internet search engines will be discussed separately in section 2.3.3.1.

2.3.3 Issues related to Internet Search Engines Internet search engines (www.google.com, www.lycos.com etc.) are applications of Resource Discovery systems. They are also full-text retrieval systems i.e. they offer retrieval operations based on any text occurring in documents found on the web. A user of an ideal Internet search engine can use a set of search terms (possibly also with boolean operators between the words) and receive a list of all documents available on the Internet that contain these terms. This does not happen in practice, due to circumstances that will be discussed shortly. Internet search engines have become tremendously popular, and their particular properties (general domain, large scale) have made them popular research objects. (Kobayashi and Takeda, 2000) present a survey of this research area and some of their findings are: • The volume of information on the internet appears to be growing at an exponential rate • 85% of Internet users claim to be using search engines • Users are not satisfied with the performance of search engines. Response time and poor result set quality (noise, broken links) are commonly cited problems1. Internet search engines are to some extent distinct from traditional retrieval engines. (Henzinger et al., 2002) points out the following differences: • Spam-dexing (misleading and manipulative content) • Unknown content quality

1. Full report is available at: http://www.gvu.gatech.edu/user_surveys/survey-1998-10/graphs/ graphs.html#use [May 13. 2003]

18

• Duplicate hosts (same hosts under different names or with identical content) • Vaguely structured data • The content is on different media (sound, image, video etc.) Therefore, the tasks of a resource discovery system (collecting, indexing, searching, ranking etc.) face distinct problems when dealing with information from the Internet. These tasks will be briefly discussed in the context of Internet search engines. 2.3.3.1 Collecting index information In order to generate a full-text index of a document collection on the web the documents need to be discovered and the content transported to an index-generating entity. On the World Wide Web (WWW) most documents are linked together in an incomplete directed graph through the use of hyperlinks. Other documents are available only as responses to retrieval operations in a document server, this is called dynamic documents. Documents on the WWW have unknown lifetimes and update cycles, and new documents do not announce their existence. Crawling through hyperlinks: The portion of documents available through hyperlinks can be retrieved by a crawler program (also known as “robots”, “spiders” etc.), which attempt to traverse the WWW graph completely. Crawlers are considered to generate at least two problems: 1. That they consume too much processing and communication resources (Kobayashi and Takeda, 2000 p.154). Any web server log will show that crawlers make up for a significant fraction of total traffic1. 2. They only find a fraction of the total web pages. A study published by NEC Research Institute in 1999 showed that no search engine indexed more than 16% of the estimated population of 800 million pages (Sherman, 2000). The rest is unreachable for crawlers due to the use of HTML framesets, dynamic web content or programmed hyperlinks (using e.g. Javascript). This is sometimes called “the invisible web” (Sullivan, 1999, Arasu et al., 2001). The published percentage may also be explained by scaleability problems, which will be discussed in Chapter 3.

1. On the author’s private web server, 19.5% of the hits during Sept.2002 came from crawlers.

19

The crawler may take on several strategies in order to find important pages early in the crawling process. It can do so by inspecting the queue of unvisited hyperlinks and reorder it by their estimated importance. A presentation of different “importance metrics” can be found in the work by Arasu et al. (2001) Selecting useful content: Web documents consisting of text may be parsed and split into a set of indexing terms. The issues of term extraction was discussed briefly in section 2.2.1. The same techniques apply in Internet search engines as in any retrieval system, but the task is a little more complicated due to the competition for “web visibility”: web documents are often filled with terms not visible in the web browser, but visible to the indexing agent, with the purpose of manipulating the search engine’s ranking algorithm. This practice, called spam-dexing (Kobayashi and Takeda, 2000 p.154, Henzinger et al., 2002), reduces the indexing terms’ utility in the ranking process. Chekuri et al. (1997) discuss this problem in the context of automatic classification. Services and non-text media: Traditional IR indexing relies on the presence of text that can be subject to lexical or language analysis. Indexing dynamic or non-textual material poses new problems: the automatic indexing of images currently relies on basic properties in the picture (e.g. colour and light distribution) or its filename. The indexing of a dynamic page (like a directory or a translation service) must normally rely on the textual content of the query form or a “front page” linking to the query form. The growing use of multimedia information on the WWW makes this problem quite relevant: Sherman states: “As we move into the next century, improvements in computer processing and storage capabilities, together with the stunning increase in available bandwidth for data communications, means we’ll be seeing far more types of media on the Web. And increasingly, information will be kept in databases that serve content dynamically, rather than stored as static Web pages. This trend will be accelerated by the “convergence” we’re already starting to see in a variety of areas.” (Sherman, 1999) Revisit pages for re-indexing: Since static web pages do not announce when they are updated, a web crawler needs to revisit them on a regular basis in order to have an up-todate index. The time interval between visits to a page is chosen as a trade-off between network capacity consumption and index currency. Also, as a page is retrieved, its revisiting

20

interval is estimated by analysing its content, past revision history and user popularity (Sullivan, 1999, Arasu et al., 2001). There exists a considerable statistical material on the lifetime of web pages and the frequency of web page retrieval. Some of this material will be analysed in Chapter 3. 2.3.3.2 Metadata The term metadata refers to auxiliary information about a resource or a document, conveying knowledge not necessarily present in the document text (Agosti et al., 1993, Milstead and Feldman, 1999, Dornfest and Brickley, 2001). Metadata can hold information about a document’s formal status, revision history, classification etc. This information can assist both the indexing, retrieval and ranking operations. The use of metadata in Internet search engines is of interest, since this type of information is seldom included in the text retrieved by the crawler. Metadata is similar to the traditional library card, and does not exclusively apply to the operation of Internet search engines, but due to its particular relevance it is mentioned in this context. Several metadata standards exist: the Dublin Core (DublinCore, 1995) suggests a general set of attributes for the description of books, but can also be used for other information as well. The Resource Description Framework (W3C, 1999) offers a syntax for the construction of ontologies based on the eXtensible Markup Language (W3C, 1998). XML and RDF lend themselves well to the description of Dublin Core attributes, and these standards are often seen working together. Problems related to the indexing of non-text material and web services (Section2.3.3.1) can be reduced through the use of metadata. Automatic generation of metadata based on textual analysis is a current research area (Jenkins et al., 1999, Liddy et al., 2002, Stuckenschmidt and van Harmelen, 2001), and the project presented in this thesis also does automatic metadata generation (classification and keyword extraction) to some extent. Metadata content resulting from automatic textual analysis will in general exclude ‘auxiliary’ information like author, revision history, copyrights etc.

2.3.4 Retrieving information The syntax of a query (search statement) will most likely fall into one of three categories: 21

• A list of terms • A question posed in natural language • A boolean expression with terms as operands The typical Internet search engine uses a simplification of the third alternative: a list of terms are treated as a boolean expression with implicit “AND”-operators between them. An in-depth comparison of Internet search engines features is given by Hock (2000). Traditional retrieval engines targeted at a professional audience expect the user to be experienced in formulating complex queries. Internet users, on the other hand, are naive and use few search terms. According to Jansen et al. (1998) the average length of queries is 2.21 terms, and less than 10% of queries included any boolean operators. The study was done on a sample of 51,473 queries on the Excite search engine (www.excite.com). A later report from “Search Engine Watch1” reports that of a population of 33,000 users, 28.6% use one keyword, 44.8% use multiple keywords when searching. Spink et al. (2002) are reporting a mean query length of 2.3 and that 27% use one keyword only (Excite study). Despite the time between these reports, they all indicate that users prefer to use only one or a few keywords when searching, and that this habit is unlikely to change in the future.

2.3.5 Summary and conclusion This section has discussed some aspects of Internet search engines. They are applications of Information Retrieval engines, and are also examples of Resource Discovery systems. The indexing process is dependent upon crawler programs, which are known to find only a fraction of WWW documents. Although very popular in use, surveys indicate that users of search engines are not satisfied with their performance (Section 2.3.3). Poor result set quality (noise and broken links) and scaleability problems are among the causes for the dissatisfaction.

1. Web-published in 2000 on: http://www.searchenginewatch.com/sereport/article.php/2162791 [May 13. 2003]

22

2.4 Automatic classification 2.4.1 Introduction The term automatic classification refers to the process of attributing a category or class to an item of information (e.g. a document, an item of metadata or a query). Automatic classification is an important field of research for this thesis, because the Content-Sensitive Infrastructure requires that both resources and queries have classification codes attributed to them. Having a method for automatic generation of the classification codes is important for the utility of the system. The set of possible categories to choose from may be predefined (like the Dewey Decimal Classification System), or it can be a set of categories that emerges from the classification process. The latter is often called clustering. This section will discuss automatic classification in the first context, i.e. where a fixed and predefined category system is used. A comprehensive bibliography of the early history of automatic classification can be found in (Sparck Jones, 1991). Most of the early work on classification inside IR, e.g. (van Rijsbergen, 1979), seem to focus on clustering. Van Rijsbergen and Sparck Jones (1973) formulated the Cluster Hypothesis, which states that “closely associated documents tend to be relevant to the same requests”. The hypothesis has been studied e.g. by Griffiths et al. (1997) in the context of clustering, but not in the context of a predefined classification system. Automatic classification remains an active research area: The proceedings from SIGIR 2002 includes 7 papers in the fields clustering/classification, e.g. by Kawatani (2002) The cluster hypothesis forms an important basis for the design of the Content-Sensitive Infrastructure: I propose that an effective retrieval system can be built using collections of metadata distributed on the basis of the topical property of the metadata. A query will find its result set in one of these clusters, which (according to the cluster hypothesis) should have an acceptable recall. Automatic classification in a predefined classification system is not covered in the wellknown textbooks on IR (e.g. Baeza-Yates and Ribeiro-Neto, 1999, Kowalski and Maybury, 2000, Korfhage, 1997). Also, papers describing results in this area are relatively recent.

23

For the rest of this section, “automatic classification” means classification in a predefined classification system.

2.4.2 Principles of automatic classification Automatic classification can employ different theoretical models. This section will present two important models: The Vector Space model takes a vector representation of a category and the alternative categories and calculates the similarity between the document and the categories using a cosine similarity function (Section 2.2.6). The Naïve Bayesian model estimates the probability that the document “belongs” (is produced by) a given category. The mathematical method used is based on Bayes’ rule for conditional probabilities. Both models rely on specific representations of the document and alternative categories, which will be discussed below. Other models exist (e.g. the Support Vector Machine), but are not presented.

2.4.3 The Vector Space model Automatic classification based on the Vector Space model is closely related to the Centroid-based retrieval model described in Section 2.2.6. In this model, the document and the alternative categories are represented as lists of (term, weight) pairs, and the similarities between the document and the categories are calculated using the cosine calculation discussed previously. The techniques of term weighting, stop word removal (Section 2.2.2), stemming (Section 2.2.3) and feature selection (Section 2.2.7) have similar applications as in retrieval operations. The result from a classification operation is a list of suggested categories, ranked by decreasing similarities to the document (based on the cosine calculation). 2.4.3.1 Making a category representation Building a representation of a category is done relatively easily by concatenating a number of documents already known to belong to this class and build a document representation on this concatenated document using term selection and weighting techniques previously described. A document collection already classified and used for this purpose

24

is called a training set, and the process of building category representations is often referred to as training the classifier. In the same manner as document terms were weighted according to a tf ⋅ idf scheme (section 2.2.4), statistics are needed on the frequency of terms in other categories. Classification schemes like the Naïve Bayesian classifier (described in section 2.4.4) also require a count of the number of documents in every category as a part of the training process. The training of a classifier is therefore done with all categories at the same time. A category representation is sometimes called a category centroid, since it can be viewed as an estimated description of all the documents belonging to a given category. The process of “server ranking” based on centroids, as described in section 2.6.4, is closely resembling the categorisation algorithms that are described in the following section.

2.4.4 Naïve Bayesian model Naïve Bayesian classification calculates the probability for a document to belong to a given category. It measures the frequency of a term appears in different categories (from documents in the training set) and how often a given category appears in the training set. From these training set data it is possible to study the terms of a test document and see “how much” they belong to different categories using Bayes’ rule. Based on the frequency distribution of terms in the different classes from documents in the training set, as well as the frequency distribution of documents in the different classes, it is possible to use Bayes’ rule to compute the class that has the highest posterior probability for a particular document. P ( c k d i ) , the posterior probability of class ck given a test document di is computed using Bayes’ rule P ( c k )P ( d i c k ) P ( c k d i ) = ---------------------------------P ( di )

(Eq. 2-7)

and di is assigned to the class with the highest probability (thus P ( d i ) can be disregarded): Class of di = arg max { P ( ck d i ) } = arg max { P ( c k )P ( d i c k ) } 1≤k≤N

1≤k≤N

25

(Eq. 2-8)

where N is the total number of classes. The term P ( c k ) can be computed by finding the fraction of training documents inside each class:

D

P ( ck ) =

∑ P ( c k di )

(Eq. 2-9)

i----------------------------=1 -

D

where D is the set of training documents and |D| is the number of training documents in D. The term P ( di c k ) expresses the probability for a given test document (not a training document) to appear under class ck. It is possible to compute this if the frequency (or occurrence) of a term dij in di is assumed to be statistically independent of other terms in the same document. The probability then becomes the product of the probabilities for every term in that document: P ( di ck ) =

m

∏ P ( d ij ck )

(Eq. 2-10)

j=1

The explanation in this section was inspired by the explanation given by Han and Karypis (2000)

2.4.5 Feature selection The applications of feature selection are described in section 2.2.7. The benefit from using feature selection is smaller term vectors and thus reduced computational effort during similarity calculations. What has also been reported is the effect from feature selection on classification accuracy. Scott and Matvin (1998) report better accuracy in classification when using WordNet for replacing words with their common hypernyms. Bekkerman, on the other hand, could not see improvements using so-called distributional clusters when compared with a traditional bag-of-word approach on a text with relatively precise vocabulary (Bekkerman et al., 2001).

26

2.4.6 Some automatic classification research projects Among research projects on automatic classification using a predefined set of categories, examples on the use of the techniques mentioned in Section 2.4.2 can be found, as well as combinations of them. Han and Karypis (2000) are reporting that the use of “Centroid-Based” document classification is consistently and substantially outperforming other tested approaches, among them the Naïve Bayesian approach. The training sets used includes well-known corpora like TREC (from the Text REtrieval Conference), and Reuters-21578 (a set of news stories from Reuters). Their results show that the different approaches tested differs with siginificant margins, and that the number of categories in use is rather low (6-44). Also, the training sets included Reuters-21578 and OHSUMED-233445 (a subset of the MEDLINE database), which is known to contain specific and precise vocabularies. Fukumoto and Suzuki (2001) are reporting from a classification experiment using Support Vector Machines and term expansion using WordNet (replacing synonyms with synsets). Although the training set is the same as in Han and Karypsis’ work (Reuters-21578), the number of categories is different (90). Fukumoto presents a table of comparable results on this corpus, which all lies in a small range of 83-86% correctly classified documents. In an earlier paper, Chekuri et al. (1997) report from a classification experiment using vector calculation and a training set picked from the Yahoo! portal. By picking 2000 web pages from 20 top-level categories as a training set and another 500 as a test set, Chekuri obtains a pre-classified set of web pages. The results are presented using a “cutoff” value, a concept often associated with ranking: the classification algorithm results in an ordered list of categories, and a cutoff value of e.g. 3 will consider a classification successful if the correct category appears among the three best suggestions. Chekuri also discusses problems of picking significant terms from web documents of general nature, and the special problems associated with “spam-dexing”. Two projects using Support Vector Machines from Fukumoto and Bekkerman (Fukumoto and Suzuki, 2001, Bekkerman et al., 2001) both combine SVM-based classification by dimensionality reduction (see section 2.2.7): Fukumoto takes a WordNet-based approach to replace terms with their common hypernyms, while Bekkerman takes a statistical approach in order to cluster interdependent terms (see section 2.4.5).

27

Toutanova et al. (2001) are using Bayesian classification in the context of hierarchical classification. The Bayesian model is extended with the idea that a document is “produced” of terms from more than one category: all parent categories can contribute with terms as well as the document’s “home category”. This brings the discussion over to the branch of classification projects concerned with hierarchical classification, which is of particular interest to this thesis. The motivation for using hierarchical classification systems varies between research projects. D’Alessio reports improvement in running time in by a factor of three on his experiment on the Reuters-21578 corpus (D'Alessio et al., 2000). Dumais and Chen (2000) report a small improvement in accuracy when classifying the Reuters corpus with a SVM-based classifier. Dumais and Chen are pointing out the potential for disambiguating words that are associated with sub-categories. In a hierarchical approach, only toplevel categories may be considered in a first step, before moving down to a sub-tree and thus disregarding possible categories in other branches. This divide-and-conquer approach has good effects on running time, as pointed out by D’Alessio. Another reason for using hierarchical classification systems is found in any library: hierarchical classification has been in successful use since 19th century for organising large information volumes. The problem with comparing the results from these projects has to do with different versions of the Reuters corpus (at least 5), and different measures when reporting the results. Yang presents a comprehensive comparison of 14 categorisation algorithms in (Yang, 1999), but does not consider any projects using a hierarchical classification schema. The lack of a common hierarchy for these projects adds to the difficulties of comparing the results.

2.5 Distributed systems 2.5.1 Introduction The theoretical foundations of Distributed Systems is of interest to this thesis, since the Content-Sensitive Infrastructure is a distributed system. This section gives a brief introduction to some key concepts in and motivation factors for distributed computing. The area of peer-to-peer computing is of particular interest to this

28

thesis, since the described resource discovery system (the Content-Sensitive Infrastructure) can be considered to be a peer-to-peer system.

2.5.2 Historical background and key concepts The term Distributed Systems refers to computers connected by a communication network, which are co-operating in order to achieve a common goal. The presence of a common objective is a distinctive feature of a distributed system: a network of uncoordinated computers does not by itself constitute a distributed system (Coulouris et al., 2001). The study of distributed systems is firmly founded on the research areas of computer networks and operating systems, and modern textbooks in these areas usually give lots of attention to distributed systems (Galli, 2000, Bacon, 2000, Fongen, 2002). The origins of the Internet dates back to 1973, when the U.S. Defence Advanced Research Project Agency (DARPA) initiated a study on how to interlink packet networks. In 1986, National Science Foundation (NSF) founded NSFNET, which even today forms an important backbone for the Internet. Several regional and national networks joined the Internet during the 1980s, e.g. NORDUNET and EUNET. In 1993, the Internet was opened for commercial use, the same year as the World Wide Web was invented and fuelled the public interest in the use of Internet1. The underlying theory of distributed systems (on issues like synchronisation, security, replica management, distributed agreements and transactions) was stable by 1985. In Tanenbaum’s review of distributed operating systems (Tanenbaum and Renesse, 1985) he outlines a theoretical approach to the field that is mainly unchanged today. Modern textbooks in the field (Tanenbaum and Steen, 2002, Coulouris et al., 2001) include additional chapters on code mobility, client mobility and multimedia services.

2.5.3 Motivation factors The reasons for developing distributed applications (as opposed to centralised applications) are: • Improving performance by aggregating the processing and networking capacity of several computers (scaleability).

1. For more information on Internet history see: http://www.isoc.org/internet/history/ [May 13. 2003]

29

• Improved resource utilisation by offering remote use of resources (storage, peripherals etc.) • Reducing network bandwidth consumption through the separation of storage, processing and presentation tasks (e.g. passing a chart to the end-user instead of the entire data set, and validating user input before sending it to the processing equipment). • Fault tolerance (and error masking) due to presence of redundant resources and data

2.5.4 Peer-to-peer distribution A general description of distribution models will not be given, but rather a description of a model of particular relevance to the Content-Sensitive Infrastructure, namely peer-topeer distribution (Figure 2-2). Distributed applications without central servers or authorities is hardly a new invention: Systems as old as UUCP (Unix-to-Unix Copy) (RFC976, 1986)1 let peers of UNIX computers form networks for exchanging mail, files and offering remote execution. Today’s use of the term “Peer-to-peer networking” (p2p) is somewhat vague: Schollmeier (2001) attempts to give a definition of the concept, with two distinguishing criteria: • Participating nodes are both client and servers (Servents) • Shared resources are directly accessible by peers (without intermediaries) On the basis of the properties found in the p2p projects referenced in this thesis, these distinct characteristics are found to be essential: • No central server: members take on both the server and the client role, see Figure 2-2. • Spontaneous community formation and zero configuration: the system is self-configuring and reacts automatically to arriving or leaving members (also called peer discovery) • Relaxed consistency design: members arrive and leave frequently, and strong-consistency replication or causal/total ordering do not often apply

1. This RFC describes the mail message format of UUCP. The networking protocols are not defined in an RFC.

30

• Relaxed security needs: p2p systems raise lots of security concerns on member computers (trojan horses and other hostile code) and on network transport (authentication, privacy, availability). • Scaleability: the design tolerates (or should tolerate) a high number of members and a large volume of information B

D

A

C

E

Figure 2-2. Peer-to-peer configuration

Despite the papers presented in a recent issue of Communication of the ACM (Agre, 2003, Kubiatowitcz, 2003, Balakrishnan et al., 2003), and (Clark, 2002) and (Oram, 2001), there is a surprisingly small amount of research literature on p2p systems. This section will focus on the Chord (Stoica et al., 2001) and Content-Adressable Network (Ratnasamy et al., 2001) projects. Apart from file-sharing (most often used for distribution of copyrighted material), there are few obvious applications for p2p networking. Distributed directories and lookup services seems to be a good candidate for p2p design: In the case of a lookup service, the client has a key and looks for the associated information. The (key,value) pairs is easy to distribute/replicate among members in a network, and the correspondence between the key and a network route/network address can be established using different techniques. The Domain Name System (DNS) is a well-proven distributed lookup service. Although not a p2p design, it has facilitated the study of the effects of distribution, caching and replication in a system of very large scale. The Content-Addressable Network (Ratnasamy et al., 2001) is a look-up service designed as a p2p system that maps the key space into a set of coordinates in a d-dimensional torus1 through a hashing function. A CAN (Content-Addressable Network) node occupies a portion of the key space and stores values for keys inside this portion and responds to queries.

1. A torus is a space where the coordinates wrap. A 2-dimensional torus can be visualised as a doughnut.

31

Values and queries are routed to this node by other CAN members by looking at the coordinates of the key. As new members enter the CAN, the portions are split between an existing node and the new node. Key/value pairs are handed over to the new node corresponding to its portion of the key space. As members leave the CAN, their storage of key/ value pairs are handed over to a neighbour expected to inherit the vacant key space. By “overloading” portions of the key space with more than one member (called peers), the CAN also provides for higher availability through replication of key/value pairs. The Chord project (Stoica et al., 2001) provides a different solution to a similar problem: the key space is mapped into a ring using consistent hashing, and the nodes of the system take different positions around the ring and take responsibility for keys from this position and on to the next node on the ring. In a ring of N keys, each node has pointers (called fingers) to logN nodes exponentially distributed along the ring (point to the first, second, 4rd 8th 16th node in the forward direction). When a key request is forwarded along the ring, the distribution of fingers makes the forwarding process something like a binary search, where O(logN) hops is necessary in order to reach the destination. Similar to the CAN, the Chord offers a dynamic partitioning of the key space as nodes arrive and leave. But unlike CAN, the Chord project does not offer replication of key values. When a node crashes, the stored key/value pairs are invariably lost. Both papers, but especially the CAN paper, give valuable insight in the tension between consistency and scaleability: both projects see the necessity of “consolidating” the network structure at regular intervals, and the CAN paper suggests that key/value pairs are “refreshed” from the outside client at regular intervals. Both these measures solve certain problems of uncorrected consistency, but at the price of reduced scaleability since the necessary number of “refresh” messages grows in the same proportions as the total volume of data stored in the system.

2.5.5 Service traders One type of distributed system that has much in common with a resource discovery system is the Service Trader. In the area of distributed objects (Java RMI, OMG Corba), there is a need for announcing and requesting services. The mediator between a service exporter (usually a server) and a service importer (a client) is called a service trader. A

32

service trader has a database of registered interfaces and policies on how to process lookup operations. Interfaces are classified in a type hierarchy just as the classes in an object-oriented programming language. The rules for referencing specific interfaces with general references (found in most OO languages) applies also for service traders. It is thus possible to search for interfaces on a sub-tree of interface types. Independent traders can cooperate by linking to each other and to process/forward queries on behalf of others. Federated traders form a directed graph where the arcs denotes the ability of one trader to access another. The architecture and mechanisms of a service trader is described by Bearman (1997). The criteria for matching a registered interface with a query is based on formal rules and not on lexical similarities. Although traders and retrieval engines are conceptually related, there is surprisingly little contact between the two research areas. Tari and Craske (2000) suggest extensions to the semantics of service matches by adding informal information in interface specifications, to be used by humans, not programmed clients. They also suggest a framework for an ontology of such information, and a more flexible method for query propagating in a trader graph. Distributed objects and web documents do not live isolated lives. The arrival of Web Services will move the two types of services closer together, and the same server will be likely to offer both. There seems to be an obvious need for search and retrieval technology that can trade both service requests and information queries. The project described in this thesis has several properties (e.g. the use of metadata) that makes it a good candidate for this type of service integration.

2.5.6 Summary and conclusion This section on distributed system has given a short presentation of the basic concepts of distributed systems. The architecture of peer-to-peer networking has been given most attention, because the Content-Sensitive Infrastructure (presented in this thesis) is building on the same principles.

33

2.6 Distributed information retrieval 2.6.1 Introduction The term “distributed information retrieval” (DIR) refers to a large class of systems characterised by the utilisation of several retrieval engines being coordinated through a network protocol. The reasons for choosing a distributed approach to information retrieval can be: • A desire to federate existing, independent document collections (Metasearchers, Section 2.6.3) • To overcome scaleability problems associated with a centralised design • To provide fault-tolerant services • To reduce network traffic through storing of information closer to the user (exploiting locality) For the purpose of this discussion, the research projects in distributed IR can be divided into three categories: 1. Meta-searchers that multicast queries to several independent collections 2. Distributed document collections using forward knowledge to route queries 3. Co-operative caching mechanisms that distribute content according to client requests Some of these projects are not IR in the traditional sense, but rather content distribution services. They will still be mentioned here and the design commented upon where appropriate since they address the scaleability problem of resource discovery

2.6.2 A taxonomy of DIR projects In Figure 2-3 the distributed information retrieval projects discussed in this section are presented in a taxonomy.

34

Distributed Information Retrieval

Distributed Collections

Meta-searchers

Moving content

Dynamic forward knowledge

Static forward knowledge

Co-operative caching

Content Distribution Networks

p2p lookup

WHOIS++ Harvest MIDS Q-pilot

EOSDIS Content-Sensitive Infrastructure

SPREAD

Akamai Exodus

CAN Chord Freenet

Figure 2-3. A taxonomy of distributed information retrieval projects

2.6.3 Meta-searchers Meta-searchers attempt to increase the recall of queries by joining the result set from several retrieval engines. Today’s Meta-searchers (www.metacrawler.com, www.dogpile.com)1 are seen as “umbrellas” over popular search engines. They multicast identical

1. For more meta-searchers see: http://dmoz.org/Computers/Internet/Searching/Metasearch/ [May 13. 2003]

35

queries to several independent search engines and process the accumulated result set (removing duplicates, ranking results) upon return , see Figure 2-4. common protocol query Query Client Co-ordinated result set

result set

MetaMetasearcher searcher

Document Document collection collection

do. do.

- multicasting queries - co-ordinating responses

do.

Document Document collection collection

Document Document collection collection

Document Document collection collection

Figure 2-4. Architecture of a meta-searcher

In order to facilitate meta-searching and the development of universal search client programs, a common client-server protocol for retrieval operations is needed. As pointed out in section 2.2.8, there exist standardised protocols that describe both the syntax of the queries and the networking protocols between the client and the retrieval engine. These standards could apply to meta-searchers as well, but are not used on the Internet today: The Z39.50 Register of Implementors1 does not include any of the important search engines, so the major meta-searchers (www.metacrawler.com, www.dogpile.com) are assumed to deal with each search engine’s syntax separately. Although a distributed approach, meta-searchers do not improve scaleability, since they replicate queries rather than distribute them. They increase the network traffic in order to improve the recall of the result set. Research on meta-searchers, e.g. (Sander-Beuermann and Schomburg, 1998), mostly deal with problems related to local processing of result set and is of little interest to the project described in this thesis.

2.6.4 Using forward knowledge A more scaleable approach than meta-searchers is to pass queries to a subset of document collections expected to have relevant documents. The process of selecting document col-

1. See: http://lcweb.loc.gov/z3950/agency/register/entries.html [May 13. 2003]

36

lection is called collection selection. Collection selection is the responsibility of the query router and can take place on the basis of two information resources: • Information about the query: the textual content of the query can always be analysed. The intent of the query can also be investigated (through personal profiles, previous queries etc.). • Information about the document collections: the characteristics of the document collections in the retrieval engines must be known. The knowledge needed about the document collection in order to make forwarding decisions is called forward knowledge.

Query Client Co-ordinated result set

Query Query router router

query result set forward knowledge

Document Document collection collection

Document Document collection collection

- distributing queries on the basis of forward knowledge - co-ordinating responses

Document Document collection collection

Document Document collection collection

Figure 2-5. Distributing queries on the basis of forward knowledge

Since the characteristics of a document collection is likely to change over time, the query router needs to have its forward knowledge updated. This flow of data is indicated on Figure 2-5 by an arrow pointing to the query router from the document collection. The forward knowledge can be generated by the retrieval engine (possessing the document collection) or by the query router. The tasks of collecting or disseminating updated forward knowledge can use several strategies: a popular approach is the use of centroids (Salton et al., 1975), explained in section 2.2.6. A centroid used in this manner represents the lexical characteristics of an entire document collection, not only a single document. The query router can calculate the similarities between the query and the document collections by the cosine function described in section 2.2.6.

37

Forward knowledge generated by the retrieval engine: The centroid is in this case calculated by the retrieval engine and passed to whom it may concern (in the opposite direction of the queries). An unweighted variant of centroids are used in WHOIS++ (RFC1913, 1996) and other systems that will be described later in this chapter. A generalisation of centroids can be found as content labels in Sheldon’s Ph.D. thesis (Sheldon, 1996), where forward knowledge may include queries that are “true” for every document in this collection. Yet another model is the use of inference network and probabilistic calculations (Callan, Lu and Croft, 1995) based on similar statistics gathered from the collections. Forward knowledge generated by the query router: In this case the query router may interrogate the retrieval engine. This can be done by: • fetching web pages from the web server that hosts the retrieval engine (where applicable) • making “probe queries” to the retrieval engine and analysing the returned result set The Q-Pilot (Sugiura and Etzioni, 2000) is an example of this type of system. Q-pilot performs “pre-classification” of collection during a training phase where the search engine’s “home page” and adjacent pages are analysed for word occurrences. Systems of this type cooperate loosely with the individual collection; They usually offer a web interface where the user clicks on a hyperlink that either brings the web client to the front page of this collection, or that forwards the query using the collection’s local query syntax. This type of system mostly deals with the problem of query analysis and query classification, and often reports their performance in “recall divided by number of servers visited” (Dolin et al., 1997, Hawking and Thistlewaite, 1999, French et al., 1999, Rasolofo et al., 2001). They do not discuss scaleability problems and their work has limited relevance to the study of distribution models presented in this thesis. The scaleability properties of forward knowledge are largely unknown. There is little discussion to be found on the volume (and rate) of forward knowledge that needs to be transported across the network, or the effect of forward knowledge on the distribution of queries. This is a core issue for the project presented in this thesis, and will be addressed in the next chapter.

38

2.6.4.1 The effect of forward knowledge on retrieval effectiveness The idea of using forward knowledge for selecting a subset of document collection for query processing is based on an assumption that the document collections are not completely heterogenous, i.e. the documents relevant to a query are not scattered evenly across every collection but reside in a small number of collections. Xu and Croft (1999) are pointing at the need for topicality in document collections, i.e. that the documents in a collection are relatively similar. They base this claim on the Cluster Hypothesis (Jardine and van Rijsbergen, 1971) which state that “Closely associated documents tend to be relevant to the same requests”. Xu and Croft’s work present experiments where different clustering algorithm were applied to documents and the documents were distributed across collections as clusters. The experimental results show that the use of “clustered” collections improve the retrieval effectiveness. 2.6.4.2 Distributed IR projects based on use of forward knowledge Harvest (Bowman et al., 1995) was designed for scaleability and customisation through the separation of gatherers, which are responsible for the acquisition of information, and brokers, which are responsible for collection, index generation and dissemination of that information. Gatherers run at information provider sites and transmit information acquired back to one or more brokers using Summary Object Interchange Format (SOIF). Brokers interact with one or more gatherers for initial acquisition and with other brokers where useful to filter information already collected by those brokers. Brokers provide query interfaces to the gathered information through the use of GLIMPSE (Global IMPlicit SEarch). In the hierarchy of brokers, there is a central broker which is mainly responsible for finding suitable brokers to send queries to. Similar to WAIS (RFC1580, 1994, RFC1625, 1994), this partial centralised means of searching the hierarchical tree structure is not a convincing solution in respect of scaleability1. MIDS - MITRE Information Discovery System (Helm et al., 1996) add components and services to the Harvest architecture so that the forward knowledge can be classified. A new service between the Gatherer and Broker, called Gatherer Dissemination Service

1. This description of Harvest is taken from personal communication with Dr.Ian Ferguson, Strathclyde University, Glasgow, UK

39

(GDS), classifies the forward knowledge into a fixed set of topics and disseminates the information to those brokers who have registered their interest in that particular topic. The Broker is extended to allow the user to look for documents within specific categories. The classification algorithm used in MIDS is not described in detail, but it seems to be more primitive than those that will be described separately later in this chapter. The WHOIS++ service forms the basis for many centroid-based1 query forwarding services. WHOIS++ organises the retrieval engines in a mesh (a directed acyclic graph, shown on Figure 2-6), and queries are passed along the arcs of this graph if they are similar (above some threshold) to the centroid representing the adjacent node. Queries can be forwarded in several hops. WHOIS++ is described in (RFC1913, 1996) where also the details of the message protocol is shown. Query direction Forward knowledge direction

whois++ server whois++ server whois++ server

whois++ server

whois++ server whois++ server

whois++ server

whois++ server

Figure 2-6. A network of WHOIS++ servers organised as a directed graph

The idea behind WHOIS++ (and related systems) is to federate independent document collections. No transfer of documents between collections take place, nor any adaptive measures to balance workload or to increase fault tolerance. Distribution takes place on the basis of the characteristics of the document collection only. Recent reports on Peer-to-Peer Information Retrieval (Klampanos and Jose, 2003 , Bawa et al., 2003) describe schemes quite similar to WHOIS++. They assume that the collec-

1. As mentioned earlier in the section, WHOIS++ uses unweighted Centroids, which are different from Salton’s concept.

40

tions are “specific” and that each collection contain information relevant only to a small fraction of queries. By making this assumption the volume of forward knowledge can be limited. Variants of this architecture includes the The Isaac Network (Lukas and Roszkowski, 1999) which offers an LDAP (Lightweight Directory Access Protocol) (RFC2251, 1997) interface to a centroid-based service. Centroids are exchanged between servers using a “Common Indexing Protocol” (CIP). The Isaac Network stores metadata, not full documents, which makes the LDAP architecture more suitable for a search service; LDAP is designed to handle directory data and is not expected to handle large documents or multimedia content efficiently. The query forwarding mechanisms relies on the use of LDAP referrals. Referrals are used in order to redirect a client to a different server so that it re-issues the query there. In contrast to query routing, redirection forces the client to send new copies of the query to alternate receivers. It is thus possible that the client becomes a bottleneck during the query processing. An interesting feature with the Isaac Network is that any LDAP-browser can be used as a client for search operations. Like in any other LDAP-application, the servers need to agree on a schema in order to cooperate on the Isaac network. Some of the base attributes of the Dublin Core (DublinCore, 1995) attribute set is used in this schema. Yet another variant of a centroid-based service is the WHERE project, where use of term weights improve the precision and economy of the query forwarding mechanism (Rio et al., 1997): The use of term weight enables a client to obtain a better ranked list of document collections, and to visit only the “best” collections for the query. Ranking a list of candidate document collection for the purpose of collection selection is also the object of Yuwono’s work (Yuwono and Lee, 1997), where a weighted list of key terms are sent from a retrieval engine to a client. Yet another early variant of forward collection is the use of “Content Labels” (Sheldon, 1996) in a context where query broadening and query refinement is offered. Judging from the year of publication of the projects mentioned, it appears as if the Centroid-based server selection schemes are not longer exciting so much academic interest

41

and the “client interrogation schemes” e.g. (Rasolofo et al., 2001) are still being investigated. 2.6.4.3 Projects using Distributed Metadata It is possible to conduct search operations on metadata collections rather than document collections. Metadata (Section 2.3.3.2) used as document representations offers several advantages during search operations, since metadata can contain information about a document not necessarily present in the document text, e.g. formal status, price, client software requirements etc. In addition, metadata can represent general web resources (e.g. a programmed service) in addition to documents. Also, metadata (due to its small size) can be replicated on several retrieval engines or moved between engines for any reason. Centroids can well be regarded as an item of metadata describing a document collection. The use of distributed metadata is often found in domain-specific information systems where metadata is replicated rather than distributed (e.g. the Association for Geographic Information’s GIgateway1). Although the MIDS project mentions “metadata” (Helm et al., 1996), this aspect of the project is not described. The EOSDIS architecture (Hinds, 1998) uses metadata and distributes “metadata referrals” between participating servers and the client resolver. The system classifies referrals in a system called “centroid hierarchy” (topics). Queries are routed to metadata servers on the basis of a server ranking algorithm. The system is successfully exploiting the use of caches since the metadata seldom changes once they have been added to the system. Hinds’ research is interesting because it takes some new approaches to traditional problems, which is possible since this is a domain-specific system (earth observations) where some general problems can be disregarded: the nature of the data and its usage pattern are known in advance.

2.6.5 Content Distribution An alternative to query routing is content distribution. One can obtain a better network economy and lower response times by moving the information content (documents or resource representations) towards the user using a speculative approach. 1. see: http://www.gigateway.org.uk/default.asp [May 13. 2003]

42

In order to move content, two requirements must be met: 1. The content must be re-constructable anywhere. In practice, only the content of static documents can meet this requirement. Dynamic documents that are the result of a programmed service (program output) can in general not be moved. 2. The reference to the resource must be location-independent, i.e. the same reference must be able to find the resource regardless of its present location. Note that the use of metadata (Section 2.3.3.2) can relax the first requirement: metadata can be static documents while representing dynamic documents (or web services). An architecture for moving metadata content can thus also accommodate programmed services. When resources are referenced through the use of URLs (Uniform Resource Locators), the second requirement can be met by modifying the DNS (Domain Name Service) as explained in section 2.6.5.2. For the following discussion, this branch of research will be divided into three categories: • co-operative caching • content distribution network • peer-to-peer file sharing 2.6.5.1 Cooperative caching One method is to transfer content towards the client is by a caching algorithm. Caching speculates on the chance that the same content will be requested again from the same user (or group of users) in the near future. A well-known implementation of caching is the web proxy mechanism. In this case, retrieved web content pass through a series of computers (proxies) where it is stored for a while. Later retrieval operations from this user (or other users following the same network path) are intercepted by the proxies, and if there is a valid cached copy of the addressed content in the proxy, then this content is returned. Several proxy servers may be configured to cooperate, i.e. they form a chain between client and server, and every proxy performs the same operation of checking against local content.

43

Since a web page does not announce when it is updated or deleted, there is an issue of ensuring cache validity. On the World Wide Web, a proxy server seldom takes the chance of returning content that is out of date, so it normally contacts the web server in order to check if the locally stored copy is still valid (based on timestamps). Therefore, a normally configured web proxy does not save round-trip delay, only transfer delay. In order for proxy servers to cooperate along other paths than that between the client and the web server, the Internet Cache Protocol (RFC2186, 1997) allows proxies to exchange information about the content of their caches (by requests and “hit/miss” responses) in order to pick the best candidate for serving a client request. The SPREAD project (Rodriguez and Sibal, 2000) takes an “translucent” approach to web proxies, since it attempts to configure the proxy servers so that they intercept all IP traffic (almost like a firewall) and recognises IP packets with HTTP transactions in them. A proxy server can be set up anywhere on the network without any configuration and does not require any knowledge about the other proxy servers in the network. A SPREAD proxy server can take four approaches to ensuring cache validity: 1. Checking the validity of a cached copy for every request 2. Relying on a “time-to-live” value associated with the web page 3. The proxy can subscribe to “invalidate” messages from the server, indicating that the web page needs to be re-fetched 4. The proxy can subscribe to “replications”, whereby the server pushes the updated content of the web page when necessary The principle of dynamic IP routing may cause the proxy server to see only parts of the HTTP communication. SPREAD therefore requires that the proxies set up IP tunnels between themselves by exploiting some esoteric features in the TCP protocol (called TPOT) The SPREAD project has looked into the frequency of access vs. frequency of updates for different types of web pages. Rodriguez and Sibal report this distribution to be gaussian with the peak between 10 and 100 accesses per update. Although co-operative caching is not distributed IR in the normal sense of the word, it contributes to the knowledge on the distribution of content, rather than of queries. 44

2.6.5.2 Content distribution network A less transparent approach for moving content towards the users are the content distribution networks (CDN), whereby the updated web pages are replicated through a network of intermediate storage nodes initiated by the agent that controls the updating process of the web pages (Kurose and Ross, 2003). The CDN nodes behave like replicated web servers, not like cache servers, since they do not need to check the web servers to ensure consistency with the original web pages. The consistency of the replicas are the responsibility of the web servers, which need to be made “CDN-aware”.

1

r /s ports.html o f t s e requ ports HTTP.foo.com/s www

2

Origin server

DNS query for www.cdn.com

3

HTT www.P request cdn.c for om/w ww.

foo.co m/spo

CDN’s authoritative DNS server

rts/rut

h.gif Nearby CDN server

Figure 2-7. CDNs use DNS to direct requests to nearby CDN server (from Kurose & Ross, 2003, p.165)

In order to redirect the clients to the nearest CDN node, the CDN has to cooperate with the Domain Name System (DNS). When a web client is initiating a DNS lookup for a host served by a CDN service, the DNS will not return the same IP address for every client request, but rather the IP address of the CDN node “nearest” (by some topological metric) to the client (Figure 2-7). In this way, the CDN becomes invisible to a web client or a web proxy, but the scheme requires specialised software to run in the DNS servers responsible for the names used by the CDN nodes.

45

Content distribution networks are found as commercial Internet services, often in the form of industrial-grade web hosting. CDN services are offered by companies like Akamai (www.akamai.com) and Exodus (www.exodus.net). 2.6.5.3 Peer-to-peer file sharing Peer-to-peer (p2p) networking in general is discussed in section 2.5.4, so at this point in the discussion the use of p2p networking for content distribution will be mentioned. In the same manner as with content distribution networks, the use of p2p distribution can offer content distribution based on a combination of topological information (e.g. distance/cost between nodes) and request patterns. P2p projects are often limited to offering file sharing. They often do so by distributing queries to nodes expected to know about files of interest. The queries can be centrally processed as in Napster (www.napster.com), where every file offered in the network is registered in a central directory, while the content resides on any node in the distributed application. The queries can in the other extreme be flooded, so that they visit as many nodes as possible. As in the case of Gnutella, every node that receives a query will look for matching files in its local repository and respond back to the requester accordingly. Gnutella has a very simple design and reportedly a poor performance (Ripeanu, 2001, Ritter, 2003)1, but did nevertheless become very popular before it was made obsolete by more mature projects like Kazaa (www.kazaa.com). Freenet (Clarke, 1999) is a p2p project taking a different approach: Along the lines of CDN, several copies of the information content may be replicated throughout the p2p network. A Freenet node will react to a request by looking into its local repository, and then forward the request to every known neighbour. In this way the network is searched in a breadth-first traversal. If data matching the request’s key2 is found, the data is forwarded along the reverse path of the request, and the search is terminated. Copies of the data are stored in the nodes along this path for a finite period of time, thus making the Freenet nodes behave like proxy servers. Also, the nodes will store a reference to the requesting node and thus increase the number of “neighbour links”.

1. Jordan Ritter is one of the founding authors of Napster. 2. URLs are not used in Freenet, but keys, similar to what is used in a distributed lookup service (section 2.5.4)

46

Cache consistency is not addressed as a problem in Freenet; data is assumed not to change while stored under a key. Updated data is expected to use a new key.

2.6.6 Conclusion The area of distributed information retrieval has been discussed separately from information retrieval in general. Distributed information retrieval can follow several distribution strategies: 1. Distribute queries based on knowledge about the characteristics of the document collections (forward knowledge). This knowledge can be generated and distributed by the collection, or generated by the query router based on the study of previous responses 2. Distribute data closer to the requesting client (content distribution) based on speculation about the client’s future queries The Content-Sensitive Infrastructure (presented in subsequent chapters) is able to exploit both strategies in order to improve scaleability and network economy.

2.7 Conclusion This chapter has presented some principles and research projects inside the fields of Information Retrieval, Internet Search Engines, Automatic Classification, Distributed Systems and Distributed Information Retrieval. The necessary support for the claims has been found in the course of the survey: • There are scaleability problems with Internet Search Engines (section 2.3.3). • The potential of Metadata has not been exploited in search engines, e.g. the potential unification of document and service representation (section 2.3.3.2). • Automatic classification using the Vector model for document and category representation (also called the Centroid approach) is reportedly performing better than other techniques (section 2.4.6). • Peer-to-peer distribution models have been used for building distributed look-up systems, but few p2p-based retrieval systems have been found (section 2.5.4).

47

• The scaleability of existing distributed retrieval systems has not been thoroughly examined (section 2.6.4). The next chapter will present a formal analysis of the scaleability properties of distributed information retrieval under different architectural models.

48

3

The problem of scaleability in distributed resource discovery

3.1 Introduction The purpose of this chapter is to provide an analytical framework for the evaluation of scaleability in distributed resource discovery (RD) systems. The scaleability problems with existing projects indicated in the literature survey will be identified using this framework. The term “scaleability” refers to a system’s ability to scale, as explained in (Coulouris et al., 2001 p.19): The system should remain effective when there is a significant increase in the number of resources and the number of users. It is of interest to see how an increasing number of computers can share the increasing communication and storage requirements of a resource discovery system in a large scale. Some of the research projects discussed in the previous chapter have a similar goal to the project presented in this thesis: the provision of scaleable solutions to Resource Discovery. These projects often claim to be scaleable. The property of scaleability is a common argument for distribution, but is seldom a subject for a thorough analysis or discussion. In this chapter a new framework for comparing and evaluating the scaleability properties in systems dealing with distributed RD will be presented. Statistical research results from analysis of Internet information lifetime will be applied. The conclusion from the chapter will be that existing resource discovery systems do not have the necessary scaleability when used under conditions found on the Internet.

3.2 The elements of a search engine From an architectural viewpoint, search engines are not complicated. It is possible to compare alternative designs using an evaluation framework based on analysis of the information flows between components. The components (shown in Figure 3-1) are: 49

Indexing Agents (IA) - Entities which take web documents as input (in general, any textual information) and produce a condensed representation of the documents that is useful for searching operations. Indexing agents also have responsibility for the crawling operation and revisiting web documents for updating indices. Query Processors (QP) - Entities which store index information received from the IA about a set of documents (called the index collection), and offer a facility for processing queries on this collection in order to produce a result set. In contrast to many “traditional” information retrieval systems, the query processor of an Internet search engine is not required to store documents, only indices and resource pointers (e.g. URLs). Query Router (QR) - An entity that inspects a query and routes it to the query processors expected to produce a good result set. The Query Router relies on knowledge about the content of the alternative query processors in order to offer this service. This knowledge is called Forward Knowledge (see section 2.6.4).

Queries

QP QP

Queries QR QR Result set

IA IA Forward knowledge

QP QP

Web documents

Index

Figure 3-1. The components of a search engine

3.3 The flows of data in a search engine The resource of greatest scarcity when applications are distributed over the Internet is the networking capacity: In addition to the “last mile problem” and “first mile problem”1, the so-called “peering points” between the Internet providers have been identified as serious bottlenecks (Akamai, 2000). The chosen perspective is therefore to study the message complexity of the different alternatives, i.e. the number of messages sent or received by any node as a function of fulltext volume, index volume, query volume and metadata vol-

1. The “first” and “last” mile refers to the access network at each end of a transport connection. These parts of the network are often underdimensioned and often become bottlenecks.

50

ume. These four terms are discussed below. The message complexity is presented using the familiar “big-Oh”-notation (Weiss, 1999 p.41).

3.3.1 The Fulltext volume Any search engine using full-text indices (indices representing the full body of text) will need an automatic indexing mechanism. An agent that generates indices (Indexing Agent, IA) will need to fetch a given web page periodically in case it has changed content or has become unavailable. The flow of full-text documents into the indexing agent is called the fulltext volume. In order to estimate the size of the fulltext volume, some statistical information about web page changes is needed. The results presented by Brewington and Cybenko (2000) provide this information for the discussion to follow: A presentation of web page lifetime distribution as a cumulative distribution function (CDF) estimates the fraction of web pages that change content during a given time period (Figure 3-2b). The median value is 117 days, i.e. half of the web pages have an average lifetime of less than

5.0

1.0

4.0

0.8

Cumulat ive probabilit y

Probabilit y densit y × 10−3

117 days.

3.0

2.0

1.0

0

0.6

0.4

0.2

0

200

400

0 100

600

M ean lif et ime (days)

101

102

M ean lif et ime (days)

(a)

(b)

Figure 3-2. Estimated distribution of web page lifetime (a) Probability Density Function (PDF) and (b) Cumulative Distribution Function (CDF)

Furthermore, this information can be extended to estimate the inspection interval necessary for web pages in order to have an index at most

β

days out-of -date with the proba-

bility α . Brewington and Cybenko denote this property 51

(α,β) – currency

and the inspection

103

interval necessary in order to ensure this property can be used to calculate the necessary data rate to the indexing agent. An example of Brewington and Cybenko’s results is that a page need to be fetched every 18th day if the index should have a 0.95-probability that it is at most one week out-of-date, i.e. (0.95,1 week)-currency. This period decreases to 8.5 days if the index should be less than one day out-of-date with the same probability. With a total web page volume of 800 million pages (estimated in 1999) 1, each having an average size of 12 kBytes (numbers indicated by Brewington and Cybenko), a re-indexing period of 18 days requires the transport of 533 Gigabytes per day into the index-generating agent if it attempts to index the entire web! These numbers are based on a model where all web pages are re-fetched at the same interval. A model where some pages are re-fetched more often than others (based on estimated importance or update frequency) is a complex optimisation problem investigated by Arasu et al. (2001). As stated in section 2.3.3, the number of web pages is believed to currently be growing exponentially over time (Kobayashi and Takeda, 2000).

3.3.2 The Index volume The volume of index information generated by the IA and sent to a query processor (QP) is called the index volume. The structure most commonly used for storing full-text indices is the inverted file (Kowalski and Maybury, 2000 p. 82). The size of the inverted file is dependent on the richness of the information contained in the file, as well as the use of stop words in the indexing process. On the basis of experiments by Melnik et al. (2001) and earlier text retrieval experiments that were undertaken by the author, the estimated size of the index is less than, but at the same order of magnitude as the full-text volume. When an indexing agent is distributing indices to several query processors, it will see exactly which portion of the index that has changed since the last time of distribution and transmit only this portion. Brewington and Cybenko’s work provides the probability density function (PDF) for web page mean lifetime (shown on Figure 3-2a, it may be called p l ( t ) ),

but not the integral required to calculate the size of this portion:

1. Google now reports 3 billion pages indexed (May 2003).

52

pl ( t )

∫ ---------t

(Eq. 3-1)

The numbers presented are therefore the lowest possible: 100% of the pages have an estimated lifetime less than 500 days. Disregarding that most pages would change more than once during this period, the total index would have to be transmitted during these 500 days. If the size of the index is 10% of the total fulltext volume (the lower end of the interval identified above, in order not to overestimate the volume), and this needs to be transported from the index-generating agent to the query processor every 500 days, an index volume of 1.9 Gigabytes per day needs to be transported to a query processor representing the entire web. The actual numbers would be higher than this if currency of the index is to be kept under 500 days. Since the size of the index volume is a linear function of the fulltext volume, it is also expected to grow exponentially over time.

3.3.3 The Query volume The query volume is the volume of queries flowing to the query processor from all its clients plus the result sets returned back to the clients. The size of the query is small (the size of what a user types in, seldom more than 100 characters). The size of the result set is solely dependent upon the characteristics of the document collection and the query. The network traffic generated by queries and responses is therefore expected to grow linearly with the number of query operations1. The number of internet users is believed to be growing exponentially over time (Kobayashi and Takeda, 2000). The popularity of search engines is not decreasing (ibid.), so the number of query operations generated from each user is not expected to decrease. Network appliances are also expected to conduct web searches e.g. as the Semantic Web falls into place (Berners-Lee et al., 2001). The query volume is therefore expected to grow exponentially over time as well. The most popular search engines report number of queries to more than 40 million per day (Brewer, 2001). The volume of information transferred is not known.

1. Local- or LAN-caching of query results is not considered here.

53

3.3.4 The Metadata volume When using metadata to represent documents (see section 2.3.3.2), the document collection is not represented by a common index, but every document is separately described through self-contained items of information, made through a manual or automatic process. A metadata description of a web page can be generated inside its web server and be sent to a query processor “just-in-time” (when the metadata representation of the page is changed). A metadata description of a web page can be of any size, but the dominant standards (e.g. Dublin Core) indicate a typical compressed size of 100-200 bytes (5001000 characters uncompressed). In contrast to a full-text index which changes at the same rate as the web pages it represents, a metadata representation of a web page does not necessarily change each time the web page changes. The flow of metadata between the IA (which generates metadata) and the QP is therefore difficult to estimate by analysis, but the volume of this flow is likely to be much smaller than the index volume: the metadata representing the document is smaller than the alternative inverted file, and the rate of change for metadata is also smaller. Since an item of metadata is independent upon the size of the document it represents, it is reasonable to expect the size of the metadata volume to be a linear function of the total number of web pages, not the total fulltext volume. Nevertheless, the volume of metadata is also expected to grow exponentially over time. The choice of metadata over full-text collection indices has great impact on the functionality and the architectural properties of the search engine. This is discussed in section 4.3.1.

3.4 Alternative distribution strategies 3.4.1 Introduction When dealing with distributed resource discovery, the processing model of the architecture has great influence on the way in which the information volumes need to be transported across the network. Some alternative models will be presented here, most of them are in use in existing distributed IR or RD projects. The analytical model used is devel-

54

oped by the author, as well as one of the distribution models, which forms the basis for the Content-Sensitive Infrastructure.

3.4.2 Required terminology In order to process queries, a query processor (QP) needs access to an index1 representing the content of the collections. In distributed resource discovery there are several collections distributed across several information servers (which is the case in the World Wide Web). The index can be kept centralised (representing every collection) or distributed (representing a subset of collections). In the following analysis, these symbols are used to denote the different data flows: Symbol

Data flow

F

Fulltext volume

I

Index or Metadata volume

Q

Query volume

M

Metadata volume

Table 3-1. The symbols used to denote the data flows

During the description of distribution models, the notation shown in Figure 3-3 will be used to show how the data flows are distributed or replicated to several recipients:

a

b

c

Figure 3-3. The symbols used in this discussion: (a) shows an information flow received by one of several recipients, (b) shows a flow where two or more recipients receive the same information (multicast), (c) shows a flow where every recipient receives the same information (broadcast)

3.4.3 Centralised index (centralised query processor) In this case the entire query volume is processed by one central query processor. Also, in order to keep the index current, the entire index volume (from an indexing agent) or metadata volume must be transported into the same site (Figure 3-4). The message complexity of the search engine in this case is O(Q+I) for the QP and O(F+I) for the IA, respectively. Since Q, F and I are all expected to have exponential growth rates, this design is not 1. At this stage of the discussion, the word “index” includes full-text indices (e.g. inverted files) and metadata.

55

future-proof. “Traditional” search engines like Google (www.google.com) are using this computational model1. Note that the I flow does not flow across a wide area link in this configuration, but is more likely to be transported across a high-speed local bus.

I

Q QP

F IA

Figure 3-4. The information flow using centralised index. QP=Query Processor, IA=Indexing Agent

For a search engine to be scaleable to the expected future Internet proportions, all the dataflows in use (Q, I, F) must be distributed across several networking nodes and networking links.

3.4.4 Distribution of Query volume The volume of queries may be distributed across several identical query processors. This strategy requires that the index information is replicated between the query processors, so that they produce identical results to any query. It also requires that there exists a strategy for distributing the queries evenly between query processors. Figure 3-5 show this arrangement, where the query volume (Q) is distributed among n query processors, and the index volume (I) is replicated to the same set of query processors. The message complexity for this strategy becomes O(Q/n+I) for the query processors. This strategy manages only to distribute the query flow, whilst the Index/Metadata flows are unchanged. In addition, an indexing agent needs to send replicas to several query processors, making the message complexity O(nI+F) for this component2. AltaVista (www.altavista.com) seems to be using a variation of this distribution strategy; There are query processors in different countries (no.altavista.com, se.altavista.com), and they give almost the same

1. Any references to the design of commercial search engines are based on assumptions, since the internals of these systems are not open to the academic community. 2. The workload of replication can also be shared between the query processors, making the message complexity slightly different.

56

result to the same query 1. The differences in the result sets indicate that only parts of the index database are replicated between the query processors. QP Q

I

QP

F IA

QP

Figure 3-5. The information flow using query volume distribution

3.4.5 Distribution of Fulltext volume A distribution strategy where the Fulltext volume is distributed across several Indexing Agents (IA) requires that the F volume is partitioned into several disjoint parts, one for each IA. In a Search Engine, where a crawler traverses the World Wide Web by following hyperlinks in order to bring the F volume to the Indexing Agent, this becomes a graph partitioning problem. There is no known graph partitioning algorithm that could be trusted to partition the entire web. Although hyperlink traversals have been shown to work as a clustering method (Flake et al., 2002, Salampasis and Tait, 1999), it is not known to be useful for partitioning purposes. Distribution of the fulltext volume is a realistic approach when the F volume resides on a small number of independent retrieval engines, as will be shown in the next section.

3.4.6 Distribution of the Index volume An approach where the index volume is distributed requires several Indexing Agents and several Query Processors. Figure 3-6 shows the principle of such a configuration. When the Index volume is distributed, each Query Processor only keeps a subset of the total Index volume, and a query must be forwarded to the appropriate set of QPs to produce an acceptable result set.

1. The DNS servers for the altavista.com domain are directing clients to query processors based on locality of clients.

57

The IR literature refers to this problem as the collection fusion problem1 (Salampasis and Tait, 1999), database selection (French et al., 1999) or server selection (Hawking and Thistlewaite, 1999). In this thesis the task is called Query Routing, since some entity in the system (a Query Router) must decide which query processors a query should be forwarded to. This decision is based on Forward Knowledge in the Query Router (see section 2.6.4), a type of information describing the characteristics of the index collection kept in the Query Processor. This information can be generated by the document collection or by the query router. Several approaches to the generation of forward knowledge have been demonstrated in the literature (Sugiura and Etzioni, 2000, Yuwono and Lee, 1997, Kirriemuir et al., 1998), but none of them shows much attention to scaleability issues. Since the use of Centroids (Weider et al., 1996, Wang et al., 1998) is popular and since its analysis seems to be tractable, this distribution strategy has been chosen for the analysis in section 3.5. As the characteristics of the index collection change (as the underlying document collections change), the forward knowledge in the Query Routers need to be updated. Therefore, a new information flow becomes necessary that carries changes in the forward knowledge. The volume of this flow (called I’) will be analysed in section 3.5 together with the effect of its distribution on queries. In Figure 3-6 I’ has been shown as the flow from the QP to the QR. Note that the model relies on a distribution of the fulltext volume, which has been deemed infeasible in a web search engine environment where web crawlers are applied. Using the model of distributed index flow, the “ideal” message complexity of a Query Processor becomes O(Q/n+I/n), and that of the Query Router becomes O(Q/m+I’) (where m is the number of query routers and n the number of query processors). This is true only in the situation where the query is forwarded to only one query processor. Under conditions where queries are sent to several query processors, the complexity of the query processor grows towards O(Q+I/n). In order to efficiently divide the query volume between query processors, the collections must be distinct and different. When an indexing agent is crawling the World Wide Web,

1. The term Collection Fusion has also been observed to mean “result set merging” (Hawking and Thistlewaite, 1999).

58

there is no known method by which to find collections sufficiently different so that the Query Routers can discriminate well between them. An approach where the web servers themselves generate the index (inverted files or metadata) rather than letting an indexing agent rely on arbitrary hyperlinks could overcome this problem, since the web servers can use other methods to cluster the documents into distinct collections than a crawler (the web server can traverse a document collection using file system commands, rather than rely on hyperlinks between documents). Q QR Q

QR QR

QP

I

IA

I’ QP

IA

QP

IA

F

Figure 3-6. The information flow using distributed index and Forward Knowledge. QR=Query Router

3.5 The use of centroids and their properties 3.5.1 Introduction Since centroids (section 2.2.6) are used for distributing forward knowledge from query processors in distributed information retrieval systems (Weider et al., 1996, Wang et al., 1998 , also cf. section 2.4.4.1), it is of interest to analyse the properties of the centroids, in particular their size and distributional effect on queries. It will be shown that the size of the centroid grows linearly with the size of the corresponding document collection, and that the distributional effect is dependent on the informational characteristics of the collections. In this context (somewhat different from the use of the term in section 2.2.6 (Salton et al., 1975)) the term centroid refers to a binary term vector (weights either 0 or 1).

3.5.2 Size There is a well-known correlation between the frequency of a word in a language and its “rank”. In the context of natural language, Zipf’s law states that if the most frequent word

59

has frequency freq1, the second most frequent word has frequency freq1/2, the third most frequent word the frequency freq1/3. The general formula is: freq freqn = -------------1drank

(Eq. 3-2)

where freqn is the frequency of the word of rank n. d is often (and in these calculations) given the constant value of 1. Zipf’s law can be used to estimate the size of the centroid, by calculating the number of unique words with a frequency > 1 for a given size of a collection. The rank for the least frequent term (and the number of terms in the centroid) is: size centroid = rank = f req 1

(Eq. 3-3)

Frequency

Size of centroid A

Size of centroid B

Most frequent words removed (stopwords)

Word distribu tion of col ec tion A Word distribu tion of col ecti on B Rank

Figure 3-7. Word distribution according to Zipf’s law. Words excluded from centroids are shaded

The graph in Figure 3-7 shows the frequency distribution of ranked terms in two collections of different sizes (note that the terms with rank x in the two collections not necessarily are the same). When extracting terms from a collection in order to construct a centroid, stop words (frequent and insignificant words) are removed. These sets of terms are shown inside the shaded regions in the figure.

60

The terms that are included in the centroid are represented as the portion of the curve between the shaded parts. The number of terms (the size of the centroid) is represented by the distance along the x-axis. From equation 3-3 this distance can be expressed as:

sizecentroid = freq1 – sizestopwords

(Eq. 3-4)

where sizestopwords is the number of terms removed during the stopword process. This is a constant value and independent of the size of the document collection. freq1 is (in the generalised document model) proportional to the size of the document collection. According to equation 3-4, the size of the centroid becomes proportional to the size of the document collection. In a real situation, the centroid size will not increase indefinitely, since the vocabulary of terms in use is a finite number. The graph in Figure 3-7 will ultimately meet the x-axis when the rank exceeds this number. During simulation, the size of the centroid falls below the linear growth (with respect to collection size) when it approaches the vocabulary size, see Figure 3-8.

61

Centroid size as a function of document collection size 25000

Centroid size (words)

20000

15000

10000

5000

0 0

500000

1e+06

1.5e+06 2e+06 Collection size (words)

2.5e+06

3e+06

Figure 3-8. A simulation study of the size of a centroid as a function of document collection size. In this experiment, a term must occur 10 times before it is included in the centroid.

3.5.3 The distributional effect on queries The purpose of a centroid is to facilitate the distribution of queries between query processors. A query will be compared to the available centroids and passed to the query processors most likely to have relevant information. The typical criteria for comparing a query and a centroid is lexical similarity. Two centroids with a high degree of similarity will not contribute to a good distribution of queries, since a relatively larger fraction of queries will be sent to both query processors. In general, as the size of the collections grow, there is an increasing probability that any given term will occur in more than one collection. Therefore, the centroids for large collections are likely to be more similar to each other than the centroids for small collections. In order for centroids to be used for efficient query routing, is must be certain that the centroids for different query routes are different regardless of the scale and heterogeneity of the indexed document collections. It is believed that this is too strong a condition for the real Word Wide Web. Xu and Croft (1999) identifiy a similar problem (i.e. inefficient query distribution and poor retrieval performance) with heterogeneous document collec62

tions, yet there are quite recent projects on distributed IR that implicitly rely on the existence of homogeneous collections (Klampanos and Jose, 2003, Bawa et al., 2003).

3.5.4 Trade-off between scaleability and distribution When centroids are used for distributed information retrieval (as opposed to distributed resource discovery) the scaling problem can be met in two different ways: 1. adding more collections to the system, thereby increasing the total number of centroids linearly with the number of collections (it is necessary to send centroids for each collection separately). 2. adding information to existing collections, thereby decreasing the distributional effect of the centroids, and increasing the volume of queries (since queries become less efficiently distributed). While (1) weakens the scaleability of the system by increasing the I’ flow, (2) weakens the effect of query distribution.

3.6 Conclusion The scaleability properties of resource discovery systems using different distribution strategies have been the subject of discussion in this chapter. Different approaches have been analysed, and their scaleability properties have been found to be unsatisfactory or unknown (not discussed in the reports). 1. The fulltext volume can not reliably be partitioned into separate disjoints flows when hyperlink crawlers are the method for finding documents 2. The use of forward knowledge has not been proven to ensure scaleability to Internet proportions. The analysis in this chapter indicates poor scaleability properties 3. The distribution of the index volume into several Query Processors require that the index collections are distinct and different, a property difficult to maintain as the collections grow in size. In the following chapter a distribution model will be presented where the scaleability properties are better. This property is due to elimination of forward knowledge, the use of metadata for distribution of index volume and maintaining of the distinctiveness of document collections.

63

4

Possible solutions to the scaleability problem

4.1 Introduction In the previous chapter several different distribution strategies for resource discovery systems were evaluated and analysed. In this chapter a conceptual framework for a solution will be presented together with a list of required properties for this solution to work.

4.2 Using static forward knowledge Several alternative distribution models for resource discovery systems have now been analysed. A flaw in the design of many distributed resource discovery systems has been identified in the previous chapter: the necessity of transporting forward knowledge of uncertain volume and uncertain distributional effect. This problem can be solved by ensuring that 1. the size of forward knowledge is growing less than linearly with the size of the index collections (which is not the case with centroids, see section 3.5.2) 2. the document collections remain sufficiently distinct so that queries can be efficiently distributed. This is generally not the case with approaches where forward knowledge is describing heterogeneous collections. Based on the discussion in section 3.5.3, it is believed that collections need to be coordinated in order to keep them distinct. One way to avoid sending the I’ (the forward knowledge) volume to every Query Router (Figure 3-6) is by relying on Static Forward Knowledge i.e. forward knowledge that does not change. Static forward knowledge can characterise Static collections, but static collections do not exist in the World Wide Web. Static forward knowledge can, on the other hand, be obtained if each collection only contains information belonging to a given class, e.g.

64

information related to a specific topic. By attributing a topic to the query, a query router can forward the query to the correct collection. Such collections are unlikely to exist in a heterogeneous environment as the World Wide Web, so this is not a realistic solution. One approach to obtain static forward knowledge is to relax the 1:1 relationship between the Indexing Agent and the Query Processor. In the previous models it has been assumed that the index collection generated by an IA is sent to one QP only. In this case the index collection will reflect the diversity and heterogeneity of the documents discovered by this IA. In the model proposed in this thesis, the index collection of a QP only refers to documents belonging to the same class. This requires the system to generate index information on a per-document basis, and that the IAs send index information for any given document to the QPs responsible for the class this document belongs to. Index information for a single document has been termed metadata in section 2.3.3.2.

4.3 Metadata associated with a topic One possible way to obtaining static forward knowledge is by using query processors that store metadata instead of full-text collection indices. As explained in section 3.3.4, a collection of metadata can describe documents and resources in a number of collections. A query processor can therefore become “static” (in the sense that it holds information on a fixed topic) by collecting metadata only on this topic. Document collections will remain independent and general, and metadata descriptions of their individual documents (generated by an IA) are passed to different Query Processors depending on the topical nature of the document. One metadata item that describes a particular document can be stored inside a metadata collection together with metadata descriptions of similar resources. Since the metadata collections can be stored on different nodes in the network, this process is called distributed metadata clustering. Queries can in the same manner be routed to a query processor based on the topic attributed to them. Both the Q volume and the M volume can be distributed across several nodes using this approach, and the I’ volume do not exist1, since only static forward knowledge 1. The join-protocol, described in section 5.3.2 can be viewed as a form of forward knowledge (I’)

65

is used. See Figure 4-1 (the Indexing Agent is not shown in this figure, because it is not an essential part of the discussion at this point. The IA is likely to be resident in the same computer as the document collection, which will be explained later). QR Q

QR

QP Q

QR

QP

M

QP

Figure 4-1. The information flow using distributed metadata clustering and static Forward Knowledge.

The message complexity of the query router is under ideal conditions O(Q/n) (when the Q volume is evenly distributed across n query routers) and of the query processors O(M/ m) (when the Q and M volume is evenly distributed across m query processors). The project described in this thesis employs this model.

4.3.1 Advantages of using metadata Metadata has been shown to be a key factor in obtaining scaleability, and a topic-based metadata scheme has the potential to obtain effective static forward knowledge. The use of metadata has several other advantages, which are now discussed: Currency Resource discovery is normally based on the use of crawlers, which extract information from a information server, but do not know anything about the expected lifetime of this information. Thus, a re-indexing scheme must be based on a heuristic estimate, as discussed earlier in this chapter (section 3.3.1). Metadata can be generated by an IA resident in the document collection server “just in time” and passed to the query processors that have “subscribed” to this class of information. As a consequence, the flow of metadata (called the M flow in section 3.3.4) can be optimised so that new metadata is distributed only when needed (not necessarily each time the document has changed). Coverage As explained in section 2.3.3.1, a large fraction of the resources on the Internet are invisible to a crawler, called “the invisible web” (Sullivan, 1999, Arasu et al., 2001). A related 66

problem is that crawlers may pick up pages that should not be retrieved outside a particular context (e.g. an HTML frameset). The most effective way to cope with this problem is by letting the information servers themselves announce their resources. Using this approach, documents that are available through ordinary URL-references, but invisible to a crawler due to use of JavaScript or server programming, can now be visible and become part of a query result set. Metadata can extend the resource discovery service into the area of service trading (section 2.5.5). Relevance Metadata can be augmented by adding information about the resource not present in the actual document text, e.g. topic, publishing status, date, copyrights etc. Since such information can be relevant for a query, it may improve the relevance of the items in a result set. There is no point in returning references to the user which he cannot use for technical, economic or legal reasons.

4.4 Building a system based on static forward knowledge 4.4.1 Introduction In order to demonstrate the validity of the principles of static forward knowledge and distribution of metadata just discussed, a real working system must be built and evaluated. The rest of the thesis will describe the design and properties of the Content-Sensitive Infrastructure (CSI) which is a scaleable resource discovery system built on the aforementioned principles. The goals and service model of the system will be presented in this section, while the design principles and protocol details will be presented in the next chapter. During the subsequent chapters the properties of the CSI will be examined through a series of experiments.

4.4.2 The goal of the system The goal of the CSI is to provide a scaleable infrastructure for distributed resource discovery. Resource providers (or information providers) may announce available resources by

67

describing the resource in the form of Advertisements (metadata) using a formal syntax and sending the advertisements to the CSI. Resource seekers can express their needs in the form of a Query using a formal syntax and send it to the CSI. The CSI will process the query and return a Result set containing advertisements considered relevant for this query. The service model of the CSI is shown on Figure 4-2. The system behaves like a clientserver system using message passing between the client and the server. There is no synchronous relationship between a request and its responses. The retrieval effectiveness and processing efficiency of the CSI should be comparable to existing search engines. The system should be scaleable to very large proportions, i.e. to millions of networking nodes. These requirements will be investigated in subsequent chapters.

Resource seeker

Re

Resource provider

Qu er

sul tse

y ve

t

Ad

ContentSensitive Infrastructure

Adv

n me e s rti

t

erti

sem ent Resource provider

Figure 4-2. Service model of the Content-Sensitive Infrastructure

4.4.3 The “ocean” metaphor The CSI can be visualised as an ocean where information (advertisements) are swimming around like fish, forming shoals with other fish of similar kind in fishing grounds. The fisherman must know where the fishing grounds are in order to catch the fish. In this way the ocean is divided into areas where similar information form clusters. Queries are

68

expected to look for information of a certain type, and should be directed towards the area where such information is likely to be found. In the same manner, the CSI contains member nodes that are interested in a certain type of information, in the sense that they store information of this type and process queries looking for this type of information. Furthermore, the CSI organises the ocean in “concentric” areas containing increasingly specific information, i.e. an area containing specific information lies within an area containing general information. Figure 4-3 shows this property.

"Ocean" Recreation Rafting

Golf

Religion Religion.Islam.Poetry Computing Languages OS

Network Biology.Genetics

Economy

Figure 4-3. The CSI visualised as an ocean. The concentric areas show areas of increasingly specific information

4.4.4 The need for classification The “ocean model” implies that there exists a typological framework for information, so that any advertisement and query may be associated with an information class. This framework needs to have a hierarchical structure so that information classes can be related along a specificity dimension. The reasons for the use of a hierarchical classification framework are: 1. To allow queries and advertisements to have different specificity. General queries should gather advertisements from a larger area of the ocean than specific queries. Advertisements assigned to given class are candidates for being included in the result of queries assigned to any superclass.

69

2. To allow for a flexible number of networking nodes inside the infrastructure, and a flexible granularity of distribution. This is an implementation issue which will become clear during the presentation of the design in chapter 5.

4.4.5 The effects of classification on the retrieval effectiveness The retrieval performance of a distributed system built on classification will be thoroughly investigated in this thesis. Related work done by Xu and Croft (1999) has studied retrieval performance in a distributed system based on clustering, and found that they perform slightly worse than systems based on centralised retrieval, but better than systems based on heterogeneous document collections. Xu and Croft’s work isfocussing on retrieval performance and not on scaleability, and is therefore not directly comparable to our work. They do not consider the cost (in terms of network consumption) of clustering (and re-clustering) documents.

4.5 Conclusion In this chapter some necessary properties for a scaleable resource discovery systems have been identified and discussed: 1. The volume of forward knowledge need to be controlled, e.g. by ensuring that the index collections are kept static (no changes in type of information content). 2. The query router’s ability to divide the query volume efficiently must be controlled, e.g. by keeping the index collections distinct and different (with regards to type of information content). Distributing the forward knowledge in the form of centroids is not a good solution, neither in terms of network bandwidth consumption nor in terms of effect on distribution of queries. An approach based on distributed metadata clustering is promising, since the distinctive properties of the index collections can be controlled (by the “topicality” of query processors). The query distribution can be controlled better and there is no need to pass forward knowledge to the query routers.

70

I hypothesise that it is possible to build a working resource discovery system based on these principles. The service model and design overview of the Content-Sensitive Infrastructure have been presented in this chapter. The remainder of the thesis will present the protocols, design and the properties of the Content-Sensitive Infrastructure, and it will demonstrate that a system based on distributed metadata clustering can be made scalable to very large proportions and will have acceptable retrieval performance.

71

5

The Content-Sensitive Infrastructure - design and analysis

5.1 Introduction In the previous chapter, the rationale behind a resource discovery system based on distributed classified metadata was presented, along with a possible service model and some design requirements. In this chapter the detailed design of the Content-Sensitive Infrastructure (CSI) will be presented. The CSI meets the requirements for a scaleable resource discovery system given in the previous chapter. The focus of the design description and the algorithms will therefore be on scaleability issues. Discussion of IR aspects is held over till chapter 6. The scaleability properties of the system will be formally analysed, and the analysis will be verified through a simulation experiment. The fault-tolerant properties of the system will also be investigated. The CSI is based on the idea of a distributed network of Query Processors (called Members) where each member is responsible for holding metadata and process queries about a particular category.

5.1.1 The design principles The CSI is implemented using a simple set of design principles: • The infrastructure should use a peer-to-peer approach so that central servers and “single point of failures” are avoided • The system should be completely self-configuring in order to avoid managerial bottlenecks • The networking protocols should emulate a messaging system in order to enable asynchronous operation of peers

72

• “Relaxed consistency” between member nodes should be applied in order to avoid costly error recovery procedures • There should be no limitations on the number of networking nodes. The system should be able to operate with any number of nodes • There should be a loose coupling between query processing and message processing, so that members can employ different retrieval algorithms

5.2 The architectural elements of the CSI 5.2.1 The CSI Member The core of the CSI operation is a number of member nodes that can be described as forwarding and processing entities. The main property of a CSI member is its Category of Interest (COI) by which a member declares which area of the “ocean” (see section 4.4.3) it is willing to observe. Fig. 5-1 shows the basic elements of the CSI member. The duties of a CSI member are: 1. Storing Advertisements of types that are within its Category of Interest 2. Processing Queries of types that are within its COI (and returning a result set) 3. Forwarding Advertisements and Queries towards CSI members having the corresponding COI. Definition: a Category of Interest (COI) denotes a class in a classification hierarchy. A COI is represented by the path from the top of the category tree as a sequence of category names separated by dots. There exists a “less-than” relation (

Suggest Documents