FWEB: Automatic Hyperlink Creation Using Peer-to-Peer Web Servers

0 downloads 0 Views 118KB Size Report
already on the web using a hyperlink inserted into a web page by the ..... as the subscription/publication rendezvous node for a large number of keywords,.
FWEB: Automatic Hyperlink Creation Using Peer-to-Peer Web Servers Simon Courtenage and Steven Williams Cavendish School of Computer Science, University of Westminster, 115 New Cavendish Street, London, UK {courtes,williast}@wmin.ac.uk

Abstract. The World-Wide Web allows users to quickly and easily publish information in the form of web pages. Pages are linked to other pages already on the web using a hyperlink inserted into a web page by the page’s author that contains the URL address of another existing web page. This model of web publishing, although simple and efficient, also has the effect that links between pages must be created manually and only to pages that are known to the author of the links. This can be a disadvantage if, for example, information in a particular field is incomplete and expanding rapidly over time, and where a page author cannot be expected to know which pages are the most appropriate to link to and when they become available. In this paper, we look at a radically different model of web publishing in which the author of a web page does not specify links using URLs. Instead, the page author expresses an interest about the kind of content the page should link to and as new content comes online that matches that interest, links are inserted automatically into the original page to point to the new content. This leads to the possibility that a hyperlink from a particular location in a web page can lead to multiple destinations, something we call a multi-valued hyperlink. We also describe a prototype implementation of our web architecture, based on the CHORD-based peer-to-peer overlay network, which uses publish/subscribe to communicate page author interests to other peers in order to create links between pages.

1

Introduction

The web has been a source of enormous benefit to a great many people. The great advantages of the web are, of course, the ease with which information can be published and made available to a wide audience, and the ability to organize and connect different documents in a graph-based structure using hyperlinks. However, the way in which the web is constructed, through the addition of new documents, can be at odds with the different ways in which knowledge expands. New pages link to pages that already exist in the web graph (as described in,

for example, [1]), but are not linked to by other pages as yet, because they are new. The web graph, therefore, does not grow forwards. We do not write a web page with links from locations in the page to pages that do not as yet exist, nor is a new page immediately linked-in to existing pages, even though it may expand on information contained in those existing pages. But this is one possible model for how human knowledge grows, both in terms of a particular individuals understanding of a subject and in terms of research in general. Research sometimes makes use of undefined terms in particular models, expanding or completing the definition of these terms at some later date. There is a disparity, in some circumstances therefore, between how the web is constructed and how human knowledge is constructed. The web is a great tool for organizing and connecting existing knowledge, but not always for knowledge that is incomplete and expanding. In this paper, we describe a new architecture for the web that models how human knowledge expands when that knowledge is incomplete, but which is fully compatible with the existing web and existing web pages. The key feature of this web architecture is that authors of web pages do not insert links to other pages as they create or edit the page. Instead they indicate which parts of the content of the page from which links should be established as soon as pages with matching content become available. The basis for this feature is a distributed content-based publish-subscribe system [2] [3] [4]. We show how this publish-subscribe system can implement our forward web (which we refer to in this paper as fweb) over a peer-to-peer network of web servers. We also describe a prototype implementation of FWEB, based on CHORD [5] [6], an efficient P2P architecture that provides an single operation, node lookup, efficiently using distributed hash tables.

2

A Forward Web with Multi-Valued Hyperlinks

Consider the following scenario: A geneticist may publish, as a web document, results of research into a particular gene and its effect on a genetically-inherited disease. This research may involve factoring in the presence of a second gene in the DNA of a patient, which is not the primary subject of the geneticist’s research. Research results into the second gene may not be available at the time the web document is published. At some later time, a web document into the effect of the second gene on patient susceptibility to disease may be published. The geneticist ought now to edit the original document to point to the new document describing research into the second gene. In other words, the web must be revised backwards as new knowledge expands forwards. To help the geneticist in this example, we propose a new web infrastructure that allows the web to grow forwards, in the same direction as knowledge. However, the current web grows backwards as a consequence of the fact that pages must exist in order to be targeted by an anchor in another tag. To grow forwards, therefore, we must allow anchors to pages that do not yet exist but which may

exist at some point in the future. Clearly, we cannot target specific pages that do not yet exist, since their URL is unknown. Instead, therefore, we indicate, through the anchor text what kind of pages we want to link to. 2.1

FHTML - A Forward HTML

Web authors need to be able to state where links should be inserted into their documents, as and when new and relevant content is made available. To provide authors with this capability, we first add a new SUMMARY tag, to be included in the header of a page and which acts as an aid to matching page content to link requirements. The SUMMARY tag is similar to the use of the META tag in existing HTML. We have defined a separate tag, rather than reuse At present, the matching between the content of the KEY attribute in a tag and the content in other web documents is done using simple keyword matching. The author of an FHTML document does not have complete control over which documents will be linked to, since this depends on the matching carried out between the keywords in a tag and the summary in a tag. In fact, using the tag in a web document creates the distinct

possibility that the anchor text in a tag can refer to more than one web document, leading to what we term a multi-valued hyperlink. In our current implementation, this means the browser displays a drop-down list of URLs when the user clicks on an multi-valued hyperlink, from which the user can select one as the page to actually visit.

3

An architecture for the Forward Web

The Forward Web demands more from the components of its infrastructure than the current web. Links between pages must be found and created automatically when a match is found between the content of a tag in one document and the content of a tag in another document. The infrastructure must be aware of the contents of both tags in all documents, therefore, in order to infer which documents should link to which other documents. Our solution to this problem is to use the publish/subscribe communications paradigm for connecting the components of the web infrastructure. Specifically, we use content-based publish/subscribe, over a peer-to-peer (P2P) network. The use of a publish/subscribe network overcomes the problems in matching keywords with web page content posed by the asynchronous nature of web publishing, while the use of a P2P network enables document repositories to provide services to each other to allow new hyperlinks to be formed. 3.1

Publish-Subscribe and Content-Based Routing

Publish/Subscribe systems [7] form an important communications paradigm in distributed systems, one in which servers (or producers of messages) are decoupled from clients (or consumers) by the network. Instead of clients contacting servers directly to request services or information, clients register a subscription with the network to receive messages satisfying certain criteria. Servers publish information onto the network, without knowing who will receive it, and the network undertakes to route messages to the appropriate clients based on the set of subscriptions currently in effect. Traditional publish/subscribe systems create channels, groups or topics, sometimes hierarchial, under which messages may be classified, where a subscription is simply the identity of the channel, group, or topic that a user wants to receive messages from. Once subscribed, the user or subscribe receives all messages that are published under that channel, group or topic. Recently, another approach to publish/subscribe has been developed that allows subscriber to specify their interests in terms of the kind of message content they want to receive: this is known as content-based routing. The advantage of combining content-based routing and publish/subscribe, to create content-based publish/subscribe [3] [2] [4], over more conventional systems is the far greater flexibility that is permitted in creating subscriptions. Subscribers are in effect allowed to create their own message groupings rather than simply sign up to predefined ones, by defining predicates over the structure of

a particular message type. When the subscription has been registered with the network (typically a network of servers overlaying a TCP/IP-based network), the network undertakes to route to the subscriber all messages of that type whose content satisfies the subscriber’s criteria, typically using an overlay network of brokers, servers whose role is to match up subscriptions with publications. 3.2

Implementing the Forward Web

We have implemented a prototype version of fweb as a custom-written Javabased web server. In this section, we describe some of the key features of the implementation. The key problem in implementing fweb is how to connect a web page with a tag with a page with a tag when the keywords in the tag and the content of the tag agree. We have tackled this problem by arranging fweb servers as a broker network and employing the content-based publish/subscribe paradigm to match up subscriptions with publications. In our prototype implementation, the broker architecture is provided by a P2P overlay network based on CHORD [5] [6], a popular distributed P2P service with a single operation: node lookup based on Distributed Hash Tables (DHTs). Given a particular hash key, the CHORD architecture allows fast and efficient lookup of the P2P node associated with that particular key. Each fweb server participates in the P2P system, acting not only as a web server, therefore, but also as a broker node in a content-based publish/subscribe system in order to link pages together. Placing a new web page in the document root of an fweb server triggers the server to parse the document and extract the content of any or tags. If a tag is found, then its keywords are used to create subscriptions (each tag in a page is assigned a unique identity within the page). Each keyword is hashed using the CHORD hash function to locate the fweb server in the CHORD-based P2P network which will act as broker for the subscription. The located server is then contacted with the details of the subscription (unhashed keyword and URL of the document containing the tag). Similarly, the keywords in the page summary are used to create publications, by hashing the keywords to locate the nodes in the fweb P2P network that the publications should be sent to. When an fweb server receives a subscription, it attempts to match the subscription against the web pages that it knows have already published summaries. Since both the page summary and the subscription are sent to a node on the basis of hashing a keyword, then if they both contain the same keyword, then they will end up at the same node. If the match is deemed successful, then details of the publication are sent back to the fweb server that made the subscription. The subscribing fweb server collects different notifications for pages in its document repository, keeping track of how many subscriptions were created for a particular tag in a particular page. When all keywords for a tag have been matched by publications from another document, it adds the URL

of the publishing document to a URL file which is associated with the subscribing document. For example, an fweb server with a document such as index.html will create a file index.url to hold URLs of documents that match tags within index.html. When a document that has an associated URL file is requested by a client browser, the fweb server handling the request parses the document and the URL file, dynamically replacing the tags in the document with matching URLs before sending the page to the browser. Currently, to handle multi-valued hyperlinks, we are redirecting document requests within the web server to a JSP that parses the requested page and inserts a DHTML drop-down menu of hyperlinks where the multi-valued hyperlink should be.

4

Related Work

There has been a great deal of work in designing and developing content-based publish/subscribe systems, for example, [3] [8] [9] [4] [10] [11], which have successfully tackled many of the problems involved in routing a message from its source to a client based solely on its content. Our work makes use of this research in its use of a broker-style network to mediate between subscriptions and publications, in a manner similar to Hermes [4]. However, unlike Hermes (and most other content-based publish/subscribe systems), fweb servers keep track of past publications in order to create hyperlinks to documents that already exist. Moreover, we do not use subscriptions that have filters more complex than simple equality. Lewis et al [12] describe the use of a content-based publish/subscribe system as part of a complete semantic-based information dissemination system, allowing client browsers to make subscriptions to create information spaces, and then receive notifications when new information becomes available. Our system, however, uses publish/subscribe only to create ordinary hyperlinks between HTML-compliant web pages based on matching content, which is a much simpler proposition. Moreover, the documents received by web browsers are fully HTML-compliant, whereas the system described in [12] depends heavily on Semantic Web markup and associated technologies. P2P systems have been used before to implement publish/subscribe systems: for example, content-based publish/subscribe systems using Chord is described in Triantafillou et al [13] and Terpstra et al [14]. The primary goals of [14] are the robustness of the routing strategy for content-based publish/subscribe and the implementation of a filtering strategy on top of a DHT-based P2P system, neither of which are currently handled by fweb (fweb’s current concept of filters is limited to simple keyword equality which does not pose a problem in DHT-based P2P systems). In [13], the concern is with how range predicates (more complex filters on content, such as less-than or greater-than) can be implemented in DHTbased P2P systems such as CHORD where the use of hashing to locate nodes makes support of filters other than equality difficult.

5

Conclusions and Future Work

This paper has described the use of a P2P-based content-based publish/subscribe system to augment the current web and allow web authors to specify what kind of pages they want to link to, rather than explicit URLs. The advantage of this approach is that links to matching pages are automatically added, maintained and updated without intervention from the web page author. This may suit domains where information is incomplete and expanding. The principal disadvantages of our approach are (i) that each server must act as the subscription/publication rendezvous node for a large number of keywords, and (ii) that, at the moment, we allow free-text in the keyword lists of the and tags, resulting in a very large namespace, as well as problems dealing with synonyms etc. We have experimented with fweb as a small P2P network of web servers, with a small collection of documents in each web server’s repository, in order to test the implementation. This has not given us a great deal of information to evaluate the strengths or weaknesses of the architecture. However, we have found that, in practice, it is as difficult to write tags that result in meaningful matches between pages as it is to write meaningful search engine queries, particularly since we require exact matching between keywords. This is likely to be a disadvantage in a real deployment of fweb since we cannot expect web page authors to agree on which keywords to use in a distributed environment. (Unlike search engines, however, we do not relax the matching criteria to produce more matches. As a result, we do not need to rank the results in order that the first link displayed in a multi-valued hyperlink is the most relevant result.) One approach to this problem is to allow keywords only from agreed ontologies, although this limits application to specific communities where such ontologies can be agreed on and predefined. The use of ontologies would, however, eliminate the problem of dealing with synonyms etc. that are created when using free-text in conjunction with DHTs. This would bring our work closer to that of the W3C’s Semantic Web Initiative, although currently we are only proposing the use of the term hierarchies of ontologies, rather than any reasoning capability, in order to solve problems of matching terms. Future work includes exploring the possibilities implied by fweb, as well as the need for more structured descriptions in and