Architectural Alternatives for Information Filtering in Structured Overlay Networks ∗
Christos Tryfonopoulos1
Christian Zimmer1 Gerhard Weikum1
Manolis Koubarakis2
1
Max-Planck Institute for Informatics Department of Databases and Information Systems, 66123 Saarbruecken, Germany {trifon,czimmer,weikum}@mpi-inf.mpg.de 2
National and Kapodistrian University of Athens Department of Informatics and Telecommunications, 15784 Athens, Greece
[email protected] ABSTRACT In this work we discuss how to provide information filtering (pub/sub) functionality over peer-to-peer structured overlay networks by presenting two approaches we developed. Both approaches utilize the Chord DHT as the routing substrate, but one stresses retrieval effectiveness while the other relaxes recall guarantees to achieve lower message traffic and thus better scalability. We highlight the main characteristics of the two approaches, present the issues and tradeoffs involved in their design and compare them in terms of scalability, efficiency and filtering effectiveness. Finally, building on our experience from these systems, we highlight the advantages and disadvantages of each approach and state the lessons we learned.
crawler
request resource/ send resource
INTRODUCTION
In recent years, dynamic information dissemination applications such as news alerts, weather monitoring, stock quotes etc., have gained popularity for large, open and dynamic communities of users. In this work we are interested in applications such as news alerts, digital libraries, RSS feeds etc. where the data of interest is mostly textual and user needs are expressed using languages from the area of Information Retrieval (e.g., keywords or pieces of text). We argue that peer-to-peer (P2P) structured overlay networks are well-suited as an implementation technology for these ∗
The research presented here was supported in part by the DELOS Network of Excellence, and the EU projects AEOLUS and Evergrow. This work is a modified version of the article published in IEEE Internet Computing, special issue on Dynamic Information Dissemination.
Published by the IEEE Computer Society. IEEE INTERNET COMPUTING c 2007 IEEE. 1089-7801/07/$25.00 °
user resource publication IEEE DL
P2P network
request resource/ send resource
Observer module
query submission/ resource publication resource publication user
CiteSeer
1.
query submission
user
crawler publisher service
user subscriber service
router service
Figure 1: A high level view of a P2P pub/sub architecture. scenarios due to their potential to handle huge amounts of information in a highly distributed, self-organizing way. These characteristics offer enormous potential benefits for information dissemination capabilities in terms of openness, scalability, efficiency, resilience to failures, and dynamics. The functionality that we consider here is information filtering (IF), also known as publish-subscribe (pub/sub) or information alert. Users, or services that act on behalf of users, specify continuous queries for information, thus registering a subscription to newly appearing documents that satisfy the query conditions. The user will then be automatically notified whenever a new matching document is published. Publishers can be news feeds, digital libraries, or users who post new items to blogs or other forms of Internet communities. The P2P information filtering architecture we envision is shown in Figure 1. In this setting users utilise publisher nodes with content published by them to the network. The content may be locally created, e.g., from a Web server, an
underlying content management system or it may be gathered from the Web with a (possibly thematically focused) crawler. Additionally other content providers such as digital libraries or publishing houses (e.g., IEEE, Elsevier, etc.) may also use the network and utilise publisher nodes to make metadata about their resources available to the system. In this setting we also expect to see observer modules (as in [7]) for information sources that do not provide their own alerting service. These modules will query the sources for new material in a scheduled manner and inform subscribers accordingly. All this information flow will be filtered and redirected to users according to their submitted queries1 , by making use of different types of network nodes. In this setting, information production, consumption and flow are highly decentralised and asynchronous, making a P2P architecture a natural approach. In the rest of this section we briefly discuss different approaches that utilised IR-based query languages and overlay networks to realise some facet of the scenario described above. Note that this functionality is very different from information search as in search engines, where a query is posed and executed only once to retrieve the currently matching documents. In fact, the preferred approach for centralized pub/sub applications is to index queries rather than data and evaluate newly published data items against these existing subscriptions. Depending on the expressiveness of the query language for subscriptions, such IF services are technically difficult already in a centralized setting. In a widely distributed setting with a large number of subscribers and potential publishers, efficiently routing newly published documents for matching them against the existing subscriptions poses additional complexity and technical challenges. Building on our experience from designing information filtering systems on top of overlay networks, this paper presents architectural alternatives and considerations that should be taken into account when one wants to build efficient and scalable distributed pub/sub systems. Initially, we provide a short overview of recent work on IF on top of structured overlays. Then we discuss how to build this filtering functionality on top of the Chord DHT [18] by focusing on two different architectures that we have designed, coined DHTrie and MAPS. DHTrie stresses retrieval effectiveness and aims for exact pub/sub functionality by trading message traffic, whereas MAPS puts focus on scalability and relaxes recall to achieve approximate pub/sub functionality. We briefly present the main components of these architectures, discuss the associated protocols, and demonstrate the usage of DHTs to achieve scalability. Then we compare these two approaches in terms of their main characteristics, highlight the advantages and disadvantages of each architecture and state the lessons learned during their design. Our contributions lie in the demonstration of the different architectural approaches than can be taken by using the same routing infrastructure (i.e., the DHT) and the different characteristics of these architectures in terms of scalability, retrieval effectiveness and possible optimizations. We identify the issues and design choices available for P2P IF and point out two recent approaches that share the same routing mechanisms but target different objectives. Our goal is to present the feasibility of P2P IF, to highlight the differ1
The terms query, subscription and profile will always denote a continuous query (since we focus on a pub/sub setting) and will be used interchangeably.
ent tradeoffs included in the design of such systems and to provide pointers for further reading and exploration.
2.
INFORMATION FILTERING IN STRUCTURED OVERLAYS
P2P networks are typically distinguished into three different classes according to their topology: unstructured, hierarchical and structured networks. DHTs [18, 15, 16] are a prominent class of structured overlay networks devised to efficiently solve the object lookup problem. With the invention of DHTs, distributed information management emerged as an interesting research area that could benefit from the P2P paradigm. One category of information management applications focused on supporting content retrieval on top of structured overlays using various data models and query languages [10, 19], while another category focused on the pub/sub paradigm. In this section we put our emphasis on surveying work from the pub/sub domain, and do not further explore retrieval approaches as these are not directly relevant to the focus of the paper. Over the last years, a new wave of pub/sub systems that uses structured overlays as the routing substrate has appeared. Scribe [17] was one of the first topic-based approaches that were built using Pastry [16]. Hermes [7] is similar to Scribe because it uses the same underlying DHT (Pastry) but it allows more expressive subscriptions by supporting the notion of an event type with attributes. Each event type in Hermes is managed by an event broker which is a rendezvous node for subscriptions and publications related to this event. PeerCQ [8] is another notable pub/sub system implemented on top of a DHT infrastructure. The most important contribution of PeerCQ is that it takes into account peer heterogeneity and extends consistent hashing with simple load balancing techniques based on appropriate assignment of peer identifiers to network nodes. Meghdoot [9] is a recent proposal implemented on top of a CAN-like DHT [15]; it supports an attribute-value data model and offers new ideas for the processing of subscriptions with range predicates (e.g., the price is between 20 and 40 Euros) and load balancing. A P2P system with a similar attribute-value data model that has been utilized in the implementation of a publish-subscribe system for network games is Mercury [4]. All the aforementioned approaches support Boolean queries with arithmetic operators and do not consider IR-based data models and languages. Recently, systems that employed an IR-based query language to support information filtering on top of structured overlay networks have also been deployed. pFilter [20] uses a hierarchical extension of CAN DHT to store user queries and relies on multi-cast trees to notify subscribers. LibraRing [21] and MinervaDL [25] present two frameworks to provide information retrieval and filtering services for future digital libraries in a super-peer environments. Finally, [1] shows how to implement a DHT-agnostic solution to support prefix and suffix operations over string attributes in a pub/sub environment. Compared to these approaches, the solutions presented here use a more expressive data model and query language, do not need to maintain multicast data structures to notify subscribers, and do not utilize a hierarchical (two-tier) architecture.
3.
THE DHTRIE ARCHITECTURE
In the DHTrie network, nodes can implement a basic router service and any (or both) of the following types of services: subscriber service and publisher service. To facilitate efficient messaging between nodes, extensions to the Chord [18] routing infrastructure have been developed and we briefly describe them in Section 3.2.
3.1
Types of Services
Router service. All nodes in the DHTrie system form the message routing layer of the network by implementing the router service. Each node that implements this service runs a DHT protocol which is an extension of Chord. The role of the DHT in DHTrie is threefold. • To act as a rendezvous point between information producers and information consumers. • To serve as a robust, fault-tolerant and scalable routing infrastructure. When the number of nodes is small, each node can easily maintain a complete routing table. When the network grows in size, the DHT provides a scalable means of locating other nodes. • To serve as a global metadata index that is partitioned among nodes. repository that can be queried efficiently. Subscriber service. Nodes implementing only the subscriber service are called subscribers. Subscribers are information consumers: they can subscribe to resource publications and receive notifications about published resources that match their interests. If subscribers are not on-line at notification time, notifications matching their interests are stored by their DHT successors and delivered once clients reconnect. Resource requests are handled directly by the publisher node that is the owner of the resource. Publisher service. This service is implemented by information producers (e.g., digital libraries or users) that want to expose their content to the rest of the network. A node implementing only the publisher service utilizes the router service to connect to the rest of the network. To be able to implement the publisher service, an information producer should create metadata for the resources it stores using the AWPS data model [11]. Figure 2 shows how subscription, publication and notification protocols are realized in DHTrie.
3.2
Content-Based Multicasting
In DHTrie we use a well-understood attribute-value data model called AWPS [11]. The query language of AWPS allows equality, containment and similarity operators on attribute values. The containment operator offers conjunction, disjunction, negation of words and also distance between words. Here for simplicity we consider only keyword queries and describe how they are indexed in DHTrie. Let us assume that a subscriber node wants to submit a continuous query q, consisting of k distinct terms. It uses the DHT hash function to obtain k identifiers and uses the DHT to route q to the nodes responsible for these identifiers. This way q is stored at k nodes in the network, and these nodes are responsible for matching the query against new publications and notifying the subscriber. This query indexing mechanism does not affect recall, since the whole
query q is indexed at k different nodes. In this way, at publication time, the full query is matched against the incoming document and only exact matches are returned to the users. When a node wants to publish a document d, it identifies all the distinct words in it, and uses the DHT hash function to obtain at most as many identifiers as the number of distinct words in d. Then it contacts all nodes responsible for these words, since these are the nodes that may store queries that will match D. The local matching of incoming documents and stored queries performed at each peer utilizes appropriate local filtering algorithms such as [23]. From the above procedure it is clear that query placement in DHTrie is deterministic and it depends upon the terms contained in the query. Publication is also deterministic, and depends on the words contained in the incoming document. The role of the DHT is to partition the term space and act as a rendezvous point for queries and documents. This query indexing strategy allows DHTrie to achieve the recall of a centralized system. The DHTrie protocols are described in detail in [22]. To facilitate message sending between nodes we use the function send(msg, I) to send message msg from some node to the node responsible for identifier I. Function send() is similar to Chord function lookup(I) and costs O(logN ) overlay hops for a network of N nodes. When function send(msg, I) is invoked by node S, it works as follows. S contacts S 0 , where id(S 0 ) is the greatest identifier contained in the finger table of S, for which id(S 0 ) ≤ I holds. Upon reception of a send() message by a node S, I is compared with id(S). If id(S) < I, then S just forwards the message by calling send(msg, I) itself. If id(S) ≥ I, then S processes msg since it is the intended recipient. From the subscription and publication procedure briefly described above, it is clear that a node needs to send the same message to a group of other nodes. This group creation is dynamic, it is specified by the message originator and depends on the resource publication or query submission that takes place. In this way, group members are implicitly subscribed to groups depending on the keys they are responsible for. Thus, our multicast groups are not known a priori and this is the major difference between our contentbased multicast scheme and the standard enrollment-based schemes presented in systems such as Scribe [17]. In those schemes receivers explicitly subscribe to multicast groups which maintain special data structures (e.g., multicast trees) to be able to notify subscribers. To facilitate this message sending we have designed and implemented three different methods that show the tradeoffs between message traffic and publication latency. • The Iterative method. The obvious way to handle content-based multicasting over Chord is to create k different send() messages, where k is the number of different nodes to be contacted, and then locate the recipients of the message in an iterative fashion using O(k log N ) messages. The iterative method optimizes latency since messages are sent in parallel. • The Recursive method. Using the iterative method requires asking the same routing question to the same node multiple times, which leads to an increase in message traffic. The idea behind the recursive method is to pack messages together and exploit other nodes’ routing tables to reduce network traffic. This is achieved
subscription step
publication & notification step
S
P2 S
1
2
P3
1 1
R R
Message sent using DHT
Subscription procedure: 1) The subscriber (S) chooses a peer in the network using the DHT mapping function and uses the DHT to send the query to this peer. This peer (R) will store the query and notify the subscriber when a publication that matches the query arrives at any network node.
P1
Message sent using DHT Message sent point-to-point
Publication and notification procedure: 1) Publishers (P1 , P2 , P3 ) use the DHT to contact a peer (R) storing a possibly matching query. 2) Peer R storing the query matches the new documents against its local database and notifies the subscriber (S) about documents that match its query.
Figure 2: Subscription, publication and notification procedure in DHTrie by the use of a recipients list that is sent along with the actual message. This list contains the identifiers of all intended recipients sorted in ascending order. The message is then sent to the node responsible for the first identifier in the list. Once this node is reached, the message is copied for matching with the node’s query database, this identifier is removed from the list, and the message is forwarded in the same recursive way to the rest of the recipients in the list. If the recipients list is long, then the last recipient has to wait for a long time until he is notified about the publication, which means higher publication latency. • The Hybrid method. The idea behind the hybrid method is to design a tunable alternative that will provide fast delivery of messages at low network cost. Thus, the hybrid method is a combination of the techniques utilized in the two previous methods, and is able to trade network traffic for publication latency. The identifiers of all intended recipients are split into several small recipient lists, and the messages containing the lists are send iteratively, while the intended recipients within the lists are reached in a recursive way. This splitting of lists uses a system parameter called maximum list size and is based on the finger table entries of the message originator. Each of the finger table entries is associated with a number of lists so as to be able to better exploit routing information existing in finger tables. In the experiments presented in the next section, we use a maximum list size of 30. Our experiments showed that this approach is slightly more expensive than the recursive method in terms of network traffic (but still significantly lower than the iterative), whereas it performs well in terms of latency. In general, it is a versatile algorithm that combines the benefits of the two previous methods to minimize both message traffic and latency. In combination with these methods, we also introduce a simple routing table (called FCache) that uses only local information and manages to further reduce network traffic
and publication latency. Each node updates, with minimal bookkeeping, the FCache entries with the IP addresses of other nodes that are contacted often. In this way, the routing infrastructure can be passed over and thus relieved from a significant load. This caching scheme provides a mapping between often used keys and nodes responsible for them (similar to value proxying) and is different from caching mechanisms such as [6], where cached copies of the actual data are left along the path from the querying to the data provider peer. FCache is by design a self-maintained (outdated entries are repaired when needed, no extra maintenance cost in terms of messages), self-tunable (frequently used entries need less repairs) and self-adaptive (entries are dynamic) data structure that increases the scalability and efficiency of the system. Before closing the section, let us note that although our query partitioning scheme is by keyword, the setting is not similar to the one described in [12], since no posting lists of any sort need to be sent over the network to match a query against an incoming document. The only information sent across the network is a link to the document publisher, that may be used by nodes storing possibly matching queries. Additionally, the optimizations discussed in the previous paragraphs greatly improve the scalability of the proposed architecture which follows the scalability guarantees of Chord. Further improvements towards the efficiency direction was also one of the main reasons that lead to the development of MAPS, which is briefly described in Section 4.
3.3
Experimental Evaluation
To show the scalability of our solutions we have implemented and experimented with six variations of the DHTrie protocols. Algorithms It, Re and Hy utilize the Iterative, Recursive and Hybrid method respectively in the publication protocol and do not use the FCache, whereas algorithms ItC, ReC and HyC utilize an FCache of size 10K. Figures 3 and 4 show the message traffic and latency per document observed, when publishing 10K documents of average size 5415 words at networks of size varying from 10K nodes to 100K nodes. The main observation is that both message traffic
It/4
ItC
Re
ReC
Hy
It
HyC
3500
Re/6
ReC
Hy
HyC
500
3000
450 400
2500
latency (hops)
# of DHT messages/document
ItC
550
2000 1500 1000
350 300 250 200 150 100
500
50
0
0
10
20
30
40
50 60 70 80 # of nodes (x1000)
90
100
Figure 3: Performance in terms of overlay messages to achieve the recall of a centralized system.
and latency grow at a logarithmic scale compared to network size, due to the DHT usage. Secondly the effectiveness of the FCache is demonstrated, by achieving a reduction of 6 times both in message traffic and latency. We have also conducted experiments [22] that show how each algorithm is affected by parameters such as FCache size, level of FCache training, document size and type of queries. In general, experiments [22] showed that algorithm ReC performs best when message traffic is in question. It remains relatively unaffected by network size, it shows small sensitivity to increasing document sizes, and needs only few valid FCache entries to present a big performance improvement. On the other hand, when publication latency is in question, algorithm ItC is a good candidate. Its remains unaffected by network size, level of training and size of the FCache, and the size of the publication but this comes at the price of higher network traffic in the routing infrastructure. Finally, HyC is a tunable alternative to the previous approaches that manages to strike a balance between optimizing network traffic and publication latency. It is slightly more expensive than ReC in terms of network traffic but still significantly lower than ItC, whereas it performs well in terms of latency. Finally, we also studied three important cases of load balancing, namely query, routing and filtering load balancing and devised a novel algorithm based on load-shedding that manages to distribute the load evenly between the nodes [22].
4.
THE MAPS ARCHITECTURE
In the previous section we have seen an architecture that uses a DHT as the routing infrastructure to support IF, and gives us the recall of a centralized system. In this section we present MAPS (Minerva Approximate Publish/Subscribe) [24, 3], a complementary approach that shows how to improve the scalability of the solution presented above by trading reduced recall for lower message traffic. Again, a DHT is used as an underlying routing mechanism, but its purpose is to disseminate statistics about the publications rather than the publications themselves as in DHTrie. In the MAPS approach, only a few carefully selected, specialized and promis-
10
20
30
40
50 60 70 # of nodes (x1000)
80
90
100
Figure 4: Performance in terms of latency. Latency is measured as the longest chain of messages needed until the publication reaches all intended recipients. ing peers store the user query and are monitored for new publications. This approximate filtering relaxes the assumption of potentially delivering notifications from every producer that holds in most pub/sub systems, and amplifies scalability.
4.1
Types of Services
The architecture of MAPS is based on the P2P search engine Minerva [2]. Each participating peer implements three types of services: Directory service. All nodes that participate to the MAPS network implement the directory service. This service provides the DHT-based routing infrastructure and is responsible for the maintenance of a distributed statistics index for document and query terms. This index forms a conceptually global but physically distributed directory, which is layered on top of a Chord-style DHT, and manages aggregated information about each peer’s local knowledge in compact form. The DHT partitions the term space, such that every peer is responsible for the statistics of a randomized subset of terms within the global directory. To maintain the IR statistics up to date, each peer distributes per-term summaries of its local index along with its contact information to the global directory. For efficiency reasons these messages are piggy-backed to DHT maintenance messages and message batching is used. Additionally, to reduce the term space and also improve recall efficiency, MAPS exploits correlations among query terms (e.g., the word Aguilera is almost always together with the word Christina in a query) and considers multi-word terms. The DHT determines which peer is responsible for collecting statistics for a specific (multi-word) term. This peer maintains statistics and pointers to other nodes that publish documents containing this term. Publication service. The publication service can be used by users that want to publish their own documents to the MAPS network or by digital libraries that want to expose their content. A node implementing the publisher service utilizes the directory service to update statistics about the terms contained in the documents it publishes. Publisher nodes are also responsible for storing continuous queries
submitted by the users and matching them against their own publications. All queries that match a publication produce appropriate notifications to interested subscribers. Additionally, a node implementing this service may also use a thematically focused web crawler that may also locate and publish documents on behalf of the user. Subscription service. The subscription service is implemented by users that want to monitor specific data sources. The subscription service is critical to the recall that will be achieved at filtering time since it is responsible for selecting the peers that will index the query. This node selection procedure utilizes the directory service to discover and retrieve node statistics that will facilitate query indexing. Once these statistics are retrieved, a ranking of the potential sources is performed and the user query is sent to the k top ranked publishers. Only these publishers will be monitored for new publications, and since their publication behavior may not be consistent over-time, query repositioning is necessary to achieve higher recall.
4.2
Continuous Query Forwarding
A peer receiving a keyword query q with k distinct terms has to determine which peers in the network are promising candidates to satisfy q with documents published in the future. Thus, for each term ki contained in q the peer requests per-peer statistics from the directory service. Then the querying peer computes a peer score based on a combination of resource selection and peer behavior prediction formulas as shown by the equation below: score(p, q) = α · sel(p, q) + (1 − α) · pred(p, q) The tunable parameter α affects the balance between authorities (high sel(p, q) scores) and promising peers (high pred(p, q) scores) in the final ranking. Based on these scores calculated for each peer, a ranking of peers is determined, and q is forwarded to the highest ranked peers. Only publications from these peers will be matched against q and the matching documents will produce appropriate notifications to subscribers. Since publishing is a dynamic process, the querying peer may need to reposition q periodically to achieve higher recall. To show why an approach that relies only on resource selection is not sufficient, and to give the intuition behind node behavior prediction, consider the following example. Assume a node P that has specialized and become an authority in sports but no longer publishes relevant documents. Another node P 0 is not specialized in sports but is currently crawling a sports portal. Imagine a user who wants to stay informed about the upcoming 2008 Olympic Games to be held in Beijing and, subscribes with the continuous query 2008 Olympic Games. If the ranking function solely relied on resource selection, node P would always be chosen to index the user’s continuous query, which would be wrong given that node P no longer publishes sports-related documents. Node P 0 , on the other hand, in order to be assigned a high score by the ranking function, would have to specialize in sports – a long procedure inapplicable in a filtering setting which is by definition dynamic. The fact that resource selection alone is not sufficient is even more evident in the case of news items. News items have a short shelf-life, making them the worst candidate for slow-paced resource selection algorithms. The above example shows the need to make slow-paced selection algorithms more sensitive to the
publication dynamics in the network. We employ node behavior prediction to cope with these dynamics. The main contribution of our work with respect to predicting node behavior, is to view the IR statistics as time-series data and use statistical analysis tools to model node behavior. Timeseries analysis accounts for the fact that the data points taken over time have some sort of internal structure (e.g., trend, periodicity etc.), and uses this observation to analyze older values and predict future ones. In the next section, we briefly discuss different time-series analysis techniques and present the double exponential smoothing technique that we used in our approach.
4.2.1
Time-Series Analysis
To predict node behavior we consider time-series of IR statistics, thus making a rich repository of techniques from time-series analysis [5] applicable to our problem. These techniques predict future time-series values based on past observations and differ in their assumptions about the internal structure of the time-series (e.g., whether it exhibits a trend or seasonality). Next, we give a concise overview of the most important techniques that we considered, and explain why we chose double exponential smoothing for our setting. For our explanations, let the values x1 , . . . , xn−1 denote the observed time-series values and let x∗n be the predicted value. Moving average techniques are a prominent group of timeseries prediction techniques. The simplest single moving average technique uses the mean of the most recent k observations to predict the next time-series value, i.e., x∗n = (xn−k , . . . , xn−1 )/k . Two common objections about moving average techniques are that they cannot cope well with trends in the data and assign equal weights to past observations. Both weaknesses are critical in our scenario. The considered IR statistics exhibit trends, for instance, when nodes successively crawl sites that belong to different topics, or, gradually change their thematic focus. In our setting, it is also reasonable to put emphasis on a node’s recent behavior and thus to assign higher weight to recent observations. Exponential smoothing techniques, as the second group of prediction techniques considered here, address both issues. Single exponential smoothing is similar to the moving average technique presented above but takes into account all past observations (in contrast to only k) with exponentially decaying weights. The smoothed value Sn that is used as a predictor for x∗n is recursively defined as Sn = η · xn−1 + (1 − η) · Sn−1 . The decay parameters η is used to control the speed at which the older observations are dampened. When η is close to 1, dampening is quick and when η is close to 0, dampening is slow. Like the moving average techniques, single exponential smoothing cannot cope appropriately with trends in the observed data. Double exponential smoothing, the technique that we use, eliminates this weakness by taking into account trends in the observed data. The technique maintains two smoothed values Ln and Tn representing the level and the trend respectively. A predictor for the next time-series value is obtained as x∗n = Ln + Tn
subscription step
publication & notification step
D2
P1
S
1
S
2 2
2 R1
P2
1 R3
P3
R2 R2 D1
Message sent using DHT Message sent point-to-point
Subscription procedure: 1) The subscriber (S) collects term statistics from directory nodes (D1 , D2 ) using the DHT and selects the peers it will monitor. 2) S forwards its query to the selected peers (R1 , R2 , R3 ) using the DHT; it will be notified only for documents published by these peers.
Message sent point-to-point
Publication and notification procedure: 1) Only publisher peers that store the subscriber’s query (P2 , P3 ) and produce matching documents notify the subscriber (S). Publishers P2 and P3 use point-to-point links to notify S.
Figure 5: Subscription, publication and notification procedure in MAPS where Ln and Tn are recursively defined as Ln Tn
= =
η · xn−1 + (1 − η) · (Ln−1 + Tn−1 ) γ · (Ln − Ln−1 ) + (1 − γ) · Tn−1 .
To dampen the effect of trend over time, similarly to η, parameter γ is introduced. These parameters are estimated by comparing, for a similar data set, the prediction values with the real values. Then, η and γ are chosen such that the mean square error is minimized. For completeness we mention that there is also triple exponential smoothing that, in addition, handles seasonality in the observed data. Since many queries are expected to be short-lived so that no seasonality will be observed in the IR statistics time-series, we do not consider seasonality in our predictions. For an application with many long-lasting queries, one could use triple-exponential smoothing, so that seasonality is taken into account.
4.2.2
Node Behavior Prediction
The function pred(P, q) returns a score for a node P that represents its likelihood of publishing documents that are relevant to query q in the future. Using the double exponential smoothing technique described above, two values are predicted. First, for all terms t in query q, we predict the ∗ value for dfP,t (denoted as dfP,t ), and use the difference (de∗ ˆ noted as df P,t ) between the predicted and the last received value obtained from the directory to calculate the score for ˆ ∗ reflects the P (the symbolˆsignifies difference). Value df P,t number of relevant documents that P will publish in the ∗ next time-unit. Second, we predict cs ˆ as the difference in the collection size of node P reflecting the node’s overall expected future publishing activity. We thus model two aspects of the node’s behavior, namely its ability to publish ˆ∗ relevant documents in the future (as reflected by the df P,t values) and its overall expected future publishing activity ∗ (as reflected by the cs ˆ value). The idea for including the second aspect is that it represents the node’s potential of publishing additional relevant documents. The time-series of IR statistics that are needed as an input to our prediction mechanism are obtained by the querying node by regularly
retrieving the underlying values from the directory. The predicted behavior for node P can now be quantified as follows: pred(P, q) =
X
³ ∗ ´ ˆ log df ˆ ∗P + 1) + 1 P,t + log (cs
t∈q
In the above formula, the publishing of relevant documents is more accented than the dampened publishing rate. If a node publishes no documents at all, or, to be exact, ˆ ∗ , is the prediction of cs ˆ ∗ , and thus also the prediction of df 0 then the pred(P, q) value is also 0. The addition of 1 in the log formulas is used to yield positive predictions and to avoid log(0).
4.2.3
Resource Selection
The function sel(P, q) returns a score for a node P and a query q, and is calculated using standard resource selection algorithms from the IR literature (such as simple tf-idf based methods, CORI, language models, etc.). Using sel(P, q) we can identify authorities specialized in a topic, which is not sufficient for our filtering scenario as we argued above. In our implementation we use a simple but efficient approach based on node document frequency (df ), and maximum node term frequency (tf max ). The values of sel(P, q) for all query terms t are aggregated as follows: X ¡ max ¢ sel(P, q) = β · log (dfP,t ) + (1 − β) · log tfP,t t∈q
The value of the parameter β can be chosen between 0 and 1 and is used to emphasize the importance of df vs. tf max . Experiments with resource selection have shown that β = 0.5 is a satisfying value (see [14] for an overview of other node selection approaches).
4.2.4
Term Correlation
To further enhance node selection and improve recall, we have devised techniques for capturing and exploiting correlations between terms in multi-term queries, as in [13]. A standard approach would decompose queries like “Christina Aguilera” into individual terms, identify the best nodes for
4.3
Resource publication
When a peer p publishes a new document, the document is matched against its local query index using appropriate local filtering algorithms such as [23], and triggers notifications to all subscribers. Notice that only peers with their continuous query indexed in p will be notified about the new publication, since the document is not distributed to any other peer in the network. Thus, the placement of a peer’s continuous query is a crucial decision. The publication and notification process itself does not require any additional communication costs.
4.4
Experimental Evaluation
We conducted experiments with 1000 peers specialized in ten categories (e.g., Sports, or Travel ) that means per category, the local document collection of 100 peers hosts documents mainly from one topic. Our data collection consists of 2M categorized documents from a focused Web crawl and we used continuous queries with two, three, and four query terms that are strong representatives of the categories. Example queries are modern art, or music instrument. Subscribers submit the queries and monitor a fraction ρ of the publisher population, while they periodically reposition their queries according to the scoring function. We measure recall as the ratio of the total number of notifications received by the subscriber to the total number of published documents matching the subscriptions. Figure 6 shows the experimental results for a scenario where publishers shift their interest to a different topic. Different values for α are used to investigate the influence of peer behavior prediction and resource selection in our scoring function. On the x-axis, parameter ρ is varied from 1 to 25 that means that up to 25% of the publishers is monitored by the subscriber. In this setting, we observe that resource selection only (α = 1) is not enough to reach a satisfying recall (achieves less than 10% recall for monitoring 10% of the publishers). Contrary, when α value is small (α = 0 or α = 0.25) and the emphasis is on peer behavior prediction, MAPS achieves 60% recall by monitoring only 10% of the publishers. In scalability experiments conducted, we observed a reduction in message traffic of as much as 8 times compared to DHTrie, by trading around 10%-20% of recall.
5.
COMPARISON
We now proceed to identify common elements in the two presented architectures, highlighting their differences and stressing the issues involved and lessons learned from this process. Table 1 provides a comparison between the two approaches in terms of design decisions made.
5.1
Routing Infrastructure
A crucial design decision in both approaches is the use of the DHT as the underlying routing infrastructure. Contrary
a=0.00
a=0.25
a=0.50
a=0.75
a=1.00
80%
60%
average recall
each of the terms separately, and finally combine them in order to derive a candidate list of publishers to index the query. In this way term correlations are lost and recall decreases. Hence, statistical information about term co-occurrences is also maintained in the directory, so that when possible, resource selection and publication prediction techniques base their rankings on the correlated terms. Details on how this technique is implemented in an IF setting are omitted due to space reasons.
40%
20%
0% 0
5
10 15 20 percentage of publishers monitored (p)
25
Figure 6: High recall by monitoring only a few publishers is possible with peer behavior prediction.
to filtering based on topics, where usually groups of subscriber nodes are organised in a rather static (but possibly distributed) tree forming interest groups, content-based filtering requires an efficient object location mechanism to support expressive query languages that capture the user’s specific interests. This renders Gnutella-style networks an inefficient solution and simple keyword functionality of standard DHTs an insufficient one. To overcome this limitation of exact lookup, both approaches extend the DHT functionality to support richer data models and more expressive queries. However, to be able to efficiently support this functionality, changes and extensions in DHT protocols and data structures are necessary. DHTrie uses message grouping and extra routing tables to overcome inefficiencies, whereas MAPS uses batch posting of term summaries, piggy-backing of post messages to directory maintenance messages and decrease in the term space by using correlated multi-word terms to reduce network traffic.
5.2
Query Placement
Distributed pub/sub systems involve some sort of peer selection techniques to decide where to place a user query. This selection is critical, since future publications that may match this query should also be able to reach the node storing it to trigger a notification in case of a match. Query placement in DHTrie is deterministic and it depends upon the terms contained in the query and the hash function provided by the DHT. To decide where a keyword query should be placed, all query terms are hashed and the query is forwarded to the nodes responsible for these identifiers to ensure correctness at publication time. These query placement protocols lead to filtering performance that is exactly the same with that of a centralized system. On the other hand, in MAPS only the most specialized and promising nodes store a user query and are thus monitored. Publications produced by each node are matched against its local query database only, since for scalability reasons no publication forwarding is used. In this case recall is lower than that of DHTrie, but document-granularity dissemination to the network is avoided.
objective advantages and disadvantages
routing DHT optimizations
DHTrie exact pub/sub functionality (queries are indexed, all publishers are monitored) + retrieval effectiveness (recall of a centralized system) − dependent on publication rate − relatively high message traffic Chord DHT (to index queries) (i) message grouping (ii) extra routing table
query placement
deterministic indexing (depends on query terms)
terms statistics
(i) implicit (due to indexing) (ii) needed for matching (i) explicit (algorithm to address imbalances) (ii) more sensitive to load imbalances
load balancing
MAPS approximate pub/sub functionality (only selected publishers are monitored) + low network traffic, scalability + independent of publication rate − lower recall (missing potentially interesting publications) Chord DHT (to maintain statistics) (i) batch messaging (ii) piggy-backing to DHT messages (iii) term-space reduction on selected nodes (depends on publishing behavior) (i) explicit (peers post and collect statistics) (ii) needed for peer ranking and matching (i) implicit (due to query placement) (ii) less sensitive to load imbalances
Table 1: A comparison between the two approaches presented.
5.3
Statistical Information
Another issue in content-based filtering when someone tries to support IR-based models and languages is distributed management of statistical information. Matching incoming documents with stored subscriptions involves maintenance and estimation of important global statistics such as the document frequency of a certain term, i.e., the number of distinct documents seen in the last interval that contain a specific keyword. In DHTrie this functionality is implicit; every time a new document is published by some node, it reaches the nodes responsible for the distinct keywords contained in the document, since these are the candidate nodes that may store continuous queries that will potentially match the document. These nodes do all the necessary bookkeeping of the statistical information, and can be queried for this when other nodes need to compute a similarity score. On the other hand, MAPS explicitly addresses this issue with the maintenance of a directory. Notice that in the case of DHTrie the statistics maintenance is a byproduct of the filtering process and it is only necessary for computing similarity scores, whereas in the MAPS case it is the cornerstone of the filtering process, since both peer selection (and thus query routing) and also similarity computation utilize it.
5.4
Load Balancing
In typical IF scenarios the probability distributions associated with documents and queries can be arbitrary and are typically skewed. For example, the frequency of occurrence of words in a document collection follows the Zipf distribution or subscriptions to an electronic journal might refer mostly to current hot topics while publications appearing in the same journal might reflect its established tradition. Thus, a key issue that arises when trying to partition the query space among the different nodes of a DHT in a pub/sub scenario is to achieve load balancing. In such a pub/sub setting, we can distinguish three types of node load: query, routing and filtering. The query load of a node is a function of the number of queries stored at it. The routing load of a node is a function of the number of messages that this node has to forward due to the overlay maintenance
protocols. Finally, the filtering load of a node is a function of the number of filtering requests (i.e., publications) that need to be processed at this node. The effect of load imbalances between nodes in the two approaches is different. DHTrie nodes are more susceptible to load imbalances due to the nature of the subscription and publication protocol. Through the mapping of the DHT hash function, an overloaded node or a node with small processing capabilities may get responsible for a popular term, and thus be forced to store high volumes of user queries and process high numbers of publications. DHTrie load balancing mechanisms are based on load-shedding and prove efficient even for highly skewed distributions. On the other hand, MAPS has an intrinsic load balancing mechanism to cope with imbalances. Assuming that nodes are willing to offer some of their resources to the community, a node with resources to share soon becomes a hub node for some topic and receives more subscriptions, whereas a node with few resources will be forced to specialize more, and thus reduce the number of users that are interested in its publications. MAPS is more sensitive to filtering load balancing (i.e., load imposed by the filtering requests that need to be processed), since a directory node responsible for a popular term will get high numbers of statistics retrieval requests. Finally, routing load (i.e., load imposed by the messages a node has to forward due to the overlay maintenance protocols) is similar in both systems, and it mainly depends on the DHT usage.
5.5
Caching
Another dimension of comparison between the two systems is the way caching is used to reduce network traffic. In DHTrie, caching of frequently contacted nodes, by the usage of FCache, helps reduce overlay hops at publication time, latency, and overall message traffic. This caching mechanism is built using local information only and is maintained at no extra effort by using routing information discovered by the DHT. This approach could also be used in MAPS to reduce network stress at subscription time. However, since in many pub/sub scenarios publications come at a higher rate than subscriptions (e.g., in news feeds), the benefits to the MAPS protocols would not be considerable. This last
observation also summarises the different philosophy of the two protocols. DHTrie uses more messages since it focuses on publications (that normally come at high rates) but also achieves higher recall, while MAPS monitors selected nodes and focuses only on the publications produced by them, thus reducing message traffic in the expense of recall.
6.
CONCLUSIONS
In this paper we have described two pub/sub architectures that are able to support an expressive data model in a distributed dynamic environment. Supporting expressive query languages on top of distributed data structures that are mainly designed for exact-match lookups, such as DHTs, may lead to pitfalls as suggested in [12]. Both systems avoid these problems by different approaches: DHTrie relies on routing optimizations and intelligent caching mechanisms, while MAPS relies on selective monitoring of sources and smart indexing mechanisms that avoid page-granularity indexes. We have identified the strong and weak points of each architecture and highlighted important architectural considerations that should be taken into account when designing a P2P pub/sub system.
7.
REFERENCES
[1] I. Aekaterinidis and P. Triantafillou. PastryStrings: A Comprehensive Content-Based Publish/Subscribe DHT Network. In ICDCS, 2006. [2] M. Bender, S. Michel, P. Triantafillou, G. Weikum, and C. Zimmer. Improving Collection Selection with Overlap-Awareness. In SIGIR, 2005. [3] K. Berberich, M. Koubarakis, C. Tryfonopoulos, G. Weikum, and C. Zimmer. MAPS: Approximate Publish/Subscribe Functionality in Peer-to-Peer Networks. In ADPUC, 2006. [4] A. Bharambe, M. Agrawal, and S. Seshan. Mercury: Supporting scalable multi-attribute range queries. In SIGCOMM, 2004. [5] C. Chatfield. The Analysis of Time Series - An Introduction. CRC Press, 2004. [6] F. Dabek, M.F. Kaashoek, D. Karger, R. Morris, and I. Stoica. Wide Area Cooperative Storage with CFS. In SOSP, 2001. [7] D. Faensen, L. Faulstich, H. Schweppe, A. Hinze, and A. Steidinger. Hermes – A Notification Service for Digital Libraries. In JCDL, 2001. [8] B. Gedik and L. Liu. PeerCQ: A Decentralized and Self-Configuring Peer-to-Peer Information Monitoring System. In ICDCS, 2003. [9] A. Gupta, O. D. Sahin, D. Agrawal, and A. E. Abbadi. Meghdoot: Content-Based Publish/Subscribe over P2P Networks. In Middleware, 2004. [10] R. Huebsch, J. M. Hellerstein, N. Lanham, B. T. Loo, S. Shenker, and I. Stoica. Querying the Internet with PIER. In VLDB, 2002. [11] M. Koubarakis, T. Koutris, C. Tryfonopoulos, and P. Raftopoulou. Information Alert in Distributed Digital Libraries: The Models, Languages, and Architecture of DIAS. In ECDL, 2002. [12] J. Li, B.T. Loo, J.M. Hellerstein, M.F. Kaashoek, D.R. Karger, and R. Morris. On the Feasibility of Peer-to-Peer Web Indexing and Search. In IPTPS, 2003.
[13] S. Michel, M. Bender, N. Ntarmos, P. Triantafillou, G. Weikum, and C. Zimmer. Discovering and Exploiting Keyword and Attribute-Value Co-Occurrences to Improve P2P Routing Indices. In CIKM, 2006. [14] H. Nottelmann and N. Fuhr. Evaluating Different Methods of Estimating Retrieval Quality for Resource Selection. In SIGIR, 2003. [15] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A Scalable Content-addressable Network. In SIGCOMM, 2001. [16] A. Rowstron and P. Druschel. Pastry: Scalable, Decentralised Object Location and Routing for Large-Scale Peer-to-Peer Systems. In Middleware, 2001. [17] A. Rowstron, A.-M. Kermarrec, M. Castro, and P. Druschel. Scribe: The Design of a Large-scale Event Notification Infrastructure. In COST264, 2001. [18] I. Stoica, R. Morris, D. Karger, M.F. Kaashoek, and H. Balakrishnan. Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications. In SIGCOMM, 2001. [19] J. Stribling, I.G. Councill, J. Li, M.F. Kaashoek, D.R. Karger, R. Morris, and S. Shenker. OverCite: A Cooperative Digital Research Library. In IPTPS, 2005. [20] C. Tang and Z. Xu. pFilter: Global Information Filtering and Dissemination Using Structured Overlays. In FTDCS, 2003. [21] C. Tryfonopoulos, S. Idreos, and M. Koubarakis. LibraRing: An Architecture for Distributed Digital Libraries Based on DHTs. In ECDL, 2005. [22] C. Tryfonopoulos, S. Idreos, and M. Koubarakis. Publish/Subscribe Functionality in IR Environments using Structured Overlay Networks. In SIGIR, 2005. [23] C. Tryfonopoulos, M. Koubarakis, and Y. Drougas. Filtering Algorithms for Information Retrieval Models with Named Attributes and Proximity Operators. In SIGIR, 2004. [24] C. Tryfonopoulos, C. Zimmer, M. Koubarakis, and G. Weikum. Architectural Alternatives for Information Filtering in Structured Overlay Networks. IEEE Internet Computing, 2007. [25] C. Zimmer, C. Tryfonopoulos, and G. Weikum. MinervaDL: An Architecture for Information Retrieval and Filtering in Distributed Digital Libraries. In ECDL, 2007.