Apr 11, 2003 - Distributed content-based publish-subscribe middleware provides the necessary ... showing that they indeed improve significantly event delivery, are scalable, and ..... gest is constructed which includes the (globally unique).
Epidemic Algorithms for Reliable Content-Based Publish-Subscribe Paolo Costa, Matteo Migliavacca, Gian Pietro Picco, Gianpaolo Cugola ∗ 11th April 2003
Abstract Distributed content-based publish-subscribe middleware provides the necessary decoupling, flexibility, expressiveness, and scalability required by modern distributed applications. Unfortunately, this kind of middleware usually does not provide reliability guarantees, as this problem has been thus far largely disregarded by the research community and solutions developed in other contexts are not immediately applicable. In this paper, we tackle the problem of introducing reliability in content-based publish-subscribe by exploiting epidemic algorithms, whose characteristics in terms of decentralization, scalability, and resilience to topological changes resonate with our problem. The three algorithms we present, and their relative tradeoffs, are evaluated through simulation in challenging scenarios, showing that they indeed improve significantly event delivery, are scalable, and introduce only limited overhead.
1 Introduction Publish-subscribe middleware has recently become popular because of its asynchronous, implicit, multi-point, and peerto-peer style of communication. Components in a publish-subscribe system are strongly decoupled: they can be easily replaced, thus providing a high degree of flexibility both at the application and infrastructure level. A number of publish-subscribe systems have been proposed to date. In this paper we focus on those that seek increased scalability and flexibility by exploiting a distributed architecture for event dispatching, and that empower the programmer with maximum expressiveness by using a content-based scheme for determining the match between an event and a subscription. Representative examples are [4, 28, 26, 3, 9]. Although the publish-subscribe model enjoys a growing popularity, we observe that the characteristics of the available systems still fall short of expectations under many respects. For instance, this paper is motivated by the observation that the reliability of the distributed event dispatching infrastructure is rarely guaranteed by dedicated mechanisms: instead, it is typically delegated to the underlying transport protocol, e.g., by assuming the existence of TCP links. Unfortunately, this approach is overly restraining in several scenarios, including simple ones characterized by small Authors’ affiliation: Dip. di Elettronica e Informazione, Politecnico di Milano, P.za Leonardo da Vinci, 32, 20133 Milano, Italy. E-mail: {pcosta,migliava,picco,cugola}@elet.polimi.it. ∗
scale and a static network topology. For instance, communication can be implemented on top of unreliable transport protocols like UDP for performance reasons; moreover, links and nodes of the dispatching infrastructure may fail altogether. Clearly, the situation is exacerbated in the more dynamic scenarios that are increasingly characterizing modern distributed computing, where publish-subscribe would find its natural use. As an example, mobile computing implies a continuously changing network topology, where reliable links are often difficult to maintain and where the event dispatching infrastructure is itself continuously reconfigured, providing an additional source of event loss. In this paper, we present solutions for reliable publish-subscribe, and analyze their effectiveness by means of simulation. The starting point of our research was the desire to combine our previous work on efficient reconfiguration of content-based routing in the presence of changes in the underlying network topology [10, 11, 21] with a mechanism to minimize the loss of events caused by such reconfiguration. Nevertheless, the contribution we put forth here goes well beyond this original goal, in that it is not tied to a specific source of event loss and hence it enjoys general applicability. As such, it is useful also in more traditional scenarios, e.g., those characterized by a fixed topology and unreliable links. The approach we investigate in this paper relies on epidemic algorithms [2, 17, 14], a breed of distributed algorithms that find inspiration in the theory of epidemics. These algorithms aim at providing a lightweight, scalable, and robust means of reliably disseminating information to a group of recipients, by providing guarantees only in probabilistic terms. Given their characteristics, epidemic algorithms are amenable to the unreliable and highly dynamic scenarios we target. At the same time, epidemic algorithms were never applied to content-based publish-subscribe, and previous results in other fields, e.g., multicast communication, cannot be easily adapted to such scenario. In this paper, we describe three different epidemic algorithms for content-based publish-subscribe, and evaluate their performance. The results show that our solutions indeed improve significantly event delivery, are scalable, and introduce only limited overhead. The paper is structured as follows. Section 2 provides the reader with the necessary background information concerning content-based publish-subscribe systems and epidemic algorithms. Section 3 analyzes the challenges posed by the application of epidemic algorithms in the specific context of content-based routing. Section 4 presents three algorithms we designed to provide reliability in content-based publish-subscribe systems. Section 5 evaluates their performance and the tradeoffs involved. Finally, Section 6 places our contribution in the context of related work, and Section 7 ends the paper with brief concluding remarks.
2 Background In this section we provide the reader with the background information about content-based publish-subscribe systems and epidemic algorithms necessary to grasp the contribution put forth by this paper.
2.1 Content-Based Publish-Subscribe The last few years have seen the development of a large number of publish-subscribe middleware, which differ along several dimensions1 . Two are usually considered fundamental: the expressiveness of the subscription language and the architecture of the event dispatcher. The expressiveness of the subscription language draws a line between subject-based systems, where subscriptions identify only classes of events belonging to a given channel or subject, and content-based systems, where subscriptions contain expressions (called event patterns) that provide increased flexibility and expressiveness through sophisticated matching on the event content. The architecture of the event dispatcher can be either centralized or distributed. In this paper, we focus on publish-subscribe middleware with a distributed event dispatcher. In such middleware, a set of disFigure 1: A dispatching network with subscriptions laid down according to a subscription forwarding scheme.
patching servers2 are connected in an overlay network, as shown in Figure 1. These servers cooperate in collecting subscriptions coming from clients and in routing events, with the goal of reducing the network load
and increasing scalability. Systems exploiting a distributed dispatcher can be further classified according to the interconnection topology of the dispatching servers, and the strategy exploited to route subscriptions and events. In this work we consider a subscription forwarding scheme on an unrooted tree topology as this choice covers the majority of existing systems. In a subscription forwarding scheme [4], subscriptions are delivered to every dispatcher along a single unrooted tree connecting all the dispatchers, and are used to establish the routes that are followed by published events. When a client issues a subscription, a message containing the corresponding event pattern is sent to the dispatcher the client is attached to. There, the event pattern is inserted in a subscription table, together with the identifier of the subscriber. Then, the subscription is propagated by the dispatcher, which now behaves as a subscriber with respect to the rest of the 1
For more detailed comparisons see [4, 9, 25]. Unless otherwise stated, in the following we refer to a dispatching server simply as dispatcher, although the latter represents the whole distributed component in charge of dispatching events instead of a specific server. 2
dispatching network, to all of its neighboring dispatchers on the overlay network. In turn, they record the subscription and re-propagate it towards all their neighboring dispatchers, except for the one that sent it. This scheme is typically optimized by avoiding propagation of subscriptions for the same event pattern in the same direction. The propagation of a subscription effectively sets up a route for events, through the reverse path from the publisher to the subscriber. Requests to unsubscribe from a given event pattern are handled and propagated analogously to subscriptions, although at each hop entries in the subscription table are removed rather than inserted. Figure 1 shows a dispatching network with two dispatchers subscribed for a “black” pattern, and one for a “gray” pattern. Arrows represent the routes laid down according to these subscriptions, and reflect the content of the subscription tables of each dispatcher in the network. As a consequence of the subscription forwarding process we described, the routes for the two separate subscriptions are laid down on the single tree constituting the dispatching network. This choice is typical of content-based systems and is motivated by the fact that a single event may match multiple patterns. Routing on multiple independent trees, as typically done by subject-based systems, would lead to inefficient duplication of events along the separate trees. Finally, here and in the rest of the paper we ignore the presence of clients and focus only on dispatchers. Accordingly, with some stretch of terminology we say that a dispatcher is a subscriber if at least one of its clients is, although in principle only clients can be subscribers.
2.2 Epidemic Algorithms Epidemic (or gossip) algorithms (e.g., [2, 17, 14]) recently became popular as a solution to address scalable and reliable multicast dissemination of information. These algorithms are inspired by the theory of epidemics, in that communication is achieved by trying to “infect” as many nodes as possible. In essence, gossip algorithms trade the strong reliability guarantees, typical of the traditional deterministic approaches, for better scalability, achieved at the price of weaker guarantees defined only in probabilistic terms. Although epidemic algorithms have been originally developed to deal efficiently with the consistency management of replicated databases [12], they have been applied to date to a number of relevant problems, including dissemination of news through NNTP and multicast in ad hoc mobile networks [6, 18].
2.2.1
Basic Concepts
The idea underlying this family of algorithms is for each process to communicate periodically its knowledge about the system “state” to a random subset of other processes. Hereafter, the state we consider is the set of messages appeared in the system so far. Missing messages are recovered through one or more “gossip rounds”, during which other processes potentially holding a copy of the data are contacted. A single gossip round consists of the following steps: 1. Process A chooses randomly another process to communicate with, say B. 2. A sends to B information that allows to determine the presence of inconsistencies in their view of the system’s state (e.g., the identifiers of the messages A has received, or missed). 3. A and B reconcile their state by exchanging the actual messages that are not part of the history of both. Epidemic algorithms differ along two dimensions. The first one is the mode of communication, which can exploit a push or pull style. In a push style, each process gossips periodically, to disseminate its view of the system to other processes. Instead, in a pull style a process solicits the transmission of information from other processes to compensate for local losses. Demers et al. showed in [12] that a pull approach converges faster than push, provided that a majority of the participants have the requested message. This can be explained intuitively by considering a scenario where a broadcast message reaches all the receivers but one. In this case, the pull strategy allows the receiver who missed the message to immediately recover it instead of waiting to be pushed, thus improving message delivery latency. Another dimension along which gossip algorithms differ is the scheme used for disseminating the information about the process state, which can exploit either positive or negative gossip messages. In the former scheme, each gossip message sent by a process contains the state of communication as perceived by the process, e.g., the content of the process’ event history. Hence, the gossip message lists all the events that the process has received lately. In a negative gossip scheme, instead, the gossip message contains the events that the process has missed. While in principle the style of communication, push vs. pull, and the information dissemination scheme, positive vs. negative, are orthogonal, in practice pull/negative and push/positive are the most meaningful cases and those typically exploited. In the presence of a pull strategy, a negative scheme can be naturally used to react to a missing message by
“pulling” it from other processes, while the positive scheme is best employed to proactively push a process’ state to the rest of the system. This last remark highlights a key difference between the two solutions we are considering, that impacts their performance and applicability. The pull/negative strategy is intrinsically reactive, in that it is triggered only when a process realizes it has lost a message; otherwise, no action is necessary. Instead, push/positive must be implemented according to a proactive scheme, where gossip takes place periodically. Typically, the high degree of reactivity provided by the first approach is preferable. Nevertheless, in some scenarios, e.g., those characterized by a low rate of communication, a process may not realize it has missed a message until the next one is received. In this case, the other approach may be preferable. Clearly, the two approaches are not mutually exclusive, and mixed approach are also possible [12, 14].
2.2.2
Advantages and Motivation
The probabilistic and decentralized nature of these algorithms gives them a number of desirable properties. Gossip algorithms impose a constant, equally distributed load on the processes in the system, and are very resilient to changes in the system configuration (e.g., topological changes) since they do not rely on the existence of one or more processes. Moreover, these properties are preserved as the size of the system increases, thus leading to good scalability. Finally, these algorithms are very simple to implement and rather inexpensive to run. Gossip algorithms are then a good match for highly distributed and dynamic scenarios. Nevertheless, their application to the case of content-based publish-subscribe system is not straightforward. Content-based systems pose peculiar challenges that have not been tackled thus far by the research community, which at best has concentrated on the simpler subject-based publish-subscribe systems. Still, the synergy between the two approaches is worth investigating, since a content-based approach enhances the underlying publish-subscribe middleware with unprecedented levels of flexibility, hence simplifying the programmer’s task. The challenges arising from the use of epidemic algorithms for content-based publish-subscribe are examined next.
3 Epidemic Algorithms and Content-Based Publish-Subscribe Epidemic algorithms typically rely on some notion of group (or subject) that is exploited in at least two ways. On one hand, the group defines the set of nodes (the group members) that collectively define the scope of a gossip interaction.
Hence, it provides a way to determine how to route gossip messages within the system. On the other hand, the group is used to tag the messages exchanged within it. Hence, it provides a clue for the recovery process when the message gets lost. These criteria have been applied successfully in systems that already provide a notion of group for the purpose of enabling communication, e.g., multicast protocols, group communication facilities, and subject-based publish-subscribe. Unfortunately, content-based publish-subscribe systems do not provide an explicit notion of group. One could argue that a subscription to an event pattern can be treated as an implicit group, but a more careful analysis reveals that the analogy holds only to a given extent. In subject-based systems, each message is associated to one subject (the group), and routing is performed independently for each subject. Instead, in content-based systems routing is entirely based on the message content: hence, a single message can match different patterns—i.e., different groups in the aforementioned analogy. This observation is at the core of the challenge of applying epidemic algorithms to content-based publish-subscribe systems, which can be summarized by the following issues: • Detecting event loss. In subject-based systems, a simple solution to detect event loss relies on tagging a published event at its source s not only with its subject p, as already done by s for routing purposes, but also with a sequence number associated to s and p. This sequence number gets increased each time an event for the subject p is published by s. Receivers can then easily detect an event loss by discovering missing sequence numbers in the set of events received by a given source and for a given subject. In content-based systems this technique must be somehow generalized, since an event may match many patterns instead of a single subject, and the source is not required to tag the published event in any way, since routing is entirely determined by the event content. • Routing gossip messages. In subject-based systems, the subject defines the set of nodes it is useful to gossip with, since it contains the set of receivers for a given event associated to that subject. In content-based systems, however, this set of nodes cannot be determined as easily since the subject notion is missing. In principle, gossip interactions should rely on content-based routing as much as possible, since it is precisely the event content that determines the set of potential recipients. Nevertheless, it is generally not possible to simply route gossip messages as normal events. This is particularly evident when a pull approach with negative digests is applied. In this case, a node starts a gossip round to retrieve an event that has been lost and whose content is, by definition,
not available.
4 Reliable Content-Based Publish-Subscribe In this section, we present three epidemic algorithms that overcome the challenges described in the previous section. The first algorithm, presented in Section 4.1, uses proactive gossip push with positive digests. Instead, in Section 4.2 we illustrate two alternatives using reactive pull with negative digests. All the solutions assume that a unicast transport layer (not necessarily reliable, e.g., UDP-based) is available, which is a reasonable assumption in most environments. The behavior of each event dispatcher is formalized, for what concerns the processing of gossip messages, by using a very simple pseudo-code notation, as shown in Figure 2 and 3. The key steps of the algorithms are formalized as actions, reminiscent of subroutines, whose body is assumed to be executed atomically. The three algorithms we present share a common structure. Each dispatcher periodically initiates a new round of gossip by performing the operations described by an action startGossipRound, that is invoked externally upon expiration of a timeout determined by the gossip interval. We omitted the actions concerning the setting and triggering of this timeout as their semantics is trivial. Processing of gossip messages received by neighbor dispatchers is instead codified in the action handleGossipMsg. The variable Cache contains the event messages stored by the dispatcher. In the description and formalization of our algorithms we do not delve in the details of the policy employed to select which events to cache. A reasonable policy is to cache only the events for which a dispatcher is either a subscriber or a source. Nevertheless, other alternatives are meaningful depending on the deployment scenario, and we provide additional insights about this issue later in this paper. Similarly, we do not detail further the last step of a gossip interaction, i.e., the sending of the missing event. A reasonable implementation is to exploit an out of band channel, e.g., a unicast link.
4.1 Push To provide an answer to the questions identified in Section 3 in the case of proactive push with positive digests we observe that, with this strategy, a gossip message sent by a dispatcher should include information about the set of events it cached. Moreover, this gossip message should be sent only to dispatchers subscribed to such events. As we already discussed, in content-based publish-subscribe systems this set of subscribers cannot be computed once and for all. Nevertheless, we can leverage off of the fact that every dispatcher that received and cached an event e knows, from
its subscription table, all the patterns matching e. This means that each dispatcher is able to construct a gossip message which includes a digest of all the cached events matching a given pattern p. This gossip message can then be labelled with p and routed similarly to events matching p. Per dispatcher information: - list of neighboring nodes (including clients) - subscription table, stored as a list of pairs (node, pattern) - buffer Cache holding a copy of the last events received Invoked periodically, e.g., after timeout expiration. Triggers the start of a new gossip round for a pattern in the subscription table. startGossipRound () choose a pattern p from the subscription table create digest = ∅ for all event e ∈ Cache do if matches(e, p) then insert e.id in digest end if end for create gossipMsg = (self , p, digest) send gossipMsg towards one or more subscribers for p Invoked on a dispatcher upon receipt of a gossip message. handleGossipMsg (gossipMsg ) if self is subscribed to gossipMsg .pattern then create a new reqMsg = ∅ for all id ∈ gossipMsg .digest do if ¬isReceived(id ) then insert id in reqMsg end if end for if reqMsg 6= ∅ then send reqMsg to gossipMsg .initiator end if end if with probability Pforward send gossipMsg towards one or more subscribers for gossipMsg.pattern Invoked on the gossip initiator when a request for a missing event is received. handleReqMsg (reqMsg) for all id ∈ reqMsg do if ∃e ∈ Cache | e.id = id then send e to the sender of reqMsg end if end for
Figure 2 describes an algorithm based on the considerations above. When startGossipRound is invoked by the dispatcher acting as the gossip initiator—or gossiper—a pattern p is chosen according to some strategy (e.g., randomly) from the dispatcher’s subscription table and a digest is constructed which includes the (globally unique) identifiers3 of all the cached events matching p. The gossip message gossipMsg is then labelled with the pattern p and propagated along the dispatching tree. Routing of gossipMsg message outside the gossiper and along the way towards a subscriber is determined as usual, by looking at the subscription table to find neighbors interested in the pattern p. Nevertheless, it is worth noting that the gossip message is not necessarily duplicated on all the outgoing routes towards subscribers, as in the normal operation of a publish-subscribe system. Instead, to limit the overhead, it is forwarded only to a subset of the neighbors on the dispatching tree, e.g., determined randomly.
Figure 2: Push.
The extent of propagation is determined at each hop by
the probability Pforward . Moreover, in traditional push-based approaches every node gossips only with nodes sharing the same interests. A similar behavior could be obtained in a content-based publish-subscribe system by limiting the choice of p to those patterns in the subscription table that belong to subscriptions issued locally, i.e., by the clients attached to dispatcher. Nevertheless, in the system we consider a dispatcher receives—and stores—also subscriptions that are not issued locally, 3 A straightforward implementation of this identifier is the pair given by the source identifier and a monotonically increasing sequence number associated to the source.
but that are nonetheless handled by the dispatcher because it is on the route toward a subscriber. Consequently, the above strategy would potentially limit the scope of each gossip round. For this reason, in our solution p is selected by considering the whole subscription table, i.e., among all the patterns known to the dispatcher. This increases the chances of eventually finding all the dispatchers interested in the cached events, and speeds up convergence. Upon receipt of gossipMsg, a dispatcher is expected to perform the operations represented by the action handleGossipMsg. These consist of checking whether the dispatcher is subscribed to the pattern p labelling gossipMsg and, if yes, of verifying whether all the identifiers contained in the digest correspond to events that have already been received. In our solution, the details of how this test is performed are glossed over, and encapsulated in a function isReceived (id ), which returns true if the dispatcher received an event with the given id . The identifiers of all the missed events, if any, are then included in a request message reqMsg which is sent back to the gossiper using an out of band channel. Upon receipt of this message, the gossiper invokes the action handleReqMsg, which selects the events with the corresponding identifier from the cache, and sends them back to the requester. This third and last phase concludes the interaction taking place in our push approach.
4.2 Pull In some situations a proactive push approach may converge slowly or result in unnecessary traffic. In these cases, an approach using reactive pull with negative digests may be preferable. Nevertheless, the problem of detecting an event loss in a content-based system is complicated by the fact that dispatchers do not receive all events, but only those matching the patterns they are subscribed to. The technique we employ to overcome this problem is to tag events with identifiers carrying enough information to enable loss detection. Besides the event source, these identifiers contain information about the patterns 4 matched by the event, each associated with a sequence number incremented at the source each time an event is published for that pattern. For instance, let us consider a publisher with identifier S, which already published four events matching a pattern Pa , and other three matching Pb 5 . When this event source publishes a new event that matches both patterns, the identifier associated to the event message is S:P a -5:Pb -4. Patterns are associated to an event at its source: this is 4
A hash signature of the pattern is actually enough. Here, we abstract away from the details of patterns, but content-based systems enable the use of rather sophisticated expressions. For instance, Pa could be {Software* OR Appl*s} and Pb {Distributed AND ?pplication?}. Events containing "Distributed Applications" would then match both patterns. 5
Per dispatcher information: - list of neighboring nodes (including clients) - subscription table, stored as a list of pairs (node, pattern) - buffer LostBuffer holding a triple (source, pattern, sequence number) for each lost event - buffer Cache holding a copy of the last events received Invoked periodically, e.g., after timeout expiration. Triggers the start of a new gossip round for a pattern in the subscription table. startGossipRound () if LostBuffer 6= ∅ then choose a pattern p from the subscription table (considering only those coming from clients) create digest = ∅ for all (s, p, c) ∈ LostBuffer do insert (s, c) in digest end for create gossipMsg = (self , p, digest) send gossipMsg towards one or more subscribers for p end if Invoked on a dispatcher upon receipt of a gossip message. handleGossipMsg (gossipMsg) for all (s, c) ∈ gossipMsg.digest do if ∃e ∈ Cache | e matches (s, gossipMsg.pattern, c) then send e to gossipMsg .initiator end if end for if self is not subscribed to gossipMsg.pattern then send gossipMsg to one or more subscribers for gossipMsg.pattern else with probability Pforward send gossipMsg to one or more subscribers for gossipMsg.pattern end if
Per dispatcher information: - list of neighboring nodes (including clients) - subscription table, stored as a list of pairs (node, pattern) - buffer LostBuffer holding a triple (source, pattern, sequence number) for each lost event - buffer Cache holding a copy of the last events received - buffer Routes holding a pair (source, route) for each event source Invoked periodically, e.g., after timeout expiration. Triggers the start of a new gossip round for a source. startGossipRound () if LostBuffer 6= ∅ then choose a publisher s create digest = ∅ for all (s, p, c) ∈ LostBuffer do insert (p, c) in digest end for create gossipMsg = (self , s, digest, r) with r|(s, r) ∈ Routes send gossipMsg to the first node in gossipMsg .route end if Invoked on a dispatcher upon receipt of a gossip message. handleGossipMsg (gossipMsg) for all (p, c) ∈ gossipMsg.digest do if ∃e ∈ Cache | e matches (gossipMsg .source, p, c) then send e to gossipMsg.initiator end if end for if self = gossipMsg .source then drop gossipMsg else send gossipMsg to the next node in gossipMsg.route end if
Figure 3: Subscriber-based (left) and publisher-based (right) pull. made possible by the fact that a subscription forwarding strategy is chosen, and hence subscriptions are known to all dispatchers. This scheme, which is a generalization of the one we described in Section 3 for subject-based systems, enables the detection of event loss. Whenever a dispatcher receives an event matching a pattern p, but for which the sequence number associated to p in the event identifier is greater than the one expected for that pattern and source, it can detect the loss of an event and trigger the appropriate actions. In the remainder of this section, we present two solutions that both rely on the this technique for detecting event loss, but differ in the way they attempt to retrieve the missing event. The solutions are complementary, in that the first one steers gossip messages towards the event receivers (the subscribers), while the other steers them towards the event sender (the publisher). Both solutions are shown in Figure 3.
4.2.1
Subscriber-Based Pull
In the solution shown in Figure 3, as soon as a lost event is detected it is immediately inserted in the buffer LostBuffer , by an action that is not shown explicitly in the figure to keep the algorithm description concise. The elements of LostBuffer are the triples identifying an event in our encoding, i.e., source, pattern, and sequence number associated to the pattern and source. The action startGossipRound, that is invoked at regular intervals like in the push solution, first checks whether there are lost events. If yes, a gossip round is effectively triggered 6 . Note how in this case we do not use the whole subscription table like in the push approach, since here the focus is on retrieving events that are relevant to the gossiper, and not on disseminating events to as many dispatchers as possible. Consequently, the pattern p is drawn from the ones associated to subscriptions issued locally, i.e., by the clients attached to dispatcher. Pattern p is used to select the corresponding lost events from LostBuffer , which are inserted them in the digest attached to the gossip message gossipMsg, which is then labelled with p and routed in a way analogous to the push solution. When a dispatcher receives a gossipMsg, it checks its event cache to see whether it holds some of the events requested by the gossiper. It does not matter whether the dispatcher at hand is a subscriber for the pattern p requested by the gossiper. For instance, following up on our earlier example, let us suppose that the gossiper is missing the event S:P a 5, and that this information is included in a gossipMsg. Of course, there is no way for the gossiper to know that this event has been delivered also to dispatchers subscribed to P b . Instead, a dispatcher that is subscribed to events matching Pb and has cached the event can easily determine that S:P b -4 and S:Pa -5 are indeed the same event by looking at the event identifier (S:Pa -5:Pb -4). Hence, the dispatcher can retransmit the missed event to the gossiper. Although events can be retransmitted by any dispatcher that has “seen” the event, it is very important for a gossip message to be steered towards a subscriber for the same pattern of the lost event (P a in our case). In fact, subscribers act as “points of accumulation” for events, in that not only they might have the requested event in the cache, but they also actively try to recover lost events through gossip. Hence, when a gossipMsg reaches a subscriber rather than a middleman dispatcher, the likelihood of recovering the lost events is much higher. 6
In principle, a gossip round could be triggered immediately upon detection of a lost event. Nevertheless, in scenarios characterized by frequent losses it is convenient to delay the triggering to the next gossip round, so that multiple lost events can potentially be retrieved during a single round.
4.2.2
Publisher-Based Pull
The right side of Figure 3 shows a source-routing scheme that recovers lost events by walking backwards towards the publisher. While in the other algorithms we are not sensitive about the policy used to cache events, in this solution we assume that published events are cached at the source and possibly at all the dispatchers located on routes towards the subscribers for that event. Moreover, the address of each dispatcher encountered by the event during its travel towards a subscribers is appended to the event message, thus recording a route from the publisher to the subscriber. Lost events are stored in LostBuffer as described earlier for the subscriber-based solution. In addition, a new buffer Routes is necessary to store the route towards a given publisher, e.g., based on the route information stored in the most recent event received from that source. When a new gossip round is triggered, an event source is chosen among those known. The actions startGossipRound and handleGossipMsg essentially behave like their counterparts in the subscriber-based pull algorithm, except for the fact that the information distinctive of the gossip message is now the event source rather than pattern, and that gossipMsg is now augmented with the information necessary to be routed back to the publisher. It is interesting to note that there is no guarantee that the route stored in Routes is the same originally followed by the missing event, since changes in the dispatching network could have occurred meanwhile. On the other hand, it is likely that the two share at least the first portion or, in the worst case, the source. One reasonable question to ask is whether this solution suffers from the well-known “acknowledgment implosion” problem, which affects several reliable group communication schemes and occurs when several nodes missing a message request retransmission simultaneously to the same node. Our push and subscriber-based pull solutions, thanks to their distributed nature, are essentially free from this risk. For the publisher-based solution, the probability of such a phenomenon is rather low. In fact, it is unlikely that two subscribers holding different subscriptions (e.g., P a and Pb ) realize at the same time that they have missed an event. Following our example, this would happen only if the next event published by S matches both patterns as well. Finally, differently from traditional NACK-based reliable multicast schemes, retransmission requests are handled by the first dispatcher holding the desired event found along the path, thus avoiding to overload the publisher.
5 Simulation Results This section presents simulation results that allow one to critically evaluate our approach. We analyze the performance of the push, subscriber-based pull, and publisher-based pull algorithms. Moreover, we also show simulations where the two pull approaches are combined because, as discussed in the rest of the section, this option enables a significant performance improvement. In our simulations, the combination is obtained by choosing probabilistically, at each round, which variant to use. In addition, we also simulate the behavior of a random pull approach where routing is neither subscriber- nor publisher-based, instead the link along which a gossip message must be routed is chosen entirely at random. We consider this alternative since it allows us to determine whether the extra effort of deciding how to route gossip messages is worthwhile. The simulations for the analogous random push approach are instead not shown here since their performance is extremely low. We begin by describing the simulation setting, and then show that the epidemic algorithms we proposed indeed improve significantly the event delivery by considering two very different unreliable scenarios: one where events are lost because the links of the overlay network are lossy, and one where events are lost because the overlay network is being reconfigured (e.g., because of mobility of the dispatchers). In the rest of the section we focus on the former scenario, as it is more general. We first analyze the effect of changes in the key parameters of our algorithms, i.e., gossip interval and buffer size, and then evaluate the scalability of our approach and the overhead it introduces.
5.1 Simulation Setting In absence of reference scenarios for comparing content-based publish-subscribe systems, we defined our own by choosing what we believe are reasonable and significant values. The parameters we vary in the simulations are discussed below and summarized,
Parameter number of dispatchers maximum number of patterns per subscriber publish rate link error rate interval between topological reconfigurations buffer size gossip interval
Default Value N = 100 πmax = 2 50 publish/s = 0.1 ρ = +∞ β = 1500 T = 0.03s
together with their default values, in Figure 4.
Figure 4: Simulation parameters and their default values.
Modeling content-based publish-subscribe.
For this set of parameters, we built upon the simulation parameters used
in previous work by some of the authors [21]. • Events, subscriptions, and matching. Events are represented as randomly-generated sequence of numbers, where
each number represents a pattern of the system. We chose a uniform distribution for this sequence. An event pattern is represented as a single number. An event matches a subscription if it contains the number specified by the event pattern in the subscription. Each dispatcher can subscribe to a maximum number π max of event patterns, drawn randomly from the overall number Π of patterns available in the system, which in our simulation is set to 70. Given these parameters, it is possible to calculate the number of subscribers per pattern as N π = (N πmax )/Π whose default value, given the values in Figure 4, is 2.85. • Publish rate. Our simulations are run with dispatchers continuously publishing events on a network with stable subscription information (i.e., no (un)subscriptions are being issued). The frequency at which the publish operation is invoked by each dispatcher is governed by the parameter f pub , which hence essentially determines the system load in terms of event messages that need to be routed. As a default, we choose a high publishing load scenario with about 50 publish/s per dispatcher (f pub = 0.05). In some of the simulations we also consider a low publishing load scenario of 5 publish/s per dispatcher. • Overlay network topology. The results we present here are all obtained with configurations where each dispatcher is connected with at most four other dispatchers. Clients are not modeled explicitly, as their activity ultimately affects only the dispatcher they are attached to. Modeling the sources of event loss.
We considered two different scenarios: one where links are lossy and hence
the percentage of events lost is directly determined by the error rate on the link, and one where event loss is indirectly determined by a topological reconfiguration taking place in the overlay network. For this latter case, we relied again on existing simulation code developed in previous work. The reader interested in the details about how this latter reconfiguration is performed and simulated, is redirected to [21]. • Channel reliability. We assume that each link connecting two dispatchers in the overlay network behaves as a 10 Mbit/s Ethernet link. For the lossy link scenario, we simulated scenarios with an error rate = 0.1 (leading to 75% of events lost) and = 0.05 (leading to a 55% loss). In the case of topological reconfiguration, the links are instead assumed to be fully reliable. • Frequency of reconfiguration. In the case where we considered topological reconfigurations, the latter are triggered with a frequency determined by the duration of the interval ρ between two reconfigurations. A reconfigu-
ration in this setting is the breakage of a link, and its replacement with a different link that maintains the network connected. Following [21], we assume that the route on the overlay network is repaired in 0.1s. We simulated non-overlapping reconfigurations (ρ = 0.2s) where a link is replaced by another before a new link breaks, as well as overlapping ones (ρ = 0.03s). Clearly, the former cause less disruption and hence less event loss.
Gossip parameters.
The parameters governing the behavior of our gossip algorithms are the following:
• Buffer size. Each dispatcher is equipped with a buffer where events are stored, to satisfy retransmission requests. The buffer has a size of β elements. In our simulations we adopted a simple FIFO buffering strategy where each dispatcher caches only events for which it is either the publisher or a subscriber. • Gossip interval. The frequency of gossiping is controlled by the gossip interval T between two gossip rounds. • Combining subscriber-based and publisher-based pull. As we discuss in the sequel, we combined these two pull approaches to improve performance. Which approach is used at a given moment is determined in probabilistic terms, using the Psource parameter.
Effect of randomization.
Since the topology, the subscription patterns, the event generation, and the reconfigurations
are determined randomly, we ran each simulation 10 times with different random seeds, to investigate the variability. It turned out that, differently from [21], the variations were limited, around 1%-2%. Hence, we present here the results of a single simulation, rather than averaging over the various runs.
Simulation tool.
In our simulations, we are essentially comparing the algorithms only at the application level, and we
are not interested in modeling the behavior of the underlying networking stack. For this reason, we did not use network simulators like ns-2 [1], but instead decided to develop our simulations using OMNET++ [30], a free, open source discrete event simulation tool.
5.2 Event Delivery As mentioned in Section 1, our initial and driving motivation for tackling reliability was to cope with event loss induced by the dynamic reconfiguration of the dispatching infrastructure, e.g., due to mobility. Nevertheless, thus far we did not make any hypothesis about the cause of event loss. Hence, our algorithms enjoy general applicability, and in principle
can improve reliability in any situation where an event loss may occur. In this section, we show the effectiveness of our algorithms in improving event delivery. We do this by considering both the common scenario of a stable topology with lossy links, and the scenario where events are lost because of changes in the topology of the overlay network. Figure 5(a) shows the simulation results 7 for the former case, by comparing the performance of the various solutions when the error rate = 0.05. The performance metric we choose is the delivery rate, i.e., the ratio between the number of events correctly received by a process and those that would be received in a fully reliable scenario. The delivery rate in the chart is averaged, and shown in percentage. The time on the x-axis is the simulation time. In this scenario, our baseline is the delivery rate obtained without any form of recovery, which is around 75%. The chart shows how neither the subscriber-based and publisher-based pull solutions alone are sufficient to achieve a satisfactory delivery rate. This behavior is easily understood by focusing on the special case where there is a single dispatcher subscribed for a given pattern. In this case, a subscriber-based approach is not very effective, in that there are no other dispatchers to gossip with. Instead, a publisher-based is more convenient in this case. Nevertheless, in a situation where there are several dispatchers subscribed to the same pattern a publisher-based approach is less appealing, since gossip involves a much smaller fraction of the dispatchers. In this case, a subscriber-based solution is more reasonable. Hence, the two variants essentially complement each other and, as shown by the simulations, perform best when they are combined. In fact, the combination of the subscriber-based and publisher-based pull approaches enable a delivery rate close to 98%. Analogous performance is achieved by the push algorithm. This behavior and the associated benefits can be better appreciated in the more challenging scenario considered in the right-hand side of Figure 5(a), where = 0.1 and the baseline is a delivery rate of 55%. Again, subscriber-based and publisher-based pull alone are not enough, but together they boost the delivery rate up to 90%, similar to what achieved by the push algorithm alone. Hence, in this scenario the recovery phase performed with our algorithms is responsible for the delivery of almost half of the events being dispatched in the system. The effect of our algorithms is evident when topological reconfigurations occur. While in the scenario with lossy links errors are by and large uniformly spread, in the case of topological reconfigurations (over fully reliable links) they are instead concentrated around the time when the reconfiguration occurs. The left-hand side of Figure 5(b) shows a situation where reconfigurations occur every ρ = 0.2s, leading to a sequence of non-overlapping reconfigurations, i.e., the system reaches a stable state where subscription routes are correctly restored and hence events correctly delivered, 7
Hereafter, we assume that the parameters whose value is not explicitly mentioned in the text are set to their default value, defined in Figure 4.
before a new link breaks. In this case, depending on where in the tree the reconfiguration actually occurs, the delivery rate may drop as low as 70%. All of our algorithms have a beneficial effect, by reducing the fraction of events lost. Nevertheless, again the push approach and the combined pull approach effectively “level” the delivery rate in proximity of 100%, by eliminating all the negative spikes corresponding to event loss.
1
1
0.95
0.95
0.9
0.9
0.85
0.85 delivery rate
delivery rate
The right-hand side of Fig-
0.8 0.75 0.7 0.65
nario where reconfigurations
0.8
occur every ρ = 0.03s and
0.75 0.7 0.65
0.6
0.6
0.55
0.55
0.5
ure 5(b) shows instead a sce-
2
2.5
3 seconds
no recovery random pull push
3.5
0.5
4
hence are overlapping. Clearly, 2
subscriber-based pull combined pull publisher-based pull
2.5
3 seconds
no recovery random pull push
3.5
4
subscriber-based pull combined pull publisher-based pull
scenario, that can be regarded
1
1 0.95
0.9
0.9
0.85
0.85
delivery rate
delivery rate
(a) Scenario with lossy links, = 0.05 (left) and = 0.1 (right).
0.95
0.8 0.75
0.7 0.65
0.6
0.6
4
4.5
5
no recovery random pull push
5.5 seconds
6
6.5
7
7.5
case where a non-leaf dispatcher is detached from the
0.8
0.7
3.5
as an approximation of the
network and multiple links
0.75
0.65
this is a very challenging
are broken at once. In any 3.5
subscriber-based pull combined pull publisher-based pull
4
4.5
no recovery random pull push
5
5.5 seconds
6
6.5
subscriber-based pull combined pull publisher-based pull
(b) Scenario with topological reconfigurations, ρ = 0.2s (left) and ρ = 0.03s (right).
Figure 5: Event delivery.
7
7.5
case, it defines a good, extreme test case for our analysis. It can be seen that the baseline delivery rate may drop as low as 60%. In-
stead, our best algorithms cut again all the negative spikes, and never fall below a 95% delivery rate. Hence, they introduce a high degree of robustness in the system, by masking to a great extent the perturbations caused by topological reconfiguration. In the remainder of the section we focus on scenarios characterized by lossy links, since they represent the most general case. In particular, we set the error rate to the value of = 0.1 to better appreciate the variations in performance determined by changes in the other parameter values.
5.3 Gossip Interval and Buffer Size The most important parameters of our approach are the gossip in-
1 0.95
terval T , determining how frequently dispatchers communicate for the delivery rate
sake of recovery, and the size β of the buffers where events are cached.
0.9 0.85 0.8 0.75 0.7 0.65
Figure 6 shows how changes in these parameters affect event delivery.
0.6 0.55 0.5
The chart at the top shows simulations results for a buffer size β ranging
1000
no recovery random pull push
1500
2000
2500
3000
3500
4000
subscriber-based pull combined pull publisher-based pull
1
time of persistence of an event in the buffer ranging between 1.3s and
0.95 0.9 0.85 delivery rate
subscriber-based pull alone cannot improve beyond a given limit. The
500
β (buffer size)
from 500 to 4000 buffered events, that in our scenario translates into a
9.2s, against an overall simulation run time of 25s. It is evident how
0
0.8 0.75 0.7 0.65 0.6
reason for this behavior is the same we discussed earlier, and lies in the scarcity of dispatchers with the same pattern. The publisher-based and random pull approaches perform better than subscriber-based pull, but nevertheless exhibit a much slower convergence to 100% delivery.
0.55 0.5 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 T (gossip interval) no recovery random pull push
subscriber-based pull combined pull publisher-based pull
Figure 6: Effect of buffer size (top) and gossip interval (bottom) on delivery.
Again, push and the combined pull approach exhibit the best performance. Interestingly, combined pull has better performance than push with small buffers, while push approaches much faster 100% delivery as the buffer increases. This is easily explained by observing that the push approach relies more heavily on the persistence of events in the buffer. In fact, as known from the literature on epidemic algorithms [12], the push approach has a bigger recovery latency than pull. Moreover, in our push approach each gossip round involves only one of the potentially many patterns matching an event. Hence, the event recovery may involve several gossip rounds. Instead, the pull approach gossips more precise information about the lost event, and hence exhibits a smaller latency. It is worth noting at this point that since the buffer capacity is a key factor in determining the performance of our algorithms, in all the simulations presented in this section we were very careful in setting the value of β, to minimize bias. In particular, as we discuss next, we increased linearly the buffer size together with the system scale. This is a rather conservative choice, since it is shown in the literature that buffer requirements grow as O(f pub N log N ). Moreover, we are currently investigating if and how some of the published results (e.g., [20]) that enable a significant buffer optimization are applicable in our context.
The chart at the bottom of Figure 6 shows instead how event delivery is affected by the gossip interval. The considerations that can be drawn are similar to those we made about the buffer size. Once more, subscriber-based pull has a limit at about 78%, while push and combined pull are the best solutions, with the former improving much faster as gossip rounds become more frequent, and the latter performing better as the interval between rounds increases. The interplay between the two parameters is shown in Figure 7, where 1
we plotted the event delivery obtained with the combined pull approach
0.95 0.9
ments of 1000 elements, starting with a buffer of 500. (Simulations
delivery rate
against the gossip interval, and varied the buffer capacity with incre-
0.85 0.8 0.75 0.7 0.65 0.6
with push show a similar behavior, and hence are not shown here.) The
0.55 0.5 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 T (gossip interval)
chart evidences a number of interesting phenomena. First, increments in the buffer size do not bear any significant impact after a given threshold. This is particularly evident when T is very small. Moreover, it can
β=2500 β=3500
no recovery β=500 β=1500
Figure 7: Effect on delivery of simultaneous changes to β and T (combined pull).
be seen how the sensitivity of our algorithms, and in particular of the combined pull approach considered in the figure, to changes in T is greater when the buffer size is smaller. This is evident from our previous discussion: when the buffer is big, less frequent gossip rounds are compensated by a longer persistence of events in the buffer.
5.4 Scalability The charts we presented thus far were derived for an overlay network
0.95 0.9
of N = 100 nodes. An open question is how an increase of N affects
the number of dispatchers in the system and, to compensate for the in-
0.8 delivery rate
event delivery. The answer is in Figure 8. In each run we increased
0.85 0.75 0.7 0.65 0.6 0.55 0.5
creased scale, we also increased the buffer size accordingly, so that a given event persists in the buffer for a constant time (of about 4s). The simulation results show that our solutions exhibit good scalability with
0.45
0
20
40
60 80 100 120 140 N (number of dispatchers)
no recovery random pull push
160
180
200
subscriber-based pull combined pull publisher-based pull
Figure 8: Delivery as the system size increases.
respect to the number of nodes. This is not surprising, given that this is a characteristic of epidemic algorithms, and one of the reasons why we chose this approach. Again, the best performance
in terms of delivery is achieved by the combined subscriber-based and publisher-based pull approaches, and by the push approach. All the pull solutions, when applied alone, are more sensitive to scale, with the publisher-based one being the best one when the number of nodes is limited. In particular, push becomes more convenient as the system size increases. The reason is that an increase in N determines the extension of the overlay network gets extended with new leaves, and hence the push message can travel through new links and disseminate information about events to more dispatchers. The system size, however, is not the only parameter characterizing
90 80
scalability. In a content-based system, the distribution of patterns is
maximum number πmax of patterns a dispatcher can be subscribed to. The effect of this parameter in terms of scalability can be grasped by looking at Figure 9, where πmax is plotted against the average number of subscribers that receive a single event. It can be seen how π max = 5 is already sufficient to reach about 25% of dispatchers; this percentage
receivers per event
another key factor. Here, we evaluate this aspect by intervening on the
70 60 50 40 30 20 10 0
0
5 10 15 20 25 Πmax (maximum number of patterns per subscriber)
30
Figure 9: Number of dispatchers receiving an event as the number πmax of subscriptions per dispatcher increases.
raises to 80% with πmax = 30, essentially making communication more akin to a broadcast rather than a content-based one 8 .
1
1
0.9
0.9
0.8
0.8
delivery rate
delivery rate
The impact of πmax on
0.7 0.6 0.5 0.4
the delivery rate is then analyzed in Figure 10, under different publishing loads.
0.7 0.6
The chart to the left shows
0.5
0
5 10 15 20 25 Πmax (maximum number of patterns per subscriber) no recovery push
subscriber-based pull combined pull
30
0.4
0
5 10 15 20 25 Πmax (maximum number of patterns per subscriber) no recovery push
30
subscriber-based pull combined pull
Figure 10: Delivery as the number of subscribers per pattern increases, under a low (left) and high (right) publish load.
how, under a low publish rate of 5 publish/s, the delivery rate of push and combined pull is basically un-
affected by increases in πmax , with the former performing slightly better than the latter. The subscribed-based pull improves a little since more dispatchers are now caching an event. The chart to the right, derived under the usual high publish rate of 50 publish/s, shows a more interesting behavior. For a small number of subscriptions per dispatcher, 8 All of our simulations assume that an event can match at most 3 patterns. In a content-based system this is a quite conservative assumption, since the need for a single tree is motivated precisely by the fact that a single event is likely to match several patterns. A higher matching rate would make the curve in Figure 9 even steeper; additional simulations we ran show how this noticeably improves further the performance of our algorithms.
about πmax < 6, combined pull improves delivery, while push makes it worse. This is explained by observing that the push approach proceeds by gossiping about a pattern at a time: the higher the number of patterns, the higher the number of gossip rounds potentially required to recover an event, and the higher the likelihood that the event is actually discarded from all the caches before being recovered—especially under a high publish load. Instead, in a pull approach the increase in the number of subscribers is beneficial, since it increases the probability to contact a dispatcher that has actually cached the event. For π max > 6, instead, performance decreases significantly for all solutions. This can be explained by noting that both charts in Figure 10 were derived with a buffer size β = 4000. Since the number of subscriptions per dispatcher increases, each subscriber receives more events: the aforementioned value of β is more than enough for a low publishing load, but it is insufficient to keep up with a high publishing load—hence the decrease in performance.
5.5 Overhead After we verified that our solutions indeed significantly improve event delivery even when the system scale increases, the next question is about the overhead they introduce to achieve this improvement. Moreover, overhead can be regarded as another facet of scalability. Figure 11 contains the results of our evaluation. It considers the system size N and the number πmax of subscriptions per dispatcher as a measure of scalability, similarly to what we did in Section 5.4. The overhead is presented in two ways: as the number of gossip messages sent by each dispatcher, to evaluate the overhead on the single dispatcher, and as the ratio between the gossip and event messages dispatched in the overall system, to evaluate the impact of gossiping on the overall bandwidth available to event dispatching. The left-hand side of Figure 11(a) shows that the number of gossip messages sent by each dispatcher as the dispatching network grows increases with the scale of the system, but well below a linear trend. This very desirable behavior is a direct consequence of the decentralized nature of gossip algorithms: the local effort of a dispatcher, in term of gossip messages sent, is independent from the system size. Hence, the growth of gossip traffic is proportional to the number of hops made by each gossip message, which in our case increases logarithmically. The right-hand side of Figure 11(a) shows instead that the traffic caused by event forwarding rises faster than the one caused by gossiping—under the assumption of continuous publishing. Again, this is a desirable property of our algorithms in that it leads to high scalability, and can be explained by noting that while event forwarding essentially implements a multicast scheme, and
hence must reach all the recipients, gossiping involves only a fraction of them; moreover, the propagation of a single message is often “short-circuited” by the first dispatcher that owns the requested message. Figure 11(b) presents the
tions per dispatcher. The overhead on a single dis-
4500
0.4
4000
0.35
3500
Gossip msgs / Events
number πmax of subscrip-
Gossip msgs per dispatcher
impact on overhead of the
3000 2500 2000 1500 1000
0
0.2 0.15 0.1 0.05
500
patcher, shown in the left-
0.3 0.25
0
20
40 push
hand side, is only marginally
60 80 100 120 140 N (number of dispatchers)
160
180
0
200
0
20
40
60 80 100 120 140 N (number of dispatchers)
push
combined pull
160
180
200
combined pull
(a) Overhead vs. the system size N .
affected by πmax . The over1800
head decreases a little for
observing that an increase in the number of patterns increases the number of dis-
0.2
1400
Gossip msgs / Events
which can be explained by
Gossip msgs per dispatcher
increasing values of πmax ,
0.25
1600
1200 1000 800 600 400
0.15
0.1
0.05
200 0
0
5
10 15 20 25 Πmax (maximum number of patterns per subscriber) push
combined pull
30
0
0
5 10 15 20 25 Πmax (maximum number of patterns per subscriber) push
30
combined pull
(b) Overhead vs. the number πmax of subscriptions per dispatcher.
patcher where an event gets
Figure 11: Overhead introduced by gossip: in absolute terms (left) and relative to the cached, and hence increases events in the system (right). the likelihood of retrieving the event close to the gossiper. Instead, the ratio of gossip messages and event messages in the system decreases significantly with the increase of πmax , which is a desirable property. The reason can be grasped by looking back at Figure 9. An increase in πmax determines an increase in the number of receivers, hence the number of events dispatched in the system rises much faster than the number of gossip messages, especially in the scenario with high publishing load we considered. It is worth pointing out some additional issues related with overhead. First, we note that the overhead against the system scale might look relatively high, since in Figure 11 it ranges from about 28% for 40 nodes, down to about 20% for 200 nodes. Still, it should be remembered that we performed these simulations in extremely challenging scenarios,
where the system load is very high and so is the chance of losing an event. Given this tough setting and the remarkable improvement in event delivery, the overhead plotted in Figure 11 does not seem unreasonable. In general, however, the tradeoffs between overhead and event delivery are essentially determined by the application and networking scenario at hand, and can be tuned appropriately by intervening on the gossip interval and buffer size whose impact has been described in Section 5.3. Moreover, if these assumptions about load and error rate are made 2500
changes significantly, as the reactive pull approach triggers communication only when a recovery is needed while the proactive push approach gossips continuously, and hence may result in wasted bandwidth. This
Gossip msgs per dispatcher
less challenging, the relative performance of the push and pull approach
2000
1500
1000
500
0 0.01
behavior is shown in Figure 12, which plots the total number of gossip
at the high publish rate of 50 publish/s, while the bottom plot is derived
2000
chart it can be seen that when = 0.01, corresponding to a baseline delivery rate of 95%, pull overhead is one third of push. The pull approach, in this case, may skip some gossip rounds due to fact that no
Gossip msgs per dispatcher
2500
bandwidth, especially when communication is more reliable: from the
0.03
0.04
0.05 0.06 0.07 ε (link error rate)
push
messages sent against the error rate. The top chart plots the overhead
with 5 publish/s. In the latter case, the pull approach clearly wastes less
0.02
0.08
0.09
0.1
0.09
0.1
combined pull
1500
1000
500
0 0.01
0.02
0.03
0.04 0.05 0.06 0.07 ε (link error rate)
push
0.08
combined pull
Figure 12: Overhead under a high (top) and low (bottom) publishing load.
event has been detected as lost in the meantime, while a push approach must proactively push at each gossip round. To remove the potential source of inefficiency of the push algorithm, an adaptive approach could be exploited where the gossip interval T is changed dynamically according to the current state of the system, as suggested in [8]. In general, note also that in these simulations we assumed that the size of event and gossip messages is the same. Hence, the plots actually show only an upper bound for overhead: the size of gossip messages is likely to be in reality much smaller than the one of event messages, bringing the relative overhead below the curves shown. Finally, another issue is the one of computational overhead, which we did not investigate through simulations. Qualitatively, the pull-based solutions require that, when an event e is published by a dispatcher, the latter performs a match
of e against all the patterns in its subscription table. This is more than normally required, since usually the match processing needed to route a message towards a neighbor stops as soon as the first matching pattern is found. While we are currently investigating optimizations to limit this overhead, we also observe that only the publisher experiences it: the event routing performed by the other dispatchers follows the normal processing.
6 Related Work Several publish-subscribe systems offer a reliable service (e.g., all the JMS [27] compliant systems) by adopting a centralized event dispatcher and reliable channels between the dispatcher and its clients. Similarly, some of the existing publish-subscribe systems which adopt a distributed dispatcher provide a reliable service [29, 23, 3, 22, 31, 5], but none of them use a content-based routing scheme. Researchers working on reliable multicast [24, 19, 16] and group communication [13, 7] proposed several protocols for reliable multicast where routing is group or subject-based. Unfortunately, none of them can be adapted easily to content-based routing, for the reasons discussed in Section 3. In recent years, gossip techniques have been employed successfully to deal with scalable reliable multicast in peerto-peer networks [14] and in highly dynamic environment such as mobile ad hoc networking [6, 18]. Results have been encouraging but, again, their techniques are not directly applicable to our problem since they consider broadcasting information to all participants in a group. To the best of our knowledge, the only system that provides reliable content-based routing is hpcast [15]. In hpcast nodes are organized in a hierarchy where the leaves represent event subscribers and publishers, and intermediate nodes represent delegates, i.e., special nodes which are chosen to represent aggregate interests of their children. A gossip push approach is used to distribute events starting from the root of the hierarchy and moving down each time a delegate retrieves an event that could interest its children. This idea of using gossip not just to improve event delivery but as the only routing mechanism is simple and elegant, but results in several drawbacks. First, in absence of faults it increases the overhead since events are not routed only to interested nodes, but they can reach also non-interested nodes or even be sent more than once to the same node. Second, even in absence of faults it does not guarantee that events are delivered correctly. Third, it forces the adoption of a push approach in which gossip messages include the entire event content instead of a simple digest, thus further increasing the network traffic. Finally, the nodes near to the root of the hierarchy
are subject to a high traffic, and hence must keep their event caches very large to increase the probability of correctly delivering events.
7 Conclusions Modern distributed computing is steering towards scenarios that are increasingly large scale, unreliable, and highly dynamic. Distributed content-based publish-subscribe embodies a communication model providing the necessary component decoupling, flexibility, expressiveness, and scalability to deal with these characteristics at the application level. Nevertheless, the problem of reliable event delivery, which is exacerbated by the aforementioned scenarios, has not yet been tackled by researchers, and is hampering the exploitation of content-based publish-subscribe middleware in real-world applications. In this paper, we presented solutions based on epidemic algorithms that are shown to improve remarkably the event delivery in unreliable environments. Our findings are supported by simulation results, which show also that our approach is scalable and introduces limited overhead. Our results are derived without making assumptions about the source of event loss, and hence enjoy general applicability. Our ongoing work, however, is aimed at complementing the results we described with those we already obtained for the reconfiguration of the dispatching infrastructure, and convey them in a new-generation distributed content-based publish-subscribe system able to tolerate arbitrary reconfigurations and minimizing the number of events lost.
References [1] ns-2 Web page. www.isi.edu/nsnam/ns. [2] K. Birman et al. Bimodal multicast. ACM Trans. on Computer Systems, 17(2):41–88, 1999. [3] L. F. Cabrera, M. B. Jones, and M. Theimer. Herald: Achieving a global event notification service. In Proc. of the 8 th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 87–94, Elmau, Germany, May 2001. IEEE Computer Society. [4] A. Carzaniga, D. Rosenblum, and A. Wolf. Design and evaluation of a wide-area event notification service. ACM Trans. on Computer Systems, 19(3):332–383, 2001. [5] M. Castro, P. Druschel, A. Kermarrec, and A. Rowstron. Scribe: A large-scale and decentralized application-level multicast infrastructure. IEEE Journal on Selected Areas in Communications (J-SAC), 20(8), 2002. [6] R. Chandra, V. Ramasubramanian, and K. Birman. Anonymous gossip: Improving multicast reliability in mobile ad-hoc networks. In Proc. 21st Int. Conf. on Distributed Computing Systems (ICDCS), pages 275–283, 2001. [7] G. Chockler, I. Keidar, and R. Vitenberg. Group communication specifications: a comprehensive study. ACM Computing Surveys, 33(4):427–469, 2001. [8] F. Cuenca-Acuna et al. PlanetP: Infrastructure support for P2P information sharing. Technical Report DCS-TR-465, Rutgers University, November 2001.
[9] G. Cugola, E. Di Nitto, and A. Fuggetta. The JEDI event-based infrastructure and its application to the development of the OPSS WFMS. IEEE Trans. on Software Engineering, 27(9):827–850, September 2001. [10] G. Cugola, D. Frey, A.L. Murphy, and G.P. Picco. Minimizing the reconfiguration overhead in content-based publishsubscribe. Technical report, Politecnico di Milano, March 2003. Submitted for publication. Available at www.elet. polimi.it/˜picco. [11] G. Cugola, G.P. Picco, and A.L. Murphy. Towards distributed publish-subscribe middleware for mobile systems. In Proc. of the 3rd Int. Workshop on Software Engineering and Middleware (SEM02), co-located with the 24 th Int. Conf. on Software Engineering (ICSE03), Orlando (FL), USA, May 2002. [12] A. Demers et al. Epidemic algorithms for replicated database maintenance. Operating Systems Review, 22(1):8–32, 1988. [13] C. Diot, W. Dabbous, and J. Crowcroft. Group communication. IEEE Journal on Selected Areas in Communication. Special Issue on Group Communication, May 1997. [14] P. Eugster et al. Lightweight probabilistic broadcast. In Proc. of the Int. Conf. on Dependable Systems and Networks (DSN 2001), pages 443–452, 2001. [15] P. Eugster and R. Guerraoui. Hierarchical probabilistic multicast. Technical report, EPFL, Lausanne (Switzerland), 2001. Available at icwww.epfl.ch/publications. [16] B. Levine and J. Garcia-Luna-Aceves. A comparison of known classes of reliable multicast protocols. In Proc. of the IEEE Int. Conf. on Network Protocols, October 1996. [17] M.-J. Lin and K. Marzullo. Directional gossip: Gossip in a wide area network. In European Dependable Computing Conference, pages 364–379, 1999. [18] J. Luo, P. Eugster, and J.-P. Hubaux. Route Driven Gossip: Probabilistic Reliable Multicast in Ad Hoc Networking. Technical report, EPFL, Lausanne (Switzerland), 2002. Available at icwww.epfl.ch/publications. [19] K. Obraczka. Multicast transport protocols: a survey and taxonomy. IEEE Communications, 36(1):94–102, January 1998. [20] O. Ozkasap, R. van Renesse, K. Birman, and Z. Xiao. Efficient Buffering in Reliable Multicast Protocols. In Proc. of NGC99, Pisa, Italy, November 1999. [21] G.P. Picco, G. Cugola, and A.L. Murphy. Efficient Content-Based Event Dispatching in the Presence of Topological Reconfigurations. In Proc. of the 23rd Int. Conf. on Distributed Computing Systems (ICDCS 2003). To appear. Available at www.elet.polimi.it/˜picco. [22] P. Pietzuch and J. Bacon. Hermes: A distributed event-based middleware architecture. In Proc. of the Workshop on Distributed Event-Based Systems (DEBS), 2002., Vienna, Austria, July 2002. IEEE Computer Society. [23] Real-Time Innovations, Inc. NDDS: Network Middleware for Distributed Real Time Applications. www.rti.com. [24] V. Roca et al. A survey of multicast technologies. Technical report, Laboratoire d’Informatique de Paris 6 (LIP6), September 2000. www-rp.lip6.fr. [25] D.S. Rosenblum and A.L. Wolf. A Design Framework for Internet-Scale Event Observation and Notification. In Proc. of the 6th European Software Engineering Conf. held jointly with the 5th Symp. on the Foundations of Software Engineering (ESEC/FSE97), LNCS 1301, Zurich (Switzerland), September 1997. Springer. [26] R. Strom et al. Gryphon: An information flow based approach to message brokering. In Int. Symp. on Software Reliability Engineering, 1998. [27] Sun Microsystems, Inc. Java Message Service Specification Version 1.1, April 2002. [28] P. Sutton, R. Arkins, and B. Segall. Supporting Disconnectedness—Transparent Information Delivery for Mobile and Invisible Computing. In Proc. of the IEEE Int. Symp. on Cluster Computing and the Grid, May 2001. [29] TIBCO Inc. TIBCO Rendezvous. www.rv.tibco.com. [30] A. Varga. OMNeT++ Web page. www.hit.bme.hu/phd/vargaa/omnetpp.htm. [31] S. Zhuang et al. Bayeux: An architecture for scalable and fault-tolerant widearea data dissemination. In Proc. of the 11 th Int. Workshop on Network and Operating System Support for Digital Audio and Video (NOSSDAV 2001), June 2001.