The Arbitrary Boolean Publish/Subscribe Model: Making ... - CiteSeerX

2 downloads 0 Views 324KB Size Report
Jun 20, 2007 - need to insert don't-care predicates for attributes not stated by a subscription. ...... [23] I. Mathieson, S. Dance, L. Padgham, M. Gorman, and.
The Arbitrary Boolean Publish/Subscribe Model: Making the Case Sven Bittner

Annika Hinze

Department of Computer Science The University of Waikato, New Zealand

Department of Computer Science The University of Waikato, New Zealand

[email protected]

[email protected]

ABSTRACT

to monitor distributed systems, to more high-level areas, e.g., electronic commerce applications [7]. Although it may be argued that current systems support arbitrary Boolean subscriptions/advertisements by applying the canonical conversion approach [25, 27], utilizing this method within pub/sub leads to unnecessary efficiency and scalability drawbacks. We will demonstrate the negative effect of conversions in a comparative study of our arbitrary Boolean and conjunctive approaches, reported in this paper. The general drawbacks of the canonical conversion approach in pub/sub systems had already been clearly identified in one of the fundamental works in the information dissemination area by Yan and Garc´ıa-Molina [33]: The authors state that it might be problematic to convert arbitrary subscriptions to conjunctions (due to the exponential explosion of the problem size). Nevertheless, they left the analysis of this aspect to future work. To the best of our knowledge, this evaluation has not been undertaken so far in respect to efficiency aspects; the work in [3] only focusses on memory requirements. Within this paper, we aim to close this gap. The canonical conversion approach is well-known from its effective application in DBMSs. However, for pub/sub systems the applicability of conversions is questionable. This results from the different problem definitions in pub/sub and DBMSs. We believe that the application of conversions of subscriptions/advertisements in pub/sub can be contested: Taking an interaction semantics view, we identify similarities between subscriptions and database queries (user interests), event messages and the stored data (objects of interest), advertisements and the database schema/access privileges (there is no real counterpart to advertisements), and notifications and query results (computational result of the system). We have given an overview of these correspondences in Figure 1. DBMSs convert queries; at a first glance, it thus appears reasonable to convert their counterpart, subscriptions. However, pub/sub systems simultaneously handle hundreds of thousands of subscriptions whereas DBMSs typically only execute a few queries at the same time. The effect of converting arbitrary Boolean expressions to canonical forms, leading to an exponential number of subscriptions [8] in the worst case, is thus much more influential in pub/sub than in DBMSs: The already large problem size (number of subscriptions/advertisements) explodes even more due to conversions. Additionally, the counterpart to advertisements is not converted at all in DBMSs. Taking the data storage view, for notifications and advertisements (stored as advertisement base) the correspondences are the same as in the semantics view. As a dif-

In this paper, we present BoP, a content-based publish/subscribe system for arbitrary Boolean subscriptions and advertisements. BoP targets at the time and space-efficient matching of event messages using the wide-spread attributevalue pair event model. In contrast to other content-based publish/subscribe systems focussing on an efficient matching process, BoP internally supports subscriptions and advertisements as arbitrary Boolean expressions. As we will show in this paper, directly handling these representations leads to efficiency benefits for applications using this class of expressions. The support of arbitrary Boolean subscriptions and advertisements requires the introduction of efficient matching and overlapping calculation algorithms, as well as applicable routing optimizations. In this paper, we will outline these solutions that have been integrated into BoP. The evaluation part of this work presents the results of a comparative study of our approaches and recent conjunctive solutions.

Categories and Subject Descriptors C.2.4 [Computer-Communication Networks]: Distributed Systems—Distributed applications; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Information filtering

General Terms Algorithms, Experimentation, Performance

1. INTRODUCTION The BoP (Boolean publish/subscribe) system has been designed with the clear objective of creating a generic but efficient content-based publish/subscribe (pub/sub) system for arbitrary Boolean subscriptions and advertisements. The need for systems supporting such subscriptions and advertisements is constituted in the application of the originally more low-level event-based communication paradigm, e.g.,

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DEBS ’07, June 20–22, 2007 Toronto, Ontario, Canada Copyright 2007 ACM 978-1-59593-665-3/07/03 ...$5.00.

226

Publish/subscribe system Event Subscrip−Advertise− messages tions ments

Publish/ subscribe system

Notifications

Input to the system

Database management system Schema/ Data updates/ access insertions/ deletions Queries privileges

plied to this class of expressions, as we outline in detail in Section 7. The four system aspects to be enhanced are: (1) the matching algorithm, (2) the subscription-based routing optimization, (3) the overlapping calculation algorithm, and (4) the advertisement-based routing optimization. The former two of these solutions are required under all circumstances regardless of whether subscription or advertisement forwarding [9, 24] is applied. The latter two solutions are only needed in combination with advertisementbased pub/sub. BoP includes solutions to all four of these areas but does not depend on the utilization of advertisements, i.e., it works either with or without advertisements. The contributions of this paper are:

Query Database results management system Output to users

Figure 1: Correspondences in pub/sub and DBMSs taking an interaction semantics view. Publish/subscribe system Event messages Notifications

Data updates/ Advertise− insertions/ ments deletions

Subscrip− tions Subscription/ advertisement base Stored data

1. We introduce BoP, an efficient distributed contentbased pub/sub system that supports arbitrary Boolean subscriptions and advertisements.

Database management system Queries Query results Schema/ access privileges

2. We present the matching algorithm and an effective subscription-based optimization to support the subscription forwarding routing scheme in BoP.

Data base

3. We propose an overlapping calculation algorithm and an advertisement-based routing optimization to additionally allow for the use of advertisements.

Transient data

Figure 2: Correspondences in pub/sub and DBMSs taking a data storage view.

4. We present the results of a comparitive study of the algorithms of BoP and existing conjunctive solutions.

ference, one now finds similarities between data and subscriptions (stored by the system as data and subscription base), and between queries and event messages (not being stored but triggering a processing based on the stored data/subscriptions). The overview in Figure 2 graphically illustrates these correspondences. The approach to convert subscriptions does now become even more unreasonable because this stands for converting the means similar to data, being stored in DBMSs “as is” (without conversions). Event messages, on the other hand, are already in a canonical form, attribute value pairs, as their counterpart of queries. The recognition of these facts and our recent work on arbitrary subscriptions [3, 4] has been a driving factor for designing BoP. Another factor has been the observation that most of the proposed (high-level) applications for content-based pub/sub do naturally benefit from the support of other than conjunctive expressions. Examples include electronic commerce [7], meteorology [23], and facility management [19]. For inexperienced human users, formulating arbitrary Boolean subscriptions might be a task more complicated than formulating conjunctive subscriptions. One would typically apply advanced user interfaces for this purpose, e.g., inspired by [20]. More experienced users are likely to formulate relatively compact Boolean expressions [28], i.e., expressions without strong redundancies. For machine-generated subscriptions, the automatic formulation of arbitrary Boolean expressions is a realistic assumption. Under all circumstances, the system can apply a logic expression minimization algorithm before registering subscriptions/advertisements. Generating compact expressions is thus not a large overhead because of the pattern of a relatively small number of subscriptions/advertisements per user. Supporting arbitrary Boolean subscriptions and advertisements requires advanced solutions to four areas of pub/sub systems. Most current proposals cannot be practically ap-

227

We have structured the rest of this paper as follows: In Section 2, we introduce the event, subscription, and advertisement model of BoP, and analyze the use of the term expressiveness in existing pub/sub literature. Section 3 presents the matching algorithm of BoP. The event routing optimization applied in BoP is focus of Section 4; Section 5 outlines how to symmetrically support arbitrary advertisements. The detailed evaluation of the introduced algorithms and their comparison to existing solutions is the focus of Section 6. Finally, Section 7 relates BoP to current research.

2.

EXPRESSIVENESS AND INTERACTION

We start this section by analyzing the usage of the term expressiveness in pub/sub systems. In Section 2.2, we specify the notions of event messages, subscriptions, and advertisements in the arbitrary Boolean pub/sub model.

2.1

Publish/Subscribe and Expressiveness

To our knowledge, there is no agreed definition of the term expressiveness in the context of pub/sub systems. In the literature, we have found some general explanations: Carzaniga et al. [9] define expressiveness as the ability of a pub/sub system to express subscriptions. Eugster et al. [13] state that expressiveness defines how accurately subscriptions can represent the interests of subscribers. Most other works, e.g., [7, 14], identify different levels of expressiveness in the distinction between topic and content-based systems. However, content-based systems (even supporting range queries) can be mapped to topic-based ones [30]. Hence, the term expressiveness in the context of pub/sub does not model the general notion of expressiveness, describing what facts can be represented by a language [22]. Li et al. [21] explicitly include the opportunities to combine predicates in subscriptions into their expressiveness definition. They state that in contrast to conjunctive approaches,

by providing for arbitrary complicated Boolean functions in subscriptions, an expressive subscription language can be naturally supported. Also M¨ uhl [24] takes this approach and states that the restriction to conjunctions in current systems reduces the expressiveness of these solutions. In this paper, we adopt the view of these latter works, and consider the support of subscriptions and advertisements as either conjunctive or arbitrary Boolean combinations of predicates as different levels of expressiveness. A current universal claim is that achieving an expressive subscription language is a conflicting goal to realizing a pub/sub system in an efficient manner [9, 13, 24, 27]. Thus, our contributions within this paper contradict this assumption in respect to arbitrary Boolean and conjunctive languages.

matching (cf. Section 2.2.4), and thus resulting in a notification, solely depends on the stated predicates. This semantics of subscriptions is different from conjunctive approaches, e.g., Rebeca [24], where all attribute specifications have to be referred to by at most one predicate. Other approaches, e.g., by Gough and Smith [16], explicitly need to insert don’t-care predicates for attributes not stated by a subscription. The underlying reason for this proceeding is the applied indexing scheme that, however, strongly increases in size if subscribers do not restrict all attributes1 . An alternative solution by Aguilera et al. [1] does not show these size problems but, as a consequence, is inefficient to re-organize, e.g., in case of deregistrations.

2.2 Messages, Subscriptions & Advertisements

Advertisements in BoP are defined similarly to subscriptions, i.e., an advertisement a specifies an event type and an arbitrary Boolean filter expression. The semantical difference to subscriptions is that advertisements describe future messages of publishers whereas subscriptions state subscriber interests. If attributes stay unrestricted in advertisements, publishers potentially send all possible values for these attributes. Advertisements are generally less restrictive than subscriptions because publishers send broader ranges of messages than those described by typical subscriptions. We have given an abstract, graphical illustration of event messages, subscriptions, and advertisements in the upper part of Figure 3. In this figure, we have used a 2-dimensional event space in combination with filter functions specifying ranges of values. In contrast to conjunctive approaches, resulting in subscriptions/advertisements as rectangles, subscriptions and advertisements in BoP describe sets of rectangles. A concrete example subscription s1 and advertisement a1 from an online auction application scenario is given in Figure 4. They are represented by what is referred to as subscription tree [4] and advertisement tree, respectively.

2.2.3

Within this section, we informally introduce the notions of event messages, subscriptions, and advertisements that are used within BoP. We also state the possible relationships (matching, conforming, and overlapping) between these three directives. We need to properly define these concepts and their exact semantics because the later introduced algorithms strongly depend on these definitions. Our definitions are based on the usage of event types, e.g., as used by Eugster et al. [14] and Hermes [27]. Thereby, an event type specifies a set of attribute specifications, with each attribute specification stating an attribute name (unique within the type), an attribute domain, and a set of supported filter functions of two variables.

2.2.1

Event Messages

Each event message e belongs to exactly one event type and specifies one attribute-value pair for each attribute specification of its type. The value of each pair belongs to the attribute domain. This definition of messages is similar to what is referred to as total messages by Campailla et al. [8]. The set of all possible messages is referred to as event space.

2.2.2

2.2.4

Subscriptions

Advertisements

Relationships

Between instances of these three directives, we define the following relationships (similarly to current approaches): Matching (message and subscription). An event message e matches a subscription s (equivalent to s is fulfilled by e) if and only if: (1) s and e specify the same event type, and (2) the filter expression of s evaluates to true on e. For this evaluation, one evaluates the Boolean combination of predicates, whereas each predicate p of s gets assigned the result of the function stated by p with the first variable being the value of the respective attribute-value pair in e and the second variable being the operand stated by p. Conforming (message and advertisement). A message e conforms to an advertisement a if and only if (1) a and e specify the same event type, and (2) the filter expression of a evaluates to true on e. This evaluation works as defined for the matching relation. If using advertisements, BoP assumes that publishers only send conforming messages. Overlapping (subscription and advertisement). An advertisement a overlaps a subscription s (and vice versa) if and only if there exists at least one event message e that both matches s and conforms to a. We have given examples and counterexamples of these

Subscriptions are also based on event types. They generally describe a filter on event messages and are issued by subscribers in order to describe their interests. Subscriptions are usually highly selective filters to only lead to a small number of notifications. More specifically, a subscription s specifies an event type and a Boolean filter expression using the operators conjunction, disjunction, and negation. The variables of the filter expression are referred to as predicates p, being attribute-function-operand triples. Each of these predicates has to specify one of the attributes of the event type of s, a valid filter function for this attribute (defined by the event type), and an operand that is valid as second variable for the stated filter function. This definition of subscriptions is similar to the three subscription definition languages introduced by Campailla et al. [8]. This similarity to all three languages occurs due to their equal expressivity in case of total messages, being our assumption for events. Using these definitions, subscriptions do not need to contain predicates referring to all of the attributes defined by their event type, or they might contain several predicates referring to the same attribute. The semantics in the latter scenario is given by the Boolean combination of predicates. For attributes not referred to by predicates, subscribers accept all possible values. Whether the incoming message is

1 Every don’t-care branch in a node needs to contain all subtrees that can be reached by the other branches of that node.

228

Event messages

Subscriptions

e1

Advertisements

ical subscription patterns. For example, the cluster approach [15, 17] requires highly selective access predicates that are shared among various subscriptions. This property makes that approach to a specialized matching solution that is not practically applicable to more general settings.

s1 a1

e3

e2

Event space

Matching Non-matching s2 e3

Event space

Conforming

Overlapping

e1

s2 e1

a3

s2 s3 Event space

s2

a2

Non-conforming

a2

Non-overlapping a1

Figure 3: Examples of event messages, subscriptions, advertisements, and their relationships. AND p1 title like A

OR

AND

p2 ending < 1 day

AND

p3 p5 p6 p4 condition = new price < B price < C condition = used

AND OR AND

Indexation Process

Within one-dimensional indexing solutions, it has become a “de facto standard” to apply predicate indexes according to both the attribute domain and the filter function. Equality predicates on integers or floats typically apply hash tables; Patricia trees could be used for strings. Domains of a fixed enumerable size allow for the development of specialized, highly efficient data structures, e.g., described in [2]. Before the indexation process, every subscription s undergoes a syntactical analysis and rewriting procedure, involving the negation removal and the operator summarization. Rewriting. In the negation removal procedure, BoP pushes down all negations into the leaf nodes nl of subscription trees. By applying De Morgan’s laws, negations are thus integrated into predicates. This procedure requires the matching algorithm to always support the inverse to a given filter function, e.g., equality and inequality. In practice, this requirement is straightforwardly to implement by using the original function for indexing and inverting the results. The operator summarization procedure analyzes the inner nodes of subscription trees and summarizes consecutive operators of the same kind (i.e., conjunctive nodes nc and disjunctive nodes nd ). Proceeding that way reduces the memory requirements for the later encoding of subscription trees. Next to these two basic rewriting procedures, one can apply other syntactic rules, e.g., to minimize or simplify a filter expression. Such rules can be easily integrated into BoP. Also the semantic rewriting of filter expressions is a possible extension of BoP, but it is beyond the scope of this paper. Indexation. As in conjunctive solutions, the predicate indexing phase inserts predicates into specialized predicate index structures, assigns an identifier id(p) to predicates p (common predicates get assigned the same identifer), and integrates information about the predicate usage in a predicate subscription association table (cf. Figure 5). Subscription indexing extends the procedure taken in the counting algorithm: BoP uses the minimum predicate count vector to store a subscription-specific property, the minimal number of fulfilled predicates pmin (s), that is required for fulfilled subscription s. For every subscription s, BoP can calculate pmin (s) by analyzing the structure of its filter expression, encoded in the subscription tree [3]. Subscription indexing additionally space-efficiently encodes subscription trees, and stores the memory position loc(s) of this encoding for a subscription s in a subscription location table. For this purpose, subscriptions s get assigned a unique identifier id(s). We have given an overview of the described predicate and subscription index structures in Figure 5. BoP improves the originally proposed subscription encoding scheme [3]: Inner nodes contain Boolean operators and consist of two parts, a structural component using 2 bytes (1 byte for the operator and 1 byte for the number of children) and a functional component representing the child nodes. Leaf nodes contain the predicate identifiers using 5 bytes (1 byte to denote it as a leaf and 4 bytes for the identifier). We have given an example of the encoding of s1 in Figure 6. Evidently, subscription trees are not the only method of

s2

e3

a1

3.1

a2

p1 category = A

AND

p2 p3 p4 p5 buyItNow = yes price > B price > C attribute = signed

Figure 4: Example subscription s1 (top) and advertisement a1 (bottom). three relationships in the bottom part of Figure 3.

3. MATCHING ALGORITHM When designing BoP, we particularly targeted at creating a system that is not tailored to arbitrary Boolean expressions but also efficiently supports the use of mere conjunctions. Regarding the matching algorithm, this design goal has been achieved by extending an arbitrary Boolean variant [3] of the general-purpose conjunctive counting algorithm [2, 33]. We present these extensions in the following. Our decision to further improve this algorithm has been based on the advantageous properties of this approach: 1. It uses one-dimensional indexes, being the best compromise between time and space-efficient matching. 2. It is a general-purpose solution, not requiring particular subscription patterns in order to be applicable. Regarding Advantage 1, multi-dimensional indexing solutions (e.g., [1, 16]) generally result in a more efficient matching process [19] than one-dimensional ones (e.g., [2, 15, 17, 33]). Non-indexing solutions (e.g., [8, 29]), on the contrary, are the least efficient matching method [19, 29]. However, multi-dimensional approaches are costly to re-organize in case of deregistrations. Additionally, they show poor scalability characteristics due to their large memory requirements. Non-indexing approaches also lead to poor scalability because of their requirement to evaluate each subscription for each message. A one-dimensional approach, as taken in the following, balances these two extremes [3]. Regarding Advantage 2, other one-dimensional (conjunctive) matching solutions make strong assumptions on typ-

229

Conj

Child.

1

3

Leaf ID Disj Child. 4

1

2

Conj

Child.

1

2

2

Leaf ID Leaf ID Conj 4

3

4

4

1

Child. 2

Leaf ID Leaf ID Leaf ID 4

5

4

6

4

2

Figure 6: Encoding of subscription s1 using the index of a predicate as its identifier (ID in this figure), i.e., id(pi ) = i. A conjunctive node is identified by 1, a disjunctive node by 2, and a leaf node by 4.

One−dimensional predicate indexes

... ... id(pi ) ... id(pj )

Predicate id(p) subscription association table {id(s)}

... ...

... ...

Minimum predicate count vector

Subscription id(s) location loc(s) table

... ...

... ...

Subscription trees

id(pk )

id(s) int

tion process, BoP analyzes the filter expression of a subscription s. If s is a pure conjunction, the structural component of the encoded root contains another operator identifier than for an ordinary conjunction. The matching algorithm subsequently avoids the evaluation of s if it is a candidate subscription, i.e., the evaluation method just returns true. BoP can apply this method because in a conjunctive subscription s, pmin (s) is always equal to the total predicate number. Hence, every conjunctive candidate is a matching subscription. The minimal overhead in comparison to the counting approach is to retrieve the memory address of the subscription tree, using the subscription location table. Short-Circuiting. For Boolean subscriptions, BoP applies a short-circuiting optimization. However, due to the memory-aware encoding, full short-circuiting can only be applied to root nodes. Inner nodes use partial short-circuiting, i.e., nodes are not fully bypassed but only accessed to determine their size; BoP thus avoids the evaluation of Boolean expressions and the access of the fulfilled predicate vector. We have also experimented with the original encoding [3] that allows for full bypassing. Our alternative scheme requires less memory resources but has led to the same efficiency properties in empirical studies. Order of Children. BoP applies a routing optimization (cf. Section 4) that estimates the selectivity of the nodes of subscription trees. The matching algorithm uses this information and re-orders the children of a node according to the selectivity estimate. For conjunctions, BoP orders children with increasing selectivity. It is thus more likely to determine a non-matching candidate early in the evaluation process. For disjunctions, children are arranged with a decreasing selectivity estimation. Hence, BoP determines matching candidates early and avoids a further evaluation. Matching Shortcut. BoP applies the subscription/advertisement forwarding scheme as routing algorithm. This allows for the implementation of a shortcut to avoid the evaluation of most candidate subscriptions. The same shortcut can be applied if subscribers, having various registered subscriptions, do only need to be notified about matching events but not about what subscription has been fulfilled. BoP uses a hashtable to record whether any non-local subscription that has been forwarded by a particular neighbor is fulfilled by the incoming message e. Because e needs to be routed to a neighbor regardless of how many of the forwarded subscriptions are fulfilled, BoP only requires to evaluate the respective candidates until one fulfilled subscription is found. Proceeding in this way avoids the evaluation of the majority of candidate subscriptions. The same approach can be used for subscribers, having properties as described before. An inspiring shortcut has been proposed in [10] for subscriptions in disjunctive normal form (treating a set of conjunctive subscriptions as one subscription). Exploiting Event Types. Event types, on the one hand, define the semantics of subscriptions. On the other hand,

... id(pl )

... ...

... ...

... ...

loc(s)

... ...

... id(p)

Figure 5: Overview of the applied index structures.

representing a Boolean filter expression: Campailla et al. [8] and Li et al. [21] apply variants of binary decision diagrams. We refer to Section 7 for an analysis of the advantages and disadvantages of these alternative encoding schemes. Deregistration. The deregistration in BoP just requires the subscriber to issue the identifier id(s) of the subscription s to be deregistered. All information that is needed for the deregistration process is found in the index structures.

3.2 Matching Process Due to our approach of extending a conjunctive solution, the initial part of the matching process in BoP is the same as in the counting algorithm: In predicate matching, all predicates that are fulfilled by the incoming message are determined using the one-dimensional index structures; they are noted in a fulfilled predicate vector. The next step is subscription matching: Using the predicate subscription association table, the algorithm accumulates a counter in a hit vector, stating the number of fulfilled predicates per subscription. The remaining steps extend the counting approach: Comparing the entry in the hit vector to the entry in the minimum predicate count vector leads to a set of candidate subscriptions. These candidates are all subscriptions that have a larger or equal value in the hit than in the minimum predicate count vector. What remains to be done for these candidates is the evaluation of their filter expressions: Using the subscription location table, the algorithm determines the memory address of the encoded subscription tree; this tree is finally evaluated. Because predicates are represented by their identifiers, BoP just needs to evaluate inner nodes. The result of the filter function in the leaves is found in the fulfilled predicate vector, populated in predicate matching.

3.2.1

Extensions and Fine Tuning

One of the design goals of BoP is to provide a generalpurpose system. There are also extensions to the presented generic algorithm that significantly improve its performance. Pure Conjunctions. If subscribers register pure conjunctive subscriptions, BoP handles them with only a little overhead compared to the counting algorithm: In the registra-

230

AND

BoP can exploit them to improve its matching process: Messages and subscriptions can only match if they specify the same type. A matching algorithm can thus neglect subscriptions of any other type than the one stated by the event message. This is automatically exploited in predicate matching. However, subscription matching, in the generic way we have described it previously, offers some optimization potential: The general idea is to compact the hit vector to reduce the number of comparisons that is required to identify candidates. A way of compacting this structure but still using an efficient implementation is an advanced handling of subscription identifiers: Firstly, identifiers contain two parts, one stating the type and one stating a unique identifier for this type. This allows the hit vector (as array) to only contain entries for one type. Secondly, subscription identifers should not contain holes, i.e., the identifier space should be densely populated. This can be achieved by reissuing identifers to subscribers or by adding another level of indirection, i.e., internal identifiers differ from those used by subscribers. We plan to fully integrate this extension into BoP.

(1) title like A (4)

AND

(3) OR (2) ending < 1 day

(5) (6)

AND (7)

condition = new price < B price < C condition = used

Figure 7: Overview of all valid pruning operations for s1 (Figure 4), having named them (1) to (7). ing leaf node thus becomes a child of the disjunction. To ensure the correct notification of subscribers, BoP only prunes non-local subscriptions. The generalization of routing entries thus only affects the internal pub/sub network but not subscribers: The broker sending a notification always uses the unpruned subscription for matching. Internally created false positives are thus not sent to subscribers.

4.2

4. SUBSCRIPTION-BASED OPTIMIZATION The second problem we need to solve in order to support arbitrary Boolean subscriptions is the development of a subscription-based routing optimization. Most current optimizations have only been proposed in combination with restricted conjunctive subscriptions and are not practically applicable to arbitrary Boolean ones. We refer to Section 7 for an overview and analysis of these optimizations. Another problem of most current optimizations is their dependence on certain relationships and similarities among subscriptions. BoP thus applies a variant of the generic subscription pruning optimization [4], not depending on such properties. We show the independence of BoP on the covering relationships among subscriptions in Section 6.

4.1 Optimization Idea: Pruning The main goal of subscription pruning is to reduce the sizes of event routing tables. When applying subscription/advertisement forwarding, subscriptions act as routing entries. BoP thus alters subscriptions to achieve its goal. To implement an efficient optimization, BoP analyzes subscriptions once when they are registered. This analysis leads to a local optimization decision: every subscription is ranked with a numeric value, stating its optimization potential. If BoP needs to optimize, it has to reach global optimization decisions; these decisions are merely based on the local rankings of subscriptions; they thus do not involve the complex relation of all subscriptions with each other, as required by current approaches, but only a comparison of local rankings. As introduced, subscriptions are internally represented by their tree structures. BoP prunes these subscription trees in order to reach its optimization goals. A required property of this pruning process is to always create a more general (i.e., a less or equally selective) subscription. The syntactically only valid pruning operation to achieve this property is to remove a child of a conjunctive node. We have illustrated the seven valid prunings for example subscription s1 in Figure 7. After having pruned a subscription s, BoP again applies the operator summarization method (cf. Section 3.1) to compact s. Additionally, BoP performs the unary operator removal procedure, e.g., the conjunctive node in Figure 7 is removed after performing Pruning Option (7); the remain-

Ordering Pruning Operations

BoP currently supports four different measures ∆(si , sj ) to rank pruning operations of subscription si to sj . Other measures are currently under development. The less ∆(si , sj ), the better the optimization potential of this pruning option. Accuracy. The accuracy-based measure aims at reducing the selectivity of subscriptions s as little as possible due to prunings. BoP estimates the selectivity sel≈ (s) of subscriptions based on historic information about the selectivity of predicates [4]. When pruning a subscription si to sj , the rank is defined as follows: ∆(si , sj ) = sel≈ (sj ) − sel≈ (si ). Efficiency. An effective efficiency-based measure should be intertwined with the matching algorithm of a system. The most important property of the algorithm used in BoP is pmin (s). BoP thus aims at reducing pmin (s) as little as possible for a subscription s when pruning. This avoids an increase in the occurrence of s as candidate subscription. The rank is defined as follows: ∆(si , sj ) = pmin (si ) − pmin (sj ). Size. When considering the size as measure, BoP targets at strongly reducing the memory requirements for routing tables. To achieve this effect, BoP approximates the influence of a pruning operation as the change in the size mem(s) of the encoded subscription tree of s. We thus define the rank as follows: ∆(si , sj ) = mem(sj ) − mem(si ). Accuracy and Popularity. A way of relating subscriptions with each other is to incorporate both the change in accuracy but also the commonality of subscriptions (partially modeling the degree of subscription-subscription overlap). BoP takes this approach by favoring prunings that remove uncommon predicates. For this purpose, we introduce the commonality measures occ(n) and occ(s) for nodes n and subscriptions s: • For a leaf node nl , occ(nl ) equals the number of subscriptions using the predicate in nl . • For an inner node ni , occ(ni ) is the sum of occ(nj ) for all children nj of ni . • For a subscription s, it holds occ(s) = occ(ni ) with ni being the root node of the subscription tree of s. Based on this definition, BoP avoids the removal of branches that involve popular predicates, but it also incorporates the

231

3. Pruning and integration into routing table 1. Forwarded subscription

Routing table

3. Integration into routing table 2. Forwarding of original subscription

1. Forwarded subscription

Routing table

2. Forwarding of original subscription

2. Pruning and for− warding of subscription

2. Pruning and for− warding of subscription

Figure 8: Example of post-pruning in one broker.

Figure 9: Example of pre-pruning in one broker.

decrease in selectivity due to pruning: Subtrees containing little frequently used predicates get assigned a small rank that is further weighted down according to the selectivity decrease induced by the pruning. It holds 0 ≤ ∆(si , sj ):

taken decisions are potentially sub-optimal from the viewpoint of neighbors. One can, however, collect status information from neighbors, e.g., memory usage or routing load, to tailor the pruning decisions to these neighbors. That is, brokers might perform different prunings for different neighbors. The example in Figure 9 shows this approach. Combined pruning. Next to the options of using prepruning and post-pruning individually, both variants can be applied at the same time. Brokers thus prune subscriptions before forwarding (as in pre-pruning). Additionally, brokers self-dependently prune subscriptions later on (as in postpruning). The example in Figure 9 shows this option when replacing the broker with the one that is given in Figure 8. Subscription pruning uses an optimization approach that is orthogonal to current proposals, e.g., covering [9, 25]. This property allows for the simultaneous application of pruning and recent optimizations. We will show this effect in Section 6: Even though the covering optimization has been fully applied, pruning leads to a further reduction in memory requirements and an increase in performance.

∆(si , sj )

=

occ(si ) × (sel≈ (sj ) − sel≈ (si )) occ(sj )

In empirical studies, this measure has resulted in the best optimization. We do not present the results within this paper but focus on a comparative study of the overall approach (using this measure) and conjunctive solutions (Section 6). Bandwidth and Distance. When extending the routing scheme in BoP by integrating a “distance measure” to the local broker of a subscription (e.g., using the number of hops or the bandwidth on that way), an optimization can aim at pruning subscriptions according to this distance: The further away from a subscriber, the stronger the routing entry is pruned; we plan to integrate this measure into BoP. Combination of Measures. The best optimization effect might be achieved by using a multi-criteria optimization approach, combining all five of the presented measures. Although the accuracy and popularity variant has led to the best overall optimization, combining it with the other approaches might result in an even further improved behavior. We plan to take this step in the future.

5.

SUPPORTING ADVERTISEMENTS

BoP also supports the use of advertisements in its routing protocols. To practically use the advertisement forwarding routing scheme, BoP implements an algorithm to calculate the overlapping relationships between Boolean subscriptions and advertisements (Section 5.1), and also applies an advertisement-based optimization, advertisement pruning.

4.3 Global Pruning Having defined the measures to assign a rank to prunings, BoP easily reaches global optimization decisions: After determining the best local decision at registration, the ranking and the respective subscription (∆(si , sj ), si ) is inserted into a priority queue (minimum element at the top). By successively removing the top from the queue, BoP finds the best global pruning operation. After it has been executed, the new local decision is calculated and inserted into the queue. Changes, e.g., in the selectivity estimation, can be resolved by re-inserting the new local decision into the queue. Pruning can be performed in three variants, as presented in the following. BoP currently applies post-pruning. Post-pruning. Post-pruning is performed in broker components to achieve an optimization in the individual component: Subscriptions are always forwarded in their original, unpruned form. Every broker then self-dependently prunes according to the chosen or assigned measure. This allows brokers the reach the best global pruning decisions based on their current situation. We exemplify this in Figure 8. Pre-pruning. When using pre-pruning, brokers reach pruning decisions before forwarding subscriptions, i.e., already pruned subscriptions are forwarded. This means that the

5.1

Calculation of Overlapping

The semantics of conjunctive subscriptions to contain one predicate per attribute allows for the overlapping calculation based on overlapping predicates, e.g., sketched in [24]. This solution, however, does not work for the Boolean pub/sub model because subscriptions and advertisements can contain any number of predicates per attribute. That is, there might exist overlap if there are no overlapping predicates. However, approaching the overlap from the opposite direction, i.e., basing the calculation on non-overlapping predicates, leads to a solution. In the following, we describe the calculation from the viewpoint of advertisements, i.e., given an advertisement, BoP determines all overlapping subscriptions. The computation works analogously for the other direction due to the symmetry of the overlapping relationship.

5.1.1

Disjoint Predicates

We refer to non-overlapping predicates as disjoint predicates. The set of disjoint predicates for a predicate p includes all those predicates pi that (1) are used in subscriptions, (2) refer to the same attribute as p, and (3) are never

232

fulfilled by any attribute-value pair that fulfils p. BoP bases the calculation of disjoint predicates on its predicate indexes. According to the filter function used in a predicate p, the system can apply a set of fixed computation rules (cf. [6]) to determine the set of disjoint predicates Pdis (p). Based on these results for predicates, BoP successively calculates the disjoint predicates for arbitrary Boolean advertisements based on the nodes n of advertisement trees. To incorporate the Boolean structure, the semantics of disjoint predicates Pdis (n) for nodes n of a tree differs from the semantics of Pdis (p) for predicates p. Whereas the latter has been introduced as a set of predicates, we need to define the former as a set of sets of predicates (each of these sets is referred to as individual disjoint predicate set):

Table 1: Properties of subscription classes Property Class 1 Class 2 Class 3 No. of Boolean operators 4 10 6 No. of original predicates 6 12 7 No. of conjunctions 2 4 6 No. of converted predicates 8 20 18 pmin (s) 4 5 3

without strongly increasing the existing overlapping relationships. Applying this optimization thus only minimally increases the additional number of subscription forwardings. Similarly to the measures for subscription pruning, BoP defines an overlapping rank to estimate the influence of advertisement pruning operations. This measure contains a quantitative overlapping rank, incorporating the number of disjoint predicates, and a qualitative overlapping rank, incorporating the influence of these predicates on the overlapping relationships. By relating the ranks before and after pruning, BoP is able to define an order of all possible pruning operations and to execute these prunings in this order. We refer to [6] for details and evaluation results.

• For a leaf node nl , it holds Pdis (nl ) = {Pdis (p)}, with p being the predicate encoded in nl . • For a disjunctive node nd with k children, n1 to nk , it S holds Pdis (nd ) = i=1...k Pdis (ni ), being the union of all individual disjoint predicate sets of children. • For a conjunctive node S nc with k children, n1 to nk , it holds Pdis (nc ) = { i=1...k x|x ∈ Pdis (ni )}, describing the union of each individual disjoint predicate set of a child with each of these sets of all other children.

6.

In this section, we present the results of a comparative study of the arbitrary Boolean approaches introduced before and conjunctive solutions. We start with describing our test system and setup, and proceed with evaluating our results: We have implemented the BoP prototype in C/C++. For the distributed component, we use standard TCP/IP sockets. Predicate indexes are implemented using the Stl (Standard Template Library) map class. Minimum predicate count vector and subscription location table are based on the Stl class vector. For predicate subscription association table, and fulfilled predicate and hit vector, we have used dynamic array implementations. Subscriptions, advertisements, and events are represented by a proprietary language, developed using Antlr2 . These directives are always serialized to a textual notation before being sent over the network and deserialized before being internally processed. As argued before, the matching algorithm in BoP supports conjunctive subscriptions in nearly the same way as the counting approach. Although, we have compiled a conjunctive version of BoP, removing the overhead of the extension and directly utilizing the counting approach. Proceeding in that way ensures that both algorithms utilize the same data structures and thus removes implementation-specific influences. The conjunctive version of BoP performs canonical conversions and supports the covering optimization.

• For an advertisement a, it holds Pdis (a) = Pdis (n), with n being the root node of a. Having the means to calculate these disjoint predicates for any Boolean advertisement, BoP deduces the overlapping relationships out of this information as follows.

5.1.2

Determination of Overlapping

Also the overlapping algorithm uses the notion of candidates. BoP bases the determination of overlapping candidates of an advertisement a on the minimal number of fulfilled predicates pmin (s) (cf. Section 3.1), the total number of predicates |P (s)|, and the number of disjoint predicates |Pdis (s)| per subscription s. The values |Pdis (s)| are calculated by BoP using a hit vector for each individual disjoint predicate set in Pdis (a). Using the predicate subscription association table, the system can efficiently determine this subscription-specific property. For each candidate overlapping subscription s, the following property has to hold: |P (s)|



EVALUATION AND COMPARISON

pmin (s) + |Pdis (s)|

This inequality states that a subscription can still evaluate to true for a determined individual disjoint predicate set. To determine whether a candidate is an overlapping subscription, BoP evaluates its subscription tree. It is again (as in the matching algorithm) not required to evaluate the functions of predicates because BoP has already determined the disjoint predicates: For disjoint predicates, BoP assumes predicates to evaluate to f alse; other predicates are assumed to evaluate to true. If the whole subscription tree of s still evaluates to true, s is an overlap.

6.1

Experimental Setup

We conducted our experimental evaluation using an online book auction scenario. To obtain a realistic data for event messages, we analyzed the distributions of book auctions on eBay (http://www.ebay.com). Based on this information, we created a typical workload (we refer to [5] for details). For subscriptions and advertisements, we have taken a semi-realistic approach: We have identified various subscription/advertisement classes that would typically be used in online book auctions [6]. These classes describe the overall

5.2 Advertisement-based Optimization The final step to support the arbitrary Boolean pub/sub model is to develop an optimization for advertisements. BoP includes such an optimization, advertisement pruning. Advertisement pruning is tailored to advertisements and thus aims at decreasing the sizes of subscription routing tables

2

233

Language recognition tool, http://www.antlr.org/

structure of subscriptions/advertisements, i.e., they define the used operators and templates for predicates. Actual operands in predicates are created randomly, based on the identified properties for events. We have obtained similar results for various operand distributions. Table 1 gives an overview of the main properties of our three subscription classes before and after conversion. The influence of conversions increases for these classes: 2, 4, and 6 conjunctions are required to express the same interest as in a Boolean subscription (Row 3). Thereby, the predicate numbers increase by factors of 33, 67, and 157% (relation of Rows 4 and 2). The minimal number of fulfilled predicates pmin (s) for subscriptions s of these classes is 4, 5, and 3 (Row 5). Class 1 is represented by subscription s1 in Figure 4. In our experiments, brokers are run on machines with 512 MB of RAM, using a 2 GHz processor, and connected by a 10 Mbps network. To avoid influences and characteristics of a chosen network topology, we decided to connect brokers in a line. This is the best way to directly show the behavior of the algorithms of BoP in a distributed setting without incorporating effects originating from specific topologies. For most of our experiments, we restrain the network to contain five brokers, register 200,000 subscriptions, and state the average results when publishing 100,000 event messages (both uniformly distributed among brokers). In parallel experiments (not shown in this paper), we show the independence of the optimization potential of pruning from the network by scaling the network size for two extreme topologies.

times, in fact, is linear but advantageously influenced by the processor cache for small subscription numbers. The point of changing gradients occurs at a much smaller number of subscriptions for the counting than for the arbitrary Boolean approach because the conjunctive algorithm needs to internally convert the original subscriptions: Due to the commonality among the converted subscriptions, more counters in more subscriptions need to be increased after the conversion. The hit vector thus does not fit into the processor cache from approximately 50,000 original subscriptions onwards (200,000 converted ones). After having registered approximately 100,000 original subscriptions (400,000 converted ones), the influence of the processor cache is negligible and thus the maximal gradient of the curve is reached. Although the arbitrary Boolean algorithm is subjected to the same influence, the effect on this algorithm is much less: Firstly, the point of changing gradients occurs at a much larger number of registered subscriptions: approximately 200,000. This number of subscriptions is four times the number of subscriptions in the conjunctive setting (the conversion leads to four times the number of original subscriptions). Hence, the processor cache can store four times more unconverted subscriptions in the hit vector. Secondly, the main proportion of matching time is not spent for increasing counters but for evaluating candidate subscriptions (there are much fewer predicates because no canonical conversion needs to be performed). Hence, the processor cache does not have such a great influence on the overall matching time, i.e., the change in the gradient is less. Interpreting the curves in Figure 10, both algorithms initially show similar matching times. The more subscriptions get registered, the larger becomes the difference between arbitrary Boolean and conjunctive approach. Having registered more than 400,000 subscriptions, the difference between the algorithms stabilizes: The Boolean solution requires approximately 75, 64, 68, 72, and 76 ms less per event than the conjunctive approach for the five distributions.

Average time per event in ms

6.2 Matching: Conjunctive vs. Boolean 300 250 200 150 100

Bool (u) Bool (z) Bool (n) Bool (rz) Bool (rn) Conj (u) Conj (z) Conj (n)

Conj (rz) Conj (rn)

50

6.3

0 100,000 200,000 300,000 400,000 Number of original subscriptions

500,000

Pruning: Conjunctive vs. Boolean

We have also comparatively analyzed the influence of subscription pruning for both the Boolean and the conjunctive version of BoP. In the graphical illustration in Figure 11, we have directly related the reduction in memory requirements for event routing tables (abscissa) to matching time (left ordinate) and network load (right ordinate), using the setting described in Section 6.1. We have performed increased numbers of post-prunings in all brokers in this experiment (and stopped if each of the remaining prunings removes a subscription). We had to reduce the number of original subscriptions to 100,000 for Class 2 and to 67,000 for Class 3 due to the increase in memory requirements after the canonical conversion. Note that the reduction in memory usage is given proportional to the un-optimized setting, i.e., for the conjunctive case the situation after the conversion (having increased the predicate numbers) is used as basis point. Our results show that the application of pruning is beneficial for both Boolean and conjunctive subscriptions. For the different classes, compared to the un-optimized case the overall matching time decreases by 64, 51, and 29% (46% for the combined setting) for the Boolean setting and by 37, 50, and 22% (32% for the combined setting) for the conjunctive one. At the same time, the network load only increases slightly. From a certain point onwards, the cut-off point, the

Figure 10: Comparison of matching performance with increasing subscriptions using various predicate distributions (u–uniform, n–normal, z–Zipf, rn– reversed normal, rz–reversed Zipf distribution). We have comparatively evaluated the matching performance in individual brokers of BoP when using the Boolean algorithm against the counting algorithm. Figure 10 shows the matching time (ordinate) of both algorithms with an increasing subscription number (abscissa) in the combined setting (we omit similar results for the individual classes due to space restrictions). For each algorithm, we evaluated five distributions in the operands of subscriptions (Figure 10). As to be seen, the time efficiency of both algorithms is only slightly influenced by the actual predicate distributions. Theoretically, both algorithms should show linearly increasing matching times with increasing subscription numbers. In practice, however, both approaches appear to lead to super-linearly developing times. The reason for this property of both algorithms is found in their general approach of incrementing counters per subscriptions and the influence of a limited processor cache. The behavior of the matching

234

(a) Subscription Class 1

(b) Subscription Class 2

(c) Subscription Class 3

13 5.0 NetworkBool 12 NetworkConj 4.5 11 TimeBool 4.0 TimeConj 10 3.5 9 8 3.0 7 2.5 6 2.0 5 1.5 4 1.0 3 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Proportion of removed pred/sub associations

Average brokers per message

5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0

Average brokers per message Average time per event in ms

NetworkBool NetworkConj TimeBool TimeConj

Average brokers per message Average time per event in ms

12

Average brokers per message Average time per event in ms

26 6 5.0 5.0 NetworkBool NetworkBool 24 NetworkConj NetworkConj 4.5 4.5 5 22 TimeBool TimeBool 4.0 4.0 TimeConj TimeConj 20 10 4 3.5 3.5 18 8 16 3.0 3 3.0 14 2.5 2.5 6 2 12 2.0 2.0 10 4 1 1.5 1.5 8 1.0 1.0 2 6 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Proportion of removed pred/sub associations Proportion of removed pred/sub associations Proportion of removed pred/sub associations

Average time per event in ms

14

(d) Subscription Class 1–3

Figure 11: Influences of pruning for conjunctive and Boolean subscriptions. Relation of reduction in memory requirements (abscissae) to matching performance (left ordinates) and network load (right ordinates).

network load increases strongly and, at the same time, the overall matching time starts to increase. In practice, postpruning should only be performed up to this cut-off point. When comparing the un-optimized settings, the Boolean approach performs better than the conjunctive one, as predicted by our results regarding matching (previous section). This behavior is illustrated by the leftmost points in Figure 11. The overall best performance after the pruning optimization is always achieved by the arbitrary Boolean approach. However, the conjunctive solution might temporarily lead to a better performance when reducing the predicates by the same proportion, e.g., shown in Figure 11(b).

regardless of the covering properties: It reduces the memory requirements between 30 and 65% independently of the existing covering. The matching performance is higher for small covering proportions. The reasons are (1) that subscription pruning optimizes orthogonal to this proportion, and (2) in case of little covering, fewer subscriptions are forwarded in the network, leading to fewer filtered messages. Subscription covering, on the other hand, improves its optimization potential, in respect to both performance and memory usage, with an increasing covering proportion. Subscription covering thus leads to better results in case of a high covering proportion. Regarding performance, subscription covering leads to better results than pruning for covering proportions of more than approximately 0.78 (Class 1), 0.30 (Class 2), 0.85 (Class 3), and 0.68 (Classes 1–3). Due to the orthogonality of subscription covering and pruning, one can utilize both optimizations at the same time. After having fully optimized based on the existing covering, pruning even further improves the matching performance and reduces the memory requirements. We have given an overview of the results when proceeding in that way in Table 2. Our experiments show that pruning improves the results of the covering optimization for all covering proportions. The relative improvement of the matching performance is the higher the less covering exists among subscriptions. This behavior results from the removal of a large number of subscriptions in case of high covering proportions.

6.4 Subscription Pruning vs. Covering

6.5

In this part of our evaluation, we directly compare the optimization potential of covering and pruning. For the former, we canonically convert subscriptions, apply the covering optimization, and use the counting approach for matching. In the latter scenario, we register the original subscriptions, apply subscription pruning, and use the Boolean matching algorithm. In our experiments, we vary the number of covering relationships among subscriptions (abscissae in Figure 12). For this purpose, we define the covering proportion as the number of removed subscriptions due to applying covering divided by the original subscription number. In our experiments, we have created subscriptions with different covering proportions by increasing the sizes of selected attribute domains. Due to the varying structures of subscription classes, this results in different covering proportions. We have again decreased the subscription number for some settings (cf. Section 6.3) due to the large memory usage of the conjunctive approach for little covering. Our results show that subscription pruning is applicable

Similarly to the matching algorithm, we comparatively evaluated the performance of our overlapping calculation approach against a conjunctive solution. For this experiment, we registered a varying number of advertisements (abscissae in Figure 13); in the conjunctive setting, we converted these advertisements. The performance measure at the ordinates of Figure 13 shows the average time (using 25,000 subscriptions) to calculate the overlap. We analyzed two scenarios: All calculates all relationships, whereas First only determines whether overlap exists. The latter scenario is practically required for subscription forwarding. Again, the behavior of the calculation performance with an increasing problem size appears to be super-linear but, in fact, is linear with an advantageous cache exploitation for small advertisement numbers (cf. Section 6.2). In particular Scenario First shows a better computation performance for the Boolean than for the conjunctive algorithm. The reason for this advantage is that the problem size explosion due to conversions occurs twice for overlapping relationships: once

Table 2: Simultaneous covering and pruning Covering only; Covering & pruning; Covering time in ms time in ms proportion 1.81 1.74 0.95 2.24 2.10 0.91 3.89 3.17 0.68 4.50 3.63 0.56 4.95 3.41 0.43 5.10 3.13 0.35 5.15 2.74 0.31 5.26 2.01 0.26 5.28 1.68 0.25

235

Advertisements: Boolean vs. Conjunctive

12.0

TimeBool TimeConj MemoryBool MemoryConj

10.0

(a) Subscription Class 1

1.0 0.8 0.6

8.0

0.4

6.0 0.2

4.0

2.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Covering proportion (removed subscript.)

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

0.8 0.7 0.6 0.5 0.4 TimeBool TimeConj MemoryBool MemoryConj

0.3 0.2 0.1

0.0 0.5 0.6 0.7 0.8 0.9 1.0 Covering proportion (removed subscript.)

(b) Subscription Class 2

5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0

TimeBool TimeConj MemoryBool MemoryConj

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

1.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Covering proportion (removed subscript.)

(c) Subscription Class 3

Prop. of removed pred/sub assoc.

14.0

Prop. of removed pred/sub assoc. Average time per event in ms

16.0

Prop. of removed pred/sub assoc. Average time per event in ms

Prop. of removed pred/sub assoc. Average time per event in ms

Average time per event in ms

11.0 1.0 TimeBool 10.0 TimeConj 9.0 0.8 MemoryBool MemoryConj 8.0 7.0 0.6 6.0 5.0 0.4 4.0 3.0 0.2 2.0 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Covering proportion (removed subscript.)

(d) Subscription Class 1–3

300 250

TimeBool(All) TimeConj(All) TimeBool(First) TimeConj(First)

200 150 100 50 0 20,000

100,000 180,000 260,000 Number of advertisements

(a) Subscription Class 1

600 500 400

TimeBool(All) TimeConj(All) TimeBool(First) TimeConj(First)

300 200 100 0 20,000

300 250 200 150 100 50 0 20,000

100,000 180,000 260,000 Number of advertisements

(b) Subscription Class 2

TimeBool(All) TimeConj(All) TimeBool(First) TimeConj(First)

100,000 180,000 260,000 Number of advertisements

(c) Subscription Class 3

Time per subscription in ms

350

Time per subscription in ms

400

Time per subscription in ms

Time per subscription in ms

Figure 12: Comparison of covering and pruning with the relation of covering proportion (abscissa) to matching performance (left ordinate) and memory reduction (right ordinate). 400 350 300 250

TimeBool(All) TimeConj(All) TimeBool(First) TimeConj(First)

200 150 100 50 0 20,000

100,000 180,000 260,000 Number of advertisements

(d) Subscription Class 1–3

Figure 13: Overlapping calculation using conjunctive and arbitrary Boolean advertisements. for subscriptions and once for advertisements. The advantage for the Boolean approach in Scenario All, however, is smaller; it is even an disadvantage for Class 3. The reason for this behavior is the number of disjoint predicates and the implied number of candidates to evaluate.

re-organization process when deregistering subscriptions. Event Routing Optimizations. There are three main categories of routing optimizations: covering, merging, and summarization. Although slightly dissimilar, all approaches exploit commonalities and similarities among subscriptions: Subscription covering has been widely researched on, e.g., in [9, 11, 21, 25]. It exploits subset relationships among individual subscriptions to reduce the number of routing table entries. Subscription subsumption [26] is a similar approach, exploiting these relationships when considering subscription sets. Merging combines several subscriptions to reduce the number of routing table entries. There is a perfect and an imperfect variant. It has been applied in a range of systems, e.g., [12, 21, 25]. Subscription summarization is similar to merging; it has been proposed and analyzed by Wang et al. [32], and Triantafillou and Economides [31]. The main difference to merging is that the computed summaries are distributed “as a whole” (merging is applied “on the fly”). Due to the common optimization approach of these proposals, they share the same drawbacks: (1) they strongly depend on the commonality among the registered subscriptions, (2) they create a strong overhead in case of deregistrations, and (3) they are impractical to apply for arbitrary Boolean expressions (it is needed to relate all subscriptions with each other, leading to (co-)NP-hardness [12, 26]). The broad idea of filter weakening, briefly sketched in [13], can be seen as a predecessor to subscription pruning. However, this approach is restricted to conjunctive subscriptions; there is no work on how to broaden subscriptions in practice, except of the idea of basing it on their generality [13]. Advertisements. Current systems define advertisements as conjunctions (e.g., Padres [21], Rebeca [24], Siena [9], and [18].), or they only specify the message type (e.g., [27]). The proposed algorithms to compute overlappings are re-

7. RELATED WORK Having presented the concepts and algorithms of BoP, we now relate them to existing works. Following the general structure of this paper, we have organized this section into parts on matching, event routing, and advertisements. Matching Algorithms. There is no current matching approach that (1) applies predicate indexes to achieve an efficient and scalable matching, and (2) supports arbitrary Boolean subscriptions. Indexing approaches [1, 2, 15, 16, 21, 33] are restricted to conjunctions, whereas non-indexing solutions [8, 29] support arbitrary Boolean subscriptions. The counting approach [2, 33] is the most promising conjunctive solution (general-purpose one-dimensional approach, cf. our reasoning in Section 3) that offers potential for an extension to more expressive subscriptions, as undertaken in BoP. Another way of storing arbitrary Boolean subscriptions (instead of subscription trees) is to use Binary Decision Diagrams (BDD), as done by Campailla et al. [8] and by Li et al. [21]. However, [8] does not index predicates, whereas [21] only supports conjunctive subscriptions. Storing individual subscriptions in a compact form, such as BDDs, might result in a more space-efficient storage and a more time-efficient evaluation of candidates. However, applying a sophisticated subscription tree encoding scheme might also lead to these results. This low-level optimization is generally out of the focus of BoP: Making the case for the arbitrary Boolean pub/sub model. Conversely using shared BDDs for several subscriptions, as proposed in [8], results in a time-consuming

236

stricted to conjunctive advertisements and subscriptions. There are no optimizations that are tailored to advertisements. Instead, subscription-based optimizations [9, 24] (covering/merging) are suggested to be applied to advertisements, leading to the problems described before.

[13]

[14]

8. CONCLUSIONS AND FUTURE WORK BoP is a content-based pub/sub system supporting arbitrary Boolean subscriptions and advertisements. Directly supporting such expressions leads to efficiency benefits for applications requiring this class of expressiveness. We have presented the applied matching algorithm, a subscriptionbased routing optimization, an overlapping calculation algorithm, and an advertisement-based optimization approach. Our experimental comparison of BoP to conjunctive solutions has shown that the direct support of arbitrary Boolean expression leads to efficiency benefits. Most importantly, our solutions are also applicable to conjunctive subscriptions and even future improve the optimization potential of state-of-the-art conjunctive routing approaches. Our solutions also show an optimization effect in settings that cannot be optimized by recent solutions. In the future, we will fully integrate the extensions outlined in this paper and plan further analyses of other application scenarios than the online auction setting used in our experiments. This will show the advantages of the arbitrary Boolean pub/sub model in universal settings.

[15]

[16] [17]

[18] [19]

[20]

[21]

9. REFERENCES

[22]

[1] M. K. Aguilera, R. E. Strom, D. C. Sturman, M. Astley, and T. D. Chandra. Matching Events in a Content-Based Subscription System. In PODC ’99, 1999. [2] G. Ashayer, H.-A. Jacobsen, and H. Leung. Predicate Matching and Subscription Matching in Publish/Subscribe Systems. In ICDCSW ’02, Austria, July 2002. [3] S. Bittner and A. Hinze. A Detailed Investigation of Memory Requirements for Publish/Subscribe Filtering Algorithms. In CoopIS 2005, Cyprus, November 2005. [4] S. Bittner and A. Hinze. Dimension-Based Subscription Pruning for Pub/Sub. In ICDCSW ’06, 2006. [5] S. Bittner and A. Hinze. Event Distributions in Online Book Auctions. Technical Report 03/2006, Computer Science Department, Waikato University, Feb 2006. [6] S. Bittner and A. Hinze. Optimizing Pub/Sub Systems by Advertisement Pruning. In DOA 2006, 2006. [7] A. P. Buchmann, C. Bornh¨ ovd, M. Cilia, L. Fiege, F. C. G¨ artner, C. Liebig, M. Meixner, and G. M¨ uhl. DREAM: Distributed Reliable Event-Based Application Management. In Web Dynamics, 2004. [8] A. Campailla, S. Chaki, E. Clarke, S. Jha, and H. Veith. Efficient Filtering in Pub-Sub Systems using Binary Decision Diagrams. In ICSE 2001, 2001. [9] A. Carzaniga, D. S. Rosenblum, and A. L. Wolf. Achieving Scalability and Expressiveness in an Internet-Scale Event Notification Service. In PODC 2000. [10] A. Carzaniga and A. L. Wolf. Forwarding in a Content-Based Network. In SIGCOMM ’03, 2003. [11] R. Chand and P. Felber. A Scalable Protocol for Content-Based Routing in Overlay Networks. In NCA 2003, USA, April, 2003. [12] A. Crespo, O. Buyukkokten, and H. Garcia-Molina. Query Merging: Improving Query Subscription Pro-

[23]

[24] [25]

[26]

[27] [28] [29]

[30]

[31]

[32]

[33]

237

cessing in a Multicast Environment. IEEE TKDE, 15(1):174–191, 2003. P. T. Eugster, P. Felber, R. Guerraoui, and S. B. Handurukande. Event Systems: How to Have Your Cake and Eat It Too. In ICDCSW ’02, July 2002. P. T. Eugster, P. A. Felber, R. Guerraoui, and A.-M. Kermarrec. The Many Faces of Publish/Subscribe. ACM Computing Surveys, 35(2):114–131, 2003. F. Fabret, A. Jacobsen, F. Llirbat, J. Pereira, K. Ross, and D. Shasha. Filtering Algorithms and Implementation for Fast Pub/Sub Systems. In SIGMOD 2001. J. Gough and G. Smith. Efficient Recognition of Events in a Distributed System. In ACSC-18, 1995. E. N. Hanson, M. Chaabouni, C.-H. Kim, and Y.-W. Wang. A Predicate Matching Algorithm for Database Rule Systems. In SIGMOD 1990, USA, May 1990. D. Heimbigner. Expressive and Efficient Peer-to-Peer Queries. In HICSS-38, USA, January 2005. A. Hinze. A-MEDIAS: Concept and Design of an Adaptive Integrating Event Notification Service. PhD thesis, FU Berlin, Inst. of Computer Science, 2003. S. Jones, S. McInnes, and M. S. Staveley. A Graphical User Interface for Boolean Query Specification. IJDL, 2(2–3):207–223, 1999. G. Li, S. Hou, and H.-A. Jacobsen. A Unified Approach to Routing, Covering and Merging in Publish/Subscribe Systems based on Modified Binary Decision Diagrams. In ICDCS ’05, USA, June 2005. J. MacKinley and M. Genesereth. Expressiveness and Language Choice. DKE, 1:17–29, 1985. I. Mathieson, S. Dance, L. Padgham, M. Gorman, and M. Winikoff. An Open Meteorological Alerting System: Issues and Solutions. In ACSC-27, 2004. G. M¨ uhl. Large-Scale Content-Based Publish/Subscribe Systems. PhD thesis, TU Darmstadt, September 2002. G. M¨ uhl and L. Fiege. Supporting Covering and Merging in Content-Based Pub/Sub Systems: Beyond Name/Value Pairs. IEEE DS Online, 2(7), 2001. A. M. Ouksel, O. Jurca, I. Podnar, and K. Aberer. Efficient Probabilistic Subsumption Checking for Content-based Pub/Sub Systems. In Middleware ’06, 2006. P. R. Pietzuch. Hermes: A Scalable Event-Based Middleware. PhD thesis, Cambrigde University, 2004. K. A. Ross. Selection Conditions ins Main Memory. ACM TODS, 29(1):132–161, 2004. B. Segall and D. Arnold. Elvin has left the building: A publish/subscribe notification service with quenching. In AUUG97, Australia, September 1997. D. Tam, R. Azimi, and H.-A. Jacobsen. Building Content-Based Publish/Subscribe Systems with Distributed Hash Tables. In DBISP2P 2003, 2003. P. Triantafillou and A. Economides. Subscription Summarization: A New Paradigm for Efficient Publish/ Subscribe Systems. In ICDCS ’04, 2004. Y.-M. Wang, L. Qiu, C. Verbowski, D. Achlioptas, G. Das, and P. Larson. Summary-based Routing for Content-based Event Distribution Networks. ACM SIGCOMM CCR, 34(5):59–74, 2004. T. W. Yan and H. Garc´ıa-Molina. Index Structures for Selective Dissemination of Information Under the Boolean Model. ACM TODS, 19(2):332–364, 1994.

Suggest Documents