Research Article Summary Instance: Scalable Event

Hindawi Publishing Corporation International Journal of Distributed Sensor Networks Volume 2015, Article ID 390329, 14 pages http://dx.doi.org/10.1155/2015/390329

Research Article Summary Instance: Scalable Event Priority Determination Engine for Large-Scale Distributed Event-Based System Ruisheng Shi,1 Yang Zhang,2 Lina Lan,3 Fei Li,4 and Junliang Chen2 1

Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education, and School of Humanities, Beijing University of Posts and Telecommunications, Beijing 100876, China 2 State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China 3 School of Network Education, Beijing University of Posts and Telecommunications, Beijing 100088, China 4 Siemens AG Austria, Siemensstrasse 90, 1210 Vienna, Austria Correspondence should be addressed to Ruisheng Shi; [email protected] Received 19 October 2014; Accepted 5 February 2015 Academic Editor: Ching-Hsien Hsu Copyright © 2015 Ruisheng Shi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Data prioritization problem is paramount for distributed publish/subscribe infrastructure to the timely delivery of real-time events since a large number of low priority events may clog the channel thereby causing high priority events to get delayed. The challenge raised for the event-based middleware in large-scale distributed system such as vehicular ad hoc networks is that event priority determination engine must be efficient and scalable in terms of priority rule size and event throughputs. This paper proposes an innovative approach based on Bloom filter and event discretization. A Bloom filter data structure is used to store the rule instances and their priorities. The complex rule evaluation is reduced to set membership testing as queries on Bloom filters. The time complexity of data prioritization is constant and independent of the number of priority rules. As event discretization signatures can be cached, this approach is cache friendly in nature. The previous computation results can be cached in overlay network nodes and reused to improve the system throughputs and determination time. We have evaluated our proposed approach and the results show a significant performance improvement.

1. Introduction With the advent of ubiquitous sensor-rich environments and location-based services, distributed event-based systems with the publish/subscribe communication paradigm have been gaining popularity [1, 2]. For example, in vehicular ad hoc networks (VANETs), the applications logics are triggered by various events from geographically distributed sources. In expressway monitoring system of VANETs, the sensing data of vehicles are published continuously and the vehicle information system may subscribe to different data based on the vehicle’s location. With the increasing popularity of distributed event-based systems (especially the publish/subscribe systems) and the adoption in mission critical areas, performance and scalability issues are becoming a major concern [3, 4]. The performance and scalability of the event-based middleware used to

process real-time event data will be crucial for the successful adoption of such applications. Flexible and efficient events routing mechanisms are paramount for the improvement of user experience. Publish/subscribe systems must support a large number of geographically distributed publishers and subscribers. Efficient communication between these brokers is paramount. Data of different importance are transported in the same communication infrastructure. Large number of low priority events may occupy much bandwidth of event brokers and incur delivery delay of time sensitive data. We propose event delivery on-time rate (EDOR) metrics to measure system performance. Prioritized multiqueue approach is a natural choice to improve the system performance under given system resources. However, the effectiveness of this approach depends on the performance and scalability of event priority determination engine (PDE).

2 A na¨ıve implementation of priority rule matching might check each rule against event instance values. But this na¨ıve approach performs poorly in large-scale system [5]. The performance of priority determination engine is dependent on the number of rules in the system. Since each condition of the rules needs to be checked on the fly, this na¨ıve approach is cache-unfriendly and may perform poorly under geographically distributed environment. A cache-friendly approach may alleviate the load of PDE dramatically and achieve significant improvement of system performance in terms of the speed of priority determination and event delivery on-time rate (EDOR). Another na¨ıve priority policy is that event producer can determine the priority of events. Under this policy, the priority is determined by different event producers. If most events are labeled as high priority, the system cannot benefit much from the priority mechanism. We need global policy for resource scheduling in the overlay network. The design issues of efficient and scalable priority determination engine (PDE) are addressed in this paper. In this paper, we present an innovative PDE design based on Bloom filter and event discretization. First, the speed of priority determination is independent of the number of rules in the system. Second, this approach is cache friendly. The system can handle large number of events in geographically distributed deployments. The results in this paper are an improved and extended version of our conference paper in IEEE SCC 2012 [6]. The major extensions in this work are the following. First, the model is refined and expressed more accurately. Second, more related works are explored. Third, the discretization algorithm is detailed, which provides a much more thorough description compared to the preliminary results in [6]. Finally, the evaluation methods and results are introduced in this paper.

2. Related Work Data Prioritization. The internet services require soft realtime constraints, for example, 300 msec latency [7, 8]. For some applications, a 100 msec increase in latency can affect its user experience significantly. Experiments of Amazon and Google [9] demonstrated that latencies at hundreds of milliseconds could already result in significant financial loss. The absence of traffic prioritization causes latency-sensitive data stream to wait behind latency-insensitive data stream. Events are useful, if and only if events are delivered within its deadline. Recent research works [7, 8, 10] addressed this issue under datacenter environment and the solutions are mostly focused on transport layer protocols. In the crosslayer approach, named DeTail, the solution depends upon applications to properly specify data priorities based on how latency sensitive they are [7]. The presence of data prioritization can alleviate this issue significantly. Our system [11] introduces data prioritization into application layer, that is, publish/subscribe overlay network. In the overlay system, the prioritizations of application data can be handled by publish/subscribe

International Journal of Distributed Sensor Networks infrastructure. However, if the overhead of prioritization is too high, the solution is not affordable for most soft realtime applications. These online applications require fast data prioritization services. Low latency and high throughput under geographically distributed environment are demanded for the data prioritization engine. At the same time, it should be scalable in terms of priority rule size. Rule Matching. Rule matching engines have been intensively studied in the past decades. The most famous algorithm is Rete, which was proposed by Charles L. Forgy at CarnegieMellon University in the 1970’s [5, 12]. Rete algorithm has become very widely used; it is the basis of OPS5, CLIPS, and numerous commercial rule-based tools. Techniques used in expert and rule-based systems support expressive predicate languages [12] but are unable to scale up to process millions of Boolean expressions. Most of the traditional rule-based systems used in expert systems focus on language expressiveness and their expected sizes are assumed to be less than thousands of Boolean expressions. The latest Rete implementation declared that the scale can be up to 100 K rules with millions of objects and is at least 500 times faster than the original Rete [13, 14]. Although advances in the implementation of knowledge-based expert systems have provided substantial performance improvements, the rule matching speed in large-scale systems with millions of Boolean expressions under severe time constraints for example, submillisecond, is still an open issue. Many innovative approaches have been proposed to address fast rule matching against millions of Boolean expressions [15–18] recently. The rule matching performance has been improved significantly. But all these algorithms scale linearly with respect to the number of matched Boolean expressions [15]. These algorithms focus more on top-k matching Boolean expressions [15, 17, 18]. Compared with Rete and the aforementioned works, our approach focuses on the scalability and provides an innovative solution on rule matching engine design under distributed computing environment. Our approach is focused on the scalability of online query speed of event priority rule matching at the cost of offline large rule instance database maintenance and cache management on broker nodes.

3. Model Description 3.1. System Model. The publish/subscribe system can be classified by architecture as centralized publish/subscribe system and distributed publish/subscribe system [19]. As the increasing scale of event-based systems, the distributed publish/subscribe system attracts more attention from both industry [20] and academia [21]. A generic publish/subscribe system (often referred to in the literature as event service or notification service) is composed of a set of broker nodes distributed over a communication network. These nodes form an overlay network, which is a logical network built on the physical network. The links between nodes are paths in the physical network. Formally [22] the distributed publish/subscribe system can be represented as a 5-tuple 𝐺 = ⟨𝐵, 𝐶, 𝑃, 𝑆, 𝐸𝑇 ⟩ where

International Journal of Distributed Sensor Networks

3 also may be refined into proper granularity per application requirements. The discretization procedures can be defined by applications per business requirements.

C4 C5 Broker network

C1

C6

B7 B1

C2

3.3. Priority Rule Model. The triple consisting of attribute, operator, and set of values is referred to as a Boolean predicate. A conjunction of Boolean predicates is a Boolean expression. A priority rule can be modeled as a set of Boolean expressions. The rule is expressed as a disjunction of Boolean expressions. For a given priority, there may be a set of priority rules specified by applications.

B6 B3

B4

B2

C8 B5

C3 Access broker

Inner broker

C9 C10

Client

Figure 1: System architecture of distributed publish/subscribe overlay network.

𝐵 = {𝑏1 , 𝑏2 , . . . , 𝑏|𝐵| } is the set of system broker nodes. 𝐶 = {𝑐1 , 𝑐2 , . . . , 𝑐|𝐶| } is the set of connections between broker nodes. 𝑃 = {𝑝1 , 𝑝2 , . . . , 𝑝|𝑃| } is the set of publishers. 𝑆 = {𝑠1 , 𝑠2 , . . . , 𝑠|𝑆| } is the set of subscribers. 𝐸𝑇 = {𝑇1 , 𝑇2 , . . . , 𝑇|𝐸𝑇 | } is the set of event types. The overlay topology of publish/subscribe network is shown in Figure 1. Client can be publisher or/and subscriber. Each client is connected to only one of the brokers in the system. The broker that is connected to client is called the access broker from network view and is also called home broker with respect to that client. The brokers that route events between brokers are called event router or inner broker. 3.2. Event Model. Publish/subscribe based event models were first introduced in the data and business domain as complex event processing [1]. Each event is described by a set of attributes, 𝑒 = ⟨V1 , V2 , . . . , V𝑘 ⟩ [23, 24]. The tuple ⟨type(V1 ), type(V2 ), . . . , type(V𝑘 )⟩ is called event schema. A more succinct presentation is written as attribute vector {𝑎1 , 𝑎2 , . . . , 𝑎𝑘 }. The events having the same schema are classified into the same event type. Event type space is defined as a set, denoted by 𝐸𝑇 = {𝑇1 , 𝑇2 , . . . , 𝑇|𝐸𝑇 | }, where 𝑇𝑖 = {𝑎1 , 𝑎2 , . . . , 𝑎𝑘 }, 𝑖 ∈ [1, |𝐸𝑇 |]. Let 𝐸 be the set of all events published in the system. An event 𝑒 ∈ 𝐸 must follow one of event schema in the event type set 𝐸𝑇 . Let 𝐸𝑇𝑖 denote all events which follow the event schema defined by event type denoted by 𝑇𝑖 = {𝑎1 , 𝑎2 , . . . , 𝑎𝑘 }. Therefore, we have the following relations: 𝐸 = 𝐸𝑇1 ∪ 𝐸𝑇2 ∪ ⋅ ⋅ ⋅ ∪ 𝐸𝑇𝑛 ∀𝑇𝑖 ∈ 𝐸𝑇 ,

𝑇𝑗 ∈ 𝐸𝑇 : 𝐸𝑇𝑖 ∩ 𝐸𝑇𝑗 = ⌀.

(1)

For each event schema, the attributes vectors that determine the event priority are called priority signature vector. Let 𝑉Sig (𝑇𝑖 ) denote the signature vector for event type 𝑇𝑖 . 𝑉Sig (𝑇𝑖 ) = ⟨𝑎𝑖1 , 𝑎𝑖2 , . . . , 𝑎𝑖𝑚 ⟩ has 𝑚 priority attribute fields, where 𝑎𝑖𝑗 ∈ {𝑎1 , 𝑎2 , . . . , 𝑎𝑘 }, 𝑎𝑖𝑗 = 𝑎𝑖𝑘 if and only if 𝑗 = 𝑘. We distinguish two types of attributes: continuous variables and discrete variables. In our approach, the attributes with continuous value must be refined into discrete values with proper granularity. The attributes with discrete value

The set of values in all Boolean predicates composed the metadata of the priority rule. An expressive set of operators are supported: relational operators () and set operators (∈, ∉). The metadata of discrete attributes shall be a subset of the corresponding attribute domain. The metadata of continuous attributes shall be an element of the corresponding attribute domain. For example, given event schema of coal mine monitoring data, which is defined as ⟨𝑎1 , 𝑎2 , 𝑎3 ⟩, 𝑎1 is a discrete variable which denotes the location identifier, 𝑎2 is a continuous variable which denotes methane (CH4 ) gas density, and 𝑎3 is timestamp field. Boolean predicates 𝑝11 = 𝑎1 ∈ 𝑆1 and 𝑝21 = 𝑎2 ≤ 𝐶1 can construct a Boolean expression BE1 = 𝑝11 ∧ 𝑝21 = (𝑎1 ∈ 𝑆1 ) ∧ (𝑎2 ≤ 𝐶1 ), which means that if the data come from locations in 𝑆1 and the value of attribute 𝑎2 is not greater than 𝐶1 , the event shall be determined as the corresponding priority. The metadata 𝑆1 and 𝐶1 suffice the following constraints: 𝑆1 ⊆ Dom(𝑎1 ) and 𝐶1 ∈ Dom(𝑎2 ). Similarly, BE2 can be defined as BE2 = 𝑝12 ∧ 𝑝22 = (𝑎1 ∈ 𝑆2 ) ∧ (𝑎2 ≤ 𝐶2 ). The priority rule is defined as 𝑅1 = BE1 ∨ BE2 = (𝑝11 ∧ 𝑝21 ) ∨ (𝑝12 ∧ 𝑝22 ). The data {𝑆1 , 𝐶1 , 𝑆2 , 𝐶2 } are metadata defined by application system, which may be constant or dynamically changing under context. The rule set for given priority can be modeled as a set of Boolean expressions, which are the union of the Boolean expression sets for the priority rules of the given priority. The general format of priority Boolean function for a set of rules can be formalized as 𝑅 = 𝑅1 ∨ ⋅ ⋅ ⋅ ∨ 𝑅𝑘 = BE1 ∨ ⋅ ⋅ ⋅ ∨ BE𝑚 = 𝑓(𝑝1 , 𝑝2 , . . . , 𝑝𝑛 ). The size of priority Boolean function can be measured by the number of Boolean expressions. We define the normal model of priority rule as a disjunction of Boolean expressions and each Boolean expression is defined as a conjunction of Boolean predicates. The transformation from nature language rule specification to normal expression is another research topic in requirement engineering, which is not addressed in this paper. Given an event instance, 𝑒 = ⟨V1 , V2 , V3 ⟩. If 𝑒 suffices the conditions that V1 in set 𝑆1 and V2 ≤ 𝐶1 , event 𝑒 matches rule 𝑅 successfully; if event 𝑒 suffices the conditions that V1 in set 𝑆2 and V2 ≤ 𝐶2 , event 𝑒 matched rule 𝑅 successfully; if both failed, event 𝑒 does not match the priority rule 𝑅. In this simple example, we observe that the four condition tests denoted by Boolean predicates can be reduced to two attributes in event priority signature vector. We also observe that in condition tests on Boolean predicates 𝑝11 = 𝑎1 ∈ 𝑆1

4


and 𝑝12 = 𝑎1 ∈ 𝑆2 , the computation time of condition test depends on the size of set 𝑆𝑖 (𝑖 = 1, 2). The condition tests on Boolean predicates need be checked on the fly since past evaluations on attributes with continuous values cannot be reused directly. 3.4. Assumptions and Design Goals. Our design has been guided by assumptions that offer both challenges and opportunities. (1) The system shall support large-scale system running under geographically distributed environment. (2) The speed of priority determination is paramount to ensure real-time event to be delivered timely. The performance shall be scalable with the scale of condition tests in rules and the traffic of events in overlay network. (3) The problem of priority determination can accept false positive if the false rate can be controlled under the acceptable rate. (4) Large amounts of condition tests in priority rules are actually defined by small number of event signature attributes. (5) The condition tests in rules can be expressed as set membership query problem. The priority rules mostly are expressed as some attribute suffice some condition (in particular set, less than or greater than specified threshold value). The condition tests are not likely as complex as the pattern match problem in artificial intelligence area. (6) The set of event types is known in advance. Our design goals on PDE focus on the following aspects. (1) The PDE should strive to maximize the number of events that satisfy their deadline to contribute to application throughput. (2) The PDE should be able to accommodate to burst tolerance to improve the peak load: redefines the peak loads at which the publish/subscribe system can operate without impacting the user experience. To achieve these design goals, we propose our approach: summary instance.

4. Solution The basic ideas of our approach are composed of two main principles. First, make online query on event instance as simple as possible. The time consuming procedures should be done offline. Second, exploiting the power of cache on each broker node to reduce network round trip may bring much room for performance improvement. 4.1. Overview. To accommodate the aforementioned principles, it would be best that the computation time of query

Rule management interface

Event priority query interface

RIE (rule instantiation engine)

PDE (priority determination engine)

Generate instances from rules

Event discretization signature generation

Hash computation on instance sets

Query from the array of Bloom filters

Update the array of Bloom filters

Determination of incoming event

0 10 0 1 0 0 0 0 1 0 0 1

P1

0 0 1 0 0 1 0 1 0 0 1 0 0 ···

P2

0 0 1 0 0 1 01 0 0 1 0 0

Pk

A lossy summary of rule instances

Figure 2: Architecture of event priority rule matching engine.

is simple and independent of the number of condition tests in rules and query can be answered by lookup local cache as long as possible. The key ideas of our approach are rule instantiation, event (attributes) discretization, and cachingfriendly signature-based rule matching mechanism under distributed event environment. As shown in Figure 2, the rule matching engine is decoupled as two parts, offline rule instantiation and online query on event instance matching. The offline part is named rule instantiation engine (RIE). The online part is named priority determination engine (PDE). Rule instantiation process tries to represent the event priority rule set as a set of instances. If event priority is queried on this set directly, the computation time involved in performing the query is dependent on the number of the elements in set 𝑅. To reduce the computation time, the rule instance set 𝑅 is stored with Bloom filter data structure. The computation time of query on Bloom filter is independent on the size of rule instance set 𝑅. Furthermore, the amount of storage required by the Bloom filter for each element in set 𝑅 is independent of its length. By employment of Bloom filter, the online query of event priority is reduced as twohash function computation on event signature; refer to next section on Bloom filter theory. 4.2. Preliminary. In order to keep this paper self-contained, this subsection presents a concise introduction on Bloom filter. After the Bloom filter is proposed in 1970s [25], it is first used in database communities. This technique has gained popularity in network applications with the emergence of the Internet [26].

International Journal of Distributed Sensor Networks A Bloom filter is a simple, space-efficient randomized data structure for representing a set of strings compactly for efficient membership querying. It outperforms other efficient data structures such as binary search trees and tries as the time needed to add an item or check whether an item belonging to the set is constant irrespective of the cardinality of the set. At first, we present the mathematics behind Bloom filters concisely. A standard Bloom filter for representing a set 𝑆 = {𝑥1 , 𝑥2 , . . . , 𝑥𝑛 } of 𝑛 elements is represented by an 𝑚-bit vector. All bits in the 𝑚-bit vector are initially set to 0. A Bloom filter uses 𝑘 independent hash functions {ℎ1 , ℎ2 , . . . , ℎ𝑘 } with range {1, 2, . . . , 𝑚}. For each member 𝑥 of 𝑆, the bits ℎ𝑖 (𝑥) are set to 1 for 1 ≤ 𝑖 ≤ 𝑘. The bits can be set to 1 multiple times, but only the first change has an effect. After repeating this procedure for all members of the set, the programming of the filter is completed [25–27]. The query process is similar to programming. To check whether an item 𝑦 is in 𝑆, we generate 𝑘 hash values with {ℎ1 , ℎ2 , . . . , ℎ𝑘 } from item 𝑦. Then, we check whether all 𝑘 bits ℎ𝑖 (𝑦) are set to 1. If at least one of these 𝑘 bites is unset to 1, 𝑦 is clearly not a member of 𝑆. If all 𝑘 bits are set to 1, we assume that 𝑦 is in 𝑆, although we are wrong with some probability. Hence, a Bloom filter may yield a false positive. False-positive probability 𝑓 is 𝑓 = (1−𝑒−𝑛𝑘/𝑚 )𝑘 , where 𝑛 is the number of elements in 𝑆, 𝑘 is the number of hash functions, and 𝑚 is the size of bit vector. We can reduce the value of 𝑓 by choosing appropriate values of 𝑚 and 𝑘 for given size 𝑛 of the member set. In the optimal case, which minimizes falsepositive probability with respect to 𝑘, 𝑘 = (𝑚/𝑛) ln 2. This corresponds to a false-positive probability ratio of 𝑓 = (1/2)𝑘 . To accommodate the deletion operation on Bloom filters, Fan et al. proposed the idea of counting Bloom filters [28]. In a counting Bloom filter, each entry in the Bloom filter is not a single bit but rather a small counter. When an item is inserted, the corresponding counters are incremented; when an item is deleted, the corresponding counters are decremented. To avoid counter overflow, we choose sufficiently large counters [26, 27]. The analysis from [26, 28] reveals that 4 bits per counter can suffice requirements of most applications. To accommodate membership queries of dynamic sets, Guo et al. proposed dynamic Bloom filters (DBF) [29]. Further improvements on scalability problem of Bloom filter are addressed by scalable Bloom filter (SBF) [30]. In order to reduce the need for computation of possibly large number of different hash functions, the authors of [31] have shown that only two hash functions are necessary to effectively implement a Bloom filter without any loss in the asymptotic false-positive probability. 4.3. Rule Instantiation Engine Footnotes. RIE (rule instantiation engine) is designed to program the event priority determination rules into a set of Bloom filter structures. RIE transforms abstract priority determination rules into concrete instances and generates Bloom filter based summaries of the large data set on rule instances per each priority. In this section, we show how RIE works.

5 4.3.1. Rule Instantiation Process. First, we explain how RIE transforms rules into a set of rule instances with a simple example. Consider event schema and rule description as follows. (1) Event schema is defined as 𝑇 = ⟨𝑎1 , 𝑎2 , 𝑎3 ⟩, where Dom(𝑎1 ) is {𝐴, 𝐵, 𝐶, 𝐷, 𝐸}, Dom(𝑎2 ) = [0, 1], and Dom(𝑎3 ) = [0, +∞). The attribute 𝑎1 denotes the location identifier where the data is generated. In this example, the whole coal mine area is divided into five areas, respectively, named 𝐴, 𝐵, 𝐶, 𝐷, and 𝐸. The attribute 𝑎2 denotes the methane gas density at 𝑎1 . The gas density ranges from 0% to 100%. The attribute 𝑎3 denotes the timestamp for the generated data. (2) Rule set 𝑅 = BE1 ∨ BE2 = (𝑃11 ∧ 𝑃21 ) ∨ (𝑃12 ∧ 𝑃22 ), where 𝑃11 = 𝑎1 ∈ 𝑆1 , 𝑃21 = 𝑎2 ≤ 𝐶1 , 𝑃12 = 𝑎1 ∈ 𝑆2 , and 𝑃22 = 𝑎2 ≤ 𝐶2 . The sets 𝑆1 = {𝐴, 𝐵, 𝐶} and 𝑆2 = {𝐴, 𝐷} are location sets. The threshold values of the methane gas density are 𝐶1 and 𝐶2 . The first rule means that if the incoming event data are generated from locations in set 𝑆1 = {𝐴, 𝐵, 𝐶} and the value of methane gas density is less than threshold value 𝐶1 , the incoming event suffices the first rule. The second rule means that if the incoming event data are generated from locations in set 𝑆2 = {𝐴, 𝐷} and the value of methane gas density is less than threshold value 𝐶2 , the incoming event suffices the first rule. Event priority signature vector can be inferred as 𝑉sig = ⟨𝑎1 , 𝑎2 ⟩ from rule set specifications. The attribute 𝑎2 is numeric value and shall be discretized into discrete value, which belongs to the set 𝑆𝑎2 = {𝐴, 𝐵, 𝐶}. The entity, which defines the discretization procedure, is called discretizer. Each attribute with numeric value in signature vector shall have particular discretizer, which is defined by application running over the publish/subscribe infrastructure. The set 𝑆𝑎2 is defined as follows. Assume 𝐶1 < 𝐶2 ; symbols 𝐴, 𝐵, and 𝐶 denote the numeric range, respectively, 𝐴 = (−∞, 𝐶1 ], 𝐵 = (𝐶1 , 𝐶2 ], and 𝐶 = (𝐶2 , +∞). For 𝑃21 = {𝑎2 ≤ 𝐶1 }, the condition test is transformed as 𝑃21 = {𝑇(𝑎2 ) ∈ 𝑇(𝐶1 )}, where 𝑇(𝑎2 ) ∈ 𝑆𝑎2 , 𝑇(𝐶1 ) = {𝐴}, and 𝑇(𝐶1 ) ⊆ 𝑆𝑎2 . The 𝑇(𝑎2 ) is the discretized result of PDE discretizer on event attribute 𝑎2 . The set 𝑇(𝐶1 ) is the discretized result of RIE discretizer on threshold value 𝐶1 in condition tests on numeric value attribute 𝑎2 . Each rule instance is an element in rule instance space (IS), which is defined as IS = 𝑆𝑎1 × 𝑆𝑎2 . The cardinality of the set IS is |𝑆𝑎1 | × |𝑆𝑎2 |. In this example, as |𝑆𝑎1 | = 5 and |𝑆𝑎2 | = 3, |IS| = 15. We can deduct the instance representation format for rule 𝑅 as follows. (𝑝11 ∧ 𝑝21 ) = {⟨𝐴, 𝐴⟩, ⟨𝐵, 𝐴⟩, ⟨𝐶, 𝐴⟩} and (𝑝12 ∧ 𝑝22 ) = {⟨𝐴, 𝐴⟩, ⟨𝐴, 𝐵⟩, ⟨𝐷, 𝐴⟩, ⟨𝐷, 𝐵⟩}; therefore, we get the rule instance set 𝑆𝑅 = {⟨𝐴, 𝐴⟩ , ⟨𝐵, 𝐴⟩ , ⟨𝐶, 𝐴⟩} ∪ {⟨𝐴, 𝐴⟩ , ⟨𝐴, 𝐵⟩ , ⟨𝐷, 𝐴⟩ , ⟨𝐷, 𝐵⟩} = {⟨𝐴, 𝐴⟩ , ⟨𝐴, 𝐵⟩ , ⟨𝐵, 𝐴⟩ , ⟨𝐶, 𝐴⟩ , ⟨𝐷, 𝐴⟩ , ⟨𝐷, 𝐵⟩} .

(2)

6


(1) Predicate: ⟨attribute, operator, value⟩. (2) Operators: OperatorEnumerator {, ∈, ∉} defines the supported operator set. (3) BE: a list of predicates, which is ordered by event attributes. For example, 𝑎2 < 50% ∧ 𝑎1 ∈ {𝐴, 𝐵, 𝐶} ∧ 𝑎2 > 10% shall be formalized as 𝑎1 ∈ {𝐴, 𝐵, 𝐶} ∧ 𝑎2 < 50% ∧ 𝑎2 > 10% (4) Rule: a list of BEs (5) RuleSet: a list of rules (6) VecSig: signature vector is a list of attributes involved in priority rules logically. In this basic design schema, there is only one signature vector for each priority per event schema. The vector is initialized as bit vector with zero. If the attribute appeared in Boolean expressions, the corresponding bit is set to (1). (7) Discretizer: two types of hash table are defined. ⟨𝑖𝑑, ⟨𝑙, 𝑟⟩⟩ is designed for discretizer for continuous attribute and ⟨𝑖𝑑, 𝑠𝑒𝑡⟨int⟩⟩ is designed for discretizer for discrete attribute. (8) Signature for event instance: byte block structured as ⟨Event Type ID, Discretizer ID, an array of discretized attribute value in event instance⟩ Algorithm 1: Data structure.

Input RuleSet Output 𝑆𝑅 , VecSig (1) generate BE list for RuleSet (2) for all predicate 𝑏𝑝 in RuleSet do (3) Get the attribute ID in predicate (4) Set the bit in signature vector to (1) (5) end for (6) for all BE 𝑏𝑒 do (7) predicates table ⟨attribute ID, 𝑆attribute ⟩, 𝑆attribute is an array of integer to denote a discrete set (8) for all predicate 𝑏𝑝 in 𝑏𝑒 do (9) 𝑇𝑝 : ⟨𝑎, 𝑜𝑝𝑡, V⟩ → 𝑆𝑎 : Tranform predicate 𝑝 triple into discrete set by attribute discretizers (10) Add set 𝑆𝑎 into predicates table (11) end for (12) Merge discrete sets on the same attribute by set intersect operation in predicates table (13) Generate the rule instance set 𝑆be from merged predicates table by set product operation (14) Store the rule instances in set 𝑆be into Bloom filter (15) end for Algorithm 2: Rule instantiation algorithm.

The duplicated instance ⟨𝐴, 𝐴⟩ is eliminated during the union operation of two sets. It is obvious that 𝑆𝑅 ⊆ IS. The cardinality of set IS is the upper bound of the size of rule instance set 𝑆𝑅 . The key data structures are described in Figure 3 and Algorithm 1. The procedure on rule instantiation process is described as in Algorithm 2. First, transform the condition tests as set membership determination with proper granularity, 𝑇𝑝 : ⟨𝑎, 𝑜𝑝𝑡, V⟩ → 𝑆𝑎 as shown in Algorithm 3. Second, translate the logic operators (conjunction, disjunction, and negation) as set operators (intersection/product, union, and set difference) at line (6) to line (13) in Algorithm 2. Finally, 𝑆𝑅 is the result of rule instantiation procedure. Signature vector is also generated during rule instantiation procedure at line (2) to line (5) in Algorithm 2. Given 𝑅 = BE1 ∨⋅ ⋅ ⋅∨BE|BE| = 𝑓(𝑝1 , 𝑝2 , . . . , 𝑝𝑛 ) and event priority signature vector 𝑉sig = ⟨𝑎1 , 𝑎2 , . . . , 𝑎𝑚 ⟩, rule instance space IS is the Cartesian product of 𝑆𝑎1 , 𝑆𝑎2 , . . . , 𝑆𝑎𝑚 , where 𝑆𝑎𝑖 (1 ≤ 𝑖 ≤ 𝑚) is the set defined by attribute 𝑎𝑖 ’s discretizer. Each Boolean expression can be represented by a set of rule

Signature vector 1 1 0 0 1 0 0 0 0 1 0 0 1 ··· Event schema a1 a2 a3 Attribute ID Type Discretizer Value

an

0: continuous 1: discrete Continuous type: ⟨id, ⟨l, r⟩⟩ Discrete type: ⟨id, set⟨id⟩⟩

Figure 3: Data structure of signature vector and attribute discretizer.

instances 𝑆be , which is a subset of rule instance space IS. The rule instance representation of rule set 𝑅, denoted by symbol 𝑆𝑅 , is the union of all 𝑆be . Obviously, 𝑆𝑅 is a subset of rule instance space IS. Our framework draws a schema of the discretizer, which can be customized by applications. The discretizer divides the domain of the attribute value into several ranges or discrete


7

Input predicate triple 𝑝 Output discrete set 𝑆 (1) switch(p. attribute.type) (2) case: continuous (3) ContinousDiscretizer cd = p. attribute.discretizer; (4) switch(p. operator) (5) case: Less (6) for all item in cd do (7) if 𝑝. value < item.lower (8) ; //do nothing (9) else if 𝑝. value < item.upper (10) S.insert(item.id); //false positvie rule are introduced (11) else 𝑝. value > item.upper (12) S.insert(item.id); (13) end for (14) break; (15) case: Great (16) for all item in cd do (17) if 𝑝. v < item.lower (18) S.insert(item.id); (19) else if 𝑝. value < item.upper (20) S.insert(item.id); //false positvie rule are introduced (21) else 𝑝. value > item.upper (22) ; //do nothing (23) end for (24) break; (25) . . . //other operators (26) break; (27) case: discrete (28) DiscreteDiscretizer dd = p. attribute.discretizer; (29) switch(p. operator) (30) case: In (31) for all item in dd do (32) if 𝑝. value ∩ item.set ! = ⌀ (33) S.insert(item.id); //false positvie rule are introduced (34) end for (35) break; (36) case: NotIn (37) for all item in dd do (38) if item.set ⊆ 𝑝. value (39) ; //do nothing (40) else //false positvie rule are introduced when item.set ∩ 𝑝. value ≠ ⌀ (41) S.insert(item.id); (42) end for (43) break; (44) . . . //other operators (45) break; Algorithm 3: Predicate transformation algorithm 𝑇𝑝 : ⟨𝑎, 𝑜𝑝𝑡, V⟩ → 𝑆𝑎 .

sets and these ranges or sets shall be disjoint with each other; that is, for all 𝑆𝑖 , 𝑆𝑗 ∈ Discretizer, 𝑆𝑖 ∩ 𝑆𝑗 = ⌀, 𝑖 ≠ 𝑗. Each range or set in the discretizer is labeled with an identifier as shown at line (7) in Algorithm 1. To represent a discretizer for continuous attribute, the hash table ⟨𝑖𝑑, ⟨𝑙, 𝑟⟩⟩ is employed, where id denotes the identifier of the range and ⟨𝑙, 𝑟⟩ denotes the lower bound and upper bound of the range. To represent a discretizer for discrete attribute, the hash table ⟨𝑖𝑑, 𝑠𝑒𝑡⟨int⟩⟩ is employed, where id denotes the identifier of the set and 𝑠𝑒𝑡⟨int⟩ denotes the discrete set. The applications can define

their own discretizer for event schema according to business requirements. The id in discretizer hash tables is used to compose the rule instance and event instance signature. The signature vectors for event schema are generated in Algorithm 2. The signature vectors are used for event signature generation procedure in Section 4.4. The structure of signature vector is shown in Figure 3. 4.3.2. Hash Computation on Instance Set 𝑆𝑅 . For each element in 𝑆𝑅 , generate the signature for each rule instance tuple,

8

International Journal of Distributed Sensor Networks Table 1: Example for rule instance structure.

Event schema ID Discretizer ID 32-bit integer 32-bit integer

Attribute 1 32-bit integer

Attribute 2 32-bit integer

for example, ⟨𝐴, 𝐴⟩. The signature may be string composed by each field in the rule instance tuple. You may have more clever encoding approach of the signature. Anyway, the computation complexity depends on the size of rule instance set. If event types share the same set of Bloom filters, event type identification shall be programmed into signature to ensure the uniqueness of each signature in the Bloom filter. Event type identification is a unique string to distinguish event types. An alternative design choice is that each event type has its own set of Bloom filters for priority determination. Different event discretizers will make the rule instance different even for the same event schema. If there are multiple discretizers defined by various applications, the discretizer identifier shall be contained in rule instance structure. An example is shown in Table 1. 4.3.3. Update the Computation Results into Bloom Filters. For each priority, one bit vector is dedicated for the summary of rule instance set 𝑆𝑅 . The Bloom filter set 𝑃 = {𝑃1 , 𝑃2 , . . . , 𝑃𝑘 } is a lossy summary of rule instances for 𝑘 priorities. 4.4. Priority Determination Engine. The PDE discretizes event instance values to generate the signature of priority and determine event priority by query rule database, which is represented by a group of Bloom filters. When multiple rules match the same event, the engine shall choose the high priority result. 4.4.1. Event Priority Signature Generation in Access Brokers. The generation of event priority signature is based on the same priority signature vector and corresponding discretizer. The interaction procedure between broker node and PDE service is shown in Figure 4. Consider signature and discretizer as follows: 𝑉sig = ⟨𝑎1 , 𝑎2 ⟩; 𝑎1 ∈ 𝑆𝑎1 = {𝐴, 𝐵, 𝐶, 𝐷, 𝐸}; 𝑎2 ∈ (−∞, +∞); 𝑆𝑎2 = {𝐴, 𝐵, 𝐶}; 𝐴 = (−∞, 𝐶1 ], 𝐵 = (𝐶1 , 𝐶2 ], and 𝐶 = (𝐶2 , ∞). While the access broker receives a published event, the broker need generates the event signature. The signature generation algorithm is shown in Algorithm 4 (line (1)–line (7)). The signature is initialized in line (1). Assume that event type ID is “ETID0001” and discretizer ID is “DISC01,” the signature is represented as ⟨ETID0001, DISC01⟩. Assume that event instance ⟨V1 , V2 , V3 ⟩ is ⟨B, 30%, timestamp⟩. We traverse all attribute values in event instance to transform event attribute value into discrete id. For discrete attribute, the discretizer does nothing by default. In this example, 𝑇𝑒 (V1 ) : 𝐵 → 𝐵. If application

Broker

PDE_service

Find signature vector

Call discretizers Generate signature

Event instance signature Priority flag

Figure 4: Event priority determination procedure.

need classifies the domain of discrete attribute, the algorithm performs line (4) to line (6) in Algorithm 5. For example, application defines discretizer for attribute 𝑎1 ∈ 𝑆𝑎1 = {𝐴, 𝐵, 𝐶, 𝐷, 𝐸} as 𝑆1 = {𝐴, 𝐵, 𝐶} and 𝑆2 = {𝐷, 𝐸}; the function 𝑇𝑒 shall be executed 𝑇𝑒 (V1 ) : 𝐵 → 𝑆1 . For continuous attribute, the discretizer performs line (1) to line (3) in Algorithm 5. In this example, V2 = 30% and discretizer divide the domain of attribute 𝑎2 ∈ (−∞, +∞) into three areas identified with 𝐴, 𝐵, and 𝐶, respectively. If 30% ≤ 𝐶1 , 𝑇𝑒 (V2 ) : 30% → 𝐴. The third attribute is not in signature vector; this attribute has no effect on signature generation as line (3) in Algorithm 4. In this example, the final signature is represented by ⟨ETID0001, DISC01, 𝐵, 𝐴⟩ as the format shown in Table 1. Once the event signature is generated as shown in Figure 4, the access broker sends the event signature to PDE service for priority determination. 4.4.2. Query Bloom Filter with Signature. Based on hash computations on event instance signature string, PDE queries the BF-based rule DB to determine the event priority. PDE returns the priority flag to the access broker. 4.4.3. Caching Query Result in Access Brokers. Since a large amount of event instances may share the same signature, the round trip time in the network may be saved by caching the hot signatures in local broker. The main memory access time is typically less than 100 ns. Even the round trip time in the same datacenter is about 500,000 ns. The round trip time in wide area network may be over 100 ms, which is about 6 orders of magnitude of main memory reference time. The saved time on network round trip may speed up the determination of event priority significantly. 4.5. Discussion on Discretization. The discretization can be flexibly defined by applications per business requirements. The basic principle is false positive.


9

Input event instance ⟨V1 , V2 , . . . , V𝑛 ⟩, Output event priority flag 𝑘 (1) Initialize signature with event type ID and discretizer ID (2) for all attibute value V𝑖 in event instance do (3) if this attribute is in SignatureVector then (4) 𝑇𝑒 (V𝑖 ) : V𝑖 → 𝑖𝑑 : tranform V𝑖 as id of discretizer (5) add id into signature (6) end if (7) end for (8) Query BF-based rule DB with signature to determine the event priority

Algorithm 4: Event priority query algorithm.

Input attribute value of event instance V𝑖 Output id (1) if this attribute is continous type then (2) traverse discretizer ⟨𝑖𝑑, ⟨𝑙, 𝑟⟩⟩ to locate id (3) end if (4) if this attribute is discrete type then (5) traverse discretizer ⟨𝑖𝑑, 𝑠𝑒𝑡⟨int⟩⟩ to locate id (6) end if Algorithm 5: Transform event attribute value into discrete id: 𝑇𝑒 (V𝑖 ) : V𝑖 → 𝑖𝑑.

False Positive. For numeric value attribute 𝑎, the corresponding discretization set is defined as 𝑆𝑎 = {𝐴, 𝐵, 𝐶}, where 𝐴 = (−∞, 𝐶1 ], 𝐵 = (𝐶1 , 𝐶2 ], and 𝐶 = (𝐶2 , ∞). The Boolean predicate 𝑝𝑗 = 𝑎𝑖 ≤ 𝐶, where 𝐶1 < 𝐶 < 𝐶2 . By false-positive principle, 𝑝𝑗 represented by rule instance set shall be {𝐴, 𝐵}, not {𝐴}. The definition of computation granularity on specific attribute 𝑎𝑖 decides the cardinality of corresponding set 𝑆𝑎𝑖 according to application requirement. For discrete attributes, the discretizer is optional. It means that the discretizer can be composed of dummy (do nothing) functions. It is up to application requirements. If the original granularity of attribute value set is too fine, applications can plug in a customized discretizer to achieve proper granularity. Performance. The traverse of continuous discretizer can be improved by binary tree search. As it is trivial, we do not discuss it in detail. The traverse of discrete discretizer can be avoided in most cases since there is no discretizer for discrete attribute by default. The employment of discrete discretizer can reduce the rule instance space size at the cost of computation efforts in Algorithms 3 and 5, which increase the computation time of signature generation procedure. 4.6. Analysis on System Performance. System performance analysis is divided by online query (interactions between access broker and PDE) and offline rule instance summary building (RIE module).

4.6.1. Online Query Performance. The computation is broken into two parts as shown in Algorithm 4. The first part is signature generation. The computation complexity depends on two system parameters: the size of the priority signature vector and corresponding attribute discretizer. Assume that the priority signature vector is ⟨𝑎1 , 𝑎2 , . . . , 𝑎𝑠 ⟩ and corresponding discretizer is 𝐷𝑖 (𝑖 = 1, 2, . . . 𝑠). For an incoming event 𝑒 = ⟨𝑒1 , 𝑒2 , . . . , 𝑒𝑛 ⟩, the priority signature vector is ⟨𝑒1 , 𝑒2 , . . . , 𝑒𝑠 ⟩. The discretization result of 𝐷𝑖 function belongs to one set, whose size is 𝑁𝑖 = |𝐷𝑖 |. Therefore, the computation complexity is ∑𝑠𝑖=1 𝑁𝑖 , where 𝑁𝑖 is constant parameter predefined by domain knowledge. For given domain problem, ∑𝑠𝑖=1 𝑁𝑖 is constant. For example, human body temperature set is {Low, Normal, Low Fever, Medium Fever, High Fever}. Therefore, the first stage computation complexity is 𝑂(1) and is independent of the number of rules in the system. The second stage computation is query on rule set Bloom filters. The query computation complexity is 𝑚 ∗ 𝑘; 𝑚 denotes the number of priorities predefined by rule set and 𝑘 is the parameter of Bloom filters. Since 𝑚 and 𝑘 are both constant numbers for given rule set, the second stage computation complexity is 𝑂(1) and is independent of the number of rules in the system. Therefore, the computation complexity of online query is 𝑂(1) and is independent of the number of rules in the system. The computation of signature generation depends on the size of the priority signature vector and corresponding attribute discretizer. Since these discretizers can work in

10 parallel, the speed of signature generation depends on the slowest discretizer. It would not be the bottleneck in practice. In PDE, the main part of query computation time is twohash function computation of the signature string [31]. To keep the cache fresh, the update on rule instances shall be notified to access brokers. The cache management procedure has no impact on the online query speed. 4.6.2. Offline Building and Maintenance of Rule Instance Database Based on Bloom Filters. Although offline work is not time sensitive, we also need to evaluate the efforts on rule instance database building. We want to know how to minimize the efforts on building instance database. The basic idea of rule instantiation process is presented in Algorithm 2. The upper bound of rule instance set size is the cardinality of rule instance space set. The applications shall choose priority signature vector to make the size m as small as possible. The attribute discretizer shall choose proper computation granularity to make the size of 𝑆𝑎𝑖 (1 ≤ 𝑖 ≤ 𝑚) as small as possible. These efforts can reduce the size of instance space. The computation time of Bloom filter programming procedure is 𝑂(|𝑆𝑅 |). The application can dynamically adjust these parameters to improve the efficiency of offline computation. Minimizing the offline computation at middleware layer is the subject of ongoing work. A more efficient implementation requires further exploration. The goal is to reduce rule instance space size dramatically without introducing significant impact on online query performance. For rule maintenance efficiency, the delta rule change shall be processed efficiently. The cache shall be managed efficiently. These works will be addressed by future works.

5. Evaluations In this section, we evaluate the query performance and scalability of summary instance (SI) approach with simulations. The experiments were run on an Intel Xeon Dual-core E5645 2.4 GHz machine with 8 GB of memory, of which 6 GB is allocated to the JVM. 5.1. Data Set. To evaluate the performance of the summery instance based priority determination engine, we generate rule set with Boolean expressions ranging from 100 K to 1000 K. Lacking the benchmarks and real application data, the rule data set and event data set were generated by a workload generator which produces the data randomly by selecting values from given value ranges. The value ranges can be specified in the configuration of data generator application. 5.2. Matching Algorithm. The brutal-force approach is an exhaustive algorithm that scans and evaluates all BEs one by one for each assignment. We call this approach SF in the following experiments. We compare our approach SI with SF approach in the following experiments. 5.3. Experiment Results. In this section, we explore the impacts on matching time from workload size, workload

International Journal of Distributed Sensor Networks Table 2: Workload distribution impacts. (a) Matching time variance in SF algorithm on different workload distribution (Unit: ms)

Rule set size 100 K 300 K 500 K 700 K 900 K 1M

Uniform 245.37 734.21 1226.42 1716.16 2211.40 2464.45

Zipf 375.62 1089.25 1884.68 2543.11 3269.20 3778.07

Variance 53.08% 48.36% 53.67% 48.19% 47.83% 53.30%

(b) Matching time variance in SI algorithm on different workload distribution (Unit: ms), raw data


Uniform 0.1133 0.1099 0.1141 0.1140 0.1125 0.1125

Zipf 0.1133 0.1099 0.1094 0.1109 0.1078 0.1063

Variance 0.00% 0.00% −4.12% −2.72% −4.18% −5.51%

(c) Matching time variance in SI algorithm on different workload distribution (Unit: ms)


Uniform 0.11 0.11 0.11 0.11 0.11 0.11

Zipf 0.11 0.11 0.11 0.11 0.11 0.11

Variance 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%

distribution, and matching rate of event data set. Then, we evaluate the false-positive issue in SI algorithm. Workload Size. We evaluate the impacts of workload size on the matching algorithms. Figure 5 illustrates the comparison results for SF and SI algorithms under varying rule set size. The rule set size varies from 100 K to 1 M. The matching time of SF algorithm increases linearly with the workload size as shown in Figures 5(a) and 5(b). The matching time of SI algorithm is nearly constant as shown in Figures 5(a) and 5(b). SI algorithm illustrates impressive scalable performance. SF algorithm performance will degrade with workload increases. The simulation results are consistent with our theoretical analysis. Workload Distribution. The effects of workload distribution are shown in Table 2 by comparing the performance of event matching under uniform distribution workload and Zipf distribution workload. The SF algorithm is sensitive to workload distribution in data set. From experiment results in Table 2(a), the Zipf distribution workload introduces about 50% matching time increases in SF matching algorithm compared with uniform distribution workload. The SI algorithm is robust with the workload distribution. There are no significant matching time increases in different workload


11 4000.00 Matching time/event (ms)

Matching time/event (ms)

3000.00 2500.00 2000.00 1500.00 1000.00 500.00

3000.00 2000.00 1000.00 0.00

0.00 100 K

300 K 700 K 900 K 500 K Varying number of Boolean expressions

100 K

1M

300 K 500 K 700 K 900 K Varying number of Boolean expressions

1M

SF SI

SF SI (a) Uniform: work load size

(b) Zipf: work load size

Figure 5: Varying workload size.

1600.00 Matching time/event (ms)

Matching time/event (ms)

1200.00 1000.00 800.00 600.00 400.00 200.00

1400.00 1200.00 1000.00 800.00 600.00 400.00 200.00 0.00

0.00 10

30 50 70 Varying matching rate of event set (%)

90

10

30 50 70 Varying matching rate of event set (%)

90

SF SI

SF SI (a) Uniform: matching rate of event set

(b) Zipf: matching rate of event set

Figure 6: Varying matching rate of event set.

distribution. Since the time precision of the computer system is 100 milliseconds and the size of event data set is 10,000, the precision of matching time per event is about 0.01 milliseconds. The raw performance data of SI algorithm are illustrated in Table 2(b). The variance can be ignored considering the time precession in our experimental environment. The final results are shown in Table 2(c). The variance of performance is nearly zero. Event Set Matching Rate. We consider the effects of matching rate of event data set. If one event instance does not match any rule in rule set, the SF algorithm needs to go through all rules in the rule set. From intuition, the average event matching time will increase if matching rate in event data set decreases. From Figure 6, we can see that the performance of SF algorithm is sensitive to the matching rate of event set. As the matching rate of event set increases, the average matching time per event decreases. The matching time decrease linearly with the matching rate of event data set.

We can see that the performance of SI approach is robust with varying matching rates with different workloads. The experiment results under uniform workload are shown in Figure 6(a). The experiment results under Zipf workload are shown in Figure 6(b). The matching time is nearly constant under varying matching rates and different workloads. False-Positive Issue Evaluation. An important property of SI algorithm is the false-positive rate. We explore the falsepositive issue in this experiment. There are two kinds of false-positive sources: Bloom filter query process and discretization process. The discretization process is defined by applications and can be adjusted at application layer. This paper focuses on platform layer. The discretizer design and optimization is out of the scope of this paper. An automatic adaptive mechanism is promising to optimize the false-positive rate and computation efforts. This optimization issue is out of the scope of this paper. We need to address this issue in an independent paper.

12

International Journal of Distributed Sensor Networks Table 3: False-positive rate evaluation in SI algorithm.

BF-FPR 0.1% 1% 10% 20% 30% 40% 50%

Matched 2000 2000 2000 2000 2000 2000 2000

Unmatched 8000 8000 8000 8000 8000 8000 8000

SI result 2000 2005 2499 3563 4344 4644 5878

FPR 0.00% 0.06% 6.24% 19.54% 29.30% 33.05% 48.48%

False positive rate in SI algorithm with different Bloom filter parameters.

We set up controlled experiments to evaluate the impacts of Bloom filter configuration on false-positive issue. We also verify that no false negative happened to support the theory analysis results. The test data set are designed as follows. The false positive of discretization process can be avoided by generating event data set and rule data set from predefined ranges based on definition of discretizers. We use a simple example to illustrate data set construction principles. The attribute discretizer are defined as 𝐴 = (−∞, 1000], 𝐵 = (1000, 2000], and 𝐶 = (2000, ∞). The rules 𝑎 < Const and Const can be randomly selected from [900, 1000]. The discretization results of rule 𝑎 < Const shall be Sig(𝑒) ∈ {𝐴}. The event instance value can be randomly selected from [100, 800] or [1100, 1800]. The discretization results of event instance shall be Sig(𝑒) = 𝐴 or Sig(𝑒) = 𝐵. In aforementioned data set, no false-positive cases are introduced by discretization process. The event 𝑒 suffices the rule 𝑎 < Const if and only if the discretized event signature suffices Sig(𝑒) ∈ {𝐴}. If Sig(𝑒) = 𝐴, the query of Sig(𝑒) on Bloom filter is definitely true. If Sig(𝑒) = 𝐵, the query of Sig(𝑒) may be true if the false positive happened. In this experiment, the event data set and rule data set are generated randomly with the constraints without introducing false positive in discretization process. The event data set size is 10 K. The event instances are uniformly distributed in the given ranges. The rule set size is 10 K. The rule parameters are randomly selected from the given ranges. The experiment results are shown in Table 3. The Bloom filter false-positive rate varies from 0.001 to 0.5, as shown in the BF-FPR (Bloom filter false-positive rate) column in Table 3. The data in matched column and unmatched column are from SF algorithm. These two columns illustrate the accurate rule matching results. The 4th column (SI result) shows the approximate rule matching results by SI algorithm. The parameter of Bloom filter impacts on the rule matching results, namely, FPR (False-Positive Rate), are shown in column 5 in Table 3. Since priority determination problem is not bothered with low false-positive issue, SI algorithm is very suitable for this kind of applications.

significant scalability with workload size and stable performance with different workload distribution and various event data sets with varying event matching rates. In addition, it also illustrates acceptable false-positive rate. Therefore, it is a suitable approach providing scalable and robust priority determination service.

6. Conclusion Information representation and query processing are two core problems of event-based distributed systems such as VANETs. In design problem of event priority rule matching engine, the two core problems are the rule representation and event instance priority determination. Rule representation means organizing rule policy information according to some format and mechanism, making information operable by the corresponding method. Query processing means making decisions about whether an event instance with a given attribute value belongs to a given set. To speed up the online query in distributed event-based system, we introduce the rule storage schema based on rule instantiation method with Bloom filter technique. This approach leverages offline efforts to increase the online query speed. This paper draws a fundamental framework for this approach. The key features of our approach are the following: (1) scalability: performance of rule matching is independent of the number of rules in the system, because an important property of Bloom filter is that the computation time involved in performing the query is independent of the number of strings in the database provided the memory used by the data structure scales linearly with the number of strings stored in it, (2) efficiency: the signature approach is cache friendly and works very efficiently under large-scale distributed environment. Large amount of event instances do not need to occupy the bandwidth of the rule match engine, (3) false-positive rule matching: the false rate is acceptable by adjusting parameters of Bloom filters. Our approach is promising to provide an efficient scalable design for event priority determination problem in largescale distributed event-based systems. This approach is also applicable for many rule matching scenarios with severe time constraints for large rule sets.

Disclosure A preliminary version of this paper appeared in IEEE SCC 2012, June 24–29, Honolulu, Hawaii, USA.

Conflict of Interests 5.3.1. Summary on Evaluation. SI algorithm outperforms SF algorithm with 2–5 orders of magnitude. It demonstrated

The authors declare that there is no conflict of interests regarding the publication of this paper.


Acknowledgments We express our thanks to anonymous reviewers who checked our paper for their insightful and constructive comments. This work was supported by National Grand Fundamental Research 973 Program of China under Grant no. 2013CB329605; National Natural Science Foundation of China under Grant no. 91124002; Chinese Universities Scientific Fund (BUPT2014RC0701); Transformation Project of Scientific and Technological Achievements in Henan Province (2014) no. 142201210009; Key Project of Science and Technology in Henan Province (2014) no. 144300510001.

References [1] G. Cugola and A. Margara, “Processing flows of information: from data stream to complex event processing,” ACM Computing Surveys, vol. 44, no. 3, pp. 15–84, 2012. [2] R. Shi, F. Liu, Y. Zhang, B. Cheng, and J. Chen, “An MIDbased load balancing approach for topic-based pub-sub overlay construction,” Tsinghua Science and Technology, vol. 16, no. 6, pp. 589–600, 2011. [3] A. Hinze, K. Sachs, and A. Buchmann, “Event-based applications and enabling technologies,” in Proceedings of the 3rd ACM International Conference on Distributed Event-Based Systems (DEBS ’09), Nashville, Tenn, USA, July 2009. [4] A. Schroter, G. Muhl, S. Kounev, H. Parzyjegla, and J. Richling, “Stochastic performance analysis and capacity planning of publish/subscribe systems,” in Proceedings of the 4th ACM International Conference on Distributed Event-Based Systems, pp. 258–269, ACM, 2010. [5] http://en.wikipedia.org/wiki/Rete algorithm. [6] R. Shi, Y. Zhang, J. Chen, B. Cheng, X. Qiao, and B. Wu, “Summary instance: scalable event priority determination engine for large scale distributed event-based system,” in Proceedings of the IEEE 9th International Conference on Services Computing (SCC ’12), pp. 400–406, IEEE, Honolulu, Hawaii, USA, June 2012. [7] D. Zats, T. Das, P. Mohan, D. Borthakur, and R. Katz, “DeTail: reducing the flow completion time tail in datacenter networks,” in Proceedings of the Conference Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM ’12), pp. 139–150, ACM, August 2012. [8] B. Vamanan, J. Hasan, and T. N. Vijaykumar, “Deadline-aware datacenter tcp (D2TCP),” in Proceedings of the ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM ’12), pp. 115–126, ACM, 2012. [9] R. Kohavi and R. Longbotham, “Online experiments: lessons learned,” Computer, vol. 40, no. 9, pp. 103–105, 2007. [10] C. Wilson, H. Ballani, T. Karagiannis, and A. Rowtron, “Better never than late: meeting deadlines in datacenter networks,” in Proceedings of the ACM SIGCOMM Conference (SIGCOMM ’11), pp. 50–61, 2011. [11] R.-S. Shi, Y. Zhang, J.-L. Chen et al., “Publish/subscribe network service infrastructure design for EDSOA service platform,” Computer Integrated Manufacturing Systems, vol. 18, no. 8, pp. 1659–1666, 2012. [12] C. L. Forgy, “Rete: a fast algorithm for the many pattern/many object pattern match problem,” Artificial Intelligence, vol. 19, no. 1, pp. 17–37, 1982.

13 [13] J. Owen, “World’s fastest rules engine,” September 2010, http://www.javaworld.com/javaworld/jw-09-2010/100920-retent.html. [14] http://www.pst.com/reteii2.html. [15] S. Whang, C. Brower, J. Shanmugasundaram et al., “Indexing boolean expressions,” in Proceedings of the 35th International Conference on Very Large Data Bases (VLDB ’09), Lyon, France, August 2009. [16] M. Sadoghi and H.-A. Jacobsen, “BE-Tree: an index structure to efficiently match Boolean expressions over high-dimensional discrete space,” in Proceedings of the International Conference on Management of Data (SIGMOD ’11), pp. 637–648, ACM, June 2011. [17] M. Sadoghi and H.-A. Jacobsen, “Relevance matters: capitalizing on less (Top-k matching in publish/subscribe),” in Proceedings of the IEEE 28th International Conference on Data Engineering (ICDE ’12), pp. 786–797, IEEE, Washington, DC, USA, April 2012. [18] A. Machanavajjhala, E. Vee, M. Garofalakis, and J. Shanmugasundaram, “Scalable ranked publish/subscribe,” Proceedings of the VLDB Endowment, vol. 1, no. 1, pp. 451–462, 2008. [19] P. T. Eugster, P. A. Felber, R. Guerraoui, and A.-M. Kermarrec, “The many faces of publish/subscribe,” ACM Computing Surveys, vol. 35, no. 2, pp. 114–131, 2003. [20] B. F. Cooper, R. Ramakrishnan, U. Srivastava et al., “PNUTS: yahoo!’s hosted data serving platform,” Proceedings of the VLDB Endowment, vol. 1, no. 2, pp. 1277–1288, 2008. [21] R. Baldoni and A. Virgillito, “Distributed event routing in publish/subscribe communication systems: a survey,” Tech. Rep., DIS, Universita di Roma La Sapienza, 2005. [22] S. Kounev, J. Bacon, K. Sachs, and A. Buchmann, “A methodology for performance modeling of distributed event-based systems,” in Proceedings of the 11th IEEE International Symposium on Object Oriented Real-Time Distributed Computing (ISORC ’08), pp. 13–22, IEEE, 2008. [23] S. Tian, G. Weber, and C. Lutteroth, “A tuplespace event model for mashups,” in Proceedings of the 23rd Australian ComputerHuman Interaction Conference (OzCHI ’11), pp. 281–290, ACM, December 2011. [24] T. Pongthawornkamol, K. Nahrstedt, and G. Wang, “Probabilistic QoS modeling for reliability/timeliness prediction in distributed content-based publish/subscribe systems over besteffort networks,” in Proceedings of the 7th IEEE/ACM International Conference on Autonomic Computing, pp. 185–194, ACM, Washington, DC, USA, June 2010. [25] B. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Communications of the ACM, vol. 13, no. 7, pp. 422–426, 1970. [26] A. Broder and M. Mitzenmacher, “Network applications of bloom filters: a survey,” Internet Mathematics, vol. 1, no. 4, pp. 485–509, 2004. [27] S. Dharmapurikar, P. Krishnamurthy, T. Sproull, and J. Lockwood, “Deep packet inspection using parallel bloom filters,” in Proceedings of the 11th IEEE Symposium on High Performance Interconnects, pp. 44–51, 2003. [28] L. Fan, P. Cao, J. Almeida, and A. Z. Broder, “Summary cache: a scalable wide-area Web cache sharing protocol,” IEEE/ACM Transactions on Networking, vol. 8, no. 3, pp. 281–293, 2000. [29] D. Guo, J. Wu, H. Chen, and X. Luo, “Theory and network applications of dynamic bloom filters,” in Proceedings of the 25th IEEE Conference on Computer Communications (INFOCOM ’06), vol. 1, April 2006.

14 [30] K. Xie, Y. Min, D. Zhang, J. Wen, and G. Xie, “A scalable bloom filter for membership queries,” in Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM ’07), pp. 543– 547, Washington, DC, USA, November 2007. [31] A. Kirsch and M. Mitzenmacher, “Less hashing, same performance: building a better bloom filter,” in Proceedings of the 14th Annual European Symposium on Algorithms, pp. 456–467, 2006.


International Journal of

Rotating Machinery

Engineering Journal of

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

The Scientific World Journal Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014


Distributed Sensor Networks

Journal of

Sensors Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014


Volume 2014


Volume 2014

Journal of

Control Science and Engineering

Advances in

Civil Engineering Hindawi Publishing Corporation http://www.hindawi.com


Volume 2014

Volume 2014

Submit your manuscripts at http://www.hindawi.com Journal of

Journal of

Electrical and Computer Engineering

Robotics Hindawi Publishing Corporation http://www.hindawi.com


Volume 2014

Volume 2014

VLSI Design Advances in OptoElectronics


Navigation and Observation Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014



Chemical Engineering Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

Volume 2014

Active and Passive Electronic Components

Antennas and Propagation Hindawi Publishing Corporation http://www.hindawi.com

Aerospace Engineering


Volume 2014


Volume 2014

Volume 2014




Modelling & Simulation in Engineering

Volume 2014


Volume 2014

Shock and Vibration Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

Advances in

Acoustics and Vibration Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

Research Article Summary Instance: Scalable Event

Research Article Summary Instance: Scalable Event

Suggest Documents

Research Article Multiclass Informative Instance

EVENT SUMMARY

Research Article A Scalable and Privacy-Aware

Research Article Scalable Fixed Point QRD Core

Research Article Towards Scalable Distributed ...

Research Summary Research Summary Research Summary ...

Event summary - Google Groups

event summary - Gladstein, Neandross & Associates

event summary - Gladstein, Neandross & Associates

event summary - Gladstein, Neandross & Associates

An Efficient and Scalable Approach for Ontology Instance Matching

An Efficient and Scalable Approach for Ontology Instance Matching

Research Article Architecture and Implementation of a Scalable ...

Research Article A Flexible and Scalable Architecture for ... - Hindawi

Research Article A Flexible and Scalable Architecture for Real-Time

RESEARCH SUMMARY

Article Summary - SSRN papers

Article License Summary

Scalable Efficient Composite Event Detection

Building a Scalable Event Processing System with

A Scalable Durable Grid Event Service

Scalable Efficient Composite Event Detection

How Does Instance-Based Inference About Event ...

Research Article Scalable Production of Iron Oxide ... - COREwww.researchgate.net › publication › fulltext › Scalable-P