Pattern-based event detection in sensor networks - Semantic Scholar

1 downloads 155 Views 1MB Size Report
Nov 4, 2011 - Abstract Many applications of wireless sensor networks monitor the .... We extend a SQL-based sensor query language [19, 32] to allow the ...
Distrib Parallel Databases (2012) 30:27–62 DOI 10.1007/s10619-011-7087-6

Pattern-based event detection in sensor networks Wenwei Xue · Qiong Luo · Hejun Wu

Published online: 4 November 2011 © Springer Science+Business Media, LLC 2011

Abstract Many applications of wireless sensor networks monitor the physical world and report events of interest. To facilitate event detection in these applications, in this paper we propose a pattern-based event detection approach and integrate the approach into an in-network sensor query processing framework. Different from existing threshold-based event detection, we abstract events into patterns in sensory data and convert the problem of event detection into a pattern matching problem. We focus on applying single-node temporal patterns, and define the general patterns as well as five types of basic patterns for event specification. Considering the limited storage on sensor nodes, we design an on-node cache manager to maintain the historical data required for pattern matching and develop event-driven processing techniques for queries in our framework. We have conducted experiments using patterns for events that are extracted from real-world datasets. The results demonstrate the effectiveness and efficiency of our approach. Keywords Sensor networks · Event detection · Pattern matching · Query processing · Data compression

Communicated by Amit Sheth. W. Xue () Nokia Research Center, Beijing, China e-mail: [email protected] Q. Luo Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, Hong Kong e-mail: [email protected] H. Wu Department of Computer Science, Sun Yat-sen University, Guangzhou, China e-mail: [email protected]

28

Distrib Parallel Databases (2012) 30:27–62

1 Introduction Event detection has become an essential element for a wide variety of sensor network applications, such as disaster detection [17, 18, 29, 32], habitat monitoring [19, 24], industrial process control [1] and object tracking [10, 11]. Instead of acquiring a complete view of sensory data over time and space, these applications often require pre-defined actions to be triggered whenever certain events occur. An event is a kind of phenomena in the physical world that the users are interested in, e.g. vehicle movement or fire emergency. With the goal of facilitating the development of sensor network applications, in this paper we propose an event-driven framework that integrates the application semantics and processing logic of event detection with distributed, database-style query processing in sensor networks [19, 32]. A typical event detection approach in recent work on sensor databases is to use thresholds in query predicates to express the events [1, 10, 19, 31, 32]. For instance, in order to detect whether the door of a lab of restricted access is opened, the lab manager attaches a sensor node to the door, and installs on the node a query having a predicate on the acceleration attribute: accel_x > threshold. The intuition is that when the door moves, the accelerometer reading on the node will exceed a threshold value. Aggregates and temporal aggregates, such as avg, max, winavg and winmax [19], can be used together with thresholds [10, 32] to reduce the false alarms caused by environmental noise or hardware malfunction [24]. As examples, a predicate winavg(accel_x, 5sec) > threshold will report the event of door opening only if the average value of multiple accelerometer readings on the sensor node within a sliding window of five seconds exceeds the threshold; if a group of more than one sensor nodes are attached to the door, a predicate avg(accel_x) > threshold can be evaluated over the multi-node reading stream of the whole group. Despite the implementation simplicity, this threshold-based approach to event detection suffers from a number of drawbacks. First, a single threshold value may not be able to fully express the semantics of an event. Take the previous door opening event as example. On the widely-used MEMSIC motes [20], this event corresponds to a transient spike in the readings of the accelerometer, preceded and followed by a constant reading level when the door is closed. Such temporal semantics of the event is hard to capture by a threshold. Second, the suitable threshold value for an event may change from time to time and environment to environment. It is difficult for the user to tune the value for various occurrences of the event. In the door-opening example, the accelerometer readings may differ depending on how hard a user pushes the door and at what points in time during the opening process the readings are sampled. Last, although aggregates and temporal aggregates are useful to smooth sensor readings from a group of nodes at a time or from a node in a period, a single aggregate value as the threshold is still unreliable and error-prone. Motivated by these limitations in the threshold-based approach, in this paper we propose a pattern-based approach to event detection in sensor networks. Our approach is based on a key observation that we obtained from various real-world sensory datasets: the occurrence of an event generates a particular spatio-temporal pattern in the sensor readings of the nodes that detect the event. This observation has been used in many previous studies [9, 11, 12, 18, 21, 28–30].

Distrib Parallel Databases (2012) 30:27–62

29

With our pattern-based event detection, the main problem is converted to pattern matching in sensor readings. Specifically, given the pattern generated by an event, our approach reports an occurrence of the event when the streaming sensory data on a node, or on a multi-node group, matches the pattern. Compared to a threshold, a pattern better captures the event semantics and is more resilient to missing or erroneous values in sensor readings. Furthermore, patterns can be specified by fixing their overall change trends only, which is more flexible and reliable than specifying specific threshold values. This enables a global pattern specification on all nodes that detect the same event. In this work we focus on a common class of events that can be detected through the matching of temporal patterns on a node. Compared to complex spatial patterns that require the resource-consuming modeling of network-wide sensor data distribution [18, 29, 30], the temporal patterns are simpler to express while still provide great opportunities for event abstraction. Moreover, temporal patterns only require the matching processes on individual sensor nodes. This can save lots of communication cost in the network, which are the dominating factor of energy consumption on the nodes [19]. Similar to threshold evaluation combined with aggregates, a temporal pattern in our approach can be matched not only on the reading stream of a single node, but also on the aggregated reading stream of a group of multiple nodes. As examples, Fig. 1 shows several temporal patterns in datasets that we collected using motes. The y-axes in the figure are the raw ADC (Analog-to-Digital Converter) counts sampled by various sensors on the motes and the x-axes are the sample periods. The patterns represent a number of real-world events in two case study applications we set up. These applications are described as follows. Application 1 (Frog Pond Surveillance) A number of sensor nodes are deployed around a frog pond on our campus. In the summer, the frogs in the pond croak all day except for a few short periods. People may walk around the pond or sit nearby at leisure time. Interesting events in this application include the pause of frog croaks, the sunrise or sunset, and the presence or movement of visitors around the pond. Application 2 (Pervasive Lab Surveillance) A number of sensor nodes are deployed in the pervasive research lab of our department. The application monitors the activities of users and the status of equipment in the lab. Example events to be detected are the opening of the lab door (implying a person enters or leaves the lab), the switching of ceiling lights, the loud talk of person and the movement of valuable equipment in the lab. There are three important design objectives for our pattern-based approach. First, the approach must be general and applicable to detect various events in the physical world. Second, the approach must be feasible to run on current-generation sensor nodes. These nodes are characterized by very limited computation, storage and communication resources. For instance, the TelosB motes have an 8 MHz microcontroller, a 10 KB RAM space and a 250 Kbps 802.15.4-compliant wireless radio [20]. Last, the pattern specification should strike a balance between simplicity and expressive power.

Fig. 1 Example temporal patterns in sensor readings for real-world events

30 Distrib Parallel Databases (2012) 30:27–62

Distrib Parallel Databases (2012) 30:27–62

31

Fig. 2 Example complex pattern concatenated from basic patterns

Given these design objectives, we abstract a general event as a sequential pattern [13] in the reading time series sampled by a type of sensor on a node (e.g. accelerometer, temperature or sound sensor). For user convenience, we further define five types of basic patterns that are useful to represent many events in real-world applications: the horizon, slope, oscillation, jump and spike patterns. To tolerate imperfect user knowledge about the events, each basic pattern is specified using a set of parameters that only fix the relative rough shape of the pattern but not the specific reading values. As an alternative of specifying general patterns, the users can compose more complex patterns for events that involve one or more types of sensors by combining a few basic patterns as building blocks. Figure 2 shows an example complex event pattern from our case study datasets. It is the concatenation of three basic patterns. Compared to the threshold-based approach, our approach improves the accuracy of event detection by replacing the requirement for a single highly accurate parameter with that for a set of reasonably accurate parameters. We extend a SQL-based sensor query language [19, 32] to allow the specification of general and basic patterns using system built-in methods. Through this query interface, thresholds and aggregates can be used together with our patterns as the users see fit. Since thresholds can still be effective in several situations and are complementary to patterns, this query interface allows the applications to take advantage of both approaches whenever applicable. Because the sensor nodes have limited storage, we design a cache manager equipped with a two-level compression scheme on each node to maintain the historical sensor reading time series that must be stored for pattern matching. Focusing on the evaluation of pattern matching methods embedded in continuous queries, we propose event-driven techniques for query processing in our framework. Results from simulation and a real-mote prototype demonstrate that our patternbased approach could achieve at least 80% and mostly 90%–100% accuracy of event detection. Such good performance does not reply on the case that the pattern parameters are set perfectly by the users, whereas a reasonable value range for each parameter will do. Upon the experimental query workload, our approach achieves 30%–90% cache data compression ratio on average while ensuring the aforementioned accuracy, and its computation cost on real motes is as small as 0.3 seconds. The results also show that the basic patterns we specially design are more immune to insufficient memory than the general patterns so they are more desirable for the resource-limited sensor motes.

32

Distrib Parallel Databases (2012) 30:27–62

The remainder of the paper is organized as follows. We describe our pattern-based event specification in Sect. 2. We present the design of the on-node cache manager in Sect. 3 and the event-driven query processing framework in Sect. 4. In Sect. 5, we evaluate the performance of our approach using patterns generated by real-world events. We discuss related work in Sect. 6 and draw concluding remarks in Sect. 7.

2 Pattern-based event specification In our approach, we define an event as a particular temporal pattern in sensor readings and detect the event using a SQL-based query embedded with method(s) for pattern matching. We present the properties of our patterns and discuss considerations on selecting these properties in Sect. 2.1. We then give formal definitions of individual types of patterns in our event specification in Sect. 2.2, and introduce the class of queries handled in our framework in Sect. 2.3. 2.1 Properties of patterns We focus on using temporal patterns in sensor readings that have the following three properties to abstract the events: (1) Single-attribute. A pattern is defined on the readings of a type of sensor, i.e., the values of a sensory attribute [19]. Example attributes are temp, light, noise and accel_x. They represent the readings of temperature, light, microphone and acceleration sensors. (2) Single-node. A pattern consists of a sliding window of historical sensor readings or aggregated values on a node. (3) Equal-interval. A pattern is a value sequence with a fixed time interval between any two consecutive values in the sequence. The properties are all chosen based on the design objectives we presented in Introduction. We choose the first property as it simplifies the event specification. Singleattribute patterns for events are more intuitive and easier to be recognized by the users. Moreover, the AND/OR keywords in SQL can be used to combine multiple single-attribute patterns to describe the same event (see Sect. 2.3). With the second property, a main advantage is that local temporal patterns can be matched entirely on individual nodes. A node requires no communication in order to perform pattern matching. By this means, our approach preserves the distributed merit of the threshold-based approach to save communication cost. The processing of the two approaches can be seamlessly combined on a node. We adopt the third property for simplicity and feasibility. Because the events are unpredictable phenomena in the physical world, continual sampling with a regular sample period remains necessary to ensure the accuracy of event detection [6, 7]. The fixed interval expresses the requirement of the user about how fast the sensor should sample in order to promptly catch an occurrence of the event. We understand there are many physical-world events whose subtle semantics cannot be fully captured by temporal patterns. To express these events, we will need to

Distrib Parallel Databases (2012) 30:27–62

33

extend our approach to model the correlation between readings of multiple sensors of one or more types on different nodes in the network, and abstract the event as a general spatio-temporal pattern. This requires a global view of the sensory data distribution generated by the event for pattern matching, and suggests considerable communication costs in the network. There have been a few previous papers that provide potential directions on how we could extend our pattern-based approach to support general spatio-temporal patterns: either all data is transmitted to a server (possibly combined with in-network aggregation) for centralized pattern matching [18, 29, 30], or real-time data is exchanged in the network to coordinate nodes to model the spatial data distribution [4, 8, 12] for in-network matching. On the other hand, according to our case studies, many events in the physical world have straightforward semantics that can be expressed by single-attribute temporal patterns effectively. Typical examples have been illustrated in Fig. 1. To detect these events, the costly correlation maintenance between multi-sensor and multi-node readings becomes unnecessary, whereas a simpler alternative that each node examines its own patterns is acceptable and desirable especially on resource-limited sensor nodes. This motivates us to use the temporal patterns for event abstraction and detection in this paper. In addition to the traditional threshold-based approach, there have been a number of newly proposed approaches to sensor network event detection in the database literature [1, 4, 12, 18, 29–31]. Same as all prior approaches, the pattern-based approach we propose in this paper may not be able to abstract and detect all kinds of events in the physical word very accurately. Nevertheless, it provides a competitive alternative and a complement to these prior approaches for the users. 2.2 Definitions of patterns We define a general pattern in our event specification using the semantics conforming to the similarity search, i.e. pattern matching queries in time series databases [2, 3, 13, 27], as presented in Definition 1. Definition 1 (Pattern) Consider a sequence of consecutive readings S = (s1 , s2 , . . . , sn ) of a sensor with a sample period sp. S forms a pattern if and only if given a query sequence Q = (q1 , q2 , . . . , qn ), at least n ∗ (1 − α) readings in S satisfy |si − qi | ≤ (1 ≤ i ≤ n). (1 − α) ∈ (0, 1) is called the confidence level.  > 0 is called the degree of difference. In the definition, every element in S can also be an aggregated value. The query sequence Q is equal-interval as S. This interval is the sample period sp. The confidence level is designed to tolerate a certain percentage of outlying sensor readings that deviate from the overall trend of the pattern. This is because real-world sensory data is unreliable [32] and a few readings may be completely invalid [24]. Figure 1(a) shows an example pattern randomly picked from the readings of a microphone sensor. Definition 1 requires a user to provide a specific value sequence for pattern matching. An immediate problem in many sensor network applications is that, a user may only have partial knowledge about the overall trend of the pattern for an event. For

34

Distrib Parallel Databases (2012) 30:27–62

instance, by posing test queries, the user notices that a sunrise leads to a gradual increase in the light sensor readings of the nodes attached to a window. However, it is difficult for the user to find out the exact fitting curve (e.g., polynomial or exponential) or value sequence that best approximates the pattern, which may change from environment to environment as a threshold. Addressing this problem, we define five types of basic patterns in our event specification. They are all derived from our two case study applications described in Introduction. Each type of basic pattern can be viewed as a kind of shape in the sensor reading curve over time. To specify a basic pattern, a user only needs to provide a set of parameter values that fix the rough shape of the pattern. This tolerates the user’s imperfect knowledge of the event. Without ambiguity, in the rest of the paper we call these five types of patterns “basic patterns” and the patterns defined by Definition 1 “general patterns”. We name the five types of basic patterns respectively as horizon, slope, oscillation, jump, spike pattern. To the best of our knowledge, there is no previous work on pattern matching that has defined a set of basic patterns similar to ours. In the following, we give the definitions of and provide example real-world events for each pattern. Definition 2 (Horizon Pattern) Consider a sequence of consecutive readings S = (s1 , s2 , . . . , sn ) of a sensor with a sample period sp. S forms a horizon pattern if and only if at least n ∗ (1 − α) readings in S satisfy si = savg. (1 − α) ∈ (0, 1) is called the confidence level. savg is the average of all values in S. Intuitively, a horizon pattern is a stable state with little variance in the sensor reading curve. Even such simplest pattern can represent interesting events for applications. For instance, the pause of frog croaking in our case study application corresponds to a horizon pattern in microphone sensor readings as in Fig. 1(b), since there is little other noise around the frog pond during the period of application deployment. Definition 3 (Slope Pattern) Consider a sequence of consecutive readings S = (s1 , s2 , . . . , sn ) of a sensor with a sample period sp. S forms an increasing slope pattern if and only if it satisfies the following two conditions: 1. A monotonically non-decreasing sequence S  of at least n ∗ (1 − α1 ) readings can be generated by removing some readings from S without changing the relative order of the others. (1 − α1 ) ∈ (0, 1) is called the non-decreasing confidence level. 2. A monotonically increasing sequence S  of at least n ∗ (1 − α2 ) readings can be generated by removing some readings from S  without changing the relative order of the others. The absolute difference between the first and last readings in S  is at least . (1 − α2 ) ∈ (0, 1) is called the increasing confidence level and (1 − α2 ) < (1 − α1 ).  > 0 is called the degree of change. A decreasing slope pattern is defined symmetrically by replacing “non-decreasing” in Condition 1 with “non-increasing”, and replacing “increasing” in Condition 2 with “decreasing”. A slope pattern is not simply a line. It is a gradual, continuous and long-term increase or decrease trend in the sensor reading curve. For instance, a fire generates an

Distrib Parallel Databases (2012) 30:27–62

35

increasing slope pattern in the readings of temperature sensors nearby [17]. A sunrise generates an increasing slope pattern in the light sensor readings of the nodes exposed to the sunlight (Fig. 1(c)). Definition 4 (Oscillation Pattern) Consider a sequence of consecutive readings S = (s1 , s2 , . . . , sn ) of a sensor with a sample period sp. S forms an oscillation pattern if and only if no subsequence S  of S satisfies the following two conditions: 1. All readings in S  are larger than, or smaller than, or equal to savg . 2. The length of S  is larger than n ∗ ρ ∗ (1 + α). ρ ∈ (0, 1) is called the percentage of transience and (1 − α) ∈ (0, 1) the confidence level. savg is the average of all values in S. We define a subsequence of a sequence S as a new sequence formed by a number of consecutive elements in S. An oscillation pattern is a fluctuation with respect to a mean level in the sensor reading curve. The percentage of transience ρ captures the semantics that the readings cannot stay at either direction (up or down) of the mean level more than a relatively long sub-period over the total period of the pattern. This type of pattern is useful to represent various object vibrations in the physical world. The pattern in Fig. 1(d) is obtained from the accelerometer readings of a node attached to the base of a swing model installed in our pervasive lab when the swing is played. Definition 5 (Jump Pattern) Consider a sequence of consecutive readings S = (s1 , s2 , . . . , sm , sm+1 , sm+2 , . . . , sn ) of a sensor with a sample period sp. S forms an increasing jump pattern if and only if it satisfies the following three conditions: 1. sm+1 − sm ≥ 1 . 1 > 0 is called the degree of instantaneous change. 2. At least m ∗ (1 − α) readings in the subsequence S  = (s1 , s2 , . . . , sm ) satisfy δi = sm+1 − si ≥ 2 (1 ≤ i ≤ m). (1 − α) ∈ (0, 1) is called the confidence level. 2 > 0 is called the degree of level change and 2 ≤ 1 . 3. At least (n − m) ∗ (1 − α) readings in the subsequence S  = (sm+1 , sm+2 , . . . , sn ) satisfy δj = sj − sm ≥ 2 (m + 1 ≤ j ≤ n). A decreasing jump pattern is defined symmetrically by replacing “≥ k ” (k = 1, 2) in the conditions with “≤ −k ”. A jump pattern is a phase shift in the sensor reading curve with a large difference in the overall reading level before and after the change. For instance, the light sensor readings in a dark lab follow an increasing jump pattern when we switch on the electric lights in the lab (Fig. 1(e)). When a loud conversation happens somewhere in a quiet lab, the readings of microphone sensors nearby the conversation also follow an increasing jump pattern. Definition 6 (Spike Pattern) Consider a sequence of consecutive readings S = (s1 , s2 , . . . , sm , sm+1 , sm+2 , . . . , sn ) of a sensor with a sample period sp. S forms an increasing spike pattern if and only if it satisfies the following three conditions:

36

Distrib Parallel Databases (2012) 30:27–62

1. sm+1 − sm ≥ 1 . 1 > 0 is called the degree of instantaneous change. 2. At least m ∗ (1 − α) readings in the subsequence S  = (s1 , s2 , . . . , sm ) satisfy δi = sm+1 − si ≥ 2 (1 ≤ i ≤ m). (1 − α) ∈ (0, 1) is called the confidence level. 2 > 0 is called the degree of level change and 2 ≤ 1 . 3. At least (n − m) ∗ (1 − α) readings in the subsequence S  = (sm+1 , sm+2 , . . . , sn ) satisfy δj = sm+1 − sj ≥ 2 (m + 1 ≤ j ≤ n). A decreasing spike pattern is defined symmetrically by replacing “≥ k ” (k = 1, 2) in the conditions with “≤ −k .” A spike pattern is a large transient increase or decrease in the sensor reading curve. Its effect dies out quickly, with nearly the same overall reading level before and after the change. The door opening event in Fig. 1(f) is reflected as a spike pattern in the accelerometer readings on the node attached to the lab door. In all Definitions 1–6, we call the time interval T = (n − 1) ∗ sp the interval of inspection. For Definitions 5–6, we also call T1 = (m − 1) ∗ sp the anterior interval of inspection and T2 = (n − m − 1) ∗ sp the posterior interval of inspection. As a whole, the general patterns make our event specification generic and complete. A user is able to specify the pattern for any real-world event as long as the user can provide the specific value sequence of the pattern. The basic patterns, on the other hand, improve the flexibility and usefulness of our specification by tolerating imperfect user knowledge. Our pattern definitions are user-specific. The parameters involved in each definition, for example, sp, Q, (1 − α),  and T in Definition 1, do not have system-defined default values. We call these parameters the pattern parameters. A user specifies an instance of a type of pattern by providing instantiated values for these pattern parameters. The user-specified pattern is then matched with the readings of corresponding sensor on the nodes for event detection. Our approach only tries to match a pattern when and only when the pattern is explicitly specified by a user in a query. We do not intend to mine and discover all potential patterns implicitly without user specification as previous work [9, 21]. This subject of work on pattern mining in sensory data is orthogonal to our work. It can be applied in our approach to assist a user to fix the type and rough parameters to be used in a pattern specification, given a training dataset that contains pattern instances generated by the event. Although a little complex in expression, our pattern definitions are all intuitive in nature and quite different from each other. Indeed, our approach requires more parameters than the threshold-based approach. However, our experimental results in Sect. 5.2 illustrate that the accuracy of our approach is relatively indifferent to the fluctuation of individual pattern parameters, whereas the accuracy of the thresholdbased approach is largely affected by the single threshold. The readings of a sensor within the same interval may form different types of patterns given different sets of pattern parameters, or we say, in the perceptions of different users. For instance, Fig. 1(f) can be viewed as a horizon pattern rather than a spike pattern by another user, who is not interested in the door opening event. We believe it is generally reasonable to assume that a user is able to understand and infer what type of pattern an event should be abstracted into and what rough values for

Distrib Parallel Databases (2012) 30:27–62

37

Fig. 3 Global trend vs. local oscillation of a pattern

the pattern parameters should be used to specify the pattern in order to successfully detect the event, based on the user’s specific requirement and preference. Finally, due to hardware limitations and environment noises, small-scale and local oscillations exist in sensor readings. For instance, Fig. 3 shows an enlarged view of the anterior part of the jump pattern in Fig. 1(e). Such local reading oscillation is unfavorable and unpredictable. It is different from the oscillation patterns for events of interest we define. To reduce the influence of this oscillation on matching the global trend of a pattern, we require a user to provide an error bound ε when specifying any pattern. Two consecutive readings of a sensor with a difference not larger than ε are treated as equal in our approach. A user can set the value of ε based on the accuracy of the sensor that is often available from the hardware manual, e.g., the temperature sensor on MEMSIC MTS420CA sensor board has an accuracy of ±2◦ C [20]. 2.3 Query interface We encapsulate the semantics of general and basic patterns into system built-in Boolean methods and extend an existing SQL-based sensor query language [19, 32] to support our proposed processing logic of pattern matching. As a high-level programming paradigm, the database-style query interface brings the generality and flexibility in event specification and detection. The users can specify diversified events declaratively by simple queries. The process of event detection is then effectively integrated into query processing and efficiently performed in a generic, distributed manner on the sensor nodes. We focus on presenting the novel query language constructs and the corresponding processing techniques we propose as our extension in this paper. The methods that correspond to Definitions 1–6 are named as pattern, horizon, slope, oscillation, jump and spike. Every method has two input parameters. The first parameter is an attribute or aggregate function. If the parameter is an attribute, the method is invoked on each node in the sensor network. If the parameter is an aggregate function, the method is only invoked on a single “root” node in each group that computes the final aggregated value of the group every sample period of the method. This root node is the node in the group that is closest to the sink node of the network connected to a base station PC (i.e. a server) [19]. The second parameter is the pathname of an XML configuration file that contains the user-specified values for pattern parameters. We encapsulate the specification of parameter values into a configuration file rather than adding multiple new clauses to SQL in order to: (i) keep the query

38

Distrib Parallel Databases (2012) 30:27–62

language concise, (ii) allow multiple methods in a query to have different values of the same parameter, e.g. sample period or confidence level, (iii) enable convenient modification of the values. A query in our framework is continuous [19, 32] and embedded with one or more methods for pattern matching. These methods can be used in the WHERE or HAVING clause of a query. In each sample period of a method, the method is evaluated on a node by matching the pattern it specifies with the cache data sequence in a recent sliding window. The method returns true if and only if a pattern match is achieved in this period. The SELECT clause of a query contains the action in response to the event. The action can be attribute projection, SQL aggregation, or any user-defined function in general. Queries 1–2 shows two example queries in our framework. The sensors virtual table [19] in the queries abstracts sensory data collected from all nodes in a sensor network. It has a schema of (nodeid, timestamp, attr1 , attr2 , . . . , attrn ). nodeid is the ID of a sensor node and attri (1 ≤ i ≤ n) is an attribute name. Each tuple in the table consists of the attribute values collected from a node at a time indicated by timestamp. As exemplified by Query 2, thresholds and aggregates can be easily used together with our patterns in the SQL-based queries. The avg() method in Query 2 computes the spatial average noise values over the group of all nodes in each room. The jump pattern is then matched over the sequence of these average values stored on the group leader node in each room. Figure 4 shows the configuration file talk.xml of the jump method in Query 2. The elements in the file are self-explanatory given the pattern definition. Query 1: SELECT

alarm(room_no) FROM sensors WHERE spike(accel_x, “door_opening.xml”)

Query 2: SELECT GROUP BY HAVING

room_no, avg(noise) FROM sensors room_no jump(avg(noise), “talk.xml”) AND avg(noise) > 800

A user can require the detection of an arbitrary event by issuing a query with a pattern method that specifies the general pattern generated by the event. Also, the user can choose to compose the specification of a complex pattern by connecting the methods for basic pattern matching with the AND/OR keywords in SQL, or using nested queries. This is helpful when the general pattern for an event is difficult or infeasible to be identified by the user. Take the complex event pattern in Fig. 2 as example. It begins with an increasing slope pattern, followed by an increasing spike and then a decreasing spike pattern. There are 5∼10 seconds gap of any pattern between the slope and the first spike pattern. This complex pattern can be specified using Query 3. As depicted by the query, similar to CREATE VIEW in SQL, we design a CREATE EVENT statement as a language extension in our query interface. It enables the specification of composing complex general patterns via the basic patterns using nested queries.

Distrib Parallel Databases (2012) 30:27–62 Fig. 4 Example configuration file of a method

39

jump increasing 10 seconds 3 minutes 3 minutes 400 ADC counts 300 ADC counts 0.2 5 ADC counts

CREATE EVENT increasingSlopes AS SELECT * FROM sensors WHERE slope(light, “increasing_slope.xml”) CREATE EVENT increasingSpikes AS SELECT * FROM sensors WHERE spike(light, “increasing_spike.xml”) CREATE EVENT decreasingSpikes AS SELECT * FROM sensors WHERE spike(light, “decreasing_spike.xml”) Query 3: SELECT FROM WHERE AND AND AND AND

p3.nodeid, p3.timestamp increasingSlopes p1, increasingSpikes p2, descreasingSpikes p3 p1.nodeid = p2.nodeid AND p2.nodeid = p3.nodeid p2.timestamp − p1.timestamp >= (T 2 + 5) sec p2.timestamp − p1.timestamp p2.timestamp p3.timestamp − p2.timestamp threshold SAMPLE PERIOD 1 sec is posed in the threshold-based approach. Figure 8 shows the unsatisfactory accuracy achieved by this threshold-based query when the threshold value varied. No matter what value was used for the threshold, the precision and recall of the query was much

Distrib Parallel Databases (2012) 30:27–62

51

Fig. 8 Accuracy of threshold-based detection for example event

Table 3 Default pattern parameters for Q1–Q6 Query ID

Values of Parameters

1

sp = the sample period of the dataset, T = 99 ∗ sp,

2

sp = 30 sec, T = 20 min, 1 − α = 0.8, ε = 5 ADC counts

3

sp = 30 sec, T = 30 min, 1 − α1 = 0.4, 1 − α2 = 0.2,

4

sp = 1 sec, T = 3 min, 1 − α = 0.8, ρ = 0.1, ε = 2 ADC counts

5

sp = 10 sec, T1 = T2 = 3 min, 1 − α = 0.8,

1 −α = 0.9,  = 0.1 ∗ Q.mean, ε = 0.02 ∗ Q.mean

 = 100 ADC counts, ε = 3 ADC counts

1 = 400 ADC counts, 2 = 300 ADC counts, ε = 5 ADC counts 6

sp = 1 sec, T1 = T2 = 20 sec, 1 − α = 0.8, 1 = 6 ADC counts, 2 = 4 ADC counts, ε = 1 ADC count

lower than that of Q6 in Fig. 7(e). A threshold value that resulted in a good precision corresponded to a bad recall and vice versa. We have further considered threshold-based counterparts for the other queries in our query workload. Q1 has no threshold-based counterpart because it makes no sense to replace a random sequence of values with any single value. Q2 cannot be expressed in the threshold-based approach either, since a horizon pattern focuses on the semantics of a “long-term steady value state” that a single value simply cannot express. For Q3–Q5, even though thresholds are insufficient to capture their event semantics, we manually tuned their threshold counterparts many times for comparison. The “inverse” relationship between precision and recall as shown in Fig. 8 were similarly revealed in the accuracy results for these three threshold-based queries. For the remaining experiments we presented, we used a fixed set of parameter values for each of Q2–Q6 as listed in Table 3. As shown in Fig. 7, these sets of parameters resulted in 100% accuracy for the queries. The parameters we used for Q1 throughout all experiments are also provided in the table. Q1 always achieved 100% precision and recall with this parameter setting and an unlimited Lc . Given the sample period and interval of inspection for each pattern in the table, Q1–Q6 require to store 100, 41, 61, 181, 37 and 41 sensor readings, respectively. 5.3 Evaluation of cache data compression We continued to evaluate our two-level data compression scheme in the cache manager. In addition to the accuracy, we studied two other performance metrics in this

52

Distrib Parallel Databases (2012) 30:27–62

Fig. 9 Storage costs of Q1–Q6 with different compression schemes

experiment: (i) the maximum cache space requirement (MCSR), and (ii) the average cache space requirement (ACSR). When a query is run over a dataset, the query requires a certain amount of cache space on a node at a time. We call this space an instantaneous cache requirement of the query on the node at this time. The MCSR and ACSR of the query on a node are defined as the maximum or average of the instantaneous cache requirements over the entire time span of the dataset. These two metrics indicate the worst or average case of storage cost for a query. Figure 9 shows the cache space requirements of Q1–Q6 when each of them was run individually with three different compression schemes applied in the cache manager: (i) our combination of PCA and PLR, (ii) PCA, and (iii) PLR. “UC” in the figure means that the cache is uncompressed, i.e., every sensor reading is stored as an RP in the cache. Each value in the figure was the average of the metric values on all nodes in the dataset. We assume a size of 2 bytes for a sensor reading or a timestamp in this experiment. The results show that in comparison with UC, our proposed PCA + PLR achieved a data compression ratio of 10%, 21%, 6%, 33%, 13% and 51% for Q1–Q6 on MCSR. This indicates even in the worst case, the reduction in storage cost achieved by our scheme is considerable. The compression ratio of our scheme for the six queries rose to 61%, 59%, 33%, 40%, 33% and 90% on ACSR. This suggests the average storage cost of a query can be reduced sharply with our scheme applied. The figure also shows that for all queries, applying our scheme led to a consistently smaller MCSR or ACSR than applying PCA alone for the data compression. If PLR was applied alone, it achieved a smaller MCSR and ACSR for Q3, but its compression ratio on the other five queries was worse than our scheme. The good performance of PLR on Q3 was due to the fact that a linear function happened to approximate well the slope pattern generated by the sunrise event in our FP dataset. As a whole, our scheme has an overall better performance in compressing different types of patterns than applying PCA or PLR alone. Figure 10 shows the accuracy of Q1–Q6 we got in this experiment. In the figure all three compression schemes achieved more than 80% precision and 90% recall for every query. Like storage cost, PLR performed the worst whereas our scheme the best among the three. This suggests that our scheme preserves the quality of real-world sensory data in a better way than applying PCA or PLR alone. Interestingly, Fig. 10 indicates that the accuracy of event detection for the six queries got worse if the caches were not compressed. This is because we apply the cache data compression scheme not only to save storage cost, but also to smooth the

Distrib Parallel Databases (2012) 30:27–62

53

Fig. 10 Accuracy of Q1–Q6 with different compression schemes

Fig. 11 Accuracy of individual queries with insufficient cache space

unreliable sensory data and remove the influence of local reading oscillation on the matching of the overall pattern of interest. The compression we conducted serves as a kind of pre-processing to improve the quality of raw sensory data before the data could be useful for high-level applications. In contrast, the plausible thought that using the original, uncompressed sensor readings would lead to the best possible accuracy is not correct due to the intrinsic errors in real-world sensory data. We have tried many other parameter settings for Q1–Q6. All these results confirmed the relative magnitude of accuracy achieved by the three schemes and UC shown in Fig. 10. We last investigated the effect of the cost-based policy in our PLR model for dealing with insufficient cache space. Figure 11 shows the accuracy of Q2–Q5 when each query was run individually over its corresponding dataset with the cache size limit Lc varied from 100% to 20% of the ACSR of the query. More specifically, the category “100%” in the x-axis of the figure represents a different ACSR value of 158, 68, 163, 432, 54 or 9 bytes for Q1–Q6 in order. The other categories are in the same case and denote values computed in proportion. Because the recall of Q1 and Q6 decreased to nearly 0% when the cache space was insufficient, we do not include the results for these two queries in the figure. This experiment is particularly designed to evaluate our approach in the worst case of extremely-limited on-node storage, while current-generation sensor motes at least have a few kilobytes of RAM [20]. As expected, the accuracy of the queries degraded when the available cache space on the nodes decreased. The reason was whenever the cache space became insufficient, many consecutive readings in a cache had to be aggregated using the cost-based policy in a way that breaks the current error bound of the cache. Consequently, the subtle trends in the sensory data may get lost. The figure illustrates that the accuracy of Q2–Q4 was not affected as significant as that of Q1, Q5 and Q6 with insufficient cache space. The preservation in accuracy for

54

Distrib Parallel Databases (2012) 30:27–62

Fig. 12 Accuracy of concurrent queries with insufficient total cache space

Q2–Q4 was mainly due to the continuity property of horizon, slope and oscillation patterns: even if part of a pattern generated by the event becomes out of the trend due to the lossy compression, the type of pattern can still be detected as long as the unaffected part of the pattern is long enough to satisfy the user-specified confidence level. In contrast, for Q1 and Q6, since the lossy compression smoothed sensory data, the confidence level or degree of change for these two types of patterns was less likely to get satisfied. The poor precision of Q5 was because a few transient increases in voice level during the same talk of people were wrongly regarded as the beginning of a new talk with the lossy data smoothness. Figure 12 shows the accuracy of Q1–Q6 when all queries were run concurrently over a synthetic dataset with the value of Lc varied from 100% to 20% of the sum of the ACSRs of all queries. The synthetic dataset was merged from the two datasets in Table 1. The pattern that corresponds to every occurrence of the five events in Table 2 was kept in the synthetic dataset as well as the random patterns generated for Q1. Different from Fig. 11, in Fig. 12 the category “100%” in the x-axis represents a single value for all Q1–Q6. It is the sum of their ACSR values of 884 bytes. The other categories are in the same case and denote values computed in proportion. In comparison with Fig. 11, we see that the accuracy of the queries in Fig. 12 was improved, mostly, for Q1, Q5 and Q6. This suggests in a scenario that multiple queries are running simultaneously on the nodes, the cost-based policy can favorably balance the cache space requirements among these concurrent queries and improve their overall performance. As shown in Fig. 12, with concurrent query execution, Q1 still suffered from poor recall when the available cache space is extremely limited but Q6 did not. The combined results of Figs. 11–12 tell us that our cost-based policy can ensure the accuracy of event detection for Q1–Q6 with as small as hundreds of bytes cache space available. However, our performance may degrade if the cache space is so limited down to only tens of bytes and/or cannot satisfy the storage requirement of every single query. Furthermore, the basic patterns we define were more immune to insufficient cache space than the general patterns. Consequently, the matching of these basic patterns is more applicable and desirable on the resource-limited sensor nodes.

Distrib Parallel Databases (2012) 30:27–62 Table 4 Computation costs of Q1–Q6 in the prototype

55

Query ID

Average Computation Overhead (seconds)

1

0.281

2

0.277

3

0.259

4

0.277

5

0.304

6

0.308

5.4 Prototype evaluation on real sensor motes We have implemented a real-mote prototype of our approach in nesC using TinyOS [25] to verify and complement the simulation results. We evaluated the prototype on the TelosB motes [20]. We have also successfully compiled and tested the prototype on MICA2 motes that have a smaller 4 KB RAM space. To evaluate the prototype using the datasets and query workload in Sect. 5.1, we cut the two datasets for each node into smaller pieces and loaded each piece into RAM in sequence. We made each event pattern completely contained in a piece of dataset. The cache manager in the prototype read sensory data from these pieces in RAM rather than from the real sensors on the mote. We first used the prototype to test the computation efficiency of our approach on the real motes. When run over a dataset, a query in our workload invokes a pattern matching algorithm in each sample period of the method in the query. We call this running time of the algorithm on a node an instantaneous computation overhead of the query on the node in this period. We define the average computation overhead of the query on a node as the average of the instantaneous computation overheads over the entire time span of the dataset. Table 4 shows the average computation overheads for Q1–Q6 in the prototype. Each value in the table was the average of the metric values on all nodes in the dataset. The table shows that the computation overheads of Q1–Q6 were similar in the experiment and roughly 0.3 seconds each. This absolute overhead value guarantees the real-time event detection of Q1–Q6, considering the sample periods in the magnitude of seconds and the intervals of inspection in the magnitude of tens or hundreds of seconds for the relevant events. The overhead is thus acceptable for real-world sensor network applications. In comparison, a threshold-based query had an average computation overhead of 0.063 seconds in the prototype. This suggests that with the query workload, the computation overhead of our pattern-based approach was at most 5 times the cost of the threshold-based approach on the motes. Such overhead increase is reasonably small and sub-linear since the pattern lengths of the queries are much longer compared to a single threshold value. The result indicates the good computation efficiency of our approach on real sensor hardware. We have performed further experiments to measure the average energy consumption for Q1–Q6 in the prototype. We implemented a mechanism to put a node into

56

Distrib Parallel Databases (2012) 30:27–62

sleeping mode for energy conservation when it has no operation. The result was similar for all queries and roughly 0.41 milli-joules per second. As for the mote lifetime, we found that without sleeping a TelosB mote consumed 66 milli-joules per second and could sustain 6 days on two new batteries (brand: Energizer). This is consistent with the result calculated in TinyDB [19]. Therefore, we estimate that a mote running our prototype and query workload could sustain 966 days in proportion based on our average energy consumption value. The result indicates our approach is feasible for the applications that require long-term sensor network deployment. As for the accuracy, the results for Q1–Q6 over the two datasets in the prototype were nearly the same as those shown in Fig. 7. We have examined the runtime accuracy of our approach by matching patterns in the real-time sensor readings of each mote rather than in the loaded datasets. The event patterns to be detected in this experiment were manually generated on-the-fly by using a piece of black paper to cover the light sensor on a mote to form a kind of general, horizon, oscillation, jump or spike pattern individually, and using a hair dryer to heat the temperature sensor for a slope pattern. Each of these real-time patterns contains 20 readings. In all experimental runs, each pattern matching algorithm was able to correctly detect all occurrences of the corresponding event. This result proved the practical applicability of our prototype.

6 Related work Event-based query processing in sensor networks has attracted many research efforts in the database community. Madden et al. [19] have studied continuous queries that are triggered by external events in their TinyDB in-network sensor query processor. An event in TinyDB is reported from a query, a program module in the operating system, or a stand-alone hardware component attached on a mote such as a switch [19] or a motion detector [10]. Our work does not assume the existence of any handcoded OS module or hardware equipment. We enable the flexible specification of events in SQL queries through the definitions of general and five types of basic patterns. REED [1] is an extension of TinyDB. It implements efficient in-network algorithms for joining sensory data streams on the nodes with static tables stored at the base station. The authors showed that the multi-predicate filtration (join) queries supported by REED are useful to detect events in industrial process control applications. Yang et al. [31] studied the efficient processing of self-join queries in sensor networks and used these queries for event detection. Nevertheless, these approaches are still threshold-based as opposed to our pattern-based approach. Chu et al. [4] proposed to distribute a dynamic probabilistic model into a sensor network for energy-efficient data collection and anomalistic event detection. The work lacks of a systematic event specification like ours. Kotidis [15] used linear regression to cluster adjacent nodes that have correlated readings. The goal is to remove the spatial data redundancy by electing a number of representative nodes to be involved in query processing only. This work can be adapted to optimize our approach

Distrib Parallel Databases (2012) 30:27–62

57

by identifying redundant sensor nodes in event detection and stopping the pattern matching process on these nodes. Indexing [2] and normalization [13] are two common approaches in traditional pattern matching to tolerate the presence of noise, amplitude/longitude scaling or shifting. The heavy storage and maintenance cost of indices as well as the simple query workload that current sensor query processing systems need to support make indexing less preferable. The five types of basic patterns we define can tolerate imperfect user knowledge about the events. This has a similar effect to normalization. In the scenario of sensor network applications, it may not be intuitive to ask a user to provide a normalized difference level for pattern matching [13]. Our basic patterns bear a similarity to the shape-based subsequence matching in time series databases [3, 27]. In comparison, we have defined more versatile shapes in sensor readings that abstract real-world events. Moreover, we have included a number of parameters in the pattern definitions to deal with the special characteristics of sensory data, such as the confidence level for noise elimination and the error bound for local oscillation smoothness. There are traditional methods in signal processing that can be applied to process and compare sensor reading time series for event detection. Examples include cross correlation, Fourier transform, convolution and spectrum analysis [23]. We instead adopt a database-oriented approach to address the problem. We define our own patterns with high-level, user-specific semantics that can be naturally integrated with declarative SQL-based queries. In the networking and systems community, Directed Diffusion [11] recognizes the emergence of an animal near a node by matching sensor readings on the node with pre-stored libraries of waveforms. This target recognition mechanism gives an initial inspiration to our event detection approach. However, in Directed Diffusion the authors focus on studying the in-network sensor data propagation algorithms and no details of the matching process is mentioned. Dutta et al. [7] designed a sensor node platform called XSM as a kind of hardware support for event detection in surveillance applications. Node reprogramming is required for either Directed Diffusion or XSM when the event that it needs to detect changes. In contrast, we propose a declarative specification of our general and basic patterns in SQL query interface. This provides users the flexibility of concurrent detection of multiple events and the convenience of on-the-fly node re-tasking. DSWare [17] is a middleware that provides group-based event detection services for sensor networks. The authors proposed to use a confidence function to identify the occurrence of a complex event composed of multiple basic events, but the details of the basic events are missing. Wittenburg et al. [26] proposed to use training-based pattern recognition for distributed event detection in sensor networks. Unlike this work we take a different approach based on pattern matching and provide a formal event specification integrated with declarative SQL queries. In our previous work [29], we have used the contour maps as a mathematical tool to model spatio-temporal patterns of events. We proposed an approach for event detection in sensor networks based on in-network contour mapping and server-side contour map matching. The work in this paper differs from our previous work in three aspects. First, we abstract events as temporal patterns on individual nodes in the network. In contrast, in previous work we focus on studying the spatial maps generated

58

Distrib Parallel Databases (2012) 30:27–62

by an event and pay little attention to temporal domain. The proposed event specification and query processing framework in this paper are totally different from those in the previous work. Second, in our previous work the events are detected centrally at the base station via matching the contour maps built by in-network aggregation. In this work, we have moved the process of pattern matching distributedly on to the nodes to save communication cost. Third, our previous work requires the nodes to have a relatively large computation and storage capability to perform the Boolean operations on polygons. In comparison, the computation and storage cost of the approach in this paper is lower and more applicable to the resource-limited motes. Li et al. [18] studied non-threshold based event detection in sensor networks using 3D gradient data maps, which is similar to our previous work [29]. Both approaches divide the sensor network area into individual unit cells, apply a single linear regression model for in-network map construction, and propose an event specification based on the maps. The main difference is that in our work we used 2D contour maps rather than 3D ones. Moreover, in a piece of recent work followed, we have further proposed another more generic event detection approach [30]. This approach collects all required sensory data back to the server and uses multiple regression models to match various historical sensor data distributions generated by an event over spatial regions. The work in this paper puts the last piece of puzzle in our systematic research of variant mechanisms for sensor network event detection. Each of our mechanisms is more preferable under certain application requirements.

7 Conclusion We have proposed a pattern-based approach to event detection in sensor networks. A pattern we use for event abstraction consists of a temporal sequence of readings of a sensor on a node. We define general and basic patterns via user-specified parameters and encapsulate pattern matching processes into Boolean methods in SQL queries. We design a cache manager on each node to maintain a sliding window of cache data for pattern matching. The manager combines PCA and PLR as a two-level data compression scheme for the caches. We have conducted simulation studies to evaluate the performance of our approach using patterns for real events from two case study datasets. The results show that our approach achieves a good accuracy of event detection even when the users are unable to specify perfect values for the pattern parameters. The cache data compression scheme reduces the storage cost of our approach. We have further evaluated our approach on real motes to demonstrate its computation efficiency and runtime accuracy.

Appendix This appendix contains the pattern matching algorithms that are used in our eventdriven query processing framework. Algorithms 1–4 are for the matching of a general, horizon, slope or oscillation pattern. Algorithm 5 is for both jump and spike pattern matching. For brevity, the input and output descriptions of Algorithms 2–5 are omitted due to their similarity to those of Algorithm 1.

Distrib Parallel Databases (2012) 30:27–62

59

Algorithm 1: Matching of a General Pattern Input: a new reading r for a cache C sampled at time t, a method M that uses C and specifies a general pattern Output: the new return value of M given r and current historical data in C 1. 2. 3. 4. 5. 6. 7 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.

ts = t − M.T ; /* [ts , te = t] is the interval to be inspected at this time */ M.value = false; M.timestamp = t; /* set the new return value and associated timestamp for M */ if ts < C.IT then return; /* not enough historical data */ s1 = findCacheItem(C, ts ); /∗ find the item in C that covers ts */ s2 = M.Qcts .head; /∗ the first item in M.Qcts ∗ / t1 = ts ; t2 = 0; /* current timestamp markers for the two time series compared */ count = (M.T /M.sp + 1) ∗ (1 − M.(1 − α)); while true do /* match the query sequence Q with the cache data sequence by a simple point-to-point comparison */ while t1 ≤ s1 .t do if |getValue(s1 , t1 ) − (s2 .w0 + s2 .w1 ∗ t2 )| + s2 .ε > M. + M.ε then count −−; /* the error brought to each segment (s2 .ε) when compressing Q into Qcts is considered */ t1 + = M.sp; t2 + = M.sp; if t2 > s2 .t then s2 = ∗(s2 .next); if count < 0 then return; if s1 == C.tail then v = the value of r; if | v − (s2 .w0 + s2 .w1 ∗ t2 )| + s2 .ε > M. + M.ε then count −−; if count ≥ 0 then M.value = true; /∗ a pattern is found */ break; /* exit the loop */ else s1 = ∗(s1 .next);

Algorithm 2: Matching of a Horizon Pattern 1. 2.

3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

ts = t − M.T ; M.value = false; M.timestamp = t; if ts < C.IT or t − M.lastEventTime ≤ M.T then return; /* lastEventTime indicates when a last match is achieved for M. It is used to prevent disturbing user with repetitive actions upon the same event instance. */ count = (M.T /M.sp + 1) ∗ (1 − M.(1 − α)); s = findCacheItem(C, ts ); t1 = ts ; computeMean(r, C, M, ts ); /∗ incrementally compute the mean of the cached data sequence in [ts , te ] */ while true do if s is an RP or a PPFP and |s.v − M.mean| > M.ε then count −−; if s is a CAP and M.mean ∈ [s.min, s.max] then count − = (s.t − t1 )/M.sp + 1; if s is an LRP or a CLRP then while t1 ≤ s.t do if |getValue(s, t1 ) − M.mean| > M.ε/2 then count−−; /* average-case estimation */ t1 + = M.sp; if count < 0 then return; if s == C.tail then M.value = true; break; /* a pattern is found */ s = ∗(s.next); t1 = getStartTimestamp(C, s);

60

Distrib Parallel Databases (2012) 30:27–62

Algorithm 3: Matching of a Slope Pattern 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.

ts = t − M.T ; M.value = false; M.timestamp = t; if ts < C.IT or t − M.lastEventTime ≤ M.T then return; if M.S == null then /* M.S is a value array stored in M. It is a post-processed snapshot of cache data that falls into [ts , te ]. */ S = constructValueArray(C, M, ts ); M.headEvicted = false; /* initial construction of the array in linear time */ else /* incremental update of the array in constant time */ S = M.S; if not M.headEvicted then remove S[0]; /* remove expired data at the head of S */ extract values v1 , v2 , v3 from C at time ts − M.sp, ts , ts + M.sp; if M.mode == 0 and v2 < v1 and v2 < v3 or M.mode == 1 and v2 > v1 and v2 > v3 then add v2 to the head of S; /* M.mode is the change mode of M: 0 for increasing and 1 for decreasing */ v = the value of r; add v to the tail of S; len = M.T /M.sp + 1; n = S.length; marked = false; while n ≥ len ∗ (M.(1 − α1 )/2) do /* worst-case estimation */ for i = 0 to n − 2 do if M.mode == 0 and S[i + 1] + M.ε/2 < S[i] or M.mode == 1 and S[i + 1] > S[i] + M.ε/2 then mark both S[i] and S[i + 1]; marked = true; if i == 0 then M.headEvicted = true; if not marked then break; remove all marked entries in S; n = S.length; marked = false; if n < len ∗ (M.(1 − α1 )/2) then /* over-balancing to deal with the potential error removal above */ if S.length == 0 then M.S = null; else M.S = S; return; while n ≥ len ∗ M.(1 − α2 ) do for i = 0 to n − 2 do if | S[i + 1] − S[i]| > M.ε/2 then mark S[i]; marked = true; if i == 0 then M.headEvicted = true; if not marked then break; remove all marked entries in S; n = S.length; marked = false; if S.length == 0 then M.S = null; else M.S = S; if S.length ≥ len ∗ M.(1 − α2 ) and |S[n − 1] − S[0]| ≥ M. then M.S = null; M.value = true; /* a pattern is found */

Algorithm 4: Matching of an Oscillation Pattern 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

ts = t − M.T ; M.value = false; M.timestamp = t; if ts < C.IT or t − M.lastEventTime ≤ M.T then return; count = (M.T /M.sp + 1) ∗ M.ρ ∗ (2 − M.(1 − α)); computeMean(r, C, M, ts ); s = findCacheItem(C, ts ); t1 = ts ; trend = ‘U ’; /* ‘U ’ denotes “unknown” */ while true do trend_org = trend; /* backup the original change trend */ while t1 ≤ s.t do trend = testTrend(M.mean, getValue(s, t1 )); /* testTrend(mean, v) returns the relative scale of a value v to an average level mean. The scale can be ‘L’, ‘S’, ‘E’ suggesting v is larger than, smaller than or equal to mean. ∗/ t1 + = M.sp; if trend == trend_org then count −−; else count = (M.T /M.sp + 1) ∗ M.ρ ∗ (2 − M.(1 − α)) − 1; trend_org = trend; if count < 0 then return; if s == C.tail then v = the value of r; trend_org = trend; trend = testTrend(M.mean, v); if trend == trend_org then count −−; else count = (M.T /M.sp + 1) ∗ M.ρ ∗ (2 − M.(1 − α)) − 1; if count > 0 then M.value = true; /* a pattern is found */ break; s = ∗(s.next); t1 = getStartTimestamp(C, s);

Distrib Parallel Databases (2012) 30:27–62

61

Algorithm 5: Matching of a Jump or a Spike Pattern 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

checkPPFP(r, C, M); ts = t − M.T2 ; M.value = false; M.timestamp = t; if ts < C.IT then return; s = findCacheItem(C, ts ); if s is not a PPFP for M then return; if M specifies a jump pattern then v1 = getValue(∗(s.prev), ∗(s.prev).t); M.mode ==! M.mode; /* inverse the change mode of M */ else v1 = getValue(s, s.t); /* M specifies a spike pattern */ v2 = the value of r; count = checkLevelChange(C, M, ts , M.T2 /M.sp + 1, v1 ); /* examine the posterior level change when the last reading in the pattern sequence arrives */ if M specifies a jump pattern then M.mode ==! M.mode; /* restore the change mode of M */ else switch v1 and v2 ; if M.mode == 0 and v2 − v1 < M.2 or M.mode == 1 and v2 − v1 > −M.2 then count −−; if count ≥ 0 then M.value = true; /* a pattern is found */

Function: checkPPFP(r, C, M) /* check whether r is a PPFP for M ∗ / 1. te = t − M.sp; ts = te − M.T1 ; if ts < C.IT then return; 2. v1 = getValue(C.tail, C.tail.t); v2 = the value of r; 3. if M.mode == 0 and v2 − v1 ≥ M.1 or M.mode == 1 and v2 − v1 ≤ −M.1 then 4. if checkLevelChange(C, M, ts , M.T1 /M.sp + 1, v2 ) ≥ 0 then mark r as a PPFP for M; /* examine the anterior level change immediately after the detection of the instantaneous change */ Function: checkLevelChange(C, M, ts , n, v); /* check whether v forms a level change for M with respect to data sequence in C from ts ∗ / 1. count = n ∗ (1 − M.(1 − α)); s = findCacheItem(C, ts ); t1 = ts ; 2. while true do 3. while t1 ≤ s.t do 4. v  = getValue(s, t1 ); t1 + = M.sp; 5. if M.mode == 0 and v − v  < M.2 or M.mode == 1 and v − v  > −M.2 then count −−; 6. if count < 0 or s == C.tail then return count; s = ∗(s.next); t1 = getStartTimestamp(C, s);

References 1. Abadi, D., Madden, S., Lindner, W.: REED: robust, efficient filtering and event detection in sensor networks. In: Proc. 31st International Conference on Very Large Data Bases (VLDB), pp. 769–780 (2005) 2. Agrawal, R., Lin, K., Sawhney, H., Shim, K.: Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In: Proc. 21st International Conference on Very Large Data Bases (VLDB), pp. 490–501 (1995) 3. Agrawal, R., Psaila, G., Wimmers, E., Zaït, M.: Querying shapes of histories. In: Proc. 21st International Conference on Very Large Data Bases (VLDB), pp. 502–514 (1995) 4. Chu, D., Deshpande, A., Hellerstein, J., Hong, W.: Approximate data collection in sensor networks using probabilistic models. In: Proc. 22nd International Conference on Data Engineering (ICDE), p. 48 (2006) 5. Deligiannakis, A., Kotidis, Y., Roussopoulos, N.: Compressing historical information in sensor networks. In: Proc. 2004 ACM SIGMOD International Conference on Management of Data, pp. 527–538 (2004) 6. Deshpande, A., Guestrin, C., Madden, S.: Model-driven data acquisition in sensor networks. In: Proc. 30th International Conference on Very Large Data Bases (VLDB), pp. 588–599 (2004) 7. Dutta, P., Grimmer, M., Arora, A., Bibyk, S., Culler, D.: Design of a wireless sensor network platform for detecting rare, random, and ephemeral events. In: Proc. 4th International Conference on Information Processing in Sensor Networks (IPSN), pp. 293–300 (2005) 8. Guestrin, C., Bodik, P., Thibaux, R., Paskin, M., Madden, S.: Distributed regression: an efficient framework for modeling sensor network data. In: Proc. 3rd International Conference on Information Processing in Sensor Networks (IPSN), pp. 1–10 (2004)

62

Distrib Parallel Databases (2012) 30:27–62

9. Guralnik, V., Srivastava, J.: Event detection from time series data. In: Proc. 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 33–42 (1999) 10. Hellerstein, J., Hong, W., Madden, S., Stanek, K.: Beyond average: towards sophisticated sensing with queries. In: Proc. 2nd International Conference on Information Processing in Sensor Networks (IPSN), pp. 63–79 (2003) 11. Intanagonwiwat, C., Govindan, R., Estrin, D.: Directed diffusion: a scalable and robust communication paradigm for sensor networks. In: Proc. 6th International Conference on Mobile Computing and Networking (MOBICOM), pp. 56–67 (2000) 12. Jin, G., Nittel, S.: Efficient tracking of 2D objects with spatiotemporal properties in wireless sensor networks. Distrib. Parallel Databases 29(1–2), 3–30 (2011) 13. Keogh, E.: Fast similarity search in the presence of longitudinal scaling in time series databases. In: Proc. 9th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 578–584 (1997) 14. Keogh, E., Chu, S., Hart, D., Pazzani, M.: An online algorithm for segmenting time series. In: Proc. 1st International Conference on Data Mining (ICDM), pp. 289–296 (2001) 15. Kotidis, Y.: Snapshot queries: towards data-centric sensor networks. In: Proc. 21st International Conference on Data Engineering (ICDE), pp. 131–142 (2005) 16. Lazaridis, I., Mehrotra, S.: Capturing sensor-generated time series with quality guarantees. In: Proc. 19th International Conference on Data Engineering (ICDE), pp. 429–440 (2003) 17. Li, S., Lin, Y., Son, S., Stankovic, J., Wei, Y.: Event detection services using data service middleware in distributed sensor networks. Telecommun. Syst. 26(2–4), 351–368 (2004) 18. Li, M., Liu, Y., Chen, L.: Non-threshold based event detection for 3D environment monitoring in sensor networks. In: Proc. 27th International Conference on Distributed Computing Systems (ICDCS), p. 9 (2007) 19. Madden, S., Franklin, M., Hellerstein, J., Hong, W.: TinyDB: an acquisitional query processing system for sensor networks. ACM Trans. Database Syst. 30(1), 122–173 (2005) 20. MEMSIC, Inc.: http://www.memsic.com (2011) 21. Papadimitriou, S., Brockwell, A., Faloutsos, C.: Adaptive, hands-off stream mining. In: Proc. 29th International Conference on Very Large Data Bases (VLDB), pp. 560–571 (2003) 22. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1986) 23. Stranneby, D., Walker, W.: Digital Signal Processing and Applications, 2nd edn. Elsevier, Amsterdam (2004) 24. Szewczyk, R., Polastre, J., Mainwaring, A., Culler, D.: Lessons from a sensor network expedition. In: Proc. 1st European Conference on Wireless Sensor Networks (EWSN), pp. 307–322 (2004) 25. TinyOS: http://www.tinyos.net (2011) 26. Wittenburg, G., Dziengel, N., Wartenburger, C., Schiller, J.: A system for distributed event detection in wireless sensor networks. In: Proc. 9th International Conference on Information Processing in Sensor Networks (IPSN), pp. 94–104 (2010) 27. Wu, H., Salzberg, B., Zhang, D.: Online event-driven subsequence matching over financial data streams. In: Proc. 2004 ACM SIGMOD International Conference on Management of Data, pp. 23–34 (2004) 28. Xue, W., He, B., Wu, H., Luo, Q.: The HKUST frog pond—a case study of sensory data analysis. In: Proc. 1st IFIP International Conference on Network and Parallel Computing (NPC), pp. 551–558 (2004) 29. Xue, W., Luo, Q., Chen, L., Liu, Y.: Contour map matching for event detection in sensor networks. In: Proc. 2006 ACM SIGMOD International Conference on Management of Data, pp. 145–156 (2006) 30. Xue, W., Luo, Q., Pung, H.K.: Modeling and detecting events for sensor networks. Inf. Fusion 12(3), 176–186 (2011) 31. Yang, X., Lim, H.B., Özsu, M.T., Tan, K.L.: In-network execution of monitoring queries in sensor networks. In: Proc. 2007 ACM SIGMOD International Conference on Management of Data, pp. 521– 532 (2007) 32. Yao, Y., Gehrke, J.: Query processing for sensor networks. In: Proc. 1st Biennial Conference on Innovative Data Systems Research (CIDR) (2003)

Suggest Documents