IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
VOL. 19,
NO. 7,
JULY 2008
865
A Novel Algorithm for Mining Association Rules in Wireless Ad Hoc Sensor Networks Azzedine Boukerche and Samer Samarah Abstract—In this paper, we propose a comprehensive framework for mining Wireless Ad Hoc Sensor Networks (WASNs), which is able to extract patterns regarding the sensors’ behaviors. The main goal of determining behavioral patterns is to use them to generate rules that will improve the WASN’s Quality of Service by participating in the resource management process or compensating for the undesired side effects of wireless communication. The proposed framework consists of 1) a formal definition of sensor behavioral patterns and sensor association rules, 2) a novel representation structure that we refer to as the Positional Lexicographic Tree (PLT) that is able to compress the data gathered for the mining process and thus allows the fast and efficient mining of sensor behavioral patterns, and 3) a distributed data extraction mechanism to prepare the data required for mining sensor behavioral patterns. Several experimental studies have been conducted to evaluate our PLT structure and our proposed data extraction algorithms for mining wireless sensor networks. Index Terms—Wireless sensor networks, distributed systems, distributed data mining.
Ç 1
INTRODUCTION
A
in wireless technologies have led to the development of sensor nodes that are capable of sensing, processing, and transmitting. This new trend in sensor technology allows the design of Wireless Ad hoc Sensor Networks (WASNs) that consist of several sensor nodes, with the main functions of sensing the area surrounding them and sending detected events to a wellequipped node, called the sink, in multihop fashion. The detected events are transmitted to the sink periodically or based on whether or not they meet a particular predicate [2], [22], [23]. WSANs have proven their success in a variety of applications, especially those that require the fine-grained monitoring of physical environments that are subject to critical conditions such as fire, toxic gas leaks, and explosions. These kinds of applications introduce new challenges for WASN developers. In order to guarantee an acceptable level of quality for events’ delivery, a new class of fast, reliable, and fault-tolerant protocols for WSN needs to be developed. However, the distributed nature and the limited resources of sensor nodes, as well as the unreliability of the wireless communication, cause several delay and loss of transmitted events, which will have devastating effects on the overall quality of WASNs [3], [22]. Several techniques have been proposed in the literature to enhance the performance of WASNs, such as clustering, aggregation, and data fusion, just to mention a few DVANCES
. The authors are with the School of Information Technology and Engineering (SITE) and the PARADISE Research Laboratory, University of Ottawa, 800 King Edward Ave., Ottawa, Ontario, K1N 6N5, Canada. E-mail: {boukerch, ssamarah}@site.uottawa.ca. Manuscript received 21 Dec. 2006; revised 13 July 2007; accepted 24 July 2007; published online 12 Oct. 2007. Recommended for acceptance by S. Olariu. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TPDS-0412-1206. Digital Object Identifier no. 10.1109/TPDS.2007.70789. 1045-9219/08/$25.00 ß 2008 IEEE
examples [1]. In this paper, we introduce a data mining solution to extract behavioral patterns from WASNs to formulate what we call sensor association rules. The main objective of the sensor association rules is to capture the temporal relations between sensor nodes based on common intervals of activities. An example of such a rule is ðs1 s2 ) s3 ; 90 percent; Þ, which means that if we receive events from sensors s1 and s2 , then there is a 90 percent chance of receiving an event from sensor s3 within units of time. The main step in the formation of association rules is to find the patterns of sensors that co-occur together and exceed a certain frequency (these patterns are called frequent association patterns). For instance, in our example, the rule ðs1 s2 ) s3 Þ is generated from the pattern ðs1 s2 s3 Þ. Two major impacts of sensor association rules that benefit many applications are the ability to predict the source of future events and the ability to identify sets of temporally correlated sensors. These impacts can be used to enhance the performance of WSANs by participating in the resource management process of sensor nodes in order to cope with the sensors’ limitations and reduce the undesired effects of the wireless communication, thereby improving the Quality of Service of WSANs. Predicting the sources of future events can also be helpful in a variety of applications such as predicting faulty nodes (for example, we are expecting to receive an event from a certain node, and it does not occur), or it may be used to identify the source of the next event in the case of the emergency preparedness class of applications. Identifying correlated sensors can also be helpful in compensating for the undesirable effects of unreliable wireless communication such as missed reading and in the resource management process such as deciding which nodes can be switched safely to a sleep mode without affecting the coverage of the network. There are several challenges in mining sensor association rules. The first challenge is to find a formal definition for sensor behavioral patterns and rules. The second challenge Published by the IEEE Computer Society
Authorized licensed use limited to: Feng Chia University. Downloaded on November 11, 2008 at 02:08 from IEEE Xplore. Restrictions apply.
866
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
is the need to design a data extraction mechanism that is able to collect data regarding sensors’ behavior from sensor nodes while also taking into consideration the limited resources of sensor nodes, especially their energy. The third challenge is the need for efficient data structure that is able to compress the data collected from sensor networks and efficiently allow mining the required patterns, which, as we know, may require exploring an exponential search space. For example, if we have a set of n sensors, then there is a potential that 2n patterns between sensors will be checked against the collected data to determine the frequency of each pattern. The contribution of this paper can be summarized as follows: First, we provide a reformulation of the association rule mining problem, a well-known data mining technique, that makes it applicable for sensors’ behavioral data. The second contribution is data extraction methodologies for extracting the data required for mining sensor association rules. Third, we propose an efficient data mining algorithm for generating sensors’ behavior patterns, which uses the Positional Lexicographic Tree (PLT), a new representation structure that is able to compress the sensors’ behavioral data extracted from the sensor nodes. The remainder of this paper is organized as follows: Section 2 reviews related work regarding the algorithms proposed to solve the association rule problem and the algorithms that apply data mining to sensor data. Section 3 presents our framework for mining sensor association rules. Section 4 provides a performance study for our proposed schemes. Section 5 concludes this paper.
2
RELATED WORK
In this section, we will review some of the work that has been proposed for applying data mining to sensor data. In addition, we will highlight the main techniques that have been introduced for generating association rules. Loo et al. [17] have studied the problem of mining the associations that exist between sensor values in a stream of data reported from a wireless sensor network. They proposed a data model that stores the data and presents them in a way that makes it possible to adapt the lossy counting algorithm [26] that makes an online one-pass analysis of the data. In this data model, sensors are assumed to take values from a finite discrete number of values, whereas a quantization method is applied for the continuous values. The time is divided into equal-sized intervals, and a snapshot from the sensor reading is taken whenever there is an update on a sensor reading. These snapshots formulate the contexts of the database. Although taking snapshots at state changes will reduce the redundancy in the data, these snapshots occur randomly; thus, each context is associated with a weight value that indicates for how many intervals this reading is valid (that is, for how long these readings will kept unchanged). The support of the pattern is defined by the total length of nonoverlapping intervals in which the pattern is valid. Mining spatial temporal event patterns is another attempt to link the problem of mining sensor data to the association rules’ mining problem that was proposed by Ro¨mer [18]. Ro¨mer’s approach takes into consideration the
VOL. 19,
NO. 7, JULY 2008
distributed nature of wireless sensor networks and proposes an in-network data mining technique to discover frequent patterns of events with certain spatial and temporal properties. In this approach, each sensor should be aware of the events that are within a certain distance from itself (this distance may be a euclidean distance or a number of hops). The sensor then collects these events and applies a mining algorithm to discover the pattern that satisfies the given parameters. The mining parameters include a minimum support S, a minimum confidence C, a maximum scope, and a maximum history. Each node in the network collects the events from its neighbors within the maximum scope and keeps a history of their events for a duration of the maximum history. After that, each node applies a mining algorithm to discover the frequent patterns (those that have frequency exceeding the given minimum support). Halatchev and Gruenwald [19] propose an association rule mining framework to tolerate the missed readings that result from the loss and corruption of messages while they are routed from sensor nodes to the processing points. Sensor readings are streaming in nature; hence, applying an association mining algorithm such as Apriori [5] directly to the stream of data is not possible in the first place. This situation led the authors to propose the Data Stream Association Rule Mining (DSARM) framework that adapts the “Apriori” algorithm to make it applicable to the data stream received from sensor nodes. There are several modifications that have been made for the Apriori scheme to be adapted for sensor streams. First, rules are generated between pairs of sensors instead of generating all of the possible rules. Second, the association between pairs of sensors is evaluated with respect to a particular state of the sensors, and this modification will lead to rules of the form ðs1 ) s2 =stÞ, which means that s1 determines s2 with respect to state st. Finally, the sliding window technique is implemented to generate the association between sensors within the given window size. To the best of our knowledge, few studies have proposed addressing the problem of extracting data from wireless sensor networks for mining patterns regarding the sensor nodes themselves. All the attempts have focused on extracting patterns regarding the phenomenon monitored by the sensor nodes, in which the mining techniques are applied to the sensed data received from the sensor nodes and accumulated at a central database. In our work, we will propose a solution to extract the behavioral data required for mining patterns regarding the behavior of the sensor nodes in the network (that is, the data used in the mining process is metadata, describing the nodes activities, and it differs from the sensed data). A primary assumption of the proposed data extraction mechanism is to have a flash memory device attached to each sensor to store the metadata about the sensor’s behavior that will be used during the extraction process. Several researchers have studied the cost of attaching a storage devise to each sensor. In [20], Mathul et al. have showed that current flash memories offer a low-cost high-capacity energy-efficient storage solution, especially when compared with the transmission of the data. Peter et al. [21] show how sensors
Authorized licensed use limited to: Feng Chia University. Downloaded on November 11, 2008 at 02:08 from IEEE Xplore. Restrictions apply.
BOUKERCHE AND SAMARAH: A NOVEL ALGORITHM FOR MINING ASSOCIATION RULES IN WIRELESS AD HOC SENSOR NETWORKS
with storage could be an ideal solution for a class of applications in which historical data is needed. Several algorithms in the literature have been proposed to tackle the problem of mining frequent patterns from large databases. These algorithms differ mainly in the way that they represent the database and in which they generate the frequent patterns. These algorithms can be classified into two main approaches: the candidate generation approach [5] and the pattern growth approach [14]. In terms of these approaches, the algorithms also differ in how they represent the database. The two most popular formats are 1) the vertical layout, in which each object is associated with the list of context identifiers where it occurred, and 2) the horizontal layout, in which each context identifier is associated with the list of objects. The candidate generation approach enumerates the frequent patterns gradually, with several scans of the database. In each iteration, patterns found to be frequent are used to generate the candidates (possible frequent patterns) to be counted in the next iteration. Within this approach, several schemes have been developed such as the AIS algorithm [4], Apriori, AprioriId, AprioriHyprid [5], Directed Hashing and Pruning (DHP) [6], the Partition algorithm [8], and Dynamic Itemset Counting (DIC) [7]. The most popular algorithm among the candidate generation approaches is the Apriori scheme. All other approaches, except for AIS, are basically optimized versions of the Apriori scheme. Recall that in the Apriori algorithm, a database scan is conducted in order to determine the set of frequent one element patterns. From this set, it generates set of a candidates to be counted in the next step. In addition, it prunes the set of the candidates by eliminating the candidates that have at least one infrequent subset. This process is repeated a number of times that is equal to the size of the largest frequent pattern. The pattern growth approach [14] tries avoiding the large number of candidates generated in each pass and overcome the repeated scans of the database, thereby enabling most of the algorithms in this approach to outperform the candidate-generation-based approach algorithms. The Frequent Pattern Growth (FP-Growth) proposed by Han et al. is the core algorithm of the pattern growth approach [14]. In this method, the database is converted into a compact representation in the form of a tree, called Frequent Pattern tree (FP-tree), which is much smaller in size than the original database. The FP-tree is constructed in such a way that all relevant information needed in the mining process is presented in the tree structure. Note that building the tree structure requires only two scans of the database. After building the FP-tree, the FP-Growth routine mines all the frequent patterns from the tree structure without referring to the original database and without generating candidates. Only one pattern can be considered at a time, and a new tree, which is referred to as a conditional structure of the pattern, is constructed from the set of frequent patterns that occur with it in the same context. This process is repeated recursively until all the frequent patterns are generated. Although the FP-Growth method has proven its efficiency in comparison to the candidate generation approach, it has been shown in [13] that this method is not suitable for all kinds of data. In particular, when the database is sparse, the resulting FP-tree is very
867
large and leads to a significant overhead when traversing it during the mining process. This leads to the need for a large amount of space in the recursive process, which will prevent the FP-Growth method from scaling well for large amounts of data. Several algorithms that follow the pattern growth approach have been implemented in the literature, such as H-Mine [13], FP-growth* [11], COFI-tree [12], and ITL-tree [16], CT-ITL [25].
3
SENSOR ASSOCIATION RULES MINING FRAMEWORK
In this section, we will explain in detail the main steps for mining sensor association rules. We will start by providing a formal definition for the sensor association rules and then illustrate two different extraction mechanisms that we used for extracting the data needed for mining these rules. Finally, we will introduce the PLT and our proposed structure for storing and mining the extracted data.
3.1 The Sensor Association Rules Mining Problem Our definition of the problem of mining sensor association rules is based upon the definition of association rules proposed in the domain of transactional databases [5]. However, not much work has been done on how we can define association rules for wireless sensor networks, in which the sensors themselves are the main objects in the extracted rules, regardless of their values. Note that we are interested if a sensor detects an event and not the value of the event. Let S ¼ fs1 ; s2 ; . . . ; sm g be a set of sensors in a particular sensor network. We assume that the time is divided into equal-sized slots ft1 ; t2 ; . . . ; tn g such that tiþ1 ti ¼ for all 1 < i < n, where is the size of each time slot, and T his ¼ tn t1 represents the historical period of the behavioral data defined during the data extraction process. We also refer to P ¼ fs1 ; s2 ; . . . ; sk g S as a pattern of sensors. Definition 1.1. A sensor database DS, the behavioral data, is defined to be a set of epochs in which each epoch is a couple EðEts ; P Þ, where P is a pattern of sensors that report events within the same time slot. Ets is the epoch’s time slot. Definition 1.2. Let P1 be a pattern of sensor nodes such that P1 S. We say that an epoch EðEts ; P Þ supports P1 if P1 P . Definition 1.3. The frequency of the pattern P1 in DS is defined to be the number of epochs in DS that supports it: FreqðP1 ; DSÞ ¼ j fEðEts ; P Þ j P1 P gj: Definition 1.4. The pattern is said to be frequent if its frequency is greater than or equal to the given minimum support. Definition 1.5. Sensor association rules are defined in the form of P 0 ) P 00 , where P 0 S, P 00 S, and P 0 \ P 00 ¼ . Definition 1.6. The frequency of the rule ðP 0 ) P 00 Þ represents the frequency of the pattern ðP 0 [ P 00 Þ in DS, whereas the confidence of the rule is defined as follows: Conf ðP 0 ) P 00 Þ ¼ FreqðP 0 [ P 00 ; DSÞ=FreqðP 0 ; DSÞ:
Authorized licensed use limited to: Feng Chia University. Downloaded on November 11, 2008 at 02:08 from IEEE Xplore. Restrictions apply.
868
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
VOL. 19,
NO. 7, JULY 2008
Fig. 2. Detected events for a historical period of 70 minutes.
Fig. 1. Network architecture.
We say that a rule is of interest to the targeted application if its frequency and confidence are greater than or equal to a given minimum support min sup and minimum confidence percentage min conf. Note that frequency and support are used interchangeably and min sup represents the minimum number of epochs that the frequency of the rules should satisfy. Recall that given a database of epochs generated at a particular time slot size and historical period, as well as minimum support and minimum confidence, the problem of mining sensors’ association rules is to generate all the of-interest rules present in the behavioral data. Mining association rules can be decomposed into two steps [4]: generating the frequent patterns (that is, those that have frequency >¼ min sup) and generating the rules that satisfy the min conf restriction. Note that generating sensor association rules might be straightforward and does not take a long runtime. The main challenges of mining these rules can thus be summarized as follows: how the data required for the mining process can efficiently be extracted and 2. how the patterns that meet the given minimum support can efficiently be generated. Extracting the data required for mining association rules involves interaction between sensor nodes and the sink, which is the main focus of the next section. The process for generating the frequent sensor patterns will be described in Section 3.3. 1.
3.2 Data Extraction Methodologies In this section, we will present two possible methodologies for extracting the behavioral data required for mining sensor association rules from WSANs. The first methodology is a direct reporting, in which the data are transferred to the sink without any involvement from the sensor nodes in the reporting process. The second methodology considers the overall limited resources of the network while each node is trying to optimize the number of messages that it will send. In what follows, we will describe the network architecture and then describe these methodologies in detail.
3.2.1 System Architecture The network architecture for extracting the data consists of sensor nodes, where each node is coupled with a flash memory device1 that is able to store megabytes of data. Attaching a storage device to each sensor was previously impossible due to the high energy consumption required for maintaining the storage device. However, recent advances in flash memory technology have changed this perspective, and studies have shown that energy consumption for maintaining a unit of data within a flash memory coupled with a sensor node is very low compared to the energy needed for transmitting this unit. In [20], Mathur et al. reported that a flash memory like NANAD from Toshiba, storing 28 Gbytes of data generated at a rate of 512 bytes per second will decrease the lifetime of a sensor node designed to run for 3 years by only 6 weeks. In our network model, as shown in Fig. 1, sensors are deployed in an ad hoc fashion and use a multihop mode of transmission to route the data to a well-equipped node called the sink. The sink is attached to a database to store the retrieved epochs (the behavioral data) that are received using either a distributed or a directed reporting methodologies. 3.2.2 Direct Reporting In direct reporting, the extraction process starts with the application that provides the mining parameters to the sink. These parameters include the time slot size , historical period This , and minimum support min sup. The sink then broadcasts the time slot size and the historical period to the nodes in the network. Each node keeps track of the time. At the end of each time slot, it checks whether there is any detected event. If there is, it sends a notification message to the sink that contains its identifier and the time slot number where the event occurred (that is, an integer number is used to refer to the current time slot). At the sink node, the time is monitored. At the end of each time slot (with additional delay), the sink checks the received messages and creates an epoch for the current time slot consisting of the time slot number and a pattern of sensors’ identifiers extracted from the received messages that carry the same time slot number. Then, the sink stores this epoch in the database. Let us consider the following scenario as an illustrative example. Let S ¼ fs1 ; s2 ; . . . . . . ; s6 g be the sensors in a particular sensor network. Let the time slot size be equal to 10 minutes and the historical period equal to 70 minutes. Assume that the extraction process is initiated at time 10:00. Fig. 2 shows the detected events within the sensor network. At the end of the first epoch (10:10), sensors fs1 ; s2 ; s3 g send the messages, respectively, M1 ð1; s1 Þ, M2 ð1; s2 Þ, and M3 ð1; s3 Þ, which contain the epoch number in which the event was detected and the sensor’s identifier. At time ð10:10 þ Þ, the sink formulates the first epoch E(1, ðs1 s2 s3 Þ) and stores it in the database. The same process is repeated periodically at the end of each time slot 1. Sensors with flash memory are necessary for the distributed methodology.
Authorized licensed use limited to: Feng Chia University. Downloaded on November 11, 2008 at 02:08 from IEEE Xplore. Restrictions apply.
BOUKERCHE AND SAMARAH: A NOVEL ALGORITHM FOR MINING ASSOCIATION RULES IN WIRELESS AD HOC SENSOR NETWORKS
869
TABLE 1 DS Using Direct Reporting
TABLE 2 DS Using Distributed Extraction
until the end of the historical period. Table 1 shows the extracted epochs after a historical period of 70 minutes. Algorithm 1 shows a formal description of direct reporting.
nodes in the network. These parameters include the minimum support, the time slot size, and the historical period. Upon receiving the mining parameters, each sensor will establish a local buffer that has a bit entry for each time slot in the historical period. Initially, all the bit entries in the buffer are unset. After that, sensors keep track of the time, and at the end of each time slot, each sensor checks whether there is any detected event within this time slot. If there is, the bit entry corresponding to this time slot is set. At the end of the historical period, each sensor traverses its local buffer, and if the number of set bits is greater than or equal to the minimum support, the node will establish a message (or a series of messages, depending on the packet size) containing the sensor identifier and the time slot numbers, in which the corresponding bits are set. Then, the sensor sends this message to the sink. The sink waits until it receives all possible messages from the sensor nodes, and it then restructures the data in the messages in such a way that all sensors that reported an event at the same time slot will appear in the same epoch. This epoch is then stored in the database. As an example, let us reconsider the events presented in Fig. 2. Physically, each sensor will maintain a buffer length of 7, one entry for each time slot, and a buffer entry is set if there is a detected event in that time slot. For example, s1 ’s buffer will be ½1 j 1 j 1 j 1 j 0 j 1 j 1. Assume that the minimum support is 2. Then, at time 11:10, which is the end of the historical period, sensors s1 , s2 , s3 , and s4 will formulate messages (s1 , [1, 2, 3, 4, 6, 7]), (s2 , [1, 2, 3, 4, 5, 7]), (s3 , [1, 2, 3, 4, 5, 6]), and (s4 , [3, 4]), respectively. These messages are then sent to the sink. Note that there are no messages to be sent from nodes s5 and s6 , since the number of set entries is less than the required minimum support. The sink waits until it receives all messages and formulates the epochs to be stored in the database. Table 2 shows the extracted database for the events in Fig. 2. Algorithm 2 shows a formal description of the distributed methodology.
Algorithm 1: direct reporting. Sink: Broadcast parameters ðT his; Þ Slot Number ¼ 1; Time ¼ current time; While ðcurrent time