Department of Computer Science and Engineering, Shanghai JiaoTong University ... Keywords: Query processing, Wireless sensor networks, Top-k, Historical ..... In the simulation, the degree (k) of a sensory data is defined as the location.
History-Sensitive Based Approach to Optimizing Top-k Queries in Sensor Networks Qunhua Pan, Minglu Li, and Min-You Wu Department of Computer Science and Engineering, Shanghai JiaoTong University 1954 Huashan Road, Shanghai, China {oct-pan, li-ml, wu-my}@cs.sjtu.edu.cn
Abstract. Sensor networks generate a large amount of data during monitoring process. These data must be sparingly exacted to conserve energy. There are two methods to obtain data: “push” and “pull”. When the sensory data satisfied a preset condition, they are “push”ed towards the base station. The “pull” method is to actively query the sensor networks for any interesting sensory data. The problem is how to plan the query and save the energy. When a query has been executed, there are some hints that can be kept to optimize the subsequent query processing. Energy consumption can be reduced by not contacting nodes whose values either can be predicted or are unlikely to be used. In this paper, we propose a history-sensitive based method to optimize top-k query processing in sensor networks. The top-k query looks for and utilizes the historical data in each sensor node. Subsequent top-k queries are guided by these historical data, therefore, to improve the entire query process. Simulation results show that the number of query hops can be reduced and the delays in response are improved. Keywords: Query processing, Wireless sensor networks, Top-k, Historical data.
1 Introduction Technology advances in wireless sensor networks have opened up new opportunities for collecting data from all sorts of environments. The task of effectively and efficiently querying these networks is an important and challenging problem. Because sensors are often battery powered, the lifetime of the network is tied to the rate at which it consumes energy. In particular, radio communication is a primary source of energy consumption in sensor networks. Hence, minimizing communication in query execution can save a significant amount of energy and prolong the lifetime of the network. Moreover, because of the extreme limited resource of sensors, the corresponding protocols and algorithms should be carefully designed. Sensor networks generate a large amount of data during monitoring process. These data must be sparingly exacted to conserve energy. There are two methods to obtain data: “push” and “pull”. When the sensory data satisfied a pre-established condition, a pushing event is trigged and data are pushed to the base station. In this situation, the sensors should continue on working if needed unless their energy power exhausts. The “pull” method is to actively query the sensor networks for any interesting sensory data. The sensors can be in their sleeping or idle status when no query is requested. J. Cao et al. (Eds.): MSN 2006, LNCS 4325, pp. 674 – 684, 2006. © Springer-Verlag Berlin Heidelberg 2006
History-Sensitive Based Approach to Optimizing Top-k Queries in Sensor Networks
675
Once they received a query, they will wake up, sense the environment and send the satisfied sensory data back. So, the query process is more flexible and efficient. The problem is how to plan the query to save energy. Consider an example of temperature monitoring in a large exhibition. There are many exhibition rooms. To automatically monitor and adjust the temperature of the rooms, it needs to collect the real-time temperature of each room. Temperature sensors are deployed in all rooms. These sensors are self-organized into a wireless sensor network. The sensory data are sent back to the control room, where a manager can monitor the temperature of each room and particularly will be interested in knowing which room having the highest or lowest temperature. The controller may run a top-k query over the network to find out the target rooms in order to adjust their air conditioners. Optimization of top-k sensor queries is significantly more complex than one for ordinary queries (e.g., return all readings greater than x). Top-k query must know all the sensory data in the sensor network. Flooding the query to the entire sensor network can obtain all information, but it will consume too much energy. Every sensor is visited no matter whether its data will be useful or not. We propose a new approach which combines the push and pull methods. A base query specifies a time period for sensing and aggregating minimal data to continuously monitor the field. That is to simulate a push method but with minimal possible energy consumption. The important function of the base query is that it provides minimal but necessary information so that the later queries can be optimized. Other queries use the pull method to actively request information from the sensing field. In addition, the results from previous queries can be cached in sensor nodes for the subsequent queries. The cache can also include the information about data distribution. These hints can guide the subsequence queries to the right region. Based on this idea, we analyze the features of top-k query dissemination and data aggregation. We optimize top-k query processing based on historical data in sensor networks. Our contributions are as following: 1. Based on historical data, we have developed a query optimization framework, for query dissemination and data aggregation in sensor networks. Useful information is extracted from the aggregated data. Historical queries are cached in sensors. A new incoming query checks if there is any matched historical query in the cached query table. If so, the query processing is stopped and the cached data result will be sent back. Otherwise, the query agent will find if some data with the tags can satisfy the query. Finally, the query will be sent to next sensor nodes by the routing schema which is also generated from the aggregated data. Numbers of query hops will be reduced in this framework. It is energy efficient. Moreover, the response time will be substantially improved. 2. We apply the above concepts and techniques to the top-k query in sensor networks. Top-k query processing always defines a threshold on the historical aggregated data. When the sequent query comes to one sensor, it will be forwarded when there are aggregated data that are higher than the threshold. However, the threshold is determined by the subjectivity of the user, and it is application aware. We optimize the top-k query without define a threshold. By following the algorithms of data aggregation and decision rules which judge if
676
Q. Pan, M. Li, and M.-Y. Wu
the query should be forwarded to children nodes, queried nodes are pruned. We evaluate these algorithms using simulation. The rest of this paper is organized as follows: Section 2 will provide an overview of related work in top-k query processing in traditional database management systems and current query processing in sensor networks. Section 3 will present the framework of our historical data based top-k processing. The algorithm will be discussed in Section 4, followed by its performance evaluation in Section 5. We will conclude with an outlook on open research problems in Section 6.
2 Related Works A substantial amount of work has been done on querying sensor networks. Fjords [3] is a proposed architecture for managing multiple queries over many sensors to allow users to pose queries that combine streaming, push-based sensor sources with traditional pull-based sources. It also proposed power-sensitive Fjord operators called sensor proxies which serve as mediators between the query processing environment and the physical sensors. Reference [4] discussed the aspects of an acquisitional query language, introduced event and lifetime clauses to control when and how often sampling occurs. It discussed query optimization with the associated issues of modeling sampling costs and ordering of sampling operators. And it showed how event-based queries can be rewritten as joins between streams of events and sensor samples. This paper also demonstrated the use of semantic routing trees as a mechanism for efficiently disseminating queries. Query processing in sensor networks concerns with routing tree [5][6], aggregation[7][8] and semantics[9]. In the traditional database technology, the top-k query problem has been intensively studied. Reference [10] studied the advantages and limitations of processing a top-k query by translating it into a single range query that can efficiently processed by a traditional relational database management system. It studied how to determine a range query to evaluate a top-k query by exploiting the statistics available to a RDBMS. Donjerkovic and Ramakrishnan [11] proposed a probabilistic approach to query optimization for returning the top-k tuples for a given query. Chen and Ling [12] used sampling to define the range selection query that is expected to cover most of the top-k tuples. The result of the selection query serves as an approximate answer to the original top-k query. The processing of top-k query in sensor networks is different from one in a traditional relation database. Deshpande et al. [2] proposes model-driven data acquisition, which suggests using models such as multivariate Gaussians to predict sensor readings. These models let us avoid visiting nodes whose readings can be accurately predicted or are unlikely to contribute to the final result. This approach can dramatically reduce the energy consumed by the network, but of course makes results approximate. Instead of using models explicitly, Silberstein et. al [13] proposes to use samples of past sensor readings. The samples are computationally efficient to use in query optimization. It demonstrates the power and flexibility of sampling-based approach by developing a series of top-k query planning algorithms with linear programming. Zeinalipour [15] presents the Threshold Join Algorithm (TJA), an efficient top-k query processing algorithm for distributed sensor networks. TJA uses a
History-Sensitive Based Approach to Optimizing Top-k Queries in Sensor Networks
677
non-uniform threshold on the queried attribute in order to minimize the number of tuples that have to be transferred towards the querying node. Because the data management systems in sensors, such as TinyDB and Courgar, are weak and the main target is the energy saving when querying the sensor networks, top-k query technologies on relation database can not be used in sensor networks. Prior works of top-k query optimization in sensor networks using the threshold based on history data. The threshold definition is application-dependent. However, the result accuracy using sampling-based approach cannot always be guaranteed. The processing of top-k query in our framework is distributed and its optimization is independent of the threshold. The communication cost and the delays will be low.
3 History-Sensitive Based Top-k Query Processing The main goal of designing this querying system is to minimize the energy consumption of the system, that is, minimize the number of messages as well the size of the messages in the system including the number of queries and the number of data messages. In the application scenarios we described, detailed monitoring is only necessary for a subset of the data having corresponding numeric attributes whose values are among the k largest, where k is an application-dependent parameter. Therefore, the transmission, storage, and processing burdens in the monitoring infrastructure can be reduced by limiting the scope of detailed monitoring accordingly. Because of the extremely limited resource of a sensor node, purely flooding top-k queries to the sensor network to collect all sensory data is often unnecessary. A low-cost mechanism is needed for continually identifying the top-k data values in a sensor network. When a query has been executed, many sensory data are generated. They are sent back and aggregated in intermediate sensors. If the historical data still reside in the sensors that are near the base station. A repeated query can respond immediately if the result is still available. Moreover, the semantic meaning in resultant data can be used to guide the subsequent new queries. Based on this point of view, we can optimize the top-k query by utilizing the historical data. 3.1 Framework of Historical Data Based Query Processing The framework illustrated in Figure 1 consists of several components: query agent, cached historical data, cached queries, and routing schema. When a new query arrives at a node, the cached queries table is to be checked to see if the query had been executed. If so, the query will not be forwarded. The query result can be extracted from the aggregated data. Even there is no matched query in the cache, the query result could also be generated from the aggregated data. If all these operations cannot satisfy the query, it will be forwarded to the next nodes through routing schema. Because there is information of data distribution in the aggregated data, the selection of next nodes also depends on the historical data. This query framework consists of a base query and subsequent normal queries for an application. A base query is injected to the sensor network before the normal queries. The subsequent normal queries may further explore the field for details. They may utilize the data cached in the sensor nodes to minimize energy
678
Q. Pan, M. Li, and M.-Y. Wu
Cached Queries Interface
Query Query Agent
Routing Schema
Data Aggregated Data Sensor Networks Interface
Fig. 1. Framework for History-based query processing
consumption. Historical data may provide guide so that the query will head to the most likely location to search for the top-k values. In this paper, we assume that a global time synchronization mechanism has been implemented so all operations can be executed synchronously. Also, a fault-tolerance mechanism is implemented to handle possible failure of sensor nodes. Although dynamic change of the status of the field could be handled in this framework, we will focus on a relative simple problem in this paper, that is, how to obtain the top-k values between two consecutive query readings. The base query will be discussed below. 3.2 Base Query Because of the variance of the monitoring environment, the base query requests all sensors in the sensor network to periodically sense the field. The principle of designing the base query is to reduce its energy consumption while required information of the field is continuously monitored. Normally, a base query can be an aggregation operation performed every time period of P. An example can be a sensor network that monitors the forest fire. A base query may be designed as follows. A sensor in the field should wake up every P = 10 minutes, sense and transmit the temperature to its parent. The aggregation operation is to extract the maximum of values from its children as well as itself. The base station will receive information of the highest temperature in the field. To further minimize energy consumption, a threshold can be set so that only temperature higher than the threshold will be transmitted. Thus, when there is no fire in the field, the activity and energy consumption is negligible while the field is still monitored continuously. Once a potential fire is detected, normal queries could be generated to understand the degree and ranges of the fire. As an example, a top-k query may be sent to find out the most severe part of the fire. In summary, a base query is designed as follows. First, it is a periodical query, that is, each node that executes the query will periodically execute the operation specified in the query with the specified period until another query explicitly cancels the operation. Second, it is a query that is sent to every node through a broadcast tree, that is, the query floods to the entire sensor network. Third, it is for surveillance purpose so it must be energy-efficient. To this extend, the length of the period must be carefully selected and should be application-aware. It must be frequent enough so the
History-Sensitive Based Approach to Optimizing Top-k Queries in Sensor Networks
679
field is sufficiently monitored and must be longer enough to ensure long lifetime of the sensor network. Furthermore, the message sent to the base station must be minimized. The smallest possible message is sent upward to the base station, and normally, aggregation is performed instead of collecting all messages to minimize the size of the message. In the above example, aggregation of maximum of temperature values from every node is performed to monitor the fire in the field so that only N-1 messages with a single value is transmitted every time period. Finally, messages sent from a child to its parent are cached in the parent node. These cached data will provide a guide to subsequent queries. The base query has its fundamental importance in the entire design of the querying system. It provides a basic surveillance mechanism with low-energy consumption which ensures long lifetime of the sensor network while continually monitoring is provided in the field. In addition, it provides a substrate of information for subsequent queries. Furthermore, as the query instructs every node to periodical sense the field and returns its reading, the sensor network keep monitor the dynamic changing of the global state of the field.
4 Algorithms for Top-k Query in Sensor Networks We now describe our algorithm for historical data based top-k monitoring. The network initialization includes the query routing tree establishment. The root node broadcast a hello message to the sensor network. When node received the message forwarded by other node, it will add to the node as its children node. Then this children node will forward the message to its neighbor. After all the nodes have received the message, the initial routing tree is established. Because the nodes always received the message from the nearest node, the routing tree is a nearest-first tree. After the base query is flooding in the sensor network by the routing of nearest-firs tree, responded nodes will send data back to their parent nodes. The query proceeding computes the answer bottom-up in one pass over the network. Each node simply collects the top-k values from each of its children, selects the top-k from all such values and its own, and passes them on to its parent. If the subtree rooted at a node has fewer than k nodes, then all values from the subtree are passed up to the node. Each parent node waits until it receives gathered data from all its children nodes, apply an aggregation operator on it and send the result to its parent. But since every node must be visited in order to guarantee an exact answer, the query hops are quite large. The base query procedure is shown as follows:
: :
,
input top-k query N // N is the current query node. output result of top-k data N.broadcast(query k) Begin If (N has no child node) then N.visited = true NodeResult = localDataValue return (NodeResult) If each of my children node Ni.visited = true then
680
Q. Pan, M. Li, and M.-Y. Wu
AggregatedResult=MergeResult(localDataValue,NodeResul t1,…,NodeResultn) NodeResult = FindtopK(AggregatedResult,k) N.visited = true return (NodeResult) else For Each of my children Ni.visited = false Do Begin NodeResulti = Ni.broadcastTopKQuery(query,k) End End Comments for the objects and methods in the above algorithm: NodeResult: stored data-value, nodeID. MergeResult: aggregate localDataValue with NodeResulti from its children nodes. FindtopK: select top-k data as NodeResult, NodeResult will be sent to parent node. After the base top-k query, there are most |Ni|*K data in each node N, {Ni} is the number of one level child nodes of N. But at least (jNij - 1)*k of them will not be in the final result, representing a significant waste of bandwidth. The utility of these historical data to guide the sequent top-k’ queries can reduce the redundant data transmission. When the root node receives the sequent top-k’, recall from section 3, the k’ is assigned {ki} to each children nodes. The child node i will execute top-ki query in its subtree. The algorithm when k’ is less than k can be described as follows.
: :
input top-k’ query, N output result of top-k’ Initialization: N = root Node N.broadcastTopKQuery(query,k’) Begin Sort OverallResult Find Top k’ from OverallResult Find MinValue of Top k’ Find {Ki} //Ki is the number of data sent by child nodei and sum(ki) = k’ If Ki = 1 AND top 1 data value = localDataValue then N.visited = true NodeResult = localDataValue return (NodeResult) If each of my children node Ni.visited = true then AggregatedResult=MergeResult(localDataValue,NodeRes ult1,…,NodeResultn) NodeResult = FindtopK(AggregatedResult,ki) N.visited = true return (NodeResult) else For each of my child Ni do Begin
History-Sensitive Based Approach to Optimizing Top-k Queries in Sensor Networks
681
NodeResulti = Ni. broadcastTopKQuery(query,ki) End End When k’ is larger than k, the k assignment algorithm is not suitable. Because when the k data are all come from one node i, the (k+1)th data in the root are from other nodes j. If we still forward the query to node j to find the (k+1)th data. The true data will hide in subtree of the node i. So, we can first find the (k+1)th data from the node i, then we compare it with the (k+1)th data in the root node. If the new received data is larger, the true top (k+1)th data is found, or the true top (k+1)th data is from the other node, such as node j. The top (k+2)th data can be also found by this way. The shortage is that there is long delay especially when the k’ >> k. But it is energy efficient when k’ is not very larger than k. The algorithm is as follows:
: :
,
input top-k’ query N. output result of top-k’. Begin Initialization: N = root Node Do until k = k’ Find Nodei // Nodei is the node who sends the k-th top data value Ni.broadcastTopKQuery(query,k+1) If > historical (k+1)th data in N then Updata (k+1)th data in N by (k+1)th data in NodeResult(Ni) k = k+1 End do Return top-k’ data End
5 Simulation and Results In this section we report the simulation results to evaluate the effectiveness and efficiency of the algorithms in top-k query processing. 5.1 Data and Simulation Setup We run our simulation on a 20×20 grid-topology network and there is one sensor in each grid. There are two different data distribution in our simulation. One is circle distribution and another is power law. The circle distribution of data means there is a circle region in which the data values are higher than other region. The top-k query will be satisfied by the sensors in the region if it is efficiently guided. The transmission radius of sensor node is set 3 grid distance. The root node is located in left corner of the sensor network. It has been observed that for several self-organizing networks the degree distribution follows a power law (or, equivalently, scaling) distribution of the form P(k) ~ k-α. In the simulation, the degree (k) of a sensory data is defined as the location of the sensor node. A higher degree exponent means the distribution goes to zero faster, i.e., there are very few nodes that have very high degrees. On the other hand, if
682
Q. Pan, M. Li, and M.-Y. Wu
the exponent is smaller, there are a relatively higher number of nodes with very high degrees. The power law distribution is normal in sensor network application. A user-query is generated by a user who queries the sensor network for data. The user-query is started at the root node. Firstly, the base query is flooding to the whole sensor network. The nodes responded the query will send data to its parent node. The result data are aggregated and cached in intermediate nodes. When the sequence query arrived to one node, it will be determined whether to be sent to its children judged by the cached data. Here, the query-hop is used to measure the energy consumption. The number of query-hop means how many nodes will be visited in the sequence query. 5.2 Simulation Result We first evaluate the performance the historical data based top k query when consequent k’ < k. The base query in our simulation is to find top 5 data in the sensor network. After the base query, the sequence queries are top 4, … , top 1. In base query, each node is visited, the hops is 399. The result shown in Figure 2 plots that the hops of sequence query will reduce after the base query.
Fig. 2. Hops in static data distribution and k’ < k
The optimized top-k query limits the scope of the region. So the visited nodes reduced, consequently, the hops of top k query will decrease. The delay is defined as the farthest visited node from the root node. Figure 3 shows the delays of top k’ query (k’ < k) in static data distribution. It is evidence that the delays are reduced by the hint of the base query.
Fig. 3. Delays in static data distribution and k’ < k
History-Sensitive Based Approach to Optimizing Top-k Queries in Sensor Networks
683
In Figure 4, we calculate the delays when the sequent query k’ is larger than the k of base query. Here, the base query is to find maximum data of the sensor network. We increase the number of required top data by step 1 in the algorithm. The hops increase when the k’ is increased. If the k’ is not very large, the total hops will be less than the flooding query.
Fig. 4. Hops in static data distribution and k’ > k
The delays will increase when the k’ > k. Figure 5 shows the result of delays in each top k’ query.
Fig. 5. Delays in static data distribution and k’ > k
When the data center is moving or the data value varies, the historical data guided query can’t get the accurate data. If the varying rate is slow, when the query reaches the stop node, it can extend one or more steps to query the neighbor nodes. This can expand the searching scope and get more accurate data. However, the direction and the rate of data center moving is application aware. If the query routing tree is static, it can’t adapt to the data variance.
6 Conclusion and Outlook Sensor network is data centric. This paper proposes a historical data based framework of query processing. The query result will be aggregated in intermediate nodes and then sent back, which is to conserve energy. However, these aggregated data will reside in the nodes for a period time. The sequent queries will be guided by these historical data. We apply this framework to the top-k query problem. There are base
684
Q. Pan, M. Li, and M.-Y. Wu
top-k query and sequent top-k’ queries in the application. By the guide of the result data of base query, the top-k’ query will get the target quickly and minimum the query hops from root to the target. Simulation results show the historical data base top-k query processing can save the energy and the delays are reduce when k’ is less than k. When k’ is larger than k, the query hops are reduced but the delays increased. The problem of data storage in node is not discussed in this paper. Because the memory of a sensor is limited, an efficient compression algorithm should be designed in our future works. The query routing tree is very important to the framework we proposed, how to create an efficient query routing tree is our another direction. The design of query plan in dynamic environment is in our future work.
References 1. B. Babcock and C. Olston. Distributed top-k monitoring. In Proc. of the 2003 ACM SIGMOD Intl. Conf. on Management of Data, San Diego, California, USA, June 2003. 2. A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In Proc. of VLDB2004. 3. S. Madden and M. J. Franklin, “Fjording the stream: An architechture for queries over streaming sensor data”, In Proc. Of ICDE2002. 4. S. Madden, M. Franklin, J. Hellerstein, and W. Hong. The design of an acquisitional query processor for sensor networks. In Proc. of ACM SIGMOD2003. 5. Jeffrey E. Wieselthier, Gam D. Nguyen, A. Ephremides, “On the Construction of EnergyEfficient Broadcast and Multicast Trees in Wireless Networks,” In Proc. of IEEE INFOCOM 2000. 6. H. Yang, F. Ye and B. Sikdar, “A Dynamic Query-tree Energy Balancing Protocol for Sensor Networks”, In Proc. of WCNC2002. 7. S. Madden, R. Szewczyk, Michael J. Franklin and David Culler. “Supporting Aggregate Queries Over Ad-Hoc Wireless Sensor Networks”, Workshop on Mobile Computing and Systems Applications, 2002. 8. W. Yu, T.Nam Le, Dong. Xuan, and W. Zhao,” Query Aggregation for Providing Efficient Data Services in Sensor Networks”, in Proc. of IEEE Mobile Sensor and Ad-hoc and Sensor Systems (MASS), October 2004. 9. Qunhua Pan, Minglu Li, Min-You Wu, “A semantic-based architecture for sensor networks”, Annals of telecommunications, Vol.60 n°7-8, July-August 2005. pp.928-943. 10. N. Bruno, S. Chaudhurl, L. Gravano, “Top-k Selection Queries over Relational Databases: Mapping Strategies and Performance Evaluation”, ACM Transactions on Database Systems, Vol. 27, No. 2, June 2002, Pages 153-187. 11. Donjerkovic, D, Ramakrishnan. R, “Probabilistic optimization of top N queries”, In Proc. of VLDB’99. 12. 12 C. Chen, and Y. Ling. “A sampling-based estimator for top-k selection query”, In Proc. Of ICDE2002. 13. A. Silberstein, R. Braynard, C. Ellis, K. Munagala, “ A Sampling-Based Approach to Optimizing Top-k Queries in Sensor Networks” , In Proc.of ICDE 2006. 14. Christopher R. Palmer, J. Gregory Steffan, “Generating network topologies that obey power law”, In: Proc. of the IEEE GLOBECOM, San Francisco, 2000, pp.434−438. 15. D. ZeinalipourYazti, Z. Vagena, D. Gunopulos, V. Kalogeraki, V. Tsotras, “The Threshold Join Algorithm for Top-k Queries in Distributed Sensor Networks”, in Proc. of DMSN’05, August 29, 2005, Trondheim, Norway.