Locality-Aware Multidimensional Range query Index for Border ...

5 downloads 44 Views 539KB Size Report
Jan 5, 2005 - Again, the primary concern of this system lies on people whose vital signs go ...... [10] D. V. Kalashnikov, S. Prabhakar, W. G. Aref, and S. E..
M-LARI: Locality-Aware Multidimensional Range query Index for Border Monitoring Queries over Data Streams

SangJeong Lee, Seungwoo Kang, Jinwon Lee, Youngki Lee, Byoungjip Kim, Hyunju Jin, and Junehwa Song

CS/TR-2005-213 January 5, 2005

KAIST Department of Computer Science

M-LARI: Locality-Aware Multidimensional Range query Index for Border Monitoring Queries over Data Streams SangJeong Lee, Seungwoo Kang, Jinwon Lee, Youngki Lee, Byoungjip Kim, Hyunju Jin, Junehwa Song Korea Advanced Institute of Science and Technology (KAIST) 373-1 Guseong-dong Yuseong-gu Daejeon, Korea {peterlee, swkang, jcircle, youngki, bjkim, hyunju, junesong}@nclab.kaist.ac.kr

ABSTRACT This paper presents a Locality-Aware Multidimensional Range query Index, M-LARI, to evaluate Border Monitoring Queries efficiently. A Border Monitoring Query (BMQ) has different query semantics from traditional continuous range queries. It monitors the values of data streams and reports which data streams cross the borders of its range. This paper emphasizes the importance and usefulness of BMQs with new service scenarios. We propose a novel index structure, M-LARI, which has two important features: multidimensionality and locality-awareness. M-LARI targets high performance multidimensional range query - especially multi-dimensional BMQ - evaluation. To accelerate BMQ evaluation, our system utilizes the locality of data streams and contains efficient data structures and operations designed to utilize that locality. Experiments show the performance advantage of M-LARI compared to other continuous range query indices. This paper also presents some challenging situations in which most range query indices suffer and discusses additional technical issues introduced by these situations.

1. INTRODUCTION Advances in mobile computing and embedded device technology are bringing about new computing environments. These environments contain numerous data generators such as sensors, probes, and agents, which generate data continuously in the form of data streams. We call such environments data stream environments. In data stream environments, restaurants can advertise special lunch menus to people nearby, and gas stations can send electronic coupons to nearby vehicles through mobile networks. Such applications, location-based advertisement services, take people’s or vehicles’ location information as input from mobile networks, compare that information with advertisement regions, and send advertisements to the proper people or vehicles in real-time. They request information about new objects entering the specified region rather than information about the objects which remain in the region. Similar applications can easily be found in emerging service domains such as locationbased services or Telematics services. The query semantics used by such applications is that of “monitoring objects’ location information and reporting objects which cross the borders of the region.” We are interested in these new types of range queries which will be quite prevalent in data stream environments. We name such queries Border Monitoring Queries (BMQs). The semantics of a BMQ is totally different from that of the traditional range queries supported by existing

continuous query systems [1][4][5]. While existing range queries report the data streams within a configured range - we call these kinds of range queries Region Monitoring Queries (RMQs) to distinguish them from BMQs - a BMQ reports the data streams crossing the borders of the configured range. Hence, a BMQ should be evaluated differently from an RMQ. We should consider the intersection relationship between the change of data streams and the borders of configured regions, rather than the inclusion relationship between the data stream values and the regions. Existing continuous range query indices, however, have not yet taken this intersection relationship into consideration as they have not yet realized the importance of BMQs. This paper proposes a Locality-Aware Multidimensional Range query Index, M-LARI. This multidimensional index structure is designed to rapidly evaluate many multidimensional range queries by utilizing the locality of input data streams. In our previous work [15], we utilized the locality of one-dimensional data streams for high performance continuous range query processing. We enhance LARI to support multidimensional range queries, especially multi-dimensional BMQs. We define the locality of multidimensional data streams, and suggest data structures and operations which evaluate multidimensional BMQs rapidly while consuming less storage. Various experiments in this paper will demonstrate the significant performance improvement of M-LARI over other indices. The locality of data streams helps accelerate a costly operation in which continuous range queries are repeatedly evaluated as input data arrives. The use of locality will bring about a considerable performance improvement, especially for queries whose result significantly depends on the change of data stream values, such as BMQs. We have already demonstrated that there is sufficient locality in various data streams [15]. User or vehicle location data streams used for location-based services are representative examples. In addition, weather-related data streams such as temperature, humidity, and pressure and data streams related to the economy such as oil prices, stock prices and volumes also exhibit locality. Among such data streams, we concentrate on multidimensional data streams such as location information or multi-attribute data streams 1 . First, we formally define the locality of multi1

We define multi-attribute data streams as a type of multidimensional data stream that has multiple attributes which constitute the multiple dimensions of a data stream.

dimensional data streams. The definition includes a normalization step, since the attributes of multi-attribute data streams have different resolutions. We employ a dimension-fair difference to normalize each dimension and calculate the change of data stream values. Next, we design data structures utilizing multidimensional locality. To manage multidimensional queries, M-LARI employs a Region Segment List (RS List) for each dimension, where a Region Segment represents a region of a dimension and stores its BMQ information. For highly localized data streams, M-LARI achieves high performance by skipping unnecessary computation as the values of the data stream are highly likely to stay in the same Region Segment. Moreover, by employing an RS List for each dimension, the storage size of M-LARI decreases considerably even when the number of dimensions increases; i.e., it increases linearly propotional to the dimensionality. This is a big advantage over existing multidimensional range query indices whose storage sizes increase exponentially with the number of dimensions. Finally, we develop a multidimensional search operation to find matching BMQs for incoming data values. It is an efficient cross-checking algorithm that checks the range query information across multiple dimensions and produces a set of matching BMQs. We also design a novel data structure using a hash table associated with binary search trees in order to accelerate this operation. A lot of research has attempted to index multidimensional range queries in data stream environments, but most has discussed only traditional range queries, i.e. RMQs. S.Prabhakar et al. first proposed an R-tree based two-dimensional index [12] with safe regions for continuous range queries over moving objects. They subsequently then developed a cell-based range query index [10] which outperforms their previous index considerably. Despite the great performance improvement, the cell-based query index consumes a lot of storage as it stores query information in many cells redundantly. Moreover, as the number of the range queries that partly intersect with cells increases, the overhead required in order to check if a query indeed contains an object degrades performance significantly. In order to avoid this overhead, [14] deployed predefined tiles of fixed sizes to cover range queries. Although this index appears to exhibit better performance and consume less storage than [10], it has severe scalability problems. For applications which require a large region and a high resolution of information, such as Telematics in a large city, the resolution must be lowered and range queries approximated so that they may be covered by large tiles. However, this limitation may lead to inaccurate query results. Not only indices over moving objects, but also a few continuous query systems [4][5] in data stream environments discussed range query indices and proposed a grouped filter index [11] or Interval Skip List [8] in order to evaluate range queries quickly. However, such systems handle range queries concerning only one attribute of data streams, and so far there has been no discussion about multidimensional range queries. Compared to existing work, our work presents two main contributions. First, we propose new semantics for continuous range queries, i.e. BMQs. Although we already introduced the concept of BMQs in our previous work [15], this paper refines the concept further by emphasizing the importance and usefulness of this new type of continuous range query through various service scenarios. We define Border Monitoring Queries formally and investigate their main characteristics in depth. Second, we

Lunch Menu

Pet-Care

Coupon

Figure 1. Location-Based Advertisement develop a high performance index for BMQs over multidimensional data streams that consume less storage than other indices. In terms of technical contributions, we define the locality of data streams in multiple dimensions and develop light-weight data structures and efficient operations utilizing that locality. We demonstrate the high performance and low storage space requirement advantages of M-LARI through experiments. In addition, we discuss the enhancements of M-LARI that enable it to cope with several challenging situations. We address additional technical issues that arise from the challenges and develop proper solutions based on M-LARI. This paper is organized as follows. In section 2, we introduce new scenarios to describe the Border Monitoring Query and discuss its semantics. In section 3, we present related work in detail and discuss the contributions of M-LARI. Section 4 describes the data structures and operations of M-LARI. Section 5 evaluates the performance of M-LARI as compared to other index structures. In section 6, we address additional technical issues and discuss advanced M-LARI. Finally, we conclude our work in section 7.

2. Border Monitoring Query Mobile computing and embedded device technology are now becoming mature enough to spawn new types of services and applications. These emerging applications receive a large number of data streams and evaluate continuous queries over them. Continuous query systems, or stream processing systems, are being developed to support those applications. We argue that these emerging applications require new range query semantics. The following three scenarios will present emerging applications wherein a new type of range query is prevalent. Scenario 1: Location-based advertisement As shown in Figure 1, a restaurant, a café, and a gas station advertise special lunch menus, pet-care services, or electric coupons. For example, the restaurant first requests the advertisement application to send its lunch menu within a range of 500 meters for a period of two hours. The main feature of this application is that of quickly locating people that are entering the specified region. Since people do not like to receive the same advertisement more than once, it is not necessary to locate and identify people who are already in the region. Thus, we observe that it is important to retrieve information about people crossing the borders of a region.

a b c d e f g Data Streams

Query

RSetRMQ = {a, b, c, f, g}

a

d

b f

c

Domain

g

e +RSetBMQ = {f, g} -RSetBMQ = {d, e} Result

Figure 2. Semantic Difference between BMQ and RMQ Scenario 2: Parking lot A state-of-the-art parking lot management system, one of many Telematics services, monitors vehicles coming to and from the parking lot. For example, this system identifies a vehicle entering a university and guides it to the appropriate parking lot depending on the university status information of the driver: faculty, student, staff, or guest. In addition, the system checks the parking time and bills her/him for parking later. This system also must frequently evaluate queries that ask for vehicles crossing the boundaries of a specified region. Scenario 3: Electronic hospital Soon hospitals will actively use systems that monitor patients’ vital signs such as heart rate, body temperature, and blood pressure. If the vital signs of a patient cross the borders of normal ranges into abnormal ranges, the system must immediately notify the relevant doctors and nurses of the patient’s status. Again, the primary concern of this system lies on people whose vital signs go beyond normal ranges. The semantics of range queries used by the above applications are quite different from that of traditional range queries used in existing continuous query systems. We have identified the semantic difference between them. We call these new continuous range queries Border Monitoring Queries. Figure 2 shows an example explaining the semantic difference between a BMQ and an RMQ. There are seven data streams labeled from ‘a’ to ‘g’ and a continuous range query which uses both BMQ and RMQ semantics. As shown in the figure, the result set of an RMQ that was initially {a, b, c, d, e} later becomes {a, b, c, f, g}. A BMQ has two result sets; a set of incoming data streams, +ResultSet, {f, g} and a set of outgoing data streams, -ResultSet, {d, e}. BMQs are concerned with the variation of data stream values rather than the values themselves. To evaluate BMQs, a system identifies the difference between the previous value and the current value of a data stream, and compares the difference and the borders of BMQs. It is very costly and redundant to execute this comparison for many BMQs and data streams, when the values of the data streams change slowly. By eliminating the unnecessary comparison operation, we develop a high performance index which can rapidly evaluate many BMQs.

3. Related Work A lot of previous research has developed continuous range query evaluation techniques for several application domains. This research, however, has focused only on traditional range queries,

i.e. Region Monitoring Queries. In this section, we summarize the previous efforts related to the evaluation of traditional range queries, and discuss the originality and contribution of our work. Some papers such as [12], [10], and [14] have developed continuous range query indices over moving objects. [12] first proposed a query index for the scalable evaluation of continuous range queries over moving objects. This index is an R-tree based index built on range queries instead of objects and employs safe regions to avoid excessive accesses to the index. As long as it has not moved outside its safe region, an object does not have to report its location. Determining a safe region, however, requires intensive computation and the index must re-compute the safe region whenever an object moves out of its safe region. Moreover, R-tree based indices exhibit severe performance degradation when many range queries overlap [2][7][13]. Next, [10] proposed a cell-based range query index: a Grid index. The entire monitoring region is partitioned into square cells, and each cell is associated with two lists: a full list and a part list. The part list maintains all the queries that partly intersect the cell, and the full list contains all the queries that fully contain the cell. Although no additional computation is needed for queries in the full list, the index requires intensive computation to check if a query in the part list indeed contains an object in the cell. [10] showed that this cell-based index considerably outperforms the previous index [12]. However, it consumes a large amount of storage in order to provide quality performance. In view of performance, the size of grid cells must be small, increasing the number of cells significantly. Thus, query information would be stored in multiple grid cells redundantly, wasting storage. In view of storage, however, the index must increase the size of grid cells to reduce storage consumption, but this configuration increases the number of queries in the part lists. This requires intensive computation on part lists to retrieve matching queries precisely, and degrades performance drastically. To reduce the query evaluation overhead of part lists, [14] proposed a COVEring Tile-based range query indexing method, COVET. It assumes that range queries can be completely covered by one or more tiles which are squares or rectangles with various predefined sizes. In detail, every integer grid point in a monitoring region has a number of virtual tiles, and after a query is registered, the tiles that cover the query are activated. Whenever an object reports its position or value, the index finds activated tiles among the candidate virtual tiles (tiles which are calculated out of a difference array and a pivot tile). The index finally finds matching queries using the query information registered at the selected tiles. [14] argues that the index exhibits better performance and consumes less storage than [10], but the index has severe scalability problems. The complexity of the storage size can be as high as R2⋅(log2R)2 as can the complexity of the execution time, where R denotes the resolution of a monitoring region. This is a reason why [14] performed simulations assuming a relatively small region, 500 by 500. However, the resolution is more than one million for general applications which require a large monitoring region and a high resolution of information. In order to use the index in those applications, we have to lower the resolution and approximate range queries so that they may be covered by large tiles. However, these limitations may lead to inaccurate query results. Therefore, COVET is not applicable for large-scale practical applications.

Existing continuous query systems [11][3] have developed a grouped filter index for simultaneous evaluation of multiple continuous queries. The grouped filter index is built on multiple predicates, and efficiently retrieves a set of predicates that match the data. A grouped filter is maintained for each attribute that appears in a query. It consists of four data structures storing query predicates: a greater-than balanced binary tree, a less-than tree, an equality hash-table, and an inequality hash-table. Upon arrival of a data, each of the data structures is probed, and a bit-mask of matching queries is marked. By scanning the mask, they identify the queries to which the data should be output. Since the grouped filter index separately maintains a greater-than tree and a lessthan tree, one-bounded range queries are efficiently supported. For two-bounded range queries, however, additional processing is required to find queries commonly belonging to search results of each tree. This can cause much overhead pruning away nonmatching queries when there are a lot of queries in each search result. In the context of active database rule systems, [8] proposed an Interval Skip List to find rule conditions that match occurring events. They extended a skip list structure to index selection predicates of rule conditions. An Interval Skip List allows efficient retrieval of all rule conditions whose selection intervals contain a given point. However, it supports only one-dimensional intervals. In addition, there is no consideration for the locality of input events in the design of Interval Skip Lists, because the events hardly exhibit any correlation with each other in an active database system. Q. Hart et al. have proposed a Dynamic Cascade Tree (DCT) to index query regions on a Remotely Sensed Imagery (RSI) data stream [9]. DCT is used to efficiently determine which queries are interested in an incoming RSI data stream. They exploit the characteristic that consecutive points in a stream of RSI data have close spatial and temporal proximity. DCT consists of three separate list structures based on a skip list to maintain the currently matching query regions over an incoming data stream. However, DCT was developed to handle just a single or small number of data streams in very close proximity. Moreover, it requires a separate list for each data stream and has to continuously update the list according to the data stream’s movement. This is not a scalable solution for a large number of data streams, thus its usage is very limited. Compared to the aforementioned research, M-LARI presents two original contributions. First, we propose a new semantics for continuous range queries, i.e. BMQs, which will be useful and prevalent in the near future. Developing the concept of BMQs which were first introduced in our previous work [15], we presented various service scenarios in Section 2 and discussed the usefulness of BMQs in future applications. To the authors’ knowledge, M-LARI is the first work to strongly advocate the importance of BMQs in data stream environments. Second, we develop a high performance index for BMQs over multidimensional data streams that consume less storage than other indices. M-LARI is the first to utilize the locality of data streams in multiple dimensions in order to achieve high performance. We design locality-aware data structures whose storage consumption decreases drastically compared to other indices. We also develop

a novel cross-checking algorithm to identify matching BMQs for an efficient search operation. In the next section, we will present the design and structure of M-LARI, and discuss the technical details featured in M-LARI.

4. M-LARI In this section, we discuss the characteristics of input data streams and define their multidimensional degree of locality. Then, we describe the data structures of M-LARI and main operations such as query registration/deregistration and search. Finally, we analyze the storage space and computation complexity to determine the advantages of M-LARI.

4.1 Discussion on Data Streams Continuous query systems, or stream processing systems, assume that a computing environment contains numerous data generators such as sensors, probes, or agents, which generate data continuously in the form of data streams. Many services require that such CQ systems process huge data streams in real-time. In order to achieve high performance stream processing, we investigate the characteristics of data streams. After reviewing new applications presented in recent papers and articles, we have classified various data streams into the following two categories. Observation Data Streams: This category includes data streams concerning user locations, vital signs, stock prices, temperatures, etc. These data streams are generated by active sensors which send periodic observations. They report (Sensor-ID, Parameter, Value) information, where Sensor-ID denotes the sensor identification, Parameter the name of the data, and Value the value of the parameter. Event Data Streams: This category includes data streams employed in various RFID-based applications which have recently become popular in the retail and express delivery industries. Once an RFID-tagged object passes by an RFID reader, the reader immediately generates and transmits data about the object. Sensors monitoring vehicles on roadways or packets in the Internet also report information about detected vehicles or packets, whenever a vehicle or a packet passes by. These data streams are generated by passive sensors which detect event occurrences. The data streams contain (Sensor-ID, Event-Info) information, where Sensor-ID and Event-Info denote sensor and event information, respectively. Based on these categories, the first characteristic that we discuss in this paper is the locality of data streams. First, we argue that Observation Data Streams exhibit a sufficient degree of locality. The conditions of the real world usually do not change suddenly. Moreover, so as not to miss significant changes, sensors should perform sensing operations frequently. As a result, the values in Observation Data Streams change gradually. Thus, they exhibit a sufficient degree of locality. Second, unfortunately, we cannot argue that Event Data Streams exhibit as high a degree of locality as Observation Data Streams. Event occurrences are quite random and successive events are unrelated to each other.

We consider a new type of data stream called Aggregation Data Streams, which are derived from Event Data Streams. A sensor aggregates its event data periodically and forms Aggregation Data Streams in order to report valuable information. For example, the incoming packet rate in the Internet, the service rate of a server system, and the number of vehicles and their average speed on a roadway over a certain time interval are interesting Aggregation Data Streams. These data streams deliver information in the same format as Observation Data Streams, (Sensor-ID, Parameter, Value). We argue that most Aggregation Data Streams also exhibit a sufficient degree of locality as their aggregation periods are small enough to perceive tiny changes. As shown in the above discussion, many valuable data streams exhibit sufficient locality. Therefore, we can get performance improvements in many situations when we utilize locality for evaluating continuous range queries. Especially for queries whose results are quite dependent on the change of data stream values, such as Border Monitoring Queries, the use of locality brings significant performance improvement. We already demonstrated the significant performance improvement in the case of onedimensional data streams in our previous work [15]. Next, we consider the multidimensionality of data streams. A data stream that consists of location information is a good example of a multidimensional data stream which provides valuable information. We have noticed that many valuable applications in location-based services or Telematics services utilize location information obtained from GPS devices or cell-phones. We also consider multi-attribute data streams from tiny sensors capable of observing various conditions as multidimensional data streams. A sensor, for instance, can monitor temperature, humidity, atmospheric pressure, and wind speed at the same time, or the number of vehicles and their average speed on a roadway. We can derive interesting information about the relationship between those multiple attributes from multidimensional data streams. Taking the locality and multidimensionality of data streams into consideration, we define the locality of multidimensional data streams formally2. y Definition: Multidimensional Degree of Locality (DoL) N

∑ Diff * ( S ( n + 1), S ( n ))

Fluctuation =

n =1

N , where S(n) denotes the n-th value of stream S, and consists of ( SD1(n), SD2(n), … , SDi(n), … , SDk(n) ) Diff * ( S ( n + 1), S ( n )) =

⎛ SDi ( n + 1) − SDi ( n ) ⎞ ∑⎜ ⎟ i ⎝ unitDi ⎠

2

, where SDi(n) denotes the n-th value of the i-th dimension of stream S, and unitDi denotes the normalization factor of the i-th dimension of stream S.

2

We already defined the degree of locality of one-dimensional data streams in our previous work [15].

Previous matching queries

Q1

Current matching queries 2D domain

Q2

Previous value

Current value Q3 –delta queries

+delta queries

Figure 3. Matching query and Delta query

Degree of Locality: DoL =

1 Fluctuation

Diff*, dimensionless difference, normalizes values of each dimension when the dimensions have different resolutions. For example, the X-dimension and the Y-dimension of location information have identical resolutions, but in the case of a multiattribute data stream which delivers information about temperature and pressure together, the resolution of each dimension is different from that of other dimensions. Dimensionless difference gives a dimension-fair difference by normalizing the resolution of each dimension using unitDi. With this formal definition of DoL, we will discuss the performance of our index and the effect of locality through experiments.

4.2 Data Structure In this section, we present the data structures of M-LARI with definitions and terms. For simpicity, we assume that M-LARI is deployed in a two-dimensional situation. Let Q denote a set of Border Monitoring Queries whose element Qi has a query range of (xli, xui, yli, yui), where xli, xui, yli, and yui are lower and upper bounds of each dimension. Each query has two result sets, +RSetBMQ and -RSetBMQ as described in Section 2. Let D denote a two-dimensional domain which has the bounds (bXmin, bXmax, bYmin, bYmax). We define a set of bounds for each dimension. BX is a set of the lower and upper bounds for Qi’s range on the Xdimension, bXmin, and bXmax. BY is constructed in the same manner for the Y-dimension, and we sort the elements of BX and BY in increasing order of their values. y BX = {bX| bX is either xli or xui for Qi ∈ Q, bXmin, or bXmax} y BY = {bY| bY is either yli or yui for Qi ∈ Q, bYmin, or bYmax} , where bX0 < bX1 < ... < bXn and bY0 < bY1 < ... < bYm

In order to better explain the data structures and operations of MLARI, we define two kinds of query categories: matching queries and delta queries. Figure 3 shows an example of these queries. Q1 and Q2 in the figure contain the previous value of a data stream; we call such queries the matching queries of the data stream. For the current value of the stream, Q2 and Q3 become the matching queries of the stream. On the other hand, Q1 in the figure previously contained the value of a data stream and currently does not; we call such a query the delta query of the data stream, specifically the -delta query. In contrast, Q3 in the figure

Stream Table StreamID

v

PX

PY

MQSet-X

MQSet-Y

s1

(vX1, vY1)

RS-X2

RS-Y2

{Q1}

{Q1}

s2

(vX2, vY2)

RS-X3

RS-Y5

{Q1,Q2}

{Q2,Q3}

s3

(vX3, vY3)

RS-X5

RS-Y4

{Q2,Q3}

{Q1,Q2,Q3}

bY7 {Q2}

{}

RS-Y7

{Q3}

{}

RS-Y6

{Q1}

{}

RS-Y5 b Y4

{} {} {}

bY6

Q2 bY5

Q1

{Q3} RS-Y4 bY3 {Q2} RS-Y3 bY2 {Q1} RS-Y2

v(s1)

{}

RS-Y1

-DQSet-Yi +DQSet-Yi

{}

{Q4} {Q3}

{} {}

{Q1}

{}

RS-Y6’ RS-Y6

{}

{Q3}

{}

{Q2, Q4}

{}

{Q1}

{}

{}

Q4

RS-Y3

-DQSet-Yi +DQSet-Yi

v1(s3) v2(s3) v3(s3)

bY1 {}

Q3

v(s2)

{Q2}

RS-X6

RS-X3 RS-X3’

+DQSet-Xi

{}

{Q1}

-DQSet-Xi

{}

{}

{Q2} {Q4} {} {}

{}

{Q1}

{Q3} {}

{}

{}

{Q3, Q4} {Q2}

Figure 5. Query Registration example

bY0

RS-Y List RS-X List +DQSet-Xi -DQSet-Xi

bX0 bX1

bX2

RS-X1 RS-X2 {} {Q1} {}

{}

bX3

bX4

bX5 bX6 bX7

RS-X3 RS-X4 RS-X5 RS-X6 RS-X7 {} {Q3} {} {} {Q2} {}

{Q1}

{}

{Q3} {Q2}

Figure 4. Stream Table and RS Lists

previously did not contain the value of a data stream and currently does; we call such a query the +delta query of the stream. Based on the categorization of continuous range queries, we developed two kinds of data structures: RS (Region Segment) Lists and a Stream Table. To support multiple dimensions, we employ an RS List for each dimension. For two-dimensional data and queries, an RS-X List is maintained for the X-dimension and an RS-Y List for the Y- dimension. An RS-X List is a list of region segments that together comprise the range of the Xdimension, 3. Each region segment, RS-Xi, maintains a tuple (bXi-1, bXi, +DQSet-Xi, -DQSet-Xi), where y bXi-1 is the lower bound of RS-Xi y bXi is the upper bound of RS-Xi where bX* ∈ BX y +DQSet-Xi is a set of queries Qk whose xlk is equal to bXi-1, i.e., +DQSet-Xi = { Qk | xlk = bXi-1 for Qk ∈ Q} y -DQSet-Xi is a set of queries Qk whose xuk is equal to bXi-1, i.e., -DQSet-Xi = {Qk | xuk = bXi-1 for Qk ∈ Q}

An RS-Y List is a list of region segments that comprise the range of the Y-dimension, . Each RS-Yi also maintains a tuple (bYi-1, bYi, +DQSet-Yi, -DQSet-Yi), where the bounds and ±DQSet-Y are defined in the same manner as the X-dimension. By this definition, every query is inserted into +DQSet and –DQSet once for each dimension4. The Stream Table contains entries for each data stream. Each entry consists of StreamID, v, PX, PY, and MQSet for each and every dimension. In this two-dimensional situation, the Stream Table includes MQSet-X and MQSet-Y. StreamID is the identifier of a data stream, and v is the current value of the data stream, (vX, vY). PX is the pointer to a region segment RS-Xi which contains 3

An RS List is implemented as a doubly linked list.

4

We do not store an entire query string in the sets, just a query ID.

the current value of the X-dimension. PY is the pointer to a region segment RS-Yi which contains the current value of the Ydimension. MQSet-X is the set of matching queries for the Xdimension. MQSet-Y is the set of matching queries for the Ydimension. y MQSet-X = {Qk | xlk ≤ vX < xuk for Qk ∈ Q} y MQSet-Y = {Qk | ylk ≤ vY < yuk for Qk ∈ Q}

In the example given in Figure 4, the Stream Table has three entries for three data streams s1, s2, and s3 which contain the current values v(s1), v(s2), and v(s3), respectively. Two RS Lists are built for three BMQs: Q1, Q2, and Q3. Then, seven region segments are created for each RS List with each region segment having lower and upper bounds, a +DQSet, and a -DQSet as described before. For example, PX(s2) points to RS-X3, PY(s2) points to RS-Y5, MQSet-X(s2) is {Q1, Q2}, and MQSet-Y(s2) is {Q2, Q3} as shown in the figure. Upon the arrival of a new value for each data stream, MQSet-X(s) and MQSet-Y(s) are incrementally updated by manipulating queries in DQSet-Xi and DQSet-Yi. We design efficient operations to manage and evaluate BMQs using these Stream Table and RS Lists.

4.3 Query Registration/Deregistration M-LARI supports dynamic query registration and deregistration. This section explains these operations on the Stream Table and RS Lists in detail. Query Registration: Assume that a new query Qk, whose range is (xlk, xuk, ylk, yuk), is given. First, the operation finds the RS-X containing xlk in the RS-X List, RS-Xi. Given RS-Xi, if xlk is greater than bXi-1, RS-Xi is split into the two region segments above and below xlk, and their bounds are configured accordingly. The left one has the ±DQSets of the original segment, and the +DQSet-X of the right one contains Qk. If xlk is equal to bXi-1, Qk is registered to the +DQSet-Xi of RS-Xi. Next, the operation finds the RS-X containing xuk, RS-Xj. If xuk is greater than bXj-1, RS-Xj is also split into the two region segments above and below xuk, and Qk is registered in the -DQSet-X of the right segment. If xuk is equal to bXj-1, Qk is registered to the -DQSet-Xj of RS-Xj. For the Y-dimension, the same operations are executed using ylk, yuk and the RS-Y List. Finally, Qk is added in the Stream Table’s MQSets of the data streams whose current value is contained in Qk.

y

/* For input: v (vX, vY) of s, and output: ±RSetBMQ(s) */

y y

Find a stream entry, s, in Stream Table; Read two segments RS-Xa and RS-Yb by PX(s) and PY(s);

y y y

/* hit */ if bXa-1 ≤ vX < bXa and bYb-1 ≤ vY < bYb, then return empty ±RSetBMQ;

y y y

/* miss */ /* Initialize variables */ ±DQSet-X(s), ±DQSet-Y(s), ±RSetBMQ (s) = {};

y y y y y y y

/* Build +DQSet-X(s) and -DQSet-X(s) */ if ( vX < bXa-1 ), then for ( i = a; bXi-1 ≤ vX < bXi; i-- ) { -DQSet-Xi Æ +DQSet-X(s); +DQSet-Xi Æ -DQSet-X(s); } else if ( vX ≥ bXa ), then for ( i = a; bXi-1 ≤ vX

Suggest Documents