Answering Arbitrary Conjunctive Queries over

0 downloads 0 Views 144KB Size Report
paper considers how to generate the most complete answer possible to a positive conjunctive query over the ... examples include traffic monitoring, sensor networks, and status information. While the ... of free CPUs and the number of running jobs. .... queries can be expressed as a simple select-project-join sql query. 1.
Answering Arbitrary Conjunctive Queries over Incomplete Data Stream Histories1) Alasdair J.G. Gray1, M. Howard Williams1 and Werner Nutt2 1

Heriot-Watt University, Edinburgh, UK. 2

Free University of Bozen, Italy. [email protected], www.macs.hw.ac.uk/magik-i

Abstract: Streams of data often originate from many distributed sources. A user wanting to query the streams should not need to know from where each stream originates but should be provided with a global view of the streams. R-GMA is a system that integrates distributed data streams to provide a global view of all the streams for users to query. R-GMA has been developed as a grid information and monitoring system although the techniques developed can be applied wherever there is a need to publish and query distributed streams. Stream data is important not only for its current values but also for past values produced. In order to support this, the history of the stream must be archived and stream processing systems must support history queries. However, one problem which then arises is that data streams published by distributed sources are prone to missing data values, e.g. due to a network failure. Since the stream has missed some values, the stored history of the stream contains gaps. This paper considers how to generate the most complete answer possible to a positive conjunctive query over the available stream history. A model for representing the incompleteness in the stream history is provided along with an algorithm that distinguishes when and how the missing data affects the answer to a query.

1

Introduction

Data streams appear in a variety of settings often originating at distributed sources. Typical examples include traffic monitoring, sensor networks, and status information. While the processing of data streams has been the focus of much research in recent years [6] this has predominantly been done in a centralised setting with little focus on the past behaviour of the stream. It is not just the current values of a data stream that are of interest, historical data is useful 1)

This work was supported by the UK Research Council EPSRC grant MAGIK-I GR/S44839/01

for identifying trends and patterns. By storing the past content of a stream the data can then be queried to see how behaviour changes over time, to pick out events with particular characteristics, or to compare events. For example, consider the problem of monitoring resources on a grid. Each resource is instrumented with sensors and scripts that capture and publish key characteristics of the resource, e.g. for a computing element it is useful to know the number of free CPUs and the number of running jobs. Since the resources of the grid are distributed, the streams of monitoring information will also be distributed. One problem with this is that this distribution can often give rise to data being lost due to network failures, communication errors, etc. This leads to incomplete historical data sets. A user posing a query for historical data should not be aware that data is missing from the system unless the query cannot be answered completely. When the query cannot be answered completely it would be helpful if the system could provide details of how the answer returned compares to what would have been returned if the data had been complete. This paper presents (i) a model for expressing when the history of a data stream is incomplete, and (ii) an algorithm that decides if a positive conjunctive query can be answered completely when the histories of the streams contain gaps. If a query cannot be answered completely then the answer that is returned could be annotated with appropriate information to allow the user to make an informed decision about the effects of the incompleteness. Section 2 provides an overview of r-gma, a data stream integration system for publishing monitoring data about grid resources. An illustrative example from the domain of grid monitoring is then provided to express the problem addressed in this paper. In Section 3 details of the model for representing missing data in the history of a stream and the algorithm for computing answers in the presence of missing data are provided. Related work is considered in Section 4 and our conclusions are presented in Section 5.

2

Publishing Distributed Data Streams

r-gma is a grid information and monitoring system that allows users to locate monitoring data of interest without knowledge of where it is published. References [4] and [5] provide an overview of the architecture of r-gma, details of the query planning mechanisms, and performance measures of the system. However, neither of them cover the problem of incompleteness. r-gma continues to be developed as part of the EGEE Grid infrastructure which has been deployed on several grids including the Large Hadron Collider Grid. r-gma is a local-as-view information integration system for data streams. The architecture is shown in Figure 1 and consists of Primary Producers (which publish monitoring data as a stream according to some view description), Secondary Producers (which pose a query and

Query

Consumer

Data Primary Producer

Secondary Producer

Primary Producer

Query

Registry

Primary Producer View

Figure 1: The architecture of r-gma

publish the resulting stream), Consumers (which retrieve specific monitoring data by posing a query), and a Registry (which matches Consumer requests with Producer descriptions). Primary producers publish their data as a stream of tuples conforming to a selection view description over an agreed global schema. The primary producer may additionally maintain a history buffer of the stream, i.e. all the tuples that have been published during some allotted period of time, and/or a latest-state buffer which contains the most recent tuple seen for a given value of the key. To access the data, consumers may pose three different types of query over the agreed global schema. The first, a continuous query would return every new tuple that satisfies the query condition. The second, a history query would return all the tuples that have previously been published that satisfy the query condition and which fall in the query’s stated time period. The third, a latest-state query would return the most recent tuple for each of the key values that satisfy the query. To illustrate the problem addressed in this paper, an example from the grid monitoring domain is used. Consider the following relations for providing status information about resources on the grid compEle(CEId, freeCPUs, runningJobs, ts), storEle(SEId, currentIO, ts), CESEBind(CEId, SEId). The relation compEle contains information about the number of free CPUs and the number of running jobs at a particular time instance ts. The storEle relation contains information about the I/O load on a storage element. The CESEBind relation shows which computing elements and storage elements are linked. In all the relations, the underlined attributes form the primary key. The CESEBind relation does not contain a ts attribute since this information is not expected to change much over time. An instance of the histories of these global relations is shown in Table 1. For the two streams,

CEId 1 2 1 2 .. .

freeCPUs 4 1 2 5 .. .

runningJobs 1 6 5 2 .. .

ts 1 1 5 5 .. .

CEId 2

SEId 10

SEId 10 20 10

currentIO 30 80 27

ts 1 1 2

10 20 .. .

23 60 .. .

6 6 .. .

Table 1: An instance of the global relations

data is missing where there is a blank line. The static CESEBind relation is considered to be complete since the data does not change often. It is assumed that the histories carry on with no more missing data and no additional grid resources. The values in the ts attribute represent the hour at which the reading was taken. In reality, a much finer granularity of timestamp would be used and a higher frequency of data capture. There now follows three queries that might be posed over the data instance. Each query is first expressed as an English statement and then as a conjunctive query. The conjunctive query notation has been extended to express the length of history that should be considered in the query. This is written as an additional conjunct in square brackets. All conjunctive queries can be expressed as a simple select-project-join sql query. 1. Find all machines that have had more than 5 running jobs in the last 24 hours. q(CEId) ← compEle(CEId, freeCPUs, runningJobs, ts) ∧ runningJobs > 5 ∧ [history = 24 hrs]

2. Find all machines that have had more than 5 running jobs in the last 12 hours and are linked to a storage element. q(CEId) ← compEle(CEId, freeCPUs, runningJobs, ts) ∧ CESEBind(CEId, SEId) ∧ runningJobs > 5 ∧ [history = 12 hrs]

3. Find all machines that have had more than 5 running jobs in the last 24 hours which are linked to a storage element that has had an I/O load of greater than 75 in the same period. q(CEId) ← compEle(CEId, freeCPUs, runningJobs, ts1 ) ∧ storEle(SEId, currentIO, ts2 ) ∧ CESEBind(CEId, SEId) ∧ runningJobs > 5 ∧ currentIO > 75 ∧ [history = 24 hrs]

Note that the semantics of a join between the histories of two streams is not immediately obvious. One semantics would only permit the join between two tuples if their timestamps are the same (this can be to within some threshold to allow for the difference between distributed clocks). An alternative semantics is that the timestamps do not affect the join unless explicitly declared. In the query above, different variables have been used for the timestamp attributes of the two streams. This means that a tuple in the history of one stream may be joined with any tuple in the history of the other stream. This matches the semantics used in Data Stream Management Systems where a join is processed over a window of data; the window here is defined to be the history asked for in the query.

3

Incomplete Data Stream Histories

This section details the model developed for representing the data missing in a stream history. The effects of the missing data on positive conjunctive queries is described along with an algorithm for generating the most complete answer possible. 3.1

Representing Missing Information

In order to be able to answer a query over the history of a stream, the stream must be archived for later access. In r-gma this can be achieved by configuring the primary or secondary producer to maintain a history for some period of time. Every tuple published by that producer is then stored in a database until its timestamp exceeds the producer’s history period when it will be discarded. The length of history maintained by each producer is dependent on that producer’s history period. The planning mechanisms in r-gma allow arbitrary sql history queries to be posed providing that there is a producer that collects all of the relevant data. The assumption in the current system is that the data in such a producer is complete. However, this is not always the case. For example, a secondary producer may miss values if its data sources publish at too high a rate, or it may have lost contact with one of more of its sources due to some temporary network failure. It is important to note that there is no one producer which contains all of the data published. While it is currently possible to create a secondary producer that publishes an entire stream, in the future this will not be feasible due to the anticipated size of future grids. For the purposes of this work the complete virtual database would contain a tuple for each sensor reading made. Additionally, it is assumed that there is a query processing service [8]

that converts the query over the global schema into separate sub-queries over the various producers available to retrieve the required data. In order to be able to handle the incompleteness of the producers and the different history retention periods the query planning service requires a mechanism to describe the “gaps” in the global streams. In the r-gma setting this is possible on a per producer basis as the views provide a relationship between the producer and the stream that it publishes. A data stream can be thought of as a collection of channels. Each channel is a maximal substream where all of the tuples agree on their key values except the timestamp. Since each producer describes the data that it publishes using a view on the global schema, which for each relation has a defined primary key, the producer’s view effectively describes a set of channels that it publishes. The values on those channels may be restricted by the view. For the purposes of this paper it is assumed that a producer can detect when it has potentially missed data. For example, when a secondary producer is unable to contact a data source it will know how long it has been out of contact with that source. Alternatively, when a primary producer is unable to cope with the input rate of its source it will know how long this state lasts. Each period where the producer has missed data should be declared as a “gap” in its history. Other systems may rely on different techniques for detecting when the data is incomplete. The gaps in the global stream can be derived by combining all of the gap information for each producer of each channel. The query planning service can then construct a plan using the data available in the sources to answer the query. The gap information can be stored on a per channel basis. Additionally, since a channel can only exist if at least one tuple has been published on it, all of the channels are known to the system. 3.2

Answering Conjunctive Queries

When the history of a data stream is not complete, the effects of this incompleteness on the answer to a query must be considered. There are some cases where a query can be answered completely. However, when the available data cannot answer the query completely then there are different types of answers that can be returned. A tuple t in the available database is a member of one of three sets of answers: Certain Positive Answer: Tuple t would be returned if the query were posed over the full set of data. Certain Negative Answer: Tuple t would not be returned if the query were posed over the

Algorithm 1 Algorithm to generate answer sets 1: Input: Query q := πA (σC (r)) 2: for all c such that c is a channel in r do 3: if c ∧ C is satisfiable then 4: if A only contains primary key attributes then 5: if ∃t on c such that t satisfies C then 6: The projection of t on A is a certain positive answer 7: else if c contains a gap then 8: The projection of c on A is a possible answer 9: else 10: The projection of c on A is a certain negative answer 11: else 12: for all t on c do 13: if t satisfies C then 14: The projection of t on A is a certain positive answer 15: else 16: The projection of t on A is a certain negative answer 17: if c contains a gap then 18: Details of gaps are potentially missed answers full set of data. Possible Answer: Tuple t may be returned if the query were posed over the full set of data. The state of the answer depends on the content of the gaps. These answer sets do not cover all of the information that is available. Consider a query that asks for some non-key attribute to be returned and a channel, the definition of which does not contradict the query condition, that contains gaps. The data missed by the gaps is unavailable so there is not a tuple to fit into one of the above answer sets, although some of the missed tuples could potentially be returned if the query were posed over the complete database. Therefore, details of the channel and its gaps should be returned to inform the user that there are potentially more answers. Algorithm 1 provides a mechanism for answering a selection query. For the purposes of the algorithm, a channel definition c is a condition of the form a1 = v 1 ∧ . . . ∧ ak = v k , where a1 , . . . , ak are the key attributes (except the timestamp attribute) of r restricted by c and v1 , . . . , vk are scalar values. The algorithm performs an analysis of the query in order to classify the available data into the answer sets.

The execution of a join query would be done in a similar manner. If there any static relations involved these will be consulted first in order to reduce the number of channels considered. Each stream relation involved in the join would then be processed to discover which tuples on which channels satisfy the query condition relating to that relation. This information is then combined to produce the answer returned to the query. The approach to query execution will now be illustrated by considering the example from Section 2. The answer sets to query 1 over the stream history instance would be: Certain Positive Answer: { (2) }. Certain Negative Answer: ∅. Possible Answer: { (1) }. The certain negative answer set is empty since there are no answers which can be identified that definitely do not satisfy the condition because of the gaps in the data set. Since the query projects onto the key attributes (channel descriptor), there are no potentially missed answers and all the data is accounted for. Due to the use of set semantics the gap in the history of the channel CEId = 2 does not lead to any values being missed from the final answer. Query 2 can be answered completely from the available information. Since the CESEBind table is complete, only the channel for CEId = 2 needs to be considered. The tuple (2) would be returned and the gaps on the channel can lead to no additional answers. For query 3, the only answer returned would be the possible answer { (2) }. The information in the join relation CESEBind limits the set of tuples considered in the stream relations. Only the tuple (2, 1, 6, 1) in the compEle relation satisfies the condition runningJobs > 5. However, there is no tuple in storEle that satisfies the condition currentIO > 75 which can be joined with it although there is a gap in the history which results in the possible answer. When a gap on a channel affects the answer returned to a query, additional meta-data about the gaps and the data on the channel can be provided. This channel meta-data can contain information about the percentage of the channel that was missing, the frequency with which the available data was published, and the maximum, minimum and average of the values available. For example, the answer to query 3 would be annotated with:

Channel: Coverage: Frequency: Maximum currentIO: Minimum currentIO: Average currentIO:

SEId = 10 90% 1 tuple per hour 35 23 26.67

This additional meta-data allows the user to make an informed decision about the accuracy of their answer. For instance, from the meta-data above the user could reasonably assume that it is very unlikely that there were any computer elements with more than 5 jobs linked to a storage element with an I/O load of more than 75%.

4

Related Work

Since the conception of a data stream processing system quite a few systems have been developed, the most notable being the stream system [2]. Many of the issues underpinning the processing of a data stream in a centralised setting are presented in [6]. The Borealis system extends these notions to provide a distributed stream processing system [1]. None of the existing data stream systems have focused on archiving the data on a stream for later processing although some mechanisms for summarising the past content have been developed. These include synopses and digests [6]. The publication and processing of experimental data as a stream on a grid has also been studied. The Calder system [10] concentrates on providing a grid service interface to a query processing engine. The StreamGlobe system [9] is a peer to peer system for publishing data streams on a grid. There has been a considerable amount of work on incompleteness in traditional databases, e.g. the use of null values [3]. However, the history of a data stream has additional properties which can be exploited to provide more complete answers in certain circumstances, e.g. query 2 in Section 3.2. The effects of incomplete data sources have been considered in information integration systems resulting in the concepts of certain and possible answers [7]. This work has considered how these concepts can be applied to queries over the history of a data stream.

5

Conclusions

Distributed data streams are often incomplete. If these streams are archived for later access, then these histories will also be incomplete. This paper proposes a model for describing the incompleteness present in the history of an integrated set of data streams.

When a conjunctive query is posed over an incomplete data stream history and the complete answer cannot be retrieved the available data can fall into one of three sets: (i) Certain positive answers, (ii) Certain negative answers, and (iii) Possible answers. An algorithm was developed to generate these answer sets for a positive conjunctive query. These answer sets were shown not to cover all the knowledge available. There are occasions when the information about the missing data can help inform the user about their answer. When the full answer is not available, additional meta-data can be derived that helps inform the user about the effects of the incompleteness. Other techniques for handling incompleteness in data stream histories are still being explored under different assumptions and with different classes of query. Once this has been completed, it is planned to develop an implementation as an extension to the r-gma system.

References [1] D.J. Abadi, Y. Ahmad, M. Balazinska, and et al.. The design of the borealis stream processing engine. In Proceedings of Second Biennial Conference on Innovative Data Systems Research (CIDR 2005), pages 277–289, Asilomar (CA, USA), January 2005. On-line proceedings. [2] A. Arasu, B. Babcock, S. Babu, and et al.. stream: The stanford data stream management system. In Data-Stream Management: Processing High-Speed Data Streams. Springer-Verlag, New York (NY, USA), 2007. To appear. [3] E.F. Codd. Missing information (applicable and inapplicable) in relational databases. SIGMOD Record, 15(4):53–78, December 1986. [4] A. Cooke, A.J.G. Gray, and W. Nutt. Stream integration techniques for grid monitoring. Journal on Data Semantics, 2:136–175, 2005. [5] A.W. Cooke, A.J.G. Gray, W. Nutt, and et al.. The relational grid monitoring architecture: Mediating information about the grid. Journal of Grid Computing, 2(4):323–339, December 2004. ¨ [6] L. Golab and M.T. Ozsu. Issues in data stream management. SIGMOD Record, 32(2):5–14, June 2003. [7] G. Grahne and V. Kiricenko. Partial answers in information integration systesm. In International Workshop on Web Information and Data Management (WIDM 2003), pages 98–101, New Orleans (LA, USA), November 2003. [8] D. Kossmann. The state of the art in distributed query processing. ACM Computing Surveys, 32(4):422– 469, December 2000. [9] B. Stegmaier, R. Kuntschke, and A. Kemper. StreamGlobe: Adaptive query processing and optimization in streaming P2P environments. In Proceedings of the 1st International Workshop on Data Management for Sensor Networks (DMSN 2004), pages 88–97, Toronto (Canada), August 2004. [10] N. Vijayakumar, Y. Liu, and B. Plale. Calder query grid service: Insights and experimental evaluations. In 6th International Symposium on Cluster Computing and the Grid (CCGrid 2006), pages 539–543, Singapore (Singapore), May 2006. IEEE Computer Society.