Optimizations Enabled by a Relational Data Model ... - Semantic Scholar

3 downloads 366 Views 209KB Size Report
say, scienti c visualization, motivates our search for optimizations to improve query ... the myriad of sensors in the house to large compute engines on campus. A.
Optimizations Enabled by a Relational Data Model View to Querying Data Streams Beth Plale and Karsten Schwan College of Computing Georgia Institute of Technology Abstract

Streaming data is growing in prevalence as connectivity increases and data streaming sources proliferate. But getting precisely the data one needs from data streams can be dicult. Given the low resource capabilities of some clients, the decision process of which data to keep often must be made `upstream' of the client. We postulate that the popularity of SQL for querying relational databases makes the language a viable solution to retrieving data from data streams. In response, we have developed a system, dQUOB, that uses SQL queries to extract data from streaming data in real time. The high performance needs of, say, scienti c visualization, motivates our search for optimizations to improve query evaluation eciency. The primary purpose of this paper is to discuss the unique optimizations we have realized by a database point of view to streaming data.

1 Introduction Passage of time and widespread adoption have made the bene ts of queries as a means of retrieving data from databases widely known to computer science specialists and non-specialists alike. The wide popularity of SQL as the standard query language for relational database management systems attests to that particular language's ability to satisfy a user's need for data of interest. That is, its widespread use for a wide range of applications is informal testament to its expressiveness. From the recent explosion of the Internet and the ubiquity of computers has emerged a new data source, however, the computational data stream. Data streams are generally regarded as event data in transit from some source to some consumer. And the streams are prevalent: the Aware Home Project at Georgia Tech has data continuously owing from the myriad of sensors in the house to large compute engines on campus. A investigation of a scienti c model run using high end graphics machines to display results visually transports large volumes of complex scienti c data. Delta airlines [13] pushes in excess of 12 million events per day between the ticket counter, check-in counter, passenger check-in, airport update monitors in concourses, reservations desks, and the mainframes. We hold a commonly held view of data streams as streams of events, where an event is timestamped data about a component. Our group has developed the notion and established the viability of computational data streams, streams with computation inserted at the source, destination, or at intermediate points between. The computations often transform, aggregate, or lter the data. For example, aggregation might be used to sum values over neighbor points in a 3D space to reduce downstream bandwidth needs. Transformation might perform units conversion or partially prepare the data for visualization. Computational data streams are one of the underlying mechanisms of the Infosphere project [16]. Their viability has been established in [6] and considered by others in [3], [4], and [8]. Our work with dQUOB is in adapting database queries to operate over streaming data instead of database tables. Viewing data streams as data sources over which relational queries can be speci ed has been explored in the past in the context of performance monitoring [19, 12], but it su ered limitations in the ability to keep up. Our contribution is to replace all traces of a database with temporary bu ers to improve performance. 1

The work further contributes adaptivity to query processing, under the hypothesis that more ecient, optimal queries can be achieved if run-time data can be fed back into the optimization cycle. Earlier results [15] have shown that optimized queries can signi cantly reduce query computation time. Further, our work with a global atmospheric transport model and earlier with a autonomous robotics application has shown that queries relevant and meaningful to users accessing data streams can be stated with the SQL query language. Earlier work by our group has justi ed the bene ts of stream computation [6], that is, performing transformations to the data as it ows from source to client. Our work has shown that by preceding a transformation computation with queries, one can signi cantly decrease the total amount of time spent in transformation, and additionally decrease total network bandwidth consumed. The contributions of this paper are three-fold. First, we demonstrate the usefulness and ease-of-use of a query language for specifying interesting queries over streaming data. Second, conceptualizing streaming data using the relational data model creates an opportunity for new optimizations on data streams. We identify in this paper optimizations realizable from this new way of thinking. Finally, we quantify the overhead incurred in using a general approach to query evaluation. In the following section we give a brief overview of dQUOB. In Section 3 we expand on the notion of data streams and data stream components, then narrow to the single example used in the remainder of the paper. Using the sample application, we illustrate the query language by example. The optimizations made possible by our implementation and the relational data model are the topic of Section 4. Measurements in Section 5.

2 dQUOB Overview The dQUOB (dynamic QUery OBjects) system enables users to create queries for precisely the data they wish to use. With such queries are associated user-de ned computations, which can further lter data and/or transform it, thereby generating data in the form in which it is most useful to end users. Query execution is performed by dQUOB runtime components termed quoblets, which may be dynamically embedded `into' data streams at arbitrary points, including data providers, intermediate machines, and data consumers. The intent is to distribute ltering and processing actions as per resource availabilities and application needs. The dQUOB system is a tool for creating queries with associated computation, and dynamically embedding these query/action rules into a data stream. The software architecture, shown in Figure 1 consists of a dQUOB query compiler and run-time environment. The compiler accepts an SQL query (Step 1), compiles the query into an intermediate form as a parse tree, performs query optimizations over the parse tree, then generates a script. The query is deployed at the quoblet by passing it a script (Step 2). A quoblet consists of an interpreter to execute the script, and the dQUOB library to dynamically create compiled code representations of the queries at runtime. The script also contains information used by the quoblet to retrieve and dynamically link the user de ned action code (Step 3). During run-time, the reoptimizer gathers statistical information about the data stream, periodically triggering reoptimization (Step 4). The dQUOB runtime handles events for queries Q1-Q3.

3 Query Language Through Examples The viability of the dQUOB approach is determined in large part by the expressiveness of its query language. In this section we demonstrate usefulness of the language with a series of examples drawn from a sample scienti c application. The dQUOB language adopts the create-if-then rule construct of the Starburst [20] query language for active database systems. The create -clause creates a named rule, the if -clause contains an SQL query, and the then -clause is a set of actions to be executed when the query is satis ed. SQL has a select-from-where three-clause syntax. The select -clause speci es the attributes whose values are to be retrieved as well as the structure into which the retrieved values are to be organized. That is, the outbound event. The from -clause speci es the classes from which data are to be retrieved as well as the structure into which the retrieved values are to be organized; that is, the inbound events. The where -clause speci es the conditions to be satis ed [21]. 2

1 compiler

script generation

optimizer

dQUOB compiler

2 compiled queries Q1 Q3

Q2

3

code repository

User-defined action code dQUOB library Interpreter

quoblet

Reoptimizer

dQUOB runtime

1

SQL query and action defined by scientist

2

query code moved into quoblet

3

action code dynamically linked into quoblet

4

reoptimization of compiled queries at runtime

4

Figure 1: Life of a Query/Action Rule. The data stream example we use is the visualization of 3D atmospheric data generated by a parallel and distributed global atmospheric model developed at Georgia TechThe model consists of an atmospheric transport model that simulates the ow of chemical species, speci cally ozone, through the stratosphere coupled with a chemical model that models the interaction of the ozone with short lived species (e.g., CH 4, CO, HNO3). Species data is pushed from the model each logical timestep (i.e., 2 hrs. of modeled time). A 3D gridpoint is de ned by the tuple (level, latitude, and longitude). 'Level' corresponds to an atmospheric pressure. Through a series of small examples drawn from the atmospheric visualization we illustrate and discuss the query language. Example 1: Rule construct, Select The following rule, named C:1, is a simple query to retrieve data for the upper atmospheric levels of the Antarctic circle, stated by the conjunction of two select expressions. The data records that satisfy the query are passed to the function, ppm2ppb, which convert the grid points in the 3D slice from parts-per-million to parts-per-billion. CREATE RULE C:1 ON Data_Ev IF SELECT Data_Ev FROM Data_Ev as d WHERE d.latitude_min = 30 and THEN FUNC ppm2ppb

Example 2: Boolean Operators, Join. The second example accepts two event types: data events

from the atmospheric model, Data Ev, and a user request for a particular region of data, Request Ev. For illustration purposes, the user requested region is actually one of two 3D points at selected latitudes. The query evaluates to true if the data event contains the data for the longitude and level in which one or the other requested latitudinal points appear. (The negation boolean operator is supported as well.) CREATE RULE C:2 ON Data_Ev, Request_Ev IF SELECT Data_Ev FROM Data_Ev as d, Request_Ev as r WHERE ((r.lat_point1 >= d.lat_min and r.lat_point1 = d.lat_min and r.lat_point2 = r.lat_min or d.lat_min = 30 and r.aid == 1001 and p.latency == 1001 and p.latency