Calder Query Grid Service - Computer Science: Indiana University

7 downloads 180266 Views 470KB Size Report
Calder Query Grid Service: Insights and Experimental Evaluation. Nithya N. .... stream rowset service are all web services and together im- plement the GGF ...
Calder Query Grid Service: Insights and Experimental Evaluation Nithya N. Vijayakumar, Ying Liu and Beth Plale Department of Computer Science, Indiana University {nvijayak,yingliu,plale}@cs.indiana.edu Abstract We have architected and evaluated a new kind of data resource, one that is composed of a logical collection of ephemeral data streams that could be viewed as a collection of publish-subscribe “channels” over which rich dataaccess and semantic operations can be performed. This paper contributes new insight to stream processing under the highly asynchronous stream workloads often found in datadriven scientific applications, and presents insights gained through porting a distributed stream processing system to a Grid services framework. Experimental results reveal limits on stream processing rates that are directly tied to differences in stream rates.

1. Introduction Data driven applications have received considerable attention in distributed computing literature over the past several years largely in response to recent developments in hardware and technology. A common approach to datadriven computing is to organize an application as a dataflow where one or more streams of data flow through multiple stages as a pipeline or in a directed graph. The ordering of the tasks could be static or event driven and can range from simple logical expressions (filters), to mathematically complex data transformations, aggregation operators, analysis operators, or cleansing operators. A useful distinction can be made between data-driven applications that process data streams for purposes of making the streams available to others, and those that consume the data streams. A streamconsuming data flow application ingests data streams for the purpose of accomplishing a computational outcome. On the other hand, a stream-preserving data flow application builds a data resource for use by others, in that it might clean the data, generate metadata, and search for interesting features but with the goal of storing or buffering the data for subsequent access by others. Our system, which falls into the category of a streampreserving data flow applications, provides access to stream data for scientific applications that need to access resources through a service-oriented architecture. Our system, called

Calder 1 , sits between a data-consuming application and a set of data streams, providing customized access and minimal, efficient preprocessing. We model a data resource as a logical collection of data streams and provide access to that resource. We believe that the value of streams increases dramatically when streams are aggregated and global behavior can be interrogated. We further believe that access to streaming data through database operations is an intuitive way to think about stream access. The recent burgeoning interest in the database research community on data streams reinforces this view [2, 3, 5, 10]. A stream is a member of a logical collection if its inclusion does not violate the collection’s data coherence and group meaning. In comparison with a DBMS, a data stream resource is analogous to a database and the streams are analogous to its tables. Calder is well suited to applications that need rapid, relatively simple user-customized stream processing at a level of scalability and efficiency that might not be achievable by a database management system. It is optimized for processing the highly asynchronous streams often found in data-driven science applications. Calder is built on the dQUOB query processing system [10], and extends the system with web service support provided by the database grid access framework OGSA-DAI [1], to bring programmatic access to user-driven data flow processing in the form of filtering and aggregation operators. This paper contributes new insight to the general body of knowledge on data stream processing within the limits of stream data processing under the highly asynchronous stream workloads; and insights gained during porting Calder to a grid services framework. A valid measure of stream processing is service time, that is, the total required time for the stream processing to react to a condition that might be defined over multiple streams. We experimentally evaluated the relationship between stream rate and service time for highly asynchronous streams that reveal rate-based limitations on operations where multiple streams are applied. We have found that service time, measured as a rate, 1 This work is funded by the National Science Foundation ATM0331480 and CNS-0202048; and Department of Energy DE-FG0204ER25600.

is largely independent of operator ordering. The remainder of the paper is organized as follows. The port to a grid services model and programming model are discussed in Section 2. A brief overview of the system architecture is given in Section 3. Section 4 describes the experimental evaluation we carried out. Related work is discussed in Section 5. Section 6 discusses lessons learned and future work.

2. Porting Calder to Grid Services In Calder, we model a data stream resource as a logical collection of data streams and provide access to that resource. There are two kinds of users to the Calder system. The stream publishers add or remove input data streams into a data stream resource. The input streams enter Calder through a publish-subscribe mechanism. The stream consumers submit a continuous query to process the streams and then consume the resulting derived streams. Porting Calder to a grid service framework can leverage the interoperability provided by grid services thus providing the stream publishers and stream consuming applications convenient and efficient access to data streams. Streaming Realization of a Grid Data Service In a service-oriented architecture, a data virtualization is represented by, and encapsulated in, a web service that implements one or more base data interfaces, defined in the Global Grid Forum DAIS Data Service specification [6]. The main challenge to porting Calder to a grid service environment was that there was no previously existing realization of the DAIS specification for a data stream resource. Also data streams have unique features that make it non-trivial to simply adopt a realization defined for other data resources, such as files, databases etc. Hence, our first task was to define the DAIS interfaces for data streams. We achieved this by defining a grid data service for a data stream resource to be a persistent service that serves data from a logical collection of data streams. We define four sub services (SQL logical interface, Rowset logical interface, Stream Publish logical interface and Administrator logical interface) composed of methods from three base porttypes (DataAccess interface, DataFactory interface and DataManagement interface) as shown in Figure 1. A detailed discussion of the DAIS stream realization is beyond the scope of this paper. Interested readers are referred to [8]. Insight Gained from Porting Calder Porting Calder to a grid service framework means conforming to the interfaces defined in the stream DAIS specification. We encountered a number of challenges that had to be addressed. First, Calder, in its role as a stream-preserving data flow application, serves both stream consumers and stream publishers. These users have different views of the

Figure 1. Realization of GGF DAIS Data Service Specification for Streams streams and different needs. Stream publishers view input data streams as entities to be added or removed from the resource. For this, it is useful to create an “empty collection” into which streams can be logically organized. Administrative users are responsible for managing the computational resources of the system. The stream registry service allows stream publishers to add and remove raw streams; and enables the administrators to add or remove computational nodes in the system. It is based on the Stream Publish Logical Interface and Admin Logical Interface of Figure 1. Second, consumers issue continuous queries on input streams to create derived streams that are then consumed. Based on these observations, we created a query planning service for stream consumers, with which stream consumers can submit queries and check their status. The operations are shown as SQL logical interface in Figure 1. Third, streaming data differs from traditional data in a database in that the query result is still a stream, a derived stream as previously mentioned. This makes it impossible to directly return a single query result as an output of a function call. Hence, we designed the stream rowset service to allow consumers to access derived data streams synchronously as streams or asynchronously from a buffer. The stream rowset service is based on the Rowset Logical interface of Figure 1. When user submits a continuous query to the system, a new in memory, time based, ring buffer is created in stream rowset service for the output of this query. The query user gets an identifier to this ring buffer and can access it later by issuing a tuple retrieval request. The stream registry service, query planning service and stream rowset service are all web services and together implement the GGF DAIS specification for streams given in Figure 1. To enable grid service access to Calder, we extended the OGSI compliant OGSA-DAI v6 grid data service [1] to a stream grid data service (GDS) that supports the api described in Figure 1 as internal activities. A user accesses the Calder system by creating a stream GDS and passing it an XML document containing the continuous query activity, the rowset request activity or the metadata

query activity. The GDS acts like a gateway in redirecting the continuous queries to the planner service, rowset requests to the stream rowset service and metadata queries to the stream registry service.

3. Architecture

Figure 2. Calder architectural components The Calder system comprises a data management subsystem and a query processing subsystem. The two subsystems communicate through a publish-subscribe mechanism. The data management subsystem is composed of four functionally distinct web services as shown in Figure 2. The stream grid data service (GDS) is a transient grid service instantiated on a per user basis. The distributed query planner is a persistent web service that optimizes the query for distribution across the computational network using query reuse techniques. The query planner is responsible for transforming the SQL query into an intermediate representation, decomposing the query into fragments, optimizing the fragments, and assigning the fragments to computation hosts. The stream rowset service, a persistent web service, in its simplest form is a buffer between timely streams and programs that may be delayed in accessing the data. It maintains a ring buffer of event data, one ring buffer per active query in the network. The stream rowset service can have thousands of ring buffers active simultaneously. The stream registry service is a persistent web service that captures domain specific metadata in streams registered with the Calder system. Streams are tracked w.r.t to the logical domain they belong to and their provenance information like the generation source, schema type and formats (XML, NetCDF etc). Running at each computational node in the network is a query processing engine (quoblet) that dynamically accepts queries as scripts and deploys them as compiled code. Details of quoblet functionality are given in [10]. In the absence of any queries, a computational node listens for commands from the query planning service through the publishsubscribe system. When a query is deployed to a computational node, the node will subscribe as receiver to the streams that participate in the query and as a publisher to the resulting “view” (derived stream) that the query creates. An arriving event is queued at the query, then pushed

through the query operators (e.g., select, project, join) by means of a depth-first traversal of the query tree [10]. Data streams are implemented as events pushed or pulled through a publish-subscribe system. Events are grouped as typed channels in that a channel transfers events having the same type. A channel is a logical entity through which one or more sources send data to all its sinks. A stream is connected to its channels by a one-to-many relationship. Calder supports SQL-like continuous queries with constructs SELECT, FROM, WHERE, AND and OR. It also supports special constructs: EXEC (to execute a user defined function), START (start time of query) and EXPIRE (stop time of query). A rowset request is typically a tuple retrieval request where the bounds are specified by timestamps. Examples: getTuple(timestamp), getTuples(starttimestamp, end-timestamp). Metadata queries retrieve metadata and provenance information of the streams.

4. Experimental Evaluation We experimentally evaluated several aspects of system behavior. In the first experiment we measure the cost of deploying a query. In the second experiment we quantify service time at the stream processing node. The experimental setup is as follows: The GDS and the planning service reside on a dual processor 2.8GHz Pentium, 2GB memory, running RHEL. The stream rowset service and the computational nodes are each hosted on a single processor, 2.8GHz Pentium with 1GB memory, running RHEL. The machines are interconnected through 1Gbps switched Ethernet LAN. The computational mesh consists of 4 single processor workstations, each executing a quoblet. Query Deployment Time Query deployment time (QDT) is the delay incurred when a client submits a long running query to the system. Query deployment is handled in five logical functions or components: Stream GDS setup - create service instance that acts on behalf of user; Query planning - generate distributed query plan for incoming query; Query distribution - transfer query parts to nodes where it is to be executed by quoblets; Query instantiation - instantiate query on the quoblets; and Ring buffer (RB) setup - allocate buffer in stream rowset service and register to receive the derived stream. Each quoblet is subscribed to two input data streams: an S1 stream made up of 50KB events generated at a rate of 50 events/second and a second stream S2 also consisting of 50KB typed events but these are generated at a rate of 10 events/second. The query workload consists of 100 queries with complexity varying from simple selects to multiple joins. The queries were issued by a single user sequentially. The results appear in Figure 3. Along the

Figure 4. Stream processing stack

Figure 3. Startup and query deployment time. Average time spent in each component. X-axis are the five functional components described above. The Y-axis plots in log scale the average execution time computed over 100 queries. The wide vertical bars indicate average time while the thin error bars indicate the max and min over the 100 query workload plotted around the mean. Instantiation of the GDS is a one-time cost that can be amortized over multiple query submissions by a single user. While sum of the component times is roughly 80 ms, it should be noted that this measure does not capture service-to-service communication delays. In a different experiment, the round-trip query deployment time under a realistic workload, measured in the range of 300 to 400 ms. Stream Processing Overhead Over the life of an application, performance of a continuous query system such as Calder will be dominated by stream processing times (versus deployment times) because queries are generally long running and execute repeatedly. In this second experiment we quantify service time at the stream processing node. Figure 4 depicts event flow through a query processing intermediary. In the absence of intermediate stream processing, the latency between event generation and receipt at the recipient is determined by network bandwidth, network latency, overhead of encoding and decoding the events, and copy costs to copy the event out of kernel space to the ring buffer. Filtering agent adds delay in the form of query scheduling, and query execution time. We measure the query shown in Figure 5 at various stream rates and show the results of one configuration here. The input streams begin generating data before the queries begin execution to avoid skewed measurements because queries are delayed while blocking on data to arrive. A stream generator that we developed pushes the synthetic workload down channels at the desired rates. The S1 stream carries events 50KB in size that are generated at a steady rate of 10 events/sec. The S2 stream also carries 50 KB events but the rate cycled between 10 events/sec and 1 event/sec every 10 seconds. Figure 5 displays the timing points we used. We use three metrics to capture query processing cost: Input buffer delay - time an event remains in

Figure 5. Conditional join query over streams S1 and S2 . Timing points noted as ti . an input buffer. This time is an indication of how well the query is keeping up with current flow (t1 -t0 in Figure 5); Sliding window delay - queries with join operators maintain a sliding window over the two streams participating in the join. An event will be retained in the sliding window until pushed off by events that have arrived more recently. Delay in a sliding window is a reflection of the level of synchronicity between two streams (t3 -t2 in Figure 5); and Total service time - the sum of input buffer and sliding window delays plus any processing overhead (t4 -t0 in Figure 5). The experiment is conducted over 2 minutes, during which time the two event streams are pushed at the query and the amount of time spent in the input buffer and join window are captured for the steady stream S1 . Figure 6, depicts the second half of a 2-minute run. The top graph captures overall service time. The middle graph captures the time spent in the input buffer. This delay is mainly due to the overhead of the query scheduler and is a few 10s of microseconds. The bottom graph captures the amount of time an event from S1 spends in the sliding window. Figure 6, clearly show that stream processing is dominated by the sliding window delay when streams are asynchronous. In fact, the more asynchronous the streams are, the larger the sliding window will be and the delay will be. Experience of applying Calder to various stream configurations and architectures reveals similar relationship between query latency and the rates of asynchronous streams. To better understand the results shown in Figure 6, we examine the point at second 70. Between seconds 60 and 70, streams S1 and S2 are both arriving at 10 ev/sec. For the duration between second 70 and 80, however, the rate of steam S1 remains at 10 ev/sec while the rate of stream S2 drops to 1 ev/sec. So in the interval of 70-80, 100 events are processed from stream S1 and 10 from stream S2 . The impact of the slowed stream is dramatic in the sliding win-

Manager (GSDM) project [7], extends a main-memory, object relational DBMS to work in a grid environment. The targeted users of GSDM are considerably different from those of Calder. GSDM is a centralized database supporting heavy-weight queries. Calder supports highly scalable, distributed but simpler queries over streams.

6. Conclusion

Figure 6. Breakdown of time spent in input buffer and join window for processing asynchronous streams. dow. At second 70 the 10 events that arrive in S1 must sit in the sliding window waiting for the first S2 event to arrive. When the S2 event arrives, it satisfies all the waiting S1 events and the cycle repeats. The variation in sliding window delays of between 50 ms and 200 ms are explained by changes in the order in which events arrive and are scheduled. We implemented an Earliest Job First scheduling algorithm to smooth out this behavior and an algorithm to adjust the sliding window size according to stream rates [9].

5. Related Work Borealis [5], is a distributed stream processing solution built as peer-to-peer extension of Aurora. Calder shares Borealis’ goal of a distributed design. Borealis assumes all incoming data is in a format internal to the system. Calder works with data formats that are familiar to the science application users. Calder’s metadata management and asynchronous data delivery make it better suited to grid applications. NiagaraCQ [3] is a file based continuous query system that supports queries written in a XML query language called XML-QL. Queries are triggered by either timer events or modifications to files. This file-oriented philosophy is less well suited to event-based, data driven applications. The Stanford STREAM [2] project investigates language support for queries over databases containing temporal stream data and traditional snapshot data. Calder’s emphasis instead is on a fully functioning, scalable system. GATES [4] is a grid-based system for stream-consuming data flow applications. Gates’ research is on adaptability techniques that enable the nodes in the flow graph to respond to changes in their environment. Calder targets a different set of applications. In Gates, the functionality carried out by the nodes in the graph are user supplied code, and the ordering is supplied by the user. Calder, on the other hand, provides a limited set of nodes (i.e., query operators) and determines the ordering of the nodes by a compiler based on a well defined algebra [10]. Hence programming the system is much easier in Calder. The Grid Stream Database

We have architected and evaluated a new kind of data resource, one that is composed of a logical collection of ephemeral data streams. By providing access to the data stream resource through a grid service interface, a gridenabled application can discover a stream resource then issue long-lived queries that execute relatively simple usercustomized stream processing. Calder supports a level of scalability and efficiency not achievable by a database management system. Calder additionally supports buffering of the resulting streams for subsequent access. Our research focuses on systems issues of suitable stream data formats, communication protocols, conversion and copying overheads, and scalability under realistic workloads. Our future work comprises of performance and scalability analysis, query optimization and distribution, metadata and provenance management and approximate query processing.

References [1] A. Anjomshoaa et.al. Design and implementation of grid database services in OGSA-DAI. Concurrency and Computation: Practice and Experience, 17(2-4):357–376, 2005. [2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proceedings of ACM Symposium on PODS, 2002. [3] J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous query system for internet databases. In Proceedings of ACM SIGMOD, 2000. [4] L. Chen, K. Reddy, and G. Agrawal. GATES: A grid-based middleware for processing distributed data streams. In Proceedings of HPDC, 2004. [5] D. J. Abadi et. al. The design of the borealis stream processing engine. In Proceedings of CIDR, 2005. [6] I. Foster, S. Tuecke, and J. Unger. OGSA data services. In Global Grid Forum GWD-I, August 2003. [7] M. G. Koparanova and T. Risch. High-performance grid stream database manager for scientific data. In Proceedings of first European Across Grids Conference, 2003. [8] Y. Liu, B. Plale, and N. Vijayakumar. Realization of ggf dais data service interface for grid access to data streams. Technical Report IUCS TR 613, Indiana University, 2005. [9] B. Plale. Evaluation of rate-based adaptivity in joining asynchronous data streams. In Proceedings of International Parallel and Distributed Processing Symposium, 2005. [10] B. Plale and K. Schwan. Dynamic querying of streaming data with the dQUOB system. IEEE Transactions on Parallel and Distributed Systems, 14(4):422 – 432, 2003.

Suggest Documents