Blending OLAP processing with real-time data streams 1
João Costa , José Cecílio2, Pedro Martins2, Pedro Furtado2 1
Polytechnic of Coimbra,2 University of Coimbra, Coimbra, Portugal
[email protected], 2{jcecilio, pmon, pnf}@dei.uc.pt
Abstract CEP and Databases share some characteristics but traditionally are treated as two separate worlds, one oriented towards real-time event processing and the later oriented towards long-term data management. However many real-time data and business intelligence analysis do need to confront streaming data with long-term stored one. For instance, how do current power consumption and power grid distribution compare with last year’s for the same day? StreamNetFlux is a novel system that recently emerged to the market designed to integrate CEP and database functionalities. This blending allows the processing of queries that need such capabilities with top efficiency and scalability features.
1. INTRODUCTION Lately, CEP engines have been gaining ground in commercial enterprises with event processing needs, including the IBM System S[1] stream processing core that provides scalable distributed runtime execution, a SPADE language and compiler[2] for stream processing applications and both optimization and fault-tolerance capabilities built-in; Tibco BusinessEvents[3] uses a UML-based state model, a rules engine based on the industry-standard RETE protocol, and events capture and processing functionality. Truviso[4] offers data analysis using standard SQL, and results can trigger actions such as alerts to decision makers or events in other systems, or be delivered to an end-user over a standard web browser. Queries are continuous and the system can be run in a distributed fashion across many applications. Coral8 and Aleri [5] are an engine and platform, respectively, for streaming event data, with an SQL extension to process data. The Coral8 Engine uses a Continuous Computation Language (CCL), and the Aleri Streaming Platform has visual dataflow authoring, event driven integration and implements scalability support. The StreamBase [6] server is programmed with either a StreamSQL processing language or a visual query language, and the engine offers a large number of operators to use on streams that include merging, filtering, statistics computation, thresholding and pattern matching. In Oracle CEP [7], applications are designed based on stream sources, processors represented in XML with CQL-like queries, stream sink beans that process output data and channels linking parts together. Esper [8] is a Java-based CEP that allows users to re-use Java capabilities and add powerful event and pattern handling in an event processing language syntax (EPL). Typically, CEP and databases are two distinct entities. This approach increases complexity and creates performance problems in many practical applications when data analyses require database data together with CEP streams.
2. STREAMNETFLUX SYSTEM StreamNetFlux CEP-DB system, illustrated in figure 1, integrates both functionalities into a single, scalable and efficient engine. StreamNetFlux users pose queries and StreamNetFlux manages memory, machines, databases and CEP engines with top efficiency and scalability.
STREAM NET FLUX
Query SQL
Result
DB functions: - Persistency - Long Queries - ODBC/JDBC
CEP functions : - Event base - Event processing - continuous Query
SNF-CQL
Continuous Result
Business Rules
Configuration
Parallelized / Distributed
Customize code
Figure 1 – StreamNetFlux Architecture
StreamNetFlux implements common features of database systems, including persistence, DB storage mechanisms, transaction management, recovery and also an ODBC/JDBC interface. The persistent storage uses a hybrid memory-disk organization and techniques to offer top efficiency and scalability. Complex event and stream processing allows data analysis to be always up-to-date. Users can pose StreamNetFlux statements and application code with embedded StreamNetFlux statements. These include continuous StreamNetFlux queries and customized data analysis code. By offering DB functionalities and user transparent capabilities, StreamNetFlux allows applications to use the system without disrupting regular application functionalities. StreamNetFlux ease of use, real-time processing, scalability and efficiency was evaluated with a massive volume of high-rate data produced by an energy power grid infrastructure. Besides power grid data, it also processes data from energy producers, from major energy enterprises and thousands of Micro-generation-producers (e.g. producing from solar power or window turbine). The distribution of electricity requires that energy be produced as required by the consumers, since it cannot be efficiently stored for later usage, except for a limited and restricted time period. When required, some inactive power plants need to be activated to produce additional energy to assure the energy consumption. To reduce the energy lost, while electricity transverses the power lines and substations, this additional energy should be produced from power plants nearby the consumers.
To prevent power blackouts, it’s crucial to have continuous monitorization of the power grid infrastructure and the evaluation of the energy consumption, energy generation and interconnecting links and power sub-stations capacity.
Figure 2 – Performance Summary
For operational purposes, it is important to monitor the usage of the transmission lines, in order to assess when and where abnormal patterns occur and to take preventive or corrective actions (e.g. add additional capacity). Marketing decisions can also be made based on the available data. Alerts and actions may be triggered when there is excess or shortage of some indicator. The scope of alerts, reports and analysis may be drilled up or down or across different perspectives. Sub-stations and energy power plants capacity can also be monitored for detecting abnormal pattern usage behavior. StreamNetFlux, even under high-loads, delivers results in the ms range (figure 2), whereas evaluated DBMS engines take too much time to obtain the same results, returning them in a discrete manner and with a highly inefficient way to integrate new data. Such result discretization (time lag between query re-execution against the data, including the new recent one) is unacceptable for critical, high-demanding applications, like energy power grid applications or telecommunications. Stream engines, for real data analysis, can deliver fast performance results, but only for a small subset of the recent data, limited by the window size. They are unsuited for performing broader analysis which requires not only recent data contained within the window size limits, but also other data that relies outside the window scope. Having a tool such as StreamNetFlux that allows users to use both streaming and past data is a very important improvement. StreamNetFlux was designed with easy-of-use considerations, with reduced timeto-learn curve and a fast time to market. It allows users to specify operators and data processors, to define computations, filters, data aggregations and dataflows between them. It also allows the definition of event based actions, or rules, triggering conditions and actions to be performed.
3. DEMONSTRATION ROADMAP In the demonstration, we will be using a power grid data schema, with information from the power grid infrastructure, from energy generators (including major energy producers and micro-producers) and also information collected from power meters at consumers, to evidence the main features of StreamNetFlux: ease of use, real-time processing, scalability and efficiency. The demonstration consists of the following steps: Setting up: using a power grid data schema, we show how to setup through a set of simple drag and drop and dataflows steps. We also show how it can be setup through a command line console.
Running: we demonstrate how queries are seamlessly posed against StreamNetFlux and how the engine executes business rules. We will demonstrate alerts, reporting and analysis queries, some of those correlating streaming data with stored persistent data. For instance, to: compute the national grid or the substation usage and detect variations (e.g. > than) in comparison with the same week day of previous years; trigger Alerts when the variation is greater than a given threshold; raise Alarms when link usage falls below a certain level
Visualization: show how to explore and visualize data.
The StreamNetFlux is a disruptive product that works with streaming data and stored data, while simultaneously providing scalability and unusual ease of programming and querying. In this demo we show how the system is able to do that. The technology behind StreamNetFlux has already spurred patent requests and a spinoff company dedicated to the developing of applications for industrial markets such as telecommunications and energy efficiency.
REFERENCES [1] “IBM Research Exploratory Stream Processing Systems” http://domino.research.ibm.com/comm/ [2] B. Gedik, H. Andrade, K. Wu, P.S. Yu, e M. Doo, “SPADE: the system s declarative stream processing engine,” SIGMOD, Vancouver, Canada: ACM, 2008, pp. 1123-1134. [3] TIBCO Business Events, “http://www.tibco.com/software/” [4] “Truviso Data Analysis “, http://truviso.com/products/ [5] “Aleri & Coral8” http://www.aleri.com/ products/aleri-cep [6] “Streambase”, http://www.streambase.com/ [7] “Complex Event Processing”, http://www.oracle.com/technologies/soa/complex-eventprocessing.html [8] “Esper Complex Event Processing.”, http://esper.codehaus.org