Multi-Query Unification for Generating Efficient Big Data Processing Components from a DFD Kosaku Kimura, Yoshihide Nomura, Hidetoshi Kurihara, Koji Yamamoto and Rieko Yamamoto Software Innovation Lab., FUJITSU LABORATORIES LTD., Kawasaki, Japan Email:
[email protected]
Abstract—This paper proposes multi-query unification, a technique for generating unified components from a DFD aimed at reducing the total cost of data transmission between components that are deployed to a computing fabric that includes processing nodes and interconnection services. The method focuses on generating components of the two primary data processing methodologies: cumulative data processing (CDP) and data stream processing (DSP). The method utilizes multi-query unification and generates a unified query by applying two methods depending on the order sensitivity of processes in a DFD. Nesting unification composes a unified query by embedding the query of a process into the query of the next process as a subquery. Clause assembly unification composes a query using templates for each clause of the original query. For clause assembly is applicable only to processes that is executable simultaneously, we define the criteria called order sensitivity for applying clause assembly and propose two-stage unification in which nesting unification is always applied after clause assembly. The performance evaluation based on a virtual DFD shows that applying two-stage unification reduces the execution time of components by 60 percent in DSP; however, execution time is reduced by only 10 percent in CDP. On the other hand, nesting unification alone reduces the execution time by 30 percent. Based on those results, we conclude that clause assembly should be applied to DSP using Esper but should not be applied to CDP using Hive. Keywords-multi-query unification; order sensitivity; big data; component; DFD; platform as a service;
I. I NTRODUCTION Big data processing technologies, such as MapReduce [1] and complex event processing (CEP), enable efficient processing of massive event data produced by mobile devices, sensors, web services, various ICT systems, and so on. A consolidated development environment to support the use of these technologies is important to extract new and different value from unutilized stored data or data that may be used in the near future. Two primary data processing methodologies can be applied to massive event data: cumulative data processing (CDP) and data stream processing (DSP). These methodologies are differentiated based on the type of data. The former is a data-intensive batch-style process and is often applied to statistical analysis of past data. The latter is realtime processing that continuously processes event data in order of arrival and is often applied to event detection to provide real-time services.
Similar to business intelligence tools or “extract, transform, load” tools, an intuitive data-flow diagram (DFD) is used to define how data is processed and the order of execution in the processing medhodologies. Several existing development environments for big data processing also adopt DFD or other domain specific language (DSL). For example, [2], [3] adopt DSLs to define MapReduce processes, and [4]–[6] adopt DFD and DSLs to define continuous queries for CEP. However, these development environments only support one of the processing methodologies. Multiple development environments are necessary to execute realservice development procedures to analyze massive event data and to apply the results of the analysis to the service. Madras [7] provides integrative support for utilizing both CDP and DSP. Madras transforms a DFD into components that have different query processing implementations for deployment to a cloud and provides a computing fabric that includes CDP and DSP nodes and interconnection services. In Madras deployment, each implementation should be able to use a different query language and therefore should be deployed to different types of processing nodes. In such an environment, components communicate with each other to deliver event data using an interconnection service. Using this type of environment to process massive event data is problematic due to the large size of the data being transmitted between components. To address this data overhead problem, we propose a technique that unifies multiple queries of processes to reduce the number of components generated from a DFD. We call this technique “multi-query unification.” This technique generates a single query from queries that can be processed simultaneously and also embeds a process query into the next process as a subquery to retain the correct processing order. The remainder of this paper is organized as follows. Section II presents an overview of Madras as an example of a target development environment. Section III describes three unification methods of multi-query unification. Section IV describes criteria for applying unification method. We evaluate and discuss the reduction of execution time by multi-query unification in Section V. Section VI summarizes related work and our conclusions are presented in Section VII.
II. M ADRAS :
TARGET DEVELOPMENT ENVIRONMENT EXAMPLE
Madras is a platform as a service that is capable to process massive event data produced by point-of-sales systems, electronic ticket systems, etc. to provide integrative support for data analysis, service provision and evaluation. We assume that the end user of Madras is an analyst who has extensive data analysis experience but does not know which technology to use to process data most efficiently or how to implement the selected technology. Madras provides a method to define a data processing procedure using a DFD that does not depend on any particular technology. Consequently, the end user can execute data analysis and develop a real-time service without actually implementing specific technologies, such as CEP and MapReduce. Figure 1 shows the Madras architecture. In Figure 1, Web Browser denotes the web browser on a client PC. The DFD Editor works in the Web Browser and is employed by the end user to define a DFD. We assume that Web Node, Control Node, and the other nodes that execute processing engines illustrated in Figure 1 are deployed to a public or private cloud. The Resource Service stores the DFD defined by the end user in the Resource Repository for version control. The Deploy Service generates components from the DFD, stores them in the Component Repository for version control, and transfers them to the Control Node for execution. According to the component description, the Control Node deploys nodes for processing engines and provides message queue services, such as the Java Message Service. Each component has an implementation and interfaces to transmit data between components using a specified method in accordance with the component assembly model of Service Component Architecture (SCA)1 . Madras adopts Apache Tuscany2 as the runtime engine for components, Apache Hive3 as the CDP engine, and Esper4 as the DSP engine. Madras is also empowered by Eclipse Modeling Framework (EMF)5 to create and manipulate the DFD model described below. A. DFD model In Madras, data analysis and service development procedures are consistently defined by the same style of DFD. The benefit of this methodology is that it facilitates the application of information, such as a rule, model, or method, revealed by the data analysis to service development. Generally, the application of such information is a challenging problem because there are gaps between data analysis and service development that vary depending on the characteristics of the data and the process. In data analysis, a number of past event data that have been stored for a long time are processed using batch-style processing technology, whereas 1 http://www.oasis-opencsa.org/sca 2 http://tuscany.apache.org/ 3 http://hive.apache.org/
ůŝĞŶƚ tĞď ƌŽǁƐĞƌ ;,dD>ϱͿ
& ĚŝƚŽƌ
^ĞƌǀĞƌ䠄WĂĂ^䠅 tĞď EŽĚĞ
ZhŽĨ &
ƵŝůĚ͕ĚĞƉůŽLJ ĂŶĚĞdžĞĐƵƚĞ
ZĞƐŽƵƌĐĞ ^ĞƌǀŝĐĞ
ZĞƐŽƵƌĐĞZĞƉŽƐŝƚŽƌLJ
ZĞĂĚ
ĞƉůŽLJ ^ĞƌǀŝĐĞ
Zh
ŽŵƉŽŶĞŶƚZĞƉŽƐŝƚŽƌLJ
ĞƉůŽLJĂŶĚĞdžĞĐƵƚĞ ŽŶƚƌŽů EŽĚĞ
^ŶŐŝŶĞ ŝƌĞĐƚŽƌLJ ^ĞƌǀŝĐĞ
Figure 1.
ŽŵƉƵƚŝŶŐĨĂďƌŝĐ ŝŶĐů͘WĂŶĚ^W ŶŽĚĞƐĂŶĚŝŶƚĞƌͲ ĐŽŶŶĞĐƚŝŽŶƐĞƌǀŝĐĞƐ
Madras architecture.
in service development, fresh event data uses real-time processing technology. Therefore, the development environment should bridge the gaps between data analysis and service development. The definition of the Madras DFD model is similar to the general DFD definition [8] in that it consists of dependencies between data and processes. However, differing from the general DFD definition, the data source, data sink, and temporal data store are not differentiated in the Madras DFD model. They are simply treated as indistinct data. That is, let DFD be a directed bipartite graph (DN, P N, E) comprising two disjoint sets DN and P N , and a set E such that every directed edge e ∈ E connects from a data node d ∈ DN to a process node p ∈ P N , or vice versa. In this paper, we assume that the DFD model in the proposed method is the same as the Madras DFD model. We define data analysis and service development procedures by combining the following types of processes: data filtering with a condition, numerical calculation, an analytical method, data cleansing, service provision processes, and so on. In the Madras DFD model, a process has implementation parameters, and data has a schema that is comprised of a list of fields (i.e., name and type) and the type of data storage used. Moreover, both process and data are assigned their available processing behaviors (i.e., CDP, DSP, both or other). The DFD defined by the end user is serialized to the XML Metadata Interchange (XMI)6 format by using EMF and is transferred to the Resource Service and stored in the Resource Repository. B. Generating components from the DFD The Deploy Service generates component source code in SCA format from the DFD in XMI format in response to re-
4 http://www.espertech.com/ 5 http://www.eclipse.org/modeling/emf/
Zh
6 http://www.omg.org/spec/XMI/
quests from the DFD Editor. To generate component source code, the Deploy Service uses template engine combining templates and process properties. There are two types of templates: query and component. A query process is implemented using a specific query template and the generalized component template to process the specific query language associated with the query. The non-query process (i.e., analytical method) has an implementation-specific component template. The Deploy Service generates components by the following procedure. First, the Deploy Service generates a query that corresponds to the process using its query template to apply process properties to the parameter. Then, if the process has a query, the generated query is applied to the generalized component template parameter, and the component source code is generated. If the process does not have a query, process properties are applied to the implementation-specific component template parameter, and the component source code is generated. A message queue service and file delivery are considered the most general and versatile methods for transmitting data between components.
㻽㼡㼑㼞㼥㻌㼛㼒 㼜㼞㼛㼏㼑㼟㼟 㻽㼡㼑㼞㼥㻌㼛㼒 㼚㼑㼤㼠㻌㼜㼞㼛㼏㼑㼟㼟
㻿㻱㻸㻱㻯㼀㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻲㻾㻻㻹㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㼃㻴㻱㻾㻱 ůůĨŝĞůĚƐ ĂƚĂ
^ƉĞĐŝĨŝĐ 㻿㻱㻸㻱㻯㼀㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻲㻾㻻㻹㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌 ĂƚĂ ĨŝĞůĚƐ
ϭϴ͗ϬϬфсdŝŵĞ EdŝŵĞфсϮϬ͗ϬϬ
;ĚĞƌŝǀĞĚĨƌŽŵ ƉƌĞǀŝŽƵƐƉƌŽĐĞƐƐͿ
hƐĞƌ/͕/ƚĞŵ/͕ĂƚĞ
^ƉĞĐŝĨŝĐ 㼁㼚㼕㼒㼕㼑㼐㻌㼝㼡㼑㼞㼥 㻿㻱㻸㻱㻯㼀㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻲㻾㻻㻹㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㻌㼃㻴㻱㻾㻱 ĂƚĂ ĨŝĞůĚƐ 㼛㼒㻌㼠㼔㼑㻌㼠㼣㼛
Query of process
Query of next process
ŽŶĚŝƚŝŽŶ
ϭϴ͗ϬϬфсdŝŵĞ EdŝŵĞфсϮϬ͗ϬϬ
hƐĞƌ/͕/ƚĞŵ/͕ĂƚĞ
Figure 2.
ŽŶĚŝƚŝŽŶ
Clause assembly unification.
ϭ
(HiveQL)INSERT OVERWRITE TABLE
SELECT
FROM
WHERE
(HiveQL)INSERT OVERWRITE TABLE
SELECT
FROM
WHERE
(a) HiveQL
III. OVERVIEW OF MULTI - QUERY UNIFICATION Multi-query unification is a technique that unifies queries of processes to reduce the number of components generated by the generalized component template previously described. In this section, we describe three unification methods that consitutes multi-query unification. In this paper, to simplify the expressions of the methods, we assume query languages that have similar syntax to Structured Query Language (SQL). A. Clause assembly unification Clause assembly unification is a method that composes a unified query by dividing the original SQL-like queries into several clauses, such as SELECT. To unify process queries with the next process, the clause assembly method generates clauses of process query using templates and then composes clauses as the unified query using string operations, such as concatenation and replacement, for each clause. To execute the method, clause templates, rather than query templates, must be prepared for each clause. Figure 2 shows an example of clause assembly unification. In the example shown in Figure 2, queries can only be unified by concatenating the SELECT clause of the second process, the FROM clause of the first process, and the WHERE clause that concatenates the conditions of both processes using the AND operator. B. Nesting unification Nesting unification is a method that composes a unified query by embedding a process query into the query of the next process as a subquery. The procedures for embedding the subquery vary depending on the processing engines. Figure 3 shows an example of nesting unification. As shown in Figure 3, in Hive, queries
INSERT OVERWRITE TABLE SELECT
FROM ( SELECT
FROM
WHERE
)WHERE
(b) EPL INSERT INTO
SELECT
FROM
WHERE
;
SELECT
FROM
WHERE
;
Figure 3.
Nesting unification.
can be unified by replacing the expression of the FROM clause of the next process with the query of the first process in parentheses. On the other hand, in Esper, queries can be unified by the following procedure. INSERT INTO clause is placed at the beginning of the query of the first process. This query will be treated as an internal stream with a new name. Then, the FROM clause of the next process is replaced by the name of the internal stream. C. Two-stage unification Two-stage unification is a combined method that applies clause assembly first and then nesting. Clause assembly cannot unify several query types that typically occur in a DFD. On the other hand, nesting can unify any type of query in a DFD and can be applied after clause assembly. Two-stage unification is applicable to the same coverage of query types as nesting. However, queries unified by the two-stage unification have different characteristics from those unified by nesting unification; there are fewer nested queries in a two-stage unification, and each subquery
Table I C LASSIFICATION OF OPERATORS . Type π σ X γ M
order-sensitive No No Partially No Yes Yes
SQL Example SELECT a, b, c FROM A SELECT * FROM A WHERE 10
64800000 select userid from TicketEvent5 Table IV EPL
method Two-stage
q4 ◦ q5 (q1 ◦ q2 ◦ q3 (·)) =
Nesting
q5 (q4 (q3 (q2 (q1 (·))))) =
QUERIES UNIFIED BY EACH UNIFICATION METHOD .
query insert into TicketEvent4 select *, (Math.floor(date / 3600000) % 24) * 3600000 as time from TicketEvent where station = "1" and gender = 1; select userid from TicketEvent4 where 72000000 > time and time > 64800000 insert into TicketEvent2 select * from TicketEvent where station = "1"; insert into TicketEvent3 select *, (Math.floor(date / 3600000) % 24) * 3600000 as time from TicketEvent2; insert into TicketEvent4 select * from TicketEvent3 where gender = 1; insert into TicketEvent5 select * from TicketEvent4 where 72000000 > time and time > 64800000; select userid from TicketEvent5
shows the average quotients of reduction of execution time for each data set. Table IV shows the EPL query unified by two-stage unification. In two-stage unification, clause assembly generated two unified queries. The first included “Pick Users Getting Off At The Station,” “Convert Date-time Format,” and “Pick Male.” The second included “Pick 18:00-20:00” and “Select User ID.” Then, nesting unification generated a unified query that embedded the first clause-assembled unified query into the second clause-assembled unified query. We found that applying two-stage unification reduced the execution time by 60 percent for DSP but only reduced CDP by 10 percent. B. Change of reduction effect with and without application of clause assembly We measured the execution time solely for nesting unification to evaluate the effect of clause assembly by comparing the results from nesting unification with twostage unification. Consistent with the conditions given in the previous section, let the execution time be the average of three measurements using the same data set. We also measured the execution time without multi-query unification because the contents of the data set were different from the data set measured in the previous section. Figure 6 shows the execution time measurements. Figure 6(a) shows the times for CDP using Hive, and Figure 6(b) shows the execution times for DSP using Esper. Table V shows the average quotients of reduction of execution time for each data set.
Table IV shows the EPL query unified by nesting unification. Nesting unification generated a unified query involving all processes between “Pick Users Getting Off At The Station” and “Select User ID.” The unified query had fourstage nesting. We found that applying nesting unification alone reduced the execution time by 30 percent for CDP and 35 percent for DSP. C. Discussion In CDP using Hive, data transmission was performed by file delivery on the HDFS. Multistage Map and Reduce processing in Hive was produced from a HiveQL query [2]. Temporary data was generated between each processing stage. We believe that the execution time reduction effect with multi-query unification in CDP was generally smaller than in DSP because the number of file deliveries did not decrease regardless of whether or not multi-query unification was applied. However, the execution time reduction when only nesting unification was applied was higher than when two-stage unification was applied. We consider this was because query optimization of Hive results in a more efficient multistage nesting query than the complex query generated by clause assembly. For DSP using Esper, data transmissions between components were implemented using a message queue service. Regardless of whether or not clause assembly was applied, the DFD shown in Figure 4 was eventually transformed into a unified component involving all processes for “Pick
120
1200 not unified two-stage
not unified two-stage
100 Execution time [s]
Execution time [s]
1000 800 600 400
80 60 40 20
200
0
0 0
200,000
400,000 600,000 # of events
0
800,000 1,000,000
10,000
(a) CDP using Hive. Figure 5.
20,000
30,000 40,000 # of events
50,000
60,000
50,000
60,000
(b) DSP using Esper.
Execution time of the components applying two-stage unification.
120
1200 not unified nesting
not unified nesting
100 Execution time[s]
Execution time[s]
1000 800 600 400
80 60 40 20
200
0
0 0
200,000
400,000 600,000 # of events
800,000 1,000,000
(a) CDP using Hive. Figure 6.
0
10,000
20,000
30,000 40,000 # of events
(b) DSP using Esper.
Execution time of the components applying nesting unification.
Users Getting Off At The Station” and “Select User ID.” The number of message queues decreased from six to two. Because the quotient of execution time reduction for DSP is high, we consider that reducing the number of message queues had a positive effect on execution time. When applying nesting to DSP using Esper, event data was temporarily stored in PC memory, creating an internal stream by positioning INSERT INTO clause at the beginning of the query. We believe that clause assembly had a positive effect on execution time reduction in DSP because the number of data transmissions in memory was increased. In light of the above results, the effect of clause assembly differs between CDP and DSP. Consequently, it is reasonable to conclude that nesting unification should be applied to CDP and two-stage unification should be applied to DSP. VI. R ELATED W ORK Query optimization [9] is a technique that optimizes a query execution plan on a processing engine, such as a database management system. Multi-query optimization [10], [11] is a technique that optimizes execution plans for multiple queries arriving at a processing engine simultaneously. Multi-query unification is a technique that unifies sequentially-processed queries to generate unified
Table V Q UOTIENT OF THE REDUCTION
Two-stage Nesting
CDP 0.108 0.299
OF EXECUTION TIME .
DSP 0.600 0.345
components. Multi-query unification works outside a processing engine, whereas query optimization works within a processing engine. Therefore, we can use multi-query unification in combination with query optimization or multiquery optimization. There are several query transformation methods that work outside a processing engine. Yang et al. [12] proposed a query transformation method that defines multiple queries in relation to members that transform themselves to enable calculation from pre-calculated relationships. Ahmed et al. [13] proposed a method that extracts subqueries to reduce processing engine execution time. However, unlike multiquery unification, these methods do not reduce the number of generated components. Several methods to facilitate the creation and execution of queries in a development environment have also been
reported. Peterson et al. [14] proposed a method that facilitates the creation of a query by providing an editor that can combine templates for each clause of a query. Ahmed et al. [15] proposed a method that reuses the processing results of part of a query by dividing the query into several blocks. There are various commercial products for CEP development environments that adopt DFDs to create queries. Products such as StreamBase [6], Sybase [5], and Oracle CEP [4] can create queries using primitive operations, such as Projection, Selection and Join in a DFD. Moreover, Oracle CEP provides an editor that enables the creation of an application by composing queries during editing. However, in these environments, components are created for each query, and consequently significant data transmission overhead persists. None of these products provides a solution for this problem. VII. C ONCLUSIONS This paper proposed multi-query unification, a technique for generating unified components from a DFD aimed at reducing the total cost of data transmission between components. We described three unifiation methods that constitute multi-query unification. Nesting unification composes a unified query by embedding the query of a process into the query of the next process as a subquery. Clause assembly unification composes a query using templates for each clause of the original query. For clause assembly is applicable only to processes that is executable simultaneously, we defined the criteria called order sensitivity for applying clause assembly and proposed two-stage unification in which nesting unification is always applied after clause assembly. In this paper, we assumed query languages that have similar syntax to Structured Query Language (SQL) to discribe the methods. However, we consider that multiquery unification is applicable to various query languages that implement a close approximation of relational algebra even without SQL-like syntax. Moreover, we believe that generalization of component unification from multi-query unification is even possible by clarifying order sensitivity between processes that utilize various technologies such as NoSQL and other unstructured databases. The performance evaluation based on a virtual DFD shows that applying two-stage unification reduces the execution time of components by 60 percent in DSP; however, execution time is reduced by only 10 percent in CDP. On the other hand, nesting unification alone reduces the execution time by 30 percent. Based on those results, we conclude that clause assembly should be applied to DSP using Esper but should not be applied to CDP using Hive. In two-stage unification, clause assembly can reduce the number of nesting operations. Deciding whether or not it is appropriate to use clause assembly must take processing engines into consideration because processing engines treat subqueries differently. We believe that, when using Hive, this issue arises because of incompatibility between clause assembly and query optimization; however, we do not confirm or verify this assumption in this paper. Therefore, future
work will attempt to verify this assumption and improve the proposed method to achieve higher query optimization compatibility. R EFERENCES [1] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008. [2] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, “Hive: a warehousing solution over a map-reduce framework,” Proc. VLDB Endow., vol. 2, no. 2, pp. 1626–1629, Aug. 2009. [Online]. Available: http://dl.acm.org/citation.cfm?id= 1687553.1687609 [3] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig latin: a not-so-foreign language for data processing,” in Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008, pp. 1099– 1110. [4] Oracle Complex Event Processing. Oracle Corporation. [Online]. Available: http://www.oracle.com/technetwork/ middleware/complex-event-processing/overview/index.html [5] Sybase Aleri Event Stream Processor. SYBASE, Inc. [Online]. Available: http://www.sybase.jp/products/ financialservicessolutions/complex-event-processing [6] StreamBase CEP. StreamBase Systems, Inc. [Online]. Available: http://www.streambase.com/products/streambasecep/ [7] Y. Nomura, K. Kimura, H. Kurihara, R. Yamamoto, K. Yamamoto, and S. Tokumoto, “Massive event data analysis and processing service development environment using dfd,” in Services (SERVICES), 2012 IEEE Eighth World Congress on, 2012, pp. 80–87. [8] T. DeMarco, Structured Analysis and System Specification. Upper Saddle River, NJ, USA: Prentice Hall PTR, 1979. [9] Y. Ioannidis, “Query optimization,” ACM Computing Surveys (CSUR), vol. 28, no. 1, pp. 121–123, 1996. [10] T. Sellis, “Multiple-query optimization,” ACM Transactions on Database Systems (TODS), vol. 13, no. 1, pp. 23–52, 1988. [11] P. Kalnis and D. Papadias, “Multi-query optimization for on-line analytical processing,” Information Systems, vol. 28, no. 5, pp. 457–473, 2003. [12] H. Yang and P. Larson, “Query transformation for psjqueries,” in Proceedings of the 13th International VLDB Conference, 1987, pp. 245–254. [13] R. Ahmed, A. Lee, A. Witkowski, D. Das, H. Su, M. Zait, and T. Cruanes, “Cost-based query transformation in oracle,” in Proceedings of the 32nd international conference on Very large data bases. VLDB Endowment, 2006, pp. 1026–1036. [14] T. Peterson, “Query templates with functional template blocks,” U.S. Patent Application 12/131,263, Jun. 2, 2008. [15] R. Ahmed, “Reusing optimized query blocks in query processing,” U.S. Patent Application 10/901,272, Jul. 17, 2007.