Robust heuristic algorithms for exploiting the ... - Semantic Scholar

4 downloads 10562 Views 2MB Size Report
Received in revised form 10 August 2014. Accepted 14 January 2015. Available online 7 February 2015. Keywords: Relational cloud database. Multiple-query ...
Applied Soft Computing 30 (2015) 72–82

Contents lists available at ScienceDirect

Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc

Robust heuristic algorithms for exploiting the common tasks of relational cloud database queries Tansel Dokeroglu a , Murat Ali Bayir b , Ahmet Cosar c,∗ a b c

Simsoft Computer Technologies, Middle East Technical University, Teknokent Bolgesi, 06800 Ankara, Turkey Microsoft, 1 Microsoft Way, Redmond, WA 98052, United States Computer Engineering, Middle East Technical University, 06800 Ankara, Turkey

a r t i c l e

i n f o

Article history: Received 20 April 2013 Received in revised form 10 August 2014 Accepted 14 January 2015 Available online 7 February 2015 Keywords: Relational cloud database Multiple-query optimization Evolutionary computing Branch-and-Bound Hill Climbing

a b s t r a c t Cloud computing enables a conventional relational database system’s hardware to be adjusted dynamically according to query workload, performance and deadline constraints. One can rent a large amount of resources for a short duration in order to run complex queries efficiently on large-scale data with virtual machine clusters. Complex queries usually contain common subexpressions, either in a single query or among multiple queries that are submitted as a batch. The common subexpressions scan the same relations, compute the same tasks (join, sort, etc.), and/or ship the same data among virtual computers. The total time spent for the queries can be reduced by executing these common tasks only once. In this study, we build and use efficient sets of query execution plans to reduce the total execution time. This is an NP-Hard problem therefore, a set of robust heuristic algorithms, Branch-and-Bound, Genetic, Hill Climbing, and Hybrid Genetic-Hill Climbing, are proposed to find (near-) optimal query execution plans and maximize the benefits. The optimization time of each algorithm for identifying the query execution plans and the quality of these plans are analyzed by extensive experiments. © 2015 Elsevier B.V. All rights reserved.

1. Introduction Cloud computing is creating a new market, DataBase as a Service (DBaaS), that has a great potential to attract users ranging from small businesses to very large enterprises seeking high performance solutions. In addition to its high performance, lower cost of ownership, quality of service guarantees, data privacy, scalability, and elasticity are the other opportunities offered by this emerging paradigm. One of the biggest challenges that DBaaS providers have to cope with is the request of users for continuously meeting the service level agreements. Providing an illusion of infinite resources with increasing database workloads is an NP-Hard optimization problem where the tasks need to be scheduled optimally in order to answer the required services [1,2]. Cloud database query engines can take advantage of common tasks and efficiently manage the resources by using a well-known database optimization technique, Multiple Query Optimization (MQO) [3–8]. Although MQO requires significant search for the identification of common tasks among queries, it has been successfully applied to complex

∗ Corresponding author. Tel.: +90 505 2370615; fax: +90 312 2105544. E-mail address: [email protected] (A. Cosar). http://dx.doi.org/10.1016/j.asoc.2015.01.026 1568-4946/© 2015 Elsevier B.V. All rights reserved.

Online Analytical Processing (OLAP) queries that involve big data processing and common tasks [9,10]. MQO has been studied on centralized databases for more than 30 years; however, solving the same problem for relational Cloud databases has not been studied from the perspective of alternative query plans (QP) [8,12,13]. Conventional query engines find the fastest execution plans for single queries and try to execute them as fast as possible on the other hand MQO can execute sets of queries together in shorter times by using their alternative QPs. In this study, we introduce four robust heuristic algorithms (Branch-and-Bound, Genetic, Hill Climbing, and Hybrid GeneticHill Climbing) that improve the total execution time of a set of queries in a relational Cloud database by using alternative QPs that have more sharable tasks. Locality of previously computed tasks and concurrently executing sub-queries are optimized with the proposed robust heuristic algorithms and used for the solution of this problem. Our contributions in this study can be listed as: (1) MQO problem is formally adapted for relational Cloud databases including an improved cost model with network communication costs. (2) Alternative QP generation methods for relational Cloud databases, where the site locations of join tasks can be decided

T. Dokeroglu et al. / Applied Soft Computing 30 (2015) 72–82

intelligently to reduce communication costs, are developed and experimentally evaluated. (3) Heuristic Branch-and-Bound, Genetic, Hill Climbing, and Hybrid Genetic-Hill Climbing algorithms are developed and experimentally evaluated for solving the Cloud MQO problem. In Section 2, information about the related work on Grid/Cloud MQO techniques is given. Section 3 gives the formal definition of the problem. Section 4 explains the distributed query engine. Section 5 presents our proposed algorithms that work with alternative QPs. Section 6 discusses the experiments conducted for evaluating the proposed algorithms. Finally our concluding remarks are given in Section 7. 2. Related work The MQO problem was first defined in 1980s and finding a global optimal QP by using MQO was shown to be an NP-Hard problem [8,16]. A detailed theoretical study of query scheduling, caching, and pipelining in MQO can be found in [18]. Considerable amount of MQO work has been done on relational databases [17,19,20]. The idea of using joint subexpressions has been applied to batch execution of multiple related queries and efficient maintenance of materialized views [49,50]. The studies in [16,51] considered these optimizations and used only the best plans of queries, thus achieving less sharing (i.e. higher total cost) than that could be obtained by using suboptimal QPs. Polat et al. provide heuristics and methods for generating alternative QPs that will improve the performance of MQO [14]. The execution time of a batch of queries is improved by evaluating a common plan task once obtained by using a lightweight and effective mechanism for detecting potential sharing opportunities among QP tasks [46]. When we survey MQO on distributed/parallel databases, we can note early research such as: • Increasing inter-query locality by decomposing a query into parallel sub-tasks so that a scheduler rearranges the QP tasks execution order for maximizing the reuse of cached-data [21], • Resource usage models to perform multiple query scheduling on parallel query processing systems in order to reduce the response times of queries [22], • Dividing a query into sub-queries that can be executed in parallel on many processors and enabling already computed (and cached) sub-query results to be re-used for improving processing speeds of new queries [20]. Mehta and DeWitt developed algorithms to take advantage of intra-operator parallelism, used CPU loads and tuple production rates of select and hash-join database operations for deciding on the number of allocated processors and the assignment of database operations to these processors [23]. Distributed query processing middleware systems have also been extensively studied as a solution for data intensive scientific applications. MOCHA [25] was one of the first database middlewares developed to execute database queries over distributed data sources. MOCHA could move the code required to process the query to the data storage site. In Beynon [26], user-defined functions can be executed at data storage sites to perform subsetting operations and many filter (e.g. aggregation) operators can be run in parallel on a large number of computers. Indexing the data at each server is an efficient method for distributed query optimization. R-trees are widely used to index and integrate the back-end servers as a single query server. Parallel Rtrees, Master R-trees, and Master-Client R-trees are mechanisms used for improving the performance of shared-nothing environments [29]. More specifically, the savings resulting from reusing

73

cached results have to be weighed against the service time and extra storage cost and extra data access load imposed on the server where the cached result is located. Mondal et al. used data migration to shift the workload from heavily loaded servers to lightly loaded servers in shared-nothing environments [30]. Chen et al. considered the network layer of the problem and reduced the communication costs with a query reconstruction algorithm to enable sharing of overlapped data through micro-machines that collaborate for evaluating query batches [6]. IGNITE [11], OGSA-DQP [32], CoDIMS-G [33], and GridDB-Lite are some of the important projects that focus on Cloud/Grid data integration [34]. Except IGNITE, none of these systems has MQO support. Recently, studies have been performed for adapting traditional query optimizers to Cloud computing. In [47], a classical query optimizer is adapted to Cloud computing workloads where it uses a partitioned database on a shared-nothing architecture. In [44], a parallel data warehouse system optimizer is developed for single queries by considering a rich space of execution alternatives with bushy-tree plans instead of simply parallelizing the best serial plan. Query optimization in Cloud environments can have different goals unlike the traditional query optimizers and the search space becomes much larger. In an interesting study, the scheduling of data processing workflows on the Cloud is considered from the perspective of minimizing the completion time given a fixed budget [48]. Although there exist some initial studies to integrate MQO techniques into existing relational Cloud database query engines, to our knowledge, there is no approach like ours to optimize a batch of queries by employing a relational Cloud database query optimizer which can produce and exploit alternative QPs. Recently, there were two remarkable projects. Giannikis et al. developed a new database architecture that is based on batching queries and shared computation across many concurrent queries in a shared disk, shared L3-cache, multi-core and multi-processor machine [9]. Their model does not try to generate any new alternative plans for input queries. A framework is developed for a Cascade-style Cloud query optimizer to enhance the performance by using MQO techniques for massive data analysis scripts that contain common subexpressions [45] but this approach differs from our technique because new alternative plans are not generated and subexpression costs are used for making optimization decisions only. In our study, a data flow execution model (operator-centric) is used instead of an iterator model, thus most of the mentioned systems cannot be compared with ours. A distributed query system, IGNITE, that is similar to ours, is chosen for comparing with our system and modifications are done on its architecture to it with the capability of alternative QP generation. 3. Problem formulation In this section, we formulate the Cloud-MQO problem. A multiple query execution scenario in a relational Cloud database is given, the symbols used throughout the study are explained and the formal problem description is given. 3.1. Sample scenario A sample relational Cloud database environment for the tpc-h benchmark database can be seen in Fig. 1. Concurrently accessed databases, queries, network, and the sites/processors are the main elements of the Cloud computing environment. In this scenario, there are 6 virtual machines connected via a network. In Fig. 2, two different QPs are shown for Query 3 of tpc-h database benchmark. QP1 scans the relations Customer at site S6 ,

74

T. Dokeroglu et al. / Applied Soft Computing 30 (2015) 72–82

Fig. 1. Multiple queries executed on a Cloud computing environment.

Fig. 2. Two different QPs of tpc-h query 3. ( is the selection operator that works on rows,  is the projection operator that selects the desired columns, and  is the join operator).

Order at site S4 , and Lineitem at site S3 . According to QP1 , Customer and Order are joined at site S4 , Order and Lineitem are joined at site S3 before selection and projection operations are applied. QP2 joins Customer and Order relations at site S3 , Order and Lineitem at site S4 . As it can be seen from this example, MQO becomes a very complicated problem when applied to a Cloud database environment. Traditional distributed query engines process a set of queries using only the best QPs of queries and can make use of the cached (sub-)query results at different sites [6]. On the other hand, nonoptimal QPs may result in a smaller total execution time. Some subexpressions can be shared by two or more queries, so that while the response time for an individual query can be higher, the total execution time of a set of queries can be improved dramatically. Taking advantage of common tasks, mainly by avoiding redundant page accesses and re-computation, a considerable improvement can be obtained. Executing and caching the shared sub-queries at the sites with the lowest communication cost and shipping the outputs to sites that need them as input are important factors that improve the total execution time. Fig. 3 illustrates two global plans for the execution of queries q1 and q2 with alternative QPs of q1 (left-deep and right-deep execution orders) to show how the second query can exploit common tasks of q1 . Both queries are issued from site S1 (the costs of the tasks are given in Table 1). According to Fig. 3a, relations B and C are scanned from disk and brought into main memory to produce join (B  C) in 27 s. Relation A is also scanned at site S1 while executing join (B  C). At the last phase, join operation A  (B  C) is executed at site S1 totally in 42 s and another 20 s is needed to complete q2,1 . It takes 62 s to complete both queries in total. On the other hand, with the same access path selections, the total execution is improved to 42 s with the second global plan (32%

Fig. 3. Alternative query plan generation can achieve larger common tasks and smaller total cost for a query batch. Table 1 Costs of the tasks. Task

Cost (S)

Explanation of the task

t1 t2 t3 t8 t6 t4 t7 t5

5 5 5 7 10 10 10 10

Scan relation A from the disk at S1 Scan relation B from the disk at S1 Scan relation C from the disk at S2 Send relation C from S2 to S1 Join cost of A  B Join cost of B  C Join cost of (A  B)  C Join cost of A  (B  C)

decrease in the total execution time), because the common join (A  B) is executed only once and shared by both queries. 3.2. Formulation Before giving the definition of the Cloud-MQO problem, we explain the meaning of a task. A task is a process that an operator (machine) can handle. Tasks can be a part of a complex query. Scanning a relation by a machine can be an input to a hash-join. In another way, a task can be seen as an operator. The same tasks can be executed at different virtual machines. In this study, we divide query plans, QPs, into tasks and execute the related tasks with the same machines only once [10,27]. The symbols used to define the Cloud-MQO problem formally are given in Table 2. Symbol S is the set of the virtual machines that are connected to the other sites via a network and s is an element in set S. The number of the virtual machines in the environment is | S |. Symbol Q represents the batch of queries. pi,m is the m-th query execution plan of query qi . ti,s is a task that can be used by different QPs with a CPU execution cost of ei ,s and a shipping cost, ci ,s , for the

T. Dokeroglu et al. / Applied Soft Computing 30 (2015) 72–82

75

Table 2 Symbols and notations. Symbol

Definition

S |S| Si Q qi pi,j ti,s ei,s ci,s ioi,s

Set of virtual machines The cardinality of set of sites S Site i set of queries Query i j-th query execution plan of query qi Task i executed at site s CPU execution cost of task ti,s Network comm. cost of output tuples of task ti,s Time to read the data from the disk for ti,s

resulting tuples of the task. ioi,s is the time spent to read the data from the disk for the task ti,s . The details of how these parameters are used during the optimization are given in Section 5. Proposed system focuses on the detection and the synchronization of the subexpressions by using alternative QPs that increase the possibility of detecting more shareability [14]. Instead of compiling each query into a separate QP, our system compiles the whole workload of query batches into a single plan (called a global QP) and calculates its total execution time. This global plan finds and serves the results of many concurrent queries. The queries issued from different sites may have common tasks. Each query has a number of alternative QPs that are named as pi ,j representing the j-th alternative QP of query qi . In a distributed environment, each QP corresponds to a query evaluation tree where each node is represented by a task and a site where the task is executed. Edges indicate the network. If we explain the problem more formally where: 1 the set of virtual machines with k elements is: S={s1 ,..., sk } 2 |S| is the number of virtual machines. 3 the set of queries ranging from 1 to n is (s is an element of set S, where the query is issued): Q = {q1 , ..., qn } 4 the set of m alternative QPs of query q1 is: QPs of qi = {pi,1 , . . . , pi,m } 5 the set of j tasks of x-th query execution plan of query q1,s that can be executed at all of the sites is: tasks of plan pi,x ={t1,1 ,..., t1,|S| ,..., tj,1 ,..., tj,|S| } meaning that task t1,s of a query execution plan can be executed at any of the sites where | S | shows the number of sites. Tasks having the same inputs and producing the same results may have different costs depending on which site they perform the tasks, due to the communication costs and available resources (relations, partitions of relations, replications, etc.) at the site. In an environment where the same subexpressions are exploited, the main purpose of the proposed system is to determine a set of QPs with a minimum total execution cost such that the shared tasks are executed only once and the communication cost of the shared results are minimized by careful site selection. Although Cloud-MQO problem is very similar to MQO in a centralized database, given the storing and caching (in memory) of relations and intermediate query results at different sites, combined with the communication costs of the sub-query results between sites, Cloud-MQO is a more complex problem to optimize.

Fig. 4. Query processing trees.

trees for relational Cloud databases, generating alternative query execution plans, micro machines (machines), and the distributed query cost model are explained. Distributed query engine is a relational database integration system that provides a collection of virtual views for integrating data objects from distributed data sources and can be implemented on top of any traditional database [11]. QPs are represented with trees that consist of all the relations to be joined as its leaf nodes. Non-leaf nodes of the trees indicate the joins and edges indicate the flow of partial results from the leaves to the root of the tree. In addition, the edges represent the network layer for the relational Cloud database. Each evaluation plan has a cost and can be represented with various tree structures [15]. Examples of left/right-deep, bushy, and zig–zag trees are shown in Fig. 4. Different orders of these trees can be used during the execution of the queries and the site of the operator is a crucial issue that must be handled. The distributed query execution model uses a data flow style that is also known as operator-centric model [10]. This model uses micro machines (machines). Tasks are implemented as independent machines. Each machine has input and output data buffers. They can consume the outputs of each other by working in parallel. Fig. 5 describes the query execution engine architecture. In this architecture, input query batches are first received by MQO component of the system. MQO component finds the best query execution plan configuration of the input queries by generating alternative QPs. The set of optimized QPs (the global plan) is passed

4. Distributed query engine This section explains the structure of the distributed query engine that is used in this study. How the detection of the common subexpressions are performed, types of query execution processing

Fig. 5. Distributed query execution model.

76

T. Dokeroglu et al. / Applied Soft Computing 30 (2015) 72–82

Fig. 7. Global query execution plan for a set of tpc-h queries. Fig. 6. Alternative query execution plans for tpc-h query 3.

to the dispatcher. Dispatcher sends the requests to machines. Scan, sort, join, aggregation, and communicator that sends and receives data to the other sites are the machines used by the system [6]. In a data flow query execution model, machines can work in parallel and a better data sharing can be achieved across the sites. 4.1. Alternative query plan generation After being parsed, queries are decomposed into tasks that will form a global query execution plan. Selection, projection, join, sorting, and data shipping are the main tasks that are executed to compute a distributed query. The basic principle of the proposed query generator is to decompose a query to its basic tasks that will be explored by the query optimizer. Given a query execution plan, alternative plan generator constructs plans consisting of tasks with different QP trees [14]. The plan generator interacts with a cost model and this model of our plan evaluator is different from conventional cost models. A strategy of performing selections and projections first before performing the joins can be efficient for single queries, whereas joining two relations first and later applying selections can be more potent if the same join is used repeatedly [9]. Alternative QPs (with selection first and projection first heuristics) for tpc-h query 3 can be seen in Fig. 6. The precedence of applying selection, projection, starting the execution of a query with join operations and changing the sites of the joins are the means of generating alternative QPs. 4.2. Detecting the common tasks of queries Given a set of queries (Q) and their alternative query execution plans for each query (Qi )={pi,1 , . . . , pi,k } where k is the number of alternative plans for query (Qi ), set of queries can be selected such that the total execution time of the queries is minimized. A global QP that combines common tasks of the queries, G, can be generated using the formulation given below [52]: q

G = (V, A, Ca , fq ) where V is the set of vertices and A is the set of arcs over V. • Create a vertex (v) for every base relation and relational algebra operator (select and join) in a query tree. • T (v) is the relation produced by the corresponding vertex v and can be a relation at the leaf level or an intermediate result that is produced during the processing of a query. • L is the set of leaf nodes. • For any root vertex v, T (v ) corresponds to a global query and R is a set of root nodes. • If a base relation or an intermediate result relation T(u) corresponds to vertex u is needed for a process at node v, an arc u → v is introduced.

• S(v) denotes the source nodes that have edges pointed to vertex v. S(v)={} if v is a member of leaf nodes L. S* {v} is the set of descendants of v. • D(v) denotes the destination nodes to which v is pointed. For any v ∈ R, D(v)={}. • Cpq (v) is the cost of query q that accesses to T (v), if T (v) is previously executed. • fq is the frequency of the query. If G is the global QP, Cqi (G) is the cost of computing query qi from  the set of common tasks then the total query execution time is f C (G). q ∈Q qi qi i

The cost model for distributed computation environments needs to take into account the communication cost for transferring data. Given a query qi that is submitted by node Nj and denoting by Vk a common task used to answer qi , the communication cost is zero if Vk is executed at the same node. Otherwise, if Nl is the node that contains Vk , then the communication cost for transferring Vk from Nl to Nj is: ComCost (Vk ,Nl →Nj ) = CNj ,Nl × size(Vk ) where CNj ,Nl is the network transmission cost per unit of data transferred between nodes Nj and Nl and size(Vk ) is the resulting file size of Vk . Different global query execution plans can be built depending on the query execution plans and by searching the alternative global query execution plans it is possible to obtain significant improvements. Fig. 7 shows a sample global query execution plan (G) for a set of tpc-h queries that share common tasks. 4.3. Cost model In order to measure the effectiveness of the global query execution plans, we develop a cost model that is based on the total execution time of the queries [35,24,36]. The model uses QP trees that can be executed in a parallel manner by several computers to examine every factor in detail [44]. The cost model depends on the statistics of the database. The main parameters used in the cost model are shown in Table 3 [39]. Table 3 Parameters used in the cost model. Symbol

Definition

TI/O #I/O TCPU #insts TMSG #msgs TTR #pages

I/O time for a page Number of page I/O operations Time for a CPU instruction Number of instructions Time to initiate and receive a message Number of messages Time to transmit a page Number of pages

T. Dokeroglu et al. / Applied Soft Computing 30 (2015) 72–82 Table 4 Sample execution of B&B algorithm.

77

pi,s represents the number of alternative QPs of qi,s . The heuristic function of B&B is:

State

Est. cost

Action



52 52 67 37 37 37 37

Expanded Expanded Pruned Expanded Expanded Solution Solution

Using the number of pages as a parameter, the total execution time taken by a task is calculated as:

h(Sm ) =



cost(tx )

tx ∈tsel

+



min(es cost(pi,1 ), . . ., est cost(pi,(pi,m ) ))

(5)

m

Suggest Documents