Distributed Query Optimization in the Stack-Based ... - Springer Link

3 downloads 227 Views 161KB Size Report
composing queries into subqueries addressing local servers. Compositionality is ... A distributed query processor should minimize: (1) communications costs,. (2) CPU costs and I/O costs .... Due to space limit in this paper we use a very simple ...
Distributed Query Optimization in the Stack-Based Approach Hanna Kozankiewicz1, Krzysztof Stencel2, and Kazimierz Subieta1,3 1

Institute of Computer Sciences of the Polish Academy of Sciences, Warsaw, Poland [email protected] 2 Institute of Informatics Warsaw University, Warsaw, Poland [email protected] 3 Polish-Japanese Institute of Information Technology, Warsaw, Poland [email protected]

Abstract. We consider query execution strategies for object-oriented distributed databases. There are several scenarios of query decomposition, assuming that the corresponding query language is fully compositional, i.e. allows decomposing queries into subqueries addressing local servers. Compositionality is a hard issue for known OO query languages such as OQL. Thus we use the Stack-Based Approach (SBA) and its query language SBQL, which is fully compositional and adequate for distributed query optimization. We show flexible methods based on decomposition of SBQL queries in a distributed environments. Decomposition can be static or dynamic, depending on the structure of the query and distribution of data. The paper presents only the main assumptions, which are now the subject of our study and implementation.

1 Introduction Distributed query optimization has been thoroughly investigated for relational database systems (see surveys, e.g. [1,2,3,4,5]), but the methods are hard to generalize for object-oriented or XML-oriented databases. In order to execute a query in a distributed environment, one has to decompose it into queries which are to be run by a number of participating servers. Unfortunately, queries in the most widely known query languages OQL and XQuery are hard to decompose due to irregular, non-orthogonal syntax and imprecise semantics. Therefore, optimization methods may work for simple queries when data is horizontally fragmented and centralized or distributed indices are present. More advanced methods, however, present a hard issue. A distributed query processor should minimize: (1) communications costs, (2) CPU costs and I/O costs at the client (the global application) and (3) CPU costs and I/O costs at servers. Furthermore, parallel computations at many servers should be exploited. If all computations are performed by clients (clients are fat), communication lines will be heavily loaded. If all computations are performed by servers (clients are thin), the most heavily loaded server can become the bottleneck. Execution strategies can be global and local. A global strategy concerns a global application and a coordinating server, while local strategies are implemented by all participating sites. Global strategies should be supplemented by appropriate local L.T. Yang et al. (Eds.): HPCC 2005, LNCS 3726, pp. 904 – 909, 2005. © Springer-Verlag Berlin Heidelberg 2005

Distributed Query Optimization in the Stack-Based Approach

905

strategies. For example, before application of any global strategy, local optimization e.g. using indices and rewriting (factoring out independent subqueries, pushing expensive operators down a syntactic tree, etc.) should be applied. In this paper we consider query execution strategies for object-oriented distributed databases. We use the Stack-Based Approach (SBA) [6,7] as the framework, since it provides the fully compositional query language called SBQL (Stack-Based Query Language). Its flexibility allows designing many decomposition/optimization strategies which can also exercise opportunities of parallel computation. The paper deals with global optimization strategies. We discuss several optimization scenarios. In our discussion and simple examples we assume the most common case of horizontal data fragmentation and distributive queries. The paper is organized as follows. In Section 2 we consider various execution strategies which can be applied in distributed environments. In Section 3 we sketch the Stack-Based Approach. In Section 4 we present optimization strategies in case of horizontally fragmented data. Section 5 concludes.

2 Query Execution Strategies Popular techniques of distributed query processing are based on decomposition [8,9,10,11,12]. A global application (client) usually processes queries in the following way: (1) the query is parsed and (2) decomposed into subqueries to be sent to particular sites, (3) the order of execution (including parallelism) of these subqueries is established, (4) the subqueries are sent to servers, (5) results of these subqueries are collected and (6) combined by the client. There can be several strategies which can be used in this framework, as follows: Total data shipping. The client requests all the data required by a query from servers and processes the query itself. Obviously, the strategy is conceptually simple and easy to implement, but causes a high or unacceptable communication overhead. Static decomposition. The query is decomposed into a number of subqueries to be sent to servers simultaneously. Servers work in parallel. The client collects the results and combines them into a global result. Dynamic decomposition. The client analyses a query and then generates a subquery to be sent to one of servers. Then collects the result from this server and uses this result to generate another subquery to be sent to next server. The result returned by the second server is collected, and so on. Servers do not work in parallel. Dynamic decomposition with data shipped to servers. Subqueries sent to servers are accompanied with a data collection, which facilitates better optimization at these servers. Semi-joins [13] are a well-know application of this strategy. Hybrid decomposition exploits the advantages of both dynamic and static decomposition as well as the possibility to ship data which facilitates local optimizations. At the beginning several subqueries may be generated and executed simultaneously by a number servers. The collected results are used to generate subsequent batch of subqueries and data fragments. The process is repeated until the final result of the query is computed.

906

H. Kozankiewicz, K. Stencel, and K. Subieta

3 Stack-Based Approach (SBA) Query optimization is efficient only if it is possible to verify that the query before optimization is equivalent to the query after optimization for all database states. Therefore, we must have a precise definition of the query language semantics. Specification of semantics is a weak point of current query languages such as OQL and XQuery. They present semantic through rough explanations and examples, which leave a lot of room for different interpretations. In contrast, SBA and its query language SBQL have fully formal, precise and complete semantics. In SBA a query language is treated as a kind of a programming language. Thus evaluation of queries is based on mechanisms well known from programming languages. The approach precisely determines the semantics of all query operators (including hard non-algebraic ones, such as dependent joins, quantifiers or transitive closures), their relationships with objectoriented concepts, constructs of imperative programming, and programming abstractions, including procedures, functional procedures, methods, views, etc. The stackbased semantics causes that all the abstractions can be recursive. In the Stack Based Approach four data store models are defined, with increasing functionality and complexity. The M0 model described in [6] is the simplest data store model. In M0 objects can be nested (with no limitations on nesting levels) and can be connected with other objects by links. M0 covers relational and XML-oriented structures. It can be easily extended [7] to comply with more complex models which include classes and static inheritance (M1), dynamic object roles and dynamic inheritance (M2), encapsulation (M3) and other features of object-oriented databases. The basis of SBA is the environment stack. It is the most basic auxiliary data structure in programming languages. It supports the abstraction principle, which allows the programmer to consider the currently written piece of code to be independent of the context of its possible uses. SBA respects the naming-scoping-binding discipline, which means that each name occurring in a query is bound to a run-time entity (an object, an attribute, a method, a parameter, etc.) according to the scope of its name.

4 Distributed Queries to Horizontally Fragmented Data Due to compositionality of SBQL queries can be easily decomposed into subqueries, up to atomic ones (single names or values). This property can be used to decompose queries into subqueries addressing particular servers. For instance, if name N occurs in a query and we know that data named N is horizontally fragmented among servers A and B, then we can substitute name N by (NA ∪ NB), where NA is to be bound on server A and NB on the server B. Such decomposition works for a general case, but it is easier for horizontally fragmented data (most usual case) and queries distributive with respect to set/bag union. If a query is not distributive, a dynamic schema (similar to semi-joins) can be used. If data is horizontally fragmented, a query can be executed in a simplest variant of the static decomposition scenario, which assumes sending the query in parallel to all servers and then summing their results. This requires, however, checking if the query is distributive, i.e. the union operator, as shown above, must be recursively pushed to the root of the query syntactic tree.

Distributed Query Optimization in the Stack-Based Approach

907

A distributed optimizer must have access to a global schema which describes structure and location of distributed objects. The global schema models and reflects the data store and itself it looks like a data store. Some nodes of the global schema have the attribute loc. Its value indicates names of servers which store data described by a node. If the value of a node’s attribute loc contains a number of servers, the collection of objects represented by this node is fragmented among the indicated servers. Only root object nodes have the attribute loc, since we assume that an entire object is always stored on the same server (no vertical fragmentation). Pointers from objects to objects may cross the boundaries of servers. A thorough description of database schemata for SBQL can be found in [14]. Due to space limit in this paper we use a very simple schema with a collection of objects Emp, which are fragmented and stored in Milano and Napoli. 4.1 Analyzing Queries 1. A query is statically analyzed against the global schema in order to determine its nodes where particular names occurring in the query are to be bound. The method of the analysis involves static environment and result stacks, see e.g. [7, 15]. During this process, if a name is to be bound to a node of the global schema which contains the attribute loc, the name is marked with the value of this attribute. In this way we involve names of local servers into the query syntactic tree. 2. The distributiveness of query operators is employed. All nodes of the syntactic tree associated with servers are marked with the flag “distributive”. Next, if a node marked distributive is the left argument of a distributive operator, the whole subquery rooted at this operator is also marked distributive. This process is repeated up the syntax tree of the query until no more markings are possible. If this process meets a non-distributive operator (e.g. an aggregate function), it is terminated. 3. Each node of the syntax tree marked distributive and having associated names of servers is split by the union operator into as many nodes as servers. Then, the splitting is continued on bigger and bigger query syntax sub-trees, till its root or a non-distributive node. 4. In this way the query is decomposed into subqueries addressing single servers. Depending on interdependencies of the subqueries they can be executed sequentially or simultaneously. If a subquery q1 is parameterized by subquery q2, q2 must be executed before q1. If they are independent, they can be executed in parallel. 4.2 Example of Static Decomposition Majority of queries are simply distributive with respect to horizontal fragmentation. They can be statically decomposed into subqueries addressed to a number of servers. The results collected from the servers allow merging the final result out of them. Consider the following query: Emp . name

(1)

The name Emp is marked distributive and is assigned to servers Milano, Napoli: Emp{Milano,Napoli} . name

(2)

908

H. Kozankiewicz, K. Stencel, and K. Subieta

and equivalently (EmpMilano ∪ EmpNapoli) . name

(3)

Since the dot operator is distributive with respect to horizontal fragmentation, we can push ∪ up the syntax tree of the query and eventually receive: (Emp.name)Milano ∪ (Emp.name)Napoli

(4)

This query can be executed in parallel on different servers and their results are eventually merged by the client. 4.3 Example of Dynamic Decomposition Sometimes a query contains a subquery being its parameter. Consider the following query (find employees who earn more than Blake): Emp where sal > ((Emp where name = “Blake” ) . sal)

(5)

In this query the right argument of the outer selection contains a so-called independent subquery that can be exevuted independently of the outer one. Each name Emp can be decorated by the names of servers: Emp{Milano,Napoli} where sal > ((Emp{Milano,Napoli} where name = “Blake” ) . sal)

(6)

After this operation we can split the query syntax tree, as shown in the previous example. Because we expect that the Blake’s salary is a value that can be found on one of the servers, the final result of the decomposition can be presented as follows: P := ((Emp where name = “Blake” ) . sal)Milano if P is null then P := ((Emp where name = “Blake” ) . sal)Napoli

(7)

(Emp where sal > P)Milano ∪ (Emp where sal > P)Napoli The decomposition employs distributivity of the where operator. In effect, the inner subquery is executed in some order and then the outer query is executed in parallel. The decomposition (the order of execution of subqueries) may involve the cost model estimating the execution cost of subqueries on particular servers, e.g. we can check whether to look for the Blake’s salary first in Milano and then in Napoli, or v/v.

5 Conclusions In this paper we have proposed optimization techniques for distributed object databases. We have shown that if the query language is fully compositional, it facilitates many decomposition methods (either static or dynamic). The Stack-Based Approach (SBA) and the query language SBQL have the unique query decomposition potential. Unlike OQL and XQuery, each fragment of an SBQL, even an atomic name or value, is also a query. Therefore, the optimizer has full freedom in partitioning a query into subqueries sent to remote servers. The decomposition can support various scenarios of distributed query processing, including static and dynamic decomposition.

Distributed Query Optimization in the Stack-Based Approach

909

We have discussed optimization techniques for the data grid architecture originally proposed in [16]. This architecture is currently being developed on top of our object oriented DBMS ODRA (Object Database for Rapid Application development) devoted to grid applications, composing web services and distributed web content management.

References 1. C.T.Yu, C.C.Chang: Distributed Query Processing. ACM Comput. Surv. 16(4): 399-433 (1984) 2. S.Ceri, G.Pelagatti: Distributed Databases: Principles and Systems McGraw-Hill Book Company 1984 3. M.T.Özsu, P.Valduriez: Principles of Distributed Database Systems, Second Edition, Prentice-Hall 1999 4. D.Kossmann: The State of the Art in Distributed Query Processing. ACM Comput. Surv. 32(4): 422-469 (2000) 5. C.T.Yu, W.Meng. Principles of Database Query Processing for Advanced Applications, Morgan Kaufmann Publishers, 1998 6. K.Subieta, Y.Kambayashi, and J.Leszczyłowski. Procedures in Object-Oriented Query Languages. Proc. VLDB Conf., Morgan Kaufmann, 182-193, 1995 7. K.Subieta. Theory and Construction of Object-Oriented Query Languages. Polish-Japanese Institute of Information Technology Editors, Warsaw 2004, 522 pages 8. V.Josifovski, T.Risch: Query Decomposition for a Distributed Object-Oriented Mediator System. Distributed and Parallel Databases 11(3): 307-336 (2002) 9. D.Suciu: Query Decomposition and View Maintenance for Query Languages for Unstructured Data. VLDB 1996: 227-238 10. K.Evrendilek, A.Dogac: Query Decomposition, Optimization and Processing in Multidatabase Systems. NGITS 1995 11. E.Leclercq, M.Savonnet, M.-N.Terrasse, K.Yétongnon: Objekt Clustering Methods and a Query Decomposition Strategy for Distributed Objekt-Based Information Systems. DEXA 1999: 781-790 12. E.Bertino: Query Decomposition in an Object-Oriented Database System Distributed on a Local Area Network. RIDE-DOM 1995: 2-9 13. P.A.Bernstein, N.Goodman, E.Wong, C.L.Reeve, J.B.Rothnie Jr.: Query Processing in a System for Distributed Databases (SDD-1). ACM Trans. Database Syst. 6(4): 602-625 (1981) 14. R.Hryniów, M.Lentner, K.Stencel, K.Subieta: Types and Type Checking in Stack-Based Query Languages, Institute of Computer Science, Polish Academy of Sciences, Report 984, March 2005 15. J.Płodzień, A.Kraken: Object Query Optimization through Detecting Independent Subqueries. Inf. Syst. 25(8): 467-490 (2000) 16. H.Kozankiewicz, K.Stencel, K.Subieta. Implementation of Federated Databases through Updatable Views. Proc. of the European Grid Conference, Amsterdam, The Netherlands, Springer LNCS 3470: 610-620 (2005)

Suggest Documents