Modelization and simulation of parallel relational

0 downloads 0 Views 404KB Size Report
predicat: C. C not C. Figure 18: Waste place. Such an ER net can easily represent an a-i P/T-net. Transition instructions corre- spond to actions and conditions ...
Laboratoire de l’Informatique du Parallélisme Ecole Normale Supérieure de Lyon Unité de recherche associée au CNRS n°1398

Modelization and simulation of parallel relational query execution plans using DPL graphs and High-level Petri nets. Lionel Brunie Harald Kosch

September 1996

Research Report No 96-22

Ecole Normale Supérieure de Lyon 46 Allée d’Italie, 69364 Lyon Cedex 07, France Téléphone : (+33) 72.72.80.00 Télécopieur : (+33) 72.72.80.80 Adresse électronique : [email protected]−lyon.fr

Modelization and simulation of parallel relational query execution plans using DPL graphs and High-level Petri nets. Lionel Brunie Harald Kosch September 1996

Abstract This report presents a novel representation model of parallel relational query execution plans, called DPL graphs. This model allows to deal with any kind of parallel architecture and any kind of parallel execution strategy. Based on an analysis of execution dependencies between operators, this model allows to precisely represent communications, run-time control mechanisms, scheduling constraints or speci c processing strategies (e.g. bucket processing). This report especially focus on the modelization and the simulation of the data and control

ows which are realized using high-level Petri nets.

Keywords: Parallel databases, parallel query optimization, parallel query execution plan, scheduling graph, High-level Petri nets.

Resume Dans ce rapport nous introduisons un nouveau modele de representation d'un plan d'execution parallele d'une requ^ete relationnelle, appele DPL graphs. Ce modele nous permet de modeliser toutes sortes de strategies d'execution paralleles sur n'importe quelle architecture parallele. Il se base sur une analyse complete des dependences existantes entre des operateurs relationels executes en parallele et integre en m^eme temps les communications, le contr^ole d'execution, les contraintes d'ordonnancement et de traitements speci ques (ex. traitement par bucket). Ce rapport se concentre sur la modelisation et la simulation du

ux de contr^ole et de donnees qui sont realisees a l'aide de reseaux de Petri.

Mots-cles: Bases de donnees paralleles, optimisation de requ^etes pour l'execution parallele, plan d'execution parallele, graph d`ordonnancement, reseau de Petri.

Contents 1 Introduction

3

2 Problem formulation

3

3 Related work

5

4 DPL graphs : a novel PEP representation model

8

3.1 PEP representation models : : : : : : : : : : : : : : : : : : : : : : : : : : : 5 3.2 Integration of communication and run-time consideration into operator nodes 7 4.1 Operator nodes : basic, communication and control operators 4.2 DPL graphs : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.2.1 From D graphs to DPL graphs : : : : : : : : : : : : : 4.2.2 Formal de nition of DPL graphs : : : : : : : : : : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

8 10 10 12

5 DPL graph { A scheduling tool for parallel query execution

13

6 Discussion of the chosen methodology

14

6.1 Comparison of the DPL graphs and par-LERA : : : : : : : : : : : : : : : : 6.2 Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6.3 Impacts on the parallel query optimization : : : : : : : : : : : : : : : : : :

15 16 16

7 Using high-level Petri nets for modeling control and data ows in DPL graphs 17 7.1 Short notes about Place/Transitions Petri nets : : : : : : : : : : : : : : : : 7.2 Modelization of the control and data ow in DPL graphs using annotated and inscribed P/T-nets : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7.2.1 Transformation of the operator vertices to transitions : : : : : : : : 7.2.2 Initialization and terminating control schemas : : : : : : : : : : : : : 7.2.3 Control schema for data and precedence dependencies : : : : : : : : 7.2.4 Control schema for unlabeled loop dependencies : : : : : : : : : : : 7.2.5 Control schema for labeled loop dependencies : : : : : : : : : : : : : 7.3 Veri cation of the control schema representation : : : : : : : : : : : : : : : 7.3.1 Correctness consideration for data- and precedence dependencies : : 7.3.2 Correctness consideration for unlabeled loop dependencies : : : : : : 7.3.3 Correctness consideration for labeled loop dependencies : : : : : : : 7.4 From modelization to simulation of query processing : : : : : : : : : : : : :

8 Simulation of query processing trees using High-level Petri nets

8.1 The Petri net simulation tool Cabernet : : : : : : : : : : : : : : : : : : : : : 8.1.1 A little example : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1

17

18 19 20 20 21 22 22 23 23 26 27

27

28 29

8.1.2 Timed Petri nets bring simulation time : : : : : : : : : : : : : : : :

9 Conclusion and connected works

30

30

2

Publishing and submitting remarks Parts of this research report are published:

 Lionel Brunie and Harald Kosch. DPL graphs { a powerful representation of parallel

relational query execution plans. In EUROPAR 96 Conference. LNCS series 1123, Springer Verlag.

1 Introduction With the emergence of decision support systems, relational queries tend to become more and more complex. In such contexts, parallel query optimization is extremely dicult, since the relational model provides many sources of parallelism. Until now, parallel query optimizers have been strongly limited by the fact that execution plan representation models give a very schematic view of the actual execution. Thus, communications, run-time control mechanisms, scheduling constraints and speci c processing strategies (e.g. bucket processing) are seldom correctly handled by existing models. In that context, this report presents in the rst part a novel representation model of parallel relational query execution plans, called DPL graphs. Based on an analysis of execution dependencies between operators, this model allows to deal with any kind of parallel architecture and any kind of parallel execution strategy. Stress was especially put on the modelization of communications and run-time constraints. In the second part of the report, we will concentrate on the modelization and simulation of the control and data ows in these DPL graphs. The modelization is based on high-level Petri nets which seem especially well-designed for this purpose. Many simulation tools exist for such high-level Petri nets. We seeked for an adapted one (Cabernet), in order to build up an ecient simulation tool for parallel execution plans. This report is organized as follows. Section 2 presents some short basic notes about parallel query processing and execution plan representation models. Follows an analysis of related work (section 3). Section 4 describes the concept of DPL graphs. This methodology is discussed in section 5 and 6. Then we describe how high-level Petri nets can model 7 and simulate 8 the data and control ows in DPL graphs. At last, section 9 concludes this paper and points out future developments and connected works.

2 Problem formulation The purpose of this section is to give a basic introduction about relational query processing, optimization and representation models of execution plans and to describe the integration of parallelism into these components. Consequently, readers familiar with these notions can overjump this section.

Introduction to query processing, optimization and execution plans representation Three among the relational operators have a major importance : the selection, the

projection and the join. Because of its high implementation cost, this latter has specially been studied in the literature (see, for instance, the survey of [1]). In this paper, we focus on the class of hash-based join methods, which are the most commonly used, and more particularly on the equi-join operator. This technique works on two steps. First, a hash-table 3

is built using the tuples of the rst relation. Then the tuples of the second relation are probed against this hash-table in order to nd the couples of tuples having the same value for the join attribute. Query optimization consists in selecting an optimal execution plan for a given query. This selection is based on some prede ned optimization objectives such as the minimization of the response time or the maximization of the system throughput. This execution plan must specify the operations to be run, their operands (relations), their execution order, the implementation algorithms and the access methods. In uniprocessor optimization techniques, execution plans are usually represented using query processing trees [2].

De nition 1 A sequential query processing tree is a binary tree in which the leaves

represent the base relations that participate in the query and intermediate nodes represent relational operations. Intermediate nodes receive their input relations via the incoming edges and send the result through their outgoing edge to the next operation. The root of the tree represents the result of the whole query.

In this formalism, implementation and access methods are represented as annotations attached to the operator nodes.

Integration of parallelism into relational query processing, optimization and execution plan representation models The relational model is especially well-suited

for parallelism. Basically, three forms of parallelization strategies can be applied :

pipeline parallelism : suppose a selection operator is to be performed on the output of

an equi-join operator. Clearly the selection operator can start its job as soon as the rst tuple has been processed by the join operator and then work in parallel with that latter. inter-operation parallelism or task parallelism : let consider now the join of two selected relations. The two initial select operators can be performed in parallel. intra-operation parallelism or data parallelism : assume a select operator is to be run on a relation. The tuples of that relation can be partitioned into sub-relations processed separately.

In addition to the usual information carried by sequential execution plans a Parallel Execution Plan (PEP) must de ne a parallelization strategy : number of processors assigned to each relational operation (degree of intra-operation parallelism), number of operations to be run in parallel (degree of inter-operation parallelism) and management of the communications between processors. Obviously, parallelism highly complicates the optimization process. In particular, the number of possible plans dramatically increases [3, 4]. A lot of research works have focused on the study of plan search heuristics [5, 6, 7]. On the contrary, the representation of parallel execution plans has been much more neglected. This can seem quite surprising. Indeed, using plans representation models that can not allow to model all forms of parallelization strategies de facto restrict the search space used by the parallel optimizer. This incurs the risk that the optimal and even suboptimal solutions are excluded from the search space [3]. 4

3 Related work 3.1 PEP representation models Representation of global parallelization strategies (inter-operation parallelism)

All representation models use processing trees as the base structure for representing global parallelization strategies. Indeed the \external" shape of a processing tree speci es in a very intuitive way the degree of inter-operation parallelism, i.e. operations lying on di erent paths of the tree can be executed concurrently (see, for instance, g. 2). In practise, relational operators are usually split into atomic operations (e.g a hashbased join of R1 ./ R2 will be split into a (build hash table R1) operator and a (probe hash table R2 ) operator) represented as nodes of a processing trees. In the remaining of this paper, this kind of structures will be called D-graphs :

De nition 2 A D-graph is a processing-tree in which intermediate nodes model atomic operators and edges represent Data dependencies between operators. A dependency can

be sequential (i.e. the successor operator can start its execution only when the predecessor operator has nished its job (i.e. has processed all the tuples)) or pipelined (i.e. the successor can start its execution as soon as one tuple has been processed by the predecessor).

Fig. 1 presents an example of D-graph for a three ways join R1 ./ R2 ./ R3 executed using a hash-partition method [8] on a shared nothing system. First, the tuples of the relations are hashed-partitioned over the available processors (split phase). Then a local hash-join (see section 2) is applied (join phase), represented here by the atomic operators build and probe1. R3

R2

R1

build hash table R2 probe hash table R3

build hash table R1 R2

R3

probe hash table R2 R3 R1

R2

R3

store result

Figure 1: D-graph for a three ways join. D-graphs have a representation power much higher than classical query processing trees. Consider for instance the previous example. The fact that the two hash-tables can be built in parallel could not be speci ed using a basic query processing tree, since the nodes of such trees represent whole join operations and not atomic operations.

Representation of intra-operator parallelization strategies Intra-operator parallelization strategies are usually represented by annotations placed on the nodes of the 1 The initial hash partitioning is speci ed via annotations attached to the build and probe operators (see below).

5

D-graph. Fig. 2 shows a typical example of such annotations [6] (only the annotations relevant to the parallel execution are shown). The join R1 ./ R2 is executed using a sort-merge algorithm. The annotation processors speci es the number and the identity of the processors executing this join operation. The annotation dependency states if the result relation must be materialized or pipelined to the next operator. At last, the annotation redistribution speci es if a tuples redistribution must be triggered. R

R

1

processors: {1,2,3,4} dependency: pipelined scan R 1 redistrib.: yes

processors:{1,2,3,4} dependency:materilized redistrib.:no

processors: dependency: redistrib.:

scan R2

sort2

sort1

{1,2,3,4} −− −−

2

processors: {1,2} dependency: pipelined redistrib.: yes

processors: {1,2,3,4} dependency: materilized redistrib.: no

merge

Figure 2: Example of annotated operator nodes (Ganguly et al.). Such annotations provide a compact and powerful tool for representing parallelization strategies. DPL graphs, which are introduced later in this paper, adopt this approach.

Discussion Representation models based on such approaches su er from the fact that

they cannot allow to represent all kinds of parallelization strategies. Thus, most of models, e.g. [9, 10, 6, 7], only consider data-dependencies between operators. Only par-LERA (used in the DBS3 system [11])2 and the representation model implemented in the GAMMA prototype [12] allow to actually implement a precedence dependency between operators (see section 4.2 for a de nition of this form of dependency). However, even these representation models do not allow to specify the parallel execution nely enough to represent all kinds of inter-operation parallelization strategies. Indeed, none of these models correctly deal with the algorithms based on bucket processing. This technique is applied when relation partitions can not t in the main memory of the processors. Instead of working on the whole partitions, algorithms implementing this approach work on portions of partition, called buckets [13, 14]. We will see in section 4.2.1 that correctly modeling such algorithms requires to introduce a novel form of dependency, the loop dependency. In summary, until now no model has been proposed that allows to correctly and nely represent complex parallel execution strategies. This has a severe impact on the whole query optimization process. First, the estimated execution costs are non-realistic, since they do not correctly integrate bucket-based processing strategies. Second, the optimizer is not provided with a correct view of the parallelization space. However, though it is clear that the optimizer will not investigate the whole parallelization space3 , it should decide by itself 2 3

See section 6.1 for a detailed comparison of our approach with par-LERA. Also called search space.

6

how this space must be restricted, and not be constrained by the underlying representation model.

3.2 Integration of communication and run-time consideration into operator nodes Integration of communications Communicating requires to access to relations and

to eciently manage communication bu ers. Such operations are very expensive. Consequently, communications should be considered with the same attention as any other operator working on relation partitions. Indeed, as it was rst mentioned by Gray in 1988 [15] and later by Hasan [16, 17], optimizers that do not integrate communication costs cannot work eciently. Two classes of approaches have been proposed in order to integrate communications into PEPs. Some works suggest to model communications as weights associated to the edges used in the D-graph representation (e.g. [16, 10]). Oppositely, some systems explicitly introduce communication operators. To our knowledge, only three systems have adopted this latter approach : the par-LERA database language [18] with its constructor node (see section 6.1), the Volcano prototype [19] (exchange operator) and the commercial product DB2 Parallel Edition of IBM [20](data-exchange operator). However, all these operators carry very poor information which strongly limits their representation power : no communication algorithm can be speci ed ; no means are provided to specify the access methods used to read the relation partitions to be communicated. In fact, basically these operators only carry an estimation of the communication cost.

Integration of run-time control mechanisms Query processing often has to face

important run-time problems. Among them, load imbalance [21, 22] and run-time adaption of static parallelization strategies [23, 24] are especially crucial. In order to ensure good performances, such situations must be quickly detected and corrected. This requires to introduce at compile-time, in the PEP itself, control mechanisms [24] 4 . Thus, in XPRS, a special choose operator [23] is introduced which allows to choose at run-time between several execution alternatives. The exchange operator of Volcano includes some kind of control, as it can dynamically re-estimate the degree of the intra-operator parallelism[19]. In a previous work [25], we introduced a control operator designed to supervise the query processing, i.e. to detect optimization errors. However, based on a representation model similar to par-LERA [18], this control operator could not integrate annotations specifying implementation parameters. Thus, to the best of our knowledge, no PEP representation model allows to take into account all kinds of run-time control, e.g. load imbalance, optimizer estimation errors etc. However, in the lack of a correct representation of the communications and run-time control mechanisms, a parallel query optimizer cannot eciently work. Indeed, parallel optimizers are based on a cost model assumed to provide a good image of the actual execution costs. This can be do only if PEPs provide a good representation of the query execution. In that framework, this report proposes a complete and simple PEP representation model. This model allows to specify all kinds of dependencies existing between operators, thus enabling to represent all the possible parallelization strategies. Furthermore, special annoted operators allow to precisely state the communications to be done and the run-time 4

Control operators can be placed, for instance, after operators that require to write result data on disk.

7

control mechanisms to be triggered.

4 DPL graphs : a novel PEP representation model 4.1 Operator nodes : basic, communication and control operators DPL graphs distinguish three kinds of operators : basic, communication and control operators :

 Basic operators are atomic operators working on relation partitions. They are part of the implementation of a relational operator. These operators work independently on each processor holding a part of the implicated relations. Basic operators are graphically represented by circles whose inscription detail the functionality (e.g. build hash table).  Communication operators implement data redistribution. They are graphically represented as boxes whose inscription states the kind of repartition to be done (e.g. all-to-all repartition). Example : suppose the join R1 ./ R2 is to be executed using a hash based method. If the input relation R1 is not distributed on the same processors as R2 is, at least one of the two relations must be repartitioned. This can be speci ed using the communication operator any-to-any repartition.  Control operators are used to control the query processing. They are represented by lozenges. Inscriptions state the kind of control to be performed. Special annotations specify the execution parameters. Example: suppose a distributed control of the redistribution skew (see [26]) is to be placed before a hash-based join. This can be done by inserting into the plan a control operator with the inscription control to Redistribution Skew and an annotation specifying the processors in charge of this control.

As noted just above, speci c annotations allow to precisely specify the operator execution strategy. Such annotations can be computed at compile- and/or run-time. Suppose the target machine is based on a shared-nothing architecture. Fig. 3 shows possible annotations for each of the three classes of operators. These lists of annotations do not pretend to be exhaustive. Of course, annotations speci c to the actual target machine can be added. Let us analyze these annotations. Six basic types of annotations are attached to the basic operator build-hash-table : the partition function (e.g hash, round-robin) and the partition attribute ; the relation name ; the access methods, i.e. file scan, index scan or sorted scan. When an index scan is used, it is required to specify the nature of the index to be used (e.g primary, sorted or clustered index) as well as the way data are distributed over the disks (e.g record distribution or page distribution [27])). When a sorted scan is used, it is required to state if sorting properties of the relation partitions are to be used. Furthermore, some annotations provide statistic information about involved relations (e.g. tuples size, cardinality of involved relations) and about the implemented relational operation (e.g join selectivity). Other annotations specify the degree of intra-operator parallelism and the identity of the processors that will execute the operation. A 8

Control operator

Communication operator

Basic operator

build hash−table

repartition:

function: hash attribut Attr1

relation name: R access method: index: kind: primary distribution:record statistic info.:

data dependency:

card.: 100000 tuple−size: 200K join−select.: 0.000012 pipelined

degree of parallelism 4 processors: P1,P3,P4,P5

any−to−any redistribution

control to RS skew

repartion before :

function: hash attribut Attr1

degree of parallelism processors:

repartition after :

function: range attribut Attr2

statistic info. : P1.card = 500 P3.card = 600

2 P1,P3

access : index: kind: primary method distribution:page data dependency:

pipelined

degree of parallelism 2 proc. before : P1,P3 degree of parallelism 3 proc. after : P1,P2,P3 statistic info.:

card: 1000000 tuple−size: 150K bandwith: 10MB/s start−up: 0.001s

heuristic: best fit decreasing

Figure 3: Example of annotations for basic, communication and control operators. last kind of annotation states the data dependency with the following operator. This dependency can be either sequential or pipelined (section 3). Let us now consider communication operators. Annotations to be completed are : the repartition function and the repartition attributes before and after the redistribution ; the access methods ; statistic information about the relations (e.g. tuples size, cardinality) and the communication network (e.g bandwidth, start-up time) ; the data dependency ; the degree of intra-operator parallelism and the identity of the processors executing the predecessor and successor basic operators. At last, it should be stated if the redistribution follows a special heuristic or not. Indeed, heuristic redistributions are specially important in load balancing (e.g. best t decreasing strategy [28]). At last, let us analyze the annotations attached to control operators. As previously, we nd : the degree of intra-operator parallelism, the identity of the processors, executing the operation and statistic information (e.g. cardinalities of the relation partitions). The way these statistics are collected at run-time can also be speci ed (e.g. \P1 collects all cardinalities"). Formulas describing how the processing strategy is to be modi ed with respect the actual skew could eventually be included. Using such annotations, this concept of control operators allows to model any kind of control mechanisms whether they deal with load imbalance, optimizer estimation errors or unavailable resources. In that sense, this concept formalizes and generalizes related works (see section 3). Annotations analyzed above are related to shared nothing systems. However, the same methodology can be applied to any system architecture simply by adapting the annotations to be lled in. For instance, for a shared memory architecture, the annotations attached to communication operators should specify the memory zone where the data to be exchanged 9

are stored.

4.2 DPL graphs In this section, we rst show, using simple examples, why it is necessary to generalize the notion of D graphs and how DPL graphs allow to correctly represent a parallel execution plan. Then a formal de nition of DPL graphs is presented. This section is based on the study of the optimization of a sample query R1 ./ R2 ./ R3 executed on a shared nothing system. Relations are supposed already partitioned on the join attribute over all the disks. The intermediate result relation R1 ./ R2 is repartitioned using a communication operator. The two join operators are implemented using hash-based algorithms, whereby the two hash tables are built on the base relations R2 and R3 . In order to keep the gures readable, only the data dependencies (see section 4.1) between operators (sequential or pipelined) are mentioned. Other annotations are hidden.

4.2.1 From D graphs to DPL graphs R1

R

R2

R

3

R3

2

R1

... sequential ...

... sequential ...

build hash table R2

... sequential ...

probe hash table R3

build hash table R1 R2

R1

... sequential ...

probe hash table R3

build hash table R1

R2

R3

any−to−any redistribution

... pipelined ...

... pipelined ...

probe hash table R2 R3

R3

R1

store result

R2

... pipelined ...

R3

any−to−any redistribution

... sequential ...

probe hash table R2 R3

R2

build hash table R2

... pipelined ...

... pipelined ...

R3

store result

Figure 4: Left scheme : D graph for the sample query. Right scheme: DP graph for the same query. D graphs cannot model all possible parallel execution strategies. To illustrate this, let us suppose that the available memory is limited and that the hash-table on R1 can be built only when the intermediate relation R2 ./ R3 has been build. Such a situation can not be represented using D graphs ( g. 4, left scheme). Indeed no data dependency exists between the probe hash table R3 operator and the build hash table R1. So, according to the D graph g. 4, left scheme, these two operations should be executed in parallel. Representing such an execution strategy requires to introduce precedence dependencies stating that an operator must be terminated before another operator can start, though no data dependency is involved. In the remaining of this paper, graphs including precedence dependencies will be called DP graphs (for data and precedence dependencies). A precedence dependency will be graphically represented as a double directed edge. Thus, the DP graph of g. 4, right scheme, indicates that the build hash table R1 can start only when the probe hash table R3 operator has terminated.

10

Let us consider now a more realistic processing situation in which the relation partitions can not t in the main memory of the processors. In such situations some over ow processing has to be carried out. A common way to do that is to work on portions of partitions, called buckets [13, 14]. Indeed, this technique has been proved to be the most ecient technique to handle large amount of data [13]. The simplest hash based join algorithm working on buckets, is the Grace hash join [8]. In the split phase, a rst hash function is applied to the tuples in order to determine the processor they must be assigned to ; then a second hash function is applied so as to determine a bucket number (the hash function is chosen in a way that the bucket size is smaller than the memory size available on the processor). In the local join phase, the buckets of the relations to be joined are successively loaded into the main memory and a classical hashed-based algorithm is applied. Practically, the rst bucket of the rst relation is loaded and a build hash table operation performed. Then the rst bucket of the second relation is loaded and a probe hash table operator applied. Then, the second bucket of the rst relation is loaded and the hash table built before the second bucket of the second relation is probed against it. And so on and so forth. Unfortunately a DP graph can not model such bucket based join algorithms. Indeed, modeling the local join process with only one build hash table on bucket operator connected by a sequential data dependency to a probe hash table on bucket operator is not correct, as the loop phenomenon between these two operators is not represented. In practise, PEPs can include sequences of several basic or communication operators, all based on a same initial bucket decomposition. Such processing strategies should be explicitly represented into the PEP structure itself, not into the annotations attached to the operators. Indeed, this is the only way for precisely modeling what will be actually performed at run-time. However only a precise view of the execution can allow the query optimizer to correctly estimate the resources consumption and execution costs and to optimize the query processing. So, it is necessary to introduce a new kind of dependency called loop dependency.

De nition 3 A loop dependency is based on a sequence of basic and communication opera-

tors lying on a same data dependency path. It indicates that this sequence must be repeated as many times as there are available buckets (or tuples)5. Loop dependencies are represented by dotted directed edges.

PEP including loop dependencies will be called DPL graph (for data, precedence and

loop dependencies).

Fig. 5, left scheme, shows the PEP, represented by a DPL graph, modeling our sample join, R1 ./ R2 ./ R3, in which all the join operations are executed using a Grace-join method. Thus, the loop-dependency between the build bucket hash table R2 and probe bucket hash table R3 operators translates the fact that the two operations must be repeated for all the buckets of R2. Loop dependencies can be nested. Indeed this is required by some algorithms, e.g. the bucket-based nested-loop join algorithm. This algorithm is especially used for implementing theta-joins in which the join condition is not an equality but a comparison between attributes. In such cases, it is the only practicable implementation algorithm [1].

This dependency structure requires to test if there are still available buckets (tuples) at the starting operator of the loop (i.e. the operator the loop dependency points to). In order to avoid ambiguous situations, we x that starting operators must have only one incident data dependency (i.e. the loop condition depends on only one data stream). 5

11

R3

R2

R1

R1 ... sequential ...

... sequential ...

build bucket hash table R2 probe bucket hash table R3

... sequential ...

R3

R2

build bucket hash table R1

R2

R

probe bucket hash table R2 R3

R1

R2

R

... pipelined ...

load one bucket R 3

match bucket R 2,R3 2 ... sequential ...

3

any−to−any redistribution

load one bucket R 2

build bucket hash table R1

R2

... sequential ...

R3

any−to−any redistribution

probe bucket hash table R2 R3

... pipelined ...

R1

3

store result

R2

... pipelined ...

... sequential ...

... pipelined ...

R3

store result

Figure 5: Left scheme : DPL graph. Right scheme: DPL graph with nested loop dependency. Suppose the R2 ./ R3 join involved in the previous sample query is now a theta-join, executed using a bucket-based nested-loop algorithm. So, the rst bucket of the rst relation is to be loaded into the main memory before all the buckets of the second relation are loaded and matched against it. Then, the second bucket of the rst relation will be loaded and all the buckets of the second relation matched against it. And so on: : : Fig 5, right scheme, shows the corresponding DPL graph. The basic operator match bucket R2,R3 is the initial operator of two loop dependencies. A labeling of the loop dependencies is introduced in order to attribute a priority to each loop. Thus, the loop dependency involving the load one bucket R3 operator has the priority 1 while the loop dependency involving the load one bucket R2 operator, the priority 2. This translates the fact that for each loaded bucket of R2 all the buckets of R3 must be loaded and matched.

4.2.2 Formal de nition of DPL graphs This paragraph gives a formal de nition of the DPL graphs. Let us rst remind three basic de nitions we will use later :

De nition 4 A graph G(V,E) consists of a set V of elements called vertices and a set E of unordered pairs of members of V called edges. De nition 5 A directed graph interprets V as a ordered pair of vertices. An edge from the vertex u to v is denoted by (u; v ). The rst vertex u is called the initial vertex and the second vertex the terminal vertex. De nition 6 A path form a vertex u to a vertex v in a directed graph G is de ned as an alternating sequence of vertices and edges,

v1 ; e1; v2; e2; ::: ; en?1; vn 12

1

... sequential ...

where v1 = u and vn = v , all vertices in the sequence are distinct, and successive vertices vi and vi+i (1  i  n ? 1) are initial vertex and terminal vertex of the intermediate edge ei .

Directed graphs can be used to model data dependencies between operators [29] : data (i.e. relations) are materialized as edges, while operators are represented by vertices.

De nition 7 A DPL graph is a directed graph G based on four kinds of vertices: base

relations, basic operators, communication operators and control operators ; and three kinds of edges : 1. a D-edge represents a data dependency between a base relation and an operator or between two operators. 2. a P-edge represents a precedence dependency between two operators. 3. a L-edge represents a loop dependency between a sequence of operators lying on a same data dependency path. A L-edge can be labeled.

Let us detail the semantics of these di erent edges:

 A D-edge links a relation vertex to an operator vertex if the relation is an input of the

operator. A D-edge between two operators states that the successor operator needs data from the predecessor operator before starting its execution.  A P-edge between two operators not connected by a D-edge speci es that the terminal operator can start its execution only when the initial operator has been completed.  At last, a L-edge relies on a sequence of operator vertices, lying on a same D-edges path. It indicates that these operators must be repeated as many times as there are available buckets (tuples) to be processed. The test of the bucket availability is performed at the terminal vertex of the L-edge. In order to make this availability control more robust, we have xed above that the terminal vertex of a L-edge can have only one incident D-edge. Nested loop dependencies are allowed on condition that L-edges involved in a nest do not lie on the same data dependency path and are issued by an single initial vertex. In this case, L-edges must be labeled in order to specify their relative execution priority.

5 DPL graph { A scheduling tool for parallel query execution Precedence dependencies used in DPL graphs allow to integrate scheduling considerations into PEPs. The purpose of this section is to illustrate, on an example, how this integration can be made. Scheduling a PEP consists in de ning an execution order on the operators involved in the PEP i.e. in specifying for each operator what operators precede it [18]. Let us consider the DPL graph of g. 6, left scheme, representing the query R1 ./ R2 ./ R3. Suppose the target machine is based on a shared nothing architecture, and all base and intermediate relations are declustered over all available disks. Consequently, operations represented in g. 6 are executed on the same processors/disks. Suppose, at last, that two

13

operations at most can be run on the same processor6. This implies that the build table R1 operator cannot be run in parallel with the two pipelined operators : probe table R3 and any-to-any redistribution. R

R3

2

R

R1

R3

2

R1

... sequential ...

... sequential ...

hash hash

... sequential ...

build hash table R2 probe hash table R3

build hash table R1

R2

R1

R2

... sequential ...

R3

any−to−any redistribution

probe hash table R2 R3

... pipelined ...

build hash table R2 probe hash table R3

build hash table R1

R2

... sequential ...

R3

any−to−any redistribution

probe hash table R2 R3

... pipelined ...

R3

R1

store result

R2

... pipelined ...

... sequential ...

... pipelined ...

R3

store result

Figure 6: Left scheme : DPL graph. Right scheme: DPL graph integrating scheduling considerations. A possible scheduling of this PEP is the following : First, run in parallel the build hash R2 and build hash table R1 operators ; then execute the probe hash table R3 operator and the any-to-any redistribution ; at last run the probe hash table R2 ./ R3 and the store result. Such a schedule can be represented in the DPL graph by inserting P-edges between operators. This can be very simply performed by executing the following algorithm :

table

op

for all operations in the schedule do for all preceding operations of in the schedule do if ( ) 7 then insert a P-edge ( ).

pop

pop; op 2= E

op

pop; op

So, a P-edge is inserted between an operation and its predecessors in the schedule if no other dependency already connects them. In our example, a P-edge is inserted between the build hash table R1 and the probe hash table R3 operators (see g. 6, right scheme). @@@@@

6 Discussion of the chosen methodology In this section we rst compare the DPL graphs introduced in this paper with the methodology which seems to us the most advanced in this research eld, the par-LERA language (subsection 6.1). Then we discuss our approach (subsection 6.2) and try to analyze possible positive impact of DPL graphs on the parallel query optimization (subsection 6.3). 6 7

Such requirement can be introduced by query optimizers in order to restrict the search space. E denotes the set of edges of the DPL graph.

14

6.1 Comparison of the DPL graphs and par-LERA The LERA-project8 aims to develop a universal database language able to represent all kinds of database processing, whether it is object-oriented, deductive or parallel. In that framework, the parallel version of LERA, called par-LERA, rst introduced in a paper of Borla-Salamet et al. in 1991 [18], proposes both a PEP representation model and a scheduling formalism. A par-LERA graph is a D-graph with two special operator nodes : the operator and the constructor nodes (see also section 2). An operator node works on declustered relations. Oppositely, a constructor node speci es the execution dependency between two operators, which can be either pipelined or sequential. Furthermore par-LERA allows to distinguish if a dependency is local, i.e the producer and consumer are mapped on the same processors, or global. A constructor node must always be connected to operator nodes. Two constructor nodes, START and END represent the global beginning and termination of the query execution. A global pipeline dependency means that a redistribution has to be triggered. On the contrary, a global sequential dependency can be data-driven or simply represent a precedence constraint. Par-LERA is certainly one of the most elaborated PEP representation model, as it formalizes the notion of precedence dependency between operators and integrates scheduling considerations. However par-LERA su ers from important limitations. First, loop dependencies are not taken into account. Second no run-time aspects are integrated. Third, the lack of a clear distinction between the representation of communications and precedence dependencies can lead to modelization problems. Let us illustrate this fact by a very simple example ( g. 7) : Suppose R1 ./ R2 is to be implemented using a hash based join algorithm and that the two relations R1 and R2 must rst be repartitioned on the join attribute. Suppose also that a communication bottleneck forces the any-to-any redistribution of R2 to wait for the termination of the any-to-any redistribution of R19 . Fig. 7, left scheme, shows the corresponding DPL graph. R

R1

START

2

scan R 1

... pipelined ...

any−to−any redistribution

any−to−any redistribution

... sequential ...

... sequential ...

build hash table R1

scan R 2

PIPE

SEQ

ON

build R1

probe hash table R2

LSEQ

... pipelined ...

probe R

2

PIPE

Figure 7: Left scheme : Example of DPL graph. Right scheme: Corresponding par-LERA graph. Designing the corresponding par-LERA graph requires to replace each communication 8 9

LERA stands for Language for Extended Relational Algebra. This situation is not unrealistic: : :

15

operator by two nodes : a constructor operator node and an operator node (scan) (indeed the START operator cannot be directly connected to another constructor node). Furthermore, the precedence dependency included into the DPL graph can not be directly represented, since par-LERA does not allow to connect the constructor node PIPE, modeling the redistribution of R2, to a constructor node SEQ, modeling a precedence dependency. Only the introduction of an intermediate pseudo-operator node (here called ON) can solve this problem. Fig. 6.1, right scheme, shows the result graph, which is much more complicated than the corresponding DPL graph ( g. 6.1, left scheme). This simple example shows that only a clear distinction between the communication, basic and control operators on one hand and dependencies on the other hand can actually provide a simple and consistent PEP representation model.

6.2 Discussion In section 4 we have shown that modeling inter-operation parallelism requires to distinguish three kinds of dependency between operators : data dependency, precedence dependency and loop dependency. Precedence dependencies allow to model parallel processing strategies in which operators are ordered without reference to a data stream. This typically occurs when resources must be optimized e.g. when the available memory is limited. More generally, precedence dependencies allow to integrate scheduling considerations into DPL graphs. This property enables the query optimizer to work on a single data structure during the whole optimization process. This allows to avoid the development of complicated interfaces (e.g. between sequential and parallel optimization plans or between optimization plans and scheduling graphs). Loop dependencies, never considered before, allow to correctly represent operators working on buckets. This is especially interesting as bucket processing is the technique mainly used to handle very large relations. Furthermore, DPL graphs actually provide tools for integrating communications and run-time considerations into PEPs, via communication and control operators, thus formalizing the notion of execution control (see section 3.2). All these features allow DPL graphs based execution plans to be much more realistic, to correctly model the actual execution.

6.3 Impacts on the parallel query optimization DPL graphs, thanks to their annotated operator nodes, allow to consider all kinds of parallel execution strategies. It uni es, in this way, features of previous developed representations (See section 3. Furthermore, cost functions can be made much more realistic. Until now, in the lack of powerful representation model, the only way to take into consideration complex run-time situations has consisted in implementing very complex cost models (e.g. [30]). Oppositely, DPL graphs allow to easily model, in the PEP itself, complex situations using annoted operators and dependency links. Thus, because a precise view of the actual execution is available, a relatively simple cost model (e.g. that proposed by Za^t et al. in [31]) can give very good estimations : the complexity has been moved. Moreover, the whole parallel search space is accessible to the optimizer. This means

16

that the optimizer is now in charge of the restriction of the search space. In other words, the search space is no more restricted because of the limitations of the underlying PEP representation but decided according to optimization objectives. Looking now at the modelization of the control and data ow of DPL graphs.

7 Using high-level Petri nets for modeling control and data

ows in DPL graphs High-level Petri nets are a powerful and simple theory to model data ow in distributed and parallel systems [32, 33]. This formalism allows to represent control and processing strategies while being independent of the underlying parallel machine.

7.1 Short notes about Place/Transitions Petri nets This section recalls some basic notions about Place/Transitions Petri nets (P/T-net) that will be used later to model the data and control ows in DPL graphs. De nitions are taken from [32]. Readers familiar with the concept of P/T-nets can overjump this section.

De nition 8 A triplet N=(S,T,F,M,W)10 is called a Place/Transition Petri-net (P/Tnet) i

1. S and T are disjoint sets. The elements of S are called places and the elements of T transitions. 2. F 2 (S  T ) [ (T  S ) is a binary relation, called ow relation. 3. W : F ! Nnf0g is called the weight relation.

4. M : S ! N [ f! g is called the initial marking.11

Graphically, places are represented as circles and transitions as bars ; the ow relation F is represented by arcs between circles and bars. Arcs f are labeled by their weight, W(f) (see e.g. g 13). Notation: Let N be a P/T-net, N=(S,T,F,M,W). We note U the union of S and T : U = S [ T.

De nition 9 Let N be a P/T net. 1. Let x 2 U ,  x = fy j yFxg is called the x = fy j xFyg is called the

preset of x. postset of x.

2. Let X  U , X 10 11

=

S

x2X

 x.

The capacity of places is omitted, since it plays no role in our modelization. ! denote the empty element.

17

We can now de ne the dynamics of a P/T-net :

De nition 10 Let N be a P/T-net. 1. A mapping M : S ! N [ f! g is called a marking of N.

2. A transition t 2 T is M-enabled i 8s 2 t : M (s)  W (s; t)

3. An M-enabled transition t 2 T may yield a follower marking M' of M which is such for each s 2 S :

8 > M (s) + W (t; s) , s 2 t nt; > < s) ? W (s; t) , s 2 tnt ; M (s) = > M (s)M?(W (s; t) + W (t; s) , s 2  t \ t ; > : 0 otherwise It is said that t res from M to M' and noted M [t > M 0.

A marking M is represented by drawing M(s) tokens on each place. Fig. 8 shows an example of transition ring. 2

2 2

2

Figure 8: Firing of a transition in a P/T-net.

7.2 Modelization of the control and data ow in DPL graphs using annotated and inscribed P/T-nets Recent works [34, 35] have shown that high-level Petri nets12 were a very powerful formal tool, especially interesting for modeling the semantics of parallel systems and parallel programming languages. Indeed high-level Petri nets allow to represent data ows in a simple and intuitive way. Let us rst introduce a special class of high-level Petri nets, the annotated and inscribed P/T-nets.

De nition 11 An annotated and inscribed P/T-net13 (a-i P/T net) N is a P/T-net, where places and transitions can take inscriptions. Furthermore annotations can be attached to transitions. The following properties must be veri ed :

1. Place inscriptions denote conditions, which have to be ful lled to allow the associated transitions to re. If a place condition is not ful lled, all tokens attached to the place are removed. There is no \ocial" de nition of high-level Petri-nets. However, it is usually considered by a high-level net is derived a low-level-net by adding labels and annotations to the tokens, transition and places. 13 For the sake of simplicity, we will not consider labeled tokens. Furthermore we will assume that conditions can be computed locally on each concerned processor. 12

18

2. Transition inscriptions denote operator instructions which are executed when the transition res. 3. Transition annotations denote the annotations attached to the corresponding operator vertex (see section 4.1). 4. Operator instructions inscribed in a transition de ne an atomic action (i.e. during the execution of transition, the data required by this transition cannot be modi ed by the ring of other transitions.)

Inscribed places allow to make a control schema adaptable to the run-time context, i.e. conditions attached to places can change the ring behavior of an a-i Petri-net. Such adaptation capacities can be opposed to the control schemas used in systems like DBS3 [18], which are static and allow no run-time adaptation. Let us describe in detail, how the annoted operator vertices and the di erent edges of a DPL graphs are transformed into a annotated and inscribed P/T-nets.

7.2.1 Transformation of the operator vertices to transitions In general, we let the transitions correspond to the operator vertices of the DPL graph. The inscribed instructions of the transitions represent the executable code to implement the operator. Basic- and communication operator vertices are replaced straightforward by inscribed transition. The integration of control operator vertices is little more complicated, as its execution can imply a decision between two execution alternatives (see section 4.1). This decision is modeled with two conditioned places, connected to the a-i P/T-nets representing the two alternative executions. Fig. 9 left scheme shows an example DPL graph for a control operator, implementing a choose operator between an indexscan and a seqscan of the relation R. The right scheme presents the corresponding executable PEP. The place inscription choose index is ful lled, if an indexscan had to be performed and enables in this case the corresponding transition. Idem for choose seq.

choose choose index

indexscan

seqscan

indexscan

choose seq

seqscan

Figure 9: Left scheme : Example DPL graph with control operator. Right scheme: Corresponding Petri-net. So, basically data manipulated in a DPL graph will be modeled by places of a a-i P/Tnet while operators will be represented by transitions. Thus a-i P/T-nets provide a very powerful and intuitive representation model of DPL graphs. In the rest of this section, we will consider only basic operator vertices. 19

7.2.2 Initialization and terminating control schemas Two special nets : the input- and the output nets are devoted to model the beginning and the end of the execution. data stream 1

data stream 2

input

data stream 3 initialization transition

operator 1

operator 2

operator 3

operator 1

operator 2

operator 3

Figure 10: Input net construction. The input place is connected to the initialization transition. This latter represents the operations to be done before starting the execution. The initialization transition issues as many places as their are base relations. These places are then connected to the transitions representing the rst operator vertices working on these base relations. Access informations are integrated in the annotations to these transitions. In order to allow the execution to begin, a starting token is put in the input place. See g. 10 for this input net construction, for three base relations. The output place is connected to the last executed operator vertices (e.g. the store result operator vertex in g 5). A correct termination is characterized by a marking a equal to the initial net marking except for the starting token which must have migrated from the input place to the output place.

7.2.3 Control schema for data and precedence dependencies Building the a-i P/T-nets associated to a DPL graph is straightforward concerning data and precedence dependencies. Thus, a D- or P-edge between two operator vertices will be modeled by two transitions (representing the two operator vertices plus a place connecting the two transitions (denoting the dependency)) ( g. 11 and 12). One exception however : if the terminal vertex of a D-edge is the initial vertex of a L-edge, a special net construction is necessary, which will be introduced in the next paragraph. Remark : Data dependency can be either pipelined or sequential (see section 4). In the case of pipelined data dependency, the transition representing the initial operator vertex will re as soon as the rst tuple is output. Oppositely, in the case of sequential data dependency, it will re only when all tuples are output. operator1 operator1

operator2

Figure 11: Net construction for D-edges. 20

operator2

operator1 operator2

operator1

operator2

Figure 12: Net construction for P-edges.

7.2.4 Control schema for unlabeled loop dependencies Representing a L-edge requires complex net constructions. Let us rst consider a nonlabeled L-edge. Assume, without loss of generality, that only one external operator vertex (noted operator j) is linked to an operator vertex lying on the loop. operator 1 data stream

sj operator 1

s

i

t operator j

tj

NEOS

s INI ti

operator i

operator i s operator j

t

operator i+1

i+1

operator i+1 i+1

Not EOS ? data

s

s

NEOS

k

s EOS operator k

operator k+1

tk

operator k

t k+1

operator k+1

t

EOS ? data

EOS

s k+1

Figure 13: Net construction for unlabeled L-edges. Fig. 13 shows the net construction associated to a non-labeled L-edge. Operators (noted operator i,..., operator k) are represented by transitions (noted ti ; :::tk). The L-edge is represented by two places (sNEOS ; sINI ) and one transition (tNEOS ). Initially, a token is put in sINI , in order to start the execution of the loop. The place sEOS checks if their are still available tuples i.e. if an End of Stream (EOS) condition is true or not (condition Not EOS? relation). In that latter case, the transition tNEOS res. This transition provides with tokens all the places (si ; sINI ; sj ) used to re-execute the loop. The end of loop is represented by a place sEOS and a transition tEOS . This transition 21

is linked to sINI and (via the place sk+1) to tk+1 . The inscribed place condition of sEOS , EOS ? relation, veri es if the End of Stream of relation R condition is true and res the transition tEOS . This transition provides sINI with a token (in order to come back to the initial state) and res tk+1 , which continue the query execution.

7.2.5 Control schema for labeled loop dependencies relation 1

relation 2

operator i

operator j

operator a

operator b operator j+1

2

1

operator i+1

operator k

operator k+1

Figure 14: Labeled L-edges. Consider now the case of labeled L-edges. Fig. 15 shows the net construction associated to a loop containing two L-edges ( g. 14). A higher label means a higher priority, i.e. for each step of the priority 1 loop, the priority 2 loop is entirely executed (in other words, the higher a priority is, the more internal the associated loop is). The higher priority loop will be called the outer loop ; the lower priority loop will be called the inner loop. The L-edge associated with relation 2 is substituted by a net for unlabeled L-edges, as described above. One edge must however be added. In order to enable the ring of tk , the place sk must be provided with a token by tNEOS 2 . The output sOUT of this subnet is connected to a new transition tCONT . This transition is linked to two places. These two places, sEOS 1 and sNEOS 1 , verify if there are still data to be processed in the outer (i.e. priority 2) loop. If it is the case, all places sa ; si; sb ; sj ; sINI 1 are provided with token, which allows to re-enter into the outer loop (and, later into the inner loop). Otherwise, the transition representing the operator vertex operator k+1 (i.e. the operator following the loops nest) is red. This procedure can be easily generalized to three or more L-edges.

7.3 Veri cation of the control schema representation In this subsection we intend to prove the correctness of our modelization of the control scheme by a-i P/T-nets.

22

s

t

j

s

NEOS1

s

t NEOS2

i

INI1

operator i

operator j

operator a

operator b s

s

j+1

i+1

sa

s operator j+1

s

NEOS1

Not EOS ? relation 1

b

operator i+1 Not EOS ? relation 2

sk

tk operator k s OUT t CONT

s

EOS1

EOS ? relation 2

EOS ? relation 1

operator k+1

Figure 15: Net construction for labeled L-edges.

7.3.1 Correctness consideration for data- and precedence dependencies Correctness for data and precedence dependency is quit obvious. The nature of those dependency corresponds directly to the semantics of transitions ring. A transition representing the terminal operator vertex (terminal transition) of such a dependency can only re, when the execution of the initial operator vertex has been terminated14. Thus the intermediate place between the two transitions is provided with the necessary token to re the terminal transition.

7.3.2 Correctness consideration for unlabeled loop dependencies Correctness for the loop dependency is more complex: rst the a-i P/T-nets must be transformed to a P/T-nets without conditioned places. Then, with the help of the theory of P/T-net I-invariants, the correctness is proved. Let us rst introduce two de nitions, necessary for the correctness proofs. First, we introduce the matrix representation for a P/T-net.

De nition 12 Let N = (S,T,F,M,W) be a P/T-net. 1. For transitions t 2 T , let the vector t : S ! Zbe de ned as : 

14

- in the pipelined data dependency, when the rst tuple is output.

23

8 > , s 2 tnt; > < ?WW((t;s;s)t) , s 2 tnt ; t(s) = > W (t; s) ? W (s; t) , s 2  t \ t ; > : 0 otherwise

2. Let the matrix N : S  T ! Zbe de ned as N(s; t) = t(s)    Clearly every marking of the net may be represented by a vector M . The dynamic of a net can now be described with the help of the matrix N and M . De nition 13 Let N be a pure15 P/T-net and let M, M': S ! N be two markings of N. Then for each transition t 2 T : 1. If t is M-enabled then M [t > M 0 , M + t = M 0 . 2. t is M-enabled , 0  M + t.  Supposing without loss of generality, that the loop consists of one inner operator, i.e. the sequence relied by the L-edge is of length 3 : initial, inner and terminal operator vertex. Furthermore we note n for the number of loop executions (n  1). The conditioned places are replaced, by rearranging the net. Fig. 16 shows the corresponding P/T-net for the unlabeled L-edge of example g. 13 with n = 4 and one inner loop operator vertex (represented by t2 ). The two conditioned places (sNEOS and sEOS ) of g. 13 are replaced by ordinary places s5 and s6 . In order, to represent the semantics of the conditioned places, a special place s4 is introduced. s4 holds initially n ? 1 tokens and is connected to the transition t4 . This transition can re exactly n ? 1 times, which signi es, that the loop is executed n times. When the last token has left s4, the place s7 must hold n ? 1 token. In the next loop execution s7 enables t5 to re and s5 could no longer re t4 (no more tokens in s4 ). In order to return the net to the initial marking, the place s8 and the transition t6 must be added to the g. 16 (the tokens of s5 to s6 , not able to re anymore, are removed). The heller eshes in g. 16 illustrates the described token ow, when transition t5 res . In order to show the correctness of this net construction for unlabeled L-edges, we shall prove the following properties : 1. The operator vertices of the loop are executed exactly n-times. 2. After ring the transition, representing the operator vertices to be executed after the loop, the net returns to the initial marking before the loop execution.

We use the theory of T-invariants of a P/T-net [32] to prove the properties.

De nition 14 The T-invariants of a P/T-net is the solution x of the equation: N  x = 0

when N is the matrix representing the P/T-net and xT = (x1; x1; x3; x4; x5; x6; x7) 2 T .16 Let v : T ! N be such a solution. If it is possible, starting from some marking M , to re each transitions t exactly v (t) times, then this again yields the marking M . (See [32] for formalization and proof.)

15 A P/T-net is called a pure net, if F does not contain self-loops. Constructed nets in our context are always pure. 16 T x denote the transpose of the vector x.

24

s

10

t4

s1

s6

t1 t6 s

2

s8 s5 t2

s3

n−1 token s7

t3

s4

n−1

n t5 s9

t7

Figure 16: P/T-net for the a-i P/T net of g. 13. Let M1 be the initial marking of the net of g. 16, and N the matrix for this net. M1 and N expresses as :

0 ?1 B 1 B B 0 B B B 0 B B 0 N = B B ? 1 B B 0 B B B 0 B @ 0

0 ?1 1 0 0 0 0 0 0 0 ?1

0 0 ?1 0 1 0 1 0 0 0

1 0 0 0 0 0 ?1 n ? 1 ?1 0 1 0 0 ?n 0 1 0 1 1 0

0 0 0 0 ?1 1 0 ?1 0 0

1 0 0 0 0 0 0 0 ?1 1

1 0 1 1 CC BB 0 CC CC BB 0 CC CC BB n ? 1 CC C CC B CC and M1 = BBB 0 CCC CC BB 1 CC CC BB 0 CC CC BB 0 CC A @ 0 A 1

We can pose then the following lemma 1 :

Lemma 1 Vectors of the linear space f(n; n; n; n ? 1; 1; 1; 1)T  i ; i 2 Ng are the Tinvariants for N .

Proof of lemma 1 : 25

In order to solve the equation N  x = 0, we rearrange N on upper triangle matrix U : 0 ?1 0 0 1 0 0 1 1 BB 0 ?1 0 1 0 0 1 CC BB 0 0 ?1 1 0 0 1 CC BB 0 0 0 ?1 n ? 1 0 0 CC BB 0 0 0 0 1 0 ?1 CC U =B BB 0 0 0 0 0 ?1 1 CCC BB 0 0 0 0 0 0 0 CC BB 0 0 0 0 0 0 0 CC B@ 0 0 0 0 0 0 0 CA 0 0 0 0 0 0 0 The 6-th line of U puts the equations: x6 = x7 and the 5-th line : x7 = x5 ; which implies x6 = x5 . The 4-th line is x4 = (n ? 1)  x5 and the third line : x3 = x4 + x7 which simpli es to x3 = n  x5 . Similar for the rst and second index: x2 = n  x5 and x1 = n  x5. Then, the solution vector x expresses as : x = (n; n; n; n ? 1; 1; 1; 1)T  x5 Consequently, the solution space expresses as a linear space in the indicated form of the lemma 1. Lemma 1 interprets then as follows : When starting the loop execution the initial marking is equal to M1 . The three operator vertices of the loop (represented by t1 ; t2 ; t3) are executed exactly n times. The control ow switches via t5 to the operation vertex after the loop (represented by t7 ), before a new loop starting is initiated. The transition t6 re-provides s6 with a token and remove the token from s8 . At last, when ring t7 , the net returns to the initial marking M1. This proves the properties.

7.3.3 Correctness consideration for labeled loop dependencies Let us now turn to labeled L-edges and consider the a-i P/T net construction proposed in g. 15. We omitted for clearness of the presentation, the operator i-1 and operator j-1. Supposing, that m is the number of loop executions of relation 1. For each of those executions, the loop of relation 2 executes n-times. The P/T net for the L-edge for relation 2 can be considered independently (see g. 17) from those of the L-edge of relation 1. We will refer to the latter as N1 . Indeed, N1 has the same structure as the constructed net for unlabeled L-edges of g. 13. Consequently, the properties for unlabeled L-edges can be applied. The transitions t5 ; t4; t3 are therefore executed n-times. Furthermore the net returns to the initial marking, before ring t8 . Then, the net N1 can be replaced by a simple place s15, as N1 returns after loop execution to the initial marking, before ring t8 . This new structure is once again equal to the constructed net for unlabeled L-edges, whereby the corresponding L-edge relies t1 ; t2 . Applying the properties of unlabeled L-edges, we can conclude that the sequence t1 ; t2 is executed m-times. 26

t6

N

1

t9 s1

s 12

s

t1

operator 1

t5

s2

Not EOS ? relation 1

s

11

4

s8

operator 4

s5

operator 2

t

2

t4 operator 5

s3

Not EOS ? relation 2

s7

s6

t 3 operator 3

s 10

t7

t8

EOS ? s13 relation 1

EOS ? relation 2

s 14

s9

t 10

operator 6

t11

Figure 17: Partitioning of the net construction for labeled L-edges. This proves the correctness of the net construction for labeled L-edges.

7.4 From modelization to simulation of query processing In the last subsections, we have shown how high-level Petri nets could model the data and control ows in DPL graphs. This point is especially interesting for two reasons. First, simulators exist which allow to simulate Petri nets. Thus, DPL graphs, combined with high level Petri nets, provide a very powerful tool for simulating optimization strategies and runtime control mechanisms (e.g. load balancing). Second, tools exist that allow to actually execute Petri nets on a parallel machine, and therefore to execute PEPs modeled a DPL graphs. In this context, we show in the next chapter, how to build up an ecient simulation tool for processing DPL graphs. In a rst time we seeked for an adapted Petri net simulation tool, to implement the developed modelization of the data and control ow in DPL graphs.

8 Simulation of query processing trees using High-level Petri nets The CRIM (Centre de recherche d'informatique de Montreal) provides at the WWW-server http://www.crim.ca/se/petri-tools.html an overview of all known Petri net tools in 27

the world (revised regularly). We investigated this list for the most adapted simulation tool, and were chosen Cabernet (Computer Aided software engineering environment Based on ER NETs) for the simulation of our annotated and inscribed P/T-net. This choice was motivated by the excellent documentation available (only half of the tools provide a documentation). It is free-ware and relatively simple to install. Cabernet is available from CEFRIEL and Politecnico di Milano at ftp-se.elet.polimi.it. We made bright use of the excellent tool (based on a graphical editor, executer, animator and analyzer ; see g. 20).

8.1 The Petri net simulation tool Cabernet Precisely, Cabernet [36, 37] is a software engineering environment for the speci cation and analysis of real time systems based on ER nets. ER nets [34] are P/T-nets, where the tokens carry information and transitions are augmented with actions and predicates. Transitions are red, when the corresponding predicate is ful lled. Actions implement any kind of programs. predicat: C predicat: C

C

predicat: not C

waste

Figure 18: Waste place. Such an ER net can easily represent an a-i P/T-net. Transition instructions correspond to actions and conditions written in places correspond to predicates of the associated transitions. However, a small di erence exists in the handling of the transition ring. When the predicate C is not ful lled, an ER-nets do not remove such tokens, as an a-i P/T nets do. In order to remove such tokens, a special waste place (inscribed with the complement predicate notC ) must be introduced (see g. 18).

28

8.1.1 A little example Looking now again at the query processing tree for a sample relational query R1 ./ R2 ./ R3 of g. 4 right scheme. Fig. 19 shows the a-i P/T net modeling the execution of this DPL graph. The a-i P/T net of g. 19 was then transformed in an ER net with regard to possible waste places. Fig. 20 shows a snapshot of the resulting ER net. input

build hash table R2

probe hash table R3 build hash table R1

probe hash table R2 R3

output

Figure 19: The a-i P/T-net for the example query processing tree.

Figure 20: Snapshot of Cabernet showing the example Petri net. 29

8.1.2 Timed Petri nets bring simulation time ER nets are timed Petri nets, i.e. tokens can carry a time value, to be modi ed by the transitions action. Once a transition is red, its action has to estimate the operator execution time, it represents17 . This local execution time is then added to the time value of the token. Simulation time can be graphically shown for each transition and can be even modi ed, in order to model load imbalance. In such a case, several alternative run-time strategies can be tested e.g. a redistribution operator can be integrated. Even the structure of the DPL graph can be quickly modi ed with the graphical representation tool e.g. a bushy processing strategy is changed to a linear one by adapting the arcs in the ER nets.

9 Conclusion and connected works This report has described a novel representation model of parallel relational query execution plans (PEPs). In comparison with previous approaches, DPL graphs allow to very accurately represent any strategy of data, task or pipeline parallelism. By allowing to integrate communication and run-time considerations into the PEP itself (i.e. at compile-time), this formalism provides a PEP representation model much closer to the actual execution. Furthermore, DPL graphs allow to encapsulate into the PEP a scheduling graph. This allows to work on a same data structure during the whole optimization process, without needing interface structures. At last , we have shown that the data and control ows could be modeled using highlevel Petri nets, thus completing the de nition of a general framework for parallel query optimization. We described the way how this modelization can lead to an ecient simulation environment, based on the Petri net simulation tool Cabernet. Indeed, we believe that DPL graphs, completed by high-level Petri nets, can provide a very powerful simulation environment for testing run-time control strategies as well as query optimization method. This report presents a global, conceptual framework for parallel query optimization. Up to the moment, we developed and implemented a query optimizer based on DPL graphs using randomized search strategies. Details of the implementation can be found in [38] and are also subject of a new technical report.

17 For estimations of execution time see for example [13]. We are interested here on the global time dependencies.

30

References [1] P. Mishra and M.H. Eich. Join Processing in relational databases. ACM Computing Surveys, 24(1), March 1992. [2] R. Krishnamurty H. Boral and C. Zaniolo. Optimization of Nonrecursive Queries. In Proceedings of the International Conference on Very Large Databases, Kyoto, Japan, August 1986. [3] H. Lu B.-C. Ooi and K.-L. Tan, editors. Query Processing in Parallel Relational Database Systems, page 382. IEEE Computer Society Press, 1994. [4] W. Hasan D. Florescu and P. Valduriez. Open issues in parallel query optimization. ACM Sigmod Records, 25(3), September 1996. [5] R.S.G. Lanzelotte P. Valduriez and M. Zat. Industrial-Strength Parallel Query Optimization: Issues and Lessons. Information Systems - An International Journal, 1994. [6] S. Ganguly W. Hasan and R. Krishnamurthy. Query Optimization for Parallel Execution. In Proceedings of the ACM SIGMOD International Conference of Managment of Data, San Diego, California, USA, 1992. [7] G. Graefe R.L. Cole D.L. Davison W.J. McKenna and R.H. Wolniewicz. Extensible Query Optimization and Parallel Execution in Volcano, page 305. Query Processing for Advanced Database Applications. Morgan Kaufman, San Mateo, CA, 1994. [8] D. Schneider and D.J. DeWitt. A performance evaluation of four parallel algorithms in a shared-nothing multiprocessor environment. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Portland, Oregon, USA, June 1989. [9] Wei Hong. Parallel Query Processing Using Shared Memory Multiprocessors and Disk Arrays. PhD thesis, University of California, Berkeley, August 1992. [10] A. Hameurlai and F. Morvan. A parallel scheduling method for ecient query processing. In Proceedings of the International Conference on Parallel Processing, Boca Raton, FL, USA, August 1993. [11] P. Valduriez M. Couprie and B. Bergstein. Prototyping DBS3, shared-memory parallel database system. In Proceedings of the 1st International Conference on Parallel and Distributed Information System, Miami Beach, Florida, December 1991. [12] Donavan Schneider. Complex Query Processing in Multiprocessor Dastabase Machines. PhD thesis, University of Wisconsin, Madison, USA, 1990. also available as Computer Sciences Technical Report 965 (September 1990). [13] Goetz Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2), June 1993. [14] D. Schneider and D.J. DeWitt. Tradeo s in processing complex join queries via hashing in multi-processor database machines. In Proceedings of the International Conference on Very Large Databases, Melbourne, Australia, August 1990. [15] Jim Gray. The Cost of Messages. In Proceedings of the Seventh ACM Symposium on Principles of Distributed Computing, Toronto, Canada, August 1988. 31

[16] W. Hasan and R. Motwani. Optimization Algorithms for Exploiting the ParallelismCommunication Tradeo in Pipelined Parallelism. In Proceedings of the International Conference on Very Large Databases, pages 36{47, Santiago, Chile, September 1994. [17] W. Hasan and R. Motwani. Colering Away Communication in Parallel Query Optimization. In Proceedings of the International Conference on Very Large Databases, pages 36{47, Santiago, Chile, 1995. [18] P. Borla-Salamet C. Chachaty and B. Dageville B. Compiling Control into Databases Queries for Parallel Execution Management. In Proceedings of the 1st International Conference of Parallel and Distributed Information Systems, Miami, Florida, December 1991. [19] G. Graefe and D. L. Davison. Encapsulation of parallelism and architecture independance in extensible database query processing. IEEE Transactions on Software Engineering, 19(7), July 1993. [20] C.K. Baru G. Fecteau A. Goyal H. Hsiao A. Jhingran S. Padmanabhan G.P. Copeland W.G. Wilson. DB2 Parallel Edition. IBM Systems Journal, 34(2), 1995. [21] J. Wolf D. Dias P. Yu and J. Turek. An e ective algorithm for parallelizing hash joins in the presence of data skew. In Proceedings of the Seventh International Conference on Data Engineering, pages 200{209, Kobe, Japan, April 1991. [22] H. Lu and K.L. Tan. Load-Balanced Join Processing in Shared-Nothing Systems. Journal of Parallel and Distributed Computing, 23:382{398, 1994. [23] Wei Hong. Exploiting Inter-Operation Parallelism in XPRS. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 19{28, USA, June 1992. [24] L. Brunie and H. Kosch. Control strategies for complex relational query processing in shared nothing systems. ACM Sigmod Records, 25(3), September 1996. [25] L. Brunie H. Kosch and A. Flory. New Static Scheduling for Parallel Query Processing. In IEEE Computer Society Press, BIWIT 95, pages 50{59, San Sebastian, Spain, July 1995. [26] D. Schneider D.J. DeWitt J. Naughton and S. Seshardi. Pratical skew handling in parallel joins. In Proceeding of the International Conference on Very Large Databases, Vancouver, British Columbia, August 1992. [27] B. Seeger and P.- A. Larson. Multi-Disk B-trees. In Proceedings of the ACM SIGMOD International Conference of Managment of Data, Miami Beach, USA, December 1991. [28] K.A. Hua M. Lo M. and H.C. Young. Considering Data Skew Factor in Multi-Way Join Query Optimization for Parallel Execution. Very Large Databases Journal, 2(6), March 1993. [29] James A. Mc Hugh. Algorithmic Graph Theory. Prentice Hall, Englewood Cli s, New Jersey 07632, 1991. [30] G. von Bueltingsloewen. Optimizing SQL Queries for Parallel Execution. Master's thesis, Universit t Karlsruhe, 1990. 32

[31] M. Zat P. Valduriez and D. Florescu. Validating a parallel Query Optimizer. Ingenerie des sytemes d'information, 3(1):85{111, January 1995. [32] Wolfgang Reisig. Petri nets: An introduction. EATCS Monographs on Theoretical Computer Science. Springer-Verlag, New York, 1985. [33] K. Jensen and G. Rozenberg. High-Level Petri-Nets, Theory and Application. SpringerVerlag, 1991. [34] C. Ghezzi D. Mandrioli S. Morasca and M. Pezze. A Uni ed High-Level Petri Net Formalism for Time-Critical Systems. IEEE Transaction on Software Engineering, 17(2), February 1991. [35] I. Gorton. Parallel Program Design using High-Level Petri Nets. Concurrency: Practice and Experience, 5(2), April 1993. [36] M. Pezze and C. Ghezzi. Cabernet: an environment for the speci cation and veri cation of real-time systems. In Proceedings of 1992 DECUS Europe Symposium, Cannes, France, September 1992. [37] Mauro Pezze. Cabernet : a customizable environment for the speci cation and veri cation of real-time systems. submitted for publication, 1994. [38] L. Brunie and H. Kosch. Parallel query optimization in parallel databases. In 4th Workshop on Scienti c Computing, October 1996.

33