A Model for Pipelined Query Execution. 1 Introduction - CiteSeerX

A Model for Pipelined Query Execution. Annita N. Wilschut Stephan A. van Gils University of Twente P.O.Box 217, 7500 AE Enschede, the Netherlands [email protected] [email protected] Abstract

This paper develops an analytical model for pipelined execution of complex queries. The goal of this research is gaining insight in potential performance gain from the concurrent execution of pipelined relational operations. First, the model is developed in general terms, and after that it is elaborated for selection, unique, and join operations. Finally, it is shown how the model is used to understand the behavior of multi-operation queries. It is shown, how the individual characteristics of operations in uence their cooperation in a pipeline. The model increases the understanding of data ow query execution and can form a basis to generate query optimization heuristics for a parallel DBMS.

1 Introduction During the last years much attention has been paid to the development of parallel DBMSs. Teradata [Ter83], GAMMA [DGS90], Bubba [BAC90], HC16-186 [BrG89] and PRISMA/DB [Ame91,ABF92, WFA92] are examples of parallel DBMSs that actually were implemented. Each of these systems exploits parallelism to speed up query execution. Intra-operator parallelism is the primary source of parallelism in many projects. This type of parallelism is well understood now, and using it, ecient execution strategies can be found for simple queries that consist of only one or two relational operations on possibly large volumes of data. A parallel DBMS, however, may also have the potential of exploiting inter-operator parallelism to speed up the execution of more complex queries. PRISMA/DB is an example of a data ow DBMS that oers exibility in its query execution, so that inter-operator parallelism can be experimented with. However, the use of inter-operator parallelism is not well understood yet. Two forms of inter-operator parallelism can be discriminated: Horizontal inter-operator parallelism allocates independent parts of a query execution tree to disjoint (sets of) processors, so that those parts of the query tree can execute independently in parallel. The independence of the execution of the parallel tasks makes this form of parallelism easy to understand. Inter-operator pipelining allocates processes that have a producerconsumer relationship to dierent processors. The processes may run concurrently, but the execution is restricted by the interrelationship of the processes. Speci cally, a producer may impede the consumer process by supplying input at a low rate. The interrelationship of operations in a pipeline makes it hard to understand their processing: it is not clear how much performance gain pipelining yields. Recently, we started studying the use of inter-operator parallelism for complex queries in the context of PRISMA/DB. Multi-join queries were used as an example in this study. Important questions in this research are: What type of query tree yields most performance gain from inter-operator parallelism, and what is the in uence of the algorithms used for the individual operations on this performance gain. The latter question led to the proposal of pipelining algorithms, that aim at producing output as early as possible to increase the eective parallelism from pipelined processes. In Section 4.3, a pipelining hashjoin algorithm is described. Figure 1 illustrates the rst question. It shows a linear and a bushy tree for 1

the same multi-join query. Allocation of each join operation to a private processor yields inter-operation parallelism in both join-trees, however, it is not clear which tree performs best.

Figure 1: A bushy and a linear join tree [WiA91b,WiA92] report the rst results of that study. A simulator was used to study dierent query execution plans. It appeared to be dicult to understand inter-operator pipelining. An attempt to understand the concurrent execution of processes in a producer-consumer relationship led to the development of an analytical model for data ow query execution. The results of that analysis for pipelining between join operations are presented in [WiA91b,WiA92], to explain the results from the simulation study. In this paper, the full development and evaluation of the analytical model for data ow query execution are presented. It is shown how the model is designed to increase the understanding of the modeled phenomena. Therefore, the yield of this modeling is twofold: rstly, in cases where full analytic evaluation is possible, it yields a strong performance model for the concurrent execution of pipelined operations; secondly, it increases the insight in the essential features of pipelining in general by providing a framework that looks at pipelining from the right angle. [WiA91b,WiA92] applies the data ow model to join operations; here the model is extended to cover relational operations other than the join. Finally, it is shown how query trees can be described by combining the models for their components. Recently, some papers on the parallel execution of complex queries have appeared. In a rst attempt to understand the eect of various query tree formats, [ScD90] studies the behavior of right-deep and left-deep linear query trees for multi-join queries. It is concluded in that paper, that right-deep scheduling has performance advantages in the context of GAMMA. In [Gra90] it is shown how arbitrarily shaped query trees can be parallelized using the \exchange" operator, which splits a (part of) a query tree into a number of subtrees that can be executed in parallel. Although that paper makes clear that certain query trees can be parallelized, it does not solve the problem of which (type of) query tree performs best. [HoS91] discriminates between blocking and non-blocking operations. An operation is called blocking if it cannot execute concurrently with the consumer of its output. In that paper, inter-operation pipelining is only exploited between non-blocking operations. In contrast, our approach tries to adapt the algorithms for relational operations to eliminate the synchronization requirements that make them blocking. The performance gain that can be expected in return is studied in this paper. This paper is organized as follows: Section 2 summarizes the essentials of PRISMA/DB, and its data ow execution model. Section 3 starts with a discussion key ideas behind the analytical model, and than it develops the model for the pipelined execution of an operation in general. In Section 4, the model is specialized to describe a selection, a unique, and a join operation. The emphasis of this section is on the path that leads from the characteristics of the operation to the model and on the intuitive interpretation of the results. After that it is shown how the models for individual operations can be combined to describe a query tree. Section 6 summarizes and concludes the paper.

2

2 Data ow Query Execution on PRISMA/DB PRISMA/DB [ABF92] is a parallel, main-memory DBMS that runs on a shared-nothing parallel multiprocessor. This system has the following features: The hardware consists of a number of processors that can communicate via a message-passing network. Each processor hosts part of the base-data. A processor can access its part of the base-data directly. If a processor wants to access the data stored on another processor, it has to be sent to it via the network. PRISMA/DB supports various forms of parallelism in its query execution. In this paper, inter-operation is studied, and therefore, throughout this paper, each operation is assumed to have one private processor. PRISMA/DB is a main-memory system, which has important consequences for this research: a main-memory system allows using pipelining algorithms. Also, the behavior of a main-memory system is relatively simple, which allows the formulation of a simple model. A query on a relational database can be represented as a data ow graph. The nodes of such a graph represent eXtended Relational Algebra operations [GWF91]. Each processor can run one or more operations processes. In this paper, we want to study inter-operator parallelism, and therefore, each operation process is assumed to have a private processor. Operation processes evaluate XRA-operations on local data, or on tuple streams that are sent to them via the message-passing network. The result of the evaluation of an XRA-operation consists of a (multi)set of tuples. Network transport of tuples is modeled as follows: To transport a tuple from a process to another, remote process, rst, it has to be \wrapped" and put on the network hardware by the sending operating system, then, it is sent over the network, and nally, it has to be retrieved from the network and \unwrapped" by the receiving operating system. So, sending a tuple over the network implies CPU costs on the sending and receiving processor, and actual transmission, which implies a delay. In general, the CPU costs involved, appear to be the limiting factor [DeG90], and, therefore, the rate in which tuples are transported over the network is determined by the capacity of the CPUs that send and receive the tuples and not by the capacity of the network hardware.

3 An analytical model for data ow query execution In this section, an analytical model for data ow query execution is developed. The key idea behind this model is as follows: A data ow query in execution is seen as an \assembly line" in which the data are the items to be processed, and the operation processes serve as the workers, that manipulate the data items coming along. At every point in the system, the data ows at a certain rate, which is de ned as the number of tuples passing by per time unit. Operation processes map the rates at which their operands are available onto the rate at which the result is generated. This mapping depends on the sort of operation process and on the resources that are available to the operation process. The way in which operation processes map their input stream onto an output stream is essential for the potential parallelism from pipelining. This can be illustrated as follows: A sort operation can only start producing output after it is ready sorting its input. Therefore, concurrent execution of a sort operation and an operation that consumes its output will not yield much performance gain, because the consumer is waiting for input during the main processing phase of the sort operation. A selection operation, on the other hand, can produce output during its execution, so its consumer can execute at the same time. In this case, concurrent execution of the producer and consumer does yield performance gain, the extent of which depends on the relative costs of the producing selection process, and the process consuming the output. These examples show that understanding the components of a pipeline is essential to understand the behavior of the entire pipeline. However, not all relational operations are as easy to understand as the sort and the selection used in this example. Therefore, an analytical model is used. This model is rst developed for individual operations in a pipeline. After that, the models are connected to describe a multi-operation pipeline. In this paper, the data ow model is described as follow: the next section (3.1) states some preliminary 3

remarks and assumptions. After that, we show how one relational operation maps the rate at which the input is available to the rate at which the result is produced. This is done in general terms rst (Section 3.2), and Section 4 speci es the general model to describe a selection operation, a unique operation, and a join operation. Finally, Section 5.1 shows how a join tree can be described from the models for the participating join operations, and Section 5.2 gives an indication of the modeling of general query trees.

3.1 Some preliminaries

Resources in the model

The model describes the rates at which tuples are transported and processed in a data ow system. Also, the utilization of the processors participating in the data ow system is modeled. Because, as described above, the bandwidth of the message-passing network is assumed to exceed the requirements of the application, the utilization of this hardware is not modeled. This paper only deals with retrieval, and in a Main-Memory context, retrieval does not need any disk-accesses. So, there is no need to model secondary storage either. The only resource that has to be taken into account now is the CPU. The resulting model is simple and consequently powerful: a complete analysis is possible for some classes of queries.

Modeling discrete phenomena

Tuples are discrete entities. Our model, however, is continuous. A continuous model for a discrete phenomenon is possible if large numbers of events are described [Oku80,WiD87]. The transition from a discrete to an analytical model eliminates the need to use probability theory; If, in a discrete model, there is a probability 0.5 that a tuple is generated, the analytical model will generate half a tuple. This way of modeling has generally been accepted in physics and biology, and can to our opinion be used here without signi cant problems.

Entities and dimensions

The rates at which tuples are transported and processed, and the utilization of processors are modeled. To do so, also the costs of certain operations have to be expressed. Tuple transport is expressed in number of tuples per unit of time. The processor utilization is dimensionless, and has a maximum of 1. The costs of operations are expressed in units of time per tuple. Consider as an example, an operation process that processes tuples at rate x tuples/time-unit. The processor spends A time-units for the processing of one tuple. The resulting processor utilization is Ax (dimensionless).

3.2 De nition of a data ow model

Figure 2: Data ow query execution. Figure 2 summarizes the essentials of a data ow operation. The large box in this gure represents the processor; the small box represents the operation process. Data is sent to the operation process at the bottom of the box; the result is sent to some destination at the top of the box. 4

Terminology

In Figure 2, each input stream contains two arrows: the rst arrow indicates the rate at which tuples arrive at the operations process, the second one represents the rate at which tuples are processed by the operation process. This distinction is made because these two rates may dier. For example, if tuples are sent an operation process at a high rate, and the operation process cannot keep up with this rate, they are processed at a lower rate than they are available. If, on the other hand, operand tuples are sent to an operation process at a low rate, the rate at which tuples are available, may be the limiting factor, so they are processed at the rate at which they are available. The arrow in an output stream indicates the rate at which result tuples are produced in that stream. The left column in Figure 2 shows the formalism used: a(t) is the rate at which tuples of a particular operand are available to an operation process at time t (dimension: tuples per unit of time). x(t) is the rate at which tuples of a particular operand are processed by an operation process at time t (dimension: tuples per unit of time). w(t) is the processor utilization at time t (dimensionless). r(t) is the rate at which tuples are produced in a particular output stream (dimension tuples per unit of time). A query in execution, consists of a number of communicating processes. Time t = 0, is always used to indicate the moment at which the execution of the entire query starts. Some operation processes in the query may be idle at time t = 0. The symbol T is used to indicate the termination of the execution of an operation process

3.3 Some relationships

From the description of data ow execution, the following relationships can be deduced: The model handles binary and unary operations. This section is written in terms of a binary operation; equivalent expression for unary operations are straightforward. The processor utilization is a function of the rate at which the operand tuples are processed: w(t) = W (x1 (t); x2(t)) (1) The rate at which tuples are produced is also a function of the rate at which the operand tuples are processed: r(t) = R(x1 (t); x2(t)) If a CPU works at full capacity, its utilization is 1. Therefore, w(t) can never be larger than 1. The discrimination between a(t), and x(t) leads to the de nition of two dierent modes in which an operation process can work. input-limited mode Tuples are available to the operation process at such low rate, that the operation process can keep up with this rate. Now, w(t) 1, and xj (t) = aj (t) for all operands. CPU-limited mode Tuples are available to the operation process at such high rate, that the receiving processor cannot keep up with this rate, so w(t) = 1 and 9j 2 f1; 2g : xj (t) < aj (t): This discrimination leads to the central equation in this paper: xj (t) = aj (t) if operation process input-limited (2) xj (t) meets w(t) = 1 if operation process CPU-limited This equation is used to evaluate the behavior of an operation process. An outline of such an evaluation is as follows: Equation (1) expresses the CPU-utilization as a function of the rates at which 5

operand streams are processed. In the input-limited mode of an operation process xj (t) = aj (t), and W ((a1 (t); a2(t)) < 1. In CPU-limited mode, the join operation cannot keep up with rate at which tuples arrive so W ((a1 (t); a2(t)) > 1. Therefore, evaluation of W ((a1 (t); a2 (t)), and comparing the result to 1 (the maximal CPU-utilization) can reveal whether an operation process is input-limited or CPU-limited at time t. If an operation process is input-limited, the rate at which operand streams are processed is clear (a(t)). If, on the other hand, the process appears to be CPU-limited, then solving equation W ((x1 (t); x2 (t)) = 1, for x1(t) and x2 (t) shows at what rate each operand tuple-stream is processed. Knowing the functions xj (t) and mapping R, the rate at which tuples are produced by an operation process, r(t), can be calculated. The result of an operation process can be sent as input to another operation process. Those tuples are assumed to arrive at the receiving operation process with some delay1 at the rate at which they are produced by the producing operation process. So, then function a(t) for a consumer is known, and we are in the position to evaluate the behavior of this consumer process. Summarizing, the model maps the rate at which operand tuples are available, to the rate at which result tuples are produced. To describe a query tree, the result of the evaluation of one operation can be used as input to a next one. In the remainder of this paper, the model developed above is applied to some speci c relational operations. In all cases, the goal is full characterization in terms of x(t); w(t); r(t); and T, given the rate at which operand tuples are available (a(t)).

4 Elaboration of the model for some relational operations The general model can be specialized to describe relational operations. Full analytic evaluation is possible in many cases. In this section, three operations, the selection, the unique operation, and the join operation are described. These three operations were chosen, because they are fundamentally dierent in their behavior. Each subsection starts with a description of the operation and its parameters, then it is shown how the operation can be modeled, and nally the solution is given and interpreted intuitively. The essential features of the solution are recognized at the end of each subsection. The paths from the model to the solution are in the Appendix.

4.1 Selection

As a rst simple example, the behavior of a selection operation is studied. A selection operation has one operand. The selectivity of the selection is %, so, if the operand contains n tuples, the result contains %n tuples. The operand tuples are supposed to be available at a constant rate: a(t) = X. The amount of work associated with processing one operand tuple (retrieving the tuple from the network or the local memory, and evaluation of the selection condition) is denoted As . The amount of work associated with processing one result tuple (storing it, or putting it on the network is denoted Ss ). From these de nitions, it is clear that the CPU-utilization can be formulated as follows: w(t) = As x(t) + %Ss x(t) (3) For this case, equation (2) can specialized into: x(t) = X if As X + %Ss X 1 (4) x(t) meets As x(t) + %Ss x(t) = 1 otherwise This equation can be solved for x(t): (5) x(t) = min(X; A +1 %S ) s s

1 The transmission delays can easily be handled in our model, and their in uence on the results is simple. The formalism is complicated by using them however, and also, the transmission delay is assumed to be small compared to the time needed to evaluate a relational operation. Therefore, we choose not to incorporate them in the model in this paper.

6

This last equation is intuitively right: A selection process processes its input stream at the minimum of the rate at which tuples are available, and the maximum rate at which the processor can process tuples. the rate of tuples in the output stream is : r(t) = %x(t) With this equation the characterization of a selection process on a constant input stream is complete. From these results, it is clear that a selection operation maps an input stream in which the tuples have a constant rate, to an output stream that also has a constant rate. Also, in CPU-limited mode, a selection processes its input stream at constant rate, producing output also at constant rate.

4.2 Duplicate removal

This section specializes the model for the unique operation, which removes duplicate tuples from its input. The input of the operation consists n tuples, %n of which are removed. So, the output consists of (1 ? %)n tuples. For simplicity, it is assumed that there are no tuples that occur more than twice in the input relation. Under this assumption, it is important to realize that 2%n tuples in the relation have a \sibling" in the relation. It is assumed that the input is available at non-limiting rate; the rate at which the input is processed, and the rate at which result tuples are produced is calculated.

Model

Similar to the selection, we discriminate between the work done to process a tuple in the input, and a tuple in the . Assuming a hash-based unique algorithm, each operand tuple has to be made available to the unique process, its hash-value has to be calculated, and it has to be inserted in a hash-table to nd equal tuples. These costs are assumed to be constant during the unique-process (Au ) 2. If no duplicate is found, an output tuple is produced. The costs associated with producing one tuple (storage or network transport) are assumed to be constant too (Su ). The distinction between work dedicated to processing operand tuples and to generating result tuples is essential in this model. Through this distinction, insight is gained in how an operation process maps its input streams to an output stream. Consider a tuple processed by the unique operation at time t. At that time, Zt x()d 0

tuples have already been processed. Because 2%n tuples have a \sibling" in the relation, a tuple processed at time t has a probability Zt 2 n% x()d 0

to be a duplicate of a tuple that has already been processed, so that no result tuple is generated. Therefore, the rate at which result tuples are generated is equal to Zt % (6) x(t) ? 2 n x(t) x()d 0 and the work spent on generating the output is equal to Zt Su (x(t) ? 2 n% x(t) x()d): 0 The work spent on processing input tuples in simply equal to: Au x(t)

2 Actually these costs are increasing slightly during the unique-process, due to the fact that hash-buckets are lled. Using a good hash-table though, minimizes this increase.

7

The CPU utilization is equal to the sum of the amount of work spent on processing the input, and the amount of work spent generating the output Zt w(t) = Au x(t) + Su (x(t) ? 2 n% x(t) x()d): 0 The unique operation processes in its CPU-limited mode, so equation (2) can be specialized to: Zt (7) x(t) meets Au x(t) + Su (x(t) ? 2 n% x(t) x()d) = 1 0

Solution

The integral equation can be solved explicitly using elementary calculus into (the solution is equivalent to theorem 1 in the appendix): 1 x(t) = p (8) (Au + Su )2 ? 4 n% Su t The unique operation is ready at time T. At this time, n tuples per operand have been processed: ZT x()d = n 0

Substitution of Equation 8 and solving this equation for T yields: T = Aun + Su (1 ? %)n

(9)

The rate at which output is produced is derived in equation (6).

Example

%

0.02

0.1

0.2

0.5

r(t) w(t) x(t) Figure 3: Duplicate Removal Figure 3 illustrates the model for the unique operation 3 . The bottom row contains diagrams that plot x(t) against t for dierent values of %. x(t) increases with the time, because, as the unique operation proceeds, the probability to nd a duplicate tuple increases. Therefore, the amount of work that has to

3 Part of the symbolic manipulation and the generation of plots of the results of this manipulation was carried out using the symbolic manipulator Maple [CGG88].

8

be spent on processing result tuples decreases, leaving more time to process input tuples. This eect is stronger for the more selective unique operations. The middle diagrams show the processor-utilization as a function of the time. As the unique-operations are CPU-limited during the entire operation, the processor utilization is equal to 1. In these diagrams, an additional curve shows what portion of the CPU-eort is spent on processing input tuples (area below the curve), and what proportion is spent on generating output (other area). We see that the more selective unique operations spend less eort on generating output, leaving more CPU capacity to processes input. As a result, the more selective unique operations terminate earlier than the less selective ones. The topmost diagrams show the rate at which result tuples are generated as a function of the time. This rate in decreasing in the time due to increasing probability of nding a sibling tuple. These results can be summarized to the characteristic behavior of a unique operation: the input is processed at an increasing rate, and the output is produced at a decreasing rate. This behavior can be contrasted to the behavior of the selection operation in the previous subsection that tends to process both input and output at constant rates. The next section shows, that the behavior of the Pipelining Hash-join is \opposite" to the behavior of the unique operation. The Pipelining Hash-Join naturally processes its input at a decreasing rate, and it produces output at an increasing rate.

4.3 Pipelining Hash-Join

In this section, the behavior of the Pipelining Hash-Join algorithm is studied. This algorithmis extensively described and evaluated in [WiA91a,WiA91b,WiA90,WAF91]. The Pipelining Hash-Join algorithm is a Main-Memory join algorithm that aims at producing its output tuples as early as possible. As such, the algorithm is designed to yield optimal performance gain from the concurrent execution of pipelined join operations. During the join process, a hash table is built for both operands. As a tuple comes in, it is rst hashed and used to probe that part of the hash table of the other operand that has already been constructed. If any matches are found, the corresponding result tuples are formed. Finally, the tuple is inserted in the hash table of its own operand. Now, the join process can go on processing the next tuple of either operand, whichever has a tuple available for processing rst. If both operands have tuples available for processing, the join process will process tuples from both operands in turn. In this section, it is assumed that the operand tuples from both operands are available at a nonlimiting rate, so, the operation process processes in its CPU-limited mode. The more general case, in which the join-process works both input-limited, and CPU-limited is dealt with afterwards Both operands are equal in size: they contain n tuples. The selectivity of the join operation is assumed to be %: the result contains %n2 tuples. From the description of the pipelining join algorithm above, it is clear that in this case (CPU-limited) the join algorithm processes tuples from both operands at the same rate (x(t)). The goal of this section is nding x(t) as a function of the time, and deriving T, and r(t) from x(t).

Model

The development of the model is equivalent to the development of the model for the unique operation, and therefore, it is described only shortly. Again we discriminate between the costs related to processing an input tuple (Aj ), and the costs related to generating an output tuple (Sj ). The tuples in both operands are processed at the same rate: x(t). The amount of work the processor spends on processing input tuples from both operands is a function of the time: 2Aj x(t) The amount of work spent on generating the result is calculated as follows: The number of tuples that a tuple, processed at time t, produces a match with is proportional to the number of tuples that have already been processed in the other operand. However, the number of tuples that are processed in each operand Rist equal. Since the number of tuples that have been processed in either operand at time t is equal to 0 x()d; the number of result-tuples that is formed upon the arrival of one tuple at time t is 9

equal to:

%

Z 0

t

x()d

Using this expression, the amount of work spent on generating the result can be formulated as: Zt 2%Sj x(t) x()d 0

There is a factor 2 in this expression, because tuples from both operands are processed. The CPUutilization is equal to the sum of the amount of work spent on processing the input, and the amount of work spent on generating the output, so equation (2) can be specialized to: Zt (10) x(t) meets 2Aj x(t) + 2%Sj x(t) x()d = 1 0

Solution

We are now ready to nd x(t) by solving equation (10). This integral equation can be solved using elementary calculus: (see theorem 1 in the Appendix) (11) x(t) = q 1 2 %Sj t + A2j The join operation is ready at time T. At this time, n tuples per operand have been processed: ZT x()d = n 0

Substitution of equation 11 and solving this equation for T yields: T = 2Aj n + %Sj n2 The rate at which result tuples are produced can be derived from equation (10): Zt r(t) = 2%x(t) x()d 0

Example

(12) (13)

Figure 4 shows some diagrams of the characterization of two dierent join-processes Two join-operations are illustrated: one in which the selectivity of the join was chosen to be 1=n so that the result contains n tuples, and another in which the selectivity of the join is 2=n, so that the result contains 2n tuples. The middle diagrams show the processor-utilization as a function of the time. As the join-operations are CPU-limited during the entire operation, the processor utilization is equal to 1. We see that the less-selective join-operation spends a larger proportion of its eort on generating output, and that in both cases the amount of work related to generating output increases, at the expense of processing input. This is caused by the fact that the probability to nd a matching tuple in the other operands hash-table increases with the time. The bottom diagrams show the rate at which input tuples are processed. This rate is decreasing, and this eect is stronger for the second join-operation. The topmost diagrams show the rate at which result tuples are generated as a function of the time. This rate is increasing. Again, we can summarize this section by formulating the characteristic behavior of the Pipelining Hash-Join: A Pipelining Hash-Join processes its input at a decreasing rate, and it produces output at an increasing rate. In the next section, we describe how the models for individual join operations can be combined to describe a join-tree. It will appear that full analytic evaluation of this complex model is possible. 10

%

1=n

2=n

r(t) w(t) x(t) T n(2Aj + Sj ) n(2Aj + 2Sj ) Figure 4: Model for the Pipelining Hash-Join.

5 Multi-operation queries In this section, we show how individual operations can be linked to form a pipeline. First bushy join trees are described, and after that it is indicated how we plan to evaluated the behavior of more general query trees.

5.1 Bushy Join trees

Symmetric bushy join-trees are used as an example, in this section, for several reasons: rstly, the puzzling behavior of the execution of symmetric bushy join-trees [WiA91b] was the initial incentive to develop the analytical data ow model, and secondly, this type of query tree allows full analytic evaluation of the model. Section 5.2 describes how we plan to evaluate irregular query trees.

Symmetric Bushy Join trees

The following multi-join query is assumed to be evaluated: All operands are equal in size (each operand contains n tuples). The selectivity (%) of the join-operations is equal to 1=n, so the result relation contain n tuples. Each result tuple is projected, so that the size of the result tuples is equal to the size of the operand tuples. Under these assumptions, it is clear that the join operations that have two base relations as operands, all have the same execution characteristics. These join operations are called level0 joins. The join operations that join the results of level0-joins again have the same characteristics. They are called level1-joins. In the same way, level2, level3 and may be even higher levels can be de ned.

Model

It is assumed, that the base operands are available to the level0 joins at non-limiting rates. Therefore, the characteristics of these joins are as described in Section 4.3. To describe the other levels, some notation conventions are needed. In this section, subscripts are used to indicate the level of the join-operation. So, T3 denotes the termination time of a level3 join, and x2 (t) denotes the rate at which tuples of one level2 join-operand are processed. Result tuples from one level are sent as input to a join-operation in the next level. As operation processes are assumed to have a private processor, this tuple transport amounts to transport over the network. As stated in section 3.3, the transmission delay over the network connections is assumed to be 0. So, ai+1 (t) = ri(t) (14) 11

A leveli join terminates at time Ti . A join at the next level terminates at Ti+1 . Clearly, Ti+1 is always greater than Ti . The dierence between the termination of two joins at subsequent levels is called termination delay. Also the description needs an additional symbol. i The time at which a CPU executing a leveli join is saturated. So at this time, the join process switches from its input-limited to its CPU-limited mode. Apparently, 0 is equal to 0. Now, we can derive the model for a join-operation at leveli. Equation (14) is used to characterize the input rate for level i, and the central equation of this paper (2) is used to model the operation process. Furthermore, an expression for the CPU-utilization (10) for the pipelining join is derived in the previous section. The combination of these three equations, and the de nition of i , given above yields the model for level i of a bushy join-tree: = a (t) if 0 t < i i?1 Rt (15) xi (t) S meets 2Aj xi (t) + 2 n xi (t) 0 xi ()d = 1 if t i where Zt 2 ai (t) = ri?1(t) = n xi?1(t) xi?1()d (16) j

0

Solution

Appendix A shows how these equations can be solved explicitly for xi (t). Here we want to make one remark about this evaluation: each level contains an input-limited, and a CPU-limited phase. Solving each phase is easy; the hard part in this evaluation is nding out when and how these parts connect to each other. Level0 was solved in the previous section: (17) x0 (t) = q 1 2 %Sj t + A2j For i > 0, xi(t) can expressed recursively as: xi(t) = axi(t)(t ? d) ifif t0 > t i (18) i?1 i where ai (t) is as de ned in equation (16). Also it can be derived that: i = + id (19) where and d are constants that are proportional to n. In the previous section the termination time for level0 joins was derived: T0 = 2Aj n + %Sj n2 (20) It can be shown that each subsequent level terminates d time-units after its predecessor. Ti = T0 + id (21) 2 Note that, although this solution is formulated recursively, it is well-de ned, because it is initialized with x0. This solution can be interpreted in the following way. Two phases can be discriminated in a join-process: The startup-phase, in which the join-process does not saturate its processor, and the main-phase, in which the join process is CPU-limited. A join-process at leveli of a symmetric bushy join-tree reaches its main-phase at time i . In a level0 join the startup-phase has length 0 (so, there is no startup phase). As the values of i are proportional to n, the length of the startup phase scales with the number of tuples in the operands. 12

n

1000

1500

level3 level2 level1 level0 Figure 5: Modeled execution characteristics of the joins at dierent levels of a bushy join-tree.

The main phases of subsequent join levels are similar, apart from a translation d in time. This im-

plies that the main-phase of each subsequent level starts and ends d time-units after its predecessor. From this result, it can be concluded that the termination delay of subsequent join levels is proportional to n. Also, the response time of the entire query is proportional to n.

Examples

Figure 5 shows diagrams which plot w(t) against the time for level0 through level3 for operand sizes of 1000 and 1500 tuples. Comparison of the diagrams in one column shows that the termination delay between subsequent levels is constant. However, comparison of both columns illustrates the fact that the termination delay between levels grows with the number of tuples in the operands. As a consequence, also the response time of the entire query is proportional to the number of tuples in the operands. Detailed calculation show that the termination delay between two successive levels is limited to about 1=5th of the execution time of a level0 join.

5.2 Modeling general query trees

In the previous section, it was shown how the models for individual join operations can be connected to describe a join-tree. Full analytic evaluation of the model is possible for symmetric bushy join-trees. Our experience in this eld shows that full analytic evaluation is hard, if not impossible for an arbitrary combination operations. Currently, we study this problem. Here an intuitive outline of the direction of this research is given. The description of the models for the selection, the join, and the unique operation showed that these operations dier in the way they process their operands, and in the way in which they produce output. For example, the join operation processes its input at a decreasing rate, if it operates in CPU limited mode. If such a join operation is pipelined as consumer to an operation that produces output at an increasing rate (starting at a low rate), it is obvious that this join initially cannot fully exploit the available CPU-power. Pipelining join operations to each other yields an example of this situation, and the delay between subsequent levels in a symmetric bushy join tree is caused by the fact that the consumer join operation, that wants to process input at an initially high and then decreasing rate, gets its input in the opposite way: starting at a low rate and than increasing. If, on the other hand, a join is pipelined as consumer to an operation that produces output at a decreasing rate, the match between the producer 13

and the consumer is likely to be much better, so that the join operation can fully exploit the available CPU power from the beginning on. To test this hypothesis, we have studied the situation in which a join operation produces input for a unique operation. From the discussion above it is clear that this pipeline is expected to have a good match between the producer and the consumer. Some initial experiments had encouraging results. It was clear that the eective parallelism between the join operation and the unique operation is relatively high.

6 Summary and Future work This paper studies the performance aspects of inter-operation parallelism in complex queries. The focus is on inter-operation pipelining, as this form of parallelism is hard to understand. It is argued that understanding the components of a pipeline, the individual relational operations, is essential to understand the behavior of the entire pipeline. An analytical model for operations in a data ow system is developed. This model describes how a relational operation maps the rate at which its input is available onto the rate at which the output is produced. In the examples, it is shown that selections tend to produce output at a constant rate, a unique operation produces output at a decreasing rate, and a join operation produces output at an increasing rate. The models for individual operations can be linked together to describe a complex query. The model for a symmetric bushy join tree is described and fully evaluated. The results con rm the results of a former simulation study [WiA91b]. The work reported in this paper is continued in several directions. Firstly, as indication in Section 5.2 general query trees are studied. Finally, this work should lead to an analytical cost model for queries that exploit horizontal inter-operation parallelism and inter-operation pipelining. Also, we try to incorporate intra-operator parallelism in the model. This can be done as follows: using hash-partitioning of the data, the work for one operation is evenly distributed over the participating processors. This can be modeled by adding more CPU-capacity to an operation. This would change the central equation (2) into: xj (t) = aj (t) if operation process input-limited (22) xj (t) meets w(t) = n if operation process CPU-limited in which n is the number of processors allocated to an operation process. The consequences of this change are studied. Finally, we will validate the model presented in this paper against PRISMA/DB. The work in this direction was started recently.

Acknowledgement

The authors thank Peter Apers, Paul Grefen, and Martin Kersten for their useful suggestions on earlier drafts of this paper.

A Mathematical evaluation of the model.

The mathematical model on join queries uses the following integral equation:

x(t) meets 2Aj x(t) + 2%Sj x(t)

Z

0

t

x( )d = 1

(23)

This equations contains parameters Aj , Sj , and . To reduce the number of parameters, and thus the complexity of the formalism, the following substitutions are made: y(t) = Aj x(t); (24) and j (25) = %S A2 j

14

These substitution convert equation (23) into:

y(t) meets 2y(t) + 2y(t)

Zt 0

y( )d = 1

(26)

The entire appendix is written in terms of y(t), and . Using equation (24), and (25), the results of this appendix can easily be translated back. Let y0 : R+ ! R satisfy the integral equation 2y0 (t) + 2y0 (t)

Z

t

y0 ( )d = 1:

0

(27)

Theorem 1 Equation (27) has a unique solution given explicitly by y0 (t) = 2pt1 + 1 ;

t 0:

(28)

Proof. To solve equation (27) it is rst converted into a dierential equation using

Y (t) =

Z

t

0

y0 ( )d =) dYdt(t) = y0 (t):

Substitution into (27) yields:

2 dYdt(t) f1 + Y (t)g = 1 Separation of variables yields a general solution to this dierential equation:

q

t ? C + 1 1 Y (t) = ? ; p in which C is an integration constant to be set by a boundary condition to the solution of the dierential equation. Dierentiation of Y (t) yields y0 (t) y0 (t) = dYdt(t) = p 1 2 (t ? C ) + 1 The boundary condition is implicitly given by equation (27). Substitution of t = 0 in this equation yields y0 (t) = 0, and therefore C = 0. This observation concludes the proof of this theorem.

2

Next, let the mapping a1 : R+ ! R be de ned by

a1 (t) = 2y0 (t)

Z

t

0

y0 ( )d

Lemma 1 There exist unique positive values and , such that Z 2a1 () + 2a1 ()

and

0

a1 ( )d = 1;

y0 () = a1 ():

15

(29)

(30) (31)

Proof. First, we observe from equation (27) that a1 (t) = 1 ? 2y0 (t). Because, as is obvious from the explicit solution of y0 (28), y0 (t) is strictly decreasing, a1 (t) is strictly increasing. Therefore the mapping

t 7! 2a1 () + 2a1 ()

Z 0

a1 ( )d

is strictly increasing. Also, a1 (0) = 0, and a1 (1) = 1. Hence equation (30) uniquely determines > 0. As y0 (0) = 1=2, a1 (0) = 0, y0 (1) = 0, and a1 (1) = 1, the monotonicity of both y0 , and a1 implies that there exists a unique value of > 0 that satis es equation (31).

2

Let the sequence fi gi2N be de ned by i = + id, where d = ? . Now, we can de ne the functions yi , i 2 N+ , on R+ by = a (t) if 0 t < i i R yi (t) (32) satis es 2yi (t) + 2yi (t) 0t yi ( )d if t i where Zt ai (t) = 2yi?1 (t) yi?1 ( )d (33) 0

and y0 is as de ned in equation (27).

Theorem 2 The functions yi , de ned by equation (32), can be solved by the recursive expression: yi (t) = yaii?(t1)(t ? d) ifif t0 > ti i

(34)

where ai satis es equation (33), and y0 is de ned by equation (27).

Proof. This theorem is proved via induction. In the initial step i is set to 1. From the de nition of i , we observe that 0 = , and 1 = . Hence, the de nition of , , and y1 implies that y0 (0 ) = y1 (1 ) (35) From this result, the de nition of in equation (30), and equation (27), we conclude that

9 = =) R 2y0 (0 ) + 2y0 (0 ) 00 y0 ( )d = 1 ; 9 R 2y1 (1 ) + 2y1 (1 ) 01 y1 ( )d = 1 = R ; =) R

2a1 () + 2a1 () 0 a1 ( )d = 1

2y0 (0 ) + 2y0 (0 ) 00 y0 ( )d = 1 Z 1 Z 0 y1 ( )d = y0 ( )d 0

0

(36)

Equations (35), and (36) are used to derive the relationship between y0 , and y1 on (1 ; 1]. For t > 1 , we have:

9 = =) R t?d ; 2y0 (t ? d) + 2y0 (t ? d) 0 y0 ( )d = 1 R R 9 2y1 (t) + 2y1 (t)f 01 y1 ( )d + t1 y1 ( )d g = 1 = =) R R 2y0 (t ? d) + 2y0 (t ? d)f 00 y0 ( )d + t0?d y0 ( )d g = 1 ; R

2y1 (t) + 2y1 (t) 0t y1 ( )d = 1

16

R

Rt

9 = (37) R R 2y0 (t ? d) + 2y0 (t ? d)f 0 y0 ( )d + t1 y0 ( ? d)d g = 1 ; These last two equations imply that limt#1 y1 (t) = limt#1 y0 (t ? d), and therefore these two equations each have 2y1 (t) + 2y1 (t)f 0 y0 ( )d +

1

y1 ( )d g = 1

a unique solution that satis es the initial condition, so, it can be concluded that

y1 (t) = y0 (t ? d); t > 1

(38)

The observation that y1 (t) = a1 (t) on [0; 1 ] by de nition, concludes the initial step of the proof of theorem 2. For the induction step, we assume that

yi (t) = Moreover, it is assumed that

Z

i

0

ai (t) if 0 t < i : yi?1 (t ? d) if t i yi ( )d =

Z

i?1

(39)

yi?1 ( )d

0

(40)

Note that this last equation holds for i = 1 (see equation (36)) First, we observe that

yi+1 (t) = ai+1 (t); 0 t i+1 by de nition. Next, using equations (39) and (40), and the de nitions of ai+1 (t) and i+1 , we observe that for i t i+1

ai+1 (t) = 2yi (t)

Z

= 2yi (t)f

t

yi ( )d

Z0

i

0

yi ( )d +

= 2yi?1 (t ? d)f = 2yi?1 (t ? d)f = 2yi?1 (t ? d) This result implies that

Z

i

0

Z

0

Z

i

?

?1 ?1

t d

0

Zt

yi ( )d g

i

yi?1 ( )d + yi?1 ( )d +

Z

?

t d

Z

? ?

yi?1 ( )d g

i d t d i?1

yi?1 ( )d g

yi?1 ( )d = ai (t ? d)

yi+1 (i+1 ) = yi (i )

(41)

The analogue of equation (40) for i + 1 is derived as follows:

Z

0

i+1

yi+1 ( )d = =

Z

Z

i+1

0 0

= 2 = f

i

ai+1 ( )d +

Z Z

ai+1 ( )d

i

0

0

i

yi ( )

Z 0

Z

i

ai+1 ( )d

yi (t)dtd +

yi ( )d g2 +

17

i+1

Z

i

i?1

Z

i

i?1

ai ( )d

ai ( )d

= f = =

Z

Z

0 0

Z

i?1

0 i?1 i

yi?1 ( )d g2 +

ai ( )d +

ai ( )d =

Z

Z

i

i?1 i

0

Z

i i?1

ai ( )d

ai ( )d

yi ( )d

Using this result and equation (41), we can proof analogously to the derivation of (38) in the initial step that:

yi+1 (t) = yi (t ? d);

t > i + 1

which concludes the proof of the theorem.

(42)

2

Lemma 2 The solutions of yi (t), given in equation (34), are continuous in i Proof. This lemma is proved by induction too. The continuity of y1 in 1 can be derived as follows: Equation (30) implies 1 y1 (1 ) = a1 () = R a1 ( )d = 2 + 2y0 ()1R y0 ( )d 2 + 2a1 () 0 0 and equation (37) implies 1R lim y1 (t) = t#1 2 + 2y0 () 0 y0 ( )d so y1 (t) is continuous in 1 . Given the continuity of yi (t) in i , the fact that yi+1 (t) = ai+1 (t) = ai (t ? d) = yi (t ? d) i t i+1 yi+1 (t) = yi (t ? d) t > i+1 implies the continuity of yi+1 in i+1 . 2

Theorem 3 The values and , as de ned in Lemma 1 are proportional to 1 Proof. Straightforward solution of equation (30) for yields 2 = k ? 1 in which k is the only real root of 2y3 ? 6y2 + 7y ? 4 = 0. Solution of equation (31) for yields

2 = ?3k(2k+?8k2)?2 4

References

2

[Ame91] P. America, ed., Proceedings of the PRISMA Workshop on Parallel Database Systems, Springer-Verlag, New York{Heidelberg{Berlin, 1991. [ABF92] P. M. G. Apers, C. A. vanden Berg, J. Flokstra, P. W. P. J. Grefen, M. L. Kersten & A. N. Wilschut, \PRISMA/DB: A Parallel Main-Memory Relational DBMS," to appear in IEEE Transactions on Knowledge and Data Engineering (December 1992).

18

[BAC90] H. Boral, W. Alexander, L. Clay, G. Copeland, S. Danforth, M. Franklin, B. Hart, M. Smith & P. Valduriez, \Prototyping Bubba, A Highly Parallel Database System," IEEE Transactions on Knowledge and Data Engineering 2(1990), 4{24. [BrG89] K. Bratbergsengen & T. Gjelsvik, \The Development of the CROSS8 and HC16-186 (Database) Computers.," in Proceedings of the Sixth International Workshop on Database Machines, Deauville, France, June 1989, 359 {372. [CGG88] B. W. Char, K. O. Geddes, G. H. Gonnet, M. B. Monager & S. M. Watt, Maple Reference Manual, WATCOM Publications Limited, Waterloo, Canada, 1988. [DGS90] D. J. DeWitt, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H. Hsiao & R. Rasmussen, \The GAMMA Database Machine Project," IEEE Transactions on Knowledge and Data Engineering 2(March 1990), 44{62. [DeG90] D. J. DeWitt & J. Gray, \Parallel Database Systems: The Future of Database Processing or a Passing Fad?," SIGMOD RECORD 19 (1990), 104 { 112. [Gra90] G. Graefe, \Encapsulation of parallelism in the Volcano Query Processing System," in Proceedings of ACMSIGMOD 1990 International Conference on Management of Data, Atlantic City, NJ, May 23{25, 1990, 102{111. [GWF91] P. W. P. J. Grefen, A. N. Wilschut & J. Flokstra, \PRISMA/DB1 User Manual," Memorandum INF91-06, Universiteit Twente, Enschede, The Netherlands, 1991. [HoS91] W. Hong & M. Stonebraker, \Optimization of Parallel Query Execution Plans in XPRS," in Proceedings of the First International Conference on Parallel and Distributed Information Systems, Miami Beach, Florida, USA, December 1991. [Oku80] A. Okubo, Diusion and Ecological Problems: Mathematical Models, Springer-Verlag, New York{Heidelberg{ Berlin, 1980. [ScD90] D. A. Schneider & D. J. DeWitt, \Tradeos in Processing Complex Join Queries via Hashing in Multiprocessor Database Machines," in Proceedings of Sixteenth International Conference on Very Large Data Bases, Brisbane, Australia, August 13{16, 1990, 469{480. [Ter83] Teradata Corporation, \Teradata, "DBC/1012 Database Computer Concepts and Facilities," C02-0001-00, 1983. [WiA91a] A. N. Wilschut & P. M. G. Apers, \Parallel Execution of Multi-join Queries.," Memorandum INF91-11, Universiteit Twente, Enschede, The Netherlands, 1991. [WiA91b] A. N. Wilschut & P. M. G. Apers, \Data ow Query Execution in a Parallel Main-Memory Environment," in Proceedings of the First International Conference on Parallel and Distributed Information Systems, Miami Beach, Florida, USA, December 1991. [WiA92] A. N. Wilschut & P. M. G. Apers, \Data ow Query Execution in a Parallel Main-Memory Environment," To appear in Journal of Distributed and Parallel Databases.(1992). [WiA90] A. N. Wilschut & P. M. G. Apers, \Pipelining in Query Execution," in Proceedings of the International Conference on Databases, Parallel Architectures and their Applications, Miami, USA, March 1990. [WAF91] A. N. Wilschut, P. M. G. Apers & J. Flokstra, \Parallel Query Execution in PRISMA/DB," in Proceedings of the PRISMA Workshop on Parallel Database Systems, Noordwijk, The Netherlands, September 1990, P. America,ed., Springer-Verlag, New York{Heidelberg{Berlin, 1991. [WiD87] A. N. Wilschut & P. G. Doucet, \Theoretical Studies on Animal Orientation: A Model for Kinesis," Journal of Theoretical Biology 127(1987), 111 { 125. [WFA92] A. N. Wilschut, J. Flokstra & P. M. G. Apers, \Parallelism in a Main-Memory System: The Performance of PRISMA/DB.," in Proceedings of the 18th International Conference on Very Large Data Bases, Vancouver, Canada, August 23 - 27, 1992.

19

A Model for Pipelined Query Execution. 1 Introduction - CiteSeerX

A Model for Pipelined Query Execution. 1 Introduction - CiteSeerX

Suggest Documents

Modelling Resource Utilization in Pipelined Query Execution

Parallel Query Execution in PRISMA/DB. 1 Introduction. - CiteSeerX

OCL as the Query Language for UML Model Execution - CiteSeerX

Query Execution and Optimization Query Execution: Parameters ...

Comprehensions, a Query Notation for DBPLs 1 Introduction - CiteSeerX

Comprehensions, a Query Notation for DBPLs 1 Introduction - CiteSeerX

A Delta Debugger for ILP Query Execution

OFL: A Functional Execution Model1 for Object Query ... - CiteSeerX

CONSTRAINT QUERY LANGUAGES 1 Introduction - CiteSeerX

Video query formulation ABSTRACT 1 INTRODUCTION - CiteSeerX

Query Rewriting for Semistructured Data 1 Introduction

Multi-Level Parallel Query Execution Framework for CPU ... - CiteSeerX

Fragmentation Design for Efficient Query Execution over ... - CiteSeerX

An Adaptive Query Execution System for Data Integration - CiteSeerX

Data ow Query Execution in a Parallel Main-Memory ... - CiteSeerX

A Flexible Simulator of Pipelined Processors 1 Introduction - AES

A Shot Noise Model For Financial Assets 1 Introduction - CiteSeerX

Deadlock Resolution in Pipelined Query Graphs

Execution-Cache-Memory Performance Model: Introduction and ...

A Query Language for XML 1. Introduction - Semantic Scholar

A Cost Model for the Estimation of Query Execution Time in a Parallel

A Functional Execution Model1 for Object Query Languages - smis - Inria

A Platform for Interactive Visual Analysis of Query Execution Plans

A Query Language for Analyzing Business Processes Execution (PDF ...