On a Three-Way Hash Join Algorithm Vasilis Samoladas Daniel P. Miranker The University of Texas at Austin Department of Computer Sciences Taylor Hall 2.124 Austin, TX 78712-1188 fvsam,
[email protected]
Tel: (512){471{9541
Preliminary paper number: AR 272 Abstract
We develop hash-based algorithms for computing a three-way join. The method involves hashing all three relations into buckets, and then joining buckets in main memory, three buckets at a time. Comparing to twocascaded hash joins, the algorithms avoid materializing an intermediate result. We present a cost model for this approach, from which we identify the range of parameters for queries that bene t from our technique. We also validate our analysis with experimental results, comparing our approach to performing two hybrid-hash joins. This approach is almost always preferable to computing to consecutive GRACE hash joins, and in many cases it is also preferable to two consequtive hybrid-hash joins.
1 Introduction Decision support is an emerging area of database processing that has created new demands on database systems. Traditional OLTP systems are built around the assumption that they will be called to execute a sequence queries, arriving at a high rate, but where each query will only touch a small portion of the database. Thus, the processing in these systems is heavily based on index mechanisms, so that for each arriving query the related data can be located quickly. In decision support applications, the system is asked to evaluate queries that touch almost all of the database. The execution of each of these queries is expected to be very long, and require large amounts of I/O. Most of the execution time is spent in computing joins between relations that are often much larger than main memory. From now on, we use the term join to refer to the equi-join, which is by far the most frequent kind of join in real-world applications. It has been shown that when join computation needs to touch most of the pages of each relation, join algorithms based on indices are not very ecient. For this problem, the best techniques employ hashing. There is a very large number of contibutions in this area. The algorithm of choice is the hybrid-hash join algorith, proposed by Shapiro [9]. It is shown to perform very well for joining large relations, and at the same time make ecient use of any amount of main memory in the system. The drawback of hash-based join techniques is that they are not amenable to pipelining. Pipelining is a technique that avoids materializing the intermediate results of a sequence of join operators. If a 3-way join, (R 1A S) 1B T, is executed by two hybrid-hash joins, the intermediate result R 1A S will be materialized on disk. Assume that attribute A is a primary key in R and a foreign key in S, in a 1 : n dependency, where n is large. Every tuple of R will join with n tuples of S and the records will materialize on disk as an intermediate result. Note that n is a multiplicative factor on the size of the intermediate result and may seriously impact the I/O requirements of the query. We claim that the repeated values of the R tuples is a hidden, or implicit replication of data. In this paper, we present an algorithm that performs a 3-way join. The algorithm rst scans all three relations and lls buckets, similar to 2-way hash-join algorithms. Then these buckets are joined in groups of three, one bucket from each relation. A complete result is assured by replicating, 1
when necessary, tuples in the third relation and placing them into multiple buckets. Thus, the loading of buckets for the third relation is a data-dependent many-to-many mapping. Our method avoids materializing an intermediate result by virtue of the explicit replication of data in a preprocessing phase. A key concern is the relative impact on I/O of the explicit replication relative to the I/O entailed by implicit replication. We develop a cost function for our algorithm and compare its performance to hybrid-hash join. We will show, analytically, that our method always outperforms two cascaded GRACE-hash joins. (Section 3) Since hybrid-hash joins reduce to GRACE hash joins when the relations are much bigger than main memory, our result carries over in these cases. We then compare empirically the performance of our algorithm and of an improvement, against two cascaded hybrid-hash joins. (Section 5) The evaluation is based on a number of cases derived by the TPC-D benchmark speci cation. The experiments show our algorithms to be better in a broad range of parameters.
2 A 3-way Join Algorithm Replication Hash Join (RH) is a 3-way equi-join algorithm. We rst describe the algorithm, and then develop a cost function, modelling the I/O required by this algorithm. We use the following notation in this paper: Letters R; S; T; : : : denote relations. Letters A; B; C; : : : denote attributes. 1A is the natural join operator on attribute A. kRk is the number of tuples in relation R. jRj is the number of disk pages that relation R occupies when materialized to disk. [X] is the per-tuple space of a tuple of schema X. Space is measured in disk pages, thus, for any relation R that has schema X, jRj [X] = kRlim k!1 kRk
R[X] denotes the projection of relation R on X, and t[X] denotes the projection of tuple t on X.
2.1 Description
Assume we have to compute the following 3-way join: R 1A S 1A T (1) Notice that the join attribute is the same in both joins, i.e. this is a star query. We could compute this join, in the following way: for an appropriate number n, hash the tuples of each relation into a set of n buckets (one set of buckets per relation), based on attribute A. Then, load corresponding triplets of buckets into main memory (assume they always t), and use a main memory algorithm to compute the joins. Ignoring any CPU costs, the cost of this operation is approximately (jRj + jS j + jT j)(2cread + cwrite ) (2) This is quite low, considering that the relations may be many times bigger than main memory. We could also compute the result by two cascaded GRACE hash joins. If R and S are joined rst, the approximate cost is (jRj + jS j + jT j)(2cread + cwrite ) + jR 1A S j(cread + cwrite ) (3) We see that the rst approach is better in this case. Observe that the amount of extra I/O performed by the cascaded GRACE hash joins, is precisely the amount of I/O involved with materializing the intermediate result R 1A S. So far, this approach is of restricted application, because it requires both joins to be on the same attribute. The results in this paper are based on the observation that an arbitrary 3-way join can be transformed into a star query by applying another join operation. Assume that we want to compute Q1 = R 1A S 1B T (4) 2
We de ne table T 0 as
T 0 = AB (S) 1B T Then, we can compute Q1 by the following expression: Q1 = R 1A S 1AB T 0 which is a star query, with an additional join attribute in the second join.
(5) (6)
B
πJ
πT
σJ
σT T
A
πR
πS
σR
σS
R
S
Figure 1: A complete SPJ tree for three relations. The above operation does not directly provide an interesting solution, because the computation of T 0 requires another join. For the moment though let us put aside the computation of T 0 and see what T 0 will contain. First, observe that tuples of T that don't join with S, do not appear in T 0 . This is good, but not unique, since the same eect can be achieved by bit lter techniques for hybrid hash joins. Second, tuples of T are potentially replicated more than once in T 0. Thus, although we can compute the transformed 3-way join eciently, it is a (potentially) larger problem than the initial one.
2.2 Replication Hash Join Algorithm
Assume that R, S and T are three disk-resident relations. The family of Select-Project-Join (SPJ) relational expressions with respect to R, S and T can be captured, up to a permutation of the relations, by the relational algebra expression depicted in Fig.1 We now present an algorithm that can compute the result of such expressions, when the input relations conform to certain restrictions. Some of the notation used in the algorithm corresponds to Fig.1. Algorithm 1 RH(R,A,S,B,T ,n) 1. Scan R. For each tuple t that satis es R , place R (t) in one of the buckets R1 ; : : :; Rn, based on t[A]. 2. Allocate space in main memory for a table L. 3. Scan S . For each tuple t that satis es S , do the following: (a) Hash S (t) into one of the buckets S1 ; : : :; Sn , based on t[A]. Let Rb be that bucket. 3
(b) Add pair (h(t[B]); b) to table L, where h(t[B]) is a hash value for t[B]. If L over ows, continue to disk. 4. Call LOCALIZE(T ,L). The operation of this step is called localization and will be described below. This step produces n buckets, T1 ; : : :; Tn , of tuples of T that satisfy T , projected as required by T . 5. For i = 1; : : :; n, compute where J and J refer to Fig.1.
J (J (Ri 1A Si )) 1B Ti
Step 5 of the RH algorithm is intended to be performed by a 3-way join algorithm in main memory, such as TreeTracker [7]. The key component of RH is the LOCALIZE routine, which is responsible for creating buckets Ti . The LOCALIZE routine operates on a relation stored on disk, and on table L. Table L is known as a localization table. L may or may not t into the main memory. The case where L ts into the main memory is presented rst. Algorithm 2 LOCALIZE-FIT(T , L) Comment: version of LOCALIZE where L ts in main memory. 1. Sort L, removing any duplicates. 2. Scan T . For every tuple t of T that satis es T , do: (a) Look up h(t[B]) in L. If no such entry is found, discard t. Otherwise, for each entry (h(t[B]); b) of table L, place t into Tb .
The above algorithm will indeed produce a number of buckets with tuples of T, as speci ed in step 4 of the RH algorithm. However, there is replication of some tuples into more than one bucket. This is because bucket Ti must hold all tuples that join with some tuple in bucket Si . It is this replication that gives this algorithm its name. Before we proceed to calculate the cost of this algorithm, we present the version of the LOCALIZE procedure when L does not t into main memory.
Algorithm 3 LOCALIZE-NOTFIT(T , L)
Comment: version of LOCALIZE where L does not t in main memory. 1. Sort L on h(t[B]), removing any duplicates, using some external sorting algorithm. Let L be seen as segmented in m buckets B1 ; : : :; Bm (as big as possible) such that each of them ts in main memory. Pairs with the same h(t[B]) value must all be in one bucket. So, buckets Bi partition the range of values of h(t[B]). 2. Load B1 in main memory. Initialize a new set of m ? 1 buckets, U2 ; : : :; Um 3. Scan T . For each tuple t that satis es T , do the following: (a) If h(t[B]) lies in B1 , then for each entry (h(t[B]); b) in B1 , emit T (t) to bucket Tb . (b) If h(t[B]) lies in Bk , k > 1, then emit T (t) in bucket Uk . 4. For i = 2; : : :; m do: (a) Load Bi in main memory. (b) Scan Ui . For tuple t, and for each entry (h(t[B]); b) in Ui , emit t to Tb .
2.3 Cost of RH
We shall make some simplifying assumptions, in the spirit of Shapiro for the cost estimation [9] . First, we assume that our data does not contain any skew. Also, we will assume that the hash function h(t[B]) is a perfect hash function. In practice, we can always achieve low collision ratios (less say than 1% ) by expanding the number of bits per hash value. We also assume that the produced buckets for the three relations always t into the main memory. In fact, for each triple of corresponding buckets, we only require two of them to t into the main memory. 4
In terms of cost measures, we only compute I/O costs, and we model I/O by assigning a constant cost of 1 to every disk page read and write. Also, we do not include the cost of the initial scanning of the relations, since it does not oer any information, and would only complicate the formulae. We now compute the cost for the case where L ts in main memory. The cost of RH is given by the following formula: CRH = jR(R)j hash R + jS (S)j hash S + jT (T)j localize T j (R) j + j (S) j R S + read produced buckets +j (T)j R
where above is the replication factor, de ned as
Pn
jTi j = ji=1(T) (7) T j Notice that since we do not include the cost of the original read of the data, we can drop the project operators from the expression. Thus, CRH = 2(jRj + jS j + jT j) (8) When the localization table does not t into the main memory, LOCALIZE-NOTFIT is called. This routine performs a computation similar to hybrid hash join, between the localization table and relation T. Its cost is (9) CNOTFIT = 4jLj + 2 1 ? jM Lj jT j where jLj is the size of the localization table and M is the size of the main memory. The I/O cost of sorting L is estimated to be 4jLj, to re ect the cost of bucket sort merge [11]. This cost should be added to CRH , to obtain the total cost.
2.4 Estimation of
The replication factor is the ratio of total disk space consumed by the buckets of the localized relation T, over the disk space of the relation. Its exact value is of course data-dependent, but we can compute statistical approximations of it, using typical databse estimations. First, we assume that all tuples have approximately the same size, thus Pn kT k i = i=1 (10) kT k First, we must estimate the number of tuples of T that actually join with S, i.e. the size of the semi-join. This is because during the localization stage, tuples of T that do not have matching entries in the localization table get discarded. Assuming a perfect hash function h(t[B]), i.e. no collisions, we see that only those tuples of T that join with S are localized. In practice, there will be collisions, but if we require that the hash values be uniformly distributed in a large enough range, we can achieve collisions in less than, say, 1% of the tuples. In practice, a length of 6 + log2 maxfB(S); B(T)g bits for values of h(t[B]) is enough to achieve this, where B(S) (resp. B(T)) is the number of distinct values that attribute B, the join attribute, takes in relation S (resp. T). We nd the size of the semijoin to be B(S) g kT 0k = kT k minf1; B(T) (11) If we assume an in nite number of buckets, then the average replication factor could be computed by kS k (12) ^ = kS 1B TBk(S ) = B(S) kT k minf1; B(T ) g 5
where kS 1B T k is estimated as:
S kkT k kS 1B T k = maxfkB(S); B(T)g
(13)
However, if the number of buers is not in nite, but n, no tuple will be replicated more than n times. In fact, we estimate the average replication to be1 1 ^! (14) n 1? 1? n and thus, the replication factor becomes
? 0 kSk 1 n 1 ? 1 ? n1 ^ kT 0k @1 ? 1 ? 1 B S A minf1; B(S) g = n = (n) = kT k n B(T) ( )
(15)
2.5 Joining the buckets in main memory
Good techniques for joining data in main memory are beyond the scope of this paper. However, we will provide a simple idea, that was actually implemented in the experiments: for each three related buckets, choose the smallest two, and hash them into two tables in main memory, on their join attributes. Scan the third bucket, and for each tuple of it, perform lookups into these tables. This is a simple but fast way to compute the join, and it gives us a simple but useful constraint on the size (and thus the number) of the buckets. The above algorithm also gives us an estimate of the CPU cost of the join procesing, that becomes comparable to two hybrid hash joins. However, to any CPU costs for this processing, we must add the cost of sorting the localization table. This cost should be negligible compared to the overall execution time.
2.6 Computing the number of buckets
The number of buckets to be produced per relation, enters indirectly into the cost formula by aecting the replication factor . In general, fewer buckets imply lower replication and thus lower cost. There is no useful algebraic formula for this number. However, depending on the choice of main memory join algorithm, we can get a lower bound for this number, say nmin . Then, we can use iterative approximation, to compute n. For the experiments, we used the following algorithm: Algorithm 4 Let M be the size of the main memory to be used in the main memory join. 1. Compute nmin by assuming no replication. minfjRj + jSj; jRj + jT j; jSj + jT jg nmin = (16) M 2. n := nmin 3. Do for ever: (a) Compute (n). (b) Let s := minfjRj + jS j; jS j + (n)jT j; jRj + (n)jT jg. (c) If ns M , return n. (d) Otherwize, n := d Ms e.
This loop converges in only a few iterations (2 or 3 in most cases). 1 Because n(1 ? (1 ? 1=n)x ) x, this is a conservative approximation. It is relatively accurate for small variations of the tuple fan-in. In general though, this is an upper bound.
6
3 Comparison with Hybrid Hash Join Hybrid hash join (HHJ) has been known to combine the bene ts of GRACE hash join for large relations, and Simple Hash join when a relation ts in main memory [9, 5, 4]. It works very well for a wide range of join parameters. It is relatively simple to implement, and takes full advantage of the presense of a large main memory. In this section, we develop cost formulae for a cascaded hybrid hash 3-way join, and compare with the cost of RH. Assume that we have determined the join ordering to be (R 1 S) 1 T, and let J stand for the intermediate result: J = R 1A S (17) As in the previous section, we will not include the cost of the original read of the relations, and thus we can ignore projections. From Shapiro [9], after some manipulations, we get CHH = 2(1 ? q1)(jRj + jS j) rst join +jJ j write intermediate result +jJ j + 2(1 ? q2)(jJ j + jT j) second join where q1 and q2 are de ned as M q1 = maxf0; minfjM (18) Rj; jS jg g q2 = maxf0; minfjJ j; jT jg g The above cost can be improved, if the second hybrid hash join can accept input J \pre-bucketed" as it is produced. With this slight modi cation, that big systems would reasonably support, the cost becomes 0 = 2(1 ? q1)(jRj + jS j) + 2jJ j + (1 ? q2)jT j CHH (19) i.e. an extraneous 2(1 ? q2)jJ j was saved. From now on we will refer to the original algorithm as naive HH, and we will use the term HH for the modi ed variant.
3.1 Comparison
What follows is a comparison of RH and HH, in the extreme case where the size of the main memory is much smaller than the size of each of the relations. For a given join order of hybrid hash joins, say (R 1 S) 1 T, we approximate factors q1 and q2 by 0. In this case, hybrid-hash join reduces to GRACE hash join. The cost of two joins becomes 0 = 2(jRj + jS j + jT j) + 2jJ j CHH (20) or, for naive HH, 0 = 2(jRj + jS j + jT j) + 4jJ j CHH (21) For this expression to compare to CRH , we need a simple estimate for , and also an estimate for J. We can get a correlation between these two quantities, when we consider the following RH-join: \Hash S and T on B, and localize R". The cost for this operation now becomes (assuming that the localization table ts in main memory), CRH = 2(jRj + jS j + jT j) (22) From these formulae, we get 0 ?C CHH (23) RH = 2jJ j ? 2( ? 1)jRj = 2jRj + 2jJ j ? 2jRj From Eq.12-15 we get ^ kRkR1kS k (24) and then
[R] kR 1 S k 0 ?C CHH RH 2jRj + 2jJ j ? 2 kRk jRj = 2jRj + 2(1 ? [J] )jJ j 7
(25)
(recall that [X] is the size of a tuple of schema X). If [R] [J], the above cost dierence is positive, i.e. in favor of RH-join. In general, by virtue of pushing projections through the joins, this will be true in most cases. (recall that R is the projected relation). It is possible though, as can be seen from Fig.1, to have [J] < [R], when a non-trivial J exists in the query expression (for example, an inequality between attributes of R and S, followed by projection). Based on the above analysis, we can state the following theorems: Theorem 1 Let (R 1 S) 1 T be a 3-way join, and let J = R 1 S. The I/O overhead of the RH algorithm is no more than the I/O overhead of GRACE hash joins, for the query expression of Fig.1, where J = R 1 S , provided that: 1. the localization table ts in main memory, 2. the data is uniformly distributed, and 3. jRj + jJ j [[RJ ]] jJ j.
0 of the HH join for Proof: The proof is a direct consequence of Eq.25 and the fact that the cost function CHH
q1 = q2 = 0 is identical to the cost function for GRACE hash join. 2
Theorem 2 For theorem 1, if we additionally assume that opration J does not exist in a query, then the I/O overhead of RH is strictly less than the overhead of GRACE hash join, by an amount of at least 2jRj. Proof: This is a direct consequence of the previous theorem, together with the observation that if J = True, J does not exist (it can be pushed down through the joins) and thus [J] [R]. 2
By the above arguments, we see that there is potential for savings by using RH-join. However, what is revealed by the mathematical manipulations is an intuitive explanation of why replication works. The fact is that replication of data appears in the HH solution as well, but implicitly, inside the intermediate result, when the joined relations are in one-many or many-many dependency, with respect to the join attribute. We call this replication implicit replication, whereas the replication performed by the localization operator is called explicit replication. In general, replication introduces overhead in proportion to the size of the replicated data, i.e. size of projected tuples. If this size is small, (cf. [J] < [R]), the overall eect is not signi cant. In this case, the HH solution is advantageous to RH-join, because the q1 and q2 coecients oer savings that RH-join doesn't compensate for. Before we proceed to the experiments, we will discuss a very interesting aspect of explicit replication, namely various techniques to reduce it. It is not obvious that implicit replication is amenable to such techniques, and thus we consider this an advantage of explicit replication.
4 Reducing Replication The simple technique presented in this section, is based on the idea of having the buckets of the localized relation share common pages between them. If a page is shared between two or more buckets, all tuples in this page belong to all the buckets that share the page. This idea does indeed decrease the space requirements of the RH-join algorithm, but does not necessarily reduce the I/O overhead, because a \shared" page may have to be loaded more than once, possibly once for each bucket that shares it. In order to discuss the page sharing idea, we de ne a conceptual tool, the placement vector of a tuple. If an RH-join produces n buckets during localization, say B1 ; : : :; Bn, then for some tuple t of the localized relation, its placement vector vt is an n-bit vector, with bit i equal to 1 if t is placed in Bi , and 0 otherwise. Since tuples can be replicated in more than one bucket, more than one of the bits of vt can be 1. One approach to page sharing would be to de ne one macro-bucket for each n-bit binary vector, for a total of 2n macro-buckets, and place into this macro-bucket all tuples that have an equal placement vector. This solution works beautifully for small values of n, say less that 8, and it elliminates replication completely, because each tuple will only belong to one macro-bucket. Each Bi will then be compiled at load time by a number of these macro-buckets. Unfortunately, this solution is very hard to scale for larger values of n, which implies large relations. In this case, most of the 2n macro-buckets would be empty, or -worse- almost empty. Thus the complexity of loading buckets becomes unmanageable, and disk space utilization, which re ects heavily on I/O overhead, decreases tremendously. However, we can modify this idea slightly, to obtain a workable solution. Instead of identifying macro-buckets with whole n-bit vectors, we identify them with positioned runs of 1s in an n-bit vector, in other words, we de ne 8
one macro-bucket for each sub-interval of interval [1 : n]. For example, suppose that tuple t has a placement vector vt = 0110111000, of 10 bits. There are two runs of 1s in this vector, thus this tuple will be placed in two macro-buckets: bucket B[1 : 2] and bucket B[4 : 6]. This solution scales up much more nicely. Indeed, there are at most n(n2+1) macro-buckets. With respect to replication, there will only be two copies of this tuple on disk, instead of 5, so physical replication decreases. Even more interestingly, the amount of I/O performed decreases proportionally to the decrease in replication, because each macro-bucket, say B[i : j], will be loaded only once, that is, right before step i, will remain in main memory until step j is nished, and then it will be unloaded and never used again. In some sense, this approach resembles a pre-computed replacement policy, adhering to the temporal locality of the in-memory joins. The above technique will be denoted as LRH-join, for Linear Replication Hash join. In the next section, we present experimental results comparing HH, naive HH, RH-join and LRH-join, and nd LRH-join to have the best performance in many cases.
5 Experimental Results In order to evaluate the proposed techniques, we chose the domain of Decision Support processing. Compared to OLTP, Decision Support queries are much more likely to include large multi-way joins, and especially hash joins. To this eect, we adopted the schema and data statistics from the TPC-D benchmark [10]. For convenience, we show part of the schema in Fig.2. PART
LINEITEM
PARTKEY 1:4 PARTSUPP
1:7.5 SUPPKEY/PARTKEY
ORDERKEY 1:4 ORDER CUSTKEY 1:10
SUPPKEY 1:80
CUSTOMER
SUPPLIER
Figure 2: Part of the schema for TPC-D benchmark. Arrows show dependency on foreign key. All 1 : n dependencies given, are average. Each node in the gure represents a relation. An arrow from node R to node S, labeled by key K, denotes that the primary key of R is a foreign key of S. The sizes of the relations are given in Table.1. Notice that each size is given as a multiple of a coecient SF. For more information see [10]. We simulated four algorithms: naiveHH Two hybrid-hash joins. HH Two hybrid-hash joins, with hashing of intermediate result on-the- y. RH Replication hash join (LOCALIZE-NONFIT implemented). LRH Linear replication hash join (LOCALIZE-NONFIT implemented). We optimized three simple 3-way join queries on the given schema. As preci ed by TPC-D, the data has no skew, and buckets will not over ow. We used usual database estimation formulas to derive parameters for our optimization. 9
Relation
PART PARTSUPP SUPPLIER CUSTOMER ORDER LINEITEM
Size(k=thousands of tuples) 200k SF 800k SF 10k SF 150k SF 1500k SF 6000k SF
Table 1: Sizes of relations in TPC-D schema. In all cases, we assumed that tuples are projected as they are scanned from the base relations. Tuples of two of the relations are projected to 40 bytes, and tuples of the third are projected to 20 bytes per tuple. For naiveHH and HH, each tuple of the intermediate result is 40 bytes long. The results of the query cost estimates are shown in Figs.3-5. These graphs range over the size of the main memory. Each point on the graphs is a dierent optimization. Optimization consisted of evaluating the cost models over all alternative executions of the 3-way joins, and choosing the cheapest one. The graphs show the estimated amount of I/O for these queries, not including the initial scans of the base relations, and not including writing the nal result to disk. We assumed that SF = 10 for these examples. This implies a total of 10Gb of data in the database (not including any indices and other supporting data). We chose a range of main memory from 1Mb to 200Mb. SELECT * FROM PART P, SUPPLIER S, PARTSUPP PS WHERE P.PARTKEY=PS.PARTKEY AND PS.SUPPKEY=S.SUPPKEY 1600 HH naiveHH RH LRH
1400
1200
I/O (Mb)
1000
800
600
400
200
0 0
20
40
60
80 100 120 main memory (Mb)
140
160
180
200
Figure 3: Query 1. Relations occupy approx. 1.5Gb on disk. After initial projections, relations occupy 244Mb. Minimum memory for localization table to t: 39Mb. 10
SELECT * FROM ORDER O, LINEITEM L, SUPPLIER S WHERE O.ORDERKEY=L.ORDERKEY AND L.SUPPKEY=S.SUPPKEY 12000 HH naiveHH RH LRH 10000
I/O (Mb)
8000
6000
4000
2000
0 0
20
40
60
80 100 120 main memory (Mb)
140
160
180
200
Figure 4: Query 2. Relations occupy approx. 8.3Gb on disk. After initial projections, relations occupy 1.8Gb on disk. Minimum memory for localization table to t: 307Mb. The rst query, corresponding to Fig.3, a classic database query, shows domination of the LRH algorithm throughout the range of main memory sizes. Including the initial scan of the relations (not shown), there is an impovement of 5% to 16%. For this query, the least amount of memory needed for the localization table to t, was 39Mb. The jump in the graph at that point for RH and LRH indicates it. Results for the second query are shown in Fig.4. This is a very interesting query, because is shows the existence of implitit replication. The rst join, between the very large relation LINEITEM and one of the two other relations, simply produces too big an intermediate result. Since this intermediate result has to be materialized to disk and then read back in, even for the HH case, introduces another 2Gb of I/O. Notice that the localization table does not t in main memory anywhere in the depicted range. This does not aect the performance, because the localized table, SUPPLIER, is quite small. The third query is a representative example where replication is not preferable to hybrid hash join. In this case the size of the intermediate result is small, and the fact that the hybrid hash variants scan the data less than twice, makes the dierence in the performance. They dominate for main memory larger than 75Mb for HH, and 160Mb for naiveHH. Also, notice again the jump at 78Mb, when the localization table begins to t into main memory. The three graphs reveal the strengths and weaknesses of each of the examined techniques. For the HH derivatives, the lower bound of the overall I/O is proportional to the size of the intermediate result. Indeed, as memory gets larger, the fraction of the input that is read only once increases, until after some point, all input is read only once. In contrast, the RH variants will read the input data at least twice. As main memory increases, the number of buckets per relation decreases, and thus the replication also decreases. Thus, the lower bound of I/O overhead of the RH variants is proportional to the size of the input. Usually, with in nite main memory available, the HH variants are preferred over the RH variants. However, Fig.3 demonstrates that this is not always the case. In this particular example, the replication in the intermediate result 11
SELECT * FROM CUSTOMER C, ORDER O, LINEITEM L WHERE C.CUSTKEY=O.CUSTKEY AND O.ORDERKEY=L.ORDERKEY 8000 HH naiveHH RH LRH
7000
6000
I/O (Mb)
5000
4000
3000
2000
1000
0 0
20
40
60
80 100 120 main memory (Mb)
140
160
180
200
Figure 5: Query 3. Relations occupy approx. 8.6Gb on disk. After initial projections, relations occupy 2.8Gb on disk. Minimum memory for localization table to t: 73Mb. is so high, that the size of the intermediate result exceeds the size of all 3 input relations. In contrast, the query in Fig.5 has an intermediate result (joining on CUSTOMER and ORDER rst) that is much smaller compared than the size of the input relations, especially relation LINEITEM. In this case, the RH algorithms are penalized for having to read the input twice. However, even in this case, they are comparable to the HH variants for less than 80Mb of main memory.
6 Related Work Explicit replication has been exploited in parallel hash-joins. In that work replication was exploited to overcome the eects of skew. That eort was restricted to memory-bound query processing. We will consider the integration of these methods in the future. Although it is generally acknowlegded that multi-way join algorithms can impove performance [11], there is relatively little work in the literature. The rst such proposal was in INGRES [6]. Rich et.al. addressed implicit replication by use of non-normalized join indexes [2]. Both of the above approaches applied only to star queries. In [7] we proposed a general multi-way join algorithm, for acyclic queries, that performs best when data ts into main memory. The literature on hash-based join algorithms is huge. Most of the literature examines join algorithms for paraller query processing, and there the main issue is the amelioration of skew. For the uniprocessor case, hybrid hash as prposed by Shapiro has dominated for the last decade. Recently, there has been renewed interest on hash-based techniques, for object-oriented databases. DeWitt et. al. [3] investigated pointer-based joins derived from hybridhash join, and identi ed that algorithms that require less replication will often outperform hybrid-hash join. 12
7 Conclusions The cost mosdels demonstrate that RH is not a replacement for HH. In general, we can conclude that the RH variants can substantially reduce the I/O overhead of a 3-way join, when the intermediate result is large, and are comparable to the HH variants even for small intermediate result, provided that the input relations are much larger than main memory. We have left open a number of implementation issues in this paper. The most important open issue is the main memory join algorithm. We do propose a simple algorithm, but have really not examined the problem in detail. This is would be an important component of a practical implementation. Its choice will aect issues such as the handling of bucket over ow in the presense of skew, and the use of virtual memory. Since we concentrate on a sequential algorithm in this paper, we have not been particularly concerned with skew. We would like to note that our approach is compatible with other database techniques for query execution, such as integrating projections with joins, and such. Also note that the technique can be integrated into an existing optimizer relatively easily, by postprocessing a chosen physical execution plan, and replacing cascades of 2 hybridhash joins with the replication hash join, when the cost estimate for the latter is lower. Another popular database technique is Bloom ltering [1], which uses a bit lter to semi-join reduce the second argument of a join operation during preprocessing. In RH join, the lcalized relation is reduced during localization. It is easy to include Bloom ltering reduction during the partitioning of the other two relations, by materializing the localization table to disk, instead of main memory. Because the localization table is very small compared to the input data, this should not be a signi cant burden to the I/O overhead, and the potential savings can be quite big for joins with very small selectivity. There is a number of open questions about the ideas presented in this paper. Most important, is the potential of extending this technique to joins of larger arity. Indeed, this is possible in a straightforward manner, because tables can produce their own localization tables, as they are being themselves localized. In fact, for a hypothetical query with no projections, this would be a very viable alternative. However, if most of the attributes of intermediate results are projected away after joining, this approach is not very promising. Also, it requires a signi cant amount of temporary storage. So, it is not clear whether for larger join arities the replication approach has merits.
8 Acknowledgements The authors would like to thank Dr. Roberto Bayardo Jr. for many useful discussions and ideas.
References [1] B. H. Bloom. Space/time trade-os in hash coding with allowable errors. Communications of the ACM, 13(7):422{426, July 1970. [2] A. Rosenthal C. Rich and M.H. Scholl. Reducing duplicate work in relational joins: A uni ed approach. In Proc. Int'l Conf. on Information Systems and Management of Data, Delhi, India, 1993. [3] D. DeWitt D. Lieuwen and M. Mahta. Pointer-based join techniques for object-oriented databases. Technical Report CS-TR-92-1099, Univ. of Wisconcin at Madison, 1992. [4] D. DeWitt et. al. Implementation techniques for main memory database systems. In Proc. of SIGMOD'84, Boston, MA, June 1984. ACM. [5] M. Kitsuregawa et al. Application of hash to database machine and its architecture. New Generation Comput., 1(1):62{74, 1983. [6] M. Stonebraker et. al. The design and implementation of ingres. ACM TODS, 1(3):189{222, Sept. 1976. [7] V. Samoladas and D. P. Miranker. Loop optimizations for acyclic object oriented queries. Technical Report TR96-10, University of Texas at Austin, 1996. [8] D. Schneider and D. DeWitt. A performance evaluation of four paraller join algorithms in a shared-nothing multiprocessor environment. In Proc. SIGMOD'89, pages 110{121. ACM, Jun 1989. 13
[9] L. D. Shapiro. Join processing in database systems with large main memories. ACM TODS, 11(3):239{264, 1986. [10] Transaction Performance Processing Council, http://www.tpc.org. TPC BENCHMARK D Standard Speci cation, Dec 1996. [11] J. Ullman. Principles of Database and Knowledge-Base Systems. Computer Science Press, Inc., 1988. [12] P. Wilson. Locality of reference, patterns in program behavior, memory management and memory hierarchies. Unpublished manuscript.
14