Join Algorithm Costs Revisited - Semantic Scholar

4 downloads 0 Views 329KB Size Report
a combination of the optimal nested block and optimal GRACE hash join ..... Like the sort-merge algorithm, this algorithm is likely to only use one pass over the ...
Join Algorithm Costs Revisited Evan P. Harris

Kotagiri Ramamohanarao

Technical Report 93/5

Department of Computer Science The University of Melbourne Parkville, Victoria 3052 Australia

Abstract

A method of analysing join algorithms based upon the time required to access, transfer and perform the relevant CPU based operations on a disk page is proposed. The costs of variations of several of the standard join algorithms, including nested block, sort-merge, GRACE hash and hybrid hash, are presented. For a given total bu er size, the cost of these join algorithms depends on the parts of the bu er allocated for each purpose (for example, when joining two relations using the nested block join algorithm the amount of bu er space allocated for the outer and inner relations can signi cantly a ect the cost of the join). Analysis of expected and experimental results of various join algorithms show that a combination of the optimal nested block and optimal GRACE hash join algorithms usually provide the greatest cost bene t. Algorithms to quickly determine the bu er allocation producing the minimal cost for each of these algorithms are presented.

1 Introduction In the past, the analysis of join algorithms has primarily consisted of counting the number of disk pages transferred during the join operation, as this has been perceived as the dominant cost of the join algorithm. However, the di erence between the time taken to locate a single disk page and transfer a single disk page is signi cant, with the time taken to locate the single page dominating. Hence, the di erence between transferring two consecutive disk pages, and two random disk pages is quite signi cant. For example, if it takes 25 milliseconds to locate the average disk page and 5 milliseconds to transfer it from disk to memory, then reading two consecutive pages takes 35 milliseconds, whereas reading two random pages takes 60 milliseconds. When the cost of a join operation is determined, the di erence in this time should be taken into account. The CPU cost of a join should also be taken into account. Experience with the Aditi deductive database [13] has shown that the disk access and transfer times correspond to 10{20% of the time taken to perform a join. Thus, the CPU time required to perform the join should be taken into account when determining the most ecient method to use to perform any given join. There have been a number of join algorithms described in the past, none of which are the best under all circumstances. We provide analyses of some of the more common of these. The nested loop algorithm is the rst of these. This algorithm and the sort-merge algorithm are the algorithms used in most current database implementations. The nested loop is used when a small relation is joined to another, and the sort-merge for larger relations [2]. For a while it was believed that the sort-merge join was the best possible algorithm [9], however the description of hash based join algorithms [4,7] indicated that this was not necessarily true. DeWitt, et al. compared the sort-merge, simple hash, GRACE hash and hybrid hash join algorithms and concluded that hybrid hash has the lowest cost. While some are still unsure, for example, Cheng, et al. [3] claimed that when memory is large then sort-merge and hybrid hash have similar disk I/O performance, hybrid hash is regarded as one of the best algorithms for performing the join. A survey of join algorithms appears in [10]. Each of the aforementioned papers typically uses the number of pages transferred in its description of the cost of disk operations. This assumes that the cost of transferring a number of consecutive pages at once is the same as transferring them individually from random parts of the disk. Hagmann [5] argued that, for current disk drive technology, when small numbers of pages are transferred the cost of locating the pages was much greater than the cost of transferring them. He described an analysis of the nested loop algorithm counting only the number of accesses which, when minimised, showed that half the number of pages of the memory bu er should be devoted to each relation. Under the previous cost model, the inner relation was provided with one page of memory, and the remaining memory was devoted to the outer relation. Our analysis is a generalisation of these two and allows the relative, or absolute, cost of each disk and CPU operation to be speci ed. The use of an extent based le system, even under UNIX [8], will provide greater support for our technique than standard le systems which do not guarantee that consecutive blocks are even on the same part of the disk. Although standard le systems do typically try to cluster contiguous blocks, extent based le systems achieve this to a greater degree. Using our cost model, we will demonstrate that the cost of the calculating the optimal bu er allocation and then performing the join using the GRACE hash join algorithm is usually superior to the same operations using the hybrid hash join algorithm. The hybrid hash join algorithm is generally regarded to have the lowest cost of all join algorithms when the relation size is larger than the memory in which the join is to take place. When the time taken to calculate the optimal bu er allocation is taken into account, this is not usually the case, for the algorithms we have used. In the next section we present four join algorithms, the nested block, sort-merge, GRACE hash and hybrid hash algorithms. Each join algorithm is described and analysed. In Section 3 a generalisation of the two hash methods is described and we show how to maximise the bu er usage to reduce the number of disk seeks. Minimisation algorithms are described for some of the 1

Table 1: Notation used in cost formulae.

VA VB VR B BA BB BH BI BP BR P  TC TJ TK TM TP TS TT kS

number of pages in relation A number of pages in relation B number of pages in result of joining relations A and B number of pages available in memory number of pages in memory for relation A number of pages in memory for relation B number of pages in memory used for a hash table number of pages in memory used for an input bu er number of pages in memory used for each partition number of pages in memory for result number of partitions created on each pass number of passes during the partitioning phase cost of constructing a hash table per page in place in memory cost of joining a page with a hash table in memory cost of moving the disk head to a page on disk cost of merging a page with another in the sort-merge algorithm cost of partitioning a page in memory cost of sorting a page in memory cost of transferring a page from disk to memory sorting constant

join algorithms in this section. In Section 4 an analysis of expected results is presented and some experimental results are reported. In Section 5 we discuss how non-uniform data distributions may be handled and in the nal section we present our conclusions.

2 Join Algorithms In the following analyses of the nested block, sort-merge, GRACE hash and hybrid hash join algorithms we make some assumptions. In common with the other papers cited above, we assume that the distribution of records to partitions in the hash based join algorithms is uniform. However, in Section 5 we will describe a method which will work when the data is not uniformly distributed. We also assume that some small amount of memory is available, in addition to that provided for bu ering the pages from disk. For example, we allow an algorithm to require a pointer or two for each page of memory. This additional memory will typically be thousands of times smaller than the size of the bu er. The notation used in the analysis of the cost of each algorithm is given in Table 1. Note that in each of the costs below the cost of locating a page on disk, TK , would typically be the sum of the average seek time and the average latency time. However, equally, the maximum seek and latency times could be used, giving an upper bound on the cost of each operation. The cost of a single disk operation, transferring a set of disk pages, can be given by

CP = T K + V P T T where VP is the number of pages transferred from disk to memory. This is a generalisation of the assumption that only a single page is transferred during each disk operation (TK = 0, TT = 1, VP = 1) and the cost which assumes that any number of pages can be transferred for at the same cost (TK = 1, TT = 0). 2

The operation each of the following algorithms describes may be given by

Ao nB!R where A, B and R are relations.

2.1 Nested Block

The nested block join algorithm is a more ecient version of the nested loop join algorithm which is used in a paged disk environment. The nested loop algorithm works by reading one record from one relation, the outer relation, and passing over each record of the other relation, the inner relation, joining the record of the outer relation with all the appropriate records of the inner relation. The next record from the outer relation is then read and the whole of the inner relation is again scanned, and so on. The nested block algorithm works by reading a block of records, typically one page, from the outer relation and passing over each record of the inner relation (also read in blocks) joining the records of the outer relation with those of the inner relation. Note that under Hagmann's analysis [5] half the available memory is devoted to pages from the inner relation, and half to the outer relation. A slightly more ecient version of this algorithm can be obtained by rocking backwards and forwards across the inner relation. That is, for the rst block of the outer relation the inner relation is read forwards and for the second block of the outer relation the inner relation is read backwards. This eliminates the need for reading in one set of blocks of the inner relation at the start of each pass, except the rst, because the blocks will already be in memory. This is the version of the algorithm the following analysis represents. We assume that the memory based part of the operation is based on hashing. That is, the pages of the outer relation are formed into a hash table, and the records of the inner relation are hashed against this table to nd records to join with. The constraints BA + BB + BR  B , 1  BA  VA , 1  BB  VB , BR  1 must be satis ed. They ensure that the number pages allocated to each relation is reasonable. Relation A is the outer relation (VA < VlB ).mIt is read precisely once, BA pages at a time. Relation B is the inner l m V V B A relation. It is read in BA times, in BB read operations. Each pass over relation B except the rst reads VB ? BB pages. Assuming the constraints above are satis ed, the cost of the nested block join algorithm is given by 

 V A CNB = B TK + VA (TT + TC ) read A A   + BVB TK + VB (TT + TJ ) B       V V B A + B ?1 BB ? 1 TK + (VB ? BB )(TT + TJ ) read B A   + BVR TK + VR TT write R l

m



R

l

m



l

m

The BVAA ? 1 and BVBB ? 1 terms are due to the rocking. On each of the BVAA passes except the rst VB ? BB blocks are read in, BB blocks at a time. Simplifying this equation,          CNB = BVA BVB + 1 + BVR TK + VA + BVA (VB ? BB ) + BB + VR TT A B R A    + VA TC + BVA (VB ? BB ) + BB TJ : A

3

2.2 Sort-Merge

The sort-merge join algorithm works in two phases. In the rst, sorting, phase, each relation is sorted on the join attributes. In the second, merging, phase, a record (or block of records) is read from each relation and they are joined if possible, otherwise the one with the smaller sort order is discarded and a new record read from that relation. In this way, each relation is only read once after it has been sorted. The variant of the sort-merge algorithm whose cost we present below is a simpli ed version of that used in the Aditi deductive database system [13]. Instead of completely sorting each relation, each relation is partitioned into sorted partitions that are the size of the memory bu er, B . To perform this partitioning, B pages are read from a relation, sorted, and written out, then another B pages, and so on. During the merge phase, the partitions of each relation are merged together and the records from each relation are joined. The same pass over the partitions both merges the partitions of each relation and joins the relations. Up to B ? BR partitions of both relations may be created prior to the merge phase. In the equations in this section, BA and BB represent the number of pages allocated to each partition of relations A and B respectively. If the time taken to sort a page of x records is given by TS = kx lg x, then the time taken to sort n pages may be given by

T = knx lg(nx) = knx(lg n + lg x) = knx lg n + knx lg x = kS n lg n + nTS = n(kS lg n + TS ): 







Assuming the constraints VBA BA + VBB BB + BR  B , BA  1, BB  1, BR  1 are satis ed, the cost of the sort-merge join algorithm is given by 

VA  T + V T  + V (k lg B + T ) S B K A T  A S  + 2 VBB TK + VB TT + VB (kS lg B + TS )       V B B V A B + B B + B B TK + (VA + VB ) (TT + TM ) A B   + BVR TK + VR TT

CSM = 2

R

Simplifying,

VA B



sort B merge write R

       B B V V B R 2+ B + B 2+ B + B TK CSM = A B R + (3VA + 3VB + VR ) TT + kS (VA + VB ) lg B + (VA + VB )(TS + TM ): This cost analysis assumes that at most B ? BR partitions are created during the sort phase. For large memory bu er sizes this is likely to be true under normal circumstances, and is certainly enough to compare the sort-merge join algorithm with the other join algorithms presented in this paper. For example, consider a memory bu er of size 4 Mbytes, ignoring the nal output bu er. If we assume 8 kbyte pages, this means that a maximum of 4096=8 = 512 partitions may be created which are to be merged together on the nal pass, thus 256 partitions for each relation. Each partition will be 4 Mbytes in size thus the maximum size of each relation is 1 Gbyte if only one sorting and one merging pass is permitted. A similar analysis shows that if 16 Mbytes of memory 



sort A

4

is available, the maximum size of each relation is 16 Gbytes. Thus, if large main memory is available, one sorting and merging pass is likely to be sucient.

2.3 GRACE Hash

The GRACE hash join algorithm [7] also works in two phases, in a similar way to the sort-merge algorithm. In the rst, partitioning, phase, each relation is partitioned such that one relation may be contained within memory. This may take a number of passes over the relations. In the second, merging, phase, one partition of the outer relation is read into memory and the corresponding partition of the inner relation is scanned and the appropriate records joined. The second phase e ectively consists of a number of invocations of the nested block join algorithm. The partitioning operation is performed by hashing each record based on the values of its join attributes and placing the record in one of the output partitions using the hash value. The same hash function is used for each relation, guaranteeing that all the records of one relation which will join with any given record of the other relation are in the corresponding partition of the other relation. In the cost given below we assume that during the partition phase records are read into a bu er of size BI and then distributed between P output bu ers of size BP . While we set P to be a single value, it could vary for each pass. However, if a large amount of main memory is available, one pass will typically be enough. Assuming the constraints BA + BB + BR  B , 1  BA  VA , 1  BB  VB , BR  1, P > 1, and the constraints 1  BI  B , 2  PBP  B are satis ed, and if the input and output bu ers are distinct 3  PBP + BI  B are satis ed, and the optimal number of partitioning passes is , the cost of the GRACE hash join algorithm is given by !   VAi  ' V A P TK + P i (TT + TP ) CGH = P i B I i=0 ! & '    VAi X V A P i + P BP TK + P i TT i=1 & ' !   X ?1 VBi V B i P + P BI TK + P i (TT + TP ) i=0 ! & '    VBi X V B i P TK + P i TT + P B P i=1 & ' !   VA V A  P +P BA TK + P  (TT + TC ) & '   VB V B  P +P BB TK + P  (TT + TJ )

X ?1

&

+

partition A

partition B read A

&

VA  ' ! P ? 1 BA !! ! & '    VB V B P BB ? 1 TK + P  ? BB (TT + TJ ) read B

  + BVR TK + VR TT : write R R Note that there is a corresponding CPU based operation for each disk read operation. Like the sort-merge algorithm, this algorithm is likely to only use one pass over the data during the partitioning phase. For example, consider the same memory bu er of size 4Mbytes used in the

5

example in the previous section. One pass means that  = 1. The largest size of the outer relation will occur when P = B ? 1, thus VA = (B ? 1)BA . If we assume that most of the 512 pages (4 Mbytes) are allocated to BA , the size of the smaller relation, relation A, will be just under 2 Gbytes. The other relation may be much larger. By a similar analysis, if 16 Mbytes (2048 pages) of memory is available, the maximum size of the smaller relation will be just under 32 Gbytes. In practice, a lower cost may be found by making two passes to partition the data when the relations are this large so that multiple pages may be read during one read operation. This is discussed further in Section 3.2.1.

2.4 Hybrid Hash

The hybrid hash join algorithm [4,12] is very similar to the GRACE hash join algorithm. The di erence is that the hybrid hash join algorithm reserves an area of memory to join records in during the partitioning phase. Instead of hashing each record into one of P partitions during the partitioning phase, each record is hashed into P +1 partitions. During the partitioning of relation A, records which hash into the extra partition are not written out to disk but are instead stored in the reserved area in memory. When relation B is partitioned, records which hash into the extra partition are joined with the records of relation A which are stored in memory. The amount of memory reserved for this extra partition need not be the same as the expected sizes of the other partitions, providing that the extra partition does not over ow during one pass of relation A. The basis of the cost of the hybrid hash join algorithm below is similar to the GRACE hash join algorithm with the addition of the hash table in memory, size BH , during the partition phase. Assuming the constraints BA + BB + BR  B , BH + PBP + BI + BR  B , 1  BA  VA , 1  BB  VB , BR  1, 1  BH  VA , P > 1, BP  1, BI  1 are satis ed the cost of the hybrid hash join algorithm is approximated by

CHH =

X ?1 i=0

02  B6 P i B6

P 1 j 3 VA?BH i? j=0 P Pi 7

Pi?1 j ' V ? B A H j =0 P T 7 TK + T 7 i P 7

BI

@6 6 6

&

7

'

&

Pi

02 

P 1 j 3 VA?BH i? j=0 P Pi 7

Pj V ?B TP + BH TC + A HP i j =0 + +

 X i=1 X ?1 i=0

B6

6 Pi B @6

6 6 022  B66

66 Pi B @66

BP

6 6 6 6 6 2 6

+ 66 6 6

TK +

partition & join A, read Pi?1

'

1

VA ? BH j =0 P j C TT C A Pi

partition & join A, write

P 1 j 3 3 VB (VA ?BH i? j=0 P ) VA 7 7 7 =BI 7 TK 7 7 i P 7 7

66 66 2

+

7 7 7 7

&

!

7

7 Pi?1 j  3 VB (VA?BH P)

VA Pi

j=0

P VB (VA ?BH ij=0 VA Pi

7 7 7 7 7 3 j P) 7 7 7 7 7

TT 1 



C TP + VBVBH TJ C A A

6

partition & join B , read

+

 X i=1

022  B66

66 Pi B @66

P 1 j 3 3 VB (VA ?BH i? j=0 P ) VA 7 7 7 =BP 7 TK 7 7 i P 7 7

66 66 2

+ 02  B6

6 + P B @6

P 1 j 3 VA?BH ? j=0 P P 7

6 6 022  B66

66 + P B @66

7 7 7 7

BA

7 7 7 7 7

TK +

&

C

TT C A

partition & join B , write P?1

1

'

C VA ? BH j =0 P j C join A, read ( T + T ) T C A  P

P 1 j 3 3 VB (VA?BH ? j=0 P ) VA 7 7 7 =BB 7 TK 7 7  P 7 7

66 66 2

+

j=0

VA Pi

6 6 6 6 6

7 1

7 Pi?1 j  3 P) VB (VA ?BH

7

7 P?1 j  3 P ) VB (VA?BH

6 6 6 6 6 02  B6

6 +B @6

VA P

j=0

P 1 VA?BH ? j=0 P P

6 6 0022 

BA

7 7 7 7 7

(TT + TJ ) 1 C

? 1C A

P 1 j 3 1 3 VB (VA?BH ? j=0 P ) VA C 7 7 7 =BB 7 ? 1C TK 7 7 A P 7 7

BB66 BB66 @@66 66 66 02  B6

7 7 7 7 7  j 3

7 P?1 j  3 VB (VA?BH P )

6 +B @6 6 6

VA P

j=0

7 7 7 7 7

7

1

11

C

CC

CC ? BB C A (TT + TJ )AA

join B , read

  + BVR TK + VR TT write R R where  is the number of passes during the partition phase. This is a generalisation of the original hybrid hash scheme [4], in which it was assumed that there was only one pass through the data, that is,  = 1.

3 Minimising Costs The equations given in Section 2 provide the costs of each join method. Now we describe how we determine the optimal bu er allocation for the nested block and hash-based join algorithms.

3.1 Nested Block

To minimise the cost of the nested block join algorithm we must minimise CNB in the presence of two variables, BA and BB . We set BR = B ? BA ? BB . For this to be useful in practice we must be able to nd the minimum value in a small period of time, relative to the time taken by 7

Figure 1: Nested block join algorithm varying BA . VA = 100, VB = 1000, BR = VR = 0, B = 64, TK = 0:0243, TT = 0:00494, TC = TJ = 0:015. Nested block join, buffer size comparison 150 Nested block Cost=44

Cost (secs)

120

90

60

30

0 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 Number of pages in buffer size for outer relation

the join. To determine how to nd the minimum we have plotted the cost of the join versus BA after setting BR = 0 in Figure 1. The values of TK and TT are based on the Wren 6 disk drive and TC and TJ were based on the operations which may be performed on a Sparcstation 10/30 using an 8 kbyte page size. We observe that the minimum value of CNB is likely to be when the values of the variables are such that BVAA , BVBB , and BVRR are integers. If they are not integers, the greater the value of the fractional part the better. Our minimisation algorithm, see Figure 2, works by stepping down from the largest values of BA and BB until the cost is greater than the minimum cost found by a certain factor,l  . mIt initially  V A sets BA = i , where i is the smallest integer such that BA  B ? 2, and BB = VjB , where j is the smallest value such that BA + BB  B ? 1. It sets BR = B ? BA ? BB and calculates the cost, then increments j and recalculates BB , BR and the cost. This process continues while the cost is less than  multiplied by the minimum cost found for this value of BA . The nal cost is saved, i is incremented and BA is recalculated. This process continues while the best cost for each value of BA is less than  multiplied by the global minimal cost. We describe how this algorithm performs in Section 4.

3.2 A Generalised Hash Join Algorithm

By relaxing the constraints on the hybrid hash join algorithm, so that BH = 0 is permitted, and by removing the result bu er BR during the partitioning phase when BH = 0, the hybrid hash join algorithm can be generalised to include the GRACE hash join algorithm. Both of these algorithms, as described in Section 2, have separate input and output bu er areas during the partitioning phase. These can be combined without a ecting the cost equations CGH and CHH, however, the constraints need to be altered to permit this. Figure 3 diagrammatically represents our scheme. A set of P one page bu ers is reserved along 8

Figure 2: Function for minimising the cost of the nested block join algorithm. function minimiseNB(VA , VB , VR )

mincost

1, BA0

while A 6=  V1 do A BA i if BA  B ? 2 B0

0, i

1

# nd the largest integer (almost) dividing VA

^ BA0 6= BA then 1, BB0 0, j 1

runcost

while BB0 6=  V1 do B BB j if BB  B ? BA ? 1 BR

?

?

^ BB0 6= BB

# a xed value for BA is a \run" then

# nd the largest integer (almost) dividing VB

B B A BB CNB (VA ; VB ; VR ; BA ; BB ; BR ) 0 BB BB # save BB if cost < runcost then (BA+ ; BB+ ; BR+ ; runcost) (BA ; BB ; BR ; cost) else if cost >  runcost then

cost

break end if end if

j



j+1

# prepare for next value of BB # save BA to ensure we don't try it twice

BA

if runcost < mincost then

(BA ; BB ; BR ; mincost) (BA+ ; BB+ ; BR+ ; runcost) else if runcost >  mincost then

break end if end if

i

# save run best

# this cost is much worse, so end this run

end while 0 BA

to ensure we don't try it twice

i+1

# save overall best # this cost is much worse, so end # prepare for next value of BA

end while return mincost, BA , BB , BR end function

9

Figure 3: Bu er structure of modi ed hash join algorithm during partitioning stage.

 BR

BH

P

BP



PBP  BI

BP

-

with the BR and BH pages. The remaining PBP bu ers are used as both the input and output bu ers during the partitioning phase. A relation is read into the PBP pages, then it is partitioned into PBP + P pages. It e ectively tries to partition in place using the PBP pages. Pages which are not completely lled are moved to one of the P spare pages based upon its partition number. On the next pass, the contents of each spare page is added to each partition after partitioning before the full pages are written out. Incompletely lled pages are again saved within the P spare pages. Thus, only complete pages are written out, except after nal read when all pages must be written. As the number of pages transferred under this scheme is the same as the algorithms described in Section 2, and the number of seeks will be the same if the constraints on P , BP , BI , BH and BR are modi ed appropriately, the cost functions of Section 2 are still valid. These constraints are described in the next two sections.

3.2.1 Modi ed GRACE Hash

The bu er arrangement of our modi ed GRACE hash join algorithm during the partitioning phase is the same as that described in Figure 3, providing BR = 0 and BH = 0 during this phase. For the cost CGH to be valid, the constraints are now that BA + BB + BR  B , 1  BA  VA , 1  BB  VB , BR  1, P > 1, BP  1 and BI = PBP  B ? P . We now consider the maximum practical size of the relations under this scheme before two passes must be made over the data to partition it. If we assume that both relations are the same size and the output size is one block, the largest relation which will be partitioned in one pass is 1344 Mbytes in size, assuming a 4 Mbyte bu er area, and 8 kbyte page size and the values for TK , TT , TC , TJ , TP given above. The values for the variables are BA = 506, BB = 5, BR = 1, P = 170, BP = 3 and it would take over 6 hours to execute. For relations larger than this, it is better to make two passes over the relation in which two blocks are allocated to each output partition, than it is to make only one pass but each output partition is only allocated one block. Note that this is similar to the capacity of the sort-merge algorithm when only one pass is permitted. To minimise the cost of the GRACE hash join algorithm which partitions in place, we must minimise  CGH  in the presence of four variables, BA , BB , P and . We set BR = B ? BA ? BB , BP = BP?P and BI = BP P . We must be able to nd the minimal value in a small period of time, relative to the time taken by the join, for this to be useful. To determine how to nd the minimum we have plotted the cost of the join versus BA after setting BR = 1, P = 10 and  = 1 (the actual values which result in the minimal cost for the best BA ) in Figure 4. This graph is very similar to the graphs produced using the nested block algorithm, such as Figure 1. We use a similar minimisation algorithm for the GRACE hash join method to the one which minimises the nested block join algorithm. The primary di erence is that value of BA is derived from the value of P and the number of passes. Thus, instead of changing BA directly, the values of P and  are changed. The minimisation algorithm is presented in Figure 5.

3.2.2 Modi ed Hybrid Hash

The bu er arrangement of our modi ed hybrid hash join algorithm is that described in Figure 3. For the cost CHH to be valid, the constraints are now that BA + BB + BR  B , 1  BA  VA , 10

Figure 4: GRACE hash join algorithm varying BA , VA = VB = 1000, BR = VR = 1, B = 64, P = 10, BP = 5,  = 1, TS = 0:0243, TT = 0:00494, TP = TC = TJ = 0:015. GRACE hash join, buffer size comparison 300 GRACE hash Cost=120 250

Cost (secs)

200

150

100

50

0 0

4

8

12 16 20 24 28 32 36 40 44 48 52 56 Number of pages in buffer size for smaller relation

60

64

1  BB  VB , BR  1, P > 1, BP  1, BH  1 and BI = PBP  B ? P ? BH ? BR . Our algorithm to minimise CHH is similar to that which minimises the GRACE hash join algorithm except that instead of minimising the cost using three variables it must minimise using seven: BA , BB , BR , P , BP , BH , . We set BI = B ? BH ? BR ? PBP . It takes signi cantly longer to produce a result than any of the other minimisation algorithms. Additionally, it is not feasible to perform an exhaustive search to determine whether it determines the minimal result or not. As a consequence, we do not present it here.

4 Computational Results In the remaining sections, when we use the GRACE hash (hybrid hash) algorithm we mean the modi ed GRACE hash (hybrid hash) algorithm described in Section 3.2.1 (3.2.2). In this section we provide examples of the expected performance of the join algorithms under the cost model described in Section 2. We compare the nested block join algorithm, the GRACE hash join algorithm, and the hybrid hash join algorithm with the respective version of each algorithm commonly described in the literature. We then compare the expected performance of all four join algorithms described in Section 2 on a variety of joins on relations of di erent sizes, and report some preliminary experimental data based on programs which read and write the relevant number of pages for each algorithm.

4.1 Nested Block

While the bu er arrangement which results in the minimal value of CNB is obviously the best bu er arrangement to use, determining it requires some CPU cost. Therefore, we would like to know if the improvement in performance would make this worthwhile on a general basis. To this 11

Figure 5: Functions for minimising the cost of the GRACE hash join algorithm. function minimiseGH(VA , VB , VR , B )

(mincost; BA+ ; BB+ ; BR+ ) P+ 0, + 0 runcost 1,  1 do

prevcost runcost 2 to 1 do for i runcost l q 1

VA  V i(B +2) A iP  

for P

BA if BA

minimiseNB(VA , VB , VR , B )

m

  to B3 do

 B ? 2 then (cost; BB ; BR ) minimiseBB (VA , VB , VR , B , BA , P , ) if cost < mincost then (mincost; runcost; BA+ ; BB+ ; BR+ ; P + ; + ) (cost; cost; BA ; BB ; BR ; P; ) else if cost < runcost then runcost cost else if cost >   runcost then

break end if end if end for end for +1



doprevcost < runcost + + return mincost, BA+ ; BB ; B R ; P + ; + end function function minimiseBB (VA , VB , VR , B , BA , P , )

1, BB0

mincost

while B  6=1V do B   P B B0

B

0, i

i

if BB

 B ? BA ? 1 ^ B ? B A ? BB

BR

1

6 BB then B=

# nd largest integer (almost) dividing relation size

B0

cost CGH (VA ; VB ; VR ; BA ; BB ; BR ; P; ) if cost < mincost then (BB+ ; BR+ ; mincost) (BB ; BR ; cost) else if cost >  mincost then

break end if 0 BB

end if

i

# save best values # this cost is much worse, so nish # save BB to ensure we don't try it twice

BB

i+1

# prepare for next value of BB

end while return mincost, BB+ , BR+ end function

12

Figure 6: Nested block join algorithm comparison VB = 1000, VR = 100, B = 64, TK = 0:0243, TT = 0:00494, TC = TJ = 0:015. Nested block join algorithm comparison 800 Nested block (std) Nested block (Hag) Nested block (opt)

700

600

Cost (secs)

500

400

300

200

100

0 0

100

200

300 400 500 600 700 Outer relation size (pages)

800

900

1000

end, we have plotted the cost of each algorithm for a range of values of VA for a xed VB , VR , B and realistic values for TK , TT , TC and TJ . The result of this is shown in Figure 6. If we assume a page size of 8 kbytes, VB = 1000 approximately corresponds to an 8 Mbyte relation and VR = 100 to an 800 kbyte result relation. B = 64 corresponds to 512 kbytes of main memory used during the join operation. The times TK and TT are based on a disk drive with 8 kbyte pages, an average seek time of 16 milliseconds and which rotates at 3600 RPM. Therefore, TK = 24:3 msec and TT = 4:94 msec. TC and TJ are based on the operations which would be performed on a Sparcstation 10/30. While di erent values for all numbers are possible, the Figure 6 is a representative comparison of the di erent versions of the nested block algorithm. The Nested block (opt) line corresponds to the minimal value of CNB calculated using the minimisation algorithm we describe in Figure 2. Table 2 shows a number of the optimal values for BA , BB and BR for various values of VA . The Nested block (std) line corresponds to the standard version of the nested block algorithm commonly described in the literature, in which BA = B ? 2, BB = BR = 1. The Nested block (Hag) line corresponds to the version proposed by Hagmann [5] in which BA = BB = (B ? 1)=2, BR = 1. Note that while Hagmann's version is faster than the standard version for larger relations, it is approximately twice as slow for some of the smaller relations. This is in marked contrast to the results Hagmann reported, and is due entirely to our more realistic cost model which includes CPU time in the cost of calculating the join. The results show that a signi cant improvement in performance is gained by using the bu er arrangement which results in the minimal value of CNB. For example, when VA = 100, the optimal version takes 51% of the time of the standard version and 55% of the time of Hagmann's version. When relation A can be contained within the memory bu er an improvement is still achieved by selecting better values for BB and BR . In addition, the minimisation algorithm is very fast. This is discussed further in Section 4.5. 13

Table 2: Optimal bu er allocation of BA , BB , BB for nested block join algorithm when VB = 1000, VR = 100, B = 64, TK = 0:0243, TT = 0:00494, TC = TJ = 0:015.

VA BA BB BR 1 1 50 13 32 32 25 7 50 50 11 3 64 32 26 6 100 50 11 3 128 43 18 3 1000 56 7 1

4.2 GRACE Hash

As with the nested block join algorithm, while the bu er arrangement of the GRACE hash join algorithm which results in the minimal value of CGH is obviously the best bu er arrangement to use, determining it requires some CPU cost. We would like to know if the improvement in performance which can be achieved makes determining the minimal arrangement worthwhile. We have plotted the cost of each algorithm for a range of values of VA for the same xed VB , VR , B , TK , TT , TC , TJ , TP used above. The result of this is shown in Figure 7. The GRACE hash (opt) line corresponds to the minimal value of CGH calculated using the minimisation algorithm. The GRACE hash (std) line corresponds to the standard version of the GRACE hash algorithm, in which BA = B ? 2, BB = BR = 1, and P = B . As with the nested block algorithm, the results show that a signi cant improvement in performance is gained by using the bu er arrangement providing the minimal value of CGH. For example, when VA = 200, the optimal version takes 34% of the time of the standard version, and when VA = 1000 it takes 49% of the time of the standard version. Like the nested block minimisation algorithm, the time taken by the GRACE hash minimisation algorithm is small. This is discussed further in Section 4.5.

4.3 Hybrid Hash and Simulated Annealing

While the bu er arrangement which results in the minimal value of CHH is obviously the best bu er arrangement to use, determining it using our minimisation algorithm requires a large CPU cost, often much longer than performing the join. To attempt to reduce this we used simulated annealing [1] to attempt to nd a near optimal bu er arrangement in a much shorter period of time. We would like to know if the improvement in performance achieved using simulated annealing makes determining the minimal arrangement for the hybrid hash join algorithm worthwhile. The parameters to the simulated annealing algorithm was designed so that it would terminate in around 60 seconds, although in all our tests times actually varied between 15 and 130 seconds. This is discussed further in Sections 4.4 and 4.5. We have plotted the cost of each algorithm for a range of values of VA for the same xed VB , VR , B , TK , TT , TP , TC , TJ as the nested block and GRACE hash algorithms. The result of this is shown in Figure 8. The Hybrid hash (opt) line corresponds to the near minimal value of CHH calculated using simulated annealing. Simulated annealing obviously did not always determine the minimal value at each point, because the gradient of the line would be non-decreasing if it did. The Hybrid hash (std) line corresponds to the standard version m of the hybrid hash algorithm, in l V ? (B ?2) A which BA = B ? 2, BB = BR = BP = BI = 1, P = (B?2)?1 + 1 and BH = B ? P ? BI ? BR . As with the nested block and GRACE hash algorithms, the results show that a signi cant improvement in performance is gained by using the bu er arrangement providing a near-minimal 14

Figure 7: GRACE hash join algorithm comparison VB = 1000, VR = 100, B = 64, TK = 0:0243, TT = 0:00494, TP = 0:0018, TC = TJ = 0:015. GRACE hash join algorithm comparison 200 GRACE hash (std) GRACE hash (opt)

180

160

Cost (secs)

140

120

100

80

60

40

20 0

100

200

300 400 500 600 700 Outer relation size (pages)

800

900

1000

Figure 8: Hybrid hash join algorithm comparison VB = 1000, VR = 100, B = 64, TK = 0:0243, TT = 0:00494, TP = 0:0018, TC = TJ = 0:015. Hybrid hash join algorithm comparison 200 Hybrid hash (std) Hybrid hash (opt)

180

160

Cost (secs)

140

120

100

80

60

40

20 0

100

200

300 400 500 600 700 Outer relation size (pages)

15

800

900

1000

Figure 9: Join algorithm comparison VB = 1000, VR = 100, B = 64, TK = 0:0243, TT = 0:00494, TP = 0:0018, TC = TJ = 0:015, TS = 0:013, TM = 0:0025, kS = 0:00144. Join algorithm comparison 250 Nested block Sort-merge GRACE hash Hybrid hash

Cost (secs)

200

(opt) (opt) (opt) (opt)

150

100

50

0 0

100

200

300 400 500 600 700 Outer relation size (pages)

800

900

1000

value of CHH. For example, when VA = 200, the optimal version takes 41% of the time of the standard version, and when VA = 1000 it takes 50% of the time of the standard version.

4.4 Join Algorithm Comparison

We have seen that using a more realistic cost model for determining memory bu er usage for the nested block, GRACE hash and hybrid hash join algorithms can result in a signi cant improvement in the performance of the join algorithm. We now compare the four join algorithms, using the same example as above. The result of this is shown in Figures 9 and 10. Figure 10 contains an enlarged version of part of Figure 9. Others using the standard cost model, such as Blasgen and Eswaran [2], found that when the outer relation may be contained within main memory, the nested block algorithm performs the best. Figures 9 and 10 show that this is still the case under our cost model. Note that our de nitions of the GRACE and hybrid hash algorithms presented in Section 2 reduce to the nested block algorithm when no partitioning passes are made over the data, hence the costs are the same. As the size of the outer relation gets larger the other join algorithms all perform better than the nested block algorithm. The result that is di erent to that reported in the past using the standard cost model, such as by DeWitt, et al. in [4], is that, in general, the GRACE hash algorithm performs as well as, or better than, the hybrid hash algorithm for large relation sizes. The evidence provided in the next section indicates that this is likely to be because the simulated annealing algorithm failed to nd the optimal bu er allocation in the time available. We discuss further this in the next section. When the relation size is small, but larger than the bu er size, the hybrid hash algorithm performs better than all the other algorithms. This is because a large percentage of the relations may be joined during the rst pass over the data. Thus, the amount of data which does not have to be written to disk, read back in and then joined will be large enough to ensure that the cost of 16

Figure 10: Join algorithm comparison VB = 1000, VR = 100, B = 64, TK = 0:0243, TT = 0:00494, TP = 0:0018, TC = TJ = 0:015, TS = 0:013, TM = 0:0025, kS = 0:00144. Join algorithm comparison 140 Nested block Sort-merge GRACE hash Hybrid hash

120

(opt) (opt) (opt) (opt)

Cost (secs)

100

80

60

40

20 0

50

100 150 200 Outer relation size (pages)

250

300

the hybrid hash algorithm is signi cantly smaller than the other methods. Again, as reported in [4] using the standard cost model, the sort-merge algorithm does not perform as well as either hash-based algorithm, despite the fact that our version of the sort-merge algorithm has a lower cost than the version usually reported.

4.5 Join Algorithm Comparison: Optimisation Times

In the previous four sections we have shown that the optimal versions of each of the algorithms will perform better than the standard versions. Therefore, whether the optimal versions are the best to use in practice will depend on the time taken to nd the optimal bu er allocation in each case. The sum of the time taken to determine the optimal bu er allocation and then to execute the join must be faster than simply using the standard version of each algorithm for this scheme to be worthwhile. To determine the likely relative performance of each join and minimisation algorithm we generated 120 random join queries. For each query, the value of B was randomly chosen to be one of 64, 128, 256 or 512. If we assume pages are 8 kbytes, this tests bu er sizes up to 4 Mbytes. The values of VA , VB and VR were chosen randomly such that VA < VB . The extreme values for these variables are shown in Table 3. For each of these queries the minimisation algorithm for the nested block and GRACE hash algorithms were used to determine their \minimal" values. Simulated annealing was used for the extended version of the modi ed hybrid hash algorithm which generalises both the hybrid hash and the GRACE hash algorithms. That is, BH was allowed to be zero, and if it was then BR = 0 was set during the partitioning phase. This was done so that the GRACE and hybrid hash algorithms could be compared equally, and to see if simulated annealing was, in fact, determining the optimal bu er allocation. An exhaustive search was also performed for the nested block, sort-merge and GRACE hash algorithms to determine their actual minimal values. 17

Table 3: Range of random values for VA , VB , VR . Variable Minimum Maximum VA 16 21229 VB 42 59405 VR 3 28129 Table 4: Algorithm resulting in minimal bu er allocation. Number of Minimal Costs Excluding Including Minimisation Time Nested block 48 52 GRACE and hybrid hash 9 0 GRACE hash only 28 63 Hybrid hash only 35 5 Sort-merge 0 0 Total 120 120 Algorithm

The performance of the nested block and GRACE hash minimisation algorithms was very good. The nested block minimisation algorithm always found its minimal value, and always ran in less than 0.05% of the time to perform the join. The GRACE hash minimisation algorithm found its minimal value on all but one occasion, and on that occasion its value was 0.04% greater than the minimal value. To nd the minimal value took less than 0.4% of the time taken to perform the join on all but two occasions and the longest time taken was 0.6% of the time taken to perform the join. Thus, we believe that these minimisation algorithms should be used in practice when executing joins, particularly given the improvements demonstrated in Sections 4.1 and 4.2. The running time of the simulated annealing minimisation of the extended hybrid hash algorithm varied between 3% and 96% of the time it would take to both use simulated annealing to determine the bu er allocation and then to perform the join. When the nested block algorithm wasn't the optimal join method, the running time varied between 3% and 77% of the total time. Unfortunately, simulated annealing did not always nd the optimal bu er allocation. We know this because, for some of the random join queries, the cost of the hybrid hash algorithm using the bu er allocations determined using simulated annealing was higher than the cost of the bu er allocations found GRACE hash minimisation algorithm. As the simulated annealing algorithm was minimising a cost function which generalised the GRACE hash join method, in addition to the hybrid hash join method, the minimal cost should be less than, or equal to, that found by the GRACE hash minimisation algorithm. Table 4 summarises the performance of each of the minimisation algorithms. Of the 120 joins, the nested block join algorithm was the best (or equal best) algorithm to use in 48 of the joins based upon execution time alone, and was clearly the best in 52 of the cases when the time taken to determine the optimal bu er allocation was also considered. The GRACE and hybrid hash algorithms determined minimal bu er allocations with the same cost on 9 occasions. The results in Table 4 show that the simulated annealing minimisation algorithm for the extended hybrid hash join method, which encompasses both the original hybrid hash join method and the GRACE hash join method, which we used needs to be improved signi cantly before it will be generally useful. Conversely, we believe that the minimisation algorithm for the GRACE hash join algorithm can be used now to determine the optimal bu er allocation for any join. Interestingly, for 35 of the 78 join queries in which algorithms other than the nested block 18

algorithm produce the minimal cost, the simulated annealing algorithm produced a minimal value such that BH = 0. That is, the GRACE hash method resulted in a lower cost than the traditional hybrid hash method in which BH > 1 is a constraint. On only 9 of these 35 occasions was the bu er allocation determined by the simulated annealing algorithm the same as that produced by the GRACE hash minimisation algorithm. On the other 26 occasions the cost of the allocation produced by the simulated annealing algorithm was greater than that produced by the GRACE hash minimisation algorithm. As a result of this we also do not have con dence that the hybrid hash costs in Figures 9 and 10 of Section 4.4 necessarily represents the minimal costs of this algorithm, although they are likely to be closer to the minimal costs for smaller values of VA . The expected performance of the sort-merge algorithm reinforced the results shown in Figure 9. Due to the restricted version of the algorithm we are examining, only a limited number of joins were appropriate. Twenty eight of the 120 joins were too big to sort the relations in one pass, and in another 48 joins the nested block algorithm produced the minimal value. The sort-merge algorithm had a higher cost than the GRACE hash algorithm for all the remaining 44 joins. The amount of the additional cost varied between 17% and 208% of the cost of the GRACE hash algorithm. The primary advantage of the sort-merge algorithm is that it avoids the problem of uneven partition sizes which can a ect the performance of the hash join algorithms. We discuss how this problem can be reduced for the hash join algorithms in Section 5. An experimental investigation into the e ectiveness of this and a comparison with the sort-merge algorithm will be performed in the future. Instead of using a random starting point for the simulated annealing algorithm, we used the bu er allocation produced by the GRACE hash minimisation algorithm. This method proved cost e ective for only 24 of the 120 join queries. That is, the improvement in the cost was greater than the time taken for the single run of the simulated annealing algorithm (which varied between 0.2 and 3.4 seconds of CPU time) for 24 join queries, with an average improvement of 1.9% across all 120 queries. Of the 24 join queries for which an lower cost was found, the cost determined was equal to that determined by the simulated annealing algorithm described above for 2 of the joins. The cost was less than the cost determined by the simulated annealing algorithm for 2 of the joins and the cost was greater for the other 20 joins. The di erences were typically small (14 of the 20 were less than 2 seconds), however one was almost 60 seconds. We do not believe that this is clearly a better algorithm to use as the average improvement across all queries is small for our set of experimental join queries. Another possible method of determining the optimal bu er allocation for the hybrid hash algorithm would be a combination of storing precomputed optimal bu er allocations for various input and output relation sizes, in conjunction with a (small) search around that optimal bu er allocation for given relation sizes. Storing all possible combinations of relation sizes is unlikely to be possible, unless the sizes of potential relations is severely constrained. Additionally, in a multi-user system the amount of memory available as bu er space is likely to vary depending on the system load. Thus, combinations of relation and bu er sizes would have to be stored, ensuring that the storage of optimal bu er allocations for all possible joins is impractical. The storage of only unique optimal bu er allocations is also likely to be impractical. With four distinct variables (BA , BR , BH , P deriving BB and BP ), the number of possible bu er allocations is O(B 4 ). If the bu er size can vary as well it clearly becomes impractical to store bu er allocations for many values of B . An open problem remains to determine how many bu er allocations to store and how to eciently derive an optimal allocation from them. In conclusion, we have seen that the combination of the nested block and GRACE hash join algorithms, and their respective minimisation algorithms, provide a signi cant improvement over the standard versions of all the join algorithms we have examined. We believe that these are the join algorithms which should be implemented. The use of simulated annealing to determine the optimal bu er allocation for the hybrid hash join method is not accurate enough and is too slow to be cost e ective to implement. We believe the optimal hybrid hash method will become 19

Table 5: Bu er size changes when the relationship between TK and TT is varied, VB = 1000, VR = 100, B = 64, TC = TJ = 3TT , TP = 0:4TT . Nested block GRACE hash VA = 62 Cost VA = 1000 TK =TT BA BB BR Ratio BA BB BR 1 62 1 1 1.53 50 10 4 2 62 1 1 1.29 50 10 4 3 62 1 1 1.11 50 10 4 4 31 28 5 1 50 10 4 5 31 28 5 1 50 10 4 6 31 28 5 1 50 10 4 7 31 28 5 1 42 17 5 8 31 28 5 1 42 17 5

Cost P Ratio 20 1 20 1 20 1 20 1 20 1 20 1 12 1.01 12 1.03

viable only when we have a more e ective optimisation algorithm for determining optimal bu er allocations at the time of query optimisation. None of the optimisation algorithms examined to date are e ective because of the amount of CPU time they require compared with the improvement in performance.

4.6 Stability of Optimal Allocation: Varying Seek and Transfer Times

It is important to know the e ect of the seek and transfer times, TK and TT on the optimal bu er arrangement. These times are used in the calculation of the cost of each join operation, and di erent hardware typically has di erent values for each of these times. If a small variation in the relationship between these values has a signi cant impact on the stability of the optimal bu er arrangement then the characteristics for each disk drive would have to be known. This would make this method of determining optimal bu er arrangements much less useful, and dicult to determine. Table 5 shows typical values for VA in which each algorithm will be used while VB , VR , B , TC , TJ and TP are constant. Let C (B; N ) be the cost of the join algorithm using bu er arrangement B and N = TK =TT . To calculate the cost ratio for a given value of TK =TT , N 0, we rst nd the optimal bu er arrangement, B , for TK =TT = 5. We set C1 = C (B; N 0). Now we nd the optimal bu er arrangement, B1 , for TK =TT = N 0 . We set C2 = C (B1; N 0) and the cost ratio is given by C1=C2. Table 5 shows that the relationship between TK and TT does not have an enormous impact on the cost of the optimal bu er arrangement. We believe that the number of occasions in which an extremely accurate estimation of the relationship between the seek and transfer times is required will be relatively rare.

4.7 Stability of Optimal Allocation: Varying CPU and Disk Times

It is also important to know the e ect of CPU times compared with the disk seek and transfer times on the optimal bu er arrangement. As with the relationship between seek and transfer times, these times are used in the calculation of the cost of each join operation. Not only does di erent hardware have di erent values for each of these times, but the operating system and software used also a ects the values of these times. If a small variation in the relationship between these values has a signi cant impact on the stability of the optimal bu er arrangement then the relationship between the values would have to be known. This would make this method of determining optimal bu er arrangements much less useful, and more dicult to determine. 20

Table 6: Bu er size changes when the relationship between TJ and TT is varied, VB = 1000, VR = 100, B = 64, TK = 5TT , TC = TJ , TP = TJ =8. Nested block GRACE hash VA = 62 Cost VA = 1000 Cost TJ =TT BA BB BR Ratio BA BB BR P Ratio 0.5 31 28 5 1 50 12 2 10 1.10 1 31 28 5 1 42 17 5 12 1.06 1.5 31 28 5 1 42 17 5 12 1.03 2 31 28 5 1 50 10 4 20 1 3 31 28 5 1 50 10 4 20 1 4 31 28 5 1 50 10 4 20 1 5 62 1 1 1.07 50 10 4 20 1 Table 6 shows typical values for VA in which each algorithm will be used while VB , VR and B are constant. The ratio of TJ and TT is varied, while the other values are set so that TK = 5TT , TC = TJ and TP = TJ =8. We use a similar analysis to the previous section to derive the cost ratio. Let C (B; N ) be the cost of the join algorithm using bu er arrangement B and N = TJ =TT . To calculate the cost ratio for a given value of TJ =TT , N 0, we rst nd the optimal bu er arrangement, B , for TJ =TT = 3. We set C1 = C (B; N 0). Now we nd the optimal bu er arrangement, B1 , for TJ =TT = N 0 . We set C2 = C (B1 ; N 0) and the cost ratio is given by C1 =C2. Table 6 shows that the relationship between TJ and TT does not have an enormous impact on the cost of the optimal bu er arrangement.

4.8 Bene ts of Optimal Allocation: Varying CPU and Disk Times

The current trend in hardware technology is for CPU speeds to increase at a rate greater than that of the disk drive technology (seek and transfer rates). As the cost of memory is decreasing, it is also likely that in the future more memory will be available to be used for bu ers. In the future, is it going to be more or less bene cial to use the optimal bu er allocation? Figures 11 and 12 show how the ratio between the CPU time constants (TC , TJ and TP ) and disk time constants (TK and TT ) a ects the performance of the GRACE hash join algorithm. The ratio of the cost of the standard bu er allocation to the optimal bu er allocation is compared for a number of di erent bu er and relation sizes. These gures show that as the CPU speed gets faster, relative to the disk seek and transfer speeds, the optimal bu er allocations perform better than the standard bu er allocations. Figure 11 also demonstrates the e ect of varying the total bu er size. As the amount of memory available for bu ers increases, the performance of the optimal allocation over the standard allocation increases. Figure 12 also demonstrates the e ect of varying the relation sizes. It shows that smaller relation exhibit greater performance improvement than larger ones using by the optimal bu er allocation, however, the di erence is not as great as the impact of using di erent bu er sizes.

4.9 Experimental Results

Programs were written which performed the appropriate I/O and CPU operations for the nested block and GRACE hash join algorithms when provided with input les. This was implemented on a Sun Sparcstation 10 and relied on the UNIX le system for le management. Thus, we had no control over the disk accesses required to retrieve the data. The SunOS le system [8] used attempts to keep consecutive pages together in cylinder groups wherever possible as well as bu ering disk pages and reading disk pages ahead. To combat these 21

Figure 11: Relative cost of optimal and standard bu er allocations when the ratio TJ =TT varies. VR = 100, TK = 5TT , TC = TJ , TP = TJ =8. GRACE hash join algorithm Performance gain = standard cost / optimal cost

5.5 VA = VB = 1000, B = 64 VA = VB = 1000, B = 128 VA = VB = 1000, B = 256 Current technology

5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

0.5

1

1.5

2 2.5 3 CPU to disk ratio

3.5

4

4.5

5

Figure 12: Relative cost of optimal and standard bu er allocations when the ratio TJ =TT varies. VR = 100, TK = 5TT , TC = TJ , TP = TJ =8. GRACE hash join algorithm Performance gain = standard cost / optimal cost

5.5 VA = VB = 500, B = 128 VA = VB = 1000, B = 128 VA = VB = 2000, B = 128 Current technology

5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

0.5

1

1.5

2 2.5 3 CPU to disk ratio

22

3.5

4

4.5

5

Table 7: Experimental results: time taken by join algorithms. VA = 2000 (109 Mbyte), VB = 2000, VR = 0. Type Nested block (opt) Nested block (std) Nested block (Hag) GRACE hash (opt) GRACE hash (std)

BA BB BR P  Time (sec) Expected 118 10 0 | 2879 2809 127 1 0 | 2949 3136 64 64 0 | 4775 4992 100 28 0 20 1 412 419 127 1 0 127 1 515 578

features of the le system we performed the join operation on very large les and read the les backwards. We did this to attempt simulate a system in which the database has direct control over the lesystem and there is no operating system supported cache. By reading through very large les we minimised the number of pages which would be saved in the bu er cache (although there is no separate xed size bu er cache in SunOS), so that next time the page was required it would be read from disk. By reading the les backwards we defeated the read ahead mechanism, which reads subsequent pages when the le is sequentially read forwards. However, as the le system reads seven 8 kbyte pages at a time, even reading backwards through the le one page at a time does not result in a disk access for every read. To combat this we set the page size to be 56 kbytes. The amount of memory reserved for the memory bu er was 128 of the 56 kbyte pages, a total of 7 Mbytes. The large page size has the e ect of moving the cost of the standard algorithm a lot closer to the optimal algorithm. However, if the page size is assumed to be 8kbytes, the performance di erence would be of the order of 2 to 3 times as in Figures 11 and 12. The disk drive used was an Elite-2. In the calculation of expected results for this disk we assume that TK = 16:556 msec and TT = 14:993 msec. We also assumed that TP = 10:9 msec, TC = 26:3 msec and TJ = 63:9 msec. These values were calculated using a separate program which performed the same memory based operations as our join simulation did on a single 56 kbyte page. These values represent the average of a number of runs of this program. The time, and expected time calculated using the costs in Section 2, taken by each of the join operations are shown in Table 7 and represent the total time in seconds taken by the join operations. The times reported are the average of four runs for each program. The results show that our estimation of the times involved are reasonable estimates given our lack of control over the bu er cache and lack of knowledge of the exact values of TK , TT , TP , TC , and TJ during the test runs, and the fact that the average seek time is not the exact time for each le. These results show that using the optimal bu er allocation instead of the standard one is likely to result in the improvements that we have earlier shown are theoretically possible.

5 Non-Uniform Data Distributions The problem of non-uniform data distributions is a common one for all hash based techniques. To address this problem Nakayama, et al. proposed an extension to the hybrid hash method, called the Dynamic Hybrid GRACE Hash join method (DHGH) [11]. Their method dynamically determines which partitions will be stored in memory and which will be stored on disk. This depends upon the distribution of data. In [6] they provide an analysis of DHGH and show the e ect of varying partition sizes and show that a large number of small partitions is the best method to handle non-uniform distributions. However, these small partitions are combined for the join phase to minimise the join cost. As with almost all other analyses of join algorithms, they also count disk pages transferred in their cost model. The DHGH assumes that there is one input block and all the other blocks are for the partitions. As we have seen, this is not generally optimal when disk page access time is taken into account as 23

in our cost model. Additionally, their method does not support partitioning of the data in place, which has signi cant bene ts for the extended GRACE and hybrid hash methods. Our proposal is that a sample of the rst relation is read. Each record is hashed into a range that is a number of times the size of the desired number of partitions. A table constructed of the frequency of each of the hash values. Finally, each hash value is assigned to a partition such that the partitions are of equal size. Each record in each relation is placed in the appropriate partition by looking up its assigned partition based on its hash value. The additional overhead of this method compared with the methods described in the previous sections is likely to be very small compared to the running time of the algorithm. We call this method the Optimal GRACE Hash algorithm for Non-Uniform distributions (OGHNU), and the original method the Optimal GRACE Hash algorithm for Uniform Distributions (OGHU). The OGHNU method assumes that the sample of the rst relation is representative of the whole relation. This may be achieved in two ways. A small sample of randomly chosen pages from the rst relation may be read. This would e ectively require TK + TT time for each page, so only a small number of pages should be read. The other way would be to read the B pages as the GRACE hash algorithm would normally do. As in DHGH, OGHNU will take more CPU time due to the construction of the table and the grouping of partitions. Therefore, the OGHNU method will be a little more expensive than the OGHU method when applied to a uniform data distribution. However, the OGHNU method would be more much ecient than the OGHU method for non-uniform data distributions. Therefore, we believe that the OGHNU method is the best method to use as a general method which tolerates skewed data distributions.

6 Conclusion In this paper, we have presented an analysis of four common join algorithms, the nested block, sort-merge, GRACE hash and hybrid hash join algorithms, based upon the time required to access and transfer a page from disk and the processing cost of the join. This is a generalisation of both the commonly used method of counting the number of disk pages transferred and a proposed alternative of counting the number of operations in which any number of disk pages may be transferred as a single operation (seeks). We have shown that it is very important to consider both the disk seek and transfer times and CPU times involved in the join algorithms for optimal performance. We have presented cost e ective algorithms to quickly nd the optimal bu er allocations of the nested block and GRACE hash join algorithms, and suggested a mechanism for handling nonuniform data distributions. The time taken by the minimisation algorithms is often less than 0.3% of the running time of the join operation and always less than 1% for the cases we have tried. We have reported experimental results which con rm that the performance gains reported in this paper are likely to be achieved in practice. With the current disk and CPU technology, we expect performance gains in the order of two to three times when using the join algorithms with optimal bu er allocations rather than the standard join algorithms. In the future, as the relative speed of the CPU over the disks increases, the use of optimal bu er allocations will become more important. Our modi ed version of the GRACE hash join algorithm will usually perform better than the extended hybrid hash algorithm when the cost of calculating the optimal bu er allocation is taken into account. We require a cost e ective minimisation algorithm for the hybrid hash algorithm to become competitive. We suggested an algorithm based on simulated annealing, however this algorithm is not cost e ective. Using this algorithm on 120 experimental joins, we found only 6 instances (5%) in which it was cost e ective to minimise the extended hybrid hash algorithm. This approach is further improved by using the bu er allocation produced by the GRACE hash minimisation algorithm as a starting point, instead of a random starting point. This approach was cost e ective on only 20% of the joins tested resulting in an average improvement of 1.9% across all the joins tested. Unless a better algorithm to minimise the extended hybrid hash algorithm is found, we believe 24

that the best algorithm to implement will be the GRACE hash join algorithm and GRACE hash minimisation algorithm. Even in an operating system environment in which le system access is not directly under the control of the database, there will be a decrease in the total cost as a result of a reduction in the number of system calls required to read the data and the CPU time of memory based parts of the join algorithms. Further work in progress arising from this paper includes an experimental study into the e ects of non-uniform data distributions on the algorithms we have described. Another issue is to improve on the extended hybrid hash minimisation algorithms presented in this paper.

Acknowledgements The authors would like to thank Zoltan Somogyi for his valuable comments on an earlier draft. The rst author is partially supported by an APRA scholarship. The second author would like to acknowledge the support of the Australian Research Council, the Cooperative Research Centre for Intelligent Decision Systems, and the Key Centre for Knowledge Based Systems.

References [1] E. Aarts and J. Korst. Simulated Annealing and Boltzmann Machines. Wiley, 1989. [2] M. W. Blasgen and K. P. Eswaran. Storage and access in relational data bases. IBM Systems Journal, 16(4), 1977. [3] J. Cheng, D. Haderle, R. Hedges, B. R. Iyer, T. Messinger, C. Mohan, and Y. Wang. An ecient hybrid join algorithm: a DB2 prototype. In Proceedings of the Seventh International Conference on Data Engineering, pages 171{180, Kobe, Japan, April 1991. IEEE Computer Society Press. [4] D. J. DeWitt, R. H. Katz, F. Olken, L. D. Shapiro, M. R. Stonebraker, and D. Wood. Implementation techniques for main memory database systems. In B. Yormark, editor, Proceedings of the 1984 ACM SIGMOD International Conference on the Management of Data, pages 1{8, Boston, MA, USA, June 1984. [5] R. B. Hagmann. An observation on database bu ering performance metrics. In Y. Kambayashi, editor, Proceedings of the Twelfth International Conference on Very Large Data Bases, pages 289{293, Kyoto, Japan, August 1986. [6] M. Kitsuregawa, M. Nakayama, and M. Takagi. The e ect of bucket size tuning in the dynamic hybrid GRACE hash join method. In P. M. G. Apers and G. Wiederhold, editors, Proceedings of the Fifteenth International Conference on Very Large Data Bases, pages 257{ 266, Amsterdam, The Netherlands, August 1989. [7] M. Kitsuregawa, H. Tanaka, and T. Moto-oka. Application of hash to data base machine and its architecture. New Generation Computing, 1(1):66{74, 1983. [8] L. W. McVoy and S. R. Kleiman. Extent-like performance from a UNIX le system. In Proceedings of the USENIX 1991 Winter Conference, pages 33{43, Dallas, Texas, USA, January 1991. [9] T. H. Merrett. Why sort-merge gives the best implementation of the natural join. SIGMOD Record, 13(2):39{51, January 1981. [10] P. Mishra and M. H. Eich. Join processing in relational databases. ACM Computing Surveys, 24(1):63{113, March 1992. 25

[11] M. Nakayama, M. Kitsuregawa, and M. Takagi. Hash-partitioned join method using dynamic destaging strategy. In F. Bancilhon and D. J. DeWitt, editors, Proceedings of the Fifteenth International Conference on Very Large Data Bases, pages 468{478, Los Angeles, CA, USA, August 1988. [12] L. D. Shapiro. Join processing in database systems with large main memories. ACM Transactions on Database Systems, 11(3):239{264, September 1986. [13] J. Vaghani, K. Ramamohanarao, D. B. Kemp, Z. Somogyi, and P. J. Stuckey. Design overview of the Aditi deductive database system. In Proceedings of the Seventh International Conference on Data Engineering, pages 240{247, Kobe, Japan, April 1991. IEEE Computer Society Press.

26