Optimal Clustering of Relations to Improve Sorting and Partitioning for Joins E VAN P. H ARRIS AND KOTAGIRI R AMAMOHANARAO Department of Computer Science, The University of Melbourne, Parkville 3052, Australia Email:
[email protected] The sorting or partitioning of relations is very common in relational database systems. Implementations of the join operation include the sort–merge join algorithm, which sorts both relations, and the hash join algorithm, which usually partitions both relations. We describe how clustering records using an optimal multi-attribute hash (MAH) "le, taking the query pattern and distribution into account, reduces the average cost of sorting or partitioning. We demonstrate that maintaining multiple copies of a data "le, each with a different clustering organization, further reduces the average cost of sorting or partitioning. We describe an inexpensive method for determining a good partitioning index (MAH "le organization). Our analysis and experiments show that the partitioning indexes we "nd are usually optimal and can often partition a relation more than ten times faster than by not using any clustering. We also show that a signi"cant change in the query pattern or distribution is required before a reorganization of a data "le is necessary, and that such a reorganization is, in general, an inexpensive operation. Received August 29, 1996; accepted September 8, 1997
1.
INTRODUCTION
In database systems, such as relational databases, only a small proportion of the retrieval operations will be to answer simple partial-match queries. Other common operations include projection, join, division, intersection, union and set difference. The join [1] is one of the most common of these expensive operations. When relations are larger than main memory, joining them typically involves reading and writing the relations more than once, the exact number of times depending on the join method and the nature of the relations. Two of the most common types of join algorithm are the sort–merge join algorithm and variations on the hash join algorithm [2–8]. In the "rst phase of the sort–merge algorithm, the relations are sorted. In the "rst phase of the hash join algorithms, the relations are partitioned. The cost of these join algorithms can be reduced if the relations are stored in a manner which facilitates the operation of the algorithm. For example, consider a relation whose records are stored in a "le sorted on a particular attribute. For a join on that attribute, the "rst phase of the sort–merge algorithm is not necessary. If there are many joins on this attribute, it may be worthwhile incurring the costs of maintaining the relation in a sorted form to reduce the cost of these joins. Similarly, by maintaining a relation in a suitably partitioned form, the cost of a hash join can be reduced. Making use of multi-attribute clustered relations to reduce the amount of partitioning when performing a hash join has previously been suggested in [9–12]. Thom et al. [9] used T HE C OMPUTER J OURNAL,
the multi-attribute hash (MAH) "le as their clustering technique; Ozkarahan et al. [10, 11] used their own DYOP "le; while Harada et al. [12] used both the grid "le [13] and the multi-dimensional binary search tree (k-d tree) [14]. In this paper, we concentrate on reducing the average cost of the "rst phase of each join, by using an optimal clustering of records. We de"ne an `optimal clustering of records' to be the "le organization which minimizes the cost X CA = pi Ci , (1) i
where Ci is the cost of the ith join, taking maximum advantage of any clustering, andPpi is the probability of the ith join being asked such that i pi = 1. In Sections 4 and 5 we describe how to calculate each cost Ci , and in Section 7 we describe how to construct (near-)optimal MAH "les using Equation 1. Similar techniques have been used to produce optimal clusterings of records for partial-match retrieval in [15–17] and for partial-match range queries in [18, 19]. Like Thom et al. [9], we use multi-attribute hashing as our example "le organization. However, the optimization process can be applied to other multi-attribute indexing and storage techniques, such as the grid "le [13], the k-d tree [14], the BANG "le [20], the multilevel grid "le [21] and the k-d-B-tree [22]. That is, it will work with explicit indexes (separate data structures) or organized "les (such as hash "les). For some of these data structures, the cost functions we present must be modi"ed to re#ect accurately Vol. 40,
No. 7,
1997
C LUSTERING R ELATIONS FOR J OINS the cost of sorting or partitioning relations stored within them. This is discussed in more detail in [23]. As the cost of mass storage devices decreases, it is becoming feasible to store multiple copies of data, each with a different clustering organization. When the query performance is crucial, such that it is acceptable to maintain multiple copies of the data, we show that using multiple copies of a data "le can increase the performance of the average sorting or partitioning operation. This is similar to the use of multiple "les to improve the performance of partial-match retrieval, presented in [17]. In the next section we brie#y describe multi-attribute hash "les, then in Section 3 we describe the assumptions we make in calculating the costs of each operation. In Sections 4 and 5 we describe the cost of the partitioning and sorting operations, respectively, and give some results. In Section 6 we compare partitioning and sorting using MAH "les and give some experimental results. In Section 7 we describe various approaches to bit allocation, in Section 8 we present computational results which show how these techniques perform, and in Section 9 we discuss updates and determine how our method performs when the probability distribution of the join queries changes. In Section 10 we discuss reorganizing a data "le, in Section 11 we discuss nonuniform data distributions, in Section 12 we discuss other miscellaneous issues, and in the "nal section we present our conclusions. The appendices contain details of the query distributions used to produce our results. 2.
MULTI-ATTRIBUTE HASH FILES
A `record' consists of a number of `"elds', each of which may contain some data. An `attribute' is a "eld which may be speci"ed in a query. One method used to store the records in a hash "le is `multi-attribute hashing', as described by Rivest [24]. A single hash key is calculated for each record from a number of its attributes. If a record has n attributes, A1 , . . . , An , then there are n corresponding hash functions, h 1 , . . . , h n . Each hash function maps a value for an attribute Ai to a bit string of length li , of the form b1 . . . bli where bx ∈ {0, 1}. The lengths of the bit strings returned by the hash functions need not be equal. For example, if an attribute represents a person's sex, it would have only one bit in its bit string. For a text attribute, its hash function would return a bit string with many bits in it. We de"ne the `constraining' number of bits of an attribute to be the length of the bit string generated by the hash function of the attribute. If this number is small, we assume that it is because the domain of the attribute is limited, not that the hash function is inadequate. A hash key for a record is constructed by combining bits from the n bit strings. This hash key de"nes the block in the "le in which the record is stored. The number of bits taken from the bit string of the ith attribute, di , must be between zero and its constraining number of bits, 0 ≤ di ≤ li . If di = li , we say that the attribute is `maximally allocated'. The total number of bits taken from the bit strings must be T HE C OMPUTER J OURNAL,
417
equal to the length of the hash key required to identify each block in the data "le. If the size of the data "le is 2d , only d bits are useful in identifying which block a record should be Pn di = d. We refer to the set of di s stored in. Therefore, i=1 used to make up the hash key as the `bit allocation'. E XAMPLE 1. The relation in Figure 1 has two attributes. The bit allocation consists of "ve bits from the "rst attribute and three bits from the second attribute. The second attribute is maximally allocated because its hash function only produces three bits. In this case, the block number in which to store each record is constructed by appending the three bits of the bit string of the second attribute to the "rst "ve bits of the bit string of the "rst attribute. The bit allocation, and the order of the bits within the hash key, determines various properties of the hash "le. In Figure 1, the hash key was made by appending the bits from one attribute to those of the other. If the "le is required to be dynamic, the bits from the attributes must be taken from the front of the bit strings and are usually interleaved. The bits are taken from the front of the bit strings so that when the "le size increases, the whole of the data "le need not be reorganized. Interleaving the bits achieves better performance for each "le size because fewer bits are used when the "le size decreases, and more are used when the "le size increases. These issues are discussed further by Lloyd and Ramamohanarao [16, 25]. Faloutsos [26] suggested a technique based on Gray codes which attempts to order the bits to maximize retrieval performance by clustering similar bit patterns together. This technique can be used with our optimal MAH "les. An advantage of multi-attribute hashing is that it uses a dynamic data structure whose maintenance costs are the same as those of the primary key hashing scheme upon which it is based. Using an optimal MAH "le does not incur any additional overhead. That is, the maintenance costs using an optimal MAH "le are the same as using a hash "le which is not optimal. Primary key hashing schemes which can be used include dynamic hashing [27], linear hashing [28, 29], extendible hashing [30], adaptive hashing [31], and variations on these schemes [16, 25]. We discuss the maintenance of optimal MAH "les in more detail in Sections 9 and 10. A disadvantage of multi-attribute hashing is that it performs poorly when attributes are correlated. All bits composing a hash key should be independent. If a hash key is formed from the bit strings of correlated attributes, bits in the hash key will be correlated. This can result in an uneven distribution of records to blocks in the data "le. Since records are stored in a hash "le, this can increase the access time for records. One method of avoiding uneven clustering of data is to alter the method of storage. This is the approach taken in adaptive hashing [31] and in BANG "les [20]. These storage structures are compatible with our method of determining optimal "le organizations. We discuss nonuniform data distributions in more detail in Section 11. Vol. 40,
No. 7,
1997
418
E. P. H ARRIS AND K. R AMAMOHANARAO
FIGURE 1. The location of two records in an MAH "le.
3.
4.
ASSUMPTIONS
In this paper, we make a number of assumptions to simplify the cost model and disregard factors which do not affect the relative performance of the algorithms. We give the cost of each algorithm in terms of the number of disk blocks transferred. That is, the number of blocks read from, or written to, disk. While this is a common metric, a more sophisticated approach may include differentiating between seek and transfer costs, as in [32]. This remains an area for future work. We assume that the records are uniformly distributed amongst all blocks in the data "le. That is, we do not consider the effect of over#ow chain lengths in our cost formulae. However, if we wish to model over#ow we simply scale up our cost formulae by multiplying by an appropriate factor which models the over#ow chains. For example, if the load factor is 80% with 50 records per block, the multiplying factor is 1.0725. This is equal to the unsuccessful search length. The values of these multiplying factors for various load and blocking factors can be found in [25]. We discuss extreme non-uniform data distributions in Section 11. A consequence of assuming a uniform data distribution is that we assume that the join attributes are distributed reasonably across their respective domains, so the worst case of a cross product is not considered. We believe this is reasonable because, for a cross product, neither the sort– merge nor hash join algorithms should be used. The best algorithm for implementing a cross product is based on the nested loop join [1], so our method is inapplicable in this case. We assume that the information known about the sorting and partitioning operations to be performed on a relation consists of the attributes of the relation involved in each sorting or partitioning operation and the probability of each operation. We assume that the size of each "le is 2d blocks. This is because we are attempting to determine the optimal bit allocation for a multi-attribute hash "le. However, the actual size of a data "le can be any integral number of blocks if, for example, a linear hash "le is used. T HE C OMPUTER J OURNAL,
PARTITIONING
The reason for using multi-attribute hashing is to partition a data "le. Therefore, it can be used to reduce the cost of further partitioning of the data "le. In this section we derive the cost of partitioning a relation stored in an MAH "le and, using Equation 1, the average cost of partitioning the relation. We then compare the average partitioning cost using the optimal MAH "le with the average partitioning cost using some other MAH organizations. Note that a relation partitioned using multi-attribute hashing can be used for partial-match retrieval, even if the particular bit allocation is chosen to support other operations. Furthermore, if the relative probabilities of all relational operations on a given relation are known, a combination of the cost functions given in this paper and elsewhere [15–19] can be used to construct indexes which support all of these operations [23]. However, in this paper we concentrate on using it to support join algorithms. The purpose of the partitioning phase of the hash join algorithm is to partition the input relations such that the partitions of one can be held in memory. That is, if the size of main memory is B blocks, the "nal partitions can be at most B − 2 blocks. The "nal partition can not be B blocks because one block is reserved for the second relation and one for the output relation. If two relations being joined use the same hash functions and the join attribute of each relation contributes bits to the hash key of its relation, then the partitioning implicit in an MAH "le can be used as an implicit partitioning operation on each "le. If the implicit partitioning is such that the partitions are of size B − 2 blocks or smaller, no explicit partitioning is required. E XAMPLE 2. Figure 2 contains an example using the same relation and bit allocation as the example in Figure 1. It shows the bit allocation used to construct the address in which each record is stored. The arrows indicate the binary logarithm of the number of blocks available in memory to hold the partitions of this relation. That is, blg(B − 2)c = 3. If a join is performed on attribute A1 , the relation would not have to be explicitly partitioned because the MAH "le already partitions the relation into partitions which are small Vol. 40,
No. 7,
1997
C LUSTERING R ELATIONS FOR J OINS
FIGURE 2. A bit allocation.
419
Q taking the MAH "le organization into account, is i6∈ Q 2di . If we let α j = blg(B − j)c and β j = 2α j , the cost of partitioning the data "le, following from Equation 2, is & ÃÃ ! !' Y C(q) = 2 · 2d logβ1 2di /β2 & Ã
enough. However, if a join is performed on attribute A2 , the relation would have to be partitioned because the MAH "le only partitions the relation into partitions of size 25 blocks on attribute A2 . 4.1.
Cost of partitioning using a single copy of a data "le
We "rst describe the cost of partitioning a data "le assuming the contents of the "le are not organized in any particular way. If B is the number of blocks in the main memory buffer and 2d is the number of blocks in the "le being partitioned, the cost of partitioning the "le into partitions of size B − 2 is ¨ § (2) C = 2 · 2d log B−1 (2d /(B − 2)) disk blocks transferred. The term 2 · 2d indicates that there will be two disk block transfers, one read and one write, d in the data "le. The expression for ¨ § each ofd the 2 blocks log B−1 (2 /(B − 2)) denotes the minimum number of partitioning passes required to partition a "le of size 2d blocks into partitions of size B − 2 blocks. During each partitioning pass, for each input partition there can be at most B − 1 output partitions. This is because there must be one block for the input partition and one block for each of the output partitions. We can see that the maximum relation size which results in one pass is (B−1)(B−2), and if 2d ≤ B−2, no partitioning is necessary. If the data "le is organized using multi-attribute hashing, then all of the bits in the bit allocation contributed by the attributes on which the data must be partitioned can be used to partition the data. For example, if we wish to partition the relation in Figure 2, using attribute A1 , then there can be a separate partitioning phase for each of the 25 distinct values of the bits of A1 . In general, if there are 2d A1 distinct values in the MAH "le for attribute A1 there will be 2d A1 partitions of size 2d−d A1 without performing any explicit partitioning. If we are partitioning a data "le, either implicitly or explicitly, which is then used in an operation involving another partitioned relation, the best performance is achieved when both relations are partitioned in the same way. In the absence of information about any other relation, we assume that the number of partitions constructed must be a power of two. We do this because the MAH "le organization does this implicitly and we want to allow the other relation to be partitioned implicitly. Therefore, instead of forming B − 1 partitions during each pass, we form 2blg(B−1)c partitions. Similarly, the "nal size of each partition will be 2blg(B−2)c blocks instead of B − 2 blocks. Let Q be the set of attributes used in a partitioning operation q. The size of each initial implicit partition, after T HE C OMPUTER J OURNAL,
=
2
d+1
lg
i6∈ Q
Y
!
di
2
'
/α1 − α2 /α1 .
i6∈ Q
If we assume that blg(B − 1)c = blg(B − 2)c, then α1 = α2 and ' & X d+1 C(q) = 2 di /α2 − 1 . (3) i6∈ Q
P
Note that if i6∈ Q di ≤ α2 = blg(B − 2)c, then the cost is zero as no partitioning is required. Therefore, when choosing values for the di s, there is no gain in allocating more bits to the di s than the minimum required to satisfy this inequality. The average cost of partitioning, derived from Equation 1, is X pi C(i), (4) CP = i∈P
where P is the set of all possible partitioning operations, and pi is the probability of partitioning operation i. To minimize the average partitioning cost we must "nd the values of the di s which P minimize Equation 4 while maintaining the constraint i di = d, as discussed in Section 2. The results obtained by doing this are presented in Section 4.3. In Section 8 we present the results of testing a number of methods which attempt to "nd optimal bit allocations quickly. 4.2.
Cost of partitioning using multiple copies of a data "le
If it is crucial that queries are as fast as possible, and consequently that update costs are less important, one method which can be used to increase the performance of partitioning is to have multiple copies of the data "le and provide each copy with a different bit allocation scheme. This is analogous to having separate clusterings using different keys. Other methods which can be used include precomputed joins and denormalizing relations. Which of these methods to use, if any, will depend on the application environment. When there are multiple copies of a data "le, the "le copy which provides the best performance for each operation is used. Let there be m copies of the data "le. C j (i) is the cost of partitioning for the ith operation using the jth "le copy and is given by Equation 3. In calculating the average partitioning cost, we take the minimum cost of performing the partitioning on each of the m data "les. Under these conditions, the average cost of partitioning becomes X pi min C j (i). (5) CP = i∈P
Vol. 40,
No. 7,
1997
1≤ j≤m
420
E. P. H ARRIS AND K. R AMAMOHANARAO
The minimum possible average partitioning cost, that is, no partitioning at query execution time, is usually achieved with fewer than n "le copies (m < n). 4.3.
Computed results
In this section we compare the performance of partitioning an optimal MAH "le with partitioning a "le with no special clustering, and with partitioning MAH "les created using methods discussed in other papers. This will enable us to determine whether using an optimal MAH "le is worthwhile. The two other clusterings which we examined consist of an equal allocation of bits to attributes (`Even MAH File'), and the allocation of all bits to one attribute (`Single Hash File'). The "rst of these clustering schemes is the same as that used by data structures which partition their domain space by cyclically dividing the domains of the attributes into two. Data structures which do this include the grid "le, k-d tree and multilevel grid "le. The second of the clustering schemes is equivalent to creating a hash "le using a single attribute. This is the attribute with the highest probability of appearing in a partitioning operation. 4.3.1. Distributions used in all computed results All of the computed results in this and subsequent sections were produced using the same sets of query distributions. These distributions are described in more detail in Appendix B. More information on all of the distributions used is given in [23]. To produce the computed results, we generated a number of sets of random probabilities for partitioning (or sorting) operations on a relation. The number of attributes of each relation involved in the operations varied between 2 and 7. Each possible combination of attributes was assigned a probability. We considered a wide spectrum of probability distributions, including distributions biased towards joins on single attributes and equi-distributions. In some distributions, we also restricted the size of the domain of the attributes by assigning a constraining number of bits to each attribute. We assumed that the block size is 8 kbytes and that the size of each relation is 4 Gbytes, thus d = 19. The number of unreserved blocks in the buffer (B − 2) ranged between 1024 (8 Mbytes) and 16,384 (128 Mbytes). The costs are measured as the number of disk blocks read or written. Twenty-"ve sets of results, each testing "ve main memory sizes, were generated for each of the algorithms. In [33] we presented preliminary results for smaller "les (32 Mbytes, d = 13), a smaller page size (4 kbyte) and smaller main memories ((B − 2) varied from 32 to 256 4 kbyte pages). The results obtained for these smaller sizes are the same as those presented in this paper, unless otherwise noted, and therefore are not discussed any further. 4.3.2. A single data "le Figure 3 contains an example of the performance of the different methods to partition a data "le. Note that the cost axis has a logarithmic scale. Equation 4 was used to T HE C OMPUTER J OURNAL,
FIGURE 3. Partitioning costs using various "le clustering organizations (Distribution 1).
determine the average cost of partitioning. The optimal bit allocation was determined by an exhaustive search of all possible bit allocations. This was feasible because most of the relations only contained a few attributes, and we were prepared to allow the exhaustive search to run for hours, if required. In Section 8 we discuss why an exhaustive search cannot be used in general. The improvement in performance when comparing the optimal MAH "le with the other "le organizations in Figure 3 is representative of the results we obtained. A reduction in cost as memory increases occurs when suf"cient memory is provided to reduce the number of partitioning passes required in one or more of the join operations. In Figure 3, the increase to 64 Mbytes is suf"cient for the reduction of partitioning passes when the optimal MAH "le is used, but not the other bit allocations. Using an optimal MAH "le, we expect the partitioning to be performed substantially faster than when using an unclustered "le, on average. The performance of the other two clustering arrangements varied depending on the nature of the probability distribution. Allocating all bits to a single attribute was never near-optimal for the distributions we tested. As expected, performance was best when a high proportion of the partitioning operations involved one particular attribute. However, because there is no bene"t in allocating more than d − blg(B − 2)c bits to any attribute, as discussed in Section 4.1, some of the bits were wasted. The optimal MAH "le puts these bits to good use, even for distributions in which a high proportion (but not all) of the partitioning operations involved a particular attribute. An equal allocation of bits to attributes did produce nearoptimal results on some occasions. It performed well for distributions in which a high proportion of the attributes appeared in each operation. For other distributions, it performed worse than allocating all bits to a single attribute. Vol. 40,
No. 7,
1997
C LUSTERING R ELATIONS FOR J OINS TABLE 1. Costs (blocks transferred) when the number of "le copies varies (Distribution 3, 7 attributes, d = 19, B = 2048). Copies
Cost
1 2 3 4
6040.024 625.817 67.110 0.000
For some distributions it only performed well at certain memory buffer sizes. There were distributions in which it was near-optimal for all memory buffer sizes, often when there were few attributes in the relation. Our results show that for many distributions, the results produced by the other bit allocations are not optimal and a better bit allocation needs to be used for optimal performance. 4.3.3. Multiple copies of a data "le When multiple copies of a data "le are involved, it is impractical to search exhaustively all possible bit allocations to "nd the optimal bit allocation. In Sections 7 and 8 we describe and compare several algorithms for "nding good bit allocations which are optimal or near-optimal. The results for multiple copies of a data "le were produced in the same way as the results above, except that the bit allocation used was the best one found by these algorithms and may not be optimal. Table 1 provides an example of the improvement in performance of the average cost of partitioning which can be achieved using multiple "le copies. It shows that using a second "le can result in a signi"cant increase in performance. The smallest improvement in performance we observed for a large number of joins when using two "les instead of one was a factor of seven. When more "les or a large main memory buffer is used, a greater improvement in performance is achieved. When a large main memory buffer and more than two "les are used, the cost of partitioning becomes effectively zero, even when seven attributes are involved. In our tests, the improvement in cost between three and four copies was typically so small that using the fourth copy was not cost effective. However, if a very large number of attributes were involved, using four or more copies may become cost effective. Note that creating multiple secondary indexes will not improve the partitioning in joins in the same way that having multiple copies does because the records will still be distributed randomly throughout the "le. By maintaining multiple "le copies, each with different bit allocation, the records will be physically clustered together, which is required to achieve the level of performance we have discussed. 5.
SORTING
Although the sort–merge join algorithm requires the relations to be sorted prior to the merge phase, they do not have to be sorted solely on the value of the join attributes. By T HE C OMPUTER J OURNAL,
421
using the clustering inherent in MAH "les to partially order the relation, we can reduce the amount of sorting required. We de"ne a `sort combination' to be the combination of attributes on which a relation is to be sorted. For example, each attribute alone is a sort combination, as is the set of all attributes. In the context of the sort–merge join, the sort combination is the join attributes. We de"ne a notation for sorting on combinations of attributes. Let 5(A1 , . . . , An ) be the result of sorting a relation on A1 , then A2 , and so on, up to An . Within each distinct value of Ai the records will be ordered on (say) increasing values of Ai+1 for all values of 1 ≤ i < n. We de"ne a `hashed sort key' to be the MAH key value of an attribute concatenated with the value of the attribute. The presence of the value of the attribute is needed to deal with the case of different attribute values having the same hash key. The relations can be sorted on this sort key instead of just the attribute value using the sort–merge join algorithm. An MAH "le is partially sorted using our hashed sort keys. For example, consider the relation used in Figures 1 and 2. If the data is required to be sorted on the combination 5(A1 , A2 ), each of the blocks with the same value of A1 have to be sorted and merged together. This is because the fourth and subsequent bits of the hash value of A1 are more signi"cant than the bits of A2 . We may wish to produce a relation which is completely sorted on the "rst attribute of the sort combination, then the second attribute, and so on. To do this, the hash function which is used to generate the hashed sort key must be order preserving. Thus, when used as a part of the sort–merge join algorithm, we are still able to perform theta-joins. For operations which do not require this, such as equijoins, then any hash function may be used. In this case we can sort on the hash value and then the attribute value using the hashed sort key. Note that if an order-preserving hash function is used, then the resulting MAH "le supports range queries, in addition to the partial-match queries which all MAH "les permit [19]. 5.1.
General solution for a single copy of a data "le
To calculate the average sorting cost, we must determine the cost of sorting using each combination of attributes, 5(A1 , . . . , Ak ), k ≤ n. First, we discuss the cost of sorting a relation. Sorting a relation using a multiway merge–sort consists of two steps, an initial run-generation step and the merge step [34]. The run-generation step takes the relation of size V blocks and breaks it up into N runs of size R B sorted blocks, where B is the number of blocks of memory which are used for sorting. The value of R depends on the type of initial run generation used. According to Knuth [34], for replacement selection R = 2 on average. The initial run generation performs V block reads and V block writes.§ If the merge step merges M runs at a time, ¨ there will be log §M N passes ¨ over the relation. The cost of the cost of sorting this merge is 2V log M §N . Therefore, ¨ the relation is 2V + 2V log M N blocks transferred. Vol. 40,
No. 7,
1997
422
E. P. H ARRIS AND K. R AMAMOHANARAO
To sort a relation of size 2d blocks, V = 2d , N = 2d /R B and M = B − 1. The cost of sorting this relation is ¨ § C = 2 · 2d + 2 · 2d log B−1 (2d /R B) ¨ § = 2d+1 log B−1 (2d /R) − log B−1 B + 1 (6) disk §blocks transferred. This is usually equal to ¨ 2d+1 log B−1 (2d /R) , so this is the cost we use for the remainder of this section. If we wish to sort an MAH "le on the sort combination 5(A1 , . . . , Ak ), the MAH "le is partially sorted on the "rst attribute if d A1 > 0. In this case, we can perform 2d A1 merge–sort operations on partitions containing 2d−d A1 blocks, instead of one merge–sort operation on all 2d blocks. Therefore, based on Equation 6, the cost of sorting is given by C(5(A1 , . . . , Ak )) d ¡ d−d +1 § ¢¨¢ ¡ log B−1 2d−d A1 /R 2 A1 2 A1 if B < 2d−d A1 = d A1 d−d A1 +1 2 2 if B ≥ 2d−d A1 d+1 §¡ ¢ ¨ d − d A1 − lg R / lg(B − 1) 2 if B < 2d−d A1 = d+1 2 if B ≥ 2d−d A1 . The second case covers the situation in which the partitions of the relation are suf"ciently small such they can be read into memory and sorted in one operation. If R = 2, C(5(A1 , . . . , Ak )) d+1 §¡ ¢ ¨ d − d A1 − 1 / lg(B − 1) 2 if B < 2d−d A1 = d+1 2 if B ≥ 2d−d A1 .
(7)
We can see that it is extremely desirable to allocate d − lg B bits to attribute A1 so that the second condition is satis"ed. In fact, given that replacement selection will produce sorted runs of length 2B on average, we only need to allocate d − lg(B − 1) − 1 bits to attribute A1 to produce sorted partitions in a single pass over the data. The cost given in Equation 7 only depends on the "rst attribute in each sort combination. It is independent of both k and n; d and B are constant for a given "le and memory size. Therefore, the cost of sorting on a combination of attributes is the same as sorting on the "rst attribute of the combination. Thus, the cost of sorting on a sort combination starting with A1 is C(5(A1 , . . .)) = C(5(A1 )). It follows that the cost of sorting on any attribute combination is one of the n costs C(5(A1 )), . . . , C(5(An )). For now, let us assume that B < 2d−d Ai for each attribute Ai . Combining Equations 1 and 7, the average sorting cost is given by C S = 2d+1
n X
pi d(d − di − 1) / lg(B − 1)e ,
(8)
i=1
where di is the number of bits allocated to the ith attribute and pi is the sum of the probabilities of the sort combinations in which the ith attribute is "rst. T HE C OMPUTER J OURNAL,
As we will see later, there is no easy method of "nding an optimal bit allocation for this cost function for all probability distributions. However, if there is at most one merging pass when sorting on any attribute, we can determine the optimal bit allocation. This constraint is not unreasonable if a large amount of main memory is used. We do this by observing that each sorting operation will require either one or no merging passes after the initial sorting pass which creates the sorted runs. No merging passes are required to sort on attribute Ai if we set di = dd − 1 − lg(B − 1)e. It follows that to create the optimal MAH "le to minimize the cost of sorting we should: (1) allocate as many bits as possible, up to dd − 1 − lg(B − 1)e bits, to the attribute with the highest probability of appearing in a join operation; (2) allocate as many of the remaining bits as possible, up to dd − 1 − lg(B − 1)e bits, to the attribute with the second highest probability of appearing in a join operation; (3) and so on, until all the bits have been allocated. Each attribute must have a domain suf"ciently large as to appear unconstrained, so that no attribute can be maximally allocated. That is, li ≥ dd − 1 − lg(B − 1)e. If there are constraints on the domain of an attribute, then no more bits should be allocated to that attribute than given by the constraints. Supporting maximally allocated attributes is discussed in Section 5.3, after we describe the general solution for multiple copies of a data "le. 5.2.
General solution for multiple copies of a data "le
If the resources exist to create multiple copies of a relation, each with a different bit allocation, we would like to minimize the average cost of sorting using the best copy for each operation. Having multiple copies of a data "le has the disadvantage that all insertions, deletions and updates are duplicated across multiple "les. However, if the performance of query operations is more important than insertion and update operations, this technique may be valuable instead of, or in addition to, precomputed joins and denormalized relations. For now, let us assume that B < 2d−d Ai , for each attribute Ai . As in Section 4.2, let there be m copies of the data "le. j Here di is the number of bits allocated to the ith attribute in the jth "le copy. In calculating the average sorting cost, we take the minimum cost of performing the sort on each of the m data "les. The average cost of sorting, based on Equation 7, is given by C S = 2d+1
n X
´ m j d − di − 1 / lg(B − 1) . (9)
l³ pi min
i=1
1≤ j≤m
The argument we used for determining the optimal bit allocation for a single "le also applies in the case of multiple "les. That is, there is no easy method of determining the optimal bit allocation for this cost function for all probability distributions, and no merging passes are required for sorting j on attribute Ai if we set di = dd − 1 − lg(B − 1)e for some "le copy, numbered j. Under these circumstances, optimal MAH "les can be created by primarily partitioning the nth copy of the data Vol. 40,
No. 7,
1997
C LUSTERING R ELATIONS FOR J OINS
423
"le using the attribute with the nth highest probability of being sorted. The attribute should be allocated as many bits as possible, up to dd − 1 − lg(B − 1)e bits. Clearly, the optimal average cost of sorting is obtained when there is a "le copy for each attribute. However, if the size of the data "les is not signi"cantly larger than the size of main memory, the optimal average cost of sorting can be obtained with fewer than n "le copies. 5.3.
Solution when attributes are maximally allocated
A special situation arises when an attribute is maximally allocated. For maximally allocated attributes, for values a and b of the attribute, ∀a, b : h(a) = h(b) ⇒ a ≡ b, where h(x) is the hash function used to construct the hash value for the attribute. There is no point allocating more than the constraining number of bits to a maximally allocated attribute, because the extra bits do not differentiate between values of the attributes. Consider the example in Figure 1. The hash function for A2 only produces three bits. Therefore, we assume the domain size of the attribute is (at most) 8 and no advantage can be gained by allocating more than three bits to attribute A2 in the bit allocation. The "le is completely sorted on attribute A2 because all the blocks can be retrieved, one by one, in the correct order. However, the "le is also completely sorted on the sort combination 5(A2 , A1 ) because it is completely sorted on A2 "rst and also partially sorted on A1 . However, it is not completely sorted on the sort combination 5(A1 , A2 ) because it is not completely sorted on A1 . Thus, the cost of sorting on a combination of attributes becomes C(5(A1 , . . . , Ai , . . . , Ak )) l³ ´ m P d+1 d − 1 − ij=1 d A j / lg(B − 1) 2 P = if ij=1 d A j < d − lg B (10) P d+1 if ij=1 d A j ≥ d − lg B, 2 where Ai is the "rst attribute which is not maximally allocated. Note that if l A1 ≥ d, then i = 1 and Equation 10 reduces to Equation 7. The average cost of sorting on all combinations is given by combining Equations 1 and 10. 5.4.
Computed results
In this section we compare the performance of the sorting algorithms which make use of an optimal MAH "le organization with sorting an unclustered "le and with sorting algorithms which make use of MAH "les created using methods discussed in other papers. This will enable us to determine whether using an optimal MAH "le is worthwhile. As in Section 4.3, the two other clusterings were an equal allocation of bits to attributes (`Even MAH File'), and an allocation of all bits to one attribute (`Single Hash File'). Equations 1 and 10 were used to determine the average cost of the sort operations. The optimal bit allocation was determined by an exhaustive search of all possible bit allocations. In Section 8 we discuss why an exhaustive search cannot be used for all problems. T HE C OMPUTER J OURNAL,
FIGURE 4. Sorting costs using various "le clustering organizations (Distribution 1).
Figure 4 contains an example of the effectiveness of using the optimal bit allocation to aid in sorting a data "le. That is, we normally expect to reduce the cost of the sorting phase of the average join by at least 10%, and often much more, when compared with sorting an unclustered "le. The performance of the other two clustering arrangements varied depending on the nature of the probability distribution. If a high proportion of the sorting operations involved a particular attribute then the `Single Hash File' arrangement (that is, the relation is sorted on one attribute) was nearoptimal. However, if a high proportion of the sorting operations involved, for example, one of two particular attributes, it was not near-optimal, as in Figure 4. The performance of the equal bit allocation was relatively consistent across all distributions. For each operation, the relation could be sorted in either one or two passes over the data, depending on the number of bits allocated to the appropriate attributes. If the memory size and equal bit allocation was such that each operation could be performed in one pass, the equal bit allocation was optimal. If it was such that it took two passes, the cost of using an MAH "le with an even bit allocation was the same as an unclustered "le. This can be seen in Figure 4. For main memory sizes up to 64 Mbytes, the equal bit allocation performed no better than not clustering at all. For main memory sizes from 128 Mbytes, the equal bit allocation produced results much closer to optimal. Whether an equal allocation of bits to attributes is optimal or not depends much more on the number of attributes, the size of the relation, and the size of main memory than it does on the distribution of sorting operations. For example, in our tests, an equal allocation of bits was optimal across all the memory sizes we tested when two attributes were involved; it was optimal across the larger three memory sizes when three attributes were involved, and it was usually (but Vol. 40,
No. 7,
1997
424
E. P. H ARRIS AND K. R AMAMOHANARAO distributions, the cost of sorting using an optimal MAH "le was never lower than the cost of partitioning an unclustered "le. For the remainder of this paper we report results for the partitioning algorithm alone. We do not present the results obtained for sorting because they are similar to those we do present and lead to similar conclusions. 6.2.
FIGURE 5. Partitioning and sorting costs using the optimal bit allocation (Distribution 3).
not always) optimal for only the largest memory size when "ve attributes were involved. It was never optimal when seven attributes were involved. Clearly, it does not produce optimal results in general. In many relational database applications most sorting operations will be on only one or two attributes, if the TPCD benchmark [35] is any guide. If most sorting operations are on one particular attribute, creating an MAH "le on that attribute will provide optimal or near-optimal results. However, if most sorting operations are on one of two (or more) particular attributes, as in Figure 4, creating an MAH "le on only one of the attributes will not result in near-optimal results. In these circumstances, a better bit allocation must be found for optimal performance. As a result, we believe that neither an equal allocation of bits, nor hashing on only one attribute, will be appropriate under all circumstances and in many cases it is worth using the optimal bit allocation. 6. 6.1.
RESULTS Comparing partitioning and sorting
Figure 5 compares the performance of the sorting and partitioning algorithms. These are from the same set of results which we discussed in Sections 4.3 and 5.4. We can see that partitioning has a lower cost than sorting when not using an MAH "le organization. This is not a new result. However, we can see that when using optimal MAH "les we get a greater relative improvement using partitioning than using sorting. In Figure 5, partitioning an unclustered "le has a lower cost than sorting using an optimal MAH "le. This is not consistent across all distributions. For a number of distributions the costs were the same. However, in our T HE C OMPUTER J OURNAL,
Experimental results
We have seen that by using an optimal bit allocation we should be able to reduce the average cost of partitioning. To see if these improvements are possible in practice, we generated four distributions, described in Appendix A, and recorded the execution time (elapsed running time) of the partitioning algorithm when not using any clustering and when partitioning an optimal MAH "le. For Distributions C and D, single-attribute joins make up around 90% of the operations, whereas for Distributions A and B, multiattribute joins make up around 70% of the operations. Experiments were performed using Distributions A and B on an unloaded Sun SPARCstation IPX with 28 Mbytes of main memory. These results were previously reported in [23]. The block (disk page) size used was 56 kbytes because the Sun extent-based "le system uses this as the unit of transfer to the disk [36]. The size of each "le was 28 Mbytes (d = 9) and 64 blocks (3.5 Mbytes) of main memory was used. The additional experiments shown here were run on a machine with an Intel Pentium CPU and 64 Mbytes of main memory. Table 2 shows the results for Distributions A and B on the Sun SPARCstation IPX (`Sun') and Distributions A to D on the Intel Pentium-based machine (`Intel'). The results are based on the average cost of partitioning each relation. The cost ratio is the ratio of the cost of each method compared with the cost of the algorithm using the optimal bit allocation. The experimental cost ratios are the ratios of the elapsed running times, while the calculated cost ratio (`Calc.') is the ratio of the expected costs, calculated using Equation 4. These results show that the ratio of the expected costs can correspond to the ratio of the costs which are achieved in practice. Note that while the experimental results presented here use small (28 Mbyte) "les, the fact that the ratio of calculated costs corresponds well to the ratio of experimental costs gives us con"dence that the results presented in Sections 4.3 and 5.4 using 4 Gbyte "les can be achieved in practice. That is, we believe that our method will scale up to much larger "les than those used in our experiments. Both the calculated and experimental results have shown that using the optimal bit allocation can result in a signi"cant increase in performance. Figures 3, 4 and 5, and Table 2, have shown a substantial increase in performance over algorithms which use bit allocations that are not optimal, such as hashing on a single attribute or using an MAH "le with an equal bit allocation or algorithms which do not use any clustering. Thus, from a performance perspective, we believe it is essential to use an optimally clustered "le. Vol. 40,
No. 7,
1997
C LUSTERING R ELATIONS FOR J OINS
425
TABLE 2. Experimental and expected (calculated) results using the partitioning algorithm.
Method Optimal MAH File Single Hash File No Clustering
7.
Distribution A Sun Intel Calc.
Cost Ratios Distribution B Distribution C Sun Intel Calc. Intel Calc.
1 19.4 39.2
1 5.9 11.4
1 18.4 36.9
1 18.9 38.1
FINDING OPTIMAL BIT ALLOCATIONS FOR MAH FILES
If the probabilities of operations on a relation are stable, the optimal bit allocation only needs to be found once for each relation, when its data "le is created. As the bit allocation only needs to be found once, "nding the optimal bit allocation is more important than the time taken to "nd it. However, it is still important to "nd the bit allocation as quickly as possible. For a large "le with many attributes there are a large number of possible bit allocations which could be created. For large databases it is computationally infeasible to calculate the average cost of all the possible combinations in a reasonable amount of time. We would like a procedure which enables us to compute a good bit allocation in a reasonable amount of time. This bit allocation should be as close to optimal as possible. To this end, we tested a number of approaches for "nding a good bit allocation. The simplest algorithms derive bit allocations directly from the probability distribution or the number of attributes in the relation and are very fast. The more complex methods have been successful in the past in a number of similar problem domains. In Section 9 we discuss the effect of a changing probability distribution. 7.1.
Naive algorithms
The following naive algorithms were tested to see if they can "nd the optimal bit allocation. While we did not expect that they would "nd the optimal bit allocation for all distributions, they are fast and could potentially provide an adequate level of optimization. 7.1.1. PROB1 Up to d − blg(B − 2)c bits (or its constraining number of bits, whichever is lower) are allocated to the attribute with the highest probability of appearing in an operation. The remaining bits are then allocated to the next most probable attribute, up to a maximum of d − blg(B − 2)c bits (or its constraining number of bits), then the next attribute, and so on. This is a simple extension of the algorithm which allocates all bits to a single attribute. It takes advantage of the fact that no partitioning is required for operations involving any given attribute if d − blg(B − 2)c bits are allocated from that attribute. Therefore, allocating more than d − blg(B − 2)c bits is wasteful. T HE C OMPUTER J OURNAL,
1 5.6 10.5
1 5.5 10.6
1 7.1 10.9
1 7.2 11.0
Distribution D Intel Calc. 1 3.7 5.4
1 3.6 5.3
T HEOREM 7.1. A bit allocation produced by PROB1 will be optimal if all partitioning operations are on one attribute, each partitioning operation requires at most one pass to partition the data, and the constraining number of bits for each attribute is at least d − blg(B − 2)c. Proof. Consider a bit allocation for a probability distribution satisfying the constraints speci"ed in the theorem. Let query i be the partitioning operation containing attribute Ai . As each join requires at most one pass to partition the data, the cost C(i) is given by ½ d+1 if di < d − blg(B − 2)c 2 C(i) = 0 if di ≥ d − blg(B − 2)c. Let A be the set of attributes for which C(i) = 0 for each attribute i ∈ A. As d − blg(B − 2)c bits are required to be given to attribute Ai for C(i) = 0, the maximum number of attributes for which C(i) may be zero is |A|. Therefore, min(n, bd/(d − blg(B − 2)c)c) if blg(B − 2)c < d |A| = n if blg(B − 2)c ≥ d. Assume that there is another bit allocation with a set A0 of attributes with C(i) = 0 which has a lower cost. Then |A0 | ≤ |A|. The new average partitioning cost is given by X X pi 2d+1 − pi 2d+1 C 0P = C P + i∈A
= C P + 2d+1
i∈A0
Ã
X i∈A
pi −
X
! pi .
i∈A0
The newPbit allocationP will have a lower average partitioning of cost if i∈A0 pi > i∈A P the de"nition Ppi . However, 0 p ≤ p if |A | ≤ PROB1 guarantees that i∈A0 i i∈A i |A|. Therefore, the bit allocation derived from PROB1 is optimal. 7.1.2. PROB2 Like PROB1, up to d − blg(B − 2)c bits (or its constraining number of bits, whichever is lower) are allocated to the attribute with the highest probability of appearing in a partitioning operation. All the operations involving this attribute are now eliminated. Up to d − blg(B − 2)c bits are now allocated to the attribute with the highest probability of appearing in the remaining operations. This process is repeated until all bits have been allocated. Vol. 40,
No. 7,
1997
426
E. P. H ARRIS AND K. R AMAMOHANARAO
If all partitioning operations are on one attribute only, this method is identical to PROB1. Thus, the result of this method will be optimal if the conditions of Theorem 7.1 are satis"ed. 7.2.
Minimal marginal increase
Minimal marginal increase (MMI) has been used successfully by Lloyd and Ramamohanarao [16] to determine optimal bit allocations for partial-match queries when the attributes are not independent. More recently, it has also been used by Chen et al. [18] and Harris and Ramamohanarao [19] for determining the optimal bit allocation for partial-match range queries. The method of minimal marginal increase commences with no bits allocated to any attribute. It works in d steps, where d is the length of the hash key of the relation. At each step a bit is allocated to a single attribute. To determine which attribute to allocate the bit to, the bit is allocated to each attribute in turn and the value of the cost function to be minimized, such as Equation 4, is calculated. The bit is permanently allocated to the attribute which results in the smallest value of the cost function. This algorithm is explained in greater detail in, for example, [16, 23]. This method can be extended to include multiple "les. At each step the bit is added to the attribute across all "les which minimizes the value of the cost function. 7.3.
Simulated annealing
Simulated annealing (SA) [37] has been examined for use in a number of similar problem domains. These include query optimization [38], join query optimization [39, 40], bit allocation for partial-match range queries [19], and bit allocation for partial-match queries using multiple "les [17]. Simulated annealing performs T trials, and returns the bit allocation with the minimum cost from these trials. Each trial initially generates a random bit allocation and calculates the cost using this bit allocation. The next phase involves S iterations in which bits are `perturbed', commencing with the initial random bit allocation. A bit allocation is perturbed by decrementing the number of bits allocated to one randomly selected attribute and incrementing the number of bits allocated to a different randomly selected attribute. The cost using this new bit allocation is then calculated. If this cost is less than the cost using the previous bit allocation, or if the value of the `cooling function's true, the new bit allocation is used as the basis of the next iteration, otherwise the previous bit allocation is used. A limit, L, speci"es the maximum number of consecutive iterations permitted which do not "nd a bit allocation with a lower cost. If this limit is exceeded, the trial ceases and the best bit allocation found during the trial is returned. The cooling function tests whether a randomly generated number is less than the value of an inverse exponential function which takes the iteration number as a parameter. The purpose of the cooling function is to allow the algorithm T HE C OMPUTER J OURNAL,
to accept bit allocations with worse costs in early trials, but only accept bit allocations with better costs in later trials. This allows the simulated annealing algorithm to escape from local minima in early trials, to continue searching for the global minima. Like MMI, this algorithm can be extended to include multiple "les. At the beginning of each trial a random bit allocation is generated for each "le. During each trial a random "le is chosen prior to choosing the two attributes whose bits are perturbed. The two attributes must be from the same "le. One of the disadvantages of simulated annealing is that the bit allocation is only optimized for a given "le size (value of d). If the "le size were to double (incrementing d), a new bit allocation found by simulated annealing might allocate a different number of bits to each attribute. This would mean that the "le would have to be reorganized. However, [17] describes how MMI can be used to allocate additional bits on top of an initial bit allocation determined by simulated annealing so that a complete reorganization is not required. A longer discussion of the simulated annealing algorithm we use can be found in [23]. 7.4.
Multiple copies of a data "le: naive algorithms
While the simulated annealing algorithm described in the last section can be used with few changes to try to "nd an optimal bit allocation for multiple "le copies, the algorithms in Section 7.1 cannot be applied without change. The naive algorithms we tested with multiple "le copies are as follows. 7.4.1. PROB1 This works in a similar way to the single-copy version of PROB1. The attribute with the highest probability of appearing in a partitioning operation has d − blg(B − 2)c bits (or its constraining number of bits, whichever is lower) allocated to it in the "rst copy. The attribute with the second highest probability has d − blg(B − 2)c bits allocated to it in the second copy. This process continues until the m most probable attributes have d − blg(B − 2)c bits allocated to them in one of the m copies. The remaining bits, up to d − blg(B − 2)c bits, of each copy are then allocated to the next most probable attributes in turn. This process continues until all bits are allocated in all copies. 7.4.2. PROB2 This works in a similar way to the single-copy version of PROB2. That is, it operates in the same way as the multiple copy PROB1, except that the partitioning operations involving attributes with bits already allocated to them in one copy are removed when calculating the most probable attribute. 7.4.3. EVEN For the purposes of creating bit allocations, the attributes are divided amongst the "le copies as equally as possible. That is, if there are n attributes, the bit allocation for a given "le Vol. 40,
No. 7,
1997
C LUSTERING R ELATIONS FOR J OINS TABLE 3. Values of simulated annealing constants. Algorithm SA1 SA2 SA3 SA4 SA5 SEED
T
S
L
10 100 500 50 100 1
1000 100 100 1000 500 1000
100 100 100 100 100 100
copy will contain bits from n/m of the attributes. Within each "le copy the n/m attributes are allocated an equal number of bits in the manner described in Section 4.3. 8.
COMPUTATIONAL RESULTS
In this section we answer the following questions. 1. 2. 3.
Which bit allocation method provides the best bit allocation? What is the relative cost of "nding a bit allocation using each method? Is the best bit allocation we found the optimal bit allocation?
The same set of distributions were used under the same conditions as in Sections 4.3, 5.4 and 6, as discussed in Section 4.3.1. They are described in more detail in Appendix B. 8.1.
One data "le
For results using a single copy of the data "le, three versions of the simulated annealing algorithm were tested. They had different values of T , S and L, and are shown as SA1, SA2 and SA3 in Table 3. These sets of constants have been used in the past [17, 19]. In addition, a hybrid method, SEED, was tested. SEED is a single simulated annealing trial with the initial bit allocation set to the best bit allocation returned by PROB1 and the other two bit allocation techniques discussed in Section 4.3. SA4 and SA5 in Table 3 were only used when multiple copies of the data "le were tested; these results are presented in Section 8.2. Twenty-"ve sets of results were generated on a Sun SPARCserver 1000. The costs are again measured as the number of disk blocks read or written. Table 4 contains examples of the results. The time shown to "nd the Optimal MAH File is the time to perform an exhaustive search of all possible bit allocations. The results show that the simulated annealing algorithms SA1, SA2, SA3 and SEED usually "nd the optimal bit allocation, but the other algorithms often do not. The algorithms SA1 and SEED failed to "nd the optimal bit allocation on one and ten occasions (out of 125) respectively. Distribution 2 in Table 4 shows one of these. The MMI, PROB1 and PROB2 algorithms did not consistently "nd the optimal bit allocation. As reported in Section 4.3, neither did an equal bit allocation, nor a hash "le T HE C OMPUTER J OURNAL,
427
indexed using a single attribute. The relative performance of the Single Hash File, PROB1 and PROB2 was consistent. PROB1 always had a lower cost than the Single Hash File; often it was signi"cantly lower. The cost of the bit allocation found using PROB2 was usually the same as that of PROB1. However, on seven occasions PROB2 had a lower cost than PROB1. The time taken by PROB2 was greater than SEED because it constructs additional supporting data structures prior to starting the search for the bit allocation. For Distribution 3, the actual running times of the two algorithms after initialization were 0.09 and 2.93 seconds, respectively. In [33] we reported results generated using much smaller relations and much less main memory. On two occasions on those smaller data "les PROB2 had a higher cost than PROB1. This occurred when the constraining number of bits of the most probable attribute was less than d − blg(B − 2)c. The queries involving that attribute were incorrectly assumed to have a zero cost by PROB2 when the second most probable attribute was chosen. This was not observed with larger relations and larger memory because more attributes were allocated bits. It is more likely that other attributes involved in these queries are allocated bits, thereby reducing the cost of the bit allocation produced by PROB2. The relative performance of an equal bit allocation and PROB1 varied. There were occasions on which one was clearly superior to the other, as in Table 4. There were distributions in which one of them was optimal; in others both were signi"cantly worse than optimal, as in Distribution 3 in Table 4. In general, we found that the smaller the amount of main memory and the more attributes that are involved, the less likely it is that any of the heuristic algorithms "nd the optimal solution. Thus, we conclude that we cannot use these naive algorithms alone to "nd the optimal bit allocation. To answer the second question posed at the start of this section, we must examine the time taken by each algorithm. Table 4 shows that the time taken to "nd the optimal bit allocation using SA2 or SA3 can be relatively high. For example, for the distribution with "ve attributes it is faster to search exhaustively all possible bit allocations than use either SA2 or SA3. This is because it is possible for the simulated annealing trials to test the same bit allocation several times. As the number of attributes increases, the time required to search all bit allocations increases dramatically, while the time taken by the simulated annealing algorithm does not increase to the same degree. SA2 and SA3 are both faster than searching all possible bit allocations when seven attributes are involved. However, this depends on the size of the "les and amount of memory involved. In [33] we reported that for small "les and a small amount of main memory, SA3 was slower than exhaustively searching for relations with seven attributes. The time taken by the MMI and PROB2 algorithms is much less than the three simulated annealing algorithms. However, the bit allocations they "nd often have a higher cost than those found by the simulated annealing algorithms. Vol. 40,
No. 7,
1997
428
E. P. H ARRIS AND K. R AMAMOHANARAO TABLE 4. Performance of bit allocation algorithms (cost in blocks transferred, time in seconds).
Number of attributes B−2 Bit allocation method No Clustering Even MAH File Single Hash File PROB1 PROB2 MMI SA1 SA2 SA3 SEED Optimal MAH File
Distribution 2
Distribution 3
5 2048
7 8192
Partitioning cost (blocks)
Bit allocation time (secs)
Partitioning cost (blocks)
Bit allocation time (secs)
1,048,576 46,833 188,486 39,602 39,602 64,832 39,602 34,251 34,251 39,602 34,251
– – – – 0.27 0.26 0.78 3.90 18.52 0.21 2.78
1,048,576 2461 146,352 5596 4657 12,066 1427 1427 1427 1427 1427
– – – – 5.64 7.28 33.70 144.14 796.84 4.37 2496.52
As we stated in Section 7, an optimal bit allocation only needs to be found once. Therefore, allowing a greater amount of time to "nd an optimal bit allocation, by using SA2 or SA3, will often be acceptable. The best bit allocation algorithm will depend on the properties of the relation and query distribution. If the relation has few attributes and a large amount of main memory is available, then the best of the heuristic algorithms is likely to be optimal. As they are all fast, each can be run and the best bit allocation chosen. For relations with more attributes, or smaller amounts of main memory, a simulated annealing algorithm should be chosen. Algorithms with long running times, such as SA2 and SA3, are the best. However, algorithms such as SA1 or SEED can be used if the running time is critical. We also tested the algorithms to attempt to "nd the optimal bit allocation for sorting. The performance of the algorithms was similar to that described above for the partitioning algorithm. One difference was that PROB1 often performed better than PROB2. To obtain the best results for sorting, simulated annealing should be used and at least one of the runs should be seeded with the best bit allocations found by the heuristic algorithms. 8.2.
Multiple copies of data "les
We also produced results for multiple "les to help answer the "rst of the two questions above. As we noted in Section 4.3, it is impractical to calculate the optimal bit allocation using an exhaustive search for multiple "le copies. Hence, we cannot determine whether the best bit allocation is the optimal bit allocation. The results were produced in the same way as the results for a single "le. In addition to the three naive algorithms described in Section 7, "ve simulated annealing algorithms were tested. They are shown as SA1 to SA5 in Table 3. A multiple-copy version of the SEED algorithm, which T HE C OMPUTER J OURNAL,
uses the best of the three naive algorithms described in Section 7.4 as its "rst bit allocation, was also tested. Table 5 shows examples of the performance of the bit allocation algorithms. The time taken by the seeded simulated annealing algorithm includes the time taken by all of the naive algorithms. The relative performance of the naive algorithms was consistent. The PROB algorithms usually performed better than the EVEN algorithm. They never performed worse, even for the distributions in which an equal bit allocation performed better for a single "le copy. The PROB2 algorithm never performed worse than PROB1, and often performed better. Both distributions in Table 5 demonstrate these properties. The SEED algorithm did not improve on any of the bit allocations with which it was seeded. This is different to the single-"le version in which it was able to improve on its seed. The performance of the other simulated annealing algorithms varied. They were able to "nd better bit allocations than the naive algorithms on some occasions, such as Distribution 3 in Table 5. However, on other occasions the naive algorithms found better bit allocations than any of the simulated annealing algorithms, such as for Distribution 2 in Table 5. For distributions for which the best cost was found to be zero, PROB1 and PROB2 always found the best bit allocation. However, on a number of these occasions none of the simulated annealing algorithms (other than SEED) found the best bit allocation. To maximize the chances of "nding the optimal bit allocation for multiple "les we recommend using the PROB2 algorithm in conjunction with a simulated annealing algorithm. The amount of time taken by the simulated annealing algorithm can be controlled by its parameters. Distributions with optimal bit allocations which have an average partitioning cost of zero are likely to be found by PROB2. The time taken by the algorithms demonstrates that nearVol. 40,
No. 7,
1997
C LUSTERING R ELATIONS FOR J OINS
429
TABLE 5. Performance of bit allocation algorithms for multiple "le copies (cost in blocks transferred, time in seconds).
Number of attributes B−2 File copies Bit allocation method EVEN PROB1 PROB2 SA1 SA2 SA3 SA4 SA5 SEED
Distribution 2
Distribution 3
5 1024 2
7 4096 2
Partitioning cost (blocks)
Bit allocation time (secs)
Partitioning cost (blocks)
Bit allocation time (secs)
10,112 2989 2179 18,212 5167 2989 10,922 6998 2179
– – – 1.58 7.67 37.43 5.44 10.58 0.35
1872 1453 1074 544 470 385 380 380 1074
– – – 56.37 319.17 1562.05 276.98 519.87 9.53
optimal bit allocations are likely to be found in a feasible amount of time for multiple "le copies. Note that these bit allocations are only found once, when the data "les are built. As the performance improvement is very high, reducing the partitioning cost to almost nothing in some cases, we believe using multiple copies of data "les is worthwhile when the performance is crucial. As we noted in Section 4.3.3, simply using multiple secondary indexes will not result in the same level of performance.
9.
UPDATES AND CHANGES IN A QUERY DISTRIBUTION
We have proposed using optimal bit allocations for data stored using a multi-dimensional clustering technique, such as MAH "les. We have compared it favourably with other bit allocation techniques which use the same multi-dimensional structures. The cost of performing updates on a multi-dimensional data structure using an optimal bit allocation is the same as the cost of performing updates for the same data structure using any other bit allocation. This is true for whatever data structure we use, including linear hash "les [28, 29], extendible hash "les [30], grid "les [13], BANG "les [20], and multilevel grid "les [21]. In the context of an optimal bit allocation, the only change to the environment which is relevant is a change to the distribution of operations which will be performed. In this paper, we have assumed that the probability of using each combination of attributes in an operation is known. In practice, the probability will not be known exactly, or it will change over time. To determine the impact of the probability distribution changing, we produced results in which the optimal bit allocations already found were tested using versions of the original distributions with the probabilities changed. The probabilities were changed from their original values T HE C OMPUTER J OURNAL,
using the formula pi0 = pi ((1 − s/100) + 2r s/100), where s is the percentage change range and r is a random number in the range [0, 1]. For example, a probability of 0.1 with a change of s = 40 would be randomly changed to between 0.06 and 0.14. After changing, the probabilities are normalized so that their sum is one. To study the robustness of the original optimal solution, results were obtained for values of s of 10, 20, 40 and 80. Assume that we have a probability distribution P with an optimal bit allocation A. We create a changed probability distribution P 0 using the technique described above. The changed distribution has an optimal bit allocation A0 . Let C(P, A) be the average cost using bit allocation A with probability distribution P. Our results report the cost ratio C(P 0 , A)/C(P 0 , A0 ). The cost C(P 0 , A0 ) is the cost of the optimal bit allocation of the changed probability distribution. The cost C(P 0 , A) is the cost of the optimal bit allocation of the original probability distribution using the changed probability distribution. Examples of the results are shown in Table 6. The results indicate that for distributions whose probabilities changed by up to ±40% of their original values, the optimal bit allocation for the original distribution performs as well, or nearly as well, as the optimal bit allocation for the changed distribution. When each change was up to ±10%, the cost difference varied between 0% and 6.3%, with most distributions having no cost difference. When each change was up to ±40%, the cost difference varied between 0% and 20.2%, with the majority of distributions again having no difference. As the buffer size increases, the cost difference typically decreases for a given percentage change in the probabilities. As the number of attributes decreases, the cost difference also decreases for a given percentage change in probabilities. Distribution 2 in Table 6 is typical of the situation in which the buffer size is large with a small number of attributes. Vol. 40,
No. 7,
1997
430
E. P. H ARRIS AND K. R AMAMOHANARAO TABLE 6. Cost ratio for changed distributions. Distrib. 2
Distrib. 3
Number of attributes B−2
5 16,384
7 4096
Change range −5% . . . + 5% −10% . . . + 10% −20% . . . + 20% −40% . . . + 40% Random
Cost ratio 1.000 1.000 1.000 1.073 3.375
Cost ratio 1.006 1.038 1.056 1.062 1.591
Distribution 3 is typical when the buffer size is small with a larger number of attributes. The fact that there is no signi"cant increase in the cost even when each probability changes by up to ±40% may imply that the optimal bit allocation does not depend on the probability distribution. To disprove this we tested the original optimal bit allocation with a new random probability distribution (`Random' in Table 6) in which the probability of an operation was inversely proportional to the number of attributes involved in that operation. When the distribution is random, the cost ratio can be large, as in Distribution 2 in Table 6, or small, as in Distribution 3 in Table 6. This shows that the original optimal bit allocations are not optimal for all distributions. Once a near-optimal MAH "le has been constructed, a reorganization of the "le is rarely required: it is only required when the probability distribution changes signi"cantly. 10.
DATA FILE REORGANIZATION
Although an optimal bit allocation performs very well even after the query probabilities change by up to ±40%, if the query distribution changes substantially the data "le should be reorganized to maintain the best performance. Fortunately, this is, in general, an inexpensive operation. The cost of reorganizing a data "le is the same as that of partitioning the data "le. Thus, a single pass over the data "le is often all that is required. On a single pass, blg(B − 1)c bits in the hash key can be changed. Therefore, if c bits must be changed to transform the old MAH "le into the new one, dc/blg(B − 1)ce passes are required. Consider an MAH "le with n attributes and di bits allocated to the ith attribute, which must be reorganized so that the ith attribute has di0 bits allocated to it. The number of bits which must change, that is, be taken from one attribute and given to another, is given by c=
1 2
n X
|di0 − di |.
i=1
This takes one pass if c ≤ blg(B − 1)c. E XAMPLE 3. Consider an MAH "le with 219 blocks in which d1 = 3, d2 = 2, d3 = 9 and d4 = 5. If we wish to T HE C OMPUTER J OURNAL,
FIGURE 6. Grid "le directory.
reorganize the "le so that d10 = 7, d20 = 8, d30 = 1 and d40 = 3, then c = 10. If the main memory buffer contains more than 210 blocks, this requires only one pass over the "le. If each block is 8 kbytes in size, only 8 Mbytes of memory would be required to reorganize the 4 Gbyte "le in one pass. Even if it cannot be reorganized in one pass, it can be reorganized incrementally. Between each pass, other operations can be performed using the MAH "le. 11.
NON-UNIFORM DATA
As we discussed in Section 2, one of the most serious problems encountered when using hash functions is that of non-uniform distributions of records to blocks, and the resulting decreased storage utilization. In analysing the costs in the previous sections, we have assumed that there is a oneto-one relationship between the blocks speci"ed by an MAH "le and the number of blocks containing records. That is, the sizes of the records that are required to be stored in any block are such that they can all be stored in the one block (no blocks over#ow) and all blocks contain a reasonable number of records (no blocks under#ow). Careful selection of hash functions and taking correlated attributes into account will reduce the chances of a nonuniform distribution of records to blocks, to some extent. However, the only solution which can avoid this completely is to use a different multi-attribute data structure. The technique we describe for determining the optimal bit allocation can also be used with many multi-dimensional indexing techniques; MAH "les are simply used as an example. E XAMPLE 4. Figure 6 shows how the grid "le [13] directory would be constructed for the bit allocation of the relation shown in Figure 1. Each bit which is allocated to an attribute divides each domain of that attribute into two. Therefore, for d1 = 5 and d2 = 3, the domain of A1 is divided into 32 segments, and A2 into 8 segments. This principle applies to many other multi-dimensional data structures, such as those mentioned in Section 2. Data structures which do not degrade in the presence of non-uniform data and can be used with our technique include the multilevel grid "le [21] and the BANG "le [20]. For highly non-uniform data distributions, they perform much better than simple multi-attribute hashing using linear hash "les or grid "les. For these indexing structures, the total cost will include the cost of retrieving both index pages and data pages. In the case of partial-match retrieval, it has been Vol. 40,
No. 7,
1997
C LUSTERING R ELATIONS FOR J OINS shown that the cost of retrieving index pages is insigni"cant compared to the cost of retrieving data pages [16]. The same is true for partitioning and sorting where data access costs dominate. Since our proposed scheme clusters the data for partitioning or sorting, substantial reductions in cost can be achieved, even for multi-attribute indexing schemes in which explicit indexes are maintained. Most hash join partitioning algorithms assume that the distribution of records to buffers is even, and thus they do not perform as well under a non-uniform distribution. However, a number of methods have been proposed which do not assume this, including the work of Kitsuregawa et al. [4, 5]. They showed that their methods perform better than the hybrid-hash join [3] for non-uniform distributions. Their methods can equally be used in place of the partitioning strategy we described without affecting our method of "nding a near-optimal bit allocation. 12.
OTHER ISSUES
In this paper we have concentrated on reducing the cost of the "rst pass of various join algorithms, to produce MAH "les which perform well for these operations alone. In practice, it would be more appropriate to consider all the relational operations which would be applied to the "le when deciding on the bit allocation to use. For example, the "rst pass of the union, intersection and difference algorithms is also a sorting or partitioning pass [23, 41], so the relative probabilities of these operations could be included with that of the join algorithm. Similarly, the overall average query cost (instead of average join cost) could include partial-match retrieval (selection) costs of [16, 33]. This is discussed in more detail in [23]. Unlike the sort–merge join, the hash join is not symmetric. Only one of its two relations needs to be partitioned suf"ciently to be held in memory. The other relation simply needs to be partitioned to match the partitioning of the "rst relation. Therefore, to produce globally optimal MAH "les for a number of relations, all the different operations between all the different relations should be considered. This is a much more dif"cult optimization problem, and is also discussed in [23]. 13.
equal allocation of bits to attributes and PROB1, as a seed for a trial of simulated annealing. Better performance can be achieved by performing additional simulated annealing trials. An optimal bit allocation typically provides orders of magnitude of improvement to the average time taken by the partitioning algorithm compared with partitioning an unclustered "le. For example, in Figure 3 the average cost of partitioning, when using large main memory, was over 10 times faster than not using any clustering. An optimal bit allocation also provides a large improvement to the average sorting operation, but not as great as that of the average partitioning operation. For example, in Figure 4 an improvement of a factor of almost two was achieved for large buffer sizes. These improvements are achieved because the MAH "le eliminates the need for partitioning for some operations, reducing this cost to zero, or because it reduces the cost of sorting. When multiple copies of the data "le are used, each with a different clustering organization, the increase in performance is at least another order of magnitude. For example, in Table 1 when two "le copies were used, the average cost of partitioning was almost 10 times faster than with one copy. When three "le copies were used, the average cost of partition was 90 times faster than with one copy. We have shown that an optimal bit allocation is robust with respect to changing the probability of each operation. Even when the probability of each operation is changed randomly by up to ±40% of its original value, the optimal bit allocation of the original distribution results in a bit allocation as good, or nearly as good, as the optimal bit allocation for the changed distribution. Therefore, reorganization of data "les is rarely required once the optimal bit allocation has been found. We also showed that when reorganization is required, it is relatively inexpensive. Reorganization usually only requires one pass over the data "le. The results in this paper can be summarized as follows. •
CONCLUSION
Partitioning or sorting compose the "rst of two passes of two common join algorithms: the hash join and the sort– merge join. Reducing the cost of the "rst pass can result in a substantial reduction in the time taken by these joins. We have used partitioning and sorting algorithms which take advantage of the clustering inherent in multi-attribute hash "les to reduce the cost of each algorithm. We have described how to "nd a clustering scheme which minimizes the average cost of the partitioning and sorting algorithms. These techniques can also be used to better organize other data structures, such as the grid "le, multilevel grid "le, BANG "le, and k-d tree. We observed that an optimal bit allocation can often be determined by using the better of two naive techniques, an T HE C OMPUTER J OURNAL,
431
• • •
• •
Making use of an optimal clustering of records in a relation results in a signi"cant increase in performance compared with not using any clustering. The optimal clustering organization often performs much better than other commonly used clustering techniques. In the case of partitioning, the cost of many of the common partitioning operations is reduced to zero. The optimal clustering organization usually only needs to be found once, when the data "le is created. An optimal, or near-optimal, clustering organization can usually be found quickly. Even when an optimal clustering organization is not found, a substantial reduction in cost can be achieved using several simple heuristic bit allocation algorithms, including simulated annealing. Using an optimal clustering organization does not effect the cost of insertions, deletions or updates. The query probability distribution must change signi"cantly before a new organization is required to maintain the best performance. When reorganization is required,
Vol. 40,
No. 7,
1997
432
E. P. H ARRIS AND K. R AMAMOHANARAO TABLE A.1. Probabilities of Distribution A.
TABLE A.3. Probabilities of Distribution C.
Attributes
Probability
Attributes
Probability
Attributes
Probability
Attributes
Probability
1 12 123 124 13 134 14
0.089398 0.083304 0.042774 0.100586 0.081804 0.027279 0.065543
2 23 234 24 3 34 4
0.090118 0.082593 0.106684 0.075053 0.026218 0.045867 0.082779
1 134 14 2 234
0.218926 0.002513 0.043046 0.090584 0.011531
24 3 34 4
0.000857 0.288873 0.041206 0.302465
TABLE A.4. Probabilities of Distribution D. TABLE A.2. Probabilities of Distribution B.
•
•
•
Attributes
Probability
Attributes
Probability
1 12 123 124 13 134 14 145 15 2 234
0.115839 0.064662 0.019633 0.012986 0.048545 0.057326 0.046966 0.027279 0.034016 0.091604 0.024312
235 24 245 3 34 345 35 4 45 5
0.030406 0.098332 0.040491 0.018403 0.065467 0.064890 0.039186 0.010077 0.036990 0.052590
it can usually be performed in only one pass over the data "le. If it takes more than one pass, it can be performed incrementally. For high performance, one of a number of options is to construct multiple copies of the data "le, each with a different clustering organization. Other options include the more standard techniques of precomputed (materialized) joins and denormalized relations. The optimal clustering organization can also be applied to indexing schemes such as the grid "le, BANG "le and multilevel grid "le. We are currently preparing a paper indicating this can be successfully implemented for BANG "les and partial-match retrieval. The optimal clustering organization can work effectively in cases where data is highly skewed by using an indexing scheme whose performance does not degrade for such data distributions, such as the BANG "le or multilevel grid "le.
ACKNOWLEDGEMENTS The authors would like to thank the anonymous reviewers for their suggestions, and Justin Zobel and Zoltan Somogyi for their comments on earlier versions of this paper. The authors were supported by an Australian Research Council grant, the Cooperative Research Centre for Intelligent Decision Systems, the Key Centre for Knowledge Based Systems, and the Collaborative Information Technology Research Institute. The "rst author was also supported by APA and CITRI scholarships. T HE C OMPUTER J OURNAL,
Attributes
Probability
Attributes
Probability
1 12 123 125 13 2
0.105370 0.017165 0.002513 0.011531 0.033215 0.263540
24 3 35 4 5
0.033872 0.279551 0.000857 0.169687 0.082699
APPENDIX A.
PROBABILITY USED IN RESULTS
DISTRIBUTIONS EXPERIMENTAL
Distribution A contained four attributes. Up to three attributes were speci"ed as join attributes. Those combinations with non-zero probabilities are given in Table A.1. Distribution B contained "ve attributes. Up to three attributes were speci"ed as join attributes. Those combinations with non-zero probabilities are given in Table A.2. Distribution C contained four attributes. Up to three attributes were speci"ed as join attributes. Single-attribute joins are the most prevalent, making up around 90% of the joins. Those combinations with non-zero probabilities are given in Table A.3.
TABLE B.1. Probabilities of Distribution 1. Attributes
Probability
Attributes
Probability
1 12 123 1234 12345 1235 124 1245 125 13 134 1345 135 14 145 15
0.356172 0.016470 0.000267 0.000013 0.000001 0.000014 0.000308 0.000012 0.000307 0.015236 0.000449 0.000011 0.000334 0.006410 0.000525 0.012639
2 23 234 2345 235 24 245 25 3 34 345 35 4 45 5
0.006146 0.008195 0.000248 0.000012 0.000402 0.008547 0.000207 0.008306 0.156984 0.008036 0.000423 0.010489 0.083093 0.010184 0.289563
Vol. 40,
No. 7,
1997
C LUSTERING R ELATIONS FOR J OINS TABLE B.2. Probabilities of Distribution 2. Attributes
Probability
Attributes
Probability
1 12 123 1234 12345 1235 124 1245 125 13 134 1345 135 14 145 15
0.002091 0.005700 0.025524 0.079964 0.380285 0.075543 0.012227 0.070733 0.014288 0.007746 0.022129 0.066695 0.016345 0.009939 0.012751 0.001758
2 23 234 2345 235 24 245 25 3 34 345 35 4 45 5
0.004004 0.010742 0.014703 0.067272 0.015902 0.011599 0.013339 0.011015 0.003610 0.007189 0.020900 0.011002 0.004364 0.004240 0.002869
TABLE B.3. Probabilities of Distribution 3. Attributes
Probability
Attributes
Probability
12345 123456 1234567 123457 12346 123467 12347 12356 123567 12357 1236 12367 1237 12456 124567 12457 12467 12567
0.008767 0.052818 0.371376 0.051017 0.009462 0.050739 0.008395 0.008590 0.053106 0.008511 0.002118 0.009124 0.002091 0.008281 0.052394 0.008576 0.008866 0.008671
1257 13456 134567 13457 13467 13567 1367 1456 14567 23456 234567 23457 23467 23567 2357 24567 34567
0.002086 0.008521 0.049846 0.008944 0.007809 0.008678 0.002033 0.002182 0.008064 0.007981 0.052932 0.008754 0.008366 0.008808 0.002109 0.009351 0.008316
Distribution D contained "ve attributes. Up to three attributes were speci"ed as join attributes. Single-attribute joins are the most prevalent, making up around 90% of the joins. Those combinations with non-zero probabilities are given in Table A.4. APPENDIX B.
PROBABILITY DISTRIBUTIONS USED IN COMPUTATIONAL RESULTS
Distribution 1 contained "ve attributes. All combinations of all attributes were speci"ed as join attributes. Singleattribute joins are the most prevalent, making up around 90% of the joins. The probabilities are given in Table B.1. Distribution 2 contained "ve attributes. All combinations of all attributes were speci"ed as join attributes. The T HE C OMPUTER J OURNAL,
433
probabilities are given in Table B.2. Distribution 3 contained seven attributes. All combinations of all attributes were speci"ed as join attributes. To conserve space, only those combinations with probabilities greater than 0.002 are given in Table B.3. REFERENCES [1] Mishra, P. and Eich, M. H. (1992) Join processing in relational databases. ACM Comput. Surv., 24, 63–113. [2] Kitsuregawa, M., Tanaka, H. and Moto-oka, T. (1983) Application of hash to data base machine and its architecture. New Generation Computing, 1, 66–74. [3] DeWitt, D. J., Katz, R. H., Olken, F., Shapiro, L. D., Stonebraker, M. R. and Wood, D. (1984) Implementation techniques for main memory database systems. In Proceedings of the 1984 ACM SIGMOD International Conference on the Management of Data, June, Boston, MA, pp. 1–8. ACM Press, New York. [4] Kitsuregawa, M., Nakayama, M. and Takagi, M. (1989) The effect of bucket size tuning in the dynamic hybrid GRACE hash join method. In Proceedings of the Fifteenth International Conference on Very Large Data Bases, August, Amsterdam, The Netherlands, pp. 257–266. Morgan Kaufmann, Palo Alto, CA. [5] Nakayama, M., Kitsuregawa, M. and Takagi, M. (1988) Hash-partitioned join method using dynamic destaging strategy. In Proceedings of the Fifteenth International Conference on Very Large Data Bases, August, Los Angeles, CA, pp. 468–478. Morgan Kaufmann, Palo Alto, CA. [6] Pang, H., Carey, M. J. and Livny, M. (1993) Partially preemptible hash joins. In Proceedings of the 1993 ACM SIGMOD International Conference on the Management of Data, May, Washington, DC, pp. 59–68. ACM Press, New York. [7] Shapiro, L. D. (1986) Join processing in database systems with large main memories. ACM Trans. Database Systems, 11, 239–264. [8] Zeller, H. and Gray, J. (1990) An adaptive hash join algorithm for multiuser environments. In Proceedings of the Sixteenth International Conference on Very Large Data Bases, August, Brisbane, Australia, pp. 186–197. Morgan Kaufmann, Palo Alto, CA. [9] Thom, J. A., Ramamohanarao, K. and Naish, L. (1986) A superjoin algorithm for deductive databases. In Proceedings of the Twelfth International Conference on Very Large Data Bases, August, Kyoto, Japan, pp. 189–196. Morgan Kaufmann, Palo Alto, CA. [10] Ozkarahan, E. A. and Ouksel, M. (1985) Dynamic and order preserving data partitioning for database machines. In Proceedings of the Eleventh International Conference on Very Large Data Bases, Stockholm, Sweden, pp. 358–368. VLDB Endowment, Saratoga, CA. [11] Ozkarahan, E. A. and Bozsahin, C. H. (1988) Join strategies using data space partitioning. New Generation Comput., 6, 19–39. [12] Harada, L., Nakano, M., Kitsuregawa, M. and Takagi, M. (1990) Query processing method for multi-attribute clustered relations. In Proceedings of the Sixteenth International Conference on Very Large Data Bases, August, Brisbane, Australia, pp. 59–70. Morgan Kaufmann, Palo Alto, CA.
Vol. 40,
No. 7,
1997
434
E. P. H ARRIS AND K. R AMAMOHANARAO
[13] Nievergelt, J., Hinterberger, H. and Sevcik, K. C. (1984) The grid "le: An adaptable, symmetric multikey "le structure. ACM Trans. Database Systems, 9, 38–71. [14] Bentley, J. L. (1975) Multidimensional binary search trees used for associative searching. Commun. ACM, 18, 509–517. [15] Lloyd, J. W. (1980) Optimal partial-match retrieval. BIT, 20, 406–413. [16] Lloyd, J. W. and Ramamohanarao, K. (1982) Partial-match retrieval for dynamic "les. BIT, 22, 150–168. [17] Ramamohanarao, K., Shepherd, J. and Sacks-Davis, R. (1990) Multi-attribute hashing with multiple "le copies for high performance partial-match retrieval. BIT, 30, 404–423. [18] Chen, C. Y., Chang, C. C. and Lee, R. C. T. (1993) Optimal MMI "le systems for orthogonal range queries. Information Systems, 18, 37–54. [19] Harris, E. P. and Ramamohanarao, K. (1993) Optimal dynamic multi-attribute hashing for range queries. BIT, 33, 561–579. [20] Freeston, M. (1987) The BANG "le: a new kind of grid "le. In Proceedings of the 1987 ACM SIGMOD International Conference on the Management of Data, May, San Francisco, CA, pp. 260–269. ACM Press, New York. [21] Whang, K.-Y. and Krishnamurthy, R. (1991) The multilevel grid "le — a dynamic hierarchical multidimensional "le structure. In International Symposium on Database Systems for Advanced Applications, April, Tokyo, Japan, pp. 449–459. World Scienti"c, Singapore. [22] Robinson, J. T. (1981) The k-d-B-tree: a search structure for large multidimensional dynamic indexes. In Proceedings of the 1981 ACM SIGMOD International Conference on the Management of Data, April, Ann Arbor, MI, pp. 10–18. ACM Press, New York. [23] Harris, E. P. (1994) Towards optimal storage design for ef"cient query processing in relational database systems. Ph.D. Thesis. Department of Computer Science, The University of Melbourne, Parkville, Victoria 3052, Australia. [24] Rivest, R. L. (1976) Partial-match retrieval algorithms. SIAM J. Comput., 5, 19–50. [25] Ramamohanarao, K. and Lloyd, J. W. (1982) Dynamic hashing schemes. The Comput. J., 25, 478–485. [26] Faloutsos, C. (1986) Multiattribute hashing using gray codes. In Proceedings of the 1986 ACM SIGMOD International Conference on the Management of Data, pp. 227–238. ACM Press, New York. ) (1978) Dynamic hashing. BIT, 18, 184–201. [27] Larson, P.-A. ) (1980) Linear hashing with partial expansions. [28] Larson, P.-A. In Proceedings of the Sixth International Conference on Very
T HE C OMPUTER J OURNAL,
[29]
[30]
[31] [32] [33]
[34]
[35]
[36]
[37] [38]
[39]
[40]
[41]
Large Data Bases, October, Montreal, Canada, pp. 224–232. IEEE Society Press, Long Beach, CA. Litwin, W. (1980) Linear hashing: a new tool for "le and table addressing. In Proceedings of the Sixth International Conference on Very Large Data Bases, October, Montreal, Canada, pp. 212–223. IEEE Society Press, Long Beach, CA. Fagin, R., Nievergelt, J. and Strong, H. R. (1979) Extendible hashing — a fast access method for dynamic "les. ACM Trans. Database Systems, 4, 315–344. Hsiao, Y. and Tharp, A. L. (1988) Adaptive hashing. Information Systems, 13, 111–127. Harris, E. P. and Ramamohanarao, K. (1996) Join algorithm costs revisited. The VLDB J., 5, 64–84. Harris, E. P. and Ramamohanarao, K. (1994) Using optimized multi-attribute hash indexes for hash joins. In Proceedings of the Fifth Australasian Database Conference, January, Christchurch, New Zealand, pp. 92–111. Global Publications Services, Singapore. Knuth, D. E. (1973) Sorting and Searching. In The Art of Computer Programming, Vol 3. Addison-Wesley, Reading, MA. Transaction Processing Performance Council (TPC) (1995) TPC Benchmark D (Decision Support) Standard Speci"cation, Revision 1.0, May. 777 N. First Street, Suite 600, San Jose, CA 95112-6311. McVoy, L. W. and Kleiman, S. R. (1991) Extent-like performance from a UNIX "le system. In Proceedings of the USENIX 1991 Winter Conference, January, Dallas, TX, pp. 33–43. USENIX Association, Berkeley, CA. Aarts, E. and Korst, J. (1989) Simulated Annealing and Boltzmann Machines. Wiley, New York. Ioannidis, Y. E. and Wong, E. (1987) Query optimization by simulated annealing. In Proceedings of the 1987 ACM SIGMOD International Conference on the Management of Data, May, San Francisco, CA, pp. 9–22. ACM Press, New York. Swami, A. (1989) Optimization of large join queries: combining heuristic and combinatorial techniques. In Proceedings of the 1989 ACM SIGMOD International Conference on the Management of Data, June, Portland, OR, pp. 367–376. ACM Press, New York. Ioannidis, Y. E. and Kang, Y. C. (1990) Randomized algorithms for optimizing large join queries. In Proceedings of the 1990 ACM SIGMOD International Conference on the Management of Data, May, Atlantic City, NJ, pp. 312–321. ACM Press, New York. Graefe, G. (1993) Query evaluation techniques for large databases. ACM Comput. Surveys, 25, 73–170.
Vol. 40,
No. 7,
1997