Performance Modeling of the Grace Hash Join on Cluster Architectures Erich Schikuta Institut f¨ur Informatik und Wirtschaftsinformatik, University of Vienna Rathausstraße 19/9, A-1010 Vienna, Austria
[email protected]
Abstract Aim of the paper is to develop a concise but comprehensive analytical model for the well-known Grace Hash Join algorithm on cost effective cluster architectures. This approach is part of an ongoing project to develop algorithms for the design of quasi-optimal query execution plans for parallel database systems. We try to concentrate on a limited number of characteristic parameters to keep the analytical model clear and focused. We believe that a meaningful model can be built upon only three characteristic parameter sets, describing main memory size, the I/O bandwidth and the disk bandwidth. We will justify our approach by a practical implementation and a comparison of the theoretical and real performance values.
1. Introduction Due to its immense costs and difficult manageability the development of classical, special-purpose parallel database machines was replaced by the use of affordable and available distributed architectures. This trend was supported by the dramatic improvements of processor and network performance of commodity workstations built up from relatively cheap, of-the-shelf parts. Nowadays clusters of such workstations are the focus of many high performance applications searching for viable and affordable platforms replacing expensive supercomputer architectures. A cluster system is a parallel or distributed processing system consisting of interconnected stand-alone workstations working together as a single, integrated computing resource [3]. Each of a cluster’s workstation consists basically of 3 components, one (or more) processing unit(s) with own main memory, a disk subsystem, and an interconnect. In practice such clusters have proven as a cost-effective replacement for classical supercomputers in many application domains. We believe that cluster system can also be a suitable environment for parallel database systems. There is also an ur-
gent need for novel database architectures due to new stimulating application domains with huge data sets to administer, search and analyze. A typical example are high energy physics applications (see [15]), which are based heavily on relational database technology. Due to its inherent expressive power the most important operation in a relational database system is the join. It allows to combine information of different relations according to a user specified condition, which makes it the most demanding operation of the relational algebra. Thus the join is obviously the central point of research for performance engineering in database systems. Generally two approaches for parallel or distributed execution of database operations can be distinguished, interand intra-operational parallelism. Inter-operator parallelism, resembling the rather simple, distributed approach, is achieved by executing distinctive operators of a query execution plan in a parallel or pipelined parallel fashion. In contrary the intra-operator parallelism is focusing on the parallel execution of a single operation among multiple processors by applying smart algorithms. In the past a number of paper appeared covering this topic, like [11], [1], [4], [6], [7], [12], [16], which proposed and analyzed parallel database algorithms for parallel database machines. [2] presents an adaptive, load-balancing parallel join algorithm implemented on a cluster of workstations. The algorithm efficiently adapts to use additional join processors when necessary, while maintaining a balanced load. [8] develops a parallel hash-based join algorithm using shared-memory multiprocessor computers as nodes of a networked cluster. [18] uses a hash-based join algorithm to compare the designed cluster-system with commercial parallel systems. In this paper we will present an analysis and evaluation of the Grace Hash Join. This work is part of a running project for a comprehensive analysis of parallel join operations ([9]). We did a similar research for all important parallel join operations (e.g. Hybrid Hash Join [14]). A focus on analyzing hardware characteristics of the underlying system is beyond the scope of this paper. So we are interested
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
in the specifics of the algorithms and not of the machines.
2. Parallel Database Operations 2.1. Declustering Declustering is the general method in a parallel database system to increase the bandwidth of the I/O operations by reading and writing the multiple disks in parallel. This denotes the distribution of the tuples (or records, i.e. the basic data unit) of a relation among two or more disk drives, according to one or more attributes and/or a distribution criteria. Three basic declustering schemes can be distinguished, • range declustering (contiguous parts of a data set are stored on the same disk) • round-robin declustering (tuples are spread consecutively on the disk) • hash-declustering (location of a tuple is defined by a hash function)
Basically the parallel versions of the these approaches are realized in a conventional client-server scheme. The server stores both relations to join and distributes the declustered tuples among the available clients. The clients perform the specific join algorithm on their sub relations and send the sub results back to the server. The server collects the result tuples and stores the result relation.
3. Analytical Model In the following the Grace Hash Join is presented with its specific characteristics. We develop a model of operation, evaluate and justify it by a practical implementation.
3.1. Model Parameters In the following (see Table 1) we specify several parameters and a few derived terms, which describe the characteristics of the model environment and build the basis for the derived cost functions. m n
The disjoint property of the declustered sets Ri can be exploited by parallel algorithms based on the SPMD (single program, multiple data) paradigm. This means that multiple processors execute the same program, but each on a different set of tuples. In the database terminology this approach resembles the intra-operational parallelism. A realistic assumption of our model is that the relations of the database system are too large to fit into the main memory of the processing units. Consequently all operations have to be done external and the I/O costs are the dominant factor for the system performance.
2.2. The Join operation Basically the join operation ‘merges’ two relations R and S via two attributes (or attribute sets) A or B (respective relations R and S) responding to a certain join condition. The join attributes have to have the same domain. Two different types of join operators are distinguished, the equi-join and the theta-join. In the following only the equi-join is discussed. In the literature [5] three different approaches for join algorithms are distinguished: • sort merge, • nested loop, and • hash based join.
p ntm b s lf
read write receive send find target hash probe fill compare
number of tuples of relation R (inner relation) number of tuples of relation S (outer relation) number of processors number of tuples per message bucket size (tuples per bucket) selectivity factor (percentage of the product of m and n giving the result size loop factor (number of loops necessary to build hash buckets due to number of open file limitations) read one tuple from disk write one tuple to disk receive one message send one message find the right target client store a tuple into a main memory hash table probe a main memory hash table with a tuple fill a tuple into main memory compare the keys of two tuples in main memory and build a result tuple if keys match.
Table 1. Parameters of the cost model The specific values of the basic parameters and the derived functions used in the theoretical model to calculate ”real” numbers were profiled by a specific test program on the real hardware. The values are depicted in section 4.
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
server read = (m + n) ∗ read
(1)
calculates the respective target client using a declustering function with p as one of its parameters (2) server compute = (m + n) ∗ f ind target
(2)
cost type percentages
The declustering of the data across the clients is equal for all four join algorithms and is therefore analyzed at the beginning. The server reads the two input relations, see equation (1),
1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0
computation communication I/O
50000
100000 250000 500000 1000000 tuples of R and S
and sends the tuples (packed in messages) to the target client (3). server send = (
m n + ) ∗ send ntm ntm
Figure 1. Percentage of server-side cost types (3)
After sending the messages the server is in an idle-state. It waits for the results of the clients (4). server receive =
m∗n∗s ∗ receive ntm
(4)
The received tuples are written to disk (5). server write = m ∗ n ∗ s ∗ write
(5)
The total costs of the server are defined in (7) and are the sum of (1) to (5). server cost = server read + server compute+ (6) + server send + server receive + server write The costs of the server I/O-costs (read,write), messagecosts (send, receive) and computational costs are obviously independent of the number of processors used. Figure 1 shows this situation graphically by splitting the total server execution costs into the shares on I/O costs (read and write operations), communications costs (send and receive operations) and pure computational costs.
3.2. Grace Hash-Join A well known representative of a hash join algorithm is the Grace join algorithm [10]. It works in three phases. In the first phase, the algorithm declusters relation R (inner relation) into N disk buckets by hashing on the join attribute of each tuple in R. In phase 2, relation S (outer relation) is declustered into N buckets using the same hash function. In the final phase, the algorithm joins the respective matching buckets from relation R and S and builds the result relation.
The number of buckets, N , is chosen to be very large. This reduces the chance that any bucket of relation R will exceed the memory capacity of the processor used to actually effect the join (the buckets of the outer relation may exceed the memory capacity because they are only used for probing the bucket of the inner relation and need not to fit in main memory). In the case that the buckets are much smaller than main memory, several will be combined during the third phase to form more optimal sized join buckets (referred to as bucket tuning in [10]). The Grace algorithm differs from the sort merge and simple-hash join algorithms in that data partitioning occurs at two different stages – during bucket forming and during bucket-joining. Parallelizing the algorithm thus must address both of these data partitioning stages. To ensure maximum utilization of available I/O bandwidth during the bucket-joining stage, each bucket is partitioned across all available disk drives . A partitioning split table is used for this task. To join the ith bucket in R with the ith bucket of S, the tuples from the ith bucket in R are distributed to the available joining processors using a joining split table (which will contain one entry for each processor used to effect the join). As tuples arrive at a site they are stored in in-memory hash tables. Tuples from bucket I of relation S are then distributed using the same joining split table and as tuples arrive at a processor they probe the hash table for matches. The bucket-forming phase is completely separate from the joining phase under the Grace join algorithm. This separation of phases forces the Grace algorithm to write the buckets of both joining relations back to disk before beginning the join stage of the algorithm.
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
3.2.1 Cost model The costs for the clients start with receiving the tuples of the inner and outer relation from the server. Every client n gets only m p tuples of the inner relation R and p tuples of the outer relation S. The costs for receiving are described by [7].
client receive =
m p
ntm
∗ receive +
n p
ntm
grace cost = server cost + client grace cost
The theoretical speed-up of the Grace Hash-Join algorithm based on our model is shown in Figure 2. Independently of the number of input tuples we can calculate a significant speedup. Thus we see a perfect linear behavior of the algorithm. Please note that the x axis is logarithmic.
∗ receive (7)
700 600
Afterwards the received tuples are written to the local disk, which is [8].
50k tuples 100k tuples 250k tuples 500k tuples 1000k tuples
m n ∗ write + ∗ write p p
(8)
In l f loops buckets are built using a hash table. We need l f loops because there is a restriction to the number of possible open files within a system. The formula for these costs is given by [9]. m n ∗ l f + ∗ l f )∗ (9) p p ∗ (read + hash + write)
seconds
500
build temp f iles =
400 300 200 100 0
build buckets = (
In the third phase the buckets of relation R are read and an in-memory hash table is filled with the tuples of the buckets. The costs are described by [10]. build hash =
m ∗ (read + hash) p
(10)
The hash table is proved for every tuple of S. If the tuples match, a result tuple is created [11]. probe hash =
n ∗ (read + probe) p
(11)
Finally the client has to write the result tuples back to the server [12]. The complete costs of the client [13] are the sum of [7] to [11]. The total cost for the Grace hash join [13] is the sum of the cost of the server and [12]. send result = (
m n s ∗ )∗ ∗ send p p ntm
(12)
(13) client grace cost = client receive+ + build temp f iles + build buckets + build hash + + probe hash + send result
(14)
1 proc.
2 proc.
3 proc.
4 proc.
processors
Figure 2. Theoretical speed-up Grace Hashjoin Figure 3 shows the behavior of the algorithm with increasing workload1 . It can be seen that duplication of the number of input tuples leads to a duplication of costs independent of the number of processors (1 to 4). Additionally we see again the improvement of performance when using more processors, which justify our first result It is interesting to check the percentage of I/O-costs, message costs and computational costs in relation to the total client costs. The model shows a steady distribution of the different shares of costs while changing the number input tuples. This could be expected because Grace hash is of linear order in n. This situation is depicted by Figure 4. Even the number of clients does not change the distribution of the cost shares which can be easily seen in Figure 5.
4. Model Justification To justify the presented model we evaluate and compare it to a practical performance analysis were we realized the algorithm according to the preceding section. 1 Please
note the quasi-logarithmic scale of the x axis.
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
1 0,9 0,8 700 600
1 proc. 2 proc. 3 proc. 4 proc.
500 seconds
percentage
0,7
400
0,6 0,5 0,4
cluster_compute cluster-message cluster-I/O
0,3 0,2
300
0,1 0
200
50000
100
100000
250000
500000
1000000
number of tuples
0 50k tuples
100k tuples
250k tuples
500k tuples
1000k tuples
number of tuples R and S
Figure 3. Theoretical scale-up Grace Hashjoin
1 0,9 0,8 percentage
0,7 0,6 0,5 cluster_compute cluster-message cluster-I/O
0,4 0,3 0,2 0,1
Figure 5. Percentage of client-side cost types with changing client numbers
We realized the client-server architecture so that one of the nodes implemented the server, starting the operations, distributing the workload and collecting the result, and 4 client nodes, processing the distributed workload. At the beginning of each test run the relations R (inner relation) and S (outer relation) reside on a server. In the tests we used an integer variable as join attribute. Test-bed for our analysis was an off-the-shelf ”elcheapo” PC cluster consisting of 5 single processor nodes (computational units) running the Linux operating system. The algorithms were realized with the C language and PVM [17] as communication library. The specific values of the parameters implemented in our tests are depicted in Table 2. m n p ntm b s lf
50k,100k,250k,500k,1000k 50k,100k,250k,500k,1000k 1,2,3,4 100 1000 1/m 10
0 1
2
3
4
processors
Figure 4. Percentage of client-side cost types with changing tuple numbers
Table 2. Specific values of the basic parameters
We used a test module to determine the values of the basic parameters and the derived functions (measured in seconds) in our cost model. The results can be seen in Table 3. The following figures give the real execution times for the Grace Hash Join in Figure 6. All given values are the
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
800
0,0000105 seconds 0,00001 seconds 0,0025 seconds 0,0025 seconds 0,000005 seconds 0,00001 seconds 0,00001 seconds 0,0000008 seconds 0,0000008 seconds
700
1 proc. 2 proc.
600 seconds
read write receive send find target Hash probe fill compare
3 proc. 4 proc.
500 400 300 200 100
Table 3. Values for derived functions of the cost model
0 50k tuples
100k tuples
250k tuples
500k tuples
1000k tuples
number of tuples R and S
Figure 7. Real scaleup Grace Hash Join averages of at least 20 runs. Analogously to the speed-up the scale-up values given in Figure 7 show the expected behavior. 800 700
50k tuples 100k tuples
seconds
600
250k tuples 500k tuples
500
1000k tuples 400 300 200 100 0 1 proc.
2 proc.
3 proc.
4 proc.
processors
Figure 6. Real speedup Grace Hash Join The real values correspond to the theoretical values amazingly well. The asymptotic runtime behavior for increasing workloads and processing nodes (speed-up) of the model and the reality is the about same. Not only the trend of the data is the same, but also the real execution values match the ones calculated by the model. The difference between the two values was about 15 percent, which is due to the simplified model ignoring operating system specifics. Summing up this result shows that the simplified approach described in the previous section models the reality very well and justifies it as a basis to analyze the algorithms behavior on a cluster architecture thoroughly.
our stated assumptions at the beginning of the paper: To build up an analytical model for relational operations on cluster systems it is sufficient to concentrate on the characteristics of main memory, IO bandwidth and disk bandwidth. Thus it is possible to model the execution time behavior of data intensive operations on clusters accurately, which builds the needed basis for the development of query analyzer for parallel/distributed database systems. There is an ongoing project which uses an artificial intelligence based blackboard method as an heuristic approach to find near to optimal parallel query execution plans for relational database queries [13]. As a side effect of our analysis we gave a case for the usage of cluster systems as a basis architecture for parallel database systems. With the development of faster and cheaper network interfaces clusters can deliver an unbeatable price/performance ratio for many former pure supercomputer applications, as in our case the administration and manipulation of very large database systems.
6. Acknowledgements First we would like to express our thanks to Peter Kirkovits for implementing the Grace Hash Join algorithm and Thomas Fuerle, who was always helpful in the administration of the cluster and the programming of the test suite. The work described in this paper was partly supported by the Special Research Program SFB F011 AURORA of the Austrian Science Fund.
5. Conclusions References By the presented analysis of the proposed concise, but obviously comprehensive model and the justification by the comparison to real implementation results, we could prove
[1] P. America. Parallel database systems. In Proc. PRISMA Workshop. Springer Verlag, 1990.
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
[2] M. B. Amin, D. A. Schneider, and V. Singh. An adaptive, load balancing parallel join algorithm. In Sixth International Conference on Management of Data (COMAD’94), Bangalore, India, 1994. [3] M. Baker and R. Buyya. High Performance Cluster Computing, chapter Cluster Computing at a Glance, pages 3–47. Prentice Hall, 1999. [4] G. Copeland, W. Alexander, E. Boughter, and T. Keller. Data placement in bubba. In Proc. Of the ACM-SIGMOD Int. Conf. On Management of Data. ACM, 1988. [5] C. Date. An Introduction to Database Systems. AddisonWesley, 1986. [6] D. DeWitt, R. Gerber, G. Graefe, M. Heytens, K. Kumar, and M. Muralikrishna. GAMMA- a high performance dataflow database machine. In Proc. Of the 12th Conf. On Very Large Data Bases, 1986. [7] D. DeWitt and J. Gray. Parallel database systems: The future of database processing or a passing fad? ACM-SIGMOD record, 19(4), 1990. [8] Y. Jiang and A. Makinouchi. A parallel hash-based join algorithm for a networked cluster of multiprocessor nodes. In Proceedings of the COMPSAC ’97 - 21st International Computer Software and Applications Conference, 1997. [9] P. Kirkovits and E. Schikuta. Parallel join algorithms on clusters: Analysis and evaluation. In accepted for publication, 2002. [10] M. Kitsuregawa, H. Tanaka, and T. Moto-Oka. Application of hash to data base machine and its architecture. New Generation Computing, 1(1), 1983. [11] M. H. Nodine and J. S. Vitter. Optimal deterministic sorting on parallel disks. In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA ’93), pages 120–129, Velen, Germany, 1993. [12] H. Pirahesh, C. Mohan, J. Cheng, T. Liu, and P. Selinger. Parallelism in relational database systems: Architectural issues and design approaches. In Proc. Of the IEEE Conf. On Distributed and Parallel Database Systems. IEEE Computer Society Press, 1990. [13] E. Schikuta and P. Kirkovits. A knowledge base for the optimization of parallel query execution plans. In Proc. Int. Symposium on Parallel Architectures, Algorithms and Networks (ISPAN’97). IEEE Computer Society Press, 1997. [14] E. Schikuta and P. Kirkovits. Cluster based hybrid hash join: Analysis and evaluation. In Proc. IEEE International Conference on Cluster Computing, Chicago, 2002. IEEE Computer Society Press. [15] B. Segal. Grid computing: The european data project. In IEEE Nuclear Science Symposium and Medical Imaging Conference, Lyon, France, 2000. IEEE Computer Society Press. [16] M. Stonebraker, P. Aoki, R. Devine, W. Litwin, and M. Olson. Mariposa: A new architecture for distributed data. In Proc. Of the Int. Conf. On Data Engineering. IEEE Computer Society Press, 1994. [17] V. S. Sunderam. Pvm: A framework for parallel distributed computing. Concurrency: Practice and Experience, 2(4):315–339, December 1990. [18] T. Tamura, M. Oguchi, and M. Kitsuregawa. Parallel database processing on a 100 node PC cluster. In Proc. of the Supercomputing 97. IEEE Computer Society Press, 1997.
0-7695-1926-1/03/$17.00 (C) 2003 IEEE