Parallel “GroupBy-Before-Join” Query Processing for High Performance Parallel/Distributed Database Systems David Taniar1 and Wenny Rahayu2 1 Monash University, Australia 2 La Trobe University, Australia
[email protected],
[email protected]
Abstract GroupBy-Join queries in SQL are queries involving the group by clause joining several tables. In this paper, we describe three parallelization techniques for GroupBby-Join queries, particularly the queries where the group-by clause can be performed before the join operation. We subsequently call this query “GroupByBefore-Join” queries. Performance evaluation of the three parallel processing methods is also carried out.
1. Introduction Queries involving aggregates are very common and are often used as a tool for strategic decision making [1]. As the data repository containing data for integrated decision-making is growing, database queries are required to be executed efficiently [2]. Consequently, effective optimization has the potential to result in huge performance gains. In this paper, we would like to focus on the use of parallel query processing techniques in GroupBy-Join queries. It is common that a GroupBy query involves multiple tables [4,8]. These tables are joined to produce a single table, and this table becomes an input to the group-by operation. The following query shows an illustration of GroupBy-Join: "Retrieve project numbers, names, and total quantity of shipments for each project having the total shipments quantity of more than 1000". Select J.J#, J.Jname, SUM(Qty) From PROJECT J, SHIPMENT S Where J.J# = S.J# Group By J.J#, J.Jname Having SUM(Qty)>1000
When the join attribute and the group-by attribute is the same as shown in the above query, it is expected that the group-by operation be carried out first, and then the join operation. Hence, we call it GroupByBefore-Join query.
2. Parallelization Methods We discuss three parallel algorithms as follows: o
Early Distribution Scheme The Early Distribution scheme is divided into two phases: distribution phase and group-by-join phase. Using query example describe earlier, the two tables to be joined are Project and Shipment based on attribute J#, and the group-by will be based on table Shipment. For simplicity of notation, the table which becomes the basis for group-by is called table R (e.g. table Shipment), and the other table is called table S (e.g. table Project). In the distribution phase, raw records from both tables (i.e. tables R and S) are distributed based on the join/group-by attribute according to a data partitioning function. An example of a partitioning function is to allocate each processor with project numbers ranging on certain values. This distribution scheme is commonly used in parallel join, where raw records are partitioned into buckets based on an adopted partitioning scheme like the above range partitioning [3]. Once the distribution is completed, each processor will have records within certain groups identified by the group-by/join attribute. Subsequently, the second phase (the group-by-join phase) groups records of table R based on the group-by attribute and calculates the aggregate values on each group. Aggregating in each processor can be carried out through a sort or a hash function. After table R is grouped in each processor, it is joined with table S in the same processor. After joining, each processor will have a local query result. The final query result is a union of all sub-results produced by each processor. o
Early GroupBy with Partitioning Scheme The Early GroupBy scheme performs the group by operation first before anything else (eg. distribution).
Proceedings of the 20th International Conference on Advanced Information Networking and Applications (AINA’06) 1550-445X/06 $20.00 © 2006
IEEE
The Early GroupBy with Partitioning scheme is divided into three phases: (i) local grouping phase, (ii) distribution phase, and (iii) final grouping and join phase. In the local grouping phase, each processor performs its group-by operation and calculates its local aggregate values on records of table R. In this phase each processor groups local records R according to the designated group-by attribute and performs the aggregate function. In the second phase (ie. distribution phase), the results of local aggregates from each processor, together with records of table S, are distributed to all processors according to a partitioning function. The partitioning function is based on the join/group-by attribute, which in this case is attribute J# of tables Project and Shipment. In the third phase (ie. final grouping and join phase), two operations are carried out, particularly; final aggregate or grouping of R, and join it with S. The final grouping can be carried out by merging all temporary results which was obtained in each processor. After local aggregates are formulated in each processor, each processor then distributes each of the groups to another processor depending on the adopted distribution function. Once the distribution of local results based on a particular distribution function is completed, global aggregation in each processor is simply done by merging all identical project number (J#) into one aggregate value. After global aggregation results are obtained, it is then joined table S in each processor. o
Early GroupBy with Replication Scheme The Early GroupBy with Replication scheme, which is also divided into three phases, works as follows. The first phase, local grouping, is exactly the same as that of the Early GroupBy with Partitioning scheme. The main difference is in phase two. The local aggregate results obtained from each processor are replicated to all processors. In the third phase, the final grouping and join phase, is basically similar to that of the "with Partitioning" scheme. That is local aggregates from all processors are merged to obtain global aggregate, and then joined with S.
3. Analytical Models The notations used by the cost models are presented in Table 1. The projectivity and selectivity ratios (i.e. and ) in the query parameters have a values ranging from 0 to 1. The projectivity ratio is the ratio between the projected attribute size and the original
record length. Since two tables are involved, we use the notations of R and S to distinguish between the projectivity ratio of one table and the other.
Symbol N R and S |R| , |S| |Ri| , |Si| P H
R and S R j IO tr tw th ta tj td mp ml
Table 1. Cost Notations Description System and Data Parameters Number of processors Size of table R and table S Number of records in table R and table S Number of records in table R and table S on node i Page size Hash table size Query Parameters Projectivity ratios of table R and table S GroupBy selectivity ratios of table R Join selectivity ratio Time Unit Cost Effective time to read a page from disk Time to read a record Time to write a record Time to compute hash value Time to add a record to current aggregate value Time to compare a record with a hash table entry Time to compute destination Communication Cost Message protocol cost per page Message latency for one page
There are two different kinds of selectivity ratio: one is related to the group-by operation, whereas the other is related to the join operation. The group-by selectivity ratio R is a ratio between number of groups in the aggregate result and the original total number of records. Since table R is aggregated based on the group-by, the selectivity ratio R is applicable to table R only. The join selectivity ratio j is also similar, that is the ratio between the join query result and the product of the two tables R and S. Table R has a number of records which is defined as |R|. Number of records in table R on node (processor) i is defined as |Ri|. A record in a node/processor is defined as Ri.
3.1. “Early Distribution” Scheme Cost Models for Phase One (Distribution Phase): • Scan cost is the cost for loading data from local disk in each processor. Since data loading from disk is done page by page, the fragment size of the table resided in each disk is divided by the page size to obtain number of pages. ((Ri / P) IO) + ((Si / P) IO) (1)
Proceedings of the 20th International Conference on Advanced Information Networking and Applications (AINA’06) 1550-445X/06 $20.00 © 2006
IEEE
The left-hand side term is the data loading cost of table R in processor i, whereas the right-hand side term is the associated loading cost of table S. • Select cost is to get record out of data page, which is calculated as number of records loaded from the disk times reading and writing unit cost to the main-memory. (|Ri| (tr + tw)) + (|Si| (tr + tw)) (2) The select cost also involves both records from tables R and S in each processor. • Determining the destination cost is the cost for calculating destination of each record to be distributed from the processor in phase one to phase two. This overhead is given by the number of records in each fragment times the destination computation unit cost, which is given as follows.
matched, the matching records are written out to the query result. The hashing process is very much determined on how big the hash table that can fit into main-memory. If the memory size is smaller than the hash table size, we normally partition the hash table into multiple buckets whereby each bucket can perfectly fit into main-memory. All but the first bucket is spooled to disk. Based on the scenario, we must include the I/O cost for reading and writing overflow buckets, which is as follows. • Reading/Writing of overflow buckets cost is the I/O costs associating with the limitation of mainmemory to accommodate the entire hash table. This cost includes the costs for reading and writing records not processed in the first pass of hashing.
( |Ri| td ) + ( |Si| td )
1 min H ,1 R Ri 2 IO R Ri P
(3) Data transfer cost for sending records to other processors is given by the number of pages to be sent multiple by the message unit cost, which is given as follows. ((R Ri / P) (mp + ml)) + ((S Si / P) (mp + ml)) (4) When distributing the records during the first phase, only those attributes relevant to the query are redistributed. This factor is depicted by the projectivity factor, denoted by . •
Cost Models for Phase Two (GroupBy-Join Phase): • Receiving records cost from processors in the first phase is calculated by number of projected values of the two tables multiply by the message unit cost. ((R Ri / P) mp) + ((S Si / P) mp) (5) If the number of groups is less than the number of processors, Ri = R / (Number_of_Groups), instead of Ri = R / N (i.e. assume uniform distribution), because not all processors are used. Consequently, when number of groups is small, smaller than the available number of processors, performance can be expected to perform poorly. • Aggregation and Join costs involve reading, hashing, computing the cumulative value, and probing. The costs are as follows: (|Ri| (tr + th + ta)) + (|Si| (tr + th + tj)) (6) The aggregation process is basically reading each record R, hash it to a hash table, and calculate the aggregate value. After all records R are processed, records S can be read, hashed, and probed. If they are
(7) The first term of the above equation can be explained as follows. For example, if the maximum hash table size H is 10 records, selectivity ratio R is , and there are 200 records (|Ri|), the number of groups in the query result will be equal to 50 groups (R |Ri|). Since only 10 groups can be processed at a time, we need to break the hash table into 5 buckets. All buckets but the first are spooled to disk. Hence, 80% of the groups (1 – (10/50)) is overflow. Should there be only less or equal to 10 groups in the query result, the first term of the above equation would be equal to 0 (zero), and hence there would no overhead. The second term of the above equation is explained as follows. The constant 2 refers to two input/output accesses: one is for spooling of the overflow buckets to disk and two is for reading the overflow buckets from disk. • Generating result records cost is the number of selected records multiply by the writing unit cost. |Ri| R |Si| j tw (8) • Disk cost for storing final result is the number of pages to store the final aggregate values times disk unit cost, which is: (R Ri R S Si j / P) IO (9) The total cost of the Early Distribution scheme is the sum of equations (1) to (9) above.
Proceedings of the 20th International Conference on Advanced Information Networking and Applications (AINA’06) 1550-445X/06 $20.00 © 2006
IEEE
3.2. “Early GroupBy with Partitioning” Scheme Cost Models for Phase One (Grouping Phase): • Scan cost is associated with both tables R and S, which is the same as that of the "Early Distribution" scheme, and therefore equation (1) presented in the previous section can be used. • Select cost is also associated with both tables R and S, and it is identical to equation (2) in the Early Distribution scheme. • Local aggregation cost covers the reading, hashing, and accumulating aggregate values costs, which are as follows. |Ri| (tr + th + ta) (10) Notice that the cost equation involves R only, and not S, since table S is not yet processed. Equation (10) is similar to the left-hand side term of equation (6) presented in the previous section. The only difference is that equation (6) involves the hashing/probing cost of table S. In equation (11) above, only the aggregation cost is involved. • Reading/Writing of overflow buckets cost is similar to equation (7) in the Early Distribution scheme. The main difference is that the group-by selectivity factor used here is now identified by R1 instead of R because in the Early GroupBy with Partitioning scheme, there are two group by operations: local group-by and final/global groupby. Here R1 indicates the first group-by selectivity ratio.
1 min H ,1 R Ri 2 IO R1 Ri P (11) •
Generating final result cost is: |Ri| R1 tw
(12) The sum of the equations (10) to (12) gives the total cost for phase one of the Early GroupBy with Partitioning scheme. Cost Models for Phase Two (Distribution Phase): • Determining the destination cost is associated with both tables R and S, since both tables are distributed. ( |Ri| R1 td ) + ( |Si| td ) (13) • Data transfer cost is the cost for sending local aggregate results and fragment of table S from each processor. ((R Ri R1 / P) (mp + ml)) + ((S Si / P) (mp + ml)) (14)
•
Receiving records cost is similar to the data transfer cost, but without the message latency overhead. ((R Ri R1 / P) mp) + ((S Si / P) mp) (15) The sum of the equations (13) to (15) gives the total cost for phase two of the Early GroupBy with Partitioning scheme. Cost Models for Phase Three (GroupBy-Join Phase): • Aggregation and Join costs involving reading, hashing, computing the cumulative value, and probing. The costs are as follows: (|Ri| R1 (tr + th + ta)) + (|Si| (tr + th + tj)) (16) • Reading/Writing of overflow buckets cost is similar to equations (7) described earlier. The overflow percentage is determined by the maximum hash table size and the table to be hashed. Notice that |Ri| has been reduced by R2 and this is determined by the second group-by selectivity ratio. We assume that R1 R2 meaning that the second group-by selectivity ratio is a further filtering of the first group-by selectivity. Notice also that the I/O cost associated with |Ri| has been reduced by the first group-by selectivity ratio R1.
1 min H ,1 R R1 Ri 2 IO R 2 Ri P H S Si 2 IO , 1 + 1 min R 2 Ri P •
(17) Generating result records cost is the number of selected records multiply by the writing unit cost, which is identical to equation (8). In the Early GroupBy with Partitioning scheme, we also use equation (8). Notice that equation (8) uses R.
Here R is calculated by multiplying the two tables R1 and R2. • Disk cost for storing final result is the number of pages to store the final aggregate values times disk unit cost, which is identical to equation (9). As the above (i.e. generating result records cost) R is used to indicate the overall group-by selectivity ratio. The total cost of the Early GroupBy with Partitioning scheme is the sum of equations (10) to (17).
Proceedings of the 20th International Conference on Advanced Information Networking and Applications (AINA’06) 1550-445X/06 $20.00 © 2006
IEEE
3.3. “Early GroupBy with Replication” Scheme Cost Models for Phase One (Grouping Phase): Cost component of the first phase of the Early GroupBy with Replication scheme is identical to that of the first phase of the Early GroupBy with Partitiong scheme. Cost Models for Phase Two (Replication Phase): This cost component is purely associated with table R, as table S is not at all moved from where they are stored. • Data transfer cost is the cost for sending local aggregate results of each processor to all processors. ((R Ri R1 (N-1) / P) (mp + ml)) (18) In the above equation, Ri is reduced by two factors, namely R and R1. However, the replication cost is increased by the number of processors N-1. •
Receiving records cost is as follows. ((R Ri R1 (N-1) / P) mp)
(19) The sum of the above two equations gives the total cost for phase two of the Early GroupBy with Replication scheme. Cost Models for Phase Three (Grouping/Joining Phase): • Aggregation and Join costs are as follows: (|R| R1 (tr + th + ta)) + (|Si| (tr + th + tj)) (20) • Reading/Writing of overflow buckets cost is very similar to equation (17) except that now we use R not Ri because of the replication.
1 min H ,1 R R1 R 2 IO R 2 R P H S Si 2 IO , 1 + 1 min R 2 R P (21) The generating result records cost and disk cost are the same as those of the Early GroupBy with Partitioning scheme, which are also identical to those of the Early Distribution scheme. Hence, equations (8) and (9) can be used.
4. Implementation Result The experiment setup is based on the sharednothing architecture mentioned above using 12 Linux
workstations. Although the number of workstations seems to be small, it is expected that same trend will follow if more workstations are used. The experiment result uses the following parameters: the number of records in each table is 10,000 and the number of groups is below 100. The three algorithms were developed using C and Message Passing Interface library (MPI). MPI library consists of message passing functions that can be used for distributed and parallel algorithms.
4.1. Number of processors (Execution time) The number of processors used is below 6, the Early Group-By Replication method was experiencing poor performance as compared to Early Distribution and Early Group-By Partition methods as shown in Figure 1. However, Early Group-By Replication method performs worst only after the number of processors is increased to 6 and above. Early Group-By Partition method is performing consistently well with the varied in number of processors. The graph line of the execution time for Early Distribution method is declining gradually as the number of processors increased. However, the performance of Early Distribution does not outperform Early Group-By Partition method. This is due to the small number of groups. This has confirmed to our cost model in the previous section.
Figure 1. Execution time The execution time of the Early Group-By Partition method is maintained consistently well throughout the number of processors. Early Group-By Replication method is not performing well as compared to Early Group-By Partition and Early Distribution methods. The main reason is that the data skew in Early GroupBy Replication affects execution time. This is partly due to the time spent in data redistribution using MPI_Alltoall primitive.
Proceedings of the 20th International Conference on Advanced Information Networking and Applications (AINA’06) 1550-445X/06 $20.00 © 2006
IEEE
4.2. Number of processors (Speed-up) Figure 2 shows a speed-up result of the Early Distribution, Early Group-By Partition and Early Group-By Replication methods. The result shows that the Early Group-By Partition achieved higher speed-up as compared to the other two. The overall speed-up for the three methods is low. The main reason for low speed-up is the high overhead of communication cost in the shared nothing architecture. The communication cost incurred on finding the range partitioning vector and data redistribution using MPI_Alltoall and MPI_Alltoallv primitives. Another thing to note is that in our experiment, we did not consider disk I/O as one of the overhead. As mentioned earlier, skew exists in the processing stage and it has a greater impact on the Early GroupBy Replication method than the other two methods in the analytical model. This experiment has showed the Early Group-By Replication performance is almost as what we have expected in the analytical model. Early Group-By Partition method is not as efficient as compared to Early Distribution method. However, both the efficient lines in the Figure 2 show that they are experiencing similar type of trend. This is due to communication cost incurred on data redistribution using MPI_Alltoall primitive. Early Distribution method is fares better than Early Group-By Partition method besides the efficient line gap is around 10%. This is partly due to the data skew in Early Group-By Partition method was affected badly as compared to Early Distribution method.
Figure 2. Speed-up
4.3. Summary of Implementation Result The algorithms performed almost as expected from the analytical model. Figure 1 shows the performance of the Early Distribution, Early Group-By Partition and Early Group-By Replication approach. Figure 2 shows the speed-up result of the three methods.
In summary, Early Group-By Partition method was performing consistently well in terms of the execution time as shown in Figure 2. However, Early Distribution method performed much better when the number of processors increased. In term of the speedup, all methods were low. Early Distribution method is more efficient as compared to the others. Therefore, Early Distribution method is the preferred choice.
5. Conclusions and Future Work In this paper, we have investigated three parallel algorithms for processing GroupBy-Before-Join queries in high performance parallel database systems. From our study it is concluded that the Early Distribution method is the preference one when the number of groups produced is growing to be large, but not in favour when the number of groups produced small. On the other hand, the Early-GroupBy with Replication is good when the number of groups produced is small, but it suffers serious performance problem once the number of groups produced by the query is large. Our future work is being planned to investigate high dimensional Group By operations, which is often identified as Cube operations and are highly pertinent to data warehousing applications.
6. References [1] Bedell J.A. “Outstanding Challenges in OLAP”, Proc of 14th Intl Conf on Data Engineering, 1998. [2] Datta A. and Moon B., “A case for parallelism in data warehousing and OLAP”, Proc.. of DEXA, 1998. [3] Leung, C.H.C. and Taniar. D., "Parallel Query Processing in Object-Oriented Database Systems", Australian Computer Science Communications, vol 17, no 2, pp. 119-131, 1995 [4] Mishra, P. and Eich, M.H., "Join Processing in Relational Databases", ACM Computing Surveys, vol. 24, no. 1, pp. 63-113, March 1992. [5] Shatdal A. and Naughton, J.F., "Adaptive Parallel Aggregation Algorithms", Proceedings of the ACM SIGMOD Conference, pp. 104-114, 1995. [6] Taniar, D. and Rahayu, J.W., “Parallel Group-By Query Processing in a Cluster Architecture”, Intl. J. of Computer Systems: Science and Engineering, vol. 17, no. 1, pp. 23-39, 2002. [7] Taniar, D., Jiang, Y., Liu, K.H., and Leung, C.H.C., "Parallel Aggregate-Join Query Processing ", Informatica, vol. 26, pp. 321-332, 2002. [8] Yan W. P. and P. Larson, "Performing group-by before join", Proc of the Intl Conf on Data Engineering, 1994. [9] Zipf, G.K., Human Behaviour and the Principle of Least Effort, Addison Wesley, 1949.
Proceedings of the 20th International Conference on Advanced Information Networking and Applications (AINA’06) 1550-445X/06 $20.00 © 2006
IEEE