of both tables R and S, and aggregate operation of table R and join with table ..... are Earl! Distribution scheme, Early GroupBp scheme, and Cluster-based scheme. The main ... Yan W. P. and P. Larson, "Performing group-by before join", Proc.
Parallel Processing of “GroupBy-Before-Join”Queries in Cluster Architecture David Taniar School of Business Systems Monash University PO Box 63B, Clayton, Vic 3800, Australia David.Taniar @ infotech. monash.edu.au
J. Wenny Rahayu Department of Comp. Sc. and Comp. Eng. La Trobe University Bundoora Vic 3823, Australia wenny @cs.latrobe.edu.au
Abstract
improvement, but also the fact that parallel architecture is now available in many forms, such as systems consisting of a small number but powerful processors (i.e. SMP machines), clusters of workstations (i.e. loosely coupled shared-nothing architectures), massively parallel processors (i.e. MPP), and clusters of SMP machines (i.e. hybrid architectures) [ I ] . Parallelism of GroupBy queries involving aggregate functions is not a new subject in the parallel database community. Many researchers have produced several techniques, which are useful for parallelizing such queries. However, most of them focus on a general architecture, mostly shared-nothing architectures. As cluster architectures are now becoming the de facto platform for parallel computing [9,12], there is a need to shift the concentration of this platform. In this paper, we propose parallelization schemes based on a cluster environment. The proposed technique takes into account that processors within a cluster node share main-memory, but communicate through a slower network among other cluster nodes. The work presented in this paper is actually part of a larger project on parallel aggregate query processing. Parallelization of “GroupBy-Before-Join”queries is the third and the final stage of the project. The first stage of this project dealt with parallelization of GroupBy queries on single tables (i.e. no join operation involved). The results have been reported at PART’2000 conference [IO]. The second stage focused on parallelization of GroupByJoin queries where the GroupBy attributes are different from the Join attributes with a consequence that the join operation must be carried out first and then the group-by operation. We have presented the outcome of the second stage at HPCAsia’2000 conference [ 1 11. In the third and final stage, which is the main focus of this paper, concentrates on GroupBy-Join queries (like in stage two), but the join attribute is the same as the group-by attribute resulting that the group-by operation can be performed before the join for optimization purposes (i.e. GroupByBefore-Join queries). More details on the three group-by queries, our previous work, and the focus of this paper are explained in the next section.
SQL queries in the real world are replete with groupby and join operations. This Qpe of queries is often known as “GroupBy-Join queries. In some GroupByJoin queries, it is desirable to perform group-by before join in order to achieve better performance. This subset of GroupBy-Join queries is called “GroupBy-Before-Join“ queries. In this paper, we present a study on para 1le1iza tion of GroupBy -Bef o re -Jo in queries, particularly by exploiting cluster architectures. From our study, we have learned that in parallel q u e p optimization, processing group-by as early as possible is not always desirable. In many occasions, performing data distribution first before group-by offers performance advantages. In this study, we also describe our clusterbased scheme. ”
1 Introduction Queries involving aggregates are very common in database processing, especially i n On-Line Analytical Processing (OLAP), and Data Warehouse [2,4]. These queries are often used as a tool for strategic decision making. Queries containing aggregate functions summarize a large set of records based on the designated grouping. The input set of records may be derived from multiple tables using a join operation. In this paper, we concentrate on this kind of queries in which the queries contain group-by clause/aggregate functions and join operations. We refer this query as “GroupBy-Join“query. As the data repository containing data for integrated decision making is growing, aggregate queries are required to be executed efficiently. Large historical tables need to be joined and aggregated each other; consequently, effective optimization of aggregate functions has the potential to result in huge performance gains. In this paper, we would like to focus on the use of parallel processing techniques. The motivation for efficient parallel query processing is not only influenced by the need to performance
0-7695-1010-8/01 $10.00 0 2001 IEEE
178
The rest of this paper is organized as follows. Section
2 explains the background of this work. Section 3 describes general parallel algorithms for processing GroupBy-Before-Join queries. Section 4 defines the problem to be solved. Section 5 presents our proposed parallel schemes based on a cluster architecture. Section 6 presents performance evaluation results. Finally, Section 7 gives the conclusions.
2 Background As the background to our work presented in this paper, we need to explain particularly two aspects: one is an overview of GroupBy Queries, and two is our previous work on parallel GroupBy queries and the focus of this paper.
2.1 GroupBy Queries GroupBy queries in SQL can be divided into two broad categories; one is group-by on one table (we call this purely GroupBy queries), and two is a mixture between group-by and join (we then call this GroupByJoin queries). In either category, aggregate functions are normally involved in the query. To illustrate these two types of GroupBy queries, we use the following tables from a Suppliers-Parts-Projects database: SUPPLIER (E, Sname, Status, City) PARTS (p#, Pname, Weight, Price, City) PROJECT (3,Jname, City, Budget) SHIPMENT ( S # , P#, J#, Qty)
An example of a GroupBy query on a single table is to "retrieve number of suppliers for each city". The table to be used in this query is table Supplier, and the supplier records are grouped according to its city. For each group, it is then counted the number of records. These numbers will then represent number of suppliers in each city. The SQLof this query are given below. QUERY 1:
Select City, COUNT ( * ) From SUPPLIER Group By City
The next category is GroupBy-Join queries. For simplicity of description and without loss of generality, we consider queries that involve only one aggregation function and a single join. The following two queries give an illustration of GroupBy-Join queries. Query 2 is to "group the part shipments by their city locations ". The query written in SQL is as follows. QUERY
2:
Select PARTS.City, AVG(Qty) From PARTS, SHIPMENT Where PARTS.P# = SHIPMENT.P# Group By PARTS.City
Another example is to "retrieve project numbers, names, and total quantity of shipments for each project". QUERY 3:
Select PROJECT.J#, PROJECT.Jname, SUM(Qty) From PROJECT, SHIPMENT Where PROJECT.J# = SHIPMENT.J# Group By PROJECT.J#, PROJECT.Jname
The main difference between Query 2 and Query 3 above lies in the join attributes and group-by attributes. In Query 3, the join attribute is also one of the group-by attributes. This is not the case with Query 2, where the join attribute is totally different from the group-by attribute. This difference is especially a critical factor in processing GroupBy-Join queries, as there are decisions to be made in which operation should be performed first: the group-by or the join operation. When the join attribute and the group-by attribute are different as shown in Query 2, there is no choice but to invoke the join operation first, and then the group-by operation. However, when the join attribute and the group-by attribute is the same as shown in Query 3 (e.g. attribute J# of both Project and Shipment tables), it is expected that the group-by operation be carried out first, and then the join operation. Hence, we call the latter query (e.g. Query 3) "GroupBy-Before-Join" query. In Query 3, all Shipment records are grouped based on the J# attribute. After grouping this, the result is joined with table Project. As known widely, join is a more expensive operation than group-by, and it would be beneficial to reduce the join relation sizes by applying the group-by first. Generally, group-by operation should always precede join whenever possible. Early processing of the group-by before join reduces the overall execution time as stated in the general query optimization rule where unary operations are always executed before binary operations if possible. The semantic issues about aggregate functions and join and the conditions under which group-by would be performed before join can be found in literatures [3,5,8,13]. In this paper, we focus on cases where group-by operation is performed before the join operation. Therefore, we will use Query 3 as a running example throughout this paper.
2.2 Our Previous Work and Focus of This Paper Our previous work on parallelization of GroupBy queries mainly focuses on GroupBy queries on a single table (e.g. Query 1) [lo], and GroupBy-Join queries where the join attribute is different from the group by attribute (e.g. Query 2) [ 111. Parallelization of GroupBy join queries on single tables (i.e. Query 1) exists in several forms. In Taniar and Rahayu [ 101, we presented three parallel algorithms. The first two were general algorithms, and the third was a
179
specialized algorithm for cluster architectures. The main issue of parallelization single table GroupBy queries was whether to perform distribution after local aggregation (Two Phase method) or to perform distribution without local aggregation (Redistribution method). With the Two Phase method, the communication costs may be reduced by the group-by selectivity factor. However, if the reduction is minimum, local aggregation may not offer much benefit. With cluster architectures, these two issues can be incorporated. That is group-by processing within a cluster node (a cluster node consists of several processors sharing the same main-memory) can be done ala the Redistribution method, and global aggregate processing among cluster nodes through an interconnected network can be done like the Two-Phase method. This method is proven to be efficient in cluster platforms, because the redistribution within each node is done through sharedmemory, whereas communication among nodes like the two-phase method is done after local aggregate filtering is carried out by each cluster node. Parallelization of GroupBy-Join queries where the join attribute is the same as the group-by attribute (i.e. Query 2 ) also exists in several forms. In Taniar, Jiang, Liu, and Leung [ 1 11, we presented three parallelization techniques. The main issue of parallelization of such a query was that a decision had to be made whether to use the join attribute or to use the group-by attribute as the partitioning attribute for data distribution. If we choose the join attribute as the partitioning attribute (Join Partition method), after data partitioning each processor performs local join and local aggregation. The results from each processor are needed to be redistributed according to the group-by attribute to perform global aggregate to the temporary join result. If we choose the group-by attribute as the partitioning attribute (AggregatePartition method), only the table associating to the group-by attribute can be partitioned, whereas the other table needs to be replicated. In a cluster architecture, a hybrid approach was adopted. where within each node parallelization is carried out like the Aggregate Partition method (this can be efficient because data replication is done within the shared memory); and among cluster nodes parallelization is performed like the Join Partition method. In this paper, we focus on GroupBy-Before-Join queries (i.e. Query 3). The main differences between this work and our previous work can be outlined as follows. Unlike Query 2, which has two partitioning attributes (e.g. join attribute and group-by attribute), Query 3 has only one partitioning attribute, since the join attribute is the same as the group-by attribute. Therefore, the complexity of this work is not choosing the correct partitioning attribute, but the fact that the group-by clause has to be carried out before the join, and this will affect the parallelization techniques. Unlike Query 1, which does not involve a join, Query 3 involves joining tables. Therefore the complexity of
this work is due to the join operation that is involved in the query, and this will affect the decision on when to perform data distribution for calculating the aggregates as joining operation also involves. Because the foundation for processing Query 3 is different from that of Query 1 and Query 2, parallelization of Query 3 needs special attention.
3 General Parallel Algorithms for “GroupBy-Before-Join”Queries In the following sections, we describe two general parallel algorithms for GroupBy-Before-Join query processing, namely: Early Distribution scheme and Early GroupBy scheme.
3.1 Early Distribution Scheme As the name states, Early Distribution scheme performs data distribution first before anything else (i.e. group-by and join operations). This scheme is influenced by the practice of parallel join algorithms, where raw records are first partitioneddistributed and allocated to each processor, and then each processor performs its operation [6]. This scheme is motivated by fast message passing multi processor systems. The Early Distribution scheme is divided into two phases: distribution phase and group-by-join phase. Using Query 3 , the two tables to be joined are Project and Shipment based on attribute J#, and the group-by is based on table Shipment. For simplicity of notation, the table which becomes the basis for group-by is called table R (e.g. table Shipment), and the other table is called table S (e.g. table Project). For now on, we will refer them as tables R and S. In the distribution phase, raw records from both tables (i.e. tables R and S) are distributed based on the join/group-by attribute according to a data partitioning function. An example of a partitioning function is to allocate each processor with project numbers ranging on certain values. For example, project numbers (i.e. attribute Jq p l to p99 go to processor 1, project numbers p100-p199 to processor 2, project numbers p200-p299 to processor 3 , and so on. We need to emphasize that the two tables R and S are both distributed. As a result, for example, processor 1 will have records from the Shipment table with J# between p l and p99, inclusive, as well as records from the Project table with J# pl-p99. This distribution scheme is commonly used in parallel join, where raw records are partitioned into buckets based on an adopted partitioning scheme like the above range partitioning [6]. Once the distribution is completed, each processor will have records within certain groups identified by the group-by/join attribute. Subsequently, the second phase (the group-by-join phase) aggregates records of table R
180
based on the group-by attribute and calculates the aggregate values on each group. Aggregating in each processor can be carried out through a sort or a hash function. After table R is grouped in each processor, it is joined with table S in the same processor. After joining, each processor will have a local query result. The final query result is a union of all sub-results produced by each processor. Figure 1 shows an illustration of the Early Distribution scheme. Notice that partitioning is done to the raw records of both tables R and S, and aggregate operation of table R and join with table S in each processor is carried out after the distribution phase. Perform group-
b! (aggregate function) of table R. and then loin
,’
with table S.
I
Disrriburr the
I
a n d S ) on the
Records from where they are originally stored
Figure 1. Early Distribution Scheme
There are several things need to be highlighted from this scheme. First, the grouping is still performed before the join (although after data distribution). This is to conform with an optimization rule for such kind of queries that group-by clause must be carried out before the join in order to achieve more efficient query processing time. Second, the distribution of records from both tables can be expensive, as all raw records are distributed and no prior filtering is done to either table. It becomes more desirable if grouping (and aggregation function) is carried out even before the distribution, in order to reduce the distribution cost especially of table R. This leads to the next scheme called Early GroupBy scheme for reducing the communication costs during distribution phase.
3.2 Early GroupBy Scheme
(p140, 4000). The numerical figures indicate the SUM(Qty) of each project. In the second phase (i.e. distriburion phase), the results of local aggregates from each processor, together with records of table S, are distributed to all processors according to a partitioning function. The partitioning function is based on the join/group-by attribute, which in this case is attribute J# of tables Project and Shipment. Again using the same partitioning function in the previous section, J# of p l - p 9 9 are to go to processor 1, J# of pl00p199 to processor 2, and so on. In the third phase (i.e. final grouping andjoin phase), two operations are carried out, particularly; final aggregate or grouping of R, and join it with S. The final grouping can be carried out by merging all temporary aggregate results obtained in each processor. Global aggregation in each processor is simply done by merging all identical project number (J#) into one aggregate value. For example, processor 2 will merge (p140, 8000) from one processor and @140, 4000) from another to produce (p140, 12000) which is the final aggregate value for this project number. Global aggregation can be tricky depending on the complexity of the aggregate functions used in actual query. If, for example, an AVG function was used instead of SUMin Query 3, calculating an average value based on temporary averages must taken into account the actual raw records involved in each processor. Therefore, for these kinds of aggregate functions, local aggregate must also produce number of raw records in each processor although they are not specified in the query. This is needed for the global aggregation to produce correct values. For example, one processor may produce (p140, 8000, 5 ) and the other @140, 4000, 1). After distribution, suppose processor 2 received all p140 records, the average for project p140 is calculated by dividing the sum of the two quantities (e.g. 8000 and 4000) and the total shipment records for that project. (i.e. (8000+4000)/(5+1) = 2000). The total shipments in each project are needed to be determined in each processor although it is not specified in the query. Global
As the name states, the Early GroupBj scheme performs the group by operation first (before data distribution). This scheme is divided into three phases: (i) local grouping phase, (ii) distribution phase, and (iii) final grouping und join phase. In the local grouping phase, each processor performs its group-by operation and calculates its local aggregate values on records of table R. In this phase, each processor groups local records R according to the designated groupby attribute and performs the aggregate function. Using the same example as that in the previous section, one processor may produce, for example, ( p l , 5000) and ( ~ 1 4 0 8000), , and another processor 07100, 7000) and
table R.
Records from where they are originally stored
Figure 2. Early GroupBy Scheme
181
After global aggregation results are obtained, it is then joined table S in each processor. Figure 2 shows an illustration of this scheme. There are several things worth noting. First, records R in each processor are aggregatedgrouped before distributing them. Consequently, communication costs associated with table R can be expected to reduce depending on the group by selectivity factor. Second, we observe that if the number of groups is less than the number of available processors, not all processors can be exploited - reducing the capability of parallelism.
4 Problem Formulation Despite the usefulness of the above two general algorithms for parallel processing GroupBy-Before-Join queries, a number of issues worth to consider, which are as follows. All of the schemes described previously focus on general shared-nothing architectures, where each processor is equipped with its own local memory and disk, and communicates with other processors through message passing. They do not consider whether some of the processors share the one memory (and disks), as in the case with cluster architectures. In a slower network (particularly slower than system bus), i t is commonly understood that communication via network should be minimized. This is not clearly identified in the previous schemes. In a shared-memory environment, where the memory is shared, we should take an advantage of load balancing and load sharing in this environment. This is also not identified in the existing schemes either. Based on these factors, we . propose . a scheme for parallel GroupBy-Before-Join queries especially designed for cluster environment, where communication among nodes is done through message passing via an interconnection network, and processors within the same node share the same memory. We need to identify how to minimize communication costs among nodes and how to group shared data in shared-memory. SMP 5MP ............................................... ...........: : SM ..-..P : .............................. : "
"
I
^...I
.
/
:. :
!
11..
i
I
.
1
:
I"".
I.............
.
! ..
j
....:i :
i ............................. !
Interconnected Network
I
;
I
Figure 3. Clusters of SMP There are a number of variations to cluster architecture. In this paper, each cluster node is a sharedmemory architecture connected to an interconnection
network a la shared-nothing. As each shared-memory (i.e. SMP machine) maintains a group of processing elements, collection of these clusters are often called "Clusters of SMP" [9]. Figure 4 shows an architecture of clusters of
SMP.
5 Proposed Algorithm for Cluster Architecture Like the Early GroupBy scheme, the Cluster-based scheme is divided into three phases: local grouping, distribution, and final groupindjoining phases. The first phase (local grouping phase) is where in each cluster node (i.e. S M P node consisting of several processors), logically distribute table R based on the group-by attribute. In this phase, processors within a cluster node, in turn, load each record R from the shared disk and allocate to which processor the record should be processed. Since all processors within a cluster node share main-memory, data distribution or data partitioning can be achieved by creating a fragment table for each processor. At the end of this distribution each processor will have a table fragment R to work with. Once each processor in a node has a distinct fragment table to work with, aggregation operation can be carried out by each processor within a cluster node and will produce a set of distinct group-by values. Each node will have one set of local aggregates, which are the union of results produced by each processor within that cluster node. The second phase (distribution phase) distributes local aggregates produced from table R , as well as the nongroup-by table S from each cluster node to other nodes. The distribution is based on the group-by/join attribute. Remember that distribution is done at a node level, not at a processor level. In other word, from each node there is one outgoing communication stream to another node. The third phase (final grouping/joining phase) consists of merging of local aggregates from the first phase, and joining it with table S. The merging process can be explained as follows. After each node has been reallocated with local aggregates from different places, each node now merges the same aggregate values it has received possibly from other clusters. Since multiple processors exist in each node, the merging process can also be done in parallel by exploiting all processor in that node to participate in the merging process. The result of this final grouping is that each node will produce a set of distinct aggregate values. The joining operation is basically joining the result of the final grouping and the non-group-by table S. Since each node consists of several processors, the joining method adopted is a shared-memory join operation, which can be explained as follows. First, each processor reads in an aggregate value R, and hashes it to a shared hash table. Reading and hashing are carried out concurrently by all processors within each S M P node. Second, each processor
182
reads in record S and hashesfprobes into the shared hash table. This is also done concurrently by all processors within one node. Any matching is stored in the query result. Figure 4 shows an illustration of the Cluster-based scheme. In this diagram, it shows how the new scheme works with three cluster nodes and four processors in each cluster node. and join operations
;
:
Group-By/ Join attribute Local
R in each node Partitioning
R within each node on the Records from where they are originally stored
Group-By
attr,b;re,
Figure 4. Cluster-based Scheme The main differences between the Cluster-based and the Early GroupBy can be outlined as follows. In the Early GroupBy scheme, each processor is considered as independent. It does not consider the fact that some processors share the same memory. Subsequently, local aggregation is to be done at a processor level, instead of at a node level. Since the table in each node is stored in a shared disk as one piece, processors then need to logically divide the one piece of data (table) into fragments. Because the disk is shared, it is common to adopt a round robin logical partitioning in order to maintain load balancing of each processor. Since it is round robin, each processor will likely produce identical groups with different aggregate values with other processor in the same cluster node. In contrast, using the Cluster-based scheme, logical partitioning adopts semantics partitioning that is partitioning is based on the group-by attribute, and subsequently each processor will produce distinct sets of aggregate values (groups), thus reducing the number of groups in the node. The impact of this is propagated to the data distribution among nodes, since less number of local aggregate values (groups) is being distributed across network. Another difference is in the final grouping/joining phase. After the distribution phase, using the general version of the scheme, each processor will have its fragments of local aggregate results and table S, to which the two are to be joined. It is most likely that each
processor will have different fragment sizes and this causes the load of each processor imbalance. On the other hand, using the cluster-based scheme, data distribution is done at a node level, and hence each node will have its fragments of local aggregates and table S. Suppose a node consists of four processors, using the cluster-based scheme, there will be one fragment of R and one fragment of S. Using the Early GroupBy scheme, there will be four smaller fragments of R and four smaller fragment of S. Assuming that the four processors within each node are subsequent in the hash function, the four fragments (of the Early GroupBy scheme) are equal to the one bigger fragment (of the cluster-based scheme). However, the four smaller fragments may likely be in different sizes - causing load imbalance of the processors. In contrast, the one bigger fragment, as it is in a sharedmemory, may be divided evenly to all processors during the processing. Therefore load balancing within the node is achieved using the cluster-based. However, we need to emphasize that skew problem among nodes may still occur.
6 Performance Evaluation In order to study the behavior and to compare performance of the three schemes presented in this paper, we carried out a sensitivity analysis. A sensitivity analysis is done by varying performance parameters. For this purpose, a simulation package called Transim [7] was used in the experimentation. Transim is a transputer-based simulator, which adopts an Occam-like language. Using Transim, the number of processors and the architecture topology can be configured. In the experimentation, 64 processors were used, and the table sizes were between 1 and 10 GB, representing between 10 and 100 million records. The maximum entry for each hash table is 10000, and the projectivity ratio is 15%.
6.1 GroupBy Selectivity The graph in Figure 5 shows a comparative performance between the three parallel schemes by varying the GroupBy selectivity ratio (i.e. number of groups produced by the query). The selectivity ratio is varied from 0.0000001 to 0.01. With 100 million records as input, the selectivity of 0.0000001 produces 10 groups, whereas the other end of selectivity ratio of 0.01 produces 1 million groups. The cluster configuration consists of 8 nodes with 8 processors in each node. Using the Early Distribution scheme, the major cost is the scanning cost in phase one of the processing. The total cost of phase one is constant regardless of the selectivity ratio, as no grouping is done in phase one. Data transfer from phase one to phase two is not that much, because of two reasons: one is the records to be transferred has been projected, and hence each record size is reduced; and two
183
is the communication unit costs are much smaller than the disk access unit cost. One thing should be noted that when number of groups is smaller than the number of processors, not all processors are used. By looking at the graph in Figure 5 , we notice that the cost line of the Early Distribution scheme goes down from number of groups of 10 to 100. This is because in the experimentation we used 64 processors in total, and consequently, when 10 groups were produced by the query, not all 64 processors were used, and this degrades performance. When 100 groups are produced, all available processors were used. Using the Early GroupBy scheme, the majority of the processing cost is in the data scanning and loading. Notice that the performance of this scheme is quite steady when the number of groups is small, and suddenly the cost is increased when the number of groups in the output query grows. This is primarily caused by the overhead produced by the overflow hash tables. We also notice that the distribution costs do not play an important role, since communication unit cost is far smaller than disk unit cost. Using the Cluster-bused scheme, the cost components are similar to those of Early GroupBy, where the major cost is the data scanning and loading costs. It appears that the local partitioning cost, which is the cost incurred by the local partitioning within each node, is negligible. Overall, the Cluster-based scheme is better than the early GroupBy: one is due to the data transfer cost along the network, whereby the Cluster-based scheme produces relatively less groups in each node compared to that of the Early GroupBy scheme, which also gives impact to a lower final aggregation cost in the Early Distribution scheme. Another reason is the hash table overflow overhead appears later in the larger groups, and consequently, the cost line of the Early Distribution scheme goes up later than that of Early GroupBy. Comparing the three schemes, in general the Clusterbased scheme delivers better performance than the other two, except in a few situations, such as extremely large number of groups produced by the query, in which the Early Distribution scheme is better performed. The experimentation results also prove that when the number of groups is small, the Early GroupBy is good as it has filtered out records in the first phase of processing. When filtering is limited in the first phase of processing, the Early GroupBy is not at all good, as it requires double processing. On the other hand, the Early Distribution scheme, which does not filter in the first phase, is good for large number of groups. The proposed Cluster-based scheme somehow compensates the previous two schemes. It behaves like the Early GroupBy but does not increase the cost too much when filtering is insufficiently done in the first phase of processing. Based on these performance results, we can come into conclusion that the Cluster-based scheme is beneficial; as i t delivers better performance in most circumstances. We can also conclude that Early Distribution does not give
too much poor performance even though group-by operation is not the first operation performed. In fact, in many cases, Early Distribution performs better than Early GroupBy. This is an interesting conclusion drawn from parallel processing perspective that optimization in sequential processors may not necessary apply to parallel Pr' :essors. Varying Group By ..............
'+E.
Distribution +E.
GroupBy +Cluster
*@I - - -
Sec 150
10
100
loo0
loo00
10000(31OOOOOO
Number of Groups
Figure 5. Varying GroupBy Selectivity Ratio
6.2 Cluster Configuration Figure 6 shows a comparative performance between the three schemes when number of cluster and cluster size are varied. Number of cluster is varied from 1 to 8, and each cluster has 4 or 8 processors. The graphs shown in Figure 6 are the experimentation results using the following parameters: the number of groups produced is 10,000 groups (selectivity ratio of O.OOOl), and a maximum hash table entry is 10,000. From the graphs, we notice that the Cluster-based scheme works well when there are more processors in a cluster node and when more nodes are used in the system. Generally, the Cluster-based scheme delivers the best performance, which is ranging from 8% to 15% improvement than the other two schemes. With large number of processors, data transfer cost for the Early Distribution scheme is expensive. Data transfer cost of the Early GroupBy scheme can also be significant, although not as much as that of the Early Distribution scheme. On the other hand, the Cluster-based scheme, as the selectivity ratio within each cluster can lower often be than the selectivity factor of the first phase of the Early GroupBy scheme, data transfer cost is trivial. With small number of processors, the Cluster-based scheme imposes additional overhead associated with the local data partitioning. Like the Early GroupBy scheme, the Cluster-based scheme also has some overheads associated with hash table overflow. This can only be minimized if more processors are used so that the workload is spread. The graphs in Figure 6 also indicate that using the current parameters, the Early GroupBy scheme performs the worst. This is due to small reduction in the original
184
number of records R. The Early Distribution scheme is, to some degree, better than the Early GroupBy, because of exactly the opposite reasons. The Cluster-based scheme in this case takes the advantage of low hash table overflow overhead like the Early Distribution scheme, but without an expensive data transfer cost. As a result, in most cases, the Cluster-based scheme offers the best performance. a. 4 Processors per Node
\-
. t E. Distribution
Sec 1500 1000
logically partitioned according to the group by attribute. Based on this scheme, the cluster version takes the advantage of shared memory of the cluster node in which logical data partitioning can be done easily and efficiently (load balanced), and filtering through the group-by selectivity can have better results. Our performance evaluation results show that in most cases the Cluster-based schemes deliver better performance than the other two schemes. It is surprisingly that the Early Distribution is not as bad as we initially thought since the group-by operation was not the first operation performed. This strongly indicates that optimization rules for sequential processors may not be blindly adopted in parallel processors. In parallel query optimization, we must consider other elements, such as distribution cost, parallel architecture, and so on. Also in this paper, we have shown how parallel query optimization must be tailored for specific architecture, such as cluster architecture, which is the main platform of the experimentation in this paper.
--c E. GroupBy -+Cluster
500 0 2
1
4
0
Number of Cluster Nodes
Reference 1. Almasi G., and Gottlieb, A., Highly Parallel Computing, 2"d ed., The BenjaminKummings Publishing Co. Inc, 1994. 2. Bedell J.A. "Outstanding Challenges in OLAP", Proc. of 14IhIntl. Conf. on Data Engineering, 1998. 3. Bultzingslcewen G., "Translating and optimizing SQL queries having aggregate", Proc. of the 13Ih International Conference on Very Large Data Bases, 1987. 4. Datta A. and Moon B., "A case for parallelism in data warehousing and OLAP", Proc. of gh Intl Workshop on Database and Expert Systems Applications, 1998. 5 . Dayal U,, "Of nests and trees: a unified approach to processing queries that contain nested subqueries, aggregates, and quantifiers",Proc. of the 13IhIntl. Con$ on Very Large Data Bases, Brighton, UK, 1987. 6 . DeWitt, D.J. and Gray, J., "ParallelDatabase Systems: The Future of High Performance Database Systems", Comm. of the A C M , vol. 35, no. 6, pp. 85-98, 1992. 7 . Hart, E., Transim: Protopping Parallel Algorithms, User Guide & Reference Manual, ver 3.5, Westminster University, 1993. 8. Kim, W., "On optimizing an SQL-like nested query",ACM Transactions on Database Sjistems, Vol7, No 3 , Sept 1982. 9. Pfister, G.F., In Search of Clusters: The Ongoing Battle in Lowly Parallel Computing, 2nded., Prentice Hall, 1998. IO. Taniar, D. and Rahayu, J.W., "Parallel Processing of Aggregate Queries in a Cluster Architecture", Proc. of the
;i!j\ # b. 8 Processors per Node
E.
-tDistribution
Sec
--c E. GroupBy 43- Cluster
800 600 400 200
0 1
2
4
8
Number of Cluster Nodes
Figure 6. Varying Cluster Configuration
7 Conclusions In this paper. we have studied three parallel algorithms for processing "GroupBy-Before-Join" queries (i.e. GroupBy queries where the group by operation can be performed before the join operation) in high performance parallel database systems. These algorithms are Earl! Distribution scheme, Early GroupBp scheme, and Cluster-based scheme. The main rationale for the development ofthe Clusterbased scheme is twofold: one is to take the advantage of cluster architectures. and two is that the two existing methods were not specifically design for cluster architectures. Using the cluster-based schemes, local aggregation in each S M P node is done through a sharedmemory group-by operation whereby raw records are
7Ih Australasian Conf. on Parallel and Real-Time Systems PART2000, Springer-Verlag.Nov. 2000. 11. Taniar, D., Jiang, Y., Liu, K.H., and Leung, C.H.C., "Aggregate-Join Query Processing in Parallel Database Systems", Proc. of the 4Ih HPCAsia '2000lntl Conf, vol. 2, IEEE CS Press, pp. 824-829, 2000. 12. Wilkinson, B. and Allen, M. Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers, Prentice Hall, 1999. 13. Yan W. P. and P. Larson, "Performing group-by before join", Proc. of the Intl. Conj on Data Engineering, 1994.
185