Document not found! Please try again

Data Reduction Through Early Grouping Abstract 1 ... - CiteSeerX

0 downloads 0 Views 184KB Size Report
Appears in Proceedings of the 1994 IBM CAS Conference, pages 227{235, Toronto, Ontario, November. 1994. .... dure starts with internally sorted pages (i.e.,.
Data Reduction Through Early Grouping W. Paul Yan and Paul Larson Department of Computer Science University of Waterloo Waterloo, Ontario, N2L 3G1, Canada Appears in Proceedings of the 1994 IBM CAS Conference, pages 227{235, Toronto, Ontario, November 1994.

Abstract

1 Introduction SQL queries containing GROUP BY and aggregation are common in decision support applications. Grouping with aggregation is typically done by rst sorting the input and then performing the aggregation as part of the output phase of the sort. The most widely used external sorting algorithm is merge sort, which consists of two phases: a run formation phase and a merge phase. The amount of data output from the run phase and carried through the merge phase can be reduced by a simple technique: form (partial) groups and perform aggregation during run formation. The data carried into the merge phase then consists of (partial) groups instead of individual tuples. The partial groups are then combined during the merge phase to form the nal groups. We call this technique early grouping. Early grouping always reduces the number of tuples output from the run formation phase. If the number of groups is small, all groups may t into main memory. In this case, no merging is required, which saves writing the runs to disk and reading them in again during the merge. Given today's main memory sizes, we believe that this will be the most common case. Even when all groups do not t in main memory, early grouping may reduce the number of tuples very signi cantly. The reduction factor depends on the amount of memory relative to the number of groups and the distribution of tuples over groups. We have run a number of simulation experiments aimed at quan-

SQL queries containing GROUP BY and aggregation occur frequently in decision support applications. Grouping with aggregation is typically done by rst sorting the input and then performing the aggregation as part of the output phase of the sort. The most widely used external sorting algorithm is merge sort, consisting of a run formation phase followed by a (single) merge pass. The amount of data output from the run formation phase can be reduced by a technique that we call early grouping. The idea is straightforward: simply form groups and perform aggregation during run formation. Each run will now consist of partial groups instead of individual records. These partial groups are then combined during the merge phase. Early grouping always reduces the number of records output from the run formation phase. The relative output size depends on the amount of memory relative to the total number of groups and the distribution of records over groups. When the input data is uniformly distributed | the worst case | our simulation results show that the relative output size is proportional to the (relative) amount of memory used. When the data is skewed | the more common case in practice | the relative output size is much smaller. 227

tifying the reduction factor. Our results show that when the groups are all of the same size, the reduction factor is directly proportional to the fraction of groups that t in main memory. This is the worst case. When the groups are of di erent size { the more common case in practice { the reduction factor is much higher. The idea of early grouping is not new. Bitton and Dewitt[2] proposed and analyzed early duplicate removal during run formation, but we have not found any published results on early grouping. However, some database systems implement early grouping and duplicate removal. DB2/MVS1 performs early grouping, but not in all cases. We designed a set of experiments aimed at determining whether Oracle V72 performs early grouping. The results are not conclusive but it appears that Oracle V7 performs early grouping in some cases. It is highly likely that several other systems do as well.

cate removal during run formation phase in external sorting seems to have been investigated rst by D. Bitton and D.J. Dewitt [2]. Their duplicate elimination algorithm exploits early duplicate removal during both run formation and run merging. They also analyzed the performance of the algorithm under the simplifying assumption that the external merge procedure starts with internally sorted pages (i.e., initial runs) free of duplicates. Therefore, in their analysis, the e ects of early duplicate removal during run formation are ignored. Their argument for this assumption is that (a) if the records are randomly distributed, and (b) if the number of distinct records is much larger than the number of records per page, then early duplicate removal does not reduce the sizes of the initial runs signi cantly. Their cost analysis is based on (a) the number of multi-level merge phases and (b) the average size of runs in each phase for a two-way merge sort. This analysis is also summarized in [3].

2 Previous Work

3 Early Grouping

Most published work on duplicate elimination deals with main-memory algorithms. Munro and Spira[6] gave a computational bound for the number of comparisons required to sort a multiset with early duplicate removal. Several algorithms, based on various sorting algorithms, e.g., quick sort, hash sort and merge sort, have been proposed for duplicate elimination. Abdelguer and Sood[1] gave the computational complexity of the merge sort method based on the number of three branch comparisons; Teuhola and Wegner[7] gave a duplicate elimination algorithm based on hashing with early duplicate removal, which requires linear time on the average and O(1) extra space; and Wegner[8] gave a quick sort algorithm for the run formation phase and analyzed its computational complexity (Nlogn, where N is the number of rows and n is the number of distinct rows). However, we are mainly interested in largescale grouping and aggregation requiring external sorting. The processing cost is then dominated by the cost of I/O, and the CPU time can be largely ignored. The idea of early dupli1 2

In most database systems, evaluation of a GROUP BY query relies on sorting. It rst sorts the records by the grouping columns, which brings the records belonging to the same group together, and then performs the accumulation speci ed on the query.

Evaluation of GROUP-BY For large tables that cannot be held in main memory, the sorting algorithm is typically external merge sort. This GROUP BY evaluation method is illustrated in Figure 1. It contains three phases: run formation, run merging and accumulation. During the run formation phase, the input is divided into a number of initial sorted runs which are written out to some temporary les. In the run merging phase, the runs are merge-sorted into one run. During accumulation, the records in the same group are reduced to a single record. Early accumulation can be performed during any phase. If early accumulation is performed in both the run formation and run merging phases, the accumulation phase may be omitted.

DB2/MVS is a registered trademark of IBM. Oracle is a registered trademark of Oracle.

228

Run Run Input

Run Formation

Run

Run

Merging

Output

Accumulation

Accumulation

Accumulation

Run

Figure 1: Evaluation of GROUP BY

Run Formation Algorithms

ues until the current run is empty. Then, the next run becomes the current run and a new next run is created. This process continues until the input is empty. The two heaps can be combined into a single heap by pre xing the sort key with a eld containing the run number (assigned immediately before the record is inserted into the heap). Then the (extended) key value of a record assigned to the next run is always greater than the (extended) key value of records assigned to the current run. Replacement selection can only increase the number of records in a run. In the worst case, the length of a run is the same as available main memory. This occurs when the input is already sorted in reverse order. When the input is already sorted in correct order, only one run will be produced. This is the best case. E. F. Moore showed that, for randomly ordered input, the expected length of each run is twice the available memory size [4]. When the input exhibits some level of pre-sortedness, runs are likely to be longer than twice the available memory.

Run formation algorithms can be classi ed into two types: those producing xed-length runs and those producing variable-length runs. Fixed-length run formation algorithms read a certain amount of input into memory (limited by the size of available memory), sort the data in memory using some sorting algorithm, and then write out a run. This process continues until all input records have been scanned. Variable-length run formation algorithms may produce runs that are larger than the available memory. The standard algorithm is replacement selection [4].

Replacement Selection

The basic idea of replacement selection is as follows. In main memory, exactly two runs are maintained, represented by selection trees (heaps). One run is the current run and the other is the next run. Selection trees can eciently support the operations insert and delete-minimum in logarithmic time. We rst read as many records into memory as t into the available memory and the records are maintained by the selection tree of the current run. After memory is lled up, we output the root record into a (new) run le and the current run heap is adjusted so that the record with the lowest value of the sort key is on top. A new record is then read in. If its sort key is less than the one just output, it cannot be included in the current run so it goes to the heap for the next run. If it is greater, it is inserted into the heap for the current run. This contin-

Adding Early Grouping to Replacement Selection When adding early grouping to run formation by replacement selection, we maintain in memory one record for each group encountered so far. When reading a new input record, we rst check whether it belongs to one of the groups already in memory. If so, the input record is added to that group; that is, the necessary accumulation is done, after which the input record can be discarded. If the input record 229

does not belong to one of the groups in memory, a new group has to be created. This may force us to write out one of the groups in memory (variable-length runs), or trigger the sort and output of a complete run ( xed-length runs).

merging. This is not an excessive requirement for a computer expected to handle 100GB les.

Run Merging

We have run a large number of simulation experiments to investigate the e ects of early grouping during run formation. The main bene t of early grouping is a reduction of the number of records (and runs) output from the run formation phase. Our experiments were therefore focussed on nding out by what factor the data is reduced, and how this depends on the number of groups that t in main memory and the distribution of group size.

4 Simulation Experiments

When memory is not sucient for merging all the runs in one pass, the merge phase is divided into several levels. Each level merges a certain number of runs into a new run. Figure 2 shows the case for a three-way two-level merge. Since at each level every record (or group if early grouping is used) will be scanned once, it is important to reduce the amount of data carried through the merge. Early grouping can \squeeze" more input records into each run, and thus reduce the number of runs. The idea of early grouping can be applied at every merge operation in the tree, thereby reducing the amount of data carried to the next level. Multi-level merging was common in the past when main memories were small and expensive. However, it is hardly ever needed today; a single merge pass will almost always suce. The number of runs that can be merged is mainly limited by the main memory space needed for input bu ers. At least one bu er for each run is needed, but a more typical number is two or three. Merging several hundred runs in a single pass is not unusual. The following example illustrates why multi-level merging is rarely needed. Example: Assume that there is 100GB of data to be sorted and that 500 runs are merged in a single pass. If the page size in the bu er is 8KB and double bu ering (two bu ers for each run) is used, we will need

Distribution of Group Sizes

Assume that the input records belong to N different groups. In our experiments, the groups were labeled 1; 2; :::; N . To model the group size distribution we chose a generalized, truncated Zipf [4] distribution. The distribution function is de ned by:

Z (x) = ((1=x) )=c; x = 1; 2; ::::;N where is a positive constant and c is a normalizing constant ensuring that the probabilities add up to one. According to this distribution, the fraction of records that belong to group x equals ((1=x) )=c. The group membership of an input record was randomly drawn from this distribution. = 1 gives the traditional Zipf distribution, and = 0 gives a uniform distribution. Increasing increases the skew in the group size distribution, which should increase the data reduction obtained during run formation. Figure 3 shows three Zipf distribution functions when the values of are 0:0, 0:5 and 1:0 respectively, for a million records and 1500 groups. Many phenomena, including the distribution of word occurrences in English text, have experimentally been found to follow a traditional Zipf distribution.

500  2  8KB = 8MB of bu er space. The size of each run is 100GB=500 = 200MB: Replacement relection produces runs twice the size of memery (on average). So 100MB of memory will be sucient to produce 200MB runs. In summary, a 100GB le can be sorted with a single merge pass by using 100MB of memory during run formation and 8MB during

Simulation Process

In all our experiments, runs were formed by replacement selection using a single heap. N is the number of groups; that is, the number 230

Final output

larger run

Initial Runs: run

run

larger run

larger run

run

run

run

run

run

run

run

Figure 2: Multi-level Merging We summarize the notation we used in Figure 4.

of distinct group IDs. When we need a new input record during the simulation, its group ID is computed as a random variate drawn from the Zipf distribution. We then check whether the corresponding group is currently in main memory. If it is, we update the required set function eld in the group record and proceed with the next input record. If it is not, we create a new group record and insert it into the heap used by replacement selection. A simulation experiment consisted of generating a certain number of input records and forming the runs as outlined above. The statistic of main interest is the number of (group) records output from the run formation. The results reported here are all for the case of 1500 groups (N ) and one million input records. We de ne the relative output size of the run formation phase as of output records RelSize = number number of input records : Note that, each output record is a (partial) group. Clearly, the lower the relative output size, the better data reduction we get. Relative output size ranges between 0 and 1. Assume that we can store M group records in the available main memory. The memory factor is then de ned as Mem = M=N ; that is, the fraction of groups that t in main memory simultaneously. With a xed number of groups, increasing the memory factor (Mem) is expected to reduce the relative output size (RelSize). Our objective is to nd the relationship between Mem and RelSize.

Simulation Results

Figure 5 shows the simulation results. The horizontal axis represents the memory factor, and the vertical axis represents the relative output size. Each curve shows the the results for a speci c value, i.e., for a particular group size distribution. Consider the curve for = 0, which is the curve when all groups are of the same size. When the memory factor is 0.1, i.e., when memory can hold 10% of the groups, the relative output size is 0.9, meaning that the number of output records (group records) is 90% of the input records. When the memory factor is 0.6, the relative output size is 0.4. From the curve, we can clearly see that the reduction in output size is directly proportional to the relative amount of memory used. The experiment suggests that the relationship between RelSize and Mem is simply

RelSize = 1 ? Mem Consider the curve for = 1. When the memory factor is 0.1, the relative output size is about 0.43. In other words, if 10% of the groups t in main memory, the output is reduced by more then half. We can clearly see that the data reduction is much better when the input data is skewed. When the data becomes more and more skewed (as increases), 231

Zipf Distribution Functions 10000 alpha = 1.0, skewed alpha = 0.5, skewed alpha = 0.0, uniform

Rows per Group

8000

6000

4000

2000

0 0

50

100 150 Group Position

200

250

Figure 3: Zipf Distribution Notation N M Mem

Meaning Number of groups Number of group records that can be stored in the main memory Memory factor, de ned as M=N , i.e., the fraction of the groups that t in main memory simultaneously RelSize Relative output size = number of output records =number of input records Reduction Factor 1 - RelSize alpha( ) parameter of the Zipf distribution, higher values increase skew Figure 4: Notation the relative output size decreases very rapidly for a relatively small memory factor. However, when the memory factor becomes close to one, the di erence among the relative output sizes for di erently skewed data distributions is not signi cant. This is understandable since in this situation, the main memory can almost hold all the input records and there are only a few output (partial) groups. In the extreme case when the memory factor is one, the relative output size is 0 since the memory can hold all the groups and no output is necessary. This is the best case: merging is avoided completely, a single scan is sucient. In our other experiments, we also found that

the number of groups is almost irrelevant to the relative output sizes. With the same memory factor and same data distribution, the relative output size is only very slightly smaller when the number of groups is much larger.

We have not been able to nd a general mathematical model for the relationship between the relative output size and the memory factor. 232

Total No of Groups: 1500; Number of Rows: 1,000,000 1 ’alpha=0.0’ ’alpha=0.3’ ’alpha=0.6’ ’alpha=0.8’ ’alpha=0.9’ ’alpha=1.0’

0.9

Relative Output Size

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Memory Factor: Groups in Memory/Total Groups

1

Figure 5: Memory Factor versus Relative Output Size

5 Implementation Considerations

each output record may be larger than each input record. If it is the case and the number of records is reduced only slightly, the total amount of data output from the run formation phase may increase.

Size of Output Records A word of caution may be appropriate rst. As we saw in the previous section, early grouping always reduces the number of output records. However, the output records (group records) may be larger than the records that would be output without early grouping. This may happen when several aggregation functions are computed on the same column; for example, when computing MIN(PRICE), AVG(PRICE), and MAX(PRICE). The column PRICE in an input record will then be expanded to four columns in group records. (Four because AVG has to be expanded into SUM and COUNT(NOT NULL) 3.) Therefore, the size of

As a heap can only eciently support insert and delete-smallest operations, an additional operation, nd, is needed to nd a group with a given sort key. This operation can be implemented eciently by maintaining a hash table on the sort keys. In our experiments we used a linear hash table, as described in [5]. The bene t of a linear hash table is that it can grow and shrink dynamically based on the size of input data.

Note the COUNT here excludes NULL columns since AVG does not consider NULL values. COUNT(NOT NULL) is not a standard SQL implementation but is used here for convenience.

If early grouping is not employed, no hash table is needed and no table lookup needs to be performed. In other words, early grouping incurs

Hash Table

Hash Table Overhead

3

233

some additional overhead in both space (main memory) and CPU time. Assume that the relative memory size available for the heap is M . Since we only need to maintain a hash table of maximum size M , the space it takes is O(M). It takes O(1) time to insert, nd, or delete an item from the table. For each input record, the worst case for using the hash table is: (1) nd the group the record belongs to, and it is not in the hash table (and therefore not in memory); (2) delete the smallest-key group; and (3) insert the new group. Therefore, the total time overhead is O(n), where n is the number of records in the input. In summary, we need O(M ) extra space and O(n) extra time to implement early grouping.

a single scan over the data. Taking into account current main memory sizes, we believe that this case will occur frequently.

6 Conclusion This paper describes early grouping and demonstrates the data reduction that can be achieved by this technique. Our simulation experiments showed that when all groups are of the same size, the relative output size is proportional to the (relative) amount of memory used. This is the worst case. In practice, the distribution of group sizes is typically skewed. In this case, the data reduction is much higher. If the input data is partially ordered, the relative output size also decreases. Early grouping trades o cpu time and main memory against disk I/O. In the vast majority of cases, early grouping can signi cantly reduce the amount of data carried through the merge. However, in some very rare situations, the amount of data may actually increase. The bene ts of early grouping clearly outweigh the drawbacks. Future work includes running experiments against real-life data, and developing a mathematical model for the relationship between the relative output size and the memory factor.

Worst Case

In the worst case, the number of output records is the same as the number of input records. One possible scenario is as follows: (a) the relative memory size is smaller then the number of groups; (b) the input is divided into sequences of unique records; (c) the distance between two records belonging to the same group is always larger than the number of groups that t in main memory. Here is an example with relative memory size 4 and 5 groups. Consider the input sequence 6; 7; 8; 9; 10; 6; 7; 8; 9; 10

Acknowledgements

One can see that, whenever a new record is read in, an old record (the one with the lowest key) has to be output. For example, when memory holds 6; 7; 8; 9, and 10 is read in, 6 has to be output and the new heap is 7; 8; 9; 10. When 6 is read in, 7 is output and the new heap is 8; 9; 10; 6. This pattern continues. If output records are larger than input records, the end result is an expansion of the amount of data carried through the subsequent merge. Fortunately,, this appears to be a very rare case.

The authors gratefully acknowledge nancial support for this work from the Natural Science and Engineering Research Council and IBM Toronto Laboratory.

About the Authors

Weipeng Paul Yan is a Ph.D candidate in

the Department of Computer Science, University of Waterloo. His current research interests include query optimization and query processing, multidatabase systems and formal speci cation. He can be reached at the Department of Computer Science, University of Waterloo, Waterloo, Ontario, N2L 3G1. His Internet address is [email protected].

Best Case

The best case of this technique occurs when the relative memory size is equal or larger than the number of groups. In this case, all groups t in main memory and no merging is required. The grouping and aggregation is thus completed in 234

P.- A (Paul) Larson is a professor in the De-

partment of Computer Science, University of Waterloo. He has worked on many aspects of database systems over the last 15 years. His current research interests include multidatabase systems, query optimization, and query processing. He can be reached at the Department of Computer Science, University of Waterloo, Waterloo, Ontario, N2L 3G1. His Internet address is [email protected].

References [1] M. Abdelguer and Arun K. Sood. Computational complexity of sorting and joining relations with duplicates. IEEE Transactions on Knowledge and Data Engineering, 3(4):497{503, December 1991. [2] D. Bitton and D. J. Dewitt. Duplicate record elimination in large data les. ACM Transactions on Database Systems, 8(2):255, June 1983. [3] Goetz Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2):73{170, June 1993. [4] Donald E. Knuth. The art of computer programming, volume 3. Reading, Massachusetts, 1973. [5] Per- Ake Larson. Linear hashing with separators - a dynamic hashing scheme achieving one-access retrieval. ACM Transactions on Database Systems, 13(3):366{388, December 1988. [6] I. Munro and P.M. Spira. Sorting and searching in multisets. SIAM Journal of Computing, 5(1), March 1976. [7] Jukka Teuhola and Lutz Wegner. Minimal space, average linear time duplicate deletion. Communications of ACM, 34(3):62{ 73, March 1991. [8] L. Wegner. Quicksort for equal keys. IEEE Computer, C-34(4):362{367, April 1985.

235