IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 15,
NO. 4,
JULY/AUGUST 2003
961
External Sorting: Run Formation Revisited Per- Ake Larson, Member, IEEE Computer Society Abstract—External mergesort begins with a run formation phase creating the initial sorted runs. Run formation can be done by a loadsort-store algorithm or by replacement selection. A load-sort-store algorithm repeatedly fills available memory with input records, sorts them, and writes the result to a run file. Replacement selection produces longer runs than load-sort-store algorithms and completely overlaps sorting and I/O, but it has poor locality of reference resulting in frequent cache misses and the classical algorithm works only for fixed-length records. This paper introduces batched replacement selection: a cache-conscious version of replacement selection that works also for variable-length records. The new algorithm resembles AlphaSort in the sense that it creates small in-memory runs and merges them to form the output runs. Its performance is experimentally compared with three other run formation algorithms: classical replacement selection, Quicksort, and AlphaSort. The experiments show that batched replacement selection is considerably faster than classic replacement selection. For small records (average 100 bytes), CPU time was reduced by about 50 percent and elapsed time by 47-63 percent. It was also consistently faster than Quicksort, but it did not always outperform AlphaSort. Replacement selection produces fewer runs than Quicksort and AlphaSort. The experiments confirmed that this reduces the merge time whereas the effect on the overall sort time depends on the number of disks available. Index Terms—External sorting, merge sort, replacement selection, run formation.
æ 1
INTRODUCTION
E
XTERNAL mergesort is the standard technique used for sorting large sets of records. It is comprised of two phases: a run formation phase that creates sorted subsets, called runs, and a merge phase that repeatedly merges runs into larger and larger runs, until a single run has been created. Most sort implementations use a load-sort-store algorithm for run formation. This algorithm fills the available memory space with records, extracts pointers to the records into an array, sorts the entries in the array (on the sort key of the records), and scans the array outputting records into a run file. This process is repeated until all input records have been processed. Any in-memory sorting algorithm can be used for the pointer sort, with Quicksort being the most popular choice. All runs created will be of the same length, except possibly the last one. CPU processing and I/O are not overlapped. Replacement selection is an alternative algorithm that produces runs twice as long. It is based on the observation that, if we keep track of the highest key output so far, we can easily decide whether an incoming record can be made part of the current run or has to be deferred to the next run. Any record added to the current run in this way increases the run length. Knuth [7] provides an excellent description and analysis of replacement selection. The classical replacement selection algorithm cannot be used for variable-length records. It assumes that, whenever a record is output, it is immediately replaced in memory by another record from the input. If records are of variable length, such one-to-one replacement is no longer possible: The next input record may not fit into the space freed up by
. The author is with Microsoft Corporation, One Microsoft Way, Redmond, WA 98052. E-mail:
[email protected]. Manuscript received 14 Feb 2002; accepted 15 Apr. 2002. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number 115899. 1041-4347/03/$17.00 ß 2003 IEEE
an output record or multiple input records may fit into the space. This paper first introduces a version of replacement selection able to handle variable-length records. Modern CPUs rely heavily on caches to hide memory latency and increase overall performance. Hence, it has become increasingly important to design algorithms that generate few cache misses. The basic replacement selection algorithm has poor cache behavior when the number of records in memory is large. The main loop of the algorithm consists of traversing a path from a leaf to the root of a selection tree, at each node comparing sort keys. Which path is traversed is unrelated to previously used paths. The nodes in the top part of the tree, and their associated sort keys, are touched frequently and are likely to remain in the cache but not the ones lower down in the tree. We then introduce a cache-conscious version of replacement selection, called batched replacement selection, that reduces the number of cache misses significantly. It resembles AlphaSort [11] in the sense that it creates small in-memory runs and merges them to form output runs. The performance of the new algorithm is experimentally compared with three other run formation algorithms: the classical version of replacement selection, Quicksort, and AlphaSort. Replacement selection has several advantages compared with a load-sort-store algorithm. First, it produces fewer and longer runs, which speeds up subsequent merge steps. Second, CPU processing, reading of input records, and writing of output runs are all overlapped, which improves the utilization of I/O devices and reduces elapsed time. Third, it very effectively exploits presorting, i.e., input sequences that are not random but somewhat correlated to the desired sorted output sequence [1]. In particular, if the input is already sorted, a single run will be generated. The rest of the paper is organized as follows: Section 2 briefly describes related work and Section 3 contains some preliminary material. Section 4 describes the classical Published by the IEEE Computer Society
962
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
replacement selection algorithm. The classical algorithm deals only with fixed-length records and extending the algorithm to variable size records requires substantial changes, which are described in Section 5. Batched replacement selection is introduced in Section 6. Quicksort and AlphaSort are summarized briefly in Section 7 and experimental results are presented in Section 8.
2
RELATED WORK
Sorting is one of the most extensively studied problems in computing. Knuth’s classic text [7] provides extensive coverage of the fundamentals of sorting, including replacement selection and mergesort. Standard replacement selection produces runs twice the size of memory (on average). There have been several efforts to increase the run length further. Dinsmore [3] observed that initial runs do not have to be completely sorted. If we can guarantee that no record is too far away from its position in the completely sorted run and keep a sliding window of records in memory during merging, then we can reconstitute the run during input to the merge. Frazer and Wong [5] introduced the idea of a reservoir, i.e., incoming records that cannot be included in the current run are flushed out to a fixed-size reservoir on disk. The records in the reservoir are included in the next run. Ting and Wang [15] extended this idea to a reservoir of variable size. These extensions have received limited acceptance because they increase I/O and prevent full overlapping. The benefits do not justify the additional complexity. The classical replacement selection algorithm stores a run number with each record and compares (run number, key) pairs. Wright pointed out in [17] that the run number can be dispensed with because we can always determine whether a key belongs to the current run or not by comparing it with the key of the last record output. However, this more than doubles the number of key comparisons, which seems like a poor trade off even when comparisons are cheap. Some sort algorithms are adaptive in the sense that they do less work if the input exhibits some degree of presortedness. Mergesort with run formation by replacement selection has this property but mergesort with run formation by load-sort-store does not, i.e., the amount of work is always the same, even when the input is already sorted. Estivill-Castro and Wood [1] provide a survey of adaptive sort algorithms. LaMarca and Ladner [9] studied the cache behavior of several versions of four in-memory sort algorithms, namely, mergesort, Quicksort, heapsort, and radix sort. Their experiments were run on an Alpha processor, but their key findings apply to any processor with a significant cache miss penalty. Unfortunately, they only considered sorting 8-byte integers. They found that Quicksort and (in-memory) mergesort generated the fewest cache misses and had the best performance. The fastest version of Quicksort was one that sorts small subarrays by insertion sort and does so immediately (instead of in a separate pass at the end). The fastest version of mergesort was one that completed the sort by a single wide merge using a selection tree (instead of
VOL. 15,
NO. 4, JULY/AUGUST 2003
multiple binary merges). They also investigated the cache behavior of heaps [9]. In-memory sorting terminates with the sorted result in memory. Run formation and disk-to-disk sorting, on the other hand, need only a sorted output stream—whether the output ever exists in sorted form in memory is immaterial. AlphaSort [11] is a fast, cache-conscious disk-to-disk sort. When used for run formation, it produces runs of a fixed length. Interestingly enough, one of the motivations for AlphaSort was Quicksort’s poor cache behavior! AlphaSort is discussed in more detail in Section 7. This paper deals with the run formation phase. Techniques to speed up the merge phase are covered in the following references [2], [12], [13], [18], [19], [20].
3
PRELIMINARIES
Sorting has often been studied in isolation, which for external sorting means a freestanding disk-to-disk sort. However, we are also interested in sorting as part of query processing in a database system. In this scenario, input records originate from a stream of records produced by a simple scan or a more complex query. A stream supplies records one at a time and does not provide memory for more than the most recently supplied record. Thus, the sort operation must copy records into its workspace. We assume that the sort operation is limited to its workspace, i.e., it does not use additional memory for large data structures. Most database systems’ internal memory allocation mechanisms do not support allocation of arbitrarily large contiguous memory regions. Therefore, we presume that the workspace is divided into fixed-size extents. Each extent is a contiguous area in memory. Extent sizes may vary from a single disk page, e.g., 8 KB, to very large, e.g., multiple megabytes. Records cannot span extent boundaries. A record is copied twice during run formation: once from the input stream into the workspace and once from the workspace into an output buffer. The run formation algorithms discussed here operate on pointers to the records in the workspace. Once a record has been copied to a location in the workspace, it remains there until it is copied into an output buffer, which will eventually be written to an output file.
4
CLASSICAL REPLACEMENT-SELECTION ALGORITHM
This section reviews the classical version of the replacement selection algorithm as described, for example, by Knuth [7]. This algorithm assumes that records are of fixed length; the next section describes how to modify the algorithm to handle variable-length records. Replacement selection is based on the observation that, if we save the value of the last key output, we can easily decide whether an incoming record can still be made part of the current run or has to be deferred to the next run. When records are of fixed length, the main loop of the algorithm consists of three steps:
LARSON: EXTERNAL SORTING: RUN FORMATION REVISITED
963
determined by its position in the array so the type need not be explicitly marked. There are twice as many nodes as records so this adds 2 4 ¼ 8 bytes of overhead per record. A record slot contains a record and two additional fields: a run number and the position of the record’s source node, that is, the external node “owning” the record. Assuming these two fields are stored as 4-byte integers, this adds another 8 bytes of overhead per record, bringing the total overhead to 16 bytes per record. What is the source node field needed for? When a record is output, it is replaced in the tree by a new record. To replace it, we need to know what external node the record occupied, which is what the source node field does. When records are of a fixed length and we know exactly how much memory is available before run formation starts, we know exactly how large a selection tree is needed. The tree is preallocated and filled with a fictitious run zero that is never output. Processing an incoming record then consists of the following steps: Record the key value and run number of the top record. 2. Copy the top record to an output buffer (provided it is a real record). 3. If the output buffer is full, write it to the run file. 4. Copy the incoming record into the vacated slot in memory. 5. Determine its run number by comparing it with the recorded key value. 6. Copy a pointer to the new record into the appropriate node in the selection tree. 7. Fix up the path from the leaf with the new record to the root of the tree. When there are no more input records, we again fill memory with fictitious records that are never output. The only operation performed is replacement, consisting mainly of copying two records and traversing a path in the selection tree from a leaf to the root. Knuth’s version of the algorithm [7] does not include a source node field in each record. Instead, it adds a level of indirection: An internal node points to the external node owning the record, which in turn points to the actual record. To locate a record, we first have to access an external node. We rejected this solution because the indirection increases cache misses during traversal. Knuth’s version includes two additional refinements: storing the loser instead of the winner in each internal node and packing one internal node and one external node together into a combined physical node. These changes complicate the algorithm, but have no effect on its performance, so they were not used in our implementation. When records are of a fixed length, we can save space by eliminating the source node fields and completely eliminating external nodes. When records are originally loaded into memory, the first external node will point to the record in slot 1, the second external node will point to the record in slot 2, and so on. Whenever a record is output and replaced, the new record will be stored in the freed-up slot and use the same external node as the record that was output. In other words, the pointers in the external nodes never change and, in fact, can be inferred from the node number. 1.
Fig. 1. Example selection tree illustrating replacement of one element in the tree. (a) Initial state and (b) output (1, 12), replace by (1, 3).
Among the records in memory, select one with the lowest key greater than or equal to the last key output. 2. Output the selected record and remember its key. 3. Get a new record from the input and store it in the slot previously occupied by the record just output. The first step is the most expensive. It can be done efficiently with the help of a selection tree, using the run number and the record key as the (compound) selection key. A selection tree for N records is a left-complete binary tree with N external nodes and N ÿ 1 internal nodes, usually stored in an array without pointers. Logically, each external node stores a (run number, key) pair and each internal node stores the lesser of the keys of its two sons, plus a reference to the source node of the key. Physically, they can be implemented in several different ways. Fig. 1 shows a selection tree with five keys and illustrates replacing the lowest key. External nodes are shown as rectangles and internal nodes as ellipses. The small number to the left of a node indicates its position in the array. First, (1, 12) is output and its key value recorded. This frees up the external node at position 9 and a new record (1, 30) is inserted here. Next, we find a new minimum by traversing the path up to the root, at each step comparing the keys of two sibling nodes and promoting the one with the lower key. The number of comparisons is always either dlog2ðNÞe or blog2ðNÞc. 1.
4.1 Implementation A selection tree can be implemented in several different ways. We will first describe the implementation chosen and then discuss alternatives. A node in the tree contains nothing but a pointer to a record to keep the tree itself of minimal size. The beginning of the array storing the tree is offset so that two siblings always occupy the same cache line. The type of a node is
964
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Similarly, the position of a record’s source node can be inferred from which record slot it occupies. This modification will reduce the overhead to 8 bytes per record (four for the run number and four for each internal node). However, we did not implement these space-saving refinements.
5
REPLACEMENT SELECTION FOR VARIABLE-LENGTH KEYS
The textbook version of replacement selection cannot handle variable-length records. This is a serious drawback because variable-length records are common in practice. In this section, we introduce a version of replacement selection that handles this case. Only the basic ideas are described here; the detailed algorithms can be found in the Appendix. When records are of variable length, two complications arise. First, managing the space reserved for storing records becomes more complex. Second, records are no longer just replaced in the selection tree. A record may be output and deleted from the tree without being replaced (if there is no free slot large enough for the replacement record). Similarly, records may be input and added to the tree without outputting any existing records. Thus, the selection tree is no longer of constant size. In this case, it is better to view replacement selection as consisting of two processes: an input process that fills memory with new records and adds them to the selection tree and an output process that repeatedly deletes the top record from the tree and outputs it. The input process drives the processing. Whenever it fails to find memory space for an incoming record, it resumes the output process, which runs until it has created a free slot of sufficient size. When there are no more input records, the output process simply continues until all records have been output. Add to selection tree. A new record can be added to the selection tree easily; see Algorithm AddRecordToTree in the Appendix for details. The new node is added to the end, after the last external node. Because the tree is complete, its parent is always the first external node. The content of the parent node is copied into the next free node, that is, it becomes the right sibling of the new element. A different external node now owns the record so the source node field in the record has to be updated. The sort keys of the two new external nodes are compared and the lower one promoted to the parent node. The walk up the tree continues until there is no change, that is, the record that gets promoted is the same as the one already stored in the parent node. The main cost of an addition is the (partial) traversal of a path in the tree, more specifically, the cost of key comparisons during the traversal. Delete from selection tree. Deleting a node is more complex; see Algorithm RemoveFromTree in the Appendix for details. The basic idea is to exchange the target node with the last node of the tree and delete the last node. Suppose we need to delete element (1, 15) in node 5 of Fig. 1b. We first save the pointer to element (1, 30) from the last node of the tree. Next, we delete the last node, which consists of moving element (1, 32) from node 8 to node 4 (its parent), updating the source node field and fixing up the path from node 4 to the root. (Note that this traversal cannot
VOL. 15,
NO. 4, JULY/AUGUST 2003
continue all the way to the root because the node does not contain the smallest key.) Finally, we copy the pointer to element (1, 30) into node 5, update its source node field, and fix up the path from node 5 to the root. This traversal always continues all the way to the root because we replaced the element with the smallest key. The main cost of a deletion is one full and one partial traversal of the tree.
5.1 Memory Management When records are of variable length, managing the space reserved for storing records becomes an issue. In a previous paper [10], we showed that a version of best-fit allocation solves this problem efficiently, resulting in a memory utilization of 90 percent or better. The basic idea of best-fit allocation is to always allocate space from the smallest free block large enough for the record being stored. If the selected free block is larger than required, the unused space is returned to the pool as a smaller free block. We recommended immediate coalescing of adjacent free blocks and the use of boundary tags to enable efficient coalescing of adjacent free blocks, see [6, pp. 440-443]. Best-fit allocation depends on being able to efficiently locate the smallest block larger than a given size. We proposed storing the collection of free blocks in a (balanced or unbalanced) binary tree with the block size as the key. If the tree is kept balanced, for example, as an AVL-tree, the best-fitting free block can always be located in logarithmic time. Subsequent experiments revealed that a simpler, approximate best-fit scheme is much faster while using only slightly more memory. The basic idea is to use blocks of a number of predetermined sizes and have separate free lists for each size. In the experiments reported in this paper, we used block sizes spaced 32 bytes apart, that is, block sizes 32, 64, 96, 128, and so on. Recall the assumption that the workspace consists of extents of some fixed size. In our experiments, extents were of size 8 KB. With the largest block size 8 KB and block sizes spaced 32 bytes apart, we need 8 1; 024=32 ¼ 256 free lists. The space needed for list headers is thus only 1 KB. The lists must be doubly linked because we need to be able to delete a free block from the middle of a list when coalescing adjacent free blocks. Using this organization, it is easy to locate the smallest free block larger than a given size. First, round the size up to the closest multiple of 32 and then divide by 32 (shift by 5) to determine the first list that might have free blocks of sufficient size. Scan the free list headers forward to the first nonnull header and grab the first free block from that list. If the block found is larger than required, round the size up to the closest multiple of 32, carve off a piece of this size from the beginning of the block, and return the remainder (as a smaller free block) to the appropriate free list. To improve search performance, we keep track of the highest nonnull free list and never search past this point. Let last_non_null be a variable that keeps track of this. Whenever a search fails, we know that all lists checked during the search are empty. If the failed search started at position p, we can set last_non_null to p ÿ 1 and terminate future searches there. When a new free block is added to a list q past last_non_null, last_non_null must be reset to q. This simple device significantly reduced the amount of unnecessary checking of free lists known to be empty.
LARSON: EXTERNAL SORTING: RUN FORMATION REVISITED
965
6
Fig. 2. Layout of main data structures.
When returning a block of memory, we first check whether it can be coalesced with its neighbor to the left, its neighbor to the right, or both. In most cases, no coalescing is possible and the free block is simply placed at the beginning of the appropriate free list. If the block can be coalesced with an adjacent block, we first delete that block from whatever free list it is on, combine the blocks into a new larger block, and add the new block to the appropriate free list. Thanks to boundary tags and the organization of the free lists, freeing a block can be done in constant time.
5.2 IMPLEMENTATION The layout of the main data structures is illustrated in Fig. 2. The selection tree remains unchanged, consisting of an array of pointers to sort records and a counter (LastNode) indicating the last active node in the array. A sort record contains the same fields as before: a source node reference (SrcNode), a run number (RunNo), and the actual record. We assume that the record length is available in the record itself. Recall that the selection tree and these fields add up to 16 bytes of overhead per record. Each record slot also contains a header and a footer field storing the boundary tags needed for space management. Each field requires two bytes: one bit to indicate whether the block if free or occupied and 15 bits to indicate the length of the slot. If the slot size is recorded in multiples of 32, 15 bits is enough to handle slots up to 1MB. The space overhead is now 20 bytes per record plus 1KB for the headers of the free lists. To this must be added, of course, any space wasted by internal and external fragmentation.
BATCHED REPLACEMENT SELECTION
The versions of replacement selection described so far have poor locality of reference and generate frequent cache misses. This section describes batched replacement selection, a new version that generates significantly fewer cache misses. In our experiments, we observed 50 percent fewer L2 cache misses for 100 byte records and 20MB of memory. The previous algorithms use a selection tree with twice as many nodes as there are records in memory. To reduce its size, we apply the same idea as AlphaSort, namely, attaching to each external node a small batch of sorted records. We call these miniruns to distinguish them from output runs. When a record has been output, the next record in its minirun replaces it and the tree is updated in the normal way. When a minirun becomes empty, there is no replacement record and the tree shrinks as described in the previous section. Incoming records are placed in memory, but, instead of being added individually to the tree, they are collected into miniruns. When the current minirun becomes full, we determine the run number of each record in the batch, sort the minirun using Quicksort, and add its first record to the tree. The size of the selection tree varies as miniruns are added or deleted. Collecting records into miniruns obviously reduces the size of the selection tree and the set of keys participating in comparisons when traversing the tree. Using the standard snowplow argument, see Knuth [7], one can show that miniruns are on average half full, meaning that a minirun size of B records reduces the tree size by a factor of B=2 and, thus, the number of tree levels by log2 ðB=2Þ. For example, using miniruns of 512 records reduces the selection tree by a factor of 256 or eight levels. Each level may generate two or three L2 cache misses during traversal, so the potential saving is 16 to 24 misses. We make one further modification to the algorithm: batch processing of input and output. Recall that replacement selection can be viewed as consisting of an input process and an output process. The algorithm’s main loop processes, on average, one record in input mode, switches to output mode, outputs one record, and switches back to input mode. We can reduce the working set and cache misses by switching less frequently. That is, once we switch to output mode, we output not just one but a batch of records. This will free up multiple record slots allowing us to process multiple input records at a time. In summary, we break the main loop into two loops: 1.
2.
An input loop that fills memory. Whenever there is a full minirun of, say, 500 records, it is sorted and added to the selection tree. Whenever there is no free space for an incoming record, switch to output mode. An output loop that selects and outputs a batch of, say, 1,000 records and then switches to input mode. Whenever a minirun in the tree becomes empty, the corresponding external node is deleted and the tree shrinks. As soon as a record has been copied to an output buffer, its memory space is returned to the free pool. Output buffers are written to disk as soon as they become full.
966
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Batch processing reduces the working set and cache misses. When switching to output mode, the selection tree and the related keys have probably been expelled from the cache. There is first a flurry of cache misses, but soon large parts of the tree and the active keys will have been loaded into the cache. After that, most activity takes place in the cache. Similar arguments can be made regarding the effects of input batching. We must switch to output mode when an incoming record cannot be placed in memory, but how many records should be output before switching back to input mode? That is, how large should output batches be? There is no reason to switch back before a free slot large enough to store the incoming record has been created. Beyond that, several rules are possible: output a fixed number of records, fill one output buffer, issue a write and switch, and output until x percent of the memory space has been freed up. In all our experiments, we used the first rule (output a fixed number of records). 1. 2. 3.
6.1 Implementation The data structure for the selection tree remains unchanged. The data structure for the current minirun being assembled consists of an array of pointers to records and a counter. Once the array has been filled, the records are sorted and the records in the minirun are linked together in sorted order. This solution requires less space than representing a minirun by a pointer array. As mentioned above, miniruns are on average half-full, so the effective overhead would have been two pointers per record. Each record slot still needs the two additional fields mentioned earlier: a run number and a source node field. If records are of variable length, the header and footer fields are also needed. The space overhead is thus 2 4 ¼ 8 bytes per minirun and 4 þ 4 þ 4 ¼ 12 bytes per record if records are of a fixed length and 16 bytes if they are of variable length. The selection tree is now much smaller, so, in this case, it might make sense to use the indirection scheme mentioned earlier, i.e., an internal node points to an external node which, in turn, points to the actual record. This would allow us to eliminate the source node field in each record slot—a significant saving. Another possibility is to include both a pointer to the record and a pointer to the external node in each internal node. We have not investigated the effects of these alternatives.
7
QUICKSORT
AND
ALPHASORT
Because we compare the performance of replacement selection with Quicksort and AlphaSort, this section briefly describes the versions of these algorithms used in the experiments. Note that these two algorithms both produce runs of fixed size, slightly smaller than the size of available memory.
7.1 Quicksort The easiest way to create a run is to fill the available memory space with records, extract pointers to all records
VOL. 15,
NO. 4, JULY/AUGUST 2003
into an array, sort the pointers in the array (on the sort key of the records pointed to), scan forward through the array outputting records into a run file, and erase all records in memory. Repeat this until all input records have been processed. Any in-memory sort algorithm can be used for the pointer sort in step two, but Quicksort is the standard choice. In our experiments, we used the version of Quicksort found to be the fastest by LaMarca and Ladner [9]. This version simply stops the Quicksort recursion for small subsets and immediately sorts the subset using insertion sort. Our implementation used this optimization with the subset limit set at eight elements. This method for run creation is easy to implement but it has one major drawback: CPU processing and I/O cannot be overlapped. This reduces throughput significantly.
7.2 AlphaSort In-memory sort algorithms leave the sorted output in memory but for run formation, all we need is a stream of sorted records that can be written to disk (or passed to the next operation if the input fits completely in memory). Merge-based algorithms have exactly this behavior. Although designed as a fast disk-to-disk sort, AlphaSort [11] seems ideally suited for run formation because it has good cache performance and overlaps processing and I/O. AlphaSort first loads all records into memory but during loading it sorts small batches of records creating miniruns. The sorting and the next I/O operation are thus overlapped. Sorting each batch immediately also generates fewer cache misses than sorting them all at the end. The records in the batch were just copied into the workspace and at least some of the sort keys will remain in the cache. Once memory has been filled, AlphaSort enters a merge phase. All miniruns are merged in a single, wide merge using a selection tree in the same way as for replacement selection. The stream of sorted records is output immediately, which allows merging and output to disk to be completely overlapped. In our implementation, batches were sorted using Quicksort and the merge phase used the selection tree version explained above for batched replacement selection. The original version of AlphaSort included sort key prefixes in the selection tree, but this optimization was omitted in our implementation. Including key prefixes would have increased the tree size (by 100 percent or more), which increases the number of cache misses.
8
EXPERIMENTAL RESULTS
We have run a series of experiments to evaluate the performance of batched replacement selection and compare it with other run formation algorithms. All experiments were run on 300 MHz Pentium II processor with a 512 KB unified L2 cache and two 16 KB L1 caches, one for data and on for instructions. The machine had 128 MB of memory and two 9GB SCSI disks (Seagate, Barracuda 9, Ultra SCSI). On Pentium II and similar processors, the cache miss penalty is not fixed but depends on the circumstances. Pentium II is a superscalar processor with out-of-order execution. When one instruction stalls on a cache miss, the processor attempts to continue executing other instructions. This may mask some of the cache penalty. Furthermore,
LARSON: EXTERNAL SORTING: RUN FORMATION REVISITED
967
reading or writing to memory is not a simple fixed-duration operation; a read/write request may have to wait for the bus and/or memory module to serve other requests. For the experiments, the stream of input records was generated by randomly selecting records from a pool of 25,084 records in memory. The record keys consisted of (unique) English words from a spelling dictionary, extended with dots (“.”) to completely fill the record. Key comparisons were done using binary order (i.e., using strcmp()), so comparisons were comparatively cheap. When running experiments with variable-length input records, record lengths were drawn from a triangular distribution with a specified minimum and maximum record length. The probability that a record is of length k is pðkÞ ¼ ð1 ÿ 2ðk ÿ lÞnÞ=ðn þ 1Þ; k ¼ l; l þ 1; . . . ; u, where l is the minimum record length, u is the maximum record length, and n ¼ u ÿ l. The average record length is then l þ ðu ÿ lÞ=3. In all experiments reported in this paper, the workspace used for storing records consisted of 8 KB extents. Fixedsize records were 100 bytes long and variable-size records followed the triangular distribution with minimum 75, maximum 150, and average 100 bytes. Space for auxiliary data structures (pointer arrays for Quicksort and AlphaSort, selection trees for replacement selection and AlphaSort and current minirun array for batched replacement selection) were allocated outside this workspace. This was also the case with all output buffers. Batched replacement selection used miniruns with 500 records and AlphaSort of size 100 records. Initial experiments showed that the throughput of batched replacement selection was close to constant for minirun sizes in the range of 100 to 500 records. (Decreasing it below 100 records slowed down throughput noticeably.) Five hundred was chosen to reduce the size of the selection tree. AlphaSort’s throughput deteriorated when the minirun size was increased beyond 100.
8.1 CPU Performance—No I/O The first series of experiments focused on the CPU behavior of the various algorithms and, therefore, did no I/O. Records were copied into output buffers but the actual writes were omitted. Two series of experiments are reported here: one with 100-byte fixed-length records and one with variable length records with an average length of 100 bytes. Fig. 3 shows the overall performance of the four methods as a function of memory size. We measure performance by throughput in MB per second, that is, the total amount of sorted run data output per second by each algorithm. The first observation is that throughput decreases as memory size increases. That is expected—as memory size increases, fewer runs are generated, so, in essence, more of the sort work has been done. Second, the two cache-conscious algorithms (AlphaSort and batched replacement selection) are faster than Quicksort and classic replacement selection, but the difference is less than one might expect. Third, AlphaSort is the fastest overall, but keep in mind that it produces shorter runs than replacement selection, i.e., it does a smaller share of the overall sort work. Table 1 provides more detailed information on elapsed time in CPU cycles, instructions executed, and cache misses for memory size 20 MB. All figures are per record processed. We first observe that batched replacement
Fig. 3. Comparison of overall run formation speed without output to disk, in MB/s for fixed and variable size records. Variable size records have an average size of 100 bytes.
selection generates about 50 percent fewer L2 cache misses than classical replacement selection. The number of L1 cache misses also drops but not as much. Note also the excellent cache performance of AlphaSort. Second, reducing L2 cache misses pays off. For fixed-size record, batched replacement selection actually executes 7.7 percent more instructions than the classical algorithm (2,688 versus 2,494) but runs 16.7 percent faster (4,776 versus 5,737 cycles). For variable-size records, batched replacement selection is as much as 33.5 percent faster. Third, AlphaSort is the overall winner both in elapsed time and instructions executed. Finally, the number of cycles per instructions is quite high. The Pentium II is a superscalar processor capable of retiring two instructions per cycle, which translates to 0.5 cycles per instructions. Here, we observe between 1.78 and 2.30 cycles per instructions, meaning that the processor runs at about a quarter of its peak performance. The number of key comparisons performed by an algorithm is an important performance measure. In our experiments, key comparisons were relatively cheap because the keys were ASCII strings and the collating sequence was binary order. Even so, most of the elapsed time was spent in the comparison functions. For compound
968
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 15,
NO. 4, JULY/AUGUST 2003
TABLE 1 CPU Cycles, Instructions, and Cache Misses Per Record
TABLE 2 Number of Key Comparisons Per Record
TABLE 3 Average Run Length Relative to Memory Size
Fig. 4. Comparison of overall run formation speed, including output, in MB/s for variable size records. Records have an average size of 100 bytes.
keys using natural language collating sequences, key comparisons may be very expensive. Table 2 shows the number of key comparisons performed by the various algorithms. The first observation is that Quicksort is not as prudent with key comparisons as one might expect. All other algorithms performed fewer key comparisons, with the glaring exception of classic replacement selection for variable size records. The high comparison count for this algorithm is caused by frequent deletes from the selection tree without replacement and additions to the tree without a matching delete. Performing a series of unmatched operators is much more expensive than performing the same work using a series of replacement operations. Finally, the number of comparisons for batched replacement is slightly lower when records are of variable size because fewer records fit in memory and runs are slightly shorter. Table 3 lists the average run length for different memory sizes and an input size of 50 times the memory size. Quicksort has no other overhead than the pointer array (4 bytes per record), resulting in runs of length 0.96 times memory size. AlphaSort has some additional overhead for the selection tree, which results in slightly shorter runs than
Quicksort. The claim that replacement selection produces runs twice the size of memory is correct but with one proviso: if one counts only the memory space occupied by records and ignores overhead. In practice, there is significant space overhead, so the run length will be less than twice the total space used. The experiments show that batched replacement selection produces slightly longer runs than the classical algorithm, mostly because the selection tree requires less space. Compared with Quicksort and AlphaSort, batched replacement selection produces about 45 percent fewer runs for fixed-length records and about 40 percent fewer for variable-length records.
8.2 Performance with I/O Included In reality, the runs have to be written to disk. We ran two series of experiments for this case: one series writing to a single disk and one series writing to two disks in parallel. All experiments used output buffers of size 64KB. To allow concurrent writing to disk and CPU processing, the number of buffers was set to the number of disks plus one. Writes were asynchronous, allowing full overlap of I/O and CPU processing. All writes bypassed the file system cache completely, thereby avoiding one round of copying. Write caching (by the disk controller) was also disabled. Fig. 4 plots the throughput for one and two disks. Replacement selection overlaps CPU processing and writing
LARSON: EXTERNAL SORTING: RUN FORMATION REVISITED
completely, so the total throughput will be limited by the slower of the two activities. As expected, Quicksort is slowest in almost all cases. Fig. 4 shows that the maximum output rate for a single disk is around 4.5MB/s. For fixed-sized records, both versions of replacement selection are able to keep up with the disk. When records are of a variable length, more time is spent on memory management. Batched replacement selection is still able to keep up with the disk, but the classical version cannot and its throughput drops as memory size increases. AlphaSort performs no writes during the initial phase of loading records and creating miniruns, so its maximum throughput is by necessity lower. However, it requires less CPU time per record and its throughput remains almost flat as memory size increases. For large memory sizes and variable size records, it outperforms classical replacement selection. In summary, when writing runs to a single disk, batched replacement selection was not only the fastest but also produced the fewest runs. The situation is different when the output bandwidth increases. When writing to two disks, the maximum output rate is around 8MB/s. When memory is small and records are of a fixed size, replacement selection is fast enough to keep both disks busy. However, when the memory size increases, throughput tapers off for both versions of the algorithm. AlphaSort has a flatter throughput curve and for larger memory sizes, it has a higher throughput than batched replacement selection.
8.3 Overall Disk-to-Disk Sort Performance Replacement selection produces fewer runs than load-sortstore algorithms. Does this really matter, i.e., does it reduce the overall sort time? The final set of experiments attempts to answer this question by measuring the overall performance of disk-to-disk sorting. The input to the sort consisted of a 250 MB file containing 2.6 M variable-length records with an average length of 100 bytes. The work space was 5 MB or 10 MB, divided into 8 KB segments. Records were randomly generated in the same way as described earlier. After run generation, the sort was completed using a single merge pass. All disk reads and writes were asynchronous. During run formation, the input stream used two 128 KB buffers and the output stream (run file) used four 128 KB buffers. During the merge phase, the output stream used four 128 KB buffers and each run used two input buffers. Input buffer size decreased with the number of runs so as not to use (much) more workspace than assigned. For example, using a 5 MB work space, Quicksort produced 51 runs and the merge phase used 51 2 ¼ 102 input buffers for the runs and four output buffers, for a total of 51 2 46 KB þ 4 128 KB ¼ 5; 204 KB ¼ 5:08 MB: Two sets of experiments are reported. In the first experiments, two disks were used: one disk stored the input file and received the final sorted output, and the other disk was used for storing runs. This allows for maximum overlap because both run formation and merging read from
969
TABLE 4 Elapsed Time for Sorting a 250 MB File Containing 2.6 Million Variable-Length Records Using Two Disks
one disk and write to another disk. The second set of experiments stored everything on a single disk. Just reading the input file sequentially using two 128 KB buffers took about 25 seconds and writing it sequentially using four 128 KB buffers took about 40 seconds. Both the run formation phase and the merge phase write its output sequentially, so each phase takes at least 40 seconds. The results of the experiments with two disks and two different workspace sizes are shown in Table 4. Batched replacement selection gave the fastest overall sort time, including the fastest times for run formation and for merging. Compared with Quicksort, the improvement in sort time was 28 percent and 26 percent for 5 MB and 10 MB, respectively. As expected, AlphaSort completed run formation faster than Quicksort because it partially overlaps processing and I/O. Classic replacement selection and AlphaSort had about the same overall time; AlphaSort is faster during run formation but slower during merging because of its higher merge fan-in. The run formation phase takes longer as memory size increases because more of the sort work is done during run formation. However, this is more than compensated for by the shorter merge phase (fewer runs and larger input buffers), substantially reducing the overall sort time. Table 5 shows the results from experiments using a single disk. Merging is inevitably slower when using a single disk because disk accesses for reading of run files and writing of the final result are interspersed and competing for the same disk. AlphaSort had the best overall sort time. Quicksort and AlphaSort separate reads and writes during run formation and, hence, are about equally fast whether one or two disks are used. (The minor reduction compared with the figures is Table 4 is, most likely, caused by a different placement of the files on the disks.) Both versions of replacement selection overlap reads and writes, which causes more disk seeks and slower run formation. The overall conclusion is clear. If run files can be placed on disks separate from those used for the input and the
970
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
TABLE 5 Elapsed Time for Sorting a 250 MB File Containing 2.6 Million Variable-Length Records Using One Disk
sorted output, batched replacement selection is the fastest method for run formation and gives the lowest overall sort time. However, if run files cannot be placed on separate disks, then AlphaSort is the preferred method for run formation.
9
SUMMARY
AND
CONCLUSION
This paper introduces a new version of replacement selection called batched replacement selection. The new algorithm can handle both fixed-length and variable-length records. It incorporates two ideas for reducing cache misses: creating miniruns in memory and merging them to form the output runs and processing both input and output in batches. Experiments showed that the modifications pay off on all counts: It generates fewer cache misses, executes fewer instructions, needs fewer key comparisons, and runs faster than the classical version. Run formation performance was also compared with Quicksort and (a simplified version of) AlphaSort. In all experiments, Quicksort was the slowest. AlphaSort required the least CPU time for all memory sizes considered. When disk bandwidth was high, AlphaSort was the fastest overall, but when disk bandwidth was lower, batched replacement selection was faster. Quicksort and AlphaSort both produce more and shorter runs than replacement selection. Experiments with complete disk-to-disk sorting using two disks showed that batched replacement selection resulted in the lowest overall time. The speedup was over 25 percent compared with Quicksort and about 15 percent compared with AlphaSort. In all experiments, batched replacement selection had both the fastest run formation phase and the fastest merge phase. However, in experiments using only a single disk, AlphaSort was the overall winner because it does not mix read and writes to the same disk during run formation.
VOL. 15,
NO. 4, JULY/AUGUST 2003
Several potential improvements to batched selection remain to be investigated. Batched replacement selection can store key prefixes in the selection tree in the same way as AlphaSort but the current implementation does not exploit this. To avoid accesses to the key, the run number must also be included in the tree. Under what circumstances this change pays off is unknown. Some initial experiments showed that one has to be careful: The cost of computing the prefixes is amortized over no more than a few key comparisons and the additional space may increase cache misses. The core data structure of replacement selection is a priority queue, implemented as a heap. Batched replacement selection retains the heap structure but improves its cache performance by presorting small batches of records before adding them to the heap. This approach was introduced by AlphaSort and was chosen here because of its simplicity. However, several alternative designs for cache-efficient priority queues have been proposed in the literature. Sanders [14] introduces a cache-efficient heap structure called a sequence heap and reviews several earlier approaches. The sequence heap also relies on creating and merging in-memory runs, but the scheme is somewhat more complex. A sequence heap could be used instead of the current heap implementation. It is not currently known whether this will improve run formation performance. Wegner and Teuhola’s [16] hillsort is a version of heapsort designed for external sorting. To reduce I/O operations, they update the heap by page merge operations instead of propagation of single records. Their scheme could be adapted for in-memory sorting, resulting in a cacheconscious version of heapsort. However, the scheme is rather complex making it questionable whether the additional overhead will be compensated for by reduced cache misses.
APPENDIX REPLACEMENT SELECTION ALGORITHM FOR VARIABLE-LENGTH RECORDS Selection tree (global): Array of ptrs to SortRecord TreeNode ; Int LastNode ; A SortRecord consists of three fields: Int SrcNode ; Int RunNo ; InputRecord Record ; // Main algorithm driving processing Algorithm ReplacementSelection() { int CurRun ; SortKey LastKey ; Ptr to InputRecord pInRecord ; Ptr to SortRecord pStoredRecord ; Ptr to SortRecord pOutRecord LastNode = 0 ;
LARSON: EXTERNAL SORTING: RUN FORMATION REVISITED
971
CurRun = 1 ; LastKey = Lowest possible key value ; Clear SortRecord store ;
TreeNode [1]->SrcNode = 1 ; LastNode = 1 ; } }
Forever do { pInRecord = Get next input record ; If ( pInRecord == NULL ) then exit loop ; //Save record in record store and add to tournament tree Forever do { pStoredRecord = StoreRecord( pInRecord) ; if( pStoredRecord != NULL ) then exit loop ; //Not enough free space in record store; //output a record and try again pOutRecord = TreeNode [1] ; if( pOutRecord->RunNo != CurRun ) then { This is the first record of a new run Terminate the current run and begin a new run ; } CurRun = pOutRecord->RunNo ; RemoveFromTree(pOutRecord->ScrNode) ; Output pOutRecord->Record to run file ; FreeRecordSpace(pOutRecord) ; } if( CompareKeys(pInRecord->SortKey, LastKey) RunNo = CurRun ; else pStoredRecord->RunNo = CurRun+1 ; AddRecordToTree(pStoredRecord) ;
//Fix up the path from StartPos to the root of the //selection tree so that the heap property is restored Algorithm FixUpPath( int StartPos ) { for( Indx=StartPos; Indx > 1; Indx = Indx/2 ){ If( Indx is even ) then Brother = Indx+1 else Brother = Indx-1 ; // Compare run numbers and sort keys If( CompareSortRecordKeys(TreeNode[Indx], TreeNode[Brother] ) > 0 ) then NewParent = TreeNode[Brother] ; else NewParent = TreeNode[Indx] ; // Stop when parent is unchanged If( TreeNode[Indx/2] == NewParent ) then return ; TreeNode[Indx/2] = NewParent ; } } // Remove the leaf node at the given position from // selection tree. Algorithm RemoveFromTree( int AtPos) { Ptr to SortRecord pTmp ; pTmp = TreeNode[AtPos] ; if( AtPos == LastNode ) then { // Case 1: remove last node RemoveLastTreeNode() ; } Else if ( AtPos == LastNode-1 ) then { // Case 2: remove brother of last node pTmp = TreeNode[AtPos] ; TreeNode[AtPos] = TreeNode[LastNode] ; TreeNode[AtPos]->SrcNode = AtPos ; TreeNode[LastNode] = pTmp ; TreeNode[LastNode]->SrcNode = LastNode ; RemoveLastTreeNode() ; } Else { // Case 3: remove other leaf node pTmp = TreeNode[LastNode] ; RemoveLastTreeNode() ; TreeNode[AtPos] = pTmp ; TreeNode[AtPos]->SrcNode = AtPos ; FixUpPath( AtPos) ; }
} // End of input—output all records in memory while(LastNode > 0) { pOutRecord = TreeNode [1] ; if( pOutRecord->RunNo != CurRun ) then { This is the first record of a new run Terminate the current run and begin a new run ; } CurRun = pOutRecord->RunNo ; RemoveFromTree(pOutRecord->ScrNode) ; Output pOutRecord to run file ; } Terminate final run ; } //Add a sort record to the selection tree Algorithm AddRecordToTree( Ptr to SortRecord pRecord) { if( LastNode > 0 ) then { Left = LastNode+1 ; TreeNode[Left] = pRecord ; TreeNode[Left]->SrcNode = Left ; TreeNode[Left+1] = TreeNode[Left/2] ; TreeNode[Left+1]->SrcNode = Left+1 ; LastNode = LastNode+2 ; FixUpPath(LastNode) ; } Else { TreeNode [1] = pRecord ;
} // Remove the last (highest index) leaf node of // the selection tree. Algorithm RemoveLastTreeNode( ) { if( LastNode > 0 ) then { Parent = LastNode/2 ;
972
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
If( LastNode/2 > 0 ) then { TreeNode[Parent] = TreeNode[LastNode-1] ; TreeNode[Parent]->SrcNode = Parent ; } LastNode = LastNode—2 ; FixUpPath(Parent) ; } }
VOL. 15,
NO. 4, JULY/AUGUST 2003
Per- Ake Larson received the PhD degree from Abo Akademi University, Finland, in 1976. He joined the Department of Computer Science at the University of Waterloo, Canada, in 1981, where he was promoted to a professor in 1987. He is currently a senior researcher in the Database Group at Microsoft Research, which he joined in 1996. His research interests include query optimization, query processing algorithms, and data structures, and his publications cover many topics in these areas. He is a member of the IEEE Computer Society.
REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20]
V. Estivill-Castro and D. Wood, “A Survey of Adaptive Sorting Algorithms,” Computing Surveys, vol. 24, no. 4, pp. 441-476, 1992. V. Estivill-Castro and D. Wood, “Foundations of Faster External Sorting,” Proc. 14th Conf. Foundations of Software Technology and Theoretical Computer Science, pp. 414-425, 1994. R.J. Dinsmore, “Longer Strings for Sorting,” Comm. ACM, vol. 8, no. 1, p. 48, 1965. W. Dobosiewicz, “Replacement Selection in 3-Level Memories,” The Computer J., vol. 27, no. 4, pp. 331-339, 1981. W.D. Frazer and C.K. Wong, “Sorting by Natural Selection,” Comm. ACM, vol. 15, no. 10, pp. 910-913, 1972. D.E. Knuth, The Art of Computer Programming, Volume 1, 3rd ed. Addison-Wesley, 1997. D.E. Knuth, The Art of Computer Programming, Volume 3, 2nd ed. Addison-Wesley, 1998. A. LaMarca and R.E. Ladner, “The Influence of Caches on the Performance of Heaps,” ACM J. Experimental Algorithmics, vol. 1, no. 4, 1996. A. LaMarca and R.E. Ladner, “The Influence of Caches on the Performance of Sorting,” Proc. Eighth Ann. ACM-SIAM Symp. Discrete Algorithms, 1997. P.- A. Larson and G. Graefe, “Memory Management during Run Generation in External Sorting,” Proc. SIGMOD, pp. 472-483, 1998. C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and D.B. Lomet, “AlphaSort: A RISC Machine Sort,” Proc. SIGMOD, pp. 233-242, 1994. V.S. Pai and P.J. Varman, “Prefetching with Multiple Disks for External Mergesort: Simulation and Analysis,” Proc. Int’l Conf. Data Eng., pp. 273-282, 1992. B. Salzberg, “Merging Sorted Runs Using Large Main Memory,” Acta Informatica, vol. 27, no. 3, pp. 195-215, 1989. P. Sanders, “Fast Priority Queues for Cached Memory,” ACM J. Experimental Algorithmics, vol. 5, Aug. 2000. T.C. Ting and Y.W. Wang, “Multiway Replacement Selection Sort with a Dynamic Reservoir,” The Computer J., vol. 20, no. 4, pp. 298301, 1977. L. Wegner and J.I. Teuhola, “The External Heapsort,” IEEE Trans. Software Eng., vol. 5, no. 7, pp. 917-925, July 1989. W.E. Wright, “A Refinement of Replacement Selection,” Information Processing Letters, vol. 70, no. 3, pp. 107-111, 1999. W. Zhang and P.- A. Larson, “Dynamic Memory Adjustment for External Mergesort,” Proc. Very Large Data Bases Conf., pp. 376-385, 1997. W. Zhang and P.- A. Larson, “Buffering and Read-Ahead Strategies for External Mergesort,” Proc. Very Large Data Bases Conf., pp. 523-533, 1998. L. Zheng and P.- A. Larson, “Speeding Up External Mergesort,” IEEE Trans. Knowledge and Data Eng., vol. 8, no. 2, pp. 322-332, 1996.
. For more information on this or any computing topic, please visit our Digital Library at http://computer.org/publications/dlib.