The VLDB Journal manuscript No. (will be inserted by the editor)
External Sorting on Flash Storage: Reducing Cell Wearing and Increasing Efficiency by Avoiding Intermediate Writes Yaron Kanza · Hadas Yaari
Received: date / Accepted: date
Abstract This paper studies the problem of how to conduct external sorting on flash drives while avoiding intermediate writes to the disk. The focus is on sort in portable electronic devices, where relations are only larger than the main memory by a small factor, and on sort as part of distributed processes where relations are frequently partially-sorted. In such cases, sort algorithms that refrain from writing intermediate results to the disk have three advantages over algorithms that perform intermediate writes. First, on devices in which read operations are much faster than writes, such methods are efficient and frequently outperform Merge Sort. Secondly, they reduce flash-cell degradation caused by writes. Thirdly, they can be used in cases where there is not enough disk space for the intermediate results. Novel sort algorithms that avoid intermediate writes to the disk are presented. An experimental evaluation, on different flash storage devices, shows that in many cases the new algorithms can extend the lifespan of the devices by avoiding unnecessary writes to the disk, while maintain efficiency, in comparison to Merge Sort.
Keywords Sort algorithms, external sorting, flash memory, solid-state drive, SSD, cell wearing, write endurance, Merge Sort
Y. Kanza Jacobs Institute, Cornell Tech New York, NY, USA E-mail:
[email protected] H. Yaari Computer Science Department Haifa University, Haifa, Israel E-mail:
[email protected]
1 Introduction Flash-based storage devices such as USB flash drives, flash memory cards and Solid-State Drives (SSDs) have become prevalent due to several advantages they have over ordinary magnetic hard disk drives (HDDs)—their low access latency, their low energy consumption, their resistance to shocks and their light weight. Many mobile devices, including laptop computers, tablets, smartphones and wearable computers, use a flash drive as their secondary storage. Also many desktop computers, workstations and servers are equipped with flash storage devices, and the usage of flash drives is expected to grow in the future [24]. Differently from ordinary hard disk drives, in flash storage reading data is efficient not only when applying sequential scan but also when conducting random access, because there are no moving parts in flash disks. However, flash storage devices have some disadvantages that should be taken into account. First, in many flash disks, write operations are relatively slow, on the average. Second, writing to the disk causes cells to wear out and can lead to disk degradation. Third, flash memory is relatively expensive, so frequently flash disks have a small capacity, in comparison to hard disk drives. The reason for the low write rates is that a page on the disk can be written only if it is clean, i.e., it has not been written since the last time it had been erased, and erasure operations can only be applied on entire blocks. (Typically, a block comprises 32, 64 or 128 pages.) Accordingly, an update of a page cannot be done directly by writing on the page. Instead, the system reads the page, marks it as deleted and writes the updated data on a different page. (A Flash Translation Layer is used to map logical pages to their physical location on the device.) However, such operations
2
leave unclean pages on the disk, so a garbage collection process is employed to free the space of unclean pages. When erasing a block, it is required to move pages that are still in use to other parts of the disk. Consequently, in I/O intensive systems, writes can be lingered by the erasure procedure. Disk degradation occurs because a flash cell can endure only a limited number of program/erase cycles (i.e., a limited number of writes) before it looses its ability to store data reliably. When cells wear out, the effective size of the storage gradually decreases. This is mitigated by wear leveling where the program/erase cycles are distributed evenly on the device using the Flash Translation Layer. Other techniques are also used, such as data compression and over-provisioning, where the physical capacity of the device is larger than the logical capacity that the operating system sees. Nonetheless, disk degradation is a drawback that should be taken into account when the disk is used intensively not just for reading. Hence, it is important to avoid unnecessary writes to the disk. Applications that use a flash storage should be designed accordingly, to increase the lifespan of the device [47]. Many mobile instruments that include a flash drive are used for storing and processing data. Frequently, these instruments should manage a relational database or a key-value store [16, 32]. Sorting is a fundamental operation in such systems. The most commonly used sorting algorithm in relational systems is Merge Sort. Initially, it sorts chunks of the relation. Typically, each chunk is in the size of the main memory. (The replacement-selection technique can be used to produce sorted runs that are on the average twice the size of the memory [29].) Then, the sorted chunks are merged. Flash drives support well the random-access required in the merge phase, however, the initial step of writing sorted blocks to the disk is costly in such devices, and should be avoided when possible. Note that using replacement-selection does not alleviate the problem, because it does not prevent writing intermediate sorted runs to the disk. It may seem reasonable to use ordinary in-memory sort algorithms to sort relations on flash disks. However, such algorithms do not avoid intermediate writes, and thus, do not cope well with cell wearing, slow writes or the lack of disk space for intermediate results. For example, Bucket Sort is required to write buckets as intermediate results. Merge-based algorithms such as Timsort and Patience Sort [41] are required to write intermediate sorted runs. Algorithms that work in-place when executed in memory, such as Quicksort, are no longer in-place over flash storage, because updates (e.g.,
Yaron Kanza, Hadas Yaari
swaps of values) require moving pages on the disk. Thus, there is a need for new sort methods. Flash drives are often being used in portable devices and in large data centers. In both cases, reducing energy consumption is a main reason for that. In this paper we focus on cases that are relevant to these types of usages and where we can avoid the costly step of writing sorted chunks of relations as part of the sorting operation. Portable Devices: In portable devices, relations that are not smaller than the memory are midsize, i.e., larger than the main memory only by a small factor. For example, consider a standard smartphone with 1 GB of main memory and 64 GB of secondary storage. To sort a relation of size 10 GB while writing intermediate results would require at least 10 GB for the relation and 10 GB for intermediate results, without write amplification, but may require about half the size of the drive with write amplification. (Write amplification occurs when a writing of n bytes to the disk leads to writing a multiplication of n bytes due to the need to move pages, when erasing blocks to free space. [27]) In this case, if the sorted result is written to the drive, most of the secondary storage will be required for the operation. Hence, in practice only small and midsize relations can be sorted on such devices. Note that refraining from writing temporary results to the disk allows sorting larger relations. Midsize relations are frequently generated and processed by wearable computers [42], by smartphones [20, 22], and in other sub-domains of ubiquitous computing [49], e.g., in the GeoLife project, the GPS sensors recorded about 60 bytes every 1-5 seconds [54]. In a year, such sensor produces a data set of approximately 600 megabytes (an order of 10 million records). This is a midsize relation for a portable electronic device with, say, 1 gigabyte of main memory, when only part of the memory is available for the operation. Data Centers: Distributed query processing is another domain where we can find midsize relations, when relations are partitioned and processed on many servers, e.g., in data centers. For example, in the MapReduce framework [13], the map processes partition a given data set of key-value pairs, based on the key, and send each pair to the assigned reduce process. A reduce process sorts the key-value pairs assigned to it and applies the reduce function to the sorted list. Thus, although the given data set is big, if the system is well-configured, the reduce processes will sort midsize relations. Frequently, sorted results that are produced by the reduce processes are combined and should again be sorted [38]. This requires sorting a sequence of sorted
Title Suppressed Due to Excessive Length
chunks with different sizes. We show how to handle this case effectively. Additional Cases: Sorting a relation that is constructed from large sorted chunks is also required in other settings. Such relation can be the result of an integration (e.g., union) of sorted relations. It can be the result of a sorted relation that is extended by adding to it several batches of sorted records—the practice of batch insertion is often applied in client-server applications, e.g., when a Java application uses JDBC (the Java Database Connectivity API) to insert records into a relational database. In the domain of ubiquitous computing, streaming data and sensor data are collected in a buffer, in main memory, and from time to time they are written to the disk as a sorted chunk. This yields a data set that is stored as a sequence of sorted chunks. Another important case we examine in this paper is of nearly-sorted relations. We consider a relation as nearly sorted if the removal of a small number of records from it results in a sorted relation. Nearly-sorted relations are frequently created as a result of updates of sorted relations. One of the difficult tasks in this case is to detect that the relation is nearly sorted, to refrain from erroneously applying the algorithm on a relation that is completely unsorted. In this paper, we present novel sort algorithms for midsize relations, and for relations that are partially or nearly-sorted. Our algorithms refrain from writing sorted runs to a secondary storage by reading the relation several times and utilizing the main memory effectively. They reduce disk degradation caused by cell wearing, and on devices in which reading rates are higher than writing rates, they are more efficient than algorithms that write intermediate results to the disk. For example, consider a hybrid disk in which the relation is stored on an SSD and intermediate results are written to a HDD. Suppose that reading from the SSD is 8 times faster than writing to the HDD. (This was the ratio in the devices we used in our experiments.) Then, reading the relation, say, 8 times, would be faster than reading the relation, writing it as sorted chunks and then reading the sorted chunks. The main contributions of this paper are as follows. – We show that in a sort of midsize or nearly-sorted relations, writing intermediate results to the disk can be avoided, to increase the life expectancy of the storage device. – We show that in these cases, avoiding writing the intermediate results can be done without scarifying performances, in comparison to Merge Sort, even though Merge Sort is considered to be well adapted to flash drives, in terms of efficiency [37, 43].
3
– We show that on devices in which reading is faster than writing, our algorithms are more efficient than Merge Sort for midsize relations. – We illustrate the performances of our algorithms over different types of flash storage. In particular, for midsize relations on a hybrid system that combines an ordinary hard drive and a flash drive, our algorithms outperform Merge Sort. 2 Related Work Recently the problem of managing databases on a flash storage was introduced and has been studied [7, 25, 32, 37]. It was pointed out that query-evaluation methods should be redesigned to accommodate flash storage devices. Traditionally, query-processing methods were optimized for magnetic disks, so they do not exploit or take into account the special characteristics of flash storage. However, adapting query processing techniques and data management methods to flash storage can improve the efficiency of query processing on flash storage [10,43]. Thus, several papers studied the use of flash disks for improving database management systems, by using flash disks as a cache [14], as an extension of the buffer pool [15, 18], as a log [36] or by implementing a column-based relational storage [48]. Ajwani et al. [1] described the running times of different I/O operations on flash chips. Do and Patel [17] studied evaluation of traditional join algorithms on SSDs. Zhang et al. [50] investigated the implementation of OLTP systems on SSDs. However, none of these studies deals with reducing the number of intermediate writes when sorting relations on flash drives. In some systems, SSD and HDD are combined. Such systems are called hybrid systems [31, 39]. The goal is to exploit the unique advantages of these two types of devices, and decrease their disadvantages, to create a better system. For example, using the SSD for fast random reads (HDD’s Achilles heel), and using the HDD for intermediate writes (instead of writing to the SSD and shortening its life). The ways to combine the two devices vary. One way is to use the SSD as a persistent storage, combined with HDDs [6]. Another way is to use the SSD as a cache for the HDDs [28, 31]. However, none of these papers showed how to improve the sort operation for flash disks or for hybrid systems. Sorting in database systems has been studied extensively for many years. It has received a lot of attention since the early days of System R [26]. Many methods were developed to improve the implementation of the sort operation over HDDs. Some methods focused on the use of replacement selection for creating longer chunks (runs) in Merge Sort [30, 46]. Im-
4
provements of replacement selection by making better utilization of the cache and more sophisticated access to the runs produced in the first phase of Merge Sort were suggested in [34, 53]. Chandramouli and Goldstein [9] showed how to use Patience Sort to create sorted runs more efficiently than Merge Sort, and they showed how to conduct the merge phase efficiently using ping-pong merge. Several papers focused on improving memory utilization, caching and buffer management, in Merge Sort and in other external-sort algorithms [33, 35, 44, 45, 51, 52]. Albutiu et al. [2] and Balkesen et al. [4] showed how to utilize a large main memory and massively parallel multi-core processing in sort-merge joins. Chhugani et al. [11] showed how to use modern multi processors to improve Merge Sort. A survey by Graefe [23] discusses many of these methods and their implementation for sorting in database management systems. Note, however, that none of these sort algorithms was designed for flash storage. All these methods write to the disk intermediate results. None of them tries to reduce the wearing of disk cells or deal with cases in which writes are slower than reads. Liu et al. [40] studied the problem of sorting relations on flash drives and utilizing partial orders to improve the sort. They suggested an algorithm to find sequences of pages with non-overlapping ranges of keys. Such sequences are then considered as sorted runs in Merge Sort. Their algorithm is somewhat similar to Na¨ıve SCS-Merge that we will present in Section 4. However, their algorithm does not avoid writing intermediate results to the disk, it does not handle outliers and can only utilize order at the granularity of entire pages. In comparison, we provide a more extensive solution that combines different methods to handle different cases (regarding relation size and extent of being sorted). We consider different types of storage devices, and exploit partial order at a granularity of tuples rather than pages.
Yaron Kanza, Hadas Yaari
A sort algorithm is given a relation R and a key K and it changes the order of the tuples so that the resulting relation is sorted on K. In the following section we present sort algorithms for relations on flash drives. A key-value list can be viewed as a relation with only two attributes K and V , where K is a sort key and V is the value attribute. Thus, the sorting algorithms we present can be applied on key-value lists. Problem Definition. Given a relation R stored on a flash drive, the task is to sort R and write it sorted, on the device, or to memory buffers, for pipelining. The memory size of the system is limited. The main targets are as follows. (1) The sorting algorithm should refrain from writing intermediate results to the disk, to avoid wearing out flash cells. (2) The algorithm should be as efficient as possible. (3) The algorithm should minimize the time required for the I/O, in order to cope with future speedup of in-memory computations.
4 Algorithms In this section we present sort algorithms and we discuss their adaptation to flash storage. We consider Merge Sort, which is commonly used in commercial database management systems, and we introduce novel algorithms that can avoid writing intermediate results, in specific cases that will be discussed. Throughout this section, we assume that the relation to be sorted is R and the sort key is K. Database management systems maintain a buffer pool and perform all the read and write operations via this buffer pool. The buffer pool comprises M buffers— also called frames—where each frame has the size of a block. The system can read a block of data from the disk into a frame of the buffer pool. It can store the content of a frame by writing it to a block on the storage device. Thus, at any given time, there are at most M blocks of data in main memory. We will use these notions throughout this section.
3 Framework In this paper, we assume that the data are stored in a relational database or in a key-value store. A relation R is a sequence of tuples stored on the disk. The initial order of the tuples in R is their order on the disk. A sort key is an attribute, or a set of attributes, according to which we would like to sort the relation. For a sort key K and a tuple t in R, we denote by t[K], the projection of t on the attributes of K. A relation R is sorted on K if the following holds: For every two tuples ti and tj in R, if ti appears in R before tj , then ti [K] ≤ tj [K].
4.1 Merge Sort Typically, relational database management systems sort relations using the Merge Sort algorithm. Merge Sort has two phases. First, it reads the relation in chunks, sorts each chunk and writes the sorted chunks on the disk. The main-memory sorting is usually done using Quicksort, however, any in-memory sorting algorithm can be used. In the second phase, the sorted chunks are merged and the result of the merge is written to the disk. For details, see [21, 30].
Title Suppressed Due to Excessive Length
In the first phase, Merge Sort reads the entire relation, and writes the entire relation to the disk, in sorted runs. This phase of writing the entire relation to the disk does not deal well with cell wearing and slow writes, hence, we introduce alternative algorithms. We implemented the first phase as a parallel process. Designated threads read, sort and write the chunks simultaneously. The threads are connected in a pipeline fashion. One thread reads data into the buffer. Another thread sorts the buffer, as soon as it is full. A third thread writes the buffer to the storage device, when the sort of the buffer is complete. To exploit the parallelism, we partitioned the buffers into three parts, and used them in parallel: in each step, data is read from the disk into one part of the buffer pool, the tuples are sorted in the second part, and the sorted chunk in the third part is written to the disk. At the end of each step, the roles of the different parts change. Such implementation causes the algorithm to read and write chunks that are at the size of one third of the buffer pool, rather than chunks that are at the size of the buffer pool. This increases the number of random accesses to the disk. But, it allows computations to be performed in parallel to disk access. Since the parallel computation is more efficient than the non-parallel computation, we used it in our experiments.
4.2 Multi-Insertion Sort To deal with the high cost of writing to the disk, we define cases that can be easily identified and where a tailored algorithm can avoid writing intermediate results to the secondary storage. Such case is of sorting midsize relations. Let |M | be the sizes of the main memory, and sup|R| pose ρ = |M | . We say that R is ρ-larger than M , and we refer to R as midsize if ρ is a small number. (The exact number is configurable and depends on the system specifications.) Multi-Insertion Sort identifies midsize relations and sorts them without writing intermediate results. Essentially, it allocates space in memory for k tuples and produces the result iteratively—in each iteration it finds the k tuples with the smallest key among those that were not chosen in any previous iteration (namely, k smallest tuples), sorts them in memory and adds them to the result. This continues till all the tuples are added to the result. In Merge Sort, we read the relation, write it as intermediate sorted chunks and read it again, for producing the sorted result. Hence, sequentially reading the relation 3 times for finding the k smallest tuples, in Multi-
5
Insertion Sort, is expected to be more efficient than Merge Sort. If reading is s times faster than writing, then reading the relation 2 + bsc times is expected to be more efficient than Merge Sort. So, Multi-Insertion Sort is expected to be more efficient than Merge Sort when ρ ≤ 2 + bsc. Now, suppose that there is space in memory for storing 2k tuples. We could use one buffer for reading and the other buffers for storing the smallest tuples seen so far. But this would require many read accesses, and will increase the cost of read-latency. A more efficient approach is to equally partition the buffers between reading buffers and buffers for storing the k smallest tuples. That is, buffers for k tuples are allocated for reading and buffers for k tuples are allocated for the smallest tuples (the output). Doing so improves the efficiency of the reading. However, the I/O in SSD is much faster than the I/O in HDD. Thus, efficient implementation of the approach requires also efficient main-memory computations, and depends on efficiently handling the main-memory task of finding the k smallest tuples while scanning R. Next, we elaborate on that. To allocate equal parts for the input and the output, the memory is petitioned into an array AO to store the k smallest tuples seen so far and an array AI into which k tuples of R are read in each step. In each iteration, let Kl be the largest key value among the tuples chosen in the previous iteration. Then, we go over R by reading it into AI in chunks of k tuples. In AO we store the smallest k tuples among the tuples whose key is larger than Kl . A straightforward approach is to sort AI and AO jointly, in place, after each reading of a chunk of tuples into AI (after discarding tuples whose key is not larger than Kl ), so that the smallest tuples will be in AO . This requires sorting 2k tuples in each iteration and does not exploit the fact that AO is initially sorted. Implementing AO as a heap is inefficient because then the smallest tuples will not be sorted at the end of the iteration. Next, we explain how to solve this. Memory Optimization: In each iteration, we read R in chunks and keep in AO the k smallest tuples greater than Kl . After reading a chunk into AI , AI is sorted. Then, the arrays are scanned and if tuple i of AO is larger than tuple i of AI , the tuples are swapped. After this, the first half of AO contains the smallest k2 tuples and the second half of AI contains the largest k2 tuples. So, merging AO and the first half of AI till having k tuples, provides the smallest k tuples. The merge is within the allocated memory—first into the second half of AI , and then into the first half of AO . The entire merge is in-place, differently from ordinary merge or ping-pong merge (see [9])
6
Yaron Kanza, Hadas Yaari
Multi-Insertion Sort (R) Input: A relation R Output: The relation R, sorted. 1: let AI and AO be arrays of size k, in memory 2: {(w.l.o.g., k is even)} 3: let n be the number of tuples in R 4: Kl ← −∞ 5: insert into AO dummy tuples with key ∞ 6: l ← 0 7: for i = 1 to n do k 8: for j = 0 to n − 1 do k 9: if j < n − 1 then k 10: read tuples j · k, . . . , j · k + (k − 1) into AI 11: else 12: read tuples j · k, . . . , n into AI 13: insert into the unread cells of AI dummy tuples with key ∞ 14: end if 15: remove from AI tuples whose key < Kl 16: {(for simplicity, assume replacing each one of them with a dummy tuple whose key is ∞)} 17: sort AI 18: for ` = 1 to k do 19: if AI [`] < AO [`] then 20: swap (AI [`], AO [`]) 21: end if 22: end for 23: merge AO [1], . . . , AO [k] and AI [1], . . . , AI [ k2 ] into AI [ k2 ], . . . , AI [k], AO [1], . . . , AO [ k2 ] 24: let AI [ k2 ], . . . , AI [k], AO [1], . . . , AO [ k2 ] be AO 25: let AI [1], . . . , AI [ k2 ], AO [ k2 ], . . . , AO [k] be AI 26: end for 27: Kl ← key value of AO [k] 28: add AO to the result 29: end for Fig. 1 Multi-Insertion Sort.
Multi-Insertion Sort is presented in Fig. 1. In Line 1 the algorithm creates two arrays—AI for reading the input from the storage device and AO for writing the sorted chunks to the storage device. In Line 7, the algorithm applies an iteration that nk times extracts the smallest tuples greater than Kl from R. The value of Kl is updated after each writing of a sorted chunk (Line n 27). To find the k smallest values it reads R in iterations—the read values are inserted into AI . k The k smallest tuples among the tuples in AI and AO are moved to be in AO , sorted. Thus, at the end of the inner iteration (of Line 8), AO is added to the result. Please note that in the algorithm, insertion of dummy tuples with key ∞ is just to simplify the presentation. In practice, these places are skipped. Instead of sorting an array of size 2k, the algorithm sorts an array of size k, conducts a linear scan for swapping elements between the two arrays and applies a linear-time merge of two arrays, going over k tuples.
AO
AI
1 3 5 7 23 28 44 99
AO
2 3 3 6 6 8 45 97
Fig. 2 I. The goal is to efficiently find the k smallest tuples in AO and AI .
k AO [ ...k ] = 2
6 8 44 97
k AI [1... ] = 2
2 3 5 7
k AO [ ...k ] = 2
2 3 5 6
Fig. 4 III. Pick the smallest elements among the largest elements of AO and the smallest elements of AI .
AI
1 3 3 6 6 8 44 97
2 3 5 7 23 28 45 99
Fig. 3 II. If AO [`] > AI [`], the pair is swapped. Each array is partitioned into two equal parts.
AO
1 3 3 6 2 3 5 6
AO
1 2 3 3 3 5 6 6
(sorted)
Fig. 5 IV. Merging the result of Step III with the smallest elements of AO yields the desired result.
For the main-memory sort, we use Quicksort.1 Proposition 1 asserts the correctness of the approach. Proposition 1 Consider two sorted arrays AI and AO , where both contain k distinct integers, and where, without loss of generality, k is even. Suppose that we apply the swap phase of Multi-Insertion Sort—for every 1 ≤ ` ≤ k, if AO [`] > AI [`], we swap between AI [`] and AO [`]. Let S be the set of the k smallest elements among the elements of AI and AO . Then, after the swaps, – {AO [1], . . . , AO [k/2]} ⊂ S, i.e., AO [1], . . . , AO [k/2] are among the k smallest elements of the two arrays. – {AI [k/2], . . . , AI [k]}∩S = ∅, i.e., AI [k/2], . . . , AI [k] are not among the k smallest elements of the arrays. – AI and AO are sorted. The proofs of this proposition and of the following propositions appear in the appendix. According to Proposition 1, after the swap phase, we can discard k/2 tuples of AI (none of them is among the k/2 smallest) and immediately deem k/2 tuples as members of S. Thus, we need to choose the k/2 smallest tuples from the k tuples in AO [k/2], . . . , AO [k] and AI [1], . . . , AI [k/2], instead of choosing the k smallest tuples from all the 2k tuples of AI and AO . Figures 2-5 provide an illustrative example of how to efficiently find the k smallest tuples in Multi-Insertion Sort. Fig. 2 presents the two buffers: AI for the input from the disk and AO for the output. Note that at the 1
We used a standard C++ implementation of Quicksort, with random pivot.
Title Suppressed Due to Excessive Length
end of each iteration, AO is sorted. The algorithm iterates over the two arrays, and for each 1 ≤ ` ≤ k, if AO [`] > AI [`], it swaps AI [`] and AO [`]. In Fig. 3, the bold numbers are the keys of the swapped tuples. When ` = 3, the algorithm swaps 3 and 5, when ` = 4, the algorithm swaps 6 and 7, and so on. The blue lines present the partition of each array into two halves. Fig. 4 shows the second half of AO and the first half of AI . These two sets of tuples are merged, sorted and the smallest k 2 tuples of them are selected to be added to AO . We only need to examine these tuples because the first half of AO contains the smallest tuples among all the tuples in the two buffers. The second half of AI contains the tuples with the highest keys among all the tuples in the two buffers—tuples that can be immediately deemed irrelevant for this iteration. After finding the k2 tuples that should be added to AO , AO is sorted in preparation for the next iteration. (On the last iteration, AO contains the output that is added to the result.) This is illustrated in Fig. 5. Using Pointers: When the key size is small relatively to the tuple size, we use the memory more effectively by processing pairs of a key and a pointer rather than processing entire tuples. After reading the tuples, we store them in memory, assign a pointer to each tuple and apply Multy-Insertion Sort on the key-pointer pairs. This is a standard technique in database management systems, see [23].
4.3 SCS-Merge Multi-Insertion Sort works well for relations that are larger than the memory by a small factor. We now present a different approach that utilizes partial sorted sequences in the relation and can be applied on partiallysorted relations of any size. First, we present the general approach and refer to it as SCS-Merge. Then we improve SCS-Merge to handle a case where the number of buffers is smaller than the number of sorted chunks, and refer to this algorithm as SCS-Merge V1. We discuss outliers and how to handle them, and we present SCS-Merge V2—an improvement of SCS-Merge V1 that handles outliers effectively. Finally, we discuss perturbations, which are small shifts of tuples from their place in the order, and we introduce SCS-Merge V3—an enhancement of SCS-Merge V2, to handle perturbations effectively. We start with the simple case where R is a sequence of sorted chunks, and the number of chunks does not exceed the number of available memory buffers. In such case, the chunks can be immediately merged without the initial step of sorting parts of the relation and with-
7
out writing the sorted chunks to the secondary storage. The overhead of this approach is the requirement to test whether indeed the relation consists of a few sorted sequences. We will present the approach and show how to implement it efficiently. 4.3.1 Preliminaries Before presenting how we implemented SCS-Merge, we provide some definitions and notations. Definition 1 (MSC) A sorted chunk is a sequence of consecutive tuples that are ordered by the key. A maximal sorted chunk (MSC) is a sorted chunk that is not a proper subset of any other sorted chunk. If ti , . . . , ti+k is an MSC, then the following hold: 1. For every two tuples tj , tj+1 , in the sequence, holds tj [K] ≤ tj+1 [K]. 2. Either ti is the first tuple of R or ti−1 [K] > ti [K]. 3. If ti+k is not the last tuple of R then ti+k [K] > ti+k+1 [K]. Note that when R is sorted in an ascending order, there is one MSC that consists of all the tuples of R, and when R is sorted in a descending order, each tuple of R is an MSC (a singleton). We consider a union of two MSCs of R, M1 and M2 , to be all the tuples of R that are either in M1 or in M2 . The intersection of two MSCs, M1 and M2 , comprises all the tuples that are in M1 and in M2 . (Note that these operations do not affect the order of the tuples in R.) Lemma 1 The union of two different MSCs is not an MSC. The intersection of any two different MSCs is an empty set. This entails the following assertion. Proposition 2 Let R be a relation with sort key K. There is a unique partition of R into MSCs. 4.3.2 Na¨ıve SCS-Merge We present now a na¨ıve implementation of an algorithm that merges sorted chunks. A relation R is considered a k-sorted-chunk sequence if there are k sequences of consecutive tuples in R where each set is a sorted chunk. In other words, R is a sequence of tuples z }| { z }| { z }| { tc0 +1 , . . . , tc1 , tc1 +1 , . . . , tc2 , . . . , tck−1 +1 , . . . , tck where each chunk tci−1 +1 , . . . , tci is sorted. To check that there are at most k maximal sorted chunks in R, we scan R and count the number of cases
8
Yaron Kanza, Hadas Yaari
3
1...8
4
Example 1 Consider the following relation R (for simplicity, only the keys are depicted),
2...5
5
4, 6, 7, 8, 3, 11, 5, 2, 1, 13, 14, 17, 9, 11, 12, 16, 15 9…13
5…13
10...13
11...30
consisting of 3 large sorted sequences and 4 small sequences (of less than 3 records):
20...28
z }| { z}|{ z}|{ z}|{ z }| { z }| { z}|{ 4, 6, 7, 8, 3, 11, 5 , 2 , 1, 13, 14, 17, 9, 11, 12, 16, 15 Fig. 6 Sorted chunks of R.
Fig. 7 The precedence graph G of the relation R in Fig. 6.
where for two consecutive tuples t1 and t2 holds t1 [K] > t2 [K], i.e., the key of t1 is larger than the key of t2 . By recording these cases, we receive the locations where the sequences start. This requires a sequential read of R and hence it is relatively inexpensive on flash drives. SCS-Merge first checks that R is a sorted-chunk sequence where the number of sequences can be merged using the available frames in the buffer pool. Then, it merges the sequences as in Merge Sort. Note that although SCS-Merge is simple, there are cases where it is more efficient than Merge Sort. This is because the sorted sequences may be of different sizes and their starting points are unknown prior to the scan. An ordinary Merge Sort will not only fail using the fact that large parts of the relation are already sorted. It may also apply Quick Sort on sorted sequences and by that reduce the efficiency of the in-memory sorting. 4.3.3 SCS-Merge V1 SCS-Merge can sort efficiently relations that comprise a few large sorted chunks, but it cannot be applied when the number of chunks is greater than the number of available buffers. However, when a relation consists of a few large chunks and some additional small chunks, it is more effective to handle the small chunks and the big chunks differently. In this section, we show how to do so. We refer to the algorithm as SCSV1. SCSV1 has two phases. First, the small chunks are merged into large chunks. This phase is conducted as part of the scan for discovering the chunks. During the scan for discovering the chunks, when encountering a small chunk, it is inserted into memory and is merged with other small chunks that are in memory. If the part of the memory allocated for the merge of small chunks is full, the content of the memory is written to the disk as a large chunk. (This may require writing an intermediate result to the disk, but it is only a small fraction of the relation.) In the second phase, the algorithm merges all the chunks—those that were discovered in the scan and those that were created by merging small chunks. The following example illustrates the approach.
Suppose that there are merely 5 available buffers for reading R during the merge. (For simplicity, we ignore the output buffer.) Then, there are not enough buffers to read simultaneously and merge the 7 sequences. Thus, when the algorithm reads the relation to discover the sorted sequences, the small sequences z}|{ z}|{ z}|{ z}|{ 3, 11, 5 , 2 , 15 are collected in memory, merged and sorted to create a sequence 2, 3, 5, 11, 15. Then, the algorithm merges z }| { this sequence with the other long sequences 4, 6, 7, 8, z }| { z }| { 1, 13, 14, 17, and 9, 11, 12, 16. We implemented the first phase as a parallel process, employing three threads concurrently. One thread reads blocks of data from the relation R into the memory. A second thread examines the data to reveal the start and the end of sorted chunks and it moves small chunks into the designated area in memory for processing them. The third thread merges the small chunks, and if needed, flushes the merged chunks to the storage device. The memory size limits the number of chunks SCSMerge can merge and the amount of buffers that can be allocated for each chunk, e.g., if the algorithm allocates two frames of the buffer pool for each processed chunk (for double buffering), the number of processed chunks should not exceed half the number of frames in the buffer pool. Thus, it is desired to reduce the number of chunks that are processed in parallel and to allocate for each chunk as many buffers of the buffer pool, in order to read several blocks in each disk access and reduce the time wasted on read latency. To that end, we use the fact that two chunks whose key ranges are not overlapping may be processed one after the other and not in parallel. That is, suppose we have a chunk [k1 , . . . , k2 ], where the key of the last tuple is k2 , and a chunk [k3 , . . . , k4 ], where the key of the first tuple is k3 . If k2 < k3 , then the second chunk can be processed after completing processing the first chunks, and we say in such case that [k1 , . . . , k2 ] precedes [k3 , . . . , k4 ]. If k2 ≥ k3 and k4 ≥ k1 , the two chunks should be processed in parallel and none precedes the other. For instance, in Example 1, there is no need to read sequence
Title Suppressed Due to Excessive Length
9
SCS-Merge Simulation (R)
3
1...8
4
2...5
5
Input: A relation R Output: Minimal number of buffers for the sort
5…13
9…13 10...13
1: create the precedence graph G of R 2: Omax ← 0 3: while G is not empty do 4: Or ← orphan vertexes of G 5: if Omax < |Or | then 6: Omax ← |Or | 7: end if 8: F ← first-to-end vertexes of Or 9: remove from G the vertexes F and edges that contain a vertex in F 10: end while 11: return Omax Fig. 8 Simulation of SCS-Merge.
9, 11, 12, 16 before completing the merge of sequence 4, 6, 7, 8, yet, sequences 4, 6, 7, 8 and 1, 13, 14, 17 should be processed in parallel. To find which chunks should be processed in parallel, we create a precedence graph G = (V, E)—a directed graph in which there is a vertex for each maximal sorted chunk (MSC) of R. There is a directed edge in G from chunk C1 to chunk C2 , for every pair of MSCs C1 and C2 such that C1 precedes C2 . For each chunk, the graph stores the smallest and largest keys in it and a pointer to its location on the storage device. A vertex in G is considered an orphan if it has no incoming edges. In a set Or of orphan vertexes, a vertex (chunk) C is considered first-to-end if the largest key in C is smaller than or equal to the largest keys of all the chunks in Or . Intuitively, this means that for every C 0 in Or that is not first-to-end and C that is first-to-end, there is a tuple in C 0 that appears after all the tuples of C, in the sorted R. Example 2 Fig. 6 depicts a relation R that consists of sorted chunks. For each chunk, the figure shows the range of the keys and the number of tuples in it. For instance, the range of the first chunk is 1, . . . , 8, which means the key values of this chunk are between 1 and 8, and the number of tuples in this chunk is 5, as appears in brackets, on the top-right corner. Fig. 7 presents the graph G constructed from R. In G there are edges from node (1, . . . , 8) to nodes (10, . . . , 13), (20, . . . , 28), (11, . . . , 30) and (9, . . . , 13), because the merge of all these chunks does not start before completing the merge of (1, . . . , 8). There is no edge from (1, . . . , 8) to (2, . . . , 5), because these sequences overlap and should be processed in parallel. A simulation of a sort over G allows us to determine an upper bound on the number of blocks that
11...30 20...28
Fig. 9 Illustration of the merged chunks in each iteration of SCSV1
Fig. 10 The precedence graph of SCS-Merge: Iteration 1
must be in memory during the sort. The simulation is presented in Fig. 8. Initially, G is constructed. Then, the simulation imitates the reading of chunks into the memory as part of the sort. This allows us to determine how many chunks must be read in parallel. In each step, the orphan vertexes of G are the chunks that need to be processed (a chunk is processed when tuples are extracted from it and are added to the constructed sequence, as part of the merge). Note that a non-orphan chunk is only processed after ending processing its preceding chunks while an orphan chunk does not have preceding chunks. The first-to-end chunks are removed from G, to simulate the end of the reading of these chunks. Then, we update the set of orphans, and update the maximal number of processed chunks, if needed.
3
1...8
4
3
2...5 1...8
5
9…13
5…13
10...13
4
2...5
5
9…13
5…13
10...13
11...30 20...28
Fig. 11 Precedence graph in Iteration 4
11...30 20...28
Fig. 12 Precedence graph in Iteration 5
Example 3 Fig. 9 shows the chunks in memory when applying SCS-Merge on R in Fig. 6. Each row depicts a merge step and the merged chunks. The red arrows represent the dependencies between the chunks. For each chunk, they indicate the chunks that will be read after completing its processing, e.g., when completing processing chunk (3), the processing of chunk (4) starts. The simulation of this merge is presented in figures 1012. The red bold vertexes are the orphan vertexes. Initially, the chunks (1, . . . , 8), (2, . . . , 5) and (3) are the orphan vertexes (Fig. 10). At this stage, all the other chunks are preceded by one of the orphans. This step is represented by the first row of the table in Fig. 9. After removing chunk (3) in step 1, chunk (4) in step 2 and chunks (5) and (2, . . . , 5) in step 3, chunk (1, . . . , 8) is still an orphan and chunk (5, . . . , 13) becomes an or-
10
phan (Fig. 11). Thus, step 4 simulates the 4-th row of the table in Fig. 9, where chunks (1, . . . , 8) and (5, . . . , 13) are processed. When chunk (1, . . . , 8) is removed, three chunks become orphans, so in addition to (5, . . . , 13), there are four orphans in this step (Fig. 10). This is the representation of the 5-th row of the table in Fig. 9. Since this is the maximal number of orphans at a single step, the simulation returns 4 as the maximal number of chunks to be merged in parallel. Thus, to run SCS-Merge we have to allocate for chunk reading at least 4 buffers. Generally, if Tmax is the maximal number of chunks that are merged simultaneously in the merge phase, according to the simulation, and M isj the number of k M buffers allocated for the merge, then Tmax +1 buffers are allocated for each merged chunk and by that we read j k M blocks at a time, to reduce the time spent on Tmax +1 read latency. Note that at least one buffer should be used for the output—for collecting the tuples from the merged chunks before adding them to the output (i.e., before writing to the disk or transferring to the next operation by pipelining).
4.4 Tidy-Outliers In this section we discuss outliers, i.e., a few tuples that are out of order, and we present an improvement of SCS-Merge to handle them efficiently. Many sorted relations become slightly unsorted due to sporadic updates. That is, the relation is unsorted, however, only a few tuples (relatively to the size of R) are not in their place in the order. We refer to such relations as nearly sorted. It is beneficial to exploit the underlying order rather than to consider the relation as completely unsorted. A relation R is nearly sorted if the removal of a small number of tuples from it, referred to as outliers, yields a sorted relation [5]. In 2, 5, 6, 4, 8, 9, 11, 12, 3, 15, for example, the values 4 and 3 are outliers—removing them provides a sorted sequence. Formally, we say that R is k-nearly sorted if it contains k tuples (outliers), O = {ti1 , . . . , tik }, such that the tuples {t | t ∈ R and t 6∈ O} are completely sorted in R. A set O of outliers of R is minimal if the number of outliers in O is smaller than or equal to the number of tuples in any other set whose removal from R yields a sorted relation. The problem of outlier detection is similar to the longest increasing subsequence problem [3], which has been studied for the case of sequences in memory. An in-memory computation of the longest increasing subsequence requires storing in memory an array in the
Yaron Kanza, Hadas Yaari
size of the processed relation, or writing intermediate results to the secondary storage, see [19]. Both options are problematic in external sorting over flash disks, so we present a longest-increasing-subsequence algorithm that is adapted to our problem. Moreover, we do not just discover the outliers. We combine outlier detection with a merge of sorted chunks, in the algorithms we introduce. To illustrate the approach, suppose we have a buffer that can hold k tuples and we know the locations of the outliers. Tidy-Outliers Sort conducts two iterations. Firstly, it scans R and collects the outliers. The outliers are inserted into a min heap and kept in memory. Secondly, it reads R, writes the non-outlier tuples to the result, and whenever the tuple at the root of the heap (the one with the smallest key) fits in the order, this tuple is removed from the heap and added to the constructed result. The difficult task in this approach is to detect the outliers. To illustrate the difficulty, consider the following two sequences of numbers: s1 = 1, 2, 5, 6, 7, 3, 4, 8, 9
s2 = 1, 2, 6, 7, 5, 3, 4, 8, 9
In s1 , two options to choose the outliers are {5, 6, 7} or {3, 4} (obviously, there are additional options of choosing more than three outliers). In s2 we can choose the outliers to be {6, 7, 5}, {5, 3, 4}, {6, 7, 3, 4}, etc. Even in these simple examples there are many options to choose the outliers. The goal is to provide an effective outlier selection that will be efficient in terms of running time and will return a set of outliers as small as possible. A direct approach for selecting the outliers is as follows. Given a relation R, first, create a directed graph Go = (R, Eo ) where the vertexes of Go are the tuples of R, i.e., there is a vertex in Go for every tuple of R. There is a directed edge (ti , tj ) in Eo for each pair of tuples ti , tj ∈ R such that (1) ti appears in R before tj and (2) ti [K] ≤ tj [K], i.e., ti and tj comply with the order induced by the key. After the construction of Go , find the longest directed path in it. This can be done by iterating over the tuples of R and for each tuple finding the longest path to it by examining the longest paths discovered for its fathers in Go , choosing the longest among them and augmenting to it the current vertex. Note that essentially this is a variation of Dijkstra’s Algorithm. The tuples that are not in the longest path of Go are the outliers. To prove the correctness of the direct outlier discovery, consider a relation R and a minimal set of outliers O discovered for R. Let O0 be the set of outliers discovered by finding the longest path in Go . The following two assertions will hold. First, |O| ≤ |O0 | because the removal of O0 from R yields a sorted sequence and O
Title Suppressed Due to Excessive Length
is minimal with respect to the sets whose removal from R yields a sorted sequence. Second, |O| ≥ |O0 |. This is because the removal of O from R produces a sorted sequence and this sorted sequence, as any other sorted sequence of R, is a path in Go . Let PO and PO0 denote the paths in Go created by the removal of O and O0 , respectively. Then, since PO0 is the longest path in Go , it has at least as many vertexes as PO , and thus, |O| ≥ |O0 |. From these two assertions, we infer |O| = |O0 |, hence, O0 is a minimal set of outliers. A disadvantage of the direct discovery is its low efficiency. Although the discovery of the shortest path using Dijkstra’s Algorithm and using a Fibonacci heap has O(m log m) time complexity, over a relation of size m [12], the actual running time for a big relation is high, as Go can be too large to be managed in memory. Next, we discuss how we can implement this approach more efficiently by applying it over maximal sorted chunks (MSCs) for relations that are partially sorted. We say that for a pair (M1 , M2 ) of MSCs, M1 totally precedes M2 if: 1. M1 appears before M2 in R, and 2. t1 [K] < t2 [K], where t1 is the last tuple of M1 and t2 is the first tuple of M2 . In Example 1, (4, 6, 7, 8) totally precedes (9, 11, 12, 16) because 8 < 9. We say that M1 overlaps and precedes M2 if M1 appears before M2 in R, and their ranges overlap. For instance, in Example 1, MSC (1, 13, 14, 17) overlaps and precedes (9, 11, 12, 16) because the ranges [1 . . . 17] and [9 . . . 16] overlap. In a case of an overlap of the ranges, 1. there are t1 in M1 and t2 in M2 such that t1 [K] ≥ t2 [K], e.g., for (1, 13, 14, 17) and (9, 11, 12, 16), holds 13 ≥ 11; and 2. there are t01 in M1 and t02 in M2 such that t01 [K] ≤ t02 [K], e.g., for (1, 13, 14, 17) and (9, 11, 12, 16), holds 1 ≤ 9. A pair (t1 , t2 ) of tuple t1 of M1 and tuple t2 of M2 are connectors of M1 and M2 if there are a tuple t02 in M2 and a tuple t01 in M1 such that t02 [K] ≤ t1 [K] ≤ t2 [K] ≤ t01 [K]. For example, in (1, 13, 14, 17) and (9, 11, 15, 16), 13 and 15 are connectors because 9 ≤ 13 ≤ 15 ≤ 17. Intuitively, it is possible to create an ordered sequence by reading the tuples of M1 from the start to t1 and then reading the tuples of M2 from t2 to the end. In the case of (1, 13, 14, 17), (9, 11, 15, 16) and the connectors 13 and 15, this would yield the sequence 1, 13, 15, 16. Note that the pair of connectors is not unique. For example, 14 and 15 can also be used as connectors here. To find outliers, we construct from the MSCs of R the MSC graph of R. The vertexes of the graph are the MSCs of R. We denote the set of vertexes by M. There
11
is an edge (M1 , M2 , t1 , t2 ), from M1 to M2 , for every pair of MSCs such that M1 totally precedes M2 , t1 is the last tuple of M1 and t2 is the first tuple of M2 . In addition, there is an edge (M1 , M2 , t1 , t2 ), from M1 to M2 , for every pair of MSCs such that M1 overlaps and precedes M2 , and (t1 , t2 ) are connectors of M1 and M2 . We denote the edges by E. The MSC graph of R, denoted GR , is the pair GR = (M, E). Constructing the MSC graph is simple. The MSCs are discovered in a scan of the relation. Then, for each two MSCs M1 and M2 , if M1 precedes M2 (totally or with an overlap), an edge from M1 to M2 is added. It is easy to see that GR is a directed acyclic graph (DAG). When an edge connects two MSCs, the first MSC appears in R before the second MSC. A cycle in GR that contains two MSCs M1 and M2 would mean that M1 appears before M2 in R and M2 appears before M1 in R. This is a contradiction. Thus, there cannot be cycles in R. We consider the vertexes of GR that do not have any ingoing edge as sources and the vertexes that do not have any outgoing edge as targets. Since GR is a DAG, it must have at least one source and at least one target. A full path in GR is a path from a source vertex to a target vertex. In a path P that traverses the edges (Mi1 , Mi2 , t0i1 , ti2 ), (Mi2 , Mi3 , t0i2 , ti3 ), . . . , (Min−1 , Min , t0in−1 , tin ), the weight of Mij is the number of tuples between ti and t0i in R, including ti and t0i (for Mi1 it is the number of tuples between its first tuple and t0i1 , and for Min it is the number of tuples between tin and its last tuple). The weight of a path is the sum of the weights of the MSCs it goes through. The best-covering path of GR is the path with the greatest weight among the paths of GR . Finding the best-covering path can be done using a variation of Dijkstra’s Algorithm in the manner discussed earlier. Proposition 3 Consider a relation R with MSC graph GR . Let P = Mi1 , . . . , Mik be a path in GR , then the underlying sequence of P is a sorted sequence. Given a path P and the underlying sequence it defines, the tuples of R that are not in P are the outliers. Proposition 4 Consider a relation R with MSC graph GR . Let Pb be the best-covering path in GR , then the underlying sequence of Pb is a sorted sequence that minimizes the number of outliers.
4.5 SCS-Merge V2 We present now a modification of SCSV1 that combines the approach of outlier detection with SCS-Merge Note that Tidy-Outliers cannot be used when there are
12
Yaron Kanza, Hadas Yaari
SCS-Merge V2(R)
3
1...8
Input: A relation R Output: A sorted relation 1: scan R and find the set M of MSCs of R 2: create an MSC graph GM of R with vertexes M 3: add an edge (M1 , M2 ) to GM for each pair such that M1 precedes M2 4: P ← ∅ 5: while |GM | > |Boutliers | do 6: find the longest path P in GM 7: add P to P and consider that a sequence SP 8: remove the MSCs of P from GM 9: end while 10: sort in memory the tuples in Boutliers and consider this sequence as SO 11: if |P| + 1 > the number of allocated buffers then 12: fail and exit 13: end if 14: merge sequences SP of paths P and sequence SO 15: write the result of the merge to the output
Fig. 13 SCS-Merge V2
too many outliers to be stored in memory. In such cases, we wish to join the ability of collecting and sorting outliers in memory with the merging of sorted chunks. We use SCSV1 as the basis for the improved algorithm, thus, we call the improved algorithm SCS-Merge V2 (or SCSV2, for short). In SCSV2, we build an MCC graph GM similar to the precedence graph defined in Section 4.3.3. It is somewhat similar to the graph built in Tidy Outliers. The vertexes of GM are the MSCs of R. The difference is that for efficiency, there are edges in GM only between two MSCs with non-overlapping ranges. We utilize the fact that in a flash storage, random access is efficient, and thus, we ignore the order of the chunks in R. More precisely, we say that an MSC M1 precedes MSC M2 if t1 [K] ≤ t2 [K] , where t1 is the last tuple of M1 and t2 is the first tuple of M1 . Note that the difference between a case where M1 precedes M2 and a case where M1 totally precedes M2 is that in the second case, M1 appears before M2 in R whereas in the first case, this is not required. In both cases, the range of M1 precedes the range of M2 , without an overlap (so the algorithm can start reading M2 after completing processing M1 ). There is an edge in GM between every pair of MSCs M1 and M2 such that M1 precedes M2 . In SCSV2, some part of the memory is allocated for reading tuples during the merge phase and another part is allocated for the output. In the part that is allocated for the input, we designate a subpart (typically about half of it) for outliers and for small sorted chunks that are merged in memory. Let Boutliers be the buffers allocated for the outliers and the small chunks. Initially,
4
3
2...5 1...8
5
Path length = 27
5…13
9…13
4
2...5
5
Path length = 13
10...13
9…13
5…13
10...13
11...30
11...30
20...28
20...28
Fig. 14 A path of length 27, depicted in red, is discovered in the first iteration of SCSV2
Fig. 15 A path of length 13 (in red) is discovered in the second iteration of SCSV2
SCSV2 reads R and discovers the sorted chunks. It builds the graph GM in memory. Then, it iteratively extracts paths from GM . In each iteration, it finds the longest path P in GM and adds it to the list of discovered paths. Then, it removes from GM the MSCs that P comprises. When the remaining MSCs are small enough to fit in Boutliers , the iteration stops. Since in GM there are edges only between non-overlapping MSCs, the path P yields a sorted sequence comprising the tuples in the MSCs it goes through. When reading the tuples of this sequence, the MSCs can be read one after the other, according to their order in P . The discovered set of paths and Boutliers are merged to provide the sorted result. The pseudo code of SCSV2 is presented in Fig. 13.
3
1...8
4
2...5
5
Path length = 12
9…13 10...13
(2…5)
(5..13)
(1…8)
(10..13)
(20…28) = 27 = 13
(3,4,5,…,9,…11,…13,…30)
= 12
5…13
11...30 20...28
Fig. 16 The remaining graph can fit into the Fig. 17 The merged MSCs—in outlier buffers bold, the accumulated outliers
Example 4 The iterative process of finding the longest paths in SCSV2 is illustrated in figures 14-16. This example presents applying SCSV2 on the sequences presented in Fig. 6. In the first iteration, a path whose weight is 27 tuples, according to the weights presented in Fig. 6, is discovered (Fig. 14). In the second iteration, a path whose weight is 13 is discovered (Fig. 15). After discarding the vertexes of the discovered paths, the remaining graph represents 12 tuples that can fit into the memory allocated for the outliers (Fig. 16). Thus, three sequences are merged (Fig. 17). Algorithm SCSV2 reads R twice—first sequentially for discovering the MSCs, and then it applies random access when reading the sequences during the merge. Differently from Merge Sort, it does not write to the
Title Suppressed Due to Excessive Length
disk an intermediate result in the size of the relation. Differently from SCSV1, it does not consider outliers as sorted chunks, and by that it reduces the overhead of managing the chunks, and it utilizes the memory better. 4.6 Handling Perturbations (SCS-Merge V3) In many real datasets, subsequences of tuples are merely unsorted due to perturbations, i.e., small changes that slightly shift tuples from their place in the order. In this section, we explain the problem and show how to handle it efficiently. We present Algorithm SCS-Merge V3 (SCSV3, for short) that improves SCSV2 to cope better with perturbations. Essentially, it deals with perturbations by sorting, in memory, parts of chunks, during the discovery and the merge phases of SCSV2. The following example illustrates perturbations, their effect on SCSV2 and how SCSV3 handles them. Example 5 Consider a relation R comprising the following sequence of tuples (for simplicity, only the keys are presented). 2, 4, 8, 12, 16, 20, 18, 19, 21, 22, 23, 24, 5, 7, 9, 11, 13, 15 Suppose that there are 2 buffers available for reading the relation and each buffer can hold 4 tuples. Then, SCSV2 creates a graph with 3 nodes. A node v1 for (2, 4, 8, 12, 16, 20), a node v2 for (18, 19, 21, 22, 23, 24) and a node v3 for (5, 7, 9, 11, 13, 15). There is an edge from v3 to v2 , because the last key of v3 is 15, the first key of v2 is 18, and 15 < 18. So, initially, v3 and v1 are merged. When the merge of v3 ends, the remaining of v1 is merged with v2 . Now, consider a similar sequence R0 , created by some perturbations in the order of the tuples. 2, 4, 12, 8, 16, 20, 19, 22, 21, 23, 18, 24, 5, 7, 9, 11, 15, 13 There are 7 shorter MSCs now: (2, 4, 12), (8, 16, 20), (19, 22), (21, 23), (18, 24), (5, 7, 9, 11, 15), (13). Note that the MSCs (8, 16, 20), (19, 22), (18, 24) should be processed in parallel. Thus, 2 buffers are insufficient now to apply SCSV2. However, the following can be done: read into the first buffer the tuples 2, 4, 12, 8, read the tuples 5, 7, 9, 11 into the second buffer, sort the buffers in memory, merge and add o the result. Then, insert the tuples 16, 19, 20, 22 into the first buffer and the tuples 15, 13 into the second buffer, sort the buffers, in memory, merge and add to the result. Finally, sort the tuples 21, 23, 18, 24 and add to the result. So, by discovering the nearly-sorted chunks (2, 4, 12, 8, 16, 20, 19, 22), (21, 23, 18, 24), (5, 7, 9, 11, 15, 13), and sorting them in memory while merging, the sort can be done using just two buffers.
13
Note that in this example there are just two buffers. When there are more buffers, it is possible to sort in memory larger sequences that span several buffers. The process in Example 5 requires to discover the chunks, to see if there are enough available memory buffers for the sort and merge. This can be done by applying a process similar to the replacement-selection algorithm [46], but without writing intermediate results to the disk. The following example illustrates this. Example 6 Consider the relation R0 of Example 5, but now, suppose there are three buffers to read the input. Two buffers can be used to create a sorted stream of tuples, as follows. The first 4 tuples are read into the first buffer and the next 4 tuples are read into the second buffer. The read tuples 2, 4, 12, 8, 16, 20, 19, 22 are sorted in memory and the smallest 4 tuples are discarded. (During the discovery phase the tuples are just removed from the memory.) Then, the next 4 tuples are read and the 4 smallest tuples are discarded. So, by removing 2, 4, 12, 8 and adding 21, 23, 18, 24, the two buffers contain 16, 20, 19, 22, 21, 23, 18, 24. The next step removes 16, 18, 19, 20 and reads 5, 7, 9, 11. Since the last discarded tuple is 20, the tuples 5, 7, 9, 11 cannot be used to continue the sorted sequence. So 21, 22, 23, 24 are discarded and end the chunk. In this case, the tuples 2, 4, 12, 8, 16, 20, 19, 22, 21, 23, 18, 24 are considered a single chunk, when using two buffers, because it is possible to stream them sorted to the merging thread. Similarly, the tuples 5, 7, 9, 11, 15, 13 will be considered a single chunk. The two chunks can be merged as in SCSV2, while applying the in-memory sort. As demonstrated above, to deal with nearly-sorted sequences, SCSV3 discovers, sorts and merges chunks. Using replacement-selection it produces chunks that are larger than the allocated memory. That is, given a memory that can store m tuples, the memory is partitioned into two parts whose sizes, in terms of the number of tuples they can store, are sl and sr (i.e. m = sr + sl ). Initially, SCSV3 reads the first m tuples of the chunk into the memory. Then, in each iteration, it outputs the smallest sr tuples, sorted, (in the second phase, the chunks are merged, so the output is pipelined into the merging thread), and it reads the next sr tuples of the chunk, or the remaining of the chunk if there are less than sr tuples that have not been read so far. This can be continued while all the sr new tuples have a key greater than the key of the last outputted tuple (or, equivalently, have a key greater than the keys of all the outputted tuples, since the tuples are outputted sorted). Thus, when the new keys are too big, the chunk should end.
14
In order to merge the chunks, first it is required to correctly discover them. Chunks are chosen so that in each iteration, all the sr new tuples have a key greater than the key of the last outputted tuple. Obviously, every sorted chunk satisfies this. But frequently also nearly sorted chunks with perturbations satisfy this. To discover the chunks, SCSV3 conducts a simulation of the perturbation handling. It applies the iterative process of reading sr tuples in each step, without actually outputting the tuples. However, this requires sorting m tuples in each step, both when discovering the chunks and when applying the merge. By avoiding the sort in the discovery phase, the discovery becomes more efficient, but we get shorter chunks. Thus, the sort is optional. The code, where the in-memory sort can be toggled on and off, is presented in Fig. 18. Algorithm Discover Chunks (Fig. 18) receives a relation R of n tuples, and works with a memory that can store m tuples. The memory is partitioned so that in each iteration, sr new tuples are read and replace sr existing tuples, while sl tuples are kept in the memory for the next iteration. A flag flag sort is set in line 1 to determine if the content of the memory will be sorted in each iteration. Initially, the first m tuples of R are inserted into the memory (line 6), and if flag sort is true, the tuples are sorted. Variable Pstart chunk marks the position of the beginning of the current chunk. Variable Plast read marks the position of the last read tuple of R. Variable kmax stores the largest key among the keys of “outputted” tuples. (Note that in the merge, the tuples are actually outputted, but in the discovery of the chunks, this is just a simulation of outputting tuples.) Iteratively, the algorithm reads R (line 11). The tuples are read into an array Am of size m. In each iteration, the last sl tuples of Am are moved to the left, overriding the first sr tuples of Am , to simulate outputting the first sr tuples (line 12). The next sr tuples of R are read into the evacuated array cells (line 16). When there are less than sr unread tuples in R, the unread tuples are inserted into Am , dummy tuples are inserted into the remaining cells of Am , and the variable complete is set to true to marked that this is the final iteration. When the key of a read tuple is greater than kmax (line 24), the position of the read tuple is marked as the end of the chunk, because the key is too small to move the tuple into its correct position by merely applying an in-memory sort. Then, the start and end positions of the chunk are added to L, and a new chunk is opened. The sl tuples starting from the position where the new chunk starts are inserted into the left sl positions of Am and dummy tuples with key −∞ are inserted into
Yaron Kanza, Hadas Yaari
Discover Chunks(R, m, sr ) Input: A relation R with n tuples Parameters: The number of tuples in the allocated memory m, and the number of tuples read in each iteration sr Output: A list of chunks L 1: flag sort ← false/true 2: let Am [1, . . . , m] be an array of tuples 3: L ← ∅ 4: Pstart chunk ← 1 5: sl ← m − sr 6: insert into Am the first m tuples of R 7: if flag sort then sort Am 8: Plast read ← m 9: kmax ← max{K | K is a key in Am [1, . . . , sr ]} 10: complete ← false 11: while not complete do 12: for j = 1 to sl do 13: Am [j] ← Am [j + sr ] 14: end for 15: if Plast read + sr ≤ n then 16: read tuples Plast read + 1, . . . , Plast read + sr into Am [sl + 1, . . . , m] 17: else 18: read tuples Plast read + 1, . . . , n into Am [sl + 1, . . . , sl + n − (Plast read + 1)] 19: insert into Am [n − Plast read , . . . , m] dummy tuples with key ∞ 20: complete ← true 21: end if 22: for j = sl + 1 to m do 23: let t be the tuple in Am [j] and let t[K] be the key of t 24: if t[K] < kmax then 25: add chunk [Pstart chunk , Plast read + j − 1] to L 26: Pstart chunk ← Plast read + j 27: Plast read ← Plast read + j 28: insert dummy tuples with key of −∞ into Am [1, . . . , sr ] 29: read tuples Plast read , . . . , Plast read + sl of R into Am [sr + 1, . . . , m] 30: kmax ← −∞ 31: break 32: end if 33: end for 34: if flag sort then sort Am 35: kmax ← max{kmax , keys of Am [1], . . . , Am [sr ]} 36: Plast read ← Plast read + sr 37: end while 38: add chunk [Pstart chunk , n] to L 39: return L Fig. 18 Discovering the chunks in SCS-Merge V3
the sr right cells of Am . Note, in the next iteration, the dummy tuples are pushed out and the key is updated. At the end of the iteration, after reading the tuples, the key and the position of the last read tuple are updated (lines 35 and 36). The sizes sr and sl are parameters that can be modified. There is a tradeoff here. On the one hand, as we
Title Suppressed Due to Excessive Length
decrease the size of sr , we increase the length of created chunks, because in each step we only increase kmax based on the keys of the leftmost sr tuples (when sort is applied, these are the smallest tuples). For larger sr , the expected value of kmax is larger and there is a higher chance that the condition in line 24 will be true, so the ending of a chunk will occur more frequently. On the other hand, a bigger sr causes the number of read accesses to be smaller and this decreases the read latency. In our implementation of SCSV3, sl is 34 m. In addition, |M | | the size of the outlier buffer is |M 4 , and not 2 as in SCSV2, to prevent a case where only a very small number of chunks could be merged.
5 Experimental Evaluation In this section we present our experimental evaluation. We compare running times of the algorithms, including a comparison to Merge Sort. We present tests on different storage devices over synthetic and real datasets.
5.1 Settings We start by describing our experimental setting. 5.1.1 Hardware We executed our tests on a Dell Inspiron N5010, with a 2.53 GHz Intel Core i5 processor, 4 GB RAM and using a 64-bit Windows 7 OS. However, we limited the main memory to 128 MB, to illustrate a run using a typical memory allocated for an application in Androidbased systems with 1 GB of main memory [8]. This is the entire memory that was used—for both the data structures and the data read from the disk. We ran every test on 4 different devices: (1) SSD; (2) a hybrid system with an SSD as the input storage device from which the data is initially read, and an internal HDD to which intermediate and final results are written; (3) SD Card; (4) Micro SD Card. The SSD we used is an Intel 520 with a capacity of 128GB and SATA interface. The HDD is a Samsung HM501II with a capacity of 500 GB, 5400 RPM and SATA interface. The SD Card is a SanDisk SDHC class 6 with a capacity of 8 GB. The micro SD Card is a PNY Micro SDHC class 10 with a capacity of 16 GB. 5.1.2 Datasets We tested relations containing tuples of size 128 bytes. We ran the tests on synthetic data that we created and on real data that we downloaded.
15
76
75.5
75
74.5
74
73.5
Open
73
72.5
72
71.5
Fig. 19 Stock prices (opening rate by date), taken from Yahoo! Finance
Synthetic data. The tuples were created randomly and uniformly, and then they were sorted. Afterwords, we randomly chose pairs of chunks, and we switched between the chunks of each pair. The sizes of the chunks were also chosen randomly. Let σ be the number of chunk switches we performed on the sorted relation. In our tests, σ ranged from 5 to 480. Note that as σ grows, the permuted relation becomes less ordered. For each value of σ, we created 10 different relations on which we executed our tests. Real data. For the tests on real data, we downloaded from Yahoo! Finance opening stock rates as a function of the date, for different stocks. See an example in Fig. 19. Tuples were expanded to 128 bytes without changing the key or the order. 5.1.3 Measures All the algorithms were implemented as multi-threaded applications, where each thread has a single role: read, write or in-memory computation. The I/O operations were blocking, i.e. no operation was performed by the executing thread during the I/O operation. We locked the physical memory so that no swaps will occur during the run of the algorithms, and the files were opened without data caching. Writing to a file occurred immediately with no buffering, because we wanted to control every I/O operation. We experimented with relations that were larger than the memory size by a factor of ρ ∈ {2, . . . , 10}. We measured I/O times and total running times (total time, for short). Each reported time is the average of 10 different runs. Measuring I/O time means summing up the times spent by the algorithm on I/O operations and ignoring the time spent on computations in main mem-
16
Yaron Kanza, Hadas Yaari
25000
14000
12000 20000
time[msec]
time[msec]
10000
8000
6000
15000
10000
4000 5000 2000 0
0 5
10
20
40
80
160
320
5
480
10
20
40
Multi Insertion Sort
SCSV1
Merge Sort
SCSV2
Fig. 20 Total times as a function of σ, on SSD: ρ = 2, σ = [5 . . . 480]
160
320
480
ory. This relates to cases where computations in main memory can be sped up by using a stronger processor or can be conducted in parallel to the I/O operations. The total running time refers to measuring the entire time it took for the algorithm to perform the sort. Note that the total running time can be smaller than the I/O time when I/O operations are conducted in parallel.
Multi Insertion Sort
SCSV1
SCSV2
Fig. 21 Total times as a function of σ, on SSD: ρ = 4, σ = [5 . . . 480]
25000
20000
time[msec]
Merge Sort
80 σ
σ
15000
10000
5000
5.2 Results 0
We present now the test results. In all the tests, Merge Sort wrote to the disk intermediate sorted runs with total size equal to the size of the sorted relation. The other algorithms did not write on the disk any intermediate results.
5
10
20
40
80
160
σ Merge Sort
Multi Insertion Sort
SCSV1
SCSV2
Fig. 22 I/O times as a function of σ, on SSD: ρ = 4, σ = [5 . . . 160]
5.2.1 Experimenting with Synthetic Data In this section we present the results of tests over synthetic data. In the following graphs, Merge Sort is depicted in blue, Multi-Insertion Sort is presented in red, SCSV1, SCSV2 and SCSV3 are depicted in green, yellow and pink, respectively. All the times are in milliseconds. In these tests, SCSV3 provided results that are slightly less efficient than SCSV2, because of the chunkdiscovery phase. Thus, to simplify the presentation, we do not depict the results of SCSV3. Solid-State Drive (SSD). The performances of the different algorithms on an SSD are presented in figures 20-23. In the tests, we used an SSD whose read throughput and write throughput are almost the same. On such device, Multi-Insertion Sort has relatively low efficiency, yet, it is not much worse than Merge Sort for small ρ ratios. In all cases, SCSV2 is more efficient
than SCSV1 (or for very low σ values, as efficient as SCSV1). For low values of σ, the relations comprise relatively long sorted chunks, thus SCSV2 provides the best results, for any ρ. Fig. 23, shows that for nearlysorted relations, SCSV2 is more efficient than Merge Sort even when the size of the relation grows. When σ is large, there are too many small chunks to deal with, so the SCS algorithms stop being efficient. Note, however, that SCSV1 and SCSV2 have better I/O times than Merge Sort (Fig. 22), so they can benefit more than Merge Sort from speeding up in-memory computations. Hybrid System. We tested our algorithms on a hybrid system that combines an SSD and a HDD [31, 39], and uses the SSD as a read-only storage device (from which it reads the relation), and uses the HDD as a secondary device (on which the intermediate and the final results are written). By writing to the HDD
Title Suppressed Due to Excessive Length
30000
17
Merge Sort
SCSV2 35000
25000 30000
15000 10000
25000 time[msec]
time [msec]
20000
5000
20000
15000
10000
0 2
3
4
5
6
7
8
9
5000
10
0 5
10
20
40
80
160
320
480
σ
Fig. 23 Total times per ρ on SSD, σ = 40 Merge Sort
Multi Insertion Sort
SCSV1
SCSV2
Fig. 25 Total times per σ, on a Hybrid System, ρ = 4, σ = [5 . . . 480]
16000 14000
times[msec]
12000 10000 35000 8000 30000 6000 25000 time[msec]
4000 2000 0 5
10
20
40
80
160
320
Multi Insertion Sort
15000
480
σ Merge Sort
20000
10000 SCSV1
SCSV2 5000
0
Fig. 24 Total times per σ on a Hybrid System, ρ = 2, σ = [5 . . . 480]
5
10
Merge Sort
we avoid writing unnecessary/temporary data to the SSD and reduce the wearing of the SSD. This approach exploits the high read throughput of the SSD and the low wearing level of the HDD. The results are presented in figures 24-27. On a hybrid system, Merge Sort does not wear out the cells, but it has the worst running times and it is significantly worse than SCSV2 when σ is low. For low σ, SCSV2 outperforms Merge Sort even when the size of the relation grows (Fig. 27). Multi-Insertion Sort is effective when σ is high. Note that SCSV1 and SCSV2 have the lowest I/O times (Fig. 26) and can benefit the most from speeding up the processor. SD Card. Many hand-held devices use an SD Card, which is a much slower device than an SSD, but other than that, has characteristics that are similar to those of an SSD. The test results are presented in figures 2831. The results are similar to those of the tests on the SSD, except that on the SD Card the differences between Multi-Insertion Sort and the SCS algorithms are bigger. More importantly, now the SCS algorithms out-
20
40 σ
80
Multi Insertion Sort
SCSV1
160
320
SCSV2
Fig. 26 I/O times per σ, on a Hybrid System, ρ = 4, σ = [5 . . . 320]
perform Merge Sort also for higher values of σ, i.e., for relations with less order. These results demonstrate the effectiveness of our algorithms on devices in which writing is much slower than reading. Micro SD Card. The test results on Micro SD Card are similar to the results on SD Card, e.g., see Fig. 32. That is, on the Micro SD Card, for the ρ and the σ values we tested, SCSV2 is the most efficient algorithm, and it avoids wearing out the cells. 5.2.2 Experimenting With Real Data Next, we present tests over real data. There are perturbations in the dataset we used (recall the discussion in Section 4.6). This yielded many short sequences when we applied SCSV2 or SCSV1, and there was not enough memory to effectively execute SCSV2 or SCSV1. Hence,
18
Yaron Kanza, Hadas Yaari Merge Sort
80000
SCSV2 300000
70000
60000
250000
40000 30000 20000
200000 time[msec]
time [msec]
50000
150000
100000
10000 0
50000
2
3
4
5
6
7
8
9
10
0 5
10
20
40
80
160
320
480
σ
Fig. 27 Total times per ρ, on a Hybrid System, σ = 40 Merge Sort
Multi Sort
SCSV1
SCSV2
Fig. 29 Total times per σ, on SD Card, ρ = 4, σ = [5 . . . 480]
200000 180000 160000
time[msec]
140000 120000 100000
350000
80000 300000 60000 250000
20000 0 5
10
20
40
80
160
320
480
σ Merge Sort
Multi Sort
SCSV1
SCSV2
time[msec]
40000
200000
150000
100000
50000
Fig. 28 Total times per σ, on SD Card, ρ = 3, σ = [5 . . . 480]
0 5
10
20
40
80
160
σ Merge Sort
Multi Sort
SCSV2
SCSV3
we used in these experiments SCSV3 instead of using SCSV2 or SCSV1. Fig. 30 I/O times per σ, on SD Card, ρ = 4, σ = [5 . . . 160]
SSD. The results of sorting the stock-opening-rates data over an SSD are presented in figures 33–34. On the SSD, SCSV3 is slightly more efficient than Merge Sort, but the difference is not big because there are many small chunks in this dataset. As mentioned earlier, increasing the number of chunks in a relation increases the in-memory computation of SCSV3. Note, however, that SCSV3 is, after all, more efficient than Merge Sort and it avoids wearing out disk cells. Also note that SCSV3 has a low I/O time (Fig. 34), thus, it can be improved by using a stronger processor. These experiments show that SCSV3 can handle effectively real data. Efficiency is expected to be even higher for relations with more order (fewer chunks). Hybrid System. On the hybrid system, MultiInsertion Sort is more effective than Merge Sort. This is due to the large difference between the read rates of the SSD and the write rates of the HDD. We can see in figures 35 and 36 that SCSV3 is the most efficient
algorithm, for all the ρ values we tested. Also MultiInsertion Sort is more efficient than Merge Sort in these experiments. Note that in the hybrid system, the I/O rate is relatively low, so the in-memory computations have a smaller effects than the I/O on the total time. SD Card / Micro SD Card. On the SD Card and on the Micro SD Card, algorithm SCSV3 was the most efficient. The results for SD Card are depicted in figures 37 and 38. For Micro SD Card, the results are similar. We see once more that on a device in which write rates are lower than read rates, avoiding writes speeds-up the computation, so in these experiments, SCSV3 is much more efficient than Merge Sort, and for small values of ρ, even Multi-Insertion Sort is more efficient than Merge Sort.
Title Suppressed Due to Excessive Length
19
Merge Sort
400000
SCSV2
350000
25000
300000
20000
200000 150000 100000
time[msec]
time [msec]
250000
50000
15000 10000 5000
0 2
3
4
5
6
7
8
9
0
10
2
Fig. 31 Total times per ρ, on SD Card, σ = 40
Merge Sort
300000
3 ρ Multi Insertion Sort
4 SCSV3
Fig. 33 Running times per ρ, for the financial data, on SSD
250000
time[msec]
200000
25000
150000
20000
time[msec]
100000
50000
0 5
10
20
40
80
160
320
Multi Insertion Sort
10000
480
5000
σ Merge Sort
15000
SCSV1
SCSV2
0
2
Fig. 32 Total times as a function of σ, on Micro SD Card: ρ = 4, σ = [5 . . . 480]
5.2.3 Tests with a Larger Memory The tests presented so far were conducted with main memory limited to 128MB. To demonstrate that similar results are achieved also with a bigger memory, we ran the tests on an SSD with memory limited to 1GB and syntactic relations of size 3GB. The behavior of the algorithms remains the same as in the previous experiments. The results are presented in Fig. 39 and Fig. 40.
5.3 Summary We now summarize the results of the tests and explain when to use each algorithm. We consider three parameters: the device, the size of the relation with respect to the memory and how far the relation is from being sorted. The devices are SSD, hybrid system and SD Card. The memory sizes are small (ρ ≈ 2), medium (ρ ≈ 4) and large (ρ ≈ 10). The sort levels are nearly sorted (σ ≤ 160), partially sorted (160 < σ ≤ 480) and unsorted (σ > 480). Note that testing the sort level can
Merge Sort
3 ρ Multi Insertion Sort
4 SCSV3
Fig. 34 I/O times per ρ, for the financial data, on SSD
be done efficiently in a sub-linear time using property testing, see [5]. SSD. On the SSD, SCSV2 is the most efficient algorithm for nearly-sorted relations. For small and midsize partially-sorted relations, SCSV2 is the preferred algorithm. For small unsorted relations, Multi-Insertion Sort is the most efficient. (In the following tables we refer to Multi-Insertion Sort as MI Sort.) See Table 1.
Small ρ
Nearly sorted SCSV2
Partially sorted SCSV2
Medium ρ
SCSV2
Large ρ
SCSV2
SCSV2 Merge Sort Merge Sort
Table 1 Recommended algorithm on SSD.
Unsorted MI Sort Merge Sort Merge Sort Merge Sort
Yaron Kanza, Hadas Yaari
30000
300000
25000
250000
time[msec]
time[msec]
20
20000 15000 10000
200000 150000 100000 50000
5000 0
0
2 Merge Sort
3 ρ Multi Insertion Sort
4
2
SCSV3
Fig. 35 Running times as a function of ρ, when sorting the financial data on a Hybrid System
Merge Sort
3 ρ
4
Multi Insertion Sort
SCSV3
Fig. 37 Total times for ρ values, when sorting the financial data on SD Card
300000 30000 250000
time[msec]
time[msec]
25000 20000 15000
200000 150000 100000
10000 50000 5000 0 0
2 2
Merge Sort
3 ρ Multi Insertion Sort
3 ρ
4 Merge Sort
4
Multi Insertion Sort
SCSV3
SCSV3
Fig. 38 I/O times for ρ values, on SD Card Fig. 36 I/O times as a function of ρ, when sorting the financial data on a Hybrid System
Hybrid System. When using a Hybrid System, Multi-Insertion Sort (MI Sort) outperforms Merge Sort, except for relations that are very big. So, for unsorted relations Multi-Insertion Sort is the best algorithm. Also for small partially-sorted relations, Multi-Insertion Sort is the recommended algorithm. For partially-sorted and nearly-sorted relations, SCSV2 is the recommended algorithm in terms of efficiency and avoiding cell wearing. See Table 2. SD Card / Micro SD Card. The algorithms have similar performances for SD Card and Micro SD Card. SD Cards have slow I/O. So, in our tests, SCSV2 was the best algorithm for nearly-sorted and partiallysorted relations, in terms of efficiency and avoiding cell wearing. For unsorted relations, Multi-Insertion Sort is the preferred algorithm. See Table 3.
Small ρ
Nearly sorted SCSV2
Medium ρ Large ρ
SCSV2 SCSV2
Partially sorted MI Sort SCSV2 SCSV2 SCSV2
Unsorted MI Sort MI Sort Merge Sort
Table 2 Recommended methods on Hybrid System.
Small ρ Medium ρ Large ρ
Nearly sorted SCSV2 SCSV2 SCSV2
Partially sorted SCSV2 SCSV2 SCSV2
Unsorted MI Sort Merge Sort Merge Sort
Table 3 Recommended algorithms on SD Card and Micro SD Card.
For all the devices, the I/O time of SCSV2 was similar to that of SCSV1 and was lower than the I/O times
Title Suppressed Due to Excessive Length
21
in instruments with a flash disk, due to cost considerations. Data processing algorithms should be designed to take these features of flash disks into account.
140000 120000
time[msec]
100000 80000 60000 40000 20000 0 5
10
20
40
80
160
320
480
σ Merge Sort
Multi Insertion Sort
SCSV1
SCSV2
Fig. 39 Total times as a func. of σ, when sorting 3GB of data on an SSD, using 1GB of main memory
120000
100000
time[msec]
80000
60000
40000
20000
0 5
10
20
40
80
160
320
480
σ Merge Sort
Multi Insertion Sort
SCSV1
SCSV2
Fig. 40 I/O times as a function of σ, when sorting 3GB of data on an SSD, using 1GB of main memory
of the other algorithms. So, a faster processor will increase the efficiency of SCSV2 in comparison to the other algorithms.
6 Conclusion The recent rapid growth in the prevalence of devices with a flash storage, such as smartphones and servers in data centers, and the need to store and process data on such devices, warrant the development of algorithms that are adapted to flash-based secondary storage. On flash disks, writing data leads to cell wearing, which causes disk degradation and reduces the lifespan of the drive. Furthermore, in many flash- based storage devices write rates are lower than read rates due to the need to erase cells before they can be programmed (rewritten), and the restriction that erase operations can only be applied on entire blocks. Also, frequently the ratio of the memory size to the disk size is small
In this paper we studied the problem of external sorting on flash drives while avoiding writing intermediate results to the disk, to alleviate the problems of disk degradation, slow writes and insufficient space for intermediate results. The algorithms are designed to cope with important common cases—mainly, relations that are nearly sorted or that are larger than the main memory by a small factor. Such relations are common in distributed processing and in mobile devices. We introduced Multi-Insertion Sort, which is designed to cope with midsize relations, and three SCS algorithms that are adapted to cope with nearly-sorted relations. We tested our algorithms and compared them to Merge Sort on several storage devices. In all the tests, our algorithms did not write any intermediate results while Merge Sort wrote intermediate sorted runs whose total size is as the size of the input relation. For relations that are partially sorted, the three versions of SCS (and in particular SCSV2 for data without perturbations and SCSV3 for data with perturbations) were faster than Merge Sort, mainly on the SD Card, the Micro SD Card and the hybrid system. For midsize relations, which are common in hand-held devices, SCSV2 efficiently sorts partially-sorted relations and Multi-Insertion Sort can be used in the other cases, with running times that at the worst case are not much higher than those of Merge Sort, and without wearing out cells. To summarize, for midsize or nearly-sorted relations, our algorithms outperform Merge Sort, and increase the life expectancy of the device by avoiding cell wearing. For devices in which writing is slower than reading, such as hybrid disks and CD Cards, our algorithms outperform Merge Sort for midsize relations. For small relations that fit into memory, our methods can reduce the memory usage to be a fraction of the relation size while still avoiding writing intermediate results to the disk. The algorithms also use less disk space, which may be important in devices with a small disk. Future work includes the development of methods to increase the utilization of multi cores when executing our algorithms.
Acknowledgements This research was supported in part by the Israel Science Foundation (Grant 1467/13) and by the Isreali Ministry of Science and Technology (Grant 3-9617). We thank the anonymous reviewers for their insightful comments and suggestions.
22
Yaron Kanza, Hadas Yaari
References 1. D. Ajwani, I. Malinger, U. Meyer, and S. Toledo. Characterizing the performance of flash memory storage devices and its impact on algorithm design. In Proc. of the 7th international Conf. on Experimental algorithms, pages 208–219, Provincetown, MA, USA, 2008. SpringerVerlag. 2. M.-C. Albutiu, A. Kemper, and T. Neumann. Massively parallel sort-merge joins in main memory multicore database systems. Proc. VLDB Endow., 5(10):1064– 1075, 2012. 3. D. Aldous and P. Diaconis. Longest increasing subsequences: from patience sorting to the Baik-DeiftJohansson theorem. Bulletin of the American Mathematical Society, 36(4):413–432, 1999. ¨ 4. C. Balkesen, G. Alonso, J. Teubner, and M. T. Ozsu. Multi-core, main-memory joins: Sort vs. hash revisited. Proc. VLDB Endow., 7(1):85–96, 2013. 5. S. Ben-Moshe, E. Fischer, M. Fischer, Y. Kanza, A. Matsliah, and C. Staelin. Detecting and exploiting nearsortedness for efficient relational query evaluation. In Proc. of the 14th International Conf. on Database Theory, pages 256–267, Uppsala, Sweden, 2011. ACM. 6. M. A. Bender, M. Farach-Colton, R. Johnson, B. C. Kuszmaul, D. Medjedovic, P. Montes, P. Shetty, R. P. Spillane, and E. Zadok. Don’t thrash: how to cache your hash on flash. In Proc. of the 3rd USENIX Conf. on Hot topics in storage and file systems, Portland, OR, 2011. USENIX Association. 7. P. Bonnet and L. Bouganim. Flash device support for database management. In Fifth Biennial Conf. on Innovative Data Systems Research, pages 1–8, Asilomar, CA, USA, 2011. 8. S. Br¨ ahler. Analysis of the android architecture. Master’s thesis, Karlsruhe Institute, 2010. http://os.itec.kit.edu/downloads/sa 2010 braehlerstefan android-architecture.pdf. 9. B. Chandramouli and J. Goldstein. Patience is a virtue: Revisiting merge and sort on modern processors. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pages 731–742, Snowbird, Utah, USA, 2014. 10. F. Chen, D. A. Koufaty, and X. Zhang. Understanding intrinsic characteristics and system implications of flash memory based solid state drives. In Proc. of the eleventh international joint Conf. on Measurement and modeling of computer systems, pages 181–192, Seattle, WA, USA, 2009. ACM. 11. J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar, and P. Dubey. Efficient implementation of sorting on multi-core simd cpu architecture. Proc. VLDB Endow., 1(2):1313–1324, 2008. 12. T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001. 13. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107– 113, 2008. 14. B. Debnath, S. Sengupta, and J. Li. Flashstore: High throughput persistent key-value store. Proc. VLDB Endow., 3(1-2):1414–1425, 2010. 15. D. J. DeWitt, J. Do, J. M. Patel, and D. Zhang. Fast peak-to-peak behavior with SSD buffer pool. In Proc. of the International Conf. on Data Engineering, pages
16.
17.
18.
19.
20.
21.
22.
23. 24.
25.
26.
27.
28.
29. 30.
31. 32.
33.
34.
1129–1140, Washington, DC, USA, 2013. IEEE Computer Society. J. Do, Y.-S. Kee, J. M. Patel, C. Park, K. Park, and D. J. DeWitt. Query processing on smart SSDs: opportunities and challenges. In Proc. of the 2013 ACM SIGMOD International Conf. on Management of Data, pages 1221– 1230, New York, NY, USA, 2013. J. Do and J. M. Patel. Join processing for flash SSDs: remembering past lessons. In Proc. of the Fifth International Workshop on Data Management on New Hardware, pages 1–8, Providence, Rhode Island, 2009. J. Do, D. Zhang, J. M. Patel, D. J. DeWitt, J. F. Naughton, and A. Halverson. Turbocharging DBMS buffer pool using SSDs. In Proc. of the 2011 ACM SIGMOD International Conf. on Management of Data, pages 1113–1124, Athens, Greece, 2011. M. L. Fredman. On computing the length of longest increasing subsequences. Discrete Mathematics, 11(1):29 – 35, 1975. R. Friedman, I. Hefez, Y. Kanza, R. Levin, E. Safra, and Y. Sagiv. Wiser: A web-based interactive route search system for smartphones. In Proceedings of the 21st International Conference on World Wide Web, pages 337– 340, Lyon, France, 2012. H. Garcia-Molina, J. D. Ullman, and J. Widom. Database Systems: The Complete Book. Prentice Hall Press, Upper Saddle River, NJ, USA, 2 edition, 2008. R. Gotsman and Y. Kanza. A dilution-matchingencoding compaction of trajectories over road networks. GeoInformatica, 19(2):331–364, 2015. G. Graefe. Implementing sorting in database systems. ACM Comput. Surv., 38(3):1–37, 2006. G. Graefe. The five-minute rule 20 years later (and how flash memory changes the rules). Commun. ACM, 52(7):48–59, 2009. G. Graefe, S. Harizopoulos, H. A. Kuno, M. A. Shah, D. Tsirogiannis, and J. L. Wiener. Designing database operators for flash-enabled memory hierarchies. IEEE Data Eng. Bull., 33(4):21–27, 2010. T. H¨ arder. A scan-driven sort facility for a relational database system. In Proc. of the 3rd International Conf. on Very Large Data Bases - Volume 3, pages 236–244, Tokyo, Japan, 1977. VLDB Endowment. X.-Y. Hu, E. Eleftheriou, R. Haas, I. Iliadis, and R. Pletka. Write amplification analysis in flash-based solid state drives. In Proc. of SYSTOR: The Israeli Experimental Systems Conf., pages 10:1–10:9, Haifa, Israel, 2009. ACM. W.-H. Kang, S.-W. Lee, and B. Moon. Flash-based extended cache for higher throughput and faster recovery. Proc. VLDB Endow., 5(11):1615–1626, 2012. D. E. Knuth. Length of strings for a merge sort. Commun. ACM, 6(11):685–688, 1963. D. E. Knuth. The art of computer programming, volume 3: (2nd ed.) sorting and searching. Addison Wesley Longman Publishing, Redwood City, CA, USA, 1998. I. Koltsidas and S. D. Viglas. Flashing up the storage layer. Proc. VLDB Endow., 1(1):514–525, Aug. 2008. I. Koltsidas and S. D. Viglas. Data management over flash memory. In Proc. of the 2011 ACM SIGMOD International Conf. on Management of Data, pages 1209– 1212, Athens, Greece, 2011. S. C. Kwan and J.-L. Baer. The I/O performance of multiway Mergesort and tag sort. Computers, IEEE Transactions on, C-34(4):383–387, 1985. P.-A. Larson. External sorting: Run formation revisited. IEEE Trans. on Knowl. and Data Eng., 15(4), 2003.
Title Suppressed Due to Excessive Length 35. P.-A. Larson and G. Graefe. Memory management during run generation in external sorting. In Proc. of the 1998 ACM SIGMOD International Conf. on Management of Data, pages 472–483, Seattle, Washington, USA, 1998. 36. S.-W. Lee and B. Moon. Design of flash-based DBMS: an in-page logging approach. In Proc. of the 2007 ACM SIGMOD international Conf. on Management of data, pages 55–66, Beijing, China, 2007. 37. S.-W. Lee, B. Moon, C. Park, J.-M. Kim, and S.-W. Kim. A case for flash memory SSD in enterprise database applications. In Proc. of the 2008 ACM SIGMOD international Conf. on Management of data, pages 1075–1086, Vancouver, Canada, 2008. ACM. 38. R. Levin and Y. Kanza. Stratified-sampling over social networks using MapReduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 863–874, Snowbird, Utah, USA, 2014. 39. X. Liu and K. Salem. Hybrid storage management for database systems. Proc. VLDB Endow., 6(8):541–552, 2013. 40. Y. Liu, Z. He, Y.-P. P. Chen, and T. Nguyen. External sorting on flash memory via natural page run generation. The Computer Journal, 54(11), 2011. 41. C. Mallows. Problem 62-2, patience sorting. SIAM Review, 4(2):148–149, 1962. 42. S. Mann. Wearable computing: a first step toward personal imaging. Computer, 30(2):25–32, 1997. 43. D. Myers. On the use of NAND flash memory in high-performance relational databases. Master’s thesis, Massachusetts Institute of Technology (MIT), 2008. http://hdl.handle.net/1721.1/43070. 44. C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and D. Lomet. Alphasort: A cache-sensitive parallel external sort. The VLDB Journal, 4(4):603–628, 1995. 45. H. Pang, M. J. Carey, and M. Livny. Memory-adaptive external sorting. In Proc. of the 19th International Conf. on Very Large Data Bases, pages 618–629, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers. 46. B. Salzberg. Merging sorted runs using large main memory. Acta Inf., 27(3):195–215, 1989. 47. G. Soundararajan, V. Prabhakaran, M. Balakrishnan, and T. Wobber. Extending SSD lifetimes with disk-based write caches. In Proc. of the 8th USENIX Conf. on File and Storage Technologies, pages 8–8, Berkeley, CA, USA, 2010. USENIX Association. 48. D. Tsirogiannis, S. Harizopoulos, M. A. Shah, J. L. Wiener, and G. Graefe. Query processing techniques for solid state drives. In Proc. of the 35th SIGMOD international Conf. on Management of data, pages 59–72, Providence, Rhode Island, USA, 2009. 49. M. Weiser. Some computer science issues in ubiquitous computing. Commun. ACM, 36(7):75–84, 1993. 50. N. Zhang, J. Tatemura, J. Patel, and H. Hacigumus. Re-evaluating designs for multi-tenant OLTP workloads on SSD-based I/O subsystems. In Proc. of the ACM SIGMOD International Conf. on Management of Data, pages 1383–1394, Snowbird, Utah, USA, 2014. 51. W. Zhang and P.-A. Larson. Dynamic memory adjustment for external Mergesort. In Proc. of the 23rd International Conf. on Very Large Data Bases, pages 376– 385, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers. 52. W. Zhang and P.-A. Larson. Buffering and read-ahead strategies for external Mergesort. In Proc. of the 24rd International Conf. on Very Large Data Bases, pages 523–533, San Francisco, CA, USA, 1998.
23 53. L. Zheng and P.-A. Larson. Speeding up external Mergesort. IEEE Trans. on Knowl. and Data Eng., 8(2):322– 332, 1996. 54. Y. Zheng, X. Xie, and W.-Y. Ma. Geolife: A collaborative social networking service among user, location and trajectory. IEEE Data Engineering Bulletin, 33(2):32–40, June 2010.
24
APPENDIX A – Proofs Proof of Proposition 1.
Yaron Kanza, Hadas Yaari
that contains two consecutive tuples tj and tj+1 such that tj [K] > tj+1 [K], apply the process recursively on t1 , . . . , tj and tj+1 , . . . , tm , and return the union of the partitions that were computed by the recursive calls. The recursive process provides a partition of R because tuples of R are never discarded or duplicated in the process. Hence, the created sets are disjoint and cover R. Obviously, every set in the result is a sorted sequence, since otherwise it would have been partitioned. Maximality follows from the fact that the partition is always into sequences t1 , . . . , tj and tj+1 , . . . , tk such that tj [K] > tj+1 [K], thus, extending each such sequence disobeys the order. To show that a partition is unique, suppose there were two different partitions of R into MSCs. Then there would be a tuple t such that t is in one MSC of the first partition and in a different MSC of the second partition. In this case, the intersection of two different MSCs is not empty, since it contains t, in contradiction to Lemma 1.
Proof Consider element AO [i], where i ≤ k/2. If we do not swap AO [i] with AI [i] then AO [i] ≤ AI [i]. In this case, AO [i] is smaller than the elements AO [i + 1], . . . , AO [k] because AO is sorted. Also, AO [i] is smaller than the elements AI [i + 1], . . . , AI [k] because AI [i] is smaller than these elements due to the order of AI . Hence, there are at least k elements that are greater than AI [i] in the two arrays AO and AI . If we swap AO [i] with AI [i], then before the swap the elements AI [i + 1], . . . , AI [k] are greater than AI [i] due to the order of AI . The elements AO [i + 1], . . . , AO [k] are greater than AI [i] because they are greater than AO [i] and AO [i] > AI [i], as a cause to the swap. Consequently, prior to the swap there are at least k elements greater than AI [i] in the two arrays, and after the swap this holds for AI [i]. Based on the above 2 cases {AO [1], . . . , AO [k/2]} ⊂ S. Similar arguments can show that for every element among AI [k/2], . . . , AI [k] there are at least k elements smaller than it, and hence, {AI [k/2], . . . , AI [k]}∩S = ∅. Finally, we show that after the swaps the arrays are sorted. Consider AO [i] and AO [i + 1], for some 1 ≤ i ≤ k − 1. Before the swapping phase, AO [i] ≤ AO [i + 1], since AO is initially sorted. Thus, if none of them is swapped, AO [i] ≤ AO [i + 1] holds. If both AO [i] and AO [i + 1] are swapped then the claim holds because AI is initially sorted, so AI [i] ≤ AI [i + 1]. If just AO [i] and AI [i] are swapped, then AI [i] < AO [i], so AI [i] < AO [i] ≤ AO [i + 1], so the claim holds. If just AO [i + 1] and AI [i + 1] are swapped, then AO [i] ≤ AI [i], because they were not swapped and AI [i] ≤ AI [i + 1] because AI is sorted, thus AO [i] ≤ AI [i + 1] before the swap and the claim holds. Similar arguments show that AI is sorted after the swaps.
Proof (Sketch) For two tuples ti and tj that are in the same MSC, their order is the order in the MSC and hence, it is according to the sort key. If ti is in Mi , tj is in Mj and there is an edge from Mi to Mj then Mi precedes Mj . Suppose Mi totally precedes Mj . Let t0i be the last tuple of Mi and t0j be the first tuple of Mj . Then, t0i [K] ≤ t0j [K], and thus, ti [K] ≤ t0i [K] ≤ t0j [K] ≤ tj [K]. Suppose Mi overlaps and precedes Mj . Let (t0i , t0j ) be the connectors of Mi and Mj . Then, ti [K] ≤ t0i [K] ≤ t0j [K] ≤ tj [K]. The rest of the proof is by induction showing that if ti is in Mi , tj is in Mj and there is a path in GR from Mi to Mj , then ti [K] ≤ tj [K]. The induction is on the length of the path where the case of Mi and Mj that are connected by an edge is the basis of the induction.
Proof of Lemma 1.
Proof of Proposition 4.
Proof The union of two different MSCs is not an MSC because otherwise it would contradict maximality—each MSC is a proper subset of the union but cannot be a proper subset of an MSC. The intersection of any two MSCs is an empty set because otherwise their union would be an MSC, in contradiction to the previous statement. Proof of Proposition 2. Proof To prove the existence of a partition, consider the following recursive process. When given a sorted (ascending) sequence, return as a partition a set comprising this sequence. When given a sequence t1 , . . . , tm
Proof of Proposition 3.
Proof From Proposition 3 follows that the underlying sequence of Pb is a sorted sequence. Given a sorted subsequence of R, comprising t001 , . . . , t00k00 , we partition it by assigning each tuple to the MSC that contains it. Recall that the partition into MSCs is unique (Proposition 2). Let Mi1 , . . . , Min be the MSCs to which we assigned tuples of the given sorted subsequence. Each pair of consecutive tuples t00i , t00i+1 in the subsequence, satisfies one of the following cases. (1) Tuples t00i , t00i+1 are in the same MSC. (2) Tuples t00i , t00i+1 belong to MSCs Mi and Mj such that Mi totally precedes Mj . (3) Tuples t00i , t00i+1 belong to MSCs Mi and Mj such that Mi overlaps and precedes Mj . One of these cases must hold. If
Title Suppressed Due to Excessive Length
25
Case 1 does not hold, t00i and t00i+1 are tuples in two different MSCs such that t00i appears in R before t00i+1 and t00i [K] ≤ t00i+1 [K]. In this case, either the MSCs do not overlap (Case 2) or overlap (Case 3). Thus, t001 , . . . , t00k00 yields a path in GR . The MSCs Mi1 , . . . , Min are, therefore, part of a path in GR . Since the weight of this path is smaller than the weight of Pb , it holds that the number of tuples in t001 , . . . , t00k00 does not exceed the number of tuples that Pb comprises.
APPENDIX B – I/O Rates To illustrate the performances of the storage devices on which we ran the tests, we used CrystalDiskMark 2 —a well know hard-drive benchmark. We performed four different tests on our devices, to examine their characteristics: (1) Sequential read test – 4GB of data were read from the device sequentially, the block size was 1024KB. (2) Sequential write test – 4GB of data were written to the device sequentially, the block size was 1024KB. (3) Random read test – 4GB of data were read from the device from random addresses, the block size was 512KB. (4) Random write test – 4GB of data were written to the device in random addresses, the block size was 512KB. The tests were performed 10 times, the average results were taken and the rate was calculated. The results are presented in Table 4. Device
Total Size
Test
Read Rate
Write Rate
HDD
4000MB
SSD
4000MB
SD CARD
4000MB
Micro SD CARD
4000MB
Seq 1024KB Random 512KB Seq 1024KB Random 512KB Seq 1024KB Random 512KB Seq 1024KB Random 512KB
66.8 MB/s 22.9 MB/s 223.5 MB/s 203.3 MB/s 18.9 MB/s 18.6 MB/s 18.9 MB/s 18.9 MB/s
68.1 MB/s 25.5 MB/s 149.5 MB/s 122.4 MB/s 8.9 MB/s 0.6 MB/s 11.3 MB/s 0.9 MB/s
Table 4 I/O rates of flash storage devices.
We can see in Table 4 that the HDD reading rates are almost equal to the writing rates (for both random and sequential accesses). This is non-surprising as there is no difference between reading and writing in HDD. Note that in HDD the writing rates are a little bit better than the reading rates. This is due to the buffering mechanism, which in this benchmark we could not control (as opposed to our experiments). The random I/Os are less efficient than the sequential I/Os, on HDD, due to the large seek time in random access, relatively to the access times in sequential access. As opposed to the sequential and random read rates of the HDD, we can see that the sequential and random read rates are similar in both SD Card and Micro SD 2
http://crystalmark.info/software/CrystalDiskMark/index-e.html
Card. According to [1], the read performances depend on the block size, but usually not on whether the access pattern is random or sequential. We can also see that the sequential-read rate is higher than the sequentialwrite rate, in both devices. We can see that the SSD performances are good relatively to the other devices. The SSD optimizes the performances in two different levels. (1) An hardware levelparallelism is used to perform read, write and erase operations on different planes independently. (2) On the firmware level, the controller of the SSD optimizes the I/O operations while using the garbage collector to erase blocks, as a background process and not immediately. Table 4 shows that the random write performance is relatively poor due to write amplification caused by garbage collection. When writing randomly, the overhead of moving valid pages to other blocks leads to write amplification, which increases the required bandwidth and shortens the time during which the SSD reliably operates [27]. The sequential read rate is slightly better than the random read rate, due to optimizations performed by the controller.