Reducing fragmentation impact with forward knowledge in backup systems with deduplication Michal Kaczmarczyk
Cezary Dubnicki
9LivesData, LLC {kaczmarczyk, dubnicki}@9livesdata.com
Abstract
General Terms Algorithms, Performance
Deduplication of backups is very effective in saving storage, but may also cause significant restore slowdown. This problem is caused by data fragmentation, where logically continuous but duplicate data is not placed sequentially on the disk. Two types of fragmentation introduce high restore penalty: inter-version fragmentation, caused by duplicates present in multiple versions of the same backup, and internal fragmentation, caused by duplicates present in a single backup stream. This paper introduces Limited Forward Knowledge cache (LFK) reducing the internal fragmentation problem. The cache performs blocks eviction based on available limited forward knowledge. As keeping the full knowledge requires memory proportional to the size of a backup, we limit the forward knowledge to an 8GB window and show that such limitation does not impact the performance significantly. In order to further increase the LFK effectiveness in presence of inter-version fragmentation we combined this algorithm with already known solution called Context-Based Rewriting — CBR (Kaczmarczyk et al. 2012). Our evaluation with real-world traces shows that data fragmentation results in an average 42% slowdown for backups stored on a single disk. LFK alone reduces this drop to 21%. CBR+LFK eliminates it completely so the restore speed is equal to reading non-duplicated data. In a multi-disk setup the standard approach suffers from 83% restore performance drop. The combined algorithms reduce this drop to 35%, assuring a 4 times better restore bandwidth.
Keywords deduplication, fragmentation, backup, restore
Categories and Subject Descriptors E.5 [Files]: Backup/ recovery
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. SYSTOR ’15, May 26–28, 2015, Haifa, Israel. c 2015 ACM 978-1-4503-3607-9/15/05. . . $15.00. Copyright http://dx.doi.org/10.1145/2757667.2757678
1. Introduction A purpose-built backup appliance (PBBA) is a disk-based system used as a target for backup. Practically all PBBAs support deduplication. IDC forecasts more than doubling the value of the PBBA market in a few years: from $2.4 billion in 2011 to $5.8 billion in 2016 (Amatruda 2012). This dramatic growth is also clearly visible in the 2012 survey by Whitehouse et al. (2010). 76% of over 300 respondents implemented or planned to implement a diskbased deduplication solution. As shown by numerous research (Dubnicki et al. 2009; Quinlan and Dorward 2002; Muthitacharoen et al. 2001; Zhu et al. 2008) such approach significantly reduces both backup time and storage space required. Today, storage systems with data deduplication deliver new records of backup bandwidth (Preston 2010a; NEC HS8-4000) and the market is being flooded with various solutions from many vendors (EMC; NEC; IBM; Quantum; HP; Symantec; ExaGrid; Aronovich et al. 2009). As a result, deduplication has become one of the indispensable features of backup systems (Asaro and Biggar 2007; Babineau and Chapa 2010). The key characteristic of all backup systems is to enable fast restore of data in case of emergency. In fact, Recovery Time Objective (RTO) is one of the main parts of the contract signed between the backup system integrator and the final customer. In most modern deduplication systems, before the data is written, it is chunked into relatively small blocks (e.g. 8KB). Only after the block uniqueness is verified, it is stored on the disk. Otherwise, the address of an already existing block is used. Unfortunately, such backup pattern results in data fragmentation and together with the minimal effective read prefetch from a disk (2MB or more (Nam et al. 2012; Lillibridge et al. 2013; Kaczmarczyk et al. 2012)) end up in significant increase in the data restore time. We identified three types of fragmentation based on the origin of the base data block used to eliminate subsequent duplicated blocks: (1) internal stream fragmentation — caused by the same block appearing many times in a sin-
Figure 1. The schematic cost of having each kind of fragmentation with in-line dedup (based on our experiments). gle backup, (2) inter-version fragmentation — caused by periodical backups of similar data, but in subsequent versions (daily, weekly, monthly), (3) global fragmentation — caused by the same blocks appearing in backups with no logical connection to each other. As shown in Figure 1, the impact of inter-version fragmentation is growing with each new version of a backup stored. The most affected version is the latest one in systems deduplicating against older backups (in-line deduplication) or the oldest one in systems deduplicating against newer backups (off-line deduplication). Since usually the latest backup is restored, the latter approach may seem attractive, but it suffers from many other problems like reduced speed of writing and increased need for storage. In practice, most systems today on the market do in-line deduplication. The internal stream fragmentation impact does not grow with number of versions, but often has large impact and affects even a single stream version. Our experiments show that due to the internal fragmentation only, the deduplicated backup restore using 512MB of the most common LRU cache (Wallace et al. 2012; Nam et al. 2012; Lillibridge et al. 2013; Zhu et al. 2008) is up to 80% slower than the sequential backup restore. This problem can be solved by using huge, expensive caches. The same data set shows almost 50% performance increase with infinite cache. Other possibility is to optimize the location of the data, appearing many times in a stream, to limit the number of disk seeks on reading. However, it is difficult to make this idea effective. The quite intuitive choice is also to cache the most common blocks. Unfortunately, after testing, it turned out that such solution does not work well and is often even worse than LRU algorithm. In this work we strive not only to avoid the bandwidth reduction caused by the internal fragmentation without affecting the deduplication ratio, but also to use the duplicate data to achieve a greater restore performance compared to systems with no deduplication. This is enabled by using additional information about restore block order already present in backup systems, called backup recipe (Lillibridge et al. 2013). Such knowledge facilitates extremely effective usage of available cache memory. As our simulations sug-
Figure 2. Internal stream fragmentation. An example of restore process of blocks 402-438 from some initial backup. gest, quite limited forward knowledge is usually enough to achieve the results comparable to those delivered with an infinite cache. The remainder of the paper contains motivation followed by a description of our solution. Next, we confirm the effectiveness of the new algorithm in simulation experiments on 6 real-world traces. This evaluation is done for both singleversion and multiple-version backups. For the latter case, we combine our algorithm with CBR (Kaczmarczyk et al. 2012), designed for dealing with inter-version fragmentation. Finally, we compare our approach with the related work and propose directions for the future.
2. Motivation 2.1 Problem magnitude Figure 2 illustrates the problem of internal fragmentation. A single backup is stored to a system with deduplication. As the system stores only one copy of each block, the backup stream is not stored sequentially on the disk any more (see physical block addresses in Figure 2). This results in a large number of disk seeks needed for reading the data, and in consequence a low read performance. A restore of such backup (with limited cache for 100 blocks) is compared to a restore from a system without deduplication, assuming prefetch of
2.2.1 Internal stream fragmentation effect
% of max possible read bandwidth with no dups
Average results of 6 data sets (A) Internal fragmentation only with adapted Belady cache (B) Combined fragmentation with adapted Belady cache (C) Internal fragmentation only with LRU cache (D) Combined fragmentation with LRU cache
150
+ Positive effect of deduplication
100 - Negative effect of deduplication
50
0
ite fin in
B
M 24 10
B
2M 51
B
6M 25
B
8M 12
Cache size
Figure 3. Impact of different kinds of fragmentation and cache algorithm on the latest backup restore. 4 blocks in both cases. The internal fragmentation increases number of disk seeks by 50%. On the other hand, if duplicate blocks are kept in the cache long enough, they can actually improve the restore performance. With unlimited cache in the above example, none of duplicate blocks stored in the previous part of the stream require disk accesses. As a result, reading the part of the stream shown in Figure 2 requires only 7 disk accesses; 30% less then in the case of sequential restore. Since the cache memory is always limited, such effect requires much more effective cache replacement policy than LRU. The optimal algorithm already exists for the page replacement policy and is called B´el´ady’s optimal algorithm (Belady 1966). To achieve the optimal cache usage, this algorithm discards on eviction the page from memory that will not be needed for the longest time in the future. Even though for the majority of scenarios the algorithm is not implementable, in a case of a backup restore it could use the existing backup recipe describing future blocks to be read. Note that in such a case the algorithm looses its optimality as eviction is based on blocks, while the smallest disk read (prefetch) size is hundreds times larger. Still, this solution (referenced as adapted B´el´ady) states well the achievable performance level. 2.2 Impact on real data To motivate our work, we performed simulations driven by 6 sets of traces gathered from users of a commercial system HYDRAstor (NEC; Dubnicki et al. 2009). The experiments include the impact of internal fragmentation alone and combined results with inter-version fragmentation. To show the potential improvement of full forward knowledge, we implemented adapted B´el´ady (Belady 1966). All the results were gathered for different cache sizes and shown as a percentage of restore bandwidth achieved for a system without deduplication (assuming sequential data location and the cache size to fit one prefetch only). The description of data sets and the experimental methodology are shown in Section 4.
With only a single stream present in the backup system, any performance drop during restore is a result of internal fragmentation. Curve C in Figure 3 shows that for all finite caches, the average results are from 39% to 19% below the non-duplicated sequential restore (i.e. the level of 100%). This indicates that increasing LRU cache can be a solution to the internal fragmentation problem. Unfortunately, such approach is not effective and is expensive in terms of additional memory required. On the other hand, curve A shows that the adapted B´el´ady with as little as 128MB memory provides an average performance similar to the sequential restore, with even 19% increase when available memory grows up to 1GB. Those results show the potential of applying forward knowledge to eliminate the negative impact of internal fragmentation on average, i.e. across all traces (in some cases even the adapted B´el´ady stays below 100% with any cache smaller than 1GB). 2.2.2 Combined fragmentation effect Curves B and D in Figure 3 show the combined impact of both types of fragmentations: the inter-version and the internal one. In fact, the inter-version fragmentation occurs very often, as we usually keep multiple versions of the same backup, and the subsequent versions differ very little. With the two types of fragmentation considered, the performance of adapted B´el´ady is in most cases significantly reduced compared to the previous case of internal fragmentation only. Moreover, in many cases even an infinite cache cannot deliver performance of a system with no deduplication. Depending on cache size, the average LRU performance drop (curve D) is 37% to 51% compared to non-duplicated sequential restore. This drop range for adapted B´el´ady (curve B) is between 16% and 28%. As we can see, full forward knowledge helps a lot, but alone it cannot eliminate the negative combined effect of both fragmentation types. 2.2.3 Scalability issues The single disk case presented above is relatively easy to analyze therefore it is commonly used as verification for defragmentation algorithms. In fact, the performance achieved in such way is often too small when it comes to storing even a single stream within requested backup window. This is the reason why all available backup systems deliver much higher write bandwidth leveraging performance of many disks used at the same time. In a case of an emergency restore, the same disks are used to assure high restore bandwidth. In the realistic case of using 10 disks and 512MB cache, LRU results in the average performance drop of 75% and 83% in the single- and multiple-versions scenarios respectively. Even though the base level of 100% is now the bandwidth of 10 sequential disks (about 1GB/s), the drop is substantial. Moreover, its size suggests that using more disks is a very ineffective solution to the problem of fragmentation.
Figure 4. The data restore process - scheme.
3. Limited Forward Knowledge This section introduces a new cache replacement algorithm called Limited Forward Knowledge designed specifically for dealing with internal fragmentation. 3.1 Desired properties of the final solution The new solution should (1) provide the restore speed close to that achieved by adapted B´el´ady, (2) preserve deduplication efficiency, and (3) require only limited modifications of the existing code. If trade-offs are necessary, they should be addressed by a range of available choices. 3.2 The idea As internal fragmentation appears often, any forward knowledge is useful to keep in memory only blocks which are scheduled to reappear in the nearest future. The idea itself is present in the B´el´ady’s algorithm (Belady 1966), but the major issue making it useless is that in general such information is difficult or even impossible to get. In a backup system this characteristic is different, as backups are very big (tens or hundreds of GBs (Wallace et al. 2012)) and accessed in the same order in which they were written. Moreover, the restore process usually has access to a complete knowledge about blocks to be restored known as the backup recipe (Lillibridge et al. 2013; Zhu et al. 2008). Even though the idea of using full forward knowledge (as in adapted B´el´ady) is tempting, it requires significant amounts of additional memory. Fortunately, as our experiments show, similar restore performance can be delivered with limited forward knowledge. For further optimizations, we propose dedicated data structures to keep sufficient information in a compact way.
Figure 5. The Limited Forward Knowledge cache. the system with read requests. With the restore proceeding, additional metadata is read and more requests are issued. In our algorithm we would like to read more metadata to provide the limited forward knowledge. This information will be used to build the oracle, the key component of our algorithm (see Figure 5). 3.3.2 The oracle The oracle plays a crucial role in the final cache eviction decisions. It maps identifiers of all known blocks to be read to a sorted list of block positions in which they appear in the stream (see oracle data in Figure 6). Each update with forward information adds an identifier of a block (if not present) and pushes the block position to the back of its list. When necessary, the structure returns for a given block the closest future position in which it will be required, or updates the most recently read block by removing its closest (current) position from the list of next block occurrences. With the additional data, the oracle requires dedicated memory different from the one where the cache data is kept. This requirement can be significantly lowered with two observations. First, the id of a block does not need to be 100% correct — e.g. for an 8KB block and 16GB of forward knowledge, a 64 bit hash (of a block content or block address) is perfectly enough to assure 1 to 10 million chance of a collision. Second, the position can be approximate —
3.3 Algorithm details 3.3.1 The general restore algorithm A restore of a stream starts by receiving the stream identifier (see Figure 4). Even though such an operation unlocks the access to all metadata, usually only some small amount is restored, enough to assure constant load, fully utilizing
Figure 6. Structures used by LFK algorithms.
i.e. 16 bits is enough for numbering 8MB sections starting from the beginning of a stream (with in-memory renumbering every 500GB of backup). Both optimizations are essential to keep the memory requirements limited by paying only minimal cost when compared to the exact solution. 3.3.3 The details of cache organization The LFK data cache (see Figure 5) is organized as a map with block addresses as keys and block data as values. The only difference from LRU cache is the additional information kept. Instead of storing most recent usage for each block, the closest future occurrence is kept and used to sort the LFK priority queue (see Figure 6). Most of the operations required are similar to the operations on LRU priority queue. The only different operation is binary insert, required when a block with a new closest occurrence is added. Figure 7 shows an example of a block restore and cache eviction in two cases. In the case presented on the left, the block is found in the cache, and the only operation performed is the update of the restored block in both the cache and the oracle structures. In the case presented on the right, the block has to be read from the disk. Such a request is handled in the following steps: (1) read the block from the disk to cache together with its prefetch; (2) update the cache priority queue with read blocks and the information provided by the oracle; (3) remove the blocks exceeding the maximal cache size with the most time to the next occurrence; (4) update the structures as if the block were found in the cache. In the case when there is no known section of the next occurrence in the oracle for the prefetched blocks, and there is still some space left in the cache, a few choices can be made. We can keep some of those blocks in the cache (for example by assigning an artificial and large section number) or free the memory to use it for some other purpose in case a dynamic memory allocation to different structures is possible. Our experiments showed that the first option does not provide noticeable performance gain. A better choice would be to use the additional memory for other system operations if necessary (such as restores, writes and background calculations) or to dynamically increase the oracle size, which would result in providing more forward information until all the available memory is efficiently used. 3.3.4 Memory requirements Beside the standard LRU structures, LFK algorithm requires some additional memory for the oracle with its forward knowledge. For this purpose we need 13 bytes per entry (hash - 8 bytes, section entry - 2 bytes, hash table with closed hashing overhead - 30% (Heileman and Luo 2005)) and further 1.62MB per 1GB of forward knowledge. Additionally, there is a list of all block addresses waiting to be restored after they are received as a forward knowledge, but before they are actually used to restore the data. As keeping the whole block address in the memory is expensive, we prefer to read it once again from the disk. Assuming the size of a block ad-
num. of bkps GeneralFileServer 14 Wiki 8 DevelopmentProject 7 UserDirectories 50 IssueRepository 7 22 Mail data set name
single backup size 77.6GB 8.8GB 13.7GB 76.2GB 18.4GB 25.9GB
duplicate internal blocks* duplicate blocks 83.4% 17.4% 97.8% 17.8% 96.4% 17.8% 92.6% 19.3% 85.4% 23.2% 97.8% 32.6%
Table 1. Data sets characteristics, average values for each data set (* – data excluding first backup). dress is 20-100 bytes per 8KB block (Romanski et al. 2011), the whole operation would require reading 0.256%-1.5% more data from disks than in the original solution. This amount sounds acceptable and negligible. 3.4 Trade-offs The major trade-off is the division of available memory between the standard cache and the oracle. The best solution here includes dynamic division based on the current data pattern. With such an approach, the system is able to extend the forward knowledge in case the cache memory is not fully utilized or decrease it otherwise. Although this scenario moves us closer to the full forward knowledge results, it is much more complicated and at least with our traces brings very limited performance gain (up to 3% on average). One more trade-off is connected with the section size. Since in order to save memory the exact position of next block occurrence is not kept, some evictions can be made not in the desired order. As this can happen only within the section with the longest time to the next occurrence, the achieved performance can never be lower than the one having cache memory reduced by the size of a single section. In a typical scenario of 512MB cache and 8MB section size, the performance would never be worse than with 504MB cache and the exact knowledge about each block position.
4. Methodology of experiments We prepared a 12K line C++ simulator able to test thousands of possible configurations in parallel. Based on real-world traces, the simulator produced statistics which led to conclusions presented in this work. 4.1 Data sets and testing scenarios To diagnose the problem of fragmentation and verify the proposed algorithm, we used 6 sets of weekly full backup traces representing over 5.7TB of data and gathered from users of a commercial system HYDRAstor (NEC). Table 1 contains characteristics of those traces, including individual deduplication level based on Rabin fingerprinting (Rabin 1981) with an average block size of 8KB (as the most popular). Even though the traces differ a lot between each other (data pattern, size, number of backups, deduplication level etc.), we
Figure 7. Reading a block already present in cache (left) and not present in cache with eviction policy (right). decided to present most of the final results in the form of averages over all 6 sets. This approach was used to simplify the publication and due to the page limitation. Detail results are available in (Kaczmarczyk 2015). We tested two different scenarios: (1) single-version — only one (the last) backup from a set loaded into the system — only internal fragmentation occurred; and (2) multi-version — all backups from a data set loaded one after another causing both internal and inter-version fragmentation. The single-version result can be achieved only by backup systems with off-line deduplication (preserving the latest copy of a block) or by archive systems keeping only one version of the data. The restore performance is always measured for the latest backup (the only one in case of the single-version), usually the most probable to be restored (Whitehouse et al. 2010). The maximal result for each backup can be achieved in the single-version scenario (no inter-version fragmentation) with infinite cache (no internal fragmentation). In such setup the whole backup is placed in one continuous area in the order of reading, and all the blocks once read will never be evicted from cache. 4.2 Write simulation We simulate a write process assuming in-line duplicate elimination (the most popular in today’s backup (Amatruda
2012)), with locality preserving blocks placement (Zhu et al. 2008) and storing new blocks after the currently occupied area. This approach preserves the inter-version fragmentation in the way it appears in the real-world systems. The data used for the simulations was chunked using Rabin fingerprinting into blocks of the average size of 8KB. In the base case we assume a storage system with a single large disk which is empty at the beginning of each measurement. Experiments with many spindles are described in Section 5.4. 4.3 Restore simulation To verify the effectiveness of prefetching in the environment with deduplication, we performed a simulation with fixed prefetch size varied from 512KB up to 8MB for all 6 traces. To enable performance comparison across different prefetch sizes, we used a common enterprise data center capacity HDD specification (Seagate) (sustained transfer rate: 175MB/s, read access time: 12.67ms). As presenting all the results was not possible due to the page limitation, we decided to choose the default prefetch size for the direct I/O comparison between the algorithms. To be conservative, we selected the 2MB prefetch size which favors LRU data replacement policy in the multi-version scenario. With such setup, the shortest restore time was achieved with 3 out of 6 traces. Each of the remaining
cache
128MB 256MB 512MB 1024MB infinite
single-version
multi-version
(no inter-version fragmentation)
(with inter-version fragmentation)
LRU −39% −35% −28% −19% +28%
LFK −6% +3% +7% +11% +28%
LRU −51% −47% −42% −37% −9%
LFK −28% −25% −21% −17% −9%
Table 2. Restore bandwidth relative to a system with no deduplication. Average over all 6 data sets. 3 traces had different optimal prefetch: 1MB, 4MB and 8MB respectively. Note, that the restore bandwidth presented in our experiments is always in relation to the system with a single drive, 2MB prefetch and no deduplication (except for multi-disk setup in Section 5.4). For better problem visualization, we varied cache size from 128MB to 1GB per single stream being restored. Additional experiments with infinite cache establish the performance limits for any practical cache replacement policy. In the experiments with LFK, only the blocks with a known future appearance are kept in cache so the cache may be underutilized. Such an approach is not optimal, but we decided to use it in order to clearly visualize the limitations and provide insights for future work.
5. Evaluation 5.1 Meeting the requirements 5.1.1 Performance results Our experiments show that LFK can dramatically increase restore performance when compared to LRU. This improvement is achieved by adding a very small amount of memory for limited forward knowledge. Table 2 shows the average restore bandwidth of all 6 traces in relation to the restore with no deduplication. For a single-version scenario, the negative impact of internal fragmentation was completely eliminated on average for all cache sizes except for 128MB (which resulted in a performance drop of 6% vs. 39% for LRU). In the multi-version scenario, the drop was also significantly reduced, but the inter-version fragmentation did not allow LFK to reach the non-duplicate performance (this problem is addressed in Section 5.5). 8GB of forward knowledge is used for all caches except for 128MB cache, where 2GB is used (see Section 5.1.2). Windows of these sizes represent a fraction of tested traces, except for wiki, which is very short. The detailed results used to compute the averages presented in Table 2 are shown in Figure 8, but only for two selected traces. The other four sets not shown exhibited a behavior very similar to one of the chosen sets. In each graph, the results with LRU and full forward knowledge (equivalent to adapted B´el´ady shown previously in Figure 3) are included for comparison. Restore bandwidth delivered by infi-
nite cache establishes the limit for any implementable cache replacement policy. In 4 out of 6 data sets (1 out of each 2 in Figure 8) the results of 256MB cache with 8GB forward knowledge are almost identical to those delivered by infinite cache. For two others — UserDirectories and Mail — the possible options are to stay with 256MB size of cache and gain 22%-73% of additional bandwidth even when compared to LRU with 1GB cache, or to use the same size of 1GB cache with 22%253% bandwidth boost. Only for these two cases, at least 1GB LFK cache is needed to deliver results at the level of a system without deduplication. 5.1.2 Resource usage As described in Section 3.3.4, LFK requires additional resources, which should be included in the total costs. Those are: memory (about 13MB per 8GB of forward knowledge) and bandwidth (about 0.256% decrease). Although the latter is negligible, the former can matter when the total amount of cache memory is small. Fortunately, in such a case also a smaller amount of forward knowledge is required as visualized in Figure 8. Also, for the 512MB of cache, having 8GB of forward knowledge causes only about 2.5% memory overhead. At this small cost, the algorithm enables even 8 times cache size reduction (from 1GB LRU to only 128MB LFK with 2GB forward knowledge [+3.2MB, also 2.5% overhead]), and still delivers about 15% average gain. Note that in our experiments the additional memory is not included by default in the total cache size. This enables a clear and easy comparison between the different forward knowledge sizes and their impact on the performance while keeping exactly the same cache size for the data. 5.1.3 Code modifications The code modification required to implement LFK is limited to the restore algorithm only and therefore does not impact deduplication effectiveness. One necessary change is filling the oracle with forward knowledge, which can be easily done by extending the metadata read request. Another change requires only swapping the cache implementation. In general, such limited modifications make the algorithm suitable for most (or possibly even all) systems present on the market. 5.2 Setting the forward knowledge size Increasing forward knowledge does not always help to improve the results (see Figure 8). The gain or its absence is highly correlated with the amount of cache used, the pattern of internal duplicates in a backup, but also with the presence of inter-version fragmentation. Note that increasing the cache while keeping forward knowledge constant does not give any boost from some point because our cache implementation keeps only the blocks found in the forward knowledge. This problem can be easily corrected by extending the forward knowledge when necessary, which requires very little of additional memory.
single-version scenario
multi-version scenario Issue repository % of max possible read bandwidth with no dups
% of max possible read bandwidth with no dups
Issue repository 150
+
100 -
50 LFK with full forward knowledge LFK with 8GB forward knowledge LFK with 2GB forward knowledge LFK with 512MB forward knowledge LRU 0
150 LFK with full forward knowledge LFK with 8GB forward knowledge LFK with 2GB forward knowledge LFK with 512MB forward knowledge LRU +
100 -
50
0
B M
Mail % of max possible read bandwidth with no dups
% of max possible read bandwidth with no dups
24
B 2M
B 6M
B 8M
B M
ite in
f in
10
51
25
12
24
B 2M
B 6M
B 8M
ite in
f in
10
51
25
12
Mail 150
+
100 -
50
0
+
100 -
50
0
ite
B
Cache size
M
24
fin
in
10
B
2M
51
B
B
6M
8M
25
12
ite
B
M
24
fin
in
10
B
2M
51
B
6M
25
B
8M
12
Cache size
150
Figure 8. Impact of forward knowledge size on restore performance of the latest backup with internal fragmentation (singleversion, left column) and with combined fragmentation (multi-version, right column). prefetch 1MB 2MB 4MB 8MB
single-version LRU LFK −45% −27% −28% +7% −20% +38% −23% +57%
multi-version LRU LFK −55% −43% −42% −21% −37% −4% −43% +3%
NO DEDUP −34% 0% +36% +65%
Table 3. Impact of different prefetch sizes on average restore bandwidth (1 spindle, 512MB cache). 5.3 Experimenting with larger prefetch So far we used 2MB prefetch, which is commonly used and is the most effective for LRU in 3 out of our 6 traces. To compare the results between different prefetch sizes, we performed dedicated experiments and applied common enterprise disk characteristics (Seagate). In the single-version scenario (see Table 3), the LFK results with larger prefetch sizes are much better than the best LRU (+57% vs −20%) and similar to the ones achieved by the system with no deduplication (NO DEDUP column). As the multi-version scenario suffers from inter-version fragmentation, it does not provide such good results. However, LFK with 8MB prefetch delivers 63% increase over the best LRU (+3% vs
prefetch per disk 0.2MB 0.4MB 0.8MB 1.6MB 3.2MB
single-version LRU LFK −90% −85% −85% −74% −79% −57% −75% −37% −75% −17%
multi-version LRU LFK −92% −89% −88% −82% −84% −72% −83% −62% −87% −56%
NO DEDUP −86% −74% −55% −29% 0%
Table 4. Impact of different prefetch sizes on average restore bandwidth (10 disks, 512MB cache) when compared to the base setup: 10 disks, no dedup, 3.2MB prefetch each. −37%). With a faster disk and unchanged seek time, the best LFK prefetch shows even bigger improvement over the best LRU prefetch because the unneeded prefetched data is effectively discarded by LFK. The only practical reasons for keeping the prefetch smaller are the low level disk fragmentation and the requirement of high system responsiveness when handling many operations at once. 5.4 Scalability Today’s systems often use 10 or more disks to restore data (Lillibridge et al. 2013; Dubnicki et al. 2009) through
cache 128MB 256MB 512MB 1024MB infinite
LRU −51% −47% −42% −37% −9%
multi-version LFK LRU+CBR −28% −41% −25% −36% −21% −30% −17% −22% −9% +17%
LFK+CBR −11% −3% 0% +4% +17%
Table 5. CBR impact with different cache configurations (1 disk). Average over all 6 data sets. RAID or erasure coding (Weatherspoon and Kubiatowicz 2002). In the previous experiments we assumed a 2MB prefetch for the whole stream, which in the above setup would mean 0.2MB prefetch per disk. Since such small prefetch results in even 6 timers longer restore, the per disk prefetch size should rather be set to much higher values. On the other hand, when having deduplication a larger prefetch does not always mean higher performance (see Tables 3 and 4). This is especially true for LRU algorithm, for which only 1.6MB per disk prefetch achieves the maximal results (see Table 4); 4-6 times lower than the bandwidth achievable by systems with 10 disks and no deduplication. At the same time the best LFK (with 3.2MB prefetch per disk) offers 2.6-3.2 times improvement over the best LRU. However, due to inter-version fragmentation, the best LFK for multi-version scenario was still over 2 times slower than the non-duplicated sequential restore. Table 4 shows also that the negative impact of interversion fragmentation on both LRU and LFK performance increases with the prefetch size. For example, the LFK with the 0.2MB prefetch is only 26% worse in the multi-version case than in the single-version one. This drop increases to 47% with the 3.2MB prefetch. Such results suggest that for further performance improvement a technique to deal with inter-version fragmentation is required. 5.5 Combined results with CBR defragmentation algorithm LFK alone is not sufficient to combat the bandwidth drop caused by inter-version fragmentation. The average drop of restore performance is still at least 17% for finite cache sizes on a single disk (Table 2) and over 50% in a multi-disk setup (Table 4). To address this, we combined LFK with a version of Context Based Rewriting (CBR) (Kaczmarczyk et al. 2012) algorithm, dealing with inter-version fragmentation. CBR rewrites some of the duplicates based on context as described in the referenced paper. The restore process is not modified, which allowed us to use any kind of cache and easily compare the results. The two algorithms used to fight different types of fragmentation result in a very effective symbiosis, practically eliminating the impact of fragmentation caused by deduplication for all cache sizes (see Table 5). The results show the synergy of CBR+LFK approach: the combined result is significantly better than a sum of the individual improve-
ments introduced by each of the two algorithms separately. Moreover, the combined algorithm comes close to the results achieved with no inter-version fragmentation and 8GB forward knowledge (see Table 6 with detail results for each data set). Finally, for 4 out of 6 data sets, the results with only 256MB of memory (and also with 512MB) are already within 12% from the maximal result available with no interversion fragmentation and infinite cache. To visualize the impact of the combined LFK+CBR approach on a common backup system, we performed experiments with 10 disks. We used a 3.2MB prefetch per disk, 512MB LFK cache and 48MB stream context for CBR (see (Kaczmarczyk et al. 2012) for details). The best LRU (with 1.6MB prefetch per disk) resulted in the average performance drop of 83% compared to the sequential restore from 10 disks with a 3.2MB prefetch per disk. LFK+CBR reduced this average drop to 35%. While percentage-wise this drop is still substantial, the 100% now is the sequential performance of 10 disks. Moreover, the LFK+CBR combination delivers a 4 times better bandwidth than the best LRU with 10 disks.
6. Related Work The fragmentation problem caused by deduplication has been described widely (Preston 2010b; Livens 2009b,a; Whitehouse 2008), but solutions to this problem started to appear just recently. One of the first proposals was Context-Based Rewriting algorithm (Kaczmarczyk et al. 2012), already mentioned in this paper. By selectively rewriting up to 5% of blocks during backup, CBR reduces the restore speed drop caused by interversion fragmentation to less than 7%. As it rewrites only the first occurrence of each block within a stream CBR does not address the internal fragmentation problem. Unlike other proposals described below, CBR does not degrade the deduplication ratio (the original copies of the rewritten duplicates are removed in the background). A bit different approach described by Nam et al. (2012, 2011) is to ensure sufficient read performance. Chunk Fragmentation Level is used to monitor simulated read performance during backup and enable selective defragmentation when this level is below a pre-defined threshold. This approach reduces the deduplication ratio and may reduce write performance even a few times. Another technique, called Container Capping (CAP) proposed by Lillibridge et al. (2013) improves bandwidth by assuring restore only from limited number of containers. Even though it shows a significant gain, this approach also lowers the deduplication ratio. This work was the first one to leverage the backup recipe to speed up restore by caching recently read blocks which will be needed in the future. A large piece of memory reserved as the Forward Assembly Area is used on each container read to assemble next backup part. Since upon assembly each block slot of the forward area can keep only a specific backup block, the FAA cannot look too
data set name
LRU
GeneralFileServer Wiki DevelopmentProject UserDirectories IssueRepository Mail
12% 1% −18% −36% −48% −80%
Average
−28%
single-version LFK LFK 8GBfk full 16% 17% 21% 21% 23% 23% −11% 1% 25% 32% −31% −31% 7%
10%
∞ cache 17% 21% 23% 25% 32% 48%
LRU −1% −12% −30% −48% −76% −84%
28% −42%
multi-version LFK LFK 8GBfk full 2% 3% 4% 4% 3% 3% −30% −28% −52% −52% −51% −51% −21%
−20%
∞ cache 3% 4% 3% −17% −44% −5%
multi-version + CBR ∞ LFK LFK LRU cache 8GBfk full 7% 11% 12% 12% −3% 15% 15% 15% −23% 17% 17% 17% −32% −16% −7% 14% −50% 16% 22% 22% −81% −39% −39% 21%
−9% −30%
0%
3%
17%
Table 6. Impact of different algorithms and scenarios on restore bandwidth of each data set (512MB cache). much forward and makes the memory usage not very effective. LFK cache presented in this work can keep any backup block in each slot. The actual difference can be seen in Figure 8, where the option with 512MB cache and 512MB of forward knowledge looks very similar to the 512MB forward assembly area. The most recent algorithm is called History-Aware Rewriting (Fu et al. 2014). It keeps data in 4MB containers and rewrites duplicates to ensure that at least 50% of each container read on a backup restore is useful backup data. Additionally, this work introduces Optimal Restore Cache based on the B´el´ady’s algorithm with full forward knowledge but with 4MB containers. Such large eviction units reduce overhead due to keeping full knowledge, but also result in unneeded data blocks kept in the cache. This fact can significantly lower the final restore bandwidth as the authors admit in (Fu et al. 2015). In contrast, LFK uses 500 times finer eviction unit (8KB blocks) together with only limited forward knowledge. This allows LFK to keep in the cache exactly the data to be used soon and to achieve similar results to full knowledge and 8KB eviction unit. For 3 traces tested in (Fu et al. 2014), HAR with Optimal Cache reduced the performance drop by between 25% and 50% with one spindle and 1GB cache; whereas LFK+CBR eliminated this drop completely on average (albeit for different 6 traces). Additionally, HAR degrades significantly deduplication caused by block rewriting, whereas CBR does not. In 2 out of 3 cases studied, HAR reduced this ratio more than 50% to limit restore performance drop to about 25%. HAR paper contains also a comparison with CBR, but it was simulated without internal duplicates filtering (implemented but not described in the original CBR paper), which made CBR results rather poor in the HAR paper.
7. Conclusions and future work In this work we described and quantified the internal fragmentation problem in backup systems with deduplication. With 1 disk, the problem results in a restore bandwidth drop of even 80% when compared to systems with no deduplication; the average drop of performance is also quite significant: 28% assuming single spindle. With 10 disks, the drop is 75% compared to the same system without deduplication.
To deal with the problem, we proposed an algorithm called Limited Forward Knowledge (LFK). It uses the backup recipe already kept with each backup to speed-up the restore. To verify our algorithm, we performed a large number of experiments on real-world backup traces. LFK transfers most of the negative performance impact of internal deduplication into a positive one. By caching only blocks to be used in the nearest future, the new algorithm often delivers a better restore performance for single-version backups than in systems with no deduplication. LFK with a 512MB cache and 8GB forward knowledge (only 13MB memory overhead) ensures a performance almost identical to the one delivered by infinite cache in 4 out of 6 cases. On average, 8GB of forward knowledge results in restore bandwidth within 3 percentage points of the level reachable with the full forward knowledge. Moreover, LFK can exploit the usage of large prefetch, which is useful especially with many disks. With 10 of them and 3.2MB prefetch per disk, LFK gives on average over 3 times better bandwidth than LRU. As LFK deals with internal fragmentation only, we combined LFK with CBR algorithm (Kaczmarczyk et al. 2012) which reduces the impact of inter-version fragmentation. In a multi-version scenario and 1 spindle, the combined approach practically eliminates the average restore drop performance of the latest backup cased by both internal and inter-version fragmentation. In 4 out of 6 traces, the combined results delivers significantly better bandwidth than in systems with no deduplication. One exception is Mail trace, for which the combined algorithm delivers 61% of restore speed with no deduplication, but the restore bandwidth is improved more than 3 times compared to LRU. For 10 spindles, the 83% average drop is reduced to 35%, giving a 4 times better bandwidth. Our plan is to implement LFK in the commercial system HYDRAstor (NEC) and integrate LFK with CBR which is already there. Two interesting topics for research are variable prefetch size and dynamic memory division between the forward knowledge and the cache. Finally, the global fragmentation problem is still waiting for a solution, preferably one that can be combined with the LFK+CBR approach and does not degrade deduplication.
References R. Amatruda. Worldwide Purpose-Built Backup Appliance 2012-2016 Forecast and 2011 Vendor Shares. International Data Corporation, April 2012. URL http://www.emc. com/collateral/analyst-reports/idc-worldwidepurpose-built-backup-appliance.pdf. L. Aronovich, R. Asher, E. Bachmat, H. Bitner, M. Hirsch, and S. T. Klein. The design of a similarity based deduplication system. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, SYSTOR ’09, pages 6:1–6:14, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-623-6. . URL http:// doi.acm.org/10.1145/1534530.1534539.
M. Kaczmarczyk. Fragmentation in storage systems with duplicate elimination. PhD thesis, University of Warsaw, Poland, 2015. To be published in June 2015. M. Kaczmarczyk, M. Barczynski, W. Kilian, and C. Dubnicki. Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of the 5th Annual International Systems and Storage Conference, SYSTOR ’12, pages 15:1–15:12, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1448-0. . URL http://doi.acm.org/10.1145/2367589.2367600.
T. Asaro and H. Biggar. Data De-duplication and Disk-to-Disk Backup Systems: Technical and Business Considerations. Enterprise Strategy Group, July 2007.
M. Lillibridge, K. Eshghi, and D. Bhagwat. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies, FAST’13, pages 183–198, Berkeley, CA, USA, 2013. USENIX Association. URL http://dl.acm. org/citation.cfm?id=2591272.2591292.
B. Babineau and D. A. Chapa. Deduplication’s Business Imperatives. Enterprise Strategy Group, December 2010. Sponsored by EMC Corporation.
J. Livens. Deduplication and restore performance. Wikibon.org, January 2009a. URL http://wikibon.org/wiki/v/ Deduplication and restore performance.
L. A. Belady. A study of replacement algorithms for a virtualstorage computer. IBM Syst. J., 5(2):78–101, June 1966. ISSN 0018-8670. . URL http://dx.doi.org/10.1147/sj.52.0078.
J. Livens. Defragmentation, rehydration and deduplication. AboutRestore.com, June 2009b. URL http://www. aboutrestore.com/2009/06/24/defragmentationrehydration-and-deduplication/.
C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J. Szczepkowski, C. Ungureanu, and M. Welnicki. Hydrastor: A scalable secondary storage. In Proccedings of the 7th Conference on File and Storage Technologies, FAST ’09, pages 197–210, Berkeley, CA, USA, 2009. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1525908. 1525923. EMC. DataDomain - Deduplication Storage for Backup, Archiving and Disaster Recovery. URL http://www.datadomain.com. ExaGrid. Exagrid. URL http://www.exagrid.com. M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, F. Huang, and Q. Liu. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC’14, pages 181–192, Berkeley, CA, USA, 2014. USENIX Association. ISBN 978-1-931971-10-2. URL http://dl.acm.org/ citation.cfm?id=2643634.2643653. M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, Y. Zhang, and Y. Tan. Design tradeoffs for data deduplication performance in backup workloads. In 13th USENIX Conference on File and Storage Technologies (FAST 15), pages 331–344, Santa Clara, CA, Feb. 2015. USENIX Association. ISBN 978-1-931971201. URL https://www.usenix.org/conference/fast15/ technical-sessions/presentation/fu. G. L. Heileman and W. Luo. How caching affects hashing. In C. Demetrescu, R. Sedgewick, and R. Tamassia, editors, ALENEX/ANALCO, pages 141–154. SIAM, 2005. ISBN 0-89871-596-2. URL http://www.siam.org/meetings/ alenex05/papers/13gheileman.pdf.
A. Muthitacharoen, B. Chen, and D. Mazires. A low-bandwidth network file system. In In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP ’01, pages 174–187, New York, NY, USA, 2001. ACM. ISBN 1-58113389-8. . URL http://pdos.csail.mit.edu/papers/lbfs: sosp01/lbfs.pdf. Y. Nam, G. Lu, N. Park, W. Xiao, and D. H. C. Du. Chunk fragmentation level: An effective indicator for read performance degradation in deduplication storage. In Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications, HPCC ’11, pages 581–586, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 9780-7695-4538-7. . URL http://dx.doi.org/10.1109/HPCC. 2011.82. Y. J. Nam, D. Park, and D. H. C. Du. Assuring demanded read performance of data deduplication storage with backup datasets. In Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS ’12, pages 201–208, Washington, DC, USA, 2012. IEEE Computer Society. ISBN 978-07695-4793-0. . URL http://dx.doi.org/10.1109/MASCOTS. 2012.32. NEC. HYDRAstor Grid Storage System. URL http://www. hydrastor.com. NEC HS8-4000. NEC HYDRAstor HS8-4000 Specification, 2013. URL http://www.necam.com/HYDRAstor/doc.cfm? t=HS8-4000.
HP. HP StoreOnce Backup. URL http://www8.hp.com/us/ en/products/data-storage/storage-backup-archive. html.
W. C. Preston. Target deduplication appliance performance comparison. BackupCentral.com, October 2010a. URL http:// www.backupcentral.com/mr-backup-blog-mainmenu47/13-mr-backup-blog/348-target-deduplicationappliance-performance-comparison.html.
IBM. IBM ProtecTIER Deduplication Solution. URL http:// www-03.ibm.com/systems/storage/tape/ts7650g/.
W. C. Preston. Restoring deduped data in deduplication systems. SearchDataBackup.com, April 2010b. URL
http://searchdatabackup.techtarget.com/feature/ Restoring-deduped-data-in-deduplication-systems. Quantum. DXi Deduplication Solution. URL http://www. quantum.com/products/disk-basedbackup/index.aspxm. S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies, FAST’02, pages 7–7, Berkeley, CA, USA, 2002. USENIX Association. URL http://dl.acm.org/ citation.cfm?id=1973333.1973340. M. Rabin. Fingerprinting by random polynomials. Technical report, Center for Research in Computing Technology, Harvard University, New York, NY, USA, 1981. URL http://www. xmailserver.org/rabin.pdf. B. Romanski, L. Heldt, W. Kilian, K. Lichota, and C. Dubnicki. Anchor-driven subchunk deduplication. In Proceedings of the 4th Annual International Conference on Systems and Storage, SYSTOR ’11, pages 16:1–16:13, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0773-4. . URL http://doi.acm. org/10.1145/1987816.1987837. Seagate. Common enterprise disk specification (based on Seagate Constellation ES.3 4TB, model 2012). URL http://www.seagate.com/www-content/productcontent/constellation-fam/constellation-es/ constellation-es-3/en-us/docs/constellation- es3-data-sheet-ds1769-1-1210us.pdf. Symantec. NetBackup Appliances. URL http://www.symantec. com/backup-appliance. G. Wallace, F. Douglis, H. Qian, P. Shilane, S. Smaldone, M. Chamness, and W. Hsu. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX conference on File and Storage Technologies, FAST’12, pages 4–4, Berkeley, CA, USA, 2012. USENIX Association. URL http://dl.acm.org/citation.cfm?id=2208461.2208465. H. Weatherspoon and J. Kubiatowicz. Erasure coding vs. replication: A quantitative comparison. In IPTPS ’01: Revised Papers from the First International Workshop on Peer-to-Peer Systems, pages 328–338, London, UK, 2002. ISBN 3-540-44179-4. URL http://www.cs.rice.edu/Conferences/IPTPS02/170.pdf. L.
Whitehouse. Restoring deduped data. searchdatabackup.techtarget.com, August 2008. URL http:// searchdatabackup.techtarget.com/tip/Restoringdeduped-data.
L. Whitehouse, B. Lundell, J. McKnight, and J. Gahm. 2010 Data Protection Trends. Enterprise Strategy Group, April 2010. B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies, FAST’08, pages 18:1–18:14, Berkeley, CA, USA, 2008. USENIX Association. URL http://dl.acm.org/citation. cfm?id=1364813.1364831.