2010 39th International Conference on Parallel Processing
FlashCoop: A Locality-Aware Cooperative Buffer Management for SSD-based Storage Cluster
Qingsong Wei
Bozhao Gong, Suraj Pathak, Y.C.Tay
Data Storage Institute Agency for Science, Technology and Research Singapore
[email protected]
School of Computing National University of Singapore Singapore {gbozhao, suraj, dcstaycc}@nus.edu.sg
Bandwidth(MB/Sec)
Abstract—Random writes significantly limit the application of flash-based Solid State Drive (SSD) in enterprise environment due to its poor latency, negative impact on SSD lifetime and high garbage collection overhead. To release above limitations, we propose a locality-aware cooperative buffer scheme referred to as FlashCoop (Flash Cooperation), which leverages free memory of neighboring storage server to buffer writes over high speed network. Both temporal and sequential localities of access pattern are exploited in the design of cooperative buffer management. Leveraging the filtering effect of the cooperative buffer, FlashCoop can efficiently shape the I/O request stream and improve the sequentiality of the write accesses passed to the SSD.
I.
Cooperative
Random Write Mix of Seq. & Ran. Write
1K
2K
4K
8K
16K 32K
Request Size(Bytes)
Figure 1. Performance of write on SSD
random write speed is only 0.87 Mbytes/second while sequential write performance is 30.69 Mbytes/second. Further, performance of mixed workload with sequential and random writes (50:50) is worse than that of pure random write. While random writes are very slow on SSD, the durability issue of SSD is a more serious problem to solve. Due to the nature of the technology, NAND flash memory can incur only a finite number of erases for a given physical block. Therefore, increased erase operations due to random writes shortens the lifetime of a SSD. Experiments in [18] show that random write intensive workload could make SSD wear out over hundred times faster than sequential write intensive workload. Portable devices are designed for single user application, in which most write requests are sequential. On the other hand, server deployed in enterprise is to support multiple users and multiple tasks concurrently. The write request pattern in server is more complex with the characteristics that many random writes are interposed between sequential writes. In addition, there is a high temporal locality for random writes because there are many popular sectors which are updated frequently. Therefore, random write on SSD is a critical problem to both performance and lifetime in enterprise environment. To solve this problem, we propose a locality-aware cooperative buffer scheme referred to as FlashCoop to improve performance and sequentiality of write accesses passed to SSD. By coordinating the memory at neighboring server over high speed network, FlashCoop implements
Buffer,
INTRODUCTION
Flash memory is rapidly becoming promising technology for the next-generation storage due to a number of strong technical merits including (i) low access latency (ii) low power consumption (iii) higher resistance to shocks (iv) light weight and (v) increasing endurance. As an emerging technology, flash memory received strong interest in both academia and industry [5-8]. Flash memory has been traditionally used in portable devices. More recently, as price drops and capacity increases, this technology has made huge strides into personal computer and server storage space in the form of SSD with the intention of replacing traditional hard disk drives (HDD). In fact, two leading on-line search engine service providers, google.com and baidu.com, both announced their plans to migrate existing hard disk based storage system to a platform built on SSDs [4,7]. However, SSD suffers from random writes when applied in enterprise environment. Writes on SSD are highly correlated with access patterns. The electrical properties of flash cells result in random writes being much slower than the sequential writes. Figure 1 shows the write performance of Intel® X25-E Extreme SATA SSD. The 4Kbyte size
0190-3918/10 $26.00 © 2010 IEEE DOI 10.1109/ICPP.2010.71
Sequential Write
512
FlashCoop has been extensively evaluated under various enterprise workloads. Our benchmark results conclusively demonstrate that FlashCoop can achieve 52.3% performance improvement and 56.5% garbage collection overhead reduction compared to the system without FlashCoop. Keywords-SSD, Random Write, Temporal and Sequential localities
80 70 60 50 40 30 20 10 0
634
cooperative buffer, which allows data to be written in both local buffer and remote buffer instead of synchronously updating to SSD. Data buffered in memory will be replaced and asynchronously flushed into SSD by locality-aware replacement algorithm, which takes both temporal and sequential localities into account. FlashCoop not only improves performance and extends SSD lifetime, but also significantly reduces the internal fragmentation and garbage collection overhead associated with random write. The rest of this paper is organized as follows. Section II provides an overview of background and motivation. In Section III, we present details of locality-aware cooperative buffer scheme (FlashCoop). Evaluation and measurement results are presented in Section IV. In Section V, we give a brief study of related works in the literature and conclusions and possible future work are summarized in Section VI. II.
was proposed. Hybrid-level FTL uses a block-level mapping to manage most data blocks and uses a page-level mapping to manage a small set of log blocks, which works as a buffer to accept incoming write requests [9]. They show high garbage collection efficiency and require a small-sized mapping table. Garbage Collection – Since data in flash memory cannot be updated in place, the FTL simply writes the data to another clean page and marks the previous page as invalid. When running out of clean blocks, a garbage collection module scans flash memory blocks and recycles invalidated pages. If a page-level mapping is used, the valid pages in the scanned block are copied out and condensed into a new block. For block-level and hybrid-level mappings, the valid pages need to be merged together with the updated pages in the same block. Wear Leveling – Due to the locality in most workloads, writes are often performed over a subset of blocks. Thus some flash memory blocks may be frequently overwritten and tend to wear out earlier than other blocks. FTLs usually employ wear leveling algorithm to ensure that equal use is made of all the available write cycles for each block [26].
BACKGROUND
A. Flash Memory Technologies For NAND flash memory, a flash memory package is composed of one or more dies. Each die within a package contains multiple planes. A typical plane consists of thousands (e.g. 2048) of blocks and one or two registers of the page size as an I/O buffer. Each block in turn consists of 64 to 128 pages. Each page has a 2KB or 4KB data area and a metadata area (e.g. 128 bytes) for storing identification, page state and Error Correcting Code (ECC) information. Flash memory supports three major operations, read, write, and erase. Read and write are performed in units of pages. A unique requirement of flash memory is that flash blocks must be erased before they can be reused and erase must be conducted in block granularity [7]. In addition, each block can be erased only a finite number of times. A typical MLC flash memory has around 10,000 erase cycles, while a SLC flash memory has around 100,000 erase cycles. After wearing out, flash memory cells can no longer store data [8].
C. Random Write and SSD Random write has following negative impacts on SSD. 1) Shorten the lifetime of SSD For SSD, the more random the writes are, the more erase operations are needed. Due to the nature of the technology, NAND flash memory can incur only a finite number of erases for a given physical block. Therefore, increased erase operations due to random writes make a flash storage wear out much faster than sequential writes. 2) High Garbage Collection Overhead Random writes result in higher overhead of garbage collection than sequential writes. Considering an erase block of N pages, a random write may incur up to N − 1 page reads from the old block, N page writes to the new block, and one block erase. In contrast, a sequential write only incurs one page write and 1/N erase operations on the average [7]. For SSD adopting hybrid FTL, the more random writes are, the more merge operations [11] are needed. At the worst case, each individual page in a log block would belong to a different mapping unit and needs expensive full merge operation correspondingly [5]. Random write operations are most likely to trigger garbage collection. These internal operations running in the background may compete for resources with incoming foreground requests and cause increased latency. 3) Internal Fragmentation Flash memory does not support in-place update. Therefore, if the incoming writes are randomly distributed over the logical block address space, sooner or later all physical flash memory blocks may have an invalid page, which is called internal fragmentation [8]. Such an internal fragmentation has significant impacts on garbage collection and performance. First of all, the cleaning efficiency drastically drops down. Second, after fragmentation, each write becomes excessively expensive and the bandwidth of sequential write collapses much lower than that on a regular
B. SSD SSDs use flash memory as their storage medium. Flash Translation Layer (FTL) [10], a critical component implemented in the SSD controller, allows operating systems to access flash memory devices in the same way as conventional disk drives. The FTL plays a key role in SSD and many sophisticated mechanisms are adopted to optimize SSD performance. It provides address mapping, wear leveling and garbage collection. Logical block mapping – Generally, FTL schemes can be classified into three groups depending on the granularity of address mapping: page-level, block-level, and hybrid-level FTL schemes[10]. In the page-level FTL scheme, a logical page number (LPN) can be mapped to a physical page number (PPN) in flash memory. This mapping approach is efficient and shows great garbage collection efficiency, but it requires a large amount of RAM space to store the mapping table. On the contrary, block-level FTL is space-wise efficient, but it requires an expensive read-modify-write operation when writing only part of a block. In order to overcome these disadvantages, the hybrid-level FTL scheme
635
laptop disk [7]. Finally, the prefetching mechanism inside SSD would not be effective any more since logically continuous pages are not physically continuous to each other. This causes the bandwidth of sequential read to drop closely to the bandwidth of random read. 4) Little chance for performance optimization SSD leverages striping and interleaving to improve performance [6], which are based on sequential locality. If a write is sequential, the data can be striped and written across different dies or planes in parallel. Interleaving is used to hide the latency of costly operations. Single multi-page read or write can be efficiently interleaved, while multiple singlepage reads or writes can only be conducted in separate way. While above optimizations can dramatically improve performance for SSD with more sequential locality, its ability to deal with random write is very limited because less sequential locality is left for it to exploit. Since the buffer is positioned at system level higher than SSD and receives I/O requests directly from applications, novel replacement algorithm can be designed to shape the requests into I/O stream with more sequential locality, thus extend SSD lifetime, reduce garbage collection overhead and provide more opportunities for interleaving, striping and prefetching mechanisms. III.
Task1
Task2
LPN: (1,2,3,4,5,6)
Task3
LPN: (20,21)
LPN: (22)
Original Requests
File System (4,5,6) (20,21)
(22)
(1,2,3)
Interleaved Requests
FlashCoop (1,2,3,4,5,6)
(20,21,22)
Sequential Writes
SSD Figure 2. The FlashCoop increases sequentiality of write accesses
Firstly, the memory of storage servers in the cluster is not always 100 percent utilized. It is possible to allocate a small amount of memory as remote buffer while not sacrificing system performance. Further, the size of remote buffer required to store write data is small because most writes are small [1,2]. Secondly, writing the data to remote buffer over high speed network (i.e. 10Gbit Ethernet) is much faster than synchronously writing to SSD due to poor random write performance of SSD, especially for aged SSD. Finally, data is asynchronously flushed into SSD without risk of data being lost because there is very low possibility for both servers to fail at the same time, same as RAID 1. Storage cluster is configured into cooperative pairs, in which each server of the pair serves its own read/write requests, as well as remote write requests from neighboring peer. FlashCoop uses Local Caching Table (LCT) and Remote Caching Table (RCT) to manage pages stored in local buffer and remote buffer, respectively. All access decisions are made in the access portal module. Upon the arrival of a write request, FlashCoop places it into local buffer and transfers a copy to remote buffer of neighboring server instead of synchronously writing them into SSD. For read requests, FlashCoop always attempts to fetch data from local buffer through LCT like traditional way. If the read request is not hit in local buffer, the access portal would then fetch data from SSD and a copy of the data will be cached into its local buffer as reference for future requests. With FlashCoop, writing the data into SSD is artificially delayed so that originally sequential but interleaved writes can be reconstructed back to sequential write and multiple small writes can be combined into one big write. Short lived files which can be buffered in memory are often never really written to SSD. The files are removed and purged from the buffer before they are pushed to SSD. Such short lived files appear to be relatively common in Unix systems [3], and delayed write in the FlashCoop reduces both the number of metadata updates caused by such files and the impacts of such files on SSD internal fragmentation. Therefore FlashCoop can significantly reduce the numbers of random
LOCALITY-AWARE COOPERATIVE BUFFER SCHEME
We now present the design of our locality-aware cooperative buffer scheme for SSD-based storage cluster. A. Design Rationale Existing file systems use synchronous writes for metadata updates to ensure proper ordering of changes being written into disk. The orderly updating is necessary to ensure file system recoverability and data integrity after system failure. However, it is well known that using synchronous write is very expensive for SSD because of its poor random write performance and negative impacts on lifetime and garbage collection overhead. The typical workload in an enterprise system is a mixture of random writes and sequential writes. Figure 2 shows the typical behaviors of write requests which are generated by multiple tasks and are sent to the SSD through the file system in enterprise environment. The file system may divide a long sequential write request into small write requests and send them to the SSD, interleaving a write request from another task. FlashCoop is proposed and positioned between file system and SSD to buffer and shape I/O stream so that random write to SSD can be reduced. The main idea of FlashCoop is to use a small portion of memory in neighboring server to implement cooperative write buffer, which provides chance to improve sequentiality of write accesses passed to SSD. To do so, FlashCoop partitions the memory into two parts: local buffer and remote buffer as shown in Figure 3. Local buffer is used for local read and write and remote buffer is used to store writing data from the neighboring server. Therefore, fresh data is kept in the local buffer for longer time to reduce random write to SSD. The design is feasible because of following reasons:
636
Requests
Monitor & Recovery
Requests
LCT
Access Portal
Monitor & Recovery
RCT
LCT
Access Portal
Writes
RCT
Write
Local Buffer
Remote Buffer
Dynamic Memory Allocation
Locality-aware Replacement
High Speed Data Center Network
Remote Buffer
Local Buffer
Dynamic Memory Allocation
Locality-aware Replacement
Sequential flush
Sequential flush SSD
SSD Cooperative Storage Servers Figure 3. Architecture of the FlashCoop
writes and increase sequentiality of write accesses passed to SSD. Consequently, the negative impacts associated with small writes are dramatically reduced. Cache replacement plays key role in FlashCoop. We propose a locality-aware cache replacement policy to replace pages based on access pattern taking into account the temporary and sequential localities. FlashCoop leverages the replacement algorithm to shape I/O stream and asynchronously flushes the data to SSD in sequential way, which will be discussed in following subsection B. It should be noted that the size of the remote buffer in each storage server can be different. Further, memory allocation between local buffer and remote buffer within storage server is dynamic according to workload changes, which will be discussed in following subsection C. The monitor and recovery module is used to monitor availability of the cooperative partner constantly for detection of any device or networking failures. It will also enable data recovery from both local and remote failures to maintain data consistency, which will be discussed in following subsection D.
pattern, we propose a Locality-Aware Replacement scheme (LAR) for FlashCoop. It works as following. 1) Block-based management LAR buffer management is based on granularity of logical block which consists of multiple pages in SSD. System can obtain block size of underline SSD. Block-based management enables the buffer replacement aware of SSD layout and maintains sequential locality of accesses. In the real application, read and write accesses are mixed. Usage patterns exhibit block-level temporal locality: the pages in the same logical block are likely to be accessed (read/write) again in the near future. LAR services both read and write operations because only buffering writes in memory space may destroy the original locality present among access sequences. Servicing reads helps to reduce the load in the flash data channel which is shared by both read and write operations. Moreover, by servicing foreground read operations, the flash data channel’s bandwidth can also be saved to conduct background garbage collection task, which helps to reduce the influences of each other. LAR treats reads and writes as whole to make full use of block-level temporal locality of accesses, which helps to naturally form sequential block. Therefore, LAR groups both write and read pages belonging to same block into a logical block. 2) Two-level sorting and replacement LAR implements two-level sort using a table and several chains. In the first level, the table is sorted based on block popularity, which is defined as block access frequency including reading and writing of any pages of the block. Sequentially accessing multiple pages of the block is treated as one block access. Thus, block with sequential accesses will has low popularity value, while block with random accesses has high popularity value. LAR selects least popular block for replacement. For the second level, we use a chain to organize the blocks with same popularity and sort them based on the
B. Locality-Aware Replacement Policy The cache management policy influences the sequence of requests that access SSDs. Different cache management policies may generate different request sequences, which directly affect the performance and lifetime of SSD. In other words, by changing the cache management scheme, it is possible to change the access pattern for SSD, thus providing more sequential writes for SSD to extend lifetime and improve performance. As discussed in Section II, sequential locality of write access passed to SSD significantly influences performance, lifetime, internal fragmentation and garbage collection overhead. Therefore, we adopt improvement on both cache hit ratio and sequential locality as our design objectives. Leveraging both temporal and sequential localities of access
637
can be physically placed onto continuous pages, so as to avoid internal fragmentation. LAR minimizes the cost of random write and brings following benefits. 1) Delaying write gives a better knowledge of access pattern and provides more opportunity to group intrablock accesses and reconstruct interleaved writes. 2) Reduces number of random write to SSD, which significantly saves life time and reduces garbage collection overhead. 3) Improves latency of read and write because most accesses to popular data can be hit in buffer. 4) Adapts to different workloads such as read intensive and write intensive workload. Most read pages will be kept in buffer for read-intensive workload, while most write pages will stay in buffer if the workload is writeintensive.
Requests Sequence WR(0,1,2)
RD(3,8,9)
WR(10,11)
RD(19)
19 miss
3,8,9 miss
WR(1,2)
WR(16,17,18)
1,2 hit
Block No: 0
Block No: 2
Block No: 4
0
8
16
1
9
17
2
10
18
3
11
19
Popularity: 3 Dirty Pages:3
Popularity: 2 Dirty Pages:2
Popularity: 2 Dirty Pages:3 Victim for replacement
Read miss SSD
C. Dynamic Memory Allocation Scheme Memory inside a storage server is partitioned into local buffer and remote buffer. Remote buffer is excess idle memory space, which is used to backup write data from neighboring server. Memory allocation has significant influence on performance of cooperative server. Unreasonable allocation policy will eventually downgrade performance in terms of access latency and throughput. Cooperative buffer can’t be achieved at cost of sacrificing performance. For FlashCoop design, we realize that it is important not only to minimize the number of random write, but also to provide better overall performance through balanced memory allocation. For heterogeneous configuration of storage cluster, a better overall performance is difficult to achieve with static memory partition strategies between local buffer and remote buffer because different types of server possess different memory capacity, CPU powers, network bandwidths and workloads. Therefore, we propose a dynamic memory allocation scheme, taking workloads and resource usage of both cooperative servers into account. Our dynamic memory allocation strategy works in the way that more remote buffer will be allocated if its local usage is low and workload of its neighbor is write intensive. Workload activity of two cooperative servers is represented as A={A1, A2}, and accordingly, the ratio between the size of remote buffer and the size of total memory(excluding system memory) is represented as θ = {θ1, θ2}. We model the workload activity of server as a set of parameters Ai=(mi, pi, ni), where mi, pi and ni represent memory, CPU and network utilization of the ith server. Thus, the ratio θi of server Si is defined as Equation (1).
Sequential Flush
Figure 4. Locality-Aware Replacement Algorithm, each block is composed of four pages. WR(LPN) and RD(LPN) denote write and read request respectively. The numbers within the small boxes denote a LPN of each page. White small box represents clean pages and grey small box represents dirty pages. Big box with dash line denotes logical block with two parameters of popularity and number of dirty pages.
number of dirty pages in the block. If more than one block has the same least popularity, a block having largest number of dirty pages will be selected as victim for replacement. Once a block is selected as victim, there are two cases to deal with. If there are dirty pages in this block, both read and dirty pages of this block in local buffer will be sequentially flushed into SSD. Then, Corresponding pages in remote buffer of neighboring server will be discarded. Otherwise, all the clean pages of this block will be discarded instead of being flushed to SSD. 3) Clustering multiple small writes into full block If there are multiple dirty pages belonging to different logical blocks in the tails, we group them into a block size write and sequentially flush into SSD. This can make full use of interleaving and striping to improve performance. Figure 4 illustrates an example of LAR. Upon arrival of write request WR(0,1,2), page 0,1and 2 are written in local buffer and remote buffer. They belong to block 0 whose popularity is 1 and dirty page number is 1. As read request RD(3,8,9) arrives, three missed pages are read from SSD and stored in local buffer. Page 3 goes to block 0 whose popularity is updated as 2 and page 8 and 9 form block 2 with popularity of 1. As write request WR(10,11) arrives, page 10 and 11 are written in local buffer and remote buffer. Popularity and dirty page number of block 2 are updated as 2 and 2, respectively. Although having same popularity as block 2, block 4 is placed at tail because it has more dirty pages. When replacement happens, block 4 is selected as victim and both the dirty pages (page 16,17 and 18) and clean pages (page 19) of block 4 are sequentially flushed into SSD. This policy guarantees that logically continuous pages
θi = aj =
638
λ write j (1 − b i ) λj
λwrite j , b = α * mi + β * pi + γ * ni λj i
(1)
The figure 5 shows our evaluation environment. FlashCoop is running on two servers connected by 10G Ethernet network. Function modules of FlashCoop including buffer management, dynamic memory allocation and failure recovery modules were implemented. FlashCoop interfaces DiskSim [23] based SSD simulation plug-in [6]. Various FTL functionalities were added and extensive modification of SSD code was done to enable device-level evaluation in terms of garbage collection overhead and distribution of write size. We use system without FlashCoop as Baseline scheme for evaluation, which synchronously writes data to SSD without buffer.
where, a is workload factor and b is resource usage; λwrite is write request arrival rate and λ is total request arrival rate; α,β,γ are adjustment factors according to workload pattern, i.e. data-centric workloads or compute-centric workloads. The value of will decrease when resource usage in local server increases. In addition, the value of is sensitive to access pattern of workload running on remote server. For example, the value of increases when write intensive workload is running on remote server. By contrast, the value of decreases for read intensive workload. In real application, workloads may change. By monitoring the workload and resource usage, FashCoop dynamically calculates and adjusts the value of θ. To do so, each server of the pair periodically collects and exchanges required information. As workload changes rapidly, excessive communication and calculation are required to dynamically adjust the value of θ and smooth out load variation. However it may obliquely affect the overall system performance. Finding a cost effective way at a reasonable computation workload is an interesting issue for future work.
Workload
FlashCoop
FlashCoop Network Connection
D. Failure Recovery Failure may happen due to network breakdown or server itself getting crashed. The failure recovery module constantly monitors availability of the cooperative partner. To do so, mechanism of Heartbeat is setup between two cooperative servers. Availability of peer server is monitored by sending Heartbeat message periodically. There are two types of failure to deal with for FlashCoop: local failure and remote failure. In case of local failure, following steps will be conducted. Firstly, local server reads RCT from neighboring server after reboot. Then, according to RCT, dirty data in remote buffer of neighboring server will be copied and stored in SSD. Finally, it notifies neighboring server to clean out remote buffer. Remote failure includes network partition and remote server crash, which may cause the backup of write data in the remote buffer not available any more. In case of remote failure, local server does not forward any new write data to neighboring server and dirty data in its local buffer will be immediately flushed into SSD. With this failure recovery mechanism, FlashCoop can successfully maintain data consistency. We observed that failure recovery time is a tradeoff between performance and reliability. Large remote buffer allows more data to be written in memory, providing more chance to optimize write accesses. However, more data stored in remote buffer requires long time to transfer during failure recovery. Long failure recovery time will affect normal user accesses and risk second failure. Finding an effective way for fast recovery is another interesting issue for future work. IV.
Workload
SSD Simulator
SSD Simulator
Server 1
Server 2 Figure 5. Experiment Setup
2) Workloads We used a mixture of real-world and synthetic traces to study the efficiency of FlashCoop on various enterprise workloads. Table I presents salient features of workloads. We employed write-dominant and read-dominant I/O traces running at a financial institution [25] made available by the Storage Performance Council (SPC), henceforth referred to as the Fin1 and Fin2. Since original traces of Fin1 and Fin2 are distributed on multiple servers, we filtered and used traces on one server. We also used a synthetic trace to study the behavior of different cache replacement algorithms for a workload with characteristics of mixed read/write and random/ sequential, referred to as Mix. TABLE I. Workload Fin1 Fin2 Mix
SPECIFICATION OF WORKLOAD
Avg. Req. Size(KB) 4.38 4.84 3.16
Write (%) 91 10 50
Seq. (%) 2.0 0.20 50
Avg.Req.Interarrive Time(ms) 133.50 64.53 199.91
3) FTL Configuration of SSD We choose Page-based FTL and Hybrid-based FTL to evaluate how FlashCoop would behave under different SSD TABLE II.
SPECIFICATION OF SSD CONFIGURATION
Page Read to Register Page Program (Write) from Register Block Erase Serial Access to Register (Data bus) Die Size Block Size Page Size Data Register Erase Cycles
EVALUATION
A. Experiment Setup 1) Trace-driven Simulator
639
25μs 200μs 1.5ms 100μs 4 GB 256 KB 4 KB 4 KB 100 K
configurations. Block-based FTL is not chosen because it is not suitable for enterprise application. Two typical Hybrid FTLs are chosen: Block Associative Sector Translation (BAST) [10,14] and Fully Associative Sector Translation (FAST)[20]. Configuration parameters of SSD for evaluation are listed in the table II [6].
FlashCoop with LAR can significantly reduce the overhead of internal garbage collection for SSD. This is because the workload passed to SSD by FlashCoop with LAR is more sequential than random. 3) Write Length Distribution Because the major effort of FlashCoop and LAR to optimize random write of SSD is to influence access pattern presented to the SSD-the number of sequential write, we show the write length distribution for workloads running on FlashCoop and Baseline. For this purpose, we use CDF curves to show percentage (shown on Y-axis) of written pages whose sizes are less than a certain value (shown on Xaxis). Figure 8 shows the write length distribution for three workloads. The curve of Baseline in the Figure 8 denotes original distribution of write length. The results show that algorithm LAR is very efficient in reducing total number of write and increasing sequentiality of write accesses. As we can see in Figure 8(a), the percentage of 1-page write of LRU and LFU is 29.22% and 27.32%, which is more than 10.65% of Baseline. This means that sequentiality of write accesses is reduced due to LRU and LFU replacement. In contrast, LAR only has 2.98% small writes, better than Baseline. Further, ratio of big write with LAR is much higher than LRU and LFU. For example, almost 35.6% of writes is larger than 8 pages in size for LAR, while LRU and LFU almost do not have writes larger than 8 pages. The results further show that the distribution of write length is directly correlated to the performance improvement and garbage collection overhead reduction. Under workload Fin1, LAR makes 68.67% writes whose sizes are larger than 4 pages compared with 12.59%, 11.56% for LRU and LFU (see Figure 8(a)). Accordingly, there is a 45.61% performance improvement (see Figure 6(a)) and 51% garbage collection overhead reduction (see Figure 7(a)). The correlation clearly indicates that write length is a critical factor affecting SSD performance and garbage collection overhead and LAR makes its contributions through increasing the write size. 4) Impact of FTL FlashCoop is effective for SSD with various FTL configurations. From results we can see that FlashCoop outperforms Baseline under various FTL configurations. FlashCoop achieves 45.64%, 13.79% and 20.14% performance improvement over Baseline for BAST, FAST and page-based FTL under Fin1 workload (see Figure 6). Accordingly, there is a 51%, 41.62% and 35.51% reduction of the block erase for BAST, FAST and page-based FTL compared to Baseline (see Figure 7). We also observed that improvement of LAR for BAST is much larger than that of LRU and LFU. This is because BAST is appropriate for the sequential access patterns and workload passed by LAR has more sequential locality for it to exploit than LRU and LFU (see Figure 8(a)). 5) Dynamic Features To evaluate the dynamic features of FlashCoop, we ran write intensive workload Fin1 and read intensive workload Fin2 in remote server and changed request arrival rates in
B. Experiment Results We evaluated FlashCoop with replacement algorithm of LAR, LRU (Least Recently Used) and LFU (Least Frequency Used) over Baseline. Please note that results presented in this paper are collected on one server except dynamic testing. 1) Performance Figure 6 shows the average response time of the FlashCoop and Baseline for three FTL configurations when we vary workloads. The following observations are made from the result. FlashCoop yields consistently better average response time than Baseline across different FTLs and traces as shown in Figure 6. Let us take BAST as an example, we can see that the average response time of FlahCoop with LAR is 0.63 msecond and 0.32 msecond under Fin1 and Fin2, respectively. By contrast, the average response time of Baseline is 1.32 msecond and 0.51 msecond, respectively. The performance gain comes from two aspects: reduced synchronous write and frequently updated data hit in buffer. LAR replacement algorithm outperforms the traditional LRU and LFU schemes in terms of performance. As shown in Figure 6(a), the average response time of LRU and LFU schemes is 0.8 msecond and 0.95 msecond for trace Fin1, respectively. In contrast, LAR caching scheme outperforms LRU and LFU caching schemes by a factor of 1.6. This is because LAR is block-based replacement algorithm optimized for SSD taking both sequential and temporal localities into account. LAR makes its contributions through improving cache hit ratio and increasing the portion of sequential write, which can be proved from the results shown in the Table III. Table III shows the cache hit ratio of LAR, LRU and LFU when we vary buffer size under workload Fin1. From the results we can see that LAR achieves higher hit rate than LAR and LFU. TABLE III.
CACHE HIT RATIO VARIES WITH BUFFER SIZE
Buffer size(Page) LAR LRU LFU
Cache hit ratio (%)
1024 55.21 50.53 46.80
2048 67.34 61.53 52.71
4096 78.87 71.81 69.84
8192 91.83 83.32 80.08
2) Garbage Collection Overhead Figure 7 shows the number of erase block of the FlashCoop and Baseline for three FTL configurations when we vary workloads. From figure 7(a), we can see that the number of erase blocks is 11000 and 20000 for FlashCoop with LRU and Baseline under trace Fin1, respectively. Further, LAR outperforms LRU and LFU in terms of reduction of garbage collection overhead. For example, the number of erase blocks is 11000 and 12000 for LRU and LFU, respectively. In contrast, LAR incurs only 8700 block erases during Fin1 replay. The results clearly indicate that
640
Avg. Rsp. Time(msec)
Avg. Rsp. Time(msec)
1.2
FlashCoop with LAR
16
1 0.8 0.6 0.4 0.2
FlashCoop with LRU
14
FlashCoop with LFU
12
Avg. Rsp. Time(msec)
FlashCoop with LAR FlashCoop with LRU FlashCoop with LFU Baseline
1.4
Baseline
10 8 6 4 2
0
0
Fin1
Fin2
(a)
Mix
Fin1
Fin2
BAST
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Mix
FlashCoop with LAR FlashCoop with LRU FlashCoop with LFU Baseline
Fin1
(b) FAST
Fin2
Mix
(c)Page-based FTL
Figure 6. Performance
Erase numbers(10000)
Erase numbers(10000)
10
FlashCoop with LAR FlashCoop with LRU FlashCoop with LFU Baseline
25
8 6 4 2
20
1.5 Erase numbers(10000)
FlashCoop with LAR FlashCoop with LRU FlashCoop with LFU Baseline
12
15 10 5
0
1.2 0.9 0.6 0.3
0 Fin1
Fin2
(a)
FlashCoop with LAR FlashCoop with LRU FlashCoop with LFU Baseline
0
Mix
Fin1
Fin2
BAST
Mix
Fin1
(b) FAST
Fin2
Mix
(c)Page-based FTL
Figure 7. Garbage Collection Overhead
80
80
80
60
60
60
40
FlashCoop w. LAR FlashCoop w. LRU FlashCoop w. LFU Baseline
20
CDF(%)
100
CDF(%)
100
CDF(%)
100
FlashCoop w. LAR FlashCoop w. LRU FlashCoop w. LFU Baseline
40 20
0
1
2
4
8
16
Number of Pages
(a)
Fin1
32
64
FlashCoop w. LAR FlashCoop w. LRU FlashCoop w. LFU Baseline
20
0
0
40
0
0
1
2
4
8
16
32
Number of Pages
(b) Fin2 Figure 8. Write length distribution
641
64
0
1
2
4
8
16
Number of Pages
(c) Mix
32
64
local server. We set α=0.4, β=0.2, and γ=0.4. Testing results are plotted in Figure 9. From the results, we can see that the value of decreases when workload intensity in local server increases. In addition, the value of is sensitive to access pattern of workload running on remote server. For example, for local request arrival rate of 0.3, the value of is 21.2% when workload Fin1 is running on remote server. By contrast, the value of is 9.1% for remote workload Fin2.
the probability of data reuse based on analysis and manipulation of data block reuse distance. B. Flash Memory Several studies have been conducted on flash storages concerning the performance of random writes at various levels of the storage hierarchy [15,16,17]. Research works on FTL try to improve performance and address the problems of high garbage collection overhead. BAST exclusively associates a log block with a data block. In presence of small random writes, this scheme suffers from increased garbage collection cost. FAST keeps a single sequential log block dedicated for sequential updates while other log blocks are used for random writes. SuperBlock FTL scheme [12] utilizes block level spatial locality in workloads by combining consecutive logical blocks into a Superblock. It maintains page level mappings within the superblock to exploit temporal locality by separating hot and cold data within the superblock. The Locality-Aware Sector Translation (LAST) scheme [5] tries to alleviate the shortcomings of BAST and FAST by exploiting both temporal locality and sequential locality in workloads. It further separates random log blocks into hot and cold regions to reduce garbage collection cost. Unlike currently predominant hybrid FTLs, Demand-based Flash Translation Layer (DFTL) [11] is purely page-mapped, which exploits temporal locality in enterprise-scale workloads to store the most popular mappings in on-flash limited SRAM while the rest are maintained on the flash device itself. BPLRU [13], FAB [28] and LB-CLOCK [29] are buffer management schemes proposed inside SSD to reduce random write. As in this paper FlashCoop is designed at system level, they are not relevant to us. MFT [19], a block device level solution, translates random writes to sequential writes between the file system and SSD. FlashLite [18] does it between application and the file system with similarity to this idea for P2P file sharing. Griffin [27] is proposed to use a log-structured HDD as a write cache to improve the sequentiality of the write accesses to the SSD. This paper differs from the above mentioned studies in a number of ways. First, FlashCoop implemented a blockbased buffer management, which is aware of SSD layout. A second major feature of this study is that locality-aware replacement algorithm is proposed to leverage both temporal and sequential localities of accesses. Third, proposed replacement algorithm LAR buffers both read and write to make full use of locality information of accesses. Finally, this study conducts investigation from both system and device point of view, which provides detailed understanding on how the system design should interact with SSD.
Value of θ(%)
50 Fin1 on Remote Server
40
Fin2 on Remote Server
30 20 10 0 0.1
0.2
0.3
0.4
0.5
Access Arrival Rate
Figure 9. Memory allocation varies with workload
The above experimental results demonstrate that FlashCoop and LAR are very efficient to boost system performance, reduce garbage collection overhead and internal fragmentation by increasing sequentiality of write accesses passed to SSD. V.
RELATED WORK
A. Cache Management Cooperative caching has been used to improve client latency and reduce server load for some time [21,34]. Over the years, numerous replacement algorithms have been proposed to reduce actual disk accesses. The oldest and yet still widely adopted algorithm is the Least Recently Used (LRU) algorithm. LRU is recency-based policy, which is based on the assumption that a recent accessed block will be referenced again in near future. LFU is a typical frequencybased policy, taking into account the frequency information which indicates the popularity to a block. There are a large number of algorithms proposed such as CLOCK[30], 2Q [32], ARC [31], and LIRS [33]. By exploiting the temporal locality of data accesses, all these replacement algorithms are designed by adopting cache hit ratio improvement as the sole objective to minimize disk activities [22]. However this can be a misleading metric for SSD. As discussed in Section II, sequential locality of write access passed to SSD significantly influences performance, lifetime, internal fragmentation and garbage collection overhead. These buffer managements are not effective for SSD because sequential locality is unfortunately ignored. DULO [22] introduces spatial locality into the consideration of page replacement and thus makes replacement algorithms aware of page placements on the hard disk. However, it cannot be directly applied for SSD because it consider hard disk layout instead of SSD layout. The paper [24] proposes a locality-aware cooperative caching protocol to effectively predict cache utilization and
VI.
CONCLUSIONS AND FUTURE WORKS
In this paper, we propose a scheme called FlashCoop to improve the performance and endurance of SSD in enterprise environment. By coordinating the memory at neighboring server over high speed network, FlashCoop writes data in both local buffer and remote buffer. Both temporal and sequential localities of access pattern are exploited in the design of buffer cache management. Leveraging the filtering
642
effect of the cooperative buffer, FlashCoop can shape the I/O request stream and transfer the accesses passed to SSD into more sequential accesses. FlashCoop not only improves the access latency and extends SSD lifetime, it can also significantly reduce the negative impacts of random write on internal activities of SSD including garbage collection and wear-leveling. In our immediate future work, we will implement a SSD device in Linux kernel to enable system level and device level investigation in real environment.
[20] S. Lee, D. Park, T. Chung, D. Lee, S. Park, and H. Song. A Log Buffer based Flash Translation Layer Using Fully Associative Sector Translation. IEEE Transactions on Embedded Computing Systems, 6(3):18, 2007. [21] Jin-Woo Song, Kyo-Sung Park, Sung-Bong Yang. An Effective Cooperative Cache Replacement Policy for Mobile P2P Environments, In Proc. of ICHIT'06. [22] Song Jiang, Xiaoning Ding, Feng Chen, Enhua Tan, and Xiaodong Zhang, DULO: an Effective Buffer Cache Management Scheme to Exploit both Temporal and Spatial Locality, In Proc. of FAST'05, 2005. [23] John S. Bucy, Jiri Schindler, Steven W. Schlosser, Gregory R. Ganger, The DiskSim Simulation Environment Version 4.0 Reference Manual, Technical Report, CMU-PDL-08-101, Carnegie Mellon University, May,2008. [24] Song Jiang, Kei Davis, Fabrizio Petrini, Xiaoning Ding, Xiaodong Zhang. A Locality-Aware Cooperative Cache Management Protocol to Improve Network File System Performance. In Proc. ICDCS'06. [25] OLTP Trace from UMass Trace Repository. http://traces.cs.umass.edu/index.php/Storage/Storage [26] L.P. Chang, On efficient wear leveling for large-scale flash memory storage systems, Proceedings of the 2007 ACM symposium on Applied computing. USA, 2007. [27] Gokul Soundararajan, Vijayan Prabhakaran, Mahesh Balakrishnan, and Ted Wobber, Extending SSD Lifetimes with Disk-Based Write Caches, In Proc. FAST’10. [28] H. Jo, J. Kang, S. Park, J. Kim, and J. Lee, FAB: Flash-aware Buffer Management Policy for Portable Media Players, In IEEE Transactions on Consumer Electronics,, vol. 22, no. 2, 2006 . [29] Biplob Debnath, Sunil Subramanya, David Du and David J. Lilja, Large Block CLOCK (LB-CLOCK): A Write Caching Algorithm for Solid State Disks, In Proc. of MASCOTS’09, 2009. [30] F. Corbato, A Paging Experiment with the Multics System, In MIT Project MAC Report MAC-M-384, 1968. [31] N. Megiddo and D. Modha, ARC: A Self-tuning, Low Overhead Replacement cache, In Proc. of FAST’03, 2003. [32] T. Johnson and D. Shasha, 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm, In Proc. of VLDB ’94, 1994, pp. 439-450. [33] S. Jiang and X. Zhang, LIRS: An Efficient Low Interreference Recency Set Replacement Policy to Improve Buffer Cache Performance, In Proc. of SIGMETRICS ’02, June 2002. [34] M. D. Dahlin, R. Y. Wang, T. E. Anderson, D. A. Patterson. Cooperative Caching: Using Remote Client Memory to Improve File System Performance. In Proc. of USENIX OSDI’1994, November 1994.
REFERENCES [1]
[2] [3]
[4]
[5]
[6]
[7]
[8] [9]
[10] [11]
[12] [13] [14]
[15] [16]
[17]
[18]
[19]
G. R. Ganger and M. F. Kaashoek. Embedded inodes and explicit grouping: exploiting disk bandwidth for small files. USENIX '97, pages 1-17, January 1997. D. Roselli, J. Lorch, and T.Anderson, A Comparison of File System Workloads, USENIX’2000, pp.41-54, 2000. Adam Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck, Scalability in the XFS File System , USENIX '96, 1996. T. Claburn. Google plans to use Intel SSD storage in servers. http://www.informationweek.com/news/ storage/systems/showArticle.jhtml?articleID=207602745. S. Lee, D. Shin, Y. Kim, and J. Kim. LAST: Locality-Aware Sector Translation for NAND Flash Memory-Based Storage Systems. In Proc. SPEED’08, Feburary 2008. N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis,M. Manasse, and R. Panigrahy. Design tradeoffs for SSD performance. In Proc. of USENIX’08, June 2008. F. Chen, D. A. Koufaty and X.D.Zhang, Understanding intrinsic characteristics and system implications of flash memory based solid state drives, In Proc. of SIGMETRICS’09, 2009. Abhishek Rajimwale, Vijayan Prabhakaran, and John D. Davis, Block Management in Solid-State Devices, USENIX'09. H. J. Choi, S. Lim, and K. H. Park. JFTL: a flash translation layer based on a journal remapping for flash memory. In ACM Transactions on Storage, vol.4, Jan 2009. T. Chung, D. Park, S. Park, D. Lee, S. Lee, and H. Song. System software for flash memory: a survey. In Proc. of ICEUC’06, 2006. A. Gupta, Y. Kim, and B. Urgaonkar. DFTL: a flash translation layer employing demand-based selective caching of page-level address mappings. In Proc. Of ASPLOS’09, 2009. J. Kang, H. Jo, J. Kim, and J. Lee. A superblock-based flash translation layer for NAND flash memory. In Proc. of ICES’06, 2006. H. Kim and S. Ahn. BPLRU: A buffer management scheme for improving random writes in flash storage. In Proc. of FAST’08, 2008. J. Kim, J. M. Kim, S. H. Noh, S. L. Min, and Y. Cho. A spaceefficient flash translation layer for compact flash systems. In IEEE Transactions on Consumer Electronics, volume 48(2):366-375, 2002. I. Koltsidas and S. Viglas. Flashing up the storage layer. In Proc. of VLDB’08, 2008. S. Lee, B. Moon, C. Park, J. Kim, and S. Kim. A case for flash memory SSD in enterprise database applications. In Proc. of SIGMOD’08, 2008. D. Narayanan, E. Thereska, A. Donnelly, S. Elnikety, and A. Rowstron. Migrating enterprise storage to SSDs: analysis of tradeoffs. In Proc. of EuroSys’09, 2009. Hyojun Kim and Umakishore Ramachandran, FlashLite: a user-level library to enhance durability of SSD for P2P File Sharing, In Proc. ICDCS’09. E. C. Company, Managed Flash Technology, http://www.easyco.com/mft/index.htm.
643