An Endurance-Aware Metadata Allocation Strategy for MLC NAND ...

2 downloads 0 Views 399KB Size Report
Abstract—This paper presents a reliability-aware meta- data allocation strategy called Scatter-SLC for MLC. NAND flash memory storage systems.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2015.2474394, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

1

An Endurance-Aware Metadata Allocation Strategy for MLC NAND Flash Memory Storage Systems Min Huang, Zhaoqing Liu, Liyan Qiao, Yi Wang, Zili Shao

Abstract—This paper presents a reliability-aware metadata allocation strategy called Scatter-SLC for MLC NAND flash memory storage systems. In Scatter-SLC, metadata is kept in LSB pages and corresponding MSB pages are bypassed. Without partitioning SLC and MLC blocks, Scatter-SLC can eliminate the unbalanced lifetime between SLC and MLC blocks while achieving the similar error rate as the method to store metadata in SLC blocks. We implemented Scatter-SLC on a real hardware platform. The experiment results show that Scatter-SLC can reduce uncorrectable page errors by 93.54% while incurring less than 1% time overhead on average compared with the previous work. Index Terms—NAND flash memory, MLC, reliability, metadata, shared page.

I. I NTRODUCTION MLC flash devices have been widely adopted by the market, and have become the main stream flash storage media [1], [2]. The dramatic improvements in the price and capacity of MLC NAND flash memory have been achieved by technological advances, such as the aggressive shrinking of process geometry and the increase in the number of bits stored in each memory cell. These technological advances inevitably degrade the reliability of MLC NAND flash. Metadata occupies a small portion of storage space but maintains the critical information of the file system and the address translations of the storage system. How to enhance the reliability of the metadata stored in MLC NAND flash memory has become an important issue. To enhance the metadata reliability, existing approaches adopt stronger ECC (Error Correcting Code) [3] or Min Huang, Zhaoqing Liu and Liyan Qiao are with the School of Electrical Engineering and Automation, Harbin Institute of Technology, Harbin, China. Yi Wang is with Guangdong Province Key Laboratory of Popular High Performance Computers, Shenzhen University, Shenzhen, China. Zili Shao is with the Department of Computing, the Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong. Copyright (c) 2015 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to [email protected].

redundancy [4], [5]. Other work exploited the architecture of the MLC NAND flash memory [6], [7], in which physical blocks can be partitioned into SLC blocks and MLC blocks. SLC blocks can provide high reliability and MLC blocks can provide high capacity. By manipulating the trade-off between reliability and capacity, this blocklevel partition can effectively reduce the error rate for NAND flash memory storage system. However, this introduces the problem of unbalanced lifetime between SLC blocks and MLC blocks. The fine-grained page-level allocation shows a very promising direction to solve the problem. In MLC NAND flash memory, each physical page consists of an LSB page and an MSB page; programming the LSB page while bypassing the corresponding MSB page can achieve high reliability while not introducing unbalanced lifetime from the SLC and MLC block partition. We call this method as page-level SLC mode. Our experiments on a real hardware platform show that the metadata allocation strategy based on the page-level SLC mode can achieve similar level of reliability as the one with SLC blocks. This paper presents Scatter-SLC that is a reliability enhancement strategy to utilize the page-level SLC mode. Scatter-SLC scatters write requests for metadata pages to different LSB pages while bypassing MSB pages. Without partitioning SLC and MLC blocks, ScatterSLC can eliminate the unbalanced lifetime between SLC blocks and MLC blocks while achieving the similar error rate as the method that stores metadata in SLC blocks. The proposed scheme can be combined with wear-leveling algorithms [6], [8] and can supplement wear-leveling schemes to further improve both of the reliability and the endurance of MLC NAND flash. We have implemented Scatter-SLC on a real hardware platform. The experimental results show that ScatterSLC can reduce uncorrectable page errors by 93.54% on average compared with the baseline scheme. ScatterSLC can also extend the lifetime more than three times compared with the method based on SLC blocks. In terms of I/O performance, Scatter-SLC can improve the average response time up to 7.00% compared with MetaCure [4].

0278-0070 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2015.2474394, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

2

The rest of the paper is organized as follows. Section II discusses the motivation. Section III presents our proposed Scatter-SLC. Section IV presents experimental results. Finally, Section V concludes the paper. II. M OTIVATION 70 50

58.3%

PPN 6 PPN 7 PPN 2 PPN 3

45

50

40 35

40 # of Error Bits

Bit error rate (%)

60

30 20 10.7% 1.2%2.2%0%

3.1% 0%

25 20 15 10

5.1% 1.2%0.7%0%

5

11 -> 0 11 1 -> 0 11 0 -> 1 01 0 -> 1 01 1 -> 0 01 0 -> 1 00 0 -> 1 00 1 -> 0 00 1 -> 1 10 0 -> 1 10 1 -> 0 10 1 -> 00

10

17.5%

30

0

0

0

500

1000

Error pattern

1500

2000

2500

3000

3500

4000

P/E Cycles

(a)

(b)

Fig. 1. (a) All possible error patterns for MLC NAND flash memory with 2,000 P/E cycles and (b) the test results for the number of error bits with Micron 29F64GCBAAA NAND flash memory chips.

0.0060

0.0045

Maximum P/E Cycle

0.0040 0.0035 0.0030 0.0025 0.0020 0.0015 0.0010 0.0005 0.0000 0

500

1000

1500

2000

2500

P/E Cycles

(a)

3000

3500

Normal programming Random page with page-level SLC mode Block-level SLC mode

Uncorrectalbe page error rate (%)

0.0050

Raw Bit Error Rate (%)

10.0 9.5 9.0 8.5 8.0 7.5 7.0 6.5 6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0

Normal programming Random page with page-level SLC mode Block-level SLC mode

0.0055

4000

Maximum P/E Cycle

0

500

1000

1500

2000

2500

3000

3500

4000

P/E cycle

(b)

(a) Raw Bit Error Rate (RBER) without using ECC (b) Uncorrectable page error rate with 5-bit ECC. Fig. 2.

In MLC NAND flash memory, an LSB page and its corresponding MSB page form a pair of shared pages. For a 2-bit MLC NAND flash memory chip which has four codings of “11”, “01”, “00”, and “10”, changing from one reference voltage state to another with totally 12 combinations of error patterns. We have tested all these patterns. The details of experimental setup are given in Section IV. The percentage of each pattern is presented in Figure 1(a). The results show that the majority of errors occur during the change in state from “00” to “01”. By only programming the LSB page, where the physical cell only has “11” and “01” states, more than 60% of the errors can be directly avoided. There are two ways of only programming LSB pages: block-level only SLC programming which means pages of the whole block are using only LSB programming, and page-level only SLC programming in which only part of pages in one block follow only LSB programming manner and other pages are normally programmed. We

adopt the page-level SLC mode and test the number of error bits on a real hardware platform (see Section IV for the detailed experimental setup). We program all pages in one physical block except page 8 and page 9 which are the corresponding MSB pages of two LSB pages (page 2 and page 3), respectively. The results are presented in Figure 1(b). We can observe that the number of error bits of page 2 and page 3, which utilize the page-level SLC mode, is far less than other physical pages whose MSB pages and LSB pages are both programmed (such as page 6 and page 7). Therefore, we can utilize this page-level SLC mode to enhance the reliability of MLC NAND flash memory. As shown in Figure 2, we observe that the page-level SLC mode can significantly reduce the raw bit error rate compared to the one in normal programming mode. By adopting 5-bit BCH code to correct page errors, at the time of the maximum 3,000 P/E cycles for the lifetime of the tested MLC NAND flash, the uncorrectable page error rate of the page-level SLC mode is less than 5% while the normal programming mode is over 25%. Furthermore, as shown in Figure 2(b), by combining our scheme with default ECC, the uncorrectable page error rate of the page-level SLC mode is almost the same as that in the block-level SLC mode. Different from the block-level SLC mode, the page-level SLC mode does not restrict specific blocks in SLC mode. Therefor, an LSB page could be used in the page-level SLC mode or it could be used in normal mode within the next P/E cycle. This provides more flexibility than the block-level SLC mode. Since pages containing metadata only account for a small percentage of all pages stored in NAND flash memory storage systems, small space and time overhead will be incurred by storing the metadata in LSB pages and bypassing the corresponding MSB pages. Furthermore, the metadata usually has a much higher update frequency than normal data and is highly related with the workload. With the page-level SLC mode, the metadata will be distributed randomly in the whole span of the physical space; thus, advanced load balance methods such as [6], [8] are not required any more while low uncorrectable page error rate can still be achieved. The page-level SLC mode also will not introduce the lifetime unbalance as that in the block-level SLC mode. This observation motives us to propose a scheme to enhance metadata reliability for MLC NAND flash. III. S CATTER -SLC A. Overview of Scatter-SLC Figure 3 illustrates the system architecture of ScatterSLC in which the metadata is kept in critical pages. Our

0278-0070 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2015.2474394, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

3

Host File System and Host Interface

NAND Flash Storage System Host Interface Logic Request Queue Scheduler Uncritical pages

Critical pages

Scatter-SLC Flash Translation Layer (FTL)

Pairing Table

NAND Flash Controller

NAND Flash Memory Array

Fig. 3.

System architecture of Scatter-SLC.

basic idea is to program metadata with the page-level SLC mode so we can keep the metadata at low error rates while eliminating unbalanced lifetime between metadata and normal data with the block-level SLC mode. ScatterSLC fully makes use of the page-level SLC mode and adopts two techniques, namely metadata address pairing and scatter garbage collection.

Algorithm III.1 describes the process of a write operation. When an FTL allocates a physical page for a write request (line 1), if it is a critical page (line 2) and an LSB page has been allocated (line 3), then based on the metadata address pairing strategy, the content will be written to the LSB page (line 4), and its corresponding MSB page will be marked as invalid based on the sharedpages mapping table (line 5). If the FTL assigns an MSB page to a critical page (line 6), the mapping between this page and the MSB page is invalidated (line 7), and a signal is sent to the Request Queue Scheduler to delay this request (line 8). If a noncritical page write request can be found in the request queue (line 9), it will be served with the MSB page (line 10), and then the Request Queue Scheduler will serve the delayed request again based on the above the operations (line 11). If there is no non-critical page available in the request queue (line 12), the nearest available LSB page will be found (line 13) and used to serve the request (line 14), and all physical pages between the current MSB page and the next available LSB page will be marked as invalid (line 15). For a page request for non-critical data (line 18), either an LSB or MSB page can be utilized to serve it (line 19).

B. Metadata Address Pairing The metadata address pairing identifies the pages that contain metadata and allocates them based on the pagelevel SLC mode. Metadata such as file system metadata are normally stored in predefined locations with fixed logical addresses. Thus, we can identify critical pages containing metadata based on logical page numbers. Algorithm III.1 Write operation for Scatter-SLC method Input: A logical page number (LP N #M ), LPN page content. Output: Write the content to a physical page. 1: Allocate a physical page P P N #N for LP N #M by an FTL 2: if LP N #M is a critical page then 3: if P P N #N is LSB page then 4: Write LP N #M into P P N #N 5: Mark the corresponding MSB page of P P N #N as invalid 6: else 7: Invalidate the mapping hLP N #M, P P N #N i 8: Send a signal to Request Queue Scheduler to delay the current request 9: if there is a non-critical-page write request in the request queue then 10: Serve this non-critical-page write request with P P N #N 11: Serve the delayed critical page write request again 12: else 13: Find the nearest LSB page P P N #R 14: Write LP N #M into P P N #R 15: Invalidate all the physical pages between P P N #N and P P N #R 16: end if 17: end if 18: else 19: Write LP N #M into P P N #N 20: end if

C. Scatter Garbage Collection When a garbage collector copies valid pages from a victim block to a free block, we need to make sure that critical pages are still stored in the page-level SLC mode. To achieve this, the process of valid page copies of the garbage collection needs to be modified. We propose the scatter garbage collection strategy to handle this. Our basic idea is to use a small buffer to reorganize write requests. Algorithm III.2 presents the basic procedures of the garbage collection operations in Scatter-SLC. Algorithm III.2 Valid page copies in garbage collection of Scatter-SLC Input: Valid pages in the valid page buffer. Output: Write the valid pages in the valid page buffer into a free physical block. 1: for Each valid critical page Pci in the buffer do 2: Write Pci to the next available LSB page P P N #M 3: Find the corresponding M SB page of P P N #M and mark it as invalid 4: Update the address mapping table 5: Remove Pci from the valid page buffer 6: end for 7: for Each valid non-critical page Pui in the buffer do 8: Write Pui to the next available physical page 9: Update the address mapping table 10: Remove Pui from the valid page buffer 11: end for

In Algorithm III.2, a small buffer called the valid page buffer is used to store all valid pages from a victim block.

0278-0070 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2015.2474394, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

In this section, we first introduce the experimental setup and then present the experimental results. A. Experimental Setup We conducted a set of experiments on a real hardware platform, which is built on Xilinx Zynq platform. In this evaluation, we used 64 Gb Micron 29F64GCBAAA NAND flash memory chips. A page-level FTL is implemented, and BCH code is implemented and used as the error correction code. We focus on enhancing the reliability of the metadata for the NTFS file system in which the metadata are allocated to the first 1,024 logical pages.

0

1

2

3

4

5

6

7

16384 8192 4096 2048 1024 512 256 128 64 32 16 8 4 2 1

Scatter-SLC Meta-Cure Baseline Gather-SLC

0

1

2

ECC strength

3

4

5

6

7

8

ECC strength

(b)

Fig. 4. The number of uncorrectable page errors for Baseline,

Meta-Cure, Scatter-SLC and Gather-SLC with different ECC strength with (a) P/E cycle of 1000 (b) P/E cycle of 3000.

observe that the raw bit error rate (0-bit ECC) of ScatterSLC is larger than that of Gather-SLC, because GatherSLC has the relative lower average uncorrectable page error. However, as the ECC strength increases, ScatterSLC can achieve similar numbers of uncorrectable page errors as those of Gather-SLC. 3.0 2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0

Baseline Scatter-SLC Meta-Cure Gather-SLC

0

25

50

75

Percentage of read operations

100

3.0 2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0

Baseline Scatter-SLC Meta-Cure Gather-SLC

0

25

50

75

100

Percentage of random accesses

(a)

B. Results and Discussion In this section, we present and analyze the experimental results. We test Scatter-SLC using a set of I/O traces generated from the standard tool IOMeter. Scatter-SLC is compared with several previous techniques including Meta-Cure [4], Gather-SLC and the default error correction scheme in the Linux kernel (labeled as “Baseline”) [4] over three performance metrics, the number of uncorrectable page errors, the I/O performance and endurance. Basically, Meta-Cure makes redundant copies for critical pages to enhance the reliability. Gather-SLC is a technique implemented based on the block-level SLC mode in which the metadata are kept in the specific SLC block partition. 1) Uncorrectable Page Errors: Figure 4 presents the number of uncorrectable page errors under two levels of wears (i.e., 1,000 P/E cycles and 3,000 P/E cycles). The workload generated by the IOMeter was configured with 50% random requests and 50% read accesses. Compared to the baseline scheme, Scatter-SLC can reduce 93.54% and 86.05% uncorrectable page errors for 1,000 and 3,000 P/E cycles, respectively. According to Figure 4, our scheme can achieve similar reduction of uncorrectable page errors with weaker ECC strength (less than 2-bit ECC). Compared to Gather-SLC, we can

8

# of uncorrectable page errors

Scatter-SLC Meta-Cure Baseline Gather-SLC

Average response time (ms)

IV. E VALUATION

16384 8192 4096 2048 1024 512 256 128 64 32 16 8 4 2 1

(a)

Average response time (ms)

When a valid page in a victim block is moved to the buffer, it is marked as critical or non-critical based on whether or not it is stored in the page-level SLC mode. When valid pages are moved to a free block, valid critical pages will first be processed, in which they are written to LSB pages sequentially with corresponding MSB pages bypassed (lines 1-6). Then valid non-critical pages will be processed by written to both LSB and MSB pages sequentially (lines 7-11).

# of uncorrectable page errors

4

(b)

The average response time for Baseline, Meta-Cure, Scatter-SLC and Gather-SLC with different (a) Percentage of read operations (b) Percentage of random accesses.

Fig. 5.

2) I/O Performance: We evaluated the I/O performance of the proposed scheme in terms of the average response time. We set the partition of the Gather-SLC mode with a static partition with the same size of the amount of metadata. Figure 5 shows the experimental results. We observe that Scatter-SLC can achieve similar average response times compared to the baseline scheme. From the results, the overhead for the average response time varies between 0.17% and 2.23%. Therefore, Scatter-SLC only incurs negligible overhead in terms of the average response time. Compared to Meta-Cure, our scheme can improve the average response time by 7.00% and 6.11% for cases (a) and (b), respectively. while Meta-Cure has a response time over two times longer for read operations. For Gather-SLC, the SLC partition size is set up as 105% of the metadata size (extra 5% free space for garbage collection). Compared to Gather-SLC, our scheme can improve the average response time by 12.66% and 16.75% for cases (a) and (b) respectively. This is due to the fact that Gather-SLC may experience

0278-0070 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2015.2474394, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

600 550 500 450 400 350 300 250 200 150 100 50 0

Baseline Scatter-SLC Meta-Cure

6.0

ACKNOWLEDGMENT

5.5

Normalized lifetime

# of block erase counts

5

5.0 4.5 4.0

SLC Partition of Gather-SLC

3.5 3.0 2.5

MLC Partition of Gather-SLC

2.0

Scatter-SLC

1.5 1.0 0.5

0

25

50

75

100

Gather-SLC Overall (Fitting)

0.0 8

12

Percentage of read operations

(a)

16

20

24

28

32

36

40

SLC Partition Size (MB)

(b)

(a) The number of block erase counts for Baseline, Meta-Cure and Scatter-SLC (b) Lifetime comparison of Scatter-SLC and Gather-SLC.

Fig. 6.

much larger numbers of block erase counts in its SLC partition. even it has low access legacy of LSB pages. 3) Endurance: To evaluate the endurance, we compared Scatter-SLC with Meta-Cure and Baseline in terms of the number of block erase counts. We also compare the lifetimes of Scatter-SLC and Gather-SLC. From results in Figure 6(a), we can observe that ScatterSLC achieves almost the same number of block erase counts as that of Meta-Cure. That indicates that our page allocation strategy in Algorithm III.1 is very effective for improving the block utilization. Scatter-SLC incurs 3.41% (on average) and 7.14% (the maximum) extra overheads in terms of block erase counts compared to the baseline scheme. Compared to Gather-SLC, when the size of the SLC partition is less than 16 MB (i.e., four times of the occupied space of the metadata), the lifetime of ScatterSLC is longer than that of Gather-SLC. When the size of the SLC partition is similar to the occupied space of matedata, Scatter-SLC can extend the lifetime more than three times compared with Gather-SLC. These results show that, although the maximum lifetime of the SLC partition is ten times (according to the datasheets of SLC and MLC NAND flash) longer than that of MLC NAND flash, due to the high update frequency of the metadata and the limited space, the SLC partition will wear out much faster compared to Scatter-SLC.

The work described in this paper is partially supported by the grants from the Research Grants Council of the Hong Kong Special Administrative Region, China (GRF 152138/14E and GRF 15222315/15E), National Natural Science Foundation of China (Project 61272103, Project 61373049 and Project 61502309), National 863 Program 2013AA013202, the Hong Kong Polytechnic University (4-ZZD7,G-YK24, G-YM10 and G-YN36), Guangdong Province Key Laboratory Project (2012A061400024), Guangdong Natural Science Foundation (2014A030310269), and Natural Science Foundation of SZU (201534 and 2015/827-000073). Liyan Qiao and Yi Wang are corresponding authors. R EFERENCES [1] Y. Joo et al., “An energy characterization platform for memory devices and energy-aware data compression for multilevel-cell flash memory,” TODAES, vol. 13, no. 3, pp. 43:1–43:29, 2008. [2] Y.-H. Chang et al., “A reliable MTD design for MLC flashmemory storage systems,” in EMSOFT ’10, 2010, pp. 179 – 188. [3] Y. Kang et al., “Adding aggressive error correction to a highperformance compressing flash file system,” in EMSOFT ’09, 2009, pp. 305–314. [4] Y. Wang et al., “Meta-Cure: A reliability enhancement strategy for metadata in NAND flash memory storage systems,” in DAC ’12, 2012, pp. 214 –219. [5] R. Zhou et al., “Characterizing the efficiency of data deduplication for big data storage management,” in IISWC ’12, 2012, pp. 98 – 108. [6] S. Hong et al., “NAND flash-based disk cache using SLC/MLC combined flash memory,” in SNAPI ’10, 2010, pp. 21–30. [7] L.-P. Chang et al., “An adaptive, lowcost wear-leveling algorithm for multichannel solid-state disks,” in TECS, 2013, vol. 13, pp. 55:1–55:26. [8] J. Xavier et al., “Software controlled cell bit-density to improve NAND flash lifetime,” in DAC ’12. ACM, 2012, pp. 229–234. [9] G. Sun et al., “Exploring the vulnerability of CMPs to soft errors with 3D stacked nonvolatile memory,” JETC, vol. 9, no. 3, pp. 22:1–22:22, 2013. [10] J. Guo et al., “DPA: A data pattern aware error prevention technique for NAND flash lifetime extension,” in ASPDAC ’14, 2014, pp. 592–597.

V. C ONCLUSION In this paper, we proposed Scatter-SLC, a reliabilityaware metadata allocation strategy for MLC NAND flash memory storage systems. The experimental results show that Scatter-SLC can significantly reduce the uncorrectable page errors, while eliminating the lifetime unbalance with the SLC and MLC partition. In the future, we plan to extend our work to other types of NAND flash [9], [10]. Currently, our technique is implemented based on page-level FTLs 0278-0070 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Suggest Documents