7 CDF-LDPC: A New Error Correction Method for SSD to Improve the ...

41 downloads 0 Views 2MB Size Report
The raw error rate of a Solid-State drive (SSD) increases gradually with the increase ... Additional Key Words and Phrases: Solid-state drives, low density parity ...
CDF-LDPC: A New Error Correction Method for SSD to Improve the Read Performance SHIGUI QI, DAN FENG, NAN SU, LINJUN MEI, and JINGNING LIU, Huazhong University of Science and Technology The raw error rate of a Solid-State drive (SSD) increases gradually with the increase of Program/Erase (P/E) cycles, retention time, and read cycles. Traditional approaches often use Error Correction Code (ECC) to ensure the reliability of SSDs. For error-free flash memory pages, time costs spent on ECC are redundant and make read performance suboptimal. This article presents a CRC-Detect-First LDPC (CDF-LDPC) algorithm to optimize the read performance of SSDs. The basic idea is to bypass Low-Density Parity-Check (LDPC) decoding of error-free flash memory pages, which can be found using a Cyclic Redundancy Check (CRC) code. Thus, error-free pages can be read directly without sacrificing the reliability of SSDs. Experiment results show that the read performance is improved more than 50% compared with traditional approaches. In particular, when idle time of benchmarks and SSD parallelism are exploited, CDF-LDPC can be performed more efficiently. In this case, the read performance of SSDs can be improved up to about 80% compared to that of the state-of-art. Categories and Subject Descriptors: D.4.2 [Operating Systems]: Storage Management General Terms: Reliability, Performance, Algorithms Additional Key Words and Phrases: Solid-state drives, low density parity check, read performance, error correction code, error detection code ACM Reference Format: Shigui Qi, Dan Feng, Nan Su, Linjun Mei, and Jingning Liu. 2017. CDF-LDPC: A new error correction method for SSD to improve the read performance. ACM Trans. Storage 13, 1, Article 7 (February 2017), 22 pages. DOI: http://dx.doi.org/10.1145/3017430

1. INTRODUCTION

Multilevel Cell (MLC) NAND flash memory is widely used in Solid-State Drives (SSDs). Because of technology scaling, the storage reliability of flash memory is largely degraded. Error Correction Codes (ECCs) are usually used to ensure data reliability of SSDs. Some advanced ECCs, such as Low-Density Parity Check (LDPC) [Gallager 1962] often have expensive decoding cost, which further deteriorates read performance. In addition, to meet the increasing reliability requirement of flash memory, LDPC code has been studied due to its superior error correction capability. LDPC encoder first generates codewords (raw user data and parity bits) and stores This work is supported by National High-Tech R & D Program of China (863 Program) under Grants No. 2015AA016701, No. 2015AA015301, and No. 2013AA013203 and National Natural Science Foundation of China (NSFC) under Grants No. 61173043, No. 61303046, and No. 61402189. This work was also supported by Key Laboratory of Information Storage System, Ministry of Education, China. Authors’ addresses: S. Qi, D. Feng (corresponding author), N. Su, L. Mei, and J. Liu, Wuhan National Laboratory for Optoelectronic and School of Computer Science and Technology, Huazhong University of Science and Technology; emails: {qisg, dfeng}@hust.edu.cn, [email protected], {ljmei, jnliu}@hust.edu.cn; S. Qi is working at Xuchang University. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2017 ACM 1553-3077/2017/02-ART7 $15.00  DOI: http://dx.doi.org/10.1145/3017430

ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

7

7:2

S. Qi et al.

them in flash memory. The codewords are decoded by a LDPC decoder using the Belief Propagation (BP) iterative algorithm [Pearl 1988] to correct errors. LDPC decoding is a time-consuming process for its iterative property. In current systems, whether there are errors or not in the codewords, the decoding process will happen. Thus, the time costs spent on error-free pages are redundant and reduce read performance. It has been known that flash memory is disturbed by different noises [Fukuda et al. 2007; Lee et al. 2002; Mielke et al. 2006]. The storage reliability gradually decreases throughout the lifetime of flash memory [Zhao et al. 2013, 2014a]. During the early lifetime of flash memory, there are no errors or few errors. Nevertheless, the probability of occurring errors will increase during the late lifetime of flash memory. Motivated by the fact that the storage reliability of flash memory decreases gradually, we propose a new CRC-Detect-First LDPC (CDF-LDPC) algorithm, which combines error detection code (EDC, such as CRC) with ECC to improve the read performance of SSDs. To the best of our knowledge, no prior work has studied the combination of EDC and ECC to ensure the reliability of SSDs. First, we use CRC code to find error-free pages in SSDs. Then these error-free pages can bypass complex LDPC decoding and be read directly from SSDs. The pages with errors must implement LDPC decoding to correct errors. If we decode pages detected by CRC in advance before reading them, pages may be interfered by noises and occur errors. In order to prevent errors from generating once again after pages are detected by CRC, LDPC decoding is often carried out when we read data from SSDs. Moreover, CDF-LDPC can be performed in advance in two cases in order to further optimize read performance. One case is to exploit the idle time for the pages in fully programmed blocks.1 The other case is to take advantage of SSD parallelism to accelerate the speed of detecting error-free pages in the fully programmed blocks. When pages with errors in a block accumulate to a certain extent, CDF-LDPC will not be used and traditional LDPC will be launched. The reason is that the advantage of CDF-LDPC does not exist when most pages in a block are not error-free. In addition, when blocks are erased, CDF-LDPC will be reused again. We have developed a model [Qi et al. 2014, 2015] to capture major noises of flash memory according to extensive open literature in flash memory research community. To quantitatively evaluate the read performance of the proposed technology, the simulator, Disksim with SSD extension [Bucy et al. 2008] is used to carry out system level simulations using different workload traces. Simulation results show that the proposed technology can significantly improve the read performance of SSDs compared with LDPC code. We make the following contributions in this work. —CDF-LDPC is proposed, which is a new ECC algorithm combining CRC with LDPC to improve the read performance of SSDs without sacrificing the reliability of SSDs. CDF-LDPC can reach a better trade-off between read and write overhead. CDFLDPC brings a little extra overhead to write but can avoid large overhead in read. —The rule of performing CDF-LDPC algorithm is proposed. Most of time, CDF-LDPC decoding is performed as soon as flash memory is accessed to avoid flash memory generating errors once again after being detected. To accelerate the implementation of CDF-LDPC, CDF-LDPC decoding can also be performed in advance for the fully programmed blocks in the idle time of SSDs. At the same time, Die parallelism is further utilized to accelerate CDF-LDPC decoding. —CDF-LDPC is comprehensively evaluated and compared with traditional LDPC code. Experiment results show that the read performance of SSDs using CDF-LDPC can achieve up to 80% compared to LDPC code. 1 Fully

programmed block means that all pages in this block have been programmed. The fully programmed block cannot be disturbed by noises caused by P/E cycles. ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

CDF-LDPC: A New Error Correction Method for SSD to Improve the Read Performance

7:3

Fig. 1. The structure of MLC NAND flash memory and hard-decision/soft-decision voltage sensing.

Fig. 2. The threshold voltage distribution of MLC NAND flash memory. (a) The ideal threshold voltage distribution. (b) The overlapped threshold voltage states disturbed by noises.

The rest of the article is organized as follows. In Section 2, a description of the structure of MLC NAND flash memory and noises in SSD are discussed. The CDFLDPC algorithm is given in Section 3. The experiment setup and results are presented in Section 4. Related works are discussed in Section 5. Section 6 concludes the article. 2. BACKGROUND 2.1. Cell of NAND Flash Memory

One MLC NAND flash memory cell stores 2 bits: Most Significant Bit (MSB) and Least Significant Bit (LSB), which are associated with MSB page and LSB page, respectively. In addition, the cell must be erased before programming data, which is the process of removing electric charges from the floating gate of the cell. The erase unit is a block that consists of multiple pages, while both reading and programming units in flash memory are a page. Programming data into the cell is achieved by injecting electric charges into the floating gate of the cell. Furthermore, a MLC NAND flash memory cell has four voltage levels. This means that different electric charges can be injected into the floating gate of the cell based on voltage levels. For MLC NAND flash memory, the four voltage levels of the cell P (0) (x), P (1) (x), (2) P (x), and P (3) (x) represent bits information “11,” “01,” “00,” and “10,” respectively, as shown in Figure 1. The four ideal voltage levels of the cell are separated from each other as shown in Figure 2(a). In fact, when data is read from flash memory, the cell state can be identified by comparing the threshold voltage with the reference voltage. However, electric charges in the cell will decrease or increase after being disturbed by different noises, which induce the decrease or increase of threshold voltage. Then, ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

7:4

S. Qi et al.

Fig. 3. Raw bit error rate (RBER) versus P/E cycles [Choi et al. 2010].

the four threshold voltage levels will become overlapped with each other as shown in Figure 2(b), and it is difficult to get the right bits information of the cell by only comparing three hard-decision2 reference voltages. NAND flash memory may carry out fine-grained soft-decision3 voltage sensing to get Log-Likelihood Ratios (LLRs), which is taken as the initial information to the soft-decision LDPC decoder. We will introduce LDPC code in detail in Section 2.3. 2.2. Noises of Flash Memory

The stored data in flash memory will generate errors due to the interference from different noises. Random telegraph noise (RTN) [Monzio Compagnoni et al. 2009] and Cell-to-Cell Interference (CCI) [Lee et al. 2002] are noises caused by P/E cycles. In addition, there are other noises in flash memory besides the noises caused by P/E cycles, for example, Retention Noise [Lee et al. 2003; Mielke et al. 2006] and Read Disturb [Cai et al. 2015; Cooke 2007; Grupp et al. 2009]. The probability of errors is different due to the changing interference from noises throughout the entire lifetime of SSD. In the early lifetime, NAND flash memory has few errors. Whereas with the increase of P/E cycles, data retention time, and read cycles, the probability of errors will also increase. Noises in flash memory are closely related to P/E cycles. When we program data into one cell, the threshold voltage of its neighboring victim cells will change due to parasitic capacitance coupling. Moreover, too many P/E cycles can damage the oxide around the floating gate, and some electrons in the floating gate will leak. Therefore, the threshold voltage of the cell will also change due to the leak of electrons. For example, on a real 3×-nm MLC NAND flash memory experimental test platform [Cai et al. 2011, 2013], Cai et al. show that all types of errors in NAND flash memory are highly correlated with P/E cycles, and the conclusion is also supported by other researchers [Tanakamaru et al. 2011]. Meanwhile, the paper Zhao et al. [2014b] shows that the error probability of sub-22-nm MLC NAND flash memory is highly correlated with P/E cycles, and bit error rates of pages are approximated as Gaussian distribution. Although MLC NAND flash memory can be disturbed by read operations, read disturbs are not as serious as other noises. Generally speaking, errors often occur after the effect of thousands of read operations [Cooke 2007; Grupp et al. 2009]. In general, all types of errors in SSD are highly correlated with P/E cycles [Cai et al. 2012]. Raw Bit Error Rate (RBER) of flash memory increase with the increase of P/E cycles. For example, one block of a 2GB MLC NAND flash is programmed with different P/E cycles, whose RBER increases gradually as shown in Figure 3. In addition, the block has 128 pages and the page size is 4KB in the tested platform. We can see that RBER is 2 Hard-decision 3 Soft-decision

sensing means that there is only one quantization level between two adjacent storage states. sensing means that there are multiple quantization levels between two adjacent storage

states.

ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

CDF-LDPC: A New Error Correction Method for SSD to Improve the Read Performance

7:5

Table I. Mapping RBER to Perror-page when RBER is 10−6 Error bits

5 5 bits/page

4 bits/page +1 bit/page

3 bits/page +2 bits/page

Pages with error 1 Perror− page 1/128∼ =0.7%

2 2/128∼ =1.6%

2 2/128∼ =1.6%

Distribution

2 bits/page 1 bit/page×3 +2 bits/page + 2 bits/page +1 bit/page 3 4 3/128∼ =2.3% 4/128∼ =3.1%

1 bit/page×5

5 5/128∼ =3.9%

Table II. Relation between RBER and Perror-page RBER in a block 1×10−8 1×10−7 1×10−6 1×10−5 1×10−4

Error bits 0.04 0.42 4.19 41.94 419.43

Perror− page 0 0.7% 0.7%, 1.6%, 2.3%, 3.1%, 3.9% 0.7% → 32% 0.7% → 100%

very low when SSD has few P/E cycles. For example, when RBER is 1×10−6 , P/E cycles are less than 30,000 as shown in Figure 3. In general, the more P/E cycles SSD has, the higher RBER will be. Meanwhile, we can map RBER to page error rate Perror− page .4 We take the preceding tested platform as an example to explain the relationship between RBER and Perror− page . The total bits in a block are 128×4KB=4,194,304 bits. Error bits in a block are about 4.2 when RBER is 1×10−6 . In order to calculate conveniently, we assume that there are five error bits in a block. In fact, the five error bits have six distribution modes in a block as shown in Table I. The first mode is that five error bits are in one page and Perror− page ∼ = 0.7%. The second mode is that four error bits are in a page and one error bit is in another page, and we can get Perror− page ∼ = 1.6%. The third mode is that three error bits are in a page and two error bits are in another page, then Perror− page ∼ = 1.6%. The fourth mode is that two error bits are in a page, another two error bits are in a page, and the remaining one error bit is in a page, and Perror− page ∼ = 2.3%. The fifth mode is that a page has two error bits and the remaining three error bits are distributed into three pages uniformly. We can get Perror− page ∼ = 3.1%. The last mode is that five error bits are located into five pages uniformly and Perror− page ∼ = 3.9%. Then, the relation between RBER and Perror− page can be obtained as shown in Table II. From the preceding analysis, it can be seen that RBER and page error rate of flash memory are not constant throughout the lifetime of SSDs. 2.3. Performing LDPC Code in SSDs

LDPC code [Gallager 1962; MacKay and Neal 1996] is a kind of block codes, which can be represented by a sparse parity-check matrix and a bipartite graph. In addition, LDPC code has two types: hard decision and soft decision. Hard-decision LDPC code receives hard-decision information to correct errors, while soft-decision LDPC code utilizes soft-decision information to achieve higher decoding capability. Soft-decision information is the probability of 0 or 1, which can be represented by LLRs. If there is no special statement, LDPC code represents soft-decision LDPC code in this work. LDPC Encoding: The purpose of LDPC encoding is to generate parity bits of user data, and parity bits are combined together with raw user data to generate LDPC codewords. Then LDPC codewords are programmed into flash memory. In fact, LDPC encoding is an efficient process which is a product of raw user data and the LDPC generator matrix as shown in the left part of Figure 4. For example, C = R·G, where R is raw user data with k bits, Gk×n is the LDPC generator matrix, and C is the codeword with n bits. 4 Page

error rate equals the ratio of pages with errors to the total pages in a block.

ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

7:6

S. Qi et al.

Fig. 4. The encoding and decoding process of LDPC code in NAND flash memory.

LDPC Decoding: LDPC decoding is an iterative message-passing process as shown in the right part of Figure 4. LDPC decoding needs raw LLRs as initial input, which can be obtained from flash memory channel and computed as LLR = log P(1|v) . P P(0|v) represents the posterior probability of 1 or 0. v represents the threshold voltage of the cell. When P(1|V ) > P(0|V ), it means that the probability of 1 is bigger than that of 0 at the point of threshold voltage v, and LLR is positive. Otherwise, LLR is negative when P(1|V ) < P(0|V ). LDPC decoding is complex and time-consuming. First, each variable node is initialized with LLR information from the flash memory channel. Second, the soft-decision message is iteratively computed and transferred between the variable nodes and check nodes. Moreover, the soft-decision message is only exchanged through the edges between the neighboring nodes. Last, LDPC decoding stops until termination conditions have been reached. In addition, soft-decision voltage sensing can increase the read latency of LDPC decoding. For example, for 25nm MLC NAND flash memory, it needs 125μs to carry out seven-level soft-decision voltage sensing [Zhao et al. 2013]. 3. CDF-LDPC ALGORITHM

As discussed previously, LDPC decoding is more complex and time-consuming. For traditional LDPC code, whether or not pages are error-free, LDPC decoding must be performed for every page. Obviously, performing LDPC decoding for error-free pages is redundant. If error-free pages are identified in advance, LDPC decoding can be bypassed. In this way, read performance can be significantly improved without affecting the reliability of SSDs. To this end, we propose the CDF-LDPC algorithm to eliminate the cost of LDPC decoding for error-free pages as shown in Figure 5. In this section, we first introduce the process of CDF-LDPC encoding/decoding. Then, we discuss how SSDs utilize the idle time and parallelism to accelerate the execution of CDF-LDPC. 3.1. Encoding/Decoding of CDF-LDPC Algorithm

CDF-LDPC consists of two processes: Encoding and Decoding. CDF-LDPC Encoding. CDF-LDPC encoding is to generate CDF-LDPC codeword, which consists of raw user data and corresponding parity bits. As shown in Figure 6, raw user data is encoded by CRC code and LDPC code, respectively, to generate CDF-LDPC codeword before raw user data is programmed into flash memory. Both CRC code and LDPC code are systematic codes in which their user data and parity bits can be separated from each other. So, raw user data per page is programmed into the flash memory only once. CRC parity bits CRC-P and LDPC parity bits LDPC-P are generated, respectively, based on raw user data. Raw user data, CRC-P and LDPC-P form CDF-LDPC codeword. In addition, both CRC-P and LDPC-P are stored in the spare ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

CDF-LDPC: A New Error Correction Method for SSD to Improve the Read Performance

7:7

Fig. 5. CDF-LDPC algorithm.

Fig. 6. The encoding process of CDF-LDPC. CRC-P represents CRC parity bits and LDPC-P represents LDPC parity bits. Error-free-judgment is the flag of error-free page. Error-free-judgment has 1 bit.

space of a page. In order to quickly find these error-free pages, a flag Error-free-judgment of 1 bit is added in the spare space per page. At the beginning, Error-free-judgment of all pages are initialized to 0, which represent raw user data of pages have no errors. On the contrary, Error-free-judgment per page is 1 if raw user data has errors. In the process of CDF-LDPC encoding, CRC encoding will occupy some space to store parity bits. However, the overhead is much smaller than that of LDPC code. For example, we use CRC-16 as EDC to detect whether the page of 1KB size is error-free. CRC parity bits only occupy about 0.002% of the page size, while LDPC parity bits occupy about 0.05% of the page size when the LDPC code rate is 0.95. The storage space of LDPC parity bits is about 25 times larger than that of CRC parity bits. CDF-LDPC Decoding. When data is read from flash memory, CRC codeword (raw user data and CRC-P) is first decoded by CRC as shown in Figure 7. Then the Error-freejudgment is set to 1 if CRC codeword has errors. Otherwise, the Error-free-judgment remains 0. The error-free page can be read directly into a DRAM buffer without LDPC decoding. On the contrary, if the Error-free-judgment is 1, LDPC codeword (raw user data and LDPC-P, which is contained in CDF-LDPC codeword) must be decoded with LDPC to correct errors. ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

7:8

S. Qi et al.

Fig. 7. The decoding process of CDF-LDPC. Data is first decoded by CRC decoding. Only the page with errors is decoded by LDPC decoding when it is accessed. The flag Error-free-judgment represents page error state: 0 (error-free), 1 (error).

To ensure the detected page does not generate errors once again, LDPC decoding is often performed when the page is read. Since the fully programmed blocks are not disturbed by noises caused by P/E cycles, CRC detection can be performed in advance for the fully programmed blocks when flash memory is idle. This will be discussed in the next section. Since CRC decoding needs hard-decision information, hard-decision voltage sensing is first performed. Soft-decision voltage sensing is performed when soft-decision LDPC decoding begins. Obviously, for error-free pages, expensive soft-decision voltage sensing can be omitted so that the read performance of SSD can be improved. Furthermore, when the number of pages with errors in a block exceeds the predefined threshold, it means that the block is disturbed by high noises and most pages in a block are not error-free. In this case, the advantages of CDF-LDPC in improving the read performance will not exist. Then CDF-LDPC will be replaced with traditional LDPC for the block. In addition, CDF-LDPC can be reused again when the block is erased.5 The threshold will be discussed in Section 4. In the CDF-LDPC algorithm, CRC encoding/decoding will increase the system time. However, both CRC encoding and decoding are only generator polynomial division. Therefore, the performing cost of CRC code is very low compared with the iterative LDPC decoding. 3.2. Performing CDF-LDPC Decoding in Idle Time of SSDs

To further improve the read performance of SSDs, CDF-LDPC decoding can be carried out for fully programmed blocks in advance during idle time of SSDs. The reason is that fully programmed blocks cannot be interfered by noises caused by P/E cycles since they have no P/E cycles. Except for noises caused by P/E cycles, flash memory can be interfered by other noises, for example, read disturb, retention noise. However, flash memory cannot generate errors caused by read disturb until the block has been accessed thousands of times [Cooke 2007; Grupp et al. 2009]. Therefore, the period of occurring errors caused by retention noise and read disturb is longer than that caused by P/E cycles. In fact, flash memory block can be rewritten to a new block to eliminate read disturb and retention noise when they are accumulated to a certain extent. Moreover, CDF-LDPC decoding for fully programmed block in idle time mainly carries out CRC detection for all pages in a block in advance and implements LDPC decoding for these pages with errors only when they are read. For example, as shown in Figure 8, block 1 and block 2 are fully programmed blocks for which CDF-LDPC decoding for them can 5 After

the block is erased, the block will be blank. Then new data can be programmed into the blank block whose noises are few.

ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

CDF-LDPC: A New Error Correction Method for SSD to Improve the Read Performance

7:9

Fig. 8. Illustration of applying CDF-LDPC decoding for fully programmed blocks in the idle time. Read point means the start point for read operation. Idle point means the start point when SSD is idle.

be performed in advance during idle time of SSDs. For the non-fully programmed block (e.g., block 0), CDF-LDPC decoding is performed when the block is read. At the same time, we set a flag block-detected to represent the detected state of a block. If the flag block-detected is 1, it represents that all pages in a block have been detected by CRC code. Otherwise, the block-detected is set to 0. We can quickly identify detected blocks through the flag block-detected and skip them when we implement CDF-LDPC decoding in the idle time of SSDs. For non-fully programmed blocks, their flags block-detected are always 0. During the later lifetime of SSDs, especially at the end of lifetime, most of the pages are not error-free and the advantage of CDF-LDPC does not exist. For this situation, CDF-LDPC will be replaced with traditional LDPC code. A threshold of page error rate will be set to indicate when to replace CDF-LDPC with conventional LDPC code. The threshold of page error rate will be discussed in Section 4. In general, performing CDF-LDPC in idle time of SSDs is suitable for fully programmed blocks. For non-fully programmed blocks, CDF-LDPC can be performed only when they are accessed. 3.3. Using SSD Parallelism to Speed Up CDF-LDPC Decoding

It is known that the internal parallelism of SSDs can significantly improve the performance of SSDs. A SSD has four parallelism levels: Channels, Packages, Dies, and Planes [Chen et al. 2011; Jung et al. 2012]. The parallelism can be utilized to speed up the execution of CDF-LDPC. In this work, we only discuss the situation of utilizing Die parallelism to implement CDF-LDPC during the idle time of SSDs. For example, as shown in Figure 8, Block 1 and Block 2 can implement CDF-LDPC decoding simultaneously making use of Die parallelism in the idle time of SSDs. 3.4. False Positive Rate Analysis of CDF-LDPC

The probability of missing errors in the detection is called the false positive rate of EDC/ECC. CRC and LDPC have different false positive rates. We assume that the ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

7:10

S. Qi et al. Table III. The Ratio of Writing Traces Field

Write Rate

Financial1 Financial2 Postmark Syn1

72% 14% 17% 50%

Field

Write Rate

WebSearch1 WebSearch2 Syn2 Syn3

0 0 0 0

false positive rate of CRC is X and the false positive rate of LDPC is Y . Then the probability of detecting all errors using CRC is 1 − X and the probability of detecting all errors for LDPC is 1 − Y . There are two cases in computing the false positive rate of the CDF-LDPC algorithm. First, when CRC code has detected all pages with errors in a block, the pages with errors will be decoded by LDPC code as discussed previously. Then, the false positive rate of CDF-LDPC is (1 − X) × Y ; second, CRC misses some pages with errors in a block, which means that these pages with errors may be identified as error-free. Then the false positive rate of CDF-LDPC is X for this case. Based on the previous analysis, the false positive rate of CDF-LDPC can be computed as (1 − X) × Y + X. In particular, when X = 0, it means that CRC code can detect all pages with errors in a block. In this case, the false positive rate of CDF-LDPC is Y , which equals the false positive rate of LDPC. The difference of false positive rate between CDF-LDPC and LDPC can be computed as follows: (1 − X) × Y + X − Y = X × (1 − Y ).

(1)

We know that LDPC code has powerful error correcting capability, which means that the false positive rate of LDPC Y is very small. We can assume that Y equals 0. Then the difference of false positive rate between CDF-LDPC and LDPC is X according to Equation (1). Thus, the false positive rate of CDF-LDPC approximately equals the false positive rate of CRC X under this special case. It can be seen that the false positive of CDF-LDPC is bigger than LDPC. The CDF-LDPC algorithm is suitable for the early lifetime of SSDs. In addition, the RBER of flash memory cannot exceed the false positive rate of CRC. Otherwise, CRC cannot detect some pages with errors. In fact, the bit error rate is very low in the early lifetime of SSDs. For example, the RBER of new SSDs is 10−13 or lower, and CRC-16 can detect almost all pages with errors. In addition, CDF-LDPC will be replaced with LDPC during the later lifetime of SSDs. In conclusion, the CDF-LDPC algorithm is not suitable for the entire lifetime of SSDs. 4. EXPERIMENT SETUP AND RESULTS

To quantitatively evaluate the proposed techniques through trace-driven simulations, we use the SSD module [Agrawal et al. 2008] in DiskSim [Bucy et al. 2008] with different workload traces. The workload traces include Financial1, Financial2, Postmark, WebSearch1, and WebSearch2. We also use three synthetic workloads Syn1, Syn2, and Syn3 as shown in Table III. Since these traces are from HDDs and HDDs are much slower than SSDs, the timestamps of traces are shortened 1,000 times to simulate real SSD environment. In this work, SSD has eight flash chips. Each flash chip has two dies that share some control signals and an 8-bit bus. One die contains four planes and one plane has 2,048 blocks. One block has 64 4KB pages. We set the read latency per page as 25μs, the write latency per page is 200μs, and the erase latency per page is 1.5ms as shown in Table IV. ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

CDF-LDPC: A New Error Correction Method for SSD to Improve the Read Performance

7:11

Table IV. Parameters of SSD Field

Value

Flash chips Dies per chip Planes per die Blocks per plane Pages per block Page size

8 2 4 2,048 64 4KB

Field Page read time Page write time Block erase time Tclk Tbyte

Value 25μs 200μs 1,500μs 10ns 8*Tclk

Table V. Max Length at Hamming Distance (HD)/Polynomial Max length at HD/Polynomial HD=2 HD=3 HD=4 HD=5

8 0xe7 247 0xe7 119 0x83 9 0xeb

CRC Size (bits) 16 24 0x8d95 0x8f90e3 65519 16777191 0x8d95 0x8f90e3 32751 8388583 0xd175 0x9945b1 241 4037 0xac9a 0x98ff8c

32 work in progress 4294967263 work in progress 2147483615 work in progress 65505

4.1. CRC Selection

The units of CRC code and LDPC code are pages in this work. The length of CRC raw user data and CRC parity bits (CRC-P) can be set to any length according to the actual scenarios. Commonly used CRC codes have CRC-8, CRC-16, CRC-24, CRC-32, etc. Two aspects are often considered to choose proper CRC codes. On the one hand, selecting CRC code should be considered a good trade-off between the maximum number of possible detected errors and data word length for which the polynomial is effective [Koopman and Chakravarty 2004]. On the other hand, the issue should be considered about how much SSD performance can be improved. First, we compare the error detection capability of four CRC codes. As shown in Table V [Koopman 2015], each cell has two numbers. The top number is the maximal detectable data word, and the bottom number is the good CRC polynomial for lengths up to the maximum detectable data word. The minimal undetected error bit corresponds to the value of Hamming Distance (HD) in the leftmost column of Table V. For example, HD=4 means that the minimal undetected errors is 4 bits in a data word. For CRC16, the maximal data word of 32,751 will use polynomial 0xd175 to achieve HD=4. However, to achieve HD=4 for CRC-8, the maximal detectable data word length is 119 using polynomial 0x83. Obviously, the error detection capability of CRC-8 cannot meet the error correction requirement of a 4KB page when HD=4 since its maximal detectable data word length is less than 4KB. Therefore, CRC-16, CRC-24, and CRC32 can meet the error correction requirement when HD=4. As shown in Table V, the maximal detectable data word length of CRC-16 is closest to 4KB when HD=4, while the maximal detectable data word lengths of CRC-24 and CRC-32 are substantially larger than 4KB. Second, the performance improvement of SSDs using CRC code should be considered. In fact, the parity bits of CRC-24 and CRC-32 are 50% larger and 100% larger than CRC-16, respectively. As discussed in Section 3.1, the more parity bits CRC code has, the more storage space CDF-LDPC encoding will take. Moreover, the write performance of SSDs will decrease when CDF-LDPC uses CRC with longer parity bits. In general, CRC-16 can achieve a good trade-off between error detection capability and ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

7:12

S. Qi et al.

performance improvement compared with CRC-8, CRC-24, and CRC-32. In this work, we select CRC-16 as EDC to detect errors in flash memory. 4.2. Model Read/Write Performance of CDF-LDPC

Both CRC encoding and CRC decoding are modulo 2 operations so that they can be performed by direct bit XOR operations. In addition, as CRC encoding and decoding depend on the preconfigured generator polynomial, the speeds of them are related to the frequency of internal processor in SSDs and are nearly the same. We assume that SSD operates at 100MHZ. Then the clock period of SSD is Tclk=10ns, and the time of processing a byte using shift register is Tbyte =8*Tclk. The encoding time TCRC−Encoding and decoding time TCRC−Decoding for the given CRC codeword are equal to the product of Tbyte and LCRC−Codeword (CRC codeword (bytes)) as shown in Equations (2) and (3). For example, for a 4KB flash memory page using CRC-16, CRC codeword is 4,098 bytes, and we can get the CRC encoding/decoding time per codeword is 328μs. TCRC−Encoding = Tbyte × LCRC−Codeword,

(2)

TCRC−Decoding = Tbyte × LCRCCodeword.

(3)

Since LDPC encoding is the product of raw user data and LDPC generator matrix, LDPC encoding time TLDPC−Encoding is closely related to the frequency of internal processor in the SSD and LDPC codeword. Moreover, LDPC parity bits LDPC-P are userdata determined by the code rate of LDPC. LDPCcoderate = userdata+ . Then, LDPC encoding parity time TLDPC−Encoding can be computed as shown in Equation (4), in which LPage represents the size of a page (bytes). For example, for LDPC code with a code rate of 0.95, parity bits LDPC-P are 215 bytes for a 4KB flash memory page, and LDPC codeword is 4,311 bytes. Then, the time of LDPC encoding TLDPC−Encoding is 4,311 × Tbyte =345μs. Moreover, different LDPC code rates can generate different LDPC codewords. TLDPC−Encoding = Tbyte × LLDPC−Codeword = Tbyte × LPage /LDPCcoderate .

(4)

However, the computation of LDPC decoding is complicated since it depends on many factors, for example, iteration number, RBER, threshold voltage sensing frequency, etc. In general, LDPC decoding time includes threshold voltage sensing time, threshold voltage transferring time, and LDPC decoding time as shown in Equation (5). According to the paper Zhao et al. [2013] and ONFI 2.1, 25nm MLC NAND flash memory chips will take 250μs to carry out a 14-level soft-decision threshold voltage sensing. Meanwhile, it takes about 160μs to transfer the soft-decision information of a 4KB page to the controller at 100MB/s I/O bandwidth, and it takes 8μs to finish the iterative decoding of one 4KB LDPC codeword when the LDPC decoder operates at 4Gbps. In fact, the threshold voltage sensing and transferring occupy most of the LDPC decoding time [Qi et al. 2014]. The LDPC decoding time per page is computed as follows. TLDPC−Decoding = Tvoltage−sensing + Ttrans f er−voltage +TLDPC−Iterative−decoding .

(5)

TLDPC−Decoding represents the total decoding time per page. TLDPC−Iterative−decoding represents the iterative decoding time per page, and Tvoltage−sensing is the threshold voltage sensing time. Ttrans f er−voltage represents the voltage transferring time. In general, LDPC decoding time per page is bigger than LDPC encoding time per page. After we get the encoding/decoding time per page of CRC and LDPC, the encoding/decoding time per page of CDF-LDPC can be obtained as follows. First, we can get ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

CDF-LDPC: A New Error Correction Method for SSD to Improve the Read Performance

7:13

the CDF-LDPC decoding time per block as follows. N 

Tcdf ldpc−r = N × Perror− page × TLDPC−Decoding +

i=1

N 

TCRC−Decoding .

(6)

i=1

N is the page number in a block. Tcdf ldpc−r represents CDF-LDPC decoding time per page and Perror− page represents page error rate in a block. We can simplify Equation (6) by dividing N on both sides to get CDF-LDPC decoding per page as follows. Tcdf ldpc−r = Perror− page × TLDPC−Decoding + TCRC−Decoding .

(7)

If Perror− page equals 0, it means that all pages in a block are error-free. Then, CDF-LDPC decoding time per page equals TC RC−Decoding . The total read time per page Tread with CDF-LDPC can be computed as follows. Tread = Tcdf ldpc−r + Taddressing + Tbus−trans f er .

(8)

Taddressing , Tbus−trans f er represent the addressing time and bus transferring time respectively. In addition, the read time of only using the LDPC code is computed as follows. Tread−ldpc = TLDPC−Decoding + Taddressing + Tbus−trans f er .

(9)

As discussed previously, when SSDs enter the later lifetime, CDF-LDPC will be replaced with LDPC since the advantage of CDF-LDPC in improving read performance does not exist. The threshold Perror− page of switching CDF-LDPC and LDPC can be derived as follows. Tread < Tread−ldpc → Tcdf ldpc−r < TLDPC−Decoding (TLDPC−Decoding − TCRC−Decoding ) . → Perror− page < TLDPC−Decoding

(10)

As shown in Equation (10), the upper limit of threshold Perror− page is decided by the difference between CRC decoding time and LDPC decoding time. If the difference of decoding time between CRC and LDPC increases, Perror− page can be set higher. In the same way, the write time per page of CDF-LDPC is computed as follows. Tcdf ldpc−w = TLDPC−Encoding + TCRC−Encoding .

(11)

Then, we can get the total write time per page Twrite of CDF-LDPC as follows. Twrite = Tcdf ldpc−w + Taddressing + Tbus−trans f er .

(12)

4.3. Performing CDF-LDPC in the Reading Process

In order to evaluate the performance of the proposed technique, we set the traditional LDPC code as the baseline. We assume that there are six page error rates (Perror− page ): 0.1%, 5%, 10%, 20%, 30%, and 40%. CDF-LDPC has two implementation modes: performing CDF-LDPC in the reading process and performing CDF-LDPC in the idle time of SSDs. This section introduces the first mode. Performing CDF-LDPC in the reading process means that we perform CDF-LDPC decoding when data are read from SSDs. In the writing process, we carry out CRC encoding and LDPC encoding, respectively, for the input raw user data to get CDFLDPC codewords. Then CDF-LDPC codewords are programmed into flash memory. In the reading process, we first carry out CRC decoding in order to find out error-free pages, for which we can bypass LDPC decoding and read them directly. However, for these pages with errors, we implement LDPC decoding to correct errors when they are read. For example, when Perror− page is 5%, these pages with errors will be found ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

7:14

S. Qi et al.

Fig. 9. The average read time per page of performing CDF-LDPC in the reading process.

through performing CDF-LDPC decoding, and their flags Error-free-judgments are set to 1, while Error-free-judgments of error-free pages in the block remain 0. The average read performance per page can be improved owing to the bypassing of LDPC decoding for error-free pages. As shown in Figure 9, the reading times of WebSearch1 and WebSearch2 are decreased by 58%, 56%, 54%, and 49%, respectively, compared with the baseline when Perror− page equals 0.1%, 5%, 10%, and 20%. The reason is that most of the pages are error-free when Perror− page is low. Then, most of the reading requests of error-free pages bypass complex LDPC decoding. However, the read performance per page of Financial1 is increased from 3% to 6% when Perror− page increases from 0.1% to 40% compared with the baseline. The reason is that 72% requests of Financial1 are writing requests as shown in Table III. It is known that performing writing requests is the process of programming data into flash memory. Only the reading requests can implement CRC decoding to find error-free pages and LDPC decoding can be bypassed for them. In general, the read performance of write-dominated traces is not significantly improved compared with the read-dominated traces. With the increase of Perror− page in SSDs, CDF-LDPC will detect more pages with errors. SSDs will carry out more LDPC decoding for those pages with errors. Obviously, the read performance of SSDs can be affected by the increase of Perror− page . For example, the read time of Syn1 is decreased by 50%, 48%, 14%, 17%, 22%, and 26%, respectively, compared with the baseline when Perror− page is from 0.1% to 40%. In addition, the time of CRC encoding/decoding per page will be shorter when we use higher processor frequency, and the LDPC encoding/decoding time per page will also be reduced. For example, Samsung 850 has a 400MHz processor onboard, which is a commodity SATA drive, not even a high performance PCI-E drive, which often employs multiple processors or multicore processors. We use a 400MHZ processor to verify the effect of the CDF-LDPC algorithm under the situation of performing CDF-LDPC in the reading process. The result shows that the read performance can be further significantly improved compared with the baseline. For example, the read performance of Syn3 is improved four times compared with the baseline when Perror− page is 0.1% as shown in Figure 10, and the read performance of Syn3 is also improved two times compared with the baseline even though Perror− page = 40%. Therefore, with the increase of the processor frequency of SSDs, the read performance of SSDs is better than that of using a slower processor under the situation of performing CDF-LDPC in the reading process. ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

CDF-LDPC: A New Error Correction Method for SSD to Improve the Read Performance

7:15

Fig. 10. The average read time per page of carrying out the CDF-LDPC technique in the reading process with a 400MHz internal processor in SSDs.

Fig. 11. Normalized average write time per page of carrying out CDF-LDPC in the reading process.

In addition, CDF-LDPC can affect the write performance of SSDs owing to the additional CRC encoding time. We select four traces that contain a portion of writing requests to evaluate write performance. The average write time per page is normalized to that of only using LDPC code. In fact, the write performance of CDF-LDPC is decreased compared with LDPC code. For example, the average write time per page of Financial1 significantly increases since the writing requests ratio of Financial1 is 72% as shown in Figure 11 and Table III. The write time per page is not related to Perror− page as shown in Equations (11) and (12). Therefore, the write time per page using CDF-LDPC is independent from Perror− page . In addition, with the decrease of write ratio, the write performance of SSD will be improved. For example, as shown in Figure 11, the average write time per page of Financial2 is decreased by 36% compared with that of Financial1 when Perror− page is 10% since the write ratio of Financial2 decreases by 58% compared with Financial1. In this work, the evaluation result is based on software tools. If the encoding/decoding of CRC code are performed by hardware, the write performance of SSDs using CDF-LDPC will be further increased. ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

7:16

S. Qi et al.

Fig. 12. The idle time statistics of traces.

4.4. Implementing CDF-LDPC in Idle Time of SSDs

If we carry out CDF-LDPC decoding in advance, we can further improve the read performance of SSDs. As discussed previously, in order to prevent flash memory generating errors once again after the page is detected, we only implement CDF-LDPC decoding for the fully programmed blocks in idle time of SSDs. In fact, there are more idle time in SSDs. The idle time statistics of different traces in SSDs are shown in Figure 12. Several observations can be made from the figure. First, it shows that the majority of idle time are less than 5ms for different traces, for example, Financial1, Financial2, Syn1, Syn2, and Syn3. Second, the lengths of idle time are varied for different traces. Third, for the read-dominated traces, the lengths of idle time are longer than that of write-dominated traces. Moreover, the idle time can be more than 60ms for those readdominated traces, for example, WebSearch1, WebSearch2. This motivates us to carry out CDF-LDPC decoding in advance in idle time of SSDs. In addition, CRC decoding time per page is 328μs, and a block has 64 pages. CRC decoding time per block is about 21ms. So, SSDs have enough time to finish CDF-LDPC decoding for some fully programmed blocks. However, for some short idle period, SSDs can only finish partial fully programmed block detection. Based on the length of idle time, performing CDF-LDPC decoding in the idle time of SSDs is divided into two cases. The first case is that all pages in the fully programmed block can finish CDF-LDPC decoding. The second case is that partial pages in the fully programmed block can implement CDF-LDPC decoding. We introduce the two cases with details in the following sections. 4.4.1. Performing CDF-LDPC Decoding for All Pages in Idle Time of SSDs. We first discuss the case that all pages in a fully programmed block can be decoded by CDF-LDPC during the idle time of SSDs. The read performance of those read-dominated traces are significantly improved compared with immediate CDF-LDPC. For example, as shown in Figure 13, the read performance of Syn2 is increased by 29%, 30%, 69%, 60%, 51%, and 43%, respectively, compared with immediate CDF-LDPC when Perror− page changes from 0.1% to 40%. The read performance of other read-dominated traces are also improved from 50% to 87% compared with immediate CDF-LDPC, for example, WebSearch1, WebSearch2, and Syn3. However, for the write-dominated trace Financial1, the read performance is only improved by 13% compared with immediate CDF-LDPC. The reason is that read-dominated traces have more read requests than write-dominated traces, and CDF-LDPC decoding mainly aims to improve the read performance of read requests. To illustrate the effect of Perror− page on the performance of SSDs, we compare IOPS of four traces with writing requests. With the increase of Perror− page , the number of pages ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

CDF-LDPC: A New Error Correction Method for SSD to Improve the Read Performance

7:17

Fig. 13. The average read time per page of performing CDF-LDPC for all pages in a fully programmed block of idle time of SSDs.

Fig. 14. The IOPS of different traces containing writing requests under different Perror− page . The situation is that SSDs carry out CDF-LDPC for all pages in the fully programmed block during the idle time.

with errors increases and the performance of SSD decreases as shown in Figure 14. The reason is that those pages with errors must be decoded by LDPC code to correct errors before they are read from SSDs. As discussed previously, the cost of LDPC decoding is more complex than LDPC encoding and CRC encoding/decoding. So the increase of Perror− page will bring about the performance degradation of SSDs. In addition, Input/Output operations per second (IOPS) of CDF-LDPC is bigger than that of LDPC. This is because CDF-LDPC can miss lots of complex LDPC decoding for error-free pages so that the speed of CDF-LDPC is faster than that of LDPC. For example, for Financial1, IOPS using CDF-LDPC with different Perror− page are three times as fast as using LDPC as shown in Figure 14. As shown in Figure 12, for Financial2, SSDs can implement more CDF-LDPC decoding in advance during the idle time than other traces since Financial2 has a longer idle time. Therefore, IOPS of CDF-LDPC for Financial2 is bigger than other traces. It is obvious that whether CDF-LDPC decoding can be successful for the fully programmed block is related to the length of idle time.

ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

7:18

S. Qi et al.

Fig. 15. The read time of performing CDF-LDPC for partial pages in a fully programmed block in the idle time of SSDs.

4.4.2. Performing CDF-LDPC Decoding for Partial Pages in Idle Time of SSDs. Performing CDF-LDPC decoding for partial pages in a fully programmed block includes two stages: performing CDF-LDPC decoding for some pages in the idle time of SSDs and performing CDF-LDPC decoding for some pages immediately in the non-idle time of SSDs. We assume that Perror− page remains 5% during the previous two stages. To simplify the experiment, we assume that 80% of the pages in a fully programmed block can be decoded by CDF-LDPC in idle time of SSDs and 20% of pages are only decoded by CDF-LDPC immediately when they are read. As shown in Figure 15, the read performance of performing CDF-LDPC for partial pages of a fully programmed block in idle time of SSDs is better than the baseline. Moreover, the read performance of read-dominated traces is significantly increased. For example, the read performance of traces Syn1, Syn2, and Syn3 is improved by 55%, 62%, and 63%, respectively, compared with the baseline. Even for the write-dominated trace Financial1, the read performance is increased by 47% compared with the baseline. Moreover, the read performance of performing partial CDF-LDPC is also better than that of immediate CDF-LDPC. However, the read performance of performing partial CDF-LDPC is worse than that of performing CDF-LDPC for all pages in the fully programmed block during the idle time of SSDs. For example, for WebSearch1, read performance of performing partial CDF-LDPC is reduced by 17% compared with performing CDF-LDPC for all pages in the fully programmed block during the idle time of SSDs. In addition, read performance of performing partial CDF-LDPC is improved by 4% compared with immediate CDF-LDPC when Perror− page is 5%. 4.5. Using Die Parallelism to Implement CDF-LDPC Decoding

To speed up CDF-LDPC decoding, we can make use of SSD parallelism. In this work, only die parallelism is used. The advantage of using SSD parallelism is obvious, which not only speeds up the implementation of CDF-LDPC but also increases the data transferring speed in SSDs. Moreover, we can detect more error-free pages in advance in the idle time of SSDs. Therefore, the read performance of SSDs will be significantly improved. In this work, we use two Die parallelism and four Die parallelism to illustrate the effect of parallelism on the read performance. As shown in Figure 8, when both Dies are simultaneously idle, we can carry out CDF-LDPC decoding in parallel. However, when one Die is not idle, the parallel implementation of CDF-LDPC decoding stops. If SSD has more parallel Dies, it will get better read performance. Furthermore, we can identify the detected block through the flag block-detected. For those detected blocks, we can bypass them and detect other fully programmed blocks during the idle time of SSDs. ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

CDF-LDPC: A New Error Correction Method for SSD to Improve the Read Performance

7:19

Fig. 16. Illustration of using parallelism to improve read performance in SSDs. The trace Postmark is studied in this figure.

The read performance of SSDs can be significantly improved compared with the case without Die parallelism. We select five Perror− page (5%, 10%, 20%, 30%, and 40%) and use the trace Postmark to test the read performance under different Die parallelism. For example, the read performance using two Die parallelism is improved more than 50% compared with the case without Die parallelism as shown in Figure 16. Moreover, the read performance using four Die parallelism is increased by more than about 60% compared with that of using two Die parallelism. In addition, with the increase of Die parallelism, the read latency per page is also reduced. For example, when Perror− page is 20%, the read latency per page is 1.54ms without Die parallelism. However, the read latency per page is only 0.89ms using two Die parallelism, while the read latency per page is reduced to 0.64ms when SSD uses four Die parallelism. We can see that the read performance of SSDs can be further improved when SSDs implement CDF-LDPC with the help of Die parallelism. 5. RELATED WORK

The data reliability of SSDs based on NAND flash memory has attracted much attention. Most works focused on improving SSD performance and endurance. Up until now, ECC is still a main way to ensure data reliability of SSDs. To improve the read performance of SSDs, Zhao et al. [2013] developed a look-ahead and fine-grained progressive threshold voltage sensing for soft-decision LDPC code after hard-decision LDPC decoding failed. Wu and Zhang [2014] proposed a trade-off between flash memory write latency and data retention time to improve the write performance of SSDs. Zhao et al. [2014b] made use of overclocking I/O to reduce the underutilized error correction capability of ECC without sacrificing storage reliability. Wu et al. [2010] slowed down the write speed of NAND flash memory and used weaker BCH code to improve the read performance and reliability of SSDs. In order to improve SSD performance, Pan et al. [2013] proposed a quasi-EZ-NAND design strategy that distributed ECC and DSP hierarchically to flash memory chips and SSD controller, respectively. When the weak ECC in the flash memory chip failed, SSD carried out the stronger ECC in the controller to correct errors. Huang et al. [2014] used CRC code to scan HDD periodically to find those blocks with errors. Then they put those HDD blocks with errors into SSD cache ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

7:20

S. Qi et al.

and used BCH to correct errors. Our work was different from theirs. We mainly focused on the reliability of the SSD device itself rather than the reliability of the hybrid storage system including HDD and SSD. We proposed some new data structures and took advantage of SSD parallelism to accelerate the speed of error correction. In addition, We distinguished two block types to carry out CDF-LDPC in the idle time of SSDs. SSD parallelism is another research focus. Jung et al. [2012] performed addressed queuing using multi-plane-mode parallelism to improve IOPS of SSDs. Chen et al. [2011] performed a comprehensive study on the internal parallelism of SSDs. They concluded that the internal parallelism of SSD could significantly improve I/O performance. To improve flash-level parallelism and reduce transactions, the paper Jung and Kandemir [2014] proposed a Sprinkler scheme to relax parallelism dependency by scheduling I/O requests based on internal resource layout. Hu et al. [2011] compared four parallelism levels and obtained a parallelism priority order in SSDs. 6. CONCLUSIONS

In this article, we propose a CDF-LDPC algorithm through bypassing complex LDPC decoding for error-free flash memory pages in order to improve the read performance of SSDs. As for the pages with errors, LDPC decoding can be performed as soon as they are accessed in order to avoid errors generating again. To further accelerate CDFLDPC, we implement CDF-LDPC decoding for the fully programmed blocks in the idle time of SSDs. This is because the fully programmed block cannot be programmed so that it is free from interference from noises caused by P/E cycles. Moreover, we make use of SSD parallelism to further accelerate CDF-LDPC decoding in the idle time of SSDs. We evaluated the read performance of CDF-LDPC using different real traces, and experiment results show that the read performance is significantly improved. When we carry out CDF-LDPC decoding immediately, the read performance of SSD can be improved 49%–58% compared with LDPC. Furthermore, the read performance of SSD can be improved 50%–80% compared with LDPC making use of the idle time and SSD parallelism. However, for write-dominated traces, the read performance of SSD is improved 3%–6% compared with LDPC code. In general, the CDF-LDPC algorithm is suitable for read-dominated traces and early lifetime of SSD in improving read performance. ACKNOWLEDGMENTS We would like to thank our IT support staff at Data Storage and Application Lab. We are thankful to Dr. Jonghong Kim of Seoul National University for providing specifications about the noise channel of flash memory.

REFERENCES Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark S. Manasse, and Rina Panigrahy. 2008. Design tradeoffs for SSD performance. In Proceedings of the USENIX Annual Technical Conference. 57–70. John S. Bucy, Jiri Schindler, Steven W. Schlosser, and Gregory R. Ganger. 2008. The DiskSim simulation environment version 4.0 reference manual (cmu-pdl-08-101). Parallel Data Laboratory (2008). Yu Cai, Erich F. Haratsch, Mark McCartney, and Ken Mai. 2011. FPGA-based solid-state drive prototyping platform. In Proceedings of the 19th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 101–104. Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai. 2012. Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE). 521–526.

ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

CDF-LDPC: A New Error Correction Method for SSD to Improve the Read Performance

7:21

Yu Cai, Yixin Luo, Saugata Ghose, and Onur Mutlu. 2015. Read disturb errors in MLC NAND flash memory: Characterization, mitigation, and recovery. In Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 438–449. Yu Cai, Onur Mutlu, Erich F. Haratsch, and Ken Mai. 2013. Program interference in MLC NAND flash memory: Characterization, modeling, and mitigation. In Proceedings of the 31st International Conference on Computer Design (ICCD). 123–130. Feng Chen, Rubao Lee, and Xiaodong Zhang. 2011. Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing. In Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA). 266–277. Hyojin Choi, Wei Liu, and Wonyong Sung. 2010. VLSI implementation of BCH error correction for multilevel cell NAND flash memory. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 18, 5 (2010), 843–847. Jim Cooke. 2007. Flash memory technology direction. Micron Applications Engineering Document (2007). Koichi Fukuda, Yuui Shimizu, Kazumi Amemiya, Masahiro Kamoshida, and Chenming Hu. 2007. Random telegraph noise in flash memories-model and technology scaling. In Proceedings of the IEEE International Electron Devices Meeting (IEDM). 169–172. Robert G. Gallager. 1962. Low-density parity-check codes. IRE Transactions on Information Theory 8, 1 (1962), 21–28. Laura M. Grupp, Adrian M. Caulfield, Joel Coburn, Steven Swanson, Eitan Yaakobi, Paul H. Siegel, and Jack K. Wolf. 2009. Characterizing flash memory: Anomalies, observations, and applications. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 24–33. Yang Hu, Hong Jiang, Dan Feng, Lei Tian, Hao Luo, and Shuping Zhang. 2011. Performance impact and interplay of SSD parallelism through advanced commands, allocation strategy and data granularity. In Proceedings of the International Conference on Supercomputing (ICS). 96–107. Ping Huang, Pradeep Subedi, Xubin He, Shuang He, and Ke Zhou. 2014. FlexECC: Partially relaxing ECC of MLC SSD for better cache performance. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (ATC). 489–500. Myoungsoo Jung and Mahmut T. Kandemir. 2014. Sprinkler: Maximizing resource utilization in many-chip solid state disks. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA). 524–535. Myoungsoo Jung, Ellis H. Wilson III, and Mahmut Kandemir. 2012. Physically addressed queueing (PAQ): Improving parallelism in solid state disks. In Proceedings of the International Symposium on Computer Architecture (ISCA). 404–415. Philip Koopman. 2015. Best CRC Polynomials. https://users.ece.cmu.edu/∼koopman/crc/. Philip Koopman and Tridib Chakravarty. 2004. Cyclic redundancy code (CRC) polynomial selection for embedded networks. In Proceedings of the International Conference on Dependable Systems and Networks (DSN). 145–154. Jae-Duk Lee, Jeong-Hyuk Choi, Donggun Park, and Kinam Kim. 2003. Data retention characteristics of sub-100 nm NAND flash memory cells. IEEE Electron Device Letters 24, 12 (2003), 748–750. Jae-Duk Lee, Sung-Hoi Hur, and Jung-Dal Choi. 2002. Effects of floating-gate interference on NAND flash memory cell operation. IEEE Electron Device Letters 23, 5 (2002), 264–266. David J. C. MacKay and Radford M. Neal. 1996. Near Shannon limit performance of low density parity check codes. IEEE Electronics Letters 32, 18 (1996), 1645–1646. Neal Mielke, Hanmant P. Belgal, Albert Fazio, Qingru Meng, and Nick Righos. 2006. Recovery effects in the distributed cycling of flash memories. In Proceedings of the IEEE International Reliability Physics Symposium. 29–35. Christian Monzio Compagnoni, Michele Ghidotti, Andrea L. Lacaita, Alessandro S. Spinelli, and Angelo Visconti. 2009. Random telegraph noise effect on the programmed threshold-voltage distribution of flash memories. IEEE Electron Device Letters, 30, 9 (2009), 984–986. Yangyang Pan, Guiqiang Dong, Ningde Xie, and Tong Zhang. 2013. Using Quasi-EZ-NAND flash memory to build large-capacity solid-state drives in computing systems. IEEE Transactions on Computers 62, 5 (2013), 1051–1057. Judea Pearl. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann. Shigui Qi, Dan Feng, and Jingning Liu. 2014. Optimal voltage signal sensing of NAND flash memory for LDPC code. In Proceedings of the IEEE Workshop on Signal Processing Systems (SiPS). 145–150.

ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

7:22

S. Qi et al.

Shigui Qi, Dan Feng, Nan Su, Wenguo Liu, and Jingning Liu. 2015. A new solution based on multi-rate LDPC for flash memory to reduce ECC redundancy. In Proceedings of IEEE Trustcom/BigDataSE/ISPA. 918– 923. Shuhei Tanakamaru, Chinglin Hung, Atsushi Esumi, Mitsuyoshi Ito, Kai Li, and Ken Takeuchi. 2011. 95%-lower-BER 43%-lower-power intelligent solid-state drive (SSD) with asymmetric coding and stripe pattern elimination algorithm. In Proceedings of the IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). 204–206. Guanying Wu, Xubin He, Ningde Xie, and Tong Zhang. 2010. DiffECC: Improving SSD read performance using differentiated error correction coding schemes. In Proceedings of the IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS). 57–66. Qi Wu and Tong Zhang. 2014. OFWAR: Reducing SSD response time using on-demand fast-write-and-rewrite. IEEE Transactions on Computers 63, 10 (2014), 2500–2512. Kai Zhao, Kalyana S. Venkataraman, Xuebin Zhang, Jiangpeng Li, Ning Zheng, and Tong Zhang. 2014b. Over-clocked SSD: Safely running beyond flash memory chip I/O clock specs. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA). 536–545. Kai Zhao, Wenzhe Zhao, Hongbin Sun, Tong Zhang, Xiaodong Zhang, and Nanning Zheng. 2013. LDPC-inSSD: Making advanced error correction codes work effectively in solid state drives. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies (FAST). 243–256. Wenzhe Zhao, Hongbin Sun, Minjie Lv, Guiqiang Dong, Nanning Zheng, and Tong Zhang. 2014a. Improving min-sum LDPC decoding throughput by exploiting intra-cell bit error characteristic in MLC NAND flash memory. In Proceedings of the 30th Symposium on Mass Storage Systems and Technologies (MSST). 1–6. Received April 2015; revised August 2016; accepted October 2016

ACM Transactions on Storage, Vol. 13, No. 1, Article 7, Publication date: February 2017.

Suggest Documents