A Scalable Multi-channel Parallel NAND Flash Memory Controller Architecture Yang Ou, Nong Xiao, Mingche Lai School of Computer, National University of Defense Technology Changsha, China
[email protected];
[email protected];
[email protected]; Abstract—With more data-intensive applications appearing in the present social life, the NAND flash memory acting as a replacement candidate of hard disk drives is popularly used in some data centers due to its lower power consumption, faster random access, and higher shock resistance. But the traditional solid state disk exposes the limitation of bandwidth. To this end, we deliver the scalable multichannel flash memory controller architecture in this paper to exploit the parallelism of multiple chips. It supports all the flash operations, and boosts the performance by evenly distributing multiple accesses among different chips when using the error correction code to improve its reliability. The functions of the multi-channel parallel controller are validated according to a wide spread of workloads. And the evaluation results show that our proposed eight-channel controller outperforms the traditional controller by more than three times and can be well scaled to be 128 channels without extra critical timing path. Keywords — data-intensive computing; flash memory; multi-channel parallelism;
I.
INTRODUCTION
With the rapid development of the network, computing, and storage technologies, the amount of data generated by every 18 months equals to the sum of history on the network. We live in a world overflowing with data. The storing, managing, accessing, and processing of this mass data represent a fundamental need and an immense challenge in order to satisfy needs to classify, search, analyze, and process this data as information. Dataintensive computing is intended to address this need. Nowadays, data-intensive applications have been popular in our life such as Facebook. Such applications devote most of their processing time to I/O, movement and manipulation of data, often with random access patterns to small-size objects over large datasets. Traditional data center, which uses hard disk drives as the storage, is not suitable for data-intensive computing because of its high power consumption, long latency and low bandwidth. Flash memory is gradually replacing hard disk drive in data center because of its lower power consumption, faster random access, and higher shock resistance. The type of flash memory used for mass data storage applications is NAND flash memory, which has unusual peculiarities that cause the problem of using it as a storage
device directly. NAND flash memory has its own command set obeying the special timing sequences. It is necessary to make strict timing requirements to ensure commands to be executed correctly. And development trend of data-intensive computing is to emphasize high I/O throughput. In order to alleviate the computer system I/O bottlenecks, we need the high-bandwidth flash array architectures. However, the bandwidth of single NAND flash chip is about 40MB/s. Previous researches [9], [10] exploit the chip-level interleaving and the bus-level interleaving to hide the latency of flash operations. They have promoted performance more than 80%. Recently, the technology executing multiple flash operations out of order [11] has been proposed to exploit the multi-chip parallelism inherent in flash memory. It achieves 3% to 100% more throughput than interleaving, with 46% to 88% lower response times. Differently, we investigate a scalable approach to improve the bandwidth in magnitude. Obeying the NAND flash memory’s timing sequences, we need to exploit a scalable multi-channel parallel access mechanism in the flash array to improve I/O bandwidth. In this paper, we propose a multi-channel parallel NAND flash memory controller architecture. Our main contributions are listed as follows. First, in order to improve the aggregate I/O bandwidth, we propose the NAND flash memory controller with a scalable multichannel parallel access mechanism to execute all the flash operations. Second, we describe a new dynamic replacing strategy for the mapping table in the flash translation level (FTL) that evenly distributes multiple operations among chips to avoid the hot chips. Third, the design of multichannel parallel NAND flash memory controller with error correction code is completed and the detail experimental results show that the proposed multi-channel parallel controller can correctly implement the multi-channel parallel control of NAND flash operations. As to the performance, the total throughout of eight-channel controller is improved by more than three times compared to the traditional controller and it presents the well scalability up to 128 channels without extra critical timing path. The rest of this paper is organized as follows. In the next section, we describe the basics of flash memory and review related work on the flash memory controller. Then, we present multi-channel parallel NAND flash memory controller architecture in Section 3. The results
of a performance evaluation are presented in Section 4. Finally, we conclude in Section 5, and discuss some directions for future research. II.
BACKGROUND AND RELATED WORK
A. NAND Flash Memory NAND flash memory is organized into blocks, each of which consists of a set of pages. Each page has a user data area and a spare data area. The user data area is a set of sector (512 bytes), and the spare data area is typically 16 bytes for each sector in the user data area. We use the Micon MT29F8G08AABWP 8Gb flash memory chip [2] as the example throughout this paper although our technique does not rely on it. The block size is 128KB, and a block consists of 64 pages of 2KB. And there is a 2,112 bytes data register in the memory cell arrays, which accommodates data transfer between the I/O buffers and memory during read page and program page operations. One target address can be represented by 5 bytes. There are three main flash operations: read, program, and erase, which take 25μs, 300μs and 2ms respectively. Flash memory cannot perform in-place update of data, so an erase operation must be executed before a program operation. The flash translation layer (FTL) [3] hides the peculiarities of flash memory from the host computer and provides an illusion of an HDD. The most important function of the FTL is to maintain a mapping between the logical sector addresses used by the host computer and the physical block and page addresses used in flash memory. With the times of charging and discharging increasing, the electrical characteristics of NAND flash memory may change so that there will be some bad blocks in it. So each type of NAND flash memory has a limitation of erasure times called the life time, which necessitates a technique called wear-leveling that balances the erasure times of each block. In addition, NAND flash memory is troubled with the bit-flipping errors, which lead to one bit changing from 0 to 1 or from 1 to 0 (more than one bit may rarely be changed). So we need a check-logic to solve this problem.
memory. Chanik Park et al. [6] presented a high performance and cost-efficient controller for NAND flashbased solid state disk. It provided cost-efficiency, fast start-up time, high reliability and low power consumption. Both of them above do not consider any parallelism, so that they are limited by the bandwidth. Jin Hyuk Yoon et al. [7] described a high-performance Flash/FRAM hybrid SSD architecture that promotes the efficiency of small random writes by using a small FRAM. However, it is not suitable for the bulk data because of the limitation of FRAM’s size. Chang and Kuo proposed [8] chip-level interleaving to improve the write performance, which increases the effective bandwidth of flash operations by allowing several flash chips to operate in parallel. But the method is limited by the flash bus bandwidth. Another interleaving approach is the bus-level interleaving. It uses the super-chip, which consists of a set of flash chips involved in both chip-level and bus-level interleaving [9], [10]. This technology, however, is less efficient for the random flash operations. To solve this problem, the controller O3 is proposed to execute the operations out of order [11]. This allows O3 to exploit the multi-chip parallelism inherent in flash memory much more effectively than interleaving. However, both of the interleaving and O3 are not well scalable, and the bandwidth they have improved cannot satisfy for the demand of the flash array in data-intensive computing. Recently, Fusion-io introduced the commercial SSD “ioDrive Duo” [12]. It provides 1.5 GB/s of the read/write bandwidth. Compared to the SSD of Intel using the same NAND flash chip, the read and write speed increase by 8 times and 16 times separately. But unfortunately, we do not know the details of the technology. III.
MULTI-CHANNEL PARALLELISM ARCHITECTURE
Fig. 1 shows the architecture of the scalable multichannel parallel flash memory controller. As shown in the Fig. 1, although we use eight flash chips, but all of them are addressed uniformly. That is, the eight chips are considered as a whole one called super-chip in user’s view.
B. Related Work Tang Lei et al. [4] designed a flash controller for FPGA application. The controller provides a main statemachine with an enhanced command set to manage the flash chip. Users can operate the controller without caring about the timing sequences of the chip. And it described the method to replace the invalid block. But it cannot overcome the errors of bit-flipping, and cannot insure the reliability of data. Lin and Dung [5] proposed A NAND flash memory controller for SD/MMC flash memory card. They designed a t-EC w-bit parallel BCH ECC to correct the bit-flipping errors. The controller has a high reliability and endurance because of its built-in defect management and wearleveling algorithm. And they presented a code-banking technology to support various kinds of NAND flash
Figure 1.
The architecture of the multi-channel parallel flash memory controller
The host interface makes read and write requests specified in terms of logical sectors to the FTL, which translates host requests into flash requests. And then the FTL issues the commands, addresses (the single controller numbers) and data to the multi-channel controller via eight channels. When receiving the requests, the switching fabric sends the requests to a single controller decided by the address of the request. Then according to the type of the request, the single controller sends control signals to NAND flash chip to complete the operation like read page, program page, block erase etc. In the following subsections, we detail the key features of the multi-channel parallel flash memory controller, which are dynamic FTL mapping strategy, the switching fabric and the design details of the single controller. A. Dynamic FTL Mapping Strategy When the FTL receives the requests from the host, it translates the requests into the specification of flash requests. During this process, the FTL acts the most important role, which is to locate the physical flash memory address for the logical sector address used by the host file system in the mapping table. As mentioned above, the FTL uses the wear-leveling to even out the erasure times of each block in the same chip. For different chips, however, the number of operations is not well-distributed. Our target is to make the multiple chips work as parallel as possible, which improves the total bandwidth more significantly. We describe a new dynamic replacing strategy for the mapping table in the FTL to even out the number of operations among the chips.
Figure 2.
The FTL gets the status of the chip by the switching fabric. If one of the chips works for a long time and oppositely some others are idle, then the mapping table may be changed. For a stream of the sequential program page requests on a chip, the FTL firstly finds the physical flash memory addresses of the latter requests in the mapping table, and then modifies the target chip number instead of the number of an idle chip. After this process, a large file can be divided into several parts programming in different chips. When reading this file, it can be read out via the parallel mechanism. B. The Switching Fabric We design the switching fabric to implement our multi-channel parallel access mechanism. The switching fabric can accept multiple requests simultaneously, which are issued by the FTL via the 8 channels. The requests store in the inbound feeder while awaiting dispatch to the target flash chip. Then according to the busy/idle status of each single controller and the single controller number of the request in the inbound feeder, the switching scheduler sends select signals to the crossbar switch unit to connect a data-path with the single controller. The data-path is responsible for transmitting the data, commands, and addresses between the single controller and one channel in the FTL. And at the same time, the other channels and the single controllers can connect so that it can execute multiple requests concurrently. When the request has been executed completely, i.e. the single controller is idle, the data-path is disconnected by the switching scheduler. Then the channel can connect to another single controller if needed.
Timing diagrams for a sequence of flash requests
When the switch scheduler dispatches the requests, we have two kinds of collisions: 1. The collision between the single controllers occurs when two requests are from the same original channel. If the channel has connected to a single controller, then here is another request in the inbound feeder, who wants to occupy the same channel. 2. If the target of the commands is the same chip, then there is a collision. This situation is that while the single controller is busy executing a command no matter what it is, another command in the inbound feeder wants to use this single controller. The switching scheduler allows only one command executing by a single controller, and one data-path connecting to a single controller. If there is a collision, we use the FIFO policy to decide the executing order. So the later must wait in the inbound feeder until the former has completed. Of course, the two collisions above may occur concurrently. As shown in Fig. 2(a), the commands must execute in sequent order with the traditional single controller, and take about 4025μs. R1-2 presents that the read page command is coming from channel 1 and going to chip 2 (the others are similar). Fig. 2(b) shows an example of how a sequence of flash operations is scheduled by the crossbar. It just costs 2425μs to execute all the commands. The read page request R3-1 starts executing until the read page request R2-1 finished, because the target of them is the same (chip 1). In addition, because channel 2 is occupied by the program page request W2-3, the last request R2-1 cannot be executed immediately after R3-1. bit w
… …
… …
… n:1 mux …
n:1 mux
n:1 mux
out port 1… …
out port n
…
input port n
… n:1 mux
…
input port 1
select signal
select signal
…
input n-1 input n
Figure 3.
2:1 mux
select signal
output
2:1 mux
… …
select signal
select signal
2:1 mux
2:1 mux
input 2
2:1 mux
select signal input 1
A n:1 multilpexer of log2n levels
The architecture of a n×n crossbar switch of w bits
Fig. 3 describes the architecture of an n×n crossbar switch of w bits. It consists of a set of the n:1 multiplexers, each of which is organized into log2n levels 2:1 mux. For level i, the number of the 2:1 mux is n/2i. The theory of logical effort [13] proposes a method for fast back-of-the-envelope estimates of delay in a CMOS circuit using a technology-independent unit τ. With this theory, we can get the total circuit delay of the w bits n×n crossbar switch, shown in (1).
Tcircuit = Teff + Tpar
= 5log 4 w + 5log 4 n + 6 log 2 n − 2.5 (1) Where Teff is the product of the logical effort required to perform a logic function and the electrical effort required to drive an electrical load called the effort delay, and Tpar is the intrinsic delay of a gate due to its own internal capacitance called the parasitic delay. Similarly we can get the circuit delay of the switch arbiter, which is in the switch scheduler, shown in (2).
T = 21.5log 4 n + 19
(2)
The total circuit delay of w bits n×n switch fabric is
T = 5log 4 w + 26.5log 4 n + 6 log 2 n + 17.5 (3) The 8 ×8 switch fabric we used is responsible for connecting the 24 bits data-path including data, address and control signals etc. between eight single controllers and eight channels. For an input port, the crossbar switch chooses an output port by the select singles from the switch schedule. The total circuit latency is 76.8τ. As shown in Fig. 3, we can easily increase the number of the input ports and output ports of the crossbar switch by extending the n:1 multiplexer with adding the levels of the 2:1 mux. Then we can improve the parallel degree of the channels to get a higher bandwidth. C. The Single Controller Fig. 4 shows the details of the single controller architecture. When it receives a command from the switching fabric, with the help of the main control module, it correctly implements the control of flash operations such as read page, program page, block erase etc according to the control logic of NAND flash chip. And we design a hardware-based ECC module to generate a 16 bit ECC code of each page, which stores in the spare data area. Main control module is the core of the single controller. The main control logic has a main state machine, which transforms the state among the main states, each of which includes a sub-state machine. While a request arriving, the main control logic starts the main finite state machine, and makes the sub-state machine to perform the corresponding actions including judging the operation it should execute and generating different control singles to ECC module. When the request executes completely, the sub-state machine finishes and the main
state machine enters the idle state to wait for the next request. And the data, which need to input or output, are put into the data buffer at first to optimize the performance of small random reads/writes. The command and address are stored in the registers before they are sent to the chip by the main control logic.
Figure 4.
The single controller architecture
ECC module provides a hardware-based technology to solve the bit-flipping errors, which are mentioned in Section 2.1. It can detect the bit-flipping errors and modify one bit error. In the program page operation, before sent to the chip, the data are sent to the ECC code generation one by one to generate a 16 bit ECC code of a page by the ECC control logic. And then, the ECC control logic writes the code into the spare data area of the target page after the data written into the user data area. When executing the read page operation, it is a similar opposite process. The difference is that the error data location should compare the old code (read from the target page) and the new code (generated by ECC code generation) to judge if there is a bit-flipping error or not. If there is a bit-flipping error, the address of the error bit is sent to the main control module to correct the bit, which has stored in the data buffer. IV.
the collector and the analyst to check if the operations execute correctly or not. To test the function of the single controller, we generate a set of processes, including the combine of three main operations in different order. For example, as shown in Fig. 5, PROCESS_RPE presents the operations, which is coming from the same channel and going to the same chip, are executed one by one in the order read(R)program(P)-erase(E). The experiment results that the single controller can correctly implement the basic operations according to the timing sequences of NAND flash memory chip. In addition, the workload generator generates the random requests to verify the function of the multi-channel controller. The requests are coming from arbitrary original channel and going to arbitrary target chip. Although there are some collisions (mentioned in Section 3.2), the collisions are solved rightly by the switching fabric as we expect. The result provided by the comparer shows that multiple requests can be executed concurrently without collisions with the help of the multi-channel parallel NAND flash memory controller.
PERFORMACE EVALUATION Figure 5.
A. Experimental Set-up We complete function simulation and performance testing based on the RTL-level of multi-channel parallel NAND flash memory controller using the experimental setup shown in Fig. 5. We use Micon MT29F8G08AABWP 8Gb flash memory chip [2] as the storage module. The flash bus is 8 bits wide and operates at 40MHz, and thus the maximum bandwidth of the flash bus is 40MB/s. Here we use Synopsys Prime Time to simulate the delay of the crossbar switch in a 0.18μm CMOS technology and in this process technology, a τ=18ps. In our experiment, we generate some basic requests to verify the single controller at first. Then, we test the function of the switching fabric, which completes the multi-channel parallel access. Finally, we have compared the performance of the single controller and the multichannel controller by different workloads and channels. B. Function Verification As shown in Fig. 5, the analyst is responsible for analysing the data and address. The collector is used to collect the data from each chip when a process finishes. We use the comparer to compare the log files generated by
Experimental setup
C. Performance Evaluation To compare the performance of the multi-channel controller with the single controller, we setup the experiment as follows. Firstly, we build the workload generator that can generate mixes of read (R), program (P), erase (E) and operations to generate a stream of flash requests. Of course, the requests generate randomly (the target page or block of any request is arbitrary). And then, we use the stream to assess the performance of the two flash controllers. The single controller executes requests in the stream one by one, and we get the throughput. Then we use the multi-channel controller to execute the same requests in the parallel mode. And we take the longest time of the channels as the executing time of multi-channel controller in each stream to get the throughput. To assess the scalability of the multi-channel controller, we performed experiments in which we vary the number of channels and the type of workloads. Fig. 6 presents the speedup of different multi-channel controllers with the workloads Read Only, E: P: R=1:64:64, and E: P=1:64. With increasing the number of channels, the speedup improves almost straightly for the
three workloads. In fact, the performance may be influenced by the total circuit latency of the switch fabric when we increase the number of channels. The frequency of the single controller is 200MHz. So we can reach the performance similar to Fig. 6 as long as the total circuit latency of the switch fabric is less than 5ns. In Fig. 7, the total circuit latency of the switch fabric (details in Section 3.1) increases slowly with the number of channels increasing. For the 128-channel controller, the latency is about 2.7ns. So the total circuit latency of the switch fabric has little effect on the performance.
architecture. It supports all the flash operations when using the error correction code to improve its reliability. And in order to improve the bandwidth, we exploit the multichannel parallelism and propose a new dynamic replacing strategy for the mapping table in the FTL. The result shows the performance of eight-channel controller is more than three times better than the traditional one. And it is well scalable that we can easily extend it to 128 channels without extra critical timing path. We are currently exploring on the interleaving of the same bus that has multiple chips and the parallelism of the multi-die in the chip. REFERENCES [1]
[2]
[3] Figure 6.
Effect of the multi-channel controller
[4]
[5]
[6]
Figure 7.
[7]
The total delay of the switch fabric
we prepare six different streams of flash requests, which consist of read, program, and erase operations. And the channel number increases from 1 to 16. The result in Fig. 8 shows that we also can get high throughput for different workloads compared to the single controller. The performance of the scalable multi-channel controller is largely insensitive to the workload type. 180
Throughout (MB/s)
1-channel
2-channel
4-channel
8-channel
12-channel
16-channel
150
[8]
[9]
[10]
120
90
60
[11]
30
0
R=64
Figure 8.
E:P:R=1:64:512
E:P:R=1:64:256
E:P:R=1:64:64
E:P:R=1:64:32
E:P=1:64
Throughout of different workloads and channels
V.
Conclusion
In this paper, we have presented a new scalable multichannel parallel NAND flash memory controller
[12] [13]
C.Hwang, “Nanotechnology enables a new memory growth model,” Proceedings of the IEEE, Vol. 91, No. 11, pp. 1765-1771, Nov. 2003. Micon. NAND Flash Memory datasheet. http://download.micron.com/pdf/datasheets/flash/nand/2gb_nand_ m29b.pdf. E. Gal and S. Toledo, “Algorithms and data structures for flash memories,” ACM Computing Surveys, Vol. 37, No. 2, pp. 138-163, Jun. 2005. Tang Lei, Zhou Xuan, Wu Yao, Li Jincheng, “Flash controller design for FPGA application”, ICEEE 2010 International Conference , pp. 1-4, Nov. 2010. Chuan-Sheng Lin,Lan-Rong Dung, “A NAND Flash memory controller for SD/MMC flash memory card”, IEEE TRANSACTIONS ON MAGNETICS, VOL. 43, NO. 2, Feb. 2007. Chanik Park, Prakash Talawar, Daesik Won, MyungJin Jung, JungBeen 1m, Suksan Kim and Youngjoon Choi, “A high performance controller for NAND Flash-based solid state disk (NSSD)”, IEEE 21st Non-Volatile Semiconductor Memory Workshop, pp. 17-20, Feb. 2006. Jin Hyuk Yoon, Eyee Hyun Nam, Yoon Jae Seong, Hongseok Kim, Bryan S. Kim, Sang Lyul Min, and Yookun Cho, “Chameleon: A High Performance Flash/FRAM Hybrid Solid State Disk Architecture”, IEEE Computer Architecture Letters, VOL. 7, pp. 17, Jan. 2008. L.-P. Chang and T.W. Kuo, "An adaptive striping architecture for flash memory storage systems of embedded systems," in Proceedings of the 8th Real-Time and Embedded Technology and Applications Symposium, San Jose, California, USA, 2002. Y.J. Seong, E.H. Nam, J.H. Yoon, H. Kim, J.-y. Choi, S. Lee, Y.H. Bae, J. Lee, Y. Cho, and S.L. Min, "Hydra: a block-mapped parallel flash memory solid-state disk architecture," IEEE Transactions on Computers, vol. 59, no. 7, pp. 905-921, 2010. A.M. Caulfield, L.M. Grupp, and S. Swanson, "Gordon: using flash memory to build fast, power-efficient clusters for dataintensive applications," in Proceedings of the Architectural Support for Programming Languages and Operating Systems, Washington DC, USA, 2009. Eyee Hyun Nam, Bryan Suk Joon Kim, Hyeonsang Eom, Sang Lyul Min, “Ozone (O3): An Out-of-Order Flash Memory Controller Architecture”, IEEE Transactions on Computers, vol. 60, no. 5, pp. 653-666, May 2011. http://www.fusionio.com/products/iodriveduo. I.E.Sutherland, R.F.Sproull and D.Harris. “Logical Effort: Designing Fast CMOS Circuits,”Mrogan Kaufman, 1999.