2017 3rd IEEE International Conference on Computer and Communications
Optimizing Read and Write Performance Based on Deep Understanding of SSD
Xin Liu, Yutong Lu
Jie Yu
National University of Defense Technology, Computer Science and Technology, Changsha, Hunan, China e-mail:
[email protected],
[email protected]
National Supercomputing Center in Tianjin, Tianjin, China e-mail:
[email protected]
Ying Lu University of Nebraska-Lincoln, Computer Science and Engineering, Lincoln, NE, US e-mail:
[email protected] groups tens of or dozens of Flash devices as one for parallel access, and it uses buffer schemes and prefetching to improve access performance [5]. The parallelism of SSD can be achieved among multiple Flash devices or within a single Flash. The parallelism among Flash devices is obvious. When sequential read/write operations make all Flash devices work at the same time, the aggregated bandwidth of accessing all Flash devices in parallel is high. Flash leverages multi-plane technology, and there is certain potential of parallelism among multiple planes. It is not as easy to understand as the parallelism among multiple Flash devices. Therefore, it is also important to understand and take the most advantages of plane-level parallelism. The read and write requests can be categorized into sequential and random types, which have great impacts on the performance of SSD. In designing and implementing SSD, the mapping granularity and mapping relationship between users’ logical page (LP) to SSD’s internal physical page (PP) may have significant impact on SSD I/O performance. The mapping relationship directly determines whether the data requested by read/write can be distributed to as much parallelizable storage components, such as planes and Flash devices, in a reasonable way. With the page size increases to 4KB~16KB, the block size reaches several MBs or even dozens of MBs. Taking HPC applications as an example, the average size of user files is around 2MB to 8MB [6]. Therefore, it is reasonable to choose a page as the mapping granularity. The mapping relationship is also an important factor. It can be looping over all Flash devices, or random mapping by hashing function. A simple and practical mapping method would use page as mapping granularity, and leverage loop-mapping over planes in a Flash device. In above mapping policy, when the requested size and the stride interval fit in particular combinations, SSD will fall into a low performance resonant range. The root cause of the low performance resonant range is that the read/write requests are mapped to only a few Flash devices. Due to the low performance of a single Flash device, the bandwidth of read/write requests are not high. Under this
Abstract—Flash-based Solid State Drive (SSD) consists of multiple Flash devices, and achieves high I/O bandwidth by parallel data access, buffer schemes and prefetching. Currently, the capacity of high-performance SSD reaches several TBs, and the nominal read/write bandwidth reach GB/s. However, since Flash is accessed in page unit, SSD may have a low performance resonant range when combining specific parameters. Deeply understanding the internal structure of SSD is important for exploiting the parallelism of SSD. Due to the competitions among SSD manufactories, they are reluctant to publish important parameters of their SSD products, which makes it difficult to fully take the advantages of high performance of SSD. This paper analyzes typical SSD internal structure, parallel characteristic, the mapping relationship, and the potential low performance resonant range. It puts forward a technical method to explore the internal structure of SSD. We analyze the average distribution of file sizes, and determine the relationship between minimum size and the bandwidth of read/write requests. We analyze the reason why the size of read/write requests shrink along the processing, and propose mechanisms to prevent the shrinking of read/write request size and to avoid low performance resonant range, for achieving higher read/write performance. Keywords-SSD; flash; file system; HPC
I.
INTRODUCTION
Flash is a full electronic storage device based on EEPROM. It reads and writes in page unit, usually around 4KB to 16KB, and it erases in block unit. Each block has multiple pages and the size is number of MBs, or even tens of MB. The time for typical page read, page write and block erase are tens of microseconds, hundreds of microseconds, and several milliseconds respectively, which shows that the speed of read/write of a single Flash is not high. Taking current high-performance MLC (Multi-Level Cell) Flash as an example, the speed of page read and page write is about 100MB/s and 10MB/s respectively [1-4]. Solid State Drive (SSD) is a storage device based on Flash. In order to improve the storage capacity and read/write bandwidth, SSD
978-1-5090-6352-9/17/$31.00 ©2017 IEEE
circumstance, mapping in offset write can improve the performance. If SSD allocates a new page within the same Flash device, the original mapping relationship between LP to PP is maintained. To avoid read/write requests falling into low performance resonant range, the solutions is optimizing the control at user level. However, we should first thoroughly understand the internal structure of SSD in order to optimize SSD performance at user level. After nearly a decade of research and development, the study of SSD have changed from researching various techniques in academic area to products’ designing and implementing in manufactories. SSD products are becoming sophisticated and the techniques employed by SSD are determined. Currently, the critical work is how to fully exploit the high performance potential of SSD. This is also the main purpose of our work. However, due to the competition among SSD manufactories, they are reluctant to publish the internal structure of their SSD products, even parameters that are closely related to the read/write performance, such as page size, block size, number of channels, and the mapping relationship between LP and PP. For users, SSD is treated as a black box. Hence, researchers put forward many black box methods to detect the internal structure of SSD [7]-[10], which wasted lots of manpower. In order to fully take the advantages of high performance of SSD, it is necessary to solve following two questions: (1) how to understand the internal parallel structure of SSD, the parallel potential of a SSD, and possible low performance resonant range; (2) analyzing the characteristics of data read/write in HPC applications, researching on how to adapt to SSD internal parallel structure, and optimizing the utilization of SSD. Based on HPC applications, this paper analyzes the internal parallel storage structure and the processes of read/write operations of current representative SSD. It also analyzes possible mapping relationship between LP and PP, the allocation policy of offset write that SSD may use, and the influences of these methods on the read/write performance. In addition, we analyze the distribution of file sizes in HPC environment, and understand the relationship between the size of read/write requests and the bandwidth of read/write requests. We summarize the procedure of how users’ read/write requests are processed from its generation till it reaches SSD, and conclude how large read/write requests are divided into multiple small sub-requests. Due to the shrinking of read/write request size, with the interleaving of multiple sub-requests, original large sequential access requests may be divided into small random access requests. According to that, we propose methods that minimize the shrinking of request size and put forward suggestions of minimal read/write request size in order to avoid dramatically decreasing of read/write performance. With the conclusions, we indicate that certain parameters’ combinations, such as read/write request size and stride interval, may cause the low performance resonant range. Our paper aims at understanding how to distinguish low performance resonant range and proposing methods to avoid that.
II.
ANALYZING AND UNDERSTANDING THE INTERNAL STRUCTURE AND PARALLELISM OF SSD Parallel storage is the main technique to achieve high performance storage. A thorough understanding of SSD internal structure and parallel access mechanism is important for exploiting the high performance of SSD parallel read and write. Flash-based SSD has multiple levels of parallelism: within a single Flash, or among multiple Flash devices. We first analyze the features of parallel read and write within a Flash device. A. Analyzing Internal Read/Write Features of Flash Flash is a full electronic device based on EEPROM. Its data access operations include data read, data write and data erase. With specific mechanism of data access, Flash reads and writes data in page unit, and erases data in block unit. The size of page and block is around several KBs and MBs respectively. A die is an independent storage unit. In order to increase assembling density, multiple dies are usually encapsulated together as a package, and they are independent from each other for control. Fig. 1 shows the typical internal structure of a die in a Flash device [2]. It reveals that a die mainly consists of a controller and a storage array. It transfers data and addresses through an 8-bit data/address bus. Figure 1 shows that one controller controls 2 storage arrays, which are called planes. It has independent address path and data path to the controller. Therefore, these two planes can be accessed in parallel within the die. DQ(7:0)
Column Address
I/O Controller
DQS EN WE RE UP
Row Address
Control Logic
NAND Flash NAND Array Flash Array Plane Plane Data
Figure 1. Internal structure of a die in a flash device.
Some Flash devices have only one plane, while some have 4 planes. Fig. 2(a) shows the storage components, pages and blocks, within a plane in a die [2]. Package
DQ 7 n KB
DQ 0 DQ (7:0)
Die 0
Control_0
1 Block DQ (7:0)
1 Page = n KB 1 Block = N Pages 1 Plane = P Blocks
(a)
Die 1
Control_1
(b)
Figure 2. (a) The logical structure of a storage array of a die; (b) The composition of dies within one package.
Since the width of external data/address bus for each die is only 8 bits, and there are only tens of control lines, the actual number of pins required by a die is only around 20
B. Analyzing Features of Parallel Data Access in SSD SSD groups multiple Flash devices as one for parallel access to achieve larger storage capacity and higher read/write bandwidth. For SSD, we need to understand following key parameters, such as page size, the number of planes/dies/channels, the mapping between LP to PP, the mapping of offset write, etc. Taking SSD P420 from Micron as an example to analyze SSD, the main technical parameters are as follows [5]. Fig. 3 shows the internal structure of SSD P420. x Sequential read/write (steady state) performance: Sequential read: Up to 3.3 GB/s (128KB IO size); Sequential write: Up to 630 MB/s (128KB IO size). x Latency (queue depth 1): READ latency: