This article has been accepted and published on J-STAGE in advance of copyediting. Content is final as presented. IEICE Electronics Express
An FPGA enhanced extensible and parallel query storage system for emerging NVRAM Gezi Li1,2a), Xiaogang Chen1b), Bomy Chen3, Shunfen Li1, Mi Zhou1,2, Wenbing Han1,2, Xiaoyun Li1,2, and Zhitang Song1 1
State Key Laboratory of Functional Materials for Informatics, Shanghai Institute
of Micro-system and Information Technology, Chinese Academy of Sciences, Shanghai, 200050, China 2
University of Chinese Academy of Sciences, Beijing 100080, China
3
Shanghai Xinchu Integrated Circuit Incorporation, China
a)
[email protected] b)
[email protected]
Abstract: Modern data appliances face severe bandwidth bottlenecks when transferring data between storage and host. In this paper, we present a storage system which can improve performance more than three times by using an FPGA to implement the off-load engine. Regardless of the data size, the host CPU utilization varies vastly between the two systems, traditional way uses 100% and the system presented only uses 1%. Compared to the traditional way, the memory allocation has been greatly reduced. The system presented is explored for emerging NVRAMs which have many benefits compared to the traditional memory. Keywords: Storage, NVRAM, FPGA, Parallel, Query Classification: Storage technology References [1] E. Doller, A. Akel, J. Wang, K. Curewitz, and S. Eilert: 2014 Symposium on VLSI Circuits Digest of Technical. Papers (2014)1. DOI: 10.1109/VLSIC.2014.6858357 [2] J. Do, Y. S. Kee, J. M. Patel, C. Park, K. Park and D. J. DeWitt: In Proc. SIGMOD. (2013) 1221. [3] L. Woods, Z. Istvan and G. Alonso: 2014 the 40th International Conference on Very Large Data Bases. Papers (2014) 7.11. [4] P. Francisco: The Netezza Data Appliance Architecture: A Platform for High Performance Data Warehousing and Analytics, IBM Redbook, (2011) 1. ©IEICE 2016 DOI: 10.1587/elex.13.20151109 Received December 23, 2015 Accepted January 12, 2016 Publicized January 29, 2016
[5] Oracle: A Technical Overview of the Oracle Exadata Database Machine and Exadata Storage Server, White Paper, Oracle Corp, (2012) 1.
1
IEICE Electronics Express
[6] T. Takenaka, M.Takagi and H. Inoue: 2012 22nd International Conference on field Programmable Logic and Applications (FPL). Papers (2012) 237. DOI: 10.1109/FPL.2012.6339187 [7] R. F. Freitas and W. W. Wilcke: IBM Journal of Research and Development. 52 (2008) 439. DOI:10.1147/rd.524.0439 [8] G. W. Burr, B. N. Kurdi, J. C. Scott, C. H. Lam, K. Gopalakrishnan and R. S. Shenoy: IBM Journal of Research and Development. 52 (2008) 449. DOI: 10.1147/rd.524.0449 [9] Z. W. Li, Y. P. Liu and H. Z. Yang: 2014 IEEE International Conference on Electron Devices and Solid-State Circuits (EDSSC). Papers (2014) 1. DOI: 10.1109/EDSSC.2014.7061233 [10] S. Onkaraiah, P. E. Gaillardon, J. M. Portal, M. Bocquet and C. Muller: 2011 IEEE/ACM International Symposium on Nanoscale Architectures. Papers (2011) 65. DOI: 10.1109/NANOARCH.2011.5941485
1 Introduction Nowadays, the internet servers face severe bandwidth bottlenecks and serious power consumption when transferring vast data between networked attached storage and host. To solve these issues, intelligent storage engines have been proposed to support off-loading query operations. For example, E. Doller et. al [1] and J. Do et. al [2] have pushed query processing into smart SSDs, which can improve system performance but limited by clock frequency of SSD controller. Among various solutions, an approach by using FPGA as hardware accelerator can be seen as the most promising one, such as L. Woods’s Ibex [3], Oracle’s Exadata [4] and IBM’s Netezza [5]. However, these SQL-based hardware-accelerated engines fail to support any user-defined functions because their hardware complexity incur easily in severe throughput degradation [6]. Since IBM raised Storage Class Memory (SCM) concept in 2008 [7, 8], emerging Non-Volatile Random Access Memories (NVRAM) [9, 10] have become the new candidates of working memory and high performance storage. In this paper, we present an FPGA-based storage system to execute in-storage data processing (ISP) for NVRAMs. NVRAM technologies offer quite higher density, faster read speed, lower leakage power, byte-addressability and non-volatility. The proposed system can offload tedious data manipulation tasks to the FPGA-based storage subsystem. With a general purpose interface, only useful data will be transferred to host and can be processed simultaneously in parallel without passing a block buffer. Owing to ISP and NVRAM technologies, the storage system has higher performance and releases host from heavy data processing tasks.
2 Proposed system architecture Fig.1 (a) shows the traditional architecture in modern-day. The limited factor in performance is the transportation of large amount of data from storage to the DRAM of host. Recently, several researchers have explored the potential benefits of SSD-based ISP, as shown in Fig.1 (b). But the problem is still severe for the data need to be transferred from flash arrays to where it can be processed, i.e. DRAM (red thread in fig.1). Also, the commodity microprocessors of traditional SSDs will
2
IEICE Electronics Express
degenerate the system performance due to stagnant clock speeds and high power consumption. So, the most promising method to further increase performance remains parallelization. In this paper, an FPGA enhanced extensible and parallel query storage system has been implemented, as shown in Fig.1 (c). The FPGA-based co-processor can be inserted in the data path between storage and host CPU. This is for the next-generation NVRAMs such as phase change memory (PCM) which approaches DRAM-like performance with the additional benefits of lower power consumption and higher density as process technology scales. Our main contributions are summarized as follows. 1) Real prototype systems. The true functional real FPGA-based NVRAM storage emulated platform has been developed, which helps to evaluate the performance of off-load operation. 2) Extensible and parallel ISP structure. As mentioned above, traditional systems suffer from the transmission and CPU bottleneck. Due to their highly parallel nature, the system proposed uses the FPGA rather than conventional CPUs to implement the off-load engine. The FPGA is inserted into the data path as to process the data closer to the data and without additional transfer costs. FPGAs can meet the increasing throughput demands due to its flexibility on the bit level. 3) Non-volatility emulation board. It can evaluate/analyze the impact of NVRAMs on the performance of storage operations.
Fig.1. Architecture contrast figure.
3 System Setup Fig.2 (a) shows the storage server system named ZedBoard based on Xilinx ZynqTM-7000 All Programmable SoC. Zynq consists of a dual Corex-A9 Processing System (PS) and 85,000 Series-7 Programmable Logic (PL) cells. We have implemented an embedded Linux system (version 3.3.0) on PS with an SD card holding root file system. Table I. The parts of hardware configurations of ZedBoard CPU L1 Cache L2 Cache Memory Interfaces
Dual ARM® Cortex™-A9 MP Core™ 667MHz 32KB Instruction, 32KB Data per processor 512KB 512 MB DDR3 533MHz USB-UART, SD Card, FMC, Ethernet
Fig.2 (b) shows the storage subsystem implemented with the Lattice FPGA (LFE3-35EA-8FN484C) and Micron DDR3 (MT41J128M16) SDRAM. We emulate NVRAM using DRAM with customized parameters lower than the
3
IEICE Electronics Express
performance of Micron commercial PCM (MT66R7072A) chip. The DDR3 memory chips are connected to DRAM memory controller, and can be accessed via AXI bus by PS core. The DRAM of the storage subsystem is clocked with 350MHz while PCM can run at the speed of 400MHz. The PCM and the DRAM both have interface protocol with 16b data. For generalized sequential access, read latency time will be counted only once in a Seamless Burst Read operation. So the difference of read latency will hardly influence the final results.
Fig.2. Proposed system architecture.
Also, a generic command mechanism has been proposed to guarantee the system working efficiently and safely. Fig.3 shows the execution flow of the storage system. Linux-based OS running on the host CPU controls the communication between storage server system and storage subsystem through a set of data paths. The algorithm on FPGA called Hunt-IP allows the host CPU directly operates on data, such as write/read operations (as indicated by path1 and path4).
Fig.3. Data flow graph of our storage subsystem.
Moreover, the Hunt-IP can off-load entire queries to FPGA (as indicated by path3). The communication protocol between the sever system and the subsystem include a search scope and the keywords to be searched. When the search started, the finite-state machine (FSM) called SM3 efficiently loads data from the memory controller to the local memory. Another FSM called SM2 compares the read out data to the keywords simultaneously. An adaptive FIFO has been implemented to match the speed between read operation and comparison. If there is a match, the FPGA increments the match words counters and records the match words address. As soon as status flag is set, the host CPU will retrieve the results. Throughout ISP, the host CPU is released from heavy data processing tasks.
4
IEICE Electronics Express
4 Experimental Results To emulate a real user case, we use a dataset contains news text from Datatang website (http://www.datatang.com/data/47264), and store it in the DRAM of custom built board and SD card of ZedBoard. All experiments are performed on the ZedBoard and custom built board, as seen in Fig 2. The ZedBoard and the custom built board are coupled by the FPGA Mezzanine Card (FMC) connector. We have implemented two types of query operation: one is traditional way in modern-day (as shown in Fig.1 (a)) and another is FPGA based ISP (as shown in Fig.1 (c)). For the in-host data processing (IHP) of traditional way, the host CPU will first copy the dataset into the DRAM of ZedBoard and then begin to search. The transmission time is not included in the execution time. Instead of retrieving each individual data for processing on the host CPU, the proposed system can instruct FPGA to locally search their portions of the dataset and return results.
Fig.4. Execution time and speedup comparison figure
Fig.4 shows the evaluation of speed and execution time. As can be seen, FPGA based ISP achieves maximum speed of almost 25MB/s. This is more than three times speedup compared to traditional way. With the increase of memory size, execution time of FPGA based ISP performs better than the CPU based IHP.
Fig.5. The system resources utilization and performance comparison figure
Regardless of the memory size, host CPU utilization varies vastly between the two types of query operation: traditional way by CPU uses 100% and in-storage by FPGA uses less than 1%. That is because the host CPU is mostly in the idle state waiting for the data from the storage subsystem or other computation jobs. As shown in the Fig.5 (b), with the increase of memory size, the memory allocation of
5
IEICE Electronics Express
query processing increases linearly by using traditional way with CPU whereas it maintains about 0.1% by using ISP mechanism. The query application's memory utilization rate is too high will affect the operation of other programs and degrade the overall performance of system. Adopting the ISP mechanism will be helpful to avoid high memory usage of host CPU and also significantly to benefit the performance of the whole system in big data problems. Fig.5 (c) and (d) shows the evaluation of speed and execution time with data size from 10MB to 100MB. As illustrated in Fig.5 (d), for the 100MB data size query, the execution time is about 15.6 seconds by using traditional way while it is about 4.1 seconds by using ISP mechanism. This shows the high efficiency of our highly-parallel storage system.
5 Conclusion In this paper, we have presented a hardware-accelerated storage system to efficiently offload entire queries to storage subsystem. Comparing with in-host data processing of traditional way, a roughly 400% improvement can be achieved in both query speed and execution time. Also the host CPU utilization straight decline to less than 1% from 100%. The proposed ISP mechanism is very suitable for big data applications by avoiding high CPU and memory usage and large transfer overhead. More agressive results may be achieved by optimizing pipeline and increasing parallel processing threads in FPGA. Furhtermore, structurized data processing like database accessing will also greatly benefit from byte-addressable NVRAM, and further research will be reported soon.
Acknowledgments This work was supported by the "Strategic Priority Research Program" of the Chinese Academy of Sciences (XDA09020402), National Integrate Circuit Research Program of China (2009ZX02023-003), National Natural Science Foundation of China (61176122, 61261160500, 61376006), Science and Technology Council of Shanghai (14DZ2294900, 13ZR1447200, 14ZR1447500).
6