19th International Conference on Telecommunications (ICT 2012)
An Operating System for a Reconfigurable Active SSD Processing Node Ali Ali, Mohamad Jomaa, Bashar Romanous, Mageda Sharafeddine, Mazen A. R. Saghir*, Haitham Akkary, Hassan Artail, Mariette Awad, Hazem Hajj Department of Electrical and Computer Engineering, American University of Beirut, Beirut, Lebanon * Electrical and Computer Engineering Program, Texas A&M University at Qatar, Doha, Qatar E-mails: {ama142, mfj03, bfr03, mas117, ha95, ha27, ma162,hh63}@aub.edu.lb *
[email protected]
Abstract—We have recently proposed a Distributed Reconfigurable Active SSD computation platform (RASSD) for processing data-intensive applications at the storage node itself, without having to move data over slow networks. In this paper, we present the design of an operating system (OS) for the RASSD node. RASSD OS is a multitasking real-time operating system that runs on the 32-bit MicroBlaze® soft processor core available for Xilinx® FPGA’s. We discuss in this paper our OS design features which include initializing the node and configuring the different components of the RASSD node, monitoring the node’s activities, and processing middleware requests. RASSD OS provides a set of services to the middleware through which it hides the low-level details of the node’s hardware architecture. We describe the functions essential for the data-intensive processing within the RASSD node using examples that capture the common states of the node and various possible requests. Keywords-reconfigurable computing; cloud computing; active storage; distributed computing; embedded operating system.
I.
INTRODUCTION
The exponential growth rate of digital data generated by business, government, scientific, health, entertainment and other modern human organizations is outpacing the improvements of current computing systems. For applications that need to process such massive amounts of data, the size of the data involved is not the only challenge. In addition, large amount of data is often collected and stored in dispersed geographical locations and accessing and processing this data over local or wide area networks is a serious performance bottleneck. Even if the bottleneck of the network delay to access data is somehow overcome, the proliferation of many types of data sets requiring different processing applications with complex and varying computation characteristics poses a challenge to conventional general purpose computer hardware. In order to overcome these challenges at a fundamental level and to meet modern society appetite for generating and processing massive amounts of data, we have proposed in [2] a computation platform in which data is processed at the storage site or what is referred to as the “cloud”. We call our platform Distributed Active Solid State Drives Architecture (RASSD). The RASSD acronym reflects two architectural innovations at the hardware level that are intended to overcome the performance challenge of processing distributed massive amounts of data: 1) active solid state storage (ASSD) in which solid state storage is tightly coupled with computation
978-1-4673-0747-5/12/$31.00 ©2012 IEEE
hardware that processes data without having to communicate with the SSD over a network connection, and 2) reconfigurable field programmable gate arrays (R) to tailor the computation hardware in the ASSD node to the requirements of data processing application kernels. By combining the low latency and high bandwidth advantages of SSD storage with high performance spatial computation capability of FPGA’s, our platform achieves significant performance gains over conventional distributed computing architectures that use electromechanical disks for storage [2]. In our proposed platform, the RASSD node is not a single isolated computational component that is expected to do a predefined fixed acceleration of specific calculations on the data stored in the SSD. On the contrary, the RASSD node hardware is part of an integrated platform that may contain hundreds or thousands of RASSD nodes physically located in dispersed and far apart geographical locations and is designed to support multiple applications with distinct processing and storage needs. The hardware is expected to serve API requests made by client applications, through a middleware layer, in a dynamic environment where data and computational requirements of the system change unpredictably. Although Active Disks architectures have been proposed and implemented before [1][3][5][9], our platform differs from prior work in its distributed architecture and reconfigurable hardware capability. The unique architectural characteristics of our platform require new operating software in the active storage node, and this paper presents our initial design of such embedded operating system for our RASSD node. Our novel Reconfigurable Active SSD Operating System (RASSD OS) is a multitasking, real-time operating system that effectively interacts with the middleware layer to provide a high level of abstraction of the reconfigurable hardware, thus completely hiding from the running data processing application the details and the complexity of the reconfigurable computation node. The RASSD OS turns an FPGA board connected to one or more SSD’s into a smart computational engine that is both powerful and scalable to suit future data intensive applications. Given the limited general purpose processing capabilities of FPGA’s, we will detail the challenges in designing an operating system that handles the underlying reconfigurable hardware as a standalone system in a distributed environment and with ever increasing computation demands. In addition to the relevance of our RASSD OS to our
platform design, the paper is also a case study of how to design a complex, real time operating system software for distributed FPGA computation boards. The rest of the paper is organized as follows. Section II discusses related work. In section III we give an overview of the system architecture. Section IV describes the RASSD node hardware. Section V covers our proposed RASSD OS design. We finally conclude the paper in section VI. II.
RELATED WORK
The approach of moving the data processing functions closer to the storage device has its origins in the database machines of the 1970’s and 1980’s [4][8][10], and was studied briefly in the late 1990’s and early 2000’s through the introduction of active disk drives. These devices consisted of hard disk drives tightly coupled with local disk processors [1][3][5][9]. In these systems, the data intensive computation component, which we call drivelet, is offloaded from a host processor to the specialized disk processors while the higher level task of coordinating, scheduling and collecting results from the distributed disks is handled by a host processor. Active storage architectures require operating system support both at host or network as well as at the disk. Prior work reported in [1] extended the DiskOS to support memory management, stream communication with the host computer and drivelet scheduling on the active disk. In contrast to this prior work our platform is multitasking, distributed and reconfigurable. The complexity of the services required in our RASSD node has made it necessary to develop our specialized real-time, multitasking operating system built on top of the Xilkernel® from Xilinx [7] and described in this paper. A real-time operating system, e.g. LynxOS [12], provides consistency in the time it takes to respond and complete a task. This is achieved by flexible scheduling mechanisms with adjustable task priorities that can guarantee the required minimum response and execution time. Our RASSD OS design relies on the multithreading support and the round robin thread scheduling policy in Xilkernel®. It enables memory management, communication with the middleware server and drivelet scheduling as performed in prior work [1]. In addition, it enables dynamic partial reconfiguration of the FPGA fabric and other processing and services performed in the RASSD node to support distributed processing of data stored in multiple RASSD nodes in our platform. III.
DISTRIBUTED RASSD ARCHITECTURE
A. General System Architecture Fig. 1 shows our Distributed RASSD system architecture. The system consists of Application and Middleware Client, Middleware Servers and RASSD nodes. All components are connected together via WAN and LAN networks. The Middleware Servers (MWS) are connected to the RASSD nodes via LAN network. Each RASSD node consists of one FPGA board connected to one or more SSD devices over PCIe interconnect. PCs, laptop computers, and even handheld computers can be used as clients to run different data processing queries.
FIGURE 1: THE GENERAL SYSTEM ARCHITECTURE
An Application/Middleware client communicates with all MWS’s through a Client Local Middleware interface (CLM). When a CLM requests a query from a MWS, the latter resolves the location of the required data either locally or remotely by communicating with other MWS’s. The CLM also submits to the local MWS a kernel tag along with the query. Kernels in our system refer to the processing functions that are delegated to the RASSD node. In this communication process, the MWS sends the appropriate kernel subparts namely “drivelets” and “hardware accelerator configurations” to the RASSD node. Drivelets are processing functions designed to run on the Microblaze soft processor. Hardware accelerators on the other hand exploit the reconfigurable FPGA logic fabric to customize computations and achieve significant speedups over software implementations. Using application profiling we determine the data-intensive portions of our applications and extract them to be processed on the hardware. These are our kernels which can be drivelets written in C language, configuration bitstream files for hardware accelerators, or both drivelets and accelerators configuration bitstreams. In order to integrate the desired capabilities into our platform, we adapt Hadoop [11] for our RASSD distributed system. Hadoop is a large scale distributed processing infrastructure that has been deployed in many commercial products. It relies on the Map-Reduce software framework [11] to develop applications that run over the highly scalable, distributed Hadoop architecture and file system. In our platform, applications are written using Map-Reduce
Java processes. Matched kernels in the Map-Reduce applications are identified by the CLM and sent to respective middleware servers (MWS). Middleware servers send the processing task to nodes with desired data. After the RASSD node processes the data, it sends it back to the local MWS’s where it is aggregated and sent back to CLM to be delivered to the end user. B. Application Architecture Applications are written in Java using Hadoop’s MapReduce [11]. A service or a job is divided into Map-tasks. Every Map-task in an application corresponds to a kernel in our CLM library. Typically, a performance demanding sub task is derived from a kernel and is accelerated on the FPGA’s reconfigurable fabric. The kernel accelerator is implemented on the RASSD node when the library is initially defined and the accelerator configuration bitstream is then stored in the MWS libraries. MWS libraries therefore maintain tasks drivelets and their corresponding bitstreams as applicable. A drivelet code is defined as a Data Function Primitive at the RASSD node level. During runtime, a task is mapped to the respective MWS’s which send the kernel to the RASSD nodes where the data resides, allowing direct processing of the data on the RASSD node. The corresponding hardware accelerator configuration bitstreams are used to reconfigure the FPGA through a process called dynamic partial reconfiguration. Once configured, the FPGA can process data from the SSD in an autonomous fashion without disrupting other tasks in the RASSD node.
between Hadoop based middleware and the RASSD node which has to act as a Hadoop DataNode with distributed operating system capabilities. The MWS also includes a data aggregator sub-component which combines multiple data streams arriving from multiple RASSD’s in response to the same application request. IV.
RASSD NODE HARDWARE OVERVIEW
Our RASSD node hardware prototype [2] is implemented on a Xilinx® Virtex-5 FPGA chip (XC5VSX50TFFG1136) hosted on an ML506 development board. The board is connected over PCIe interface to least one SSD. A 32-bit MicroBlaze® RISC soft processor core is the RASSD node’s main processor as shown in Fig. 2. The soft processor is connected to an external 256 MB double-data rate (DDR2) main memory through a Multi-Ported Memory Controller (MPMC) that connects the system together. The MicroBlaze offloads data and compute intensive tasks to a co-processor hardware accelerator using a predefined partial bit file (bitstream). The co-processor connects to the MicroBlaze using two unidirectional FIFO-based point-topoint buses known as Fast Simplex Links (FSL). The main peripherals on the RASSD node that are controlled by the MicroBlaze include: the Ethernet Media Access Controller(EMAC) which provides the interface between the LAN and the RASSD node; the PCIe controller
C. Middleware Architecture The Middleware Architecture consists of three components: •
CLM: Acts as a proxy through which the application interfaces with the distributed RASSD’s system. It locates on behalf of the application the data and processes the query. The CLM performs this function using Hadoop’s Job Tracker and NameNode infrastructure. The CLM hides the implementation details from end users; therefore, the application does not have to deal with the complexity of the distributed data processing and data location.
•
MWS: This is the heart of our distributed system and connects both the front-end with the application client and the back-end with the RASSD node. It is based on Hadoop as mentioned earlier, with great deal of changes made to suit our design goals. For example, Hadoop is adapted to fit our two-level Job to Task delegation: first level from an application Job to MWS Tasks and second level from a MWS Job to RASSD node Tasks.
•
Middleware libraries and hardware proxies: MWS maintains various libraries including a Data Catalogue that holds the description of data files resident on the RASSD nodes and a Data Processing library that contains the drivelets and configuration bitstreams. The MWS also maintains a data directory which manages information about the physical location of the data items and a site information and statistics directory. The statistics directory stores physical information about the sites, such as physical location, proximity to other sites, average load, data sizes, and other statistics. The MWS includes hardware proxies designed to simplify the interface
FIGURE 2: THE NODE’S HARDWARE ARCHITECTURE
used to establish a communication channel between the SSD’s and the RASSD node; the Interrupt Controller used to arbitrate between the different external and internal interrupts in our system and to pass the selected interrupts to the MicroBlaze processor; and finally the Hardware Internal Configuration Access Port (HWICAP) responsible for accessing the FPGA configuration memory and configuring the FPGA fabric with the logic needed to implement the hardware accelerators. The middleware is the master to several RASSD nodes connected in a cluster. The RASSD node provides services to its middleware using a communication protocol that defines the type of communicated packets and the actions taken based on the package type. There are two types of packets exchanged between the middleware and the RASSD node: data processing request packets and synchronization and control packets. At the core of the RASSD node runs the real-time operating system responsible for communicating with the middleware and efficiently scheduling, processing, and delivering results. V.
RASSD OS DESIGN
The RASSD operating system is a three layered software platform, as shown in Fig.3. 1- The first layer provides services for the middlewarehardware communication protocol. This layer hides lowlevel details using common open-source libraries, such as the TCP/IP LwIP library and the file system XilFATFS, which are both license free from Xilinx. 2- The second layer consists of the Xilkernel® [7] and its libraries as a backbone. Xilkernel® is a small and highly customizable kernel with key features needed in embedded systems such as multi-tasking and pre-emptive task scheduling. 3- The third layer consists of the Xilinx Drivers and other software modules needed to access hardware specific functions. This is the lowest hardware abstraction layer. To implement the RASSD OS, we use the multithreading library of Xilkernel® to provide all OS services simultaneously. Since all these threads run on the same MicroBlaze® soft processor, they are scheduled in a round robin fashion. However, not all these threads are permanently active. As we explain next, most threads are terminated after completing their tasks, thus leaving the execution resources for working threads. A. RASSD Threads The RASSD OS threads include the following: 1- Main: This is the top-level coordinator of threads. It controls the RASSD system and responds to all requests from the middleware. Main is always running and controls all other service threads which are dynamically launched as needed. The launched threads are added to the “PROC_READY” queue to wait for a time slice. This queue contains active threads competing to use CPU for a predetermined time-slice. The scheduler maintains fairness when determining when and which process uses the CPU next. The only thread that is not controlled by main is the EMAC-handler (second in this list), since it is pushed to the “PROC_READY” queue upon receiving an external packet arrival interrupt.
FIGURE 3: THE RASSD OPERATING SYSTEM
2EMAC-handler: This thread performs three tasks: 1) it keeps the node connection with the MWS alive, 2) it transfers the received packets from the network adapter buffers to main memory, and 3) it sends results back to MWS. Results here refer to data from tasks completed by drivelets as well as the hardware accelerators. The EMAC-handler is a callback thread. Upon receiving a packet, the network adapter interrupt initiates the EMAC-handler. Subsequently, the thread is pushed onto “PROC_READY” queue. For communication with the MWS, we chose the light weight intellectual property (LwIP) provided by Xilinx [6]. LwIP in socket mode spawns a separate sub-thread to move data from the network adapter buffers to the library stack. The EMAC-handler thread moves received packet from the EMAC buffer to memory and sends the address of packet in memory to main. Once in memory, all other threads can access the packets. 3Drivelet Loader: This thread is responsible for loading and launching the drivelets on the MicroBlaze. It also ensures that no memory violations happen, such as code violations, data access violations or I/O violations. 4Dynamic Partial Reconfiguration (DPR) Manager: This thread initializes the Hardware Internal Configuration Access Port (HWICAP) and then feeds it the bit file stored in memory by EMAC handler, thus configuring the FPGA fabric with the desired hardware accelerator. After the configuration is done, the DPR manager informs main and terminates.
5Cache controller: This thread maintains storage of drivelets and accelerators bitstreams that have been loaded in the node. In case a drivelet is already in the node, it can be residing in the first level cache memory or in second level cache memory, which is referred to by Xilinx as the SYStem Advanced Configuration Environment (SYSACE). Main is informed whether the cache access is a hit or a miss. 6PCIe controller: This thread is responsible for transferring data needed by the drivelet or the hardware accelerator from the storage device (SSD’s) to the main memory and for updating data files in the SSD’s. 7Ftp server: This thread is launched by main to handle file transfers between the middleware and the RASSD node. It handles file transfer by opening an ftp connection. The following examples describe some of the RASSD OS actions that occur in our platform. B. RASSD OS in action Whenever the middleware layer sends a packet to the RASSD node, the network adapter of this node launches the EMAC-handler using an interrupt. The EMAC-handler is detached from the main thread because this is a requirement by the LwIP socket mode. The EMAC-handler moves the received packet from the network adapter buffer to the DDR2 main memory so it can be accessed by other threads in the system. The EMAC-handler informs main, using inter-process communication, about the new packet received and passes to it the address of the packet. The main thread parses the packet received from the EMAC-handler and classifies it as request (processing, raw data, etc...) or control and synchronization packet. C. Handling of non-processing requests The main thread checks the Type field to determine packet type. Since our middleware design expects heartbeats from the nodes indicating their status of being alive and functioning properly, the RASSD OS needs to process status requests. If the node is actually alive and when main receives a status request packet, it ignores the rest of the packet, checks its record of malfunctions in the node (reported by the corresponding drivers) and sends to the middleware its status. Raw data requests are more involved. After receiving a raw data request, main notifies the middleware, through the EMAChandler, to start the actual file transfer when the ftp server thread becomes available. The port number of this server is passed with the notification. Furthermore, in case of a load request, the file name is included in the payload of the request packet and is extracted by main which passes it to the PCIe controller. The latter loads the file to main memory and returns its address to main. When data is no longer needed in a node, the MWS sends a delete packet. Main launches the PCIe controller which deletes the data file on the SSD. The MWS updates its own directory. By sending a save request, the middleware asks for an ftp connection with the node in order to send a file to it. Main then
launches ftp server, and lets it handle the file transfer. After the file is received and saved on main memory, the ftp server informs main, which launches the PCIe controller to save the file on the SSD(s). Eventually main kills both threads. Two other examples of file requests from the MWS are replace and append. In the case of a replace request, the existing file is entirely deleted and replaced with the new one. This is equivalent to a delete request followed by a save request. The same threads (ftp server and PCIe controller threads) are involved in this operation. Append is used with text file type. It opens the file locally and appends the new data to it. For this purpose, main launches Drivelet-loader. The drivelet opens the file and append the new data to the end. The new file is saved on the SSD by PCIe controller. D. Handling of processing requests: We now examine the processing of the three packets requiring drivelets and/or accelerators handled by our OS: 1. Data processing requests requiring drivelet The threads involved in the response to this type of requests are EMAC-handler, main, cache controller, drivelets-loader and PCIe controller. Fig.4 shows the execution flow chart for data processing requests requiring drivelet. Main recognizes a drivelet only processing request by checking the request type field in the packet’s header. As a result, it launches a distinct thread, cache controller, responsible for searching for the executable drivelet file (.elf) in memory. Only frequently used drivelets are cached for optimal network traffic. Our software-implemented cache is a two-level one, where the first level cache memory is the lowlatency, high-bandwidth SRAM memory, tightly coupled to the FPGA chip, and the second level cache memory is the SYSACE, which has a relatively high storage capacity, up to 8 GB. If cache controller finds the drivelet’s .elf file, it copies it to memory, informs the main thread of the hit and passes the memory address of the file. Otherwise, it will just inform main of the miss. In the case of a hit, main kills the cache controller thread. This decision originates from the fact that we are trying, in our design, to reduce the number of service threads running on the MicroBlaze as much as possible, and dedicate CPU resources to drivelet execution as much as possible to improve the throughput of the node. Immediately after that, main will launch two new threads; the first one is the drivelets-loader, in charge of loading and launching the drivelets on the MicroBlaze processor while also ensuring no memory violations. One of the arguments passed by main to the newly launched thread is the address of the .elf file found by the cache controller. The second one is the PCIe controller, the role of which, in this case, is to access the data required by the drivelet and copy it from the SSD to DDR2 memory. Driveletsloader would then load and launch the drivelets as a separate working thread and returns its PC pointer to main. After the working thread is done, it should report its results to main and terminate. Main propagates the results to the EMAC-handler to be sent to the middleware.
In the case of a cache miss, an ftp server thread is launched to establish a connection with the middleware and to request that the drivelet be sent to the node. Once the drivelet is received, the same steps as above will be followed. 2. Data processing requests requiring accelerator On receiving such request, recognized by the request type field in the packet’s header, main extracts the name of a bit file from the payload of the packet instead of a drivelet. The two threads Cache controller and PCIe controller are launched, in the same way described above. Whether the hardware accelerator file (the bit file) was delivered from the middleware - in case of a miss - or it was found locally by the Cache controller, the next step for main is to launch a dedicated thread, the DPR-manager, that would initialize HWICAP to access the configuration memory of the FPGA chip and implement the hardware accelerator on its fabric. After the DPR-manager streams the bit file stored on memory to the ICAP buffers, it would report back to main, and main would terminate it. Having a direct streaming access to the main memory through MPMC on one side, and a connection to the MicroBlaze using two FSL links on the other side, the hardware accelerator would write the results obtained to the slave FSL link. These results would be collected in main and sent back to the middleware through EMAC-handler. 3. Data processing requiring drivelet and accelerator All the operations described in the above two sections are needed for the execution of the task. In addition, the EMAChandler will keep the node connection with the middleware alive, transfer the received packets from the EMAC hardware buffers to main memory, and send the results of any finished drivelets or/and hardware accelerators to the middleware. While main coordinates all other threads, to manage the system, and to be ready to respond to other requests or/and commands from the middleware, the PCIe Controller transfers any data needed dynamically by the drivelet or the hardware accelerator to the main memory. VI.
CONCLUSION
We have presented the design of the operating system for a Distributed Reconfigurable Active SSD computation node (RASSD OS). The RASSD OS is a multitasking real-time operating system that runs on the32-bit MicroBlaze® soft processor core available for Xilinx® FPGA’s. We have discussed the OS design features which include initializing the node and configuring the different components of the system, monitoring the node’s activities, and processing middleware requests. RASSD OS provides a set of services to the middleware through which it hides the low-level details of the node’s complex reconfigurable hardware architecture. ACKNOWLEDGMENT This paper was made possible by NPRP grant # 09-10502-405 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors. REFERENCES [1]
A. Acharya et al., "Active Disks: Programming Model, Algorithms and Evaluation", ACM SIGOPS Operating Systems Rev., v. 32, n. 5, 1998.
FIGURE 4. REQUESTS REQUIRING A DRIVELET [2]
N Abbani et al., "A Distributed Reconfigurable Active SSD Platform for Data Intensive Applications," IEEE 13th International Conference on High Performance Computing and Communications (HPCC), Sep 2011. [3] G. Gibson et al., “A Cost-Effective, High-Bandwidth Storage Architecture,” Proc. Conf. Architectural Support for Programming Languages and Operating Systems, Oct. 1998, pp. 92-103. [4] D. Hsiao. “DataBase Machines Are Coming, DataBase Machines Are Coming!” IEEE Computer, March 1979. [5] K. Keeton, D. Patterson, and J. Hellerstein, “A Case for Intelligent Disks (IDISKs),” SIGMOD Record, Sept. 1998, pp. 42-52. [6] The lwIP TCP/IP Stack http://savannah.nongnu.org/projects/lwip/ [7] OS and Libraries Document CollectionXilkernel (v5.00.a) UG708 July 6,2011http://www.xilinx.com/support/documentation/sw_manuals/xilinx 13_2/oslib_rm.pdf [8] E. A. Ozkarahan et al., “RAP - An Associative Processor for Data Base Management” Proceedings of AFIPS NCC 44, 1975. [9] E. Riedel, “Active Disks—Remote Execution for Network-Attached Storage,” doctoral dissertation, Technical Report, CMU-CS-99-177, Carnegie Mellon Univ., Pittsburgh, 1999. [10] S. Su et al., “The Architectural Features and Implementation Techniques of the Multicell CASSM” IEEE Trans. on Computers, 28(6), June 1979. [11] http://wiki.apache.org/hadoop/ [12] http://www.lynuxworks.com/rto