Virtualizable Hardware/Software Design Infrastructure for Dynamically Partially Reconfigurable Systems CHUN-HSIAN HUANG, National Taitung University PAO-ANN HSIUNG, National Chung Cheng University
In most existing works, reconfigurable hardware modules are still managed as conventional hardware devices. Further, the software reconfiguration overhead incurred by loading corresponding device drivers into the kernel of an operating system has been overlooked until now. As a result, the enhancement of system performance and the utilization of reconfigurable hardware modules are still quite limited. This work proposes a virtualizable hardware/software design infrastructure (VDI) for dynamically partially reconfigurable systems. Besides the gate-level hardware virtualization provided by the partial reconfiguration technology, VDI supports the device-level hardware virtualization. In VDI, a reconfigurable hardware module can be virtualized such that it can be accessed efficiently by multiple applications in an interleaving way. A Hot-Plugin Connector (HPC) replaces the conventional device driver, such that it not only assists the device-level hardware virtualization but can also be reused across different hardware modules. To facilitate hardware/software communication and to enhance system scalability, the proposed VDI is realized as a hierarchical design framework. User-designed reconfigurable hardware modules can be easily integrated into VDI, and are then executed as hardware tasks in an operating system for reconfigurable systems (OS4RS). A dynamically partially reconfigurable network security system was designed using VDI, which demonstrated a higher utilization of reconfigurable hardware modules and a reduction by up to 12.83% of the processing time required by using the conventional method in a dynamically partially reconfigurable system. Categories and Subject Descriptors: C.0 [General]: System Architectures; D.4.7 [Operating Systems]: Organization and Design—Hierarchical Design General Terms: Design, Experimentation Additional Key Words and Phrases: Dynamically partially reconfigurable systems, hardware virtualization ACM Reference Format: Huang, C.-H. and Hsiung, P.-A. 2013. Virtualizable hardware/software design infrastructure for dynamically partially reconfigurable systems. ACM Trans. Reconfig. Technol. Syst. 6, 2, Article 11 (July 2013), 18 pages. DOI: http://dx.doi.org/10.1145/2499625.2499628
1. INTRODUCTION
Taking virtualization one step further, the dynamic partial reconfiguration technology in FPGA devices such as those from Xilinx further allows multiple applications to access a fixed set of logic resources in a temporally exclusive way. This is also called the gate-level hardware virtualization technique. Further, using the partial reconfiguration technology provided by Xilinx, one part of the FPGA device can be reconfigured, while other parts remain operational without being affected by reconfiguration. Thus, computation-intensive functions are implemented as reconfigurable Authors’ addresses: C.-H. Huang (corresponding author), Department of Computer Science and Information Engineering, National Taitung University, Taitung, Taiwan; email:
[email protected]; P.-A. Hsiung, Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi, Taiwan. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or
[email protected]. c 2013 ACM 1936-7406/2013/07-ART11 $15.00 DOI: http://dx.doi.org/10.1145/2499625.2499628 ACM Transactions on Reconfigurable Technology and Systems, Vol. 6, No. 2, Article 11, Publication date: July 2013.
11
11:2
C.-H. Huang and P.-A. Hsiung
hardware modules that can be configured on-demand at runtime, which not only supports dynamic adaptation to different environment conditions, but also increases hardware resource utilization. A hardware/software embedded system realized with a partially dynamically reconfigurable FPGA device is called a Dynamically Partially Reconfigurable System (DPRS) that allows multiple applications to be accelerated in hardware, and reduces the overall system execution time [Xilinx Inc. 2006]. This new dimension of dynamic hardware reconfigurability has made the design of an embedded system become more flexible than before, since it includes not only the traditional software applications and hardware devices, but also on-demand reconfigurable hardware modules. To efficiently manage such a complex hardware/software runtime environment, an Operating System for Reconfigurable Systems (OS4RS) is thus introduced in a DPRS. Similar to Unix-like OS, the kernel space and the user space are also defined in the OS4RS. This means that the software applications run in the user space, while the the core of the OS4RS and the device drivers are executed in the kernel space. When a software application requests a reconfigurable hardware function whose reconfigurable hardware module is not configured in the FPGA, the requested hardware function is created on-demand as a hardware task that can be accessed by the software application. However, the existing DPRS designs still face the following three main problems. —Limitations in infrastructure support. The existing OS4RS designs usually manage reconfigurable hardware modules as either specific encapsulated files [So and Brodersen 2008] or conventional hardware devices [Donato et al. 2005; Santambrogio et al. 2008]. When a reconfigurable hardware module is accessed by a software application, it is thus blocked by the software application. This means that, although the reconfigurable hardware module is not accessed all the time by the software application, it still cannot be used by other software applications. This limitation would lead to inefficient utilization of reconfigurable hardware resources. —Software reconfiguration overhead. The dynamic partial reconfiguration technology would incur an additional reconfiguration time overhead in terms of hardware logic reconfiguration and system device switching. This means that, when new reconfigurable hardware modules are configured in the FPGA, their corresponding device drivers must be loaded into the kernel of an OS4RS, such that applications in the user space can access them. This additional time overhead is often overlooked in most existing work. Further, due to the requirements for low power and low cost, a microprocessor with low frequency is usually used in an embedded system design, and the reduction of software time overhead thus becomes a key method to improve system performance. However, most existing methods, such as configuration prefetch [Banerjee et al. 2005] and reuse [Hsiung et al. 2007], focus on only solving the hardware time overhead incurred by reconfiguration, which is much smaller than the software time overhead. —Low system scalability. Reconfigurable hardware modules are usually individually implemented at design time without supporting a complete infrastructure design. Therefore, to integrate hardware modules having different data interfaces with a DPRS at runtime becomes very difficult, which does not only reduce system scalability but also increases development efforts. To solve the preceding problems, we propose a Virtualizable hardware/software Design Infrastructure (VDI) for dynamically partially reconfigurable systems. In VDI, a HotPlugin Connector (HPC) design and two device-level hardware virtualization techniques, including logic virtualization and hardware device virtualization, are proposed. Further, the hierarchical design concept is adopted in the VDI. The contributions of our proposed designs or techniques in VDI are introduced as follows. ACM Transactions on Reconfigurable Technology and Systems, Vol. 6, No. 2, Article 11, Publication date: July 2013.
Virtualizable Hardware/Software Design Infrastructure
11:3
—HPC. Instead of a conventional device driver specific to a hardware device, an HPC is implemented as a unified kernel module, so that the generality in accessing the hardware modules would be not sacrificed. Further, due to the reuse characteristic of HPC, an application in the user space can access a configured hardware module without spending additional time on loading its specific device driver. As a result, the more serious software time overhead incurred by reconfiguration can be solved, thus enhancing system performance. Further, the HPC can also be used to support the device-level hardware virtualization. —Logic virtualization. By using logic virtualization, a reconfigurable hardware module can be virtualized to support multiple applications in the user space. As a result, multiple applications can be executed under the illusion of full access to the same reconfigurable hardware module through their own HPCs. This can raise the utilization of reconfigurable hardware modules. —Hardware device virtualization. By using hardware device virtualization, an HPC can be connected to two reconfigurable hardware modules. The processing results of a hardware module can thus be directly transferred to another hardware module via the HPC. This can enhance system performance significantly. —Hierarchical DPRS design. It standardizes the hardware/software communication interface. Within the hierarchical DPRS design, a user-designed hardware module needs to be only integrated with a partial reconfigurable hardware task template (PR template) [Huang and Hsiung 2011], while its control method is implemented in a hardware control library. As a result, applications in the user space can interact easily with the new hardware module through the HPC by invoking the APIs in the hardware control library, which also enhances system scalability. This article is organized as follows. Section 2 introduces the related research. Section 3 introduces the proposed device-level hardware virtualization techniques in VDI. A hierarchical DPRS design is described in Section 4. Section 5 presents our experiments and analyses, and conclusions are given in Section 6. 2. RELATED WORK
Using the partial reconfiguration technology, more applications can be accelerated in hardware at runtime. Compared to the full software design in current embedded systems, the performance in the DPRS could be significantly enhanced due to the capability for dynamic reconfiguration [Lagger et al. 2006]. To investigate the potential of partial reconfiguration, the CHREC group proposed the performance evaluation approach [Hymel et al. 2007; El-Araby et al. 2009] and the performance analysis tool [Koehler et al. 2008]. Further, to manage such an adaptive hardware/software embedded system, several works [Donato et al. 2005; So and Brodersen 2008; Santambrogio et al. 2008; Huang et al. 2010, 2012; Chen et al. 2011] also developed the corresponding system infrastructure and proposed efficient hardware/software management approaches. A unified hardware/software runtime environment for FPGA-based reconfigurable systems called BORPH [So and Brodersen 2008] enabled the reconfigurable hardware modules to be encapsulated in the BOF file format so that they could be executed as hardware tasks in an OS4RS. However, a conventional hardware device is usually accessed through the corresponding device driver and is reloaded for every application instance. This encapsulation method is too specific and very different from the conventional access method, which might lead to the lack of generality in accessing hardware devices. As a result, this unified hardware/software runtime environment was not easily applicable to most embedded OSes. Without sacrificing the generality in accessing the hardware device, the work [Donato et al. 2005; Santambrogio et al. ACM Transactions on Reconfigurable Technology and Systems, Vol. 6, No. 2, Article 11, Publication date: July 2013.
11:4
C.-H. Huang and P.-A. Hsiung
2008] adopted the concept of modular design in the device driver of a hardware module, that is, each device driver was implemented as a kernel module. As a result, the corresponding kernel module can be dynamically loaded into the Linux kernel via an IP-core manager, after a reconfigurable hardware module was configured into the FPGA. The applications in the user space could thus interact with the reconfigurable hardware module using specific system calls. However, reconfigurable hardware modules in this OS4RS design were managed as conventional hardware devices, and thus the dynamic loading/reloading of device drivers still caused an impact on system performance. All the OS4RS designs [So and Brodersen 2008; Donato et al. 2005; Santambrogio et al. 2008] provided support for the interaction mechanism between applications in the user space and reconfigurable hardware modules; however, reconfigurable hardware modules are managed as either specific files or conventional hardware devices, and thus the infrastructure support for DPRS is still limited. To solve the problem of limited infrastructure support for DPRS, device virtualization would be an applicable method to increase the utilization of reconfigurable hardware modules and improve system performance. This is because device virtualization would enable a hardware module that has been configured in the FPGA to be efficiently and interleavingly accessed by multiple applications running in the user space. Conventional device virtualization methods, such as Xen [Pratt et al. 2005] and KVM [Kivity et al. 2007] with QEMU [2013], are used to virtualize a hardware device to support multiple OSes. However, the device virtualization methods using Xen, KVM, and QEMU need to either support the specific processor technologies, such as Intel VT [Intel Inc. 2013] and AMD-V [AMD Inc. 2013] technologies, or modify the kernel of the guest OSes. Further, the conventional virtualization mechanisms [Pratt et al. 2005; Kivity et al. 2007; QEMU 2013] mainly focus on platform emulation for facilitating system development, in which a hypervisor mediation needs to be inserted into the system design to support the virtualization of the underlying machine. Due to the indirect access to hardware devices via a hypervisor mediation, system performance is thus lower than that of the physically bare hardware. In the reconfigurable computing field, the existing hardware virtualization methods [El-Araby et al. 2008; Kirischian et al. 2010; Hofmann et al. 2010; Gohringer et al. 2011; Werner et al. 2012; Garcia and Compton 2008; Sabeghi and Bertels 2009] either adopted the partial reconfiguration technique to support specific applications or proposed a virtualization layer that abstracted the hardware characteristics to increase the utilization of hardware resources. From the viewpoint of system management, the preceding hardware virtualization methods lack system flexibility and scalability compared to the OS4RS designs [So and Brodersen 2008; Donato et al. 2005; Santambrogio et al. 2008]. Further, a complete hardware resource management mechanism, such as those in Xen, KVM, and QEMU, needs to be also introduced in their hardware virtualization methods. Compared to the conventional virtualization mechanisms [Pratt et al. 2005; Kivity et al. 2007; QEMU 2013] targeting on guest OSes, the proposed VDI targets on applications. In VDI, logic virtualization and hardware device virtualization are proposed to enhance system performance. Besides increasing the utilization of reconfigurable hardware modules, the reduction of the time overhead incurred by reconfiguration is also a key method to improve system performance. To reduce the hardware reconfiguration time overhead, the module graph merging approach [Koh and Diessel 2007], the configuration prefetch approach [Banerjee et al. 2005], and the reuse approach [Hsiung et al. 2007] have been widely used in DPRS designs to efficiently reduce the hardware reconfiguration time overhead. Nevertheless, the software reconfiguration time overhead has been mostly unsolved. In this work, the HPC is designed in VDI to alleviate the software reconfiguration overhead and support the device-level hardware virtualization. ACM Transactions on Reconfigurable Technology and Systems, Vol. 6, No. 2, Article 11, Publication date: July 2013.
Virtualizable Hardware/Software Design Infrastructure
SW
User Space
HW
Kernel Space
Configured HW
11:5
APP1
APP1
APP2
APP2
RC6 Node
CRC Node
Node
Node
Node
Node
Node
RC6 Driver CRC Driver
HPC
HPC
HPC
HPC
HPC
RC6 HW
CRC HW
RC6 HW
CRC HW
RC6 HW
CRC HW
(a) conventional DPRS design
APP1
(b) DPRS design with logic virtualization
(c) DPRS design with HW device virtualization
Fig. 1. Comparison between the conventional and VDI techniques.
3. DEVICE-LEVEL HARDWARE VIRTUALIZATION
In most existing OS4RS designs [Donato et al. 2005; Santambrogio et al. 2008], reconfigurable hardware modules were managed as conventional hardware devices. Take as example a network security application based on the Transport-Layer Security (TLS) protocol as shown in Figure 1. Before transmitting data to a receiver on the network, the data need to be encrypted using the RC6 hardware module, and then transferred to a CRC hardware module for processing. Note that here Figure 1 abstracts the DPRS design into three parts, including the user space and kernel space in an OS4RS, and the configured hardware modules in an FPGA device, to ease the explanation. As shown in Figure 1(a), a network security application APP1 has opened the device nodes of the RC6 and CRC hardware modules. To access the RC6 hardware module, another network security application APP2 still needs to wait until APP1 closes the RC6 device node. However, in reality, the RC6 hardware module is not accessed all the time by APP1, because APP1 can access only one hardware module at a time instant. This constrains the utilization of reconfigurable hardware modules, thus degrading system performance, which becomes unacceptable with increasing number of requests for the same hardware module from different applications. Further, because the network security application APP1 interacts sequentially with the RC6 and CRC hardware modules, the processing results of the RC6 hardware module need to be transferred from the kernel space to the user space, and then sent to the kernel space again for data processing by the CRC hardware module. The repeated data transfers between the kernel space and the user space may cause a large time overhead. To solve the problem of sequentialized access to a hardware module, the VDI method provides support for the device-level hardware virtualization technique, including logic virtualization and hardware device virtualization. The details are given in the following sections. 3.1. Logic Virtualization
The logic virtualization technique is proposed to virtualize a physical hardware module as multiple logic ones to support more than one software application at a time. This is based on a hardware-to-software virtualization concept. As shown in Figure 1(b), using logic virtualization, the RC6 hardware module can be shared between the network security applications APP1 and APP2 through the different pairs of device node and HPC, based on a round-robin policy. As a result, when the network security application APP1 is accessing the CRC hardware module, the RC6 hardware module can be accessed by the network security application APP2 through a different pair of device node and HPC. ACM Transactions on Reconfigurable Technology and Systems, Vol. 6, No. 2, Article 11, Publication date: July 2013.
11:6
C.-H. Huang and P.-A. Hsiung
Fig. 2. Illustration examples.
Fig. 3. Execution results using the conventional and VDI methods for the illustration example.
To further illustrate how system execution differs when using the conventional method and the logic virtualization technique, we use a DPRS design along with three PRRs as our example. As shown in Figure 2, based on the first-come-first-served scheduling policy, the applications APP1, APP2, and APP3 are served sequentially. Further, the numbers of iterations in both APP2 and APP3 are two each. Figure 3 shows the execution results by using the conventional method and the logic virtualization technique. Here, to interact with a reconfigurable hardware module i, a software application needs to perform five operations, including reconfiguring the hardware module (Ri), loading its device driver (Li), opening its device node (Oi), interacting with the hardware module (Ti), and closing its device node (Ci). We take the application APP3 that requests the hardware module HW4 accessed by its preceding application APP2 as an example. In the conventional method as shown in Figure 3(a), when the device node of the hardware module HW4 has been opened by the preceding application APP2 (blocked), the current application APP3 cannot access HW4 until APP2 closes the corresponding device node. This shows a hardware module can be accessed by only one application at a time. The waiting time for the hardware module HW4 thus delays the execution of the application APP3. Furthermore, using the conventional method, the software time overhead incurred by reconfiguration would exist in a DPRS. For example, as shown in Figure 3(a), after configuring HW4 and HW5 in the FPGA, the conventional method still needs to load the corresponding device drivers for APP2 to access HW4 and HW5. Compared to the conventional method, the logic virtualization technique enables a hardware module to be virtualized to support multiple applications simultaneously, without being blocked by one of them. The round-robin policy is used in the logic virtualization, so that a shared hardware module can be fairly accessed by each application in a constant time period called an access period. Here, it is assumed that a hardware module does not need to retain state between iterations. Figure 3(b) shows the best case ACM Transactions on Reconfigurable Technology and Systems, Vol. 6, No. 2, Article 11, Publication date: July 2013.
Virtualizable Hardware/Software Design Infrastructure
11:7
of using logic virtualization, in which the time period to access HW4 is set as an execution iteration of HW4. APP3 needs to only wait for an access period of HW4 for APP2, and then HW4 can be accessed by APP2 and APP3 interleavingly. However, due to the needs of data synchronization and mutual exclusive access for the shared hardware modules among multiple applications, in the worst case, APP3 needs to wait every access period of HW4 for APP2 until APP2 does not access HW4 longer. Note that an additional time overhead is also incurred by logic virtualization, that is, the hardware device sharing needs to be managed by a hardware task manager to ensure data synchronization in a DPRS. Besides supporting the device-level hardware virtualization, the HPC can be used to eliminate the software reconfiguration overhead. From the viewpoint of system design, the HPC implements the access interface of a PRR. As long as a hardware module conforms to the interface design of a PRR, it can be controlled through the HPC corresponding to the PRR, in which the hardware module is configured. Basically, an HPC is implemented using the loadable kernel module technology from Linux and other OSes, and an HPC can thus be reused across different hardware modules that are configured into the PRR. As a result, different applications can interact with a newly configured hardware module through the HPC without loading its device drivers, except for the first time a PRR is used. As shown in Figure 3(b), the previously loaded HPC used for HW1 can be reused for HW4, which thus eliminates the software reconfiguration overhead. Therefore, through the use of HPC and logic virtualization, system performance can be enhanced significantly, as shown in Figure 3. 3.2. Hardware Device Virtualization
The hardware device virtualization technique is proposed to virtualize one HPC (device driver) to support multiple reconfigurable hardware modules at the same time. It is based on a software-to-hardware virtualization concept, and is mainly applied to an application that sequentially accesses multiple hardware modules. Through the use of an HPC, the processing results of a hardware module can be directly transferred to another, without reading back to the user space and then writing to the kernel space. As shown in Figure 1(c), the HPC linked to the RC6 hardware module can be virtualized to also support the CRC hardware module. As a result, the processing results of the RC6 hardware module can be transferred to the CRC hardware module through the HPC. Note that in our current implementation, the hardware device virtualization includes a constraint, that is, the widths of the input data signals of all hardware modules and the input data sizes for all hardware modules need to be the same. 4. HIERARCHICAL DPRS DESIGN
To support the VDI method and enhance system scalability, we introduce the layered approach in the DPRS design. As a result, the designs for different layers can be easily extended and integrated with new user-designed reconfigurable hardware modules. Then, the reconfigurable hardware modules can be executed as hardware tasks in the OS4RS. In the following sections, we will introduce the proposed hierarchical DPRS design, including the architecture design for each layer, and its design applicability. 4.1. Layered Design
We implement the DPRS as a hierarchical design as shown in Figure 4, which contains a microprocessor, a memory controller, a configuration controller, a network controller, communication components, and PRRs realized by using Early Access Partial Reconfiguration (EA PR) flow [Xilinx Inc. 2006], and a system bus. The system architecture consists of six layers, namely, configuration, communication, interface, management, function, and application layers. ACM Transactions on Reconfigurable Technology and Systems, Vol. 6, No. 2, Article 11, Publication date: July 2013.
11:8
C.-H. Huang and P.-A. Hsiung
Fig. 4. Hierarchical DPRS design.
Fig. 5. Interface between communication component and PRR.
4.1.1. Configuration Layer. The configuration layer focuses on integrating new reconfigurable hardware modules into the FPGA. To standardize user-designed hardware modules having different data interfaces, a partially reconfigurable hardware task template (PR template) [Huang and Hsiung 2011], as shown in Figure 5, is used to connect user-designed modules to the communication component designed in the communication layer. To use a newly developed hardware module in the OS4RS, a designer has to simply integrate the new hardware module with the proposed PR template because the template provides a common communication interface between the hardware module and the rest of the system. The PR template consists of eight 32-bit input data signals, one 32-bit input control signal, four 32-bit output data signals, and one 32-bit output control signal. It also contains an optional Data Transformation Component (DTC) for unpacking incoming data and packing outgoing data based on the I/O registers’ sizes in the hardware module. Take the RSA hardware module in Figure 5 as an example. ACM Transactions on Reconfigurable Technology and Systems, Vol. 6, No. 2, Article 11, Publication date: July 2013.
Virtualizable Hardware/Software Design Infrastructure
11:9
Four 32-bit input signals indata, inExp, inMod, and cypher are connected directly to four input data signals of the PR template, while one 32-bit signal dataout is connected to an output data signal of the PR template. Using the DTC, the 1-bit signals ds and reset are packed as a 32-bit signal to connect to the input control signal of the PR template, while the 1-bit signal ready is packed as a 32-bit signal to connect to the output control signal of the PR template. To further raise the utilization of logic resources, different sizes of PRRs are implemented on the FPGA such that each reconfigurable hardware module can be (re)configured into a best-fit PRR at runtime. Here, the best-fit policy in our current implementation is only based on the available resources of a PRR and the required resources for hardware modules. However, this would lead to internal fragmentation and rejection of future hardware modules. To cope with the preceding problems, in the future, the related methods [Montone et al. 2008; Hsiung et al. 2010] will be integrated into the hierarchical DPRS design for supporting a more efficient system management mechanism. 4.1.2. Communication Layer. The communication layer includes the communication architecture used for data transfers among all hardware components. Further, as shown in Figure 4, the communication components, such as comm1, comm2, and comm3, are designed in the communication layer to act as the bus interfaces of the hardware modules that are integrated with the PR template and configured in PRRs. The communication components are realized using the OPB Intellectual Property InterFace (IPIF) design, each of which contains fourteen software-accessible registers that connect to the fourteen 32-bit signals of the PR template. To connect a communication component with a PRR, Xilinx bus macros [Xilinx Inc. 2006] are inserted between them to allow correct communication and connection, as shown in Figure 5. Through bus macros, the routing between a reconfigurable hardware module and the communication component would be locked so that the pin of the reconfigurable hardware module can be compatible with the communication component. Through the communication component, the HPC designed in the interface layer can interact with the reconfigurable hardware module in PRRs. Further, the processing results of a reconfigurable hardware module can be buffered in its corresponding communication component until the HPC reads them. 4.1.3. Interface Layer. The proposed hardware virtualization technique is mainly realized in the interface layer. Different from the device driver specific to a hardware device in a conventional embedded OS, an HPC is designed to interact only with the fourteen 32-bit signals of the PR template. It implements fourteen ioctl system calls to access fourteen software-accessible registers of the communication component. Through the software HPCs and the hardware communication components, applications in the user space can easily access the reconfigurable hardware modules in PRRs. To ensure data synchronization in the hierarchical DPRS design, the HPC contains a tuple (L, D) to control its data transfers, where L and D are the Boolean flags asserted by the hardware task manager in the management layer. When an HPC is loaded ondemand into the OS4RS kernel and linked to a reconfigurable hardware module by the hardware task manager, its L flag is asserted true. To support logic virtualization, only one of the L flags for all HPCs linked to the same reconfigurable hardware module is set to true, while the L flags for other HPCs are set to false. This means that the mutually exclusive access to the reconfigurable hardware module needs to be controlled by the hardware task manager. Furthermore, when hardware device virtualization is used, the D flag of an HPC is set to true. The HPC thus transfers the processing results of a reconfigurable hardware module to another reconfigurable hardware module, instead of transferring back to the application in the user space. After all data are processed, the D flag of the HPC is set back to false. ACM Transactions on Reconfigurable Technology and Systems, Vol. 6, No. 2, Article 11, Publication date: July 2013.
11:10
C.-H. Huang and P.-A. Hsiung
Request a HW function
Does a PRR with the requested HW function exist?
NO
YES Load an HPC in the NO OS4RS kernel
Is the request from the same application? YES
Send a request for partial reconfiguration and select a best-fit PRR
Link the HPC to the corresponding PRR
Link the previously used HPC to the PRR with the requested HW function
Configure the requested HW module using ICAP
Logic Virtualization
HW Device Virtualization
Partial Reconfiguration
Fig. 6. Hardware task management.
4.1.4. Management Layer. The management layer contains a hardware task manager to not only manage all data transfers between the HPCs and the reconfigurable hardware modules, but also to determine which virtualization technique will be used. As shown in Figure 6, the hardware task management is divided into three categories, including logic virtualization, hardware device virtualization, and partial reconfiguration. As introduced in Section 3, the proposed hardware virtualization technique is mainly applied to the applications having ordered sequences of reconfigurable hardware functions. The hardware task manager adopts a first-come-first-served scheduled policy, and the applications in the user space are sequentially served. When an application requests a hardware function, the hardware task manager first checks if the requested hardware module has been configured in a PRR. If not, the hardware task manager requests the configuration controller to configure the requested hardware module in a best-fit PRR. When the requested hardware module has been already configured, the hardware task manager checks if the request is received from the same application. If not, the logic virtualization is invoked to dynamically load an HPC into the OS4RS kernel. Then, this HPC is linked to the corresponding PRR, while its L flag is set to true by the hardware task manager. The hardware task manager adopts a round-robin policy, so that a configured hardware module can be shared between multiple applications. If the request is received from the same application, the hardware task manager dynamically links the previously used HPC in this application to the PRR with the requested hardware function. Thus, the processing results of the hardware module previously linked to this HPC can be directly transferred to the requested hardware module via the HPC. Note that the HPC is responsible for queuing the processing results and then forwarding them to the requested hardware module. Further, using the hardware device virtualization, when a pair of device node and HPC is linked to only one hardware module, the final processing results are directly transferred back to the application in the user space. 4.1.5. Function Layer. The function layer contains mainly a hardware control library that provides the related APIs for applications in the user space to interact with all reconfigurable hardware modules. Every hardware module has its specific I/O interface. In the conventional Unix-like OS, the control method specific to the I/O interface of a ACM Transactions on Reconfigurable Technology and Systems, Vol. 6, No. 2, Article 11, Publication date: July 2013.
Virtualizable Hardware/Software Design Infrastructure Application
Function
Management
Interface
11:11 Communication
Configuration
Send a request to the HW task manager
If the requested HW is not configured in a PRR
Open the configuration device Access the configuration controller Configure the requested HW in a PRR
Use a HPC to establish the link from the application to the requested HW
Logic virtualization: A unused HPC is adopted HW device virtualization: The previously used HPC in the application is adopted
Notify the application that requested HW can be accessed Use the related APIs in the HW control library Write data to the HPC Access the requested HW via the communication component
Fig. 7. UML sequence diagram for control and data flow.
hardware device is implemented in the device driver. However, to support the hardware virtualization technique, the VDI method unifies the communication interface for all reconfigurable hardware modules by using the hardware PR template, the communication component, and the software HPC. For the applications in the user space, the VDI method still needs to provide the corresponding control methods of reconfigurable hardware modules. The hardware control library is responsible for recording all control methods for the I/O interfaces of reconfigurable hardware modules. As a result, it acts as a pool that provides the high-level device drivers for applications in the user space to interact with all reconfigurable hardware modules. To integrate a user-designed hardware module into a DPRS, it needs to be only interfaced with the PR template, and its I/O interface updated into the hardware control library. Applications in the user space can interact with a newly integrated hardware module by simply invoking the APIs in the hardware control library. System scalability is thus enhanced. 4.1.6. Application Layer. The topmost layer of the hierarchical DPRS design is the application layer. The target application is an application having ordered sequences of reconfigurable hardware functions. To interact with a reconfigurable hardware module, an application needs to send its request to the hardware task manager in the management layer. Figure 7 gives a UML sequence diagram for illustrating the data and control flow between the six layers of the hierarchical DPRS design. As introduced in Section 4.1.4, the hardware task manager plays a key role to establish the link from the application to the requested hardware module. The application can thus use the related APIs in the hardware control library to interact with the requested hardware module. 4.2. Design Applicability
For the proposed hardware virtualization techniques, only one of logic virtualization and hardware device virtualization is used in our current implementation at the same time to ensure data synchronization. This is because, when the logic virtualization is used, the round-robin policy and the L flags in the HPCs would be adopted so that data synchronization between multiple applications can be guaranteed. However, to meet ACM Transactions on Reconfigurable Technology and Systems, Vol. 6, No. 2, Article 11, Publication date: July 2013.
11:12
C.-H. Huang and P.-A. Hsiung
the mechanism of logic virtualization and to ensure that all data can be transferred and processed, the data transfers from a hardware module to another via an HPC in the hardware device virtualization would become discontinuous due to the constant access period incurred by logic virtualization. This thus increases additional time overheads in waiting for the access period available. As a result, the performance enhancement using the hardware device virtualization in an application would be neutralized. In the future, we will propose a more complete mechanism that can integrate the two kinds of hardware virtualization techniques efficiently into a DPRS design. For the applicability of the hierarchical DPRS design, in our current implementation, the layered design method is only realized in a DPRS with a single FPGA. To support a system architecture with multiple FPGAs such as the BEE2 platform [So and Brodersen 2008], one of the FPGAs needs to be used as a master controller on which the OS4RS can run. Other slave FPGAs are treated like PRRs in an FPGA that can be accessed by the master FPGA. Further, the communication components are configured in the master FPGA so that the OS4RS can use them to interact with all the slave FPGAs. As a result, each reconfigurable hardware module configured in a slave FPGA can thus be managed as a hardware device in the OS4RS. Further, the logic virtualization and the hardware device virtualization can also be applied to such a system architecture. 5. EXPERIMENTS
To demonstrate how system performance and the utilization of reconfigurable hardware modules can be further enhanced using the proposed VDI method, we adopt a real Dynamically Partially Reconfigurable Network Security System (DPRNSS) for multimedia applications as our example, with support for the TLS protocol that is in widespread use in applications, such as Web browsing, electronic mail, and instant messaging. The DPRNSS was implemented on the Xilinx ML310 platform with a Virtex II Pro XC2VP30 FPGA chip that has 13,696 slices. The proposed hardware virtualization techniques were realized in the PetaLinux embedded OS [PetaLogix 2013], which ran on a Xilinx MicroBlaze soft-core processor at 100 MHz. The DPRNSS supports five cryptographic hardware modules, including three variants of RSA having different key and input data sizes in bits (RSA32, RSA64, and RSA128), RC6 encryption and RC6 decryption, and three hash hardware modules, including three variants of CRC having different input data sizes in bits (CRC32, CRC64, and CRC128). The proposed DPRS supports all the aforesaid modules by implementing only two different sized PRRs, namely a large PRR1 with 2,464 slices and a small PRR2 with 1,456 slices. The multimedia application captures real-time 128 × 64 images from the camera, transfers the captured images sequentially to a cryptographic module and a hash hardware module for data processing, and finally transfers the encrypted images to a receiver on the network. Further, it can also receive images from a receiver on the network for decryption using the cryptographic and hash hardware modules. 5.1. System Resource Usage Analysis
The system resource usage, including the logic resource usage, the number of device nodes, and the number of device drivers for three different system implementations including a conventional embedded system, a conventional DPRS [Donato et al. 2005; Santambrogio et al. 2008], and the VDI method, are shown in Table I. A conventional embedded system must be configured at design time with all eight hardware modules to support the transfer of real-time images on a network. The other two DPRS implementations, including the conventional DPRS [Donato et al. 2005; Santambrogio et al. 2008] and the VDI method, both require logic resources only for the PRRs because the different hardware modules can be configured into the PRRs at runtime ACM Transactions on Reconfigurable Technology and Systems, Vol. 6, No. 2, Article 11, Publication date: July 2013.
Virtualizable Hardware/Software Design Infrastructure
11:13
Table I. Comparison on System Resource Usage
Conventional Embedded System Conventional DPRS VDI Method
Logic Resource Usage 6,296 slices (8 HWs) 3,920 slices (2 PRRs) 3,920 slices (2 PRRs)
#Device Node Minimum Maximum #HW #HW (8) (8) #HW #HW (8) (8) #PRR unlimited (2) (∞)
#Device Driver Minimum Maximum #HW #HW (8) (8) #HW #HW (8) (8) #PRR unlimited (2) (∞)
#Device Node: the number of device nodes. #Device Driver: the number of device drivers. #HW: the number of hardware functions. #PRR: the number of PRRs.
Open 0%
Config 9% Load 44%
(a) time ratio to access RSA in PRR1 Close 0% Interact 48% Open 0%
Config 7% Load 45%
Conventional method Total processing time (sec)
Close 0% Interact 47%
VDI method
12 10 8 6 4 2 0
1
2 3 # of requested hardware functions
4
(c) continuous requests for different hardware functions
(b) time ratio to access RSA in PRR2
Fig. 8. Reconfiguration overhead analysis.
for fitting different system requirements. The difference is that in the conventional DPRS [Donato et al. 2005; Santambrogio et al. 2008] the reconfigurable hardware modules are still managed as conventional hardware devices. Thus, the full set of eight device nodes and eight device drivers are required. The proposed VDI method allows the system to work for all eight reconfigurable hardware modules using fewer pairs of device nodes and HPCs (device drivers), due to the unified hardware interface (PR template), the unified access control (hardware control library), and the software unified HPC. Thus, the number of device nodes and device drivers can be minimized to the number of PRRs, instead of growing with the number of hardware modules. Since the number of PRRs is usually much fewer than that of the hardware modules, the VDI method has basically placed a lower bound on the number of device nodes and drivers. As a result, the VDI method can further improve the utilization of system resources and reduces the load of kernel memory. 5.2. Reconfiguration Overheads Analysis
We measured the actual time ratios required for encrypting a 128 × 64 pixel image using an RSA hardware module in PRR1 and PRR2, as shown in Figures 8(a) and 8(b). As introduced in Section 3, the total processing time includes the time to configure the RSA hardware module, load the device driver into the OS4RS kernel, open the device node, interact with the configured RSA hardware module, and close the device node. The hardware time overheads for configuring the RSA hardware module in PRR1 and PRR2 are 177 milliseconds (9%) and 123 milliseconds (7%), respectively, in terms of the total processing time. However, the corresponding software time overheads for loading the device driver are 830 milliseconds (44%) and 830 milliseconds (45%), respectively, in terms of the total processing time, which are about 5 times and 6 times the hardware time overheads. Note that the time to open and to close a device node are almost ACM Transactions on Reconfigurable Technology and Systems, Vol. 6, No. 2, Article 11, Publication date: July 2013.
11:14
C.-H. Huang and P.-A. Hsiung
Conventional method
CM for RSA and CRC
LV for RSA and CRC
CM for RC6 and CRC
LV for RC6 and CRC
Logic virtualization
200 180
Finish time (sec)
160 140 120 100 80 60 40 20 0
(a) theoretical processing time
5 10 15 20 25 30 35 40 45 50 # of image (b) real applications
Fig. 9. Logic virtualization and conventional method.
negligible, thus they are 0% in terms of the total processing time shown in Figures 8(a) and 8(b). From these results, we can observe that, in reality, the software reconfiguration overhead is much greater than the hardware reconfiguration overhead. Software reconfiguration overhead thus has a more prominent impact on system performance than hardware reconfiguration overhead. However, most existing methods focus on solving the hardware reconfiguration overhead. The issue of the more serious software reconfiguration overhead has not been addressed yet and thus not solved. We validate the VDI method by experimenting with a network security application that continuously requests four different hardware functions for encrypting four 128 × 64 pixel images individually due to the changing network conditional threat. The requested hardware modules are thus continuously configured into the FPGA device. The total processing time using the conventional method and using the VDI method were measured as shown in Figure 8(c). We can observe that, due to the elimination of software reconfiguration overhead using the VDI method, more and more processing time is saved, when the number of requested hardware modules increases. These measurements validate our contribution, that is, the use of HPC in the VDI method can further improve system performance by eliminating the software reconfiguration overheads. 5.3. Performance Analysis Using Logic Virtualization
For more detailed analysis, we measured the time required for each of the following basic operations. The average amounts of time to load a device driver into the OS4RS kernel, open a device node, and close it, are 830, 0.253, and 0.048 milliseconds, respectively. The average amounts of time to write a 32-bit data to and read a 32-bit data from the kernel space are 0.022 and 0.022 milliseconds, respectively. The average computing time for processing a 128 × 64 pixel image using the RSA32, RSA64, RSA128, RC6, CRC32, CRC64, and CRC128 hardware modules is 0.026 milliseconds, while the average time to read a 32-bit data from a hardware module and then write it to another hardware module through an HPC is 0.0028 milliseconds. Figure 9(a) compares logic virtualization and the conventional method by looking at the finish time of an application that has some hardware modules previously accessed ACM Transactions on Reconfigurable Technology and Systems, Vol. 6, No. 2, Article 11, Publication date: July 2013.
Virtualizable Hardware/Software Design Infrastructure APP2
APP1 HPC1 Hash1
11:15
HPC2
HPC3
Crypt (a) logic virtualization
APP1 HPC4
HPC1
HPC2
Hash2
Crypt
Hash
(b) hardware device virtualization
Fig. 10. Application setup.
by another application. The number of shared hardware modules and the number of iterations to access the shared hardware modules in the preceding application are set ranging from 0 to 1, 000 and from 0 to 1, 000, 000, respectively. Note that Figure 9(a) focuses only on the processing time of reconfigurable hardware modules, while the processing time of the software functions and the nonreconfigurable hardware functions will be not taken into consideration. Further, it gives an ideal performance evaluation without considering the time overhead incurred by switching the access privilege of the shared hardware modules in the logic virtualization technique. We can observe that the finish time of an application using the conventional method increases exponentially, when the number of shared hardware modules and the number of iterations to access the shared hardware modules in the preceding application increase gradually. This is because an application would need more and more waiting time to access the shared hardware modules, which results in seriously delaying its finish time. Using logic virtualization, an application can access a hardware module that has been accessed by another preceding application, as long as the preceding application is not using it, irrespective of whether it is released or not. As shown in Figure 9(a), we can observe that the time required by using logic virtualization becomes lesser and lesser compared to the time required by using the conventional access method, when the number of shared hardware modules and the number of iterations to access the shared hardware modules in the preceding application increase gradually. This is because logic virtualization reduces significantly the waiting time for the shared hardware modules. Further, when the total processing time of shared hardware modules in the preceding application increases, using the conventional method, the problem of blocked device access thus becomes more and more serious. On the contrary, performance improvement using logic virtualization becomes more and more prominent because the reduction of waiting time for the shared module becomes more and more. Note that the finish time of an application using logic virtualization also increases gradually, when the number of shared hardware modules and the number of iterations to access the shared hardware modules in the preceding application increase. However, due to the exponential growth on time using the conventional method, the increase of the finish time using logic virtualization is not obvious so that it seems flat in Figure 9(a). To validate the performance results derived in Figure 9(a), we further experimented with real applications. As shown in Figure 10(a), two multimedia applications interacted simultaneously with the same cryptographic hardware module. Each multimedia application captured images from the camera, and then transferred the captured images to the cryptographic and hash hardware modules sequentially for data processing. By using logic virtualization, two HPCs (HPC2 and HPC3) were individually used for the two multimedia applications to interact with the shared cryptographic hardware module. Note that here we set the time required for encrypting an image as a time period to switch the access privilege of the shared cryptographic hardware module for the two multimedia applications. ACM Transactions on Reconfigurable Technology and Systems, Vol. 6, No. 2, Article 11, Publication date: July 2013.
11:16
C.-H. Huang and P.-A. Hsiung
Conventional method
HW device virtualization
CM for RSA and CRC
HDV for RSA and CRC
CM for RC6 and CRC
HDV for RC6 and CRC
120
Processing time (sec)
100 80 60 40 20 0
(a) theoretical processing time
5 10 15 20 25 30 35 40 45 50 # of image (b) real applications
Fig. 11. Hardware device virtualization and conventional method.
Experiments were performed to evaluate the effect of varying the data size (the number of images). Figure 9(b) shows the time required for processing 5 to 50 images using the Conventional Method (CM) and Logic Virtualization (LV), where the RSA module and the RC6 module, respectively, are shared between two different multimedia applications for image encryption. We can observe that more and more time is saved by logic virtualization compared to the conventional method, when the number of the images captured increases. Further, the time saved by using logic virtualization is much more than the time overhead incurred by switching the access privilege of the shared cryptographic hardware module. As a result, the amount of time saved by using logic virtualization can increase gradually, when the number of images captured increases. For the pair of RSA and CRC hardware modules and that of RC6 and CRC hardware modules, the logic virtualization method can reduce up to 12.83% and 3.73%, respectively, of the time required by using the conventional method. 5.4. Performance Analysis Using Hardware Device Virtualization
Based on the average amount of time measured for each basic operation as introduced in Section 5.3, we analyze the effect on system performance using hardware device virtualization. Figure 11(a) compares the total processing time of an application using the conventional method and hardware device virtualization, respectively. The number of hardware modules and the number of hardware access iterations are both varying from 0 to 100,000. As shown in Figure 11(a), using either the conventional method or hardware device virtualization, the total processing time of an application increases with both the number of hardware modules and the number of hardware access iterations. The widening gap between the hardware device virtualization technique and the conventional methods is mainly due to the accumulated savings in time brought about by two factors in each iteration, namely: (1) the elimination of software time overhead incurred by reconfiguration and (2) the reduction in the number of iterations of data transfers between the user space and kernel space. ACM Transactions on Reconfigurable Technology and Systems, Vol. 6, No. 2, Article 11, Publication date: July 2013.
Virtualizable Hardware/Software Design Infrastructure
11:17
To validate the performance results derived in Figure 11(a), we also experimented with real applications. As shown in Figure 10(b), a multimedia application first captures 5 to 50 images from the camera, and then transfers the captured images sequentially to a cryptographic hardware module and a hash hardware module for data processing. Different considerations of cryptographic modules, namely RSA and RC6, and a hash module, namely CRC, were used for ensuring the security and integrity, respectively, of the image transfers on the network. Experiments were performed to evaluate the effect of varying the data size (the number of images). The times for processing 5 to 50 images using the Conventional Method (CM) and Hardware Device Virtualization (HDV) are compared in Figure 11(b). We can observe that the time reduced by using hardware device virtualization becomes more and more compared to that using the conventional method, when the number of the images captured increases. For the pair of RSA and CRC hardware modules and that of RC6 and CRC hardware modules, the hardware device virtualization technique can reduce up to 6.53% and 11.19%, respectively, of the time required by using the conventional method. Note that in real applications, logic virtualization and hardware device virtualization can enhance system performance up to only 12.83% and 11.19%, respectively; however, through the theoretical and experimental analyses, it can be inferred that the performance improvement would become more and more significant, when the size of input data (the number of images) increases. 6. CONCLUSIONS
This work proposes a virtualizable hardware/software design infrastructure (VDI) method that alleviates the limitations of infrastructure support for DPRS and improves system performance. In VDI, the logic virtualization technique enables a hardware module to be efficiently and interleavingly accessed by multiple applications, while the hardware device virtualization technique enables the time overheads for repeatedly transferring data between kernel space and user space to be alleviated. An HPC design is proposed to not only support the device-level hardware virtualization but also eliminate the software reconfiguration overhead. To support the VDI method, we also propose a hierarchical DPRS design that facilitates system development and scalability. Through both the analysis and experiment results, we have demonstrated that the VDI method not only enhances system performance, but also increases the utilization of reconfigurable hardware modules. Experiments with real applications also demonstrated that the VDI method can reduce the processing time up to 12.83% of that required by using the conventional DPRS method. REFERENCES AMD INC. 2013. AMD-V. http://www.amd.com. BANERJEE, S., BOZORGZADEH, E., AND DUTT, N. 2005. Physically-aware hw-sw partitioning for reconfigurable architectures with partial dynamic reconfiguration. In Proceedings of the 42nd ACM/IEEE Design Automation Conference (DAC’05). 335–340. CHEN, E., GUSEV, V., SABAZ, D., SHANNON, L., AND GRUVER, W. A. 2011. Dynamic partial reconfigurable fpga framework for agent systems. In Proceedings of the International Conference on Industrial Applications of Holonic and Multi-Agent Systems. DONATO, A., FERRANDI, F., SANTAMBROGIO, M. D., AND SCIUTO, D. 2005. Operating system support for dynamically reconfigurable soc architecture. In Proceedings of the IEEE International SOC Conference. 233–238. EL-ARABY, E., GONZALEZ, I., AND EL-GHAZAWI, T. 2008. Virtualizing and sharing reconfigurable resources in high-performance reconfigurable computing systems. In Proceedings of the 2nd International Workshop on High-Performance Reconfigurable Computing Technology and Applications (HPRCTA’08). 1–8. EL-ARABY, E., GONZALEZ, I., AND EL-GHAZAWI, T. 2009. Exploiting partial run-time reconfiguration for high performance reconfigurable computing. ACM Trans. Reconfig. Technol. Syst. 1, 4, 1–23.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 6, No. 2, Article 11, Publication date: July 2013.
11:18
C.-H. Huang and P.-A. Hsiung
GARCIA, P. AND COMPTON, K. 2008. Kernel sharing on reconfigurable multiprocessor systems. In Proceedings of the International Conference on ICECE Technology (FPT’08). 225–232. GOHRINGER, D., WERNER, S., HUBNER, M., AND BECKER, J. 2011. RAMPSoCVM: Runtime support and hardware virtualization for a runtime adaptive mpsoc. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’11). 181–184. HOFMANN, A., WALDSCHMIDT, K., AND HAASE, J. 2010. SDVMR - Managing heterogeneity in space and time on multicore socs. In Proceedings of the NASA/ESA Conference on Adaptive HW and Systems (AHS’10). 142–148. HSIUNG, P.-A., HUANG, C.-H., SHEN, J.-S., AND CHIANG, C.-C. 2010. Scheduling and placement of hardware/software real-time relocatable tasks in dynamically partially reconfigurable systems. ACM Trans. Reconfig. Technol. Syst. 4, 1. HSIUNG, P.-A., LU, P.-H., AND LIU, C.-W. 2007. Energy efficient hardware-software co-scheduling in dynamically reconfigurable systems. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES + ISSS’07). ACM Press, New York, 87–92. HUANG, C.-H. AND HSIUNG, P.-A. 2011. Model-based verification and estimation framework for dynamically partially reconfigurable systems. IEEE Trans. Indust. Informatics 7, 2, 287–301. HUANG, M., ANDREWS, D., AND AGRON, J. 2010. Operating system structures for multiprocessor systems on programmable chip. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig’10). 358–363. HUANG, M., NARAYANA, V. K., BAKHOUYA, M., GABER, J., AND EL-GHAZAWI, T. 2012. Reconfigurable hardware using architectural variants. IEEE Trans. Comput. 61, 5, 1354–1360. HYMEL, R., GEORGE, A., AND LAM, H. 2007. Evaluating partial reconfiguration for embedded fpga applications. In Proceedings of the High-Performance Embedded Computing Workshop (HPEC’07). 18–20. INTEL INC. 2013. Intel VT. http://www.intel.com/technology/virtualization. KIRISCHIAN, L., DUMITRIU, V., CHUN, P. W., AND OKOUNEVA, G. 2010. Mechanism of resource virtualization in rcs for multitask stream applications. Int. J. Reconfig. Comput. 2010, 8. KIVITY, A., KAMAY, Y., LAOR, D., LUBLIN, U., AND LIGUORI, A. 2007. KVM: The linux virtual machine nonitor. In Proceedings of the Ottawa Linux Symposium. 225–230. KOEHLER, S., CURRERI, J., AND GEORGE, A. D. 2008. Performance analysis challenges and framework for high-performance reconfigurable computing. Parallel Comput. 34, 4, 217–230. KOH, S. AND DIESSEL, O. 2007. Module graph merging and placement to reduce reconfiguration overheads in paged fpga devices. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’07). IEEE, 293–298. LAGGER, A., UPEGUI, A., SANCHEZ, E., AND GONZALEZ, I. 2006. Self-reconfigurable pervasive platform for cryptographic application. In Proceedings of the 16th IEEE International Conference on Field Programmable Logic and Applications (FPL’06). IEEE Computer Society, 777–780. MONTONE, A., REDAELLI, F., SANTAMBROGIO, M. D., AND MEMIK, S. O. 2008. A reconfiguration-aware floorplacer for fpgas. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig’08). IEEE Computer Society, 109–114. PETALOGIX. 2013. PetaLinux. http://www.petalogix.com. PRATT, I., FRASER, K., HANDA, S., LIMPACH, C., WARELD, A., MAGENHEIMER, D., NAKAJIMA, J., AND MALLICK, A. 2005. Xen 3.0 and the art of virtualization. In Proceedings of the Linux Symposium. Vol. 2. 65–78. QEMU. 2013. QEMU Open source processor emulator. http://wiki.qemu.org/Main Page. SABEGHI, M. AND BERTELS, K. 2009. Toward a runtime system for reconfigurable computers: A virtualization approach. In Proceedings of the Conference and Exhibition on Design, Automation and Test in Europe (DATE’09). 1576–1579. SANTAMBROGIO, M., RANA, V., AND SCIUTO, D. 2008. Operating system support for online partial dynamic reconfiguration management. In Proceedings of the 18th International Conference on Field Programmable Logic and Applications (FPL’08). IEEE Computer Society, 455–458. SO, H. K.-H. AND BRODERSEN, R. 2008. A unified hardware/software runtime environment for fpga based reconfigurable computers using borph. ACM Trans. Embedded Comput. Syst. 7, 2, 1–28. WERNER, S., OEY, O., GOHRINGER, D., HUBNER, M., AND BECKER, J. 2012. Virtualized on-chip distributed computing for heterogeneous reconfigurable multi-core systems. In Proceedings of the Conference and Exhibition on Design, Automation and Test in Europe (DATE’12). 280–283. XILINX INC. 2006. Early access partial reconfiguration user guide - ug208. http://forums.xilinx.com/t5/Archived -ISE-issues/Early-Access-EA-Partial-Reconfiguration/td-p/18750. Received July 2012; revised January 2013; accepted April 2013
ACM Transactions on Reconfigurable Technology and Systems, Vol. 6, No. 2, Article 11, Publication date: July 2013.