Hierarchical Run-Time Reconfiguration Managed by an ... - CiteSeerX

3 downloads 289 Views 298KB Size Report
archy, transparently managed by an Operating System for. Reconfigurable Systems. .... dedicated hardware tasks into one reconfigurable ICN tile. (Figure 5b).
Hierarchical Run-Time Reconfiguration Managed by an Operating System for Reconfigurable Systems V. Nollet, J-Y. Mignolet, T.A. Bartic, D. Verkest†,‡ , S. Vernalde, R. Lauwereins‡ IMEC Kapeldreef 75, 3001 Leuven, Belgium † also Professor at Vrije Universiteit Brussel ‡ also Professor at Katholieke Universiteit Leuven {nollet,mignolet,bartic}@imec.be

Abstract The need for flexible computational power has motivated many researchers to incorporate run-time reconfigurable logic into their architectures. Most contemporary experiments include commercial FPGA’s serving as reconfigurable hardware. Unfortunately, the FPGA does not exhibit the same run-time flexibility as the Instruction Set Processor (ISP) e.g. when it comes to ease and speed of setting up a task. In addition, FPGA’s tend to be less suited than traditional ISP’s to accommodate control-flow dominated tasks. Obviously, it is possible to alleviate some of these issues by using a reconfiguration hierarchy (e.g. placing and configuring an ASIP or coarse grain reconfigurable block into the FPGA). This paper illustrates how our operating system transparently manages the complexity of hierarchical reconfiguration. In addition, this paper highlights the benefits and drawbacks of employing multiple hierarchical levels of configuration. As a proof of concept, we developed a filtering application on top of an in-house 16 bit microcontroller and a parameterizable filter block, both instantiated inside an FPGA.

1. Introduction The use of an embedded architecture that incorporates both an Instruction Set Processor (ISP) and reconfigurable hardware (an FPGA), allows execution of computationally demanding multimedia applications with maximum performance and flexibility. It is obvious that, in order to cope with multiple applications executing partly on the ISP and partly on the reconfigurable logic, one needs a suitable management infrastructure, also denoted as Operating System for Reconfigurable Systems (OS4RS) [1]. The main purpose of such an operating system is to provide an environment

where tasks can execute concurrently with minimal interference between them, but with support for inter-task communication. Furthermore, the operating system is responsible for managing the reconfigurable resources in an efficient and fair way. FPGA’s prove to be an excellent platform to illustrate the benefits of run-time hardware reconfiguration and hardware acceleration of certain applications. However, there are still some FPGA technology related drawbacks that prevent these kind of architectures from becoming mainstream: - Size of the hardware task binary. An important issue when dealing with embedded architectures (and thus limited storage capabilities) is that the size of the partial bitstream describing the hardware task is quite large. As a consequence, the actual task setup can require a considerable amount of time, which might inhibit multitasking on the FPGA. - Task setup overhead. In order to fill the FPGA with multiple concurrent tasks, each capable of communicating with the rest of the world, one requires some area allocation, area partitioning and run-time routing algorithms. Generally, these algorithms are quite complex and induce a considerable amount of run-time overhead [8]. - Ease of development. In spite of several good design tools, the development of hardware tasks is relatively hard compared to software tasks. In addition, the hardware development know-how is less widespread. - Dynamic task relocation. Due to the fact that the number of concurrent hardware tasks is severely limited (in contrast to the maximum number of concurrent tasks on an ISP), the operating system should be able to relocate hardware tasks at run-time. It is clear that efficient heterogeneous task relocation [6] is still in its infancy.

In this paper we illustrate how one can alleviate these issues by introducing additional abstraction levels on top of the bare system computing resources (ISP or FPGA). Combined, these abstraction levels form a reconfiguration hierarchy, transparently managed by an Operating System for Reconfigurable Systems. This means, for example, that the OS4RS is able to instantiate a virtual machine on top of an ISP or a softcore/coarse-grain block on top of an FPGA. The remainder of this paper is organized as follows. Section 2 briefly discusses the related work. Section 3 details the operating system for reconfigurable systems. It discusses the way the operating system manages computing resources and how it performs hierarchical reconfiguration. Section 4 details the benefits of using run-time reconfigurable IP blocks in a hierarchical way. Section 5 briefly highlights the implementation details of our proofof-concept application. Finally, conclusions are drawn in Section 6.

2. Related Work In [2], Schaumont et al. describe how reconfigurable systems generally introduce multiple, hierarchical levels of programming and design (e.g. an FPGA can be configured with a softcore. The softcore, in turn, can be configured with a softcore application binary). They realize that one of the advantages of employing hierarchical reconfiguration is that it allows control of the overall system complexity, while creating more opportunities for component reuse. Their focus is mainly on the application viewpoint. We will show that incorporating support for hierarchical reconfiguration into the operating system is also beneficial for managing the system resources in a more efficient way. In [5], Ogrenci et al. are trying to reduce the run-time FPGA reconfiguration time. In order to do so, they reduce the number of configuration bits by providing the application with pre-placed coarse grain computation block, denoted as Versatile Parameterizable Blocks (VPB). These VPB’s could be interpreted as an extra abstraction layer on top of an FPGA. This technique is also application centered. Furthermore, it employs only one hierarchical level of (re)configuration.

3. Operating System for Reconfigurable Systems (OS4RS) The OS4RS’s main duty is to divide the available computing resources among all executing tasks in an efficient and fair way. Therefore, the operating system will have to keep track of the available computing resources as well as the executing OS4RS tasks.

3.1. OS4RS Computing Resource Management

The operating system manages its computing resources by linking a processor information structure to every (programmable) computing unit in the system (e.g. ISP). It should be noted that it is also possible to register, for example, a softcore or a virtual machine as an OS4RS computing resource. This type of computing resource, further denoted as soft computing unit, provides the system with a new level of computing abstraction. Obviously a soft computing unit requires a host computing unit in order to be able to execute an OS4RS task. (e.g. softcore ASIP on top of an FPGA host or virtual machine on top of an ISP host). Every processor information structure contains a set of interface functions that completely describes the functionality of the computing resource. This means that for every registered resource, the OS4RS is able to instantiate/delete a task, suspend/resume a task, control inter-task communication and handle computing resource exceptions. The operating system is also able to monitor the state of the computing resource through a number of variables contained in its information structure. This mainly includes the load of the computing unit, the number of running tasks, the task setup time and a link to the host processor information structure in case of a soft computing unit.

Figure 1. OS4RS structure for managing different computing resources.

Figure 1 provides a typical example of a system, managed by OS4RS, that contains an ISP (a), an FPGA (b) and a soft computing unit (c). The soft computing unit will rely on a host computing unit (the ISP or the FPGA) to execute an OS4RS tasks. Each computing unit has an interface layer responsible for hiding its internal complexity from the OS4RS by providing the required processor information structure functions. In case of an ISP, we employ an existing RTOS as interface layer, since it readily provides all necessary processor information structure functions.

3.2. OS4RS Task Management The operating system keeps track of the tasks by means of a task information structure list. Every OS4RS task instantiation is linked to such a task information structure. The most important components are: - The task state. This allows, for example, to indicate that a certain task has not been assigned to a computing resource for execution, or that a task has been selected for relocation to a different computing resource. - A list containing the available execution binaries and their respective properties, targeted at the different system computing resources.

that it is the responsibility of the soft computing unit to setup an instance of itself on the host computing unit prior to execution of the assigned task. This is possible due to the fact that the processor information structures form a hierarchically linked list (Figure 2). Naturally, if the required soft computing instance is already present, the task setup phase only requires to reconfigure the instance (e.g. applying new filter coefficients in case of a run-time configurable filter block). The reason for this hierarchical way of setting up and starting a task is that configuration is strongly dependent on the implementation of the soft computing unit (e.g. an ASIP will have to configure some program memory, while a coarse grain filter block only needs to configure the filter components).

3.3. Hierarchical Reconfiguration Mechanism The different actions the OS4RS needs to perform whenever it wants to execute an OS4RS task on a computing resource, are detailed in Figure 2. In case the task is assigned to a soft computing unit, the operating system needs to (recursively) allocate a suitable host computing unit and link both computing resources by means of their respective processor information structure.

Figure 3. Setting up and starting an OS4RS task is achieved by recursively setting up and starting every computing unit in the hierarchy.

Notice that, although Figure 3 illustrates just two hierarchical levels of reconfiguration, the operating system is not bound to that limitation.

4. Use of Hierarchical Configuration in Reconfigurable Systems

Figure 2. (a) Algorithm to link the different computing units in a hierarchical way, in order to execute task "X" on top of Computing Unit "Y". (b) List of hierarchically linked computing units.

As shown in Figure 3, setting up and starting a task on any computing unit is done by means of the interface functions present in its processor information structure. Notice

Employing an operating system that is able to handle hierarchical (re)configuration opens up a wide range of (research) possibilities, by allowing to place any kind of abstraction layer on top of another registered computing resource. This section not only discusses the benefits of hierarchical reconfiguration for the application designer, but also for the management of the run-time reconfigurable resources.

4.1. The InterConnection Network (ICN) As it will be presented in Section 5, our experimental platform contains both an ISP and an FPGA. In order to introduce more flexibility in managing the reconfigurable hardware, several authors [3, 4] proposed adding an abstraction layer on top of the FPGA. Their approach consists of pre-partitioning the reconfigurable logic into a fixed number

of fixed size run-time reconfigurable areas, denoted as tiles, and implementing a communication infrastructure between them. This abstraction layer facilitates the management of the reconfigurable hardware resources since it provides a way to perform run-time task placement, and to control the inter-task communication in a straightforward way, as described in [6]. We chose to partition our FPGA into two tiles both used as independently manageable run-time reconfigurable soft computing resources. The communication between them is implemented using a packet switched interconnection network (ICN), described by [3] and illustrated in Figure 4b.

Figure 4. (a) Top view of an FPGA partitioned in two tiles communicating through ICN. (b) The ICN implements an abstraction layer on top of the FPGA, providing two reconfigurable hardware computing units to OS4RS.

coupled to the size of the tile and not to the actual size of the hardware task. In case the tiles are systematically too small, the designer will be obliged to split his oversized dedicated hardware task into multiple smaller tasks, of whom each occupies a tile. Due to the limited number of tiles, it will be difficult to divide the computing resource in a fair way among all running applications. Solving a size mismatch problem at run-time is not very appealing, since this would require to repartition the reconfigurable logic. This not only implies a lengthy full reconfiguration of the entire ICN, it also requires the OS4RS to preempt all tasks running on the reconfigurable logic and to restart them once the repartitioning has finished. By using an operating system for reconfigurable systems capable of handling hierarchical reconfiguration, one could use the following design time solutions. In case the hardware tasks are significantly smaller than the reconfigurable tiles (Figure 5a) the designer might consider using or creating a second hierarchical network level by registering a special type of soft computing unit: a multiplexer block. This type of soft computing unit is in fact an extra abstraction layer that allows placing multiple small dedicated hardware tasks into one reconfigurable ICN tile (Figure 5b). The main job of this multiplexer block, would be to perform, in association with the OS4RS, some kind of port masquerading: depending on the port number of an incoming data message, the message is dispatched to ’Task x’ or ’Task y’.

The tasks communicate through messages that are encapsulated in packets and transmitted using the network services. The network is made up of routers that dispatch the packets according to the tile address and a port number, both specified in the message packet header. The inter-tile communication is run-time reconfigurable, by overwriting the routers’ routing tables. In this way, the operating system is able to adapt the inter-task communication according to the tiles’ configurations.

4.2. Efficient Use of Reconfigurable Tile Area The disadvantage of using a pre-partitioned reconfigurable area, as described in Section 4.1, is that it results in a fixed number of fixed sized tiles. In case of a size mismatch between the hardware task and the reconfigurable tile, much valuable reconfigurable resources can be wasted. In case the tiles are systematically too large, a lot of reconfigurable surface remains unused (this is called internal fragmentation [7]). This implies that the operating system will have a hard time to manage the computing resources in an efficient way. In addition, task reconfiguration time and size of the partial bitstream describing the hardware task is

Figure 5. (a) ’Task x’ and ’Task y’ each allocate an ICN tile, with a lot of unused reconfigurable hardware area as a consequence. (b) By using an additional abstraction layer, it is possible to reduce the amount of unused reconfigurable area.

In case a dedicated hardware task is too large, the designer might consider using a specialized soft computing unit (DSP or ASIP) to perform the task. In previous experiments, we found that a full TCP/IP implementation requires up to 120% of a Virtex XCV800 FPGA. By using an

in-house softcore, we were able to reduce this to 50% without significant loss of performance. This implies that a soft computing unit might prove to be useful for tasks that are quite complex and would require a very large state machine. In addition, the use of a microprocessor tends to be more efficient in case of control-intensive tasks [10]. The only possible down-side, is that the soft computing unit might not live up to the performance or power requirements. Obviously, it is up to the designer to decide if these properties are considered important for the task or application under development.

4.3. Dynamic Task Relocation

The ability to relocate tasks among heterogeneous processors is an interesting research topic in the scope of runtime reconfigurable systems. This ability should lead to a more efficient use of resources, since it allows, for example, to implement a dynamic load balancing scheme controlled by the OS4RS. There are, essentially, two different techniques to deal with dynamic task relocation: the translation-based technique and the interpretation-based technique [9, 12]. The translation based technique is used when a task needs to be relocated between heterogeneous computing units (e.g. from an ICN tile to an ISP or vice versa). In this case, relocating a task first of all requires that the task binary is present for both computing unit types or that the operating system is able to perform a run-time translation. Secondly, in order to seamlessly continue task execution, the operating system needs to transfer the task state from one computing unit to another. If the task state is not kept in an application-dependent processor-independent form, as described in [6], then this technique also implies a translation of the task state by the OS4RS. The interpretation based technique can be used in case the task is executing on a soft computing unit available for both types of host computing units involved in the dynamic task relocation. This, in fact, boils down to having the same virtual processor emulated on both the origin and target processor in the relocation process. Obviously, in this case there is no need to have multiple binary representations per task, which is quite important when dealing with limited storage devices (embedded architectures). Furthermore, transferring task state is quite straightforward, in contrast to the translation technique. The interpretation based technique would, for example, allows a task to be started on a Java Virtual Machine running on top of an ISP and then relocate it to a Java Processor Core [13], fitted into an ICN tile.

4.4. Benefits of Using Hierarchical Configuration The benefits acquired by executing a task on a soft computing unit, like an ASIP or a coarse grain reconfigurable block, instead of using dedicated hardware configuration can be split into two categories: the design-time benefits and the run-time benefits. At run-time, the use of such a soft computing unit results in less inter-task interference, since the time needed to perform a run-time reconfiguration is considerably reduced [5]. This reduced task setup time, in turn, makes multitasking on an ICN tile by means of temporal scheduling feasible. Finally, the amount of storage space (memory and disk space) needed to store a task binary can be reduced by at least an order of magnitude. The design time benefits of using this kind of soft computing unit can be summarized as follows: - Faster development due to the widespread availability of software development tools and simulators. This makes it possible, for example, to use a software task while prototyping. Meanwhile a dedicated high performance hardware configuration can be developed. - Possible reuse of legacy code, which avoids going through a lengthy design and debug phase.

5. Proof of Concept To demonstrate the hierarchical run-time reconfiguration capabilities of our operating system, we have developed both a softcore ISP and a parameterizable filter block as soft computing units. Both units require an ICN tile as host computing resource. The following sections describe the platform, the softcore ISP, the parameterizable filter block and an example application.

5.1. Experimental Setup The reconfigurable computing platform we use is an inhouse platform composed of a commercial Compaq iPAQ 3760 and a Xilinx VIRTEX-II FPGA. The iPAQ is a personal digital assistant (PDA) that features a StrongARM SA-1110 ISP and an expansion bus that allows connection of an external device. The on-board FPGA is a XC2V6000 device containing 6000k system gates. The FPGA is mounted on a generic prototyping board connected to the iPAQ via the expansion bus. We developed a soft packet-switched interconnection network (ICN) composed of two reconfigurable tiles as soft computing units. These tiles are run-time reconfigurable by using the partial reconfiguration feature of the VIRTEX-II devices.

5.2. The Lezard16 Soft Computing Unit As one of the soft computing units, we use a Lezard16 ISP. This in-house developed softcore is based on the Xilinx PicoBlaze microcontroller [11]. It is a 16-bit microcontroller with an 18-bit instruction word. The Lezard16 instruction set is similar to the one of the PicoBlaze, except that the Lezard16 is not able to handle interrupts. Furthermore, it features a 1024 instruction word deep program memory, as opposed to the 256 instruction word memory of the PicoBlaze. The program memory is implemented as a dual-port memory, allowing the program code to be updated through the second access port. To enable the program code download, we encapsulate the Lezard16 into an ICN tile wrapper (Figure 6). This wrapper filters the data that comes from the interconnection network. Lezard16 configuration data is put into the program memory, while the application data is forwarded to the task running on top of the Lezard16.

an Image Edge Detection application. As Figure 7 illustrates, the application is built using two communicating OS4RS tasks. The Data Retrieval and Display task (executing on the ISP), is responsible for dividing the original image into small blocks and send them to the Edge Detection task. The edge detection task performs some image filtering using specific filter coefficients. Once the data block has been processed, it is sent back in order to be displayed. We have developed three versions of the edge detection task: a dedicated hardware configuration (Figure 7a), a Lezard16 implementation (Figure 7b) and an implementation based on a parameterizable filter (Figure 7c).

Figure 7. Edge detection application containing two communicating tasks. The Edge Detection Task is implemented in three different ways: (a) as dedicated hardware configuration. (b) as a Lezard16 implementation. (c) using a parameterizable filter block. Figure 6. The Lezard16 microcontroller instantiated into an ICN Tile wrapper.

Implemented on our VIRTEX-II device, the Lezard16 uses 204 slices and the encapsulated version uses 314 slices. The program memory exactly fits one BlockRAM.

5.3. Parameterizable Filter Block The second soft computing unit is a parameterizable filter block. The filter block contains nine run-time reconfigurable filter coefficients. This implies that an OS4RS task executing on this type of soft computing is defined by nine 16-bit words. Furthermore, there is no need for any kind of design time compilation phase. This parameterizable filter block implementation requires 280 slices and 27 multipliers.

5.4. Example Application In order to illustrate the capability of our OS4RS to transparently handle hierarchical reconfiguration, we developed

The experimental results, detailed in Table 1, clearly show that the Lezard16 implementation is considerably slower than the dedicated hardware configuration. This is partly caused by the fact that the instruction set and the processing power of the Lezard16 are quite limited (e.g. no multiplication instruction). However, the size of the Lezard16 task binary is orders of magnitude smaller than the one of the dedicated hardware configuration and consequently it does not require a lengthy reconfiguration time. The size of the dedicated hardware configuration, as explained in Section 4.2, is caused by the fact that the size of the task binary (not compressed partial bitstream) is coupled to the size of the reconfigurable tile. It is clear that by replacing the Lezard16 with a parameterizable filter block, the performance significantly increases, while the size of the task binary remains small. However, the flexibility of the parameterizable filter block solution is a lot smaller compared to the lezard16. Again, it is up to the application designer to determine the ideal properties for his application. The time needed by the OS4RS to set up a certain configuration hierarchy (Figure 7), depends on the number of

Table 1. Details about the different Edge Detection implementations. @ 40 MHz Performance Reconf.Time Task Size Flexibility

Lezard16 0.91 f/s 80 µs 234 bytes Good

Param.Filt. 43,40 f/s 17 µs 18 bytes Very Poor

Ded. Conf. 97.46 f/s 108 ms 521636 bytes Nil

hierarchical levels, the total number of registered computing units, the relative position of their processor information structures in the OS4RS linked list and the number of already linked soft computing units in the hierarchy. Typically the OS4RS requires between 0.2 µs and 10 µs (not including the actual configuration) to set up the hierarchy.

6. Conclusion This paper illustrates how our operating system for reconfigurable systems (OS4RS) allows full exploitation of the different hierarchical levels of programming offered by a heterogeneous reconfigurable system. By introducing the soft computing unit concept, the operating system is able to provide both design-time and run-time benefits such as efficient use of reconfigurable area, faster reconfiguration and dynamic task relocation. In order to demonstrate these capabilities, we have developed a proof of concept application, based on the in-house Lezard16 ISP core and a parameterizable filter block. An Image Edge Detection application illustrates some trade-off decisions the designer will encounter while developing an application.

Acknowledgements Part of this research has been funded by the European Commission through the IST-AMDREL project (Contract No IST-2001-34379) and by the Flemish Government through the GBOU-RESUME project (Contract No IWT020174-RESUME)

References [1] V. Nollet, P. Coene, D. Verkest, S. Vernalde, R. Lauwereins, ”Designing an Operating System for a Heterogeneous Reconfigurable SoC,” Reconfigurable Architectures Workshop (RAW 2003), Nice, France, April 2003. (accepted) [2] P. Schaumont, I. Verbauwhede, M. Sarrafzadeh, K. Keutzer, ”A Quick Safari Through the Reconfiguration Jungle”, Proc. of the 38th Design Automa-

tion Conference, DAC 2001, p172-177, Las Vegas, Nevada, USA. [3] T. Marescaux, A. Bartic, D. Verkest, S. Vernalde, R. Lauwereins, ”Interconnection Networks Enable FineGrain Dynamic Multi-Tasking on FPGA’s”, FPL’02, p795-805, Montpellier, France. [4] P. Merino, M. Jacome, J.C. Lopez, ”A Methodology for Task Based Partitioning and Scheduling of Dynamically Reconfigurable Systems”, Proc. IEEE Symp. on FPGA’s for Custom Computing Machines (FCCM), p324-325, 1998. [5] S. Ogrenci, M. Sarrafzadeh, ”Strategically Reconfigurable Systems”, RAW 2001. [6] J-Y. Mignolet, V. Nollet, P. Coene, D.Verkest, S. Vernalde, R. Lauwereins, ”Infrastructure for design and management of relocatable tasks in a heterogeneous reconfigurable system-on-chip”, Proceedings of Design, Automation and Test in Europe (DATE) Conference, pp. 986-991, Munich, Germany, March 2003. [7] G. Wigley, D. Kearney, ”Research Issues in Operating Systems for Reconfigurable Computing”, Proc. of the 2nd International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA’02), p10-16, Las Vegas, Nevada, USA, June 2002. [8] G. Wigley, D. Kearney, ”The Development of an Operating System for Reconfigurable Computing”, Proc. IEEE Symp. FCCM’01, April 2001, IEEE Press. [9] P. Chen, B. Noble,”When Virtual Is Better Than Real”, Proc. 8th Workshop on Hot Topics in Operating Systems May 2001, p20-23. [10] http://www.xilinx.com/company/press/kits /v2pro/backgrounder.pdf [11] Xilinx Inc., PicoBlaze 8-bit Microcontroller for Virtex Devices, XAPP213 (v1.2), April 2002. [12] G. Attardi, I. Filotti, J Marks,”Techniques for Dynamic Software Migration”, ESPRIT ’88: Proc. of the 5th Annual ESPRIT Conference, CEC (eds.), p475491, North-Holland, 1989. [13] http://www.sun.com/microelectronics /picoJava/overview.html

Suggest Documents