ARTICLE IN PRESS
INTEGRATION, the VLSI journal 38 (2004) 107–130
Run-time support for heterogeneous multitasking on reconfigurable SoCs$ T. Marescauxa,*, V. Nolleta, J.-Y. Mignoleta, A. Bartica, W. Moffata, P. Avasarea, P. Coenea, D. Verkesta,b,c, S. Vernaldea, R. Lauwereinsa,b a
DESICS IMEC vzw, Kapeldreef 75, B-3001 Leuven, Belgium b Katholieke Universiteit Leuven, Belgium c Vrije Universiteit Brussel, Belgium
Received 31 July 2003; received in revised form 14 January 2004; accepted 19 March 2004
Abstract In complex reconfigurable systems on chip (SoC), the dynamism of applications requires an efficient management of the platform. To allow run-time management of heterogeneous resources, operating systems (OS) and reconfigurable SoC platforms should be developed together. For run-time support of reconfigurable architectures, the OS must abstract the reconfigurable computing resources and provide an efficient communication layer. This paper presents our efforts to simultaneously develop the run-time support and the communication layer of reconfigurable SoCs. We show that networks-on-chip (NoC) are an ideal communication layer for dynamically reconfigurable SoCs, explain how our OS provides run-time support for dynamic task relocation and detail how hardware parts of the OS are integrated into the higher layers of the NoC. An implementation of the OS and of the dedicated communication layer on our reconfigurable architecture supports the concepts we describe. r 2004 Elsevier B.V. All rights reserved. Keywords: Reconfigurable architecture; Operating system; Network on chip
1. Introduction In order to meet the ever-increasing design complexity, future sub-100 nm platforms [1,2] will consist of a mixture of heterogeneous computing resources, storage elements, hardware $
Part of this research has been funded by the EC through the IST-AMDREL project (IST-2001-34379) and by Xilinx Labs, Xilinx Inc. R&D group. *Correspondig author. Tel.: +32-16-28-83-37; fax: +32-16-28-15-15. E-mail addresses:
[email protected],
[email protected] (T. Marescaux). 0167-9260/$ - see front matter r 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.vlsi.2004.03.002
ARTICLE IN PRESS 108
T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
Fig. 1. A global view of a heterogeneous multiprocessor system.
accelerators, etc. (denoted as tiles). These programmable/reconfigurable tiles will be interconnected by networks-on-chip (NoC). The integration of these heterogeneous resources into a single tile-based system, illustrated by Fig. 1, raises the question of how this system should be managed. Obviously, the management infrastructure should ease application development by shielding the programmer from the complexity of the system and by providing a clear application development interface. From the application point-of-view, the management infrastructure should provide a run-time environment where several applications can execute concurrently, with minimal interference between them, but with support for data and control communication. In addition, this infrastructure is responsible for monitoring and managing the available resources in a consistent, efficient and fair way. The novelty of our approach resides in the seamless integration of reconfigurable hardware in a multiprocessor system completely managed by an operating system (OS). The remainder of this paper is organized as follows. Section 2 motivates the use of a management infrastructure. Section 3 details several key operating system components. Section 4 illustrates the use of hardware to support low-level OS functionality like e.g. inter-task communication. Section 5 describes the life cycle of an application. Section 6 provides implementation details about the management infrastructure. Section 7 briefly discusses the related work. Section 8 provides an overview of the future research. Finally, conclusions are drawn in Section 9.
2. Management infrastructure motivation A considerable part of the functionality, needed to manage this heterogeneous tile-based system, will be handled by an OS. Two distinct approaches of OS management can be identified: *
Treating computing tiles as peripheral devices. In this case the host OS typically provides device driver support allowing applications to directly access these devices, which should ensure
ARTICLE IN PRESS T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
*
109
maximum performance. The downside of this approach is that the application designer not only needs to be aware of the low-level aspects of every device, he/she is also responsible for the run-time management of the devices. In addition, this approach severely limits the portability of the applications, while much of the management functionality is duplicated in every application. Treating computing tiles as regular computing resources. The OS deals with the low-level aspects and the management of the heterogeneous resources, allowing the application designer to concentrate on the application’s functionality. This technique also facilitates sharing these computing resources between simultaneous active tasks owned by unrelated applications.
We believe it is timely to develop an integrated management infrastructure that enables the full potential of the heterogeneous multiprocessor system, illustrated by Fig. 2. Such an infrastructure should provide a greater ease-of-use and consistency by providing a suited abstraction for the heterogeneous computing resources. In addition, it should enable sharing the computing resources among multiple, unrelated applications. Instead of creating the OS from scratch, we decided to extend an existing OS. This approach will allow us to focus on the issues regarding the management of the heterogeneous resources, since support for regular software tasks is already present. In addition, this technique should make it possible to separate the extensions from the underlaying base OS, ensuring portability to other platforms and OS. As Fig. 2 illustrates, the biggest part of our operating system for reconfigurable systems (OS4RS) executes on top of the instruction set processor (master ISP). However, depending on the amount of hardware support for the OS (cf. Section 4), a small amount of low-level management functions effectively ‘executes’ on top of the slave processing units. According to the classification of distributed OSs, described in [3], our OS can be classified as a master–slave configuration. This implies that one processor unit (the master ISP) is responsible for monitoring the status of the heterogeneous system and for assigning work to all the other processing units (slaves). In terms of classical distributed OSs, this configuration is considered to be the most straightforward type. A possible drawback for this type of system is the possibility that the master processor becomes a bottleneck and, consequently, will fail to fully utilize the processing potential of the system. However, this is unlikely to happen in our case due to the small amount of slave units and the heterogeneity of the system (i.e. the amount of tasks that are able to run on top of a certain resource should be small compared to the total amount of tasks in the system).
Fig. 2. OS4RS handling a heterogeneous reconfigurable SoC.
ARTICLE IN PRESS 110
T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
3. OS components for the targeted system This section presents the necessary structures required by the OS to keep track of both tasks and computing resources and describes the OS components responsible for task scheduling and inter-task communication. 3.1. OS4RS task descriptors The OS keeps track of the tasks by means of a task descriptor list. This list contains a descriptor for every OS4RS task instantiation. The most important descriptor components are: *
*
*
* *
The task logical ID. This unique ID is assigned by OS4RS at task initialization time and allows addressing tasks, independent of their location (i.e. current computing unit) within the system. The task state. This allows, for example, to indicate that a certain task has been assigned to a computing resource for execution, or that a task has been selected for relocation to a different computing resource. A list containing the available execution binaries and their respective properties, targeted at the different heterogeneous computing resources. A link to the task destination look-up table (DLT) (cf. Section 3.4). A link to the computing unit descriptor that currently executes the task (cf. Section 3.2).
3.2. OS4RS computing unit descriptors The operating system manages its computing resources by maintaining a linked list of computing unit descriptors. Every computing unit descriptor contains a set of 11 functions that completely describes the capabilities of the computing resource. The most important functions allow the operating system to: *
*
* * *
*
*
Set up a task. This function requires the task logical ID and the task binary as input from the OS. Initialize a task. This function allows the OS to reset a task and, if necessary, initialize it with a previously captured task state. Start a task. This function allows the OS to start a task with a specified logical ID. Remove a task. This function allows the OS to remove a task with a specified logical ID. Signal a task. This functions allows the OS to send a switch signal to a certain task, to suspend/ resume a task, etc. This function requires a signal identifier and a task logical ID. Control inter-task communication. This function allows the OS to initialize/update the DLT (detailed in Section 3.4) of a specified task. Handle computing resource exceptions. This is a call-back function that allows the computing resource to signal exceptions to the OS.
Furthermore, the computing unit descriptor allows the OS to monitor the state of the computing unit through a number of variables like the load of the computing unit, the number of running tasks, the task set up time, etc.
ARTICLE IN PRESS T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
111
3.3. OS4RS task scheduling The OS employs a dynamic two-level scheduling technique, illustrated by Fig. 3. The top-level scheduler assigns tasks to computing resources in response to timer ticks and external events (e.g. user interaction, cf. Section 5.2). This mapping is based on the information that resides in the OS4RS task descriptors (e.g. can this task be mapped on this resource?) and the computing unit descriptors (e.g. can this computing resource handle this extra task?). The top-level scheduler dispatches a task to a local scheduler by instantiating this task on a certain computing unit. This implies the creation of a computing unit specific task descriptor (referred to as local task descriptor) by means of the interface functions present in the computing unit descriptor. The local scheduler, tied to a certain computing unit, is responsible for the temporal ordering of the tasks that have been assigned to that computing unit. The top-level scheduler is able to move tasks between heterogeneous computing resources by using task-specific contexts [4,5]. These contexts are not defined in terms of computing unit state such as registers and status flags, but describe the state of the task itself at specific points in time (switch points or preemption points). Only at these instants can tasks be moved between heterogeneous resources. As a consequence, the rescheduling of tasks is performed in a cooperative way. For the temporal ordering of their assigned tasks, computing unit local schedulers can use their own (resource specific) technique for capturing and restoring task state. 3.4. OS4RS inter-task communication By using a uniform communication scheme for all tasks (independent of their mapping), relocating at run-time a certain task between heterogeneous computing resources does not affect the way other tasks communicate with it. This uniform communication scheme is based on message passing, since this communication type is supported by the underlaying hardware
Fig. 3. Two-level scheduling mechanism: the top-level scheduler assigns tasks to computing resources based on the OS4RS task descriptor. The local schedulers are responsible for the local temporal scheduling, based on the information contained in the local task descriptor.
ARTICLE IN PRESS 112
T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
Fig. 4. Application task graph showing input–output port connections.
Fig. 5. DLTs for every task in the graph.
(cf. Section 4.2). To support point-to-multipoint and multipoint-to-point communication, we have introduced the notion of input and output ports for the tasks. Messages sent by one task to two other tasks are distinguished by the output port number they are being sent on. During application initialization, for every task in the application, the OS assigns a system-wide unique logical ID. The top-level scheduler (cf. Fig. 3) maps the task on a computing resource, which determines its physical address on the platform. In addition, the application provides the OS with a task graph detailing the application’s inter-task communication (Fig. 4). Thus, for every output port of a task the OS defines a triplet (destination input port, destination logic address, destination physical address). For instance, task C has two output ports, hence is assigned two triplets, which compose its DLT (Fig. 5). In our system a task may have up to 16 output ports, thus there are 16 entries in a DLT. Whenever a task gets instantiated, the OS updates its associated DLT and sends it to the computing resource responsible for the task execution. In addition, the DLTs of all tasks communicating with this newly instantiated task are updated.
4. OS hardware support It is quite common to have hardware support for OS incorporated into contemporary microprocessors (e.g. the Memory Management Unit). The hardware support allows the OS to perform some low-level management functions in a more efficient way. This section explains how an ensemble of NoCs and interfaces to the computations resources implement an efficient communication layer and support the OS in three distinct domains: *
Task management. In order to instantiate/remove tasks from a certain computing resource, the OS requires efficient access to its configuration/programming mechanism.
ARTICLE IN PRESS T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130 *
*
113
Inter-task communication. Traditionally, the OS offers some form of inter-task communication service to the applications. A straightforward solution would be to pass all communication through the OS running on the main ISP. Obviously, this solution lacks efficiency on a heterogeneous tile-based system, since the main ISP would spend most of its time moving data form one location to another. A better solution would be to allow the different (slave) computing resources to communicate with each other in a way controlled by the OS, but without having to pass through the main ISP. Operation and Management (OAM). The OS needs to keep track of the behavior of the different tasks executing on all the computing resources in terms of communication and security. It would be inefficient for the OS to waste resources of the main ISP to continuously monitor the slaves. A hardware component, that performs the tracking and provides the OS the requested information and events, is therefore a must. Such a hardware component will also facilitate the debugging process.
Section 4.1 details how the communication on our platform is naturally split in three channels: data communication, control communication and reconfiguration communication. Section 4.2 then focuses on the OSI [6] network layer of the data communication whereas Section 4.3 looks at the network layers of the control and reconfiguration communications. The transport layers of data and control communications, implemented in their network interface components, are finally discussed in Section 4.4. We show how they implement an efficient hardware support for the OS. 4.1. HW communication layer supports three types of communication The requirements of the OS in terms of HW support, define respectively three types of communication on the platform: reconfiguration communication, inter-task communication and control communication. These three communication channels have different requirements: *
*
*
Inter-task communication. Application communication requires a high bandwidth availability and an acceptable latency. OAM. The OAM messages of the OS are typically short, but must arrive timely. Therefore this communication channel requires a low bandwidth and must minimize latency. Task management. The requirements of this reconfiguration communication channel are more complex, because they depend on the type of reconfigurable logic present on the platform. Finegrained reconfigurable logic, such as FPGAs, need many bits to be configured and thus would require a high-bandwidth, but could suffer from a longer latency because they would not be reconfigured too often. On the contrary, coarse-grain logic needs less bits to be configured but would be frequently reconfigured.
Because the communication requirements of these three channels are so different, we have decided to first implement each communication type on a separate network to efficiently interface the slave tiles to the OS running on the master ISP. Our platform has thus three networks: a reconfiguration network, a data network and a control network (Fig. 6). Every tile connects to each of the three networks by means of a specialized interface called a network interface
ARTICLE IN PRESS T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
114
Reconfigurable Tile 1
Master ISP
ISP - NoCs Interface
Data Router
Control Router
ReConf Router
Reconfigurable Tile 2
Slave ISP
Data NIC
Control NIC
ReConf NIC
Data NIC
Control NIC
ReConf NIC
Data NIC
Control NIC
ReConf NIC
Data Router
Control Router
ReConf Router
Data Router
Control Router
ReConf Router
Data Router
Control Router
ReConf Router
Fig. 6. The platform uses three NoCs: a data network, a control network and a reconfiguration network.
component (NIC1). There are three NICs: a data NIC, a control NIC and a reconfiguration NIC connecting a tile to, respectively, the data, control and reconfiguration networks. Every tile can thus be compared to a Harvard processor architecture, with independent data, control and instruction busses corresponding, respectively, to the data, control and reconfiguration networks. The services implemented on these three networks, including their interfaces, compose the HW support for the OS. In addition to efficiency, a clean logical separation of the three types of communications in three communication paths, ensures independence of application and OS. The OS does not need to care about the contents of the messages carried on the data network and an application designer does not need to take into account OS OAM interactions. 4.2. A data network supports inter-task communication The tasks running on the different computing resources make use of the services provided by the communication layer to directly exchange information. In this way data can be transferred from one task to another without involving the master ISP or the OS. The OS manages the whole communication infrastructure through functions implemented partially in software and partially in hardware, as Section 4.4 describes. It sets the data flow between the tasks, as required by the applications, but the actual data exchange is handled by the communication infrastructure. 4.2.1. Implementation solutions Different implementations are possible for the inter-task communication layer. For large heterogeneous systems the best solution is to separately design the computation and the communication resources and to interconnect them through standardized interfaces [7]. For runtime reconfigurable systems the separation becomes even mandatory. As mentioned in Section 2 our system contains among other computing resources, reconfigurable hardware. As the reconfigurable hardware can be reprogrammed to satisfy the computation needs of that moment, the flow of data will have to be modified as well. In the absence of standard interfaces, 1
The acronym NIC overloads the commonly used Network Interface Card because it serves the similar role of abstracting a high-level computing resource from the low-level communication of the network.
ARTICLE IN PRESS T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
115
implementing ad hoc, point-to-point connections between tasks implies solving a very complex and time consuming run-time routing problem. It also requires a necessary amount of free routing resources, which cannot be guaranteed. IPs can thus be interconnected by buses or networks. Our choice goes to networks because they are a communication architecture with better scalability properties [8–12]. 4.2.2. Data network design To address the large spectrum of communication needs we have designed a highly flexible and scalable network. There are many different types of networks as described in [13]. In our design we chose for a direct, packet switched network looking to maximize the efficiency of bandwidth usage. A brief description of the data network is presented next, for more information see [14]. Networks are mainly characterized by their topology, switching technique and routing algorithm. The network building blocks are the routers. Our router is independently parameterizable in the number of input and output ports, allowing therefore routers with unequal number of inputs and outputs. All inputs and outputs are identical whether or not they connect to a router or to a task. The extra functionality required by the communication with the tasks is implemented in the interfaces. The switching technique currently implemented is virtual cut-through (VCT), because it offers a low latency. The downside of this technique is the large buffer sizes required to store the packets [15]. The routing algorithm is another important design choice. Although our router uses a deterministic algorithm, it still offers a certain amount of adaptivity. It is currently implemented as a look-up table with an entry for each addressable task in the network. The content of the table can be changed by the OS, dynamically, at run-time. This implementation offers a higher level form of adaptivity, that can work well in networks with a known or relatively constant traffic pattern. The different tasks can be characterized in terms of the required data rate, and knowing the senders and the receivers for each task, the routing tables can be modified to balance the network traffic. Each time a task that produces a change in the communication between the tasks is started, a new optimum can be computed and the tables changed accordingly. Also, the OS gathers statistical information about the data flow in the network. It can therefore identify the connections that have a high load for a long time and try to adapt the routing tables in order to spread the traffic. Having one entry for each tile in every router offers the possibility to customize the routing for each tile at network level. Communicating tasks should preferably be placed next to each other. In reconfigurable systems adjacent resources can be freed by means of task relocation. However, task relocation can be avoided by reprogramming the routing tables, provided that there are enough free communication resources. Inter-task communication is then diverted on a longer path using low traffic connections, incurring an extra delay but saving the time required for task relocation. The present design does not implement QoS services at the network level but a higher level of QoS is implemented by the interface under the control of the OS as described in Section 4.4. 4.3. Control and reconfiguration networks The control network is used by the OS to control the behavior of the complete system (Fig. 6). It allows monitoring of the data NoC, debugging and control of a task, exception handling, etc.
ARTICLE IN PRESS 116
T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
OS OAM messages are short, but the control network must deliver them timely. We therefore need a low bandwidth, low latency control network. For rapid prototyping on our research platform, the communication layers are implemented on the same reconfigurable chip used by the computing tiles. To limit logic resource usage by the communication layers and to minimize message latency we decided to first implement the control network as a shared bus, where the OS running on the ISP is the only bus master and all control network NICs of tiles are slaves. In addition, implementing the control network as a bus eases the early debugging stages of our data NoC when used on real-life applications. Nevertheless, the communication on this bus is designed to be message-based and can therefore be replaced by any type of NoC in a general case. Our research platform for reconfigurable SoCs, detailed in Section 6.4, is composed of an ISP coupled to an FPGA. Tasks are instantiated on fine-grain reconfigurable tiles by partially reconfiguring the chip. In this case, the reconfiguration network is already present on the platform as the native reconfiguration bus of the FPGA. The reconfiguration bus is accessed by the ISP as a simple memory mapped device. 4.4. NICs implement a distributed HW OS support The control and data NICs form the higher layers of the control and data NoCs. The main role of the data NIC is to buffer input and output messages from the local computing resource and abstract the task from the details of the low-level communication on the data NoC. The role of the control NIC is to provide an abstraction layer for the communication and computing resources of the platform. It provides the OS running on the master ISP with a clean interface to access and control both computing tiles and the data NoC [16]. 4.4.1. Data NIC supports dynamic task relocation As Section 3.4 explains, inter-task communication is done on an input/output port basis [17]. Fig. 4 shows an example of an application task graph with the input/output port connections between tasks. The result of mapping this task-graph onto the computing resources is the ensemble of destination look-up tables of the application (Fig. 5). The DLT associated to a task is stored in the data NIC of its respective computing resource. The OS can change the DLT at run-time, by sending an OAM message on the control network. This method abstracts the computing resources from the details of the communication on the data network and allows dynamic task relocation in reconfigurable SoCs [17]. The relocation process is discussed in details in Section 5.1. 4.4.2. Data NIC monitors communication resources Three main parameters about the usage of communication resources on the data NoC are monitored in the data NIC of every tile. Two are the number of messages coming in and out of a specific tile are gathered in the NIC in real time and made available to the OS. The other important figure available is the average number of messages that have been blocked due to lack of buffer space in the NIC. These figures allow the OS to keep track of the communication usage on the data NoC. Based on these figures and on application priorities, the OS can manage communication resources per tile and thus ensure QoS on the platform [18].
ARTICLE IN PRESS T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
117
4.4.3. Data NIC implements communication load control The maximum amount of messages a task is allowed to send on the network per unit of time can be controlled by the OS. To this end we have added an injection rate controller in the data NIC. As explained in [17], outgoing messages from a task are first buffered in the NIC and are then injected in the network as soon as it is free (Best Effort service). The injection rate controller adds an extra constraint on the time period when the messages may be injected in the NoC. It is composed of a counter and a comparator. The OS allows the NIC to inject messages only during a window of the counter time. The smaller the window, the less messages injected into the NoC per unit of time, freeing resources for other communications. This simple system, introduces a guarantee on average bandwidth2 usage in the NoC and allows the OS to manage QoS on the platform. 4.4.4. Data NIC adds HW support for OS security Security is a serious matter for future reconfigurable SoCs. Thanks to reconfiguration, unknown tasks may be scheduled on HW resources and will use the data NoC to communicate. We must therefore perform sanity checks on the messages circulating on the data NoC and notify the OS when problems occur. Communication related checks are naturally performed in the NIC. We check for instance whether the message length is smaller than the maximum transfer unit, that messages are delivered in order and especially that tasks do not breach security by sending messages on output ports not configured in the DLT by the OS. 4.4.5. Control NIC implements HW OS support The control NIC of every tile is memory-mapped in the ISP. An important role of the control NIC is to communicate to the OS information about the data NoC, such as message statistics and security exceptions. It is also through the control NIC that the OS sends DLTs or injection-rate windows to the data NIC. Another very important role of the control network is to allow control and monitoring of a task running on a reconfigurable tile. To clearly understand the need for OS control here, let us consider the life-cycle of a reconfigurable task in our SoC platform. Before instantiating a task in a tile, we must isolate the tile from the communication resources, to ensure the task does not do anything harmful on the data NoC before being initialized. To this end, the control NIC implements a reset signal and bit masks to disable task communication (step 3 in Fig. 9). After reconfiguration, the task needs to be clocked. However, its maximum clock speed might be less than that of our data NoC. Because we do not want to constrain the speed of our platform to the clock speed of the slowest task (which can always change as new tasks are introduced at run-time), the OS can set a clock multiplexer to feed the task with an appropriate clock rate. The task can now perform its computation task. At some stage it might generate an exception, to signal for instance a division by zero. The control NIC implements a mechanism to signal task exceptions to the OS. As for a task executing on the ISP, the OS can also send exceptions to a task running on a reconfigurable tile. One usage of these exceptions is to perform debugging. Later on, the OS might decide to relocate the task to another reconfigurable tile or as a task on an ISP (Section 5.1, [16–18]). The NIC implements a mechanism to signal task switching and to 2
As long as the data NIC buffers are not permanently saturated.
ARTICLE IN PRESS 118
T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
transmit the internal task state information to the OS. The NIC also implements a mechanism to initiate a task with a certain internal state, for instance when switching from an ISP to a reconfigurable tile [4].
5. Description of the application life cycle This section details the life cycle of a task on our heterogeneous multitasking reconfigurable SoC. Additionally, a use case scenario illustrates the need of run-time task relocation on this kind of system. 5.1. Task life cycle Whenever a user starts an application, the OS needs to perform a series of actions before the application actually starts running on the heterogeneous reconfigurable SoC. Three steps can be identified (Fig. 7): (1) Loading the application. This step creates a task structure containing a unique logical ID for every task within the application. In this stage, the application should also inform the OS about its inter-task communication. This is done by registering a communication task graph (cf. Section 3.4). Based on this task graph, the OS creates a DLT for every task. In addition, the application should, per task, register the available task representations (SW binaries, fine and/or coarse-grain bitstreams). (2) Allocate tasks to platform computing resources. In this step the operating system decides on the mapping of the application tasks depending on their available representation, the availability of computing resources, the requested QoS, etc. (3) Instantiate, initialize and start the application tasks. This entails (for every task) sending the configuration/program data to the computing resource, reset/initialization of the tasks, updating the DLT and sending it to the computing resource and finally start the task.
Fig. 7. Overview of the different steps involved in starting an application.
ARTICLE IN PRESS T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
119
Fig. 8. Switching a task at run-time from computing resource (tile) X to computing resource (tile) Y.
The OS needs the ability to relocate (migrate) a task from one heterogeneous computing resource (origin tile) to another (destination tile), in order to e.g. respond to changes in the QoS requirements. As Section 3.3 explains, we use a high-level abstraction, task dependent way to represent task state information. This mechanism allows to reinitialize a task, independently of the target computing resource. The principle of the relocation process is depicted by Fig. 8. In order to relocate a task, the OS can send a switch signal to that task, at any time (1). Whenever that signaled task reaches a switchpoint, it goes into an interrupted state (2). In this state, all the relevant state information of that switchpoint is transferred to the OS (3). Consequently, the OS will instantiate that task onto a different computing resource. The task will be initialized using the state information previously stored by the OS (4). The tasks resumes by continuing execution in the corresponding switchpoint (5). Because parts of the OS are distributed in different components over the platform, the actual switch process involves concurrent steps and requires synchronization of communication. The different steps performed in the actual switch process are described in more detail in Fig. 9. When the OS sends a switch signal to the origin tile (1), the task running on that tile may be in a state that requires more input data before reaching a switch point. This input data originates from another task called sender task instantiated on a tile called sender tile. Nor OS, nor sender task know how many input messages are required for the task on the origin tile to reach a switch point. As a result, the sender task might send more messages than required to reach that switch point. Therefore, when the task reaches its switch point and signals it to the OS ð1-2Þ; there may be an undetermined number of pending input messages on the data NoC or in the data NIC buffers destined to the origin tile. During the relocation process from origin tile to destination tile, the pending messages have to be orderly forwarded to the destination tile and the DLT of the sender tile has to be updated to send input messages to the destination tile in place of the origin tile. After reception of the acknowledgment that the task has reached its switch point, the OS requests the sender tile to send one last tagged3 message to the origin tile and then stop sending further messages (2). The OS then configures the destination tile with the switched task, initializes it to the state it stopped in and enables its communications on the data NoC (3). The next step is to forward all pending messages to the newly relocated task. To this end, the OS sends a new DLT 3
The message is tagged by sending it on the data NoC with a special message class in its header.
ARTICLE IN PRESS T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
120
Operating System
Send Switch
1
Origin Tile
Receive as many messages as required to reach Switch Point.
Sender Tile
Destination Tile
Send Messages to Origin Tile
NO Reached Switch?
YES
Send Comm Sync
2
3
Configure Dst. Tile
4
Send Forward DLT
- Tag and Send one last message to Origin Tile. - Stop sending Messages. - Disconnect data NoC - Reconfigure Tile - Get DLT - Clock Tile - Get State - Connect data NoC - Wait Tagged Message
- Get DLT - Forward all messages to Destination Tile
NO Tag Msg ACK?
YES
5
Send DLT
6
Switch Done
- Get DLT - Resume Normal Operation
- Normal Operation
Free Origin Tile
Fig. 9. Communication synchronization during task switching.
to the origin tile and puts its data NIC in a special state that forwards all its input messages to the destination tile (4). The tagged message is the last one to be forwarded by the data NIC of the origin tile to the destination tile. The OS is informed by the data NIC of the destination tile upon reception of the tagged message ð4-5Þ: The task switching mechanism is finished and the OS can
ARTICLE IN PRESS T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
121
simply update the DLT of the sender tile to point to the destination tile in place of the origin tile and re-enable its communication on the data NoC (5). 5.2. Application scenario The run-time task relocation can be exploited to optimize resource usage on an heterogeneous reconfigurable platform. The following use case scenario demonstrates the need of task relocation in the case of mobile multimedia terminals. The scenario is depicted in Fig. 10. A user is watching a movie on the mobile multimedia terminal (1). The movie comes from a broadcast TV stream so now and then, advertisements might interrupt it. The user then wants to take advantage of that time to execute another application, while keeping an eye on the video window in order to resume watching the movie whenever the advertisements are finished. The video window is thus downsized in the corner of the screen (2). The user can then start another application (e.g. a 3D game) on the terminal. Fig. 10 also shows the behavior of the platform corresponding to the use case scenario. The platform is composed of a set of flexible computing resources (ISP, reconfigurable hardware). At the beginning, only the movie player executes on the platform. Therefore the video decoder can use all needed resources and can run at full resolution and full frame rate (1). When the user downsizes the video window, both resolution and frame rate can be reduced. The video decoder can then be relocated to a smaller amount of computing resources (2). The resources that are not used anymore can then be made available to the 3D engine (3) that is used by the game. The user interaction on the terminal thus creates dynamism that affects the mapping of the applications on the available resources. This dynamism is one of the reasons for having flexible,
Fig. 10. Multimedia applications scenario. (1) The user is watching a movie—the video decoder executes using as much hardware resources as it can; (2) during advertisements, the user downsizes the video window—the decoder is therefore relocated on less resources; (3) the user starts a 3D game—the 3D engine uses the hardware resources made available.
ARTICLE IN PRESS 122
T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
yet computing intensive (multimedia application are computation hungry) resources on the platform, i.e. reconfigurable hardware. It also clearly justifies the need for run-time task relocation on the platform.
6. Implementation details 6.1. OS details The OS (OS4RS) is built as an extension to RTAI version rtai-24.1.9, downloadable at [19]. The main reasons for choosing a realtime Linux extension as our basic OS are: * * *
*
Source code availability. Small size OS with a deterministic behavior. Constructed in a modular way using Linux modules. This allows to efficiently ‘‘(re)boot’’ the OS by inserting and removing the Linux modules. This technique decreases the development time, by allowing modular OS tests. Availability of existing Linux services e.g. for touchscreen, USB, UART, debugging, etc.
In order to limit the number of changes in the RTAI source code, we created a few simple software hooks. These hooks represent a link between our OS extensions and the already present RTAI structures: e.g. a void pointer is added to the RTAI task descriptor. This pointer allows connection of the OS extension task descriptor with the RTAI local task structure (cf. Fig. 3). Approximately 150 lines of source code were added directly into RTAI. The actual OS extensions are represented by several Linux modules, currently containing about 5000 lines of source code. The OS4RS core extensions (i.e. mainly run-time task management, uniform communication scheme) have a memory footprint of approximately 50 kbyte: 6.2. Data network implementation As discussed in Section 4.2 to implement the inter-task communication we have designed a very flexible network on chip [14]. The network is made of routers that can be interconnected to build arbitrary topologies. The data path currently implemented is 16-bit wide and the maximum packet payload length is fixed to 256 words. The routers are built with input and output blocks connected through a shared crossbar switch (cf. Fig. 11). Any input can be connected to any output, and each output block has an arbiter to solve conflicts between the different inputs. The arbitration scheme presently used is based on round robin and takes into account the acceptance ratio to be fair and starvation free. For the switch many different designs are possible. The current implementation is a straightforward crossbar switch using large multiplexers in a single stage. It could be optimized with respect to area by using smaller multiplexers in multiple stages. The handshaking between routers is realized through request/acknowledge lines. A request is placed by the output block for the corresponding input block of the next router. The flow control is made at packet level, meaning that an entire packet is sent before another one can begin.
ARTICLE IN PRESS T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130 N Input
123
N Output
Routing Table
Arbiter Crossbar S Input
Output Queue
S Output
switch
Routing Table
Arbiter
Output Queue
V2Pro slices (thousands)
3.5
2.5 Slices Eq. Gates
3 2.5
2.0
2
1.5
1.5
1.0
1 0.5
0.5
Equivalent Gates (millions)
Fig. 11. Structure of a 2 inputs, 2 outputs router.
0.0
0 2
4
6
8
10
Network nodes (mesh)
Fig. 12. Logic resource usage of the data NoC with a mesh topology as a function of the number of routers.
For VCT switching the router needs to buffer the whole packet if it is blocked. In our implementation the buffers are placed at the output. Output buffering has the advantage to avoid head of line blocking. In this case, even if the packet is blocked, once it is buffered the input is free to receive and forward a new packet. If the next packet asks for the same output then depending on the size of the output buffer the input will block or not. If the output buffer can hold more than one packet then subsequent packets can be received and locally buffered until the maximum number is reached. The network is implemented as a VHDL component, and is parameterizable using generics. The topology and the size of the network can easily be changed since it is passed as a generic mentioning all the connections between the routers, and between the routers and the tasks. A 3 3 mesh network takes 2800 Virtex II slices, amounting to 8.3% of the chip we use (Fig. 12). The design is optimized for performance. There is a routing table in every input block, an arbiter and a packet buffer in every output block. In this way none of these modules has to be shared. This solution provides the shortest possible latency at the cost of silicon area. The latency per router is 3 cycles. Therefore a packet traveling 5 nodes has a base latency of 15 cycles plus a number of cycles equal to the number of packet flits.4 For a large packet size the delay to receive the first flit is only a small fraction of the time needed to receive the entire packet. 4
When flowing through the network, packets are divided into smaller units. A flow control unit or flit is the smallest unit on which flow control is performed.
ARTICLE IN PRESS 124
T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
6.3. Implementation of the NICs 6.3.1. Control NIC defines HW OS reaction time The SW part of our OS [18] is running on an ISP and controls the HW OS extensions located in the data and control NICs, through the control network. Therefore the reaction time of the control network and its NIC has a strong influence on the behavior of the system. To get a feeling of this latency and of the processing required both in HW and SW let us consider the simple example where the OS sends an atomic instruction to the control NIC of a particular tile to reset the task running on it (Fig. 13). We assume that the control NIC is clocked at 22 MHz and that the ISP can access the 16-bit wide control network at 50 MHz: The SW part of the OS sends the atomic RST IP instruction to the control NIC of the task in 120 ns (measured time). A total of 12:8 ms is spent in the control NIC to decode, process and acknowledge the commands issued from the SW part of the OS. Only a total of 320 ns are spent by the SW part of the OS to send an atomic instruction and request the control NIC to clear the IRQ, acknowledging the command has been processed. The total processing time is under 13:2 ms; of which 97% is performed by the HW OS support from the control NIC. In the case of dynamic task relocation from SW to HW (Section 5.1), the task on the reconfigurable tile needs to be initialized with the state information extracted from the SW version of the task [4]. Assuming we have 100 16-bits words of state information to transfer, the total transaction takes about 440 ms (control NIC transmits a word to the task in 4:3 ms and it takes about 10 ms for control NIC latency, instruction decoding and acknowledgment). In both cases the control NIC abstracts the access to the task on the reconfigurable tile from the SW part of the OS. Because the NICs offload the ISP from low-level access to the tasks, they are considered as the HW part of the OS. For research purposes, we implement the fixed NoCs together with the reconfigurable tasks on the same FPGA. We report therefore in Table 1 the area usage of the NICs in terms of FPGA logic and consider it as overhead to the reconfigurable tasks they support. As expected, the support of functions required by a full OS such as state transfer, exception handling, HW debugging or communication load control come at the expense of area overhead in the NICs. On our implementation platform, the Virtex-II 6000, this area overhead amounts to 361 slices, or 1.81% of the chip per reconfigurable tile instantiated. Nevertheless on a production reconfigurable SoC, the NoCs and their interfaces could be implemented as hard cores, reducing considerably the area overhead on the chip.
Fig. 13. The OS sends a Reset command to a task on a reconfigurable tile. Most of the processing is performed in the control NIC, making it HW support for the OS. Control NIC is clocked at 22 MHz and control network is accessed by the ISP at 50 MHz:
ARTICLE IN PRESS T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
125
Table 1 HW overhead of data and control NICs Element
Virtex-II Slices
XC2V6000 (%)
XC2VP20 (%)
Control NIC and router Data NIC Control+data NIC
250 361 611
0.74 1.07 1.81
2.7 3.89 6.58
Fig. 14. The Gecko demonstrator.
6.4. Proof-of-concept We have developed an emulation platform to explore HW/SW multitasking on reconfigurable heterogeneous multi-processor SoCs. The Gecko demonstrator (Fig. 14) [20] is a platform composed of a Compaq iPAQ 3760 and a Xilinx Virtex 2 FPGA. The iPAQ is a personal digital assistant that features a StrongARM SA-1110 ISP ð206 MHzÞ and an expansion bus that allows connection of an external device. The FPGA is a XC2V6000 containing 6000k system gates and is clocked at 30 MHz: The Gecko emulation platform is powerful enough to run real-time multimedia applications on top of our extended OS. It allows us to study real cases of run-time support of heterogeneous multitasking on reconfigurable SoCs more accurately than it would be possible in a more traditional simulation environment. The second generation of Gecko demonstrators (Gecko2 ) emulates a control bus and a 3 3 data bidirectional mesh network that connect 9 heterogeneous processors. The StrongARM processor on the iPaq is used as the master CPU of our system, whereas the 8 slave processors and the data and control NoCs are emulated on the FPGA (Fig. 15): * * * *
S1 ; S3 ; S4 ; S6 ; S7 are simple 16-bit RISC processors (Lezard16), S2 ; S5 are fine-grain reconfigurable processors (Fig. 16), S8 is a fixed 2D IDCT block, I=F is the interface to the external Master CPU, the Strong ARM.
ARTICLE IN PRESS 126
T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
Fig. 15. Reconfigurable MP-SoC emulation on the Gecko2 platform.
1
S3
S
S2
S4
S5
Strong ARM
S
S6
Legend:
8
Master ISP SA-1110
S7
Slave Processor n NIC Router
Fixed Module
Reconfigurable Module 1
Reconf. Module 2
Fig. 16. Floorplanning of Xilinx Virtex II FPGA for MP-SoC emulation.
Various tasks have been implemented to run on one or more of these heterogeneous processors (Table 2).5 Applications such as video decoder, 3D-game [22] and image processing can be concurrently run on the platform and HW/SW trade-offs can be explored at run-time by dynamically changing task locations on the heterogeneous processor system. 5
Some task implementations, such as the Lezard16 and the convolution filter are programmable machines. They are therefore also considered as processors when instantiated in reconfigurable hardware. The concept of changing programs on a processor that has been instantiated by reconfiguration is called hierarchical reconfiguration [21]. Hierarchical reconfiguration is out of the scope of this paper.
ARTICLE IN PRESS T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
127
Table 2 Implementations of various HW/SW tasks on Gecko2 heterogeneous processors Processors
SA-1110 Reconf. S2 ; S5 Lezard16 Conv. filter Fixed HW
Task implementations Huffman decoder
2D IDCT
3D text. mapper
Lezard16 processor
Conv. filter
Edge detect
X X
X X
X X
X
X X X
X X X X
X
X
The floorplanning of the Gecko2 system follows Xilinx’s Modular Design [23] technique to enable dynamic partial reconfiguration. The FPGA is divided into 3 independently reconfigurable modules, separated by ‘‘bus macros’’ [23] (Fig. 16): Fixed Module contains the data and control NoCs, their interfaces, as well as all fixed slave processors (S1 ; S3 ; S4 ; S6 ; S7 ; S8 ) and the fixed interface to the external Master CPU. Reconfigurable Module 1 contains raw Virtex 2 logic, block-RAMs and multipliers. It connects to the Fixed Module on data and control NoCs as slave processor S2 and contains route-through wires to connect the Fixed Node to Reconfigurable Module 2. Reconfigurable Module 2 contains raw Virtex 2 logic, block-RAMs and multipliers. It connects to the Fixed Module on data and control NoCs as slave processor S5 : The task relocation overhead for the OS is only about 100 ms: This low OS overhead can be explained by the absence of a complex task placement algorithm. However, the OS relocation overhead does not take into account the reconfiguration time, which depends on the type of reconfigurable hardware and on the area to reconfigure. For instance, relocating the Huffman decoder task from software to fine-grain reconfigurable hardware (S2 ) takes about 108 ms: Most of the relocation latency is caused here by the actual partial reconfiguration through the slow CPU–FPGA interface. In theory, the total software to hardware relocation latency can be reduced to about 11 ms; when performing the partial reconfiguration of S2 at full speed. When relocating a task from hardware to software, the total relocation latency is equal to the OS overhead, since in this case no partial reconfiguration is required. Regarding power dissipation, the demo setup cannot directly show relevant results because it uses an FPGA as reconfigurable hardware. Traditionally, FPGAs are used for rapid prototyping and are not meant to be power efficient. However, we are in the process of estimating power consumption by mapping our heterogeneous MP-SoC to standard cell logic. An estimation of the power dissipation due to the reconfiguration process is planned on the final platform we are targeting. This final platform will be composed of new, low-power fine- and coarse-grain reconfigurable hardware that will improve the total power dissipation of the platform. Power efficiency will be provided by the ability of spawning highly parallel, computation intensive tasks on this type of hardware.
ARTICLE IN PRESS 128
T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
7. Related work In [24], Diessel presents an overview of the challenges and opportunities that arise when developing an OS to manage a reconfigurable system. Our management infrastructure addresses most of the described run-time issues and proposes an implementation solution. Wigley [25] discusses the services and abstractions that have to be provided by an OS for reconfigurable computing: this mainly concerns task scheduling and inter-task communication. However, this paper does not address the use of OS support implemented in hardware. Simmler [26] investigates ‘‘multitasking’’ on an FPGA. An application dependent technique to extract and restore task state information is presented: state extraction is done by filtering out the relevant state bits from the readback bitstream. State reconstruction is done through bitstream manipulation. This technique is not only application dependent, it is also architecture dependent. In addition, this technique only enables multitasking on the FPGA. It does not allow relocating a task between heterogeneous processors. We address the need for high-level task state representation that enables real dynamic heterogeneous multitasking. Rijpkema discusses in [27] the integration of best-effort and guaranteed-throughput services in a combined router. Such a combined system could be an interesting alternative to our physically separated data and control networks (Section 4). Mignolet presents in [4] the design environment that allows development of applications featuring tasks relocatable on heterogeneous processors. A common HW/SW behavior, required for heterogeneous relocation is obtained by using a unified HW/SW design language such as OCAPI-XL [28]. OCAPI-XL allows automatic generation of HW and SW versions of a task with an equivalent internal state representation. Introduction of switch points is thus possible and allows a high-level abstraction of task state information. Our OS is able to use this high-level task state abstraction to perform heterogeneous multitasking on reconfigurable systems.
8. Future work In the future, on the OS side we want to integrate cases where the management infrastructure shifts from being a master–slave configuration, explained in Section 2, to a more symmetric configuration where a bigger part of the OS functionality is distributed among the different heterogeneous resources. In addition, one could imagine a heterogeneous resource having its own (limited) set of independent OS functions. In this case, the system classification could shift toward a separate supervisor configuration, where every processing unit has its own set of OS functions and associated data structures. Further research is planned on the control network. Its current bus implementation should be changed to be a NoC and we will investigate whether its communication requirements would allow to integrate it to the data NoC, possibly in a combined best-effort/guaranteed-throughput router [27]. Further research will target the integration of the communication requirements of both coarse-grain and fine-grain logic in a reconfiguration network.
ARTICLE IN PRESS T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
129
9. Conclusion This paper presents our work on the run-time support of heterogeneous multitasking on reconfigurable SoCs. We show that an OS for reconfigurable systems should to be designed together with the communication layer of the platform. Our architecture for reconfigurable SoCs is composed of a master ISP connected to heterogeneous reconfigurable tiles using an ensemble of NoCs as communication layer. The operating system abstracts the complex computation and communication resources from the application designer. The OS uses a two level spatial and temporal scheduler to map tasks to the heterogeneous computation resources of the platform and a unified HW/SW communication scheme to allow run-time task relocation. The communication layer has three components: a highbandwidth data NoC that relieves the master ISP from inter-task communication, a control network that allows the OS to monitor and control the platform and a reconfiguration network, used for run-time instantiation of tasks. The higher layers of the data and control networks implement a hardware OS support distributed over the platform, essential to the run-time support of advanced features of reconfigurable SoCs such as dynamic task relocation. Our implementation of the first reconfigurable computing platform for HW/SW multitasking supports the concepts we present. The combined design of operating systems and communication layer allows run-time support for heterogeneous multitasking on reconfigurable SoCs and helps to dynamically optimize resource usage. References [1] H. De Man, On nanoscale integration and gigascale complexity in the post com world, Proceedings of DATE 2002, Paris, France, March 2002. [2] B. Lewis, I. Bolsens, R. Lauwereins, C. Wheddon, B. Gupta, Y. Tanurhan, PANEL: reconfigurable SoC—what will it look like, Proceedings of DATE 2002, Paris, France, March 2002, pp. 660–662. [3] M. Singhal, N.G. Shivaratri, Advanced Concepts in Operating Systems: Distributed, Database and Multiprocessor Operating Systems, McGraw-Hill Series in Computer Science, McGraw-Hill, New York, 1994, pp. 444–445. [4] J.-Y. Mignolet, V. Nollet, P. Coene, D. Verkest, S. Vernalde, R. Lauwereins, Infrastructure for design and management of relocatable tasks in a heterogeneous reconfigurable system-on-chip, Proceedings of DATE 2003, Munich, March 2003, pp. 986–992. [5] P.P. Bungale, S. Sridhar, V. Krishnamurthy, An approach to heterogeneous process state capture/recovery to achieve minimum performance overhead during normal execution, Proceedings of the Reconfigurable Architectures Workshop (RAW), Nice, April 2003. [6] http://www.acm.org/sigcomm/standards/iso stds/OSI MODEL/ [7] J.A. Rowson, A. Sangivanni-Venticelli, Interface-based design, Proceedings of the Design Automation Conference, 1997. [8] A. Jantsch, H. Tenhunen, Will networks on chip close the productivity gap, Networks on Chip, Kluwer Academic Publishers, Dordrecht, 2003, pp. 3–18, ISBN 1-4020-7392-5. [9] L. Benini, G. De Micheli, Networks on chips: a new SoC paradigm, IEEE Comput. 35 (1) (2002) 70–78. [10] P. Guerrier, A. Greiner, A scalable architecture for system-on-chip interconnections, Proceedings of the SophiaAntipolis MicroElectronics Conference (SAME’99), France, October 1999. [11] I. Saastamoinen, D. Siguenza-Tortosa, J. Nurmi, An IP-Based On-Chip Packet-Switched Network, Kluwer Academic Publishers, Dordrecht, 2003, pp. 193–213. [12] P. Guerrier, A. Grenier, A generic architecture for on-chip packet switched interconnections, Proceedings of DATE 2000, 2000.
ARTICLE IN PRESS 130
T. Marescaux et al. / INTEGRATION, the VLSI journal 38 (2004) 107–130
[13] J. Duato, S. Yalamanchili, L. Ni, Interconnection Networks, An Engineering Approach, IEEE Computer Society Press, Silver Spring, MD, 1997 ISBN 0-8186-7800-3. [14] T.A. Bartic, J.-Y. Mignolet, V. Nollet, T. Marescaux, D. Verkest, S. Vernalde, R. Lauwereins, Highly scalable network on chip for reconfigurable systems, Systems on Chip Conference, Tampere, Finland, November 2003, pp. 79–82. [15] J. Rexford, K.G. Shin, Support for multiple classes of traffic in multicomputer routers, Proceedings of the Parallel Computer Routing and Communication Workshop, May 1994, pp. 116–130. [16] T. Marescaux, J.-Y. Mignolet, A. Bartic, W. Moffat, D. Verkest, S. Vernalde, R. Lauwereins, Networks on chip as hardware components of an OS for reconfigurable systems, Proceedings of the 13th International Conference on Field Programmable Logic and Applications, Lisbon, Portugal, September 2003, pp. 595–605. [17] T. Marescaux, A. Bartic, D. Verkest, S. Vernalde, R. Lauwereins, Interconnection networks enable fine-grain dynamic multi-tasking on FPGAs, Proceedings of the 12th International Conference on Field-Programmable Logic and Applications, Montpellier, September 2002, Lecture Notes in Computer Science, Vol. 2438, Springer, Berlin, 2002, pp. 795–805. [18] V. Nollet, P. Coene, D. Verkest, S. Vernalde, R. Lauwereins, Designing an operating system for a heterogeneous reconfigurable SoC, Proceedings of the RAW’03 Workshop, Nice, April 2003. [19] http://www.aero.polimi.it/~rtai [20] http://www.imec.be/reconfigurable [21] V. Nollet, J.-Y. Mignolet, T.A. Bartic, D. Verkest, S. Vernalde, R. Lauwereins, Hierarchical run-time reconfiguration managed by a operating system for reconfigurable systems, Proceedings of the International Conference on Engineering Reconfigurable Systems and Algorithms 2003, Las Vegas, June 2003, pp. 81–87. [22] N. Pham Ngoc, G. Lafruit, J.-Y. Mignolet, S. Vernalde, G. Deconinck, R. Lauwereins, Real-time 3D applications on mobile platforms with run-time reconfigurable hardware accelerator, Proceedings of the International Conference on Computer Visions and Graphics, ICCVG’02, Zakopand, Poland, September 2002, pp. 582–588. [23] http://www.xilinx.com/bvdocs/appnotes/xapp290.pdf [24] O. Diessel, G. Wigley, Opportunities for operating systems research in reconfigurable computing, Technical Report ACRC-99-018, Advanced Computing Research Centre, School of Computer and Information Science, University of South Australia, August 1999. [25] G. Wigley, D. Kearney, Research issues in operating systems for reconfigurable computing, Proceedings of the International Conference on Engineering Reconfigurable Systems and Architecture 2002, Las Vegas, USA, June 2002, pp. 10–16. [26] H. Simmler, L. Levinson, R. M.anner, Multitasking on FPGA coprocessors, Proceedings of the 10th International Conferences on Field Programmable Logic and Applications, Villach, August 2000, pp. 121–130. [27] E. Rijpkema, et al., Trade offs in the design of a router with both guaranteed and best-effort services for networks on chip, Proceedings of DATE 2003, Munich, March 2003, pp. 350–355. [28] http://www.imec.be/ocapi