Run-Time Scheduling for Multimedia Applications

0 downloads 0 Views 676KB Size Report
with a different interface is mapped on a tile, the whole ... TCM accomplishes the schedule in two phases. The ... example the whole application is just one TF.
Run-Time Scheduling for Multimedia Applications on Dynamically Reconfigurable Systems Javier Resano1, Diederik Verkest2, 3, 4, Daniel Mozos1, Serge Vernalde2, Francky Catthoor2, 4 1

Dept. Arquitectura de Computadores, Universidad Complutense de Madrid, Spain. [email protected] 2 IMEC vzw, Kapeldreef 75, 3001, Leuven, Belgium 3 Professor at Vrije Universiteit Brussel, Belgium 4 Professor at Katholieke Universiteit Leuven., Belgium.

Abstract Current multimedia applications are characterized by highly dynamic and non-deterministic behavior as well as high-performance requirements. In addition, portable devices demand also low energy consumption. Potentially, Dynamically Reconfigurable Hardware resources (DRHW) present the ideal features to fulfill these requirements since they can be reconfigured at run-time to match the performance and energy consumption requirements. However, the lack of programming support for dynamic task placement as well as the large configuration overhead has prevented a broader use of DRHW resources on embedded system design. To cope with these two problems, we have adopted a DRHW model with specific support for task migration and inter-task communication. On top of it we have applied an energy-aware run-time scheduling technique capable of taking advantage of the DRHW flexibility. Finally, we have developed a set of modules that greatly reduces the reconfiguration overhead, making it affordable for current multimedia applications.

1. Introduction and related work Dynamically Reconfigurable Hardware (DRHW), that allows partial reconfiguration at run-time (e.g. FPGAs) represents a powerful and flexible way to deal with the dynamism of current multimedia applications, as well as with their high performance requirements. In fact, DRHW resources are becoming more and more common in all kind of platforms. However, there are three main drawbacks that prevent them from achieving an even broader use. The first one is the lack of programming support for dynamic task placement. Current FPGAs can be reconfigured at run-time to change part of their functionality. Hence ideally, an FPGA is similar to a multiprocessor system where a task can be assigned for execution to a certain part of the FPGA. When it finishes, it can be removed and another task can be executed in the same place. Afterwards, if needed, it can be reassigned to another part of the FPGA. However, FPGA platforms do not provide Operative System (OS) support, hence it is up to the designer to find ways to re-allocate the tasks, and allow inter-task communication. The second drawback is that, when compared to application specific integrated circuits (ASICs), DRHW resources are less power efficient. Since power consumption is currently one of the

most important design concerns, this problem must be addressed at every possible level. Finally, the last drawback is the reconfiguration latency (by reconfiguration we mean the load of a new task onto the DRHW). For instance, loading a task that occupies one quarter of a Virtex2 v6000 FPGA requires at least 11ms. This overhead prevents many tasks from being implemented on the FPGA. In this paper we propose a solution for these three problems. To overcome the dynamic task placement problem, we have adopted an Interconnection-Network (ICN) DRHW model that provides support for task reallocation and inter-task communication. In order to efficiently tackle the low energy consumption requirements we apply an emerging task concurrency management (TCM) scheduling technique that allows the system to minimize its energy consumption while achieving the Quality-of-Service (QoS) level demanded by the applications. Finally, in order to adapt TCM to the DRHW context, and more specifically, to minimize the penalization of the reconfiguration overhead, we have developed a run-time module that coupled with the TCM run-time scheduler minimizes this overhead. Other research groups have presented scheduling approaches for DRHW resources executing multimedia tasks, R. Maestre et al. propose a technique to allocate configurations [1], M. Sanchez-Elez et al. [2] optimize the data allocation both for improving the execution time and the energy consumption and Li Shang et al. [3] propose a static scheduling technique. However, all these approaches are applied at design-time, so they cannot tackle efficiently dynamic applications, whereas our approach selects at runtime among different energy/performance trade-offs. Hence, it prevents the use of static mappings based on worst-case conditions. This is a critical issue since according to Kalavade and Moghe [4], the computational requirements of current multimedia applications are critically data-dependant and the average execution computational load may differ in one or even two orders of magnitude with the worst-case scenario. Hence, a static approach based on worst-case scenarios is often not only inefficient, but also not feasible within the timing constraints.

1.1 Contributions of the paper The key contribution of this paper is the development of a new run-time scheduling flow for systems with DRHW

resources. This scheduling flow takes advantage of two previously and independently developed ideas, namely, the ICN-DRHW model and the TCM scheduling technique. However, we have identified that even after applying the above-mentioned ideas to DRHW resources, the reconfiguration overhead of these resources can drastically affect both the system performance and the energy consumption. Hence, we have developed a set of modules that, working together with the run-time scheduler, minimize the number of reconfigurations needed, and reduce the effects of their latency at least by a factor of 4. One of these modules (the prefetch module) has been previously presented in [14]. The rest of the modules, as well as the scheduling flow, are presented in this paper for the first time.

1.2 Outline of the rest of the paper The rest of the paper is organized as follows: sections 2 and 3 provide a brief overview of the ICN-based DRHW model and the TCM methodology. Sect. 4 discusses the problem of the reconfiguration overhead with a case study. Sect. 5 presents the scheduling flow developed to cope with the reconfiguration overhead problem. Sect. 6 explains in detail the different techniques applied and analyses their results. Finally Sect. 7 draws some conclusions.

2. ICN-based DRHW model The Interconnection-Network (ICN) DRHW model [6] partitions an FPGA platform into an array of identical tiles. The tiles are interconnected using a packet-switched ICN implemented using the FPGA fabric. At run-time, tasks are assigned to these tiles using partial dynamic reconfiguration. Communication between the tasks is achieved by sending messages over the ICN using a fixed network interface implemented inside each tile. As explained in [5], this approach avoids the huge Place & Route overhead that would be incurred when directly interconnecting tasks. In the latter case, when a new task with a different interface is mapped on a tile, the whole FPGA needs to be rerouted in order to interconnect the interface of the new task to the other tasks. Also the communication interfaces contain some Operating System (OS) [6,7] support like storage space and routing tables, which allow run-time migration of tasks from one tile to another tile or even from an FPGA tile to an embedded processor. Applying the ICN model to a DRHW platform greatly simplifies the dynamic task allocation problem, providing a software-like approach, where tasks can be assigned to HW resources in the same way that threads are assigned to processors. Thus, this model enables the use of the emerging Matador TCM methodology.

3. Matador Task Concurrency Management Methodology The matador TCM methodology [8,10] proposes a task scheduling technique for heterogeneous multiprocessor embedded systems. The different steps of the methodology are presented in figure 1. It starts from an application

specification, composed of several tasks, called Thread Frames (TF). These TFs are dynamically created and deleted and can even be non-deterministically triggered. Nevertheless, inside each TF only deterministic and limited dynamic behavior is allowed. The whole application is represented using the greybox model, which combines two models, namely the Control-Data Flow Graph (CDFG) model and the MTG* model that has been specifically designed to tackle dynamic non-deterministic behavior. CDFGs are used to model the TFs, whereas MTG* models the inter-TF behavior. Each node of the CDFG is called a Thread Node (TN). TNs are the atomic scheduling units. TCM accomplishes the schedule in two phases. The first phase generates at design-time a set of near-optimal solutions for each TF called a Pareto curve. Each solution represents a schedule and an assignment of the TNs over the available processing elements with a different performance/energy trade-off. Whereas this first step carries out separately the design-space exploration of each TF, the second phase tackles their run-time behavior, selecting at run-time the most suitable Pareto point for each TF. The goal of the methodology is to minimize the energy consumption while meeting the timing constraints of the applications (typically, highly-dynamic multimedia applications). TCM has been successfully applied to schedule several current multimedia applications on multiprocessor systems [9,10].

Fig. 1. Matador Methodology

Task

Concurrency

Management

4. A Case study: motion JPEG video decoder We will illustrate our approach with a simple multimedia application, a motion JPEG video-decoder. The application code has been developed using OCAPI-XL [11]. OCAPI_XL is a C++-based codesign environment that starts from a C++ specification of the application and via incremental refinement, generates compilable C code for the SW tasks, and synthesizable VHDL code for the HW tasks. Moreover, OCAPI-XL describes both the HW and

the SW tasks with a unified model. This model guarantees that when two versions of the same task are created (one for HW and one for SW) both of then are functionally equivalent. This feature is needed to allow HW/SW task migration. We have simulated each task using the OCAPIXL simulation environment to estimate its execution time and we have used power estimation tools (like XLINX Xpower [12]) for the energy-consumption. With these estimations, a CDFG representing the application and a description of the underlying architecture (in this case a SA-1110 processor coupled with a Virtex2 v6000 FPGA), the current TCM design-time scheduler tool generates the Pareto curve depicted in figure 2.

Energy(J)

12 Pareto points

10 8

Non-optimal points

6 4 18

20

22

24

26

28

Time (time units) Fig. 2. Pareto curve of the motion JPEG video decoder

In the figure optimal and non-optimal schedules are depicted, but only the optimal ones are included in the Pareto curve. This curve is one of the inputs for the runtime scheduler. Thus, the scheduler can select at run-time between different energy/performance trade-offs. In this example the whole application is just one TF. Assuming that the deadline for this application is 40 time units, and that initially no other application is being executed, the run-time scheduler will always select the least energy consuming solution since all of them meet this timing constraint. However, if one out of ten iterations there is another task to execute (for instance an OS routine that requires 20 time units), the run-time scheduler will select the fastest Pareto point for these iterations. Hence, the deadline of the two tasks will be met. If a designer attempts to tackle the same situation using a static worstcase approach, he will be forced to select always the fastest solution in order to meet the timing constraints. Both approaches meet the timing constraints but our TCM approach consumes on average 11*0.1+5*0.9= 5.6 J/iteration, whereas the static worst-case consumes 11 J/iteration. Hence, a run-time scheduling approach like TCM can drastically reduce the overall energy consumption while still meeting the real-time constraints. Currently, the TCM scheduling tools neglect the task context-switching overhead, since for many existing processors it is very low when it is compared to the execution time of a TN. However, the overhead due to the load of a configuration on a FPGA is much greater than the afore-mentioned context-switching overhead, e.g. reconfiguring a tile of our ICN-based FPGA consumes 4 ms (assuming that a tile occupies one tenth of a

XC2V6000 FPGA and the configuration frequency is 50 MHz). The impact of this overhead on the system performance greatly depends on the granularity of the TNs. However, for current multimedia applications, TN average execution time is likely to be in the order of magnitude of milliseconds, for instance, the motion JPEG application must decode a frame in 40 ms. If this is the case, the reconfiguration overhead can drastically affect both the performance and the energy consumption of the system, moving the Pareto curve to a more energy and time consuming area. Moreover, in many cases, the shape of the Pareto curve changes when this overhead is added. Therefore, if it is not included, the TCM run-time scheduler cannot take the optimal decisions. Of course, there is not always a need for reconfiguration since when a task is loaded onto a FPGA tile it remains present also for the next iteration(s). Hence, as long as just one task is implemented in each tile, no reconfigurations are demanded. Clearly, a simple solution to avoid the costly reconfiguration overhead is simply to reuse the configurations that have been loaded previously. However, the main advantage of the TCM approach is that it allows run-time flexibility. For instance, in the motion JPEG example, the run-time scheduler changes the Pareto point selected frequently. Therefore, it is likely that new tasks must be loaded often in the FPGA. In this example, the motivation for selecting a new Pareto point is to meet a time constraint. However, actually, it might not be met if several tiles must be reconfigured. Hence, apart from reusing HW tasks from one iteration to another, it is also needed to reduce the reconfiguration overhead when new tasks must be loaded frequently on to the FPGA. To address this problem, we have developed a set of modules that co-operate with the TCM run-time scheduler to minimize the effects of this overhead.

5. Run-Time Minimization of the Reconfiguration Overhead Figure 3 depicts the scheduling flow that we have developed to cope at run-time with the reconfiguration overhead of the DRHW resources. The process starts from a schedule provided by the run-time scheduler. The following modules will read this schedule and update it taking all the decisions regarding the reconfiguration overhead. Three main decisions are taken. Firstly, for each TF the reuse module decides which TNs can be reused from the previous iterations. Secondly, if some of the TNs cannot be reused, the prefetch module schedules their loads attempting to minimize the execution time overhead. Finally, the replacement module decides where the incoming TNs are going to be loaded onto the DRHW. If free tiles are available this is a very simple decision. Otherwise this module must decide which TN is less likely to be reused in the future and overwrite it with a new TN. The decisions are taken sequentially for all the actives TFs (active means those TFs that are going to be executed in the current iteration). The TFs are analyzed following the order of the initial schedule selected by the run-time scheduler. After re-scheduling one TF, the initial schedule

is updated, if needed, adding the delay created by the reconfigurations.

Running TFs Information Platform Description

TCM Run-Time Scheduler Initialization phase For each active TF do: Reuse Module

feedback

Prefetch Module Replacement Module

QoS Manager Final Schedule Fig.3. Run-time scheduling flow

When all the TFs have been updated, a QoS manager analyzes the final schedule and decides whether it still meets the QoS requirements or whether a new schedule is needed. In the latter case, the QoS manager will give some feedback to the run-time scheduler. With this information the run-time scheduler refines its previous schedule and generates a new one that will pass across all these steps again. We assume that the scheduler has assigned a maximum time-slot for finding a solution. Hence, this process will continue until a valid solution is found or until the assigned time slot finalizes. If none of the schedules analyzed meets all the timing constraints, the QoS manager will select the solution that is closest to meeting them.

6. Implementation details This section explains in detail how each one of the modules of the design flow works.

6.1 TCM run-time scheduler The TCM run-time scheduler has been recently described in detail in [13]. For each active task, the scheduler selects one of the Pareto points and schedules their execution. This scheduler is called periodically. Since we are tackling dynamic applications and non-deterministic behavior, the set of active TF may differ from one iteration to the next one. However, the scheduler assumes that at the beginning of a period the set of active tasks is known. According to [10] this is realistic for real-life multimedia applications if specific extra-code is inserted to extract the necessary information. Nevertheless, if an unexpected task appears, the run-time scheduler must be invoked again to reevaluate the initial decisions.

6.2 Initialization phase In this phase the previously generated schedule is analysed. The goal is to identify which of the TNs that are currently located in some of the DRHW tiles can be reused. Since the Pareto points have been already selected (each Pareto point represents a schedule and an assignment of the TNs over the platform resources) at this point it is already

known which TNs are going to be executed in the DRHW. Hence, this module simply checks which of them are still loaded on the DRHW. Thus, the tiles of the DRHW are divided in two categories, namely, free tiles (those that hold a TN that is not going to be executed in the DRHW during this period) and re-useable tiles (those that hold a TN that is going to be executed in the DRHW during this period). This module generates two lists as output. The first list contains all the free tiles, whereas the second contains the re-useable tiles sorted by their execution start time. This list is called the replacement list.

6.3 Reuse module This module, as well as the next two modules, is executed sequentially for each TF. It looks for those TNs assigned to the DRHW that are in the replacement list. Thus, they will not be reloaded because they can be reused, while the remaining TNs assigned to the DRHW are marked as “to be loaded”.

6.4 Prefetch module After the reuse module has identified which TNs must be loaded on to the DRHW, the prefetch module decides when these TNs must be loaded in order to minimize the reconfiguration execution-time overhead. This module implements a scheduling heuristic described in detail in [14]. The configuration prefetch technique, firstly introduced by Comptom and Hauck in [15] for static scheduling, attempts to overlap the configuration of a TN with the computation of other TNs in order to hide the configuration latency. A very simple example is presented in figure 4, where 4 TNs must be loaded and executed on an FPGA with 4 tiles. Since current FPGAs do not support multiple simultaneous reconfigurations, configurations must be loaded sequentially. Without applying prefetch, the best ondemand schedule result is depicted in 3(a) because TN3 cannot be loaded before the loading of TN2 is finished and TN4 must wait until the execution of both TN2 and TN3 is finished. However, applying prefetch (3b), the loads of TN2, TN3 overlap with the execution of TN1 and the load of TN4 overlap with the execution of TN2 and TN3. Hence, only the loads of TN1 penalize the system execution time. Initial TF Tile1 TN1

Tile2 TN2

TN3

Tile3

L1

Ex 1

L1 L 2 Ex 2

L2

L 3 Ex 3

a) without prefetch

Ex 2 L 3 Ex 3

L 4 Ex 4

Tile4 TN4

Ex 1

L 4 Ex 4

b) applying prefetch

Fig. 4. Configuration prefetch example. L: loading, Ex: executing

We have investigated the effectiveness of this module with four multimedia applications. Table 1 shows how the reconfiguration overhead affects more heavily to those applications with less execution time, this overhead is reduced on average by a factor of 4 when our prefetch module is applied. However, it must be remarked that these

TNs Init T Overhead Prefetch Pattern Recognition 6 94 +17% +4% JPEG decoder 4 81 +25% +5% Enhanced JPEG dec. 8 57 +35% +7% MPEG decoder 5 28 +70% +22% Table 1. Reconfiguration overhead with and without applying the prefetch module for four actual multimedia applications. TNs is the number of TNs in the application, Init T is the execution time (ms) of the application assuming that no reconfigurations are required. Prefetch and Overhead are the percentage of the execution time increased due to the reconfiguration overhead, when all the configurations are loaded in the FPGA, with and without the prefetch module. The first application computes the Hough transform of a given image in order to look for certain patterns. The following two applications are different implementations of a JPEG decoder. The last application is a MPEG-1 video decoder.

6.5 Replacement module Once the loads of the TNs have been scheduled, this module decides where they are going to be loaded. There are two possibilities. First, if the free-tiles list is not empty, the first tile of the list is selected and removed from the list. Otherwise, the module selects one of the tiles in the replacement list. In this case, the module applies a greedy policy to select the TN to replace. In the replacement list three TNs categories exist. Firstly a set of TNs that can be reused in the current iteration. Secondly, a set of TNs that have been already reused during the current period, and finally, a set of TNs that have been loaded during this iteration. With this information the replacement policy is very simple. Those nodes that belong to the first category have the maximum priority to remain in the FPGA. This is coherent because if these TNs are not removed from the FPGA, they are going to be reused at least this period. However, if the second and third categories are empty, the TN from the first category that is scheduled farthest from the current point in time is selected for replacement. This policy fits very well with the prefetch module. For instance, if the TF depicted in figure 4 is in the replacement list, and one of the TNs must be replaced, TN4 will be selected. Hence, when the TF starts its execution, TN4 will have to be loaded to the DRHW, but the prefetch module will easily schedule this load in a way that will not introduce any delay on the system execution. Nevertheless, if TN1 is selected for being replaced, the prefetch module will be unable to hide the reconfiguration latency and the whole TF will be delayed several milliseconds (and consequently all the following TFs). In the case of the MPEG decoder, the 22% overhead

mentioned in table 1 will be reduced to just 8% if the fist TN of the decoder is reused and no overhead is present at all if the second TN is reused as well. However, if the fifth TN is the only one reused the 22% overhead will remain % reused

results have been obtained assuming that all the TNs must be loaded onto the DRHW. If some of them are assigned to SW resources, or can be reused from previous iterations, the reconfiguration overhead will be drastically reduced because there will be less loads to overlap with the same computation. Hence, even more possibilities exist to hide the reconfiguration latency.

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5

7

9

11

13

15

17

19

21

23

Numbe r of FPGA tile s

25

27 29 31 gre e dy ra ndo m

Fig. 5. Percentage of configuration reused for a set of 10 TFs with 74 TNs in total using our greedy replacement heuristic and a random replacement policy.

Figure 5 shows the percentage of reused configurations for a randomly generated set of ten TFs (with 74 TNs in total) running on a platform with SW and DRHW resources (with a variable number of tiles). We have simulated the dynamic behavior of the TFs assigning to them a factor that represents the probability of execute them on a given iteration (these probabilities have been assigned also randomly among 25%, 50%, 75% and 100%). In addition, we have generated several Pareto points for each TF (each Pareto point represents a different allocation of the TNs) and assigned them a probability of being executed. Hence, even when a TF is executed in two consecutive iterations, some of the TNs assigned to SW in the first iteration may be assigned to DRHW in the second one and vice versa. Clearly, this dynamic behavior makes it more difficult to reuse a configuration. However, as shown in figure 5, our heuristic is able to reuse a significant percentage of TNs even with this dynamic behavior. For instance, when the FPGA has 21 tiles, 50% of the configurations are reused. Taking into account that there are 74 TNs competing for these 21 tiles this is a very interesting result. In order to evaluate our heuristic, we have compared it with a random replacement policy. As can be seen in figure 5, when a small number of FPGA tiles is present (which is normally the case), our heuristic largely outperforms the random policy. The difference becomes smaller as the number of tiles increase. This is logical, since reusing a TN becomes simpler when the number of FPGA tiles enlarges.

6.6 Quality-of-Service manager This module analyzes if the initial schedule, after being modified by the previous modules, still meets the timing constraints. If so, the scheduling process finishes. Otherwise, if still time is available for another iteration, the process starts again after giving some feedback to the runtime scheduler. This feedback provides information about which are the constraints that have not been met and which are the TFs that have created the delays. On the contrary, if the time-slot for the scheduling process has finished, the QoS manager will select the least-cost schedule. This cost is computed using the equation 1, where M is the set of

missed deadlines; Ti is the difference between the deadline and the actual time when the TFi has finished and finally QoSLi is the QoS level of the TFi. Cost = i∈M Ti*QoSLi (1) Clearly, our approach is not suitable for safety critical applications with hard real-time constraints. However, it has been developed for tackling multimedia applications that typically are characterized by soft real-time constraints. In this kind of applications a reasonable percentage of deadline misses is allowed. This percentage heavily depends on the application. Hence, we provide flexibility to the designer to decide which deadlines are more important. To do this, the QoS level of each TF can be fixed. If no explicit information is given, all the TF priorities are set to 1.

6.7 Timing Analysis Since this is a run-time scheduling system, our major concern is to create the minimum possible run-time overhead while providing good solutions. In this context, we have developed modules that generate very low overhead. This overhead is computed adding the execution time of the run-time scheduler with the execution time of the modules that we have developed to tackle the reconfiguration overhead. When analysing the execution time of our modules, we have observed that the prefetch module execution time is much more time consuming than the others (its complexity is O(NC*NTN) where NC is the number of configurations that must be loaded, and NTN is the number of TNs in the TF). However, it still remains low enough to be applied at run-time. For instance, it schedules the load of a TF with 13 TNs assigned to HW in 4µs using a processor running at 200MHz. If there are 20 TFs to schedule with this module, the overhead will be 0.08ms, and the overhead of all the modules together will be below 0.1ms, that is still affordable, especially when it is compared with the reconfiguration overhead of loading a new task on to a DRHW tile (in our system we assume that this overhead is 4 ms). In addition, assuming that there are 20 TFs and each one needs 13 reconfigurations, 260 TNs should be loaded during a single iteration. Considering that we are currently thinking of DRHW systems with no more than 32 tiles, this number is very much larger than the average number of reconfigurations per iteration that we expect. In order to assess the total run-time overhead, the execution time of the run-time scheduler must be considered as well. The current TCM run-time scheduler [13] provides a near-optimal schedule for a set of 20 TFs with 9 Pareto points for each TF in less than 0.1 ms using a 200MHz processor. Hence, for 20 TFs a whole iteration can be executed in 0.2 ms (0.1 ms for the run-time scheduler and 0.1 ms for the other modules). Thus, if the scheduling system receives a time-slot of 1ms, it would be possible, if needed, to execute up to five iterations, which should be sufficient to refine the initial solution.

7. Conclusion DRHW provides both the performance and the flexibility that current multimedia applications demand. However,

several drawbacks have prevented a broader use of these resources. The ICN model for DRHW resources, and the TCM scheduling technique can efficiently overcome them. Nevertheless, in order to optimally manage the DRHW resources, specific support must be added to the system to tackle the reconfiguration overhead. We have developed a set of modules that reduce this overhead to a great extent. For instance the prefetch module by its own can reduce this overhead by a factor of four and the replacement module reduces the number of configurations to load by a factor of two even when tackling highly dynamic TFs and with almost 4TNs for each DRHW tile. In addition, these two techniques working together can reduce this overhead even more drastically (in many cases the overhead is reduced to 0). Moreover, if is not possible to reduce the overhead enough, we have developed an iterative scheduling flow that will refine the solution attempting to find a better one. We believe that our approach makes the use of DRHW resources affordable for current highlydynamic multimedia applications.

References 1. R. Maestre et al, “Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations”, ISSS’00, pp 107-113, 2000. 2. M. Sánchez-Elez et al “Low-Energy Data Management for Different On-Chip Memory Levels in Multi-Context Reconfigurable Architectures”. DATE´03, pp. 36-41, 2003. 3. Li Shang et al., “Hw/Sw Co-synthesis of Low Power RealTime Distributed Embedded Systems with Dynamically Reconfigurable FPGAs”, ASP-DAC’02, pp. 345-360, 2002. 4. A. Kalavade, P. Moghe, “A tool for performance estimation of networked Embedded End-Systems”, Proc. of DAC conference, June 1998, pp. 257-262. 5. T. Marescaux et al., “Interconnection Network enable FineGrain Dynamic Multi-Tasking on FPGAs”, Proc. of FPL’02, pp. 795-805, 2002. 6. V. Nollet et al., “Designing an Operative System for Heterogeneous Reconfigurable SoC” Proc of RAW'03, 2003. 7. J-Y. Mignolet et al. “Infrastructure for Design and Management of Relocatable Tasks in a Heterogeneous Reconfigurable System-on-Chip” DATE'03, 2003. 8. P. Yang et al., “Energy-Aware Runtime Scheduling for Embedded-Multiprocessors SOCs”, IEEE Journal on Design&Test of Computers, pp. 46-58, 2001. 9. P. Marchal et al, “Matador: an Exploration Environment for System-Design”, Journal of Circuits, Systems and Computers, Vol. 11, No. 5, pp. 503-535, 2002. 10. P. Yang et al, “Managing Dynamic Concurrent Tasks in Embedded Real-Time Multimedia systems”, ISSS’02, pp. 112-119, 2002. 11. G. Vanmeerbeeck et al, “Hardware/Software Partitioning for Embedded Systems in OCAPI-XL” CODES’01, 2001. 12. www.xilinx.com. 13. Peng Yang and Francky Catthoor. “Pareto-OptimizationBased Run-Time Task Scheduling for Embedded Systems”.Proc. of ISSS’03. 2003 14. J. Resano et al, “Run-time Minimization of Reconfiguration Overhead in Dynamically Reconfigurable Systems”, Proc. of FPL’03. 2003. 15. Z. Li and S. Hauck, “Configuration prefetching techniques for partial reconfigurable coprocessor with relocation and defragmentation” Int’l Symp. FPGAs, pp. 187-195, 2002.