Run-time Minimization of Reconfiguration Overhead in Dynamically Reconfigurable Systems Javier Resano1, Daniel Mozos1, Diederik Verkest2,3, Serge Vernalde2, Francky Catthoor2,4 1
Dept. Arquitectura de Computadores, Universidad Complutense de Madrid, Spain.
[email protected],
[email protected] 2 IMEC vzw, Kapeldreef 75, 3001, Leuven, Belgium {Verkest, Vernalde, Catthoor}@imec.be 3 Professor at Vrije Universiteit Brussel, Belgium 4 Professor at Katholieke Universiteit Leuven., Belgium.
Abstract. Dynamically Reconfigurable Hardware (DRHW) can take advantage of its reconfiguration capability to adapt at run-time its performance and its energy consumption. However, due to the lack of programming support for dynamic task placement on these platforms, little previous work has been presented studying these run-time performance/power trade-offs. To cope with the task placement problem we have adopted an interconnection-network-based DRHW model with specific support for reallocating tasks at run-time. On top of it, we have applied an emerging task concurrency management (TCM) methodology previously applied to multiprocessor platforms. We have identified that the reconfiguration overhead can drastically affect both the system performance and energy consumption. Hence, we have developed two new modules for the TCM run-time scheduler that minimize these effects. The first module reuses previously loaded configurations, whereas the second minimizes the impact of the reconfiguration latency by applying a configuration prefetching technique. With these techniques reconfiguration overhead is reduced by a factor of 4.
1 Introduction and Related Work Dynamically Reconfigurable Hardware (DRHW), that allows partial reconfiguration at run-time, represents a powerful and flexible way to deal with the dynamism of current multimedia applications. However, compared with application specific integrated circuits (ASICs), DRHW systems are less power efficient. Since power consumption is one of the most important design concerns, this problem must be addressed at every possible level; thus, we propose to use a task concurrency management (TCM) approach, specially designed to deal with current dynamic multimedia applications, which attempts to reduce the energy consumption at task-level. Other research groups have addressed the power consumption of DRHW, proposing a technique to allocate configurations [1], optimizing the data allocation both for improving the execution time and the energy consumption [2], presenting an energy-
conscious architectural exploration [3], introducing a methodology to decrease the voltage requirements [4], or carrying out a static scheduling [5]. However, all these approaches are applied at design-time, so they cannot tackle efficiently dynamic applications, whereas our approach selects at run-time among different power/performance trade-offs. Hence, we can achieve larger energy savings since our approach prevents the use of static mappings based on worst-case conditions. The rest of the paper is organized as follows: sections 2 and 3 provide a brief overview of the ICN-based DRHW model and the TCM methodology. Sect. 4 discusses the problem of the reconfiguration overhead. Sect. 5 presents the two modules developed to tackle this problem. Sect. 6 introduces some energy considerations. Sect. 7 analyses the experimental results and Sect. 8 presents the conclusions.
2 ICN-based DRHW model The Interconnection-Network (ICN) DRHW model [6] partitions an FPGA platform into an array of identical tiles. The tiles are interconnected using a packet-switched ICN implemented using the FPGA fabric. At run-time, tasks are assigned to these tiles using partial dynamic reconfiguration. Communication between the tasks is achieved by sending messages over the ICN using a fixed network interface implemented inside each tile. As explained in [6], this approach avoids the huge Place & Route overhead that would be incurred when directly interconnecting tasks. In the latter case, when a new task with a different interface is mapped on a tile, the whole FPGA needs to be rerouted in order to interconnect the interface of the new task to the other tasks. Also the communication interfaces contain some Operating System (OS) support like storage space and routing tables, which allow run-time migration of tasks from one tile to another tile or even from an FPGA tile to an embedded processor. Applying the ICN model to a DRHW platform greatly simplifies the dynamic task allocation problem, providing a software-like approach, where tasks can be assigned to HW resources in the same way that threads are assigned to processors. Thus, this model enables the use of the emerging Matador TCM methodology [7].
3 Matador Task Concurrency Management Methodology The matador TCM methodology [7, 9] proposes a task scheduling technique for heterogeneous multiprocessor embedded systems. The different steps of the methodology are presented in figure 1. It starts from an application specification, composed of several tasks, called Thread Frames (TF). These TFs are dynamically created and deleted and can even be non-deterministically triggered. Nevertheless, inside each TF only deterministic and limited dynamic behavior is allowed. The whole application is represented using the grey-box model, which combines ControlñData Flow Graphs (CDFG) with another model (called MTG*) specifically designed to tackle dynamic non-deterministic behavior. CDFGs are used to model the
TFs, whereas MTG* models the inter-TF behavior. Each node of the CDFG is called a Thread Node (TN). TNs are the atomic scheduling units. TCM accomplishes the scheduling in two phases. The first phase generates at design-time a set of near-optimal scheduling solutions for each TF called a Pareto curve. Each solution represents a schedule and an assignment of the TNs over the available processing elements with a different performance/energy tradeoff. Whereas this first step accomplishes separately the design-space exploration of each TF, the second phase tackles their run-time behavior, selecting at run-time the most suitable Pareto point for each TF. The goal of the methodology is to minimize the energy consumption while meeting the timing constraints of the applications (typically, highly-dynamic multimedia applications). TCM has been successfully applied to schedule several current multimedia applications on multiprocessor systems [8, 9].
Fig. 1. Matador Task Concurrency Management Methodology
4 Dynamic Reconfiguration Overhead Fig. 2 represents the Pareto curve obtained using the existing TCM design scheduler for a system with a SA-1110 processor coupled with a Virtex2 v6000 FPGA. The TF corresponds to a motion JPEG application. In the figure optimal, and non-optimal schedules are depicted, but only the optimal ones are included in the Pareto curve. This curve is one of the inputs for the run-time scheduler. Thus, the scheduler can select at run-time between different energy/performance trade-offs. For instance, if there is not a tight timing constraint, it will select the least energy consuming solution, whereas, if the timing constraint changes (for instance if a new task starts), the run-time scheduler will look for a faster solution which meets the new constraint (with the consequent energy penalty).
Energy (J)
11 Pareto point Non-optimal point
5 20
26 Time (Time Units)
Fig. 2. Pareto curve of the motion JPEG
In this example, assuming that 80% of the time the most energy-efficient solution can be selected TCM can achieve a 47% energy saving (on average 11*0.2+5*0.8= 5.78 J/iteration) than a static worst-case approach while providing the same peak performance (the worst-case consumes 11 J/iteration). Hence, the TCM approach can drastically reduce the overall energy consumption while still meeting hard real-time constraints. However, current TCM scheduling tools neglect the task contextswitching overhead, since in certain cases it is very low. However, the overhead due to the load of a configuration on a FPGA is much greater than the afore-mentioned context-switching overhead, e.g. reconfiguring a tile of our ICN-based FPGA consumes 4 ms (assuming that a tile occupies one tenth of a XC2V6000 FPGA and the configuration frequency is 50 MHz). The impact of this overhead on the system performance greatly depends on the granularity of the TNs. However, for current multimedia applications, TN average execution time is likely to be in the order of magnitude of milliseconds (a motion JPEG application must decode a frames in 40 ms). In this is the case, the reconfiguration overhead can drastically affect both the performance and the energy consumption of the system, moving the Pareto curve to a more energy and time consuming area. Moreover, in many cases, the shape of the Pareto curve changes when this overhead is added. Therefore, if it is not included the TCM schedulers cannot take the optimal decisions. To address this problem, we have added two new modules to the TCM schedulers, namely: configuration reuse, and configuration prefetch. These modules are not only used to accurately estimate the reconfiguration overhead, but also to minimize it. Configuration reuse attempts to reuse previously loaded configurations. Thus, if a TF is being executed periodically, at the beginning of each iteration the scheduler is going to check if the TNs loaded in the previous iteration are still there, if so, they are reused preventing unnecessary reconfigurations. Configuration prefetch [10] attempts to overlap the configurations of a TN with the computation of other TNs in order to hide the configuration latency. A very simple example is presented in figure 3, where 4 TNs must be loaded and executed on an FPGA with 3 tiles. Since current FPGAs do not support multiple simultaneous reconfigurations, configurations must be load sequentially. Without prefetching, the best ondemand schedule result is depicted in 3(a) because TN3 cannot be loaded before the loading of TN2 is finished and TN4 must wait until the execution of both TN2 and TN3 is finished. However, applying prefetch (3b), the loads of TN2 and TN4 overlap with the execution of TN1 and TN3. Hence, only the loads of TN1 and TN3 penalize the system execution time.
Initial TF
Tile1
TN2 TN1
TN4 TN3
Tile2 Tile3
L 1 Ex 1
L 1 Ex 1 L 2 Ex 2
L 4 Ex 4
L 3 Ex 3 a) Schedule without prefetching
L 2 Ex 2 L 4
Ex 4
L 3 Ex 3 b) Schedule with prefetching
Fig. 3. Configuration prefetch on a platform with 3 FPGA tiles. L: loading, Ex: executing
Clearly, this powerful technique can lead to significant execution time savings. Unfortunately, deciding the best order to load the configurations is a NP-complete problem. Moreover, in order to apply this technique in conjunction with the configuration reuse technique, the schedule of the reconfiguration must be established at run-time, since the number of configurations that must be loaded depends on the number of configurations that can be reused and typically this number will differ from one execution to another if the system behavior is non-deterministic. Therefore, we need to introduce these techniques in the TCM run-time scheduler while attempting to keep the resulting overhead to a minimum.
5 Run-Time Configuration Prefetch and Reuse The run-time scheduler receives as an input a set of Pareto curve and selects a point on them for each TF according to timing and energy considerations. Currently, at runtime we never alter the scheduling order imposed by the design-time scheduler, thus, in order to check if a previous configuration can be reused we just look for the first TN assigned to every FPGA tile in the given schedule (we use the term initial for those TNs). Since all tiles created by the use of the ICN in the FPGA are identical, the actual tile in which a TN is executed is irrelevant. When the run-time scheduler needs to execute a given TF, it will first check whether the initial TNs are still present in any of the FPGA tiles. If so, they are reused (avoiding the costly reconfiguration). Otherwise the TNs are assigned to an available tile. For instance in the example of Fig. 3, the runtime scheduler will initially check if the configurations of TN1, TN2, and TN3 are still loaded in the FPGA, in this case TN1, and TN3 will be reused. The run-time scheduler will not check if it can reuse the configuration of TN4 since it knows that even when this TN could remain loaded in the FPGA it will be overwritten by TN2. The run-time configuration reuse algorithm is presented in figure 4.a, its complexity is O(NT*I), where NT is the number of tiles and I the number of initial TNs. Typically I and NT are small numbers and the overhead of this module is negligible. Once the run-time scheduler knows which TN configurations can be reused, it has to decide when the remaining configurations are going to be loaded. The pseudo-code of the heuristic developed for this step is presented in figure 4.b. It starts from a given design-time schedule that does not include the reconfiguration overhead and updates it according to the number of reconfigurations needed. Then, it schedules the reconfigurations by applying prefetching to try to minimize the execution time.
a) for (i=0; i < Number of initial TNs; i++){ while (not found)and(j < Number of FPGA tiles){ found = look_for_reuse(i, j); if(found){assign the TN i to actual tile j; j++;}} Assign the remaining TNs to actual tiles; b) If there are TN configurations to load{ schedule the TNs that do not need to be loaded for (i=0; i < Number of configurations to load; i++){ select&schedule a configuration; schedule its successors that do not need to
be loaded on to the FPGA;}}
Fig. 4. a) Configuration reuse pseudo-code. b) Configuration prefetch pseudo-code
We have developed a simple heuristic based on an enhanced list-scheduling. It starts by scheduling all the nodes that do not need to be configured on the FPGA, i.e. those assigned to SW or those assigned to the FPGA whose configuration can be reused. After scheduling a TN, the heuristic attempts to schedule its successors. If they do not need to be loaded on to the FPGA the process continues, otherwise, a reconfiguration request is stored in a list. When it is impossible to continue scheduling TNs, one of the requests is selected according to the following criteria: • If, at a given time t, there is just one configuration ready for loading this configuration is selected. A configuration is ready for loading if the previous TN assigned to the same FPGA tile has already finished its execution. • Otherwise, when several configurations are ready, the configuration with the highest weight is selected. The weight of a configuration is assigned at design-time and represents the maximum time-distance from a TN to the end of the TF. This weight is computed carrying out an ALAP scheduling in the TF. Those configurations corresponding to nodes in the critical path of the TF will be heavier than the other ones. The complexity of this module is O(N*C) where N is the number of TNs and C the number of configurations to load. Since these two techniques must be applied at run-time, we are very concerned about their execution time. This time depends on the number of TN, FPGA tiles and configurations that must be loaded. The configuration-prefetch module is much more time-demanding that the configuration-reuse module. However we believe that the overhead is acceptable for a run-time scheduler. For instance a TF with 20 nodes, 4 FPGA tiles and 13 configurations to be loaded is scheduled in less than 2.5 µs using a Pentium-II running at 350MHz. This number has been obtained starting from a C++ initial code, and disabling all the compiler optimizations. We expect that this time will be significantly reduced when starting from C code and applying optimization compiler techniques. But even if is not reduced, it will be compensated by the prefetching time-savings if the heuristic hides the latency of one reconfiguration at least once every 1600 executions (assuming that the reconfiguration overhead is 4ms). In our experiments the heuristic has exhibited much better results than this minimum requirement. Although we believe that the overhead due to the prefetch module is acceptable, it is
not always worthwhile to execute it. For instance, when all the configurations of a design are already loaded in the FPGA there will be no gains applying prefetch. Thus, in this case the module will not be executed, preventing unnecessary computation. There is another common case where this run-time computation can be substituted by design-time computation; this is when all the configurations must be loaded. This case happens at least once (when the TF is loaded for the first time), and it can be very frequent if there is a great amount of TNs competing for the resources. Hence, we can save some run-time computation analyzing this case at design-time. However, this last optimization duplicates the storage space needed for a Pareto point, since now for each point in the Pareto curve two schedules are stored. Hence, if the system has a very limited storage space, this optimization should be disabled. The run-time scheduling will follow the steps depicted in figure 5. for each Pareto point to evaluate { apply configuration reuse If there are not configurations to load { read energy and execution time of scheduling 1} Else if all the configurations must be loaded{ read energy and execution time of scheduling 2 } Else { actualize scheduling 1 applying configuration prefetching}} Fig. 5. Run-time evaluation process pseudo-code. Schedules 1 and 2 are computed at design time. Both of them share the same allocation of the TNs on the system processing elements, but they have different execution time and energy consumption since schedule 1 assumes that all the TNs assigned to the FPGA have been previously loaded, whereas schedule 2 includes the reconfiguration overhead of these TNs
6 Energy considerations TCM is an energy-aware scheduling technique. Thus, these two new modules should not only reduce the execution time but also the energy consumption. Clearly, the configuration reuse technique can generate energy savings, since loading a configuration to an FPGA involves both an execution time and an energy overhead. According to [5], when a FPGA is frequently reconfigured, up to 50% of the FPGA energy consumption is due to the reconfiguration circuitry. Hence, reducing the number of reconfigurations is a powerful way to achieve energy savings. Configuration prefetch can also indirectly lead to energy savings. If we assume that loading a configuration on to a tile has a constant overhead Ec, and a given schedule involves 4 reconfigurations, the energy overhead due to the reconfigurations will be 4*Ec independently of the order of the loads. However, configuration prefetch reduces the execution time of the TFs and the run-time scheduler can take advantage of this extra time to select a slower and less energy consuming Pareto point. Fig. 6 illustrates this idea with one TF.
Solution without prefetching Solution with prefetching
s1
Energy
s2 s3
s4
Time
Deadline
Fig. 6. Configuration prefetch technique for energy savings. Before applying prefetching, s2 was the solution that consumes less energy meeting the timing constraint. After applying prefetching the time saved can be used to select a more energy efficient solution (s3)
7 Results and Analysis
0 .2 5 T F A 4 TNs
0 .5
T F B 3 T N s
T F E 1 T N s
T F C 2 T N s
T F F 2 T N s
0 .2 5
T F G 1 T N s
T F D 2 T N s
Fig. 7. Set of TFs generated using the Task Graph For Free (TGFF) system [11]. Inside each node its name (a letter from A to G), and the number of TNs assigned to HW are depicted. When there are different execution paths, each one is tagged with the probability of being selected. We assume that this set of TF is executed periodically
The efficiency of the reuse module depends on the number of FPGA tiles, the TNs assigned to these tiles and on the run time events. Figure 8 presents the average reuse percentage for the set of TFs depicted in figure 7 for different number of FPGA tiles. This example contains 15 TNs assigned to HW, although not all of them are executed every iteration. The reuse percentage is significant even with just 5 FPGA tiles (29%). Most of the reuse is due to the TFs A and G, since they are executed every iteration. For instance, when there are 8 tiles, 48 % of the configurations are reused and 30% are due to TF A and TF G (5 TNs). When there are 17 tiles, only the 71 % of configurations are reused, this is not an optimal result since as long as there are more tiles than TNs it should be possible to reuse 100% of the configurations. However, we are just applying a local reuse algorithm to each TF instead of applying a global one to the whole set of TFs. We have adopted a local policy because it creates almost no run-time overhead, and can be easily applied to systems with non-deterministic behavior. However, we are currently analyzing how a more complex approach could lead to higher percentages of reuse with an affordable overhead.
80 60 % 40 20 0 5
8 11 14 Number of tiles in the FPGA
17
Fig. 8. Percentage of configurations reused vs Number of tiles in the FPGA
We have performed two experiments to analyze the prefetch module performance. Firstly, we have studied how good the schedules computed by our heuristic are. To this end we have generated 100 pseudo-random TFs using the TGFF system, and we have scheduled the configuration loads both with our heuristic, and with a branch&bound (b&b) algorithm that accomplishes a full design space exploration. This experiment shows that the b&b scheduler finds better solutions (on average 10% better) than our heuristic. However, for TFs with 20 TNs, it needs 800 times more computational time to find these solutions. Hence, our heuristic generates almost optimal schedules, in an affordable time. The second experiment presents the time-savings achieved due to the prefetch module for three multimedia applications (table 1). This experiment assumes that the whole application is executed on an ICN-like FPGA, with 4 tiles that can be reconfigured in 4 ms, which is currently the fastest possible speed. However, even with this fast reconfiguration assumption, it is remarkable how the reconfiguration overhead affects the system performance, increasing up to the 35% the overall execution time. This overhead is drastically reduced (in average a factor of 4) when the prefetch module is applied.
8 Conclusions The ICN-based DRHW model provides the dynamic task reallocation support needed to apply a TCM approach to heterogeneous systems with DRHW resources. With this model both the DRHW and the SW resources can be handled in the same way, simplifying the tasks of the TCM schedulers. We have identified that the DRHW reconfiguration overhead significantly decreases the system performance and increases the energy consumption. Hence, we have developed two new modules for the TCM run-time scheduler (namely configuration reuse and configuration prefetch) that can reduce this problem while improving at run-time the execution time and energy consumption estimations. Configuration reuse attempts to reuse previously loaded configurations, leading to energy and execution time savings. Its efficiency depends on the number of FPGA tiles, and TNs assigned to these tiles. However, the module exhibits significant reuse percentage even with 15 TNs competing for 5 FPGA tiles. Configuration prefetch
schedules the reconfigurations to minimize the overall execution time. The results show that it reduces the execution time reconfiguration overhead by a factor of 4. Table 1. . Reconfiguration overhead with and without applying the prefetch module for three actual multimedia applications. TNs is the number of TNs in the application, Init T is the execution time of the application assuming that no reconfigurations are required. Prefetch and Overhead are the percentage of the execution time increased due to the reconfiguration overhead, when all the configurations are loaded in the FPGA, with and without the prefetch module. The first two applications are different implementations of a JPEG decoder, the third application computes the Hough transform of a given image in order to look for certain patterns JPEG decoder Enhanced JPEG decoder Pattern Recognition
TNs 4 8 6
Init T 81ms 57ms 94ms
Overhead +25% +35% +17%
Prefetch +5% +7% +4%
Acknowledgements The authors would like to acknowledge all our colleagues from the T-Recs and Matador groups at IMEC, for all their comments, help and support. This work has been partially supported by TIC 2002-00160.
References 1. R. Maestre et al., ì Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizationsî , ISSSí00, pp 107-113, 2000. 2. M. S· nchez-Elez et al ì Low-Energy Data Management for Different On-Chip Memory Levels in Multi-Context Reconfigurable Architecturesî . DATE¥03, pp. 36-41, 2003. 3. M. Wan et al. ì Design Methodology of a Low-Energy Reconfigurable Single-Chip DSP Systemî , Journal of VLSI Signal Processing 28, pp. 47-61, 2001. 4. A. D. Garcia et al., ì Reducing the power Consumption in FPGAs with keeping a high Performance Levelî , WVLSI00, pp 47-52, 2002. 5. Li Shang et al., ì Hw/Sw Co-synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAsî , ASP-DACí02, pp. 345-360, 2002. 6. T. Marescaux et al.,ì Interconnection Network enable Fine-Grain Dynamic Multi-Tasking on FPGAsî , FPLí02, pp. 795-805, 2002. 7. P. Yang et al., ì Energy-Aware Runtime Scheduling for Embedded-Multiprocessors SOCsî , IEEE Journal on Design&Test of Computers, pp. 46-58, 2001. 8. P. Marchal et al, ì Matador: an Exploration Environment for System-Designî , Journal of Circuits, Systems and Computers, Vol. 11, No. 5, pp. 503-535, 2002. 9. P. Yang et al, ì Managing Dynamic Concurrent Tasks in Embedded Real-Time Multimedia systemsî , ISSSí02, pp. 112-119, 2002. 10. Z. Li and S. Hauck, ì Configuration prefetching techniques for partial reconfigurable coprocessor with relocation and defragmentationî Intíl Symp. FPGAs, pp. 187-195, 2002. 11. R.P. Dick et al, ì TGFF: Task Graphs for Freeî , Intíl Workshop HW/SW Codesign, pp. 97101, 1998.