An Efficient Evolutionary Task Scheduling/Binding Framework for ...

Hindawi Publishing Corporation International Journal of Reconfigurable Computing Volume 2016, Article ID 9012909, 24 pages http://dx.doi.org/10.1155/2016/9012909

Research Article An Efficient Evolutionary Task Scheduling/Binding Framework for Reconfigurable Systems A. Al-Wattar, S. Areibi, and G. Grewal School of Engineering and Computer Science, University of Guelph, Guelph, ON, Canada N1G 2W1 Correspondence should be addressed to S. Areibi; [email protected] Received 13 September 2015; Revised 26 November 2015; Accepted 16 December 2015 Academic Editor: Nadia Nedjah Copyright © 2016 A. Al-Wattar et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Several embedded application domains for reconfigurable systems tend to combine frequent changes with high performance demands of their workloads such as image processing, wearable computing, and network processors. Time multiplexing of reconfigurable hardware resources raises a number of new issues, ranging from run-time systems to complex programming models that usually form a reconfigurable operating system (ROS). In this paper, an efficient ROS framework that aids the designer from the early design stages all the way to the actual hardware implementation is proposed and implemented. An efficient reconfigurable platform is implemented along with novel placement/scheduling algorithms. The proposed algorithms tend to reuse hardware tasks to reduce reconfiguration overhead, migrate tasks between software and hardware to efficiently utilize resources, and reduce computation time. A supporting framework for efficient mapping of execution units to task graphs in a run-time reconfigurable system is also designed. The framework utilizes an Island Based Genetic Algorithm flow that optimizes several objectives including performance, area, and power consumption. The proposed Island Based GA framework achieves on average 55.2% improvement over a single-GA implementation and an 80.7% improvement over a baseline random allocation and binding approach.

1. Introduction In the area of computer architecture choices span a wide spectrum, with Application Specific Integrated Circuits (ASICs) and General-Purpose Processors (GPPs) being at opposite ends. General-Purpose Processors are flexible, but unlike ASICs, they are not optimized for specific applications. Reconfigurable architectures, in general, and Field Programmable Gate Arrays (FPGAs), in particular, fill the gap between these two extremes by achieving both the high performance of ASICs and the flexibility of GPPs. However, FPGAs are still not a match for the lower power consumed by ASICs, nor for the performance achieved by the latter. One important feature of FPGAs is their ability to be partially reprogrammed during the execution of an application. This RunTime Reconfiguration capability provides common benefits when adapting hardware algorithms during system run-time including sharing hardware resources to reduce device count, minimizing power, and shortening reconfiguration time [1].

Many embedded application domains for reconfigurable systems require frequent changes to support the high performance demands of their workloads. For example, in telecommunication applications, several wireless standards and technologies, such as WiMax, WLAN, GSM, and WCDMA, have to be utilized and supported. However, it is unlikely that these protocols will be used simultaneously. Accordingly, it is possible to dynamically load only the architecture that is needed onto the FPGA. Also, in an Unmanned Aerial Vehicle (UAV), employing different machine vision algorithms and utilizing the most appropriate based on the environment or perhaps the need to lower power consumption is, yet, another example. Time multiplexing of reconfigurable hardware resources raises a number of new issues, ranging from run-time systems to complex programming models that usually form a Reconfigurable Hardware Operating System (ROS). The ROS performs online task scheduling and handles resource management. The main objective of ROS for reconfigurable

2 platforms is to reduce the complexity of developing applications by giving the developer a higher level of abstraction with which to work. Problem Definition. Performance is one of the fundamental reasons for using Reconfigurable Computing Systems (RCS). By mapping algorithms and applications to hardware, designers can not only tailor the computation components, but also perform data-flow optimization to match the algorithm. Typically, a designer would manually divide the FPGA fabric into both static and dynamic regions [2]. The static regions accommodate modules that do not change in time, such as the task manager and necessary buses used for communication. The dynamic region is partitioned into a set of uniform or nonuniform blocks with a certain size called Partial Reconfigurable Regions (PRRs). Each PRR can accommodate Reconfigurable Modules (RM) application specific hardware accelerators for the incoming tasks that need to be executed. In any type of operating system, a scheduler decides when to load new tasks to be executed. Efficient task scheduling algorithms have to take task dependencies, data communication, task resource utilization, and system parameters into account to fully exploit the performance of a dynamic reconfigurable system. A scheduling algorithm for a reconfigurable system that assumes one architecture variant for the hardware implementation of each task may lead to inferior solutions [3]. Having multiple execution units per task helps in reducing the imbalance of the processing throughput of interacting tasks [4]. Also, static resource allocation for Run-Time Reconfiguration (RTR) might lead to inferior results, as the number of PRRs for one application might be different from the number required by another. The type of PRRs (uniform, nonuniform, and hybrid) tends to play a crucial role as well in determining both performance and power consumption (Figure 1). ROS Main Components. Reconfigurable computing is capable of mapping hardware tasks onto a finite FPGA fabric while taking into account the dependency among tasks and timing constraints. Therefore, managing the resources becomes essential for any reconfigurable computing platform. Dynamic partial reconfiguration allows several independent bit-streams to be mapped into the FPGA fabric independently, as one architecture can be selectively swapped out of the chip while another is actively executing. Partial reconfiguration provides the capability of keeping certain parts intact in the FPGA, while other parts are replaced. A Reconfigurable Computing System traditionally consists of a GPP and several Reconfigurable Modules (as seen in Figure 2) that execute dedicated hardware tasks in parallel [5]. A fundamental feature of a partially reconfigurable FPGA is that the logical components and the routing resources are time multiplexed. Therefore, in order for an application to be mapped onto an FPGA, it needs to be divided into independent tasks such that each subtask can be executed at a different time. The use of an operating system for reconfigurable computing should ease application development, provide

International Journal of Reconfigurable Computing a higher level of abstraction, and, most importantly, verify and maintain applications. There are several essential modules that need to exist in any reconfigurable operating system implementation, including the bit-stream manager, scheduler, placer, and communications network (Figure 2). (1) The Scheduler. This is an important component that decides when tasks will be executed. Efficient task scheduling algorithms have to take data communication, task dependencies, and resource utilization of tasks into consideration to optimally exploit the performance of a dynamic reconfigurable system. Scheduling techniques proposed in the literature have different goals, such as improving hardware resource utilization, reduction of scheduling time, and/or reconfiguration overhead. Other reconfigurable computing schedulers attempt to reduce fragmentation and power consumption [6]. (2) Bit-Stream Manager. This module manages the loading/unloading of partial bit-streams from storage media into a PRR. The bit-stream manager requires fast and fairly large storage media. Therefore, it is preferable to use a dedicated hardware module for this task. A bit-stream manager is further decomposed into a storage manager and a reconfigurable manager. The latter reads bit-streams from the storage manager and then downloads them onto the FPGA fabric. (3) The Placer. In conventional reconfigurable devices, a placer decides where to assign new tasks in the reconfigurable area. In RTR systems, the scheduler and placer are interdependent. Therefore, they are implemented as a single entity that makes it challenging to identify clear boundaries between them. (4) Communication Manager. This module defines how hardware and software tasks communicate and interact with each other. The communication network proposed by the manager depends on the type of applications used in the system. For example, streaming applications such as vision need a different topology than that of a centralized computational application. Proposed Work and Motivations. In this paper, we propose and implement a framework for ROS that aids the designer from the early stages of the design to the actual implementation and submission of the application onto the reconfigurable fabric. A novel online scheduler, coined RCOnline, is developed that seeks to reuse hardware resources for tasks in order to reduce reconfiguration overhead, migrate tasks between software and hardware, and schedule operations based upon a priority scheme. With regard to the latter, RCOnline dynamically measures several system metrics, such as execution time and reconfiguration time, and then uses these metrics to assign a priority to each task. RCOnline then migrates and assigns tasks to the most suitable processing elements (SW or HW) based on these calculated priorities. One of the main problems encountered in RTR is identifying the most appropriate floorplan or infrastructure that suits an application. Every application (e.g., machine vision,

International Journal of Reconfigurable Computing

3

PRR2

PRR1

PRR3

PRR2

PRR1

PRR3

PRR4 (a)

GPP1 GPP2

GPP1

GPP1

GPP2

PRR7 PRR8 PRR9

PRR4 PRR5 PRR6

PRR3

PRR1 PRR2

PRR5 PRR6

PRR4

PRR2 PRR3

PRR1

PRR1 PRR2 PRR3

GPP3

(b)

Figure 1: (a) PRR layout, (b) miscellaneous platforms/floorplans. Tasks sets FPGA

5

1

0

+

0

> 1 HybHW

HybHW

3

3

∗ HybHW

7

+ HybHW

5

6

∗ HybHW

8

ks tas

HW tasks

Memory controller

Priority

SW tasks Memory

14

HybHW

6 + HybHW 1

2 2

3 5

9

910

∗ HybHW

1011

12

+ HybHW

Inputs Task ID

+ 14 HybHW

Bit-streams manager

type

PRRs

GPP

+ HybHW

12 13

∗

HybHW

2

− 11 HybHW

13

5 15

Operation

Reconfiguration HW tasks controller

2

−

5 + HybHW

7

8

SW

Bit-streams library (HW tasks)

4 ∗ HybHW

2

2

4

4

1

3

1

8

Scheduler/placer

1

8

∗

15

HybHW

16

+ 16 HybHW

Output

Communication network

Figure 2: Reconfigurable OS: essential components.

wireless-sensor network) requires specific types of resources that optimize certain objectives, such as reducing power consumption and/or improving execution time. In our previous work published in [7], a dynamic reconfigurable framework consisting of five reconfigurable regions and two GPPs was implemented, along with simple scheduling algorithms. In this paper, a novel efficient and robust scheduling algorithm is developed that seeks to reuse hardware tasks to reduce reconfiguration time. The developed scheduler was tested using a collection of tasks represented by Data-Flow Graphs (DFGs). To verify the scheduler functionality and performance, DFGs

with different sizes and parameters were required; therefore, a DFG generator was also designed and implemented. The DFG generator is able to generate benchmarks with different sizes and features. However, using such a dynamic reconfigurable platform to evaluate hundreds of DFGs based on different hardware configurations is both complex and tedious. The FPGA platform to be used in this work requires a different floorplan and bit-stream for each new configuration which limits the scope of testing and evaluation. Accordingly, an architecture reconfigurable simulator was also developed to simulate the hardware platform, while running the developed

4


reconfigurable operating system. The simulator allows for faster evaluation and flexibility and thus can support different hardware scenarios by varying the number of processing elements (PRRs, GPPs), size/shape of PRRs, and schedulers. Contrary to software tasks, hardware tasks may have multiple implementation variants (execution units or hardware implementation). These variants differ in terms of performance, power consumption, and area. As limiting a scheduler to one execution variant may lead to inferior solutions, we also propose an efficient optimization framework that can mitigate the mentioned deficiency. The optimization framework extends the work we introduced in [8] which utilizes an evolutionary Island Based Genetic Algorithm technique. Given a particular DFG, each island seeks to optimize speed, power, and/or area for a different floorplan. The floorplans differ in terms of the number, size, and layout of PRRs, as described earlier. The framework presented in this paper uses the online proposed schedulers along with the Island Based GA Engine, for evaluating the quality of a particular schedule in terms of power and performance based on the most suitable floorplan for the intended application. This paper seeks to integrate the developed tools (simulator, DFG generator, and schedulers) along with the Genetic Algorithm Engine for task allocation in a unique framework that is capable of mapping and scheduling tasks efficiently on a reconfigurable computing platform. Contributions. The main contributions of this paper can be summarized in the following points: (1) Operating System Support. The development and evaluation of a novel heuristic for online scheduling of hard, real-time tasks for partially reconfigurable devices are achieved. The proposed scheduler uses fixed predefined Partial Reconfigurable Regions with reuse, relocation, and task migration capability. The scheduler dynamically measures several performance metrics such as reconfiguration time and execution time, calculates a priority for each task, and then assigns incoming tasks to the appropriate processing elements based on this priority. To evaluate the proposed framework and schedulers, a DFG generator and a reconfigurable simulator were also developed. The simulator is freely available under the open source GPL license on GitHub [9] to enable researchers working in the area of reconfigurable operating systems to utilize it as a supporting tool for their work. (2) Multiple Execution Variants. A multiobjective optimization framework capable of optimizing total execution time and power consumption for static and partial dynamic reconfigurable systems is proposed. Not only is this framework capable of optimizing the aforementioned objectives, but also it can determine the most appropriate reconfigurable floorplan/platform for the application. This way, a good trade-off between performance and power consumption can be achieved which results in high energy efficiency. The proposed Island Based GA framework

achieves on average 55.2% improvement over singleGA implementation and 80.7% improvement over a baseline random allocation and binding approach. The GA framework is made freely available under the open source GPL license on GitHub [10]. To the best of our knowledge, this is the first attempt to use multiple GA instances for optimizing several objectives and aggregating the results to further improve solution quality. Paper Organization. The remainder of this paper is organized as follows. Section 2 discusses the most significant work published in the field of scheduling and execution unit assignments for reconfigurable systems. Section 3 introduces the tools developed to test and evaluate the proposed work. Section 4 introduces a novel reconfigurable scheduler with a comparison with other schedulers adapted from the literature. Section 5 introduces the evolutionary framework developed for allocating and binding execution units to task graphs. Finally, conclusions and future directions are presented in Section 6.

2. Related Work The work in [11, 12] first proposed the idea of using reconfigurable hardware, not as a dependent coprocessor, but as a standalone independent hardware unit. In this early implementation hardware tasks can access a FIFO, memory block, and I/O driver or even signal the scheduler. The system uses an external powerful processor and connects to an FPGA via the PCI port. The proposed run-time environment can dynamically load and run variable size tasks utilizing partial reconfiguration. The system proposed in [12] was used to explore online scheduling for real-time tasks for both 1D and 2D modules. The work in [13] was the first to explore the migration between hardware and software tasks. The authors studied the coexistence of hardware (on an FPGA) and software threads and proposed a migration protocol between them for real-time processing. The authors introduced a novel allocation scheme and proposed an architecture that can benefit from utilizing dynamic partial reconfiguration. However, the results published were mainly based on pure simulation. In [14], the author presented a run-time hardware/software scheduler that accepts tasks modeled as a DFG. The scheduler is based on a special architecture that consists of two processors (master and slave), a Reconfigurable Computing Unit (RCU), and shared memory. The RCU runs hardware tasks and can be partially reconfigured. Each task can be represented in three forms: a hardware bit-stream run on the RCU, a software task for the master processor, or a software task for the slave processor. A scheduler migrates tasks between hardware and software to achieve the highest possible efficiency. The authors found the software implementation of the scheduler to be slow. Therefore, the authors designed and implemented a hardware version of the scheduler to minimize overhead. The work focuses mainly on the hardware architecture of the scheduler and does not address other issues associated with

International Journal of Reconfigurable Computing the loading/unloading of hardware tasks or the efficiency of intertask communication. Another drawback was the requirement that each task in the DFG be assigned to one of the three processing elements mentioned above. This restriction conflicts with the idea of software/hardware migration. The authors of [15] designed and implemented an execution manager that loads/unloads tasks in an FPGA using dynamic partial reconfiguration. The execution manager insures task dependency and proper execution. It also uses prefetching and task reuse to reduce configuration time overhead. The task is modeled using scheduled DFGs in order to be usable by the execution manager. The authors implemented a software and hardware version of the manager but found the software versions to be too slow for real-time applications. The implementation used a simulated hardware task that consists of two timers to represent execution and configuration time. There are still many unaddressed issues for the manager to be used with real tasks, such as configuration management and intertask communication. The work in [16] proposes a hardware OS, called CAP-OS, that uses run-time scheduling, task mapping, and resource management on a run-time reconfigurable multiprocessor system. The authors in [17] proposed to use the RAMPSOC with two schedulers: a static scheduler to generate a task graph and assign resources and a dynamic scheduler that uses the generated task graph produced by the static scheduler. A deadline constraint for the entire graph is imposed instead of individual tasks. Most of their work was based on simulation, and the implementation lacked task reconfiguration. Many researchers considered the use of multiple architecture variants as part of developing an operating system or a scheduler. In [18, 19] multiple core variants were swapped to enhance execution time and/or reduce power consumption. In our previous work [7], we alternated between software and hardware implementation of tasks to reduce total execution time. The work presented here is different in its applicability since it is more comprehensive. It can be used during the design stage to aid the designer to select the appropriate application variant for tasks in the task graph, as will be described in Section 5. Genetic Algorithms (GAs) were used by many researchers to map tasks into FPGAs. For example, in [20] a hardware based GA partitioning and scheduling technique for dynamically reconfigurable embedded systems was developed. The authors used a modified list scheduling and placement algorithm as part of the GA approach to determine the best partitioning. The work in [21] maps task graphs into a single FPGA composed of multiple Xilinx MicroBlaze processors and hardware accelerators. In [22], the authors used a GA to map task graphs into partially reconfigurable FPGAs. The authors modeled the reconfigurable area into tiles of resources with multiple configuration controllers operating in parallel. However, all of the above works do not take into account multiple architecture variants. In [18] the authors modified the Rate Monotonic Scheduling (RMS) algorithm to support hot swapping architecture implementation. Their goal is to minimize power consumption and reschedule the task on the fly while the system

5 is running. The authors addressed the problem from the perspective of schedule feasibility, but their approach does not target any particular reconfigurable platform, nor does it take multiple objectives into consideration which makes it different from the proposed work in this paper. In [23], the authors used an Evolutionary Algorithm (EA) to optimize both power and temperature for heterogeneous FPGAs. The work in [23] is very different from the current proposed framework in terms of the types of objectives considered and their actual implementation, since the focus is on swapping a limited number of GPP cores in an embedded processor for the arriving tasks. In contrast, our proposed work focuses on optimizing the hardware implementation for every single task in the incoming DFG and targets static and partial reconfigurable FPGAs. In [3, 4] the authors used a GA framework for task graph partitioning along with a library of hardware task implementation events that contain multiple architecture variants for each hardware task. The variants reflect trade-offs between hardware resources and task execution throughput. In their work, the selection of implementation variants dictates how the task graph should be partitioned. There are a few important differences from our proposed framework. First, our work targets dynamic partial RTR with reconfigurable operating systems. The system simultaneously runs multiple hardware tasks on PRRs and can swap tasks in realtime between software and hardware (the reconfigurable OS platform is discussed in [7]). Second, the authors use a single objective approach, selecting the variants that give minimum execution time. In contrast, our approach is multiobjective and not only seeks to optimize speed and power, but also seeks to select the most suitable reconfigurable platform. As each of the previous objectives is likely to have its own optimal solution, the resulting optimization does not give rise to a single superior solution, but rather a set of optimal solutions, known as Pareto optimal solutions [24]. Unlike the work in [3, 4], we employ a parallel, multiobjective Island Based GA to generate the set of Pareto optimal solutions.

3. Methodology In this section, the overall methodology of the proposed work, along with the tools developed to implement the framework, is presented. The two major phases of the flow are shown in Figure 3: (1) Scheduling and Placement. This phase is responsible for scheduling and placing incoming tasks on the available processing elements (GPP/PRRs). In our earlier work in [7], a complete reconfigurable platform, based on five PRRs and two GPPs which runs a simple ROS, was designed and mapped onto a Xilinx 6 FPGA. For our current work, using the previously developed hardware reconfigurable platform imposed a limitation, as modifying the PRRs/GPPs is a tedious and time-consuming process. Accordingly, we developed a reconfigurable simulator that hosts the developed ROS and emulates the reconfigurable platform, as will be discussed in Section 3.2. The ROS

6

International Journal of Reconfigurable Computing (2) Algorithm binding (1) Scheduling and placement E A FPGA RTR platform

D

Island Based GA for efficient mapping

Reconfigurable simulator

C

B DFG generator

Scheduling algorithms

Figure 3: Overall methodology.

runs a novel online reconfigurable-aware scheduler (RCOnline) that is able to migrate tasks between software and hardware and measure real system metrics to optimize task scheduling, as will be explained in Section 4.2. We also developed a DFG generator to evaluate the proposed system, which will be presented in Section 3.1. (2) Algorithm Binding. Phase one assumes a single execution element (architecture) per task. This assumption might lead to a suboptimal solution, as a hardware task can be implemented in different forms (e.g., serial or parallel implementation). To allow the proposed system to consider multiple (different) architecture implementation events, we implemented a multiobjective Island Based Genetic Algorithm platform that not only optimizes for power and speed, but also selects the most appropriate platform for the DFG, as will be explained in Section 5. 3.1. DFG Generator. Scheduling sequential tasks on multiple processing elements is an NP-hard problem [25]. To develop, modify, and evaluate a scheduler, extensive testing is required. Scheduling validation can be achieved using workloads created either by actual recorded data from user logs or more commonly by use of randomly synthetically generated benchmarks [25]. In this work, a synthetic random DFG generator was developed to assist in verifying system functionality and performance (note: several real-world benchmarks from MediaBench DSP suite [26] were also used to validate the work presented, as will be explained in Section 3.3). An overview of the DFG generator is presented in Figure 4. The inputs to the DFG generator include a library of task types (operations), the number of required nodes, the total number of dependencies between nodes (tasks), and the maximum number of dependencies per task. In addition, the user can control the percentage of occurrences for each task or group of tasks. The outputs of the DFG generator consist of two files: a graph and a platform file. The graph file is a graphical representation of the DFG in a user-friendly format

[27], while the platform file is the same DFG in a format compatible with the targeted platform. The DFG generator consists of three major modules, as seen in Figure 4: (1) Matrix Generator. This module provides the core functionality of the DFG generator and is responsible for constructing the DFG skeleton (i.e., the nodes and dependencies). The nodes and edges that constitute the DFG are generated randomly. The nodes represent functions (or tasks) and the edges in the graph represent data that is communicated among tasks. Both tasks and edges are generated randomly based on user-supplied directives. The matrix dependency generator builds a DFG without assigning operations to the nodes (nodes and dependencies). (2) Matrix Operation Generator. This module randomly assigns task types to the DFG (nodes) generated by the Matrix Generator module from a library of task types based on user-supplied parameters. (3) Graph File Generator. This module produces a graph representation of the DFG in a standard open source DOT (graph description language) format [27]. 3.2. Reconfigurable Simulator. Using the hardware platform introduced in [7] has several advantages but also introduces some limitations. The FPGA platform to be used in this work requires a different floorplan and bit-stream for each new configuration which limits the scope of testing and evaluation. Accordingly, an architecture reconfigurable simulator was developed to simulate the hardware platform in [7], while running the developed reconfigurable operating system. The simulator consists of three distinct layers, as shown in Figure 5. The Platform Emulator Layer (PEL) emulates the RTR platform functionality and acts as a virtual machine for the upper layers. The ROS kernel along with the developed schedulers runs on top of the PEL. The developed code of the ROS kernel along with the schedulers can run on both


7

(i) Tasks types list (ii) Tasks group (iii) Groups % of occurrence

(i) Number of nodes (ii) Number of total dep. (iii) Number of dep. per node

Matrices generator Three randomization engines (i) Rand. dep. between tasks (ii) Rand. dep. per task (iii) Rand. operation per task

(i) Tasks matrix (ii) Operation matrix

Task type details

Platform file generator

Graph file generator

Generates the platform specific DFG

Generates the graph file

DFG platform file

Graph format settings

DOT graph file

Figure 4: Components of the DFG generator.

Application layer (DFGs)

ROS kernel

Schedulers

Reconfigurable platform emulator

Figure 5: Reconfigurable simulator layout.

the simulator and the actual hardware RTR platform without any modification. The use of the PEL layer enables the designer to easily develop and test new schedulers on the simulator before porting them to the actual hardware platform. The developed simulator utilizes the proposed schedulers and supports any number of PRRs and/or GPPs (software or hardware). The simulator was written in the C language and runs under Linux. The code was carefully developed in a modular format, which enables it to accommodate new schedulers in the future. The simulator uses several configuration files, similar to the one used by the Linux OS. We opted to make the simulator freely available under the open source GPL license on GitHub [9] to enable researchers

working in the area of reconfigurable operating systems to utilize it as a supporting tool for their work. 3.2.1. Simulator Inputs. The simulator accepts different input parameters to control its operation including scheduler selection and architecture files names. The simulator is developed to emulate the hardware platform and expects the following configuration files as input: (i) A task (architecture) library file which stores task information used by the simulator: task information includes the mode of operation (software, hardware, or hybrid), execution time, area, reconfiguration time, reconfiguration power, and dynamic power consumption (hybrid tasks can migrate between hardware and software), on a per-task-type basis. Some of these values are based on analytic models found in [28, 29], while others are measured manually from actual implementation on a Xilinx Virtex-6 platform. (ii) A layout (platform) file which specifies the FPGA floorplan: the layout includes data that represent the size, shape, and number of PRRs along with the types and number of GPPs. (iii) A DFG file which stores the DFGs to be scheduled and executed. 3.2.2. Simulator Output. The simulator generates several useful system parameters, including total execution time,

8

International Journal of Reconfigurable Computing Table 1: Benchmark specifications.

ID

Name

# of nodes

# of edges

Avg. edges/node

Critical path

Parallelism (nodes/crit. path)

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11

Synthesized (4 task types) Synthesized (4 task types) Synthesized (8 task types) Synthesized (8 task types) Synthesized (5 task types) Synthesized (4 task types) Synthesized (8 task types) Synthesized (4 task types) Synthesized (8 task types) Synthesized (8 task types) Synthesized (4 task types)

50 150 25 100 13 10 50 100 150 200 100

50 200 40 120 10 5 60 120 200 60 150

1 1.33 1.6 1.2 0.77 0.5 1.2 1.2 1.33 0.3 1.5

6 15 7 7 3 3 8 7 8 4 8

8.3 10.1 3.5 14.2 4.3 3.3 6.3 14.3 25.1 50.1 12.5

JPEG-Smooth Downsample MPEG-Motion Vectors EPIC-Collapse pyr MESA-Matrix Multiplication HAL Finite Input Response Filter 2 Cosine 1

51 32 56 109 11 40 66

52 29 73 116 8 39 76

1.02 0.91 1.3 1.06 0.72 0.975 1.15

16 6 7 9 4 11 8

3.2 5.3 8.1 12.1 2.8 3.6 8.3

DFG2 DFG6 DFG7 DFG12 DFG14 DFG16 DFG19

reconfiguration time (cycles), task migration information, and hardware reuse. A task placement graph can also be generated by enabling specific options as will be seen in the example presented in Section 5.2.2. 3.3. Benchmarks. Verifying the functionality of any developed scheduler with the aid of a simulator requires testing based on benchmarks with different parameters, properties, and statistics. Accordingly, ten DFGs were generated by the DFG generator presented in Section 3.1. These DFGs differ with respect to the number of tasks (nodes), dependencies, and operations, as shown in Table 1. The number of operations refers to the different unique functions (task types) in the DFG. The benchmark suite consists of two different sets of DFGs. The first set uses four operations, while the second set uses eight operations that have variable execution time, dependencies, and possible hardware/software implementation. In addition to the synthesized benchmarks, seven real-world benchmarks selected from the well-known MediaBench DSP suite [26] were used in the schedulers’ evaluation, as shown in the lower part of Table 1.

4. Scheduling Algorithms In this section, we introduce a novel online scheduling algorithm (RCOnline) along with an enhanced version (RCOnline-Enh.) for reconfigurable computing. The proposed scheduler manages both hardware and software tasks. We also introduce an efficient offline scheduler (RCOffline) for reconfigurable systems that is used to verify the quality of solutions obtained by the online scheduler. Offline schedulers often produce solutions superior to those obtained by online

schedulers since the former build complete information about the system activities. In addition, several works in the related literature resort to such an approach to measure the quality of solutions obtained by online schedulers due to lack of open source code of work published in the past [30–32]. Also, schedulers developed by many researchers are usually implemented based on different assumptions and constraints that are not compatible with those employed in our proposed schedulers. 4.1. Baseline Schedulers. One of the main challenges faced in this work is the lack of compatible schedulers, tools, and benchmarks that can be used to compare with our results. Therefore, throughout the research process, we considered several published algorithms and adapted them to represent our RTR platform for the sole sake of comparing with our proposed schedulers. Two of the adapted scheduling algorithms are described briefly in the following sections. 4.1.1. RCOffline Scheduler. An offline scheduler was implemented and integrated into our proposed RTR platform. The RCOffline scheduler is a greedy offline reconfiguration aware scheduler for 2D dynamically reconfigurable architectures. It supports task reuse, module prefetch, and a defragmentation technique. Task reuse refers to a policy that the operating system can take to reduce reconfiguration time. This is accomplished by reusing the same hardware accelerator that exists in the floorplan of the FPGA when a new arriving task needs to be scheduled. The main advantage of utilizing such a scheduler is that it produces solutions (schedules) that are of high quality (close to those obtained by an ILP based model [33, 34]) in fraction of the time. The RCOffline


9

(1) DFGs = Compute ALAP and sort (DFG). (2) ReadyQ =findAvailableTasks(DFGs)//no dependencies (3) while there are unscheduled tasks do (4) if (Reuse = true) (5) for all tasks in ReadyQ do (6) PRR# = FindAvailablePRR(task) (7) if (PRR#) then (8) ts=t (9) for all p in PredecessorOf(task) (10) ts=max(ts, tps + tpl) (11) end for (12) remove task from ReadyQ (13) end if (14) end for (15) else /⋆Reuse = false⋆/ (16) for all tasks in ReadyQ do (17) PRR# = findEmptyPRR(task) (18) if (PRR# and Reconfiguration) then (19) ts=t+trs (20) for all p in PredecessorOf(task) ts=max(ts, tps + tpl) (21) (22) end for (23) Remove task from ReadyQ (24) end if (25) end for (26) end if (27) ReadyQ=CheckforAvailableTasks /⋆ dependencies met ⋆/ (28) end while Algorithm 1: Pseudo-code for RCOffline.

pseudo-code is shown in Algorithm 1. The RCOffline starts by sorting the incoming tasks in the DFG using the As-Late-AsPossible (ALAP) algorithm and then adds the ready tasks (no dependencies) to the ready queue (ReadyQ) [lines (1)-(2)]. For task reuse, the scheduler will locate an available PRR [line (6)] and then tentatively set the start time (𝑡𝑠 ) to the current time (𝑡) [line (8)]. Similarly, if task reuse is false, the scheduler will look for an empty or unused PRR and only perform reconfiguration if the reconfiguration controller is available [lines (17)-(18)]. The tentative start time in this case is set to the sum of current time and reconfiguration time (𝑡𝑟𝑠 ). To prevent running a task before the predecessors assigned to the same PRR, the scheduler checks the end time of the predecessor tasks [lines (9)–(11), (20)–(22)]. This might lead to a gap between the reconfiguration time of a task and the time to execute it on the PRR (i.e., task prefetching). Since RCOffline is an offline based scheduler it expects the entire DFG in advance. Also, it assumes the existence of a reconfigurable platform that supports task relocation and PRR resizing at run-time. 4.2. RCOnline Scheduler. The quality of the employed online scheduling algorithms in an RTR-OS can have a significant impact on the final performance of applications running on a reconfigurable computing platform. The overall time taken to run applications on the same platform can vary considerably by using different schedulers.

The online scheduler (RCOnline) proposed in this work was designed to intelligently identify and choose the most suitable processing element (PRR or GPP) for the current task based on knowledge extracted from previous tasks’ historical data. The scheduler dynamically calculates task placement priorities based on data obtained from the previous running tasks. During task execution, the scheduler attempts to learn about running tasks by classifying them according to their designated types. The scheduler then dynamically updates a “task type table” with the new task type information. The data stored in the table is used by the scheduler at a later time to classify and assign priorities to new incoming tasks. The RCOnline scheduler was further enhanced by modifying task selection from the ready queue. This enhancement dramatically improves performance, as will be explained later in Section 4.4. Enhancing algorithm performance by learning from historical previous data is not an issue for embedded systems since usually the same task sets run repeatedly on the same platform. For example, video processing applications use the same repeated task sets for every frame. That is to say, the scheduler will always produce a valid schedule even during the learning phase; however its efficiency (schedule quality) enhances over time. The following definitions are essential to further understand the functionality of the RCOnline algorithm:

10


(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29) (30) (31) (32) (33) (34) (35) (36) (37) (38) (39) (40) (41) (42) (43) (44) (45) (46) (47) (48) (49) (50) (51)

// The PRRS should be stored from the smallest to largest such that // those with smaller reconfiguration time have smaller IDs Sort PRRs in ascending order based on size. Read the node with the highest task priority if (All dependencies are met) {add task to ready queue} // SELECTION BY RCOnline versus RCOnline-Enh Fetch the first task from the ReadyQueue // -- used by RCOnline Fetch a task that matches an existing reconfigured task // -- used by RCOnline-Enh switch (task->mode) { case Hybrid HW : { if(TaskTypePriority==0 and GPP is free) {// TaskTypePriority ==0 indicates that task is preferred to run on SW Migrate task from HybridHW to HybridSW Add to the beginning of the ReadyQueue} }else{ if (there is a free PRR(s) that fits the current task) {{Search Free PRRs for a match against the ready task} if (task found) run ready task on the free PRR else{ currentPRR= smallest free PRR; if(TaskTypePriority< currentPRR and GPP is free) { // check if it’s faster to reconfigure or run on SW Migrate task from HybridHW to HybridSW Add to the beginning of the ReadyQueue } }else{ Reconfigure currentPRR with the ready task bit-stream}} }else // all PRRs are busy {if ( GPP is free) { Migrate task from HybridHW to HybridSW Add to the beginning of the ReadyQueue } }else{ // return to main program and // wait for free resources Increase the Busy Counter return Busy} break; } case HybridSW : {if (GPP is free) { load task into GPP} }else if (there is a free PRR) { Migrate task from HybridSW to HybridHW Add to the beginning of the ReadyQueue} }else{ // return to main program and //wait for free resources Increase the Busy Counter return Busy break;}} End Algorithm 2: Pseudo-code for RCOnline/RCOnline-Enh. (with reuse and task migration).

(i) PRRs Priority. It is calculated based on the PRR size. PRRs with smaller size have lower reconfiguration time and thus have higher priorities. (ii) Placement Priority. It is a dynamically calculated metric which is frequently updated on the arrival of new tasks. Placement priority is assigned based

on the type (function) of task. RCOnline uses this metric to decide where to place the current tasks. Placement priority is an integer value, where the lower the number the higher the priority. Based on the pseudo-code of Algorithm 2, RCOnline takes the following steps to efficiently schedule tasks:

International Journal of Reconfigurable Computing (1) The PRRs are given priority based on their respective size. PRR priority is a number from 0 to PRRmax − 1. (2) Arriving tasks are checked for dependencies, and if all dependencies are satisfied, the task is appended to the ready queue. (3) The scheduler then selects the task with the highest priority from the ready queue; if task mode is HybridHW (i.e., it can be placed in either HW or SW) then the following occur: (a) The scheduler checks the current available PRRs for task reuse. If reuse is not possible (i.e., no module with similar architecture is configured), the scheduler uses placement priority and PRR priority to either place the task on an alternative PRR or migrate it to software (lines (28) to (41) in the pseudo-code of Algorithm 2). This step is useful for nonuniform implementation, since each PRR has a different reconfiguration time. Task migration is performed when the following conditions are met: Migrate task= True, if [(HW Exec. time + Reconfig. time) > SW Exec. Time ] or [there are No free PRRs] (b) If no resources are available (either in hardware or in software), a busy counter is incremented while waiting for a resource to be free (lines (46) to (50)). The busy counter is a performance metric that is used to give an indication of how busy (or idle) the scheduler is. The busy counter counts the number of cycles a task has to wait for free resources. (4) When the task mode is HybridSW (i.e., task should run on SW but can also run on HW), RCOnline first attempts to run the task on a GPP and then migrates it to hardware based on GPP availability. Otherwise, RCOnline increments the busy counter and waits for free resources (lines (49) to (50)). (5) Software and hardware only tasks are handled in a similar way to HybridSW and HybridHW, respectively, with no task migration. Placement priority, which is calculated by the scheduler, is different than task priority, which is assigned to DFG nodes at design time, in that placement priority determines where a ready task should be placed, while task priority determines the order in which tasks should run. The calculation of the PRR placement priority takes into account the PRRs’ reconfiguration times in addition to task execution time: (1) At the end of task execution, RCOnline adds any newly discovered task type to the task type table. (2) The scheduler updates the reconfiguration time for the current PRR in the task type table.

11 (3) RCOnline updates the execution time for the current processing element based on the following formula: 𝐸new = 𝐸old + (𝐸new − 𝐸old ) ∗ 𝑋,

(1)

where 𝐸 denotes execution time and 𝑋 is the learning factor. Based on empirical testing and extensive experimentation, 𝑋 has been set to a value of 0.2. The basic idea is to adapt the new execution time in such a way that does not radically deviate away from the average execution times of the processing element. This assists the scheduler to adapt for changes in execution time for the same operation. The learning formula takes into account old accumulated average execution time (80%) and the most recent instance of the task (20%). The goal of the task type table is to keep track of execution times (and reconfiguration times in the case of PRRs), of each task type on every processing element (PRRs and GPPs). Accordingly, when similar new tasks arrive, the scheduler can use the data in the table to determine the appropriate location to run the task. (4) Finally, RCOnline uses the measured reconfiguration and execution times to calculate placement priority for each task type. Assuming all tasks in a DFG are available, the complexity of RCOnline is O(n), where n is the number of tasks in the DFG. Figure 6 depicts the learning behavior of RCOnline with an example DFG consisting of six nodes. As task 1 completes execution, the system records the following information (as shown in Figure 6(a)) in the task type table: (i) the task operation (multiplication in this case), (ii) reconfiguration time (10 ms in PRR1), and (iii) the execution time (HW, 200 ms). Following the execution of task 2, task type table is updated again with the features of the new task type (addition), as shown in Figure 6(b). As tasks 4, 5, and 6 arrive, the scheduler will have had more knowledge about their reconfiguration and execution times based on prior information collected. The recorded performance metrics are then used to calculate placement priority that will assist the scheduler to efficiently select future incoming tasks with similar type. 4.3. RCOnline-Enh. To further enhance the performance of the RCOnline scheduler, the task selection criteria were slightly modified. Instead of selecting the first available task in the ready queue (i.e., the task with the highest priority), the scheduler searches the ready queue for a task that matches an existing reconfigured task (line (9) in Algorithm 2). In practice, this simple modification dramatically reduces the number of reconfigurations (see the results presented in Section 4.4). Figure 7 clearly demonstrates the main difference between the RCOnline algorithm and the RCOnlineEnh. As can be seen in the example, the RCOnline-Enh. seeks to switch task “2” with task “1” to take advantage of task reuse and thus minimize the overall reconfiguration time that takes place in the system.

12

International Journal of Reconfigurable Computing Task type table

2 +

1 ∗ 3 −

Execution time (ms)

∗ 4

Task

HW

SW

∗

200

?

5 +

?

+ 6

Reconfiguration time (ms)

PRR1

PRR2

PRR3

10

?

? (a)

RCOnline

Task type table Execution time (ms)

GPP2

PRR3

PRR2

PRR1

GPP1

Task

HW

SW

∗

200

?

+

80

?

Reconfiguration time (ms)

PRR1

PRR2

PRR3

10

20

?

(b)

Figure 6: Learning in RCOnline scheduler.

4.4. Results and Analysis. In this section, results based on RCOnline along with an enhanced version of RCOnline are presented. In order to achieve a fair comparison with the RCOffline scheduler, task migration and software tasks were completely disabled from RCOnline and RCOnline-Enh. implementation. The same reconfigurable area available in the hardware platform is assigned to the RCOffline scheduler. 4.4.1. Total Execution Time. The total execution time or schedule time (cycles) measured for the proposed online schedulers (RCOnline) and its variant are presented in Tables 2 and 3, respectively. The tables compare the solution quality obtained by RCOnline to the solutions obtained by RCOffline, as well as to the solutions found by RCOnlineEnh. 𝐶off , 𝐶on , and 𝐶enh are the total execution times (in clock cycles) found by RCOffline, RCOnline, and RCOnlineEnh., respectively. Δ𝐶 = 100(𝐶off − 𝐶on )/𝐶on and Δ𝐶enh = 100(𝐶off − 𝐶enh )/𝐶enh is the relative error in percent for the schedule length found by RCOnline and RCOnline-Enh. compared to RCOffline. This notation is used in all tables presented in this section. Table 2 shows the total execution time of the first run for all schedulers. Results are based on

the assumption that all PRRs are empty and RCOnline is in the initial phase of learning. Table 3, on the other hand, shows the total execution time after the first two iterations. At this point, the learning phase of RCOnline is complete (i.e., third iteration). Since RCOffline is an offline scheduler with no learning capability, the time it takes to configure the first task of a DFG was subtracted (assuming it is already configured). The results in Table 2 clearly indicate an average performance gap increases by 40% when comparing RCOnlineEnh. to RCOnline. The performance gap drops to −6% compared to RCOffline. This is a significant improvement, taking into account the fact that RCOnline-Enh. is an online scheduler that requires a small CPU time compared to the RCOffline. The RCOnline-Enh. performance is improved dramatically following the learning phase, as shown in Table 3. The RCOnline-Enh. performance gap was 49% of RCOnline and 2% of RCOffline. The performance is expected to improve even more if task migration is enabled. The CPU run-time comparison of RCOnline-Enh. and RCOffline is shown in Table 4. The tests were performed on a Red Hat Linux workstation with 6-core Intel Xeon processor, running at a speed of 3.2 GHz and equipped with 16 GB of


13

RCOnline 1 +

Task not available; X reconfiguration is needed − PRR1 RCOnline-Enh.

A task is reused No reconfiguration is needed

2 −

3 ×

4 +

5 ×

×

−

×

PRR2

PRR2

PRR2

6 −

Ready queue

Check the first task in the ready queue

1

2

3

4

5

6

+

−

×

+

×

−

−

+

×

+

×

−

−

×

−

×

PRR1

PRR2

PRR2

PRR2

Ready queue

Search the ready queue for task reuse, before issuing a reconfiguration command

Figure 7: RCOnline versus RCOnline-Enh.

Table 2: Schedulers comparison for total execution time (no learning).

Table 3: Schedulers comparison for total execution time (after learning phase).

Benchmark

Total execution time 𝐶off 𝐶on 𝐶enh

Percentage Δ𝐶 Δ𝐶enh

Benchmark

S1 S2 S3 S4 S6 S7 S8 S9 S10 S11 DFG2 DFG6 DFG7 DFG12 DFG14 DFG16 DFG19

520 1160 300 775 160 420 380 995 1095 920 680 325 405 745 140 295 455

740 1750 420 1470 160 860 500 1990 3000 1430 800 340 600 910 160 403 540

540 1220 340 820 160 500 460 1000 1140 1000 584 359 481 761 142 383 460

−30 −34 −29 −47 0 −51 −24 −50 −64 −36 −15 −4 −33 −18 −13 −27 −16

−4 −5 −12 −5 0 −16 −17 −1 −4 −8 16 −9 −16 −2 −1 −23 −1


500 1140 280 755 140 400 360 975 1075 900 660 305 385 725 120 275 435

660 1690 390 1370 140 700 440 1950 2620 1230 830 360 555 870 160 380 540

520 1160 280 660 120 430 345 950 1075 866 584 322 436 723 103 288 460

−24 −33 −28 −45 0 −43 −18 −50 −59 −27 −20 −15 −31 −17 −25 −28 −19

−4 −2 0 14 17 −7 4 3 0 4 13 −5 −12 0 17 −5 −5

Average

575

945

609

−29

−6

Average

555

876

548

−28

2

RAM. The RCOnline-Enh. is on average 10 times faster than the RCOffline. 4.4.2. Hardware Task Reuse. Tables 5 and 6 show the number of reused hardware tasks prior to the learning phase and after learning phase, respectively. Hardware task reuse tends to reduce the number of required reconfigurations and, hence, total execution time. It is clear from the tables that the amount of hardware task reuse was improved with RCOnline-Enh. Obviously, this contributed to a reduction in the total execution time. Table 6 shows the benefit of the

𝐶off

Total execution time 𝐶on 𝐶enh


learning phase which tends to increase the variety of the available reconfigured tasks. The RCOnline-Enh. has on average 172% and 158% more hardware reuse than RCOnline prior to and following the learning phase, respectively. RCOnline-Enh. exceeded the amount of hardware reuse over RCOffline after the learning phase by 104%, and it reached 87% of the hardware reuse without learning. RCOffline is able to prefetch tasks, which influences the amount of reuse. All results presented in this section were based on scheduling tasks within a DFG using a single type of architecture per task and based on a fixed single floorplan. The next section shows how these

14


Table 4: Run-time comparison of RCOffline and RCOnline-Enh. Benchmark S1 S2 S3 S4 S6 S7 S8 S9 S10 S11 DFG2 DFG6 DFG7 DFG12 DFG14 DFG16 DFG19

Speedup (𝑋 times)

Time in (ms) RCOffline RCOnline-Enh. 12 1.52 59 9.14 5 0.418 29 2.03 2 0.116 8 1.31 5 0.614 58 6.28 110 9.74 31 4.98 12 0.829 3 0.553 7 1.15 29 3.69 2 0.111 3 0.754 12 1.39

Average

23

7.9 6.4 12 14.3 17.0 6.0 8.0 9.2 11.3 6.2 14.5 5.4 6.1 7.9 18.0 3.9 8.6

3

9.5

Table 5: Comparison of schedulers based on number of reused tasks (before learning). Benchmark

Number HW task reuse 𝐶off 𝐶on 𝐶enh



38 139 17 74 5 36 18 128 181 88 37 21 49 86 7 32 50

22 74 6 28 4 8 9 53 54 40 23 19 36 72 4 26 42

37 128 10 67 4 27 14 118 173 78 38 19 42 84 5 26 49

−42 −47 −65 −62 −20 −78 −50 −59 −70 −55 −38 −10 −27 −16 −43 −19 −16

−3 −8 −41 −9 −20 −25 −22 −8 −4 −11 3 −10 −14 −2 −29 −19 −2

Average

59

31

54

−42

−13

schedulers can be used as part of a multiobjective framework to determine the most appropriate architecture from a set of variant architectures for a given task. The framework proposed in the next section not only is capable of optimizing the selection of the architecture for a task within a DFG,

Table 6: Comparison of schedulers based on number of reused tasks (after learning). Benchmark

Number or task reuse 𝐶off 𝐶on 𝐶enh



38 139 17 74 5 36 18 128 181 88 37 21 49 86 7 32 50

26 79 10 33 8 17 13 56 74 51 24 20 38 73 5 27 42

41 131 13 79 8 31 21 123 183 90 38 22 48 85 8 32 49

−32 −43 −41 −55 60 −53 −28 −56 −59 −42 −35 −5 −22 −15 −29 −16 −16

8 −6 −24 7 60 −14 17 −4 1 2 3 5 −2 −1 14 0 −2

Average

59

35

59

−29

4

but also can determine the most appropriate reconfigurable floorplan/platform for the application. 4.4.3. Comparison with Online Schedulers. Table 7 presents a comparison between RCOnline-Enh. and two other efficient scheduling algorithms (RCSched-I and RCSched-II) that were published in [7]. Both RCSched-I and RCSched-II nominate free (i.e., inactive or currently not running) PRRs for the next ready task. If all PRRs are busy and the GPP is free, the scheduler attempts to migrate the ready task (i.e., the first task in the ready queue) by changing the task type from hardware to software. Busy PRRs, unlike free PRRs, accommodate hardware tasks that are active (i.e., running). The two algorithms differ with respect to the way the next PRR is nominated. Results obtained indicate that RCOnline-Enh. has on average 327% more hardware reuse than RCSched-I and 203% more hardware reuse than RCSched-II. Table 7 also indicates that the performance gap in terms of total execution time between RCSched-I and RCOnline-Enh. is, on average, 53% and that between the latter and RCSched-II is 43%.

5. Execution Unit Allocation Optimizing the hardware architecture (alternative execution units) for a specific task in a DFG is an NP-hard problem [3]. Therefore, in this section, an efficient metaheuristic based technique is proposed to select the type of execution units (implementation variants) needed for a specific task. The proposed approach uses an Island Based Genetic Algorithm (IBGA) as an optimization based technique.


15

Table 7: Comparison of schedulers based on execution time and number of hardware reused tasks. Benchmark S1 S2 S3 S4 S6 S7 S8 S9 S10 S11 DFG2 DFG6 DFG7 DFG12 DFG14 DFG16 DFG19 Average

RCSched-I Total ex. time # of reuse 867 10 2466 35 478 3 1773 13 201 2 896 6 583 5 2697 17 3589 22 1735 21 912 18 390 18 701 32 1270 51 183 4 469 23 738 31 1173 18

RCSched-II Total ex. time # of reuse 740 19 1767 68 441 5 1508 26 160 4 818 10 503 9 2417 44 2887 57 1293 42 827 22 358 19 596 36 879 66 162 4 403 26 560 39 960 29

The main feature of an IBGA is that each population of individuals (i.e., set of candidate solutions) is divided into several subpopulations, called islands. All traditional genetic operators, such as crossover, mutation, and selection, are performed separately on each island. Some individuals are selected from each island and migrated to different islands periodically. In our proposed Island Based Genetic Algorithm (GA), no migration is performed among the different subpopulations to preserve the solution quality of each island, since each island is based on a different (unique) floorplan as will be explained later on. The novel idea here is to aggregate the results obtained from the Pareto fronts of each island to enhance the overall solution quality. Each solution on the Pareto front tends to optimize multiple objectives including power consumption and speed. However, each island tends to optimize this multiobjective optimization problem based on a distinct platform (floorplan). The IBGA approach proposed utilizes the framework presented in [7] to evaluate the solutions created in addition to the reconfigurable simulator presented earlier in Section 3.2. 5.1. Single Island GA. The Single Island GA optimization module consists of four main components: an architecture library, Population Generation module, a GA Engine, and a Fitness Evaluation module (online scheduler), as shown in Figure 8. The architecture library stores all of the different available architectures (execution units) for every task type. Architectures vary in execution time, area, reconfiguration time, and power consumption. For example, a multiplier can have serial, semiparallel, or parallel implementation. Each GA module tends to optimize several criteria (objectives) based on a single fixed floorplan/platform. The initial population generator uses a given task graph along with the architecture library to generate a diverse initial

RCOnline-Enh. Total ex. time # of reuse 520 41 1160 131 280 13 660 79 120 8 430 31 345 21 950 123 1075 183 866 90 584 38 322 22 436 48 723 85 103 8 288 32 460 49 548 59

population of task graphs, as demonstrated by Figure 8. Each possible solution (chromosome) of the population assembles the same DFG with one execution unit (architecture) bound to each task. The Genetic Algorithm Engine manipulates the initial population by creating new solutions using several recombination operators in the form of crossover and mutation. New solutions created are evaluated and the most fit solution replaces the least fit. These iterations will continue for a specified number of generations or until no further improvement in solution quality is achieved. The fourth component, fitness function, evaluates each solution based on specific criteria. The fitness function used in the proposed model is based on the online reconfigurable scheduler presented earlier that schedules each task graph and returns the time and overall power consumed by the DFG to the GA Engine. 5.1.1. Initial Population. The initial population represents the solution space that the GA will use to search for the best solution. The initial population can be generated randomly or partially seeded using known solutions and it needs to be as diverse as possible. In our framework, we evaluated the system with different population sizes that varied from a population equal to the number of nodes in the (input) DFG to five times the number of nodes of the input DFGs. Chromosome Representation. Every (solution) individual in the population is represented using a bit string known as a chromosome. Each chromosome consists of genes. In our formulation, we choose to use a gene for each of the N tasks in the task graph; each gene represents the choice of a particular implementation variant (as an integer value). A task graph is mapped to a chromosome, where each gene represents an operation within the DFG using a specific execution unit, as shown in Figure 9.

16

International Journal of Reconfigurable Computing Multiple execution units per task

∗

Initial population A ∗

?

B + ∗

Task assignment

+

∗ +

∗

+ ∗

+

+

+

∗ + +

+

+

+

+

∗ +

+

+

Pareto front

∗ + ∗

+

Time

B A C E ? +

+

Power

Architecture library A

C A

E

+

∗

B GA Engine

∗

Parameters tuning (i) Crossover (ii) Mutation (iii) Selection method (iv) Replacement tech.

Evaluation module Online scheduler

+ + +

Platform

Figure 8: A Single Island GA module. 1 +

3 ∗

Task type library

2 ∗

1

2

3

4

5

6

+

∗

∗

+

+

+

A

A

B

B

C

A

Task + 4 Arch

5 + + 6

Chromosome

Task

Architectures

+

A

B

C

∗

A

∗

C

−

A

B

NA

FFT

A

B

NA

Figure 9: Task graph to chromosome mapping (binding/allocation).

5.1.2. Fitness Function. The fitness of any given chromosome is calculated and used by the Genetic Algorithm Engine as a single figure of merit. In our proposed framework, the fitness function is based on the outcome of a scheduler which takes each chromosome and uses it to construct a schedule. The scheduler then returns a value representing time, power, or a weighted average of both values. The quality of the scheduling algorithms implemented can have a significant impact on the final performance of applications running on a reconfigurable computing platform. The overall time it takes to run a set of tasks on the same platform could vary using different schedulers. Power consumption, reconfiguration time, and other factors can also be used as metrics to evaluate schedulers. The RCOnline scheduler described in Section 4.2 is used to evaluate each chromosome as part of the fitness function, in the GA framework. The results summarized in this section are based on the RCOnline scheduler presented earlier. 5.2. Island Based GA. The proposed Island Based GA consists of multiple GA modules as shown in Figure 10. Each GA

module produces a Pareto front based on power and performance (execution time) for a unique platform. An aggregated Pareto front is then constructed from all the GAs to give the user a framework that optimizes not only power and performance, but also the most suitable platform (floorplan). A platform in this case includes the number, size, and layout of the PRRs. The proposed framework simultaneously sends the same DFG to every GA island, as shown in Figure 10. The islands then run in parallel to optimize for performance and power consumption for each platform. Running multiple GAs in parallel helps in reducing processing time. The chromosomes in each island which represent an allocation of execution units (implementation variant) and binding of units to tasks are used to build a feasible near-optimal schedule. The schedule, allocation, and binding are then evaluated by the simulator to determine the associated power consumed and total execution time. These values are fed back to the GA as a measure of fitness of different solutions in the population. The GA seeks to minimize a weighted sum of performance and power consumption (2). The value of 𝑊 determines the


17

Multiple execution units per task

?

C +

∗

+

A ? ∗

∗

B ∗ GA island 1

+ +

+

Power

+

+

Pareto front

Time

B

Platform A +

+

∗ GA island 2

Aggregated Pareto front

∗ +

Time

A E

+

Platform B +

Power

+ ∗

Platform B

∗ GA island 3

∗ +

+

+ + +

Time

+

∗

+

Power

Aggregated Pareto front

Platform C

∗ GA island 4

∗ +

+ +

Power Time

+

Power

Platform D

Figure 10: An Island Based GA (IBGA) framework.

weight of each objective. For example, a value of one indicates that the GA should optimize for performance only, while a value of 0.5 gives equal emphasis on performance and speed (assuming the values of speed and power are normalized): fitness value = power ∗ (1 − 𝑊) + exec. time ∗ 𝑊.

(2)

Every island produces a separate Pareto front for a different platform in parallel. An aggregation module then combines all the Pareto fronts into an aggregated measure that gives not only a near-optimal performance and power, but also the hardware platform (floorplan) associated with the solution. Each island generates the Pareto front by varying the value of 𝑊 in (2) from 0.0 to 1.0 in step sizes of 0.1. 5.2.1. Power Measurements. Real-time power measurements were initially performed using the Xilinx VC707 board [35].

The Texas Instrument (TI) power controllers along with TI dongle and a monitoring station were used to perform the power measurement. Other measurements were calculated using Xilinx XPower software. 5.2.2. IBGA Example. In this section, we introduce a detailed example of the IBGA framework. Benchmark S3 will be used to illustrate how the IBGA module operates as shown in Figure 11. The S3 benchmark is a synthetic benchmark with 25 nodes and 8 different task types. Each task has between 2 and 5 different architecture variants (i.e., hardware implementation). (i) Manual Assignment. As shown in Figure 11(a), each node is manually assigned a random hardware task variant and the platform (PRR layout) is manually selected as well. The total execution time in this case

18


Benchmark = S3 Number of task types = 8 [0] T7

Number of nodes = 25 Number of architecture variant per task = (2–5)

[2] T6

[19] T2

[8] T8 [13] T6

[9] T2 [15] T6

[20] T4

[11] T1 [23] T1

[21] T3

[14] T8 [16] T8

[22] T7

[24] T1

[12] T4

Manual selection of platform and task variant PRR3

[7] T6

[10] T2

[4] T4

PRR2

[6] T3

[5] T6

[3] T5

PRR1

[1] T3

[18] T8

Total execution time

[17] T4

3310

Task variant assignment 1 ··· 25 Node: Architecture: 2112133211332222112311223 (a) [0] T7

[2] T6

[19] T2

[8] T8 [13] T6

[9] T2 [20] T4

[24] T1

[12] T4

[15] T6

[21] T3

[11] T1 [23] T1

Single-GA

[14] T8

PRR3

[7] T6

[10] T2

[4] T4

PRR2

[6] T3

[5] T6

[3] T5

PRR1

[1] T3

[16] T8

[22] T7

[18] T8

[17] T4

Task variant assignment 1 ··· 25 Node: Architecture: 2533215124431321121115213

Total execution time 2126 (b)

[0] T7

[2] T6

[24] T1

[13] T6

[9] T2 [20] T4 [22] T7

[15] T6

[21] T3

[11] T1 [23] T1

[14] T8

Island Based GA

[16] T8 [18] T8

[17] T4

Total execution time

Task variant assignment 1 ··· 25 Node: Architecture: 2121121123411112111113213

593 (c)

Figure 11: IBGA example.

PRR5

[8] T8

PRR4

[19] T2

[12] T4

PRR3

[7] T6

[10] T2

[4] T4

PRR2

[6] T3

[5] T6

[3] T5

PRR1

[1] T3


19 P4

P2 Uniform

PRR3

PRR2 PRR8

PRR7

PRR5

PRR4

PRR5

PRR3 PRR4

PRR2

PRR1

PRR5

PRR4

PRR3

PRR2

PRR1

PRR3

PRR2

PRR1

PRR1

P3 Nonuniform

PRR6

P1 Uniform

Nonuniform

Figure 12: Platforms with different floorplans.

Table 8: Resource utilization of different multiplier implementation. Resources Slices LUT

Arch 1 17 30

Arch 2 72 126

Generation versus fitness average convergence for (S2) benchmark 14000

Arch 3 114 200

12000

is 3310 cycles. The execution time will be used as a baseline for the next two cases.

Fitness

10000 8000

(ii) Single-GA. Figure 11(b) shows the single-GA approach. This approach produces different task bindings than that of the manual assignment. Despite using the same platform of previous manual assignment, the different bindings lead to a more optimized DFG (lower total execution time). The performance increases by 36% over the manual task assignment.

6000 4000 2000

40 50 60 Generation

70

80

90

100

P3 P4

4000 3500

Fitness

3000 2500 2000

P2: 2533215124431321121115213.

1000

Table 8 shows the resource utilization of different multiplier implementation used in the example presented in Figure 11. Table 9 shows the resource available in a PRR and resources used by an adder in uniform implementation.

30

Generation versus fitness average convergence for (DFG16) benchmark

1500

P4: 1322221114431111111111112.

20

Figure 13: Convergence of synthesized benchmark S2.

P1: 2533215124431321121115213. P3: 2322231111431211111313212.

10

P1 P2

(iii) Island Based GA. This approach is shown in Figure 11(c), where task binding is optimized over multiple platforms. The DFG has different task bindings than the two previous cases and uses a different platform. As a result, the performance dramatically increases (the total execution reduces from 3310 cycles to 593). The IBGA produces four different architecture bindings for each platform, as shown in Figure 12. Each number represents an index to the selected execution unit (architecture) and the location of the integer represents the node number. Architecture binding for each platform (P1–P4), for the S3 benchmark, is as follows:

0

500

0

10

P1 P2

20

30

40 50 60 Generation

70

80

90

100

P3 P4

Figure 14: Convergence of MediaBench benchmark (DFG16).

20


2.0

4.0

6.0

8.0 10.0 12.0 14.0 16.0 18.0 20.0 Power (mW)

P3

2.0

8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 0.0

4.0

6.0 8.0 Power (mW)

10.0

12.0

14.0

Execution time (1/1000 cycle)

P1

8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 0.0



8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 0.0


8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 0.0


Pareto front for the different GA islands for benchmark S2

8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 0.0

P2

2.0

4.0

6.0

8.0 10.0 12.0 14.0 16.0 18.0 20.0 Power (mW)

P4

2.0

4.0

6.0 8.0 Power (mW)

10.0

12.0

All platforms

2.0

4.0

P1

P3

P2

P4

6.0

8.0

10.0 Power (mW)

12.0

14.0

16.0

18.0

20.0

Figure 15: Aggregated Pareto front (time versus power) for synthesized benchmark (S2).

Table 9: Resources utilization of an adder in a PRR for the uniform implementation (Virtex-V).

Table 10: Resources utilization for the system on a Virtex-V FPGA (XC5VFX70T), freq. 100 MHz.

Site type LUT FD LD SliceL SliceM DSP48E

Resources

Available 240 240 30 30 4

Required 32 32 5 5 0

Utilization % 14 14 17 17 0

Table 10 shows the resources available in the Virtex-V FPGA operating at 100 MHz along with the resources occupied by the complete system. 5.3. Results. In this section, we first describe our experimental setup and then present results based on the proposed Island Based GA framework. As the GA is stochastic by nature, it was run 10 times for each benchmark. Solution quality was then evaluated in terms of average results produced. The run-time performance of the IBGA based on serial and

Register LUT Slice IO BSCAN BUFIO DCM ADV DSP48E ICAP Global clock buffer

Utilization

Available

Utilization %

12022 11176 6282 237 1 8 1 6 1 6

44800 44800 11200 640 4 80 12 128 2 32

26 24 56 37 25 10 8 4 50 18

parallel implementation is shown in Table 11. The IBGA was tested on a Red Hat Linux workstation with 6-core Intel Xeon processor, running at a speed of 3.2 GHz and equipped with 16 GB of RAM.


21

1.0

2.0

3.0 4.0 Power (mW)

5.0

6.0

P3

0.5

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0

1.0 1.5 Power (mW)

2.0

2.5


P1

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0



0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0


0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0


Pareto front for the different GA islands for benchmark DFG16

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0

P2

1.0

2.0

3.0 4.0 Power (mW)

5.0

6.0

P4

0.2

0.4

0.6 0.8 Power (mW)

1.0

1.2

1.4

All platforms

1.0

2.0

P1

P3

P2

P4

3.0 Power (mW)

4.0

5.0

6.0

Figure 16: Aggregated Pareto front (time versus power) for MediaBench benchmark (DFG16).

Table 11: IBGA run-time for serial and parallel implementation. DFG S1 S2 S3 S4 S5 DFG2 DFG6 DFG7 DFG12 DFG14 DFG16 DFG19

Serial 00:52 04:53 00:23 03:06 00:07 01:12 00:12 00:35 02:45 00:05 00:26 01:13

Time (min:sec) Parallel 00:17 01:41 00:10 01:16 00:02 00:28 00:08 00:13 00:52 00:02 00:09 00:23

Speedup (𝑋 times) 3.1 2.9 2.3 2.5 3.0 2.6 1.5 2.7 3.2 2.5 2.9 3.2

5.3.1. Convergence of Island Based GA. The benchmarks were tested on the proposed Island GA framework using four

Table 12: The average of 10 runs for the best fitness values. Benchmark S1 S2 S3 S4 S5 DFG2 DFG6 DFG7 DFG12 DFG14 DFG16 DFG19

No opt. 4435.7 12514.7 2649.95 8474.23 1392 5452.67 2942.99 4558.4 9249.97 1134.13 3614.28 5805.24

1-GA 1847 4288.3 1973 3856 877 2554.1 1217.9 1303.2 2595.9 735 1301.1 2368.4

IBGA 978.2 2244.4 660.9 1792.4 286 1036.1 406.1 582.4 1218.1 341.5 505.2 1528.1

islands. Each island targeted a different FPGA platform (floorplan) that was generated manually. Each platform is distinct in terms of the number, size, and layout of the PRRs. An example is shown in Figure 12. Platforms 𝑃1 and

22

International Journal of Reconfigurable Computing Table 13: Fitness values for different weights (𝑊).

Weight Bench. S1 S2 S3 S4 S5 DFG2 DFG6 DFG7 DFG12 DFG14 DFG16 DFG19

No opt. 3327 9483 1894 6286 954 3749 1851 2880 5891 775 2274 4588

𝑊 = 0.5 1-GA 2280 5731 1517 4000 684 2237 906 1148 2400 557 1003 3152

IBGA 1469 4351 825 2747 335 1324 387 668 1432 338 446 2308

No opt. 3917 11034 2303 7470 1194 4673 2434 3792 7722 963 2976 5232

𝑊 = 0.7 1-GA 2134 4872 1760 3890 790 2450 1073 1231 2609 656 1166 2743

IBGA 1266 3331 734 2277 300 1189 409 640 1342 345 505 1907

𝑃2 have uniform size distribution, while platforms 𝑃3 and 𝑃4 on the other hand have nonuniform size distributions. Despite the fact that the four islands are optimizing the same task graph they converge to different solutions, since they are targeting different floorplans on the same FPGA architecture. Figures 13 and 14 show the convergence of the fitness values for two different sample benchmarks, synthetic and MediaBench, for an average of 10 runs. 5.3.2. The Pareto Front of Island GA Framework. The proposed multiobjective Island Based GA optimizes for execution time and/or power consumption. Since the objective functions are conflicting, a set of Pareto optimal solutions was generated for every benchmark, as discussed in Section 5.2. Figures 15 and 16 show the Pareto fronts obtained for both synthetic and MediaBench benchmarks, respectively, based on an average of 10 runs. The Pareto solutions obtained by each island (P1 to P4) are displayed individually along with the aggregated Pareto front for each benchmark. It is clear from the figures that each GA island produces a different and unique Pareto front that optimizes performance and power for the targeted platform. On the other hand, the aggregated solution based on the four GA islands combines the best solutions and thus improves upon the solutions obtained by the individual GA Pareto fronts. Hence, the system/designer can choose the most appropriate platform based on the desired objective function. Table 12 compares the best fitness values of the randomly binding architecture with no optimization (No opt.) to that using a single-GA approach (1-GA) along with the proposed aggregated Pareto optimal point approach based on four islands (IBGA). Each value in Table 12 is based on the average of 10 different runs. The single-GA implementation achieves on average 55.9% improvement over the baseline nonoptimized approach, while the Island Based GA achieves on average 80.7% improvement. The latter achieves on average 55.2% improvement over the single-GA approach. Table 13 shows the fitness values based on different weight values (introduced in (2)) for randomly binding architectures (No opt.), a singleGA (1-GA), and an Island Based GA (IBGA). On average the

No opt. 4435.7 12514.7 2649.95 8474.23 1392 5452.67 2942.99 4558.4 9249.97 1134.13 3614.28 5805.24

𝑊 = 0.875 1-GA 1847 4288.3 1973 3856 877 2554.1 1217.9 1303.2 2595.9 735 1301.1 2368.4

𝑊 = 1 (performance) No opt. 1-GA IBGA 4661 1601 796 13457 3724 1604 2914 2126 588 9226 3715 1366 1550 940 290 6025 2582 903 3306 1320 396 5147 1324 537 10415 2886 1044 1249 785 340 4095 1392 507 6146 2186 1197

IBGA 978.2 2244.4 660.9 1792.4 286 1036.1 406.1 582.4 1218.1 341.5 505.2 1528.1

Table 14: Exhaustive versus IBGA (average of 10 runs). Benchmarks DFG14 (11 nodes) S5 (13 nodes)

Fitness

IBGA Time (sec)

Exhaustive search Fitness Time (min)

341

1.8

313

5.8

286

2.4

283

257

single-GA implementation achieves 52.7% improvement over the baseline nonoptimized approach, while the Island Based GA achieves on average 75% improvement. Table 14 compares the IBGA with an exhaustive procedure in terms of quality of solution and CPU time. Only two small benchmarks are used due to the combinatorial explosion of the CPU time when using the exhaustive search based procedure. It is clear from Table 14 that the IBGA technique produces near-optimal solutions for these small benchmarks. The DFG14 (11 nodes) is 9% inferior to the optimal solution, while S5 (13 nodes) is 1% away from optimality.

6. Conclusions and Future Work In this paper, we presented an efficient reconfigurable online scheduler that takes PRR size and other measures into account within a reconfigurable OS framework. The scheduler is evaluated and compared with a baseline reconfigurable-aware offline scheduler. The main objective of the proposed scheduler is to minimize the total execution time for the incoming task graph. The proposed scheduler supports hardware task reuse and software-to-hardware task migration. RCOnline learns about task types from previous history and its performance tends to improve overtime. RCOnline-Enh. further improves upon RCOnline and searches the ready queue for a task that can exploit hardware task reuse. RCOnline-Enh. has exceptionally good performance since on average it reached 87% of the performance of RCOffline scheduler, without the learning phase, and 104% of RCOffline scheduler following the learning phase. In addition

International Journal of Reconfigurable Computing to the proposed schedulers, a novel parallel Island Based GA approach was designed and implemented to efficiently map execution units to task graphs for partial dynamic reconfigurable systems. Unlike previous works, our approach is multiobjective and not only seeks to optimize speed and power, but also seeks to select the best reconfigurable floorplan. Our algorithm was tested using both synthetic and real-world benchmarks. Experimental results clearly indicate the effectiveness of this approach for solving the problem where the proposed Island Based GA framework achieved on average 55.2% improvement over single-GA implementation and 80.7% improvement over a baseline random allocation and binding approach.

Conflict of Interests The authors declare that there is no conflict of interests regarding the publication of this paper.

23

[12]

[13]

[14]

[15]

[16]

References [1] K. Eguro, “Automated dynamic reconfiguration for highperformance regular expression searching,” in Proceedings of the International Conference on Field-Programmable Technology (FPT ’09), pp. 455–459, IEEE, Sydney, Australia, December 2009. [2] Xilinx, Partial Reconfiguration User Guide, UG702, Xilinx, 2010. [3] M. Huang, V. K. Narayana, M. Bakhouya, J. Gaber, and T. ElGhazawi, “Efficient mapping of task graphs onto reconfigurable hardware using architectural variants,” IEEE Transactions on Computers, vol. 61, no. 9, pp. 1354–1360, 2012. [4] M. Huang, V. K. Narayana, and T. El-Ghazawi, “Efficient mapping of hardware tasks on reconfigurable computers using libraries of architecture variants,” in Proceedings of the 17th IEEE Symposium on Field Programmable Custom Computing Machines (FCCM ’09), pp. 247–250, IEEE, Napa, Calif, USA, April 2009. [5] S. Hauck and A. Dehon, Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation, Morgan Kaufmann, Amsterdam, The Netherlands, 2008. [6] P.-A. Hsiung, M. D. Santambrogio, and C.-H. Huang, Reconfigurable System Design and Verification, CRC Press, Boca Raton, Fla, USA, 2009. [7] A. Al-Wattar, S. Areibi, and F. Saffih, “Efficient on-line hardware/software task scheduling for dynamic run-time reconfigurable systems,” in Proceedings of the IEEE Reconfigurable Architectures Workshop (RAW ’12), pp. 401–406, IEEE, Shanghai, China, May 2012. [8] A. Al-Wattar, S. Areibi, and G. Grewal, “Efficient mapping and allocation of execution units to task graphs using an evolutionary framework,” in Proceedings of the International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART ’15), Boston, Mass, USA, June 2015. [9] A. Al-wattar, S. Areibi, and G. Grewal, “Rcsimulator, a simulator for reconfigurable operating systems,” 2015, https://github .com/Aalwattar/rcSimulator. [10] “Island based genetic algorithm for task allocation,” 2015, https://github.com/Aalwattar/GA-Allocation. [11] H. Walder and M. Platzner, “A runtime environment for reconfigurable hardware operating systems,” in Field Programmable

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

Logic and Application, pp. 831–835, Springer, Berlin, Germany, 2004. “Reconfigurable hardware operating systems: from design concepts to realizations,” in Proceedings of the 3rd International Conference on Engineering of Reconfigurable Systems and Architectures (ERSA ’03), pp. 284–287, CSREA Press, 2003. R. Pellizzoni and M. Caccamo, “Real-time management of hardware and software tasks for FPGA-based embedded systems,” IEEE Transactions on Computers, vol. 56, no. 12, pp. 1666–1680, 2007. F. Ghaffari, B. Miramond, and F. Verdier, “Run-time HW/SW scheduling of data flow applications on reconfigurable architectures,” EURASIP Journal on Embedded Systems, vol. 2009, Article ID 976296, p. 3, 2009. J. A. Clemente, C. González, J. Resano, and D. Mozos, “A task graph execution manager for reconfigurable multi-tasking systems,” Microprocessors and Microsystems, vol. 34, no. 2–4, pp. 73–83, 2010. D. Göhringer, M. Hübner, E. Nguepi Zeutebouo, and J. Becker, “Operating system for runtime reconfigurable multiprocessor systems,” International Journal of Reconfigurable Computing, vol. 2011, Article ID 121353, 16 pages, 2011. D. Göhringer and J. Becker, “High performance reconfigurable multi-processor-based computing on FPGAs,” in Proceedings of the IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW ’10), pp. 1–4, IEEE, Atlanta, Ga, USA, April 2010. T.-M. Lee, J. Henkel, and W. Wolf, “Dynamic runtime rescheduling allowing multiple implementations of a task for platform-based designs,” in Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, pp. 296–301, Paris, France, March 2002. W. Fu and K. Compton, “An execution environment for reconfigurable computing,” in Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM ’05), pp. 149–158, IEEE, April 2005. B. Mei, P. Schaumont, and S. Vernalde, “A hardware-software partitioning and scheduling algorithm for dynamically reconfigurable embedded systems,” in Proceedings of the Workshop on Circuits Systems and Signal Processing (ProRISC ’00), Veldhoven, The Netherlands, November-December 2000. J. Wang and S. M. Loo, “Case study of finite resource optimization in fpga using genetic algorithm,” International Journal of Computer Applications, vol. 17, no. 2, pp. 95–101, 2010. Y. Qu, J.-P. Soininen, and J. Nurmi, “A genetic algorithm for scheduling tasks onto dynamically reconfigurable hardware,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS ’07), pp. 161–164, Napa, Calif, USA, May 2007. R. Chen, P. R. Lewis, and X. Yao, “Temperature management for heterogeneous multi-core FPGAs using adaptive evolutionary multi-objective approaches,” in Proceedings of the IEEE International Conference on Evolvable Systems (ICES ’14), pp. 101–108, IEEE, Orlando, Fla, USA, December 2014. A. Elhossini, S. Areibi, and R. Dony, “Strength pareto particle swarm optimization and hybrid EA-PSO for multi-objective optimization,” Evolutionary Computation, vol. 18, no. 1, pp. 127– 156, 2010. D. Cordeiro, G. Mounié, S. Perarnau, D. Trystram, J.-M. Vincent, and F. Wagner, “Random graph generation for scheduling simulations,” in Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques, SIMUTools ’10, pp.

24

[26]

[27] [28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

International Journal of Reconfigurable Computing 60.1–60.10, Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering (ICST), Brussels, Belgium, 2010. U. Electrical & Computer Engineering Department at the UCSB, “Express benchmarks,” 2015, http://express.ece.ucsb .edu/benchmark/. A. Bilgin and J. Ellson, “Dot (graph description language),” 2008, http://www.graphviz.org/Documentation.php. G. A. Vera, D. Llamocca, M. S. Pattichis, and J. Lyke, “A dynamically reconfigurable computing model for video processing applications,” in Proceedings of the 43rd Asilomar Conference on Signals, Systems and Computers, pp. 327–331, IEEE, Pacific Grove, Calif, USA, November 2009. L. Cai, S. Huang, Z. Lou, and H. Peng, “Measurement method of the system reconfigurability,” Journal of Communications and Information Sciences, vol. 3, no. 3, pp. 1–13, 2013. J. A. Clemente, J. Resano, C. González, and D. Mozos, “A Hardware implementation of a run-time scheduler for reconfigurable systems,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 19, no. 7, pp. 1263–1276, 2011. C. Steiger, H. Walder, and M. Platzner, “Heuristics for online scheduling real-time tasks to partially reconfigurable devices,” in Field-Programmable Logic and Applications: 13th International Conference, FPL 2003, Lisbon, Portugal, September 1– 3, 2003: Proceedings, vol. 2778 of Lecture Notes in Computer Science, pp. 575–584, Springer, Berlin, Germany, 2003. Y.-H. Chen and P.-A. Hsiung, “Hardware task scheduling and placement in operating systems for dynamically reconfigurable SoC,” in Embedded and Ubiquitous Computing—EUC 2005, vol. 3824, pp. 489–498, Springer, Berlin, Germany, 2005. F. Redaelli, M. D. Santambrogio, and S. O. Memik, “An ILP formulation for the task graph scheduling problem tailored to bi dimensional reconfigurable architectures,” in Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig ’08), pp. 97–102, Cancun, Mexico, December 2008. F. Redaelli, M. D. Santambrogio, and D. Sciuto, “Task scheduling with configuration prefetching and anti-fragmentation techniques on dynamically reconfigurable systems,” in Proceedings of the Design, Automation and Test in Europe (DATE ’08), pp. 519–522, Munich, Germany, March 2008. Xilinx Inc, Xpower Estimator User Guide, Xilinx Inc, 2010.

International Journal of

Rotating Machinery

Engineering Journal of

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

The Scientific World Journal Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014


Distributed Sensor Networks

Journal of

Sensors Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014


Volume 2014


Volume 2014

Journal of

Control Science and Engineering

Advances in

Civil Engineering Hindawi Publishing Corporation http://www.hindawi.com


Volume 2014

Volume 2014

Submit your manuscripts at http://www.hindawi.com Journal of

Journal of

Electrical and Computer Engineering

Robotics Hindawi Publishing Corporation http://www.hindawi.com


Volume 2014

Volume 2014

VLSI Design Advances in OptoElectronics


Navigation and Observation Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014



Chemical Engineering Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

Volume 2014

Active and Passive Electronic Components

Antennas and Propagation Hindawi Publishing Corporation http://www.hindawi.com

Aerospace Engineering


Volume 2014


Volume 2014

Volume 2014




Modelling & Simulation in Engineering

Volume 2014


Volume 2014

Shock and Vibration Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

Advances in

Acoustics and Vibration Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

An Efficient Evolutionary Task Scheduling/Binding Framework for ...

An Efficient Evolutionary Task Scheduling/Binding Framework for ...

Suggest Documents

An Efficient Multiobjective Evolutionary Algorithm for Community ...

An Evolutionary Framework for 3-SAT Problems

An Evolutionary Knowledge-Based Framework for

An Evolutionary Framework for 3-SAT Problems

Implementing an evolutionary framework for understanding genetic ...

An Efficient Framework for Online Dealer ...

VIRACOCHA: An Efficient Parallelization Framework for ... - CiteSeerX

AN EFFICIENT FRAMEWORK FOR ROBUST MOBILE SPEECH ...

An Efficient C++ Framework for Cycle-Based

An Efficient Semi-Analytical Framework for

An efficient software framework for performing ... - ScienceDirect

An Efficient Framework for Order Optimization - CiteSeerX

An Efficient Argumentation Framework for ... - Semantic Scholar

AN EFFICIENT FRAMEWORK FOR ROBUST ... - Semantic Scholar

GeoSocialBound: An Efficient Framework for Estimating ... - InfoLab

An Efficient Framework for Hydrologic Model ...

An Evolutionary Framework of Service Systems - CiteSeerX

Improving sustainability: An international evolutionary framework

An Evolutionary Framework of Service Systems - CiteSeerX

An Intelligent Agent Based Framework for an Efficient Portfolio

An Efficient Task Clustering Heuristic for Scheduling ... - CiteSeerX

An Efficient Algorithm for Online Soft Real-Time Task ... - CiteSeerX

An Efficient Task Scheduling Method for Improved Network Delay in

Cass: An Efficient Task Management System For Distributed Memory