Reliable Pre-Scheduling Delay Estimation for Hardware/Software ...

2 downloads 97 Views 807KB Size Report
Reliable Pre-scheduling Delay Estimation for. Hardware/Software partitioning. Rania O. Hassan*, M. B. Abdelhalim** and S. E.-D. Habib. *. *Electronics and ...
Reliable Pre-scheduling Delay Estimation for Hardware/Software partitioning Rania O. Hassan*, M. B. Abdelhalim** and S. E.-D. Habib*

*Electronics and Communications Engineering Department, Cairo University, Giza, Egypt ** CCIT-AASTMT, Cairo, Egypt Email: [email protected], [email protected], [email protected] Abstract—Hardware and Software co-design has become one of the main methodologies in modern embedded systems. The partitioning step, i.e. to decide which components of the system should be implemented in hardware and which ones in software, is the most important step in the embedded systems. Since the costs and delays of the final design strongly depend on partitioning results, there is a need to get an accurate estimate for hardware area, delay and power. However, accurate delay estimation methods are slow as they need a scheduling step. In this paper, we propose a reliable delay estimation method to be used within the partitioning step prior to the scheduling step.

Keywords: Hardware/Software Co-design, Hardware/Software Partitioning, Control-Data Flow Graphs, FPGAs, High-level synthesis, Design Space Exploration, MPSOC.

I.

INTRODUCTION

High level synthesis for digital designs proved efficient in supporting design from behavioral specifications with constraints. The designer of an embedded systems always faced the problem of partitioning design components to either hardware components; i.e. FPGAs or ASICs; or software components on processor; i.e. DSPs, core processors, or ASIPs. The advantages of using processors are manifold, as software is more flexible and cheaper than hardware. This software flexibility allows late design changes. However, hardware is always used when processors are not able to meet the required performance. This tradeoff between hardware and software illustrates the optimization aspect of the Hardware/Software (HW/SW) co-design problem [1]. Hardware/Software co-design deals with the problem of designing embedded systems while considering this tradeoff between hardware and software. Hence, we need to partition design components to hardware or software where automatic partitioning is one key issue. The aim of the partitioning task is to find a design implementation that fulfills all the specification requirements (functionality, goals and constraints such as area, delay, and power). Basically, the HW/SW partitioning problem can be formulated as an optimization problem [1]. Consider for example the case of n components, and we need to decide which of these n components is to be mapped to HW or SW. We obviously have 2n candidate partitioning vectors. We may, for example, request that the power should be minimized subject to an area (cost) and delay constraints. The central question is: which of these partitioning vectors is the "fittest" or optimum. Fitness is obviously based on power,

978-1-4799-0066-4/13/$31.00 ©2013 IEEE

delay, area, etc. metrics. This problem is NP-hard for most cases [1]. Additionally, the current system designs call for multitude of SW design approaches (e.g., design using multiprocessors) and multitude of HW design approaches (e.g., pipelining for example). Therefore, exact solutions for this problem tend to be quite slow, especially as the number of system components increases. Hence, many algorithms were proposed to solve this problem; such as Particle Swarm Optimization [2], greedy partitioning algorithms [3], simulated annealing [4], Genetic algorithm [5], Dynamic programming [6], integer linear programming [7], etc. All these approaches can work perfectly within their own codesign environments, but it is difficult to compare them, because of the large differences in their co-design environments as well as the lack of commonly adopted benchmarks [8]. Delay estimation poses considerable difficulty for the partitioning problem. In contrast to area and power estimation, a given partitioning vector does not imply a definite delay. A scheduling step is needed to define accurately the best delay for a given partitioning vector. Thus, the partitioning and scheduling steps are cross coupled [9]. A given partition bounds the scheduling step, and the scheduling step affects the delays that are needed to decide the fitness of the individual partition vectors. This coupling between the partitioning and scheduling steps is unfortunate. On one extreme, the designer may resolve this coupling by solving the partitioning and the scheduling problems concurrently; which means we have to include an inner loop into our search algorithm that calculates the optimum schedule for each partitioning vector [10-13]. Obviously, this concurrent partition/schedule approach adds a high time penalty to an already time consuming NP problem. The other extreme is to solve the partitioning then the scheduling problem sequentially [14-15]. The success of the later timeefficient approach depends on devising reliable and fast prescheduling delay estimation formula. This paper addresses this pre-scheduling delay estimation problem. The outline of the paper is as follows: Section II gives an overview of related work. Problem definition is given in Section III. Section IV discusses the proposed hardware delay estimation method and the experimental results. The paper is concluded in Section V. II.

RELATED WORK

In general, delay calculation is an important action to be taken during the HW/SW partitioning process. It can be

1246

calculated post-scheduling or by applying a pre-scheduling estimation formula and/or algorithm. Considerable research effort addressed the concurrent solution of the partitioning/scheduling problem. Ref. [13] combines partitioning with scheduling in HW/SW co-design on Multi-Processor Systems-on-Chip (MPSoC). Scheduling First Partitioning Later (SFPL) [10, 11] and ref. [12] present algorithms to solve the joint problem of partitioning and scheduling together. These algorithms consist basically of two local search heuristics: one for partitioning and one for scheduling. A partitioning algorithm that depends on scheduling of a given task graph with a minimum latency was presented in [16]. Reference [17] used the discrete particle swarm for HW/SW partitioning and depth first search (DFS) algorithm [18]. DFS algorithm is used to get worst execution time to be used as a metric to evaluate an obtained solution. A novel formulation of a time-indexed mixed-integer 0-1 programming model for the co-synthesis hardware/software integrated partitioning and scheduling problem presented in [19]. It provides also a tool for simultaneously partitioning and scheduling tasks onto hardware/software units. A mixed-integer programming model was developed by Niemann and Marwedel [7] that employs a two-phase method for hardware/software partitioning, where a tentative schedule is proposed first and is verified subsequently. If the timing constraints are violated, the partitioning step is repeated with timing constraints that are tighter than the estimated scheduling horizon length. All the above post-scheduling delay calculation methods are commonly known to be accurate yet time consuming approaches. Other researchers adopted the sequential partition-thenschedule approach, where scheduling is performed on the final solution of the partitioning step. In general, there is a scarcity of information on how the delay is estimated if this approach is adopted. In ref. [2] the delay is split into of two parts: hardware delay and software delay and a single processor is assumed. Hence, the total delay of the software components can be calculated by summing individual delays of software components. The other part of delay is due to hardware mapped components. Ref. [2] proceeds to calculate upper and lower bounds for the HW delay. These bounds are then used to estimate, heuristically, the delay of the total hardware-mapped components. Ref. [21] used a delay equation to calculate the total design delay by summing software components delay and hardware components delay supposed that all the design executes sequentially. III.

PROBLEM DEFINITION

The following examples demonstrate the need for reliable delay estimation for the partition-then-schedule problem. The examples in Figures 1 and 2 are given as (scheduled) control data flow graphs (CDFGs), where each example consists of the same four tasks. Each task can be implemented either in software or hardware. Software is represented by 0 and hardware represented by 1 (serial implementation), 2 (parallel

implementation), and 3 (pipelined implementation) as represented in [20]. Thus, the partition vector [0123] means the component T1 is mapped to SW, component T2 is mapped to serial HW, the component T3 is mapped to parallel HW, and the component T4 is mapped to pipelined HW. T1

T2

T3

T4

Figure 1. Control Data Flow Graph1 (CDFG1)

T1

T2

T3

T4

Figure 2. Control Data Flow Graph2 (CDFG2)

Table I lists the assumed delay information for each of four tasks T1 to T4, for each of the possible three hardware implementations. The delay of each component is given in nanoseconds, and also, as the number of clock cycles for each component. The clock cycle period is taken 10ns as it is the fastest clock period over all hardware components. TABLE I DIFFERENT IMPLEMENTATIONS HARDWARE DELAY Component Serial Parallel Pipelined ns /cycles ns/cycles ns/ cycles T1

30/3

10/1

9/1

T2 T3

20/2

9/1

9/1

25/3

12/2

9/1

T4

45/5

25/3

10/1

The work in [2, 15] presented a method for hardware delay estimation based on an "average" Parallelism Factor (PF). For each partition vector, the PF is defined as shown in Equation (1). The total delay (in clock cycles) is calculated by summing the cycles of the hardware-mapped components then divide the result by the PF. (1)

From Table I, Number of serial cycles= 13 and number of parallel cycles= 7; hence PF=1.857. We can use PF to get total latency as shown in Table II according to the PF formula. We can observe that the total latency is the same

1247

among the two examples although the two examples are different. TABLE II DESIGN DELAY FOR EACH PARTITIONING SOLUTION Partitioning Vector [T1 T2 T3 T4]

PF Latency (# cycles)

Exact Latency (CDFG1) (# cycles)

Exact Latency (CDFG2) (# cycles)

[1 2 2 3]

3.77

6

5

[2 1 2 3]

3.23

5

3

[2 2 1 3]

3.23

5

4

If the hardware-mapped components are dominant for a certain partitioning vector, the error incurred due to the PF delay estimation formula (3) will be large and unacceptable. Another problem with the PF formula is that it depends only on hardware-mapped components as we sum individual component delays and divide the result over PF factor to get the estimated hardware delay. The software-mapped components are not taken into consideration. Since the "fitness" of a given partition vector depends on the overall delay rather than the delay of the HW-mapped or SWmapped components, then the PF formula can lead to unacceptable errors. The above discussion shows the need for a reliable delay estimation formula. Note that such delay formula should be based on the delay information that is available prescheduling; namely: a. As-Soon-As Possible (ASAP) hardware delay (Ta). b. Serial hardware delay (Ts) obtained by summing the delay of all hardware components when executed sequentially. Note that these components may be internally implemented according to serial, parallel, or pipelined architectures. c. Software delay (Tf) obtained by summing software components delays for single processor architecture d. Number of hardware-mapped components per partition (N ). e. Number of software-mapped components per partition (N ).

components to different clock cycles. On the other hand, if the Nh is small, the post-scheduling delay would tend to the serial hardware delay Ts, since there is little freedom to move components between different time slots. Our proposed pre-scheduling delay estimate is shown in Equations 2-3. We call this function the Component Count Function (CCF). Note that this equation behaves similar to the post-scheduling delay, as discussed in the previous paragraph. Given that the software delay Tf is much smaller than the serial hardware delay, Ts, then equation 3 predicts a delay that is bounded between Ta and Ts. Additionally, if Nh is large (relative to fixed number M), then the predicted delay CCF tends to the ASAP delay, Ta. Alternatively, if Nh O1

G

+

+ *

-

1) Case of Dominant Hardware Delay Consider the case of where the delay corresponding to a given partition vector is dominated by delay of hardware-mapped components. The hardware delay has an upper delay bound, Ts, corresponding to scheduling all components serially. The hardware delay has also a lower delay bound, Ta, corresponding to scheduling all components ASAP. Additionally, it is expected that the exact post-scheduling delay would tend to Ta if number of hardware-mapped components, Nh, is large; since this case corresponds to large clock cycles with an attendant large freedom to assign

F

*

+

PROPOSED DELAY FORMULA AND EXPERIMENTAL RESULTS

E

D

C

T13 O2

Figure 3. CDFG for Simple Example

The experimental results reported in this paper were obtained using CUPSHOP tool proposed in [2] targeting Altera cyclone FPGAs. Case Study1: In table III, we made a comparison between the different delay formulas for each partitioning vector to determine if the estimated delay falls within the range of hardware delay which is between the two extremes: parallel scheduling delay and serial scheduling delay. As shown in table III, different partitioning vectors, resulting from different constraints on area, delay, and power, are tested for Processor frequency =80MHz (cycle time=12.5ns) where SW is indicated by 0, serial HW by 1, parallel HW by 2 and pipeline HW by 3 TABLE III DELAY RESULTS FOR HARDWARE COMPONENTS Partitioning Vector Tf PF = CCF CCF CCF Ta< Delay Range< Ts 2.6 M=5 M=10 M=15 [T1 T2 T3 T4 ……… T13] [2201211211331] 75 211 289 375 407 236.54< Delay

Suggest Documents