An Evolutionary Approach to Area-Time ... - Semantic Scholar

2 downloads 23222 Views 356KB Size Report
register-transfer level (RTL) design in a hardware description language (e.g. Verilog). ... provides good results in terms of trade-off between total area occupied and ... approach that better fits this application domain allowing fast converge to a ...
An Evolutionary Approach to Area-Time Optimization of FPGA designs Fabrizio Ferrandi, Pier Luca Lanzi, Gianluca Palermo, Christian Pilato, Donatella Sciuto, Antonino Tumeo Politecnico di Milano Dipartimento di Elettronica e Informazione Via Ponzio 34/5, Milano, Italy {ferrandi,gpalermo,lanzi,sciuto,tumeo}@elet.polimi.it [email protected] Abstract—This paper presents a new methodology based on evolutionary multi-objective optimization (EMO) to synthesize multiple complex modules on programmable devices (FPGAs). It starts from a behavioral description written in a common high-level language (for instance C) to automatically produce the register-transfer level (RTL) design in a hardware description language (e.g. Verilog). Since all high-level synthesis problems (scheduling, allocation and binding) are notoriously NP-complete and interdependent, the three problems should be considered simultaneously. This drives to a wide design space, that needs to be thoroughly explored to obtain solutions able to satisfy the design constraints. Evolutionary algorithms are good candidates to tackle such complex explorations. In this paper we provide a solution based on the Non-dominated Sorting Genetic Algorithm (NSGA-II) to explore the design space in order obtain the best solutions in terms of performance given the area constraints of a target FPGA device. Moreover, it has been integrated a good cost estimation model to guarantee the quality of the solutions found without requiring a complete synthesis for the validation of each generation, an impractical and time consuming operation. We show on the JPEG case study that the proposed approach provides good results in terms of trade-off between total area occupied and execution time.

I. I NTRODUCTION High-Level Synthesis (HLS) is concerned with the design and implementation of digital circuits starting from a behavioral description, subject to a set of goals and constraints, and given a library of different types of resources. The behavioral description specifies behavior in terms of operations, assignment statements and control constructs in a common high-level language (e.g. C language). The resource library provides a choice of components among which the synthesizer may select the one that best matches the design constraints and maximizes the optimization objectives. The overall target architecture of the HLS flow is typically based on the FSMD model [1]: a datapath description controlled by a finite state machine. At the RTL level, a datapath is composed of functional units, storage and interconnection elements. The finite state machine specifies every set of micro-operations for the datapath to be performed during each control step. High-level synthesis involves three main tasks: the operation scheduling, the resource allocation and the controller synthesis. Operation scheduling provides the cycle steps in which operations start their execution. Resource allocation is concerned

TABLE I C HARACTERIZATION OF ADDER , MULTIPLIER AND MUX S [3]

Functional units and MUX adder24bit cla mul18bit mux24bit mux24bit mux24bit

Implementation

Carry look-head Booth-recoded wall Wallace 2to1 Synopsys design 8to1 Synopsys design 32to1 Synopsys design

Area (CLB)

Delay (ns)

Power (W)

26

11.8

0.010

280

14.8

0.308

6 66 276

0.6 4.6 10.9

0.002 0.023 0.240

with assigning operations and values to hardware components and interconnecting them using connection elements. Solving these problems efficiently is a non-trivial matter because of their NP-complete nature [2]. Controller synthesis provides the logic to issue datapath operations, based on the control flow. Recent studies [3] have demonstrated that interconnection costs have to be taken into account since area of multiplexers and interconnection elements has by far outweighed area of functional units and registers (see Table I). This is especially true for FPGA designs because a larger amount of transistors have to be provided in the wiring channels and logic blocks to provide programmability for signal transmission. This strongly motivates the design of highly effective algorithms to reduce the amount and size of multiplexers generated during highlevel synthesis: a methodology that doesn’t consider them produces an incomplete area estimation. This could lead to a wrong final design, where interconnection elements could increase area costs also over global constraints. In fact, sometimes design with more functional units or registers can reduce total area, by consistently reducing interconnection elements. As a result, interconnection allocation should be taken into account by each methodology that tries to minimize FPGA design. Evolutionary algorithms are good candidates for high-level synthesis because they iteratively improve a set of solutions (thus improving alternative designs), they don’t require the quality (cost) function to be linear (e.g.: time-area product) and they are known to work well on problems with large and non-

convex search spaces (constrained or with poor information). They can build a model themselves to give estimations on the final solution. Thus, the problem is to select the evolutionary approach that better fits this application domain allowing fast converge to a feasible and good solution. The high-level synthesis cost function can be a single-objective one (minimize execution time, total area or power consumption) or a multiobjective one, where two or more objectives are considered simultaneously. So, there is not a single solution to such problems, no optimum in the sense traditionally expected. Instead, there is a large family of alternative solutions, different balances of the various objectives. To deal with multi-objective functions, a single weighted cost function has been shown not to be useful, because algorithm convergence depends on weights imposed to each objective [4]. The proposed approach adopts a non-dominated sorting genetic algorithm (NSGA-II [5]), since it can maintain diversity into the population and therefore it allows a better exploration of the design space. The paper is organized as follows. Section II provides an overview of the current state of the art of the high-level synthesis techniques with particular attention to approaches integrating evolutionary algorithms. In section III the problem is defined and the proposed flow for high-level synthesis of a design solution is explained. In section IV the design space exploration problem is approached using the NSGA-II algorithm. Finally, experimental results on the JPEG encoder case study are reported in Section V. II. R ELATED WORK Synthesis on FPGA devices requires fast and easily modifiable designs, so new design methodologies have been a very hot research topic over the past two decades. There have been many deterministic and non-deterministic approaches to solve the datapath synthesis problem. Lin [6] analyzes developments in the last two decades for each scheduling, resource allocation and controller synthesis problems. Early approaches were oriented to search optimal solutions, based on operations research algorithms. Integer Linear Programming (ILP) formulations for both resource-constrained scheduling and time-constrained scheduling [7] have been proposed. Papachristou and Konuk [8] present a scheduling and allocation algorithm based on linear programming, followed by an interconnection optimization. In practice, the ILP approach is applicable only to very small problems. Rim et al. [9] present a formulation to reduce wiring and multiplexers area with an ILP formulation, giving a scheduled graph. Cordone et al. [10] recently present a ILP Branch and Cut approach to improve scheduling using speculative computation. However, the execution time of ILP algorithms grows exponentially with the number of variables and the number of inequalities. Since the ILP method is impractical for large designs, heuristic methods that run efficiently at the expense of the design optimality have been developed. Stok [11] presents a set of heuristic algorithms to sequentially solve each step of datapath creation. The symbolic BDD-based manipulations have been attained interesting results as an alternative ILP and heuristic

techniques. The key idea of the symbolic approach is to use a set of nondeterministic finite automata to describe design alternatives for highly constrained control dominated models, in which complex if-then-else patterns constitute the body of kernel loops. Brewer [12] and Haynal [13] present a set of techniques for representing the high-level behavior of a digital subsystem as a collection of nondeterministic finite automata symbolic formulation for scheduling problem based on finite automata. The technology is similar to that used in symbolic model checking and it has been extended and formalized by Cabodi et al. [14], that describe a SAT-based formulation of automata-based scheduling and propose a resolution algorithm based on SAT solvers and bounded model checking (BMC). Lakshminarayana et al. [15] create the finite state machine during scheduling phase, under allocation constraints. This algorithm tries to minimize the average execution time and preserves the parallelism inherent in the application. Harmanani and Saliba [16] give an estimation of the controller area based on a PLA model implementation, with number of PLA inputs and outputs. This can be considered as an approximate approach because some datapath design optimizations can hardly affect controller cost in terms of components and control logic. Classical optimization methods (including the multi-criterion decision-making methods) suggest converting the multi-objective optimization problem to a single-objective optimization problem by emphasizing one particular Paretooptimal solution at a time. Since evolutionary algorithms (EAs) work with a population of solutions, a simple EA can be extended to maintain a diverse set of solutions. With an emphasis for moving toward the true Pareto-optimal region, an EA can be used to find multiple Pareto-optimal solutions in one single simulation run. The non-dominated sorting genetic algorithm (NSGA) proposed in [17] is among the firsts of such EAs. This work has been further extended by Deb et al. [5] into NSGA-II algorithm. Mandal et al. present two genetic algorithms to approach high-level synthesis; the first one [18] is a scheduling algorithm and tries to minimize area cost and design latency, the second one [19] is an allocation and binding algorithm that works on a scheduled graph. They consider the two phases and problems separately. Grewal et al. [20] implement a hierarchical genetic algorithm, where genetic module allocation is followed by a genetic scheduling. Very few algorithms consider also interconnection costs in cost evaluation. Typical algorithms [8], [3] only optimize and minimize interconnection cost after termination of the other phases. Ara´ujo et al. [21] use a genetic programming approach, where solutions are represented as tree productions (rephrased rules in the HDL grammar) to directly creates Structured Function description Language (SFL) programs. III. P ROBLEM DEFINITION The proposed methodology accepts as input a behavioral description of the problem (in C language) and a library of resource descriptions. A set of constraints can also be specified on the resources that can be allocated (e.g. maximum total area) or maximum execution time (see section IV-G). This

Fig. 2.

Kim example: CDG+DFG data structure.

B. The High-Level Synthesis Problem

Fig. 1.

Kim example: Control flow graph.

section describes the translation of the behavioral specification into a graph-based specialization (Section III-A) and the highlevel synthesis problem (Section III-B). A. Graph-based specification Language based specifications are usually translated into intermediate representations to efficiently manage and analyze the design specification. We adopt an optimized version of the data and control dependences analysis due by [22] to express the precedence relations between operations. Our framework uses as front-end a customized interface of the GNU GCC compiler [23]. Starting from version 3.5, GCC provides the possibility of dumping on file the syntax tree structure representing the initial source code. The combined control and data dependences CDG+DDG data structure is built starting from this syntax tree structure. The use of GCC allows the introduction of several compiler optimization techniques into a high-level synthesis framework, such as loop unrolling, constant propagation, dead code elimination, common subexpression elimination, etc. Example: Fig. 1 and Fig. 2 show the internal representations of the kim benchmark. The Kim benchmark [24] is composed of 34 operations (among them there are 16 additions and 9 subtractions) and 2 branching blocks.

In the high-level synthesis process, a schedule can induce a resource allocation (because operations that are executed simultaneously need more resources) and resource allocation can induce a schedule (because resource constraints and dependences allow only some operations to be executed simultaneously). Thus, the total number of functional units required in a control step directly corresponds to the number of operations scheduled into it. If more operations are scheduled into each control step, more functional units are necessary, which results in fewer control steps for the design implementation. On the other hand, if fewer operations are scheduled into each control step, fewer functional units are sufficient, but more control steps are needed. Unit allocation determines the number and types of library components to be used in the design. Since a real resource library may contain multiple types of resources, each with different characteristics (e.g. operations that can be performed, size, delay and power dissipation), unit allocation needs to determine the number and type of different functional and storage units from a resource library. Unit binding maps the operations, variables and data transfers in the behavioral description into the functional, storage and interconnection units, respectively. Therefore, two parameters can define how good a solution is: total area occupied and execution time (design latency). The occupied area is the sum of the area of all resources allocated and total execution time is defined as the worst case execution time (see section IV-B). Considering total area estimation, a simple approach can induce to select solutions with fewer functional units. However, reducing functional units, more registers can be needed to store intermediate results and more multiplexers can be needed to select the

+1

+2

...

!=1

...





...



...

Fig. 3.

-8

-9

Mux

SchedA

1

RegA

ConnA

2

1

An example of chromosome encoding for kim benchmark

correct inputs to them. On the opposite, adding functional units can lead to fewer registers and multiplexers. Since functional unit area can be almost equal to multiplexers area, it can be possible that, to reduce total area, functional units have to be added. Example: In Table II, different resource allocations are shown for kim benchmark. Choosing the minimization of functional units area as objective function does not lead to the optimization of the overall design area. In particular, by adding one subtractor, the number of registers and multiplexers is reduced. The problem is that this information is not available when synthesis starts, but only at the end, because there is no way to know in advance the use of registers and multiplexers, since it highly depends on allocation and scheduling results. This feedback information (available only after synthesis) should be used to acquire experience and then produce new better solutions. This is why genetic algorithms have been chosen to approach the problem. The flow we used to perform synthesis is based on the following classical steps: • Allocation-and-Binding • Scheduling • FSM-creation • Register-allocation • Interconnection-allocation Scheduling is computed using different schedulers (e.g. integer linear programming approach or list-based one), in order to minimize latency under constraints imposed by allocation and binding. Register allocation is performed by a left-edge algorithm, sequential vertex coloring or an heuristic clique covering. To minimize interconnection elements, a port swapping is performed among inputs of a functional unit (if the operation that requires them is commutative). IV. D ESIGN S PACE E XPLORATION In this section, the application of the NSGA-II algorithm is proposed to explore the design space and to find the designs that are the better trade-off between total area occupied and execution time. A. Chromosome encoding Chromosomes encode all the information that is necessary for synthesis computation. In the proposed methodology a chromosome is simply a vector, composed of two parts. In the first part, each gene describes the mapping between the operations to be performed by the behavioral specification and the functional unit where it will be executed. With

this formulation, allocation and binding information are here encoded. In the second part, additional genes are added to represent which algorithm is used to complete the other steps of the high-level synthesis (scheduling, register allocation and interconnection allocation). Since these algorithms are deterministic, the results can be retrieved by directly applying the algorithms thus it is not required to encode them in the chromosome. Example: Fig. 3 shows the chromosome encoding for one of the kim solutions. White elements represent operations to be executed and an admissible binding. Grey elements represent completing algorithms chosen for the other high-level synthesis steps: SchedA is the index for the scheduling algorithm, RegA for the register allocation one and ConnA for the interconnection binding one. With this encoding, all genetic operators (see section IV-E) create feasible solutions. In fact, recombination of operations binding simply results in a new allocation or binding and recombination of completing algorithms results in using different high-level synthesis algorithms. This produces good solutions using common genetic operators. Thus the algorithm is fast in that it does not require a recovering procedure to produce feasible chromosome encodings. Moreover, the use of different completing algorithms allows the genetic algorithm to choose and use the algorithm that best matches each single high-level synthesis step. B. Cost function The fitness function is multi-objective and is defined as follows: · ¸ · ¸ f1 (x) Area(x) F (x) = = (1) f2 (x) T ime(x) Where Area(x) is an estimation of the area occupied by solution x, computed using the area model; T ime(x) is an estimation of the worst case execution time of solution x, computed as the longest path into the scheduled CDG+DFG data structure. The goal of the genetic algorithm is to find those solutions that minimize this cost function. Since for the presented methodology the required number of area evaluations is huge, it is not possible to completely synthesize on the target device each generated solution. To reduce exploration time without reducing the number of visited solutions, we choose to adapt already existing area models for FPGA design to generate the required values [25]. The area model we used for fast estimation is shown in Figure 4. For each architecture A the model divides the area component into two main parts: the Flip-Flop part and the LUT part. While the Flip-Flop part is easy to estimate, since it is composed of the data registers used in the datapath and of the flip flops used for the state encoding of the controllerFSM, the LUT part is a little more complex. Four main parts contribute to the global area in terms of LUT: FU, FSM, MUX and Glue. The FU part corresponds to the contribution of the functional units and so its value is the sum of the area value

TABLE II S YNTHESIS RESULTS FOR DIFFERENT ARCHITECTURES ON kim BENCHMARK # Adders 1 1

ARCH-1 ARCH-2

# Noteq comp 1 1

# Subtractors 1 2

# Control steps 9 9

#F FF SM = dlog2 (A.F SM.N umControlStates)e

X

#F FDataP ath =

sizeof (R) R∈A.DataP ath.Registers

A.Area.F F = #F FF SM + #F FDataP ath #LU TF SM = d1.99 ∗ A.F SM.N umControlStates− −0.24 ∗ A.F SM.Inputs + 1.50 ∗ A.F SM.Outputs − 9.97e

X

F.Area

#LU TF U = F ∈A.DataP ath.F unctionalU nits

d(0.59 ∗ M.Input − 0.3 ∗ sizeof (M ))e M ∈A.DataP ath.M ux

#LU TGlue = d0.7 ∗ #LU TF SM + 9.99 ∗ A.DataP ath.N umRegisterse A.Area.LU T = #LU TF SM + #LU TF U + #LU TM U X + #LU TGlue

Fig. 4.

FU LUT (%) 5.98 10.04

MUXs LUT (%) 55.83 57.67

C. Initial population At the beginning of each run, an initial population of admissible resource bindings is created. It can be created by random generation or by starting from a first admissible binding. It allows the algorithm to start from some interesting points (e.g. minimum number of functional units or minimum latency) and then to explore around them. The rest of the population can be created by random mutation of these predefined ones. D. Ranking and Selection

X #LU TM U X =

# Total LUT 1338 1115

Model used to estimate area occupation for the FPGA design

Solutions are sorted into different levels according to their fitness values. The idea is that a ranking selection method can emphasize good solution points. The non-dominated solutions are classified into first level. Then, they are discounted and a new classification is performed among the remaining ones. The ranking has been accelerated using the fast-non-dominatedsort algorithm available in the NSGA-II algorithm. E. Genetic operators

15000

Real Area [CLB]

12500

10000

7500

5000

2500

0

0

2500

5000

7500

10000

12500

15000

Estimated Area [CLB]

Fig. 5.

Validation of the FPGA area model

of each functional unit. The other three parts (FSM, MUX, Glue) are obtained by using a regression-based approach: • • •

the FSM contribution is due to the combinatorial logic used to compute the output and next state; the MUX contribution is due to the number and size of multiplexers used in the datapath; the Glue contribution is due to the logic to enable writing in the flip flops and to the logic used for the interaction between the controller and the datapath.

The coefficient extraction and the model validation has been done using two different set of applications. Figure 5 represents the model validation by using the set of benchmarks presented in [10], showing an average and a maximum error equal 3.7% and 14% respectively.

To explore and exploit the design space, the usual genetic operators are used, the unary mutation and the binary crossover. The two operators are applied iteratively with respect to their probability. Crossover is a reproduction technique that mates two parent chromosomes and produces two child chromosomes. Given two chromosomes, uniform crossover is applied with a high probability (Pc =70%). A crossover mechanism can mix solution bindings, but can also mix the genes that represent algorithms which the synthesis phases are computed with. Mutation is an operator used for finding new points in the search space. Mutation has been implemented with a relatively low rate (Pm =0.10%). Thus, a gene in a chromosome is changed according to its admissible values. If a selected gene is related to an operation, mutation results in a new binding for that operation. On the other hand, if a gene corresponds to a completing algorithm among scheduling, register allocation or interconnection binding, mutation changes the algorithm used to solve the corresponding synthesis step. F. Elitism An important feature of NSGA-II is the elitist mechanism to preserve diversity into the population. In fact, despite the convergence to a Pareto-optimal set, it is also desirable that the genetic algorithm maintains a good diversity of the obtained solutions. So, NSGA-II implements a crowded-comparison operator based on a density estimation. From the density estimation, a new value is associated with a solution: the crowded distance, based on estimation of perimeter of the cuboid formed by using the nearest neighbors. So, each

solution has two values associated: a non-dominated rank and the crowding distance. This allows the algorithm to rank solutions also inside a non-dominated level: if two solutions have the same rank, they belong to the same front and the solution located into a lesser crowded region is preferred by the selection process. G. Constraint handling A set of constraints can be imposed to the high-level synthesis flow: area constraints and time constraints. Area constraints are related to the maximum area available on the device. So the constraint for solution x can be defined as: Area(x) ≤ Areamax

(2)

where Area is the total area cost, as computed by equation (1). Whenever Areamax → ∞ it means that Area has no upper bound constraint (resource-unconstrained design with infinite resources). Execution time constraint represents the maximum execution time allowed. Constraint for solution x is defined as: T ime(x) ≤ T imemax (3) where T ime is the execution time, as computed by equation (1). Whenever T imemax → ∞ it means that T ime has no upper bound constraint (time-unconstrained design). In the presence of constraints, it can happen that solutions are feasible or infeasible. They are infeasible if an objective function violates a constraint. So, a different definition of domination has to be considered to let the solutions be compared. So, the NSGA-II definition of domination between solutions i and j has been applied: Definition 1: A solution i is said to constrain-dominate a solution j, if any of the following conditions is true: 1) Solution i is feasible and solution j is not. 2) Solution i and solution j are both infeasible, but solution i has a smaller overall constraint violation 3) Solution i dominates solution j (its cost function is less than j cost function, as defined by eq. 1). The effect of using this definition is that a feasible solution has a better non-domination rank than any infeasible solution. So all feasible solutions are ranked by the objective functions values, while for the infeasible ones, the solution with smaller violation has a better rank. This is useful because one infeasible solution violating a constraint only marginally is classified in a smaller non-dominated level. So, the algorithm searches faster in infeasible search region before reaching the feasible region. H. Overview of the Algorithm The information on the behavioral specification has been obtained from the parser, represented by a CDG+DFG data structure. The NSGA-II algorithm can start by creating a first resource allocation and binding (as described in section IV-C). This binding defines constraints for the following scheduling phase, performed using the algorithm defined by the associated gene. The solution is finally generated by applying the rest

of the high-level synthesis flow (e.g., FSM creation, register allocation and interconnect allocation) following the gene encoding (see section IV-A). At this point, estimations on area and execution time are computed (see section IV-B). Once the fitness function values are obtained, the solutions are sorted using non-dominated fronts (see section IV-D), the crowded distance is computed (see section IV-F) and the solutions are sorted even inside each front by this crowded distance. The first parent population is now ready. Now, tournament selection is performed to choose better elements for crossover and random selection is used to choose individuals for mutation. Then operators are used to create a new child population. Offspring population is then added to the parent one and the resulting set is non-dominated sorted. In this way, parent solutions have been ranked with offspring ones and good solutions can be maintained. The process goes on as described above until termination. V. C ASE S TUDY: THE JPEG ENCODER We implemented the described methodology in a C++ High-Level Synthesis framework, integrated into PandA [26] environment (our open-source framework covering different aspects of the hardware-software design of embedded systems). Evolutionary computation has been provided using Open BEAGLE framework [27]. Open BEAGLE provides an high-level software environment to perform a large set of evolutionary computations, with support for different genetic algorithms and encoding features. The proposed approach has been used to perform high level synthesis of two steps of the baseline JPEG compression algorithm with Huffman encoding for a Xilinx MicroBlaze System-on-Chip Architecture implemented on a Virtex II-PRO XC2VP4 FPGA device. The architecture presents a single MicroBlaze v.5.00.c softcore processor, connected through a standard On-Chip Peripheral Bus (OPB) to a UART peripheral, an external memory controller and a SysAce device controller to allow reading and writing of a Compact Flash memory card. The software application implements the baseline JPEG compression algorithm with Huffman encoding and is composed of six phases: (i) original image (.PPM format) reading, (ii) RGB to YUV color spaces conversion (for color images), (iii) expansion and downsampling, (iv) quantization tables setting, (v) bi-dimensional Discrete Cosine Transform (2D-DCT), Quantization and zig-zag reordering, (vi) entropic coding and file saving. It should be noted that the standard JPEG implementation applies the 2D-DCT on blocks of 64 pixel (8x8). Table III shows the total execution clock cycles of the JPEG compression algorithm on a 160x120 pixels image. The most computationally intensive phases are the RGB to YUV conversion and the 2D-DCT steps. We choose these two parts to be given as input to the proposed high-level synthesis flow. A population size of 1.000 individuals was used by the genetic algorithm, evolving up to a maximum of 200 generations. The RGB to YUV performs independent elaboration for each pixel of the image, so the loop that applies the conversion

800

350

700 300

Latency [cycles]

Latency [cycles]

600

500

400 300

250

200

150

200 100 100

0 2500

2600

2700

2800

2900

3000

3100

3200

3300

3400

50 950

3500

1000

1050

1100

Area [CLB]

Fig. 6. Pareto points generated applying the proposed methodology to the RGB2YUV phase

1150

1200

1250

1300

Area [CLB]

Fig. 7. Pareto points generated applying the proposed methodology to the DCT phase 5

3 XC2VP4

Latency [cycles]

equations to each pixel can be unrolled to parallelize and speed up this phase. The 2D-DCT, instead, can be decomposed into two mono-dimensional transforms. Our software application, in fact, adopts this decomposition, thus executes a monodimensional DCT on the rows of each 8x8 block and then executes the same mono-dimensional code working on the columns of the resulting block from the first transform. So, before applying our high level synthesis flow to the RGB to YUV conversion routine, we unrolled it for a factor of four, while we took the mono-dimensional DCT routine without any modifications. The Xilinx XC2VP4 FPGA is composed of 6,768 configurable logic blocks (CLBs). The base architecture without any accelerators implemented accounts for 2,665 logic elements. The remaining 4,103 CLBs can be used to implement the RGB to YUV and DCT hardware functions. This is, in fact, the area constraints we considered to choose the best fit architecture among the solutions obtained with the evolutionary exploration. It is worth noting that we decided to connect the synthesized hardware cores to the MicroBlaze architecture with a standard Fast Simplex Link (FSL) connection, which offers blocking and non blocking message passing communication mechanisms. It is possible, with a minimal overhead in terms of occupation due to the more complex interface, to adopt also an OPB connection with an interrupt mechanism. Figure 6 shows the results of our methodology applied to the RGB to YUV conversion routine. The Pareto points represent the best results in terms of latency for a given area, measured in number of occupied configurable logic blocks. Figure 7 shows instead the results of the flow applied to the DCT. It is interesting to note that with 30% more area the algorithm is able to find a solution three times faster. The results show that the algorithm is able to explore both the fastest solution (with unconstrained number of resources) and the minimal area solution, while covering a good number of

x 10

2.5

2

1.5 Best Architecture 1

0.5

6200

6400

6600

6800

7000

7200

7400

Area [CLB]

Fig. 8.

Best-fitting architecture for the target XC2VP4 device

solutions in between. Figure 8 shows all the possible final architectures that can be realized adding to the base architecture all the solutions obtained by fully combining all the RGB2YUV and DCT Pareto points (see Figures 6 and 7 respectively) for the 160x120 image. In Figure 8 it is possible to identify the best fitting architecture found by the algorithm for the target device, which balances the execution latency of both the functions given the remaining area on the FPGA after the implementation of the base MicroBlaze architecture. After the synthesis of the hardware accelerators the new JPEG encoder implementation triplicates the performance of the baseline software architecture. Performance details about all the phases of the JPEG algorithm are reported in Table III. VI. C ONCLUSIONS AND FUTURE WORK In this paper we presented a new flow for high level synthesis that implements an evolutionary algorithm to perform

TABLE III C LOCK - CYCLES BREAKDOWN OF THE EXECUTION OF THE JPEG COMPRESSION ALGORITHM ON A SINGLE M ICRO B LAZE ARCHITECTURE WITH AND WITHOUT THE HARDWARE MODULES SYNTHESIZED WITH OUR EVOLUTIONARY FLOW

Phase File reading RGB to YUV Exp & Downsample Set quant. table DCT & Quant Entropic coding Total

SW 169,512,465 1,598,251,574 678,638 60,910 489,865,154 599,810,406 2,858,179,147

SW/HW 171,672,577 1,643,508 679,306 61,027 190,905,503 592,318,457 957,280,378

multi-objective design space search. The proposed approach adopts the NSGA-II algorithm in order to find Pareto-optimal results in terms of execution latency while constraining the area of the target FPGA design. In order to evaluate each generated solution without performing complete synthesis an area estimation model has been integrated. We used the JPEG compression algorithm as a real case study to test this flow. The results shows that, given an area constraint, the tool is able to find a best-fit solution for the target device while examining the design space for two different functions at the same time. The most important future work will be the integration of this flow in a reconfigurable design framework. Reconfigurable designs have strict requirements on area constraints. The reconfigurable area is in fact targeted to the biggest module it has to allocate. As such, this flow would allow not only to find an optimal solution in terms of execution latency given the reconfigurable area for a single module, but it would make possible to find the best solutions for all the modules that will share that area. R EFERENCES [1] J. Zhu and D. D. Gajski, “A unified formal model of ISA and FSMD,” in Hardware/Software Codesign, 1999. (CODES ’99) Proceedings of the Seventh International Workshop on, Rome, Italy, 1999, pp. 121–125. [2] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. New York, NY, USA: W. H. Freeman & Co., 1979. [3] D. Chen and J. Cong, “Register binding and port assignment for multiplexer optimization,” ASP-DAC ’04: Proceedings of the 2004 conference on Asia South Pacific design automation, pp. 68–73, 2004. [4] E. Zitzler, K. Deb, and L. Thiele, “Comparison of multiobjective evolutionary algorithms: Empirical results,” Evolutionary Computation, vol. 8, no. 2, pp. 173–195, 2000. [5] K. Deb, S. Agrawal, A. Pratab, and T. Meyarivan, “A Fast and Elitist Multi-Objective Genetic Algorithm: NSGA-II,” Proceedings of the Parallel Problem Solving from Nature VI Conference, pp. 849–858, 2000. [Online]. Available: citeseer.ist.psu.edu/article/deb00fast.html [6] Y.-L. Lin, “Recent developments in high-level synthesis,” ACM Trans. Des. Autom. Electron. Syst., vol. vol. 2, no. 1, pp. 2–21, 1997. [7] C.-T. Hwang, J.-H. Leea, and Y.-C. Hsu, “A formal approach to the scheduling problem in high level synthesis,” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. vol. 10, no. 4, pp. 464–475, 1991. [8] C. A. Papachristou and H. Konuk, “A linear program driven scheduling and allocation method followed by an interconnect optimization algorithm,” Proceedings of the 27th ACM/IEEE conference on Design automation, pp. 77–83, 1990. [9] M. Rim, R. Jain, and R. D. Leone, “Optimal allocation and binding in high-level synthesis,” in DAC ’92: Proceedings of the 29th ACM/IEEE conference on Design automation. Los Alamitos, CA, USA: IEEE Computer Society Press, 1992, pp. 120–123.

[10] R. Cordone, F. Ferrandi, M. D. Santambrogio, G. Palermo, and D. Sciuto, “Using speculative computation and parallelizing techniques to improve scheduling of control based designs,” in ASP-DAC ’06: Proceedings of the 2006 conference on Asia South Pacific design automation. New York, NY, USA: ACM Press, 2006, pp. 898–904. [11] L. Stok, “Data Path Synthesis,” Integration, the VLSI Journal, vol. vol. 18, no. 1, pp. 1–71, 1994. [12] I. P. Radivojevic and F. Brewer, “A new symbolic technique for controldependent scheduling.” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 15, no. 1, pp. 45–57, 1996. [13] S. P. Haynal, “Automata-based symbolic scheduling,” Ph.D. dissertation, 2000, chairman-Forrest D. Brewer. [14] G. Cabodi, A. Kondratyev, L. Lavagno, S. Nocco, S. Quer, and Y. Watanabe, “A BMC-based formulation for the scheduling problem of hardware systems.” STTT, vol. 7, no. 2, pp. 102–117, 2005. [15] G. Lakshminarayana, K. S. Khouri, and N. K. Jha, “Wavesched: a novel scheduling technique for control-flow intensive behavioral descriptions,” in ICCAD, 1997, pp. 244–250. [Online]. Available: citeseer.ist.psu.edu/lakshminarayana97wavesched.html [16] H. Harmanani and R. Saliba, “An Evolutionary Approach for Data Path Synthesis,” Proceedings of Canadian Conference on Electrical and Computer Engineering, vol. vol. 1, pp. 380–384, Mar. 2000. [17] N. Srinivas and K. Deb, “Multiobjective Optimization Using Nondominated Sorting in Genetic Algorithms,” Evolutionary Computation, vol. vol. 2, no. 3, pp. 221–248, 1994. [Online]. Available: citeseer.ist.psu.edu/srinivas94multiobjective.html [18] C. Mandal, P. Chakrabarti, and S. Ghose, “Design space exploration for data path synthesis,” Proceedings of the 10-th International Conference on VLSI Design, pp. 166–170, 1996. [Online]. Available: citeseer.ist.psu.edu/392175.html [19] C. Mandal, P. P. Chakrabarti, and S. Ghose, “GABIND: a GA approach to allocation and binding for the high-level synthesis of data paths,” IEEE Trans. Very Large Scale Integr. Syst., vol. 8, no. 6, pp. 747–750, 2000. [20] G. Grewal, M. O’Cleirigh, and M. Wineberg, “An evolutionary approach to behavioural-level synthesis,” The 2003 Congress on Evolutionary Computation, 2003, CEC ’03, vol. vol. 1, no. 8-12, pp. 264–272, Dec. 2003. [21] S. G. Ara´ujo, A. C. Mesquita, and A. Pedroza, “Optimized Datapath Design by Evolutionary Computation.” in Proceedings of the 3rd IEEE International Workshop on System-on-Chip for Real-Time Applications (IWSOC’03), 30 June - 2 July 2003, Calgary, Alberta, Canada, 2003, pp. 6–9. [22] M. Girkar and C. Polychronopoulos, “Automatic extraction of functional parallelism from ordinary programs,” IEEE Trans. on Parallel and Distributed Systems, vol. 3, no. 2, pp. 166–178, March 1992. [23] “GCC - GNU Compiler Collection,” http://gcc.gnu.org. [24] T. Kim, N. Yonezawa, J. W.-S. Liu, and C. L. Liu, “A scheduling algorithm for conditional resource sharing - a hierarchical reduction approach.” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 13, no. 4, pp. 425–438, 1994. [25] C. Brandolese, W. Fornaciari, and F. Salice, “An Area Estimation Methodology for FPGA Based Designs at SystemC-Level,” ASP-DAC ’04: Proceedings of the 2004 conference on Asia South Pacific design automation, pp. 129– 132, 2004. [26] “PandA framework, available at http://trac.elet.polimi.it/panda.” [27] C. Gagn´e and M. Parizeau, “Open BEAGLE: A New Versatile C++ Framework for Evolutionary Computation.” 2002, pp. 161–168.