IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
100
Low Complexity Design Space Exploration from Early Specifications Sebastien Bilavarn, Guy Gogniat, Jean-Luc Philippe and Lilian Bossuet
Abstract Performance evaluation and design space exploration from early specifications is still a time consuming, experience and technology dependant issue in the design process. However, evaluation of architectural alternatives at an early stage of the design process is important because the choices made have a critical impact on the system final characteristics (area, performances, power consumption, flexibility, . . . ). To address those problems, we propose an original exploration methodology based on area / delay estimations and a case study targeting modern FPGAs. Two main steps compose the estimation flow: ´½µ a structural estimation step that performs automatic exploration of several RTL architectural solutions and ´¾µ a physical estimation step that performs a technology mapping of the RTL solutions to the target FPGA. Experiments conducted on Xilinx (VirtexE) and Altera (Apex20K) FPGAs for a 2D Discrete Wavelet Transform and a G722 speech coder lead to an average error of 10% for temporal values and 18% for area estimations, starting from algorithmic descriptions given in the C language. The low complexity of the method allows fast exploration of many design parameters such as parallelism, target device, resources allocation, scheduling, clock period. Thanks to this methodology, the complexity of the design space exploration process is significantly reduced and permits to reach quickly a reliable solution meeting with both the design constraints (time to market, target FPGA) and the device architecture (cost, performance). Index Terms Design Space Exploration, area / delay estimation, C specification, H/CDFG representation, graph scheduling, architectural synthesis, technology projection, FPGA devices.
I.
I NTRODUCTION
N
EW cryptography, telecommunication and multimedia applications are expected to be very compute intensive with the exponential increase demand for high performance and security. Hence an efficient exploitation of the parallelism potential, both on an application and architecture point of view, is essential to meet a satisfying area / delay and power / delay trade-off. Recent issues in the field of Adaptive Computing Systems [1] and Field Programmable Gate Arrays (FPGAs) make reconfigurable hardware an efficient solution for future System on Chip (SoC) [2][3]. Reconfigurable devices provide the flexibility, energy reduction and high performance capabilities to cope with future design perspectives. The new possibilities introduced in terms of density and performance, dedicated resources for Digital Signal Processing (DSP operators, embedded memory blocks), or flexibility through the possibility of run time reconfiguration suggest a better exploitation of vast amounts of instruction-level parallelism [4] [5] and coarse-grained parallelism within new generations of applications (adaptive video streaming, software radio, . . . ). However, the choice of both a suitable parallelism degree for the application and a FPGA device satisfying the physical constraints (area, performance, power consumption) and marketing ones (final product cost, time to market) is a complex issue often left to the designer experience. To ease such design choices, we propose an original Design Space Exploration (DSE) approach located at an early stage in the design process since this is where the impact on the final system characteristics is the more important. Furthermore, it is crucial for a designer to have a first performance feedback (area / delay, feasibility) while defining the application at the algorithmic level in a way to build a suitable specification. Since the design space can be very large (in case of high parallelism potential), it is also important to define precisely what the limits addressed by our approach are. There is always a trade-off between the design space coverage and the pertinence of an implementation solution (and the accuracy of the estimations provided). In our case we clearly have focused on the exploration of the parallelism potential (memory bandwidth and computation) we consider as a key point to reach an optimal application / architecture matching. To be less dependent of an implementation technology, we propose a two step flow: the first step defines several architectural solutions (structural estimations) while the second step computes the corresponding area / delay estimates (physical estimations) on which design choices mainly depend. Dependence on a target device and on low level synthesis tools is thus mainly reported on the second step, and simplified through the use of libraries. This way, application to all FPGA families (including recent devices) has been made possible. The paper is arranged as follows: Section II reviews some prior contributions in the fields of design space exploration and area / delay estimators for FPGAs. Section III focuses on the exploration and estimation flow we defined. This section S. Bilavarn is with the Signal Processing Institute School of Engineering, Swiss Federal Institute of Technology, CH 1015 Lausanne Switzerland (email:
[email protected]). G. Gogniat, J.L. Philippe and L. Bossuet are with the Laboratory of Electronic and REal Time Systems (LESTER), University of South Britanny (UBS), Lorient, France (email:
[email protected];
[email protected];
[email protected]).
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
101
Authors
Estimator Input
Estimator Outputs
Design Space exploration
Architecture Model
FPGA Model
Complexity
Accuracy
Xu and Kurdahi 1996 [11][12]
Netlist
Area, Delay Tool optimization
No Partitioning + analytical
Datapath, Control logic Operator, Register
XC4000, LUT Interconnections
%
Enzler et al. 2000 [9]
DFG
Area, Delay
Yes, direct Pipelining Replication Decomposition Analytical
Datapath Operator, Register
XC4000E, LUT
%
Nayak et al. 2002 [8]
MATLAB to RTL VHDL
Area, Delay Tool optimization
Yes, iterative compilation Loop unrolling, pipelining Scheduling + analytical
Datapath, Control logic Operator, Register
XC4010, LUT Interconnections
%
Bjureus et al. 2002 [13]
MATLAB to DFG
Area, Delay
Yes, iterative Stream data rate Device clock speed Trace + analytical
Datapath Operator, Register
?
%
Kulkarni et al. 2002 [4]
SA-C to DFG
Area Tool optimization
Yes, iterative compilation Parallelization Loop unrolling Analytical
Datapath Operator, Register
XCV1000, LUT
%
So et al. 2003 [5][16]
C to DFG (loop body)
Area, Delay
Yes, iterative compilation Parallelization Loop unrolling Memory bandwidth Analytical
Datapath, Memory bandwidth Operator, Register
XCV1000, LUT
%
Authors
C to HCDFG
Area, Delay
Yes, direct Parallelization Loop Unrolling Memory bandwidth Scheduling + analytical
Datapath, Control logic Operator Memory bandwidth
XCV400, EP20K200 LUT, BRAM DSP blocks
%
Min-cut
FDS Left-edge
TABLE I M OST RELEVANT ESTIMATOR TOOLS DEDICATED TO FPGA S
provides the necessary information to understand the detailed description of the design flow reported in sections IV and V. Section IV presents the structural estimation step whereas section V presents the physical estimation step. Section VI illustrates applications of the approach on several examples in order to exhibit the benefits provided to the designer and the ability to perform easy and efficient DSE. Section VII presents future work and concludes the paper. II.
R ELATED W ORK
The DSE problem related to FPGA implementation can consist in exploring different RTL architectures for a given application. In that case, the FPGA architecture is fixed and several design choices may be considered, e.g. parallelism, pipelining, replication, resource binding, clock value. This exploration is motivated by the great amount of resources available within the FPGA that can highly speedup the execution of an algorithm. This approach can be classified into three subcategories: synthesis [6][7]; compilation [8][4][5]; and estimation [9]. The third subcategory (estimate and compare) uses estimations to perform DSE, synthesis or compilation steps are replaced by low complexity estimators. After estimation, each architecture is characterized by the performances and device occupation expected when the algorithm is running on the FPGA. Like for the second subcategory, synthesis steps are still required to design the RTL architecture once the designer has reached a suitable RTL solution. The work presented in this paper is based on the estimate and compare approach. A detailed presentation of the major contributions in the domain of area and delay estimators is presented hereafter. Table I summarizes the main characteristics of each approach described in the following. A first technique proposed in [10] by Miller and Owyang is based on a benchmark library. A set of circuits are implemented and measured on a variety of FPGAs. Area and performance prediction is performed by partitioning the application into several components, that are substituted by the most similar benchmark circuit. The drawback of this approach is related to the difficult task of maintaining the library for different devices and applications. Another methodology described in [11][12] by Xu and Kurdahi computes area and delay values from an estimation of the mapping and place & route processes. Starting from a logic level description, they first build an netlist aproximation which is then used to compute timing estimations. During the estimation process, wiring effects and logic optimizations are considered to obtain more accurate results. This method does not address the DSE problem and is technological dependent since it is dedicated to the XC4000 family (CLB based architecture). The method defined by Enzler et al. [9] achieves an estimation from higher abstraction levels (i.e. Data Flow Graph DFG). Area and delay are predicted using a combination of algorithm characterization (e.g. number of operations, parallelism degree) and FPGA mapping representation (operation mapping characteristics in terms of area and delay). DSE is performed by analyzing the improvements introduced when using pipelining, replication and decomposition of the DFG specification. Their estimator targets a XC4000E device (CLB based architecture) and uses an analytical approach. Extension to other architectures is not obvious and may need further developments. Their approach is interesting but the limitation to DFG specifications does not allow to consider the control and multidimensional data overhead.
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
102
The four next methods are based on the compilation of high level specifications (except [13]). Nayak et al. [8] propose an estimation technique operating from a MATLAB specification. Their method computes area and delay estimates for a XC4010 device, in a two step approach: they first use the MATCH compiler [14] to perform DSE (e.g. code parallelization, loop unrolling) and to generate a RTL code in VHDL. Then, estimators are used to predict area and delay values. Area estimation is obtained after a scheduling and a register allocation step in order to define the number and the type of operators used. Delay estimation is based on IP characterization and considers the interconnection cost overhead. They also take into account some synthesis optimizations (through a multiplicative factor) to get more realistic results. Another interesting feature of their approach is that they care about the datapath and control logic in the RTL architecture model. The main limitation concerns the memory unit and control implementation using CLBs, whereas recent FPGA families allows efficient integration of product terms and ROMs into dedicated resources (e.g. Apex Embedded System Blocks [15]). Bjureus et al. [13] propose a simulation-based technique to derive FPGA occupation rate and delay estimation from MATLAB specifications. During the simulation of the MATLAB model, a trace is generated in order to build an acyclic DFG that contains all the operations to carry out and where loops are unfolded. Scheduling and binding are then applied using greedy algorithms. The FPGA architecture is represented by a performance model and is used to retrieve the area and the latency information about the different operations in the execution trace. Each resource is modeled as a function that maps an operation to an estimated area and delay doublet. DSE is performed iteratively considering several design alternatives like the number of input channels, the bitwidth of the input stream, the device clock speed or the device area. Their approach seems to be dedicated to dataflow specifications dealing with scalar data since they do not consider the memory and control units. In [13] the authors do not indicate which CLB based architecture has been used to build their performance model. Kulkarni et al. [4] propose an iterative compiler-based method starting from a SA-C code specification. Their approach relies on the compiler ability to perform extensive transformations for better code efficiency and parallelism exploitation. For each solution produced, a DFG is generated to compute area estimates. The area estimator is based on a mapping of the DFG nodes onto the FPGA architecture where each node is represented by an approximation formula. This estimation also takes into account some synthesis optimizations to enhance accuracy (e.g. shift operations instead of multiplication by a power of 2), but the method is restricted to loop bodies and does not consider the memory and control overhead which may represent a critical issue in the perspective of future applications and standards (video / image processing typically). Moreover like in [9], area estimation (e.g. FPGA resource occupation) is limited to one kind of resource (Configurable Logic Cells in [9] or number of Look Up Tables in [4]) and does not care about dedicated resources like embedded operators and memories. These features must be considered to give realistic evaluations of the final device occupation and optimized performances. So et al. [5] also propose a compiler based approach. However, compared to [8] and [4] they introduce a new parameter influencing the DSE problem: the memory bandwidth. Their approach is based on the DEFACTO compiler [16] that can successfully identify multiple accesses to the same array location across iterations of multi-dimensional loop nests. This analysis is used to identify opportunities for exploiting parallelism, eliminate unnecessary memory accesses and optimize the mapping of data to external memories. However, they use a simplified FPGA performance model leading to an estimation accuracy that depends on the complexity of the loop body explored. As we can notice, most studies deal with a DFG and perform iterative DSE (Table I). Several works consider some design optimizations during the estimation process but they are mainly applied through a corrective factor. Usually, only the datapath is considered even if some methods take the memory and / or control unit into consideration. In most cases, those methods do not consider enough design parameters to achieve a complete characterization of the architecture in terms of processing, control and memory. As an illustration, the best accuracy is achieved in [4] and reaches a value of 5%, but the estimator only deals with area estimates of the datapath and starts from a very low abstraction level (logic level). Finally most works consider a single RTL architecture except [9] and [8] that propose several implementation models (e.g. pipelining, replication). Concerning this, Choi et al. [17] proposed a very interesting contribution based on the definition of efficient performance models for specific application domains. Thanks to this, they are able to perform a wide DSE and take into account the main features of the algorithms to be evaluated. The main limitation is related to the difficulty in defining the performance model which requires expertise of the application domain. Compared to previous work, the advances of our approach can be summarized as follows. We are not limited to the implementation of basic blocks or loop kernels, as we consider an realistic and complete representation including control structures, arrays and computations (thanks to the use of the H/CDFG). We perform area and delay estimation for a particular implementation model composed of a control unit, a memory unit and a datapath (up to our knowledge, no current work consider such a complete characterization). This point is very interesting since a designer can obtain a realistic performance evaluations of its application very quickly. However, the basic RTL model used is not totally satisfying regarding all the possible optimizations applicable by an expert designer. We do not consider computation pipeline (except for loop kernels), memory sharing and memory pipelining. Compared to other works, our estimators take into account both dedicated resources (embedded memory, DSP block) and LUT. However like most other contributions, we need to maintain a library of operators for each new FPGA, that can be considered as a limitation unless it is a first step towards tool and technology independance. The automatic DSE methodology proposed exhibits the memory and computation parallelism potential, this point is original since most works only consider the computation parallelism potential and perform iterative DSE. In conclusion, the main
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
103
High Level Language C code RTL Architecture Model
RTL Architecture Characterization
Number of RTL ressources Parser C to HCDFG
FPGA Architecture and Performance Model
op1 opk Nop 1,max
HCDFG Structural Estimation
RTL Design Space Exploration
FPGA Estimation
Set of explored solutions Each solution is fully characterized and matches the constraints
ram_rd ram_wr rom
Nop k,max 1
Set of explored solutions Each solution is characterized at the RTL level Physical Estimation
Nop k (Nc)
Nc min
Nc max Number of clock cycles
Nc
Number of FPGA ressources
RTL Architecture mapped onto the FPGA Characterization
Amax A(T)
Amin Tmin
T
Tmax
Delay
Fig. 1. Exploration / Estimation Flow: a two step approach composed of 1) the structural definition of several solutions (i.e. at the RT Level) and 2) the estimation of the physical characteristics of each solution (i.e. FPGA occupation rate vs. algorithm execution time).
advance of our approach resides in the fact that we consider both an entire application, a complete RTL architecture model and perform an automatic DSE of key design parameters. III.
E XPLORATION & E STIMATION F LOW
We present first a general overview of the exploration approach [18] [19] which is based on two main steps: a Structural Estimation step and a Physical Estimation step. The first step performs an exploration at the RT level where several solutions are defined in terms of resource selection and scheduling. To achieve accuracy, a realistic analysis based on Processing, Control and Memory characterization is performed through the following parameters: For the processing unit (datapath): – number of different operators / execution units; – number of operators of each type instantiated; – bitwidth of each operator; – number of registers; For the control unit: – number of control states; – number of control signals; For the memory unit: – total memory size; – number of simultaneous reads from the RAMs; – number of simultaneous writes to the RAMs; – number of simultaneous reads from the ROMs; Fig. 1 illustrates the 2D graphical view used to exhibit the results: a solution is characterized by a given execution time on the horizontal axis ( ) while the vertical axis exhibits the above characterization. The second step is called physical estimation and computes the physical area / time trade-offs of each RTL solution. Device characteristics such as operator / memory area and delay are described in libraries and used to compute acurate estimation of: the algorithm execution time for this device; the number of FPGA resources used, including specific resources like dedicated operators / embedded memories; the corresponding FPGA occupation rate; FPGA resources considered here are logic cells (e.g. slices, logic elements), dedicated cells (e.g. embedded memories, DSP operators), I/O pads and tristate buffers. Here again, results are gathered in a 2D graphical representation where each solution
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
104
corresponds to a given execution time value on the horizontal axis. The vertical axis represents the number of FPGA resources of each type used. At the end of the DSE process, several implementation alternatives are evaluated through different area vs. delay trade-offs. Additional information concerning the inputs of the methodology is introduced in the following. We describe the representation model used for intermediate representation that is derived from the C specification (H/CDFG) and the RTL architecture model on which the mapping process is based. A. From C to H/CDFG model The system specification is given in a subset of the C language. The use of a software High Level language for the purpose of hardware implementation imposes some restrictions: only a basic subset of the C language is used namely for the specification of control structures (conditional / iterative structures), basic data types, arrays, function calls. Complex constructs like pointers / records / dynamic memory size allocation are not accepted. The C code is then parsed into a Hierarchical Control and Data Flow Graph (H/CDFG) that has been defined specifically in a perspective of exploration & estimation usage [20][21]. This model is composed of three types of elementary (i.e. non hierarchical) nodes: processing, memory and conditional nodes. A processing node represents an arithmetic or logic operation. A memory node represents a data transfer. A conditional node represents a test / branch operation (e.g. if, case, loops). A Data Flow Graph (DFG) is a graph containing only elementary memory and processing nodes. Namely it represents a sequence of non conditional instructions in the source code. For example the graph on the right of Fig. 2 is a DFG. A CDFG is a graph representing a conditional or a loop structure with their associated DFGs. The graph in the center of Fig. 2 is an example of CDFG. Finally, a H/CDFG is a graph containing H/CDFGs and CDFGs. A H/CDFG is used to encapsulate the entire application hierarchy, i.e. the nesting of control structures and graphs executed according to sequential or parallel patterns. The graph on the left of Fig. 2 is a H/CDFG. Key points in the following are the notions of I/O, local and global data, that are used during the exploration step. We make a distinction between several types of memory nodes: global I/O, local I/O, local data and constant. : global input/output corresponds to input/output data of the entire H/CDFG (whole application) : local input/output corresponds to input/output data of a graph. This data crosses the hierarchical levels of the H/CDFG representation. : local data is used to store internal processing result (inside a DFG). It represents temporary data since it does not cross the hierarchical levels of the H/CDFG. : constant data of a graph. The creation rules of a H/CDFG from a C function are based on a depth-first search algorithm [22]. A H/CDFG is created each time a conditional node is found in the hierarchy level. When no conditional nodes remains in the current hierarchy level, a DFG is built.
B. RTL Architecture Model As pointed out before, Processing, Control and Memory representations are essential in the representation model to achieve realistic evaluations. Regarding this, architectural assumptions have also to be made for the Processing, Control and Memory units. In the following, we expose these choices that have been mainly defined to cope with FPGA specificities: The processing unit is composed of registers, operators and multiplexers. A general bus-based architecture has been preferred to a general register-based architecture since this solution minimizes the number of interconnections between resources of the processing unit (i.e. reduction of the number of buses). Indeed each operator is supposed to be connected to one intra CLB register. As a consequence, register locality enables independent operator execution and thus algorithm parallelization. Concerning the control unit, Moore Finite State Machines (FSMs) are considered. They are supposed to be implemented using microcode ROMs since recent devices enable efficient integration using dedicated resources. This permits to keep the logic cell usage for the processing unit. The memory unit is composed of one or several RAM / ROM memories. ROMs are dedicated to constant storage. Concerning RAMs, simple dual port memories are considered since embedded memories are based on their use. The choices presented above imply some design permissions / restrictions through the different architectural asumptions. Nevertheless, this underlying implementation model is required to promote the definition of realistic solutions. It also allows to reduce the gap between the estimation values of area and execution time (results of the physical estimation step) and the effective design characteristics (results of the physical synthesis). Other models could have been defined. In the following, we will focus on the one of Fig. 3 and refer to it as our architecture template. Possibility of changing / adapting the design procedure will be addressed in the Result Discussion part (section VI-D).
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
105
HCDFG
data1#0
data2#0 i#0
i#0
DFGEVAL1#0
DFGEVOL1#0
V#0
i#1
DFG evalution
DFG evolution
DFG index graph
HCDFG1#0
Memory Node (index of an array)
data1#1 DFGINDA#0
HCDFG2#0
CDFG arrayA#0
ind#0
DFG
data3#0
BeginFOR#0 arrayA#0
data3#0
HCDFG1FOR#0
arrayA#0
data3#0
DFGFOR1#0 arrayB#0
data4#1
× data4#0
HCDFG3#0
EndFOR#0
HCDFG6#0
data5#0 arrayB#1
data4#0
arrayC#0 +
Processing Nodes
HCDFG4#0
arrayB#2
data4#1
Memory Node (array)
HCDFG5#0
arrayC#1
Fig. 2.
Elements of a H/CDFG
IV.
S TRUCTURAL E STIMATION
The structural estimation step performs an exploration at the RT level based on the implementation model of Fig. 3. Compared to a typical DSE flow, our approach is not iterative: when the structural estimation ends, the entire application parallelism has been explored. Each RTL solution corresponds to a parallelism degree characterized for a cycle budget by the number and the type of resources required (respectively
and
) to execute the application for that time constraint. ranges from the most sequential execution (when only one resource of each type is allocated) to the most parallel execution (maximum number of resources it is possible to allocate, it corresponds to the critical path). The definition of several architectural solutions starts with the scheduling of DFGs (section IV-C). Then analytical heuristics are used to deal with control patterns ( IV-D. CDFG estimation) and application hierarchy ( IV-E.3. H/CDFG combination). Progressive combinations using a bottom-up approach allow to reach this way an efficient schedule of the entire graph at low complexity costs. During this exploration step, both processing resources and memory bandwidth are considered through the following parameters: The number and the type of execution units
The number of simultaneous reads (writes) from (to) the RAMs and The number of simultaneous reads from the ROMs . The number of control states . Those parameters fully characterize each RTL solution for a given cycle budget in terms of Processing (number of execution units) Control (number of control steps) and Memory (number of memory accesses). Their computation is divided into five steps: Pre-estimation, Selection, DFG scheduling, CDFG Estimation and H/CDFG Combination
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
106
Memory
ROM1
Control
FU2
FUK-1
FUK
STATE REGISTER
ROM
RAM_DPN
WE R_AD W_AD DOUT DIN
FU1
CUP
ROMM
STATE REGISTER
CU1
ROM
AD DOUT
RAM_DP1
Datapath Fig. 3.
RTL Architectural Model
(Fig. 4). A. Pre-estimation The pre-estimation process checks if a given FPGA device is suited or not for the application under analysis in terms of I/O pads and memory resources. During that process, the set of data nodes labeled of the H/CDFG representation must be compared to the actual number of I/O pads in the target device. is derived from the number of formal parameters used in the C function. From the number of data in the set and their corresponding bitwidth, the number of I/O pads required is computed and compared to the number of I/O pads available in the device. This information is described in the FPGA Architecture and Performance Model (section V-A). . The technique used is the one proposed by Grun et al. The estimation of the total RAM size is computed from the set [23] to compute fast and reliable RAM size estimation from high level specifications. ROM size is derived from the number of elements in the set . Once computed, RAM and ROM sizes are compared to the amount of memory resources available in the FPGA. If I/O pads and memory size conditions are satisfied, the next step of the flow is performed otherwise the designer has to select another device for evaluation.
B. Selection of Execution Units First the H/CDFG representation is analyzed using a depth-first search algorithm [22] in order to list of all the operations to carry out. Then, the selection of execution units is made using a greedy algorithm. Available operators are described and classified in a technology file called the FPGA Architecture and Performance Model. For each operation, the first execution unit supporting the operation found in the technology file is selected. Although such an approach may not always give the best solution, it permits a first and quick evaluation. However, regarding the greedy algorithm limitations, the designer has also the possibility to perform a manual selection in order to choose the most interesting execution units based on its experience. This approach can be considered to refine the results obtained from a first greedy approach. Once the execution units are selected, a clock value can be set. As previously, the designer has the possibility to choose manually a clock value, otherwise it is set to which is the propagation delay of the slowest execution unit. An example of manual clock period exploration is given in section VI-C.
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
107
HCDFG
RTL Architecture Model
Pre-estimation Input/Output Memory
No
FPGA Architecture and Performance Model
FPGA OK? Yes
Selection/Allocation DFG Scheduling CDFG estimation Conditional Loop HCDFG combination « Combination rules » Sequential Parallel Set of explored solutions Each solution is characterized at the RTL level Fig. 4.
Structural Estimation Flow
C. DFG Scheduling When execution units are selected and a clock value is set, a scheduling step is applicable to each DFG of the graph. In our case, this step is performed for several time constraints ( ), from the most parallel solution (the fastest execution) to the most sequential one (the slowest execution) as illustrated in Fig. 5. The entire DFG parallelism is explored this way, each parallelism solution corresponds to a value on the horizontal axis of the 2D representation. The scheduling algorithm is used with the goal to minimize the number of execution units and bandwidth requirements for a precise time constraint. The scheduling principle is a time-constrained list-scheduling heuristic extended to deal with both processing and memory nodes [24]. Its use allows to take into account resources sharing within a DFG (operators). In the example of Fig. 5, three solutions representing different number of resources / clock cycles trade-offs are provided. To implement the first solution (most parallel one), the processing unit must include 2 adders and 2 multipliers, and the memory unit must enable 2 simultaneous read accesses (and 1 write access). Note here that only I/O of the graph and constants are mapped to the memories since internal data is assigned to register (there is no register allocation as explained in section III-B). Finally, the control unit requires 5 states to manage the scheduling of this DFG. Once a DFG has been scheduled, it is replaced in the H/CDFG representation by its estimation results (resources vs. clock cycles curve). Then, a progressive combination of the curves is applied using a bottom-up approach and leads to the estimation results of the entire specification. The combination rules that are used depend on the type of dependence between the parent graphs (Fig. 6). A H/CDFG may be composed of four types of dependences : two of them correspond to control dependences
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
x1#0[i#0]
Number of RTL ressources
x2#0[i#0]
data
data
add
108
add
data
C&1
C&1
data
data
data
mult
mult
Example of DFG
data
Nmult z Nadd Nram_rd × Nram_wr + Nrom
data
2
z
1
× +
× z +
× + z
5
6
7
add
data
y#1[i#0]
Operation
Operation
Operation
Parallel solution
Intermediate solution
Rom Read
Rom Read
Ram Read
Add
Mult
Ram Read
Add
Mult
Add
Ram Write
S1
S2
S3
S4
S5
Sequential solution
Rom Read
State
Ram Read
Rom Read
Mult
Ram Read
Add
Add
Mult
Add
S1
S2
S3
S4
S5
Nc=5; Nmult=2; Nadd=2; Nram_rd=2; Nram_wr=1 Nrom=1; Ns=5;
Fig. 5.
Number of clock cycles
Ram Write
S6
State
Nc=6; Nmult=1; Nadd=1; Nram_rd=2; Nram_wr=1 Nrom=1; Ns=6;
Rom Read
Rom Read
Mult
Ram Read
Ram Read
Add
Add
Mult
Add
Ram Write
S1
S2
S3
S4
S5
S6
S7
State
Nc=7; Nmult=1; Nadd=1; Nram_rd=1; Nram_wr=1 Nrom=1; Ns=7;
DFG scheduling example
data1
data2
Conditional structure DFGEVAL
Sequential execution
RES1
DFG1 BeginIF
RES1
data1 DFGEVAL HCDFG1IF data4
HCDFG1IF DFGTRUE
RESTRUE
DFGFALSE
HCDFG1IF
RESFALSE
data3 EndIF HCDFG1FOR
HCDFG1FOR data5
HCDFG1FOR
Parallel execution
Loop structure
data6
DFGEVOL DFG2
DFG5
data7
data9
RES2
RES5
RES2
BeginFOR RES3
DFG3
DFGEVOL
RES3
DFGEVAL data8
DFGFOR DFGFOR
DFG4
data10
Fig. 6.
Combination schemes
RES5
DFGEVAL
RES4 EndFOR
RES4
Sequential execution
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
109
(conditional and loop structures) and the two others correspond to execution dependences (i.e. sequential and parallel execution). For each type of dependence, the estimation curves are combined according to the equations presented in the following. D. CDFG estimation This section explains how control constructs (i.e. CDFGs) are combined. DFG cores included in CDFGs are processed as explained previously. Then we have to deal with two types of control : conditional structures (tests) and loop structures (iterations). 1) CDFG conditional structures: Conditional structures correspond to CDFGs and are equivalent to an ”if” statement in a programming language. An ”if” structure is composed of three subgraphs (DFGs), one for the evaluation of the condition and two for the true and false branches (corresponding respectively to the three subscript 0, 1 and 2 in the following equations). Branching probabilities ( ) are used to balance the execution time of each branch. Application profiling is used to obtain the branching probabilities of each execution branch. After scheduling, each DFG is characterized by a number of execution units / number of clock cycles trade-off. In our approach, we first compute the exhaustive combination of all the different solutions and then we eliminate all non optimal results (i.e. keep only Pareto solutions [25]). Thus the estimation parameters of a solution (characterized by a cycle budget ) are obtained as follows:
¼
¼
¼
¼
¼
where and represent respectively the number of states and the number of operators / memory accesses for the DFG labeled and the number of cycles . Execution units and memory accesses are estimated under the assumption of maximum sharing between the two DFGs by taking the maximum value (since they are never executed simultaneously). The total number of control states needed is the sum of the number of states of each DFG, plus one for the branching to the first state of the branch to execute (a mux resource is added to perform this branching). The equations above are computed for each possible combination of solutions, i.e. each possible combination of (exhaustive approach) and results in a new resources vs. clock cycle curve. 2) CDFG loop structures: The most common scheme used to estimate a loop structure is based on a sequential execution. This scheme is considered when the number of iterations is not known statically or in case of dependences between iterations (non deterministic execution). In that case, the estimation is computed as follows: first, the three subgraphs composing the loop (evaluation, core and evolution) are processed through DFG scheduling. then, a sequential combination of the three subgraphs is applied. and finally, the entire loop structure is estimated by repeating times the loop pattern (with the number of iterations). This last step leads to the equations below.
Sequential execution:
¼ ¼
¼
¼
where
¼
, and correspond to the solution resulting from the combination.
In this case the execution model considered do not take into account memory and computation pipelining, execution is totally sequential. However, in order to explore more efficiently the available parallelism, we also propose a low complexity technique to unfold loops (partial unrolling and folding execution). In case of a deterministic execution, unfolding loops can exploit the memory and computation parallelism to reduce the critical path. Several parallelism factors are defined in order to explore the intra-loop parallelism potential.
Partial unrolling and folding:
¼
¼
¼
¼
¼
where corresponds to the number of parallel execution of the loop kernel ( takes a value between and ). When is equal to one, it corresponds to a full pipeline execution of each iteration of the loop core and when is equal to it corresponds to a full parallel execution of each iteration of the loop kernel. This pipelined and parallel execution of the
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
110
loop structure reduces the number of execution cycles. Consequently, the number of resources increases according to the parallelism factor .
This approach may not always be applicable especially in case of dependences (resource, control and data). But it can be very helpful to exhibit the parallelism level that will allow to respect the time constraint. Then, one can unfold partially the loop specification according to the parallelism factor and apply a full DFG schedule. This way of proceeding is close to optimality but it is at the expense of complexity. E. H/CDFG exploration This section presents the approach we developed to estimate a complete H/CDFG specification. Once conditional and loop structures are estimated (section IV-D), the two following combination rules, corresponding respectively to the sequential and parallel executions of CDFGs, are applied. 1) H/CDFG sequential execution: The analytical equations translating the combination of two CDFGs that are executed sequentially (labeled and in the following) is performed under the assumption of maximum reuse. This means that the execution units are shared for the execution of both graphs:
¼
¼
¼
¼
¼
First an exhaustive combination of the solutions between the two graphs is computed according to the equations above. Then all non optimal solutions are rejected. The same combination scheme is repeated for each
and memory accesses. 2) H/CDFG parallel execution: The estimation results for a parallel execution of two graphs are computed as follows:
¼
¼
¼
¼
¼
The execution time is the maximum of both execution times. Execution units and memory accesses are estimated in the worst case by taking the maximum value. This means that resource sharing is not considered (unlike sequential execution) and may lead to an overhead in some cases. The reason of this is due to the fact that considering resource sharing requires the storage and analysis of each operation schedule. We believe the impact on the exploration algorithm complexity is too high regarding the improvement of the estimation accuracy. Given this, the number of states is the sum of the number of states of each graph since we do not merge FSMs, also for complexity reasons. We suppose that each subpart of the architecture has its own FSM. This results in a hierarchical FSM composed of several microcode ROMs. Like in the case of sequential execution, this combination scheme is repeated for each possible combination of solution and each , for both execution units and memory accesses. 3) H/CDFG combination algorithm: The global combination algorithm enables to take into account the control and execution dependences to the entire H/CDFG correctly. At the beginning, all the nodes of the H/CDFG graph are analyzed with a depth-first search algorithm [22]. When a node is labeled as a composite one (namely a hierarchical node containing subgraphs), a recursive function is called until a DFG is reached (Algorithm 1). All the DFGs are scheduled this way and replaced by their corresponding structural characterization. Then, conditional / loop structures are analyzed using respectively and routines (implementing the heuristics of section IV-D.1 and IV-D.2) until there the remains no CDFGs unprocessed. The routine is invoked to process all the possible sequential / parallel structures (heuristics of section IV-E.1 and IV-E.2). Finally, the global combination algorithm (Algorithm 1 combines recursively all the nodes until there remains only a single node representing the entire application. These combinations are illustrated on the example of Fig. 6. They lead to the final characterization curve of the entire specification. The last step to complete the Exploration / Estimation flow is to compute the FPGA occupation vs. execution time estimates (Physical Estimation step) for each RTL solution provided.
V.
P HYSICAL E STIMATION
The physical characterization of the FPGA resources occupation vs. temporal constraint (physical time unit, , ) is computed from the structural characterization of each solution. The resources of the FPGA considered to compute are logic cells, dedicated cells, tristate buffers and I/O pads. A technology file is used to describe a given FPGA device (FPGA Architecture and Performance Model). It contains the following information: The number and of FPGA resources of each type (i.e number of LCs, DCs, multiplexers / tri-state buffers and I/O resources);
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
111
Alg. 1 H/CDFG exploration algorithm HCDFG Expl(graph )
= 0 for each node in do if current node is a composite node then = + 1 if composite node is not a result node then sub graph of the composite node HCDFG Expl(graph ) end if end if end for = type of father if = if then Apply for else if = for then Apply for else if 1 then Apply for end if
execution combination(graph )
Initialize the list of current node with root nodes in while 0 do = successors( ) for each node do Parallel combination of predecessors 2 by 2 Sequential combination of current node and predecessor node end for end while
FPGA characteristics Execution units adder sub mult comp equal shift reg reg mux Memory RAM DP RAM DP ROM
LC
Virtex V400EPQ240-7 # LC DC
# DC
# Tri
# I/O
slice op
4800 impl
BRAM latency
40 area
4800 bitwidth
48 ctl lines
+ -
LC LC LC LC LC LC LC Tri access time rw rw r
4.9ns 4.9ns 12.3ns 4.8ns 5.0ns 4.9ns
4 4 36 6 4 4 4 4 latency read 13.4ns 7.2ns 7.2ns
8 8 8 8 8 8 8 8 latency write 13.4ns 7.2ns 7.2ns
1 1 1 1 1 1 1 2
,, , ,
impl LC DC DC
bit per cell 32 4096 4096
TABLE II V IRTEX V400EPQ240-7 FPGA CHARACTERIZATION
The area and delay of operators; The area and access times of memories; This information is obtained from the data sheet of the target device and from the synthesis of basic arithmetic / logic operators and memories. A. FPGA Architecture and Performance Model As an illustration, table II gives a simplified characterization example of a Virtex V400EPQ240-7 FPGA [26]. This device contains 4800 Logic Cells and 40 Dedicated Cells (Slices and BRAM respectively). Each operator is characterized for usual bitwidths (8, 16 and 32 bits) by the number of FPGA resources used, by the corresponding delay and by the number of control signals. To give an example, an eight bit adder is characterized as follows: 4 slices, 4.9ns, 8 bits, 1 control line (corresponding to the signal to drive the adder output register). Those values are obtained after a logic synthesis step.
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
112
Set of explored solutions Each solution is characterized at the RTL level
RTL Architecture Model
Memory Unit Processing Unit
FPGA Architecture and Performance Model
Control Unit Global Cost Area Delay
Set of explored solutions Each solution is fully characterized and matches the constraints Fig. 7.
Physical Estimation Flow
Memories are characterized by the type of resources used for their implementation, the storage capacity and read / write delays. A generic characterization (i.e. independent from size and bitwidth) has been defined based on the number of memory bits per cell and the worst case access time. Furthermore, two choices are possible to implement a ROM memory in Xilinx and Altera devices. For example for an Apex EP20K200RC208-1 FPGA: logic cell based (logic elements) or dedicated cell based (ESB) implementation. The respective characterizations are 16 bits / logic element, 11.2ns and 2048 bits / ESB, 4ns. As we can notice, dedicated cell implementation is faster, but also it is more efficient in terms of area. So in the following, dedicated cell implementation is considered first as it also permits to save the logic cells utilization for custom user defned functions. B. Technology projection From the characteristics of one RTL solution and the characteristics of the target FPGA, we derive a simble and acurate estimation of the expected FPGA occupation rate and algorithm execution time. For better accuracy, a specifc ”projection” process has been defined for each unit of the architecture. 1) Memory unit: A simple approach is applied to evaluate the area of the memory unit. It is based on the total memory size, the number of simultaneous accesses, and the characteristics of the memory resources inside the FPGA. As stated before, two implementation possibilities are considered: using logic cells or using dedicated cells. According to the type of cells used, the area of the memory unit is estimated as follows: Logic cells implementation:
where
and ! are respectively the total memory size for the RAM and the ROM memories. are the number of memory bits per logic cell and , , the bitwidth of the data to be stored. !
and
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
113
Dedicated cells implementation: In a first approach, the number of memories can be approximated by taking the maximum between the number of simultaneous write and the number of simultaneous read operations . However, the type of resource used is important because only one memory can be integrated in a single dedicated cell. So in this case, the number of embedded memories is computed by taking the maximum between the number of dedicated cells needed to implement the total memory size and the number of simultaneous accesses:
and are the number of memory bits per dedicated cell and , , the bitwidth of the data to be stored. Once the number of memories of each type is known, the number of control signals needed to drive the RAM ( ) and ROM memories ( ) are derived. They correspond to the address and write enable signals (Fig. 3):
and ) are derived from the number of words in the memories:
where sizes of the address bus (
¾
¾
The current RTL model does not include a specific address generator. Hence, the control of the address signals of the RAMs and take into account respectively and ROMs is left to the control unit. This explains why the expressions of and . Thus, the total number of control signals for the memory unit is:
2) Processing unit (datapath): The area of the processing unit is computed by adding the contribution of each execution
unit ( or ):
For example, 3 eight bit adders implemented on a Virtex V400E requires 12 slices ( ). Like in the case of the memory unit, the number of control signals needed to drive the datapath is considered. There are four types of control signals, according to the current model of Fig. 3: signals for the control of the output register associated with each execution unit. The number of signals of this type is
equal to the number of execution units . . signals for the operation selection of multi-functional units " signals for the control of registers associated with the memories read / write ports " " . signals for multiplexors / tristate control . The number of control signals needed for the processing unit is then
and the total number of control signals for the entire architecture is
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
114
3) Control unit: The area of the control unit is estimated from the number of states and the number of control lines [27]. Control logic is supposed to be integrated in a ROM memory which area is computed from its number of words and data bitwidth.
¾
where is the total number of states needed to schedule the entire graph. The area of the control logic is obtained from the area occupied by a ROM. Then, the corresponding FPGA resource for a logic cell or a dedicated cell implementation is respectively:
4) Global cost characterization: The total area in terms of FPGA resource occupation is computed for each type of FPGA resource by adding the contribution of each unit of the architecture.
The physical value of the execution time is computed from the value of the clock period defined during the selection step (section IV-B):
The above computation process is then iterated for each architectural solution of the structural estimation results and leads to the final cost vs. performance characterization (figure 1). VI.
E XPERIMENTS AND R ESULTS
This section is organized as follows: first we apply the exploration methodology on several examples representative of different algorithmic complexities and processing characteristics (intense data processing, control dominated, high / low parallelism potential). Then, we address the problem of estimation accuracy. For this purpose, we compare the synthesis results of one architectural solution with the area and performance estimation values given by our exploration tool. Three approaches are considered to perform this comparison: the first one is based on hand coded design, the second one is based on design obtained with a HLS tool and the last one is based on pre-characterized IPs. Finally, we apply the design methodology on a 1D Discrete Wavelet Transform to illustrate how easy and independent from any design experience the exploration process can be processed. A. Applications The methodology described in this paper has been integrated in a framework for the codesign of SoCs called Design Trotter [21]. With the help of this tool, early exploration of applications starting from C descriptions is fast and easy. Applications representative of several algorithmic complexities and processing characteristics have been used to analyze the design space coverage vs exploration times trade-offs. The first seven examples are representative of typical DSP processing: filtering (FIR, Volterra, F22 and Adaptive), transforms (Fast Fourier Transform, 2D Discrete Wavelet Transform and Discrete Cosine Transform). Those examples are characterized by a high memory and computation parallelism and intense data processing / storage, especially in the case of the 2D DWT. The last examples (G722 speech coding recommendation, MPEG and Huffman coding) are control dominated systems with low parallelism potential. Table III presents the number of solutions explored and the corresponding exploration times on a Pentium 3 processor running at 1.2 GHz. Results show the ability of the tool to compute several architectural solutions in a reasonable amount of time, even in the case of complex specifications. The 2D DWT represents the highest complexity. For this example, Design Trotter generates about 350 RTL solutions and the corresponding area / performance estimates within 10 minutes. For comparison, the time required to perform only the logic synthesis of one single solution is several orders of magnitude higher (about one day). The results of table III prove the effectiveness of a global ”low complexity” estimation framework operating from early specifications and exploring several RTL solutions completely characterized in terms of processing, control and memory. In the following, we address the problem of estimation accuracy in order to evaluate the relevance of the estimates provided.
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
FIR Volterra F22 Adaptive FFT DWT DCT G722 MPEG Huffman
Virtex V400EPQ240 sol. expl. time 5 0.04 sec 11 5.6 sec 40 2.3 sec 4 2.6 sec 56 0.4 sec 342 8.7 min 14 1.5 sec 16 0.7 sec 2 3.1 sec 4 4.2 sec
115
Apex EP20K200EFC484 sol. expl. time 5 0.03 sec 11 4.2 sec 40 2.6 sec 4 1.8 sec 56 0.4 sec 342 9.4 min 14 1.6 sec 16 0.4 sec 2 3.1 sec 4 4.5 sec
TABLE III N UMBER OF SOLUTIONS GENERATED VS . EXPLORATION TIMES FOR SEVERAL DSP APPLICATIONS
Parrec Recons Upzero Uppol2 Uppol1 Filtez Filtep Predic G722Predictor
Virtex V400EPQ240-7 Estimation Synthesis slices # slices # (ns) (ns) 9 6 10 6 9 6 10 6 217 1224 255 1171 257 292 303 300 216 230 275 254 150 77 163 88 181 593 177 511 9 6 10 6 1166 1224 1263 1317
Apex EPK20K200EFC484-2X Estimation Synthesis lgc elt # lgc elt # (ns) (ns) 17 9 19 9 17 9 19 9 608 1504 400 1272 857 358 718 302 589 282 612 236 518 94 648 103 484 728 497 515 17 9 49 9 3132 1504 3027 1318
TABLE IV E STIMATION VS . S YNTHESIS FOR THE G722 P REDICTOR FUNCTION
B. Accuracy To perform a reliable accuracy evaluation, three approaches have been carried out: comparing estimations with hand coded designs,with HLS design and with pre-characterized IP (intellectual property) modules and analyze the contribution and benefits on the entire design process. The first approach is needed because the accuracy of the estimations is strongly related to the RTL model of Fig. 3. Thus it is representative of the technology projection accuracy. The second and the third approaches can lead to more significant variations because the architecture models used are different. But on the other hand, a comparison with an automated architectural synthesis tool and with IP components is required to discuss the pertinence of the architectural solutions provided. 1) Comparison with hand coded designs: The two applications considered are a speech coder (G722) [28] and a 2D Discrete Wavelet Transform [29]. Concerning the G722 recommendation, we focused on the predictor which represents the core processing of the application. The predictor is composed of eight sub-functions that are executed concurrently and represents an average of 260 lines of C codes. The DWT algorithm considered is based on a lifting scheme process composed of twelve filtering functions applied sequentially. Six loops are first executed to compute the horizontal transform and then six loops are executed to compute the vertical transform. Each loop is a second order nested loop. The complexity of the algorithm is about 250 lines of C codes which corresponds to almost 500 lines of HCDGH grammar. In the following, we compare the exploration time vs. logic synthesis time for two representative devices of recent FPGA families (it does not include the time spent for architecture synthesis): the Virtex V400EPQ240-7 [26] and Altera Apex EP20K200EFC484-2X [15]. The ISE Foundation and Quartus synthesis tools have been used to target the respective devices. In order to provide a reliable accuracy evaluation, the entire applications have been synthesized in both cases (G722 Predictor and 2D DWT), as well as each sub-function independently. This way, the average accuracy has been computed on a total of 22 design references. Results are given in Table IV and Table V. Table IV reports the number of slices and logic elements after estimation and synthesis. Table V gives the average error in percent and reports the exploration vs. synthesis times. The average accuracy is about % and % respectively for area and execution time values. Some variations may be noticed locally, like in the case of function upzero, that are due to logic optimizations performed by the logic synthesis tool. In that case, the difference are due to the simplification of a multiplier with one of its operands remaining constant. Other variations like horizontal and vertical rearrange in the DWT are caused by the address generation model left to the control unit. This is especially sensitive in the case of this function where memory accesses are critical. Concerning the processing times, we must precise here that only the logic synthesis times have been reported for comparison. The exploration times in the case of the G722 predictor and DWT are respectively second and minutes (to generate and solutions) while the respective logic synthesis times (for only one solution) are respectively minutes and more than
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
Parrec Recons Upzero Uppol2 Uppol1 Filtep Filtez Predic G722Predictor
Virtex V400EPQ240-7 Accuracy (%) Explo vs lgc synth slices # # # (sec) (min) -10 0 0.05 1 -10 0 0.05 1 -14.9 +4.5 0.22 5 -15.2 -2.7 0.11 5 -21.5 -9.4 0.11 5 -8 -12.5 0.05 1 +2.2 +16 0.05 2 -10 0 0.05 1 -7.7 -7.1 0.9 15
1stHLftStep 1stHDLftStep 2ndHLftStep 2ndHDLftStep Hscaling Hrearrange 1stVLftStep 1stVDLftStep 2ndVLftStep 2ndVDLftStep Vscaling Vrearrange DWT 2D
+6.9 +4 +5.1 +2.5 +2.7 +46.8 +7.1 +5 +5.1 +3.4 +3.4 +50.9 +35.9
+7.1 +0.6 +13.9 +9.8 +3.6 -25 +25.5 +16.9 +18.5 +18.3 +5.5 -5.5 +18.2
0.1 0.05 0.06 0.06 0.1 0.06 0.1 0.05 0.06 0.06 0.1 0.06 5min
116 Apex EPK20K200EFC484-2X Accuracy (%) Explo vs lgc synth lgc elt # # # (sec) (min) -10.5 0 0.05 1 -10.5 0 0.05 1 +52 +18.2 0.06 5 +19.4 +18.2 0.11 5 -3.8 +19.5 0.11 5 -20.1 -8.7 0.05 1 -2.6 +41.4 0.05 2 -10.5 0 0.06 1 +3.4 +14.1 0.4 10
5 5 5 5 5 5 5 5 5 5 5 5 1.5days
+1.4 +2.6 +2.8 -0.2 +4.9 +67 -0.2 -0.6 +1.1 -2.6 +3.2 +61 +37
+7.6 +1.9 +9.3 +1.7 +3.6 +9.1 +2.9 +5.1 +2.9 +7.7 +3.8 +3.8 +3.1
0.05 0.06 0.05 0.05 0.1 0.05 0.05 0.06 0.05 0.05 0.1 0.05 5min
8 8 8 8 8 8 8 8 8 8 8 8 2days
TABLE V E STIMATION VS . S YNTHESIS ERROR AND E XPLORATION VS . ( LOGIC ) S YNTHESIS TIME FOR G722 P REDICTOR FUNCTION AND DWT 2D FUNCTION
day, plus the additional time needed to write the RTL description by hand (about 1 month). 2) Comparison to HLS design: In this section, we propose to compare our exploration / estimation methodology with a High Level Synthesis approach. To achieve this, we perform an exploration both using our approach (Design Trotter framework) and a HLS tool (GAUT HLS [30]) on a 16 tap FIR filter. Design Trotter generates RTL architectural solutions. To perform our comparison, we selected one solution provided by the exploration framework and constrained the GAUT HLS tool with the corresponding timing constraints (304 ) and clock period (16 ). The left row of Table VI reports the characteristics of our solution and gives the necessary information to design the RTL architecture with the HLS tool. As we can see, there
FIR 16 (Design Trotter) add 16 bit mult 16 bit reg 16 bit RAM, ROM control steps sec to generate 5 solutions + area / time estimates
FIR16 (GAUT) add 16 bit mult 16 bit reg 16 bit Mem unit not generated control steps min to generate this solution + minutes for logic synthesis
TABLE VI D ESIGN T ROTTER FIR 16 VS . HLS (GAUT) FIR 16
are some slight differences which are mainly due to the different RTL architecture models used in both tools: concerning the number of registers, the difference is due to the fact that each functional unit used is supposed to be associated with an output register in our approach (section III-B). So the real number of registers needed is equal to (originally estimated) plus 2 (one at the output of adder and multiplier). concerning memory, Design Trotter computes a basic estimation of the number of memories needed which is simply based on the analysis of the number of simultaneous accesses ( , and ), while GAUT HLS does not address the memory requirements in the current version used for this evaluation. Finally, from the analysis of the processing times (bottom of Table VI) arises the complementarity of the exploration methodology with the High Level Synthesis tool, more than an opposition: the use of a HLS tool may need several time consuming iterations before defining a suitable architecture. With our tool, several parallelism solutions are automatically generated with an average accuracy of %. Once a suitable solution has been selected, the designer can then constrain the HLS tool (using information such as local / global time constraints, parallelism, scheduling, allocation, . . . ) to meet with this solution. The confidence in finding more surely a suitable implementation is then enhanced significantly. 3) Comparison with pre-characterized IPs: To compare our exploration approach with existing IP modules, we also considered a 16 tap FIR filter obtained from the core generator provided by Xilinx [32]. Two IPs have been generated, the
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
117
first one is a full pipelined filter whereas the second one to a full sequential one. Design Trotter provides 5 solutions, in
() ()
Solutions pipelined sequential 1 16 1 1 1542 2368 157 168 306 304 2394 3213 81.5 62.5 116 62.5 TABLE VII
D ESIGN T ROTTER FIR 16 VS . IP (X ILINX ) FIR 16
the following we consider the fastest and the slowest ones. Table VII reports the results of that comparison. For the fastest solution, BRAM are required as a maximum of simultaneous memory accesses have been estimated. Xilinx’s solution uses only one single BRAM. The default clock value we considered corresponds to the slowest execution unit (multiplier in our case) and keep the same value for both solutions as we do not consider logic optimization. Compared to the solutions provided by Xilinx, the differences are due to the high optimization effort on the IP device, for which we do not have any information for evident confidentiality reasons. However considering that we propose an estimation starting from a C code the accuracy to our point of view is reasonable. Further improvement on this point is possible and will be discussed in section VI-D. The last part of the result section emphasizes the exploration possibilities introduced by our methodology (design parameters, exploration / synthesis relation). C. Exploration Approach In this section, the design flow described previously is applied to a 1D Discrete Wavelet Transform. Fig. 8 and 9 show the exploration results for the two following devices: Xilinx Virtex V400EPQ240-7 and Altera Apex EP20K200EFC484-2X.
Fig. 8.
Horizontal DWT exploration results (Virtex) - slices vs time
On this example, the exploration tool provides a total of architectural solutions in both cases, each one corresponding to a different parallelism degree. Let’s consider for instance the solution highlighted (respectively slices / for
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
Fig. 9.
118
Horizontal DWT exploration results (Apex) - logic elements vs time
Virtex in figure 8 and logic elements / for Apex in figure 9) since it corresponds to an interesting area / speed trade-off. Based on that solution the designer may want to refine its exploration. For example, in this experiment the default clock period value corresponds to the delay of the slowest execution unit. Hence, the designer can analyze the effect of different clock periods and resource selection. For the solution selected before, the impact of several clock values and data bitwidths are reported in figure 10 (labels correspond to clock period - data bitwidth). Once a solution has been selected (for example clock = 20ns, bitwidth = 16), details of the corresponding structural estimation results for each hierarchy level (each sub-function of the specification) provide usefull information (Table VIII). Those partial results fully characterize each architectural solution and give the designer all the necessary information to design the system. As an illustration, we can see that the selected solution is composed of 4 multipliers and 8 adders for an execution of 223 cycles, which corresponds to a resource occupation of ( ) slices, ( ) BRAMs and ( ) tristate buffers for a execution time. To give an example of sub-hierarchy characterization, the For12 body sub-function is composed of 1 multiplier and 2 adders for the datapath and the RAM memory bandwidth is 1 write and 3 reads. All these information can be used to guide a HLS tool to design the selected solution.
D. Result Discussion Application on several examples pointed out some benefits / limitations of the exploration algorithm. In the following, we discuss the pertinence of the results obtained and propose some possible enhancements. In the next section, we will focus on the contribution over related work in the perspective of future complex / heterogeneous system design. The structural estimations represent the core of the proposed methodology and the exploration efficiency depends strongly on the pertinence / feasibility of the different parallelism solutions provided. As a starting point, we defined a first approach based on DFG scheduling and analytical CDFG combinations using simple execution models. The main reason of doing this was to achieve the best complexity / design space coverage. Given the accuracy and processing times reported in table V, some enhancements are possible (of course at the expense of processing complexity) and exposed in the following: Parallelism exploration: the weak point is the simplicity of the combination heuristics used. On the other hand, an optimal schedule of the whole graph is something that would impact greatly the estimation complexity. That’s the reason why we decided to schedule only the DFGs of the graph as a compromise. A solution to enhance the quality of the solutions would be to apply a fine schedule of some critical processings, in particular in the case of iterative structures with execution
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
Fig. 10.
119
Bitwidth and clock period exploration for a Virtex implementation Graph
Cycles
States
Mul16
Add16
For12 body H1stLftStep For22 body H1stDLftStep For32 body H2ndLftStep For42 body H2ndDLftStep For52 body Hscaling For62 body Hrearrange HDWT
5 32 5 32 5 32 5 32 3 66 2 33 223 #
5 32 5 32 5 32 5 32 3 66 2 33 223 $ %
1 4 1 4 1 4 1 4 2 4
2 8 2 8 2 8 2 8
4
8
Reg16
28 & !
RAM (wr) 1 4 1 4 1 4 1 4 2 4 2 8 8
RAM ROM (rd) 3 1 12 4 3 1 12 4 3 1 12 4 3 1 12 4 2 2 4 4 2 8 12 4
TABLE VIII S OLUTION DETAILS FOR THE SELECTED RTL ARCHITECTURE LEADING TO THE FOLLOWING PERFORMANCE :
SLICES , EXECUTION TIME
dependences. We believe this is something achievable without too much penalty on the exploration times regarding the actual complexity of the exploration algorithm. Simplicity of the memory model: in the current version, memory estimation is based on a basic load / store model and characterized through memory size estimation and bandwidth requirements. This is also justified by the embedded memory structure inside a FPGA composed of several distinct storage resources. For complex memory structures (using caches, pipelined execution modes, resouces sharing), suited models and heuristics must be defined. Relevant work in this field has already been addressed and could be used for this purpose [31]. Another way to enhance the pertinence of the RTL solutions is to include the characterization of IP cores in the FPGA characterization file. For example, a Discrete Cosine Transform could be used provided a specific DCT node is present in the specification graph and a corresponding DCT implementation is described in the technology file (in terms of FPGA ressource used and latency). This possibility is enabled by the H/CDFG representation and could greatly enhance the exploration reliability. As an extension, any processing sequence or common functionality can be associated with an
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
120
actual implementation provided information on area and delay are available in the technology file (in the manner of [10]). Concerning physical estimations, some variations have been noticed when comparing physical synthesis results with area and performance estimates (table V). This is mainly due to some unconsidered low level optimizations. Control unit / address generation: Control Unit estimation leads to some variations in case of high data processing / address generation. A solution to this problem is to consider a separate address generation model from the control unit model (possibility also enabled by the H/CDFG representation). Logic optimizations: some improvements are possible by considering some low level optimizations like operator area reduction when one operand remain constant, or a shift operation instead of a multiplication by a power of two. The impact of this is not critical as the average accuracy of the estimations is already around 15 %. That’s the reason why such optimizations have not been considered. We believe the structural estimation step is a more important concern where additional processing complexity should focus. This way, several scheduling strategies can be defined to cope with the processing requirements of an application domain, a HLS tool procedure or an architectural implementation style. Different strategies could be applied for data dominated, intense processing, control dominated applications for example. Suited architecture models have also to be defined in this case (a finer control model than the one of Fig. 3 for control dominated applications with separate address generation / merging of state machines for example). E. Comparison with existing approaches Compared to the work presented in Table I, our approach can be summarized as follows: The estimator inputs a C description which is then parsed into a H/CDFG representation, so a complete characterisation of the application is achieved in terms of Processing, Control and Memory (unlike most approaches). The RTL architecture model is complete since datapath, control and memory units are considered. Each model has been defined in order to cope with FPGA specificites and to achieve the best estimation accuracy. Other models can be easily defined. The estimator outputs area and delay values. No optimization at the logic level are currently considered. Design Space Exploration performs automatic parallelism exploration. This means that several RTL solutions are defined for a given specification, without making several iterations of the exploration process. A great variety of design parameters can be explored: implementation technology / device, resource / memory selection, clock period, data bitwidth, parallelism, control cost, . . . . The FPGA coverage is large and includes up to date devices. The characterization of the device occupation includes logic cells as well as dedicated cells (DSP operators, embedded memories) which are important features in modern reconfigurable logic. This approach allows to be more independant of a specific FPGA, which is characterized through a library based approach. The complexity of the algorithm is low ( since only a list-scheduling algorithm is performed before an analytical combination method. Thanks to this, the design space coverage is large, fine scheduling in critical cases could be easily applied (complex loop processing with dependences typically), but at the expense of estimation complexity. The average accuracy is 10% for delay estimations and 20% for area values. The application of this work has shown the possibility of using accurate estimations from system level specifications. The use of such low complexity estimations allows to compare very quickly different implementation possibilities and to make reliable choices that are derived directly from the processing characteristics of the specification. The main conditions that have enabled this are the use of (1) a complete representation model, (2) a realistic implementation model and (3) implementable scheduling heuristics. Given this, those 3 conditions can be changed / adapted to cope with a specific design procedure / application domain. Concerning this, defining a modeling approach suited to the design procedure of a High Level Synthesis tool would greatly ease the design of a given solution by providing the synthesis tool with all the necessary information such as execution unit selection, clock period, sub-processing local time constraints . . . Moreover, an extention to ASIC design (simply achievable through the definition of an appropriate library) could provide a complete framework for fast hardware prototyping. Considering power consumption in addition to the area / time characterization could also provide useful information for the design of portable devices. This could be derived in a first approach from the operator average consumption per use, while considering the number of operators selected and the number of operations to perform in the graph (and the static consumption depending the implementation technology).
VII.
C ONCLUSION AND P ERSPECTIVES
The starting objective of this work was to define an accurate estimation methodology from early specifications. The accuracy achieved is high considering the abstraction level of the specification. This results from the use of a complete representation model (H/CDFG) and from the use of RTL implementation models for the datapath, control and memory units. The scheduling approach used also provides better confidence the feasibility of the solutions (compared to an estimation approach) and
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
121
information that can be efficiently used to guide their design. The simplification of the scheduling process is important to explore quickly a wide range of parallelism solutions. The counterpart maybe the pertinence of the solutions defined in some cases, which is unavoidable given the abstraction level of the specification. But the actual complexity / design space coverage trade-off obtained let us expect easy adpatation to other architecture templates and execution models in a way to provide better quality to the architectural solutions. The benefits of this methodology on the design cycle are multiple: it allows to shorten significantly the design cycle (especially if used in complementarity with a HLS tool), to be less dependent from designer experience and synthesis tools (thus from technology), and to converge faster towards an acceptable constraint compliant solution. Better application / architecture / device matching is reached thanks to the exploration coverage in terms of parallelism, devices, clock period, and functional unit allocation. This approach has been integrated in a CAD framework for the codesign of SoCs. Interesting extensions of this work are possible, among which: High level exploration methodology based on area / power / performance estimation, used in complementarity with a HLS tool. Definition of a hardware design methodology including FPGA / ASIC devices (and a first step towards independence to technology perspectives). Hardware compilation for Adaptive Computing systems (based on CPU and reconfigurable hardware). An optimal area / time and / or power consumption / time trade-off can be found at compile time using such fast estimations. R EFERENCES [1] R. Hartenstein, ”Reconfigurable Computing: the Roadmap to a New Business Model - and its Impact on SoC Design”, Proceedings of the 14th Symposium on Integrated Circuits and Systems Design (SBCCI 2001) Pirenopolis, DF, Brazil, September 10-15, 2001. [2] R. Hartenstein, Are we ready for the Breakthrough ?, Proceedings of the 10th Reconfigurable Architectures Workshop 2003 (RAW 2003), Nice, France, April 22, 2003. [3] N. Tredennick and B. Shimamoto, The Rise of Reconfigurable Systems, In Proceedings of Engineering of Reconfigurable Systems and Application Conference (ERSA’2003), June 23-26, 2003, Las Vegas, Nevada, USA. [4] G. Kulkarni, W. A. Najjar, R. Rinker, F. J. Kurdahi, Fast Area Estimation to support Compiler Optimizations in FPGA-based Reconfigurable Systems, Proceedings of International Symposium on Field-Programmable Custom Computing Machines (FCCM’02), April 21-24, 2002, Nappa, California, USA. [5] B. So, P. C. Diniz and M. W. Hall, Using Estimates from Behavioral Synthesis Tools in Compiler-Directed Design Space Exploration, Proceedings of IEEE Design Automation Conference (DAC’03), June 2-6, 2003, Anaheim, California, USA. [6] MATLAB-based IP for DSP Design of FPGAs and ASICs, White Paper, May 2004, www.accelchip.com [7] The application of retiming to the synthesis of C based languages using the Celoxica DK Design Suite, White Paper, March 2004, www.celoxica.com [8] A. Nayak, M. Haldar, A. Choudhary and P. Banerjee, Accurate Area and Delay Estimators for FPGAs, Proceedings of International Conference on Design Automation and Test in Europe (DATE’02), March 4-8, 2002, Paris, France. [9] R. Enzler, T. Jeger, D. Cottet, and G. Tr¨oster, High-level area and performance estimation of hardware building blocks on FPGAs, Proceedings of International Conference on Field-Programmable Logic and Applications (FPL’00), volume 1896 of Lecture Notes in Computer Science, pages 525-534. Springer, 2000. [10] W. Miller and K. Owyang, Designing a high performance FPGA – using the PREP benchmarks, Proceedings of WESCON’93 Conference, pages 234-239, 1993. [11] M. Xu and F.J. Kurdahi, Area and Timing Estimation for Lookup Table Based FPGAs, Proceedings of the European Design and Test Conference (ED&TC’96), March, 1996. [12] M. Xu and F.J. Kurdahi, Layout-Driven RTL Binding Techniques for High-Level Synthesis Using Accurate Estimators, in ACM Transactions On Design Automation of Electronic System, Vol. 2, No. 4, p313-343, October 1997. [13] P. Bjureus, M. Millberg and A. Jantsch, FPGA Resource and Timing Estimation from Matlab Execution Traces, Proceedings of Tenth International Symposium on Hardware/Software Codesign (CODES’02), May 6-8, 2002, Estes Parks, Colorado, USA. [14] P.Banerjee, N. Shenoy, A. Choudhary, S. Hauck, A. Nayak and S. Periyacheri, A MATLAB Compiler for Distributed, Heterogeneous, reconfigurable Computing Systems, Procedings of IEEE Symposium on FPGA as Custom Computing Machines (FCCM 2000), 17-19 April 2000 - Napa Valley, CA. [15] ALTERA, APEX 20K Programmable LogicDevice Family, August 2001, www.altera.com [16] P. Diniz, M. Hall, J. Park, B. So and H. Ziegler, Bridging the Gap between Compilation and Synthesis in the DEFACTO System, Proceedings of the Languages and Compilers for Parallel Computing Workshop (LCPC’01), Aug 2001, Cumberland Falls, KY, USA. [17] S. Choi, J.W. Jang, S. Mohanty and V.K. Prasanna, Domain-Specific Modelling for Rapid System-Wide Energy Estimation of Reconfigurable Architectures, Engineering of Reconfigurable Systems and Algorithms (ERSA’02), June 2002. [18] S. Bilavarn, Exploration Architecturale au Niveau Comportemental - Application aux FPGAs, PhD, University of South Britanny, Feb 2002. [19] S. Bilavarn, G. Gogniat and J.L. Philippe, Fast Prototyping of Reconfigurable Architectures: An Estimation And Exploration Methodology from SystemLevel Specifications, Proceedings of Eleventh ACM International Symposium on Field-Programmable Gate Arrays (FPGA’03), February 23-25, 2003, Monterey, California, USA. [20] J.P. Diguet, G. Gogniat, P. Danielo, M. Auguin, J.L. Philippe, The SPF Model, Proceedings of International Forum on Design Languages (FDL’00), T¨ubingen, Germany, September 2000. [21] Y. Moullec, J.P. Diguet and J.L. Philippe, Design-Trotter: a Multimedia Embedded Systems Design Space Exploration Tool, Proceedings of International IEEE Workshop on Multimedia Signal Processing (MMSP’02) December 9-11, 2002, St. Thomas, US Virgin Islands. [22] T. H. Cormen, C. E. Leiserson, R. L. Rivest and C. Stein, Introduction to Algorithms, Second Edition, MIT Press and McGraw-Hill, September 2001. [23] G. Grun, N. Dutt and F. Balasa, System Level Memory Size Estimation, Technical Report, University of California, 1997. [24] D. Gajski, N. Dutt, A. Wu and S.Lin, High-Level Synthesis: Introduction to Chip and System Design, Kluwer Academic Publishers, 1992. [25] S. A. Blythe and R. A. Walker, Efficient Optimal Design Space Characterization Methodologies, in ACM Transactions On Design Automation of Electronic System, Vol. 5, No. 3, July 2000. [26] XILINX, Virtex-II 1.5 Field Programmable Gate Arrays, July 2001, www.xilinx.com [27] S. Narayan and D.D. Gajski, Area and Performance Estimation from System-Level Specifications, Technical Report, University of California, 1992. [28] Recommandation G722, Codage audiofrequence a 7KHz a un debit inferieur ou egal a 64Kbits/s, Melbourne, 1988. [29] I. Daubechies, The Wavelett Transform, time-frequency localization and signal analysis, IEEE Transactions on Information Theory, Vol. 36, No. 5, September 1990.
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
122
[30] E. Martin, O. Sentieys, H. Dubois, J. L. Philippe, GAUT, an Architecture Synthesis Tool for Dedicated Signal Processors, Proceedings of International EURO-DAC Conference, pp. 14-19, 1993. [31] F. Catthoor, S. Wuytack, E. DeGreef, F. Balasa, L. Nachtergaele and A. Vandecappelle, Custom Memory Management Methodology, Kluwer Academic Publishers, 1998. [32] Creating a CORE Generator Module, ISE 6 In-Depth Tutorial, www.xilinx.com
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
123
Dear reviewers, In the following we have addressed the questions and remarks made concerning our work. The paper has been shortened and we have defined more clearly what the contributions and the limitations are, both on a methodology and modeling approach point of view. We have also extended the result section by adding a comparison between our approach and IP components from Xilinx. We actually tried to address all the points you required to be improved, and we propose in the following to answer all the questions and remarks reported (answers are in italic text style). Thanks for your remarks and comments that enabled us to improve the clarity / presentation of this work, focus more on where the limitations of the methodology are and what extensions must be addressed. The Authors.
The authors make several claims in the introduction of the paper which I’d like to rebut, some of which I’ll describe in mode depth below. Claims: 1. Defines a method at a high-level of abstraction: essentially this is composition of partial scheduling with an analytical performance modeling. This is not that big of a deal and I would like to see a little bit more discussion about the rationale for doing this and what the benefits and limitations are. Basically, we wanted to start from algorithmic specifications, so the use of a C specification is interesting since it enables easy algorithm verification. We decided not to use a hardware like description language because we address a simple preliminary step (analysis from the system level), and we do not care about parallelization capabilities on the algorithm side at this level of the design process. For the same reason, we avoid to make early analysis of execution / data dependences within the C code. As a preliminary analysis, the goal is to provide the designer with useful information about the execution performance potential from a pure description of the application, e.g. without any precise implementation assumption / constraint. Moreover, the choice of the C language is motivated by its large usage within the designers / developers community in a perspective of functional validation. This does not mean that parallelism or data dependence analysis are not possible, in fact they are, but this is much efficient to use the H/CDFG representation for such manipulations. The use of true scheduling heuristics enables (1) to define realistic solutions, (2) to analyze the parallelism potential (several time constraints are explored), (3) to consider resource sharing and resource / data dependences. So the concern is not only to count the number of resources required. But on the other hand, our goal is not either to implement a High Level Synthesis process, for two main reasons: (1) because of the synthesis process complexity and (2) because our goal is rather to perform exploration that is still time consuming and experience dependent even with the help of a HLS tool. Our approach is less accurate as it resembles (in its first steps) to some kind of simplified HLS process. 2. Realistic cost modeling taking into account RTL architecture: this architecture RTL apparently does not take into account resource sharing in the sequential execution scenario and I’m left with the impression the implementation architecture is geared towards the model and not the other way around. Scheduling enables to consider resource sharing for the DFGs (basic blocks), then combination techniques take into account the hierarchy dependences in the graph (execution and control dependences) in order to analyze if processing and memory resources can be shared or not. In each case where the basic blocks do not execute in parallel, we consider resource sharing. In the other case, we consider a simplified model without resource sharing. This limitation is justified by the need to store the full scheduling of all resources in order to check the possibility of sharing at each time slot. The choice made corresponds to the best complexity compromise (at the expense of area cost sometimes, anyway resource sharing is far from being always applicable). Graph merging (and then scheduling) could be an interesting issue to address this point and bring better resource exploitation / further scheduling pertinence. Concerning the RTL architecture, in fact this is just (1) a basic model based on a typical processing / (load / store) memory architecture, (2) it has been chosen because it is suitable for FPGA implementation, (3) and moreover HLS tools are based on such implementation models (with separate Control / Processing / Memory units). Actually, the architecture model has been defined first and then we addressed the estimation models to be as close as possible to an optimal feasible implementation. In fact, the architecture model is correct but the estimation scenario does not consider all the aspects (for complexity reasons and given the abstraction level we start from). It is true that the fact we do not always consider resource sharing can be a limitation in some cases (parallel execution of graphs with numerous common resources). The answer is probably to apply a suited scenario in such cases, perhaps we could take the mean of the number of common resources of each type, or apply a fine schedule of the whole structure resulting from merging the considered graphs (the
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
124
best solution as it corresponds to an effective and optimal schedule). Anyway, the alternative of fine analysis of each resource schedule is absolutely excluded for complexity and storage reasons. 3. Explore memory and parallelism: the fact that the model does not capture memory pipelining and to some extent mode dynamic bindings of the variables to the memory leads me to believe this is a simple model. Nevertheless this is a reasonable claim given that you have to start somewhere. Indeed the memory model is a simple load / store model. An extension to more complex structures has not been studied, but some research (at IMEC - www.imec.be - in Belgium for example) addresses this point and could be efficiently used (see section VI-D). Anyway the use of the H/CDFG representation is a promising idea that allows fine memory and control consideration. In that sense, it is already an interesting contribution unlike the current model limitations. 4. Several Architecture models. : well this is simplified by the fact that you model the various structures of the design in about the same fashion so I do not see the added value here. Absolutely, we do not consider several architectural models but one model for which we provide several parallelism levels, as you emphasized. The added value is coming from the fact we consider Control and Memory units in addition to the Processing Unit, which is not the case in most other approaches. Those points have been clarified in the revised version. The code of the proposed modeling technique the ”compare and estimate” is about partial scheduling and analytical step to define several solutions. The analytical solutions are derived by composition of the program structure using a simple execution model. My biggest concerns, and hence that reasons which to set of limitations should be disclosed earlier on are that the model is simplistic. The architecture model is realistic and corresponds to typical modeling approaches used in High Level Synthesis tools. It may be considered too simplistic for a real synthesis usage, but one must keep in mind that our concern is to provide fast reliable exploration, not synthesis. In this sense, the model is accurate enough as we start from absolutely nothing in terms of architecture assumption (except we want to implement the algorithm on a FPGA). Anyway other models can be considered, the possibility of refining / defining other models is addressed in the new section VI- D. Moreover, the technology projection process defined brings significant contribution as it enables to consider all FPGA devices including recent families. The structural estimation starts with a time budget Nc and then attempts to find out Nopk(Nc). What are the analytical heuristics mentioned at the top of page 108 ? The analytical heuristics (top of page 108, this term has been removed in the new version) refers to estimation procedures in case of conditional structures (IV-D.1), loop structures (IV-D.2), sequential execution (IV-E.1), parallel execution (IV-E.2) and HCDFG combination algorithm (IV-E.3). The resources sharing at the memory level is not taken into account (page 111 bottom left column) Also in the modeling of parallel execution there assumption is that there is no resources sharing and on sequential execution there is resources sharing ”thus preventing software pipelining techniques and memory pipelining access to be taken into considerations”. This is the weakest part of the modeling and in my view very simplistic and I really thing you should make an effort to be more forthcoming with the limitations of your model. This point is addressed in the new section VI - D As to the application (2 in-depth) I’d like to see them as an electronic appendix to this article and a more relevant description of their characteristics for the modeling aspect of this work. For instance I suspect that one could make a lot of different designs in the presence of many other compiler-oriented transformations. So If I’m given the actual code or even some pseudo-code that would help me understand what code structures work well with your approach and which do not. Code examples are provided in attachment to this file (G722 predictor and 2D Discrete Wavelet Transform). It is clear that we can consider different type of architectures with optimized processing / memory pipelining, plus eventually other optimization possibilities. As emphasized in the revised paper, the choices are not fixed and aims to reach the best accuracy / exploration space compromise, thus the model complexity vs. processing time trade-off. But the methodology has been developed with the concern to be adaptive with several complexity / accuracy trade-offs. The actual version correspond to a low complexity / general application domain. But we could derive more pertinent models and scheduling according to a specific application domain or design procedure or even architecture style.
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
125
The hand design comparing is useless essentially the authors compare the DSE against a design with the same methodology but done by hand. For instance I saw no discussion of the various other technique people do use when they do designs by hand such a pipelined memory accesses, or simply scalar replacement (saving data in registers that is later reused) to improve the performance of the design. These additional modeling and program transformations I think would expose the limitations of your models. So over all I thing the reads would be better served by knowing up front what the limitations of the models are rather than having to read between the lines to understand that you really are not modeling pipelined execution modes. The hand design comparing allows to study the technology projection accuracy and to check the feasibility of the solutions defined. A comparison with pre-characterized IPs have also been added in the Experiment and Result section (VI- B - 3) to provide a comparison with existing designs. But once again, such comparison are not relevant of the exploration efficiency. The more we increase the solution pertinence, the more we reduce the exploration space Discussions about how to enhance such pertinence is provided all along the paper. This is something we kept in mind since we started to define the estimation approach, in particular, the possibility to associate macro nodes of the specification (DCT for example) knowing the characteristics of a specific implementation in terms of area and performance on the device considered (section VI-D). You also talk about a template: design in the manual design. What exactly do you refer by this ? The RTL template model refers to the architecture model of Fig. 3, section III-B. It is not clear in the results section when you talk about precision over what are you referring to. It sounds you are comparing the precision of the estimates over the actual synthesis. This should be the case, but then again given you assumptions about the execution model I would be surprised if there correlation were not that good. In fact it would be good to understand what is the model actually missing, given that for the limited execution models you should be missing very little. Precision refers to the accuracy of area / time estimation values vs. actual synthesis results. The solutions defined are based on the architecture model and the scheduling strategy exposed in the paper. Errors are computed according to the logic synthesis results. Overall the references are fine with the exception of reference [4]. Kulkarni et al (reference [4]) does not start with C but SA-C instead. This is a misrepresentation of the work as the hard work of extracting data dependences is done by the programmers by the single-assignment semantics. Also their work only considers the data-path portion of the computation and given that the kernels they study are very small there is very little effect high-occupancy on the FPGAs clock rate and area since, and like your model you do not consider resources sharing. So in a way both models are very successful when there is ample space available, but then again if that is the case, the usefulness of your model in estimating when to stop applying transformations is very limited as well. References have been checked and corrected. However, the discussion does seem very long and masks the ”actual” contributions of this paper. In addition, the use of English is poor because of which the paper needs to be revised/rewritten extensively to ensure easy read. The paper has been deeply revised / rewritten and shortened in a way to keep what seemed to be essential both to us and to the reviewers feedback. In the following, the review is organized into separate sections in terms of the various technical and presentation aspects of the paper. - This paper discusses various aspects of the proposed design techniques and the tool in great detail. While some of the details should be omitted for space concern, the completeness should be retained. The paper has been shortened, some details have been omitted, some figures removed. - Table 1 is useful. It may not be a bad idea to add a row for the proposed technique so that the advantages can be obvious Done. - While there are a number of quantitative and qualitative evaluations of the proposed technique, a performance comparison of
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
126
the final design and an existing IP (academic or industry) is missing. For example, Xilinx provides IP cores for many signal processing kernels. The authors should compare the area and latency performance of these IP cores with the designs generated by the proposed technique. This has been added in a new section (VI - B - 3). The methodology enables another possibility: a node in the HCDFG can represent an IP processing associated with an implementation described in the technology file. For example, we can define a DCT butterfly node in the graph and describe the butterfly implementation characteristics in terms of FPGA resource occupation and delay. This way, the butterfly characteristics used in the exploration process correspond to an optimal existing implementation and can lead to better estimation accuracy of a full DCT implementation (see VI-D). - In the similar spirit, the authors should compare the ”best” design generated by their tool with the ”best” design generated by compilers by AccelChip or HandleC. Banerjee et. al. have published many papers related to compilation of MATLAB codes into FPGA designs and associated design space exploration. This comparison can be both qualitative and quantitative. The comparison of designs does not appear as in fact it is not a design tool. The approach has been defined to provide maximum confidence in the estimation values. This is the case as the solutions are implementable. When the designer makes a choice of a solution, the confidence in obtaining a feasible system is granted. Then synthesis can be performed with a certain optimization potential left to the designer experience / synthesis tool efficiency. This is the following step of the design flow (exploration =¿ synthesis =¿ optimization) that can avoid several iterations of time-consuming synthesis steps and does not require any design experience to reach surely a first constraint compliant solution. - It is not clear if the complete C language is supported. While supporting the complete syntax of C is not required for FPGA based design, author should discuss what features of C is supported. A paragraph explains the subset of C supported (section III-A). Code examples are provided in attachment to this file. - Prasanna et. al. discuss techniques and tools based on a domain specific modeling approach for efficient kernel design for FPGAs. The authors should compare their approach in a qualitative manner with the proposed approach. It will be an interesting contrast as Prasanna et. al explore at the level of algorithm and architecture and the authors do so from a given algorithm in C and an architecture of the type discussed in Fig. 6. Comparison with existing approaches is provided in section VI - E. and table 1. - It is not clear how the translation from RTL design to physical synthesis takes place ? Is it manual ? There is no actual translation from RTL design to physical synthesis. In fact, RTL solutions are defined (but not designed) in terms of resource selection and scheduling. These information are used then to compute reliable estimations about area occupation (FPGA resource occupation) and execution time. This computation is automatic, the user just has to choose an implementation device. - In Section III, it says ”Those results are gathered (presented) is a 2D graphical representation, where the vertical axis exhibits all the characteristics mentioned above, and the horizontal axis correspond to...”. It is not clear how the Y-axis represent all the parameters. If so what does a POINT in the design space mean. Also op1, opk, etc. should be defined in the beginning of Section III (or forward referenced) when the figure appears. Clarification of the meaning of ”point in the design space”: in fact there are several points one for each estimation parameter (see section III and Fig1), each solution is characterized by a time constraint (1 point on the horizontal axis and referred to as Nc). Section III (and all other reference to this) has been corrected to clarify this point. The paper didn’t make clear exactly what the difference was between this work and those previous approaches. Scattered comments are made throughout the previous work section, but it would have really helped me to have a single concise paragraph at the end of that section explaining exactly what the advances are. I had thought that the section VI-d would contain this information, but really, it just summarized the current approach, without relating it to previous approaches. At the end of the section II (Related work) we have defined with more accuracy what are the advances and the limitations of our work compared to existing work. A concluding analysis has also been added in the seek of results consideration (section VI-E). Does your methodology consider the multipliers in the Virtex and the DSP blocks in the Altera chip? You do consider
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL. XX, NO. Y, MONTH 2004
127
different implementations of the control unit (ESB vs. logic), but do you consider different implementations of multipliers, for example? Looking at Table II, it looks like the possibility of implementing the multiplier in a hardwired multiplier block is not considered. It doesn’t seem it would be hard to add this. We do consider the multipliers in our approach, but for that study we worked with VirtexE and Apex FPGA devices that do not integrate multipliers. In our method, it is the same to consider LUT-based or hard-wired multipliers during the exploration / estimation process, as we do for LUT-based or dedicated embedded memories. The important point is to define what type of implementation is preferred and has to be considered first. For the memory we first consider embedded memories and apply the same for multipliers by taking into account first the hard-wired implementation. This is a way to guarantee a maximum usage of all FPGA resources and to keep the logic cell usage (Slices, Logic Elements) for the definition of custom functionalities by the user. Dr S. Bilavarn - Signal Processing Institute / EPFL - Lausanne Switzerland Dr G. Gogniat - Lab. Of Electronic and Real Time Systems (LESTER) / Univ, South Brittanny - Lorient France Pr J.L. Philippe - Lab. Of Electronic and Real Time Systems (LESTER) / Univ, South Brittanny - Lorient France Dr L. Bossuet - Lab. Of Electronic and Real Time Systems (LESTER) / Univ, South Brittanny - Lorient France