design trotter: building and selecting architectures for ... - CiteSeerX

1 downloads 2793 Views 610KB Size Report
The context of this work is the design of embedded systems targeting multimedia applications. More specifically we address the problem of design space.
DESIGN TROTTER: BUILDING AND SELECTING ARCHITECTURES FOR EMBEDDED MULTIMEDIA APPLICATIONS Yannick Le Moullec1), Peter Koch1), Jean-Philippe Diguet2)Jean-Luc Philippe2)

1) Center for Embedded Software Systems, Aalborg University, Fr. Bajers Vej 7B, 9220 Alborg Ø, Denmark [email protected] ABSTRACT The context of this work is the design of embedded systems targeting multimedia applications. More specifically we address the problem of design space exploration at high levels of abstraction, early in the design process. In this paper we present Design Trotter, a framework for guiding system designers. The main features of Design Trotter are the characterization of the application by means of metrics, the exploration of the application parallelisms by means of dynamic trade-off curves and the possibilities of performance estimations onto hardware and software targets. We present some results on a real-life application showing how the designer can benefit from our approach.

1

INTRODUCTION

The trade-off between energy savings, area usage and realtime constraints, which is required for future embedded systems, for instance in telecommunications, imposes to optimize the usage of silicon. Thus the necessary improvements of the efficiency ratios (MIPS/Watt and 2 Watt/mm ) can only be reached through i) the implementation of massive spatial parallelisms, ii) the use of limited clock frequencies, iii) the design of more or less dedicated hardware modules and iv) the use of state-ofthe-art transistor technology. VLSI technology now enables massive integration of processing units and fast communication channels on a single chip [1]. This search for parallelism and dedicated hardware opportunities constitutes an important task of the design flow. This task has to be performed before the complete definition of the target architecture and can be defined as system-level architectural exploration. Secondly, though we now observe that CAD tools for co-synthesis and co-simulation have reached a reasonable degree of maturity, systemlevel exploration tools still remain at a research stage. However, fast estimations based on largely abstract and

2) LESTER, Université de Bretagne Sud, BP 92116, 56321 Lorient Cedex, France [email protected]

incomplete specifications (of the application and the architecture) are vital at system level [2]. Most of existing tools provide an exploration based on a fixed architecture (e.g., an embedded processor connected to a FPGA) and a library of pre-characterized functions (on the elements of the fixed architecture). Such static libraries usually quantify one software implementation and one or two hardware implementations. Thus, the architectural exploration is bounded by the fixed function granularity and limited to hardware/software partitioning. It seems as though another step has to be performed beyond the one describe above to really guide or parameterize architectural choices. The work described in this paper deals with this second step. Finally, regarding design decisions it is crucial to get early estimations i) to shorten design delays, ii) to measure rapidly the impact of algorithmic choices or transformations and iii) to adapt the following architectural choices to the application parallelisms. Our work aims to bridge the gap between the specification of a system and the definition of a target (or a set of target) architecture(s) for that system. The exploration of a solution space for embedded systems can have different meanings. Our work focuses on the following aspects: I) exhibition and exploitation of parallelism is a key parameter in the design of digital systems. It has a direct impact on several performance factors such as execution time, energy consumption, area, ... Therefore we seek to explore the potential parallelism of an application in terms of a) type (data-transfer and data processing), b) granularity level, and c) type (spatial, temporal). In our work these aspects are addressed via our graph-based internal representation (detailed in 2.1) and via our estimation technique. These two elements rapidly provide dynamic exploration of an application by means of parallelism vs. delay trade-off curves on which a point corresponds to a potential architecture. The designer has then access to a set of solutions from which he can choose the most promising ones according to his design constraints. These solutions can then be refined and

mapped to technology dependent architectural models as described in section 4 and [3]; II) target architecture definition: at system level we assume that the target architecture is not yet defined. Therefore we can only consider algorithmic operators (operators performing the operations found in the application specification) since the goal of our work is to guide the architecture definition. At that level, no directive related to a specific technology is introduced. For that purpose we use an abstract model to represent the architecture components (for example time is expressed in cycles). The model used to describe this abstract architecture is rapidly defined in 2.2; III) impact of specification choice: since the core of our estimation framework is based on a fast scheduler, the designer can evaluate very rapidly the impact of algorithmic transformations. This feature is very important since it enables the exploration of several algorithmic specifications for a given application. The rest of the paper is organized as follows: section 2 describes the overall flow of the methodology and the associated models. Section 3 outlines some results and shows how the designer can benefit from our approach. In section 4 we discuss existing and future work focusing on high level estimation onto microprocessors. Finally we conclude in section 5.

2

DESIGN FLOW AND MODELS !

"

#

SET OF FUNCTIONS (C language) PARSER HIERARCHICAL GRAPHS SYSTEM LEVEL UAR FUNCTION CHARACTERIZATION INTRA-FUNCTION ESTIMATION SOFTWARE UAR (eg. ARM10)

POTENTIAL SOLUTIONS

HARDWARE UAR (eg. Xilinx XCV2000E )

SELECTION SOFTWARE PROJECTION

HARDWARE PROJECTION

Figure 1. The overall Design Trotter framework The methodology presented in this paper has been implemented in our framework called Design Trotter, designed at LESTER. Its flow is depicted in Fig.1. 2.1 System specification The system to be estimated can firstly be specified, for example, with methods and tools such as Radha Ratan or Esterel Studio. This work is not presented in this paper. The actual entry point of our work is a set of functions specified with the C language. These functions are parsed into our internal representation, HCDFG, detailed in what follows. The choice of the C language has been motivated

by the availability of many standard specifications. In fact our framework is an open one and other languages could be considered in the future. The HCDFG model Each C function of the specification is a node at the top level of the Hierarchical Control and Data Flow Graph (HCDFG). A function is a HCDFG. There are three types of elementary (i.e., non hierarchical) nodes: a processing node represents an arithmetic or logic operation. A memory node represents a data-transfer. The main node parameters are the transfer direction (read / write), the data format and the hierarchy level. A conditional node represents a test operation (if, case, loops, etc.). Three types of oriented edges are used to indicate scheduling constraints: i) a control dependency indicates an order between operations without data-transfer, it imposes a given order between independent operations or graphs with the intention of favoring resource optimization (datareuse for instance); ii) a scalar data dependency between two nodes A and B indicates that node A uses a scalar data produced by node B; and iii) a multidimensional data dependency is a data dependency where the data produced is no more a scalar but an array. A DFG is a graph which contains only elementary memory and processing nodes. Namely it represents a sequence of non conditional instructions of the 'C' code. A CDFG is a graph which represents a test or a loop pattern with associated DFGs. A HCDFG is a graph which contains only elementary conditional nodes, HCDFGs and CDFGs. A HCDFG example is given in Fig.2. 2.2 Resources Specification In order to start the estimation processes it is necessary to have information about the type (and not the quantity) of resources available in the abstract architecture model. At system level, the designer defines a set of rules, named "UAR" (User Abstract Rules). The processing part is characterized by the type of available resources: ALU, MAC, etc. and the operations they can perform; a number of cycles is associated to every type of operator. Regarding the memory part, the user defines the number of levels (Li) of the hierarchy and the number of cycles (li) associated for each type of access. When considering system level exploration the system level UAR model is used. However, regarding the architectural level of Design Trotter, HW and SW UAR models are used (cf. Fig.1). They provide accurate execution time and area information; these libraries are obtained either by synthesizing operators on specific FPGAs or, when a software projection is performed by using a processor specification tool such as Armor [7].

high performance memories are required (large bandwidth, dual-port memory, etc.) as well as an efficient use of memory hierarchy and data locality. 2.3.2 Control Orientation Metric (COM) To calculate this metric test operations, namely the following operators: =, !=, = =, must be identified. COM is defined by the general formula: Nc COM = (3) Np + Nc + Nm where Nc is the number of non-deterministic tests (i.e., tests which can not be eliminated at compile time), Nm is the number of memory accesses and Np the number of processing operations. COM indicates the appearance frequency of control operations (i.e., tests that cannot be eliminated during compilation) in a CDFG or HCDFG since there is no test within a DFG. COM values are normalized in the [0;1] interval. The closer to 1 COM is, the more the function is control dominated, so needs complex control structures. It also indicates that the use of the pipeline technique is not efficient for such functions Figure 2. A HCDFG example

2.3 Function characterization The characterization step has two objectives. The first one is used to sort the application functions according to their criticity. The criticity of a function is expressed as:

γ =

NbOp CP

(1)

where NbOp is the sum of data-transfer plus processing operations and CP the critical path of the function, i.e., the value of the longest path considering processing and data-transfer operations. This metric provides a first indication about the potential parallelism of the function. If we consider that the architecture design must be driven by critical functions, it enables the pruning of the specification and the choice of the order in which the remaining functions will be estimated during the interfunction estimation step [5] (the most critical ones first). The second objective of this step is to indicate the function orientation which is the nature of dominant types of operations. By counting tests, data-transfers and processing operations within the HCDFG representation we obtain ratios which indicate the function orientation We now give a general definition of these metrics, a more extensive description can be found in [6]. 2.3.1 Memory Orientation Metric (MOM): This metric is defined by the following formula: Nm MOM = ( 2) Nm + Np where Nm is the number of memory accesses and Np the number of processing operations. MOM indicates the frequency of memory accesses in a graph. MOM values are normalized in the [0;1] interval. The closer to 1 MOM is, the more the function is considered as data-access dominated. Therefore in the case of hard time constraints,

2.4 Estimations Once the functions have been characterized they have to be estimated. Because of the size of the design-space, the estimation process is divided into two levels: the system level and the architectural level. The first one is performed by the intra-function estimation step that is based on an abstract architecture model in order to rapidly explore a large set of parallelism options. The architecture model is abstract since at this point of time in the design trajectory it is not yet defined and that system level exploration corresponds to an algorithmic level exploration of the parallelism. The second one (corresponding to HW and SW projections steps on Fig.1) uses technological libraries to get accurate estimation results. The algorithms developed for the system level exploration are presented in what follows and possible software estimation is discussed in section 4. Information about hardware projection can be found in [3]. 2.5 DFG estimation: adaptive scheduling With the information contained in the graph and the designer rules (UAR), the goal of the method is to minimize the quantity of resources and the bandwidth requirement for several time constraints. The scheduling principle is a time-constrained list scheduling heuristic where the number of resources of type i allocated at the beginning is given by the lower bound:

Nb resourcesi =

Nb operationsi

( 4) Tc where Tc is the number of cycles allocated to the estimated function. The method includes three subalgorithms: processing first, data-transfer first and mixed. Scheduling first the most critical type of operations enables a reduction of the quantity of resources used for that type; this implies further constraints for the

scheduling of the other type. More information can be found in [11]. 2.6 CDFG estimation CDFG estimation relates to estimating loops (for, whiledo-while) and branching structures (if-then-else, switchcase). 2.6.1 Loop scheduling The general scheme used to estimate a loop is as follows: 1) the three parts of the loop (evaluation, core and evolution) are estimated using DFG scheduling (cf. 2.5) and CDFG combinations (cf. 2.7). 2) the loop "pattern" is estimated by performing a sequential combination of the three parts. 3) the whole loop is estimated by repeating N times the loop pattern (with N the number of iterations). However, in order to explore fully the available parallelisms we have developed techniques to unfold loops. These are presented in the what follows. Unfolding loops permits exposure of the potential parallelism and thus also allows a reduction of the function critical path. This can be helpful to successfully schedule a function in a given time constraint. However, unfolding a loop is limited by data dependencies. We distinguish two types of dependencies: 1) memory dependencies, and 2) data-flow (a.k.a true) dependencies. Memory dependencies: this type of dependency can be eliminated by using structural transforms. In general, a loop of size N, unfolded L times, is transformed into a L branches tree and its critical path is given by:

N + Log 2 ( L) L

CPL =

Tmax GCD ( ∆ cr , Tcr ) Tcr

*

∆ cr

(6)

GCD ( ∆ cr , Tcr ) 

where

is computed as: T

''

=

T + (α − 1)d min

d min = max cycles (

α

Tcr ) ∆ cr

(8)

to gain G cycles is given by:

(9)

G 1− T − d min

The main points of our loop management policy are: i) loop unfolding is included in the trade-off curves computation as a solution to reach resource lower bounds, ii) data-flow and memory dependencies are both considered whereas they are usually implemented in frameworks dedicated to different application domains and iii) the unfolding factor is computed with the aim to meet time constraints instead of finding the optimal rate. 2.7 HCDFG estimation Once the DFGs and CDFGs have been estimated, the framework estimates the HCDFGs (i.e., functions). This is done by combining their trade-off curves by means of mutually exclusive, sequential and parallel combinations. Like in the previous steps, the goal here is to favor resource reuse. The estimation is carried out hierarchically; it starts first with the DFGs, then the CDFGs and finally the HCDFGs until reaching the dynamic estimation of the top HCDFG. Moreover, we want to obtain the estimates very rapidly in order to explore large design-spaces. So, considering the important number of points in each trade-off curve we cannot apply an exhaustive search of the Pareto points. Instead, we have developed a technique based on CDFGs trade-off curve peculiar points detailed in [9].

3

where Tmax is the latency of the slowest operator, cr the number of delays, Tcr the accumulated latency of the critical cycle and GCD the Greatest Common Denominator. Our problem is different, we have to find the minimum unfolding factor to schedule a loop respecting a time constraint T. Therefore, we have reformulated the problem as: if a loop requires T” to be scheduled without unfolding, we must find the unfolding factor to gain G = T”-T cycles. We can show that the number of cycles obtained after unfolding a loop with a factor

α=

1

(5)

Data-flow dependencies: For data-flow dependencies, researches usually focus on the unfolding factor necessary to obtain the optimal rate which maximizes the (i.e., the unfolding factor throughput). In [8], it is shown that can be computed as: α mgpr =

Finally the unfolding factor

RESULTS

3.1 Object Motion Detection (OMD) This application performs motion detection of objects recorded by a video camera. It has been originally developed by the LIST laboratory of the CEA research center [10]. The typical target architecture is presented in Fig.3. It is a intelligent video camera made of a CMOS sensor and a processor along with some reconfigurable logic. This application is typically embedded in video cameras and used for parking lot monitoring (detection of car and person moves), person counting in places such as subways and so on. We have used a large set of representative input data (from a parking lot monitoring appliance) to produce a profiling of the application functions (used to fill the probability attributes of the graphs). Fig. 4 illustrates how the OMD application works. Ext. Digital Other sensors CMOS

PROCESSOR + RECONFIGURABLE LOGIC

Interface

(7)

Sensor

VIDEO PROCESSOR

DSP

reconfigurable logic

Figure 3. Motion detection architecture. (Copyrights CEA)

We now expose the intra-function results for 2 functions of the OMD application and explain how the designer can use them.

Resources and speedup

The OMD application is made of hundreds of functions. We found out that 16 of them are the most critical ones (there are those which should be optimized). The results of the OMD characterization are presented in table 1. The first observation which can be made is that all the functions have high MOM values, (0.72 on the average, that is more than 2 operations out of 3), this is due to the fact that there are numerous reads of data from the video stream and that the application is highly hierarchical (nested control structures for example) and that the DFGs are rather shorts. This implies that the OMD application requires either a big local memory (data reuse) or high-end input/output mechanisms (parallel data reading/writing). Next we observe that values are very different, from 1.27 for convolvetabhisto up to 43.8 for testGravity. Using these values it is possible to sort the functions and to find out in what order they should be estimated. Focusing on the most critical ones first enables to sketch an appropriate architecture and also to take reusing into account (the less critical functions may be able to reuse the hardware allocated to the most critical ones). Finally COM values are comprised between 0 and 0.3 which denotes that control is not dominant (most of the tests in the application are deterministic).

50,00 45,00 40,00 35,00 30,00 25,00 20,00 15,00 10,00 5,00 0,00

Speedup #ALUs #Mults #Memory R/W

24 14 85 976 47 1, 816 1E + 1, 07 2E + 2, 07 7E + 2, 07 8E + 2, 07 9E +0 7

Figure 4. OMD motion detection running. Left: video from the camera, center: background detection, right: moving objects detection

TestGravity: This function is composed of 378 C code lines, translated into 2408 lines of HCDFG. The corresponding graph is made of 200 sub-graphs. The results obtained for systemlevel estimation are presented in Fig.5. As firstly indicated by the metric this function has a good speedup potential. We have chosen not to show all the possible solutions since the number of resources required for very high speedup factor was extremely high. The most expensive solution shown permits a speedup factor of almost 12 using an architecture enable to perform simultaneously 11 ALU like operations + 7 multiplications and 43 data R/W. On the other hand the cheapest solution only requires one operation of each type at a time but requires a longer execution time. Table 2 shows the results of the hardware projection of three solutions onto the Xilinx V400EPQ2 FPGA. The solutions selected are solution 21 (no speedup), solution 11 (speedup=2.05) and solution 1 (speedup=11.85). For each solution the estimated execution time is given (in ns) as well as the estimated number of Logic Cells (LC) and Dedicated Cells (DC).

#Cycles Function name 

MOM [0,1]

COM [0,1]

Ic_testGravity

43,88

0,78

0,22

Ic_label

10,31

0,74

0,07

Ic_changeBackground

5,62

0,76

0,03

Ic_reconstDilat

4,75

0,65

0,32

Ic_dilatBin

4,69

0,70

0,02

Ic_histoThreshold Ic_envelop

4,00 3,91

0,64 0,66

0,29 0,13

Ic_absolute

2,60

0,71

0,08

Ic_thresholdAdapt

2,20

0,75

0,08

Ic_convolveTabHisto

1,27

0,70

0,03

Ic_div

1,25

0,73

0,00

Ic_getHistogram

1,22

0,75

0,00

Ic_setValue

1,14

0,78

0,00

Ic_add

1,11

0,75

0,00

Ic_sub

1,11

0,75

0,00

Ic_erodBin

1,10

0,73

0,01

Table 1. OMD characterization

Solution number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

#Cycles

Speedup

#ALUs

#Mults

#Memory R/W

2414976

11,85

11

7

43

4302848 8009992 8547816

6,65 3,57 3,35

7 5 5

5 4 4

23 13 13

9623464 10161288 10699112

2,97 2,82 2,67

5 5 5

4 4 4

13 11 9

11236936 11774760

2,55 2,43

5 5

4 4

9 8

12312584 13926056 16615176

2,32 2,05 1,72

5 5 5

4 4 4

7 7 7

26833832 27371656 27909480

1,07 1,05 1,02

5 5 4

4 4 3

7 6 5

28447500 28447503 28592958

1,01 1,01 1,00

4 3 2

3 2 2

4 3 3

28592960 28606419 28606421

1,00 1,00

2 1 1

2 1 1

2 2 1

Figure 5. TestGravity trade-off curve

Solution number

Time (ns)

1 11 21

4

Nb LC Nb DC

46178247

296

36

266932793

253

29

547212227

200

18

Table 2. TestGravity projection on Xilinx V400EPQ2 Label: This function is about 200 lines of C code and 1200 lines in the HCDFG description. The estimation results are presented in Fig.6. The value for this function is lower than for the testGravity function which is corroborated by the maximum speedup factor which is only 2.2. Table 3 shows the results obtained for the hardware projection of the label function onto the Xilinx V400EPQ2 FPGA. The selected solutions are solution 13 (no speedup), solution 8 (speedup=1.87) and solution 1 (speedup = 2.20). Once again this shows how the designer can explore the design space: by referring to the metric values and tradeoff curves for each function of his application, he is guided in his choices for building or selecting an appropriated architecture.

In the previous sections we have presented the system level estimation step of our flow as well as some results for the hardware projection step onto the Xilinx V400EPQ2 FPGA. Although hardware implementations (on ASICs, FPGAs,…) can provide (for some functionalities) good computational efficiency since it can be tailored to the considered algorithm, microprocessors (µP) offers many advantages and are more adapted to certain types of applications: • • • • •

Resources and speedup

14,00

SOFTWARE PROJECTION

flexibility: µP can execute most types of functionalities, and systems built upon µP can be easily updated and maintained; µP can run operating systems which are found in virtually all embedded systems in one form or another; µP can support low power techniques such as frequency and dynamic voltage scaling; cost: microprocessors for embedded systems are widely available and cheap; development time is normally shorter for µP than for ASICs for instance.

12,00 10,00 Speedup #ALUs #Mults #Memory R/W

8,00 6,00 4,00 2,00

26 23 49 6 26 90 82 7 26 93 39 0 26 93 39 4 55 10 92 9 57 75 63 6 57 75 64 0

0,00

#Cycles Solution number #Cycles 1 2623496

Speedup

#ALUs

#Mults

#Memory R/W

2,20

5

2

12

2

2690825

2,15

5

2

8

3

2690827

2,15

5

2

7

4

2693388

2,14

4

2

7

5

2693390

2,14

4

2

6

6

2693392

2,14

3

2

6

7

2693394

2,14

3

2

5

8

3082248

1,87

3

2

5

9

5510929

1,05

3

1

5

10

5773075

1,00

3

1

3

11

5775636

1,00

2

1

3

12

5775638

1,00

2

1

2

13

5775640

1

1

1

Figure 6. Label trade-off curve Solution Time (ns) number

1 8 13

50219189 59081399 110482217

Nb CL

Nb CD

412 368 192

21 14 13

Table 3. Label projection on Xilinx V400EPQ2

All these features make modern embedded systems almost always include (at least) one microprocessor (General Purpose Processor and/or Digital Signal Processor). Hence it is crucial for the designer of such systems to be able to evaluate, rapidly and easily the potential performance of his application or part of his application onto microprocessor(s). Therefore we have started investigating estimation techniques for inclusion in the Design Trotter framework. Our objective is to provide designers the capability to estimate, at a high level, the execution time (E.T), utilization factor (U.F), memory usage (M.U) and energy consumption (E.C) of his application’s functions onto a processor. In the context of design space exploration one of the main issue is to be able to compare, for a fixed algorithm, its performances onto different targets, and for a fixed target to compare the performances (E.T, U.F, M.U and EC) of different implementations of an algorithm or functionality. More over, since the design space of an embedded system can be huge, low response time is an important feature. A tool offering these possibilities must therefore make abstraction of low level implementation details (e.g., transistor switching, …) in order to be fast and at the same time be accurate enough to enable safe comparison between several solutions. Since microprocessors for embedded systems include more and more complex features (pipelines, caches, dynamic voltage and frequency scaling) that have major impacts on the performances it is crucial to take such parameters into account. 4.2 First approach In a previous work [4] we have developed an estimator that extends the intra-function estimation step. The idea is to specialize the abstract architecture model (UAR file) using the microprocessor description language Armor [7].

Compiling the microprocessor model generates a number of files which describe the instruction set of the processor, the potential parallelism between instructions and the delays between the instructions. After mapping the operations found in the HCDFG (generated from the C code of the application) to the processor instruction, a resource constraint version of the estimation scheduler is used to take the limited parallelism of the processor into account. Finally energy consumption is estimated by associating the processor unit activities to a library characterizing the power consumption of the units. 4.1 Improving software projection Although the previous method enables fast comparisons between several algorithm/processor combinations, it lacks the influence from features such as pipeline stalls, cache misses, operating system overhead and the possible availability of dynamic voltage/frequency scaling. We are currently investigating other methods upon which we could possibly expand our existing estimator or devise a new one in order to take the features previously mentioned into account. An interesting approach for estimating the execution time of C programs at source-level is described in [11]. The methodology uses a formal hierarchical analysis of the code structure and a mathematical model of the timing features with profiling/statistical information. The same authors are currently investigating timing estimation of libraries [13].

5

CONCLUSION

In this paper we have presented the Design Trotter tool. The results produced by the Design Trotter framework provide very useful information to the designer. First, the characterization step indicates the available average parallelism and both processing and data-transfer orientation of the application functions. This orientation information has two purposes: 1) it can be used by the designer to select a type of architecture (GPP, DSP, ASIC...) and 2) it is used to guide the intra-function step. Secondly, the intra-function step exhibits and explores the potential parallelisms of the functions for several time constraints. Our method permits the selection of the most appropriate scheduling algorithm (processing first, datatransfer first or mixed). This selection takes into account the most critical aspects of a function. The estimation process starts with the parallelism exploration of DFGs (equivalent to C basic blocs). Then higher granularity levels of parallelism are explored hierarchically, with a bottom-up approach. The parallelism vs. cycle budget tradeoff curves are computed while favoring the reuse of resources by taking into account parameters such as loop unrolling and mutual exclusions due to conditional statements. The resulting delay/resources trade-off curves also indicate the parallelism lower and upper bounds of the application. Thus, the designer can choose or scale its architecture according to these parallelism bounds. Thirdly, the software projection steps enables to evaluate very rapidly the performance of the application functions on microprocessors. We have also considered extending this work; this is currently our main research topic, which

we expect to report in detail at the conference. The estimates produced by the Design Trotter tool provide relevant information very rapidly, which help the designer to take decisions very early in the design process in order to build or select the system architecture. The applicability of the method has been demonstrated using a real-life example, which illustrates how the designer can explore the design-space by referring to the trade-off curves and using the hardware projection step [3] which enables the refinement of the estimation onto FPGA architectural models. 6

REFERENCES

[1] W. R. Davis, N. Zhang, and K. Camera et al., A Design Environment for High Throughput, Low Power Dedicated Signal Processing Systems, CICC01, San Diego, USA, May 2001. [2] J. Plantin and E. Stoy, Aspects on System Level Design, CODES’99, Roma, Italy, 1999. [3] S. Bilavarn, G. Gogniat, J.L. Philippe, and L. Bossuet, Fast Prototyping of Reconfigurable Architectures From a C Program, ISCAS’03, Bangkok, Thailand, 2003. [4] Y. Le Moullec, J-Ph. Diguet, and P. Koch, A Power Aware System-Level Design Space Exploration Framework, DDECS'02, Brno, Czech Republic, 2002. [5] Th. Gourdeaux, J-Ph. Diguet, and J-L. Philippe, Design Trotter: Interfunction Cycle Distribution Step, RTS’03, Paris, France, 2003. [6] Y. Le Moullec, N. Ben Amor, J-Ph. Diguet, J-L. Philippe, and M. Abid, Multi-granularity Metrics for the Era of Strongly Personalized SOCs, DATE’03, Munich, Germany, 2003. [7] F. Charot and V. Messé, A Flexible Code Generation Framework for the Design of Application Specific Programmable Processors, CODES’99, Roma, Italy, 1999. [8] D-J. Wang and Y.H. HU, Rate Optimal Scheduling of Recursive DSP Algorithms by Unfolding, TCS, vol. 41, pp. 672-675, October 1994. [9] Y. Le Moullec, J-Ph. Diguet and J-L. Philippe, DesignTrotter: a Multimedia Embedded Systems Design Space Exploration Tool, MMSP’02, St. Thomas, USVI, 2002 [10] L. Letellier and E. Duchesne, Motion Estimation Algorithms, Technical report, L.C.E.I, C.E.A, Saclay, France, 2001. [11] Y. Le Moullec, J-Ph. Diguet, D. Heller, and J-L. Philippe. Fast and Adaptive Data-Flow and Data-Transfer Scheduling for Large Design Space Exploration, GLSVLSI’02, New-York, USA, 2002. [12] C. Brandolese, W. Fornaciari, F. Salice, D. Scuito, Source-Level Execution Time Estimation of C Programs, CODES’01, Copenhaguen, Denmark, 2001 [13] C. Brandolese, W. Fornaciari, F. Salice, D. Scuito, LibraryFunctions Timing Characterization for SourceLevel Analysis, DATE’03, Munich, Germany, 2003

Suggest Documents