An Automatic Approach to Generate Haste Code ... - Semantic Scholar

3 downloads 982 Views 137KB Size Report
Email: [email protected]. Email: ... Email: [email protected] ...... builder.html. [8] The. Mathworks. HDL-Coder. [Online]. Available:.
An Automatic Approach to Generate Haste Code from Simulink Specifications Maurizio Tranchero and Leonardo M. Reyneri Dipartimento di Elettronica — Politecnico di Torino C.so Duca degli Abruzzi, 24 — 10129 Torino Italy Email: [email protected] Email: [email protected]

Abstract One of the reasons that prevents digital designers to adopt asynchronous design methodologies is the lack of high level design tools that are available for asynchronous design. Nowadays, it is quite common to use Simulink, from The Mathworks, as a modeling tool and then to synthesize the developed diagram into RTL code automatically. In the synchronous domain some tools are able to synthesize such models. Until now however, such tools were not available in the asynchronous domain. Our work aims at filling this gap using the Haste language and TiDE tools from Handshake Solutions to facilitate mapping Simulink models onto selftimed circuits. The solution proposed is based on the CodeSimulink environment, a co-design tool, able to synthesize systems composed of hardware and software parts from Simulink descriptions. A model of a commercial audio CODEC has been converted using this approach, showing a significant reduction in development time.

1. Introduction Model-based design [1], [2] is, nowadays, a common approach for developing digital designs since it helps the designer to reduce the development time. It also helps in maintaining consistency between the top level specification of the design and the actual implementation. Most of the tools available on the market in this field are targeted at a synchronous implementation. In the last years we have seen a growing interest of both academia and industry in the application of asynchronous circuits [3] due to many advantages provided by them like lower power consumption, adaptation to process variability, reduction of electromagnetic emission, and modularity of the design [4]. Widespread use of asynchronous design is however hampered by the limited availability of design tools and the fact that such tools are often different from their synchronous counterparts. Our aim is to provide designers with tools for model-based design that are comparable with the synchronous alternatives

Arjan Bink and Mark de Wit Handshake Solutions High Tech Campus 12 — 5656 AE Eindhoven The Netherlands Email: [email protected] Email: [email protected]

and are well integrated with existing design flows. By doing this we enable application of self-timed circuits for a wider design community. This paper introduces the work done for integrating a Simulink-based co-design environment (CodeSimulink developed at Politecnico di Torino, see Sec. 3.1) with an IC design flow for asynchronous circuits (Handshake Solutions TiDE, see Sec. 3.2). This paper is organized as follows: we start by introducing Simulink-based design (Sec. 2), the two environments used in this work (Sec. 3), and the reasons that lead us to use these environments in our approach. In the next section, we highlight some of the design decisions that were made during the integration of the two environments (Sec. 4). After this introduction we describe the developed design flow (Sec. 5) and we provide some results of a commercial audio chip used as benchmark (Sec. 6). In the last section we present possible future work and further optimizations.

2. Simulink-Based Design In the commercial field, to reduce time-to-market, designers often use Simulink-based development [5], due to the environment flexibility in describing and simulating heterogeneous systems. In order to help engineers save time, many tools are available to automatically convert a Simulink model into a target application. For software implementation The Mathworks provides a well-known Simulink-to-C compiler: Real-Time Workshop. For hardware implementation, there are many possibilities [6], [7], [8], to generate synthesizable HDL code from a Simulink model. All these tools are based on the translation of a Simulink diagram into RTL code, so they use Simulink as a graphical front-end to their own development environment. The main problem however is that Simulink is based on dataflow computation [9] which is not one-to-one compatible with RTL implementation. Another problem is to maintain the correctness between the model and the generated code since the conversion has to be performed by hand and can introduce errors. For the conversion the designer uses two orthogonal block sets with different functions and meaning. When starting a new system, the designer describes it

Table 1. Comparisons of some compiler from Simulink to hardware available on the market. Tool DSP Builder System Generator HDL Coder CodeSimulink

RTL yes yes yes yes

DF no no no yes

Library custom custom compatible compatible

Simulink Model Functional Simulation

Platform Altera Xilinx Independent Independent

HW/SW Partitioning

Software

Hardware Digital Hw Compiler

Real−Time Workshop

Logic Simulation

with the target-specific block set, simply describing RTL hardware at a graphical level. This is still a good way to reduce development time, but its expressiveness is less powerful than the dataflow language itself. Furthermore, as the usage of Simulink is becoming more common it is also becoming more likely that a diagram of the desired system already exists. If we are using a standard RTL environment, such a diagram has to be manually converted into another one, which requires special attention to ensure that the functionality of the system remains unchanged. Another Simulink-based development environment is CodeSimulink [10], [11]. This is a tool developed by Politecnico di Torino which stands out compared to other environments by offering the ability to describe, implement and simulate heterogeneous systems composed of hardware and software. It does so in a transparent way thanks to the dataflow paradigm implementation used and its platform independence (see Tab. 1). Since in our case we want to maintain the computational model that is used in Simulink diagrams, we already have a tool which is fully Simulink compatible and guarantees that the model developed in pure Simulink can be synthesized in hardware.

Synthesis

P&R Post P&R Simulation Target Programming On−Chip Test

Figure 1. CodeSimulink flow.

using an internal compiler, and C-code for software (through The Mathworks’ Real-Time Workshop compiler) starting from a Simulink model. It is composed of: • an ensemble of Simulink-compatible block libraries characterized by implementation-specific parameters (e.g. data representation, number of pipeline stages, physical addresses, . . . ); • at least one VHDL implementation for each block1 ; • a set of Matlab scripts for the synthesis process; • the needed interfaces between blocks with different implementations; • other target-specific files. All these components are necessary to obtain a complete implementation of the high-level model for both FPGAs and DSPs. CodeSimulink uses the flow depicted in Fig. 1 for systems based on digital hardware and software components: this flow starts with a system description, made using the Simulink environment and several component libraries. Once the description is ready, we proceed to the partitioning step where the system is divided into hardware and software. Following the hardware branch, it continues with logical synthesis and “Placement & Routing” phases, both performed using commercial tools depending on the actual target device. Every step can be checked using simulation which is fully integrated in the environment and which is automatically

3. Used Environments In this section we introduce the design environments we used in our approach, and briefly describe their characteristics and their limitations.

3.1. CodeSimulink Simulink uses a dataflow specification [9], in which the designer statically specifies the scheduling. The scheduling analysis ensures the correct execution order according to the data dependencies of each elaboration unit. This computation paradigm is well suited for asynchronous circuits, where the concept of data validity is natural [12], and for software where the execution order guarantees the algorithm correctness. The same can be achieved in synchronous design by inserting a protocol part that guarantees the exact scheduling of operations. Thanks to the adherence to this dataflow model, CodeSimulink is an environment which can generate VHDL-code for digital hardware (both synchronous and asynchronous)

1. For some blocks we have different implementations, e.g. synchronous, asynchronous, bit-serial, . . .

2

generated by CodeSimulink, in order to keep consistency between the simulation performed at the highest level and the one used at lower levels. This environment has already been extended to the asynchronous circuit domain [13], but this work done was focused on FPGA development, therefore lacking support for the back-ends used in ASIC development. To overcome this limitation, we used the TiDE flow offered by Handshake Solutions, which is introduced in the next section.

Haste description

Synthesis (htcomp)

Handshake Circuits Behavioral Simulation (htview) Mapping (htmap)

Verilog Netlist

External Verilog Netlists

3.2. TiDE Flow Netlist Verification

Linking (htlink)

The Timeless Design Environment (TiDE) [14] is a set of tools, provided by Handshake Solutions, that can map hardware descriptions onto a self-timed gate-level netlist, starting from the Haste language. Haste [15] is a high-level behavioral language that supports asynchronous communication using CSP [16] constructs. Values can be communicated between parallel processes using channels. Channels are objects on which read and write operations are “synchronized,”. So a process that writes on a channel can only complete its communication when the corresponding reader has read the data. The standard design flow based on TiDE (see Fig. 2) starts with the system description in Haste. This description is converted by the Haste compiler (htcomp) into a behavioral description that can be used to perform functional simulation and can then be mapped onto the desired technology (using htmap). Using the tool htlink external Verilog netlists can be linked to the netlist generated from Haste. After this operation the obtained netlist is optimized and adapted for the back-end part of the flow. (For further details on this, please refer to the TiDE manual [14]). There were several reasons why we chose to integrate CodeSimulink and TiDE: •

Logical Optimization (htlog)

Back−End

Figure 2. TiDE flow. *

A B

16

+

32

O

16

>

x3 S

3

2

Figure 3. Small design used to test different coding styles in Haste.

4.1. Haste Coding Style

CodeSimulink can convert Simulink models into VHDL, but it lacks an ASIC back-end; Haste is a powerful language which allows you to easily describe asynchronous circuits using its native constructs; TiDE can synthesize both RTL code and Haste together.

4. Preliminary Analysis

Haste relies on a transparent compiler. This means that you will get what you described, and for this reason it is necessary to take care of the coding styles used in order to get the most efficient implementation of the design. To compare different coding styles, we developed a set of small models. One of these models is depicted in Fig. 3 and represents a simple datapath of 6 different operations on two inputs which can be selected from another input. Although the model is small, it includes several computational blocks that are often used in Simulink diagrams to describe more complex systems.

In this section we describe some considerations made during the implementation of Simulink diagrams in Haste. They are related to the Haste language itself and to the characteristics of Simulink.

4.1.1. Communication. To pass data between modules we can use two different modes: shared variables or channel communications. The implementation of shared variables is relatively





By integrating these two environments we can cover the design process from high-level modeling and simulation to the physical implementation.

3

Table 2. Comparisons among different block using channels, shared variables, with state or without state implementation. (These results refer to a different implementation of the design depicted in Fig. 3 and are in number of gates.)

A

I o!I

Implementation choices Registers Channels Variables X X X X X X

Area [µm2 ] Memory C-gates 15857.6 1441.0 15857.6 490.6 0 1134.4 0 367.9

Total 54829.1 54140.9 44470.9 43683.8

Datapath

Tuple Not Used Used X X X X

Registers Not Used Used X X X X -

B

i?[[ao,io]] ; o! B( ao, io )

O

i?v

Figure 4. Example of a Simulink model that can lead to a deadlock (see Fig. 5)

fact that all the input communications are synchronized together, therefore not allowing individual completion. A typical example is the one depicted in Fig. 5: block A needs to have a complete handshake on its inputs to compute; block I needs to wait until all the blocks fed by its output have captured its value before continuing. For this reason, before concluding the communication with A, it needs to wait for the completion of the communication with B. However, B cannot finish its communication with I until it receives data from A and this can never happen, since A cannot compute until it finishes its input communication with I. So the system is stuck waiting for a condition that will never happen.

Table 3. Comparisons among different coding styles for the design depicted in Fig. 3.

Design

i?v ; o! A(v)

Area [µm2 ] Total/C-gates 11454.9/4804.4 11883.8/4792.2 4067.3/438.3 3670.3/254.5

cheap, but they require explicit synchronization between readers and writers in order to avoid data miss and data duplication, since registers are shared between the writer and the readers. Channels on the other hand automatically synchronize input and output actions of modules running in parallel and thereby guarantee a correct timing relationship between the read and write actions. Their implementation is more expensive in terms of area than a shared variable (around 1.5%, see Tab. 2 for further details). To keep the conversion of the Simulink model to Haste straightforward, we would like to avoid explicit synchronization between modules. Therefore we choose to use channels instead of shared variables. A channel is a communication mechanism shared between different objects with at least one transmitter and at least one receiver. The implementation of a channel relies on the bundled data approach. This implementation consists of a data part and a control part. The control part takes care of the communication protocol and the required delay matching of the data part. The simplest way to describe the way the blocks communicate in a Simulink diagram is using separate channels for each input/output. This solution is straightforward to implement, but it can be more expensive since every input has its own control logic. Haste allows the user to group together data channels, thereby sharing handshake control circuitry. Such a multipledata channel is called tuple channel. This solution requires less area. Deadlock can be introduced however due to the

4.1.2. Functions or Procedures. A module in Haste can be described as a fully combinational block or as a block with registers (Fig. 6). Data-flow networks usually do not include stages (since data is processed from input to output continuously). However, in order to to increase system throughput decoupling stages (i.e. registers or latches) can be required. The results presented in Tab. 3 show a large difference in terms of area for the two implementations. As a design trade off exists between area and speed, it is possible to choose the desired implementation. 4.1.3. Register Placement. As previously mentioned, Simulink models do not have the concept of registers as is usual in digital design. Most standard blocks perform operations regardless of the concept of time. Only a few blocks are related to timing events. We will come back to these blocks later. Registers are necessary to achieve performance, but we have to decide where to insert them. Since each Simulink block has only one output, whereas it can have more than one input, it is natural to insert registers on the output in order to optimize area. Using the Haste language it is difficult to describe such an implementation, since when you get data from one or more input channels you have to store them into registers, and this results in latching the inputs. In the present version of the TiDE flow (5.2) the compiler will put registers where the designer has inserted them in the Haste description. In the future release (6.0), the compiler can optimize the number of registers automatically given the required number of decoupling stages. For this reason we 4

B: process ( i0 ? chan [0..255] & i1 ? chan [0..255] & o? chan [0..255]). begin & ao : var [0..255] & io : var [0..255] | forever do ( i0 ? ao || i1 ? io ) ; o! b( ao , io ) od end

B: process ( & i? chan [0..255] & o! chan [0..255]). begin & ao : var [0..255] & io : var [0..255] | forever do i? [[ ao , io ]] ; o! b( ao , io ) od end

(a) o!I

i?v

(b)

o!A(v)

R1+

i?[[ao,io]]

o!B(ao,io) o!I

R2+

i?v

A1+

R+

R1−

Ra+

A1−

R1+

A2+

A1+

o!A(v)

i?[[ao,io]] R+

waiting for Ab+

R1−

waiting for Ra+

A1− Ra+ cannot be generated Input HS not yet completed

R2− A2−

o!B(ao,io)

R1+ A1+ R1− A1−

(c)

(d)

Figure 5. A valid Simulink-like diagram 4 can be described using separate input channels (a) or with a tupled input (b); the latter can lead to a deadlock as we can see in the sequence diagram that describes its behavior (d), while the former works correctly (c).

choose to use the more common way to describe modules (with input registers) and let the compiler decide where to put them.

(like Buffer and Unbuffer) that are not taken into account now, since these are used less frequently than the previously cited ones. These blocks are often used in Simulink diagrams for two main purposes: • introducing an explicit storage element in a design (e.g. an accumulator, a decoupling register in loops, . . . ); • an adaptation to different rates in multi-rated systems (e.g. high-speed ADC or DAC interfaced with lower speed circuitry or vice versa). In the synchronous implementation, these blocks need to both sample and generate data at a given time, according to their parameters (input and output sampling time) to guarantee the same behavior of the Simulink model. Also in the asynchronous version we need the same behavior and this can be achieved in two different ways: • to introduce, in each of these blocks, a clock signal which can be used to derive the desired timing relationships; • to move the clock interface only to the input blocks,

4.2. Sampling Blocks There are three main blocks in Simulink that deal with a fixed sampling time: the “unit delay”, the “zero order hold,” and the “rate transition” (Fig. 7). These blocks are often used to change the input-to-output data rate of a given function, especially when the system has to deal with interfaces providing (or requiring) data at slower (or faster) data rates. Such blocks are also used when it is necessary to explicitly insert a storage element in a design (e.g. for an accumulator). The “unit delay” block acts as a memory element, which can also oversample the input data in order to increase the output data rate. The “zero order hold” block can reduce the output data rate. Finally the “rate transition” block is a super set of the previous one. There are also other blocks 5

& sum2 = func & A ? var & B ? var ): [511:0]. ( A +

CodeSimulink blocks (from the CodeSimulink libraries). At this level of abstraction both simulation and architectural exploration are well supported with a fast design-evaluation loop. In particular this support is based on the Simulink environment being very suitable for dataflow design (e.g. filters, streaming data processing, control systems, . . . ) and the availability of simulation libraries which help in evaluating system’s behavior (e.g. with respect to the number of bits used in the implementation of each block, the usage of integer, fixed point or floating point representation, the presence of signed or unsigned numbers, . . . ). These choices can be modified at Simulink level and they affect the simulation, providing the designer with a powerful way to evaluate (manually or with user developed scripts) the optimal implementation for each block. All these parameters are fully tunable and accessible to the designer who can easily find the best trade-off between circuit complexity and accuracy required by the system under development. In case of using pure Simulink blocks, a script can convert such model into a CodeSimulink-compliant one, introducing all the hardware-specific parameters necessary for our environment in order to simulate and synthesize the desired hardware. If specified, during this conversion step our scripts can automatically estimate the best characteristics of each block used in the design according to the simulation set in the model.

( [255..0] [255..0] B ) fit [511..0] (a)

& sum2 = proc ( & A ? chan [255..0] & B ? chan [255..0] & Y ! chan [511..0] ). begin & a : var [255..0] & b : var [255..0] | forever do ( A?a || B?b ) ; Y ! (a+b) fit [511..0] od end (b)

Figure 6. Examples of fully combinational logic (a) and logic with registers (b).

1 z Unit Delay (a)

Zero−Order Hold

Rate Transition

(b)

From this point CodeSimulink scripts will generate:

(c) •

Figure 7. Simulink blocks related to signal sampling: Unit Delay (a), Zero-Order Hold (b), and Rate Transition (c).

• •

after which the following sampling block will just have to up-sample (or to down-sample) the incoming data, which was already synchronized. Since the first solution introduces much interaction with the clock, it will result in more area overhead, whereas the second one maintains clock interactions only on the boundary of the system. For this reason we prefer the latter choice, even though both options are available and can be generated from the same model automatically.



a Haste language file which describes the structure present in the Simulink model; a list of all the library files needed for the synthesis; a set of VHDL files describing the functionality of each block in the model; a set of Cadence RTL Compiler scripts which will include all the library files needed and will generate from them a synthesizable Verilog version in order to include them in the standard TiDE flow.

With these files the standard TiDE flow can start, so we can first use htcomp and htmap to generate the Verilog netlist implementing the system structure after which we can synthesize all the RTL VHDL files into gate-level Verilog ones and finally link them together. After this part the TiDE back-end flow can continue optimizing the design and performing the timing analysis necessary to insert the delay chains required in the control path.

5. Simulink to Haste Conversion 5.1. Proposed Flow

At the end of this process, the final netlist is available and can be used to verify the system behavior at this level of abstraction, a step necessary to also analyze system performance.

Our aim is to integrate TiDE and CodeSimulink together in order to automatically synthesize Simulink models without describing them in Haste by hand. Fig. 8 depicts the proposed integrated tool flow. Input to this flow is a description of the desired algorithm in Simulink, using either pure Simulink blocks or

In the next sections we describe how each block is converted into Haste and VHDL to allow automatic synthesis.

6

HDL Code

CodeSimulink Front−End Flow

Simulink Model

modelConvert

Protocol Controller

DigHwCompile

Figure 9. Structure of a Simulink block described in Haste.

Haste Code

HasteDescription

The TiDE flow uses Haste as main specification language, but it is also possible to insert Verilog components, whereas CodeSimulink uses VHDL as specification language for its implementation. CodeSimulink uses a library-based approach in which each block has a parametric description that is called when needed in the top module. In order to reuse this extensive library we decided to use a commercial synthesizer to convert its VHDL behavioral descriptions into gate-level Verilog netlists. The proposed flow merges the CodeSimulink and the TiDE one, as depicted in Fig. 8. We start from a Simulink model (developed with pure Simulink blocks or with CodeSimulink ones), using the scripts developed such a model is converted into Haste and Verilog, which are processed differently along the TiDE flow. In order to reuse the RTL libraries available in CodeSimulink and to guarantee modularity and maintainability, each block is divided into different files. In this way the common part can be shared among blocks without having the burden of rewriting code already existent. Each module is composed of: • a structure definition of the design, made of a single Haste file; • a set of VHDL files, one for each block in the diagram, that describes the RTL behavior of the block; Not all the blocks will have this structure, since there are interfaces with synchronous environments and sampling blocks that are quite different since their function is more related to the protocol than to the processing part (which is usually not present). For this reason such blocks have completely been described in Haste. The following sections provide details on these categories after having described the common parts:

htcomp

Handshake Circuits RTL Compiler

htmap

Verilog Netlist

Verilog Netlist

htlink

TiDE Flow Back−End

Figure 8. Proposed Simulink to Haste flow.

5.2. Block Structure The structure of a Simulink block implemented in Haste will be the one depicted in Fig. 9. In that figure we can see how the block can be divided into: • •

R

CodeSimulink Model

RTL VHDL

Timeless Design Environment Flow

Functional Block

a functional circuit, which implements the combinational logic function of the block; a protocol controller, which coordinates all the operations within the block and describes the elements used for storage.

The automatic conversion of Simulink models into Haste and Verilog descriptions is based on the CodeSimulink environment. Thanks to this we reuse all the things already developed for such environment in order reduce both development and debug time. Each Simulink block is automatically converted into Haste code (for controlling the data flows) and into Verilog code (for data elaboration). Haste is used to describe the top module and the communication infrastructure among blocks.

5.2.1. The Haste Shell. According to the ideas exposed in Sec. 4 the skeleton on which a block will be built needs to interface input data with the desired logic function and the results returned by such logic function to the block output. Figure 10 shows the Haste shell structure for a two-inputs adder, as we can notice the structure is straightforward. Each block is represented by a Haste procedure in which each 7

& sim_sum2 = proc ( & Y1 ! chan VECTOR_17 & A1 ? chan VECTOR_16 & A2 ? chan VECTOR_16 ). begin & v_A1 : var VECTOR_16 & v_A2 : var VECTOR_16 | forever do // input acquisition ( A1 ? v_A1 || A2 ? v_A2 ) // output generation // ( sim_sum2_f is imported ) ; Y1 ! sim_sum2_f ( . A1 ( v_A1 ), . A2 ( v_A2 ) ) od end

& sim_ud = proc ( & Y1 ! chan VECTOR_16 & A1 ? chan VECTOR_16 ). begin & v_A1 : var VECTOR_16 | forever do // output generation ( oversampled ) for 5 do ( Y1 ! v_A1 ) od // input acquisition ; A1 ? v_A1 od end (a)

& sim_zoh = proc ( & Y1 ! chan VECTOR_16 & A1 ? chan VECTOR_16 ). begin & v_A1 : var VECTOR_16 | forever do // input acquisition for 5 do ( A1 ? v_A1 ) od // output generation ( undersampled ) ; Y1 ! v_A1 od end

Figure 10. Haste shell for a 2-inputs adder.

input and each output is listed as an input or an output channel respectively. In the body of the procedure only the interface operations are performed: inputs are read and outputs are generated by the external function associated to the block itself. Please mind the order of execution, indeed the inputs are collected in parallel and obviously when all of them are available, the outputs can be generated.

(b)

Figure 11. Haste description of a “unit delay” 11(a) and of a “zero order hold” 11(b) blocks both with a over- undersampling ratio of 5.

5.2.2. Sampling Blocks. Sampling blocks can have different implementations synchronized with a global clock, in order to slow down the circuit operation (to make it operate at a certain Sampling Time) or completely asynchronous (see Sec. 4.2). In both modes the input data rate can differ from the output one. Using these blocks it is possible to make a multi-rate system in which the data rate is increased (using a unit delay block) or decreased (using a zero order hold block). Figure 11 shows the Haste description of such blocks.

provide a conversion utility which automatically converts a pure Simulink model into a CodeSimulink one by setting the parameters needed for the implementation according to the simulation results of the model.

5.4. System Description

5.2.3. RTL Processing Part / Parametric RTL Description. Each block has a set of parameters that can be configured to make the module able to deal with different scenarios (serial or parallel input/output representation, different datawidth, . . . ) and all these parameters can be configured in the VHDL description. For each block a HDL file will be generated with all the desired parameters set and an RTL Compiler script that can synthesize it into a Verilog netlist.

Now that we have introduced the structure of each block in the design, we will explain how the whole system is described. The main Haste file is composed of different sections (See Fig. 12): • the definition of the types used across the design; • the definition of the system interface; • the external RTL functions import; • the Haste declaration of each block; • the block instance and connection.

5.3. Simulink to CodeSimulink Conversion The typical approach used to develop a design that should be converted into hardware is to build a diagram using CodeSimulink blocks from the start. The advantage of starting with CodeSimulink blocks instead of Simulink blocks is that their simulation behavior matches that of their hardware implementation. Since the CodeSimulink block set is oneto-one compatible with the standard Simulink one, we also

6. Case Study: a Commercial Audio CODEC To test our methodology we apply it to a Simulink model of a commercial Audio CODEC. Such a model describes one of the two channels in a stereo audio chip implementing a Sigma-Delta modulator [17]. 8

& VECTOR_16 & VECTOR_17 & VECTOR_32

Table 4. Synthesis result comparisons of the same Simulink model in different implementations. The designs have been implemented using a 180 nm technology library.

= type [0..2ˆ16 -1] = type [0..2ˆ17 -1] = type [0..2ˆ32 -1]

& datapath = main proc ( & O ! chan VECTOR_32 & A ? chan VECTOR_16 & B ? chan VECTOR_16 ). begin // Internal channel declaration & Y1_6 : chan VECTOR_16 broad // ... // External function declaration & Sum = func ( & A1 ? var VECTOR_16 & A2 ? var VECTOR_16 ): VECTOR_16 . import // ... // Haste shell description of each block & Sum_sh = proc ( & Y1 ! chan VECTOR_17 & A1 ? chan VECTOR_16 & A2 ? chan VECTOR_16 ). begin & v_A1 : var VECTOR_16 := 0 & v_A2 : var VECTOR_16 := 0 | forever do ( A1 ? v_A1 || A2 ? v_A2 ) ; Y1 ! Sum ( . A1 ( v_A1 ), . A2 ( v_A2 )) od end // ... | // Block instance and connection // ... || Sum_sh ( . Y1 ( Y1_6 ), . A1 ( Y1_8 ), . A2 ( Y1_3 )) // ... end

Design Tool Sequentialµm2 Logic Totalµm2 Overhead Coding time

Hand written TiDE 5.2 32018 138244 173694 — about 1 week

Automatic Generated TiDE 5.2 TiDE 6.0 89792 11632 357368 152468 468746 164100 +170% -5.5% 20 minutes

passed through version 5.2 of TiDE flow, while the second has been processed with the new pre-release version (6.0). Unfortunately it was not possible to compile hand-written version with the TiDE 6.0 flow, since it does not support anymore some low level constructs available in the old release. We can notice a number of differences between the three versions proposed. The designs are not architecturally the same, since the number of registers is not the same in all of them. This is due to the code generated (or written): • for the hand-written code, most of the blocks in the Simulink model have been implemented using Haste functions [15]. The number of blocks for which the designer decided to insert registers is small compared to the total number of blocks. • for the TiDE 5.2 version, each block has registers on its inputs, which results in a high overhead, since many of them are not required. • for the TiDE 6.0 version, the compiler automatically decides the minimum number of registers required for the described circuit. For the reasons above, we can conclude that at the moment the code generated automatically and compiled with the TiDE 6.0 version represents the lower bound with respect to the number of registers. On the other hand the same design compiled with the 5.2 version is the upper bound, since the granularity at the Simulink level is very fine-grained. Since our work was targeted for the TiDE 6.0 version, the results shown in Tab. 4 are promising. The achieved implementation based on this new flow2 requires less area than the hand-written counterpart. In order to guarantee the circuit equivalence, we simulate the netlist generated from TiDE 5.2 of the hand-written code and the automatically generated one with the same test bench. Since we had not access to the testbenches used to develop the original version, we had to create a new tesbench based on the data streams derived from the Simulink simulation. Because we are still working on a feature which generates input patterns directly from the

Figure 12. Example of the Haste code generated for the main procedure.

This model is quite complex, since it is composed of about 150 blocks, including: about 30 16-bit wide multiplication by constant values, 15 8-bit wide multipliers, and 30 16bit wide adders. It has been used to develop a hand-written implementation in Haste. Thanks to the collaboration with an industrial partner we had access to synthesis results of this asynchronous hand-written version and we could compare this with the Haste version generated by our tool. Comparisons for both versions are based on optimized prelayout netlists mapped onto the same technology library. The results of this analysis are reported in Tab. 4. In this table we compare the hand written Haste code with two versions of the automatically generated one: the first is

2. At the moment TiDE 6.0 is not complete; indeed some operations have to be performed by hand, but the optimizations performed by the tool are stable and will not change significantly with the official tool release.

9

Simulink environment, we had to do the verification partially by hand. The result of this analysis demonstrates the functional equivalence of the two circuits (with respect to the simulation performed). Results shown do not include any figure on timings. Actually we had not access to this data for the hand-written version. The only timing constraint we had was to be able to process all the samples provided at a given data rate, and this was easily achieved by the automatic generated code compiled with both TiDE flows.

[2] W. Wong, “Model-Based Design,” Electronic Design, March 2006. [Online]. Available: http://electronicdesign.com/Files/29/12086/12086 01.pdf [3] A. Taubin, J. Cortadella, and L. Lavagno, Design Automation Of Real-Life Asynchronous Devices And Systems. United States: Now Publishers Inc, 2007. [4] C. Van Berkel, M. Josephs, and S. Nowick, “Scanning the technology: Applications of asynchronous circuits,” Proceedings of the IEEE, vol. 87, no. 2, February 1999. [5] The Mathwork’s. Simulink on-line documentation. [Online]. Available: http://www.mathworks.com/products/simulink/

7. Conclusions and Future Work

[6] Xilinx. System generator. [Online]. Available: http://www.xilinx.com/ise/optional prod/system generator.htm

This paper has shown how we address a complete flow for generating asynchronous circuits starting from Simulink diagrams, using Haste as intermediate description language. At the moment only a subset of Simulink blocks are supported, but the methodology can easily be extended in order to cover all blocks. Our proposal has been used on a commercial model of an audio CODEC, showing appealing advantages (code reuse, time-to-market reduction) without area overhead introduction. Obviously a number of optimizations and improvements can be added: • high-level optimizations, like block merging, in order to reduce the number of asynchronous controllers inserted in the circuit; • an integration between the Simulink and the RTL simulation, in order to have the same test set for both the abstraction levels and, consequently, reduce the testing phase; • a way to automatically select where to insert pipeline stages in the design. We are looking forward to all these optimizations since they can further reduce area overhead and development time. Moreover we are working to use different methodologies exposed here, in [8] and in [13] in order to compare which one can produce better results.

[7] Altera. DSP-Builder. [Online]. Available: http://www.altera.com/products/software/products/dsp/dspbuilder.html [8] The Mathworks. HDL-Coder. [Online]. http://www.mathworks.com/products/slhdlcoder/

Available:

[9] E. A. Lee and D. G. Messerschmitt, “Dataflow process network,” Proceedings of the IEEE, vol. 83, no. 5, May 1995. [10] L. Reyneri, F. Cucinotta, A. Serra, and L. Lavagno, “A hardware/software co-design flow and ip library based on simulink,” DAC, June 2001. [11] E. Bellei, E. Bussolino, F. Gregoretti, L. Mari, F. Renga, and L. Reyneri, “Simulink-based codesing and cosimulation of a common-rail injector test bench,” Journal on Computer, Systems and Circuits, vol. 12, pp. 171–202, 2003. [12] I. E. Sutherland, “Micropipelines,” Communications of the ACM, vol. 32, no. 6, june 1989. [13] M. Tranchero and L. Reyneri, “Automatic generation of selftimed circuits from simulink specifications,” International Conference on Electronics, Circuits and Systems, December 2007. [14] TiDE Manual, Internal documentation, Handshake Solutions, 2007.

8. Acknowledgment

[15] A. Peeters and M. de Wit, Haste Manual, Handshake Solutions, 2007. [Online]. Available: http://www.handshakesolutions.com

We would like to thank Luciano Lavagno who helped us during the writing phase of this paper with remarks and suggestions.

[16] C. Hoare, “Communicating sequential processes,” Communications of the ACM, vol. 21, pp. 666–677, 1978.

References

[17] P. Allen and D. Holberg, CMOS Analog Circuits Design. New York: Oxford University Press, 2002.

[1] E. A. Lee, S. Neuendorffer, and M. J. Wirthlin, “Actororiented design of embedded hardware and software systems,” Journal of Circuits, Systems, and Computers, 2002.

10