Achilles: A High-Level Synthesis System for Asynchronous Circuits Jordi Cortadella, Rosa M. Badia, Enric Pastor, and Abelardo Pardoy Dept. of Computer Architecture Universitat Politecnica de Catalunya (Barcelona) e-mail:
[email protected] Dept. of Electrical Engineering University of Colorado (Boulder) y
Abstract
This paper presents Achilles, a High-Level Synthesis System for asynchronous digital circuits. A new architecture model based on a completely distributed control structure is proposed. The most relevant dierences from synthesis systems for synchronous circuits appear in the phases of scheduling and synthesis of the control. Signal Transition Graphs are automatically generated to describe the behavior of local controllers.
1 Introduction Asynchronous circuits require no global clock and elude, thus, the problem of clock skew. A task can be executed as soon as their preceding tasks are completed. The computation time of a task depends, in general, on the input data and can be represented by a distribution function with a worst-case and an average delay. The nonexistence of a global clock eases the extendability of asynchronous systems [1]. One-toone local synchronization between modules allows their interconnection without aecting the system's functionality. Furthermore, asynchronism does not add timing constraints that could eventually limit the physical size of the system. However, asynchronism also involves several problems that make circuit design dicult. Metastability, race conditions, and hazard conditions [2] have been topics deeply studied for many years, and design methodologies to synthesize error-free circuits have been proposed [3]. On the other hand, the design of an asynchronous circuit also implies some area penalty due to the overhead required for synchronization, mainly for the routing of dierential signals when using dual-rail encoding in self-timed units. As asynchronous circuits are more extensively used, there is a need of synthesis tools that hide the underlying complexity of their design to the user. Currently, most of the research is focused on the synthesis of asynchronous circuits from signal transition graphs (STGs) rstly introduced by Chu [4]. Work funded by CYCIT TIC 91-1036, Dept. d'Ensenyament de la Generalitat de Catalunya and ACiD-WG (Esprit 7225)
Until now, only synchronous architectures have been proposed for high-level synthesis systems. The eorts on synthesis from high-level languages have been mainly focused on the generation of asynchronous circuits by syntax{directed translation [5]. When conceiving synthesis algorithms for asynchronous circuits, the major dierences from those in the eld of synchronous circuits come from the fact of using a dierent timing model. Scheduling and control synthesis are, thus, the steps that need to be rethought when considering an asynchronous target architecture. Until now, the algorithms proposed for scheduling have been based on the existence of a global clock to de ne control states [6, 7]. In an asynchronous execution model, the scheduling phase must determine timing precedences which preserve dependencies between operations, with an estimation of the performance according to the average processing speed of the hardware modules used to execute the operations. A centralized control unit approach would neglect most of the attractiveness of asynchronous systems. Global signals introduce delays that reduce the potential parallelism inherent to asynchronism. Furthermore, the number of states of the control unit grows exponentially with the number of control signals [4], which makes the synthesis of control units for large systems prohibitive. Therefore, models based on the distribution of control functions must be proposed for asynchronous systems. This paper presents Achilles , a high-level synthesis system for asynchronous circuits that is currently under development and focuses on those issues that dierentiate it from synthesis systems for synchronous circuits: architecture model, scheduling, and control synthesis. Section 2 describes the target architecture model used in Achilles . An overview of the synthesis system is presented in section 3. The scheduling algorithm is described in section 4. Section 5 explains how control units are generated from the schedule de ned for the operations. Finally, section 6 concludes the paper.
2 Architecture Model The architecture model proposed in Achilles is based on two main key-notions: Processing Unit and Channel . A processing unit (PU) is an entity that executes operations . Two PUs can communicate to each other through unidirectional channels . The granularity of a PU may vary from the basic modules given in a library (adder, ALU, multiplexor, register, etc.) to large, complex modules (elliptic lter, microprocessor, etc.). Moreover, at lower levels of abstraction a PU may be composed of other PUs. Each PU consists of a data processing part and a control part. The data processing part is implemented by a self-timed module (either asynchronous or locally-synchronous) that executes operations. The control part manages the communication and synchronization with other PUs. There is no global PU snd(p1)
ack_rcv(p1) port 1 channel port 2
PU rcv(p2)
ack_snd(p2)
Figure 1: Communication Protocol
control managing the functioning of the circuit. Thus, control is fully distributed among PUs, which communicate to each other locally.
2.1 Communication Protocol
Communication between PUs follows a message passing handshake protocol as illustrated in gure 1. A channel has an active port in one PU (designated as p1) and a passive port in other PU (p2). Four primitives are used to implement the communication protocol: snd(p1):
Send data to p1. rcv(p2): Receive data from p2 . ack snd(p2): Send data reception ack to p2 . ack rcv(p1): Receive data reception ack from p1.
Communication between two PUs is initiated by the active PU (PU connected to the active port). Once the active PU has completed its operation sends the output data to p1 (snd(p1)). The passive PU receives its input data from p2 (rcv(p2)). When the data received from p2 is no longer required, the passive PU accomplishes an ack snd(p2) and continues its execution. At the other side, the active PU receives the acknowledgment (rcv ack(p1)) and continues its execution.
2.2 Behavioral Model for Processing Units In 1
In 2 Repeat
R4
R1
R2
In R1
R3
In R2 R4 := R1 + R2 Out
M1
M2
R3 := R1 + R4 R4 := R2 + R3 Out R4 Local Control
adder R1
For ever
Data processing unit
(a)
(b)
Figure 2: (a) Communication Processes and (b) Implemented code The semantics of the model used for the description of the behavior of the PUs must be capable of expressing all the potential parallelism inherent to the asynchronism of the operation execution. Dierent languages have been used to express parallel behavior in hardware, like CSP or ISPS. In Achilles, we have chosen Petri Nets . Two main reasons have endorsed this decision:
The ability for Petri Nets to express ne-grain parallelism. The simplicity of translating Petri nets into Signal Transition Graphs [4].
Figure 2(a) depicts a circuit composed of seven processing units: an adder, two multiplexors, and four registers. The operations executed in this circuit are described by the RTL code in gure 2(b). Each PU receives data, performs an operation and sends the result to other PUs through output
channels in an asynchronous manner. Figure 3(a) shows the behavioral speci cation (Petri Net) for multiplexor M2 . Actions like rcv(R4), rcv(R2), and rcv(R3) can occur in parallel with other actions like evaluate, snd(ALU), and ack rcv(ALU). ack_rcv(alu)
rcv(r2)
rcv(in1)
rcv(in2)
evaluate
evaluate
ack_rcv(r2)
snd(alu)
ack_rcv(in1)
ack_rcv(in2)
rcv(r3)
snd(alu)
snd(out)
evaluate
ack_rcv(r3) ack_rcv(alu)
rcv(r4)
ack_rcv(alu)
evaluate
ack_rcv(out)
snd(alu)
ack_rcv(r4)
(a)
(b)
Figure 3: Petri net for (a) MUX2 and (b) whole circuit With this approach, the whole circuit can also be expressed as a hierarchical structure of PUs. For example, gure 3(b) describes the behavior of the whole circuit. The operation evaluate includes all the operations executed in parallel by all the internal PUs. This description can then be used to consider the circuit as a new PU for more complex circuits.
3 Overview of Achilles Figure 4 describes the main phases of Achilles synthesis system. From a Control{Data Flow Graph , a library of self-timed modules, and a set of constraints, the scheduler de nes a partial ordering for the execution of the operations and determines the type of module that will execute each operation. The scheduler generates a Scheduled Control{Data Flow Graph . The Module Binder assigns a hardware module instance to each operation (registers are considered as hardware modules that execute a storage operation). Thus, a processing unit can be de ned for each hardware module of the circuit (ALUs, multipliers, multiplexors, registers, etc.). As a starting point, Achilles de nes the PUs at the nest level of granularity: one for each hardware module. After module binding, a network of PUs and communication channels can be derived. Moreover, the data processing part of each PU is completely de ned. Therefore only the local control of each PU is left to be synthesized. From the Scheduled CDFG and by means of a data{ ow analysis, similarly to how optimizing compilers do, the input/output relationships among PUs can be determined and the Petri nets describing the behavior of the local controllers generated (see [8] for a more detailed explanation). Still, the description in terms of Petri nets is independent from the low-level handshaking protocol. These details are de ned when a Signal Transition Graph is generated for each local controller: handshake signals, asynchronous or locally-synchronous communications, 2 or 4-phase protocol, dual or single-rail, etc. Next, STGs are merged to reduce the granularity of the distributed control. This step is optional and only applied to reduce the area and improve the performance of tightly-coupled local controllers (i.e.
Behavioral Description
Compilation Constraints
Module Library (ALU, regs, muxs...) CDFG
Scheduling & Allocation
Scheduled CDFG
Module Binding Definition of Processes & Channels Process i Process1 (ALU)
Process2 (MUX)
Process3 (MUX)
Process4 (REG)
Local
Data
Control
Processing Unit
Control STGalu
Definition
STG mux
Petri Net
STG merging
Handshake protocol
STGalu&mux
Dual or single rail
Control
STG mux
Sythesis
STG reg
STG i
Figure 4: Synthesis ow in Achilles. an ALU with its input multiplexors). Finally, an asynchronous sequential circuit is synthesized from each STG.
4 Operation Scheduling in Achilles This section presents the approach used by Achilles for operation scheduling (see [9] for further details). The problem to be solved is nding a partial ordering of the graph vertices which minimizes the total execution time under resource contraints. This partial ordering must preserve both data and structural1 dependencies between vertices. The total execution time must be calculated by considering the estimated average execution time of each operation.
4.1 Frame Reservation Table (FRT)
A Frame Reservation Table FRT is a data structure that represents the utilization of a resource type r during the execution of the scheduling algorithm. It stores the number of active resources at each time instant. An FRT can also be represented as an Event List. An Event List EL is a list of ordered pairs < time ; nfus > where nfus is the number of available instances of type r from time to time +1 . r
r
i
i
i
A structural dependency exists between two vertices if they cannot be executed simultaneously due to resource con icts. 1
Figure 5: (a) Frame Re
FRTadder
ELadder
time
nfus
0
1
t1 t2
0 1
t3 t4
+ + +
2 3
+ t (a)
(b)
(c)
+ -
+
35
*
*