R Foundations and Trends in Electronic Design Automation Vol. 2, No. 1 (2007) 1–133 c 2007 A. Taubin, J. Cortadella, L. Lavagno, A. Kondratyev and A. Peeters DOI: 10.1561/1000000006
Design Automation of Real-Life Asynchronous Devices and Systems Alexander Taubin1 , Jordi Cortadella2 , Luciano Lavagno3 , Alex Kondratyev4 and Ad Peeters5 1 2 3 4 5
Boston University, USA,
[email protected] Universitat Polit`ecnica de Catalunya, Spain,
[email protected] Politecnico di Torino, Italy,
[email protected] Cadence Design Systems, USA,
[email protected] Handshake Solutions, The Netherlands,
[email protected]
Abstract The number of gates on a chip is quickly growing toward and beyond the one billion mark. Keeping all the gates running at the beat of a single or a few rationally related clocks is becoming impossible. In static timing analysis process variations and signal integrity issues stretch the timing margins to the point where they become too conservative and result in significant overdesign. Importance and difficulty of such problems push some developers to once again turn to asynchronous alternatives. However, the electronics industry for the most part is still reluctant to adopt asynchronous design (with a few notable exceptions) due to a common belief that we still lack a commercial-quality Electronic Design Automation tools (similar to the synchronous RTL-to-GDSII flow) for asynchronous circuits.
The purpose of this paper is to counteract this view by presenting design flows that can tackle large designs without significant changes with respect to synchronous design flow. We are limiting ourselves to four design flows that we believe to be closest to this goal. We start from the Tangram flow, because it is the most commercially proven and it is one of the oldest from a methodological point of view. The other three flows (Null Convention Logic, de-synchronization, and gate-level pipelining) could be considered together as asynchronous re-implementations of synchronous (RTL- or gate-level) specifications. The main common idea is substituting the global clocks by local synchronizations. Their most important aspect is to open the possibility to implement large legacy synchronous designs in an almost “push button” manner, where all asynchronous machinery is hidden, so that synchronous RTL designers do not need to be re-educated. These three flows offer a trade-off from very low overhead, almost synchronous implementations, to very high performance, extremely robust dual-rail pipelines.
1 Introduction
1.1
Requirements for an Asynchronous Design Flow
With the increases in die size and clock frequency, it has become increasingly difficult to drive signals across a die following a globally synchronous approach. Process variations and signal integrity stretch the timing margins in static timing analysis to the point where they become too conservative and result in significant over-design. Asynchronous circuits measure, rather than estimate, the delay of the combinational logic, of the wires, and of the memory elements. They operate reliably at very low voltages (even below the transistor threshold), when device characteristics exhibit second and third order effects. They do not require cycle-accurate specifications of the design, but can exploit specification concurrency virtually at every level of granularity. They can save power because they naturally perform computation on-demand. Asynchronous design also provides unique advantages, such as reduced electromagnetic emission, extremely aggressive pipelining for high performance, and improved security for cryptographic devices. In addition, asynchronous design might become a necessity for 3
4 Introduction non-standard future fabrication technologies, such as flexible electronics based on low-temperature poly-silicon TFT technology [50] or nanocomputing [97]. In the next section, we will discuss more in detail the main motivations to start using asynchronous approaches in real-life designs. Here we just mention that the problems listed above are suggesting some designers to consider asynchronous implementation strategies again, after decades of disuse. Despite all its potential advantages, asynchronous design has traditionally been avoided by digital circuit and system designers due to several reasons: • There were no good EDA tools and methodologies that completely covered the design flow (especially for the large, realistic designs). • Testing without a clock did not benefit from the wellestablished and reliable set of procedures ensuring that a synchronous circuit will work as expected after fabrication. • Asynchrony required a deep change in designers’ mentality when devising the synchronization among various components of a digital system. Custom and semi-custom design of asynchronous devices and systems, on the other hand, is a well established academic research area, offering several well-known success stories including realistic designs (e.g., [31, 113, 36, 50, 72, 79, 80]). Some asynchronous implementations have been reportedly developed and sometimes commercialized by major design companies (e.g., Intel’s RAPPID design [107]; Sun’s high-speed pipelined devices used in a commercial Sun Ultra are based on results from [15, 83, 124]; IBM/Columbia low-latency synchronous– asynchronous FIR filter chip [114, 128] was fabricated, etc.). However, until recently design and testing automation was considered a major weakness of asynchronous design approaches. Asynchronous design is a very large research domain, and it is almost impossible to cover it in depth within a single paper. The interested reader is referred to a number of excellent research papers, books, and surveys devoted to design automation tools and flows for
1.1 Requirements for an Asynchronous Design Flow
5
asynchronous circuits (e.g., [17, 18, 23, 32, 33, 64, 88]) and in general to asynchronous design (e.g., [123, 51, 81]). This article is devoted specifically to one topic: automation of asynchronous design based on industrial-quality tools and flows. This requires support throughout the design cycle, mostly re-using synchronous tools and modeling languages due to the otherwise prohibitive investment in training and development. We also aim at dispelling a common misbelief, namely that asynchronous design is difficult, and that only specially educated PhDs can do it. We show that this is no true and that there exist flows and tools that satisfy the following key requirements: • They must be able to handle real-size designs and to re-use the huge investments in tools, languages and methodologies that enable synchronous design. • They must use standard hardware description languages (HDL), such as Verilog or VHDL, with a modeling style that does not differ much from the synthesizable subset, in order to handle legacy designs. • They must use standard EDA tools to synthesize, map, place, route the design, and to perform timing analysis, equivalence checking, parasitics extraction, scan insertion, automated test pattern generation and so on. • They must use CMOS standard-cell libraries, except when specially designed dynamic CMOS libraries are required for maximum performance and minimum power consumption. • They must be scalable, in order to create industrial-quality and industrial-size designs. Any change in the design methodology, no matter how small, must be strongly motivated. We believe that this is happening now, particularly since asynchrony is the most natural, powerful and effective mechanism to address manufacturing and operating condition variability. For integrated circuits at 45 nm and beyond, asynchrony is also a natural way to overcome power, performance and timing convergence issues related to the use of clock signals, supply voltage drop and delay uncertainty caused by noise.
6 Introduction
1.2
Motivation for Asynchronous Approach
Feedback closed-loop control is a classical engineering technique used to improve the performance of a design in the presence of manufacturing uncertainty. In digital electronics, synchronization is performed in an open-loop fashion. That is, most synchronization mechanisms, including clock distribution, clock gating, and so on are based on a feed-forward network. All delay uncertainties in both the clock tree and the combinational logic must be designed out, i.e., taken care of by means of appropriate worst-case margins. This approach has worked very well in the past, but currently it shows several signs of weakness. A designer, helped by electronic design automation tools, must estimate at every design stage (floor-planning, logic synthesis, placement, routing, mask preparation) the effect that uncertainties will have on geometry, performance and power (or energy) of the circuit. Cycles in the design and/or fabrication flow translate into both time-to-market delays and direct non-recurrent engineering costs. In the case of mask geometry and fabrication, these uncertainties have so far had mostly a local effect, which can be translated into design rules. However, as technology progresses towards 45 nm and beyond, non-local effects, e.g., due to geometry density and orientation, are becoming more and more significant. In the case of delay and power, these uncertainties add up to significant margins (e.g., 30% was reported in [89]) in order to ensure that a sufficiently large percentage of manufactured chips works within specifications. Statistical timing analysis [3] attempts to improve the model of the the impact of correlated and independent variability sources on performance. However, it requires a deep change in the business relationship between foundries and design houses in order to make statistical process data available to design tools. The commercial demonstrations of statistical timing analysis viability have been scarce up to now. Moreover, it is still based on improving predictions, not on actual postmanufacturing measurements. Signal integrity comes into play as both crosstalk-induced timing errors and voltage drop on power lines. Commercial tools for signal and power integrity analysis and minimization (e.g., [11, 112]) predict and
1.2 Motivation for Asynchronous Approach
7
Fig. 1.1 Delay penalties in a synchronous system.
help reduce the impact of voltage drop and crosstalk noise on circuit performance. However, it is believed that up to 25% of delay penalty may be due to signal integrity. Figure 1.1 (from [7]) summarizes the delay penalties that are typical for a state-of-the-art synchronous methodology. In addition to design flow and manufacturing process uncertainties, modern circuit-level power minimization techniques, such as Dynamic Voltage Scaling and Adaptive Body Biasing, deliberately introduce performance variability. Changing the clock frequency in order to match performance with scaled supply voltage is very difficult since it multiplies the complexity of timing analysis by the number of voltage steps, and variability impact at low voltages is much more significant. Doing the same in the presence of varying body biasing, and hence varying threshold voltages, is even more complex. Moreover, phase-locked loops provide limited guarantees about periods and phases during frequency changes, hence the clock must be stopped while frequency is stepped up or down. It is well-known that, under normal operating conditions, the delay of a CMOS circuit scales linearly with its voltage supply, while its power scales quadratically. Thus the normalized energy-per-cycle or energyper-operation efficiency measure scales linearly with voltage supply. However, it is very difficult to use this optimization opportunity to the extreme by operating very close to the threshold voltage. Two approaches have been proposed in the literature to tackle this problem with purely synchronous means. Both are based on sampling
8 Introduction the output of a signal which is forced to make a transition very close to the clock cycle, and slow down the clock frequency or increase the voltage supply if this critical sampling happens at the current voltage and frequency conditions. The Razor CPU [25] is designed with double slave latches and an XOR in each master-slave pair (thus increasing by over 100% the area of each converted latch). The second slave is clocked half a clock cycle later than the first slave. When the comparator detects a difference in values between the slaves, the inputs must have changed very close to the falling edge of the clock of the first slave, and the latch memorized an incorrect value. The Razor in that case “skips a beat” and restarts the pipeline with the value of the second latch, which is always (assuming that environmental conditions change slowly) latched correctly. An external controller always keeps voltage and clock frequency very close to this “critical clocking” condition in order to operate the processor very close to the best V dd point for the required clock frequency, under the current temperature conditions. The approach, while very appealing for processors, has an inherent problem that makes it not applicable to ASICs. Due to the near-critical clocking, it is always possible that the first latch goes meta-stable. In that case, the whole detector and the clock controller may suffer from meta-stability problems. That case is detected with analogue mechanisms, and the whole pipeline is flushed and restarted. This is easy to do in a processor, for which flushing and restarting is already part of a modern micro-architecture. However, it is very difficult, if not impossible, to achieve the same objective automatically for a generic logic circuit. Another technique that has been proposed is to dynamically monitor the delay of the logic at the current voltage value and adjust the clock frequency accordingly (the PowerWiseT M technology from National Semiconductors). It samples, with a high frequency clock, the output of a digital delay line that toggles once per system clock cycle. This is used, more or less as in Razor, to measure the delay of the line in the current environment conditions (temperature, V dd etc.). The scheme is safer than Razor, because it allows one to insert enough synchronizers after the delay line to reduce the meta-stability
1.3 Asynchronous Design
9
danger. However, it is an indirect measure, and requires a complicated (patented) logic to monitor and control the clock speed. Asynchronous implementation, as demonstrated, e.g., in [52, 80, 90], achieves similar goals with much simpler logic, because the delay of the logic is directly used to generate the synchronization signals in a feedback control fashion. On the other hand, several kinds of applications, in particular most of those that use complex processor architectures for part of the computation (e.g., general purpose computing and multi-media), and several others that are tolerant to environmental variations (e.g., wireless communications), do not have to obey strict timing constraints at all times. The widespread use of caches, the difficulty of tight worst-case execution time analysis for software, and the use multi-tasking kernels, require the use of DSP algorithms that are tolerant to internal performance variations and offer only average case guarantees. Hence, a design style in which the device provides average case guarantee, but may occasionally go slower or faster is quite acceptable for many applications. If the performance of that device on average is twice that of a traditionally designed device, then the performance advantage is significant enough to make a limited change in the design flow acceptable.
1.3
Asynchronous Design
Asynchronous design can be viewed as a method to introduce feedback control for synchronization of latches and flip–flops in a digital design. Asynchronous circuits measure, rather than estimate, the delay of the combinational logic, of the wires, and of the memorization elements. Handshaking between controllers that generate the clock signal for groups of flip–flops and latches ensures the satisfaction of both setup and hold constraints as illustrated in Figure 1.2. Asynchronous circuits also reduce electromagnetic emission with respect to equivalent synchronous ones [53, 82] because they reduce the power consumption peaks in the vicinity of clock edges. Hence they produce a flatter power spectrum and exhibit smaller voltage supply drops.
10 Introduction
Fig. 1.2 Synchronous–Asynchronous Direct Translation: from synchronous (a) to desynchronized (b) and fine-grain pipelined (c) circuits.
Asynchronous circuits offer two kinds of power-related advantages. First, they provide a very fine-grained control over activation of both datapath and storage resources in a manner that is similar to clock gating but much easier to describe, verify, and implement robustly at the circuit level. Second, they reliably operate at very low voltages (even below the transistor threshold), when device characteristics exhibit second and third order effects. Synchronous operation becomes virtually impossible under these conditions [90] because: • Library cells are seldom characterized by the manufacturer at such extreme operating conditions. Hence the normal synchronous ASIC design flow is unsuitable to guarantee correct operation. • The transistor electrical models deviate significantly from those used under nominal conditions and make a straightforward scaling of performance and power impossible, or at least very risky. • The effects of various random or hard-to-predict phenomena, such as threshold voltage variations, wire width variations, and local voltage supply variations due to IR drop, are dramatically magnified. All this means that, even if one were able to use the traditional synchronous flow for circuits that will operate at a voltage supply that is close to or even below the transistor threshold voltage the performance margins that one would have to use to ensure correct operation would be huge. Robustness of asynchronous circuits to delay variations allows them to run at very low voltage levels. Recently, Achronix reported that
1.4 An Overview of Asynchronous Design Styles
11
an asynchronous FPGA chip, built in 90 nm CMOS, reliably operates with a supply of just 0.2 V and exhibits an 87% power consumption reduction by scaling the supply voltage from 1.2 V to 0.6 V [2].
1.4
An Overview of Asynchronous Design Styles
Clocking is a common, simple abstraction for representing the timing issues in the behavior of real circuits. Generally speaking, it lets designers ignore timing when considering functionality. Designers can describe both the functions performed and the circuits themselves in terms of logical equations (Boolean algebra). In general, synchronous designers do not need to worry about the exact sequence of gate switching as long as the outputs are correct at the clock pulses. In contrast, asynchronous circuits must strictly coordinate their behavior. Logic synthesis for asynchronous circuits not only must handle circuit functionality but must also properly order gate activity (switching). The solution is to use functional redundancy to explicitly model computation flows without using abstract means such as clocks. Using logic to ensure correct circuit behavior under any delay distribution can be costly and impractical. Therefore, most asynchronous design styles use some timing assumptions to correctly coordinate and synchronize computations. These assumptions can have different degrees of locality, from matching delays on some fanout wires, to making sure that a set of logic paths are faster than others. Localized assumptions are easier to meet in a design because they simplify timing convergence and provide more modularity. But ensuring the correctness of such assumptions can be costly because it requires more system redundancy at the functional level. Asynchronous design styles differ in the way they handle the trade-off between locality of timing assumptions [121] and design cost. The main asynchronous design flows rely on the following assumptions: • Delay-insensitive (DI ) circuits [86] impose no timing assumptions, allowing arbitrary gate and wire delays. Unfortunately, the class of DI implementations is limited and impractical [77].
12 Introduction • Quasi-delay-insensitive (QDI ) circuits [76] partition wires into critical and noncritical categories. Designers of such circuits consider fanout in critical wires to be safe by assuming that the skew between wire branches is less than the minimum gate delay. Designers thus assume these wires, which of course must be constrained to physically lie within a small area of the chip, to be isochronic. In contrast, noncritical wires can have arbitrary delays on fanout branches. • Bundled-delay (BD ) circuits assume that the maximum delay of each combinational logic island is smaller than that of a reference logic path (usually called matched delay) [121]. Matched delays are implemented using the same gate library as the rest of the datapath and they are subject to the same operating conditions (temperature, voltage). This results in consistent tracking of datapath delays by matched delays and allows one to reduce design margins with respect to synchronous design (bundled-data protocols). As Figure 1.3 shows, the locality of timing assumptions decreases, from DI systems (which make no global assumptions) to synchronous circuits.
Functional redundancy DI QDI
Bundled data Synchronous
locality of timing assumptions Fig. 1.3 Functional redundancy and locality of timing assumptions in asynchronous designs.
1.5 Asynchronous Design Flows
13
The imposed timing assumptions help in differentiating asynchronous implementations. For further categorizing of asynchronous design flows one needs to find out how the following key issues are addressed: (a) which way a designer expresses his/her intents, i.e., a design specification and (b) which way a designer proceeds with synthesis. The spectrum of the flow described in this paper ranges from think asynchronously — implement asynchronously to think synchronously — implement almost synchronously. Table 1.1 gives a highlevel picture of the observed flows.
1.5
Asynchronous Design Flows
The ability to specify a design at a relatively high level, roughly equivalent to Register Transfer Level (RTL), is essential in order to enable enough designer productivity today. Two basic approaches have been proposed in the asynchronous domain for this purpose. (1) The first one is to use an asynchronous HDL, generally based on the formal model of Communicating Sequential Processes (CSP [44]), because it fits very well the underlying asynchronous implementation fabrics [75, 134]. By nature, asynchronous circuits are highly concurrent and communication between modules is based on handshake channels (local synchronization). Haste, the design language used by the flow in Chapter 2, offers a syntax similar to behavioral Verilog, and in addition has explicit constructs for parallelism, communication, and hardware sharing. Starting from Haste, one can compile to behavioral Verilog for functional verification. Its main forte is the ability to achieve all the advantages of asynchronous implementation, e.g., by controlling the activation of both control and datapath units (and hence power and energy consumption) at a very fine level with direct and explicit HDL support. Several groups tried to combine RTL with handshake channel based specifications, particularly, to add channel as an extension of standard HDL (e.g., Verilog or VHDL) [103,
Design style From QDI to Bundled Data QDI
Bundled Data
QDI
Design flow Haste
NCL
Desync.
Fine grain pipeline (Weaver)
Synchronous RTL
Specification style Asynchronous high-level and RTL (CSP-based) Synchronous RTL for datapath, asynchronous for control Synchronous RTL
Synchronous + scripts to map into pipeline cells
Synchronous + scripts to implement local clocking
Synchronous + scripts to map into async. library
Type of synthesis Asynchronous
Table 1.1 High-level view on presented fully-automated design flows.
Custom (dynamic logic) Possible to extend to standard cells
Standard cells
Implementation library Asynchronous DesignWare mapped to standard cells Custom NCL. Possible to extend to standard cells
Summary Think asynchronously — implement asynchronously Think asynchronously — implement almost synchronously Think synchronously — implement almost synchronously Think synchronously — implement asynchronously
14 Introduction
1.5 Asynchronous Design Flows
108, 109] that could be considered as HDL level automation of Caltech group’s ideas [79]. Unfortunately this approach which is attractive from theoretical point of view requires reeducation of RTL designers and rewriting of existing specifications which is not realistic for big and expensive designs. (2) The other approach, that we call Synchronous–Asynchronous Direct Translation (SADT), starts from a synchronous synthesizable specification, translates it into a gate-level netlist using traditional logic synthesis tools, and then applies a variety of techniques to translate it into as asynchronous implementation [20, 68, 70, 117]. Its main advantage is to allow reimplementation of legacy designs without any designer intervention at the HDL level. Since eliminating logic bugs takes up to 50% of design time, this can potentially translate into a significant advantage in terms of design time and cost, with respect to approaches that require a significant redesign. This approach is represented by the following design styles. (a) The Null-Convention Logic (NCL), that is used by Theseus Logic and is described in Chapter 3, is based on a proprietary library of majority gates and produces fully delay-insensitive circuits. This has very high overhead (about 3–4× in terms of area), but is also very robust, because functional correctness is achieved independent of gate and wire delays. (b) De-synchronization, that is described in Chapter 4, uses a micropipeline-style architecture and an interconnection of small handshaking controllers to derive the clock for groups of up to 100 latches (derived by splitting synchronous flip–flops). It has very low overhead in terms of area (between 5% and 10%), but requires careful matching of delays all the way down through physical design. Note that the same implementation architecture is used by the Handshake Solution flow of Chapter 2, starting from the
15
16 Introduction Haste language, thus easing interfacing between logic blocks produced by one of these two flows. (c) Automated fine-grain pipelining presented in Chapter 5. In the finest granularity (gate-level pipelining Figure. 1.2(c) in addition to replacing global synchronization with local handshake control this flow also removes functionally unnecessary synchronization. The flow offers support for automated pipelining therefore it is targeted to improve the original design performance. In this flow [115, 117] by default pipelining is done in the finest degree resulting in high-performance. The flow can exploit the use of aggressive pipelining to reduce the performance gap while maintaining low power. For example, for a standard cell library developed using 180 nm TSMC process the fine-grain pipelined cells are functional with VDDs down to 0.6 V. FIFO performance of 780 MHz at nominal 1.8 V drops to 135 MHz at the safe 0.8 V with 14.2× lower power consumption. We believe that synchronous–asynchronous directed translation model could play for asynchronous design automation as important of a role as RTL did for synchronous EDA. The main contribution to EDA by RTL model is due to a separation of optimization and timing (all sequential behavior is in an interaction between registers, all synthesis and optimization are only about combinational clouds). The key idea that enables SADT flows is as follows. The RTL model (Figure 1.2(a)) is based on global synchronization and timing assumptions (computations complete in every stage before the next clock edge). During every clock cycle every latch undergoes two phases: pass and store. Master–slave flip–flops prevent the register from being transparent at any given time, but introduce the requirement to carefully control clock skew in order to avoid double-clocking failures (also called hold time violations), which are especially nasty because they cannot be fixed simply by slowing down the clock. Dynamic logic also has
1.5 Asynchronous Design Flows
17
a two-phase operation: evaluate and precharge (reset). These phases naturally map to asynchronous handshake (request–acknowledge) protocols [121]. The separation into phases enables, in asynchronous just as in synchronous design, a clean separation between functionality and timing. A datapath implemented using any of the techniques described in this paper behaves just like combinational logic (and is in fact just plain combinational logic in the Haste and the desynchronization flows) during evaluation. A resetting or precharge phase is used in the NCL and fine-grain pipelining flows to ensure reliable delay-insensitive or highspeed operation. In other words, asynchronous implementation does not change the externally observable behavior, and the sequence of values that appears on the boundary registers is the same before and after the application of SADT. Among other advantages (e.g., in terms of reuse of functional and timing verification simulation vectors), this enables easy interfacing with synchronous logic. The latter could be achieved by driving the clocks of the synchronous blocks by the request signals coming from the asynchronous blocks if the following two assumptions are satisfied: • Each synchronous block has an interface whose fanin and fanout registers are all clocked by a single asynchronous controller. • Each synchronous block is faster than the asynchronous one that drives its clock. For this reason, synchronous legacy logic can be used unchanged in an asynchronous fabric if it is non-critical. This is very different from GALS-based methods, which require the use of wrappers and introduce significant system-level integration problems due to the need of creating ad-hoc manual handshaking between blocks that belong to different clock domains. The various SADT flows described in this paper differ in the granularity of the pipeline stages. The NCL and desynchronization approaches retain exactly the same position of registers, and hence pipelining level, as the original synchronous specification. Gate-level
18 Introduction pipelining, on the other hand, significantly decreases the granularity of the pipelining, down to the gate level, as shown in Figure 1.2(c). All the approaches considered in this paper share several common characteristics. First of all, control is derived through a syntaxdirected translation from a specification, whether it is written in Haste or in synthesizable Verilog. Second, the datapath is generated (at least initially, for the NCL and fine-grain pipelining flows) using traditional logic synthesis techniques, starting from design libraries such as DesignWare [22]. Third, physical design and implementation verification (equivalence checking, extraction, back-annotation etc.) are essentially unchanged. Finally, testing of the datapath and of the controllers is performed mostly synchronously thanks to the fact that timing faults in the controller networks are easy to detect with simple functional tests. Hence design for testability and automated test pattern generation tools and techniques can be reused almost without change. We present a variety of automated flows (Haste and the flows related to SADT approach) because we believe that the full advantages of asynchronous design to tackle power, energy, variability, and electromagnetic emission will come from a judicious mix of: • Full redesigning performance and power-critical widely used components (e.g. microprocessors) using a language like Haste. • Converting synchronous designs of special-purpose critical modules to asynchronous implementations, using one of the SADT techniques outlined in this paper. The choice of method will depend on whether the main goal is robustness, cost or performance. • Leaving non-critical components as synchronous and clocking them with the handshake signals produced by the asynchronous interface controllers.
1.6
Paper Organization
We will start our review by considering in Chapter 2 the most radical design approach that has been applied so far to real-life designs. It is based on the Haste language, and commercialized by Handshake
1.6 Paper Organization
19
Solutions. It is radical since it starts from non-standard (asynchronously specific model) and needs some specific education for designers. However, this non-standard HDL could lead to rather efficient solutions that could be unreachable from standard HDLs specifications. It was successfully used for real LARGE designs. Then we will describe the SADT approaches, starting from the pioneering NCL technique [57, 68] discussed in Chapter 3. NCL was the first approach to asynchronous design exploiting the idea of synthesizing large designs using commercial EDA tools. NCL circuits are dual-rail to enable completion detection. They are architecturally equivalent to the RTL implementation. However, full synchronization of completion detection at each register implies that NCL circuits are significantly slower and larger (by a factor of 2 to 4) than the synchronous starting point. Reducing NCL overheads moves us to de-synchronization [19, 20], in Chapter 4, which uses delay matching in order to achieve a good compromise between robustness, performance, and cost. When run at their worst-case speed, desynchronized designs exhibit an almost negligible overhead with respect to synchronous ones. On the other hand, when run at the actual speed at the process, voltage, and temperature conditions. They can dramatically reduce the delay margins required by synchronous design. None of the above approaches offers support for automated pipelining, therefore they cannot directly improve this aspect of the performance equation. Gate-level pipelining [117, 118], on the other hand, can pipeline at the level of individual gates, thus achieving performance levels that are virtually impossible to match with synchronous designs. We present this flow in Chapter 5. Chapter 6 is dedicated to design examples that illustrate both the achievable results and the possible application areas of the various design flows. Finally, Chapter 7 presents some conclusions on the opportunities offered by asynchronous circuits and flows.
2 Handshake Technology
2.1
Motivation
The Handshake Technology design flow is aimed at making clockless circuit technology available to the IC community in general, and not only to asynchronous circuit experts. This has led us to develop a complete tool set that goes all the way from a high-level design entry down to a circuit-level realization. The design flow has been set up in such a way that new tools are developed only where needed. Furthermore, it is designed to be complementary to and compatible with standard design flows. In particular, the Handshake Technology design flow reuses the following: • Cell library. Any standard-cell library can be connected, as no dedicated “asynchronous” cells are needed, cf. Section 2.6. • Memories. Standard (on-chip) memories can be interfaced with directly in a glue-less manner. • Optimization. Combinational datapath logic can either be incorporated directly (and the tools will create a handshake wrapper around it) or it can be generated using the design entry language. Logic optimization can be done using 20
2.1 Motivation
•
•
•
•
21
tools designed for synchronous flows, since the datapaths are implemented using single-rail logic. Test-pattern generation. Although asynchronous circuits will always require a dedicated Design-for-Test solution, the testpattern generation part can be done using standard ATPG (automatic test-pattern generation) tools, provided that a scan like solution is implemented, as we did in the Handshake Technology flow, cf. Section 2.8.2. Placement and routing. Since the design flow uses only standard cells, we can reuse existing placement and routing tools to do the layout. Care is required for the correct handling of timing assumptions and combinational feedbacks, cf. Section 2.9. Timing analysis. Timing characterization of digital circuits has become a science in itself, and we are happy that we can reuse these tools as well, by cutting all combinational loops before timing analysis and by specifying different dummy clocks. FPGA prototyping. Although in principle asynchronous circuits can be mapped onto an FPGA as well, we have chosen to use a clocked implementation of handshake protocols for prototyping. This results in standard clocked circuits which can be mapped onto FPGAs directly and serve as functional prototyping. Naturally the timing behavior on an FPGA will be different than that on final silicon.
The Handshake Technology design flow does not only consists of scripts around standard third-party EDA tools. Dedicated tools and representations have also been developed, specifically in the following areas: • Design entry. A new design-entry language — called Haste, formerly known as Tangram — has been developed to enable expressing asynchronous behavior and communication. • Handshake circuits. An intermediate architecture called Handshake Circuits has been defined, in which we still
22 Handshake Technology abstract from the implementation details and the handshake protocols that are used. • Netlist generation. We have chosen to implement netlist generators for handshake circuits, where we select a specific handshake protocol for each handshake channel and component, and then generate a circuit implementation in terms of standard-cells from a pre-selected standard-cell library. This step could alternatively be implemented through the use of asynchronous control synthesis tools such as Petrify [17, 64] and Minimalist [32, 33]. • Design for Test. Design-for-Test of asynchronous circuits cannot be handled by synchronous scan insertion tools directly, as these cannot work with combinational feedback loops that are abundant in any asynchronous circuit, for instance inside Muller-C [86] and mutual-exclusion elements [111]. A new test solution has therefore been developed that in itself is as compatible as possible with synchronous scan-based test. The remainder of this chapter goes into more detail regarding the innovative aspects of the Handshake Technology design flow. It addresses both the dedicated “asynchronous” tools and representations that have been developed, and the interfacing to “synchronous” tools.
2.2
Functional Design
The functional part of the design flow, during which the designer’s focus is on the functional correctness of the design, is shown schematically in Figure 2.1. In this diagram, boxes denote design representations and test benches, and the oval boxes refer to tools. The two central tools are htcomp, which compiles a Haste program into a handshake circuit, and htmap, which compiles a handshake circuit into a Verilog netlist. Other tools and representations also play a role during functional design.
2.3
Design Language Haste
The goal of the Handshake Technology design flow is to make the benefits of clockless IC design available to designers that themselves are
2.3 Design Language Haste
23
Haste Program
htcomp
Behavioral Verilog
Handshake Circuit
System-C
htmap
Verilog Testbench
Asynchronous Netlist
Verilog Simulator
Synchronous Netlist
htsim Simulation traces
htview
Fig. 2.1 Functional design flow.
not asynchronous circuit experts. Even more, we want to make Handshake Technology also available to system designers and application specialists, rather than to ASIC/IC designers only. The design language thus has to abstract from the physical implementation, such as the handshake protocols and the data-valid schemes that are being used. Therefore it needs to be a behavioral more than a structural design language. Three essential requirements for the design language — that were not available in any design language when Haste was conceived — are: (1) Parallelism must be supported at any granularity, on process level, at the level of atomic actions, and anything in between. Arbitrarily nesting of parallel and sequential operations should also be supported. (2) Channels should be supported for communication (and synchronization) between parallel processes. For such channels, send and receive actions should be available as atomic
24 Handshake Technology communication actions. The representation of channels and their communication actions (send, receive) in the language must be independent of their implementation (which could be clocked or clockless and employ many different data-encoding options). (3) The parameter mechanism (for, e.g., modules, function, procedures) should support channels as well. More precisely, it must be possible to refer to such a channel parameter as an atomic entity. Since, at the implementation level, a channel is a bidirectional object (with both input and output signals in it), most languages do not support this by design. The combination of the above requirements was not found in any programming or design language. In addition, no language was found where this could be added through an abstraction mechanism. The considerations mentioned above could have led to the choice of starting from a standard language and then support an extended subset. Such an approach was taken for instance by Celoxica, when developing Handel-C [10]. This choice typically refers to the expectation that, in order to save on engineering (re)training, a standard language would be preferable. Indeed, a standard language would offer the advantage of a familiar syntax. However, there are some disadvantages to a standard language as well. First of all, there is no consensus as to which standard language should be chosen. We have had requests for C, System-C, Verilog, SystemVerilog, and VHDL. Each language has its own proponents, depending on the background of the teams, which may have a focus on hardware, software, or system-level design. Supporting a multitude of extended subsets of different input languages could be an option, but it is quite a challenge to maintain consistency between the expectation levels and capabilities of such an approach. Additionally, for each language, a designer will typically have a standard way of using it. That may be a handicap when that language is also going to be used to design clockless circuits, for which a new way of using the language would have to be learned. Having a new syntax may be an advantage in this learning process, as it makes the designer aware of the new approach.
2.3 Design Language Haste
25
With the choice for developing a design language that is a good balance between the asynchronous target implementation and the high abstraction level that was envisioned, it has been possible to demonstrate the high-level design capabilities of an asynchronous circuit design style. Haste has been set up such that is fully synthesizable: any Haste program can be compiled into a silicon implementation. Since the work on Haste and its silicon compiler has started, the offerings of design languages has changed. Whether in hindsight a new design language has been the optimal choice remains to be seen. It is definitely intriguing to see that academic design environments that have been developed to mimic the Haste environment, such as the Balsa system, have chosen to develop additional new languages. It is likely, that with the emerging of high-level modeling languages like System-C and SystemVerilog, at some point the Haste compiler will be tuned so as to also accept a SystemVerilog or System-C syntax, in combination with coding style guidelines. A complete overview of the design language Haste can be found in the Haste manual [21]. Below, we highlight some key aspects that make Haste a unique and powerful design language. 2.3.1
Haste as a Behavioral Design Language
The gain in code size that is reported by designers using Haste is to a large extent due to the behavioral aspects of the language. In contrast to an RTL description, Haste does not require alignment and scheduling of actions to a global event such as a clock edge. One can freely mix actions of different grain size, as in x :=0 y :=y +a *z
// ‘fast’ (reset) // ‘slow’ (multiply-acc),
which combines a fast reset action with a slower multiply accumulate. In a synchronous circuit, the reset would be slowed down the slowest action in the circuit, so at least to the time needed for the multiply accumulate. In an asynchronous circuit, such a fast action can actually be exploited. The non-uniform timing does not
26 Handshake Technology only allow microtiming at a finer granularity than what one would achieve in a uniformly clocked circuit, it also allows exceptions to take more time. In a synchronous circuit, significant effort goes into fitting even extreme cases into the clock period, which may result in a serious investment in circuit area and operation energy for that exception. Asynchronous circuits can accommodate such slow actions naturally. Actions can have different durations that may even be data dependent. This may come in the form of an if-statement, where the different branches may have (completely) different completion times, leading to completion times of the if-statement that depend on the actual value of the guard, as in: if x >y then x :=0 else y :=y +x *z fi It may also come in the form of a data-dependent while loop, where the number of iterations depends on the value of the guard and its evolution over time, such as in the following implementation of the greatest common divisor algorithm: do x ack out); end behave;
The NCL design flow uses off-the-shelf simulation and synthesis components (see Figure 3.4). The flow executes two synthesis steps. The first step treats NCL variables as single wires. The synthesis tool performs HDL optimizations and outputs a network in generic library
3.4 DIMS-based NCL Design Flow
51
Fig. 3.4 RTL flow for NCL.
(GTECH) as if it is a conventional Boolean RTL. The second step expands the intermediate generic netlist into a dual-rail NCL by making dual-rail expansions and mapping into the library of hysteresis gates. The particulars of Step 1 and Step 2 implementation impact the quality of final results. Note, that although a 2-rail expansion is a specific step that is not present in synchronous flows, its implementation might be performed by mere manipulation of gate networks. Editing capabilities on gate networks are present in all modern synthesis tools. Thus a tworail expansion is performed within commercial flows through developing of corresponding scripts.
3.4
DIMS-based NCL Design Flow
In [68] a regular method for NCL implementation based on Delay Insensitive Minterm Synthesis (DIMS) [122] was suggested. In this method, Steps 1 and 2 of the design flow are implemented as follows: Step 1 performs a mapping of the optimized network into two-input NAND, NOR, and XOR gates.
52 Synchronous-to-asynchronous RTL Flow Using NCL
Fig. 3.5 Dual-rail expansion for NAND gate.
Step 2 first represents each wire a as a dual-rail pair a.0 and a.1 and then makes a direct translation of two-input Boolean gates into pairs of gates with hysteresis performing some limited optimizations. Figure 3.5 illustrates Step 2 implementation for a two-input NAND gate c = a ∗ b. Note, that in the DIMS method, a function is represented as a set of minterms with only one of them turning “ON” in set phase. Therefore outputs of the pair of gates with hysteresis acknowledge all changes at their inputs. The latter is the basis for delay-insensitivity of DIMS-based NCL implementation. The advantage of DIMS-based approach is a simplicity of the translation scheme and automatic guarantee of DI properties in implementation. Unfortunately DIMS-based implementations have significant overhead which comes from three sources: (1) Over-designing due to locality of ensuring DI. Each dual-pair of gates acknowledges ALL its inputs. This is excessive when inputs are shared between several pairs of gates. (2) Little room for optimization. This comes from the close coupling of implementing functions and ensuring delayinsensitivity by the very same gates. Optimization of functionality can easily destroy DI properties. (3) Need for proprietary gates. Gates with hysteresis are required to ensure a correct by construction reset in DATA → NULL transitions. These gates are significantly larger than conventional gates (2 times approximately).
3.5 NCL Flow with Explicit Completeness
3.5
53
NCL Flow with Explicit Completeness
The next step in gaining efficiency of NCL implementations was proposed in [57] by developing the so called NCL X flow. It exploits the idea of separate implementations for functionality and delay insensitivity. NCL X partitions an NCL circuit into functional and completion parts, which permits independent optimization of each. Modifying the flow’s implementation steps permits this separate implementation and optimization. In Step 1, designers perform a conventional logic synthesis (with optimization) from the RTL specification of an NCL circuit. This also maps the resulting network into a given library, using gates that implement set functions of threshold gates. Step 2 involves the following substeps: • reducing the logic network to unate gates by using two different variables, a.0 and a.1, for direct and inverse values of signal a (the obtained unate network implements rail.1 of a dual-rail combinational circuit); • enabling dual-rail expansion of the combinational logic by creating a corresponding dual gate in the rail.0 network for each gate in the rail.1 network (for this one need to confine the library content by ensuring that each gate of a library has a gate with dual function from the same library); • ensuring delay insensitivity by providing local completion detectors (OR gates) for each pair of dual gates and connecting them in a completion network (multi-input C element) with a single output, done. Implementing the NCL X design flow requires a minor modification of interfacing conventions within the NCL system. We assume that for each two-rail primary input (a.0, a.1), explicit signal a.go exists such that a.0 = a.1 → a.go = 1 (set phase), whereas a.0 = a.1 = 0 → a.go = 0 (reset phase). Figure 3.6 shows the modified organization of the NCL system. Unlike the system in Figure 3.2, this system has separate completion detectors for combinational logic and registers. Designing the combinational logic for a 4-to-2 encoder (from Section 3.3) illustrates this design flow.
54 Synchronous-to-asynchronous RTL Flow Using NCL b.ack
b.go
C done a.ack
a.go a.req RG_A
C
C b.req
C
Combinational logic (CL)
RG_A
Fig. 3.6 NCL system with explicit completion detection.
Figure 3.7 shows design steps in NCL X implementation of the encoder. After the reduction to unate gates (Figure 3.7(b)) the network is expanded into a dual-rail implementation (Figure 3.7(c)). This is done locally by providing a dual function for each of the gates in unate representation from Figure 3.7(b). At last each pair of dual gates is supplied by local completion detectors (OR gates) whose outputs are connected to multi-input C-element providing signal done (Figure 3.7(d)). Information about validity of input codewords is provided by C-element I.go
Fig. 3.7 4 to 2 encoder with explicit completion.
3.6 Verification of NCL Circuits
55
that combines all completion signals for primary inputs. The output I.go is connected to C-element done. The following property is crucial for the proposed implementation. An NCL circuit implemented with explicit completion belongs to QDI class. Proof for this statement is rather straightforward. In each phase (set and reset) exactly one gate out of the dual pair must switch. The switching is acknowledged by local completion detectors which, in turn, are acknowledged by primary output done. Hence, transitions at every gate of a circuit are acknowledged by circuit outputs and a circuit belongs to QDI class. A clear boundary between a circuit’s functional and completion parts simplifies optimization procedures and results in significant area reduction of NCL X versus conventional NCL flow (about 30% [57]). Optimization of NCL implementations is an on-going topic of research. Jeong and Nowick [48] suggested a set of QDI preserving transformations that help in more efficient technology mapping of NCL structures. Zhou et al. [145] proposed the idea of partial completion to reduce the size of NCL gates and completion network. Note, however that area improvements in NCL flow are limited by the theoretical 2× penalty bound stemming from the need to perform a dual-rail expansion. Figure 3.8 summarizes the efforts in optimizing NCL implementations and provides their comparison to the reference synchronous implementations.
3.6
Verification of NCL Circuits
In synchronous circuits static timing analysis plays the major role in design sign-off. Every combinational path must be checked for setup and hold constraints to make sure that a circuit functions correctly. The big advantage of QDI circuits (NCL in particular) is that their behavior is independent from the delays of gates. This makes static timing analysis obsolete for verifying NCL circuits. Ideally QDI properties of NCL implementations should be satisfied by using correct-byconstruction synthesis methods. In practice explicitly verifying QDI is necessary because
56 Synchronous-to-asynchronous RTL Flow Using NCL Area NCL (Lihthart, 2000) 3x NCL_X (Kondratyev, 2002) Jeong, 2006 Zhou, 2006 2x
1x
Synchronous
Year Fig. 3.8 Comparative analysis of NCL implementations.
(1) Parts of a system might be designed manually. (2) Even best automated tools are prone of errors. (3) It increases a designer’s confidence in implementation results. Formally QDI verification is reduced to checking that every gate of a circuit translates the result of its firing into the changes at primary outputs. This is essentially the same as the acknowledgment property introduced in Section 3.2. If outputs of a circuit take a settled value (Null or Data) before some internal gate transitions to the settled value then an internal gate is called an orphan [58] because its transition is left unsupervised. Using this terminology to verify QDI properties means to prove that a circuit is free from orphan gates. In NCL systems orphan checking plays the same role in a design sign-off as static static timing analysis for synchronous circuits. Several important observations help in reducing the complexity of the delay-insensitive analysis of NCL systems: • Analysis of a whole NCL system is decomposed into smaller tasks of checking its combinational subparts in the same way as with ordinary static timing analysis.
3.6 Verification of NCL Circuits
57
• Data and Null wavefronts cannot collide in NCL systems. This suggests that orphans must be checked separately for set and reset phases (but not for their mix). For a gate g with hysteresis (g = S + gR) checking must be performed with respect to a combinational network of gates with functions S (for the set phase) or for a network of gates with functions R (for the reset phase). • It is easy to prove that consecutive Data and Null wavefronts sensitize the very same combinational paths. This property simplifies the job of orphan checking to confine it to the set phase only. Figure 3.9 shows an overall flow for QDI verification in NCL circuits. The key component of the verification tool is a satisfiability solver Chaff [85] that provides a simple and efficient solution for orphan analysis. The analysis is based on constructing a miter [63] that combines the outputs of the “correct” sub-network IO*(gi) (in which gi might switch) and a “faulty” sub-network O*(gi) (in which the value of gi is kept at “0”). The output F of the resulting circuit (Figure 3.10) takes the value “1” if there is an input set DATAj that does not distinguish outputs of O*(gi) and IO*(gi). Gate gi must switch to “1” under DATAj . However, in O*(gi), the value at the output of gi is replaced by “0”. Therefore, under the DATAj input set, the outputs settle down irrelevant of the transition at gi. This behavior describes an orphan crossing gi. Clearly, F being a tautology implies that the circuit has no orphans and is delay-insensitive.
Fig. 3.9 Verification flow for NCL.
58 Synchronous-to-asynchronous RTL Flow Using NCL
Fig. 3.10 Miter construction for checking orphans crossing gi .
Fig. 3.11 ALU design size vs verification time.
The efficiency of SAT approach for orphan checking is confirmed by Figure 3.11 from [58] that shows an example of a scalable ALU. One can see that as the size of a circuit increases the verification time grows fairly slowly. 3.6.1
Testing of NCL
Testing is the last piece of mosaic to provide a complete picture of NCL design methodology. Delay-insensitive NCL systems are immune to timing failures. Delay faults in these circuits influence the circuit performance but not the correctness of its behavior. Therefore the main focus of testing efforts for NCL designs is centered around stuck-at faults. As we will show below, a scan based approach is well suited for testing NCL circuits. This suggests that bridging faults in NCL implementations
3.6 Verification of NCL Circuits
59
might be tested using conventional techniques of applying IDDQ vectors. This topic is however outside the scope of this section. At first glance testing micropipelines seems to be very difficult because of numerous loops that couple pipeline stages and implement handshakes. Fortunately many asynchronous systems (micropipelines in particular) enjoy the self-checking properties with respect to stuck-at output faults [139]. If one of the handshake wires is stuck at a constant level then sooner or later the whole pipeline stalls which provides an excellent observability for control faults. These results could be generalized beyond handshake wires to any input stuck-at faults in completion detectors [59].2 The latter bring us to an important conclusion: register and control circuitry is selfchecking in NCL designs and therefore can be removed from consideration when exploring test patterns. Intuitively this is easy to understand because transparent latches under monotonic changes at data inputs behave like buffers and do not influence testability of a system. Therefore, for the purpose of test pattern generation one might take a simplified view of the NCL pipeline as a connection of combinational logic networks in sequence, which is equivalent to a single combinational logic network (see Figure 3.12).
Fig. 3.12 Removal of registers from NCL pipeline.
2 Here,
we assume that feedback wires in NCL gates and C-elements are not externally visible inputs. Testing input stuck-at faults for feedback wires is an open problem for NCL designs.
60 Synchronous-to-asynchronous RTL Flow Using NCL Even after registers are stripped out of the NCL pipeline a designer is left with a combinational cloud of sequential gates with hysteresis. These gates cannot be directly submitted to any of the existing commercial ATPG tools. However, using the very same observations as for verification, one can represent a single cloud of NCL gates by two separate components: one for specifying a behavior of NCL network in the set phase and another for its behavior in the reset phase. Both these components are conventional Boolean networks. Furthermore from the symmetry of path sensitizations in set and reset phases it follows that testing the single combinational network for the set phase would suffice to cover all stuck-at faults. This leads to a following important conclusion: Acyclic NCL pipelines are 100% input stuck-at faults testable. For a general case of pipelines with computational loops one need to first break all computational cycles and after that enjoy the ease of testing of acyclic case. Breaking cycles could be done by inserting scan registers that are very similar to synchronous scan registers (see [59]). This method is effective and has low cost because it uses partial scan rather than full scan insertion. 3.6.2
Summary
NCL design methodology was the first to show that asynchronous circuits are indeed possible to design using existing commercial EDA tools. This was an important proof of a concept. NCL implementations provide the following advantages. (1) Significant reduction of electromagnetic interference (orders of magnitude) due to even distribution of circuit transitions in time [82]. (2) Potential power savings that stem from reimplementing an algorithm in an event-centric fashion. One of the encouraging examples is provided by NCL version of Motorola’s STAR08 processor core that saved about 40% of power consumption with respect to its synchronous counterpart [82].
3.6 Verification of NCL Circuits
61
(3) Robustness to process and operation variations that are inherent for QDI implementations. However NCL advantages are coming at a cost. These are: (1) Significant area penalty (up to 3–4× for conventional NCL and up to 2.5× for NCL X). The latter cannot be reduced below 2× due to the nature of the dual-rail implementation. (2) Power consumption may grow up significantly because one out of a pair of dual gates goes up and down in every phase of the functioning. This may offset the power savings that are achieved at algorithmic level. Area overhead seems to be the main bottleneck in accepting NCL methodology. It makes it difficult to use NCL as a universal approach leaving however a room for the “sweet spot” applications [82]. NCL example stimulated development of other automated design flows. De-synchronizations is a successful follower of NCL that manage to keep area penalties negligible without significant sacrifices in the robustness of an implementation.
4 De-synchronization: A Simple Mutation to Transform a Circuit into Asynchronous
4.1
Introduction
If one would ask about a strategy to design an asynchronous circuit that would keep the differences from its synchronous counterpart to a minimum, that strategy would probably be the one presented in this chapter: de-synchronization. The idea is simple. Let us consider the implementation of every flip–flop as a pair of master/slave latches. The role of the clock signal is to generate the enable pulses for the latches in such a way that data flows through the circuit to perform a certain computation. De-synchronization simply substitutes the clock with an asynchronous control layer that generates the enable pulses of the latches in such a way that the behavior is preserved. The data-path remains intact. De-synchronization has been “on the air” since Ivan Sutherland, in his Turing award lecture [125], pioneered micropipelines, a scheme to generate local clocks for a synchronous latch-based datapath. Mimicking synchronous behavior by asynchronous logic was suggested in [138]. This work was mainly focused on the synchronization 62
4.2 Signal Transition Graphs
63
of asynchronous arrays. Similar ideas were exploited in a doubly-latched asynchronous pipeline proposed in [55]. However, the previous approaches were never formalized in a way that allowed them to be integrated into an automated synthesis flow. The first attempt to derive an asynchronous circuit from an arbitrary synchronous circuit was presented in [69]. The scheme required a re-encoding of the data-path using LEDR and delay-insensitive code. This results in a significant overhead in area and power that makes the approach not very attractive to designers. To alleviate this overhead, a coarse-grain approach was suggested in [102]. Finally, a formal approach that requires no overhead for the datapath was presented in [20]. This approach will be explained in more detail in this chapter. In this chapter, a formal model to specify timing diagrams will be used. This model is called Signal Transition Graph and is presented in the next section. The behavior of synchronous systems is next revisited and analyzed. Understanding the properties of synchronous data transfers is crucial to proposing similar schemes with asynchronous control. The foundations and implementation of de-synchronization are presented in the remaining sections.
4.2
Signal Transition Graphs
The model used in this paper to specify asynchronous controllers is based on Petri nets [87] and is called Signal Transition Graph (STG) [14, 106]. Roughly speaking, an STG is a formal model for timing diagrams that can specify causality, concurrency and choice among rising and falling transitions of signals. For those readers not familiar with Petri nets, we refer to [87]. Graphically, Petri nets are depicted as bipartite graphs with boxes representing transitions, circles representing places and arcs representing the flow relation. The tokens held in places represent markings of the net and denote the state of the system. In an STG, the Petri net transitions are labeled with signal events, e.g., a+ and a− representing rising and falling transitions of signal a, respectively. An example of STG is shown in Figure 4.1(a).
64 De-synchronization: Simple Mutation Circuit into Asynchronous
(a)
d+
e+
d−
c+
e+
c−
e−
b+
a+
a−
b−
p2 (b)
a+
d+
e+
d−
c+
e+
c−
e−
b+ p1
a−
b−
p3
b− e+ (c)
a+
b+
d+ c+
e+ e−
d− c−
e−
a− a− e− b−
Fig. 4.1 (a) STG, (b) compact representation of an STG, (c) reachability graph.
To represent STGs in a more compact form, places with exactly one predecessor and one successor transitions are omitted. Therefore, arcs between transitions implicitly represent a place. In that case, the tokens are held on the arc. A compact representation of the previous STG is depicted in Figure 4.1(b). A Petri net models the dynamic behavior of a concurrent system. The set of reachable markings from the initial marking form a graph called reachability graph. Figure 4.1(c) depicts the reachability graph of the previous STG. Let us analyze the behavior of the example by looking at the STG in Figure 4.1(b). After firing the transitions a+ and b+, a marking is reached with a token in place p1 in which two branches can be taken (choice). After firing any of them, c + e + c− or d + e + d−, another marking is reached (tokens in p2 and p3 ). At this point, the sequence a − b− can be fired concurrently with the event e−, thus reaching the initial marking again.
4.3 Revisiting Synchronous Circuits
L
L B
65
B
A
A C (a)
(b)
C
Fig. 4.2 (a) Netlist, (b) marked graph representation.
There is a specific class of Petri nets that has special relevance in this work: marked graphs. A marked graph only has places with one predecessor and one successor transitions. Most of the STGs presented in this paper will be marked graphs in which places hold one token at most in every reachable marking. Figure 4.2 describes how a netlist can be modeled with a marked graph. The combinational logic (blocks A, B, and C) is represented by transitions. Latches (or flip–flops) are represented by places. Every latch has only one input predecessor, which is the combinational block that produces the result stored in the latch. A latch may have multiple fanout. In that case, the latch is represented by a set of places, one associated to every output block. This place multiplicity is required to model the fact that the value of the latch is transmitted to all combinational blocks. A place with multiple output arcs would represent a choice, i.e., a transfer to only one of the successors, which is not the actual behavior of the circuit. This is the reason why netlists are typically represented with choice-free nets, i.e., marked graphs.
4.3
Revisiting Synchronous Circuits
Before presenting the principles of de-synchronization it is convenient to understand how synchronous circuits work. The clock signal is the one that keeps the system continuously beating, interleaving data computation and storage. It is a signal with a fixed frequency arriving simultaneously (or with negligible skew) at all latches. A conceptual clocking scheme is depicted in Figure 4.3. The clock is generated by an oscillator (inverted delay) that feeds a non-overlapping phase generator. The oscillator can be internally implemented as a
66 De-synchronization: Simple Mutation Circuit into Asynchronous M a s t e r
S l a v e
Φ1
Φ2
CL
M a s t e r
S l a v e
Φ1
Φ2
Φ 1+ Φ 1− Φ 2+
Φ2
Φ1
Φ 2− delay
Fig. 4.3 Conceptual view of the clock generation in a synchronous circuit.
ring with an odd number of inverters. However, the most conventional way is by means of an external quartz crystal that guarantees a fixed oscillation frequency with high accuracy. The registers are typically implemented as flip–flops, internally composed by two level-sensitive latches: master and slave. Both latches operate with the two complementary phases of the clock. The crosscoupled NOR gates guarantee that the phases Φ1 and Φ2 do not overlap. The scheme depicted in Figure 4.3 has two importants properties: • Absence of races. At each phase, half of the latches are sending data while the other half are receiving data. Transfers occur from master to slave during Φ2 and from slave to master during Φ1 . Since the phases do not overlap, simultaneous transfers master → slave → master or slave → master → slave can never occur and, therefore, no data overruns are produced. This constraint is what determines the hold time of a latch. • No data loss. The frequency of the clock, determined by the delay of the oscillator, guarantees that the receiving latch does not become opaque before capturing the incoming data, otherwise the data would be lost. This constraint is what determines the performance of the system: the clock period
4.4 Relaxing the Synchronous Requirement
67
must accomodate the longest combinational path from latch to latch. At the level of latches, this constraint is directly related to the setup time.
4.4
Relaxing the Synchronous Requirement
The synchronous scheme results in simple implementations, since only one synchronization signal is required in the whole system. However, this requirement is not strictly necessary for a correct operation of the system. A well-known asynchronous scheme is the Muller pipeline [86], depicted in Figure 4.4. In this case, there is no explicit distinction between master and slave latches and there is no global synchronization. Instead, a correct synchronization is guaranteed by the local interactions with the neighboring latches: a transfer only occurs when the sender has “valid” data (req = 1) and the receiver is “empty” (ack = 0). A behavioral specification of the interaction of the req and ack signals between two adjacent stages is shown in Figure 4.5(a). The acki signals play the role of the enable signals (Φi ) for the latch at stage i. By hiding the events of the req signals and observing only the events of the enable signals (Φi = acki ), the causality relation depicted in Figure 4.5(b) can be derived. With this particular protocol, the phases Φi−1 and Φi overlap during the initial period of the transfer (Φi−1 + → Φi +), but this always occurs
CL
CL
delay
req C ack
Fig. 4.4 Muller pipeline.
CL
delay C
delay C
req C ack
68 De-synchronization: Simple Mutation Circuit into Asynchronous ack i+1−
req i +
req i+1 +
ack i +
Φ i−1 + Φ i−1 − Φ i−1 +
req i −
ack i+1 +
ack i −
req i+1 (a)
Φ i−1 − Φ i−1 +
Φ i+ Φ i− Φ i+ Φ i−
(b)
Φ i+1 − Φ i+1 + Φ i+1 − Φ i+1 +
Φ i1 Φi Φ i+1
Φ i+1 − (c)
Fig. 4.5 Muller pipeline: (a) STG specification, (b) enable signals of adjacent latches, (c) timing diagram.
In
Out
Fig. 4.6 Asymmetric delay.
when the previous data item at stage i has been delivered to the next stage (Φi+1 − → Φi +), thus preventing data overruns. To avoid data losses, the delay at the req signals is included. This delay must be designed in such a way that the latch does not become opaque before the combinational logic has completed its calculation. To make this scheme more efficient, it is possible to use asymmetric chains with short return-to-zero delays, as shown in Figure 4.6. Finally, Figure 4.5 shows a possible timing diagram of the interaction between consecutives stages in a Muller pipeline.
4.5
Minimum Requirements for Correct Asynchronous Communication
In this section, we face the problem of data communication in a more general way. In particular, we want to answer the following question: what are the minimum requirements for an asynchronous scheme to implement correct data transfers between latches? Figure 4.7(a) depicts a linear pipeline in which the signals A − D denote the enable signals of the latches. The dark and white circles inside the latches represent valid and non-valid data (tokens and
4.5 Minimum Requirements for Correct Asynchronous Communication
A
B
A+
B
C
D
A
B−
C+
D−
A+
B−
A−
B+
C−
D+
A−
B+
A+
B−
C+
D−
A+
B
A−
B+
C−
D+
A−
B+
A+
B−
C+
D−
A+
(a)
B+
A−
B−
A+
B+
A−
B−
(c)
(d)
B− (f)
(b) A+
(e)
69
C+
D+
A+
C−
D−
A−
C+
D+
A+
C−
D−
A−
B+
(g)
B− B+
B− (h)
Fig. 4.7 Asynchronous data transfer for a linear pipeline and a ring.
bubbles, respectively). To operate correctly, data transfers must always move tokens to latches with bubbles. Figure 4.7(b) describes the minimum causality relations between the events of the enable signals to guarantee correct data transfers. The picture depicts a fragment of the unfolded marked graph. The tokens on the arcs denote the current state of the system. There are three types of arcs in the diagram: • A+ → A− → A+, arcs that simply denote that the rising and falling transitions of each signal must alternate. • B− → A+, arcs that indicate that for latch A to read a new data token, B must have completed the reading of the previous
70 De-synchronization: Simple Mutation Circuit into Asynchronous token coming from A. If this arc is not present, data overruns can occur, or in other terms hold constraints can be violated. • A+ → B−, arcs that indicate that for latch B to complete the reading of a data token coming from A it must first wait for the data token to be stored in A. If this arc is not present, B can “read a bubble” and a data token can be lost, or in other terms setup constraints can be violated. Figure 4.7(c) depicts the marked graph corresponding to the causality relations in Figure 4.7(b). A simplified notation is used in Figure 4.7(d) to represent the same graph, substituting each cycle −→ • y by a double arc x ←→ x ←− • y, where the token is located close to the enabled event in the cycle (y in this example). It is interesting to notice that the previous model is more aggressive than the Muller pipeline or the classical scheme for generating non-overlapping phases for latch-based designs. As an example, the following sequence can be fired in the model of Figure 4.7(a–d): D+ D− C+ B+ B− A+ C− · · · After the events D+ D− C+ B+, a state in which B = C = 1 and A = D = 0 is reached, where the data token stored in A is rippling through the latches B and C. A timing diagram illustrating this sequence is shown in Figure 4.8. This feature only illustrates the fact that various schemes can be proposed for a correct flow of data along a pipeline, exploring different degrees of concurrency among the events. The practicality and efficiency of every specific scheme will depend on its implementation. Another important detail is the initialization of the marked graph. If the latch has a token (e.g., latch A), the arc A+ → B− must be marked, i.e., the pulse at latch B must occur before the one at latch A. Conversely, the arc C− → B+ is marked since there is a bubble in latch B. Thus, latch C can deliver its token before capturing the new incoming token from B. As a particular case, Figure 4.7(f) depicts the graph for a 2-stage ring (Figure 4.7(e)). This is the typical communication mechanism obtained from a state machine that holds the combinational logic between the latches B and A. Interestingly, the folded marked graph
4.5 Minimum Requirements for Correct Asynchronous Communication
71
A reads a new data token A B C D D reads data token from C
Data token from A ripples thorugh B and C
Fig. 4.8 Timing diagram of the linear pipeline in Figure 4.7(a–d).
after removal of redundant arcs (Figure 4.7(h)) represents the behavior of two non-overlapping phases. Precisely, this is the only protocol that allows a correct operation of the ring since any phase overlapping would overwrite the contents of both latches. Still, the interesting question that we need to answer is: can this asynchronous data transfer model be generalized for any arbitrary netlist? 4.5.1
A Simple Example
Figure 4.9 depicts a synchronous netlist with a global clock. The shadowed boxes represent level-sensitive latches. The labels 0 and 1 at the latches represent the polarity of the phase. The reader can notice that the phases alternate in any path of the netlist. The de-synchronization method proposed in this chapter substitutes the clock network with a network of asynchronous controllers. This network is shown in Figure 4.10. Every box is an asynchronous controller that generates the enable signal (Φ) of one of the latches. The controllers synchronize through request and acknowledge signals: Ri and Ai for the input data and Ro and Ao for the output data. When a latch receives data from more than one latch, a C element synchronizes the incoming requests. Similarly, a C element synchronizes the acknowledgments of the output data. As an example, latch A receives data from F and G through the combinational
72 De-synchronization: Simple Mutation Circuit into Asynchronous
F 1
A
B
E
0
1
0
CLK 0
1
C
D G
1
Fig. 4.9 A synchronous circuit with a single global clock.
ΦF Ro
Ri
F Ao
ΦA C
Ri
Ro
Ri
A Ai
Ao
Ri
C
Ro
Ai
Ri
Ro
E Ao
Ai
Ao
C
ΦD
Ro
Ri
Ao
Ai
C Ai
ΦE
B
ΦC C
Ai
ΦB
Ai
Ro
D
Ri
ΦG
G Ao
C
Ao C
Fig. 4.10 Asynchronous control for the netlist in Figure 4.9.
Ro
4.5 Minimum Requirements for Correct Asynchronous Communication
73
logic that precedes A and, thus, the C element for Ri synchronizes the Ro signals coming from F and G. The rest of this chapter will be devoted to study the properties and implementations of the asynchronous controllers that guarantee a correct operation of the system. 4.5.2
Correct De-synchronization
The interaction of the asynchronous controllers in Figure 4.10 produces a specific alternation of the rising and falling transitions of Φi for all latches. We are seeking the requirements for these alternations to produce an asynchronous behavior equivalent to the synchronous one. The main results in this area were presented in [20]. We next review and discuss these results in an informal and intuitive way. 4.5.2.1
Flow Equivalence
This is a concept that defines the behavioral equivalence between two circuits [65]. Two circuits are flow equivalent if they have the same set of latches and, for each latch A, the projections of the traces onto A are the same in both circuits. Intuitively, two circuits are flow-equivalent if their behavior cannot be distinguished by observing the sequence of values stored at each latch. This observation is done individually for each latch and, thus, the relative order with which values are stored in different latches can change, as illustrated in Figure 4.11. The top diagram depicts the behavior of a synchronous system by showing the values stored in two latches, A and B, at each clock cycle. The diagram at the bottom shows a possible de-synchronization, where the pulses of signals A and B represent the behavior of the enable signals of these latches. An important result presented in [20] is the following: The de-synchronization protocol presented in Figure 4.7 preserves flow equivalence.
74 De-synchronization: Simple Mutation Circuit into Asynchronous CLK A B
1 5
3 1
1
A B
0 2
5
1
2
2 1 5 3 3 1 4 2 Synchronous behavior
3 3
0 1
2
1 4
5 2
3
1 4
1 4
3
6 3
6
0 1
0 1
Desynchronized behavior Fig. 4.11 Flow equivalence.
This result guarantees that all computations performed by a desynchronized circuit will be the same as the ones performed by the synchronous counterpart. However, another subtle property is required to guarantee a complete mimicry of the synchronous behavior: liveness. The following result also holds [20]: The de-synchronization protocol presented in Figure 4.7 preserves liveness. This property guarantees that the protocol will never enter a deadlock state due to an unappropriate synchronization. At first sight it might look like liveness is an obvious property of any reasonable handshake protocol. Unfortunately, this is not true. The protocol of the Muller pipeline presented in Figures 4.4 and 4.5 preserves flow equivalence. In fact, the protocol can be obtained by reducing the concurrency of the protocol shown in Figure 4.7. However, an n-stage ring with n/2 tokens1 would enter a deadlock if we would use the controllers of a Muller pipeline. Formally, liveness can be analyzed by studying the marked graph obtained by the cyclic composition of the handshake, as shown in Figure 4.12. A ring with non-overlapping phases would manifest a behavior like the one specified in Figure 4.12(a). The one in Figure 4.12(b) corresponds to a ring with the controllers of a Muller 1A
ring like this can be obtained by connecting the output of D to the input of A in Figure 4.7.
4.6 Handshake Protocols for De-synchronization
A+
B−
C+
D−
A+
B+
C+
D+
A−
B+
C−
D+
A−
B−
C−
D−
(a)
75
(b)
Fig. 4.12 Synchronization of a ring: (a) non-overlapping phases (live), (b) Muller pipeline (deadlock).
pipeline. Liveness can be deduced by using a well-known result is the theory of marked graphs [16]: A marked graph is live if and only if the initial marking assigns at least one token on each directed circuit. One can immediately realize that the marked graph in Figure 4.12(b) contains a cycle without tokens, the one covering all falling signal transitions (A−, B−, C−, and D−).
4.6
Handshake Protocols for De-synchronization
What is inside the asynchronous controllers in Figure 4.10? Several handshake protocols have been proposed in the literature for asynchronous pipelines. The question is: are they suitable for a fully automatic de-synchronization approach? Is there any controller that manifests the concurrency of the de-synchronization model proposed in this paper? We now review the classical four-phase micropipeline latch control circuits presented in [35]. For that, the specification of each controller (Figures 5, 7, and 11 in [35]) has been projected onto the handshake signals (Ri, Ai, Ro, Ao) and the latch control signal (A), thus abstracting away the behavior of the internal state signals. The controller shown in Figures 5 and 6 of [35] (simple 4-phase control) corresponds exactly to the behavior of the Muller pipeline. Figures 4.13(a) and 4.13(b) show the projections of the semidecoupled and fully-decoupled controllers from [35]. The left part of the figure depicts the connection between adjacent controllers generating the latch enable signals A and B, respectively. The right part depicts
76 De-synchronization: Simple Mutation Circuit into Asynchronous B
A Ri Ai Ri+
B
A−
Rx+
Ao+
A+
B+
C+
Ro−
A−
B−
C−
Ao B−
Ax+ A+
Rx−
C
Ro
Ax
Ai+ Ri
A
Rx
B+
Ro+
Ai− Ax− Ao− (a) Semi−decoupled four−phase control (Furber & Day) Ri+
A−
B−
Ax+
Ai+ Ri−
Rx+
A+
Ai−
Rx−
B+
Ax−
Ro+ Ao+
A+
B+
C+
Ro−
A−
B−
C−
Ao−
(b) Fully decoupled four−phase control (Furber & Day) Ri+
A
B
Ax+
Ai+ Ri−
Rx+
A+
Ai−
Rx−
B+
Ax−
Ro+ Ao+
A+
B+
C+
Ro−
A−
B−
C−
Ao−
(c) Fall−decoupled control Ri+
A−
B−
Ax+
Ai+ Ri−
Rx+
A+
Ai−
Rx− Ax−
B+
Ro+ Ao+
A+
B+
C+
Ro−
A−
B−
C−
Ao−
(d) De-synchronization control Fig. 4.13 Handshake protocols.
4.6 Handshake Protocols for De-synchronization
77
B+
A−
B−
A+
B+
not
A+
live
only the projection on the latch control signals when three controllers are connected in a row with alternating tokens and bubbles. The controllers from [35] show less concurrency than the desynchronization model. For this reason, we also propose a new controller implementing the protocol with maximum concurrency proposed in this paper (Figure 4.13(d)). For completeness, a handshake decoupling the falling events of the control signals (fall-decoupled ) is also described in Figure 4.13(c). In all cases, it is crucial to properly define the initial state of each controller, which turns out to be different for the latches initially holding tokens and those initially holding bubbles. We now present a general study of four-phase protocols, illustrated in Figure 4.14. The figure describes a partial order defined by the degree of concurrency of different protocols. Each protocol has been annotated with the number of states of the corresponding state graph. The marked graphs in the figure do not contain redundant arcs. An arc in the partial order indicates that one protocol can be obtained from the other by a local transformation (i.e., by moving the
(4 states) A− B− non−overlapping
simple (Muller pipeline)
A+
B+
A−
B−
concurrency reduction
(5 states) semi−decoupled (Furber & Day)
A+
A+
B+
B+
live and flow−equivalent A−
A−
B−
rise−decoupled (fully decoupled, Furber & Day)
(6 states) B−
fall−decoupled
A+
B+
A−
B−
(8 states) de-synchronization model
Fig. 4.14 Different degrees of concurrency in handshake protocols.
78 De-synchronization: Simple Mutation Circuit into Asynchronous target of one of the arcs of the model). The arcs A+ ↔ A− and B+ ↔ B− cannot be removed for obvious reasons (they can only become redundant). For example, the semi-decoupled protocol (5 states) can be obtained from the rise-decoupled protocol (6 states) by changing • B− to the arc A+ −→ • B+, thus reducing concurrency. the arc2 A+ −→ The model with 8 states, labeled as “de-synchronization model,” corresponds to the most concurrent model presented earlier in this chapter. The other models are obtained by successive reductions or increases of concurrency. The nomenclature rise- and fall-decoupled has been introduced to designate the protocols in which the rising or falling edges of the pulses have been decoupled, respectively. The rise-decoupled protocol corresponds to the fully decoupled one proposed in [35]. Any attempt to increase the concurrency of the de-synchronization model results in the violation of one of the properties for correct data transfer described in Section 4.3: absence of races or no data loss. In [20], the following results were proved for the models shown in Figure 4.14: • All models preserve flow equivalence. • All models except the simple four-phase protocol (top left corner) are live. • De-synchronization can be performed by using any hybrid combination of the live models shown in the figure (i.e., using different types of controllers for different latches). These results offer a great flexibility to design different schemes for de-synchronized circuits. 4.6.1
Which Controller to Use?
There is no simple answer to this question. It is generally known that the degree of concurrency is highly correlated with the complexity of the controller. Depending on the technology used for the implementation, the critical cycles in the controller may vary. 2 Note
that this arc is not explicitly drawn in the picture because it is redundant.
4.7 Implementation of Handshake Controllers
79
Since any hybrid combination of controllers is valid for desynchronization, a natural approach to select the controllers can be the following: (1) Calculate the criticality of the cycles in the datapath. The most stringent cycles are those that have the highest average delay for each combinational block. (2) Use simple controllers (e.g., non-overlapping) for the latches in non-critical cycles. (3) Use more concurrent controllers for the latches in critical cycles.
4.7
Implementation of Handshake Controllers
The protocols described in Section 4.6 can be implemented in different ways using different design styles. In this section, a possible implementation of the semi-decoupled four-phase handshake protocol is described. It is an implementation with static CMOS gates, while the original one was designed at transistor level. This controller offers a good trade-off between performance and complexity. In time-critical applications, other controllers can be used, including hybrid approaches that combine protocols different from the ones shown in Figure 4.14. Figure 4.15 depicts an implementation of a pair of controllers for a fragment of data-path. The figure also shows the marked graphs modeling the behavior of each controller. The only difference is the initial marking, that determines the reset logic (signal RST). Resetting the controllers is crucial for a correct behavior. In this case, alternating latches are, respectively, transparent and opaque in the initial state, playing the role of master and slave latches in synchronous designs. With this strategy, only the opaque latches must be reset in the data-path. The controllers also include a delay that must match the delay of the combinational logic and the pulse width of the latch control signal. Each latch control signal (A and B) is produced by a buffer (tree) that drives all the latches. If all the buffer delays are similar, they can
80 De-synchronization: Simple Mutation Circuit into Asynchronous L A T C H
COMBINATIONAL LOGIC
B
A
RST Ai
Ri
delay
Ri−
A+
Ai− Ri+
L A T C H
A−
Ao
Ai
Ro
Ri
Ao
delay
Ro
Ro−
Ri−
Ao−
Ai−
Ro+
Ri+
Ai+ Ao+ Initial state: A=Ao=1, Ri=Ai=Ro=0
B+
Ro− Ao−
B−
Ro+
Ai+ Ao+ Initial state: Ai=1, B=Ri=Ro=Ao=0
Fig. 4.15 Implementation of semi-decoupled controllers.
be neglected during timing analysis. Otherwise, they can be included in the matched delays, with a similar but slightly more complex analysis. In particular, the delay of the sequence of events A+ → Ro/Ri− → logic delay → B+ is the one that must be matched with the delay of the combinational logic plus the clock-to-output delay of a latch. The event Ro/Ri− corresponds to the falling transition of the signal Ro/Ri between the A- and B-controllers. On the other hand, the delay of the sequence B+ → Ai/Ao− → Ro/Ri+ → pulse delay → B− is the one that must be matched with the minimum pulse width. It is interesting to note that both delays appear between transitions of
4.8 Design Flow
81
the control signals of Ri and B, and can be implemented with just one asymmetric delay. A more detailed analysis about the timing requirements of the asynchronous controllers used for de-synchronization can be found in [20].
4.8
Design Flow
Given a synchronous circuit implemented with combinational logic and D flip–flops, all of which work at the same clock edge, desynchronization can be performed in the following steps: (1) Conversion into a latch-based design. All flip–flops must be decomposed into the corresponding master and slave latches. (2) Generation of matched delays. Each matched delay must be greater than the critical path of the corresponding combinational block. The matched delays serve as completion detectors. In case of variable-delay units that may have datadependent delays, a multiple x or selecting a specific delay can be used, thus taking advantage of the average performance of the functional units [9]. (3) Selection and implementation of the asynchronous controllers. As shown in Section 4.6, different controllers can be used for each latch trading-off area and delay. Finally, the C elements combining the request and acknowledge signals of multiple inputs and outputs, respectively, must be inserted. Unfortunately, the delay of each functional unit is not known until the circuit layout has been generated. This can make the estimation of matched delays overly conservative with a negative impact on performance. Matched delays are simply chains of inverters and nand/nor gates, as shown in Figure 4.6. A practical approach to design the matched delays is to initially start with pessimistic delays, according to prelayout timing analysis. After layout, timing analysis can be performed and more accurate delays can be calculated. The already placed delays can be re-adjusted by removing some of the gates. This operation only
82 De-synchronization: Simple Mutation Circuit into Asynchronous requires incremental rewiring, leaving the unused gates in their original location. The previous design flow offers important benefits with regard to other more intricate flows for asynchronous design. Given that the datapath remains intact, conventional equivalence checking tools can be used for verification and scan paths can be inserted for testability. Asynchronous handshake circuits can also be tested by using a fullscan methodology, as discussed in [98]. Handshake circuits are selfchecking, and the work in [59] showed that 100% stuck-at coverage can be achieved for asynchronous pipelines using conventional test pattern generation tools. Since variability is small between devices in relatively small regions of the layout, it is reasonable to assume that the performance of the circuit is determined by the matched delays. In this scenario, the response time of every individual die can be measured by observing the request and acknowledge wires at the boundaries of the system. In other words, the maximum speed of a die can be estimated by only looking at the timing of transitions of some output signals with respect to the clock input, without the need for expensive at-speed delay testing equipment. This allows one to classify dies according to their maximum operational speed (binning), which so far was only used for leading-edge CPUs due to the huge cost of at-speed testing equipment.
4.9
Why De-synchronize?
While de-synchronized circuits closely resemble their synchronous counterparts, there are subtle but important differences. These make de-synchronization an attractive approach to tackle some of the problems inherent to modern circuit design. The significant uncertainties in gate and wire delays of current technologies are causing designers to adopt extremely conservative strategies to ensure that a sufficiently large number of manufactured chips work correctly. While statistical timing analysis may help to identify and quantify the different sources of variability, designers must still define huge margins that have a negative impact on performance.
4.10 Conclusions
83
De-synchronization is an approach to tackle the effects of correlated variability sources, such as supply voltage, operating temperature, and large-scale process variations (e.g., optical imperfections). We believe that the natural capability of accomodation of circuit performance to environmental conditions is what allows this approach to explore a wide range of voltage/frequency points that are not achievable by synchronous circuits. We refer the reader to Section 6.2 and analyze the results obtained from the de-synchronization of a DLX microprocessor to observe the features previously mentioned.
4.10
Conclusions
The approach presented in this chapter is probably the simplest strategy to mimic a synchronous circuit without a clock. A correct synchronization is guaranteed by replacing the clock with a network of distributed asynchronous controllers associated with the sequential elements of the data-path. This simple strategy allows one to use existing tools for synthesis and timing analysis of synchronous systems. Section 6.2 presents a practical example of desynchronization.
5 Automated Gate Level Pipelining (Weaver)
5.1
Automated Pipelining: Motivation
This chapter presents a direct translation approach for automatic implementation of conventional RTL designs as fine-grain pipelined asynchronous QDI circuits. This approach has two main distinctive features. It solves the global clocking problem per se and replaces the register transfer architecture with a gate transfer architecture (further referred to as GTL). This results in very fine-grain pipelined circuits, regardless of the level of pipelining of the original synchronous implementation. In synchronous designs, automation of pipelining is difficult to implement as it changes the number and position of registers which finally results in a completely new specification. There are no tools capable of establishing the correspondence between the functionality of synchronous pipelined and synchronous non-pipelined designs. Moreover, synchronous pipelining is reasonable only for more than eight levels of logic. This number is increasing because gate latency decreases as technology shrinks while clock limitations (routing and skew) stay the same [39, 46]. 84
5.1 Automated Pipelining: Motivation
85
Fig. 5.1 Retiming based automated RTL pipelining.
To illustrate the inefficiency of fine-grain pipelining in synchronous systems we set up an experiment with retiming based automated synchronous pipelining. We synthesized the Advanced Encryption Standard (AES) [30] from RTL Electronic Code Book with Synopsis DC using Artisan 0.18 library (typical). The first column of Figure 5.1 shows the non-pipelined implementation. It is combinational design comprising of over 42,000 gates. Columns 2–9 present results of various modes of automatic pipelining. Columns 2 and 3 show automatically pipelined hierarchical and flat implementations targeting maximum performance. The results with high numbers of stages (columns 4–9) are obtained by forcing Design Compiler to insert 10, 25, 50, 100, 200, and 400 pipeline stages, respectively. It is clear from Figure 5.1 that finer pipeline granularity in synchronous RTL causes a significant area overhead with little effect on performance. In the extreme cases, performance of the very fine-grain pipelined implementations is worse due to growing complexity of the clock tree. These results support the theoretical conclusions from [39, 46] on the granularity limits of synchronous pipelining. The comparison between synchronous and automatically fine-grain pipelined asynchronous implementations
86 Automated Gate Level Pipelining (Weaver) for the same example is presented later on (see Figure 6.7 in Section 6.3). Contrary to synchronous case in QDI designs, delays of gates and delays of handshake control implementations shrink consistently. No margin is required because data transfers are based on explicit local handshakes. These local handshakes allow a circuit operate correctly regardless of process and environmental variations. In GTL synthesis flow [116, 117, 118], pipelining is done by default at the gate level which gives the finest degree of pipelining and the highest potential performance. A part from efforts to automate pipelining in circuit synthesis a serious effort is dedicated to designing efficient micropipeline implementation based on dynamic logic. Dynamic logic is relevant not only for high speed implementations [101], but for hardware security as well [129]. Unfortunately, up to now the usage of dynamic gates was limited to custom or semi-custom design flows. There are two major reasons why EDA support for dynamic logic designs is difficult when using the synchronous methodology [13, 38]. First, each synchronous dynamic gate requires a clock input and uses both levels of the clock signal which makes it behave like a flip– flop from the EDA tool’s point of view. Second, charge sharing and uncertainty about worst case delay makes static timing analysis (STA) of dynamic circuits very complex. As a result commercial RTL synthesis tools provide almost no support for design in dynamic logic libraries. Our approach incorporates dynamic logic within QDI implementations for which timing problems are solved graciously. As a result the approach enables usage of dynamic gates in automatic standard-cell based design flow. The QDI implementation out of GTL synthesis flow is based on ideas borrowed from custom design methodology suggested for MIPS microprocessor in [71, 79] which were further developed to result in a fully automated design flow. The most attractive feature of this flow is a complete compatibility with standard synchronous RTL flow. The dual-rail nature of the implementation and the handshake control are completely hidden from the RTL designer. The GTL-flow does not require channel extensions of RTL. This is important as any standard
5.2 Automated Gate-Level Pipelining: General Approach
87
synchronous Verilog or VHDL specification can be directly implemented as an asynchronous fine-grain pipelined design without any modifications to the existing HDL code. In short, our flow uses results of synchronous synthesis to trace register transfers and set/reset points, and structurally replaces communication between registers by chains of gate-level communications. Thus, the asynchronous channels (dual-rail data and handshake control) are introduced only at the level of technology mapping. Other known approaches suggest the addition of channels to standard HDL (e.g., Verilog or VHDL) [103, 108, 109] to make these languages more appropriate for describing behavior of asynchronous systems. This approach, though attractive, requires the re-education of RTL designers and increases the difficulty of using legacy RTL code.
5.2
Automated Gate-Level Pipelining: General Approach
Micropipeline synthesis (as a particular case of SADT methodology) consists of three main steps: RTL synthesis, re-implementation, and final mapping. RTL synthesis is performed by standard RTL synthesis tools from a given HDL specification. Initially, RTL functionality is identified. Every combinational gate or clocked latch (gi ) is represented by a library cell (see examples in Figure 5.2). Clocked flip–flops are treated as pairs of sequentially connected latches (master gmi and slave gsi ) controlled by two opposite phase clocks and are represented as two cells each. Every data wire (i.e., any wire except for clock and reset) is mapped to a cell connection. This way, no additional data dependencies are added and no existing data dependencies are removed. The initial state of state holding gates (D-latches and D-flip–flops) is guaranteed by an appropriate reset. The only difference from standard RTL synthesis is that implementation is performed using a virtual library whose content corresponds to micropipeline library (see details below). This synthesis step determines the design implementation architecture. It can be tuned, in the same way as conventional RTL synthesis, to trade-off area, performance or dynamic power consumption.
88 Automated Gate Level Pipelining (Weaver) reset A
D
L clock
SET
CLR
X
Q
A
Q
spaced data req ack
join
micropipeline stage F0=A0, F1=A1 ACK
spaced data ack req
CD
PC F
D-latch
X
M
reset
A
X
B
Y
A
AND2
B
spaced data req ack req ack spaced data
join
micropipeline stage F0=A0+B0,F1=A1&B1 ACK
fork
CD
PC F
M
spaced data ack req X req ack spaced data Z
reset
(a) Virtual library cell
(b) Micropipeline cell in general
Fig. 5.2 Micropipeline synthesis examples: clocked latch (top) and AND2 gate with a fork (bottom).
Re-implementation takes the RTL netlist obtained in the previous step as input. The algorithm has a linear complexity as it is substitution based. We assume that for every logic gate or latch there exists a cell or a previously synthesized module implementing the cell’s functionality. This assumption is satisfied by targeting RTL synthesis to the virtual library, which is functionally equivalent to the micropipeline library, and by the bottom-up synthesis of hierarchical designs. Next, implementability and optimality are ensured. The netlist of virtual cells is translated into a micropipeline netlist using the stagecells from the micropipeline library. Finally, the micropipeline netlist is checked for fan-out/driving capability design rules and modified, if necessary, by the same RTL synthesis tool. The final netlist can be written in any of the formats supported by the tool in hierarchical or flat form. Figure 5.2 presents micropipeline synthesis examples for a clocked latch with fan-out of 1 and an AND2 gate with fan-out 2. The latch (Figure 5.2(a) top) is connected to reset with its preset pin and is initialized to “1” in RTL implementation. During micropipeline synthesis an identity cell from a micropipeline library replaces the latch. This cell is initialized to dual-rail value of logical “1.”
5.2 Automated Gate-Level Pipelining: General Approach
89
The combinational AND gate is implemented by a micropipeline stage with equivalent dual-rail functionality. The gate output depends on both inputs therefore the inputs must be synchronized by a join module. Likewise the output is split to X and Y what makes it necessary to synchronize the feedback acknowledgments with a fork module. Channels denote sets of signals (two for dual-rail) carrying data and up to two handshake signals associated with data (request propagating in the same direction as data and acknowledge moving backward). Other components of micropipeline cells will be explained shortly. Figure 5.2 shows a general case of channel expansion using request (req), acknowledgement (ack), and dual-rail binary data wires. The join and fork module implementations are protocol dependent. The Muller C-element, for example, may serve as a handshake synchronizer. Channels depicted in Figure 5.2(b) show the most general case when both req and ack signals are present together with synchronization for multiple-input and multiple-fan-out gates. However, some templates may not use all four (handshake req, ack, and data d0, d1) communication lines (e.g., precharge half-buffer (PCHB) from [92] does not use req). Portions of combinational logic are substituted by functionally equivalent pipeline stages. Without loss of generality, let every such portion be a single logic gate in synchronous implementation. The following properties of micropipeline synthesis are guaranteed by the flow design: During the mapping of the RTL implementation, data wires are substituted by channels and: • No additional data dependencies are added and no existing data dependencies are removed. • Every gate implementing a logical function is mapped to a fine-grain cell (stage) implementing equivalent function for dual-rail encoded data that is initialized to spacer. More detailed consideration of this and others properties of micropipeline synthesis could be found in [115, 118].
90 Automated Gate Level Pipelining (Weaver)
5.3
Micropipeline Stages: QDI Template
The suggested fine-grain pipelining flow is targeted at QDI implementations. The assumptions about isochronic forks are supported by keeping all wire forks internal to micropipeline cells and by constraining the internal layout of cells. An example of a dynamic implementation of a micropipeline stage cell implementing AND2 function is shown in Figure 5.3. This particular example illustrates the reduced stack precharge half-buffer (RSPCHB) template from [92]. A block implementing the stage logical function is denoted by F . The rest of the blocks are typical for most of the stages. LReq and LAck are the left and RReq and RAck the right requests and acknowledgments, respectively. ACK implements handshake, PC is a phase (precharge/evaluate) control, CD serves as a completion detection and M stands for memory. “Staticizers” (or keepers) are formed by adding weak inverters as shown in Figure 5.3. They are needed to store the stage output value for an unlimited time thus eliminating timing assumptions. At the same time, keepers solve the charge sharing problem and improve the noise margin of precharge style implementations. The req line is used in protocols to signal data ACK CD C
LR eq
R R eq C
LAck
PC
F
R Ack
CD
M
A0 B0
Z0 weak
A1
Z1
B1
parasitic capacitances
(a) Input completion detection
(b) AND2 stage implementation example
Fig. 5.3 AND2 micropipeline stage dynamic implementation example.
5.4 Design Flow Basics
91
consumption to the previous stage. Availability to the following stages while the ack is used to indicate data depending on the communication protocol, some or all of the handshake events can be transmitted over the data lines so req and/or ack lines may not be needed. A dedicated micropipeline library with each cell representing an entire micropipeline stage localizes in-stage timing assumptions and power balancing inside the cell thereby leaving it to the library designer. With DI inter-stage communication, the functionality of the implementation no longer depends on its place & route. Note (Figure 5.3) that the implementation of memory and logic functions implementations are of the same cost and speed as their synchronous dual-rail domino counterparts. The main sources of area overhead are the Muller C-element for handshake control implementation (ACK), completion detection circuitry (CD), and ack/req synchronization (see Figure 5.2).
5.4
Design Flow Basics
The Weaver synthesis tool [115, 144] consists of two main parts an engine and a set of scripts. These scripts implement the user interface (commands) and handle interaction with a standard RTL synthesis tool. The flow is implemented [115, 117, 118] maximizing the use of commercial design tools. Weaver Engine is a VHDL compiler based on the Savant VHDL compiler [110]. The engine incorporates VHDL and Synopsys Liberty [67] parsers/generators to interface the design and library specifications with industry-standard tools. The RTL synthesis tool currently used R . This set of tools is targeted in the flow is Synopsys Design Compiler at micropipeline synthesis but also automates some library installation tasks. Library installation is executed once per library or every time the library is modified. This step is essential ensuring the flexibility of the flow in using various micropipeline libraries. A micropipeline stage cell is defined as a pre-designed module that implements one or more functions of its inputs. This layer of abstraction allowes the flow to be
92 Automated Gate Level Pipelining (Weaver) flexible and support different micropipeline styles. Every data input or output is considered to be a channel consisting of encoded data and, optionally, handshake lines. The virtual library is an imaginary single-rail synchronous RTL library functionally equivalent to the micropipeline library. The virtual library is generated from the micropipeline library during its installation. The cell AND2 in Figure 5.2(a) is a virtual library cell generated for the micropipeline cell AND2 (Figure 5.2(b)) found in the micropipeline library. Area and delay characteristics of the virtual library cells are mapped from the corresponding micropipeline library cells to make optimization during the RTL synthesis meaningful. For the adopted QDI style only the skew of wire delays after forks matter. The delays after forks are kept under control by ensuring that inter-cell communications are delay insensitive. The flow achieves this by placing all isochronic forks inside library cells. Together with guaranteeing correct functionality, this requirement lowers the effect of variation in design parameters. As long as individual cells are designed to handle variation inside the cell, no global constraints are necessary. Communication protocol is a return-to-zero DI protocol. All submodules of any given module must use this protocol, in which data items are separated with spacers. The flow currently cannot handle a library designed for non-return-to-zero (non-RTZ) protocols, like phased logic, since the synthesis procedure would be quite different. Apart from the handshake synchronization cells, the library must also implement the equivalent of AND2 cell (or OR2, assumed to be the dual of AND2) and an identity function stage (buffer stage). The inverter is implemented as the cross-over of data wires. There must be at least three versions of each buffer stage implementation (resettable to data1, data0, and spacer) and at least one resettable to spacer version of cells implementing other logic functions.
5.5
Fine-Grain Pipelining with Registers
Suppose that a synchronous RTL netlist produced by a standard RTL synthesis engine consists of combinational logic (CL) gates, D-latches (DL), and D-flip–flops (FF). Clocked registers (either composed of
5.5 Fine-Grain Pipelining with Registers DL DL CL
93
DL DL CL
( a ) Pipelined synchronous combinational circuit HB HB HB HB
HB HB
HB HB HB HB
HB HB
(b) Fine-grain pipelined GTL circuit Fig. 5.4 Weaving with clocked registers.
DFFs or DLs) are used in RTL implementations to separate the concurrent processing of distinct consecutive data items which advance through the registers at the pace of the clock signal. Figure 5.4(a) shows master and slave latches with empty and shaded circles representing bubble and data items. “Master” and “slave” latches are controlled by opposite edges of a clock. The pipeline cells can be divided into two categories full-buffer (FB) and half-buffer (HB) [79, 92]. The two are distinct by the token capacity, i.e., the number of data items that can fit in a pipeline of a given length. The handshake protocols we are considering support consecutive data tokens separation (with spacer/bubble) which makes it possible to substitute every DL with one HB stage (DFF with two HB or one FB stages). The result is shown in Figure 5.4(b) where, in the absence of a clock signal, the circles represent the initial states of the stages. Bubbles are denoted by empty circles and tokens by shadowed ones. In HB pipelines distinct tokens are always separated with bubbles (no two distinct tokens in any two adjacent stages); From this follows the next property: For each FF in RTL implementation in GTL implementation there exist two HB stages one initialized to a bubble and another — to a token. More tokens can simultaneously fit in the pipeline increasing its performance relative to the synchronous implementation.
94 Automated Gate Level Pipelining (Weaver) For linear and loop free data paths, initialization is less of an issue because it impacts only a few output items until the pipeline is flushed with new data tokens. In the presence of loops and/or nonlinearities, initialization is crucial for liveness as shown in [118].
5.6
Pipeline Petri Net Model of the Flow
We use Petri nets (PN) [87] to model the behavior of the original synchronous and resulting GTL circuits. Petri nets are also used to prove correctness of GTL implementation. We use high level abstraction where PN markings represent the positions of tokens (data portions) in the pipeline. 5.6.1
Linear Case
In synchronous implementations each pipeline stage is implemented with a D-flip–flop (DFF) or two D-latches (DL) with oppositely driven clocks to store the result of data processing in the combinational logic (CL) (see Figures 5.5(a) and 5.5(b). Synchronous pipeline PN model is shown in Figure 5.5(c) where t1 represents the DL1 state change, t3 – DL2 state change etc. while the transitions clk0, clk1 represent clock edges. For asynchronous pipeline implementations, the model from [139] (Figure 5.5(d)) can be used. It restricts PN in such a way that for every two transitions ti , tj connected through a place pk , there exists a place pl that succeeds tj and precedes ti . We refer to such PNs as Pipeline Petri nets (PPN s) [139]. We call pairs of places like pk , pl as a stage space and the stage space with adjacent transitions as a stage. A PPN consists of stages for which the following properties are satisfied: • No stage space is shared between any two stages. • Exactly one place is marked in every stage space. Depending on whether the post- or precondition is marked for a stage the latter is said to contain a token or a bubble. In the course of PPN execution adjacent stages having opposite markings can change their states by firing transitions. Tokens propagate in one direction (the
95
5.6 Pipeline Petri Net Model of the Flow CL1
D
SET
C LR
CL2
Q
D
Q
C LR
D FF1
clk
SET
CL3
Q
D
Q
SET
C LR
D FF2
Q
Q
D FF3
(a) S ynchronous (D FF) pipeline im plem entation CL1
D
L
SET
C LR
Q
D
Q
L
D L1
clk0 clk1
SET
CL2
Q
D
Q
C LR
L
SET
C LR
Q
D
Q
L
D L3
D L2
SET
C LR
CL3
Q
D
Q
L
SET
C LR
Q
Q
D L5
D L4
(b) S ynchronous (D L) pipeline im plem entation t3
t1
... ...
t4
t5
t6
... ...
... ...
clk0/clk