circuitry to monitor the activity of nodes within ... high power consumption caused by the clock network ... internal nodes Activity Monitors (AMs) are applied to.
Activity-Monitoring Completion-Detection (AMCD): A new single rail approach to achieve self-timing E. Grass, R. C. S. Morling and I. Kale School of Electronic and Manufacturing Systems Engineering University of Westminster 115 New Cavendish Street, London, W1M 8JS, GB Abstract A new method for designing single rail asynchronous circuits is studied. It utilises additional circuitry to monitor the activity of nodes within combinational logic blocks. When all transitions have halted a completion signal is generated. Details of the circuit and design methodology are given and the influence of glitches on the proposed circuit is discussed. Three different levels of granularity are investigated. Experimental physical layout of the circuit with extracted and back-annotated simulation results is provided. The proposed approach results in faster operation than synchronous circuits with minimum circuit overhead incurred.
1
Introduction
The difficulty in distributing a high speed accurate clock over a large circuit area as well as the inherent high power consumption caused by the clock network has prompted designers to reconsider the role of asynchronous circuits. Whereas in synchronous systems the clock frequency governs the speed of the whole circuit, which in turn is determined by its critical path, asynchronous circuits rely on the use of completion-detection methods to determine when a Combinational Logic block (CL) has completed its operation. Several methods of completion-detection are known. Dual rail coding techniques carry the disadvantages of a very high hardware overhead coupled with low operation speed [1]. The bounded delay approach on the other hand does not utilise the data dependency of internal delays [2]. Current-Sensing Completion-Detection (CSCD) causes high quiescent currents leading to undesirable high power consumption [3] [4] or requires expensive BiCMOS technology for satisfactory operation [5]. Some of the proposed circuits in [4] and [6] need reference voltages which are expensive to implement
and contribute to an increase in power consumption. The reference currents necessary for the current mode current comparator deployed in [7] lead to a high quiescent current of the entire circuit. The objective of this paper is to investigate an alternative method of completion-detection [8] which substantially improves asynchronous circuit performance. The method is based on a single rail implementation of the CL and uses minimal additional circuitry to determine whether the CL is in a steady or transient state. Three different levels of granularity are studied and compared with each other. It is shown that glitches occurring within the CL do not cause the circuit to malfunction. Theoretical considerations on the magnitude of achievable speed relative to synchronous circuits are carried out for the example of a carry ripple adder.
2
Completion-Detection Method
The proposed method monitors transitions of logic levels on signals within CLs. The combinational logic block is assumed to be implemented in static gate logic. The transient state of a CL is characterised by transitions occurring on internal signals. If no transitions occur within a certain period of time it can be concluded that the circuit has reached a steady state. This period is determined by the maximum delay between two consecutive transitions and comprises a gate delay coupled with the associated interconnection delay. To extract the information on the activity of internal nodes Activity Monitors (AMs) are applied to internal signals of the CL. The circuit of an AM has one input which is connected to the signal that is to be observed, and one Open Drain (OD) output. The function of an AM is to generate a pulse, i.e. a highlow-high transition on its OD output when a transition of the input signal occurs, regardless whether the input signal is a high-low or a low-high transition. If no
VDD
A0 B0
A1 B1
A
B
R
C1
CO S
A
B
CI
CO
full0
S0
IN
S1
ACT
M1
M2
•
IN gnd
AM
gnd
ACT
AM1
START
M4
INV2
full1
AM
M3
C2
S
AM2
MDG
ACT
RDY
Figure 1: Ripple-carry adder using AMCD together with proposed circuit of an Activity Monitor (AM)
transition of the input signal occurs the OD output remains in a high impedance state. Since timing assumptions on basic building blocks of the CL have to be made the approach is not delay independent. An AM can be attached to every internal signal of the CL. The outputs of the AMs monitoring internal nodes have to be connected to a common signal ACT (Figure 1) which is fitted with a pull-up resistor and can be used to determine whether the CL is in a transient or steady state. For practical implementation purposes the pull-up resistors are replaced with active p-mos pull-ups. To treat the cases when no transitions occur at all (state similarity) and to cover the delay before the first vector of AMs is activated, a Minimum Delay Generator (MDG) is needed [4]. The MDG is activated when a new data token is applied to the CL. The transistor level implementation of an AM as shown in Figure 1 has been found to be an optimal solution with respect to symmetrical operation as well as area and speed trade-offs. INV1 and INV2 can be implemented with minimum size transistors. The repercussion on the signal to be monitored, imposed by additional capacitance, is kept to a minimum since only three additional transistors, two of which are minimum size, have to be driven. Eight transistors in total are needed for one AM. Attaching one AM to every internal signal of a CL may result in a substantial increase in the overall circuit area. Alternatively AMs can be attached to certain exposed signals only. In order to guarantee reliable operation the following conditions must be fulfilled.
•
attached to the outputs of one subcircuit is called a vector of AMs.
ACT
INV1 IN
To attach AMs the CL has to be subdivided into non-overlapping subcircuits, i.e. each subcircuit must be self contained without sharing devices belonging to other subcircuits. Each signal which connects two different subcircuits has to be fitted with an AM. The group of all AMs which are
•
In order to ensure the safe overlapping of output pulses of AMs, and hence avoid spikes on the common output signal ACT, the pulse width (tp) generated by each vector of AMs when detecting a signal transition must exceed the delay of the critical path (tcrit) of the subsequent subcircuit plus the switch-on delay (tA) of the subsequent vector of AMs. The pulse width generated by the MDG (tMDG) must exceed the tcrit of the first subcircuit plus the switchon delay (tA) of the first vector of AMs.
For a ripple-carry adder an appropriate place to attach AMs is to each carry signal (C1, C2, ..) as shown in Figure 1. Here the vector of AMs degenerates to a single AM. Each bit-slice of the full adder represents a subcircuit. The pulse width generated by each AM (tp) when detecting a signal transition must cover the tcrit of a one-bit full adder plus the tA of the subsequent AM. In practice, as with synchronous circuits, a substantial safety margin has to be added to the nominal value of tp. As discussed later, the AMCD approach benefits from the fact that this safety margin has to be allocated to each subcircuit of the CL rather than to the whole CL as is necessary with synchronous circuits. Figure 2 displays the waveforms obtained from the SPICE simulation of a 4-bit ripple-carry adder according to Figure 1 using SPICE parameters of a commercially available 1.2 µm CMOS process. An AM has been attached to each of the signals C1, C2 and C3. Signal C4 is not fitted with an AM since it represents a primary output of the 4-bit adder. Two transitions of input data, t1 and t2, have been applied: (A[3:0], B[3:0]) = (0FH, 0FH) -- t1-->(0H, 1H) -- t2 --> (0FH, 1H) The signal A0 shown in Figure 2 acts as an example for the applied input vector. The measured switch-on delay tA for the signal ACT is tA = 0.9 ns and the delay of the critical path of a subcircuit is tcrit = 1.4 ns. The rising edge of signal ACT can be used to trigger the transport of a processed data token to subsequent processing stages. Considering Von Neuman’s findings that the average length (pa) of the ripple path on a carry chain of length N for random input data is pa = log2N [9], the theoretical average speedup (s) of a ripple carry adder
using AMCD compared to a similar synchronous circuit is
s = psync/pamcd = N/(log2 N + 1).
the speedup is increased to s=32/6. This is due to the fact that for the synchronous circuit the safety margin needs to be applied to the whole circuit whereas for the AMCD version a safety margin has to be applied to the pulse width of each vector of AMs i.e. to the delay of each subcircuit. Consequently imposing a 100% safety margin on an AMCD circuit increases the overall delay by the delay of one subcircuit only. According to (1) the achievable speedup (s) is in theory higher for larger and more complex circuits. In practice, parasitic capacitances will deteriorate the performance of the AMCD version slightly.
(1)
The parameter psync is the maximum ripple path of the adder which would apply to a synchronous working circuit without taking a safety margin into account. pamcd is the average delay of the AMCD circuit for random input data. The addition of the constant one to the denominator is necessary because of two reasons: Firstly the delay of the MDG needs to cover the case of state similarity. Therefore if no transitions at all occur within the circuit the MDG will still delay the assertion of the RDY signal (Figure 1) appropriate to the critical path of the first adder stage as pointed out earlier. On the other hand this constant is caused by the required pulse width of the AMs which has to be added to the delay of the actual ripple path. For practical application in order to achieve a sufficient safety margin this value needs to be increased to approximately two. Note that, in contrast to synchronous circuits, the safety margin for the AMCD version has to be applied to the subcircuit rather than the whole circuit. The theoretical average speedup of a 16-bit ripplecarry adder for random input data and without imposing any safety margin is s = 16/5. In order to achieve high yields and to guarantee correct operation under worst case conditions, safety margins which are typically in the order of magnitude of 70% to 130% have to be added to the critical path delays. If a 100% safety margin is assumed for both the ‘normal’ synchronous circuit and the AMCD version,
3 Requirements for the Monitoring Circuit (AM)
Activity
The activity monitoring circuit (AM) must be capable of reliably detecting signal transitions under all circumstances. A particular problem arises when glitches with pulse widths in the order of magnitude of one inverter delay occur. In this case the response of the AM is not necessarily a superposition of the response on the appropriate two single transitions. However, simulations of such cases have shown that the proposed circuit of an AM has very good properties in this respect. It can be shown that the sensitivity of the AM with regard to pulses is higher than the sensitivity of the actual combinational logic. To demonstrate this property the test circuit shown in Figure 3 has been used. The buffer (BUF) is used to generate a ‘natural’ input signals for the actual test circuitry. This is fed to the AM and a NAND gate in parallel. The NAND gate
5
[Volt]
4
C1 C2
3 t crit
tA
tp 2
C3 C4
A0
ACT
1
0 0
5
10
15
20
25
T ime [ns ]
Figure 2: SPICE simulation of a 4-bit ripple-carry adder using AMCD
30
35
40
VDD BUF
IN1
OUT
IN VDD
C
IN
AM ACT
gnd R ACT C gnd
Figure 3: Circuit for performance test of AM with respect to the detection of glitches
V2(OUT) V2(ACT) V1(OUT) V1(ACT) Time [ns] Figure 4: SPICE simulation of the test circuit applying the input pulses P1 (0.5 ns) and P2 (0.3 ns)
serves as a sensor that allows the determination of whether the appropriate pulse is recognised by the subsequent logic. The AM performs the actual transition detection. Simulations of the circuit of Figure 3 have been performed by applying two input pulses. The first pulse (P1) having a width of 0.5 ns and the second (P2) a width of 0.3 ns. The resulting output waveforms have been named V1 and V2 respectively. As can be seen in Figure 4 the AM detects a 0.3 ns pulse which does not lead to a sufficient output pulse at the output of the NAND gate. This means that all pulses which cause any further transitions within the CL can be detected. This is an important property since those pulses may propagate to the primary outputs of the CL and if they are not detected may cause false results.
3.1 Layout Considerations In order to verify the operation of the AM a customsilicon implementation of the AM circuit according to Figure 3, using the ES2 1.0 µm double metal n-well CMOS process has been done. Figure 5 shows the experimental AM layout used. The four AND-OR
Figure 5: Experimental CMOS layout of the AM
configured n-channel MOSFETS of the AM circuit, are clearly seen at the bottom right of the layout, characterised by their large widths. The post-layout parasitic extracted SPICE simulation results shown in Figure 6 confirm the correct operation of the circuit, for realistic and accurately extracted load conditions. The overall layout size of the AM cell is 27µmx49µm. There is however room for further optimization of the cell to yield smaller area. As expected waveforms seen in Figure 6, clearly indicate that the ACT signal is more pronounced in comparison to the OUT signal.
4 Results of Applying Levels of Granularity
Different
In order to study the approach in more detail a transistor level implementation of a 4 by 4-bit parallel multiplier has been used. The multiplier comprises a regular array of cells which compute the product of appropriate inputs (P) and a number of adders (A).To investigate the effects of parasitic capacitance, including estimates of wire capacitances, a SPICE simulation of this circuit has been carried out, again using the spice parameters of a 1.2 µm CMOS process.
The circuit has been divided into a number of subcircuits by inserting vectors (columns) of AMs as shown in Figure 7. The pulse width generated by a vector of AMs must exceed the critical path of the circuit elements between this vector and the succeeding vector. The capacitance of each ACTx signal in conjunction with its pull-up resistor can be used to adjust the pulse width accordingly. In our simulations the resistors were implemented as active grounded-gate p-mos devices. The width of the pull-up devices was tuned such that for each circuit a 100% safety margin in addition to tp was guaranteed. For the multiplier investigated, connecting between four and eight outputs to one ACTx signal proved to give optimal results. The combination of an open drain circuit on the lowest level with a completion tree for subsequent higher levels has been found to be a particular beneficial compromise. The inherent wire capacitances of the open drain configuration together with the active pull-up devices can be utilised to generate the desired time constant for the pulse width tp which guarantees safe operation with a specific safety margin. On higher levels of the completion tree no further delays are desired and therefore the tree structure deploying static gate logic ensures fast operation. A string of investigations has been carried out to determine the optimal granularity of AMs for the multiplier circuit considered here. Six transitions of input data t1 ... t6 have been applied to the circuit at 50 ns intervals:
V(OUT)
V(ACT) V(IN1)
Figure 6: Simulation result of the post-layout extracted circuit
(X[3:0], Y[3:0]) = (0H,0H) --t1-> (1H,1H) --t2-> (FH,1H) --t3-> (FH,8H) --t4-> (FH,9H) --t5-> (FH,FH) --t6-> (0H,0H) The results shown in Table 1 compare the performance of the 4 by 4-bit multiplier using AMCD on three levels of granularity to an identical synchronous circuit. A 100% safety margin has been allocated to both the synchronous circuit and the subcircuits employing the AMCD approach. The maximum delay (tmax) specifies the delay which occurs when the longest ripple path is invoked by the applied input data. This is the case during transition t3 of the above input vectors. As discussed in section two the values for tmax in case of the synchronous circuit have been obtained by multiplying the delay of the longest ripple path by two whereas in case of the AMCD circuit the pulse width of each AM has been adjusted such that it covers twice the delay of a subcircuit. The average delay is calculated for all six transitions of the input vectors. E is the energy which is needed to carry out all six transitions. Please note that for asynchronous circuits the classical ‘power dissipation’ i.e. E/t is not a sensible parameter, since it depends on the rate at which input data tokens are applied. The circuit area is quoted in percent relative to the area of the synchronous circuit. The values represent a rough approximation obtained by adding the number of transistors and weighting them with their width (all transistors are minimum size). Neither the clock distribution network of the synchronous circuit nor the area which is needed to implement the communication protocol for the AMCD versions is taken into consideration. Further the area needed for signal routing within the CL is also disregarded. The first investigated version which is referred to as the fine-grained version (AMCDF) has vectors of AMs fitted in every column of the multiplier cells (C1, C2, ...C7). The circuit parameters and simulation results obtained from this version show that, as expected this circuit results in fastest operation but incurs a relatively high area overhead. The version shown in Figure 7 is referred to as the medium-grained version (AMCDM). In this version every other column (C1, C3, C5 and C7) of multiplier cells is fitted with one vector of AMs, resulting in a very good area and speed compromise. The average speed of the AMCD version is substantially increased by 37% and the worst case speed is still 9% higher than
X0
X1
Y3
P
Y2
P
ACT1
VDD
ACT2
X2
X3
P
P
P
A
ACT3
P
P
P
A
A
ACT ACT4
Y1
P
P
P
P
A
A
A
Y0
P
P
P
P
A
A
A
P0
C1
P1
P2
C2
P3
C3
P4
C4
P5
C5
C6
AM
A P6
P7
C7
Figure 7: Multiplier circuit fitted with vectors (columns) of AMs
the synchronous counterpart. The additional area required for implementing the AMs is relatively small at 12%. This additional hardware causes an almost proportional increase in energy consumption (E) of 11%. Finally a coarse-grained version (AMCDC) has been designed and investigated in a similar manner to the one described above. The coarse grained version has vectors of AMs fitted to only two columns of the multiplier cells (C2 and C5). The additional delays caused in this case are naturally quite high. In particular the worst case delay is substantially increased. This makes the circuit less attractive when compared to the previous versions.
VERSION AMCDF
taverage [ns]
tmax [ns]
E [nJ]
Area %
13.7
17.2
2.9
123
AMCDM
14.8
18.6
2.5
112
AMCDC
17.4
20.4
2.4
106
SYNC
20.3
20.3
2.2
100
Table 1: Performance of a 4 by 4-bit AMCD multiplier using different levels of granularity compared to an identical synchronous circuit
At the system level, the increase in energy consumption will be more than compensated by the absence of the clock signal.
5
Conclusions
The feasibility of a new method called AMCD to achieve completion-detection for self-timed circuits has been demonstrated. The major advantages of the proposed method is that data dependent delays of CLs can be exploited whilst imposing only marginal repercussions on the CL itself. Requirements for the deployed Activity Monitors in terms of switch-on delays and generated pulse widths have been discussed The issue of glitches occuring within the CL has been studied and it has been demonstrated that they will be detected and hence not cause the circuit to malfunction. A SPICE simulation of a 4 by 4-bit multiplier employing the models of a commercially available CMOS process has demonstrated the practicality of the approach. Three circuits, each on a different level of granularity have been investigated and the obtained parameters have been compared to an identical synchronous circuit. The experimental observations demonstrate the benefits of the method and suggest that a medium level of granularity delivers best results in terms of area speed and power dissipation. The postlayout parasitic extracted simulation results of test circuits confirmed the correct operation of the proposed AM under various conditions.
Further work is carried out to investgate methods to activlely prevent glitches from being propagated. This is expected to result in both less power dissipation and increased speed.
References [1]
N. R. Poole, “Self-timed logic circuits”, Electronics & Communication Engineering Journal, Vol. 6, (6), pp. 261-270, 1994,
[2]
S. Hauck, “Asynchronous Design Methodologies: An Overview”, Proceedings of the IEEE, Vol. 83, (1), pp. 69-93, 1995
[3]
O. A. Izosimov,. I. I. Shagurin, V. V. Tsylyov, “Physical approach to CMOS Self-Timing”, Electronic Letters, Vol. 26, pp. 1835-1836,1990.
[4]
M. E. Dean, D. L. Dill, M. Horowitz, “Self-Timed Logic Using Current-Sensing Completion Detection (CSCD)”, Journal of VLSI Signal Processing, Vol. 7, pp. 7-16, 1994.
[5]
E. Grass, S. Jones, “Asynchronous Circuits based on Multiple Localised Current-Sensing Completion Detection”, Proc. of the 2nd Working Conference on Asynchronous Design Methodologies, IEEE Computer Society Press, pp. 170-177, 1995
[6]
V. .I. Varshavsky, V. B. Marakowsky, R. A. Lashevsky, “Asynchronous Interaction in Massively Parallel Computing Systems”, Proc. of the IEEE 1st International Conference on Algorithms and Architectures for Parallel Processing, Brisbane (Australia), Vol. 2, pp. 481-492, 1995
[7]
M. J. Gamble, A Novel Current-Sensing CompletionDetection Circuit Adapted to the Micropipeline Methodology, M.Sc.-Thesis, University of Manitoba (Canada), pp. 25-31, 1994.
[8]
E. Grass, S. Jones, “Activity-Monitoring CompletionDetection (AMCD) a new approach to achieve selftiming”, Electronics Letters, accepted for publication
[9]
B. E. Briley, “Some New Results on Average Worst Carry’, IEEE Trans. Comp, Vol. 22 (5), pp. 459-463., 1973