Activity-Monitoring Completion-Detection (AMCD): A new ... - CiteSeerX

Activity-Monitoring Completion-Detection (AMCD): A new single rail approach to achieve self-timing E. Grass, R. C. S. Morling and I. Kale School of Electronic and Manufacturing Systems Engineering University of Westminster 115 New Cavendish Street, London, W1M 8JS, GB Abstract A new method for designing single rail asynchronous circuits is studied. It utilises additional circuitry to monitor the activity of nodes within combinational logic blocks. When all transitions have halted a completion signal is generated. Details of the circuit and design methodology are given and the influence of glitches on the proposed circuit is discussed. Three different levels of granularity are investigated. Experimental physical layout of the circuit with extracted and back-annotated simulation results is provided. The proposed approach results in faster operation than synchronous circuits with minimum circuit overhead incurred.

1

Introduction

The difficulty in distributing a high speed accurate clock over a large circuit area as well as the inherent high power consumption caused by the clock network has prompted designers to reconsider the role of asynchronous circuits. Whereas in synchronous systems the clock frequency governs the speed of the whole circuit, which in turn is determined by its critical path, asynchronous circuits rely on the use of completion-detection methods to determine when a Combinational Logic block (CL) has completed its operation. Several methods of completion-detection are known. Dual rail coding techniques carry the disadvantages of a very high hardware overhead coupled with low operation speed [1]. The bounded delay approach on the other hand does not utilise the data dependency of internal delays [2]. Current-Sensing Completion-Detection (CSCD) causes high quiescent currents leading to undesirable high power consumption [3] [4] or requires expensive BiCMOS technology for satisfactory operation [5]. Some of the proposed circuits in [4] and [6] need reference voltages which are expensive to implement

and contribute to an increase in power consumption. The reference currents necessary for the current mode current comparator deployed in [7] lead to a high quiescent current of the entire circuit. The objective of this paper is to investigate an alternative method of completion-detection [8] which substantially improves asynchronous circuit performance. The method is based on a single rail implementation of the CL and uses minimal additional circuitry to determine whether the CL is in a steady or transient state. Three different levels of granularity are studied and compared with each other. It is shown that glitches occurring within the CL do not cause the circuit to malfunction. Theoretical considerations on the magnitude of achievable speed relative to synchronous circuits are carried out for the example of a carry ripple adder.

2

Completion-Detection Method

The proposed method monitors transitions of logic levels on signals within CLs. The combinational logic block is assumed to be implemented in static gate logic. The transient state of a CL is characterised by transitions occurring on internal signals. If no transitions occur within a certain period of time it can be concluded that the circuit has reached a steady state. This period is determined by the maximum delay between two consecutive transitions and comprises a gate delay coupled with the associated interconnection delay. To extract the information on the activity of internal nodes Activity Monitors (AMs) are applied to internal signals of the CL. The circuit of an AM has one input which is connected to the signal that is to be observed, and one Open Drain (OD) output. The function of an AM is to generate a pulse, i.e. a highlow-high transition on its OD output when a transition of the input signal occurs, regardless whether the input signal is a high-low or a low-high transition. If no

VDD

A0 B0

A1 B1

A

B

R

C1

CO S

A

B

CI

CO

full0

S0

IN

S1

ACT

M1

M2

•

IN gnd

AM

gnd

ACT

AM1

START

M4

INV2

full1

AM

M3

C2

S

AM2

MDG

ACT

RDY

Figure 1: Ripple-carry adder using AMCD together with proposed circuit of an Activity Monitor (AM)

transition of the input signal occurs the OD output remains in a high impedance state. Since timing assumptions on basic building blocks of the CL have to be made the approach is not delay independent. An AM can be attached to every internal signal of the CL. The outputs of the AMs monitoring internal nodes have to be connected to a common signal ACT (Figure 1) which is fitted with a pull-up resistor and can be used to determine whether the CL is in a transient or steady state. For practical implementation purposes the pull-up resistors are replaced with active p-mos pull-ups. To treat the cases when no transitions occur at all (state similarity) and to cover the delay before the first vector of AMs is activated, a Minimum Delay Generator (MDG) is needed [4]. The MDG is activated when a new data token is applied to the CL. The transistor level implementation of an AM as shown in Figure 1 has been found to be an optimal solution with respect to symmetrical operation as well as area and speed trade-offs. INV1 and INV2 can be implemented with minimum size transistors. The repercussion on the signal to be monitored, imposed by additional capacitance, is kept to a minimum since only three additional transistors, two of which are minimum size, have to be driven. Eight transistors in total are needed for one AM. Attaching one AM to every internal signal of a CL may result in a substantial increase in the overall circuit area. Alternatively AMs can be attached to certain exposed signals only. In order to guarantee reliable operation the following conditions must be fulfilled.

•

attached to the outputs of one subcircuit is called a vector of AMs.

ACT

INV1 IN

To attach AMs the CL has to be subdivided into non-overlapping subcircuits, i.e. each subcircuit must be self contained without sharing devices belonging to other subcircuits. Each signal which connects two different subcircuits has to be fitted with an AM. The group of all AMs which are

•

In order to ensure the safe overlapping of output pulses of AMs, and hence avoid spikes on the common output signal ACT, the pulse width (tp) generated by each vector of AMs when detecting a signal transition must exceed the delay of the critical path (tcrit) of the subsequent subcircuit plus the switch-on delay (tA) of the subsequent vector of AMs. The pulse width generated by the MDG (tMDG) must exceed the tcrit of the first subcircuit plus the switchon delay (tA) of the first vector of AMs.

For a ripple-carry adder an appropriate place to attach AMs is to each carry signal (C1, C2, ..) as shown in Figure 1. Here the vector of AMs degenerates to a single AM. Each bit-slice of the full adder represents a subcircuit. The pulse width generated by each AM (tp) when detecting a signal transition must cover the tcrit of a one-bit full adder plus the tA of the subsequent AM. In practice, as with synchronous circuits, a substantial safety margin has to be added to the nominal value of tp. As discussed later, the AMCD approach benefits from the fact that this safety margin has to be allocated to each subcircuit of the CL rather than to the whole CL as is necessary with synchronous circuits. Figure 2 displays the waveforms obtained from the SPICE simulation of a 4-bit ripple-carry adder according to Figure 1 using SPICE parameters of a commercially available 1.2 µm CMOS process. An AM has been attached to each of the signals C1, C2 and C3. Signal C4 is not fitted with an AM since it represents a primary output of the 4-bit adder. Two transitions of input data, t1 and t2, have been applied: (A[3:0], B[3:0]) = (0FH, 0FH) -- t1-->(0H, 1H) -- t2 --> (0FH, 1H) The signal A0 shown in Figure 2 acts as an example for the applied input vector. The measured switch-on delay tA for the signal ACT is tA = 0.9 ns and the delay of the critical path of a subcircuit is tcrit = 1.4 ns. The rising edge of signal ACT can be used to trigger the transport of a processed data token to subsequent processing stages. Considering Von Neuman’s findings that the average length (pa) of the ripple path on a carry chain of length N for random input data is pa = log2N [9], the theoretical average speedup (s) of a ripple carry adder

using AMCD compared to a similar synchronous circuit is

s = psync/pamcd = N/(log2 N + 1).

the speedup is increased to s=32/6. This is due to the fact that for the synchronous circuit the safety margin needs to be applied to the whole circuit whereas for the AMCD version a safety margin has to be applied to the pulse width of each vector of AMs i.e. to the delay of each subcircuit. Consequently imposing a 100% safety margin on an AMCD circuit increases the overall delay by the delay of one subcircuit only. According to (1) the achievable speedup (s) is in theory higher for larger and more complex circuits. In practice, parasitic capacitances will deteriorate the performance of the AMCD version slightly.

(1)

The parameter psync is the maximum ripple path of the adder which would apply to a synchronous working circuit without taking a safety margin into account. pamcd is the average delay of the AMCD circuit for random input data. The addition of the constant one to the denominator is necessary because of two reasons: Firstly the delay of the MDG needs to cover the case of state similarity. Therefore if no transitions at all occur within the circuit the MDG will still delay the assertion of the RDY signal (Figure 1) appropriate to the critical path of the first adder stage as pointed out earlier. On the other hand this constant is caused by the required pulse width of the AMs which has to be added to the delay of the actual ripple path. For practical application in order to achieve a sufficient safety margin this value needs to be increased to approximately two. Note that, in contrast to synchronous circuits, the safety margin for the AMCD version has to be applied to the subcircuit rather than the whole circuit. The theoretical average speedup of a 16-bit ripplecarry adder for random input data and without imposing any safety margin is s = 16/5. In order to achieve high yields and to guarantee correct operation under worst case conditions, safety margins which are typically in the order of magnitude of 70% to 130% have to be added to the critical path delays. If a 100% safety margin is assumed for both the ‘normal’ synchronous circuit and the AMCD version,

3 Requirements for the Monitoring Circuit (AM)

Activity

The activity monitoring circuit (AM) must be capable of reliably detecting signal transitions under all circumstances. A particular problem arises when glitches with pulse widths in the order of magnitude of one inverter delay occur. In this case the response of the AM is not necessarily a superposition of the response on the appropriate two single transitions. However, simulations of such cases have shown that the proposed circuit of an AM has very good properties in this respect. It can be shown that the sensitivity of the AM with regard to pulses is higher than the sensitivity of the actual combinational logic. To demonstrate this property the test circuit shown in Figure 3 has been used. The buffer (BUF) is used to generate a ‘natural’ input signals for the actual test circuitry. This is fed to the AM and a NAND gate in parallel. The NAND gate

5

[Volt]

4

C1 C2

3 t crit

tA

tp 2

C3 C4

A0

ACT

1

0 0

5

10

15

20

25

T ime [ns ]

Figure 2: SPICE simulation of a 4-bit ripple-carry adder using AMCD

30

35

40

VDD BUF

IN1

OUT

IN VDD

C

IN

AM ACT

gnd R ACT C gnd

Figure 3: Circuit for performance test of AM with respect to the detection of glitches

V2(OUT) V2(ACT) V1(OUT) V1(ACT) Time [ns] Figure 4: SPICE simulation of the test circuit applying the input pulses P1 (0.5 ns) and P2 (0.3 ns)

serves as a sensor that allows the determination of whether the appropriate pulse is recognised by the subsequent logic. The AM performs the actual transition detection. Simulations of the circuit of Figure 3 have been performed by applying two input pulses. The first pulse (P1) having a width of 0.5 ns and the second (P2) a width of 0.3 ns. The resulting output waveforms have been named V1 and V2 respectively. As can be seen in Figure 4 the AM detects a 0.3 ns pulse which does not lead to a sufficient output pulse at the output of the NAND gate. This means that all pulses which cause any further transitions within the CL can be detected. This is an important property since those pulses may propagate to the primary outputs of the CL and if they are not detected may cause false results.

3.1 Layout Considerations In order to verify the operation of the AM a customsilicon implementation of the AM circuit according to Figure 3, using the ES2 1.0 µm double metal n-well CMOS process has been done. Figure 5 shows the experimental AM layout used. The four AND-OR

Figure 5: Experimental CMOS layout of the AM

configured n-channel MOSFETS of the AM circuit, are clearly seen at the bottom right of the layout, characterised by their large widths. The post-layout parasitic extracted SPICE simulation results shown in Figure 6 confirm the correct operation of the circuit, for realistic and accurately extracted load conditions. The overall layout size of the AM cell is 27µmx49µm. There is however room for further optimization of the cell to yield smaller area. As expected waveforms seen in Figure 6, clearly indicate that the ACT signal is more pronounced in comparison to the OUT signal.

4 Results of Applying Levels of Granularity

Different

In order to study the approach in more detail a transistor level implementation of a 4 by 4-bit parallel multiplier has been used. The multiplier comprises a regular array of cells which compute the product of appropriate inputs (P) and a number of adders (A).To investigate the effects of parasitic capacitance, including estimates of wire capacitances, a SPICE simulation of this circuit has been carried out, again using the spice parameters of a 1.2 µm CMOS process.

The circuit has been divided into a number of subcircuits by inserting vectors (columns) of AMs as shown in Figure 7. The pulse width generated by a vector of AMs must exceed the critical path of the circuit elements between this vector and the succeeding vector. The capacitance of each ACTx signal in conjunction with its pull-up resistor can be used to adjust the pulse width accordingly. In our simulations the resistors were implemented as active grounded-gate p-mos devices. The width of the pull-up devices was tuned such that for each circuit a 100% safety margin in addition to tp was guaranteed. For the multiplier investigated, connecting between four and eight outputs to one ACTx signal proved to give optimal results. The combination of an open drain circuit on the lowest level with a completion tree for subsequent higher levels has been found to be a particular beneficial compromise. The inherent wire capacitances of the open drain configuration together with the active pull-up devices can be utilised to generate the desired time constant for the pulse width tp which guarantees safe operation with a specific safety margin. On higher levels of the completion tree no further delays are desired and therefore the tree structure deploying static gate logic ensures fast operation. A string of investigations has been carried out to determine the optimal granularity of AMs for the multiplier circuit considered here. Six transitions of input data t1 ... t6 have been applied to the circuit at 50 ns intervals:

V(OUT)

V(ACT) V(IN1)

Figure 6: Simulation result of the post-layout extracted circuit

(X[3:0], Y[3:0]) = (0H,0H) --t1-> (1H,1H) --t2-> (FH,1H) --t3-> (FH,8H) --t4-> (FH,9H) --t5-> (FH,FH) --t6-> (0H,0H) The results shown in Table 1 compare the performance of the 4 by 4-bit multiplier using AMCD on three levels of granularity to an identical synchronous circuit. A 100% safety margin has been allocated to both the synchronous circuit and the subcircuits employing the AMCD approach. The maximum delay (tmax) specifies the delay which occurs when the longest ripple path is invoked by the applied input data. This is the case during transition t3 of the above input vectors. As discussed in section two the values for tmax in case of the synchronous circuit have been obtained by multiplying the delay of the longest ripple path by two whereas in case of the AMCD circuit the pulse width of each AM has been adjusted such that it covers twice the delay of a subcircuit. The average delay is calculated for all six transitions of the input vectors. E is the energy which is needed to carry out all six transitions. Please note that for asynchronous circuits the classical ‘power dissipation’ i.e. E/t is not a sensible parameter, since it depends on the rate at which input data tokens are applied. The circuit area is quoted in percent relative to the area of the synchronous circuit. The values represent a rough approximation obtained by adding the number of transistors and weighting them with their width (all transistors are minimum size). Neither the clock distribution network of the synchronous circuit nor the area which is needed to implement the communication protocol for the AMCD versions is taken into consideration. Further the area needed for signal routing within the CL is also disregarded. The first investigated version which is referred to as the fine-grained version (AMCDF) has vectors of AMs fitted in every column of the multiplier cells (C1, C2, ...C7). The circuit parameters and simulation results obtained from this version show that, as expected this circuit results in fastest operation but incurs a relatively high area overhead. The version shown in Figure 7 is referred to as the medium-grained version (AMCDM). In this version every other column (C1, C3, C5 and C7) of multiplier cells is fitted with one vector of AMs, resulting in a very good area and speed compromise. The average speed of the AMCD version is substantially increased by 37% and the worst case speed is still 9% higher than

X0

X1

Y3

P

Y2

P

ACT1

VDD

ACT2

X2

X3

P

P

P

A

ACT3

P

P

P

A

A

ACT ACT4

Y1

P

P

P

P

A

A

A

Y0

P

P

P

P

A

A

A

P0

C1

P1

P2

C2

P3

C3

P4

C4

P5

C5

C6

AM

A P6

P7

C7

Figure 7: Multiplier circuit fitted with vectors (columns) of AMs

the synchronous counterpart. The additional area required for implementing the AMs is relatively small at 12%. This additional hardware causes an almost proportional increase in energy consumption (E) of 11%. Finally a coarse-grained version (AMCDC) has been designed and investigated in a similar manner to the one described above. The coarse grained version has vectors of AMs fitted to only two columns of the multiplier cells (C2 and C5). The additional delays caused in this case are naturally quite high. In particular the worst case delay is substantially increased. This makes the circuit less attractive when compared to the previous versions.

VERSION AMCDF

taverage [ns]

tmax [ns]

E [nJ]

Area %

13.7

17.2

2.9

123

AMCDM

14.8

18.6

2.5

112

AMCDC

17.4

20.4

2.4

106

SYNC

20.3

20.3

2.2

100

Table 1: Performance of a 4 by 4-bit AMCD multiplier using different levels of granularity compared to an identical synchronous circuit

At the system level, the increase in energy consumption will be more than compensated by the absence of the clock signal.

5

Conclusions

The feasibility of a new method called AMCD to achieve completion-detection for self-timed circuits has been demonstrated. The major advantages of the proposed method is that data dependent delays of CLs can be exploited whilst imposing only marginal repercussions on the CL itself. Requirements for the deployed Activity Monitors in terms of switch-on delays and generated pulse widths have been discussed The issue of glitches occuring within the CL has been studied and it has been demonstrated that they will be detected and hence not cause the circuit to malfunction. A SPICE simulation of a 4 by 4-bit multiplier employing the models of a commercially available CMOS process has demonstrated the practicality of the approach. Three circuits, each on a different level of granularity have been investigated and the obtained parameters have been compared to an identical synchronous circuit. The experimental observations demonstrate the benefits of the method and suggest that a medium level of granularity delivers best results in terms of area speed and power dissipation. The postlayout parasitic extracted simulation results of test circuits confirmed the correct operation of the proposed AM under various conditions.

Further work is carried out to investgate methods to activlely prevent glitches from being propagated. This is expected to result in both less power dissipation and increased speed.

References [1]

N. R. Poole, “Self-timed logic circuits”, Electronics & Communication Engineering Journal, Vol. 6, (6), pp. 261-270, 1994,

[2]

S. Hauck, “Asynchronous Design Methodologies: An Overview”, Proceedings of the IEEE, Vol. 83, (1), pp. 69-93, 1995

[3]

O. A. Izosimov,. I. I. Shagurin, V. V. Tsylyov, “Physical approach to CMOS Self-Timing”, Electronic Letters, Vol. 26, pp. 1835-1836,1990.

[4]

M. E. Dean, D. L. Dill, M. Horowitz, “Self-Timed Logic Using Current-Sensing Completion Detection (CSCD)”, Journal of VLSI Signal Processing, Vol. 7, pp. 7-16, 1994.

[5]

E. Grass, S. Jones, “Asynchronous Circuits based on Multiple Localised Current-Sensing Completion Detection”, Proc. of the 2nd Working Conference on Asynchronous Design Methodologies, IEEE Computer Society Press, pp. 170-177, 1995

[6]

V. .I. Varshavsky, V. B. Marakowsky, R. A. Lashevsky, “Asynchronous Interaction in Massively Parallel Computing Systems”, Proc. of the IEEE 1st International Conference on Algorithms and Architectures for Parallel Processing, Brisbane (Australia), Vol. 2, pp. 481-492, 1995

[7]

M. J. Gamble, A Novel Current-Sensing CompletionDetection Circuit Adapted to the Micropipeline Methodology, M.Sc.-Thesis, University of Manitoba (Canada), pp. 25-31, 1994.

[8]

E. Grass, S. Jones, “Activity-Monitoring CompletionDetection (AMCD) a new approach to achieve selftiming”, Electronics Letters, accepted for publication

[9]

B. E. Briley, “Some New Results on Average Worst Carry’, IEEE Trans. Comp, Vol. 22 (5), pp. 459-463., 1973

Activity-Monitoring Completion-Detection (AMCD): A new ... - CiteSeerX

Activity-Monitoring Completion-Detection (AMCD): A new ... - CiteSeerX

Suggest Documents

Mental Health Counseling and the AMCD Multicultural ... - EBSCOhost

A Spuriously New Economy - CiteSeerX

A new business model? - CiteSeerX

A new musculoskeletal curriculum - CiteSeerX

Towards a new empiricism - CiteSeerX

new media: new pleasures - CiteSeerX

Russia: A New Innovation System for the New Economy - CiteSeerX

A new species and new records of Hydrama (Coleoptera ... - CiteSeerX

New academic identities for a new profession?: Situating ... - CiteSeerX

Russia: A New Innovation System for the New Economy - CiteSeerX

Integrating financial stability: new models for a new ... - CiteSeerX

Advanced Wind Technology: New Challenges for a New ... - CiteSeerX

A new species of Mymarothecium and new host and ... - CiteSeerX

The New Geology: a new understanding of fluid-rock ... - CiteSeerX

urban policy under new labour: a new dawn? - CiteSeerX

New technology, new professional practices: A study on ... - CiteSeerX

Intrinsic Motivation in a New Light - CiteSeerX

A NEW MARKETING SEGMENTATION APPROACH ... - CiteSeerX

A New Generation of Cluster Interconnect - CiteSeerX

Structure Determination of Ba8CoRh6O21, a New ... - CiteSeerX

Candidatus Bartonella merieuxii, a Potential New - CiteSeerX

Police Science: Toward a New Paradigm - CiteSeerX

EthicALife: A new field of inquiry - CiteSeerX

A New Program in Information Technology - CiteSeerX