(31,5) Parallel Counter Circuit Based on ... - Semantic Scholar

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 31, NO. 8, AUGUST 1996

1177

A Compact High-Speed (31,5) Parallel Counter Circuit Based on Capacitive Threshold-Logic Gates ¨ Y. Leblebici, H. Ozdemir, A. Kepkep, and U. Cilingiro˘ ¸ glu

Abstract— A novel high-speed circuit implementation of the (31,5)-parallel counter (i.e., population counter) based on capacitive threshold logic (CTL) is presented. The circuit consists of 20 threshold logic gates arranged in two stages, i.e., the parallel counter described here has an effective logic depth of two. The charge-based CTL gates are essentially dynamic circuits which require a periodic refresh or precharge cycle, but unlike conventional dynamic CMOS gates, the circuit can be operated in synchronous as well as in asynchronous mode. The counter circuit is implemented using conventional 1.2 m double-poly CMOS technology, and it occupies a silicon area of about 0.08 mm2 : Extensive post-layout simulations indicate that the circuit has a typical input-to-output propagation delay of less than 3 ns, and the test circuit is shown to operate reliably when consecutive 31-b input vectors are applied at a rate of up to 16 Mvectors/s. With its demonstrated data processing capability of about 500 Mb/s, the CTL-based (31,5) parallel counter offers a number of application possibilities, e.g., in high-speed parallel multiplier arrays and data encoding circuits.

I. INTRODUCTION

P

ARALLEL counters or population counters are multiinput, multi-output combinational logic circuits which determine the number of logic one’s in their input vectors and generate a binary encoded output vector which corresponds to this number. An alternate terminology for this type of circuit is -counter, where represents the number of input bits (input vector length) and represents the number of output bits required to encode the number of one’s in the input (output vector length). High-speed counting of one’s in a large input vector is a very useful operation in various areas, especially in high-speed parallel multiplication, multiple-input adders, and in digital signal processing applications [1], [2]. The use of parallel counters for fast parallel multiplication was first studied by Dadda [3], who devised various schemes for the reduction of partial product terms into two binary numbers, whose sum equals the product. Wallace tree structures which are used extensively in high-speed parallel binary multipliers are also based on counting the number of one’s in a column of partial product terms [4]–[7]. A number of methodologies exist for constructing highorder parallel counters which can operate on a large input vector. Multiple-input, dedicated counter circuits based on combinational logic typically suffer from large input capacManuscript received March 17, 1995; revised March 20, 1996. This work was supported in part by the Technology Development Fund of Turkey under Contract TTGV-036 and by the following member organizations of ITU-ETA Foundation: Alcatel-Teletas A.S., Bekoteknik A.S., Netas A.S., Simko A.S. and Vestel A.S. The authors are with the ITU-ETA Design Center and Department of Electronics and Communication Engineering, Istanbul Technical University, 80626 Maslak, Istanbul, Turkey. Publisher Item Identifier S 0018-9200(96)05696-X.

itances and intermediate node capacitances, and from long pull-down paths, since the number of input transistors increases quadratically with the number of inputs [1]. Therefore, parallel counters with a large input vector length are typically constructed by using networks full of adders. In fact, the full adder circuit itself is a parallel counter, called the (3,2)counter, which encodes the number of one’s on its three input terminals onto its two output terminals. Several approaches have been developed to implement high-order counters as networks consisting of (3,2)-counters or adders [1], [2], [8]. In these approaches, however, the number of full adders to be used as well as the logic depth of the overall circuit grow quite rapidly with the input vector size. Other approaches for implementing parallel counters using ROM arrays [2], [9], mixed analog/digital circuits [2], and threshold logic circuits [3] have also been proposed. In this paper, a novel scheme is proposed for two-level realization of -counters using threshold logic gates, and the circuit implementation of this scheme is demonstrated with a high-speed (31,5) parallel counter based on capacitive threshold logic (CTL) gates. The charge-based CTL gates are essentially dynamic circuits which require a periodic refresh or precharge cycle, but unlike conventional dynamic CMOS gates, the circuit can be operated in synchronous as well as in asynchronous mode [10]. The (31,5)-counter circuit consists of 20 threshold logic gates arranged in two stages, i.e., the parallel counter described here has an effective logic depth of two [11]. Thus, the circuit offers a radically different, compact, and potentially superior alternative for the implementation of large-input counters. In Section II of this paper, the fundamental aspects of threshold logic and the basic structure of the CTL gate are presented. The (31,5) parallel counter circuit is described in detail in Section III. In Section IV, the experimental results obtained from the test chip are presented and discussed. The results of this work are summarized in Section V. II. CAPACITIVE THRESHOLD-LOGIC GATES Threshold logic functions comprise an important subset of Boolean functions defined for multiple binary input variables. A threshold gate is defined as an -input logic gate, such that the output of the gate is determined by the following set of relations:

0018–9200/96$05.00  1996 IEEE

if if

(1)

1178

represent the binary input variables, and represents Here, the input vector consisting of binary input variables. The is the Boolean (switching) logic function function realized by the threshold gate, where represent the input weights (real numbers) corresponding to the input is a real number representing the gate variables, and threshold. It must be noted that the summation operator used in the above definition indicates arithmetic summation of the weighted binary input variables. A Boolean function thus represented by the output of a threshold gate is called a threshold function. It can easily be shown that all Boolean functions can be realized by a twolevel network consisting of threshold gates. Fig. 1(a) shows the symbolic representation of a general threshold logic gate with binary inputs and with threshold Threshold gates typically offer the capability of realizing complex Boolean functions using a smaller number of logic gates and/or fewer logic stages [12], [13]. The charge-based CTL gate concept used in this work offers a viable alternative for the realization of threshold functions. This approach effectively merges the advantages of a very simple, area-efficient circuit structure with the capability of handling a large number of parallel inputs simultaneously, thus exploiting the full advantage of threshold logic gates. A CTL gate of inputs, as shown schematically in Fig. 1(b), comprises a row of weight-implementing capacitors and a chain of identical inverters which functions as a voltage comparator to generate the output [10]. The gate operates in a two-phase nonoverlapping clock scheme consisting of and an evaluation a reset phase defined by the clock . In the reset phase, the row voltage phase defined by is reset to the logic threshold voltage of the first inverter stage of the comparator chain, while the bottom plates of all weight-implementing capacitors are precharged to a reference voltage The evaluation phase begins with the arrival of Binary input signals, are forced onto the input columns, while and 0 V are applied to the two and respectively. As a result, the reference capacitors, row voltage is perturbed from the reset level . The small row voltage perturbation is binarized by the comparator. It can be shown that the output of the comparator chain indeed realizes a threshold logic function of the input variables, where realized as integer multiples of a unit capacitor, determine and the input weights and the reference capacitors determine the threshold value. Details and operation of the basic CTL gate have been extensively studied in [10]. It was demonstrated that a very large fan-in can be achieved in CTL gates (experimentally verified fan-in capacity of at least 255), enabling the simple realization of threshold functions with a large number of input variables. Also notice that the CTL gate described above is fully input- and output-compatible with conventional CMOS logic gates. Now examine the two-dimensional (2-D) capacitor array structure shown in Fig. 1(c). Here, each row of capacitors implements a different threshold function of the same input variables, and one set of input switches is used to transfer the input voltages to all rows. It can easily be seen that the


relative area overhead of the input switches will diminish with increasing number of rows. This observation leads to the conclusion that the full advantage of the CTL gate concept can be exploited in applications where i) there is a large number of input variables and ii) many threshold functions having a large common input set are to be implemented simultaneously. Unlike conventional dynamic CMOS gates, which require a precharge phase before evaluating each new input vector, a CTL gate is capable of evaluating a large number of successive input vectors between two consecutive reset phases. This capability is a result of the fact that the evaluation process is nondestructive with respect to charge conservation. The (31,5) parallel counter circuit presented in the following section also exploits this feature in order to achieve high throughput rates. III. DESCRIPTION OF THE PARALLEL COUNTER CIRCUIT As briefly mentioned in Section I, the -bit parallel counter or -counter is a combinational logic circuit which generates a binary encoded output vector of length corresponding to the number of logic one’s in its input vector of length The least significant bit (LSB) in the output vector is obviously the parity function of the input bits, i.e., the LSB corresponds to the parity of one’s in the input vector. The second significant bit of the output can be expressed as the parity of one-bit pairs in the input vector, the third significant bit of the output can be expressed as the parity of one-bit quadruples in the input vector, etc. Generalizing this observation, it can be shown that the th significant output bit of a parallel counter is indeed the -tuples in the input vector. parity function of one-bit In fact, the th significant output of the -counter is equal to a residue threshold function where represents the input vector consisting of binary variables, and the function is defined as [9] iff otherwise

(2)

The realization of residue threshold functions, i.e., the output functions of an -counter using a threshold logic network was first considered by Dadda [3], who proposed a network structure consisting of gates arranged in consecutive stages. This feed-forward network structure was generalized by Muroga for the realization of XOR (parity) functions [12]. Ho et al. have discussed the implementation of residue threshold functions using dedicated ROM arrays [9]. Here, we propose a two-level realization of the functions described above, offering a significant speed advantage compared with other alternatives. The realization of simple parity functions using a two-level threshold logic network was studied earlier by Muroga [12]. Extending the two-level parity circuit structure given in [12] to general parity functions of -tuples, the parallel counter can be constructed as a regular two-level threshold logic network. Fig. 2 shows the gate-level diagram of the (31,5)-counter circuit consisting of 20 threshold logic gates, arranged in two logic stages. The counter is actually designed as a 32-input, 5-output circuit which determines the number of the one’s in the input vector if this number is smaller


1179

(a)

(b)

(c) Fig. 1. (a) Symbolic representation of the generic threshold logic gate. (b) Circuit schematic of a CTL gate of m input variables, with the reset (precharge) and evaluation clock signals. (c) Circuit schematic of a CTL array which implements n independent threshold functions of m input variables.

than or equal to 31. If the number of one’s is equal to 32, a sixth flag bit can be generated. The 16 first-level threshold gates all receive the same 32 input variables, where the input weight corresponding to each input variable is equal to one.

The threshold values of the first-level gates are chosen as 2, 4, 6, 8, , 32. The inverted outputs of all first-level threshold gates are applied to the second-level parity (LSB) gate, each corresponds to the with an input weight of two. Its output

1180


Fig. 2. Gate-level diagram of the (31,5) parallel counter circuit, consisting of 20 threshold logic gates arranged in two logic levels.

parity function of the 32-b input vector. To obtain the second significant output bit, the inverted outputs of the first-level gates with threshold values of 4, 8, 12, , 32 are applied to the each with an input weight of second-level gate generating four. The third and fourth significant bits of the output vector and are generated in a similar manner. Notice that all second-level threshold gates also receive the 32 primary input variables, with unity input weights. The most significant bit becomes logic one when the (MSB) of the output vector number of one’s in the input vector is larger than or equal to

16 and less than 32. The sixth flag bit is simply obtained as the noninverted output of the first-level threshold gate with a threshold value of 32 [11]. It is worth noting that the two-level realization of the (31,5) parallel counter described here can be easily generalized for an arbitrary number of input bits. The first-level threshold generic structure would consist of second-level threshold gates. gates and It can be seen that this network consisting of 20 threshold gates satisfies the requirements set forth in Section II for efficient implementation of the CTL concept: The number of


(a)

(b) Fig. 3. (a) Microphotograph of the CTL-based (31,5) parallel counter implemented in 1.2 m CMOS technology; the circuit occupies an area of (580 2 300) m2 : (b) Microphotograph of a more compact version of the (31,5) counter using the same technology; the circuit occupies an area of (360 2 220) m2 :

input variables is large (at least 32) for all threshold gates, the input weights are mostly very uniform, and all threshold gates in this network share the same 32 primary inputs. Therefore, the CTL implementation of this (31,5) parallel counter can offer a significant area advantage. Fig. 3(a) shows the microphotograph of the CTL-based (31,5)-counter circuit designed and fabricated using the 1.2 m double-polysilicon CMOS technology of Austrian Micro Systems (AMS). The weight-implementing capacitors are built as multiples of minimum-geometry poly1-poly2 crossings, where the overlap area determines the unit capacitance value. The unit weight capacitance thus implemented has a value of 9.18 fF. The absence of any active devices within the array allows a very high integration density. The input switches controlled by the two clock signals are arranged on one side of the capacitor array. The comparators which binarize the row voltage perturbations are arranged perpendicular to the switch array, on the periphery. Since 32 of the 48 input switches supply the same primary input variables to all 20 rows (i.e., 20 threshold gates), the area overhead of input circuitry is not significant. A more compact version of the CTL-based (31,5)-counter designed using the same technology is shown in Fig. 3(b). The entire parallel counter circuit occupies an area of about 0.08 mm

1181

The comparators consist of two cascaded inverter stages, an nMOS reset switch to be activated during the precharge phase and a dummy switch attached to its input for charge-injection compensation. The inverter stages in this test structure were designed to maximize the small-signal voltage gain only, therefore, the dynamic response times of the measured CTL gates leave ample room for improvement. It was demonstrated in [10] that the input-to-output signal propagation delay of a CTL gate exhibits only a weak logarithmic dependence on fan-in, due to the fact that the minimum row voltage perturbation decreases with the increasing number of inputs. The propagation delay of the parallel counter is always determined by two CTL stage delays, which are approximately equal in magnitude. Fig. 4 shows the input and output voltage waveforms of the counter circuit obtained through post-layout SPICE simulation. The typical signal propagation delay of the CTL-based (31,5) parallel counter circuit is found to be about 3 ns. Since the propagation delay of a CTL-gate actually depends on the amount of row voltage perturbation (overdrive) generated by the input vector, worst-case delays may exceed this number by up to 40%. This CTL-based realization of the 31-b parallel counter can now be compared with the conventional, full-adder-based realization. Swartzlander [2] has shown that the number of full adders needed to build an -input population counter is and that the upper bound of the counter delay is equal to adder delays. Thus, a conventional realization of the (31,5) parallel counter would require a total of 26 full adders, arranged as a network where the maximum logic depth is equal to seven adder stages. A compact CMOS full adder cell has been designed by the authors using the same technology (1.2 m CMOS) for comparison purposes. The adder cell occupies an area of about 0.004 mm , which dictates that the minimum area required for an adder-based (31,5) parallel counter (including intercell routing and output buffers) would be 0.15 mm . Note that this area is about twice as large as the area occupied by the CTL-based parallel counter. The signal propagation delay of the full adder cell has been determined to be about 1.2 ns with post-layout simulations. Hence, the total delay of the full-adder-based (31,5) parallel counter would be about 8.5 ns, which is higher than the delay of the CTL-based counter circuit. For counters with a larger number of inputs, the speed advantage of the CTL-based approach is expected to become even more significant. IV. EXPERIMENTAL RESULTS The fabricated 31-b parallel population counter [Fig. 3(a)] described in Section III was tested under various dynamic operating conditions. To verify its operation at high input rates, consecutive 31-b input vectors were applied to the counter circuit with a rate of up to 16 Mvectors/s, in between two consecutive reset phases. The test vectors used to verify the counting function were designed such that the number of one’s in consecutive vectors is either increasing (or decreasing) steadily. In order to simplify the application of test vectors, the 31 inputs of the compressor circuit were grouped together into five separately accessible blocks with 1, 2, 4, 8, and 16 input

1182


Fig. 4. Post-layout SPICE simulation results showing a primary input voltage, an intermediate stage (first stage) output voltage, and the LSB output voltage (B 0 ) waveform of the (31,5) parallel counter. The overall signal propagation delay of the two-stage counter circuit is 3 ns.

switches. These inputs were then driven by an external highspeed 5-b counter circuit, while the 32nd input was held at logic zero. Fig. 5 shows the five output voltage waveforms through obtained at an input rate of about 16 Mvectors/s, i.e., when 16 10 distinct input vectors are applied to the counter circuit every second. To achieve this input rate, the external 5-b counter circuit generating the input vectors was driven with a 16 MHz master clock signal. Thus, the frequency of the LSB input waveform is then 8 MHz, the frequency of the second significant input bit waveform is 4 MHz, etc. The reference voltage was V, and the reset/evaluate clock frequency was 100 kHz. The output waveforms verify that the circuit operates correctly as a (31,5)-counter, where the LSB output waveform frequency is, as expected, equal to the LSB input waveform frequency of 8 MHz. Since the oscilloscope used in the measurements could display only four traces simultaneously, the MSB output waveform is shown in a separate plot, along with the waveforms and It is clear that the five output waveforms shown here can be sampled at a clock frequency of 16 MHz, resulting in an output rate of 16 Mvectors/s. With 31-b input vectors, the measured output rate of the parallel counter circuit translates to a data processing capability of about 500 Mb/s. The observed transient glitch at the output voltage is due both to signal race between the first and second stages of the CTL parallel counter circuit and to slight signal skew between the rising and falling edges of the input vector bits. With proper sampling of the output signal, this situation can obviously be resolved in a simple manner. As indicated in Section III, the voltage comparators used in this test structure were designed to maximize the small-signal voltage gain only, they are were not designed to drive the considerably high

input capacitances of the off-chip output buffers. The measured input-to-output signal propagation delay of the parallel counter test chip is therefore approximately equal to 40 ns. The dynamic response characteristics of the CTL-based parallel counter circuit can be expected to improve further with proper gain/speed optimization. V. CONCLUSION Threshold gates offer the capability of realizing complex Boolean functions using a smaller number of logic gates and/or fewer logic stages. The novel, charge-based CTL gate concept used in this work effectively merges the advantages of a very simple, area-efficient circuit structure with the capability of handling a large number of parallel inputs simultaneously, thus exploiting the full advantage of threshold logic gates. CTL gates are essentially dynamic circuits which require a periodic refresh or precharge cycle, but unlike conventional dynamic CMOS gates, the circuit can be operated in synchronous as well as in asynchronous mode. It was also shown that the full advantage of the CTL gate concept can be exploited in applications where i) there is a large number of input variables and ii) many threshold functions of the same input variables (same input set) are to be implemented simultaneously. To demonstrate the area and speed advantages offered by the CTL gate concept, a 31-b parallel counter circuit has been designed, fabricated, and tested. The (31,5)-counter circuit consists of 20 threshold logic gates arranged in two stages, i.e., the circuit described here has an effective logic depth of two, and it occupies an area of approximately 0.08 mm Thus, the circuit offers a radically different, compact, and potentially superior alternative for the implementation of largeinput counters. Detailed post-layout simulations have indicated


1183

dictated by the upper limit of the sum of weights on one CTL gate, which was demonstrated to be well above 256 [10]. On the other hand, the capacitor matrix of the CTL-based counter occupies area columns rows), whereas the adder-tree implementation of the parallel counter requires area [14]. Therefore, the area advantage of the CTL-based counter is expected to diminish for very large input vector sizes, e.g., for In this case, however, an -input counter can be constructed by using two -input counters in parallel [2]. The propagation delay is far less sensitive to the number of inputs, since the logic depth of two is independent of The delay of each CTL stage exhibits a weak dependence to the number of inputs due to the decrease in row voltage perturbation, but the influence on overall speed is not very significant. REFERENCES

Fig. 5. Measured output voltage waveforms of the (31,5) parallel counter circuit, at an input rate of 16 Mvectors/s. The number of one’s in each consecutive 31-b input vector is steadily being decreased by one.

that the total signal propagation delay of the circuit is about 3 ns. The test circuit was shown to operate reliably when consecutive 31-b input vectors are applied at a rate of 16 Mvectors/s. With its demonstrated data processing capability of about 500 Mb/s, the 31-b parallel counter offers a number of application possibilities, e.g., in high-speed parallel multiplier arrays and data encoding circuits. Theoretical limits for the size of CTL-based parallel counters are primarily determined by the maximum number of inputs of the CTL gates used in the design. This, in turn, is

[1] P. J. Song and G. De Micheli, “Circuit and architecture trade-offs for high-speed multiplication,” IEEE J. Solid-State Circuits, vol. 26, pp. 1184–1198, Sept. 1991. [2] E. E. Swartzlander, Jr., “Parallel counters,” IEEE Trans. Comput., vol. C-22, pp. 1021–1024, Nov. 1973. [3] L. Dadda, “Some schemes for parallel multipliers,” Alta Frequenza, vol. 34, pp. 349–356, Mar. 1965. [4] C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Trans. Electron. Comp., vol. EC-13, pp. 14–17, Feb. 1964. [5] E. Hokenek, R. K. Montoye, and P. W. Cook, “Second-generation RISC floating point with multiply-all fused,” IEEE J. Solid-State Circuits, vol. 25, pp. 1207–1213, Oct. 1990. [6] G. Goto, T. Sato, M. Nakajima, and T. Sukemura, “A 54 2 54-b regularly structured tree multiplier,” IEEE J. Solid-State Circuits, vol. 27, pp. 1229–1236, Sept. 1992. [7] J. Mori et al., “10-ns 54 2 54-b parallel-structured full-array multiplier fabricated with 0.5 m CMOS technology,” IEEE J. Solid-State Circuits, vol. 26, pp. 600–606, Apr. 1991. [8] C. C. Foster and F. D. Stockton, “Counting responders in an associative memory,” IEEE Trans. Comput., vol. C-20, pp. 1580–1583, Dec. 1971. [9] I. T. Ho and T. C. Chen, “Multiple addition by residue threshold functions and their representation by array logic,” IEEE Trans. Comput., vol. C-22, pp. 762–767, Aug. 1973. ¨ [10] H. Ozdemir, A. Kepkep, B. Pamir, Y. Leblebici, and U. Cilingiro˘ ¸ glu, “A capacitive threshold-logic gate,” IEEE J. Solid-State Circuits, this issue, pp. 1141–1150. ¨ [11] Y. Leblebici, H. Ozdemir, A. Kepkep, and U. Cilingiro˘ ¸ glu, “A compact parallel (31,5)-counter circuit based on capacitive threshold-logic gates,” in Proc. 1995 European Solid-State Circuits Conf., Sept. 1995, pp. 390–393. [12] S. Muroga, Threshold Logic and Its Applications. New York, NY: Wiley, 1971. [13] P. M. Lewis and C. L. Coates, Threshold Logic. New York, NY: Wiley, 1967. [14] K. Choi, “VLSI Implementation of a neural network: Binary weight pattern associator,” Ph.D. dissertation, The Pennsylvania State University, May 1993.

(31,5) Parallel Counter Circuit Based on ... - Semantic Scholar

(31,5) Parallel Counter Circuit Based on ... - Semantic Scholar

Suggest Documents

315 Catikas.vp - Semantic Scholar

Fast Circuit Simulation Based on Parallel-Distributed LIM using Cloud ...

Sequential Test Generation Based on Circuit ... - Semantic Scholar

Analog circuit optimization system based on hybrid ... - Semantic Scholar

A Regular Expression Matching Circuit Based on a ... - Semantic Scholar

Ternary logic circuit design based on single ... - Semantic Scholar

Effects of circuit-based exercise programs on the ... - Semantic Scholar

A neuromorphic object-capturing circuit based on ... - Semantic Scholar

Performance of Parallel Prefix Circuit Transition ... - Semantic Scholar

Performance of Parallel Prefix Circuit Transition ... - Semantic Scholar

Stochastic parallel gradient descent based ... - Semantic Scholar

Parallel Optimality Criteria-based Topology ... - Semantic Scholar

Skeleton-based parallel programming: Functional ... - Semantic Scholar

cireda parallel genetic algorithm based ... - Semantic Scholar

Parallel Edge-Region-Based Segmentation ... - Semantic Scholar

A Parallel Search Strategy Based on Sparse ... - Semantic Scholar

Software Parallel CAVLC Encoder Based on ... - Semantic Scholar

DS-OCDMA Receivers Based on Parallel ... - Semantic Scholar

Stack-Based Parallel Recursion on Graphics ... - Semantic Scholar

An Effective Parallel Web Crawler based on ... - Semantic Scholar

Optimal design of parallel manipulators based on ... - Semantic Scholar

Stack-Based Parallel Recursion on Graphics ... - Semantic Scholar

Stack-Based Parallel Recursion on Graphics ... - Semantic Scholar

Javelin 2.0: Java-Based Parallel Computing on the ... - Semantic Scholar