1448
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 6, JUNE 2012
Associative Memory for Nearest-Hamming-Distance Search Based on Frequency Mapping Hans Jürgen Mattausch, Senior Member, IEEE, Wataru Imafuku, Akio Kawabata, Tania Ansari, Masahiro Yasuda, and Tetsushi Koide, Member, IEEE
Abstract—The developed associative-memory architecture utilizes a mapping operation of the Hamming distances into frequency space with ring oscillators programmable in discrete frequency steps. As a result fast word-parallel search of the nearest Hamming distance with low power consumption is obtained. Additionally, high robustness against fabrication-related variations of the MOSFET characteristics is achievable by design because the size of the frequency steps is a freely selectable design parameter which can be adjusted to compensate the variation magnitude. A quantitative analysis of within-die variation effects on the reliability of the associative-memory architecture is presented and guidelines for the choice of the design parameters at a given magnitude of the variation effects are derived. Feasibility and performance of this associative-memory architecture are experimentally evaluated with a VLSI design in 180 nm CMOS technology containing 64 reference patterns each consisting of 256 bits. The fabricated chip is correctly operating down to low supply voltages (Vdd) of 0.7 V. The power dissipation is less than 36.5 mW and 307 µW at supply voltages of 1.8 V (nominal supply) and 0.7 V, respectively. Measured search reliability is found to be in agreement with measured variations of the important design parameters and expectations from the variation analysis. In comparison to previously reported digital associative-memory designs, the achieved power dissipation is more than 5 times smaller, while the average search speed is only slightly improved. For Vdd = 1.8 V the search time ranges from a minimum of 50 ns at Hamming distance 0 to a maximum of 245 ns for the largest Hamming distance 255. Index Terms—Associative memory, frequency mapping, CMOS, Hamming distance, process variation, reliability, word-parallel search.
I. INTRODUCTION
N
EAREST-MATCH detection goes beyond the capabilities of conventional content addressable memory (CAM) which can only detect a complete matching between the analyzed bits of the input-data word and the stored reference-data words. Fast detection of nearest-matching reference data is required in many advanced applications with artificial intelligence requirements. Examples of such applications are speech-pattern recognition, database search, computational biology or machine learning [1]–[4]. Additionally, low power consumption Manuscript received January 03, 2011; revised January 26, 2012; accepted January 30, 2012. Date of publication April 19, 2012; date of current version May 22, 2012. This paper was approved by Associate Editor Bevan Baas. This work was supported by grants 19360163 and 9D06JE324 from the Ministry of Education and Science, Japanese Government. The authors are with Hiroshima University, Higashi Hiroshima 739-8527, Japan (e-mail:
[email protected]). Digital Object Identifier 10.1109/JSSC.2012.2190191
for these matching operations is gaining importance due to environmental concerns and the general trend towards battery-powered, mobile equipment. The ideal solution for fast nearest-match search is a fullyparallel associative memory. Previously reported solutions for word-parallel nearest-Hamming-distance search include mixed digital/analog [5], [6] as well as fully digital solutions [7], [8]. All of these solutions rely on the mapping of the Hamming distance into a convenient representation space as for example: — Delay time of an inverter chain changeable by drive-current adjustment [5]. — Input current of a regulated amplification circuit [6] — Clock-cycle number of signal transmission through a register chain [7] — Thermometer code for the number of non-matching bits by sorting algorithm [8] Mixed digital/analog solutions, as e.g., the distance-current mapping, have the advantages of requiring fewer transistors per bit. Furthermore, the completely asynchronous self-timed search operation leads to low power consumption. A main disadvantage of mixed digital/analog solutions is a more difficult scaling to small design rules and low supply voltages [6]. On the other hand, the digital solutions have the advantages of scalability and higher search reliability. However, they need a larger number of transistors per bit and require additional control signals within the search circuitry for detection of the nearest-matching reference data. This increases the power dissipation because the control signals must be routed to the bit circuitry of all word comparators and because they must be asserted once or multiple times during each search operation [7], [8]. This paper demonstrates that distance-frequency mapping with ring-oscillators can overcome the disadvantages of mixed digital/analog architectures while maintaining their advantages. Low power dissipation, fast search operation, design-rule/supply-voltage scalability and high search reliability can be achieved at the same time. Consequently, the advantages of previous architectures are combined and their disadvantages are largely avoided. In Section II of this paper we describe the developed frequency-domain associative-memory architecture in detail. The influence of fabrication-process variation effects on the search reliability of the frequency-domain architecture is analyzed in Section III. Then the experimental results of a CMOS-design study in 180 nm technology are explained and compared to the expectations from variation effects in Section IV. Finally, a summary and the conclusions are given in Section V.
0018-9200/$31.00 © 2012 IEEE
MATTAUSCH et al.: ASSOCIATIVE MEMORY FOR NEAREST-HAMMING-DISTANCE SEARCH BASED ON FREQUENCY MAPPING
1449
II. FREQUENCY-DOMAIN ARCHITECTURE We have developed the frequency-domain associative memory architecture by building on ideas from a previously reported time-domain architecture [5], which maps the Hamming distance onto discrete drive-current steps of the inverters in a delay chain. The drive-current mapping leads to a time delay for different Hamming distances which is then used to determine the reference data with the minimum distance. However, the achievable size of the time-delay steps with this drive-current mapping concept is rather limited and is expected to remain too small for coping with the effects from fabrication-process variations, especially in the case of small design rules. The main task for overcoming the limitations of the architecture reported in [5] is therefore the provision of mechanisms for making the time-delay steps sufficiently large. The approach reported here applies a mapping of the Hamming distances into frequency domain by using ring oscillators with delay stages having selectable discrete delay steps [9]. Special care is taken that the delay-step size can be adjusted freely to obtain sufficient margin for compensation of process-related variation effects. For a practical delay-step implementation the delay of a circuit consisting of e.g., inverters and intermediate capacitances is a good solution. The delay-step size would in this case be adjusted by the inverter number, the length/width sizing of the constituting MOSFETs, the capacitance number or the capacitance size. The obtainable advantageous properties of the developed associative memory architecture include: — Scalability with respect to design-rules and power-supply voltage. — Asynchronous search operation with self-timing of the search-operation end. — Minimized simultaneous switching activity of just 1 gate per reference pattern at any time during the search operation, because for any changed ring-oscillator configuration, due to changes in the mapped distance, the basic structure of a closed sequential chain where only one element switches at any given time remains. — No necessity for additional control-signal routing within word comparators and a moderate number of word-comparator transistors per bit. Fig. 1 gives an overview of the developed associative-memory architecture. Bit-comparison is carried out by implementing an EXOR-operation with conventional circuitry [5]–[8]. The resulting output signals, which code the information on matching and non-matching bits, are connected to the adjustable ring-oscillator stages. An additional fixed delay is selected for each non-matching bit in these ring-oscillator stages. The Hamming distance is therefore transformed into frequency domain according to the equation (1) . where is the basic ring delay for Hamming distance The delay is a designed additional delay for a Hamming-distance increase by 1. must be designed sufficiently large so that process variations are compensated. Additionally, the remaining difference between the frequency of the ring oscillator for the nearest-matching reference pattern (the winner) and the
Fig. 1. Minimum Hamming-distance-search architecture based on, ring-oscillators with delay stages adjustable in time-delay steps for non-matching bits, and a time-domain WTA circuit.
Fig. 2. Ring-oscillator stages with selectable delay for non-matching bits, implemented (a) by a simple inverter chain or (b) as a gated inverter chain for reduced power dissipation.
ring-oscillator frequencies for the other reference patterns (the losers) must be larger than the resolution limit of the WinnerTake-All (WTA) circuit. In Fig. 1 a circular implementation concept for the ring oscillators is assumed, with an upper row of delay stages (upper stages) for the signal flow from right to left and a lower row of delay stages (lower stages) for the signal flow from left to right. This concept for the placement of delay stages avoids long wiring in the ring-oscillator layout. Neighboring bits are connected in this scheme to lower and upper stages. Different ring oscillator implementation concepts are of course possible. Two implementation examples for the delay by (a) a simple inverter chain and (b) a gated inverter chain are shown in Fig. 2. The solution (a) has the advantage of fewer transistors while (b) results in smaller power dissipation because the inverters in unused delay elements are prevented from switching during search operations. Each search operation begins by simultaneously starting all ring-oscillators with a search-begin signal (SB). After signal circulation around the ring-oscillator loops for a suitable number
1450
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 6, JUNE 2012
Fig. 3. High resolution closed-loop time-domain WTA circuit, based on posi, and tive-edge resettable registers, delay stages , frequency dividers by a winner-detector, which can be viewed as a very sensitive phase detector for the ring-oscillator frequencies (see also Fig. 9).
of times , a time-domain WTA circuit is used for making the winner decision based on the ring oscillator outputs. The choice leads to shortest search time, while increases winner-loser time delay, and therefore search reliability. A search-end signal (SE) is also generated, which indicates the end of the search operation and can be used to reduce power dissipation by turning off all ring oscillators. A time-domain WTA circuit, which implements a closedloop decision concept, is shown in Fig. 3. Basic construction elements are frequency dividers by , delay stages with delay , positive-edge-triggered resettable registers and a winner-detection circuit. All ring-oscillator output signals pass through the frequency dividers with delay and the identical delay stages to the data inputs of their corresponding registers. After the frequency dividers all signals are also connected to the winner-detection circuit, which can be viewed as a very sensitive phase detector of the ring-oscillator frequencies operating in the time domain. The detailed design of a winner-detection circuit is described in the section on experimental results (see Section IV, Fig. 9). At the arrival time of the first signal from the frequency dividers, which is by design the winner signal, the winner-detection circuit generates a positive-edge clock signal with delay for all registers to store a 1 in the register of the correct row holding the winner pattern and 0 values in all other registers. The winner-detection circuit must be designed symmetrically so that the delay remains the same for any location of the winner row. The freely adjustable input-signal delay is designed according to the following delay equation: (2) is the register-set-up time and is a margin Here for compensating process-variation effects for the delays and to assure that the lower-bound condition always holds. The upper-bound condition for is given by the requirement that the nearest-loser signal should arrive at its corresponding register after the clock signal. Since the smallest possible time difference between winner and nearest
loser is given by the designed delay step , this means that should hold as the upper-bound condition even under within-die process-variation effects. The consequences of the upper-bound condition including variation effects are discussed in detail in Section III of this paper. If both lower- and upperbound conditions are fulfilled, only the correct row holding the winner pattern is registered as winner by its corresponding register, while all other rows are correctly registered as losers. In search cases where 2 or more reference patterns with the same minimum distance to the input pattern are present among the reference data, the following search results are possible: — If the time difference of the respective row signals after the frequency dividers, induced by process-variation effects, remains smaller than the time resolution of the time-domain WTA circuit (basically plus variation effects for and ), all rows are detected as winners. — Those rows for which the signals after the frequency dividers are delayed beyond the time resolution of the WTA circuit due to variation effects are detected as losers. For dealing with the case of multiple winners an additional priority-encoder circuit, which selects one of the multiple winners as final winner, is needed after the WTA circuit. Efficient priority-encoder circuits are known from previous work on associative memories and CAM-memories [10], [11]. The described associative-memory architecture offers 2 independent design methods for adapting the search reliability to the needs of each individual application. The 1st method adjusts and allows search-error reduction or elimination for the case of with only moderate search-time degradation. The main cost of this method is an increasing Si-area consumption proportional to the bit-number of the reference data, because all delay-stages in all ring-oscillators are affected by the transistor-number increase. The 2nd method uses the ring-oscillator output after signal circulation around the ring-oscillator loop for times to increase the time difference between winner and loser signals. It is however only effective in search case where such a time difference between winner and nearest loser is present, but which is too small for detection by the WTA circuit. The 2nd method leads to smaller Si-area penalty only proportional to the pattern number of the reference data. It can be implemented with the frequency dividers of the WTA as indicated in Fig. 3. The main cost is a factor longer search time. Interfacing of asynchronous self-timed systems with standard clocked systems is often a concern. However, for the described frequency-domain associative memory this interfacing is particularly simple and efficient, because we have a conventional digital memory interface on the input side. Furthermore, all output results are already stored in digital registers at the end of the asynchronous winner-search operation and a search end signal SE is provided. The probability for meta-stabilities in the output registers is quite small since winner and loser signals are normally clearly separated in the time domain. Even if meta-stabilities occur, they can be easily overcome by allowing for sufficient register settling time. Fig. 4 shows the structure of our proposed output interface to a synchronous system. It uses a handshake-type control part
MATTAUSCH et al.: ASSOCIATIVE MEMORY FOR NEAREST-HAMMING-DISTANCE SEARCH BASED ON FREQUENCY MAPPING
1451
III. ANALYSIS OF VARIATION EFFECTS
Fig. 4. Proposed interface for synchronization of outputs from the asynof chronous frequency-domain associative memory with the clock a synchronous system, which ensures that no metastable states can arise. The gate labeled C is a C-element (or C-gate) [12].
and positive-edge registers for transferring the minimum-distance-search result to the synchronous system side with a maximum latency of only one clock cycle. The control part of the interface consist of 3 AND gates, 2 delay elements and a C element. The C element changes its output only if both inputs are 0 (output is forced to 0) or both inputs are 1 (output forced to 1) and otherwise the output remains unchanged [12]. While the search end signal SE is low, the clock of the synchronous side is gated and the internal interface clock remains low. Therefore, both inputs to the C element are low so that the acknowledge signal ACK is also forced low. When a minimum-distance search on the asynchronous side finishes, SE rises to 1 and thus one of the C-element inputs also rises to 1. After a delay , designed to allow for settling of meta-stable states in the output registers of the associative memory, the signal also rises to 1 and is allow to propagate to the synchronization registers. If at this time, immediately rises and the data are transferred to the synchronous side. However, if does not stay at 1 for a minimum time a very short pulse can occur for the internal clock signal resulting in either meta-stable register states or the fact that the input data is not stored in the registers. Therefore, in cases of pulses with length shorter than , the 2nd input of the C-element is prevented from rising so that ACK remains low. Consequently, is not gated and the associative-memory search results are safely stored with the next rising edge of , as also in the cases where is 0 at the time of rising . After safe storage of the associative-memory data with the synchronous system clock, ACK rises and is again gated. ACK can then be used to reset the search-begin signal SB of the associate memory, which results in SE and ACK returning to 0, i.e., to the initial state for accepting the next data transfer.
As discussed already in Section II, fabrication-process variations will influence the search reliability and therefore we analyze the influence of these variation effects in this section in more detail. We derive some quantitative relationships between the important design parameters to enhance the insight into the variation effects and to provide guidelines for the actual design. Throughout this section we assume that the within-die variation, which is responsible for the relative differences between the ring-oscillator components of the associative memory and therefore determines the search reliability, is dominantly random in nature. On the other hand, die-to-die variations are assumed to have equal effects on all ring-oscillator components so that they are only important for the variation of minimum search time and power dissipation but not for the search reliability. The analysis is made in terms of the variance and the standard deviation of the delay times important for the search reliability. Variance and standard deviation are interrelated by the equation . The important delay times and are constructed as chains of delay elements E consisting mainly of inverters and transmission gates. The variation of these delay elements within a single die can be assumed as random and uncorrelated. Let be the delay of n such identical delay elements in a chain. Since these delays are independent variables with the same mean and variance , the variance of the chain delay is given by . Consequently, we can write the standard deviation of the mean delay of a chain of n delay elements as (3) This means that the normalized standard deviation of a delay chain is expected to diminish with as a function of the number of delay elements n, according to the equation. (4) The two key distance parameters for the characterization of the nearest-distance-search performance of our frequency-domain associative memory are: i) The distance of a search-input pattern to the bestmatching stored reference pattern, which we referred to as winner-input distance. describes the similarity of the best-matching reference pattern. Larger means that we have less similarity between the input data and the best matching reference data. ii) The distance between best-matching and 2nd-best-matching reference pattern, which we referred to as winner-nearest-loser distance. Smaller means less difference between best-matching and 2nd-best-matching reference pattern and therefore less significance of a nearest-distance-search error. Fig. 5 illustrates the relation between the designed arrival times and of the winner signal W, the register
1452
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 6, JUNE 2012
are independent variables with the same mean Since and the same variance , the separate variances for the delays and of winner-induced clock signal and nearest loser signals can thus be written as
Fig. 5. Schematic illustration of the relations between the arrival times of , the winner triggered clock signal and nearest loser winner signal at the registers in the WTA circuit. The influence of process-variasignal tion effects is also illustrated schematically.
clock signal CL(W) and the nearest-loser signal NL at the registers for storage of the search results. W and NL arrive at different registers. The generation of CL(W) is triggered by W and CL(W) is distributed to all registers. For reliable storage of the correct search result W must be stable at its register’s data input during the set-up time before CL(W) rises and NL must arrive after CL(W) has risen. Thus, the following 2 conditions must be fulfilled in the presence of variation effects ( requirement):
Here and are frequency-divider delay and ring-oscillator period jitter, respectively. The quadratic variation increase as a function of the ring-oscillator loop number for the terms containing and is due to the fact that these multiple loops occur in identical ring oscillators so that their variations are correlated. For the variance of the delay difference between and we then obtain:
(9) With (7), (8) and (9) the requirement (6) for reliable winner search becomes
(5) (6) Condition (5) refers only to delays within the WTA circuit, is not depending on the specific search chase and thus relatively easy to fulfill in an actual design. Condition (6) is more complicated, because it depends on the search parameters , and the ring-oscillator delay constants. We will therefore analyze the consequences of condition (6) in more detail. In the general case we can express the delay times and as
(7) (8)
(10) From (9) and (10) we can conclude that: — Search reliability strongly increases with larger distance between winner and nearest loser and also with larger . — Search reliability decreases with larger distance between winner and input pattern. — Increase of can only help to reduce the impact of the variation components and outside of the ring-oscillator loop as well as the impact of the ring-oscillator jitter and the WTA design margin . The requirement (10) for reliable winner search can be solved to obtain the winner-input distance until which a reliable winner search can be expected. (See (11) at the bottom of the page.)
(11)
1453
MATTAUSCH et al.: ASSOCIATIVE MEMORY FOR NEAREST-HAMMING-DISTANCE SEARCH BASED ON FREQUENCY MAPPING
From (11) we can conclude that the main factors degrading the search reliability are the variations of and . To obtain a simplified quantitative view of the relationships, we relate all delays appearing in above equations to a basic delay-step . In particular, we assume , and . Furthermore, and are chosen, which is a realistic setting from our design experience. We could also assume further factors between the variations outside the ring-oscillator loop and the jitter to express their differences, but this would not change the basic insight and would only lead to small changes of quantitative results. With above assumptions (11) simplifies to
(12) On the basis of (12) we further conclude that an increase of by a factor z is a very efficient measure for the delay step improving the search reliability, because this increases the 1st positive term and decreases 3 of the 4 negative terms simultaneously. Reduction of the standard deviation y of would be also effective because it results in a quadratic improvement of the maximum reliably searchable winner input distance. However, y is mainly determined by the used technology and the designer has only limited means to improve y. Additionally we can conclude that the basic ring-oscillator delay should be designed as small as possible, because this reduces the factor and therefore the main diminishing term for the maximum . According to (12), we expect reliable winner search up to a winner-input distance of for the most critical winner-nearest loser distance , in a typical design case with standard deviation of 1% for , 50 times larger in comparison to and . For , the reliably searchable winner-input distance would be expected to increase drastically to . In Fig. 6 the dependence of on the delay step size , representing a Hamming distance of 1, is plotted for and . It can be seen that even a small increase of z results in a large improvement of . As expected, also leads to drastic improvements, while the positive effect of is much smaller in comparison. Fig. 7 verifies the strong influence of the standard deviation of the basic delay step . With the design choice and , reasonable search performance can be expected until 1.2% standard deviation . For technologies with larger standard deviation larger delay steps i.e., must be chosen. The positive influence of increasing and is of similar magnitude as in Fig. 6 for the basic delay-step size dependence. The influence of the size of the ring-oscillator loop delay in comparison to on is depicted in Fig. 8. As expected decreases with larger , which confirms that minimization of in an actual design is desirable. Again both and improve ,
Fig. 6. Dependence of the reliably searchable winner-input distance on the delay-step size as obtained from (12). Data is plotted for (1% standard deviation of ), (50 times larger basic ring delay than ), and the 2 most critical cases of winner-nearest-loser distance . of ring-oscillator-loop passes is also illustrated. The effect of the number
Fig. 7. Dependence of the reliably searchable winner-input distance on the standard deviation of the delay step as obtained from (12). chosen and the other plot parameters are the same as in Fig. 6.
with the improvement magnitude for larger than for .
is
being much
IV. EXPERIMENTAL RESULTS A full-custom CMOS test chip, applying manual layout, placement and routing for all sub-blocks, was designed in 180 nm technology. This test chip for proof of concept has a storage capacity of 64 patterns with 256 bit length each and was fabricated to experimentally evaluate achievable search times, power dissipation and variation robustness. Frequencies dividers were not implemented in this design. The selectable delay steps for the ring-oscillator stages were implemented according to Fig. 2(a) with 4 inverters and 2 MOS-capacitances at the first intermediate node. In the actual design, transistors
1454
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 6, JUNE 2012
Fig. 8. Dependence of the reliably searchable winner-input distance on the increase factor of the basic loop delay of the ring oscillator over as obtained from (12). is chosen and the other the delay-step size plot parameters are the same as in Fig. 7.
with long channel length L were used whenever possible to implement an additional delay of about 1 ns for non-matching bits at nominal supply voltage V and to minimize its standard deviation . Care was also taken that the designed delay difference for rising and falling stage inputs remained smaller than 0.5 ps. Fig. 9(a) shows the 2-stage wired-OR circuit which is applied as winner detector in the time-domain WTA circuit and realizes a simulated minimum resolution of 160 ps for the winner-loser time difference. The implemented precharge and keeper circuitry is depicted in Fig. 9(b). It is used to precharge wired-OR lines to Vdd while the search-begin signal SB is low, i.e., before a search is carried out, and keeps discharged wired-OR lines at VSS during the complete search time as long as SB is high. The die micrograph is depicted in Fig. 10 and shows that the total design area is 3.08 mm . Each of the 64 ring-oscillators has a size of 1690 m by 15 m and includes 256 selectable delay stages. Ring-oscillators and time-domain WTA require about 52% and 4% of the design area, respectively. Design functionality is verified from nominal supply voltage of V down to 0.7 V. The design implements a multiplexer to enable the connection of the of ring-oscillator signals to output pads of the chip. Therefore, direct measurements of the ring-oscillator frequencies for design verification become possible. Fig. 11 shows the measured ring-oscillator frequencies for Hamming distances and with supply voltages of V and V, respectively. The measurements for are repeated measurements of the same ring oscillator to verify the measurement-equipment accuracy. For the measurements with , the stage with the non-matching bit is changed to different ring-oscillator positions. By subtracting the result for from the results with for the different positions along the ring oscillator, the within-die fabrication variation for the time-delay step in horizontal die direction within a single ring oscillator can be evaluated experimentally. These
Fig. 9. (a) 2-stage wired-OR implementation of the winner detector circuit used in the test chip. (b) Implementation of Precharge & Keeper subcircuit in (a).
Fig. 10. Micrograph of the fabricated 64 256 bit nearest Hamming-distancesearch associative memory in 180 nm CMOS. Design area is 3.08 mm . Ringoscillators and time-domain WTA require about 52% and 4% of the design area, respectively. The design area of 1 ring-oscillator is 1690 m by 15 m.
measurements verify an average of 953 ps with a standard deviation of ps or 0.55% of the designed delay at Vdd 1.8 V. The average increases to 3324 ps at lower supply voltage of V with a standard deviation of
MATTAUSCH et al.: ASSOCIATIVE MEMORY FOR NEAREST-HAMMING-DISTANCE SEARCH BASED ON FREQUENCY MAPPING
1455
Fig. 12. Search times measured as a function of winner-input distance, at power-supply voltages from 0.7 V to 1.8 V for the most critical winner-nearest-loser distance of 1.
Fig. 11. Measured ring-oscillator frequencies for Hamming distances and at (a) V and (b) V supply voltage. confirm equipRepeated measurements of the same ring oscillator for ps and ps with small ment accuracy. Average values of within-die variations of 0.55% and 1.25% in comparison to absolute values are V and V, respectively. observed for
ps or 1.25% of the average delay. This confirms that the within-die variation of is small in comparison to its absolute magnitude. However, the relative variation increases by about a factor 2.3 when Vdd is reduced from 1.8 V to 0.9 V. We determined the standard deviation of from frequencymeter measurements of the 64 ring-oscillators of the designed test chip set to . A standard deviation ps or 0.22% of the designed delay , was obtained at V. This is a factor 2.8 larger than the straight forward estimate of 0.078% and may be explained by the fact that we used minimum gate length L for the inverters determining in order to keep small, in contrast to the large L applied for most of the inverters used in the design of . The ring-oscillator period jitter was determined form oscilloscope measurements. Its standard deviation of ps at V is also quite large, most likely because we didn’t take enough care in the layout to minimize the power-supply noise. Measured search times as a function of are plotted in Fig. 12 and show that the minimum search time at V is 50 ns. The search time increases with larger to a maximum of 245 ns for . The search time also increases with lower supply voltage to 0.79 s for and to 2.99 s for at the lowest supply voltage of V. On the other hand, the maximum power dissipation decreases from mW at V to W at
Fig. 13. Maximum normalized power dissipation as a function of supply voltage Vdd. The power dissipation becomes as low as 20 nW/bit at V.
V. Fig. 13 plots the supply-voltage dependence of the normalized maximum power dissipation which becomes as low as 20 nW/bit at V. Further substantial power-dissipation reductions and similar search times are expected with a gated ring-oscillator stage according to Fig. 2(b). We have verified the effectiveness of such gated ring-oscillator stages in a scaled design using 65 nm CMOS technology, which also verifies the scalability of the proposed associative memory architecture to small design rules [13]. For the search reliability, sufficiently large differences between the arrival times of winner signal and winner triggered register clock signal as well as and nearest loser signal at the registers in the WTA circuit are important. As discussed in Section III, these arrival times depend on the division factor of the frequency divider, the winner-input distance , the winner-nearest-loser distance , the designed delays and the standard deviations of these delays. The WTA circuit starts the winner-detection loop when the winner signal has passed the frequency divider. Therefore, only variation effects in the data-signal delay , the register-set-up time and the clock-signal generation time are important for correct detection of the winner by the
1456
WTA circuit, i.e., storing a 1 in the respective register. The design variable should therefore be chosen large enough to assure that including process-variation effects is always fulfilled. The most critical case for correct detection of the nearest-loser signal, i.e., storing a 0 in the respective register, is , where results. As discussed in Section III (see Fig. 6), the most effective design method for this purpose is to design large enough to absorb the variations of . The characterization of the nearest-distance-search reliability of the fabricated test chip in 180 nm CMOS technology was done by measuring 2 properties. — The limit of error-free search for the winner-input distance as a function of the winner-nearest-loser distance and the supply voltage Vdd. An exactly distinguishable below 1% of is regarded as sufficient for most practical applications. — The cumulative search-error rate as a function of for the most critical at nominal and low Vdd. With the chosen design parameters of the test chip, the limit for error-free search ratio , defined as the exactly resolvable winner-nearest-loser distance in % of the winner-input distance , has been estimated by circuit simulation to be %. Variation effects are included in this simulation by applying a method based on surface-potential compact models [14], [15]. For determination from (11) of the variation analysis in Section III, the measured data is unfortunately not complete, because we could not measure the internal delay and the standard deviations and of the internal delays and . However, with measured delay step ps, measured standard deviations ps, ps, ps, designed and the assumption , we obtain an estimate of for the most critical search case. The measured error-free search limits of the fabricated test chip, namely the maximum until which exact winner determination is obtained for any location of winner and nearest loser rows, is shown in Fig. 14 as a function of the supply voltage Vdd for from 1 to 4. At V, exact winner determination in the most difficult case of is possible until , which is reasonable close to our estimate from (11). For , both winner and nearest loser may be identified as winners for at least 1 combination of the locations for winner and nearest loser row. These measured results verify an error-free search ratio % at V. The error-free search ratio degrades with lower supply voltage as also shown in Fig. 14. As expected from the variation analysis in Section III, increasing is a very strong measure to improve . Even at V an of just 4 is sufficient for exact winner-search results up to the largest possible winner-input distance , which corresponds to an error-free search ratio of %. According to the relationships derived in Section III (see Fig. 6), improved error-free search ratios down to %, which is the limit for our test chip with a finite pattern length of 256 bit, are of course achievable by increasing . We have additionally measured the cumulative search-error rate for all possible combinations of physical winner row and
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 6, JUNE 2012
Fig. 14. Measured maximum winner-input distance with error-free search capability as a function of power-supply voltages from 0.7 V to 1.8 V for diffrom 1 to 4. The error-free search ferent winner-nearest-loser distances is verified to be % at ratio V and to increase to % at V.
nearest-loser row locations in the associative memory. Several realization of the respective distances with randomly chosen locations of the delay stages implementing these distances where measured for each of the above combinations. Thus the measurement is exhaustive with respect to the variation of but not exhaustive with respect to the variation of . This search-error rate measurement has been done as a function of the winnerinput distance for the most difficult winner-nearest-loser distances of 1 and 2, to further characterize the search reliability of the designed test chip also beyond the -range of error free search. Search errors are defined in these measurements as the cases where the nearest loser, together with the real winner or alone, is mistakenly identified as winner. Fig. 15(a) verifies that the cumulative search-error rate up to the maximum for the most difficult case of nearest-loser distance stays below 0.5% at nominal V and below 2.5% at a factor 2 reduced V. If is increased to the 2nd most difficult case of 2, no errors occur anymore up to the maximum as shown in Fig. 15(b), and as expected from the variation analysis in Section III. Unfortunately, frequency dividers were not available in the 180 nm CMOS test chip for proving our proposed associativememory concept. Therefore, we couldn’t evaluate the expected small practical reliability improvements achievable with . However, in a 65 nm CMOS technology test chip to verify the scalability to small design rules [13] we implemented also frequency dividers. The measured search reliability of this test chip (128 pattern with 512 bit each) up to for and verifies an improvement of the search reliability by about a factor 1.8 for [13]. Table I compares search times and power dissipation achieved by the 180 nm CMOS test-chip design to previous digital solutions for fully-parallel minimum-Hamming-distance search [7], [8]. Since our design has 2 times and 8 times larger storage capacity for reference data than the designs in [8] and [7], respectively, a scaling to the same capacity has to be carried out for this comparison. The design in [8] uses furthermore
MATTAUSCH et al.: ASSOCIATIVE MEMORY FOR NEAREST-HAMMING-DISTANCE SEARCH BASED ON FREQUENCY MAPPING
1457
TABLE I COMPARISON OF THE REPORTED ASSOCIATIVE MEMORY FOR MINIMUM HAMMING-DISTANCE SEARCH BASED ON FREQUENCY MAPPING TO PREVIOUSLY REPORTED SEARCH CIRCUITS [7], [8]. SCALED ESTIMATES TO THE SAME STORAGE CAPACITIES AND THE SAME TECHNOLOGY NODE ARE ALSO LISTED TO FACILITATE THE COMPARISON
Fig. 15. Search error rates as a function of winner-input distance at V and V for the most difficult winner-nearest-loser up to the maximum and (b) distances of (a) up to the maximum . The cumulative search-error rate is very and stays below 2.5% even at V. If is low for increased to 2, errors don’t occur anymore.
a more advanced CMOS technology with about 30% smaller design rules, so that an additional scaling to the same design
rules is necessary. But even without any scaling, since the ring-oscillator-based architecture avoids clock and other global control-signal lines in the word-comparator’s bit-level circuitry and keeps transistor numbers per bit smaller than the digital solutions, we achieve considerably lower power dissipation at higher reference-data capacity in comparison to [7] and [8] and with less advanced design rules in comparison to [8]. The scaling of the clocked-register-based design in [7] with word number increasing m-times and bit number per word increasing n-times is as for clock cycle and as for winner decision and encoding. Since the scaling of the worst-case search time with in [7] is proportional to the product of and the clock-cycle time, a stronger than linear increase due to -scaling results. Therefore, the clock-cycle time and the worst-case search time become quite large with increasing word length. For the estimate in Table I, the small increase due to winner decision and encoding is neglected, so that the scaling rule with is applied for the best-case search time and the scaling rule with is applied for the worst-case search time. The power dissipation in [7] scales proportional to the storage capacity and inverse proportional to the clock-cycle time, so that a scaling rule with is used in Table I. The obtained scaling results for search time and power dissipation of [7] to 64 words and 256 bit/word are listed in Table I. For the bubble-sorting solution of [8] the scaling for the search time is basically linear with the bit number. Consequently, a scaling rule proportional to n (independent of m) applies. The power dissipation scales proportional to the storage capacity and inverse proportional to the search time, resulting in a scaling rule with m (independent of n). For [8] furthermore the design rule and the power-supply voltage have to be scaled to 180 nm and 1.8 V, respectively. With the scaling factors for the gate length and for the power-supply voltage, the scaling factors for the delay time and power dissipation become and , respectively. The thus obtained scaling results for search time and power dissipation of [8] to 64 words with 256 bit/word and 180 nm technology with V are also listed in Table I.
1458
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 6, JUNE 2012
It can be seen that the power dissipation is more than a factor 5 smaller for the frequency-mapping-based associative memory in comparison to [7] and [8]. The search-time comparison requires the consideration of the winner-input distance , because the search time is independent of in [8] but depends on in [7] and in our design. The search time in [7] is very short for small but increases strongly with larger , so that our design has shorter search time for which means for 96% of the possible search cases. On the other hand, the search time in [8] is constant and slower than in our design up to . Since, small is more frequent and thus more important in practical application, an advantage up to means that our solution can be regarded as slightly faster than [8] on the average. In summary we can therefore say, that our design has only slight average speed advantages over the solutions in [7] and [8]. V. SUMMARY A low-power architecture with word-parallel search capability for the reference word, which has the smallest Hamming distance in comparison to an input word, has been developed on the basis of ring oscillators adjustable in discrete frequency steps. The reported associative memory maps onto an equivalent number of frequency steps and can achieve any required level of search accuracy by designing an appropriate frequency-step size. A closed-loop time-domain WTA circuit, with high time-difference resolution for arriving input signals, is used to determine the winner. An analysis of fabrication-variation effects confirms that high reliability of the search results can be expected in practical applications. For experimental verification, a test chip in 180 nm CMOS technology with 64 words, each consisting of 256 bits, has been designed. Correct operation over a large range of supply voltages Vdd and a low measured power consumptions of less than 36.5 mW and 307 W at V and V, respectively, are verified. Additionally, fast search times from 50 ns to 245 ns for winner-input distances from to are achieved. In comparison to previous digital solutions, power dissipation is 5 times smaller while average search times are only slightly faster. By directly measuring the ring-oscillator frequencies, a small relative magnitude of % at V for fabrication-related within-die frequency variations is confirmed. The measured error-free search ratio of the designed test chip is % at V and increases to % at low supply voltage of V. Search errors can be completely eliminated, i.e., the minimum value % pattern length of 256 bit, can be achieved, by increasing the size of the designed frequency steps . The cumulative search-error rate up to for the most difficult is measured to be % and % at V and V, respectively. If is increased to the 2nd most difficult case of 2, this cumulative search error up to reduces to 0%. An analysis of variation effects confirms that the ring-oscillator time delay step for a Hamming-distance change of 1 should be designed as large as possible, while ring oscillator loop delay for Hamming-distance 0 should be designed as small as possible, and that it is important to minimize the ring-
oscillator jitter in the layout design. The variation analysis further provides quantitative guidelines for the choice of the different delay parameters as a function of the variation magnitude of the used fabrication technology. Scalability of the developed associative-memory architecture to small design rules has been successfully verified with a design in 65 nm CMOS technology [13]. ACKNOWLEDGMENT The VLSI chip fabrication was supported by VDEC, the University of Tokyo, in collaboration with Rohm, Simucad, and Cadence. REFERENCES [1] C. M. Bishop, Pattern Recognition and Machine Learning. New York: Springer, 2007. [2] G. Valiente, Combinatorial Pattern Matching Algorithms in Computational Biology. Boca Raton, FL: CRC Press, 2009. [3] M. Yagi and T. Shibata, “An image representation algorithm compatible with neural-associative-processor-based hardware recognition systems,” IEEE Trans. Neural Networks, vol. 14, no. 5, pp. 1144–1161, 2003. [4] A. Ahmadi, H. J. Mattausch, M. A. Abedin, T. Koide, Y. Shirakawa, and M. A. Ritonga, “Developing a reliable learning model for cognitive classification tasks using an associative memory,” in IEEE Symp. Computational Intelligence in Image and Signal Processing (CIISP’2007), Apr. 2007, pp. 214–219. [5] M. Ikeda and K. Asada, “Time-domain minimum-distance detector and its application to low power coding scheme on chip interface,” in Proc. 24th Eur. Solid-State Circuits Conf. (ESSCIRC’98), Sep. 1998, pp. 464–467. [6] H. J. Mattausch, T. Gyohten, Y. Soda, and T. Koide, “Compact associative-memory architecture with fully-parallel search capability for the minimum hamming distance,” IEEE J. Solid-State Circuits, vol. 37, no. 2, pp. 218–227, Feb. 2002. [7] Y. Oike, M. Ikeda, and K. Asada, “A high-speed and low-voltage associative co-processor with exact hamming/manhattan-distance estimation using word-parallel and hierarchical search architecture,” IEEE J. Solid-State Circuits, vol. 39, no. 8, pp. 1383–1387, Aug. 2004. [8] S. Nakahara and T. Kawata, “A design for a minimum Hamming-distance search using asynchronous digital techniques,” IEEE J. SolidState Circuits, vol. 40, no. 1, pp. 276–285, Jan. 2005. [9] H. J. Mattausch, W. Imafuku, T. Ansari, A. Kawabata, and T. Koide, “Low-power word-parallel nearest-hamming-distance search circuit based on frequency mapping,” in Proc. 36th Eur. Solid-State Circuits Conf. (ESSCIRC’2010), Sep. 2010, pp. 538–541. [10] C. H. Huang, J. S. Wang, and Y. C. Huang, “Design of high-performance CMOS priority encoders and incrementers/decrementers using multilevel lookahead and multilevel folding techniques,” IEEE J. Solid-State Circuits, vol. 37, no. 1, pp. 63–76, Jan. 2002. [11] H. Noda, K. Inoue, M. Kuroiwa, F. Igaue, K. Yamamoto, H. J. Mattausch, T. Koide, A. Amo, A. Hachisuka, S. Soeda, F. Morishita, K. Dosaka, K. Arimoto, and T. Yoshihara, “A cost-efficient high-performance dynamic TCAM with pipelined hierarchical searching and shift redundancy architecture,” IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 245–253, Jan. 2005. [12] D. E. Muller and W. S. Bartky, “A theory o asynchronous circuits,” in Proc. Int. Symp. Switching, Part 1, Harvard Univ. Press, 1959, pp. 204–243. [13] H. J. Mattausch, M. Yasuda, A. Kawabata, W. Imafuku, and T. Koide, “A 381 fs/bit, 51.7 nW/bit nearest hamming-distance search circuit in 65 nm CMOS,” in Proc. 2011 Symp. VLSI Circuits Dig., Jun. 2011, pp. 192–193. [14] H. J. Mattausch, N. Sadachika, A. Yumisaki, A. Kaya, W. Imafuku, K. Johguchi, T. Koide, and M. Miura-Mattausch, “Correlating microscopic and macroscopic variation with surface-potential compact model,” IEEE Electron Device Lett., vol. 30, pp. 873–875, Aug. 2009. [15] M. Miura-Mattausch, N. Sadachika, D. Navarro, G. Suzuki, Y. Takeda, M. Miyake, T. Warabino, Y. Mizukane, K. Machida, R. Inagaki, T. Ezaki, H. J. Mattausch, T. Ohguro, T. Iizuka, M. Taguchi, S. Kumashiro, and S. Miyamoto, “HiSIM2: Advanced MOSFET model valid for RF-circuit simulation,” IEEE Trans. Electron Devices, vol. 53, no. 9, pp. 1994–2007, Sep. 2006.
MATTAUSCH et al.: ASSOCIATIVE MEMORY FOR NEAREST-HAMMING-DISTANCE SEARCH BASED ON FREQUENCY MAPPING
Hans Jürgen Mattausch (M’96–SM’00) received the Dr. rer. nat. degree from the University of Stuttgart, Stuttgart, Germany, in 1981. He is a Professor at the Research Institute for Nanodevice and Bio Systems and the Graduate School for Advanced Sciences of Matter, Hiroshima University, Higashi, Hiroshima, Japan. From 1982 to 1996, he was with Siemens AG in Munich, Germany, where he was involved in the development of CMOS technology, memory and telecommunication circuits, power semiconductor devices, chip-card ICs and compact models. Since 1996 he is a Professor with Hiroshima University researching in the fields of VLSI design, nano-electronics and compact modeling. Dr. Mattausch is a senior member of IEEE and a member of IEICE.
Wataru Imafuku was born in Osaka, Japan, in 1985. He received the B.S. and M.S. degrees in electrical engineering from Hiroshima University, Japan, in 2008 and 2010, respectively. In his research, he was engaged in the design of associative memories based on distance mapping into the voltage and time domain. In 2011, he joined Rohm Corporation, Kyoto, Japan, where is currently engaged in LSI design for consumer applications.
Akio Kawabata was born in Okayama, Japan, in 1987. He received the B.S. and M.S. degrees in electrical engineering from Hiroshima University, Japan, in 2009 and 2011, respectively. In his research, he evaluated performance, reliability and applications of associative memories. In 2011, he joined the Research & Development Center, Mitsubishi Motors Corporation, Okazaki, Aichi, Japan. He is currently engaged in the design of car audio systems.
1459
Tania Ansari received her B.Sc. and M.Sc. degrees in EEE from Bangladesh University of Engineering and Technology (BUET) in 2004 and 2008, respectively. She worked as a Design Engineer in Power IC Limited (www.picsemi.com), a fables design company in Bangladesh, from 2004 to 2008 where she was involved mainly in designing LDO power converters. Since October 2008, she has been a Ph.D. student in the Research Institute for Nanodevice and Bio Systems, Hiroshima University. Her current research interest is to mitigate the impact of process variation on the frequency-domain associative memory circuit structure.
Masahiro Yasuda was born in Fukuoka, Japan, in 1987. He received the B.S. degree in electrical engineering from Hiroshima University, Japan, in 2010 and is currently working toward the M.S. degree also at Hiroshima University, Japan. His main research interest is in the VLSI realization of the associative memory based on frequency mapping.
Tetsushi Koide (M’92) was born in Wakayama, Japan, in 1967. He received the B.E. degree in physical electronics and the M.E. and Ph.D. degrees in systems engineering from Hiroshima University, Hiroshima, Japan, in 1990, 1992, and 1998, respectively. He was a Research Associate during 1992–1999 and an Associate Professor in 1999 in the Faculty of Engineering at Hiroshima University. From 1999, he was with the VLSI Design and Education Center (VDEC), University of Tokyo, as an Associate Professor. Since 2001, he has been an Associate Professor in the Research Center for Nanodevices and Systems, Hiroshima University. His research interests include system design and architecture issues for memory-based systems, realtime image processing, VLSI CAD/DA, genetic algorithms, and combinatorial optimization. Dr. Koide is a member of the Institute of Electrical and Electronics Engineers, the Association for Computing Machinery, the Institute of Electronics, Information and Communication Engineers of Japan, and the Information Processing Society of Japan.