Constructing associative memories using high-order neural networks ...

5 downloads 0 Views 314KB Size Report
Apr 10, 1992 - + wNfN(x). +. (3) if g is an rth order polynomial function, where wi are called weights and each product term f;{X) is of the form x;;x;; . . . $;,k,, k,, ...
Fig. 4 is 1 MHz, and the beat frequency separation is -23 kHz for both sidebands.

Inc. of Dallas, Texas, for supplying the SAW resonator for study. 10th April 1992

C. K. Campbell and P. M.Smith (Department

of Electrical and Com-

puter Engineering, McMnster University, 1280 Main Street West, Hamilton, Ontario U S 4L7, Canndn) P. J. Edmonson (Depmtment of Electrotechnology, Mohawk College, Fennel1 Campus, Hamilton, Ontario U N 3T2, Cannda)

References a

R.: ‘A study of locking phenomena in oscillators’, Proc. IRE, 1946.34, pp. 351-367

1

ADLFR,

2

c. I., FREIBERG, R. J., and SKOLNICK, M. L.: ‘Laser injection locking’, Proc. IEEE, 1973,10, pp. 1411-1431 CAMPBELL, c. K . : ‘Beat frequency spectra in a driven unlocked multimode SAW comb oscillator’. Proc. 1987 Ultrasonics Symp., Denver, 1987, pp. 69-71 STOVER, H. L.: ‘Theoretical explanation for the output spectra of unlocked driven oscillators’,Proc. IEEE, 1966.54. pp. 310-311 ARMAND, M.: ‘On the output spectrum of unlocked driven oscillators’, Proc. IEEE, 1969, SI, pp. 798-799 RAMO, s., and W H I ~ J.Y R.:, ‘Fields and waves in modern radio’ (John Wiley & Sons, New York, 1953). 2nd. edn., p. 283

3 4 5 6

BUCZEK,

CONSTRUCTING ASSOCIATIVE MEMORIES USING HIGH-ORDER NEURAL NETWORKS

Y.-H. Tseng and J.-L.Wu Indexing terms. Neural networks, Memories A class of neural network for constructing associative memC

Fig. 3 Phase modulation uector with partial elliptical component due to ampl@er saturation, decomposition into circulnrly polarised components, and resultant asymmetric two-sided beat spectrum with f , z f, a Phase modulation vector b Decomposition (not to scale) c Resultant beat spectrum

C

? >

. m D

U

2

s p a n = l OMHz

Fig. 4 Experimental response of driven unlocked SAW oscillator with f, = 914.360 M H z , f , = 914.534 M H z Conclusion: We have shown that amplifier saturation in a driven unlocked oscillator causes the inherent one-sided beat frequency spectrum to be converted into a two-sided beat frequency spectrum of asymmetric amplitude, but equal beat frequency separation. This can be of importance in injectionlocking oscillator design as it can affect the approach to the threshold-lock condition with large injection signals. Acknowledgment: This work was supported in part by grants from the Natural Sciences and Engineering Research Council of Canada. Appreciation is also extended to R F Monolithics,

1122

ories that learn the memory patterns as well as their neighbouring patterns is presented. The network is basically a layer of perceptrons with high-order polynomials as their discriminant functions. A learning algorithm IS proposed for the network to learn arbitrary bipolar patterns. The simulation results show that the associative memories implemented in this way achieve a set of desirable characteristics, namely high storage capacity, nearest convergence, and existence of a ‘no decision’ state which attracts indistinguishable inputs. Furthermore, it is also possible to shape the attraction basin of a memory pattern under any metria definition of distance. Introduction: Associative memories using the sum-of-outerproducts encoding rule [l] have the disadvantages of inducing spurious memories and low memory capacity. These disadvantages are caused by crosstalk between correlated memory patterns. To avoid crosstalk, Hamming-type associative memories [2, 31 have emerged and have been built in hardware for image coding applications [4]. Basically, this type of associative memory stores each memory pattern on the weights of each individual node. Therefore the network complexity grows linearly with the number of stored patterns, which may not be cost-effective when the number of memory patterns is large. Other constant-size associative memories using various encoding techniques such as the spectral scheme [SI, pseudo-inverse, and eigenstructure method [6] have been proposed but all have the limitation that the reliable memory capacity does not exceed n, the dimension of the memory patterns. Furthermore, their error-correction capability is not quantitatively clear, except resorting to exhaustively visiting each state and then seeing what the attraction region of the pattern might be. The type of associative memory using the delta learning rule has been indicated in Reference 7, extensive simulations have been made in Reference 8, and the idea of learning all possible neighbouring patterns (those that are learnable by the delta rule) can be found in Reference 9. The delta encoding rule can significantly improve the memory capacity to reach beyond the dimension of the memory patterns when only autoassociative memory patterns are learned. In fact, the memory patterns in an autoassociative memory are linearly

ELECTRONICS LETTERS

4th June 1992

Vol. 28

No. 12

separable [ 9 ] and thus the maximum memory capacity is 2" for bipolar patterns. If any of the neighbouring patterns of the desired memories are required to be learned, the linear separability condition may not guarantee success and the delta rule may fail. To overcome the limitation, the perceptrons with polynomial discriminant functions are used as associative memories both to achieve high memory capacity and to learn arbitrary memory patterns as well as their neighbouring patterns.

where it is shown that if the number of product terms is more than one half of the number of the training patterns, say K,i.e. N 2 K / 2 , then all the training patterns can be learned with probability 1 as N goes to infinity. Although the above guidelines do not reveal any apparent procedure to determine the product terms in the polynomial function, they lead us to the following simple but effective learning process:

Perceptrons with polynomial discriminant functions: Consider an associative memory consisting of m independent perceptrons with polynomials as their discriminant functions. The output function of a perceptron is defined as [IO]

(ii) Step 2: learn the training patterns with the error-correction procedure; if it does not converge within an iteration bound, go to step 1. The learning procedure in step 2 is

cal. Another guideline comes from the result in Reference 10,

(i) Step I: generate N product terms for the g function at random

(1)

W

=

W

+Y

In the above equation, X = [ x , , x,, . ..,x.] E { 1, - 1)" is the input pattern, sgn is the sign function: sgn (a) = 1 if a z 0, - 1 if a < 0, and

W

=

W

-

z

= sgn

MX)I

g(X) = W I X l

+ w , x , + ... + w,x, + t

(2)

if g is a linear discriminant function, in which case the perceptron is called a linear perceptron, or

Ax) = w l f l ( x ) + w 2 f 2 ( x ) + ... + w N f N ( x ) +

(3)

if g is an rth order polynomial function, where wi are called weights and each product term f;{X) is of the form x;;x;; . . $ ; , k , , k,, ..., k, E ( 1,..., n} and n1,n2, ..., n, E {LO}. It is noted that the value of a product x 1 x 2 ... x, of n bipolar variables is - 1 if there is an odd number of -1s in the variables and + 1 otherwise. This function is isomorphic to the parity function of n binary elements for which the output is 1 if there is an odd number of 1s in n binary elements and 0 otherwise. Because there is a two-layer linear perceptron for the parity problem [ll], high-order perceptrons can be considered as three-layer linear perceptrons by replacing each product term with a two-layer perceptron. Although this suggests a possible implementation of highorder perceptrons with an existing neural network model, we do not regard it as a multilayer perceptron as we develop its learning algorithm in the following. The accompanying problem of using high-order polynomial functions in the perceptrons is to decide what order to use and which product terms to choose. There are some guidelines for determining these parameters. One is that a discriminant polynomial of order n is able to dichotomise any complex bipolar pattern set. This is elaborated in the following. Define a discriminant polynomial P ( X ) to be exact if P ( X ) = sgn (P(X)).A discriminant polynomial is said to be inexact if it is not exact. An example of an exact discriminant performing AND operation over n bipolar elements can be defined as a polynomial of order n :

.

where A N D ( X ) = - 1 if all xi = - 1 and if any one of xi = 1, i = 1, ..., n. (Here - 1 is considered as true and 1 as false.) We can alternatively define an inexact linear discriminant function as A N w X ) = sgn [ P ( x ) ] = s g n

[ 1"

xi

+ ( n - I)]

(5)

i= 1

Similarly, an exact (or inexact) discriminant function can be defined for an OR operation. Because any bipolar function can be expressed in an AND-OR (or OR-AND) two-level expression and an exact discriminant is itself a bipolar variable, it turns out that any bipolar function always has an exact/inexact discriminant polynomial of order at most n to compute it. However, as n grows larger and larger, generating all 2" product terms of an ntb-order polynomial becomes impracti-

ELECTRONICS LETERS 4th June 1992 Vol. 28 No. 12

Y

if g ( X ) 5 0 and X maps to 1

(6a)

if g ( X ) 2 0 and X maps to

(6b)

-1

where W = [ w l , w 2 , . .. , w N .wo] E RN+' is the weight vector, W' is the new value of W , and Y = i-j,(X), f,(X), ..., fN(X), 13 E 1 1 , The above learning rule is equivalent to the delta rule W' = W + c(d - y)Y at c = 1/2, where d is the desired output, y is the network output, and c the step size. However, we reserve the name error-correction procedure rather than the notable name delta rule, first because it is in this form that the convergence of the learning procedure is proved [lo] and secondly because the delta rule involves a predetermination of the step size whereas the above learning rule does not. The convergence of the error-correction procedure is guaranteed only when the training patterns are 'linearly separable'. Therefore a maximum number of learning steps have to be specified to prevent infinite iterations. We set a condition in the examples below that if it does not converge in L iterations (L = 10 in example 1 and L = 50 in examples 2 and 3), it is then assumed that the training patterns are not learnable under these N product terms. In that case, another set of product terms is generated and the learning process runs again. Simulation results: (i) Example I ; In our simulation, the above learning process proves to be effective. We test the learning algorithm on decoding a (7, 4 ) Hamming code, where there are 16 memory patterns (legal codewords) and seven nearest neighbours of Hamming distance one from each memory, making up a total of 128 training patterns. A codeword in the (7, 4 ) Hamming code bas four information bits x I , x,. x p , x4 and three parity bits x 5 , x 6 , x 7 . The parity bits are related to the information bits as: x , = x l x , x 4 , x6 = x I x 3 x 4 ,x , = x z x 3 x 4 . (Note the original parity operation in the binary domain is replaced by a multiplication in the bipolar domain because they are isomorphic.) To associate a received 7 bit Hamming codeword with its correct four information bits, four independent highorder perceptrons are used in the associative memory. We begin our simulation with initial zero weights and 64 product terms in each polynomial. The number 64 is chosen to be one half of the total 128 patterns, as indicated in the preceding Section. The network learns quickly in about 10s or less on an PC/486. When the number of product terms N is decreased to as low as 14, it takes a couple of minutes to converge. One of the solutions in our simulation is given below: g l ( X )= 14x,

- 2x3 - 2 x , x , x ,

+ 8x2x4x5

- 8x,x,x,x,x6 - 6x,x,x,xtix, g 2 ( X )= 12x2 + 8 x , x 4 x 7 - 4 x , x , x , x , x ,

-2

-4x2x4x,x6x,

+ 8x, x , X 6 + 8x,x4 xti - 8 x , x 2x j X 6 x , g4(X) = 16x4 - 2 x , + S X , x6 X ,

g 3 ( X )= 12x,

+ 6x,x,x6 - 8x,x,x,x,x6

-

2

Not all 14 terms are listed; we omit those terms whose weights are zero. A legal codeword and its seven error patterns can be 1123

represented by ( x l , x2, .. _, x,) and ( - X I , x z , ..., x,), (XI, - X z r ..., x,), ..., and ( x l , x Z r ..., -x,), respectively. The solution can be verified by taking any of the above codewords as input to the network and see what results. Note the connection complexity of this network is very low as compared to the number of memory patterns it stores and the error-correction capability it achieves. This is an example where a neural associative memory has storage capacity larger than its network size.

possible if the number of training patterns is within tractable bounds. Thus it can be considered as an alternative approach to constructing associative memories. As we pointed out in the second Section, that high-order perceptrons can be reduced to multilayer perceptrons, it turns out that the above results can all be realised by existing implementation techniques that are applicable to multilayer perceptrons.

(ii) Example 2: In this example, six patterns the same as those in Reference 6 are used as memory patterns:

Y.-H. Tseng and J.-L. Wu (Department of Computer Science and lnfor-

13th April 1992

Xl = [ - I .

motion Engineering, National Taiwan Uniwrsity, Taipei, Taiwan, Republic of China)

1, -1, 1, 1, 1, -1, 1, 1, 13

x,= [l, 1, -1, -1, 1, -1, 1, -1, 1, 11 x,= [-1, 1, 1, 1, -1, -1, 1, -1, 1, -11 x,= [l, 1, -1, 1, -1, 1, -I, 1, 1, 11

Refereoces

~~

x,= [l,

x, = [ - I ,

-1,

-1, -1, 1, 1, 1, -1,

-1,

-13

-1, -1, 1, 1, -1, 1, 1, -1, I]

A network of 10 high-order perceptrons is used to store these six memories as well as their neighbouring patterns (in Hamming distance). Among 2” = 1024 possible inputs, there are 351 patterns that are equidistant from at least two memories. These patterns are classified as indistinguishable inputs and are indicated by an eleventh perceptron which outputs - 1 if any of these inputs is detected and 1 otherwise. There are 512 product terms in each perceptron in the beginning. The algorithm learns the memory patterns and their neighbouring patterns in a short period whereas it takes - 4 h to learn to classify the indistinguishable inputs. The product terms for the first 10 perceptrons can further be reduced to 256 without affecting the training time very much, but a reduction in the number of terms in the 11th perceptron leads to an indefinite convergence time. This shows that it is easy to learn the patterns clustered in neighbouring regions and dillicult to learn those scattered over the whole pattern space. (iii) Example 3: In this example, the same memory patterns as those in the above example are stored but a distance measure different from the common Hamming distance is used. Let X be a vector of n bipolar elements and X‘ be a reverse vector of X. For example, if X = [ I , 1, -I], then X’ = [-1, 1, 13. Define an undirected Hamming (UH) distance between two vectors X and Y as U H ( X , Y ) = M r i v [ H ( x , Y), H ( X , r)i where M I N chooses the smaller one of its two arguments and H(X,Y) is the Hamming distance between X and Y. (As an application example, the bar code used in consumer products might requires this distance measure to allow it to be scanned in either direction.) In this simulation, there are now 410 equidistant inputs and a total of 608 neighbouring patterns for the six memories. Again 512 product terms appears in each perceptron and the convergence speed for the first 10 perceptrons is high, as can be expected. But the convergence time for the l l t h perceptron is indefinite. The l l t h perceptron has difficulty converging because it has to classify those indistinguishable patterns among all possible inputs. Conclusion: We have shown that associative memories using high-order perceptrons achieve a set of desirable characteristics described in Reference 12: tolerance of noisy input versions of stored memories; high storage capacity; convergence to a stable state in a small number of iterations (in this feedforward associative memory, only one forward pass is needed); existence of a passive or ‘no decision’ state which attracts indistinguishable inputs. Besides, it is possible to arbitrarily shape the basin of attraction of a memory pattern, under any metrics definition, by selecting a pattern to a ‘neighbour’ of that memory pattern and putting it on the list of the training set such that once the learning process converges, the neighbouring pattern is associated with its corresponding memory pattern. All these characteristics are

1124

HOPFIELD, I. I . : ‘Neurons with graded response have collective computational properties like those of two-state neurons’. Proc. Nat. Acad. Sci., USA, 1984,81, pp. 3088-3092 2 L I P P ~ A N ,R. P.: ‘An introduction to computing with neural nets’, IEEE ASSP Mag.,April 1987, pp. 4-22 3 wu, IA-LING, and TSENG, WEN-HSIEN: ‘An associative mapping network without redundant local minima’, lnt. J. Electron., 1991, 71, (6), pp. 899-916 4 CHIUEH, T. D., and GOODMAN, R. M.:‘Recurrent correlation associative memories’,IEEE Trans., March 1991, ”-2 5 VMKATESH, S. S., PANCHA, 0.. PSALTIS, D., and SIRAT, G . : ‘Shaping attraction basins in neural networks’, Neural Netw., 1990, 3, (6). pp. 613-624 6 MICHEL, A. N., SI, 11, and YEN, O m : ‘Analysis and synthesis of a class of discrete time neural networks described on hypercube’, IEEE Trans., 1991, “-2, (1). pp. 32-47 7 LEVIN, E.: ‘A recurrent neural network: limitation and training’, Neural Netw., 1990,3, (6), pp. 64-650 8 P R A ~ D., L., and KAK, s. c.: ‘Neural network capacity using delta rule’, Electron. Lett., 1989.25. (9, pp. 197-199 9 MEKKAOU, A., and JESPERS, P.: ‘A perceptron based autoassociative memory’. IJCNN, Washington 1990, pp. 1672-1675 10 NILSSON, N. I.: ‘Learning machines: foundations of trainable pattern-classifyingsystems’ (Maraw-Hill, 1965) G. E, and WILLIAMS, R. J.: ‘Learning 11 RUMELHART, D. E., f”N, internal representations by error propagation’, in RUMELHART, D. E., and PDP Research Group: ‘Parallel distributed processing’ (MIT Press, Cambridge MA, 1986) 12 HASSOUN, M. H., and WATTA, P. E.: ‘Exact associative neural memory dynamics utilizing Boolean matrices’, IEEE Trans., 1991, “-2, (4), pp. 437-448

1

BiCMOS DYNAMIC FULL ADDER CIRCUIT FOR HIGH-SPEED PARALLEL MULTIPLIERS H. P. Chen, H. J. Liao and J. B. Kuo Indexing t e r m : Large-scale integration, Adders, Multipliers

A BiCMOS dynamic full adder circuit for VLSI implementation of high-speed parallel multipliers using Wallace tree reduction architecture is presented. With the BiCMOS dynamic full adder circuit, an 8 x 8 multiplier designed based on a 2pm BiCMOS technology shows a six times improvement in speed as compared to the CMOS static circuit. The speed advantage of using BiCMOS dynamic full adder circuits is even greater in 16 x 16 and 32 x 32 multipliers as a result of the BiCMOS large driving capability for realising the complex Wallace tree reduction architecture.

Introduction: High-speed multipliers are usually realised by parallel architectures [I], where the Wallace reduction structure [l] and carry look ahead circuits have been used to increase the speed. In a high-speed parallel multiplier using the Wallace tree reduction structure, the most important building cell is the full adder circuit. Although the CMOS dynamic technique [2] can provide a speed advantage over the static technique for implementing serial adders, it is suitable for realising the full adder circuit for parallel multipliers using the Wallace tree reduction structure due to race probE L E C T R O N I C S LETTERS

4 f h June 1992

Vol. 28

N o . 12

Suggest Documents