Digital Systems for Neural Networks - Semantic Scholar

9 downloads 36239 Views 380KB Size Report
now buy this ANN as part of a pulp process control system48. In the application ..... a large server dedicated to connectionist algorithms (CNS-11). Systems using ...
c 1995 Society of Photo-Optical Instrumentation Engineers.

This paper was published in P. Papamichalis and R. Kerwin, editors, Digital Signal Processing Technology, volume CR57 of Critical Reviews Series, pages 314{45. SPIE Optical Engineering, Orlando, Fla., 1995, and is made available as an electronic reprint with permission of SPIE. Single print or electronic copies for personal use only are allowed. Systematic or multiple reproduction, or distribution to multiple locations through an electronic listserver or other electronic means, or duplication of any material in this paper for a fee or for commercial purposes is prohibited. By choosing to view or print this document, you agree to all the provisions of the copyright law protecting it.

Digital Systems for Neural Networks Paolo Ienne EPFL Microcomputing Laboratory IN-F Ecublens CH-1015 Lausanne [email protected] .ch Gary Kuhn SCR Learning Systems 755 College Road Princeton, NJ 08540 [email protected] ABSTRACT

Neural networks are non-linear static or dynamical systems that learn to solve problems from examples. Those learning algorithms that require a lot of computing power could bene t from fast dedicated hardware. This paper presents an overview of digital systems to implement neural networks. We consider three options for implementing neural networks in digital systems: serial computers, parallel systems with standard digital components, and parallel systems with special-purpose digital devices. We describe many examples under each option, with an emphasis on commercially available systems. We discuss the trend toward more general architectures, we mention a few hybrid and analog systems that can complement digital systems, and we try to answer questions that came to our minds as prospective users of these systems. We conclude that support software and in general, system integration, is beginning to reach the level of versatility that many researchers will require. The next step appears to be integrating all of these technologies together, in a new generation of big, fast and user-friendly neurocomputers. Keywords: Digital Systems, Neural Networks, Serial Systems, Parallel Systems.

1 INTRODUCTION

Arti cial neural networks (ANNs) are used by di erent people for di erent purposes. Some people want to make increasingly biological systems. Other people just want computational tools with some of the bene ts of biological systems, like the ability to model non-linearities in the system inputs, and the ability to do massively parallel processing with simple computing elements. This article emphasizes digital systems that implement ANNs as a computational tool. The eld of ANNs as a computational tool has matured a lot in the last decade. Many useful results have been achieved: in theory, in the laboratory, and in realworld applications. While further results can be expected, ANNs are now ready for

the computational toolbox. A great variety of digital implementations of ANNs are also ready, as our description of commercially available systems will show. To motivate the reader, we now mention 3 applications of ANNs, from among hundreds which have been reported. The rst application saves lives, the second saves natural resources, and the third saves money. In the application which saves lives, a heartbeat arrhythmia detector based on expert knowledge provided by a major medical manufacturer was re-implemented as an ANN. The ANN performed identically to the original detector, until it was optimized, analyzed, and simpli ed by eliminating 40% of the expert connections. At that point the ANN no longer performed identically to the original detector. Instead, it \reduced the heartbeat classi cation errors by a factor of 2"68. In the application which saves natural resources, an ANN was applied to control the cooking process used in pulp production. At the end of the process, one can tell how successful the cooking was, using indicators like the permanganate concentration. An ANN was trained to predict the permanganate concentration from other measurements available during the cooking. After training, the ANN was \signi cantly more accurate (30%) than the previously used analytical model". You can now buy this ANN as part of a pulp process control system48. In the application which saves money, ANNs were compared with multiple discriminant analysis (MDA) in the prediction of credit risk. Apparently, MDA is widely used by nancial institutions, although it is based on only a linear equation. A banker and a professor in nancial management chose 15 indicators of creditworthiness as inputs for the ANN and for the MDA. No one of these indicators is very strong, and the combination of indicators had a coecient of determination (R2) of only 44.2%. Tests showed, that despite the weak indicators, the MDA consistently classi ed 80% of credit applicants correctly. Using the same indicators, the ANN consistently classi ed 88% of the applicants correctly65 . These 3 applications have certain characteristics in common. One characteristic is that they use the most popular type of ANN, a three-layer network trained by back-propagation. Another characteristic is that they use only tens of inputs, not hundreds or thousands, as one might want in imaging applications. Still, training can be extremely slow even for small nets. In a time series prediction competition18 open to all techniques, ANNs performed extremely well, but the run times ranged from 3 hours on a CRAY Y-MP to 3 weeks on a SPARC 1. Clearly, ANN users require powerful hardware. In x2, we describe the mathematics of the most popular types of ANNs. In x3 and x4, we describe digital systems which implement these ANNs and others. The digital systems in x4 are either commercially available or show interesting uses of the characteristic features of ANNs. All of them, to one degree or another, permit the massive computations necessary with larger networks. In x5, we outline trends and mention a few hybrid and analog systems that can complement digital systems. In x6 we conclude with our view on the outlook for successful system integrations.

2 THE MATHEMATICS OF ANNs

We describe the mathematics of three types of ANNs which have been implemented by many digital systems. The rst two types are the multi-layer sigmoid and radial basis function nets, and the third type is Kohonen's single-layer selforganizing feature map. 2

2.1 Sigmoid nets

Sigmoid nets are either static or dynamic. We put both types of net into a single computing framework, so that can we describe their training with just one set of equations.

2.1.1 Static and dynamic nets

Sigmoid nets, often called MLPs (for multi-layer perceptrons), are networks with at least some processing units or neurons that compute their output y as a continuous, monotonically increasing, non-linear sigmoid function  of their input x: y = (x) = (1 + e? )?1 . The sigmoid function maps numbers from domain (?1; +1) to range (0; 1). A symmetric version, in which the output is doubled and then shifted down by 1, maps numbers to (?1; 1). The argument of the sigmoid, x, is sometimes called the unit potential. Sigmoid nets are not popular just because (x) looks like the saturating response of a biological neuron. There is also an important theoretical result: a net whose output is a linear combination of n sigmoids of linear combinations of its inputs, can uniformly approximate any continuous function f with support in the unit hypercube of IR 12 . However, this theoretical result makes no claim about the number of combinations of the inputs which will be needed for a good sigmoid approximation; that number could approach in nity. Therefore the result makes no claim about what class of problems is practically solvable. The result also makes no claim about what procedure to follow for nding good solutions. So the practitioner has to look elsewhere for hints on how to solve these important remaining problems. Since the practitioner's approach is usually partly empirical, it is a big help to have a fast digital system when exploring sigmoid nets. By now, many people have applied a static, rst-order sigmoid net to learn the mapping between vector pairs (x; f(x)) over the domain of f. Static means the net has no memory of previous inputs, and rst-order means there is only one weight on each connection to a neuron. The static system equation is typically something like j i X X   ? y = f (x) =  II + wII  I + wI y0 x

n

N

k

k

N

kj

k

j

j =1

ji

i

i=1

where i = 1; : : :; N , j = 1; : : :; N , and k = 1; : : :; N are indexes for units in an input, hidden or output layer, respectively, the w are weights on the connections to hidden units from input units, the w are weights on the connections to output units from hidden units, and the  and  are weights on a single connection to each hidden or output unit, from a bias unit that always outputs a 1. The biologically inspired name for all of these connections between units is synapses. Note that a second non-linearity is applied even though Cybenko's result does not require one. The reader will wonder: where is input vector x in this equation? We have written system input x as output y0 of input unit i. This notation reminds us that in real-world problems the inputs to our net are themselves the outputs of some kind of feature extractor. This notation helps us to think of jointly training the feature extractors and the net, in order to optimize the features for the speci c mapping at hand34. Of course, joint optimization means there is even more work for the implementing system. i

j

k

ji

kj

j

i

k

i

3

We can turn the static system equation into a dynamic one by adding past context. With past context, the system may respond di erently to the same current input, depending on the values of earlier inputs. Past context can be represented with a tapped delay line, like an FIR lter, or with feedback connections, which make the net recurrent, recursive, or cyclic, like an IIR lter. A dynamic net which illustrates both kinds of past context is this non-linear ARMA net which was used in speech-recognition36 : ?

y (t) =  wrec1y (t ? 1) + II + k

kk

k

Nj 1;3;5 X X

k



wII  I + kjl

j

Ni 1;:::;5 X X

j

jid

i

l



wI y0 (t ? l ? d) : i

d

Note that this system is represented in discrete time: all unit outputs are indexed by time, and all weights, except those on the bias connections, have a third index, for the time-delay. In image processing, time could be interpreted as scan number, as row after row of pixels arrive from a hardware scanner. Note also, that the recursive part of this system returns the previous system output, y (t ? 1), inside the second non-linearity. Training can produce recursive weights greater than 1, without a problem for system stability. If we omit the nonlinearities and revert to complete connections in a single layer, the network would be a Hop eld net25. k

2.1.2 A single framework

To be able to implement both a static net and a dynamic net on a single multimodel digital system, it would be helpful to represent both types of equation in a single computing framework. Such a framework is available, if we adopt the convention that there is a time delay on the connections between adjacent layers of the static net. Time delays make sense for the connections in the static net, if we model the neurons as computing in synchrony. Synchronous processing requires a system memory to hold all unit outputs that will be needed at subsequent time steps. The memory is updated in preparation for the next synchronous processing, by pushing all current memory contents back one row in time, i.e., by incrementing a memory ring-bu er pointer. Synchronous computation can be implemented if a digital system has more than one processor. ANNs contain a lot of intrinsic parallelism. A simple goal would be to compute several neurons in parallel. In this case one can talk of neuron-oriented parallelism. This can extend to a few neurons, a whole layer or a full network. A ner grain of parallelism would compute in parallel the weighted connections, thus achieving synapse-oriented parallelism. Examples of these possibilities can be found among the digital systems described in x4.

2.1.3 Weight-change equation

Now that we have a uniform framework for representing both static and dynamic sigmoid nets, we can write one set of training equations. One might think that the purpose of training is to make the outputs of the network approximate ever more closely the function values in our training pairs f(x; f(x))g, i.e., to reduce the training error. This is false. The purpose of training is to make the outputs approximate ever more closely the function values in pairs (x; f(x)) that the network is not being trained on, i.e., to reduce the generalization error. 4

Because the purpose of training is to reduce generalization error, the simplest, and most completely parallelizable training algorithms are not good enough. For example, it is not good enough to keep moving the network weights in proportion to the negative gradient of the network error with respect to the weights, stopping when the gradient is zero, i.e. when the error is at a minimum. Instead, one must take extra steps to avoid spurious minima on the error surface, and to avoid over- tting to the training examples while generalization error goes up. Nevertheless, computing the negative gradient of the network error with respect to the weights is a good place to start. Several di erent measures of approximation error have been used. We illustrate training with error measure X E = 12 (f (t) ? y (t))2 : Error E is a sum over time and over all outputs of time-dependent errors, and the error gradient with respect to the weights is similarly a sum over time of timedependent derivatives. The best-known algorithm for computing the error gradient is called the error back-propagation algorithm58, or back-prop (BP), because it computes the error gradient from the last time-frame to the rst, after a forward pass has computed and stored all necessary unit outputs. For our framework of sigmoid nets with time-delays and feedback, the error back-propagation algorithm would compute the time-dependent contribution to a gradient-based weight change from k

k

k;t

:::; T) w (t) = ?@E(t; @w @x :::; T) = @w (t) ?@E(t; @x (t) X  = y (t ? d)0(x (t)) ?@y@E(t) + w (t) abd

abd

a

abd

a

b

a

nal

a

?@E(t + 1; :::; T) ; @x (t) n

n;l

where ? a (( )) = f (t) ? y (t) if unit a is an output unit and 0 otherwise, 0 (x (t)) = y (t)(1 ? y (t)), (n( ) ) = 0 , and the indexes n and l range over all possible next units and time lags to which the output of unit a might be connected. Alternatively, the time-dependent contribution to a gradient-based weight change can be computed on the forward pass33,57,70 from @E t

@y

a

t

a

a

a

a

@E t>T @x

t

@E(t) w (t) = ?@w X @x (t) ?@E (t)  = @w @x (t) X X @x (t ? l) 0 (x (t ? l))w + (y (t ? l))g ?@E (t) ; f  = @w @x (t) abd

abd

c

c

abd

c

c

c

p

p

c

p;l

abd

cpl

p

c

where unit c is an error producing unit, indexes p and l range over all possible preceding units and time lags to which the input of unit a might be connected, and the Kronecker  = y (t ? l) i cpl = abd. p

5

The back-prop calculation has the disadvantage that one must compute and store an entire forward pass, before calculating the gradient on the backward pass. The forward calculation has the disadvantage that one must compute and store a p large array of partials abd . To our knowledge, no one has tried to implement the forward calculation on parallel hardware, so we now concentrate on the matrix representation of the backward calculation, for hardware implementation37. In the back-prop calculation one recursively maintains just a vector of negative derivatives of current and following errors with respect to the current system out^ puts. At the last time, t = T, one initializes that vector with E 0 (T) = f(T) ? f(T) ^ where f(T) is the vector of y (T)'s from the output units. At times t < T the current and following negative error derivatives must be added together, as was shown between square brackets in the back-prop equation. If we represent the outputs of the feature extractors as vector y0 , the input to hidden weights fw g as matrix WI , the outputs of the hidden layer as yI , and the hidden to output weights fw g as matrix WII , then the back-prop matrix equations for the weight changes are @x

@w

k

jid

kjl

and

  WII = WII T (0 (e0 )) yI T

WI = WI T 0 WII T (0 (e0 )) y0 T ; where e0 is the recursively calculated vector ? (II( ) ) , i.e. the error derivative with respect to vector yII , the system output vector. So, the back-prop weight-change calculation for matrix WII requires us to recursively update the output error derivative vector, multiply by the slope of the output unit transfer function, multiply by WII transposed, and then take an outer product. The weight change calculation for matrix WI requires the same steps except the outer product for WII , then multiply by the slope of the hidden unit transfer function, multiply by WI transposed, and take an outer product. If we want a digital system to speed up or even parallelize our back-prop gradient calculation, these are the matrix and vector operations that the digital system must be able to compute. 







@E t;:::;T @y

2.2 Radial Basis Function nets

t

A second type of multi-layer ANN, whose mathematics is much easier to describe, is called radial basis function (RBF) nets. The hidden units in these nets have some sort of distance-limited activation function. Depending on the choice of distance, activation function, and learning algorithm, RBF nets are known under di erent names, like probabilistic neural nets (PNNs)62 , restricted coulomb energy nets (RCEs)60 , Gaussian activation nets23 , or simply, nets with local receptive elds43 . RBFs are \neural" nets, in the sense that local receptive elds are found in biological systems. A center vector in input space is assigned to each hidden unit. Center vectors can be placed on a regular grid, or by running a clustering algorithm. Distance is computed between inputs and center vectors, and an activation is computed for each distance. Optimal weightings of the hidden unit activations can be found by supervised training, in order to perform system functions like prediction or classi cation. 6

The distance chosen might be Euclidean (`2 ) but often it is only Manhattan (`1 ). The activation function might be Gaussian or exponential, but often it is only the sign (binary bump) function. Gaussian activations might be normalized to sum to 1, but the other activations are often left unnormalized. Supervised training for the optimal weightings is often accomplished using supervised LMS. So a typical static system equation for a 3-layer RBF is simply y = f (x) = k

Nj X

k

?

w a d[w ; y] kj



j

j =1

where d[] is the chosen distance calculation and a() is the activation function. If one has enough data to train all of the necessary clusters, RBFs have two advantages over sigmoid nets. First, training is extremely fast. Only a few iterations of a clustering algorithm are necessary, and supervised LMS provides straightforward quadratic minimization. Therefore, data and weights can be represented with fewer bits than for the many iterations of gradient calculations of the sigmoid nets. Second, with zero/one-weighted combinations of simple activations and distances, it may not even be necessary to perform any multiplication. Therefore, RBFs can be very inexpensive to implement in digital systems.

2.3 Kohonen's Self-Organizing Feature Maps

Kohonen's self-organizing feature maps (SOFMs), are generally thought of as single-layer ANNs32 . Input vectors are pictured in a layer below the SOFM, but there is no thought of optimizing the inputs before the SOFM. Instead, the map itself is an unsupervised optimizer of features. The units of the SOFM are usually arranged in a square array. Each unit in the array is connected to the input vector. These connections permit each unit to measure the distance kx ? w k between the vector at the input x and its own characteristic vector w. These distances are used in a winner-take-all computation, where only the unit which has min kx ? w k responds by outputting a 1. During training, the winning unit then causes its own characteristic vector and the characteristic vector of neighboring units to be modi ed by the rule   w (t) = (t) x(t) ? w (t) ; ij 2 N ; where (t) is a learning rate, N is the size of the a ected spatial neighborhood, and both are decreased over training iterations t. Because modi cations take place in the spatial neighborhood of the winning unit, the characteristic vectors of neighboring units will be similar after training. In some versions of the algorithm, all neighboring neurons do not get updated by the same amount. Instead, another coecient is applied, whose value depends on the distance on the map between the winning unit and the unit to be updated. The unsupervised training of the SOFM can be converted to a form of supervised training called learning vector quantization (LVQ). Class boundaries are identi ed approximately in the trained map, and further inputs are presented which are of known class. If the winning unit has the wrong class, its characteristic vector is now moved not toward, but away from the input vector. The SOFM is \neural" in the sense that it is reminiscent of the lateral excitation, or with LVQ, the lateral inhibition, of biological systems. The SOFM is apt for ij

i;j

ij

ij

ij

t

7

t

implementation as a digital system, especially if neighborhoods can be rectangular, distances can be `1 or dot-products, and updates can be of xed magnitude.

3 SERIAL SYSTEMS

In discussing digital systems for ANNs, it would be easy to overlook serial systems. All other things equal, serial systems are not advantageous for ANNs, because they fail to exploit the intrinsic parallelism. But all other things are not equal. Serial hardware is more widely available and less expensive, and software support is usually more reliable and complete. As a result, modern practice for training ANNs on serial systems is a good place to start our discussion of digital implementations. We will emphasize the use of sophisticated training software on serial systems, and see exactly where and how parallel hardware might be a welcome addition to the system. ANN practitioners have been idled for long hours by training on their serial systems, particularly by training of their sigmoid nets. So they have re-discovered, that there are much better ways to do numerical optimization than by just accumulating the in nitesimal changes of gradient descent, e.g. just accumulating the WI and WII of x2.1.3 over their datasets. Here better means both faster, and achieving better generalization error. These better ways are now implemented in a large number of commercial software packages for training ANNs on serial systems. An article published in November 1994, lists more than 35 vendors from the US alone8 . For example, one commercial package or simulator that runs on serial systems includes tools, libraries and kernel software. The tools include graphics interfaces for representation of networks, back-prop quantities, and statistical results. The libraries include general utilities, an interface library, a plotting library, and a simulation library. The simulator kernel takes care of objects and messages, graphics primitives, lisp interpretation and, of course, the low-level simulation. The advantages of all this software support are real. The company that sells this simulator also uses it on serial machines, and has reduced by 30% the error in predicting water demand for the Compagnie Generale des Eaux (Paris). Their system has been in operation at the Paris water works since September 199215. In contrast, we attempted to contact by telephone, the 6 companies which were listed in that same November 1994, article as providers of hardware accelerators for ANNs on PC's. One product was discontinued, because 486 PC's had caught up to the speed of a 386 with a pipelined accelerator. The second and third telephone numbers were disconnected, or not in service with the company not known to the telephone operator for that area code. The fourth company, NeuroDynamX, has been taken over by Infotech, but one can still get their fast i860 RISC accelerator. The fth company, California Scienti c, o ers from one to 4 fast TMS320 processors on a PC board, and the sixth company, Telebyte, o ers a MIMD PC coprocessor. If our attempts to make contact were a statistically valid sampling, and assuming that the information for the article was up-to-date in October 1994, the results of these phone calls would indicate that providers of hardware accelerators have a half-life of three months! Of course, our sampling is not statistically valid, the journal in which the article appeared is not much more than a subscriber's newsletter, the companies are small, and our emphasis should not be just on practical considerations. But, the results 8

of our phone calls raise two very real questions if one is looking for even serial accelerators. First, how long will one get reliable support? Second, how long will it be advantageous to use the specialized hardware, instead of using the next generation of general-purpose, serial hardware?

3.1 Improved learning algorithms

Having raised these issues, we now give an example of a training algorithm that saves time and decreases generalization error compared with simple gradient descent, and that runs relatively quickly on a serial system35. Reviews have been published with other algorithms5. For this example, we assume that there is a reason to use sigmoid nets for the application, that the problem is not so large that it won't t in the serial computer, that one can store all training examples ahead of time instead of having to deal with training within a real-time process, that the stored examples are of several classes, that the relative quantities of examples re ect the prior probabilities of the classes, that the ANN is supposed to learn to turn on the correct class output unit for each example, and that the ANN library for our hardware/software combination supports these 3 calls for our type of net:

 Call setup(network structure,initial weights array)  Call learn(input array,target array,delta weights array)  Call forward(input array,output array) The call to setup initializes the ANN for the available hardware. The call to learn computes WI and WII given some training pairs (x; f(x)), and the call ^ given inputs x. to forward computes output vectors f(x)

Our algorithm is now free to schedule the use of these calls as it chooses. Our choice, is to train the network using a procedure which alternates between a global randomized search for an error minimizing gradient, and a local deterministic search in the conjugate gradient direction. A pseudo-program that implements this procedure is:

 Initialize the size of a set of balanced blocks of training examples  Call setup: Initialize network and weights  Loop: Randomly re-partition the total dataset  For each set of balanced blocks in the training partition  Call learn on the current set of balanced blocks  Compute normalized conjugate gradient over current & previous

sets, and compute current search direction forward repeatedly: Execute a golden-section stepping line search to nd a local error minimum for the current set of balanced blocks, and then update the weights  Next set of balanced blocks  If error criterion is met, then quit  If regularization error increased, then increase the set size if possible  Next randomized re-partitioning

 Call

9

The whole training dataset consists of some number of balanced blocks of examples from all the classes. We decide what de nition of balanced to use. On every \iteration" or \complete sweep" through the dataset, we randomly re-partition the whole dataset into a large number of balanced blocks to be used for training and a small number to be used to test regularization. Within the randomly selected training partition, we estimate the error gradient on increasingly large sets of balanced blocks, increasing the set size only when the error at the end of the iteration on the regularization partition goes up at the current set size. Given the negative gradient  = (WI ; WII ) estimated on the current set of balanced blocks and normalized by the current T, we compute the normalized Pollack-Ribiere conjugate gradient  

= max 0; (? ?1) ?1 ?1 over the current  and previous ?1 negative gradients, and we set the current search direction to Z =  + Z?1 : The -weighted term is sometimes called the momentum. Then we execute a goldensection stepping line search, i.e., we modify the total weight matrix W according to ^ = W + sZ W where  is a user-supplied in nitesimal learning rate and step size s is varied as g0 ; g1; g2 ; : : :, with g = 1:618, the golden section, until the error that we compute on the outputs indicates that with the previous s we had a local error minimum. Termination of the line search by tting a quadratic is an option. When we use this software approach on a serial system, with an error measure that includes a penalty term for large weights69, we expect to obtain a saving of iterations on the order of 103, a time savings on the order of 102, and better performance, compared with simple back-propagation. So, a good software approach to training ANNs, under appropriate circumstances, can make a serial system seem less burdensome. Now, can we combine our good software approaches with good hardware approaches for algorithm parallelization? In the case of sigmoid nets, for example, what is available to parallelize learn and forward? To begin to answer this question, we turn to the next section, on parallel digital systems.

4 PARALLEL SYSTEMS The serial systems discussed in x3 help the developer of an ANN application

in two ways: First, they provide the user with software libraries and environments adapted for the purpose. Second, they use top performance digital signal processors or RISC processors as small but powerful computational servers in common platforms (such as PCs) lacking the required performance. Unfortunately, the intrinsic parallelism of the connectionist algorithms is not exploited or it is used only up to a very moderate extent (e.g., instruction level parallelism in pipelined RISC processors). On the contrary, the systems addressed in the present section are conceived explicitly to address the regularity of these algorithms and to make the most of it in order to distribute the computation on several units working concurrently. 10

Two great families of ANN-dedicated parallel systems can be identi ed, depending on the type of computation units employed in these machines. Some machines use commercially available processors and either exploit the availability of e ective communication channels (e.g., Inmos Transputers or Texas TMS320) or add special communication hardware optimized for the application. These systems are described in x4.1. Other systems try to take advantage of other peculiarities of ANN algorithms, such as a reduced precision required in the computations. This makes it possible to develop ad-hoc processing elements which are characterized by a small size and cost. In turn, this often enables one to design systems with a very high processing element count and therefore to increase the degree of parallelism. Systems using custom processors are described in two sections: x4.2 deals with simple systems developed with one algorithm or a family of ANN algorithms in mind and not conceived to be programmed for other purposes; x4.3 deals with systems using more complex processing elements and usually displaying a higher degree of programmability. We will not even try to be exhaustive in addressing these categories of designs. The purpose is to show which kind of design choices are possible when conceiving a system devoted to ANN applications and to outline the consequences these choices have for the users.

4.1 Parallel systems using general-purpose processing elements

One straightforward design technique for ANN hardware is to use multiprocessor systems made of commercial CPUs. A clear advantage is, in general, a rather short design time. When a long ANN experimentation program is foreseen and an adequate computational server is not available, a multiprocessor design can be seen as an interim choice to obtain high performance in a short time. Designs based on custom processing elements are usually much longer to develop because of the need of developing the custom processor itself and because of less traditional design paradigms. This type of evolution from general purpose processors to custom chips has been taken at ICSI and led from a rst array of DSP (Ring Array Processor44 ) to a special-purpose VLIW processor (SPERT2 , see x5.1) and will nally result in a large server dedicated to connectionist algorithms (CNS-11 ). Systems using commercial Ps were more common in the late 80's when many research and commercial machines dedicated to ANN simulation were still at early stages of development. Moreover, the use of general-purpose processors often makes possible the application of these systems to problems other than neural networks, and, conversely, commercial general-purpose multiprocessor systems display similar performances at similar prices. Indeed, systems such as MUSIC45 have been developed with ANNs among the rst applications and are now marketed as generic desktop multi-processors. If this adaptability of use is an interesting feature, on the other side it is paid for in limited size and parallelism: commercial DSPs are rather expensive and systems with more than a few tens of processors are unrealistic both because of cost and physical size. If one wants higher performance, custom processor designs are a must: the simpli cations made possible by ANN algorithms enable the integration of many of processors on a single die, often tens or hundreds per integrated circuit. Several examples exist in this category: SPRINT13 is a general purpose Transputer-based systolic-array computer designed in the second half of the 80's which, 11

while not conceived speci cally for ANNs, was found well adapted for the purpose. Similar experiences were carried out at about the same time on other systolic arrays of processors, such as Warp52, obtaining record performances. MUSIC45, a more recent system designed with BP in mind is interesting for its communication system well adapted to ANN simulation and its performance to price ratio. Of course, this kind of system can be cheap compared to a full- edged supercomputer but is still far away from the reach of most ANN researchers seeking large computing power at small budgets. For lower price, one has to turn to simpler custom processing elements, but these will bring along other problems.

4.2 Parallel systems using single-model custom processing elements

As already mentioned, the way to move from the limited parallelism of standard DSP-based machines to the full parallelism made possible by ANNs consists in the design of custom processing elements. This technique may also enable the design of small systems with interesting performances at a fraction of the cost and of the size of multiprocessor systems. The basic technique used to simplify the processing units consists in implementing a minimal set of computational primitives, registers, and datapaths as required by the target neural algorithms. In some cases the processing units reduce to simple ad-hoc arithmetic units (with one or a few functions) replicated many times (e.g., distance computation units in the Ni1000 chip, see x4.2.3). In these cases, the nal system is almost completely contained on a single chip and implements a very speci c ANN algorithm. Sometimes, some programming is possible to specify algorithms in the same family. Other systems are more general in that the processing units are more complex and contain registers, true recon gurable arithmetic units, and communication ports. Higher degrees of versatility are therefore possible and systems may be designed with many ANN models in mind, or systems conceived for one model can easily ported to others. Systems of the latter sort will be addressed in a separate section (x4.3). Conversely, the present section is dedicated to smaller systems (often single cascadable chips) designed for a particular ANN model or family. Another simpli cation made possible by ANN algorithms is due to the reduced precision required in most calculation and in most models4,24,64. As a consequence, the designer may avoid area-expensive oating point arithmetic units and use reduced integer precision. These result in smaller arithmetic units, registers and datapaths, therefore using less area on the die, and in a smaller number of communication links with other peripherals (e.g., IC pins). These simpli ed codings may go to extremes, such as network weights or activation coded on single bits, or they may use original coding techniques, such as stochastic pulse coded variables49,10 (see x4.2.2). The systems investigated in this section, with one exception, rely on these peculiar features to produce single chips designed for easy interfaceability with traditional processors and therefore serve as ANN coprocessors. The single exception72,71 (see x4.2.4) consists of two very particular designs making the best use of ANN regularity for Wafer Scale Integration (WSI). As it is the case with other designs in this section, a single ANN model is implemented and the systems are almost selfcontained. The evident di erence that puts these designs in a class of their own is the high level of integration. 12

4.2.1

Philips'

L-Neuro 1.0

L-Neuro 1.0 is an early example of a chip aimed at the implementation of sigmoid networks and provides internally the most time-critical primitives for the purpose. The integrated circuit has been used by Franzi at LAMI-EPFL in a compact BP coprocessor board for a 68030-based machine16. Figure 1 shows the 63,39

Host Processor Bus

.32 Multiplication Latches Serial-Parallel Multipliers Learning Processors

j 0

63

0

.8

Controller

.128

.128

Adder Tree

M L

.12 .32 Learn 0

Wij i

Learn 15

15

ALU & Accumulator

1024 Bytes Weight Memory .1 F U

.8 .8 ∆

.16

Bit 7 .8 Bit 0

.32 16 x 2 Bytes Neuron Input Registers & Potential Output Registers (2 Independent Banks)

Learning Parameter Registers

.8

.32

BNSR Bus

BE Bus

External Non-Linear Function

Figure 1.

L-Neuro 1.0. Processing element structure.

structure of a single L-Neuro 1.0 processing element. As it is the case with most single-chip architectures, it is a stored weight architecture, in the sense that the weights of each neuron are stored in on-chip memory. One kbyte of memory is arranged as 8-bit weights for 64 neurons (0  i < 64) with 16 inputs each (n = 16). A variable-precision technique allows one to rearrange the memory for 256 neurons with weights on 4 bits. During the learning process, 16-bit weights are used (two 8-bit words are coupled), reducing the maximum number of neurons to 32. In the forward phase, a single serial-parallel multiplier the products and sums P ?performs 1 required in the matrix-vector product y = =0 W  x . Each neuron (corresponding to row j of W and whose transpose is represented in gure 1) is processed sequentially, that is, in this case we have synapse parallelism but on a single neuron at a time. This is also done with the intent of producing a single potential output at a time, allowing one to time-share the external hardware dedicated to the non-linear function  (typically a look-up table). A built-in Hebbian learning operation computes WNEW = WOLD +   x in parallel over the sixteen values of j, for one value of i at a time, that is, one neuron is updated in each step. The BP algorithm also requires the Kronecker n

j

i

ji

i

ij

13

ij

i

j

product of the error vectors with the activation derivative. This operation has not been implemented on chip and has to be done on the host processor, thus severely hindering BP performance. Another problem is created by the need for back-propagating the error using WT instead of W. Again this operation has not been implemented and a possible workaround16|common to many designs71,40 |is to update two identical networks in parallel, one storing the normal weight matrix (used in the forward phase) and the other storing the transpose (used when backpropagating the errors). The evident disadvantage is double: the parallel hardware is doubled and the weight update must be repeated twice without a possibility of parallelization. As with most designs of its category, the chip is cascadable (to increment the number of neurons or inputs) but networks whose weights exceed the size of the onchip memory can hardly be implemented due to the low bandwidth from external memory to the internal weight storage. L-Neuro 1.0 is therefore an interesting early example of sigmoid network implementation for embedded feed-forward or BP applications suciently small to be resident in the internal memory of one or a few chips. It interfaces easily with a traditional microcontroller which uses the circuit for some compute intensive activities. Conversely, its limited degree of parallelization and the absence of a direct memory interface make it an example of a system whose improvement is easily outperformed by a successive generation of traditional P.

4.2.2

Ricoh's RN-200 Ricoh's RN-20049 is another example of a chip for on-board MLP applications

and it implements BP learning. The main di erence with L-Neuro 1.0 (x4.2.1) and its peculiarity compared to most other systems consists in the coding of input and output variables. While most designs use a traditional binary coding, either in serial or parallel form, the RN-200 uses stochastic pulse trains. A variable is coded by the number of active, xed-width pulses present in a predetermined sampling interval. In the case of RN-200, the sampling interval lasts 128 ticks of the system clock, therefore coding 7 bits of information. The interest of this coding is that it leads to an extremely simple hardware for multiplication, sum and activation function. If the pulses representing the value of a variable are randomly distributed within the 128 available positions, the logic AND of two signals represents their product while the logic OR produces a result that may be interpreted as a saturated sum of the inputs. This basic scheme is extended to include negative quantities by modeling excitatory and inhibitory synapses. Similarly, the primitives for back-propagation of the error are implemented. RN-200, compared with a previous version, implements also the derivative of the sigmoidal function implicitly obtained in the OR gate. This is obtained from 0 = (1 ? ), where the complement is performed with an inverter and the product with an AND gate. Decorrelation of the signal imposes a one-cycle delay on one of the inputs of the gate. Each circuit contains 16 neurons, each with 16 inputs. Weights are stored in 8bit registers and are converted to pulse trains for the computation. This is one of the most delicate aspects of the chip because the accuracy of the computation critically depends on the lack of correlation of the various pulse distributions. To satisfy the requirement, a di erent random number generator is implemented in each synapse, therefore partly o setting the apparent simplicity of a design otherwise using one or two gates per synapse and explaining why a rather large chip (13:73m  13:73m 14

in 0:8m CMOS technology) cannot contain more than 256 synapses. RN-200 relies on a clever idea and may be useful for a small, xed-size, embedded application. Furthermore, the control is completely integrated in the chip, making possible a CPU-less neural-network. Otherwise, it shares most of the limitations of the L-Neuro 1.0 in terms of reduced programmability and impossibility of multiplexing larger networks on limited-size hardware.

4.2.3

Intel's

Ni1000

In contrast with the chips described in the previous sections, Intel's Ni100050,47 is a chip aimed at the simulation of RBFs. The chip is composed of three blocks: a classi er, a traditional microcontroller and an external interface. The parallel part of the chip consists of a large prototype array (1024 prototypes with 256 components or features, each component coded on 5 bits) tightly coupled with 512 distance calculation units (DCU), time multiplexed over the two halves of the prototypes. The DCUs compute the `1 -norm between the input patterns and each prototype. Their outputs are latched and used by a pipelined 16-bit

oating point unit to compute the radial basis function and the probability density function estimation for at most 64 di erent classes. The chips easily interface to standard P buses and many circuits can be cascaded to increase the number of stored prototypes. The interface may also convert the results in IEEE standard 32-bit format to minimize the load on the host processor. The whole classi cation pipeline, processing 32,000 input vectors per second at 33 MHz, is shown in gure 2. Program Memory

Microcontroller

Prototype Array (1024 x 256 x 5)

Prototype Parameter RAM

512 Distance Calculation Units

Math Unit

Firing Classes or Probabilities

Input Patterns

Input Buffer RAM

Figure 2.

Output Buffer RAM

Ni1000. Pipeline classi cation30.

While the classi cation pipeline is hardwired, the learning process is supervised by a 16-bit traditional microcontroller with 4 kwords of Flash program memory. The controller can access all the memory locations in the chip and this makes it possible to implement di erent learning algorithms, such as RCE or PNN. Other chips for classi cation purposes and with similar features are appearing on the market. The Ni1000 is particularly attractive thanks to its mix of hardwired computation in the highly parallel and time consuming part of the process (distance computation and classi cation), and the programmable, highly versatile control in the learning process. A similar system, IBM's ZISC03626, has the advantage of more precise distance computation and a choice of norms (`1 - or `1 -norm) but lacks the versatility that the on-chip microcontroller gives to the Ni1000. An extreme example of the use of the adaptability of the Ni1000 chip has been reported51 , showing how the dot product required for MLP's can be approximated 15

by an expression using the `1 -norm available in the circuit. The approximation improves for inputs having large dimensionality. Even if problems may arise in practical applications due to the reduced weight precision (especially in the BP phase), the approach is an exemplary case of adaptation of neural algorithms to restrictive hardware constraints.

4.2.4

Hitachi's WSI neurocomputers Hitachi developed two wafer scale integration (WSI) neurocomputers dedicated

to MLP72,71. The second machine is essentially a reworking of the rst design to provide the system with BP learning. The basic architecture, common to both systems, consists of processing elements, each representing a neuron, connected as in gure 3. The processing element constiController 9

Output Bus

PE0

σ(∗)

PE1

PE575

Address Bus

10

Input Bus 9

Figure 3.

Hitachi. Time sharing bus architecture.

tutes a typical single instruction- ow multiple data- ow (SIMD) monodimensional array for neuro-computers. For instance, this structure, often referred as broadcast bus, is also used in the CNAPS21 (see 4.3.2) and MY-NEUPOWER59 neurocomputers. In the rst design, 576 neurons were integrated on a 5" wafer. The input bus feeds the array with network inputs, one per cycle. During the potential computation, the address bus is used to select a weight from register les in the processing elements and the product of the weight with the input is accumulated. After potential computation, each processing element in turn uses the output bus to communicate the output of the respective neuron. The simple structure of the processing elements in the earlier design is shown in gure 4. The Address

Weight

Counter

Time Constant Multiplier

/EN

Address Bus

=

+

0

Reg A

x Input Bus

Figure 4.

Hitachi. Processing element structure.

16

Reg B

Output Bus

performed operation essentially consists of a weighted sum with 8-bit coecients of 9-bit inputs. One peculiarity of the implementation is the small weight register le (64 entries compared to 576 neurons per wafer). The 64 locations are assigned (using the address part of the le) to the heaviest weights, the others being rounded down to zero. During the output computation, the address bus is inspected, compared to each of the stored addresses and the weight, if non-zero, is multiplied by the value on the input bus and accumulated. In the later system, the complete network is duplicated to make possible the multiplication of the error signal with the weight matrix transpose. The weight update is performed on both W and WT at the same time, so as to guarantee that the matrixes are kept identical. The internal structure of the processing elements is slightly modi ed to accommodate a few more primitives and the number of neurons per wafer is reduced to one half (144) during the learning process. A few additional registers are added: The weights are now stored on 16 bits and a matrix of W is stored on 8 bits to implement a momentum term. The reduced weight table described for the feedforward-only network is implemented also in the back-propagation processing element but it is not clear to us how this is used during learning. In general, as it is unfortunately often the case with algorithmic simpli cations imposed by hardware constraints, the impact of the rounding to zero of all weights but the largest 64 is not addressed in the cited papers. The most interesting feature of the systems is the exploitation of the ANN regularity and of the broadcast bus easy recon gurability to obtain an acceptable yield in WSI17. To minimize the impact of defects on internal busses, a hierarchical bus structure is implemented with neurons clustered by local busses, grouped in global busses and nally collected in the WSI main bus. Another example of a chip recon gurable to withstand manufacturing defects will be described in x4.3.2.

4.3 Parallel systems using programmable custom processing elements

When an ANN user is not interested in a very small embedded system or is not interested in a single xed type of network, a programmable system is required. In a sense, we will address in this section a family of systems that are midway between the DSP-based systems presented in x4.1 and the single chip systems of x4.2: these designs try to achieve an optimal compromise between cost and generality. Most designs of this class are SIMD systems, with minor variations. The address generation (including program- ow control) and instruction fetch from memory is therefore handled by a controller unique for the whole system. In some cases it is the host machine that directly issues instructions for the SIMD agent. Instructions are often horizontally encoded, that is, each eld of the instruction word directly con gures a part of the processing element. These two features (no address/issue logic and reduced instruction decoding), together with the reduction of resources to some average maximum requirement for ANNs, are the primary sources of complexity reduction compared to traditional processors. These simpli cations make it more appropriate to use the term processing element (PE) than processor for the computation units.

4.3.1

HNC's HNC100

and SNAP

The HNC100 processing element41,42 is conceived for the implementation of a mixed SIMD ring-array and broadcast-bus architecture. The chip contains one of the rare custom PEs for ANNs implementing oating-point computations. It has 17

R1A

R1B

R2A

R2B

All datapaths are 32-bit wide

SDprev LMR

FP mult

FP/I alu

FM

FA

GMR

Status

SDOUT LMOUT

SDnext

GMOUT GMBUS LMBUS

Figure 5.

SNAP. Processing element.

a rather simple but very orthogonal structure. As gure 5 shows, the core of the PE is composed of a 32-bit oating-point multiplier and a 32-bit oating-point and integer arithmetic and logic unit. Around these two units, there are seven data registers, three instruction registers, and one status register. The many 32-bit interconnections allow a versatile data routing among the di erent elements. The choice of a 32-bit oating-point representation limits the number of PEs in each chip to 4. The communications with the memory and between PEs are performed through four channels: (1) a bidirectional datapath to and from the PE local memory; (2) a bidirectional datapath to and from a memory bank global to the whole system; and (3) two bidirectional datapaths connecting each PE with the nearest neighbors. The structure of the resulting system is shown in gure 6. Compared to other machines implementing ring architectures, SNAP has the additional feature of a global interconnection bus. This is not simply an interface to a global memory but is designed to allow a direct PE to PE interconnection. This may prove useful to speedup some operations that require point-to-point communication, such as the accumulation of partial results stored in the di erent processors42 . Complete systems contain between 16 and 64 PEs on several boards. These numbers are not very di erent from the typical sizes of machines based on commercial processors. This is partly due to the orthogonality of the processor, partly to the implementation of a oating-point unit. To achieve higher parallelism, manufacturers have to combine a simpli ed processor structure with reduced precision xed-point capabilities, as shown in the next sections.

4.3.2

Adaptive Solutions' N64000 and CNAPS Servers Adaptive Solutions's CNAPS Server21 , was one of the rst commercial ex-

amples of large neurocomputers. The chip at the core of the system has a very 18

VSB Host Bus

Control Registers

Instruction Memory

Controller Processor

Command Registers

Semaphore Registers

Scratch Pad Memory

Dual Port Memory

SNAP Controller GM Addr Bus GM Data Bus LM Addr Bus Ctrl Bus

Array Processor

Array Processor

Array Processor

Array Processor

SD Bus

LM Data Buses Local Memory

Local Memory

Figure 6.

Local Memory

Local Memory

SNAP. System architecture.

aggressive design: the die measures about 1 square inch and has 13.6 million transistors fabricated20,29. Similar to the Hitachi WSI designs, this is a good example of use of the regularity of the broadcast bus architecture to recon gure faulty elements on the die and improve yield. Redundancies at three-levels of the design guarantee a 90% yield. Active transistors are thus reduced to 11 million and 64 processing nodes (PN) are guaranteed working out of the 80 fabricated. Figure 7 shows the connections between PNs on di erent chips and the seSequencer

N64000 Chip

N64000 Chip

8

4

Outbus

PN0

InterPN

PN1

31

PN63

PN64

PnCmd Inbus

8

Figure 7.

CNAPS. Inter-chip communication.

quencer. One can observe the reduced connectivity: essentially, only two 8-bit and one 31-bit busses connect the chips among themselves and with the sequencer. The gure suggests two important advantages of the architecture: (1) The regularity of the structure, allowing expansion of the machine by simply adding more N64000 19

chips on the bus (up to 8 in available systems). (2) The minimal connectivity, making the processor feasible and reducing packaging and mounting costs. The PN internal architecture is that of a very simple DSP processor ( gure 8). The reduced connectivity is made possible by the presence of 4 Kbytes of static 8

Outbus

4

InterPN Output Unit

Register File (32 x 16 bits)

Local Memory Address Unit

12

Local Memory 4 kbytes 16

A bus 16

Shifter

32

Input Unit Logic Unit

9

Adder 32

B bus

Multiplier 24

31

PNcmd

8

Inbus

Figure 8.

CNAPS. Block diagram of a N64000 Processing Node.

RAM to hold the weights in each PN. The basic operating mode is the classic matrix vector multiplication scheme for a broadcast bus architecture already described in x4.2.4. Each neuron is assigned to a di erent PN until there are no free units. Large networks may assign more than one neuron to each PN and process them in sequence. For MLPs one may transmit outputs from one layer to the next by allowing the processing elements in turn to place the results on the output bus. These are transmitted to all the PNs through the sequencer. Once more, error back-propagation, which requires the multiplication of the transpose of the weight matrix by the errors on the output of the layer, poses some problems. The implemented solution40 duplicates the weight matrix in the processors, once stored one row in each processor and once one column each. In contrast with other less programmable systems16,71, only the matrix is replicated, without requiring any further hardware. Both matrixes are updated successively. The calculated performances for a network with 1,900 inputs, 500 hidden units and 12 outputs are among the highest ever reported: 9,671 million synapses simulated per second and 2,379 million synapses updated per second. Unfortunately no detail is given on the number of iterations for convergence of the algorithm. It is possible that the low precision in the implementation of the sigmoid function and of its derivative (both fully contained in the PN memory and therefore sampled on a very limited number of points) degrades the performance. This e ect could partly o set the high raw performance. No comprehensive study of the problem has been reported even though research has shown that the coarseness of these functions has an important impact on the nal result46 . The low connectivity is convenient if there is no need for frequent accesses to the weight memory, otherwise the system becomes prohibitively slow and the concurrency of operation between PNs is lost. This is the case when the network size exceeds the capacities of the PN memory. This memory is not huge, if one considers 20

that two weight matrixes per layer (except the rst) should be stored. Furthermore, the implementation of momentum would probably require twice as much memory to store previous updates. Note that the apparently very large test network (1900-50012) replicates only a small matrix (6,000 elements) because most of the weights are concentrated in the rst layer (950,000 elements). Real applications may not be so well-matched to the architecture. Techniques such as epoch (block) processing and updating may reduce the catastrophic impact of weight swapping by performing it only once every n vectors. The great advantage of the CNAPS architecture is the versatility of the processing elements and the easy programmability. Other algorithms have been mapped on the machine (e.g., Kohonen's SOFM22 or image processing algorithms) and a C compiler is available making Adaptive Solutions system very powerful when the user is not reaching the PN memory limit. These advantages of the broadcast bus explain why other commercial systems have favored the use of this architecture (e.g., Hitachi MY-NEUPOWER59 ).

4.3.3

Siemens'

MA16 and SYNAPSE-1

As for most ANN processing elements, Siemens' MA1654,7 array processor chip has been designed by examination of common computational requirements for neural algorithms and by mathematical uni cation of the di erent learning algorithms53,55. The primitives have been implemented in a rather complex PE, called the elementary chain. A simpli ed version of the PE is shown in gure 9. The YIn

XInOut

1 x 4 Vector

4 x 4 Matrix

1 x 4 Vector

4 x 4 Matrix

Y’

Transposer Y = Y’

Transposer

Y = Y’T

X = X’

Y Scaling Multiplier C

S’= S’’ ’) S’= (C i . S’i,j ’) S’= (C j . S’i,j

’2 ) S’= (S’i,j ’|) S’= (|S’i,j

X’

X = X’T

4 x 4 Matrix 4 x 4 Matrix

X

Matrix Multiplier S’’ S’’= + S’’= X . Y −X + X− +Y +Y S’’= − S’’= −

Output Arithmetic Unit

S’

S = S’ S = S’ + −A S = ( j S’i,j x y u) S = (A i,1+ S’i,j x y u) S = (min (S’i,j , A i,j )) S = (max (S’i,j , A i,j ))

S= S= S= S= S=

(min (S’i , A i )) (max (S’i , A i )) for c = 1, 4 (A i,j > c ) (sat c ( A i,j )) for c = 15, 16

A

S

4 x 4 Matrix AccuIn

4 x 4 Matrix

Figure 9.

AccuOut

SYNAPSE-1. MA16 processing element.

rst important characteristic is the fact that an elementary chain does not handle scalars (as do all the other systems described in this article) but 4  4 matrices of integers. Communication channels are one element wide, and 16 basic clock cycles are required to send or receive operands from and to a PE. This bandwidth matches 21

the throughput of the PE whose central element is a matrix-matrix multiplier that takes 16 cycles to multiply two 4  4 operands. As is shown in gure 9, all input and output registers are duplicated and this makes it possible to pipeline computation and data communication. The typical operations performed by the PE, in addition to the matrix-matrix and matrix-transpose matrix multiplication, include matrix addition and subtraction, computation of `1 and `2 distances between corresponding matrix rows, and minimum and maximum search. Each MA16 chip contains four PEs connected as shown in gure 10. An instrucXInOut1

4∆

YIn

AccuIn

XInOut2

XInOut3

4∆

XInOut4

4∆

6∆

YIn XInOut

YIn XInOut

YIn XInOut

YIn XInOut

Elementary Chain

Elementary Chain

Elementary Chain

Elementary Chain

Accu In

Accu Out

Accu In

Figure 10.

Accu Out

Accu In

Accu Out

Accu In

Accu Out

YOut

AccuOut

SYNAPSE-1. MA16 chip structure.

tion word propagates in a systolic form in parallel with the data. The SYNAPSE-1 neurocomputer56 uses 8 of these chips connected in two parallel rings sharing the weight memory (see gure 11). The FIFOs and the loop arithmetic units are implemented on a data unit controlled by a Motorola MC68040 processor. Another MC68040 processor controls the whole system through a microsequencer. A key feature of the system is the direct interface of the PE chips to large amounts of dynamic RAM, that is, with the cheapest and most dense memory chips on the market. The standard weight memory amounts to 128 Mbytes and can be expanded up to 512 Mbytes. In general, the design of SYNAPSE-1 lends itself to batch processing, that is to the processing of many input vectors with the same weights, deferring all the weight updates to the end. This is due to two factors: rst, to the handling of matrices instead of scalars, which in most cases impose an epoch of four prototypes, and, second, to the use of a double ring with a single weight memory. This latter feature, another peculiarity of SYNAPSE-1, is intended to double the performance in the forward phase of MLP networks. The need to serially update the weights with the results of each ring reduces performance during learning. SYNAPSE-1 is a machine in the same league with CNAPS (see 4.3.2) in terms of approximate price, versatility and performance. Nevertheless, some features favor either one or the other system and show that the ideal neurocomputer has still to come. The great advantage of the SYNAPSE-1 architecture is the interface with huge amounts of on-line weight memory. The rst demonstration of the system consisted in a simulation of a Hop eld network for retrieval of 64  64 typographic characters. With its 4,096 neurons and its 16 million connections, the problem largely exceeds the capacity of any CNAPS Server on the market, and the execution speed is very remarkable (48 ms per 8 patterns, or approximately 2,800 million synapses simulated per second). On the other side, what makes the CNAPS architecture attractive is its relative 22

Weight Memory

Auxiliary Memory ( ∆ ip)

Auxiliary Memory ( ∆ ip)

Weight Memory



MA16

Weight Memory

Auxiliary Memory ( ∆ ip)



MA16



Weight Memory

Auxiliary Memory ( ∆ ip)



MA16





MA16

FIFO Arithmetic Unit & Data Memory VMEbus Auxiliary Memory ( ∆ ip)

Auxiliary Memory ( ∆ ip)



MA16

Auxiliary Memory ( ∆ ip)



MA16



Auxiliary Memory ( ∆ ip)



MA16





MA16

FIFO Arithmetic Unit & Data Memory VMEbus

Figure 11.

SYNAPSE-1. System architecture.

programming simplicity: the strict SIMD paradigm, together with the orthogonality of the processor may make assembly programming accessible to most motivated users. Moreover, the existence of a special C compiler (even if not very optimized11) and of large handcoded libraries of functions freely mixable with C code, make experimentation acceptably simple. SYNAPSE-1 is hindered by a complex PE and by a systolic paradigm that makes programming dicult. The PE has a large number of primitives but is not at all orthogonal, making it tricky or impossible to implement operations that were not foreseen by the designers. So far, we are not aware of demonstrations of complex ANN algorithms or comprehensive software packages and programming tools, even though libraries with standard ANN models were announced several months ago.

4.3.4

EPFL's

Genes IV and Mantra I

All of the designs seen so far address neuron parallelism, that is, they associate at most one processing element per neuron. A further degree of parallelism could be attained by assigning up to one processing element per synapse. The advantage is an increased throughput. This is roughly limited, in the former case, to one vector processed in a single layer every N cycles, where N indicates the number of inputs of the layer. With one processor per synapse, thanks to the higher parallelization, one could produce up to one vector per cycle. One of the rare examples of neurocomputers with synaptic PEs is Mantra I66,67 developed at EPFL. At the computational heart of the system is a bidimensional i

i

23

mesh of custom processing elements called Genes IV28. PEs are connected by serial lines to their four neighbors as shown in gure 12(a). All input and output I v1

I v2

I v3

NIN WGTIN

I v4

PE

PE

PE

PE

1,1

1,2

1,3

1,4

I h1

INSTIN

Inst.

WOUT

L

PE

PE

PE

PE

2,1

2,2

2,3

2,4

W0

PE

PE

PE

PE

3,1

3,2

3,3

3,4

PS := PS + W . D 2 PS := PS + (D - W) PS if PS ≤ D PS :={ N max if PS > D if PS ≥ D PS :={PS N min if PS < D W := W + PS . D W := W + PS . (D - W)

D

O h3 PE

PE

PE

PE

4,1

4,2

4,3

4,4

WIN

I h4

EIN

W1

O h2

I h3

INSTOUT

LIN LOUT

O h1

I h2

NOUT

U

EOUT

PS

O h4 O v1

O v2

O v3

UIN UOUT

O v4

SOUT WGTOUT

(a)

(b)

SIN

Figure 12. Mantra I. (a) Square systolic array of Genes IV processing elements. (b) Architecture of a Genes IV diagonal PE.

operations are performed by the PEs located on the North-West to South-East diagonal. Figure 12(b) shows the architecture of a diagonal PE (non-diagonal PEs do not require the diagonal input/output signals). The array implements a few general primitives sucient for most neural models including Hop eld nets, MLPs, BP learning, Kohonen SOFMs. Special features in the PEs make it possible to achieve 100% utilization rate in normal conditions. The array implemented in Mantra I contains 40  40 PEs running at 10 MHz. A large section of the part synchronous with the array (delimited by a dashed line in gure 13) is devoted to an input/output system with a bandwidth large enough to Controller FIFO

Comm. Lk.

Sigma Unit

Interrupt

Memory

Genes IV Array

TMS 320C40

Figure 13.

Weight Memory & FIFOs

XY Memories & FIFOs

GACD1 Array

Function of Y Unit

Desired Output Memory &FIFO

Mantra I. System architecture.

keep the systolic array busy. This part also contains two look-up tables and a linear array composed of other custom PEs, labeled GACD1 . The system is controlled by a Texas TMS320C40 processor. Its tasks include con guration of the SIMD part, instruction dispatching, and input and output management. It also handles the communications with a host computer and between multiple Mantra I accelerators. Mantra I is one of the rare examples of genuine massive parallelism (see also 24

x4.3.5). The controller is built with commercial components and cannot support the array at speeds above 10 MHz. Furthermore, the use of a serial communication format to reduce connectivity also reduces the performance. This explains why 1,600 PEs are needed to achieve, for instance, a peak of 400 million MLP synapses simulated per second, less than half the measured performance of the SYNAPSE1. But the main problem is elsewhere: probably Mantra I basically shares the shortcoming of SYNAPSE-1, namely dicult recon gurability and a very complex programming paradigm. As in the case of SYNAPSE-1, tools have been developed to help the programmer27 but coding is still painful and the hardware misses that transparency to the software that is essential for these machines to become popular.

4.3.5

Current Technology's

MM32k

A nice example of a good software integration of a general purpose accelerator board is Current Technology's MM32k19, a massively parallel system containing 32,768 serial PEs on a single PC card. PEs are extremely simple and are composed of three one-bit registers, a serial arithmetic and logic unit and 512 bits of memory. The instruction set has 17 di erent operations and it is complete, that is it can implement any logic function. PEs are interconnected with a 64  64 crossbar with 512 PEs connected to each port. If the hardware is interesting due to its mix of PE simplicity and versatility, the reported software interface is exemplary: A C++ language class library hides the hardware and overloads common arithmetic operators to perform vector operations on the PEs. It is probably the kind of interface that every user expects. Unfortunately, coding this interface is rather simple in a very regular SIMD structure, but implementing e ectively a similar interface on heavily pipelined machines and general-purpose systolic arrays may prove almost impossible without tools to compact and optimize the array microcode dynamically during execution.

5 DISCUSSION

We discuss now the trend toward more generality, the complementary role that hybrid and analog systems can play alongside the digital ANN systems, and the questions that we raised earlier in the paper.

5.1 Toward more generality

One trend that we see as we look into the future of digital systems for ANNs is the trend towards more generality in the designs. The quest for generality has already in uenced the design of many new processing elements. As mentioned in x3, simple BP is seldom a good choice to attack a problem; therefore, users already wish more generality to avoid the risk of running slow algorithms on fast hardware with the same performance as fast algorithms in slow hardware (and with the di erence that fast hardware is expensive while fast algorithms are not). The tendency is increasingly towards general processing elements, more and more closely resembling to RISC or VLIW processors with reduced arithmetic capabilities (no oating-point, reduced precision). Many examples exist, such as the University of Nottingham's toroidal neural processor31, ICSI's SPERT3 or the new Philips generation of chips, L-Neuro 2.314. This tendency to generality seems so fundamental as to o set the quest for massive parallelism. Essentially, we see a trend in digital ANN designs in favor of programmable systems that provide a similar exibility to serial hardware with a comparable cod25

ing e ort. On the side of very high-performance dedicated applications, however, purely digital designs may need to be complemented by hybrid and analog systems that achieve small size, small power consumption, or extraordinarily high speeds. We present some examples in the next section.

5.2 Complementing systems

Real neural networks, as opposed to arti cial ones, are not digital systems. Here are some examples where hybrid and analog systems appear to be attractive complements to purely digital systems. In the rst example, a sigmoid net with feed-forward connections was trained on a serial digital system, to provide signals that control the shape of a plasma in a tokamak6. To implement the forward pass of the trained sigmoid net for fast feedback control, a rapid matrix multiplier was built, using 12-bit frequencycompensated multiplying DACs, for multiplication of analog signals by a digitally stored number, i.e., the DACs are used as digitally controlled attenuators, not as converters. A sigmoidal summing unit was also built, using two transistors as a long-tailed pair, to generate the symmetric (?1; +1) sigmoidal transfer function. Temperature sensitivity is overcome by placing 5 transistors in close thermal contact in one chip. Two transistors form the long-tailed pair, one is used as a heat source, and the other two measure the temperature and permit active thermal feedback control. A separate 12-bit DAC with a xed DC input implements the bias for each sigmoid. In the second example, a CCD programmable image processor has been built that has an analog input bu er, 49 multipliers, and 49 eight-bit 20-stage local memories, all in a 29 mm2 chip area9 . These processors have a 42-dB dynamic range and perform one billion arithmetic operations per second, dissipating less than 1 W when clocked at 10 MHz. The rst layer of an ANN with replicated input weights has been implemented using this processor as an image-feature extractor. In the third example, an analog VLSI implementation of the ART1 (clustering) algorithm has been fabricated61 on a die area of 1 cm2. This chip performs pattern clustering or classi cation in less than 1 s on 100 binary pixel input patterns in up to 18 di erent categories. The chip is analog internally but interfaces to the outside world through digital signals. In our nal example, a two-layer optical network has been demonstrated in a system for face recognition38. Input images are accepted at video rates and classi ed in real time. Adaptable interconnections of the network are implemented as holograms stored in a photorefractive crystal. The training algorithm trains each hidden optical unit separately and not iteratively, to avoid the decay of image information by re-recordings.

5.3 A return to our questions

Having reviewed many digital systems we return to three questions that we raised earlier:

A. Does it appear that we will be able to get reliable support on commercial systems? In the best case, yes. But some digital systems are no longer

available, and many have an availability that remains limited for the moment (e.g., to joint-research applications). And in some cases it is not just small rms but also big ones that may not be staying in the ANN systems business. 26

B. Will it be advantageous to use the specialized hardware, instead of using the next generation of general-purpose, serial hardware? In the best

case, yes. But it all depends. Can we wait? How demanding are we? Do we have a tiny embedded application? Does our application require very massive computing? Does it need to run in real-time? These are questions that each potential user has to answer for himself.

C. Can we combine our good software approaches with good hardware approaches for algorithm parallelization? Again, in the best case, yes. When

we nd a system provider who has his own full environment and debugged support software, he may try to put us into his framework, and we may try to do the same with him. If we are patient, and the provider is patient, we maximize our chances for a successful combination.

6 CONCLUSION

Neural networks are an alternative to traditional systems whose behavior is dictated by the algorithmic analysis made by the programmer. Conversely, ANNs learn to nd the solution of a problem by processing examples. Their usefulness and competitiveness with traditional techniques has been proved on many real problems. Unfortunately the training process can be lengthy and burdensome. To speed-up this process, several digital systems have been presented in recent years; some have been developed in academic research laboratories, others are becoming commercial products. We reviewed a number of systems, selected as a function of their technological representativeness, emphasizing commercially available systems whenever possible. These chips and machines all attempt to exploit the intrinsic parallelism of ANN algorithms. The way this is achieved is extremely variable and ranges from systems built with standard processors to systems using full-custom processing units. The systems are also very di erent in size, depending on their target application: some are single chips essentially devoted to embedded applications, others are massively parallel machines controlled by powerful workstations. Confronting these systems with the potential expectations of users raises several questions that we tried to answer. Essentially, while the hardware is achieving some maturity (many di erent directions have been tried and some abandoned, and research e orts are now concentrating on few key issues), other aspects may still require some tuning. Software support, and system integration in general, is just beginning to have the versatility that a wider class of users may require. Until more tools are available, it will remain easier for the largest applications or the critical embedded designs to justify the use of the available parallelizing hardware. An important by-product of future advancements in system integration might and should be the possibility of incorporating into digital designs, the promising results of other technologies such as analog electronics or optical devices, so as to make the next generation of neurocomputers well adapted to huge problems, fast, and user-friendly.

7 ACKNOWLEDGEMENTS

The authors would like to thanks all the colleagues that provided them with information on the reviewed systems, namely J. Anlauf, M. Duranton, H. Eguchi, E. Franzi, D. Hammerstrom, R. Means, J. Pampus, U. Ramacher, T. Watanabe. 27

Also, several people helped us to understand the state-of-the-art more generally. We gratefully acknowledge L. Bottou, S.-Y. Kung, J.-D. Nicoud, E. Ojamaa, J. Principe, M. Viredaz, R. Watrous, B. Yoon. Paolo Ienne was nancially supported in this work by the Swiss Confederation \SPP-IF Mantra" program under grant 5003-34353.

8 REFERENCES

[1] K. Asanovic, J. Beck, T. Callahan, J. Feldman, B. Irissou, B. Kingsbury, P. Kohn, J. Lazzaro, N. Morgan, D. Stoutamire, and J. Wawrzynek. CNS-1 architecture speci cation: A connectionist network supercomputer. Technical Report TR-93-021, International Computer Science Institute, Berkeley, Calif., Apr. 1993. [2] K. Asanovic, J. Beck, B. E. D. Kingsbury, P. Kohn, N. Morgan, and J. Wawrzynek. SPERT: A VLIW/SIMD microprocessor for arti cial neural network computations. Technical Report TR-91-072, International Computer Science Institute, Berkeley, Calif., Dec. 1991. [3] K. Asanovic, J. Beck, B. E. D. Kingsbury, P. Kohn, N. Morgan, and J. Wawrzynek. SPERT: A neuro-microprocessor. In J. G. Delgado-Frias and W. R. Moore, editors, VLSI for Neural Networks and Arti cial Intelligence, pages 103{7. Plenum Press, New York, 1994. [4] K. Asanovic and N. Morgan. Experimental determination of precision requirements for back-propagation training of arti cial neural networks. In Proceedings of the Second International Conference on Microelectronics for Neural Networks, pages 9{15, Munich, 1991.

[5] R. Battiti. First- and second-order methods for learning: Between steepest descent and Newton's method. Neural Computation, 4:141{66, 1992. [6] C. M. Bishop, P. S. Haynes, M. E. U. Smith, and T. N. Todd. Fast feedback control of a high temperature fusion plasma. Neural Copmuting & Applications, 2:148{59, 1994. [7] N. Bruls. Programmable VLSI array processor for neural networks and matrixbased signal processing { User description. Technical report, Siemens AG, Corporate Research and Development Division, Munich, Oct. 1993. Version 1.3. [8] R. B. Caldwell. Selecting the right neural network tool. Neurove$t Journal, 2(6):16{23, Nov. 1994. [9] A. Chiang and M. Chuang. A CCD programmable image processor and its neural network applications. IEEE Journal of Solid-State Circuits, SC26(12):1894{1901, 1991. [10] T. G. Clarkson, D. Gorse, J. G. Taylor, and C. K. Ng. Learning probabilistic RAM nets using VLSI structures. IEEE Transactions on Computers, C41(12):1552{61, Dec. 1992. [11] T. Cornu and P. Ienne. Performance of digital neuro-computers. In Proceedings of the Fourth International Conference on Microelectronics for Neural Networks and Fuzzy Systems, pages 87{93, Turin, Sept. 1994. [12] G. Cybenko. Approximation by superpositions of a sigmoidalfunction. Control, Signals, & Systems, 2(4):303{14, Dec. 1989.

[13] A. J. De Groot and S. R. Parker. Systolic implementation of neural networks. In K. Bromley, editor, High Speed Computing II, volume 1058, pages 182{90, Los Angeles, Calif., Jan. 1989. SPIE - The International Society for Optical Engineering. 28

[14] C. Dejean and F. Caillaud. Parallel implementations of neural networks using the L-Neuro 2.0 architecture. In Proceedings of the International Conference on Solid State Devices and Materials, pages 388{90, Yokohama, 1994. [15] X. Driancourt, Dec. 1994. Personal Communication. Neuristique, Inc. [16] E. Franzi. Neural accelerator for parallelization of back-propagation algorithm. Microprocessing and Microprogramming, 38:689{96, 1993. [17] M. Fujita, Y. Kobayashi, K. Shiozawa, T. Takahashi, F. Mizuno, H. Hayakawa, M. Kato, S. Mori, T. Kase, and M. Yamada. Development and fabrication of digital neural network WSIs. IEICE Transactions on Electronics, E76C(7):1182{90, July 1993. [18] N. A. Gershenfeld and A. S. Weigend. The future of time series: Learning and understanding. In A. S. Weigend and N. A. Gershenfeld, editors, Time Series Prediction: Forecasting the Future and Understanding the Past, pages 1{70. Addison-Wesley, Reading, Mass., 1993. [19] M. A. Glover and W. T. Miller, III. A massively-parallel SIMD processor for neural network and machine vision applications. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 843{49, San Mateo, Calif., 1994. Morgan Kaufmann. [20] M. Grin, G. Tahara, K. Knorpp, and B. Riley. An 11-million transistor neural network execution engine. In IEEE International Conference on SolidState Circuits, pages 180{81, 1991. [21] D. Hammerstrom. A highly parallel digital architecture for neural network emulation. In J. G. Delgado-Frias and W. R. Moore, editors, VLSI for Arti cial Intelligence and Neural Networks, chapter 5.1, pages 357{66. Plenum Press, New York, 1991. [22] D. Hammerstrom and N. Nguyen. An implementation of Kohonen's selforganizing map on the Adaptive Solutions neurocomputer. In Proceedings of the International Conference on Arti cial Neural Networks, volume I, pages 715{20, Helsinki, June 1991. [23] J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Computation. Santa Fe Institute Studies in Sciences of Complexity. AddisonWesley, Redwood City, Calif., 1991. [24] J. L. Holt and T. E. Baker. Back propagation simulations using limited precision calculation. In Proceedings of the International Joint Conference on Neural Networks, Seattle, Wash., July 1991. [25] J. J. Hop eld and D. W. Tank. Computing with neural circuits: A model. Science, 233:625{33, Aug. 1986. [26] IBM. ZISC036 Data Book (Preliminary), Nov. 1994. Version 2.1. [27] P. Ienne and M. A. Viredaz. Implementation of Kohonen's self-organizing maps on MANTRA I. In Proceedings of the Fourth International Conference on Microelectronics for Neural Networks and Fuzzy Systems, pages 273{79, Turin, Sept. 1994. [28] P. Ienne and M. A. Viredaz. GENES IV: A bit-serial processing element for a multi-model neural-network accelerator. Journal of VLSI Signal Processing, 9(3), 1995. To appear. [29] inova microelectronics corporation, Santa Clara, Calif. N64000 Digital Neural Network Processor (Preliminary Data). [30] Intel Corp., Santa Clara, Calif. Ni1000 Datasheet (Preliminary), 1993. 29

[31] S. R. Jones and K. M. Sammut. Learning in linear systolic neural network engines: Analysis and implementation. IEEE Transactions on Neural Networks, NN-5(4):584{93, July 1994. [32] T. Kohonen. Self-Organization and Associative Memory, volume 8 of Springer Series in Information Sciences. Springer-Verlag, Berlin, third edition, 1989. [33] G. Kuhn. A rst look at phonetic discrimination using a connectionist network with recurrent links. Technical Report 82018, IDA-CRD, Princeton, N.J., Sept. 1987. [34] G. Kuhn. Joint optimization of classi er and feature space in speech recognition. In Proceedings of the International Joint Conference on Neural Networks, volume IV, pages 709{14, Seattle, Wash., July 1992. [35] G. Kuhn and N. Herzberg. Some variations on training of recurrent networks. In R. Mammoneand Y. Zeevi, editors, Neural Networks: Theory and Applications, chapter 11, pages 233{44. Academic Press, San Diego, Calif., 1991. [36] G. Kuhn, R. Watrous, and B. Ladendorf. Connected recognition with a recurrent network. Speech Communications, 9(1):41{48, Feb. 1990. [37] S. Y. Kung. Tutorial: Digital neurocomputing for signal/image processing. In B. H. Juang, S. Y. Kung, and C. A. Kamm, editors, Neural Networks for Signal Processing, volume 1, pages 616{44, Piscataway, N.J., Sept. 1991. IEEE Signal Processing Society. [38] H.-Y. S. Li, Y. Qiao, and D. Psaltis. Optical network for real-time face recognition. Applied Optics, 32(26):5026{35, Sept. 1993. [39] N. Mauduit, M. Duranton, J. Gobert, and J.-A. Sirat. Lneuro 1.0: A piece of hardware LEGO for building neural network systems. IEEE Transactions on Neural Networks, NN-3(3):414{22, May 1992. [40] H. McCartor. Back propagation implementation on the Adaptive Solutions neurocomputer chip. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems, volume 3, San Mateo, Calif., 1991. Morgan Kaufmann. [41] R. W. Means and L. Lisenbee. Extensible linear oating point SIMD neurocomputer array processor. In Proceedings of the International Joint Conference on Neural Networks, volume I, pages 587{92, Seattle, Wash., July 1991. [42] R. W. Means and L. Lisenbee. Floating-point SIMD neurocomputer array processor. In K. W. Prztula and V. K. Prasanna, editors, Parallel Implementations of Neural Networks. Prentice Hall, New York, 1993. [43] J. Moody and C. Darken. Learning with local receptive elds. In D. Touretzky, G. Hinton, and T. Sejnowski, editors, Proceedings of the 1988 Connectionist Models Summer School, pages 133{44, San Mateo, Calif., June 1989. MorganKaufmann. [44] N. Morgan, J. Beck, P. Kohn, J. Bilmes, E. Allman, and J. Beer. The Ring Array Processor: A multiprocessing peripheral for connectionist applications. Journal of Parallel and Distributed Computing, 14:248{59, 1992. [45] U. A. Muller, A. Gunzinger, and W. Guggenbuhl. Fast neural net simulation with a DSP processor array. IEEE Transactions on Neural Networks, NN6(1):203{13, Jan. 1995. [46] P. Murtagh and A. C. Tsoi. Implementation issues of sigmoid function and its derivative for VLSI digital neural networks. IEE Proceedings, 139(3):207{14, May 1992. [47] Nestor Inc., Providence, R.I. Ni1000 Recognition Accelerator Datasheet, 1994. 30

[48] D. Obradovic, G. Deco, H. Furumoto, and C. Fricke. Neural networks for industrial process control: Applications in pulp production. In Proceedings of Neuro-N^mes 93, volume I, pages 25{32, N^mes, France, Oct. 1993. [49] S. Oteki, A. Hashimoto, T. Furuta, T. Watanabe, D. G. Stork, and H. Eguchi. A digital neural network VLSI with on-chip learning using stochastic pulse encoding. In Proceedings of the International Joint Conference on Neural Networks, volume III, pages 3039{45, Nagoya, Japan, Oct. 1993. [50] C. Park, K. Buckmann, J. Diamond, U. Santoni, S.-C. The, M. Holler, M. Glier, C. L. Sco eld, and L. Nunez. A radial basis function neural network with onchip learning. In Proceedings of the International Joint Conference on Neural Networks, volume III, pages 3035{38, Nagoya, Japan, Oct. 1993. [51] M. P. Perrone and L. N. Cooper. The Ni1000: High speed parallel VLSI for implementing multilayer Perceptrons. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 8, 1995. [52] D. A. Pomerleau, G. L. Gusciora, D. S. Touretzky, and H. T. Kung. Neural network simulation at Warp speed: How we got 17 million connections per second. In Proceedings of the IEEE Conference on Neural Networks, volume II, pages 143{50, San Diego, Calif., 1988. [53] U. Ramacher and J. Beichter. Systolic architectures for fast emulation of arti cial neural networks. In J. McCanny, J. McWhirter, and E. Swartzlander, Jr., editors, Systolic Array Processors, pages 277{86. Prentice Hall, New York, 1989. [54] U. Ramacher, J. Beichter, W. Raab, J. Anlauf, N. Bruls, U. Hachmann, and M. Wesseling. Design of a 1 generation neurocomputer. In VLSI Design of Neural Networks, pages 271{310. Kluwer Academic, Norwell, Mass., 1991. [55] U. Ramacher and P. Nachbar. Hamiltonian dynamics of neural networks. In st

Proceedings of the Second International Conference on Microelectronics for Neural Networks, pages 95{102, Munich, 1991.

[56] U. Ramacher, W. Raab, J. Anlauf, U. Hachmann, and M. Weeling. SYNAPSE-1 | A general-purpose neurocomputer. Technical report, Siemens AG, Corporate Research and Development Division, Munich, Feb. 1993. [57] A. J. Robinson and F. Fallside. Static and dynamic error propagation networks with application to speech coding. In D. Z. Anderson, editor, Advances in Neural Information Processing Systems, volume 1, pages 632{41, New York, 1987. American Institute of Physics. [58] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume 1: Foundations, chapter 8. MIT Press, Cambridge, Mass., 1986. [59] Y. Sato, K. Shibata, M. Asai, M. Ohki, M. Sugie, T. Sakaguchi, M. Hashimoto, and Y. Kuwabara. Development of a high-performance general purpose neurocomputer composed of 512 digital neurons. In Proceedings of the International Joint Conference on Neural Networks, volume II, pages 1967{70, Nagoya, Japan, Oct. 1993. [60] C. L. Sco eld and D. L. Reilly. Into silicon: Real time learning in a high density RBF neural network. In Proceedings of the International Joint Conference on Neural Networks, volume I, pages 551{56, Seattle, Wash., July 1991. 31

[61] T. Serrano, B. Linares-Barranco, and J. L. Huertas. Analog VLSI implementation of the ART1 algorithm. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 8, 1995. [62] D. F. Spec. Enhancements to probabilistic neural networks. In Proceedings of the International Joint Conference on Neural Networks, volume I, pages 761{68, Baltimore, Md., June 1992. [63] J. B. Theeten, M. Duranton, N. Mauduit, and J. A. Sirat. The Lneuro chip: A digital VLSI with on-chip learning mechanism. In International Neural Networks Conference, pages 593{96, 1990. [64] P. Thiran, V. Peiris, P. Heim, and B. Hochet. Quantization e ects in digitally behaving circuit implementations of Kohonen networks. IEEE Transactions on Neural Networks, NN-5(3):450{58, May 1994. [65] R. van Eyden and J. Cronje. Discriminant analysis versus neural networks in credit scoring. Neurove$t Journal, 2(6):11{15, Nov. 1994. [66] M. Viredaz. Design and Analysis of a Systolic Array for Neural Computation. PhD Thesis N 1264, E cole Polytechnique Federale de Lausanne, Lausanne, 1994. [67] M. A. Viredaz and P. Ienne. MANTRA I: A systolic neuro-computer. In Proceedings of the International Joint Conference on Neural Networks, volume III, pages 3054{57, Nagoya, Japan, Oct. 1993. [68] R. Watrous, G. Towell, M. S. Glassman, M. Shahraray, and D. Theivanayagam. Synthesize, optimize, analyze, repeat (SOAR): Application of neural network tools to ECG patient monitoring. In Proceedings of the International Symposium on Nonlinear Theory and Its Applications, volume 2, pages 565{70, Hawaii, Dec. 1993. [69] A. S. Weigend, B. A. Huberman, and D. E. Rumelhart. Predicting the future: A connectionist approach. International Journal of Neural Systems, 1:193{209, 1990. [70] R. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1:270{80, 1989. [71] M. Yasunaga, N. Masuda, M. Yagyu, M. Asai, K. Shibata, M. Ooyama, M. Yamada, T. Sakaguchi, and M. Hashimoto. A self-learning neural network composed of 1152 digital neurons in wafer-scale LSIs. In Proceedings of the International Joint Conference on Neural Networks, pages 1844{49, Seattle, Wash., July 1991. [72] M. Yasunaga, N. Masuda, M. Yagyu, M. Asai, M. Yamada, and A. Masaki. Design, fabrication and evaluation of a 5-inch wafer scale neural network LSI composed of 576 digital neurons. In Proceedings of the International Joint Conference on Neural Networks, volume II, pages 527{36, San Diego, Calif., June 1990.

32

Suggest Documents