Nelson Morgan, James Beck, Phil Kohn, and Je Bilmes. 2 Introduction ...... 14] R. Lippmann and B. Gold, Neural Classi ers Useful for Speech Recog- nition, FirstĀ ...
Neurocomputing on the RAP NELSON MORGAN JAMES BECK PHIL KOHN JEFF BILMES
International Computer Science Institute, Berkeley, CA
1 Abstract In 1989 we designed and implemented a Ring Array Processor (RAP) for fast execution of our continuous speech recognition training algorithms, which have been dominated by connectionist calculations. The RAP is a multi-DSP system with a low-latency ring interconnection scheme using programmable gate array technology and a signi cant amount of local memory per node (16 MBytes of dynamic memory and 256 KByte of fast static RAM). Theoretical peak performance is 128 MFlops/board, with sustained performance of 30-90% for back-propagation problems of interest to us. Systems with up to 40 nodes have been tested, for which throughputs of up to 574 Million Connections Per Second (MCPS) have been measured, as well as learning rates of up to 106 Million Connection Updates Per Second (MCUPS) for training. While the system is tuned to these algorithms, it is also a fully programmable computer, and users code in C++, C, and assembly language. Practical considerations such as workstation address space and clock skew restrict current implementations to 64 nodes, but in principle the architecture scales to about 16,000 nodes for back-propagation. We now have considerable experience with the RAP as a day-to-day computational tool for our research. With the aid of the RAP hardware and software, we have done network training studies that would have taken a signi cant fraction of a century on a UNIX workstation. We have also used the RAP to simulate variable precision arithmetic to guide us in the design of higher performance neurocomputers that are currently in the early stages of planning. 1
2
Nelson Morgan, James Beck, Phil Kohn, and Je Bilmes
2 Introduction Connectionist computation can be de ned as computation in which information is represented by the pattern of activation and strengths of connections between elements that evaluate some simple function of their inputs. The most common such element computes a saturation nonlinearity on the weighted sum of its inputs, but other functions have also been used. These systems are distantly modeled after the activity of biological neurons, whose ring frequency can be related to the activity at their inputs. The simplicity of this computation model, along with the versatility of the approach, suggests that connectionist algorithms may be the best solution to the challenge of parallel computing. Certainly those of us involved in the eld consider the brain as an existence proof for the generality of such algorithms that depend on a massively parallel system of relatively simple operators. We have been particularly interested in applying connectionist techniques to machine recognition of continuous speech. Numerous researchers have found these algorithms to be useful for this application [5][8][14][21][23]. In our own work, we have found that layered networks can be eectively used as probabilistic estimators for a Hidden Markov Model (HMM) procedure [3][4][18]. Features representing the spectral content of the speech are estimated 100 times per second. A layered network is trained to predict the phonetic label of each 10 msec \frame". This network takes as its input the spectral features from one or more frames, and has an output layer consisting of one unit per phonetic category. For some of our experiments, the inputs are real-valued, and hidden layers are used. For others, the speech features are vector-quantized to map the frame into one of a set of prototype vectors, and the network input consists of a binary input unit for each possible feature value, only one of which can be active at a time. In either case, the neural network is trained by back-propagation [22][24] augmented by a generalization-based stopping criterion [17]. It can be shown [6] that the net outputs can be trained to estimate emission probabilities for the Viterbi decoding step of an HMM speech recognizer. A network is useful for this procedure because it can estimate joint probabilities (joint over multiple features or time frames) without strong assumptions about the independence or parametric relation of the separate dimensions. We have conducted a number of experiments which seem to con rm the utility of this approach. For continuous speech recognition, computer resources are commonly
Neurocomputing on the RAP
3
dominated by the primitives of dynamic programming (as used in Viterbi decoding) - address calculations, reads, adds, compares, and branches (or conditional loads). This is particularly true for large vocabulary recognition. However, the use of large connectionist probability estimators can add a signi cant amount of computation to the recognition process. For example, for a 1000-word vocabulary we are using for recognition, a SparcStation 2 takes roughly 10 times real time to do the dynamic programming (with no pruning of unlikely hypotheses). The neural network calculations for a large network with 300,000 connections take about 60 seconds on the workstation for each second of speech (for a large continuous input network). However, training via back-propagation is perhaps 5 times as long as the forward network calculation, and must be repeated over 10-20 iterations through a data set that could easily be as large as 1,000 seconds per speaker (for the speaker-dependent Resource Management task, for instance), or 10,000 seconds of speech (for the speaker-independent Resource Management task). Thus, the training runs we are currently doing could take anywhere from a month to a year on a uniprocessor workstation. Planned experiments in feature selection will also require iterations over the training procedure. Since our research is largely in the area of training algorithms as opposed to recognition per se, a fast processor was required.
3 Building a Neurocomputer The challenge in building a connectionist computing machine is to provide sucient power and exibility so that it will still be useful by the time it is nished. This can be a problem for overly special-purpose designs, since the algorithms in use change rapidly in a healthy research environment. Additionally, a design that takes too long to implement may be rendered obsolete by the rapid advances in general-purpose computers. Our approach in the RAP was to take maximum advantage of commercial Digital Signal Processing (DSP) chips to satisfy the computational requirement while still providing exibility. An internode ring communication network has also been designed to allow easy expansion to more boards with a higher bandwidth than can be provided from the standard bus connections. Taking advantage of the full power of fast DSP circuits in a parallel connectionist system is more complex than just wiring chips together with
4
Nelson Morgan, James Beck, Phil Kohn, and Je Bilmes
a convenient interface. In particular, the designer has at least three major concerns (beyond raw numerical power) when attempting to eectively use multiple computation units for increased performance: 1. Having data at the right place at the right time. This requirement is frequently not satis ed for a fast general-purpose machine that is used for connectionist applications, or for parallel machines with relatively slow routing mechanisms. 2. Ensuring the machine can perform all functions of the target application. Even machines speci cally designed for an idealization of connectionist networks may perform worse than general purpose machines if a crucial function is missing. 3. Giving users the tools to develop and debug programs without compromising performance. The user will not gain if programming is suf ciently awkward to eliminate time gains from fast execution. To some extent, we have addressed these concerns by focusing our primary design eort on the set of operations that must perform well on the system. This design process involves starting with a set of equations for the desired computations, inferring the component data movements and functional operations, and then mapping them onto the proposed hardware design. In many cases this approach leads to a familiar set of equations de ning recall, learning, and update using multiply-accumulate and nonlinear lookups as the basic operations. For a more general application to problems in machine perception and human neural modeling, however, this approach has some limitations. Many useful calculations are not expressed by these \standard" formulae, such as the calculation of maxima, and comparative computations such as dynamic programming, which are so important for speech recognition (even when connectionist computation is used). If we cannot write a few simple equations to summarize connectionist computation, are we doomed to building a completely general computer? If this is so, one would do best to follow commercial developments and purchase a commercial supercomputer. Fortunately, the situation is not quite so grim. Fast arithmetic chips can be used eectively, but they must be incorporated in a system that can access large collections of data with a small time penalty,
5
Neurocomputing on the RAP
and they must be able to handle computations that are either impossible or inecient on a specialized \neural" processor. While we cannot say that all computation will be connectionist, we can design so that the highest requirements will be for such operations, and also provide some generalpurpose capabilities to prevent non-connectionist operations from becoming a bottleneck. Viewing the system as a general-purpose host with some special computational servers, we can also place an emphasis on the connectionist paradigm to guide the design of communications mechanisms. Ultimately, the special concerns of these algorithms push us toward custom VLSI designs. We have been working in this area as well. However, our immediate need in 1989 was for a programmable machine with performance roughly two orders of magnitude higher than we could get from our workstations, and the fastest way to accomplish this was to use existing DSP chips as the computational elements. The resulting design took only one year to implement, so that we have been able to accomplish many large studies since the summer of 1990. In the meanwhile, we have begun work on the design of VLSI elements for a successor machine that will have much higher performance, while retaining the exibility of the current system.
4 Architectural Considerations Arti cial neural networks (ANNs) frequently do not have complete connectivity [7], even between layers of a feedforward network [15]. Nonetheless, an extremely useful subclass of these networks uses nonsparse connectivity between layers of \units", which are (for the most common case) nonlinear functions of the weighted sums of their inputs. The most common unit function uses a sigmoid nonlinearity, namely, 1 f (y ) = (1) 1 + e?y with N wixi + (2) y=
X i=1
where the w's are connection strengths (weights), the x's are unit inputs, is a unit bias, y is the unit potential, and f (y ) is the unit output.
6
Nelson Morgan, James Beck, Phil Kohn, and Je Bilmes
The computational requirements of such algorithms are well matched to the capabilities of commercial DSPs. In particular, these circuits are designed with a high memory bandwidth and ecient implementation of the \multiply-accumulate" (MAC) operation (including addressing). However, if the unit implementations and corresponding weight storage are to be divided between multiple processors, there must be an ecient means for distributing unit outputs to all of the processors. If this is not provided in the system hardware, overall operation may be extremely inecient despite specialized circuitry for the required arithmetic. Full connectivity between processors is impractical even for a moderate number of nodes. A reasonable design for networks in which all processors need all unit outputs is a single broadcast bus. However, this design is not appropriate for other related algorithms such as the backward phase of the back-propagation learning algorithm. More speci cally, for a forward step the weight matrix should be stored in row-major form, i.e., each processor has access to a particular row vector of the weight matrix. This corresponds to a list of connection strengths for inputs to a particular output unit. However, for a backward step the matrix should be distributed in column-major form, so that each processor has access to all connection strengths from a particular input unit. As Kung [13] has pointed out, the backward phase corresponds to a vector-matrix multiplication (as opposed to the matrix-vector multiplication of the forward case). One can use a circular pipeline or ring architecture to distribute partial sums to neighboring processors where local contributions to these sums can be added. Using this systolic mode of operation, partial sums for N units on N processors can be distributed in O(N ) cycles, where in contrast, a single-bus broadcast architecture would require O(N 2) broadcasts to get all the partial sums to the processors where the complete sums can be computed. Alternatively, with a simple bus-based architecture the weight matrices could be stored twice, once in each ordering, and weight updates could be computed twice using error terms that have been distributed via broadcast. These added costs are unnecessary for the RAP because of the ring. Table I shows the process of calculating error terms for back propagation on a 4-processor ring. The top table shows the initial location of partial sums: sij refers to the ith partial sum (corresponding to the local contribution to the error for hidden unit i) as computed in processor j .
7
Neurocomputing on the RAP
s11 s21 s31 s41
Initial Partial Sum Location
P1
s12 s22 s32 s42
s13 s23 s33 s43
P3
s14 s24 s34 s44
P4
Partial Sum Location After One Ring Shift
P1
P2 s12 s22 s32 s41 + s42
s11 s21 s34 + s31 s41
P1
P2
P3 s12 + s13 s23 s33 s43
P4 s14 s23 + s24 s34 s44
Partial Sum Location After Two Ring Shifts
s11 s23 + s24 + s21 s34 + s31 s41
P2
s12 s22 s34 + s31 + s32 s41 + s42
P3
s12 + s13 s23 s33 s41 + s42 + s43
P4
s12 + s13 + s14 s23 + s24 s34 s44
Partial Sum Location After Three Ring Shifts
P1 P2 P3 s12 + s13 + s14 + s12 s12 + s13 s11 s23 + s24 + s21 s23 + s24 + s21 + s23 s22 s34 + s31 s34 + s31 + s32 s34 + s31 + s32 + s33 s41 s41 + s42 s41 + s42 + s43
P4 s12 + s13 + s14 s23 + s24 s34 s41 + s42 + s43 + s44
Table I Accumulation of partial error sums via the ring
8
Nelson Morgan, James Beck, Phil Kohn, and Je Bilmes
In other words, sij is all of the error term for hidden unit i which could be computed locally in processor j given the distribution of weights. In each step, each processor passes one partial sum to the processor on its right, and receives a partial sum from the processor on its left (with a ring connection between the end processors). The received sum is added into one of the partial sums. By choosing the passed values correctly, all processors can be usefully employed adding in values. Thus, in the example shown, each of the four processors has a completed error sum for a hidden unit after 3 steps. In general, N ? 1 steps are required to compute N such sums using N processors. Because of the ring hardware, the data movement operations are not a signi cant amount of the total computation. For each board, the peak transfer rate between 4 nodes is 64 million words/sec (256 Mbytes/second). This is a good balance to the 64 million MAC/sec (128 MFLOPS) peak performance of the computational elements. In general, units of a layer (actually, the activation calculations for the units) are split between the processors, and output activations are then distributed from all processors to all processors in the ring pseudo-broadcast described above. As long as the network is well-expressed as a series of matrix operations (as in the feedforward layered case), partitioning is done \automatically" when the user calls assembly language matrix routines which have been written for the multi-processor hardware.
5 RAP Hardware The goal was to realize a programmable system quickly with a small engineering team, but still achieve a speedup of around 100 what we could achieve on a commercial workstation. As the system architecture evolved, this concern was translated into several concrete decisions: 1. Only standard, commercially available components (e.g., TMS320C30) were used. 2. The computational power of the system came from using several highperformance processors, not from complex interconnection schemes or sophisticated peripheral circuits. To that end, the processing nodes were kept simple (though they are based on full oating-point processors) to allow placing four nodes on a circuit board. This has also
Neurocomputing on the RAP
3.
4.
5.
6.
7.
9
permitted us to build powerful single-board RAP systems local to a Sun workstation, as well as our main multi-board RAP. The memory system used only a single bank of dynamic RAM (DRAM) and static RAM (SRAM) at each processing node. This decision permitted greatly simpli ed memory control logic and provided the minimum electrical loading of the processor data lines. The backplane bus chosen was a standard VME bus using only the 32 address and data lines on the standard interface. The clock signals and data distribution ring used separate connectors and cables at the front panel in order to maintain strict compatibility with the VME bus. As much of the logic as possible was implemented in Programmable Gate Arrays (PGA1 ). This decision reduced parts count, simpli ed board design, and allowed exibility for later design enhancements. PGAs were used for two functions at each node: the interprocessor communication ring, and the memory control for DRAM. One additional PGA was used in the VME bus interface. The RAP is all digital, and does not directly support analog I/O. By using only digital circuits, we simpli ed board layout and sidestepped choices of the particular A/D or D/A converters required. Analog I/O is now being added without redesigning the RAP by using a pair of connectors that are provided for an add-on board which connects to the DSP serial ports. Each of these high-speed ports can support data conversion rates as high as 1 MB/second. Overall, the RAP is a very regular design. Although expansion to heterogeneous external devices is available via synchronous serial lines and the VME bus, the boards themselves consist of repeated nodes.
Figure 1 shows the major elements of a RAP node. Each node consists of a DSP chip, the local memory associated with that node, a simple 2register pipeline and handshake for ring communication (implemented on a PGA and a PAL), and a very small amount of miscellaneous control logic; almost all packages are processors, memory, and PGAs. Four interconnected 1
PGA is a registered trademark of Xilinx Incorporated
10
Nelson Morgan, James Beck, Phil Kohn, and Je Bilmes
BUS INTERFACE PGA
Ring In
Ring Out
RING HANDSHAKE
Extension Bus
PAL
Serial Ports
DSP Primary Bus
256 KB
RAM Control
STATIC RAM
PGA
16 MB DYNAMIC RAM
To VME BUS
Fig. 1 RAP Node Architecture.
nodes are contained on a single circuit board, with the internode busses also brought to connectors at the board front edge. At the next higher level, several boards are plugged into a common backplane bus (VME) with all ring busses connected using at ribbon cable. The user interface to the RAP is through a Sun CPU board which is plugged into the same VME bus, or using NFS to access a daemon running on that CPU from over the Ethernet.
6 Benchmark Results Three six-layer printed circuit boards were fabricated in late 1989, and lowlevel software and rmware were written to bring the system to a usable state by mid-1990. The boards were programmed to implement common matrix-vector library routines, and to do the forward and backward phases of back-propagation. More recently (in early 1991), ten boards were fabricated and tested in a single card cage. Results on this larger system with a simple benchmark for a network with one hidden layer and all layers of the same size are shown in the rst 2 rows of Table II. Matching layer sizes are not required by either the hardware or neural net software, but this choice permits a simpli ed analysis of the machine performance as a function of the number
Neurocomputing on the RAP
11
of units in a layer. The second column in Table II shows the performance for a subtask that is close to optimal for the TI DSP, forward propagation. For a large enough dimension, this routine exceeds 80% eciency (with respect to 64 MMACS/board), and ten RAP boards are roughly 600 times the speed of a Sun SparcStation 2 running the same benchmark. For the larger networks, the forward propagation performance becomes almost identical to a matrixvector multiply (which is O(N 2) ), since the sigmoid calculation (which is O(N )) becomes inconsequential in comparison to the multiply-accumulates. Finally, when learning is performed on each cycle (for a network with one hidden layer), the weight update steps dominate the computation. This is commonly the case with the back propagation algorithm, and similar ratios have been reported for other multi-processor implementations [9][11][25]. The update and the delta calculation each require at least as many arithmetic operations as the forward step, so that a factor of 3-5 decrease in throughput should be expected for the network calculation when learning is included. Another limitation is the DSP, which is optimized for dotproduct calculations rather than the read-add-write of the weight update step. For the complete forward and backward cycles, the 45-100 MCUPS shown in the table corresponds to a speed-up over the SparcStation 2 of about 100-200. For an average of ve oating-point arithmetic operations per connection during learning, the last column of Table II corresponds to 225-500 MFLOPS, or roughly one-fourth to one-half of the peak arithmetic capability of the machine.
12
Nelson Morgan, James Beck, Phil Kohn, and Je Bilmes
Network Size Forward Propagation 128)128)128 211 MCPS 512)512)512 527 MCPS 512)256)512 429 MCPS 234)1024)61 341 MCPS Table II
Full Learning 63 MCUPS 100 MCUPS 83 MCUPS 45 MCUPS
Measured Performance, 10-board RAP
Figures 2 and 3 further illustrate the benchmark performance of a 40node (ten board) RAP system. As the problems increase in size, performance approaches peak computation rates and demonstrates a closer approximation to a linear speedup. For forward propagation, with an average of two oating point operations per connection, 90% of the peak arithmetic capability is obtained for networks with layers of about 800 units. The loss in performance for the largest problems shown is due to external SRAM storage of the network variables, as opposed to on-chip RAM for the smaller problems. The last two rows of Table II show the 10-board RAP performance for two feedforward networks with nonuniform layer sizes. The rst row corresponds to a common autoassociative network, and the second to an architecture that we have used in our speech recognition training. For these examples, performance is in the 341-429 MCPS range for forward propagation alone, and 45-83 MCUPS for the complete learning cycle for one pattern. We have analyzed the costs for communication, control, and computation in the back-propagation algorithm on a many-node RAP [16]. This analysis suggests that, for the largest problems that will t on such a RAP, computation will be at least 50% ecient for con gurations of up to 16,000 processors. Such a machine would have a peak computational throughput
13
Neurocomputing on the RAP
Million Connections Per Second (MCPS)
1,000
40 Node RAP
300
100
30
10
3
SPARC 2 Workstation 1
0.3 0
128
256
384
512
640
768
896
1,024
1,152
1,280
Layer Size (units)
Million Connection Updates Per Second (MCUPS)
Fig. 2
RAP Performance for uniform layer size.
100
40 Node RAP 30
10
3
1
SPARC 2 Workstation 0.3 0
128
256
384
512
640
768
896
1,024
1,152
1,280
Layer Size (units)
Fig. 3
RAP Performance for uniform layer size, one hidden layer.
14
Nelson Morgan, James Beck, Phil Kohn, and Je Bilmes
of 256 GFLOPS or 128,000 MCPS2 .
7 Software As with the hardware, our connectionist focus reduced the complexity of the software so that the system was up and running shortly after the hardware was functioning. In particular: 1. The system requirements were minimal - no virtual memory, no coherent shared memory, and no operating system. 2. The RAP functions as a computational server for a host Unix machine; Unix-compatible libraries are available for user code running on the RAP (e.g., printf(), fopen(), etc.). 3. Although the RAP is potentially a general purpose machine, the software eort has focused on application-speci c programming. 4. Although the RAP is capable of MIMD (Multiple Instruction streams controlling Multiple Data streams) operation, the software and communications ring were designed to facilitate SPMD (Single Program operating on Multiple Data streams) style of programming. In SPMD programming the same program is loaded into all of the processors, while the data is distributed among all processors. Usually, the processors will all be doing the same operations on dierent parts of the data. For example, to multiply a matrix by a vector, each processor would have its own subset of the matrix rows that must be multiplied. This is equivalent to partitioning the output vector elements among the processors. If the complete output vector is needed by all processors, a ring broadcast routine is called to redistribute the part of the output vector from each processor to all the other processors. Since there is no shared memory between processing nodes, all interprocessor communication is handled by the ring. The hardware does not We have included this projection for the reader who enjoys fantasy. Since this is a coarse-grained system in which each processor consumes a large amount of resources, issues of clock distribution, cooling, host interface, etc. could easily make such a machine quite impractical, or minimally require a billion-dollar company for support. 2
Neurocomputing on the RAP
15
automatically keep the processors in lock step; for example, they may become out of sync because of branches conditioned on the processor's node number or on the data. However, when the processors must communicate with each other through the ring, hardware synchronization automatically occurs. A node that attempts to read before data is ready, or to write when there is already data waiting, will stop executing until the data can be moved. An extensive set of software tools has been implemented with an emphasis on improving the eciency of layered arti cial neural network algorithms. This was done by providing a library of matrix-oriented assembly language routines, some of which use node-custom compilation. An object-oriented RAP interface in C++ is provided that allows programmers to incorporate the RAP as a computational server into their own UNIX applications. For those not wishing to program in C++, a command interpreter has been built that provides interactive and shell-script style manipulation of the RAP. A veneer of C functions with distributed matrix and vector structures are also provided. The RAP DSP software is built in three levels. At the lowest level are hand coded assembler routines for matrix, vector and ring operations. Many standard matrix and vector operations are currently supported as well as some operations specialized for ecient backpropagation. There is also a UNIX compatible library including standard input, output, math and string functions. An intermediate level consists of matrix and vector object classes coded in C++. A programmer writing at this level or above can view the RAP as a conventional serial machine. These object classes divide the data and processing as evenly as possible among the available processing nodes, using the ring to redistribute data. The top level of RAP software is the Connectionist Layered Objectoriented NEtwork Simulator (CLONES) environment for constructing ANNs [12]. The goals of the CLONES design are eciency, exibility and ease of use. Experimental researchers often generate either a proliferation of versions of the same basic program, or one giant program with a large number of options and many potential interactions and side-eects. By using an object-oriented design, we attempt to make the most frequently changed parts of the program very small and well localized. The parts that rarely change are in a centralized library. One of the many advantages of an object-
16
Nelson Morgan, James Beck, Phil Kohn, and Je Bilmes
Layer
Binary
Analog
Radial
Softmax
Weighted Sum
Linear
Sigmoid
Connection
Full
Sparse
Bus
Fig. 4 Examples of the layer and connection class hierarchies in CLONES.
oriented library for experimental work is that any part can be specialized by making a new class of object that inherits the desired operations from a library class. In CLONES, there are two class hierarchies for constructing ANNs: layer and connection (Figure 4). A layer represents a collection of units that have some internal representation. For example, the representation may be a
oating point number for each unit (analog layer), or it may be a set of unit indices, indicating which units are active (binary layer). These analog and binary layers are built into the CLONES library as subclasses of the class layer. The analog layer in turn has subclasses for various transfer functions. A layer class also has functions that initialize it for a forward or backward pass, and generate activations or errors from the partial results produced by its connections. A connection class includes two functions: one that transforms activations from the incoming layer into partial results in the outgoing layer, and one that takes outgoing errors and generates partial results in the incoming layer. The structure of a partial result is part of the layer class. Connection classes include: Bus (one to one), Full (all to all) and Sparse (some to some). In order to do its job eciently, a connection must know something about the internal representation of the
Neurocomputing on the RAP
17
layers that are connected. By using overloaded functions, a connection class may have several forward and backward functions for dierent pairs of layer classes. The connection function selected depends not only on the class of connection, but also on the classes of the two layers that are connected. Not all connection classes are de ned for all pairs of layer classes. However, connections that convert between layer classes can be utilized to quickly compensate for missing functions. This structure allows the user to view layers and connections much like tinker-toy wheels and rods. ANNs are built up by creating layer objects and passing them to the create functions of the desired connection classes. Changing the interconnection pattern usually does not require any changes to the layer classes or objects and vice-versa.
8 RAP Applications The RAP has now been used for a large number of studies in feature extraction for continuous speech recognition [19][20]. We have experimented with a number of connectionist architectures, but one of the most successful has been a simple but large network that was referred to above as the 234)1024)61 architecture. This net consisted of the following components: 1. Input layer - 9 groups of 26 inputs, one for the current 10 msec frame of speech and one for each of four frames into the past and future (i.e., a 9-frame window of temporal context). The inputs for each frame were 13 coecients from a Perceptual Linear Prediction (PLP) analysis [10] and 13 coecients from an estimate of the instantaneous temporal slope of these coecients. 2. Hidden layer - 1024 units that receive connections from all input units. Experiments showed that signi cant performance could be seen for increases in hidden layer size for up to this number. 3. Output layer - 61 units, corresponding to 61 phonetic classes, receiving connections from all hidden units. While clever shortcuts could have reduced the size of this net, it was an enormous convenience not to have to be concerned with such matters while
18
Nelson Morgan, James Beck, Phil Kohn, and Je Bilmes
doing the feature explorations. Additionally, we got our speed-up (overnight on the RAP rather than 1-2 months on the Sun) without any special programming; for most of our experiments, we used a standard layered network program that was designed at ICSI early in 1990. As described above, the primary target application for the RAP was back-propagation training of layered neural networks. However, we did not build special-purpose integrated circuits, which could have had a considerable speed advantage, because we wanted a programmable machine 3 . While our current uses for the RAP continue to be dominated by back-propagation, we are able to modify the network algorithm daily for our experiments. Furthermore, we have experimented with using the RAP for computations such as the generation of Mandelbrot and Julia sets, computation of radial basis functions, and optimization of multidimensional Gaussian mixtures by a discriminative criterion. We also have used the RAP for calculation of dynamic features ( rst, second, and third temporal derivatives) to feed the layered network. While the topology has been optimized for the block matrix operations required in backpropagation, many algorithms can bene t from the fast computation and communication provided by the RAP. In the case of dynamic programming, for instance, one could treat a board as four independent processors that perform recognition on dierent sentences, thus speeding up a batch run by four. For real-time operation, the reference lexicon would be split up between processors, so that processors only need to communicate once for each speech frame. Thus, the RAP can be used as a SPMD machine (for our matrix operations, as in backpropagation), as a farm of separate SISD machines, requiring essentially no intercommunication (as in our current use for oine dynamic programming), or as a MIMD machine with simple and infrequent communication (as in the dynamic programming case for a distributed lexicon).
9 FUTURE DIRECTIONS The RAP is being used both for our speech research and to run simulations of new architectures that we will implement in the coming years. Our experience with the RAP has also suggested a few modi cations to our design strategy for our next machine, which we have tentatively named CNS-1 3
And we wanted it soon!
Neurocomputing on the RAP
19
(Connectionist Network Supercomputer), which has a target of a billion connections between a million units evaluated one hundred times per second: 1. Memory interface - the RAP uses a programmable gate array as a simple memory controller to select either a single bank of dynamic RAM or a single bank of static RAM. The CNS-1 design has much higher memory bandwidth and capacity requirements. To provide a large amount of memory operating at high speed, DRAMs with wider data buses (4 bits rather than 1 bit) will be used in page mode whenever possible. While this forces some additional complexity, the CNS-1 design will rely on a single type of primary storage, so that memory control will actually be simpler. 2. Custom VLSI - the RAP relies on commercial DSP chips for computation and programmable gate arrays for communication. The new system will require much higher levels of performance, so custom VLSI designs will be necessary. These designs can capitalize on moderate xed point precision requirements, which we have veri ed for our connectionist speech training algorithms [2]. As with the RAP's DSP, however, the CNS-1's custom processor will implement a general-purpose instruction set (albeit at a lower rate than the vector-style connectionist operations). 3. Network connectivity - the RAP architecture and software was optimized for fully connected layered networks. An explicit design goal for CNS-1 is the ability to eciently implement arbitrarily connected and activated networks. In particular, our target network is extremely sparse, and this will need to be incorporated in the design from the beginning. We are currently in the nal design stages for a preliminary microcoded processor chip that is a simpli ed form of what we will need for this machine [1].
10 SUMMARY Ring architectures have been shown to be a good match to a variety of signal processing and connectionist algorithms. We have built a Ring Array
20
Nelson Morgan, James Beck, Phil Kohn, and Je Bilmes
Processor using commercial DSP chips for computation and Programmable Gate Arrays for communication. Measured performance for a 10-board system on target calculations is over 2 orders of magnitude higher than we have achieved on a general-purpose workstation. A programming environment has been provided so that a programmer may use familiar UNIX system calls for I/O and memory allocation. This tool is greatly aiding our connectionist research, particularly for the training of speech recognition systems, allowing exploration of problems previously considered computationally impractical.
ACKNOWLEDGEMENTS Joachim Beer and Eric Allman were major contributors in the early stages of RAP design. Joel Libove provided a detailed hardware design review at critical stages. Herve Bourlard of L&H Speechproducts continues to provide the foundations for the speech application. Chuck Wooters was our rst non-developer RAP user, and worked to apply the RAP routines to our speech problems. Krste Asanovic, who is the principal architect of the VLSI machines that we are starting to build, also extended the RAP software to simulate variable precision arithmetic. Components were contributed by Toshiba America, Cypress Semiconductor, and Xilinx Inc., and Texas Instruments provided free emulators to debug the DSPs in-circuit. Finally, support from the International Computer Science Institute for this work is gratefully acknowledged.
Bibliography [1] K. Asanovic, B. Kingsbury, J. Beck, N. Morgan, and J. Wawrzynek, SPerT: A Microcoded SIMD Array for Synthetic Perceptron Training, International Computer Science Institute Technical Report, In Prep [2] K. Asanovic and N. Morgan, Experimental Determination of Precision Requirements for Back-Propagation Training of Arti cial Neural Networks , International Computer Science Institute TR-91-036, 1991. [3] H. Bourlard and N. Morgan, Merging Multilayer Perceptrons and Hidden Markov Models: Some Experiments in Continuous Speech Recognition, International Computer Science Institute TR-89-033, 1989.
Neurocomputing on the RAP
21
[4] H. Bourlard, N. Morgan, and C. Wellekens, Statistical Inference in Multilayer Perceptrons and Hidden Markov Models with Applications in Continuous Speech Recognition, in Neuro Computing, Algorithms, Architectures and Applications, NATO ASI Series, 1990. [5] H. Bourlard and N. Morgan, Merging Multilayer Perceptrons and Hidden Markov Models: Some Experiments in Continuous Speech Recognition, in Neural Networks: Advances and Applications, E. Gelenbe (Ed.), Elsevier Science Publichers B.V. (North-Holland), 1991. [6] H. Bourlard and C. Wellekens, Links Between Markov Models and Multilayer Perceptrons, in Advances in Neural Information Processing Systems 1, Morgan Kaufmann, pp.502-510, 1989. [7] J. Feldman, M. Fanty, N. Goddard, and K. Lynne, Computing with Structured Connectionist Networks in Communications of the ACM, 1988. [8] M. Franzini, K. Lee, and A. Waibel, Connectionist Viterbi Training: A New Hybrid Method for Continuous Speech Recognition, Proc. of the IEEE Intl. Conf. on Acoustics, Speech, & Signal Processing, pp. 425428, Albuquerque, NM, April 1990. [9] S. Garth, A Chipset for High Speed Simulation of Neural Network Systems, First International Conference on Neural Networks, San Diego, pp. III-443-452, June 1987. [10] H. Hermansky, Perceptual Linear Predictive (PLP) Analysis of Speech, J. Acoust. Soc. Am. 87 (4), April, 1990 [11] A. Iwata, Y. Yoshida, S. Matsuda, Y. Sato, and N. Suzumura, An Arti cial Neural Network Accelerator using General Purpose Floating Point Digital Signal Processors. Proceedings JCNN 1989, pp. II-171175. [12] P. Kohn, CLONES: A Connectionist Layered Object-oriented NEtwork Simulator, International Computer Science Institute, Technical Report In Prep.
22
Nelson Morgan, James Beck, Phil Kohn, and Je Bilmes
[13] S. Kung and J. Hwang, A Uni ed Systolic Architecture for Arti cial Neural Networks, Journal of Parallel and Distributed Computing, April, 1989, Michael Arbib ed [14] R. Lippmann and B. Gold, Neural Classi ers Useful for Speech Recognition, First Int. Conf. Neural Networks, pp. IV-417, San Diego, CA, 1987. [15] Y. Le Cun, J. Denker, S. Solla, R. Howard and L. Jackel, Optimal Brain Damage, in Advances in Neural Information Processing Systems II, Morgan-Kaufmann, San Mateo, 1990 , David Touretzky, ed. [16] N. Morgan, J. Beck, P. Kohn, J. Bilmes, E. Allman, J. Beer, The RAP: a Ring Array Processor for Layered Network Calculations, Proc. of Intl. Conf. on Application Speci c Array Processors, pp. 296-308. IEEE Computer Society Press, Princeton, N.J., 1990 [17] N. Morgan and H. Bourlard, Generalization and Parameters Estimation in Feedforward Nets: Some Experiments, International Computer Science Institute TR-89-017. [18] N. Morgan and H. Bourlard, Continuous Speech Recognition Using Multilayer Perceptrons with Hidden Markov Models, Proc. IEEE Intl. Conf. on Acoustics, Speech, & Signal Processing, pp. 413-416 Albuquerque, New Mexico, 1990. [19] N. Morgan, H. Hermansky, C. Wooters, P. Kohn and H. Bourlard, Phonetically-based Speaker-Independent Continuous Speech Recognition Using PLP Analysis with Multilayer Perceptrons, IEEE Intl. Conf. on Acoustics, Speech, & Signal Processing, Toronto, Canada, 1991, in Press. [20] N. Morgan, C. Wooters, H. Bourlard, and M. Cohen, Continuous Speech Recognition on the Resource Management Database using Connectionist Probability Estimation, ICSI Technical report TR-090-044, also in Proceedings of ICSLP-90, Kobe, Japan. [21] T. Robinson, and F. Fallside, A Recurrent Error Propagation Network Speech Recognition System to be published in Computer, Speech, & Language, 1991.
Neurocomputing on the RAP
23
[22] D. Rumelhart, G. Hinton and R. Williams, Learning Internal Representations by Error Propagation in D. Rumelhart and J. McClelland (Eds), Parallel Distributed Processing. Exploration of the Microstructure of Cognition. vol. 1: Foundations, MIT Press, 1986. [23] R. Watrous and L. Shastri, Learning Phonetic Features Using Connectionist Networks: an Experiment in Speech Recognition, First Int. Conf. Neural Networks, pp. IV-381-388, San Diego, CA, 1987. [24] P. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences, Ph.D. thesis, Dept. of Applied Mathematics, Harvard University, 1974. [25] X. Zhang, An Ecient Implementation of the Back-propagation Algorithm on the Connection Machine CM-2, in Advances in Neural Information Processing Systems II, Morgan-Kaufmann, San Mateo, 1990, David Touretzky, ed.