A PROCESSOR ARRAY MODULE FOR ... - Science Direct

5 downloads 0 Views 750KB Size Report
ing, and it is a safe guess that the data-parallel ... sueam, Multiple Data streams) architectures, and that ..... an effective transmission rate of 16"320 Mbits= 5.12.
Microprocessing and Microprogramming 38 (1993) 529-537 North-Holland

529

A PROCESSOR ARRAY MODULE FOR DISTRIBUTED, MASSIVELY PARALLEL, EMBEDDED COMPUTING Lars Bengtsson', Kenneth Nilsson=, and Bertil Svensson'~ = Centre for Computer Architecture, Halmstad University, Box 823, S-30118 Halmstad, Sweden b Dopamnent of Computer Engineering, Chalmers University of Technology, S-41296 GOteborg, Sweden E-mail: [email protected], [email protected], and [email protected]. With the increased degree of mima~tunzation resulting from the use of modem VLSI technology and the high communication bandwidth available through optical connections, it is now possible to build massively parallel computers based on distributed modules which can be embedded in advanced industrial products. Examples of such future possibilities are "action-oriented systems", in which a network of highly parallel modules perform a multitude of tasks related to perception, cognition, and action. The paper discusses questions of architecture on the level of modules and inter-module communication and gives concrete architectural solutions which meet the demands of typical, advanced industrial real-time applications. The interface between the processor arrays and the all-optical communication network is described in some detail. Implementation issues specifically related to the demand for miniaturiT~tion are discussed. Parallel processing; distributed embedded systems; massive parallelism; learning systems; real-time computer sy~ms; optical communication.

1. I N T R O D U C T I O N During the last decade, massively parallel processing has made a major breakthrough in supercomputing, and it is a safe guess that the data-parallel approach to computation will also find application in embedded systems. As a matter of fact, already before the breakthrongh came in supercomputing, researchers conducting projects in massively parallel computing identified industrial real-time tasks as an important realm of application. For example, an early use of STARAN was air traffic control [1], and LUCAS was, among other tasks, used for real-time image and signal processing [2, 3]. As an example of efforts to make miniaturized implementations of existing massively parallel supercomputer architectures in order to make them useful as embedded systems, the Blitzen project [4] may be taken. Blitzen is a miniaturized and improved MPP architeclxn'e[5], integrating as much as 128 processing elements (PEs) per chip.

The fact that massively parallel processing is now entering the realm of embedded systems implies that considerations like application-tuned architectures, special-purpose I/O processing, and real-time demands are placed in focus. A resulting architecture may be a network of highly parallel modules, each with a specific task, see Figure 1. Such a system is a candidate architecture for "action-oriented systems", which interact with their environments by means of sophisticated sensors and actuators, often with a high degree of parallelism [6]. The ability to learn and adapt to differentcircumstancesand environments are amoug the key characteristicsof such systems. Development of applicationsbased on action-orientedsystems relies heavily on training, rather than programming of the detailedbehaviour.The modules perform perceptual tasks close to the sensors, advanced motoric control tasks close to the actuators, or complex calculations at "higher cognitive levels".

530

L. Bengtsson et aL

In the REMAP project (Real-time, Embedded, Modular, Adaptive, Parallel processor) questions related to this new use of the massively parallel computing paradigm are addressed. The problem areas cover a spectrum from the architecture of individual PEs, modules, and I/O-system organization, via overall system architecture and intermodule communication, to tools, methods and philosophies for application system development. The project work includes the design and implementation of prototype systems and the experimental use of these in various applications. In this paper we discuss questions of architecture on the, level of modules and inter-medule communication and give concrete architectural solutions which meet the demands of typical, advanced, industrial realtime applications. We also discuss implementation issues specifically related to the demand for miniaturization.

Environment

sueam, Multiple Data streams) architectures, and that certain architectural features (fast multiplication, ring or broadcast communication) are important. [11] describes the potential of the multiple-SIMD model for neural-network based real-time systems, and [12] specifically addresses the use in conlrol applications, including an outline of the required application development environment. A prototype implementation (REMAP-Beta) of a processor array suited for ANN computing, image processing, etc., has been made [13]. It is based on FPGA (Field-Programmable Gate Array) circuits [14] and is now being used for architectural experiments and real-time application studies. In the present paper we describe the architecture of a module for a distributed system. It includes a high performance optical communication interface over which it connects to other modules. The goal of the future implementation is that such a module, containing 1K PEs or more, can be miniaturized to the size of the palm of one hand. The rest of the paper is organized as follows: In the next section we introduce the architectural paradigm and the overall system architecture. We then identify the different parts of the processor array module - the Processing Unit, the Communication Unit, and the Communication Interface - and describe each of these in detail. Before concluding the paper we discuss issues of implementation with regard to the goal of achieving the degree of miniaturiT~tion that we aim at. 2. O V E R A L L S Y S T E M P H I L O S O P H Y AND ARCHITECTURE

Environment Figure 1. A multi-module architecture for an actionoriented system Earlier papers from the REMAP project report on detailed studies of artificial neural network (ANN) execution on bit-serial processor arrays [7, 8, 9] as well as a general study and overview of the use of massively parallel processors for ANN computations [10]. A major conclusion is that neural network algorithms map efficiently onto SIMD (Single Instruction

In the design of advanced real-time systems the principle of resource adequacy [15] may be used in order to achieve predictability. This means that enough processing and communication resources are designed into the system and statically allocated to guarantee that the maximum possible work-load can always be handled. The system architecture that we describe allows the use of this principle. A further discussion of the principles of time driven control and resource adequacy in the design of parallel, even heterogeneous, real-time systems may be found in [16]. A hypothetical architecture for Artificial Neural Systems (ANSs) was shown in Figure 1. Different modules (SIMD arrays) typically execute different

Distributed, massively parallel embedded computing

Artificial Neural Network (ANN) models, or different instances of the same model. Full connectivity may be used within the modules, while the communication between modules is expected to be less intensive. The architecture for action-oriented and other advanced, real-time control systems, which was outlined above, can be seen as an implementation of a more general architectural concept which we now present. It is based on the notions of nodes, channels, and local real-time databases. Nodes, which differ in functionality, are communicaring via a shared medium. Input nodes deliver sensor data to the rest of the system and may perform perceptual tasks. Output nodes control actuators and may perform motoric control tasks. Processing nodes perform various kinds of calculations. I/O nodes and processing nodes may have great similarities but, because of their closeness to the environment, I/O nodes have additional circuits for interfacing to sensor and actuator signals. Communication between nodes takes place via channels. A communication channel is a logical connection on the shared medium between a sending node and one or more listening nodes. The channels are statically scheduled so that the communication pattern required for the application is achieved. This is done by the designer. Two types of data are transported over the medium: Program data is distributed to the nodes to allow changes "on the fly" of the cyclically executed programs in the nodes. Process data informs the nodes about the status of the environment (including the states of other nodes). If the application requires intensive communication within a set of related nodes a hierarchical communication can be set up. The related nodes forms a cluster with more available bandwidth on the internal channels. Rather than being individual signals, the process data exchanged between the nodes is more like patterns, often multi-dimensional. Therefore, the shared medium must be able to carry large amounts of information (Gigabits per second in a typical system). Every node in the system executes its program in a cyclic manner. The cyclically executed program accesses its data from a local real-time database (LRTDB). This LRTDB is updated, likewise cyclically, via channels from the other nodes of the system. The principle of resource adequacy, the cyclic paradigrn and the statically scheduled communication via the LRTDBs imply the time-deterministic behaviour

531

of the system which is so important in real-time applications. The architecture permits two or more modules to be linked together to form a larger module, ff necessary. This linking may be done either over the communication medium, in which case the intermodule communication shares time with all modules of the system, or over a separate medium. In the latter case the cooperating modules form a cluster with more available bandwidth for internal communication. Special "dual-port" nodes form the interface between the cluster and the main medium. As stated earlier the distributed nodes have cyclically executing programs. These cycles have two parts: the Monitor and the Work Process. The Monitor has the following functions: • Start on a given time (a new dt has passed). • Copy output d~Lafrom the LRTDB to the communication buffer. • Copy input data from the communication buffer to the LRTDB. • Handle program changes. One of the nodes connected to the network is a Development Node, as shown in Figure 2. It establishes a channel to an executing node when it needs to send program changes. Inslructions along with address information are sent to the executing node where the monitor makes the change between two executions of the Work Process.

= /

' F-K ......

\

/"

/ 4velopment .System

X/;.-

"\

"-...

Target System .............~! ~

i L_~ /

Figure 2. Multi-node target system and multipleworkstation development system The Development Node is connected to a Local Area Network (LAN) of workstations (WS) running

532

L. Bengtsson et aL

the development system. The LAN connection can be removed without affecting the running system. 3. T H E P R O C E S S O R

ARRAY MODULE

We present a Processor Array Module (Figure 3) which includes: a SIMD Unit for data processing, a Communication Unit for carrying out the communication task of the node and a Communication Interface to a high speed medium. The Processor Array Module architecture reflects the system concept described earlier. Data is processed and stored as a bitslice of a SIMD array. The communication on the high speed medium is bit serial and a communication package is a bitslice, or part of a bitslice, of a SIMD array.

I

CU2

sp un

m

Wt-

i

S/P Comm. Comm interface

.: LRTDB I ~ I

buffer (CB .

Corm ! umt

i

PE-array

processing modules. Such communication occurs both locally (nearest neighbour) and globally (one-to-all broadcast). The linear array configuration that we use handles the short-path local d~t~ transfers. However, the long-distance global communication, required in, e.g., some neural net models, imposes more problems when it comes to physical realization. A broadcast line with, say, 2048 PEs needs buffer drivers because of the high fan-out seen from the sending PE. Now when we design for VLSI implementation in CMOS there is a way of arranging these buffers which solves both the broadcast "fan-out" problem and a "fun-in" problem found in many algorithms. For example, in one phase of the neural net Backpropagation algorithm, the sum of all elements of a vector dislributed over all the PEs is required. Previously, we used an "adder-tree" or a "ring configuration" to form this sum. These solutions are now supplemented with a new one which (suited for high PE number modules) combines all these aspects Coroa~ast with a proper physical realization, and the backp~p '~PE-add scheme"). This solution is called the "broadcast-tree" (Figure 4) and it is well suited to modules with a large number of PEs (many thousands) due to its logarithmic relationship of number of levels in the tree - to number of PEs. PE 0 f~ |

buffer0 buffer 4

SIMD-unit

Figure 3. The Processor Array Module, basic blocks 3.1. The SIMD Unit

The data processing unit consists of a massively parallel SIMD array with a high number (typically 1K to 2K) of bit-serial processing elements (PEs). The logical str~ture of the S M D array is shown in Figure 3 (CU2, LRTDB (memory), ICN and PEarray). Each PE has its own memory (where mainly weight matrices are housed). The PEs communicate through the Inma2onneclion Network (ICN) described in a further section below. The activity is controlled by the microprogrammed Control Unit (CU2), executing the program algorithm. 3.1.1. The InterConnection Network (ICN)

The InterConnection Network (ICN) serves the purpose of interchanging data between PEs within the

buffer 5

PE 7 O---t Hgure 4. The br~dcast-tree with 3 levels and 8 PEs The maximum fan-out seen in a physical realization is only two! The maximum delay imposed by the ICN is 2*(logN - 1) * BD, where B D equals tile delay through one buffer. As an estimate, if we assume a silicon CMOS process with B D = lns, and 2048 PEs, we get a maximum delay of 20 ns which is acceptable if

Distributed, massively parallel embedded computing

ICN communication must not take place every clockcycle (say 50 MHz). The number of buffer stages equals N - 2. However, the increase in area is relatively small because the buffers are very simple. The only requirements are that they be bi-directional and tri-statable. About 6-8 transistors (small ones because of the very low drive requirement) is sufficient. Compared to a PE size of about 1000 transistors this is not significant. The wires connecting the buffers to each other, and the PEs to buffers, worsen this slightly. 3.2. The Communication Unit

The Communication Unit carries out the communication task of the node. It is communicating both data and instructions (commands) on the shared medium. Data is transmitted between the nodes in the system. Instructions (commands) are transmitted from the development node to other nodes in the system. This allows incremental development, using the running system as a development platform, and easy change of the system behavionr"on the fly". We present an implementation with optical communication. The shared medium is an all-optical network (the entire path between end-nodes is passive and optical) (Figure 5). Such a network has high bandwidth (25,000 GHz) and low bit-error rate (10 -15) [17].

send insu'uction in the sending node program and one receive instruction in the listening nodes' programs. Implementing a communication channel in such a way means that the nodes in the system must be synchronized very carefully [18]. On the other hand it gives a simple and efficient communication protocol. An optical communication medium gives the possibility to use different wavelengths for the information flow on the shared medium (wavelength multiplexing). This means that parallel communication channels can be implemented with an increase in bandwith. We have chosen in this implementation to allocate three different wavelengths. One wavelength for d~ta. one wavelength for instructions to the Communication Unit, and one wavelength for the instructions to the Processing Unit. Our implementation uses devices like laser diodes, PIN diodes for the optical-to-electrical conversion, and passive wavelength multiplexers and demulfiplexers. To fully take advantage of wavelength multiplexing technique, devices like tunable optical filters and tunable laser diodes must be used. These devices are still laboratory devices and will be practical within three to five years [17]. Figure 6 shows how the information transmitted on the shared medium is split up optically (wavelength demultiplexing) in ~t_a (D) and instructions (I). CU1

°°

.'"

:::

533

DI I "" i i .]..!!"? #] .

ID

CU2

.°,°°.,.

~1 optical star-IT ..............

coupler

"o ...............................

.:

i ................

,,

..................."

OUt

]'

.4

i l...................

Figure 5. The All-Optical Network A communication channel between nodes is implemented as a time window on the shared medium. In this time window only one node is allowed to send and one or more nodes is/are allowed to listen Cumadcast). The time window is further implemented as one

S/P

SIMD-array

Figure 6. The Processor Array Module, data and control path

534

L. Bengtsson et aL

Instructions are further split up in instruction for the Communication Unit (I1) and instruction for the SIMD Unit (I2). The information is converted from serial to parallel format (S/P) in the Communication Interface. Both Control Units (CU) execute their programs cyclically. As stated earlier the execution cycle can be divided in two parts: The Monitor and the Work Process. In the Work Process part (the main part of the cycle) CU1 controls the Communication task and CU2 controls the Data Processing task of the node. To carry out the Communication task CU1 controls the communication buffers (D, I1 and I2) by executing the statically scheduled send/receive instructions. The Data Processing tagk is carried out by CU2 which controls the SIMD-atray according to the application program. The Monitor part (the minor part of the cycle) handles the task of updating the node, which means an update of data and a change of the programs. The update of data is controlled by the Monitor part in CU2 and means: 1) copy data from the LRTDB to the communication buffer D (data which have been processed in cycle i-1, and which should be transmitted in cycle i) 2) copy data from the communication buffer D to the LRTDB (data which have been received in cycle i-1, and which should be processed in cycle i). Two programs can be changed, the program in the Communication Unit and the program in the SIMD Unit. The Monitor part in CU1 handles the change of the program in the Communication Unit with the received instructions in the communication buffer I1. The Monitor part in CU2 handles the change of the program in the SIMD Unit with the received instructions in the communication buffer I2. The implementation described above impfes that the Processing Unit processes one cycle old data. If the application requires shorter delay than a cycle between input and output in different nodes the implementation described cannot meet this requirement. By letting processes in different nodes communicate directly via their LRTDBs, without the intermediate storing in the communication buffer D (which thus can be omitted), a shorter delay can be achieved. This implies, however, a more complex algorithm for the scheduling of time windows than before [19]. In this case the LRTDBs can be implemented as dual port memories, one port being controlled by CU1 for com-

munication and the other port by CU2 for processing. The scheduling algorithm guarentees collision free access to the LRTDBs. As a result the complexity is moved from the nodes to the development tool. 3.3. The Communication Interface Data is processed and stored as a bitslice of a SIMD array. The communication is bit serial and in every scheduled time window a communication package is sent. A communication package is a bitsSce, or part of a bitslice, of a SIMD array. The communication medium used is optical.Thus the two main tasks of the Communication Interface are: to do a serial-toparallel (S/P) conversion and to do an optical-to-elec~cal conversion (O/E). The Communication Interface includes (Figure 7): a laser diode on the sending end (O/E), a PIN diode on the receiving end (O/E), a fast shift register (SR), encoding/decoding circuits, multiplexing/demultiplexing circuits, a clock recovery circuit on the receiving end and circuits for multiplying (mul c) and dividing (div a/b) the local module clock (C1).

Serial data out

I

.......

Serial data in

],

From comm. buffer (CB)

C t'~-'-~t,~L To comm. to buffer (CB) Comm/Proc unit

Figure 7. The Communication Interface The Communication Interface serializes the bitslice in the following way: The multiplexer divides the bit.slice, or part of the bitslice, into smaller parts which are encoded before they are shifted out on the commu-

Distributed, massively parallel, embedded computing

nication medium with the shift register (SR). Encoding the signal serves two purposes: to get a DCbalanced signal and to allow clock recovery on the receiving end. At the receiving end the opposite takes place namely: decoding and demultiplexing. Thus, the Communication Interface performs the conversion from/to high speed serial optical signal (Gbps) to/from low speed parallel electrical signal

(MHz). Due to the high speed the Communication Interface must be integrated into one IC to work properly. GaAs is a suitable technology to use which also gives the possibility to integrate optical devices, such as laser diodes and PIN diodes, with logic. Example: C1=320 MHz, a=256, b=16, c=20 gives an effective transmission rate of 16"320 Mbits= 5.12 Gbps through the Communication Interface. The speed on the medium is then 6.4 GHz due to the encoding of data. The clock recovery on the receiving end can be implemented using different techniques. Digital Phase Locked Loop (PLL) is a common technique with the drawback of hysteresis. Simpler clock recovery techniques can be implemented if for example Manchester encoding of the data is used. The drawback with this encoding technique is a higher transmission rate on the communication media for the same effective Wansmission rate through the Communication Interface. Due to the choice of optical communication there is a possibility to transmit both data and clock the same physical way, namely by using two different wavelengths, one for the clock and one for the data. This implies that no clock recovery circuit on the receiving end is needed with the drawback of the increased cost for an extra laser diode, a passive wavelength multiplexer on the sending end, an extra PIN diode and a passive wavelength demultiplexer on the receiving end.

535

"known good die" and the like are currently limiting the MCM use in industry. However, this is surely to be overcome and we strongly believe in this packaging technology for the future. In our system, integrating up to 2K of PEs in one package with such small dimensions dictates the need for MCM. Also, by eliminating many of the bondwires for chip-to-chip communication and replacing these with small-dimension metal wires and using the 'flip-chip' technique, speed performance is increased. Traditionally, PE chips for massively parallel computers [4, 20] integrate PE logic and Static RAM cells on the same chip. This is because the I/O pin count would be far too high if the memory was to reside off-chip. Also, speed would suffer. With our approach, memory may be off-chip and the PE-chip to memory-chip connections are done via metal wires on the MCM-substrate (instead of bond-wires). The RAMs may now be fabricated in a high-density DRAM technology, the PE-chip in a technology best suited for logic, and the two kinds of chip can be cut out of their respective wafer and assembled together in the MCM-package. Speed will not suffer as much in this approach because of the eliminated bond-wires. The density of metal wires possible today in the MCM-C (ceramic) is about 800 per cm [21]. A chip side measure in the range of 10-15 mm enables roughly 1K of metal wires per side. Using a configuration of three chips, one PE-chip in the middle and RAM-chips on the upper and lower sides, gives a package with 2K PEs (Figure 8).

UUU UU1 ]

chip

]

IIII Imo

4. M I N I A T U R I Z A T I O N

4.1. Physical outline of the SIMD-node The size requirement of the complete computing module (about the palm of one hand) implicates that the SIMD-node must be highly integrated. The use of MCM-packaging is very appealing for use in massively parallel systems where thousands of PEs is included. The status of MCM today is that it is not yet suited for mass production. Problems of getting



substrate

Figure 8.The MCM package

F]

536

L. Bongtsson ot aL

5. F U T U R E I M P L E M E N T A T I O N TECHNOLOGY The speed limiting factor in the current MCMdesign is mainly due to the track-to-track capacitance between the metal wires connecting the memory and PE-chips.Also,bond-wire inductancelimitsthe signal wansition time for off-package communication. A further speed-limiting factor is that switching at high speed requires a lot of current, consequently a lot of power is dissipated at the output buffers. The total power budget of the chip may thus fimit the overall speed. Also, cross-talk is a problem that gets increasingly severe with higher switching speed. An appealing alternative to investigate would be the use of an all-optical communication scheme both chip-to-chip, and package-to-package (if modules with higher numbers of PEs will be desired). A digital GaAS VLSI process could be used. Not only does this offer higher clockspeeds than CMOS (due to the higher electron mobility in GaAs), it also opens the possibility to make an integration of both optics and electronics on the same chip [22]. This is today at a research stage. An AI/GaAs process is used and a structure called a multi-quantum well (MQW) structure is built up with layers of AlGaAs and GaAs to implement a transmission modulator that modulates an externally suppfied laser source [23]. At the receiving end, a PIN diode could be used. Reports of multi-gigabit transmissions have been made [24]. 6. C O N C L U S I O N S We have presented an implementation of a distributed, massively parallel computer, attractive for a variety of demanding real-time processing tasks in advanced embedded systems. Of special interest is the potential of the architecture to execute multiple-neural-network systems in action-oriented computers that interact with the environment by means of highly parallel sensors and ach~!ors. The way the architecture is organized suits the needs of such applications perfectly: The modules have regular structure for easy miniaturization, and the means of synchronization and communication between modules relies on optical links, allowing easy distribution of modules. Future technology developmerit strengthens even further the combination of electric and optic technology used.

Embedded computing calls for architectures which can be adapted to the varying processing needs and are flexible enough to allow changes in function even after installation, i.e. when the system is in use. The modular solution, the flexible communication structure, allowing logical rearrangements of the physical hardware, and the program execution model all serve this purpose. A prototype implementation of a processing module has been made. This is presently used for evaluations of architectural variations and for application experiments. An implementation of a multi-module system as described in this paper is under way. REFERENCES 1. Rudolph, J.A. and K.E. Batcher (1982). A productive implementation of an associative array processor: STARAN. In D.P. Siewiorek, C.G. Bell, and A. Newell, Computer Structures: Principles and Examples, McGraw-Hill, New York, pp. 317-331. 2. Svensson, B. (1983). Image operations performed on LUCAS - an array of bit-serial processors, Proceedings of the 3rd Scandinavian Conference on Image Analysis, Copenhagen, Denmark, July 1983, pp. 308-313. 3. Ohlsson, L. (1984). An improved LUCAS architecture for signal processing. Technical Report, Dept. of Computer Engineering, University of Land, Sweden. 4. Blevins, D.W., E.W. Davis, R.A. Heaton, and J.H. Reif (1990). BLITZEN: A highly integrated massively parallel machine, J. Parallel and Distributed Computing, VoL 8, No. 2, pp. 150-160. 5. Batcher, K.E. (1980). Design of a massively parallel processor, IEEE Transactions on Computers, Vol. C-26, No. 2, pp. 174-177. 6. Arbib, M.A. (1989). Schemas and neural networks for sixth generation computing. J. Parallel and Distributed Computing, Vol. 6, No. 2, pp. 185-216. 7. Svensson, B. and T. NordstrOm (1990). Execution of neural network algorithms on an array of bit-serial processors. Proceedings of lOth International Conference on Pattern Recognition Computer Architectures for Vision and Pattern Recognition, Atlantic City, NJ, USA, June 1990, Vol. II,pp. 501-505. 8. NordstrOm, T. (1991).Sparse distributedmemory simulation on R E M A P . s Research Report No. T U L E A 1991:16, Lule~ University of Technology,LUleA,Sweden. 9. NordstrOm, T. (1991). Designing parallel computers for self organizing maps. Research Report

Distributed, massively parallel embedded computing

No. TULEA 1991:17, LuleA University of Technology, Lule& Sweden. 10. NordstrOm, T. and B. Svensson (1992). Using and designing massively parallel computers for artificial neural networks. J. Parallel and Distributed Computing, Vol. 14, No. 3, pp. 260-285. 11. Svensson, B., T. NordslJ~m. K. Nil"sson,and P.A. Wiberg (1992). Towards modular, massively parallel neural computers. Proceedings of First Swedish National Conference on Connectionism, Sk6vde, Sweden, September 1992 (in ublieafion). Available as Research Report CDv202 from Centre for Computer Architecture, Halmstad University, Sweden. 12. Nilsson, K., B. Svensson, and P.A. Wiberg (1992). A modular, massively parallel computer architecture for trainable real-time control systems. AARTC "92: 2nd IFAC Workshop on Algorithms and Architectures for Real-Time Control, Seoul, Korea, Aug.31 - Sept. 2. 13. Bengtsson, L~, A. Linde, B. Svensson, M. Taveniku, and A. Ahlander (1993). The REMAP massively parallel computer platform for neural computations. MicroNeuro 1993: Proceedings of the Third International Conference on Microelectronicsfor Neural Networks, Edinburgh, Scotland, 6-8 April, 1993. 14. Linde, A., T. Nordstr6m, and M. Taveniku (1992). Using FPGAs to implement a reconfigutable highly parallel computer. Second International Workshop on Field-Programmable Logic and Applications, Vienna, Austria, Aug. 31 Sept. 2. 15. Lawson, H.W., with contributions by B. Svensson and L. Wanhammar (1992). Parallel Processing in Industrial Real-Time Applications. Prentice-Hall, Englewood Cliffs.

~

537

16. Lawson, H.W. and B. Svensson (1993). An architecture for time-critical distributed/parallel processing, Proceedings of Euromicro Workshop on Parallel and Processing, Gran Canada, Spain, Jan. 1993. Published by IEEE Computer Society Press. 17. Green, P.E. (1991). The future of fiber-optic computer networks. Computer, Vol. 24, No. 9. 18. Wiberg, P., IC Nilsson (1993). Node synchronization in a distributed system. Research Report, Halmstad University, Halmstad, Sweden. 19. Wiberg, P. (1993). Distributed System for TimeDeterministic Execution and Incremental Development. Proceedings of International Workshop on Mechatronical Computer Systems for Perception and Action, Halmstad, Sweden, June 1-3, 1993. 20. Hammerstrom, D. "A VLSI architecture for high performance, low-cost, on-chip learning." In International Joint Conference on Neural Networks, Vol. 2, pp. 537-543, San Diego, 1990. 21. Electronic Design (1992). MCMs: 11 experts debate the future. Nov 25, Vol. 40, No 24, pp 157-171. 22. Harrold, SJ. (1993). An introduction to GaAs IC design. Prentice Hall International (UK), pp. 151-158. 23. Whitehead, M. et al. (1988). Effects of cell width on the characteristics of GaAs/AIGaAs multiple quantum well electroabsorption modulators. Appl. Phys. Lett. 53, p. 956. 24. Nobuhara, H. et al. (1988). Monolithic pinHEMT receiver for long wavelength optical communications. Electron. Lett. 24, pp. 12461248.

Suggest Documents