What Computer Architecture Can Learn From ... - CiteSeerX

2 downloads 1401 Views 127KB Size Report
What Computer Architecture Can Learn From Computational Intelligence —. And Vice Versa. Ronald Moore. Bernd Klauer. Klaus Waldschmidt. J. W. Goethe- ...
What Computer Architecture Can Learn From Computational Intelligence — And Vice Versa Ronald Moore

Email:

 moore,

Bernd Klauer

Klaus Waldschmidt

J. W. Goethe-University Technische Informatik P.O. Box 11 19 32 60054 Frankfurt, Germany



klauer, waldsch @ti.informatik.uni-frankfurt.de

Published in the Proceedings of the 23rd Euromicro Conference, Budapest, Hungary, Sept. 1997

Abstract

Computational Intelligence: We take Computational Intelligence (CI) to be the study of models of cognition which use primitive numerical operations — without using explicit knowledge representation, or symbolic, syntactical models of reasoning. Compare [5]. Instead, CI attempts to show how such cognitive capabilities as reasoning and memory can be emergent properties of systems composed of numerical, non-cognitive components. CI includes most artificial neural networks (ANNs) and some evolutionary and fuzzy models [29]. Of particular interest are hybrid models, combining more than one of these three fields. One could also characterize CI as the intersection of cognitive modeling and Complex Adaptive Systems (in the sense used in [13]).

This paper considers whether the seemingly disparate fields of Computational Intelligence (CI) and computer architecture can profit from each others’ principles, results and experience. In the process, we identify important common issues, such as parallelism, distribution of data and control, granularity and regularity. We present two novel computer architectures which have profited from principles found in CI, and identify two constraints on CI to eliminate the hidden influence of the von Neumann model of computation.

Keywords: Neuro-Computing, Parallel and Distributed Systems, Multithreaded Architecture, Memory systems and management, Parallelizing compilers.

Computer Architecture: For purposes of this paper, we distinguish (somewhat arbitrarily) between computer architecture and circuit design. Without drawing a sharp boundary, we treat computer architecture as typically operating at a higher level of abstraction than circuit design. Circuit design builds individual components, out of which computer architecture constructs entire computational systems. As such, when we ask “What can computer architecture learn from CI?”, we do not mean to ask “How can CI be implemented as hardware components?”. This question has been extensively examined elsewhere (see e.g. [17, 24, 23]), and is outside the scope of this paper. Instead, we want to pursue the possibility that the basic principles explored in CI could have a beneficial impact on the design of future computer architectures — and vice-versa. Of course, a computer built on CI principles should

1. Introduction What, if anything, can computer architecture learn from Computational Intelligence (CI)? Conversely, what, if anything, can CI learn from computer architecture? In order to pose these questions more precisely, we first need to clarify what we mean by “Computational Intelligence” and “computer architecture”:

 This

work was supported in part with funds of the Deutsche Forschungsgemeinschaft under reference number WA 357/11-2 within the priority program “System and Circuit Technology for Massively Parallel Computation”.

1

Distribution of Data: For parallel architectures without a shared memory (including those with distributed virtual shared memories), distributing data amongst the various local memories is a difficult problem. The problem of data distribution takes a different form in CI systems. Here the question is normally between localized and distributed representations. However, here also we find that many CI models fall back on the conventional model of random access memory, implicitly reintroducing problems of data distribution (see section 6).

also be useful for implementing CI programs. The longterm motivation for our research is the development of programmable architectures for the simulation (computation) of parallel, distributed and object-oriented systems. The rest of this paper is organized as follows: Section 2 identifies and discusses common issues in CI and computerarchitecture. Sections 3 and 4 present two novel architectures influenced by CI principles: the Associative Dataflow ARChitecture ADARC and the Self-Distributing Associative Architecture SDAARC. Section 5 analyzes these architectures in light of the issues identified in section 2. Section 6 reverses the question and asks “What can CI learn from computer architecture?”. Section 7 presents our conclusions.

Granularity: One of the major themes of this paper is the concept of granularity. A grain is a unit of computation performed by one component without communication or coordination with other components. Levels of granularity can be smaller than an instruction, instruction-level, task-level, process-level or program-level (for example). Similarly, one of the main differences between various neuron models is the complexity of the function they compute. Binary threshold neurons are simpler (and thus finergrain) than units with a sigmoid activation unit, which are in turn simpler than radial-basis function (RBF) neurons. Granularity is also an important concept when analyzing network structure. For example, a Hopfield network displays only two levels of granularity (the individual neuron and the whole network), while a multi-layer perceptron (MLP) has an extra level of granularity (that of a layer).

2. Common Issues in CI and Computer Architecture The most fundamental commonality between CI and computer architecture is their interest in novel models of computation. When developing and evaluating these models, the following issues are essential for both computer architecture and CI: Parallelism: Computer architecture has been searching for decades for parallel alternatives to the Turing and von Neumann models. However, every attempt to move away from the “von Neumann bottleneck” (sequential computation) toward parallel systems opens “the Pandora’s box of questions . . . closed four decades ago by the von Neumann model” ([2], p. 3). A similar dissatisfaction with the symbolic, syntactical models of Artificial Intelligence (AI) led to the renewed interest in Artificial Neural Networks (ANNs) and CI in the 1980’s, in large part due to Hopfield’s work on the emergent abilities of ANNs, and the cognitive models of the PDP Research Group [22, 16]. Parallelism is essential and inherent to these models.

Programmability: Programming parallel computers must become easier if parallel computers are to achieve wider acceptance. Programmability has thus become perhaps the issue in computer architecture today. Although one often quoted advantage of CI systems is that “they do not need to be programmed” (e.g. [20]). The truth of the matter is considerably more complicated (compare [12]). Nonetheless, CI systems represent examples of comprehensible, intuitively appealing, yet inherently parallel programs. Understanding the design and evaluation of CI systems should thus help develop new programming paradigms for parallel computation.

Distribution of Control: Rumelhart and McClelland identified the use of ‘Distributed, not central, control” as one of the key features for CI models of cognition [21]. This characteristic has been implemented with varying degrees of completeness in various CI models, a point we will return to in section 6. For computer architecture, distributing the executive function is also a clear priority, but no consensus is evident. Different architectures distribute control at different levels. Multicomputers for example employ multiple processors, while modern microprocessors contain deep pipelines and other parallel structures (compare [10]).

Regularity: Some CI systems are inherently regular, having a static, unchanging structure, while others are inherently irregular, having a dynamic structure. Amongst ANNs, examples of regular systems include Hopfield networks and feed-forward Multi-Level Perceptrons (MLPs); while irregular structures include cascade correlation networks, self-pruning networks, evolutionarily specified networks and learning vector quantification. While computer architectures are not usually classified as regular or irregular, the programs which run on them are. Examples of regular programs are those which perform operations on vectors and matrices of fixed size, whereas irreg2

ular programs operate on arbitrary structures such as trees and graphs. Many architectures are specialized for regular programs. Other important issues include adaptivity and scalability. To see what implications these issues can have when designing computer architectures, we examine two example architectures in sections 3, 4 and 5. We examine the implications of these issues for CI in section 6.

3.

Associative (ADARC)

Dataflow

ARChitecture

Starting in 1993, we proposed an Associative Dataflow ARChitecture (ADARC) [25, 11, 26]. ADARC is a instruction-level parallel, programmable architecture belonging to the class of dataflow computers. While suitable for general applications, ADARC’s design is particularly suitable for computation of regular CI structures.



associative switches

Figure 1. Connection Scheme for ADARC

3.2. ADARC Configuration

3.1. Background: Dataflow Computation

The central concept upon which ADARC is built is an Associative Communication Network (ACN). See figure 1. The ACN is a crossbar switch, where each processor writes onto a dedicated horizontal line, and reads from one or more dedicated vertical lines. The ACN can be regarded as the fusion between an associative memory and a conventional crossbar. After each operation, each processor sends a data packet to the network. Each packet consists of an identifier and a data word. Each processor then specifies the identifiers of the packets it wishes to receive (the operands for the next operation). The connections between the vertical and horizontal lines are made with associative switches. Each of these switches compares the identifier on an incoming data packet on a horizontal line with the identifier requested on a vertical line, and if they match, routes the incoming data packet to the requesting processor. This ACN is inherently scalable. The number of associative switches necessary to connect processors is . An ASIC (Application Specific Integrated Circuit) with sufficient switches to connect 4 processors has been fabricated [26], and an ADARC hardware prototype with 12 processors (connected by a 3 by 3 matrix of ASICs) is up and running at the J. W. Goethe University in Frankfurt. This configuration has two major limitations: First, each associative switch has exactly one word of associative memory. This is sufficient to match incoming data packets against the current requests, but not to match incoming data packets against previous requests. Thus, the compiler receives the responsibility to see that each read request comes no later than the corresponding write opera-

Before introducing ADARC, we review quickly the essentials of the dataflow model of computation. In this model, a program (i.e. a computational process) is modeled with a graph, where the vertices represent operations, and the directed edges represent data transfers between operations. Dataflow Graphs (DFGs) are interesting for several reasons:





They provide a maximally parallel representation of an algorithm: Operations not connected by path through the graph are independent of each other and can in principle be executed in parallel. DFGs do not contain dependencies caused by the side-effects of memory usage (so-called anti-dependencies, see [1]). Antidependencies appear in DFGs only where randomaccess data structures (arrays) are utilized.





The firing (execution) of a vertex is controlled only by the presence of data on the incoming edges of that vertex: Computation is thus data-driven and not controldriven. DFGs show interesting similarities with (and differences from) ANNs. See [14]. In essence, both DFGs and ANNs are graph-based, parallel, data-driven systems with distributed data and control.

While DFGs were originally seen as a model which could be used directly to structure computer architectures [7, 4], they are increasingly seen instead as a guiding principle. 3

   



Simulators

Compilers





Subset C



Graphics

ANN's



ADARC Simulator

ADARC Assembler

ADARC Hardware

DFG Simulator



(if-then-else, for-loops, while-loops, etc.). All controlstructures can be arbitrarily nested. Loop unrolling and function in-lining are used in order to increase the parallelism in the dataflow graphs. Arrays of fixed length are permitted. Array elements are treated internally as scalars.



DFG

 



Meta-Controller

Prescheduler





Basic-block Scheduler

Schedulers

3.3.2 The Schedulers





The ADARC communication network requires the processors to be precisely coordinated: the recipients must be ready for each result exactly when that result is communicated to the network. This is intractable in the presence of branch instructions. Note however that the scheduling of all other vertices (that is, all vertices except control vertices) is independent of the value of the incoming data — it is not necessary to know what data is available, but rather only whether data is available. Thus, simple vertices can be scheduled at compile time — as long as we make special provisions for the control vertices. Based on these considerations, we use a multi-level approach when scheduling a data flow graph. Our schedulers first dissect the graph into basic blocks containing only simple vertices (no control vertices). The vertices inside each of these basic blocks can then be scheduled independently. Subsequently, the basic blocks are reassembled, with global branch, merge and call vertices serving as “glue”. Only one basic block is executed at any given time. The compiler must thus be able to find sufficient parallelism inside the basic blocks in order to keep as many processors occupied as possible. We divide the scheduling task into three sub-tasks: the pre-scheduler dissects the dataflow graph into basic blocks; a schedule for each basic block is then generated by the basic block scheduler; and the dataflow graph is reassembled by the meta-scheduler. Since scheduling is in general an NP-complete task (see [15] for a review of complexity results), we do not attempt to find the globally optimal schedule, but rather employ heuristics in the hope of finding a sufficiently optimal schedule. A uniform framework for experimentation with different heuristics has been implemented.

Back End

Figure 2. The ADARC Compiler Components

tion. Second, the vertical lines have no way of resolving multiple matches. It is thus not possible for two processors to simultaneously write to a third processor. The compiler thus receives an additional responsibility: to ensure that no more than one packet is sent to any given processor at any single point in time. These two limitations result in important responsibilities for the compiler. In section 3.3, we outline the approach we took to create compilers and schedulers which could cope with these responsibilities.

3.3. The ADARC Compiler Technology The compiler technology for ADARC consists of several programs, which are illustrated in figure 2. The various programs can be roughly arranged into 4 categories: compilers, schedulers, back end and simulators. The compilers and the schedulers are described in more detail below. 3.3.1 The Compilers Currently, two compiler front-ends have been implemented for ADARC: a neuro-compiler and a subset-C compiler: The neuro-compiler accepts as input a textual description of an artificial neural network. Back-propagation networks and radial-basis function (RBF) networks are currently supported. Currently, the forward and the backward phases are translated into separate sub-graphs. Further, each neuron (in each phase) is translated into a sub-graph containing multiple simple dataflow vertices. The mapping of neurons onto dataflow graphs can thus be said to employ sub-neural granularity. The subset-C compiler accepts as input a significant subset of C, sufficient for most numerical applications with regular structures. Notably, this subset of C allows us to experiment with ANN types other than Back Propagation and RBF networks. Arbitrary control structures are allowed

3.4. Empirical Results Figure 3 show the results of ADARC benchmarks with neural networks of varying size. Separate results are shown for the receptive (forward) and for the adaptive (backward) phases. The results are presented in terms of “speed-up” (the sequential run-time divided by the parallel run-time) for various number of parallel processors. We see that increasing number of parallel processors brings increased speedup until the implicit parallelism of the neural network is exhausted. These results provide a “proof of concept” for the ADARC architecture and for our compiler components. 4

Net

XOR4

XOR5 NOT

Topology # Neurons

Net

4 PAR

5

ENCODE PRNET

Topology # Neurons

2

16x16

9

19

266

Figure 3. Empirical Results: (a) Benchmark Neural Nets (b) Speedup.

4. The SDAARC Architecture

4.2. SDAARC Configuration

4.1. ADARC and COMAs

The topology of SDAARC is shown in figure 4. The basic topology remains a crossbar. The th horizontal line is reserved for messages sent from the th processor, and the th vertical line is reserved for messages sent to the th processor. The horizontal and vertical lines should now be considered to be asynchronous, split-phase buses. However, only one processor can write onto any given horizontal line, and only one processor can read from any given vertical line Further, the associative switches are extended in two respects: they now have depth, and sufficient hardware to implement a cache-coherency protocol. The associative memories in the associative switches do not store any data objects, but rather only the presence (and implicitly the absence) of data objects. We call the associative switches on the diagonal dedicated switches. See figure 4. These switches contain the same functionality as the other switches, but are additionally responsible for translating between global (virtual) addresses and local addresses. This network configuration has two interesting properties. First, it reduces bus traffic (and thus bus congestion) to the theoretical minimum. The only remaining congestion is unavoidable. Second, the situation for any single processor is equivalent to a symmetric multi-processor (SMP) architecture with 2 SMPs. For any given processor, a dedicated switch looks like another SMP sharing a common bus. See figure 4 (b). Effectively, the dedicated switch acts as a stand-in for all the other processors. This means that SDAARC can be built with standard (off-the-shelf) commercial SMPs. The topology described above is for a non-hierarchical SDAARC. Hierarchical topologies can also be building a new SDAARC crossbar, where the individual processors

If we lift the restrictions identified in section 3.2, we arrive at a new architecture, which we have chosen to call SDAARC (Self-Distributing Associative ARChitecture). This architecture is in the earliest stages of analysis and simulation. In SDAARC, we allow multiple matches on the vertical lines, and give the associative switches enough memory to store tokens over time. Having done so, the communication network becomes a form of Virtual Shared Memory (VSM), since processors communicate asynchronously with the network simply by loading and storing data to the network. This is made possible by generalizing the concept of the associative switch. In both systems (ADARC and SDAARC), the switches are simply responsible for making the decision Is the current Identifier in this column’s local memory, or is it elsewhere? Interestingly, this decision must also be made by a cache. A cache can be treated abstractly as a triple consisting of a fast memory, an associative memory to identify the cache’s contents, and a protocol which specifies when and how to move data in and out of the cache. The associative switches already have the necessary associative functionality. If we extend them with an appropriate protocol, then each processor’s local memory, together with it’s column’s associative switches, can be seen abstractly as a giant (slow) cache. We thus arrive, by means of an unusual path, at the architectural configuration known as COMA (Cache Only Memory Architecture) [9, 3]. Perhaps the most interesting feature of COMAs is their ability to dynamically distribute data. In COMA terminology, an augmented local memory is called an attraction memory, since each memory attracts the data it uses.



5









Sys In Sys Out

!

?

!

?

!

?

!

!

!

?

?

?

$

(

"To Procesor 2" Bus



Proc



?

!

!

?

#

!

?

Mem



Proc



?

!

?

Processor, Attraction Memory, and Dedicated Associative Switch

!

!

?



?

,

Associative Switch

Mem

?

)

"From Procesor 2" Bus

?

+

"from p" bus

"to p-1" bus

'

'

Associative Switch

Dedicated Associative Switch "to p" bus

)

?

Processor p CPU

*

Active Message Queues

Level 1Cache

'

%

local bus

Level 2 Cache

&

Local (Attraction) Memory

Proc



?

Mem

"

area in detail figure

Figure 4. Topology for a 3 Processor SDAARC: (a) Overview (b) Detail are replaced by SDAARC clusters. This process can be repeated to create hierarchies of arbitrary depth and branching factor.

thread must be immune to eviction. Second, we need to mark one of the copies of each frame as executable, and all of the other copies as unexecutable, in order to prevent redundant execution. The cache protocol needs to reflect this new state. Finally, perhaps the most important question of all concerns the selection of a new home processor when a frame migrates. We propose three criteria: The first criteria is to select a processor whose frame attraction memory already has a copy of the frame, if such an a processor exists. In this case, only the executable mark needs to be moved. Second, whenever we send arguments to a frame, and the arguments are in a different attraction memory than the frame, we have the choice between moving the frame to the argument(s), or moving the argument(s) to the frame. Both of the two criteria above are attractive: a frame travels to another processor if either one of its arguments, or a copy of itself, is resident there. We also need a dissipative criteria. The rule is simple and obvious: we need to move frames from heavily loaded processors to more lightly loaded processors. Load can be defined operationally as the number of executable frames in the frame attraction memory. With the scheduling scheme described above, we obtain a system where scheduling and data distribution are done in hardware. Both automatic distributions are fully transparent to the processors, the compilers and the programmer.

4.3. Scheduling with SDAARC A COMA which distributes data can also be used to distribute computation — that is, to schedule. Parallel computation can be done with a cactus-stack which holds execution frames (where each frame holds the arguments, local variables and return value for a function, compare [8]). If we use the self-distributing properties of a COMA to automatically and transparently distribute the stack, we can use the virtual shared memory to distribute control automatically and transparently. Store operations which write to frames become functionally equivalent to the active messages [27] used in message-driven multi-threaded architectures (e.g. [18, 6]). Multi-threaded architectures typically use demanddriven scheduling. Threads are, by default, executed on the processor which spawned them. A processor with no work polls its neighbors, looking for work. A processor farms out work from its own queue only when interrupted by an idle neighbor. Whenever no processor is idle, no distribution takes place, and an idle processor must wait for at least two messages to be transmitted (one looking for work, one to receive work) before restarting. In contrast, COMA-based scheduling is supply-driven. Heavily loaded processors initiate the distribution by evicting frames out of their attraction memories into less heavily loaded attraction memories. Idle processors do not request work (but accept it when it is farmed out). This makes some extensions to the COMA cache protocols necessary. First, the frame of the currently executing

5. Architecture Analysis This section analyzes the ADARC and SDAARC architectures in light of the issues identified in section 2: par6

constraint, and when generalized to simulate recurrent networks (so-called back propagation through time, see [20]), violates the second as well. Training a back propagation involves two distinct phases: a feed forward phase and a back propagation phase. Switching from one phase to the other requires some kind of global control not local to the individual neurons. Back propagation through time requires past values to be buffered, and requires different synapses to have access to each others’ state. See [20], page 356 for a more detailed discussion of the memory issues. Why are control structures and random access memory problematic? In essence, because they make CI models dependent on the von Neumann model. This becomes apparent when compiling CI models for non von Neumann models such as ADARC or SDAARC.

allelism, distribution of control and data, granularity, programmability, regularity, adaptivity and scalability. Both architectures are inherently parallel and scalable. Both architectures employ distributed data and distributed control. Both are programmable. The differences become apparent when we distinguish what is done at compile time from that which is done at run time. In ADARC, distribution of data and control are fixed at compile time. Thus, parallelism always remains inside a basic block. This limits ADARC to fine-grained, instruction-level parallelism. For regular applications involving a sufficient (but constant) number of parallel objects, these restrictions are not too harsh. In SDAARC, data and control distribution are done dynamically at run time. The granularity is thus flexible and adaptive. The frequency of communication, and thus the grain size, is a dynamic function of the current distribution of data and control. These distributions can also adapt to irregularities in applications. SDAARC increases programmability, since it allows random access data structures and simplifies compilation. SDAARC also increases scalability, first by introducing the possibility of hierarchical topologies, and second by exploiting the parallelism outside (between) basic blocks.

7. Conclusion We opened this paper with the questions: What, if anything, can computer architecture learn from Computational Intelligence (CI)? And what, if anything, can CI learn from computer architecture? While pursuing these questions, we found that both fields are involved in the search for novel models of computation, and evaluate new models in terms of parallelism, distribution of control and data, granularity, programmability, regularity, adaptivity and scalability. To illustrate how computer architecture can learn from CI, we presented two novel architectures: ADARC and SDAARC. Both SDAARC and ADARC model computation as a dataflow graph. ADARC maps the DFG onto this complex system onto the available hardware algorithmically at compile time. SDAARC introduces a complex and adaptive system responsible for this mapping: A cache only virtual shared memory which transparently distributes both data and control. To show what CI can learn from computer architecture, we identified two constraints on CI, proposed in order to free CI for the last remaining influences of the von Neumann model of computation. In essence, the challenge for both communities is to understand their research as part of the decades-old search for parallel and distributed models of computation, free from centralized control and ready for new models of memory.

6. What CI can learn from Computer Architecture The question “What CI can learn from Computer Architecture?” is in some ways dangerous: The reemergence of interest in CI was, in part, motivated by a deep dissatisfaction with computational models of cognition [19]. Further, if computer architecture is to continue to learn from CI, it is important for CI to lead the way, exploring principles not yet incorporated into computer architecture. Thus, we do not propose that CI researchers should make their models more like those currently employed in computer architecture, or even like those employed in ADARC or SDAARC. Instead, we prefer to keep CI models free of hidden or implicit influences of conventional (von Neumann) models of computation. Ironically, in the process, CI models can be made more amenable to novel computer architectures like ADARC and SDAARC. We propose two simple constraints for CI: 1. No global “programming language” control constructs. All components should be self-controlling;

References

2. No random access memory. Data is either found in the state of the components or in explicit communication between them;

[1] G. S. Almasi and A. Gottlieb. Dependency graphs and analysis. In Highly Parallel Computing [2], section 6.2.2, pages 309–314. [2] G. S. Almasi and A. Gottlieb. Highly Parallel Computing. Bejamin/Cummings Publishing Company, second edition, 1994.

These constraints are perhaps best illustrated with an example: the well-known error back propagation neural network, as made popular in [20]. This model violates the first 7

[3] G. S. Almasi and A. Gottlieb. Kendall Square Research KSR1. In Highly Parallel Computing [2], section 10.3.3, pages 549–553. [4] Arvind, L. Bic, and T. Ungerer. Evolution of dataflow computers. In Advanced Topics in Data-Flow Computing. Prentice Hall, 1991. [5] J. C. Bezdek. What is computational intelligence? In Zurada et al. [28], pages 1–12. [6] D. E. Culler, S. C. Goldstein, K. E. Schauser, and T. von Eicken. TAM — A compiler controlled Threaded Abstract Machine. In Journal of Parallel and Distributed Computing, Special Issue on Dataflow, June 1993. [7] J. B. Dennis. First version of a data flow procedure language. In Lecture Notes in Computer Science, volume 19. Springer Verlag, 1974. [8] S. C. Goldstein, K. E. Schauser, and D. Culler. Enabling primitives for compiling parallel languages. In Languages, Compilers and Run-Time Systems for Scalable Systems, pages 153–168. Kuwer Academic Press, 1996. [9] E. Hagersten, A. Landin, and S. Haridi. DDM — A CacheOnly Memory Architecture. IEEE Computer, 25(9), 1992. [10] J. Hennessey and D. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers Inc., 1990. [11] F. Henritzi, A. Bleck, R. Moore, B. Klauer, and K. Waldschmidt. ADARC: A new multi-instruction issue approach. In International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA ’96), 1996. [12] H. Hertz, A. Krogh, and R. G. Palmer. Introduction. In Introduction to the Theory of Neural Computation, chapter 1, page 10. Addison-Wesley Publishing Company, 1991. [13] J. H. Holland. Adaptation in Natural and Artificial Systems. University of Michigan Press, 1975. [14] B. Klauer, J. Strohschneider, F. Henritzi, S. Zickenheiner, and K. Waldschmidt. Computation of neural structures on dataflow computers. In EUROMICRO ’95, Como, Italy, 1995. [15] T. G. Lewis and H. El-Rewini. Scheduling parallel programs. In Introduction to Parallel Computing, chapter 9, pages 245–281. Prentice-Hall, 1992. [16] J. L. McClelland, D. E. Rumelhart, and the PDP Research Group, editors. Parallel Distributed Processing: Explorations in the Microstructure of Cognition; Volume 2: Psychological and Biological Models. MIT Press, 1986. [17] C. Mead, editor. Analog VLSI and Neural Systems. AddisonWesley Publishing Company, 1989. [18] R. S. Nikhil. A multithreaded implementation of Id using P-RISC graphs. In Proceedings of the Sixth Annual Workshop on Languages and Compilers for Parallel Computing, pages 390–405, Portland, Oregon, Aug. 1993. Springer Verlag LNCS 768. [19] D. A. Norman. Reflections on cognition and parallel distributed processing. In McClelland et al. [16], chapter 26, pages 531–546. [20] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In Rumelhart et al. [22], chapter 8, pages 318–364.

[21] D. E. Rumelhart and J. L. McClelland. PDP Models and general issues in cognitive science. In Rumelhart et al. [22], chapter 4, pages 110–146. [22] D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, editors. Parallel Distributed Processing: Explorations in the Microstructure of Cognition; Volume 1: Foundations. MIT Press, 1986. [23] S. G. Shiva. Neural networks. In Pipelined and Parallel Computer Architectures, section 8.3.1, pages 243–247. HarperCollins Publishers, 1996. [24] E. S. Sinencio and R. W. Newcomb, editors. IEEE Transactions on Neural Networks: Special Issue on Neural Network Hardware, volume 4, May 1993. [25] J. Strohschneider, B. Klauer, and K. Waldschmidt. Concept for an associative dataflow architecture. In 2nd Associative Processing and Applications Workshop, 1993. [26] J. Strohschneider, B. Klauer, S. Zickenheiner, and K. Waldschmidt. ADARC: An Associative Dataflow Architecture. In A. Krikelis and C. C. Weems, editors, Associative Processing and Processors. IEEE Press, 1997. [27] T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active messages: a mechanism for integrated communication and computation. In Proc. of the 19th International Symposium on Computer Architecture, Gold Coast, Australia, May 1992. [28] J. M. Zurada, R. J. Marks II, and C. J. Robinson, editors. Computational Intelligence: Imitating Life. IEEE Press, 1994. [29] J. M. Zurada, R. J. Marks II, and C. J. Robinson. Introduction. In Computational Intelligence: Imitating Life [28], pages v–xi.

8