Compiler Technology for Two Novel Computer Architectures - CiteSeerX

1 downloads 974 Views 209KB Size Report
The first architecture, which we call ADARC (Associative Dataflow ARChi- .... reassembled, with global branch, merge and call vertices serving as “glue”.
Compiler Technology for Two Novel Computer Architectures Ronald Moore, Bernd Klauer, Klaus Waldschmidt



J. W. Goethe-University Technische Informatik P.O. Box 11 19 32 60054 Frankfurt, Germany Tel: +49 69 798-22121 Fax: +49 69 798-22351 Email: moore, klauer, waldsch  @ti.informatik.uni-frankfurt.de

Abstract Before it can achieve wide acceptance, parallel computation must be made significantly easier to program. One of the main obstacles to this goal is the current usage of memory, both abstractly, by programmers, and concretely, by computer architects. In this paper, we present compiler technology for two novel computer architectures, and discuss how, on the one hand, many traditional, memory-based restraints on parallelism can be removed by the compiler — and, on the other hand, how computer architecture (along with appropriate compiler components) can provide a truly transparent virtual distributed memory in such a way so as to move both data-distribution and scheduling into the hardware domain, alleviating the programmer from these concerns.

Keywords: Parallelizing compilers, Neuro-Computing, Innovative Architectures, Interconnections networks, Memory systems and management

1 Introduction Despite decades of research, parallel computing still has not achieved the breakthrough so long expected. This has certainly been due to many factors, but there is wide agreement that, in order to become widely accepted, parallel computation will need to become much easier to program — ideally, as easy or easier to program than non-parallel computation. Some of the difficulty in conceiving, specifying and implementing parallel systems may be unavoidable: it is perhaps inescapably easier to understand the consequences of an action if that action is the only action taken by the system at a given time. Nonetheless, much of the difficulty programming parallel systems is artificial, a symptom of the way memory is understood and employed in conventional programming. The simple concept of random access memory, while so helpful in making von Neumann machines popular, has been recognized for years now as a problem, and not only for parallel systems (see e.g. [Bac82]). The problems go deeper: in essence, von Neumann programming is side-effect programming, and extracting and respecting the dependence relationships between sideeffects is often intractable, if not impossible. One alternative is to hide the concept of memory from the programmer, generally by usage of declarative, and in particular functional languages (e.g. [BW88]). On the architecture side, this approach was mirrored by the development of dataflow architectures and languages [AG94], [Den74]. In all of these paradigms, the programmer

 This work was supported in part with funds of the Deutsche Forschungsgemeinschaft under reference number WA 357/11-2 within the

priority program “System and Circuit Technology for Massively Parallel Computation”.

maps values onto values, and memory is employed, where necessary, by the compiler simply as a storage area for values waiting to be used. This pure world, despite much initial enthusiasm, has not survived. Dataflow languages introduced impure structures early in their history to allow efficient handling of data objects like arrays [ABU91]. Meanwhile, even functional languages have rediscovered memory as a necessity for modeling objects with local state [JW93]. The concept of over-writable, random access memory seems unavoidable, and thus remains a problem whenever computation is to be distributed. In this paper, we present compiler technology for two novel architectures, both of which are remarkable for their treatment of the concept of memory. The first architecture, which we call ADARC (Associative Dataflow ARChitecture) [SW94], [HBM 96], [SKZW97], is a fine-grained, tightly-coupled MIMD system with a distributed memory. ADARC is built around a novel associative crossbar network, which dynamically routes tokens between processors by matching result identifiers with request identifiers. We have implemented a set of compiler components for ADARC, including several language-specific front ends, and a language independent scheduler and assembler. Building on our experience with ADARC, we have begun work on a second architecture, which we call SDAARC (Self-Distributing Associative ARChitecture, pronounced so as to rhyme with “stark”). SDAARC is the next logical step after ADARC, loosening the most restrictive aspects of ADARC so as to allow for more flexible, loosely-coupled, variable (indeed, adaptive) grain computation. Likewise, the compiler technology for SDAARC is an extension and continual development of the compiler components for ADARC. In both of these systems, we take an approach to memory that is somewhere between the conventional approach and that employed in dataflow architectures and functional languages: We optimize the architecture and the compiler for handling values, and use memory only where necessary (to store values until they are needed) or particularly practical (to model local state and container objects such as arrays). Initial benchmarks using neural network programs provide encouraging results.



The rest of this paper is organized as follows: The ADARC architecture is presented in section 2. The compiler strategy for ADARC is described in section 3. Empirical results obtained with ADARC are presented in section 4. The SDAARC architecture is presented in section 5. The compiler strategy for SDAARC is described in section 6, and conclusions are presented in section 7.

2

The ADARC Architecture

The central concept upon which ADARC is built is an Associative Communication Network (ACN), see figure 1. The ACN is a crossbar switch, where each processor writes onto a dedicated horizontal line, and reads from one or more dedicated vertical lines. The ACN can be regarded as the fusion between an associative memory and a conventional crossbar. When writing, a processor sends a data packet to the network. Each packet consists of an identifier and a data word. When reading, the processor specifies the identifier of the packet it wishes to receive. The connections between the vertical and horizontal lines are made with associative switches. These switches compare the identifiers on the incoming data packets with the identifier on the read-line, and if they match, route the incoming data packet onto the vertical line, sending it to the reading processor. This ACN is inherently scalable. The number of associa. tive switches necessary to connect processors is An ASIC with sufficient switches to connect 4 processors has been fabricated [SKZW97], and a prototype ADARC computer with 12 processors (connected by a 3 by 3 matrix of ASICs) is up and running at the J. W. Goethe University in Frankfurt. This configuration has two major limitations: First, each associative switch has exactly one word of associative memory. This is sufficient to match incoming data packets against the current requests, but not to match incoming data packets against previous requests. Thus, the compiler receives the responsibility to see that each read request comes no later than the corresponding write operation. Second, the vertical lines have no way of resolving multiple matches. It is thus not possible for two processors to simultaneously write to a third processor. The compiler thus receives an additional responsibility: to ensure that no more than one packet is sent to any given processor at any single point in time. These two limitations result in important responsibilities for the compiler. In section 3, we outline the approach we took to obtain compilers (including schedulers) which could cope with these responsibilities. In sections 5 and 6, we examine the possibilities created when we lift these restrictions.



 



     ! !  "$#&% "$#('

   )  * ! !  "$#+% "$#,'

-   ./  ! !* "$#&% "$#('

0 0 0 0

0 0

Figure 1: Connection Scheme for ADARC

3 The ADARC Compiler Technology Before presenting the compiler technology pro se, we need to have a solid theoretical basis for proceeding with the representation and scheduling of parallel programs. We chose dataflow graphs as an intermediate format for both our architectures.

3.1 Dataflow Graphs

13246587:9; < 2>@?AB7C? DE9 = ?A

, where the vertices Dataflow graphs (DFGs) represent data dependencies in terms of a directed graph represent operations (functions), and the edges represent data connections: if is an edge from vertex to vertex , then vertex takes one of the outputs of as one of its arguments (compare fig. 2). A dataflow graph thus constitutes a partial ordering on a set of operations (vertices). Indeed, DFGs provide a maximally parallel representation of an algorithm, since no ordering has been imposed on the operations apart from that implied by data dependencies. We consider dataflow graphs to be a natural choice for an intermediate representation of parallel algorithms for precisely this reason. Note that we allow so-called branch and merge vertices to appear in dataflow graphs. We call branch, merge and call vertices control vertices, and all others simple vertices. Control vertices are essential so that we can construct loops. As such, our DFGs are not (necessarily) DAGs (directed acyclic graphs).

?A

? F?AG7:? DH5I

?

3.2 Scheduling Parallel Operations The ADARC communication network requires the processors to be precisely coordinated: the recipients must be ready for each result exactly when that result is communicated to the network. This is intractable in the presence of branch instructions. Note however that the scheduling of all other vertices (that is, all vertices except control vertices) is independent of the value of the incoming data — it is not necessary to know what data is available, but rather only whether data is available. Thus, simple vertices can be scheduled at compile time — as long as we make special provisions for the control vertices. Based on these considerations, we use a multi-level approach when scheduling a data flow graph. Our schedulers first dissect the graph into basic blocks containing only simple vertices (no control vertices — see fig. 2). The vertices inside each of these basic blocks can then be scheduled independently. Subsequently, the basic blocks are

Graph 2: Function Graph Graph 1: Main Program Graph

Figure 2: Dataflow Graph with marked basic blocks

reassembled, with global branch, merge and call vertices serving as “glue”. Only one basic block is executed at any given time. The compiler must thus be able to find sufficient parallelism inside the basic blocks in order to keep as many processors occupied as possible.

3.3 The ADARC Compiler Components The compiler technology for ADARC consists of several programs, which are illustrated in figure 3. The various programs can be roughly arranged into 4 categories: compilers, schedulers, back end and simulators. The compilers and the schedulers are described in this paper and are presented in the next two sections. 3.3.1

The Compilers

Currently, two compiler front-ends have been implemented for ADARC: a neuro-compiler and a subset-C compiler: The neuro-compiler accepts as input a textual description of an artificial neural network. Back-propagation networks and radial-basis function (RBF) networks are currently supported. Currently, the forward and the backward phases are translated into separate sub-graphs. Further, each neuron (in each phase) is translated into a sub-graph containing multiple simple dataflow vertices.

J ers Compil Subset L C

Simulators

N Graphics

ANN M 's

DFG Q R ator Simul

ADARC S R ator Simul

ADARC Assembl T er

ADARC Hardw U are

DFG Meta-Cont O roller Prescheduler P

Basic-block Schedul P er

Schedu K lers

Back End

Figure 3: The ADARC Compiler Components

The subset-C compiler accepts as input a significant subset of C, sufficient for most scientific applications with regular structures. Recursive function calls are not yet supported, nor are user-defined types. Arbitrary control structures are allowed (if-then-else, for-loops, while-loops, etc.). All control-structures can be arbitrarily nested. Loop unrolling and function call in-lining are used in order to increase the parallelism in the dataflow graphs. Arrays of fixed length are permitted. Array elements are treated internally as scalars. 3.3.2

The Schedulers

We divide the scheduling task into three sub-tasks: the pre-scheduler dissects the dataflow graph into basic blocks; a schedule for each basic block is then generated by the basic block scheduler; and the dataflow graph is reassembled by the meta-scheduler. Since scheduling (for more than 2 processors) is an NP-complete task, we do not attempt to find the globally optimal schedule, but rather employ heuristics in the hope of finding a sufficiently optimal schedule. A uniform framework for experimentation with different heuristics has been implemented. Heuristics implemented so far include critical path, depth-first, breadth-first, minimum fan-in, and various combinations of the above. Many heuristics produce adequate results, and while none of these heuristics produces uniformly superior results for all types of applications, the depth-first heuristic consistently delivers the best performance for a wide range of neural networks. The dissection and reassembly of while loops is illustrated in figure 4.

4 Empirical Results Figure 5 show the results of ADARC benchmarks with neural networks of varying size. Separate results are shown for the receptive (forward) and for the adaptive (backward) phases. The results are presented in terms of “speedup” (the sequential run-time divided by the parallel run-time) for various number of parallel processors. We see that increasing number of parallel processors brings increased speedup until the implicit parallelism of the neural network is exhausted. These results provide a “proof of concept” for the ADARC architecture and for our compiler components. Nonetheless, the ADARC approach is limited to scientific applications with very regular structures, such as feed-forward neural networks. In order to accommodate neural networks with dynamic topologies, other object-

Loop Initialization Loop Body

Loop Initialization Loop Predicate

Loop Predicate

Loop Body

While loop as list of basic-block graphs

While loop in original dataflow graph

independent basic-block schedules

While-loop in finished global schedule

Figure 4: Scheduling a While-loop

Net

XOR4 XOR5 NOT

Topology # Neurons

Net

4 PAR

5

Topology # Neurons

2

ENCODE PRNET

16x16

9

19

266

Figure 5: Empirical Results: (a) Benchmark Neural Nets (b) Speedup.

oriented systems, or mixed systems containing competing components of very different nature (e.g. signal processing combined with neural post-processing), we need to extend the ADARC architecture.

5 The SDAARC Architecture The approach taken with ADARC built upon fine-grain, instruction level parallelism. We sacrificed the possibility of executing multiple function calls, or multiple iterations of a (not unrolled) loop in parallel, and received in exchange an environment where the primitive operations could be precisely coordinated. In the process, we arrived at an architecture that shows many interesting similarities to VLIW (Very Long Instruction Word) architectures. See [HBM 96] for a more detailed comparison. However, we can lift the two major restrictions on the ADARC associative communication network: First, we can give the associative switches depth, and allow them to store and match more than just the current identifier. Second, we could add queuing capabilities to the vertical lines, so that more than one token can be sent to one processor at a time. See figure 6. After making these extensions, the identifiers now do more than just help route tokens – they now identify objects, and can be thought of as virtual addresses. The private memories, once coupled with the associative switches,



X

Sys In

Z

Z

?

Z

?

[

?

Sys Out

Z

?

WY

Z

V

Proc

Z

?

?

?

Z

WY

V

Proc

?

?

Z

Associative Switch

?

Processor, Attraction Memory, and Dedicated Associative Switch

Mem

Z

?

Z

?

Z

?

"from p" bus

W

a

CPU

?

?

"to p-1" bus

Processor p

`

`

"From Procesor 2" Bus

?

Mem

Z

_

"To Procesor 2" Bus

Associative Switch

Dedicated Associative Switch "to p" bus

YV

Active Message Queues

^

Level 1Cache

local bus

\

Level 2 Cache

]

Local (Attraction) Memory

Proc ?

Mem area in detail figure

Figure 6: Schematic Topology for a 3 Processor SDAARC come to represent the functional equivalent of distributed caches. Thus, instead of a VLIW-like architecture, we arrive at a new species of COMA (Core Only Memory Architecture) [HLH92]. The virtual shared memory is now self-distributing, or in COMA terms, an attraction memory. A conventional cache arbitrates between two memories: it’s own (fast) memory, and an external (slow) memory. Often this memory turns out to be, in turn, another (second level) cache. Thus, a cache can be seen abstractly as a memory paired with a associative switch. The associative switch determines whether a given object is contained here (in the cache’s own memory), or elsewhere (in another cache or another memory). This associative switch is thus abstractly equivalent to an extended (deep) ADARC associative switch. Note that, to avoid deadlock, the network now requires us to use split-phase loads — sending a load operation to the network is one phase, and sending the data back is another phase. The time between the two phases is nonuniform. The compiler is responsible for keeping the processors occupied in the meantime (see section 6 below). This leads us into the world of multi-threaded architectures, such as those described in [Nik93] or [CGSvE93]. The description above is sufficient to describe a cache hit. On a cache miss, more happens. The requested cache line is retrieved from another memory. If the cache is full, it becomes necessary to replace some cache line previously in the cache with the new line. At this point, caches diverge, and in the multi-processor world, cache coherence protocols multiply. The details go beyond the scope of this paper. The essential point for now is that cache replacement instigates object migration, and that objects migrate to the processors that use them. The integration of the attraction memory and the communications network makes an interesting optimization possible: In general, we want objects to migrate from heavily loaded memories to lightly loaded memories. If we transmit load information along the horizontal and vertical lines, each associative switch can easily calculate the load-difference between two processors. These load differences can be then used to determine if the load imbalance is sufficiently large to warrant a cache replacement (so that we don’t have to wait until a local memory is really full). To summarize: if sufficient depth and an appropriate cache-coherence protocol can be built into the ADARC associative switches, the resulting network can assume the responsibility not only for routing tokens between processors, but also for distributing objects (and copies of objects) amongst the processors. This alleviates the compiler and the programmer significantly. Neither the compiler, nor any processor must know where an object currently resides: they trust that the object is out there, and let the network find it.

6 The SDAARC Compiler Technology In the previous section, we saw how straight-forward extensions of the ADARC concept lead us away from the VLIW world into the COMA world, and create a virtual shared attraction memory out of the private memories of the various processors. The existence of such a memory has important implications for the compiler in several areas. One implication has been identified above: since the network is split- phase, we employ multi-threaded techniques to mask the non-uniform latencies. The original motivation for these extensions was two-fold: to allow more flexible scheduling, and to allow more flexible data structures. We can address both concerns by uniform usage of active messages.

An active message is abstractly one or more data objects, combined with an instruction pointer, so that the recipient of the data knows how to process the data [vECGS92]. Data and computation are thus coupled. We arrive at a completely object-oriented view of computation, where the distinction between data and computation is transcended. Thus, for example, an array is now longer conceived of as a contiguous block of memory, but rather as a container object, which can be implemented in a variety of ways, as long as it supports basic load and store operations. In particular, a container object can contain smaller container objects hierarchically. Once the object size has been reduced to that of a cache line, the attractor memory can transparently implement a form of adaptive block-cyclic data distribution. This approach compares favorably in terms of simplicity and programmability, if not in quality of the distribution, with approaches such as that in [GAL96]. More ambitiously, we can treat the computation stack, where functions are evaluated (compare [GSC96]), as an active container object in the sense sketched above. The individual stack frames (or framelets, following [AN96]) would thus be seen as active objects. The implication of this move is that the same attraction memory mechanism which makes data distribution automatic and transparent can be used for scheduling. As in [GSC96], stack frames would originally be resident on the processor which created them. In multithreaded systems like [GSC96] or [Nik93], these stack frames do not migrate until some processor is idle, and queries its neighbors, looking for work. In a COMA based system, stack frame migration, like all other object migration, is initiated by cache replacement, which occurs whenever the local memory is sufficiently full. All of this has several important implications for the compiler components: First, graph partitioning remains an important preparation for scheduling. Now however, we need to partition the graph into micro-threads (in the sense used in [Nik93]). Further, data dependence analysis must be extended to container objects like arrays and graphs (compare [RD96]), since the attraction memory itself does not enforce any particular ordering on distributed operations. The final implication is that scheduling ceases to be a compiler-time operation. The proper question at compiletime is data dependence. Computation distribution is now done at run-time, in hardware. We are currently fashioning simulator experiments to show that the efficiency obtained by ADARC can also be obtained by SDAARC, while encompassing a much larger and more dynamic range of applications.

7

Conclusion

We have presented two novel computer architectures. The first, ADARC, provides an efficient instruction-level approach to evaluating dataflow graphs. This required compiler technology that partitioned the graphs into basic blocks, and coordinated the operations in the blocks so as to exploit the routing capabilities of ADARC’s associative crossbar network. This architecture has been shown, in both simulation experiments and on a 12-processor hardware prototype, to be highly efficient for regularly-structured scientific applications such as neural networks. The second system, SDAARC, represents our current plans, based upon our experience with SDAARC. It retains the associative crossbar network, but extends it to allow looser coordination between the processors. Surprisingly, the natural extensions to the associative communication network lead us to the attraction memory necessary for a Cache Only Memory Architecture (COMA). The communication network thus fashions a transparent, self-distributing Virtual Shared Memory (VSM) from the local memories of the parallel processors. Further, by partitioning the dataflow graphs into micro-threads instead of basic blocks, we can use the same techniques which distribute data in order to distribute computation (scheduling). In ADARC, the compiler alleviated the architecture, while with SDAARC, the architecture alleviates the compiler. This allows the compiler designer to concentrate on questions like dependency analysis and graph partitioning, leaving data and computation distribution to the hardware. In both systems, the programmer is freed from all of these questions, and can concentrate on producing correct code with maximal implicit parallelism. SDAARC allows automatic and adaptive mapping of sequential programs, including a large class of “dusty decks” (legacy software), onto parallel hardware. The cost of parallel programming is thus reduced to approximately that of sequential programming. We anticipate that this approach will be attractive for the large class of applications which are currently not performed on parallel architectures because of the high cost of parallel programming. We are currently performing simulation experiments to fine-tune the attraction-memory coherence protocols for SDAARC, and to obtain empirical verification for the SDAARC architecture. In closing, both the ADARC and the SDAARC architecture have the following in common: they are guided by a functional, dataflow conception of data and computation. Abstractly, von Neumann memory does not enter into

this picture. In both systems, the source code, regardless of language, is taken as a specification of the transformations performed on values, and the dependencies between these values determine the order of computation. With SDAARC, we take this simple view the next step, and allow the programmer much more freedom to model local state and container objects — areas traditionally modeled by a random access, over-writable memory. We can do this, and preserve the advantages of the dataflow paradigm, by constructing a interconnection network for multiple processors which encapsulates data storage, distribution and transportation.

References [ABU91]

Arvind, L. Bic, and Th. Ungerer. Evolution of dataflow computers. In Advanced Topics in Data-Flow Computing. Prentice Hall, 1991.

[AG94]

George S. Almasi and Allan Gottlieb. Highly Parallel Computing. Bejamin/Cummings Publishing Company, second edition, 1994.

[AN96]

Murali Annavaram and Walid A. Najjar. Comparison of two storage models in data-driven multithreaded architectures. In Eighth IEEE Symposium on Parallel and Distributed Processing (SPDP) [IEE96], pages 122–129.

[Bac82]

John Backus. Function level computing. IEEE Specturm, 19:22–27, August 1982.

[BW88]

Richard Bird and Philip Wadler. Introduction to Functional Programming. Prentice Hall International, 1988.

[CGSvE93] David E. Culler, Seth Copen Goldstein, Klaus Erik Schauser, and Thorsten von Eicken. TAM — A compiler controlled Threaded Abstract Machine. In Journal of Parallel and Distributed Computing, Special Issue on Dataflow, June 1993. [Den74]

Jack B. Dennis. First version of a data flow procedure language. In Lecture Notes in Computer Science, volume 19. Springer Verlag, 1974.

[GAL96]

Jordi Garcia, Eduard Ayguad´e, and Jes´us Labarta. A framework for automatic dynamic data mapping. In Eighth IEEE Symposium on Parallel and Distributed Processing (SPDP) [IEE96], pages 92–99.

[GSC96]

Seth Copen Goldstein, Klaus Erik Schauser, and David Culler. Enabling primitives for compiling parallel languages. In Languages, Compilers and Run-Time Systems for Scalable Systems, pages 153– 168. Kuwer Academic Press, 1996.



[HBM 96] Frank Henritzi, Andreas Bleck, Ronald Moore, Bernd Klauer, and Klaus Waldschmidt. ADARC: A new multi-instruction issue approach. In International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA ’96), 1996. [HLH92]

E. Hagersten, A. Landin, and S. Haridi. DDM — A Cache-Only Memory Architecture. IEEE Computer, 25(9), 1992.

[IEE96]

IEEE. Eighth IEEE Symposium on Parallel and Distributed Processing (SPDP), New Orleans, LA, October 1996. IEEE Computer Society Press.

[JW93]

Simon L. Peyton Jones and Philip Wadler. Imperative functional programming. In 20th ACM Symposium on Principles of Programming Languages, pages 71–84, Charleston, January 1993. ACM.

[Nik93]

Rishiyur S. Nikhil. A multithreaded implementation of Id using P-RISC graphs. In Proceedings of the Sixth Annual Workshop on Languages and Compilers for Parallel Computing, pages 390–405, Portland, Oregon, August 1993. Springer Verlag LNCS 768.

[RD96]

Martin C. Rinard and Pedro Diniz. Commutativity analysis: A new analysis framework for parallelizing compilers. In Proceedings of the SIGPLAN ’96 Conference on Program Lanugage Design and Implementation, pages 54–67, Philadelphia, PA, May 1996. ACM.

[SKZW97] Justin Strohschneider, Bernd Klauer, Stefan Zickenheiner, and K. Waldschmidt. ADARC: An Associative Dataflow Architecture. In Anargyros Krikelis and Charles C. Weems, editors, Associative Processing and Processors. IEEE Press, 1997. [SW94]

Justin Strohschneider and Klaus Waldschmidt. ADARC: A fine grain dataflow architecture with associative communication network. In EUROMICRO ’94, Liverpool, September 1994.

[vECGS92] Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schauser. Active messages: a mechanism for integrated communication and computation. In Proc. of the 19th International Symposium on Computer Architecture, Gold Coast, Australia, May 1992.

Suggest Documents