Tailoring a Self-Distributing Architecture to a Cluster Computer ...

3 downloads 2109 Views 85KB Size Report
P.O. Box 11 19 32, Frankfurt, Germany. { moore ... SDAARC proposal on a small cluster of Linux PCs. .... cessor has dedicated links to and from a shared hub or ...
Tailoring a Self-Distributing Architecture to a Cluster Computer Environment Ronald Moore

Bernd Klauer

Klaus Waldschmidt

Technische Informatics, J. W. Goethe-University P.O. Box 11 19 32, Frankfurt, Germany 

moore, klauer, waldsch @ti.informatik.uni-frankfurt.de 

Published in: 8th Euromicro Workshop on Parallel and Distributed Processing (EURO-PDP 2000), Rhodes, Greece, IEEE Computer Society Press, 2000

Abstract This paper analyzes the consequences of existing network structure for the design of a protocol for a radical COMA (Cache Only Memory Architecture). Parallel Computing today faces two significant challenges: the difficulty of programming and the need to leverage existing “off-the-shelf ” hardware. The difficulty of programming parallel computers can be split into two problems: distributing the data, and distributing the computation. Parallelizing compilers address both problems, but have limited application outside the domain of loopintensive “scientific” code. Conventional COMAs provide an adaptive, self-distributing solution to data distribution, but do not address computation distribution. Our proposal leverages parallelizing compilers, and then extends COMA to provide adaptive self-distribution of both data and computation. The radical COMA protocols can be implemented in hardware, software, or a combination of both. When, however, the implementation is constrained to operate in a cluster computing environment (that is, to use only existing, already installed hardware), the protocols have to be reengineered to accommodate the deficiencies of the hardware. This paper identifies the critical qualities of various existing network structures, and discusses their repercussions for protocol design. A new protocol is presented in detail.

Keywords: Runtime Support Systems, Scheduling, Load balancing, Task and object migration, Cache Only Memory Architectures (COMAs), Cluster Computing. 

This work was supported in part with funds of the Deutsche Forschungsgemeinschaft under reference number WA 357/15-1.

1. Introduction Parallel and Distributed Computing has always faced two significant problems: the difficulty of programming parallel computers and the cost of high performance hardware. For the classical “scientific computation” field, these problems are relatively small compared to the need for high performance. Further, the computation to be performed is typically regularly structured, and mapping data and computation onto the available hardware is relatively easy. However, if Parallel and Distributed Computing is to make significant inroads outside its traditional “scientific” application area, the hardware and software costs will have to be reduced: On the one hand, cluster computing (e.g. [12]) leverages the huge installed base of network hardware to reduce hardware costs. On the other hand, various approaches are making progress at decreasing the cost of producing parallel software. The problem here can, at a very high level of abstraction, be expressed quite simply: Programming parallel computers is hard, because data and computation have to be distributed amongst the available hardware. The distribution problem, in turn, consists of first deciding on the granularity of the entities to distribute, and second on positioning those entities in time and space. Automatic parallelizing compilers can identify and distribute significant amounts of implicit parallelism in conventional “sequential” code (e.g. [16]). Nonetheless, these compilers are still largely restricted to regular control structures, typically loops, and simple data structures, typically arrays. Help for programs with irregular structures is not entirely missing. The existence of a common address space, either in form of a SMP (Symmetric Multiprocessor) archi-

tecture or some form of a Distributed Shared Memory (DSM), reduces the complexity of data distribution significantly. COMAs (Cache Only Memory Architectures, see [3]) go even further by supplying an adaptive, selfdistributing common address space. COMAs do not however address the problem of computation distribution. Further, some computation distributions can be very detrimental to overall COMA performance [11]. To address these problems, we have proposed a radical COMA architecture, which we call SDAARC, for Self Distributing Associative ARChitecture [5] [6] [7]. This architecture leverages automatic parallelizing compiler techniques to identify parallelism in conventional code. Relatively sequential code segments are converted into threads, in the sense used in multithreaded architectures such as TAM [2] or P-RISC [9]. These threads are then mapped onto the available hardware using extended COMA techniques (see section 2, below). Thus, the entire distribution problem is addressed: both data and computation are distributed automatically, transparently and adaptively. We are now in the process of implementing the SDAARC proposal on a small cluster of Linux PCs. While doing so, we have reexamined the coherency protocol published in [6] and discovered serious problems. These problems are discussed in a general fashion in section 3, and their implications for the protocol design are discussed in section 4. A new SDAARC protocol is presented in section 5. Conclusions are presented in section 6.

2. The SDAARC Proposal In this section, we review the SDAARC Proposal as published in [7] [6] [5]. We first review conventional Cache Only Memory Architectures in section 2.1 to establish the necessary terminology, and then distinguish what is new in SDAARC in section 2.2

2.1. Conventional COMA The concept behind COMAs builds on and extends the following simple realization: In every multiprocessor architecture where each processor has at least one private cache, the caches form a self-organizing, adaptive system, where data is transparently and dynamically mapped onto the processors. COMAs extend this idea by augmenting each local memory with a directory and a cache coherency protocol controller. Thus augmented, each local memory is functionally equivalent to a cache, albeit a very large and rather slow cache. Compare figure 1. Since it leads to confusion to call these augmented memories “caches”, they are instead referred to as Attraction Memories (AMs) in the literature [3]. A minimal COMA thus has a structure such as that

SITE a

SITE b

SITE b

SITE a

CPU

CPU Cache(s) Attraction Memory

CACHE Cache Directory

Local Memory

Cache Controller

Controller Cache Data Memory

RAM

Directory

Network Interface

(a)

Network Interface

(b)

Figure 1. Two Parallel Memory Architectures: (a) A Cache Coherent Non Uniform Memory Architecture (CC-NUM), with exploded cache (cf. [4], fig. 1.6); (b) A minimal Cache Only Memory Architecture (COMA)

illustrated in figure 1b. The augmentation can be done in hardware (e.g.[3]), in software, or in a mixture of software and hardware (e.g. [10]). COMA relies on the programmer and/or the compiler to distribute computation amongst the processors. Whether a COMA will be able to distribute the data efficiently depends to a large degree on just how the programmer and/or the compiler distributed the computation. See [11] for a detailed discussion of COMA performance with various kinds of applications.

2.2. SDAARC: Radical COMA The SDAARC Proposal [5] [6] [7] consists of two components: a low-latency network for parallel systems with migratory objects, and a run-time system for implementing a COMA which automatically and transparently distributes both data and computation at run-time. We consider the network proposal below (section 3). In this section we consider the run-time system. The basic idea behind the SDAARC run-time system is fairly simple. If we can partition a program into a set of grains, and if each grain is associated with a data object, then we can use a COMA to distribute these grains. If we then execute each grain at the site where it is resident, we have effectively used COMA to distribute computation by means of distributing data. This approach is thus a hybrid between static and dynamic scheduling. Static scheduling is used to extract sequential grains, and dynamic scheduling is used to distribute these grains in space and time. We utilize compiler techniques from multiprocessor architectures [2]

[9] [15] and thus refer to the grains as microthreads (a term established in [9]). The set of all microthreads represents a coarse-grain dataflow graph. See [5] for more details on how we extract threads. Each microthread has a frame (actually a framelet, see [1]): a data structure which stores all the information necessary to execute the microthread. This includes all of the the microthread’s arguments, and the addresses to which microthread’s results should be sent. Executing one microthread creates results which can be sent to other microthreads. If the recipients are remote, two courses of action are available: either we can simply send the data to the recipient framelet, leaving the recipient framelet where it is, or we can bring the recipient framelet to the argument. The first alternative (argument travels to framelet) is functionally equivalent to active messages [15]. The second alternative (framelet travels to argument) is more in the spirit of conventional COMAs. SDAARC employs both, leaving the decision up to the protocol controller. The controller thus shops for framelets for its site to store and (potentially) execute. For this reason, we call the controllers brokers, to distinguish them from the simpler controllers in conventional COMAs. The distribution is thus driven by two sets of forces: sharing arguments leads to attractive forces, which bring related framelets together; congestion leads to dissipative forces, which evict framelets to balance the load. The entire system is thus constantly in flux, attempting to balance the contradictory goals of minimizing communication while maximizing parallelism. So far, we have only considered framelets, and the only communication has been to exchange scalar arguments. A system with only framelets could be built, (see [8] [13] for one such system), but it would be extremely difficult to implement important classes of data structures. To allow arrays and other random-access container structures, we also need a data partition, which allows conventional COMA reads and writes to arbitrary addresses. Interestingly however, reads are now inherently non-blocking (“split-phase”), and the results of a read can be sent directly to a recipient framelet, and not (necessarily) to the processor which initiated the read. SDAARC can, like other COMAs, be realized with or without special purpose hardware. The implications of various degrees of hardware support in the subject of the remainder of this paper.

3. Networks for COMAs In order to better understand the constraints placed on protocol design by existing installed networks, we first review some of the networks suggested in the literature for COMAs with those widely available (section 3.1). With

this background, we discuss the problems which various networks can cause for the correct operation of a COMA (section 3.2).

3.1. Example Networks A set of network topologies which come into consideration for COMAs is presented in table 1. Bus-based Networks: A simple common bus, such as those used in SMPs, consists of one physical link shared by all of the processors and one bus arbiter. Hierarchical bus systems are possible. No special routing is available for COMAs. Collisions are possible at each stage of the network. All messages are broadcast on the non-hierarchical system. Hierarchical buses offer limited support for multicasting. The DDM Bus: One of the first COMAs in the literature, the Data Diffusion Machine [3], used common buses, where each bus was augmented with a directory. Hierarchical systems can be built by adding buses and directories. All messages were sent first to the directory node and then from the directory to the recipients(s). The directories perform dynamic routing based on the current data distribution. Broadcasting is supported, but must also go over the directory. Hierarchical buses offer limited support for multicasting. The SDAARC Network, as proposed in [7], used a two dimensional grid of buses. There are twice as many buses as processors. Each horizontal bus (compare illustration in table 1) carries output from exactly one processor, and each vertical bus carries input to exactly one processor. Every message must transverse first a horizontal and then a vertical bus. Collisions happen only on the vertical lines, and only when more than one processor sends to one recipient. The switches between the horizontal and vertical buses have small associative memories similar to the directories in the DDM. Hierarchical systems can be built by building grids of grids. Both broadcasting and multicasting are supported. Fast Ethernet: (IEEE Standard 802.3, see e.g. [14]) is perhaps the most commonly used network technology for Cluster Computing today [12]. It presents an attractive tradeoff between cost and performance. Better still, it is already installed, or soon to be installed, in many locations, and is thus effectively “free”. In a flat (non-hierarchical) Ethernet system, each processor has dedicated links to and from a shared hub or

Network Type

Flat Topology

Hierarchical Topology

Bridge

Common Bus

Site

Site

Site

Bridge

Site Site

Site

Site

Site

Directory

DDM Network

Directory

Site

Site

Site

SDAARC Network

Switch

Switch

Site

Site

Switch

Site

Site

Site

Switch

Switch

Switch

Site

Switch

Switch

Switch

Site

Switch

Site

Directory

Site

Switch

Switch

Switch

Site

Site

Switch

Site

Site

Directory

Site

Site

Switch

Switch

Switch

Switch

Site

Switch

Switch

Site

Switch

Site

Site

Site

Switch

Switch

Switch

Switch

Switch

Switch

Site

Switch

Switch

Site

Switch

Site

Switch Site

Switch

Ethernet

Site

Site

Site

Site

Site

Switch

Switch

Site

Site

Site

Site

Site

Site

Table 1. Major Network types

switch (a.k.a. a “switched hub”). Collisions can occur at the hub. Hierarchies are possible, and are usually built through processors with two network connections (called gateways). No special hardware is available for accelerating COMAs. Support for broadcasting or multicasting is limited or nonexistent. Other networks are of course possible and available. We focus on these networks since they represent two basic technologies found in non-COMAs (common buses and Ethernet), and two networks (DDM and SDAARC) explicitly tailored for COMAs. The important qualities in the two COMA-specific networks are: 1. Associative switching; 2. Support for broadcasting;

3. Support for multicasting; We will examine why these features are important, and what can be done if they are missing, in the next section.

3.2. Problems Networks Make For COMAs COMAs in general (and SDAARC in particular) present two novel problems for networks. Both problems arise since messages are (usually) addressed to migratory objects and not to physical sites: 1. Since an object can be anywhere, a message addressed to this object either needs to be sent everywhere where the object could be (each site then checks to see if the object is currently resident locally), or the message

Site a in j

ack now led ge ap pl y

midst of migration. Depending on the characteristics of the network, several alternatives exist:

ect

Special COMA problems do not arise if the network resembles a common bus. If each message is broadcast, and no two messages can be in the network at once, then it is not possible for a message from a third party to send a message which arrives at one site before the migration, and at another site after the migration. In this case, we can use fairly simple COMA protocols. This approach breaks down, however, for any more complicated network, including a hierarchical bus system. 

apply

Site b

Site c time

Figure 2. A message sent from Site misses its recipient at both Site and Site . 





For a hierarchical system with associative switches, it is sufficient if the associative switches note that an object is leaving during migration. There are then two possibilities: messages sent to an object in the leaving state can be returned to the sender, who then repeats the request (this is done in the DDM Protocol), or the message can be forwarded to the object’s new home (this was done in the SDAARC Protocol in [6]). This requires, however, that a well-defined time exists when the leaving copy can be flushed out of the AM. Such a time can be derived for the bus-based networks, but does not exist for packet-switched networks such as Fast Ethernet (see section 3.1 above). 

needs to be sent to some form of associative switch (such as those in the SDAARC or DDM networks) which knows where to forward the message. All of these approaches use some form of broadcasting or multicasting. If neither broadcasting nor multicasting is particularly efficient, then care must be taken to reduce the network traffic as much as possible. 2. Even if we assume that a message can be successfully routed to all the sites where a recipient object is resident, we still need to take special care during migration. In its simplest form, the problem is that for some amount of time, the recipient object is nowhere. 

In the general case, where we can assume next to nothing about the timing of messages sent across the network (we do assume at least that two messages traveling along the same path arrive in the same order they were transmitted) no matter how sophisticated the protocol is, a message can still miss its recipient. This problem is illustrated with a message sequence diagram in figure 2. The scenario is as follows: at some time, an object migrates from Site to Site . The two sites ( and ) can exchange any number of acknowledgments — we assume only that there is some amount of time before the total transaction reaches Site , during which refuses messages sent to the object; and some amount of time after the transaction, during which believes that the object is safely moved away, and also refuses messages sent to the object. As such, it is possible for a third Site to send a message to the object, such that the message arrives before the transaction at Site and after the transaction at Site . 



















4. Protocol Alternatives We saw above (section 3.2) that the main challenge for a COMA protocol and network is to deliver every message to every copy of an object — even when that object is in the

Finally, given a network where we can assume virtually nothing about the timing of messages, the protocol must go to the greatest lengths to ensure that no object disappears from the system, even intermittently during migration. Such a protocol for conventional COMAs was introduced in [11] and is called COMA-F. COMAF changes the concept of a directory in the following manner. Each site now has an extra data structure, called a directory, which stores information about a subset of the global address space. Given an address, any site can find the directory for that address, and the directory does not migrate. The directory stores the number and locations of all copies of each object in its subset of the address space. All messages are sent first to the directory, and then the site with the directory forwards the messages to the copies. If more than one copy of an object exist, one of those copies is marked (is in a state called Master-Shared in COMA-F). Unmarked copies can be dropped from the system at any time, but marked copies are returned to the site with the directory.

COMA-F is the only protocol known to the authors which can cope with a totally general network, and serves as the basis for our new protocol in the next section.

5. A SDAARC Protocol for an Existing Network In the last section, we examined how various networks make various extensions to a simple COMA protocol necessary. At the end, we arrived at the COMA-F protocol [11], which makes the smallest set of assumptions about the network. Like other conventional COMA protocols, COMA-F only provides a DSM. That is, it only provides support for read and write operations (and for operations occurring as a result of the protocol itself, such as object migration). We present a protocol in this section based on the more radical COMA envisioned in the SDAARC project and outlined above in section 2.2. We present definitions in section 5.1, and then present the major transactions in section 5.2.

5.1. Definitions Using the terminology from section 2.1 above, We assume a system of Sites, where each site consists of a processor and an Attraction Memory (AM). For this protocol, we stipulate that an AM consists of the following data structures: Local Tables: Two set-associative tables that associate global addresses onto a pair consisting of state and a local address. One table is used for framelets, and one for container objects. Routing Table: (the “directory” in COMA-F) maps a set of global addresses onto lists of site numbers (or equivalently onto a site vector). The Routing Table must be full associative, since there is no way to evict an item from the Routing Table. Every object is in the Routing Table at only one site. The mapping of addresses to sites is static: any processor can at any time compute the site where a given object is in the Routing Table. This is called the home site for this object. Am object can be in one of 8 states: Invalid, Routed, Exclusive-Owned, Exclusive-Borrowed, Original-Owned, Original-Borrowed, Clone-Owned and Clone-Borrowed. An object is Invalid at a given site if it is neither in the Local Table or the Routing Table. An object is Routed if it is in the Routing Table but not in the Local Table. An object is Owned if it is in both the Routing and the Local Table, and Borrowed if it is in the Local Table but not the Routing Table. Further, the object is exclusive if it is only in the Local Table of this site, and otherwise either the Original or a Clone. Only one copy of an object can be in the Original state at any time, and at all times there must be at least one copy in either the Original or the Exclusive state. (the

Original state is thus equivalent to the Master-Shared state in COMA-F, see above). The broker receives messages either from the local processor or from the network. Messages can be of the following types: Inject The AM is directed to store an object. Inject messages contain the state which the new object should have (thus there are inject exclusive messages and inject original, and so on). Application Data is being sent to a migratory object. Applications come in three basic flavors: Apply-Arguments : One or more arguments are sent to a framelet; Apply-Read : A container object is directed to send data to one framelet, and a synchronization to another framelet (both of which are done with Apply-Arguments messages); Apply-Write A container object receives data, and is directed to send a synchronization to another framelet; Delete The AM is directed to delete an object; Evict: Used to notify the home site of object migration. All Messages include information identifying the site which initiated the message, and also identifying the state in which the object is expected to be found.

5.2. Transactions The rest of the protocol consists of a table crossreferencing states and messages. The transactions for object creation (Inject) and deletion (Delete) are relatively straight-forward and are not considered here in detail. We focus instead on application and migration. Examples of these transactions are illustrated in Table 2. Table 2 illustrates four possible transactions. In case (a), a CPU sends an apply message, and finds the recipient object in the Local Table and the Routing Table. The application can be performed without notifying any other site. Cases (b) and (c) are more general. Here, the CPU’s message must be sent to the home site (if not already there), and then forwarded to the object’s current site(s). Sites with clones either update their copies (if a write-update protocol is desired) or invalidate them (if a write-invalidate protocol is desired). The site with the Original or the Exclusive copy decides, based on the current load distribution, whether to send a copy of the object to the site which initiated the apply (compare section 2.2 above), and if so, whether to keep a

Borrowed Objects

Key:

Tables

Owned Objects

Excl

Routing

Routing

Excl

Orig

Leaving

Routing

Orig

Routing

Clone

RequiredMessage

Optional Message

Clone

(a) The Simplest Transaction: All data is local. No Network traffic is required. (b) Object is in Exclusive-Borrowed state. CPU sends message first to home site (1). Home site forwards message to object (2). Object’s site either simply accepts the message and performs the apply, or optionally sends the object to the initiating site (3a) with an inject message and updates the Routing Table (3b) with an evict message. The copy sent to the CPU can be in either Exclusive or Original (but not in Clone) state, and the existing copy goes either into Invalid or Clone (but cannot stay in original) state, respectively.

CPU

Routing

3a

Excl

3b

CPU

Routing

1

3a

(c) Object is in Original-Borrowed state. Exactly as above (b), except that the home site must also send updates to the copies in clone state (2b and 2c). These copies are either updated (write-update) or invalidated CPU (write-invalidate). Updating never causes additional traffic from a clone however.

2

Orig

3b

Routing

1

Excl

2a

Clone 2c

2b

Clone 1a

(d) Two forms of Migration: An object in Exclusive or Original state is sent to another state, as in (b) or (c), above (1a and 1b). At some later time, the new copy is returned to the home site (2) with an evict message. The home site then elects a site and sends an inject message (3).

1b

Clone

2

Clone

Routing 3

Orig

Table 2. Examples of Atomic Transactions

copy for itself. Importantly, the site initiating the migration is allowed to send out an exclusive copy (and degrade its own copy to invalid), or to send out an original copy (and degrade its own copy to clone), but may not send out a clone and keep its own copy as original. This restriction is due to race conditions discussed below. At any point, a site can decide to evict an object. This is illustrated in Table 2 (d). The home site must be notified, even if the object was in the clone state. When an original or exclusive copy is spontaneously evicted, it is sent back to the home site, which then either keeps it locally (in exclusive-owned or original-owned state), or sends it out again. We need one final rule to cope with certain race conditions. See figure 3. We specify that any message sent from the home site is returned to sender if the message expected

to find an object in either the exclusive or the original state, and does not do so (i.e., the object is in the clone or invalid state instead). These returned messages, upon arrival at the home site, are then redirected to the copy in the exclusive or original state. Interestingly, this is sufficient to handle all transactions. More complicated exchanges are implicit in this specification. Thus, for example, a read to container structure is a chain-reaction of apply transactions. Migration can also cause chain reactions.

6. Conclusions and Future Work In this paper, we have reviewed a proposal for a radical COMA which leverages parallelizing compiler tech-

Site a

u ret pp

i ct

a rn(

ev

ly )

Home Site

ap p ly

apply

ect

se (re

in j

nd

Site b

)

Site c time

Figure 3. Race Conditions in the new SDAARC Protocol: An Apply is sent from site to the home site and then on to site . At about the same time, site evicts the object, and the home site relocates the object at site . The apply arrives too late at site , is returned to the home site, and is then forwarded to site . Compare figure 2 above. 











niques and then extends COMA techniques to tackle the entire problem of distributing both data and also computation across a parallel and distributed computing system (whereas conventional COMA addresses only data distribution). Since first publishing the SDAARC proposal in [7], we have made the following progress toward implementing it: A first prototype for the microthread extraction and code generation has been completed (preliminary thread extraction results have been published in [5]). 



Further, an distributed emulator has been written in Java and is being tested at the moment on a Beowulf cluster consisting of four Linux PCs and a dedicated Fast Ethernet network.

Initial experience with the emulator identified many of the problems discussed in this paper, and lead to the development of the protocol presented above.

References [1] M. Annavaram and W. A. Najjar. Comparison of two storage models in data-driven multithreaded architectures. In Eighth IEEE Symposium on Parallel and Distributed Processing (SPDP), pages 122–129, New Orleans, LA, Oct. 1996. IEEE, IEEE Computer Society Press. [2] D. E. Culler, S. C. Goldstein, K. E. Schauser, and T. von Eicken. TAM — A compiler controlled Threaded Abstract Machine. In Journal of Parallel and Distributed Computing, Special Issue on Dataflow, June 1993.

[3] E. Hagersten, A. Landin, and S. Haridi. DDM — A CacheOnly Memory Architecture. IEEE Computer, 25(9):44–54, 1992. [4] J. Handy. The Cache Memory Book. Academic Press, Inc., San Diego, CA, 1993. [5] R. Moore, M. Klang, B. Klauer, and K. Waldschmidt. Combining static partitioning with dynamic distribution of threads. In F. J. Rammig, editor, Distributed and Parallel Embedded Systems, IFIP WG10.3/WG10.5 International Workshop on Distributed and Parallel Embedded Systems (DIPES ’98), pages 85– 96. Kluwer Academic Publishers, 1999. . [6] R. Moore, B. Klauer, and K. Waldschmidt. Automatic scheduling for Cache Only Memory Architectures. In Third International Conference on Massively Parallel Computing Systems (MPCS ’98), Colorado Springs, Colorado, Apr. 1998. . [7] R. Moore, B. Klauer, and K. Waldschmidt. A combined virtual shared memory and network which schedules. International Journal of Parallel and Distributed Systems and Networks, 1(2):51–56, 1998. . [8] R. Moore, S. Zickenheiner, B. Klauer, F. Henritzi, A. Bleck, and K. Waldschmidt. Neural compiler technology for a parallel architecture. In International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA ’96), Sunnyvale, CA, Aug. 1996. . [9] R. S. Nikhil. A multithreaded implementation of Id using P-RISC graphs. In Proceedings of the Sixth Annual Workshop on Languages and Compilers for Parallel Computing, pages 390–405, Portland, Oregon, Aug. 1993. Springer Verlag LNCS 768. [10] A. Saulsbury, T. Wilkinson, J. Carter, and A. Landin. An argument for simple COMA. In First IEEE Symposium on High Performance Computer Architecture, pages 276–285, Rayleigh, North Carolina, Jan. 1995. [11] P. Stenstr¨om, T. Joe, and A. Gupta. Comparative performance evaluation of cache-coherent NUMA and COMA architectures. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 80–91, Gold Coast, Australia, May 1992. [12] T. Sterling, J. Salmon, D. Becker, and D. F. Savarese. How to Build a Beowulf. MIT Press, May 1999. [13] J. Strohschneider, B. Klauer, S. Zickenheiner, F. Henritzi, and K. Waldschmidt. ADARC – Associative processors and processing. In A. Krikelis and C. C. Weems, editors, Associative Processing and Processors, pages 82–96. IEEE Computer Society Press, 1997. [14] A. Tannenbaum. Computer Networks. Prentice Hall, 3rd edition, 1996. [15] T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active messages: a mechanism for integrated

communication and computation. In Proc. of the 19th International Symposium on Computer Architecture, Gold Coast, Australia, May 1992. (Also available as Technical Report UCB/CSD 92/675, CS Div., University of California at Berkeley). [16] M. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley Publishing Company, 1996.