Document not found! Please try again

Experiments with Data ow on a General-Purpose ... - Semantic Scholar

17 downloads 62372 Views 187KB Size Report
Apr 26, 1994 - developed two experimental data ow programming systems for the ... ing that support many parallel programming models. ...... (Bachelor's.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY

A. I. Memo No. 1272

April 26, 1994

Experiments with Data ow on a General-Purpose Parallel Computer Ellen Spertus and William J. Dally

Abstract: The MIT J-Machine [2], a massively-parallel computer, is an experiment

in providing general-purpose mechanisms for communication, synchronization, and naming that will support a wide variety of parallel models of comptuation. We have developed two experimental data ow programming systems for the J-Machine. For the rst system, we adapted Papadopoulos' explicit token store [12] to implement static and then dynamic data ow. Our second system made use of Iannucci's hybrid execution model [10] to combine several data ow graph nodes into a single sequence, decreasing scheduling overhead. By combining the strengths of the two systems, it is possible to produce a system with competitive performance. We have demonstrated the feasibility of eciently executing data ow programs on a general-purpose parallel computer. Keywords: compilation, parallelization, data ow, hybrid architectures, MIMD. The research described in this paper was supported in part by the Defense Advanced Research Projects Agency under contracts N00014-88K-0738 and N0001487K-0825, by a National Science Foundation Presidential Young Investigator Award, grant MIP-8657531, with matching funds from General Electric Corporation and IBM Corporation, and by a National Science Foundation Graduate Fellowship.

1 Introduction The data ow programming model is attractive because it exposes parallelism in computer programs, a crucial step in compiling for multiprocessing computers. Once a program has been converted into a data ow graph, only necessary constraints among the operations remain, and the program can, ideally, be spread across many processors. In the past, this has been done by special-purpose data ow machines such as DDM1 [5], the Manchester Data ow Machine [7], Monsoon [13], and Sigma-1 [15] that directly execute data ow graphs. In addition to research in pure data ow architectures, there is a growing interest in developing hybrid architectures, such as the EM-4 [14], that take advantage of the parallelism found by data ow methods without sacri cing the straight-line eciency of von Neumann machines [6]. Our approach is to develop a system to execute data ow programs on the J-Machine, a massively-parallel general-purpose computer [2]. The J-Machine was not speci cally designed to support data ow execution but instead to provide universal mechanisms for concurrency, synchronization, and naming that support many parallel programming models. Having universal mechanisms allows the separation of programming model issues from issues of machine organization [4]. Our task was to nd ways to utilize the J-Machine's mechanisms to meet the requirements of data ow programs. Additionally, we sought to nd the best possible representation of data ow programs to match the J-Machine. Speci cally, our rst system involved translating each node of a data ow graph into a sequence of code, where execution proceeded sequentially within each sequence, but the order of the sequences was determined at run-time. The second approach, motivated by a desire to lessen run-time scheduling overhead, made use of Iannucci's work on hybrid architectures [10] and Traub's work on data ow graph sequentialization [21] to produce a single sequence of code for several data ow graph nodes, allowing some scheduling to be done at compile-time. Both systems used data ow graphs produced from the data ow language Id [11]. This paper describes our experience and results with these systems.

1.1 Background 1.1.1 Id

Id is a mostly-functional language originally designed for programming data ow computers [11]. Interesting features include I-structures and mechanisms for loop parallelization. I-structures are data structures that bypass the ineciency of array modi cations in purely-functional languages, where an array must be copied every time it is modi ed. Copying is not necessary for I-structures. Slots are initialized to empty, and read requests of an empty cell are silently deferred until the data is available. Writing a cell more than once denotes a run-time error. While I-structures 1

prevent Id from being purely functional, the language remains deterministic, an arguably more important property. The unbounded latency of an I-structure read is one of the reasons ne-grained scheduling is needed, in order to use the time eciently that would otherwise be wasted waiting for a read request to complete. A major source of parallelism in Id programs comes from loops. The semantics of Id is such that instructions from di erent iterations of a loop can be executed out of order as long as data dependencies are obeyed. More details about loop statements will appear later in the paper.

1.1.2 The J-Machine

The J-Machine [2] is a massively-parallel MIMD computer based on the MessageDriven Processor (MDP) [3], a custom chip. For this research, we used a simulator of a 32-node J-Machine [9].1 Each processor has 260K (4K on chip) of 32-bit words augmented with 4-bit tags. Tag types include booleans, integers, symbols, pointers, and cfutures. A cfuture is used for synchronization to represent an empty location and typically is written into a memory location before the actual data value is ready. If an attempt is made to operate on the cfuture, a fault occurs. The MDPs communicate with each other by sending messages through a lowlatency network. When a message arrives at its destination, it is placed on the message queue, and a new task is created when the message reaches the head of the queue.

1.2 Overview

In the next section, we describe a straightforward method of implementing Id on the J-Machine, based on Papadopoulos' explicit token store (ETS) [12]. In the following section, we describe the system we built to simulate Iannucci's hybrid architecture on the J-Machine, focusing on the run-time data structures used to support the style of synchronization used on his hybrid architecture, and on loop parallelization. In the conclusion, we discuss the strengths and weaknesses of the two systems and describe our plans to combine them into an ecient implementation of Id.

2 ETS on the J-Machine 2.1 The Explicit Token Store

In a data ow graph representation of a program, nodes represent operators, and arcs represent dependencies. Tokens are the mechanism for carrying data values on these arcs. Abstract data ow machines have a waiting-matching unit that matches tokens 1

By the time of the conference, we expect a real J-Machine to be operational.

2

C1: 3

C1: 5

?

!

C1 + D1

C1 + D1

C2 + D1

C2 + D1

3

C1 + D1

3

C1 + D1

C2 + D1

4

C2 + D1

?

!

C2:4

(A)

4

C2: 6

(B)

(C)

(D)

Figure 1: Snapshots for Dynamic Data ow ETS destined for dyadic (two-input) operators with their partners. A token consists of several components: 1. A value to be operated on. 2. A context, indicating what instantiation of the graph it belongs to. (This will be more fully explained in the section on dynamic data ow.) 3. The destination address, corresponding to the node to which it should be delivered. 4. A port number, indicating whether it is the \left", \right", or sole input. Left and right tokens with the same context and destination address must be matched with each other and sent to their destination address together to be executed. It is usually more ecient to explicitly store tokens that arrive before their partners than to implement a waiting-matching unit directly [12, pp. 44{45]. In an ETS strategy, tokens that arrive before their partners are stored in ordinary memory. When a left token, for example, is processed, the appropriate memory location is checked for its partner. The memory address is a function of the destination address and context, both of which the left and right tokens share. If that location is not empty, it must contain the right token, and the operation is performed. If the location is empty, the left token is stored there, to be retrieved when the right token arrives and checks that location.

2.2 Static Implementation

Our rst experiments with data ow on the J-Machine involved static data ow, in which data ow graphs must be acyclic and nonreentrant. This eliminates the need 3

for contexts, because the static discipline ensures that only one instantiation of a data ow graph will be active at a time. For a J-Machine implementation of static data ow, a token on its way to a commutative dyadic node, such as plus, can be represented by two words: 1. A header with the address of the destination instruction. 2. The data value. When a token is sent to a data ow node, this two-word message is sent to the processor on which it should run. The code for a plus node, written symbolically, is: [Initialization code] R0 0 and 11 otherwise. The example is still relevant, because procedures exist for which no such reduction is possible, for example, if the bindings for a and b were changed to a = f x bb and b = g x aa, where f and g were passed in as parameters [19, p. 2]. The simpler program is used purely for ease of exposition. 3

9

def add_em_up f n = { total = 0 in { for count

Suggest Documents