Distributed Implementation of Programmed Graph ... - Semantic Scholar

12 downloads 0 Views 268KB Size Report
one) and the other examples use data structures (quicksort, towers of Hanoi, the queens problem, matrix multiplication, fold and map) (see appendix).
Proc. PARLE 1989, LNCS 365, Springer Verlag 1989, pp. 136{157

Distributed Implementation of Programmed Graph Reduction

Rita Loogen, Herbert Kuchen and Klaus Indermark RWTH Aachen, Lehrstuhl fur Informatik II Ahornstrae 55, 5100 Aachen, West Germany Werner Damm FB Informatik, Universitat Oldenburg Postfach 2503, 2900 Oldenburg, West Germany Abstract

Programmed graph reduction has been shown to be an ecient implementation technique for lazy functional languages on sequential machines. Considering programmed graph reduction as a generalization of conventional environment-based implementations where the activation records are allocated in a graph instead of on a stack it becomes very easy to use this technique for the execution of functional programs in a parallel machine with distributed memory. We describe in this paper the realization of programmed graph reduction in PAM | a parallel abstract machine with distributed memory. Results of our implementation of PAM on an Occam-Transputersystem are given.

1 Introduction The most important properties of lazy functional programs are their expressiveness which is due to the possibility to work with higher order functions and in nite data structures, their semantic simplicity which is due to their mathematical foundation and nally their implicit parallelism which is due to the Church-Rosser-property of the reduction semantics. There is no a priori need to extend a functional language by special syntactic constructs to indicate parallelism that can be exploited in an implementation on a parallel machine. A parallelizing compiler may detect the implicit parallelism and decompose the functional program for parallel execution. Thus, the programmer need not think about the organisation of parallelism at all. An approach to the automatic parallelization of functional programs has been described in [Hudak/Goldberg 85]: Each functional program can be translated into a system of serial combinators, i.e. a set of fully lazy global function de nitions in which the places where parallel evaluation of subexpressions should take place are indicated by a special syntactic construct. Some form of strictness analysis is used to detect the maximal parallelism within a program. We will assume here that the technique of evaluation transformers, which has been introduced in [Burn 87a] and which handles structured data types in an appropriate way, is used for strictness analysis. For each prede ned function and for each combinator an evaluation transformer tells us which amount of evaluation can be done on the arguments of the function or combinator if we know how much evaluation has to be done on the application. This information will be used in the parallel machine to control the evaluation. In section 3 we will give a short introduction to evaluation transformers as we will use them.

types. The combinators will be given in a at syntactic form, which provides a clean distinction between rst-order and higher-order expressions. As we will point out later the special syntactic form enables us to treat rst-order expressions in a more ecient way during the reduction process. The parallel implementation of the combinator systems is based on a parallel abstract machine (PAM), that consists of a nite number of identical processor elements with local storage. The processor elements may communicate by exchanging messages via an interconnection network. PAM has a very modular structure. Each processor element consists of two independent processing units:  a communication unit that is responsible for the organizational aspects of the parallelization of the reduction process, and  a reduction unit that executes the parallel processes that have been derived from the original functional program, by programmed graph reduction. The modularization of the processor elements simpli es the formal speci cation of the parallel machine as it is e.g. given in [Loogen 88] by the use of state transition systems and in [Loogen 87] by the use of an axiomatic architecture description language. On the other hand the modularization causes a decentralization of the parallel program execution by separating the overhead of parallelism | message handling, work distribution, workload balancing | from the reduction process. This promises a better exploitation of parallelism. Due to space limitations we omit a complete formal speci cation of the abstract machine in this paper and describe the state transitions of the machine in sections 4 and 5 only in an informal way. A detailed description can be found in [Loogen, Kuchen et al. 89]. The abstract machine has been implemented on an Occam/Transputersystem. In the nal sections we describe this implementation and present rst simulation results. Furthermore we give an overview of and comparison with some related projects.

2 Serial Combinator Systems

A program P in the parallel intermediate language consists of a serial combinator system R and an applicative expression that de nes the value of the program: 8 > < F1 (x1 ; : : : ; xn1 ) P = hmain = e; Ri with R = > : Fr (x1 ; : : : ; xn ) r

9

= e1 ; > = ... : > ; = er ;

The serial combinator system R contains a nite number of recursive combinator de nitions whose bodies are parallelized applicative expressions. Let denote the set of basic functions and ? the set of data constructors. Let V be a set of typed argument and local variables and C be a set of combinator variables. The set PExp of parallelized applicative expressions is the smallest set with 1. variables and constants: V [ C [ [ ?  PExp , 2. simple conditional expressions: if e then e1 else e2 2 PExp where e 2 PExp with type bool and e1; e2 2 PExp , 3. complete case analysis on the possible decompositions of a data structure

case e of c1(y11; : : : ; y1n ) : e1;    ; ck (yk1; : : : ; ykn ) : ek esac 2 PExp 1

k

of type d, 4. rst order applications of constants and combinators where we admit partial applications of basic functions and combinators up to the arity of the function symbol: (e1; : : : ; em) 2 PExp ; where  2 [ ? [ Com and arity () = m if  2 ? and arity ()  m otherwise, 5. higher order applications: ap(e; e1 ; : : : ; em) 2 PExp where e 2 PExp is an arbritrary functional expression. 6. explicit parallelism: letpar y1 = F1(e11; : : : ; e1n1 )

and  and yp = Fp(ep1; : : : ; epn ) in e 2 PExp : where p  1; yi 2 V (1  i  p) and Fi(ei1; : : : ; ein ) 2 PExp are (complete) combinator applications with Fi 2 C (1  i  p). p

i

The letpar-construct is used to make parallelism in the intermediate program explicit. Subexpressions that should be evaluated in parallel are abstracted out from combinator bodies and represented by applications of `serial' combinators. During the parallelization process new combinators may be de ned to represent such parallel subexpressions. The remaining body e which contains | via the local variables yi | references to the parallel subtasks is called the sequential thread of the computation. Parallel subtasks always correspond to combinator applications. There are two sources of parallelism within serial combinator systems. On the one hand parallel processes will result from the execution of letpar-constructs. On the other hand a parallel process will also be generated for the delayed execution of a combinator application in a non-strict argument position. Expressions and combinator de nitions are monomorphically typed because we use a strictness analysis that is only adaptable for monomorphically typed expressions. We omit a formal description of the reduction semantics as this is straightforward assuming a strict interpretation of the base functions. The data constructors are handled as non-strict non-interpreted functions. The combinator de nitions of the serial combinator system are used as rewrite rules. Additionally there are special rules for the treatment of conditional expressions and higher-order applications. The rules for higher-order applications simply describe the collection of arguments for the completion of partial applications. An example program will be given in the next section.

3 Evaluation Transformers In the evaluation of functional programs with structured data types several levels of evaluation may be distinguished. In [Burn 87a] a selection of evaluation levels is considered that seems to be natural for the treatment of structured data types. These levels are described by the following evaluators, which are used to evaluate expressions: 0 | does no evaluation, 1 | evaluates expressions to weak head normal form, 2 | evaluates the structure of data types,

normal form. The rst two evaluators are common for the lazy evaluation of functional programs. The new evaluators will be used to organize the evaluation of data structures in a more appropriate way. Evaluating the structure of a data type simply means evaluating the sequence of constructor nodes of that data type. In the following, let the set of evaluators applicable to an expression e of type t be given by ( t is a structured type : t Evset := ff0;; :: g: ; 3g ifotherwise 0 1 The evaluators 0; : : : ; 3 handle only top-level data structures. In the case of nested data structures, e.g. lists of integerlists, it is not possible to force the evaluation of each element of a structure with evaluator 2 or 3. To handle nested structures in a more sophisticated way one would have to extend the set of evaluators. As this set has to be nite it is only possible to specify the degree of evaluation of nested data structures up to a nite depth. Using the technique of abstract interpretation one can derive evaluation transformers for basic functions, constructors and combinators of a serial combinator system [Burn 87b, Loogen 88]. An evaluation transformer tells us for a function which evaluator may be used for the evaluation of the arguments in an application when the evaluator for the application is given. Thus, if we have a function  of type t1    tk ! t0, the corresponding evaluation transformer ET() is a mapping from the set of evaluators Evset t0 into the set of k-tuples of evaluators from Evset t1      Evset t . We denote the component functions of ET() by ETi() for 1  i  k. As we assumed a strict interpretation we get the following evaluation transformers for n-ary base functions f 2

8i 2 f1; : : : ; ng 8 2 f0; 1g ETi(f )() = : For data constructors c 2 ? of type d with arity k  0 and argument types t1; : : : tk the evaluation transformer can be de ned by: 8i 2 f1; : : : ; kg with ti 6= d 8 2 Evset d ETi(c)() := if  = 3 then 1 else 0 and 8i 2 f1; : : : ; kg with ti = d 8 2 Evset d ETi(c)() := if  = 1 then 0 else  : To derive the evaluation transformers of combinators one has to use the techniques described in [Burn 87b, Loogen 88]. The activation of parallel subtasks will naturally depend on the given evaluator. For this reason the syntax of the letpar-construct will be slightly changed to become letpar y1 = F1(e11; : : : ; e1n1 ) if ev1 k

and  and yp = Fp(ep1; : : : ; epn ) if evp in e 2 PExp : p

under the assumptions given in section 2 and ev1; : : : ; evk 2 Evset. The parallel activation of the combinator applications depends on the evaluator of the letparconstruct. The combinator application Fi(ei1 ; : : : ; ein ) will be evaluated in parallel only if the evaluator of the letpar-expression is stronger than or equal to evi. The evaluators for the combinator applications will be determined by the evaluation transformer of the sequential thread of the letpar-expression. For this reason we will additionally assume that each letpar-expression is annotated by the evaluation transformer of its sequential thread. One could also note the evaluation transformers of the di erent alternatives of case-expressions and use this information to speed up the evaluation of the components of data structures. We omit this optimization here for simplicity. Example: The following rst order example program computes all solutions of the well-known queens problem. It has been adapted from [Hudak/Goldberg 85]. The types of parameters and combinator applications are noted in italic font style. i

intlist with constructors nil 2 ? and cons 2 ? , and listofintlist with constructors lnil 2 ?listo ntlist and lcons 2 ?intlistlistof intlist!listof intlist :

main: listofintlist

= print( allsolutions( nil, 0)) allsolutions (psol: intlist, col: int): listofintlist = if col=8 then lcons( psol, lnil) else ndsolutions( psol, col, 0) ndsolutions (psol: intlist, col: int, row: int): listofintlist = if row=8 then lnil else letpar y1 = ndsolutions( psol, col, row+1) if 2 and y2 = complete( psol, col, row) if 2 in append(y1; y2 ) complete (psol: intlist, col: int, row: int): listofintlist = if safe( col, row, psol, col?1) then allsolutions ( cons( row, psol), col +1) else lnil safe (col: int, row: int, ps: intlist, col1: int): bool = case ps of nil: true; cons(h,t): (h6=row and col6=col1) and (j h?row j6=jcol?col1j and safe( col, row, t, col1 ?1))

esac

The evaluation transformers for the combinators of this example are given by the following table where  6= 0: combinators ET1(:)() ET2(:)() ET3(:)() ET4(:)() allsolutions 0 1 ndsolutions 0 0 1 complete 3 0 0 safe 0 0 3 0 These evaluation transformers are very simple as they do not depend on the given evaluator . The letpar-expression in the body of the combinator ndsolutions indicates the parallel evaluation of the arguments of the library function append only if the actual evaluator is stronger than 2. This becomes clear if one considers the evaluation transformer of the function append de ned by: append (l1 : listofintlist, l2 : listofintlist) : listofintlist := case l1 of lnil : l2; lcons(h; t): lcons(h,append(t; l2)) esac The evaluation transformer of append is given in the following table:  ET1(append)() ET2(append)() 0 0 0 1 1 0 2 2 2 3 3 3 The second argument may only be evaluated if the evaluator is stronger than 2. As the sequential thread of the letpar-expression equals the function append, the evaluation transformer of append is also used to annotate the evaluation transformer of the letpar-expression.

4 The Parallel Abstract Machine In this section we give an overview of our parallel abstract machine concentrating on the organization of parallelism. The implementation of serial combinators and the realization of programmed

MESSAGE TRANSFER SYSTEM

6

6

? ?- 6 --  - - -

? ?- 6  - - -

PE1

6

v

v

v

PE2

? ?- 6 -- - - - PEn

Figure 1: Global Structure of PAM graph reduction in the parallel machine will be discussed in the next section. A complete formal speci cation of the machine can be found in [Loogen 87/88]. The parallel abstract reduction machine consists of a nite number of identical processor elements with local storage which may communicate with each other via a message transfer system (see gure 2). For the purpose of this paper we assume an interconnection network which provides a logically complete communication between processor elements.

4.1 Processor Elements

Each processor element consists of two autonomous processing units | the communication unit and the reduction unit | which may communicate via exchange of messages in local shared memories. The communication units manage the distribution of work. The reduction units perform the combinator reductions by programmed graph reduction. Figure 2 shows the structure of a single processor element. The communication unit contains two independent processors:  The communication processor is responsible for the organization of the parallelism. Its main task is the distribution of the parallel processes and the workload balancing.  The network adapter collects the incoming messages from the inports and forwards them to the communication processor. Furthermore, it passes messages from the communication processor to the correct outports. The shared memories within the local store of a processor element contain queues for the transfer of messages between the di erent processing units. The next-task- ag in the common store of reduction and communication unit can be set by the reduction unit to ask the communication processor for a new process. To avoid con icts the ag can only be reset by the communication processor.

4.2 Reduction Messages

Four types of messages are essential for the parallelization of the graph reduction process:

Inports

?

Network-

Outports

 -

adapter

tionprocessor

-

-

Reductionprocessor

processqueue

6

?

information tables

6

? Red.queue



C O M M U N I C A T I O N U N I T

?

Outputqueue

?

 

6

? Inputqueue

 Communica-

6

Com.queue

6

? pointer to the active task

Nexttask ag Graph

   ??

 programstore

  ?

workingmode

' ?

??

local taskqueue

?

- activation list Figure 2: Structure of a Processor Element

?

R E D U C T I O N U N I T

 answer messages to communicate the result of distributed parallel processes and the \value"

of global subgraphs,  request messages to ask for nodes of the program graph that are allocated on other processor elements and  activation messages to activate the evaluation of subgraphs on other processor elements. A parallel process corresponds always to a combinator application. Thus it is completely speci ed by the combinatorname, the list of arguments | where basic values are directly given and complex arguments are represented by pointers |, the evaluator with which the parallel task has to be evaluated, the return-address (the address where the result of the task has to be sent to) and nally the kind of activation, which will be explained later: [PROCESS, combinatorname, argument-list, evaluator, return-address, kind of activation] Answer messages consist of the destination address where the answer has to be sent to and the result which is a terminal node of the program graph1. Graphs are transferred node-wise between processor elements. [ANSWER, address, terminal graph node] Request messages contain two addresses | the result-address where the answer to the request must be sent to and the address of the graph node whose content is needed. Additionally, the evaluator which can be used for the evaluation of the requested node (if it is not a terminal node) is given. [REQUEST, address, returnaddress, evaluator] Activation messages are similar to request messages except that they contain no return-address, as no result of the initiated evaluation needs to be returned. [ACTIVATE, address, evaluator]

4.3 Process Management

The communication processor maintains in its local store a process queue, that contains process messages (from other processor elements and from the reduction unit) (see Figure 2). Process messages represent the parallel activated, but not yet started processes in the system. A process is passed to the reduction unit only if this unit has set the nexttask ag. As soon as a reduction unit gets a process, execution of this process starts and the process is never migrated again. Parallel subprocesses that are generated in the reduction unit are passed to the communication processor which will decide whether to keep them in the own process queue or to distribute them among the neighbour processors2. To make this decision the processor should take into consideration which processors are capable of performing the task, where the arguments of the task are allocated, the workload of the neighbours, as far as it is known, and the length of the own process-queue. To provide such information the communication processor contains certain information tables in its local store. In certain time intervals dynamically changing information as for example the workload of processors will be communicated in special administration messages that are exchanged between the communication processors. On the level of the abstract machine we will not go further into these issues. We abstract from the algorithm that is used for work distribution and workload balancing, as appropriate algorithms will also depend on the special structure of the interconnection network. 1 2

Note that only terminal nodes of the program graph are copied and sent in answer messages. Neighbour processors are physically connected processors.

In the parallel abstract machine program execution is done by the reduction units. The implementation technique is distributed programmed graph reduction, i.e.:  The program is represented as a graph that will be transformed during execution. In fact, the nodes representing combinator applications correspond to the activation records of conventional implementations.  For each h serial combinator, evaluator i-pair a machine code sequence is generated that controls the reduction process.  The program graph will be distributed among several processor elements.

5.1 Graph Representation

The distinction between rst-order and higher-order applications in the syntax of applicative expressions has been made to provide a much more ecient graph representation of these expressions. The standard technique in the literature is to work with curried functions and to represent them by binary graphs, as e.g. in the G-machine [Johnsson 84]. This requires unwinding the sequence of apply-nodes to determine the function symbol before any application can be reduced. Due to the special syntactic structure of our typed applicative expressions we get a more ecient graph representation of applications, such that a process like the unwinding in the G-machine is avoided whenever possible. Complete applications (e1; : : : ; en ) are represented by so-called task nodes of the following structure: TASK  e1 /    / en status-information Task nodes are similar to the vector-apply nodes, which are used in an optimized version of the G-machine, which has been described in [Johnsson 87]. Vector-apply nodes however do not contain status-information. The status-information will be explained later. Partial applications (e1; : : : ; ek ) with k < n, where n denotes the arity of , are represented by incomplete-task-nodes: INCOMPLETE-TASK  e1 /    / ek / ? /   / ? n ? k Because of the full laziness of serial combinator systems copying partial applications does not unshare computations. Thus incomplete-task nodes are treated as terminal nodes. Higher-order applications ap(e; e1; : : : ; em) do not require an explicit graph representation. This is because of the special treatment of (non-strict) arguments whose evaluation must be suspended. The function name in a task node may be a combinator name, a basic function or the special tag `arg' that indicates tasks that are generated for the delayed evaluation of arguments. For each of these symbols machine code is contained in the program store of each reduction unit. The arguments ei in task nodes are directly given, if they are basic values. Otherwise the task node contains a pointer to the graph representation of the argument which may be an argument node. As the program graph will be distributed among several processor elements by the migration of tasks and by request- and answer-messages, a global address-space which is composed of disjoint local address-spaces is required. Each global address consists of two components: global address = h processor number, local address i.

all the status-information that is necessary for the reduction of a task is stored in the task node. If a task is not yet activated the status-information consists only of the evaluator 0. The status of an activated tasks has the following form status = hev; ip; d; lv; pc; lq; gqi; where ev is the evaluator of the task, ip is the instruction pointer, which indicates the next instruction that must be executed, d denotes the data stack, which is used for the execution of basic functions3; lv stands for storage for the local variables and pointers to parallel subtasks in the case of a combinator task; this storage is organized as a stack, pc is a pending count, which indicates the number of results a suspended task has to wait for, lq is a list of the addresses of local tasks that are waiting for the result of the task and gq is a list of global addresses, where the result of the task must be sent to. The entries in the argument list, on the data stack or on the stack of local variables are basic values, local addresses or global addresses. As usual, tags are used to distinguish the di erent kinds of values. For addresses the evaluator which can be assumed for the subgraph with this address is additionally noted. This information is especially important for global addresses as it can be used to avoid multiple activations with the same evaluator. In the original G-machine [Johnsson 84] arguments whose evaluation must be delayed were represented by graphs which were interpreted in the case of evaluation. In our machine we follow the approach taken in [Fairbairn, Wray 87] and represent suspended expressions by special so-called argument nodes which correspond to closures: ARGUMENT environment pointer to code The environment within argument nodes consists of the argument list and the local variable stack of the task node that created the argument node. If the argument node represents a data structure, three di erent code addresses corresponding to the possible evaluators are given for the evaluation of the node. Otherwise only one address needs to be noted within the node. In the case of evaluation the argument node will be overwritten with a task node with the function tag `arg'. The argument list and the stack of local variables will be copied from the argument node. The instruction pointer will be initialized with the code address given for the actual evaluator. Thus the evaluation of suspended arguments will still be controlled by code. Especially no graph representation for case- and letpar-expressions or higher order applications is needed. Structured data objects are represented by data nodes that contain the data constructor, the components of this structure and the evaluator, which indicates the amount of evaluation that has already been done on the structure. SDATA constructor components evaluator Basic values can be stored in basic data nodes: The length of the data stack is statically bounded because tasks themselves are not recursive. Recursive combinator calls result in the activation of new tasks. 3

These are e.g. necessary if the result of a task is a basic value, as a task node is | after termination of the task | overwritten with its result. Data nodes and incomplete task nodes are terminal graph nodes that may be contained in answer messages. If a task is activated and sent to another processor element for execution, its original task node is overwritten by an indirection node. The indirection node indicates that the value of this subgraph will be computed on some other processor element, which is usually not known. It contains the new address of the subgraph, if this is known, the evaluator of the task and lists of local and global addresses to note the places where the result of this graph has to be communicated. INDIRECTION address evaluator lists of addresses The address of the indirection node is the return-address of the distributed task and thus the result of the task will be sent back to the indirection node. An indirection node is also generated, when a request to a global address is produced. The indirection node becomes the place-holder for the result that will be returned in an answer message. If a task tries to access an indirection node, it is suspended | its pending count is incremented | and the address of the task node is noted in the list of local addresses within the indirection node. The pending count of the task will be decremented, when the indirection node is over-written by a terminal node. If there is a request-message to an indirection node the address where the answer to the request should be sent to, is noted in the list of global addresses within the indirection node. The request message will be answered as soon as the indirection node is over-written. For technical reasons we also need local indirection nodes which simply consist of the corresponding tag and a local address. They are sometimes used to overwrite task nodes after termination of the task by an indirection to the result.

5.2 State of a Reduction Unit

The local store of each reduction unit consists of six components:

hm; atp; G; ltq; al; pi

m indicates the working mode of the reduction unit. We distinguish four di erent modes:  In the communication mode (cm) messages from the communication unit are processed.  In the reduction mode (rm) combinator reductions are performed by executing machine instructions.  In the activation mode (am) the further evaluation of the components of data structures, whose evaluator has been incremented is initiated.  In the wait mode (wm) the reduction unit waits for new processes to be passed by the communication processor. atp denotes the active task pointer, which indicates the task that is presently executed. Each task is represented by a node of the program graph. Thus the active-task-pointer is just an element of the local-address-space pointing to a node of the local graph. In each local graph there is at most one active task. If the reduction unit is running idle, the active task pointer has the value `nil'. G denotes the graph component which is a mapping from the local addresses into the graph nodes, which have been explained in the previous section.

these are subtasks of local tasks that will be executed locally. These subtasks correspond to subroutines that are not distributed. On the other hand, these are tasks that were suspended because some argument value or the result of some subtask was not yet available, but that can now be reactivated because the missing result has arrived. A task will be reactivated if its pending count becomes zero. al denotes the activation list, which contains pairs of pointers to local structured data nodes and evaluators. Each entry represents a data structure whose evaluator has been increased. In the activation mode the reduction unit processes this list and initiates the further evaluation of the components of the data structure. p denotes the program-store that contains the translations of the combinators and sequences of machine instructions for the reduction of basic function applications and higher-order applications. The state of a reduction unit is determined by the state of the local store and additionally by the state of the shared memory of the reduction unit and the communication unit. This shared memory contains three components: h red-q, com-q, next i.

red-q is the queue of (reduction) messages that are passed from the communication processor to the reduction unit. com-q is the queue of reduction messages that are generated by the reduction unit and passed to the communication unit. next is the boolean nexttask- ag that can be set by the reduction unit to ask for new processes.

5.3 Machine Code

We distinguish four classes of machine instructions, namely  data stack instructions: { EXEC f (f 2 ) | Execute base function f { NODE (c; ) (c 2 ?) | Construct a structured data node,  control instructions: { JMP l | Jump to label l (l is a program-address) { JPFALSE l | Jump to l if \false" is on top of the data stack { CASE h(c1; l1)    (ck ; lk )i | Jump to li if the top of the data stack represents a structure with constructor ci,  graph instructions: { LOAD i | Load the i-th argument of the active task on the stack. { LOADLOC i | Load the i-th element of the local variable stack of the active task on the data stack { SPLIT | Load the components of the data structure represented by the top of the data stack on the local variable stack

{ { { {

variables POP n | Delete n elements from the stack of local variables. ARGNODE(l1; l2; l3) | Create an argument node MKNODE(,i) | Create a task node, where  is a combinator or basic function and i 2 N gives the number of arguments that are available on the stack; the node that is constructed has status dormant (evaluator 0) if enough arguments are given, otherwise an incomplete task node is generated; APPLY | adds arguments to an incomplete task node

and  process instructions. { EVALUATE  | Activate a local subtask with evaluator  { ACTIVATE  | Activate a parallel subtask with evaluator  { INITIATE  | Initiate the evaluation of the top element of the data stack { INITARG/LOC (i,) | Initiate the evaluation of the i-th argument /the i-th element of the local variable stack of the active task using evaluator  { GETARG/LOC (i, ) | the same as INITIATE but it is ensured that the result of this evaluation will be locally available { WAIT m | Suspend the active task, if one of the m elements on top of the data stack is not yet completely evaluated or not locally available { RET | Finish the active task { PUSH (F ,i) | Perform a tail-recursive combinator call The rst three classes of instructions work only on the local store of the reduction unit. The most important instructions are the process instructions which control the evaluation and the local process and task management. They may lead to the generation of reduction messages which will be written into the shared memory component com-q. We distinguish instructions for the activation, suspension and termination of computations. There are four di erent mechanisms for the activation of tasks and processes. First of all, a task may be executed locally or in parallel. Local tasks are initiated by the instructions EVALUATE and INITIATE. Parallel processes are generated by the instructions ACTIVATE and INITARG/LOC and GETARG/LOC, respectively. A local task activation leads to the extension of the task node by the status-information that is necessary for its execution. The address of the task node is added to the internal task queue. The active task is not necessarily suspended. In the case of a parallel activation a process message is generated and written into the shared memory of the reduction unit and the communication unit. Additionally, we distinguish direct and indirect activations. A task will be activated directly (by the instructions EVALUATE or ACTIVATE) if the evaluation of its strict arguments | with respect to the context sensitive evaluation transformer of the corresponding application | has already been initiated. This is always the case when a task is activated immediately after the generation of its task node with the MKNODE-instruction. A direct activation is not possible for dynamically (by higher order applications) generated task nodes or delayed evaluations. If a task is activated indirectly (by the instructions INITIATE or INITARG/LOC, GETARG/LOC) the execution of the task starts with a special code sequence for the evaluation of the strict arguments | with respect to the context-free evaluation transformer of the function corresponding to the task.

INIT-instructions generate an activation-message, the GET-instructions send a request-message to a global address to ensure that the contents of the global graph node will be returned (after its evaluation to weak head normal form).

5.4 Compilation Rules

We distinguish the following compilation schemes whose complete formal de nitions can be found in [Loogen 88] or [Loogen, Kuchen et al. 89].  ComTrans [ Fi(x1; :::; xn ) = ei]  gives the translation of a combinator de nition for the evaluator . We generate special code for each combinator/evaluator-combination. The code starts with a sequence of INITARG-instructions that cause the evaluation of the strict arguments of the combinator before the evaluation of the combinator body is started. These instructions will not be executed when a task is directly activated, i.e. activated by an EVALUATE- or ACTIVATE-instruction immediately after the construction of the task node, because, in this case, the evaluation of the strict arguments has already been initiated.  EvalRet [ e]  lv m is used for the translation of combinator bodies. It generates code for the evaluation of an expression with evaluator  and the subsequent termination of the executed task. It is used to handle tail-recursive combinator calls in an optimized way. The parameters lv and m are used to control the local variable stack.  Eval [ e]  lv m and Init [ e]  lv m give code for the evaluation of an expression e with evaluator . The scheme Eval is used for the translation of strict arguments of base functions, whose value is locally needed. The scheme Init is used for the translation of arguments of combinators, constructors or other functional expressions. The generated code initiates computations, but does not ensure that the result of the computations will be locally available. Thus values (graph nodes) are only transferred to processor elements, where they are really needed. The necessity of distinguishing locally needed and strict, but not (yet) locally needed global arguments has been pointed out to the authors by Geo rey Burn [Burn 88b]. He detected that in a previous version of our machine arbitrarily long chains of indirection nodes could be generated and that consequently, answer messages following these chains could take a roundabout way through the system before they reached their destination. In the current version, there are at most chains of indirection nodes with length two. The schemes Eval and Init di er only for variables for which the Eval generates a GETARGinstruction and Init generates an INITARG- instruction. Therefore these schemes are de ned using the scheme  ETrans [ e]  lv m  which has an extra parameter  2 feval, initg to indicate the mode of translation. We give here only the translation of a letpar-expression: (Let  2 Evsetnf0g and  2 feval, initg.) ETrans [ letpar y1 = F~1(e11 ; : : : ; e1l1 ) if ev 1 and : : : yp = F~p (ep1 ; : : : ; epl ) if ev p in e]  lv m  := pcode (F~1(e11; : : : ; e1l1 ); ET1cs(e)(); ; ev 1, lv, m) ::: pcode (F~p(ep1; : : : ; epl ); ETpcs(e)(); ; ev p, lv, m) STORE p; ETrans [ e]  lv[y1 =m + p ? 1; : : : ; yp =m] m + p  ; i

p

p

j ~ where pcode 8 (Fj (ej 1 ; : : : ; ecsjl )~;  ; ; ev j , lv, m) j > Init[ ej 1] ET1 (Fj (ej 1 ; : : : ; ejl ))( ) lv m; : : : Init[ ejl ] ETlcs(F~j (ej 1 ; : : : ; ejl ))( j ) lv m; > < j ~ falls   ev j := MKNODE (Fj ; lj ); ACTIVATE  ; > > : Init[ F~j (ej 1 ; : : : ; ejl )]] j lv m; falls  < ev j cs and ET (e) denotes the context-sensitive evaluation transformer of the sequential thread j

j

j

j

j

j

 DELAY [ e] lv m is used to handle \lazy evaluation". It generates a representation of expressions in non-strict argument positions whose evaluation must be delayed.

Below we show the machine code sequence that is generated for the combinator ` ndsolutions' of our example program of section 3 with evaluator 3. caindir : INITARG (3,1); cadir : GETARG (3,1); LOAD 3; LIT 8; WAIT 2; EXEC =; JPFALSE lF ; LIT lnil; JMP lEND ; lF : LOAD 1; LOAD 2; GETARG (3,1); LOAD 3; LIT 1; WAIT 2; EXEC +; MKNODE( ndsolutions,3); ACTIVATE 3; INITARG (1,3); LOAD 1; LOAD 2; LOAD 3; MKNODE(complete,3); ACTIVATE 3; STORE 2; INITLOC (1, 3); LOADLOC 1; INITLOC (0, 3); LOADLOC 0; PUSH(append,2); The schematical translation of combinator systems generates of course a lot of unnecessary instructions as e.g. the INITLOC-instructions in the code for ndsolutions. These can be detected and eliminated. We will not go further into this issues.

6 Implementation Aspects The parallel abstract machine has been implemented on an Occam-Transputer system, where one processor element of the abstract machine runs on one transputer. Before we will discuss our rst experimental results in the next section, we want to give an overview on special aspects of this implementation, which have not been handled on the level of the abstract machine.

6.1 Asynchronous Communication

All the processing units of PAM work completely asynchronously. Unfortunately, the language Occam only supports synchronous hand-shake communications. Thus we have to simulate asynchronous message passing by always bu ering messages to ensure that no processor is ever blocked because it has to wait for a communication. The bu ers have a highly decentralized structure. Each bu er consists of a nite number of one-element bu ers that work in parallel. Access to the bu ers is controlled by the processors using the bu ers by simple counters indicating the bu er cell which can be read or written next. Thus it is always possible for two processors at the same time to write a message into the bu er and to read a message from the bu er,respectively. The structure of the processor elements was slightly changed to optimize the message trac within a processor element, because it is not possible to run the communication unit and the reduction unit on separate transputers. The di erent processors are implemented by independent Occam-processes which communicate via bu ered channels. The communication unit now contains three independent processors:

 The network-adapter-in collects the incoming messages from the inports and forwards them

to the reduction unit, the task manager or to the network-adapter-out, if messages must be routed through the network.  The network-adapter-out is responsible for the routing of messages. It passes messages from the reduction unit, the processmanager and the network-adapter-in to the correct outports. We use two processors for the adaption to the interconnection network in order that receiving messages from the inports and sending messages via the outports can be done in parallel. In the realization of the abstract machine on transputers we cannot assume a complete interconnection of the di erent processor elements. Therefore messages must be routed through the system. The reduction unit now sends answer, request and activation messages directly to the networkadapter-out which sends them to their destination. Process messages are always send to the process manager which is responsible for their distribution.

6.2 Routing of Messages

In general, the interconnection network does not provide a full interconnection of all transputers. Thus messages must be routed through the network. This work is done by the network-adapter-out. In the current implementation we use a very simple static routing scheme. Each network-adapterout has a routing table in a local store that gives for each processor element the number of the outport (link) into which a message for this processor element has to be written. Of course, the routing table xes a shortest way between each pair of processor elements. The advantages of this method are that routing can be done very eciently and that the order of messages is never changed (although this is not necessary in our machine).

6.3 Process Distribution

The process manager in the Occam implementation corresponds to the communication processor of the parallel abstract machine. Its main task is the process organization. One can coarsely distinguish two di erent approaches to process distribution, namely passive or active distribution.  Active distribution of processes means that the process manager may decide to send processes that have been generated by the reduction unit to neighbour processor elements without that these processor elements have asked for work.  Passive process distribution means that processor elements which have no work have to ask the neighbour processor elements to send them processes. Active process distribution schemes seem to provide a better balancing of the workload and a more appropriate distribution of the program graph, especially in the presence of complex data structures. We intend to test various process distribution schemes in our implementation. In fact we want to support some combination of active and passive process distribution, as it seems that an active scheme should be prefered when there are less processes than processors while an passive scheme is advantageous in the other case. Especially in the beginning and before the termination of the overall computation the workload will be low. The passive scheme produces an enormous communication overhead in these situations, because the idle processors keep on sending workrequests to their neighbours. In our current implementation we have implemented a very simple passive scheme. Processors, which have no work and run idle, have the possibility to send workrequest messages to locally connected processor elements. These messages may then be answered by process messages if the addressed processor has enough work and decides to distribute some processes. Otherwise they are answered by nowork messages.

In order to prevent the idle processors from sending workrequests when the workload is low and there is no more work to distribute, we use a clock. Whenever a processor only gets nowork-answers to its workrequests, it sets an alarm clock and sends no more workrequests until this clock has run down.

6.4 Garbage Collection

In the abstract machine we completely abstracted from garbage collection in the programm graph. In an implementation this is of course not possible. We use weighted reference counting [Bevan 87] for garbage collection in the distributed program graph. For its realization special decrement messages are necessary to decrement the reference count of graph nodes on other processor elements.

7 Applicative Data Structures An important topic of our project is the development and comparison of di erent applicative (i.e. side-e ect-free) data structures. In the following we concentrate on the structures list and sequence. Sequences are very similar to lists but allow to insert new elements in arbitrary positions and to delete and access arbitrary elements more eciently. They are implemented using binary inhomogeneous trees, i.e. the elements of the structure are stored in the leaves. The internal nodes are marked by the length of the subsequence represented by the corresponding subtree. We use inhomogenous trees since they are easier to append and to traverse in a divide and conquer style, which is essential for a parallel implementation. Consequently, a sequence can be divided into subsequences in O(1), an update (insertion or deletion) can be done in O(log n) in the average case. The updated structure and the original one are stored in an overlapping manner. For each update only a path from the new root to the changed leaf needs to be build. The rest of the structure can be shared. In order to search the i-th element of a sequence, i has to be compared with the length information of the left son for each node on the search path down to a leaf. We consider unbalanced sequences and height balanced sequences based on AVL-trees, which have been proposed by Myers [Myers 84].

8 Experimental results The machine code of the parallel abstract machine is interpreted, which might explain the absolute times needed by our simple example programs (see Figures 3 and 4). Until now, we have run experiments on 1, 2, 4, and 12 transputers. Two examples work on integer numbers (n b and one) and the other examples use data structures (quicksort, towers of Hanoi, the queens problem, matrix multiplication, fold and map) (see appendix). For the examples with integer numbers, we got good speedups as can be seen from Figure 3. In both examples the grain of parallelism was too ne. Very few of the generated processes were sent to another processor, most of them returned to their origin. Because of this, we modi ed the PAM in such a way that process messages were only generated when a workrequest from another processor arrived. A second task queue, containing only pointers to those processes, was added to the reduction unit. With this modi cation, it was possible to save about 75% of time. For programs working on data structures, we get reasonable speedups up to 8 or 9 with 12 processors. The workload balancing for all problems except quicksort is good. Each processor executes approximately the same number of PAM-commands. After a short starting phase, all processors work continuously, until in a short nal phase the results are collected.

problem one one one n b n b n b map map map map map map map map map map map map map map map fold fold towers towers qsort qsort matmult matmult

on 1 processor size structure #com sec # nodes 13 238K 12.36 0.1K 15 950K 49.41 0.1K 17 3801K 197.62 0.2K 20 317K 16.48 0.1K 22 831K 43.14 0.1K 24 2176K 112.94 0.1K 500 L 44K 2.41 7.0K 500 U 99K 5.35 4.2K 500 B 135K 7.10 6.2K 1000 L 88K 4.80 14.0K 1000 U 199K 10.69 8.2K 1000 B 270K 14.18 12.2K 1500 L 132K 7.20 21.0K 1500 U 298K 16.04 12.2K 1500 B 407K 21.34 18.3K 2000 L 176K 9.59 28.0K 2000 U 398K 21.38 16.2K 2000 B 541K 28.35 24.3K 2500 L 220K 12.00 35.0K 2500 U 497K 26.72 20.3K 2500 B 678K 35.50 30.3K 2500 L 231K 12.82 20.0K 2500 U 437K 25.28 30.2K 11 L 578K 35.09 15.6K 11 U 292K 15.00 8.4K 700 L 889K 48.76 14.6K 700 U 1727K 99.23 20.2K 10 L 761K 47.07 1.8K 10 U 1291K 71.05 2.4K

on 12 processors sec # messages speedup 1.21 0.1K 10.21 4.31 0.3K 11.46 16.69 0.4K 11.84 1.55 0.2K 10.63 3.80 0.3K 11.35 9.65 0.4K 11.70 2.30 3.0K 1.05 1.02 5.7K 5.25 1.54 9.8K 4.61 4.58 6.0K 1.05 1.93 11.9K 5.54 2.35 22.4K 6.03 6.81 9.0K 1.06 2.44 19.2K 6.57 3.63 31.6K 5.88 9.09 12.1K 1.06 3.01 23.9K 7.10 4.17 40.0K 6.80 11.36 15.1K 1.06 3.61 28.3K 7.40 5.29 47.2K 6.71 12.82 0 1.00 3.65 26.6K 6.93 12.13 5.9K 2.89 2.05 0.2K 7.32 24.04 36.0K 2.03 24.69 76.3K 4.01 6.54 51.5K 7.20 9.15 72.0K 7.76

size : structure :

problem size data structure used: L : strict lists, U : unbalanced sequences, B : height-balanced sequences # com: total number of executed PAM-commands sec: runtime in seconds, # nodes: number of graph nodes; each node needs 16 bytes # messages: total number of messages Figure 3: Experimental results

#processors 1 2 4 12 problem size sec speedup speedup speedup map 2500 26.72 1.75 2.99 7.40 fold 2500 25.28 1.74 3.08 6.93 towers 12 30.07 2.03 3.95 8.92 qsort 700 99.23 1.94 2.88 4.01 queens 8 428.00 1.89 3.61 9.44 matmult 10 71.05 1.81 3.18 7.76 Figure 4: Speedup for problems on unbalanced sequences Unbalanced sequences behave better than balanced ones in all our examples, since they are automatically balanced by the divide and conquer algorithms which produced them. Explicit balancing only causes overhead. This can be seen for the map-function in gure 3. Lists are easier to traverse then sequences, since they consist of less nodes. Because of this, algorithms which are based on traversing like map and fold are more ecient on lists on one processor. But lists can only be handled sequentially. So, sequences are better on many processors, since they allow a divide and conquer style traversal. Sequences are also easier to append (in O(1)). Since this is the only operation needed in the towers of Hanoi example, sequences are better here even on one processor. Lists are better, where small structures up to 10 or 15 elements are needed, since building an inhomogenous tree is more expensive than building a list. This overhead cannot be made up on such small structures by a divide and conquer style algorithm or by a more ecient access operation (see queens and matmult in gure 3). Map and fold are recursive higher order functions. The same functional parameter is passed as an argument in each recursive call until the base case is reached. Then, the corresponding process accesses the representation of this parameter, in general via a message. Since the base case is reached n times for a tree with n leaves, this function is accessed n times. This causes an unnecessary communication overhead, especially for the processor element where the function representation is stored. In fact, this processor element sequentializes all these accesses. To avoid this problem, we have sent (the root of) each argument, in which a function is strict together with the corresponding task. Following recursive calls refer to these copies of the original arguments. A similar problem occurs in the matmult example. The roots of the matrix representations are accessed n2 times (n being the size of the matrices). The technique mentioned above doesn't help, since there remain n2=2 accesses to each son of the roots, n2=4 to their grandsons and so on. In the matmult example, the processing elements which store those critical nodes (and their neighbor processing elements) execute up to 30% less PAM-commands than the others. This problem occurs, since our matrix multiplication algorithm does not take into consideration the structure of the matrix (representations). Another algorithm could work better. Nevertheless, we are working on solutions of the problem. Our quicksort implementations perform bad in parallel (see gure 3), since they are not structural recursive. So, more communication is needed, which prevents a good speedup. The algorithm on lists su ers from the additional problem, that dividing a list into two sublists can only be done sequentially. Figure 4 shows, how the speedup for some programs using unbalanced sequences increases with the number of processors.

There are a lot of projects underway with the goal of implementing functional languages on appropriate parallel architectures. We mention here only the two projects which are most similar to our work. A distributed memory architecture for programmed graph reduction is also described in [Burn 88a]. The approach of Burn is based on the same principles as our work. The main di erences can be found in the task and process management. Furthermore Burn does not consider serial combinator systems where parallelism has been made explicit using the letpar-construct. An implementation of his machine design is not yet available. Another approach similar to our work has been described in [Raber, Remmel et. al. 88]. Raber and his coworkers developped a parallel version of the G-machine [Johnsson 84] which is much more closer to the original G-machine than our implementation of distributed programmed graph reduction. Implementation results of their machine are also not yet available.

10 Conclusion Our experiments with the implementation of the parallel architecture are still in a preliminary stage. It is too early to draw nal conclusions. Our rst experience with the handling of data structures is that some optimizations should be done on the message trac to reduce the total number of messages. This could be done by introducing an address translation table within each processor element that indicates which global addresses correspond to which local addresses, or by bundling request messages with the same destination that must be routed. We will report on these optimizations and further implementation results in forthcoming papers.

Acknowledgements This research bene tted from lots of interesting discussions with Geo rey Burn. In particular, Geo rey pointed out that there could be arbitrarily long chains of indirection nodes in an earlier version of our parallel abstract machine design, and showed us how such chains can be avoided.

References [Bevan 87] D.Bevan: Distributed Garbage Collection Using Reference Counting, in: Lecture Notes in Computer Science, Vol. 259, Springer Verlag 1987. [Burn 87a] G.Burn: Evaluation Transformers | A Model for the Parallel Evaluation of Functional Languages, in: Lecture Notes in Computer Science 274, Springer Verlag 1987. [Burn 87b] G.Burn: Abstract Interpretation and the Parallel Evaluation of Functional Languages, Ph.D. Thesis, Imperial College, London 1987. [Burn 88a] G.Burn: Developing a Distributed Memory Architecture for Parallel Graph Reduction, CONPAR 88. [Burn 88b] G.Burn : personal communication at the Aspenas Workshop on Implementation of Lazy Functional Languages, Goteborg, Sept. 1988. [Fairbairn/Wray 87] J.Fairbairn, S.Wray: TIM | A Simple Lazy Abstract Machine to Execute Supercombinators, in: Lecture Notes in Computer Science 274, Springer Verlag 1987. [Hudak/Goldberg 85] P.Hudak, B.Goldberg: Ecient Distributed Evaluation of Functional Programs Using Serial Combinators, IEEE Transactions on Computers, Vol. C-34, No. 10, October 85. [Johnsson 84] Th.Johnsson: Ecient Compilation of Lazy Evaluation, SIGPLAN Notices Vol. 19, No. 6, June 1984.

versity of Technology, Goteborg 1987. [Loogen 87] R.Loogen: Design of a parallel programmable graph reduction machine with distributed memory, Aachener Informatik-Berichte Nr. 87-11. [Loogen 88] R.Loogen: Parallele Implementierung funktionaler Programmiersprachen, Dissertation (in german), RWTH Aachen, 1988. [Loogen,Kuchen et al. 89] R.Loogen, H.Kuchen, K.Indermark, W.Damm: Implementation of Functional Languages on Multiprocessor Systems, Aachener Informatik-Berichte 1989. [Myers 84] E.Myers: Ecient applicative data types, ACM Symp. on Principles of Programming Languages 1984, 66-75. [Raber,Remmel et al. 88] M.Raber, Th.Remmel, E.Ho mann, D.Maurer, F.Muller, H.Oberhauser. R.Wilhelm: Compiled Graph Reduction on a Processor Network, GI/ITG Tagung Paderborn, 1988, Informatik-Fachberichte, Springer Verlag.

A Example Programs We list some of the example programs which we used in our experiments, as serial combinator systems to give the reader an idea of which algorithms have been used for the di erent problems and which form of parallelization has been taken: one (n : int) : int = if n = 0 then 1 else letpar v = one(n-1) in one(n-1)  v n b (n : int) : int = if n  1 then 1 else letpar v = n b(n-1) in v + n b(n-2) + 1 The programs map, fold, quicksort, towers of Hanoi and matrix multiplication will only be given for sequences. It is straightforward to get the corresponding programs for lists. The following base operations are available for sequences:  seqsel, seqdel, sequpd | to select, delete, update an element of the sequence,  seqappend | to append two sequences and additionally  leftseq and rightseq | to decompose a given sequence into two subsequences such that seqappend(leftseq(S),rightseq(S)) = S for every sequence S. All basic operations are implemented in a straightforward way. The example programs are not always the most ecient ones to solve the problem. They serve to show the properties of our abstract machine and the di erent applicative data structures. To compare lists and sequences, we use nearly the same algorithms for the corresponding example programs (except map and fold). In our experiments, map has always been called with the successor function as argument, and fold has been called with the addition and the identity function. mapseq (f: int ! int, T: intseq): intseq = case T of emptyseq: emptyseq; seqleaf(x): seqleaf(ap(f,x)); seqnode(L,R,lg): letpar v = mapseq(f,rightseq(T)) in seqnode(mapseq(f,leftseq(T)),v,lg) esac foldseq (f: int  int ! int, g: int ! int, T: intseq) : int = case T of emptyseq: error; seqleaf(x): ap(g,x); seqnode(L,R,lg): letpar v = foldseq(f,g,rightseq(T)) in ap(f,foldseq(f,g,leftseq(T)),v) esac

let move = pair(source,dest) in if n=1 then seqleaf(move) else letpar v = tower(n-1, help, source, dest) in seqappend(tower(n-1,source,dest,help),seqins(v,1,move)) quicksort (L: intseq): intseq = case L of emptyseq: emptyseq; seqleaf(x): L; seqnode(L,R,lg): letpar v0 = seqsel(L,1) and pair(L1,R1) = divide(v0,seqdel(L,1)) and v1 = quicksort(R1) in seqappend(quicksort(L1),seqins(v1,1,v0)) esac divide (x: int, S: intseq): intseqpair = case S of emptyseq: pair(emptyseq,emptyseq); seqleaf(y): if y  x then pair(S,emptyseq) else pair(emptyseq,S) seqnode(L,R,lg): letpar pair(L1,R1) = divide(x,leftseq(S)) and pair(L2,R2) = divide(x,rightseq(S)) and Ln = seqappend(L1,L2) in pair(Ln,seqappend(R1,R2)) esac matmult (n: int, A,B: seqo ntseq): seqo ntseq = matrix(ap(sprod,n,A,B)) matrix (f: int ! int ! intseq, n: int): seqo ntseq = foldint(1,n,ap(row,f,n),seqappend) row (f: int ! int ! intseq, n: int, i: int): intseq = seqleaf(foldint(1,n,ap(f,i),seqappend)) sprod (n: int, A,B: seqo ntseq, i,j: int): intseq = seqleaf(foldint(1,n,ap(mult,A,B,i,j),+)) mult(A,B: seqo ntseq, i,j,k: int): int = seqsel(seqsel(A,i),k)  seqsel(seqsel(B,k),j) foldint(i,j: int, f: ! , g:  ! ) : = if i=j then ap(f,i) else let med = (i+j)/2 in letpar v = foldint(i,med,f,g) in ap(g,foldint(med+1,j,f,g),v)

Suggest Documents