Effective data parallel computation using the Psi

0 downloads 0 Views 798KB Size Report
Related work by Chen[15] views arrays as functions (or data fields) defined over ... array indexing function, Psi, that selects a partition of an array described by a ...
CONCURRENCY: PRACTICE AND EXPERIENCE, VOL. 8(7), 499-515 (SEPTEMBER 1996)

Effective data parallel computation using the Psi calculus L.M.R. MULLIN

M.A. JENKINS

Department of Computer Science University at Albany SUNY, Albany NY 12222, USA

Department of Computing and Information Science Queens University Kingston Canada K7L 3N6

SUMMARY Large scale scientific computing necessitatesfindinga way to match the high level understanding of how a prohlem can he solved with the details of its computation in a processing environment organized as networks of processors. Effective utilization of parallel architectures can then he achieved hy using formal methods to descrihe hoth computations and computational organizations within these networks. By returning to the mathematical treatment of a prohlem as a high level numerical algorithm we can express it as an algorithmic formalism that captures the inherent parallelism of the computation. We then give a meta description of an architecture followed by the use of transformational techniques to convert the high level description into a program that utilizes the architecture effectively. The hope is that one formalism can he used to descrihe hoth computations as well as architectures and that a methodology for automatically transforming computations can he developed. The formalism and methodology presented in the paper is a first step toward the amhitious goals descrihed ahove. It uses a theory of arrays, the Psi calculus, as the formalism, and two levels of conversions — one for simplification and another for data mapping.

1.

INTRODUCTION

The goal of our research is to provide technology that enhances the ability of scientists and engineers to use parallel computers to do large scale scientific and engineering computations effectively. Efforts are under way within the programming language implementation community to provide compilers for Fortran and C that are capable of turning sequential codes into programs that can produce parallelized code for a particular architecture. A significant part of the effort is to extend Fortran and C to have operations that do array computations as a unit action so that the programmer does not need to write out the details of the array computation. The additional operations are analogous to operations used in the language APL, but are integrated into the notation for the particular language, e.g. Fortran 90. The expectation is that, by expressing parts of the algorithm in higher level constructs, a compiler may be able to extract parallelism implicitly from the overall structure of arrays and algorithms on arrays. In current methodology, the problem under study originally programmed for a sequential solution on a single processor is rewritten or automatically processed to run on a particular parallel architecture. The task of parallelizing sequential algorithms effectively is very difficult because the performance depends on both dividing the work between the processors so that the overall computing time is kept small, and on placing the data of the problem to minimize communication between the processors. CCC 1040-3108/96/070499-17 ©1996 by John Wiley & Sons. Ltd.

Received 9 February 1994 Revised 25 July 1995

500

L.M.R. MULLIN AND M.A. JENKINS

If the problem is data intensive, then it is likely that the original formulation that might have involved operations on large array objects has been transformed into a sequential program with nested loops and detailed subscript handling. Sophisticated algorithms are being developed to analyze nested loop code to discover opportunities to produce parallel code[l,2]. However, despite extensive efforts to develop technology that parallelizes sequential code, it appears infeasible, at least for the forseeable future, to construct a compiler that can automatically parallelize sequential programs into efficient parallel code[3]. The high performance computing community has recognized the need to provide language extensions that provide for a more direct expression of the problem in terms of array data stnictures[4-7]. This paper describes the use of a mathematical formalism that can be used as an analytical tool in dealing with the array objects being introduced into high performance Fortran (HPF) and high performance C (HPC). The formalism provides a systematic way of describing array operations on data suitable for inclusion in a programming notation. It also has a calculus that can be used to reduce an expression formulated in these operations to one that does the minimal work. A compiler has been developed that demonstrates how the reduction calculus could be used in conjunction with existing parallel compiler technology[8,9]. HPF[ 10] also assists semi-automatic parallelization through programmer directives which describe data partitioning and dependencies that are communicated to the compiler. Parallel tools work in conjunction with these directives to assist the programmer and compiler to achieve an effective solution. Various implementations of HPF have emerged which provide high performance as well as superior programmer tools with a guarantee of portability to various platforms[l 1-14]. However, it is our belief that, for a substantial number of subproblems, an effective approach to assigning processors and doing data placement can be deduced from the array expression. The results here are more speculative, but we show that by using the same formalism to describe the parallel architecture as an array of processors a reasonable partitioning solution can be found for a nontrivial problem. Related work by Chen[15] views arrays as functions (or data fields) defined over a set of index points (or index domains). Similar to our approach, Chen views arrays of data and arrays of processors using the same methodology, an equational theory. She can therefore manipulate data into a more efficient interpretation through equivalence preserving transformations. The main difference between Chen's work and ours is that we define everything in terms of structure and indexing. Through the reduction semantics of the Psi calculus we can simplify an arbitrary array expression even on one processor. A unified theory of arrays is used throughout our analysis without any reliance on graph theory, set theory or data flow analyais. It is our view that, in the long term, effective utilization of parallel architectures will be achieved by using formal methods to describe both computations and computational organization within networks of processors. The idea is to return to the mathematical treatment of the problem as a high level numerical algorithm, describe it in an algorithmic formalism that captures the inherent parallelism of the computation, provide a high level description of the available architecture, and then use transformational techniques to convert the high level description into a program that utilizes the architecture effectively. The hope is that a unified mathematical treatment can be used to describe both computations and architectures and that a methodology for automatically transforming computations can be developed.

DATA PARALLEL COMPUTATION USING PSI CALCULUS

2.

501

THE Psi CALCULUS

The MOA algebra and its associated Psi calculus is a formalism that we use to describe array computations on uni- or multi-processor topologies. It is centered around a generalized array indexing function, Psi, that selects a partition of an array described by a partial index. All array operations in this theory are defined using the indexing function. The central theorem presented herein indicates how to describe the effect of the indexing operation in terms of selecting items from a one-dimensional list of the items (data or processors) assumed to be in row major order. Many of the ideas for the Psi calculus have their origin in APL and other interpreted languages that support arrays, such as Nial[16] and Paralation Lisp[17]. There is considerable academic interest in finding a suitable formalism for parallel programming[ 18,19]. There have been several investigations into a mathematics to describe array computations. These include More's theory of arrays (AT)[20], a formal description of AT's nested arrays in a first order logic[21] and the development of a mathematics of arrays (MOA)[22]. In [23] the correspondence between AT and MOA is described. The algebra of MOA denotes a core set of operations proven useful for representing algorithms in scientific disciplines. Unlike other theories about arrays[7,20], all operations in MOA are defined using shapes and the indexing function Psi{ip), which in conjunction with the reduction semantics of the Psi calculus provides not just transformational properties but compositional reduction properties which produce an optimal normal form possessing the Church Rosser property. Often transformations require pattern matching and knowledge of equivalence preserving transformations, while Psi reductions are deterministic, and hence mechanical. Both MOA and AT build on Iverson's concepts and notations from APL and extend the transformational properties of the original notation[24]. The Psi calculus and MOA puts closure on concepts introduced by Abrams[25] to optimize the evaluation of APL array expressions based on an algebra of indexing. His work led to many attempts to compile APL, with varying degrees of success[26,27]. Our goal is to optimize array computation given a linear address space. The first level of optimization, termed a functional normal form, is a minimal semantic form expressed in terms of selections using Cartesian co-ordinates. The normal form can be built in numerous ways. We describe a constructive way to build it using the Psi calculus. The second level of optimization is to tranform the functional normal form to its equivalent operational normal form or description for implementation which describes the result in terms of starts, strides and lengths to select from the linear arrangement of the items. We show how the Psi correspondence theorem is used to achieve the operational form. In the last Section of the paper we use MOA to describe partitioning and mapping of data arrays to arrays of processors. This work is more speculative in that our results are still preliminary but, as we demonstrate, it shows that the same theory of array structure can be used to describe arrays of processors and hence allows a unified formal treatment of algorithm description and problem mapping. The central operation of MOA is the indexing function pipA in which a vector of n integers p is used to select an item of the n-dimensional array A. The operation is generalized to select a partition of A, so that if q has only k < n components

502

L.M.R. MULLIN AND M.A. JENKINS

then qtpA is an array of dimensionality n — k and q selects among the possible choices for the first k axes. In MOA zero origin indexing is assumed. For example, if ^4 is the 3 x 5 x 4 array 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

21 25 29 33 37

22 23 24 26 27 28 30 31 32 34 35 36 38 39 40

41 45 49 53 57

42 43 44 46 47 48 50 51 52 54 55 56 58 59 60

then

21 25 29 33 37 < 2I >

22 26 30 34 38

23 27 31 35 39

24 28 32 36 40

= < 45 46 47 48 >

< 21 3>

=

48

Most of the common array manipulation operations found in languages like APL and Nial can be defined from ip and a few elementary vector operations. We now introduce notation to permit us to define a/; formally and to develop the Psi correspondence theorem, which is central to the effective exploitation of MOA in array computations. We will use A,B,... to denote an array of numbers (integers or reals). An array's dimensionality will be denoted by dA and will be assumed to be n if not specified. TTie shape of an array A, denoted by s^, is a vector of integers of length rf^, each item giving the length of the corresponding axis. The number of items in an array, denoted by will denote the empty vector. For every n-dimensional array A, there is a vector of the items of A, which we denote by the corresponding lower case letter, here a. The length of the vector of items is tA- ^

DATA PARALLEL COMPUTATION USING PSI CALCULUS

503

vector is itself a one-dimensional array, whose shape is the one-item vector holding the length. Thus, for o, the vector of items of A, the shape of o is

and the number of items or tally of a is

The precise mapping of . 0 the vector of the first k items of u and when A; < 0 the vector of the last |fc| items of u when Jfc > 0 the vector oftu-k last items of u and when A; < 0 the vector of the first 0 the vector of {k d r o p u) c a t {k t a k e u) (i.e. rotate left) and when A; < 0 the vector or {k t a k e u) c a t (ifc d r o p u) (i.e. rotate right)

Definition 1

Let X be an n-dimensional array and p a vector of integers. If p is an index of A, pip A = alfisA.p)] where 7(*/i)P)

=

Xn-\

xo

=

po,

Xj

=

Xj-i*Sj

defined by the recurrence + Pj, j =

l,...,n-l.

If p is partial index of length k < n, pil)A = B where the shape of B is SB = k d r o p SA and for every index q of B, qtpB = (p c a t q)ipA The definition uses the second form of specifying an array to define the result of a partial index. For the index case, the function 7(5,p) is used to convert an index p to an integer giving the location of the corresponding item in the row major order list of items of an array of shape s. The recurrence computation for 7 is the one used in most compilers for converting an index to a memory address[29].

2.2.

Corollary 1

V* A = A. The following theorem shows that a i/» selection with a partial index can be expressed as a composition of ip selections.

2.3.

Theorem 1

Let i4 be an n-dimensional array and p a partial index so that p = g c a t r. Then P'\j}A — rtp{qipA).

Proof: The proof is a consequence of the fact that for vectors u.v.io (u c a t v) c a t w = u c a t (u c a t w).

DATA PARALLEL COMPUTATION USING PSI CALCULUS

505

If we extend p to a full index by p c a t p', then = {pcatp')ipA = {{qcatr)catp')tpA = (^cat (rcatp'))V'>l = rip{qt()A) which completes the proof. We can now use Psi to define other operations on arrays. For example, consider definitions of t a k e and r e v e r s e for multidimensional arrays. 2.4. Definition 2 (take) Let i4 be an n-dimensional array, and k a non-negative integer such that 0 c a t (1 d r o p s y ) and for every index p of B, prpB = pipA. (In MOA t a k e is also defined for negative integers and is generalized to any vector u with its absolute value vector a partial index of A. The details are omitted here.) 2.5. Definition 3 (reverse) Let A be an n-dimensional array. Then

sreverseA = SA and for every integer i, 0 < i < so, < i > tp r e v e r s e A =< so - i - 1 > tpA. This definition of r e v e r s e does a reversal of the 0th axis of A. 2.6. Example Consider the evaluation of the expression < 1 3 > V(2 t a k e r e v e r s e i4) where A is the array given in the previous Section. The shape of the result is 2 d r o p Sj t a k e r e v e r s e A

(1)

506

L.M.R. MULLIN AND M.A. JENKINS

= =

2drop(< 2 > cat(ldropsreverse/t)) 2 d r o p ( < 2 > cat(ldrops/i))

=

2 drop (< 2 > cat < 5 4 >)

=

2 drop < 2 5 4 >

The expression can be simplified using the definitions: < 1 3 > •i/'(2 take reverse A)

= =

< 1 3 > V* reverse A < 3 > i){< 1 > V reverse A)

=

< 1 3 > V^ (2)

This process of simplifying the expression for the item in terms of its Cartesian co-ordinates is called Psi reduction. Hie operations of MOA have been designed so that all expressions can be reduced to a minimal normal form[30]. 2.7.

Theorem 2 (Psi correspondence theorem)

Let i4 bean n-dimensional array andpapartialindex for i4 oflengthfcsatisfyingO < kt,

If we compare this with the theorem statement we see that to complete the proof we have to show that b[i] = a[ s t a r t {s,p) * s t r i d e (s,p) -\- i] for 0 < i < ir{k drop SA)For a given i, let q represent the corresponding index of B. "Dien b[i]

=

qipB

=

( p e a t q)ifA

by the definition of ip

DATA PARALLEL COMPUTATION USING PSI CALCULUS

=

507

a[^{p cat q,sA)]

and the proof reduces to showing that for 0 < i < 7r(jk drop ^4), 7(p c a t q,SA) = s t a r t {s,p) * s t r i d e {s,p) + •y{q,k drop SA)

(3)

On the left of (3) 7(pcatg,s>i) = a:n_i where ao = PO Xj

=

Xj-\

* Sj + Vj,

Xj

=

Xj-i*Sj

j = 1,...,A; - 1

+ qj-k

j =

k,...,n-l

Tlius {{

) . ) * S n - i ) + qn-k-i

s t a r t (s,p) =

s t r i d e {s,p)

(4)

7(fctakes,p)

= Sk* ... * 5n-i

where yo

=

go

yj

=

j / j - i * Sj+k

+ Qj w i t h j - l , . . . , n

- k - l

Thus, the right-hand side of (3) becomes s t a r t (s,p)* s t r i d e ( s , p ) + 7(g,fcdropsy) = Xk-\ *Sk* ...Sn-i + yn-k-i = Xk-\

*Sk*

...Sn-i

+ {...{qo*Sk

+ qi)

which is a partially factored version of (4), and the proof is complete. The practical significance of the Psi correspondence theorem is that because all the operations in the Psi calculus are defined in terms of i/* their definitions can be translated from one involving cartesian co-ordinates into one involving selection from the list of items stored in memory or through abstract data or processor restructurings, their lexicographic location(s). We can now use the Psi correspondence theorem to find the algorithm to compute expression (1) directly:

=

< 1 3 > ip{2 t a k e r e v e r s e A) < 1 3 > tpAihy Definition 2)

508

L.M.R. MULLIN AND M.A. JENKINS

=

{ < 4 > , a [ s t a r t ( < 3 5 4 > , < 1 3 > ) * s t r i d e (< 3 5 4 > , < 1 3 >) +(,7r(2drop < 3 5 4 >)]}(by Theorem 2)

=

{ < 4 > ,a[8*4 + t4]}

=

{ < 4 > , a [ < 3 2 33 34 35>]}

=

< 33 34 35 36 >

Thus, we have seen that an MOA array expression can be reduced to an expression only involving the operation tp by Psi reduction. The Psi correspondence theorem(PCT) is then used to express the selection in terms of starts, strides and lengths. The latter can be used to access memory efficiently. 3.

EFFICIENT MEMORY ACCESS — A MULTIDIMENSIONAL ARRAY

The significance of the PCT is that multiple loops necessitated by a multidimensional array can be reliably collapsed to facilitate use of special purpose hardware designed for start, stops and strides, thus speeding memory access and enhancing high performance computing. Later, we will also demonstrate how the PCT can contribute to reliable methods for partitioning, and mapping to multiple processors. In conjuction with the Psi reduction rules, the PCT is essential for high performance array based scientific computation. To illustrate collapsing of multiple loops we present a higher dimensional example. To make the meaning very explicit we introduce three notations: • The notation [expr|i = Vi, j = Vj, • • •] means an array of dimensionality given by the number of generators explicitly given and of shape equal to the lengths of the generators. The values taken on by the generators can be noncontiguous and do not have to start with zero. If A is the array mentioned at the beginning, with SA =< 3 5 4 >, lipA\i=


    ,j = ,k=]

    is '

    " 11

    12 1 r 31

    32

    19 20 J [ 39 40 • The notation mix {expr \ i — Vi, j = Vj,- • •] , where eocpr denotes an array of a fixed size means the array formed by joining the arrays generated by expr as slices. The axes of arrays in the expression become the trailing axes. Thus, using the same A above m i x [< i > tpA\i

    =< 0 2 >]

    is the array formed from the first and third planes. The mix notation is needed to describe the gluing of arrays formed using tp on partial indices. • The notation l i n k [expr \ i = Vi, j = Vj,- • •], where expr denotes a vector, means the vector formed by concatenating them. The link notation is needed to describe the vector of items collected from a [Note 1 ] ' ' We have used the terms mix and link in these notations since they correspond to operations in Nial that have the

    DATA PARALLEL COMPUTATION USING PSI CALCULUS

    4.

    509

    A 4-D EXAMPLE

    Let A denote a 4-dimensional array such that S/i is < 3 4 5 3 >. We want to choose the subarray: the first two hyperplanes on axis 0, all planes, the last 3 rows and all the columns. This subarray can be expressed in MOA as: < 2 > t a k e (< 0 0 2 >

    drop^)

    (See Reference 22 for a general definition of t a k e ) . The result of the Psi reduction is an array of four dimensions: R = \< i,j,k,l

    >i)A\i=,j=,k=,l=]

    where the above notation denotes an array where S R = < 2 4 3 3 > . But, we want to describe R in the form: i? = { < 2 4 3 3 > , a[u]} where u is a vector of indices for a. We can achieve the latter by a double application of the PCT. First, we observe that by Theorem 1, R can be rewritten as:

    R

    =

    [< k,l > V'(< i,j > ipA) I i = < 0 1 > , j = < 0 1 2 3 > , it = < 2 3 4 > , / = < 0 1 2 >]

    Let Bij =< i,j > ipA, and bij its list of items with sg^. = < 5 3 >. Then Rij

    =

    [< k,l > V^ij I A; = < 2 3 4 > , / = < 0 1 2 >]

    =

    mix [ tpBij Ifc= < 2 3 4 >]

    since I ranges over the entire subrange of the 2nd dimension of Bij. We apply the PCT to the expression in the mix notation for Rij to obtain Rij

    = = =

    mix[{l d r o p < 5 3 > , 6 y [ s t a r t (< 5 3 > , < A; >) * s t r i d e (< 5 3 > , < ifc >) -h t7r(l d r o p < 5 3 >)]} | fc = < 2 3 4 >] mix[{< 3 > , 6 y [ s t a r t ( < 5 3 > , < k>)* s t r i d e (< 5 3 > , < fc >) -ft 7 r < 3 > ] } | fc= ] mix[{< 3 > ,bij{k *3 + L3]} | ifc = < 2 3 4 >]

    Applying the PCT to By we get Bij

    =

    {2take < 3 4 5 3 > ,a[start{< 3 4 5 3 > , < i j > ) * s t r i d e (< 3 4 5 3 > , < i J >)-I-t7r(2drop < 3 4 5 3 >)]}

    =

    {

    ,a\{i*

    Combining these we have Rij = m i x [ { < 3 > ,a[{i*4

    + j ) * 15 + d5][A;* 3-I-13]} | A; = < 2 3 4 > ]

    same semantic effect. They are not part of the MOA notation because they require arrays of arrays as arguments.

    510

    L.M.R. MULLIN AND M.A. JENKINS

    To fonn R, we have R

    =

    mix [iZy I z = < 0 1 > , j = < 0 1 2 3 >]

    =

    { < 2 4 3 3 > ,link[a[(i*4+j)*15 + j = < 0 1 2 3 > , ifc = < 2 3 4 >]}

    =

    { < 2 4 3 3 > , l i n k (a[m* 15-I-tl5][jfc* 3 + t3] | m = < 0 - - - 7 > , jfc=]}

    where we combine the i and j indices into an index over m. Using the property of indexing that a\u][v] = a\u{v]]

    and the fact that all items of [A; * 3 -f t3 | fc = < 2 3 4 >] are in il5, we can further reduce this to R

    = =

    { < 2 4 3 3 > , l i n k [ o ( m * 15 +A; *3 + i3] I m = < 0 - - • ? > , ] ( : = < 2 3 4 >]} {< 2 4 3 3 > ,0(11], where u[i] = (i/9) * 15 + ((i/3) mod 3 + 2) * 3 H-1 mod 3 and z = < 0 ••71>}

    which is in the desired form. In practice, we may wish to use the description in the form:

    . J? = { < 2 4 3 3 > , l i n k [ a [ s - | - t 9 ] | s = < 6 2 1 3 6 - - - > ] } indicating that the items of r are expressed in terms of items of a found using a start point of 6, strides of 15 and segments of length 9. From an algorithmic view, we have replaced a four-deep nested loop of the form: for i := 0 to 1 do for j := 0 to 3 do for k := 0 to 2 do for 1 := 0 to 2 do

    with a two level loop of the form: for i with 0 to 7 do for j with 0 to 8 do r[8*i + j] := a[start + stride*i

    DATA PARALLH. COMPUTATION USING PSI CALCULUS

    511

    Note that, if our example were for a 5-D array (or higher), the reduced form would still have only two loops. In the 5-D case we might have

    for i := 0 to 1 do for j := 0 to 3 do for k := 0 to 2 do for 1 := 0 to 2 do for m := 0 to 4 do

    which would reduce to a two-level loop of the form

    for i with 0 to 7 do for j with 0 to 44 do r[45*i + j]

    := a [ s t a r t + s t r i d e * i + j ] ;

    5. MAPPING ARRAYS TO PROCESSORS Much work has been done in describing architectures in terms of abstract models and then using the abstract model as the basis for mapping decisions[31-34]. However, it is still a primarily manual effort to design the mapping of the computation to the abstract model. The automation of this step is crucial if we are to make effective use of parallel architectures. The abstract model for the organization of the processors is often in the form of a graph whereby the nodes of the graph are processors and the edges are communication links. For many practical models the graph can be represented by an array in which each processor is given an address and each processor to which a link is available is at an address one away along one of the axes. The array model can describe a list of processors, a two-dimensional mesh, a hypercube, a balanced tree[31,32], or a network of workstations[33-35]. Our approach will be to view an architecture as having two array organizations, one which is the abstract model best suited for describing the problem and a second which corresponds to an enumeration of the processors as a list. For an actual architecture the latter corresponds to the list of processor identity numbers and is used to determine the actual send and receive instructions issued by the resulting program. In doing the mapping from the data arrays of the problem to the array-like arrangements of processors, there is a need to be able to systematically determine what information to distribute to which processors. Having made those decisions, the high level algorithm expressed in terms of array operations has to be turned into low level code that selects data elements from one-dimensional memory and sends it to the appropriate processor in the one-dimensional list of processors. Each processor has to be supplied with code that is parameterized so that it operates on the data in its local memory to carry out its portion of the parallel algorithm. The difficulties are compounded when a problem is so large that it must be attacked in slices. Thus, a vector of length k = m* n*p, where m is the number of slices, n is the amount of data each processor can process in one go, and p is the number of processors.

    512

    L.M.R. MULLIN AND M.A. JENKINS

    can be viewed algorithmically as a three-dimensional array of shape m x nx p, where the first axis indicates slices of work to be done one after the other. The manipulations of the data addresses to ensure that the problem decompositions are handled correctly can be quite intricate and are difficult to get right by hand. Thus, having a formal technique for deriving the address computations, one that can be automated to a large extent is essential if rapid progress is going to be made in exploiting parallel hardware for scientific computation. We can also use the idea of mapping Cartesian co-ordinates to their lexicographic ordering when we want to partition and map arrays to a multiprocessor topology in a portable, scalable way. Consider, for example, a parallel vector-matrix multiply of vector A and matrix B, where SA =< TI> and SB = < np >. One effective way to organize the computation is to map each of the rows of B to a processor, send the elements of A to the corresponding processors, do an integer-vector multiply to form vectors in each processor, and then add the vectors pointwise to produce the result. The last step involves the adding together of n vectors and is best done by adding pairs of vectors in parallel. Abstractly, this last step can be seen as adding together the rows of a matrix pointwise. A hypercube topology is ideal for such a computation and at best would take O(log n) on n processors to compute. Often a hypercube topology is not available, we may only have a LAN of workstations or a linear list of processors. But, we can view any processor topology abstractly as a hypercube and map the rows to processors by imposing an ordering on the p available processors. That is, we look at pi where 0 < i < p as the lexicographically ordered items of the hypercube. Hence, in the case of the LAN we obtain a vector of socket addresses. We then abstractly restructure the vector of addresses as a A;-dimensional hypercube where k = [logj n ] . In order to map the matrix to the abstract hypercube we restructure the matrix into a 3dimensional array such that there is a 1-1 correspondence between the restructured array's planes and the available processors. That is, we send the ith plane of the restructured array to the ith processor lexicographically. If there are more rows than processors then the planes are sequentially reduced within each processor in parallel. We can apply the Psi correspondence theorem to the data to see how to address the ith planes from memory efficiently. The same methodology can be applied to address the processors effectively. For example, suppose we want to add up the rows of a 256 x 512 matrix and we have eight workstations connected by a LAN. We would restructure the matrix into an 8 x 32 X 512 array which we denote by A'. The socket address of the workstations are put into a matrix P and each < i > ipA ,i = 1,...,8 is sent to the processor addressed by Pi. The sum of the rows for each plane are formed in parallel producing eight vectors of length 512 in each processor. We then restructure P into a 3-D hypercube implicitly and use this arrangement to decide how to perform the access and subsequent addition between the processors. In the first step we add processor plane 1 to plane 0. By the Psi correspondence theorem this implies adding the contents of processors 4 to 7 to those of processors 0 to 3. In the next step we add processor row 1 to row 0, which implies the contents of processors 2 and 3 are added to those of processors 0 and 1. Finally, we add the contents of processor 1 to the contents of processor 0. Thus, we have added up all the rows in Iog2 8 or 3 steps. The method can be employed for any size matrix and can utilize an arbitrary number

    DATA PARALLEL COMPUTATION USING PSI CALCULUS

    513

    of homogeneous workstations connected by a LAN. It is a portable scalable design. A more detailed description of this technique (including timing results) can be found in [34]. Recently, we ported and scaled our designs to a 32 processor CM5[35]. We used a similar approach, a linear processor array, to map a parallel sparse LU decomposition to a network ofRS6000s[33]. 6.

    CONCLUSION

    It is recognized tbat many large scale scientific and engineering problems have massive computational requirements that can utilize the collective power of large numbers of processors working together. The difficulty facing the computer science community is to provide tools to the scientists and engineers that allow them to solve such problems on large networks of workstations or specific parallel architectures in an effective and timely manner. The approach we are espousing is that a formal methodology be developed that assists in automating the translation of high level descriptions of computationally intensive problems working from a high level array-based description of the problem. The paper discusses a specific formalism, the Psi calculus, and demonstrates how by a combination of using Psi reduction and the Psi correspondence Theorem, progress has been made on both the problem of expressing problems at a high level, and on using the formalism both to generate low level code and to assist in the organization of the computation on a processor network. Our goal is to prototype tools that can be adopted by compiler designers to assist in the problem of parallelizing programs. We see this happening in two directions: • by a language providing a high level notation embedded within a standard language, e.g. HPF, that is preprocessed using the techniques described in the paper, or • by the compiler extracting a high level description based on loop analysis and then using the techniques to achieve an effective translation. The next step towards our goal is to apply the methodology to a problem of practical size and complexity to demonstrate that significant scientific and engineering computations can be solved effectively in this manner. ACKNOWLEDGEMENTS This work was supported by the National Science Foundations's PFF Award Program and the Natural Sciences and Engineering Council of Canada. REFERENCES 1. Saman P. Amarasinghe and Monica S. Lam, 'Communication optimization and code generation for distributed memory machines', SIGPIAN NOT, 28, (6), 126-138 (1993). 2. Jennifer Anderson and Monica S. Lam, 'Global optimizations for parallelism and locality on scalable parallel machines', SIGPLANNOT, 28, (6), 112-125 (1993). 3. Pietro Rossi, 'Science in industry in hpcn', in HPCN Europe 93,1993. 4. M. Chen, Y. Choo and J. Li, Crystal, 'Theory and pragmatics of generating efficient parallel code', in K. Szymanski (Ed.), Parallel Functional Languages and Compilers, ACM Press, 1991. 5. U. Geuder, M. Hardtner, B. Womer and R. Zink, 'The grids approach to automatic parallelization', in Fourth International Workshop on Compilers for Parallel Computers, TU Delft, 1993.

    514

    L.M,R, MULLIN AND M.A. JENKINS

    6. E.M.Palvast, H.J.Sips and A.J. van Gemund, 'Automatic parallel program generation with an application to the booster language', in Proceedings of the 1991 International Conference on Parallel Processing, August 1991, 7. E.M, Palvast, Programming for Parallelism and Compiling for Efficiency. PhD thesis, Delft University of Technology, June 1992. 8. L. Mullin and T. McMahon, 'Parallel algorithm derivation and program transformation in a preprocessing compiler for scientific languages'. Technical Report CSC-94-29, University of Missoud-RoUa, Dept of CS, 1994 (under review, J. Sci. Program,). 9. L. Mullin and S. Thibault, 'A reduction semantics for array expressions: The psi compiler', Technical Report CSC-95-05, University of Missouri-RoUa, Dept of Comuter Science, 1994, 10. High Performance Fortran Forum, High Performance Fortran Language Specification, version 1, May 1993. 11. Z. Bozkus, A. Choudhary, G. Fox, T. Haupt and S. Ranka, 'A compilation approach for fortran 90dmemory mimd computers', in Proceedings of the Sixth Annual Workshop on Languages and Compilers for Parallel Computing, 1993. 12. S. Hiranandani, K. Kenendy, C. Koelbel, U. Kremer and C. Tseng, 'An overview of the fortran d programming system', in Proceedings of the Fourth Workshop on Languages and Compilers for Parallel Computing, 1991. 13. J. Hulman, S. Andel, B. Chapman and H. Zima, Intelligent parallelization within the Vienna fortran compilation system, in Fourth International Workshop on Compilers for Parallel Computers,TV Tidft, 1993. 14. Z. Bozkus, L. Meadows, S. Nakamoto and V. Schuster, 'Retargetable hpf compiler interface', in Fourth International Workshop on Compilers for Parallel Computers, TU Delft, 1993. 15. J. Yang and Y. Choo, 'Data fields as parallel programs', in Second International Workshopon Array Structures, Unwersity of Montreal, 1992. 16. M.A. Jenkins, J. I, Glasgow, Carl McCrosky, and H. Meijer, 'Expressing parallel algorithms in nial', / Parallel Comput., 11, (3), (1989). 17. G. W. Sabot, The PAralation Model: Architecture-Independent Parallel Programming, MIT Press, Cambridge, Mass, 1988. 18. D, Skillicom, 'Architecture-independentparallelcomputation', lEEEComput.,December 1990. 19. Guy E. Blelloch and Gary W. Sabot, 'Compiling collection-oriented languages onto massively parallel computers', J. ParallelDistrib. Comput., 8,119-134(1990). 20. T. More, 'Axioms and theorems for arrays', IBM J. Res. Dev., 17, (2), (1973). 21. M. A. Jenkins and J, I. Glasgow, 'A logical basis for nested array data structures', Comput. Lflrtg., 14,(1), 35-51 (1989). 22. L. M, R, Mullin, A Mathematics of Arrays, Ph.D. dissertation, Syracuse University, December 1988, 23. M. Jenkins and L. Mullin, 'A comparison of array theory and a mathematics of arrays', in Arrays, Functional Languages, and Parallel Systems, Kluwer Academic Publishers, 1991. 24. K.E. Iverson, A Programming Language, Wiley, 1962. 25. P.S. Abrams, An APL Machine, PhD thesis, Stanford University, 1970, 26. R. Bemecky, 'Compiling APL', in Arrays, Functional Languages and Parallel Systems, Kluwer Academic Publishers, 1991. 27. T, Budd, 'A parallel intermediate representation based on lambda expressions', in Arrays, Functional Languages and Parallel Systems, Kluwer Academic Publishers, 1991, 28. M. A. Jenkins, Arrays and Functional Programming, in progress, 1994. 29. A. V. Aho, R. Sethi and J. D. Ullman, Compilers: Principles, Techniques, and Tools, AddisonWesley, Reading, Mass, 1986. 30. L.R. Mullin, E. M. Insall and W. Kluge, 'The psi calculus and the Church-Rosser property', in progress, 1994. 31. N. B61anger, L. Mullin, and Y. Savaria, 'Formal methods for the partitioning, scheduling and routing of arrays on a hierarchical bus multiprocessing architecture'. Technical Report CSEE/92/06-04, University of Vermont, Dept of CSEE, 1992, Proceedings of ATABLE92, Montreal, Qudbec, 1992. 32. N, B61anger, Y. Savaria and L. Mullin, 'Data structure and algorithms for partitioning arrays on a multiprocessor', rec/inica/reporr,University ofMissouri-RoUa, DeptofCS, 1994, in progress.

    DATA PARALLEL COMPUTATION USING PSI CALCULUS

    515

    33. L. MuUin and D. Dooling, 'Indexing and distributing a general patterned sparse array', in First Workshop on Solving Irregular Problems on Distributed Memory Machines, held in conjunction with IPPS 95, http://www.cis.syr,edu/people/ranka/ipps.html, 1995. 34. L.R. MuUin, D. Dooling, E. Sandberg and S. Thibault, 'Formal methods for scheduling and communication protocol', in Proceedings of the Second International Symposium on High Performance Distributed Computing, July 1993. 35. L.R. MuUin, D. Dooling, E. Sandbeig and S. Thibault, 'Formal methods for partitioning, scheduling, routing, and communication protocol', Technical Report CSC-95-04, tJniversity of Missoud-RoUa, Dept of Comuter Science, 1994.

Suggest Documents